IMPLEMENTING VALIDATION PROCEDURES TO STUDY THE PROPERTIES 
OF WIDELY USED STATISTICAL ANALYSIS METHODS OF RNA 
SEQUENCING EXPERIMENTS
 
By
 
Pablo Daniel Reeb
 
 
A DISSERTATION
 
Submitted
 
to
 
Michigan State University
 
in partial fulfillment of the requirements
 
for the degree of
 
 
Fisheries and Wildlife
-
Doctor of Philosophy
 
2015
 
 
ABSTRACT
 
IMPLEMENTING VALIDATION PROCEDURES TO STUDY THE PROPERTIES 
OF WIDELY USED STATISTICAL ANALYSIS METHODS 
OF RNA 
SEQUENCING EXPERIMENTS
 
By
 
Pablo Daniel Reeb
 
RNA sequencing (RNA
-
seq) technology is being rapidly adopted as the platform of choice
 
for tran
scriptomic studies. Although its
 
major focus has been gene expression profiling, other 
interests, such as single nucleotide profiling, are emerging as the technology evolves. In addition, 
applications are being rapidly expanding in model and nonmodel organisms. 
T
he
 
overall objective
 
of t
his dissertation was to propose and implement validation procedures based on experimental 
data to estimate the properties of widely used statistical analysis methods of RNA
-
seq 
experiments. 
 
The first study evaluated differential expression methods based o
n count data distribution 
and Gaussian transformed models. Parametric simulations and plasmode datasets derived from 
RNA
-
seq experiments were generated to compare the statistical models in terms of type I error 
rate, power and null p
-
value distribution. Ov
erall, Gaussian models presented p
-
values closer to 
nominal significance levels and a p
-
value distribution closer to the expected uniform distribution. 
Researchers using models with these properties 
will 
have less false positives when inferring 
differentia
lly expresses transcripts. Additionally, the use of Gaussian transformations enables the 
applications of all the well
-
known theory of linear models for instance to account for complex 
experimental designs.
 
The second study assessed the properties of dissim
ilarity measures for agglomerative 
hierarchical cluster analysis. The validation comprise
d
 
dissimilarity measures based on Euclidean 
distance, correlation
-
based dissimilarities and count data
-
based dissimilarities. I used plasmode 
 
 
datasets generated from t
wo RNA
-
seq experiments with different 
sample structures and 
simulated
 
scenarios based on informative and non 
informative transcripts. In addition
, I proposed 
two measures, agreement and consistency, for comparing dendrograms. Dissimilarity measures 
based o
n non
-
transformed data resulted in dendrograms that did not resemble the expected 
sample structure, whereas dissimilarities calculated with appropriate transformations for count 
data were consistent in reproducing the expected dendrograms under different s
cenarios
.
         
 
The third study compared variant calling programs 
that used reference
 
genotypes 
obtained from a SNP
 
chip. The evaluation included multiple samples and multiple tissue datasets 
and considered the effect of per base read depth. Sensitivity 
and false discovery rates were 
computed separately for heterozygous and homozygous sites in order to provide information for 
potentially different applications such as allele
-
specific expression or RNA
-
editing. Additionally, I 
explored the use of SNP calle
d from RNA
-
seq to compute relationship matrices in population 
studies. Heterozygous sites with more than 10 reads per base and per sample were called with 
high sensitivity and low false discovery rates. Homozygous sites were called with higher sensitivity 
than heterozygous irrespective of depth but presented higher false discovery rates. A relationship 
matrix based on accurate genotypes obtained with RNA
-
seq presented a high correlation with a 
relation matrix based on genotypes from a SNP
 
chip.
 
In conclusion
, u
sing synthetic and reference datasets, I compared statistical models to 
perform differential expression analysis, sampled
-
base hierarchical cluster analysis, and variant 
calling and genotyping
. This validation framework can be extended to 
evaluate other
 
methods of 
RNA
-
seq analysis as well as to evaluate 
the periodic
 
publication of new and updated analysis 
methods. 
Choosing the most appropriate software can help 
researcher
s
 
to obtained better results 
and to achieve the goals of the
ir
 
investigation
s.
 
 
Copyright by
 
PABLO DANIEL REEB
 
2015
 
 
v
 
 
To Romina, Sophia and Olivia
.
 
 
vi
 
 
ACKNOWLEDGMENTS
 
I dedicate this dissertation to my lovely family: Romina, Sophia and Olivia. Saying thank you will 
never be sufficient. Romina and I simultaneously embraced the challenges of building a family 
and moving abroad to continue my education. Both ventures have been tough but both were worth 
the efforts. We can say that one challenge has arrived to the end af
ter
 
writing
 
five chapters. The 
remaining challenge
 
has two little sprouts
, Sophia and Olivia, 
and I hope 
we will never finish 
writing that
 
story. I love you! 
 
An especial recognition goes to my advisor Dr. Juan P. Steibel for being extremely patient and 
su
pporting throughout the whole program. Certainly, he has been more than a mentor while giving 
guidelines and advice, and offering friendship in the right moments.
 
I would like to say thank you to the members of my guidance committee Drs
.
 
James Bence, C. 
Ti
tus Brown, Weiming Li, and Robert Tempelman. They all helped me to build a comprehensive 
academic background from their specific disciplines.
 
I shared a productive and friendly academic environment at Michigan State University, especially 
at the Animal Bre
eding and Genetics Group. I am taking with me an enriched experience of 
collaborative work with graduate students, professors, visiting scholars and post
-
docs.             
 
Finally, I want to thank my large family and friends. All your short or long messag
es, phone calls, 
and hugs given virtually from distant places or in person in Michigan were essential for me and 
my family. 
 
 
vii
 
 
TABLE OF CONTENTS
 
 
LIST OF TABLES
 
................................
................................
................................
.......................
 
ix
 
 
LIST OF FIGURES
 
................................
................................
................................
.....................
 
x
 
 
Chapter 1 Introduction
 
................................
................................
................................
................
 
1
 
 
Chapt
er 2 Evaluating statistical analysis models for RNA sequencing experiments
 
..................
11
 
2.1
 
Introduction
 
................................
................................
................................
.................
12
 
2.2
 
Material and Methods
 
................................
................................
................................
.
15
 
2.2.1 Simulations
 
................................
................................
................................
...............
15
 
2.2.2 Plas
modes
 
................................
................................
................................
................
18
 
2.2.2.1 Null dataset (Cheung)
 
................................
................................
.........................
18
 
2.2.2.2 Differentially expressed dataset (Bottomly)
 
................................
.........................
20
 
2.2.3 Comparison of alternative 
analysis tools for evaluating differential expression
 
..........
22
 
2.2.3.1 Filtering and normalization
 
................................
................................
..................
22
 
2.2.3.2 Differential expression analysis
 
................................
................................
..........
23
 
2.2.3.3 Multiple comparisons
 
................................
................................
..........................
23
 
2.2.4 Evaluating and comparing results from alternative analysis packages
 
......................
24
 
2.3
 
Results
 
................................
................................
................................
........................
26
 
2.3.1 Simulation
s
 
................................
................................
................................
...............
26
 
2.3.2 Plasmodes
 
................................
................................
................................
................
28
 
2.3.2.1 Null dataset (Cheung)
 
................................
................................
.........................
28
 
2.3.2.2 DE dataset (Bottomly)
 
................................
................................
.........................
29
 
2.4
 
Discussion and conclusion
 
................................
................................
..........................
31
 
APPENDIX
 
................................
................................
................................
............................
35
 
 
Chapter 3 Assessing dis
similarity measures for sample
-
based hierarchical clustering of RNA 
sequencing data using plasmode datasets
 
................................
................................
................
41
 
3.1
 
Introduction
 
................................
................................
................................
.................
42
 
3.2
 
Material and Methods
 
................................
................................
................................
.
44
 
3.2.1 Datasets
 
................................
................................
................................
....................
44
 
3.2.2 Plasmodes
 
................................
................................
................................
................
44
 
3.2.2.1 Plasmodes from Bottomly dataset
 
................................
................................
......
45
 
3.2.2.2 Plasmodes from MSUPRP dataset
 
................................
................................
.....
47
 
3.2.3 Clustering
 
................................
................................
................................
..................
49
 
3.2.4 Cluster validation using results from plasmodes
 
................................
........................
50
 
3.3
 
Results
 
................................
................................
................................
........................
51
 
3.3.1 Bottomly
 
................................
................................
................................
....................
51
 
3.3.2 MSUPRP
 
................................
................................
................................
..................
57
 
3.4
 
Discussion
 
................................
................................
................................
..................
60
 
3.5
 
Conclusion
 
................................
................................
................................
..................
66
 
viii
 
 
APPENDIX
 
................................
................................
................................
............................
68
 
 
Chapter 4 Assessing genotype call accuracy from RNA sequencing data
 
................................
.
75
 
4.1
 
Introduction
 
................................
................................
................................
.................
76
 
4.2
 
Material and Methods
 
................................
................................
................................
.
78
 
4.2.1 Resource population
 
................................
................................
................................
.
78
 
4.2.2 Variant calling
 
................................
................................
................................
...........
79
 
4.2.3 Assessing genotype call accuracy
 
................................
................................
.............
80
 
4.2.4 Equivalence between relationsh
ip matrices and correlation with SNP chip data
 
........
82
 
4.3
 
Results
 
................................
................................
................................
........................
83
 
4.3.1 Analysis across variant calling programs for 24 animals
 
................................
............
83
 
4.3.2 Comparison of genotype calls from DNA
-
seq and RNA
-
seq from multiple tissues
 
.....
85
 
4.3.3 Equi
valence between relationship matrices
 
................................
...............................
87
 
4.4
 
Discussion
 
................................
................................
................................
..................
89
 
 
Chapter 5 General discussion
 
................................
................................
................................
...
94
 
5.1
 
Introduction
 
................................
................................
................................
.................
95
 
5.2
 
Objectiv
es revisited
 
................................
................................
................................
.....
95
 
5.3
 
Future research directions
 
................................
................................
........................
 
101
 
 
REFERENCES
 
................................
................................
................................
.......................
 
103
 
 
ix
 
 
LIST OF TABLES
 
 
Table 1 Classification rule to compute false and true positive rates
................................
...........
25
 
 
Table 2 Consistency for each dissimilarity measure
 
................................
................................
..
56
 
 
Table 3 Reference similarity matrix (S) for MSUPRP plasmodes
 
................................
..............
72
 
 
Table 4 Reference dissimilarity matrix (D) for MSUPRP plasmodes
 
................................
..........
72
 
 
Table 5 Cophenetic matrix of the reference dendrogram for MSUPRP plasmod
es.
 
..................
73
 
 
Table 6 Correlation between dissimilarity measures and reference dissimilarity
 
........................
74
 
 
Table 7 Contingency table with classification rule used to compute sensitivity and error rate of 
genotype calling
 
................................
................................
................................
........................
81
 
 
x
 
 
LIST OF FIGURES
 
 
Figure 1 Algorithm used to simulate counts from existing estimates of model parameters
 
........
17
 
 
Figure 2 Multidimensional scaling analysis of experimental datasets
 
................................
........
19
 
 
Figure 3 Algorithm used to generate plasmode datasets with no differentially expressed 
transcripts under a model with one classification variable
 
................................
.........................
20
 
 
Figure 4 Algorithm used to generate plasmode datasets with differentially expressed transcripts 
under a model with two classification variables (block + treatment)
 
................................
...........
21
 
 
Figure 5 Simulation results from a scenario with three biological replicates
 
..............................
27
 
 
Figure 6 Plasmode results from Cheung dataset:
 
................................
................................
......
28
 
 
Figure 7 Plasmode results from Bottomly dataset:
 
................................
................................
....
29
 
 
-
value results
................................
................................
30
 
 
Figure 9 Steps used to process reads to obtain matrix
 
of counts
 
................................
..............
36
 
 
Figure 10 P
-
value distribution of differential expression analysis performed with edgeR for 
Bottomly data using a 
model with block and treatment fixed effects
 
................................
..........
37
 
 
Figure 11 P
-
value distribution of non differentially expressed transcripts for a 
simulated scenario 
with three biological replicates
 
................................
................................
................................
..
38
 
 
Figure 12 P
-
value distribution of non differentially expressed transcript
s for plasmodes 
generated from Cheung dataset
 
................................
................................
................................
39
 
 
Figure 13 P
-
value distribution of non differentially expressed transcripts for 
plasmodes 
generated from Bottomly dataset
 
................................
................................
..............................
40
 
 
Figure 14 Algorithm used to generate plasmodes from Bottomly dataset
 
................................
..
47
 
 
Figure 15 Multidimensional scaling analysis of MSUPRP dataset
 
................................
.............
48
 
 
Figure 16 Typical dendrograms obtained for plasmode datasets from Bottomly experimental 
data with two dissimilarity measures under three scenarios
 
................................
......................
53
 
 
Figure 17 Agreement between dissimilarity measures using Bottomly plasmode datasets
 
........
54
 
 
xi
 
 
Figure 18 Typical dendrograms obtained for plasmode datasets from MSUPRP experimental 
data with two dissimilarity measures
 
................................
................................
.........................
58
 
 
Figure 19 Agreement between dissimilarity measures using  MSUPRP plasmode datasets
 
.....
59
 
 
Figure 20 Sequencing layout of 24 barcoded samples
 
................................
..............................
69
 
 
Figure 21 Plasmode generation for MSUPRP dataset
 
................................
...............................
71
 
 
Figure 22 Reference dendrogram for MSUPRP plasmodes
 
................................
......................
73
 
 
Figure 23 Assessing SNP calling accuracy of RNA
-
seq samples of
 
longisimus dorsi
 
................
84
 
 
Figure 24 Assessing SNP calling accuracy of DNA and RNA
-
seq samples of three tissues from 
the same animal
 
................................
................................
................................
........................
87
 
 
Figure 25 Scatterplot of correlation between relationship matrices for the 24 F2 females
 
.........
89
 
 
1
 
 
Chapter
 
1
 
Introduction
 
 
Introduction
 
 
2
 
 
Transcriptome analysis through next generation sequencing technologies 
(Metzker, 2010)
 
is known as RNA sequencing (RNA
-
seq) 
(Wang 
et al., 2009)
. This technology has revolutionized 
transcriptomics since  its first introduction in 2004 
(Bennett, 2004)
 
and has been rapidly applied 
to a variety of studies and species 
(Ozsolak and Milos, 2011; Marguerat et al., 2010; 
Wickramasinghe et al., 2014)
. Gene expression analysis has been the main objective in RNA
-
seq 
experiments 
(Oshlack et al., 2010)
 
but  other quantitative and qualitative aspects of transcript 
biology have been increasing in importance as the technology evolves. These other objectives 
include studies for detecting  allele
-
specific expression
 
(Quinn et al., 2014; Piri
nen et al., 2015; 
Steibel et al., 2015)
, RNA editing 
(Lee et al., 2013; Ramaswami et al., 2013)
, 
alternative splicing 
(Griffith et al., 2010; Alamancos et al., 2014)
, novel transcripts 
(Roberts et al., 2011; Trapnell et 
al., 2010)
, and nucleotide variations 
(Quinn et al., 2013; Cánovas et al., 2010)
. 
 
A typical RNA
-
seq experiment includes 
the following steps 
(Marguerat et a
l., 2010; 
Oshlack et al., 2010)
. First, a sample is extracted from the tissue of interest, 
then 
RNA is purified 
and submitted to a series of processes known as library preparation. This process comprises 
poly
-
A RNA isolation, RNA fragmentation, reverse 
transcription to cDNA, adapter ligation, size
 
selection, and PCR enrichment. The library preparation converts the input RNA into small 
fragments of cDNA that are ready to be sequenced. Second, the cDNA libraries are placed in the 
sequencing machine and the
 
fragments are sequenced in parallel in a predefined number of 
cycles. At each cycle, fluorescent labeled nucleotides are added and the signals emitted are 
recorded and converted to base
-
calls. As a result, the sequencing machine provides a number of 
short
 
reads with length equal to the number of cycles and with read abundance proportional to 
the fragment abundance. Third, the sequenced reads are filtered (for quality) and mapped to a 
reference genome or transcriptome (either pre
-
existing or assembled from 
the sequenced reads 
themselves).
 
3
 
 
After reads have been aligned, qualitative and quantitative information can be extracted 
from them in order to be applied for different types of analysis. Qualitative information refer
s
 
to 
the study of the sequences of base
s 
per se
 
and can be used, for example, for discovering and 
calling single nucleotide polymorphisms (SNPs). Quantitative information, on the other hand, is 
based on the characteristic that the number of reads aligning to any given region of the genome 
is pr
oportional to the abundance of fragments for that region within the sample 
(Mortazavi et al., 
2008)
. These read counts can be summarized and aggregated over biologically meaningful units, 
such as exons, transcripts
,
 
or genes in order to provide a direct measure of expression 
(Oshlack 
et al., 2010)
. The most common use of this expression profiling is the analysis of differential 
expression (DE) 
(Oshlack et al., 2010)
. Differential expres
sion analysis pursue
s
 
the identification 
of genes whose transcripts show substantial changes in abundance across experimental 
conditions, such as differences between strains, tissues
,
 
or treated versus untreated individuals. 
Finally, both qualitative and q
uantitative information can be combined in studies like allele
-
specific 
expression (ASE). In ASE experiments, the goal is to identify genes for which the two alleles in 
an individual are expressed at different l
evels. ASE studies, first call
 
for SNP and 
th
en identify
 
heterozygous sites. For each of the heterozygous sites the number of reads supporting each of 
the alleles is counted and a statistic
al
 
test is performed to evaluate whether the alleles are 
expressed in balance. 
 
RNA
-
seq has expanded the field o
f transcriptomics 
as no other previous technology (
e.g
. 
microarray),
 
not 
only for model but also for non
model organisms in general 
(Ekblom and Galindo, 
2011)
. Some of the reasons that contribute to this unpr
ecedented expansion in non
model 
organisms
,
 
and for fish and wildlife species 
(Qian et al., 2014)
 
specifically
, 
are the cost
-
effectiveness
 
and the independence of prior genomic knowledge of the speci
es of interest 
(Ekblom and Galindo, 2011; Vijay et al., 2013)
. For instance, Salem et al. 
(
2012)
 
used RNA
-
seq 
analysis to identify SNPs markers associated with growth
-
rate in rainbow trout. The authors 
4
 
 
compared two full
-
sib families, one selected for improved growth. Putative SNPs identified by 
RNA
-
seq as associated with growth trait were further est
imated by genotyping individuals from 
40 families using a designed array platform. Smith et al. 
(
2013)
 
studied the ability of crimson 
spotted rainbowfish to adapt to temperature stress. The authors tested for differential exp
ression 
across two groups of fish exposed to different conditions using a 
de novo 
assembled 
transcriptome. The genes identified as differentially expressed were related to critical metabolic 
pathways for temperature tolerance and provided candidates to ext
end investigations of 
population adaptations to climate change. Babbit et al. 
(
2012)
 
used RNA
-
seq to compare gene 
expression patterns in the ovarian tissue of juvenile and adult female baboons and were
 
able to 
link the differentially expressed genes to selection occurring in human and chimpanzee. This 
information is valuable to study the evolution of gene regulation
 
in humans, specifically providing
 
insight into the loci that contributed to shifts in de
velopmental timing and physiology during human 
evolution. All the applications of RNA
-
seq aimed at making population
-
wide inferences (e.g: 
differential 
expression, allele
-
specific expression
),
 
including the cited ones, use
 
a sample from 
the population in q
uestion, and thus, they rely on statistical analysis
 
to make inferences
. 
Consequently
, the validity of the conclusions depend directly on the validity of the 
statistical 
methods used in the studies.
 
Even though a number of statistical analysis methods and 
tools have been developed for 
various applications of RNA
-
seq experiments 
(Chen et al., 2011)
, limitations and challenges still 
remain in determining their statistical validity due to potential variation introduced in each of the 
steps of the design and execution of an RNA
-
seq study 
(Ozsolak and Milos, 2011; Marguerat et 
al., 2010)
. A characteristic aspect of RNA
-
seq is that the measure of expression, total number of 
reads mapping to a certain region in the reference, is of discrete nature 
(Anders and Huber, 2010)
 
in contrast to continuous intensity measures obtained in microarray analysi
s. These count data 
require statistical models that can account for the discrete nature of the data when testing for 
5
 
 
differential expression or when summarizing massively parallel multivariate expression records 
through clustering analysis. A simple Poisso
n model seems appropriate for this type of data when 
the experiment includes only technical replicates 
(Marioni et al., 2008)
. H
owever, due to extra 
sources of variation at the biological level, read counts are always overdispersed and other 
models 
have been proposed
, for example,
 
based
 
on the negative binomial distribution 
(Robinson 
et al., 2010; Anders and Huber, 2010; Hardcastle and Kelly, 2010; Di et al., 2011)
. 
Other 
characteristics of RNA
-
seq experiments such as the small sample size and large number of 
re
sponses have limited the direct application of 
generalized linear mixed models developed for 
count data in other disciplines. 
Current models based on discrete distributions (i.e. Poisson, 
negative binomial) are limited to simple experimental design
s
 
such a
s pairwise or multiple fixed 
factors. Although those designs are extremely useful in most of lab experiments, they cannot 
account for random effects and hierarchical structures which are commonly present in ecological 
experiments and studies. Thus, it woul
d be valuable to evaluate the
 
use of transformation in linear
 
mixed models using realistic datasets. The evaluation should include properties of interest when 
inferring results such as the rate of false positives 
and statistical power. Specifically
, some a
uthors 
(Langmead et al., 2010; Law et al., 2014)
 
have proposed the use of data transformation to fit 
simpler Gaussian linear model analysis, which allow a straightforward implementation of 
multilevel models.  Thus, there are multiple recomm
ended ways of analyzing RNA
-
seq for 
differential expression and clustering whose statistical properties need to be evaluated.
 
Hierarchical cluster analysis has been the preferred statistical method to represent results 
of samples and genes after differenti
al expression  analysis
 
(Liu and Si, 2014)
. Hierarchical 
clustering is also a good unsupervised technique to discover subpopulation structure with respect 
to a set of multivariate responses 
(Legendre and Legendre, 2012)
. Thus, hierarchical clustering 
of samples using gene expression data from RNA
-
seq experiments holds great promise for the 
study of natural variation in relation 
to gene expression in ecological experiments. Hierarchical 
6
 
 
clustering analysis relies on a measure of similarity/dissimilarity among the features (i.e. genes
, 
transcripts or samples) to build a dendrogram. But dissimilarity measures may be greatly affected
 
by the distribution of the measured variables. Dissimilarity measures for RNA
-
seq derived gene 
expression count data are mainly based on logarithmic transformation of raw counts. Additionally, 
a specific Poisson dissimilarity measure has been proposed for
 
RNA
-
seq data 
(Witten, 2011)
. 
However, only parametric simulations and exemplar data have been used to compare the 
performance of these measures for sample clustering. Due to the high dispersion of count data 
derived from RNA
-
seq
, a comparison based on experimental
 
datasets and using a good measure 
of consistency of the resulting dendrogram representation could contribute to decide on the best 
dissimilarity measure
s
 
for hierarchical clustering implementations. Another relevant que
stion in 
this area is: are transformation
s
 
used for differential expression analysis also appropriate for 
clustering analysis? Finding a dissimi
larity measure that can reliably
 
represent the sample 
structure before fitting any model could be of interest to
 
conduct 
a priori
 
exploratory data analysis 
to be followed up by inferential analyses.
 
Finally, another inferential problem that deserves study is the use of algorithms for calling 
SNP genotyping from RNA
-
seq data. SNP genotyping from RNA
-
seq is an applica
tion of RNA
-
seq experiments 
that 
has proven to be useful in some studies and we can expect that its use will 
increase in the future 
(Wickramasinghe et al., 2014; Seeb et al., 2011; Schunter et al., 2013; 
Narum et al., 2013)
. Curre
ntly, some questions remain
 
about the equivalence of using this 
technology versus more traditional genotyping chips and DNA
-
seq. For instance, the effe
ct of per 
base read depth when calling and genot
yping variants should be studied
 
in terms of its impact in 
sensitivity and specificity for calling SNP from RNA
-
seq data. Moreover, the statistical properties 
of called genotypes should be studied separately 
for heterozygous and homozygous genotypes 
because they are likely to be used for different purposes. Ano
ther important property
 
of genotypes 
derived from RNA
-
seq data is how well they can represent the genetic structure of a population, 
7
 
 
when compared to mo
re traditional measures of relatedness based on SNP chip genotypes and 
on genealogies.
 
 
In order to compare and decide among the available statistical methods in each of the 
mentioned applications of RNA
-
seq (e.g. differential expression analysis, clusteri
ng, SNP 
genotyping) a validation framework should be established for each case. Consequently, in this 
dissertation I followed the epistemological guidelines proposed by Mehta et al. 
(
2004)
 
for high
-
dimensional biology. Assessing statistical validity requires an explanation of what a method is 
supposed to do or what properties 
an estimate or test 
is supposed to exhibit. Although there is 
subjectivity in selecting the desired properties, once
 
the criteria are chosen, methods can be 
evaluated objectively 
(Mehta et al., 2004)
. For instance, a desired property to compare among 

rate. Once we
 
choose this property, we can compare the models by applying a validation, such 
as simulating data or other validating procedures.
 
 
Validating procedures for RNA
-
seq analysis methods have followed the same strategies 
that have been previously implemented i
n technologies such as microarray. Basically, the most 
used validating procedures can be grouped into the following categories: a) theoretical 
demonstration, b) exemplar dataset, c) simulation, d) gold
-
standard reference, and e) plasmode 
datasets. Theoreti
cal derivations in the context of high
-
dimensional biology often lacks of a 
mathematical demonstration to support their validity 
(Mehta et al., 2006)
. The use of an exemplar 
dataset offer the advantage of accounting for the whole complexity of the transcriptome such as 
having a variance
-
covariance matrix that reflect real interactions am
ong features. However, this 
approach must be considered only as an illustration in a particular dataset and not as evidence 
to support a method. Computer simulations have been the most commonly used procedure 
(Anders and Huber, 2010; McCarthy et al., 2012)
 
d
ue to ease
 
in creating data
sets under 
hypothetical scenarios by assuming a parametric data gener
ation model. However, although
 
gene 
8
 
 
expression data is easy to simulate, mimicking a realistic gene expr
ession dataset is quite 
challenging. Gold
-
standard references obtained with a different technology have been used to 
validate results 
(Fang and Cui, 2011)
. Typical r
eferences for gene expression results from RNA
-
seq are qPCR data 
(Bullard et al., 2
010; Rapaport et al., 2013)
. However, analysis models for 
qPCR data should themselves be validated 
(Steibel et al., 2009)
. Anoth
er example of gold 
standard is the use of mask
-
and
-
impute for estimating imputation accuracy 
(Badke et al., 2013; 
Gualdrón Duarte et al., 2013)
 
and the use of genotyping chips to validate gen
otype calls from 
genome sequencing 
(Kumar et al., 2014)
. Finally, the use of plasmodes is another app
ropriate 
procedure that can be applied to validate a statistical method. This approach aims at generating 
datasets that preserve the characteristics of experimental data with the benefit of knowing the 
true status as it happens with simulated data. 
 
Plasmo
des are synthetic datasets generated from experimental data 
(Mehta et al.
, 2004)
. 
The term plasmode was introduced in 1967 by Cattel
l
 
et al. 
(
1967)
 
as a tool for validating 
techniques in multiv
ariate data. Unlike parametric simulations, data distributions and correlations 
generated in plasmodes can be more realistic because they are taken directly
 
from experimental 
data thus no
 
assumptions are required 
a priori
 
to create the datasets. Plasmodes have to be 
generated according to the validation objectives and the available experimental data 
(Mehta et 
al., 2006)
 
and one challenge is precisely how to create a plasmode for differential expression 
analysis versus a plas
mode to validate results for
 
sample
-
 
or even gene
-
 
based
 
clustering.
 
In this diss
ertation
,
 
I show how to validate statistical analysis methods for RNA
-
seq data 
by creating plasmode datasets. I also use parametric simulations and a gold
-
standard reference 
to complement the validation analysis. Plasmodes were created in three different w
ays. First, I 
created plasmodes by applying resampling
-
based methods consistent with the null hypothesis 
(no differential expression). Second, I created differentially expressed plasmodes with a known 
proportion of differentially expressed genes. These two
 
methods of creating plasmodes are 
9
 
 
particularly useful to evaluate statistical properties of differential expression analysis models and, 
for evaluating dissimilarity metrics for clustering. Third, I generated synthetic individuals by 
combining known propo
rtions of transcript
-
specific read counts from two paternal individuals, 
which is more relevant for comparing dissimilarity metrics for hierarchical clustering. Finally, I used 
a gold
-
standar
d reference comparison 
to assess properties of variant calling me
thods.
 
 
Given the rapid
 
advances in analysis tools designed for RNA
-
seq experiments, a better 
understanding of the properties of results obtained by commonly used statistical analysis methods 
would provide valuable 
information for researchers who
 
either de
velop analysis models and tools 
or utilize analysis tools in their investigations. A systematic comparison t
h
rough a valid framework, 
as provided by plasmodes, could supply reference datasets to be used as benchmark
s
 
for 
comparing analytic approaches. Furt
hermore, researchers interested in specific experimental 
data could use plasmodes generated by similar experimental conditions to conduct pilot 
in silico
 
studies and to choose the most suitable analysis procedure. For instance, a researcher may 
evaluate wh
ether to use a generalized linear model for analyzing the read counts of a differential 
expression experiment or to apply an appropriate transformation in order to use a more flexible 
general linear mixed model. As the range of objectives of RNA
-
seq studie
s e
xpands (i.e. focusing 
not only o
n differential expression analysis) it will be of interest to contribute with analogous 
validation framework in such 
specific applications. In accor
dance with this demand, we provide 
metrics to evaluate dissimilarity matr
ices to represent sample relationships using hierarchical 
cluster analysis, and metrics to compare variant calling software. The evaluation of dissimilarity 
measures are of direct benefit to summarize and represent biological/technical variability in 
exper
imental design, or relationships among individuals in population studies. The comparison of 
variant calling software provide
s
 
information on genotyping accuracy that could improve other 
related analysis such as allele
-
specific expression or RNA editing.   
  
 
10
 
 
Thus, the overarching goal of this dissertation was to propose and implement validation 
procedures based on experimental data to estimate the properties of widely used statistical 
analysis methods of RNA
-
seq experiments.
 
The specific objectives were:
 
1.
 
To
 
evaluate statistical models for differential expression analysis in RNA
-
seq experiments 
 
2.
 
To assess the properties of dissimilarity measures for agglomerative hierarchical 
clustering of RNA
-
seq data
 
3.
 
To  compare sensitivity and false discovery rates for SNP
 
genotyping in variant calling 
programs
 
 
11
 
 
Chapter
 
2
 
Evaluating statistical analysis 
models for RNA sequencing experiments
 
 
Evaluating statistical analysis models for 
RNA sequencing experiments
 
 
Reeb, P. D., and Steibel, J. P. (2013). Evaluating statistical analysis models 
for RNA sequencing experiments. 
Front. Genet. 
4, 1
-
9. 
doi:10.3389/fgene.2013.00178.
 
 
12
 
 
2.1
 
Introduction
 
 
RNA sequencing (RNA
-
seq) technology is being rapidly adopted as the platform
 
of choice 
for high
-
throughput gene expression analysis 
(Ozsolak and Milos, 2011)
. Many methods have 
been proposed to model 
relative transcript abundances obtained in RNA
-
seq experiments but it 
is still difficult to evaluate whether they provide accurate estimations and inferences.
 
Sound statistical analysis of RNA
-
seq data should consider not only the factors of any basic 
expe

transcriptomic, etc).  An RNA
-
seq experimental design must consider treatment and block 
structures, and combine them according to the principles of a well
-
planned design: 
randomization, 
blocking and replication 
(Aue
r and Doerge, 2010)
. Typically, fixed or random effects such as 
library multiplexing, sequencing lane, flow cell, individual sample, tissue, or time can be crossed 
or nested with treatments or other experimental conditions. Such a design is used to mode
l 
thousands of correlated variables (i.e transcripts), usually, in a context of small number of 
biological replicates. Although the development of reliable models that account for all these 
factors is challenging, it is even more difficult to assess the va
lidity of a particular analysis model 
(Pachter, 2011)
. 
 
Validity of statistical models for differential expression analyses has been evaluated by (i) 
app
lying th
e model to a novel data
set, (ii) deriving analytical proofs, (iii) using simulations, (iv) 
comparing to a gold
-
standard measure, or (v) constructing plasmodes. In (i) the true status of 
nature is unknown, therefore this method must only be accepted as an i
llustration and not as 
evidence to support a model. However, any of the last four options, or a combination of them, 
could be used to demonstrate adequacy of a model. Obtaining a mathematical demonstration (ii), 
may be impossible for some models 
(Gadbury et al., 2008)
.   Most of the models rely on 
assumptions that are difficult to verify and the consequences of departures from assumptions may 
13
 
 
not be clear. Computer simulation (iii) has been the most commonly used procedure 
(Anders and 
Huber, 2010; McCarthy et al., 2012)
 
. This preference 
is due to ease in creating data
sets under 
diverse scenarios by controlling the set of parameters used in the simulation. Nevertheless, such 
g
enerated data depend on the parameterization selected and the assumptions of the simulation 
model. Moreover, these dataset may constitute a partial representation of reality as the complexity 
of RNA
-
seq data is hard to mimic. Typical gold
-
standard (iv) for
 
gene expression are qPCR data 
(Bullard et al., 2010; Rapaport et al., 2013)
. Ho
wever, analysis models for qPCR data should 
themselves be validated 
(Steibel et al., 2009)
. The use of plasmodes (v) is another app
ropriate 
procedure that can be applied to validate a statistical method. This approach aims at generating 
datasets that preserve the characteristics of experimental data with the benefit of knowing the 
true status as it happens with simulated data.
 
A plasm
ode is a data
set obtained from experimental data but for which some truth is known 
(Mehta et al., 2004)
. Plasmodes have been applied in microarrays 
(Gadbury et al., 2008)
, 
admixture estimation methodologies 
(Vaughan et al., 2009)
 
and qPCR 
(Steibel et al., 2009)
. 
This 
procedure has not been extensively applied in RNA
-
seq since it requires large sets of raw data 
with an accurate description of
 
the experimental conditions under which they were obtained. This 
information is essential to accurately develop plasmodes under null and alternative hypotheses.  
Only recently,  an initiative has provided a repository with ready
-
to
-
use databases from RNA
-
seq 
studies 
(Frazee et al., 2011)
. 
 
Processed data obtained from RNA
-
seq experiments are essentially counts that in the 
simplest model represent total number of reads mapping to a region in a reference genome or 
transcriptome. A comprehensive comparison of stochastic models that have been pr
oposed is 
presented in Pachter 
(2011)
. Although different discrete distributions such as binomial, 
multinomial, beta
-
binomial, Poi
sson, and negative binomial, have been proposed to model RNA
-
seq data, Poisson and negative binomial are the most implemented ones in RNA
-
seq analysis 
14
 
 
software. A simple Poisson model seems appropriate when the experiment includes only 
technical replicates
 
from a single source of RNA 
(Marioni et al., 2008)
. In practice, however, due 
to e
xtra sources of variation, the observed dispersion is larger than the expected for a simple 
Poisson distribution and to correctly account for over
-
dispersion, generalized Poisson (GPseq) 
(Srivastava and Chen, 2010)
, mixed Poisson (TSPM) 
(Auer and Doerge, 2011)
, Poisson log
-
linear 
(PoissonSeq) 
(Li et al., 2012)
 
and negative binomial (edgeR, DESeq, baySeq, NBPSeq) 
(Robinson et al., 2010; Ander
s and Huber, 2010; Hardcastle and Kelly, 2010; Di et al., 2011)
 
are 
used instead.  Regardless of the model, calculating dispersion parameters requires special 
statistical and numerical approaches due to the small sample sizes and large number of 
respons
es used in RNA
-
seq studies. In particular, borrowing information across transcripts when 
estimating model parameters, as used in microarrays 
(Smyth, 2004; Cui et al., 2005)
, has been 
also proposed for RNA
-
seq 
(Robinson and Smyth, 2008; Anders and Huber, 2010; Zhou et al., 
2011)
. Another challenging issue for these statistical anal
ysis models, is the ability to handle 
different experimental sources of variation. Most of the models allow fitting simple effect models 
and pair
-
wise comparison between treatments but only a few allow multiple factors 
(McCarthy et 
al., 2012)
.  Currently, to the best of our knowledge, there is only one available model that can fit 
random
 
effects 
(Van De Wiel et al., 2013)
. Methods that can accommodate complex hierarchic
al 
designs and provide more powerful tests to detect differentially expressed transcripts are under 
active research. 
On the other hand, microarray analysis models and software usually assume a 
Gaussian distribution for response variables, but they accommod
ate fixed and random effects in 
a straightforward manner
 
(Rosa et al., 2005; Cui et al., 2005)
. Consequently, an alternative to 
model counts in RNA
-
seq experiments is to transform counts and use Gaussian models 
(Langmead et al., 2010; Smyth et al., 201
2)
. 
 
In any case, given the multitude of available statistical models and the complexity of 
experimental design of many gene expression studies, researchers often find themselves having 
15
 
 
to decide between competing models and analysis program. In other c
ases, although a 
researcher may have an a priori designated software and model for RNA
-
seq data analysis, the 
question is if the fitted model produces sound inferences.
 
In this paper, we present and apply a methodology for evaluating statistical methods fo
r RNA
-
seq experiments by combining results from computer simulations and plasmodes. We follow the 
epistemological guidelines stated in Mehta et al. 
(2006)
  
for high
-
dimensional biology and provide 
a general framework that can be adapted to different experimental conditions.
 
 
2.2
 
Material and Methods
 
2.2.1
 
Simulations
 
Simulated
 
datasets were created conditional on estimated parameter values and results that 
had been previously obtained 
(Ernst et al., 2011)
. The data consisted of read co
unts from an 
RNA
-
seq experiment based on a developmental expression study 
(Sollero et al., 2011)
. 
Experimental and alignment protocols are described in the supplemental material (
Figure 
9
). 
Estimations for parameters 


and 


were obtained by fitting generalized linear Poisson models 
with log
-
library size as an offset variable using function lmer 
(Bates et al., 2013)
 
from R 
(R 
Development Core Team, 2014)
.
 
 
Equation [1] represents the generalized linear model used to generate the simulated 
datasets: 
 

[1]
 
 
16
 
 
where 
y
ij
 
is the read count for a particular transcript in treatment 
i
 
and sample 
j
, 
O
ij
 
is a known off
-
set value (in this case the total library size), 

i
 
is the group mean, 
e
ij
 
is a sample
-
specific residual. 
The transcript sub
-
index (g) was omitted for convenience.
 
Given estimates of parameters from equation [1] for transcripts, we simulated read counts by 
following the algorithm described in 
Figure 
1
. The output from this procedure consisted of a matrix 
of counts of size 
T
 
by 
2nr
 
with a known proportion (
p
0
) of differentially expressed transcripts and 
known group effec
ts (

i
). Treatment is represented in this matrix by 
nr
 
columns, but with only 
n
 
independent (biological) replicates. While this simulation is not based on the negative binomial 
distribution, it continues to be an over
-
dispersed Poisson process commonly use
d to simulate 
RNA
-
seq counts 
(Blekhman et al., 2010; Auer and Doerge, 2011; Hu et al., 2011)
.  The resulting 
over
-
dispersed Poisson counts will have means, variances, and treatment effects sampled from 
those estimated from experimental
 
data. The procedure can be repeated K times to produce 
several simulated datasets.
 
17
 
 
Figure 
1
 
Algorithm used to simulate counts from existing estimates of model parameters
 
 
We set 
K=1000
 
and 
T=5000
, producing 1000 simulated datasets with 5000 transcripts each. 
Noteworthy, when sampling transcripts in S, it is assumed that all transcripts are differentially 
expressed (no significance testing is performed). But subsequently, the mean treatment 
differe
nces (in the log
-
scale) are zeroed out if the transcripts are assigned to 
S
0
. For transcripts 
assigned to 
S
1
, mean differences are kept unchanged; consequently 
S
1
 
includes a whole 
distribution of treatment effects from very small to large 
according to the 
distribution 
estimated 
from the experimental data.
 
We simulated nine scenarios by combining three levels of biological replication (n=3, 5, 10) 
and three levels of technical replication (
r
=1, 3, 5).  The proportion of differentially expressed 
transcripts w
as set to 0.1. 
 
18
 
 
2.2.2 
Plasmodes
 
In contrast to simulation datasets based on equation [1], we generated plasmode datasets not 
based on any model. Plasmodes were generated using data available in the online resource 
ReCount  
(Frazee et al., 2011)
. From the whole collection of analysis
-
ready datasets, we chose 
to work with two RNA
-
seq experiments to illustrate t
he generation of 1) a null data
set, where there 
are no obvious systematic effects that explain v
ariance in gene expression 
and, 2) a data
set with 
treatment and block effects.
 
 
2.2.2.1 
Null dataset (Cheung)
 
The data originated in a study of immortalized B
-
cells from 41 (17 females and 24 males) 

(Cheung et al., 
2010)
. The samples were sequenced using the Illumina Genome Analyzer. To generate a 
plasmode dataset, we selecte
d the 21 samples from male individuals that were represented with 
only one technical replicate. The resulting gene expression data exhibits extensive variation that 
cannot be attributed to any systemati
c factor (
Figure 
2
a
). Any random partition of the dataset into 
two (or more) categories should shield a null dataset where no differential expression is expected 
beyond the normal sample
-
to
-
sample variation. Consequently
 
this dataset lends itself to create 
plasmodes to evaluate statistical properties of analysis models under the null hypothesis.
 
 
19
 
 
Figure 
2
 
Multidimensional scaling analysis of experimental datasets
 
(A)
 
Cheung samples: F=Females and M=males; 
(B)
 
Bottomly samples:  labels correspond to 
strain (treatment) B6=C57BL/6J, D2=DBA/2J, and colors to flowcell number (block): red=4, 
black=6 and green=7. In Cheung dataset there is not clear distinction between femal
es and males 
while in Bottomly samples are first grouped in two large groups corresponding to strain B6 and 
D2 and then in subgroups consistent with flowcell number
 
 
To generate null data
sets, we pr
oceeded as explained in 
Figure 
3
. Using 
n=21
 
samples from 
males, we generated 
p=10
 
plasmodes each with 
t=2
 
groups and 
r=10
 
biological replicates in each 
group. 
 
Notice that no
 
parametric model is used at any time. We constru
cted plasmodes by reshuffling 
data and assigning an arbitrary treatment label. In this way overall distribution and gene
-
to
-
gene 
correlations remain unchanged with respect to real data.
 
20
 
 
Figure 
3
 
Algorithm
 
used to generate plasmode data
sets with no differentially expressed 
transcripts under a model with one classification variable
 
 
2.2.2.2 
Differentially expressed 
dataset (Bottomly)
 
In Bottomly et al. 
(2011)
, the authors arranged 21 samples from two inbred mouse strains 
(B6 and D2; n for B6=10, n for D2=11) on 21 lanes of three Illumina GAIIx flowcells and they 
analyzed the RNA
-
seq reads with a simple one
-
way classification (strain) model. A
fter 
performing descriptive analysis of gene expression data, we found that not only strain but also 
the experiment number (flowcell) explained a large amount of the variation (
Figure 
2
b). For 
example, the first principal dimension clearly divides samples from each strain, but the second 
principal dimension shows substantial variation between flowcells, especially flowcell 4 (red) 
versus the other two.
 
Consequently, we 
blocked by experiment and used edgeR to fit a model with strain and 
experiment as fixed effect, resulting in a large number of putatively differentially expressed 
genes (
Figure 
10
). Due to a strong experiment effect, we decided to conduct randomization for 
plasmode construction within experiment number as detailed in 
Figure 
4
.
 
We g
enerated 10 plasmodes executing step 4 to 7 with p=10 and 

0.20. Notice that in step 
3, we used edgeR to obtain a list of DE genes (set G) to build a plasmode with some genes under 
alternative hypothesis. Any other statistical software, however, can be us
ed with the only 
21
 
 
requirement of defining a sufficient small q
-
value threshold. After genes are selected no model is 
used at any time. Similar to the previous section plasmodes are constructed by reshuffling data, 
but in this case an effect estimated from r
eal data is added to selected genes. Again, we expect 
that this procedure yields plasmodes with identical distribution to real data for non differentially 
expressed genes and with comparable effect sizes for differentially expressed genes.
 
 
Figure 
4
 
Algorithm used to generate plasmode datasets with differentially expressed transcripts 
under a model with two classification variables (block + treatment)
 
 
22
 
 
2.2.3 
Comparison of alternative analysis tools for evaluating 
differential expression 
 
To illustrate the use of simulated datasets and plasmodes we compared three R 
(R 
Development Core Team, 2014)
 
packages from Bioconductor 
(Gentleman et al., 2004)
. Two of 
them, edgeR  and D
ESeq, were designed specifically for statistical analyses of RNA
-
seq 
experiments while the third one,  MAANOVA 
(Cui et al., 2005)
, was originally conceived for 
analyzing microarray experiments. As mentioned before, MAANOVA has the ability of fitting 
hierarchical models that can better accommodate complex experimental design assumptions. 
However, such flexib
ility comes at the price of assuming a Gaussian distribution. Data 
transformation and use of permutation to set significance thresholds can help alleviate these 
limitations, but its performance may still be contingent upon sample size and total read counts
 
per 
transcript. Consequently, we included MAANOVA in this study and compare it to two well 
established packages for RNA
-
seq analysis.
 
 
2.2.3.1 
Filtering and normalization
 
A double filtering cr
iterion was applied to all data
sets previous to normalization a
nd 
statistical analysis. Transcripts with 2 or more reads per million in at least as many libraries as 
number or biological replicates were kept in the analysis. In the simulation study, technical 
replicates were summed up before filtering. Consequently, t
he technical replicate level only 
represents increased sequencing depth.
 
Normalization aimed at accounting for differences in library size and composition not 
attributable to treatments. To conduct the analysis with edgeR, data were normalized using the 
sc
aling method proposed by Robinson and Oshlack 
(2010)
 
and the logarithm of the resulting 
effective library size were used by default as offsets in the model. 
 
23
 
 
Analyses with DESeq were performed on counts previously normalized by function 
estimateSizeFactors. According t
o Anders and Huber 
(2010)
, this normalization method is
 
similar to the one proposed by Robinson and Oshlack 
(Robinson and Oshlack, 2010)
 
in edgeR, 
and it is the recommended procedure by the authors of DESeq.
 
Normalized values to use in MAANOVA were obtained with function voom() of  the limma 
package 
(Smyth, 2005)
. The process, analogous to the one proposed in 
(Smyth et al., 2012)
, 
included adjustment for compositional structure using function calcNormFactors() of edgeR  and 
transformation to log2
-
counts per million.
 
 
2.2.3.2 
Diff
erential expression analysis
 
edgeR
:
 
Differential expression was tested by likelihood ratio tests using the generalized linear 
model functionality and estimating tagwise dispersions.
 
DESeq
:
To look for differentially expressed genes, function nbinomGLMTest 
was applied using 
the dispersion estimates generated by function estimateDispersions.
 
MAANOVA
:
 
In the linear model fit by MAANOVA lane was treated as a fixed array effect of a 
single
-
color microarray. Differential expression analysis was performed using bo
th, moderated 
F
-
test (Fs) and transcript by transcript F
-
test (F1). Significance was assessed using 100 sample 
permutations 
(Yang and Churchill, 2007)
.
 
 
2.2.3.3 
Multiple 
comparisons 
 
It is recognized that correction of p
-
values when making multiple comparisons is essential in 
high throughput differential expression analyses 
(Storey and Tibshirani, 2003)
. The most common 
procedure used is the computation of the false discovery rate or FDR 
(Benjamini and Ho
chberg, 
1995)
. Properties of methods to estimate FDR rely heavily on the distribution of p
-
values 
(Li et 
24
 
 
al., 2012)
. In this case we did not aim at selecting individual differentially expressed genes or 
gene sets but we focused at studying the properties of tests in terms of type I and type II error 
rates. Consequently, we concentrate on compar
ison of nominal and empirical type I and type II 
error rates without applying multiple correction and we discuss how departures of assumed values 
can further affect decisions when applying p
-
value corrections.
 
 
2.2.4 
Evaluating and comparing results from 
alternative analysis packages
 
To compare performances of derived tests in terms of power and type I error rates, we 
generated receiver operator characteristic (ROC) curves by computing true positive rate (TPR) 
and false positive rate (FPR) at given signifi
cance thresholds. The TPR was calculated as the 
proportion of true positives (TP) over the total number of simulated differentially expressed 
transcripts (S
1
). FPR, on the other hand, was calculated as the proportion of false positives (FP) 
over the total 
number of transcripts simulated with no differential expression (S
0
). See table 1 for 
details.
 
 
25
 
 
Table 
1
 
Classification rule to compute false and true positive rates
 
Analysis 
method status
 
Transcript simulation status
 
Total
 
Not 
differentially 
expressed
 
Differentially 
expressed
 
not declared 
significant
 
T
N
 
FN
 
R
0
 
declared 
significant
 
FP
 
TP
 
R
1
 
Total
 
#S
0
 
#S
1
 
G
 
 
FP=number of false positives (transcripts in S
0
 
set declared differentially expressed), TP= 
number of true positives (transcripts in S
1
 
declared differentially expressed), FPR= false positive 
rate= FP/#S
0
, TPR=true positive rate=TP/#S
1
 
 
Finally, distributions of p
-
values were compared by quantile
-
to
-
qu
antile plots and 
histograms.
 
Analyses were performed at the Michigan State University High Performance Computing 
Center facilities using R (version 2.15.1), edgeR (version 3.0.8.4.6), limma (version 3.14.4), 
DESeq (version 1.10.1) and MAANOVA (version 
1.28.0).
 
 
26
 
 
2.3
 
Results
 
2.3.1 
Simulations
 
Figure 
5
 
shows results obtained for a simulation with 3 biological replicates and 1 technical 
replicate. Similar results were found in other simulated scenarios (data not shown).  
 
The quantile
-
to
-
qua
ntile plot in 
Figure 
5
 
allows evaluation of the fit of observed p
-
values to the 
uniform (0,1) distribution expected under null hypothesis 
(
Leek and Storey 2011
)
. P
-
values 
corresponding to MAANOVA showed a more characteristic pattern whereas edgeR and DESeq 
presented significant departures from such distribution. Furthermore, 
the logarithmic scale allows 
to easily inspect the behavior of very small p
-
values. DESeq
 
presented larger p
-
values than 
expected up to  a cutoff  of 0.001, while the opposite pattern occur for p
-
values smaller than 0.001. 
Both MAANOVA approaches presented a close to expected pattern with a small deviation for p
-
values smaller than 0.0001. To 
compute the logarithm, all p
-
values equal to zero were replaced 
by the minimum observed p
-
value and thus generated the plateau at the end of the distributions 
of MAANOVA results.  In addition, quantile
-
to
-
quantile plots allowed us to select Fs and F1 tests
 
computed with permutation against the tabulated approach (
Figure 
8
a
-
b). An alternative 
representation of p
-
value distribution using histograms is presented in the sup
plemental material 
(
Figure 
11
).
 
 
27
 
 
Figure 
5
 
Simulatio
n results from a scenario with three
 
biological replicates
 
(A)
 
Q
-
Q uniform plot of non differentially expressed transcripts, 
(B)
 
type I error rate 
vs. nominal 
significance values, and 
(C)
 
ROC curves. Models:  
i) edgeR (blue), ii) DESeq (red), iii) MAA
-
Fs: 
MAANOVA  Fs moderated test using permutation (green), and iv) M
AA
-
F1: MAANOVA F1 
transcript by transcript test using permutation (blue)
 
 
In concordance with the observed p
-
value distributions, the realized type I error rates 
levels for DESeq and edgeR were much different than expected in comparison with MAANOVA 
approaches (
Figure 
5
b). All the packages presented higher realized significance levels when 
evaluated at nominal values below 0.01, with edgeR being the most liberal, and MAANOVA the 
least deviated from nominal values. 
 
ROC curves had similar patterns for each of the nine sim
ulated scenarios. Power improved at 
a given FPR as the number of technical and/or biological replicates increased. In the scenario 
with 3 biological replicates, the enhancement in power when adding technical replicates seems 
to be particularly greater than
 
in a scenario with 5 or 10 biological replicates (data not shown). In 
the case with 3 biological replicates and 1 technical replicate (
Figure 
5
c), edgeR and DEseq ha
d 
similar power while the MAANOVA analyses reported less power. 
 
 
28
 
 
2.3.2 
Plasmodes
 
2.3.2.1 
Null data
set (Cheung)
 
Q
-
Q plot in 
Figure 
6
a
 
shows the adequacy of p
-
values t
o the uniform distributio
n for each of 
the plasmode data
sets analyzed with the different models. All the models presented large 
dispersions with some cases being close to the expected values and some being far apart. In 
particular, edgeR results tend to be above the identity line which means that observed p
-
val
ues 
are smaller than expected. On the contrary, DESeq and both MAANOVA tests tend to have a 
more conservative behavior as they presented larger observed p
-
values than expected. See also 
Figure 
6
b where edgeR  presented inflated type I error rates for nominal significance threshold 
smaller than 0.01.
 
 
Figure 
6
 
Plasmode results from Cheung dataset:
 
(A)
 
Q
-
Q uniform plot of non differentially expressed transcripts, 
(B)
 
type I error rate 
vs. nominal 
significance values. Models:  
i) edgeR (blue), ii) DESeq (red), iii) MAA
-
Fs: MAANOVA  Fs 
moderated test using permutation  (green), and iv) MAA
-
F1: MAANOVA F1 t
ranscript by transcript 
test using  permutation  (blue)
 
 
29
 
 
2.3.2.2 
DE dataset (Bottomly)
 
The p
-
value distributions (
Figure 
7
a) presented similar dispersion patterns to the one 
observed in the plasmodes generated from Cheung dataset utilizing edgeR and DESeq. 
However, p
-
value distributions for MAANOVA tests were 
more homogeneous across datasets 
with the p
-
values from F1 test tabulated approach being closer to the expected values under 
uniform distribution.  ROC curves for DESeq and edgeR were analogous after adjusting for type 
I error rates. Besides, both programs
 
reported higher power than analysis performed with 
MAANOVA (
Figure 
7
c).
 
Interestingly, and opposite to previous datasets,  the best F test to apply 
when using MAANOVA
 
was F1 with tabulated F values
-
compare the proximity to the red line in 
Figure 
8
e in contrast to the pattern in 
Figure 
8
f.
 
 
Figure 
7
 
Plasmode results from Bottomly dataset:
 
Q
-
Q uniform plot of non differentially expressed transcripts, 
(B)
 
type I error rate 
vs. nominal 
significance values, and 
(C)
 
ROC curves. Models:  
i) edgeR (blue), ii) DESeq (red), iii) MAA
-
Fs: 
MAANOVA  Fs moderated test  with permutation (green), and iv) MAA
-
F1: MAANOVA F1 
transcript by transcript test tabulated (blue)
 
 
30
 
 
Figure 
8
 

-
value results
 

-
value results of non 
differentially expressed transcripts using Fs 
moderated test and F1 transcript by transcript test, with a tabulated (right) or permutation (left) 
approach. In the simul
ated data
set 
(A
-
B)
 
the permutation approach presented a more 
characteristic uniform distribution, the plateau at the end is caused by the replacement of zeroes 
by the minimum observed p
-
value when computing logarithm. Plasmodes generated from Cheung 
datase
t, presented similar patterns either using a tabulated or a permutation approach 
(C
-
D)
. 
Plasmodes generated from Bottomly presented better patterns for Fs with tabulated and F1 with 
permutation approach 
(E
-
F)
 
31
 
 
2.4
 
Discussion
 
and conclusion
 
Validating and comparing methods to analyze RNA
-
seq data is essential for providing 
powerful statistical packages that can detect differentially expressed genes in downstream 
analyses 
(Robles et al., 2012)
. In this paper we illustra
te how to utilize plasmode data
sets in 
combination with simulations to evaluate analysis methods more co
mprehensively.
 
Parametric simulations can benefit a particular model depending on the distribution and 
specifications used to generate the dataset. For example, it can be argued that in our simulation 
study, edgeR and DESeq resulted too liberal compared to
 
MAANOVA due to the additive 
generalized Poisson model that was used to simulate the dataset. However, results from two 
independent plasmode datasets, generated without using specific parametric models, confirmed 
the same behavior (
Figure 
6
b and 
Figure 
7
b
). Moreover, a common problem of parametric 
simulations is that genes are sim
ulated independently.  Such misspecification is overcome in 
plasmode datasets where the residual correlation structure among genes after adjusting for 
systematic effects is preserved with respect to the original dataset.    
 
Exploring the joint null distri
bution of p
-
values for a particular test helps to determine the 
adequacy of a model and to decide the best method to correct for multiple comparisons, and 
doing so requires generation of multiple accurate high
-
dimensional datasets 
(Leek and Storey, 
2011)
. For example, we compared null p
-
value distribution obtained for the two types of 
MAANOVA F tests (Fs or F1) combined with two methods to compute the p
-
values (tabulated
 
F 
or permutation). The choice of the best c
ombination varies for each data
set: in the simulation 
study, either Fs or F1 using permutation provide a p
-
value distribution close to a uniform 
distribution while none of the F tests using tabulated values provi
de a reasonable distribution 
(
Figure 
8
a
-
b). Plas
modes generated from Cheung dataset
 
presented similar patterns for all the 
combinations (
Figure 
8
c
-
d), then Fs and F1 using permutation were chosen as suggested by Cui 
32
 
 
et al. 
(Cui et al., 2005)
. Conversely, in the analysis of plasmodes generated from Bottomly data
set
, 
F1 test using tabulated F values was the best approach (
Figure 
8
e
-
f). According to Cui et al.
 
(2005)
,
 
the F1
 
test for a fixed effect model has a standard F distribution and critical values could 
be obtained from F tables. These results are important because typical correction by FDR as 
proposed by Benjamini and Hochber 
(1995)
 
may not be appropriate if the underlying uniform 
distribution is not supported. Other strategies have been adapted from Storey 
(2002)
 
to estimate 
FDR for RNA
-
seq data and which correction should be applied is a topic of re
seach 
(Li et al., 
2012)
. All in all, these results emphasize the need to validate methods und
er realistic conditions 
and to select a base dataset for a plasmode where total sample size and sequencing depth 
(magnitude of counts) are considered.
 
In addition to the base dataset used to build a plasmode, the specific algorithm for plasmode 
generation 
should vary according to the objective of the study.  
Gadbury et. al 
(2008)
 
presented 
an algorithm that generates the partition of samples into two groups and repeatedly samples 
different e
ffect sets to be added to that unique partition. 
In this work, we propose to make several 
partitions from the original set of samples and add a set of effect in each case (
Figure 
4
). This 
approach constitutes a way to incorporate valuable information on biological variation. For 
example, one can easily study the dispersion of patterns in the Q
-
Q plots or ROC curves. 
Alternatively, both approaches, Gadbury et al.(2008) 
and the one presented in this paper, can be 
combined to study the influence of different sets of genes as well as sample variability.
 
Moreover, the construction of a plasmode must consider all the experimental conditions under 
which the base data were 
collected. Treatment and block effects may be easily identified from the 
experimental design but further restrictions in randomization (flowcell, lane, time) or technical 
issues (operator, use of technical replicates) may arise only from inspecting protoco
l details and 
applying explorative statistical analyses. For instance, descriptive analysis of the Cheung dataset 
and visualization of samples using multidimensional scaling analysis (
Figure 
2
a) suggested that 
33
 
 
no specific effects were present in the data structure; therefore we used it as an example to build 
a null plasmode. However, the same procedu
re applied to the Bottomly data
set indicated that not 
only the main st
rain, but also a characteristic effect due to flowcell number was an important 
source of variation (
Figure 
2
b). Consequently, strain and block (flowcell) were considered in two 
parts of the plasmode generation algorithm: firstly, when defining the model to select the effects 
(step 2 in 
Figure 
4
), and secondly, when partitioning samples within each flowcell (step 5 in 
Figure 
4
). These considerations allowed us to generate approp
riate null 
and alternative data
sets. A 
similar process should be followed with any new dataset plausible of being used as a base for 
plasmode generation.
 
We used the plasmodes and simulated data to illustrate the selection of optimal differential 
expression analysis 
strategies.  To this end, we focused on comparing true and false positive 
rates of tests to assess type I error rates and power. While we did not intend to perform a 
comprehensive evaluation of analysis protocols for RNA
-
seq data analysis, we did want to i
nclude 
two broad types of methods: 1) those directly tailored to count data by using negative binomial 
distributions (DESeq, EdgeR) or 2) a Gaussian model after transformation (MAANOVA). We 
found that edgeR and DESeq incur
red
 
in inflated type I error rates
 
for small significance levels 
(Figures 
Figure 
5
b
,
 
Figure 
6
b
, and
 
Figure 
7
b

-
values tend to be closer to the 
nominal significance levels. Admittedly, after adjusting for type I error rates, power was similar for 
edgeR and 
DESeq and higher than that from MAANOVA (
Figure 
7
c). However, in a real data 
scenario, adjusting is not possible because the true status is unknown.
 
These results emph
asize the fact that RNA
-
seq data are complex and to decide what method 
to use may be experiment
-
specific due to the unknown distributions of expression levels. 
Plasmodes may contribute to decisions on which method to choose by us
ing a similar pre
-
existing 
data
set and comparing results.  It is critical to select a dataset that has a complete description of 
the experimental design and detailed protocols of how the data were obtained. Using this 
34
 
 
information, it is possible to design proper null and alternative
 
datasets. For example, it was easy 
to find a set of differentially expressed genes in the mouse dataset that studied two inbred lines. 
Contrarily, in the human dataset, it was not possible to explain the variation in expression only as 
a consequence of ge
nder effects. The human subjects came from an outbred population and 
factors such as age, weight, or other characteristics could have explained differences in gene 
expression. Granted, any of the mentioned effects could have been included in the model if t
he 
information was available. The promising results obtained from this approach emphasize the need 
of promoting and improving systematic data sharing across the research community to facilitate 
plasmode building. 
 
Finally, the flexibility of plasmode const
ruction allows comparing model tuning selection for 
downstream analysis but also upstream analysis, as normalization procedures or alignment 
pipelines, could be contrasted. Future uses of plasmodes could be: comparison of alignment 
programs for a given sta
tistical analysis model or even exploring interaction of statistical model 
and read processing protocols to find optimal combined pipelines for data processing from reads
-
to
-
p
-
values.
 
 
35
 
 
APPENDIX
 
 
36
 
 
RNA
-
seq data processing 
 
In a previous 
experiment, we used the Pig
 
oligoarray for transcriptional profiling of developing 
pig skeletal muscle, and results for a study comparing transcript profiles of 
longis
s
imus dorsi
 
muscle from fetuses at 40 and 70 d of gestation in two different breed 
(Soller
o et al., 2011)
. For 
this study, we used the same RNA samples from one of the breed types profiled with the 
Pigoligoarray (n = 3 for each developmental age) with deep sequencing technology (Illumina 
GAIIx) to obtain 50nt paired end reads from 6 librarie
s (2 conditions, 3 bio
-
replicates each). The 
processing steps are described in 
the following 
Figure.
 
 
Figure 
9
 
Steps used to process reads to obtain matrix of counts
 
1. Filter passing read pairs were aligned to the S. scrofa reference genome (Sscrofa9, April 2009, 
Ensemble release 61) using the spliced RNA aware alig
ner, TopHat. Reads from each library were 
aligned separately. The reference gene annotation (same version as above) was provided to TopHat 
to provided information about predicted splice junctions, but TopHat also predicts novel splice 
junctions.
 
2. Novel s
plice junctions predicted from each of the 6 libraries were combined with the splice 
annotations from the reference to create a single, non
-
redundant set of predicted splice sites.  
TopHat is better able to map spliced reads if a list of potential junction
s is provided as input.
 
3. Each library was aligned to the reference a second time, providing the non
-
redundant set of potential 
splice sites as input.
 
4. The alignments produced from each library were used as input to Cufflinks. Cufflinks was used to 
exam
ine RNA
-
Seq alignments and to generate a set of predicted transcripts based on assembly of 
overlapping reads.
 
5. The transcript models generated by Cufflinks for each library were combined into a single, non
-
redundant set of transcript/gene models with Cuf
fcompare. The reference annotation was also 
provided to Cuffcompare to associate the predicted gene models with their most likely reference 
model.
 
6. The models generated by Cuffcompare are filtered to remove those models with little support from 
the under
lying read alignments. Specifically, if aligned reads are observed in only one of the six 
libraries, that model is removed from the final set.
 
7. The alignments for each library produced during the second round of TopHat (step 3) and the 
curated set of gen
e models (step 6) were used as input to htseq
-
count.  This program compared a set 
of alignments to an annotation file and reported the number of fragments uniquely aligned to each 
gene in the annotation. The models generated by Cuffcompare may have mul
tipl
e transcripts 
modeled for a
 
particular gene but htseq
-
count only reports the total fragments for a gene.
 
37
 
 
Figure 
10
 
P
-
value distribution of differential expression analysis performed with edgeR for 
Bottomly data using a model with block and treatment fixed effects
 
 
38
 
 
Figure 
11
 
P
-
value distribution of non differentially expressed transcripts for a simulated 
s
cenario with three
 
biological replicates
 
P
-
value distribution of non differentially expressed transcripts for a simulated scenario with 3 
biological replicates using: 1) 
edgeR
, 2) DESeq, 3) MAA
-
Fs: MAANOVA 
Fs moderated test 
(permutation), and 4) MAA
-
F1: MAANOVA F1 transcript by transcript test (permutation)
 
39
 
 
Figure 
12
 
P
-
value distribution of non differentially expressed trans
cripts for plasmodes 
g
enerated 
from Cheung dataset
 
P
-
value distribution of non differentially expressed transcripts for plasmodes generated from 
Cheung dataset using: 1) edgeR, 2) DESeq, 3) MAA
-
Fs: MAANOVA Fs moderated test 
(permutation), and 4) MAA
-
F1: MAANOVA F1 transcript by
 
transcript test (permutation)
 
 
40
 
 
Figure 
13
 
P
-
value distribution of non differentially expressed transcripts for plasmodes 
generated from Bottomly dataset
 
P
-
value distribution of non differentially expressed transcripts for 
plasmodes generated from 
Bottomly dataset using: 1) edgeR, 2) DESeq, 3) MAA
-
Fs: MAANOVA Fs moderated test 
(permutation), and 4) MAA
-
F1: MAANOVA F1 transcript by transcript test (tabulated)
 
 
41
 
 
Chapter
 
3
 
Assessing dissimilarity 
measures for sample
-
based hier
archical 
clustering of RNA sequencing data using 
plasmode datasets
 
Assessing dissimilarity measures for 
sample
-
based 
hierarchical 
clustering of 
RNA sequencing data using plasmode 
datasets
 
 
Reeb, P. D., Bramardi,
 
S. J., and Steibel, J. P. (2015
) Assessing dissimilarity 
measures for sample
-
based 
hierarchical 
clustering of RNA sequencing d
a
ta 
using plasmode datasets. 
PLoS
 
One
 
10(7)
:
 
e0132310. 
doi:10.1371/journal.pone.0132310
 
 
42
 
 
3.1
 
Introduction
 
Hierarchical cluster analysis has been a popular method for finding patterns in data and for 
representing results of gene expression analysis 
(Liu and Si, 2014)
. Clustering algorithms have 
been widely studied for analyzing microarray data 
(Jiang et al., 2004; Dalton et al., 2009)
, 
however, such technology is being rapidly replaced by RNA sequencing technology (RNA
-
seq) 
(Wang et al., 2009)
. In contrast to microarray experiments, RNA
-
seq generates count data of 
discrete nature that may call for di
fferent analysis methods. One of the most obvious differences 
between clustering gene expression data from RNA
-
seq or microarray is the choice of a 
dissimilarity measure, or the need to transform and normalize RNA
-
seq data in order to use 
dissimilarity mea
sures commonly used for microarray data 
(Liu and Si, 2014)
.    
 
Before implementing any statistical analysis of RNA
-
seq data, normalization and 
transformation have to be performed. 
(Bullard et al., 2010; Law et al., 2014; Liu and Si, 2014)
. 
Normalization aims at reducing non
-
systematic va
riation within and between samples, such as 
sequencing depth and library preparation. Data transformation could be very important because 
it aims at reducing the effects of skewness, scale and presence of outliers that can be found in 
read count data that 
usually follow a Poisson 
(Marioni et al., 2008)
 
or negative binomial distribution 
(Robinson et al., 2010; Anders and Huber, 2010)
. Through appropriate transformation, 
dissimilarity measures that are sensitive to asymme
tric distributions and scale magnitude, such 
as Euclidean and 1 

 
Pearson correlation 
(Liu and Si, 2014; Jiang et al., 2004; Johnson and 
Wichern, 2002)
 
could be used for clustering RNA
-
seq data.
 
Although a Gaussian distribution assumption is not required to compute Euclidean
 
and 
correlation based distances, transformations that convert count data into a continuous and almost 
Gaussianly distributed variable 
(Law et al., 2014)
 
could be used for hierarchical clustering. For 
instance, besides the classica
l logarithmic transformation, several functions have been proposed 
43
 
 
to model the mean
-
variance relationship of RNA
-
seq data 
(Anders and Huber, 2010; Love et al., 
2014
a; Law et al., 2014)
, while accounting for over
-
dispersion. But the properties of those 
transformations need to be tested. 
 
Finally, instead of using transformations to approximate the data to a pre
-
specified distribution 
where available dissimilarity m
easures perform well, model based methods can be directly used 
to compute dissimilarity measures 
(Witten, 2011)
.  
 
Evaluating the adequacy of alternative dissimilarity measures for hierarchical clustering 
requires the fundamental
 
step of choosing reference datasets 
(Handl et al., 2005)
. An ideal 
reference dataset should mimic the technical and biological variability found in experimental data, 
and it should also have some 
a priori 
known st
ructure in order to assess the goodness of results 
from alternative analyses. Parametric simulations, exemplar datasets, and permutation sampling 
have been used to generate such datasets in clustering analysis of biological data 
(Sloutsky et 
al., 2013)
. Similarly, plasmode datasets 
(Mehta et al., 2004)
  
have been proposed for evaluating 
differential expression analysis in RNA
-
seq experiments 
(Reeb and Steibel, 2013)
. A plasmode 
is a data
set obtained from experimental data from which some truth is known, thus, it is an ideal 
way to generate data with an a priori defined structure that realistically mimics RNA
-
seq data. 
Plasmodes were originally proposed for assessing multivariate analysis 
methods 
(Cattell and 
Jaspers, 1967)
 
and have been used in behavioral science 
(Waller et al., 2010)
 
and also in 
genomics 
(Gadbury et al., 2008; Steibel et al., 2009)
.   
 
In this paper, we propose the use of pla
smode datasets to assess the properties of 
dissimilarity measures for agglomerative hierarchical clustering or RNA
-
seq data. We present two 
possible ways of creating plasmode datasets that depend on the available data structure, and we 
use the resulting re
ference datasets to compare several commonly used dissimilarity measures.
 
 
44
 
 
3.2
 
Material and Methods
 
3.2.1 
Datasets
 
Two experimental datasets were used in this study to create reference datasets. The first 
dataset, 

Bottomly

 
corresponds to an experiment descr
ibed elsewhere 
(Bottomly et al., 2011)
. 
Briefly, 21 samples of 
striatum
 
tissue from two inbred mouse strains (C57BL/6J (B6), n=10; and 
DBA/2J (D2), n=11) were sequenced in three Illumina GAIIx flowcells.  Data were downloaded 
from  ReCount website 
(Frazee et al., 2011)
.
 
A
fter filtering out genes with zero counts in all 
sampl
es, the count matrix contained 
13932  rows (transcripts) and 21 columns (samples). The 

longissimus
 
muscle selected from the 
MSU Pig Resource Popula
tion 
(Steibel et al., 2011)
 
and sequenced 
by our collaborators 
(Steibel 
et al., 2014)
. Total RNA from 24 F2 female pigs of Duroc b
y Pietrain ancestry was barcoded and 
sequenced on Illumnina HiSeq 2000. Read mapping, gene modelling and read counting were 
performed using Tophat 
(Trapnell et al., 2009)
, Cufflinks 
(Trapnell et al., 2012)
 
and HTSeq 
(Anders et al., 2015)
, respectively. After processing the sequence reads, we
 
obtained a count 
matrix with 26740 rows (transcripts) and 24 columns (samples). (Fo
r details, see supplementary 
material
). The count matrix of the five samples (animals) used in this paper is available as 
s
upporting information at PLoS ONE website
.  
 
 
3.2
.2 
Plasmodes
 
Plasmodes  are synthetic datasets generated from experimental data for which some true 
characteristic is known 
(Mehta et al., 2004)
. For instance, we may know 
a priori
 
which genes are 
not differentially expressed or we may know group 
membership of each sample. Then, we build 
a plasmode by re
-
shuffling the existing data without assuming any probability distributions or 
correlation structures. Thus, we can use the known characteristic of the synthetic dataset to 
45
 
 
assess properties of anal
ysis methods. For instance we can apply resampling
-
based methods to 
create plasmodes consistent with the null hypothesis (no differential expression) and use them to 
evaluate the type I error rate hypothesis of testing procedures 
(Mehta et al., 2006)
, or we can use 
the known group memberships to assess the accuracy of clustering methods,
 
as we do in this 
paper. Thus, plasmodes need to be constructed according to the validation objectives (i.e. 
considering the statistical method that is being evaluated) and considering the available 
experimental data. 
 
In this paper, we present two example
s on how to create plasmodes to assess the effect of 
choice of dissimilarity measures on the results of hierarchical clustering of RNA
-
seq data. In the 
first experimental dataset, the natural structure of the data is known 
a priori
 
and it was generated 
thr
ough the experimental design (sequencing flowcells and mice strains), while in the second 
experimental dataset there is not an 
a priori
 
known structure, so we create a set of artificial 
samples where the structure is generated by construction. 
 
 
3.2.2.1 
Plasmodes from Bottomly dataset
 
We built plasmodes for this dataset by using samples from B6 strain, partitioning them in two 
groups and adding known effects for selected genes taken from the difference in gene expression 
with strain D2. 
Figure 
14
 
presents the algorithmic steps used to generate the plasmodes. Two 
main effects, strain and flowcell, were used to classify the 21 samples (Step 2.1 in 
Figure 
14
) 
given the importance of both sources of variation has been described before 
(Reeb and Steibel, 
2013; Law et al., 2014)
. Then, a differential expression analysis including all the samples (both 
strains) was conducted with edgeR 
(Robinson et al., 2010)
 
and transcripts with q
-
value <0.05 
were identified as differentially expressed  
(set 
G
1
 
in step 4 of 
Figure 
14
). Subsequently, samples 
from strain B6 were randomly assigned to two groups (A or B) within each flowcell, and a subset 
(
S
1
) of effects randomly selected from 
G
1
 
was added to the correspondin
g genes in samples 
46
 
 
labelled as B (Steps 5
-
6) . Therefore, samples from group A and B differ due to the strain effect 
added by the subset (
S
1
) of differentially expressed genes, while samples within each group differ 
due to the flowcell effect. We generated
 
50 plasmodes with 10% of differentially expressed 

B and one or two samples, if available, to group A within each flowcell (Step 5.2 in 
Figure 
14
).  As 
a result, in each plasmode generation we obtained a total of 10 samples under two artificial 
treatments (A or B) and three flowcell effects (1, 2 or 3), result
ing in a set of samples indexed by 
such factors as:{(A
1
, A
1
, B
1
, B
1
),(A
2
,B
2
,B
2)
,(A
3
,B
3
,B
3
)}. If we use only differentially expressed 
genes, we expect the samples with same letter to cluster together because of the treatment effect, 
but as we add a large number of non differentially expressed genes, we can expect that samples 
with the same subi
ndex (flowcell) will tend to cluster together because it has been shown before 
that there is a strong flowcell effect in this experiment 
(Law et al., 2014; Reeb and Steibel
, 2013)
. 
To evaluate the performance of dissimilarity measures under various differentially expressed
 
/ 
non differentially expressed
 
ratios (DE/nonDE), we analyzed three scenarios for each plasmode: 
1) only DE transcripts (
DE
[100%]
), 2) DE transcripts +
 
all nonDE transcripts (
DE
[10%]
+nonDE
[90%]
), and 3) 
DE transcripts + a random sample of 50% from nonDE transcripts (
DE
[20%]
+nonDE
[80%]
).   
 
 
47
 
 
Figure 
14
 
Algorithm used to generate plasmodes from Bottomly dataset
 
 
3.2.2.2 
Plasmodes 
from MSUPRP dataset
 
Since this dataset did not have a natural sample structure derived from experimental 
conditions, a structure had to be induced in order to know a priori the expected clustering 
configuration. From a descriptive multidimensional scaling 
analysis of the 24 pig samples 
(animals), we selected 5 dissimilar samples (A, B, C, D, E) according to their configuration in the 
main plane (
Figure 
15
). Synthetic s
amples were generated by combining a known proportion of 
randomly sampled read counts of individual genes from each of two of the five selected samples. 
For instance, a new synthetic sample named AAC was generated combining 2/3 and 1/3 of read 
counts of in
dividual genes from A and C respectively. A full plasmode consisted of 12 samples 
that included the five selected samples {AAA, BBB, CCC, DDD, EEE}, five synthetic samples 
{AAC, BBC, CCB, DDE, EED} obtained by combining 2/3:1/3 proportions from two of the 
selected 
samples, and two synthetic samples {CxB, ExD} obtained by combining 1/2:1/2 proportions of two 
of the selected samples
 
(see 
Figure 
21
.in 
supplemental material 
with a representation of the 
48
 
 
relationships among the 12 samples of each plasmode). Following this procedure, a total of 50 
replicated plasmodes were generated. As a r
esult we created a synthetic dataset where the 
samples were expected to resemble each other to a known degree given the proportions of shared 
reads. 
 
 
Figure 
15
 
Multidimensional scaling analysis of MSUPRP dataset
 
Twenty four 
samples were represented in the main plane (dimension 1 and dimension 2 explained 
22.4% and 13.8% respectively) and five distant samples (A, B, C, D, E, marked with ovals) were 
selected as input samples to generate plasmode datasets.
 
 
49
 
 
3.2.3 
Clustering
 
Defining a dissimilarity measure and a linkage method are the two key decisions for 
performing hierarchical cluster analysis. We focused on assessing the adequacy of dissimilarity 
measures that have been commonly used for clustering gene expression data. W
e also include 
a recently proposed dissimilarity measure for RNA
-
seq count data 
(Witten, 2011)
.  As linkage 
method, we decided to use complete linkage because it is invariant under monotone 
transformations 
(Izenman, 2008)
, and hence dissimilarity measures that have the same relative 
ranking result in the same cluste
r structure 
(Liu and 
Si, 2014)
. This robustness reduces the effect 
of linkage method when comparing dendrograms and allowed us to concentrate in the evaluation 
of dissimilarity measures. Hierarchical cluster analysis was applied to each plasmode using the 
agglomerative proc
edure implemented in function 
hclust
 
from R 
(R Development Core Team, 
2014)
 
to concatenate samples and to generate dendrograms. 
 
Eight dissimilarity measures were compared, including 4 variants based on Euclidean
 
distance, 3 correlation based approaches, and one Poisson based measure. Euclidean distances 
were computed between samples following one of 4 approaches: i) using raw count data (
raw
), 
ii) after normalizing samples using the median ratio size factor propo
sed by Anders and Huber 
(Anders and Huber, 2010)
 
(
rnr
), iii) after applying a variance stabilizing transformation computed 
with DESeq2 
(Love et al., 2014b)
 
(
vsd
), and iv) after applying a regularized logarithm 
transformation implemented in DESeq2  
(Love et al., 2014b)
 
(
rld
). Correlation based dis
similarities 
comprised: i) 1
-
 
Pearson correlation between samples using raw counts (
pea
), ii) 1
-
 
Pearson 
correlation between samples using counts transformed by logarithm of raw counts +1 (
plg
), and 
iii) 1
-
 
Spearman correlation between samples using raw co
unts (
spe
).  The Poisson dissimilarity 
(
poi
), which is based on a log likelihood ratio statistic for a Poisson model 
(Witten, 2011)
, was 
computed on data that were transformed by a power function to account for overdispersion, an
d 
normalized by total sum of counts for each sample.
 
50
 
 
3.2.4 
Cluster validation using results from plasmodes
 
Cluster validation can be assessed using several indices 
(Halkidi et al., 2001; Xiong and Li, 
2013)
 
and the choice of a particular 
measure is application dependent 
(Jiang et al., 2004)
. 
Cophenetic distances provide a way to quantify similarities among dendrograms in hierarchical 
clustering. The cophenetic distance is the distance from the bottom of the tree at which two 
elements (samples in this paper) are groupe
d in the same cluster for the first time in the hierarchy. 
To represent a dendrogram in terms of a set of cophenetic distances, the distances between all 
pairs of elements is computed and arranged into a matrix called cophenetic matrix that represents 
the 
whole hierarchy, as illustrated in
 
Figure 
22
 
and 
Table 
5
 
in supplementary material
. Cophenetic 
matrices  can be used to compare dendrograms 
(Sokal and Rohlf, 1962)
. For instance, to 
compare how similar are two dendrograms, the Pearson correlation betwee
n the lower triangular 
portions of two cophenetic matrices can be used.
 
 
We computed the correlation between cophenetic matrices 
(Handl et al., 2005)
 
to compare 
dendrograms obtained with different dissimilarity mea
sures (between dissimilarity measure 
comparison) as well as to compare all dendrograms obtained with a particular dissimilarity 
measure (within dissimilarity measure comparison). Mean and standard deviation of correlations 
between dissimilarity measures we
re used as a measure of agreement while mean and standard 
deviation of correlations within a dissimilarity measure were used as a measure of consistency. 
 
We also visually compared the obtained dendrograms to a reference dendrogram built 
according to the s
ample structure known 
a priori
 
from the plasmode generation process in the 
MSUPRP dataset. For the MSUPRP dataset, we defined the expected similarity between two 
samples (
s
ij
) as the maximum proportion of shared reads, and we defined 1
-
 
s
ij
 
as a reference 
dissimilarity (see 
Tables 
Table 
3
 
and 
Table 
4
 
in 
supplementary material
). With the correlation 
between each of the dissimilarity matric
es and the reference dissimilarity, we assessed how well 
each dissimilarity measure recovered the expected sample structure. An equivalent reference 
51
 
 
dissimilarity matrix and reference dendrogram cannot be easily built for the Bottomly dataset 
because we di
d not exploit relationships between samples to build the plasmode, except for their 
group membership.  In this case, we compared typical dendrograms obtained from plasmodes to 
the known strain and experiment membership in the original data.
 
 
3.3
 
Results
 
3.3.1
 
Bottomly
 
Figure 
16
 
shows the typical dendrograms obtained for plasmode datasets using two 
dissimilarity measures, 
poi
 
and 
rnr
, which are representative examples of 
two sets of results under 
the three different scenarios (
DE
[100%]
, 
DE
[10%]
+nonDE
[90%],  
and DE
[20%]
+nonDE
[80%]
,). On the one hand, 
scenario 1 (
DE
[100%]
) uses only differentially expressed transcripts, therefore the expected hierarchy 
should arrange samples
 
in two separate groups according to main treatment labels. Such is the 
structure obtained utilizing the Poisson (
poi
) dissimilarity measure (
Figure 
16
a). Using the Po
isson 
dissimilarity measure, samples were clustered in two groups corresponding to treatments A or B, 
and within each of the groups, samples were arranged according to block numbers (4, 6, or 7). 
Differently, the dendrogram based on Euclidean distance calc
ulated from raw normalized data 
(
rnr
) (
Figure 
16
b) mixed treatment labels and did not recover any expected structure.  On the 
other hand, scenario 2 (
DE
[10%]
+nonDE
[90%]
),  uses information from differentially (10%) and non 
differentially expressed (90%)  transcripts. As a result, we expected that the dissimilarity 
measures would tend to represent other aspects of samples in addition to the treatment effect. In 
conco
rdance with such expected structure, dendrogram obtained using the Poisson dissimilarity 
(
Figure 
16
c) firstly separated samples according to block labels, block 4 bein
g the most different 
group. Subgroups for treatments A and B were arranged within each block. Conversely, 
dendrogram based on Euclidean distance calculated from raw normalized data (
rnr
) (
Figure 
16
3d) 
52
 
 
did not present any expected structure.  Finally, scenario 3 (
DE
[20%]
+nonDE
[80%]
) represents an 
intermediate case that is useful to further explore the performance of  dissimilarity measures 
because it is enriched in DE ge
nes with respect to scenario 1, but it still conserves 80% of 
background (nonDE) genes. The dendrogram based on the Poisson dissimilarity (
Figure 
16
e) 
presented an in
termediate structure where we observed that samples from block 4 were clustered 
together while the remaining samples were clustered in a separate group mainly classified by 
treatment effect. Yet again, dendrogram based on 
rnr
 
(
Figure 
16
f) did not characterize any 
expected configuration.  To sum up, for this dataset, dendrograms generated from a Poisson 
dissimilarity resemble the expected hierarchical structures in all three scenarios, however, 
dendrograms based on Euclidean distance comp
uted on raw normalized data did not.  
Comparison of hierarchies between clusters constructed using 
poi
 
and 
rnr
 
dissimilarities across 
50 plasmodes presented correlation of cophenetic matrices with low means and  high standard 
deviations (0.52±0.24, , 0.65±
0.17, 0.59±0.23 , for scenarios 1, 2 and 3, respectively) (
Figure 
17
). 
These results emphasize a poor correspondence between hierarchies constructed upon 
poi
 
and 
rnr 
dissimilarities.
 
53
 
 
Figure 
16
 
Typical dendrograms obtained for plasmode datasets from Bottomly experimental data 
with two dissimilarity measures under three scenarios
 
Dendrograms obtained using complete linkage hierarchical 
clustering based on  Poisson 
dissimilarity (
poi
) are presented in the left column  (a, c and e), and dendrograms based on 
Euclidean distance calculated from raw normalized data (
rnr
) are presented in right column (b, d, 
f). The rows correspond to three sce
narios with different percentage of differentially expressed 
(DE) transcripts: 1) 
DE
[100%] 
(a and  b)
, 2)
 
DE
[10%]
+nonDE
[90%]
 
(c and d)
, and 3) 
DE
[20%]
+nonDE
[80%] 
(e  and f). 
Sample labels correspond to main treatment (A or B) and flowcell number (4, 6 or 7
). 
Dendrograms based on 
poi
 
separates samples according to the expected sources of variation; in 
(a), only DE transcripts, samples are arranged in two separate groups following treatment labels; 
in (c), with a predominant number of non DE transcripts, the 
structure of groups is dominated by 
flowcell characteristics in addition to main treatment: and in (e) an in
-
between scenario, the 
dendrogram presents an intermediate group structure. Dendrograms based on 
rnr 
 
do not 
resemble any expected configuration.
 
54
 
 
Figure 
17
 
Agreement between dissimilarity measures using Bottomly plasmode datasets
 
Each matrix contains means (upper triangle) and standard deviations (lower triangle) of 
correlation between cophenetic matrices of dendrograms (N= 
50 plasmode datasets) for eight 
dissimilarity measures: Euclidean distances using raw count data (
raw
), Euclidean distances 
using normalized samples (
rnr
), Euclidean distances using variance stabilizing transformation 
(
vsd
), Euclidean distances using regul
arized logarithm (
rld
).) 1
-
 
Pearson correlation using raw 
counts (
pea
), 1
-
 
Pearson correlation using counts transformed by logarithm of raw counts +1 (plg), 
and 1
-
 
Spearman correlation using raw counts (
spe
), and Poisson dissimilarity (
poi
). Panel labels 
 
55
 
 
(a), (b) and (c) correspond to one of three scenarios of proportion of differential expressed genes: 
DE
[100%]
,
 
DE
[10%]
+nonDE
[90%]
, and 
DE
[20%]
+nonDE
[80%]
, respectively. In all scenarios, we identified three sets 
of dissimilarity measures: 1) 
raw
, 2) 
rnr
 
and 
pea
, and 3) 
poi
,
 
rld
,
 
vsd
, 
plg
 
and 
spe
.  Results from 
raw
, set 1, were poorly related to results from any other dissimilarity measure. Dendrograms from
 
dissimilarity measures in set 2 presented correlation of cophenetic matrices with medium to high 
means and high variability with each other, and low correlation with dendrograms from other 
dissimilarity measures. Dendrograms from dissimilarity measures in
 
set 3 exhibited high 
correlations of cophenetic matrices and low to medium variability when compared to each other.
 
 
Correlations between hierarchies obtained with the eight dissimilarity measure approaches 
for each of three scenarios (
DE
[100%]
,
 
DE
[10%]
+n
onDE
[90%]
 
and 
DE
[20%]
+nonDE
[80%]
) are presented in 
Figure 
17
. Each matrix contains means and standard deviations of correlations between cophenetic 
matrices, in the upper and lower triangle respectively. Regardless of the scenario, dissimilarity 
measur
es can be apportioned to three groups with common patterns. First, Euclidean distance 
computed on raw data (
raw
)  is poorly related to any other dissimilarity measure. Second, 
Euclidean distance computed on normalized data (
rnr
) and 1
-
 
Pearson correlation 
dissimilarity 
(
pea
) presented medium to high correlations of cophenetic matrices  (mean from 0.67 to 0.89) 
with high variability (standard deviation from 0.13 to 0.29) with each other and low correlation 
values with other dissimilarity measures. Third, the
re is a subset comprising 1
-
Pearson correlation 
dissimilarity computed on log
-
transformed counts (
plg
), 1
-
Spearman correlation dissimilarity 
(
spe
), Euclidean distance computed on transformed counts after applying either a variance 
stabilizing function (
vsd
) or a regularized logarithm (
rld
), and the Poisson dissimilarity (
poi
). This 
last group of dissimilarity measures presents high correlations of cophenetic matrices (mean from 
0.82 to 0.99) with low to medium variability (standard deviation from 0.01 to 0.
2). Only hierarchies 
obtained with dissimilarity measures from the third group consistently presented the expected 
natural structure created by design in the plasmode generation process, being 
rld
, 
vsd
, 
spe
 
and 
plg
 
the more consistent across all scenarios 
(
Table 
2
, columns 2, 3 and 4).
 
56
 
 
Table 
2
 
Consistency for each dissimilarity measure
 
Dissimilarity
 
Bottomly
 
MSUPRP
 
 
DE
[100%]
 
DE
[10%]
+nonDE
[90%]
 
DE
[20%]
+nonDE
[80%]
 
 
raw
 
0.58 (0.22)
 
0.91 (0.13)
 
0.75 (0.20)
 
0.56 (0.23)
 
rnr
 
0.35 (0.27)
 
0.43 (0.28)
 
0.33 (0.28)
 
0.40 (0.22)
 
rld
 
0.98 (0.01)
 
0.90 (0.11)
 
0.88 (0.13)
 
0.99 (0.01)
 
vsd
 
0.96 (0.04)
 
0.91 (0.10)
 
0.86 (0.15)
 
0.99 (0.01)
 
pea
 
0.37 (0.31)
 
0.53 
(0.30)
 
0.45 (0.30)
 
0.43 (0.28)
 
plg
 
0.98 (0.01)
 
0.89 (0.13)
 
0.75 (0.19)
 
0.98 (0.01)
 
spe
 
0.99 (0.01)
 
0.86 (0.14)
 
0.88 (0.14)
 
0.99 (0.01)
 
poi
 
0.86 (0.21)
 
0.92 (0.09)
 
0.88 (0.14)
 
0.99 (0.01)
 
 
Mean and standard deviation of correlation between cophenetic 
matrices of dendrograms (N=50 
plasmode datasets) for each of eight dissimilarity measures: Euclidean distances using raw count 
data (
raw
), Euclidean distances using normalized samples (
rnr
), Euclidean distances using 
regularized logarithm (
rld
) Euclidean 
distances using variance stabilizing transformation (
vsd
), 1
-
 
Pearson correlation using raw counts (
pea
), 1
-
 
Pearson correlation using counts transformed by 
logarithm (plg), and 1
-
 
Spearman correlation using raw counts (
spe
), and Poisson dissimilarity 
(
poi
).  Columns correspond to the three scenarios generated for Bottomly (with different proportion 
of DE genes) and the MSUPRP dataset.
 
We considered a clustering from a dissimilarity measure 
to be consistent if hierarchies obtained for different plasmode dat
asets within each dissimilarity 
measure were highly correlated and presented a low standard deviation. Clustering based on 
raw
, 
rnr 
and 
pea
 
were generally inconsistent presenting a number of very different hierarchical 
structures.
 
 
We considered a dissimil
arity measure to be consistent if hierarchies obtained for different 
plasmode datasets within each dissimilarity measure were highly correlated and presented a low 
standard deviation. Consequently, we computed correlations of cophenetic matrices for 
dendro
gram within each dissimilarity measure and calculated the mean and standard deviation 
for each ensemble of 50 plasmodes (
Table 
2
, columns 2, 3 and 4).  Dissimilarity 
measures 
raw
, 
rnr 
and 
pea
 
were generally inconsistent, resulting in a number of different hierarchical structures. 
For instance, 
rnr 
presented mean correlation values of 0.35±0.27, 0.43±0.28, and 0.33±0.28 for 
57
 
 
the three respective scenarios. Conversely, al
l the other dissimilarity measures were much more 
consistent. For example, 
rld
 
presented mean correlation values of 0.98±0.01, 0.90±0.11, and 
0.88±0.13 for the three respective scenarios. Such high values mean that hierarchies obtained 
with 
rld
 
for the 50 
plasmodes were all very similar to each other.  
 
 
3.3.2 
MSUPRP
 
Plasmodes from MSUPRP were constructed by combining known proportions of sequence 
reads from pairs of samples, including nonDE as well as potentially DE transcripts across 
individual. We expect
 
that dendrograms cluster the samples according to the known proportions 
of shared reads as presented in the reference dendrogram in 
Figure 
18
c (see 
Table 
4
 
in 
supplementary material
 
with the corresponding ref
erence dissimilarity matrix). 
Figure 
18
a
-
b 
present the typical dendrograms obtained for plasmode datasets using 
rnr
 
and 
poi
, which are 
representative examples of the 8 dissimilarity metrics. Dendrogram
 
based on the Poisson 
dissimilarity (
Figure 
18
a) clustered the original samples A, B
,
 
and C and their synthetic 
combinations in one group, and original samples E and D and their synthetic combinations in a 
distinct group. The hierarchical structure of each of these two groups represented the degree of 
shared reads between samples by joinin

that shared ½ of reads. Additionally, the separation between samples {A, B, C} and {D,E} agreed 
with positions along the most important dimension (dim1) in 
Figure 
15
. In contrast, dendrogram 
based on 
rnr
 
(
Figure 
18
b)
 
did not cluster samples according to the anticipated configur
ation.  
Comparison of hierarchies between clusters constructed from 
rnr
 
and 
poi 
dissimilarities for all 
plasmodes presented low mean and high standard deviations (0.44±0.22) of correlation between 
cophenetic matrices (
Figure 
19
).
 
 
58
 
 
Figure 
18
 
Typical dendrograms obtained for plasmode datasets from MSUPRP experimental 
data with two dissimilarity measures
 
Dendrograms using complete linkage based 
on (a) Poisson dissimilarity (
poi
),
 
(b) Euclidean 
distance calculated from raw normalized data (
rnr
), and (c) reference dissimilarity based on 
maximum proportion of shared reads. Original samples are labeled with 3 same letters (AAA, 
BBB, CCC, DDD or DDD),
 
synthetic samples are labelled with 2 or 3 letters symbolizing the 

clustered original samples A, B and C and their synthetic combinations in one group, and ori
ginal 
samples E and D and their synthetic combinations in another group. The hierarchical structure of 
each of these two groups represented the degree of shared reads between samples by joining 

red ½ of reads. Dendrogram (b) 
did not cluster samples according to the expected configuration.
 
 
59
 
 
Figure 
19
 
Agreement between dissimilarity measures using MSUPRP plasmode datasets
 
The matrix contains means (upper triangle) and 
standard deviations (lower triangle) of correlation 
between cophenetic matrices of dendrograms (N= 50 plasmode datasets) for eight dissimilarity 
measures: Euclidean distances using raw count data (
raw
), Euclidean distances using normalized 
samples (
rnr
), E
uclidean distances using variance stabilizing transformation (
vsd
), Euclidean 
distances using regularized logarithm (
rld)
 
1
-
 
Pearson correlation using raw counts (
pea
), 1
-
 
Pearson correlation using counts transformed by logarithm of raw counts +1 (
plg
), an
d 1
-
 
Spearman correlation using raw counts (
spe
). We identified the same three sets of dissimilarity 
measures described before: 1) 
raw
, 2) 
rnr
 
and 
pea
, and 3) 
poi
,
 
rld
,
 
vsd
, 
plg
 
and 
spe
.
 
 
Correlations between hierarchies for the eight dissimilarity measures are summarized in 
Figure 
19
. It contains mean and standard deviations,  in the upper and lower tr
iangle respectively, 
of correlations between cophenetic matrices computed on 50 plasmode datasets. As observed in 
the three scenarios for the Bottomly experiment, dissimilarity measures can be apportioned to 
three groups: 1) 
raw
, 2) 
rnr
 
and 
pea
, and 3) 
poi
,
 
rld
,
 
vsd
, 
plg
 
and 
spe
.  Dendrograms from 
raw
 
did 
not agree with dendrograms from other groups.  Hierarchies from dissimilarity measures in group 
2 presented a medium correlation of cophenetic matrices with high variability (0.69±0.22). 
Dendrograms from t
he dissimilarity measures in group 3 presented high correlation values with 
each other  (>0.98) and low variation (<0.01).  
 
60
 
 
The correlation of hierarchies within each of the dissimilarity measures (
Table 
2
, column 5) 
was low for 
raw
, 
rnr 
and 
pea
, whereas clusters were much more consistent (
r
>0.98) for 
poi
, 
rld
, 
vsd
, 
plg
 
and 
spe 
dissimilarities. Additionally, dissimilarity measures 
raw
, 
rnr
, and 
pea
 
were poorly 
correlated with the reference dissimilarity (0.57±0.07, 0.53±0.07, and 0.51±0.04, respectively) 
while 
poi
, 
rld
, 
vsd
, 
plg
, and 
spe
 
were highly correlated with the reference dissimilarity 
(r>0.8±0.001, see 
Table 
6
 
in supplementary material
). Consequently, dissimilarities raw, 
rnr
, and 
pea 
did not resemble the expected sample structure and resulted in dendrograms that were very 
inconsistent over repeated samp
ling of the same dataset. In contrast, dissimilarities
 
poi
, 
rld
, 
vsd
, 
plg
, and 
spe 
maintained the sample structure and produced highly reproducible results in 
hierarchical dendrograms.
 
 
3.4
 
Discussion
 
Hierarchical cluster analysis is one of the most used techniques for exploring expression 
patterns in sequencing data 
(Liu and Si, 2014)
. In this paper, we showed how to assess the 
adequacy of dissimilarity measures for clustering samples from RNA
-
seq experiments by 
generat
ing plasmode datasets from experimental data. 
 
Plasmode datasets are useful alternatives to parametric simulations for assessing statistical 
methodologies as data are generated on more realistic conditions and do not depend on a specific 
parametric model 
(Gadbury et al., 2008)
.  The algorithm used to build a plasmode dataset 
depends on the characteristics of the experimental data and the objective of the study.  We 
presented two examples on how to build plasmode datasets from tw
o experiments with different 
conditions.  
 
61
 
 
The Bottomly dataset had an experimental design with two main sources of variation and used 
highly inbred individuals.  Such context allowed us to generate plasmode datasets with known 
proportions of differentiall
y expressed transcripts (
Figure 
14
) and focused on assessing the 
adequacy of dissimilarity measures in recovering the main sources of variation in the
 
hierarchical 
structure (
Figure 
16
). Analogous plasmode generation algorithms have been used with different 
objectives, for example to validate differ
ential expression  methods for  RNA
-
seq 
(Reeb and 
Steibel, 2013)
, microarray analysis
 
(Gadbury et al., 2008)
, and qPCR 
(Steibel et al., 2009)
, but 
this is the first time that they are used to assess the properties of sample
-
based clustering. These 
procedur
es are by no means exhaustive of the possible ways of creating plasmodes for clustering. 
For instance, the algorithm presented in 
Figure 
14
 
preserves t
he correlation among genes 
(Reeb 
and Steibel, 2013)
 
when generating plasmodes, but other sampling strategies could purposefully 
select groups of genes with speci
fic correlation patterns. For instance, instead of sampling from 
DE and nonDE groups, transcripts could be sampled from blocks of co
-
expressed genes, resulting 
in more realistic datasets especially if the study is focused on gene
-
based clustering and co
-
ex
pression analysis 
(Si et al., 2014; Rau et al., 2015)
.
 
Diffe
rent from the Bottomly dataset, the MSUPRP did not present any experimental treatment, 
and individual characteristics were more important.  Under these circumstances, we built 
plasmodes creating synthetic individuals, by combining known proportions of read
 
counts from 
original individuals and we evaluated the adequacy of dissimilarity measures in resembling the 
different mixture proportions in the hierarchical sample structure (
Figure 
18
). A similar plasmode 
generation algorithm was proposed 
(Vaughan et al., 2009)
 
to evaluate admixture estimation 

originates from different founding populations, but using SN
P genotypes instead of sequence 
read counts.
 
62
 
 
Although the utility of plasmode datasets has been recently highlighted in RNA
-
seq studies 
(Reeb and Steibel, 2013; Zhou et al., 2014)
, only parametric simulations or exemplar experimental 
datasets have been used to compare dissimilarity measures and clustering methods  
(Witten, 
2011; Ma and Wang, 2012)
. While extremely useful, parametric simulations are often criticized 
as being too simplistic to appropriately 
capture the complexity in gene expression data 
(Gadbury 
et al., 2009)
, thus limiting the scope and validity of the resulting conclusions. On the other hand, 
using a single exemplar dataset with unknown properties is not an appropriate approach for 
comparing statistical methods 
(Mehta et al., 2004)
. As a partial sol
ution to these limitations, in this 
paper we show that plasmodes can supplement the evaluation of clustering algorithms by 
including agreement and consistency measures based on datasets that mimic read count 
distributions more realistically.  One likely cr
iticism of plasmode is that the results may heavily 
depend on the original dataset 
(Reeb and Steibel, 2013)
. However, this does not invalidate their 
use. Moreove
r, as shown in this paper, using two very different datasets, some properties of 
alternative metrics remain consistent, which encourages the use of the plasmode generation 
methods presented here using alternative datasets.
 
We built plasmodes to evaluate al
ternative dissimilarity measures. The selected dissimilarity 
measures allowed i) the comparison of  traditional dissimilarity measures and dissimilarity 
measures based on discrete count distributions specifically proposed for RNA
-
seq, and ii) 
studying the 
effect of normalization and transformation prior to computing dissimilarities. 
 
The Euclidean distance or the Pearson correlation based dissimilarity computed after 
transforming data is a routine method adopted from microarray gene expression analysis 
(Jiang 
et al., 2004; Liu and Si, 2014)
. In fact, the Pearson correlation based dissi
milarity is equivalent to 
squared Euclidean distance of standardized data 
(Hastie et al., 2009)
. On the other hand, the 
most common transformations used for RNA
-
seq are the logarithm of counts, or logarithm of 
counts plus a constant 
(Severin et 
al., 2010)
, but other variance stabilizing and regularized 
63
 
 
logarithm transformation functions have been proposed to model the mean
-
variance relationship 
of RNA
-
seq counts 
(Anders and Huber, 2010; Law et al., 2014; Love et al., 2014b)
.    The 
Spearman correlation based dissimilarity uses the rank of the read count instead of the counts 
themselves to compute correlation; co
nsequently it could be applied without transforming the 
data. Although the use of Spearman correlation based dissimilarity has been discouraged for 
gene
-
based clustering of RNA
-
seq data 
(Liu and Si, 2014)
, because it uses a small number of 
grouped samples to compute ranks, 
we have used it for sample
-
based clustering where the 
number of genes is potentially large enough to obtain more precise ranks. Finally, the Poisson 
dissimilarity 
(Witten, 2011)
 
was specifically proposed for clustering of sequenc
ing data based on 
a Poisson log
-
linear model of normalized counts, and thus, it is a natural candidate to be included 
in this comparison.  
 
The eight evaluated dissimilarity measures presented a common pattern of agreement and 
consistency in recovering the
 
expected sample structure for both plasmode datasets. Dissimilarity 
measures with high level of agreement between them

correlations between cophenetic matrices 
with high mean
 
and low standard deviation (Figures
 
Figure
 
17
 
and 
Figure
 
19
)

produce 
dendrograms with very similar hierarchical structures.  However, if dissimilarity measures have 
correlations of cophenetic matrices with either low mean or high standard deviation (
F
igures
 
Figure
 
17
 
and 
Figure
 
19
), they genera
te dendrograms with different hierarchical structures.  In 
addition, correlation between dendrograms obtained with a particular dissimilarity measure 
summarizes the consistency of such dissimilarity measure. If a dissimilarity measure has a within 
cophenet
ic correlation with high mean and low standard deviation, it consistently generates 
similar dendrograms. 
 
To assess the adequacy of a dissimilarity measure, both agreement and consistency are 
important. We showed this with 
poi
,
 
rld
,
 
vsd
, 
plg
 
and 
spe 
dissim
ilarities, which presented similar 
level of agreement
 
in both datasets (Figures
 
Figure
 
17
 
and 
Figure
 
19
). However, dendrograms 
64
 
 
based on these dissimilarity measures were consi
stent in the MSUPRP dataset, but showed 
different consistency under the three scenarios in the Bottomly datasets (
Table 
2
). A counter 
example is dissim
ilarity measures 
rnr
 
and
 
pea
 
that agreed with each other but were very 
inconsistent.  This means that
 
rnr
 
and 
pea
 
tended to reproduce similar clusters on each plasmode, 
and wide range of dendrograms structures. This has not been reported before, because 
co
nsistency of clustering under repeated sampling has not been studied in previous works. But a 
reason for the agreement is that both 
rnr
 
and 
pea
 
are focusing on the same features, because 
they are essentially normalized Euclidean distances. The reason for t
he low consistency could be 
that these measures are expected to behave well with heterogeneous approximately Gaussian 
data, but not too well with extremely non
-
Gaussian data. On the other side, the measure 
raw
 
is 
expected to be better suited for homogenous
 
Gaussian data (all variances are of similar 
magnitude).
 
As mentioned before, we used plasmode to study the effect of normalizing data. Normalization 
is an essential data processing step in RNA
-
seq analysis that aims at removing systematic biases 
in order 
to make consistent comparisons within and between samples
 
(Dillies et al., 2012)
. 
Although several methods
 
(Bullard et al., 2010; Anders and Huber, 2010; Robinson and Oshlack, 
2010)
 
have been proposed to normalize data,  especially for differential expression analysis, the 
impact of a particular normalization method seem
ed to be less important in classification and 
clustering analyses 
(Witten, 2011)
. We confirmed this, showing that normalizing counts to equal 
library sizes was not enough to capture the natural structure of samples when it was th
e only 
transformation applied. For instance, dissimilarity measures 
raw
 
and 
rnr
 
had low agreement with 
dissimilarity measures that resemble better the true structure of data, e.g. 
rld
, 
vsd
, 
spe
 
or 
plg
 
or 
poi 
(
Figures
 
Figure
 
17
 
and 
Figure
 
19
). 
 
Accounting for the discrete nature of read counts in RNA
-
seq data is the most important issue 
to consider when computing dissimilarity measures. For instance, Euclidean distance and 
65
 
 
Pearso
n correlation based measures are known to be influenced by scale, skewness and outliers, 
thus, they may not work well for count data 
(Liu and Si, 2014)
. In support of this, we found that 
dissimilarity measures 
raw
, 
rnr
, 
pea
, based directly on counts, regardless of normaliza
tion or 
standardization, did not resemble the expected dendrogram and were generally inconsistent. 
Dendrograms obtained with Pearson based correlation resembled the expected structures only 
when data were previously log
-
transformed. However, we found that 
the Spearman correlation 
based dissimilarity measure (
spe
) was suitable to represent the natural structure of samples even 
without normalizing data, possibly because it preserves the relative rank relationships, and it is 
less influenced by skewness and ou
tliers 
(Kendall and Gobbons, 1990)
 
when it is based on a 
large number of genes. The variance stabilizing (
vsd
) and regularized logarithm (
rld
) approaches 
consistently retrieved the expected dendrogram structure. Both t
ransformations model the mean
-
variance relationship across all genes to stabilize the variance of counts across samples 
(Anders 
and Huber, 2010; Love et al., 2014b)
. The regularized logarithm transformation also accounts for 
variation in sequencing depth across samples 
(Love et al., 2014b)
. Both functions have been 
suggested as appropriate transformations for c
lustering and classification of RNA
-
seq data with 
less ambiguous results in hierarchical clustering than using simply log
-
transformed counts 
(Love 
et al., 2014b)
. Finally, directly using the Poisson dissimilarity (
poi
) generated dendrograms with 
the 
expected structure. This is not surprising considering that read counts are usually assumed 
to fit over dispersed Poisson distributions
 
(Pachter, 2011)
. Similarly, Witten 
(Witten, 2011)
 
obtained dendrograms with lower clustering error rates  when using the Poisson dissimilarity 
rather than 
vsd
 
or Euclidean distances on normalized data, but using overdispersed Poisson 
simulation
s. Our results are encouraging because we did not use a parametric model to produce 
similar outcomes.                
 
Sample
-
based hierarchical cluster analysis can be used as a tool to present results after 
differential expression analysis or it can be us
ed as an explorative technique for finding patterns 
66
 
 
in data. In the first approach only informative genes, i.e. differentially expressed genes (called 
signal in data mining literature) are used while in the second approach informative as well as non 
inform
ative genes (also known as noise) are utilized 
(Jiang et al., 2004)
. As the signal
-
to
-
noise 
ratio (proportion of DE to nonDE genes) is usually less than 1:10 
(Jiang et al., 2004)
, particular 
methods are applied to diminish the influence of non informative genes that can degrade the 
reliability of clustering resul
ts 
(Jiang et al., 2004)
. In R
NA
-
seq analysis, cluster analysis is 
commonly applied only to differentially expressed genes or a subset of them 
(Liu and Si, 2014)
. 
We have assessed dissimilarity measures under scenarios that include not only a set of 
differentially expressed transcripts but we also combi
ned differentially and non differentially 
expressed transcripts (signal
-
to
-
noise ratio 1:9 and 1:4), as well as a mixture of individuals. We 
found that 
rld
, 
vsd
, 
plg
, 
spe
 
and 
poi
 
were highly consistent under all scenarios with a tendency to 
diminish consis
tency as the number of non informative genes increases. Although we focused 
our comparison on the effect of dissimilarity measures on hierarchical clustering results, the same 
plasmodes could be used to investigate the effect of other decisions made when p
erforming 
sample
-
based clustering as the selection of the hierarchical clustering algorithm per
-
se or even 
the effect of pre
-
filtering transcript according to their level of expression  
(Rau et al., 2013; 
Bourgon et 
al., 2010; van Iterson et al., 2010)
.  
We did not explore those aspects of sample 
clustering, but their investigation will be facilitated by the plasmode building strategies described 
in this paper.
 
 
3.5
 
Conclusion
 
Generating plasmode datasets from experimental data is a reliable tool for evaluating 
dissimilarity measures in agglomerative hierarchical cluster analysis of RNA
-
seq data. Depending 
on the characteristics of the available datasets, several scenarios can b
e established to compare 
67
 
 
dissimilarity measures upon a broad spectrum of more realistic conditions than using other 
simulation approaches. Similar methodologies can be applied to study gene
-
based clustering as 
well as other clustering analysis methods.
 
Exp
lorative sample
-
based hierarchical clustering of RNA
-
seq data needs as an input a 
dissimilarity matrix that accounts for the mean
-
variance relationship of the discrete nature of read 
counts. Euclidean distance calculated either on data that have been previ
ously logarithm
-
transformed or regularized with more complex 
ad hoc 
functions, as well as model
-
based 
dissimilarity for RNA
-
seq data, were consistent in reproducing the expected sample structure in 
hierarchical dendrograms.
 
 
68
 
 
APPENDIX
 
 
69
 
 
Sample preparation, sequencing and mapping 
 
The MSUPRP dataset corresponds to 24 samples of 
longissimus
 
dorsi
 
muscle extracted from 
24 F2 females selected from the MSU Pig Resource Population 
(Steibel et al., 2011)
. Total RNA 
was obtained using TRIzol reagent (Invitrogen/Life Technologies, Ca
rlsbad, CA, USA), and RNA 

reverse transcribed into cDNA, fragmented and labeled to generate  24 barcoded libraries that 
were sequenced on an Illumina HiSeq 2000 (100 
bp, paired
-
end reads) at the Michigan State 
University Genomics Core Facility. Four technical replicates were collected from each library and 
arranged in four different lanes of a flowcell allowing up to 12 barcode
s per lane as illustrated in 
Figure 
20
.
  
 
Each library has a tag with the sample number and a colour symbolizing 
its barcode. A same set 
of 12 tags is repeated 4 times (technical replicates).
 
 
The raw read data consisted of 96 pairs of fastq files (4 per sample) containing approximately 
15 million short
-
reads (100bp) each. Those fastq files were pre
-
processed 
                                       
Figure 
20
 
Sequencing layout of 24 barcoded samples
 
70
 
 
using
 
FASTX
toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) and FASTQC 
(http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to assess read quality. Then, Tophat 
(Trapnell et al., 2009)
 
was used for mapping the reads to the reference genome (Sus scrofa 
10.2.69 retrieved from the Ensembl database) using an index generated by Bowtie2 
(Langmead 
and Salzberg, 2012)
. The aligned records were stored in BAM/SAM format 
(Li et al., 2009a)
. 
Alignment statistics and base coverage was cal
culated for each file using SAMt
ools 
(Li et al., 
2009a)
. After that, Cufflinks software 
(Trapnell et al., 2012)
 
was used to obtain gene models and 
to merge gene models from all samples and reference annotation. Finally, transcript specific read 
counts were estimated using HTSeq
 
(Anders et al., 2014)
. 
 
Consistently, about 85% of reads were successfully mapped to referen
ce genome. We 
detected a total of 26740 transcripts with at least one read aligned.  Average coverage per base 
across 96 pairs of fastq files was 45.79X. 
 
To obtain the final count matrix, we filtered out transcript with zero expression in all samples 
and 
merged the 4 technical replicates from each sample. As a result we obtain a count matrix 
with 26740 transcripts and 24 samples.
 
 
71
 
 
Plasmode generation for MSUPRP samples  
 
 
A singular plasmode dataset comprised 12 samples: 5 original samples, ovals labelled as 
{AAA,BBB,CCC,DDD,EEE}, and 7 synthetic samples, circles labelled as 
{AAC,BBC,CXB,CCB,DDE,EXD,EED}, obtained by  combining know
n proportions of transcripts 

 
Figure 
21
 
Plasmode generation for MSUPRP dataset
 
72
 
 
Reference similarity and dissimilarity matrices for MSUPRP plasmodes
 
 
The similarity between two samples (
s
ij
) was calculated as the maximum proportion of original 
shared reads.
 
 
The dissimilarity between two samples (
d
ij
) was calculated as 1
-
 
s
ij
 
.
 
 
AAA  AAC  BBB  BBC  CXB  CCB  CCC  DDD  DDE  EXD  EED  EEE
 
AAA
  
1.00 0.66 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
 
AAC
  
0.66 1.00 0.00 0.33 0.33 0.33 0.33 0.00 0.00 0.00 0.00 0.00
 
BBB
  
0.00 0.00 1.00 0.66 0.50 0.33 0.00 0.00 0.00 0.00 0.00 0.00
 
BBC
  
0.00 0.33 0.66 1.00 0.83 0.66 0.33 0.00 0.00 0.00 0.00 0.00
 
CXB
  
0.00 0.33 0.50 0.83 1.00 0.83 0.50 0.00 0.00 0.00 0.00 0.00
 
CCB
  
0.00 0.33 0.33 0.66 0.83 1.00 0.66 0.00 0.00 0.00 0.00 0.00
 
CCC
  
0.00 0.33 0.00 0.33 0.50 0.66 1.00 0.00 0.00 0.00 0.00 0.00
 
DDD
  
0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.66 0.50 0.30 
0.00
 
DDE
  
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.66 1.00 0.83 0.66 0.33
 
EXD 
 
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.83 1.00 0.83 0.50
 
EED
 
 
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.30 0.66 0.83 1.00 0.66
 
EEE
 
 
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.50 
0.66 1.00
 
Table 
3
 
Reference similarity matrix (S) for MSUPRP plasmodes
 
     
AAA  AAC  BBB  BBC  CXB  CCB  CCC  DDD  DDE  EXD  EED
 
AAC 0.34                                                  
 
BBB 1.00 1.00                                             
 
BBC 1.00 0.67 0.34
                                        
 
CXB 1.00 0.67 0.50 0.17                                   
 
CCB 1.00 0.67 0.67 0.34 0.17                              
 
CCC 1.00 0.67 1.00 0.67 0.50 0.34                         
 
DDD 1.00 1.00 1.00 1.00 1.00 1.00 1.00
                    
 
DDE 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.34               
 
EXD 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.50 0.17          
 
EED 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.70 0.34 0.17     
 
EEE 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.67 0.50 0.34
 
 
Table 
4
 
Reference dissimilarity matrix (D) for MSUPRP plasmodes
 
73
 
 
Reference dendrogram and cophenetic matrix for MSUPRP plasmodes
 
 
Figure 
22
 
Reference dendrogram for MSUPRP plasmodes
 
 
AAA  AAC  BBB  BBC  CXB  CCB  CCC  DDD  DDE  EXD  EED
 
AAC 0.34                                                  
 
BBB 0.87 0.87                                             
 
BBC 0.87 0.87 0.63                                        
 
CXB 0.87 0.87 0.63 0.26                                   
 
CCB 0.87 0.87 0.63 0.26 0.17                              
 
CCC 0.87 0.87 0.63 0
.50 0.50 0.50                         
 
DDD 1.00 1.00 1.00 1.00 1.00 1.00 1.00                    
 
DDE 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.64               
 
EXD 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.64 0.26          
 
EED 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0
.64 0.26 0.17     
 
EEE 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.64 0.50 0.50 0.50
 
 
Table 
5
 
Cophenetic matrix of the reference dendrogram for MSUPRP plasmodes.
 
74
 
 
Correlation between dissimilarity measures and reference dissimilarity for the 
MSUPRP plasmodes
 
 
Table 
6
 
Correlation between dissimilarity measures and reference dissimilarity
 
Dissimilarity
 
Correlation 
with   
reference 
dissimilarity
 
raw
 
0.57 (0.075
)
 
rnr
 
0.53 (0.069
)
 
rld
 
0.83
 
(0.0
0
1)
 
vsd
 
0.82
 
(0.0
0
1)
 
pea
 
0.51 (0.045
)
 
plg
 
0.82
 
(0.0
0
1)
 
spe
 
0.81
 
(0.0
0
1)
 
poi
 
0.81
 
(0.01)
 
 
Mean and standard deviation of correlation between each of the eight dissimilarity matrices and 
the reference dissimilarity for MSUPRP plasmodes (N=50 plasmode datasets). Euclidean 
distances using raw count data (
raw
), Euclidean 
distances using normalized samples (
rnr
), 
Euclidean distances using regularized logarithm (
rld
), Euclidean distances using variance 
stabilizing transformation (
vsd
), 1
-
 
Pearson correlation using raw counts (
pea
), 1
-
 
Pearson 
correlation using counts transfo
rmed by logarithm (plg), and 1
-
 
Spearman correlation using raw 
counts (
spe
), and Poisson dissimilarity (
poi
). Distances measures 
raw
, 
rnr
, and 
pea
 
poorly 
preserved the expected sample structure (r<0.57) while 
poi
, 
rld
, 
vsd
, 
plg
, and 
spe
 
highly preserved 
th
e expected sample structure (r>0.8).
 
 
MSUPRP dataset
. This file contains raw count data (6 columns x 25798 rows) for the five 
animals used to generate the MSUPRP plasmode datasets.
 
Available online 
at PLoS ONE 
website
.
 
 
75
 
 
Chapter
 
4
 
Assessing genotype cal
l 
accuracy from RNA sequencing data
 
 
Assessing genotype call accuracy from RNA 
sequencing data
 
 
76
 
 
4.1
 
Introduction
 
RNA sequencing technology (RNA
-
seq) 
(Wang et al., 2009)
 
has become the platform of 
choice for gene expression
 
profiling 
(Oshlack et al., 2010)
. In addition to quantifying total gene 
expression measures, RNA
-
seq experiments can be used for calling SNP genotypes to 
investigate allele
-
specific expression (ASE) 
(Pirinen et al., 2015; Steibel et al., 2015)
 
, and for 
studying
 
RNA
-
editing 
(Ramaswami et al., 2013)
. 
Ano
ther proposed use of genotypes called from 
RNA
-
seq data is for performing population genetics studies in nonmodel organisms 
(De Wit et al., 
2012; Schunter et al., 2013; Lemay et al., 2013; Yu et al., 2014)
 
and for performing genetic 
associations with phenotypes 
(Cánovas et al., 2013)
. These experiments, which study and 
quantify variation in transcripts nucleotide sequence rely on accurately calling genomic variants 
and genotyping individuals. Moreover, calling variants from RNA
-
seq data poses
 
challenges that 
are not present in variant calling from DNA sequence 
(Piskol et al., 2013; Quinn et al., 2013)
. For 
instance, alternative splicing 
(Piskol et al., 2013; Quinn et al., 2013)
 
and ASE 
(Steibel et al., 2015)
 
may substantially complicate genotype calling from RNA
-
seq data. Another difference between 
RNA
-
seq and DNA
-
seq data 
is that the sequence coverage of
 
certain genes across samples may 
differ substantially, due to gene
-
specific 
and subject to subject variation in gene expression.
 
Despite these challenges
, identification of genetic variants and calling genotypes from RNA
-
seq data is commonly performed using tools developed for whole genome sequencing (WGS) of 
DNA samples 
(Altmann et al., 2012; Nielsen et al., 2011)
. Consequently, the perfo
rmance of such 
algorithms have to be studied separately for RNA
-
seq experiments. 
 
Assessing the performance of variant calling programs from RNA
-
seq data is usually done by 
comparing the called genotypes to a gold standard such as genotypes from a SNP chip
 
(Djari et 
al., 2013; Cánovas et al., 2013)
 
or from WGS 
(Piskol et al., 2013; Quinn et al., 2013)
 
and 
estimating genotyping accuracy. But in the context of calling SNP genotypes from RNA
-
seq data, 
77
 
 
estimating genotyping 
accuracie
s separately for 
heterozygous and homozygous is important due 
to the different use of those genotypes. For instance: false heterozygous genotypes (a true 
homozygous site that is called heterozygous in RNA
-
seq data) will likely lead to biases in 
estimating 
ASE 
(Steibel et al., 2015; Pirinen et al., 2015)
 
and they w
ill lead to false positives in 
RNA editing studies. On the other side, false homozygous genotypes (true heterozygous sites 
called homozygous) will likely result in lack of power to detect ASE and RNA editing. In general, 
notable exceptions 
(Quinn et al., 2013)
, this issue has been ignored, and usually omnibus 
genotype calling 
accuracies or error
 
rates are defined and reported.
 
Calling reliable SNP genotypes may have other applications such as conducti
ng population 
genetics studies for estimating effective population size, studying population substructure, and 
performing genetic associations with ecologically or economically important phenotypes. In those 
cases, estimating SNP
-
 
specific genotyping call 
accuracies and error rates may not be relevant. 
For those cases it is more important to correctl
y estimate the relatedness between
 
individual
s
. 
For instance, reconstructing genomic relationship matrices 
(VanRaden, 2008)
 
with SNP
s
 
called 
from RNA
-
seq data and comparing 
them 
to a reference relationship matrix would definitely be 
more infor
mative for animal breeding applications 
(Habier et al., 2007)
, as well as for population 
history and demography studies of nonmodel organisms 
(Ekblom and Galindo
, 2011)
.
 
In this work, we compared sensitivity and false discovery rates for SNP calling using two well
-
established variant calling programs 
BCFtools 
(Li, 2011)
, and VarScan2 
(Koboldt et al., 2012)
. 
We also used Beagle 
(Browning and Browning, 2007)
 
to impute genotype calls from BCFtools, 
as previously recommended 
(Nielsen et al., 2011; Li et al., 2013)
.
 
We apply these algorithms to 
RNA
-
seq data from a pig resource population that has been extensively genotyped and 
phenotyped. We report homozygous and heterozygous genotype call rates and we estimate 
error 
rates by comparing called genotypes against genotypes obtained from SNP chip array and DNA 
sequencing.
 
78
 
 
4.2
 
Material and Methods
 
4.2.1 
Resource population 
 
The Michigan State University pig resource population (MSUPRP) is an F
2
 
cross originated 
from 4 F
0
 
Duroc sires and 16 F
0
 
Pietrain dams. The full pedigree includes 20 F
0
, 56 F
1
 
and 954 
F
2
 
animals. This population has been described in detail elsewhere 
(Edwards et al., 2008)
.
 
For 
this study we selected 24 F2 
fe
males that presented extreme
 
phenotypes in loin eye area 
(Steibel 
et al., 2011)
. 
 
Total RNA was extracted from 
longissimus dorsi
 
muscle of the 24 F2 females using TRIzol 
reagent (Invitrogen/Life Technologies, Carlsbad, CA, USA), and RNA quantity and quality were 
determined using an Agilent 2100 Bi

fragmented and labeled to generate 24 barcoded libraries that were sequenced on an Illumina 
HiSeq 2000 (100 bp, paired
-
end reads) at the Michigan State University Genomics Core Facility. 
Four tech
nical replicates were collected from each library and arranged in four different lanes of 
a flowcell allowing up to 12 barcodes per lane. The raw read data consisted of 96 pairs of fastq 
files (4 per sample) containing approximately 15million short
-
reads (
100bp) each. Those fastq 
files were pre
-
processed using FASTX toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) and 
FASTQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to assess read quality. 
Then, Tophat 
(Trapnell et al., 2009)
 
was used for mapping the reads to the reference genome 
(Sus scrofa 10.2.69 retrieved from the Ensembl database) using an index generated by Bowtie2
 
(Langmead and Salzberg, 2012)
. The aligned records were stored in BAM/SAM format
 
SNP 
chip genotypes were obtained with the SNP60K Illumina chip 
(Gualdrón Duarte et al., 
2013)
 
and used as gold standard to compare genotypes called from mRNA sequencing data.
 
79
 
 
In addition to the SNP 
chip genotypes and muscle tissue RNA
-
seq
 
data, genomic DNA 
sequence, 
liver
 
and fat tissue RNA
-
seq data were
 
available for one of the 24 animals. Genomi
c 
DNA was purified from white blood cells using the
 
Invitrogen Purelink Genomic DNA Mini Kit. 
Library preparation was performed with the Illumina TruSeq Nano DNA Library Preparation Kit 
HT and sequenced on 2 lanes of an Illumina HiSeq 2500 Rapid Run flow c
ell and sequenced in 
a 2x150bp (PE150) configuration Rapid SBS reagents. Reads were trimmed with Condetri 
(Smeds and Künstner, 2011)
 
and aligned using Bowtie 2 
(Langmead and Salzberg, 2012)
. For 
this animal, RNA
-
seq was performed 
slightly
 
differently from the other 23 pigs. Total RNA was 
extracted from subcutaneous fat, liver and 
longissimus dorsi
 
muscle tissue. The samples were 
prepared for seq
uencing using the Illumina TruSeq Stranded mRNA Library Preparation Kit. After 
validation and quantitation the libraries were pooled and loaded on both lanes of an Illumina HiSeq 
2500 Rapid Run flow cell (v1). Base calling was performed by Illumina Real Ti
me Analysis (RTA) 
v1.18.61 and output of RTA was demultiplexed and converted to FastQ files with Illumina 
Bcl2fastq v1.8.4.
 
Reads were trimmed with Condetri 
(Smeds and Künstner, 2011)
 
and aligned 
to the reference 
genome (Sus scrofa 10.2.69) with Tophat 
(Trapnell et al., 2009)
. The BAM files were merged and 
unique alignments were e
xtracted and separated by strand.
 
 
4.2.2 
Variant calling 
 
First, base alignment files were obtained using the mpileup command of SAMtools 
version 
1.0 
(Li et al., 2009a)
 
using options 
-
Q20 
-
C50 
-
B of mpileup to eliminate reads with 
quality scores 
bel
ow 20, 
to 
downgrade mapping quality for reads with exces
sive mismatches, and to di
sable
 
the 
calculation of base realignment quality in order to diminish false SNPs generate
d
 
by 
misalignments. For the single animal with mRNA and DNA sequence data
, reads with quality 
80
 
 
scores bel
ow 25 were filtered and the coefficient for downgrading mappin
g quality with excessive 
mismatches was applied, but base realignment quality was allowed (option 
-
Q25 
-
C50 
-
E
). These 
options have been used before for the study of study RNA editing 
(Chen et al., 2014)
.
 
The multiallelic calling version of 
BCFtoo
ls
 
was used (bcftools
 
call 
-
vm) to call genotypes,  
w
hile to call SNP with VarScan2, the mpileup2snp command was used with default settings (a 
minimum of 8 reads to make a call, with at least 2 supporting reads to call a variant, a minimum 
average ba
se quality of 15, a
nd a p
-
value threshold of 0.01)
. Finally, Beagle 
(Browning and 
Browning, 2007)
 
was used to refine genotype calls from BCFtools using genotype imputation. 
Bea
gle used as input vcf files obtained by BCFtools. We used default parameters for Beagle 
version 4.0 (java 
-
Xmx4000m 
-
jar beagle.r1398.jar gl=bcfvariants).
 
 
4.2.3 
Assessing genotype call accuracy
 
SNP genotypes obtained with each variant calling program were compared to genotypes 
obtained from a DNA 
SNP60K Illumina chip 
 
to estimate genotype call rates. For each SNP 
genotyped with both technologies, we built a contingency table (
Table 
7
) and we computed 
several genotyping accuracy and error measures. For instance: true homozygous called 
homozygous as well as true heterozygous called heterozygous contribute to the accuracy of
 
the 
variant calling method and they are the basis to estimate sensitivity. On the other hand, true 
homozygous called heterozygous and true heterozygous called homozygous contribute to 
genotyping errors.
 
 
81
 
 
Table 
7
 
Contingency table w
ith classification rule used to compute sensitivity and error rate of 
genotype calling
 
RNA
-
seq 
Variant
 
calling 
Genotype
 
SNP
 
c
hip Genotype
 
Total
 
Heterozygous
 
Homozygous
 
Heterozygous
 
True He
1
 
False He
2
 
Called He
3
 
Homozygous
 
False Ho
4
 
True Ho
5
 
Called Ho
6
 
Non Called
 
NC He
7
 
NC Ho
8
 
NC
9
 
Total
 
He
10
 
Ho
11
 
 
1
 
Number of he
terozygous genotyped in the SNP
 
chip called heterozygous from RNA
-
seq; 
 
2
 
Number of 
homozygous genotyped in the SNP
 
chip called heterozygous from RNA
-
seq; 
 
3
 
Total Number of called heterozygous from RNA
-
seq;
 
4
 
Number of he
terozygous genotyped in the SNP
 
chip called homozygous from RNA
-
seq;
 
5
 
Number of homozygous
 
genotyped in the SNP
 
chip called homozygous from RNA
-
seq;
 
6
 
Total Number of called homozygous from 
RNA
-
seq;
 
7
 
Number of het
erozygous genotyped in the SNP
 
chip non
-
called from RNA
-
seq;
 
8
 
Number of 
homozygous genotyped in the SNP
 
chip non
-
called from RNA
-
seq;
 
9
 
Total number of non
-
called genotypes from RNA
-
seq;
 
10
 
Total number of he
terozygous genotyped in
 
the SNP
 
chip
 
11
 
Total number of 
homozygous genotyped in the SNP
 
chip
 
 
To assess the performance of the variant calling methods, we computed true genotype 
rates (sensitivity) and false discovery rates (FDR) for each genotype as: 
 

82
 
 
4.2.4 
Equivalence between relationship matrices and correlation with SNP chip data
 
The marker
-
based rel
ationship matrix among individuals 
(VanRaden, 2008)
 
was calculated 
using genotypes 
from the 60K SNPchip (


) and compared to the marker
-
based relationship 
using genotypes from RNA
-
seq
 
(


)
. In order to compute 

 
we selected  SNPs called from 
RNA
-
seq or from the chip and built a genotype matrix
 

containing
 
standardized allelic dosages 


, where 


are the allelic counts of the reference allele and its expectation 
and variance are computed assuming Hardy Weinberg equilibrium across SNP 
(VanRaden, 
2008)
. Finally, following Van Raden 
(
2008)
, 


To generate


, 
we used SNP called by BCFtools with at 
least 240 reads per base 
across the 24 animals. This level of total depth ensure
d
 
that both heterozygous and homozygous 
genotypes were accurately called (see results section). Additionally, only monomorphic (minor 
allele frequency > 0.0) sites in autosomic
 
chromosomes with genotyping data in all 24 samples 
were considered. After applying all the filters a total of 125
,
277 SNPs were used to compute 
the


. Finally, the maximum correlation between SNPs in the SNPchip and SNPs within 1 
Mb distance
 
in RNA
-
seq data were calculated to assess the redundancy of the genotype set 
obtained with RNA
-
seq with respect to the SNP set in the 60K chip.
 
 
83
 
 
4.3
 
Results
 
4.3.1 
Analysis across variant calling programs for 24 animals
 
Sensitivity curves presented similar p
atterns for heterozygous and homozygous genotypes, 
but with specific quantitative differences. 
Figure 
23
a
-
b
 
shows the sensitivity for calling 
heterozygous and homozyg
ous genotypes, respectively, for three variant calling programs as a 
function of inverse cumul
ative read depth. Sensitivity for
 
calling heterozygous genotypes
 
was 
higher wit
h BCFtools compared to VarScan2
 
for sequencing depth below 1000 reads per base 
(across all 24 samples). Using Beagle to impute missi
ng genotypes called by BCFtools
 
provided 
a slight increment in sensitivity o
f heterozygous genotype calling
 
at the expense of FDR as we 
describe later. Reg
ardless of the variant calling program, heterozygous genotypes were called 
with sensitivity > 94% when base read depth was equal or higher than 1000. When calling SNP 
genotypes at homozygous sites, the sensitiv
ity was usually higher than 
corresponding to 
h
eterozygous sites, especially for sites covered by less than 1000 reads. Moreover, the reported 
advantage in sensitivity of BCFtools compared to VarScan2 for calling 
genotypes at heterozygous 
sites
 
was still present but it was lower for homozygous sites. C
onsistently, imputing non
-
called 
genotypes using Beagle improved the sensitivity of homozygous calling with respect to BCF
tools
 
even more markedly than for heterozygous sites. Similarly to heterozygous, calling homozygous 
in s
ites covered by over 1000 read
s
 
re
sulted in sensitivity > 98%,
 
regardless of the variant calling 
program.
 
84
 
 
Figure 
23
 
Assessing SNP calling accuracy of RNA
-
seq samples of
 
longisimus dorsi
 
Sensitivity analysis for heterozygous 
(A) 
and
 
homozygous
 
(B)
. False discovery rate for 
heterozygous 
(C),
 
and homozygous
 
(D)
.
 
Raw read depth in the x axis is inversely cumulative in 
the log
10 
scale. Varcalling software correspond to: bcf=BCFtools, bea=Beagle, and  
vsc=VarScan2
 
 
False discovery rates curves presented
 
more distinct patterns between heterozygous and 
homozygous genotypes. 
Figure 
23
c
-
d
 
shows the FDR for heterozygous and homozygous calls, 
respectively, for each of the
 
three variant calling programs as a function of inverse cumulative 
read depth. Heterozygous sites called with VarScan2 were consistently correct regardless of the 
85
 
 
coverage depth (FDR very close to 0). However this comes as a price of very low sensitivity.
 
For 
example, in sites covered by 10 to 100 reads across all 24 samples, the sensitivity of VarScan2 
was 56% on average (
Figure 
23
a
). Contrastingly, at that same 
depth range, FDR with BCFtools 
was between 12% and 4%,
 
but sensitivity was between 50% and 87%. At high sequencing d
epth, 
differences in FDR vanish
 
and all programs presented
 
FDR < 2% when base read coverage was 
equal or higher than 1000. As mentioned befo
re, imputation with low coverage increased 
sensitivity at the cost of FDR especially for sites with less than 100 reads per base. For instance, 
the use of Beagle increased the FDR from 4% to 21% when coverage dropped below 10 reads 
per base. The proportion
 
of incorrectly called homozygous was generally higher than the 
proportion of incorrectly called heterozygous (
Figure 
23
d
). Only sites with more than 1000 reads 
per b
ase presented a FDR of homozygous below 3%. BCFtools had lower error rates than 
VarScan2 for base read depth between 10 and 1000. The use of Beagle increased the FDR for 
homozygous particularly for sites with very low coverage (less than 10 reads per base)
, and the 
increase in FDR was less than when imputing heterozygous sites.
 
 
4.3.2 
Comparison of genotype calls from DNA
-
seq and RNA
-
seq from multiple tissues
 
Comparing SNP calling from RNA
-
seq to DNA
-
seq data, genotypes called from DNA
-
seq data 
showed unifo
rmly high sensitivity (sensitivity > 0.99) and
 
low FDR (2%). Conversely
, error rates 
from genotypes called from RNA
-
seq were influenced by sequencing depth. Sensitivity for 
heterozygous sites called from RNA
-
seq data varied with base read depth while sensi
tivity for 
homozygous sites was always high (>96%) regardless of base read depth (
Figure 
24
a
-
b
) and 
comparable to the sensitivity of homozygous genotypes called from D
NA
-
seq. Sensitivity of 
heterozygous genotype calls using RNA
-
seq for any of the tissues was below 90 % for sites with 
less than 10 reads per base and almost constant (94%
-
98%) for sites with more than 10 reads 
per base. A coverage of 10 reads per base in o
ne animal is on average equivalent to 240 reads 
86
 
 
across 24 animals, at which point BCFtools provided a sen
si
tivity of 94% across all samples 
(
Figure 
23
a
).
 
False discovery rates for heterozygous sites were below 2% regardless of base read depth 
either for DNA or RNA samples. Different patterns presented when analyzing homozygous sites. 
For ho
mozygous sites, variant calling using RNA
-
seq incorrectly genotyped a large proportion of 
sites (between 59% and 10%)
 
that were covered by 10 or fewer
 
reads per base. Homozygous 
sites with more than 10 reads per base presented lower false discovery rates f
or RNA
-
seq 
samples from 
longissimus dorsi
 
than from liver or fat tissue. FDR for homozygous was always 
below 1% when using the DNA sample.
 
87
 
 
Figure 
24
 
Assessing SNP calling accuracy of DNA and RNA
-
seq samples of three tissues from 
the same animal
 
Sensitivity analysis for heterozygous 
(A) 
and
 
homozygous
 
(B)
. False discovery rate for 
heterozygous 
(C),
 
and homozygous
 
(D)
.
 
Raw read depth in the x axis is in
versely cumulative in 
the log
10 
scale. Varcalling software correspond to: bcf=BCFtools, bea=Beagle, and  
vsc=VarScan2
 
 
4.3.3 
Equivalence between relationship matrices
 
Correlation between relationship matrices 


and 


(based on 
SNPchip markers 
and based on RNA
-
seq markers, respectively) was 0.99 (
Figure 
25
). This value indicated that 
88
 
 
both matrices were virtually equivalent and either of them could be used to account for the 
relationship among the animals
. In 
depth analysis of the correlation re
sults showed that 
correlation between the diagonals of both relationship matrices was 0.66 and correlation between 
off
-
diagonals was 0.95 (
Figure 
25
). These results a
re of interest to distinguish the use of the 
relationship matrices to compute genomic inbreeding coefficient from the diagonals or to compute 
genomic relationships between individuals from the off
-
diagonals 
(VanRaden, 2008)
.  
 
An average of 110 SNPs called from RNA
-
seq were within the 1 Mb window defined for ea
ch 
SNP in the SNP
60K 
chip a
nd only n
ine percent of SNP
s
 
in the 
SNP
 
60K 
chi
p did not have any 
SNP called f
r
o
m RNA
-
seq.  The median corr
elation between SNPs in the SNP
60K 
c
hip and SNPs 
called from RNA
-
seq was 0.71 and increased to 0.78 for SNPs in the SNP
60K 
chip with at least 
one SNP
 
called from RNA
-
seq within the 1 Mb window.  This result implied that on average the 
information contained in all the SNPs in the SNP
60K 
chip could be replaced by information on 
SNP
s
 
called from RNA
-
seq.
 
89
 
 
Figure 
25
 
Scatterplot of
 
correlation between relationship matrices for the 24 F2 females
 
The scatterplot shows a high equivalence (r=0.99) between the relationship matrix based on 
SNPchip genotypes (
G
SNPchip
) and the relationship matrix based on SNPs genotypes called from 
RNA
-
seq
 
sites (
G
RNA
-
seq
) with depth >=240 reads per base. Correlation between relationship 
matrices for the same animal (diagonal values of the relationship matrices) correspond to values 
in the right top quadrant (r=0.99) while correlation between relationship m
atrices for the different 
animals (off
-
diagonal values of the relationship matrices) are represented in the left bottom 
quadrant (r=0.95)
 
 
4.4
 
Discussion
 
This work assessed genotype calling from RNA
-
seq and DNA
-
seq data. We compared 
genotypes called by two var
iant calling programs to genotypes obtained from a DNA SNP
60K
 
chip by computing sensitivity and false discover
y
 
rates separately. Accuracy of SNP calling varied 
between programs for different genotypes. Based on these results we make suggestions for 
implem
enting genotype calling from RNA
-
seq data, depending on its intended use, for instance 
90
 
 
when using heterozygous genotypes for allele
-
specific expression studies or when using 
homozygous for exploring RNA editing. 
 
Identifying heterozygous SNP is the first s
tep to conduct allele
-
specific expression studies 
(Pickrell et al., 2010)
. Heterozygous sites provide inform
ation to classify reads according to their 
haplotype transcript of origin and to quantify the allelic expression ratio 
(Lee et al., 2013)
. Two 
major consequences from genotyping error in ASE are:1) the omission of calling a heterozygous 
site
,
 
or 2) the incorrect assignment of a heterozygous genotype to a true homozygous genotype. 
In the case of missing a heterozygou
s call the direct consequence will be a loss in power, while 
when mislabeling the site the effect could be biases in estimation of the ASE ratio 
(Steibel et al., 
2015)
. Quinn et
 
al. 
(2013)
 
found that sensitivity for heterozygous varied between 40% to 80% at 
coverage depth below 10X per sample, and it reached 90% for coverage depth above 10X per 
sample. Similarly, we found that
 
for the single sample analysis more than 90% of heterozygous 
were called in all tissues when depth was above 10 reads per base (
Figure 
24
A). For the 24 
multiple samples, the same proportion (>90%) was obtained using BCFtools for an equivalent 
total depth of 240 reads or more per base across all samples (24 samples x 10X per sample, 
Figure 
23
a). However, VarScan2 called only 70% of heterozygous sites at the same depth of 240 
reads per base. Interestingly, false discovery rates for heterozygous genotypes at 10 reads per 
b
ase per sample, or equivalently
 
a 
multi
-
sample total depth of 240 reads per base, were below 
2%
.
 
Consequently, calling heterozygous from RNA
-
seq data with more than 10 reads on average 
per base and per sample can provide a good number of sites to conduct ASE studies with 
sufficient power a
nd unbiased estimations of the allelic expression rate. Additionally, BCFtools 
and Beagle provided a larger set of true heterozygous genotypes (e.g. an average of 40% more 
SNP compared to VarScan2), due to increased sensitivity and to more stringent filter
ing criteria 
implemented in VarScan2. Yet, if a variant is called heterozygous by VarScan2, it will be likely be 
a true heterozygous as the FDR is very low regardless of the depth (Figures 
Figure 
23
c and 
91
 
 
Figure 
24
c). However, more caution should be considered when using BCFtools and Beagle in 
sites with low depth (Figures 
Figure 
23
c and 
Figure 
24
c). 
 
Homozygous sites provide an initial list of candidate sites for studying RNA editing 
(Li et al., 
2009b; Chen et al., 2014)
.  The typical approach to study RNA editing compares the mismatches 
between homozygous DNA and RNA sequences 
(Peng et al., 2012; Ramaswami et al., 20
12)
 
from a single individual. Besides identifying the type of editing process, e.g. A
-
to
-
I editing, it is 
possible to quantify editing levels in an analogous way as in ASE studies 
(Bahn et al., 2012; Lee 
et al., 2013)
. Therefore, similar implications as discussed above when analyzing sensitivity and 
FDR for
 
calling heterozygo
us also applies
 
when analyzing homozygous.  Calling genotypes from 
DN
A
-
seq is reliable and has been 
studied in detail 
(Nielsen et al., 2011; Li, 2014; Li et al., 2013; 
Li, 2011; Kumar et al., 2014)
. 
We confirm
ed previously published results
 
for instance, when using 
DNA
-
seq
,
 
we found that sensiti
vity for homozygous was > 0.99 (
Figure 
24
b) with a corresponding 
FDR < 2% (
Figure 
24
D) irrespective of depth. Similar values and patterns were found for 
heterozygous genotypes (
Figure 
24
a
-
b). However, calling genotypes from RNA
-
seq protocols is 
subject to diverse sources of false genotyping error, for instance: increased mapping errors due 
to the complexity of the transcripto
me compared to the genome 
(Bass et al., 2012)
. For that 
reason, stringent mapping and filtering options are applied particularly when analyzing nucleotide 
variation profiles from RNA
-
seq 
(Quinn et al., 2013; Piskol et al., 2013; Lee et al., 2013)
. 
In this 
work, we found that calling homozygous from RNA
-
seq presented high sensitivity (>95%) in all 
tissues and irrespective of depth (
Figure 
24
b), but it could result in a high number of false positives 
(
Figure 
24
D)
. Similar results for 
calling 
homozygous were found in Quinn et al. 
(Quinn et al., 
2013)
. Consequently, depth of RNA
-
seq reads should be considered when analyzing 
homozygous sites using RNA
-
seq. This is usually not considered in RNA
-
editing studies where 
protocols for including editing sites emphasize requirements in DNA depth but not in RNA
-
seq 
depth 
(Chen et al., 2014; Lee et al., 2013)
. Adding a depth filter to RNA
-
seq reads could help to 
92
 
 
lower the number of false posi
tives in addition to other filtering strategies that are currently used 
(Chen et al., 2014; Lee et al., 2013)
.
 
Imputation of genotyp
es using Beagle after calling variants with BCFtools neither improve 
sensitivity nor worsen FDR in either heterozygous or homozygous sites covered by more than 
100 reads across 24 samples (
Figure 
23
). At lower coverage, the effect was detrimental as 
imputation provided little increase in sensitivity at the cost of expanding FDR in both genotypes 
(
Figure 
23
c
-
d). These F2 pigs were sampled from an admixed population originated from two 
different founding populations, thus it is expected long range persistence of linkage disequilibrium 
(Vaughan et al., 2009)
 
and heterogeneity of haplotype blocks 
(Gualdrón Duarte et al., 2013)
 
in 
comparison to a panmictic population. Consequently, imputation based only on linkage 
dis
equilibrium may not result in high imputation accuracy 
(Gualdrón Duarte et al., 2013)
. Ho
wever, 
if RNA
-
seq data alone or a mixture of DNA
-
seq and RNA
-
seq data were available across multiple 
generations, pedigree based imputation could result in a more accurate imputation 
(Gualdrón 
Duarte et al., 2013)
. We did not explore this aspect in this paper due to small sample size and 
lack of sequence genotypes in ancestors of the F2 pigs.
 
Accurate genotypes obtained by RNA
-
seq 
can 
be used to estimate genetic relationships 
between individuals in a similar way to using genotypes from a SNP
60K 
chip (
Figure 
25
). Even 
though the SNP sets from the chip and the RNA
-
seq analysis did not completely overlap, we 
found that they are highly correlated and they represent the relationships among individual in a 
very similar way (
Figure 
25
). We know from previous analyses that the  
SNP
60K 
chip represents 
the relationships among individuals in this population with high accuracy 
(Gualdrón Duarte et al., 
2013)
. 
Additionally
, 
RNA
-
seq derived SNP
s
 
may be used to estimate relationships in populations 
that are not well re
presented by SNPs in the SNP
60K 
chip to avoid problems related to 
ascertainment bias. In nonmodel organisms of little agricul
tural interest
 
SNP
 
chip
s
 
are generally 
not available. In this situations,
 
RNA
-
seq data could be used to estimate relationships amon
g 
93
 
 
individuals 
(Seeb et al., 2011; Helyar et al., 2011; Weinman et al., 2014)
 
or to complement 
genotyping by sequencing studies 
(Narum et al., 2013)
,
 
and to improve genome
-
wide association 
s
tudies by supplementing randomly selected SNP sets with more likely functional SNP sets 
(Cánovas et al., 2013)
.
 
 
94
 
 
Chapter
 
5 
General d
iscussion
 
 
General discussion
 
 
95
 
 
5.1
 
Introduction
 
RNA sequencing (RNA
-
seq) 
(Wang et al., 2009)
 
has emerged as a revolutionary technology 
to study transcriptomes. Both quantitative and qualitative aspects are being intensively stud
ied in 
a broad spectrum of biological applications, such as gene differential expression analysis and 
single nucleotide variation profiling in both model and nonmodel organisms 
(Wickramasinghe et 
al., 2014; Qian et al., 2014)
. Simultaneously, a number of methods are being pr
oposed for 
analyzing RNA
-
seq data in each of those applications (Nookaew et al., 2012). Moreover, methods 
are being periodically updated and new algorithms are being continuously published to improve 
the analyses (Seyednasrollah et al., 2013; Kvam et al., 
2012). As a consequence, researchers 
often find themselves having to decide between competing models and assessing the reliability 
of results obtained with a designated analysis program. Choosing the most appropria
te software 
can help to get
 
better results
 
and to achieve the goals of the investigation 
(Mehta et al., 2006)
. 
The overarchin
g goal of this dissertation was to propose and implement validation procedures 
based on experimental data to estimate the properties of widely used statistical analysis methods 
of RNA
-
seq experiments. Using synthetic and reference datasets, I compared stat
istical models 
to perform differential expression analysis, sampled
-
base hierarchical cluster analysis, and 
variant calling and genotyping
. 
 
 
5.2
 
Objectives revisited
 
1.
 
To evaluate statistical models for differential expression analysis in RNA
-
seq experiments 
 
Evaluation of statistical models for differential expression analysis has been based on 
parametric simulated datasets using count data distributions, such as Poisson and negative 
binomial. In this dissertation, I provided a more comprehensively approach us
ing pla
s
modes from 
96
 
 
experimental datasets as well as parametric simulations. Two methods based on negative 
binomial (edgeR and DESeq) presented higher type I error rates than a transformed Gaussian 
model (MAANOVA) despite the use of simulations or plasmodes
. The complement between the 
tw
o types of reference datasets was
 
particularly useful when comparing 
the
 
methods. Since 
parametric simulation can benefit a model that is based on the same distribution, the use of 
plasmode
s
 
provide
d
 
an ind
ependent reference 
to supplement
 
the analysis.
 
Additionally, I presented the comparison of models by exploring the joint null distribution of p
-
values that is expected to resemble a uniform distribution. The MAANOVA program used to fit the 
transformed Gaussian model can gene
rate p
-
values using different strategies, such as using 
moderated test or transcript
-
by
-
transcirp test. The best approach to compute the p
-
values differed 
between datasets. The comparison of the p
-
value distributions helped to decide on the best set 
of spe
cifications to use in each datase
t. Usually, this measure of comparison is
 
not reported when 
evaluating the adequacy of a model. However, researchers typically correct p
-
values by FDR as 
proposed by Benjamini and Hochber 
(Benjamini and Hochberg, 1995)
 
and such correction may 
not be appropriate if the uniform distribution is not supported, thus, leading to wrong decisions 
when comparing treatments.
 
Overall, Gaussian models had p
-
values closer to nominal signifi
cance levels and presented 
p
-
value distributions closer to the uniform distribution
. Researchers using models with these 
characteristics will have less false positive when distinguishing differentiall
y expressed 
transcripts. In addition
, Gaussian transform
ed model can include random effects and hierarchical 
structures that arise in complex experiments, such as when collecting data in fisheries or wildlife 
studies. For instance, landscape genomics has been proposed as a framework for studying 
adaptive and ne
utral genetic variation at the population level in a spatially explicit context 
(Joost 
et al., 2007)
, however, flexible models that can account for environmental and spatial 
autocorrelations are nee
ded 
(Schoville et al., 2012)
. Similar to
 
macroscale biology 
applications
, 
97
 
 
more complex analysis methods are also required at the molecular scale for analyzing gene 
families or regulatory networks that need to account for time course and gene
-
set correlations 
(Yang and Wei, 2015; Emmert
-
Str
eib, 2013)
.
 
Additionally, the proposed plasmode algorithm is an alternative to previous
 
implemented
 
plasmode
s
. The
 
algorithm allows incorporation
 
of 
biological variation by making several partitions 
from the original set of samples instead of only one a
s implemented in Gadbury et al. 
(
2008)
. This 
way of creating
 
plasmodes can be of interest in field studies where the individual variability can 
be studied 
a priori
 
to improve experimental designs aspects such as sample size.  In conclusion, 
I proposed and implemented a general validation method that was applied to specifically compare 
count data based models versus a Gaussian transformed data model. Furthermore, the
 
procedure 
allow
ed
 
to set the best 
analysis option for
 
the Gaussian transformed model 
according to
 
specific 
datasets, 
and presented other uses of the information generated by the validation process that 
can help researchers to plan data collection.    
 
 
2.
 
To
 
assess the properties of dissimilarity measures for agglomerative hierarchical 
clustering of RNA
-
seq data
 
Hierarchical cluster analysis is one of the most used techniques for representing pattern 
recognition and representation 
of results from 
sequencing d
ata
 
(Liu and Si, 2014
)
. Arguably, 
selection of a distance or similarity metrics is one of the most important decision in the 
implementation of hierarchical clustering 
(Izenman, 2008)
. As discussed for differential 
expression methods, the validity of alternatives metrics for hierarchical clustering has been 
previously assessed using parametric si
mulations. In t
his dissertation
, I presented two ways for 
creating plasmode datasets for assessing the suitability of distance metrics for clustering RNA
-
seq expression data. For that, I used data from two experiments conducted under different 
conditions. One experiment 
was the result of performing RNA
-
seq from animals in a pre
-
designed 
98
 
 
experiment based on two inbred strains of mice, while the other experiment consisted in 
sequencing the transcriptome of randomly sampled (untreated) individuals from an outbred 
population.
 
The common goal of performing clustering in both datasets was to uncover population 
structure due
 
to
 
genetic differences in gene expression (differences between breeds of mice or 
differences within a crossbred population)
.
 
In both situations, the proposed
 
plasmode building 
strategy allowed for individual
-
to
-
individual variability and it extend
ed
 
the validation of cluster 
analysis for datasets with different experimental structures, e.g the first dataset is a typical 
example of wet lab experiment for studyi
ng the influence of controlled factors, e.g.  Smith et al. 
(2013)
, whereas the second is a common example of field studies for establishing relationships 
among individuals, e.g. Lamichhaney et al. 
(2012)
, or for studying evolution of gene differential 
expression, e.g. Gu et al.  
(2013
)
. 
 
The validation included the comparison of a wide range of dissimilarity measures, including 
the one based on the traditional Euclidean dist
ance, correlation
-
based dissimilarities using raw, 
transformed and normalized count data, as well as measures specifically developed for discrete 
count data. Dissimilarity measures based on non
-
transformed count data (Euclidean and Pearson 
correlation), re
sulted in dendrograms that did not resemble the expected sample structure. 
Normalization did not help with the use of those measures either. The dissimilarity based on 
Spearman correlation was the only correlation
-
based dissimilarity that recovered the nat
ural 
sample structure. Dissimilarities calculated using appropriate transformations for count data were 
consistent in reproducing the expected dendrograms.
 
Additionally, the validation procedure proposed two metrics
, agreement and consistency,
 
for 
measurin
g the adequacy of dissimilarity measures
.
 
Agreement measures the correlation between 
dissimilarities
,
 
then two dissimilarities with high agreement produce dendrograms with similar 
hierarchical structures. Consistency measures the correlation between dendro
grams obtained 
with a particular dissimilarity
,
 
then a dissimilarity with high consistency always generates similar 
99
 
 
dendrograms. These type of objective measures had not been reported before when comparing 
dendrograms in RNA
-
seq studies. 
 
Finally, plasmode
s allowed the analysis
 
to be extended
 
to different hypothetical scenarios. In 
one scenario only differentially expressed genes were used in clustering analysis while in another 
scenario all genes were included (differentially and non differentially express
ed). All the 
dissimilarities that accounted for appropriate transformations were highly consistent in all 
scenarios. This dual use of representing information is useful to explore relationships but has to 
be interpreted in the correct context. For instance
, Lamichhaney et al. 
(
2012)
 
presented 
philoge
netic trees
 
using all markers or
 
using only markers that showed significantly genetic 
differentiation when studying local adaptation to salinity levels in Atlantic herring populations
. The 
study
 
failed to uncover the population structure, except when the gene set was
 
restricted to pre
-
selected genes known for harboring SNP segregating at extreme frequencies among diverse 
populations.
 
All in all, the proposed validation method compared the most common dissimilarity measures 
used in hierarchical clustering and dissimila
rity measures potentially appropriate for count 
overdispersed data. The validation included a variety of scenarios suitable for RNA
-
seq 
applications in biology studies and specifically in fisheries and wildlife applications, and proposed 
agreement and cons
istency metrics to objectively compare dendrograms structures.   
 
 
3.
 
To  compare sensitivity and false discovery rates for SNP genotyping in variant calling 
programs
 
Calling and genotyping variants from DNA
-
seq has been actively studied 
(Nie
lsen et al., 2011; 
Li et al., 2014, 2013)
 
and the same programs are being used for calling variant from RNA
-
seq 
(Piskol et al., 2013; Lee et al., 2013)
. In 
this thesis I compared three well known programs, namely 
100
 
 
BCFtools, Beagle and VarScan2, in terms of sensitivity and false discovery rate  when genotyping 
heterozygous and homozygous sites using as reference genotypes obtained from a 60K SNPchip. 
Additional
ly, the comparison considered multi samples and multi tissue scenarios, as well as 
studied the influence of read depth. Calling heterozygous with more than 10 reads per base and 
per sample provided sensitivity of 70% for VarScan2 and more that 90% for BCFt
ools and Beagle 
with low FDR in all programs. Homozygous were called with higher sensitivity in all tissues and 
samples irrespective of depth but presented higher FDR. 
 
Distinguishing the sensitivity and false genotyping rate of homozygous from heterozygou
s 
sites separately allows using the information for designing analysis for differ
ent applications such 
as allele
-
specific expression and RNA edi
ting. This type of study is
 
emerging in fish and wildlife 
populations. For instance, Wang et al. 
(
2013)
 
used RNA
-
seq to study differential
 
expression 
analysis and allele
-
specific expression related to disease resistance against enteric septicemia of 
catfish. Furthermore, other uses of SNP genotyping from RNA
-
seq are expected to increase in 
the future. In nonmodel organisms, SNP genotyping will be relevant for ecological an
d 
evolutionary studies and population genetics 
(Seeb et al., 2011; Helyar et al., 2011)
. For instance, 
Weinman et al.
 
(
2014)
 
used RNA
-
seq for discovering SNP and establishing parentage and kinship 
in cooperatively breeding super starlings that live in highly kin
-
structured groups. In this sens
e, I 
demonstrated in this dissertation
 
that
 
accurate genotypes obtained by RNA
-
seq can be used to 
estimate genetic relationships between individuals in a similar way to using genotypes from a 
SNPchip. SNPchip are usually not available or are expensive to design for nonmodel organisms 
of little agri
cultural interest, thus RNA
-
seq can be used to complement other genotyping 
approaches  like genotyping
-
by
-
sequencing 
(Narum et al., 2013)
 
or genotyping
-
in
-
thousands by 
sequencing 
(Campbell et al., 2015; Pavey, 2015)
.
 
In summary, the comparison of variant calling programs provided information on the accuracy 
of calling and genotyping homozygous and heterozygous sites including multiple samples and 
101
 
 
multiple tissues scenarios, and considering the effect of read depth. The
 
evaluation explored the 
results in the context of applications for allele
-
specific and RNA
-
editing studies, as well as for 
studying relationships in a population. 
 
 
5.3
 
Future research directions
 
The g
enomics era is providing unprecedented opportunities for r
esearchers to study 
mechanisms underlying biological processes. The amount and type of data that are being 

having data is just the first step towards generating
 
information that can help to elucidate the 
biological processes. This context requires researchers that are able to understand the biology 
and to develop and apply valid statistical methods for extracting information 
(Gomez
-
Ca
brero et 
al., 2014)
. 
 
Systems biology is a field of study in which disciplines converged to study biological systems 
in a holistic approach. Neuroscience, physiology, and ecology all converged on the idea that it is 
as important to study and model the whole system and its inter
actions as to analyze the individuals 
parts alone 
(Conesa an
d Mortazavi, 2014)
. For instance the molecular systems biology needs to 
integrate data from complex interactions among biomolecules such as DNA, different types of 
RNA, proteins, and metabolites but similar integrative challenges exist at other multiple
 
scales, 
such as cellular, organismal and ecological organization 
(Conesa and Mortazavi, 2014)
. 
Moreover, any integrative approach in omics data has to deal with high dimensional data and in 
most cases 
a priori
 
uncertainty in the definition of the hypotheses to be tested. Thus, a field of 
active research is the 
improvement of analysis methods that allow integration of multiple data 
types to produce statistically valid and biologically relevant interpretations. 
 
102
 
 
The procedures studied and developed in this dissertation are extremely useful to continue 
research in 
integrative analysis method in life sciences. The flexibility of plasmode construction, 
as presented in chapters 2 and 3, can be extended to validate further analysis methods. In chapter 
2, the algorithm to generate plasmodes repeatedly sampled a random se
t of genes. However, it 
may be customized to sample particular related sets of genes in order to validate explorative and 
inference methodologies of gene and molecular networks such as pathway analysis. In chapter 
3, the assessment of dissimilarity measure
s for sample
-
based clustering analysis can be 
extended to gene
-
based or bi
-
clustering, and other integrative clustering analysis for multiple 
omics datasets 
(Milone et al., 2014; Consortia, 2014)
. Importantly, the validation could be easily 
tested in dimension reduction techniques that are important to reduce 
the noisy high
-
dimensional 
characteristic of RNA
-
seq and other omics. Results obtained with principal components analysis, 
multidimensional scaling or correspondence analysis can be compared and validated including 
new approaches of dimensionality reductio
n recently proposed for integrative analysis 
(Reverter 
et al., 2014; Reshetova et al., 2014; Simmons et al., 2015; Tomescu et al., 2014)
. 
Chapter 3, 
provided and in
-
depth exploration of calling and genotyping SNPs. SNP datasets are one of the 
fundamental data sources used when integrati
ng datasets in genomics. For instance it is typical 
to integrate SNP with gene expression 
(Conesa and Mortazavi, 2014)
 
as in eQTL studies
.
 
 
103
 
 
REFERENCES
 
 
104
 
 
REFERENCES
 
 
Alamancos, G., Agirre, E., and Eyras, E. (2014). 

-

Spliceosomal Pre
-
mRNA Splicing
 
Methods in 
Molecular Biology., ed. K. J. Hertel (Totowa, NJ: Humana Press), 357

397. 
doi:10.1007/978
-
1
-
62703
-
980
-
2.
 
Altmann, A., Weber, P., Bader, D., P
reuss, M., Binder, E. B., and Müller
-
Myhsok, B. (2012). A 
beginners guide to SNP calling from high
-
throughput DNA
-
sequencing data. 
Hum. Genet.
 
131, 1541

54. doi:10.1007/s00439
-
012
-
1213
-
z.
 
Anders, S., and Huber, W. (2010). Differential expression analysis f
or sequence count data. 
Genome Biol.
 
11, R106. doi:10.1186/gb
-
2010
-
11
-
10
-
r106.
 
Anders, S., Pyl, P. T., and Huber, W. (2014). HTSeq 
-
 
A Python framework to work with high
-
throughput sequencing data. 
bioRxiv
, 0

4. doi:10.1101/002824.
 
Anders, S., Pyl, P. T., 
and Huber, W. (2015). HTSeq A Python framework to work with high
-
throughput sequencing data. 
Bioinformatics
 
31, 166

169. doi:10.1101/002824.
 
Auer, P. L., and Doerge, R. W. (2011). A Two
-
Stage Poisson Model for Testing RNA
-
Seq Data. 
Stat. Appl. Genet. Mol. 
Biol.
 
10, 1

28. doi:10.2202/1544
-
6115.1627.
 
Auer, P. L., and Doerge, R. W. (2010). Statistical design and analysis of RNA sequencing data. 
Genetics
 
185, 405

416. doi:10.1534/genetics.110.114983.
 
Babbitt, C. C., Tung, J., Wray, G. A., and Alberts, S. C. (20
12). Changes in Gene Expression 
Associated with Reproductive Maturation in Wild Female Baboons. 
Genome Biol. Evol.
 
4, 
102

109. doi:10.1093/gbe/evr134.
 
Badke, Y. M., Bates, R. O., Ernst, C. W., Schwab, C., Fix, J., Van Tassell, C. P., and Steibel, J. 
P. (20
13). Methods of tagSNP selection and other variables affecting imputation accuracy 
in swine. 
BMC Genet.
 
14, 8. doi:10.1186/1471
-
2156
-
14
-
8.
 
Bahn, J. H., Lee, J., Li, G., Greer, C., and Peng, G. (2012). Accurate identification of A
-
to
-
I RNA 
editing in human 
by transcriptome sequencing Accurate identification of A
-
to
-
I RNA editing 
in human by transcriptome sequencing. 142

150. doi:10.1101/gr.124107.111.
 
Bass, B., Hundley, H., Li, J. B., Peng, Z., Pickrell, J., Xiao, X. G., and Yang, L. (2012). The 
difficult ca
lls in RNA editing. 
Nat. Biotechnol.
 
30, 1207

1209. doi:nbt.2452 
[pii]
\
r10.1038/nbt.2452.
 
Bates, D., Maechler, M., and Bolker, B. (2013). lme4: Linear and mixed
-
effects models using S4 
classes.
 
Benjamini, Y., and Hochberg, Y. (1995). Controlling the False 
Discovery Rate 
-
 
a Practical and 
Powerful Approach to Multiple Testing. 
J. R. Stat. Soc. Ser. B
-
Methodological
 
57, 289

300.
 
105
 
 
Bennett, S. (2004). Solexa Ltd. 
Pharmacogenomics
 
5, 433

8. doi:10.1517/14622416.5.4.433.
 
Blekhman, R., Marioni, J. C., Zumbo, P., St
ephens, M., and Gilad, Y. (2010). Sex
-
specific and 
lineage
-
specific alternative splicing in primates. 
Genome Res.
 
20, 180

189. 
doi:10.1101/gr.099226.109.
 
Bottomly, D., Walter, N. A. R., Hunter, J. E., Darakjian, P., Kawane, S., Buck, K. J., Searles, R. 
P.,
 
Mooney, M., McWeeney, S. K., and Hitzemann, R. (2011). Evaluating Gene Expression 
in C57BL/6J and DBA/2J Mouse Striatum Using RNA
-
Seq and Microarrays. 
PLoS One
 
6, 
e17820. doi:10.1371/journal.pone.0017820.
 
Bourgon, R., Gentleman, R., and Huber, W. (2010). Independent filtering increases detection 
power for high
-
throughput experiments. 
Proc. Natl. Acad. Sci.
 
107, 9546

9551. 
doi:10.1073/pnas.0914005107.
 
Browning, S. R., and Browning, B. L. (2007). Rapid and ac
curate haplotype phasing and 
missing
-
data inference for whole
-
genome association studies by use of localized haplotype 
clustering. 
Am. J. Hum. Genet.
 
81, 1084

1097. doi:10.1086/521987.
 
Bullard, J. H., Purdom, E., Hansen, K. D., and Dudoit, S. (2010). Evalu
ation of statistical 
methods for normalization and differential expression in mRNA
-
Seq experiments. 
BMC 
Bioinformatics
 
11, 94.
 
Campbell, N. R., Harmon, S. a., and Narum, S. R. (2015). Genotyping
-
in
-
Thousands by 
sequencing (GT
-
seq): A cost effective SNP gen
otyping method based on custom amplicon 
sequencing. 
Mol. Ecol. Resour.
 
15, 855

867. doi:10.1111/1755
-
0998.12357.
 
Cánovas, a, Rincón, G., Islas
-
Trejo, a, Jimenez
-
Flores, R., Laubscher, a, and Medrano, J. F. 
(2013). 
RNA sequencing to study gene expression and single nucleotide polymorphism 
variation associated with citrate content in cow milk. 
J. Dairy Sci.
 
96, 2637

48. 
doi:10.3168/jds.2012
-
6213.
 
Cánovas, A., Rincon, G., Islas
-
Trejo, A., Wickramasinghe, S., and Medran
o, J. F. (2010). SNP 
discovery in the bovine milk transcriptome using RNA
-
Seq technology. 
Mamm. Genome
 
21, 592

8. doi:10.1007/s00335
-
010
-
9297
-
z.
 
Cattell, R. B., and Jaspers, J. (1967). General Plasmode No. 30
-
10
-
5
-
2 for factor analytic 
exercises and resear
ch. 
Multivariate Behav. Res.
 
Chen, G., Wang, C., and Shi, T. (2011). Overview of available methods for diverse RNA
-
Seq 
data analyses. 
Sci. China. Life Sci.
 
54, 1121

8. doi:10.1007/s11427
-
011
-
4255
-
x.
 
Chen, J. Y., Peng, Z., Zhang, R., Yang, X. Z., Tan, B. C.
 
M., Fang, H., Liu, C. J., Shi, M., Ye, Z. 
Q., Zhang, Y. E., et al. (2014). RNA Editome in Rhesus Macaque Shaped by Purifying 
Selection. 
PLoS Genet.
 
10, 8

12. doi:10.1371/journal.pgen.1004274.
 
Cheung, V. G., Nayak, R. R., Wang, I. X., Elwyn, S., Cousins, S
. M., Morley, M., and Spielman, 
R. S. (2010). Polymorphic <italic>Cis</italic>
-
 
and <italic>Trans</italic>
-
Regulation of 
Human Gene Expression. 
PLoS Biol
 
8, e1000480. doi:10.1371/journal.pbio.1000480.
 
106
 
 
Conesa, A., and Mortazavi, A. (2014). The common ground
 
of genomics and systems biology. 
BMC Syst. Biol.
 
8 Suppl 2, S1. doi:10.1186/1752
-
0509
-
8
-
S2
-
S1.
 
Consortia, Stat. (2014). STATegRa: Classes and methods for multi
-
omics data integration. R 
package v 1.2.1.
 
Cui, X., Hwang, J. T. G., Qiu, J., Blades, N. J., an
d Churchill, G. A. (2005). Improved statistical 
tests for differential gene expression by shrinking variance components estimates. 
Biostatistics
 
6, 59

75. doi:10.1093/biostatistics/kxh018.
 
Dalton, L., Ballarin, V., and Brun, M. (2009). Clustering algorithm
s: on learning, validation, 
performance, and applications to genomics. 
Curr. Genomics
 
10, 430

45. 
doi:10.2174/138920209789177601.
 
Di, Y., Schafer, D. W., Cumbie, J. S., and Chang, J. H. (2011). The NBP Negative Binomial 
Model for Assessing Differential Gen
e Expression from RNA
-
Seq. 
Stat. Appl. Genet. Mol. 
Biol.
 
10. doi:10.2202/1544
-
6115.1637.
 
Dillies, M. A., Rau, A., Aubert, J., Hennequet
-
Antier, C., Jeanmougin, M., Servant, N., Keime, 
C., Marot, G., Castel, D., Estelle, J., et al. (2012). A comprehensive e
valuation of 
normalization methods for Illumina high
-
throughput RNA sequencing data analysis. 
Brief. 
Bioinform.
 
doi:10.1093/bib/bbs046.
 
Djari, A., Esquerré, D., Weiss, B., Martins, F., Meersseman, C., Boussaha, M., Klopp, C., and 
Rocha, D. (2013). Gene
-
bas
ed single nucleotide polymorphism discovery in bovine muscle 
using next
-
generation transcriptomic sequencing. 
BMC Genomics
 
14, 307. 
doi:10.1186/1471
-
2164
-
14
-
307.
 
Edwards, D. B., Ernst, C. W., Tempelman, R. J., Rosa, G. J. M., Raney, N. E., Hoge, M. D., and
 
Bates, R. O. (2008). Quantitative trait loci mapping in an F2 Duroc x Pietrain resource 
population: I. Growth traits. 
J. Anim. Sci.
 
86, 241

253. doi:10.2527/jas.2006
-
625.
 
Ekblom, R., and Galindo, J. (2011). Applications of next generation sequencing in mo
lecular 
ecology of non
-
model organisms. 
Heredity (Edinb).
 
107, 1

15. doi:10.1038/hdy.2010.152.
 
Emmert
-
Streib, F. (2013). Influence of the experimental design of gene expression studies on 
the inference of gene regulatory networks: environmental factors. 
Pe
erJ
 
1, e10. 
doi:10.7717/peerj.10.
 
Ernst, C. W., Steibel, J. P., Sollero, B. P., Strasburg, G. M., Guimarães, J. D., and Raney, N. E. 
(2011). Transcriptional profiling during pig fetal skeletal muscle development using direct 
high
-
throughput sequencing and 
crossplatform comparison with gene expression 
microarrays. in 
Annual Meeting American Dairy Science Association and American Society 
of Animal Science.
 
(New Orleans, Louisiana: J. Anim. Sci).
 
Fang, Z., and Cui, X. (2011). Design and validation issues in RN
A
-
seq experiments. 
Brief. 
Bioinform.
 
12, 280

287. doi:10.1093/bib/bbr004.
 
107
 
 
Frazee, A., Langmead, B., and Leek, J. (2011). ReCount: A multi
-
experiment resource of 
analysis
-
ready RNA
-
seq gene count datasets. 
BMC Bioinformatics
 
12, 449.
 

Design and Inference in High
-

Plant Systems Biology SE 
-
 
9
 
Methods in Molecular Biology
TM
., ed. D. A. Belostotsky (Humana Press), 181

206 LA 

 
English. doi:10.1007/978
-
1
-
60327
-
563
-
7_9.
 
Gadbury, G. L., Xiang, Q., Yang, L., Barnes, S., Page, G. P., and Allison, D. B. (2008). 
Evaluating statistical methods using plasmode data sets in the age of massive public 
databases: an illustration usi
ng false discovery rates. 
PLoS Genet.
 
4, e1000098. 
doi:10.1371/journal.pgen.1000098.
 
Gentleman, R., Carey, V., Bates, D., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., 
Ge, Y., Gentry, J., et al. (2004). Bioconductor: open software developm
ent for 
computational biology and bioinformatics. 
Genome Biol.
 
5, R80.
 
Gomez
-
Cabrero, D., Abugessaisa, I., Maier, D., Teschendorff, A., Merkenschlager, M., Gisel, 
A., Ballestar, E., Bongcam
-
Rudloff, E., Conesa, A., and Tegnér, J. (2014). Data integration 
i
n the era of omics: current and future challenges. 
BMC Syst. Biol.
 
8, I1. doi:10.1186/1752
-
0509
-
8
-
S2
-
I1.
 
Griffith, M., Griffith, O. L., Mwenifumbo, J., Goya, R., Morrissy, a S., Morin, R. D., Corbett, R., 
Tang, M. J., Hou, Y.
-
C., Pugh, T. J., et al. (2010)
. Alternative expression analysis by RNA 
sequencing. 
Nat. Methods
 
7, 843

7. doi:10.1038/nmeth.1503.
 
Gu, X. (2015). Statistical Detection of Differentially Expressed Genes based on RNA
-
seq: from 
Biological to Phylogenetic Replicates. 
Brief. Bioinform.
, 
bbv035. doi:10.1093/bib/bbv035.
 
Gu, X., Zou, Y., Huang, W., Shen, L., Arendsee, Z., and Su, Z. (2013). Phylogenomic distance 
method for analyzing transcriptome evolution based on RNA
-
seq data. 
Genome Biol. Evol.
 
5, 1746

1753. doi:10.1093/gbe/evt121.
 
Gualdr
ón Duarte, J. L., Bates, R. O., Ernst, C. W., Raney, N. E., Cantet, R. J. C., and Steibel, J. 
P. (2013). Genotype imputation accuracy in a F2 pig population using high density and low 
density SNP panels. 
BMC Genet.
 
14, 38. doi:10.1186/1471
-
2156
-
14
-
38.
 
Habi
er, D., Fernando, R. L., and Dekkers, J. C. M. (2007). The impact of genetic relationship 
information on genome
-
assisted breeding values. 
Genetics
 
177, 2389

2397. 
doi:10.1534/genetics.107.081190.
 
Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2001). On
 
Clustering Validation Techniques. 
J. 
Intell. Inf. Syst.
 
17, 107

145.
 
Handl, J., Knowles, J., and Kell, D. B. (2005). Computational cluster validation in post
-
genomic 
data analysis. 
Bioinformatics
 
21, 3201

12. doi:10.1093/bioinformatics/bti517.
 
108
 
 
Hardcastle,
 
T. J., and Kelly, K. A. (2010). baySeq: empirical Bayesian methods for identifying 
differential expression in sequence count data. 
BMC Bioinformatics
 
11, 422. 
doi:10.1186/1471
-
2105
-
11
-
422.
 
Hastie, T., Tibshirani, R., and Friedman, J. H. (2009). 
The Elemen
ts of Statistical Learning. data 
Mining, Inference, and Prediction
. Second Edi. New York, New York, USA: Springer.
 
Helyar, S. J., Hemmer
-
Hansen, J., Bekkevold, D., Taylor, M. I., Ogden, R., Limborg, M. T., 
Cariani, a., Maes, G. E., Diopere, E., Carvalho, G
. R., et al. (2011). Application of SNPs for 
population genetics of nonmodel organisms: New opportunities and challenges. 
Mol. Ecol. 
Resour.
 
11, 123

136. doi:10.1111/j.1755
-
0998.2010.02943.x.
 
Hu, M., Zhu, Y., Taylor, J. M. G., Liu, J. S., and Qin, Z. S. (2
011). Using Poisson mixed
-
effects 
model to quantify transcript
-
level gene expression in RNA
-
Seq. 
Bioinformatics
 
28, 63

68. 
doi:10.1093/bioinformatics/btr616.
 
Van Iterson, M., Boer, J. M., and Menezes, R. X. (2010). Filtering, FDR and power. 
BMC 
Bioinformat
ics
 
11, 450. doi:10.1186/1471
-
2105
-
11
-
450.
 
Izenman, A. J. (2008). 
Modern Multivariate Statistical Techniques. Regression, Classification, 
and Manifold Learning
. New York, New York, USA: Springer.
 
Jiang, D., Tang, C., and Zhang, A. (2004). Cluster analysis for gene expression data: a survey. 
IEEE Trans. Knowl. Data Eng.
 
16, 1370

1386. doi:10.1109/TKDE.2004.68.
 
Johnson, R. A., and Wichern, D. W. (2002). 
Applied multivariate statistical analysis
. 5th 
ed. 
Upper Saddle River, N.J.: Prentice Hall.
 
Joost, S., Bonin, a., Bruford, M. W., Després, L., Conord, C., Erhardt, G., and Taberlet, P. 
(2007). A spatial analysis method (SAM) to detect candidate loci for selection: Towards a 
landscape genomics approach 
to adaptation. 
Mol. Ecol.
 
16, 3955

3969. 
doi:10.1111/j.1365
-
294X.2007.03442.x.
 
Kendall, M. G., and Gobbons, J. D. (1990). 
Rank Correlation Methods
. 5th ed. USA: Oxford 
University Press.
 
Koboldt, D. C., Zhang, Q., Larson, D. E., Shen, D., Mclellan, M. D., L
in, L., Miller, C. a, Mardis, 


copy number alteration discovery in cancer by exome se
quencing. 568

576. 
doi:10.1101/gr.129684.111.
 
Kumar, P., Al
-
shafai, M., Ahmed, W., Muftah, A., Chalhoub, N., Elsaid, M. F., Aleem, A. A., and 
Suhre, K. (2014). Evaluation of SNP calling using single and multiple
-
sample calling 
algorithms by validation agai
nst array base genotyping and Mendelian inheritance. 7, 1

13. 
doi:10.1186/1756
-
0500
-
7
-
747.
 
Lamichhaney, S., Martinez Barrio, A., Rafati, N., Sundström, G., Rubin, C.
-
J., Gilbert, E. R., 
Berglund, J., Wetterbom, A., Laikre, L., Webster, M. T., et al. (2012)
. Population
-
scale 
109
 
 
sequencing reveals genetic differentiation due to local adaptation in Atlantic herring. 
Proc. 
Natl. Acad. Sci. U. S. A.
 
109, 19345

50. doi:10.1073/pnas.1216128109.
 
Langmead, B., Hansen, K. D., and Leek, J. T. (2010). Cloud
-
scale RNA
-
sequ
encing differential 
expression analysis with Myrna. 
Genome Biol.
 
11, R83. doi:10.1186/gb
-
2010
-
11
-
8
-
r83.
 
Langmead, B., and Salzberg, S. L. (2012). Fast gapped
-
read alignment with Bowtie 2. 
Nat. 
Methods
 
9, 357

360. doi:10.1038/nmeth.1923.
 
Law, C. W., Chen, Y
., Shi, W., and Smyth, G. K. (2014). Voom: precision weights unlock linear 
model analysis tools for RNA
-
seq read counts. 
Genome Biol.
 
15, R29. doi:10.1186/gb
-
2014
-
15
-
2
-
r29.
 
Lee, J.
-
H., Ang, J. K., and Xiao, X. (2013). Analysis and design of RNA sequencing 
experiments 
for identifying RNA editing and other single
-
nucleotide variants. 
RNA
, rna.037903.112

. 
doi:10.1261/rna.037903.112.
 
Leek, J. T., and Storey, J. D. (2011). The Joint Null Criterion for Multiple Hypothesis Tests. 
Stat. 
Appl. Genet. Mol. Biol.
 
10.
 
doi:10.2202/1544
-
6115.1673.
 
Legendre, P., and Legendre, L. (2012). 
Numerical ecology
. Third. Amsterdam: Elsevier.
 
Lemay, M. A., Donnelly, D. J., and Russello, M. A. (2013). Transcriptome
-
wide comparison of 
sequence variation in divergent ecotypes of kokan
ee salmon. 
BMC Genomics
. 
doi:10.1186/1471
-
2164
-
14
-
308.
 
Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping 
and population genetical parameter estimation from sequencing data. 
Bioinformatics
 
27, 
2987

2993. doi:10.
1093/bioinformatics/btr509.
 
Li, H. (2014). Toward better understanding of artifacts in variant calling from high
-
coverage 
samples. 
Bioinformatics
 
30, 2843

2851. doi:10.1093/bioinformatics/btu356.
 
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., 
Homer, N., Marth, G., Abecasis, G., 
and Durbin, R. (2009a). The Sequence Alignment / Map ( SAM ) Format and SAMtools 
1000 Genome Project Data Processing Subgroup. 1

2.
 
Li, J. B., Levanon, E. Y., Yoon, J.
-
K., Aach, J., Xie, B., LePoust, E., Zhang, K., Gao, 
Y., and 
Church, G. M. (2009b). Genome
-
Wide Identification of Human RNA editing sites by parallel 
DNA capturing and sequencing. 
Science (80
-
. ).
 
324, 1210:1213.
 
Li, J., Witten, D. M., Johnstone, I. M., and Tibshirani, R. (2012). Normalization, testing, and 
false 
discovery rate estimation for RNA
-
sequencing data. 
Biostatistics
 
13, 523

38. 
doi:10.1093/biostatistics/kxr031.
 

-
Y., Wang, M., 
Wang, C., et al. (2014). Detecting and corre
cting systematic variation in large
-
scale RNA 
sequencing data. 
Nat. Biotechnol.
 
32. doi:10.1038/nbt.3000.
 
110
 
 
Li, Y., Chen, W., Liu, E. Y., and Zhou, Y. H. (2013). Single Nucleotide Polymorphism (SNP) 
Detection and Genotype Calling from Massively Parallel Sequ
encing (MPS) Data. 
Stat. 
Biosci.
 
5, 3

25. doi:10.1007/s12561
-
012
-
9067
-
4.
 

-

Statistical Analysis of 
Next Generation Sequencing Data SE
 
 
-
 
10
 
Frontiers in Probability and the Statistical 
Sciences., eds. S. Datta and D. Nettleton (Springer International Publishing), 191

217. 
doi:10.1007/978
-
3
-
319
-
07212
-
8_10.
 
Love, M. I., Huber, W., and Anders, S. (2014a). Moderated estimation of fold change and 
dispersion for RNA
-
Seq data with DESeq2. 
Genome Biol.
 
15, 550. doi:10.1186/s13059
-
014
-
0550
-
8.
 
Love, M. I., Huber, W., and Anders, S. (2014b). Moderated estimation of fold change and 
dispersion for RNA
-
Seq data with DESeq2. 
bioRxiv
. doi:10.1101/002832.
 
Ma, 
C., and Wang, X. (2012). Application of the Gini correlation coefficient to infer regulatory 
relationships in transcriptome analysis. 
Plant Physiol.
 
160, 192

203. 
doi:10.1104/pp.112.201962.
 
Marguerat, S., Bähler, J., and Bahler, J. (2010). RNA
-
seq: from te
chnology to biology. 
Cell. Mol. 
Life Sci.
 
67, 569

579. doi:10.1007/s00018
-
009
-
0180
-
6.
 
Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., and Gilad, Y. (2008). RNA
-
seq: an 
assessment of technical reproducibility and comparison with gene expression arr
ays. 
Genome Res.
 
18, 1509

1517.
 
McCarthy, D. J., Chen, Y., and Smyth, G. K. (2012). Differential expression analysis of 
multifactor RNA
-
Seq experiments with respect to biological variation. 
Nucleic Acids Res.
, 
gks042

. doi:10.1093/nar/gks042.
 
Mehta, T. S.,
 
Zakharkin, S. O., Gadbury, G. L., and Allison, D. B. (2006). Epistemological issues 
in omics and high
-
dimensional biology: give the people what they want. 
Physiol. Genomics
 
28, 24

32. doi:10.1152/physiolgenomics.00095.2006.
 
Mehta, T., Tanik, M., and Allis
on, D. B. (2004). Towards sound epistemological foundations of 
statistical methods for high
-
dimensional biology. 
Nat. Genet.
 
36, 943

947. 
doi:10.1038/ng1422.
 
Metzker, M. L. (2010). Sequencing technologies 
-
 
the next generation. 
Nat. Rev. Genet.
 
11, 31

46. 
doi:10.1038/nrg2626.
 
Milone, D. H., Stegmayer, G., Lopez, M., Kamenetzky, L., and Carrari, F. (2014). Improving 
clustering with metabolic pathway data. 
BMC Bioinformatics
 
15, 101. doi:10.1186/1471
-
2105
-
15
-
101.
 
Mortazavi, A., Williams, B. A., McCue, K., Sch
aeffer, L., and Wold, B. (2008). Mapping and 
quantifying mammalian transcriptomes by RNA
-
Seq. 
Nat. Methods
 
5, 621

628. 
doi:http://www.nature.com/nmeth/journal/v5/n7/suppinfo/nmeth.1226_S1.html.
 
111
 
 
Narum, S. R., Buerkle, C. A., Davey, J. W., Miller, M. R., and
 
Hohenlohe, P. a. (2013). 
Genotyping
-
by
-
sequencing in ecological and conservation genomics. 
Mol. Ecol.
, n/a

n/a. 
doi:10.1111/mec.12350.
 
Nielsen, R., Paul, J. S., Albrechtsen, A., and Song, Y. S. (2011). Genotype and SNP calling 
from next
-
generation sequenc
ing data. 
Nat. Rev. Genet.
 
12, 443

51. doi:10.1038/nrg2986.
 
Oshlack, A., Robinson, M., and Young, M. (2010). From RNA
-
seq reads to differential 
expression results. 
Genome Biol.
 
11, 220.
 
Ozsolak, F., and Milos, P. M. (2011). RNA sequencing: advances, challe
nges and opportunities. 
Nat. Rev. Genet.
 
12, 87

98. doi:10.1038/nrg2934.
 
Pachter, L. (2011). Models for transcript quantification from RNA
-
Seq. 
ArXiv
 
1104.3889, 1

28.
 
Pavey, S. a (2015). High
-
throughput SNPs for all: genotyping
-
in
-
thousands. 
Mol. Ecol. Res
our.
 
15, 685

687. doi:10.1111/1755
-
0998.12405.
 
Peng, Z., Cheng, Y., Tan, B. C.
-
M., Kang, L., Tian, Z., Zhu, Y., Zhang, W., Liang, Y., Hu, X., 
Tan, X., et al. (2012). Comprehensive analysis of RNA
-
Seq data reveals extensive RNA 
editing in a human transcript
ome. 
Nat. Biotechnol.
 
30, 253

260. doi:10.1038/nbt.2122.
 
Pickrell, J. K., Marioni, J. C., Pai, A. A., Degner, J. F., Engelhardt, B. E., Nkadori, E., Veyrieras, 
J.
-
B., Stephens, M., Gilad, Y., and Pritchard, J. K. (2010). Understanding mechanisms 
underlying
 
human gene expression variation with RNA sequencing. 
Nature
 
464, 768

772. 
doi:http://www.nature.com/nature/journal/v464/n7289/suppinfo/nature08872_S1.html.
 
Pirinen, M., Lappalainen, T., Zaitlen, N. A., Dermitzakis, E. T., Donnelly, P., McCarthy, M. I., and 
Rivas, M. A. (2015). Assessing allele
-
specific expression across multiple tissues from 
RNA
-
seq read data. 
Bioinformatics
, btv074

. doi:10.1093/bioinformati
cs/btv074.
 
Piskol, R., Ramaswami, G., and Li, J. B. (2013). Reliable Identification of Genomic Variants 
from RNA
-
Seq Data. 
Am. J. Hum. Genet.
 
93, 641

51. doi:10.1016/j.ajhg.2013.08.008.
 
Qian, X., Ba, Y., Zhuang, Q., and Zhong, G. (2014). RNA
-
Seq technology
 
and its application in 
fish transcriptomics. 
OMICS
 
18, 98

110. doi:10.1089/omi.2013.0110.
 
Quinn, A., Juneja, P., and Jiggins, F. M. (2014). Estimates of allele
-
specific expression in 
Drosophila with a single genome sequence and RNA
-
seq data. 
Bioinformatics
 
30, 2603

10. doi:10.1093/bioinformatics/btu342.
 
Quinn, E. M., Cormican, P., Kenny, E. M., Hill, M., Anney, R., Gill, M., Corvin, A. P., and Morris, 
D. W. (2013). Development of Strategies for SNP Detection in RNA
-
Seq Data: Application 
to Lym
phoblastoid Cell Lines and Evaluation Using 1000 Genomes Data. 
PLoS One
 
8. 
doi:10.1371/journal.pone.0058815.
 
R Development Core Team (2014). R: A language and environment for statistical computing.
 
112
 
 
Ramaswami, G., Lin, W., Piskol, R., Tan, M. H., Davis, C.,
 
and Li, J. B. (2012). Accurate 
identification of human Alu and non
-
Alu RNA editing sites. 
Nat. Methods
 
9, 579

581. 
doi:10.1038/nmeth.1982.
 

(2013). Identifying RN
A editing sites using RNA sequencing data alone. 
Nat. Methods
 
10, 
128

32. doi:10.1038/nmeth.2330.
 
Rapaport, F., Khanin, R., Liang, Y., Krek, A., Zumbo, P., Mason, C. E., Socci, N. D., and Betel, 
D. (2013). Comprehensive evaluation of differential expressio
n analysis methods for RNA
-
seq data. 1

21.
 
Rau, A., Gallopin, M., Celeux, G., and Jaffrézic, F. (2013). Data
-
based filtering for replicated 
high
-
throughput transcriptome sequencing experiments. 
Bioinformatics
 
29, 2146

2152. 
doi:10.1093/bioinformatics/btt35
0.
 
Rau, A., Maugis
-
Rabusseau, C., and Celeux, G. (2015). Co
-
expression analysis of high
-
throughput transcriptome sequencing data with Poisson mixture models. 
Bioinformatics
 
31, 
1420

1427. doi:10.1093/bioinformatics/btu845.
 
Reeb, P. D., and Steibel, J. P. (
2013). Evaluating statistical analysis models for RNA 
sequencing experiments. 
Front. Genet.
 
4, 1

9. doi:10.3389/fgene.2013.00178.
 
Reshetova, P., Smilde, A. K., van Kampen, A. H., and Westerhuis, J. a (2014). Use of prior 
knowledge for the analysis of high
-
throughput transcriptomics and metabolomics data. 
BMC Syst. Biol.
 
8 Suppl 2, S2. doi:10.1186/1752
-
0509
-
8
-
S2
-
S2.
 
Reverter, F., Vegas, E., and Oller, J. M. (2014). Kernel
-
PCA data integration with enhanced 
interpretability. 
BMC Syst. Biol.
 
8 Suppl 2, S6. doi
:10.1186/1752
-
0509
-
8
-
S2
-
S6.
 
Roberts, A., Pimentel, H., Trapnell, C., and Pachter, L. (2011). Identification of novel transcripts 
in annotated genomes using RNA
-
Seq. 
Bioinformatics
 
27, 2325

9. 
doi:10.1093/bioinformatics/btr355.
 
Robinson, M. D., McCarthy, D.
 
J., and Smyth, G. K. (2010). edgeR: a Bioconductor package for 
differential expression analysis of digital gene expression data. 
Bioinformatics
 
26, 139

140. doi:10.1093/bioinformatics/btp616.
 
Robinson, M. D., and Oshlack, A. (2010). A scaling normalizatio
n method for differential 
expression analysis of RNA
-
seq data. 
Genome Biol.
 
11, R25. doi:10.1186/gb
-
2010
-
11
-
3
-
r25.
 
Robinson, M. D., and Smyth, G. K. (2008). Small
-
sample estimation of negative binomial 
dispersion, with applications to SAGE data. 
Biostatist
ics
 
9, 321

332. 
doi:10.1093/biostatistics/kxm030.
 
Robles, J. a, Qureshi, S. E., Stephen, S. J., Wilson, S. R., Burden, C. J., and Taylor, J. M. 
(2012). Efficient experimental design and analysis strategies for the detection of differential 
113
 
 
expression using
 
RNA
-
Sequencing. 
BMC Genomics
 
13, 484. doi:10.1186/1471
-
2164
-
13
-
484.
 
Rosa, G. J. M., Steibel, J. P., and Tempelman, R. J. (2005). Reassessing design and analysis 
of two
-
colour microarray experiments using mixed effects models. 
Comp. Funct. Genomics
 
6, 123

31. doi:10.1002/cfg.464.
 
Salem, M., Vallejo, R. L., Leeds, T. D., Palti, Y., Liu, S., Sabbagh, A., Rexroad, C. E., and Yao, 
J. (2012). RNA
-
seq identifies SNP markers for growth traits in rainbow trout. 
PLoS One
 
7. 
doi:10.1371/journal.pone.0036264.
 
Schoville, S. D., Bonin, A., François, O., Lobreaux, S., Melodelima, C., and Manel, S. (2012). 
Adaptive Genetic Variation on the Landscape: Methods and Cases. 
Annu. Rev. Ecol. Evol. 
Syst.
 
43, 23

43. doi:10.1146/annurev
-
ecolsys
-
110411
-
160248.
 
Schunter, C., 
Garza, J. C., Macpherson, E., and Pascual, M. (2013). SNP development from 
RNA
-
seq data in a nonmodel fish: how many individuals are needed for accurate allele 
frequency prediction? 
Mol. Ecol. Resour.
, n/a

n/a. doi:10.1111/1755
-
0998.12155.
 
Seeb, J. E., Car
valho, G., Hauser, L., Naish, K., Roberts, S., and Seeb, L. W. (2011). Single
-
nucleotide polymorphism (SNP) discovery and applications of SNP genotyping in 
nonmodel organisms. 
Mol. Ecol. Resour.
 
11, 1

8. doi:10.1111/j.1755
-
0998.2010.02979.x.
 
Severin, A. J.
, Woody, J. L., Bolon, Y.
-
T., Joseph, B., Diers, B. W., Farmer, A. D., Muehlbauer, 
G. J., Nelson, R. T., Grant, D., Specht, J. E., et al. (2010). RNA
-
Seq Atlas of Glycine max: 
a guide to the soybean transcriptome. 
BMC Plant Biol.
 
10, 160. doi:10.1186/1471
-
2229
-
10
-
160.
 
Si, Y., Liu, P., Li, P., and Brutnell, T. P. (2014). Model
-
Based Clustering for RNA
-
Seq Data. 
Bioinformatics
 
30, 197

205. doi:10.1093/bioinformatics/btt632.
 
Simmons, S., Peng, J., Bienkowska, J., and Berger, B. (2015). Discovering What Dimensi
onality 
Reduction Really Tells Us About RNA
-
Seq Data. 
J. Comput. Biol.
 
22, 150622063210002. 
doi:10.1089/cmb.2015.0085.
 
Sloutsky, R., Jimenez, N., Swamidass, S. J., and Naegle, K. M. (2013). Accounting for noise 
when clustering biological data. 
Brief. Bioin
form.
 
14, 423

36. doi:10.1093/bib/bbs057.
 
Smeds, L., and Künstner, A. (2011). ConDeTri
--
a content dependent read trimmer for Illumina 
data. 
PLoS One
 
6, e26314. doi:10.1371/journal.pone.0026314.
 
Smith, S., Bernatchez, L., and Beheregaray, L. B. (2013). RNA
-
seq analysis reveals extensive 
transcriptional plasticity to temperature stress in a freshwater fish species. 
BMC Genomics
 
14, 375. doi:10.1186/1471
-
2164
-
14
-
375.
 

Bioinformatics
 
Statistics
 
for Biology and Health., eds. R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, and W. Huber 
(New York: Springer), 397

420. doi:10.1007/0
-
387
-
29362
-
0_23.
 
114
 
 
Smyth, G. K. (2004). Linear models and empirical bayes methods for assessing differential 
expression in microarray experiments. 
Stat. Appl. Genet. Mol. Biol.
 
3, Article3. 
doi:10.2202/1544
-
6115.1027.
 
Smyth, G. K., Ritchie, M., and Thorne, N. (2012). limma: 
Linear Models for Microarray Data 

-
Seq Data Analysis ). Melbourne, Australia: 
Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research.
 
Sokal, R. R., and Rohlf, F. J. (1962). The comparison of dendr
ograms by objective methods. 
Taxon
 
11, 33

40.
 
Sollero, B. P., Guimarães, S. E. F., Rilington, V. D., Tempelman, R. J., Raney, N. E., Steibel, J. 
P., Guimarães, J. D., Lopes, P. S., Lopes, M. S., and Ernst, C. W. (2011). Transcriptional 
profiling during foe
tal skeletal muscle development of Piau and Yorkshire

Landrace cross
-
bred pigs. 
Anim. Genet.
 
42, 600

612. doi:10.1111/j.1365
-
2052.2011.02186.x.
 
Srivastava, S., and Chen, L. (2010). A two
-
parameter generalized Poisson model to improve the 
analysis of RNA
-
se
q data. 
Nucleic Acids Res.
 
38, e170. doi:10.1093/nar/gkq670.
 
Steibel, J. P., Bates, R. O., Rosa, G. J. M., Tempelman, R. J., Rilington, V. D., Ragavendran, 
A., Raney, N. E., Ramos, A. M., Cardoso, F. F., Edwards, D. B., et al. (2011). Genome
-
wide linkage a
nalysis of global gene expression in loin muscle tissue identifies candidate 
genes in pigs. 
PLoS One
 
6, e16766. doi:10.1371/journal.pone.0016766.
 
Steibel, J. P., Poletto, R., Coussens, P. M., and Rosa, G. J. M. (2009). A powerful and flexible 
linear mixed 
model framework for the analysis of relative quantification RT
-
PCR data. 
Genomics
 
94, 146

52. doi:10.1016/j.ygeno.2009.04.008.
 
Steibel, J. P., Reeb, P. D., Ernst, C. W., and Bates, R. O. (2014). Mapping cis and trans
-
acting 
eQTL in swine populations. in 
10
th WCGALP
 
(Vancouver, Canada).
 
Steibel, J. P., Wang, H., and Zhong, P.
-
S. (2015). A hidden Markov approach for ascertaining 
cSNP genotypes from RNA sequence data in the presence of allelic imbalance by 
exploiting linkage disequilibrium. 
BMC Bioinformatics
 
16, 1

12. doi:10.1186/s12859
-
015
-
0479
-
2.
 
Storey, J. D. (2002). A direct approach to false discovery rates. 
J. R. Stat. Soc. Ser. B 
(Statistical Methodol.
 
64, 479

498. doi:10.1111/1467
-
9868.00346.
 
Storey, J. D., and Tibshirani, R. (2003). Statistical methods for identifying differentially 
expressed genes in DNA microarrays. 
Methods Mol. Biol.
 
224, 149

57. doi:10.1385/1
-
59259
-
364
-
X:149.
 
Tomescu, O. a, Mattanovich, D., and Thallinger, G. G. (2014). In
tegrative omics analysis. A 
study based on Plasmodium falciparum mRNA and protein data. 
BMC Syst. Biol.
 
8 Suppl 
2, S4. doi:10.1186/1752
-
0509
-
8
-
S2
-
S4.
 
Trapnell, C., Pachter, L., and Salzberg, S. L. (2009). TopHat: discovering splice junctions with 
RNA
-
Seq. 
Bioinformatics
 
25, 1105

1111.
 
115
 
 
Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., Salzberg, S. 
L., Rinn, J. L., and Pachter, L. (2012). Differential gene and transcript expression analysis 
of RNA
-
seq experiments with TopH
at and Cufflinks. 
Nat. Protoc.
 
7, 562

578. 
doi:10.1038/nprot.2012.016.
 
Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. 
L., Wold, B. J., and Pachter, L. (2010). Transcript assembly and quantification by RNA
-
Seq 
reveals unannotated transcripts and isoform switching during cell differentiation. 
Nat. 
Biotechnol.
 
28, 511

5. doi:10.1038/nbt.1621.
 
VanRaden, P. M. (2008). Efficient methods to compute genomic predictions. 
J. Dairy Sci.
 
91, 
4414

4423. doi:10.3168/jds
.2007
-
0980.
 
Vaughan, L. K., Divers, J., Padilla, M., Redden, D. T., Tiwari, H. K., Pomp, D., and Allison, D. B. 
(2009). The use of plasmodes as a supplement to simulations: A simple example 
evaluating individual admixture estimation methodologies. 
Comput. 
Stat. Data Anal.
 
53, 
1755

1766. doi:10.1016/j.csda.2008.02.032.
 
Vijay, N., Poelstra, J. W., Künstner, A., and Wolf, J. B. W. (2013). Challenges and strategies in 
transcriptome assembly and differential gene expression quantification. A comprehensive 
in sil
ico assessment of RNA
-
seq experiments. 
Mol. Ecol.
 
22, 620

34. 
doi:10.1111/mec.12014.
 
Waller, N. G., Underhill, J. M., and Heather, A. (2010). Multivariate Behavioral A Method for 
Generating Simulated Plasmodes and Artificial Test Clusters with User
-
Defined
 
Shape , 
Size , and. 37

41.
 
Wang, R., Sun, L., Bao, L., Zhang, J., Jiang, Y., Yao, J., Song, L., Feng, J., Liu, S., and Liu, Z. 
(2013). Bulk segregant RNA
-
seq reveals expression and positional candidate genes and 
allele
-
specific expression for disease resi
stance against enteric septicemia of catfish. 
BMC 
Genomics
 
14, 929. doi:10.1186/1471
-
2164
-
14
-
929.
 
Wang, Z., Gerstein, M., and Snyder, M. (2009). RNA
-
Seq: a revolutionary tool for 
transcriptomics. 
Nat Rev Genet
 
10, 57

63. doi:10.1038/nrg2484.
 
Weinman, L. R.
, Solomon, J. W., and Rubenstein, D. R. (2014). A comparison of single 
nucleotide polymorphism and microsatellite markers for analysis of parentage and kinship 
in a cooperatively breeding bird. 
Mol. Ecol. Resour.
, n/a

n/a. doi:10.1111/1755
-
0998.12330.
 
Wick
ramasinghe, S., Cánovas, A., Rincón, G., and Medrano, J. F. (2014). RNA
-
Sequencing: A 
tool to explore new frontiers in animal genetics. 
Livest. Sci.
 
166, 206

216. 
doi:10.1016/j.livsci.2014.06.015.
 
Van De Wiel, M. a, Leday, G. G. R., Pardo, L., Rue, H., Van
 
Der Vaart, A. W., and Van 
Wieringen, W. N. (2013). Bayesian analysis of RNA sequencing data by estimating multiple 
shrinkage priors. 
Biostatistics
, 113

128. doi:10.1093/biostatistics/kxs031.
 
De Wit, P., Pespeni, M. H., Ladner, J. T., Barshis, D. J., Senec
a, F., Jaris, H., Therkildsen, N. 

116
 
 
genomics via RNA
-
Seq: an introduction to high
-
throughput sequencing data analysis. 
Mol. 
Ecol. Resour.
 
12, 1058

67. doi:10.1111/1755
-
0998.12
003.
 
Witten, D. M. (2011). Classification and clustering of sequencing data using a Poisson model. 
Ann. Appl. Stat.
 
5, 2493

2518. doi:10.1214/11
-
AOAS493.
 

Data Clustering
 
Chapman & 
Hall/CRC 
Data Mining and Knowledge Discovery Series. (Chapman and Hall/CRC). 
doi:doi:10.1201/b15410
-
24.
 
Yang, C., and Wei, H. (2015). Designing Microarray and RNA
-
Seq Experiments for Greater 
Systems Biology Discovery in Modern Plant Genomics. 
Mol. Plant
 
8, 196

206.
 
doi:10.1016/j.molp.2014.11.012.
 
Yang, H., and Churchill, G. (2007). Estimating p
-
values in small microarray experiments. 
Bioinformatics
 
23, 38

43. doi:10.1093/bioinformatics/btl548.
 
Yu, Y., Wei, J., Zhang, X., Liu, J., Liu, C., Li, F., and Xiang, J. (2014). SNP discovery in the 
transcriptome of white Pacific shrimp Litopenaeus vannamei by next generation 
sequencing. 
PLoS One
 
9, e87218. doi:10.1371/journal.pone.0087218.
 
Zhou, X., Lindsay, H., and Robinson, M. D. (2014). Robustly detecting differential expression in 
RNA sequencing data using observation weights. 
Nucleic Acids Res.
, 1

10. 
doi:10.1093/nar/gku310.
 
Zhou, Y.
-
H., Xia, K., and Wright, F. a (2011). A powerful and 
flexible approach to the analysis of 
RNA sequence count data. 
Bioinformatics
 
27, 2672

8. doi:10.1093/bioinformatics/btr449.