Implementing validation procedures to study the properties of widely used statistical analysis methods of RNA sequencing experiments
RNA sequencing (RNA-seq) technology is being rapidly adopted as the platform of choice for transcriptomic studies. Although its major focus has been gene expression profiling, other interests, such as single nucleotide profiling, are emerging as the technology evolves. In addition, applications are being rapidly expanding in model and nonmodel organisms. The overall objective of this dissertation was to propose and implement validation procedures based on experimental data to estimate the properties of widely used statistical analysis methods of RNA-seq experiments. The first study evaluated differential expression methods based on count data distribution and Gaussian transformed models. Parametric simulations and plasmode datasets derived from RNA-seq experiments were generated to compare the statistical models in terms of type I error rate, power and null p-value distribution. Overall, Gaussian models presented p-values closer to nominal significance levels and a p-value distribution closer to the expected uniform distribution. Researchers using models with these properties will have less false positives when inferring differentially expresses transcripts. Additionally, the use of Gaussian transformations enables the applications of all the well-known theory of linear models for instance to account for complex experimental designs.The second study assessed the properties of dissimilarity measures for agglomerative hierarchical cluster analysis. The validation comprised dissimilarity measures based on Euclidean distance, correlation-based dissimilarities and count data-based dissimilarities. I used plasmode datasets generated from two RNA-seq experiments with different sample structures and simulated scenarios based on informative and non informative transcripts. In addition, I proposed two measures, agreement and consistency, for comparing dendrograms. Dissimilarity measures based on non-transformed data resulted in dendrograms that did not resemble the expected sample structure, whereas dissimilarities calculated with appropriate transformations for count data were consistent in reproducing the expected dendrograms under different scenarios. The third study compared variant calling programs that used reference genotypes obtained from a SNPchip. The evaluation included multiple samples and multiple tissue datasets and considered the effect of per base read depth. Sensitivity and false discovery rates were computed separately for heterozygous and homozygous sites in order to provide information for potentially different applications such as allele-specific expression or RNA-editing. Additionally, I explored the use of SNP called from RNA-seq to compute relationship matrices in population studies. Heterozygous sites with more than 10 reads per base and per sample were called with high sensitivity and low false discovery rates. Homozygous sites were called with higher sensitivity than heterozygous irrespective of depth but presented higher false discovery rates. A relationship matrix based on accurate genotypes obtained with RNA-seq presented a high correlation with a relation matrix based on genotypes from a SNPchip.In conclusion, using synthetic and reference datasets, I compared statistical models to perform differential expression analysis, sampled-base hierarchical cluster analysis, and variant calling and genotyping. This validation framework can be extended to evaluate other methods of RNA-seq analysis as well as to evaluate the periodic publication of new and updated analysis methods. Choosing the most appropriate software can help researchers to obtained better results and to achieve the goals of their investigations.
Read
- In Collections
-
Electronic Theses & Dissertations
- Copyright Status
- In Copyright
- Material Type
-
Theses
- Authors
-
Reeb, Pablo Daniel
- Thesis Advisors
-
Steibel, Juan P.
- Committee Members
-
Brown, Charles T.
Bence, James R.
Li, Weiming
Tempelman, Robert J.
- Date Published
-
2015
- Program of Study
-
Fisheries and Wildlife - Doctor of Philosophy
- Degree Level
-
Doctoral
- Language
-
English
- Pages
- xi, 116 pages
- ISBN
-
9781321977530
1321977530
- Permalink
- https://doi.org/doi:10.25335/t1wf-xd84