You are here
Search results
(1 - 20 of 21)
Pages
- Title
- Application of human liver stem cells for receptor-mediated toxicogenomic study
- Creator
- Kim, Suntae
- Date
- 2011
- Collection
- Electronic Theses & Dissertations
- Description
-
Recent US and EU legislation and high drug development failure rates have prompted the need to develop reliable human in vitro models for toxicity testing to replace, reduce and refine animal testing. This also stems from the response discrepancy among species and the need for novel approaches for mechanism-based high-throughput screening. In vitro models have been widely used, but the abnormality of continuous cell lines and restricted acquisition, source inconsistency and instability of...
Show moreRecent US and EU legislation and high drug development failure rates have prompted the need to develop reliable human in vitro models for toxicity testing to replace, reduce and refine animal testing. This also stems from the response discrepancy among species and the need for novel approaches for mechanism-based high-throughput screening. In vitro models have been widely used, but the abnormality of continuous cell lines and restricted acquisition, source inconsistency and instability of primary cells limit their use. Adult stem cells derived from intact human tissue provides an innovative alternative that may more accurately predict in vivo toxicity.The objective of this research was to evaluate HL1-1 human hepatic stem cell line as a viable model for receptor-mediated toxicogenomic studies using the aryl hydrocarbon receptor (AhR) and peroxisome proliferator-activated receptor á (PPARá) as prototypical ligand activated transcription factors for comparative toxicogenomic investigation. AhR and PPARá are targets for environmental contaminants and pharmaceutical reagents and complementary species-specific hepatic responses are currently being studied.Comprehensive time course and dose-response gene expression studies were conducted in HL1-1 cells to assess the differential gene expression elicited by AhR and PPARá ligands incomparison to other in vitro and in vivo models. Some conserved responses with overlapping biological functions were identified. Although subsets of conserved differentially expressed genes are consistent with the known in vivo responses, the results suggest that species- and model-specific gene expression profiles are linked to species-specific physiology.HL1-1 cell were also immortalized by hTERT stable transfection (HLhT1) to overcome cell senescence and life span limitations. Immortalized HLhT1 cells maintain pluripotency characteristics as defined by stem cell and oval cell marker protein expression. The expression of functional AhR and PPARá and their ligand responsiveness is also comparable to the parental cell line. Collectively, human liver stem cells are viable models and warrant further development for mechanistic investigation of receptor-mediated hepatic toxicity and high throughput screening.
Show less
- Title
- Comparative and functional genomics approaches to understand environmental adpatation in woody perennial, Populus
- Creator
- Park, Sunchung
- Date
- 2004
- Collection
- Electronic Theses & Dissertations
- Title
- Defining the characteristics and roles of functional genomic sequences using computational approaches
- Creator
- Lloyd, John P.
- Date
- 2017
- Collection
- Electronic Theses & Dissertations
- Description
-
"Advances in biotechnology have provided a wealth of sequencing data that is transforming our view of a genome. Eukaryotic genomes, initially thought to contain discrete genes in a sea of non-functional DNA, have been found to exhibit pervasive biochemical activity, particularly transcription. However, whether this biochemical activity is functional (i.e. under evolutionary selection) or the result of noisy activity of cellular machinery represents a fundamental debate of the post-genome era....
Show more"Advances in biotechnology have provided a wealth of sequencing data that is transforming our view of a genome. Eukaryotic genomes, initially thought to contain discrete genes in a sea of non-functional DNA, have been found to exhibit pervasive biochemical activity, particularly transcription. However, whether this biochemical activity is functional (i.e. under evolutionary selection) or the result of noisy activity of cellular machinery represents a fundamental debate of the post-genome era. The research described in this dissertation focuses on two open questions confronting genome biology: 1) Where are the functional elements within a genome? 2) What roles are functional elements performing? For the first question, I focused on transcribed regions in unannotated, intergenic regions of genomes, which represent functionally ambiguous sequences. To determine which and how many intergenic transcribed regions (ITRs) represent functional sequences, machine learning-based function prediction models were established using Arabidopsis thaliana as a model. The prediction models were able to successfully distinguish between benchmark functional (phenotype genes) and non-functional sequences (pseudogenes) using evolutionary, biochemical, and sequence-based structural features. When applied to ITRs, 400303% of ITRs were predicted as functional, suggesting ITRs primarily represent transcriptional noise. I further investigated the evolutionary histories of ITRs in four grass (Poaceae) species. ITRs were found to be primarily species-specific and exhibit recent duplicates, with rare examples of ancient duplicate retention. In addition, ITR duplicates and orthologs were usually not expressed. Function prediction models were also generated in Oryza sativa (rice) that predicted 600303% of rice ITRs as nonfunctional. The results of function prediction models and evaluating evolutionary histories both suggest ITRs are primarily non-functional sequences. However, I also provide a list of potentially-functional ITRs that should be considered high priority targets for future experimental studies. For the second question, I established a machine learning framework to predict mutant phenotypes, which provide potent evidence for the role of a gene. Phenotype predictions were focused on essential genes (those with lethal mutant phenotypes) in A. thaliana, as these genes represent a historically well-studied group. Combining 57 expression, duplication, evolutionary, and gene network characteristics through machine learning methods accurately distinguished between genes with lethal and non-lethal mutant phenotypes. Additionally, essential gene prediction models could be applied across species; essential gene prediction models generated in A. thaliana could identify essential genes in rice and Saccharomyces cerevisiae. Thus, machine-learning represents a promising avenue of prioritization of candidate genes for large-scale phenotyping efforts. Overall, the research described in this dissertation highlight computational approaches as highly effective in defining functional sequences and classifying the likely roles of genes."--Pages ii-iii.
Show less
- Title
- Design, management, and quality control of toxicogenomic experiments
- Creator
- Burgoon, Lyle David
- Date
- 2005
- Collection
- Electronic Theses & Dissertations
- Title
- Evaluation of reagents and methods for genome editing in potato (Solanum tuberosum L.)
- Creator
- Butler, Nathaniel Martin
- Date
- 2015
- Collection
- Electronic Theses & Dissertations
- Description
-
Genome editing using sequence-specific nucleases (SSNs) is rapidly becoming a standard tool for genetic engineering in crop species. The implementation of zinc finger nucleases (ZFNs), transcription activator-like effector nucleases (TALENs) and CRISPR/Cas (clustered regularly interspaced short palindromic repeats (CRISPR)/CRISPR-associated systems (Cas)) for inducing double-strand breaks enables targeting of virtually any sequence for genetic modification. Targeted mutagenesis via...
Show moreGenome editing using sequence-specific nucleases (SSNs) is rapidly becoming a standard tool for genetic engineering in crop species. The implementation of zinc finger nucleases (ZFNs), transcription activator-like effector nucleases (TALENs) and CRISPR/Cas (clustered regularly interspaced short palindromic repeats (CRISPR)/CRISPR-associated systems (Cas)) for inducing double-strand breaks enables targeting of virtually any sequence for genetic modification. Targeted mutagenesis via nonhomologous end-joining (NHEJ) and gene targeting via homologous recombination (HR) have been demonstrated in a number of plant species but reports have been limited in vegetatively propagated crops, such as potato (Solanum tuberosum Group Tuberosum L.) The aim of this dissertation was to develop reagents and methods for genome editing in potato. This was accomplished by demonstrating TALEN and CRISPR/Cas reagents targeting the potato ACETOLACTATE SYNTHASE1 (ALS1) gene were successful in inducing targeted mutations in reporter and endogenous gene targets. Targeted mutations using CRISPR/Cas were capable of both clonal and germline transmission, making CRISPR/Cas the preferred reagent for this application. TALEN and CRISPR/Cas reagents were also used in combination with a geminivirus expression vector for gene targeting experiments to incorporate point mutations within the ALS1 locus. Transformed events modified by both TALEN and CRISPR/Cas reagents in the geminivirus expression vector carried gene targeting modifications that supported reduced herbicide susceptibility phenotypes. Gene targeting modification detection and reduced herbicide susceptibility phenotypes were enhanced by regenerating lines under high selection. The evaluated reagents and methods in this dissertation provide a frame work for genome editing in potato and other vegetatively propagated crops and have important implications for basic research and agriculture.
Show less
- Title
- Flexible hierarchical Bayesian modeling extensions to improve whole genome prediction and genome wide association analyses
- Creator
- Chen, Chunyu (Graduate of Michigan State University)
- Date
- 2017
- Collection
- Electronic Theses & Dissertations
- Description
-
"Whole genome prediction (WGP) has been widely implemented in animal and plant breeding for genomic selection of economically important traits, having already accelerated genetic progress for economically important traits in some species especially dairy cattle. Genome wide association (GWA) analysis is used for screening genomic regions that may include important candidate genes segregating for the trait of interest and is being increasingly integrated with WGP analysis. Both WGP and GWA...
Show more"Whole genome prediction (WGP) has been widely implemented in animal and plant breeding for genomic selection of economically important traits, having already accelerated genetic progress for economically important traits in some species especially dairy cattle. Genome wide association (GWA) analysis is used for screening genomic regions that may include important candidate genes segregating for the trait of interest and is being increasingly integrated with WGP analysis. Both WGP and GWA typically represent m226Bn problems as defined by a large number of single nucleotide polymorphism (SNP) markers (m) and comparably much smaller number of individuals (n). Two broad types of parametric models are typically considered for these analyses: traditional best linear unbiased prediction approaches based on SNP marker effects being normally distributed and Bayesian WGP models that allow more flexible specifications for SNP marker effects based on either heavy-tailed or variable selection specifications. Bayesian WGP models can achieve higher prediction accuracies than traditional approaches in many applications if properly tuned; however, their implementation can be computationally challenging. My dissertation was aimed to address some of these emerging issues in Bayesian WGP models as well as providing software tools for real data applications. In Chapter 2, I developed an expectation maximization (EM) algorithm as a fast alternative to traditional Markov Chain Monte Carlo (MCMC) for Bayesian WGP models. I proposed EM implementations for two models, heavy-tailed BayesA and stochastic search and variable selection (SSVS) adapting the EM algorithm for maximum a posterior (MAP) inference of SNP effects and adapting REML like strategies to estimate key hyperparameters. Using a comprehensive simulation study and real data analysis, I found that these empirical Bayes approaches can be quite sensitive to starting values for SNP effects. However, using a deterministic annealing variant of EM, I obtained hyperparameter estimates and prediction accuracies comparable to their MCMC counterparts. In Chapter 3, I further assessed the possibility using two Bayesian WGP models BayesA and SSVS for GWA studies. I also included a popular GWA analysis (EMMAX) based on the utilization of the linear mixed model. In addition to basing inferences on traditional single SNP tests and fixed genomic window tests, I assessed the merit of tests involving adaptively determined windows based on clustering genome into blocks based on linkage disequilibrium. I found that SSVS and BayesA under MCMC and adaptive window tests led to best receiver operating curve (ROC) properties. In Chapter 4, I extended SSVS to single step SSVS to incorporate phenotypes of non-genotyped individuals and compared its performance with corresponding models ignoring these genotypes for both WGP and GWA. I found single step SSVS to be a promising for WGP and GWA, particularly for genetic architectures characterized by a few genes with large effects. In Chapter 5, I combined much of the developments in Chapter 2 to Chapter 4 and beyond in a unified framework as an open source R package BATools to implement several different Bayesian models for WGP and GWA."--Pages ii-iii.
Show less
- Title
- Genetic and genomics approaches to understanding the biosynthesis of specialized metabolites in trichomes of the cultivated tomato and its wild relatives
- Creator
- Kim, Jeongwoon
- Date
- 2011
- Collection
- Electronic Theses & Dissertations
- Description
-
ABSTRACT GENETIC AND GENOMICS APPROACHES TO UNDERSTANDING THE BIOSYNTHESIS OF SPECIALIZED METABOLITES IN TRICHOMES OF THE CULTIVATED TOMATO AND ITS WILD RELATIVESByJeongwoon KimTrichomes are specialized epidermal appendages that cover the surface of plant tissues. Glandular secreting trichomes produce a variety of plant specialized metabolites. Trichome metabolites across the kingdom Plantae are extremely diverse and include those important for plant defense and therapeutic...
Show moreABSTRACT GENETIC AND GENOMICS APPROACHES TO UNDERSTANDING THE BIOSYNTHESIS OF SPECIALIZED METABOLITES IN TRICHOMES OF THE CULTIVATED TOMATO AND ITS WILD RELATIVESByJeongwoon KimTrichomes are specialized epidermal appendages that cover the surface of plant tissues. Glandular secreting trichomes produce a variety of plant specialized metabolites. Trichome metabolites across the kingdom Plantae are extremely diverse and include those important for plant defense and therapeutic purposes. Due to these benefits, biosynthetic pathways leading to the production of trichome metabolites haven been intensively studied. Recent advances and development of analytical chemistry and genetic and genomic tools facilitate the study of trichome biochemistry. In this study, analytical chemistry, genetic and genomics approaches were used to understand biosynthetic pathways for the production of non-volatile metabolites in Solanum trichomes. Specifically, this study included three individual projects: (i) identification of tomato EMS mutants altered in biosynthetically diverse trichome non-volatile metabolites and phenotypic characterization of glycosylated flavonoid mutant (ii) identification and analysis ofSolanum lycopersicum genetic variants altered in trichome methylated myricetin biochemistry, and (iii) chemical analysis of diverse trichome acylsugar profiles in 80 accessions from the wild species of cultivated tomatoS. habrochaites .
Show less
- Title
- Genetics and genomics of the DST-mediated decay pathway in Arabidopsis thaliana
- Creator
- Lidder, Preetmoninder
- Date
- 2004
- Collection
- Electronic Theses & Dissertations
- Title
- Genomic approaches to heartwood formation in hardwood tree species, black locust (Robinia pseudoacacia L.)
- Creator
- Yang, Jaemo
- Date
- 2004
- Collection
- Electronic Theses & Dissertations
- Title
- Genomics of Beta vulgaris crop types : insights into tap root development and storage characteristics
- Creator
- Galewski, Paul John
- Date
- 2020
- Collection
- Electronic Theses & Dissertations
- Description
-
Cultivated Beta vulgaris L. (beet) is a species complex composed of several distinct crop types developed for specific end uses. The crop types include sugar beet, fodder beet, table beet and leaf beet/chard. The evolution of each crop type appears to have resulted from interactions between selection, drift, gene flow, recombination, and the sorting of ancestral variation. Beets are generally heterozygous and contain self-incompatibility mechanisms. Therefore, reproducing and maintaining the...
Show moreCultivated Beta vulgaris L. (beet) is a species complex composed of several distinct crop types developed for specific end uses. The crop types include sugar beet, fodder beet, table beet and leaf beet/chard. The evolution of each crop type appears to have resulted from interactions between selection, drift, gene flow, recombination, and the sorting of ancestral variation. Beets are generally heterozygous and contain self-incompatibility mechanisms. Therefore, reproducing and maintaining the genetic constitution of a single individual for genetic and phenotypic analysis is a challenge. Beet populations are the fundamental unit of improvement and contain the evolutionary and adaptive potential of the species. This research used several approaches which explore the utility of pooled population genomic sequencing to survey the organization and distribution of genetic diversity within cultivated B. vulgaris lineages, and give context and clarity to the genetics underlying important agronomic characters.Whole genome sequence data was produced for important varieties and germplasm releases which represent the B. vulgaris crop type lineages. Using population genetic and statistical methods, relationships were determined between populations. Lineage-specific variation, or variation unique to specific crop types, was uncovered and used to quantify the level of support for these groups as discrete units. Allele frequency was able to differentiate between crop types using Principle Components Analysis (PCA), suggesting positive selection for end use was a major driver of crop type divergence. PCA carried out on a chromosome-by-chromosome basis showed the relative contributions of specific chromosomes to crop type diversification. Gene diversity (e.g., expected heterozygosity) and FST proved powerful indicators of selection along the chromosome at nucleotide resolution. In total, 12.13% of loci within the genome were differentiated with respect to crop type. Interestingly, this corresponds to levels of divergence observed in studies of incipient speciation. Differentiated regions, indicated by FST outliers, contained 472 genes, or 1.6% of the 24,255 genes predicted in the reference genome assembly. Respectively, sugar beet, table beet, fodder beet, and chard genomes contained 16, 283, 2, and 171 genes characterized as differentiated between crop types. Cryptic relationships were observed between crop types due to a high degree of genetic variation shared between crop type lineages. Specific instances of common ancestry, sorting of ancestral variation, and admixture and introgression were identified, which explain the degree of substructure observed between specific crop types.The content and organization of diversity in beet genomes reflects a complex history related to B. vulgaris crop type diversification. With the exception of chard, much of the species' historical selection has focused on the improvement of root characters (e.g., root enlargement, biomass, dry matter content, and sucrose concentration). As a result, major differences in root morphology and physiology can be observed between these lineages. Measures of root development and physiology between crop types were compared, and interestingly, much of the phenotypic variation partitioned between crop types corresponds to candidate genes identified from analyses of genome-wide variation using FST and 2pq. Admixture and introgression appear to have shared specific variation involved in the reduction of lateral roots (e.g., Root primordium defective 1), root enlargement (e.g., Brevis radix-like 4, putative NAC domain-containing protein 94, cytokinin dehydrogenase 3), and biomass accumulation (e.g., 6-phosphofructo-2-kinase). High relationship coefficients and high correlations in allele frequency for this variation were observed, indicating the genetic variation influencing these characters may have been derived from a single origin. The development of beet into an economically viable sugar crop required both an enlarged root and an increase sucrose concentration. Genes were identified that may explain these physiological changes within the root (e.g., decrease in water concentration, increase in dry matter content and increase in sucrose concentrations). These genes correspond to shared variation, distributed among crop types, as well as lineage-specific variation, restricted to sugar beet lineages. Integrating selection, drift, and admixture into a putative demographic history of beet provides evidence for the role of specific genes in the development of beet crop types and the expression of novel phenotypic characters.
Show less
- Title
- Hidden Markov model-based homology search and gene prediction in NGS ERA
- Creator
- Techa-angkoon, Prapaporn
- Date
- 2017
- Collection
- Electronic Theses & Dissertations
- Description
-
The exponential cost reduction of next-generation sequencing (NGS) enabled researchers to sequence a large number of organisms in order to answer various questions in biology, ecology, health, etc. For newly sequenced genomes, gene prediction and homology search against characterized protein sequence databases are two fundamental tasks for annotating functional elements in the genomes. The main goal of gene prediction is to identify the gene locus and their structures. As there is...
Show moreThe exponential cost reduction of next-generation sequencing (NGS) enabled researchers to sequence a large number of organisms in order to answer various questions in biology, ecology, health, etc. For newly sequenced genomes, gene prediction and homology search against characterized protein sequence databases are two fundamental tasks for annotating functional elements in the genomes. The main goal of gene prediction is to identify the gene locus and their structures. As there is accumulating evidence showing important functions of RNAs (ncRNAs), comprehensive gene prediction should include both protein-coding genes and ncRNAs. Homology search against protein sequences can aid identification of functional elements in genomes. Although there are intensive research in the fields of gene prediction, ncRNA search, and homology search, there are still unaddressed challenges. In this dissertation, I made contributions in these three areas. For gene prediction, I designed an HMM-based ab initio gene prediction tool that considers G+C gradient in grass genomes. For homology search, I designed a method that can align short reads against protein families using profile HMMs. For ncRNA search, I designed a ncRNA alignment tool that can align highly structured ncRNAs using only sequence similarity. Below I summarize my contributions.Despite decades of research about gene prediction, existing gene prediction tools are not carefully designed to deal with variant G+C content and 5'-3' changing patterns inside coding regions. Thus, these tools can miss genes with positive or negative G+C gradient in grass genomes such as rice, maize, sorghum, etc. I implemented a tool named AUGUSTUS-GC that accounts for 5'-3' G+C gradient. Our tool can accurately predict protein-coding genes in plant genomes especially grass genomes.A large number of sequencing projects produced short reads from the whole genomes or transcriptomic data. I designed a short reads homology search tool that employs paired-end reads to improve homology search sensitivity. The experimental results show that our tool can achieve significantly better sensitivity and accuracy in aligning short reads that are part of remote homologs.Despite the extensive studies of ncRNA search, the existing tools that heavily depend on the secondary structure in homology search cannot efficiently handle RNA-seq data that is accumulating rapidly. It will be ideal if we can have a faster ncRNA homology search tool with similar accuracy as those adopting secondary structure. I implemented an accurate ncRNA alignment tool called glu-RNA that can achieve similar accuracy to structural alignment tools while keeping the same running time complexity as sequence alignment tools. The experimental results demonstrate that our tool can achieve more accurate alignments than the popular sequence alignment tools and a well-known structural alignment program.
Show less
- Title
- Hierarchical extensions of Bayesian parametric models for whole genome prediction
- Creator
- Yang, Wenzhao
- Date
- 2014
- Collection
- Electronic Theses & Dissertations
- Description
-
Whole genome prediction (WGP) is increasingly used to predict breeding values (BV) of plants and animals based on the use of single nucleotide polymorphism (SNP) marker panels. Two particularly popular WGP models, labeled BayesA and BayesB, are based on specifying all SNP-associated effects to be independent of each other. In this dissertation, we further extend these two models to allow for greater flexibility to infer upon BV and SNP effects in three different frameworks: 1) allowing for...
Show moreWhole genome prediction (WGP) is increasingly used to predict breeding values (BV) of plants and animals based on the use of single nucleotide polymorphism (SNP) marker panels. Two particularly popular WGP models, labeled BayesA and BayesB, are based on specifying all SNP-associated effects to be independent of each other. In this dissertation, we further extend these two models to allow for greater flexibility to infer upon BV and SNP effects in three different frameworks: 1) allowing for correlated SNP effects, 2) reaction norm modeling of genotype by environment interaction (G×E) and 3) bivariate WGP models. We complement these efforts with focusing on strategies to infer upon key hyperparameters that anchor some of these specifications. Based on a first order nonstationary antedependence specification, we extended BayesA and BayesB to account for spatial correlation between SNP effects due to the proximal QTL; we label the corresponding extensions as ante-BayesA and ante-BayesB respectively. Using simulation studies and application to the publicly available heterogeneous stock mice data and other provided benchmark data, we determined that antedependence models had significantly higher WGP accuracies compared to their conventional counterparts, especially at higher LD levels. Subsequently, we extended reaction norm (RN) and random regression (RR) models to account for G×E. Several specifications on the SNP-specific variance-covariance matrices (VCV) of intercept and slope effects were considered using independent inverted Wishart (IW) prior densities (IW-BayesA, IW-BayesB and IW-BayesC). Two potentially more flexible RR/RN models using square root free Cholesky decomposition (CD) were proposed (CD-BayesA and CD-BayesB). Based on a RN simulation study and a RR data analysis in pigs, RR/RN WGP models provided greater WGP accuracies compared to conventional WGP models although differences were not substantial between the competing IW- vs CD- based methods except with simpler genetic architectures (i.e., low number of QTL). We also developed bivariate WGP models based on more or less the same specifications for SNP-specific VCV in RR/RN models (i.e., IW-BayesA, CD-BayesA and CD-BayesB) comparing them to the more conventional bivariate genomic BLUP (bGBLUP) model. Using a LD simulation study, the three bivariate trait models generally demonstrated higher WGP accuracy than univariate BayesA or BayesB when the number of pleiotropic QTL was relatively large and the heritability of the trait was low. Furthermore, in an application to data from pine trees, CD-BayesB exhibited higher predictive ability compared to other competing models. Comparisons between competing WGP models require appropriate tuning of key hyperparameters. Hence we also studied three alternative Metropolis-Hastings (MH) sampling strategies to infer upon key hyperparameters in BayesA and BayesB. Both simulation studies and application to the heterogeneous stock mice data, strategies that were more heavily based on Metropolis Hastings sampling of key hyperparameters demonstrated significantly greater computational efficiencies compared to strategies that deferred to usage of Gibbs sampling.
Show less
- Title
- Identification and analysis of non-coding RNAs in large scale genomic data
- Creator
- Achawanantakun, Rujira
- Date
- 2014
- Collection
- Electronic Theses & Dissertations
- Description
-
The high-throughput sequencing technologies have created the opportunity of large-scale transcriptome analyses and intensify attention on the study of non-coding RNAs (ncRNAs). NcRNAs pay important roles in many cellular processes. For example, transfer RNAs and ribosomal RNAs are involved in protein translation process; micro RNAs regulate gene expression; long ncRNAs are found to associate with many human diseases ranging from autism to cancer.Many ncRNAs function through both their...
Show moreThe high-throughput sequencing technologies have created the opportunity of large-scale transcriptome analyses and intensify attention on the study of non-coding RNAs (ncRNAs). NcRNAs pay important roles in many cellular processes. For example, transfer RNAs and ribosomal RNAs are involved in protein translation process; micro RNAs regulate gene expression; long ncRNAs are found to associate with many human diseases ranging from autism to cancer.Many ncRNAs function through both their sequences and secondary structures. Thus, accurate secondary structure prediction provides important information to understand the tertiary structures and thus the functions of ncRNAs.The state-of-the-art ncRNA identification tools are mainly based on two approaches. The first approach is a comparative structure analysis, which determines the consensus structure from homologous ncRNAs. Structure prediction is a costly process, because the size of the putative structures increases exponentially with the sequence length. Thus it is not practical for very long ncRNAs such as lncRNAs. The accuracy of current structure prediction tools is still not satisfactory, especially on sequences containing pseudoknots. An alternative identification approach that has been increasingly popular is sequence based expression analysis, which relies on next generation sequencing (NGS) technologies for quantifying gene expression on a genome-wide scale. The specific expression patterns are used to identify the type of ncRNAs. This method therefore is limited to ncRNAs that have medium to high expression levels and have the unique expression patterns that are different from other ncRNAs. In this work, we address the challenges presented in ncRNA identification using different approaches. To be specific, we have proposed four tools, grammar-string based alignment, KnotShape, KnotStructure, and lncRNA-ID. Grammar-string is a novel ncRNA secondary structure representation that encodes an ncRNA's sequence and secondary structure in the parameter space of a context-free grammar and a full RNA grammar including pseudoknots. It simplifies a complicated structure alignment to a simple grammar string-based alignment. Also, grammar-string-based alignment incorporates both sequence and structure into multiple sequence alignment. Thus, we can then enhance the speed of alignment and achieve an accurate consensus structure. KnotShape and KnotStructure focus on reducing the size of the structure search space to enhance the speed of a structure prediction process. KnotShape predicts the best shape by grouping similar structures together and applying SVM classification to select the best representative shape. KnotStructure improve the performance of structure prediction by using grammar-string based-alignment and the predicted shape output by KnotShape.lncRNA-ID is specially designed for lncRNA identification. It incorporates balanced random forest learning to construct a classification model to distinguish lncRNA from protein-coding sequences. The major advantage is that it can maintain a good predictive performance under the limited or imbalanced training data.
Show less
- Title
- Multivariate generalized functional linear models with applications to genomics
- Creator
- Jadhav, Sneha
- Date
- 2017
- Collection
- Electronic Theses & Dissertations
- Description
-
This thesis is focused on developing functional data methodology with the aim of addressing problems that arise in genetic sequencing data. While significant progress has been made in identifying common genetic variants associated with diseases, these variants only explain a small proportion of heritability. Recent studies suggest that rare variants could account for this variability. With advancements in sequencing technology, large-scale sequencing studies are now being conducted to...
Show moreThis thesis is focused on developing functional data methodology with the aim of addressing problems that arise in genetic sequencing data. While significant progress has been made in identifying common genetic variants associated with diseases, these variants only explain a small proportion of heritability. Recent studies suggest that rare variants could account for this variability. With advancements in sequencing technology, large-scale sequencing studies are now being conducted to comprehensively investigate the contribution of rare variants to the genetic etiology of various diseases. Although these studies hold great potential for uncovering new disease-associated variants, the massive amount of data and complex structure of sequencing data poses great analytical challenges on association analysis. Advanced methods are needed to address these challenges and to facilitate the discovery process of new variants predisposing to various diseases. We use functional data analysis methods to capture the complexities of sequencing data.In the first chapter we investigate the importance of considering the genetic structure of sequencing data. In association studies the effect of appropriately modeling genetic structure of sequencing data on association analysis have not been well studied. We compare three statistical approaches which use different strategies to model the genetic structure. They are a burden test, a burden test that considers pairwise correlation, and a functional analysis of variance (FANOVA) test that models the gene through fitting continuous curves on an individuals genotype profile. We find some evidence in favor of treating sequencing data as a function.In the second chapter we present the definitions of some fundamental concepts in Functional Data Analysis like the mean element, covariance operator and its eigen decomposition, and Karhunen- Loeve expansion. Basis expansion and in particular Karhunen-Loeve expansion play an important role in this thesis. We briefly discuss the estimators for the mean function, the covariance operatorand their consistency. Results on the consistency of the eigenvalues and eigenfunctions of the sample covariance operator are also stated.Several times genetic data is collected on families, where the response variable or the trait of the family members can be dependent on each other. Additionally, this trait of interest can be discrete or continuous. Thus there is a need for a functional model that can handle dependent data that may be continuous or discrete. The model proposed by Muller and Stadtmuller (2005) uses the generalized estimating equations approach that can handle both continuous and discrete data. However, they assume the response variable to be univariate and the sample to be independent. There are no existing functional methods that we know of that can be directly applied to the family data. In the third chapter we develop a framework for dependent generalized functional linear models where the response is multivariate, that can be used to test for a certain type of association between the genetic data and the trait of interest for family data.In the fourth chapter we develop regression framework where the response variable has a normal distribution and there is measurement error in the regressor function. In this set-up, the true regressor function is not observable. Instead, we observe a surrogate variable and its replicates. The relation between the true function and the surrogate one is assumed to follow the additive classicalmeasurement error model. We use the approach developed by Stefanski and Carroll (1987) to propose an estimating equation for the parameters and show asymptotic existence and consistency of the estimate obtained from this equation.
Show less
- Title
- Non-coding RNA identification in large-scale genomic data
- Creator
- Yuan, Cheng
- Date
- 2014
- Collection
- Electronic Theses & Dissertations
- Description
-
Noncoding RNAs (ncRNAs), which function directly as RNAs without translating into proteins, play diverse and important biological functions. ncRNAs function not only through their primary structures, but also secondary structures, which are defined by interactions between Watson-Crick and wobble base pairs. Common types of ncRNA include microRNA, rRNA, snoRNA, tRNA. Functions of ncRNAs vary among different types. Recent studies suggest the existence of large number of ncRNA genes....
Show moreNoncoding RNAs (ncRNAs), which function directly as RNAs without translating into proteins, play diverse and important biological functions. ncRNAs function not only through their primary structures, but also secondary structures, which are defined by interactions between Watson-Crick and wobble base pairs. Common types of ncRNA include microRNA, rRNA, snoRNA, tRNA. Functions of ncRNAs vary among different types. Recent studies suggest the existence of large number of ncRNA genes. Identification of novel and known ncRNAs becomes increasingly important in order to understand their functionalities and the underlying communities.Next-generation sequencing (NGS) technology sheds lights on more comprehensive and sensitive ncRNA annotation. Lowly transcribed ncRNAs or ncRNAs from rare species with low abundance may be identified via deep sequencing. However, there exist several challenges in ncRNA identification in large-scale genomic data. First, the massive volume of datasets could lead to very long computation time, making existing algorithms infeasible. Second, NGS has relatively high error rate, which could further complicate the problem. Third, high sequence similarity among related ncRNAs could make them difficult to identify, resulting in incorrect output. Fourth, while secondary structures should be adopted for accurate ncRNA identification, they usually incur high computational complexity. In particular, some ncRNAs contain pseudoknot structures, which cannot be effectively modeled by the state-of-the-art approach. As a result, ncRNAs containing pseudoknots are hard to annotate.In my PhD work, I aimed to tackle the above challenges in ncRNA identification. First, I designed a progressive search pipeline to identify ncRNAs containing pseudoknot structures. The algorithms are more efficient than the state-of-the-art approaches and can be used for large-scale data. Second, I designed a ncRNA classification tool for short reads in NGS data lacking quality reference genomes. The initial homology search phase significantly reduces size of the original input, making the tool feasible for large-scale data. Last, I focused on identifying 16S ribosomal RNAs from NGS data. 16S ribosomal RNAs are very important type of ncRNAs, which can be used for phylogenic study. A set of graph based assembly algorithms were applied to form longer or full-length 16S rRNA contigs. I utilized paired-end information in NGS data, so lowly abundant 16S genes can also be identified. To reduce the complexity of problem and make the tool practical for large-scale data, I designed a list of error correction and graph reduction techniques for graph simplification.
Show less
- Title
- Open reading frame composition and organization as indicators of phenotypic diversity in bacteria and archaea
- Creator
- Harrison, Scott Henry
- Date
- 2006
- Collection
- Electronic Theses & Dissertations
- Title
- Scalable phylogenetic analysis and functional interpretation of genomes with complex evolutionary histories
- Creator
- Hejase, Hussein El Abbass
- Date
- 2017
- Collection
- Electronic Theses & Dissertations
- Description
-
"Phylogenomics involves the inference of a genome-scale phylogeny. A phylogeny is typically inferred using sequences from multiple loci across a set of genomes of multiple organisms by reconstructing gene trees and then reconciling them into a species phylogeny. Many studies have shown that evolutionary processes such as gene flow, incomplete lineage sorting, recombination, selection, gene duplication and loss have shaped our genomes and played a major role in the evolution of a diverse array...
Show more"Phylogenomics involves the inference of a genome-scale phylogeny. A phylogeny is typically inferred using sequences from multiple loci across a set of genomes of multiple organisms by reconstructing gene trees and then reconciling them into a species phylogeny. Many studies have shown that evolutionary processes such as gene flow, incomplete lineage sorting, recombination, selection, gene duplication and loss have shaped our genomes and played a major role in the evolution of a diverse array of metazoans, including humans and ancient hominins, mice, bacteria, and butterflies. The aforementioned evolutionary processes are primary causes of gene tree discordance, which introduce different loci in a genome that exhibit local genealogical variation (i.e. gene trees differing from each other and the species phylogeny in terms of topology and/or branch length). In this dissertation, we develop a method for fast and accurate inference of phylogenetic networks using large-scale sequence data. The advent of high-throughput sequencing technologies has brought about two main scalability challenges: (1) dataset size in terms of the number of taxa and (2) the evolutionary divergence of the taxa in a study. We explore the impact of both dimensions of scale on phylogenetic network inference and then introduce a new phylogenetic divide-and-conquer method which we call FastNet. We show using synthetic and empirical data spanning a range of evolutionary scenarios that FastNet outperforms the state-of-the-art in terms of accuracy and computational requirements. Furthermore, we develop methods that use better and more accurate phylogenies to functionally interpret genomes. One way to study and understand the biological function of genomes is through association mapping, which pinpoints statistical associations between genotypic and phenotypic characters while modeling the relatedness between samples to avoid generating spurious inferences. Many methods have been proposed to perform association mapping while accounting for sample relatedness. However, the state of the art predominantly utilizes the simplifying assumption that sample relatedness is effectively fixed across the genome. Recent studies have shown that sample relatedness can vary greatly across different loci within a genome where gene trees could differ from each other and the species phylogeny. Thus, there is an imminent need for methods to account for local genealogical variation in functional genomic analyses. We address this methodological gap by introducing two methods, Coal-Map and Coal-Miner, which account for sample relatedness locally within loci and globally across the entire genome. We show through simulated and empirical datasets that these newly introduced methods offer comparable or typically better statistical power and type I error control compared to the state-of-the-art."--Pages ii-iii.
Show less
- Title
- The identification of ATPAF1 as a novel asthma susceptibility gene and the characterization of functional regulatory variants
- Creator
- Schauberger, Eric Michael
- Date
- 2011
- Collection
- Electronic Theses & Dissertations
- Description
-
Asthma, the most common chronic disease of childhood, is driven by genetic and environmental determinants. To identify genes that increase the risk of asthma in children, a multiple stage genome-wide association study was conducted in a nested case-control study of a whole-population birth cohort from the Isle of Wight, UK. This study resulted in the identification of a cluster of associated SNPs and SNP haplotypes in the ATPAF1 gene (ATP synthase mitochondrial F1 complex...
Show moreAsthma, the most common chronic disease of childhood, is driven by genetic and environmental determinants. To identify genes that increase the risk of asthma in children, a multiple stage genome-wide association study was conducted in a nested case-control study of a whole-population birth cohort from the Isle of Wight, UK. This study resulted in the identification of a cluster of associated SNPs and SNP haplotypes in the ATPAF1 gene (ATP synthase mitochondrial F1 complex assembly factor 1) on human chromosome 1p33, with two SNPs achieving significance at a genome-wide level (P=2.26E-5 to 2.2E-8). SNP, haplotype, and gene-level associations were confirmed in three of four replication populations. The ATPAF1 gene contains 303 reported variants, which were assessed using in silico techniques and prioritized through annotated function in public databases, and/or inferred function based on their location in experimentally reported or predicted functional DNA sequences. The in silico screen prioritized 27 variants, of which several had predicted function as coding, splicing, and/or gene expression regulation. These prioritized variants were targeted in addition to exons, conserved, and regulatory regions for selective resequencing in 40 cohort individuals using Sanger sequencing.Selective resequencing of 14.6 kb resulted in the identification of 35 total variants. This included validation of 9 (of the 27) prioritized variants from the in silico screen and 9 new rare variants, including 1 nonsynonymous mutation. Three variants with gene expression regulatory potential were found to be clustered within 600 bp of each other in the promoter/exon 1 of ATPAF1 in four haplotypes. This region was targeted for analysis using luciferase reporter gene assays in BEAS-2B and COS-7 cell lines. These cell culture assays confirmed promoter functionality and indicated a statistically significant difference in luciferase expression (means ranging between 2-3 fold differences) among the promoter haplotypes.In conclusion, ATPAF1 was identified as a childhood asthma susceptibility gene. In silico studies coupled with selective resequencing of the ATPAF1 region provided an efficient method to identify functional variants. DNA variant haplotypes within the ATPAF1 promoter demonstrated the ability to differentially regulate gene expression. However, the roles of these and other functional variants in ATPAF1 and their ability to modulate asthma susceptibility need further study.
Show less
- Title
- The utility of whole genome amplification in forensic DNA analysis
- Creator
- Barber, Amy Leigh
- Date
- 2005
- Collection
- Electronic Theses & Dissertations
- Title
- Toxicogenomic biomarker discovery of AHR-mediated TCDD-induced hepatotoxicity
- Creator
- Dere, Edward
- Date
- 2010
- Collection
- Electronic Theses & Dissertations
- Description
-
2,3,7,8 Tetrachlorodibenzo-
p -dioxin (TCDD) is a ubiquitous environmental contaminant that causes a wide array of species-specific adverse biochemical and physiological responses, including increased tumor promotion, lethality and hepatotoxicity. Most, if not all of the effects elicited by TCDD are due to inappropriate changes in gene expression that are mediated through activation of the aryl hydrocarbon receptor (AhR). Although the mechanism of AhR gene regulation is well...
Show more2,3,7,8 Tetrachlorodibenzo-p -dioxin (TCDD) is a ubiquitous environmental contaminant that causes a wide array of species-specific adverse biochemical and physiological responses, including increased tumor promotion, lethality and hepatotoxicity. Most, if not all of the effects elicited by TCDD are due to inappropriate changes in gene expression that are mediated through activation of the aryl hydrocarbon receptor (AhR). Although the mechanism of AhR gene regulation is well known, the full spectrum of targeted genes leading to the subsequent toxicological responses remains poorly understood. The objective of this research was to integrate disparate and complementary toxicogenomic approaches to identify putative biomarkers of TCDD-induced hepatotoxicity that would aide in reducing the uncertainties involved in cross-species and cross-model extrapolations.In vitro microarray investigation of a mouse hepatoma cell line treated with TCDD identified complex temporal and dose-dependent gene expression responses. Comparative analysis within vivo hepatic gene expression responses in mice identified a small subset of conserved genes with biological functions related to xenobiotic metabolism, consistent with the known responses observedin vivo . Furthermore,in vitro cross-species comparison using human, mouse, and rat hepatoma cell lines identified relatively few species-conserved gene expression and is corroborates prior reports of species-specific TCDD-induced toxicities. Genome-wide computational identification and characterization of dioxin response elements (DREs) using a position weight matrix identified species-specific regulons in the promoter regions of targeted genes that may account for the observed species-divergent and -specific responses. In order to better understand the molecular mechanisms responsible for regulating the transcriptional responses and downstream hepatotoxicity, ChIP-chip analysis was performed to globally identify TCDD-induced AhR/DNA interactions in mouse hepatic tissue. Interestingly, integration of the DRE, ChIP-chip and gene expression analyses found that only ~32% of all TCDD-elicited hepatic gene expression responses are mediated by a DRE-dependent mechanism. These direct targets of AhR regulation have biological functions related to xenobiotic and lipid metabolism, which correspond with the physiological responses observedin vivo . The remaining transcriptional responses that are mediated through a DRE-independent mechanism illustrate the diverse regulatory role of the AhR. Collectively, these results have expanded our knowledge of the hepatic AhR regulatory network and provide insight into the species-conserved responses elicited by TCDD.
Show less