You are here
Search results
(1 - 20 of 28)
Pages
- Title
- IDENTIFICATION OF LTR RETROTRANSPOSONS, EVALUATION OF GENOME ASSEMBLY, AND MODELING RICE DOMESTICATION
- Creator
- Ou, Shujun
- Date
- 2018
- Collection
- Electronic Theses & Dissertations
- Description
-
The majority of fundamental theories in genetics and evolution were proposed prior to the discovery of DNA as the genetic material in 1952. Those include Darwin’s theory of evolution (1859), Mendelian genetics (1865), Wright and Fisher’s population genetics (1918), and McClintock’s transposition of genetic elements (1951). Nevertheless, the underlining mechanisms of those theories were not fully elucidated till the appearance of DNA sequencing technology. At present, technological advances...
Show moreThe majority of fundamental theories in genetics and evolution were proposed prior to the discovery of DNA as the genetic material in 1952. Those include Darwin’s theory of evolution (1859), Mendelian genetics (1865), Wright and Fisher’s population genetics (1918), and McClintock’s transposition of genetic elements (1951). Nevertheless, the underlining mechanisms of those theories were not fully elucidated till the appearance of DNA sequencing technology. At present, technological advances have minimized the cost for sequencing genomes. The real bottleneck to establish genomic resources is the annotation of genomic sequences. Long Terminal Repeat (LTR) retrotransposon is a major type of transposable genetic elements and dominating plant genomes. We developed a new method called LTR_retriever for accurate annotation of LTR retrotransposons. Further, we studied genome dynamics, genome size variation, and polyploidy origin using LTR retrotransposons. The presence of LTR retrotransposons challenges current sequencing and assembly techniques due to their size and repetitiveness. We proposed an unbiased metric called LTR Assembly Index (LAI) which utilizes the assembled LTR retrotransposons to evaluate continuity of genome assembly. We revealed the massive gain of continuity for assembly sequenced based on long-read techniques over short-read methods, and further proposed a standardized classification system for genome quality based on LAI. With high-quality genomes, we can extend our knowledge about microevolution events using a population of genomes. The domestication history of rice is still unresolved due to its complicated demographic history. We collected, re-mapped, and re-analyzed 3,485 cultivated and wild rice resequencing accessions. With data imputation, a total of 17.7 million high-quality single-nucleotide polymorphisms (SNPs) were identified. Our dataset is highly accurate as verified by cross-platform Affymetrix Microarray data, with a pairwise concordance rate of 99%. Combining phylogeny, PCA, and ADMIXTURE analyses, we present profound diversification among rice ecotypes.
Show less
- Title
- Non-coding RNA identification in large-scale genomic data
- Creator
- Yuan, Cheng
- Date
- 2014
- Collection
- Electronic Theses & Dissertations
- Description
-
Noncoding RNAs (ncRNAs), which function directly as RNAs without translating into proteins, play diverse and important biological functions. ncRNAs function not only through their primary structures, but also secondary structures, which are defined by interactions between Watson-Crick and wobble base pairs. Common types of ncRNA include microRNA, rRNA, snoRNA, tRNA. Functions of ncRNAs vary among different types. Recent studies suggest the existence of large number of ncRNA genes....
Show moreNoncoding RNAs (ncRNAs), which function directly as RNAs without translating into proteins, play diverse and important biological functions. ncRNAs function not only through their primary structures, but also secondary structures, which are defined by interactions between Watson-Crick and wobble base pairs. Common types of ncRNA include microRNA, rRNA, snoRNA, tRNA. Functions of ncRNAs vary among different types. Recent studies suggest the existence of large number of ncRNA genes. Identification of novel and known ncRNAs becomes increasingly important in order to understand their functionalities and the underlying communities.Next-generation sequencing (NGS) technology sheds lights on more comprehensive and sensitive ncRNA annotation. Lowly transcribed ncRNAs or ncRNAs from rare species with low abundance may be identified via deep sequencing. However, there exist several challenges in ncRNA identification in large-scale genomic data. First, the massive volume of datasets could lead to very long computation time, making existing algorithms infeasible. Second, NGS has relatively high error rate, which could further complicate the problem. Third, high sequence similarity among related ncRNAs could make them difficult to identify, resulting in incorrect output. Fourth, while secondary structures should be adopted for accurate ncRNA identification, they usually incur high computational complexity. In particular, some ncRNAs contain pseudoknot structures, which cannot be effectively modeled by the state-of-the-art approach. As a result, ncRNAs containing pseudoknots are hard to annotate.In my PhD work, I aimed to tackle the above challenges in ncRNA identification. First, I designed a progressive search pipeline to identify ncRNAs containing pseudoknot structures. The algorithms are more efficient than the state-of-the-art approaches and can be used for large-scale data. Second, I designed a ncRNA classification tool for short reads in NGS data lacking quality reference genomes. The initial homology search phase significantly reduces size of the original input, making the tool feasible for large-scale data. Last, I focused on identifying 16S ribosomal RNAs from NGS data. 16S ribosomal RNAs are very important type of ncRNAs, which can be used for phylogenic study. A set of graph based assembly algorithms were applied to form longer or full-length 16S rRNA contigs. I utilized paired-end information in NGS data, so lowly abundant 16S genes can also be identified. To reduce the complexity of problem and make the tool practical for large-scale data, I designed a list of error correction and graph reduction techniques for graph simplification.
Show less
- Title
- Studies of improving therapeutic outcomes of breast cancer through development of personalized treatments and characterization of gene interactions
- Creator
- Jhan, Jing-Ru
- Date
- 2016
- Collection
- Electronic Theses & Dissertations
- Description
-
With an understanding of the heterogeneity of breast cancer, patients with luminal or HER2 breast cancer have more specific treatment options other than traditional chemotherapy, the standard therapy for triple-negative breast cancer (TNBC) patients. However, the response to current treatments as well as the prognosis have been clinical challenges. In fact, breast cancer consists of more than subtypes routinely used based on gene expression. In addition, gene expression is highly correlated...
Show moreWith an understanding of the heterogeneity of breast cancer, patients with luminal or HER2 breast cancer have more specific treatment options other than traditional chemotherapy, the standard therapy for triple-negative breast cancer (TNBC) patients. However, the response to current treatments as well as the prognosis have been clinical challenges. In fact, breast cancer consists of more than subtypes routinely used based on gene expression. In addition, gene expression is highly correlated with response to treatment and prognosis. This suggests that the development of personalized treatment with targeted therapy could improve the outcomes, especially for the TNBC subtype. To address this need, I used two approaches, the development of pathway-guided individualized treatment and an understanding of the interactions of potential genes for targeted therapy. Considering the complexity of gene and pathway interactions, the probability of pathway activation was predicted using pathway signatures generated by comparing gene expression differences between cells overexpressing interested genes and those expressing GFP. This approach was validated in two subtypes of mouse mammary tumors from MMTV-Myc mice, and then further validated in human TNBC patient-derived xenografts (PDXs). The inhibition of tumor growth in mouse mammary tumors and the regression of tumors in PDXs were observed. These proof-of-principle experiments demonstrated the flexibility of pathway-guided personalized treatment. Because this approach needs the combination of different targeted therapies, it is necessary to understand the characteristics of these targetedgenes and therapies, such as gene-gene interactions. To meet this demand, I studied the effects of Stat3 in Myc-driven tumors. Here, MMTV-Myc mice with conditional knockout Stat3 mice was generated. I noted that the deletion of Stat3 in MMTV-Myc mice accelerated the tumorigenesis as well as delayed the tumor growth with an alteration in the frequency of histological subtypes. These tumors also had deficient angiogenesis. Unexpectedly, mice with this genotype had lactation deficiencies and the lethality of pups was found.This model shared some of the same effects of loss of Stat3 in other oncogene-induced tumors and also had distinct effects compared with other models. This suggests that the oncogene drivers determine the roles of Stat3, an oncogene or tumor suppressor, and emphasizes again the importance of understanding the pathways and interactions in the development of treatment.In sum, these studies demonstrate the potential of guiding individualized treatments in preclinical platforms using bioinformatics analyses. Combined with other genomic profiles, this approach could offer more complete assessments before being translated to practice. In addition, this could be further applied in adaptive clinical trials through matching with mouse models.
Show less
- Title
- It's both who you are and where you're from : relating vocational interests and socioeconomic status to bias in biodata and SJTs
- Creator
- Prasad, Joshua
- Date
- 2017
- Collection
- Electronic Theses & Dissertations
- Description
-
"Differences in responding to biodata and situational judgement tests (SJTs) based on gender and racial minority group status were evaluated. It was hypothesized that vocational interests and socioeconomic status (SES) could be used to help characterize the differences in experience between groups (e.g. Cottrell, Newman, Roisman, 2015; Nye, Su, Rounds, & Drasgow, 2012). As a result, interests and SES may help explain differences in both the constructs assessed by biodata and SJTs as well as...
Show more"Differences in responding to biodata and situational judgement tests (SJTs) based on gender and racial minority group status were evaluated. It was hypothesized that vocational interests and socioeconomic status (SES) could be used to help characterize the differences in experience between groups (e.g. Cottrell, Newman, Roisman, 2015; Nye, Su, Rounds, & Drasgow, 2012). As a result, interests and SES may help explain differences in both the constructs assessed by biodata and SJTs as well as differences in item functioning (DIF; Drasgow, 1987). Hypotheses were evaluated using multiple-indicator multiple-cause models to simultaneously model latent constructs and item responses (MIMIC; Muthén, 1989). Findings indicate that interests helped explain differences across gender in both the constructs assessed as well as DIF. Interests explained few differences based on minority group status and SES did not seem to meaningfully explain differences in either of the demographic group comparisons. Many items still exhibited DIF as a function of gender or minority group status after accounting for vocational interests and SES, suggesting that further work is needed to identify additional substantive explanations of DIF. Overall, the present work constitutes a thorough examination of differential functioning in noncognitive assessments and establishes a meaningful relationship between the noncognitive constructs assessed here and vocational interests."--Page ii.
Show less
- Title
- GENOMIC APPLICATIONS TO PLANT BIOLOGY
- Creator
- Hoopes, Genevieve
- Date
- 2021
- Collection
- Electronic Theses & Dissertations
- Description
-
The study of the total nuclear DNA content of an organism, i.e., the genome, is a relatively new field and has evolved as sequencing technology and its output has changed. A shift from model species to ecological and crop species occurred as sequencing costs decreased and the technology became more broadly accessible, enabling new discoveries in genome biology as increasingly diverse species and populations were profiled. Here, a genome assembly and several transcriptional studies in multiple...
Show moreThe study of the total nuclear DNA content of an organism, i.e., the genome, is a relatively new field and has evolved as sequencing technology and its output has changed. A shift from model species to ecological and crop species occurred as sequencing costs decreased and the technology became more broadly accessible, enabling new discoveries in genome biology as increasingly diverse species and populations were profiled. Here, a genome assembly and several transcriptional studies in multiple non-model plant species provided new knowledge of molecular pathways and gene content. Over 157 Mb of the genome of the medicinal plant species Calotropis gigantea (L.) W.T.Aiton was sequenced, de novo assembled and annotated using Next Generation Sequencing technologies. The resulting assembly represents 92% of the genic space and provides a resource for discovery of the enzymes involved in biosynthesis of the anticancer metabolite, cardenolide. An updated gene expression atlas for 79 developmental maize (Zea mays L., 1753) tissues and five abiotic/biotic stress treatments was developed, revealing 4,154 organ-specific and 7,704 stress-induced differentially expressed (DE) genes. Presence-absence variants (PAVs) were enriched for organ-specific and stress-induced DE genes, tended to be lowly expressed, and had few co-expression network connections, suggesting that PAVs function in environmental adaptation and are on an evolutionary path to pseudogenization. The Maize Genomics Resource (http://maize.plantbiology.msu.edu/) was developed to view and data-mine these resources. Through profiling global gene expression over time in potato (Solanum tuberosum L.) leaf and tuber tissue, the first circadian rhythmic gene expression profiles of the below-ground heterotrophic tuber tissue were generated. The tuber displayed a longer circadian period, a delayed phase, and a lower amplitude compared to leaf tissue. Over 500 genes were differentially phased between the leaf and tuber, and many carbohydrate metabolism enzymes are under both diurnal and circadian regulation, reflecting the importance of the circadian clock for tuber bulking. Most core circadian clock genes do not display circadian rhythmic gene expression in the leaf or tuber, yet robust transcriptional and gene expression circadian rhythms are present.
Show less
- Title
- Molecular epidemiology, pangenomic diversity, and comparative genomics of Campylobacter jejuni
- Creator
- Rodrigues, Jose Alexandre
- Date
- 2022
- Collection
- Electronic Theses & Dissertations
- Description
-
Campylobacter jejuni, the leading cause of bacterial gastroenteritis in the United States, is often resistant to commonly used antibiotics and has been classified as a serious threat to public health. Through this work, we sought to evaluate infection trends, quantify resistance frequencies, identify epidemiological factors associated with infection, and use whole-genome sequencing (WGS) as well as comparative phylogenomic and pangenomic approaches to understand circulating C. jejuni...
Show moreCampylobacter jejuni, the leading cause of bacterial gastroenteritis in the United States, is often resistant to commonly used antibiotics and has been classified as a serious threat to public health. Through this work, we sought to evaluate infection trends, quantify resistance frequencies, identify epidemiological factors associated with infection, and use whole-genome sequencing (WGS) as well as comparative phylogenomic and pangenomic approaches to understand circulating C. jejuni populations in Michigan. C. jejuni isolates (n=214) were collected from patients via an active surveillance system at four metropolitan hospitals in Michigan between 2011 and 2014. Among the 214 C. jejuni isolates, 135 (63.1%) were resistant to at least one antibiotic. Resistance was observed for all nine antibiotics tested yielding 11 distinct resistance phenotypes. Tetracycline resistance predominated (n=120; 56.1%) followed by resistance to ciprofloxacin (n= 49; 22.9%), which increased from 15.6% in 2011 to 25.0% in 2014. Notably, patients with ciprofloxacin resistant infections were more likely to report traveling in the past month (Odds Ratio (OR): 3.0; 95% confidence interval (CI): 1.37, 6.68) and international travel (OR: 9.8; 95% CI: 3.69, 26.09). To further characterize these strains, we used WGS to examine the pangenome and investigate the genomic epidemiology of this set of C. jejuni strains recovered from Michigan patients. Among the 214 strains evaluated, 83 unique multilocus sequence types (STs) were identified that were classified as belonging to 19 previously defined clonal complexes (CCs). Core-gene phylogenetic reconstruction based on 615 genes identified three clades, with Clade I comprising six subclades (IA-IF) and predominating (83.2%) among the strains. Because specific cattle-associated STs, such as ST-982, predominated among strains from Michigan patients, we also examined a collection of 72 C. jejuni strains from cattle recovered during an overlapping time period by WGS. Several phylogenetic analyses demonstrated that most cattle strains clustered separately within the phylogeny, but a subset clustered together with human strains. Hence, we used high quality single nucleotide polymorphism (hqSNP) profiling to more comprehensively examine those cattle and human strains that clustered together to evaluate the likelihood of interspecies transmission. Notably, this method distinguished highly related strains and identified clusters comprising strains from both humans and cattle. For instance, 88 SNPs separated a cattle and human strain that were previously classified as ST-8, while the human and cattle derived ST-982 strains differed by >200 SNP differences. These findings demonstrate that highly similar strains were circulating among Michigan patients and cattle during the same time period and highlight the potential for interspecies transmission and diversification within each host. In all, the data presented illustrate that WGS and pangenomic analyses are important tools for enhancing our understanding of the distribution, dissemination, and evolution of specific pathogen populations. Combined with more traditional phenotypic and genotypic approaches, these tools can guide the development of public health prevention and mitigation strategies for C. jejuni and other foodborne pathogens.
Show less
- Title
- UNDERSTANDING THE GENETIC BASIS OF HUMAN DISEASES BY COMPUTATIONALLY MODELING THE LARGE-SCALE GENE REGULATORY NETWORKS
- Creator
- Wang, Hao
- Date
- 2022
- Collection
- Electronic Theses & Dissertations
- Description
-
Many severe diseases are known to be caused by the genetic disorder of the human genome, including breast cancer and Alzheimer's disease. Understanding the genetic basis of human diseases plays a vital role in personalized medicine and precision therapy. However, the pervasive spatial correlations between the disease-associated SNPs have hindered the ability of traditional GWAS studies to discover causal SNPs and obscured the underlying mechanisms of disease-associated SNPs. Recently, diverse...
Show moreMany severe diseases are known to be caused by the genetic disorder of the human genome, including breast cancer and Alzheimer's disease. Understanding the genetic basis of human diseases plays a vital role in personalized medicine and precision therapy. However, the pervasive spatial correlations between the disease-associated SNPs have hindered the ability of traditional GWAS studies to discover causal SNPs and obscured the underlying mechanisms of disease-associated SNPs. Recently, diverse biological datasets generated by large data consortia provide a unique opportunity to fill the gap between genotypes and phenotypes using biological networks, representing the complex interplay between genes, enhancers, and transcription factors (TF) in the 3D space. The comprehensive delineation of the regulatory landscape calls for highly scalable computational algorithms to reconstruct the 3D chromosome structures and mechanistically predict the enhancer-gene links. In this dissertation, I first developed two algorithms, FLAMINGO and tFLAMINGO, to reconstruct the high-resolution 3D chromosome structures. The algorithmic advancements of FLAMINGO and tFLAMINGO lead to the reconstruction of the 3D chromosome structures in an unprecedented resolution from the highly sparse chromatin contact maps. I further developed two integrative algorithms, ComMUTE and ProTECT, to mechanistically predict the long-range enhancer-gene links by modeling the TF profiles. Based on the extensive evaluations, these two algorithms demonstrate superior performance in predicting enhancer-gene links and decoding TF regulatory grammars over existing algorithms. The successful application of ComMUTE and ProTECT in 127 cell types not only provide a rich resource of gene regulatory networks but also shed light on the mechanistic understanding of QTLs, disease-associated genetic variants, and high-order chromatin interactions.
Show less
- Title
- THE TRANSCRIPTOMIC AND EPIGENOMIC RESPONSE OF KOCHIA SCOPARIA TO SUBLETHAL GLYPHOSATE
- Creator
- Claucherty, Carly Abbegail
- Date
- 2022
- Collection
- Electronic Theses & Dissertations
- Description
-
Weed populations respond and adapt to herbicide stress by evolving resistance. Glyphosate resistance is primarily caused by the amplification of the target site gene, EPSPS, where multiple copies produce a large enough protein pool so that field rates do not kill the plant. This mechanism has evolved independently in at least nine divergent weed species. It has been demonstrated that EPSPS gene duplication may be transposon mediated in Kochia scoparia. A key regulator of transposable element ...
Show moreWeed populations respond and adapt to herbicide stress by evolving resistance. Glyphosate resistance is primarily caused by the amplification of the target site gene, EPSPS, where multiple copies produce a large enough protein pool so that field rates do not kill the plant. This mechanism has evolved independently in at least nine divergent weed species. It has been demonstrated that EPSPS gene duplication may be transposon mediated in Kochia scoparia. A key regulator of transposable element (TE) activity is DNA methylation. The role of the epigenome and subsequent transcriptome in transient responses to herbicides of their primary target, weeds, is not well understood In this study, we performed RNA-Seq and bisulfite sequencing on leaf tissue from glyphosate-sensitive kochia before and three weeks after treatment with two sublethal doses to determine if glyphosate causes hypomethylation of the genome, allowing for the activation of transposons and upregulation of stress-related genes. Our results shows that overall gene expression was suppressed by glyphosate and increases in CHH methylation through development were also ceased. We did not observe significant global changes in cytosine methylation, and overall responses were stochastic. When combining the two datasets together, there was no direct correlation between changes in methylation and changes in gene expression suggesting that DNA methylation is not the primary cause of differential expression in our study. Our results broaden the knowledge pool of weedy species epigenomics and aid in understanding the contribution of DNA methylation to plant resilience in response to herbicide stress.
Show less
- Title
- USING FRAGARIA AS A MODEL SYSTEM FOR THE STUDY OF SUBGENOME DOMINANCE AND ADAPTATION IN CROPS
- Creator
- Alger, Elizabeth
- Date
- 2021
- Collection
- Electronic Theses & Dissertations
- Description
-
Polyploidy, or the presence of three or more complete genomes in a single organism, has occurred frequently in plants, especially in the angiosperm lineage. Allopolyploids, or polyploids resulting from the merging of different genomes in an interspecific hybrid, have often been shown to experience subgenome dominance. Subgenome dominance is the phenomenon where there is bias in the gene loss and expression between the different genomes in a polyploid, known as subgenomes. Despite the...
Show morePolyploidy, or the presence of three or more complete genomes in a single organism, has occurred frequently in plants, especially in the angiosperm lineage. Allopolyploids, or polyploids resulting from the merging of different genomes in an interspecific hybrid, have often been shown to experience subgenome dominance. Subgenome dominance is the phenomenon where there is bias in the gene loss and expression between the different genomes in a polyploid, known as subgenomes. Despite the prevalence of polyploids and subgenome dominance, little is known about the factors and mechanisms that influence this process. Strawberry (Fragaria sp.) is emerging as a powerful model system to investigate polyploid subgenome dominance evolution due to the recent identification of the four extant diploid progenitor species of the cultivated octoploid strawberry (Fragaria x ananassa). Having the diploid progenitors in hand allows us to identify differences between the dominant subgenome, F. vesca, and the other three progenitors that may have an impact of subgenome dominance. One possible factor is transposable element (TE) abundance, as low TE density has been consistently associated with the dominant subgenome in allopolyploids. Epigenetic silencing of TEs by DNA methylation to suppress TE activity has been shown to result in decreased expression of neighboring genes and this lowered gene expression may affect the establishment of subgenome dominance. F. vesca will be used as a diploid model for the study of subgenome dominance in strawberry where I can examine how TE abundance and other factors influence gene expression in a single accession and in hybrid crosses between different accessions. Tracking changes in gene expression in the hybrids will allow us to examine how genomes with difference sizes and genomic factors interact. The results and insights observed from this study can then be applied to subgenome dominance research in octoploid strawberry. In addition to the germplasm and genomic resources, strawberries are also a high value crop and the loss of their production due to (a)biotic stressors results in the loss of millions of United States dollars annually. Using a population of octoploid strawberries segregating for salt tolerance, I will identify candidate genes related to salt tolerance. Together this work will identify factors and mechanisms related to subgenome dominance and use genotypic data in a practical breeding context.
Show less
- Title
- Genomic basis of electric signal variation in African weakly electric fish
- Creator
- Losilla-Lacayo, Mauricio
- Date
- 2021
- Collection
- Electronic Theses & Dissertations
- Description
-
A repeated theme in speciation is reproductive isolation centered around divergence in few, highly variable traits, specially in cases without strong geographic isolation and high speciation rates. Understanding the genomic basis of highly variable traits that are key to speciation is a major goal of evolutionary biology, because they can characterize crucial drivers and foundations of the speciation process. African weakly electric fish (Mormyridae) are a decidedly speciose clade of teleost...
Show moreA repeated theme in speciation is reproductive isolation centered around divergence in few, highly variable traits, specially in cases without strong geographic isolation and high speciation rates. Understanding the genomic basis of highly variable traits that are key to speciation is a major goal of evolutionary biology, because they can characterize crucial drivers and foundations of the speciation process. African weakly electric fish (Mormyridae) are a decidedly speciose clade of teleost fish, and their electric organ discharges (EODs) are highly variable traits central to species divergence. However, little is known about the genes and celullar processes that underscore EOD variation. In this dissertation, I employ RNAseq and Nanopore sequencing to study the genomic basis of electric signal variation in mormyrids. In Chapter 1, I take a transcriptome-wide approach to describe the molecular basis of electric signal diversity in species of the mormyrid genus Paramormyrops, divergent for EOD complexity, duration and polarity. My results emphasize genes that influence the shape and structure of the electrocyte cytoskeleton, membrane, and extracellular matrix, and the membrane’s physiological properties. In Chapter 2, I compare gene expression patterns between electric organs that produce long vs short EODs. The results strongly support known aspects of morphological and physiological bases of EOD duration, and for the first time I identified specific genes and broad cellular processes expected to that alter morphological and physiological properties of electrocytes, most striking among these is the differential expression of multiple potassium voltage-gated channels. These two chapters independently identified the gene epdl2 as of interest for EOD divergence. In Chapter 3, I study the molecular evolutionary history of epdl2 in Mormyridae, with emphasis on Paramormyrops. My results suggest that three rounds of gene duplication produced four epdl2 paralogs in a Paramormyrops ancestor. In addition, I identify ten sites in epdl2 expected to have experienced strong positive selection in paralogs and implicate them in key functional domains. Overall, the results of this dissertation greatly solidify and expand our understanding of how the genome underpins changes to electrocytes, and in turn, divergence in their electric signals, a highly variable trait that may facilitate speciation in African weakly electric fish. This work provides an evidence-grounded list of candidate genes for functional analyses aimed to corroborate their contribution to the EOD phenotype.
Show less
- Title
- Algebraic topology and machine learning for biomolecular modeling
- Creator
- Cang, Zixuan
- Date
- 2018
- Collection
- Electronic Theses & Dissertations
- Description
-
Data is expanding in an unprecedented speed in both quantity and size. Topological data analysis provides excellent tools for analyzing high dimensional and highly complex data. Inspired by the topological data analysis's ability of robust and multiscale characterization of data and motivated by the demand of practical predictive tools in computational biology and biomedical researches, this dissertation extends the capability of persistent homology toward quantitative and predictive data...
Show moreData is expanding in an unprecedented speed in both quantity and size. Topological data analysis provides excellent tools for analyzing high dimensional and highly complex data. Inspired by the topological data analysis's ability of robust and multiscale characterization of data and motivated by the demand of practical predictive tools in computational biology and biomedical researches, this dissertation extends the capability of persistent homology toward quantitative and predictive data analysis tools with an emphasis in biomolecular systems. Although persistent homology is almost parameter free, careful treatment is still needed toward practically useful prediction models for realistic systems. This dissertation carefully assesses the representability of persistent homology for biomolecular systems and introduces a collection of characterization tools for both macromolecules and small molecules focusing on intra- and inter-molecular interactions, chemical complexities, electrostatics, and geometry. The representations are then coupled with deep learning and machine learning methods for several problems in drug design and biophysical research. In real-world applications, data often come with heterogeneous dimensions and components. For example, in addition to location, atoms of biomolecules can also be labeled with chemical types, partial charges, and atomic radii. While persistent homology is powerful in analyzing geometry of data, it lacks the ability of handling the non-geometric information. Based on cohomology, we introduce a method that attaches the non-geometric information to the topological invariants in persistent homology analysis. This method is not only useful to handle biomolecules but also can be applied to general situations where the data carries both geometric and non-geometric information. In addition to describing biomolecular systems as a static frame, we are often interested in the dynamics of the systems. An efficient way is to assign an oscillator to each atom and study the coupled dynamical system induced by atomic interactions. To this end, we propose a persistent homology based method for the analysis of the resulting trajectories from the coupled dynamical system. The methods developed in this dissertation have been applied to several problems, namely, prediction of protein stability change upon mutations, protein-ligand binding affinity prediction, virtual screening, and protein flexibility analysis. The tools have shown top performance in both commonly used validation benchmarks and community-wide blind prediction challenges in drug design.
Show less
- Title
- COMPUTATIONAL DISCOVERY AND ANNOTATIONS OF CELL-TYPE SPECIFIC LONG-RANGE GENE REGULATION
- Creator
- Huang, Binbin
- Date
- 2021
- Collection
- Electronic Theses & Dissertations
- Description
-
Long-range regulation by distal enhancers plays critical roles in cell-type specific transcriptional programs. Delineation of the underlying mechanisms underlying long- range enhancer regulation will improve our systems-level understandings on the gene regulatory networks and their functional impacts on human diseases. Although there are experimental approaches to infer cell-type specific long-range regulation, they suffer from the problems of low resolution or high false negative rates....
Show moreLong-range regulation by distal enhancers plays critical roles in cell-type specific transcriptional programs. Delineation of the underlying mechanisms underlying long- range enhancer regulation will improve our systems-level understandings on the gene regulatory networks and their functional impacts on human diseases. Although there are experimental approaches to infer cell-type specific long-range regulation, they suffer from the problems of low resolution or high false negative rates. Recent technological advances make it possible to have a comprehensive profile of the regulatory activities in multiple layers, bringing us to the multi-omics era. Here, we took use of the booming data resources and integrated them into machine learning models to uncover the resulting effects of long- range regulation, especially in diseases. In the first study about androgen- induced gene regulation in the ovary and its impact on female fertility, we identified a total of 190 annotated significant differentially expressed genes. The H3K27me3 histone modification level change was observed in more than half of the DEGs, highlighting the importance of complex long-range multi-enhancer regulation of androgen receptors regulated genes in the ovarian cells. However, current computational predictions of genome-wide enhancer–promoter interactions are still challenging due to limited accuracy and the lack of knowledge on the molecular mechanisms. Based on recent biological investigations, the protein–protein interactions (PPIs) between transcription factors (TFs) have been found to participate in the regulation of chromatin loops. Therefore, we developed a novel predictive model for cell-type specific enhancer– promoter interactions by leveraging the information of TF PPI signatures. Evaluated by a series of rigorous performance comparisons, the new model achieves superior performance over other methods. In this chromatin loop prediction model, TF bindings inferred from Chromatin immunoprecipitation followed by high- throughput sequencing (ChIP-seq) make an essential contribution to the instruction to prioritize specific TF PPIs that may mediate cell-type specific long-range regulatory interactions and reveal new mechanistic understandings of enhancer regulation. When processing ChIP-seq data, we detected, on average, 25% of the ChIP-seq reads can be aligned to multiple positions in the reference genome. These reads are discarded by traditional pipeline, which causes a large loss of information. To cope with this waste, we developed a Bayesian model and designed a Gibbs sampling algorithm to properly align these reads. Evidences from a series of biological comparisons indicated a significantly better performance of this model over the competing tool. In summary, our studies took full advantage of the booming data in this multi-omics era, to provide a novel view of the cell-type specific long- range regulation by distal enhancers and its effects on diseases.
Show less
- Title
- Oocyte and Preimplantation Embryo Cross-Species Transcriptome Meta-Analysis Reveals Divergence at Gene Level but Conservation in Functions
- Creator
- Schall, Peter Zachary
- Date
- 2021
- Collection
- Electronic Theses & Dissertations
- Description
-
Two of the most critical stages in early development occur during the maturation of oocytes and during the first lineage specification during morula-to-blastocyst transition. The accurate regulation of the transcriptome during these essential events is necessary for the development of a healthy embryo. This thesis presents the culmination of custom pipelines developed to produce three meta-analyses: 1) transcriptome changes during oocyte maturation across four mammalian species (human, rhesus...
Show moreTwo of the most critical stages in early development occur during the maturation of oocytes and during the first lineage specification during morula-to-blastocyst transition. The accurate regulation of the transcriptome during these essential events is necessary for the development of a healthy embryo. This thesis presents the culmination of custom pipelines developed to produce three meta-analyses: 1) transcriptome changes during oocyte maturation across four mammalian species (human, rhesus monkey, cow, and mouse), 2) predictive modeling of RNA binding proteins and microRNAs binding to the 3’ UTR, impacting stability during oocyte maturation across four mammalian species (human, rhesus monkey, cow, and mouse), and 3) transcriptome changes during the morula-to-blastocyst transition and the establishment of the inner cell mass and trophectoderm across five mammalian species (human, rhesus monkey, cow, pig, and mouse). The results of these studies reveal that there are relatively few individual transcripts regulated commonly across species, while there are greater shared features at the pathway and functional level. This underscores that different species may utilize a different cohort of genes to accomplish a given outcome. Additionally, the pipelines developed for this thesis are highly applicable across many areas of biology.
Show less
- Title
- AUTO-PARAMETRIZED KERNEL METHODS FOR BIOMOLECULAR MODELING
- Creator
- Szocinski, Timothy Andrew
- Date
- 2021
- Collection
- Electronic Theses & Dissertations
- Description
-
Being able to predict various physical quantities of biomolecules is of great importance to biologists, chemists, and pharmaceutical companies. By applying machine learning techniques to develop these predictive models, we find much success in our endeavors. Advanced mathematical techniques involving graph theory, algebraic topology, differential geometry, etc. have been very profitable in generating first-rate biomolecular representations that are used to train a variety of machine learning...
Show moreBeing able to predict various physical quantities of biomolecules is of great importance to biologists, chemists, and pharmaceutical companies. By applying machine learning techniques to develop these predictive models, we find much success in our endeavors. Advanced mathematical techniques involving graph theory, algebraic topology, differential geometry, etc. have been very profitable in generating first-rate biomolecular representations that are used to train a variety of machine learning models. Some of these representations are dependent on a choice of kernel function along with parameters that determine its shape. These kernel-based methods of producing features require careful tuning of the kernel parameters, and the tuning cost increases exponentially as more kernels are involved. This limitation largely restricts us to the use of machine learning models with less hyper-parameters, such as random forest (RF) and gradient-boosting trees (GBT), thus precluding the use of neural networks for kernel-based representations. To alleviate these concerns, we have developed the auto-parametrized weighted element-specific graph neural network (AweGNN), which uses kernel-based geometric graph features in which the kernel parameters are automatically updated throughout the training to reach an optimal combination of kernel parameters. The AweGNN models have shown to be particularly success in toxicity and solvation predictions, especially when a multi-task approach is taken. Although the AweGNN had introduced hundreds of parameters that were automatically tuned, the ability to include multiple kernel types simultaneously was hindered because of the computational expense. In response, the GPU-enhanced AweGNN was developed to tackle the issue. Working with GPU architecture, the AweGNN's computation speed was greatly enhanced. To achieve a more comprehensive representation, we suggested a network consisting of fixed topological and spectral auxiliary features to bolster the original AweGNN success. The proposed network was tested on new hydration and solubility datasets, with excellent results. To extend the auto-parametrized kernel technique to include features of a different type, we introduced the theoretical foundation for building an auto-parametrized spectral layer, which uses kernel-based spectral features to represent biomolecular structures. In this dissertation, we explore some underlying notions of mathematics useful in our models, review important topics in machine learning, discuss techniques and models used in molecular biology, detail the AweGNN architecture and results, and test and expand new concepts pertaining to these auto-parametrized kernel methods.
Show less
- Title
- Leveraging Angiosperm Pangenomics to Understand Genome Evolution
- Creator
- Yocca, Alan E.
- Date
- 2022
- Collection
- Electronic Theses & Dissertations
- Description
-
My dissertation work focused on species-level comparative genomics and pangenomics to describe patterns of genetic variation. I studied multiple systems and unsurprisingly discovered different patterns of variation. Within a species, individuals are genetically diverse. There are some DNA regions present in every individual (core), while others may be specific to a single individual or lineage (variable). The sum of the genetic sequences found across an entire taxonomic group is called the...
Show moreMy dissertation work focused on species-level comparative genomics and pangenomics to describe patterns of genetic variation. I studied multiple systems and unsurprisingly discovered different patterns of variation. Within a species, individuals are genetically diverse. There are some DNA regions present in every individual (core), while others may be specific to a single individual or lineage (variable). The sum of the genetic sequences found across an entire taxonomic group is called the pangenome. This DNA variation greatly contributes to observed phenotypic differences between individuals. Therefore, to understand genome evolution and the link between genotype and phenotype, we must understand the pangenome. In this work, I compare the core and variable genetic regions both coding and noncoding across different flowering plant lineages. I note many consistent features across lineages as well as ways in which each pangenomic pattern is unique. These consistencies and differences can be leveraged in the future to better understand genome evolution as well as how genotype relates to phenotype. Specifically, my dissertation includes four chapters; (1) Evolution of Conserved Noncoding Sequences in Arabidopsis thaliana, (2) Machine learning identifies differences between core and variable genes in Brachypodium distachyon and Oryza sativa, (3) Current status and future perspectives on the evolution of cis-regulatory elements in plants, and (4) A pangenome for Vaccinium.
Show less
- Title
- ELUCIDATION AND REPURPOSING OF PLANT DITERPENOID BIOSYNTHETIC PATHWAYS
- Creator
- Miller, Garret P.
- Date
- 2022
- Collection
- Electronic Theses & Dissertations
- Description
-
Terpenoids are the largest class of specialized metabolites in plants, with widespread uses ranging from fragrances and cosmetics to biofuels, antifeedants, and pharmaceuticals. Terpenoids are derived from a small set of prenyl diphosphate substrates which are cyclized into different terpene scaffolds by terpene synthases. These scaffolds are then modified by various tailoring enzymes—typically starting with cytochrome P450s—into functionalized terpenoids. Given the structural complexity of...
Show moreTerpenoids are the largest class of specialized metabolites in plants, with widespread uses ranging from fragrances and cosmetics to biofuels, antifeedants, and pharmaceuticals. Terpenoids are derived from a small set of prenyl diphosphate substrates which are cyclized into different terpene scaffolds by terpene synthases. These scaffolds are then modified by various tailoring enzymes—typically starting with cytochrome P450s—into functionalized terpenoids. Given the structural complexity of many of these metabolites, total chemical synthesis is often challenging to achieve at a relevant scale and cost, and as such, biosynthetic methods are increasingly being employed as an alternative for their production. The work presented in this dissertation describes the elucidation of two terpenoid biosynthetic pathways and the repurposing of known pathways to convert synthetic substrates not found in nature. First, three steps constituting the full biosynthetic pathway to leubethanol, an antimicrobial diterpenoid active against multidrug-resistant TB, was elucidated in the Texas Sage (Leucophyllum frutescens). Second, seven steps in the biosynthetic pathway towards structurally complex diterpenoid alkaloids were elucidated in the Siberian Larkspur (Delphinium grandiflorum). Third, twenty-four terpene synthases were screened for activity against twenty synthetic substrate analogs not found in nature, resulting in fifty-six new products and demonstrating the ability to derivatize terpene scaffolds through the derivatization of a starting substrate. In all, this work expands access to different classes of terpenoids through the elucidation of biosynthetic pathways and semi-biosynthesis of terpene scaffolds not found in nature, allowing for more feasible and sustainable production of these structurally complex compounds.
Show less
- Title
- MODELING AND PREDICTION OF GENETIC REDUNDANCY IN ARABIDOPSIS THALIANA AND SACCHAROMYCES CEREVISIAE
- Creator
- Cusack, Siobhan Anne
- Date
- 2020
- Collection
- Electronic Theses & Dissertations
- Description
-
Genetic redundancy is a phenomenon where more than one gene encodes products that perform the same function. This frequently manifests experimentally as a single gene knockout mutant which does not demonstrate a phenotypic change compared to the wild type due to the presence of a paralogous gene performing the same function; a phenotype is only observed when one or more paralogs are knocked out in combination. This presents a challenge in a fundamental goal of genetics, linking genotypes to...
Show moreGenetic redundancy is a phenomenon where more than one gene encodes products that perform the same function. This frequently manifests experimentally as a single gene knockout mutant which does not demonstrate a phenotypic change compared to the wild type due to the presence of a paralogous gene performing the same function; a phenotype is only observed when one or more paralogs are knocked out in combination. This presents a challenge in a fundamental goal of genetics, linking genotypes to phenotypes, especially because it is difficult to determine a priori which gene pairs are redundant. Furthermore, while some factors that are associated with redundant genes have been identified, little is known about factors contributing to long-term maintenance of genetic redundancy. Here, we applied a machine learning approach to predict redundancy among benchmark redundant and nonredundant gene pairs in the model plant Arabidopsis thaliana. Predictions were validated using well-characterized redundant and nonredundant gene pairs. Additionally, we leveraged the availability of fitness and multi-omics data in the budding yeast Saccharomyces cerevisiae to build machine learning models for predicting genetic redundancy and related phenotypic outcomes (single and double mutant fitness) among paralogs, and to identify features important in generating these predictions. Collectively, our models of genetic redundancy provide quantitative assessments of how well existing data allow predictions of fitness and genetic redundancy, shed light on characteristics that may contribute to long-term maintenance of paralogs that are seemingly functionally redundant, and will ultimately allow for more targeted generation of phenotypically informative mutants, advancing functional genomic studies.
Show less
- Title
- Interpretable machine learning in plant genomes : studies in modeling and understanding complex biological systems
- Creator
- Azodi, Christina Brady
- Date
- 2019
- Collection
- Electronic Theses & Dissertations
- Description
-
Complex systems are ubiquitous in genetics and genomics. From the regulation of gene expression to the genetic basis of complex traits, we see that complex networks of diverse cellular molecules underpin the natural world. Driven by technological advances, today's researchers have access to large amounts of omics data from diverse species. At the same time, improvements in computer processing and algorithms have produced more powerful computational tools. Taken together, these advances mean...
Show moreComplex systems are ubiquitous in genetics and genomics. From the regulation of gene expression to the genetic basis of complex traits, we see that complex networks of diverse cellular molecules underpin the natural world. Driven by technological advances, today's researchers have access to large amounts of omics data from diverse species. At the same time, improvements in computer processing and algorithms have produced more powerful computational tools. Taken together, these advances mean that those working at the interface of data science and biology are poised to better model and understand complex biological systems. The research in this dissertation demonstrates how a data-driven approach can be used to better understand three complex systems: (1) transcriptional response to single and combined heat and drought stress in Arabidopsis thaliana, (2) the genetic basis of flowering time, a complex trait, in Zea mays, and (3) the social basis for opinions and beliefs about biotechnology products.To study the first system, we generated models of the cis-regulatory code from information about DNA sequence and additional omics levels using both classic machine learning and deep learning algorithms. We identified 1,061 putative cis-regulatory elements associated with different patterns of response to single and combined heat and drought stress and found that information about additional levels of regulation, especially chromatin accessibility and known transcription factor binding, improved our models of the cis-regulatory code. To study the second system, we generated phenotype prediction models for flowering time, height, and yield based on either genetic markers or transcript levels at the seedling stage. We found that, while genetic marker-based models performed better than transcript level-based models, models that integrated both types of data performed best. Furthermore, transcript-based models were more useful for finding genes known to be associated with flowering time, highlighting how using additional levels of omics data can improve our ability to understand the genetic basis of complex traits. Finally, to study the third system, we integrated 29 characteristics about a person (e.g. age, political ideology, education, values, environmental beliefs) into a machine learning model that would predict an individual's beliefs and opinions about five different types of biotechnology products (e.g. biofortification, biopharmaceuticals). While this approach was particularly usefully for identifying individuals that were broadly supportive of biotechnology, finding characteristics of individuals with negative or conditional (i.e. support product A, but not B) opinions was more challenging, highlighting the complexity of public opinions about biotechnology.
Show less
- Title
- UNDERSTANDING THE ROLES OF INTERKINGDOM MICROBIAL INTERACTIONS, MICROBIAL TRAITS, AND HOST FACTORS IN THE ASSEMBLY OF PLANT MICROBIOMES
- Creator
- Liber, Julian Aaron
- Date
- 2021
- Collection
- Electronic Theses & Dissertations
- Description
-
The community of organisms that associate with plants are vital to both the survival of the host plant but also the diseases which may kill it. The processes by which this community, called the microbiome, assemble and function can contribute to the traits of the host, including plants that humans rely on for food, resources, and ecosystems services. This thesis focuses on understanding the assembly of microbiomes at the scale of microbe-microbe interactions and traits of individual microbes,...
Show moreThe community of organisms that associate with plants are vital to both the survival of the host plant but also the diseases which may kill it. The processes by which this community, called the microbiome, assemble and function can contribute to the traits of the host, including plants that humans rely on for food, resources, and ecosystems services. This thesis focuses on understanding the assembly of microbiomes at the scale of microbe-microbe interactions and traits of individual microbes, as well as how characters of the host may change this process. I first address this by examining the in vitro and in planta interactions within small synthetic communities of root-inhabiting bacteria and fungi and with the plant host and viral disease of the host. While intermicrobial interactions in vitro were not predictive of in planta interactions, adding host disease or additional organisms to the system altered the assembly process. I then show the development and applications of the CONSTAX2 classifier, a taxonomic assignment tool for metabarcoding studies, which offers improved accuracy and ease of use for conducting metabarcoding studies exploring the diversity and structure of microbial communities. Last, I present a study testing which factors affected the composition of forest fungal communities to understand the ecology of litter-inhabiting fungi and improve methodologies for sampling leaf-associated fungal communities. The factors affecting the assembly of plant microbiomes are complex and varied but connecting individual interactions to community composition and ultimately function may improve our abilities to predict and manage microbiome processes.
Show less
- Title
- Approaches to scaling and improving metagenome sequence assembly
- Creator
- Pell, Jason (Jason A.)
- Date
- 2013
- Collection
- Electronic Theses & Dissertations
- Description
-
Since the completion of the Human Genome Project in the early 2000s, new high-throughput sequencing technologies have been developed that produce more DNA sequence reads at a much lower cost. Because of this, large quantities of data have been generated that are difficult to analyze computationally, not only because of the sheer number of reads but due to errors. One area where this is a particularly difficult problem is metagenomics, where an ensemble of microbes in an environmental sample...
Show moreSince the completion of the Human Genome Project in the early 2000s, new high-throughput sequencing technologies have been developed that produce more DNA sequence reads at a much lower cost. Because of this, large quantities of data have been generated that are difficult to analyze computationally, not only because of the sheer number of reads but due to errors. One area where this is a particularly difficult problem is metagenomics, where an ensemble of microbes in an environmental sample is sequenced. In this scenario, blends of species with varying abundance levels must be processed together in a Bioinformatics pipeline. One common goal with a sequencing dataset is to assemble the genome from the set of reads, but since comparing reads with one another scales quadratically, new algorithms had to be developed to handle the large quantity of short reads generated from the latest sequencers. These assembly algorithms frequently use de Bruijn graphs where reads are broken down into k-mers, or small DNA words of a fixed size k. Despite these algorithmic advances, DNA sequence assembly still scales poorly due to errors and computer memory inefficiency.In this dissertation, we develop approaches to tackle the current shortcomings in metagenome sequence assembly. First, we devise the novel use of a Bloom filter, a probabilistic data structure with false positives, for storing a de Bruijn graph in memory. We study the properties of the de Bruijn graph with false positives in detail and observe that the components in the graph abruptly connect together at a specific false positive rate. Then, we analyze the memory efficiency of a partitioning algorithm at various false positive rates and find that this approach can lead to a 40x decrease in memory usage.Extending the idea of a probabilistic de Bruijn graph, we then develop a two-pass error correction algorithm that effectively discards erroneous reads and corrects the remaining majority to be more accurate. In the first pass, we use the digital normalization algorithm to collect novelty and discard reads that have already been at a sufficient coverage. In the second, a read-to-graph alignment strategy is used to correct reads. Some heuristics are employed to improve the performance. We evaluate the algorithm with an E. coli dataset as well as a mock human gut metagenome dataset and find that the error correction strategy works as intended.
Show less