You are here
Search results
(1 - 20 of 28)
Pages
- Title
- AUTO-PARAMETRIZED KERNEL METHODS FOR BIOMOLECULAR MODELING
- Creator
- Szocinski, Timothy Andrew
- Date
- 2021
- Collection
- Electronic Theses & Dissertations
- Description
-
Being able to predict various physical quantities of biomolecules is of great importance to biologists, chemists, and pharmaceutical companies. By applying machine learning techniques to develop these predictive models, we find much success in our endeavors. Advanced mathematical techniques involving graph theory, algebraic topology, differential geometry, etc. have been very profitable in generating first-rate biomolecular representations that are used to train a variety of machine learning...
Show moreBeing able to predict various physical quantities of biomolecules is of great importance to biologists, chemists, and pharmaceutical companies. By applying machine learning techniques to develop these predictive models, we find much success in our endeavors. Advanced mathematical techniques involving graph theory, algebraic topology, differential geometry, etc. have been very profitable in generating first-rate biomolecular representations that are used to train a variety of machine learning models. Some of these representations are dependent on a choice of kernel function along with parameters that determine its shape. These kernel-based methods of producing features require careful tuning of the kernel parameters, and the tuning cost increases exponentially as more kernels are involved. This limitation largely restricts us to the use of machine learning models with less hyper-parameters, such as random forest (RF) and gradient-boosting trees (GBT), thus precluding the use of neural networks for kernel-based representations. To alleviate these concerns, we have developed the auto-parametrized weighted element-specific graph neural network (AweGNN), which uses kernel-based geometric graph features in which the kernel parameters are automatically updated throughout the training to reach an optimal combination of kernel parameters. The AweGNN models have shown to be particularly success in toxicity and solvation predictions, especially when a multi-task approach is taken. Although the AweGNN had introduced hundreds of parameters that were automatically tuned, the ability to include multiple kernel types simultaneously was hindered because of the computational expense. In response, the GPU-enhanced AweGNN was developed to tackle the issue. Working with GPU architecture, the AweGNN's computation speed was greatly enhanced. To achieve a more comprehensive representation, we suggested a network consisting of fixed topological and spectral auxiliary features to bolster the original AweGNN success. The proposed network was tested on new hydration and solubility datasets, with excellent results. To extend the auto-parametrized kernel technique to include features of a different type, we introduced the theoretical foundation for building an auto-parametrized spectral layer, which uses kernel-based spectral features to represent biomolecular structures. In this dissertation, we explore some underlying notions of mathematics useful in our models, review important topics in machine learning, discuss techniques and models used in molecular biology, detail the AweGNN architecture and results, and test and expand new concepts pertaining to these auto-parametrized kernel methods.
Show less
- Title
- Algebraic topology and machine learning for biomolecular modeling
- Creator
- Cang, Zixuan
- Date
- 2018
- Collection
- Electronic Theses & Dissertations
- Description
-
Data is expanding in an unprecedented speed in both quantity and size. Topological data analysis provides excellent tools for analyzing high dimensional and highly complex data. Inspired by the topological data analysis's ability of robust and multiscale characterization of data and motivated by the demand of practical predictive tools in computational biology and biomedical researches, this dissertation extends the capability of persistent homology toward quantitative and predictive data...
Show moreData is expanding in an unprecedented speed in both quantity and size. Topological data analysis provides excellent tools for analyzing high dimensional and highly complex data. Inspired by the topological data analysis's ability of robust and multiscale characterization of data and motivated by the demand of practical predictive tools in computational biology and biomedical researches, this dissertation extends the capability of persistent homology toward quantitative and predictive data analysis tools with an emphasis in biomolecular systems. Although persistent homology is almost parameter free, careful treatment is still needed toward practically useful prediction models for realistic systems. This dissertation carefully assesses the representability of persistent homology for biomolecular systems and introduces a collection of characterization tools for both macromolecules and small molecules focusing on intra- and inter-molecular interactions, chemical complexities, electrostatics, and geometry. The representations are then coupled with deep learning and machine learning methods for several problems in drug design and biophysical research. In real-world applications, data often come with heterogeneous dimensions and components. For example, in addition to location, atoms of biomolecules can also be labeled with chemical types, partial charges, and atomic radii. While persistent homology is powerful in analyzing geometry of data, it lacks the ability of handling the non-geometric information. Based on cohomology, we introduce a method that attaches the non-geometric information to the topological invariants in persistent homology analysis. This method is not only useful to handle biomolecules but also can be applied to general situations where the data carries both geometric and non-geometric information. In addition to describing biomolecular systems as a static frame, we are often interested in the dynamics of the systems. An efficient way is to assign an oscillator to each atom and study the coupled dynamical system induced by atomic interactions. To this end, we propose a persistent homology based method for the analysis of the resulting trajectories from the coupled dynamical system. The methods developed in this dissertation have been applied to several problems, namely, prediction of protein stability change upon mutations, protein-ligand binding affinity prediction, virtual screening, and protein flexibility analysis. The tools have shown top performance in both commonly used validation benchmarks and community-wide blind prediction challenges in drug design.
Show less
- Title
- Analysis of host transcriptome response to porcine reproductive and respiratory syndrome virus infection
- Creator
- Arceo, Maria Eugenia
- Date
- 2012
- Collection
- Electronic Theses & Dissertations
- Description
-
Porcine Reproductive and Respiratory Syndrome (PRRS) has been affecting commercial populations of pigs in the US for more than 20 years. We evaluated differences in gene expression in pigs from the PRRS Host Genetics Consortium initiative showing a range of responses to PRRS virus infection. Pigs were allocated into four phenotypic groups according to their serum viral level and weight gain. We obtained RNA at several days post-infection and hybridized it to the 20K 70 mer-oligonucleotide...
Show morePorcine Reproductive and Respiratory Syndrome (PRRS) has been affecting commercial populations of pigs in the US for more than 20 years. We evaluated differences in gene expression in pigs from the PRRS Host Genetics Consortium initiative showing a range of responses to PRRS virus infection. Pigs were allocated into four phenotypic groups according to their serum viral level and weight gain. We obtained RNA at several days post-infection and hybridized it to the 20K 70 mer-oligonucleotide Pigoligoarray. We initially used plasmode datasets to select an optimal procedure for analyzing these data. We showed that the random array effects model with the moderated F statistic and significance thresholds obtained by permutation provided the most powerful analysis procedure. We then addressed global differential gene expression between phenotypic groups. We identified cell death as a biological function significantly associated with several gene networks enriched for differentially expressed genes. We found the genes interferon-alpha 1, major histocompatibility complex, class II, DR alpha, and major histocompatibility complex, class II, DQ alpha 1 differentially expressed between phenotypic groups. Finally, we used this study as pilot data to inform the design of future time-course transcriptional profiling experiments. We concluded the best scenario for investigation of early response to PRRSV infection consists of sampling at 4 and 7 days post infection using approximately 30 pigs per phenotypic group.
Show less
- Title
- Approaches to scaling and improving metagenome sequence assembly
- Creator
- Pell, Jason (Jason A.)
- Date
- 2013
- Collection
- Electronic Theses & Dissertations
- Description
-
Since the completion of the Human Genome Project in the early 2000s, new high-throughput sequencing technologies have been developed that produce more DNA sequence reads at a much lower cost. Because of this, large quantities of data have been generated that are difficult to analyze computationally, not only because of the sheer number of reads but due to errors. One area where this is a particularly difficult problem is metagenomics, where an ensemble of microbes in an environmental sample...
Show moreSince the completion of the Human Genome Project in the early 2000s, new high-throughput sequencing technologies have been developed that produce more DNA sequence reads at a much lower cost. Because of this, large quantities of data have been generated that are difficult to analyze computationally, not only because of the sheer number of reads but due to errors. One area where this is a particularly difficult problem is metagenomics, where an ensemble of microbes in an environmental sample is sequenced. In this scenario, blends of species with varying abundance levels must be processed together in a Bioinformatics pipeline. One common goal with a sequencing dataset is to assemble the genome from the set of reads, but since comparing reads with one another scales quadratically, new algorithms had to be developed to handle the large quantity of short reads generated from the latest sequencers. These assembly algorithms frequently use de Bruijn graphs where reads are broken down into k-mers, or small DNA words of a fixed size k. Despite these algorithmic advances, DNA sequence assembly still scales poorly due to errors and computer memory inefficiency.In this dissertation, we develop approaches to tackle the current shortcomings in metagenome sequence assembly. First, we devise the novel use of a Bloom filter, a probabilistic data structure with false positives, for storing a de Bruijn graph in memory. We study the properties of the de Bruijn graph with false positives in detail and observe that the components in the graph abruptly connect together at a specific false positive rate. Then, we analyze the memory efficiency of a partitioning algorithm at various false positive rates and find that this approach can lead to a 40x decrease in memory usage.Extending the idea of a probabilistic de Bruijn graph, we then develop a two-pass error correction algorithm that effectively discards erroneous reads and corrects the remaining majority to be more accurate. In the first pass, we use the digital normalization algorithm to collect novelty and discard reads that have already been at a sufficient coverage. In the second, a read-to-graph alignment strategy is used to correct reads. Some heuristics are employed to improve the performance. We evaluate the algorithm with an E. coli dataset as well as a mock human gut metagenome dataset and find that the error correction strategy works as intended.
Show less
- Title
- COMPUTATIONAL DISCOVERY AND ANNOTATIONS OF CELL-TYPE SPECIFIC LONG-RANGE GENE REGULATION
- Creator
- Huang, Binbin
- Date
- 2021
- Collection
- Electronic Theses & Dissertations
- Description
-
Long-range regulation by distal enhancers plays critical roles in cell-type specific transcriptional programs. Delineation of the underlying mechanisms underlying long- range enhancer regulation will improve our systems-level understandings on the gene regulatory networks and their functional impacts on human diseases. Although there are experimental approaches to infer cell-type specific long-range regulation, they suffer from the problems of low resolution or high false negative rates....
Show moreLong-range regulation by distal enhancers plays critical roles in cell-type specific transcriptional programs. Delineation of the underlying mechanisms underlying long- range enhancer regulation will improve our systems-level understandings on the gene regulatory networks and their functional impacts on human diseases. Although there are experimental approaches to infer cell-type specific long-range regulation, they suffer from the problems of low resolution or high false negative rates. Recent technological advances make it possible to have a comprehensive profile of the regulatory activities in multiple layers, bringing us to the multi-omics era. Here, we took use of the booming data resources and integrated them into machine learning models to uncover the resulting effects of long- range regulation, especially in diseases. In the first study about androgen- induced gene regulation in the ovary and its impact on female fertility, we identified a total of 190 annotated significant differentially expressed genes. The H3K27me3 histone modification level change was observed in more than half of the DEGs, highlighting the importance of complex long-range multi-enhancer regulation of androgen receptors regulated genes in the ovarian cells. However, current computational predictions of genome-wide enhancer–promoter interactions are still challenging due to limited accuracy and the lack of knowledge on the molecular mechanisms. Based on recent biological investigations, the protein–protein interactions (PPIs) between transcription factors (TFs) have been found to participate in the regulation of chromatin loops. Therefore, we developed a novel predictive model for cell-type specific enhancer– promoter interactions by leveraging the information of TF PPI signatures. Evaluated by a series of rigorous performance comparisons, the new model achieves superior performance over other methods. In this chromatin loop prediction model, TF bindings inferred from Chromatin immunoprecipitation followed by high- throughput sequencing (ChIP-seq) make an essential contribution to the instruction to prioritize specific TF PPIs that may mediate cell-type specific long-range regulatory interactions and reveal new mechanistic understandings of enhancer regulation. When processing ChIP-seq data, we detected, on average, 25% of the ChIP-seq reads can be aligned to multiple positions in the reference genome. These reads are discarded by traditional pipeline, which causes a large loss of information. To cope with this waste, we developed a Bayesian model and designed a Gibbs sampling algorithm to properly align these reads. Evidences from a series of biological comparisons indicated a significantly better performance of this model over the competing tool. In summary, our studies took full advantage of the booming data in this multi-omics era, to provide a novel view of the cell-type specific long- range regulation by distal enhancers and its effects on diseases.
Show less
- Title
- Contributions to machine learning in biomedical informatics
- Creator
- Baytas, Inci Meliha
- Date
- 2019
- Collection
- Electronic Theses & Dissertations
- Description
-
"With innovations in digital data acquisition devices and increased memory capacity, virtually all commercial and scientific domains have been witnessing an exponential growth in the amount of data they can collect. For instance, healthcare is experiencing a tremendous growth in digital patient information due to the high adaptation rate of electronic health record systems in hospitals. The abundance of data offers many opportunities to develop robust and versatile systems, as long as the...
Show more"With innovations in digital data acquisition devices and increased memory capacity, virtually all commercial and scientific domains have been witnessing an exponential growth in the amount of data they can collect. For instance, healthcare is experiencing a tremendous growth in digital patient information due to the high adaptation rate of electronic health record systems in hospitals. The abundance of data offers many opportunities to develop robust and versatile systems, as long as the underlying salient information in data can be captured. On the other hand, today's data, often named big data, is challenging to analyze due to its large scale and high complexity. For this reason, efficient data-driven techniques are necessary to extract and utilize the valuable information in the data. The field of machine learning essentially develops such techniques to learn effective models directly from the data. Machine learning models have been successfully employed to solve complicated real world problems. However, the big data concept has numerous properties that pose additional challenges in algorithm development. Namely, high dimensionality, class membership imbalance, non-linearity, distributed data, heterogeneity, and temporal nature are some of the big data characteristics that machine learning must address. Biomedical informatics is an interdisciplinary domain where machine learning techniques are used to analyze electronic health records (EHRs). EHR comprises digital patient data with various modalities and depicts an instance of big data. For this reason, analysis of digital patient data is quite challenging although it provides a rich source for clinical research. While the scale of EHR data used in clinical research might not be huge compared to the other domains, such as social media, it is still not feasible for physicians to analyze and interpret longitudinal and heterogeneous data of thousands of patients. Therefore, computational approaches and graphical tools to assist physicians in summarizing the underlying clinical patterns of the EHRs are necessary. The field of biomedical informatics employs machine learning and data mining approaches to provide the essential computational techniques to analyze and interpret complex healthcare data to assist physicians in patient diagnosis and treatment. In this thesis, we propose and develop machine learning algorithms, motivated by prevalent biomedical informatics tasks, to analyze the EHRs. Specifically, we make the following contributions: (i) A convex sparse principal component analysis approach along with variance reduced stochastic proximal gradient descent is proposed for the patient phenotyping task, which is defined as finding clinical representations for patient groups sharing the same set of diseases. (ii) An asynchronous distributed multi-task learning method is introduced to learn predictive models for distributed EHRs. (iii) A modified long-short term memory (LSTM) architecture is designed for the patient subtyping task, where the goal is to cluster patients based on similar progression pathways. The proposed LSTM architecture, T-LSTM, performs a subspace decomposition on the cell memory such that the short term effect in the previous memory is discounted based on the length of the time gap. (iv) An alternative approach to T-LSTM model is proposed with a decoupled memory to capture the short and long term changes. The proposed model, decoupled memory gated recurrent network (DM-GRN), is designed to learn two types of memories focusing on different components of the time series data. In this study, in addition to the healthcare applications, behavior of the proposed model is investigated for traffic speed prediction problem to illustrate its generalization ability. In summary, the aforementioned machine learning approaches have been developed to address complex characteristics of electronic health records in routine biomedical informatics tasks such as computational patient phenotyping and patient subtyping. Proposed models are also applicable to different domains with similar data characteristics as EHRs."--Pages ii-iii.
Show less
- Title
- Development of a nanoparticle-based electrochemical bio-barcode DNA biosensor for multiplexed pathogen detection on screen-printed carbon electrodes
- Creator
- Zhang, Deng
- Date
- 2011
- Collection
- Electronic Theses & Dissertations
- Description
-
A highly amplified, nanoparticle-based, bio-barcoded electrochemical biosensor for the simultaneous multiplexed detection of the protective antigen A (
pagA ) gene (accession number = M22589) fromBacillus anthracis and the insertion element (Iel ) gene (accession number = Z83734) fromSalmonella Enteritidis was developed. The biosensor system is mainly composed of three nanoparticles: gold nanoparticles (AuNPs), magnetic...
Show moreA highly amplified, nanoparticle-based, bio-barcoded electrochemical biosensor for the simultaneous multiplexed detection of the protective antigen A (pagA ) gene (accession number = M22589) fromBacillus anthracis and the insertion element (Iel ) gene (accession number = Z83734) fromSalmonella Enteritidis was developed. The biosensor system is mainly composed of three nanoparticles: gold nanoparticles (AuNPs), magnetic nanoparticles (MNPs), and nanoparticle tracers (NTs), such as lead sulfide (PbS) and cadmium sulfide (CdS). The AuNPs are coated with the first target-specific DNA probe (1pDNA), which can recognize one end of the target DNA sequence (tDNA), and many NT-terminated bio-barcode ssDNA (bDNA-NT), which act as signal reporter and amplifier. The MNPs are coated with the second target-specific DNA probe (2pDNA) that can recognize the other end of the target gene. After binding the nanoparticles with the target DNA, the following sandwich structure is formed: MNP-2pDNA/tDNA/1pDNA-AuNP-bDNA-NTs. A magnetic field is applied to separate the sandwich structure from the unreacted materials. Because the AuNPs have a large number of nanoparticle tracers per DNA probe binding event, there is substantial amplification. After the nanoparticle tracer is dissolved in 1 mol/L nitric acid, the NT ions, such as Pb2+ and Cd2+ , show distinct non-overlapping stripping curves by square wave anodic stripping voltammetry (SWASV) on screen-printed carbon electrode (SPCE) chips. The oxidation potential of NT ions is unique for each nanoparticle tracer and the peak current is related to the target DNA concentration. The results show that the biosensor has good specificity, and the sensitivity of single detection ofpagA gene fromBacillus anthracis using PbS NTs is as low as 0.2 pg/mL. The detection limit of this multiplex bio-barcoded DNA sensor is 50 pg/mL using PbS or CdS NTs. The nanoparticle-based bio-barcoded DNA sensor has potential applications for multiple detections of bioterrorism threat agents, co-infection, and contaminants in the same sample.
Show less
- Title
- ELUCIDATION AND REPURPOSING OF PLANT DITERPENOID BIOSYNTHETIC PATHWAYS
- Creator
- Miller, Garret P.
- Date
- 2022
- Collection
- Electronic Theses & Dissertations
- Description
-
Terpenoids are the largest class of specialized metabolites in plants, with widespread uses ranging from fragrances and cosmetics to biofuels, antifeedants, and pharmaceuticals. Terpenoids are derived from a small set of prenyl diphosphate substrates which are cyclized into different terpene scaffolds by terpene synthases. These scaffolds are then modified by various tailoring enzymes—typically starting with cytochrome P450s—into functionalized terpenoids. Given the structural complexity of...
Show moreTerpenoids are the largest class of specialized metabolites in plants, with widespread uses ranging from fragrances and cosmetics to biofuels, antifeedants, and pharmaceuticals. Terpenoids are derived from a small set of prenyl diphosphate substrates which are cyclized into different terpene scaffolds by terpene synthases. These scaffolds are then modified by various tailoring enzymes—typically starting with cytochrome P450s—into functionalized terpenoids. Given the structural complexity of many of these metabolites, total chemical synthesis is often challenging to achieve at a relevant scale and cost, and as such, biosynthetic methods are increasingly being employed as an alternative for their production. The work presented in this dissertation describes the elucidation of two terpenoid biosynthetic pathways and the repurposing of known pathways to convert synthetic substrates not found in nature. First, three steps constituting the full biosynthetic pathway to leubethanol, an antimicrobial diterpenoid active against multidrug-resistant TB, was elucidated in the Texas Sage (Leucophyllum frutescens). Second, seven steps in the biosynthetic pathway towards structurally complex diterpenoid alkaloids were elucidated in the Siberian Larkspur (Delphinium grandiflorum). Third, twenty-four terpene synthases were screened for activity against twenty synthetic substrate analogs not found in nature, resulting in fifty-six new products and demonstrating the ability to derivatize terpene scaffolds through the derivatization of a starting substrate. In all, this work expands access to different classes of terpenoids through the elucidation of biosynthetic pathways and semi-biosynthesis of terpene scaffolds not found in nature, allowing for more feasible and sustainable production of these structurally complex compounds.
Show less
- Title
- GENOMIC APPLICATIONS TO PLANT BIOLOGY
- Creator
- Hoopes, Genevieve
- Date
- 2021
- Collection
- Electronic Theses & Dissertations
- Description
-
The study of the total nuclear DNA content of an organism, i.e., the genome, is a relatively new field and has evolved as sequencing technology and its output has changed. A shift from model species to ecological and crop species occurred as sequencing costs decreased and the technology became more broadly accessible, enabling new discoveries in genome biology as increasingly diverse species and populations were profiled. Here, a genome assembly and several transcriptional studies in multiple...
Show moreThe study of the total nuclear DNA content of an organism, i.e., the genome, is a relatively new field and has evolved as sequencing technology and its output has changed. A shift from model species to ecological and crop species occurred as sequencing costs decreased and the technology became more broadly accessible, enabling new discoveries in genome biology as increasingly diverse species and populations were profiled. Here, a genome assembly and several transcriptional studies in multiple non-model plant species provided new knowledge of molecular pathways and gene content. Over 157 Mb of the genome of the medicinal plant species Calotropis gigantea (L.) W.T.Aiton was sequenced, de novo assembled and annotated using Next Generation Sequencing technologies. The resulting assembly represents 92% of the genic space and provides a resource for discovery of the enzymes involved in biosynthesis of the anticancer metabolite, cardenolide. An updated gene expression atlas for 79 developmental maize (Zea mays L., 1753) tissues and five abiotic/biotic stress treatments was developed, revealing 4,154 organ-specific and 7,704 stress-induced differentially expressed (DE) genes. Presence-absence variants (PAVs) were enriched for organ-specific and stress-induced DE genes, tended to be lowly expressed, and had few co-expression network connections, suggesting that PAVs function in environmental adaptation and are on an evolutionary path to pseudogenization. The Maize Genomics Resource (http://maize.plantbiology.msu.edu/) was developed to view and data-mine these resources. Through profiling global gene expression over time in potato (Solanum tuberosum L.) leaf and tuber tissue, the first circadian rhythmic gene expression profiles of the below-ground heterotrophic tuber tissue were generated. The tuber displayed a longer circadian period, a delayed phase, and a lower amplitude compared to leaf tissue. Over 500 genes were differentially phased between the leaf and tuber, and many carbohydrate metabolism enzymes are under both diurnal and circadian regulation, reflecting the importance of the circadian clock for tuber bulking. Most core circadian clock genes do not display circadian rhythmic gene expression in the leaf or tuber, yet robust transcriptional and gene expression circadian rhythms are present.
Show less
- Title
- Genomic basis of electric signal variation in African weakly electric fish
- Creator
- Losilla-Lacayo, Mauricio
- Date
- 2021
- Collection
- Electronic Theses & Dissertations
- Description
-
A repeated theme in speciation is reproductive isolation centered around divergence in few, highly variable traits, specially in cases without strong geographic isolation and high speciation rates. Understanding the genomic basis of highly variable traits that are key to speciation is a major goal of evolutionary biology, because they can characterize crucial drivers and foundations of the speciation process. African weakly electric fish (Mormyridae) are a decidedly speciose clade of teleost...
Show moreA repeated theme in speciation is reproductive isolation centered around divergence in few, highly variable traits, specially in cases without strong geographic isolation and high speciation rates. Understanding the genomic basis of highly variable traits that are key to speciation is a major goal of evolutionary biology, because they can characterize crucial drivers and foundations of the speciation process. African weakly electric fish (Mormyridae) are a decidedly speciose clade of teleost fish, and their electric organ discharges (EODs) are highly variable traits central to species divergence. However, little is known about the genes and celullar processes that underscore EOD variation. In this dissertation, I employ RNAseq and Nanopore sequencing to study the genomic basis of electric signal variation in mormyrids. In Chapter 1, I take a transcriptome-wide approach to describe the molecular basis of electric signal diversity in species of the mormyrid genus Paramormyrops, divergent for EOD complexity, duration and polarity. My results emphasize genes that influence the shape and structure of the electrocyte cytoskeleton, membrane, and extracellular matrix, and the membrane’s physiological properties. In Chapter 2, I compare gene expression patterns between electric organs that produce long vs short EODs. The results strongly support known aspects of morphological and physiological bases of EOD duration, and for the first time I identified specific genes and broad cellular processes expected to that alter morphological and physiological properties of electrocytes, most striking among these is the differential expression of multiple potassium voltage-gated channels. These two chapters independently identified the gene epdl2 as of interest for EOD divergence. In Chapter 3, I study the molecular evolutionary history of epdl2 in Mormyridae, with emphasis on Paramormyrops. My results suggest that three rounds of gene duplication produced four epdl2 paralogs in a Paramormyrops ancestor. In addition, I identify ten sites in epdl2 expected to have experienced strong positive selection in paralogs and implicate them in key functional domains. Overall, the results of this dissertation greatly solidify and expand our understanding of how the genome underpins changes to electrocytes, and in turn, divergence in their electric signals, a highly variable trait that may facilitate speciation in African weakly electric fish. This work provides an evidence-grounded list of candidate genes for functional analyses aimed to corroborate their contribution to the EOD phenotype.
Show less
- Title
- IDENTIFICATION OF LTR RETROTRANSPOSONS, EVALUATION OF GENOME ASSEMBLY, AND MODELING RICE DOMESTICATION
- Creator
- Ou, Shujun
- Date
- 2018
- Collection
- Electronic Theses & Dissertations
- Description
-
The majority of fundamental theories in genetics and evolution were proposed prior to the discovery of DNA as the genetic material in 1952. Those include Darwin’s theory of evolution (1859), Mendelian genetics (1865), Wright and Fisher’s population genetics (1918), and McClintock’s transposition of genetic elements (1951). Nevertheless, the underlining mechanisms of those theories were not fully elucidated till the appearance of DNA sequencing technology. At present, technological advances...
Show moreThe majority of fundamental theories in genetics and evolution were proposed prior to the discovery of DNA as the genetic material in 1952. Those include Darwin’s theory of evolution (1859), Mendelian genetics (1865), Wright and Fisher’s population genetics (1918), and McClintock’s transposition of genetic elements (1951). Nevertheless, the underlining mechanisms of those theories were not fully elucidated till the appearance of DNA sequencing technology. At present, technological advances have minimized the cost for sequencing genomes. The real bottleneck to establish genomic resources is the annotation of genomic sequences. Long Terminal Repeat (LTR) retrotransposon is a major type of transposable genetic elements and dominating plant genomes. We developed a new method called LTR_retriever for accurate annotation of LTR retrotransposons. Further, we studied genome dynamics, genome size variation, and polyploidy origin using LTR retrotransposons. The presence of LTR retrotransposons challenges current sequencing and assembly techniques due to their size and repetitiveness. We proposed an unbiased metric called LTR Assembly Index (LAI) which utilizes the assembled LTR retrotransposons to evaluate continuity of genome assembly. We revealed the massive gain of continuity for assembly sequenced based on long-read techniques over short-read methods, and further proposed a standardized classification system for genome quality based on LAI. With high-quality genomes, we can extend our knowledge about microevolution events using a population of genomes. The domestication history of rice is still unresolved due to its complicated demographic history. We collected, re-mapped, and re-analyzed 3,485 cultivated and wild rice resequencing accessions. With data imputation, a total of 17.7 million high-quality single-nucleotide polymorphisms (SNPs) were identified. Our dataset is highly accurate as verified by cross-platform Affymetrix Microarray data, with a pairwise concordance rate of 99%. Combining phylogeny, PCA, and ADMIXTURE analyses, we present profound diversification among rice ecotypes.
Show less
- Title
- Interpretable machine learning in plant genomes : studies in modeling and understanding complex biological systems
- Creator
- Azodi, Christina Brady
- Date
- 2019
- Collection
- Electronic Theses & Dissertations
- Description
-
Complex systems are ubiquitous in genetics and genomics. From the regulation of gene expression to the genetic basis of complex traits, we see that complex networks of diverse cellular molecules underpin the natural world. Driven by technological advances, today's researchers have access to large amounts of omics data from diverse species. At the same time, improvements in computer processing and algorithms have produced more powerful computational tools. Taken together, these advances mean...
Show moreComplex systems are ubiquitous in genetics and genomics. From the regulation of gene expression to the genetic basis of complex traits, we see that complex networks of diverse cellular molecules underpin the natural world. Driven by technological advances, today's researchers have access to large amounts of omics data from diverse species. At the same time, improvements in computer processing and algorithms have produced more powerful computational tools. Taken together, these advances mean that those working at the interface of data science and biology are poised to better model and understand complex biological systems. The research in this dissertation demonstrates how a data-driven approach can be used to better understand three complex systems: (1) transcriptional response to single and combined heat and drought stress in Arabidopsis thaliana, (2) the genetic basis of flowering time, a complex trait, in Zea mays, and (3) the social basis for opinions and beliefs about biotechnology products.To study the first system, we generated models of the cis-regulatory code from information about DNA sequence and additional omics levels using both classic machine learning and deep learning algorithms. We identified 1,061 putative cis-regulatory elements associated with different patterns of response to single and combined heat and drought stress and found that information about additional levels of regulation, especially chromatin accessibility and known transcription factor binding, improved our models of the cis-regulatory code. To study the second system, we generated phenotype prediction models for flowering time, height, and yield based on either genetic markers or transcript levels at the seedling stage. We found that, while genetic marker-based models performed better than transcript level-based models, models that integrated both types of data performed best. Furthermore, transcript-based models were more useful for finding genes known to be associated with flowering time, highlighting how using additional levels of omics data can improve our ability to understand the genetic basis of complex traits. Finally, to study the third system, we integrated 29 characteristics about a person (e.g. age, political ideology, education, values, environmental beliefs) into a machine learning model that would predict an individual's beliefs and opinions about five different types of biotechnology products (e.g. biofortification, biopharmaceuticals). While this approach was particularly usefully for identifying individuals that were broadly supportive of biotechnology, finding characteristics of individuals with negative or conditional (i.e. support product A, but not B) opinions was more challenging, highlighting the complexity of public opinions about biotechnology.
Show less
- Title
- It's both who you are and where you're from : relating vocational interests and socioeconomic status to bias in biodata and SJTs
- Creator
- Prasad, Joshua
- Date
- 2017
- Collection
- Electronic Theses & Dissertations
- Description
-
"Differences in responding to biodata and situational judgement tests (SJTs) based on gender and racial minority group status were evaluated. It was hypothesized that vocational interests and socioeconomic status (SES) could be used to help characterize the differences in experience between groups (e.g. Cottrell, Newman, Roisman, 2015; Nye, Su, Rounds, & Drasgow, 2012). As a result, interests and SES may help explain differences in both the constructs assessed by biodata and SJTs as well as...
Show more"Differences in responding to biodata and situational judgement tests (SJTs) based on gender and racial minority group status were evaluated. It was hypothesized that vocational interests and socioeconomic status (SES) could be used to help characterize the differences in experience between groups (e.g. Cottrell, Newman, Roisman, 2015; Nye, Su, Rounds, & Drasgow, 2012). As a result, interests and SES may help explain differences in both the constructs assessed by biodata and SJTs as well as differences in item functioning (DIF; Drasgow, 1987). Hypotheses were evaluated using multiple-indicator multiple-cause models to simultaneously model latent constructs and item responses (MIMIC; Muthén, 1989). Findings indicate that interests helped explain differences across gender in both the constructs assessed as well as DIF. Interests explained few differences based on minority group status and SES did not seem to meaningfully explain differences in either of the demographic group comparisons. Many items still exhibited DIF as a function of gender or minority group status after accounting for vocational interests and SES, suggesting that further work is needed to identify additional substantive explanations of DIF. Overall, the present work constitutes a thorough examination of differential functioning in noncognitive assessments and establishes a meaningful relationship between the noncognitive constructs assessed here and vocational interests."--Page ii.
Show less
- Title
- Leveraging Angiosperm Pangenomics to Understand Genome Evolution
- Creator
- Yocca, Alan E.
- Date
- 2022
- Collection
- Electronic Theses & Dissertations
- Description
-
My dissertation work focused on species-level comparative genomics and pangenomics to describe patterns of genetic variation. I studied multiple systems and unsurprisingly discovered different patterns of variation. Within a species, individuals are genetically diverse. There are some DNA regions present in every individual (core), while others may be specific to a single individual or lineage (variable). The sum of the genetic sequences found across an entire taxonomic group is called the...
Show moreMy dissertation work focused on species-level comparative genomics and pangenomics to describe patterns of genetic variation. I studied multiple systems and unsurprisingly discovered different patterns of variation. Within a species, individuals are genetically diverse. There are some DNA regions present in every individual (core), while others may be specific to a single individual or lineage (variable). The sum of the genetic sequences found across an entire taxonomic group is called the pangenome. This DNA variation greatly contributes to observed phenotypic differences between individuals. Therefore, to understand genome evolution and the link between genotype and phenotype, we must understand the pangenome. In this work, I compare the core and variable genetic regions both coding and noncoding across different flowering plant lineages. I note many consistent features across lineages as well as ways in which each pangenomic pattern is unique. These consistencies and differences can be leveraged in the future to better understand genome evolution as well as how genotype relates to phenotype. Specifically, my dissertation includes four chapters; (1) Evolution of Conserved Noncoding Sequences in Arabidopsis thaliana, (2) Machine learning identifies differences between core and variable genes in Brachypodium distachyon and Oryza sativa, (3) Current status and future perspectives on the evolution of cis-regulatory elements in plants, and (4) A pangenome for Vaccinium.
Show less
- Title
- MODELING AND PREDICTION OF GENETIC REDUNDANCY IN ARABIDOPSIS THALIANA AND SACCHAROMYCES CEREVISIAE
- Creator
- Cusack, Siobhan Anne
- Date
- 2020
- Collection
- Electronic Theses & Dissertations
- Description
-
Genetic redundancy is a phenomenon where more than one gene encodes products that perform the same function. This frequently manifests experimentally as a single gene knockout mutant which does not demonstrate a phenotypic change compared to the wild type due to the presence of a paralogous gene performing the same function; a phenotype is only observed when one or more paralogs are knocked out in combination. This presents a challenge in a fundamental goal of genetics, linking genotypes to...
Show moreGenetic redundancy is a phenomenon where more than one gene encodes products that perform the same function. This frequently manifests experimentally as a single gene knockout mutant which does not demonstrate a phenotypic change compared to the wild type due to the presence of a paralogous gene performing the same function; a phenotype is only observed when one or more paralogs are knocked out in combination. This presents a challenge in a fundamental goal of genetics, linking genotypes to phenotypes, especially because it is difficult to determine a priori which gene pairs are redundant. Furthermore, while some factors that are associated with redundant genes have been identified, little is known about factors contributing to long-term maintenance of genetic redundancy. Here, we applied a machine learning approach to predict redundancy among benchmark redundant and nonredundant gene pairs in the model plant Arabidopsis thaliana. Predictions were validated using well-characterized redundant and nonredundant gene pairs. Additionally, we leveraged the availability of fitness and multi-omics data in the budding yeast Saccharomyces cerevisiae to build machine learning models for predicting genetic redundancy and related phenotypic outcomes (single and double mutant fitness) among paralogs, and to identify features important in generating these predictions. Collectively, our models of genetic redundancy provide quantitative assessments of how well existing data allow predictions of fitness and genetic redundancy, shed light on characteristics that may contribute to long-term maintenance of paralogs that are seemingly functionally redundant, and will ultimately allow for more targeted generation of phenotypically informative mutants, advancing functional genomic studies.
Show less
- Title
- Molecular epidemiology, pangenomic diversity, and comparative genomics of Campylobacter jejuni
- Creator
- Rodrigues, Jose Alexandre
- Date
- 2022
- Collection
- Electronic Theses & Dissertations
- Description
-
Campylobacter jejuni, the leading cause of bacterial gastroenteritis in the United States, is often resistant to commonly used antibiotics and has been classified as a serious threat to public health. Through this work, we sought to evaluate infection trends, quantify resistance frequencies, identify epidemiological factors associated with infection, and use whole-genome sequencing (WGS) as well as comparative phylogenomic and pangenomic approaches to understand circulating C. jejuni...
Show moreCampylobacter jejuni, the leading cause of bacterial gastroenteritis in the United States, is often resistant to commonly used antibiotics and has been classified as a serious threat to public health. Through this work, we sought to evaluate infection trends, quantify resistance frequencies, identify epidemiological factors associated with infection, and use whole-genome sequencing (WGS) as well as comparative phylogenomic and pangenomic approaches to understand circulating C. jejuni populations in Michigan. C. jejuni isolates (n=214) were collected from patients via an active surveillance system at four metropolitan hospitals in Michigan between 2011 and 2014. Among the 214 C. jejuni isolates, 135 (63.1%) were resistant to at least one antibiotic. Resistance was observed for all nine antibiotics tested yielding 11 distinct resistance phenotypes. Tetracycline resistance predominated (n=120; 56.1%) followed by resistance to ciprofloxacin (n= 49; 22.9%), which increased from 15.6% in 2011 to 25.0% in 2014. Notably, patients with ciprofloxacin resistant infections were more likely to report traveling in the past month (Odds Ratio (OR): 3.0; 95% confidence interval (CI): 1.37, 6.68) and international travel (OR: 9.8; 95% CI: 3.69, 26.09). To further characterize these strains, we used WGS to examine the pangenome and investigate the genomic epidemiology of this set of C. jejuni strains recovered from Michigan patients. Among the 214 strains evaluated, 83 unique multilocus sequence types (STs) were identified that were classified as belonging to 19 previously defined clonal complexes (CCs). Core-gene phylogenetic reconstruction based on 615 genes identified three clades, with Clade I comprising six subclades (IA-IF) and predominating (83.2%) among the strains. Because specific cattle-associated STs, such as ST-982, predominated among strains from Michigan patients, we also examined a collection of 72 C. jejuni strains from cattle recovered during an overlapping time period by WGS. Several phylogenetic analyses demonstrated that most cattle strains clustered separately within the phylogeny, but a subset clustered together with human strains. Hence, we used high quality single nucleotide polymorphism (hqSNP) profiling to more comprehensively examine those cattle and human strains that clustered together to evaluate the likelihood of interspecies transmission. Notably, this method distinguished highly related strains and identified clusters comprising strains from both humans and cattle. For instance, 88 SNPs separated a cattle and human strain that were previously classified as ST-8, while the human and cattle derived ST-982 strains differed by >200 SNP differences. These findings demonstrate that highly similar strains were circulating among Michigan patients and cattle during the same time period and highlight the potential for interspecies transmission and diversification within each host. In all, the data presented illustrate that WGS and pangenomic analyses are important tools for enhancing our understanding of the distribution, dissemination, and evolution of specific pathogen populations. Combined with more traditional phenotypic and genotypic approaches, these tools can guide the development of public health prevention and mitigation strategies for C. jejuni and other foodborne pathogens.
Show less
- Title
- Non-coding RNA identification in large-scale genomic data
- Creator
- Yuan, Cheng
- Date
- 2014
- Collection
- Electronic Theses & Dissertations
- Description
-
Noncoding RNAs (ncRNAs), which function directly as RNAs without translating into proteins, play diverse and important biological functions. ncRNAs function not only through their primary structures, but also secondary structures, which are defined by interactions between Watson-Crick and wobble base pairs. Common types of ncRNA include microRNA, rRNA, snoRNA, tRNA. Functions of ncRNAs vary among different types. Recent studies suggest the existence of large number of ncRNA genes....
Show moreNoncoding RNAs (ncRNAs), which function directly as RNAs without translating into proteins, play diverse and important biological functions. ncRNAs function not only through their primary structures, but also secondary structures, which are defined by interactions between Watson-Crick and wobble base pairs. Common types of ncRNA include microRNA, rRNA, snoRNA, tRNA. Functions of ncRNAs vary among different types. Recent studies suggest the existence of large number of ncRNA genes. Identification of novel and known ncRNAs becomes increasingly important in order to understand their functionalities and the underlying communities.Next-generation sequencing (NGS) technology sheds lights on more comprehensive and sensitive ncRNA annotation. Lowly transcribed ncRNAs or ncRNAs from rare species with low abundance may be identified via deep sequencing. However, there exist several challenges in ncRNA identification in large-scale genomic data. First, the massive volume of datasets could lead to very long computation time, making existing algorithms infeasible. Second, NGS has relatively high error rate, which could further complicate the problem. Third, high sequence similarity among related ncRNAs could make them difficult to identify, resulting in incorrect output. Fourth, while secondary structures should be adopted for accurate ncRNA identification, they usually incur high computational complexity. In particular, some ncRNAs contain pseudoknot structures, which cannot be effectively modeled by the state-of-the-art approach. As a result, ncRNAs containing pseudoknots are hard to annotate.In my PhD work, I aimed to tackle the above challenges in ncRNA identification. First, I designed a progressive search pipeline to identify ncRNAs containing pseudoknot structures. The algorithms are more efficient than the state-of-the-art approaches and can be used for large-scale data. Second, I designed a ncRNA classification tool for short reads in NGS data lacking quality reference genomes. The initial homology search phase significantly reduces size of the original input, making the tool feasible for large-scale data. Last, I focused on identifying 16S ribosomal RNAs from NGS data. 16S ribosomal RNAs are very important type of ncRNAs, which can be used for phylogenic study. A set of graph based assembly algorithms were applied to form longer or full-length 16S rRNA contigs. I utilized paired-end information in NGS data, so lowly abundant 16S genes can also be identified. To reduce the complexity of problem and make the tool practical for large-scale data, I designed a list of error correction and graph reduction techniques for graph simplification.
Show less
- Title
- Oocyte and Preimplantation Embryo Cross-Species Transcriptome Meta-Analysis Reveals Divergence at Gene Level but Conservation in Functions
- Creator
- Schall, Peter Zachary
- Date
- 2021
- Collection
- Electronic Theses & Dissertations
- Description
-
Two of the most critical stages in early development occur during the maturation of oocytes and during the first lineage specification during morula-to-blastocyst transition. The accurate regulation of the transcriptome during these essential events is necessary for the development of a healthy embryo. This thesis presents the culmination of custom pipelines developed to produce three meta-analyses: 1) transcriptome changes during oocyte maturation across four mammalian species (human, rhesus...
Show moreTwo of the most critical stages in early development occur during the maturation of oocytes and during the first lineage specification during morula-to-blastocyst transition. The accurate regulation of the transcriptome during these essential events is necessary for the development of a healthy embryo. This thesis presents the culmination of custom pipelines developed to produce three meta-analyses: 1) transcriptome changes during oocyte maturation across four mammalian species (human, rhesus monkey, cow, and mouse), 2) predictive modeling of RNA binding proteins and microRNAs binding to the 3’ UTR, impacting stability during oocyte maturation across four mammalian species (human, rhesus monkey, cow, and mouse), and 3) transcriptome changes during the morula-to-blastocyst transition and the establishment of the inner cell mass and trophectoderm across five mammalian species (human, rhesus monkey, cow, pig, and mouse). The results of these studies reveal that there are relatively few individual transcripts regulated commonly across species, while there are greater shared features at the pathway and functional level. This underscores that different species may utilize a different cohort of genes to accomplish a given outcome. Additionally, the pipelines developed for this thesis are highly applicable across many areas of biology.
Show less
- Title
- Profile HMM-based protein domain analysis of next-generation sequencing data
- Creator
- Zhang, Yuan
- Date
- 2013
- Collection
- Electronic Theses & Dissertations
- Description
-
Sequence analysis is the process of analyzing DNA, RNA or peptide sequences using a wide range of methodologies in order to understand their functions, structures or evolution history. Next generation sequencing (NGS) technologies generate large-scale sequence data of high coverage and nucleotide level resolution at low costs, benefiting a variety of research areas such as gene expression profiling, metagenomic annotation, ncRNA identification, etc. Therefore, functional analysis of NGS...
Show moreSequence analysis is the process of analyzing DNA, RNA or peptide sequences using a wide range of methodologies in order to understand their functions, structures or evolution history. Next generation sequencing (NGS) technologies generate large-scale sequence data of high coverage and nucleotide level resolution at low costs, benefiting a variety of research areas such as gene expression profiling, metagenomic annotation, ncRNA identification, etc. Therefore, functional analysis of NGS sequences becomes increasingly important because it provides insightful information, such as gene expression, protein composition, and phylogenetic complexity, of the species from which the sequences are generated. One basic step during the functional analysis is to classify genomic sequences into different functional categories, such as protein families or protein domains (or domains for short), which are independent functional units in a majority of annotated protein sequences. The state-of-the-art method for protein domain analysis is based on comparative sequence analysis, which classifies query sequences into annotated protein or domain databases. There are two types of domain analysis methods, pairwise alignment and profile-based similarity search. The first one uses pairwise alignment tools such as BLAST to search query genomic sequences against reference protein sequences in databases such as NCBI-nr. The second one uses profile HMM-based tools such as HMMER to classify query sequences into annotated domain families such as Pfam. Compared to the first method, the profile HMM-based method has smaller search space and higher sensitivity with remote homolog detection. Therefore, I focus on profile HMM-based protein domain analysis.There are several challenges with protein domain analysis of NGS sequences. First, sequences generated by some NGS platforms such as pyrosequencing have relatively high error rates, making it difficult to classify the sequences into their native domain families. Second, existing protein domain analysis tools have low sensitivity with short query sequences and poorly conserved domain families. Third, the volume of NGS data is usually very large, making it difficult to assemble short reads into longer contigs. In this work, I focus on addressing these three challenges using different methods. To be specific, we have proposed four tools, HMM-FRAME, MetaDomain, SALT, and SAT-Assembler. HMM-FRAME focuses on detecting and correcting frameshift errors in sequences generated by pyrosequencing technology, thus accurately classifying metagenomic sequences containing frameshift errors into their native protein domain families. MetaDomain and SALT are both designed for short reads generated by NGS technologies. MetaDomain uses relaxed position-specific score thresholds and alignment positions to increase the sensitivity while keeping the false positive rate at a low level. SALT combines both position-specific score thresholds and graph algorithms and achieves higher accuracy than MetaDomain. SAT-Assembler conducts targeted gene assembly from large-scale NGS data. It has smaller memory usage, higher gene coverage, and lower chimera rate compared with existing tools. Finally, I will make a conclusion on my work and briefly talk about some future work
Show less
- Title
- Qtl and transcriptomic analysis between red wheat and white wheat during pre-harvest sprouting induction stage
- Creator
- Su, Yuanjie
- Date
- 2013
- Collection
- Electronic Theses & Dissertations
- Description
-
Wheat pre-harvest sprouting (PHS) is a precocious germination of seed in the head when there are prolonged wet conditions occurs during the harvest period. Recent damage caused by PHS occurred in 2008, 2009 and 2011, resulting in severe losses to the Michigan wheat industry. Direct annual losses caused by PHS worldwide can reach up to US $1 billion. Breeding for PHS resistant wheat cultivars is critical for securing soft white wheat production and reducing the economic loss to Michigan...
Show moreWheat pre-harvest sprouting (PHS) is a precocious germination of seed in the head when there are prolonged wet conditions occurs during the harvest period. Recent damage caused by PHS occurred in 2008, 2009 and 2011, resulting in severe losses to the Michigan wheat industry. Direct annual losses caused by PHS worldwide can reach up to US $1 billion. Breeding for PHS resistant wheat cultivars is critical for securing soft white wheat production and reducing the economic loss to Michigan farmers, food processors and millers. In general, white wheat is more susceptible to PHS in comparison to red wheat. However, the underlying mechanism connecting seed coat color and PHS resistance has not been clearly described. In this study, a recombinant inbred line population segregating for seed coat color alleles was evaluated for seed coat color and alpha-amylase activity in three years with two treatments. The genotyping results enabled us to group individuals by the specific red allele combinations and allowed us to examine the allelic contribution of each color loci to both seed coat color and alpha-amylase activity. A high-density genetic map based upon Infinium 9K SNP array was generated to locate QTL in relatively narrow regions. A total of 38 Quantitative Trait Loci (QTL) for seed coat color and alpha-amylase activity were identified from this population and mapped on eleven chromosomes (1B, 2A, 2B, 3A, 3B, 3D, 4B, 5A, 5D, 6B and 7B) from three years and two post-harvest treatments. Most QTL explained 6-15% of the phenotypic variance while a major QTL on chromosome 2B explained up to 37.6% of phenotypic variance of alpha-amylase activity in 2012 non-mist condition. Significant QTL × QTL interactions were also found between and within color and enzyme related traits. Next generation sequencing (NGS) technology was used in current study to generate wheat transcriptome using Trinity with two methods: de novo assembly and Genome Guided assembly. Quality assessment of the two assemblies was conducted based on their concordance, completeness and contiguity. Three assembly scenarios were evaluated in order to find a balance between sample specificity and transcriptome completeness. Red wheat and white wheat lines from previous QTL population were collected under mist and non-mist conditions and their expression profiles were compared to identify differentially expressed (DE) genes. At non-mist condition, only around 1% of the genes were differentially expressed between physiologically matured red wheat and white wheat while the rate had a 10-fold increase after 48 hr misting treatment. Annotation of the DE genes showed signature genes involved in germination process, such as late embryogenesis abundant protein, peroxidase, hydrolase, and several transcription factors. They can be potential key players involved in the underlying genetic networks related to the PHS induction process. Gene Ontology (GO) terms enriched in DE genes were also summarized for each comparison and germination related molecular function and biological process were retrieved.In conclusion, with the population segregating for seed coat color loci, the relationship between seed coat color and alpha-amylase activity were examined using biochemical methods, QTL analysis, and transcriptome profiling. The variation of seed coat color do closely linked with PHS resistance level at all three levels. DE genes and enriched GO terms identified were discussed for their potential role in bridging the gap between seed coat color and PHS resistance.
Show less