You are here
Search results
(21 - 31 of 31)
Pages
- Title
- Molecular epidemiology, pangenomic diversity, and comparative genomics of Campylobacter jejuni
- Creator
- Rodrigues, Jose Alexandre
- Date
- 2022
- Collection
- Electronic Theses & Dissertations
- Description
-
Campylobacter jejuni, the leading cause of bacterial gastroenteritis in the United States, is often resistant to commonly used antibiotics and has been classified as a serious threat to public health. Through this work, we sought to evaluate infection trends, quantify resistance frequencies, identify epidemiological factors associated with infection, and use whole-genome sequencing (WGS) as well as comparative phylogenomic and pangenomic approaches to understand circulating C. jejuni...
Show moreCampylobacter jejuni, the leading cause of bacterial gastroenteritis in the United States, is often resistant to commonly used antibiotics and has been classified as a serious threat to public health. Through this work, we sought to evaluate infection trends, quantify resistance frequencies, identify epidemiological factors associated with infection, and use whole-genome sequencing (WGS) as well as comparative phylogenomic and pangenomic approaches to understand circulating C. jejuni populations in Michigan. C. jejuni isolates (n=214) were collected from patients via an active surveillance system at four metropolitan hospitals in Michigan between 2011 and 2014. Among the 214 C. jejuni isolates, 135 (63.1%) were resistant to at least one antibiotic. Resistance was observed for all nine antibiotics tested yielding 11 distinct resistance phenotypes. Tetracycline resistance predominated (n=120; 56.1%) followed by resistance to ciprofloxacin (n= 49; 22.9%), which increased from 15.6% in 2011 to 25.0% in 2014. Notably, patients with ciprofloxacin resistant infections were more likely to report traveling in the past month (Odds Ratio (OR): 3.0; 95% confidence interval (CI): 1.37, 6.68) and international travel (OR: 9.8; 95% CI: 3.69, 26.09). To further characterize these strains, we used WGS to examine the pangenome and investigate the genomic epidemiology of this set of C. jejuni strains recovered from Michigan patients. Among the 214 strains evaluated, 83 unique multilocus sequence types (STs) were identified that were classified as belonging to 19 previously defined clonal complexes (CCs). Core-gene phylogenetic reconstruction based on 615 genes identified three clades, with Clade I comprising six subclades (IA-IF) and predominating (83.2%) among the strains. Because specific cattle-associated STs, such as ST-982, predominated among strains from Michigan patients, we also examined a collection of 72 C. jejuni strains from cattle recovered during an overlapping time period by WGS. Several phylogenetic analyses demonstrated that most cattle strains clustered separately within the phylogeny, but a subset clustered together with human strains. Hence, we used high quality single nucleotide polymorphism (hqSNP) profiling to more comprehensively examine those cattle and human strains that clustered together to evaluate the likelihood of interspecies transmission. Notably, this method distinguished highly related strains and identified clusters comprising strains from both humans and cattle. For instance, 88 SNPs separated a cattle and human strain that were previously classified as ST-8, while the human and cattle derived ST-982 strains differed by >200 SNP differences. These findings demonstrate that highly similar strains were circulating among Michigan patients and cattle during the same time period and highlight the potential for interspecies transmission and diversification within each host. In all, the data presented illustrate that WGS and pangenomic analyses are important tools for enhancing our understanding of the distribution, dissemination, and evolution of specific pathogen populations. Combined with more traditional phenotypic and genotypic approaches, these tools can guide the development of public health prevention and mitigation strategies for C. jejuni and other foodborne pathogens.
Show less
- Title
- GENOMIC APPLICATIONS TO PLANT BIOLOGY
- Creator
- Hoopes, Genevieve
- Date
- 2021
- Collection
- Electronic Theses & Dissertations
- Description
-
The study of the total nuclear DNA content of an organism, i.e., the genome, is a relatively new field and has evolved as sequencing technology and its output has changed. A shift from model species to ecological and crop species occurred as sequencing costs decreased and the technology became more broadly accessible, enabling new discoveries in genome biology as increasingly diverse species and populations were profiled. Here, a genome assembly and several transcriptional studies in multiple...
Show moreThe study of the total nuclear DNA content of an organism, i.e., the genome, is a relatively new field and has evolved as sequencing technology and its output has changed. A shift from model species to ecological and crop species occurred as sequencing costs decreased and the technology became more broadly accessible, enabling new discoveries in genome biology as increasingly diverse species and populations were profiled. Here, a genome assembly and several transcriptional studies in multiple non-model plant species provided new knowledge of molecular pathways and gene content. Over 157 Mb of the genome of the medicinal plant species Calotropis gigantea (L.) W.T.Aiton was sequenced, de novo assembled and annotated using Next Generation Sequencing technologies. The resulting assembly represents 92% of the genic space and provides a resource for discovery of the enzymes involved in biosynthesis of the anticancer metabolite, cardenolide. An updated gene expression atlas for 79 developmental maize (Zea mays L., 1753) tissues and five abiotic/biotic stress treatments was developed, revealing 4,154 organ-specific and 7,704 stress-induced differentially expressed (DE) genes. Presence-absence variants (PAVs) were enriched for organ-specific and stress-induced DE genes, tended to be lowly expressed, and had few co-expression network connections, suggesting that PAVs function in environmental adaptation and are on an evolutionary path to pseudogenization. The Maize Genomics Resource (http://maize.plantbiology.msu.edu/) was developed to view and data-mine these resources. Through profiling global gene expression over time in potato (Solanum tuberosum L.) leaf and tuber tissue, the first circadian rhythmic gene expression profiles of the below-ground heterotrophic tuber tissue were generated. The tuber displayed a longer circadian period, a delayed phase, and a lower amplitude compared to leaf tissue. Over 500 genes were differentially phased between the leaf and tuber, and many carbohydrate metabolism enzymes are under both diurnal and circadian regulation, reflecting the importance of the circadian clock for tuber bulking. Most core circadian clock genes do not display circadian rhythmic gene expression in the leaf or tuber, yet robust transcriptional and gene expression circadian rhythms are present.
Show less
- Title
- EXTRACTING STRUCTURE AND FUNCTION FROM COMPLEX SYSTEMS USING INFORMATION-THEORETIC TOOLS
- Creator
- C G, Nitash
- Date
- 2022
- Collection
- Electronic Theses & Dissertations
- Description
-
One of the primary areas of scientific research is understanding how complex systemswork, both structurally and functionally. In the natural world, complex systems are very high dimensional, with many interacting parts, making studying them difficult and in some cases nearly impossible. Due to the complexity of these systems, a lot of modern research focuses on studying these systems from a computational viewpoint. While this necessarily abstracts away from the true system, we attempt to...
Show moreOne of the primary areas of scientific research is understanding how complex systemswork, both structurally and functionally. In the natural world, complex systems are very high dimensional, with many interacting parts, making studying them difficult and in some cases nearly impossible. Due to the complexity of these systems, a lot of modern research focuses on studying these systems from a computational viewpoint. While this necessarily abstracts away from the true system, we attempt to represent the salient aspects of the system in order to better understand it. Results from such computational studies can yield insight into the natural system, and actively constrain the research space by suggesting hypotheses that can be tested.In this work, I investigate the structure and function of two seemingly disparate complexdigital systems. I begin with an investigation of the structure and function of an evolved cognitive architecture, and look at how this structure is affected by environmental changes by developing some new metrics to classify cognitive systems. I then look at the structure of the primordial fitness landscape in a different digital system, and use techniques inspired by information theory to understand the structure of this landscape. I first look at the role of historical contingency in the evolution of life by studying how the structure of this fitness landscape affects the evolutionary trajectories of life. I then investigate how information is encoded in the primordial fitness landscape. I then extend this analysis by developing a general approach for calculating the information content of individual sequences, and use them to analyze the primordial landscape. Finally, I validate this information-theoretic technique by predicting the effects of mutations on the function of a specific protein, and show that this technique can outperform the current state of the art approaches.
Show less
- Title
- COMPUTATIONAL ANNOTATIONS OF CELL TYPE SPECIFIC TRANSCRIPTION FACTORS BINDING AND LONG-RANGE ENHANCER-GENE INTERACTIONS
- Creator
- Qi, Wenjie
- Date
- 2022
- Collection
- Electronic Theses & Dissertations
- Description
-
Precise execution of cell-type-specific gene transcription is critical for cell differentiation and development. The accurate lineage-specific gene regulation lies in the proper combinatorial binding of transcription factors (TFs) to the cis-regulatory elements. TFs bind to the proximal DNA sequences around the genes to exert control over gene transcription. Recently, experimental studies revealed that enhancers also recruit TFs to stimulate gene expression by forming long-range chromatin...
Show morePrecise execution of cell-type-specific gene transcription is critical for cell differentiation and development. The accurate lineage-specific gene regulation lies in the proper combinatorial binding of transcription factors (TFs) to the cis-regulatory elements. TFs bind to the proximal DNA sequences around the genes to exert control over gene transcription. Recently, experimental studies revealed that enhancers also recruit TFs to stimulate gene expression by forming long-range chromatin interactions, suggesting the interplay between gene, enhancer, and TFs in the 3D space in specifying cell fates. Identification of transcription factor binding sites (TFBSs) as well as pinpointing the long-range chromatin interactions is pivotal for understanding the transcriptional regulatory circuits. Experimental approaches have been developed to profile protein binding as well as 3D genome but have their limitations. Therefore, accurate and highly scalable computation methods are needed to comprehensively delineate the gene regulatory landscape. Accordingly, I have developed a supervised machine learning model, TF- wave, to predict TFBSs based on DNase-Seq data. By incorporating multi-resolutions features generated by applying Wavelet Transform to DNase-Seq data, TF-wave can accurately predict TFBSs at the genome-wide level in a tissue-specific way. I further designed a matrix factorization model, EP3ICO, to jointly infer enhancer-promoter interactions based on protein-protein interactions (PPIs) between TFs with combined orders. Compared with existing algorithms, EP3ICO not only identifies underlying mechanistic regulators that mediate the 3D chromatin interactions but also achieves superior performance in predicting long-range enhancer-promoter links. In conclusion, our models provide new computational approaches for profiling the cell-type specific TF bindings and high-resolution chromatin interactions.
Show less
- Title
- IDENTIFICATION OF LTR RETROTRANSPOSONS, EVALUATION OF GENOME ASSEMBLY, AND MODELING RICE DOMESTICATION
- Creator
- Ou, Shujun
- Date
- 2018
- Collection
- Electronic Theses & Dissertations
- Description
-
The majority of fundamental theories in genetics and evolution were proposed prior to the discovery of DNA as the genetic material in 1952. Those include Darwin’s theory of evolution (1859), Mendelian genetics (1865), Wright and Fisher’s population genetics (1918), and McClintock’s transposition of genetic elements (1951). Nevertheless, the underlining mechanisms of those theories were not fully elucidated till the appearance of DNA sequencing technology. At present, technological advances...
Show moreThe majority of fundamental theories in genetics and evolution were proposed prior to the discovery of DNA as the genetic material in 1952. Those include Darwin’s theory of evolution (1859), Mendelian genetics (1865), Wright and Fisher’s population genetics (1918), and McClintock’s transposition of genetic elements (1951). Nevertheless, the underlining mechanisms of those theories were not fully elucidated till the appearance of DNA sequencing technology. At present, technological advances have minimized the cost for sequencing genomes. The real bottleneck to establish genomic resources is the annotation of genomic sequences. Long Terminal Repeat (LTR) retrotransposon is a major type of transposable genetic elements and dominating plant genomes. We developed a new method called LTR_retriever for accurate annotation of LTR retrotransposons. Further, we studied genome dynamics, genome size variation, and polyploidy origin using LTR retrotransposons. The presence of LTR retrotransposons challenges current sequencing and assembly techniques due to their size and repetitiveness. We proposed an unbiased metric called LTR Assembly Index (LAI) which utilizes the assembled LTR retrotransposons to evaluate continuity of genome assembly. We revealed the massive gain of continuity for assembly sequenced based on long-read techniques over short-read methods, and further proposed a standardized classification system for genome quality based on LAI. With high-quality genomes, we can extend our knowledge about microevolution events using a population of genomes. The domestication history of rice is still unresolved due to its complicated demographic history. We collected, re-mapped, and re-analyzed 3,485 cultivated and wild rice resequencing accessions. With data imputation, a total of 17.7 million high-quality single-nucleotide polymorphisms (SNPs) were identified. Our dataset is highly accurate as verified by cross-platform Affymetrix Microarray data, with a pairwise concordance rate of 99%. Combining phylogeny, PCA, and ADMIXTURE analyses, we present profound diversification among rice ecotypes.
Show less
- Title
- Non-coding RNA identification in large-scale genomic data
- Creator
- Yuan, Cheng
- Date
- 2014
- Collection
- Electronic Theses & Dissertations
- Description
-
Noncoding RNAs (ncRNAs), which function directly as RNAs without translating into proteins, play diverse and important biological functions. ncRNAs function not only through their primary structures, but also secondary structures, which are defined by interactions between Watson-Crick and wobble base pairs. Common types of ncRNA include microRNA, rRNA, snoRNA, tRNA. Functions of ncRNAs vary among different types. Recent studies suggest the existence of large number of ncRNA genes....
Show moreNoncoding RNAs (ncRNAs), which function directly as RNAs without translating into proteins, play diverse and important biological functions. ncRNAs function not only through their primary structures, but also secondary structures, which are defined by interactions between Watson-Crick and wobble base pairs. Common types of ncRNA include microRNA, rRNA, snoRNA, tRNA. Functions of ncRNAs vary among different types. Recent studies suggest the existence of large number of ncRNA genes. Identification of novel and known ncRNAs becomes increasingly important in order to understand their functionalities and the underlying communities.Next-generation sequencing (NGS) technology sheds lights on more comprehensive and sensitive ncRNA annotation. Lowly transcribed ncRNAs or ncRNAs from rare species with low abundance may be identified via deep sequencing. However, there exist several challenges in ncRNA identification in large-scale genomic data. First, the massive volume of datasets could lead to very long computation time, making existing algorithms infeasible. Second, NGS has relatively high error rate, which could further complicate the problem. Third, high sequence similarity among related ncRNAs could make them difficult to identify, resulting in incorrect output. Fourth, while secondary structures should be adopted for accurate ncRNA identification, they usually incur high computational complexity. In particular, some ncRNAs contain pseudoknot structures, which cannot be effectively modeled by the state-of-the-art approach. As a result, ncRNAs containing pseudoknots are hard to annotate.In my PhD work, I aimed to tackle the above challenges in ncRNA identification. First, I designed a progressive search pipeline to identify ncRNAs containing pseudoknot structures. The algorithms are more efficient than the state-of-the-art approaches and can be used for large-scale data. Second, I designed a ncRNA classification tool for short reads in NGS data lacking quality reference genomes. The initial homology search phase significantly reduces size of the original input, making the tool feasible for large-scale data. Last, I focused on identifying 16S ribosomal RNAs from NGS data. 16S ribosomal RNAs are very important type of ncRNAs, which can be used for phylogenic study. A set of graph based assembly algorithms were applied to form longer or full-length 16S rRNA contigs. I utilized paired-end information in NGS data, so lowly abundant 16S genes can also be identified. To reduce the complexity of problem and make the tool practical for large-scale data, I designed a list of error correction and graph reduction techniques for graph simplification.
Show less
- Title
- Studies of improving therapeutic outcomes of breast cancer through development of personalized treatments and characterization of gene interactions
- Creator
- Jhan, Jing-Ru
- Date
- 2016
- Collection
- Electronic Theses & Dissertations
- Description
-
With an understanding of the heterogeneity of breast cancer, patients with luminal or HER2 breast cancer have more specific treatment options other than traditional chemotherapy, the standard therapy for triple-negative breast cancer (TNBC) patients. However, the response to current treatments as well as the prognosis have been clinical challenges. In fact, breast cancer consists of more than subtypes routinely used based on gene expression. In addition, gene expression is highly correlated...
Show moreWith an understanding of the heterogeneity of breast cancer, patients with luminal or HER2 breast cancer have more specific treatment options other than traditional chemotherapy, the standard therapy for triple-negative breast cancer (TNBC) patients. However, the response to current treatments as well as the prognosis have been clinical challenges. In fact, breast cancer consists of more than subtypes routinely used based on gene expression. In addition, gene expression is highly correlated with response to treatment and prognosis. This suggests that the development of personalized treatment with targeted therapy could improve the outcomes, especially for the TNBC subtype. To address this need, I used two approaches, the development of pathway-guided individualized treatment and an understanding of the interactions of potential genes for targeted therapy. Considering the complexity of gene and pathway interactions, the probability of pathway activation was predicted using pathway signatures generated by comparing gene expression differences between cells overexpressing interested genes and those expressing GFP. This approach was validated in two subtypes of mouse mammary tumors from MMTV-Myc mice, and then further validated in human TNBC patient-derived xenografts (PDXs). The inhibition of tumor growth in mouse mammary tumors and the regression of tumors in PDXs were observed. These proof-of-principle experiments demonstrated the flexibility of pathway-guided personalized treatment. Because this approach needs the combination of different targeted therapies, it is necessary to understand the characteristics of these targetedgenes and therapies, such as gene-gene interactions. To meet this demand, I studied the effects of Stat3 in Myc-driven tumors. Here, MMTV-Myc mice with conditional knockout Stat3 mice was generated. I noted that the deletion of Stat3 in MMTV-Myc mice accelerated the tumorigenesis as well as delayed the tumor growth with an alteration in the frequency of histological subtypes. These tumors also had deficient angiogenesis. Unexpectedly, mice with this genotype had lactation deficiencies and the lethality of pups was found.This model shared some of the same effects of loss of Stat3 in other oncogene-induced tumors and also had distinct effects compared with other models. This suggests that the oncogene drivers determine the roles of Stat3, an oncogene or tumor suppressor, and emphasizes again the importance of understanding the pathways and interactions in the development of treatment.In sum, these studies demonstrate the potential of guiding individualized treatments in preclinical platforms using bioinformatics analyses. Combined with other genomic profiles, this approach could offer more complete assessments before being translated to practice. In addition, this could be further applied in adaptive clinical trials through matching with mouse models.
Show less
- Title
- Development of a nanoparticle-based electrochemical bio-barcode DNA biosensor for multiplexed pathogen detection on screen-printed carbon electrodes
- Creator
- Zhang, Deng
- Date
- 2011
- Collection
- Electronic Theses & Dissertations
- Description
-
A highly amplified, nanoparticle-based, bio-barcoded electrochemical biosensor for the simultaneous multiplexed detection of the protective antigen A (
pagA ) gene (accession number = M22589) fromBacillus anthracis and the insertion element (Iel ) gene (accession number = Z83734) fromSalmonella Enteritidis was developed. The biosensor system is mainly composed of three nanoparticles: gold nanoparticles (AuNPs), magnetic...
Show moreA highly amplified, nanoparticle-based, bio-barcoded electrochemical biosensor for the simultaneous multiplexed detection of the protective antigen A (pagA ) gene (accession number = M22589) fromBacillus anthracis and the insertion element (Iel ) gene (accession number = Z83734) fromSalmonella Enteritidis was developed. The biosensor system is mainly composed of three nanoparticles: gold nanoparticles (AuNPs), magnetic nanoparticles (MNPs), and nanoparticle tracers (NTs), such as lead sulfide (PbS) and cadmium sulfide (CdS). The AuNPs are coated with the first target-specific DNA probe (1pDNA), which can recognize one end of the target DNA sequence (tDNA), and many NT-terminated bio-barcode ssDNA (bDNA-NT), which act as signal reporter and amplifier. The MNPs are coated with the second target-specific DNA probe (2pDNA) that can recognize the other end of the target gene. After binding the nanoparticles with the target DNA, the following sandwich structure is formed: MNP-2pDNA/tDNA/1pDNA-AuNP-bDNA-NTs. A magnetic field is applied to separate the sandwich structure from the unreacted materials. Because the AuNPs have a large number of nanoparticle tracers per DNA probe binding event, there is substantial amplification. After the nanoparticle tracer is dissolved in 1 mol/L nitric acid, the NT ions, such as Pb2+ and Cd2+ , show distinct non-overlapping stripping curves by square wave anodic stripping voltammetry (SWASV) on screen-printed carbon electrode (SPCE) chips. The oxidation potential of NT ions is unique for each nanoparticle tracer and the peak current is related to the target DNA concentration. The results show that the biosensor has good specificity, and the sensitivity of single detection ofpagA gene fromBacillus anthracis using PbS NTs is as low as 0.2 pg/mL. The detection limit of this multiplex bio-barcoded DNA sensor is 50 pg/mL using PbS or CdS NTs. The nanoparticle-based bio-barcoded DNA sensor has potential applications for multiple detections of bioterrorism threat agents, co-infection, and contaminants in the same sample.
Show less
- Title
- Profile HMM-based protein domain analysis of next-generation sequencing data
- Creator
- Zhang, Yuan
- Date
- 2013
- Collection
- Electronic Theses & Dissertations
- Description
-
Sequence analysis is the process of analyzing DNA, RNA or peptide sequences using a wide range of methodologies in order to understand their functions, structures or evolution history. Next generation sequencing (NGS) technologies generate large-scale sequence data of high coverage and nucleotide level resolution at low costs, benefiting a variety of research areas such as gene expression profiling, metagenomic annotation, ncRNA identification, etc. Therefore, functional analysis of NGS...
Show moreSequence analysis is the process of analyzing DNA, RNA or peptide sequences using a wide range of methodologies in order to understand their functions, structures or evolution history. Next generation sequencing (NGS) technologies generate large-scale sequence data of high coverage and nucleotide level resolution at low costs, benefiting a variety of research areas such as gene expression profiling, metagenomic annotation, ncRNA identification, etc. Therefore, functional analysis of NGS sequences becomes increasingly important because it provides insightful information, such as gene expression, protein composition, and phylogenetic complexity, of the species from which the sequences are generated. One basic step during the functional analysis is to classify genomic sequences into different functional categories, such as protein families or protein domains (or domains for short), which are independent functional units in a majority of annotated protein sequences. The state-of-the-art method for protein domain analysis is based on comparative sequence analysis, which classifies query sequences into annotated protein or domain databases. There are two types of domain analysis methods, pairwise alignment and profile-based similarity search. The first one uses pairwise alignment tools such as BLAST to search query genomic sequences against reference protein sequences in databases such as NCBI-nr. The second one uses profile HMM-based tools such as HMMER to classify query sequences into annotated domain families such as Pfam. Compared to the first method, the profile HMM-based method has smaller search space and higher sensitivity with remote homolog detection. Therefore, I focus on profile HMM-based protein domain analysis.There are several challenges with protein domain analysis of NGS sequences. First, sequences generated by some NGS platforms such as pyrosequencing have relatively high error rates, making it difficult to classify the sequences into their native domain families. Second, existing protein domain analysis tools have low sensitivity with short query sequences and poorly conserved domain families. Third, the volume of NGS data is usually very large, making it difficult to assemble short reads into longer contigs. In this work, I focus on addressing these three challenges using different methods. To be specific, we have proposed four tools, HMM-FRAME, MetaDomain, SALT, and SAT-Assembler. HMM-FRAME focuses on detecting and correcting frameshift errors in sequences generated by pyrosequencing technology, thus accurately classifying metagenomic sequences containing frameshift errors into their native protein domain families. MetaDomain and SALT are both designed for short reads generated by NGS technologies. MetaDomain uses relaxed position-specific score thresholds and alignment positions to increase the sensitivity while keeping the false positive rate at a low level. SALT combines both position-specific score thresholds and graph algorithms and achieves higher accuracy than MetaDomain. SAT-Assembler conducts targeted gene assembly from large-scale NGS data. It has smaller memory usage, higher gene coverage, and lower chimera rate compared with existing tools. Finally, I will make a conclusion on my work and briefly talk about some future work
Show less
- Title
- Studying the effects of sampling on the efficiency and accuracy of k-mer indexes
- Creator
- Almutairy, Meznah
- Date
- 2017
- Collection
- Electronic Theses & Dissertations
- Description
-
"Searching for local alignments is a critical step in many bioinformatics applications and pipelines. This search process is often sped up by finding shared exact matches of a minimum length. Depending on the application, the shared exact matches are extended to maximal exact matches, and these are often extended further to local alignments by allowing mismatches and/or gaps. In this dissertation, we focus on searching for all maximal exact matches (MEMs) and all highly similar local...
Show more"Searching for local alignments is a critical step in many bioinformatics applications and pipelines. This search process is often sped up by finding shared exact matches of a minimum length. Depending on the application, the shared exact matches are extended to maximal exact matches, and these are often extended further to local alignments by allowing mismatches and/or gaps. In this dissertation, we focus on searching for all maximal exact matches (MEMs) and all highly similar local alignments (HSLAs) between a query sequence and a database of sequences. We focus on finding MEMs and HSLAs over nucleotide sequences. One of the most common ways to search for all MEMs and HSLAs is to use a k-mer index such as BLAST. A major problem with k-mer indexes is the space required to store the lists of all occurrences of all k-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some k-mer occurrences are stored. We classify sampling strategies used to create k-mer indexes in two ways: how they choose k-mers and how many k-mers they choose. The k-mers can be chosen in two ways: fixed sampling and minimizer sampling. A sampling method might select enough k-mers such that the k-mer index reaches full accuracy. We refer to this sampling as hard sampling. Alternatively, a sampling method might select fewer k-mers to reduce the index size even further but the index does not guarantee full accuracy. We refer to this sampling as soft sampling. In the current literature, no systematic study has been done to compare the different sampling methods and their relative benefits/weakness. It is well known that fixed sampling will produce a smaller index, typically by roughly a factor of two, whereas it is generally assumed that minimizer sampling will produce faster query times since query k-mers can also be sampled. However, no direct comparison of fixed and minimizer sampling has been performed to verify these assumptions. Also, most previous work uses hard sampling, in which all similar sequences are guaranteed to be found. In contrast, we study soft sampling, which further reduces the k-mer index at a cost of decreasing query accuracy. We systematically compare fixed and minimizer sampling to find all MEMs between large genomes such as the human genome and the mouse genome. We also study soft sampling to find all HSLAs using the NCBI BLAST tool with the human genome and human ESTs. We use BLAST, since it is the most widely used tool to search for HSLAs. We compared the sampling methods with respect to index size, query time, and query accuracy. We reach the following conclusions. First, using larger k-mers reduces query time for both fixed sampling and minimizer sampling at a cost of requiring more space. If we use the same k-mer size for both methods, fixed sampling requires typically half as much space whereas minimizer sampling processes queries slightly faster. If we are allowed to use any k-mer size for each method, then we can choose a k-mer size such that fixed sampling both uses less space and processes queries faster than minimizer sampling. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. The results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy."--Pages ii-iii.
Show less
- Title
- Qtl and transcriptomic analysis between red wheat and white wheat during pre-harvest sprouting induction stage
- Creator
- Su, Yuanjie
- Date
- 2013
- Collection
- Electronic Theses & Dissertations
- Description
-
Wheat pre-harvest sprouting (PHS) is a precocious germination of seed in the head when there are prolonged wet conditions occurs during the harvest period. Recent damage caused by PHS occurred in 2008, 2009 and 2011, resulting in severe losses to the Michigan wheat industry. Direct annual losses caused by PHS worldwide can reach up to US $1 billion. Breeding for PHS resistant wheat cultivars is critical for securing soft white wheat production and reducing the economic loss to Michigan...
Show moreWheat pre-harvest sprouting (PHS) is a precocious germination of seed in the head when there are prolonged wet conditions occurs during the harvest period. Recent damage caused by PHS occurred in 2008, 2009 and 2011, resulting in severe losses to the Michigan wheat industry. Direct annual losses caused by PHS worldwide can reach up to US $1 billion. Breeding for PHS resistant wheat cultivars is critical for securing soft white wheat production and reducing the economic loss to Michigan farmers, food processors and millers. In general, white wheat is more susceptible to PHS in comparison to red wheat. However, the underlying mechanism connecting seed coat color and PHS resistance has not been clearly described. In this study, a recombinant inbred line population segregating for seed coat color alleles was evaluated for seed coat color and alpha-amylase activity in three years with two treatments. The genotyping results enabled us to group individuals by the specific red allele combinations and allowed us to examine the allelic contribution of each color loci to both seed coat color and alpha-amylase activity. A high-density genetic map based upon Infinium 9K SNP array was generated to locate QTL in relatively narrow regions. A total of 38 Quantitative Trait Loci (QTL) for seed coat color and alpha-amylase activity were identified from this population and mapped on eleven chromosomes (1B, 2A, 2B, 3A, 3B, 3D, 4B, 5A, 5D, 6B and 7B) from three years and two post-harvest treatments. Most QTL explained 6-15% of the phenotypic variance while a major QTL on chromosome 2B explained up to 37.6% of phenotypic variance of alpha-amylase activity in 2012 non-mist condition. Significant QTL × QTL interactions were also found between and within color and enzyme related traits. Next generation sequencing (NGS) technology was used in current study to generate wheat transcriptome using Trinity with two methods: de novo assembly and Genome Guided assembly. Quality assessment of the two assemblies was conducted based on their concordance, completeness and contiguity. Three assembly scenarios were evaluated in order to find a balance between sample specificity and transcriptome completeness. Red wheat and white wheat lines from previous QTL population were collected under mist and non-mist conditions and their expression profiles were compared to identify differentially expressed (DE) genes. At non-mist condition, only around 1% of the genes were differentially expressed between physiologically matured red wheat and white wheat while the rate had a 10-fold increase after 48 hr misting treatment. Annotation of the DE genes showed signature genes involved in germination process, such as late embryogenesis abundant protein, peroxidase, hydrolase, and several transcription factors. They can be potential key players involved in the underlying genetic networks related to the PHS induction process. Gene Ontology (GO) terms enriched in DE genes were also summarized for each comparison and germination related molecular function and biological process were retrieved.In conclusion, with the population segregating for seed coat color loci, the relationship between seed coat color and alpha-amylase activity were examined using biochemical methods, QTL analysis, and transcriptome profiling. The variation of seed coat color do closely linked with PHS resistance level at all three levels. DE genes and enriched GO terms identified were discussed for their potential role in bridging the gap between seed coat color and PHS resistance.
Show less