You are here
Search results
(1 - 11 of 11)
- Title
- Computational identification and analysis of non-coding RNAs in large-scale biological data
- Creator
- Lei, Jikai
- Date
- 2015
- Collection
- Electronic Theses & Dissertations
- Description
-
Non-protein-coding RNAs (ncRNAs) are RNA molecules that function directly at the level of RNA without translating into protein. They play important biological functions in all three domains of life, i.e. Eukarya, Bacteria and Archaea. To understand the working mechanisms and the functions of ncRNAs in various species, a fundamental step is to identify both known and novel ncRNAs from large-scale biological data.Large-scale genomic data includes both genomic sequence data and NGS sequencing...
Show moreNon-protein-coding RNAs (ncRNAs) are RNA molecules that function directly at the level of RNA without translating into protein. They play important biological functions in all three domains of life, i.e. Eukarya, Bacteria and Archaea. To understand the working mechanisms and the functions of ncRNAs in various species, a fundamental step is to identify both known and novel ncRNAs from large-scale biological data.Large-scale genomic data includes both genomic sequence data and NGS sequencing data. Both types of genomic data provide great opportunity for identifying ncRNAs. For genomic sequence data, a lot of ncRNA identification tools that use comparative sequence analysis have been developed. These methods work well for ncRNAs that have strong sequence similarity. However, they are not well-suited for detecting ncRNAs that are remotely homologous. Next generation sequencing (NGS), while it opens a new horizon for annotating and understanding known and novel ncRNAs, also introduces many challenges. First, existing genomic sequence searching tools can not be readily applied to NGS data because NGS technology produces short, fragmentary reads. Second, most NGS data sets are large-scale. Existing algorithms are infeasible on NGS data because of high resource requirements. Third, metagenomic sequencing, which utilizes NGS technology to sequence uncultured, complex microbial communities directly from their natural inhabitants, further aggravates the difficulties. Thus, massive amount of genomic sequence data and NGS data calls for efficient algorithms and tools for ncRNA annotation.In this dissertation, I present three computational methods and tools to efficiently identify ncRNAs from large-scale biological data. Chain-RNA is a tool that combines both sequence similarity and structure similarity to locate cross-species conserved RNA elements with low sequence similarity in genomic sequence data. It can achieve significantly higher sensitivity in identifying remotely conserved ncRNA elements than sequence based methods such as BLAST, and is much faster than existing structural alignment tools. miR-PREFeR (miRNA PREdiction From small RNA-Seq data) utilizes expression patterns of miRNA and follows the criteria for plant microRNA annotation to accurately predict plant miRNAs from one or more small RNA-Seq data samples. It is sensitive, accurate, fast and has low-memory footprint. metaCRISPR focuses on identifying Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) from large-scale metagenomic sequencing data. It uses a kmer hash table to efficiently detect reads that belong to CRISPRs from the raw metagonmic data set. Overlap graph based clustering is then conducted on the reduced data set to separate different CRSIPRs. A set of graph based algorithms are used to assemble and recover CRISPRs from the clusters.
Show less
- Title
- Data-driven and task-specific scoring functions for predicting ligand binding poses and affinity and for screening enrichment
- Creator
- Ashtawy, Hossam Mohamed Farg
- Date
- 2017
- Collection
- Electronic Theses & Dissertations
- Description
-
Molecular modeling has become an essential tool to assist in early stages of drug discovery and development. Molecular docking, scoring, and virtual screening are three such modeling tasks of particular importance in computer-aided drug discovery. They are used to computationally simulate the interaction between small drug-like molecules, known as ligands, and a target protein whose activity is to be altered. Scoring functions (SF) are typically employed to predict the binding conformation ...
Show moreMolecular modeling has become an essential tool to assist in early stages of drug discovery and development. Molecular docking, scoring, and virtual screening are three such modeling tasks of particular importance in computer-aided drug discovery. They are used to computationally simulate the interaction between small drug-like molecules, known as ligands, and a target protein whose activity is to be altered. Scoring functions (SF) are typically employed to predict the binding conformation (docking task), binary activity label (screening task), and binding affinity (scoring task) of ligands against a critical protein in the disease's pathway. In most molecular docking software packages available today, a generic binding affinity-based (BA-based) SF is invoked for the three tasks to solve three different, but related, prediction problems. The vast majority of these predictive models are knowledge-based, empirical, or force-field scoring functions. The fourth family of SFs that has gained popularity recently and showed potential of improved accuracy is based on machine-learning (ML) approaches. Despite intense efforts in developing conventional and current ML SFs, their limited predictive accuracies in these three tasks have been a major roadblock toward cost-effective drug discovery. Therefore, in this work we present (i) novel task- specific and multi-task SFs employing large ensembles of deep neural networks (NN) and other state-of-the-art ML algorithms in conjunction with (ii) data-driven multi-perspective descriptors (features) for accurate characterization of protein-ligand complexes (PLCs) extracted using our Descriptor Data Bank (DDB) platform.We assess the docking, screening, scoring, and ranking accuracies of the proposed task-specific SFs with DDB descriptors as well as several conventional approaches in the context of the 2007 and 2014 PDBbind benchmark that encompasses a diverse set of high-quality PLCs. Our approaches substantially outperform conventional SFs based on BA and single-perspective descriptors in all tests. In terms of scoring accuracy, we find that the ensemble NN SFs, BsN-Score and BgN-Score, have more than 34% better correlation (0.844 and 0.840 vs. 0.627) between predicted and measured BAs compared to that achieved by X-Score, a top performing conventional SF. We further find that ensemble NN models surpass SFs based on other state-of-the-art ML algorithms. Similar results have been obtained for the ranking task. Within clusters of PLCs with different ligands bound to the same target protein, we find that the best ensemble NN SF is able to rank the ligands correctly 64.6% of the time compared to 57.8% obtained by X-Score. A substantial improvement in the docking task has also been achieved by our proposed docking-specific SFs. We find that the docking NN SF, BsN-Dock, has a success rate of 95% in identifying poses that are within 2 Å RMSD from the native poses of 65 different protein families. This is in comparison to a success rate of only 82% achieved by the best conventional SF, ChemPLP, employed in the commercial docking software GOLD. As for the ability to distinguish active molecules from inactives, our screening-specific SFs showed excellent improvements over the conventional approaches. The proposed SF BsN-Screen achieved a screening enrichment factor of 33.90 as opposed to 19.54 obtained from the best conventional SF, GlideScore, employed in the docking software Glide. For all tasks, we observed that the proposed task-specific SFs benefit more than their conventional counterparts from increases in the number of descriptors and training PLCs. They also perform better on novel proteins that they were never trained on before. In addition to the three task-specific SFs, we propose a novel multi-task deep neural network (MT-Net) that is trained on data from three tasks to simultaneously predict binding poses, affinities, and activity labels. MT-Net is composed of shared hidden layers for the three tasks to learn common features, task-specific hidden layers for higher feature representation, and three outputs for the three tasks. We show that the performance of MT-Net is superior to conventional SFs and competitive with other ML approaches. Based on current results and potential improvements, we believe our proposed ideas will have a transformative impact on the accuracy and outcomes of molecular docking and virtual screening.
Show less
- Title
- Rhizosphere metagenomics of three biofuel crops
- Creator
- Guo, Jiarong
- Date
- 2016
- Collection
- Electronic Theses & Dissertations
- Description
-
"Soil microbes form beneficial associations with crops in the rhizosphere and also play a major role in ecosystem functions, such as the N and C cycles. Thus large-scale cultivation of biofuel crops will have a significant impact on ecosystem functions at least regionally. In recent years, advances in high throughput sequencing technologies have enabled metagenomics, which in turn opens new ways to access the unknown majority in microbiology but poses great challenges for data analysis due to...
Show more"Soil microbes form beneficial associations with crops in the rhizosphere and also play a major role in ecosystem functions, such as the N and C cycles. Thus large-scale cultivation of biofuel crops will have a significant impact on ecosystem functions at least regionally. In recent years, advances in high throughput sequencing technologies have enabled metagenomics, which in turn opens new ways to access the unknown majority in microbiology but poses great challenges for data analysis due to the large data size and short read length of sequence data sets. We generated about 1 Tb of shotgun metagenomic data from rhizophere soil samples of three biofuel crops: corn, switchgrass, and Miscanthus. My central goal is to devise methods to extract meaning from this rhizosphere metagenomic data, with a focus on N cycle genes since N is the most limiting resource for sustainable for biofuel production. I initially provide a review of gene-targeted methods for analyzing shotgun metagenomics. In the second chapter I develop a method that improves the speed with which rRNA genes fragments can be found and analyzed in large shotgun metagenome data sets, thereby avoiding primer bias and chimeras that are problematic with PCR-based methods. I present a pipeline, SSUsearch, to achieve faster identification of short subunit rRNA gene fragments plus provides unsupervised community analysis. The pipeline also includes classification and copy number correction, and the output can be used in traditional amplicon downstream analysis platforms. Shotgun derived rhizosphere data from this pipeline yielded higher diversity estimates than amplicon data but retained the grouping of samples in ordination analysis. Our analysis confirmed the known bias against Verrucomicrobia in a commonly used V6-V8 primer set as well as discovered likely biases against Actinobacteria and for Verrucomicrobia in a commonly used V4 primer set. In the third chapter, I explore an alternative phylogenetic marker to the widely used SSU rRNA gene, which has several limitations including multiple copies in the same genomes and low resolution for differentiating strains. I demonstrate that rplB, a single copy protein coding gene, provides finer resolution more akin to species and subspecies level and also finer scale (OTU) diversity analysis. The method requires shotgun sequence since the gene is not conserved enough for recovery by primers. When used on the rhizosphere sequence data, it revealed more microbial diversity and better differentiated the communities among the three crops than the SSU rRNA gene analysis. In the last chapter I address my central biological question on rhizosphere metagenomics: do they differ among the three crops and what does this information suggests about function? I compare the rhizosphere metagenomes for overall community structure (SSU rRNA gene), overall function (annotation from global assembly), and N cycle genes (using Xander, a targeted gene assembly tool). All three levels showed corn had a significantly different community from Miscanthus and switchgrass (except for ammonia-oxidizing Archaea), and that the two perennials showed a trend of separation. In terms of life history strategy, the corn rhizosphere was enriched in copiotrophs while the perennials were enriched in oligotrophs. This is further supported by higher abundance of genes in the 'Carbohydrates' subsystem category and higher fungi/bacteria ratios. Additionally, the nitrogen fixing community of corn was dominated by nifH genes most closely affiliated to Rhizobium and Bradyrhizobium while the perennials had nifH sequences most related to Coraliomargarita, Novosphingobium and Azospirillum, indicating that the perennials independently selected beneficial members. Moreover, higher numbers of nitrogen fixation genes and lower number of nitrite reduction genes suggest better nitrogen sustainability of the perennials. These data indicate that perennial bioenergy crops have advantages over corn in higher microbial species richness and functional diversity as well as in selecting members with beneficial traits, consistent with N use efficiency."--Pages ii-iii.
Show less
- Title
- Studying the effects of sampling on the efficiency and accuracy of k-mer indexes
- Creator
- Almutairy, Meznah
- Date
- 2017
- Collection
- Electronic Theses & Dissertations
- Description
-
"Searching for local alignments is a critical step in many bioinformatics applications and pipelines. This search process is often sped up by finding shared exact matches of a minimum length. Depending on the application, the shared exact matches are extended to maximal exact matches, and these are often extended further to local alignments by allowing mismatches and/or gaps. In this dissertation, we focus on searching for all maximal exact matches (MEMs) and all highly similar local...
Show more"Searching for local alignments is a critical step in many bioinformatics applications and pipelines. This search process is often sped up by finding shared exact matches of a minimum length. Depending on the application, the shared exact matches are extended to maximal exact matches, and these are often extended further to local alignments by allowing mismatches and/or gaps. In this dissertation, we focus on searching for all maximal exact matches (MEMs) and all highly similar local alignments (HSLAs) between a query sequence and a database of sequences. We focus on finding MEMs and HSLAs over nucleotide sequences. One of the most common ways to search for all MEMs and HSLAs is to use a k-mer index such as BLAST. A major problem with k-mer indexes is the space required to store the lists of all occurrences of all k-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some k-mer occurrences are stored. We classify sampling strategies used to create k-mer indexes in two ways: how they choose k-mers and how many k-mers they choose. The k-mers can be chosen in two ways: fixed sampling and minimizer sampling. A sampling method might select enough k-mers such that the k-mer index reaches full accuracy. We refer to this sampling as hard sampling. Alternatively, a sampling method might select fewer k-mers to reduce the index size even further but the index does not guarantee full accuracy. We refer to this sampling as soft sampling. In the current literature, no systematic study has been done to compare the different sampling methods and their relative benefits/weakness. It is well known that fixed sampling will produce a smaller index, typically by roughly a factor of two, whereas it is generally assumed that minimizer sampling will produce faster query times since query k-mers can also be sampled. However, no direct comparison of fixed and minimizer sampling has been performed to verify these assumptions. Also, most previous work uses hard sampling, in which all similar sequences are guaranteed to be found. In contrast, we study soft sampling, which further reduces the k-mer index at a cost of decreasing query accuracy. We systematically compare fixed and minimizer sampling to find all MEMs between large genomes such as the human genome and the mouse genome. We also study soft sampling to find all HSLAs using the NCBI BLAST tool with the human genome and human ESTs. We use BLAST, since it is the most widely used tool to search for HSLAs. We compared the sampling methods with respect to index size, query time, and query accuracy. We reach the following conclusions. First, using larger k-mers reduces query time for both fixed sampling and minimizer sampling at a cost of requiring more space. If we use the same k-mer size for both methods, fixed sampling requires typically half as much space whereas minimizer sampling processes queries slightly faster. If we are allowed to use any k-mer size for each method, then we can choose a k-mer size such that fixed sampling both uses less space and processes queries faster than minimizer sampling. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. The results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy."--Pages ii-iii.
Show less
- Title
- Scalable phylogenetic analysis and functional interpretation of genomes with complex evolutionary histories
- Creator
- Hejase, Hussein El Abbass
- Date
- 2017
- Collection
- Electronic Theses & Dissertations
- Description
-
"Phylogenomics involves the inference of a genome-scale phylogeny. A phylogeny is typically inferred using sequences from multiple loci across a set of genomes of multiple organisms by reconstructing gene trees and then reconciling them into a species phylogeny. Many studies have shown that evolutionary processes such as gene flow, incomplete lineage sorting, recombination, selection, gene duplication and loss have shaped our genomes and played a major role in the evolution of a diverse array...
Show more"Phylogenomics involves the inference of a genome-scale phylogeny. A phylogeny is typically inferred using sequences from multiple loci across a set of genomes of multiple organisms by reconstructing gene trees and then reconciling them into a species phylogeny. Many studies have shown that evolutionary processes such as gene flow, incomplete lineage sorting, recombination, selection, gene duplication and loss have shaped our genomes and played a major role in the evolution of a diverse array of metazoans, including humans and ancient hominins, mice, bacteria, and butterflies. The aforementioned evolutionary processes are primary causes of gene tree discordance, which introduce different loci in a genome that exhibit local genealogical variation (i.e. gene trees differing from each other and the species phylogeny in terms of topology and/or branch length). In this dissertation, we develop a method for fast and accurate inference of phylogenetic networks using large-scale sequence data. The advent of high-throughput sequencing technologies has brought about two main scalability challenges: (1) dataset size in terms of the number of taxa and (2) the evolutionary divergence of the taxa in a study. We explore the impact of both dimensions of scale on phylogenetic network inference and then introduce a new phylogenetic divide-and-conquer method which we call FastNet. We show using synthetic and empirical data spanning a range of evolutionary scenarios that FastNet outperforms the state-of-the-art in terms of accuracy and computational requirements. Furthermore, we develop methods that use better and more accurate phylogenies to functionally interpret genomes. One way to study and understand the biological function of genomes is through association mapping, which pinpoints statistical associations between genotypic and phenotypic characters while modeling the relatedness between samples to avoid generating spurious inferences. Many methods have been proposed to perform association mapping while accounting for sample relatedness. However, the state of the art predominantly utilizes the simplifying assumption that sample relatedness is effectively fixed across the genome. Recent studies have shown that sample relatedness can vary greatly across different loci within a genome where gene trees could differ from each other and the species phylogeny. Thus, there is an imminent need for methods to account for local genealogical variation in functional genomic analyses. We address this methodological gap by introducing two methods, Coal-Map and Coal-Miner, which account for sample relatedness locally within loci and globally across the entire genome. We show through simulated and empirical datasets that these newly introduced methods offer comparable or typically better statistical power and type I error control compared to the state-of-the-art."--Pages ii-iii.
Show less
- Title
- Approaches to scaling and improving metagenome sequence assembly
- Creator
- Pell, Jason (Jason A.)
- Date
- 2013
- Collection
- Electronic Theses & Dissertations
- Description
-
Since the completion of the Human Genome Project in the early 2000s, new high-throughput sequencing technologies have been developed that produce more DNA sequence reads at a much lower cost. Because of this, large quantities of data have been generated that are difficult to analyze computationally, not only because of the sheer number of reads but due to errors. One area where this is a particularly difficult problem is metagenomics, where an ensemble of microbes in an environmental sample...
Show moreSince the completion of the Human Genome Project in the early 2000s, new high-throughput sequencing technologies have been developed that produce more DNA sequence reads at a much lower cost. Because of this, large quantities of data have been generated that are difficult to analyze computationally, not only because of the sheer number of reads but due to errors. One area where this is a particularly difficult problem is metagenomics, where an ensemble of microbes in an environmental sample is sequenced. In this scenario, blends of species with varying abundance levels must be processed together in a Bioinformatics pipeline. One common goal with a sequencing dataset is to assemble the genome from the set of reads, but since comparing reads with one another scales quadratically, new algorithms had to be developed to handle the large quantity of short reads generated from the latest sequencers. These assembly algorithms frequently use de Bruijn graphs where reads are broken down into k-mers, or small DNA words of a fixed size k. Despite these algorithmic advances, DNA sequence assembly still scales poorly due to errors and computer memory inefficiency.In this dissertation, we develop approaches to tackle the current shortcomings in metagenome sequence assembly. First, we devise the novel use of a Bloom filter, a probabilistic data structure with false positives, for storing a de Bruijn graph in memory. We study the properties of the de Bruijn graph with false positives in detail and observe that the components in the graph abruptly connect together at a specific false positive rate. Then, we analyze the memory efficiency of a partitioning algorithm at various false positive rates and find that this approach can lead to a 40x decrease in memory usage.Extending the idea of a probabilistic de Bruijn graph, we then develop a two-pass error correction algorithm that effectively discards erroneous reads and corrects the remaining majority to be more accurate. In the first pass, we use the digital normalization algorithm to collect novelty and discard reads that have already been at a sufficient coverage. In the second, a read-to-graph alignment strategy is used to correct reads. Some heuristics are employed to improve the performance. We evaluate the algorithm with an E. coli dataset as well as a mock human gut metagenome dataset and find that the error correction strategy works as intended.
Show less
- Title
- Pervasive alternative RNA editing in Trypanosoma brucei
- Creator
- Kirby, Laura Elizabeth
- Date
- 2019
- Collection
- Electronic Theses & Dissertations
- Description
-
"Trypanosoma brucei is a single celled eukaryote that utilizes a complex RNA editing system to render many of its mitochondrial genes translatable. Editing of these genes requires multiple small RNAs called guide RNAs to direct the insertion and deletion of uridines. These gRNAs act sequentially, each generating the anchor binding site for the next gRNA. This sequential dependence should render the process quite fragile, and mutations in the gRNAs should not be tolerated. In the examination...
Show more"Trypanosoma brucei is a single celled eukaryote that utilizes a complex RNA editing system to render many of its mitochondrial genes translatable. Editing of these genes requires multiple small RNAs called guide RNAs to direct the insertion and deletion of uridines. These gRNAs act sequentially, each generating the anchor binding site for the next gRNA. This sequential dependence should render the process quite fragile, and mutations in the gRNAs should not be tolerated. In the examination of the gRNA transcriptome of T. brucei, many gRNAs were identified that are capable of generating alternative mRNA sequences, and potentially disrupting the editing process. In this work, the effects of alternative editing are characterized. This analysis revealed the role of gRNAs in developmental regulation of gene expression, showing a correlation between the abundance of the initiating gRNAs across two different points in the life cycle of T. brucei and their expression. This study also revealed the existence of mitochondrial dual-coding genes, which provide protection for genetic material that is not under selection at all points of the life cycle of T. brucei. The examination of these dual-coding genes showed that RNA editing patterns can shift between cell lines and under different energetic conditions. Examining the gRNAs involved in these editing pathways revealed that there is a high amount of mismatching base pairs that are tolerated for editing to function, and that gRNA abundance is not a reliable predictor for editing preference. Finally, a reexamination of the gRNA transcriptome revealed that many gRNAs are still unidentified and most likely are generating new alternatively edited sequences."--Pages ii-iii.
Show less
- Title
- Novel computational approaches to investigate microbial diversity
- Creator
- Zhang, Qingpeng
- Date
- 2015
- Collection
- Electronic Theses & Dissertations
- Description
-
Species diversity is an important measurement of ecological communities.Scientists believe that there is a strong relationship between speciesdiversity and ecosystem processes. However efforts to investigate microbialdiversity using whole genome shotgun reads data are still scarce. With novel applications of data structuresand the development of novel algorithms, firstly we developed an efficient k-mer countingapproach and approaches to enable scalable streaming analysis of large and error...
Show moreSpecies diversity is an important measurement of ecological communities.Scientists believe that there is a strong relationship between speciesdiversity and ecosystem processes. However efforts to investigate microbialdiversity using whole genome shotgun reads data are still scarce. With novel applications of data structuresand the development of novel algorithms, firstly we developed an efficient k-mer countingapproach and approaches to enable scalable streaming analysis of large and error-prone short-read shotgun data sets. Then based on these efforts, we developed a statistical framework allowing for scalable diversity analysis of large,complex metagenomes without the need for assembly or reference sequences. Thismethod is evaluated on multiple large metagenomes from differentenvironments, such as seawater, human microbiome, soil. Given the velocity ingrowth of sequencing data, this method is promising for analyzing highlydiverse samples with relatively low computational requirements. Further, as themethod does not depend on reference genomes, it also provides opportunities totackle the large amounts of unknowns we find in metagenomicdatasets.
Show less
- Title
- Data quality control and inter-functional analysis on dynamic phenotype-environmental relationships
- Creator
- Xu, Lei
- Date
- 2016
- Collection
- Electronic Theses & Dissertations
- Description
-
Plant phenomics have become essential component of modern plant science. Such complex data sets are critical for understanding the mechanisms governing energy intake and storage in plants. Large-scale phenotyping techniques have been developed to conduct high-throughput phenotyping on plants. However, a major issue facing these efforts is the determination of the quality of phenotypic data. Automated methods are needed to identify and characterize alteractions caused by system errors, all of...
Show morePlant phenomics have become essential component of modern plant science. Such complex data sets are critical for understanding the mechanisms governing energy intake and storage in plants. Large-scale phenotyping techniques have been developed to conduct high-throughput phenotyping on plants. However, a major issue facing these efforts is the determination of the quality of phenotypic data. Automated methods are needed to identify and characterize alteractions caused by system errors, all of which are difficult to remove in the data collection step. Another issue is we are limited by the tools to analyze fully the phenomics data, esp. the dynamic relationships between environments and phenotypes.The overarching goal of this thesis is to explore dynamic phenotype-environmental datavia data mining/machine learning methods. Raw data measured from biological devices is pre-processed to numerical data, then cleaned by Dynamic Filter to ensure high data quality for further analysis. The cleaned data is further explored and applied with inter-functional analysis in order to find patterns that comply with both machine learning methodologies and biological constraints.In this thesis we developed two tools to make exploration of phenotyping data available:(1) For data quality control, we developed a coarse-to-rened model called Dynamic Filterto identify abnormalities in plant photosynthesis phenotype data. (2) For inter-functionalphenomics data analysis, we present a new algorithm called PhenoCurve for inter-functional phenomics data analysis.
Show less
- Title
- Making heads and tails of Molgula : next generation sequencing analysis of closely related tailed and tail-less ascidian species
- Creator
- Lowe, Elijah Kariem
- Date
- 2015
- Collection
- Electronic Theses & Dissertations
- Description
-
Tunicates are invertebrate chordates and are the sister group to the vertebrates. Although tunicates share little morphological resemblance to vertebrates in their adult stage, they do share several features in their larval stage: a hollow dorsal neural tube, gill slits, and a post-anal tail, containing a notochord--a group of cells organized in a rod shaped structure - the key features that classify the phyla. Within the tunicates, several ascidians have undergone tail-loss, and many of them...
Show moreTunicates are invertebrate chordates and are the sister group to the vertebrates. Although tunicates share little morphological resemblance to vertebrates in their adult stage, they do share several features in their larval stage: a hollow dorsal neural tube, gill slits, and a post-anal tail, containing a notochord--a group of cells organized in a rod shaped structure - the key features that classify the phyla. Within the tunicates, several ascidians have undergone tail-loss, and many of them are Molgulidae. Hybrids have been produced through the cross fertilization of two Molgula species (Molgula occulta and Molgula oculata), and no other solitary Molgula species have been known to hybridize. Here we have sequenced the transcriptomes of several developmental stages of both M. occulta and M. oculata-- two closely related, free-spawning ascidian species, and their hybrid, in order to study the mechanisms behind tail lost in M. occulta. We were first presented with the problem of identifying the best pipeline for the de novo assembly of our transcriptomes. Here we determined that processing reads through digital normalization, a redundancy reduction step, had less of an effect on assemblies than the assembler used. We then sequenced and assembled the genomes of M. occulta, M. oculata and a more distant species, M. occidentalis. This allowed us to characterize the genomes, discovering that the species are more divergent then they appear phenotypically, and also building better gene models. Through differential expression analysis we determined that M. oculata and the hybrid appear to have overlapping transcripts that are up-regulated during the formation of the ascidian tail, and that the genes are primarily overexpressed by the tailed species and hybrid in relation to the tail-less species.
Show less
- Title
- Cis-regulatory code controlling spatially specific high salinity response in Arabidopsis thaliana
- Creator
- Seddon, Alexander
- Date
- 2015
- Collection
- Electronic Theses & Dissertations
- Description
-
Plants are subjected to a variety of environmental stress, and their ability to respond to stress depends, in a large part, on the proper regulation of gene activities including transcription. Earlier studies show that the regulation of stress transcriptional response has a significant spatial component, namely, each organ, tissue, and cell type may respond to a stress by differentially regulating different sets of genes. Although our knowledge is accumulating on how specific transcription...
Show morePlants are subjected to a variety of environmental stress, and their ability to respond to stress depends, in a large part, on the proper regulation of gene activities including transcription. Earlier studies show that the regulation of stress transcriptional response has a significant spatial component, namely, each organ, tissue, and cell type may respond to a stress by differentially regulating different sets of genes. Although our knowledge is accumulating on how specific transcription factors (TFs) and their associated cis-regulatory elements (CREs) are involved in stress responses, a genome wide model of what plant TFs and CREs are key to the spatial stress response regulation has yet to emerge. In this study, a set of 1,894 putative CREs (pCREs) were identified that are associated with salt stress up-regulated genes in the root and shoot of Arabidopsis thaliana. These pCREs led to models that can better predict salt up-regulated genes in root and shoot compared to models based on known TF binding motifs. The full pCRE set could be broken into root, shoot and general subsets that are enriched amongst root, shoot, or both root and shoot salt up-regulated genes, respectively. We also identified pCRE subsets that are enriched amongst genes induced by salt in root cell-types. Most importantly, combinations of the pCRE subsets allowed predictions of genes up-regulated by high salinity in root, shoot, as well as various root cell types. In addition, consideration of pCRE combinatorial rules further improved salt upregulation prediction. Our results suggest that the organ and cell-type transcriptional response to high salinity is regulated by a core set of pCREs that need to be considered in combinations, and provides a genome-wide view on the cis-regulation of spatial transcriptional responses to stress.
Show less