You are here
Search results
(1 - 20 of 28)
Pages
- Title
- The evolutionary origins of memory use in navigation
- Creator
- Grabowski, Laura M.
- Date
- 2009
- Collection
- Electronic Theses & Dissertations
- Title
- Development of a nanoparticle-based electrochemical bio-barcode DNA biosensor for multiplexed pathogen detection on screen-printed carbon electrodes
- Creator
- Zhang, Deng
- Date
- 2011
- Collection
- Electronic Theses & Dissertations
- Description
-
A highly amplified, nanoparticle-based, bio-barcoded electrochemical biosensor for the simultaneous multiplexed detection of the protective antigen A (
pagA ) gene (accession number = M22589) fromBacillus anthracis and the insertion element (Iel ) gene (accession number = Z83734) fromSalmonella Enteritidis was developed. The biosensor system is mainly composed of three nanoparticles: gold nanoparticles (AuNPs), magnetic...
Show moreA highly amplified, nanoparticle-based, bio-barcoded electrochemical biosensor for the simultaneous multiplexed detection of the protective antigen A (pagA ) gene (accession number = M22589) fromBacillus anthracis and the insertion element (Iel ) gene (accession number = Z83734) fromSalmonella Enteritidis was developed. The biosensor system is mainly composed of three nanoparticles: gold nanoparticles (AuNPs), magnetic nanoparticles (MNPs), and nanoparticle tracers (NTs), such as lead sulfide (PbS) and cadmium sulfide (CdS). The AuNPs are coated with the first target-specific DNA probe (1pDNA), which can recognize one end of the target DNA sequence (tDNA), and many NT-terminated bio-barcode ssDNA (bDNA-NT), which act as signal reporter and amplifier. The MNPs are coated with the second target-specific DNA probe (2pDNA) that can recognize the other end of the target gene. After binding the nanoparticles with the target DNA, the following sandwich structure is formed: MNP-2pDNA/tDNA/1pDNA-AuNP-bDNA-NTs. A magnetic field is applied to separate the sandwich structure from the unreacted materials. Because the AuNPs have a large number of nanoparticle tracers per DNA probe binding event, there is substantial amplification. After the nanoparticle tracer is dissolved in 1 mol/L nitric acid, the NT ions, such as Pb2+ and Cd2+ , show distinct non-overlapping stripping curves by square wave anodic stripping voltammetry (SWASV) on screen-printed carbon electrode (SPCE) chips. The oxidation potential of NT ions is unique for each nanoparticle tracer and the peak current is related to the target DNA concentration. The results show that the biosensor has good specificity, and the sensitivity of single detection ofpagA gene fromBacillus anthracis using PbS NTs is as low as 0.2 pg/mL. The detection limit of this multiplex bio-barcoded DNA sensor is 50 pg/mL using PbS or CdS NTs. The nanoparticle-based bio-barcoded DNA sensor has potential applications for multiple detections of bioterrorism threat agents, co-infection, and contaminants in the same sample.
Show less
- Title
- Analysis of host transcriptome response to porcine reproductive and respiratory syndrome virus infection
- Creator
- Arceo, Maria Eugenia
- Date
- 2012
- Collection
- Electronic Theses & Dissertations
- Description
-
Porcine Reproductive and Respiratory Syndrome (PRRS) has been affecting commercial populations of pigs in the US for more than 20 years. We evaluated differences in gene expression in pigs from the PRRS Host Genetics Consortium initiative showing a range of responses to PRRS virus infection. Pigs were allocated into four phenotypic groups according to their serum viral level and weight gain. We obtained RNA at several days post-infection and hybridized it to the 20K 70 mer-oligonucleotide...
Show morePorcine Reproductive and Respiratory Syndrome (PRRS) has been affecting commercial populations of pigs in the US for more than 20 years. We evaluated differences in gene expression in pigs from the PRRS Host Genetics Consortium initiative showing a range of responses to PRRS virus infection. Pigs were allocated into four phenotypic groups according to their serum viral level and weight gain. We obtained RNA at several days post-infection and hybridized it to the 20K 70 mer-oligonucleotide Pigoligoarray. We initially used plasmode datasets to select an optimal procedure for analyzing these data. We showed that the random array effects model with the moderated F statistic and significance thresholds obtained by permutation provided the most powerful analysis procedure. We then addressed global differential gene expression between phenotypic groups. We identified cell death as a biological function significantly associated with several gene networks enriched for differentially expressed genes. We found the genes interferon-alpha 1, major histocompatibility complex, class II, DR alpha, and major histocompatibility complex, class II, DQ alpha 1 differentially expressed between phenotypic groups. Finally, we used this study as pilot data to inform the design of future time-course transcriptional profiling experiments. We concluded the best scenario for investigation of early response to PRRSV infection consists of sampling at 4 and 7 days post infection using approximately 30 pigs per phenotypic group.
Show less
- Title
- Approaches to scaling and improving metagenome sequence assembly
- Creator
- Pell, Jason (Jason A.)
- Date
- 2013
- Collection
- Electronic Theses & Dissertations
- Description
-
Since the completion of the Human Genome Project in the early 2000s, new high-throughput sequencing technologies have been developed that produce more DNA sequence reads at a much lower cost. Because of this, large quantities of data have been generated that are difficult to analyze computationally, not only because of the sheer number of reads but due to errors. One area where this is a particularly difficult problem is metagenomics, where an ensemble of microbes in an environmental sample...
Show moreSince the completion of the Human Genome Project in the early 2000s, new high-throughput sequencing technologies have been developed that produce more DNA sequence reads at a much lower cost. Because of this, large quantities of data have been generated that are difficult to analyze computationally, not only because of the sheer number of reads but due to errors. One area where this is a particularly difficult problem is metagenomics, where an ensemble of microbes in an environmental sample is sequenced. In this scenario, blends of species with varying abundance levels must be processed together in a Bioinformatics pipeline. One common goal with a sequencing dataset is to assemble the genome from the set of reads, but since comparing reads with one another scales quadratically, new algorithms had to be developed to handle the large quantity of short reads generated from the latest sequencers. These assembly algorithms frequently use de Bruijn graphs where reads are broken down into k-mers, or small DNA words of a fixed size k. Despite these algorithmic advances, DNA sequence assembly still scales poorly due to errors and computer memory inefficiency.In this dissertation, we develop approaches to tackle the current shortcomings in metagenome sequence assembly. First, we devise the novel use of a Bloom filter, a probabilistic data structure with false positives, for storing a de Bruijn graph in memory. We study the properties of the de Bruijn graph with false positives in detail and observe that the components in the graph abruptly connect together at a specific false positive rate. Then, we analyze the memory efficiency of a partitioning algorithm at various false positive rates and find that this approach can lead to a 40x decrease in memory usage.Extending the idea of a probabilistic de Bruijn graph, we then develop a two-pass error correction algorithm that effectively discards erroneous reads and corrects the remaining majority to be more accurate. In the first pass, we use the digital normalization algorithm to collect novelty and discard reads that have already been at a sufficient coverage. In the second, a read-to-graph alignment strategy is used to correct reads. Some heuristics are employed to improve the performance. We evaluate the algorithm with an E. coli dataset as well as a mock human gut metagenome dataset and find that the error correction strategy works as intended.
Show less
- Title
- Profile HMM-based protein domain analysis of next-generation sequencing data
- Creator
- Zhang, Yuan
- Date
- 2013
- Collection
- Electronic Theses & Dissertations
- Description
-
Sequence analysis is the process of analyzing DNA, RNA or peptide sequences using a wide range of methodologies in order to understand their functions, structures or evolution history. Next generation sequencing (NGS) technologies generate large-scale sequence data of high coverage and nucleotide level resolution at low costs, benefiting a variety of research areas such as gene expression profiling, metagenomic annotation, ncRNA identification, etc. Therefore, functional analysis of NGS...
Show moreSequence analysis is the process of analyzing DNA, RNA or peptide sequences using a wide range of methodologies in order to understand their functions, structures or evolution history. Next generation sequencing (NGS) technologies generate large-scale sequence data of high coverage and nucleotide level resolution at low costs, benefiting a variety of research areas such as gene expression profiling, metagenomic annotation, ncRNA identification, etc. Therefore, functional analysis of NGS sequences becomes increasingly important because it provides insightful information, such as gene expression, protein composition, and phylogenetic complexity, of the species from which the sequences are generated. One basic step during the functional analysis is to classify genomic sequences into different functional categories, such as protein families or protein domains (or domains for short), which are independent functional units in a majority of annotated protein sequences. The state-of-the-art method for protein domain analysis is based on comparative sequence analysis, which classifies query sequences into annotated protein or domain databases. There are two types of domain analysis methods, pairwise alignment and profile-based similarity search. The first one uses pairwise alignment tools such as BLAST to search query genomic sequences against reference protein sequences in databases such as NCBI-nr. The second one uses profile HMM-based tools such as HMMER to classify query sequences into annotated domain families such as Pfam. Compared to the first method, the profile HMM-based method has smaller search space and higher sensitivity with remote homolog detection. Therefore, I focus on profile HMM-based protein domain analysis.There are several challenges with protein domain analysis of NGS sequences. First, sequences generated by some NGS platforms such as pyrosequencing have relatively high error rates, making it difficult to classify the sequences into their native domain families. Second, existing protein domain analysis tools have low sensitivity with short query sequences and poorly conserved domain families. Third, the volume of NGS data is usually very large, making it difficult to assemble short reads into longer contigs. In this work, I focus on addressing these three challenges using different methods. To be specific, we have proposed four tools, HMM-FRAME, MetaDomain, SALT, and SAT-Assembler. HMM-FRAME focuses on detecting and correcting frameshift errors in sequences generated by pyrosequencing technology, thus accurately classifying metagenomic sequences containing frameshift errors into their native protein domain families. MetaDomain and SALT are both designed for short reads generated by NGS technologies. MetaDomain uses relaxed position-specific score thresholds and alignment positions to increase the sensitivity while keeping the false positive rate at a low level. SALT combines both position-specific score thresholds and graph algorithms and achieves higher accuracy than MetaDomain. SAT-Assembler conducts targeted gene assembly from large-scale NGS data. It has smaller memory usage, higher gene coverage, and lower chimera rate compared with existing tools. Finally, I will make a conclusion on my work and briefly talk about some future work
Show less
- Title
- Qtl and transcriptomic analysis between red wheat and white wheat during pre-harvest sprouting induction stage
- Creator
- Su, Yuanjie
- Date
- 2013
- Collection
- Electronic Theses & Dissertations
- Description
-
Wheat pre-harvest sprouting (PHS) is a precocious germination of seed in the head when there are prolonged wet conditions occurs during the harvest period. Recent damage caused by PHS occurred in 2008, 2009 and 2011, resulting in severe losses to the Michigan wheat industry. Direct annual losses caused by PHS worldwide can reach up to US $1 billion. Breeding for PHS resistant wheat cultivars is critical for securing soft white wheat production and reducing the economic loss to Michigan...
Show moreWheat pre-harvest sprouting (PHS) is a precocious germination of seed in the head when there are prolonged wet conditions occurs during the harvest period. Recent damage caused by PHS occurred in 2008, 2009 and 2011, resulting in severe losses to the Michigan wheat industry. Direct annual losses caused by PHS worldwide can reach up to US $1 billion. Breeding for PHS resistant wheat cultivars is critical for securing soft white wheat production and reducing the economic loss to Michigan farmers, food processors and millers. In general, white wheat is more susceptible to PHS in comparison to red wheat. However, the underlying mechanism connecting seed coat color and PHS resistance has not been clearly described. In this study, a recombinant inbred line population segregating for seed coat color alleles was evaluated for seed coat color and alpha-amylase activity in three years with two treatments. The genotyping results enabled us to group individuals by the specific red allele combinations and allowed us to examine the allelic contribution of each color loci to both seed coat color and alpha-amylase activity. A high-density genetic map based upon Infinium 9K SNP array was generated to locate QTL in relatively narrow regions. A total of 38 Quantitative Trait Loci (QTL) for seed coat color and alpha-amylase activity were identified from this population and mapped on eleven chromosomes (1B, 2A, 2B, 3A, 3B, 3D, 4B, 5A, 5D, 6B and 7B) from three years and two post-harvest treatments. Most QTL explained 6-15% of the phenotypic variance while a major QTL on chromosome 2B explained up to 37.6% of phenotypic variance of alpha-amylase activity in 2012 non-mist condition. Significant QTL × QTL interactions were also found between and within color and enzyme related traits. Next generation sequencing (NGS) technology was used in current study to generate wheat transcriptome using Trinity with two methods: de novo assembly and Genome Guided assembly. Quality assessment of the two assemblies was conducted based on their concordance, completeness and contiguity. Three assembly scenarios were evaluated in order to find a balance between sample specificity and transcriptome completeness. Red wheat and white wheat lines from previous QTL population were collected under mist and non-mist conditions and their expression profiles were compared to identify differentially expressed (DE) genes. At non-mist condition, only around 1% of the genes were differentially expressed between physiologically matured red wheat and white wheat while the rate had a 10-fold increase after 48 hr misting treatment. Annotation of the DE genes showed signature genes involved in germination process, such as late embryogenesis abundant protein, peroxidase, hydrolase, and several transcription factors. They can be potential key players involved in the underlying genetic networks related to the PHS induction process. Gene Ontology (GO) terms enriched in DE genes were also summarized for each comparison and germination related molecular function and biological process were retrieved.In conclusion, with the population segregating for seed coat color loci, the relationship between seed coat color and alpha-amylase activity were examined using biochemical methods, QTL analysis, and transcriptome profiling. The variation of seed coat color do closely linked with PHS resistance level at all three levels. DE genes and enriched GO terms identified were discussed for their potential role in bridging the gap between seed coat color and PHS resistance.
Show less
- Title
- Non-coding RNA identification in large-scale genomic data
- Creator
- Yuan, Cheng
- Date
- 2014
- Collection
- Electronic Theses & Dissertations
- Description
-
Noncoding RNAs (ncRNAs), which function directly as RNAs without translating into proteins, play diverse and important biological functions. ncRNAs function not only through their primary structures, but also secondary structures, which are defined by interactions between Watson-Crick and wobble base pairs. Common types of ncRNA include microRNA, rRNA, snoRNA, tRNA. Functions of ncRNAs vary among different types. Recent studies suggest the existence of large number of ncRNA genes....
Show moreNoncoding RNAs (ncRNAs), which function directly as RNAs without translating into proteins, play diverse and important biological functions. ncRNAs function not only through their primary structures, but also secondary structures, which are defined by interactions between Watson-Crick and wobble base pairs. Common types of ncRNA include microRNA, rRNA, snoRNA, tRNA. Functions of ncRNAs vary among different types. Recent studies suggest the existence of large number of ncRNA genes. Identification of novel and known ncRNAs becomes increasingly important in order to understand their functionalities and the underlying communities.Next-generation sequencing (NGS) technology sheds lights on more comprehensive and sensitive ncRNA annotation. Lowly transcribed ncRNAs or ncRNAs from rare species with low abundance may be identified via deep sequencing. However, there exist several challenges in ncRNA identification in large-scale genomic data. First, the massive volume of datasets could lead to very long computation time, making existing algorithms infeasible. Second, NGS has relatively high error rate, which could further complicate the problem. Third, high sequence similarity among related ncRNAs could make them difficult to identify, resulting in incorrect output. Fourth, while secondary structures should be adopted for accurate ncRNA identification, they usually incur high computational complexity. In particular, some ncRNAs contain pseudoknot structures, which cannot be effectively modeled by the state-of-the-art approach. As a result, ncRNAs containing pseudoknots are hard to annotate.In my PhD work, I aimed to tackle the above challenges in ncRNA identification. First, I designed a progressive search pipeline to identify ncRNAs containing pseudoknot structures. The algorithms are more efficient than the state-of-the-art approaches and can be used for large-scale data. Second, I designed a ncRNA classification tool for short reads in NGS data lacking quality reference genomes. The initial homology search phase significantly reduces size of the original input, making the tool feasible for large-scale data. Last, I focused on identifying 16S ribosomal RNAs from NGS data. 16S ribosomal RNAs are very important type of ncRNAs, which can be used for phylogenic study. A set of graph based assembly algorithms were applied to form longer or full-length 16S rRNA contigs. I utilized paired-end information in NGS data, so lowly abundant 16S genes can also be identified. To reduce the complexity of problem and make the tool practical for large-scale data, I designed a list of error correction and graph reduction techniques for graph simplification.
Show less
- Title
- Studies of improving therapeutic outcomes of breast cancer through development of personalized treatments and characterization of gene interactions
- Creator
- Jhan, Jing-Ru
- Date
- 2016
- Collection
- Electronic Theses & Dissertations
- Description
-
With an understanding of the heterogeneity of breast cancer, patients with luminal or HER2 breast cancer have more specific treatment options other than traditional chemotherapy, the standard therapy for triple-negative breast cancer (TNBC) patients. However, the response to current treatments as well as the prognosis have been clinical challenges. In fact, breast cancer consists of more than subtypes routinely used based on gene expression. In addition, gene expression is highly correlated...
Show moreWith an understanding of the heterogeneity of breast cancer, patients with luminal or HER2 breast cancer have more specific treatment options other than traditional chemotherapy, the standard therapy for triple-negative breast cancer (TNBC) patients. However, the response to current treatments as well as the prognosis have been clinical challenges. In fact, breast cancer consists of more than subtypes routinely used based on gene expression. In addition, gene expression is highly correlated with response to treatment and prognosis. This suggests that the development of personalized treatment with targeted therapy could improve the outcomes, especially for the TNBC subtype. To address this need, I used two approaches, the development of pathway-guided individualized treatment and an understanding of the interactions of potential genes for targeted therapy. Considering the complexity of gene and pathway interactions, the probability of pathway activation was predicted using pathway signatures generated by comparing gene expression differences between cells overexpressing interested genes and those expressing GFP. This approach was validated in two subtypes of mouse mammary tumors from MMTV-Myc mice, and then further validated in human TNBC patient-derived xenografts (PDXs). The inhibition of tumor growth in mouse mammary tumors and the regression of tumors in PDXs were observed. These proof-of-principle experiments demonstrated the flexibility of pathway-guided personalized treatment. Because this approach needs the combination of different targeted therapies, it is necessary to understand the characteristics of these targetedgenes and therapies, such as gene-gene interactions. To meet this demand, I studied the effects of Stat3 in Myc-driven tumors. Here, MMTV-Myc mice with conditional knockout Stat3 mice was generated. I noted that the deletion of Stat3 in MMTV-Myc mice accelerated the tumorigenesis as well as delayed the tumor growth with an alteration in the frequency of histological subtypes. These tumors also had deficient angiogenesis. Unexpectedly, mice with this genotype had lactation deficiencies and the lethality of pups was found.This model shared some of the same effects of loss of Stat3 in other oncogene-induced tumors and also had distinct effects compared with other models. This suggests that the oncogene drivers determine the roles of Stat3, an oncogene or tumor suppressor, and emphasizes again the importance of understanding the pathways and interactions in the development of treatment.In sum, these studies demonstrate the potential of guiding individualized treatments in preclinical platforms using bioinformatics analyses. Combined with other genomic profiles, this approach could offer more complete assessments before being translated to practice. In addition, this could be further applied in adaptive clinical trials through matching with mouse models.
Show less
- Title
- It's both who you are and where you're from : relating vocational interests and socioeconomic status to bias in biodata and SJTs
- Creator
- Prasad, Joshua
- Date
- 2017
- Collection
- Electronic Theses & Dissertations
- Description
-
"Differences in responding to biodata and situational judgement tests (SJTs) based on gender and racial minority group status were evaluated. It was hypothesized that vocational interests and socioeconomic status (SES) could be used to help characterize the differences in experience between groups (e.g. Cottrell, Newman, Roisman, 2015; Nye, Su, Rounds, & Drasgow, 2012). As a result, interests and SES may help explain differences in both the constructs assessed by biodata and SJTs as well as...
Show more"Differences in responding to biodata and situational judgement tests (SJTs) based on gender and racial minority group status were evaluated. It was hypothesized that vocational interests and socioeconomic status (SES) could be used to help characterize the differences in experience between groups (e.g. Cottrell, Newman, Roisman, 2015; Nye, Su, Rounds, & Drasgow, 2012). As a result, interests and SES may help explain differences in both the constructs assessed by biodata and SJTs as well as differences in item functioning (DIF; Drasgow, 1987). Hypotheses were evaluated using multiple-indicator multiple-cause models to simultaneously model latent constructs and item responses (MIMIC; Muthén, 1989). Findings indicate that interests helped explain differences across gender in both the constructs assessed as well as DIF. Interests explained few differences based on minority group status and SES did not seem to meaningfully explain differences in either of the demographic group comparisons. Many items still exhibited DIF as a function of gender or minority group status after accounting for vocational interests and SES, suggesting that further work is needed to identify additional substantive explanations of DIF. Overall, the present work constitutes a thorough examination of differential functioning in noncognitive assessments and establishes a meaningful relationship between the noncognitive constructs assessed here and vocational interests."--Page ii.
Show less
- Title
- Studying the effects of sampling on the efficiency and accuracy of k-mer indexes
- Creator
- Almutairy, Meznah
- Date
- 2017
- Collection
- Electronic Theses & Dissertations
- Description
-
"Searching for local alignments is a critical step in many bioinformatics applications and pipelines. This search process is often sped up by finding shared exact matches of a minimum length. Depending on the application, the shared exact matches are extended to maximal exact matches, and these are often extended further to local alignments by allowing mismatches and/or gaps. In this dissertation, we focus on searching for all maximal exact matches (MEMs) and all highly similar local...
Show more"Searching for local alignments is a critical step in many bioinformatics applications and pipelines. This search process is often sped up by finding shared exact matches of a minimum length. Depending on the application, the shared exact matches are extended to maximal exact matches, and these are often extended further to local alignments by allowing mismatches and/or gaps. In this dissertation, we focus on searching for all maximal exact matches (MEMs) and all highly similar local alignments (HSLAs) between a query sequence and a database of sequences. We focus on finding MEMs and HSLAs over nucleotide sequences. One of the most common ways to search for all MEMs and HSLAs is to use a k-mer index such as BLAST. A major problem with k-mer indexes is the space required to store the lists of all occurrences of all k-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some k-mer occurrences are stored. We classify sampling strategies used to create k-mer indexes in two ways: how they choose k-mers and how many k-mers they choose. The k-mers can be chosen in two ways: fixed sampling and minimizer sampling. A sampling method might select enough k-mers such that the k-mer index reaches full accuracy. We refer to this sampling as hard sampling. Alternatively, a sampling method might select fewer k-mers to reduce the index size even further but the index does not guarantee full accuracy. We refer to this sampling as soft sampling. In the current literature, no systematic study has been done to compare the different sampling methods and their relative benefits/weakness. It is well known that fixed sampling will produce a smaller index, typically by roughly a factor of two, whereas it is generally assumed that minimizer sampling will produce faster query times since query k-mers can also be sampled. However, no direct comparison of fixed and minimizer sampling has been performed to verify these assumptions. Also, most previous work uses hard sampling, in which all similar sequences are guaranteed to be found. In contrast, we study soft sampling, which further reduces the k-mer index at a cost of decreasing query accuracy. We systematically compare fixed and minimizer sampling to find all MEMs between large genomes such as the human genome and the mouse genome. We also study soft sampling to find all HSLAs using the NCBI BLAST tool with the human genome and human ESTs. We use BLAST, since it is the most widely used tool to search for HSLAs. We compared the sampling methods with respect to index size, query time, and query accuracy. We reach the following conclusions. First, using larger k-mers reduces query time for both fixed sampling and minimizer sampling at a cost of requiring more space. If we use the same k-mer size for both methods, fixed sampling requires typically half as much space whereas minimizer sampling processes queries slightly faster. If we are allowed to use any k-mer size for each method, then we can choose a k-mer size such that fixed sampling both uses less space and processes queries faster than minimizer sampling. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. The results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy."--Pages ii-iii.
Show less
- Title
- IDENTIFICATION OF LTR RETROTRANSPOSONS, EVALUATION OF GENOME ASSEMBLY, AND MODELING RICE DOMESTICATION
- Creator
- Ou, Shujun
- Date
- 2018
- Collection
- Electronic Theses & Dissertations
- Description
-
The majority of fundamental theories in genetics and evolution were proposed prior to the discovery of DNA as the genetic material in 1952. Those include Darwin’s theory of evolution (1859), Mendelian genetics (1865), Wright and Fisher’s population genetics (1918), and McClintock’s transposition of genetic elements (1951). Nevertheless, the underlining mechanisms of those theories were not fully elucidated till the appearance of DNA sequencing technology. At present, technological advances...
Show moreThe majority of fundamental theories in genetics and evolution were proposed prior to the discovery of DNA as the genetic material in 1952. Those include Darwin’s theory of evolution (1859), Mendelian genetics (1865), Wright and Fisher’s population genetics (1918), and McClintock’s transposition of genetic elements (1951). Nevertheless, the underlining mechanisms of those theories were not fully elucidated till the appearance of DNA sequencing technology. At present, technological advances have minimized the cost for sequencing genomes. The real bottleneck to establish genomic resources is the annotation of genomic sequences. Long Terminal Repeat (LTR) retrotransposon is a major type of transposable genetic elements and dominating plant genomes. We developed a new method called LTR_retriever for accurate annotation of LTR retrotransposons. Further, we studied genome dynamics, genome size variation, and polyploidy origin using LTR retrotransposons. The presence of LTR retrotransposons challenges current sequencing and assembly techniques due to their size and repetitiveness. We proposed an unbiased metric called LTR Assembly Index (LAI) which utilizes the assembled LTR retrotransposons to evaluate continuity of genome assembly. We revealed the massive gain of continuity for assembly sequenced based on long-read techniques over short-read methods, and further proposed a standardized classification system for genome quality based on LAI. With high-quality genomes, we can extend our knowledge about microevolution events using a population of genomes. The domestication history of rice is still unresolved due to its complicated demographic history. We collected, re-mapped, and re-analyzed 3,485 cultivated and wild rice resequencing accessions. With data imputation, a total of 17.7 million high-quality single-nucleotide polymorphisms (SNPs) were identified. Our dataset is highly accurate as verified by cross-platform Affymetrix Microarray data, with a pairwise concordance rate of 99%. Combining phylogeny, PCA, and ADMIXTURE analyses, we present profound diversification among rice ecotypes.
Show less
- Title
- Algebraic topology and machine learning for biomolecular modeling
- Creator
- Cang, Zixuan
- Date
- 2018
- Collection
- Electronic Theses & Dissertations
- Description
-
Data is expanding in an unprecedented speed in both quantity and size. Topological data analysis provides excellent tools for analyzing high dimensional and highly complex data. Inspired by the topological data analysis's ability of robust and multiscale characterization of data and motivated by the demand of practical predictive tools in computational biology and biomedical researches, this dissertation extends the capability of persistent homology toward quantitative and predictive data...
Show moreData is expanding in an unprecedented speed in both quantity and size. Topological data analysis provides excellent tools for analyzing high dimensional and highly complex data. Inspired by the topological data analysis's ability of robust and multiscale characterization of data and motivated by the demand of practical predictive tools in computational biology and biomedical researches, this dissertation extends the capability of persistent homology toward quantitative and predictive data analysis tools with an emphasis in biomolecular systems. Although persistent homology is almost parameter free, careful treatment is still needed toward practically useful prediction models for realistic systems. This dissertation carefully assesses the representability of persistent homology for biomolecular systems and introduces a collection of characterization tools for both macromolecules and small molecules focusing on intra- and inter-molecular interactions, chemical complexities, electrostatics, and geometry. The representations are then coupled with deep learning and machine learning methods for several problems in drug design and biophysical research. In real-world applications, data often come with heterogeneous dimensions and components. For example, in addition to location, atoms of biomolecules can also be labeled with chemical types, partial charges, and atomic radii. While persistent homology is powerful in analyzing geometry of data, it lacks the ability of handling the non-geometric information. Based on cohomology, we introduce a method that attaches the non-geometric information to the topological invariants in persistent homology analysis. This method is not only useful to handle biomolecules but also can be applied to general situations where the data carries both geometric and non-geometric information. In addition to describing biomolecular systems as a static frame, we are often interested in the dynamics of the systems. An efficient way is to assign an oscillator to each atom and study the coupled dynamical system induced by atomic interactions. To this end, we propose a persistent homology based method for the analysis of the resulting trajectories from the coupled dynamical system. The methods developed in this dissertation have been applied to several problems, namely, prediction of protein stability change upon mutations, protein-ligand binding affinity prediction, virtual screening, and protein flexibility analysis. The tools have shown top performance in both commonly used validation benchmarks and community-wide blind prediction challenges in drug design.
Show less
- Title
- THE TAXONOMIC AND FUNCTIONAL MICROBIAL DIVERSITY IN LAKE BAIKAL AND OTHER NORTH TEMPERATE LAKES
- Creator
- Wilburn, Paul
- Date
- 2018
- Collection
- Electronic Theses & Dissertations
- Description
-
Microorganisms cycle nutrients in every environment on Earth, and their importance in aquatic environments has been recognized for at least 75 years. However, many systems key to better understanding the role specific microbes play in natural environments remain poorly characterized. Lake Baikal is a UNESCO world heritage site. It is the planet’s deepest (1642 m), most voluminous (23615 km3), and oldest (25 to 30 my) lake, containing about 20% of world’s unfrozen freshwater. Baikal’s size and...
Show moreMicroorganisms cycle nutrients in every environment on Earth, and their importance in aquatic environments has been recognized for at least 75 years. However, many systems key to better understanding the role specific microbes play in natural environments remain poorly characterized. Lake Baikal is a UNESCO world heritage site. It is the planet’s deepest (1642 m), most voluminous (23615 km3), and oldest (25 to 30 my) lake, containing about 20% of world’s unfrozen freshwater. Baikal’s size and millions of years of evolutionary development have turned this ancient system into a biodiversity hotspot; however, little is known about its microbial communities. I describe what is the first -omic based survey the microbial communities of Lake Baikal, covering all three basins, multiple depths, and including measured environmental covariates.In Chapter One, I show that temperature, stratification, nutrients, and dissolved oxygen define major microbial habitats and influenced patterns of community diversity in summer Lake Baikal. The environment, not geographical distance, structured microbial communities in Lake Baikal. The overall main driver of community dissimilarity was temperature. Increases in community diversity are driven by richness in the upper mixed layer and evenness in the deep waters, and those aspects of diversity were associated with different environmental drivers. Next, we used a co-occurrence network to identify lake habitats consistently preferred by groups of co-occurring microorganisms, discovering two sets of candidate resident and two sets of candidate transient habitat-cohort pairs. Taxonomic makeup reflected the abiotic conditions of those clusters, suggesting key microbial players in each one.In Chapter Two, I expand microbial community and functional surveys to thirteen additional lakes across Michigan, Minnesota, and Wisconsin, sampled in summer and winter seasons. Lake Baikal indeed harbored microbial communities that were distinct from other north temperate lakes in both seasons, with the next closest communities supported by oligotrophic epilimnia of lakes Superior, Portsmouth, and La Salle. In summer epilimnion of Lake Baikal, which was N-P co-limited at the time of the survey, the enzymes responsible for assimilatory reduction of N species to ammonium and assimilation of ammonium into glutamate were present in ferredoxin-dependent at the low end of N availability gradient, in a trade-off with NADH-dependent, isoforms.Chapter Three presents 369 high quality draft genomes of microorganisms from Lake Baikal, assembled using computational tools that are currently at the cutting edge of bioinformatics. The metagenome assembled genomes (MAGs) were culture-independent and included the archaea domain, as well as 15 bacterial phyla, four of which have no previously sequenced lineages from Lake Baikal. Most MAGs were small but with large variation. At the same time, genomes assembled from the most stable, aseasonal, and resource environment in the Lake Baikal hypolimnion harbored the smallest genomes with remarkably little size variation, reflecting the oligotrophic environment.
Show less
- Title
- Interpretable machine learning in plant genomes : studies in modeling and understanding complex biological systems
- Creator
- Azodi, Christina Brady
- Date
- 2019
- Collection
- Electronic Theses & Dissertations
- Description
-
Complex systems are ubiquitous in genetics and genomics. From the regulation of gene expression to the genetic basis of complex traits, we see that complex networks of diverse cellular molecules underpin the natural world. Driven by technological advances, today's researchers have access to large amounts of omics data from diverse species. At the same time, improvements in computer processing and algorithms have produced more powerful computational tools. Taken together, these advances mean...
Show moreComplex systems are ubiquitous in genetics and genomics. From the regulation of gene expression to the genetic basis of complex traits, we see that complex networks of diverse cellular molecules underpin the natural world. Driven by technological advances, today's researchers have access to large amounts of omics data from diverse species. At the same time, improvements in computer processing and algorithms have produced more powerful computational tools. Taken together, these advances mean that those working at the interface of data science and biology are poised to better model and understand complex biological systems. The research in this dissertation demonstrates how a data-driven approach can be used to better understand three complex systems: (1) transcriptional response to single and combined heat and drought stress in Arabidopsis thaliana, (2) the genetic basis of flowering time, a complex trait, in Zea mays, and (3) the social basis for opinions and beliefs about biotechnology products.To study the first system, we generated models of the cis-regulatory code from information about DNA sequence and additional omics levels using both classic machine learning and deep learning algorithms. We identified 1,061 putative cis-regulatory elements associated with different patterns of response to single and combined heat and drought stress and found that information about additional levels of regulation, especially chromatin accessibility and known transcription factor binding, improved our models of the cis-regulatory code. To study the second system, we generated phenotype prediction models for flowering time, height, and yield based on either genetic markers or transcript levels at the seedling stage. We found that, while genetic marker-based models performed better than transcript level-based models, models that integrated both types of data performed best. Furthermore, transcript-based models were more useful for finding genes known to be associated with flowering time, highlighting how using additional levels of omics data can improve our ability to understand the genetic basis of complex traits. Finally, to study the third system, we integrated 29 characteristics about a person (e.g. age, political ideology, education, values, environmental beliefs) into a machine learning model that would predict an individual's beliefs and opinions about five different types of biotechnology products (e.g. biofortification, biopharmaceuticals). While this approach was particularly usefully for identifying individuals that were broadly supportive of biotechnology, finding characteristics of individuals with negative or conditional (i.e. support product A, but not B) opinions was more challenging, highlighting the complexity of public opinions about biotechnology.
Show less
- Title
- Contributions to machine learning in biomedical informatics
- Creator
- Baytas, Inci Meliha
- Date
- 2019
- Collection
- Electronic Theses & Dissertations
- Description
-
"With innovations in digital data acquisition devices and increased memory capacity, virtually all commercial and scientific domains have been witnessing an exponential growth in the amount of data they can collect. For instance, healthcare is experiencing a tremendous growth in digital patient information due to the high adaptation rate of electronic health record systems in hospitals. The abundance of data offers many opportunities to develop robust and versatile systems, as long as the...
Show more"With innovations in digital data acquisition devices and increased memory capacity, virtually all commercial and scientific domains have been witnessing an exponential growth in the amount of data they can collect. For instance, healthcare is experiencing a tremendous growth in digital patient information due to the high adaptation rate of electronic health record systems in hospitals. The abundance of data offers many opportunities to develop robust and versatile systems, as long as the underlying salient information in data can be captured. On the other hand, today's data, often named big data, is challenging to analyze due to its large scale and high complexity. For this reason, efficient data-driven techniques are necessary to extract and utilize the valuable information in the data. The field of machine learning essentially develops such techniques to learn effective models directly from the data. Machine learning models have been successfully employed to solve complicated real world problems. However, the big data concept has numerous properties that pose additional challenges in algorithm development. Namely, high dimensionality, class membership imbalance, non-linearity, distributed data, heterogeneity, and temporal nature are some of the big data characteristics that machine learning must address. Biomedical informatics is an interdisciplinary domain where machine learning techniques are used to analyze electronic health records (EHRs). EHR comprises digital patient data with various modalities and depicts an instance of big data. For this reason, analysis of digital patient data is quite challenging although it provides a rich source for clinical research. While the scale of EHR data used in clinical research might not be huge compared to the other domains, such as social media, it is still not feasible for physicians to analyze and interpret longitudinal and heterogeneous data of thousands of patients. Therefore, computational approaches and graphical tools to assist physicians in summarizing the underlying clinical patterns of the EHRs are necessary. The field of biomedical informatics employs machine learning and data mining approaches to provide the essential computational techniques to analyze and interpret complex healthcare data to assist physicians in patient diagnosis and treatment. In this thesis, we propose and develop machine learning algorithms, motivated by prevalent biomedical informatics tasks, to analyze the EHRs. Specifically, we make the following contributions: (i) A convex sparse principal component analysis approach along with variance reduced stochastic proximal gradient descent is proposed for the patient phenotyping task, which is defined as finding clinical representations for patient groups sharing the same set of diseases. (ii) An asynchronous distributed multi-task learning method is introduced to learn predictive models for distributed EHRs. (iii) A modified long-short term memory (LSTM) architecture is designed for the patient subtyping task, where the goal is to cluster patients based on similar progression pathways. The proposed LSTM architecture, T-LSTM, performs a subspace decomposition on the cell memory such that the short term effect in the previous memory is discounted based on the length of the time gap. (iv) An alternative approach to T-LSTM model is proposed with a decoupled memory to capture the short and long term changes. The proposed model, decoupled memory gated recurrent network (DM-GRN), is designed to learn two types of memories focusing on different components of the time series data. In this study, in addition to the healthcare applications, behavior of the proposed model is investigated for traffic speed prediction problem to illustrate its generalization ability. In summary, the aforementioned machine learning approaches have been developed to address complex characteristics of electronic health records in routine biomedical informatics tasks such as computational patient phenotyping and patient subtyping. Proposed models are also applicable to different domains with similar data characteristics as EHRs."--Pages ii-iii.
Show less
- Title
- MODELING AND PREDICTION OF GENETIC REDUNDANCY IN ARABIDOPSIS THALIANA AND SACCHAROMYCES CEREVISIAE
- Creator
- Cusack, Siobhan Anne
- Date
- 2020
- Collection
- Electronic Theses & Dissertations
- Description
-
Genetic redundancy is a phenomenon where more than one gene encodes products that perform the same function. This frequently manifests experimentally as a single gene knockout mutant which does not demonstrate a phenotypic change compared to the wild type due to the presence of a paralogous gene performing the same function; a phenotype is only observed when one or more paralogs are knocked out in combination. This presents a challenge in a fundamental goal of genetics, linking genotypes to...
Show moreGenetic redundancy is a phenomenon where more than one gene encodes products that perform the same function. This frequently manifests experimentally as a single gene knockout mutant which does not demonstrate a phenotypic change compared to the wild type due to the presence of a paralogous gene performing the same function; a phenotype is only observed when one or more paralogs are knocked out in combination. This presents a challenge in a fundamental goal of genetics, linking genotypes to phenotypes, especially because it is difficult to determine a priori which gene pairs are redundant. Furthermore, while some factors that are associated with redundant genes have been identified, little is known about factors contributing to long-term maintenance of genetic redundancy. Here, we applied a machine learning approach to predict redundancy among benchmark redundant and nonredundant gene pairs in the model plant Arabidopsis thaliana. Predictions were validated using well-characterized redundant and nonredundant gene pairs. Additionally, we leveraged the availability of fitness and multi-omics data in the budding yeast Saccharomyces cerevisiae to build machine learning models for predicting genetic redundancy and related phenotypic outcomes (single and double mutant fitness) among paralogs, and to identify features important in generating these predictions. Collectively, our models of genetic redundancy provide quantitative assessments of how well existing data allow predictions of fitness and genetic redundancy, shed light on characteristics that may contribute to long-term maintenance of paralogs that are seemingly functionally redundant, and will ultimately allow for more targeted generation of phenotypically informative mutants, advancing functional genomic studies.
Show less
- Title
- GENOMIC APPLICATIONS TO PLANT BIOLOGY
- Creator
- Hoopes, Genevieve
- Date
- 2021
- Collection
- Electronic Theses & Dissertations
- Description
-
The study of the total nuclear DNA content of an organism, i.e., the genome, is a relatively new field and has evolved as sequencing technology and its output has changed. A shift from model species to ecological and crop species occurred as sequencing costs decreased and the technology became more broadly accessible, enabling new discoveries in genome biology as increasingly diverse species and populations were profiled. Here, a genome assembly and several transcriptional studies in multiple...
Show moreThe study of the total nuclear DNA content of an organism, i.e., the genome, is a relatively new field and has evolved as sequencing technology and its output has changed. A shift from model species to ecological and crop species occurred as sequencing costs decreased and the technology became more broadly accessible, enabling new discoveries in genome biology as increasingly diverse species and populations were profiled. Here, a genome assembly and several transcriptional studies in multiple non-model plant species provided new knowledge of molecular pathways and gene content. Over 157 Mb of the genome of the medicinal plant species Calotropis gigantea (L.) W.T.Aiton was sequenced, de novo assembled and annotated using Next Generation Sequencing technologies. The resulting assembly represents 92% of the genic space and provides a resource for discovery of the enzymes involved in biosynthesis of the anticancer metabolite, cardenolide. An updated gene expression atlas for 79 developmental maize (Zea mays L., 1753) tissues and five abiotic/biotic stress treatments was developed, revealing 4,154 organ-specific and 7,704 stress-induced differentially expressed (DE) genes. Presence-absence variants (PAVs) were enriched for organ-specific and stress-induced DE genes, tended to be lowly expressed, and had few co-expression network connections, suggesting that PAVs function in environmental adaptation and are on an evolutionary path to pseudogenization. The Maize Genomics Resource (http://maize.plantbiology.msu.edu/) was developed to view and data-mine these resources. Through profiling global gene expression over time in potato (Solanum tuberosum L.) leaf and tuber tissue, the first circadian rhythmic gene expression profiles of the below-ground heterotrophic tuber tissue were generated. The tuber displayed a longer circadian period, a delayed phase, and a lower amplitude compared to leaf tissue. Over 500 genes were differentially phased between the leaf and tuber, and many carbohydrate metabolism enzymes are under both diurnal and circadian regulation, reflecting the importance of the circadian clock for tuber bulking. Most core circadian clock genes do not display circadian rhythmic gene expression in the leaf or tuber, yet robust transcriptional and gene expression circadian rhythms are present.
Show less
- Title
- USING FRAGARIA AS A MODEL SYSTEM FOR THE STUDY OF SUBGENOME DOMINANCE AND ADAPTATION IN CROPS
- Creator
- Alger, Elizabeth
- Date
- 2021
- Collection
- Electronic Theses & Dissertations
- Description
-
Polyploidy, or the presence of three or more complete genomes in a single organism, has occurred frequently in plants, especially in the angiosperm lineage. Allopolyploids, or polyploids resulting from the merging of different genomes in an interspecific hybrid, have often been shown to experience subgenome dominance. Subgenome dominance is the phenomenon where there is bias in the gene loss and expression between the different genomes in a polyploid, known as subgenomes. Despite the...
Show morePolyploidy, or the presence of three or more complete genomes in a single organism, has occurred frequently in plants, especially in the angiosperm lineage. Allopolyploids, or polyploids resulting from the merging of different genomes in an interspecific hybrid, have often been shown to experience subgenome dominance. Subgenome dominance is the phenomenon where there is bias in the gene loss and expression between the different genomes in a polyploid, known as subgenomes. Despite the prevalence of polyploids and subgenome dominance, little is known about the factors and mechanisms that influence this process. Strawberry (Fragaria sp.) is emerging as a powerful model system to investigate polyploid subgenome dominance evolution due to the recent identification of the four extant diploid progenitor species of the cultivated octoploid strawberry (Fragaria x ananassa). Having the diploid progenitors in hand allows us to identify differences between the dominant subgenome, F. vesca, and the other three progenitors that may have an impact of subgenome dominance. One possible factor is transposable element (TE) abundance, as low TE density has been consistently associated with the dominant subgenome in allopolyploids. Epigenetic silencing of TEs by DNA methylation to suppress TE activity has been shown to result in decreased expression of neighboring genes and this lowered gene expression may affect the establishment of subgenome dominance. F. vesca will be used as a diploid model for the study of subgenome dominance in strawberry where I can examine how TE abundance and other factors influence gene expression in a single accession and in hybrid crosses between different accessions. Tracking changes in gene expression in the hybrids will allow us to examine how genomes with difference sizes and genomic factors interact. The results and insights observed from this study can then be applied to subgenome dominance research in octoploid strawberry. In addition to the germplasm and genomic resources, strawberries are also a high value crop and the loss of their production due to (a)biotic stressors results in the loss of millions of United States dollars annually. Using a population of octoploid strawberries segregating for salt tolerance, I will identify candidate genes related to salt tolerance. Together this work will identify factors and mechanisms related to subgenome dominance and use genotypic data in a practical breeding context.
Show less
- Title
- Genomic basis of electric signal variation in African weakly electric fish
- Creator
- Losilla-Lacayo, Mauricio
- Date
- 2021
- Collection
- Electronic Theses & Dissertations
- Description
-
A repeated theme in speciation is reproductive isolation centered around divergence in few, highly variable traits, specially in cases without strong geographic isolation and high speciation rates. Understanding the genomic basis of highly variable traits that are key to speciation is a major goal of evolutionary biology, because they can characterize crucial drivers and foundations of the speciation process. African weakly electric fish (Mormyridae) are a decidedly speciose clade of teleost...
Show moreA repeated theme in speciation is reproductive isolation centered around divergence in few, highly variable traits, specially in cases without strong geographic isolation and high speciation rates. Understanding the genomic basis of highly variable traits that are key to speciation is a major goal of evolutionary biology, because they can characterize crucial drivers and foundations of the speciation process. African weakly electric fish (Mormyridae) are a decidedly speciose clade of teleost fish, and their electric organ discharges (EODs) are highly variable traits central to species divergence. However, little is known about the genes and celullar processes that underscore EOD variation. In this dissertation, I employ RNAseq and Nanopore sequencing to study the genomic basis of electric signal variation in mormyrids. In Chapter 1, I take a transcriptome-wide approach to describe the molecular basis of electric signal diversity in species of the mormyrid genus Paramormyrops, divergent for EOD complexity, duration and polarity. My results emphasize genes that influence the shape and structure of the electrocyte cytoskeleton, membrane, and extracellular matrix, and the membrane’s physiological properties. In Chapter 2, I compare gene expression patterns between electric organs that produce long vs short EODs. The results strongly support known aspects of morphological and physiological bases of EOD duration, and for the first time I identified specific genes and broad cellular processes expected to that alter morphological and physiological properties of electrocytes, most striking among these is the differential expression of multiple potassium voltage-gated channels. These two chapters independently identified the gene epdl2 as of interest for EOD divergence. In Chapter 3, I study the molecular evolutionary history of epdl2 in Mormyridae, with emphasis on Paramormyrops. My results suggest that three rounds of gene duplication produced four epdl2 paralogs in a Paramormyrops ancestor. In addition, I identify ten sites in epdl2 expected to have experienced strong positive selection in paralogs and implicate them in key functional domains. Overall, the results of this dissertation greatly solidify and expand our understanding of how the genome underpins changes to electrocytes, and in turn, divergence in their electric signals, a highly variable trait that may facilitate speciation in African weakly electric fish. This work provides an evidence-grounded list of candidate genes for functional analyses aimed to corroborate their contribution to the EOD phenotype.
Show less
- Title
- COMPUTATIONAL DISCOVERY AND ANNOTATIONS OF CELL-TYPE SPECIFIC LONG-RANGE GENE REGULATION
- Creator
- Huang, Binbin
- Date
- 2021
- Collection
- Electronic Theses & Dissertations
- Description
-
Long-range regulation by distal enhancers plays critical roles in cell-type specific transcriptional programs. Delineation of the underlying mechanisms underlying long- range enhancer regulation will improve our systems-level understandings on the gene regulatory networks and their functional impacts on human diseases. Although there are experimental approaches to infer cell-type specific long-range regulation, they suffer from the problems of low resolution or high false negative rates....
Show moreLong-range regulation by distal enhancers plays critical roles in cell-type specific transcriptional programs. Delineation of the underlying mechanisms underlying long- range enhancer regulation will improve our systems-level understandings on the gene regulatory networks and their functional impacts on human diseases. Although there are experimental approaches to infer cell-type specific long-range regulation, they suffer from the problems of low resolution or high false negative rates. Recent technological advances make it possible to have a comprehensive profile of the regulatory activities in multiple layers, bringing us to the multi-omics era. Here, we took use of the booming data resources and integrated them into machine learning models to uncover the resulting effects of long- range regulation, especially in diseases. In the first study about androgen- induced gene regulation in the ovary and its impact on female fertility, we identified a total of 190 annotated significant differentially expressed genes. The H3K27me3 histone modification level change was observed in more than half of the DEGs, highlighting the importance of complex long-range multi-enhancer regulation of androgen receptors regulated genes in the ovarian cells. However, current computational predictions of genome-wide enhancer–promoter interactions are still challenging due to limited accuracy and the lack of knowledge on the molecular mechanisms. Based on recent biological investigations, the protein–protein interactions (PPIs) between transcription factors (TFs) have been found to participate in the regulation of chromatin loops. Therefore, we developed a novel predictive model for cell-type specific enhancer– promoter interactions by leveraging the information of TF PPI signatures. Evaluated by a series of rigorous performance comparisons, the new model achieves superior performance over other methods. In this chromatin loop prediction model, TF bindings inferred from Chromatin immunoprecipitation followed by high- throughput sequencing (ChIP-seq) make an essential contribution to the instruction to prioritize specific TF PPIs that may mediate cell-type specific long-range regulatory interactions and reveal new mechanistic understandings of enhancer regulation. When processing ChIP-seq data, we detected, on average, 25% of the ChIP-seq reads can be aligned to multiple positions in the reference genome. These reads are discarded by traditional pipeline, which causes a large loss of information. To cope with this waste, we developed a Bayesian model and designed a Gibbs sampling algorithm to properly align these reads. Evidences from a series of biological comparisons indicated a significantly better performance of this model over the competing tool. In summary, our studies took full advantage of the booming data in this multi-omics era, to provide a novel view of the cell-type specific long- range regulation by distal enhancers and its effects on diseases.
Show less