Navigating Protein Fitness Landscapes with Machine Learning and Biological Insights
Proteins are essential biomolecules addressing challenges in medicine, nanotechnology, and industry. Protein engineering designs and optimizes these molecules for specific functions, such as catalyzing reactions or facilitating drug delivery. However, designing proteins with desired properties is extremely challenging due to unpredictable mutation effects and complex fitness landscapes, which depict the relationship between a protein's sequence, structure, and function. Traditional methods like directed evolution and rational design have limitations in exploring vast sequence spaces and modeling amino acid interactions. Recent advances in machine learning (ML) and the increasing availability of biological data have shifted protein engineering from a theory-driven to a data-driven approach. Despite progress, challenges remain, such as capturing nuanced protein behaviors under distinct biological conditions, enhancing data quality and diversity, and developing models that handle complex protein-ligand interactions.This dissertation explores innovative protein engineering approaches by integrating machine learning (ML) and computational tools with biological insights. It addresses designing proteins with desired properties, enhancing their numerical representations, and modeling protein-drug interactions. Also, methodologies are developed to generate new-to-nature proteins with desired properties and optimize experimental design strategies. Protein representation methods were optimized by combining traditional encodings with protein sequence language models. This ensemble approach achieved a 94\% F1 score, enhancing sequence-function predictions by capturing diverse protein fitness aspects. Performance varied for larger proteins and different protein properties suggesting the need for specialized biologically aware ML methodologies. Additionally, the study addressed the critical challenge of modeling protein-drug interactions, focusing on organic anion-transporting polypeptides (OATPs). OATPs are crucial for drug absorption and distribution, and significantly impact drug efficacy and safety. A comprehensive pipeline was developed, combining AlphaFold structure prediction, molecular docking, and a novel Heterogeneous Graph Neural Network model named HIPO. This model captured complex inter and intra-molecular interactions, outperforming existing methods for OATP inhibition prediction. By identifying key drug attributes influencing these interactions, the study demonstrated the effectiveness of structure-based approaches in elucidating protein-drug interactions, contributing to advancements in drug development and toxicity prediction. In addition, key drug attributes affecting these interactions were identified, emphasizing the need for structure-based methods. Advancing beyond protein representation and drug interaction modeling, this work addresses the generation of novel protein sequences with desired properties. It integrates evolutionary information into generative ML models through a dual approach: combining ancestral sequence reconstruction (ASR) with a Variational Autoencoder (VAE) for sequence generation and utilizing ASR-derived data to fine-tune language models for improved protein representations. This methodology explores sequences from evolutionary history not observed in modern organisms, accessing a vast, unexplored protein space. This data-centric approach leverages ASR to provide a rich source of information beyond extant species, emphasizing the crucial role of biologically diverse datasets in machine learning frameworks. The result is the generation of proteins with enhanced diversity and stability, particularly in thermal properties.By synthesizing evolutionary insights with advanced ML techniques, this work expands the possibilities for engineering proteins with unprecedented characteristics,In conclusion, this thesis presents a comprehensive framework integrating ML with protein engineering, advancing the design and optimization of biomolecules, and addressing specific biological challenges for improved therapeutics and diagnostics applications.
Read
- In Collections
-
Electronic Theses & Dissertations
- Copyright Status
- In Copyright
- Material Type
-
Theses
- Authors
-
Mardikoraem, Mehrsa
- Thesis Advisors
-
Woldring, Daniel
- Committee Members
-
Walton, Patrick
Krishnan, Arjun
Mendoza Cortes, Jose
- Date Published
-
2024
- Subjects
-
Bioinformatics
Bioengineering
- Program of Study
-
Chemical Engineering - Doctor of Philosophy
- Degree Level
-
Doctoral
- Language
-
English
- Pages
- 204 pages
- Permalink
- https://doi.org/doi:10.25335/66dj-1x54