By
Mehrsa Mardikoraem
A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Chemical Engineering—Doctor of Philosophy
2024
NAVIGATING PROTEIN FITNESS LANDSCAPES WITH MACHINE LEARNING AND
BIOLOGICAL INSIGHTS
ABSTRACT
Proteins are essential biomolecules addressing challenges in medicine, nanotechnology, and industry.
Protein engineering designs and optimizes these molecules for specific functions, such as
catalyzing reactions or facilitating drug delivery. However, designing proteins with desired properties
is extremely challenging due to unpredictable mutation effects and complex fitness landscapes,
which depict the relationship between a protein’s sequence, structure, and function. Traditional
methods like directed evolution and rational design have limitations in exploring vast sequence
spaces and modeling amino acid interactions. Recent advances in machine learning (ML) and the
increasing availability of biological data have shifted protein engineering from a theory-driven to
a data-driven approach. Despite progress, challenges remain, such as capturing nuanced protein
behaviors under distinct biological conditions, enhancing data quality and diversity, and developing
models that handle complex protein-ligand interactions.
This dissertation explores innovative protein engineering approaches by integrating machine
learning (ML) and computational tools with biological insights. It addresses designing proteins
with desired properties, enhancing their numerical representations, and modeling protein-drug
interactions. Also, methodologies are developed to generate new-to-nature proteins with desired
properties and optimize experimental design strategies. Protein representation methods were optimized
by combining traditional encodings with protein sequence language models. This ensemble
approach achieved a 94% F1 score, enhancing sequence-function predictions by capturing diverse
protein fitness aspects. Performance varied for larger proteins and different protein properties
suggesting the need for specialized biologically aware ML methodologies. Additionally, the study
addressed the critical challenge of modeling protein-drug interactions, focusing on organic aniontransporting
polypeptides (OATPs). OATPs are crucial for drug absorption and distribution, and
significantly impact drug efficacy and safety. A comprehensive pipeline was developed, combining
AlphaFold structure prediction, molecular docking, and a novel Heterogeneous Graph Neural
Network model named HIPO. This model captured complex inter and intra-molecular interactions,
outperforming existing methods for OATP inhibition prediction. By identifying key drug
attributes influencing these interactions, the study demonstrated the effectiveness of structure-based
approaches in elucidating protein-drug interactions, contributing to advancements in drug development
and toxicity prediction. In addition, key drug attributes affecting these interactions were
identified, emphasizing the need for structure-based methods. Advancing beyond protein representation
and drug interaction modeling, this work addresses the generation of novel protein sequences
with desired properties. It integrates evolutionary information into generativeMLmodels through a
dual approach: combining ancestral sequence reconstruction (ASR) with aVariationalAutoencoder
(VAE) for sequence generation and utilizing ASR-derived data to fine-tune language models for
improved protein representations. This methodology explores sequences from evolutionary history
not observed in modern organisms, accessing a vast, unexplored protein space. This data-centric
approach leverages ASR to provide a rich source of information beyond extant species, emphasizing
the crucial role of biologically diverse datasets in machine learning frameworks. The result is the
generation of proteins with enhanced diversity and stability, particularly in thermal properties.
By synthesizing evolutionary insights with advanced ML techniques, this work expands the
possibilities for engineering proteins with unprecedented characteristics, In conclusion, this thesis
presents a comprehensive framework integrating ML with protein engineering, advancing the
design and optimization of biomolecules, and addressing specific biological challenges for improved
therapeutics and diagnostics applications.
Copyright by
MEHRSA MARDIKORAEM
2024
ACKNOWLEDGEMENTS
The PhD journey has been one of my most challenging yet enlightening pursuits, filled with
valuable life lessons. Looking back on the past five years, I see myself stronger and more adaptable
to life’s challenges. This experience has taught me patience, the importance of trusting the process,
and keeping my curiosity alive while preserving the joy of learning. None of this would have
been possible without the support of my mentors, colleagues, and friends. I owe my deepest
gratitude to my PhD advisor, Dr. Daniel Woldring. Under his guidance, I’ve learned to maintain
curiosity, actively pursue optimal solutions, and embody the essence of a true scientist. Dr.
Woldring’s support throughout my PhD journey has been nothing short of extraordinary. Our
lengthy brainstorming meetings and discussions about future methodologies have been invaluable,
and I will cherish these memories as I move forward in my career.
I’m deeply grateful to my PhD committee members, Dr. S. Patrick Walton, Dr. Jose L.
Mendoza-Cortes, and Dr. Arjun Krishnan, for their invaluable mentorship and insightful feedback.
Their critical guidance, especially during my comprehensive exam, was instrumental in shaping my
final projects and refining my graduation goals. Their expertise and dedication have significantly
enriched my academic journey and contributed immensely to my growth as a researcher. I extend
my heartfelt appreciation to my wonderful lab mates in the Woldring lab, whose professionalism
and punctuality made our collaborations both enjoyable and productive. Their camaraderie and
support have been integral to my research journey. Equally important has been the invaluable
support and resources provided by the Chemical and Material Science Department, which have
been crucial in facilitating my research and academic progress throughout my PhD studies. I’m
also thankful for my dear friends Reza, Sunanda, and Parisa. Your friendship has been a source of
comfort and openness as always.
My family deserves special mention - my parents and my two lovely sisters, Mahsa and Melina.
Your constant support, patience, and presence have meant everything. From my mother, I learned
determination, and from my father, ambition. Thank you for raising me to believe in myself and
pursue my goals.
v
TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
CHAPTER 2 MACHINE LEARNING-DRIVEN PROTEIN LIBRARY DESIGN: A
PATH TOWARD SMARTER LIBRARIES . . . . . . . . . . . . . . . . 17
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
CHAPTER 3 PROTEIN FITNESS PREDICTION IS IMPACTEDBYTHEINTERPLAY
OFLANGUAGEMODELS,ENSEMBLELEARNING,ANDSAMPLING
METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
APPENDIX 3A SUPPLEMENTARY INFORMATION . . . . . . . . . . . . . . 70
CHAPTER 4 PREDICTINGORGANICANIONTRANSPORTERPROTEIN-DRUG
INHIBITIONBYINTEGRATINGSTRUCTURALINSIGHTSWITH
GRAPH NEURAL NETWORKS . . . . . . . . . . . . . . . . . . . . . 80
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
APPENDIX 4A LIGAND-PROTEIN DOCKING WORKFLOW . . . . . . . . . 108
APPENDIX 4B COMPARATIVEANALYSISBASEDONSTRUCTUREAND
LIGAND PREDICTIONS ACROSS OATP SUBFAMILIES . . . 111
APPENDIX 4C DISTRIBUTIONOFROSETTAINTERFACEBINDINGENERGIES
FOR OATP TRANSPORTERS . . . . . . . . . . . . . . . . . . 113
APPENDIX 4D DISTRIBUTION OFOATP1B1 RESIDUE INTERACTIONS
WITH INHIBITORS AND NONINHIBITORS . . . . . . . . . 114
CHAPTER 5 GENERATIVEMODELSFORPROTEINSEQUENCEMODELING:
RECENT ADVANCES AND FUTURE DIRECTIONS . . . . . . . . . . 115
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
CHAPTER 6 EVOSEQ-ML:ADVANCINGDATA-CENTRICMACHINELEARNING
WITH EVOLUTIONARY-INFORMED PROTEIN SEQUENCE
REPRESENTATION AND GENERATION . . . . . . . . . . . . . . . . 161
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
APPENDIX 6A PK2RECONSTRUCTEDPHYLOGENICTREEWITHSELECTED
NODES FOR TRAINING THE GENERATIVE MODEL. . . . . 186
APPENDIX 6B STATISTICALANALYSISFORSTABILITYOFGENERATED
PROTEINS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
APPENDIX 6C GENERATIVE MODEL ARCHITECTURE . . . . . . . . . . . 190
APPENDIX 6D STATISTICALANALYSISFORPROTEINREPRESENTATION
PERFORMANCES–LYSOZYME C . . . . . . . . . . . . . . . 191
APPENDIX 6E STATISTICALANALYSISFORPROTEINREPRESENTATION
PERFORMANCES–ENDOLYSIN . . . . . . . . . . . . . . . . 192
CHAPTER 7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
vi
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
vii
CHAPTER 1
INTRODUCTION
Proteins are biological machines involved in nearly all life processes [1, 2, 3, 4]. They are
composed of chains of amino acids that fold into three-dimensional structures, allowing them
to perform essential biological functions. The diverse chemical properties of the twenty natural
amino acids enable proteins to maintain cellular structure, facilitate chemical reactions [5, 6,
7], and transport drugs [8, 9]. Protein engineering is vital for modifying proteins to meet
modern industrial and medical needs by directing their natural evolution [10, 11, 12, 13]. This
field uses advanced techniques like recombinant DNA technology [14], CRISPR-Cas [15], and
high-throughput screening [16] to design proteins with specific functions, such as developing
antiviral peptides, engineering antibodies, and creating personalized therapies. Protein engineering
enhances enzyme specificity and functionality under various conditions and helps develop protein
therapeutics with improved pharmacokinetics, pharmacodynamics, and reduced immunogenicity
[17, 18, 19, 20, 21].
1.1 Fundamentals of Protein Engineering Guided by the Sequence-Structure-Function
Paradigm
Protein fitness [22, 23, 24], which measures how well a protein performs a specific task, is
influenced by its amino acid sequence, structural stability, and interactions with other molecules
(e.g., regulatory interactions [25], transport interactions [9], inhibition [26], and binding [27]).
Protein engineering alters protein functions through experimental methods inspired by natural
selection, aiming to yield high-fitness proteins that perform well in relevant tests. The interplay
between protein sequences and their functional outcomes highlights the transformative potential of
protein engineering. This sets the groundwork for innovative methods to tackle biological problems,
guided by the fundamental paradigm called the "sequence-structure-function" relationship [28, 29].
However, this is challenging due to the vast number of possible mutations. For example, a small
protein with 100 amino acids has 20100 possible sequences, more than the number of atoms in
the observable universe. Additionally, the complex relationship between protein sequences and
1
their fitness creates a sparse and rugged fitness landscape, making it difficult to identify optimal
mutations [30, 31]. This rugged landscape means that even one mutation can drastically affect a
protein’s function, causing significant variability in fitness. Ultimately, the challenges posed by the
vast sequence space and rugged fitness landscape underscore the need for more strategic approaches
in protein engineering.
Protein engineers approach biological problems by employing a diverse array of methodologies,
all guided by the foundational concept of protein sequence-structure-function relationships[28, 29].
This concept articulates that the sequence of amino acids dictates a protein’s structure, and
in turn, its structure determines its function. A key task in this field is structure prediction,
exemplified by breakthroughs like AlphaFold [32], which reveals how sequences fold into intricate
three-dimensional shapes, enabling advancements in drug design and disease understanding.
Conversely, when a function or interaction is predefined, engineers focus on designing novel
(de novo) proteins tailored to fold into structures that execute specific functions [33]. This creative
process addresses the challenge of the vast and sparse search space of amino acid sequences.
Additionally, protein engineering extends to fitness optimization, aimed at refining an existing
protein’s stability and activity for enhanced function[34, 35]. Collectively, these efforts exemplify
how leveraging a broad spectrum of information beyond mere structure can lead to groundbreaking
advances in protein engineering.
Addressing complex biological challenges involves tools like rational design [36, 37, 38] and
directed evolution (DE) [39]. Rational design utilizes already-known information—including
protein sequences, structures, phylogenetic data, knowledge of active sites, and predictions from
simulations—to predict new, functional proteins. Techniques like X-ray crystallography [40],
NMR spectroscopy [41], cryo-electron microscopy (Cryo-EM) [42], and machine learning (ML)
models like AlphaFold [32] further aid this process. For example, rational design has created novel
enzymes to break down antibiotic-resistant bacterial walls, improving drug efficacy, and stabilizing
therapeutic proteins for enhanced drug delivery [43, 44, 45]. DE mimics natural selection in the lab
to iteratively improve proteins. Starting with a parent protein, random mutations create a library of
2
variants, which are screened for desirable traits. DE has been impactful in enhancing the binding of
affibody proteins for improved HER2-positive breast cancer diagnosis [46], developing therapeutic
protein candidates with selective biological impact for treating a wide array of human disorders
including cancer and autoimmune/inflammatory diseases [47, 48], and improving nitrilases for
greener pharmaceutical synthesis [49]. This approach has significantly advanced therapeutic
efficacy, diagnostic accuracy, and industrial sustainability. Combining rational design and DE
has led to breakthroughs like enzymes for plastic degradation [50]. These methods are now
enhanced by machine learning and data-driven approaches. The fusion of traditional methods with
computational models and high-throughput screening is paving the way for innovative solutions in
medicine, diagnostics, industry, and environmental science.
1.2 A New Era in Protein Engineering: The Integration of Machine Learning
Recent advancements in ML techniques, along with the growing availability of biological
data, have fundamentally transformed protein engineering, shifting it from a theory-driven to a
data-driven discipline [51, 52]. Traditionally, theory-driven approaches relied on domain knowledge
and mathematical models to capture the attributes and physics of proteins. However, ML methods
focus on modeling observed data, which is particularly beneficial for the complex, high-dimensional
challenges encountered in protein engineering [53]. ML models extract meaningful features
from labeled protein sequence data, significantly enhancing predictive accuracy and generalization
[54]. These models identify patterns and relationships within the data, enabling the forecasting
of protein properties such as binding affinity, solubility, and stability [51, 55, 56, 57, 58, 59].
Despite the abundance of large-scale sequence data, the lack of experimental fitness annotations
necessitates the use of more advanced ML models, such as self-supervised and unsupervised
methods. Self-supervised learning techniques, including large language models (LLMs) [60, 61]
like AlphaFold [32] and Evolutionary Scale Modeling (ESM) [62], leverage inherent patterns in
unlabeled sequences to decipher the biological "language" of proteins, enabling accurate predictions
of their structure and function [62, 63]. Generative models, such as ProGen [64] and ProtVAE
[65], ProtGPT2 [66], and ProteinGAN [67], learn the underlying data distribution to create novel
3
proteins with desired characteristics. They use various neural network architectures, including
variational autoencoders, transformers, and adversarial neural networks [68]. LLM-based models
like TAPE [69], UniRep [63], and ProtTrans [70] capture dependencies within protein sequences,
resulting in rich semantic and syntactic numerical representations. By integrating these advanced
ML techniques with traditional protein engineering methods, researchers can effectively navigate
the high-dimensional landscape of protein design and engineering. This comprehensive toolkit not
only enhances our understanding of protein data but also paves the way for innovative solutions in
industrial, medical, and research applications.
The continued developments in ML have significantly enhanced our ability to analyze extensive
datasets, predict protein mutations, and generate novel functional proteins. However, these
techniques often face challenges due to the complex nature of biological problems. In protein
engineering, effectively predicting protein properties or designing novel proteins requires integrating
biological knowledge with ML tools. The protein family and specific property being studied can
drastically influence the problem-solving strategy [24]. While protein language models have been
successful, training simpler ML models on one protein family at a time with evolutionary tools like
alignment has sometimes yielded more promising results [69, 71]. Additionally, LLMs trained on
public datasets may lack the high-quality, specialized data needed for protein engineering, where
data quality and diversity are crucial for reliable predictions [72, 73, 74]. Furthermore, the property
under investigation, whether stability or binding, also dictates the complexity of the ML models
required [24]. By identifying and addressing these issues, we can achieve considerable progress in
the field, paving the way for more precise and effective protein engineering solutions.
1.3 Thesis Overview: ML Solutions for Biological Challenges
This thesis investigates optimization methods for navigating protein fitness space by integrating
ML and computational tools to address domain-specific challenges. The five chapters tackle critical
issues such as reducing data noise, optimizing ML algorithms for next-generation sequencing
(NGS) data [75], and employing advanced techniques in protein engineering. They delve into
the use of generative models, data-centric approaches, and evolutionary algorithms to boost ML
4
performance and improve protein fitness predictions. Chapter Two (Mardikoraem, Yeast Surface
Display,2022) [76] explores the latest ML algorithm advancements and their application to predict
sequence mutations before experiments, integratingMLinto directed evolution techniques. Chapter
Three (Mardikoraem, Pharmaceutics 2023) [24] focuses on enhancing ML model performance by
refining protein representations, managing specific properties like sequence length, and addressing
imbalanced datasets. Chapter Four examines Organic Anion Transporter Proteins (OATPs), crucial
in drug development, using multiple ML algorithms to predict their behavior and highlight
important physicochemical properties despite the scarcity of experimental structures. Chapter
Five (Mardikoraem, bioinformatics, 2023) [68] provides a comprehensive overview of ML-driven
protein engineering, particularly language modeling and generative models for protein sequence
modeling, identifying gaps and familiarizing researchers with this specialized field. Chapter Six
(Mardikoraem, ICLR 2024Workshop on Generative and Experimental Perspectives for Biomolecular
Design, 2024) [74] emphasizes the importance of high-quality domain-specific data and data-centric
methods, highlighting how evolutionary information and ancestral sequence reconstruction can
enhance generative models. Overall, this thesis explores the integration of advancedMLtechniques,
generative models, data-centric approaches, and evolutionary algorithms to contribute to the
ongoing advancements in protein engineering.
In Chapter Two, I developed a robust methodology for effectively mapping next-generation
sequencing (NGS) [75] data to experimental annotations from high-throughput techniques. This
incorporates cleaning the data, preprocessing, algorithm selection, and post-processing. For
cleaning, I carefully chose and combined multiple software (PEAR [77], SEQTK [78]) to minimize
experimental noise and increase the signal-to-noise ratio (SNR) [79] in the initial data. Then,
we do preprocessing, such as identifying how many times sequences appeared in the obtained
data and ranking them from highly functional to low functional. Other ML processing methods
such as the choice of protein numerical representations to make it compatible with ML and ML
algorithm selection were explored. The most significant contribution of this book chapter aside
from a comprehensive guideline on the implementation of ML in NGS data, is how simple ML
5
models can contribute to protein engineering paradigms. ML-driven methodologies offer several
key advantages: (i) They provide a better starting point for directed evolution techniques, which are
highly path-dependent and rely on the initial sequence. (ii) They help explore previously uncharted
regions of fitness space, revealing the relationship between sequence data and its properties. (iii)
They assist in experimental design, including the design of oligonucleotides.
In Chapter Three [24], I investigated the performance of language models in predicting protein
properties, comparing them to simpler representations like one-hot encoding and physicochemical
encoding. This study aimed to understand how different representations capture various aspects
of protein fitness. I proposed an ensemble technique that combines these methods to enhance
ML algorithm performance. The study evaluated four representation methods: one-hot encoding,
physicochemical encoding, UniRep [63], and ESM [62]. It also examined the impact of protein
size, revealing significant differences between small proteins (≤120 amino acids) and large proteins
(400–1500 amino acids) . For small proteins, one-hot encoding significantly boosted classification
metrics, achieving a mean F1-score of 94% when combined with other encodings. However, it
proved problematic for large proteins due to sparsity. The findings demonstrated that SMOTE
[80] outperformed undersampling for imbalanced datasets when using one-hot, UniRep, and ESM
representations. Ensemble learning increased predictive performance by 4 % in affinity-based
datasets and achieved high F1 scores in stability prediction using ESM. By employing advanced
sampling techniques and multiple-criteria decision-analysis (MCDA) [81], the study ensured
statistical rigor. This approach not only benchmarks and compares protein representation models
but also highlights the benefits of ensemble modeling in capturing distinct aspects of protein fitness.
Critical features for distinguishing functional proteins were identified, providing a solid foundation
for future research in protein engineering and significant advancements in predictive modeling.
Another significant focus of this thesis in Chapter Four is overcoming the challenges associated
with OATPs [82, 83]. Drug-drug interactions (DDIs) pose substantial challenges in pharmacology,
potentially leading to severe side effects and reduced therapeutic efficacy. The FDA’s guidelines
emphasize the importance of predictive DDI models in drug regulation. While predictive models
6
for metabolic DDIs involving cytochrome P450 enzymes are well-developed, there is a notable
gap in transporter-mediated DDI (tDDI) models, particularly for OATPs. These transport proteins
are crucial in ADME processes (absorption, distribution, metabolism, and excretion) and play a
vital role in drug disposition and pharmacokinetics. The lack of high-resolution structural data for
OATPs and significant variability in available in vitro data have historically limited the development
of accurate predictive models. This research employs a multimodal approach, combining advanced
neural network architectures like GNNs (graph neural networks) and molecular modeling tools to
integrate diverse information sources effectively. GNNs model complex molecular interactions and
spatial relationships within proteins, enabling the development of novel, generalizable methods for
modeling OATP-drug interactions and filling a critical gap in transporter-DDI prediction models.
Chapter Five delves into advanced sequence modeling techniques, expanding on the previous
chapter’s focus on model improvement and language model performance in predicting protein
properties, which offers a more holistic approach to examining current ML model development and
its success in the protein engineering domain [68]. With the rapid expansion of high-throughput
omics technologies, the volume of protein sequence data has surged, necessitating sophisticated
MLmethods. This chapter systematically reviews promising neural networks for protein sequences,
covering core mathematical concepts and their implementation in protein engineering tasks. The
discussion begins with language models, ideal for protein sequence data due to their ability to
handle variable sequence lengths and track long-term dependencies while maintaining token
order. It covers recurrent neural networks (RNNs) [84], the self-attention mechanism [85],
and transformers [86]. The chapter then explores generative models: variational autoencoders
(VAEs) [87], generative adversarial networks (GANs) [88], and diffusion models [89]. VAEs use
a probabilistic approach to learn data distribution, GANs involve competing neural networks to
generate realistic samples, and diffusion models add noise to data and reverse the process to generate
new samples. Additionally, current challenges and future directions in applying these models to
protein engineering are discussed, providing practical insights for researchers.
In Chapter Six [74], a data-centric approach is adopted by integrating evolutionary data,
7
particularly through ancestral sequence reconstruction (ASR) [90], to enhance the generative
and language models discussed in Chapter Five. ASR provides high-quality training datasets
from evolutionary trajectories and diverse functional adaptations, improving the robustness and
generalizability of ML models. Evolutionary insights are valuable because ancient proteins often
exhibit desirable traits: stability, adaptability, and pharmacologically relevant properties [90, 91].
By utilizing statistical models and phylogenetic tree reconstruction, extensive, high-quality datasets
are created for training generative models. These datasets, generated with the AP-LASR [91] tool,
include homogeneous and diverse sets to enhance sequence variability and model robustness.
Instead of relying solely on the most probable sequences, sampling from a posterior probability
distribution captures a wide spectrum of evolutionary insights [92]. Both generative models and
language models are employed to utilize evolutionary data: the former to generate novel hyperstable
and diverse protein sequences, and the latter to create competitive and sometimes superior protein
representations for downstream stability prediction tasks, informed by evolutionary information.
VAE is employed for generating novel protein sequences, transforming them into a computationally
suitable format using one-hot encoding, and processing them through a 1D convolutional neural
network (CNN) [93] layer to capture local sequence patterns. The VAE’s architecture, with a
latent space dimensionality of 100, captures the nuances of protein sequence variability. The
generated sequences are evaluated using AlphaFold2 for 3D structure prediction and FoldX [94] for
stability calculations, ensuring structural integrity and thermal stability. Additionally, a new protein
representation aware of their evolutionary path is introduced using the ESM2 protein language
model. This representation is extracted to refine protein family classification tasks with KNN
[95], Random Forest [96], and XGBoost algorithms [97], resulting in competitive and sometimes
better performance in stability prediction tasks. This comprehensive framework demonstrates that
addressing specific biological problems requires a nuanced understanding of both the biological
context and the strategic application of appropriate ML tools, leading to more efficient and effective
protein design.
8
BIBLIOGRAPHY
[1] Hamid Ramezani and Hendrik Dietz. Building machines with dna molecules. Nature Reviews
Genetics, 21(1):5–26, 2020.
[2] Paul C Whitford and José N Onuchic. What protein folding teaches us about biological
function and molecular machines. Current Opinion in Structural Biology, 30:57–62, 2015.
[3] David C Miller, Soumitra V Athavale, and Frances H Arnold. Combining chemistry and
protein engineering for new-to-nature biocatalysis. Nature Synthesis, 1(1):18–23, 2022.
[4] Jie Zhu, Nicole Avakyan, Albert Kakkis, Alexander M Hoffnagle, Kenneth Han, Yiying Li,
Zhiyin Zhang, Tae Su Choi, Youjeong Na, Chung-Jui Yu, et al. Protein assembly by design.
Chemical reviews, 121(22):13701–13796, 2021.
[5] Tristan Bepler and Bonnie Berger. Learning the protein language: Evolution, structure, and
function. Cell systems, 12(6):654–669, 2021.
[6] Isabell Bludau and Ruedi Aebersold. Proteomic and interactomic insights into the molecular
basis of cell functional diversity. Nature Reviews Molecular Cell Biology, 21(6):327–340,
2020.
[7] Elizabeth L Lieu, Tu Nguyen, Shawn Rhyne, and Jiyeon Kim. Amino acids in cancer.
Experimental & molecular medicine, 52(1):15–30, 2020.
[8] Ehsan Kianfar. Protein nanoparticles in drug delivery: animal protein, plant proteins and
protein cages, albumin nanoparticles. Journal of Nanobiotechnology, 19(1):159, 2021.
[9] Mattia D Pizzagalli, Ariel Bensimon, and Giulio Superti-Furga. A guide to plasma membrane
solute carrier proteins. The FEBS journal, 288(9):2784–2835, 2021.
[10] Bruno S. Silvestre and Diana Mihaela Ţîrcă. Innovations for sustainable development: Moving
toward a sustainable future. Journal of Cleaner Production, 208:325–332, 2019.
[11] Ali Miserez, Jing Yu, and Pezhman Mohammadi. Protein-based biological materials:
Molecular design and artificial production. Chemical Reviews, 123(5):2049–2111, 2023.
[12] Chenyi Li, Ruihua Zhang, Jian Wang, Lauren Marie Wilson, and Yajun Yan. Protein
engineering for improving and diversifying natural product biosynthesis. Trends in
biotechnology, 38(7):729–744, 2020.
[13] Sasha B Ebrahimi and Devleena Samanta. Engineering protein-based therapeutics through
structural and chemical design. Nature Communications, 14(1):2411, 2023.
[14] Bernard R Glick and Cheryl L Patten. Molecular biotechnology: principles and applications
9
of recombinant DNA. John Wiley & Sons, 2022.
[15] Fuguo Jiang and Jennifer A Doudna. Crispr–cas9 structures and mechanisms. Annual review
of biophysics, 46:505–529, 2017.
[16] Christopher B Black, Thomas D Duensing, Linda S Trinkle, and R Terry Dunlay. Cell-based
screening using high-throughput flow cytometry. Assay and drug development technologies,
9(1):13–20, 2011.
[17] Linghui Qian, Xuefen Lin, Xue Gao, Rizwan Ullah Khan, Jia-Yu Liao, Shubo Du, Jingyan
Ge, Su Zeng, and Shao Q Yao. The dawn of a new era: targeting the “undruggables” with
antibody-based therapeutics. Chemical reviews, 123(12):7782–7853, 2023.
[18] Sasha B Ebrahimi and Devleena Samanta. Engineering protein-based therapeutics through
structural and chemical design. Nature Communications, 14(1):2411, 2023.
[19] Rohitash Yadav, Jitendra Kumar Chaudhary, Neeraj Jain, Pankaj Kumar Chaudhary, Supriya
Khanra, Puneet Dhamĳa, Ambika Sharma, Ashish Kumar, and Shailendra Handu. Role of
structural and non-structural proteins and therapeutic targets of sars-cov-2 for covid-19. Cells,
10(4):821, 2021.
[20] Timothy M Caradonna and Aaron G Schmidt. Protein engineering strategies for rational
immunogen design. npj Vaccines, 6(1):154, 2021.
[21] Naveen K Mehta, Roma V Pradhan, Ava P Soleimany, Kelly D Moynihan, Adrienne M
Rothschilds, Noor Momin, Kavya Rakhra, Jordi Mata-Fink, Sangeeta N Bhatia, K Dane
Wittrup, et al. Pharmacokinetic tuning of protein–antigen fusions enhances the
immunogenicity of t-cell vaccines. Nature biomedical engineering, 4(6):636–648, 2020.
[22] Chloe Hsu, Hunter M. Nisonoff, Clara Fannjiang, and Jennifer Listgarten. Learning protein
fitness models from evolutionary and assay-labeled data. Nature Biotechnology, 2022.
[23] Mehrsa Mardikoraem and Daniel R. Woldring. Protein fitness prediction is impacted by the
interplay of language models, ensemble learning, and sampling methods. Pharmaceutics,
2023.
[24] Mehrsa Mardikoraem and Daniel Woldring. Protein fitness prediction is impacted by the
interplay of language models, ensemble learning, and sampling methods. Pharmaceutics,
15(5), 2023.
[25] Jacques Monod, Jeffries Wyman, and Jean-Pierre Changeux. On the nature of allosteric
transitions: A plausible model. Journal of Molecular Biology, 12(1):88–118, 1965.
[26] Lauren E Powell and Paul A Foster. Protein disulphide isomerase inhibition as a potential
cancer therapeutic strategy. Cancer medicine, 10(8):2812–2825, 2021.
10
[27] Alexander S. Hauser, Maggie M. Attwood, Mathias Rask-Andersen, Helgi B. Schiöth, and
David E. Gloriam. Pharmacogenomics of gpcr drug targets. Cell, 172(1–2):41–54, 2017.
[28] Julia Koehler Leman, Pawel Szczerbiak, P Douglas Renfrew, Vladimir Gligorĳevic, Daniel
Berenberg, Tommi Vatanen, Bryn C Taylor, Chris Chandler, Stefan Janssen, Andras Pataki,
et al. Sequence-structure-function relationships in the microbial protein universe. Nature
communications, 14(1):2351, 2023.
[29] Noelia Ferruz, Michael Heinzinger, Mehmet Akdel, Alexander Goncearenco, Luca Naef, and
Christian Dallago. From sequence to function through structure: Deep learning for protein
design. Computational and Structural Biotechnology Journal, 21:238–250, 2023.
[30] JoséNelson Onuchic, Zaida Luthey-Schulten, and PeterGWolynes. Theory of protein folding:
the energy landscape perspective. Annual review of physical chemistry, 48(1):545–600, 1997.
[31] Frances H Arnold. Design by directed evolution. Accounts of chemical research,
31(3):125–131, 1998.
[32] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf
Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Zıdek, Anna Potapenko, Alex
Bridgland, Clemens Meyer, Simon A.A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino
Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen,
David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska,
Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W. Senior,
Koray Kavukcuoglu, Pushmeet Kohli, and Demis Hassabis. Highly accurate protein structure
prediction with alphafold. Nature, 596:583–589, 2021.
[33] Po-Ssu Huang, Scott E Boyken, and David Baker. The coming of age of de novo protein
design. Nature, 537(7620):320–327, 2016.
[34] Jason Yang, Julie Ducharme, Kadina E Johnston, Francesca-Zhoufan Li, Yisong Yue,
and Frances H Arnold. Decoil: Optimization of degenerate codon libraries for machine
learning-assisted protein engineering. ACS Synthetic Biology, 12(8):2444–2454, 2023.
[35] Ariel Goldenzweig and Sarel J. Fleishman. Automated structure- and sequence-based design
of proteins for high bacterial expression and stability. Methods in Enzymology, 580:281–315,
2016.
[36] Xuefei Wang, Duan Ni, Yaqin Liu, and Shaoyong Lu. Rational design of peptide-based
inhibitors disrupting protein-protein interactions. Frontiers in chemistry, 9:682675, 2021.
[37] Vincent Frappier and Rafael J. Najmanovich. Rational design of proteins and enzymes
for biocatalysis: Recent advances and future perspectives. Biotechnology Advances,
36(8):2230–2241, 2018.
11
[38] Jonathan Weinstein, Olga Khersonsky, and Sarel J Fleishman. Practically useful
protein-design methods combining phylogenetic and atomistic calculations. Current Opinion
in Structural Biology, 63:58–64, 2020. Membranes, Engineering and Design.
[39] Frances H Arnold. Directed evolution, bringing new chemistry to life. Angewandte Chemie
International Edition, 57(16):4143–4148, 2018.
[40] Andrea Ilari and Carmelinda Savino. Protein Structure Determination by X-Ray
Crystallography, pages 63–87. Humana Press, Totowa, NJ, 2008.
[41] Kurt Wüthrich. Protein structure determination in solution by nmr spectroscopy. Journal of
Biological Chemistry, 265(36):22059–22062, 1990.
[42] Ka ManYip, Niels Fischer, Elham Paknia, Ashwin Chari, and Holger Stark. Atomic-resolution
protein structure determination by cryo-em. Nature, 587(7832):157–161, 2020.
[43] Li Tang and Anne Besson. Rational design of chemically controlled antibodies and protein
therapeutics. ACS Chemical Biology, 2023.
[44] Aisaraphon Phintha and Pimchai Chaiyen. Rational and mechanistic approaches for improving
biocatalyst performance. Chem catalysis, 2022.
[45] Shima Ghaedizadeh, Majid Zeinali, Bahareh Dabirmanesh, Behnam Rasekh, Ali Mohammad
Banaei-Moghaddam, et al. Rational design engineering of a thermostable 𝛼-carbonic
anhydrase from sulfurihydrogenibium yellowstonense to improve its thermostability for
industrial applications. 2022.
[46] Mikaela Friedman, Anna Orlova, Eva Johansson, Tove L.J. Eriksson, Ingmarie
Höidén-Guthenberg, Vladimir Tolmachev, Fredrik Y. Nilsson, and Stefan Ståhl. Directed
evolution to low nanomolar affinity of a tumor-targeting epidermal growth factor
receptor-binding affibody molecule. Journal of Molecular Biology, 376(5):1388–1402, 2008.
[47] S. D. Levin, L. S. Evans, S. Bort, E. Rickel, K. E. Lewis, R. P. Wu, J. Hoover, S. MacNeil,
D. La, M. F. Wolfson, M. W. Rixon, S. R. Dillon, M. G. Kornacker, R. Swanson, and S. L.
Peng. Novel immunomodulatory proteins generated via directed evolution of variant igsf
domains. Frontiers in Immunology, 10:3086, 2020.
[48] Ron Amon, Ronit Rosenfeld, Shahar Perlmutter, Oliver C. Grant, Sharon Yehuda, Aliza
Borenstein-Katz, Ron Alcalay, Tal Marshanski, Hai Yu, Ron Diskin, Robert J. Woods,
Xi Chen, and Vered Padler-Karavani. Directed evolution of therapeutic antibodies targeting
glycosylation in cancer. Cancers, 12(10), 2020.
[49] Keith A Powell, SandraWRamer, Stephen B Del Cardayré,Willem PC Stemmer, Matthew B
Tobin, Pascal F Longchamp, and Gjalt W Huisman. Directed evolution and biocatalysis.
Angewandte Chemie International Edition, 40(21):3948–3959, 2001.
12
[50] Valentina Pirillo, Marco Orlando, Caren Battaglia, Loredano Pollegioni, and Gianluca
Molla. Efficient polyethylene terephthalate degradation at moderate temperature: a protein
engineering study of lc-cutinase highlights the key role of residue 243. The FEBS journal,
290(12):3185–3202, 2023.
[51] Chang Feng. Potential of deep representative learning features to interpret the sequence
information in proteomics. Proteomics, 2021.
[52] O. Simeone. Avery brief introduction to machine learning with applications to communication
systems. IEEE Transactions on Cognitive Communications and Networking, 4:648–664, 2018.
[53] Kevin K Yang, Zachary Wu, and Frances H Arnold. Machine-learning-guided directed
evolution for protein engineering. Nature methods, 16(8):687–694, 2019.
[54] Ameya Harmalkar, Roshan Rao, Yuxuan Richard Xie, Jonas Honer, Wibke Deisting, Jonas
Anlahr, Anja Hoenig, Julia Czwikla, Eva Sienz-Widmann, Doris Rau, et al. Toward
generalizable prediction of antibody thermostability using machine learning on sequence
and structure features. In MAbs, volume 15, page 2163584. Taylor & Francis, 2023.
[55] T. B. Alakus and I. Turkoglu. Prediction of protein-protein interactions with lstm deep
learning model. In 3rd International Symposium on Multidisciplinary Studies and Innovative
Technologies (ISMSIT 2019) - Proceedings, pages 1–5, 2019.
[56] A. Ahmed, B. Mam, and R. Sowdhamini. Deelig: A deep learning-based approach to predict
protein-ligand binding affinity. 2020.
[57] L. Heo and M. Feig. High-accuracy protein structures by combining machine-learning with
physics-based refinement. Proteins: Structure, Function, and Bioinformatics, 88:637–642,
2020.
[58] S. Khurana, R. Rawi, K. Kunji, et al. Deepsol: A deep learning framework for sequence-based
protein solubility prediction. Bioinformatics, 34:2605–2613, 2018.
[59] M. Giollo, A. J. M. Martin, I. Walsh, et al. Neemo: A method using residue interaction
networks to improve prediction of protein stability upon mutation. BMC Genomics, 15:1–11,
2014.
[60] Lawrence Fong. Learning meaningful representations of protein sequences. Nature
Communications, 2022.
[61] Prakash M Nadkarni, Lucila Ohno-Machado, and Wendy W Chapman. Natural language
processing: an introduction. Journal of the American Medical Informatics Association,
18(5):544–551, 2011.
[62] Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu,Wenting Lu, Nikita Smetanin,
13
Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi,
Tom Sercu, Salvatore Candido, and Alexander Rives. Evolutionary-scale prediction of
atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
[63] Ethan C Alley, Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, and George M
Church. Unified rational protein engineering with sequence-based deep representation
learning. Nature methods, 16(12):1315–1322, 2019.
[64] Ali Madani, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R
Eguchi, Po-Ssu Huang, and Richard Socher. Progen: Language modeling for protein
generation. arXiv preprint arXiv:2004.03497, 2020.
[65] Emre Sevgen, Joshua Moller, Adrian Lange, John Parker, Sean Quigley, Jeff Mayer, Poonam
Srivastava, Sitaram Gayatri, David Hosfield, Maria Korshunova, et al. Prot-vae: protein
transformer variational autoencoder for functional protein design. bioRxiv, pages 2023–01,
2023.
[66] Noelia Ferruz, Steffen Schmidt, and Birte Höcker. Protgpt2 is a deep unsupervised language
model for protein design. Nature communications, 13(1):4348, 2022.
[67] Donatas Repecka, Vykintas Jauniskis, Laurynas Karpus, Elzbieta Rembeza, Irmantas
Rokaitis, Jan Zrimec, Simona Poviloniene, Audrius Laurynenas, Sandra Viknander, Wissam
Abuajwa, et al. Expanding functional protein sequence spaces using generative adversarial
networks. Nature Machine Intelligence, 3(4):324–333, 2021.
[68] Mehrsa Mardikoraem, Zirui Wang, Nathaniel Pascual, and Daniel Woldring. Generative
models for protein sequence modeling: recent advances and future directions. Briefings in
Bioinformatics, 24(6):bbad358, 2023.
[69] Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Peter Chen, John Canny, Pieter
Abbeel, and Yun Song. Evaluating protein transfer learning with tape. Advances in neural
information processing systems, 32, 2019.
[70] Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion
Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, et al. Prottrans:
Toward understanding the language of life through self-supervised learning. IEEE transactions
on pattern analysis and machine intelligence, 44(10):7112–7127, 2021.
[71] Amir Shanehsazzadeh, David Belanger, and David Dohan. Is transfer learning necessary for
protein landscape prediction? arXiv preprint arXiv:2011.03443, 2020.
[72] Pascal Notin, AaronWKollasch, Daniel Ritter, Lood Van Niekerk, Steffan Paul, Han Spinner,
Nathan J Rollins, Ada Shaw, Rose Orenbuch, Ruben Weitzman, Jonathan Frazer, Mafalda
Dias, Dinko Franceschi, Yarin Gal, and Debora Susan Marks. Proteingym: Large-scale
benchmarks for protein fitness prediction and design. In Thirty-seventh Conference on Neural
14
Information Processing Systems Datasets and Benchmarks Track, 2023.
[73] Christian Dallago, Jody Mou, Kadina E Johnston, Bruce Wittmann, Nick Bhattacharya,
Samuel Goldman, Ali Madani, andKevinKYang. FLIP: Benchmark tasks in fitness landscape
inference for proteins. In Thirty-fifth Conference on Neural Information Processing Systems
Datasets and Benchmarks Track (Round 2), 2021.
[74] Mehrsa Mardikoraem and Daniel R. Woldring. PROTEIN FITNESS LANDSCAPE
NAVIGATION ISBOOSTEDVIA INCORPORATINGEVOLUTIONARYINFORMATION
INTO MACHINE LEARNING MODELS. In ICLR 2024 Workshop on Generative and
Experimental Perspectives for Biomolecular Design, 2024.
[75] Emily E Wrenbeck, Matthew S Faber, and Timothy A Whitehead. Deep sequencing methods
for protein engineering and design. Current opinion in structural biology, 45:36–44, 2017.
[76] Mehrsa Mardikoraem and Daniel Woldring. Machine Learning-driven Protein Library
Design: A Path Toward Smarter Libraries, pages 87–104. Springer US, New York, NY,
2022.
[77] Jiajie Zhang, Kassian Kobert, Tomáš Flouri, and Alexandros Stamatakis. Pear: a fast and
accurate illumina paired-end read merger. Bioinformatics, 30(5):614–620, 2014.
[78] H Li. Github-lh3/seqtk: toolkit for processing sequences in fasta/q formats, 2021.
[79] Barry T Bosworth,WRobert Bernecky, JamesDNickila, BerhaneAdal, andGClifford Carter.
Estimating signal-to-noise ratio (snr). IEEE Journal of Oceanic Engineering, 33(4):414–418,
2008.
[80] BKovács, F Tinya, CNémeth, and P Ódor. Smote: synthetic minority over-sampling technique
nitesh. Ecol. Appl, 30(2):321–357, 2020.
[81] Ralph L Keeney, Howard Raiffa, and David W Rajala. Decisions with multiple objectives:
Preferences and value trade-offs. IEEE transactions on Systems, man, and cybernetics,
9(7):403–403, 1979.
[82] Jörg König, Annick Seithel, Ulrike Gradhand, and Martin F Fromm. Pharmacogenomics of
human oatp transporters. Naunyn-Schmiedeberg’s archives of pharmacology, 372:432–443,
2006.
[83] Klaas Nico Faber, Michael Müller, and Peter LM Jansen. Drug transport proteins in the liver.
Advanced drug delivery reviews, 55(1):107–124, 2003.
[84] Robin M. Schmidt. Recurrent neural networks (rnns): A gentle introduction and overview.
CoRR, abs/1912.05911, 2019.
15
[85] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017.
[86] Richard E. Turner. An introduction to transformers, 2024.
[87] Diederik P Kingma, Max Welling, et al. An introduction to variational autoencoders.
Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
[88] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural
information processing systems, 27, 2014.
[89] Hanqun Cao, Cheng Tan, Zhangyang Gao, Yilun Xu, Guangyong Chen, Pheng-Ann Heng,
and Stan Z Li. A survey on generative diffusion models. IEEE Transactions on Knowledge
and Data Engineering, 2024.
[90] Yvonne Joho, Vanessa Vongsouthi, Matthew A. Spence, Jennifer Ton, Chloe Gomez, Li Lynn
Tan, Joe A. Kaczmarski, Alessandro T. Caputo, Santana Royan, Colin J. Jackson, and Albert
Ardevol. Ancestral sequence reconstruction identifies structural changes underlying the
evolution of ideonella sakaiensis petase and variants with improved stability and activity.
Biochemistry, 62(2):437–450, 2023. PMID: 35951410.
[91] James VanAntwerp, Mehrsa Mardikoraem, Nathaniel Pascual, and DanielWoldring. Ap-lasr:
Automated protein libraries from ancestral sequence reconstruction. bioRxiv, 2023.
[92] Geeta N. Eick, Jamie T. Bridgham, Douglas P. Anderson, Michael J. Harms, and Joseph W.
Thornton. Robustness of Reconstructed Ancestral Protein Functions to StatisticalUncertainty.
Molecular Biology and Evolution, 34(2):247–261, 10 2016.
[93] Keiron O’Shea and Ryan Nash. An introduction to convolutional neural networks. CoRR,
abs/1511.08458, 2015.
[94] Joost Schymkowitz, Jesper Borg, Francois Stricher, Robby Nys, Frederic Rousseau, and
Luis Serrano. The FoldX web server: an online force field. Nucleic Acids Research,
33(suppl2):W382–W388, 07 2005.
[95] Gongde Guo, HuiWang, David Bell, Yaxin Bi, and Kieran Greer. Knn model-based approach
in classification. pages 986–996, 2003.
[96] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
[97] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings
of the 22nd acm sigkdd international conference on knowledge discovery and data mining,
pages 785–794, 2016.
16
CHAPTER 2
MACHINE LEARNING-DRIVEN PROTEIN LIBRARY DESIGN: A PATH TOWARD
SMARTER LIBRARIES
1
Abstract
Proteins are small yet valuable biomolecules that play a versatile role in therapeutics and
diagnostics. The intricate sequence-structure-function paradigm in the realm of proteins opens the
possibility for directly mapping amino acid sequences to function. However, the rugged nature
of the protein fitness landscape and an astronomical number of possible mutations even for small
proteins make navigating this system a daunting task. Moreover, the scarcity of functional proteins
and the ease with which deleterious mutations are introduced, due to complex epistatic relationships,
compound the existing challenges. This highlights the need for auxiliary tools in current techniques
such as rational design and directed evolution. To that end, state-of-the-art machine learning can
offer time and cost efficiency in finding high-fitness proteins, circumventing unnecessary wet-lab
experiments. In the context of improving library design, machine learning provides valuable
insights via its unique features such as high adaptation to complex systems, multi-tasking and
parallelism, and the ability to capture hidden trends in input data. Finally, both the advancements
in computational resources and the rapidly increasing number of sequences in protein databases
will allow more promising and detailed insights delivered from machine learning to protein library
design. In this chapter, fundamental concepts and a method for machine learning-driven library
design leveraging deep sequencing datasets will be discussed. We elaborate on i) basic knowledge
about machine learning algorithms, ii) the benefit of machine learning in library design, and iii)
methodology for implementing machine learning in library design.
Keywords: Library Design, Directed Evolution, Machine Learning, Deep Learning
1This chapter is adapted from content published in "Machine Learning-driven Protein Library Design: A Path
Toward Smarter Libraries," SpringerLink. All rights reserved. For more information, visit https://link.springer.com/
protocol/10.1007/978-1-0716-2285-8_5.
17
2.1 Introduction
Proteins are molecules with a wide variety of applications in biological processes. They
have many fundamental functions in living organisms such as being catalysts, receptors, structural
elements, transporters, and regulators [1, 2, 3, 4, 5]. Accordingly, increased potential for mapping
the protein sequence to its function will result in comprehension and regulation of biological
processes associated with functional disorders. As a result, techniques to efficiently navigate the
protein fitness landscape are at the forefront of protein engineering. Nevertheless, the complexity
and ruggedness of the protein fitness landscape, and the high potential of failure in searching
through the fitness landscape (mutations resulting in unstable and non-functional variants), make
this task more demanding. A growing number of computational and experimental approaches
(e.g. high-resolution stability calculations [reference to stability calculation in the book chapter],
virtual screening [6], deep sequencing [7], cytometry-based selections [8], ancestral sequence
reconstruction [ref to ASR in book chapter]) seek to address these gaps in knowledge.
Machine learning algorithms offer a platformfor harnessing large, diverse datasets to understand
natural protein features and guide protein engineering efforts. This provides the opportunity to
map protein sequences to function without requiring explicit biophysical knowledge of individual
sequences. Another advantage of such algorithms lies in their ability to expand the utility of
experimentally derived sequences beyond the typically small subsets of lead variants. While
directed evolution discards low-fitness protein variants, machine learning can leverage these
sub-optimal variants to enhance the model’s predictive performance [9]. Machine learning proves
particularly advantageous for protein engineering campaigns characterized by low-throughput or
labor-intensive selection processes. Therefore, its usefulness depends on factors such as library
size, screening difficulty, fitness landscape ruggedness, and the accuracy of the predictive model.
As a result, the ability to capture complicated trends among protein datasets, aided by nonlinear
functionality to reveal important features, makes machine learning a powerful tool for guiding
protein library design.
Machine learning has played an important role in protein engineering for more than 30 years,
18
yielding improved prediction of tertiary structure [10] and protein-protein interactions [11] using
amino acid sequence information alone [10, 11]. Machine learning algorithms such as support
vector machines (SVMs), random forest (RF), and gradient boosting machines (GBMs) have
provided platforms for protein function and property prediction including stability, catalytic activity,
and secondary structure [12, 13, 14, 15]. Deep learning (DL) [16], as a sub-field of machine
learning, imitates human brain functionality in decision-making and learning experiences. Utilizing
non-linear functions, the algorithm can learn and extract desired features from the provided input
data, well suited for dealing with rich datasets with high dimensionality. This makes deep learning
methods particularly promising for evaluating sequence–function trends among a rapidly growing
number of protein sequences. Although practical and promising, developing a fine-tuned strategy
to employ machine learning in the field demands an awareness of the existing challenges and
capabilities. Here, we present a procedure for establishing deep-learning models that guide protein
library design. Common challenges and best practices in this burgeoning field will be highlighted.
We consider multiple practical applications of machine learning within the context of protein
library design:
• Combinatorial library design based on deep sequencing data following high-throughput
directed evolution
• Process parallelization and parameterization of features within far-reaching parts of the fitness
landscape
• The ability to sample the diversity applied by specific degenerate codon techniques and
oligonucleotide combinations before experimental implementation.
2.1.1 Providing a better starting point for directed evolution
Directed evolution campaigns are initialized using a parent sequence to implement mutations.
Therefore, the path-dependency of directed evolution benefits from a high-fitness or highly stable
sequence as a starting point to increase the probability of finding optimal regions within a fitness
landscape [17]. Machine learning can guide directed evolution by identifying a large collection of
promising sequences based on curated input data. In a recent example, a machine learning model
19
was trained based on the initial library of fluorescent proteins to build a second small yet enriched
protein library [18].
2.1.2 Investigating unexplored parts of the fitness landscape
Machine learning algorithms provide some unique advantages that enable finding unexplored
variants that may possess high performance. Large datasets with high-complexity algorithms are
not only able to predict functionality but can learn hidden characteristics and rules that exist in
provided data. As an example, UniRep [19, 20] has been trained on protein sequences in Uniref50
[21] on 24 million amino acid sequences to obtain general trends in protein sequences and statistical
representations of amino acids. Another factor that is effective in finding the unexplored high-fitness
sequences is the application of parallelism and multi-tasking within machine learning. Multi-task
learning is a subset of machine learning algorithms that trains multiple tasks simultaneously in one
unique model [22]. This feature enables capturing the epistatic relationships in protein sequences
by providing a path-independent search in the fitness landscape [23]. In this way, applying machine
learning to protein engineering enables an increased likelihood of accessing undetected regions of
the fitness landscape.
2.1.3 Estimating degenerate codon performance via fitness distribution analysis
Implementing a well-trained machine learning algorithm enables the evaluation of multiple
design strategies and reduces experimental effort. Various experimental techniques have been
developed to improve the efficiency of directed evolution such as gene shuffling [24] and neutral
drift library screening [25] to manage the library size and increase the likelihood of finding the
desired property. Another highly used method is to generate libraries based on degenerate codons to
introduce tailored diversity at individual positions. As an example, while NNK is used to generate a
library coding for all amino acids (with 3% stop codons), other combinations such as NYC (coding
for hydrophobic residues), KST (coding for small residues), and NDT (coding for a balanced set
of all amino acid properties) are available. In addition, several impressive computational attempts
have been used to even go beyond these techniques and optimize the oligonucleotide combinations
[26, 27]. Among these potential degenerate codon techniques for the library design, the user can
20
pre-analyze their performance for their desired protein.
Figure 2.1 proposes a comparison of candidate degenerate codons. As a base platform, an initial
deep-learning model is trained on the objective protein. Subsequently, the algorithm can generate
sequences that incorporate candidate degenerate codons and assign a predicted fitness score to each
based on the previously trained model. Finally, the distribution of fitness scores is calculated for all
variants of each degenerate codon strategy. Statistical tools such as Jensen-Shannon-Divergence
[28] and survival function [29] provide a direct comparison between individual distributions in
terms of diversity and fitness (see Note 1). It should be emphasized that the criteria for choosing
the best performance depends on the particular application the platform is used for.
21
Figure 2.1 Evaluation of degenerate codon performance within multiple design strategies using the
trained model based on experimental data on the protein of interest. (A) Elaboration of different
hypothetical degenerate codon strategies in three different sites of the protein (the previously
trained model needs to be trained on high-quality experimental data to predict the fitness of
generated sequences in each technique.) The first utilizes the NNK technique in all sites while the
second uses different degenerate codons at each site. The third strategy uses custom mix codons
(versatile ratios of base pairs). Sequences compatible with the rubrics of each strategy can be
generated and a fitness score for the generated sequences will be assigned to each sequence based
on the previously trained model in the protein of interest. (B) The score function (transformed and
standardized predicted scores) distribution will be obtained for each candidate library. One method
to comprehend the predicted data is by clustering the generated data. The exemplary plot shown
here is based on both the sequences and scores of the three strategies when each dot represents one
sequence produced by one of the strategies. The radius of the dots represents their fitness scores,
and the distance between any two dots represents how distant those amino acid sequences are
from each other in sequence space. (C) The Jenson-Shannon-Divergence is useful for quantifying
differences that exist between individual distributions and a reference distribution or the extent of
similarity between candidates’ distributions. (D) The survival function provides the probability of
improved score functions compared to the current score function, providing more insight for the
analysis of distributions. (E) Box plots can be produced based on the generated fitness scores and
are informative for recognizing the quantiles and making comparisons between them in the three
different strategies.
22
2.2 Materials
There are a plethora of software libraries and packages provided for implementing model
optimization, machine learning, and deep learning algorithms. Choosing one depends on the
specific application that the user has in mind. TensorFlow [30], Keras [31], PyTorch [32], and
Scikit-learn [33] are among the most popular libraries with each having some trade-offs (See Note
2 for analysis). The discussion found here focuses on deep learning in Keras. Please refer to
our GitHub (https://github.com/WoldringLabMSU/DeepLearning.git) for a collection of relevant
Python scripts to guide model implementation as well.
• The latest version of Python2 installed (Installing Anaconda3 is highly suggested for beginners
as it is straightforward to use).
• Spyder or Jupyter Notebook environments in Anaconda are both popular and practical in
doing machine learning and deep learning projects.
• pandas [34] (For dealing with data frames)
• numpy [35] (For efficient mathematical operations)
• scikit_learn [36] (This is mostly used in machine learning but its preprocessing section
has a myriad of useful functions for analysis and fine-tuning the data)
• tensorflow [37] (For using Keras, its backend should be installed (CNTK and Theano are
other options, as well.)
• keras [31]
To use these libraries, use pip install X in the command prompt.
2.3 Methods
Below is a general workflow representing the required actions in building a deep learning
algorithm from deep sequencing data. Here, the goal is to produce a supervised learning algorithm
for predicting protein function based on amino acid sequence.
2https://www.python.org
3https://www.anaconda.com
23
Figure 2.2 This figure provides the overall workflow of what the user may encounter when using
deep sequencing data and wants to apply a machine learning algorithm. The main steps for using
this path are data processing, choosing an appropriate learning algorithm, and deciding based on
the algorithms’ performance. A detailed explanation of the steps is mentioned in the content.
2.3.1 Data processing as an initial yet pivotal step in any DL algorithm
The main purpose of this step is to prepare the data to be fed into the deep learning algorithm.
Importantly, what can be learned by the model strongly depends on what is provided to the algorithm
as an input. If the aim is to map the sequence to function, the protein sequence should be as an input
labeled with the desired functionality (output). Three important steps in data processing include
input data refinement, input data representation, and output data representation (if dealing with
supervised learning). See Note 2.
2.3.1.1 Input data refinement
First, one should input and standardize the features. One rationale behind this refinement is that
scaling the input removes the “importance” of the raw magnitude of one factor relative to another
and as a result reduces estimation errors and calculation times. One package which can be used for
standardization is StandardScaler or RobustScaler (advantageous when dealing with outliers)
24
from the sklearn preprocessing package:
from sklearn.preprocessing import StandardScaler
scaler1 = StandardScaler()
scaler1.fit(Feature)
Feature_standardized1 = scaler1.transform(Feature)
From sklearn.preprocessing import RobustScaler
scaler2 = RobustScaler()
scaler2.fit(Feature)
Feature_standardized2 = scaler2.transform(Feature)
2.3.1.2 Input data representation
One-hot encoding (Note 3), integer encoding (Note 4), physiochemical property-based encoding
(Note 5), UniRep encoding4 [19], TAPE5 [38] are notable options for representing amino acid
sequence data. An example code for one-hot encoding the input data representation is mentioned
in the following. The user can manually one-hot encode the sequences via defining dictionaries or
using particular packages in Python (Refer to https://github.com/WoldringLabMSU/DeepLearning.
git for more information on encoding).
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
Amino_Acids = ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M",
"N", "P", "Q", "R", "S", "T", "V", "W", "Y"]
label_encoder = LabelEncoder()
onehot_encoder = OneHotEncoder(sparse = False)
integer_encode = label_encoder.fit_transform(Amino_Acids)
4https://github.com/churchlab/UniRep
5https://github.com/songlab-cal/tape
25
integer_encoded = integer_encode.reshape(len(integer_encode), 1)
Amino_Acids_onehot = onehot_encoder.fit_transform(integer_encoded)
2.3.1.3 Output data representation
Following high throughput selection and deep sequencing of an initial combinatorial library,
amino acid sequences can be labeled based on the observed enrichment ratios as a metric for relative
fitness [39]. Depending on the experimental conditions for selection and depth of sequencing, the
distribution of enrichment ratio data may take on various forms and require further refinement (see
Note 6).
2.3.2 Deep learning algorithm selection requires an understanding of their structure
2.3.2.1 Overview
Artificial neural network (ANN) architecture is inspired by human brain function which can
learn from various input data. The building blocks for this network (neurons) receive information
from adjacent neurons, and then process this information with the aid of an activation function
(see Note 7) before being sent to other downstream neurons. The effect and significance of each
connection are proportional to its assigned weight.
There are many deep learning methods utilized in the field from the feed-forward neural network
(FNN) to the convolutional neural network (CNN) [40] and recurrent neural network (RNN)
[41]. FNNs are primary neural network algorithms that are rigorous and powerful in capturing
high-dimensional features. It consists of three distinct layers (each composed of neurons): the input
layer, the hidden layer, and output layer. The input layer receives the data features, the hidden layer
transforms the input layer to the output by updating the weights iteratively to minimize the loss
function. For each of these neural network algorithms, backpropagation is used to update model
weights to minimize the loss function via gradient descent. Finally, the output layer captures the
results. Figure 3 illustrates one example structure of a feed-forward neural network.
26
Figure 2.3 This figure represents one example of a feed-forward neural network. It consists of three
main layers: the input, hidden layer(s), and output layers. It is a symbolic representation of the
path from sequence to function. The magnifying glass focuses on one neuron (a21), illustrating
how information passes through every neuron. Moreover, the information from six neurons in the
previously hidden layer is multiplied by their weights (different colors represent the activation level
in each corresponding neuron). Afterward, the result will be summed with bias and pass through
the activation function to determine the output of that neuron.
Convolutional neural networks are an additional deep learning algorithm inspired by the visual
cortex in animals. CNNs capture hidden relationships and spatial dependencies in provided input
via various filters, pooling, and convolutional kernels. Therefore, local properties will be obtained
by using sliding filters and non-linear functions [42]. In recurrent neural networks, there is a cyclic
connection in the algorithm structure, whereby the current state of the algorithm will be updated
based upon the past state and the current input data [43]. This characteristic makes RNN useful
in time series data and capturing temporal relationships in sequences. In addition, variations of
RNN such as LSTM [44] and GRU [45] have been developed to resolve the potential problems
27
in its architecture such as capturing long-distance relationships and resolving vanishing gradient
problems. The best performance in the mentioned algorithms depends on the type of data and
purpose for using deep learning. However, more advanced architectures like RNN/CNN can learn
properties more efficiently than a traditional FNN.
2.3.2.2 Guidance for building a deep learning structure
Here we build a simple FNN in Keras to demonstrate a potential structure for constructing a
neural network (https://github.com/WoldringLabMSU/DeepLearning.git).
# Importing the required libraries from Keras
from tensorflow import keras
import keras
from keras.models import Sequential
from keras.layers.core import Dense
from keras.optimizers import Adam
from keras.activations import relu, sigmoid
from sklearn.model_selection import train_test_split
# Splitting the data set into train and test
X_train, X_test, Y_train, Y_test = train_test_split(features, labels,
test_size=0.2, random_state=42)
# Defining the model
My_Model = Sequential()
# Defining the input layer, the number of neurons in that layer
My_Model.add(Dense(20, activation=’relu’, input_shape=(100,)))
28
# Defining the hidden layers , number of neurons in each hidden layer,
and corresponding activation function
My_Model.add(Dense(10, activation=’relu’))
My_Model.add(Dense(5, activation=’relu’))
# Defining the output layer
My_Model.add(Dense(1, activation=’relu’)) # The last activation function
depends on the inherent nature of the task (see note 9)
# Compiling the model and determining the optimizer, learning rate,
loss function, and evaluation metrics
My_Model.compile(Adam(lr=0.01, decay=0.003), loss=’mean_squared_error’
, metrics=[’mse’])
# Fitting the defined model
My_Model.fit(X_train, Y_train, batch_size=100, epochs=40, verbose=2,
validation_split=0.2, shuffle=True)
# Evaluating the fitting performance
Metric = My_Model.evaluate(X_test, Y_test, verbose=2)
# Predicting the labels for the test data set
y_test_predicted = My_Model.predict(X_test, verbose=2)
2.3.3 Visualization guidance
Evaluation metrics provide useful insights into algorithm performance. Visualizations of these
results help to further interpret and communicate the findings. For example, the seaborn library
29
allows for showing the correlation between predicted output values versus actual values. Below is
a simple code implemented for this purpose.
import seaborn as sns
sns.joint plot(Prediciton, Actual_value, kind=’scatter’, s=150, color=’m’,
edgecolor="skyblue", linewidth=2)
Understanding and building such a structure is an important first step to take for utilizing
deep learning in different projects. However, it should be emphasized that, even with a powerful
algorithm and appropriate data processing step, the prediction might be drastically off which leads
us to decision-making and further fine-tuning steps.
2.3.4 Decision Making & evaluating parameters
After processing the data and choosing the appropriate algorithm, one should be able to
accurately evaluate the performance of the algorithm. Evaluation metrics depend on whether
the problem is regression or classification. In classification, metrics such as accuracy, confusion
matrix, AUC, and Recall are useful. While in regression, metrics such as RMSE, MSE, and MAPE
are better suited (see Note 8). Obtaining poor metrics for a model may arise from various stages in
the algorithm such as inadequate input data, deficient preprocessing step, overfitting, and untuned
hyperparameters. Choosing the right hyperparameters and preventing the system from overfitting
are two necessary tasks in training any deep learning algorithm. Two generally used methods for
resolving some of these issues are elaborated in the following.
2.3.4.1 Hyperparameter Optimization [46]
Hyperparameters are a set of variables that dictate various characteristics of the algorithm’s
structure and influence the process used for training models. It can be considered as a meta
optimization technique whereby parameter value fitness is monitored via the loss function during
the training process [47]. Multiple methods can be employed for searching through hyperparameter
space (e.g. manual search, grid search, random search, and Bayesian optimization). The manual
search involves tuning the hyperparameters by a user based on guess and check. In grid search,
30
for each parameter of interest, the user defines a list of values to implement. The algorithm then
calculates the loss for all possible combinations of parameter values. In random search, some but
not all combinations of parameters will be selected randomly, and the best loss will be chosen based
on the selected parameters. Bayesian optimization offers high efficiency by using information from
the past as an experience to choose the next set of hyperparameters. HYPEROPT6 and OPTUNA7
are among the Python libraries designed for hyperparameter optimization based on the objective
function. (Refer to https://github.com/WoldringLabMSU/DeepLearning.git for more information).
2.3.4.2 K-fold cross-validation
This resampling approach allows for more accurate prediction in model performance using even
a limited data set. It splits the data into k complementary groups and uses k-1 groups for training
and one group for evaluating performance. The performance of the cross-validation is calculated by
taking the mean and variance over all k performances. This enables a less biased estimate of model
performance [47]. (Refer to https://github.com/WoldringLabMSU/DeepLearning.git, for example,
K-fold cross-validation).
2.3.5 Library Construction
Machine learning techniques enable the design of smart libraries by identifying the protein
positions that are highly amenable to mutation and determining the most suitable degenerate
codon candidate(s) for the protein of interest. Based on these designs, full-length genes can
then be constructed using overlap extension PCR of degenerate oligos to incorporate the intended
site-wise amino acid diversity. Finally, the newly constructed full-length gene library, combined
with a linearized yeast surface display vector (e.g. pCT-CON2), can be electroporated into yeast
(EBY100) [48] and evaluated by high-throughput techniques [49, 50, 51, 52, 53].
6http://hyperopt.github.io/hyperopt/
7https://optuna.org/
31
Notes
2.3.6 Learning paradigms in ANN
The learning paradigms inANNare supervised learning, unsupervised learning, semi-supervised
learning, and reinforcement learning. In supervised learning, the data are provided with assigned
labels that then guide the algorithm to learn the input-output mapping through a backpropagation
process. Unsupervised learning algorithms (e.g. K-nearest neighbors and K-means) are used when
there are no values assigned to the input features. This allows for the detection of patterns even
when additional information is not provided with the input data. Semi-supervised learning is a
method well suited where only partially labeled data exists, but the algorithm aims to fully benefit
from provided information either labeled or unlabeled [54]. Reinforcement learning is agent-based
learning interacting with the environment. Therefore, the algorithm learns based on rewards from
correct prediction and penalties from incorrect guesses (i.e. trial and error methodology) [55].
2.3.7 Deep learning packages analysis
TensorFlow is one of the most popular and fastest evolving open source deep learning tools
[37]. It is compatible with both GPU/CPU computation and is well-suited for working with
multi-dimensional arrays. One downside could be the low-level API which makes it not the
ideal choice for the direct creation of deep learning algorithms. Keras (open source) can support
backends such as CNTK, TensorFlow, and Theano and its simple API makes it straightforward to
implement. PyTorch is among the more flexible programming packages in Python and supports
tensor computation and GPU acceleration. Its dynamic graph and easy debugging make it a strong
option to choose. At its core, it uses CPU and GPU tensor and NN backends. Therefore, PyTorch
brings speed and flexibility to deep learning models despite needing third-party visualization [56].
Regardless of the choice in packages, the input data’s magnitude and quality will strongly impact
the predictive power of the resulting model.
2.3.8 One-hot encoding
The simplest method is to use one-hot encoding by constructing a matrix of one and zeros where
one represents the existence of the element at a specific position of the sequence. One-hot encoding
32
is easy to implement and has been proven to be effective in many cases, yet it is highly memory
intensive and struggles to capture the relationship between amino acids in protein sequences. This
large and sparse encoding often leads to complications in training as a result of the inherently high
dimensionality.
2.3.9 Integer encoding
Integer encoding is implemented by representing each amino acid by an integer. For integer
encoding one observed drawback is the tendency of the system to assume linear relationships among
the provided labels. For example, if the labels are 1, 2, and 3, the system assumes a relationship
between the amino acids (such that 1 is closer to 2 than 3) that are assigned to these labels. As a
result, orthogonality between the labels matters. However, integer encoding is often used with a
linear embedding layer (see tf.keras.layers.embedding) whereby integer encoding calls an
embedding column like a “lookup” table.
2.3.10 Property-based encoding
Some practical encodings are obtained based on the physiochemical properties (e.g. charge,
hydrophobicity, and size) of the sequences [57]. One example is the principal components score
Vectors of Hydrophobic, Steric, and Electronic properties (VHSE8) [58].
2.3.11 Desired data distribution for DL algorithms
Gaussian-like distributions tend to have better performance in deep learning. Therefore, defining
scoring functions that change the distribution of labels is a viable option. Power transformfunctions
are used to reduce the skewness of data and make more Gaussian-like distributions. For example,
normal power transform functions may take the nth root or the nth order logarithm of the variable.
More advanced power transformers include Box-Cox and Yeo-Johnson methods. To obtain new
distributions with the mentioned methods, the Scipy [59] library or sci-kit-learn preprocessing
package called as PowerTransformer can be used.
# Example implementation
import scipy
33
import scipy.stats
Yeo-Johnson = scipy. stats.yeojohnson(label_List)
Box_Cox = scipy. stats.boxcox(label_List)
2.3.12 Activation function
One suggestive method is to use ReLU in all hidden layers and choose an appropriate activation
function for the output layer to match the distribution of data with the nature of the task. Generally,
for the output layer, sigmoid (for binary-class), softmax, and tanh (for multi-class) are used for
classification tasks, and linear activation function is used for regression.
2.3.13 Evaluation metrics
Evaluation metrics are representative of algorithm performance and should be considered within
the context of the nature of the problem and origin of the input data. As an example, in biased
data when 90 percent of the population is in category 1 and the remainder are in category 2, the
algorithm probably predicts all the data to be in the first category. In this case, if one wants to
rely on the metrics, the accuracy will be 90 percent which does not address the performance of the
algorithm. In this case, the confusion matrix provides useful information about the number of false
positives, false negatives, true positives, and true negatives.
34
BIBLIOGRAPHY
[1] B.L. Hogan. Bone morphogenetic proteins: multifunctional regulators of vertebrate
development. Genes Dev, 10:1580–1594, 1996.
[2] J. Schlessinger. Cell signaling by receptor tyrosine kinases. Cell, 103:211–225, 2000.
[3] V. Syrovatkina, K.O. Alegre, R. Dey, et al. Regulation, signaling, and physiological functions
of g-proteins. J Mol Biol, 428:3850–3868, 2016.
[4] H.W. Hellinga and J.S. Marvin. Protein engineering and the development of generic
biosensors. Trends Biotechnol, 16:183–189, 1998.
[5] N.K. Mishra, J. Chang, and P.X. Zhao. Prediction of membrane transport proteins and their
substrate specificities using primary sequence information. PLoS One, 9:e100278–e100278,
2014.
[6] T. Yang, J.C. Wu, C. Yan, et al. Virtual screening using molecular simulations. Proteins,
79:1940–1951, 2011.
[7] E.E. Wrenbeck, M.S. Faber, and T.A. Whitehead. Deep sequencing methods for protein
engineering and design. Curr Opin Struct Biol, 45:36–44, 2017.
[8] N. Kronqvist, J. Löfblom, A. Jonsson, et al. A novel affinity protein selection system based
on staphylococcal cell surface display and flow cytometry. Protein Eng Des Sel, 21:247–255,
2008.
[9] K.K. Yang, Z. Wu, and F.H. Arnold. Machine-learning-guided directed evolution for protein
engineering. Nat Methods, 16:687–694, 2019.
[10] H. Bohr, J. Bohr, S. Brunak, et al. A novel approach to prediction of the 3-dimensional
structures of protein backbones by neural networks. FEBS Lett, 261:43–46, 1990.
[11] Y. Ofran and B. Rost. Predicted protein-protein interaction sites from local sequence
information. FEBS Lett, 544:236–239, 2003.
[12] J.J. Ward, L.J. McGuffin, B.F. Buxton, et al. Secondary structure prediction with support
vector machines. Bioinformatics, 19:1650–1655, 2003.
[13] N.V. Petrova and C.H.Wu. Prediction of catalytic residues using support vector machine with
selected protein sequence and structural properties. BMC Bioinformatics, 7:1–12, 2006.
[14] B.Q. Li, K.Y. Feng, L. Chen, et al. Prediction of protein-protein interaction sites by random
forest algorithm with mrmr and ifs. PLoS One, 7:1–10, 2012.
35
[15] L. Quan, Q. Lv, and Y. Zhang. Strum: Structure-based prediction of protein stability changes
upon single-point mutation. Bioinformatics, 32:2936–2946, 2016.
[16] M. Tahir, H. Tayara, and K.T. Chong. irna-pseknc(2methyl): Identify rna 2’-o-methylation
sites by convolution neural network and chou’s pseudo components. J Theor Biol, 465:1–6,
2019.
[17] J.D. Bloom, S.T. Labthavikul, C.R. Otey, et al. Protein stability promotes evolvability. Proc
Natl Acad Sci U S A, 103:5869–5874, 2006.
[18] Y. Saito, M. Oikawa, H. Nakazawa, et al. Machine-learning-guided mutagenesis for directed
evolution of fluorescent proteins. ACS Synth Biol, 7:2014–2022, 2018.
[19] E.C. Alley, G. Khimulya, S. Biswas, et al. Unified rational protein engineering with
sequence-based deep representation learning. Nat Methods, 16:1315–1322, 2019.
[20] S. Biswas. Low-n protein engineering with data-efficient deep learning: A paradigm for low-n
protein engineering. Preprint, pages 1–39, 2020.
[21] B.E. Suzek, Y. Wang, H. Huang, et al. Uniref clusters: A comprehensive and scalable
alternative for improving sequence similarity searches. Bioinformatics, 31:926–932, 2015.
[22] M. Crawshaw. Multi-task learning with deep neural networks: A survey. Preprint, 2020.
[23] J. Im, B. Park, and K. Han. A generative model for constructing nucleic acid sequences
binding to a protein. BMC Genomics, 20:1–13, 2019.
[24] J.E. Ness, S. Kim, A. Gottman, et al. Synthetic shuffling expands functional protein diversity
by allowing amino acids to recombine independently. Nat Biotechnol, 20:1251–1255, 2002.
[25] R.D. Gupta and D.S. Tawfik. Directed enzyme evolution via small and effective neutral drift
libraries. Nat Methods, 5:939–942, 2008.
[26] M.K.M. Engqvist and J. Nielsen. Ant: Software for generating and evaluating degenerate
codons for natural and expanded genetic codes. ACS Synth Biol, 4:935–938, 2015.
[27] T.M. Jacobs, H. Yumerefendi, B. Kuhlman, et al. Swiftlib: Rapid degenerate-codon-library
optimization through dynamic programming. Nucleic Acids Res, 43:1–9, 2015.
[28] M.L. Menéndez, J.A. Pardo, L. Pardo, et al. The jensen-shannon divergence. J Franklin Inst,
334:307–318, 1997.
[29] V. Bewick, L. Cheek, and J. Ball. Statistics review12: Survival analysis. Crit Care, 8:389–394,
2004.
36
[30] TensorFlow. Index @ www.tensorflow.org, 2017.
[31] F. Chollet et al. Keras, 2015.
[32] D. Mazza and M. Pagani. Automatic differentiation in pcf. Proc ACM Program Lang, 5:1–4,
2021.
[33] F. Pedregosa, G. Varoquaux, A. Gramfort, et al. Scikit-learn: Machine learning in python. J
Mach Learn Res, 12:2825–2830, 2012.
[34] W. McKinney. Data structures for statistical computing in python. In Proc 9th Python Sci
Conf, pages 56–61, 2010.
[35] C.R. Harris, K.J. Millman, S.J. van der Walt, et al. Array programming with numpy. Nature,
585:357–362, 2020.
[36] F. Pedregosa, G.G. Varoquaux, A. Gramfort, et al. Scikit-learn: Machine learning in python.
J Mach Learn Res, 12:2825–2830, 2011.
[37] M. Abadi, P. Barham, J. Chen, et al. Tensorflow: A system for large-scale machine learning.
In Proc 12th USENIX Symp Oper Syst Des Implementation, OSDI 2016, pages 265–283,
2016.
[38] R. Rao, N. Bhattacharya, N. Thomas, et al. Evaluating protein transfer learning with tape.
Adv Neural Inf Process Syst, 32:9689, 2019.
[39] T.A. Whitehead, A. Chevalier, Y. Song, et al. Optimization of affinity, specificity and function
of designed influenza inhibitors using deep sequencing. Nat Biotechnol, 30:543–548, 2012.
[40] R. Shroff, A.W. Cole, D.J. Diaz, et al. Discovery of novel gain-of-function mutations guided
by structure-based deep learning. ACS Synth Biol, 9:2927–2935, 2020.
[41] Z. Zhao and X. Gong. Protein-protein interaction interface residue pair prediction based on
deep learning architecture. IEEE/ACM Trans Comput Biol Bioinforma, 16:1753–1759, 2019.
[42] Q. Zhang, M. Zhang, T. Chen, et al. Recent advances in convolutional neural network
acceleration. Neurocomputing, 323:37–51, 2019.
[43] Y. Yu, X. Si, C. Hu, et al. A review of recurrent neural networks: Lstm cells and network
architectures. Neural Comput, 31:1235–1270, 2019.
[44] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput, 9:1735–1780,
1997.
[45] K. Cho, B. Van Merriënboer, C. Gulcehre, et al. Learning phrase representations using rnn
37
encoder-decoder for statistical machine translation. In EMNLP 2014 - 2014 Conf Empir
Methods Nat Lang Process Proc Conf, pages 1724–1734, 2014.
[46] J. Bergstra, D. Yamins, and D.D. Cox. Making a science of model search: Hyperparameter
optimization in hundreds of dimensions for vision architectures. In 30th Int Conf Mach Learn
ICML 2013, pages 115–123, 2013.
[47] S. Raschka. Model evaluation, model selection, and algorithm selection in machine learning.
Preprint, 2018.
[48] G. Chao,W.L. Lau, B.J. Hackel, et al. Isolating and engineering human antibodies using yeast
surface display. Nat Protoc, 1:755–768, 2006.
[49] D.R. Woldring, P.V. Holec, H. Zhou, et al. High-throughput ligand discovery reveals a
sitewise gradient of diversity in broadly evolved hydrophilic fibronectin domains. PLoS One,
10:e0138956, 2015.
[50] D.R. Woldring, P.V. Holec, L.A. Stern, et al. A gradient of sitewise diversity promotes
evolutionary fitness for binder discovery in a three-helix bundle protein scaffold. Biochemistry,
56:1656–1671, 2017.
[51] M.A. Kruziki, S. Bhatnagar, D.R. Woldring, et al. A 45-amino-acid scaffold mined from the
pdb for high-affinity ligand engineering. Chem Biol, 22:946–956, 2015.
[52] M.A. Kruziki, V. Sarma, and B.J. Hackel. Constrained combinatorial libraries of gp2 proteins
enhance discovery of pd-l1 binders. ACS Comb Sci, 20:423–435, 2018.
[53] L.A. Stern, C.M. Csizmar, D.R.Woldring, et al. Titratable avidity reduction enhances affinity
discrimination in mammalian cellular selections of yeast-displayed ligands. ACS Comb Sci,
19:315–323, 2017.
[54] X. Zhu and A.B. Goldberg. Introduction to semi-supervised learning, volume 3. Synth Lect
Artif Intell Mach Learn, 2009.
[55] R.S. Sutton and A.G. Barto. Reinforcement learning: An introduction. Bradford Book, 1998.
[56] G. Nguyen, S. Dlugolinsky, M. Bobák, et al. Machine learning and deep learning frameworks
and libraries for large-scale data mining: a survey. Artif Intell Rev, 52:77–124, 2020.
[57] K.K. Yang, Z. Wu, C.N. Bedbrook, et al. Learned protein embeddings for machine learning.
Bioinformatics, 34:2642–2648, 2018.
[58] H.U. Mei, Z.H. Liao, Y. Zhou, et al. A new set of amino acid descriptors and its application
in peptide qsars. Pept Sci Orig Res Biomol, 80:775–786, 2005.
38
[59] P.Virtanen, R. Gommers, T.E. Oliphant, et al. Scipy 1.0: fundamental algorithms for scientific
computing in python. Nat Methods, 17:261–272, 2020.
39
CHAPTER 3
PROTEIN FITNESS PREDICTION IS IMPACTED BY THE INTERPLAY OF
LANGUAGE MODELS, ENSEMBLE LEARNING, AND SAMPLING METHODS
1
Abstract
Advances in machine learning (ML) and the availability of protein sequences via high-throughput
sequencing techniques have transformed the ability to design novel diagnostic and therapeutic
proteins. ML allows protein engineers to capture complex trends hidden within protein sequences
that would otherwise be difficult to identify in the context of the immense and rugged protein
fitness landscape. Despite this potential, there persists a need for guidance during the training and
evaluation of ML methods over sequencing data. Two key challenges for training discriminative
models and evaluating their performance include handling severely imbalanced datasets (e.g., few
high-fitness proteins among an abundance of non-functional proteins) and selecting appropriate
protein sequence representations (numerical encodings). Here, we present a framework for
applying ML over assay-labeled datasets to elucidate the capacity of sampling techniques and
protein-encoding methods to improve binding affinity and thermal stability prediction tasks. For
protein sequence representations, we incorporate two widely used methods (One-Hot encoding
and physiochemical encoding) and two language-based methods (next-token prediction, UniRep;
masked-token prediction, ESM). Elaboration on performance is provided over protein fitness,
protein size, and sampling techniques. In addition, an ensemble of protein representation methods
is generated to discover the contribution of distinct representations and improve the final prediction
score. We then implement multiple criteria decision analysis (MCDA; TOPSIS with entropy
weighting), using multiple metrics well-suited for imbalanced data, to ensure statistical rigor in
ranking our methods. Within the context of these datasets, the synthetic minority oversampling
technique (SMOTE) outperformed undersampling while encoding sequences with One-Hot,UniRep,
1This chapter is adapted from content published in “Protein Fitness Prediction Is Impacted by the Interplay of
Language Models, Ensemble Learning, and Sampling Methods.” All rights reserved. For more information, visit
https://doi.org/10.3390/pharmaceutic3A.15051337.
40
and ESM representations. Moreover, ensemble learning increased the predictive performance of
the affinity-based dataset by 4% compared to the best single-encoding candidate (F1-score = 97%),
while ESM alone was rigorous enough in stability prediction (F1-score = 92%).
Keywords
Machine learning; protein fitness prediction; embeddings; sequence representation; imbalanced
assay-labeled datasets; sampling methods; ensemble learning; MCDA; TOPSIS
41
3.1 Introduction
Proteins are biological machines involved in almost all biological processes [1, 2, 3, 4].
These molecules are made of amino acids that fold into 3-dimensional structures and perform
life-sustaining biological functions [5]. Protein engineering practices aim to modify proteins to
redirect what has already evolved in nature and address the industrial and medical needs of modern
society [6, 7]. This has been a challenging task due to the astronomical number of possible
mutations and the complex sequence-function relationship of the proteins (i.e., fitness landscape)
[8]. Protein fitness—a measure of how well a protein performs a task of interest—is influenced by a
variety of factors, including its structure, stability, and interactions with other molecules. In protein
engineering campaigns, where protein function is modified by experimental approaches rather than
natural selection, proteins with high fitness are those that perform well within relevant experimental
assays, whereas proteins with low fitness result in reduced activity, altered specificity, or decreased
stability. To overcome the challenge of finding high-fitness proteins among mostly non-functional
mutants, various experimental and computational techniques were developed. Recently, machine
learning (ML) has shown promise as a tool to supplement already established techniques, such
as rational design and directed evolution [9, 10, 11, 12]. Unlike directed evolution, ML models
can learn from non-functional mutants instead of simply discarding them during enrichment for
functional clones. ML-assisted protein engineering, therefore, has potential as a time-efficient
and cost-effective approach to searching for desired protein functionality. This provides a unique
opportunity to create smart protein libraries, elevate and accelerate directed evolution and rational
design strategies, and finally, enhance the probability of finding unexplored high-fitness variants
in the protein fitness landscape [13, 14, 15]. Machine learning methods have attained a high
success rate in predicting essential protein properties (i.e., protein fitness) including secondary
structure, solubility, binding affinity, flexibility, and specificity [16, 17, 18, 19, 20]. Despite these
recent milestones, to obtain generalizability and robustness in ML models, further explorations in
different protein fitness prediction tasks and training details are required.
Dealing with protein fitness landscape challenges will require us to view proteins from a new
42
perspective that supplements our biochemical knowledge with lessons from written languages.
Recent advances in ML and artificial intelligence have applied natural language processing (NLP)
methods to identify context-specific patterns from written or spoken text. NLP tasks learn how
words function grammatically (syntax) and how they deliver meaning within themselves and in
surrounding words (semantics) [21, 22]. This has given rise to virtual assistants with voice
recognition and sentiment analysis of text from diverse languages [23, 24]. Similarly, protein
engineering can leverage these NLP tools—treating a string of amino acids as if they were
letters on a page—to understand the language of proteins, providing a promising route to capture
nuances (e.g., epistatic relationships, functional motifs) in complex sequence–function mappings
[25, 26]. The rapid expansion of publicly available protein sequence data (e.g., UniProt [27],
SRA [28]) further supports the use of big data and language models in the domain of protein
engineering [29]. Self-supervised language models learn the context of the provided text by
reconstructing the masked tokens/linguistic units of the text string using the unmasked parts.
For the context of protein engineering, pre-trained protein language models—carrying valuable
information about the epistasis/interaction of amino acids—can be applied to downstream tasks
by extracting the optimized weight functions as a fixed-size vector (embedding) [25, 30, 31].
Among early embedding developments, Alley et al. introduced UniRep [32], a deep learning
model that was trained on 24 million unique protein sequences to perform the next amino acid
prediction tasks for extracting information about the global fitness landscape of proteins. Rives et
al. trained ESM, a language model for masked amino acid prediction tasks, on over 250 million
protein sequences [33]. The learned representations—including UniRep [32], ESM [33], TAPE
[34], and ProteinBERT [35] have generated promising results in diverse areas such as predicting
protein fitness, protein localization, protein-protein interaction, and disease risk of mutations in
terms of improved prediction scores, increased generalizability, and mediated data requirements
[36, 37, 38, 39, 40]. Using embeddings for sequence representations (transfer learning) enables
knowledge transfer between protein domains and future prediction tasks by further optimizing the
already-learned weights. For example, Min et al. obtained a 20% increase in the F1-score (the
43
harmonic mean of precision and recall) for a heat shock protein identification task when training
their NLP-based model, DeepHSP [41], on top of pre-trained representations.
In this study, we performprotein sequence fitness prediction withMLtechniques to demonstrate
how model performance varies given the choice of protein representation, protein size, and the
biological attribute (e.g., binding affinity and thermal stability) to be predicted. This work provides
actionable insights for effectively building discriminative models and improving their prediction
scores via sampling techniques and ensemble learning. As efficient use of embedding methods
on experimental datasets is in its infancy, rigorous studies are needed to gain new insights into
the performance of the pre-trained models given various training conditions and distinct biological
function predictions. Importantly, embedding methods have been trained over millions of protein
sequences in public databases and have produced high performance in certain fitness tasks (e.g.,
stability prediction), while they may not do aswell in all fitness prediction tasks. To this end,we used
two large datasets thatwere representative of common protein engineering tasks. First, we leveraged
a highly imbalanced dataset (93% non-functional; Table 3.1), consisting of our previously described
affinity-evolved affibody sequences [42] to explore NLP-driven practices. We then expanded our
analysis to include thousands of protein sequences labeled with their experimentally measured
stabilities (melting temperatures, Tm) obtained from the Novozymes Enzyme Stability Prediction
(NESP) dataset [43]. Thus, with our two datasets having unique attributes, we were well positioned
to address multiple questions: (i) How do different representation methods perform in predicting
distinct fitness attributes such as stability or affinity? (ii) How do sampling methods perform in
imbalanced protein datasets? (iii) Is ensemble learning over different protein representations helpful
in boosting the performance of discriminative models? (iv) How do we rank model performances
while using multiple conflicting metrics inMLprediction tasks? By addressing these challenges, we
also gain direct insights for model interpretation and reveal the features that are most important for
discriminating between fit and non-fit sequences (Figures 3A.1, 3A.6, and 3A.8). We discovered
that oversampling (especially SMOTE) generally outperformed the undersampling techniques.
In addition, ensemble over representations greatly improved the predictive performance in the
44
affibody data both using single and multiple performance metrics via multiple criteria decision
analysis (MCDA) [44]. For protein representations (e.g., single encoders), UniRep and One-Hot
outperformed other methods in the affibody (affinity) dataset while ESM achieved the best score
in stability prediction in NESP. Finally, it was observed that the performance of various protein
representation methods is strongly impacted by protein sequence length.
Table 3.1 Dataset attributes and prediction tasks.
Dataset Task Fitness Model Attributes
Affibody Classification Binding Affinity Logistic Regression 82,663 non-binders,
6077 binders
NESP Classification Stability Logistic Regression 3743 high-stability,
1311 low-stability
NESP Regression Stability Random Forest
Regressor
18,190 total
3.2 Materials and Methods
3.2.1 Obtaining Experimentally Labeled Sequence Data
Two different datasets with varying data characteristics were explored. The first is our
experimental data of affibody sequences that previously were iteratively evolved for binding affinity
and specificity against a panel of diverse targets [45]. The second collection of labeled protein
sequences was obtained from the recently released Kaggle dataset wherein numerous proteins (n=
18,190) of various lengths are labeled according to their thermal stability (Tm). This dataset,
NESP, was filtered to only include sequences characterized at pH = 7. For the affibody dataset,
raw sequence data were cleaned by removing any sequences that contained stop codons or invalid
characters. Afterwards, the frequency of each unique sequence in the experimental steps was
tabulated. Infrequent sequences appearing fewer than ten and four times (within magnetic activated
cell sorting (MACS) and fluorescent activated cell sorting (FACS), respectively) were treated as
background and removed from the analysis. Note that the more stringent frequency removal for
MACS was mainly due to the experiment type and higher probability to introduce noise in the
dataset. After removing the background, sequences from MACS and FACS were combined to form
45
the final high-fitness population of binders. The non-binding population included the initial affibody
sequence pool, which did not appear in the enriched population of the binder sequences. The initial
affibody sequences that were within one hamming distance (i.e., a single amino acid mutation) of
any enriched sequence were removed as well to account for potential errors encountered during
deep sequencing. All affibody sequences were exactly 58 amino acids in length with mutations
present at up to 17 of these positions.
3.2.2 Obtaining the Sequence Representations
We obtained four different numerical representations for our sequence data: One-Hot and
physiochemical encoding, UniRep, and ESM embeddings. One-Hot encoding refers to building a
matrix (amino acids × protein length) and filling it with one when there is a specific amino acid in
the given position, filling the rest with zeros. For physiochemical encoding, we used the modlamp
[46] package in python, which is used for extracting the physical features from protein sequences.
There were two types of physical features represented in the modlamp package (global and peptide
descriptors). All the global (e.g., sequence length, molecular weight, aliphatic index, etc.) and
local physiochemical features based on the Eisenberg scale were extracted for this analysis (twenty
in total).
Embedding refers to the continuous representation of the protein sequence in a fixed-size vector,
and it should contain meaningful information about proteins [47]. For example, in the embedding
visualization of amino acids in low dimensions for both UniRep and ESM, similar amino acids (in
terms of size, charge, hydrophobicity, etc.) were close to each other. For UniRep representation,
we used the 1900 dimension and mean representation over layers. We used Jax_UniRep for
obtaining the UniRep embeddings, https://github.com/ElArkk/jax-unirep (accessed on 20 August
2022). UniRep uses the mLSTM structure for performing next-token prediction, and it was trained
on 24 million sequences in the Uniref50 dataset with 18 M parameters. For ESM, we chose ESM2
[48] with 1280 vector dimensions and 650 M parameters and means over layer representations.
GitHub for ESM is https://github.com/facebookresearch/esm (accessed on 21 January 2023).
46
3.2.3 Sampling and Splitting
Sampling refers to choosing a random subset of data to represent the underlying population.
Three different sampling methods were tested for our severely imbalanced affibody dataset:
undersampling, random oversampling, and the synthetic minority oversampling technique (SMOTE)
[49]. Due to the sparse and rugged nature of the protein fitness landscape, it is common
for experimental data obtained in the protein domain to be highly imbalanced. One practical
approach for resolving the imbalanced dataset issue is using sampling techniques when training
the dataset. Oversampling is randomly repeating the minority class examples; thus, it could be
prone to overfitting in comparison to undersampling. However, undersampling may discard useful
information, especially in severely imbalanced datasets, as it is removing many samples from the
majority class. SMOTE is a more recent addition to sampling methods, and it is oversampling the
minor population by synthetically generating more instances that are highly similar to the minority
class. While SMOTE has shown promising results in increasing the prediction performance for
various imbalanced datasets [50, 51, 52], there are also studies indicating undersampling superior
performance compared to oversampling methods [53, 54]. As a result,we examined the performance
of all three sampling techniques to validate which sampling method performswell within ourwet-lab
protein dataset over different encoding methods.
For splitting the data points within the test set in an imbalanced dataset, sampling equally from
each class may lead to an overestimation of the model performance [55]. As a result, we made sure
that the test set distribution follows the initial data distribution (93% naïve vs. 7% enriched).
3.2.4 Algorithm Selection and Training Details
For classification, logistic regression (LR) was chosen and L2 penalization (Ridge) was used
to reduce the likelihood of overfitting. We reasoned that a simple logistic regression enables a fair
comparison between cases. One regression task was also implemented over the NESP dataset with
random forest regressor (RFR).We used regression to observe how models perform with increasing
the prediction challenge, from binary prediction to actual label prediction. The rationale for using
RFR was that the linear regression model was not viable to meet the prediction task complexity. For
47
a fair comparison between protein-encoding performances in regression, the RFR hyperparameters,
max number of estimators and max_depth, were optimized with OPTUNA [56].
3.2.5 Ensemble Learning
To enhance the predictive performance of protein-encoding predictions,we designed a framework
that integrates multiple encoding methods. Our experiments included two approaches: concatenation
and a voting mechanism. In concatenation, the encodings were combined by adding them together,
and we used the resulting representation as input for our predictive model. In voting, separate
predictive models for each encoding method were trained. The final prediction was then calculated
with the majority-voted label over a fixed test set.
3.2.6 Metrics and Statistical Analysis
One key metricwe used for analyzing classification performance is the F1 score. By considering
both precision and recall (Figure 1E), the F1 score is particularly well suited for evaluating the
highly imbalanced data within our study (Table 3.1). Therefore, the model is trained to identify the
positive instances among all positive predictions and minimize missing out the positive instances
while predicting classes. Note that other classification metrics, such as confusion matrix values
(TP, TN, FP, FN), are reported in the supplement figures. For regression analysis among NESP
data, we used mean squared error (MSE) and 𝑅2 to indicate how the models perform. MSE is the
mean of the square of differences between the actual labels and the predicted values in the test set
while 𝑅2 represents the variation explained by the independent variables.
48
Figure 3.1 Overview of the Implemented Techniques, Data Attributes, and Evaluation Metrics. (A)
Illustrates the use of sequence-function mapping to identify protein sequence functionality (e.g.,
therapeutics, diagnostics, enzymatic function). (B) Data attributes for the two datasets used in this
study. The first dataset includes high-fitness protein binders among a pool of non-binder affibody
sequences with up to 17 mutation sites. The other dataset includes a wide array of proteins with
their associated melting point. (C) One-hot encoding, physicochemical encoding, and pre-trained
models were used to encode the protein sequences present in our datasets. All present protein
amino acid information is in a machine-readable format but in different ways. One-hot encoding
converts each amino acid to a binary vector of all 0s but 1 where it belongs to its position in
the matrix. In physicochemical encoding, each amino acid is represented by its physiochemical
characteristics, such as polarity, charge, size, etc. Pre-trained models are trained over a large corpus
of unlabeled data capturing the syntax and semantics of protein language via NLP-driven models,
such as next-token prediction (e.g., UniRep) and masked token prediction (e.g., ESM). (D) The
sampling methods used in this study are undersampling, oversampling, and synthetic minority
oversampling techniques (SMOTE). (E) The main metrics used for evaluating the performance of
prediction tasks (classification and regression) are defined (a complete list of performance metrics
is listed in Figure 3A.11).
Experiments were implemented with multiple random seeds (20 in affibody and 30 in NESP
dataset) to obtain a distribution of performances for each pair of encoding and sampling methods
in each fitness prediction task. Then we implemented multiple statistical tests to confirm if the
obtained differences were significant. Analysis of variance among the groups was performed with
ANOVA [57]. After obtaining significant results in ANOVA, post hoc methods were implemented
49
to account for family-wise error rates. Here, we implemented two post hoc methods, Bonferroni
[58] and Tukey [59], for adjusting the p-values and reducing the risk of type-1 error. The null
hypothesis assumes that the performances of the methods are similar and when rejected, we
consider the methods to be statistically significant in their obtained output. The results for multiple
seeds are shown with violin plots where the white dots represent the mean values. A complete
collection of statistical analyses for comparing the significance among the means are located in the
supplementary information.
3.2.7 Multiple Criteria Decision Analysis (MCDA)
We used the F1-score as our primary classification metric in classification to optimize the
algorithms based on finding the rare positive sequences. Depending on specific applications, the
user may need to choose different criteria for analyzing the ML predictive performance. Note that it
is also generally advised to use multiple metrics to establish more rigorous analyses, specifically in
imbalanced datasets [60, 61]. Therefore,we incorporated five more classification metrics in addition
to F1-score and implemented MCDA, which is a robust approach for decision-making (i.e., ranking
alternatives based on multiple, often conflicting, criteria). In our study, the alternatives are the
choice of protein representation within different sampling methods. The criteria (classification
metrics) used for this MCDA include F1-score, false positive rate (FPR), true positive rate (TPR),
precision, negative predictive value (NPV), and false discovery rate (FDR). FPR and TPR measure
the model’s ability to identify the positive and negative classes. Precision quantifies the number of
correctly positive classes among all being predicted as positive, while NPV measures this for the
negative class. FDR measures the number of false positives over all instances that are predicted as
positives.
The performance of each encoding and sampling technique was recorded based on all six
mentioned criteria. For implementing the decision-making, we chose a well-established and widely
used MCDA method: the technique for order of preference by similarity to ideal solution (TOPSIS)
[62]. TOPSIS finds the optimal solution rooted in the idea that the best alternative should have
the minimum Euclidean distance from the positive ideal solution and maximum distance from the
50
negative ideal solution.
For the implementation of TOPSIS, the PyTopsis Python package was utilized (https://github.
com/shivambehl/PyTopsis, accessed on 4 April 2023). This method requires three primary inputs:
a decision matrix representing alternative scores across various criteria, a list of weights reflecting
the importance of each criterion, and a list of signs where −1 indicates a criterion to be minimized,
and 1 indicates maximization. Utilizing this framework, we constructed our decision matrix and
defined the optimization direction for each criterion in classification tasks.
Weights for the criteria were determined through two approaches: subjective weighting, based
on the decision-maker’s preferences, and objective weighting, derived from a numerical analysis of
the decision matrix. Both weighting methods were employed to assess and compare their efficacy
in ranking alternatives. In the subjective weighting approach, higher weights were assigned to
precision and FPR metrics to emphasize the identification of positive instances.
For objective weighting, we applied Shannon’s entropy method [63] to calculate the entropy
values from the decision matrix, which then informed the weighting process. The formula for
calculating weights based on entropy, where 𝑥𝑖 𝑗 represents each entry in the matrix, 𝑛 the number
of alternatives, and 𝑚 the number of criteria, is as follows:
Normalizing the decision matrix value: 𝑟𝑖 𝑗 =
𝑥𝑖 𝑗 Í𝑛
𝑖=1 𝑥𝑖 𝑗
(3.1)
Calculating Entropy for each criterion: 𝐸𝑗 = −𝑘
Σ︁𝑛
𝑖=1
(𝑟𝑖 𝑗 ln 𝑟𝑖 𝑗 ), 𝑘 =
1
ln(𝑛)
(3.2)
Calculating weight for each criterion: 𝑤𝑗 =
1 − 𝐸𝑗 Í𝑚𝑗
=1(1 − 𝐸𝑗 )
(3.3)
The results from TOPSIS need to be validated via statistical methods to ensure the correct
ranking among alternatives (i.e., difference between performances is not random but significant).
Therefore, we applied multivariate analysis of variance (MANOVA) [64] followed by a post-hoc
method, Tukey [59], to analyze the result significance overall and between a pair of alternatives,
respectively.
Figure 3.1 provides an overview of the data attributes (e.g., protein size, protein fitness) that
51
will be predicted and alternatives (e.g., protein encodings within different sampling methods) that
will be compared.
3.3 Results
3.3.1 Sequence-Function Mapping Obtained from High-Throughput Selection Methods and
Deep Sequencing Affibody Dataset
To investigate the impact of feature representation, ensemble learning, and sampling methods
on predictive modeling, we engaged in several prediction tasks using deep sequencing data.
Specifically, we focused on a classification task involving the affibody dataset. The goal was
to predict the scarce high-affinity binder class from a pool of non-binders. Details of the sequences
obtained after data cleaning and the specific prediction tasks are presented in Table 3.1. For the
NESP dataset, we categorized the data into two stability classes based on their thermal stability
(Tm): low-stability (Tm ≤ 35◦C) and high-stability (Tm ≥ 60◦C). Additionally, we implemented a
regression task to predict the actual Tm values, increasing the prediction difficulty and providing
insights into the performance of different protein encodings. Only sequences measured at pH 7
were included in this analysis.
Detailed results and additional analyses of the NESP dataset are discussed in Section 3.5 and
provided in the supplementary materials.
3.3.2 Physiochemical Feature Encoding, Interpretable yet Lower Predictive Capacity
The classification results in physiochemical encodings are shown in Figures 2 and 3. We
ranked the leading features in discriminating non-binder and binder classes and listed the encoding
method’s F1 score in different sampling methods. The physiochemical encoding performance was
not among the lead encoding methods, yet it achieved a high F1 score with only 20 features. It
also provided insights into how physical features correlate with each other in the given data (Figure
3A.1).
52
Figure 3.2 The lead physical features in naïve and enriched class discriminations in affinity-based
data were H-Eisenberg, Boman Index, and H- Gravy. Gravy and Eisenberg capture hydrophobicity
scales. The Boman Index is a measure of the protein’s ability to interact with its environment based
on the solubility of individual residues. The enriched proteins in our library have gone through
negative screening and are specific to their target. Therefore, there is a shift to a lower Boman
index for this population. Note that the plot is the result of oversampling, SMOTE, in the logistic
regression task.
3.3.3 Comparison of Encoding and Sampling Methods
Once the primary physical features of high-affinity binders were established, we evaluated the
performance of various protein representation techniques alongside our chosen sampling methods.
This analysis aimed to assess how each encoding method influenced the predictive capability
regarding the fitness of proteins.
Our findings revealed distinct performance variations among the encoding methods. Specifically,
One-Hot and UniRep emerged as the most effective techniques, showing superior performance in
several metrics. Concerning sampling strategies, the Synthetic Minority Over-sampling Technique
(SMOTE) consistently enhanced the F1 score across almost all scenarios. Figure 3.4 displays the
distribution of F1-scores, generated using 20 different random seeds, as violin plots, which visually
represent the density and distribution of the scores.
53
Figure 3.3 When physical features were used to encode the affibody sequences, the mean F1-score
was 75.5% with SMOTE. Both SMOTE and undersampling methods were similarly effective, with
no significant difference in performance (i.e., did not reject the null hypothesis). The violin plots
are created over 20 random seeds for each sampling method.
Figure 3.4 Performance analysis of encoding methods highlighting the comparative weaknesses of
physical feature-based encodings and the effectiveness of the SMOTE sampling method.
54
The analysis incorporated protein sequences encoded using physical features, One-Hot, UniRep,
and ESM to perform classification tasks within the affibody dataset. For each encoding strategy,
various sampling methods including undersampling, random oversampling, and SMOTE were
evaluated. The resultant F1 scores observed over 20 random seeds, provided a robust basis for
statistical analysis.
A one-way ANOVA was conducted to quantify the differences in performance across the
encoding and sampling methods, yielding a p-value of 9.52 × 10−190. This result signifies a
statistically significant difference among the groups. Detailed post-hoc results for method ranking
are presented in Figure 3A.2, further supporting the conclusions drawn from the comparative
analysis.
3.3.4 Increased Generalizability and Predictive Performance via Ensemble Learning
Due to the varying performances of the protein encodings, we postulated that ensemble
learning increases the models’ predictive performance. As oversampling performed better than
undersampling in three out of four encoding methods, we exclusively analyzed the ensemble
learning for the two oversampling types (i.e., random oversampling and SMOTE). The physical
encoding for this analysis was discarded since its performance was not as potent as the other
encodings. Figure 3.5 shows the ensemble technique, voting, which remarkably enhanced the
performance with respect to all methods with a mean F1-score of 97% over 20 random seeds.
As shown in Figure 3.5 and Figure 3A.3-4, voting notably enhanced the prediction score among
the candidates, and SMOTE improved the performance in single encoders compared to random
oversampling (R-oversampling). In pursuit of a more informed and transparent decision-making
process among the discussed methods, we also integrated a Multi-Criteria Decision Analysis
(MCDA) using the Technique for Order Preference by Similarity to Ideal Solution (TOPSIS)
across five additional classification metrics beyond the F1-score. These classification criteria were
evaluated over single encoders, the concat_all encoder, and the upvoting technique. Figure 3.6
presents a summary of our MCDA design along with the ranking results obtained from TOPSIS.
55
Figure 3.5 Voting substantially improved the predictive performance across all random
initializations over different encoding methods. The illustrated plot is divided into three regions
from left to right: single encoding methods, concatenation of encodings, and voting of predictions.
The voting was executed such that each encoding was processed through a predictive model over
the same dataset, with the final prediction obtained by majority voting. This figure highlights
how voting enhances the models’ robustness and generalizability. The concatenation of encodings
performed similarly or worse than the best model among single encodings. The best model across
all predictions was "Upvote" with oversampling methods, achieving a Mean-F1-score of 97% and a
Mean-F1-score of 96.80% (no statistical significance detected among oversampling performances
in upvoting). For a comprehensive summary of the statistical analysis and confusion matrix plots,
refer to the supplementary materials (Figure 3A.3-4, Figure 3A.6).
3.3.5 MCDA Design and Statistical Validation
The MCDA design in the affibody dataset, with only 7% high-fitness population, enabled the
comparison of encoding and sampling methods over multiple conflicting criteria. Additionally,
selective weighting in MCDA allows users to bias the results towards more favorable outcomes,
based on data attributes and specific applications. It is important to note that these rankings need
to be validated by statistical analysis. Our MANOVA analysis showed significant results between
the candidates. The Tukey method for pairwise comparison and family-wise error correction
indicated that while upvoting methods achieved significant results over other candidates, there was
no statistical difference between upvoting methods using either SMOTE or R-oversampling.
56
Table 3.2 Tukey results over all classification metrics between selected representation methods.
Comparison Mean GP1 Mean GP2 Metrics Reject Null
Upvote_SM vs. Upvote_RO 0.9712 0.9682 F1 FALSE
Upvote_SM vs. Upvote_RO 0.0187 0.0258 FDR FALSE
Upvote_SM vs. Upvote_RO 0.9614 0.9622 TPR FALSE
Upvote_SM vs. Upvote_RO 0.9813 0.9742 Precision FALSE
Upvote_SM vs. Upvote_RO 0.9972 0.9972 NPV FALSE
Upvote_SM vs. Upvote_RO 0.0019 0.0013 FPR FALSE
Upvote_SM vs. UniRep_SM 0.9712 0.9307 F1 TRUE
Upvote_SM vs. UniRep_SM 0.0187 0.0930 FDR TRUE
Upvote_SM vs. UniRep_SM 0.9614 0.9557 TPR TRUE
Upvote_SM vs. UniRep_SM 0.9813 0.9070 Precision TRUE
Upvote_SM vs. UniRep_SM 0.9972 0.9967 NPV TRUE
Upvote_SM vs. UniRep_SM 0.0019 0.0072 FPR TRUE
Upvote_RO vs. Concat_RO 0.9682 0.9261 F1 TRUE
Upvote_RO vs. Concat_RO 0.0258 0.1061 FDR TRUE
Upvote_RO vs. Concat_RO 0.9622 0.9607 TPR FALSE
Upvote_RO vs. Concat_RO 0.9742 0.8939 Precision TRUE
Upvote_RO vs. Concat_RO 0.9972 0.9971 NPV FALSE
Upvote_RO vs. Concat_RO 0.0013 0.0084 FPR TRUE
Concat_RO vs. UniRep_SM 0.9261 0.9307 F1 TRUE
Concat_RO vs. UniRep_SM 0.1061 0.0930 FDR TRUE
Concat_RO vs. UniRep_SM 0.9607 0.9557 TPR TRUE
Concat_RO vs. UniRep_SM 0.8939 0.9070 Precision TRUE
Concat_RO vs. UniRep_SM 0.9971 0.9967 NPV TRUE
Concat_RO vs. UniRep_SM 0.0084 0.0072 FPR TRUE
57
Figure 3.6 Upvoting achieved the best ranking both in subjective and objective weighting in MCDA
design. A. The main steps for performing MCDA are elaborated. This includes detailing the
selected methods for implementing MCDA such as classification criteria, model selection, and
statistical analysis. B. TOPSIS scores, i.e., closeness coefficients, and their associated rankings are
shown for subjective and objective weighting, demonstrating the effectiveness of different encoding
methods.
Alist of pairwise comparisons over all alternatives can be found in the supplementary information.
The voting method enhanced the prediction score using both a single metric and multiple
metrics in MCDA by combining the predictions of multiple models based on single encodings.
We concluded that as different encodings might capture the distance and relationships of the data
points differently, combining their predictions boosted the final model performance. The encoding
methods used for the voting technique in the dataset are visualized in Figure 3.7 in a uniform
manifold approximation and projection (UMAP) plot.
3.3.6 How Protein Encodings Perform Considering Different Data Attributes
The hypotheses were tested over affibody datasets that had notable attributes such as severe
imbalance, multiple mutation sites, affinity and specificity enrichment, and small molecular protein
length. The obtained results indicated voting and oversampling were highly effective methods
to boost fitness prediction performance. However, individual protein-encoding performance
comparisons need a more convincing explanation and thorough exploration. Specifically, we
58
Figure 3.7 Different protein encodings potentially capture distinct functional aspects of the proteins.
A 2D visualization of the encoding techniques that resulted in improved prediction in the voting
method inUMAP. This method is a dimensionality reduction technique such as principal component
analysis (PCA) [65] with unique advantages such as preserving the local structure of the data
and capturing non-linear relationships between data points. In observing the sequence–function
relationship in proteins, one can conclude that each protein sequence representation/encoding has
the potential to capture different aspects of fitness.
wondered why ESM underperformed One-Hot and UniRep despite the more powerful setup in
pretraining and being showcased in studies for high prediction potential [66]. While the performance
could be due to the datatype (e.g., small protein, complex fitness, etc.), we decided to further analyze
the encoding prediction scores in a completely different dataset and bring insights on embedding
performances in various conditions (e.g., data size in training, protein length, prediction task
difficulty). The curated data contains 18,190 sequences with varying amino acid (aa) lengths and
provides melting points that indicate protein stability. Figure 8 is the performance comparison in
the stability prediction of embeddings, their concatenation, and voting using different data sizes.
Despite underperforming in the affibody affinity data, ESM performed best for stability prediction
when including proteins with max aa length = 500.
We further evaluated the performance of embedding methods in large proteins 400 ≤ aa length ≤
59
1500 and small proteins aa length ≤ 120 to check if ESM still outperforms the other representations
in stability prediction. A complete list of statistical analysis is attached in the supplementary
materials. The analysis of physical feature encoding is provided in Figure 3A.9.
Figure 3.8 The effect of protein size on the performance of encoding methods in stability prediction
while data sizes vary. The obtained results are largely different with respect to the protein
size—small proteins (aa length ≤ 120) vs. large (400 ≤ aa length ≤ 1500). Highlights: For small
proteins, upon comparing the violin plots and statistical test results, protein sequence encoding
methods performed distinctively with respect to the initial dataset (protein max length = 500).
One-Hot encoding had a more significant contribution in boosting the classification metrics for
small proteins. As an example, when 𝑛 = 400, both One-Hot and All-Encoding concatenation
with a mean F1-score of 94% outperformed the other encoding methods. One-Hot tends to be
problematic for large proteins as it results in a highly sparse encoding vector. This was shown in
this plot when One-Hot encoding performance was not satisfactory in comparison with ESM and
UniRep. When 𝑛 = 400, based on both the violin plots and the post-hoc analysis after ANOVA
(both Bonferroni and Tukey), either ESM or ESM_UniRep with 92% mean F1-score achieved
the highest performance. One-Hot with 73% mean F1-score was the lowest score among all the
encodings. Refer to the supplementary information for all one-by-one comparisons of the statistics
and classification.
3.3.7 Regression Analysis for Melting Point Prediction
The last analysis is a regression task for predicting the melting point value. We wondered how
different encoding methods performed if we used all the data and increased the prediction challenge
60
(Tm prediction rather than stability class prediction). MSE and R² are shown predicting the Tm
values of a dataset of 18,190 sequences with 0.3 test size. There was a significant difference in the
performance of encoding methods, which was not the case in the classification task. ESM was the
best encoding method in predicting stability (R² score= 0.65). Note that we used all the data (i.e.,
did not use sampling) for training our regression model. The regression metrics are reported in
Table 3.
Table 3.3 Regression metrics for encoding methods in validation and test.
Encoding Validation Test
R² MSE R² MSE
One-Hot 0.21 141 0.24 130
UniRep 0.49 108 0.40 102
ESM 0.65 63 0.65 60
3.4 Discussion
In this study, we shed light on two key challenges of applying discriminative models over
amino acid sequence data for protein engineering applications: (1) handling imbalanced data and
(2) choosing an appropriate protein representation (i.e., encoding). Assay-labeled sequence data
in this domain is often severely imbalanced (due to the rugged and sparse nature of the protein
fitness landscape) and requires careful consideration in data sampling, splitting, and choice in
data representation for model training. To capture this common occurrence of imbalanced data,
we trained discriminative ML models over our cytometry-sorted deep-sequenced small protein
(affibody) data to distinguish between functional sequences (n = 6077) among a large collection of
non-functional protein sequences (n = 82,663). We then explored the impact of encoding protein
sequences using two simplistic approaches (One-Hot encoding, physiochemical encoding) and two
language-based methods (UniRep, ESM).We hypothesized that as each protein representation may
capture distinct information, combining representations via embedding concatenation and ensemble
learning increases overall performance and generalizability.
To address the issue of imbalanced data,we implemented various sampling techniques, including
undersampling, random oversampling, and SMOTE, and evaluated their performance using multiple
61
classification metrics. Our results indicate that implementing oversampling techniques over
imbalanced datasets improves predictive performance relative to undersampling or the exclusion
of sampling methods. Among the sequence representation methods, embeddings are the answer
to improved fitness prediction and data requirements. However, it is essential to consider the
choice of protein representation, its benefits, and its drawbacks. For example, the choice of
fitness to be predicted (e.g., thermal stability, binding affinity, target specificity) and the language
model pretraining procedure affect the model’s predictive performance and need further discussion.
Therefore, we analyzed an additional dataset (i.e., the NESP dataset, which included a variety of
protein sequences with their Tm) to discuss the effect of protein representations over variables
such as protein length, protein fitness, and prediction type (i.e., classification vs. regression). For
ensemble learning, we used majority voting to combine the prediction of each representation over
the same ML model, which significantly improved the prediction score, and its obtained results
were statistically significant using MANOVA and the post-hoc method (Figure 3.6, Table 3.2).
3.5 Conclusions
This study intends to informprotein engineers that: (i) embeddings derived from self-supervised
representation techniques are not always the optimal route to take, depending on the protein size and
protein fitness to be predicted; (ii) oversampling techniques, especially SMOTE, have the ability
to overcome the notorious challenge of highly imbalanced data in the protein fitness landscape;
(iii) different aspects learned in each protein encoding can be combined by voting techniques and
result in better predictive scores. These conclusions were revealed in the context of integrating
machine learning and protein engineering knowledge to identify high-fitness protein sequences.
Specifically, we quantified model performance while varying the choice of feature representation,
ensemble learning, and sampling methods. Analysis across a broad range of protein chain lengths
revealed the ESM language model to be most beneficial for encoding large protein sequences
(Figure 3.8). However, in the context of small protein sequences, a comparable performance was
observed between One-Hot encoding and the language models (ESM and UniRep). In our analysis,
oversampling proved to be an effective technique to improve performance when dealing with
62
severely imbalanced datasets (Figure 3.4). Finally, ensemble learning was a promising method
for boosting the binding prediction scores when using unique, competitive encoding methods
(Figure 3.5, Figure 3.6).
63
BIBLIOGRAPHY
[1] W. Liebermeister, E. Noor, A. Flamholz, D. Davidi, J. Bernhardt, and R. Milo. Visual account
of protein investment in cellular functions. Proceedings of the National Academy of Sciences
of the United States of America, 111:8488–8493, 2014.
[2] J. Schlessinger. Cell signaling by receptor tyrosine kinases. Cell, 103:211–225, 2000.
[3] B.L. Hogan. Bone morphogenetic proteins: Multifunctional regulators of vertebrate
development. Genes & Development, 10:1580–1594, 1996.
[4] E. Andrianantoandro, S. Basu, D.K. Karig, and R.Weiss. Synthetic biology: Newengineering
rules for an emerging discipline. Molecular Systems Biology, 2:2006.0028–2006.0028, 2006.
[5] M. Heim, L. Römer, and T. Scheibel. Hierarchical structures made of proteins. the complex
architecture of spider webs and their constituent silk proteins. Chemical Society Reviews,
39:156–164, 2010.
[6] H. Kolmar. Biological diversity and therapeutic potential of natural and engineered cystine
knot miniproteins. Current Opinion in Pharmacology, 9:608–614, 2009.
[7] A. Krasniqi, M. D’Huyvetter, N. Devoogdt, F.Y. Frejd, J. Sörensen, A. Orlova, M. Keyaerts,
and V. Tolmachev. Same-day imaging using small proteins: Clinical experience and
translational prospects in oncology. Journal of Nuclear Medicine, 59:885–891, 2018.
[8] P.A. Romero and F.H. Arnold. Exploring protein fitness landscapes by directed evolution.
Nature Reviews Molecular Cell Biology, 10:866–876, 2009.
[9] H.W. Hellinga. Rational protein design: Combining theory and experiment. Proceedings of
the National Academy of Sciences of the United States of America, 94:10015–10017, 1997.
[10] C. Jäckel, P. Kast, and D. Hilvert. Protein design by directed evolution. Annual Review of
Biophysics, 37:153–173, 2008.
[11] G. Li, Y. Dong, and M.T. Reetz. Can machine learning revolutionize directed evolution of
selective enzymes? Advanced Synthesis & Catalysis, 361:2377–2386, 2019.
[12] N. Anand, R. Eguchi, I.I. Mathews, C.P. Perez, A. Derry, R.B. Altman, and P.S. Huang.
Protein sequence design with a learned potential. Nature Communications, 13:716, 2022.
[13] Z. Wu, S.B.J. Kan, R.D. Lewis, B.J. Wittmann, and F.H. Arnold. Machine learning-assisted
directed protein evolution with combinatorial libraries. Proceedings of the National Academy
of Sciences of the United States of America, 116:8852–8858, 2019.
[14] Y. Saito, M. Oikawa, T. Sato, H. Nakazawa, T. Ito, T. Kameda, K. Tsuda, and M. Umetsu.
64
Machine-learning-guided library design cycle for directed evolution of enzymes: The effects
of training data composition on sequence space exploration. ACS Catalysis, 11:14615–14624,
2021.
[15] A.W. Golinski, K.M. Mischler, S. Laxminarayan, N.L. Neurock, M. Fossing, H. Pichman,
S. Martiniani, and B.J. Hackel. High-throughput developability assays enable library-scale
identification of producible protein scaffold variants. Proceedings of the National Academy
of Sciences of the United States of America, 118:e2026658118, 2021.
[16] J. Chen, S. Zheng, H. Zhao, and Y. Yang. Structure-aware protein solubility prediction
from sequence through graph convolutional network and predicted contact map. Journal of
Cheminformatics, 13:1–10, 2021.
[17] S. Wang, D. Liu, M. Ding, Z. Du, Y. Zhong, T. Song, J. Zhu, and R. Zhao. Se-onionnet:
A convolution neural network for protein–ligand binding affinity prediction. Frontiers in
Genetics, 11:607824, 2021.
[18] K. Kuzmin, A.E. Adeniyi, A.K. DaSouza, D. Lim, H. Nguyen, N.R. Molina, L. Xiong, I.T.
Weber, and R.W. Harrison. Machine learning methods accurately predict host specificity
of coronaviruses based on spike sequences alone. Biochemical and Biophysical Research
Communications, 533:553–558, 2020.
[19] S. Das and S. Chakrabarti. Classification and prediction of protein–protein interaction
interface using machine learning algorithm. Scientific Reports, 11:1–12, 2021.
[20] Y. Vander Meersche, G. Cretin, A.G. de Brevern, J.C. Gelly, and T. Galochkina. Medusa:
Prediction of protein flexibility from sequence. Journal of Molecular Biology, 433:166882,
2021.
[21] T. Young, D. Hazarika, S. Poria, and E. Cambria. Recent trends in deep learning based natural
language processing [review article]. IEEE Computational Intelligence Magazine, 13:55–75,
2018.
[22] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural
language processing (almost) from scratch. Journal of Machine Learning Research,
12:2493–2537, 2011.
[23] M. Mnasri. Recent advances in conversational nlp: Towards the standardization of chatbot
building. arXiv preprint, 2019.
[24] G. Campagna, S. Xu, M. Moradshahi, R. Socher, and M.S. Lam. Genie: A generator of
natural language semantic parsers for virtual assistant commands. In Proceedings of the 40th
ACM SIGPLAN Conference on Programming Language Design and Implementation, pages
394–410, 2019.
65
[25] M. Heinzinger, A. Elnaggar, Y. Wang, C. Dallago, D. Nechaev, F. Matthes, and B. Rost.
Modeling aspects of the language of life through transfer-learning protein sequences. BMC
Bioinformatics, 20:1–17, 2019.
[26] D. Ofer, N. Brandes, and M. Linial. The language of proteins: Nlp, machine learning&protein
sequences. Computational and Structural Biotechnology Journal, 19:1750–1758, 2021.
[27] A. Bateman, M.J. Martin, C. O’Donovan, M. Magrane, R. Apweiler, E. Alpi, R. Antunes,
J. Arganiska, B. Bely, and M. Bingley. Uniprot: A hub for protein information. Nucleic Acids
Research, 43:D204–D212, 2015.
[28] K. Katz, O. Shutov, R. Lapoint, M. Kimelman, J.R. Brister, and C. O’Sullivan. The sequence
read archive: A decade more of explosive growth. Nucleic Acids Research, 50:D387–D390,
2022.
[29] M. Torrisi, G. Pollastri, and Q. Le. Deep learning methods in protein structure prediction.
Computational and Structural Biotechnology Journal, 18:1301–1310, 2020.
[30] A. Elnaggar, M. Heinzinger, C. Dallago, G. Rehawi, W. Yu, L. Jones, T. Gibbs, T. Feher,
C. Angerer, and M. Steinegger. Prottrans: Towards cracking the language of life’s code
through self-supervised deep learning and high performance computing. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 14:7112–7127, 2021.
[31] N. Ferruz, S. Schmidt, and B. Höcker. Protgpt2 is a deep unsupervised language model for
protein design. Nature Communications, 13:4348, 2022.
[32] E.C. Alley, G. Khimulya, S. Biswas, M. AlQuraishi, and G.M. Church. Unified rational
protein engineering with sequence-based deep representation learning. Nature Methods,
16:1315–1322, 2019.
[33] A. Rives, J. Meier, T. Sercu, S. Goyal, Z. Lin, J. Liu, D. Guo, M. Ott, C.L. Zitnick, and J. Ma.
Biological structure and function emerge from scaling unsupervised learning to 250 million
protein sequences. Proceedings of the National Academy of Sciences of the United States of
America, 118:e2016239118, 2021.
[34] R. Rao, N. Bhattacharya, N. Thomas, Y. Duan, X. Chen, J. Canny, P. Abbeel, and Y.S. Song.
Evaluating protein transfer learning with tape. In Advances in Neural Information Processing
Systems, volume 32, pages 9689–9699, 2019.
[35] N. Brandes, D. Ofer, Y. Peleg, N. Rappoport, and M. Linial. Proteinbert: A universal
deep-learning model of protein sequence and function. Bioinformatics, 38:2102–2110, 2022.
[36] C. Hsu, H. Nisonoff, C. Fannjiang, and J. Listgarten. Combining evolutionary and
assay-labelled data for protein fitness prediction. bioRxiv, 2021.
66
[37] J. Meier, R. Rao, R. Verkuil, J. Liu, T. Sercu, and A. Rives. Language models enable zero-shot
prediction of the effects of mutations on protein function. Advances in Neural Information
Processing Systems, 35:29287–29303, 2021.
[38] S.K.S. Chu and J. Siegel. Predicting single-point mutational effect on protein stability. Growth,
16:35–45, 2021.
[39] Z. Lv, P. Wang, Q. Zou, and Q. Jiang. Identification of sub-golgi protein localization by use
of deep representation learning features. Bioinformatics, 36:5600–5609, 2020.
[40] K. Li, Y. Zhong, X. Lin, and Z. Quan. Predicting the disease risk of protein mutation sequences
with pre-training model. Frontiers in Genetics, 11:1–10, 2020.
[41] S. Min, H.G. Kim, B. Lee, and S. Yoon. Protein transfer learning improves identification of
heat shock protein families. PLOS ONE, 16:e0251865, 2021.
[42] D.R.Woldring, P.V. Holec, L.A. Stern, Y. Du, and B.J. Hackel. Agradient of sitewise diversity
promotes evolutionary fitness for binder discovery in a three-helix bundle protein scaffold.
Biochemistry, 56:1656–1671, 2017.
[43] D. Pultz, E. Friis, Inversion, J. Salomon, Maggie, P. Fischer Hallin, and S. Baagøe Jørgensen.
Novozymes enzyme stability prediction, 2022. Kaggle.
[44] R.L. Keeney, H. Raiffa, and D.W. Rajala. Decisions with Multiple Objectives: Preferences
and Value Trade-Offs, volume 9. IEEE Transactions on Systems, Man, and Cybernetics, 1977.
[45] D.R.Woldring, P.V. Holec, L.A. Stern, Y. Du, and B.J. Hackel. Agradient of sitewise diversity
promotes evolutionary fitness for binder discovery in a three-helix bundle protein scaffold.
Biochemistry, 56:1656–1671, 2017.
[46] A.T. Müller, G. Gabernet, J.A. Hiss, and G. Schneider. Modlamp: Python for antimicrobial
peptides. Bioinformatics, 33:2753–2755, 2017.
[47] K.K.Yang, Z.Wu, C.N. Bedbrook, and F.H. Arnold. Learned protein embeddings for machine
learning. Bioinformatics, 34:2642–2648, 2018.
[48] Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, N. Smetanin, R. Verkuil, O. Kabeli, and
Y. Shmueli. Evolutionary-scale prediction of atomic level protein structure with a language
model. Science, 379:1123–1130, 2023.
[49] B. Kovács, F. Tinya, C. Németh, and P. Ódor. Smote: Synthetic minority over-sampling
technique nitesh. Ecological Applications, 30:321–357, 2020.
[50] A. Fernández, S. García, F. Herrera, and N.V. Chawla. Smote for learning from imbalanced
data: Progress and challenges, marking the 15-year anniversary. Journal of Artificial
67
Intelligence Research, 61:863–905, 2018.
[51] A.J. Mohammed. Improving classification performance for a novel imbalanced medical
dataset using smote method. International Journal of Advanced Trends in Computer Science
and Engineering, 9:3161–3172, 2020.
[52] V. Rupapara, F. Rustam, H.F. Shahzad, A. Mehmood, I. Ashraf, and G.S. Choi. Impact of
smote on imbalanced text features for toxic comments classification using rvvc model. IEEE
Access, 9:78621–78634, 2021.
[53] T. Hasanin, T.M. Khoshgoftaar, J.L. Leevy, and R.A. Bauder. Severely imbalanced big data
challenges: Investigating data sampling approaches. Journal of Big Data, 6:1–25, 2019.
[54] R. Blagus and L. Lusa. Evaluation of smote for high-dimensional class-imbalanced microarray
data. In Proceedings of the 11th International Conference on Machine Learning and
Applications (ICMLA 2012), volume 2, pages 89–94, 2012.
[55] R. van den Goorbergh, M. van Smeden, D. Timmerman, and B. Van Calster. The harmof class
imbalance corrections for risk prediction models: Illustration and simulation using logistic
regression. Journal of the American Medical Informatics Association, 29:1525–1534, 2022.
[56] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama. Optuna: A next-generation
hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, pages 2623–2631, 2019.
[57] M.L. McHugh. Multiple comparison analysis testing in anova. Biochemia Medica,
21:203–209, 2011.
[58] R.A. Armstrong. When to use the bonferroni correction. Ophthalmic & Physiological Optics,
34:502–508, 2014.
[59] J. Tukey. The problem of multiple comparisons. Unpublished manuscript, Department of
Statistics, Princeton University, Princeton, NJ.
[60] P. Branco, L. Torgo, and R. Ribeiro. A survey of predictive modelling under imbalanced
distributions. ACM Comput. Surv. (CSUR), 49:1–50, 2015.
[61] K. Borowska and J. Stepaniuk. Imbalanced data classification: A novel re-sampling approach
combining versatile improved smote and rough sets. In Proceedings of the Computer
Information Systems and Industrial Management: 15th IFIP TC8 International Conference,
CISIM 2016, volume 9842, pages 31–42, Vilnius, Lithuania, 2016. Springer International
Publishing.
[62] C.-L. Hwang and K. Yoon. Multiple Attribute Decision Making: Methods and Applications
A State-of-the-Art Survey. Springer, Berlin, Heidelberg, Germany, 1981.
68
[63] C.E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 27:379–423,
1948.
[64] R.A. Johnson and D.W.Wichern. Applied Multivariate Statistical Analysis. Pearson Prentice
Hall, NJ USA, 2007.
[65] I.T. Jolliffe. Principal component analysis: A beginner’s guide—i. introduction and
application. Weather, 45:375–382, 1990.
[66] Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, A.d.S. Costa, M. Fazel-Zarandi, T. Sercu,
S. Candido, et al. Language models of protein sequences at the scale of evolution enable
accurate structure prediction. bioRxiv, 2022.
69
APPENDIX 3A
SUPPLEMENTARY INFORMATION
Figure 3A.1 The majority of physical features are at least correlated with one other physical feature;
leaving us insights into attribute dependencies in the affibody dataset. The plot is drawn upon
sampling from both naïve and enriched populations. Bold colors represent a higher correlation of
features. As the affibody length in our dataset was fixed, the first row and column of the matrix
are empty. Insights on the given correlation plot include i) a high correlation of Boman index and
protein flexibility index, ii) a high correlation of aromaticity with molecular weight, MSW, and
refractivity, iii) Aliphatic index correlating with multiple features such as bulkiness, H-Eisenberg,
and H-Gravy.
70
Figure 3A.2 The initial ANOVA test was done for multiple comparisons, and it resulted in 𝑝-value
= 9.53 × 10−190. The following tables in the supplement show the t-test results when Bonferroni
correction is considered. The statistically significant results are bolded.
71
Figure 3A.3 Figure 5 T-test Results-part1.
72
Figure 3A.4 Figure 5 T-test Results-part2.
73
Figure 3A.5 SMOTE either Improved the performance or had no hampering effect with respect to
R-Oversampling.
74
Figure 3A.6 Violin Plot-Based Confusion Matrix.Individual Encodings Perform uniquely in the
confusion matrix entities. While the overall predictive performance is the main goal and it
is represented via F1-Score throughout the literature, inspecting how each model performs for
maximizing true positives(TP) and true negatives(TN) while minimizing false positives(FP) and
false negatives(FN), provides insights about each model performance.
75
Figure 3A.7 Physical feature correlation plot for NESP dataset.
76
Figure 3A.8 Physical feature ranking for NESP dataset. The feature ranking is presented after using
all the data for the classification of stable vs. unstable sequences. Boman Index, H-Eisenberg, and
MSS are the lead features. However, their scores are not significantly higher than the other physical
attributes, indicating that more features are incorporated in the final F1 Score=0.86.
Figure 3A.9 Physical Feature representation while using maximum N=1000, performed poorly and
have not got selected for the main figure.
77
Figure 3A.10 The predictive performances (F1-score) of multiple protein representations (One-Hot,
UniRep, and ESM) were evaluated across stable (𝑇𝑚 ≥ 60◦𝐶; 𝑛 = 3140) vs. unstable (𝑇𝑚 ≤ 35◦𝐶;
𝑛 = 1116) proteins. The performance of individual representations is compared against the effects
of concatenating each embedding as well as ensemble methods (upvote1: hard voting, upvote2:
soft voting). The violin plots were generated by repeating the analysis over 30 random seeds
for sensitivity analysis. 𝑁 represents the total number of data used with 0.3 as a test-size ratio.
Welch t-test with unequal variances has been implemented over the obtained results to showcase
the statistical significance in comparisons (refer to supplementary information for 𝑝-values).
Highlight: In low data size (𝑁 = 20), One-Hot performs poorly with the mean F1-score of
0.60, and the embeddings that included One-Hot were outperformed by both ESM and UniRep.
While increasing the data size resulted in increased performance for all the methods, concatenating
ESM with UniRep representations obtained the best score, with a mean F1 of 0.92.
78
Figure 3A.11 A complete list of used criteria for MCDA their derivations from confusion matrix
values.
79
CHAPTER 4
PREDICTING ORGANIC ANION TRANSPORTER PROTEIN-DRUG INHIBITION BY
INTEGRATING STRUCTURAL INSIGHTS WITH GRAPH NEURAL NETWORKS
Abstract
Organic AnionTransporting Polypeptides (OATPs) are crucial for hepatic drug uptake, impacting
drug efficacy and toxicity. However, predicting OATP-mediated drug-drug interactions (DDIs)
remains challenging due to the scarcity of structural data, the dynamic behavior of these proteins, and
inconsistencies in experimental protocols. This chapter introduces the Heterogeneous GraphNeural
Network for Inhibition Perdition of OATPs (HIPO-GNN). This novel computational approach
combines molecular modeling and graph neural networks to improve the prediction of OATP-drug
inhibition. By combining ligand molecular features with protein-ligand interaction data, HIPO-GNN
significantly outperforms traditional ligand-based methods, achieving median F1 and AUC scores
of 0.82 and 0.81, respectively, compared to ECFP (F1: 0.68, AUC: 0.70) and RDKit (F1: 0.78,
AUC: 0.75) built upon XGBoost. The analysis further evaluates the utility of interaction features
in GNNs, with HIPO-GNN showing marked improvement in F1 score over a ligand-only GNN
model, and no significant compromise in AUC. Beyond improving inhibition prediction, this
study explores the impact of computational noise from molecular modeling, and identifies optimal
thresholds in molecular docking for model improved modeling performance. Additionally, protein
residues involved in inhibitory versus non-inhibitory drug interactions are characterized, specifically
highlighting residues F38, K41, N213, L378, G552, and V556, which show notable differences
between inhibitors and non-inhibitors. The findings enhance the predictive performance and
interpretability of OATP-mediated DDIs, advocating for the integration of advanced computational
techniques with standardized experimental protocols. This work contributes to the field of drug
discovery and development by providing a novel approach to predict and understand OATP
inhibition.
Keywords
Molecular Modeling, Drug-Drug Interaction, Machine Learning, OATP, GNN
80
4.1 Introduction
Drug-drug interactions (DDIs) pose a significant challenge in modern pharmacotherapy, as they
can lead to the impairment or induction of critical metabolic enzymes and transporter proteins
responsible for processing multiple drugs simultaneously. These interactions can be further
exacerbated by genetic variation, potentially resulting in lethal outcomes [1]. Recognizing the
importance of predictive DDI models in the drug development process, the US Food and Drug
Administration (FDA) has recently issued guidelines advocating for their consideration in regulatory
pathways for new drugs [2]. While such models have been well-established for metabolic DDIs
involving cytochrome P450s, there is a notable lack of predictive models for transporter-mediated
DDIs (tDDIs). This deficiency is particularly concerning in the context of organic anion transporting
polypeptides (OATPs), a family of proteins responsible for transporting a wide range of drugs into
the liver for bile-mediated elimination. Numerous alarming DDIs involving OATPs have been
reported, highlighting the urgent need for the development of reliable predictive tDDI models [3].
Addressing this knowledge gap is crucial for ensuring the safety and efficacy of medications and
ultimately improving patient outcomes.
OATPs are not only expressed in the liver for drug transport but also in other pharmacologically
relevant tissues, such as the kidneys and intestines. These proteins facilitate the uptake of a
diverse array of endogenous and exogenous small molecules, including hormones and drugs.
To accommodate the vast number and variety of substrates processed by these tissues, OATPs
exhibit a high degree of promiscuity. This promiscuity gives rise to intricate multimolecular
interactions and elaborate regulatory networks that can profoundly influence drug efficacy and
toxicity. Understanding the complexities of these interactions is essential for predicting and
mitigating the potential risks associated with OATP-mediated drug transport. By elucidating
the mechanisms underlying OATP promiscuity and the resulting regulatory networks, researchers
can develop more comprehensive and accurate models for predicting tDDIs. This knowledge
will ultimately contribute to the design of safer and more effective medications, as well as the
optimization of drug dosing and administration strategies to minimize the risk of adverse drug
81
reactions.
Currently, there is no generalizable method to consistently predict how a molecule will interact
with OATPs, despite valiant efforts to develop machine learning (ML) algorithms to do so [4, 5, 6].
This is due to two major challenges: (1) there is significant variability in the available in vitro data,
resulting in inconsistent data labeling [7, 8], and (2) until very recently, no OATP structures had
been solved, substantially limiting options for accurate structure-based analyses [4].
Challenge one—the lack of consistent in vitro data—severely constrains model applicability to
the data on which itwas trained. These data inconsistencies arise from different in vitro experimental
conditions being used to evaluate the same ligand (e.g., with or without pre-incubation, various
cell types), leading to drugs being labeled inconsistently as inhibitors or substrates depending on
the context of experimental conditions [9]. Overlooking the context (i.e., chemical environment)
associated with OATP-inhibition assays is deeply problematic as it distorts our understanding of
how OATPs function and results in inaccurately labeled inhibitor datasets. For example, studies
have shown significant variability in OATP1B1/1B3 inhibition data due to different experimental
conditions, leading to inconsistencies in drug classifications [9, 10]. This variability can be
attributed to various factors, including genetic polymorphisms in transporters likeOATP1B1, which
are major determinants of drug pharmacokinetics and can lead to variability in drug disposition
[11, 12]. We identify key physicochemical properties that correlate with inhibition measurement
variability, as these properties play a significant role in compound disposition and safety, influencing
formulation, absorption, and toxicity during drug development [13].
Challenge two—unsolved OATP structures (and, prior to AlphaFold2 [14])—limited past
ML models to only physiochemical and/or topological ligand properties. Structure elucidation
precluded incorporating protein-ligand interaction analysis into past computational workflows.
Adding to the complexity of this challenge is the dynamic nature of OATPs, whose transport
mechanisms involve at least three conformers (outward-facing, occluded, and inward-facing). Saidi
jam et al. [15] recently published cryo-EM structures of OATP1B1 (inward- and outward-facing)
and OATP1B3 [16] (inward-facing) have enabled us to explore how protein-ligand interaction
82
features (PLIFs) may be integrated into these models. The availability of both inward- and
outward-facing structures of OATP1B1 [ref] allowed us to expand our analyses to determine if
purely computational workflows could capture the dynamic nature of the PLIFs to be used in
predictive tDDI models.
Collectively, in this study, we (1) identify physiochemical ligand features (LFs) that highly
correlate with in vitro data variability, and (2) explore the integration of computationally produced
PLIFs into tDDI models. Here in, we present a workflow to extract PLIFs from OATP1B1
conformers docked against a library of experimentally characterized ligands. Adapting to the
constraints of the available data, we limited evaluation of our models to data obtained in a single in
vitro inhibition study from [7].
4.2 Methods
This section details the steps taken to curate data, select and label ligands, prepare ligand and
protein structures, perform molecular docking, and develop machine learning models for predicting
OATP inhibition.
4.2.1 Data Curation
Several studies have been conducted over the past decade that characterize and evaluate OATP
interactions with ligands. Due to largely unexplained inter-study variability among the available
in vitro OATP inhibition data [9], we reasoned that sourcing data from a single set of experiments
would provide the most consistent framework for our study [7].
4.2.1.1 Ligand Selection & Labeling
Experimental inhibition data was obtained from Karlgren, et al. for 222 ligands with each
OATP1B1, OATP1B3, and OATP2B1 Of these 225 compounds, 222 were compatible with our
molecular docking workflow; we excluded the remaining three ligands. As was done in the original
Karlgren study, compounds with reported inhibition percentages greater than 50% were labeled
as inhibitors (Class = 1), while those with inhibition percentages less than or equal to 50% were
labeled as noninhibitors (Class = 0).
83
4.2.1.2 Training & Test Sets
Under constraints that each test set must have an equal class balance, twenty test sets were
constructed by randomly selecting 10% of ligands from the total dataset. For each of the twenty
test sets, the corresponding training set consisted of the remaining 90% of compounds not in the
hold-out test set.
4.2.2 Structural Modeling
4.2.2.1 Ligand Structure Preparation
Canonical SMILES were compiled from PubChem [17] for all 222 compounds to be used
for structural modeling. Ligand representations were converted from SMILES format to three
dimensional structures (SDF format), energetically minimized, and protonated (to pH 7.5) using
Open Babel [18].
4.2.2.2 Protein Structure Preparation
Protein sequences of human OATP1B1, OATP1B3, and OATP2B1 were obtained from the
UniProt database in FASTA format [19]. The cryo-EM structures of the inward- and outward-facing
OATP1B1 conformers (PDB IDs 8HND, 8HNB) [20], along with the inward-facing OATP1B3
conformer (PDB ID 8PG0) [16], were retrieved from the Protein Data Bank [21]. These structures
were prepared for docking by removing water molecules, ions, and small molecule ligands using
PyMOL [22] and then relaxed using the Rosetta FastRelax protocol [23] to minimize any structural
artifacts from the cryo-EM experiments. Due to the lack of experimental structures for the
outward-facing OATP1B3 and both conformers of OATP2B1, we utilized AlphaFold2 [24, 25], a
state-of-the-art protein structure prediction algorithm, to generate these missing structures. The
AlphaFold2 predicted structures were validated by comparing them to the available cryo-EM
structures of OATP1B1 and OATP1B3 using root-mean-square deviation (RMSD) calculations,
as described by Sala et al. [26]. The low RMSD values observed between the predicted and
experimental structures provided confidence in the quality of the AlphaFold2 predictions for the
missing conformers.
84
4.2.2.3 Ligand-Protein Docking
Molecular docking was performed between all 222 ligands and both the outward and inward
conformers of human OATP1B1, OATP1B3, and OATP2B1 using the Rosetta modeling suite
(version 3.12) [27]. The RosettaLigand score function (pre-talaris2013) was chosen for its proven
performance in small molecule docking compared to other Rosetta score functions and external
docking algorithms [28]. The docking protocol involved the following steps: (1) concatenating the
ligand conformer files with the prepared protein structures, (2) defining the starting coordinates for
the ligand within the known transport site of the OATPs [20] using an XML file, (3) specifying
the dimensions of the grid in which the ligand is allowed to move during the docking simulation,
and (4) running the docking simulation for all ligands. To ensure a thorough sampling of the
binding space, 1000 docked poses were generated for each ligand-protein pair. To further validate
the docking results, the top-ranked poses for each ligand were visually inspected for any unrealistic
conformations or interactions. Additionally, the binding energies of the top poses were compared to
available experimental binding data [7] to assess the accuracy of the RosettaLigand score function
in ranking the poses. These validation steps provided additional confidence in the quality of the
docking results for downstream analysis.
4.2.2.4 Ranking Docked Poses for Each Ligand
For each docked pose, Rosetta reports the ligand-protein interface energy (interface_delta_X).
Thiswas used as a relative metric to rank the 1000 docked poses for a single ligand-OATP conformer
pair. Poses with the lowest interface_delta_X scores were interpreted as higher-quality docked
poses (i.e., more stable binding). For all poses, interface_delta_X scores were recorded in a
CSV file for use in subsequent analyses (e.g., determining the optimal number of poses to capture
transporter dynamics).
4.2.3 Feature Selection, Encoding, and Preprocessing
Two distinct types of feature sets were generated for each OATP conformer: traditional
ligand-only molecular descriptors (ligand features, LFs) and protein-ligand interaction features
(PLIFs).
85
4.2.3.1 Ligand Features (LFs)
Two types of vectorized molecular representations, extended-connectivity fingerprints (ECFP6)
and RDKit physicochemical descriptors, were generated for each of the 222 ligands using the
pinky 1 and RDKit [29] python libraries, respectively. As both types of LFs consist of numeric,
noncategorical data, no encoding was required. Variance (<0.001) and correlation coefficient
(>0.99) thresholding was performed to remove any obvious or redundant features.
4.2.3.2 Protein-Ligand Interaction Features (PLIFs)
The Protein-Ligand Interaction Profiler (PLIP) [30] was used to generate interaction profiles
for each docked pose. PLIP produces intricate features to describe each interaction present in a
docked pose, including interaction type, distance, angle, donor and acceptor atom IDs, etc. These
comprehensive PLIFs were parsed into a CSV file, with each row containing vectorized PLIFs for
a single docked pose. Raw PLIF csv files were processed with an initial variance filtering (<0.001)
to remove any of the non-categorical, numerical features (e.g., interaction distance, angle) with
variance below this threshold. One hot encoding was then performed for all categorical features.
Encoded data was then filtered by correlation coefficient (>0.99) to remove highly redundant
features.
4.2.3.3 Protein-Ligand Interaction Feature Reduction: Minimal PLIFs (mPLIFs)
To capture a more interpretable, less noisy PLIP representation, a simplified set of PLIFs
(minimal PLIFs, mPLIFs) was parsed from the PLIF data. These more minimal features define
whether each OATP residue is involved in a specific interaction type in a given pose (0=not
involved, 1=involved). The mPLIF naming system follows the pattern [interaction type].[OATP
residue] (examples in supplementary Figure 4D.1).
4.2.4 Machine Learning Classification Models
This section elaborates on data processing techniques to improve ML classification tasks, model
selection, from machine learning classifiers to neural network for mutlilabel classification task.
1https://github.com/ubccr/pinky/
86
4.2.4.1 Data Preprocessing
Prior to training the machine learning models, several preprocessing steps were applied to the
dataset. One-hot encoding was performed on categorical features to convert them into a binary
representation suitable for the models. Variance and multicollinearity correlation thresholdingwere
employed to remove features with low variance (<0.001) and high correlation (>0.99), respectively,
reducing the dimensionality of the feature space and mitigating the risk of overfitting.
A drug-wise splitting approach was adopted to create the holdout test sets to ensure a robust
evaluation of the models’ performance. This approach, as opposed to entirely random pose-wise
splitting, prevents leakage of information between the training and test sets. Five balanced test sets
were created, each containing 10% of the total dataset, with an equal representation of inhibitors
and non-inhibitors. The remaining 90% of the data was used for training and cross-validation.
Additionally, feature normalization using StandardScaler was applied to ensure that all features
have zero mean and unit variance, improving the convergence and stability of the machine learning
algorithms.
4.2.4.2 Classifier Training Evaluation
Several machine learning algorithms, including logistic regression, support vector machines
(SVM), random forests, and XGBoost, were trained and evaluated using the preprocessed data. The
training process involved k-fold cross-validation, where the training data was split into k subsets,
and the models were trained and validated k times, using a different subset for validation each time.
This approach helps to assess the models’ performance and stability across different subsets of the
data. The results for classification are reported for XGBoost due to consistent out-performance.
The trained models were then evaluated on the holdout test sets to assess their generalization
performance. Metrics such as accuracy, precision, recall, F1-score, and area under the receiver
operating characteristic curve (AUC-ROC) were calculated to provide a comprehensive assessment
of the model’s performance in predicting OATP inhibition.
87
4.2.4.3 Multi-label Neural Network
In our approach to predicting OATP inhibition, we developed a multi-label neural network
classification model. This methodology facilitates simultaneous prediction of inhibition across
multiple OATP subtypes (1B1, 1B3, and 2B1), providing a more holistic perspective on potential
drug interactions compared to traditional single-family classification models. The multi-label
strategy excels in revealing the intricate interplay between drugs and various OATP subtypes,
enabling the discovery of cross-subtype inhibition patterns that might be overlooked in isolated
analyses.
Our implementation featured a custom-designed InhibitorClassifier neural network architecture,
consisting of multiple hidden layers with ReLU activation functions. For multi-label classification,
we employed the BCEWithLogitsLoss criterion, which is well-suited for handling multiple binary
labels. To enhance generalization and prevent overfitting, we incorporated an early stopping
mechanism. Additionally, hyperparameter tuning was conducted using the Optuna optimization
package [31], ensuring robust performance across different OATP subtypes. This comprehensive
setup highlights the efficacy of our approach in tackling the challenges of OATP-mediated drug
interaction prediction.
4.2.5 Heterogeneous GraphNeuralNetwork for InhibitionPerdition ofOATPs (HIPO-GNN)
A Heterogeneous Graph Neural Network (HeteroGNN) is constructed for each protein-ligand
complex to capture the intricate interactions between the protein and the ligand. The graph
contains two types of nodes: amino acid residues and ligand atoms. For amino acids in the
protein sequence, 28 features are extracted (Table 4.1), including one-hot encoded residue type,
molecular weight, aromaticity, isoelectric point, hydrophobicity, flexibility, and secondary structure
propensities. These features capture the essential properties involved in protein-ligand interactions.
For ligand atoms, 79 atomic features are derived (Table 4.2), including one-hot encoded atom
type, number of heavy neighbors, formal charge, hybridization state, presence in rings, aromaticity,
atomic mass, van der Waals radius, covalent radius, chirality, and number of implicit hydrogens.
This set of atomic features provides a detailed representation of the ligand’s chemical structure and
88
properties, ensuring that the HeteroGNN effectively models the complex dynamics of protein-ligand
interactions.
Edges in the constructed graph represent the inter-interactions between amino acids and atoms,
as well as the intra-interactions among atoms within the ligand. For protein-ligand interactions,
332 edge features are extracted (Table 4.3), including interaction type, distance, angle, and other
geometric properties. For ligand bonds, 10 features are derived (Table 4.4), such as bond type,
conjugation, presence in rings, and stereochemistry. This graph construction process captures
the protein-ligand complex’s local and global structural information, providing a comprehensive
representation for subsequent analysis. Lastly, Table 4.5 presents the edge formulation indices and
their corresponding sizes, detailing the interactions between amino acids (aa) and atoms, as well as
the bonds between atoms.
4.2.5.1 Model Architecture
The HeteroGNN architecture is designed to model the protein-drug interaction and perform a
graph classification task. It uses separate convolution layers for amino acid-to-atom, atom-to-amino
acid, and atom-to-atom interactions. Edge features are first processed through Multi-Layer
Perceptrons (MLPs) to capture interaction characteristics. Convolutional layers (i.e., NNConv)
then update node features based on neighboring nodes and processed edge features, propagating
information across the protein-ligand interface. A HeteroConv layer aggregates outputs from these
convolutions, combining information from different interaction types (i.e., inter and intra). Linear
projection layers then map atom and amino acid features to a common space. Global mean pooling
is applied to create fixed-size representations for the ligand and protein. These pooled features are
concatenated and passed through a classification network consisting of linear layers with ReLU
activation and dropout. The network outputs a single value predicting the compound’s inhibitor
status. This architecture integrates heterogeneous information from protein-ligand complexes,
considering both local interactions and global structure to predict inhibitory activity.
89
4.2.5.2 Data Splitting and Cross-Validation
The data splitting strategy aims to ensure robust evaluation and minimize data leakage. 10%
of unique ligands are randomly selected as the test set, including all related docked poses. The
remaining data is split into training (90%) and validation (10%) sets based on unique ligands.
This process is repeated 20 times, creating 20 different train/validation/test splits for a more
comprehensive assessment of model performance.
4.2.5.3 Training and Evaluation
The model is trained using the AdamW optimizer with weight decay. The OneCycleLR learning
rate scheduler is used for adaptive learning rates. Mixed precision training improves computational
efficiency. Binary Cross-Entropy with Logits serves as the loss function. Early stopping, based on
validation loss with a patience of 5 epochs, prevents overfitting. Each model trains for up to 100
epochs or until early stopping occurs.
For each of the 20 data splits, a separate model is trained and evaluated on the corresponding
test set. Multiple metrics are calculated: accuracy, precision, recall, F1 score, balanced accuracy,
and AUC-ROC. This thorough evaluation provides a robust understanding of the model’s predictive
capabilities across various classification aspects.
Table 4.1 Node Features for Amino Acids.
Feature Type Size
One-Hot Encoded Residue Binary Vector 20
Molecular Weight Numerical 1
Aromaticity Numerical 1
Isoelectric Point Numerical 1
Hydrophobicity Numerical 1
Flexibility Numerical 1
Secondary Structure Numerical Vector 3
4.3 Results
To address the challenge of predicting OATP-mediated drug-drug interactions (DDIs), we
developed a comprehensive computational approach that integrates molecular modeling, machine
learning classifiers, and Graph Neural Networks (GNNs). Our novel structure-based OATP-ligand
90
Table 4.2 Node Features for Atoms.
Feature Type Size
One-Hot Encoded Atom Type Binary Vector 43
Number of Heavy Neighbors Binary Vector 6
Formal Charge Binary Vector 8
Hybridization Type Binary Vector 7
In a Ring Binary 1
Aromaticity Binary 1
Atomic Mass Numerical 1
Van der Waals Radius Numerical 1
Covalent Radius Numerical 1
Chirality Binary Vector 4
Number of Implicit Hydrogens Binary Vector 6
Table 4.3 Edge Feature for Interaction (Amino Acid Interaction with Atom).
Feature Description
Interaction Type hydrophobic, halogen, hydrogen bond, salt
bridge, pi-cation, pi-stacking
Residue Type One-hot encoded vector for protein residues
involved in the interaction
Distance Normalized distance between the interacting
protein residue and ligand atom
Angle Normalized angle measurements (if
applicable, e.g., in hydrogen bonds)
Donor/Acceptor Type One-hot encoded vector for donor/acceptor
roles (in halogen/hydrogen bonds)
Table 4.4 Edge Features for Intra-Interactions (Within Ligand Atoms).
Feature Description
Bond Type One-hot encoded vector for bond types (e.g.,
single, double, aromatic)
Conjugation Binary for whether the bond is conjugated
Ring Status Binary for whether the bond is part of a ring
Stereochemistry One-hot encoded vector for stereochemistry
(e.g., cis, trans)
Table 4.5 Edge Formulation.
Edge Index Size
’aa’, ’interacts with’, ’atom’ 332
’atom’, ’interacts with’, ’aa’ 332
’atom’, ’bonded to’, ’atom’ 10
91
interaction prediction platform, HIPO-GNN, combines ligand molecular features with protein-ligand
interaction data to significantly enhance inhibition prediction accuracy. The study unfolds in
several key stages. Initially, we employed molecular docking followed by a Protein-Ligand
Interaction Profiler (PLIP) [30] to extract interaction features, generating Protein-Ligand Interaction
Fingerprints (PLIFs). These PLIFs enabled comparisons between key residues interacting with
inhibitors versus non-inhibitors, highlighting specific residues (F38, K41, N213, L378, G552, and
V556) that show notable differences in their interactions. We then utilized these interaction features
to build and optimize machine learning classifiers, exploring the effect of docked pose sampling
on predictive performance. The culmination of this process was the development of HIPO-GNN,
designed to capture intricate intra- and inter-molecular dynamics between drugs and proteins.
HIPO-GNN significantly outperformed traditional ligand-based methods, achieving median F1 and
AUC scores of 0.82 and 0.81, respectively, compared to ECFP (F1: 0.68, AUC: 0.70) and RDKit
(F1: 0.78, AUC: 0.75) models built upon XGBoost. Additionally, we conducted a meta-analysis
to underscore the critical importance of data quality in studying complex transporter proteins like
OATPs. This analysis revealed challenges such as discrepancies in experimental dataset labeling
under different conditions and identified physicochemical features of drugs highly correlated with
these discrepancies. We also demonstrated how the availability of crystal structures influences
ML prediction outcomes, showing significant variations in results between proteins with and
without these structures. By addressing these aspects, our integrated approach aims to enhance the
prediction accuracy for OATP inhibition, provide more comprehensive insights into OATP-ligand
interactions, and contribute to the design of safer and more effective medications.
4.3.1 Structure-Based Protein-Drug Interaction Prediction
This section explores the effectiveness of structure-based and drug-based features in predicting
protein-drug interactions, specifically for the OATP1B1 subfamily which its crystal structure has
been determined. We present key findings from our studies on molecular docking simulations
and inhibition classification, emphasizing the performance of various feature representations (e.g.,
ECFP, PLIP, RDKit) in classification tasks. Additionally, our analysis identifies critical residues
92
and interaction types that differentiate inhibitory drugs from non-inhibitors. The results indicate
that structure-based features are most effective when high-quality crystal structures for both inward
and outward conformers are available. The analysis of key residues and interaction types also
shows promising results and agreement with recent findings about OATP transport. For instance,
non-inhibitors exhibited approximately twice as many pi-stacking interactions with residues Y352
and F356 in the inward conformer compared to known inhibitors. This observation aligns with
the findings of Shan et al.[20], highlighting the significance of Y352 and F356 in the binding and
transport of known OATP1B1 substrates.
4.3.1.1 Molecular Docking Simulation GeneratedNovelOATP-Ligand Interaction Interface.
Molecular docking simulations have identified novel interaction interfaces between OATP1B1
and various ligands. Utilizing high-quality cryo-EM structures of OATP1B1, we performed
extensive docking simulations to model the binding poses of 222 experimentally characterized
ligands with the inward-facing conformer of OATP1B1 [20]. From each docking pose, we
extracted protein-ligand interaction features (PLIFs), highlighted in orange in Figure 1A. These
features encompass critical interaction details such as distance, angle, and the specific atoms
involved, capturing the intricacies of potential interaction hotspots. The distribution of Rosetta
interface binding energies for OATP1B1, OATP1B3, and OATP2B1 transporters can be found in
the supplement appendix (Figure 4C.1).
4.3.1.2 Interacting Residues in Docked Poses are Generally Consistent with Experimental
Findings.
To identify key residues that characteristically interact more often with inhibitors, the distribution
of interactions per OATP1B1 residue was analyzed. This involved parsing the comprehensive
Protein-Ligand Interaction Fingerprints (PLIFs) into a subset of interaction features, termed
minimal PLIFs (mPLIFs). Specifically, mPLIFs encompass only the interaction types present
at each OATP1B1 residue. Figure 6.1B (left) highlights the frequency of any interaction type
occurring at each orthosteric site residue (Y352, F356, and F386) identified by Shan et al [20].
Inhibitory ligandswere observed to interact significantly more often with the phenylalanine residues
93
Figure 4.1 Interaction probabilities of OATP1B1 residues with inhibitors and non-inhibitors. A.
Displays the inward-facing conformers of OATP1B1, and panel (B) Exhibits the outward-facing
conformers for OATP1B1. Each panel is divided to show interaction probabilities at the orthosteric
site (left) and opportunistic sites (right). White bars represent non-inhibitors and gray bars
denote inhibitors. Error bars indicate 95% confidence intervals. Statistical significance is
marked by asterisks where *p<0.05, **p<0.01, ***p<0.001, and ****p<0.0001, as determined
by Mann-Whitney U-tests with Bonferroni correction. The inset structural model (right) illustrates
E3S (green) docked to OATP1B1, with protein-ligand interaction fingerprints (PLIFs) highlighted
in orange.
in the orthosteric site than non-inhibitory ligands (𝑝 < 5 × 10−4 for F356, and 𝑝 < 5 × 10−5 for
F386).
4.3.1.3 Structure-Based vs. Ligand-Based ML Prediction Perform Distinctly in OATP
Families
To obtain a holistic view of structure-based and drug-based features’ performances, a multilabel
classification study uses PLIP (structure-based), RDKit, and ECFP (both drug-based) features. This
approach allows us to evaluate the overall effectiveness of each feature set in distinguishing the
94
protein-drug interaction (i.e. whether an inhibitor for x or not) simultaneously for all three OATP
families.
Figure 3.2 compares the distribution of F1 Scores and AUC values for these feature sets. The
results show that PLIP-based features achieved a relatively high median F1 Score, suggesting they
are effective in predicting true positive rates across multiple labels. RDKit features exhibited the
highest median AUC, indicating a strong ability to distinguish between classes. The ECFP features
displayed more variability in both F1 Scores and AUC values, suggesting some inconsistency in
performance.
To further elucidate the performance differences across OATP subfamilies, we conducted a
comparative analysis of structure-based and ligand-based prediction methods for each transporter
type. As shown in Figure 4B.1, RDKit-based machine learning models consistently outperformed
other approaches forOATP1B3 andOATP2B1, indicating that ligand-based features are particularly
crucial for predicting inhibition in these subtypes. In contrast, OATP1B1 demonstrated superior
performance with structure-based modeling approaches, suggesting that the availability of high
quality structural data for this transporter provides valuable insights into protein-ligand interaction
dynamics while maintaining alignment within the margins. Notably, OATP2B1 posed the greatest
challenge for prediction among the tested families in the Karlgren dataset, highlighting the complex
nature of its interactions. These findings underscore the importance of tailoring computational
approaches to the specific characteristics of each OATP subfamily and informed our decision to
focus subsequent analyses on OATP1B1, leveraging its amenability to structure-based prediction
methods for further enhancement of predictive performance.
4.3.2 GNN Model Captures Nuanced Protein-Ligand Interactions and Improves Inhibition
Prediction Performance.
We present the performance of our novel HIPO-GNN model compared to other state-of-the-art
methods for predictingOATP inhibition. Our results demonstrate that HIPO-GNN, which integrates
both ligand features and protein-ligand interaction data, achieves superior performance in terms
of F1 score, indicating a better balance between precision and recall in inhibition prediction.
95
Figure 4.2 PLIP-based features demonstrated the highest median F1 Scores, while RDKit features
exhibited the highest median AUC value. The dotted lines showed statistical significance among
the performances.
This improvement is particularly noteworthy given the complex nature of OATP-mediated drug
interactions and the challenges associated with their prediction.
Figure 4.3 The box plots display the distribution of scores for ECFP, RDKit, PLIP, PLIP Vote,
HIPO-GNN, and GNN-Ligand methods. HIPO-GNN shows superior F1 scores, indicating better
overall prediction accuracy while maintaining competitive AUC scores.
Post-analysis of these results reveals interesting insights. While HIPO-GNN demonstrates the
highest median F1 score, its AUC score, though competitive, is slightly lower than that of the
GNN-Ligand model. This discrepancy might be attributed to the additional noise introduced by the
96
protein-ligand interaction features in HIPO-GNN, which are absent in the ligand-only GNN model.
The interaction features, while providing valuable information about the binding dynamics, may
also introduce some level of uncertainty due to the inherent variability in protein-ligand interactions
and potential limitations in the molecular docking process. Nevertheless, the higher F1 weighted
score suggests that HIPO-GNN’s predictions are more reliable in predicting OATP inhibition across
the dataset, taking into account the class distribution. This indicates that HIPO-GNN performs well
on both common and rare cases of inhibition, providing a more comprehensive and practically useful
model for drug-drug interaction predictions in real-world scenarios where the class imbalance is
typical.
4.3.3 From Noise to Knowledge
In this section, we aim to enhance the interpretability and knowledge gained from our OATP
inhibition prediction model. We explored the physicochemical properties most closely associated
withOATP inhibition, providing a mechanistic understanding of the inhibition process. Additionally,
we introduce simple machine learning engineering techniques to maximize the signal obtained
from our computational modeling, effectively separating meaningful patterns from noise. This
comprehensive approach not only enhances our model’s performance but also contributes to a deeper
understanding of OATP inhibition mechanisms, addressing challenges in data consistency, and
improving the overall reliability of drug-drug interaction predictions for OATP-mediated transport.
4.3.3.1 Consideration of Multiple Docked Poses Provides Dynamic Insight at the Expense
of Signal.
For each ligand, Rosetta docking simulations generated 1,000 poses of varying quality. The
Rosetta interface_delta_X energy score was employed as a quality metric to represent the binding
interface energy. Figure 4.4. A illustrates the distribution of interface energy scores for the
most stable 100 poses for both inhibitors and noninhibitors docked to inward-facing OATP1B1.
These distributions indicated that the docked interfaces of noninhibitors to the inward-facing
conformer are significantly more stable compared to those of inhibitors.To investigate the impact
of including multiple poses on model performance, datasets were created containing one, two,
97
three, five, ten, 20, 30, 50, 75, and 100 of the ’best’ poses per ligand. The performance of the
XGBoost model was observed to peak when approximately 30 poses per ligand were included
in the PLIF dataset. Interestingly, mPLIF models displayed maximum performance with only
two poses per ligand (Figure 4.4B). We hypothesized that removing nuanced, quantitative PLIFs
might reduce noise and improve model performance, although this was not observed in our results.
Conversely, regardless of the number of poses included per ligand, XGBoost models trained on the
comprehensive PLIF dataset generally outperformed those trained on the simplified mPLIFs (Figure
4.4B). The best-performing conditions (30 poses, full PLIFs) were selected for constructing the
HIPO-GNN model. In addition to implications on predictive performance, expanding datasets to
include multiple poses per ligand may broaden the dynamic perspective of transporter interactions.
The residue-interaction distributions of our docking-derived PLIFs were evaluated for consistency
with experimentally identified interaction sites. Our analyses were divided into three groups of
OATP1B1 residues: 1. We investigated PLIFs involving the orthosteric site (Y352, F356, F386),
which is consistently seen to be involved in OATP1B1 transport function [20]. 2. We defined
a second category of residues from known opportunistic sites, groups of residues important
for the binding and/or transport of only certain ligands. Our analyses considered a broadly
encompassing set of residues present in opportunistic sites of simeprivir, estrone-3-sulfate, bilirubin,
and 2’,7’-dichlorofluorescein (DCF).We refer to this grouping as the ‘opportunistic sites’ for brevity.
3. Lastly, we investigated PLIFs involving residues previously unrecognized in either the orthosteric
or opportunistic sites.
As expected, increasing the number of poses included in our datasets resulted in a greater
diversity of OATP1B1 residues captured in the PLIFs of each ligand. Figures 4.4 C–E depict the
[coverage = for each ligand, out of all n-poses, how many of the residues from the (a) orthosteric,
(b) opportunistic, (c) neither site were present.
4.3.3.2 Physiochemical Feature Importance Analysis for OATP Inhibition
Building upon our previous identification of key protein residues involved in OATP inhibition
6.1, we also focuses on physicochemical properties of drugs that contribute to this inhibition.
98
Figure 4.4 Multiple Docking Pose Effect in ML Performance and OATP Inhibition Mechanism
Insight. A. Distribution of Rosetta binding interface energy scores (interface_delta_X, in Rosetta
energy units) for 100 poses of each ligand docked to the inward-facing OATP1B1. Experimentally
characterized noninhibitors are shown in green, and inhibitors in red. Dashed lines indicate the
30th percentile – the ’best’ 30 poses. (B) All panels with 95% confidence intervals. B. Area under
the curve (AUC) and F1 scores plotted for PLIP and mPLIP feature sets for a variable number of
poses per ligand. Scores shown are for models incorporating majority voting. (C - E) Fraction
of unique residues from the orthosteric site, opportunistic sites, and remaining protein sequence
captured in PLIFs as a function of the number of poses. Shaded regions indicate 95% confidence
intervals in panels B - E.
Understanding both protein-ligand interactions and drug characteristics is crucial for a detailed
view of OATP-mediated drug-drug interactions. While achieving strong predictive performance is
crucial, the interpretability of our models offers critical insights for advancing drug development
and ensuring safety assessments.. Our analysis in 4.5 reveals the key physicochemical properties
that influence OATP inhibition across multiple subtypes and specifically for OATP1B1. These
findings not only enhance our mechanistic understanding but also offer practical guidance for
rational drug design.
99
Figure 4.5 Feature importance analysis for OATP inhibition prediction. A. Top features
contributing to the inhibition prediction across OATP1B1, OATP2B1, and OATP1B3, categorized
by physicochemical properties. The model classifies compounds as inhibitors only if they
inhibit all three OATP subtypes. B. Specific features important for OATP1B1 inhibition
prediction. Feature importance is quantified based on the model’s reliance on each feature for
predictions. Color-coding represents different categories of physicochemical properties. Key
features include molecular weight (HeavyAtomMolWt), molar refractivity (MolMR), partial charge
distributions (PEOE_VSA2), lipophilicity (SlogP_VSA1/5/6), and molecular shape descriptors
(SMR_VSA1/3/7/10, Kappa1, Chi3n).
4.3.4 Discussion
Understanding the transport mechanisms of OATPs is particularly challenging due to their
membrane-bound nature and dynamic conformational changes. Unlike soluble proteins, OATPs
are embedded in cellular membranes, complicating their study. Their polyspecificity, (i.e. ability to
interact with a wide range of substrates) exacerbates this challenge as it demands detailed knowledge
of diverse and often overlapping binding interactions. The hydrophobic nature of OATPs makes
them difficult to crystallize, resulting in a scarcity of high-resolution structural data, which impedes
the development of accurate models for predicting drug interactions. This lack of structural
100
information, combined with their significant flexibility and propensity for instability, presents
challenges throughout various stages of research, including expression, solubilization, purification,
and structural elucidation. Consequently, the principles governing OATP drug transport and
inhibition mechanisms are not fully understood. This necessitates resource-intensive in vitro and
in vivo assays and highlights the need for advanced computational methods to fill these gaps.
Machine learning (ML) techniques have been extensively applied to predict interactions between
drugs andOATPs and provide transport insights. Researchers have made progress in identifying and
predicting OATP-mediated drug transport and inhibition mechanisms by distinct ML algorithms
and encoding various features of drug-protein interactions into ML platforms. Khuri et al. [32]
employed a Random Forest classifier to screen a virtual library of over 5000 compounds, identifying
potential OATP2B1 inhibitors. The descriptors for the ligands encompassed physicochemical and
topological properties of the drugs, capturing the essence of their interaction capabilities with
the transporter. To enhance the precision of their initial findings, they proceeded to evaluate
the shortlisted compounds via docking against several comparative protein structure models of
OATP2B1. This docking process served as a subsequent step to further refine and filter the list
of candidates. Among the ten selected drugs for experimental validation, three were confirmed
as potent inhibitors of OATP2B1-mediated E3S transport. In another study, De Bruyn et al.[33]
incorporated protein sequence physiochemical features and ligand circular fingerprint (ECFP6)
into a random forest classifier to predict OATP1B1/3 inhibitors. The trained model, prospectively
validated on 54 compounds not previously included in the original library, predicted their inhibitory
potential against OATP1B1 and OATP1B3, with 80% and 74% of the compounds correctly
identified for their inhibition effects on OATP1B1 and OATP1B3, respectively. Subsequent studies
employed multiple strategies such as logistic regression, SVC, XGBoost, and ensemble models.
These approaches leveraged varied representations of features, incorporating extensive protein
sequence data (such as amino acid compositions and sequence order features) along with ligand
attributes (e.g. ECFP, and RDK fingerprints). This expanded feature set facilitated the early
detection and pharmacokinetic profiling of OATP-drug interactions [34]. Recently, Hao Duan et al.
101
[35] presented a more robust approach for predicting transporter inhibitors using both conventional
machine learning and multi-task deep learning models. The study leverages a comprehensive
dataset curated from multiple sources (i.e. Karlgren, ChembL, De Bruyn, [12 papers, 6 databases]),
achieving high predictive performance, particularly with the MLT-GAT model. However, while
the dataset curation is extensive, the study faces challenges with data imbalance and the need for
higher quality data to enhance model accuracy and reliability further. This work highlights the
high potential and current limitations of computational methods in transporter-drug interaction
prediction. While influential, these studies face several limitations. Notably, the inconsistency in
public datasets regarding the classification of drugs as inhibitors or substrates leads to discrepancies
in definitions resulting in misguided labeling across ML implementations. Additionally, most
studies concentrate solely on drug attributes, while crucial, may not fully capture the nuances
necessary for accurate predictions of interactions with complex and dynamic proteins. The lack
of protein structural information in ML models limits the ability to comprehensively incorporate
features from both drugs and proteins. Given the complexity of the system, a more sophisticated
modeling approach is required to address the challenges posed by the structural variability of
proteins and the intricate nature of drug-protein interactions.
This study demonstrates the value of integrating structural information and advanced machine
learning techniques in predicting OATP-mediated drug interactions. OATPs’ crucial role in hepatic
drug uptake, combined with the challenges of their membrane-bound nature and limited structural
data, necessitates innovative approaches. Our structure-based approach leverages protein-ligand
interaction features and heterogeneous graph neural networks to address these challenges by
effectively capturing the complex dynamics of OATP-ligand interactions. The HIPO-GNN model’s
improved predictive performance, particularly in F1 score, highlights the benefit of incorporating
both ligand and protein structural information. Our analysis of feature importance and dataset
quality provides key insights for drug development strategies. This work enhances our ability to
predict and understand OATP inhibition mechanisms, contributing to more efficient drug design
and improved patient safety through better prediction of drug-drug interactions. Looking ahead,
102
the synergy between computational models and experimental techniques promises to unlock new
frontiers in OATP research. By iteratively refining our models with emerging structural and
functional data, we can anticipate more accurate predictions, leading to tailored drug designs that
minimize unwanted interactions and optimize therapeutic outcomes.
103
BIBLIOGRAPHY
[1] Innocent G Asiimwe and Munir Pirmohamed. Drug–drug–gene interactions in cardiovascular
medicine. Pharmacogenomics and Personalized Medicine, pages 879–911, 2022.
[2] A Kalliokoski andMNiemi. Impact of oatp transporters on pharmacokinetics. British Journal
of Pharmacology, 158(3):693–705, 2009.
[3] Kenta Yoshida, Cen Guo, and Rucha Sane. Quantitative prediction of oatp-mediated
drug-drug interactions with model-based analysis of endogenous biomarker kinetics. CPT:
Pharmacometrics & Systems Pharmacology, 7(8):517–524, 2018.
[4] Alžběta Türková and Barbara Zdrazil. Current advances in studying clinically relevant
transporters of the solute carrier (slc) family by connecting computational modeling and
data science. Computational and structural biotechnology journal, 17:390–405, 2019.
[5] Akash Khandelwal, Matthew D Krasowski, Erica J Reschly, MichaelWSinz, PeterWSwaan,
and Sean Ekins. Machine learning methods and docking for predicting human pregnane x
receptor activation. Chemical research in toxicology, 21(7):1457–1467, 2008.
[6] Lei Zhang, Yuanchao Zhang, and Shiew-Mei Huang. Scientific and regulatory perspectives
on metabolizing enzyme- transporter interplay and its role in drug interactions: challenges in
predicting drug interactions. Molecular pharmaceutics, 6(6):1766–1774, 2009.
[7] Maria Karlgren, Anna Vildhede, Ulf Norinder, Jacek R Wisniewski, Emi Kimoto, Yurong
Lai, Ulf Haglund, and Per Artursson. Classification of inhibitors of hepatic organic anion
transporting polypeptides (oatps): influence of protein expression on drug–drug interactions.
Journal of medicinal chemistry, 55(10):4740–4763, 2012.
[8] Mikko Niemi, Marja K Pasanen, and Pertti J Neuvonen. Organic anion transporting
polypeptide 1b1: a genetically polymorphic transporter of major importance for hepatic
drug uptake. Pharmacological reviews, 63(1):157–181, 2011.
[9] Savannah J McFeely, Tasha K Ritchie, and Isabelle Ragueneau-Majlessi. Variability in in vitro
oatp 1b1/1b3 inhibition data: Impact of incubation conditions on variability and subsequent
drug interaction predictions. Clinical and Translational Science, 13(1):47–52, 2020.
[10] Y Tomita, K Maeda, and Y Sugiyama. Ethnic variability in the plasma exposures of oatp1b1
substrates such as hmg-coa reductase inhibitors: a kinetic consideration of its mechanism.
Clinical Pharmacology & Therapeutics, 94(1):37–51, 2013.
[11] Mikko Niemi, Janne T Backman, Lauri I Kajosaari, Julian B Leathart, Mikko Neuvonen,
Ann K Daly, Michel Eichelbaum, Kari T Kivistö, and Pertti J Neuvonen. Polymorphic organic
anion transporting polypeptide 1b1 is a major determinant of repaglinide pharmacokinetics.
Clinical Pharmacology & Therapeutics, 77(6):468–478, 2005.
104
[12] Anne T Nies, Mikko Niemi, Oliver Burk, Stefan Winter, Ulrich M Zanger, Bruno Stieger,
Matthias Schwab, and Elke Schaeffeler. Genetics is a major determinant of expression of
the human hepatic uptake transporter oatp1b1, but not of oatp1b3 and oatp2b1. Genome
medicine, 5:1–11, 2013.
[13] Nicholas A Meanwell. Improving drug candidates by design: a focus on physicochemical
properties as a means of improving compound disposition and safety. Chemical research in
toxicology, 24(9):1420–1456, 2011.
[14] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf
Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al.
Highly accurate protein structure prediction with alphafold. nature, 596(7873):583–589,
2021.
[15] Massoud Saidĳam, Fatemeh Karimi Dermani, Sareh Sohrabi, and Simon G Patching. Efflux
proteins at the blood–brain barrier: review and bioinformatics analysis. Xenobiotica,
48(5):506–532, 2018.
[16] Anca-Denise Ciută, Kamil Nosol, Julia Kowal, Somnath Mukherjee, Ana S Ramírez, Bruno
Stieger, Anthony A Kossiakoff, and Kaspar P Locher. Structure of human drug transporters
oatp1b1 and oatp1b3. Nature Communications, 14(1):5774, 2023.
[17] Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li,
Benjamin A Shoemaker, Paul A Thiessen, Bo Yu, et al. Pubchem 2023 update. Nucleic acids
research, 51(D1):D1373–D1380, 2023.
[18] Noel M O’Boyle, Michael Banck, Craig A James, Chris Morley, Tim Vandermeersch, and
Geoffrey R Hutchison. Open babel: An open chemical toolbox. Journal of cheminformatics,
3:1–14, 2011.
[19] Amos Bairoch, Rolf Apweiler, Cathy H. Wu, Winona C. Barker, Brigitte Boeckmann,
Serenella Ferro, Elisabeth Gasteiger, Hongzhan Huang, Rodrigo Lopez, Michele Magrane,
Maria J. Martin, Darren A. Natale, Claire O’Donovan, Nicole Redaschi, and Lai-Su L. Yeh.
The Universal Protein Resource (UniProt). Nucleic Acids Research, 33(suppl1):D154–D159,
01 2005.
[20] Ziyang Shan, Xuemei Yang, Huihui Liu, Yafei Yuan, Yuan Xiao, Jing Nan, Wei Zhang,
Wenqi Song, Jufang Wang, Feiwen Wei, et al. Cryo-em structures of human organic anion
transporting polypeptide oatp1b1. Cell Research, 33(12):940–951, 2023.
[21] Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge
Weissig, Ilya N Shindyalov, and Philip E Bourne. The protein data bank. Nucleic acids
research, 28(1):235–242, 2000.
[22] SerenaRosignoli and Alessandro Paiardini. Boosting the full potential of pymol with structural
105
biology plugins. Biomolecules, 12(12):1764, 2022.
[23] Patrick Conway, Michael D Tyka, Frank DiMaio, David E Konerding, and David Baker.
Relaxation of backbone bond geometry improves protein energy landscape modeling. Protein
science, 23(1):47–55, 2014.
[24] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf
Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al.
Highly accurate protein structure prediction with alphafold. nature, 596(7873):583–589,
2021.
[25] Jeffrey Skolnick, Mu Gao, Hongyi Zhou, and Suresh Singh. Alphafold 2: why it works and its
implications for understanding the relationships of protein sequence, structure, and function.
Journal of chemical information and modeling, 61(10):4827–4831, 2021.
[26] D Sala, F Engelberger, HS Mchaourab, and J Meiler. Modeling conformational states of
proteins with alphafold. Current Opinion in Structural Biology, 81:102645, 2023.
[27] Jens Meiler and David Baker. Rosettaligand: Protein–small molecule docking with full
side-chain flexibility. Proteins: Structure, Function, and Bioinformatics, 65(3):538–548,
2006.
[28] Shannon T Smith and Jens Meiler. Assessing multiple score functions in rosetta for drug
discovery. PLoS One, 15(10):e0240450, 2020.
[29] Greg Landrum. Rdkit documentation. Release, 1(1-79):4, 2013.
[30] Melissa F Adasme, Katja L Linnemann, Sarah Naomi Bolz, Florian Kaiser, Sebastian
Salentin, V Joachim Haupt, and Michael Schroeder. Plip 2021: Expanding the scope
of the protein–ligand interaction profiler to dna and rna. Nucleic acids research,
49(W1):W530–W534, 2021.
[31] Takuya Akiba, Shotaro Sano, ToshihikoYanase, Takeru Ohta, and MasanoriKoyama. Optuna:
A next-generation hyperparameter optimization framework. pages 2623–2631, 2019.
[32] Natalia Khuri, Arik A Zur, Matthias B Wittwer, Lawrence Lin, Sook Wah Yee, Andrej
Sali, and Kathleen M Giacomini. Computational discovery and experimental validation of
inhibitors of the human intestinal transporter oatp2b1. Journal of chemical information and
modeling, 57(6):1402–1413, 2017.
[33] Tom De Bruyn, Gerard J P van Westen, Adriaan P Ijzerman, Bruno Stieger, Peter de Witte,
Patrick F Augustĳns, and Pieter P Annaert. Structure-based identification of oatp1b1/3
inhibitors. Molecular pharmacology, 83(6):1257—1267, June 2013.
[34] Thomas R. Lane, Fabio Urbina, Xiaohong Zhang, Margret Fye, Jacob Gerlach, Stephen H.
106
Wright, and Sean Ekins. Machine learning models identify new inhibitors for human oatp1b1.
Molecular Pharmaceutics, 19(11):4320–4332, 2022. PMID: 36269563.
[35] Hao Duan, Chaofeng Lou, Yaxin Gu, Yimeng Wang, Weihua Li, Guixia Liu, and Yun Tang.
In silico prediction of inhibitors for multiple transporters via machine learning methods.
Molecular Informatics, 43(3):e202300270, 2024.
107
APPENDIX 4A
LIGAND-PROTEIN DOCKING WORKFLOW
In this study, we leverage protein-ligand docking simulations to inform machine learning models on
accurately classifying OATP ligands as either inhibitors or non-inhibitors. The docking workflow
involved selecting protein and ligand for docking. Optionally, the CASTp server was used to
identify potential binding sites on the protein. The workflow required organizing files in a specific
manner and utilizing Python scripts designed for this structure.The initial script prepared necessary
files, including a Params file for ligand processing instructions and a concatenated file of the
protein and ligand structures. The subsequent script generated the docking job and an XML file
with docking parameters. Customizations in the XML file included adjustments to the docking
starting coordinates, box size for ligand movement, and the scoring grid for residue analysis. The
options file was edited to define file locations for Rosetta, including the concatenated files, Params
file, and XML file, along with the docking iterations count. Docking jobs were executed from a
designated directory, with outputs stored in a specified output folder. Post-docking, a script was
used to extract the top docking results. The approach was designed for high throughput processing
of multiple proteins and ligands, ensuring file paths and naming conventions were consistent with
the established directory structure. This protocol enabled effective protein-ligand docking using
Rosetta for multiple entities.
108
Figure 4A.1 Docking Workflow for Drug Inhibition Prediction.
109
Figure 4A.2 ML Workflow for Drug Inhibition Prediction.
110
APPENDIX 4B
COMPARATIVE ANALYSIS BASED ON STRUCTURE AND LIGAND PREDICTIONS
ACROSS OATP SUBFAMILIES
Figure 4B.1 Comparative performance of machine learning models across OATP subfamilies.
RDKit-based predictions consistently outperformother models for OATP1B3 and OATP2B1, while
OATP1B1 shows superior performance with structure-based modeling approaches.
These results provide valuable insights into the relative effectiveness of structure-based and
ligand-based approaches for predicting inhibition across different OATP subfamilies. The analysis
reveals distinct patterns in predictive performance, highlighting the importance of tailored methods
for each OATP subtype.For OATP1B3 and OATP2B1, ligand information emerges as the key
determinant in inhibition prediction. The consistently superior performance of RDKit-based
machine learning models for these subtypes underscores the significance of ligand physicochemical
properties and molecular descriptors in capturing the essence of their interactions with these
111
transporters. In contrast, OATP1B1 exhibits a unique behavior, demonstrating enhanced predictive
performance when structure-based methods are employed. This suggests that the availability
of high-quality structural data for OATP1B1 provides crucial insights into the protein-ligand
interaction dynamics that are not fully captured by ligand-based approaches alone. Notably,
the analysis reveals that among the tested OATP families in the Karlgren dataset, OATP2B1
posed the most significant challenge for machine learning prediction. This observation highlights
the complex nature of OATP2B1 interactions and suggests that further refinement of predictive
models or the incorporation of additional data types may be necessary to improve accuracy for this
subtype. The comparative performance across OATP subfamilies, as illustrated in Figure 4B.1,
provides a clear visualization of these trends. The consistent outperformance of RDKit-based
models for OATP1B3 and OATP2B1 is evident, while the superior performance of structure-based
modeling for OATP1B1 is distinctly visible. Based on these findings, the study’s focus naturally
gravitated towards OATP1B1, leveraging its amenability to structure-based prediction methods.
This strategic shift paved the way for the development and implementation of novel approaches
aimed at further enhancing the predictive performance for OATP1B1 inhibition. These results not
only provide valuable insights into the differential predictive requirements of OATP subfamilies but
also underscore the importance of tailoring computational approaches to the specific characteristics
of each transporter. Such nuanced understanding is crucial for advancing our ability to accurately
predict OATP-mediated drug interactions and ultimately contribute to more effective and safer drug
development processes.
112
APPENDIX 4C
DISTRIBUTION OF ROSETTA INTERFACE BINDING ENERGIES FOR OATP
TRANSPORTERS
Figure 4C.1 Distribution of Rosetta interface binding energies for OATP1B1, OATP1B3, and
OATP2B1 transporters. The histograms show the binding energies for the inward (blue) and
outward (orange) conformers. The x-axis represents the Rosetta interface binding energy, while
the y-axis represents the count of interactions. Data indicate a broader distribution for the inward
conformer of OATP1B1, a noticeable difference between the conformers of OATP1B3, and a
symmetric distribution for OATP2B1, suggesting diverse binding affinities and interaction patterns.
Comparison with AlphaFold-generated structures reveals more diversity and wider distributions in
REU with experimentally solved structures.
113
APPENDIX 4D
DISTRIBUTION OF OATP1B1 RESIDUE INTERACTIONS WITH INHIBITORS AND
NONINHIBITORS
Figure 4D.1 Distribution of OATP1B1 residue involvement in interactions with inhibitors (blue)
and noninhibitors (orange), as computed from PLIFs of the ‘best’ 30 poses for each OATP
conformer-ligand pair. Distributions presented are per interaction type: hydrophobic, hydrogen
bonding, halogen bonding, pi-stacking, and pi-cation interactions (A – E). Inward-facing conformer
interactions (left) are notably more frequent with noninhibitors, whereas the outward-facing
conformer (right) is observed to have a greater frequency of inhibitor interactions. Especially
notable, known noninhibitors were observed to have roughly twice more pi-stacking interactions
with each Y352 and F356 in the inward conformer when compared to the known inhibitors in the
dataset. This agrees with findings of Shan et al. (2023) [20]regarding the importance of Y352 and
F356 in binding and transport of known OATP1B1 substrates.
114
CHAPTER 5
GENERATIVE MODELS FOR PROTEIN SEQUENCE MODELING: RECENT
ADVANCES AND FUTURE DIRECTIONS
1
Abstract
The widespread adoption of high-throughput omics technologies has exponentially increased
the amount of protein sequence data involved in many salient disease pathways and their respective
therapeutics and diagnostics. Despite the availability of large-scale sequence data, the lack of
experimental fitness annotations underpins the need for self-supervised and unsupervised machine
learning (ML) methods. These techniques leverage the meaningful features encoded in abundant
unlabeled sequences to accomplish complex protein engineering tasks. Proficiency in the rapidly
evolving fields of protein engineering and generative AI is required to realize the full potential ofML
models as a tool for protein fitness landscape navigation. Here, we support thiswork by (i) providing
an overview of the architecture and mathematical details of the most successful ML models
applicable to sequence data (e.g. variational autoencoders, autoregressive models, generative
adversarial neural networks, and diffusion models), (ii) guiding how to effectively implement
these models on protein sequence data to predict fitness or generate high-fitness sequences and
(iii) highlighting several successful studies that implement these techniques in protein engineering
(from paratope regions and subcellular localization prediction to high-fitness sequences and protein
design rules generation). By providing a comprehensive survey of model details, novel architecture
developments, comparisons of model applications, and current challenges, this study intends to
provide structured guidance and robust framework for delivering a prospective outlook in the
ML-driven protein engineering field.
Keywords: Generative Machine Learning (ML) Models, Protein Engineering, Generative
Adversarial Neural Networks (GANs), Variational Autoencoders (VAE), NLP, Diffusion Models
1This chapter is adapted from content published in "Briefings in Bioinformatics," Oxford Academia. All rights
reserved. For more information, visit https://doi.org/10.1093/bib/bbad358.
115
5.1 Introduction
Proteins, as genetically encoded macromolecules, play a pivotal role in regulating biological
systems. Their diverse sizes and chemical compositions empower them with a wide range of
functionalities. Consequently, engineered proteins with tailored functions find applications across
various fields, including cosmetics and environmental bioremediation. Engineered proteins can
be optimized to target disease biomarkers for the early detection of cancer, Alzheimer’s and
inflammatory diseases [1, 2, 3, 4, 5]. Protein-based therapeutics is also another significant
application; namely serum therapy with therapeutic antibodies which started more than a century
ago and evolved with scientific advancements. In addition, protein engineering is a practical tool
to address environmental issues resulting from industrialization [6, 7, 8, 9]. For example, heavy
metal protein binders displayed on the bacterial surface are capable of remediating environments
contaminated with heavy metals [7, 10]. Despite the promise of protein engineering to revolutionize
medicine and industry, discovering proteins with desired functionalities is exceedingly challenging.
Within the astronomical number of ways to build a protein (i.e. unique protein sequences),
the vast majority lack function entirely. Moreover, making a random change to a functional
protein is typically detrimental to its function and stability. This highlights a need for improved
strategies in obtaining novel proteins with favorable properties such as high binding affinity and
desired developability [11, 12]. While previous strategies such as energy-based scoring [13] and
evolutionary [14] methods are still informative, they have drawbacks, including inaccurate modeling
and search strategy inefficiency. Recent advancements in computational methods and the rapidly
growing availability of protein sequence data facilitated the use of new, data-driven approaches for
protein design and engineering [15, 16].
ML techniques may offer a promising route for navigating the high-dimensional landscape of
protein design and engineering. They have shown a high success rate in various domains such
as processing text, images and audio. In theory-driven approaches, the researcher obtains the
domain knowledge of the problem that needs to be solved and produces mathematical models
to capture the attributes and physics of the study. In contrast, ML methods are mainly centered
116
on modeling the observed data while previous knowledge and theory may also be infused to the
model. To assess how well a dataset is modeled via ML, a loss function is defined that measures the
difference between the model’s prediction and real data. It then is optimized so that the prediction
is close to reality (i.e. the difference between the model and actual data is minimized). Training
a model involves fitting the model parameters to optimize the loss function. A trained ML model,
therefore, is useful in understanding the given data and aiding in future decisions by identifying
trends, predicting outcomes and recognizing anomalies. Machine learning models can be classified
into two types of models—discriminative and generative. Discriminative models are ML models
used to predict the conditional probability of labels based on the given data features. In contrast,
generative models aim to discover how the data is constructed by estimating the joint probability
distributions of features and corresponding labels. Deep learning (DL), a subset of ML that uses
neural networks for the training, is particularly well-suited for complex domains since it can extract
high-level features (i.e. features that cannot be interpreted by humans) from the given dataset [17].
Sparse high-fitness variants can be efficiently sampled from the vast, rugged protein fitness
space landscape using DL techniques that implement statistical and probabilistic models [18]. For
sequence-function mapping, protein sequences can be vectorized (e.g. with one-hot encodings or
embeddings) to get proper input representations that are compatible withMLalgorithms. Therefore,
due to both high demand and compatibility, various ML models have been applied to predict
protein binding affinity [19], thermostability [20], developability [12], solubility [21] and stability
[22, 23]. ML-driven predictive models have shown remarkable success in various applications
compared to wet lab methods like directed evolution and traditional computational methods like
rational design. However, the incorporation of natural language processing (NLP) techniques and
generative models in protein engineering has led to a revolution in the field by improving prediction
accuracy, reducing data requirements and enabling the generation of novel and functional proteins.
Exploiting NLP techniques have been made feasible by changing the perspective about proteins and
finding similarities between human language and protein sequences. For example, both feature an
alphabet (20 amino acids in terms of proteins), have hierarchy in the organization, and evolve over
117
time [24]. TAPE [25], UniRep [26], ESM [27] and ProtTrans [28] are among the many successful
studies that have applied NLP techniques to learn the dependencies in protein sequences and be able
to numerically represent them in fixed vector formats (i.e. embeddings) that are rich in semantics
and syntax information. Generative models (based on NLP or pure statistical methods) are also
used in different tasks such as improved representation of protein sequences and generation of
unforeseen protein sequences in nature. BioSeqVAE [29], ProtGPT2 [30], ProteinGAN [31] and
RFdiffusion [31] are examples that generated de novo proteins via different model architectures
such as variational autoencoder, transformer, adversarial neural networks and diffusion models,
respectively.
Here, we provide a systematic overview of promising neural networks applicable to protein
sequences. For each architecture, we introduce core mathematical details of the model before
describing the implementation of these models towards protein engineering tasks. For each model
type, we also provide commentary through selected case studies to describe practical considerations
for integrating ML towards protein applications and to illustrate how specific model architectures
can be advantageous for a given protein engineering task. The first type to be discussed is language
models which have several features that make them strong candidates to be applied for protein
sequence data. These techniques can handle variable sequence lengths, and they can track sequence
long-term dependencies while maintaining the order of tokens (e.g. amino acid positions). We
start with recurrent neural networks (RNNs), then introduce self-attention mechanism (when the
model learns to selectively focus on important positions of the input sequence), and finally dive
into the transformers and their variants. After discussing transformers, we examine three other
generative models in detail: variational autoencoders (VAEs) [32], generative adversarial neural
networks (GANs) [33] and diffusion models [34]. While VAE takes a probabilistic approach to
learn the training data distribution, GANs use two competing neural networks to generate realistic
samples. Diffusion models take a different approach by progressively adding noise to the data
until it reaches the prior distribution. Then, new samples can be generated in the reverse diffusion
process. Finally, we discuss the current challenges and future perspectives in applying such models
118
Figure 5.1 A diverse set of protein engineering applications benefit from the generative and
discriminative potential of sequence models. These applications include stability, solubility,
bioluminescence, binding capacity, phylogeny, gene ontology and protein localization. The
schematic of the sequence models are represented here but their detailed description will be
elaborated in their corresponding sections. The autoregressive model forecasts future values based
on the previous values in the time series data. VAE is probabilistic modeling architecture that
contains an encoder (E) and a decoder (D), compressing high-dimensional input data with the E
and reconstructing the data from hidden dimension by D. This architecture along with variational
inference techniques will lead to learning given data distribution and generating novel instances.
In GANs, G represents a generative model that aims to generate realistic data from noise input,
and D is a discriminator that acts as a critic to distinguish the real data from model-generated
data. Diffusion models are a relatively new generative model that facilitates the generation of novel
samples from a state of maximum randomness (at point XT) that is previously generated through
the iterative addition of random noise to data distribution. These models have been demonstrated
in diverse applications ranging from antibody binding prediction to protein localization prediction
tasks in addition to novel protein sequence generation tasks.
to protein engineering. Figure 1 represents the overview of models and applications that are
discussed in detail in this study. With this information, we hope researchers are better prepared to
integrate these technologies into future investigations.
119
5.2 Sequence Forecasting – RNNs & ARs
Given that proteins consist of linear chains of discrete amino acids, the amino acid sequences
can be treated as time-series data, with each amino acid acting as a discrete data point for each
position (i.e. time point). Because of this attribute, models that forecast future values using past
values can be used to sequentially generate amino acids based on previous amino acids in the
sequence chain. In this section, we discuss two major models for sequence forecasting: recurrent
neural networks (RNNs) and autoregressive models (ARs).
RNNs are a popular architecture in natural language processing and speech recognition because
they hold ‘memory’ by having internal weights that store information from the past that can be
updated with every new token processed. There are different types of RNNs including one-to-one,
one-to-many, many-to-one, and many-to-many [35]. For protein sequence models, one-to-many
RNNs are suitable for sequence forecasting. By giving a starting token to initialize RNNs, the
trained model can sequentially produce tokens at the current time step using the output token from
the previous time step (Figure 2A). Despite being an architecture well suited for sequential data,
some disadvantages of RNNs need to be addressed. During prediction tasks on a given token,
RNNs are not able to learn any effects of subsequent parts of the sequence because RNNs process
sequences unidirectionally (usually from left to right). This could hurt predictive power because the
interactions between amino acids within a protein occur in a three-dimensional space. Bidirectional
RNNs (BRNNs) address the issue of unidirectionality and improve prediction performance using
two connected layers: one-layer processes sequences in the forward direction and another layer
processes sequences in the backward direction [36]. In this way, sequence information is learned
from both directions. BRNNs are also successfully applied to unsupervised tasks by enabling their
probabilistic interpretation to reconstruct the missing value [37].
Optimizing RNNs while tracking long-term dependencies within sequences can be challenging.
Due to the repeated weight matrix multiplication, the weights from early parts of the sequence have
a progressively lower influence on the final representation relative to the later parts. The repeated
multiplication of small weights results in even smaller weights, causing the gradient vanishing
120
Figure 5.2 The architecture of generative recurrent neural networks versus autoregressive model.
(A) An autoregressive model (AR) has a similar structure to a recurrent neural network (RNN).
However, while RNN only depends on the current time step, AR utilizes information from the
previous time steps as well as the current time step to predict the next token. (B) Two important
RNN architectures for resolving vanishing gradient problems in training sequence data are LSTM
and GRU. These networks contain gates to control the information flow. LSTM contains three
gates: input gate, forget gate and output gate and GRU contains two gates: reset gate and update
gate. Note that C indicates the cell state, and h is the hidden state in shown architectures.
problem in which the gradient eventually reaches a drastically small value. Therefore, RNNs
tend to generate biased parameters that capture short-term dependencies, especially when dealing
with long sequences. To enable long-term memory, Hochreiter and Schmidhuber introduced Long
Short-Term Memory (LSTM) [37], and Chung and colleagues introduced Gated Recurrent Units
(GRU) [38] (Figure 2B). Both networks have a gated cell that not only contains multiplication
but also addition operations of sigmoid and hyperbolic tangent (tanh) functions to regulate the
information flow. The sigmoid activation function—which outputs values between 0 and 1—serves
121
as a gate to keep relevant and discard irrelevant information. The information is pertinent for
prediction when values are close to 1, and it is kept to future time steps. The tanh activation
function regulates values to prevent vanishing and exploding gradients by limiting outputs between
1 and 1.
RNN and its variants have been implemented in protein design tasks for both generative and
discriminative applications. For example, a generative LSTM-based model was trained to design
de novo antimicrobial peptide [39]. The generated sequences elicited higher microbial activity than
sampling randomly mutated peptides. In another study, LSTM units were implemented to generate
antibody sequences with well-correlated negative log-likelihood and more than 100-fold affinity
maturation [40]. Interestingly, an LSTM model trained on only f and y angles of each residue
enabled helical protein design [41]. The authors demonstrated that dihedral angles are adequate
features to design protein backbones without considering amino acids in a sequence. In another
study for predicting protein secondary structures, bidirectional recurrent neural networks with GRU
units were used to capture global features within sequence [42].
ARs share a similar structure with RNNs (Figure 2A). Both outputs at time t depend on input
not only at time t but also from earlier time steps. However, RNNs use the hidden state weights
from only the most recent time step, whereas ARs use actual inputs from the past to generate
future values. AR models, as the name suggests, perform regression tasks over their own lagged
variables (i.e. forecasting future values using linear combinations of past values). Protein sequence
generation through ARs can be achieved by maximizing sequence likelihood through a tractable
probability density function. This objective function is a product of conditional probabilities of
tokens at each position that are conditioned on all previous tokens shown as Equation 1, where X
is the full-length sequence, x is each token, I is the position number, and n is the sequence length.
The objective function is decomposed from the joint probability of a full-length sequence using
the chain rule of probability and Bayes’ theorem. For each step generation, the features are past
tokens, the label is the true token at the current time step, and the loss is the difference between the
predicted token and the true token.
122
𝑝(X) =
Ö𝑛
𝑖=1
𝑝(𝑥𝑖 |𝑥1, 𝑥2, . . . , 𝑥𝑖−1) (5.1)
Theoretically, the space complexity of ARs grows exponentially with forward processing of
sequences. This complexity is represented as O(nk) in bigOnotation where k grows with increasing
n for a sequence with length of n. In practice, ARs use a fixed number of parameters to specify
each prior. This reduces the complexity to a polynomial O(nc) where c is a constant. However, this
restricts ARs to represent all possible conditional distributions and limits model expressiveness. Lin
et al. [43] proposed energy-based models and latent-variable autoregressive models as alternatives
to alleviate limited distributional modeling of standard ARs.
Recent studies have implemented ARs for protein design. Amodel with one autoregressive layer
paired with generalized logistic regression was used for mutational prediction, contact prediction,
and sequence generation of a response-regulator protein family [44]. The negative log-ratio of
joint probability of mutant and wildtype were used to indicate single mutational effects and the
sum of log probabilities of single mutations were employed as double mutation likelihood for
residue-residue contact prediction. The trained model generated sequences that were similar
to natural sequences by comparing their principal components. Another study used a dilated
convolutional and autoregressive model to model sequential constraints of long nanobodies [45].
They showed that their alignment-free model matches the accuracies of alignment-dependent
models in the context of mutation effect prediction, thermostability prediction and fitness predictions
for indels. In addition, their model yielded a designed library that contained stable and functional
nanobodies with comparable biochemical properties and enhanced diversity to natural nanobody
repertoire.
Though RNNs and ARs are powerful, their sequential operation results in linear-time O(n)
complexity that makes their training time-consuming. Both RNNs and ARs employ supervised
learning which helps models optimize with experience and yield high accuracy. However, it
increases the chance to overfitting models if the training data is not well-representing the true data
distribution. The LSTM and GRU address some limitations of RNNs, but RNNs are still inefficient
123
due to short-term memory and long gradient path. ARs have explicit probability density function
to maximize sequence likelihood, but the computation of a series of conditional probabilities
requires significant computational resources. The loss of information during training also hinders
the overall performance of RNNs and ARs. These issues are significant when considering whether
RNNs andARs should be implemented for sequence generation tasks.
5.2.1 Protein Engineering Highlights of RNNs and ARs
A comprehensive collection of notable applications of sequence forecasting models in protein
engineering is provided in Table 1 below. Following this, a case study is presented to elucidate the
details of a selected paper marked in the table.
Table 5.1 Summary of highlighted applications of sequence forecasting models for protein
engineering.
Protein Engineering Task Advancements Model Type Training Data
Source(s)
Year Ref.
Classification of sequences
based on predicted iron
sequestration capabilities,
protease activity, GPCR
activity and p450 activity
Ten diverse, unannotated
sequences predicted to exhibit
iron sequestration activity were
experimentally validated. As
measured by accuracy,
precision, recall and F1 scores,
the RNN model outperforms
logistic regression and random
forest models.
RNN UniProt Jan. 2017
[46]
Predict solvent accessibility 8.8% mean absolute error and
74.8% Pearson’s correlation
coefficient value for predicting
solvent accessibility were
observed.
Stacked
Bidirectional LSTMs
PISCES Database May 2018
[42]
Structural class prediction Using a low-dimension feature
space (18-D), classification
accuracies ranging from 84.2%
to 95.9% were observed.
RNN PDB25, FC699, 640,
498, 277 and 204
June 2021
[47]
Novel artificial protein
sequence generation
Libraries of artificially
generated monobodies mutases
were experimentally validated
DCA 1259 natural
monobodies mutase
sequences
July 2020
[48]
Antibody binding pocket
prediction
Outperformed CNN/RNN
models and random-forest
models
Bidirectional LSTMs Structural Antibody
Database (SabDab)
Dec. 2021
[49]
5.2.1.1 Prediction of Antibody Paratope with Bidirectional LSTMs
A deep learning model for predicting the antigen binding sites of antibodies was developed
through the implementation of bidirectional LSTMs in a transformer neural network (discussed
in Section Sequence Design with Attention Mechanism: Transformer-Based Language Models)
124
[49]. The DeepANIS (Antibody Interacting Site) model was able to elucidate the relationships
among residues of the loop sequences of complementarity determining regions (CDRs) of a
given antibody (https://github.com/HideInDust/DeepANIS). Trained on only 277 antibody–antigen
complexes from the Protein Sequence Culling Server (PISCES) database, the authors demonstrated
the ability of a transformer neural network using bidirectional LSTMs to outperform alternative
CNN-based and random-forest-based models for paratope prediction. This architecture also enabled
the developers to perform these predictions using the concatenated CDR loop sequences of a given
antibody as the only input. Alternative models require either the CDRs to be provided as separate
sequences for each CDR loop or additional information about the antigen or antibody.
5.3 Sequence design with attention mechanism: Transformer-based language models
Transformer models consist of a specific neural network architecture that transforms the
input sequences to output sequences using a series of operations (e.g. matrix multiplications,
scaled dot product attention and feed forward neural network). Transformers have given rise
to various sequence-to-sequence models such as machine translation, question answering (chat
bot) and text summarization. Their specific design enables parallel operation (constant-time O(1)
complexity), resulting in faster and more efficient performance than ARs and RNNs (linear-time
O(n) complexity). This parallelization improves uniformlearning across each position of a sequence
by eliminating short-term dependencies that disproportionately weigh later parts of the sequence
compared to earlier parts.
The original transformer introduced by Vaswani et al. [15] consists of an ‘encoder’ that encodes
a complete sentence to a representation and a ‘decoder’ that decodes a target sentence with the
contextual representation (Figure 2). Both the encoder and decoder contain multiple self-attention
and feed forward neural network units. Self-attention is a key component in transformers that
enables the model to know which tokens are important in processing the given token (e.g. epistatic
interaction in protein sequences). The feed forward networks are then used for adding non-linear
operations to the network in training. Note that the order information of tokens gets lost due to
parallel computing. Thus, transformers have an additional embedding called positional encoding,
125
which encodes the exact position of tokens using as many sinusoidal functions as embedding
dimensions.
To train a transformer to translate text from languageAto language B, the encoder uses language
A as an input to generate a representation. The decoder uses language B as an input and combines
this with encoder-generated representation to learn the correct mapping of words in two distinct
languages. Similarly, protein-specific de novo drug design can be treated as a translational problem
[50]. The authors used a transformer as a biological translator to generate novel molecule binders
given amino acid sequences only. For a question-answer task, the encoder input is the question, and
the decoder input is the answer. Through a connected encoder and decoder, the transformer learns
to give an answer based on a specified question. Protein–protein interactions are analogous to
question-answer pairs in syntax. This strategy was utilized to generate signal peptides via available
organism data in Swiss-Prot [51]. Their experimental results showed that the generated peptides
are functional and diverse.
As noted in Figure 3, transformers capture the influence of other tokens (e.g. through epistasis)
on the query token with a self-attention mechanism. The self-attention computation starts with
three inputs: query, key, and value; analogous to those in retrieval systems. When retrieving an
item, the machine takes a request (query) against a list of descriptions of items (keys) and returns top
matches (values). In protein chains, we retrieve attention from a sequence first by having queries
(amino acid requests) multiplied with transposed keys (amino acid identities) to obtain attention
weighting. Then, the scaled and normalized attention weighting is multiplied with values (amino
acid representations) to obtain attention (Figure 3). Often, the attention layer is split into several
heads in parallel to capture attention from different subspaces. The multi-head attention layer
combined with a fully connected feed-forward network with layer normalization in between builds
an attention block; these blocks are then combined to form the encoder. Since the inputs (query, key
and value) of the encoder are from the same sequence, it generates a self-attentive representation
of that sequence. The attention block of the decoder has an additional layer: masked multi-head
attention layer, which is placed before the multi-head attention layer. The inputs of the masked
126
attention layer are from the decoder, meaning that it is self-attentive. The inputs of the following
attention layer are from both the encoder and decoder (query from decoder; key and value from
encoder), meaning that it is cross-attentive. Due to this self- and cross-attention mechanism, the
decoder generates a target sequence considering both the encoder and the decoder.
Figure 5.3 Visualization of attention mapping and attention computation. (A) Based on the protein
fold, amino acids in different positions have varied epistitatic effects on each other. The highlighted
circle refers to a query amino acid in a protein active site. The color gradient shows how attention
can capture the influence of other amino acids (tokens) on the queried token. (B) Attention
computation requires three components: key, query and value. By calculating scaled dot-product
attention scores, the model chooses which areas of the sequence it needs to prioritize for the
prediction task.
Attention mechanism has been applied to understand the semantics and syntax of protein
language. It has been implemented between sets of gene ontology terms to predict protein–protein
interactions [52]. This mechanism has also been employed with a convolutional neural network
(CNN) to predict protein contact [53, 54]. Similarly, in another report, CNN with attention
mechanism improved protein-drug interaction prediction [55]. CNN was also used to obtain feature
metrics of the proteins and ligands. Attention mechanism was then implemented to assign weights
to each atom or amino acid. Their model evaluation of benchmark datasets showed improvements
compared to previous baselines.
The transformer decoder is autoregressive by nature, owing to its masked self-attention layer.
By masking out the attention of future tokens, the decoder decodes target sequences to infer the
attention of past tokens. This is achieved through the addition of attention weighting and a mask
matrix, whose upper triangular is filled with negative infinity and lower triangular is filled with
127
zeros. While the decoder is autoregressive during testing, it is non-autoregressive at training time.
During training, the decoder generates tokens at all time steps simultaneously, not relying on tokens
at previous time steps. The autoregressive attribute of the decoder allows transformers to be used
in generative applications. Figure 4 represents the overall schematic of transformer architecture
and its variants.
Figure 5.4 Architecture overview for the transformer and two important transformer-based
language models: Bidirectional EncoderRepresentation from Transformers (BERT) and Generative
Pre-Training (GPT). The transformer utilizes an encoder-decoder method for handling language
tasks. However, BERT uses encoder blocks only and GPT only includes decoder blocks. The
difference in their architecture is mainly due to their training objective. In pretraining, BERT takes
a bidirectional approach while GPT is based on an autoregressive method.
The ability of transformers to generate text representations has led to the development of various
transformer-variant models. Bidirectional Encoder Representation from Transformers (BERT) and
Generative Pre-Training (GPT) [56, 57]. They can generate meaningful representations which can
be used for downstream, task-specific modeling (e.g. named entity recognition, question-answering
and text generation). Pre-training models with a large corpus (unsupervised training) followed by
fine-tuning with task-specific objectives (supervised training) result in improved performance in
different language modeling tasks. Note that BERT and GPT also have shown good performance
128
in few-shot (i.e. when there are only few labeled data are available) and zero-shot (i.e. when the
model generalizes to the new task with no further data for training) settings [57, 58, 59]. The
architecture of BERT is composed of layers of original transformer encoders. The pre-training
of BERT uses an approach prevalent in masked language modeling (MLM). By masking out
tokens in an input sentence, the models are trained to predict masked tokens using their context.
This is achieved by minimizing the cross-entropy loss between masked and actual tokens. This
context-dependent training makes representation bidirectional, which is why BERT is a popular
architecture for representation learning. In contrast, GPT comprises a series of transformer decoder
architecture without the multi-head cross-attention layer due to the absence of encoders. In contrast
to BERT’s MLM, GPT uses a casual language modeling (CLM) approach that predicts masked
tokens, only considering tokens on the left side. By left-shifting each token in input sequences,
GPT does not have access to the actual token that is going to predict. Therefore, the representation
generated from GPT pre-training is unidirectional and self-attentive, making it a popular model for
text generation. By having the task layer directly working on pre-trained representation, the number
of layers, learnable parameters and training time are reduced. The training of task layer occurs
simultaneously with the fine-tuning of pre-trained models to improve the compatibility between
a representation and a given task. Based on the architecture of task layers, they handle either
sequence-to-sequence (sequence generation) or sequence-to-scalar (sequence classification) tasks.
Generally, BERT is not optimal for text generation, and GPT is limited to only unidirectional
interactions. Lewis et al. [60] proposed BidirectionalAutoRegressive Transformers (BART), which
combines the strengths of BERT and GPT to perform sequence-to-sequence denoising effectively,
ensuring it fits well within the margins. The BART encoder learns from corrupted sequences
that introduce noises to the model through masking, insertion, deletion, infilling, permutation, and
rotation. The BART decoder learns to reconstruct original sequences autoregressively. The encoder
and decoder work together to recognize and remove intentionally added noise. Hence, BART is
a useful architecture for sequence noise reduction and feature extraction. Another challenge in
training is capturing long-term dependencies for sequence data whose length is much greater than
129
its embedding dimension. The BART-derived Performer model was proposed to reduce the cost
of training the attention mechanism which scales linearly instead of quadratically with sequence
length [61]. This model presents an unbiased estimation of a regular attention matrix with which
the estimation is uniformly convergent.
5.3.1 Protein Engineering Highlights of Transformer-based Language Models
Table 2 below presents a wide range of impactful applications of transformer-based language
models in protein engineering. Following this table, two case studies selected from the table are
discussed.
5.3.2 Pre-training of Deep Bidirectional Protein Sequence Representations with Structural
Information
The pre-training scheme PLUS was able to outperform leading pre-training models that are
based solely on language models (e.g. UniRep, P-ELMo) by integrating protein-specific structural
information with amino acid sequence data (https://github.com/mswzeus/PLUS) [65]. Structural
information was obtained from protein family labels among the Pfam dataset. This provided a more
accurate and less computationally intensive route compared with using sequence similarity to predict
protein function. Additionally, masked language modeling is performed in a similar manner used
in BERT to extract syntactic and semantic information. In this study, an informative comparison
was made wherein PLUS was used to pre-train bidirectional RNN (PLUS-RNN) and Transformer
(PLUS-TFM) architectures. Despite much of the literature indicating that attention-based models
are superior, the PLUS-RNN architecture was found to be advantageous over the PLUS-TFM
in this study due to the RNN-based implementation more accurately capturing local amino acid
sequence motifs. For PLUS-RNN, bidirectional representations of amino acid sequences were used
to capture context in both the right-to-left and left-to-right directions. In doing so, the PLUS-RNN
model achieved higher performance than similarly sized transformer in protein-level classification
and regression tasks and amino acid-level classification tasks. Higher performance was observed
even against the leading task-specific models in predicting homology, stability, fluorescence and
transmembrane residues.
130
Table 5.2 Summary of highlighted applications of transformer-based models for protein engineering.
Protein Engineering Task Advancements Model Type Training Data
Source(s)
Year Ref.
Protein sequence language
modeling
Approached protein sequence
modeling as a language translation
task using one RNN to encode the
protein sequence and another RNN
to decode the previously encoded
sequence.
Transformer CAFA3 Oct. 2017
[62]
DNA-binding protein
prediction
Incorporated sequence context with
attention mechanism to improve
prediction performance over
random-forest, RNN, and support
vector machine models.
CNN-Bidirectional
RNN
UniProt Nov. 2019
[63]
Signal peptide sequence
generation
A diverse library of 53 artificially
generated signal peptides were
generated and validated in Bacillus
subtilis.
Transformer Swiss-Prot Aug.
2020 [51]
Novel sequence generation
for enhanced GB-1 variants
Used Gibbs Sampling to generate
new sequences from BERT
language model.
BERT GB1 Jan. 2021
[64]
Homology, solubility,
subcellular localization,
stability, fluorescence,
secondary structure, and
topology prediction
Demonstrated a pre-training
strategy that incorporates sequence
data with structural information
which improved performance on
protein fitness prediction tasks
compared to similar-size language
models.
Bidirectional
RNN
Pfam Sept.
2021 [65]
Variant Effect Prediction Introduced a new approach to
incorporating specialized attention
heads and sequence context
information.
Transformer UniRef100 June 2022
[66]
Novel sequence generation Generated novel protein sequences
that mimic properties of natural
proteins (e.g., stability, dynamics).
Pre-trained model enables rapid,
accessible sequence generation on
desktop machines.
Transformer UniRef50,
Swiss-Prot
July 2022
[30]
GFP fluorescence intensity,
stability
Introduced key design constraints
to Transformer model architecture
in their regularized latent space
optimization (ReLSO) approach to
protein sequence modeling.
Transformer GB1, Gifford, GFP,
TAPE
Oct. 2022
[67]
131
5.3.3 ProtGPT2 is a Deep Unsupervised Language Model for Protein Design
ProtGPT2 is an autoregressive Transformer model capable of generating highly diverse protein
sequences [30]. Among the generated sequences, amino acid propensities and fraction of disordered
regions are consistent with proteins found in nature, yet the generated sequences are highly distinct
from natural proteins. ProtGPT2 provides a useful platform for finetuning based on a particular
protein family, function, or fold of interest. Using a Transformer decoder model with byte-pair
encoded input sequences enabled self-supervised training on nearly 50 million unlabeled protein
sequences from UniRef and Swiss-Prot. With 738 million parameters, ProtGPT2 allows users to
generate novel sequences in mere seconds on a desktop computer (https://huggingface.co/nferruz/
ProtGPT2).
5.4 Pre-trained language models & Embeddings
Transfer Learning (TL) is a ML technique to transfer useful knowledge learned from a source
domain to another related domain (i.e. target domain). This is particularly useful when there is a
lack of labeled data in the target domain and obtaining labeled data is time-consuming and costly.
Inductive learning, transudative learning and unsupervised learning are specific transfer learning
approaches for different applications. Inductive TL improves the target predictive function via the
information learned from the source domain prediction task. Note that both the domain and tasks
are different but related. Transudative TL aims to improve prediction in target tasks using the
learned knowledge from the source domain. However, the learning tasks need to be the same while
domains are different. For unsupervised TL, the target domain prediction function still benefits
from the source domain when the learning tasks are not the same and there is no labeled data in
both the source and the target domain [68].
The use of TL in protein engineering applications can increase the efficiency and generalizability
of the downstream tasks via transferring the domain knowledge learned in pretraining to the
prediction task. One highly explored and successful application of TL in protein engineering is the
use of pretrained language models for predicting protein properties (e.g. thermostability, kinetic
activity, binding affinity and disordered regions) from its sequence. These models are trained over
132
a large number of unlabeled sequences in protein databases such as UniProt [69], UniRef [70] and
SRA [71]. With NLP techniques such as masked-token prediction and next-token prediction, these
models extract useful information from their training data to be used in downstream tasks. Note
that the trained model can either be directly used or its information can be extracted to a fixed-size
continuous vector (i.e. an embedding). These embeddings are unique for each input sequence,
and they contain structural, evolutionary, statistical and biophysical information about the proteins.
This is considered a breakthrough in ML-guided protein engineering tasks where pretrained models
alleviate the lack of data and improve performance of ML models. Unified representation (UniRep)
is among the early pretrained models which was trained via an mLSTM model over 25 million
sequences to distill biophysical and evolutionary information of proteins and represent it in a fixed
size representation. The UniRep model has shown generalizations to distant regions of fitness
landscape in addition to low number data requirements for viable predictions [72].
5.4.1 Protein Engineering Highlights of Pre-trained Language Models & Embeddings
Table 3 contains a collection of highlighted applications of pre-trained language models and
embeddings in protein engineering, along with an added case study for further understanding.
5.4.2 Unsupervised Learning on 250M Protein Sequences Results in Deriving Biological
Insights
BERT and GPT are versatile if trained appropriately and have been successfully implemented
in the protein sequence domain. Rives et al. [27] trained BERT with 250 million protein sequences
to generate representation that contains biological properties. They applied downstream tasks such
as remote homology, linear projection, secondary structure prediction, and contact prediction to
demonstrate the rich information captured by their deep contextual language model, Evolutionary
Scaling Modeling (ESM) (https://github.com/facebookresearch/esm). Additionally, they explored
how sequence diversity and model size impact performance. To enable transfer learning of a model
to a new task with no additional supervision, extended ESM architectures such as ESM-1v and
ESM2 were proposed for variant effect prediction (i.e. mapping sequence changes to functional
changes) and capturing high-resolution structural features, respectively [78].
133
Table 5.3 Summary of highlighted applications of transfer learning & embeddings for protein
engineering.
Protein Engineering Task Advancements Model Type Training Data
Source(s)
Year Ref.
Stability Prediction, fitness
(GFP brightness)
Introduced RNN-based unified
representation (UniRep) to improve
efficiency in protein prediction
tasks.
Transformer UniRef50 Dec. 2019
[26]
Secondary Structure,
Subcellular localization,
and solubility prediction
Developed the SeqVec model of
protein sequences that represents
sequences as continuous vectors to
predict biophysical and
biochemical properties.
ELMo UniRef50 Dec. 2019
[73]
Secondary Structure,
Contact, Homology,
Fluorescence, and Stability
Prediction
Demonstrated the usefulness of
multi-task benchmarks like TAPE
to evaluate protein transfer learning
models.
Transformer CBS513, CASP12,
TS115, avGFP,
ProteinNet, SCOP
1.75
Dec. 2019
[25]
Protein Family
Classification, Protein
Interaction Prediction
Using the training procedure
developed by RoBERTa,
PRoBERTa offers a more
generalized pre-training framework
that outperformed in protein
interaction prediction tasks.
Transformer UniProtKB/Swiss-Prot Jun. 2020
[74]
Protein folding, binding
site, and substitution matrix
prediction
Demonstrated how attention can
capture protein features from
BERT-based models for protein
engineering tasks.
Transformer Pfam, BFD,
UniRef100
March
2021 [75]
Secondary Structure,
Subcellular Localization,
and Solubility Prediction
Demonstrated the usefulness of
pre-trained embeddings for various
prediction tasks without requiring
the use of MSAs.
ARs UniRef, BDF July 2021
[28]
Secondary Structure,
Contact, Homology,
Fluorescence, and Stability
Prediction
Improved protein representations
by incorporating pairwise masked
language model to encode
co-evolutionary information.
Transformer Pfam, UniRef50 Oct. 2021
[76]
Sequence Profile
Construction
Performed sequence profile
reconstruction with the pre-trained
ProtAlbert for sequences with
limited homology to database
sequences.
Transformer UniRef, BDF Aug.
2022 [77]
Structure Prediction Despite not requiring the use of
MSAs, ESM Fold—trained on
UniRef sequences—offers rapid
structure generation from a single
input protein sequence.
Transformer UniRef50 and
UniRef90
March
2023 [78]
134
There are several attempts to model protein sequences via language model techniques to apply
the learned information about protein sequences to protein engineering tasks; some of the most
successful ones are listed above. Embedding methods can alleviate the lack of labeled data and
improve generalization. In addition, training over self-supervised methods with more parameters
leads to capturing more nuanced information about the language of proteins [78]. While transfer
learning has shown great promise in protein engineering applications, there is a need for a deeper
understanding of what information is learned in pretraining and transferred to the downstream
prediction tasks [79]. For example, some studies have observed similar or superior performance
for protein fitness prediction without the use of embedding methods [80, 81].
5.5 Probabilistic Modeling of Sequence-VAEs
Unlike transformers that treat protein sequences as a language, variational autoencoders (VAEs)
treat sequences as a parameterized multivariate distribution [82]. VAE architecture consists of
taking high-dimensional data, reducing it to a low-dimensional representation (encoder) and then
reconstructing the representation into the original dimensionality as the input data (decoder) (Figure
5). This encoder-decoder bottleneck structure is also a hallmark of standard autoencoders (AEs).
The latent space representation of AEs is a fixed length vector where each value (dimension) is
associated with a single learned feature from data. However, the latent representation of VAEs are
probability distributions (which are continuous and smooth) for each data attribute. By randomly
sampling a vector from latent state distributions, the VAE decoder acts as a generative model that
can generate new data instances (e.g. novel protein sequences). The VAE encoder is a recognition
model with the ability to recognize statistical distributions that describe variations in data.
The latent representation of VAEs is forced to be continuous and smooth by training the encoder
to output pairs of mean and standard deviation (probability distributions) which are subsequently
sampled by the decoder. Compared with discrete variable representation of AEs, the continuous
distribution representation ofVAEs allows the decoder to learn that both a single value and its nearby
values refer to the same class. Accordingly, representations of the same class are clustered together
as a distribution in latent space, and nearby representations have similar reconstructions. The gap
135
Figure 5.5 GANS architecture for generating sequence data; a model that learns to sample from
the given data distribution which contains two separate and opposed networks: generator and
discriminator. The generator aims to generate synthetic data from noise which can’t be distinguished
from the real data by the discriminator. The discriminator on the other hand gets optimized to
identify synthetic data from the real data. Evolving together, the model finally will be able to
generate samples very similar to the real training data.
between classes in latent space is troublesome, as the decoder has no training data to learn features
of those space. Therefore, VAEs incorporate a term in the loss function, the Kullback–Leibler
(KL) divergence, which measures how one distribution is different from another. By minimizing
the KL-divergence between the learned latent distribution and a prior distribution (e.g. Gaussian),
VAEs regularize the latent space. This regularization promotes a continuous latent space which
allows VAEs to interpolate values smoothly from one class to another. The reconstruction loss
encourages the formation of data points similar to the original input and KL-divergence regularizes
the latent space. This incorporation results in a well-structured and information-rich latent space
where VAEs can sample from.
Instinctively, for latent variables 𝑧 that generates observation 𝑥, we maximize data likelihood
𝑝(𝑥) by maximizing
∫
𝑝(𝑧|𝑥)𝑝(𝑥) 𝑑𝑧. However, this integral is intractable and cannot be directly
optimized. Instead, VAEs use a derivative data likelihood to model 𝑝(𝑥) with encoder distribution
𝑝(𝑧|𝑥), decoder distribution 𝑝(𝑥|𝑧), and latent variables 𝑝(𝑧). The posterior distribution, 𝑝(𝑧|𝑥),
which refers to attributes of latent variables from observation is also intractable, but we can apply
variational inference to approximate this density function [83]. By defining a tractable encoder
distribution 𝑞(𝑧|𝑥) and minimizing KL-divergence between 𝑝(𝑧|𝑥) and 𝑞(𝑧|𝑥), we obtain the
136
objective function of VAEs with derivation shown below:
𝐷𝐾𝐿 (𝑞(𝑧|𝑥) ∥𝑝(𝑧|𝑥)) = 𝐷𝐾𝐿 (𝑞(𝑧) ∥𝑝(𝑧)) − 𝐸𝑞 [log 𝑝(𝑥|𝑧)] + log 𝑝(𝑥) (5.2)
log 𝑝(𝑥) − 𝐷𝐾𝐿 (𝑞(𝑧|𝑥) ∥𝑝(𝑧|𝑥)) = 𝐸𝑞 [log 𝑝(𝑥|𝑧)] − 𝐷𝐾𝐿 (𝑞(𝑧|𝑥) ∥𝑝(𝑧)) = ELBO (5.3)
log 𝑝(𝑥) ≥ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑖𝑒𝑠𝑒𝑥𝑡𝐸𝐿𝐵𝑂 (5.4)
Since 𝐷𝐾𝐿 [𝑝(𝑧|𝑥) ∥𝑞(𝑧|𝑥)] is intractable and KL divergence is always positive, we can maximize
tractable Evidence Lower-Bound (ELBO) in order to maximize log data likelihood. Within this
objective function, 𝐸𝑞 [log 𝑝(𝑥|𝑧)] corresponds with the reconstruction loss, and 𝐷𝐾𝐿 (𝑞(𝑧|𝑥) ∥𝑝(𝑧))
corresponds KL-divergence loss mentioned in previous paragraph.
Several VAE-derived models have been developed to address common issues like attribute
entanglement and posterior collapse. Higgins et al. [84] proposed Beta-VAEs to facilitate learning
of the disentanglement of data attributes. By introducing a hyperparameter that penalizes KL
divergence loss, the latent representation is forced to adjust the trade-off between reconstruction
and regularization. Razavi et al. [85] proposed a method for preventing a common issue in
training VAEs, posterior collapse. Posterior collapse happens when the posterior fails to capture
the true posterior of the latent variables, and the model gets ineffective in generating diverse and
high-quality samples. Their proposed method, delta-VAE, restricts parameters of the posterior to
establish minimum KL divergence between prior and posterior.
5.5.1 Protein Engineering Highlights of VAEs
Explore Table 4 for an overview of how VAE models have been implemented for protein
engineering applications.
5.5.1.1 Deep Generative Models for T Cell Receptor Protein Sequences
Davidsen et al. [87] demonstrated the capability of VAE models to generate T-cell receptor
(TCR) sequences with similar characteristics to real sequences (https://github.com/matsengrp/
vampire/). Rather than modeling the probability of a given sequence to undergo VDJ recombination
that approaches the properties of the mature TCR repertoire, the architectures of VAEs enables the
direct modeling of the distribution of the mature TCR repertoire. In addition to generating novel
137
Table 5.4 Summary of highlighted applications of VAE models for protein engineering.
Protein Engineering Task Advancements Model Type Training Data
Source(s)
Year Ref.
Metallo protein sequence
generation and metal
binding site prediction
Generated novel metallo proteins
that feature copper and calcium
binding sites.
VAE UniRef90 Nov. 2018
[86]
Novel T cell receptor
sequence generation
VAE model was able to estimate
cohort frequency, learn VDJ
recombination regulation,
generalize unexplored sequence
space, and generate novel T cell
receptor sequences.
VAE Adaptive Biotech
immuneAccess
Sept.
2019 [87]
Protein dynamics prediction Introduced a novel approach to
unsupervised learning to train
VAEs.
VAE MoDEL May 2020
[88]
Novel protein sequence
generation
Implemented “Deep Exploration
Network" architecture to generate
novel sequences with
polyadenylation sites, splicing
sites, transcription enhancer
binding sites, and enhances GFP
fluorescence.
VAE MPRA, APA, avGFP July 2020
[89]
Novel luciferase sequence
generation
Novel luciferase sequences
generated from VAE model
resulted in increased solubility
while maintaining
bioluminescence.
VAE InterPro Feb. 2021
[90]
Novel protein sequence
generation
Proposed a new evaluation metric
in their comparison of Potts, VAE
and site-independent generative
models.
VAE UniProt/TREMBL,
Pfam
Nov. 2021
[91]
Ancestral sequence
reconstruction
Demonstrated the ability to capture
higher-order epistatic effects
through VAE-generated
phylogenetic trees.
VAE CFP, EvolveAGene4
Simulations, PFAM
Feb. 2023
[92]
TCR sequences, the VAE-based models were able to predict the frequency of a TCR in a given
cohort and learn the rules of V(D)J recombination. The training data of TCR sequence repertoires
were sourced from Adaptive Biotechnologies’ ImmunoSEQ assay. Despite some requiring <100
lines of Python code, these simple VAE models were found to outperform previous models that
implemented more complicated graphical models that mimic the biological process of V(D)J
recombination.
5.6 Sequence Generation through Min-Max Gaming–GANs
Up to this point, we discussed generative models that use explicit probability density functions.
RNNs and Ars have a tractable function, and VAEs have an approximate function to maximize
likelihood. Here, we turn to generative adversarial networks (GANs), an implicit probabilistic
138
model that directly generates new data instances by defining a stochastic procedure [33].
GANs employ a two-player game approach to replicate training data distributions without
assumptions about their priors. One player is the generator network, and the discriminator network
is the other player. The objective of the generator is to generate realistic data instances from
random noise to fool the discriminator. On the other hand, the objective of the discriminator is to
distinguish real and fake data from the training set and the generator, respectively (Figure 6). By
having these two networks competing, the generator learns to generate (fake) data that is close to the
(real) training samples, and the discriminator provides feedback to the generator for improvement.
This approach allows two networks to evolve with each other so that, ideally, the generator can
generate synthetic samples that are indistinguishable from real samples. To train two networks
jointly, GANs have a minmax objective function shown below:
min
𝜃𝑔
max
𝜃𝑑
𝑉(𝐷,𝐺) = E𝑥∼𝑝data [log 𝐷𝜃 (𝑥)] + E𝑧∼𝑝noise

log(1 − 𝐷𝜃 (scent on term log(𝐷𝜃 (𝐺𝜃𝑔 (noise))) instead of gradient descent on term log(1−𝐷𝜃 (𝐺𝜃𝑔 (noise))).
In this manner, GANs have steep gradient to drive learning by maximizing the likelihood of the
discriminator being wrong instead of minimizing the likelihood of the discriminator being correct.
139
Developing a loss function for GANs that leads to more stable and better learning is still an
active research area. Arjovsky et al.[93] proposed Wasserstein Loss in which the discriminator
maximizes 𝐷𝜃 (𝑥) − 𝐷𝜃 (𝐺𝜃𝑔 (noise)), and the generator maximizes 𝐷𝜃 (𝐺𝜃𝑔 (noise)). This means
that the discriminator is not a classifier but a ‘critic’ that maximizes the difference between proxy
number of fake and real data, while the generator maximizes the output of discriminator given
generated (fake) data. Usually, the trained discriminator is discarded, and the trained generator is
kept for new data generation.
5.6.1 Protein Engineering Highlights of GAN Models
Table 5 illustrates a selection of key GANs models applications in protein engineering, with an
added case study for deeper analysis.
Table 5.5 Summary of highlighted applications of GAN models for protein engineering.
Protein
Engineering Task
Advancements Model Type Training Data
Source(s)
Year Ref.
Antimicrobial
peptide sequence
generation
Implemented a GAN model
to generate novel
antimicrobial peptides
GAN UniProt Feb.
2019
[94]
Target-drug binding
affinity prediction
Used unlabeled data to train
GAN model in a
semi-supervised manner
GAN [97], KIBA Jan.
2020
[95]
Antibody design Experimentally validated
GAN-generated antibody
sequences after initial phage
display library screening
GAN Observed Antibody
Space Repository
April
2020
[96]
Ancestral sequence
reconstruction
Performed ancestral
sequence reconstruction on
H3N2 influenza proteins
and predicted the evolution
of these proteins to improve
pathogen forecasting
GAN NCBI Influenza
Virus Resource
Aug.
2020
[97]
Synthetic data
generation, gene
ontology prediction
tasks
Improved gene ontology
prediction with GAN model
by creating synthetic feature
samples
FFPred-GAN modENCODE Sept.
2020
[98]
Gene ontology
classification
Used WGAN to improve
gene ontology term
correlation performance
WGAN SwissProt Feb.
2021
[99]
Novel malate
dehydrogenase
sequence generation
Experimentally validated
novel malate dehydrogenase
sequences with up to 106
mutations
GAN UniProt April
2021
[31]
140
5.6.2 GAN Architecture Enables the Generation of Synthetic Samples to Improve Training
Data augmentation with high-quality synthetic sample data points can help overcome the
challenges of developing models that predict protein function. Wan and Jones [98] demonstrate
the ability to generate these high-quality synthetic protein feature samples using their GAN-based
FFPred-GAN. In addition to using the FFPred model to determine protein biophysical information
from protein sequences, FFPred-GAN implemented a WGAN with gradient penalty to learn the
distribution of the training protein data set distribution. FFPred-GAN enabled significantly higher
accuracy in all the three domains of the gene ontologies domains (i.e. cellular component, molecular
function and biological process) without demanding significant computational resources to generate
both negative and positive synthetic samples (https://github.com/psipred/FFPredGAN).
5.7 Diffusion Models
Diffusion is a recently developed and rapidly ascending model in the generative AI domain
and has shown competitive performance with established benchmarks. This novel method offers
better distribution convergence and more diversity in the generated samples. The diffusion model’s
underlying principle is adopted from non-equilibrium thermodynamics in which the diffusion
process increases the system’s entropy, driving it towards a state of maximum randomness [34].
Therefore, in the context of generative modeling, diffusion models can gradually transform noisy
signals into coherent data structures (i.e. reversing the noise). These models have shown
promising results in image synthesis, image inpainting (i.e. filling missing regions in images)
and text generation. For example, Dall-E2 [100], a text-to-image framework generated by OpenAI,
incorporates a diffusion model during training to generate realistic and high-quality images. Their
model resulted in up to four times improvement in resolution compared to the original Dall-E
trained with GPT3 architecture [101].
Effective training in generative diffusion models requires a detailed understanding of its main
components and foundational concepts. In this section, we describe the core concepts, models,
and the main mathematical formulations that have been used for training the diffusion models.
Finally, we examine the evolutionary improvements of these models since their introduction in
141
2015. The forward diffusion process is the transition from data distribution to a prior distribution
(e.g. isotropic Gaussian). This is a Markov chain process, and each step only depends on its
previous step (i.e. progressively adding noise). For example, we can generate a noisy image at
𝑡 = 1 by adding a small amount of Gaussian noise to the pixel values for the image at 𝑡 = 0,
repeating this process for subsequent time steps until the data distribution transforms into a prior
distribution. The forward diffusion step parametrization can be shown below:
𝑞(𝑥𝑡 |𝑥𝑡−1) = N(𝑥𝑡 ;
√︁
1 − 𝛽𝑡𝑥𝑡−1, 𝛽𝑡 𝐼) (6)
𝑞(𝑥1:𝑇 |𝑥0) =
Ö𝑇
𝑡=1
𝑞(𝑥𝑡 |𝑥𝑡−1) =
Ö𝑇
𝑡=1
N(𝑥𝑡 ;
√︁
1 − 𝛽𝑡 (𝑥𝑡−1), 𝛽𝑡 𝐼) (7)
where 𝑡 is the time step and it ranges from 1 to 𝑇, 𝑥0 is the instance sampled from the true
data distribution, 𝛽𝑡 is the variance scheduler, and 𝐼 is the identity matrix. Given the equations
above, the conditional probability distribution of each step given the previous step is assumed to
be a conditional Gaussian distribution with mean
√︁
1 − 𝛽𝑡 (𝑥𝑡−1) and variance 𝛽𝑡 𝐼. Also, the noised
image distribution can be directly obtained at each timestep using a reparameterization trick in a
closed form[102]. The Backward Diffusion Process represents the challenging task of transforming
the noised distribution back to the data distribution. Once accomplished, new data instances can be
generated by sampling from the noise distribution. In the backward diffusion process, the model
starts with pure Gaussian noise and in each step learns the Gaussian transition parameters with the
aid of a parametrized model (e.g., neural networks). Note that this network should have a similar
input and output dimension (e.g. U-NET [103] architecture). The backward step parametrization
is represented in Equation 8 and Equation 9.
𝑝𝜃 (𝑥𝑡−1|𝑥𝑡 ) = N (𝑥𝑡−1; 𝜇𝜃 (𝑥𝑡 , 𝑡), Σ𝜃 (𝑥𝑡 , 𝑡)) (8)
𝑝𝜃 (𝑥0 : 𝑇) = 𝑝𝜃 (𝑥𝑇 )
Ö𝑇
𝑡=1
𝑝𝜃 (𝑥𝑡−1|𝑥𝑡 ) (9)
142
These equations have two main differences with the forward diffusion parametrization: (i) the
time trajectory is reversed and (ii) the Gaussian distribution parameters must be learned via a
parametrized model.
The diffusion model loss function is a negative log-likelihood (NLL) loss that measures the
discrepancy between the true data distribution and the learned distribution. Minimizing the NLL
given the model parameter is intractable, but it can become tractable via variational inference
techniques. Similar to maximizing the ELBO as discussed in the context of VAEs, the evidence
lower bound formulation for training diffusion models after applying Bayes rule and simplifying is
a bution)) yields a tractable generative model when applied with a sufficient number of steps.
The reasoning behind this was that small perturbations in data are more tractable for prediction than
one-time distribution prediction. In 2020, Ho et al. [102] introduced a series of novel enhancements
143
to this technique, leading to high-quality image synthesis through denoizing diffusion probabilistic
models (DDPMs). In their research, the authors employed a linear noise scheduler and innovatively
chose to predict the image noise during each iteration in the backward process. Building upon these
advancements, the model’s performance was further elevated by incorporating
as a learned parameter in the normal distribution instead of a fixed number. Also, introducing
non-linear noise-schedulers (i.e. cosine scheduler) resulted in effective preservation of the data
distribution in early nosing steps throughout the forward diffusion process [104, 105].
One main breakthrough in diffusion generative model development was made by Song et al.
[106] by incorporating a stochastic differential equation (SDE) framework. The ‘score function’ in
their methodology refers to the gradient of the log probability density. In the forward process, the
data distribution gets perturbed in continuous space (in contrast to earlier diffusion models with
finite noising steps) via the suggested SDE formulation which does not have trainable parameters.
Reverse SDE can be solved analytically with methods like Euler-Maruyama after handling the score
function term [107]. The authors addressed this by modeling the score function using a neural
network, which then can be plugged into the reverse SDE formula. Equations 12 and 13 show the
main forward and reverse formulation used in an SDE process.
The stochastic differential equations (SDEs) used in the models are given by:
𝑑𝑥 = 𝑓 (𝑥, 𝑡) 𝑑𝑡 + 𝑔(𝑡) 𝑑𝑤 (12)
𝑑𝑥etractable loss function shown in equation 10.
f diffusion models allows for the generation of diverse protein conformations from initial noise
distribution. This inherent uncertainty is particularly beneficial and offers a more realistic modeling
approach since proteins are dynamic and adopt multiple conformations. Given these unique features
in their architecture and training procedure, diffusion models are potentially an invaluable tool for
navigating the intricate energy landscape that proteins operate within.
5.7.1 Protein Engineering Highlights of Diffusion Models
Explore Table 6 for an array of recently developed diffusion model applications in protein
engineering, extended with two analytical case studies.
5.7.1.1 ProteinGenerator Enables the Joint Generation of Protein Sequence and Structure
The authors implementedDDPMwith coordinated guidance on sequence and structure resulting
in improved generation (github.com/RosettaCommons/protein_generator) [110]. They leveraged
RoseTTAFold’s [116] capability to simultaneously generate protein sequences and structures.
Drawing inspiration from RoseTTAFold Joint Inpainting, they adopted this ability for the diffusive
creation of consistent sequence-structure pairs. Fine-tuning to retrieve noised native protein
sequences and simultaneously ensuring the accuracy of structure prediction enabled guidance from
both sequential and structural domains. In the unconditional generation, ProteinGenerator was able
to generate pairs of sequence structures close to the native proteins. Note that various structural
properties and amino acid frequencies were obtained by sampling from different noise distributions.
The model architecture also enabled high versatility and as a result compatibility with different
conditioning and classifier-guidance methods. In conditioning, additional constraints were added
to the generation process. For instance, the model was conditioned for generating high amino
acid frequencies (e.g. cystine for forming disulfide bonds to increase stability, histidine for pH
sensitivity) while satisfying the corresponding structure folding. In another example, DeepGOPlus
145
Table 5.6 Summary of highlighted applications of diffusion models for protein engineering.
Protein
Engineering Task
Advancements Model Type Training Data
Source(s)
Year Ref.
Protein sequence
inpainting and
generation
Demonstrated the use of
equivariant denoising
using the invariant point
attention technique to
jointly model protein
sequence and structures
Transformer-Based
Diffusion Model
PDB May
2022
[112]
Joint
Sequence-Structure
Generation
Modeled structure and
sequence of full protein
complexes in a
computationally efficient
manner
Graph Neural
Network-Based
Diffusion Model
PDB, UniProt Dec.
2022
[109]
Protein Backbone
Joint
Sequence-Structure
model development
Demonstrated the
feasibility of developing
generative diffusion
models through a
comparison of
sequence-only,
structure-only, and joint
sequence-structure
models
CARP PDB May
2023
[113]
Joint
Sequence-Structure
Generation
Developed a model
capable of generating
both novel sequences
and novel protein
backbones via
RoseTTAFold
DDPM INDI, SCOP May
2023
[110]
De Novo Protein
Sequence Generation
with desired
structural features
Generated novel protein
sequences with desired
secondary-structure
features using an
attention-based diffusion
model
Attention-Based
Diffusion Model
– July
2023
[114]
Antibody joint
Sequence-structure
Modeling
Improved joint protein
sequence and structure
generation using both
domain knowledge and
physics-based
constraints
SE(3)-based
Diffusion Model
pOAS Database,
HER2 Binder
data set
July
2023
[115]
146
Gene Ontology (GO) classifier was used to guide the generation process. The classifier provides
scores or gradients that can be used to influence the outputs of the main model and as a result,
generate functionally rich sequences.
5.7.1.2 Chroma Enables the Generation of Novel Protein Complexes via its Joint Sequence
and Structure Model
Another highly successful implementation of the diffusion-based framework was shown in
Chroma which enabled jointly modeling the sequence and structure of full protein complexes (https:
//github.com/lucidrains/chroma-pytorch) [109]. The authors introduced sophisticated computational
techniques and conditional sampling to adeptly manage computational challenges while crafting
proteins with specific attributes. Rooted in diffusion modeling and graph neural networks, this
versatile generative model excels in refining noisy structures while preserving the intricate 3D
details inherent in protein configurations. This model facilitates programmable protein design as
it can condition proteins on different shapes, symmetry, textual prompts, and various properties.
Remarkably, Chroma’s capability to generate protein complexes holds significant value as most of
the protein functions such as binding occur through protein interactions. Furthermore, the authors
indicated that a large protein (e.g. with > 3000 residues) can be generated within minutes via an
appropriate GPU (e.g. NVIDIA V100).
5.8 Discussion
Generative models—such as VAEs, autoregressive, GANs, and diffusion models—have shown
significant promise in the protein engineering domain to generate novel and functional sequences.
This ongoing research has mitigated long-standing challenges in designing proteins with improved
properties, generating interfaces for protein–protein interactions, establishing rules for high-fitness
protein variants and capturing phylogenetic relationships between proteins. These models aim to
learn the underlying data distribution and generate novel instances via sampling from the learned
distribution. Distinct model structures are employed to learn the given data distribution by directly
modeling or approximating the probability density function. VAEs are probabilistic generative
models that approximate the explicit density function via variational inference. Upon learning the
147
underlying distribution of the given dataset, the VAE can generate novel samples similar to input
data. VAEs have been used in various protein engineering tasks including improving fitness (e.g.
thermal stability, solubility, bioluminescence and binding) and capturing phylogenetic relationships
via learned latent space relationships. Autoregressive models calculate the explicit density function
where each token is conditioned on the previous tokens. Autoregressive models have also led
to successful outcomes in generating sequences with improved fitness, paratope prediction, and
protein localization. Unlike VAEs and autoregressive models that use restricted neural networks
in approaching the intractable normalizing constant, GANs model the generation process only. As
a result, they are not used for likelihood estimation, yet they have superior potential in generating
high-quality instances. Two networks (generator and discriminator) are used in GANs that sample
from the density function without calculating or estimating the function itself (i.e. implicit density
estimation). GANs provide promising results in diverse tasks such as gene ontology correlation,
binding affinity, phylogeny prediction, antimicrobial peptide generation and developing rules for
antibody solubility and thermal stability. Diffusions are a more recent class of generative models
adopted from thermodynamics equilibrium. The idea is if the noise in the data happens gradually,
it can be reversed. Therefore, data distribution can be approximated from pure noise in the reverse
diffusion process.
While each of these generative models has obtained promising outcomes in terms of protein
design applications, they differ in their training process, output quality and generated output
diversity. In general, given their efficient architecture, VAEs are potentially easier to train, yet
they might lead to lower quality outputs (e.g. blurry images for image generation) compared to
other generative models [117]. Note that recent architecture developments have tried to overcome
common issues in VAEs (e.g. posterior collapse and reconstruction-regularization trade off),
yet these solutions may require more computational resources. For instance, beta-VAE [84],
hierarchical VAE [118, 119] and VQ-VAE [120] are distinct types of VAE models to address
common issues in traditional VAEs. Beta-VAE adds a hyperparameter in the loss function to obtain
more disentangled representations. HierarchicalVAEaims to preclude posterior over-regularization
148
by incorporating hierarchical priors in the model. Finally, VQ-VAE has been shown to generate high
quality data and prevent posterior collapse by learning discrete representations and autoregressive
prior (versus continuous learned representations and static prior in original traditional VAE).
Similarly, there are improved variants for GANs and autoregressive models to boost generated
data attributes and resolve model restrictions. Examples of autoregressive developments include
GPT-3 [58], Reformer [121] and Big Bird [122] which use more parameters in training, reversible
sequence-to-sequence architecture, and sparse attention mechanism, respectively. For GANS,
improved variants include CycleGAN[123], LsGAN[124] and VEEGAN[125] for training without
paired data, resolving vanishing gradient issues in training and reducing mode collapse to increase
generated data diversity, respectively. Although diffusion models have been developed recently,
their architecture is rapidly evolving. For example, subspace diffusion has shownimproved sampling
quality and reduced computational cost via restricting diffusion by its projection to subspaces [126].
Denoising diffusion policy optimization (DDPO) is another architecture development in diffusions
which solved the denoising process as a multi-step decision-making problem [127]. The mentioned
architectures are a few variants among a pool of architectures and their performance depends on
the specific application and data attributes (e.g. number of samples in training, data complexity
and input data length).
Despite the newfound opportunities provided by generative models in this realm, the remaining
challenges in generative sequence modeling include validating the generated sequences, navigating
the rugged landscape in pursuit of sequences with desired features, de-novo binder design application,
effectively infusing biological priors into models and strategically combining distinct generative
models to enhance sampling quality and diversity. In many cases, wet-lab experiments are required
to assess the quality of the generated sequence in terms of basic required properties (e.g. stability
and expression) to more design-based properties (e.g. affinity and specificity). This by itself has
hindered model optimization as there is no immediate and definitive feedback for the quality of
generated sequences (versus rapidly assessing the visual quality of a general image-based data
generated from these models). With that being said, there are computational tools to aid in filtering
149
the generated sequences and increasing the success rate in experimental characterization. For
example, Alphafold2 for structural prediction [128], discriminative models to assign probabilities to
sequences based on their fitness [129], and self-supervised models for few and zero-shot predictions
[134] are among the extremely beneficial tools for analyzing the generated sequences in silico.
In this paper, we provided an overview of the architecture and underlying assumptions of four
commonly used generative models (VAEs, Autoregressive models, GANs and diffusion models).
By analyzing the strengths and limitations of each model, we hope that researchers are better
equipped to make informed decisions when selecting the appropriate model for specific data and
objectives. We also elaborated on specific protein engineering applications for each of these models,
highlighting their potential to generate novel protein sequences with improved properties. With the
exponential growth of biological and protein sequence datasets, increasing efficiency of generative
models, and improved methods for generating and validating de novo sequences, we envision a
promising future for the development of effective protein design and engineering applications.
5.8.1 Key Points
• To address the gap between the growing number of machine learning (ML) models and
their application to protein engineering tasks, we have reviewed recent protein engineering
applications of generative ML models.
• The architecture and mathematical background of three generative models (diffusion models,
generative adversarial neural networks and variational autoencoders) are described in depth
with a focus on applications towards protein design (e.g. to predict protein properties and to
generate protein design rules and sequences).
• The architecture and application of language machine learning models (namely, recurrent
neural networks, autoregressive and transformers) are also described, particularly in the
context of treating protein design tasks on amino acid sequences similarly to human language
tasks on strings of text.
• Incorporating transfer learning and embeddings can improve the efficiency and generalizability
of ML modeling tasks.
150
BIBLIOGRAPHY
[1] JMWebster, R Zhang, SS Gambhir, et al. Engineered two-helix small proteins for molecular
recognition. Chem Bio Chem, 10:1293–1296, 2009.
[2] CS Eke, E Jammeh, X Li, et al. Early detection of alzheimer’s disease with blood plasma
proteins using support vector machines. IEEE J Biomed Health Inform, 25:218–226, 2021.
[3] Y Luan and Y Yao. The clinical significance and potential role of c-reactive protein in
chronic inflammatory and neurodegenerative diseases. Front Immunol, 9:1302, 2018.
[4] R Bam, PS Lown, LA Stern, et al. Efficacy of affibody-based ultrasound molecular imaging
of vascular b7-h3 for breast cancer detection. Clin Cancer Res, 26:2140–2150, 2020.
[5] J Małecki, S Muszyński, and BG Sołowiej. Proteins in food systems—bionanomaterials,
conventional and unconventional sources, functional properties, and development
opportunities. Polymers, 13:2506, 2021.
[6] DB Janssen and JP Schanstra. Engineering proteins for environmental applications. Curr
Opin Biotechnol, 5:253–259, 1994.
[7] K Kuroda and M Ueda. Molecular design of the microbial cell surface toward the recovery
of metal ions. Curr Opin Biotechnol, 22:427–433, 2011.
[8] D Prakash, P Gabani, AK Chandel, et al. Bioremediation: a genuine technology to remediate
radionuclides from the environment. J Microbial Biotechnol, 6:349–360, 2013.
[9] JM Jez. Toward protein engineering for phytoremediation: possibilities and challenges. Int
J Phytoremediation, 13:77–89, 2011.
[10] X Jia, Y Li, T Xu, and K Wu. Display of lead-binding proteins on escherichia coli surface
for lead bioremediation. Biotechnol Bioeng, 117:3820–3834, 2020.
    [11] MD Diem, L Hyun, F Yi, et al. Selection of high-affinity centyrin fn3 domains from a
    simple library diversified at a combination of strand and loop positions. Protein Eng Des
    Sel, 27:419–429, 2014.
    [12] AW Golinski, KM Mischler, S Laxminarayan, et al. High-throughput developability assays
    enable library-scale identification of producible protein scaffold variants. Proc Natl Acad
    Sci, 118:e2026658118, 2021.
    [13] MZacharias. Protein–protein docking with a reduced protein model accounting for side-chain
    flexibility. Protein Sci, 12:1271–1282, 2003.
    [14] R Merkl and R Sterner. Reconstruction of ancestral enzymes. Perspect Sci, 9:17–23, 2016.
    151
    [15] A Vaswani, N Shazeer, N Parmar, et al. Attention is all you need. arXiv preprint, page
    arXiv:1706.03762, 2017.
    [16] B Ghojogh and A Ghodsi. Attention mechanism, transformers, bert, and gpt: Tutorial and
    survey. Open Science Framework, 2020.
    [17] Y LeCun, Y Bengio, and G Hinton. Deep learning. Nature, 521:436–444, 2015.
    [18] D Ofer, N Brandes, and M Linial. The language of proteins: Nlp, machine learning protein
    sequences. Comput Struct Biotechnol J, 19:1750–1758, 2021.
    [19] KWang, RZhou, YLi, andMLi. Deepdtaf: a deep learning method to predict protein–ligand
    binding affinity. Brief Bioinform, 22:bbab072, 2021.
    [20] G Li, KS Rabe, J Nielsen, and MKM Engqvist. Machine learning applied to predicting
    microorganism growth temperatures and enzyme catalytic optima. ACS Synth Biol,
    8:1411–1420, 2019.
    [21] S Khurana, R Rawi, K Kunji, et al. Deepsol: a deep learning framework for sequence-based
    protein solubility prediction. Bioinformatics, 34:2605–2613, 2018.
    [22] S Hashemifar, B Neyshabur, AA Khan, and J Xu. Predicting protein–protein interactions
    through sequence-based deep learning. Bioinformatics, 34:i802–i810, 2018.
    [23] LWang, H-FWang, S-R Liu, et al. Predicting protein-protein interactions from matrix-based
    protein sequence using convolution neural network and feature-selective rotation forest. Sci
    Rep, 9:9848, 2019.
    [24] N Ferruz, S Schmidt, and B Höcker. A deep unsupervised language model for protein design.
    2022, page 2022.03.09.483666, 2022.
    [25] R Rao, N Bhattacharya, N Thomas, et al. Evaluating protein transfer learning with tape. Adv
    Neural Inf Process Syst, 32:9689–9701, 2019.
    [26] EC Alley, G Khimulya, S Biswas, et al. Unified rational protein engineering with
    sequence-based deep representation learning. Nat Methods, 16:1315–1322, 2019.
    [27] A Rives, J Meier, T Sercu, et al. Biological structure and function emerge from
    scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci,
    118:e2016239118, 2021.
    [28] A Elnaggar,MHeinzinger, C Dallago, et al. Prottrans: toward understanding the language of
    life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell, 44:7112–7127,
    2022.
    152
    [29] Z Costello and HG Martin. How to hallucinate functional proteins. arXiv, page
    arXiv:1903.00458v1, 2019.
    [30] N Ferruz, S Schmidt, and B Höcker. Protgpt2 is a deep unsupervised language model for
    protein design. Nat Commun, 13:4348, 2022.
    [31] D Repecka, V Jauniskis, L Karpus, et al. Expanding functional protein sequence spaces
    using generative adversarial networks. Nat Mach Intell, 3:324–333, 2021.
    [32] DP Kingma and M Welling. Auto-encoding variational bayes. arXiv, page
    arXiv:1312.6114v11, 2022.
    [33] Ĳ Goodfellow, J Pouget-Abadie, M Mirza, et al. Generative adversarial networks. arXiv,
    page arXiv:1406.2661, 2014.
    [34] J Sohl-Dickstein, EWeiss, N Maheswaranathan, and S Ganguli. Deep unsupervised learning
    using nonequilibrium thermodynamics. Proceedings of the 32nd International Conference
    on Machine Learning, pages 2256–2265, 2015.
    [35] A Sherstinsky. Fundamentals of recurrent neural network (rnn) and long short-term memory
    (lstm) network. Phys Nonlinear Phenom, 404:132306, 2020.
    [36] M Schuster and KK Paliwal. Bidirectional recurrent neural networks. IEEE Trans Signal
    Process, 45:2673–2681, 1997.
    [37] S Hochreiter and J Schmidhuber. Long short-term memory. Neural Comput, 9:1735–1780,
    1997.
    [38] J Chung, C Gulcehre, K Cho, and Y Bengio. Empirical evaluation of gated recurrent neural
    networks on sequence modeling. arXiv, page arXiv:1412.3555, 2014.
    [39] AT Müller, JA Hiss, and G Schneider. Recurrent neural network model for constructive
    peptide design. J Chem Inf Model, 58:472–479, 2018.
    [40] K Saka, T Kakuzaki, S Metsugi, et al. Antibody design using lstm based deep generative
    model from phage display library for affinity maturation. Sci Rep, 11:5852, 2021.
    [41] S Sabban and M Markovsky. Ramanet: computational de novo helical protein backbone
    design using a long short-term memory generative neural network. F1000 Research Full,
    9:671552, 2020.
    [42] B Zhang, J Li, and Q Lü. Prediction of 8-state protein secondary structures by a novel deep
    learning architecture. BMC Bioinformatics, 19:293, 2018.
    [43] C-C Lin, A Jaech, X Li, et al. Limitations of autoregressive models and their alternatives.
    153
    Association for Computational Linguistics, 2021.
    [44] J Trinquier, GUguzzoni, A Pagnani, et al. Efficient generative modeling of protein sequences
    using simple autoregressive models. Nat Commun, 12:5800, 2021.
    [45] J-E Shin, AJ Riesselman, AW Kollasch, et al. Protein design and variant prediction using
    autoregressive generative models. Nat Commun, 12:2403, 2021.
    [46] X Liu. Deep recurrent neural network for protein function prediction from sequence. arXiv,
    page arXiv:1701.08318, 2017.
    [47] B Panda and B Majhi. A novel improved prediction of protein structural class using deep
    recurrent neural network. Evol Intell, 14:253–260, 2021.
    [48] WP Russ, M Figliuzzi, C Stocker, et al. An evolution-based model for designing chorismate
    mutase enzymes. Science, 369:440–445, 2020.
    [49] P Zhang, S Zheng, J Chen, et al. Deepanis: Predicting antibody paratope from concatenated
    cdr sequences by integrating bidirectional long-short-term memory and transformer neural
    networks. 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM),
    pages 118–124, 2021.
    [50] D. Grechishnikova. Transformer neural network for protein-specific de novo drug generation
    as a machine translation problem. Scientific Reports, 11:321, 2021.
    [51] Z. Wu, K. K. Yang, M. J. Liszka, et al. Signal peptides generated by attention-based neural
    networks. ACS Synthetic Biology, 9:2154–2161, 2020.
    [52] I. Ieremie, R. M. Ewing, and M. Niranjan. Transformergo: predicting protein–protein
    interactions by modeling the attention between sets of gene ontology terms. Bioinformatics,
    38:2269–2277, 2022.
    [53] C. Chen, T. Wu, Z. Guo, and J. Cheng. Combination of deep neural network with attention
    mechanism enhances the explainability of protein contact prediction. Proteins: Structure,
    Function, and Bioinformatics, 89:697–707, 2021.
    [54] K. O’Shea and R. Nash. An introduction to convolutional neural networks.
    arXiv:1511.08458, 2015.
    [55] Q. Zhao, H. Zhao, K. Zheng, and J. Wang. Hyperattentiondti: improving drug–protein
    interaction prediction by sequence-based deep learning with attention mechanism.
    Bioinformatics, 38:655–662, 2022.
    [56] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep
    bidirectional transformers for language understanding. In Proceedings of the Association for
    154
    Computational Linguistics (ACL), 2019.
    [57] A. Radford, K.Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding
    by generative pre-training, 2018.
    [58] T. B. Brown, B. Mann, N. Ryder, et al. Language models are few-shot learners. In Advances
    in Neural Information Processing Systems (NeurIPS), volume 33, pages 1877–1901, 2020.
    [59] M. Tsimpoukelli, J. L. Menick, S. Cabi, et al. Multimodal few-shot learning with frozen
    language models. In Proceedings of Neural Information Processing Systems (NeurIPS),
    volume 34, pages 200–212, 2021.
    [60] M. Lewis, Y. Liu, N. Goyal, et al. Bart: Denoising sequence-to-sequence pre-training for
    natural language generation. Transactions on Comprehension, 58:7871–7880, 2019.
    [61] K. Choromanski, V. Likhosherstov, D. Dohan, et al. Masked language modeling for proteins
    via linearly scalable long-context transformers. arXiv:2006.03555, 2020.
    [62] R. Cao, C. Freitas, L. Chan, et al. Prolango: protein function prediction using neural machine
    translation based on a recurrent neural network. Molecules, 22:1732, 2017.
    [63] S. Hu, R. Ma, and H. Wang. An improved deep learning method for predicting dna-binding
    proteins based on contextual features in amino acid sequences. PLoS One, 14:e0225317,
    2019.
    [64] S. R. Johnson, S. Monaco, K. Massie, and Z. Syed. Generating novel protein sequences
    using gibbs sampling of masked language models. arXiv:2021.01.26.428322, 2021.
    [65] S. Min, S. Park, S. Kim, et al. Pre-training of deep bidirectional protein sequence
    representations with structural information. IEEE Access, 9:123912–123926, 2021.
    [66] P. Notin, M. Dias, J. Frazer, et al. Tranception: Protein fitness prediction with autoregressive
    transformers and inference-time retrieval. In Proceedings of the 39th International
    Conference on Machine Learning (ICML), pages 16990–17017, 2022.
    [67] E. Castro, A. Godavarthi, J. Rubinfien, et al. Transformer-based protein generation with
    regularized latent space optimization. Nature Machine Intelligence, 4:840–851, 2022.
    [68] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and
    Data Engineering, 22:1345–1359, 2010.
    [69] UniProt. A hub for protein information. Nucleic Acids Research, 43:D204–D213, 2023.
    [70] B. E. Suzek, H. Huang, P. McGarvey, et al. Uniref: comprehensive and non-redundant
    uniprot reference clusters. Bioinformatics, 23:1282–1288, 2007.
    155
    [71] K. Katz, O. Shutov, R. Lapoint, et al. The sequence read archive: a decade more of explosive
    growth. Nucleic Acids Research, 50:D387–D390, 2022.
    [72] S. Biswas, G. Khimulya, E. C. Alley, et al. Low-n protein engineering with data-efficient
    deep learning. Nature Methods, 18:389–396, 2021.
    [73] M. Heinzinger, A. Elnaggar, Y.Wang, et al. Modeling aspects of the language of life through
    transfer-learning protein sequences. BMC Bioinformatics, 20:723, 2019.
    [74] A. Nambiar, S. Liu, M. Hopkins, et al. Transforming the language of life: transformer neural
    networks for protein prediction tasks. Journal of Computational Biology, 30:95–111, 2020.
    [75] J. Vig, A. Madani, L. R. Varshney, et al. Bertology meets biology: interpreting attention in
    protein language models. arXiv:2006.15222, 2021.
    [76] L. He, S. Zhang, L. Wu, et al. Pre-training co-evolutionary protein representation via a
    pairwise masked language model. arXiv:2110.15527, 2021.
    [77] A. Behjati, F. Zare-Mirakabad, S. S. Arab, and A. Nowzari-Dalini. Protein sequence profile
    prediction using protalbert transformer. Computational Biology and Chemistry, 99, 2021.
    [78] Z. Lin, H. Akin, R. Rao, et al. Evolutionary-scale prediction of atomic-level protein structure
    with a language model. Science, 379:1123–1130, 2023.
    [79] M. Mardikoraem and D.Woldring. Protein fitness prediction is impacted by the interplay of
    language models, ensemble learning, and sampling methods. Pharmaceutics, 15(5), 2023.
    [80] A. Shanehsazzadeh, D. Belanger, and D. Dohan. Is transfer learning necessary for protein
    landscape prediction? arXiv:2011.03443, 2020.
    [81] B. J. Wittmann, Y. Yue, and F. H. Arnold. Informed training set design enables efficient
    machine learning-assisted directed protein evolution. Cell Systems, 12:1026–1045.e7, 2021.
    [82] S. Sinai, E. Kelsic, G. M. Church, and M. A. Nowak. Variational auto-encoding of protein
    sequences. arXiv:1712.03346, 2018.
    [83] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: a review for
    statisticians. Journal of the American Statistical Association, 112:859–877, 2017.
    [84] I. Higgins, L. Matthey, A. Pal, et al. Beta-vae: learning basic visual concepts with a
    constrained variational framework. In Proceedings of the International Conference on
    Learning Representations (ICLR), 2022.
    [85] A. Razavi and O. Vinyals. Preventing posterior collapse with 𝛿-vaes. arXiv:1901.03416,
    2019.
    156
    [86] J. G. Greener, L. Moffat, and D. T. Jones. Design of metalloproteins and novel protein folds
    using variational autoencoders. Scientific Reports, 8:16189, 2018.
    [87] K. Davidsen, B. J. Olson, III DeWitt, W. S., et al. Iv deep generative models for t cell
    receptor protein sequences. Elife, 8:e46935, 2019.
    [88] A.-I. Albu and G. Czibula. Analysing protein dynamics using machine learning based
    generative models. In Proceedings of the 2020 IEEE 14th International Symposium on
    Applied Computational Intelligence and Informatics (SACI), pages 135–140, 2020.
    [89] J. Linder, N. Bogard, A. B. Rosenberg, and G. Seelig. A generative neural network for
    maximizing fitness and diversity of synthetic dna and protein sequences. Cell Systems,
    11:49–62.e16, 2020.
    [90] A. Hawkins-Hooker, F. Depardieu, S. Baur, et al. Generating functional protein variants
    with variational autoencoders. PLoS Computational Biology, 17:e1008736, 2021.
    [91] F. McGee, S. Hauri, Q. Novinger, et al. The generative capacity of probabilistic protein
    sequence models. Nature Communications, 12:6302, 2021.
    [92] L. S. Moreta, O. Rønning, and A. S. Al-Sibahi. Ancestral protein sequence reconstruction
    using a tree-structured ornstein-uhlenbeck variational autoencoder. In International
    Conference on Learning Representations (ICLR), 2022.
    [93] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv:1701.07875, 2017.
    [94] A. Gupta and J. Zou. Feedback gan for dna optimizes protein functions. Nature Machine
    Intelligence, 1:105–111, 2019.
    [95] L. Zhao, J.Wang, L. Pang, et al. Gansdta: predicting drug-target binding affinity using gans.
    Frontiers in Genetics, 10:1243, 2020.
    [96] T. Amimeur, J. M. Shaver, R. R. Ketchem, et al. Designing feature-controlled
    humanoid antibody discovery libraries using generative adversarial networks.
    arXiv:2020.04.12.024844, 2020.
    [97] D. S. Berman, C. Howser, T. Mehoke, and J. D. Evans. Mutagan: A seq2seq gan framework
    to predict mutations of evolving protein populations. Virus Evolution, 2020.
    [98] C.Wan and D. T. Jones. Protein function prediction is improved by creating synthetic feature
    samples with generative adversarial networks. Nature Machine Intelligence, 2:540–550,
    2020.
    [99] S. F. Seyyedsalehi, M. Soleymani, H. R. Rabiee, and M. R. K. Mofrad. Pfp-wgan:
    protein function prediction by discovering gene ontology term correlations with generative
    157
    adversarial networks. PLoS One, 16:e0244430, 2021.
    [100] A. Ramesh, P. Dhariwal, A. Nichol, et al. Hierarchical text-conditional image generation
    with clip latents. arXiv:2204.06125, 2022.
    [101] A. Ramesh, M. Pavlov, G. Goh, et al. Zero-shot text-to-image generation. In Proceedings of
    the 38th International Conference on Machine Learning (ICML), pages 8821–8831, 2021.
    [102] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. arXiv:2006.11239,
    2020.
    [103] W. Weng and X. Zhu. Inet: convolutional networks for biomedical image segmentation.
    IEEE Access, 9:16591–16603, 2021.
    [104] A. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models.
    arXiv:2102.09672, 2021.
    [105] P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. In Proceedings
    of the Advances in Neural Information Processing Systems (NeurIPS), pages 8780–8794,
    2021.
    [106] H. Song, B. J. Bremer, E. C. Hinds, et al. Inferring protein sequence-function relationships
    with large-scale positive-unlabeled learning. Cell Systems, 12:92–101.e8, 2021.
    [107] M. Bayram, T. Partal, and G. Orucova Buyukoz. Numerical methods for simulation of
    stochastic differential equations. Advances in Differential Equations, 2018.
    [108] L. Yang, Z. Zhang, Y. Song, et al. Diffusion models: a comprehensive survey of methods
    and applications. arXiv:2209.00796, 2023.
    [109] J. Ingraham, M. Baranov, Z. Costello, et al. Illuminating protein space with a programmable
    generative model. bioRxiv:2022.12.01.518682, 2022.
    [110] S. L. Lisanza, J. M. Gershon, S. Tipps, et al. Joint generation of protein sequence and
    structure with rosettafold sequence space diffusion. bioRxiv:2023.05.08.539766, 2023.
    [111] B. Ni, D. L. Kaplan, and M. J. Buehler. Generative design of de novo proteins based
    on secondary-structure constraints using an attention-based diffusion model. Chem,
    9:1828–1849, 2023.
    [112] N. Anand and T.Achim. Protein structure and sequence generation with equivariant denoising
    diffusion probabilistic models. arXiv:2205.15019, 2022.
    [113] R. Vinod, K. K. Yang, and L. Crawford. Joint protein sequence-structure co-design via
    equivariant diffusion, 2022.
    158
    [114] C.-H. Yu, W. Chen, Y.-H. Chiang, et al. End-to-end deep learning model to predict and
    design secondary structure content of structural proteins. ACS Biomaterials Science and
    Engineering, 8:1156–1165, 2022.
    [115] K. Martinkus, J. Ludwiczak, K. Cho, et al. Abdiffuser: full-atom generation of in-vitro
    functioning antibodies. arXiv:2308.05027, 2023.
    [116] M. Baek, F. DiMaio, I. Anishchenko, et al. Accurate prediction of protein structures and
    interactions using a three-track neural network. Science, 373:871–876, 2021.
    [117] A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on
    deep networks. In Advances inNeural Information Processing Systems (NeurIPS), volume 29,
    2016.
    [118] A. Klushyn, N. Chen, R.Kurle, et al. Learning hierarchical priors in vaes. arXiv:1905.04982,
    2019.
    [119] C. K. Sønderby, T. Raiko, L. Maaløe, et al. Ladder variational autoencoders.
    arXiv:1602.02282, 2016.
    [120] A. van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural discrete representation learning.
    arXiv:1711.00937, 2018.
    [121] N. Kitaev, Ł. Kaiser, and A. Levskaya. Reformer: the efficient transformer.
    arXiv:2001.04451, 2020.
    [122] M. Zaheer, G. Guruganesh, K. A. Dubey, et al. Big bird: transformers for longer
    sequences. In Advances in Neural Information Processing Systems (NeurIPS), volume 33,
    pages 17283–17297, 2020.
    [123] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using
    cycle-consistent adversarial networks. arXiv:1703.10593, 2020.
    [124] X. Mao, Q. Li, H. Xie, et al. Least squares generative adversarial networks.
    arXiv:1611.04076, 2017.
    [125] A. Srivastava, L. Valkov, C. Russell, et al. Veegan: Reducing mode collapse in gans
    using implicit variational learning. In Advances in Neural Information Processing Systems
    (NeurIPS), volume 30, 2017.
    [126] B. Jing, G. Corso, R. Berlinghieri, and T. Jaakkola. Subspace diffusion generative models. In
    Computer Vision – ECCV 2022, volume 13683, pages 274–289. SpringerNature Switzerland,
    2022.
    [127] K. Black, M. Janner, Y. Du, et al. Training diffusion models with reinforcement learning.
    159
    arXiv:2305.13301, 2023.
    [128] J. Jumper, R. Evans, A. Pritzel, et al. Highly accurate protein structure prediction with
    alphafold. Nature, 596:583–589, 2021.
    [129] A. Strokach and P. M. Kim. Deep generative modeling for protein design. Current Opinion
    in Structural Biology, 72:226–236, 2022.
    160
    CHAPTER 6
    EVOSEQ-ML: ADVANCING DATA-CENTRIC MACHINE LEARNING WITH
    EVOLUTIONARY-INFORMED PROTEIN SEQUENCE
    REPRESENTATION AND GENERATION
    1
    Abstract
    In protein engineering, machine learning (ML) advancements have led to significant progress,
    including protein structure prediction (e.g., AlphaFold), sequence representation through language
    models, and novel protein generation. However, the impact of data curation on ML model
    performance is underexplored. As more sequence and structural data become available, a data-centric
    approach is increasingly favored over a model-centric method. A data-centric approach prioritizes
    high-quality, domain-specific data, ensuring ML tools are trained on datasets that accurately reflect
    biological complexity and diversity. This paper introduces a novel methodology that integrates
    ancestral sequence reconstruction (ASR) into ML models, enhancing data-centric strategies in
    the field. ASR uses computational techniques to infer ancient protein sequences from modern
    descendants, providing diverse, stable sequences with rich evolutionary information. While
    multiple sequence alignments (MSAs) are commonly used in protein engineering frameworks
    to incorporate evolutionary information, ASR offers deeper insights into protein evolution. Unlike
    MSAs, ASR captures mutation rates, phylogenic relationships, evolutionary trajectories, and
    specific ancestral sequences, giving access to novel protein sequences beyond what is available in
    public databases by natural selection. We employed two statistical methods for ASR: joint Bayesian
    inference and maximum likelihood. Bayesian approaches infer ancestral sequences by sampling
    from the entire posterior distribution, accounting for epistatic interactions between multiple amino
    acid positions to capture the nuances and uncertainties of ancestral sequences. In contrast, maximum
    likelihood methods estimate the most probable amino acids at individual positions in isolation. Both
    methods provide extensive ancestral data, enhancing ML model performance in protein sequence
    1This chapter is adapted from content published in "ICLR 2024 Workshop on Generative and Experimental
    Perspectives for Biomolecular Design". For more information, visit https://openreview.net/forum?id=jYaBeydI5P.
    161
    generation and fitness prediction tasks. Our results demonstrate that generative ML models
    training on either Bayesian or maximum likelihood approaches produce highly stable and diverse
    protein sequences. We also fine-tuned the evolutionary scale ESM protein language model with
    reconstructed ancestral data to obtain evolutionary-driven protein representations, and downstream
    stability prediction tasks for Endolysin and Lysozyme C families. For Lysozyme C, ancestral-based
    representations outperformed the baseline ESM in KNN classification and matched the established
    InterPro method. In Endolysin, our novel ASR-Dist method performed on par with or better than
    the baseline and other fine-tuning approaches across various classification metrics. ASR-Dist
    showed consistent performance in both simple and complex classification models, suggesting
    the effectiveness of this data-centric approach in enhancing protein representations. This work
    demonstrates how evolutionary data can improve ML-driven protein engineering, presenting a
    novel data-centric approach that expands our exploration of protein sequence space and enhances
    our ability to predict and design functional proteins.
    Keywords: Protein Engineering, Generative Models, Language Models, Fine-Tuning, Ancestral
    Sequence Reconstruction, Data-Centric Models
    6.1 Introduction
    Recent advancements in machine learning (ML) highlight the critical role of data quality and
    diversity in model performance, shifting focus towards data-centric approaches that complement
    algorithmic innovations [1, 2]. This evolving perspective underscores the importance of curating
    diverse datasets in developing effective ML models. Unlike model-centric methods that are widely
    accessible through open-source platforms, data-centric methods offer tailored insights to specific
    applications which enhances model learning capabilities beyond scoring metrics [3][4]. This trend
    is evident in the development of the GPT model, which evolved from focusing on architectural
    improvements to prioritizing data quality and data collection strategies [5]. As a result, focusing
    on data-centric approaches promises a profound contribution toward ML-driven problem-solving
    in distinct domains. These approaches have already exhibited potential in fields ranging from
    finance to healthcare, improving model reliability, scalability, trustworthiness, and generalizability
    162
    [6, 7, 8, 9, 10]. This shift towards data-centric approaches is especially significant in areas
    demanding intricate analysis, like protein engineering, where such strategies could enhance the
    understanding of the complex protein fitness landscape [11, 12].
    In protein engineering, the adoption of data-centric approaches is paramount due to the field’s
    unique challenges. With minor alterations to protein sequences significantly impacting functionality
    and stability, the protein fitness landscape is complex and rugged [13, 14, 15]. Despite advances
    in applying ML that have facilitated the discovery of high-fitness proteins [16, 17], these models
    struggle with imbalanced datasets and the scarcity of data across protein families. This underscores
    the necessity of reevaluating our strategies towards curating training data[11, 12, 18, 19, 20, 21]. A
    focus on enhancing the diversity and quality of datasets is crucial, as it ensures that ML models are
    trained on data that accurately reflect the intricate biological properties of proteins. Establishing a
    method to provide more effective training datasets for ML models will pave the way for strategically
    advancing protein engineering campaigns.
    Ancestral sequence reconstruction (ASR) offers one possible data-centric approach to protein
    engineering. ASR utilizes computational techniques to infer ancient protein sequences from
    modern descendants, thereby enriching our datasets with high-quality, diverse, and stable sequences
    [22, 23, 24, 25]. Built on the foundation of evolutionary biology and molecular phylogenetics, ASR
    constructs phylogenetic trees using substitution models to map out evolutionary relationships and
    predict ancestral states. ASR’s approach to assigning posterior probabilities to amino acids at
    various positions allows for the exploration of a vast array of sequence combinations. For example,
    with just 15 positions that each have four high-likelihood amino acids, ASR can generate up to a
    billion unique sequences (refer to Figure 6.1).
    Recent studies across various protein families have highlighted ASR’s effectiveness in dealing
    with statistical uncertainties [26, 27]. These studies demonstrate that sequences generated from
    ASR predictions, which sample a broad distribution of amino acids at ambiguously reconstructed
    sites, can be functional and, in some cases, more efficient or offer novel functionalities compared
    to sequences based solely on the most likely amino acid predictions. This robustness to uncertainty
    163
    is a crucial aspect of ASR’s reliability. Even when incorporating alternative amino acids at up
    to 30% of residues, reconstructed proteins often maintain their overall function [ref]. While
    quantitative parameters such as binding affinities or enzymatic rates may show some variation,
    qualitative functional characteristics typically remain consistent. This robustness extends to the
    effects of key historical mutations, which often produce similar functional shifts regardless of the
    specific ancestral background. Importantly, different methods for incorporating uncertainty, such
    as the "AltAll" approach (which creates a worst-case scenario protein) and Bayesian sampling,
    have been employed to test this robustness. These findings suggest that while precise quantitative
    estimates may vary, qualitative functional inferences from ASR are generally reliable, allowing
    researchers to make confident claims about ancient protein functions and evolutionary trajectories
    despite sequence ambiguity. Therefore, with the potential to generate large-scale high-quality
    sequences, ASR represents a potentially impactful approach to incorporating data-centric strategies
    in protein engineering. Recent studies have demonstrated the promising potential of leveraging
    uncertainty in ancestral sequence reconstruction (ASR) within machine learning frameworks
    for protein engineering. Notably, the work by Colin Jackson’s lab (2024) [28, 29] introduced
        multiplexed ASR (mASR), a novel method that utilizes ASR to obtain protein representations
        that are family-specific. Their approach employs a custom transformer model (LASE) trained
        on ancestral sequences reconstructed using the maximum a posteriori (MAP) character at each
        site. While this method has shown significant improvements in protein representation learning
        and downstream task performance, particularly demonstrating a smoother fitness landscape and
        computational efficiency for the phosphodiesterase (PTE) family, it has limitations. The primary
        disadvantage of the MAP-based approach is its consideration of each site in isolation, potentially
        overlooking important epistatic effects and dependencies between amino acid sites.
        In this study, we explore the integration of ASR-reconstructed data into a comprehensive
        ML framework to address two primary objectives: (1) the generation of novel protein sequences
        through generative modeling and (2) the enhancement of protein classification via fine-tuned
        language models. For the generative ML model, we implemented a variational autoencoder
        164
        (VAE) using reconstructed sequence data obtained through both Bayesian (via BAli-Phy) and
        maximum likelihood (via IQ-TREE) methods in ASR for the ethylene-forming enzyme (EFE)
        isolated from Pseudomonas syringae pv. phaseolicola PK2 [30]. The structures and thermal
        stabilities of the generated sequences were predicted using AlphaFold2 [31] and FoldX [32],
        respectively. Additionally, sequences generated from ancestral data were compared to those
        generated from modern sequences by metrics of stability sequence and semantic diversity. We
        compared sequences generated from ancestral data to those from modern sequences using metrics
        of structural stability, sequence variability, and semantic diversity. For fine-tuning, we utilized
        evolutionary scale modeling (ESM) protein representations [33], comparing evolutionary-driven
        to modern sequences for two protein families: Lysozyme C and Endolysin. The evaluation of
        family-specific representations on stability prediction tasks revealed promising results across these
        protein families. Our method, ASR-Dist, demonstrated robust performance in both Lysozyme C
        and Endolysin, consistently matching or surpassing the baseline models tested. These innovations
        advance the use of evolutionary information in protein engineering, offering new tools for sequence
        generation and representation learning, and enhancing exploration of the protein fitness landscape.
        6.2 Methods
        6.2.1 Generative Model for Novel Protein Sequence Generation
        We aimed to test whether incorporating evolutionary data into the training of generative models
        yields protein sequences that are both novel and structurally stable. To this end, we selected the
        ethylene-forming enzyme (EFE) from PK2 as our focus and employed a VAE as our generative ML
        model. Subsequent computational analyses (i.e., AlphaFold [31], FoldX [32], UMAP visualization
        [34]) were conducted on the sequences generated by this model to assess their quality.
        6.2.1.1 Data Processing and Ancestral Sequence Reconstruction
        We extracted the evolutionary information of EFE using our AP-LASR [35] software, a tool
        designed to reconstruct ensembles of ancestral proteins by leveraging the phylogenetic tree of
        the query protein sequence. This tool has facilitated ASR by fully automating the reconstruction
        process from initial BLAST search, multiple sequence alignment by MAFFT [36] to tree phylogeny
        165
        predictions via IQ-TREE2 [37].
        AP-LASR outputs several key datasets: sequences of modern proteins, ancestral proteins
        (ASR-Max), and near-ancestral proteins (ASR-Dist). The ASR-Dist dataset is created by sampling
        from a posterior probability distribution in the ASR.state file that was generated via IQ-TREE. For
        our project, the threshold for picking amino acids from this distribution was set at 0.2 to strike
        an optimal balance between maintaining ancestral properties and sequence diversity (i.e., amino
        acids with a posterior probability greater than 0.2 for a given position were included for sampling
        in the dataset)[26, 38]. From the data generated by AP-LASR, we sampled the sequences from
        four nodes that represented high-stability ancestors obtained from various evolutionary timescales
        Node10, Node13, Node 253, and Node 384 (Figure S1).
        To understand how sequence diversity in our training set would impact the robustness and
        generalizability of our model, two distinct datasets of ancestral protein sequences were crafted for
        training our ML model. The "Homogeneous Dataset" is comprised of sequences equally sampled
        from ancestral nodes, whereas the "Diverse Dataset" was created by passing the initial dataset
        through CD-Hit with a 0.9 similarity threshold [39]. The sampled sequences for each node were
        initially aligned using MAFFT [21]. Subsequently, the sequences from all ancestral nodes were
        pooled and re-aligned with MAFFT.
        We compared a Bayesian inference method (BAli-Phy) against a maximum likelihood approach
        (IQ-TREE) to evaluate how the choice of probabilistic sequence reconstruction methods influences
        generative model performance. BAli-Phy uses a Bayesian inference method to simultaneously
        estimate of sequence alignment, tree phylogeny, and model parameters, providing full posterior
        distributions of ancestral sequences. It does not assume a fixed alignment or tree topology, making
        it more accurate and computationally intensive.
        In brief, the same multiple sequence alignment generated by AP-LASR for IQ-TREE was
        used to initialize eight independent Markov chain Monte Carlo (MCMC) runs in BAli-Phy. The
        optimal amino acid substitution model for the runs, lg08 [40], was determined using IQ-TREE’s
        ModelFinder [41]. After running simulations for seven days, three runswere selected for convergence
        166
        by minimizing the average standard deviation of split frequencies (ASDSF) and the maximum of the
        standard deviation of split frequencies (MSDSF). For convergence analysis, the first 2,000 iterations
        of each run were discarded to account for the bias of the initial random starting guess (i.e., burn-in).
        The ensemble of ancestors at each ancestral node were subsequently retrieved from the BAli-Phy
        runs, which produces a single alignment for every iteration which can be sampled. ETE4 Toolkit
        [42] was used to navigate the phylogeny and define the children sequences for all ancestral nodes2,
        and Dendropy [43] was used to retrieve all ancestral nodes from the BAli-Phy output3. For more
        targeted protein generation, we sampled the sequences from four nodes representing high-stability
        ancestors obtained from various evolutionary timescales (Node10,Node13,Node253, andNode384,
        as shown in Figure S6.1).
        6.2.1.2 Generative Model Training
        Variational Autoencoder (VAE) was selected for its proficiency in generating new data points
        that are coherent with the training data [44, 45]. We employed one-hot encoding to transform
        the sequences into a format suitable for computational processing. Feature extraction from these
        one-hot encoded sequences was performed using a 1D Convolutional Neural Network (CNN) layer,
        allowing us to capture the local sequence patterns effectively. The architecture of our VAE was
        designed with a latent space dimensionality of 100, ensuring sufficient complexity to capture the
        nuances of protein sequence variability. Additionally, we incorporated batch normalization within
        the network to facilitate smoother and more stable learning dynamics. This combination of 1D
        CNN for feature learning and batch normalization for optimization contributed to refining the
        model’s ability to generate meaningful protein sequences.
        6.2.1.3 Evaluation
        To gauge the ability of our models to generate stable and viable sequences, we utilized
        AlphaFold2 to verify that the predicted 3D structures of the generated PK2 ancestors exhibited
        folding patterns similar to the wild-type structure. Randomly selected sequences generated
        by AlphaFold2 were superimposed onto wild-type structures (PDB ID: 5V2Y) using PyMol
        2https://github.com/etetoolkit/ete
        3https://github.com/jeetsukumaran/DendroPy
        167
        to calculate RMSD and compare folding patterns. Following structural prediction, stability
        calculations (ddG) were carried out using FoldX. This evaluation phase was crucial, as it allowed
        us to ascertain not just the novelty of the generated sequences but also their practical applicability in
        terms of structural integrity and thermal stability. We then compared the distribution of generated
        structure stability measurements followed by the Dunn statistical test for measuring the obtained
        results’ significance. The sequences generated by a model trained on evolutionary-reconstructed
        sequences were compared to those generated by a model trained on modern sequences. The
        comparison focused on three aspects: structural quality, sequence diversity, and semantic diversity.
        To assess structural quality, we used the predicted Local Distance Difference Test (pLDDT)
        scores which were extracted using the alphapickle library 4. Ranging from 0 to 100, pLDDT is
        used to assess the reliability of the predicted atomic positions in the protein structure. We used
        the Levenshtein edit distance [46, 47](i.e., the minimum number of single-character edits required
        to change one string into another) to quantify the divergence between training sequences and
        generated sequences across different datasets. This allowed us to assess the sequence variability
        introduced by the generation process. To characterize semantic diversity, which takes into account
        the biological or chemical properties of the sequences rather than just their rawsequence differences,
        the sequence representations obtained via ESM language model were visualized with UMAP, a
        dimensional reduction technique that represents the data manifold in lower dimensions [34].
        6.2.2 Evolutionary-Derived Protein Sequence Representation
        In this section, we detail our approach to creating family-specific protein representations using
        fine-tuned methods. We divide the fine-tuning into two categories: regular fine-tuning, which
        adjusts model parameters using the InterPro protein family dataset, and evolutionary fine-tuning,
        which incorporates specific evolutionary data relevant to protein families to guide the tuning process.
        The performance of these fine-tuned representations is evaluated by comparing them against the
        baseline ESM2 (Evolutionary Scale Modeling) [33] representation in a protein stability prediction
        task.
        4https://github.com/mattarnoldbio/alphapickle
        168
        6.2.2.1 Data Processing
        We fine-tuned the ESM2 for Endolysin and Lysozyme C, for which we obtained labeled
        datasets for stability prediction in FireProtDB 5. Our data processing involved assembling three
        distinct unlabeled datasets to fine-tune the ESM model for each protein family: (i) a collection
        of InterPro-derived sequences that encompass an expanded set of modern proteins based on
        family affiliations 6, (ii) a collection of the single most-probable ancestors (ASR-Max) taken
        from every internal node of the phylogenetic tree built from modern sequence BLAST results,
        and (iii) a collection of ancestral ensembles taken from internal nodes (ASR-Dist). The ASR-Dist
        approach enriches the dataset with a broader spectrum of evolutionary possibilities, offering a more
        robust dataset that surpasses ASR-Max in both diversity and volume. This comprehensive dataset
        integrates extensive evolutionary insights, significantly enhancing the model’s training base. We
        sampled 1,000 sequences inferred from each high-quality ancestral node (which we defined as
        having SH-aLRT > 80% and ultrafast bootstrapping > 95%) reconstructed in ASR and removed
        repeat sequences. It is important to note, depending on the application and computational resources,
        that the number ofASR-Dist sequences can be tuned by modifying the number of sequences sampled
        from each node in the phylogenetic tree. Table 1 represents dataset information for both protein
        families tested. The prediction task was stability classification (i.e., determining if a given sequence
        is stable with a ΔΔ𝐺 < −0.5 kcal/mol or unstable with a ΔΔ𝐺 > 0.5 kcal/mol).
        Table 6.1 Dataset Quantification for Fine-Tuning Task.
        Protein Interpro ID Interpro Seqs ASR-MAX
        Seqs
        ASR-Dist
        Seqs
        Classification Seqs
        (FireProtDB)
        Endolysin IPR034690 12K 302 40K 647
        Lysozyme C IPR036328 9K 452 36K 338
        6.2.2.2 Classification Model Training
        For model fine-tuning, we employed the ESM2 model (esm2_t12_35M_UR50D) trained with
        35M parameters which generates 480 embedding dimensions and contains 12-layer representations.
        5https://loschmidt.chemi.muni.cz/fireprotdb/
        6https://www.ebi.ac.uk/interpro/
        169
        We unfroze its last two layers to adapt its learning to our specific datasets. A batch size of 32
        sequences was utilized to optimize the training process, alongside the implementation of early
        stopping to mitigate the risk of over-fitting. This fine-tuning phase was critical, allowing us to tailor
        the model to our evolutionary-informed datasets. Post-tuning, we extracted the embedding from
        each of the four datasets to further refine our approach to protein family classification, employing
        KNN, Random Forest, and XGBoost algorithms to assess the model’s predictive performance
        within each representation derived from distinct fine-tuning methods.
        6.2.2.3 Evaluation of Classification Models
        The evaluation of our fine-tuned language model focused on its ability to classify protein families
        accurately, employing a suite of classification metrics to gauge performance comprehensively.
        Precision, recall, balanced accuracy, Area Under the Curve (AUC), and the F1 score were calculated
        for each protein family (Endolysin and Lysozyme C) across the representations. For more robust
            training, we performed 5-fold cross-validation for the datasets. Then the trained models were tested
            on a held-out test set which was 30% of the initial data. Note that, for robustness, we repeated this
            on 20 distinct random states and reported the mean and standard deviation for the obtained results
            among all the classification scores.
            6.3 Results
            The critical importance of high-quality data in protein engineering, particularly for structure
            prediction and improving generalization metrics, is widely recognized. This paper explores the
            potential of integrating underutilized yet rich evolutionary information to enhance both generative
            models and language models in the field. For generative models, the selection of training data
            that accurately captures the prior distribution is crucial in determining the quality of the sequences
            produced. In parallel, the efficacy of fine-tuning language models, a state-of-the-art method for
            ML predictive tasks, is intrinsically linked to the caliber of the dataset used for refinement. This
            investigation aims to demonstrate how evolutionary information can be strategically employed in
            MLplatforms to facilitate more informed exploration and navigation of the protein fitness landscape.
            By applying this approach to both sequence generation and protein classification tasks, the study
            170
            seeks to leverage evolutionary data in protein engineering, potentially opening new avenues for
            designing proteins with enhanced stability and functionality.
            6.3.1 Ancestral Sequence Generation Using Different Probabilistic Methods: Maximum
            Likelihood vs Bayesian
            This section explores the differences in the qualities of generated sequences obtained using
            Bayesian inference (BAli-Phy) and maximum likelihood approaches (IQ-TREE) for ancestral
            sequence reconstruction. Our objective was to understand how different methods for reconstructing
            ancestral sequences impact the quality of the reconstructed sequences and the sequences generated
            using the ancestral sequences as the training dataset. While BAli-Phy employs a Bayesian
            approach to simultaneously reconstruct the phylogenetic tree and sequence alignment, it is more
            computationally intensive compared to IQ-TREE, which uses a maximum likelihood approach. Our
            results, presented in Figure 6.2, illustrate the stability, sequence variability, and semantic diversity
            of the proteins generated using these two methods. Notably, we utilized all reconstructed nodes
            from both methods (i.e., without selecting for nodes with high stability) to provide a comprehensive
            overview of their performance. The comparison revealed no significant differences in sequence
            variability, semantic diversity patterns and stability profiles between the two methods.
            The comparable performance of IQ-Tree and BAli-Phy across stability, variability, and semantic
            diversity metrics challenges expectations based on their distinct mathematical approaches (i.e.,
            maximum likelihood and Bayesian inference). A key factor potentially contributing to this
            similarity is the sampling strategy. By incorporating sequences from all nodes in the phylogenetic
            tree, both methods likely capture a wide range of ancestral states, including those with varying
            degrees of certainty. This comprehensive sampling may balance out the theoretical advantages of
            each approach, resulting in similar overall performance. The stability results indicate that both
            methods generate sequences with comparable thermodynamic properties, suggesting IQ-Tree’s
            point estimates are as robust as BAli-Phy’s distribution-based estimates for maintaining protein
            stability. The equivalent sequence variability and semantic diversity imply that IQ-Tree’s maximum
            likelihood approach explores sequence space as effectively as BAli-Phy’s Bayesian method.
            171
            For the EFE family from PK2, our results show that IQ-Tree and BAli-Phy perform similarly
            across key metrics. Given this, we chose to use IQ-Tree for further analyses due to its faster
            computation time. To improve reconstruction quality,we focused on sampling from high-confidence
            nodes in the phylogenetic tree. This approach combines IQ-Tree’s efficiency with a strategy to
            minimize uncertainty in our ancestral sequence predictions.
            6.3.2 Incorporating Evolutionary Information in Sequence Generation Results in Novel &
            Stable Proteins
            Novel sequences were generated by sampling from the learned latent representations after
            loss minimization in the validation set. Three different sets of training data (modern, Ancestral
            Type1(homogeneous), Ancestral Type2(diverse)) were used to generate distinct sets of sequences.
            A representative subset of sequences (1,000 sequences per data type) was randomly sampled from
            each dataset for both training and generation populations. Our study’s findings are promising,
            revealing that: (i) our datasets not only augment the volume of training data through the innovative
            integration of uncertainty in ML but also provide a richer set of sequences with inherently higher
            stability for training purposes. Moreover, (ii) the stability distribution of the sequences generated
            using ancestral data aligns closely with those of the training set, attesting to the potential of our
            method to replicate high-quality protein stability profiles in novel sequence creation. Also, the
            mean RMSD values were 0.43 for modern-derived structures, 0.42 for ancestral-derived structures
            (Ancestral 1), and 0.39 for further ancestral-derived structures (Ancestral 2), indicating that the
            generated structures maintained folding patterns similar to the wild-type. Figure 6.3 represents a
            comparative analysis of the thermal stability of proteins derived from both ancestral and modern
            sequences, examining stability within the training data and the sequences generated via VAE.
            The findings underscore the significant potential of evolutionary protein sequences. Within the
            training dataset, ancestral proteins exhibited markedly higher stability relative to modern proteins.
            Furthermore, the sequences generated from these ancestral proteins retained this enhanced stability.
            Another notable observation is the pronounced distribution shift towards increased stability in the
            generated sequences, particularly when using modern sequences as compared to ancestral ones.
            172
            Although the precise mechanisms underlying this shift remain unclear, it is possible that the
            generative model, when applied to modern sequences, mitigates noise from low-quality sequences
            (in terms of epistasis and folding). Conversely, the ancestral training data may inherently encompass
            stabilizing interactions. Consequently, the generated sequences from ancestral proteins consistently
            demonstrate superior stability compared to those derived from modern sequences.
            We evaluated our generated proteins based on semantic diversity (i.e., the distance between
            training and generated sequences) and structural qualities (i.e., uncertainty quantification and
            intrinsic disorder region distributions). The results of these assessments are presented in Figure
            6.4.
            The comparative analysis of ancestral and modern PK2 sequences reveals several key differences.
            Ancestral sequences demonstrate improved structural stability, evidenced by fewer intrinsically
            disordered regions and consistently higher pLDDT scores. They also exhibit greater sequence
            conservation, as indicated by lower edit distances between training and generated datasets. Despite
            this conservation, ancestral sequences display broader semantic diversity in the UMAP projection.
            In contrast, modern sequences show more variability in structural quality and sequence composition
            but form tighter clusters in semantic space. These findings highlight the potential benefits of
            incorporating evolutionary data in generative models for protein design. Ancestral sequences
            appear to combine desirable properties such as structural stability with an expanded exploration
            of sequence space. This unique combination could prove valuable in generating novel protein
            variants that maintain crucial ancestral characteristics while introducing functional innovations.
            By leveraging the broader semantic diversity of ancestral sequences, generative models could
            potentially access a wider range of protein designs, opening up new avenues for engineering
            proteins with enhanced or altered functions while preserving their fundamental structural integrity.
            6.3.3 Evolutionary protein representations demonstrate significant potential for enhancing
            classification tasks.
            As detailed in the methods section, we fine-tuned the ESM2 model on three distinct datasets
            to obtain protein representations termed Inter-Pro, ASR-Max, and ASR-Dist. Intriguingly, the
            173
            representations derived from fine-tuningESM2with ancestral data exhibited comparable or enhanced
            performance relative to those derived from modern data in both predictive stability tasks,determining
            if a given sequence is stable (ΔΔ𝐺 < −0.5 𝑘𝑐𝑎𝑙/𝑚𝑜𝑙) or unstable (ΔΔ𝐺 > 0.5 𝑘𝑐𝑎𝑙/𝑚𝑜𝑙) for
                Endolysine and Lysozyme C proteins. The outcomes for both protein families are presented in
                Table 6.2 and Table 6.3. Our ASR-Dist method, which leverages uncertainty in ASR, has shown
                improved performance over other representations for KNN classification. However, it demonstrated
                comparable or improved performance over InterPro-derived representations for ensemble-based
                classifiers (i.e., Random Forest and XGBoost).
                Table 6.2 Comparison of Classifier Performance Across Fine-Tuned Representations – Lysozyme
                C.
                Classifier Dataset Score (Mean±STD)
                Balanced Accuracy F1 Precision Recall ROC_AUC
                KNN
                ESM-Base 0.51±0.02 0.05±0.07 0.23±0.33 0.03±0.04 0.49±0.07
                Interpro 0.70±0.05 0.54±0.10 0.74±0.12 0.44±0.12 0.81±0.04
                ASR-Max 0.71±0.05 0.57±0.10 0.90±0.12 0.43±0.10 0.81±0.04
                ASR-Dist 0.73±0.05 0.61±0.09 0.83±0.13 0.50±0.09 0.86±0.05
                Random Forest
                ESM-Base 0.51±0.02 0.06±0.07 0.38±0.45 0.03±0.04 0.52±0.06
                Interpro 0.75±0.06 0.64±0.10 0.80±0.09 0.55±0.13 0.92±0.02
                ASR-Max 0.73±0.06 0.60±0.10 0.78±0.11 0.50±0.12 0.93±0.02
                ASR-Dist 0.75±0.05 0.64±0.09 0.82±0.11 0.54±0.11 0.94±0.02
                XGBoost
                ESM-Base 0.51±0.01 0.03±0.05 0.23±0.37 0.02±0.03 0.53±0.06
                Interpro 0.78±0.05 0.67±0.08 0.80±0.10 0.60±0.13 0.94±0.02
                ASR-Max 0.77±0.05 0.67±0.09 0.78±0.12 0.60±0.09 0.94±0.03
                ASR-Dist 0.78±0.05 0.67±0.07 0.77±0.10 0.60±0.10 0.93±0.03
                174
                Table 6.3 Comparison of Classifier Performance Across Fine-Tuned Representations – Endolysin.
                Classifier Dataset Score (Mean±STD)
                Balanced Accuracy F1 Precision Recall ROC_AUC
                KNN
                ESM-Base 0.82±0.05 0.73±0.08 0.85±0.08 0.66±0.10 0.93±0.03
                Interpro 0.77±0.05 0.66±0.09 0.84±0.10 0.55±0.10 0.90±0.03
                ASR-Max 0.80±0.05 0.69±0.08 0.78±0.09 0.62±0.09 0.91±0.04
                ASR-Dist 0.82±0.05 0.73±0.08 0.84±0.09 0.65±0.10 0.94±0.03
                Random Forest
                ESM-Base 0.83±0.05 0.75±0.08 0.84±0.08 0.68±0.10 0.96±0.02
                Interpro 0.82±0.04 0.73±0.07 0.85±0.06 0.65±0.09 0.96±0.02
                ASR-Max 0.83±0.05 0.74±0.07 0.83±0.07 0.67±0.09 0.96±0.02
                ASR-Dist 0.84±0.04 0.77±0.06 0.85±0.07 0.70±0.08 0.97±0.02
                XGBoost
                ESM-Base 0.85±0.04 0.77±0.05 0.85±0.06 0.72±0.08 0.95±0.04
                Interpro 0.84±0.05 0.76±0.07 0.85±0.07 0.70±0.09 0.95±0.04
                ASR-Max 0.84±0.05 0.75±0.07 0.83±0.07 0.69±0.10 0.95±0.03
                ASR-Dist 0.85±0.04 0.78±0.06 0.84±0.07 0.72±0.07 0.94±0.04
                175
                Figure 6.1 A. Generation of high-quality evolutionary data. The phylogenetic tree for the family of
                interest is generated viaASRto access the ancestral sequences. Ancestral sequences are often known
                to be stable, promiscuous, and evolvable. To generate ancestral sequences, IQ-TREE assumes a
                fixed alignment and tree topology, then employs marginal reconstruction, which calculates the
                most likely state (amino acid), shown as yellow nodes, for each site (residue position number)
                independently. Ensembles of ancestral sequences can then be produced by sampling from other
                probable amino acids at individual positions of the ancestral sequences. Additionally, BAli-Phy is
                used for sequence reconstruction to co-estimate phylogeny and alignment, allowing for simultaneous
                inference of multiple parameters, providing a robust framework for evolutionary analysis. In brief,
                the same multiple sequence alignment generated by AP-LASR for IQ-TREE is used to initialize
                eight independent Markov chain Monte Carlo (MCMC) runs in BAli-Phy. The optimal amino
                acid substitution model for the runs is determined using IQ-TREE’s ModelFinder. After running
                simulations for seven days, three runs are selected for convergence by minimizing the average
                standard deviation of split frequencies (ASDSF) and the maximum of the standard deviation of
                split frequencies (MSDSF). For convergence analysis, the first 2,000 iterations of each run are
                discarded to account for the bias of the initial random starting guess (i.e., burn-in). The ensemble
                of ancestors at each ancestral node is subsequently retrieved from theBAli-Phy runs, which produces
                a single alignment for every iteration that can be sampled. B. The obtained evolutionary information
                is used as training data for sequence generation and family-specific protein sequence representation.
                176
                Figure 6.2 Maximum Likelihood (IQ-Tree) and Bayesian (BAli-Phy) approaches for ancestral
                sequence reconstruction of the EFE family from PK2 are comparable in terms of thermostability,
                sequence variability, and structural quality. A.Stability analysis using FoldX predictions of
                thermodynamic stability ( (ΔΔ𝐺, kcal/mol) between generated and training sequences. No
                statistically significant differenceswere observed (p-value=0.17). B. Sequence variability illustrated
                by minimum edit distances between training and generated sequences, reveals similar diversity
                for both methods. Similarly, semantic diversity visualized via UMAP projections of sampled
                    generated sequences indicate comparable patterns. C. Structural quality analysis using pLDDT
                    scores demonstrated no significant differences between methods (p-value=0.37 for means and
                    p-value=0.33 for IDRs).
                    177
                    Figure 6.3 ASR-derived PK2 sequences exhibited higher thermal stability compared to modern
                    sequences (both before and after generation via VAE.) A. Stability Analysis Modern vs.
                    Ancestral_Type1 (Homogeneous). The density plot shows the thermodynamic stability (ΔΔ𝐺
                    values - stability measurement relative to the stability of wild-type (Δ𝐺)) of modern and
                    homogeneous ancestral (AncType1) protein sequences. B. Stability Analysis Modern vs.
                    Ancestral_Type2 (Diverse). Ancestral-based models consistently produced sequences with
                    improved stability compared to modern sequences. Generated sequences closely mirror the stability
                    profiles of their respective training sets. Notably, the generated sequences from both ancestral
                    types showed no statistically significant difference from their training sets (p-values > 0.05), while
                    differing significantly from modern sequences (p-values < 1e-8). The statistical significance of all
                    comparisons is provided in Supplementary Table 6B.1.
                    178
                    Figure 6.4 The Generation-Quality panel. Structural and sequence properties of ancestral and
                    modern PK2 sequences before and after generative modeling. A) Structural Quality: Intrinsically
                    DisorderedRegions (IDRs) measured by pLDDT<50 count and meanpLDDTscores across sampled
                    structures, showing fewer IDRs and higher mean pLDDT scores in ancestral sequences. B)
                    Sequence Variability: Minimum edit distances between training and generated datasets, revealing
                    significantly higher variability in modern sequences (p-value < 0.05). C) Semantic Diversity:
                    UMAP projection of sampled generated sequences, illustrating distinct clustering patterns with
                    ancestral sequences exhibiting broader distribution. Ancestral sequences demonstrate improved
                    structural stability, more conserved sequence diversity, and increased semantic diversity compared
                    to modern sequences, while modern sequences show higher sequence variability before and after
                    training.
                    179
                    6.4 Discussion
                    In this work, we explored the integration of evolutionary information into machine learning
                    models for protein engineering. Our approach combined ancestral sequence reconstruction (ASR)
                    with generative modeling and language model fine-tuning to enhance both sequence generation and
                    fitness prediction. The results demonstrated that sequences generated from evolutionary-informed
                    datasets exhibited improved characteristics, particularly in thermal stability, compared to those
                    derived from models trained solely on modern sequences. This highlights the potential of ASR to
                    enrich training data for generative models, leading to more diverse and higher-quality outputs.
                    While our study provided valuable insights, it’s important to note that we only scratched the
                    surface of the vast sequence space available through ASR. For instance, considering the 186
                    high-quality nodes from our Lysozyme ancestral tree reconstruction, with 15 variable positions and
                    4 possible amino acids per position, the theoretical sequence space encompasses approximately
                    200 billion unique sequences. Our approach, utilizing clustering algorithms and strategic sampling
                    methods, aimed to capture a representative subset of this diversity. Despite these efforts to
                    efficiently sample the sequence space, there remains immense potential for further exploration
                    and optimization of these evolutionary-informed datasets. Future work could focus on developing
                    more sophisticated sampling techniques or exploring a larger portion of the available sequence
                    space to potentially uncover even more beneficial protein variants.
                    To evaluate the utility of ancestral sequences on downstream stability prediction tasks, we
                    fine-tuned the evolutionary scale ESM protein language model with reconstructed ancestral data
                    to obtain evolutionary-driven protein representations for Endolysin and Lysozyme C families. For
                    Lysozyme C, ancestral-based representations showed promising results, outperforming the baseline
                    ESM in KNN classification and matching the established InterPro method for other classification
                    models. In Endolysin, our novel ASR-Dist method performed comparably to the baseline and other
                    fine-tuning approaches across various classification metrics. While not consistently outperforming
                    existing methods, ASR-Dist demonstrated stable performance across both simple and complex
                    classification models, suggesting potential in this data-centric approach for enhancing protein
                    180
                    representations.
                    This study advances protein engineering by integrating diverse ancestral sequence reconstruction
                    (ASR) techniques into generative modeling and representation learning. It demonstrates the
                    generation of sequences with ancestral-like properties using machine learning, a novel achievement
                    in the field. While previous work like local ancestral sequence embedding (LASE) introduced
                    ASR in representation learning using transformers trained from scratch, this approach explores
                    fine-tuning existing models with various evolutionary and modern sequence datasets, addressing
                    potential over-fitting issues and offering an accessible framework for researchers. Another key
                    innovation is the implementation and comparison of two probabilistic approaches for phylogeny
                    inference and ancestral sequence reconstruction: maximum likelihood and Bayesian inference. This
                    comprehensive incorporation of ASR-driven protein sequences and their impact on both generative
                    modeling and representation learning provides a nuanced view of evolutionary information’s role
                    in protein engineering.
                    While this study illuminates the potential of integrating evolutionary data into ML models, it
                    also highlights the necessity for further investigation. We computationally validated the stability
                    and diversity of the sequences generated for the generative task, but experimental validation is
                    crucial to understanding the functional advantages that evolutionary signals may confer.
                    181
                    BIBLIOGRAPHY
                    [1] P. Singh. Systematic review of data-centric approaches in artificial intelligence and machine
                    learning. Data Sci. Manag., 6:144–157, 2023.
                    [2] J. Adeoye, L. Hui, and Y.X. Su. Data-centric artificial intelligence in oncology: A systematic
                    review assessing data quality in machine learning models for head and neck cancer. J. Big
                    Data, 2023.
                    [3] L. Miranda. Study notes on data-centric machine learning, 2023.
                    [4] D. Zha and U. States. Data-centric artificial intelligence: A survey, 2023.
                    [5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla
                    Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
                    Language models are few-shot learners. Advances in neural information processing systems,
                    33:1877–1901, 2020.
                    [6] H. Liu, M. Chaudhary, and H. Wang. Towards trustworthy and aligned machine learning,
                    2023.
                    [7] A.X. Wang, S.S. Chukova, C.R. Simpson, and B.P. Nguyen. Data-centric ai to improve early
                    detection of mental illness. In IEEE Work. Stat. Signal Process. Proc., volume 2023-July,
                    pages 369–373, 2023.
                    [8] Johannes Jakubik, Michael Vössing, Niklas Kühl, Jannis Walk, and Gerhard Satzger.
                    Data-centric artificial intelligence, 2024.
                    [9] Yunkun Zhang, Jin Gao, Zheling Tan, Lingfeng Zhou, Kexin Ding, Mu Zhou, Shaoting Zhang,
                    and Dequan Wang. Data-centric foundation models in computational healthcare: A survey.
                    arXiv.org, 2024.
                    [10] John Adeoye, L. M. Christy Hui, and Yu xiong Su. Data-centric artificial intelligence in
                    oncology: a systematic review assessing data quality in machine learning models for head
                    and neck cancer. Journal of Big Data, 2023.
                    [11] P.A. Romero and F.H. Arnold. Exploring protein fitness landscapes by directed evolution.
                    Nat. Rev. Mol. Cell Biol., 10:866–876, 2009.
                    [12] S.A. Kauffman and E.D. Weinberger. The nk model of rugged fitness landscapes and its
                    application to maturation of the immune response. In Mol. Evol. Rugged Landscapes Proteins,
                    RNA Immune Syst., pages 135–175, 2018.
                    [13] Sonja Billerbeck. A computational walk to the hidden peaks of protein performance. Synthetic
                    Biology, 2021.
                    182
                    [14] Mehrsa Mardikoraem and Daniel R. Woldring. Protein fitness prediction is impacted by the
                    interplay of language models, ensemble learning, and sampling methods. Pharmaceutics,
                    2023.
                    [15] Shimon Bershtein, Michal Segal, Roy Bekerman, Nobuhiko Tokuriki, and Dan S. Tawfik.
                    Robustness–epistasis link shapes the fitness landscape of a randomly drifting protein. Nature,
                    2006.
                    [16] Kevin K Yang, Zachary Wu, and Frances H Arnold. Machine-learning-guided directed
                    evolution for protein engineering. Nature methods, 16(8):687—694, August 2019.
                    [17] Yuchi Qiu, Jian Hu, and Guo-Wei Wei. Cluster learning-assisted directed evolution. Nature
                    computational science, 1(12):809–818, 2021.
                    [18] M.A. Mena and P.S. Daugherty. Automated design of degenerate codon libraries. Protein
                    Eng. Des. Sel., 18:559–561, 2005.
                    [19] W. Gao, S.P. Mahajan, J. Sulam, and J.J. Gray. Deep learning in protein structural modeling
                    and design. Patterns, 1:100142, 2020.
                    [20] Iman Deznabi, Bulent Arabaci, Mehmet Koyuturk, and Oznur Tastan. Deepkinzero:
                    Zero-shot learning for predicting kinase-phosphosite associations involving understudied
                    kinases. Bioinformatics, 36:3652–3661, 2020.
                    [21] Joshua Meier, Roshan Rao, Reinier Verkuil, Jesse Liu, Tom Sercu, and Antoine Rives.
                    Language models enable zero-shot prediction of the effects of mutations on protein function.
                    bioRxiv, 2021(07):2021.07.09.450648, 2021.
                    [22] Yvonne Joho, Vanessa Vongsouthi, Matthew A. Spence, Jennifer Ton, Chloe Gomez, Li Lynn
                    Tan, Joe A. Kaczmarski, Alessandro T. Caputo, Santana Royan, Colin J. Jackson, and Albert
                    Ardevol. Ancestral sequence reconstruction identifies structural changes underlying the
                    evolution of ideonella sakaiensis petase and variants with improved stability and activity.
                    Biochemistry, 62(2):437–450, 2023.
                    [23] Yoseph Gumulya and Elizabeth MJ Gillam. Exploring the past and the future of protein
                    evolution with ancestral sequence reconstruction: The “retro” approach to protein engineering.
                    Biochem J, 474:1–19, 2017.
                    [24] Jonas Zaugg, Yoseph Gumulya, Elizabeth MJ Gillam, and Mikael Bodén. Chapter 21 of
                    prospective and retrospective strategies. 2014.
                    [25] Linus Pauling, Emile Zuckerkandl, Thorkil Henriksen, and Rolf Lövstad. Chemical
                    paleogenetics. Acta chem scand, 17:S9–S16, 1963.
                    [26] Geeta N Eick, Anderson David P Harms Michael J Bridgham, Jamie T, and Joseph W
                    183
                    Thornton. Robustness of reconstructed ancestral protein functions to statistical uncertainty.
                    Mol Biol Evol, 34:247–261, 2017.
                    [27] Hila Bar-Rogovsky, Adi Stern, Osnat Penn, ItayKobl, Tal Pupko, and Dan S Tawfik. Assessing
                    the prediction fidelity of ancestral reconstruction by a library approach. Protein Eng Des Sel,
                    28:507–518, 2015.
                    [28] Dana M Matthews, Matthew A Spence, Adam C Mater, James Nichols, Sacha B Pulsford,
                    Mahakaran Sandhu, Joe A B Kaczmarski, CharlotteMMiton, Nobuhiko Tokuriki, and Colin J
                    Jackson. Leveraging ancestral sequence reconstruction for protein representation learning.
                    bioRxiv, pages 2023–12, 2023.
                    [29] Vanessa Vongsouthi, Rosemary Georgelin, Dana Matthews, Jake Saunders, Brendon M Lee,
                    Jennifer Ton, Adam M Damry, Rebecca L Frkic, Matthew A Spence, and Colin J Jackson.
                    Ancestral reconstruction of polyethylene terephthalate degrading cutinases reveals a rugged
                    and unexplored sequence-fitness landscape. bioRxiv, pages 2024–04, 2024.
                    [30] Kazuhiro Nagahama, Takahira Ogawa, Takao Fujii, Masato Tazaki, Sumio Tanase, Yoshimasa
                    Morino, and Hideo Fukuda. Purification and properties of an ethylene-forming enzyme from
                    pseudomonas syringae pv. phaseolicola pk2. Microbiology, 137(10):2281–2286, 1991.
                    [31] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf
                    Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al.
                    Highly accurate protein structure prediction with alphafold. nature, 596(7873):583–589,
                    2021.
                    [32] Joost Schymkowitz, Jesper Borg, Francois Stricher, Robby Nys, Frederic Rousseau, and
                    Luis Serrano. The foldx web server: an online force field. Nucleic acids research,
                    33(suppl_2):W382–W388, 2005.
                    [33] Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu,Wenting Lu, Nikita Smetanin,
                    Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi,
                    Tom Sercu, Salvatore Candido, and Alexander Rives. Evolutionary-scale prediction of
                    atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
                    [34] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation
                    and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
                    [35] James VanAntwerp, Mehrsa Mardikoraem, Nathaniel Pascual, and DanielWoldring. Ap-lasr:
                    Automated protein libraries from ancestral sequence reconstruction. bioRxiv, 2023.
                    [36] K. Katoh and D.M. Standley. Mafft multiple sequence alignment software version
                    7: improvements in performance and usability. Molecular Biology and Evolution,
                    30(4):772–780, April 2013. Epub 2013 Jan 16.
                    184
                    [37] Bui Quang Minh, Heiko A Schmidt, Olga Chernomor, Dominik Schrempf, Michael D
                    Woodhams, Arndt von Haeseler, and Robert Lanfear. IQ-TREE 2: New Models and Efficient
                    Methods for Phylogenetic Inference in the Genomic Era. Molecular Biology and Evolution,
                    37(5):1530–1534, 02 2020.
                    [38] M.A. Sennett and D.L. Theobald. Extant sequence reconstruction: The accuracy of ancestral
                    sequence reconstructions evaluated by extant sequence cross-validation. Journal of Molecular
                    Evolution, 92:181–206, 2024.
                    [39] Weizhong Li and Adam Godzik. Cd-hit: a fast program for clustering and comparing large
                    sets of protein or nucleotide sequences. Bioinformatics, 22(13):1658–1659, 2006.
                    [40] Si Quang Le and Olivier Gascuel. An Improved General Amino Acid Replacement Matrix.
                    Molecular Biology and Evolution, 25(7):1307–1320, 03 2008.
                    [41] Subha Kalyaanamoorthy, Bui Quang Minh, Thomas KF Wong, Arndt Von Haeseler, and
                    Lars S Jermiin. Modelfinder: fast model selection for accurate phylogenetic estimates. Nature
                    methods, 14(6):587–589, 2017.
                    [42] Jaime Huerta-Cepas, François Serra, and Peer Bork. Ete 3: reconstruction, analysis, and
                    visualization of phylogenomic data. Molecular biology and evolution, 33(6):1635–1638,
                    2016.
                    [43] Matthew Andres Moreno, Jeet Sukumaran, and Mark T. Holder. Dendropy 5: a mature python
                    library for phylogenetic computing, 2024.
                    [44] C. Ziegler, Claude Sinner, and Faruck Morcos. Latent generative landscapes as maps of
                    functional diversity in protein sequence space. Nature Communications, 2023.
                    [45] Mehrsa Mardikoraem, Zirui Wang, Nathaniel Pascual, and Daniel Woldring. Generative
                    models for protein sequence modeling: recent advances and future directions. Briefings in
                    Bioinformatics, 24(6):bbad358, 10 2023.
                    [46] Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and
                    reversals. Soviet Physics Doklady, 10(8):707–710, 1966.
                    [47] Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys
                    (CSUR), 33(1):31–88, 2001.
                    185
                    APPENDIX 6A
                    PK2 RECONSTRUCTED PHYLOGENIC TREE WITH SELECTED NODES FOR
                    TRAINING THE GENERATIVE MODEL.
                    Figure 6A.1 The tree is generated via AP-LASR and visualized with Figtree software. Highlighted
                    nodes are Node 10, Node 13, Node 253, and Node 384 which possessed high stability and were
                    used as starting points to produce ancestral sequences for training data for the generative model.
                    186
                    Figure 6A.2 Reconstructed Tree Quality Metrics for Branches.
                    187
                    APPENDIX 6B
                    STATISTICAL ANALYSIS FOR STABILITY OF GENERATED PROTEINS
                    188
                    Table 6B.1 Comparison of Model Performance Across Different Training and Generated Data Types.
                    Modern-Generated Ancestral-Type1-Generated Ancestral-Type2-Generated Modern-Training Ancestral-Type1-Training Ancestral-Type2-Training
                    Modern-Generated 1 5.40E-10 1.39E-08 1.21E-08 2.18E-07 6.66E-06
                    Ancestral-Type1-Generated 5.40E-10 1 0.6972 2.31E-31 0.2143 0.0608
                    Ancestral-Type2-Generated 1.39E-08 0.6972 1 3.61E-28 0.4168 0.1519
                    Modern-Training 1.21E-08 2.31E-31 3.61E-28 1 1.65E-27 2.59E-24
                    Ancestral-Type1-Training 2.18E-07 0.2143 0.4168 1.65E-27 1 0.5069
                    Ancestral-Type2-Training 6.66E-06 0.0608 0.1519 2.59E-24 0.5069 1
                    189
                    APPENDIX 6C
                    GENERATIVE MODEL ARCHITECTURE
                    ProteinVAE(
                            (encoder): Encoder(
                            (conv1): Conv1d(21, 64, kernel_size=(3,), stride=(1,),\ padding=(1,))
                            (bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1,\ affine=True)
                            (flatten): Flatten(start_dim=1, end_dim=-1)
                            (fc1): Linear(in_features=25856, out_features=10000, bias=True)
                            (fc2): Linear(in_features=10000, out_features=5000, bias=True)
                            (fc3): Linear(in_features=5000, out_features=2000, bias=True)
                            (fc4): Linear(in_features=2000, out_features=500, bias=True)
                            (fc5): Linear(in_features=500, out_features=100, bias=True)
                            (z_mean): Linear(in_features=100, out_features=100, bias=True)
                            (z_log_var): Linear(in_features=100, out_features=100, bias=True)
                            (sampling): Sampling()

                                )
                            (decoder): Decoder(
                            (fc1): Linear(in_features=100, out_features=500, bias=True)
                            (fc2): Linear(in_features=500, out_features=2000, bias=True)
                            (fc3): Linear(in_features=2000, out_features=5000, bias=True)
                            (fc4): Linear(in_features=5000, out_features=10000, bias=True)
                            (fc5): Linear(in_features=10000, out_features=25856, bias=True)
                            (deconv1): ConvTranspose1d(64, 21, kernel_size=(3,),\ stride=(1,)
                            , padding=(1,))
                            (bn1): BatchNorm1d(21, eps=1e-05, momentum=0.1,\ affine=True)
                            (dropout): Dropout(p=0.2, inplace=False)

                                )
                            )
                    190
                    APPENDIX 6D
                    STATISTICAL ANALYSIS FOR PROTEIN REPRESENTATION
                    PERFORMANCES–LYSOZYME C
                    Table 6D.1 Results of Dunn Test in Case of Statistical Significance after Kruskal-Wallis for Different
                    Metrics–Lysozyme. Only KNN results had statistical significance and needed post-hoc reports.
                    Group 1 Group 2 p-value Classifier Metric Mean 1 Mean 2
                    InterPro ASR-Max 0.000165 KNN Precision 0.73 0.89
                    InterPro ASR-Dist 0.003335 KNN ROC AUC 0.81 0.86
                    ASR-Max ASR-Dist 0.003965 KNN ROC AUC 0.81 0.86
                    InterPro ASR-Dist 0.021234 KNN Precision 0.73 0.83
                    ASR-Max ASR-Dist 0.105603 KNN Precision 0.89 0.83
                    InterPro ASR-Max 0.903116 KNN ROC AUC 0.81 0.81
                    191
                    APPENDIX 6E
                    STATISTICAL ANALYSIS FOR PROTEIN REPRESENTATION
                    PERFORMANCES–ENDOLYSIN
                    Table 6E.1 Results of Dunn Test in Case of Statistical Significance after Kruskal-Wallis for Different
                    Metrics–Endolysin. Only KNN results had statistical significance.
                    Group 1 Group 2 p-value Classifier Metric Mean 1 Mean 2
                    InterPro ASR-Dist 0.000686 KNN ROC AUC 0.90 0.93
                    InterPro ASR-Dist 0.005173 KNN Recall 0.55 0.65
                    InterPro ASR-Dist 0.005527 KNN Balanced Accuracy 0.76 0.81
                    ASR-Max ASR-Dist 0.010570 KNN ROC AUC 0.91 0.93
                    InterPro ASR-Dist 0.020592 KNN F1 Score 0.65 0.72
                    InterPro ASR-Max 0.033665 KNN Recall 0.55 0.62
                    InterPro ASR-Max 0.067481 KNN Balanced Accuracy 0.76 0.80
                    ASR-Max ASR-Dist 0.155124 KNN F1 Score 0.68 0.72
                    InterPro ASR-Max 0.217757 KNN F1 Score 0.65 0.68
                    ASR-Max ASR-Dist 0.284875 KNN Balanced Accuracy 0.80 0.81
                    ASR-Max ASR-Dist 0.388140 KNN Recall 0.62 0.65
                    InterPro ASR-Max 0.524948 KNN ROC AUC 0.90 0.91
                    192
                    CHAPTER 7
                    CONCLUSION
                    7.1 Overview
                    This thesis has presented a multifaceted approach to protein engineering, leveraging the power
                    of machine learning (ML) and computational tools. The research addressed the complex challenge
                    of designing proteins with desired functions through several innovative strategies.
                    Firstly, a robust methodology was developed for mapping next-generation sequencing data
                    to experimental annotations, enhancing signal-to-noise ratios, and optimizing directed evolution
                    starting points. The evaluation of various protein representation methods, including one-hot and
                    physicochemical encodings, as well as language-based representations like ESM and UniRep, led
                    to the creation of ensemble techniques that significantly improved ML performance. For proteins
                    under 120 amino acids, this approach achieved a remarkable 94
                    The study then progressed to modeling protein-drug interactions, focusing on the therapeutically
                    important organic anion-transporting polypeptides (OATPs). A comprehensive pipeline was
                    established, integrating AlphaFold for structure prediction, molecular docking for interaction
                    modeling, and a Heterogeneous Graph Neural Network (HeteroGNN) to capture both interand
                    intra-molecular interactions. This approach revealed inconsistencies in reported OATP-drug
                    interactions, particularly for OATP1B1, and identified key drug attributes such as topological
                    surface area, heteroatom count, and hydrophobicity as critical factors. The research demonstrated
                    that drug-based representations alone are insufficient for interpreting OATP-drug interactions,
                    emphasizing the necessity of structure-based methods.
                    Furthermore, generative modeling to create novel proteins with desired properties was explored.
                    By incorporating evolutionary insights through ancestral sequence reconstruction (ASR), both
                    generative and language models were enhanced. A Variational Autoencoder (VAE) trained on
                    ASR-derived sequences produced novel proteins that maintained the stability profiles of ancestral
                    sequences, as validated using AlphaFold2 and FoldX. The integration of evolutionary-driven
                    protein representations also improved downstream prediction tasks, particularly in protein stability
                    193
                    classification. Fine-tuning the ESM2 language model with ASR data created more effective protein
                    embeddings, enhancing classification accuracy for multiple protein families.
                    In the following section, broader perspectives on the field of protein engineering and machine
                    learning integration will be presented, along with suggestions for future improvements and research
                    directions. These insights aim to guide the next generation of researchers in furthering the
                    advancements made in this thesis and addressing remaining challenges in the field.
                    7.2 Perspective
                    Significant advancements have propelled the field of protein engineering forward in remarkable
                    ways. AlphaFold3 has achieved unprecedented accuracy in predicting protein structures and
                    interactions, while language models like ESM3 [1] have enhanced our understanding of protein
                    sequences and functions. Diffusion models such as Chroma [2] have enabled the generation of
                    novel protein-protein interaction domains with greater control, advancing programmable protein
                    design. Recent studies have even demonstrated the generation of new CRISPR-Cas proteins that
                    are more compatible with human cells [3], opening new avenues for gene editing technology. These
                    developments highlight the transformative potential of integrating AI with biological research.
                    Despite these advancements, several critical challenges persist. The lack of standardized
                    benchmarks and comprehensive metrics, given the vast and diverse nature of biological space,
                    makes it difficult to establish universally applicable standards for evaluating ML models in protein
                    engineering. Ensuring reliable and reproducible outcomes requires addressing data leakage and
                    inconsistencies in ML implementations. Current methods often fail to fully account for the
                    interconnected nature of proteins and their interaction profiles.
                    The gap between computational discoveries and experimental validation in protein engineering
                    presents a significant bottleneck in the field. Unlike image generation, where results can be instantly
                    assessed visually, protein engineering lacks immediate feedback mechanisms, significantly slowing
                    down the iterative process of design, testing, and refinement. In Chapter 6, protein embeddingswere
                    used to analyze the semantic diversity of AI-generated proteins, and software such as AlphaFold
                    was employed to assess their folding profiles, followed by FoldX for stability analysis. While these
                    194
                    approaches are beneficial for analyzing success, more generalizable and rapid methods are needed
                    to improve the assessment of AI-generated proteins in terms of the likelihood of success for the
                    intended properties. Ideally, these methods would provide a likelihood panel for both success
                    and failure across different protein attributes, recognizing that protein engineering often involves
                    balancing multiple properties simultaneously. This comprehensive view of potential outcomes
                    would enhance both the speed and reliability of the protein engineering process.
                    Another significant challenge is the low signal-to-noise ratio in experimental data. NGS
                    techniques and docking studies often produce noisy data. To address this issue, several strategies
                    were implemented throughout this research. In Chapter 2, the PEAR software was utilized to
                    reduce noise in obtained sequence data. Chapter 5 focused on mitigating structural noise for
                    OATPs by analyzing the distribution of docking scores and selecting thresholds to maximize
                    signal. Furthermore, collaborative work with the Harada lab at MSU-IQ investigated the impact of
                    experimental conditions, such as incubation time, on data noise levels. These efforts demonstrate
                    the potential for improving data quality through careful preprocessing and analysis of experimental
                    parameters. Despite this progress, there remains a need for more advanced methods to identify and
                    quantify various sources of noise, as well as to develop enhanced post-processing techniques for
                    maximizing signal quality in protein engineering data.
                    Shifting emphasis from developing complex models to critically evaluating and enhancing
                    data quality is crucial for ensuring model effectiveness and reliability. For example, more than
                    99% of the data in UniProt, a widely used public protein database, remains unannotated, and
                    the quality of proteins is not determined [4]. This thesis addressed this issue by incorporating
                    high-quality ancestral data into the generative model platform, leveraging evolutionary information
                    to improve the data used. Despite this advancement, there is room for further improvement in
                    making generative models conditional on co-optimizing attributes such as stability and fitness.
                    In conclusion, this research has shown that integrating machine learning with high-quality
                    data and advanced computational tools can significantly enhance protein engineering efforts.
                    The innovative approaches developed throughout this work offer a solid foundation for future
                    195
                    advancements, promising more efficient and accurate design of proteins with desired properties
                    and functionalities. This progress paves the way for continued breakthroughs in the field, bringing
                    us closer to overcoming existing challenges and unlocking new possibilities in protein engineering.
                    196
                    BIBLIOGRAPHY
                    [1] Tomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert
                    Verkuil, Vincent Q Tran, Jonathan Deaton, MariusWiggert, et al. Simulating 500 million years
                    of evolution with a language model. bioRxiv, pages 2024–07, 2024.
                    [2] John B Ingraham, Max Baranov, Zak Costello, Karl W Barber, Wujie Wang, Ahmed
                    Ismail, Vincent Frappier, Dana M Lord, Christopher Ng-Thow-Hing, Erik R Van Vlack,
                    et al. Illuminating protein space with a programmable generative model. Nature,
                    623(7989):1070–1078, 2023.
                    [3] Jeffrey A. Ruffolo, Stephen Nayfach, Joseph Gallagher, Aadyot Bhatnagar, Joel Beazer, Riffat
                    Hussain, Jordan Russ, Jennifer Yip, Emily Hill, Martin Pacesa, Alexander J. Meeske, Peter
                    Cameron, and Ali Madani. Design of highly functional genome editors by modeling the
                    universe of crispr-cas sequences. bioRxiv, 2024.
                    [4] Baohui Lin, Xiaoling Luo, Yumeng Liu, and Xiaopeng Jin. A comprehensive review and
                    comparison of existing computational methods for protein function prediction. Briefings in
                    Bioinformatics, 25(4):bbae289, 06 2024.
                    197