MACHINE LEARNING FOR TRANSITION METAL COMPLEXES By Hongni Jin A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Chemistry – Doctor of Philosophy 2024 ABSTRACT Transition metal complexes, dubbed ‘Lego molecules’, are composed of small molecules, ions, or atoms arranged around a central metal. The diversified research field of organometallic compounds includes but is not limited to the study of metal- ligand interactions, structure-property relationships, and practical applications. This dissertation leverages machine learning techniques to expedite the research in this domain. The first part focuses on neural network potentials (NNPs). A Zn_NNPs model was built to depict the potential energy surface of zinc complexes. In this work, a simple but useful embedding of partial charges was proposed, which could model the long-range interactions accurately. Furthermore, an Fe_NNPs model was designed to identify the lowest energy spin state of Fe (II) complexes. The model integrates electronic characteristics such as total charge and spin state to account for long-range interactions effectively. For each model, a high-quality data set including tens of thousands of distinctive conformations was well curated using metadynamics. The third model is a scaffold-based diffusion model, called LigandDiff which can generate valid, novel, and unique ligands for organometallic compounds. Users only need to specify the desired size of the ligand, LigandDiff then generates a diverse and potentially infinite number of ligands of that size from scratch. Collectively, these models surpass traditional computational methods on both accuracy and efficiency, demonstrating substantial potential to accelerate transition metal complexes research. Copyright by HONGNI JIN 2024 This dissertation is dedicated to my elder brother, Jay Jin, for his unconditional support, encouragement, and belief in me. iv ACKNOWLEDGEMENTS Thank my families for their support and love. As the youngest child of my generation in our large family, I have always received extra care and attention from my family members. Thank my parents for their continuous encouragement and trust which have given me the confidence to explore the unlimited possibilities of my life. And I also want to express my special gratitude to my elder brother, who sets an excellent model for me to follow. He is truly my personalized ChatGPT, always providing me valuable and timely advice about my life. I am grateful to my advisor, Professor Kenneth M. Merz Jr. for his dedication and support. As an educator, he has given me trust and the freedom to pursue my interests. As a researcher, his extensive knowledge has answered many of my questions. As an editor, he has taught me to express my ideas clearly and succinctly in scientific writing. And I thank my committee members (Professor Angela K. Wilson, Professor Gary J. Blanchard, and Professor Guo-Wei Wei) for their insightful feedback on my research, especially their valuable experience on scientific writing. I also appreciate all members of the Merz group for their useful advice on my work. Last but not least, thank all my friends throughout my life for their companionship. Their presence brightens my life. v TABLE OF CONTENTS LIST OF ABBREVIATIONS ............................................................................... vii CHAPTER 1 INTRODUCTION ............................................................................ 1 1.1 General Introduction of Transition Metal Complexes .................................. 1 1.2 Machine Learning ......................................................................................... 9 CHAPTER 2 MODELING ZINC COMPLEXES USING NEURAL NETWORKS ........................................................................................................ 14 2.1 Zinc Complexes .......................................................................................... 14 2.2 Neural Network Potentials .......................................................................... 17 2.3 Zinc Data Set Curation ................................................................................ 44 2.4 Zn_NNPs Framework ................................................................................. 53 2.5 Results and Discussion ............................................................................... 58 2.6 Conclusions ................................................................................................. 65 CHAPTER 3 MODELING Fe (II) COMPLEXES USING NEURAL NETWORKS ........................................................................................................ 66 3.1 Fe (II) Complexes ....................................................................................... 66 3.2 Fe (II) Data Set Curation ............................................................................. 70 3.3 Fe_NNPs ..................................................................................................... 77 3.4 Results and Discussion ............................................................................... 80 3.5 Conclusions ................................................................................................. 87 CHAPTER 4 3D TRANSITION METAL COMPLEXES GENERATION ........ 88 4.1 Generative Models ...................................................................................... 88 4.2 LigandDiff ................................................................................................... 93 4.3 Data Set Curation and Evaluation Metrics .................................................. 99 4.4 Results and Discussion ............................................................................. 104 4.5 Conclusions ............................................................................................... 111 BIBLIOGRAPHY ............................................................................................... 112 APPENDIX A: TABLES .................................................................................... 136 APPENDIX B: FIGURES .................................................................................. 166 vi LIST OF ABBREVIATIONS TMCs Transition metal complexes CN Coordination number AI Artificial intelligence ML Machine learning MD Molecular dynamics PES Potential energy surface MLPs Machine learning potentials NNPs Neural network potentials RMSE Root mean square error GNNs Graph neural networks CSD Cambridge Structural Database MAE Mean absolute error HS High spin LS Low spin SCO Spin crossover SE Splitting energy DDPM Denoising Diffusion Probabilistic Model vii CHAPTER 1 INTRODUCTION 1.1 General Introduction of Transition Metal Complexes Transition metals are chemical elements which mainly lie in the d block of the periodic table. They include 3d elements from Sc to Zn, 4d elements from Y to Cd, and 5d elements from Hf to Au. The existence of transition metals on earth varies a lot, from ubiquitous to rare; 3d Fe ranks as the fourth most abundant element in the Earth’s crust while 4d Tc is only artificially produced.1 Most transition metals display slivery color, but copper and gold are usually in slightly red, and mercury is the only one which is in liquid at ambient temperatures. The research of these transition metals started in the nineteenth century and quickly drew people’s attention since the compounds of these transition metals exhibit different properties than typical covalent organic compounds. First, the partially filled d orbitals can accommodate electrons from outside to form dative covalent interactions. This bonding pattern is usually called coordination, which is a representation of Lewis acid-base interaction. Such coordination compounds are interchangeably called complexes, where the transition metal is the electron acceptor, while the electron donor is known as a ligand. The coordination sphere typically contains the central transition metal atom or ion along with the ligands that are bonded to it. The type of ligand is very flexible, and the sole criterion for ligands is that they must possess at least one pair of electrons to donate. Therefore, the ligands cover a diverse range of chemicals, either atoms, molecules, or ions. Some common 1 ligands are water, ammonia, and chloride. The coordination number (CN) is the count of ligand attachment sites surrounding a transition metal center in a complex, ranging from 1 to 162,3 which depends on both the central metal and the ligands. Some metals prefer certain coordination numbers since these coordination numbers may stabilize electron energies. Moreover, the nature of the ligands also plays a role. The shape and the size of ligands greatly determine the coordination number. Larger ligands typically give rise to a reduced coordination number. Overall, this diverse coordination range further enriches the chemical space of TMCs. Second, a characteristic feature of transition metals is their ability to exist in various oxidation states, because they can lose electrons easily in contrast to the alkali metals and alkaline earth metals. Alkali metals, with a single electron in their s orbitals, typically exhibit a +1oxidation state. Similarly, alkaline earth metals, which have two electrons in the s orbitals, almost invariably show +2 oxidation state. In contrast, transition metals are more complex because their oxidation states are determined by both the charge of ligands as well as the overall charge of the complex. For instance, most transition metals in 3d block have oxidation states of +2 or +3, like Co2+/ Co3+, Ni2+/ Ni3+, Fe2+/Fe3+, but the element Mn has more choices, ranging from +2 to +7. The flexible oxidation states give TMCs intriguing redox properties, resulting in a myriad of potential applications that will be discussed later. In addition, the spin state of transition metals can vary. The spin state describes the spin configurations of the central metal’s d electrons. It manifests as the unpaired 2 electrons of the metal. Usually, a complex can be categorized as either high spin or low spin, sometimes an intermediate spin state is also possible. The spin state is determined by the energy gap between the crystal field splitting energy (∆) and the pairing energy (P). When ∆ exceeds P, electrons occupy all the lower energy orbitals and pair up within them prior to climbing to the higher energy orbitals. In this case, it is energetically more advantageous for electrons to pair and occupy all of the low energy orbitals. Conversely, when the pairing energy exceeds the crystal field energy, electrons will first fill all available orbitals singly before any pairing occurs, regardless of the orbitals’ energy levels. Many properties of TMCs are highly related to the spin state, such as the magnetic properties and the spin-crossover properties. The overall unique properties of TMCs can be briefly summarized below: 1. Various charge states. The transition metals are usually cationic in aqueous solution. But the charge is manipulated by the coordination environment so that the whole complex can be either cationic, neutral, or anionic. 2. Unique interactions with ligands. The dative covalent bonding between ligands and transition metal center is selective and specific. The ligands can tune the electron configurations of the d orbitals of the metal, thereby forming unique overall property that the individual metal or ligands do not own. 3. Diverse coordination geometries. The flexible choice of CN provides a wide variety of coordination shapes different from the organic molecules. 3 Furthermore, isomerism exists extensively in coordination compounds which further enriches the diversity of geometries. For instance, one type of isomerism is attributed to the relative arrangement of the ligands. A planar complex M(L1)2(L2)2, where M is the metal, L1 and L2 are two different ligands, can be in either cis or trans form, depending on the relative position of the same type of ligand. TMCs have extensive applications in many fields.4-20 In biology, optical imaging is an important tool in life sciences for disease diagnostics. Many luminescent TMCs have shown promising applications in bioimaging and biosensing due to their remarkable photophysical properties.21,22 For instance, oxygen deprivation in biological systems can cause various diseases such as fatty liver, cerebral infarction, diabetic retinopathy, and cancer.23-25 Investigating the mechanism of diseases related to oxygen deprivation requires O2 imaging technology that is capable of detecting O2 in real time with high selectivity and high stability. Studies have shown that a wide range of TMCs including Pd (II) and Pt (II) porphyrins, Ir (III) complexes, Ru (II) complexes have strong phosphorescence within the visible to near-infrared spectrum, with notably extended lifetimes exceeding 1.0 microseconds and they have been successfully used to monitor O2 level in cell nucleus,26 PC12 cells,27 bone marrow,28 etc. Transition metals also play important catalytic and structural roles in biology. For instance, iron is an essential ion for the process of the respiration and electron transport in biological systems.29 Fe is bound to a porphyrin, and the whole 4 complex, called a heme, is the major oxygen carrier in blood. Small ligand like CO can easily bind with the heme so that the cellular uptake of oxygen is blocked which deactivates oxygen transport, leading to the death of the organisms. Such process is called carbon monoxide poisoning.30 In addition to respiration, Fe actively interacts with many enzymes to support the redox processes,31 electron transfer reactions,32 and even nitrogen fixation in plants.33 Another important application of TMCs is metallodrugs. Barnett Rosenberg and colleagues at Michigan State University fortuitously unveiled that cisplatin has anticancer effects, which initiated metallodrug research.34 Currently, the platinum-based metallodrug family including carboplatin and oxaliplatin has been extensively used to treat ovarian, breast, colon, testicular and prostate cancer.35 Some other TMCs have also shown intriguing properties as promising metallodrug candidates. The Ru-based complexes exhibit high affinity for DNA and can reversely bind to the double helix.36 They are expected to be potential candidates as antineoplastic drugs since they can tune the tumor cell cycle and cause apoptosis.37 A couple of complexes such as KP1019, KP1339 and NAMI-A are being evaluated in clinical trials.38 Copper(II) complexes have emerged as an promising chemotype against cancer since they are capable of generating reactive nitrogen species and reactive oxygen species, resulting in cellular death.39 A typical family example of Copper metallodrugs is Casiopeínas, which have phenanthroline or bipyridine bidentate ligands in the coordination sphere, and the second charged ligand is either O-O or N-O coordinated, such as 5 acetylacetonate, salicylaldehyde and aminoacidate. These compounds are able to increase endonuclease G and activate caspase 3, thereby leading to apoptosis.40 Recently other new copper complexes similar to Casiopeínas have also been successfully synthesized. And the results indicate that these compounds have remarkable cytotoxic and antiproliferative bioactivities in multiple cancer cells, like breast cancer, melanoma cancer, osteosarcoma cancer, cervical cancer, colon cancer and ovarian carcinoma.41-43 Overall, the mechanisms of metallodrugs against diseases include inhibition of angiogenesis, induction of cell apoptosis, alteration of the cytoskeleton, etc.44 And they open novel pathways for treating diseases that traditional organic medications cannot address due to issues with drug resistance.45 But currently, the comprehensive functions of TMCs in disease diagnostics and treatments are still unclear and more research is greatly needed. TMCs have also been widely used in material science. One important application is their great contributions as organometallic catalysts to chemical synthesis. The properties of transition metal catalysts are determined, to large degree, by the ligands coordinated to the metal. The electronic properties, the size of the ligand and the ligand bite angle are a couple of important parameters to consider for the design of new TMCs as catalysts.46,47 For instance, the electrocatalytic CO2 conversion offers tremendous potential to use carbon-free renewable energy to meet green chemistry.48 The high energy barrier for electrochemical CO2 activation is related to the high energy requirement for breaking the linear neural molecule into the bent 6 radical anion, leading to a high overpotential demand for the one electron reduction from CO2 to 𝐶𝑂! ∙#. But once the transition metal catalyst is introduced to the system, this critical potential can be met by the reduction potential of the catalyst, thus making the conversion.49 The whole process starts from the coordination of CO2 to the central transition metal so that the electron can transfer to CO2. The selection of ligands in TMCs is of great importance in this step. The suitable ligands should be flexible electron carriers that can accept or donate electrons via redox reactions. Examples include pyridines and imines. Currently, TMCs including elements Mo, W, Mn, Rh, Re, Cr, Fe, Ru, Ni, Pd, Pt, Cu, Zn, Os, Ir, have been successfully synthesized as catalysts for electrochemical conversion of CO2.50 One more example of TMCs in material science is the utilization of TMCs as photosensitizers. Photosensitization is a process where the energy transfers from the photosensitizer to a substrate so that the substrate can be activated to undergo further chemical transformations.51 The generation of singlet oxygen is a typical example of photosensitization reaction.52,53 The reaction is initiated by the excitation of the photosensitizer via the absorption of a single photon, resulting in a high energy singlet state. The photosensitizer then proceeds to convey its energy to the ground state molecular oxygen (3O2), undergoing an internal intersystem crossing to reach an excited triplet state. And the transferred energy allows the production of metastable excited state singlet oxygen (1O2). TMCs are ideal candidates for photochemical applications due to their intense absorption in the visible light 7 spectrum. The effectiveness of a photosensitizer depends on the presence of readily available, low-energy valence exited states. For TMCs, it usually corresponds to the charge transfer states between metal and ligand, either from metal to ligand or ligand to metal. For instance, Ru (II) complexes are most widely used photosensitizers because their low-lying valence excited states are predominantly long-lived metal- to-ligand charge transfer states.54 And considering that Ru (II) is extremely rare on earth, recently other transition metals, especially the first-row TMCs have been largely explored. The relative inaccessibility of MLCT excited states may limit the viability of these complexes to activate photosensitization, but well-designed novel complexes show a promising balance between cost, abundance and efficiency.55 8 1.2 Machine Learning Artificial intelligence (AI) refers to a computer system’s capacity to emulate human cognitive functions, including learning and problem-solving with machine learning (ML) representing a subset of AI applications. It is a process where mathematical models are well designed to make predictions via learning implicitly from available data. The beauty of ML is that the learning process is driven by the model itself without direct human intervention. The learning process is evaluated by the loss function of a ML model and the overall goal is to reduce the error between the true values and the predicted values as much as possible. Depending on the type of used data, machine learning mainly includes supervised learning and unsupervised learning. The former uses labeled training data, thereby having a baseline understanding of what the correct output should be. By contrast, the latter uses unlabeled data, which means the model learns independently to understand the inherent structure of the given data without any specific guidance. While the type of data is a distinctive difference between both models, they also have different goals and applications which set them apart from each other. Supervised learning is used to investigate the underlying relation between input and output while unsupervised learning is more focused on discovering new patterns and relationships in raw, unlabeled data. For instance, a supervised model might be designed to predict the flight times under the conditions of weather, airport traffic, peak flight hours, etc. But an unsupervised model might be used to automatically 9 categorize some unlabeled images. In this case, the data is not labeled so the model does not know the object in the image, but the model can discern the common characteristics of the same type of images, thus classifying the images correctly. Traditional machine learning requires feature engineering with human intervention. Features related to the target property need to be carefully determined and extracted and then fed into the model. They usually have simple framework and thus are easy to interpret. Examples of traditional machine learning model include linear regressions, logistic regressions, support vector machines, decision trees, random forests, etc. However, feature engineering is a meticulous and time-intensive task, requiring experts’ knowledge to pinpoint the pertinent features for the model. In addition, due to their simple and fixed structures, they are usually unable to model complex and high-dimensional data, thus limiting their application domains. To circumvent the limitations of traditional machine models, deep learning is used. Deep learning uses neural networks to process raw data without any feature engineering which eliminates the human intervention, thus allowing the use of large amounts of data. Artificial neural networks (ANNs), or neural networks are proposed to imitate the mechanism which human brain operates. The human brain consists of millions of neurons, all interconnected and they are sending electric signals back and forth to each other to help human process information and make decisions. And neural networks were first designed by the inspiration of these biological neurons dating back to 1943. A typical neural network architecture 10 consists of one input layer, a minimum of one hidden layer as well as one output layer. And every layer has multiple nodes, i.e. the neurons in the biological systems. Usually, nodes are fully connected between two consecutive layers. And the connection strength is determined by the weights. The input layer processes the data input from outside, analyzes it and then passes the data to the hidden layers. The hidden layers further transform the data via nonlinear functions and pass the transformed data to the output layer. Subsequently, the output gives the final result of all the data processed by the neural network. The transfer of information from one layer to the subsequent one in a neural network is termed as feedforward propagation, which is usually used for training. Once the error between the predictions and the true values is determined, backpropagation is activated to minimize the error by fine-tuning the neural network’s weights and biases. The backpropagation uses gradient descent algorithm to propagate the error from the output back to the input layer. The gradient descent algorithm computes the gradient of the error function in relation to the model’s parameters, such as weights and biases, and then updates the weights and biases in the direction of the negative gradient to reduce the error. In essence, feedforward and backpropagation work together in a neural network's training phase: feedforward makes predictions, backpropagation assesses and corrects them, and this cycle repeats until the network optimally learns to map inputs to the correct output. 11 Machine learning in chemistry is not new. The earliest application of data science techniques to chemistry research is the determination of molecular formula from low resolution mass spectrometry in 1969.56 Machine learning has wide-ranging applications in chemistry: For instance, traditional ML methods have made contributions to quantitative structure activity relationship (QSAR) applications.57,58 Usually, a set of molecular descriptors which are precomputed molecular physicochemical properties are fed into a traditional ML model, such as linear regression model, random forests, support vector machines, etc.. With such a well- trained model, the target property of new, unseen molecules can be quickly predicted. In addition, using ML to accelerate traditional quantum mechanical (QM) calculations has been emerging in the last few years. Having a deep understanding of the electronic structure of chemical systems is crucial for the design of molecules and materials. The most accurate way to decode the chemical structures is to solve the Schrödinger equation used to calculate the wave function of a given system. However, this equation can be solved exactly only for the single electron system, e.g., the hydrogen atom but not for multi-electron systems, such as the Helium atom. For larger molecules, carefully chosen approximations are needed at the cost of losing accuracy as little as possible.59 Machine-learning potentials (MLPs) with reference to high-level quantum chemistry methods have gained increasing attention in computational chemistry.60-63 Well-built MLPs can reach as high accuracy as their reference method but at much lower computational cost. Since the 12 MLPs are built based on non-linear functions which do not require any physical knowledge, such methods are very flexible and can be used for almost any system, such as a molecule,64,65 nanoporous materials,66 oxides,67 and metals.68,69 And one more example of ML for chemistry is computational material design. The design of novel structures with desirable properties is a core part of chemistry. In the early years, materials discovery was primarily driven by serendipity.70 Compared to traditional high-throughput screening methods, deep generative models can generate rational molecules at a reduced cost since little human intervention is required and much time can be saved. Once a generative model has been well designed, unlimited and diverse new molecules can be automatically generated within seconds. Currently, several generative models have been widely used for molecular generation, including variational autoencoder (VAE), convolutional neural network (CNN), Transformer-based models, recurrent neural network (RNN), flow-based models, generative adversarial network (GAN), and diffusion models.71 The powerful learning ability of ML has revolutionized many fields in modern society, and we have already witnessed great progress in chemistry that interacts with ML. Currently, AI or specifically ML, is still quickly developing, and powerful large language models (LLMs), such as ChatGPT, demonstrate potential for advancing human development. However, how these tools can benefit chemists thus accelerating chemistry research, has been underexplored to date. 13 CHAPTER 2 MODELING ZINC COMPLEXES USING NEURAL NETWORKS 2.1 Zinc Complexes Zinc is a crucial trace element responsible for all livings on our planet.72-75 Zinc deficiency can result in a variety of health issues related to skin, bones, and the reproductive, digestive, and immune systems.76 With an [Ar]3d104s2 electronic configuration, zinc has only one oxidation state, but exists in different isotopic forms of mass ranging from 66 to 70. The completely filled d-shell enables Zn2+ to coordinate various ligands in highly flexible geometry and run fast ligand-exchange reactions.77 The coordinating atom of zinc is usually N, O, S and these electron- donor ligands attach to the central meal zinc in tetrahedral, trigonal bipyramidal, or octahedral geometries, among which tetrahedral is the most common geometry with a coordination number of four. As the second most essential and abundant element in human bodies, transition metal zinc plays important structural and catalytic roles in various biological activaties.78-82 Zinc is a good electron acceptor, thus serving as a Lewis acid in catalysis. Its structural functions are validated by the fact that zinc is found in various protein structures and superstructures. The importance of zinc to biological systems can be briefly summarized below: first, the importance of zinc to the gene is indispensable based on the fact that approximately 25% of the zinc compound of 14 rat liver is identified in the nucleus.83 Specifically, zinc actively participates in the process of genetic stability and gene expression in various ways, such as the structure of chromatin, DNA repair, RNA transcription, DNA and RNA polymerases as well as programmed cell death.84 Second, the d10 configuration makes Zn2+ redox inactive, an ideal antioxidant which inhibits any possibilities of free radical reactions. And this property is crucial for antioxidant protection in biological systems. The antioxidant effect of zinc can be mediated via multiple ways including the direct activity of zinc ions, the regulatory effect on metallothionein induction as well as its structural functions in antioxidant enzymes. Specifically, zinc ions can directly bind to thiol functional groups to prevent oxidation.85 Meanwhile, zinc is also identified as a crucial component of the antioxidant enzyme known as Cu, Zn-superoxide dismutase (SOD1), which plays a vital role in defending the body against oxidative stress. When there is a deficiency of zinc in the body, the activity of SOD1 can be suppressed, potentially leading to increased oxidative damage in cells.86 Furthermore, it has been indicated that zinc can indirectly influence the functionality of various other antioxidant enzymes.87 Third, zinc complexes are reported to have great medicinal effects for a variety of diseases. For instance, zinc complexes have shown appealing properties as photosensitizers in photodynamic therapy against cancers.88 Compared to platinum derivatives which are the most widely used anticancer agents, zinc complexes have lower toxicity, 15 which make them ideal alternatives. Besides, Zn complexes have been widely used as anti-Alzheimer agents,89 anticonvulsant90 and for antidiabetic treatment.91-93 16 2.2 Neural Network Potentials Electronic structure theory methods enable the understanding of molecules, materials on the quantum level and thus complement the experimental studies. The rapid development of computational resources allows large-scale electronic structure simulations for simple systems. However, the ever-increasing demand for accurate computational calculations of large and complex systems is currently infeasible to achieve. Density functional theory (DFT), currently the computational backbone within the electronic structure theory, provides a practical compromise between chemical accuracy and computational cost, but the explicit form of Kohn- Sham DFT is still unclear and the functionals are formulated in increasingly complicated analytical forms as they climb the Jacob’s ladder.94 In addition, the functionals are crafted to satisfy certain physical conditions, including asymptotic behaviors and scaling characteristics, however these designs highly depend on human heuristics, particularly in the intermediate regime where the asymptotic principles are not applicable.95 Alternatively, force fields methods have been proposed to model the chemical reactions for large system by summing over the bonded and nonbonded interactions.96 The bonded term is used to describe the simple interactions at close distances between the directly bonded atoms, as well as angles and dihedrals among atoms connected through shared bonding partners. Conversely, nonbonded terms model the pairwise interactions between atoms, mainly for electrostatics with 17 Coulomb’s law and dispersion with Lennard-Jones parameters97. The simple format of classical force fields (FF) methods greatly improves the computational efficiency comparing to DFT methods, thus making it possible to model the system which includes thousands of atoms via molecular dynamics (MD) simulations. Even though a fundamentally sound analysis of chemical interactions is ensured, deep insights derived from MD simulations are usually limited due to the low accuracy of FF methods.98 Especially for systems where the polarization and many-body interactions have to consider, large error may appear using FF methods since both types of interactions are completely ignored in classical FF methods. Considering the limitations of both DFT and classical FF methods, new pathways for accurately and efficiently modeling the electronic structure of molecules, clusters and materials are highly required. And ML should be a good choice. ML methods strive to discern the implicit relationships between inputs, either predefined chemical descriptors or just xyz coordinates and outputs, i.e., the chemical properties. The learning process relies on the provided data set. And a well-trained model can capture the fundamental physical laws of quantum mechanics embedded in the data. Practically, ML methods do not need to deal with any mathematical formulas that obey the physical principles to represent the structure-property relation, which greatly simplifies the calculations. This unique ability allows to investigate the realm of chemistry and forecast the attributes of new molecules and materials with unparalleled efficiency and remarkable accuracy.99-101 18 A potential energy surface (PES) delineates the relationship between a system’s energy and its geometrical parameters, such as the atom coordinates, with the assumption of the Born-Oppenheimer approximation. Each point on the PES represents a unique conformation of the given configuration at different energy levels. As a result, the PES is utilized to locate stable conformers, i.e. local minima, to investigate the minimum energy pathways among numerous possible conformers, and to find transition states which include all information about the chemical reactions. Furthermore, the PES is also utilized to run MD simulations to gain understanding of the reaction dynamics.104 The PES aims at providing an understanding of systems at the atomic level, such as small organic molecules, liquids, solids, and polymers because all stable and metastable structures, atomic vibrations, transition states and activation barriers between various structures can be tracked on the PES. However, MD simulations assisted by the PES is challenging. On one hand, for large systems the PES can only be generated using quantum mechanics/molecular mechanics (QM/MM) techniques since it is practically impossible to determine the complete PES for over thousands of atoms due to the complexity of the system. For instance, the PES was used to investigate the catalytic mechanism of enzymes.105 The insights into electrons movement in a chemical reaction and the mechanistic role of active site in residues can all be obtained by analyzing the PES. However, the semiempirical methods used in QM/MM are often unreliable for providing 19 accurately microscopic chemical properties. On the other hand, to develop the full PES, the energy of a nonlinear molecule with N atoms is associate with a function of 3N-6 degrees of freedom. And the computational expense for calculating the PES escalates quickly with an increase in the number of atoms because the electron structure theory or DFT methods are utilized, and the computational scaling of DFT is typically cubic with respect to the number of atoms (O(N³)), whereas the coupled- cluster singles, doubles, and perturbative triples [CCSD(T)] method scales as the seventh power (O(N⁷)) with the number of atoms. To circumvent the limitations mentioned above, machine learning potentials (MLPs) were proposed to represent the multi-dimensional PES two decades ago.106 MLPs learn the contours of the PES using reference data curated from first-principles calculations. These MLPs implicitly encode the atomic interactions in regard to nuclear charges and atomic positions without a significant loss of accuracy compared to the electronic structure calculations but they are much faster than these traditional quantum calculations. To build a ML model to investigate the potential energy surface (PES), suitable reference data is of great importance. The reference data is usually obtained from ab initio calculations. But the ab initio calculations are only required for data preparation, since once the model is trained with the reference data, any predictions can be directly obtained from the model with ab initio level accuracy. The reference data determine the applicability domain of the trained model as well as the reliability. 20 Any deficiencies in the data will unavoidably result in imperfections of the trained model. Therefore, the reference data stands as a crucial element of any model. But in computational chemistry, compiling the data sets is not easy. This is primarily because each reference point derives from calculations that are both computationally intensive and complex, restricting the volume of data that can be gathered. Besides, the chemical space of molecules, clusters and materials is extremely large, and it is not easy to locate the representative geometries efficiently. The most common strategies for sampling and generating the reference data sets can be summarized below: (1) ab initio molecular dynamics (AIMD) Sampling. In this method, dynamical trajectories at finite-temperature are produced by using forces derived from electronic structure computations. Although this method is expensive, it is able to accurately describe the chemical process, such as chemical bond breaking and forming. In most cases, the system is simplified so that only N nuclei and Ne electrons are considered to meet the Born-Oppenheimer approximation and the dynamics of the nuclei are assumed on the ground-state electronic surface. But due to the impossibility of precisely solving the differential equations of the ground-state electronic structure, approximate electronic structure methods are widely used, among which DFT is the most common one due to its good performance at a relatively acceptable 21 computational cost. (2) Normal Mode Sampling. This method does not require to run any MD simulations. To generate a set of data with the energy range around the minima energy structure, it starts from an equilibrium point on the PES, where atoms at the geometry’s energy minima are randomly displaced with the normal modes. To achieve this, the normal mode coordinates at the minimum position are first calculated and the displacement coordinates are then obtained based on the setting of harmonic potential which is derived from the normal mode coordinates. The single point energy of the newly generated geometry is calculated as the reference data while the displaced coordinates are as input. A typical example of this sampling method is ANI-1.102 Although this method is efficient since no MD simulations are involved, only samples close to minima can be generated which limits the energy space, thus resulting redundant geometries in the data set. (3) Metadynamics Sampling. Metadynamics103 is a dynamic sampling method which is capable of biasing configurations away from the positions that have already been visited. This enhanced sampling method uses collective variables to define a biased potential, compelling the system to explore less probable states, thus all states on the PES can be sampled. The collective variables are a predefined number of degrees 22 of freedom. In metadynamics, the externally applied bias potential, a function of the collective variables, is iteratively added to the Hamiltonian of the system, which induces the system to sample the high-energy area. One benefit of this approach is that the high-energy landscape on the PES which is usually ignored in classical MD simulations can be frequently visited. Therefore, the data collected from this method is evenly distributed, instead of being limited to a narrow energy window. One drawback of ML methods is the limited extrapolation abilities since the trained models can only make reliable predictions in the training data domains. For data curation in ML, the sufficient sampling of the PES is therefore crucial. In the three sampling methods discussed above, metadynamics sampling is highly recommended. In the past 20 years, various MLPs have been proposed. Models based on traditional ML include support vector machine potentials,107 atomic cluster expansion potentials,108 spectral neighbor analysis potentials,109,110 gaussian approximation potentials,111,112 moment tensor potentials,113 gradient domain machine learning,114 etc. For example, Roman and coworkers developed a least-squares support vector machines (LS-SVM) to investigate the interatomic potentials of around 200 small organic molecules which only contain H, C, N, O, F five elements.107 To accurately model the pairwise interactions in the system, constitutional descriptors, such as the 23 mole fractions of different atoms, the size of molecule, and quantum-chemical descriptors including average polarizability, dipole moment, quadrupole moment, HOMO-LUMO gap were used to decode the structure of molecules. With these molecular descriptors, it is practically accessible to have a clear intuition of what the model learns from the provided reference data, thus allowing to understand the structure-property relations. The LS-SVM is able to deal with multivariate calibration problems and solve both linear and nonlinear multivariate calibrations. And regularization technique can be introduced to the LS-SVM model to better balance between overfitting and underfitting. The findings indicated that the LS- SVM demonstrates greater efficiency compared to ANNs for training. Furthermore, for two extra test sets, the LS-SVM model showed better interpolation and extrapolation abilities than ANNs. Although in this specific example of LS-SVM, traditional ML methods outperform ANNs, this conclusion is not universal. For large amount of data, which is necessary for real-world applications, ANNs definitely surpasses traditional ML methods, due to its structural flexibility. The potentials modelled using neural networks are call neural network potentials (NNPs). Starting from 1995 NNPs have been developed a lot and can be classified into four generations.115 Doren and coworkers first proposed a single feedforward neural network model to estimate the adsorption energy of H2 molecule on the Si (100) cluster surface.116 This first-generation neural network is a two-dimensional system since the input 24 layer has only three input coordinates and the single neuron in the output layer gives the predicted energy. At least one intermediate hidden layer resides between the initial input and the final output layer. But no physical information is embedded in the nodes of these hidden layers since they are used just to increase the neural network’s flexibility and allow for the processing of complex features and patterns in the input data before reaching the output. Adding more hidden layers and nodes increases the network’s capacity for flexibility, thus having better generalizability. The relation between any two nodes in the two consecutive layers is defined as 𝑦 = 𝑤𝑥 + 𝑏 (1) where w is weight, b is bias, and both are learnable parameters. However, this simple linear representation is not enough. A nonlinear function, or activation function is applied so that any arbitrary function can be well represented. The output of a node is then defined as 𝑦 = 𝜎(𝑤𝑥 + 𝑏) (2) where 𝜎 is the activation function. Various activation functions are frequently used. Some examples are available from Figure 35 to Figure 38 (See APPENDIX B: FIGURES). The advantages of this first-generation NNPs are obvious: they can accommodate numerous training parameters so that any physical principles can be modelled. And no preliminary equations are required to represent these physical rules which make this method simple to implement. Furthermore, the high flexibility also enables the model to accurately represent the energy of the system in regard to 25 atomic coordinates. Finally, this method allows the calculation of analytic derivatives. Hence, both energy and forces can be calculated at speeds that are much faster than first-principles methods. However, the limitations of this single feed- forward neural network are also obvious. First, this method is not applicable to large system, since the extra input neurons of the model is required whenever a new atom is introduced to the system. For a system with hundreds of atoms, the number of input nodes is excessively large, and it can impede efficient training of the model due to increased computational demands and potential overfitting. Second, in this simple feed-forward ANN, the symmetry is not guaranteed. Changing the input order of each atom results in different energy output, which is contradictory to the fundamental principle that the PES should inherently be invariant under translation, rotation, and permutation of identical particles. For example, for a water monomer, the input can be the pairwise distances of these atoms, since the two O-H bonds should be chemically equivalent, exchanging the input order of two H atoms should not change the energy of this water molecule. However, each weight and bias in the neural network are usually numerically different, different input orders possibly give rise to different outputs. Finally, once the NNP is determined, it can be only applied to a specific class of system because the framework of this neural network limits it to predict the potential energy for a system with different size of atoms. Any new system containing more or less atoms needs to be retrained. 26 The restrictions of the low-dimensional neural networks prevent its application to complex system. To solve these issues, Behler and Parrinello introduced the second- generation NNPs in 2007.117 The first improvement is to abandon the single feedforward neural network. Instead, in the second-generation NNPs, each atom is given a neural network to predict the atomic energy. By aggregating these atomic energies, the cumulative energy for the entire system can be determined by % 𝐸$ = ∑ & 𝐸& (3) where 𝐸& is the atomic energy and 𝐸$ is the total energy across all N atoms. The unique design of this method is that each element shares the same neural network with the same weights and biases which reduces the computational cost. In addition, this atomic neural network allows to model the PES for any arbitrary system since if a new atom is introduced to the existing system, a corresponding neural network of that element can be easily added to the whole framework of neural network. On the other hand, an atomic neural network can also be easily deleted if one atom is removed. The second improvement is that the locality approximation is introduced. Instead of including all pairwise interatomic interactions, interactions are only taken into account between atoms that fall within a predefined cutoff radius. This type of short-range interactions can cover the main interactions among atoms without a significant loss of accuracy. Finally, the so-called atom-centered symmetry functions (ACSFs) are introduced to encode the local structural fingerprints of the 27 atomic environments. Meanwhile, this type of descriptors can also keep the translational, rotational and permutation invariance. One important component of ACSFs is the cutoff function which is used to define the local atomic environments. The cutoff function should meet some criteria: (1) it should be differentiable to ensure there is no discontinuity in the descriptor numbers as well as the corresponding derivatives; (2) It should decay smoothly to zero at the cutoff radius to make sure the interactions decrease at larger distances and become zero outside the cutoff. A common cutoff function117 adapted from the cosine function is defined as 𝑓’0𝑅&(2 = 3 )!" )# 0.5 7𝑐𝑜𝑠 ;𝜋 = + 1? , 𝑅&( ≤ 𝑅’ 0 , 𝑅&( > 𝑅’ (4) where 𝑅&( denotes the relative position between atom i and atom j. 𝑅’ denotes a predefined cutoff radius. The shape of this cutoff function is given in Figure 1. The local atomic environments are then described based on this cutoff function. The ACSFs include two types of functions, i.e., the radial and angular symmetry functions. Both symmetry functions depend on the distance between the center atom i and its neighbor j. They complement each other to fully describe the local atomic environments of each atom. And each type of symmetry functions has a range of different functional form. 28 Figure 1. The cosine adapted cutoff function. But the symmetry functions have to meet some requirements. First, like the cutoff function, they should also decay in value as the neighbor gets away from the center atom and become zero beyond the cutoff range to reflect the real physical interactions along with distance. Moreover, they should be able to capture the minimal differences of similar structures, such as conformers. Each unique structure should have unique representations decoded by these symmetry functions. Finally, they should not depend on the number of neighbors since the number of atoms within the threshold value varies in different molecules, but the dimensionality of the input layer of the neural network keeps fixed once it is determined. 29 The radial symmetry functions are two-body terms, based on the pairwise distances among atoms. They are designed to describe the radial environment of atoms. The frequently used radial function is a sum of Gaussians of all neighbors, 𝐺& % *+, = ∑ 𝑒#-.)!"#)#/ (0& $ 𝑓’(𝑅&() (5) Figure 2. Radial symmetry functions(η=2). Figure 2 shows the radial functions with a set of different cutoff radius. Summing over these Gaussian functions ensures the representation of local atomic environment is not affected by the number of neighbors so that molecules with various system size can all be fed into the neural network. The radial function’s spatial range is controlled by the parameter η to provide a smooth transition of both potentials and forces at the cutoff.118 However, only using the radial symmetry 30 functions is not enough to exactly describe the local atomic environments because the radial functions only cover the radial environment, but for some systems, such as square planar coordination and tetrahedral coordination, if all the bond lengths are the same, the radial function is not able to distinguish both geometries. The angular symmetry function is then designed to solve this issue. Angular terms are constructed to incorporate the angles for any three atoms. A typical form is +12 = 23#4 ∑ 𝐺& +99 (,50& 4 01 + 𝜆 𝑐𝑜𝑠 𝜃&(52 𝑒#-6*!" $ 7*!% $ 8 $ 7*"% 𝑓’0𝑟&(2𝑓’(𝑟&5)𝑓’0𝑟(52 (6) where 𝜃&(5 is the angle spanned by the atoms i, j, and k. The parameter 𝜆 = ± 1, inverts the shape of the cosine function to capture an accurate depiction across different values of 𝜃&(5 , while 𝜁 controls its width. As shown in Figure 3, the exponent 𝜁 can achieve good angular resolution. Figure 3. The angular term 23#401 + 𝜆 𝑐𝑜𝑠 𝜃&(52 4 with a set of 𝜁 and 𝜆 = 1. 31 The term 23#4 is a normalizing factor to control the range of the angular symmetry functions. And since 𝜃&(5 and 2𝜋 − 𝜃&(5 should give the same angular function value, cosine function is used to keep the symmetry. Both radial and angular symmetry functions use a set of different parameters to fully describe the atomic environments. The ACSFs are useful descriptors for NNPs and have shown successful applications in a variety of systems. For example, ANI-165 is a well-trained NNPs for small organic molecules that only have less than 8 heavy atoms of carbon, nitrogen, oxygen. From ~58k neural molecules, the authors used normal mode sampling method to generate ~17.2 million conformers. They then redesigned the symmetry functions. First, they introduced a predefined parameter to shift the angular functions. Second, the cutoff radius was added to the distance exponent part of the angular function. Both modifications allow the descriptors to recognize different molecular features more accurately, such as bonding patterns, functional groups. The ANI-1 model uses 32 evenly spaced hyperparameters for radial functions and 16 shifting hyperparameters for modified angular functions, resulting in total 768 predefined parameters. It includes 3 hidden layers, each of which has 128, 128, 64 neurons. The results indicate that ANI-1 model can yield a RMSE of 0.6 kcal/mol with regard to DFT reference method and also has good transferability of predicting the energetics of larger systems with 10-24 heavy atoms. 32 Although the second-generation NNPs with ACSFs greatly improve the performances compared to the first-generation single feed-forward neural network, one limitation is that these predefined descriptors require much human expertise since each hyperparameter needs to be manually selected or adapted through numerous tests which is laborious and time-consuming. In addition, an increasing number of input dimensions can rapidly lead to high computational cost for both descriptor calculation and model evaluation in multicomponent system. To overcome the bottleneck in ACSFs, learnable descriptors are introduced. The base framework is ‘message passing neural networks’ (MPNNs), in which every molecule is considered as a three-dimensional graph.119 MPNNs are adapted from Graph Neural Networks (GNNs)120, in which every sample is considered as a graph. A graph is an ensemble of objects with specific connectivity. And it includes node, edge, and global three parts of attributes. Generally, the graph represents the relations (edges) between a collection of nodes. The simplest GNN applies an individual multilayer perceptron (MLP) to each element in a graph, i.e., for each node, a MLP is applied and returns a learned node, and a learning embedding can also be applied to single edge and the whole graph. Finally, an updated graph is obtained. One drawback of this simplest GNN is that it does not consider the connectivity of the graph, since each node, edge and global context is processed independently. But the connectivity contains very important information, for example, the connectivity of an atom with its neighbors reflects its local 33 environment which greatly affects its own atomic contribution to the total energy. MPNNs were then proposed to overcome this limitation. A MPNN aims at updating all attributes of a graph while maintaining the graph symmetry, i.e. the connectivity is unchanged during the course of transformation. In other words, MPNNs follow a ‘graph-in, graph-out’ mechanism meaning that only the information of a graph loaded into its nodes, edges and global context are progressively transformed, but the connectivity index of any at least two nodes are still kept. Molecules in chemistry are good examples of graphs since all atoms in the molecule are interacting with each other via electrons. And the interactions can be reflected by bonds connecting two atoms. Therefore, for a given molecule, each atom can be considered as a node and each bond is regarded as an edge in GNNs. The task on the graph can be either node-level, edge-level or global-level. For example, on the global level, we can predict the property of the whole graph, like whether a ligand binds to a receptor. And for the edge-level property, we can predict the type of a bond, either single, double, or even triple. Finally, for node-wise property, GNNs can be designed to estimate the atomic partial charges of a given molecule. In terms of NNPs based on MPNNs, the node-level prediction is of great concern. The goal is to predict the atomic energy of each node and sum them up to get the total energy of the molecule. MPNNs leverage the connectivity of graphs which makes GNN models more sophisticated. Further, multiple GNN layers can be stacked together, 34 through which, a node can eventually incorporate information across the entire graph. The MPNNs include in three steps: 1. Each node in the graph collects all the neighbors’ embeddings. 2. All these neighboring messages are aggregated through pooling technique. 3. The pooled messages are passed through nonlinear layers to update the node information. Specifically, each node’s embedding is first randomly initialized, ℎ; ∈ ℝ and each node exchanges its own information with its neighbors via message passing block 𝑀<. The central atom first collects its neighbors’ message, 𝑚& <73 = ∑ ( 𝑀<0ℎ& <, ℎ( <, 𝑒&(2 (7) where 𝑚& <73 is the total message passed from the neighbors j to the central node i at step t, and the bonding information of pairwise atoms 𝑒&( are included to model the interactions between i and j. The central node is then updated based on both the gathered message 𝑚& <73 and its own representation ℎ& < via update block, <73 = 𝑈<(ℎ& ℎ& <, 𝑚& <73) (8) The message passing and update block together is called one interaction. And each node goes through the interaction block multiple times, allowing the message to be disseminated throughput the molecule. At the final step T, each node transforms its updated embedding into atomic energy, 𝐸& = 𝑅(ℎ& $) (9) 35 Finally, all atomic energies are summed up as the total energy via eq 3. Compared to neural networks based on ACSFs, MPNNs do not require any predefined descriptors because they are replaced by neural networks to learn all pairwise interactions implicitly. In MPNNs 𝑀<, 𝑈<, 𝑅 are all nonlinear functions designed by neural networks. And they are much more expressive than manually selected descriptors due to their flexible framework. Figure 4. The schematic process of SchNet. A well-designed MPNNs for NNPs is SchNet.121 The inputs of SchNet are the nuclear charges Z= {𝑍3, 𝑍!, … , 𝑍%} and the atom positions R= {𝑟3, 𝑟!, … , 𝑟%}. First, these nuclear charges are transformed to high-dimensional features to represent the atom type meaning that the same atom type gets the same initial nuclear embedding. 36 This is a common step for initialization in MPNNs. The first feature of SchNet is the expansion of distances by a set of Gaussian basis, 𝑒50𝑟( − 𝑟&2 = exp (−𝛾0Y𝑟( − 𝑟&Y − 𝜇52 ! ) (10) eq. 10 is similar to the radial symmetry functions in ACSFs to extend the distance in space by adding some nonlinear transformations, ℝ= → ℝ> where D is the dimension of the expanded distance. Both 𝛾 and 𝜇5 are hyperparameters to determine the degree of expansion. Another contribution of SchNet is to decode the edge information. Instead of using the direct relative positions of pairs of atoms, the authors used convolutional filters to further transform the bonding information. These convolutional filters are designed by neural network, 𝑚& <73 = ∑ ( 𝑀<0ℎ& <, ℎ( <, 𝑒&(2 = ∑ ( < ∘ 𝑊< ;𝑒50𝑟( − 𝑟&2= (11) ℎ( where ∘ represents the element-wise multiplication and 𝑊< are convolutional filters at step t. The advantage of 𝑊< is that the atomic positions can be mapped from ℝ> → ℝ?, where F is the hidden features. The framework of SchNet is given in Figure 4. As the results shows in the original work, SchNet models the interatomic potentials very well. The authors first tested a well-known benchmark data set QM9122 which includes ~131k organic molecules with less than nine heavy atoms of carbon, nitrogen, oxygen, fluoride. SchNet evaluated 12 molecular properties which are minimal energy, enthalpy, the energy of HOMO and LUMO along with the corresponding energy gap, polarizability, harmonic frequency, dipole moments, 37 etc. SchNet shows good performance on all these tested properties. For example, the mean absolute error (MAE) of minimal energy predictions is 0.014 eV. In the past five years, a lot of variants of SchNet have been proposed to further improve the performances of NNPs, like PAINN,123 tensor field networks (TFNs),124 NequIP, 125SE (3)-transformers126 to name a few. The common feature of these NNPs is that they are all equivariant GNNs. For example, in order to overcome the limitation of only invariant representations in SchNet, a new type of rotationally equivariant representations were proposed in PAINN. As shown in Figure 5 (a), only distances and angles are considered in SchNet and since both representations are rotationally invariant, they are unable to differentiate both structures. However, the directional vectors in Figure 5 (b) can easily recognize the differences in both structures. Therefore, geometric vectors and tensors should be introduced to NNPs to make the model more expressive. Another advantage of these equivariant GNNs is that with the equivariant atom-wise representations, the model can predict tensorial properties accurately, such as forces, molecule dipole moment, polarizability, etc. 38 Figure 5. Invariant representations and equivariant representations. The second-generation NNPs with either predefined or learnable descriptors have greatly advanced MLPs. They are the mainstream methods in machine learning for MD simulations. However, their limitations are also very obvious that only short- range interactions are considered. This design can dramatically decrease the computational complexity but for systems where long-range interactions play an important role, large errors may appear. Electrostatics is a major component of long- range interactions, and the relatively weak dispersion interactions can also contribute a lot to large systems.127 It is necessary to incorporate long-range interactions in MLPs since it not only covers the interactions beyond cutoff radius, following Coulomb’s law, which adds physically meaningful energy term to MLPs but also reduces the radius to smaller threshold, which facilitates a reduced sampling configurational space. A few NNPs which contain electrostatic interactions 39 explicitly are called third-generation NNPs. The first example is an extension of HDDPs proposed in 2011.128,129 In this method, a second neural network is designed to predict the atomic partial charges. Like the neural network for atomic energy predictions, the partial charges are dependent on a set of ACSFs representing the local atomic environments. The electrostatic energy 𝐸@9@’ is subsequently computed based on the predicted partial charges q with Coulomb’s law. Finally, the system’s total energy is calculated as % 𝐸