OPTIMIZING AND IMPROVING THE FIDELITY OF REACTIVE, POLARIZABLE
MOLECULAR DYNAMICS SIMULATIONS ON MODERN HIGH PERFORMANCE
                   COMPUTING ARCHITECTURES
                                      By
                              Kurt A. O’Hearn
                            A DISSERTATION
                                Submitted to
                        Michigan State University
                 in partial fulfillment of the requirements
                              for the degree of
                Computer Science – Doctor of Philosophy
                                     2022


                                        ABSTRACT
  OPTIMIZING AND IMPROVING THE FIDELITY OF REACTIVE, POLARIZABLE
   MOLECULAR DYNAMICS SIMULATIONS ON MODERN HIGH PERFORMANCE
                              COMPUTING ARCHITECTURES
                                              By
                                       Kurt A. O’Hearn
Reactive, polarizable molecular dynamics simulations are a crucial tool for the high-fidelity
study of large systems with chemical reactions. In support of this, several approaches have
been employed with varying degrees of computational cost and physical accuracy. One of
the more successful approaches in recent years, the reactive force field (ReaxFF) model,
was designed to fill the gap between traditional classical models and quantum mechanical
models by incorporating a dynamic bond order potential term. When coupling ReaxFF
with dynamic global charges models for electrostatics, special considerations are necessary
for obtaining highly performant implementations, especially on modern high-performance
computing architectures.
    In this work, we detail the performance optimization of the PuReMD (PuReMD Reactive
Molecular Dynamics) software package, an open-source, GPLv3-licensed implementation of
ReaxFF coupled with dynamic charge models. We begin by exploring the tuning of the
iterative Krylov linear solvers underpinning the global charge models in a shared-memory
parallel context using OpenMP, with the explicit goal of minimizing the mean combined
preconditioner and solver time. We found that with appropriate solver tuning, significant
speedups and scalability improvements were observed. Following these successes, we extend
these approaches to the solvers in the distributed-memory MPI implementation of PuReMD,
as well as broaden the scope of optimization to other portions of the ReaxFF potential such
as the bond order computations. Here again we find that sizable performance gains were
achieved for large simulations numbering in the hundreds of thousands of atoms.


    With these performance improvements in hand, we next change focus to another im-
portant use of PuReMD – the development of ReaxFF force fields for new materials. The
high fidelity inherent in ReaxFF simulations for different chemistries oftentimes comes at
the expense of a steep learning curve for parameter optimization, due in part to complexities
in the high dimensional parameter space and due in part to the necessity of deep domain
knowledge of how to adequately control the ReaxFF functional forms. To diagnose and com-
bat these issues, a study was undertaken to optimize parameters for Li-O systems using the
OGOLEM genetic algorithms framework coupled with a modified shared-memory version of
PuReMD. We found that with careful training set design, sufficient optimization control with
tuned genetic algorithms, and improved polarizability through enhanced charge model use,
higher accuracy was achieved in simulations involving ductile fracture behavior, a difficult
phenomena to hereto model correctly.
    Finally, we return to performance optimization for the GPU-accelerated distributed-
memory PuReMD codebase. Modern supercomputers have recently achieved exascale levels
of peak arithmetic rates due in large part to the design decision to incorporate massive
numbers of GPUs. In order to take advantage of such computing systems, the MPI+CUDA
version of PuReMD was re-designed and benchmarked on modern NVIDIA Tesla GPUs.
Performance on-par with or exceeding the LAMMPS Kokkos, a ReaxFF implementation de-
veloped at Scandia National Laboratories, with PuReMD typically out-performing LAMMPS
Kokkos at larger scales.


Copyright by
KURT A. O’HEARN
2022


This thesis is dedicated to my family for their unconditional support and investment in my
                                  educational aspirations.
                                              v


                               ACKNOWLEDGEMENTS
The work in Chapter 2 was supported through an MSU Foundation Strategic Partnership
Grant, and NSF Grants ACI-1566049 and OAC-1807622. This research used computational
resources provided by the Institute for Cyber-Enabled Research at Michigan State University.
    The work in Chapter 3 was supported through a Michigan State University Foundation
Strategic Partnership Grant, NSF Grants ACI-1566049 and OAC-1807622, and the comput-
ing resources at the National Energy Research Scientific Computing Center (NERSC).
    The work in Chapter 4 was supported in part by a Strategic Partnership Grant from
the Michigan State University Foundation, an NSF CDS&E grant (award number 1807622),
an NSF DMR grant (award number 1832808), and an U.S. DOE grant number DE-FG02-
01ER15228. This work used computational resources provided by the Institute for Cyber-
Enabled Research at Michigan State University. Thanks to Stan Moore, Jason Koski, and
the LAMMPS group for sharing
    The work in Chapter 5 used computational resources provided by the Institute for Cyber-
Enabled Research at Michigan State University.
                                              vi


                              TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      x
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      xi
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        xiii
CHAPTER 1 BACKGROUND AND RELATED MATERIAL                           . . . . . . . . . . . .   1
   1.1 Description of the Reactive Force Field Method . . . .       . . . . . . . . . . . .   1
       1.1.1 Bond Orders . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . . .   2
       1.1.2 Bonding Term . . . . . . . . . . . . . . . . . . .     . . . . . . . . . . . .   3
       1.1.3 Lone Pair Term . . . . . . . . . . . . . . . . . .     . . . . . . . . . . . .   3
       1.1.4 Over- and Under-coordination Term . . . . . . .        . . . . . . . . . . . .   4
       1.1.5 Valence Angle Term . . . . . . . . . . . . . . .       . . . . . . . . . . . .   4
       1.1.6 Torsion Term . . . . . . . . . . . . . . . . . . .     . . . . . . . . . . . .   5
       1.1.7 Additional Bonding Terms . . . . . . . . . . . .       . . . . . . . . . . . .   5
       1.1.8 Hydrogen Bonding Term . . . . . . . . . . . . .        . . . . . . . . . . . .   6
       1.1.9 van der Waal’s Term . . . . . . . . . . . . . . .      . . . . . . . . . . . .   6
       1.1.10 Coulomb Term . . . . . . . . . . . . . . . . . .      . . . . . . . . . . . .   7
CHAPTER 2 FAST SOLVERS FOR CHARGE DISTRIBUTION MODELS ON
              SHARED-MEMORY PLATFORMS . . . . . . . . . . . . . . . . . . .                   8
   2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   9
   2.2 Dynamic Charge Distribution Models and Linear Solvers . . . . . . . . . . .           10
       2.2.1 Charge Equilibriation (QEq) . . . . . . . . . . . . . . . . . . . . . . .       10
       2.2.2 Electronegativity Equalization (EE) . . . . . . . . . . . . . . . . . . .       11
       2.2.3 Atom-Condensed Kohn-Sham Approximated to Second Order (ACKS2)                   12
   2.3 Preconditioning Techniques for Linear Solvers . . . . . . . . . . . . . . . . .       13
       2.3.1 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      14
              2.3.1.1 Jacobi (Jacobi) . . . . . . . . . . . . . . . . . . . . . . . .        15
              2.3.1.2 Incomplete LU (ILU) based Techniques . . . . . . . . . . .             15
              2.3.1.3 Sparse Approximate Inverse (SAI(τ ))) . . . . . . . . . . . .          19
       2.3.2 Dynamic Determination of Preconditioner Recomputation . . . . . .               20
       2.3.3 Solver Implementation . . . . . . . . . . . . . . . . . . . . . . . . . .       23
   2.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     24
       2.4.1 Computing Environment and Benchmark Systems . . . . . . . . . . .               24
       2.4.2 Preconditioner Longevity . . . . . . . . . . . . . . . . . . . . . . . . .      26
       2.4.3 Tuning of Preconditioner Parameters for Cost and Effectiveness . . .            27
       2.4.4 Scalability of Preconditioned Solvers . . . . . . . . . . . . . . . . . .       31
       2.4.5 Preconditioned Solver Performance . . . . . . . . . . . . . . . . . . .         33
   2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   35
                                             vii


CHAPTER 3 PERFORMANCE OPTIMIZATION OF REACTIVE MOLECULAR
             DYNAMICS SIMULATIONS WITH DYNAMIC CHARGE DISTRI-
             BUTION MODELS ON DISTRIBUTED MEMORY PLATFORMS                               . 37
  3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
  3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . 39
      3.2.1 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . 40
      3.2.2 Neighbor Generation . . . . . . . . . . . . . . . . . . . . . . . . . .      . 41
      3.2.3 Initialization of Interaction Lists . . . . . . . . . . . . . . . . . . . .  . 41
      3.2.4 Bonded Interactions . . . . . . . . . . . . . . . . . . . . . . . . . .      . 42
      3.2.5 Charge Distribution . . . . . . . . . . . . . . . . . . . . . . . . . .      . 42
      3.2.6 Nonbonded Interactions . . . . . . . . . . . . . . . . . . . . . . . .       . 44
  3.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . 44
      3.3.1 Reduction of Global Communication Overheads in Charge Solvers .              . 44
             3.3.1.1 Baseline Preconditioned Conjugate Gradient . . . . . . . .          . 44
             3.3.1.2 Preconditioned Pipelined Conjugate Gradient . . . . . . .           . 45
      3.3.2 Acceleration of QEq through preconditioning . . . . . . . . . . . . .        . 47
             3.3.2.1 SAI Preconditioning . . . . . . . . . . . . . . . . . . . . .       . 47
             3.3.2.2 Parallelization of the SAI Preconditioner . . . . . . . . . .       . 49
             3.3.2.3 Structure of the Hamiltonian . . . . . . . . . . . . . . . .        . 50
             3.3.2.4 SAI Preconditioner Sparsity Pattern Selection . . . . . . .         . 50
             3.3.2.5 SAI Preconditioner Computation . . . . . . . . . . . . . .          . 51
      3.3.3 Optimization at the Ghost Regions . . . . . . . . . . . . . . . . . .        . 54
  3.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . 56
      3.4.1 Benchmarking Systems and Hardware . . . . . . . . . . . . . . . . .          . 56
      3.4.2 Impact of Charge Solver Improvements . . . . . . . . . . . . . . . .         . 57
             3.4.2.1 SAI Longevity . . . . . . . . . . . . . . . . . . . . . . . .       . 57
             3.4.2.2 SAI Preconditioning Convergence Rates . . . . . . . . . .           . 58
             3.4.2.3 Strong Scaling Tests . . . . . . . . . . . . . . . . . . . . .      . 58
             3.4.2.4 Weak Scaling . . . . . . . . . . . . . . . . . . . . . . . . .      . 61
      3.4.3 Impact of Ghost Region Optimizations . . . . . . . . . . . . . . . .         . 63
      3.4.4 Overall Simulation Performance . . . . . . . . . . . . . . . . . . . .       . 63
  3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . 64
CHAPTER 4 OPTIMIZATION OF THE REAX FORCE FIELD FOR THE LITHIUM-
             OXYGEN SYSTEM USING A HIGH FIDELITY CHARGE MODEL                              67
  4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
  4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    71
      4.2.1 Training Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    71
      4.2.2 High Fidelity Charge Model . . . . . . . . . . . . . . . . . . . . . . .       72
      4.2.3 Optimization of the Force Field Parameters . . . . . . . . . . . . . .         75
  4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   77
      4.3.1 Fitness to the Training Data Set . . . . . . . . . . . . . . . . . . . .       78
      4.3.2 Evaluation using Bulk Lithium Oxide . . . . . . . . . . . . . . . . . .        82
      4.3.3 Evaluation using Deformation and Fracture . . . . . . . . . . . . . .          83
  4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
                                           viii


CHAPTER 5 PERFORMANCE OPTIMIZATION OF LARGE-SCALE DISTRIBUTED
              GPU-ACCELERATED REAXFF SIMULATIONS . . . . . . . . . . . 90
   5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
   5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
       5.2.1 Improving Performance Portability and Scalability via Software Mod-
              ernization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
       5.2.2 Increasing Parallelism through Kernel-Level Restructuring . . . . . . 92
       5.2.3 Exploiting Task-Level Parallelism via CUDA Streams . . . . . . . . . 93
       5.2.4 Global Charge Solver Optimizations . . . . . . . . . . . . . . . . . . . 95
   5.3 Performance Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
       5.3.1 Computing Environment and Benchmark Systems . . . . . . . . . . . 96
       5.3.2 Scalability Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
   5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
   APPENDIX A         SUPPLEMENTAL DATA FOR GLOBAL CHARGE MODEL
                      OPTIMIZATIONS IN SHARED-MEMORY . . . . . . . . . . 103
   APPENDIX B         SUPPLEMENTAL DATA FOR OPTIMIZATION OF THE
                      REAX FORCE FIELD FOR THE LITHIUM-OXYGEN SYS-
                      TEM USING A HIGH FIDELITY CHARGE MODEL . . . . 105
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
                                             ix


                                    LIST OF TABLES
Table 2.1 Molecular systems used in shared-memory performance evaluation. . . . .              25
Table 2.2 Qualitative behavior of the preconditioning techniques. . . . . . . . . . . .        25
Table 2.3 Number of recomputations for the shared-memory preconditioned solvers.               28
Table 2.4 Shared-memory preconditioner parameter comparison using the QEq
          charge method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     29
Table 2.5 Comparison of the cost and effectiveness of the shared-memory precon-
          ditioned solvers for the bulk water system. . . . . . . . . . . . . . . . . .        30
Table 3.1 Mean solver iterations for the Jacobi and SAI distributed preconditioned
          QEq solvers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     59
Table 3.2 Average number of bond order calculations per simulation step. . . . . . .           63
Table 4.1 Materials properties as predicted by DFT, the fitted ACKS2 FF, and the
          2016 QEq FF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       81
Table 4.2 Properties of the notched slabs used as a starting point for the fracture
          simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     85
Table A.1 Force field parameter values utlized for simulations with the molecular
          systems in Table 2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Table A.2 Charge matrix condition numbers for the charge distribution models. . . . 103
Table A.3 Tuned preconditioned solver parameters for shared-memory experiments. . 104
Table B.1 The ReaxFF parameters adjusted to create the new ACKS2-based force
          field for Li-O. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Table B.2 Nuclear geometries used in the calculations for the Li-O-Li system, along
          with the MRCI+Q total electronic energies and MRCI atomic charges
          on Li and O atoms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
                                               x


                                  LIST OF FIGURES
Figure 2.1 Longevity of preconditioners with the QEq model. . . . . . . . . . . . . .        27
Figure 2.2 Strong scaling plots of shared-memory preconditioned solvers with QEq.            32
Figure 2.3 Speedups for shared-memory solvers for the benchmark systems by
           charge distribution method. . . . . . . . . . . . . . . . . . . . . . . . . .     34
Figure 3.1 ReaxFF’s computational workflow includes bonded, non-bonded, and
           system-specific interactions, each with different physical formulations
           and cut-offs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Figure 3.2 Structure of the local charge matrix in distributed-memory simulations. .         51
Figure 3.3 The global view of the process of constructing and solving one of the
           least-squares problems for the distributed SAI preconditioning approach
           with an example QEq matrix. . . . . . . . . . . . . . . . . . . . . . . . .       53
Figure 3.4 Longevity of the CG+SAI(0.15) distributed preconditioned solver for
           systems using the QEq model. . . . . . . . . . . . . . . . . . . . . . . . .      57
Figure 3.5 Strong scaling plots for mean QEq time of the distributed precondi-
           tioned solvers for the silica and water systems. . . . . . . . . . . . . . . .    60
Figure 3.6 Speedups and parallel efficiencies for the distributed strong scaling study.      61
Figure 3.7 Weak scaling plots for the distributed solver sub-kernels for QEq. . . . .        62
Figure 3.8 Strong scaling plots for total simulation time of distributed precondi-
           tioned solvers for the benchmark systems. . . . . . . . . . . . . . . . . .       64
Figure 3.9 Weak scaling plots for main simulation sub-kernels in distributed mem-
           ory using the benchmark systems. . . . . . . . . . . . . . . . . . . . . . .      65
Figure 4.1 Fitness of the best performing ReaxFF parameter set during optimization. 77
Figure 4.2 Charge during dissociation of the Li – O – Li molecule for asymmetric
           symmetric bond stretching. . . . . . . . . . . . . . . . . . . . . . . . . .      79
Figure 4.3 Energy vs. hydrostatic strain of Li2 O and Li2 O2 crystals along with
           energy vs. separation of the Li2 O molecule for bond stretching. . . . . . .      80
                                              xi


Figure 4.4 NPT simulation and material properties of crystalline Li2 O with the
           ACKS2 FF and the 2016 QEq FF. . . . . . . . . . . . . . . . . . . . . .           84
Figure 4.5 Simulation results and material properties for Li2 O slabs before and
           after NPT equilibration with ACKS2 and 2016QEq FF. . . . . . . . . . .            85
Figure 4.6 Simulation results for fracture in Li2 O slabs. . . . . . . . . . . . . . . . .   86
Figure 5.1 Task dependency graph for the ReaxFF potential coupled with a global
           charge model (QEq) within a single MD step. . . . . . . . . . . . . . . .         94
Figure 5.2 Execution profile of the MPI+CUDA version of PuReMD using a single
           CUDA stream. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      96
Figure 5.3 Weak scaling study for the MPI+CUDA PuReMD implementation and
           LAMMPS Kokkos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Figure 5.4 Strong scaling study for the MPI+CUDA PuReMD implementation and
           LAMMPS Kokkos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Figure B.1 Deformation simulation using the fitted QEq force field for Li-O. . . . . . 110
Figure B.2 Detailed information on oxygen charge during cracking simulation. . . . . 111
Figure B.3 Second cracking experiment with identical parameters and starting struc-
           tures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Figure B.4 Evolution of the RDF during NPT simulation with the ACKS2 FF and
           the 2016 QEq FF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
                                             xii


                            LIST OF ALGORITHMS
Algorithm 2.1 Dynamic Determination of Preconditioner Recomputation. . . . . . .     21
Algorithm 3.1 Preconditioned Pipelined Conjugate Gradient. . . . . . . . . . . . . . 46
Algorithm 3.2 Distributed SAI(τ ) Preconditioner Sparsity Pattern Selection. . . . . 52
Algorithm 3.3 Distributed SAI(τ ) Preconditioner Computation. . . . . . . . . . . .  54
Algorithm 5.1 Atomic Transactional Approach for GPU Memory Management. . . .         92
                                         xiii


                                        CHAPTER 1
                   BACKGROUND AND RELATED MATERIAL
1.1     Description of the Reactive Force Field Method
The reactive force field (ReaxFF) method is a relatively recent model (developed in early
2000s [1]) and is similar to the classical molecular dynamics (MD) model in the sense that
it models atomic nuclei together with their electrons as a point mass. Unlike classical MD
models, ReaxFF mimics bond formation and breakage observed in quantum mechanics (QM)
methods by replacing the static harmonic bond models with the bond order concept, which
is a quantity indicating bond strength between a pair of atoms based on the types of the
atoms and the distance between them. Consequently, ReaxFF can overcome many of the
limitations inherent to conventional MD. While the bond order concept dates back to 1980s
and has been exploited in other force fields before (such as COMB [2] and AIREBO [3]), the
distinguishing aspect of ReaxFF is the flexibility and transferability of its force field that
allows ReaxFF to be applied to diverse systems of interest [1, 4, 5, 6].
    ReaxFF is currently implemented by major open source (PuReMD Reactive Molecular
Dynamics (PuReMD) [7], large-scale atomic/molecular massively parallel simulator (LA-
MMPS) [8], RXMD [9]) and commercial (ADF, Material Studio) software with an esti-
mated userbase of over 1,000 groups. In this thesis, the work builds off and utilizes the
PuReMD software, as PuReMD and its LAMMPS integrations, i.e., the User-ReaxC and
User-ReaxC/OMP packages [8], represent the most widely used implementations of ReaxFF.
PuReMD uses novel algorithms and data structures to achieve high performance while re-
taining a small memory footprint. An optimized neighbor generation scheme, elimination
of the bond order derivatives list in bonded interactions, lookup tables to accelerate non-
bonded interaction computations, and efficient iterative solvers for charge distribution are
the major algorithmic innovations in PuReMD [10, 11].
                                               1


    In ReaxFF, the total energy of the system is comprised of partial energy contributions
according to Eq. (1.1), where summation over atomic indices is implied for each term.
                   Esystem = Ebond + Elp + Eover + Eunder
                             + Eval + Epen + E3conj                                     (1.1)
                             + Etors + E4conj + EH-bond + EvdW + ECoulomb
While there are many similarities in ReaxFF to classical MD methods, one major difference
is that bond orders are calculated system-wide at the beginning of every MD step before
computing energies and gradients. Moreover, due to the dynamic bonding scheme of ReaxFF,
these potentials must be modified to ensure smooth potential energy curves as bonds form
or break. In the following subsections, these details are elucidated.
1.1.1    Bond Orders
In ReaxFF, the bond order between a pair of atoms i and j signifies the strength of the bond
between the two atoms. Eq. (1.2) properly quantifies this notion for specific types of atoms
i and j at at distance apart of rij .
                                                   "         bα #
                                     α′                   rij
                               BOij (rij ) = exp aα                                     (1.2)
                                                          r0α
In the above equation, α corresponds to σ−σ, σ−π, or π−π bonds; aα and bα are parameters
specific to the bond type; and r0α is the optimal length for this bond type. With this in
mind, the total bond order BO′ij is expressed as the summation of these bond types in the
equation below.
                                               ′       ′           ′
                                BO′ij = BOσij + BOπij + BOππ    ij                      (1.3)
In order to model complex bonding behavior in real-life systems, additional considerations
beyond pairwise bond orders are necessary including the total coordination number of each
atom and 1–3 bond corrections in valence angles. Eq. (1.4) describes these such corrections.
                   BOij = BO′ij · f1 ∆′i , ∆′j · f4 ∆′i , BO′ij · f5 ∆′j , BO′ij
                                                                               
                                                                                        (1.4)
                                                 2


In the above equation, ∆′i denotes the deviation of atom i from its optimal coordination
number, f1 ∆′i , ∆′j applies an over-coordination correction, and f4 (∆′i , BO′ij ), together with
                       
f5 (∆′j , BO′ij ) constitute 1–3 bond order corrections. Only corrected bond orders are used in
energy and gradient computations within ReaxFF.
1.1.2      Bonding Term
Classical MD methods adopt a spring model in which the energy of a bond is determined
solely by its deviation from the optimal bond distance, thereby ignoring the effects of neigh-
boring bonds. In contrast, ReaxFF tabulates the energy incident on a bond from all bond
order constituents. The higher the bond order, the lower the energy and the stronger the
force associated with the bond. Eq. (1.5) formalizes this notion, and ensures that the energy
and force due to a bond smoothly go to zero as the bond dissipates.
                                                                     p 
                          Ebond = − Deσ · BOσij · exp pbe1 1 − BOσij be2
                                                      
                                                                                              (1.5)
                                   − Deπ · BOπij − Deππ · BOππij
1.1.3      Lone Pair Term
Classical MD methods do not need to resolved bond defects due to their aforementioned
modeling. In contrast, the bonding in ReaxFF must account for unpaired electrons of an
atom via an explicit lone pair term. In a nicely formed and equilibrated system, lone pair
energy does not have a significant contribution to the total energy, however, it is important
for describing atoms with defective bonds. Lone-pair energy is computed using the following
equation:
                                                   plp2 · ∆lp
                                                           i
                                     Elp =                                                    (1.6)
                                             1 + exp{−75 · ∆lp   i }
In Eq. (1.6), ∆lp         lp    lp
                    i = nopt − ni essentially corresponds to the number of unpaired electrons.
                                                   3


1.1.4   Over- and Under-coordination Term
Despite the valence correction applied during bond order corrections, there may still remain
some over- or under-coordinated atoms in the system. Over-coordination energy penalizes
over-coordinated atoms, while under-coordination energy accounts for the energy due to a
resonant π-electron between atomic centers in the presence of a π-bond between atoms i and
j. Section 1.1.4 and Section 1.1.4 describe over- and under-coordination, resepctively.
                                               X
                                                      povun1 · Deσ · BOij
                                            j∈nbrs(i)
         Eover = ∆lpcorr
                       i      ·                                                                          (1.7)
                                 ∆lpcorr
                                    i     + V ali      1 + exp{povun2 · ∆lpcorr
                                                                             i   }
                                                              1 − exp{povun6 · ∆lpcorr
                                                                                  i      }
         Eunder = −povun5 · f6 (i, povun7 , povun8 ) ·                                                       (1.8)
                                                             1 + exp{−povun2 · ∆lpcorr
                                                                                    i     }
         ∆lpcorr
           i       = ∆i − ∆lp   i · f6 (i, povun3 , povun4 )
                                                                                                   −1
                                                          X                                           
                                                                            lp          π         ππ  
                                                                                                     
         f6 (i, p1 , p2 ) = 1 + p1 · exp p2 ·
                                                                 ∆j − ∆j · BOij + BOij
                                                                                                        
                                                        j∈nbrs(i)
1.1.5   Valence Angle Term
The energy associated with vibration about the optimum valence angle between a triplet of
atoms i, j, and k is computed from according to Section 1.1.5.
         Eval = f7 (BOij , pval3 , pval4 ) · f7 (BOjk , pval3 , pval4 ) · f8 (∆j , pval5 , pval6 , pval7 ) ·
                         pval1 − pval1 · exp −pval2 · (Θ0 − Θijk )2
                                                                         
                                                                                                             (1.9)
                               f7 (BO, p1 , p2 ) = 1 − exp {−p1 · BOp2 }
                               f8 (∆, p1 , p2 , p3 ) = p1 − (p1 − 1) · f9 (∆, p2 , p3 )
                                                              2 + exp {p1 · ∆}
                               f9 (∆, p1 , p2 ) =
                                                    1 + exp {p1 · ∆} + exp {−p2 · ∆}
Similar to its classical counterparts, the energy on Θijk increases as it moves away from its
corrected optima Θ0 , which is obtained from the theoretical optima Θ00 , by accounting for
                                                           4


the effects of over- and under-coordination on the central atom j as well as the influence of
any lone electron pairs. Valence angle energy further depends on the strength of bonds BOij
and BOjk ; f7 (BOij ) and f7 (BOjk ) terms in Section 1.1.5 ensure that valence angle energy
goes smoothly to zero as either bond dissociates.
1.1.6    Torsion Term
Section 1.1.6 accounts for the energy resulting from torsions in a molecule.
              1
  Etors =        · f10 (BOij , BOjk , BOkl , ptor2 , 1) · sin (Θijk ) · sin (Θjkl ) ·                       (1.10)
              2
              [ V1 · (1 + cos (ωijkl )) +
                         n                                                       2 o
                V2 · exp ptor1 · 2 − BOπjk − f9 (∆j + ∆k , ptor3 , ptor4 )             · (1 − 2 cos(2ωijkl )) +
                V3 · (1 + cos (3ωijkl )) ]
         f10 (BO1 , BO2 , BO3 , p1 , p2 ) = f7 (BO1 , p1 , p2 ) · f7 (BO2 , p1 , p2 ) · f7 (BO3 , p1 , p2 )
As in the valence angle energy term, the torsional conribution from a four-body structure
should vanish as any of its bonds dissociate. Here, f10 (BOij , BOjk , BOkl , ptor2 , 1) enforce
this constraint. If either of the two valence angles defined by these four atoms approaches π,
torsional energy should again disappear; this is accomplished by the term sin (Θijk )·sin (Θjkl ).
1.1.7    Additional Bonding Terms
In ReaxFF, there are other bonded interaction terms shown in Eq. (1.1) for which details
have been omitted for brevity. The stability of 3-body structures in which the central
atom has two double bonds is achieved through a penalty energy term Epen . Three-body
conjugation energy E3conj and four-body conjugation energy E4conj terms, as their names
imply, encapsulate the energy contribution from conjugated systems. Full details regarding
these terms can be found in [1].
                                                         5


1.1.8    Hydrogen Bonding Term
The energy term attached to a hydrogen bond in ReaxFF is given by Eq. (1.11).
                                                                           0                 
                                               4    ΘXHZ                        rhb   rHZ
 Ehbond = phb1 · f7 (BOXH , phb2 , 1) · sin                   · exp −phb3 ·         + 0 −2            (1.11)
                                                       2                        rHZ   rhb
In this equation, a bond is defined as existing between an electronegative atom (denoted Z)
in the vicinity of a Hydrogen atom covalently bonded to a Nitrogen, Oxygen, or Fluorine
atom (denoted by X). Similar to the equations for valency and torsion, the f7 (BOXH , phb2 , 1)
term ensures that contributions from hydrogen bonding smoothly approach zero as the co-
valent bond breaks. For hydrogen bonding to be strong, it is crucial that all three atoms
are geometrically aligned on a line. This fact is evident by observing that sin4 ΘXHZ
                                                                                                        
                                                                                                    2
                                                                                                          is
maximized when ΘXHZ = π.
1.1.9    van der Waal’s Term
A distance-corrected Morse-potential term is used for van der Waal’s interactions as shown
in Section 1.1.9.
      EvdWaals = T (rij ) · Dij ·                                                                     (1.12)
                                                                                          
                                              f13 (rij )                1             f13 (rij )
                        exp αij · 1 −                       − 2 · exp     · αij · 1 −
                                                rvdW                    2              rvdW
                            7          6            5          4
         T (rij ) = t7 · rij   + t6 · rij + t5 · rij  + t4 · rij                                      (1.13)
                                3          2
                      + t3 · rij  + t2 · rij  + t1 · rij + t0
                                                  1
                         pvdW1
                                + γw−pvdW1
                                             p
       f13 (rij ) =    rij                      vdW1                                                  (1.14)
In contrast to classical force fields where van der Waal’s interactions are computed only
between non-bonded atom pairs, in ReaxFF all atom pairs – bonded or non-bonded – con-
tribute to the energy term. The rationale behind this is that neglecting bonded pairs from
contributing would result in discontinuities on the potential energy surface as bonds are
formed or broken. To prevent extremely high repulsion forces between pairs at short dis-
tances, a shielding term is included in Section 1.1.9. Additionally, the tapering function
                                                         6


T in Section 1.1.9 is a seventh order polynomial with coefficients ti , 1 ≤ i ≤ 7, chosen to
ensure that the van der Waal’s energy smoothly goes to zero for pairs at distances beyond
the non-bonded interaction cut-off distance, rnonb .
1.1.10    Coulomb Term
Coulomb interactions are defined pairwise for all atoms akin to van der Waal’s interactions.
For corrections, shielding and taper terms are included in the potential as shown in Eq. (1.15).
                                                             qi · qj
                              ECoulomb = C · T (rij ) ·             1                   (1.15)
                                                          3
                                                         rij  + γij−3 3
Coulomb interactions are truncated within a specified cut-off rnonb , typically valued in the
10 to 12 Å range. Moreover, no long-range electrostatic interactions in included in ReaxFF.
                                               7


                                        CHAPTER 2
         FAST SOLVERS FOR CHARGE DISTRIBUTION MODELS ON
                           SHARED-MEMORY PLATFORMS
This chapter presents previously published work on optimization of global charge models
typically coupled with ReaxFF [12]. This work is reproduced with the permission of SIAM.
    Chronologically, the first avenues explored for optimizing PuReMD focused on the shared-
memory implementation. This version of the software, which prior to this work was a
serial-only implementation, serves several purposes including acting as an optimization and
development framework of ReaxFF parameter sets, and providing a codebase which is used
for integration with external molecular software packages. Moreover, the relative simplicity
of the serial nature of this implementation due to the presence of all simulation data within
the same memory space enables rapid prototyping and exploration of new algorithms and
avenues for performance optimization.
    With the above points in mind, the shared-memory implementation was used for devel-
oping and optimizing several global charge solver models which are instrumental to polariz-
ability [13, 12]. These models, which are described in detail in the following section, require
the determination of partial atomic charges at a given instant in time. Algorithmically, this
requires solving large sparse linear systems for equations at each MD timestep. From pre-
vious analysis of the performance of PuReMD, these linear solvers constitute a significant
portion of the execution time – 50% and upward depending on the particular system under
simulation. Thus, these kernels were clear top candidates for initial optimization efforts. In
the following sections, the charge models themselves and the techniques used to optimize
the linear solvers applied to these models are described.
                                               8


2.1      Introduction
Molecular dynamics (MD) simulations are utilized across a wide range of fields including
physics, chemistry, biology, and materials science. On one hand, the force field s (set of
parameterized mathematical equations describing the interactions between atoms) used in
these simulations must be high fidelity so that simulations can have scientific impact. On
the other hand, the force field formulations need to be computationally inexpensive such
that simulations of millions to billions of timesteps can be routinely performed, allowing
scientists to probe into chemical or biological phenomena that take place in the microseconds
to milliseconds ranges. In this regard, over the past few decades several highly successful
force fields have been developed to model liquids, proteins, and materials [14, 15, 16]. The
efficiency of these force fields is mainly due to the use of static bonds between atom pairs
and fixed charges on each atom. While suitable for various applications, these simplifications
render classical force fields unsuitable in applications where polarization effects or chemical
reactions are important.
    The impact of modeling charge distribution on the fidelity of MD simulations is well
known [17, 18, 19, 20]. As such, to this day several charge distribution models have been
developed for atomistic systems [21, 22, 23, 24]. While each of these models basically aims
to find the charge distribution that minimizes the total electrostatic energy, the set of phys-
ical/chemical constraints they impose in this optimization problem is essentially what sepa-
rates them. As discussed in Section 2.2.1, minimization of the electrostatic energy subject to
the set of constraints defined by a particular model requires the solution of large systems of
linear equations. It is important to note that trying to incorporate the long-range Coulomb
effects into the dynamic charge distribution models would significantly increase their compu-
tational costs, making them impractical for actual simulations. Hence, force fields utilizing
dynamic charge models adopt tapering functions which limit the effect of charges only within
a certain radius (which is typically 10-12 Å). As a result, linear systems arising in the formu-
lations of dynamic charge models are sparse, and iterative solvers are used to find sufficiently
                                                9


accurate solutions.
    Development of fast and efficient iterative solvers for dynamic charge models constitutes
the main objective of this paper. Previous work on this topic has explored the use of
Jacobi preconditioners and good initial guesses which have indeed been shown to be highly
effective [11, 10]. Nevertheless, charge distribution in actual simulations still takes up a
significant amount of computation time and represents an important scalability bottleneck,
as demonstrated for instance for the Reax Force Field (ReaxFF) [1, 4] in Refs. [11, 8]. It
can be argued that dynamic charge distribution would represent an even more significant
bottleneck for other force fields if it were to be incorporated into their formulations, because
ReaxFF is known to be a complex and computationally expensive force field.
    In this paper, we focus on the development of effective, yet inexpensive preconditioners
to accelerate the solvers used in dynamic charge distribution models, as well as high perfor-
mance implementation of the resulting solvers in the open source PuReMD package [11, 10].
Presented techniques are shared memory parallel, but an efficient shared memory paral-
lel solver constitutes the critical building block for extending our techniques to distributed
memory systems. We demonstrate through extensive numerical tests that these carefully
designed efficient preconditioned solvers and their efficient implementations can significantly
accelerate solvers for dynamic charge models.
2.2     Dynamic Charge Distribution Models and Linear Solvers
2.2.1    Charge Equilibriation (QEq)
We define and discuss the charge distribution models and their formulations as linear systems
of equations beginning with the QEq model [22]. In this model, the distribution of charges
over atoms is determined by minimizing the electrostatic energy Eele given the atomic posi-
tions. Let R = (r1 , r2 , . . . , rn ) signify the positions of the system of n atoms, where ri ∈ R3
for 1 ≤ i ≤ n. Atomic charges q = (q1 , q2 , . . . , qn ), qi ∈ R, are thus defined by solving the
                                                     10


following problem:
                                                      X                1X
                         min             Eele (q) =        χ i qi +          Hij qi qj
                           q
                                                        i
                                                                       2 i,j
                                                  X                                             (2.1)
                         subject to qnet =            qi .
                                                   i
    In Eq. (2.1), we define Hij as δij ηi + (1 − δij ) · Fi,j , where δij denotes the Kronecker delta
operator; χi and ηi denote the atomic electronegativity and idempotential; rij = ||rj − ri ||2
                                                                                 √
signifies the distance between the atomic pair i and j; and γij = γi · γj denotes a pairwise
shielding term tuned for element types of atoms i and j to avoid unbounded electrostatic
energy at short distances. Additionally, Fi,j is defined as
                      
                             1
                                    , rij ≤ rnonb (which is typically 10 to 12 Å)
                      
                      
                      q3 r 3 +γ −3
               Fi,j =      ij   ij
                                                                                                (2.2)
                      
                      0,
                      
                                        otherwise.
    Applying the method of Lagrange multipliers to Eq. (2.1), we obtain the sets of linear
equations below [25, 10].
                                   X n
                                        Hki si = −χk , k = 1, . . . , n
                                    i=1
                                     n                                                          (2.3)
                                    X
                                        Hki ti = −1,       k = 1, . . . , n
                                    i=1
Here, the values si and ti can be thought of as pseudo-charges used during the Lagrangian
method. From these, partial atomic charges qi can be computed as follows:
                                                     Pn
                                                         j=1 sj
                                         q i = s i − Pn           · ti .                        (2.4)
                                                         j=1 tj
    We henceforth refer to the coefficient matrix in Eq. (2.3) as HQEq ∈ Rn×n . We note that
HQEq is symmetric and has been observed to be positive definite across numerous atomic
systems we have worked with.
2.2.2    Electronegativity Equalization (EE)
EE is another commonly used approach for partial charge calculation; it relies on the principle
that charges should be distributed to atoms in order to satisfy constraints for both net
                                                     11


system charge and equalized atom electronegativity [26, 21]. For the latter, we represent
the electronegativity of atom i as ϵi , and thus the electronegativity equalization constraint
is formalized as
                                    ϵ1 = ϵ2 = · · · = ϵn = ϵ̄.                            (2.5)
    Using the same definition of electrostatic energy (i.e., Eele ) and empirical parameters in
Eq. (2.1) in conjunction with the constraints in Eq. (2.5), we can represent the system of
linear equations to be solved in EE in block form as a superset of those from the QEq model:
                                                 
                                 HQEq 1n  q  −χ 
                                                =                                   (2.6)
                                    1Tn      0      ϵ̄     qnet
    In Eq. (2.6), qnet again represents the net charge of the atomic system, while the EE
charge matrix HEE ∈ R(n+1)×(n+1) is symmetric, indefinite, and sparse.
2.2.3    Atom-Condensed Kohn-Sham Approximated to Second Order (ACKS2)
The recently proposed ACKS2 model [24] can be regarded as an extension of the EE model.
Essentially, ACKS2 was developed with the aim of correcting issues with accurately mod-
eling dipole polarizability and charges during bond formation/dissociation via additional
empirically fitted parameters. A block matrix representation of the linear system arising in
the ACKS2 model is as follows:
                                                          
                             H
                            QEq n  I    1 n   0 n
                                                      q    −χ 
                                                          
                            In     X 0Tn 1n       U   0 
                                                            
                                                            =  .                       (2.7)
                           
                                                  
                            1T     0 T
                                          0    0    µ  q 
                            n        n              mol   net 
                                                          
                               T      T
                              0n    1n 0 0              λU      0
    In Eq. (2.7), U contains Kohn-Sham potential coefficients, µmol and λU are Lagrangian
multipliers from the underpinning optimization problem, and X contains terms for the linear
response kernel of the Kohn-Sham potential. Similarly to the EE model, the ACKS2 matrix
HACKS2 ∈ R(2n+2)×(2n+2) is symmetric, indefinite, and sparse.
                                                12


    In Table A.1 of Appendix A, we present the empirically-fitted values utilized for studying
the computational behavior of the charge models discussed in this section. In addition to
the parameters mentioned in Section 2.2.1, σi and Λ are ACKS2-specific parameters which
control entries in the linear response kernel X, with the former being tuned for thresholding
entries based on element types and the latter being used as a global bond softness parameter.
2.3     Preconditioning Techniques for Linear Solvers
Two commonly used techniques to accelerate the convergence of iterative solvers are using
good initial guesses, and preconditioning; the latter of which essentially amounts to lever-
aging better spectral properties of a transformed linear system. In the context of charge
distribution models, the good initial guesses approach was previously explored in the form
of spline extrapolations for QEq [10] which capitalized on the key observation that in a
molecular simulation environment, atomic positions change slowly due to short timestep
lengths. Thus, extrapolations from solutions to the linear systems in previous timesteps are
likely to provide good initial guesses.
    While initial guesses are easy to apply and can be very effective, a good preconditioner is
crucial to ensure fast convergence. Aktulga et al. have also proposed an incomplete Cholesky
based preconditioner for the QEq model, albeit in a purely sequential context [10]. Due to
the difficulties in attaining a scalable implementation of the incomplete Cholesky technique,
actual simulations carried out on distributed memory systems (such as those with PuReMD
and LAMMPS [27]) still rely on the simple Jacobi preconditioning technique in the QEq
solver. As an illustration of the performance impact of this situation, 60% or more of the
total runtime in large-scale ReaxFF simulations is spent in the QEq kernel, as the cost of
inter-node communications in the QEq solver dominate the execution time [8].
    When one considers the EE and ACKS2 models, the need for efficient solvers is exac-
erbated. As shown in Table A.2 of Appendix A for the systems considered in this study
(and seems to hold true in general), condition numbers of the coefficient matrices for the EE
                                               13


model tend to be slightly worse than those of the QEq model. For ACKS2 (where we only
had empirically fitted force field parameters for the water system), the condition number is
significantly worse than either QEq or EE.
    Motivated by the factors above, identifying effective and efficient preconditioning tech-
niques for the charge distribution models constitutes the main focus of this paper. One
issue that prevents well-known preconditioners to be directly useful in this context is the
fast execution requirement of molecular dynamics simulations. The key enabler for our use
of high-quality preconditioners is the same observation that allows good initials guesses –
leveraging the slowly-evolving characteristic of a molecular simulation, we amortize the cost
of a preconditioner construction by re-using the same preconditioner over several timesteps.
However, as we discuss below and demonstrate through numerical experiments in Section 2.4,
there still exist important trade-offs that must be considered in building a practical solver:
    • Effectiveness: How much a preconditioner can increase the convergence rate of the
       solver,
    • Cost: Time to compute and apply a preconditioner,
    • Longevity: Number of subsequent timesteps that a preconditioner can be used effec-
       tively without having to recompute it, and
    • Parallelizability: How scalable a preconditioner is in a parallel environment.
2.3.1     Approaches
In the subsections that follow, we describe the preconditioning techniques explored, which
we refer to in the remainder of the paper by the bolded notation in parentheses. The symbol
H represents the generic coefficient matrix of the sparse linear system to be solved for a
specified charge model (QEq, EE or ACKS2).
                                              14


2.3.1.1    Jacobi (Jacobi)
A simple approach is to form a preconditioning matrix P−1 by taking the multiplicative
inverses of the diagonal entries of H. The ill-defined cases of zero entries on the diagonals
of HEE and HACKS2 matrices can be treated by simply assuming those zeros to be ones for
the Jacobi preconditioner.
    Despite being an inexpensive and easily parallelizable preconditioner, the Jacobi pre-
conditioner yields limited improvements in the solver convergence rate, because P−1 is a
poor approximation to H−1 for the charge models considered due to the presence of a large
number of off-diagonal entries (which is on the order of a few hundred per row) with not-so-
insignificant nonzero values. Nevertheless, given its simplicity, Jacobi preconditioned solver
performance serves as our baseline to measure the improvements obtained by the other
preconditioning techniques explored.
2.3.1.2    Incomplete LU (ILU) based Techniques
The idea behind incomplete LU preconditioning is to compute and leverage an approximate
LU decomposition H ≈ L′ U′ , where L′ and U′ are approximate lower and upper triangular
factors, respectively. ILU techniques are known to be among the most effective precon-
ditioners [28], but an important question is how to determine the factors L′ and U′ , i.e.,
which sparsity pattern to select so that the factors are computationally inexpensive to com-
pute and apply, yet effective as a preconditioner. A common choice for the sparsity pattern
is the 0-fillin approach, ILU(0), where L′ + U′ has the same sparsity pattern as H [28].
Another effective choice utilizes the thresholding approach, ILUT(t), where the sparsity
patterns in L′ and U′ are a subset of the 0-fillin approach in which entries smaller than a
prescribed threshold t are dropped during factorization [29]. Note that since the matrices in
the QEq model are symmetric positive-definite, ILU preconditioning essentially amounts to
computing an incomplete Cholesky (IC) factorization with L′ = (U′ )T .
                                              15


Sparsification Strategies for Factor Computation: Plain ILU(0) and ILUTP(t) precondi-
tioning turn out to be too costly to be useful in practice for our purposes. As such, we
investigated a number of techniques to reduce the computational costs associated with con-
struction and application of ILU based preconditioning techniques and improve their scaling.
In particular, we developed two custom sparsification schemes for matrices arising in dynamic
charge models:
    • Numeric dual drop strategy: In this scheme, we start with a zero-fillin precondi-
      tioner (i.e., ILU(0), IC(0)), but as in ILUTP(t), we apply a numerical threshold to
      drop small entries in the incomplete factors. Specifically, after a row in a factor is com-
      puted, the 1-norm of the row is computed, followed by a thresholding operation where
      zero-fillin entries that are less than a predetermined threshold t times the 1-norm of
      that row are discarded. This hybrid scheme allows us to easily determine the location
      of the nonzeros (by virtue of not allowing fillins), and obtain sparser factors that con-
      tain only the most significant entries (as in an ILUTP factorization). As such, it is an
      inexpensive, yet effective and easy-to-implement preconditioner for the spd matrices
      of QEq, which we refer to as IC with dual drop (ICDD(t)), where the parameter t
      denotes the numerical threshold to be used. In the cases of EE and ACKS2 models, to
      prevent numerical breakdowns during factorization, we replace zeros on the diagonal
      with ones and perform an ILU decomposition because EE and ACKS2 matrices are
      not spd. We denote this scheme as ILU with dual drop (ILUDD(t)).
    • Distance drop strategy: A number of other works have utilized insights in the un-
      derlying application in order to select which entries to threshold [30, 31]. We choose to
      follow these approaches and look to the physics underpinning the charge model prob-
      lem. From the definition of the coefficient matrices for the charge distribution models
      in Eq. (2.1), we observe that as the distance between a pair of atoms increases, the
      corresponding off-diagonal entry in H decreases proportionately to the inverse of the
                                               16


      distance. Said differently, small off-diagonal entries in H are contributed by atomic
      pairs separated by large distances. Intuitively, these entries contribute relatively less to
      the preconditioner quality; thus, they are good candidates for elimination. We denote
      the ILU/IC factors sparsified using the distance drop technique with the suffix DS(d),
      e.g., ICDD(t)+DS(d) and ILUDD(t)+DS(d), where d is a real parameter in the
      range (0, 1] that is applied as a scaling of the distance cutoff of the nonbonded inter-
      action radius rnonb . The distance drop sparsification technique can be implemented
      simply by constructing a sparser H given a prescribed parameter d, and then perform-
      ing an ICDD or ILUDD factorization. As will be demonstrated through numerical
      experiments, distance based sparsification is highly preferred across a large majority
      of our benchmark cases.
Parallelization of Factor Computation and Application: In addition to the cost and effec-
tiveness of the incomplete factors utilized, another important consideration is exploiting
high degrees of parallelism available on high performance computing systems. Due to the
need for eliminating data race conditions in a parallel ILU computation, the nonzero pat-
tern of the sparse matrix plays an important role during both computation and application
of the approximate factors. In this regard, it can be argued that the above sparsification
schemes increase the degree of available parallelism by virtue of reducing the overlap be-
tween different rows of the coefficient matrices. To further improve the scalability of our
ILU preconditioners, we implemented the techniques below:
    • Balanced Level Scheduling using Graph Coloring: Fundamentally, this approach
      is based on the idea of level scheduling. Using the row dependency graph, groups of
      independent rows (i.e., levels) are determined. During the subsequent factorization,
      threads perform the scaling and elimination operations in parallel within a level, and
      computations between subsequent levels proceed in a lock-step fashion. Therefore the
      degree of parallelism is heavily dependent on the sparsity pattern.
                                               17


  As shown in our previous work [13], a pure level scheduling approach can produce
  poor thread scalability in situations where there is a large number of dependencies
  between the rows. In order to improve the performance of level scheduling, many other
  works have explored the use of reorderings using graph coloring [32, 33, 34]. For the
  solvers in this work, we utilize an iterative, graph coloring approach [35] to break data
  dependencies through an explicit permutation of H [36]. In this method, a permutation
  matrix Q – which increases the amount of independent nodes per level – is determined
  by first finding an approximate coloring of the dependency graph. This permutation
  matrix can be applied as QT HQ, and subsequently factored as QT HQ ≈ L′P U′P using
  level scheduling [37]. The graph coloring approach can be used with both the numerical
  and distance drop strategies (or their combination) discussed above.
• Fine-grained ILU (FG-ILUDD(t, s)): This iterative approach, recently proposed
  for computing the incomplete factors, seeks to address the issue of limited parallelism
  in ILU preconditioning [38]. Here, the factorization problem is posed as a constraint
  optimization problem where one seeks to satisfy a series of equations of the form
                                     min(i,j)
                                       X
                                              Lik Ukj = Hij                            (2.8)
                                       k=1
  by asynchronously updating individual nonzeros in the sparse factors. A single pass
  over all nonzeros in the factors is termed as a “sweep”; at each sweep, the factors
  computed by this algorithm with fine-grained parallelism get asymptotically closer to
  the actual L′ and U′ factors. We use the parameter s to denote the number of sweeps
  used to compute the factors in our implementation. To achieve convergence of the
  factors, H must have a unit diagonal, which may require diagonal scaling. Additionally,
  we assume the convention that the factor L′ has unit diagonal entries.
  Numerical dual dropping strategy can still be applied with fine grained ILU. In this
  scheme, which we refer to as FG-ILUDD(t, s), after all sweeps are completed, we
  perform a post-processing step where all nonzeros are scanned and those below the
                                            18


      specified numerical threshold t times the 1-norm of that row are dropped. On the
      other hand, distance drop sparsification is straight-forward to use here because it ap-
      plies the sparsification before factorization, exactly as required by the fine-grained ILU
      method, and it can be highly effective because our numerical experiments show that
      contruction of the fine-grained ILU preconditioner can be very costly despite being
      easily parallelizable.
2.3.1.3    Sparse Approximate Inverse (SAI(τ )))
A technique that is generally less effective than ILU preconditioning, but lends itself more
easily to parallelization, is sparse approximate inverse (SAI) preconditioning. Given H, SAI
aims to find an approximate inverse matrix M such that the right preconditioned system
HMy = b, with x = My, (or the left preconditioned system MHx = Mb) can be solved
with fewer solver iterations than the original sparse linear system, and ideally with a lower
overall execution time. Generally, there are three popular categories of SAI techniques:
Frobenius norm minimization, factorized sparse approximate inverse, and incomplete bicon-
jugation [39]. In this work, we focus on the first group. Hence, the objective is to minimize
||I − HM|| for right preconditioning (or to minimize ||I − MH|| for left preconditioning).
    To control the cost of constructing as well as applying the SAI preconditioner, often M
is chosen to be no denser than the coefficient matrix H itself [40]. Given the relatively large
number of nonzeros per row in our problems, it is crucial that the number of nonzeros in
the approximate inverses is significantly lower than that of the charge distribution matrix.
Our experiments on a number of sample problems have shown that the positions of the
numerically large nonzeros in the charge matrix constitute good candidates for the sparsity
pattern of M. In accordance with this observation, we select only the τ percent largest
nonzeros for inclusion when computing the approximate inverse matrix.
    During construction of the SAI preconditioner, a benefit of using the Frobenius form is
                                                19


that we can easily exploit parallelism because the SAI factorization is formulated as:
                                               Xn
                              ||I − HM||2F =       ||ej − Hmj ||22 .                     (2.9)
                                               j=1
     Here, ej is the j-th column of the identity matrix and mj is the j-th column of M.
Leveraging the fact that H is sparse, we build least squares problems with smaller dense
matrices Ĥj and vectors m̂j , where Ĥj is obtained by eliminating the rows and columns
of H that do not correspond to the set of nonzeros found in the row of M currently being
computed (see Section 2.1 of [41] for details of this construction). Thus, the resulting least
                                  2
squares problems êj − Ĥj m̂j      can be solved independently through QR decomposition of
                                  2
Ĥj .
     We note that the distance based sparsification of the coefficient matrix can be easily
applied for SAI preconditioning as well. In fact, distance based sparsification can be highly
effective in this context by reducing the cost of QR factorizations without much impact in
effectiveness of the SAI preconditioner. Additionally, we utilize optimized LAPACK libraries
(such as Intel MKL or Cray LibSci) to perform the QR decomposition tasks each of which
are executed sequentially by different threads using dynamic assignment for load balancing
purposes.
2.3.2    Dynamic Determination of Preconditioner Recomputation
One important insight considered when designing our preconditioned solver is that the molec-
ular systems (hence the corresponding linear systems) change slowly over the course of a sim-
ulation. To leverage this insight, preconditioners are only occasionally computed and reused
for several solves to amortize the preconditioner construction costs. The natural question
which arises from this is exactly when to recompute the preconditioner. In Algorithm 2.1,
we give a heuristic approach that adopts a gain-loss comparison perspective for recomputing
the preconditioner versus reusing it.
                                               20


           Algorithm 2.1 Dynamic Determination of Preconditioner Recomputation.
  1:  Select α ≥ 0
  2:  for each simulation step do
  3:     if Tloss ≥ α · TPreComp or first simulation step then
  4:         (Re)compute the preconditioner
  5:         Define TPreComp as the preconditioner computation time from Line 3
  6:         Perform solve(s)
  7:         Define Tideal as the solve time from Line 5
  8:         Update Tloss = 0
  9:         Check if the preconditioner quality is satisfactory and act accordingly
10:      else
11:          Perform solve(s)
12:          Define Tactual as the solve time from Line 10
13:          Update Tloss += Tactual − Tideal
14:      end if
15:   end for
     In this approach, we make the assumption that a preconditioner is most effective in
terms of improving solver convergence at the simulation step in which it was computed. The
preconditioned solve time required right after a preconditioner is recomputed is recorded as
the ideal solve time for this preconditioner (denoted Tideal ). From this point onward, we
accumulate the excess solve time compared to this ideal solve time (Tloss ); this is considered
the loss incurred by reusing the preconditioner. When the aggregated loss time exceeds
the time it would take to recompute the preconditioner (TPreComp ) times α, we proceed to
recompute it, otherwise we continue reusing it.
     This dynamic scheme essentially justifies the reconstruction of a preconditioner by de-
tecting when the loss of effectiveness of the preconditioner would exceed α times the cost
of recomputing it, and tries to balance the gains and losses in preconditioner computation
time versus other solver time (preconditioner application, solver operations including sparse
matrix-dense vector multiplications and vector operations). We further study the conse-
quences of this heuristic in further detail later in Section 2.4.2.
     To provide insight into how to select the value for parameter α in Algorithm 2.1, consider
the following proof for the α which minimizes total solver execution time under a fixed
                                                  21


preconditioner cost and linear degradation model for solve time with preconditioner reuse.
Theorem 1 (Optimal Parameterization for Fixed Preconditioner Recomputation). Sup-
pose that for a sequence of n linear systems, a preconditioned solver is utilized where the
preconditioner is recomputed and reused according to the Algorithm 2.1. Assuming that the
solver performance degradation due to preconditioner reuse across solves is linear (solve time
increases linearly with reuse) and the cost is identical for each time the preconditioner is re-
                                                                                        q
                                                                                              d
computed, the value which minimizes total solve execution time is α = 1 ± TPreComp                 where
d is the increased solve time per step due to reusing the preconditioner.
Proof. Let r be the number of solves for which the preconditioner is used. First, observe
that from Line 3 of Algorithm 2.1 with our linear degradation assumption, we have
                           α · TPreComp = Tloss
                                            = 0 + d + 2d + . . . + (r − 1) · d
                                                   r2 − r
                                            =d·              .
                                                       2
                              q
                                   2α·TPreComp      1
Equivalently, we have r =                d
                                                +   4
                                                       + 12 . Next, considering the total solve time, we
observe the following (where, for simplicity, we assume r divides n)
                                 n
                               X       (i)      n
                     Ttotal =        Tactual +     · TPreComp
                               i=1
                                                 r
                                               n
                                              X       (i)      n
                            = n · Tideal +         Tloss +       · TPreComp
                                              i=1
                                                               r
                                                   "r                           #
                                              nd          2α · TPreComp 1 1
                            = n · Tideal +                               + −
                                               2                 d          4 2
                                                     "r                           #−1
                                                           2α · TPreComp 1 1
                                 + n · TPreComp                            + +
                                                                   d         4 2
                             ∂Ttotal
Finally, we see that when       ∂α
                                      = 0, the minimum of Ttotal is achieved when
         q
               d
α = 1 ± TPreComp   .
                                    q
                                            d
    We note that in practice,           TPreComp
                                                   ≈ 0, so we select α = 1 for our experiments in
Section 2.4.
                                                        22


2.3.3    Solver Implementation
As mentioned above, through construction of the preconditioners only occasionally, costs
associated with computing the approximate inverses in SAI or the incomplete factors in
ILU/IC schemes become negligible in practice. However, despite utilizing good initial guesses
and the described preconditioning techniques, our solvers typically take tens to hundreds of
iterations to converge depending on the charge model and the convergence tolerance used.
Hence, implementation of the preconditioned solver is important from a performance point
of view.
    Since EE and ACKS2 models produce indefinite coefficient matrices, our implementation
uses a restarted GMRES solver [42], which has in fact been observed to yield better conver-
gence for QEq compared to a conjugate gradient and minimum residual solvers, despite the
coefficient matrices being symmetric semi-positive definite in QEq. In a left preconditioned
restarted GMRES(k) implementation, with k being the maximum number of iterations be-
fore restarting, the main parts are the sparse matrix-dense vector multiplications (SpMVs),
the application of a preconditioner (PreApp), and the dense vector operations (VecOps). In
PuReMD, the sparse coefficient matrix H is stored in CSR format, as such solver SpMV’s
are implemented by using loop parallelization over matrix rows. In the ICDD/ILUDD
and FG-ILUDD approaches’ PreApp phase, parallelization is applied across rows within
each level of the level scheduling with graph coloring technique. For SAI preconditioner,
PreApp essentially amounts to an SpMV operation, again using CSR format with simple
parallelization of the loop over matrix rows. Finally, VecOps are also implemented by using
simple loop parallelization.
    Overall, with a simple Jacobi preconditioner, SpMV costs are expected to dominate
the overall computation, while VecOps or PreApp times are negligible. However, for SAI,
ICDD/ILUDD or FG-ILUDD preconditioners, PreApp time is likely to be significant.
                                              23


2.4      Numerical Experiments
2.4.1     Computing Environment and Benchmark Systems
Results from numerical experiments presented in the following subsections have been ob-
tained on Laconia, a cluster with over 400 compute nodes at Michigan State University’s
High Performance Computing Center 1 . Each of the base compute nodes on Laconia has
28 cores, consisting of two fourteen-core Intel Xeon E5-2680v4 Broadwell 2.4 GHz proces-
sors, and has 128 GB DDR3 2133 MHz ECC memory. Each core possesses a 64 KB L1 cache
(32 KB instruction, 32 KB data), a 256 KB L2 cache, and the capability of running one or
two user threads (i.e., hyperthreading). Between the fourteen cores on a single “Broadwell”
processor, a 35 MB L3 cache is shared. At the time of the experiments, the Laconia nodes ran
CentOS version 7.6.1810 distribution of GNU/linux for x86 64 architectures, kernel version
3.10.0-957.5.1, and glibc version 2.17-260.el7 6.3.
    The preconditioned solvers detailed in Section 2.3 were implemented in the shared mem-
ory version of the PuReMD ReaxFF software [10, 11]. The software was built using the GNU
Compiler Collection version 8.2.0 with the optimizations enabled from the -O3 flag. For nu-
meric libraries, Intel MKL version 2019.1.144 was utilized. For experiments, we restricted
our simulations to a single socket on Laconia to avoid performance issues due to non-uniform
memory access (NUMA) effects. In these cases, the maximum number of OpenMP threads
was set to 14 with no hyperthreading enabled.
    For benchmarking purposes, four characteristically varied molecular systems from various
application scenarios of reactive and polarizable force fields were selected. These systems,
which were comprised of bulk water (H2 O), amorphous silica (SiO2 ), pentaerythritol tetran-
itrate (PETN, C5 H8 N4 O12 ), and phospholipid bilayer (not solvated in water), represent liq-
uids, amorphous materials, perfect crystals, and soft matters, respectively. Quantitatively,
each system contained between 48,000 and 78,000 atoms; thus, these systems represented
    1
      https://docs.icer.msu.edu/2018 cluster resources/
                                                     24


moderately sized molecular structures. Table 2.1 summarizes the systems used for perfor-
mance evaluation.
Table 2.1 Molecular systems used in shared-memory performance evaluation. The
third column indicates the number of atoms in the system (N ) and the fourth column denotes
the dimensions of the rectilinear simulation boxes in R3 in Angstroms (Å).
         Name          Chem. Rep.        N       Sim. Box Dims. (Å) Category
         Water (W) H2 O                  78480 120.9 × 80.6 × 80.6         Liquid
         Silica (S)    SiO2              72000 109.4 × 100.4 × 104.2       Amorphous
         PETN (P) C5 H8 N4 O12           48256 114.5 × 114.5 × 155.5       Crystal
         Bilayer (B) Phospholipid        56800 82.7 × 81.5 × 80.0          Soft matter
    Unless indicated otherwise, all simulations were conducted with the following parameters:
periodic boundary conditions enabled, 0.25 fs timestep lengths, NVE ensembles, rnonb = 10 Å,
and 2000 total simulation timesteps. For the charge distribution solvers, the initial guesses
were extrapolated from solutions in earlier timesteps, with cubic and quadratic spline ex-
trapolation, respectively, used for the two linear systems arising in the QEq model solvers in
Eq. (2.3), while cubic spline extrapolation was used for EE and ACKS2 models. All reported
data have been averaged over the 2000 simulation timesteps.
          Table 2.2 Qualitative behavior of the preconditioning techniques.
        Preconditioner       Cost          Longevity    Effectiveness      Parallelism
        Jacobi               Very Low High              Low                High
        ICDD(t)              Moderate Moderate          High               Low
        ILUDD(t)             High          Moderate     High               Low
        FG-ICDD(t, s)        Very High Moderate         High               High
        FG-ILUDD(t, s)       Very High Moderate         High               High
        SAI(τ )              Moderate High              Moderate/High      High
    Before moving onto numerical results, in Table 2.2, we give a qualitative summary of
the trade-offs involved with the preconditioning approaches we explore. As indicated in this
table and demonstrated quantitatively in the subsections that will follow, it is crucial to
identify preconditioners that are i) effective at reducing the number of solver iterations, ii)
computationally inexpensive, and iii) easily parallelizable, especially in the PreApp phase.
                                               25


In the following, we examine each of these trade-off points to identify preconditioners with
the optimal overall performance.
2.4.2    Preconditioner Longevity
As discussed previously and indicated in Table 2.2, construction of ILU and SAI based
preconditioners constitute a significant cost in terms of computational time. To reduce
this cost, we leverage the fact that atomic positions, and hence the coefficient matrices,
evolve slowly over the course of a simulation; as such a preconditioner, once computed, is
reused over several timesteps. An important consideration is then for how many subsequent
timesteps a preconditioner can be effective. In Fig. 2.1, we show the longevity of the ICDD
and SAI preconditioners (the former of which behaves very similarly to its FG-ICDD
and ILU-based counterparts) for the bilayer and bulk water systems under our dynamic
refactorization scheme. For systems where atoms have limited mobility like the bilayer
system, we observe that both preconditioners remain effective for hundreds to thousands of
steps. For systems like water where atoms can move more freely, we observe that the heuristic
employed in Algorithm 2.1 leads to significantly more frequent preconditioner recomputations
to decrease the overall solve time. Additionally, we observe that as solver tolerances decrease,
the frequency of preconditioner recomputation tends to increase as the potential losses – in
terms of solve time – from using a poor quality preconditioner tend to increase (while the
preconditioner construction time remains constant). To precisely quantify the frequency, we
report in Table 2.3 the total number of preconditioner recomputations during the course of
the 2000 step simulations.
    Overall, these results provide a strong basis that preconditioner construction costs can
be amortized by reusing ILU and SAI based preconditioners over hundreds of steps for high
tolerances (e.g., 10−6 ) and over tens of steps for lower tolerances (e.g., 10−10 or 10−14 ). As
will be shown in subsequent results (see Fig. 2.2), such reuse rates are sufficient to reduce
the preconditioner construction costs to only a small percent of the overall solver time. As
                                               26


                40                      1e-6, SAI(0.05)+DS(0.6)
                                                                        40                       1e-6, SAI(0.1)+DS(0.6)
                                        1e-10, SAI(0.1)+DS(1.0)                                  1e-10, SAI(0.15)+DS(0.6)
                                                                                                 1e-14, SAI(0.1)+DS(0.6)
                30                      1e-14, SAI(0.15)+DS(1.0)        30
 Solver Iters
                20                                                      20
                10                                                      10
                 0                                                       0
                     0     500        1000 1500           2000               0   500        1000      1500         2000
                                 Simulation Step                                       Simulation Step
                          (a) ICDD, Bilayer                                      (b) ICDD, Water
                40                      1e-6, ICDD(0.0)+DS(0.6)
                                        1e-10, ICDD(0.0)+DS(0.6)        40
                30                      1e-14, ICDD(0.0)+DS(0.8)
                                                                        30                      1e-6, ICDD(0.01)+DS(0.6)
 Solver Iters
                                                                                                1e-10, ICDD(0.0)+DS(0.6)
                20                                                      20
                                                                                                1e-14, ICDD(0.0)+DS(0.6)
                10                                                      10
                 0                                                       0
                     0     500        1000 1500            2000              0   500        1000      1500         2000
                                 Simulation Step                                       Simulation Step
                           (c) SAI, Bilayer                                        (d) SAI, Water
Figure 2.1 Longevity of preconditioners with the QEq model. Results for the ICDD
and SAI preconditioners with the bilayer (left column) and bulk water (right column) sys-
tems using the QEq model are presented at various solver tolerances.
such, the term “preconditioner cost” will solely refer to the preconditioner application cost
(but not its construction cost) in the remainder of this section.
2.4.3                Tuning of Preconditioner Parameters for Cost and Effectiveness
With our proposed sparsification techniques, there is a large space of possible configurations
to choose from for each preconditioner (beyond the Jacobi method). Since there exist impor-
tant trade-offs regarding the effectiveness (convergence rate) and cost of different configura-
tions, we conducted a parameter search study to determine which parameters to utilize for
each scheme. In Table 2.4, we present a small snapshot from this study in which parameters
                                                                   27


Table 2.3 Number of recomputations for the shared-memory preconditioned
solvers. Results for the ICDD, ILUDD, FG-ICDD, FG-ILUDD, and SAI precon-
ditioned solvers are presented over 2000 simulation steps.
         Solver Tolerance               10−6               10−10             10−14
         Systems                   B   P S      W    B    P S      W    B    P S     W
         CM        Prec.
         QEq       ICDD             1   7    1    4   2   28   2   16    3 45    2   27
                   FG-ICDD          1   2    1    2   2    9   2    4    3 13    2     6
                   SAI              1   3    1    5   2   12   2   13    6 20    2   17
         EE        ILUDD            2   4    1    2   2    9   2    7    2 13    2   11
                   FG-ILUDD         1   1    1    1   1    4   1    3    1 6     1     4
                   SAI              5  16    1    3   3   22   2    8   16 35    5   12
         ACKS2 ILUDD                -    -   -  23    -     -   -  35    -    -   -  55
                   FG-ILUDD         -    -   -    5   -     -   -   8    -    -   -  13
                   SAI              -    -   -    9   -     -   -  12    -    -   -  22
that control preconditioner quality and sparsity for various schemes were varied for the QEq
model. To illustrate the aforementioned trade-offs, focusing on ICDD(0.0)+DS(1.0) and
ICDD(0.01)+DS(0.8), we notice that the former yields fewer solver iterations (5.4 versus
5.7 for a tolerance of 10−6 , widening to 29.6 versus 32.0 for 10−14 ), but the latter produces
the lower mean solve times (0.01 versus 0.03 for 10−6 , widening to 0.03 versus 0.08 for 10−14 ).
Note that during application of the incomplete Cholesky preconditioner, each nonzero re-
quires a single multiply/add operation, and therefore fewer number of nonzeros translates to
less floating point operations. The better solve time for ICDD(0.01)+DS(0.8) can then
be attributed to a lower preconditioner application cost as it has an order of magnitude
less number of nonzeros than the ICDD(0.0)+DS(1.0) variant. Also, as we show in more
detail in the next subsection, fewer number of nonzeros allows higher parallelizability during
preconditioner application. Similar trade-offs for other techniques can be observed in Ta-
ble 2.4. Overall, we notice that sparsification strategies hit the “sweet spot” in between the
two extremes of achieving high effectiveness and keeping costs as low as possible.
    Using insights from the above study, high quality parameters were selected for each pre-
                                               28


Table 2.4 Shared-memory preconditioner parameter comparison using the QEq
charge method. Results from bulk water simulations (6540 atoms for 100 steps) are
presented for cost (number of non-zeros), mean solver iterations (SI), and mean solver time
(ST) in seconds. Preconditioners were computed only once at step 0.
       Solver Tolerance                             10−6        10−10         10−14
                                          Cost     SI    ST     SI   ST       SI    ST
       Preconditioner
       ICDD(0.0)+DS(0.6)               2.98E5     5.7   0.01  17.5  0.02    32.2  0.03
       ICDD(0.0)+DS(0.8)               7.05E5     5.4   0.02  16.5  0.03    30.3  0.05
       ICDD(0.0)+DS(1.0)               1.37E6     5.3   0.03  16.1  0.05    29.6  0.08
       ICDD(0.01)+DS(0.6)              1.46E6     5.7   0.01  18.1  0.02    32.6  0.03
       ICDD(0.01)+DS(0.8)              1.57E5     5.6   0.01  17.6  0.02    32.0  0.03
       ICDD(0.01)+DS(1.0)              1.59E5     5.6   0.01  17.4  0.02    31.7  0.04
       FG-ICDD(0.0,3)+DS(0.6)          2.98E5     5.6   0.01  17.4  0.03    32.0  0.05
       FG-ICDD(0.0,3)+DS(0.8)          7.05E5     5.4   0.02  16.4  0.05    30.2  0.09
       FG-ICDD(0.0,3)+DS(1.0)          1.37E6     5.3   0.05  16.3  0.11    29.9  0.18
       FG-ICDD(0.1,3)+DS(0.6)          2.28E4    10.9   0.01  37.9  0.02    72.8  0.07
       FG-ICDD(0.1,3)+DS(0.8)          2.29E4    10.8   0.02  37.9  0.02    72.6  0.07
       FG-ICDD(0.1,3)+DS(1.0)          2.32E4    10.8   0.03  37.9  0.02    72.5  0.08
       SAI(0.01)+DS(0.6)               6.54E3    18.9   0.01  62.2  0.04   134.0  0.12
       SAI(0.01)+DS(0.8)               7.58E3    18.9   0.01  63.0  0.04   135.3  0.12
       SAI(0.01)+DS(1.0)               2.09E4    10.6   0.01  35.0  0.02    72.3  0.06
       SAI(0.5)+DS(0.6)                2.33E4    10.1   0.01  33.0  0.02    67.5  0.05
       SAI(0.5)+DS(0.8)                6.40E4     6.7   0.01  21.0  0.02    40.4  0.03
       SAI(0.5)+DS(1.0)                1.31E5    14.5   0.01  30.3  0.03    58.4  0.05
       SAI(0.1)+DS(0.6)                5.36E4     6.8   0.01  21.5  0.02    41.1  0.03
       SAI(0.1)+DS(0.8)                1.35E5    14.4   0.01  29.9  0.02    58.7  0.05
       SAI(0.1)+DS(1.0)                2.68E5     8.4   0.01  23.0  0.01    43.2  0.04
conditioning scheme for a given molecular system and convergence tolerance by scanning the
potential values of preconditioner parameters. These parameters are the dual drop threshold
t ∈ {0.0, 0.1, 0.01} in ICDD(t), ILUDD(t), FG-ICDD(s, t), and FG-ILUDD(s, t); the
additional parameter s for the latter two methods has been the number of sweeps (ranged
from 2 to 4). For SAI, the percentage of the top nonzeros to be kept (in terms of magnitude)
was varied from 3% to 15% in 1.5% increments. For all solvers, distance based pruning
was also investigated for values of d ∈ {0.6, 0.8, 1.0}. In the results presented below, the
combination that yields the best execution time for each preconditioner is used. We report
                                              29


these parameters in Table A.3 of Appendix A. As can be seen in that table, both the dual
dropping and distance based pruning is heavily utilized across the board by different pre-
conditioning schemes, providing evidence on their merit. We note that such a parameter
search can be made an integral part of the PuReMD software itself, so that an automated
tuning is performed for the particular simulation and architecture for the first few hundred to
thousand steps of a simulation. Considering the fact that a production simulation typically
takes millions to billions of timesteps, overheads incurred would be negligible.
Table 2.5 Comparison of the cost and effectiveness of the shared-memory precon-
ditioned solvers for the bulk water system. Results for the Jacobi, ICDD, ILUDD,
FG-ILUDD, and SAI preconditioned solvers are presented. For cost, the number of nonze-
ros in the preconditioner is reported for the chosen parameters at a solver tolerance of 10−6
(sum of factors for ILU-based methods). For effectiveness, the mean solver iterations (SI)
and mean solve time (ST) in seconds are given for various solver tolerances (top row). For
enabling easier comparisons, the solver iterations for the preconditioners have been normal-
ized as improvement over the Jacobi preconditioner (bolded and unnormalized).
      Solver Tolerance                         10−6           10−10             10−14
                                 Cost         SI     ST       SI     ST         SI     ST
      CM        Prec.
      QEq       None             0          0.8x    0.12    0.8x    0.51      0.9x    0.97
                Jacobi           7.84E4     16.3    0.09   62.7     0.37 128.4        0.80
                ICDD             3.46E6     2.7x    0.06    4.3x    0.14      5.0x    0.25
                FG-ICDD          7.16E6     2.3x    0.08    3.3x    0.21      3.8x    0.38
                SAI              6.42E5     2.6x    0.05    3.5x    0.13      3.6x    0.24
      EE        None             0          0.8x    0.10    0.8x    0.30      0.9x    0.55
                Jacobi           7.84E4     12.2    0.08   36.2     0.23     80.9     0.55
                ILUDD            1.97E6     2.4x    0.05    3.9x    0.10      4.4x    0.17
                FG-ILUDD         7.32E6     1.8x    0.06    3.0x    0.16      3.4x    0.26
                SAI              6.53E5     2.8x    0.03    3.5x    0.08      3.5x    0.15
      ACKS2 None                 0          0.1x 14.98      0.1x  27.58       0.1x 64.60
                Jacobi           1.56E5 125.6 1.62 216.1            3.01 425.7 5.94
                ILUDD            5.07E6     5.4x 0.43       5.0x    0.81      4.5x 1.74
                FG-ILUDD         8.48E6     2.4x 1.02       2.9x    1.55      2.6x 3.38
                SAI              6.34E5     2.2x 0.72       2.3x    1.25      2.1x 3.06
    Next, in Table 2.5, we demonstrate the trade-offs between preconditioner cost and ef-
                                               30


fectiveness for parameters determined as a result of the above search procedure across all
charge models. As expected, despite its low cost, Jacobi is the least effective in improving
the convergence rate, while still yielding important improvements over the unpreconditioned
case, especially for the ACKS2 charge model. Incomplete factorization methods ICDD,
ILUDD, FG-ICDD, and FG-ILUDD are the most effective ones. The effectiveness gap
between Jacobi and ILU-based schemes widens as the convergence tolerance decreases from
10−6 to 10−10 , with lesser gains when moving to 10−14 . However, this effectiveness comes at
the expense of increased preconditioner costs for which the number of nonzeros are about the
same order of magnitude as the charge matrices themselves. Comparatively, the cost of SAI
is between that of Jacobi and the ILU schemes, but the effectiveness of this method is closer
to the latter. In addition to the cost and effectiveness examined here, parallel efficiency of
applying the preconditioners represents an important consideration for overall performance
in actual simulations. We inspect this issue next.
2.4.4    Scalability of Preconditioned Solvers
In order to analyze the scalability of our preconditioned solvers, strong scaling experiments
were performed on a single socket of a dual socket 28-core node on Laconia. To better
comprehend the trade-offs of the preconditioning approaches, the mean solver time was
broken down by the major computational kernels in the solver: sparse matrix-dense vector
multiplications (SpMV), preconditioner computation (PreComp), preconditioner application
(PreApp), and the remaining vector operations (VecOps).
    As shown in Fig. 2.2, the percentage of time spent in the preconditioner computation is
insignificant for ICDD and SAI methods as a result of preconditioner reuse. However, the
iterative approach used for asynchronously computing the FG-ICDD preconditioner takes a
significant percentage of the overall solver execution time, despite reusing the preconditioners
and keeping the number of sweeps very low (2-4 sweeps). The high cost of preconditioner
computation renders the FG-ICDD method less competitive, at least on the multi-core
                                                31


architectures on which we have experimented. But the fact that preconditioners generated
by FG-ICDD exhibit convergence rates similar to those of the regular IC techniques shows
that it can provide advantages on massively parallel architectures such as GPUs or Xeon
Phis.
                             Pre Comp       Pre App        Solver SpMV          Solver Vec Ops                                 Pre Comp        Pre App       Solver SpMV         Solver Vec Ops
                0.30                                                 1.4                                          0.30
                                           0.8                                                                                               0.8                      1.4
                0.25                                                 1.2                                          0.25                                                1.2
Mean Time (s)                                                                                     Mean Time (s)
                0.20                       0.6                       1.0                                          0.20                       0.6                      1.0
                                                                     0.8                                                                                              0.8
                0.15                       0.4                                                                    0.15                       0.4
                                                                     0.6                                                                                              0.6
                0.10                                                                                              0.10
                                           0.2                       0.4                                                                     0.2                      0.4
                0.05                                                 0.2                                          0.05                                                0.2
                0.00                       0.0                       0.0                                          0.00                       0.0                      0.0
                       1     2     6 14          1     2     6 14          1     2     6 14                              1     2     6 14          1     2     6 14         1     2     6 14
                           Num. Threads              Num. Threads              Num. Threads                                  Num. Threads              Num. Threads             Num. Threads
                                        (a) ICDD, Bilayer                                                                                  (b) ICDD, Water
                                                                                                                  0.8
                                          1.0                       2.0                                                                     1.75
                0.4                                                                                                                                                   3.0
                                          0.8                                                                     0.6                       1.50
                                                                    1.5                                                                                               2.5
Mean Time (s)                                                                                     Mean Time (s)
                0.3                                                                                                                         1.25
                                          0.6                                                                                                                         2.0
                                                                                                                  0.4                       1.00
                0.2                                                 1.0                                                                                               1.5
                                          0.4                                                                                               0.75
                                                                                                                  0.2                       0.50                      1.0
                0.1                                                 0.5
                                          0.2                                                                                                                         0.5
                                                                                                                                            0.25
                0.0                       0.0                       0.0                                           0.0                       0.00                      0.0
                       1     2     6 14          1     2     6 14          1     2     6 14                              1     2     6 14          1     2     6 14         1     2     6 14
                           Num. Threads              Num. Threads              Num. Threads                                  Num. Threads              Num. Threads             Num. Threads
                                    (c) FG-ICDD, Bilayer                                                                                  (d) FG-ICDD, Water
                                                                    1.50                                                                     1.0                      2.0
                0.35                       0.7                                                                    0.35
                0.30                       0.6                      1.25                                          0.30                       0.8                      1.5
Mean Time (s)                                                                                     Mean Time (s)
                0.25                       0.5                      1.00                                          0.25
                0.20                                                                                              0.20                       0.6
                                           0.4                      0.75                                                                                              1.0
                0.15                       0.3                                                                    0.15                       0.4
                                                                    0.50
                0.10                       0.2                                                                    0.10                                                0.5
                                                                    0.25                                                                     0.2
                0.05                       0.1                                                                    0.05
                0.00                       0.0                      0.00                                          0.00                       0.0                      0.0
                       1     2     6 14          1     2     6 14          1     2     6 14                              1     2     6 14          1     2     6 14         1     2     6 14
                           Num. Threads              Num. Threads              Num. Threads                                  Num. Threads              Num. Threads             Num. Threads
                                          (e) SAI, Bilayer                                                                                  (f) SAI, Water
Figure 2.2 Strong scaling plots of shared-memory preconditioned solvers with
QEq. Results are presented for three preconditioned solvers with the bilayer (left column)
and bulk water (right column) systems. Figures from left to right within a grouping show
results at convergence tolerance levels of 10−6 , 10−10 , and 10−14 .
                  In terms of parallel scaling, ICDD achieves strong scaling efficiencies ranging from 36%
to 46% through 14 threads for the bilayer and bulk water systems, as shown in Fig. 2.2.
Improvements over ICDD are observed for FG-ICDD with scaling efficiencies ranging
between 45% and 65%. For the SAI preconditioned solver, we see efficiencies between 45%
and 50% for the bilayer system, and between 58% and 59% for the bulk water system.
                                                                                                 32


Given that the vast majority of operations involved in the preconditioned solvers have low
arithmetic intensities and are therefore memory-bound, the observed efficiencies are in the
expected regime.
    An important observation in Fig. 2.2 is the relatively higher percentage of time spent in
the PreApp phase for ICDD and FG-ICDD when compared to the SAI preconditioner.
One would normally expect the PreApp phase of the ILU-based methods to exhibit poorer
scaling than that of the SAI method, as the latter boils down to a simple SpMV operation,
whereas the former requires level-scheduled forward and backward solves. However, we
observe that the PreApp phase for the ILU-based methods (which exploits graph coloring
based reordering of matrix rows) and SAI methods scale roughly at the same rate, which is
also similar to the scaling efficiency of the SpMV kernel. As such, the reason for the high
percentage of time spent by ILU-based methods in the PreApp phase is directly related to
the cost of applying the preconditioners, which is essentially characterized by the number of
nonzeros they contain (as given in Table 2.4 and Table 2.5).
2.4.5    Preconditioned Solver Performance
In this subsection, we analyze the overall benefits of the preconditioning techniques we
explored (on a single socket, i.e., 14 cores, of a node on Laconia). Fig. 2.3 shows the speedups
over the Jacobi preconditioner for our preconditioned solvers on the four benchmark systems.
Note that we take Jacobi preconditioner as our base case due to its simplicity, but we also
show the unpreconditioned solver performance for reference. We observe that for QEq, the
preconditioners ICDD, FG-ICDD, and SAI result in speedups of 1.1 to 6.7 times the mean
solve time using the Jacobi preconditioner, with speedups generally improving as the solver
tolerance decreases. A similar trend is observed for the these preconditioners with the EE
model.
    In the case of the ACKS2 model, only the force field parameters for the bulk water
system were available to us. For bulk water, SAI produces speedups ranging from 1.8 to 2.4
                                                 33


                                                    None                   Jacobi             ICDD              FG-ICDD           SAI
              7
              6
              5
              4
    Speedup   3
              2
              1
              0
                  Water   Silica   PETN   Bilayer                        Water      Silica     PETN   Bilayer             Water         Silica   PETN   Bilayer
                                                                                    (a) QEq
                                              None                       Jacobi              ILUDD         FG-ILUDD               SAI
              7
              6
              5
              4
    Speedup   3
              2
              1
              0
                  Water   Silica   PETN   Bilayer                        Water      Silica     PETN   Bilayer             Water         Silica   PETN   Bilayer
                                                                                    (b) EE
                                                                     4
                                                                     3
                                                           Speedup
                                                                     2
                                                                     1
                                                                     0
                                                                          Water               Water             Water
                                                                                    (c) ACKS2
Figure 2.3 Speedups for shared-memory solvers for the benchmark systems by
charge distribution method. Figures from left to right show results at convergence
tolerance levels of 10−6 , 10−10 and 10−14 , respectively. Speedups achieved are over the
Jacobi preconditioned solver.
                                                                                             34


times, which are similar to the speedups observed in the QEq and EE models. On the other
hand, ILUDD produces speedups ranging from 3.3 to 3.7 fold for ACKS2, despite being
less parallelizable than SAI. In fact, ILUDD reduces the Jacobi solver iteration numbers by
about a factor of 5 as can be seen in Table 2.5 compared to the roughly 2.2x improvement
observed with SAI. This suggests that the quality of the SAI preconditioner can potentially
be improved further by selecting a better sparsity pattern – we currently do not make any
distinction between the HQEq and X subblocks in the ACKS2 matrices while selecting the
sparsity pattern.
    In contrast to the ILUDD and SAI techniques, the FG-ILUDD preconditioner leads
to speedups of 1.6 to 1.9 times mostly due to the high preconditioner construction cost and
lack of pivoting to handle zero diagonal elements in the charge matrix (the zero values were
substituted with unit values for preconditioner computation, thereby introducing errors).
One idea that was explored to improve the performance of FG-ILUDD was to significantly
increase the convergence of incomplete factorization by raising the number of sweeps per-
formed. However, this approach proved to be far too costly, as any gains from decreasing the
solver iterations were more than outweighed by the increase in preconditioner computation
time.
    As previously highlighted in Table A.2, the very poor condition numbers of the ACKS2
charge matrices pose a significant challenge as the total number of solver iterations is still very
large despite the limited success of the ILUDD and SAI preconditioners. This results in
long solution times during simulations, leaving better preconditioners than the ones explored
here to reduce the overall solve time desirable.
2.5     Conclusions
We explored several shared memory preconditioning methods to improve the performance
of iterative linear solvers in the computationally expensive charge distribution models. By
applying sparsification techniques, carefully tuning the relevant preconditioner parameters
                                               35


and adopting an efficient dynamic refactorization scheme, we observed that incomplete
factorization-based (such as ICDD, ILUDD, and FG-ILUDD) and SAI-based methods
produce good quality preconditioners with relatively low costs. Performant implementation
of these techniques on shared memory architectures results in solvers that significantly out-
perform the Jacobi baseline in time-to-solution across the majority of the charge models and
molecular systems studied. As such, these techniques have been incorporated into PuReMD
through an easy-to-use interface. Our future work on this topic will focus on developing
GPU and distributed memory versions of the solvers explored in this study.
                                             36


                                           CHAPTER 3
       PERFORMANCE OPTIMIZATION OF REACTIVE MOLECULAR
 DYNAMICS SIMULATIONS WITH DYNAMIC CHARGE DISTRIBUTION
              MODELS ON DISTRIBUTED MEMORY PLATFORMS
The following chapter presents previously published work on optimization of the distributed-
memory MPI implementation of ReaxFF in PuReMD [43]. This work is reproduced with
the permission of ACM.
    Following the successes from the charge solver optimizations in a shared-memory context,
efforts were made to apply similar preconditioning techniques together with additional com-
munication reduction algorithms within a distributed-memory MPI-only version of PuReMD.
During this work, additional observations were made which allowed other portions of the code
to be optimized including reducing the number of bond order calculations through appro-
priate pruning of interactions.
3.1     Introduction
Molecular dynamics (MD) simulations play increasingly important roles in diverse fields,
ranging from biophysics to chemistry to materials science. Classical MD techniques rely on
static bonds and fixed partial charges associated with atoms, limiting their applicability to
non-reactive systems. To study phenomena involving chemical reactions, quantum mechan-
ical (QM) methods have typically been the method of choice. QM simulations must account
for electronic degrees of freedom present in the system. As such, they are typically limited
to sub-nanometer length and picosecond time scales. However, long-time reactive simula-
tions are critical for several scientific problems such as catalysis, battery interfaces, biological
simulations involving water, and emerging areas like surface oxidation and chemical vapor
deposition (CVD) growth. Progress on these fronts is limited because long continuous time
simulations of large-scale systems are very difficult, if not impossible, to perform using purely
quantum mechanical (QM) or hybrid quantum mechanical/molecular dynamics (QM/MD)
                                                 37


simulations. The Reactive Force Field (ReaxFF) method [1], a bond order potential that
bridges quantum-scale and classical MD approaches by explicitly modeling bond activity
and redistribution of charges, is in principle ideally suited for this purpose. However, the
computationally complex force field formulation hinders ReaxFF’s scalability to large num-
ber of processing cores. In this paper, we present algorithmic and numerical techniques to
address scaling bottlenecks and enable high performance ReaxFF simulations at scale.
    ReaxFF is a relatively recent model (developed in early 2000s [1]) and is similar to the
classical MD model in the sense that it models atomic nuclei together with their electrons as
a point particle. Unlike classical MD models, ReaxFF mimics bond formation and breakage
observed in QM methods by replacing the static harmonic bond models with the bond order
concept, which is a quantity indicating bond strength between a pair of atoms based on the
types of the atoms and the distance between them. Consequently, ReaxFF can overcome
many of the limitations inherent to conventional MD. While the bond order concept dates
back to 1980s and has been exploited in other force fields before (such as COMB [44] and
AIREBO [3]), the distinguishing aspect of ReaxFF is the flexibility and transferability of its
force field that allows ReaxFF to be applied to diverse systems of interest [1, 4, 5, 6].
    ReaxFF is currently supported by major open source (PuReMD [7], LAMMPS [8], RXMD
[9]) and commercial (ADF, Material Studio) software with an estimated userbase of over
1,000 groups. In this paper, we focus on the PuReMD software, as PuReMD and its
LAMMPS integrations, i.e., the User-ReaxC and User-ReaxC/OMP packages [8], represent
the most widely used implementations of ReaxFF. PuReMD uses novel algorithms and data
structures to achieve high performance while retaining a small memory footprint. An opti-
mized neighbor generation scheme, elimination of the bond order derivatives list in bonded
interactions, lookup tables to accelerate non-bonded interaction computations, and efficient
iterative solvers for charge distribution are the major algorithmic innovations in PuReMD
[10, 11]. PuReMD has been shown to outperform the LAMMPS/Reax package by 3-5× on
various systems while using only a fraction of the memory space [11]. However, solution of
                                              38


large sparse linear systems required for distribution of partial charges and computations re-
lated to dynamic bonding constitute significant bottlenecks against scalability of PuReMD.
More precisely, the sparse linear solves and bond order computations at halo (ghost) regions
may start accounting for a significant portion of the execution time in large runs due to
communication overheads and redundant computations.
    In this paper, we discuss the use of a communication-hiding conjugate gradient solver
to prevent the onset of collective communication overheads at scale (Section 3.3.1), present
a preconditioning technique that significantly accelerates the sparse linear solves in charge
distribution (Section 3.3.2), and describe a novel technique to avoid redundant bond order
computations at halo regions (Section 3.3.3). We evaluate the performance impact of these
techniques through extensive tests and demonstrate that they significantly improve the exe-
cution time and scalability of large ReaxFF simulations (Section 3.4). While the techniques
presented are discussed in the context of the ReaxFF model, they can directly be used in
other bond order potentials. The accelerated charge model solvers can also be useful for
classical force fields with polarizable charge models.
3.2      Background
The ReaxFF approach allows reactive phenomena to be modeled with atomistic resolution
in a molecular dynamics framework. Consequently, ReaxFF can overcome many of the
limitations inherent to conventional molecular simulation methods, while retaining, to a
great extent, the desired scalability. Fig. 3.1 depicts the various ReaxFF interactions and
summarizes the work flow of a simulation which includes computation of various atomic
interaction functions (bonds, lone pair, over-/under-coordination, valance angles, torsions,
van der Waals and Coulomb) and summing up various contributions to obtain the net force
on each atom for a given time step. In terms of its implementation in PuReMD, major
components of ReaxFF can be summed up as neighbor generation, initialization of interaction
lists, bonded interactions, charge distribution and non-bonded interactions. Before presenting
                                                39


                   Determine
                                                                      New atom
                  near and far
                                                                       positions
                   neighbors
          Partial               Bond Orders                             System
      Charges (QEq)                 (BO)                             specific terms
                                                                    (H-bonds, etc.)
                                  Local BO
                                 corrections
                                                     Form     Form
                                                    3-body   4-body
                                                      list     list
                                          Over/
         Coulomb    vdWaals   Bonds     Under Coor.
                                                    Angle   Torsion
                                                                          Esystem
           Non-bonded                           Bonded                      Fnet
           interactions                      interactions
Figure 3.1 ReaxFF’s computational workflow includes bonded, non-bonded, and
system-specific interactions, each with different physical formulations and cut-
offs.
our performance improvement techniques, we summarize each of these major components
and the general parallelization scheme in PuReMD.
3.2.1   Parallelization
PuReMD uses the well-known domain decomposition technique where the input domain is
divided into equal-sized subdomains according to a prescribed 3D Cartesian configuration of
processes. Each process is then responsible for calculating the net force on all atoms in its
subdomain and updating their positions based on the velocity Verlet integration scheme [45].
After position updates, atoms crossing process boundaries, if any, are transferred, and atom
information at boundaries are exchanged locally (halo exchange) between processes that
are neighbors in the 3D grid. Since interaction functionals in ReaxFF are very expensive
computationally, interactions falling at process boundaries are computed on only one of the
                                                40


processes (as explained in [11]) to prevent redundant computations. This necessitates the
exchange of force information at the end of a timestep for those atoms whose coordinates
were communicated at the start of the timestep.
3.2.2    Neighbor Generation
In PuReMD, neighbor lists are generated using the linked cell algorithm [45] which bins
atoms into small cells based on their coordinates and generates the neighbors for each atom
by only scanning atoms in cells within applicable interaction cutoffs (typically 10-12 Å).
The neighbor list is stored by default as a half list, i.e., for neighboring atoms i and j,
only a single record is kept. A compact adjacency list format (similar to the compressed
row format in sparse matrices) is used for storing the neighbor list. Neighbor generation is
accelerated using the Verlet list method, which adds a buffer region on top of the largest
interaction cutoff and updates the neighbors of each atom by scanning the Verlet list instead
of the neighboring cells. This way, Verlet lists need to be formed from scratch (which we
call reneighboring and is performed using the neighboring cell information) only occasionally.
Reneighborings in ReaxFF can be delayed for a few hundred timesteps (typically 250 to 500)
as each timestep in ReaxFF is only fractions of a femtosecond, i.e., an order of magnitude
smaller than the typical timestep length in classical MD.
3.2.3    Initialization of Interaction Lists
While neighbor information is needed only for non-bonded interactions (with the cutoff radii
rnonb typically being 10-12 Å) in most classical MD models, in ReaxFF neighbor information
is needed additionally for formation of the Hamiltonian matrix for the dynamic charge model
(which uses a distance cutoff identical to non-bonded Coulomb interactions), for bonded in-
teractions (with cutoff radii rbond being 4-5 Å), and for hydrogen bond interactions (with
cutoff radii rhbond being 7.5 Å). Consequently, PuReMD creates four interaction lists by scan-
ning through the Verlet list, updating the atomic pair distances found therein and checking
                                                41


them against various cutoffs involved. As expected, the process of creating the interaction
lists is memory-bound in terms of performance because calculations associated with creation
of these lists are trivial. However, the bonds list is an exception here as it requires expensive
bond order calculations between pairs of atoms that are within the prescribed rbond cutoff.
The memory-bound nature of this kernel and expensive bond order calculations make this
routine (which we denote shortly by init) one of the expensive parts in PuReMD.
3.2.4     Bonded Interactions
Since bonds are dynamic in bond order formalism, 3-body and 4-body interactions (which
involve three and four atoms, respectively, as their names indicate) also need to be dis-
covered on-the-fly. Accurately modeling chemical reactions and avoiding discontinuities on
the potential energy surface in the presence of dynamic bonds requires almost all bonded
interactions (such as bond energy, 3-body valence angle energy, and 4-body torsion energy)
to have significantly more complex mathematical formulations than those found in classical
MD models [1, 4]. In addition, in a reactive environment, atoms often do not achieve their
optimal coordination numbers; to compensate for this, ReaxFF requires additional modeling
abstractions such as lone pair, over/under-coordination, and 3-body and 4-body conjugation
potentials, which introduce significant computational cost to evaluation of bonded interac-
tions. Consequently, bonded interaction calculation costs which are insignificant in classical
MD models constitute a significant part of the total execution time in ReaxFF.
3.2.5     Charge Distribution
An important requirement for correctly modeling reactions is the charge distribution proce-
dure, which tries to approximate the partial charges on atoms using suitable models such
as the Electronegativity Equilibration Method (EEM) [21] and the Charge Equilibration
Method (QEq) [22]. While the chemical intuition behind these two methods is quite differ-
ent, they produce identical-looking charge distributions in practice. Since QEq produces a
                                               42


symmetric positive definite Hamiltonian which is easier to solve using distributed memory
solvers, PuReMD uses the QEq method. In QEq, charges are determined by minimizing the
electrostatic energy Eele . Let R = (r1 , r2 , . . . , rn ) denote the atomic positions in a system
with n atoms, where ri ∈ R3 . Atomic charges q = (q1 , q2 , . . . , qn ), qi ∈ R, are thus defined
by solving the following optimization problem:
                                                    X                 1X
                        argmin       Eele (q) =           χ i qi +          Hij qi qj
                            q
                                                       i
                                                                      2 i,j
                                              X
                        subject to qnet =           qi
                                               i
                        where        Hij = δij · Ji + (1 − δij ) · Fij                         (3.1)
                                             
                                                       1
                                                              , rij ≤ rnonb
                                             
                                             
                                             q  3 r 3 +γ −3
                                     Fij =           ij   ij
                                             
                                             0,
                                             
                                                                     otherwise.
In Eq. (3.1), χi and Ji denote the atomic electronegativity and idempotential; δij denotes
the Kronecker delta operator; rij = ||rj − ri ||2 signifies the distance between the atomic
                           √
pair i and j; and γij = γi · γj denotes a pairwise tuned parameter for elements i and j
for avoiding unbounded electrostatic energy at short distances. Applying the method of
Lagrange multipliers to Eq. (3.1), the sets of linear equations below are obtained [25, 10]:
                                Xn
                                    Hki si = −χk , k = 1, . . . , n
                                i=1
                                 n                                                             (3.2)
                                X
                                    Hki ti = −1,          k = 1, . . . , n.
                                i=1
The dimension of the linear systems in Eq. (3.2) is equal to the number of atoms in the
simulation (and can be on the order of several millions to billions). Due to the computational
expense of including long-range interactions during charge distribution, QEq uses a truncated
electrostatic kernel and therefore H is a sparse matrix. Consequently, these systems must
be solved approximately using iterative methods [10]. Using solutions to Eq. (3.2), partial
atomic charges qi can be determined:
                                                  Pn
                                                        j=1 sj
                                     q i = s i − Pn              · ti .
                                                        j=1 tj
                                                  43


3.2.6    Nonbonded Interactions
Nonbonded interactions include van der Waals and Coulomb forces, which have functional
forms of the Morse and electrostatic potentials, respectively. As these interactions are pair-
wise interactions, they can be calculated directly by going over the Verlet list. When ad-
vantageous performance-wise, non-bonded interactions can be approximated with very high
accuracy using cubic spline interpolations over lookup tables.
3.3     Methods
Charge distribution is essentially a precursor to Coulomb interactions. However, the iterative
method used therein is significantly more expensive than the Coulomb interaction itself. Even
worse, this iterative method requires a large number of communication operations (both local
& global). We note that with the exception of position updates and force exchanges needed as
part of parallelization, among all kernels described above, only the charge distribution solver
requires communications. Again, among all kernels, only initialization of bonded interactions
list incurs redundant calculations which may hamper scalability in large simulations. In this
section, we present algorithms and numerical techniques to alleviate these inefficiencies.
3.3.1    Reduction of Global Communication Overheads in Charge Solvers
3.3.1.1     Baseline Preconditioned Conjugate Gradient
Previous versions of PuReMD used the conjugate gradient (CG) algorithm for solving for
the charges in the QEq procedure [11]. CG is a member of the Krylov subspace methods;
these methods seek a solution x to the sparse system Ax = b, with A being symmetric
positive definite, which falls within the Krylov subspace κ (A, b) = {b, Ab, A2 b, . . .}. Each
iteration of CG increases the dimension of the subspace, and the procedure terminates when
an acceptable solution is found. In the case of CG, storing all the vectors which define κ is
not necessary due to orthogonality and A-orthogonality of the conjugate vectors. Computing
                                               44


additional conjugate vectors with the prescribed A-orthogonality obviously requires sparse
matrix vector multiplications (SpMVs), which in the parallel context of PuReMD entails two
local communications over the 3D process torus – a forward communication to distribute
the input vectors and a backward communication to accumulate partial results into the
output vector (note that in PuReMD the Hamiltonian is stored as a half matrix to leverage
symmetries). Also, the orthonormalizations require calculation of inner products and vector
norms. In a parallel setting, these operations entail two all-reduce operations which are quite
expensive global communications in large scale runs.
    In PuReMD, the standard CG solver is improved by: i) producing good initial guesses, ii)
solving both systems in Eq. (3.2) through simultaneous iteration, and iii) utilizing a Jacobi
(diagonal) preconditioner. The initial guesses in PuReMD are based on the observation that
atomic positions (and hence charges) change only slightly from timestep to timestep; as such
PuReMD uses cubic and quadratic extrapolations to solutions of Eq. (3.2) from previous
timesteps. The Jacobi preconditioning idea stems from the fact that the Hamiltonian H
carries a heavy diagonal. Both techniques are inexpensive, yet quite effective; together
they substantially improve the convergence rate of CG for QEq, but as we demonstrate in
Section 3.4, this basic solver still requires tens of iterations per timestep, hampering the
overall scalability.
3.3.1.2    Preconditioned Pipelined Conjugate Gradient
The scalability of Krylov subspace methods like CG is hampered by global communications
during the calculation of inner products and vector norms. To combat this scalabilty issue,
the pipelined CG (PIPECG) algorithm [46] aims to achieve lower communication latency
by reducing the number of these communications to only one non-blocking global reduction
per iteration. PIPECG essentially rearranges the algebraic formulation of CG in order to
achieve this goal of one communication per step. As a trade-off for this, PIPECG ends up
performing more computation (dense vector-dense vectors operations) per step and also ex-
                                               45


hibits slightly worse convergence properties than CG (i.e., stagnation at tolerances beginning
at around 10−11 ). Algorithm 3.1 highlights the advantages and disadvantages of PIPECG
by showing that the single global communication (lines 20 and 21) can be overlapped with
the preconditioner application and SpMV on subsequent lines, effectively hiding some of the
communication latency – specifically, an MPI IAllreduce call can be started at these lines
and waited on until the completion of the SpMV. Additionally, the increased number of vec-
tor operations are shown in lines 17 through 19. Because of these advantages, we implement
PIPECG and analyze its impact on performance in the following section.
                  Algorithm 3.1 Preconditioned Pipelined Conjugate Gradient.
 1: function PIPECG(H, x0 , b, M , τ , miters )
 2:     u ← Hx0                                                                       ▷ SpMV
 3:     r ← b−u
 4:     u ← Mr                                                        ▷ Apply Preconditioner
 5:     w ← Hu                                                                        ▷ SpMV
                T              T
 6:     δ←w   √ u, γnew ← r u    √
 7:     κ ← uT u, bnorm ← bT b                                                 ▷ Global Redux
 8:     m ← Mw                                                        ▷ Apply Preconditioner
 9:     n ← Hm                                                                        ▷ SpMV
10:     x ← x0 , i ← 0
                  κ
11:     while bnorm   ≥ τ And i ≤ miters do
12:         if i > 0 then
                                     γnew
13:             β ← γγnew
                       old
                           , α ← δ−β/α·γ  new
14:         else
15:             β ← 0, α ← γnew  δ
16:         end if
17:         z ← n + βz, q ← m + βq, p ← u + βp
18:         d ← w + βd, x ← x + αp, u ← u − αq
19:         w ← w − αz, r ← r − αd
                                   T
20:         γold ← γnew , δ ← w  √u
21:         γnew ← rT u, κ ← uT u                                              ▷ Global Redux
22:         m ← Mw                                                    ▷ Apply Preconditioner
23:         n ← Hm                                                                    ▷ SpMV
24:         i ← i+1
25:     end while
26:     return x
27: end function
                                              46


3.3.2    Acceleration of QEq through preconditioning
Despite reducing global communication overheads, the PIPECG solver for QEq that is de-
scribed above still has excessive computation and communication costs compared to the
rest of the operations that must be performed in a simulation step. Therefore, to accel-
erate the convergence of the PIPECG algorithm, we present a novel distributed memory
preconditioning scheme based on the sparse approximate inverse (SAI) technique.
3.3.2.1    SAI Preconditioning
Given a linear system Hx = b, sparse approximate inverse (SAI) preconditioning aims to
find a matrix M that serves as a good approximation to H−1 by selectively computing the
entries of H−1 . Using M as a preconditioner (the transformed system is MHx = Mb in
case of left preconditioning and HMu = b where x = Mu in case of right preconditioning),
ideally the preconditioned system is expected to converge faster than the original linear
system. To satisfy that, the cost of computing and applying the preconditioner should be
marginal as construction of M is non-trivial and SAI preconditioning introduces one extra
SpMV per solver iteration. Considering that the cost of SAI preconditioner construction
and application are directly proportional to the number of non-zeros in M, M should be
chosen to be even sparser than the linear system itself H (note that H−1 is actually a dense
matrix).
    For the QEq problem, we choose the Frobenius norm minimization variant of left SAI
preconditioning [47], which tries to find an M such that ||I − MH||F is minimized where ||.||F
denotes the Frobenius norm of a matrix. In PuReMD, H is stored in the compressed sparse
row (CSR) format, which actually makes constructing M as a right preconditioner more
straightforward and affordable. Note that since H is symmetric, the left SAI preconditioner
can then easily be inferred because left and right preconditioners of SAI are transposes of
each other [41]. Let ej be the j-th column of the n × n identity matrix and let mj be the
                                              47


j-th column of M. Then we have
                          ||I − HM||2F =
                                            Pn
                    min
                      n×n                      j=1 minmj ∈Rn×1 ||ej − Hmj ||22 .
                  M∈R
Finding mj for each ||ej − Hmj ||22 corresponds to solving a set of independent least squares
problems, and as such M can be constructed column by column through a series of QR
decompositions. These decompositions are performed using optimized LAPACK libraries
(e.g., the Intel Math Kernel Library). Note that solving each least squares problem can
be quite expensive considering the dimensions of H. Leveraging the fact that M will be
much sparser than H, we reduce the computational costs of the least squares problems
significantly by building dense vectors m̂j and êj , and dense matrix Ĥj . This densification
process essentially amounts to eliminating rows and columns that are entirely composed of
zeros, as those rows and columns do not contribute to Hmj (see Sect. 2.1 of [41] for further
details).
    While the densification step significantly reduces the computation cost of M, it alone
is not sufficient to make our SAI preconditioner useful in practice. The key insight that
makes SAI useful for ReaxFF in practice is the same as the one we use for solver initial
guesses: because an atomic system evolves slowly, the preconditioner M computed at a cer-
tain timestep can be reused effectively for several subsequent steps. In fact, as we demonstrate
in Section 3.4, the SAI preconditioner can be useful for hundreds of steps. As such in the
accelerated QEq solver, we reconstruct M only occasionally and amortize the SAI precon-
ditioner computation costs over several steps.
    Determining the sparsity pattern of M, however, remains a challenge. The inverse of a
sparse matrix is generally a dense matrix; for the inverses of our QEq matrices (H−1 ), we have
observed that the magnitude of many entries is small. While it is desirable to select a sparsity
pattern that captures the positions of the numerically large entries of H−1 , these positions
cannot be known a priori. Several works have suggested approaches for choosing a sparsity
pattern [40], but these are neither sparser than H nor practical to compute in our case.
During experiments with several molecular systems, we observed a strong correlation between
                                                48


the positions of the numerically large entries of H and H−1 . Therefore, we hypothesized that
positions of numerically large entries in H are a good candidate for the sparsity pattern of
M. Subsequent experiments confirmed that including only τ percent of the numerically large
entries of H indeed yields promising results. We remark, however, that finding the ideal τ
value is non-trivial. On one hand, larger τ values (15%-20%) substantially improve the CG
convergence but computation and application of M becomes expensive. On the other hand,
smaller τ values make computation and application of M more affordable at the expense of
smaller improvements in solver convergence. The longevity of M (discussed in the previous
paragraph), the number of processes used, and the parallel efficiency are certainly other
important considerations.
    We observe that in general τ being 10% to 15% (depending on the molecular system)
leads to decent improvements in total QEq solve time. While we empirically determine the
ideal value of τ for experiments presented in this paper, note that MD simulations typically
last for millions to billions of steps. As such, an auto-tuning mechanism that empirically
determines the ideal τ value and preconditioner reconstruction frequency on-the-fly for a
given atomic system and architecture, while performing the MD simulation itself, is certainly
feasible and is planned as future work.
3.3.2.2    Parallelization of the SAI Preconditioner
After describing how we “engineer” a custom SAI preconditioner that is practical for dy-
namic charge models, we next detail its parallelization. While this topic is arguably well
documented in the literature, our contribution here is the development of an SAI imple-
mentation that leverages problem-specific characteristics of ReaxFF for high efficiency and
scalability, so that the resulting solver can perform significantly better than PuReMD’s ex-
isting QEq solver that already delivers decent results with its simple Jacobi preconditioning
scheme.
    Our discussion focuses on parallel construction of the SAI preconditioner M as this step
                                               49


is highly non-trivial; application of M, which needs to be performed at each iteration of
PIPECG, is trivial as it simply requires a parallel SpMV. Construction of M includes two
main subtasks: i) pattern selection, and ii) setup and solution of the least squares problems.
As both subtasks are intimately tied to the structure of the Hamiltonian H, we begin with
its description.
3.3.2.3    Structure of the Hamiltonian
In the Hamiltonian H, a matrix entry (i, j) represents the charge affinity relation between
atoms i and j. While the original charge solver in PuReMD adopts a half matrix for reducing
storage and SpMV computation costs, in the SAI preconditioned solver we store the full H
matrix in distributed memory (hence redundantly storing the symmetric entries twice), as
this simplifies the setup of the least squares problems (subtask 2). With the full storage
scheme, the H matrix is effectively 1D block partitioned because each process knows all
neighbors (i.e., all non-zeros in a row) of its local atoms (i.e., all rows it owns).
    Locally each process stores its submatrix in a structure that facilitates the construction
of M, as well as parallel SpMVs (needed in CG/PIPECG). As depicted in Fig. 3.2, lower
indices are reserved for local atoms. In PuReMD, local communications between neighboring
processes follows a three-stage communication scheme that respects the 3D torus topology:
first, atom information is exchanged in −x and +x directions; then in −y and +y directions,
and finally in −z and +z directions [11]. Atoms received after each stage are indexed
contiguously in the atom list as well as the local Hamiltonian, and messages to be transmitted
in the subsequent step are augmented with the received atom information as necessary. This
communication scheme yields the indexing and organization shown in Fig. 3.2.
3.3.2.4    SAI Preconditioner Sparsity Pattern Selection
Given p processes and the filtering parameter τ , for each process local Hamiltonians Hsloc ,
with 1 ≤ s ≤ p, the top τ percent numerically largest non-zeros of H are selected for the
                                                50


                                local           -x      +x     -y     +y      -z     +z
                     0                          ...    ...    ...    ...     ...     ...
                     1                          ...    ...    ...    ...     ...     ...
                     2                          ...    ...    ...    ...     ...     ...
      local
                                  .              .       .      .      .       .      .
                                  .              .       .      .      .       .      .
                                  .              .       .      .      .       .      .
                   n-1                         ...     ...    ...    ...     ...   ...
                                 R0             R1      R2     R3     R4      R5     R6
Figure 3.2 Structure of the local charge matrix in distributed-memory simulations.
In the figure, region R0 corresponds to the interactions between local atoms, R1 corresponds
to the interactions between local atoms and the atoms imported from −x direction, etc.
sparsity pattern of M. This problem can equivalently be stated as finding the k-th largest
non-zero, namely α, in the union of all local Hamiltonians such that when entries less than
α are filtered out, only τ percentile of the non-zeros remain. However, trying to find the
exact α does not lend itself to an efficient parallel algorithm, but trying to be exact is not
required as τ is an empirically determined threshold. As such, we approximate τ using a
sampling-based technique as shown in Algorithm 3.2 (see line 2). Sampled numbers from
each process are collected at the root (line 5), and these numbers are processed using a
Quickselect algorithm to efficiently find α (line 7). After broadcasting α, each process can
obtain their local sparsity patterns Hssp by filtering out entries in their local Hsloc matrices
(line 10).
3.3.2.5    SAI Preconditioner Computation
As described in Section 3.3.2.1, constructing the SAI preconditioner entails solving n least
squares problems where n is the total number of atoms. These problems are independent and
thus can be solved in parallel by performing QR decompositions. However, the precursor to
the QR decompositions is building the least squares problems which we explain through the
                                              51


         Algorithm 3.2 Distributed SAI(τ ) Preconditioner Sparsity Pattern Selection.
  1: function SetupSAI(H, p, τ )
  2:     Plocal ← Sample(H, samplerate)
  3:     s ← Reduce(length of Plocal )
  4:     if rank = root then
  5:         Pglobal ← Gather(Plocal )
                      s·τ
  6:         k ← 100
  7:         α ← Quickselect(Pglobal , k)
  8:     end if
  9:     Bcast(α)                                                    ▷ Broadcast threshold value
10:      Hsp ← FilterMatrix(H, α)
11:      return Hsp
12:  end function
example in Fig. 3.3. After determining α, the least squares problem for atom i on process s
is built based on the sparsity pattern of the i-th column of Hssp . In this example, we form
the least squares problem for atom 0, and suppose that the non-zeros in column 0 of Hssp
are in rows 0 and 3 (non-zeros in rows 4 and 7 are below α and are dropped). Consequently,
columns 0 and 3 of H are included, but the real challenge lies in finding the subset of rows
of H that are to be included in the densified least squares problem Ĥi . Those rows are the
union of j’s such that at least one of the selected columns of Ĥi has a non-zero element
at index j. Referring to the example in Fig. 3.3, the rows incorporated into the densified
matrix will be {0, 3, 4, 7} ∪ {3, 5} = {0, 3, 4, 5, 7}, which are the rows with non-zero entries
found in columns 0 and 3, respectively. This process is straight-forward if all atoms were to
be local, but when there is a non-local atom j in the sparsity pattern of the i-th column of
Hssp , all neighbors of j are needed; this information must be imported from another process.
Fortunately, when local Hamiltonians are stored as full matrices, all neighbors of atom j can
be obtained from the process that owns atom j. We describe the SAI preconditioner setup
operation in Algorithm 3.3. To efficiently import all the neighbors of a non-local atom, a
modified version of three-stage messaging is employed in the CompSAI function. All local
atoms save the number of their neighbors, which is equal to the number of non-zero entries in
the corresponding row of the local matrix, to a list (line 2-3). Then, every process distributes
                                                 52


                        0         0  1  2   3  4  5  6   7      0
                     0   X        X     X      X     X          X
                     1               X                   X
                     2                  X         X
                     3            X         X            X      X
                             -
                     4            X  X         X         X
                     5                      X     X
                     6                  X            X
                     7            X                      X
                         0        0  3
                     0   X        X
                                                     0          0   3
                     1                             0 X          X  X
                                                                         0
                     2                       0     3            X  X
                                                                       0 X
                     3            X  X    0  X
                             -                     4        -   X
                                                                       3 X
                     4            X       3  X
                                                   5               X
                     5               X
                                                   7            X
                     6
                     7            X
Figure 3.3 The global view of the process of constructing and solving one of the
least-squares problems for the distributed SAI preconditioning approach with an
example QEq matrix. In the figure, formation of ê0 , Ĥ0 , and m̂0 are shown for an example
QEq matrix H ∈ R8×8 and m0 through column selection (blue) and row eliminations (red).
In the above matrices, empty spaces and x’s denote zero and non-zero entries, respectively.
their list using staged communication (line 8) so that each atom can get to learn how many
non-zero entries have to be imported for each of its non-local neighbors (line 10-14). Next,
the processes allocate space for those non-zero entries (line 15) and pack the non-zero entries
for the local atoms that are needed by other processors (line 16-20). Then, again using
the same communication scheme, each processor sends its packed entries (line 21). Finally,
processors start constructing dense matrices and solving QR decompositions for each of their
local atoms (line 22-25).
                                               53


                  Algorithm 3.3 Distributed SAI(τ ) Preconditioner Computation.
 1: function CompSAI(H, Hsp , n, N )
 2:     for i ← 0 to n − 1 do
 3:         row nnz(i) ← H(i).end − H(i).start
 4:     end for
 5:     for i ← n to N − 1 do
 6:         row nnz(i) ← 0
 7:     end for
 8:     Dist(row nnz)                       ▷ Send non-zeros per row to neighboring processors
 9:     nnzrecv ← 0
10:     for i ← n to N − 1 do
11:         if row nnz(i) ̸= 0 then               ▷ If non-local atom i is needed by a processor
12:                nnzrecv ← nnzrecv + row nnz(i)
13:         end if
14:     end for
15:     AllocateSpace(rowsrecv , nnzrecv )
16:     for i ← 0 to n − 1 do
17:         if atom i is needed by a neighbor then
18:                rowssend .append(H(i))             ▷ Copy rowi non-zeros into comm. buffers
19:         end if
20:     end for
21:     Dist(rowssend )                              ▷ Send non-zeros to neighboring processors
22:     for i ← 0 to n − 1 do
23:         eˆi , Ĥi , m̂i ← Build Dense M atrix(H, Hsp , rowsrecv , i)
24:         Hsai (i) ← QR(eˆi , Ĥi , m̂i )
25:     end for
26:     return Hsai
27: end function
3.3.3    Optimization at the Ghost Regions
In ReaxFF, the precursor to bonded interactions is the calculation of the uncorrected bond
orders (BOp), which is performed within the rbond radii of each atom during initialization
of interaction lists. Once the bond order list is initialized, all bonded interaction functions
are calculated without any redundancies. For instance, a bond at the boundary between
two processes is calculated by the owner of the atom with the smaller global id, or a 3-body
interaction shared between 3 processes is calculated by the owner of the middle atom. In
PuReMD, a process initializes all bonds that fall within its boundaries including the ghost
regions to be prepared to handle all kinds of bonded interactions that it may end up being
                                                54


responsible for. Note that in the strong scaling regime, the redundant computations associ-
ated with doing so can be significant. For instance, when the dimensions of the subdomain of
a process is equal to rnonb , the three dimensional import region (which must have a thickness
of rnonb at least in each direction) would have a volume of roughly 26 times the volume of the
process’s local subdomain. However, inspecting how bonded interactions are shared between
processes in PuReMD reveals that one must only extend up to 3 hops into the ghost from
any local atom (as expected, this happens for 4-body bonded interactions) where a hop is
defined to be a bond order with strength above a certain threshold. This suggests that there
can be a significant number of redundant bond order calculations in the ghost region.
    To avoid redundant BOp calculations in an effort to improve the strong scaling perfor-
mance, i) we adopted a BFS-style branching with multiple sources, and ii) we implemented
a scheme to automatically tighten the rbond cutoff for a given input system. In our first opti-
mization, i.e., the BFS-style branching scheme, all local atoms are initially added to a queue
with a hop distance of 0. Then, an atom is popped from the queue at each BFS step and its
neighbors within a hop distance are inserted into the queue. The method stops when all the
atoms with less than 4 hop distance are inserted into the queue. Notice that this approach
can be used only when full neighbor list format is employed; otherwise, we would have to
calculate the hop distances of atoms in the underlying graph of a digraph, which is identical
to turning a half neighbor list into a full one. Moreover, we take advantage of symmetry to
optimize the computations even further as full neighbor list is the only viable option. To
realize this, we ensured that the interaction between atom i and atom j is computed by the
atom that is popped from the queue first.
    Our second optimization, i.e., automatic tightening of rbond , relies on the observation
that, given two atom types ti and tj , BOp is a monotonically decreasing function of the
distance between atoms. However, rbond is usually conservatively given as 4 Å (or even 5 Å)
as part of simulation parameters. Instead of relying on this parameter, we scan all possible
atom type pairs at several distances to precisely determine the distance after which the BOp
                                                55


value for all type pairs present in the given system are guaranteed to fall below bond order
acceptance threshold.
3.4     Numerical Results
3.4.1    Benchmarking Systems and Hardware
Results from numerical experiments presented in the following subsections were obtained on
Cori at the National Energy Research Scientific Computing Center (NERSC). Each Haswell
node in this Cray XC40 system has 32 cores, on two sixteen-core Intel Xeon E5-2698v3
Haswell 2.3 GHz processors, and has 128 GB DDR4 2133 MHz ECC memory. Each core
possesses a 64 KB L1 cache (32 KB instruction, 32 KB data), a 256 KB L2 cache, and the
capability of running one or two user threads (i.e., hyperthreading). All cores on a single
processor share a 40 MB L3 cache. At the time of the experiments, Cori ran the SuSE Linux
Enterprise Server version 12.3 for x86 64 architectures, linux kernel version 4.4.162-94.72-
default, and glibc version 2.22-62.16.2.
   The preconditioned solvers detailed in the above sections were implemented in the Pu-
ReMD ReaxFF software [10, 11]. The software was built using the Intel Compiler Collection
version 18.0.1 with the -O3 -march=native flags and the Cray MPICH library version 7.7.3.
For relevant experiments, we restricted our simulations to one process per core (no hyper-
threading was utilized).
   For benchmarking purposes, two molecular systems from application scenarios of reactive
and polarizable force fields were selected. These systems, which were comprised of bulk
water (H2 O) and amorphous silica (SiO2 ), range in size from thousands to millions of atoms
depending on the experiment.
   To quantify the impact of optimizations presented, we created four different versions of
the PuReMD code: i) original PuReMD code as published on its website [7] (i.e., CG+Jacobi
[Half]), ii) PuReMD code modified to work with full neighbor lists and full Hamiltonians (i.e.,
CG+Jacobi [Full], aims to quantify the impact of the switch to full list and matrices), iii)
                                               56


                  10
                        Silica                                              14   Water
                                                               10−6
                                                               10−10
                   8                                                        12
                                                                            10
   Solver Iters
                   6
                                                                             8                                         10−6
                                                                                                                       10−10
                   4                                                         6
                                                                             4
                   2
                                                                             2
                   0                                                         0
                       0         100    200     300      400     500             0       100    200     300      400     500
                                       Simulation Step                                         Simulation Step
Figure 3.4 Longevity of the CG+SAI(0.15) distributed preconditioned solver for
systems using the QEq model. In the figure, results are shown for the silica system with
864K atoms (left column), and bulk water system with 837K atoms (right column) using the
QEq model at different solver tolerances. Preconditioners were utilized for 250 steps before
being re-computed.
PuReMD with SAI preconditioning and ghost region optimizations (i.e., CG+SAI(0.15) that
denotes a τ value of 15%), and our final version which includes all presented optimizations
(i.e., PIPECG+SAI(0.15)).
3.4.2                  Impact of Charge Solver Improvements
3.4.2.1                    SAI Longevity
We start by examining for how long an SAI preconditioner can be effective, as the longevity
of the preconditioner is crucial for determining how much preconditioner computation costs
can be amortized, as well as for estimating the average number of solver iterations. As
shown in Fig. 3.4, the SAI preconditioners can be effective for tens to hundreds of steps,
especially for the silica system which is a solid material where atoms cannot move freely
like they do in water. SAI preconditioner loses its effectiveness after some point, and the
necessity of preconditioner reconstruction can be seen clearly. Overall, silica system requires
fewer iterations for convergence than the water system at the same tolerances, and the gap
becomes more apparent as the convergence tolerance is decreased to 10−10 .
                                                                       57


    Based on results shown in Fig. 3.4 and further empirical tests (not shown), we set the SAI
reconstruction rates to 250 steps for evaluations performed in this paper. All simulations
have been executed for 500 steps and averages of relevant quantities are reported, unless
stated otherwise.
3.4.2.2    SAI Preconditioning Convergence Rates
In Table 3.1, we present the mean solver iterations for the Jacobi and SAI preconditioned
QEq solvers at 10−6 and 10−10 tolerances for the 864K silica and 837K water systems using
512 processes. As expected, neither the switch to full matrix with CG, nor replacing the CG
solver with PIPECG affects convergence rates. SAI preconditioning significantly improves
the convergence rate of QEq compared to Jacobi preconditioning, yielding about 4 to 5 times
better convergence rates for silica and water at 10−6 tolerance. The margin between SAI
and Jacobi preconditioning gets larger as we go to 10−10 tolerance threshold both for silica
and water systems. Other tests (not shown here) indicate that this trend continues as we
move to even smaller tolerances, e.g., 10−13 .
3.4.2.3    Strong Scaling Tests
While convergence results in Table 3.1 are highly encouraging regarding the speedups that
can be obtained through SAI preconditioning, they omit the execution time overheads asso-
ciated with SAI in actual solves which include preconditioner construction and application
costs. While the preconditioner construction is amortized over several steps, preconditioner
application – which essentially is a parallel SpMV – must be performed at each iteration of
the QEq solver and requires an extra set of local communications; therefore it can be expen-
sive. In Fig. 3.5, we present the results of a strong scaling study for silica and water systems.
In comparing CG+Jacobi[Half] and CG+Jacobi[Full], we observe that the switch to full
neighbor lists and full charge matrices does not have a significant negative effect. In fact, at
a large number of cores, we observe a positive impact from CG+Jacobi[Full], because it only
                                                 58


Table 3.1 Mean solver iterations for the Jacobi and SAI distributed preconditioned
QEq solvers.
                                                                 Iters.
                       Dataset   Tol.    Solver
                       Silica    10−6    CG+Jacobi [Half]         11.8
                                         CG+Jacobi [Full]         11.8
                                         CG+SAI(0.15)               1.9
                                         PIPECG+SAI(0.15)           1.9
                                 10−10   CG+Jacobi [Half]         39.2
                                         CG+Jacobi [Full]         39.2
                                         CG+SAI(0.15)               5.8
                                         PIPECG+SAI(0.15)           5.8
                       Water     10−6    CG+Jacobi [Half]           9.5
                                         CG+Jacobi [Full]           9.5
                                         CG+SAI(0.15)               2.7
                                         PIPECG+SAI(0.15)           2.7
                                 10−10   CG+Jacobi [Half]         38.4
                                         CG+Jacobi [Full]         38.4
                                         CG+SAI(0.15)             10.2
                                         PIPECG+SAI(0.15)         10.2
needs to do a forward communication before the SpMV in CG iterations. In comparison,
CG+Jacobi[Half] needs to do a forward communication before SpMV and backward com-
munication after SpMV. In any case, neither CG+Jacobi[Half] nor CG+Jacobi[Full] exhibit
good scaling beyond 2048 cores.
   By switching to SAI preconditioning, CG+SAI(0.15) is able to achieve significant speed-
ups over the Jacobi variants by virtue of reduced CG iterations. As expected, the fully
optimized PIPECG+SAI(0.15) version exhibits the best performance in large core counts,
while it attains similar (or sometimes slightly worse) performance to its CG variants at
smaller runs. As shown in Fig. 3.6, we observe 0.7x to 2.4x speedup at 10−6 tolerances,
and 1.6x to 4.6x speedup at the 10−10 tolerances from PIPECG+SAI(0.15) over the original
QEq solver in PuReMD. Again looking at Fig. 3.6, it can be said that PIPECG+SAI(0.15)
almost always yields better strong scaling efficiencies than the original QEq solver.
   Note that the simulation volume (hence computational load) assigned to each process
                                              59


                                  CG+Jacobi [Half]        CG+SAI(0.15)
                                  CG+Jacobi [Full]        PIPECG+SAI(0.15)
                   2−4                                   2−3
                                          Silica, 10−6                       Silica, 10−10
                   2−5                                   2−4
        Time (s)
                   2−6                                   2−5
                   2−7                                   2−6
                                                         2−7
                   2−8
                         210                212                   210             212
                                  # Cores                               # Cores
                                             (a) Silica 864000
                   2−4                                   2−2
                                          Water, 10−6                        Water, 10−10
                                                         2−3
                   2−5
        Time (s)
                                                         2−4
                                                         2−5
                   2−6
                                                         2−6
                   2−7                                   2−7
                             10                 12
                         2                  2                     210             212
                                  # Cores                               # Cores
                                             (b) Water 837120
Figure 3.5 Strong scaling plots for mean QEq time of the distributed precondi-
tioned solvers for the silica and water systems. Figures from left to right within a
row show results at convergence tolerance levels of 10−6 and 10−10 .
decreases linearly with the increase in the core count. However, the volume of the ghost
region (hence local communication) decreases at a slower pace. Additionally, the overhead
due to global communication operations of CG (i.e., all-reduces) grows with increasing core
counts. Consequently, the ratio of communication overheads to computational load on each
process is maximized at high core counts, and unsurprisingly this is where the impact of SAI
                                                         60


                                                                                                              CG+Jacobi [Half], 10−6               PIPECG+SAI(0.15), 10−6
                    PIPECG+SAI(0.15), 10−6               PIPECG+SAI(0.15), 10−10                              CG+Jacobi [Half], 10−10              PIPECG+SAI(0.15), 10−10
                    Silica                                                                       1.0                                                              Silica
            4
  Speedup                                                                           Efficiency
                                                                                                 0.5
            2
                                                                                                 0.0
                    29            210             211         212          213                               29             210             211         212          213
                                             # Cores                                                                                   # Cores
                                          (a) Speedup                                                                             (b) Efficiency
                                                                                                                  CG+Jacobi [Half], 10−6            PIPECG+SAI(0.15), 10−6
                         PIPECG+SAI(0.15), 10−6           PIPECG+SAI(0.15), 10−10                                 CG+Jacobi [Half], 10−10           PIPECG+SAI(0.15), 10−10
                4
                     Water                                                                             1.0                                                         Water
      Speedup
                3
                                                                                          Efficiency
                2                                                                                      0.5
                1
                                                                                                       0.0
                     29             210            211          212         213                               29             210             211          212         213
                                              # Cores                                                                                   # Cores
                                           (c) Speedup                                                                             (d) Efficiency
Figure 3.6 Speedups and parallel efficiencies for the distributed strong scaling
study. In the figures, results are shown for silica (top row) and water (bottom row).
Speedups are relative to the CG+Jacobi [Half ] baseline at the same number of cores.
and PIPECG are observed most significantly.
3.4.2.4                    Weak Scaling
To better analyze the impact of core count over execution time and scaling, we also conducted
weak scaling experiments. As indicated earlier, in the current PuReMD implementation,
the dimensions of the simulation box that can be assigned to each process must be equal
to or greater than the neighbor generation cutoff (which is equal to rnonb cutoff plus the
Verlet buffer size). Therefore for our weak scaling experiments, we choose bulk water and
silica systems with cubic shapes that roughly have the minimum possible box dimensions per
process, i.e., ≈13 Å, and scale it up to 1,728 cores. As given in Fig. 3.7, PIPECG+SAI(0.15)
yields speedups of between 0.9x to 2.5x for a 10−6 tolerance, and between 1.4x to 5.1x for
a 10−10 tolerance over the original QEq solver. As a result of our highly optimized SAI
preconditioner, SAI related overheads are minimal, allowing the reduction in iteration counts
                                                                                    61


                        Pre Comp      Solver Comm                       Solver SpMV
                        Pre App       Solver Comm (AllReduce)           Solver Vec Ops
                 14                                  12
                      Silica, 10−6                        Water, 10−6
                 12                                  10
                 10
     Time (ms)
                                                      8
                  8
                                                      6
                  6
                                                      4
                  4
                  2                                   2
                  0                                   0
                        27      216     1728                27      216     1728
                             Num. Procs.                         Num. Procs.
                      Silica, 10−10                       Water, 10−10
                                                    100
                 30
                                                     80
     Time (ms)
                 20                                  60
                                                     40
                 10
                                                     20
                  0                                   0
                        27      216     1728                27      216     1728
                             Num. Procs.                         Num. Procs.
Figure 3.7 Weak scaling plots for the distributed solver sub-kernels for QEq. In
the figures, results are shown for the silica and bulk water systems at solver tolerances of
10−6 and 10−10 . Unhatched bars (left within a grouping) are from CG+Jacobi [Half ] while
hatched bars (right within a grouping) are from PIPECG+SAI(0.15).
to translate to QEq speedups. Another source of performance gain with PIPECG+SAI(0.15)
is the overlapping of global communications which constitutes the most significant part of
the original QEq solver on 1,728 cores.
                                               62


3.4.3     Impact of Ghost Region Optimizations
In Table 3.2, we quantify the impact of the ghost region optimizations. Since the implemen-
tation of the original interaction list initializations in PuReMD does not allow us to time the
exact duration spent in bond order initializations, we rather compare the number of BOp
evaluations in both schemes. As shown in this table, the presented optimizations signifi-
cantly reduce the number of BOp evaluations by a factor of about 5 for water and about 2
for silica. This is expected as silica has longer maximum bond distance (3.5 Å) and therefore
3 hops into the ghost region already covers a significant volume. But, the maximum bond
distance for water is shorter, and as such redundant calculations in a larger portion of ghost
region can be avoided.
Table 3.2 Average number of bond order calculations per simulation step. Sim-
ulations were performed with the silica and bulk water systems (6000 and 6540 atoms,
respectively) using 27 cores.
                      Dataset    Original Bond Init      Optimized Bond Init
                      Silica                    2.6E5                 1.1E5
                      Water                     6.3E5                 1.3E5
3.4.4     Overall Simulation Performance
In Figs. 3.8 and 3.9, we present the strong scaling and weak scaling results for the entire
PuReMD simulation, which are indeed quite similar to those presented above for QEq.
Since QEq solves and initialization of interaction lists constitute a significant amount of the
total running time, improvements obtained for each of these kernels carry their benefits to
the overall simulation times, especially when solver tolerance is reduced to 10−10 . In our
experiments, we observed overall speedups of 1.1x to 1.9x for strong scaling experiments,
and 0.9x to 1.8x for weak scaling experiments.
                                                  63


                                  CG+Jacobi [Half]        CG+SAI(0.15)
                                  CG+Jacobi [Full]        PIPECG+SAI(0.15)
                   2−2
                                          Silica, 10−6   2−2                 Silica, 10−10
                   2−3
                                                         2−3
        Time (s)
                   2−4
                                                         2−4
                   2−5
                                                         2−5
                       −6
                   2
                            210             212                   210             212
                                  # Cores                               # Cores
                                             (a) Silica 864000
                   2−2                                   2−1
                                          Water, 10−6                        Water, 10−10
                                                         2−2
                   2−3
        Time (s)
                                                         2−3
                   2−4
                                                         2−4
                   2−5
                                                         2−5
                            210             212                   210             212
                                  # Cores                               # Cores
                                             (b) Water 837120
Figure 3.8 Strong scaling plots for total simulation time of distributed precondi-
tioned solvers for the benchmark systems. Figures from left to right within a row
show results at convergence tolerance levels of 10−6 and 10−10 .
3.5    Conclusions
Reactive MD models fill an important void in the molecular dynamics landscape, but the
scalability of existing software is limited due to the onset of communication overheads and
redundant calculations, specifically due to charge distribution solvers and computations re-
lated to dynamic bond order lists. In this paper, we presented a number of novel techniques
                                                         64


                 FS Comm            Nbr Gen     Init         Bonded       Nonbonded
                        60                              60
                        50                              50
            Time (ms)
                        40                              40
                        30                              30
                        20                              20
                        10                              10
                         0                               0
                             27      216     1728            27      216     1728
                                  Num. Procs.                     Num. Procs.
                                                       150
                        60
                                                       125
            Time (ms)
                                                       100
                        40
                                                        75
                        20                              50
                                                        25
                         0                               0
                             27      216     1728            27      216     1728
                                  Num. Procs.                     Num. Procs.
Figure 3.9 Weak scaling plots for main simulation sub-kernels in distributed mem-
ory using the benchmark systems. Figures depict results for silica (left column) and
bulk water (right colum) systems at solver tolerances of 10−6 (top row) and 10−10 (bottom
row). Unhatched bars (left within a grouping) are from CG+Jacobi [Half ] while hatched
bars (right within a grouping) are from PIPECG+SAI(0.15).
to address these bottlenecks and studied their performance impact in large-scale simula-
tions. Our results show that the presented techniques can significantly improve the overall
performance and scaling of reactive MD simulations. A number of potential performance
improvement opportunities have been observed as discussed above, and will be the subject
of our future work in this area. While the techniques presented in this paper are discussed
in the context of the ReaxFF model, they can directly be used in other bond order poten-
tials. The accelerated charge model solvers can also be useful for classical force fields with
                                               65


polarizable charge models.
                           66


                                         CHAPTER 4
           OPTIMIZATION OF THE REAX FORCE FIELD FOR THE
     LITHIUM-OXYGEN SYSTEM USING A HIGH FIDELITY CHARGE
                                            MODEL
This chapter presents previously published work on improving ReaxFF parameter optimiza-
tion with the recently published ACKS2 charge model using the OGOLEM software inte-
grated with a modified shared-memory version of PuReMD [48]. This work is reproduced
with the permission of AIP Publishing.
    While ReaxFF has been shown to be highly accurate potential with acceptable compu-
tational cost for a variety of application areas, the development of new force fields remains a
challenging task which is often limited to a handful of researchers with deep domain knowl-
edge and experience. Moreover, the high dimensionality and difficult features of the ReaxFF
parameter space pose significant challenges to applying global optimization techniques, such
as evolutionary genetic algorithms. To address and further diagnose these issues, a parame-
ter optimization study was conducted with Li-O systems for ReaxFF coupled with the QEq
and ACKS2 charge models. In support of this, the OGOLEM genetic algorithm framework
was coupled with a modified shared-memory version of PuReMD.
4.1     Introduction
Accurately simulating chemical bond breaking is a non-trivial task. In particular, the widely
used methods based on density functional theory (DFT) [49] struggle with describing bond
dissociation processes. In materials science, fracture can simply be defined as a long series
of bond breaking events, naturally making it a significantly more difficult task than the
description of an individual bond breaking event [50].
    Given the hardships faced at the quantum chemistry level, it is no surprise that fracture
simulations have proven challenging for atomistic models, too. Fracture is especially pivotal
in ceramic materials, which exhibit catastrophic brittle failures as opposed to the more
                                               67


ductile failure modes of metals. Their failure typically starts from pre-existing cracks, in
which stress is concentrated at the sharp tip of the cracks. Griffith theory [51] further
connects the energy to break the bonds ahead of the crack tip with the strain energy due to
the tensile stress to determine whether it is energetically favorable for a crack to propagate.
This highlights the need for correct description of bond breaking in predicting fracture of
ceramic materials.
    Fracture simulations are typically beyond the length and time scales currently accessible
to quantum-based methods. Reactive molecular dynamics (MD) simulations based on em-
pirical force fields (FFs) allow simulation of mechanical deformation and crack propagation
in materials at nanometer scales [52, 53, 54, 55, 56]. However, FF-based MD simulations
tend to predict features of ductile fracture, such as void formation [57, 58], extensive sliding-
induced necking [59], and crack blunting [56], even in ceramics such as Si, Ge, SiO2 , Li2 O, and
Lix S that are known to exhibit brittle fracture [60, 61, 59, 62]. Although small simulation
volumes, lack of preexisting defects or notches, excess surface diffusion in nanostructures,
and fast strain rates can all contribute to over-ductility in MD simulations [63], the most
important factor is the underlying force field. For example, Kang and Cai compared five
different force fields for simulating Si and Ge nanowires under tension. Although all of them
predicted reasonable elastic modulus, the predicted tensile strengths and fracture behaviors
varied widely [59]. This points to an inaccurate description of bond breaking within the
force field itself as a source of the error.
    In this study, we use the reactive FF abbreviated as ReaxFF [1, 4] due to its applicability
to a wide range of systems. In ReaxFF, reactions are simulated by dynamically computing
the bond order between pairs and determining partial charges based on atomic positions.
Previous work has shown that ReaxFF is well suited to study mechanical failure in a variety
of systems [64, 65, 66]. However, major challenges still remain when it comes to fracture
simulations.
    As mentioned above, accurate description of bond breaking plays a crucial role in being
                                               68


able to correctly capture the fracture behavior (ductile vs. brittle). Of the several partial
energies employed in ReaxFF, bond energy and van der Waals and Coulomb interactions
play the main role in bond breaking. However, the charge prediction scheme utilized in
existing ReaxFF is not sufficiently accurate.
     Traditionally, ReaxFF adopts the electronegativity equalization method (EEM) [26, 21]
(which is closely related to the charge equilibration (QEq) method [22]) for dynamically
determining charges in a simulation. Both of these methods obtain the atomic charges by
minimizing the electrostatic energy at each step as a function of atomic positions. Such
system-wide optimization of charges can result in instantaneous long-range charge trans-
fer during bond breaking. For example, when a diatomic molecule dissociates and the
constituent atoms move far apart, they should become neutral species rather than being
charged. However, the QEq and EEM methods allow these two atoms to exchange electrons
and carry fractional charges even when they are well separated. Since Coulomb interaction
decays slowly, residual charges in a fracture simulation remaining after bond breaking can
still lead to strong electrostatic forces between opposing surfaces which cause overestimation
of the toughness and lead to unphysical ductility.
     Another shortcoming in existing force fields is the training set used in parameter fitting.
Typically, training set data are computed by DFT calculations for structures near the equi-
librium bond distance. Since bond breaking is crucial for fracture, highly accurate bond
breaking data should be included in the training set. However, as mentioned earlier, some
electronic structure theory methods, such as DFT, are unable to correctly describe bond
breaking processes. In those cases, highly correlated ab initio wave function methods, espe-
cially those using multiple suitably chosen reference determinants, such as the multireference
configuration interaction (MRCI) approach developed in Refs. [67] and [68] exploited in this
study, may provide the desired solution. As illustrated in Ref. [4], data from these higher-
level quantum chemistry calculations can be used to train a Reax force field to correctly
describe bond breaking events.
                                                69


    In this paper, we explore a new Reax force field concept and its ability to describe
crack propagation in lithium oxide materials. Our force field development method combines
advancements in the training set generation phase and the fitting protocol itself, in addition
to utilizing a high-fidelity charge model to mitigate the accuracy issues with the conventional
ReaxFF charge models. We created a training data set suitable for the study of crack
propagation, and fitted ReaxFF parameters by adopting the atom-condensed Kohn-Sham
DFT approximated to second order (ACKS2), [24] which is derived from the Kohn-Sham
formulation of DFT, [69] as the charge model. The use of ACKS2 instead of QEq/EEM for
charge distribution together with the training data set used lead to a significantly different
ReaxFF bond description (where Coulomb interactions are heavily emphasized as they should
be in an ionic system) compared to an existing ReaxFF parameter set [70]. Therefore,
contributions from other energy terms had to be re-balanced by refitting those parameters as
well. Since ReaxFF was developed with the capability to describe many varying chemistries,
the complexity of the force field makes parameter optimization a challenging task. Oftentimes
tens to hundreds of parameters must be simultaneously optimized when fitting to a large
training dataset. For this purpose, a genetic algorithm (GA) based parameter optimization
software was used, while the choice of parameters and their ranges were chosen carefully to
avoid an over-fitted force field with incorrect physics.
    We demonstrate that this new Reax force field is able to correctly describe bond breaking
and charge transfer in the Li2 O molecule. This translates to improved charge prediction dur-
ing bond formation and breaking in solid Li2 O systems, leading to an improved description
of fracture in amorphous Li2 O. Together with improvements in formation enthalpies of both
Li2 O and Li2 O2 , the resulting force field represents a major step forward in the simulation
of the lithium oxidation process, and paves the way for improved atomistic simulations of
lithium-air batteries as well as dendrite formation, solid-electrolyte interphase formation,
and passivation layers in lithium-ion batteries. [71]
                                                70


4.2     Methodology
Our newly fitted force field for the Li-O system advances in three distinct improvements:
i) selection of a suitable training data set, ii) adoption of the ACKS2 model for charge
prediction, iii) the fitting procedure itself. We present the details for these improvements
next.
4.2.1    Training Data Set
In training the new force field, a mix of crystalline and molecular structures were used.
Relative energies (with respect to a closely related reference structure) rather than absolute
energies are used in the training data set. This allows us to combine different quantum
chemistry methods for generating the training data to balance accuracy and efficiency. For
example, DFT was used to obtain the equations of state for crystals and surface energies for
slab models, while the higher-level ab initio MRCI methodology was employed to determine
accurate partial charges and, after adding the multireference Davidson correction through
the MRCI+Q approach, accurate energetics characterizing the asymmetric and symmetric
bond dissocation in a model molecular system containing Li and O atoms.
    The training set consisted of 276 items: energies of crystalline Li2 O, Li2 O2 , and Li
metal under varying strain rates, energies of Li – Li molecules (taken from Ref. [72]), and
energies and charges of Li – O – Li molecules under dissociation. Since DFT calculations are
suitable to give the relative energies for systems near their equilibrium structures, energies
for crystals were calculated using the DFT method in vasp [73] with PBE-GGA [74] and
PAW pseudopotentials [75]. In order to correctly capture the thermodynamic driving forces
for oxidation, the formation energies of crystal structures were used to train the force field:
                                                                  y
                     E F [Lix Oy ] = E[Lix Oy ] − (x − y)E0 [Li] − E0 [Li2 O2 ].          (4.1)
                                                                  2
    The required MRCI+Q calculations of the potential energy curves characterizing asym-
metric and symmetric bond dissociations in the linear Li – O – Li triatomic and the corre-
                                                  71


sponding MRCI computations of atomic charges were performed using the MOLPRO 2010.1
program. [76] In the case of asymmetric single bond breaking, we stretched one of the Li – O
bonds while keeping the other fixed at the equilibrium distance. In the case of symmetric
double dissocation, we stretched both Li – O bonds symmetrically. The information about
the grid of nuclear geometries used in these calculations can be found in the Appendix B. In
order to perform the MRCI computations, we employed the highly efficient internally con-
tracted MRCI algorithm developed in Refs. [67] and [68]. The underlying orbitals and refer-
ence functions were obtained with the complete active-space self-consistent field (CASSCF)
approach [77, 78] using the active space of 8 valence electrons distributed among 12 va-
lence orbitals correlating with the 2s and 2p shells of the lithium and oxygen atoms. In
the subsequent internally contracted MRCI calculations, we explicitly correlated the 8 va-
lence electrons of Li – O – Li by including all asymmetric and symmetric excitations out of
the multi-determinantal reference space. The final energetics entering our training dataset
included the quasi-degenerate Davidson correction through the aforementioned MRCI+Q
approach. Due to the fact that the Mulliken population analysis is not reliable for lithium-
containing compounds, [79] the MRCI atomic charges that form part of our training dataset
were obtained using the natural bond orbital analysis of Weinhold and co-workers, [80, 81, 82]
as implemented in MOLPRO 2010.1. [83] All of our MRCI and MRCI+Q calculations for
the Li – O – Li system used the aug-cc-pVDZ basis set. [84, 85, 86] The complete set of results
of our MRCI and MRCI+Q computations can be found in the Appendix B.
4.2.2    High Fidelity Charge Model
The residual charge problem is a common issue beyond the Li-O systems studied herein.
Some ad hoc fixes have been proposed in the literature. Patel and Brooks add artificial
constraints on atomic charges in CHARMM [87]. A different approach called atom-atom
charge transfer (AACT), proposed by Chelli et al., [88] introduced a dummy variable, called
a split charge, to share charges across a bond. The split charge variable determines the
                                               72


amount of charge transferred from one atom to the other. Later, Nistor et al. [23] proposed
the split-charge equilibration (SQE) method which combines the EEM and AACT methods.
SQE and its improved version can accurately capture some charge transfer phenomena such
as static electricity transfer [89].
    In this study, we adopt the more general ACKS2 method, which is a relatively recent
model [24]. It is derived by expanding the Legendre transform of the Kohn-Sham kinetic
energy to second order in the atomic populations and introducing a new set of dual atomic
variables, the relative atomic Kohn-Sham potentials. The ACKS2 model [24] can be regarded
as an extension of QEq/EEM methods. Essentially, ACKS2 was developed with the aim of
correcting issues with accurately modeling dipole polarizability and charges during bond
formation/dissociation via additional empirically fitted parameters.
    From a computational point of view, a block matrix representation of the linear system
arising in the ACKS2 model is as     follows:
                                                          
                            HQEq     In  1n  0n   q  −χ
                                                        
                             In      X 0Tn        U  qref 
                                              1n         
                                                          =  .                           (4.2)
                            
                                                
                             1T
                             n      0Tn   0  0  µmol  qnet 
                                                          
                                                        
                                0Tn  1Tn   0  0      λU         0
In Eq. (4.2), n is the number of atoms in the system, HQEq is the n × n Hamiltonian
matrix for the QEq model, X is an n × n matrix that contains terms for the linear response
kernel of the Kohn-Sham potential, In is the n-dimensional identity matrix, 0n and 1n are
n-dimensional column vectors whose individual elements equal 0 and 1, respectively, q is the
n-dimensional column vector whose elements are the atomic charges, U is an n-dimensional
column vector containing the coefficients defining the Kohn-Sham potential, µmol and λU are
Lagrange multipliers from the underpinning optimization problem, χ is an n-dimensional
column vector containing the electronegativity of each atom, qref is the set of atomic reference
charges (taken as all zero for this work), and qnet represents the net charge of the system to
be simulated.
                                               73


                                                ij
    The entries of HQEq are defined as HQEq           = δij ηi + (1 − δij ) · Tij , where δij denotes the
Kronecker delta operator; ηi denotes the atomic idempotential; and Tij is defined as
                                        
                                                1
                                                       , if rij ≤ rnonb
                                        
                                        
                                        q 3  3 +γ −3
                                             rij
                                Tij =              ij
                                        
                                        0,
                                        
                                                           otherwise.
                                                                                                  √
In the above equation, rij signifies the distance between atoms i and j and γij =                   γi · γj
denotes a pairwise shielding term tuned for element types of atoms i and j to avoid un-
bounded electrostatic energy at short distances. Using a tapering function, the term Tij
decays smoothly to zero for all pairs beyond the non-bonded interaction cutoff rnonb (typi-
cally 10 Å). The diagonal entries of the linear response kernel are X ii = i−1                  ij
                                                                                       P
                                                                                         j=1 −X , while
the off-diagonal entries are defined as
                            
                                   3                   6
                            Λ · rij · 1 −            rij
                                                             , rij ≤ rmin , X ij > 0
                            
                            
                                    σij               σij
                     X ij =
                            
                            0,
                                                              otherwise.
                                      σi +σj
In the previous equation, σij =          2
                                              and rmin = min{rnonb , σij }. The ACKS2-specific
parameters σi and Λ are fitted as part of the GA optimization process. σi thresholds entries
based on element types and Λ is a global bond softness parameter.
    The resulting ACKS2 matrix HACKS2 ∈ R(2n+2)×(2n+2) is symmetric, indefinite, and
sparse. Atomic charges q = (q1 , q2 , . . . , qn ), qi ∈ R are thus obtained by solving the linear
system in Eq. (4.2). Due to the size of the linear systems involved, we use the accelerated
sparse solvers for ACKS2 [12, 43] in the PuReMD package [10, 11].
    It has been shown that ACKS2 gives a better description of atomic charge, as well as
a more accurate energy prediction [4]. Therefore, we anticipate that ACKS2 will describe
the bond breaking process in Li-O with higher fidelity, and hence lead to a more accurate
characterization of its brittle fracture property.
                                                    74


4.2.3    Optimization of the Force Field Parameters
Published parameter sets for the ReaxFF are the result of a combination of chemical intuition
guided by a physical interpretation of the parameters, together with several generations of
parameter fitting. Hence when an extension of the method to a new system is desired, a
careful balance must be struck: adjusting parameters to better describe the new system
while maintaining the core chemistry of ReaxFF. It is also desirable to maintain information
from past training that is encoded in the parameters, without needing to explicitly include
the entire training history of ReaxFF in a new training set (due to the computational costs
of the parameter fitting phase).
    A two-element system in ReaxFF (such as Li-O) typically requires simultaneous opti-
mization of tens of parameters, or over 100 parameters for a full re-fit. In the case of the
Li-O system, the bond-order parameters have already been fit to capture bond formation in
this system reasonably well [4, 70, 90]. We therefore keep bond-order related parameters
fixed in our fitting, with the notable exception of the lithium bond-order over-coordination
correction. In the Li2 O crystal, the 2016 QEq FF finds a large number of lithium-lithium
bonds in addition to the desired lithium-oxygen bonds. This is because the Li-Li distance in
Li2 O is smaller than that in the lithium metal, so the bond-order terms appropriate for the
metallic bonds are also activated in Li2 O. This results in a large over-coordination correction
to the energy, but the forces due to the Li-Li bonds may still have undesired effects in fit-
ting all quantum-mechanical training data and in subsequent MD simulations. We therefore
turned on over-coordination correction at the bond-order level for lithium. This ensured
that lithium only formed bonds with oxygen in the Li2 O solid.
    While these parameters were kept fixed, adopting the ACKS2 model does require re-
optimization of a significant number of force field parameters. Most obvious are the ACKS2-
specific parameters: the ACKS2 softness parameter as well as the EEM shielding and atomic
softness parameters for oxygen and lithium. Upon fitting these parameters to the MRCI
charges and MRCI+Q energies, we observe that the use of ACKS2 instead of QEq for charge
                                              75


assignment leads to significant differences in the Coulomb force. Consequently, the strengths
of the other partial energy terms and how much they contribute to the total energy need to
be balanced. We therefore perform a “high-level” re-fit of the lithium and oxygen bonding
and angle terms, too. We find that re-fitting a total of 39 parameters (the ACKS2 softness
parameter, 14 atomic parameters, 12 bonding parameters, and 12 angle parameters) provides
sufficient flexibility to achieve a good fit while preserving the basic chemical properties of the
underlying ReaxFF structure. Additional details are provided in Section B.1 and Table B.1.
    For such high dimensional ReaxFF parameter optimization problems, recent works have
utilized GAs [91, 92], which essentially are heuristic optimization techniques that take their
namesake from the biological concept of evolution. In GAs, optimization proceeds using
mutation and crossover operations (at a predetermined rate) until some termination criteria
are reached (e.g., a solution with small enough error is found or a maximum number of
generations is reached). For evaluation of the solutions, a fitness function is used:
                                           T
                                                                !2
                                                 (j)      (i,j)
                                          X     xref − xactual
                                    Fi =                           .                         (4.3)
                                          j=1
                                                     σ (j)
In Eq. (4.3), Fi is the fitness score for organism i, j denotes an index over the items in the
                                      (j)
training dataset (T total items), xref denotes the reference values for the j-th training item
                           (i,j)
with weight σ (j) , and xactual denotes the computed value for training item j using the force
field parameters associated with organism i.
    In this study, we utilized the OGOLEM software [93], a global optimization GA frame-
work aimed at computational chemistry problems, including the fitting of MD force field
parameters. In conjunction with OGOLEM, the PuReMD ReaxFF software was used for
evaluation of the parameter sets and the MD simulations [11, 10, 94]. Novel to this work
has been the integration of ACKS2 model into PuReMD for optimization with OGOLEM.
    Several identical fitting runs were performed in parallel using OGOLEM; a model was
deemed to have converged when fitness as a function of GA generation leveled off (see
Fig. 4.1). Among the converged models, the force field that performed well on the training
                                                 76


Figure 4.1 Fitness of the best performing ReaxFF parameter set during optimiza-
tion. Results obtained from using OGOLEM are presented by run number for using the (a)
QEq and (b) ACKS2 charge models.
data according to the fitness score and showed good stability in brief energy minimization
runs was selected for use in the fracture simulations.
4.3     Results and Discussion
In this section, we present the fitting results using the ACKS2 in Reax force field (denoted as
ACKS2 FF) and the MD simulation results for Li2 O fracture. There already is a parameter
set for ReaxFF from 2016 to study Li-O systems [70], which uses the EEM/QEq model for
charge equilibration. We use this parameter set, denoted as “2016 QEq FF” in what follows,
as our comparison point. The results from these two force fields are compared.
                                                 77


4.3.1    Fitness to the Training Data Set
The optimized ReaxFF with the ACKS2 model significantly improves the accuracy of the
atomic charge assignment when compared against the MRCI training data. Fig. 4.2 shows
the predicted atomic charges in the Li – O – Li molecule during stretching. The ACKS2 FF
shows an important qualitative improvement over the 2016 QEq FF, as the latter leaves a
large residual charge on the mobile Li even after bond-breaking. ACKS2 FF, on the other
hand, correctly predicts it to be neutral at a sufficiently large separation. In the bottom
figure of Fig. 4.2, when both Li atoms are slowly pulled away from the central O atom,
charges on Li atoms decrease as expected. The MRCI method can correctly predict the
neutrality of Li atoms when they are more than 4 Å away from O. But DFT still predicts Li
and O are +0.2 and −0.4 charged after their distance is beyond 8 Å. This result shows that it
is necessary to use more accurate quantum-mechanical methods to predict charges during the
bond breaking process. The 2016 QEq FF also significantly under-predicted the Li – O – Li
molecule dissociation energy compared to the MRCI+Q results, as shown in Fig. 4.3. This
has been corrected in our new force field. We achieve much improved agreement particularly
for the “asymmetric stretching” mode in which one Li – O bond is kept fixed while the other
Li atom is removed.
    Significant improvement is also found in the description of the Li2 O and Li2 O2 crystal
energies, as shown in Fig. 4.3. The 2016 QEq FF significantly over-predicted the formation
energy of Li2 O (as given by Eq. (4.1)) in addition to the incorrect prediction of the overall
shape of the energy vs. strain curve. Li2 O2 is even more poorly described, with the 2016 QEq
FF failing to even recognize Li2 O2 as a local minimum in energy. Table 4.1 summarizes the
basic materials properties predicted by the ACKS2 and the 2016 QEq FFs in comparison with
experiments and DFT data for the crystal phases. ACKS2 FF indeed shows improvement on
lattice parameters, formation energies, and bulk modulus, especially for the oxide phases, as
these properties are embedded in the energies in the training data. The cohesive energies were
not included in the training set for the ACKS2 FF but were fitted by 2016 QEq FF, which
                                                78


                     (a)
                     (b)
Figure 4.2 Charge during dissociation of the Li – O – Li molecule for asymmetric
symmetric bond stretching. Results are plotted for the MRCI training data, the fitted
ACKS2 FF, and the 2016 QEq FF for (a) asymmetric stretching (one Li – O bond remains
static) and (b) symmetric stretching (O remains static).
shows slightly better performance. The elastic constants were not included in the training
data. ACKS2 FF is softer for Li than 2016 QEq FF. The 2016 QEq predicted two shear
instabilities (C44 < 0 and C11 − C12 < 0) and the ACKS2 FF showed one shear instability
(C11 − C12 < 0). This can be potentially problematic, as they suggest that deformation may
lead to phase instability, which is not the case experimentally. The fracture energy was not
included in the training set either. For fracturing along ⟨111⟩ direction, the force fields are
very close to one another and significantly over-predict the DFT value. By contrast, both
force fields significantly under-predict the DFT fracture energy along the ⟨110⟩ direction, but
                                               79


Figure 4.3 Energy vs. hydrostatic strain of Li2 O and Li2 O2 crystals along with
energy vs. separation of the Li2 O molecule for bond stretching. Energy vs. hydro-
static strain of (a) Li2 O and (b) Li2 O2 crystals, plotted for the DFT training data, the fitted
ACKS2 FF, and the 2016 QEq FF. (a) shows the Li2 O formation energy (given by Eq. (4.1))
and (b) shows the Li2 O2 energy referenced to the DFT-optimized structure. Energy vs.
separation of the Li2 O molecule for (c) asymmetric stretching (one Li – O bond fixed) and
(d) symmetric stretching. Plotted for the MRCI+Q training data, the ACKS2 FF, and the
2016 QEq FF.
the ACKS2 FF shows marked improvement over the 2016 QEq FF. These results suggest
that it is difficult for ReaxFF to predict the large variation in surface energies among crystal
facets. In the future, a larger number of amorphous structures, surfaces, and deformed
lattices could be included in the training set, potentially improving the performance of the
FF on phase stability and mechanical properties. For example, both the ACKS2 FF and
the 2016 QEq FF predict that the Li2 O crystal is higher in energy than the amorphous
phase, which is not a correct description at room temperature. However, the ACKS2 FF
obtained in this work already represents a major improvement over the 2016 QEq FF in this
                                                80


regard, placing the Li2 O crystal at 16 kcal/mol above the amorphous phase, as opposed to
42 kcal/mol predicted by the 2016 QEq FF. We anticipate that with the future consideration
of larger number of amorphous structures in the ACKS2 FF fitting, we should be able to
lower the energy of the crystalline phase even further, making the simulations more realistic.
Table 4.1 Materials properties as predicted by DFT, the fitted ACKS2 FF, and
the 2016 QEq FF. Cohesive energy, elastic constants, and fracture energy, which is defined
as twice the surface energy, were not included in the training data.
                                      Li             Li2 O2                  Li2 O
  Fitted Stable                       Bcc, Im3m      P63 /mmc                Fm3̄m
  Cryst. Phase
  Lattice Param.     Experiment       a = 3.49 [95]  a = 3.17, c = 7.72 [96] a = 4.57 [97]
  (Å)               DFT              a = 3.43       a = 3.18, c = 7.70      a = 4.66
                     ACKS2 FF         a = 3.44       a = 3.15 (a/c fixed)    a = 4.63
                     2016 QEq FF      a = 3.44       not stable              a = 4.65
  Form. Energy       DFT              -              -                       263
  (kJ/mol)           ACKS2 FF         -              -                       263
  (Eq. (4.1))        2016 QEq FF      -              -                       547
  Cohesive Energy    Experiment       158 [95]       -
  (kJ/mol)           DFT              153            1515                    1173
                     ACKS2 FF         99.7           2583                    1654
                     2016 QEq FF      157.7          832                     1121
  Bulk Modulus       Experiment       11.6 [95]      -                       88.0 [97]
  (GPa)              DFT              12.3           78.7                    82.5
                     ACKS2 FF         2.55           159                     132
                     2016 QEq FF      7.30           -                       217
  Elastic Consts.    Experiment       14, 11, 9 [95] -                       217, 25, 68 [97]
  C11 , C12 , C44    DFT              14, 13, 16     153, 51, 37             203, 20.3, 53.7
  (GPa)              ACKS2 FF         2, 2, 2        211, 67, −94            143, 185, 33
                     2016 QEq FF      16, 8, 8       -                       250, 250, −164
  Fracture Energy    Surface          ⟨111⟩          ⟨110⟩                   ⟨111⟩
  (J/m2 )            DFT              1.08           1.68                    1.07
                     ACKS2 FF         3.09           39.3                    2.26
                     2016 QEq FF      7.78           −2.24                   2.31
                     Surface          ⟨110⟩          ⟨100⟩                   ⟨110⟩
                     DFT              1.00           3.55                    7.53
                     ACKS2 FF         0.70           19.0                    3.63
                     2016 QEq FF      3.44           −8.23                   2.50
                                               81


4.3.2    Evaluation using Bulk Lithium Oxide
To test the newly fitted ACKS2 FF, a 324-atom cubic supercell of crystalline Li2 O under
periodic boundary conditions was simulated within an NPT ensemble at zero pressure and
300K for 40 ps. Tests in the NVE ensemble showed that 0.25 fs time step was required to
avoid energy drift, so this time step value was selected and is used consistently throughout
the paper. In both force fields, the structure became amorphous, but the volume increased
by 6% using the ACKS2 FF while decreasing by 25% using the 2016 QEq FF. Fig. 4.4 shows
the evolution of the volume and the resulting radial distribution functions (RDFs), while
more detailed RDF evaluations are shown in Fig. B.4. The nearest-neighbor Li-O peak in
the ACKS2 FF split into two peaks at 1.8 Å and 2.4 Å, respectively, causing anisotropic
mechanical behavior. The 2016 QEq FF showed similar Li-O peak splitting, with a small
Li-Li peak appearing close to 1.3 Å, which is much shorter than the Li-Li bond distance in
crystal Li (3.0 Å) and Li2 O (2.3 Å). Based on these results, we conclude that the ACKS2
FF equilibrium structure is more realistic and stable than that of 2016 QEq FF.
    Subsequent to NPT relaxation of the Li2 O boxes, slabs were then constructed from a
8 × 1 × 6 tiling of the equilibrated boxes, with a 90 Å vacuum space included in the x
direction. The slabs were then further equilibrated within the NPT ensemble under 3D
periodic boundary conditions at 300K (Fig. 4.5). Equilibration of the ACKS2 FF slab
does not have much effect on the amorphous Li2 O other than smoothing out the random
distribution of atoms over a large number of particles. The RDF stabilizes after about 10
ps in ACKS2, showing very little subsequent change out to 40 ps. The same process occurs
in the first 8 ps of the 2016 QEq FF slab, but then starting around 10 ps the slab shrinks
considerably and develops long-range order, as shown by the emergence of distinct peaks
in the RDF, reaching a new equilibrium by 40 ps. The order is not commensurate with
real crystalline Li2 O and represents an unphysical crystalline phase. This provides further
confirmation that the ACKS2 FF predicts a more stable solid Li2 O than the 2016 QEq FF.
    In order to test if the improved success of our ACKS2 FF was due to the charge assignment
                                                 82


method and not just the training procedure, a nearly-identical parameter-fitting run was
attempted for a ReaxFF force field using QEq for charge assignment. The only difference
was that the ACKS2-specific parameters were not allowed to change during optimization,
since they are not relevant to a QEq force field. While a comparable fitness value on the
training items was achieved as shown in Fig. 4.1, the resulting force fields are unstable when
simulating solid Li2 O (see Section B.2). This suggests that QEq-based ReaxFF is incapable
of describing the underlying charges properly in the training set, and the genetic algorithm
may be forced to “over-fit” the training data with an unphysical parameter combination
which leads to an unusable force field. This highlights the need for ACKS2 for the correct
description of charges in these materials and the difficulty of using fully automated fitting
procedure, as the physics insights are still important.
4.3.3    Evaluation using Deformation and Fracture
In a molecular dynamics simulation, brittle fracture is characterized by the propagation of
a sharp crack tip across a sample and a relatively flat fracture surface. To explore brittle
fracture, MD simulations were performed on slabs of Li2 O. Atoms were removed from each
slab to create a sharp notch in the xy plane, which corresponded to the [100] crystal direction
in the original structures. Every picosecond, 0.5% strain was applied normal to the crack
plane while the atom positions were evolved under NVT at 300K. This simulates a plane-
strain condition. Initially, crystalline slabs were used as the starting point for the deformation
simulations. However, the large difference in built-in stresses due to the instability of the
crystal structure made it impossible to draw meaningful comparison between the results.
Therefore, NPT-equilibrated slabs described in Section 4.3.2 by ACKS2 FF and 2016 QEq FF
methods (both with and without long-range order) were selected to test crack propagation.
All of these initially have low stress in all three directions, suggesting equilibrium state, and
thus represent a fair comparison. Parameters of the slabs at the beginning of the fracture
simulations are shown in Table 4.2. Duplicates of each fracture simulation were performed
                                                 83


(a)                                            (c)
                                                                 x10
(b)
                                               (d)
Figure 4.4 NPT simulation and material properties of crystalline Li2 O with the
ACKS2 FF and the 2016 QEq FF. Simulations were performed for 40 picoseconds at
300K using a 324-atom cube of crystalline Li2 O with the ACKS2 FF and the 2016 QEq
FF. Panels (a) and (b) show side lengths of the simulation box (under periodic boundary
conditions). Resulting RDFs are shown in (c), along with the crystal RDF (scaled by 0.1
for ease of comparison). The ACKS2 FF equilibration is more realistic, developing a single
nearest-neighbor peak closer to the crystal Li-O bond distance, whereas the 2016 QEq FF
peak is split and at smaller separation. Panel (d) shows the average stress over the course
of the simulation.
                                            84


(a)                                              (b)
Figure 4.5 Simulation results and material properties for Li2 O slabs before and
after NPT equilibration with ACKS2 and 2016QEq FF. Results are shown using (a)
ACKS2 and (b) 2016QEq FF. Noise is reduced in both simulations due to the large number
of particles, but the ACKS2 FF shape remains the same. For the first 8 ps the 2016 QEq
FF RDF remains similar, but then it develops long-range order evidenced by the emergence
of distinct peaks.
to test the impact of the stochastic MD simulation. The different random starting velocities
did not impact the conclusions (see Fig. B.3).
Table 4.2 Properties of the notched slabs used as a starting point for the fracture
simulations. a is the notch length, b is the slab width, y is the thickness in the periodic
direction along the crack, z is the thickness in the periodic direction normal to the crack,
h is the space between the crack edge and the periodic boundary, and N is the number of
atoms in the slab.
                              ACKS2 FF      2016 QEq FF      2016 QEq FF
                                             (amorphous)       (ordered)
                      a (Å)      49.7            46.0            46.2
                      b (Å)     108.3           101.2            97.5
                      y (Å)      14.8            12.7            12.0
                      z (Å)      86.6            74.6            71.7
                      h (Å)      32.5            27.9            27.5
                      N          14700           14624           14654
    Stress-strain curves and visualizations of the slab undergoing crack propagation are shown
in Fig. 4.6. Stress was calculated by taking the average pressure over each picosecond of
constant strain. Pressure was converted to true stress by multiplying by V /Vocc where V is
the total volume of the simulation cell and Vocc is volume occupied by the slab at each time
                                                85


           (a)                          (d)
                                                          0%          0%
                                             0%
          (b)                                            10%          5%
                                            10%
                                                         15%          8%
                                            15%
          (c)
                                                         20%         12%
                                            20%
Figure 4.6 Simulation results for fracture in Li2 O slabs. After NPT equilibration,
a notch is cut in the z-normal plane in each slab, and strain is applied along z constant
engineering strain rate of 5 × 10−3 ps−1 up to 20% strain to simulate fracture. Panels (a) to
(c) show the resulting stress-strain curves using the ACKS2 FF equilibrated slab and two
2016 QEq FF equilibrated slabs: one amorphous, one ordered. The amorphous ACKS2 FF
slab does not show complete fracture, but it does show a well-defined peak around 9% strain
and stress relaxation as the crack propagates. The 2016 QEq FF amorphous slab shows more
ductile behavior, with a broad peak around 13% strain and more gradual stress relaxation.
The 2016 QEq FF ordered slab shows classic fracture behavior, with a sharp peak around 5%
strain, with the stress falling to zero by about 12% strain. This cracking behavior is likely
assisted by the increased density and long-ranged order. Panel (d) shows the shape of the
slabs during deformation through scatter plots of the oxygen charges in slabs under various
levels of strain. The ACKS2 FF slab retains a sharper crack tip than the amorphous 2016
QEq FF slab, and the oxygen charge remains close to the formal charge of −2 throughout
the slab. Both 2016 QEq FF slabs show oxygen charge significantly smaller than the formal
charge, with noticeable charge depletion around the crack tip and fracture surfaces.
step. The stress in the x direction remains zero due to the free surface. The stress in the y
direction varies due to the different Poisson’s ratios. According to the elastic constants for
Li2 O shown in Table 4.1, the 2016 QEq FF over-predicts the Poisson’s ratio while the ACKS2
FF under-predicts the Poisson’s ratio compared to the DFT results. As expected based on
its performance on the training data, the newly optimized ACKS2 FF parameter set shows
                                              86


improvement in the description of fracture in amorphous Li2 O. The ACKS2 FF stress-strain
curves show crack propagation events in which the stress along z direction drops sharply at
10% of strain, while the 2016 QEq FF shows more smooth, ductile behavior. However, neither
force field produces true brittle fracture in amorphous slabs, although the stress intensity
factor is less than 1 MPa · m1/2 (calculated based on the peak stress [98] and the geometry of
the notched slab), due to the small crack length used in MD simulations. Interestingly, the
2016 QEq FF slab with long-range order appears to show classic brittle fracture at 5% strain.
However there is local pre-fracture melting near the crack tip, which is not likely to occur
in room temperature brittle fracture. So the simulated fracture is likely due to unphysical
mechanisms. This suggests that, while the force field used is important, the overall fracture
propagation is a much more complicated process that may be related to some non-fitted
features in the force field. The current force fields have deficiencies in describing some
elastic and fracture properties of the lithium oxides. The phase change during relaxation
and deformation is related to the over-stabilization of the amorphous structure and the
mechanical instability shown in Table 4.1. For future improvement, the training set should
include crystalline structures with a wider variety of elastic deformations in addition to
unstable amorphous structures. However, since the energy differences between elastically
deformed structures are relatively small (±5 kcal/mol) compared to the formation energy of
lithium oxides (∼ 260 kcal/mol), the weights for fitting have to be carefully picked.
    The improvement in fracture behavior on amorphous slabs may once again be attributed
to an improvement in the charge assignment due to ACKS2. The oxygen charges are shown
in Fig. 4.6 (d). The magnitude of both oxygen and lithium charges are significantly larger
in the ACKS2 FF and closer to the formal charges of Li+ and O2 – . The difference between
atomic charges in surface and in bulk is improved in the ACKS2 FF. There is only a modest
reduction in magnitude of the atomic charge at the surfaces in ACKS2 FF. These ACKS2
FF results are consistent with Bader charge analysis [99] of DFT results on Li2 O surfaces,
which show that atom charges at the most favorable (111) surface are essentially unchanged
                                              87


from their bulk values.
    By contrast, the (already small) oxygen charge in the QEq FF is significantly depleted
at surfaces of the crack, to a greater degree than at the slab surface. This is especially
pronounced at the crack tip, suggesting that the depletion arises from QEq’s incomplete
charge transfer upon bond breaking. Further details on the charges may be found in Fig. B.2.
    We expect that in the future, if an ACKS2-based force field can be trained to produce
a stable ordered structure, it will show similar or better brittle fracture behavior to that
exhibited by the ordered 2016 QEq FF slab. This will be particularly valuable if the crystal
structure matches the experimental Li2 O structure, as this could allow for realistic large-scale
MD simulations of lithium oxidation and cracking.
4.4     Conclusion
Both ionic and covalent bond breaking are individually challenging in a MD context, and
simulating fracture in ceramics requires an accurate description of both phenomena simulta-
neously. Our results represent a significant step forward in the atomistic simulation of bond
breaking in crystalline systems with mixed ionic-covalent bonding character. We show that
a Reax force field employing the high-fidelity ACKS2 method for charge assignment, trained
using a GA with training data from high-level ab initio wave function quantum chemistry
calculations, is able to describe bond-breaking in Li – O – Li molecules. When comparing
fracture on stable amorphous Li2 O, the ACKS2 force field represents a modest improvement
in fracture behavior compared to the existing 2016 QEq force field.
    Future work may be able to improve fracture behavior in Li2 O by training an ACKS2-
empowered ReaxFF with a genetic algorithm, while also including a large sample of amor-
phous structures in the training set. This may train the force field to predict the correct
crystal structure of Li2 O rather than amorphization. The improvement in bond-breaking
and charge assignment demonstrated by ACKS2 may then lead to true brittle fracture.
    The insights gained in this study, our GA-based fitting method, and our training set de-
                                              88


sign strategies can easily be extended to other systems. They may also serve as a foundation
for further automation of force-field generation and fitting [92]. Use of ACKS2 is also likely
to improve the description of other systems, especially ceramics or other materials with a
high degree of ionic bonding.
                                              89


                                       CHAPTER 5
   PERFORMANCE OPTIMIZATION OF LARGE-SCALE DISTRIBUTED
                  GPU-ACCELERATED REAXFF SIMULATIONS
5.1    Introduction
The original MPI+CUDA implementation of PuReMD was authored when the Fermi and
Keplar microarchitectures from the Tesla line of NVIDIA GPUs were in their heyday [94].
Since this time, subsequent hardware revisions have drastically improved peak computational
and memory bandwidth. These advancements have highlighted the need for revision and
enhancement of ReaxFF implementations.
5.2    Methods
The following subsections discuss several optimizations employed in the MPI+CUDA code-
base for PuReMD. These efforts range from generic best practices software engineering ap-
proaches to tuning individual kernels for lower resource usage and high parallelizability and
to implementing several of the previously discussed charge solver optimizations.
5.2.1   Improving Performance Portability and Scalability via Software Modern-
        ization
As previously mentioned, the original MPI+CUDA implementation of PuReMD was devel-
oped on the Fermi and Kepler microarchitectures of NVIDIA Tesla GPUs [94]. While this
work made great strides in improving performance, numerous limitations to this implemen-
tation existed including hand-optimized kernels solely for Kepler GPUs (due to hardware
assumptions); fixed-size memory allocations which severely limited scalability to systems
with at most a few thousand atoms; low parallelizablity resulting in poor device utilization
due to assigning one CUDA thread to compute all interactions for an atom; excessive data
transfers the between host and device memory spaces; and lack of implementation of critical
                                             90


simulation features (e.g., non-NVE ensembles and additional charge models).
    To address these shortcomings, the MPI+CUDA codebase was revised and improved
using several approaches. To address issues with microarchitecture-specific optimizations,
several kernels and small sub-tasks within kernels were replaced with performant portable
alternatives within the NVIDIA CUB library [100]. CUB provides optimized implementa-
tions of algorithms and performance primitives at varying scopes – device-wide, block-wide,
warp-wide – across various NVIDIA GPU microarchitectures which have been used across
several scientific and numeric codes [101, 102]. For the PuReMD codebase, several CUB
routines were leveraged including reductions and prefix sums.
    In order to resolve issues with fixed-size memory allocations, a robust approach was re-
quired as the number of interactions per atoms may vary significantly across simulation steps
for different terms in the ReaxFF potential. Previous iterations of PuReMD addressed this
issue with ReaxFF by adopting a three-stage approach to managing memory: 1) estimate
the number of interactions per atom, 2) check current data structures allocations and re-
allocate more memory if needed, and 3) perform computations for an interaction. While
this approach is simple and has reasonable performance in a traditional CPU code, several
downsides become apparent when moving to a programming model where most of the com-
putation resides on the GPU. Namely, estimation of interaction counts requires expensive
device-wide reductions at every simulation step. Moreover, following these estimations, data
must be transferred back to host memory as the host context drivers memory management
for the majority of the device memory space. In order to avoid these issues, an atomic
transactional approach for dynamic memory reallocation was developed in Algorithm 5.1.
    Algorithm 5.1 relies on writing kernels where the computations may be rolled back if
an out-of-memory condition is detected. If this atomicity condition is satisfied, then the
benefit of avoiding steps within the conditional clause can be gained on the majority of
simulation steps if either the system under simulation does not have rapidly changing mem-
ory requirements or sufficiently large over-allocations are performed when reallocation steps
                                               91


        Algorithm 5.1 Atomic Transactional Approach for GPU Memory Management.
  1: Perform computation and note any out-of-memory conditions (e.g., compute bond list)
  2: Copy out-of-memory status from computation to host
  3: if Out-of-memory condition detected during computation then
  4:     Revert any state altered by previous computation
  5:     Estimate new storage requirements for computation
  6:     Copy storage requirements to host
  7:     Reallocation storage space from host
  8:     Redo computation
  9: end if
occur. In practice, a 20% to 40% increase above the current estimation is often sufficient to
significantly reduce the number of reallocation steps. Furthermore, refining the scoping of
reallocations in data structures from a global scope to a more localized scope may further
reduce the frequency of triggering reallocations – a prime example of this is in the interac-
tion lists where changing from global counts across all atoms to localized counts per atom
dramatically drops the number of reallocations.
5.2.2    Increasing Parallelism through Kernel-Level Restructuring
In order to fully utilize the thousands of CUDA cores in modern NVIDIA GPUs, additional
programming changes are required to achieve degrees of finer-grain parallelism. A core
mechanism where a higher degree of parallelism can be expressed is the interaction list data
structure. The interaction list in PuReMD is effectively a list of lists where the inner lists
are flattened and stored contiguously as a 1-D array (with inter-list padding for memory
alignment), with supporting arrays for storing the starting and ending indices of each inner
list within the flatten 1-D array. Conceptually, the outer index used to select a specific inner
list corresponds to some entity of interaction such as an atom or a group of atoms interacting
in some physical manner. In terms of parallelism, the original MPI+CUDA code utilized
1 CUDA thread to process all the valid interactions for a inner list. As the inner lists can
vary significantly in terms of size and as many of the computations performed for each entry
in an inner list possess a relatively high arithmetic intensity, the introduction of additional
                                               92


parallelism in processing the inner lists leads to significant gains. Toward this end, a group
of threads (e.g., a warp of 32 threads) was utilized to process the inner lists.
    The group of threads approach has several advantages including enabling kernels to use
fast collective operations – reductions, scans, broadcasts, atomic operations with leader
election, and synchronizations – and performing fast memory accesses through memory coa-
lescing via aligned burst reads and writes to aligned sections of the inner lists, at the cost of
increased memory usage due to this padding. Additionally, when combining delayed writes
to global memory, use of fast shared memory, and reductions of partial results at the warp-
level, memory pressure decreases with respect to the use of atomic operations in kernels for,
e.g., partial energy and force calculations.
5.2.3     Exploiting Task-Level Parallelism via CUDA Streams
Moving beyond the idea of exposing more parallelism, another approach which is broadly
applicable to the PuReMD MPI+CUDA codebase is to leverage task-level parallelism within
the ReaxFF potential and the coupled global charge model. Conceptually, the use of mul-
tiple simultaneous CUDA streams provides the CUDA warp scheduler more opportunity to
hide memory access latency as well as to increase device utilization by scheduling compu-
tation on idling streaming multiprocessors. With this in mind, Fig. 5.1 depicts the task
dependency graph in ReaxFF. In this graph, there are two points with a number of paths
with independent tasks following Verlet list generation/updating: 1) initialization post Ver-
let list generation, and 2) energy and force computation for the potential terms in ReaxFF.
In the former section, three independent paths are present (bond, H-bond, charge matrix
initialization) while in the latter section six independent paths exist (QEq/Coulomb, van der
Waals, BO/Bonds, BO/OverUnder, BO/Valence/Torsion, H-Bonds). As such, the updated
codebase employed the use of 6 CUDA streams.
    In order to facilitate the use of CUDA streams in PuReMD, several changes were required
including using atomic operations to ensure true data independence between tasks for in-
                                                  93


Figure 5.1 Task dependency graph for the ReaxFF potential coupled with a global
charge model (QEq) within a single MD step.
termediate results, switching from synchronous to asynchronous data transfers between the
host and device memory spaces, and injecting CUDA events between asynchronous kernel
launches in order to translate dependencies in the task graph into dependencies between
kernels executing in different streams. Further to this last point on CUDA events – syn-
chronization of tasks across different streams is accomplished via the queuing of and waiting
on CUDA events. As to data movements in the context of multiple CUDA streams, the
synchronous CUDA API functions all occur in the default stream and thus cannot execute
concurrently with non-default streams; thus, the switch to the asynchronous functions is
required and yields additional performance gains due to decreases in synchronization over-
heads.
                                              94


5.2.4    Global Charge Solver Optimizations
Another more targeted area for optimization in the MPI+CUDA code is the global charge
solver models. As previously discussed concerning optimizations for the shared-memory and
distributed-memory MPI-only versions of PuReMD, the sparse linear solvers underpinning
the global charge models constitute a significant portion of the total execution time per MD
step within ReaxFF+QEq. In order to better understand the need for optimization of the
charge solver, Fig. 5.2 presents kernel execution from a profiling run of the MPI+CUDA
code for a typical MD step (as delimited by the green vertical bars). As shown, the charge
solver starts to run about mid-way through the total step time, and once the charge solver
begins execution, the GPU utilization drops due low arithmetic intensity of the kernels in
the solver, especially the SpMV kernel. Also, due to host-device data transfers and the
MPI communications, several gaps occur in the execution timeline where the streaming
multiprocessors on the GPU sit idle. As previously discussed, multiple CUDA streams may
be used to hide the data transfer latency, but other optimizations are required to further
decrease charge solver runtime. These methods are discussed in detail below.
    Approaches for improving the charge solver performance include optimizing significant
kernels within the iterative solver algorithm, namely the sparse-matrix-dense-vector multi-
plications (SpMV’s); overlapping the charge solver with other kernels via multiple streams
as discussed in the previous section; and decreasing solver runtime through efficient pre-
conditioning. For kernel-level tuning, warps of threads were utilized for the SpMV’s along
with symmetric charge matrices being stored in the full format. As to the use of CUDA
streams, the approach followed the guidance outlined in the previous section, with the addi-
tional insight that if one host thread is used to manage the CUDA context then the charge
solver should be launched last in order to avoid the blocking other kernel launches in dif-
ferent streams due to waits on memory transfers and MPI communication routines. Lastly,
insights from the preconditioning methods explored in the distributed-memory MPI-only
section guided implementing SAI-based preconditioning. For this work, computation of the
                                              95


Figure 5.2 Execution profile of the MPI+CUDA version of PuReMD using a single
CUDA stream. The nvprof profile was collected on a single V100 GPU using a 6000 atom
amorphous silica system running for 10K steps with 0.25 fs step sizes and an NVE ensemble.
Individual kernels, CUDA API calls, and host functions are visualalized horizontally in the
center, while cumulative runtime percentages are summarized in the left-most column.
SAI preconditioner was performed on the host using multiple OpenMP threads due to the
current lack of optimized batched least-squares solvers in the CUDA ecosystem – it is left as
an area of future work to move this computation to the device.
5.3     Performance Studies
5.3.1    Computing Environment and Benchmark Systems
Results from numerical experiments presented in the following subsections were collected on
clusters at the High Performance Computing Center at Michigan State University. Subsets
of nodes in these clusters are equipped with NVIDIA Tesla GPUs of varying models and
multiplicities: K20’s, K80’s, V100’s, and A100’s. For these experiments, we focus on nodes
equipped with V100’s and A100’s. Targeted nodes equipped with V100’s contained 8 GPUs
and 40 CPU cores – two twenty-core Intel Xeon Gold 6148 Skylake 2.4 GHz processors
and 384 GB main memory. Also, nodes equipped with A100’s contained 4 GPUs and 64
                                             96


GPU cores – two thirty-two core Intel Xeon Platinum 8358 Ice Lake 2.6 GHz processors and
256 GB main memory.
    At the time of the experiments, the clusters ran version 7.9.2009 of the CentOS distribu-
tion of GNU/linux for x86 64 architectures, kernel version 3.10.0-1160.36.2, and glibc version
2.17-325.
    For software, the MPI+CUDA version of PuReMD was built using the GNU Compiler
Collection version 10.3.0, OpenMPI version 4.1.1, and CUDA version 11.4.2 bundled with
CUB version 1.12.1. Compiler optimizations enabled included those implied by the -O3 flag
and those enabled by optimized device code generation for the Volta and Ampere microar-
chitectures. For numeric libraries, Intel MKL version 2021.2.0 was utilized. For SAI-based
precondition computation, the number of OpenMP threads was set to the number of available
CPU cores (no hyperthreading enabled).
    For benchmarking purposes, two familiar molecular systems were selected: bulk water
(H2 O) and amorphous silica (SiO2 ). The sizes of these systems varied from hundreds to
thousands of atoms depending on the experiment.
5.3.2    Scalability Studies
To better analyze the impact of GPU acceleration over execution time and scaling, we con-
ducted weak and strong scaling experiments for mean simulation time per step for PuReMD
and LAMMPS Kokkos, a high performance ReaxFF implementation maintained by Scan-
dia National Laboratories. As indicated earlier, the current PuReMD implementation relies
upon a uniform domain decomposition of the simulation space among the computing re-
sources, with MPI processes and GPUs in a one-to-one mapping. Therefore for our weak
scaling experiments, we choose water and silica systems with simulation volumes of mod-
erate size which contain several thousand atoms, and scale it up number of GPUs in pow-
ers of 2. As given in Fig. 5.3, results for the silica systems on the V100’s with PuReMD
CG+SAI(0.15) yield speedups of between 1.2x to 1.26x over PuReMD CG+Jacobi, and be-
                                              97


tween 1.12x to 1.31x over LAMMPS Kokkos. For the water systems, speedups for PuReMD
CG+SAI(0.15) range from 0.88x to 1.04x over PuReMD CG+Jacobi and from 0.76x to
0.97x over LAMMPS Kokkos. Similar to the V100’s, results for the silica systems on
the A100’s with PuReMD CG+SAI(0.15) yield speedups of between 1.14x to 1.37x over
PuReMD CG+Jacobi, and between 1.17x to 1.54x over LAMMPS Kokkos. For the water
systems, PuReMD CG+SAI(0.15) fairs better with speedups ranging from 1.02x to 1.08x
over PuReMD CG+Jacobi and from 1.05x to 1.24x over LAMMPS Kokkos. Worth noting,
gains from SAI preconditioning due to the reduction in iteration counts are somewhat eroded
by the relatively large precondition computation cost due to host-side computation and sub-
sequent transfer to device memory. Particularly, as the number of GPUs increases, the
number of MPI threads per MPI process decreases, hence the preconditioner computation
cost increases.
    In Fig. 5.4, we present the strong scaling results mean per-step simulation time. Since
QEq solves constitute a significant amount of the total running time, improvements obtained
for from SAI preconditioning carry their benefits to the overall simulation times. In our
experiments, we observed overall speedups of 0.81x to 1.74x and 0.89x to 1.77x for PuReMD
CG+SAI(0.15) over LAMMPS Kokkos on the V100’s with the silica and water systems
respectively, and 0.79x to 1.17x and 0.82x to 1.57x for silica and water on the A100’s.
Interestingly, the LAMMPS Kokkos code fairs best with lower GPU counts where the number
of atoms per GPU is greatest. Thus, additional tuning on PuReMD may be required,
especially if performance can be gained from avoiding memory transfers between host and
device memory spaces during the charge solve kernels.
5.4     Conclusions
We explored several optimizations to improve the performance of GPU-accelerated distribut-
ed-memory implementation of ReaxFF+QEq in PuReMD. By applying techniques including
leveraging performance portable primitives in code libraries, exploiting higher degrees of
                                             98


parallelizablity through warp-tuned kernels, hiding memory access latency and increasing
device utilization through CUDA streams, and improving charge solver performance through
tuning and improved preconditioning, the updated MPI+CUDA version of PuReMD stands
to enable efficient large-scale scientific studies on modern supercomputers. To confirm these
assertions, we performed scaling experiments against the LAMMPS Kokkos codebase, an
efficient ReaxFF implementation developed by Scandia National Laboratories. In several
cases, PuReMD performance is near parity or above the LAMMPS codebase, with greater
performance tending to be achieved at larger scales, thus showing great promise for the
future use of PuReMD.
                                                 99


                                                                       Total Per Sim. Step
                     15.0
                                                      20
                     12.5
         Time (ms)
                                                      15
                     10.0
                      7.5                             10
                      5.0
                                                       5
                      2.5
                      0.0                              0
                            1   2     4 8 16 32               1   2     4 8 16 32
                                    Num. GPUs                         Num. GPUs
                                           (a) Volta V100’s
                                                                       Total Per Sim. Step
                     10                              12
                                                     10
                      8
         Time (ms)
                                                      8
                      6
                                                      6
                      4
                                                      4
                      2                               2
                      0                               0
                            1   2  4   8     16               1   2  4   8           16
                                Num. GPUs                         Num. GPUs
                                          (b) Ampere A100’s
Figure 5.3 Weak scaling study for the MPI+CUDA PuReMD implementation and
LAMMPS Kokkos. Simulations coupled with the QEq charge model were performed on
servers with (a) 8 Volta V100 and (b) 4 Ampere A100 GPUs at the MSU HPCC. Within
each group of plots above, the left plots correspond to amorphous silica systems (6000 atoms
per GPU) and the right plots correspond to bulk water systems (6540 atoms per GPU).
PuReMD results correspond to the left two bars in a group (CG+Jacobi, CG+SAI(0.15)
solvers, respectively) while LAMMPS Kokkos (CG+Jacobi) results correspond to the trans-
parent, hatched bars. Reported times are mean total times per MD step. For these results,
simulations were performed using an NVE ensemble for 10K steps with 0.25 fs per step.
Reconstruction of the Verlet neighbor list was done every 25 steps using a 2.0 Å Verlet list
buffer for the codes using CG+Jacobi solvers, while for the CG+SAI(0.15) solver reconstruc-
tion occurred every 250 steps using a 3.0 Å buffer. All codes used a 10−6 solver tolerance.
                                                     100


                                                                     Total Per Sim. Step
                                                   20
                     40
                                                   15
         Time (ms)
                     30
                                                   10
                     20
                                                    5
                     10
                      0                             0
                          1   2     4 8 16 32            1      2     4 8 16 32
                                  Num. GPUs                         Num. GPUs
                                         (a) Volta V100’s
                                                                     Total Per Sim. Step
                     30
                                                   12
                     25
                                                   10
         Time (ms)
                     20                             8
                     15                             6
                     10                             4
                      5                             2
                      0                             0
                          1   2  4   8     16               1   2  4   8           16
                              Num. GPUs                         Num. GPUs
                                        (b) Ampere A100’s
Figure 5.4 Strong scaling study for the MPI+CUDA PuReMD implementation
and LAMMPS Kokkos. Simulations were performed on servers with (a) 8 Volta V100
and (b) 4 Ampere A100 GPUs. Within each group of plots above, the left plots correspond
to amorphous silica systems (48000 atoms) and the right plots correspond to bulk water
(22500 atoms). PuReMD results correspond to the left two bars in a group (CG+Jacobi,
CG+SAI(0.15) solvers, respectively) while LAMMPS Kokkos (CG+Jacobi) results corre-
spond to the rightmost bars. Reported times are mean total times per MD step. Additional
simulation parameters were identical to those used in the weak scaling study.
                                                   101


APPENDICES
    102


                                     APPENDIX A
          SUPPLEMENTAL DATA FOR GLOBAL CHARGE MODEL
                     OPTIMIZATIONS IN SHARED-MEMORY
Table A.1 Force field parameter values utlized for simulations with the molecular
systems in Table 2.1.
                                 γi            χi       ηi      σi        Λ
              System Element
              Bilayer C          0.7631   5.9993   6.0000        -         -
              &        H         0.8203   3.7248   9.6093        -         -
              Silica   N         1.0000   6.8287   7.2217        -         -
                       O         1.0898   8.5000   8.3122        -         -
                       P         1.0000   1.8292   7.2520        -         -
                       Si        0.8925   4.6988   6.0000        -         -
              PETN C             0.8712   5.7254   6.9235        -         -
                       H         0.8910   3.8446   1.0698        -         -
                       N         0.8922   6.7424   6.2435        -         -
                       O         0.8712   8.5000   7.1412        -         -
              Water H            0.8203   3.7248   9.6093 3.4114           -
                       O         1.0898   8.5000   8.3122 0.9745           -
                       -         -              -        -       - 548.6451
Table A.2 Charge matrix condition numbers for the charge distribution models.
Condition numbers derived from the eigen-spectrum computed by the LAPACKE dgesvd func-
tion with Intel Math Kernel Library (MKL) version 2018.1.163.
                        System   Method    dim (H)    κ (H) = σσmax
                                                                min
                         Bilayer   QEq       56800       2.569E2
                                    EE       56801       3.166E2
                         PETN      QEq       48256       1.347E2
                                    EE       48257       2.117E2
                          Silica   QEq       72000       1.061E2
                                    EE       72001       2.315E2
                         Water     QEq       78480       6.008E1
                                    EE       78481       1.022E2
                                 ACKS2      156962       8.231E4
                                           103


Table A.3 Tuned preconditioned solver parameters for shared-memory exper-
iments. For selection, a parameter search was conducted in order to minimize mean
solver time for the ICDD(t)+DS(d), ILUDD(t)+DS(d), FG-ICDD(t, s)+DS(d), FG-
ILUDD(t, s)+DS(d), and SAI(τ )+DS(d) preconditioned solvers. Parameters in the table
are ordered from left to right as appearing in the above preconditioner names.
         Solver Tolerance                       10−6           10−10          10−14
         CM        Prec.           System
         QEq       ICDD            Bilayer      (0.0, 0.6)     (0.0, 0.6)     (0.0, 0.8)
                                   PETN         (0.0, 0.6)     (0.0, 0.6)     (0.0, 0.6)
                                   Silica       (0.0, 0.6)     (0.0, 0.6)     (0.0, 0.6)
                                   Water      (0.01, 0.6)      (0.0, 0.6)     (0.0, 0.6)
                   FG-ICDD         Bilayer   (0.0, 4, 0.6)  (0.0, 3, 0.6) (0.01, 3, 0.6)
                                   PETN      (0.0, 3, 0.6)  (0.0, 2, 0.6) (0.01, 2, 0.6)
                                   Silica    (0.0, 2, 0.6)  (0.0, 2, 0.6) (0.0, 3, 0.6)
                                   Water     (0.0, 3, 0.6)  (0.0, 3, 0.6) (0.0, 4, 0.6)
                   SAI             Bilayer    (0.05, 0.6)      (0.1, 1.0)   (0.15, 1.0)
                                   PETN         (0.1, 0.6)   (0.15, 1.0)    (0.15, 1.0)
                                   Silica    (0.075, 0.8)      (0.1, 0.8)     (0.1, 0.8)
                                   Water        (0.1, 0.6)   (0.15, 0.6)      (0.1, 0.6)
         EE        ILUDD           Bilayer      (0.0, 0.6)     (0.0, 0.6)     (0.0, 0.6)
                                   PETN         (0.0, 0.6)     (0.0, 0.6)     (0.0, 0.6)
                                   Silca        (0.0, 0.6)     (0.0, 0.6)     (0.0, 0.6)
                                   Water        (0.0, 0.6)     (0.0, 0.6)     (0.0, 0.6)
                   FG-ILUDD Bilayer         (0.01, 3, 0.6) (0.01, 3, 0.6) (0.01, 4, 0.6)
                                   PETN      (0.1, 2, 0.6) (0.0, 2, 0.6) (0.0, 2, 0.6)
                                   Silica    (0.0, 3, 0.6) (0.0, 3, 0.6) (0.01, 4, 0.6)
                                   Water     (0.1, 2, 0.6) (0.0, 2, 0.6) (0.01, 2, 0.6)
                   SAI             Bilayer    (0.15, 1.0)      (0.1, 1.0)   (0.15, 1.0)
                                   PETN      (0.125, 0.8)    (0.15, 0.6) (0.125, 1.0)
                                   Silica       (0.1, 0.8)     (0.1, 0.8)     (0.1, 0.8)
                                   Water        (0.1, 0.6)     (0.1, 0.6)     (0.1, 0.6)
         ACKS2 ILUDD               Water        (0.0, 0.6)     (0.0, 0.6)     (0.0, 0.6)
                   FG-ILUDD Water            (0.0, 2, 0.6)  (0.0, 2, 0.6)  (0.0, 2, 0.6)
                   SAI             Water     (0.075, 0.6)    (0.15, 0.6)    (0.15, 0.6)
                                             104


                                       APPENDIX B
    SUPPLEMENTAL DATA FOR OPTIMIZATION OF THE REAX FORCE
        FIELD FOR THE LITHIUM-OXYGEN SYSTEM USING A HIGH
                             FIDELITY CHARGE MODEL
 B.1     Fitted ACKS2-based ReaxFF Parameters
 Table B.1 shows the parameters chosen to fit are labeled according to the standard ReaxFF
 parameter file indexing [103]. The first integer in parentheses indexes the parameter section
 corresponding to one of the following five categories: General, Atoms, Bonds, Off-diagonal,
 and Angles. The second integer indexes the block within the section. In this case, we fit
 lithium and oxygen atomic parameters O-O, Li-O and Li-Li bonds; Li-O off-diagonal param-
 eters; and O-Li-O, O-O-Li, and Li-O-Li angle parameters. The third integer corresponds to
 the specific parameter within each block (see the ReaxFF Manual for details [103]).
     The ACKS2-specific parameters are a global bond softness parameter Λ ((1, 35, 1) in
 the force field), element-specific EEM shielding parameters γi ((2, 2, 6) and (2, 5, 6)),
 and element-specific thresholding parameters σi ((2, 2, 23) and (2, 5, 23)), discussed in
 Section 4.2.2.
Table B.1 The ReaxFF parameters adjusted to create the new ACKS2-based force
field for Li-O. Each parameter is constrained to the range defined by its lower bound (L.B.)
and upper bound (U.B.) during the optimization procedure. The fourth and fifth columns
report the fitted parameter values for the ACKS2-based ReaxFF and the 2016 QEq-based
ReaxFF, respectively. Parameters marked with an asterisk∗ are specific to the ACKS2 model
and were not fitted (N.F.) for the ReaxFF QEq 2016 force field. Parameteres not listed in
the table are fixed to the values from the 2016 ReaxFF reference.
   Parameter Indices      L.B.      U.B.   Fitted ACKS2 Values      ReaxFF QEq 2016 Values
                                               105


                       Table B.1 (cont’d)
(1, 35, 1)∗  10.00 500.00               234.3037     N.F.
(2, 2, 5)     0.01   0.20                 0.0502   0.1000
(2, 2, 6)∗    0.50   1.50                 0.6318     N.F.
(2, 2, 9)     8.00  10.00                 9.8993   9.7300
(2, 2, 14)    1.00  20.00                 9.0609   8.5000
(2, 2, 15)    1.00  20.00                 7.4275   8.3122
(2, 2, 23)∗   0.10  10.00                 3.4489     N.F.
(2, 2, 25)   −5.00  −1.00               −4.0553   −3.5500
(2, 5, 5)     0.01   0.20                 0.1996   0.1109
(2, 5, 6)∗    0.40   1.00                 0.7511     N.F.
(2, 5, 9)     8.00  10.00                 9.9731   9.3147
(2, 5, 14)  −12.00  12.00               −6.0479   −6.9345
(2, 5, 15)    1.00  20.00                 6.2978  15.0000
(2, 5, 23)∗   0.50  15.00                 9.3096     N.F.
(2, 5, 25)  −28.00 −17.00              −23.0012  −25.0000
(3, 2, 1)   140.00 150.00               141.9324 142.2858
(3, 2, 4)     0.00   1.50                 1.2989   0.2506
(3, 2, 8)     0.00   1.75                 0.4910   0.6051
(3, 12, 1)   75.00  85.00                76.8938  76.3632
(3, 12, 4)   −1.50   0.00               −1.1429   −0.4309
(3, 12, 8)    0.00   1.50                 0.0631   0.1271
(3, 15, 1)   45.00  75.00                45.0406  52.4319
(3, 15, 4)    0.00   1.00                 0.0064   0.2219
(3, 15, 8)    0.00   3.00                 1.5411   0.5355
(4, 8, 1)     0.00   2.00                 1.4736   0.1112
(4, 8, 2)     0.00   2.50                 1.4147   1.6982
(4, 8, 3)     9.00  13.00                 9.0283  11.0473
(5, 31, 1)   70.00  90.00                73.8444  83.7146
(5, 31, 2)    2.00  12.00                 7.3527   8.7579
(5, 31, 3)    3.00   5.00                 4.2355   4.0000
(5, 31, 7)    0.50   2.50                 2.1889   1.2463
(5, 32, 1)   78.00  89.00                83.7686  81.6233
(5, 32, 2)   28.00  32.00                29.7164  30.0000
(5, 32, 3)    1.00   3.00                 1.4275   2.0000
                              106


                                         Table B.1 (cont’d)
  (5, 32, 7)               0.00       2.00                   0.8509                      1.0000
  (5, 33, 1)              85.00    110.00                   93.3484                    86.6874
  (5, 33, 2)              15.00      25.00                  19.0545                    22.4076
  (5, 33, 3)               3.00       5.00                   3.5779                      4.0000
  (5, 33, 7)               0.50       2.50                   1.7236                      1.3084
B.2       Fitted QEq force field
As discussed in the main text, a ReaxFF force field was fitted using the same parameters
and training set as the ACKS2 force field. While it achieved similar fitness values on the
training set (see Fig. 4.1), it was unstable and did not capture the correct physics. This is
illustrated in Fig. B.1. The failure of this force field is important for two reasons. Firstly, it
highlights the fact that good fitness values on the training set alone do not guarantee a good
force field; the proof is in the MD simulations. Secondly, it shows that the ACKS2 method
is necessary to correctly capture bond-breaking. A QEq force field breaks when forced to fit
bond-breaking data.
B.3       The MRCI+Q energies and MRCI atomic charges
Table B.2 shows the MRCI+Q energies and MRCI atomic charges that form part of the
training dataset used in force field fitting.
Table B.2 Nuclear geometries used in the calculations for the Li-O-Li system,
along with the MRCI+Q total electronic energies and MRCI atomic charges on
Li and O atoms.
                      rLi(1)–O   rLi(2)–O   MRCI+Q      qLi(1)   qO    qLi(2)
                       1.1381    1.6259    -89.865009    0.87   -1.79  0.93
                       1.3007    1.6259    -89.982903    0.91   -1.85  0.94
                       1.4633    1.6259    -90.030280    0.93   -1.88  0.94
                                                107


             Table B.2 (cont’d)
1.6259 1.6259  -90.043345    0.94 -1.89 0.94
1.7885 1.6259  -90.039912    0.94 -1.89 0.94
1.9511 1.6259  -90.028990    0.94 -1.88 0.94
2.1137 1.6259  -90.015046    0.93 -1.87 0.94
2.2763 1.6259  -90.000344    0.92 -1.85 0.94
2.4389 1.6259  -89.986037    0.90 -1.83 0.93
2.6015 1.6259  -89.972698    0.86 -1.80 0.93
2.7641 1.6259  -89.960250    0.82 -1.75 0.93
2.9266 1.6259  -89.947858    0.75 -1.68 0.93
3.0892 1.6259  -89.936609    0.65 -1.58 0.93
3.2518 1.6259  -89.926896    0.52 -1.46 0.93
3.4144 1.6259  -89.919775    0.38 -1.32 0.94
3.7396 1.6259  -89.909602    0.17 -1.11 0.94
4.0648 1.6259  -89.904763    0.07 -1.02 0.95
4.3900 1.6259  -89.902426    0.03 -0.98 0.95
4.7152 1.6259  -89.901202    0.01 -0.96 0.95
5.0403 1.6259  -89.900506    0.01 -0.96 0.95
5.3655 1.6259  -89.900080    0.00 -0.95 0.95
5.6907 1.6259  -89.899804    0.00 -0.95 0.95
6.0159 1.6259  -89.899618    0.00 -0.95 0.95
1.1381 1.1381  -89.672358    0.88 -1.75 0.88
1.3007 1.3007  -89.917700    0.92 -1.83 0.92
1.4633 1.4633  -90.016373    0.93 -1.87 0.93
1.6259 1.6259  -90.043345    0.94 -1.89 0.94
1.7885 1.7885  -90.036195    0.95 -1.89 0.95
1.9511 1.9511  -90.013953    0.94 -1.89 0.94
2.1137 2.1137  -89.985959    0.94 -1.87 0.94
2.2763 2.2763  -89.956459    0.92 -1.83 0.92
2.4389 2.4389  -89.928055    0.88 -1.77 0.88
2.6015 2.6015  -89.900678    0.83 -1.66 0.83
2.7641 2.7641  -89.877388    0.76 -1.51 0.76
2.9266 2.9266  -89.856157    0.67 -1.35 0.67
3.0892 3.0892  -89.838679    0.61 -1.22 0.61
3.2518 3.2518  -89.814604    0.50 -1.00 0.50
                    108


                                       Table B.2 (cont’d)
                      3.4144    3.4144   -89.805887    0.50   -1.00 0.50
                      3.7396    3.7396   -89.790753    0.00    0.00 0.00
                      4.0648    4.0648   -89.791244    0.00    0.00 0.00
                      4.3900    4.3900   -89.791538    0.00    0.00 0.00
                      4.7152    4.7152   -89.791685    0.00    0.00 0.00
                      5.0403    5.0403   -89.791736    0.00    0.00 0.00
                      5.3655    5.3655   -89.791730    0.00    0.00 0.00
                      5.6907    5.6907   -89.791698    0.00    0.00 0.00
                      6.0159    6.0159   -89.791658    0.00    0.00 0.00
B.4      Further details of the MD simulations
Fig. B.2 shows detailed information on oxygen charge during cracking simulation from
Fig. 4.6 (d). Also, Fig. B.3 provides analysis from a second cracking experiment with iden-
tical parameters and starting structures to Fig. 4.6 (a)-(c).
    Fig. B.4 displays evolution of the RDF of the 324-atom Li2 O cube during NPT simulation
in Section 4.3.2.
                                              109


                    (a)
                    (b)
Figure B.1 Deformation simulation using the fitted QEq force field for Li-O. In the
atomic visualization (a), atoms are clearly being ejected from the slab, despite the fact that
this simulation was run at 1K in an attempt to avoid the ejections. In the radial distribution
function (b), a small peak around 1 Å appears. This is unphysical and does not occur in the
other force fields tested.
                                             110


                     Amorphous    Ordered                 Amorphous    Ordered
           0%          0%           0%                       0%          0%
                                                   0%
                                                                                           Oxygen charge deviation (e)
          10%          10%          5%                      10%          5%
                                               10%
Number    15%          15%          8%                      15%          8%
                                               15%
          20%          20%         12%                      20%          12%
                                               20%
              Oxygen charge (e)
 Figure B.2 Detailed information on oxygen charge during cracking simulation.
 Left: histograms of oxygen charge simulations at various points during the deformation
 simulations. Note that the axis scales are identical among the subplots. Right: deviation of
 oxygen charge from the average, shown at various points during the deformation simulations.
 It is clear from this figure that the charge-depleted population is larger in the QEq FF
 simulations, whereas the distribution is narrower and more symmetric under the ACKS2
 FF.
                                             111


Figure B.3 Second cracking experiment with identical parameters and starting
structures. The behavior is slightly different but the qualitative trends remain. This
provides evidence that the conclusions are not sensitive to the stochastic nature of MD
simulation.
                                          112


                              0 ps
            ACKS2 FF                                 2016 QEq FF
                10 ps
                20 ps
                40 ps
Figure B.4 Evolution of the RDF during NPT simulation with the ACKS2 FF
and the 2016 QEq FF. Results are shown for the ACKS2 FF (left) and the 2016 QEq FF
(right).
                                      113


BIBLIOGRAPHY
      114


                                    BIBLIOGRAPHY
[1]  A. C. T. van Duin, S. Dasgupta, F. Lorant, and W. A. Goddard, “Reaxff: A reactive
     force field for hydrocarbons,” The Journal of Physical Chemistry A, vol. 105, no. 41,
     pp. 9396–9409, 2001.
[2]  T.-R. Shan, B. D. Devine, T. W. Kemper, S. B. Sinnott, and S. R. Phillpot, “Charge-
     optimized many-body potential for the hafnium/hafnium oxide system,” Phys. Rev.
     B, vol. 81, p. 125328, Mar 2010.
[3]  S. J. Stuart, A. B. Tutein, and J. A. Harrison, “A reactive potential for hydrocarbons
     with intermolecular interactions,” The Journal of Chemical Physics, vol. 112, no. 14,
     pp. 6472–6486, 2000.
[4]  T. P. Senftle, S. Hong, M. M. Islam, S. B. Kylasa, Y. Zheng, Y. K. Shin, C. Junker-
     meier, R. Engel-Herbert, M. J. Janik, H. M. Aktulga, T. Verstraelen, A. Grama, and
     A. C. T. van Duin, “The reaxff reactive force-field: development, applications and
     future directions,” npj Computational Materials, vol. 2, p. 15011, 2016.
[5]  H. M. Park, Yumi nd Atkulga, A. Grama, and A. Strachan, “Strain relaxation in
     si/ge/si nanoscale bars from molecular dynamics simulations,” Journal of Applied
     Physics, vol. 106, no. 3, p. 034304, 2009.
[6]  J. C. Fogarty, H. M. Aktulga, A. Y. Grama, A. C. T. van Duin, and S. A. Pandit, “A
     reactive molecular dynamics simulation of the silica-water interface,” The Journal of
     Chemical Physics, vol. 132, no. 17, p. 174704, 2010.
[7]  A. Grama, H. M. Aktulga, and S. B. Kylasa, “PuReMD, Purdue Reactive Molecular
     Dynamics package,” 2014. Accessed on June 8, 2016.
[8]  H. M. Aktulga, C. Knight, P. Coffman, K. A. O’Hearn, T.-R. Shan, and W. Jiang, “Op-
     timizing the performance of reactive molecular dynamics simulations for many-core ar-
     chitectures,” The International Journal of High Performance Computing Applications,
     vol. 33, no. 2, pp. 304–321, 2019.
[9]  K. ichi Nomura, R. K. Kalia, A. Nakano, and P. Vashishta, “A scalable parallel algo-
     rithm for large-scale reactive force-field molecular dynamics simulations,” Computer
     Physics Communications, vol. 178, no. 2, pp. 73–87, 2008.
[10] H. M. Aktulga, S. A. Pandit, A. C. T. van Duin, and A. Y. Grama, “Reactive molec-
     ular dynamics: Numerical methods and algorithmic techniques,” SIAM Journal on
     Scientific Computing, vol. 34, no. 1, pp. C1–C23, 2012.
[11] H. Aktulga, J. Fogarty, S. Pandit, and A. Grama, “Parallel reactive molecular dy-
     namics: Numerical methods and algorithmic techniques,” Parallel Computing, vol. 38,
     no. 4, pp. 245–259, 2012.
                                             115


[12] K. A. O’Hearn, A. Alperen, and H. M. Aktulga, “Fast Solvers for Charge Distribution
     Models on Shared Memory Platforms,” SIAM Journal on Scientific Computing, vol. 42,
     no. 1, pp. C1–C22, 2020.
[13] K. A. O’Hearn and H. M. Aktulga, “Towards fast scalable solvers for charge equilibra-
     tion in molecular dynamics applications,” in 2016 7th Workshop on Latest Advances
     in Scalable Algorithms for Large-Scale Systems (ScalA), pp. 9–16, 2016.
[14] A. D. MacKerell, D. Bashford, M. Bellott, R. L. Dunbrack, J. D. Evanseck, M. J.
     Field, S. Fischer, J. Gao, H. Guo, S. Ha, D. Joseph-McCarthy, L. Kuchnir, K. Kucz-
     era, F. T. K. Lau, C. Mattos, S. Michnick, T. Ngo, D. T. Nguyen, B. Prodhom,
     W. E. Reiher, B. Roux, M. Schlenkrich, J. C. Smith, R. Stote, J. Straub, M. Watan-
     abe, J. Wiórkiewicz-Kuczera, D. Yin, and M. Karplus, “All-atom empirical potential
     for molecular modeling and dynamics studies of proteins,” The Journal of Physical
     Chemistry B, vol. 102, no. 18, pp. 3586–3616, 1998. PMID: 24889800.
[15] W. D. Cornell, P. Cieplak, C. I. Bayly, I. R. Gould, K. M. Merz, D. M. Ferguson, D. C.
     Spellmeyer, T. Fox, J. W. Caldwell, and P. A. Kollman, “A second generation force
     field for the simulation of proteins, nucleic acids, and organic molecules,” Journal of
     the American Chemical Society, vol. 117, no. 19, pp. 5179–5197, 1995.
[16] A. K. Rappe, C. J. Casewit, K. S. Colwell, W. A. Goddard, and W. M. Skiff, “Uff, a full
     periodic table force field for molecular mechanics and molecular dynamics simulations,”
     Journal of the American Chemical Society, vol. 114, no. 25, pp. 10024–10035, 1992.
[17] T. A. Halgren and W. Damm, “Polarizable force fields,” Current opinion in structural
     biology, vol. 11, no. 2, pp. 236–242, 2001.
[18] S. W. Rick and S. J. Stuart, “Potentials and algorithms for incorporating polarizability
     in computer simulations,” Reviews in computational chemistry, vol. 18, pp. 89–146,
     2002.
[19] S. Patel and C. L. B. III, “Fluctuating charge force fields: recent developments and
     applications from small molecules to macromolecular biological systems,” Molecular
     Simulation, vol. 32, no. 3-4, pp. 231–249, 2006.
[20] P. Li and K. M. Merz, “Metal ion modeling using classical mechanics,” Chemical
     Reviews, vol. 117, no. 3, pp. 1564–1686, 2017. PMID: 28045509.
[21] W. J. Mortier, S. K. Ghosh, and S. Shankar, “Electronegativity-equalization method
     for the calculation of atomic charges in molecules,” Journal of the American Chemical
     Society, vol. 108, no. 15, pp. 4315–4320, 1986.
[22] A. K. Rappe and W. A. Goddard, “Charge equilibration for molecular dynamics sim-
     ulations,” The Journal of Physical Chemistry, vol. 95, no. 8, pp. 3358–3363, 1991.
[23] R. A. Nistor, J. G. Polihronov, M. H. Müser, and N. J. Mosey, “A generalization of
     the charge equilibration method for nonmetallic materials,” The Journal of Chemical
     Physics, vol. 125, no. 9, p. 094108, 2006.
                                              116


[24] T. Verstraelen, P. W. Ayers, V. Van Speybroeck, and M. Waroquier, “Acks2: Atom-
     condensed kohn-sham dft approximated to second order,” The Journal of Chemical
     Physics, vol. 138, no. 7, p. 074108, 2013.
[25] A. Nakano, “Parallel multilevel preconditioned conjugate-gradient approach to
     variable-charge molecular dynamics,” Computer Physics Communications, vol. 104,
     no. 1, pp. 59–69, 1997.
[26] W. J. Mortier, K. Van Genechten, and J. Gasteiger, “Electronegativity equalization:
     application and parametrization,” Journal of the American Chemical Society, vol. 107,
     no. 4, pp. 829–835, 1985.
[27] S. Plimpton, “Fast parallel algorithms for short-range molecular dynamics,” Journal
     of Computational Physics, vol. 117, no. 1, pp. 1–19, 1995.
[28] Y. Saad, Iterative Methods for Sparse Linear Systems. Society for Industrial and
     Applied Mathematics, second ed., 2003.
[29] Y. Saad, “Ilut: A dual threshold incomplete lu factorization,” Numerical Linear Alge-
     bra with Applications, vol. 1, no. 4, pp. 387–402, 1994.
[30] B. Carpentieri and M. Bollhöfer, “Symmetric inverse-based multilevel ilu precondition-
     ing for solving dense complex non-hermitian systems in electromagnetics,” Progress in
     Electromagnetics Research, vol. 128, pp. 55–74, 01 2012.
[31] X.-M. Pan and X.-Q. Sheng, “Sparse approximate inverse preconditioner for multiscale
     dynamic electromagnetic problems,” Radio Science, vol. 49, no. 11, pp. 1041–1051,
     2014.
[32] J. R. Allwright, R. Bordawekar, P. D. Coddington, K. Dincer, and C. L. Martin,
     “A comparison of parallel graph coloring algorithms,” Technical Report SCCS-666,
     vol. Northeast Parallel Architecture Center, Syracuse University, 1995.
[33] E. F. F. Botta and A. van der Ploeg, “Renumbering strategies based on multi-level
     techniques combined with ILU-decompositions,” Zh. Vychisl. Mat. Mat. Fiz., vol. 37,
     no. 11, pp. 1294–1300, 1997.
[34] P. Hénon and Y. Saad, “A parallel multistage ilu factorization based on a hierarchical
     graph decomposition,” SIAM Journal on Scientific Computing, vol. 28, no. 6, pp. 2266–
     2293, 2006.
[35] ı. V. ÇAtalyüRek, J. Feo, A. H. Gebremedhin, M. Halappanavar, and A. Pothen,
     “Graph coloring algorithms for multi-core and massively multithreaded architectures,”
     Parallel Comput., vol. 38, pp. 576–594, Oct. 2012.
[36] M. Naumov, P. Castonguay, and J. Cohen, “Parallel graph coloring with applications
     to the incomplete-lu factorization on the gpu,” NVIDIA Technical Report, vol. NVR-
     2015-001, pp. 1–23, 2015.
                                             117


[37] M. Naumov, “Parallel incomplete-lu and cholesky factorization in the preconditioned
     iterative methods on the gpu,” NVIDIA Technical Report, vol. NVR-2012-003, pp. 1–
     19, 2012.
[38] E. Chow and A. Patel, “Fine-grained parallel incomplete lu factorization,” SIAM Jour-
     nal on Scientific Computing, vol. 37, no. 2, pp. C169–C193, 2015.
[39] M. Benzi and M. Tûma, “A comparative study of sparse approximate inverse precon-
     ditioners,” Applied Numerical Mathematics, vol. 30, no. 2, pp. 305 – 340, 1999.
[40] M. Grote and T. Huckle, “Parallel preconditioning with sparse approximate inverses,”
     SIAM Journal on Scientific Computing, vol. 18, no. 3, pp. 838–853, 1997.
[41] M. Benzi and M. Tuma, “A sparse approximate inverse preconditioner for nonsymmet-
     ric linear systems,” SIAM Journal on Scientific Computing, vol. 19, no. 3, pp. 968–994,
     1998.
[42] Y. Saad and M. Schultz, “Gmres: A generalized minimal residual algorithm for solving
     nonsymmetric linear systems,” SIAM Journal on Scientific and Statistical Computing,
     vol. 7, no. 3, pp. 856–869, 1986.
[43] K. A. O’Hearn, A. Alperen, and H. M. Aktulga, “Performance optimization of re-
     active molecular dynamics simulations with dynamic charge distribution models on
     distributed memory platforms,” in ICS ’19: Proceedings of the Association of Com-
     puting Machinery International Conference on Supercomputing, pp. 150–159, 2019.
[44] T. Liang, T.-R. Shan, Y.-T. Cheng, B. D. Devine, M. Noordhoek, Y. Li, Z. Lu, S. R.
     Phillpot, and S. B. Sinnott, “Classical atomistic simulations of surfaces and heteroge-
     neous interfaces with the charge-optimized many body (comb) potentials,” Materials
     Science and Engineering: R: Reports, vol. 74, no. 9, pp. 255–279, 2013.
[45] D. Frenkel and B. Smit, Chapter 10 - Free Energies of Solids. San Diego: Academic
     Press, second edition ed., 2002.
[46] P. Ghysels and W. Vanroose, “Hiding global synchronization latency in the precondi-
     tioned conjugate gradient algorithm,” Parallel Computing, vol. 40, no. 7, pp. 224–238,
     2014. 7th Workshop on Parallel Matrix Algorithms and Applications.
[47] M. Benzi and M. Tuma, “A comparative study of sparse approximate inverse precon-
     ditioners,” Applied Numerical Mathematics, vol. 30, pp. 305–340, 04 1998.
[48] K. A. O’Hearn, M. W. Swift, J. Liu, I. Magoulas, P. Piecuch, A. C. T. van Duin,
     H. M. Aktulga, and Y. Qi, “Optimization of the reax force field for the lithium–oxygen
     system using a high fidelity charge model,” The Journal of Chemical Physics, vol. 153,
     no. 8, p. 084107, 2020.
[49] P. Hohenberg and W. Kohn, “Inhomogeneous electron gas,” Phys. Rev., vol. 136,
     pp. B864–B871, Nov 1964.
                                            118


[50] S. Huang, S. Zhang, T. Belytschko, S. S. Terdalkar, and T. Zhu, “Mechanics of nanoc-
     rack: Fracture, dislocation emission, and amorphization,” Journal of the Mechanics
     and Physics of Solids, vol. 57, no. 5, pp. 840–850, 2009.
[51] A. A. Griffith and G. I. Taylor, “Vi. the phenomena of rupture and flow in solids,”
     Philosophical Transactions of the Royal Society of London. Series A, Containing Pa-
     pers of a Mathematical or Physical Character, vol. 221, no. 582-593, pp. 163–198, 1921.
[52] J. Belak, J. N. Glosli, D. B. Boercker, and I. F. Stowers, “Molecular dynamics simu-
     lation of mechanical deformation of ultra-thin metal and ceramic films,” MRS Proc.,
     vol. 389, p. 181, 1995.
[53] I. Salehinia, J. Wang, D. Bahr, and H. Zbib, “Molecular dynamics simulations of
     plastic deformation in nb/nbc multilayers,” International Journal of Plasticity, vol. 59,
     pp. 119–132, 2014.
[54] D. Wolf, V. Yamakov, S. Phillpot, A. Mukherjee, and H. Gleiter, “Deformation of
     nanocrystalline materials by molecular-dynamics simulation: relationship to experi-
     ments?,” Acta Materialia, vol. 53, no. 1, pp. 1–40, 2005.
[55] P. S. Branı́cio and J.-P. Rino, “Large deformation and amorphization of ni nanowires
     under uniaxial strain: A molecular dynamics study,” Phys. Rev. B, vol. 62, pp. 16950–
     16955, Dec 2000.
[56] F. Yuan and L. Huang, “Brittle to ductile transition in densified silica glass,” Sci.
     Rep., vol. 4, p. 5035, 2015.
[57] K. Muralidharan, K.-D. Oh, P. A. Deymier, K. Runge, and J. H. Simmons, “Molecular
     dynamics simulations of atomic-level brittle fracture mechanisms in amorphous silica,”
     Journal of Materials Science, vol. 42, pp. 4159–4169, 2007.
[58] M. M. Islam, A. Ostadhossein, O. Borodin, A. T. Yeates, W. W. Tipton, R. G. Hennig,
     N. Kumar, and A. C. T. van Duin, “Reaxff molecular dynamics simulations on lithiated
     sulfur cathode materials,” Phys. Chem. Chem. Phys., vol. 17, pp. 3383–3393, 2015.
[59] K. Kang and W. Cai, “Brittle and ductile fracture of semiconductor nanowires – molec-
     ular dynamics simulations,” Philosophical Magazine, vol. 87, no. 14-15, pp. 2169–2189,
     2007.
[60] L. Van Brutzel, C. L. Rountree, R. K. Kalia, A. Nakano, and P. Vashishta, “Dy-
     namic fracture mechanisms in nanostructured and amorphous silica glasses million-
     atom molecular dynamics simulations,” MRS Online Proceedings Library, vol. 703,
     no. 1, p. 39, 2011.
[61] C. Xu and W. Gao, “Pilling-bedworth ratio for oxidation of alloys,” Materials Research
     Innovations, vol. 3, no. 4, pp. 231–235, 2000.
[62] J. Liu, Atomic Simulation on Chemical-Mechanical Coupled Deformations in Complex
     Nano Structures. PhD thesis, Michigan State University, 2019.
                                             119


[63] F. G. Sen, A. T. Alpas, A. C. T. van Duin, and Y. Qi, “Oxidation-assisted ductility
     of aluminium nanowires,” Nature Communications, vol. 5, p. 3959, 2014.
[64] B. Mortazavi, A. Ostadhossein, T. Rabczuk, and A. C. T. van Duin, “Mechanical
     response of all-mos2 single-layer heterostructures: a reaxff investigation,” Phys. Chem.
     Chem. Phys., vol. 18, pp. 23695–23701, 2016.
[65] G. M. Odegard, B. D. Jensen, S. Gowtham, J. Wu, J. He, and Z. Zhang, “Predict-
     ing mechanical response of crosslinked epoxy using reaxff,” Chemical Physics Letters,
     vol. 591, pp. 175–178, 2014.
[66] B. D. Jensen, K. E. Wise, and G. M. Odegard, “The effect of time step, thermostat,
     and strain rate on reaxff simulations of mechanical failure in diamond, graphene, and
     carbon nanotube,” Journal of Computational Chemistry, vol. 36, no. 21, pp. 1587–1596,
     2015.
[67] H. Werner and P. J. Knowles, “An efficient internally contracted multiconfigura-
     tion–reference configuration interaction method,” The Journal of Chemical Physics,
     vol. 89, no. 9, pp. 5803–5814, 1988.
[68] P. J. Knowles and H.-J. Werner, “An efficient method for the evaluation of coupling co-
     efficients in configuration interaction calculations,” Chemical Physics Letters, vol. 145,
     no. 6, pp. 514–522, 1988.
[69] W. Kohn and L. J. Sham, “Self-consistent equations including exchange and correlation
     effects,” Phys. Rev., vol. 140, pp. A1133–A1138, Nov 1965.
[70] A. Ostadhossein, S.-Y. Kim, E. D. Cubuk, Y. Qi, and A. C. T. van Duin, “Atomic
     insight into the lithium storage and diffusion mechanism of sio2/al2o3 electrodes of
     lithium ion batteries: Reaxff reactive force field modeling,” The Journal of Physical
     Chemistry A, vol. 120, no. 13, pp. 2114–2127, 2016. PMID: 26978039.
[71] A. Yulaev, V. Oleshko, P. Haney, J. Liu, Y. Qi, A. A. Talin, M. S. Leite, and A. Kol-
     makov, “From microparticles to nanowires and back: Radical transformations in plated
     li metal morphology revealed via in situ scanning electron microscopy,” Nano Letters,
     vol. 18, no. 3, pp. 1644–1650, 2018. PMID: 29397748.
[72] B. Narayanan, A. C. T. van Duin, B. B. Kappes, I. E. Reimanis, and C. V. Ciobanu,
     “A reactive force field for lithium–aluminum silicates with applications to eucryp-
     tite phases,” Modelling and Simulation in Materials Science and Engineering, vol. 20,
     p. 015002, nov 2011.
[73] G. Kresse and J. Furthmüller, “Efficient iterative schemes for ab initio total-energy
     calculations using a plane-wave basis set,” Phys. Rev. B, vol. 54, pp. 11169–11186, Oct
     1996.
[74] J. P. Perdew, K. Burke, and M. Ernzerhof, “Generalized gradient approximation made
     simple,” Phys. Rev. Lett., vol. 77, pp. 3865–3868, Oct 1996.
                                              120


[75] P. E. Blöchl, “Projector augmented-wave method,” Phys. Rev. B, vol. 50, pp. 17953–
     17979, Dec 1994.
[76] H.-J. Werner, P. J. Knowles, G. Knizia, F. R. Manby, M. Schütz, P. Celani, T. Korona,
     R. Lindh, A. Mitrushenkov, G. Rauhut, K. R. Shamasundar, T. B. Adler, R. D. Amos,
     A. Bernhardsson, A. Berning, D. L. Cooper, M. J. O. Deegan, A. J. Dobbyn, F. Eckert,
     E. Goll, C. Hampel, A. Hesselmann, G. Hetzer, T. Hrenar, G. Jansen, C. Köppl, Y. Liu,
     A. W. Lloyd, R. A. Mata, A. J. May, S. J. McNicholas, W. Meyer, M. E. Mura,
     A. Nicklass, D. P. O’Neill, P. Palmieri, K. Pflüger, R. Pitzer, M. Reiher, T. Shiozaki,
     H. Stoll, A. J. Stone, R. Tarroni, T. Thorsteinsson, and M. Wang. MOLPRO, version
     2010.1, a package of ab initio programs, see https://www.molpro.net.
[77] K. Ruedenberg, M. W. Schmidt, M. M. Gilbert, and S. Elbert, “Are atoms intrinsic
     to molecular electronic wavefunctions? i. the fors model,” Chemical Physics, vol. 71,
     no. 1, pp. 41–49, 1982.
[78] B. O. Roos, The Complete Active Space Self-Consistent Field Method and its Appli-
     cations in Electronic Structure Calculations, pp. 399–445. John Wiley & Sons, Ltd,
     1987.
[79] J. B. Collins and A. Streitwieser Jr, “Integrated spatial electron populations in
     molecules: Application to simple molecules,” Journal of Computational Chemistry,
     vol. 1, no. 1, pp. 81–87, 1980.
[80] A. E. Reed and F. Weinhold, “Natural bond orbital analysis of near-hartree–fock water
     dimer,” The Journal of Chemical Physics, vol. 78, no. 6, pp. 4066–4073, 1983.
[81] A. E. Reed, R. B. Weinstock, and F. Weinhold, “Natural population analysis,” The
     Journal of Chemical Physics, vol. 83, no. 2, pp. 735–746, 1985.
[82] A. E. Reed and F. Weinhold, “Natural localized molecular orbitals,” The Journal of
     Chemical Physics, vol. 83, no. 4, pp. 1736–1740, 1985.
[83] R. A. Mata and H.-J. Werner, “Local correlation methods with a natural localized
     molecular orbital basis,” Molecular Physics, vol. 105, no. 19-22, pp. 2753–2761, 2007.
[84] T. H. Dunning, “Gaussian basis sets for use in correlated molecular calculations. i. the
     atoms boron through neon and hydrogen,” The Journal of Chemical Physics, vol. 90,
     no. 2, pp. 1007–1023, 1989.
[85] R. A. Kendall, T. H. Dunning, and R. J. Harrison, “Electron affinities of the first-row
     atoms revisited. systematic basis sets and wave functions,” The Journal of Chemical
     Physics, vol. 96, no. 9, pp. 6796–6806, 1992.
[86] B. P. Prascher, D. E. Woon, K. A. Peterson, T. H. Dunning, and A. K. Wilson,
     “Gaussian basis sets for use in correlated molecular calculations. vii. valence, core-
     valence, and scalar relativistic basis sets for li, be, na, and mg,” Theoretical Chemistry
     Accounts, vol. 128, no. 1, pp. 69–82, 2011.
                                              121


[87] S. Patel and C. L. Brooks III, “Charmm fluctuating charge force field for proteins:
     I parameterization and application to bulk organic liquid simulations,” Journal of
     Computational Chemistry, vol. 25, no. 1, pp. 1–16, 2004.
[88] R. Chelli, P. Procacci, R. Righini, and S. Califano, “Electrical response in chemical
     potential equalization schemes,” The Journal of Chemical Physics, vol. 111, no. 18,
     pp. 8569–8575, 1999.
[89] W. B. Dapp and M. H. Müser, “Redox reactions with empirical potentials: Atom-
     istic battery discharge simulations,” The Journal of Chemical Physics, vol. 139, no. 6,
     p. 064106, 2013.
[90] M. M. Islam and A. C. T. van Duin, “Reductive decomposition reactions of ethylene
     carbonate by explicit electron transfer from lithium: An ereaxff molecular dynamics
     study,” The Journal of Physical Chemistry C, vol. 120, no. 48, pp. 27128–27134, 2016.
[91] M. Dittner, J. Müller, H. M. Aktulga, and B. Hartke, “Efficient global optimization of
     reactive force-field parameters,” Journal of Computational Chemistry, vol. 36, no. 20,
     pp. 1550–1561, 2015.
[92] G. Shchygol, A. Yakovlev, T. Trnka, A. C. T. van Duin, and T. Verstraelen, “Reaxff
     parameter optimization with monte-carlo and evolutionary algorithms: Guidelines and
     insights,” Journal of Chemical Theory and Computation, vol. 15, no. 12, pp. 6799–6812,
     2019. PMID: 31657217.
[93] J. M. Dieterich and B. Hartke, “Ogolem: Global cluster structure optimisation for
     arbitrary mixtures of flexible molecules. a multiscaling, object-oriented approach,”
     Molecular Physics, vol. 108, no. 3-4, pp. 279–291, 2010.
[94] S. Kylasa, H. Aktulga, and A. Grama, “Puremd-gpu: A reactive molecular dynamics
     simulation package for gpus,” Journal of Computational Physics, vol. 272, pp. 343–359,
     2014.
[95] C. Kittel, Introduction to solid state physics. J. Wiley, 8th ed ed., 2004.
[96] M. K. Y. Chan, E. L. Shirley, N. K. Karan, M. Balasubramanian, Y. Ren, J. P. Greeley,
     and T. T. Fister, “Structure of lithium peroxide,” The Journal of Physical Chemistry
     Letters, vol. 2, no. 19, pp. 2483–2486, 2011.
[97] S. Hull, T. Farley, W. Hayes, and M. Hutchings, “The elastic properties of lithium
     oxide and their variation with temperature,” Journal of Nuclear Materials, vol. 160,
     no. 2, pp. 125–134, 1988.
[98] S.-H. Cheng and C. T. Sun, “Size-dependent fracture toughness of nanoscale structures:
     Crack-tip stress approach in molecular dynamics,” Journal of Nanomechanics and
     Micromechanics, vol. 4, no. 4, p. A4014001, 2014.
[99] W. Tang, E. Sanville, and G. Henkelman, “A grid-based bader analysis algorithm
     without lattice bias,” Journal of Physics: Condensed Matter, vol. 21, p. 084204, jan
     2009.
                                             122


[100] D. Merrill and NVIDIA CORPORATION. CUB, version 1.12.1, see https://github.
      com/NVIDIA/cub.
[101] M. M. Hussain and N. Fujimoto, “Gpu-based parallel multi-objective particle swarm
      optimization for large swarms and high dimensional problems,” Parallel Computing,
      vol. 92, p. 102589, 2020.
[102] C. A. Navarro, R. Carrasco, R. J. Barrientos, J. A. Riquelme, and R. Vega, “Gpu tensor
      cores for fast arithmetic reductions,” IEEE Transactions on Parallel and Distributed
      Systems, vol. 32, no. 1, pp. 72–84, 2021.
[103] A. C. T. van Duin, ReaxFF User Manual, 2002 (Accessed December 19, 2019), see
      https://www.engr.psu.edu/adri/ReaxffManual.aspx.
                                             123