NUMERICAL METHODS FOR THE EVOLUTION OF FIELDS WITH APPLICATIONS TO
                                       PLASMAS
                                          By
                                   William A. Sands
                                  A DISSERTATION
                                      Submitted to
                              Michigan State University
                      in partial fulfillment of the requirements
                                   for the degree of
     Computational Mathematics, Science, and Engineering – Doctor of Philosophy
                                         2022


                                            ABSTRACT
  NUMERICAL METHODS FOR THE EVOLUTION OF FIELDS WITH APPLICATIONS TO
                                             PLASMAS
                                                 By
                                          William A. Sands
    In this dissertation, we present a collection of algorithms for evolving fields in plasmas with
applications to the Vlasov-Maxwell system. Maxwell’s equations are reformulated in terms of
the Lorenz and Coulomb gauge conditions to obtain systems involving wave equations. These
wave equations are solved using the methods developed in this thesis and are combined with a
particle-in-cell method to simulate plasmas. The particle-in-cell methods developed in this work
treat particles using several approaches, including the standard Newton-Lorenz equations, as well
as a generalized momentum formulation that eliminates the need to compute time derivatives of
the field data.
    In the first part of this thesis, we develop and extend some earlier methods for scalar wave
equations, which are used to update the potentials in these formulations. Our developments are
based on a class of algorithms known as the MOL𝑇 , which combines a dimensional splitting
technique with a one-dimensional integral equation method. This results in methods that are
unconditionally stable, can address geometry, and are O (𝑁), where 𝑁 is the number of mesh
points. Our work contributes methods to construct spatial derivatives of the potentials for this class
of dimensionally-split algorithms, which are used to evolve particles.
    The second part of this thesis considers core algorithms used in the MOL𝑇 and the related class
of successive convolution methods in the context of high-performance computing environments.
We developed a novel domain decomposition approach that ultimately allows the method to be used
on distributed memory computing platforms. Shared memory algorithms were developed using
the Kokkos performance portability library, which permits a user to write a single code that can
be executed on various computing devices with the architecture-dependent details being managed
by the library. We optimized predominant loop structures in the code and developed a blocking


pattern that prescribes parallelism at multiple levels and is also more cache-friendly. Moreover,
the proposed iteration pattern is flexible enough to work with shared memory features available on
GPU systems.
    The final part of this thesis presents the particle-in-cell method for the Vlasov-Maxwell system,
which leverages the methods for fields and derivatives developed in this work. The proposed
methods are applied to several test problems involving beams. Our results are generally encouraging
and demonstrate the capabilities of the proposed field solvers in simulating basic plasma phenomena.
Additionally, our results serve to validate the generalized momentum formulation, which will be
the foundation of our future work.


Copyright by
WILLIAM A. SANDS
2022


To my family, friends, and educators that have always brought out the best in me.
                                        v


                                      ACKNOWLEDGEMENTS
Many thanks and acknowledgements are in order, as this work was influenced by many people in
a collaborative environment. First, I would like to express my gratitude to my advisor Professor
Andrew Christlieb who supported me as a graduate student over the past few years, encouraged me
to take on challenging problems, and who “believed in me" at times when I did not. I am, however,
most grateful for the warm friendship he has provided me and for the compassion and emotional
support he provided at an incredibly difficult junction in my personal life.
    I am also grateful for my committee members, who took the time to write generous recom-
mendation letters on my behalf during my time as a student. I would also like to thank Dr. John
Luginsland for his insightful suggestions and comments regarding the plasma examples presented
in this work and for always putting a humorous spin on everything! Discussions with Dr. Eric
Wolf were also helpful in this regard, as he was always willing to take time to answer my questions.
Additionally, I would like to thank Professor Michael Murillo and Dr. Jeffrey Haack, who provided
me with an opportunity to work at Los Alamos National Laboratory in the summer of 2019. This
experience significantly contributed to my growth, both as a person and as a scientist.
    The Christlieb group, informally SPECTRE, (yep, like the James Bond movie) has been my
home for the past couple of years. Even though I am changing groups for my next position, I hope
to continue working with many of you as you assemble your own research portfolios. I have had
the great fortune of being in a position to share my experiences and knowledge with many of you,
whose curiosity inspires my own work.
    Lastly, but certainly not least, are my parents, who have constantly encouraged me to pursue the
things I enjoy and always made themselves available at times when I needed help. We will see if the
late night phone calls still happen after grad school... I am also grateful for the valuable friendships
forged in the CMSE program. This includes the secretaries who were always incredibly friendly
and helpful; they are the unsung heroes! While many of us are moving on to different stages in our
professional lives, we will always be connected by the time we spent together.
                                                   vi


                                TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   x
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
LIST OF SCHEMES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiv
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . .          . . . . . . . . . .  1
   1.1 Background and Literature Review . . . . . . . . . . . . . . . .   . . . . . . . . . .  2
   1.2 Mathematical Models . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . .  6
       1.2.1 Vlasov-Maxwell System . . . . . . . . . . . . . . . . .      . . . . . . . . . .  7
       1.2.2 Problem Formulation . . . . . . . . . . . . . . . . . . .    . . . . . . . . . .  9
               1.2.2.1 Maxwell’s Equations with the Lorenz Gauge .        . . . . . . . . . .  9
               1.2.2.2 Maxwell’s Equations with the Coulomb Gauge         . . . . . . . . . . 10
               1.2.2.3 Formulation for the Particles . . . . . . . . . .  . . . . . . . . . . 11
   1.3 Non-dimensionalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
       1.3.1 Equations of Motion in E-B form . . . . . . . . . . . . .    . . . . . . . . . . 16
       1.3.2 Equations of Motion for the Generalized Hamiltonian . .      . . . . . . . . . . 18
       1.3.3 Maxwell’s Equations in the Lorenz Gauge . . . . . . . .      . . . . . . . . . . 19
       1.3.4 Maxwell’s Equations in the Coulomb Gauge . . . . . . .       . . . . . . . . . . 20
   1.4 Contributions of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
CHAPTER 2 NUMERICAL METHODS FOR THE FIELD EQUATIONS                         . . . . . . . . . 23
   2.1 Integral Equation Methods and Green’s Functions . . . . . . . . .    . . . . . . . . . 23
   2.2 Semi-discrete Schemes for the Wave Equation . . . . . . . . . . .    . . . . . . . . . 27
       2.2.1 The BDF Scheme . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . . 27
       2.2.2 Time-centered Scheme . . . . . . . . . . . . . . . . . . .     . . . . . . . . . 28
       2.2.3 Splitting Method Used for Multi-dimensional Problems . .       . . . . . . . . . 29
   2.3 Inverting One-dimensional Operators . . . . . . . . . . . . . . . .  . . . . . . . . . 30
       2.3.1 Integral Solution . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . 30
       2.3.2 Fast Summation Method . . . . . . . . . . . . . . . . . .      . . . . . . . . . 32
       2.3.3 Approximating the Local Integrals . . . . . . . . . . . . .    . . . . . . . . . 34
   2.4 Applying Boundary Conditions . . . . . . . . . . . . . . . . . . .   . . . . . . . . . 36
       2.4.1 BDF Method . . . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . . 36
               2.4.1.1 Dirichlet Boundary Conditions . . . . . . . . .      . . . . . . . . . 37
               2.4.1.2 Neumann Boundary Conditions . . . . . . . . .        . . . . . . . . . 38
               2.4.1.3 Periodic Boundary Conditions . . . . . . . . . .     . . . . . . . . . 39
               2.4.1.4 Outflow Boundary Conditions . . . . . . . . . .      . . . . . . . . . 39
       2.4.2 Centered Method . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . 44
               2.4.2.1 Dirichlet Boundary Conditions . . . . . . . . .      . . . . . . . . . 45
               2.4.2.2 Neumann Boundary Conditions . . . . . . . . .        . . . . . . . . . 46
                                            vii


              2.4.2.3 Periodic Boundary Conditions . . . . . . . . . . . .       . . . . . . .  46
              2.4.2.4 Outflow Boundary Conditions . . . . . . . . . . . .        . . . . . . .  47
      2.4.3 Some Remarks for Multi-dimensional Problems . . . . . . . .          . . . . . . .  49
              2.4.3.1 Sweeping Patterns in Multi-dimensional Problems .          . . . . . . .  49
              2.4.3.2 Periodic Boundary Conditions . . . . . . . . . . . .       . . . . . . .  50
              2.4.3.3 Dirichlet Boundary Conditions . . . . . . . . . . .        . . . . . . .  50
              2.4.3.4 Neumann Boundary Conditions . . . . . . . . . . .          . . . . . . .  51
              2.4.3.5 Outflow Boundary Conditions . . . . . . . . . . . .        . . . . . . .  51
  2.5 Extensions for High-order Accuracy . . . . . . . . . . . . . . . . . .     . . . . . . .  54
      2.5.1 Successive Convolution Methods . . . . . . . . . . . . . . . .       . . . . . . .  55
      2.5.2 BDF Methods . . . . . . . . . . . . . . . . . . . . . . . . . .      . . . . . . .  58
  2.6 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . .     . . . . . . .  59
      2.6.1 BDF and Time-centered Derivatives in One Spatial Dimension           . . . . . . .  60
      2.6.2 Periodic Boundary Conditions . . . . . . . . . . . . . . . . .       . . . . . . .  62
      2.6.3 Dirichlet Boundary Conditions . . . . . . . . . . . . . . . . .      . . . . . . .  66
      2.6.4 Outflow Boundary Conditions . . . . . . . . . . . . . . . . .        . . . . . . .  69
  2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . .  74
CHAPTER 3 PARALLEL ALGORITHMS FOR SUCCESSIVE CONVOLUTION                               . . . .  75
  3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  75
  3.2 Description of Numerical Methods . . . . . . . . . . . . . . . . . . . . . .     . . . .  78
      3.2.1 Connections Among Different PDEs . . . . . . . . . . . . . . . . .         . . . .  79
      3.2.2 Representation of Derivatives . . . . . . . . . . . . . . . . . . . . .    . . . .  82
      3.2.3 Comment on Boundary Conditions . . . . . . . . . . . . . . . . . .         . . . .  85
      3.2.4 Fast Convolution Algorithm and Spatial Discretization . . . . . . .        . . . .  86
      3.2.5 Coupling Approximations with Time Integration Methods . . . . .            . . . .  88
  3.3 Nearest-Neighbor Domain Decomposition Algorithm . . . . . . . . . . . .          . . . .  90
      3.3.1 Nearest-Neighbor Criterion . . . . . . . . . . . . . . . . . . . . . .     . . . .  93
      3.3.2 Enforcing Boundary Conditions for 𝜕𝑥 . . . . . . . . . . . . . . . .       . . . .  95
      3.3.3 Enforcing Boundary Conditions for 𝜕𝑥𝑥 . . . . . . . . . . . . . . .        . . . .  96
      3.3.4 Additional Comments . . . . . . . . . . . . . . . . . . . . . . . . .      . . . .  99
  3.4 Strategies for Efficient Implementation on Parallel Systems . . . . . . . . .    . . . .  99
      3.4.1 Selecting a Shared Memory Programming Model . . . . . . . . . .            . . . . 100
      3.4.2 Comment on Performance Metrics . . . . . . . . . . . . . . . . . .         . . . . 101
      3.4.3 Benchmarking Prototypical Loop Patterns . . . . . . . . . . . . . .        . . . . 102
      3.4.4 Shared Memory Algorithms . . . . . . . . . . . . . . . . . . . . .         . . . . 107
      3.4.5 Code Strategies for Domain Decomposition . . . . . . . . . . . . .         . . . . 109
      3.4.6 Some Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . .       . . . . 111
  3.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . 113
      3.5.1 Description of Test Problems and Convergence Experiments . . . .           . . . . 113
      3.5.2 Weak Scaling Experiments . . . . . . . . . . . . . . . . . . . . . .       . . . . 117
      3.5.3 Strong Scaling Experiments . . . . . . . . . . . . . . . . . . . . .       . . . . 119
      3.5.4 Effect of CFL . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . 121
  3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . 122
                                             viii


CHAPTER 4 DEVELOPING A PARTICLE-IN-CELL METHOD . . . . . . . . . . .                        . . 125
   4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
   4.2 Moving from Point-particles to Macro-particles . . . . . . . . . . . . . . . . .     . . 125
   4.3 Methods for Controlling Divergence Errors . . . . . . . . . . . . . . . . . . .      . . 127
       4.3.1 A Classic Elliptic Projection Method Based on Gauss’ Law . . . . . . .         . . 128
       4.3.2 Elliptic Divergence Cleaning based on Potentials . . . . . . . . . . . .       . . 129
       4.3.3 Enforcing the Lorenz Gauge through Lagrange Multipliers . . . . . . .          . . 129
       4.3.4 Enforcing the Coulomb Gauge . . . . . . . . . . . . . . . . . . . . . .        . . 131
       4.3.5 Analytical Maps for Enforcing Charge Conservation . . . . . . . . . . .        . . 132
   4.4 Conventional Methods for Pushing Particles . . . . . . . . . . . . . . . . . . .     . . 140
       4.4.1 Leapfrog Time Integration . . . . . . . . . . . . . . . . . . . . . . . .      . . 140
       4.4.2 The Boris Push . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 142
   4.5 Time Integration with Non-separable Hamiltonians . . . . . . . . . . . . . . .       . . 145
       4.5.1 The Molei Tao Integrator . . . . . . . . . . . . . . . . . . . . . . . . .     . . 145
              4.5.1.1 Approach for Implicit Sources: Particles Lead the Fields . . .        . . 147
              4.5.1.2 Approaches for a Mixed Advance: Dealing with Explicit and
                        Implicit Source Terms . . . . . . . . . . . . . . . . . . . . .     . . 149
       4.5.2 The Asymmetrical Euler Method . . . . . . . . . . . . . . . . . . . . .        . . 150
   4.6 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 151
       4.6.1 Motion of a Single Charged Particle . . . . . . . . . . . . . . . . . . .      . . 152
       4.6.2 The Cold Two-Stream Instability . . . . . . . . . . . . . . . . . . . . .      . . 155
       4.6.3 Numerical Heating Study . . . . . . . . . . . . . . . . . . . . . . . . .      . . 169
       4.6.4 The Bennett Equilibrium Pinch . . . . . . . . . . . . . . . . . . . . . .      . . 173
       4.6.5 The Expanding Beam Problem . . . . . . . . . . . . . . . . . . . . . .         . . 180
       4.6.6 A Narrow Beam Problem and the Effect of Particle Count . . . . . . . .         . . 187
   4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 189
CHAPTER 5    CONCLUSION AND FUTURE DIRECTIONS . . . . . . . . . . . . . . . . 190
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
   APPENDIX A        APPENDIX FOR CHAPTER 3 . . . . . . . . . . . . . . . . . . . . 194
   APPENDIX B        APPENDIX FOR CHAPTER 4 . . . . . . . . . . . . . . . . . . . . 203
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
                                               ix


                                          LIST OF TABLES
Table 3.1: Architecture and code configuration for the loop experiments conducted on
           the Intel 18 cluster at Michigan State University’s Institute for Cyber-Enabled
           Research. To leverage the wide vector registers, we encourage the compiler
           to use AVX-512 instructions. Hardware prefetching is not used, as initial
           experiments seem to indicate that it hindered performance. Initially, we used
           GCC 8.2.0-2.31.1 as our compiler, but we found through experimentation that
           using an Intel compiler improved the performance of our application by a factor
           of ∼ 2 for this platform. Authors in [75] experienced similar behavior for their
           application and attribute this to a difference in auto-vectorization capabilities
           between compilers. An examination of the source code for loop execution
           policies in Kokkos reveals that certain decorators, e.g., #pragma ivdep are
           present, which help encourage auto-vectorization when Intel compilers are
           used. We are unsure if similar hints are provided for GCC. . . . . . . . . . . . . 105
Table 3.2: Architecture and code configuration for the numerical experiments conducted
           on the Intel 18 cluster at Michigan State University’s Institute for Cyber-
           Enabled Research. As with the loop experiments in 3.4.3, we encourage the
           compiler to use AVX-512 instructions and avoid the use of prefetching. All
           available threads within the node (40 threads/node) were used in the experi-
           ments. Each node consists of two Intel Xeon Gold 6148 CPUs and at least
           83 GB of memory. We wish to note that hyperthreading is not supported
           on this system. As mentioned in 3.4.3, when hyperthreading is not enabled,
           Kokkos::AUTO() defaults to a team size of 1. In cases where the base block
           size did not divide the problem evenly, this parameter was adjusted to ensure
           that blocks were nearly identical in size. The parameter 𝛽, which does not de-
           pend on Δ𝑡 is used in the definition of 𝛼. For details on the range of admissible
           𝛽 values, we refer the reader to [56, 44], where this parameter was introduced.
           Lastly, recall that 𝜖 is the tolerance used in the NN constraints. . . . . . . . . . . 114
Table 4.1: Table of the physical constants (SI units) used in the numerical experiments. . . . 152
Table 4.2: Summary of the algorithms explored for the two-stream instability example.
           Both time integration methods considered are second-order. . . . . . . . . . . . 157
Table 4.3: Table of the plasma parameters used in the two-stream instability example. . . . 159
Table 4.4: Table of the plasma parameters used in the numerical heating example. . . . . . 169
Table 4.5: Table of the parameters used in the setup for the Bennett pinch problem. . . . . . 175
Table 4.6: Table of the parameters used in the setup for the expanding beam problem. . . . 184
                                                   x


Table 4.7: Table of the parameters used in the setup for the narrow beam problem. . . . . . 189
                                                xi


                                        LIST OF FIGURES
Figure 2.1: Spatial refinement study for the solution and its derivative obtained with
            second-order methods for the periodic test problem in section 2.6.1. In
            Figure 2.1a, we plot the ℓ∞ errors for both the numerical solution and the
            derivative obtained with the time-centered method. Similarly, in Figure 2.1b,
            we show the same quantities, which are instead computed using the BDF
            method. The derivative for the time-centered method fails to refine in space,
            while the BDF derivative is as accurate at the numerical solution itself. . . . . 61
Figure 2.2: Time refinement study for the solution and its derivative obtained with second-
            order methods for the test problem in section 2.6.1. In Figure 2.2a, we plot
            the ℓ∞ errors for both the numerical solution and the derivative obtained
            with the time-centered method. Similarly, in Figure 2.2b, we show the same
            quantities, which are instead computed using the BDF method. The derivative
            for the time-centered method initially converges together with the numerical
            solution, but at some point begins to diverge. In contrast, we can see that
            the errors for the derivatives obtained with the BDF method are aligned with
            those of the solution. Comparing the scales of the plots, we note that the BDF
            solution is slightly less accurate than the time-centered method. . . . . . . . . 62
Figure 2.3: Time refinement study for the solution and its derivative for the two-dimensional
            periodic example 2.6.2 obtained with second-order methods. In Figure 2.3a,
            we plot errors for the numerical solution obtained with the central-2 method
            and the partial derivatives obtained with the BDF-2 method. Similarly, in
            Figure 2.3b, we show the same quantities, both of which are obtained using
            the BDF-2 method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Figure 2.4: Space refinement of the solution and its derivative for the two-dimensional
            periodic example 2.6.2 obtained with second-order methods. In Figure 2.4a,
            we plot errors for the numerical solution obtained with the central-2 method
            and the partial derivatives obtained with the BDF-2 method. Similarly, in
            Figure 2.4b, we show the same quantities, both of which are obtained using
            the BDF-2 method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Figure 2.5: Time refinement study of the solution and its derivatives in the two-dimensional
            Dirichlet problem 2.6.3 obtained with second-order methods. In Figure 2.5a,
            we plot errors for the numerical solution obtained with the central-2 method
            and the partial derivatives obtained with the BDF-2 method. Similarly, in
            Figure 2.5b, we show the same quantities, both of which are obtained using
            the BDF-2 method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
                                                 xii


Figure 2.6: Space refinement of the solution and its derivatives in the two-dimensional
             Dirichlet problem 2.6.3 obtained with second-order methods. In Figure 2.6a,
             we plot errors for the numerical solution obtained with the central-2 method
             and the partial derivatives obtained with the BDF-2 method. Similarly, in
             Figure 2.6b, we show the same quantities, both of which are obtained using
             the BDF-2 method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Figure 2.7: Here we show the reflection observed between the implicit and explicit forms
             of outflow boundary conditions for the second-order BDF method in a one-
             dimensional outflow problem. We run with the same Gaussian initial condi-
             tion until the final time 𝑇 = 4, at which point, the wave data should no longer
             be in the simulation. What is left is the reflection at the artificial boundaries
             of the domain. The plot shown on the left shows the results obtained with
             the proposed implicit form of the outflow weights developed for the BDF-2
             method, while the plot on the right uses the explicit form of the weights. We
             find that the explicit form of the weights is more effective at suppressing the
             spurious reflections at the artificial boundaries. . . . . . . . . . . . . . . . . . . 70
Figure 2.8: Time refinement study of the solution and its derivatives in the two-dimensional
             outflow problem of section 2.6.4 obtained with second-order methods. In Fig-
             ure 2.8a, we plot errors for the numerical solution obtained with the central-2
             method and the partial derivatives obtained with the BDF-2 method. Simi-
             larly, in Figure 2.8b, we show the same quantities, both of which are obtained
             using the BDF-2 method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Figure 2.9: A comparison of the temporal refinement properties for the one-dimensional
             implicit methods. The weights for the time-centered method shown in Fig-
             ure2.9a are taken from the paper [34]. We compare this to the proposed
             implicit approach to outflow, shown on the right, in Figure 2.9b. . . . . . . . . 72
Figure 2.10: Space refinement of the solution and its derivatives in the two-dimensional
             outflow problem of section 2.6.4 obtained with second-order methods. In
             Figure 2.10a, we plot errors for the numerical solution obtained with the
             central-2 method and the partial derivatives obtained with the BDF-2 method.
             Similarly, in Figure 2.10b, we show the same quantities, both of which are
             obtained using the BDF-2 method. . . . . . . . . . . . . . . . . . . . . . . . . 73
Figure 2.11: A comparison of the spatial refinement properties for the one-dimensional
             implicit methods. The weights for the time-centered method, shown in Fig-
             ure2.11a, are taken from the paper [34]. We compare this to the proposed
             implicit approach to outflow, shown on the right, in Figure 2.11b. . . . . . . . 74
Figure 3.1: Stencils used to build the six point quadrature [56, 44]. . . . . . . . . . . . . . 88
Figure 3.2: A six-point WENO quadrature stencil in 2-D. . . . . . . . . . . . . . . . . . . 91
                                                   xiii


Figure 3.3: Fast convolution communication stencil in 2-D based on N-Ns. . . . . . . . . . 97
Figure 3.4: Heterogeneous platform targeted by Kokkos [46]. . . . . . . . . . . . . . . . . 101
Figure 3.5: Plots comparing the performance of different parallel execution policies for
            the pattern in Scheme 3.1 using test cases in 2-D (left) and 3-D (right). Tests
            were conducted on a single node that consists of 40 cores using the code
            configuration outlined in 3.1. Each group consists of three plots, whose dif-
            ference is the value selected for the team size. We note that hyperthreading
            is not enabled on our systems, so Kokkos::AUTO() defaults to a team size of
            1. In each pane, we use “best" to refer to the best run for that configuration
            across different team sizes. Tile experiments used block sizes of 2562 , in
            2-D problems, and 323 in 3-D. We observe that vectorized policies are gen-
            erally faster than non-vectorized policies. Interestingly, among blocked/tiled
            policies, construction of subviews appears to be faster than those that skip
            the subview construction, despite the additional work. As the problem size
            increases, the performance of blocked policies improves substantially. This
            can be attributed to the large number of idle thread teams when the problem
            size does not produce enough blocks. In such cases, increasing the size of
            the team does offer an improvement, as it reduces the number of idle thread
            teams. For non-blocked policies, we observe that increasing the team-size
            generally results in minimal, if any, improvement in performance. In all cases,
            the use of blocking provides a more consistent update rate when enough work
            is introduced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Figure 3.6: Task charts for the domain-decomposition algorithm under fixed (left) and
            adaptive (right) time stepping rules. The work overlap regions are indicated,
            laterally, using gray boxes. The work inside the overlap regions should be
            sufficiently large to hide the communications occuring in the background.
            To clarify, the overlap in calculations for 𝐼∗ is achieved by changing the
            sweeping direction during an exchange of the boundary data. As indicated in
            the adaptive task chart, the reduction over the “lagged" wave speed data can
            be performed in the background while building the various operators. Note
            the use of MPI_WAIT prior to performing the integrator step. This is done to
            prevent certain overwrite issues during the local reductions in the subsequent
            integrator step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Figure 3.7: Convergence results for each of the 2-D example problems. Results were
            obtained using 9 MPI ranks with 40 threads/node. Also included is a first-
            order reference line (solid black). Our convergence results indicate first-
            order accuracy resulting from the low-order temporal discretization. The
            final reported 𝐿 ∞ errors for each of the applications, on a grid containing
            52772 total zones, are 2.874 × 10−3 (advection), 4.010 × 10−4 (diffusion), and
            2.674 × 10−4 (H-J). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
                                                 xiv


Figure 3.8: Results on the N-N method for the linear advection equation using a fixed
            mesh with 53772 total DOF and a variable CFL number. In each case, we
            used the fastest time-to-solution collected from repeating each configuration
            a total of 20 times. This particular data was collected using an older version
            of the code, compiled with GCC, which did not use the blocking approach.
            For larger block sizes, increasing the CFL has a noticeable improvement on
            the run time, but as the block sizes become smaller, the gains diminish. For
            example, if 9 MPI ranks are used, improvements are observed as long as
            CFL ≤ 4. However, when CFL = 5, the run times begin to increase, with a
            significant decrease in efficiency. As the blocks become smaller, Δ𝑡 needs to
            be adjusted (decreased) so that the support of the non-local convolution data
            not extend beyond N-Ns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Figure 4.1: Trajectories for the single particle test, which are obtained using the Boris
            method 4.1a and the second-order integrator by Molei Tao (as presented
            in [47]) 4.1b. Both methods produce identical trajectories under identical
            experimental conditions. The particles rotate about the magnetic field which
            points in the 𝑧-direction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Figure 4.2: Self-refinement for the single particle test using the Boris method 4.1a and
            the second-order integrator by Molei Tao (as presented in [47]) 4.1b. Second-
            order accuracy is achieved by both methods, but the ℓ∞ errors for the Boris
            method are nearly a factor of 2 larger than those produced by the Molei Tao
            method. While we have not presented timing results, it is worth noting that
            the run times for the Boris method were considerably faster than those of
            the Molei Tao method due to the latter’s additional “stages". The final error
            measurements taken from the refinement study are 1.4728 × 10−7 (Boris) and
            6.5592 × 10−8 (Tao). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Figure 4.3: Initial configuration of electrons used in the two-stream experiments. . . . . . . 158
Figure 4.4: A comparison of the Molei Tao particle integrator with and without averaging
            for the two-stream example with the Poisson model. Over time, the pairs of
            phase space data, including the associated fields, can grow apart leading
            to vastly different potentials that kick particles off their smooth trajectories.
            Averaging appears to be fairly effective at controlling this behavior. . . . . . . . 160
                                                 xv


Figure 4.5: We present plots of the electrons in phase space obtained using the Poisson
            model for the two-stream example. Results obtained using leapfrog time
            integration are shown in the top row, while the bottom row uses the second-
            order integrator based on Molei Tao and applies averaging. We selected
            𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator.
            The FFT is used to compute the scalar potentials in both methods. At later
            times, despite improvements from “averaging" the particle data, the Tao
            method causes particles to move off the stream lines. This phenomena is a
            numerical artifact that is not present in the leapfrog method. . . . . . . . . . . . 161
Figure 4.6: Time refinement of a tracer particle’s position for the two-stream instability
            using the Poisson model for the potential with leapfrog (a) and the Molei
            Tao integrator with averaging (b). We selected 𝜔 = 500 as the value of the
            coupling parameter in the Molei Tao integrator. Both methods converge to
            second-order accuracy with leapfrog generally displaying a larger absolute
            error than the Tao method. The exception to this is the smallest Δ𝑡 used in
            the leapfrog experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Figure 4.7: We present plots of the electrons in phase space obtained using the wave
            model for the two-stream example. Results obtained using leapfrog time
            integration are shown in the top row, while the bottom row uses the second-
            order integrator based on Molei Tao and applies averaging. We selected
            𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator.
            The second-order (diffusive) BDF scheme (BDF-2) is used to compute the
            scalar potentials and their derivatives in for both methods. Unlike the results
            obtained with the Poisson model, which used the FFT as the field solver
            (shown in Figure 4.5), the particles at the later times in the Molei Tao method
            seem to stay attached to their trajectories. . . . . . . . . . . . . . . . . . . . . . 165
Figure 4.8: We present plots of the electrons in phase space obtained using the wave
            model for the two-stream example. Results obtained using leapfrog time
            integration are shown in the top row, while the bottom row uses the second-
            order integrator based on Molei Tao and applies averaging. We selected 𝜔 =
            500 as the value of the coupling parameter in the Molei Tao integrator. The
            scalar potentials are evolved using the second-order central scheme (central-
            2), while the derivatives are computed at each step with the second-order
            BDF scheme (BDF-2). In the bottom row, which uses the Molei Tao method,
            we obtain results that are similar to the BDF-2 method (see 4.7) in the sense
            that particles do not seem to jump off of their trajectories. . . . . . . . . . . . . 166
                                                 xvi


Figure 4.9: We present plots of the electrons in phase space obtained using the wave
             model for the two-stream example. Results obtained using leapfrog time
             integration are shown in the top row, while the bottom row uses the second-
             order integrator based on Molei Tao and applies averaging. We selected 𝜔 =
             500 as the value of the coupling parameter in the Molei Tao integrator. The
             scalar potentials are evolved using the second-order central scheme (central-
             2), while the derivatives are computed at each step with the fourth-order BDF
             scheme (BDF-4). As with the other wave solver methods, the particles in
             the Molei Tao experiments seem to stay attached to their smooth trajectories,
             even at the later times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Figure 4.10: Time refinement of a tracer particle’s position for the two-stream instability.
             For the particle push, we consider both leapfrog and the Molei Tao method
             with averaging, in combination with different methods for fields and their
             derivatives. We selected 𝜔 = 500 as the value of the coupling parameter in
             all of the Molei Tao integrator experiments. Each of the methods converge
             to second-order accuracy with the error in the Tao method being smaller than
             leapfrog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Figure 4.11: Initial electron data in phase space used for the numerical heating tests. . . . . . 170
Figure 4.12: We present results from the numerical heating tests based on the Poisson
             model. Plots show the average electron temperature as a function of the
             number of angular plasma periods using leapfrog (left) and the second-order
             integrator by Molei Tao with averaging (right). Fields and their derivatives
             are obtained using the FFT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Figure 4.13: We display results from the numerical heating tests that use the wave model
             for the potentials. Plots show the average electron temperature as a function
             of the number of angular plasma periods using leapfrog (top) and the second-
             order integrator by Molei Tao with averaging (bottom). We selected 𝜔 = 500
             as the value of the coupling parameter in the Molei Tao integrator. The scalar
             potentials and derivatives are computed with the scheme label provided in the
             individual captions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Figure 4.14: Initialization of the steady-state toroidal magnetic field in the Bennett problem
             computed with the BDF-2 wave solver after 1000 steps against a fixed current
             density. The derivatives of the vector potential 𝐴 (3) are also obtained with
             the BDF-2 method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
                                                   xvii


Figure A.1: Plots comparing the performance of different parallel execution policies for
            the pattern in Scheme 3.2 using test cases in 2-D (top) and 3-D (bottom).
            Tests were conducted on a single node that consists of 40 cores using the
            code configuration outlined in 3.1. Each group consists of three plots, whose
            difference is the value selected for the team size. We note that hyperthreading
            is not enabled on our systems, so Kokkos::AUTO() defaults to a team size
            of 1. Tile experiments used a block size of 2562 , in 2-D problems, and 323
            in 3-D. A tiled MDRange was not implemented in the 2-D cases because
            the block size was larger than some of the problems. The results generally
            agree with those presented in 3.5. For smaller problem sizes, using the non-
            portable range_policy with OpenMP simd directives is clearly superior over
            the policies. However, when enough work is available, we see that blocked
            policies with subviews and vectorization generally become the fastest. In both
            cases, MDRange seems to have fairly good performance. Tiling, when used
            with MDRange, in the 3-D cases, seems to be slower than plain MDRange.
            Again, we see that the use of blocking provides a more consistent update rate
            if enough work is available. . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Figure A.2: Weak scaling results, for each of the applications, using up to 49 nodes
            (1960 cores). For each of the applications, we have provided the update
            rate and weak scaling efficiency computed via the fastest time/step (top) and
            average time/step (bottom). Results for advection and diffusion applications
            is quite similar, despite the use of different operators. The results for the
            H-J application seem to indicate that no major performance penalties are
            incurred by use of the adaptive time stepping method. Scalability appears
            to be excellent, up to 16 nodes (640 cores), then begins to decline. While
            some loss in performance, due to network effects, is to be expected, this loss
            appears to be larger than was previously observed. The nodes used in the runs
            were not contiguous, which hints at a possible sensitivity to data locality. . . . . 200
Figure A.3: Weak scaling results obtained with contiguous allocations of up to 9 nodes
            (360 cores) for each of the applications. For comparison, the same information
            is displayed as in A.2. Data from the fastest trials indicates nearly perfect
            weak scaling, across all applications, up to 9 nodes, with a consistent update
            rate between 2 − 4 × 108 DOF/node/s. A comparison of the fastest timings
            between the large and small runs supports our claim that data proximity is
            crucial to achieving the peak performance of the code. Note that size the error
            bars are generally smaller than those in A.2. This indicates that the timing
            data collected from individual trials exhibits less overall variation. . . . . . . . 201
                                                 xviii


Figure A.4: Strong scaling results for each of the applications obtained on contiguous
            allocations of up to 9 nodes (360 cores). Displayed among each of the appli-
            cations are the update rate and strong scaling efficiency computed from the
            fastest time/step (top) and average time/step (bottom). This method does not
            contain a substantial amount of work, so we do not expect good performance
            for smaller base problem sizes, as the work per node becomes insufficient to
            hide the cost of communication. Larger base problem sizes, which introduce
            more work, are capable of saturating the resources, but will at some point
            become insufficient. Moreover, threads become idle when the work per node
            fails to introduce enough blocks. . . . . . . . . . . . . . . . . . . . . . . . . . 202
Figure B.1: The state of the Bennett problem after 50 thermal crossing times using the
            Boris method with the steady-state Poisson model for the fields. The top figure
            shows the electrons in the non-dimensional grid and plots the radius of the
            beam as a reference. We also include a cumulative histogram of the electrons
            based on their radii, which uses a total of 50 bins. The plots on the bottom
            are cross-sections of the steady-state magnetic field 𝐵 (𝜃) , which are plotted
            against the analytical field. We see good agreement in the magnetic field with
            its analytical solution, which is enough to confine most of the particles within
            the beam. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Figure B.2: The state of the Bennett problem after 45 thermal crossing times obtained
            with the Molei Tao method (𝜔 = 500) using the steady-state Poisson model
            for the fields. The top figure shows the electrons in the non-dimensional grid
            and plots the radius of the beam as a reference. We also include a cumulative
            histogram of the electrons based on their radii, which uses a total of 50
            bins. The plots on the bottom are slices of the steady-state magnetic field
            𝐵 (𝜃) , which is plotted against the analytical field. We observe a significant
            drift in the numerical field away from its steady-state that results in a loss of
            confinement of the particles to the beam. . . . . . . . . . . . . . . . . . . . . . 207
Figure B.3: The state of the Bennett problem after 35 thermal crossing times using the
            Boris method with the wave model for the fields. The top figure shows the
            electrons in the non-dimensional grid and plots the radius of the beam as a
            reference. We also include a cumulative histogram of the electrons based on
            their radii. Again, the beam radius is indicated as a reference. A total of 50
            bins are used in the plot. The plots on the bottom are slices of the steady-state
            magnetic field 𝐵 (𝜃) , which is plotted against the analytical field. We see good
            agreement in the magnetic field with its analytical solution, which is enough
            to confine most of the particles within the beam. . . . . . . . . . . . . . . . . . 208
                                                  xix


Figure B.4: A comparison of the time derivatives of the vector potentials after 1000
            particle crossings for the expanding beam problem. This particular data was
            obtained using the Lorenz gauge formulation for the fields with the Boris
            method for particles. In the top row, the vector potentials are updated with
            the time-centered approach, which is purely dispersive and generates noisy
            time derivatives. The bottom row performs the same experiment, but uses
            the BDF method, which is purely dissipative. The differences in the quality
            of the results are quite apparent. This was discussed in [42], but results were
            not shown to illustrate the severity of the effects due to dispersion. . . . . . . . 209
Figure B.5: We plot the expanding beam after 1000 particle crossings obtained with the
            Lorenz gauge formulation that combines the Boris method with the BDF-2
            field solver. In Figure B.5a, we plot the beam and the corresponding charge
            density. We observe some oscillations along the top edge of the beam, which
            also appear in the charge density. In Figure B.5b, we observe an increase in
            the size of violations of the Lorenz gauge condition, which indicates that the
            method will eventually fail. We plot the Lorenz gauge error as a surface in
            Figure B.5c using data from the final step. The most significant violations
            occur near the injection region and along the boundary where particles are
            removed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Figure B.6: Here we show the potentials (and their derivatives) for the expanding beam
            problem after 1000 particle crossings. This data was obtained using the
            Lorenz gauge formulation which combines the Boris method with the BDF-
            2 wave solver. The first row plots the scalar potential 𝜓 and its partial
            derivatives. Similarly, in the second row, we plot the derivatives of the vector
            potentials 𝐴 (1) and 𝐴 (2) , which are used to construct the magnetic field 𝐵 (3)
            (shown in the right-most plot). Note that the time derivative data for the
            vector potentials were plotted in Figure B.4b, so we exclude them here. . . . . . 211
Figure B.7: We show the expanding beam after 2000 particle crossings obtained with the
            Coulomb gauge formulation, which uses the AEM for time stepping without
            a cleaning method. In Figure B.7a, we plot the beam and the corresponding
            charge density, which show visible striations and oscillations along the edge
            of the beam due to violations in the gauge condition. The growth in the
            errors associated with the gauge condition is reflected in Figure B.7b, which
            exhibits unbounded growth. The surface plot of the gauge condition at 2000
            crossings shows large errors, especially near the injection region and along
            the boundary where particles are removed. . . . . . . . . . . . . . . . . . . . . 212
                                                  xx


Figure B.8: We show the expanding beam after 3000 particle crossings obtained with the
             Coulomb gauge formulation that uses the AEM for time stepping with elliptic
             divergence cleaning. In Figure B.8a, we plot the beam and the corresponding
             charge density. The elliptic divergence cleaning seems effective at controlling
             the errors in the gauge condition, compared to the results shown in Figure
             B.7, which do not apply the cleaning method. The fluctuations of the gauge
             error away from the boundaries is now in the sixth decimal position, which is
             a notable improvement over the result shown in Figure B.7c. . . . . . . . . . . . 213
Figure B.9: Here we show the potentials (and their derivatives) for the expanding beam
             problem after 3000 particle crossings. This data was obtained using the
             Coulomb gauge formulation which combines the AEM for time integration
             with the BDF-2 wave solver. Elliptic divergence cleaning was applied to the
             vector potential. In each row, we plot a field quantity and is corresponding
             derivatives. The top row shows the scalar potential 𝜓 and its derivative,
             which are computed with a finite-differences. The middle and last row show
             the vector potential components 𝐴 (1) and 𝐴 (2) , respectively, along with their
             derivatives, which are computed with the BDF method. . . . . . . . . . . . . . 214
Figure B.10: We show the expanding beam after 3000 particle crossings obtained with
             the Lorenz gauge formulation that uses the AEM for time stepping along
             with a first-order BDF solver. No divergence cleaning is applied. In Figure
             B.10a, we plot the beam and the corresponding charge density. The beam
             surprisingly remains intact after many particle crossings without the use of
             a cleaning method. The fluctuations of the gauge error over time are quite
             small. We do not observe the growth in the gauge error shown earlier in
             Figure B.5b for the Boris method. . . . . . . . . . . . . . . . . . . . . . . . . . 215
Figure B.11: Here we show the potentials (and their derivatives) for the expanding beam
             problem after 3000 particle crossings. This data was obtained using the
             Lorenz gauge formulation which combines the AEM for time integration
             with the BDF-1 wave solver. A divergence cleaning method is not used in
             this example. In each row, we plot a field quantity and is corresponding
             derivatives. The top row shows the scalar potential 𝜓 and its derivative, while
             the middle and last row shows the vector potential components 𝐴 (1) and 𝐴 (2) ,
             respectively, along with their derivatives. . . . . . . . . . . . . . . . . . . . . . 216
                                                 xxi


Figure B.12: Error in Gauss’ law for the Coulomb gauge formulation of the expanding
             beam problem which applies the AEM for time integration and uses elliptic
             divergence cleaning. On the left, we show the time evolution of an “averaged"
             residual in Gauss’ law. The plot on the right is a surface of the error in Gauss’
             law taken after 3000 particle crossings. Even though cleaning is used to
             control violations in the gauge condition, whose corresponding surface was
             shown in Figure B.8c, the metric based on point-wise violations in Gauss’
             law seems to indicates a significant loss of conservation. On the other hand,
             the plot on the left implies that Gauss’ law is satisfied in an integral sense. . . . 217
Figure B.13: Error in Gauss’ law for the Coulomb gauge formulation of the expanding beam
             problem which applies the AEM for time integration. Elliptic divergence
             cleaning is not used here. On the left, we show the time evolution of the
             “averaged" residual in Gauss’ law. The plot on the right is a surface of
             the error in Gauss’ law taken after 3000 particle crossings. The point-wise
             violations in Gauss’ law are much larger than we observed in Figure B.12b.
             Similarly, the time evolution of the average defect in Gauss’ law is roughly
             three orders of magnitude larger than B.12a. . . . . . . . . . . . . . . . . . . . 218
Figure B.14: We show the narrow beam after 5 particle crossings obtained with the
             Coulomb gauge formulation that uses the AEM for time stepping with el-
             liptic divergence cleaning. We injected 400 particle per time step. In Figure
             B.14a, we plot the beam and the corresponding charge density. The plot of the
             particles appears more solid due to the increased injection rate. The density
             itself is quite smooth due to the use of additional particles. As before, we see
             there are violations in the gauge condition along the boundaries due to the
             injection and removal of particles there. Additionally, the gauge error appears
             to be quite small away from the boundaries due to the increased smoothness
             offered by the use of additional particles. . . . . . . . . . . . . . . . . . . . . . 219
Figure B.15: Here we show the potentials, as well as their derivatives, for the narrow beam
             problem after 5 particle crossings using an injection rate of 400 particles
             per step. We used the Coulomb gauge formulation which combines the
             AEM for time integration with the BDF-2 wave solver for the fields. Elliptic
             divergence cleaning was applied to the vector potential. In each row, we plot
             a field quantity and is corresponding spatial derivatives. The top row shows
             the scalar potential 𝜓 and its derivative, which are computed with a finite-
             differences. The middle and last row show the vector potential components
             𝐴 (1) and 𝐴 (2) , respectively, along with their derivatives, which are computed
             with the BDF method. The structure of the fields and their derivative are
             quite smooth here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
                                                   xxii


Figure B.16: Error in Gauss’ law for the narrow beam problem that uses an injection rate
             of 400 particles per step. On the left, we show the time evolution of an
             “averaged" residual in Gauss’ law. There is a jump in the “bulk" error for
             Gauss’ law at step 1000, since this coincides with the beam’s first crossing,
             before stabilizing. The plot on the right is a surface of the error in Gauss’
             law taken after 5 particle crossings. Even though cleaning is used to control
             violations in the gauge condition, whose corresponding surface was shown in
             Figure B.14c, the metric based on point-wise violations in Gauss’ law seems
             to indicates a loss of charge conservation similar to the previous example. . . . 221
Figure B.17: We show the derivatives used to calculate the divergence of the electric field
             for the narrow beam problem at 5 particle crossings. We used an injection
             rate of 400 particles. Derivatives are computed with second-order finite-
             differences. We note the appearance of small oscillations in the 𝑥 derivative,
             which is shown on the left. The plot to the right, which corresponds to the
             𝑦-derivative is largely uniform on the interior of the beam, but is sharp along
             the edge of the beam. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Figure B.18: We show the effect of the particle injection rate on the gauge error for the
             narrow beam problem at 5 particle crossings. In each row, we plot the error
             in the Coulomb gauge as a surface (left column) and as a slice in 𝑥 along the
             middle of the beam (right) column. The rows correspond to injection rates of
             100, 200, and 400 particles per time step, respectively, from top to bottom.
             We can see that the increase in particle count reduces the gauge error on the
             interior of the domain due to the smoothing effect on the particle data. . . . . . 222
                                                xxiii


                                     LIST OF SCHEMES
Scheme 3.1: Looping pattern used in the construction of local integrals, convolutions,
            and boundary steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Scheme 3.2: Another looping pattern used to build “resolvent" operators. With some
            modifications, this same pattern could be used for the integrator step. In
            several cases, this iteration pattern may require reading entries, which are
            separated by large distances (i.e., the data is strided), in memory. . . . . . . . 103
Scheme A.1: An example of coarse-grained parallel nested loop structure. . . . . . . . . . . 196
Scheme A.2: Kokkos kernel for the fast-convolution algorithm. . . . . . . . . . . . . . . . . 196
                                               xxiv


                                 LIST OF ALGORITHMS
Algorithm 3.1: Distributed adaptive time stepping rule. . . . . . . . . . . . . . . . . . . . 112
                                             xxv


                                             CHAPTER 1
                                          INTRODUCTION
Plasmas are ubiquitous in nature. In fact, it is well-known that they comprise a majority of the
visible universe, including the electronic devices that we regularly use as part of our daily lives, or
the sun, which sustains life on Earth. Consequently, the properties of plasmas span an enormous
range of space and time scales, often orders of magnitude in size. These multi-scale features
pose a significant challenge for model developers and computational scientists, as many of these
models used in the general interrogation of such systems are computationally intractable, even
with the existing capabilities offered by supercomputers. This necessitates the development of new
algorithms for plasmas which can successfully address the issue of scales and fit within the design
constraints posed by new computational hardware.
    Mathematically, plasmas are systems of charged particles, which can be conveniently described
in terms of probability distribution functions that characterize the probability of finding a particular
charged particle at some point of phase space. The background electromagnetic fields, which
are described by Maxwell’s equations, adapt to the motion of these charges, inducing changes
in the probability distribution functions. This results in a complex system of partial differential
equations known as the Vlasov-Maxwell system. A defining characteristic of Vlasov plasmas is
that effects due to collisions arise only through the electromagnetic fields, so they are often called
“collisionless" plasmas in an effort to distinguish them from more complex collision models such
as the Boltzmann equation.
    Building on the concept of a plasma, the next section provides a review of the literature
concerning techniques for evolving the electromagnetic fields in the context of particle-in-cell
methods for the Vlasov-Maxwell system. Then, we discuss the specific details concerning the
mathematical models adopted in this work as well as the non-dimensionalization process. Lastly,
we provide an overview of the work presented in this thesis, providing a delineation with some of
the earlier work.
                                                   1


1.1     Background and Literature Review
In this section, we provide a review of the literature pertaining to algorithms employed in the
simulation of Vlasov-type plasmas. An emphasis is placed on the particle-in-cell (PIC) method,
which is the plasma simulation technique adopted in this thesis. A comprehensive review of the
literature for PIC methods up to 2005 can be found in the review article [1]. Much of the work
highlighted by this reference is now largely considered standard, so, instead, we discuss articles
from more recent years that are more aligned with the developments in this thesis.
     PIC methods [2, 3] have been extensively applied in the numerical simulation of plasmas and
are an important class of techniques used in the design of experimental devices including lasers,
pulsed power systems, particle accelerators, among others. The earliest work involving these
methods began in the 1950s and 1960s, and it remains an active area of research to this day.
At its core, a PIC method combines an Eulerian approach for the electromagnetic fields with a
Lagrangian method that evolves collections of samples taken from general distribution functions in
phase space. In other words, the fields are evolved using a mesh, while the distribution function is
evolved using particles whose equations of motion are set according to characteristics of the PDEs
for the distribution functions. Lastly, to combine the two approaches, an interpolation method
is used to map data between the mesh and the particles. These maps are typically taken to be
linear splines with the multi-dimensional interpolation performed by taking tensor products of
one-dimensional interpolants. While PIC methods are capable of simulating complex nonlinear
processes in plasmas, it is worth mentioning their weaknesses. Firstly, a natural consequence of
the statistical element in PIC is that “bulk" processes in plasmas will be well represented, while the
tails of the distribution will be largely underresolved even with good sampling methods. Another
well-known consequence of the statistical element is the large number of simulation particles that
are required in more systematic refinement studies due to certain numerical fluctuations. This
consequence is critical to the computational efficiency of the method, which, in most applications,
warrants the use of supercomputers to perform the simulations. The goal of this work is to supply
new algorithms which aim to improve the computational efficiency of existing simulation tools,
                                                   2


such as PIC, in the numerical investigation of plasmas.
     Most PIC methods evolve the simulation particles explicitly using some form of leapfrog time
integration along with the Boris rotation method [4] in the case of electromagnetic plasmas. The
exploration of semi- and fully-implicit particle treatments in PIC methods began in the 1980s [5,
6, 7]. These approaches suffered from a number of unattractive features, including issues with
numerical heating and cooling [8], slow nonlinear convergence, and inconsistencies between the
fluid moments and particle data. Consequently, these approaches were abandoned in favor of
explicit formulations which were more aligned with the computational hardware available at the
time. In recent years, implicit PIC methods have seen a resurgence, beginning with [9], which
addressed many of these issues in the case of the Vlasov-Poisson system. Nonlinear convergence
and self-consistency were enforced using a Jacobian-free Newton-Krylov method [10] with a
fluid preconditioner that enforced the continuity equation. The resulting solver demonstrated
remarkable savings compared to explicit methods because they eliminated the need to resolve the
charge separation in the plasma, allowing for a coarser mesh to be used. These techniques were
later extended to curved geometries through the use of smooth grid mappings [11]. Recently, an
effort has been made to extend these techniques to the full Vlasov-Maxwell system [12] to avoid
the highly restrictive CFL condition posed by the gyrofrequency. While these contributions are
significant in their own right, there are many opportunities for improvement. Many of these methods
are fundamentally stuck at second-order accuracy in both space and time and may greatly benefit
from more accurate field solvers. Additionally, applications of interest involve complex geometries
which introduce additional complications with stability and are often poorly resolved with uniform
Cartesian meshes. Lastly, there is the concern of scalability. Krylov subspace methods pose a
massive challenge for scalability on large machines due to the various collective operations used
in the algorithms. It seems that the scalability of these methods could be significantly improved
if similar implicit methods could be developed which eliminate the inner Krylov solve altogether,
though this is beyond the scope of the present work.
     A challenge associated with developing any solver for Maxwell’s equations is the enforcement
                                                   3


of the involutions for the fields, namely ∇ · E = 𝜌/𝜖0 and ∇ · B = 0. In the case of a structured
Cartesian grid, Maxwell’s equations can be discretized using a staggered grid technique introduced
by Yee [13]. The use of a staggered mesh yields a structure-preserving discrete analogue of
the integral form of Maxwell’s equations that automatically enforces the involutions for E and B
without additional treatment. This is the basis of the well-known finite-difference time-domain
method (FD-TD) [14]. While the staggering in both space an time used in the FD-TD method is
second-order accurate, there exists a fourth-order extension of the spatial discretization that was
developed as a way of dealing with certain dispersion errors known as Cerenkov radiation [15].
While the use of a staggered mesh with finite-differences is quite effective for Cartesian grids,
issues arise in problems defined with geometry, such as curved surfaces, in which one resorts to
stair step boundaries [16]. To mitigate the effect of stair step boundaries in explicit methods, the
mesh resolution is increased, resulting in a highly restrictive time step to meet the CFL stability
criterion. Recently, an approach for dealing with geometry in the Yee scheme, which avoids the
stair stepping along the boundary, was developed for two-dimensional problems [17]. The grid
cells along the boundary in the method are replaced with cut-cells which use generalized finite-
difference updates that account for different intersections with the boundary. While this scheme
was shown to be energy conserving and, perhaps more remarkably, preserved the CFL condition
of the Yee scheme, the convergence rates demonstrated by the method are suboptimal. The theory
in this article established half-order accuracy, yet demonstrated first-order accuracy in numerical
experiments.
    While many electromagnetic PIC methods solve Maxwell’s equations on Cartesian meshes
through the FDTD method, other methods have been developed specifically for addressing issues
posed by geometries through the use of unstructured meshes. In [18], a finite-element method
(FEM) was coupled with PIC to solve the Darwin model in which the fields move significantly
faster than the plasma. Explicit finite-volume methods (FVM), which can address geometry, were
considered in [19], which also developed divergence cleaning methods suitable for applications to
PIC simulations of the Vlasov-Maxwell system. Discontinuous Galerkin (DG) methods have also
                                                  4


been used to develop high-order PIC methods with elliptic [20] and hyperbolic [21] divergence
cleaning methods being employed to enforce Gauss’ law. Other work in this area has explored more
generalized FEM discretizations to enforce charge conservation on arbitrary grids [22] as well as
certain structure preserving discretizations [23, 24, 25]. In particular, [24, 25, 26] employed so
called Whitney basis sets taken from the de Rham sequences to automatically enforce involutions
for the electric and magnetic fields. Despite the advantages afforded by the use of such bases,
these implicit field solvers rely on the solutions of large linear systems which need to be solved
using GMRES [27]. Even with preconditioning, such methods can be slow and difficult to achieve
scalability. In the case of explicit solvers, such as FVMs and DG methods, other challenges exist.
The basic FVM, without additional reconstructions, is first-order in space. These methods can, of
course, be improved to second-order accuracy by performing reconstructions based on a collection
of cells. Beyond second-order accuracy, the reconstruction process becomes quite complicated due
to the size of the interpolation stencils. DG methods, on the other hand, store cell-wise expansions
in a basis, which eliminates the issue encountered in the FVM, typically at the cost of a highly
restrictive condition on the size of a time step. Additionally, the significant amount of local work
in DG methods makes them appealing for newer hardware, yet the restriction on the time step
size is often left unaddressed. Notable exceptions to this restriction exist for the two-way wave
equation including staggered formulations [28] and Hermite methods [29], which allow for a much
larger time step. It will be interesting to see the performance of such methods in plasma problems,
especially in problems with intricate geometric features.
    Other methods for Maxwell’s equations have been developed with unconditional stability for
the time discretization. The first of these methods is the ADI-FDTD method [30, 31], which
combined an ADI approach with a two-stage splitting to achieve an unconditionally stable solver.
Time stepping in these methods was later generalized using a Crank-Nicolson splitting and several
techniques for enhancing the temporal accuracy were proposed [32]. Of particular significance
to this work are methods based on successive convolution, also known as the method-of-lines-
transpose (MOL𝑇 ) [33, 34]. These methods are unconditionally stable in time and can be obtained
                                                     5


by reversing the typical order in which discretization is performed. By first discretizing in time,
one can solve a resulting boundary-value problem by formally inverting the differential operator
using a Green’s function in conjunction with a fast summation method. Mesh-free methods [35,
36] for Maxwell’s equations have also been explored in the Darwin limit under the Coulomb gauge.
These formulations are in some ways similar to PIC in that they evolve particles with shapes, except
no mesh is used in the simulation. The elliptic equations are solved using a Green’s function
on an unbounded domain and a fast summation method is used for efficiency. Green’s function
methods have also been used to develop asymptotic preserving schemes [37]. This article utilized
a boundary integral formulation with a multi-dimensional Green’s function, to obtain a method
that recovers the Darwin limit under appropriate conditions. The methods considered in this work
incorporate dimensional splittings, which results in algorithms with unconditional stability, high-
order accuracy [38], parallel scalability [39] and geometric flexibility [40]. In [41], a PIC method
was developed based on the MOL𝑇 discretization combined with a staggered grid formulation of
the Vlasov-Maxwell system. In this approach, the field equations were cast in terms of the Lorenz
gauge, producing wave equations for the scalar and vector potentials. Additionally, the wave
equation for the scalar potential was replaced with an elliptic equation to control errors in the gauge
condition. Since the particle equations were written in terms of E and B, additional finite-difference
derivatives were required to compute the electric and magnetic fields from the potentials. While
the contributions of this work differ significantly from the methods in [41], we consider the latter
work as a baseline to focus our efforts. A fairly detailed outline of the contributions of this thesis
can be found in section 1.4 at the end of this chapter.
1.2    Mathematical Models
In this section, we provide relevant details of the mathematical models employed for the plasma
applications considered in this work. We begin with a discussion of the Vlasov-Maxwell system,
which is the most general model considered in this work, in section 1.2.1. Then, once we have
finished the discussion of the model, we discuss the mathematical formulation in 1.2.2, which
                                                  6


expresses Maxwell’s equations in terms of potentials through the use of gauge conditions. More
specifically, we discuss two formulations: one which employs the Lorenz gauge, and another, which
uses the Coulomb gauge. With the exception of the expanding beam problems in section 4.6, the
numerical examples shall exclusively work with the Lorenz gauge. Since we have adopted a gauge
formulation of Maxwell’s equations, we also introduce the equations of motion for the particles in
section 1.2.2, which are written entirely in terms of the potentials used for Maxwell’s equations.
Once we have presented the formulations, we discuss the non-dimensionalized systems used in the
numerical experiments presented in section 4.6. Finally, we conclude the chapter with a summary
of the original contributions presented in this thesis, which can be found in section 1.4.
1.2.1    Vlasov-Maxwell System
In this work, we develop numerical algorithms for plasmas described by the Vlasov-Maxwell (VM)
system, which in SI units, reads as
                                                  𝑞𝑠
                              𝜕𝑡 𝑓𝑠 + v · ∇𝑥 𝑓𝑠 +    (E + v × B) · ∇𝑣 𝑓𝑠 = 0,                     (1.1)
                                                  𝑚𝑠
                              ∇ × E = −𝜕𝑡 B,                                                      (1.2)
                              ∇ × B = 𝜇0 (J + 𝜖0 𝜕𝑡 E) ,                                          (1.3)
                                         𝜌
                              ∇·E=         ,                                                      (1.4)
                                        𝜖0
                              ∇ · B = 0.                                                          (1.5)
The first equation (1.1) is the Vlasov equation which describes the evolution of a probability
distribution function 𝑓𝑠 (x, v, 𝑡) for particles of species 𝑠 in phase space which have mass 𝑚 𝑠 and
charge 𝑞 𝑠 . More specifically, it describes the probability of finding a particle of species 𝑠 at the
position x, with a velocity v, at a given time 𝑡. Since the position and velocity data are vectors with
3 components, the distribution function is a scalar function of 6 dimensions plus time. While the
equation itself has fairly simple structure, the primary challenge in numerically solving this equation
is its high dimensionality. This growth in the dimensionality has posed tremendous difficulties for
grid-based discretization methods, where one often needs to use many grid points to resolve scales
                                                     7


in the problem. The use of grids for problems involving beams poses additional challenges due to
excessive dissipation along the edge of the beams. This difficulty is further compounded by the fact
that many plasmas of interest contain multiple species. Despite the lack of a collision operator on
the right-hand side of (1.1), collisions do exist in a certain mean-field sense, through the electric
and magnetic fields which appear as coefficients of the velocity gradient.
    Equations (1.2) - (1.5) are Maxwell’s equations, which describe the evolution of the background
electric and magnetic fields. Since the plasma is a collection of moving charges, any changes in the
distribution function for each species will be reflected in the charge density 𝜌(x, 𝑡), as well as the
current density J(x, 𝑡), which, respectively, are the source terms for Gauss’ law 1.4 and Ampère’s
law (1.3). For 𝑁 𝑠 species, the total charge density and current density are defined by summing over
the species
                                          𝑁𝑠
                                         ∑︁                            𝑁𝑠
                                                                       ∑︁
                              𝜌(x, 𝑡) =      𝜌 𝑠 (x, 𝑡),    J(x, 𝑡) =       J𝑠 (x, 𝑡),             (1.6)
                                         𝑠=1                           𝑠=1
where the species charge and current densities are defined through moments of the distribution
function 𝑓𝑠 according to
                                  ∫                                        ∫
                 𝜌 𝑠 (x, 𝑡) = 𝑞 𝑠     𝑓𝑠 (x, v, 𝑡) 𝑑v,     J𝑠 (x, 𝑡) = 𝑞 𝑠      v 𝑓𝑠 (x, v, 𝑡) 𝑑v. (1.7)
                                   Ω𝑣                                       Ω𝑣
Here, the integrals are taken over the velocity components of phase space, which we have denoted
by Ω𝑣 . The remaining parameters 𝜖0 and 𝜇0 describe the permittivity and permeability of the
media in which the fields propagate, which we take to be free-space. In free-space, Maxwell’s
equations move at the speed of light 𝑐, which leads to the useful relation 𝑐2 = (𝜇0 𝜖0 ) −1 . The last
two equations (1.4) and (1.5) are constraints placed on the fields to maintain charge conservation
and prevent the appearance of so-called “magnetic monopoles". It is imperative that numerical
schemes for Maxwell’s equations satisfy these conditions. This requirement is one of the reasons
we adopt a formulation for Maxwell’s equations in potential form, which is the subject of the next
section.
                                                         8


1.2.2    Problem Formulation
In this work, we adopt a particle formulation of the Vlasov equation (1.1) and use a gauge formulation
of Maxwell’s equations. Here we discuss the models that form the basis of the numerical methods
presented in this work.
1.2.2.1   Maxwell’s Equations with the Lorenz Gauge
Under the Lorenz gauge, Maxwell’s equations transform to a system of decoupled wave equations
of the form
                                          1 𝜕2𝜓            1
                                           2   2
                                                   − Δ𝜓 = 𝜌,                                     (1.8)
                                         𝑐 𝜕𝑡             𝜖0
                                             2
                                         1 𝜕 A
                                                  − ΔA = 𝜇0 J,                                   (1.9)
                                         𝑐2 𝜕𝑡 2
                                                  1 𝜕𝜓
                                        ∇·A+ 2          = 0,                                    (1.10)
                                                  𝑐 𝜕𝑡
where 𝑐 is the speed of light, 𝜖0 and 𝜇0 represent, respectively, the permittivity and permeability
of free-space. Further, we have used 𝜓 to denote the scalar potential and A is the vector potential.
In fact, under any choice of gauge condition, given 𝜓 and A, one can recover E and B using the
definitions
                                                 𝜕A
                                    E = −∇𝜓 −        ,  B = ∇ × A,                              (1.11)
                                                  𝜕𝑡
where ” × ” denotes the vector cross product. The structure of equations (1.8) and (1.9) is appealing
because the system, modulo the gauge condition (1.10), is essentially four “decoupled" scalar wave
equations. Since this system is over-determined, the coupling manifests itself through the gauge
condition which should be thought of as a constraint. Moreover, Maxwell’s equations (1.2) - (1.5)
are equivalent to (1.8) and (1.9) as long as the Lorenz gauge condition (1.10) is satisfied by 𝜓 and
A. This formulation is appealing for several reasons. First, this system is purely hyperbolic, so it
evolves in a local sense. Computationally, this means that a localized method can be used to evolve
the system, which will be more efficient for parallel computers. Another attractive feature is that
                                                    9


many of the methods developed for scalar wave equations, e.g., [38] and [34] can be applied to the
system in a straightforward manner.
1.2.2.2    Maxwell’s Equations with the Coulomb Gauge
If instead, we impose the Coulomb gauge in Maxwell’s equations, we obtain a coupled, mixed-type
system
                                                       1
                                            −Δ𝜓 =        𝜌,                                     (1.12)
                                                      𝜖0
                                   1 𝜕2A                      1 𝜕 (∇𝜓)
                                    2    2
                                           − ΔA = 𝜇0 J − 2             ,                        (1.13)
                                  𝑐 𝜕𝑡                       𝑐     𝜕𝑡
                                           ∇ · A = 0,                                           (1.14)
where, again, 𝑐 is the speed of light, 𝜖0 and 𝜇0 represent, respectively, the permittivity and perme-
ability of free-space and 𝜓 and A to denote the scalar potential and vector potential, respectively.
There are several noticeable differences in the system above, when compared to the one obtained
with the Lorenz gauge (1.8)-(1.10). The equations are no longer decoupled in the same way as in
the case of the Lorenz gauge. Moreover, the equation for the scalar potential (1.12) is an elliptic
equation, rather than a hyperbolic equation. This requires elliptic solvers, which are more difficult
to scale on parallel computers than hyperbolic solvers due to their global properties. As a conse-
quence, additional parallel communications are required to coordinate the solves. Additionally, the
Coulomb gauge introduces a somewhat unusual time derivative 𝜕𝑡 ∇𝜓, which is connected to the
steady state equation (1.12).
     In our implementation of the solver with the Coulomb gauge, we use a Helmholtz decomposition
of the vector fields A and J that separates the rotational and irrotational components according to
                                     A = Arot + Airrot ≡ Arot − ∇𝜉,                             (1.15)
                                      J = Jrot + Jirrot ≡ Jrot − ∇𝜂.                            (1.16)
Here, 𝜉 and 𝜂 are scalar functions and by definition, ∇ · Arot = 0 and ∇ · Jrot = 0. Substituting this
                                                     10


decomposition into equation (1.13), and separating the equations (by linearity), we obtain
                                   1 𝜕 2 Arot
                                               − ΔArot = 𝜇0 Jrot ,                                (1.17)
                                  𝑐2 𝜕𝑡 2
                                1 𝜕 2 Airrot                              1 𝜕 (∇𝜓)
                                              − ΔA  irrot =  𝜇 0 Jirrot −          .              (1.18)
                                𝑐2 𝜕𝑡 2                                   𝑐2 𝜕𝑡
The second equation is connected to the continuity equation. If we assume that the Coulomb gauge
is satisfied, then it follows that ∇·Airrot = −Δ𝜉 = 0. With this in mind, if we now take the divergence
of equation (1.18), using (1.12) and (1.16), we find that
                                               1 𝜕 (Δ𝜓)
                                       0=−                  + 𝜇0 ∇ · Jirrot ,
                                               𝑐2 𝜕𝑡
                                               1 𝜕𝜌
                                         = 2            + 𝜇0 ∇ · (J − Jrot ) ,
                                             𝑐 𝜖0 𝜕𝑡
                                                               
                                                  𝜕𝜌
                                         = 𝜇0          +∇·J ,
                                                  𝜕𝑡
which is the continuity equation. In order to solve (1.17), which avoids the term involving the
strange time derivative, we need to identify Jrot from J. One way of doing this is by appealing to
equation (1.16). Taking its divergence and rearranging the terms, we obtain the elliptic equation
                                                  − Δ𝜂 = ∇ · J.                                   (1.19)
Once 𝜂 is identified, we can compute ∇𝜂 and set
                                                  Jrot = J + ∇𝜂.                                  (1.20)
     Our implementation works off of equations (1.12) and (1.17), in addition to (1.19) and (1.20).
The issue of enforcing the gauge condition will be addressed in section 4.3, where we discuss
methods for controlling errors in the gauges.
1.2.2.3    Formulation for the Particles
A particle formulation of the Vlasov equation (1.1) can be developed by writing the species
distribution function 𝑓𝑠 as a collection of Dirac delta distributions over phase space
                                                   𝑁 𝑝𝑠
                                                   ∑︁
                                    𝑓𝑠 (x, v, 𝑡) =        𝛿(x − x 𝑝 )𝛿(v − v 𝑝 ).                 (1.21)
                                                    𝑝=1
                                                          11


Here, 𝑁 𝑝 𝑠 symbolizes the number of particles of species 𝑠. Notice that we also have the relation
                                    ∫ ∫
                                             𝑓𝑠 (x, v, 𝑡) 𝑑v 𝑑x = 𝑁 𝑝 𝑠 ,
                                     Ω𝑥   Ω𝑣
which holds at any time 𝑡. Furthermore, 𝑓𝑠 can be converted into a proper distribution by including
a normalization factor of 1/𝑁 𝑝 𝑠 . By combining the ansatz (1.21) with the definitions (1.6) and
(1.7), we obtain the following definitions of charge density and current density for a collection of
𝑁 𝑝 simulation particles:
                                                 𝑁𝑝
                                                ∑︁
                                      𝜌(x) =         𝑞 𝑝 𝛿(x − x 𝑝 ),                          (1.22)
                                                𝑝=1
                                                 𝑁𝑝
                                                ∑︁
                                       J(x) =        𝑞 𝑝 v 𝑝 𝛿(x − x 𝑝 ).                      (1.23)
                                                𝑝=1
In the above equators, 𝑞 𝑝 , x 𝑝 , and v 𝑝 denote the charge, position, and velocity, respectively, of
a particle whose label is 𝑝. In defining things this way, we have dropped the reference to the
species altogether, since each particle can be thought of as its own entity. These particles move
along characteristics of the equation (1.1), which are given by the system of ordinary differential
equations
                                             x¤ 𝑖 = v𝑖
                                                      1
                                             v¤ 𝑖 =      F(x𝑖 ).
                                                     𝑚𝑖
The vector field F is the Lorentz force that acts on particles and is defined by
                                                                
                                           F= 𝑞 E+v×B ,                                        (1.24)
where we have removed the subscript that refers to a specific particle for simplicity. Next, we write
the fields in terms of their potentials. Using (1.11), we can obtain the equivalent expression
                                                    𝜕A                   
                                  F = 𝑞 − ∇𝜓 −            + v × (∇ × A) .
                                                     𝜕𝑡
This expression can be simplified with the aid of the vector identity
                      ∇ (a · b) = a × ∇ × b + b × ∇ × a + (a · ∇) b + (b · ∇) a.               (1.25)
                                                      12


Using a ≡ A and b ≡ v, along with the fact that the velocity v does not depend on x, we obtain the
relation
                                  v × (∇ × A) = ∇ (A · v) − (v · ∇) A.
Inserting this expression into the force yields
                                              𝜕A                            
                             F = 𝑞 − ∇𝜓 −          + ∇ (A · v) − (v · ∇) A .
                                                𝜕𝑡
Then we can use the definition of the total (convective) derivative to write
                                          𝑑A 𝜕A
                                              =       + (v · ∇) A,
                                          𝑑𝑡      𝜕𝑡
which means the force is equivalent to
                                                     𝑑A               
                                    F = 𝑞 − ∇𝜓 −          + ∇ (A · v) .
                                                      𝑑𝑡
                                                                    𝑑p
If we let p denote the classical momentum p = 𝑚v, so that           𝑑𝑡   = F, then we can move the time
derivative to the left side of the equation, which gives
                                  𝑑                                    
                                      p + 𝑞A = 𝑞 − ∇𝜓 + ∇ (A · v) .
                                 𝑑𝑡
The expression on the left contains the canonical (generalized) momentum
                                               P := p + 𝑞A,                                      (1.26)
and the right side can be expressed as −∇𝑈, if we let 𝑈 = 𝑞 (𝜓 − A · v). Therefore, we obtain the
canonical momentum equation
                                       𝑑P                         
                                           = 𝑞 − ∇𝜓 + ∇ (A · v) .
                                       𝑑𝑡
This is the penultimate form of the equation we seek. Instead, we want the derivatives to appear
on the vector potential A rather than A · v. For this, we can use an equivalent form of the identity
(1.25), namely
                                     ∇ (a · b) = (∇b) · a + (∇a) · b.
                                                     13


Again, we select 𝑎 ≡ A and 𝑏 ≡ v, which shows that
                                      ∇ (A · v) = (∇v) · A + (∇A) · v.
Equation (1.26) provides the connection between the linear momentum p = 𝑚v and the canonical
momentum P, so that the velocity is given by
                                                     1           
                                               v≡        P − 𝑞A .                                   (1.27)
                                                    𝑚
Since v is a function only of time, we have that
                                                (∇v) · A = 0.
Combining this with the other term yields an expanded form for the canonical momentum update
                                  𝑑P                   1                         
                                      = 𝑞 − ∇𝜓 + (∇A) · (P − 𝑞A) .                                  (1.28)
                                   𝑑𝑡                   𝑚
Note that since A is a vector, taking the gradient increases its rank by 1, which means that ∇A is a
dyad. In component form, we can write ∇ (a · b) as
                                                                  𝜕𝑎 ( 𝑗)        𝜕𝑏 ( 𝑗)
                        𝜕𝑖 𝑎 𝑗 𝑏 𝑗 = 𝜕𝑖 𝑎 𝑗 𝑏 𝑗 + 𝜕𝑖 𝑏 𝑗     𝑎𝑗 =            𝑏𝑗 +             𝑎𝑗,   (1.29)
                                                                      𝜕𝑥𝑖             𝜕𝑥𝑖
where the summation convention over repeated indices has been used. For our case, the vector b
in the above calculation does not depend on space. Therefore, one only requires computing the
entries of the dyad
                                                  𝜕 𝐴 ( 𝑗)
                                                           ≡ 𝐽A𝑇                                    (1.30)
                                                   𝜕𝑥𝑖
where 𝐽A is the Jacobian matrix associated with A.
Remark 1.2.1. Another way to see that the (1.29) and (1.30) are the correct expressions for (1.28)
is to use the fact that A · v is a scalar, then apply the usual gradient operator for scalar functions. In
other words,
                                                                                    
                            ∇ (A · v) = ∇ 𝐴 (1) 𝑣 (1) + 𝐴 (2) 𝑣 (2) + 𝐴 (3) 𝑣 (3) ,
                                            (1) (1)                                   
                                           𝜕𝑥 𝐴 𝑣 + 𝐴 (2) 𝑣 (2) + 𝐴 (3) 𝑣 (3) 
                                                                                       
                                          
                                                 (1)   (1)
                                       = 𝜕𝑦 𝐴 𝑣 + 𝐴 𝑣 + 𝐴 𝑣
                                                              (2)   (2)      (3)   (3)    .
                                                                                           
                                                                                        
                                           𝜕𝑧 𝐴 (1) 𝑣 (1) + 𝐴 (2) 𝑣 (2) + 𝐴 (3) 𝑣 (3) 
                                                                                          
                                                                                          
                                                        14


The result follows by distributing the derivatives in each row, using the fact that the components of
the velocity do not depend on space, followed by use of the definition (1.27).
                                                      𝑑x
    To obtain the position equation, we note that     𝑑𝑡  = v and that the classical momentum is given
by p = 𝑚v. This implies that
                                                    𝑑x
                                            P=𝑚          + 𝑞A,
                                                    𝑑𝑡
which can be arranged to obtain
                                           𝑑x 1              
                                               =      P − 𝑞A .                                    (1.31)
                                           𝑑𝑡 𝑚
    Since the transformed equations of motion given by (1.28) and (1.31) will be identical in
structure among the particles (differing only by labels for the particles) this results in the system
                               𝑑x𝑖    1            
                                   =      P𝑖 − 𝑞𝑖 A ,                                             (1.32)
                               𝑑𝑡     𝑚𝑖
                              𝑑P𝑖                 1                      
                                   = 𝑞𝑖 − ∇𝜓 +         (∇A) · (P𝑖 − 𝑞𝑖 A) .                       (1.33)
                               𝑑𝑡                  𝑚𝑖
The complete formulation consists of evolving the fields with (1.8)-(1.10) and (1.32) - (1.33) for
the particles. In the next section, we provide details concerning the non-dimensionalization used
for the models.
1.3     Non-dimensionalization
In this section, we discuss the scalings used to non-dimensionalize the models explored in this work.
Our choice in exploring the normalized form of these models is simply to reduce the number of
floating point operations with small or large numbers. We first non-dimensionalize the models for
particles, then focus our efforts on the field equations obtained with both the Lorenz and Coulomb
gauges. To minimize repetition, we shall illustrate the process for parts of the formulation and
simply state the results for those that follow an identical pattern.
    The setup for the non-dimensionalization process used in this section considers the following
                                                   15


substitutions:
                                 x → 𝐿 x̃,    P → 𝑃P̃,     𝑡 → 𝑇 𝑡˜,
                                 𝑛 → 𝑛¯ 𝑛, ˜  𝜓 → 𝜓0 𝜓,˜     𝜌 → 𝑄 𝑛¯ 𝜌,˜
                                                                   ¯
                                                                 𝑄 𝑛𝐿
                                 A → 𝐴0 Ã,     J → 𝑄 𝑛𝑉 ¯ J̃ ≡        J̃.
                                                                   𝑇
Here, we use 𝑛¯ to denote a reference number density [m−3 ], 𝑄 is the scale for charge in [C], and
we also introduce 𝑀, which represents the scale for mass [kg]. The values for 𝑄 and 𝑀 are set
according to the electrons, so that 𝑄 = |𝑞 𝑒 | and 𝑀 = 𝑚 𝑒 . Other scales, such as 𝜓0 and 𝐴0 shall be
specified later. A natural choice of the scales for 𝐿, and 𝑇 are the Debye length and angular plasma
period, which are defined, respectively, by
                                √︄                                √︂
                                   𝜖 0 𝑘 𝐵𝑇¯                         𝑚 𝑒 𝜖0
                     𝐿 = 𝜆𝐷 =                [m],   𝑇=    𝜔−1
                                                            𝑝𝑒  =            [s/rad],
                                      ¯ 2𝑒
                                     𝑛𝑞                               ¯ 2𝑒
                                                                     𝑛𝑞
where 𝑘 𝐵 is the Boltzmann constant, 𝑚 𝑒 is the electron mass, 𝑞 𝑒 is the electron charge, and 𝑇¯ is
an average macroscopic temperature for the plasma. We choose to select these scales for all test
problems considered in section 4.6 with the exception of the expanding beam problems, which are
the last two test problems in that section. For these problems, the length scale 𝐿 corresponds to the
longest side of the simulation domain and 𝑇 is the crossing time for a particle that is injected into
the domain. Generally, the user will need to provide a macroscopic temperature 𝑇¯ [K] in addition
to the reference number density 𝑛¯ to compute 𝜆 𝐷 and 𝜔−1  𝑝𝑒 . Note that in some cases we shall refer to
the plasma period, which can be obtained from the angular plasma period 𝑇 by multiplying with 2𝜋.
Having introduced the definitions for the normalized variables, we proceed to non-dimensionalize
the models, beginning with the equations for particles before addressing the field equations in both
the Lorenz and Coulomb gauges.
1.3.1    Equations of Motion in E-B form
In our development of new field solvers with applications to PIC, it is helpful to benchmark the
proposed methods against standard approaches for integrating particles which use the Newton-
                                                   16


Lorenz equations
                                     𝑑x𝑖
                                          = v𝑖 ,                                            (1.34)
                                      𝑑𝑡
                                     𝑑v𝑖      𝑞𝑖                             
                                          =        E(x𝑖 ) + v𝑖 × B(x𝑖 ) .                   (1.35)
                                      𝑑𝑡     𝑚𝑖
    After inserting the scales into the position equation (1.34), we obtain the equivalent non-
dimensional form
                                                   𝑑 x̃𝑖
                                                           = ṽ𝑖 .
                                                    𝑑 𝑡˜
    Following the same process for the velocity equation, after some rearrangement, we obtain
                                𝑑 ṽ𝑖 𝑞˜𝑖  𝑄𝐸 0𝑇               𝑄𝐵0𝑇             
                                       =                  Ẽ +         ṽ𝑖 × B̃ .
                                 𝑑 𝑡˜     𝑟𝑖 𝑀𝑉                    𝑀
Here we have introduced the non-dimensional electric and magnetic fields Ẽ and B̃, which are
normalized by 𝐸 0 and 𝐵0 , respectively, and 𝑟𝑖 = 𝑚𝑖 /𝑀 is a mass ratio . From equation (1.11), we
can express these scales in terms of 𝜓0 and 𝐴0 as
                                                  𝜓0                𝐴0
                                          𝐸0 =         ,      𝐵0 =      .
                                                   𝐿                 𝐿
Therefore, the non-dimensionalized equation for the velocity can be expressed in terms of these
scales as
                                𝑑 ṽ𝑖 𝑞˜𝑖  𝑄𝜓0𝑇 2               𝑄 𝐴0 𝑇            
                                       =                   Ẽ +         ṽ𝑖  ×  B̃   ,
                                𝑑 𝑡˜     𝑟𝑖 𝑀 𝐿 2                  𝑀𝐿
                                         𝑞˜𝑖                        
                                       ≡       𝛼1 Ẽ + 𝛼2 ṽ𝑖 × B̃ ,
                                         𝑟𝑖
where we have introduced the parameters
                                             𝑄𝜓0𝑇 2                 𝑄 𝐴0 𝑇
                                      𝛼1 =              ,     𝛼2 =            .
                                               𝑀 𝐿2                    𝑀𝐿
We then select 𝜓0 and 𝐴0 so that 𝛼1 = 𝛼2 = 1, i.e.,
                                                𝑀 𝐿2                 𝑀𝐿
                                        𝜓0 =            ,     𝐴0 =         ,                (1.36)
                                                𝑄𝑇 2                  𝑄𝑇
                                                         17


which results in the non-dimensional system
                                             𝑑x𝑖
                                                  = v𝑖 ,                                        (1.37)
                                              𝑑𝑡
                                             𝑑v𝑖 𝑞𝑖                    
                                                  =       E + v𝑖 × B .                          (1.38)
                                              𝑑𝑡    𝑟𝑖
Note that we have dropped the tildes for brevity. The next section provides the non-dimensionalized
form for the analogous equations that evolve particles using the potentials 𝜓 and A in the generalized
momentum framework.
1.3.2   Equations of Motion for the Generalized Hamiltonian
Here, we non-dimensionalize the generalized momentum model for the particles, which is given
by equations (1.32) and (1.33). For convenience, the system is given by
                              𝑑x𝑖        1              
                                    =         P𝑖 − 𝑞𝑖 A ,
                               𝑑𝑡       𝑚𝑖
                              𝑑P𝑖                       1                          
                                    = 𝑞𝑖 − ∇𝜓 +             (∇A) · (P𝑖 − 𝑞𝑖 A) .
                              𝑑𝑡                        𝑚𝑖
    Substituting the scales introduced at the beginning of the section into the position equation, and
rearranging terms, we obtain
                                  𝑑 x̃𝑖     1  𝑇𝑃          𝑄𝑇 𝐴0          
                                        =            P̃𝑖 −           𝑞˜𝑖 Ã ≡ ṽ𝑖 .
                                   𝑑 𝑡˜ 𝑟𝑖 𝑀 𝐿                𝑀𝐿
This equation can be simplified further noting that 𝐴0 is chosen according to (1.36) and the scale
for momentum is 𝑃 = 𝑀 𝐿𝑇 −1 . Therefore, we obtain the non-dimensionalized position equation
                                          𝑑 x̃𝑖    1              
                                                 =     P̃𝑖 − 𝑞˜𝑖 Ã ≡ ṽ𝑖 .
                                           𝑑 𝑡˜ 𝑟𝑖
    Following an identical treatment for the generalized momentum equation, using the scales set
according to (1.36), we obtain
                              𝑑 P̃𝑖                     1 ˜                     
                                      = 𝑞˜𝑖 − ∇˜ 𝜓˜ +        ∇Ã · P̃𝑖 − 𝑞˜𝑖 Ã .
                               𝑑 𝑡˜                      𝑟𝑖
                                                         18


Therefore, the non-dimensional system is (again, dropping tildes for simplicity)
                              𝑑x𝑖    1             
                                  =       P𝑖 − 𝑞𝑖 A ≡ v𝑖 ,                                     (1.39)
                              𝑑𝑡     𝑟𝑖
                             𝑑P𝑖                   1                           
                                  = 𝑞𝑖 − ∇𝜓 + (∇A) · (P𝑖 − 𝑞𝑖 A) .                             (1.40)
                              𝑑𝑡                    𝑟𝑖
    The next sections are focused on the non-dimensionalized form of the field equations cast under
the Lorenz and Coulomb gauge conditions.
1.3.3   Maxwell’s Equations in the Lorenz Gauge
Substituting scales introduced at the beginning of the section into the equations (1.8) - (1.10), we
find that
                                   1 𝜓0 𝜕 2 𝜓˜ 𝜓0 ˜ 𝑄 𝑛¯
                                                  − 2 Δ̃𝜓 =          ˜
                                                                     𝜌,
                                  𝑐2 𝑇 2 𝜕 𝑡˜2      𝐿           𝜖0
                                  1 𝐴0 𝜕 2 Ã 𝐴0               𝜇0 𝑄 𝑛𝐿¯
                                   2    2     2
                                                 − 2 Δ̃Ã =                 J̃,
                                  𝑐 𝑇 𝜕𝑡     ˜      𝐿               𝑇
                                  𝐴0 ˜           1 𝜓0 𝜕 𝜓˜
                                     ∇ · Ã + 2             = 0.
                                  𝐿              𝑐 𝑇 𝜕 𝑡˜
The first equation can be rearranged to obtain
                                       𝐿 2 𝜕 2 𝜓˜              2
                                                   − Δ̃ ˜ = 𝐿 𝑄 𝑛¯ 𝜌.
                                                        𝜓              ˜
                                     𝑐2𝑇 2 𝜕 𝑡˜2             𝜖 0 𝜓0
Similarly, with the second equation we obtain
                                      𝐿 2 𝜕 2 Ã               ¯ 𝐿2
                                                            𝑄 𝑛𝑉
                                                  − Δ̃ Ã =             J̃,
                                    𝑐2𝑇 2 𝜕 𝑡˜2             𝑐 2 𝜖 0 𝐴0
where we have used 𝑉 = 𝐿𝑇 −1 as well as the fact that 𝑐2 = (𝜇0 𝜖0 ) −1 . Finally, the gauge condition
becomes
                                                   𝜓0𝑉 𝜕 𝜓˜
                                         ∇˜ · Ã + 2           = 0.
                                                   𝑐 𝐴0 𝜕 𝑡˜
                                                     19


Introducing the normalized speed of light 𝜅 = 𝑐/𝑉, and selecting 𝜓0 and 𝐴0 from (1.36), we find
that the above equations simplify to (dropping the tildes)
                                          1 𝜕2𝜓              1
                                                   − Δ𝜓  =      𝜌,                           (1.41)
                                         𝜅 2 𝜕𝑡 2           𝜎1
                                         1 𝜕2A
                                                  − ΔA = 𝜎2 J,                               (1.42)
                                         𝜅 2 𝜕𝑡 2
                                                  1 𝜕𝜓
                                       ∇·A+ 2            = 0,                                (1.43)
                                                  𝜅 𝜕𝑡
where we have introduced the new parameters
                                          𝑀𝜖0              𝑄 2 𝐿 2 𝑛𝜇
                                                                   ¯ 0
                                    𝜎1 =        ,   𝜎2 =               .                     (1.44)
                                          𝑄𝑇 𝑛¯                𝑀
These are nothing more than normalized versions of the permittivity and permeability constants in
the original equations.
1.3.4   Maxwell’s Equations in the Coulomb Gauge
Following an approach identical to the one used for the Lorenz gauge in the previous section, one
finds that the non-dimensional form of Maxwell’s equations in the Coulomb gauge is given by
                                                     1
                                            −Δ𝜓 =       𝜌,                                   (1.45)
                                                     𝜎1
                                  1 𝜕2A                      1 𝜕 (∇𝜓)
                                   2   2
                                           − ΔA = 𝜎2 J − 2               ,                   (1.46)
                                 𝜅 𝜕𝑡                       𝜅       𝜕𝑡
                                           ∇ · A = 0,                                        (1.47)
where 𝜎1 and 𝜎2 are defined according to (1.44) and 𝜅 = 𝑐/𝑉 is the normalized speed of light.
1.4    Contributions of This Thesis
The remainder of this thesis is devoted to the design of algorithms for solving the VM system using
a PIC method. The particles will be advanced in the proposed methods according to either the
Newton-Lorentz equations (1.37)-(1.38) or the generalized formulation (1.39)-(1.40). In the latter
approach, particles are evolved using the smooth potentials 𝜓 and A, as well as their derivatives,
                                                    20


which has the advantage of eliminating time derivatives from the gauge formulation. The former
approach for moving particles is largely standard and serves not only as a useful benchmark, but
also demonstrates ways in which our methods can be incorporated into existing methodologies.
While the methods developed in this work are more aligned with the Lorenz gauge formulation
of Maxwell’s equations (1.41)-(1.43), we also include some results obtained with the formulation
(1.45)-(1.47) based on the Coulomb gauge.
    Chapter 2 provides a discussion of the methods used to evolve the wave equations that appear
in the gauge formulations considered in this work. The core algorithms for the fields are quite
similar to those presented in the thesis [42], as well as the article [34]. The latter article analyzed
stability and characterized other properties of the second-order solvers, including dissipation and
dispersion. This work extends these ideas by introducing new methods for the construction of
analytical derivatives that can be obtained directly from the base solver with no degradation in
the rate of convergence. This provides a more general approach for the calculation of derivatives,
which can be leveraged in problems with non-uniform meshes and non-trivial geometries [40] and
costs the same as the base solver. In contrast, the particle methods developed in [41] used centered
finite-differences on staggered Cartesian grids, which reduce the accuracy of the fields by one order
in space. We also revisit outflow boundary conditions, focusing primarily on the multi-dimensional
setting and propose a form of outflow that is based on extrapolation. Additional details concerning
these approaches is provided to clarify the ambiguity in some of the earlier presentations. Time and
space refinement experiments are used to demonstrate the effectiveness of the proposed approaches.
    In chapter 3, we discuss parallel algorithms for successive convolution methods (see, e.g., [38,
43, 44, 45]). Our paper [39] developed a nearest-neighbor domain decomposition algorithm by
leveraging certain decay properties of the methods, which ultimately allowed the algorithms to
run on distributed memory systems. Using the Kokkos performance portability library [46], we
assessed the performance of several different strategies of dispatching threads in the shared memory
space, focusing on optimizing common looping patterns in the algorithms. Weak and strong scaling
experiments were performed on both linear and nonlinear problems to study the efficiency of the
                                                  21


algorithms.
    The focus of chapter 4 concerns the development of new PIC methods for the simulation of
plasmas. We first introduce the time stepping methods used to evolve particles in the simulations,
including leapfrog integration as well as the popular Boris rotation method [4]. These methods
are used to forge a comparison with the generalized momentum formulation (1.39)-(1.40), which
is evolved using algorithms for non-separable Hamiltonian systems [36, 47]. Several techniques
are proposed in an effort to control errors in the gauge conditions and enforce charge conservation.
We demonstrate the performance of the methods on several test problems, beginning with a single
particle test in fixed fields, before moving to more challenging beam problems.
    The last chapter of this thesis provides a high-level summary of the results. We also discuss
several ideas for future work. This includes outlining possible improvements to the methods
presented here, as well as other interesting ideas and test problems for which there was not enough
time to explore.
                                                   22


                                           CHAPTER 2
                  NUMERICAL METHODS FOR THE FIELD EQUATIONS
In this chapter, we describe algorithms used for wave propagation, which will be used in the formula-
tions of Maxwell’s equations discussed in the previous chapter. We begin with a general discussion
of Green’s function methods and integral equations in section 2.1, which are the foundation of the
approaches considered in this work. The discussion of the solvers is primarily focused on two types
of second-order accurate (time) methods, which are presented in section 2.2 in their corresponding
semi-discrete form. Using a dimensional splitting technique, which is presented at the end of the
same section, we formulate the solution in terms of one-dimensional operators that can be inverted
using the methods discussed in section 2.3. We then discuss the application of boundary conditions
in section 2.4, which also includes caveats for multi-dimensional problems. Additionally, we show
how to construct derivatives of the fields analytically, which retains the convergence rate of the
original method. This section also discusses outflow boundary conditions, which have presented
a challenge for this class of methods. Following [38], we briefly discuss the extensions of these
methods to higher-order accuracy in time and take care to address complications for boundary
conditions in these methods. We present some numerical results in section 2.6, which demonstrate
the accuracy of the proposed methods. Lastly, in section 2.7, we provide a brief summary of the
contributions in this work.
2.1     Integral Equation Methods and Green’s Functions
Integral equation methods or, more generally, Green’s function methods, are a powerful class of
techniques used in the solution of boundary value problems that occur in a range of applications,
including mechanics, fluid dynamics, and electromagnetism. Such methods allow one to write
an explicit solution of an elliptic PDE in terms of a fundamental solution or Green’s function.
While explicit, this solution can be difficult or impossible to evaluate, so numerical quadrature
is used to evaluate these terms. So-called layer potentials can then be introduced in the form of
                                                  23


surface integrals, which are used to adjust the solution to satisfy the prescribed boundary data. This
framework allows one to solve problems in complicated domains without resorting to the use of a
mesh. We illustrate the basics of the method with an example.
    Suppose that we are solving the following modified Helmholtz equation
                                            
                                          1
                                    I − 2 Δ 𝑢(x) = 𝑆(x),        x ∈ Ω,                            (2.1)
                                         𝛼
where Ω ⊂ R𝑛 and I is the identity operator, Δ is the Laplacian operator in R𝑛 , 𝑆 is a source term,
and 𝛼 ∈ R is a parameter. While this method can be broadly applied to other elliptic PDEs, equation
(2.1) is of interest to us because it can be obtained from the time discretization of a parabolic or
hyperbolic PDE. In this case, the source function would include additional time levels of 𝑢 and the
parameter 𝛼 = 𝛼(Δ𝑡) would be connected to the time discretization of the original problem being
solved. We shall not prescribe boundary conditions for this equation, and instead consider the most
general solution.
    To apply a Green’s function method to equation (2.1), one first needs to identify a function
𝐺 (x, y) that solves the equation
                                       
                                    1
                               I − 2 Δ 𝐺 (x, y) = 𝛿 (x − y) ,      x, y ∈ R𝑛                      (2.2)
                                   𝛼
over free-space, with 𝛿 (x − y) being the Dirac delta distribution. There are many ways to approach
solving this equation. For example, it is common to take advantage of radial symmetry so that
the problem reduces to a single variable, which can be solved using a Fourier transform. Green’s
functions have been tabulated for many different operators, including the modified Helmholz
operator. Therefore, we shall not elaborate on this further and assume that the fundamental solution
𝐺 (x, y) is readily available.
    We now show the connection between fundamental solution 𝐺 (x, y), which is defined on free-
space and solves (2.2), and the original problem (2.1). First, let 𝑢 be a solution of the problem
                                                  24


(2.1). If we multiply the equation (2.2) by 𝑢 and integrate over Ω:
                     ∫                                     ∫
                                1
                           I − 2 Δ 𝐺 (x, y) 𝑢(y) 𝑑𝑉y =             𝛿 (x − y) 𝑢(y) 𝑑𝑉y ,              (2.3)
                      Ω         𝛼                                Ω
                                                             = 𝑢(x),
using properties of the delta distribution. The left side of this equation can be addressed using
integration by parts. First, we split the left side into two terms:
    ∫                                    ∫                              ∫
                1                                                        1
           I − 2 Δ 𝐺 (x, y) 𝑢(y) 𝑑𝑉y =            𝐺 (x, y)𝑢(y) 𝑑𝑉y − 2         Δ𝐺 (x, y)𝑢(y) 𝑑𝑉y .   (2.4)
      Ω         𝛼                              Ω                        𝛼 Ω
Using integration by parts and the divergence theorem, we find that the second integral can be
expressed as
          ∫                         ∫                                  ∫
              Δ𝐺 (x, y)𝑢(y) 𝑑𝑉y =       ∇ · (∇𝐺 (x, y)𝑢(y)) 𝑑𝑉y −          ∇𝐺 (x, y) · ∇𝑢(y) 𝑑𝑉y ,
            Ω                         Ω                                  Ω
                                    ∫                      ∫
                                               𝜕𝐺
                                 =       𝑢(y)       𝑑𝑆y −      ∇𝐺 (x, y) · ∇𝑢(y) 𝑑𝑉y .
                                      𝜕Ω       𝜕n            Ω
Here we have used 𝜕/𝜕n to denote the normal derivative. If we, again, apply integration by parts
along with the divergence theorem to the second integral, we find that
          ∫                             ∫                                  ∫
              ∇𝐺 (x, y) · ∇𝑢(y) 𝑑𝑉y =       ∇ · (𝐺 (x, y)∇𝑢(y)) 𝑑𝑉y −          𝐺 (x, y)Δ𝑢(y) 𝑑𝑉y ,
            Ω
                                        ∫Ω                        ∫          Ω
                                                      𝜕𝑢
                                      =      𝐺 (x, y)      𝑑𝑆y −      𝐺 (x, y)Δ𝑢(y) 𝑑𝑉y .
                                         𝜕Ω           𝜕n            Ω
Combining each of these results with the relation (2.4), and using this in place of the left side of
(2.3), we obtain, after some simplifications, the equation
                  ∫                                        ∫                              
                                     1                    1                  𝜕𝑢         𝜕𝐺
          𝑢(x) =     𝐺 (x, y) I − 2 Δ 𝑢(x) 𝑑𝑉y + 2                  𝐺 (x, y)    − 𝑢(y)         𝑑𝑆y .
                   Ω                𝛼                    𝛼 𝜕Ω                𝜕n          𝜕n
Finally, since 𝑢 solves the PDE (2.1), the above equation is equivalent to
                         ∫                            ∫                              
                                                   1                  𝜕𝑢         𝜕𝐺
                𝑢(x) =      𝐺 (x, y)𝑆(y) 𝑑𝑉y + 2             𝐺 (x, y)     − 𝑢(y)        𝑑𝑆y .        (2.5)
                          Ω                       𝛼     𝜕Ω            𝜕n          𝜕n
Since the volume integral term does not enforce boundary conditions, the surface integral contri-
butions involving 𝑢 are replaced with layer potentials to ensure that these conditions are satisfied.
                                                     25


Therefore, the general solution, shown above, is replaced with the ansatz
                         ∫                        ∫                             
                                                                             𝜕𝐺
                 𝑢(x) =      𝐺 (x, y)𝑆(y) 𝑑𝑉y +         𝜎(y)𝐺 (x, y) + 𝛾(y)        𝑑𝑆y ,          (2.6)
                           Ω                        𝜕Ω                        𝜕n
where 𝜎(y) is the single-layer potential and 𝛾(y) is the double-layer potential, which must now be
determined. The names reflect the behavior of the Green’s function associated with each of the
terms. The Green’s function itself is continuous, but its derivative will have a jump. Based on the
boundary conditions, one selects either a single or double layer form as the ansatz for the solution.
The single layer form is used in the Neumann problem, while the double layer form is chosen for
the Dirichlet problem.
    Numerically evaluating the solution (2.6) first requires a discretization of the integrals using
quadrature. The evaluation proceeds by computing the volume integral term, which is the particular
solution to the problem (2.1). Using the particular solution, one obtains the homogeneous solution
with modified boundary data that accounts for the particular solution’s contributions along the
boundary. This step for the homogeneous solution requires the identification of 𝜎(y) or 𝛾(y) at
the quadrature points taken along the domain boundary, which results in a dense linear system.
In contrast to other classes of solvers, e.g, finite-element or finite-difference schemes, these linear
systems are usually well-conditioned, so the application of an iterative solver, such as the GMRES
method [27], converges with only a few iterations and is independent of the number of quadrature
points. For efficiency, the fast-multipole method (FMM) [48] can be used to reduce the evaluation
time required in the GMRES iterations [49].
    The algorithms presented in the subsequent sections are essentially a one-dimensional analogue
of these methods. Rather than invert the multi-dimensional operator corresponding to (2.6),
the methods presented here, instead, factor the Laplacian and invert one-dimensional operators,
dimension-by-dimension, using the one-dimensional form of (2.6). We will see that the resulting
methods solve for something that looks like a layer potential, with the key difference being that
the linear system is now a 2 × 2 matrix that can be inverted by hand rather than with an iterative
method. Similarly, the particular solution along a given line segment can be rapidly computed with a
lightweight, recursive, fast summation method, rather than a more complicated method, such as the
                                                    26


FMM. Moreover, these methods retain the geometric flexibility since the domain can be represented
using one-dimensional line segments with termination points specified by the geometry.
2.2    Semi-discrete Schemes for the Wave Equation
Here we provide a description of the second-order accurate wave solvers based on backwards-
difference (BDF) and time-centered discretizations. We show how to derive the semi-discrete
equations associated with each of the methods, which take the form of modified Helmholtz equations
(2.1). Then, we discuss the splitting technique that is used for multi-dimensional problems.
2.2.1   The BDF Scheme
To derive the BDF form of the wave solver, we start with the equation
                                              1 𝜕 2𝑢
                                                      − Δ𝑢 = 𝑆(x, 𝑡),                            (2.7)
                                             𝑐2 𝜕𝑡 2
where 𝑐 is the wave speed and 𝑆 is a source function. Then, using the notation 𝑢(x, 𝑡 𝑛 ) = 𝑢 𝑛 , we
can apply a second-order accurate backwards finite-difference stencil for the second derivative
                           𝜕 2𝑢             2𝑢 𝑛+1 − 5𝑢 𝑛 + 4𝑢 𝑛−1 − 𝑢 𝑛−2
                                         =                                 + O (Δ𝑡 2 ),
                            𝜕𝑡 2 𝑡=𝑡 𝑛+1                  Δ𝑡 2
where Δ𝑡 = 𝑡 𝑘 − 𝑡 𝑘−1 , for any 𝑘, is the grid spacing in time. Evaluating the remaining terms in
equation (2.7) at time level 𝑡 𝑛+1 , and inserting the above difference approximation, we obtain
                                                                                         
                    1  𝑛+1              𝑛       𝑛−1     𝑛−2
                                                             
                                                                    𝑛+1     𝑛+1          1
                           2𝑢      −  5𝑢   + 4𝑢      − 𝑢       − Δ𝑢     = 𝑆     (x) + O     ,
                 𝑐2 Δ𝑡 2                                                                 𝛼2
which can be rearranged to obtain the semi-discrete equation
                                                                                       
                         1        𝑛+1     1 𝑛          𝑛−1     𝑛−2
                                                                     1
                                                                            𝑛+1           1
                   I − 2Δ 𝑢            =     5𝑢 − 4𝑢         +𝑢      + 2 𝑆 (x) + O 4 ,           (2.8)
                         𝛼                2                             𝛼                𝛼
                                                       √
where we have introduced the parameter 𝛼 = 2/(𝑐Δ𝑡). We note that the source term is treated
implicitly in this method, which creates additional complications if the source function 𝑆 depends
on 𝑢. This necessitates some form of iteration, which increases the cost of the method.
                                                         27


    Stability properties for the semi-discrete equation (2.8) were presented in the thesis [42], which
showed that the method, above, is purely diffusive and unconditionally stable. Higher-order BDF
methods can be obtained simply by using a wider finite-difference stencil to approximate the
time derivative 𝜕𝑡𝑡 𝑢. While moving to higher-order reduces the (overly) diffusive feature of the
second-order method, there are other concerns surrounding the stability of such methods.
2.2.2     Time-centered Scheme
In the semi-discrete form of the second-order BDF method, we saw that the source term was
treated implicitly, which, in some cases, requires some form of iteration. If the iteration procedure
converges slowly, this can increase the cost of the method significantly. Another approach to deal
with this problem is to use a time-centered method, in which the source is treated explicitly. In this
case, the second time derivative is approximated with a second-order centered difference
                                   𝜕 2𝑢          𝑢 𝑛+1 − 2𝑢 𝑛 + 𝑢 𝑛−1
                                               =                         + O (Δ𝑡 2 ),
                                   𝜕𝑡 2 𝑡=𝑡 𝑛             Δ𝑡 2
where, again, Δ𝑡 = 𝑡 𝑘 − 𝑡 𝑘−1 , for any 𝑘, is the grid spacing in time. Evaluating the equation (2.7) at
time level 𝑡 𝑛 , and using this difference approximation, we obtain
                              1  𝑛+1                      
                                    𝑢      −  2𝑢 𝑛
                                                   + 𝑢 𝑛−1
                                                             − Δ𝑢 𝑛 = 𝑆 𝑛 (x) + O (Δ𝑡 2 ).                (2.9)
                           𝑐 Δ𝑡
                            2    2
To obtain a semi-discrete equation of the form (2.1) for the data 𝑢 𝑛+1 , the Laplacian term is made
implicit through the introduction of the term
                                                                 
                                            𝑢 𝑛+1 − 2𝑢 𝑛 + 𝑢 𝑛−1
                                       Δ                            ,    𝛽 ∈ R,
                                                     𝛽2
which is added and subtracted from both sides of (2.9) to obtain (after some rearrangement)
     𝛽2  𝑛+1          𝑛     𝑛−1
                                       
                                            𝑛+1            2 𝑛        𝑛−1
                                                                          
            𝑢     − 2𝑢   + 𝑢       −  Δ   𝑢     −  (2 −  𝛽  )𝑢  +   𝑢
   𝑐2 Δ𝑡 2
                                                                                                        
                                                         = 𝛽2 𝑆 𝑛 (x) − Δ 𝑢 𝑛+1 − 2𝑢 𝑛 + 𝑢 𝑛−1 + O 𝛽2 Δ𝑡 2 .
                                                          28


                                                                                𝛽4
To make the semi-discrete equation take the form (2.1), we add                𝑐2 Δ𝑡 2
                                                                                      𝑢𝑛 to both sides of the equation,
so that
    2          
       𝛽            𝑛+1                                  𝛽4 𝑛
            − Δ     𝑢     −  (2 −  𝛽 2 𝑛
                                      )𝑢  +  𝑢 𝑛−1
                                                      =          𝑢 + 𝛽2 𝑆 𝑛 (x)
    𝑐2 Δ𝑡 2                                              𝑐2 Δ𝑡 2
                                                                                                                
                                                                       − Δ 𝑢 𝑛+1 − 2𝑢 𝑛 + 𝑢 𝑛−1 + O 𝛽2 Δ𝑡 2 .
If we now write 𝛼 = 𝛽/(𝑐Δ𝑡), and multiply through by 1/𝛼2 , this equation can be written as
                                                                                                2
                         1       𝑛+1            2 𝑛         𝑛−1
                                                                      2 𝑛       𝛽2 𝑛              𝛽
                   I − 2 Δ 𝑢 − (2 − 𝛽 )𝑢 + 𝑢                       = 𝛽 𝑢 + 2 𝑆 (x) + O 4 ,                       (2.10)
                        𝛼                                                       𝛼                  𝛼
where we have used the fact that the Laplacian term on the right side satisfies
                                                                                                
                      𝑛+1       𝑛     𝑛−1                𝑛     2         4               2          1
                Δ 𝑢 − 2𝑢 + 𝑢                 = Δ (𝜕𝑡𝑡 𝑢 ) Δ𝑡 + O (Δ𝑡 ) = O (Δ𝑡 ) ≡ O 2 .
                                                                                                   𝛼
    In contrast to the semi-discrete equation (2.8), it was shown that the time-centered update (2.10)
is purely dispersive [42]. Through stability analysis, it was shown that an unconditionally stable
scheme could be obtained as long as 0 < 𝛽 ≤ 2.
2.2.3    Splitting Method Used for Multi-dimensional Problems
The semi-discrete equations (2.8) and (2.10) are both modified Helmholtz equations of the form
(2.1). Rather than appealing to (2.6), which formally inverts the multi-dimensional modified
Helmholtz operator, we apply a factorization into a product of one dimensional operators. For
example, in two-spatial dimensions the factorization writes
                                                                        
                                  1                1                1              1
                            I − 2 Δ = I − 2 𝜕𝑥𝑥 I − 2 𝜕𝑦𝑦 + 4 𝜕𝑥𝑥 𝜕𝑦𝑦 ,
                                  𝛼               𝛼                𝛼              𝛼
                                                       1
                                        ≡ L𝑥 L 𝑦 + 4 𝜕𝑥𝑥 𝜕𝑦𝑦 ,
                                                       𝛼
where L𝑥 and L 𝑦 are one-dimensional operators and the last term represents the splitting error
associated with the factorization step. For second-order accuracy in time, the coefficient of the
splitting error is 1/𝛼4 = O (Δ𝑡 4 ), so we shall ignore this term. Therefore, the semi-discrete equation
(2.8) and (2.10) can be written more compactly (dropping error terms) as
                                                                                                √
                              𝑛+1    1 𝑛            𝑛−1     𝑛−2
                                                                   1
                                                                           𝑛+1                    2
                     L𝑥 L 𝑦 𝑢     =      5𝑢 − 4𝑢          +𝑢       + 2 𝑆 (x), 𝛼 :=                   ,           (2.11)
                                     2                                𝛼                         𝑐Δ𝑡
                                                           29


and
               𝑛+1          2   𝑛     𝑛−1
                                                2 𝑛     𝛽2 𝑛                𝛽
       L𝑥 L 𝑦 𝑢     − (2 − 𝛽 )𝑢 + 𝑢           = 𝛽 𝑢 + 2 𝑆 (x),        𝛼 :=      , 0 < 𝛽 ≤ 2,    (2.12)
                                                         𝛼                  𝑐Δ𝑡
respectively.
2.3    Inverting One-dimensional Operators
The choice of factoring the multi-dimensional modified Helmholtz operator means we now have to
solve a sequence of one-dimensional boundary value problems (BVPs) of the form
                                             
                                       1
                                 I − 2 𝜕𝑥𝑥 𝑤(𝑥) = 𝑓 (𝑥), 𝑥 ∈ [𝑎, 𝑏],                            (2.13)
                                       𝛼
where [𝑎, 𝑏] is a one-dimensional line and 𝑓 is a new source term that can be used to represent a
time history or an intermediate variable constructed from the inversion of an operator along another
direction. We also point out that the parameter 𝛼 depends on the choice of the semi-discrete scheme
                                                                    √
employed to solve the problem. For the BDF scheme 𝛼 = 2/(𝑐Δ𝑡), while the centered scheme
uses 𝛼 = 𝛽/(𝑐Δ𝑡), with 0 < 𝛽 ≤ 2. We will show the process by which one obtains the general
solution to the problem (2.13), deferring the application of boundary conditions to section 2.4.
2.3.1   Integral Solution
Since the BVP (2.13) is linear, its general solution can be expressed using the one dimensional
analogue of equation (2.5):
                      ∫                                                                 𝑦=𝑏
                         𝑏
                                                 1                                   
              𝑤(𝑥) =       𝐺 (𝑥, 𝑦) 𝑓 (𝑦) 𝑑𝑦 + 2 𝐺 (𝑥, 𝑦)𝜕𝑦 𝑢(𝑦) − 𝑢(𝑦)𝜕𝑦 𝐺 (𝑥, 𝑦)          .   (2.14)
                       𝑎                         𝛼
                                                                                        𝑦=𝑎
A simple way to obtain the free-space Green’s function for this problem is to use a Fourier transform.
In Fourier space, this equation reads
                                           
                                         𝑘2 b                        𝛼2
                                    1 + 2 𝐺 = 1 =⇒ 𝐺 = 2    b             .
                                        𝛼                        𝛼 + 𝑘2
A closely related Fourier transform is obtained with the function
                                              h       i      2𝜆
                                                −𝜆|𝑥|
                                          F 𝑒           =         ,
                                                          𝜆2 + 𝑘2
                                                      30


from which, it follows that
                                                      
                                              𝜆 −𝜆|𝑥|          𝜆2
                                          F     𝑒         = 2        .
                                              2              𝜆 + 𝑘2
Therefore, matching transforms, it follows that the free-space Green’s function in one-dimension is
                                                         𝛼 −𝛼|𝑥−𝑦|
                                           𝐺 (𝑥, 𝑦) =      𝑒       .                           (2.15)
                                                         2
    To use the relation (2.14), we need to compute the derivatives in the Green’s function. We note
that
                                                
                                                 𝛼𝑒 −𝛼(𝑥−𝑦) ,
                                                
                                                                  𝑥 ≥ 𝑦,
                                                
                                                
                                                
                                𝜕𝑦 𝐺 (𝑥, 𝑦) =
                                                 −𝛼𝑒 −𝛼(𝑦−𝑥) ,
                                                
                                                
                                                                   𝑥 < 𝑦.
                                                
Taking limits, we find that
                                      lim 𝜕𝑦 𝐺 (𝑥, 𝑦) = 𝛼𝑒 −𝛼(𝑥−𝑎) ,
                                      𝑦→𝑎
                                      lim 𝜕𝑦 𝐺 (𝑥, 𝑦) = −𝛼𝑒 −𝛼(𝑏−𝑥) .
                                      𝑦→𝑏
Combining these limits with (2.15) and (2.14), we obtain the general solution
                                 ∫
                               𝛼 𝑏 −𝛼|𝑥−𝑦|
                      𝑤(𝑥) =            𝑒         𝑓 (𝑦) 𝑑𝑦 + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) ,         (2.16)
                               2 𝑎
where 𝐴 and 𝐵 are constants that are determined by boundary conditions. Comparing with (2.6),
these terms serve the same purpose as the layer potentials. Further, we identify the general solution
(2.16) as the inverse of the one-dimensional modified Helmholtz operator. In other words, we
define L𝑥−1 so that
                      𝑤(𝑥) = L𝑥−1 [ 𝑓 ] (𝑥),                                                   (2.17)
                                 ∫
                               𝛼 𝑏 −𝛼|𝑥−𝑦|
                            ≡           𝑒         𝑓 (𝑦) 𝑑𝑦 + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) ,         (2.18)
                               2 𝑎
                            ≡ I𝑥 [ 𝑓 ] (𝑥) + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) .                         (2.19)
Section 2.4 will make repeated use of (2.17)-(2.19) to illustrate the application of boundary con-
ditions. Additionally, the result (2.16) is general enough that it can be used for both the BDF and
time-centered schemes.
                                                      31


2.3.2   Fast Summation Method
In order to compute the inverse operators according to (2.17)-(2.19), suppose we have discretized
the one-dimensional computational domain [𝑎, 𝑏] into a mesh consisting of 𝑁 + 1 grid points:
                                         𝑎 = 𝑥0 < 𝑥1 < · · · < 𝑥 𝑁 = 𝑏,
with the spacing defined by
                                      Δ𝑥 𝑗 = 𝑥 𝑗 − 𝑥 𝑗−1 ,     𝑗 = 1, · · · 𝑁.
If we directly evaluate the function 𝑤(𝑥) at each of the mesh points, according to (2.18), we obtain
                       ∫  𝑏
                     𝛼
           𝑤(𝑥𝑖 ) =         𝑒 −𝛼|𝑥𝑖 −𝑦| 𝑓 (𝑦) 𝑑𝑦 + 𝐴𝑒 −𝛼(𝑥𝑖 −𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥𝑖 ) ,   𝑖 = 0, · · · , 𝑁 + 1.
                     2  𝑎
Since the evaluation of the integral term in the variable 𝑦 requires with quadrature requires O (𝑁)
operations, this direct approach requires a total of O (𝑁 2 ) operations.
    With the aid of a recursive fast summation algorithm, the cost of evaluating these terms can be
reduced from O (𝑁 2 ) to O (𝑁). To this end, it is helpful to introduce the operators
                                                      ∫  𝑥
                                    I𝑥𝑅 [ 𝑓 ] (𝑥) ≡𝛼       𝑒 −𝛼(𝑥−𝑦) 𝑓 (𝑦) 𝑑𝑦,                           (2.20)
                                                       𝑎
                                                     ∫   𝑏
                                    I𝑥𝐿 [ 𝑓 ] (𝑥) ≡𝛼       𝑒 −𝛼(𝑦−𝑥) 𝑓 (𝑦) 𝑑𝑦,                           (2.21)
                                                       𝑥
so that the total integral over [𝑎, 𝑏] can be expressed as the average of these operators
                                                  1 𝑅                          
                                I𝑥 [ 𝑓 ] (𝑥) =      I𝑥 [ 𝑓 ] (𝑥) + I𝑥𝐿 [ 𝑓 ] (𝑥) .                       (2.22)
                                                  2
    Then, the task now relies on computing the integrals (2.20) and (2.21) in an efficient manner.
To develop a recursive expression, consider evaluating the integral (2.20) at a grid point 𝑥𝑖 . Then,
                                                        32


it follows that
                                    ∫  𝑥𝑖
               I𝑥𝑅 [ 𝑓 ] (𝑥𝑖 ) =𝛼         𝑒 −𝛼(𝑥𝑖 −𝑦) 𝑓 (𝑦) 𝑑𝑦,
                                    ∫𝑎 𝑥𝑖−1                                ∫    𝑥𝑖
                                              −𝛼(𝑥𝑖 −𝑦)
                               =𝛼           𝑒             𝑓 (𝑦) 𝑑𝑦 + 𝛼             𝑒 −𝛼(𝑥𝑖 −𝑦) 𝑓 (𝑦) 𝑑𝑦,
                                    ∫𝑎 𝑥𝑖−1                                  𝑥 𝑖−1
                                                                                        ∫ 𝑥𝑖
                                              −𝛼(𝑥𝑖 −𝑥𝑖−1 +𝑥 𝑖−1 −𝑦)
                               =𝛼           𝑒                         𝑓 (𝑦) 𝑑𝑦 + 𝛼               𝑒 −𝛼(𝑥𝑖 −𝑦) 𝑓 (𝑦) 𝑑𝑦,
                                    ∫𝑎 𝑥𝑖−1                                        ∫ 𝑥𝑖 𝑥𝑖−1
                               =𝛼           𝑒 −𝛼(Δ𝑥𝑖 +𝑥𝑖−1 −𝑦) 𝑓 (𝑦) 𝑑𝑦 + 𝛼                𝑒 −𝛼(𝑥𝑖 −𝑦) 𝑓 (𝑦) 𝑑𝑦,
                                     𝑎                                               𝑥
                                           ∫ 𝑥𝑖−1                                  𝑖−1 ∫ 𝑥𝑖
                               = 𝑒 −𝛼Δ𝑥𝑖 𝛼              𝑒 −𝛼(𝑥𝑖−1 −𝑦) 𝑓 (𝑦) 𝑑𝑦 + 𝛼                  𝑒 −𝛼(𝑥𝑖 −𝑦) 𝑓 (𝑦) 𝑑𝑦,
                                                𝑎                                             𝑥 𝑖−1
                               ≡ 𝑒 −𝛼Δ𝑥𝑖 I𝑥𝑅 [ 𝑓 ] (𝑥𝑖−1 ) + J𝑥𝑅 [ 𝑓 ] (𝑥𝑖 ).
In the last line, we have introduced the local integral
                                                               ∫   𝑥𝑖
                                        J𝑥𝑅 [ 𝑓 ] (𝑥𝑖 )  =𝛼            𝑒 −𝛼(𝑥𝑖 −𝑦) 𝑓 (𝑦) 𝑑𝑦.
                                                                 𝑥𝑖−1
We see that the integral (2.20) can be expressed through a recursive weighting of its previous
values plus an additional term that is localized in space. In the next chapter, we use a variation
of this method to develop a domain decomposition algorithm that allows the method to scale on
parallel computers. To initialize the recursion, we set I𝑥𝑅 [ 𝑓 ] (𝑥 0 ) = 0, which follows directly from
its definition (2.20). Since the calculation of the local integrals J𝑥𝑅 [ 𝑓 ] (𝑥𝑖 ) employs quadrature
over a collection of 𝑀 points, the cost of computing (2.20) is now of the form O (𝑀 𝑁). Further
more, since the number of localized quadrature points 𝑀 is independent of the mesh size 𝑁, and
we additionally select 𝑀 ≪ 𝑁, the resulting approach scales as O (𝑁). A similar argument is made
for the second integral (2.21).
     In summary, the fast summation method computes the integrals (2.20) and (2.21) according to
    I𝑥𝑅 [ 𝑓 ] (𝑥𝑖 ) = 𝑒 −𝛼Δ𝑥𝑖 I𝑥𝑅 [ 𝑓 ] (𝑥𝑖−1 ) + J𝑥𝑅 [ 𝑓 ] (𝑥𝑖 ),        I𝑥𝑅 [ 𝑓 ] (𝑥 0 ) = 0,       𝑖 = 1, · · · 𝑁,        (2.23)
    I𝑥𝐿 [ 𝑓 ] (𝑥𝑖 ) = 𝑒 −𝛼Δ𝑥𝑖+1 I𝑥𝐿 [ 𝑓 ] (𝑥𝑖+1 ) + J𝑥𝐿 [ 𝑓 ] (𝑥𝑖 ),       I𝑥𝐿 [ 𝑓 ] (𝑥 𝑁 ) = 0,         𝑖 = 0, · · · 𝑁 − 1, (2.24)
                                                                  33


where the local integrals are defined by
                                           ∫ 𝑥𝑖
                         𝑅
                      J𝑥 [ 𝑓 ] (𝑥𝑖 ) = 𝛼           𝑒 −𝛼(𝑥𝑖 −𝑦) 𝑓 (𝑦) 𝑑𝑦, 𝑖 = 1, · · · 𝑁,           (2.25)
                                           ∫𝑥𝑖−1
                                               𝑥 𝑖+1
                         𝐿
                      J𝑥 [ 𝑓 ] (𝑥𝑖 ) = 𝛼             𝑒 −𝛼(𝑦−𝑥𝑖 ) 𝑓 (𝑦) 𝑑𝑦, 𝑖 = 0, · · · 𝑁 − 1.     (2.26)
                                            𝑥𝑖
Next, we discuss the approximations used in the evaluation of the local integrals.
2.3.3   Approximating the Local Integrals
Here, we present the general process used to obtain quadrature rules for the local integrals defined
by (2.25) and (2.26), in the case of a uniform grid, i.e.,
                                       Δ𝑥 = 𝑥 𝑗 − 𝑥 𝑗−1 ,         𝑗 = 1, · · · , 𝑁.
Rather than use numerical quadrature rules, e.g., Gaussian quadrature or Newton-Cotes formulas,
for purposes of stability, it was discovered that a certain form of analytical integration was required
[50]. In this approach, the operand 𝑓 (𝑥) is approximated by an interpolating function, which is
then analytically integrated against the kernel. We provide a sketch of the approach to illustrate the
idea. Specific details can be found in a number of papers, e.g., [34, 43, 44].
    First, it is helpful to transform the integrals (2.25) and (2.26) using a change of variable.
Consider the integral (2.25) and let
                           𝑦 = (𝑥 𝑗 − 𝑥 𝑗−1 )𝜏 + 𝑥 𝑗−1 ≡ Δ𝑥𝜏 + 𝑥 𝑗−1 ,            𝜏 ∈ [0, 1].
Then we can write
                                                            ∫   1
                                                     −𝛼Δ𝑥
                          J𝑥𝑅 [ 𝑓 ] (𝑥𝑖 ) = 𝛼Δ𝑥𝑒                  𝑒 𝛼𝜏Δ𝑥 𝑓 (𝜏Δ𝑥 + 𝑥𝑖−1 ) 𝑑𝜏.       (2.27)
                                                              0
    Next, we approximate 𝑓 (𝑥) in (2.27) using interpolation of some desired order of accuracy. As
an example, suppose that we want to use linear interpolation with the data { 𝑓𝑖−1 , 𝑓𝑖 } using the basis
{1, 𝑥 − 𝑥𝑖−1 }, which is shifted for convenience to cancel with the shift in (2.27). A direct calculation
shows that the interpolating polynomial is
                                                           𝑓𝑖 − 𝑓𝑖−1
                                      𝑝(𝑥) = 𝑓𝑖−1 +                   (𝑥 − 𝑥𝑖−1 ) .
                                                              Δ𝑥
                                                             34


By replacing 𝑓 in (2.27) with the above interpolant, and integrating the result analytically, we find
that
                                                   ∫    1                               
                                              −𝛼Δ𝑥
                    J𝑥𝑅 [ 𝑓 ] (𝑥𝑖 ) ≈ 𝛼Δ𝑥𝑒                  𝑒 𝛼𝜏Δ𝑥 𝑓𝑖−1 + ( 𝑓𝑖 − 𝑓𝑖−1 ) 𝜏 𝑑𝜏,
                                                     0
                                    = 𝑤 0 𝑣 𝑖−1 + 𝑤 1 𝑣 𝑖 ,
where the weights for integration are
                                               1 − 𝑒 −𝛼Δ𝑥 − 𝛼Δ𝑥𝑒 −𝛼Δ𝑥
                                       𝑤0 =                                 ,
                                                             𝛼Δ𝑥
                                               (𝛼Δ𝑥 − 1) + 𝑒 −𝛼Δ𝑥
                                       𝑤1 =                            .
                                                          𝛼Δ𝑥
    Modifications of the above can be made to accommodate additional interpolation points, as well
as techniques for shock capturing. In the latter case, methods have been devised following the idea
of WENO reconstruction [51] to create quadrature methods that can address non-smooth features
including shocks and cusps [44, 45, 52]. Additional details on the WENO quadrature, including
the reconstruction stencils can be found in chapter 3. In [45], we developed a quadrature rule using
the exponential polynomial basis, which offers additional flexibility in capturing localized features
through a “shape" parameter introduced in the basis. These tools offer a promising approach
to addressing problems with discontinuities in the material properties as well as more complex
domains with non-smooth boundaries. Despite the notable differences in the type of approximating
function used for the operand, the process is essentially identical to the example shown here.
We also wish to point out that certain issues may arise when 𝛼 ≫ 1 (i.e., Δ𝑡 ≪ 1). In such
circumstances, when the weights are computed on-the-fly, the kernel function can be replaced
with a Taylor expansion [38]. Otherwise, this results in a “narrow" Green’s function that is vastly
under-resolved by the mesh, which causes wave phenomena to remain stagnant. Our experience has
found this situation to be quite rare, but it is something to be aware of when a small CFL number
is used in a simulation.
                                                          35


2.4     Applying Boundary Conditions
In this section, we discuss the application of boundary conditions for the schemes based on BDF
and centered time discretizations. Boundary conditions are presented for these methods from the
perspective of one-dimensional problems in sections 2.4.1 and 2.4.2. Lastly, in section 2.4.3, we
describe how these conditions can be used in multi-dimensional problems.
2.4.1    BDF Method
The update for second order BDF method, in one-spatial dimension, can be obtained by combining
(2.16) with the semi-discrete equation (2.8). Defining the operand
                                    1 𝑛                             1
                            𝑅(𝑥) =      5𝑢 − 4𝑢 𝑛−1 + 𝑢 𝑛−2 (𝑥) + 2 𝑆 𝑛+1 (𝑥),
                                    2                                𝛼
we obtain
                                    ∫   𝑏
                                 𝛼
                     𝑢 𝑛+1
                           (𝑥) =          𝑒 −𝛼|𝑥−𝑦| 𝑅(𝑦) 𝑑𝑦 + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) ,       (2.28)
                                 2    𝑎
                               ≡ I𝑥 [𝑅] (𝑥) + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) ,                       (2.29)
where we have used I𝑥 [·] to denote the term involving the convolution integral which is not
to be confused with the identity operator. Applying different boundary conditions amounts to
determining the values of 𝐴 and 𝐵 used in (2.29). In the description of boundary conditions for the
method, we shall assume that the boundary conditions at the ends of the one-dimensional domain
are the same. Using slight variations of the methods illustrated below, one can mix the boundary
conditions at the ends of the line segments.
    In order to enforce conditions on the derivatives of the solution, we will also need to compute
a derivative of the update (2.28) (equivalently (2.29)). For this, we observe that the dependency
for 𝑥 appears only on analytical functions, i.e., the Green’s function (kernel) and the exponential
functions in the boundary terms. To differentiate (2.28) we start with the definition (2.22), which
splits the integral at the point 𝑦 = 𝑥 and makes the kernel easier to manipulate. Then, using the
                                                      36


fundamental theorem of calculus, we can calculate derivatives of (2.20) and (2.21) to find that
                                           ∫ 𝑥                           
               𝑑  𝑅                 𝑑                −𝛼(𝑥−𝑦)
                   I [ 𝑓 ] (𝑥) =           𝛼         𝑒          𝑓 (𝑦) 𝑑𝑦 = −𝛼I𝑥𝑅 [ 𝑓 ] (𝑥) + 𝑓 (𝑥),  (2.30)
              𝑑𝑥 𝑥                   𝑑𝑥         𝑎
                                             ∫ 𝑏                           
                𝑑  𝐿                  𝑑                −𝛼(𝑦−𝑥)
                    I [ 𝑓 ] (𝑥) =            𝛼         𝑒          𝑓 (𝑦) 𝑑𝑦 = 𝛼I𝑥𝐿 [ 𝑓 ] (𝑥) − 𝑓 (𝑥). (2.31)
                𝑑𝑥 𝑥                   𝑑𝑥         𝑥
These results can be combined according to (2.22), which provides an expression for the derivative
of the convolution term:
                             𝑑                𝛼                                       
                                 I𝑥 [ 𝑓 ] (𝑥) =           −I𝑥𝑅 [ 𝑓 ] (𝑥) + I𝑥𝐿 [ 𝑓 ] (𝑥) .           (2.32)
                            𝑑𝑥                      2
Additionally, by evaluating this equation at the ends of the interval, we obtain the identities
                                        𝑑                 
                                             I𝑥 [ 𝑓 ] (𝑎) = 𝛼I𝑥 [ 𝑓 ] (𝑎),                           (2.33)
                                       𝑑𝑥
                                        𝑑                 
                                             I𝑥 [ 𝑓 ] (𝑏) = −𝛼I𝑥 [ 𝑓 ] (𝑏),                          (2.34)
                                       𝑑𝑥
which are helpful in enforcing the boundary conditions.
    The relation (2.32) can be used to obtain a derivative for the solution at the new time level.
From the update (2.29), a direct computation reveals that
                𝑑𝑢 𝑛+1 𝛼  𝑅                                     
                       =       −I𝑥 [𝑅] (𝑥) + I𝑥𝐿 [𝑅] (𝑥) − 𝛼𝐴𝑒 −𝛼(𝑥−𝑎) + 𝛼𝐵𝑒 −𝛼(𝑏−𝑥) .               (2.35)
                  𝑑𝑥       2
Notice that no additional approximations have been made beyond what is needed to compute I𝑥𝑅
and I𝑥𝐿 , which are needed in the base method. For this reason, we think of equation (2.35) as an
analytical derivative. The boundary coefficients 𝐴 and 𝐵 appearing in (2.35) will be calculated
in the same way as the update (2.29), which are discussed in the remaining subsections. This
treatment ensures that the discrete derivative will be consistent with the conditions imposed on the
solution variable.
2.4.1.1   Dirichlet Boundary Conditions
Suppose we are given the function values along the boundary, which is represented by the data
                                                                                    
                             𝑢 𝑛+1 (𝑎) = 𝑔𝑎 𝑡 𝑛+1 ,           𝑢 𝑛+1 (𝑏) = 𝑔𝑏 𝑡 𝑛+1 .
                                                           37


If we evaluate the BDF-2 update (2.29) at the ends of the interval, we obtain the conditions
                                               
                                     𝑔𝑎 𝑡 𝑛+1 = I𝑥 [𝑅] (𝑎) + 𝐴 + 𝜇𝐵,
                                               
                                            𝑛+1
                                     𝑔𝑏 𝑡         = I𝑥 [𝑅] (𝑏) + 𝜇𝐴 + 𝐵,
where we have defined 𝜇 = 𝑒 −𝛼(𝑏−𝑎) . This is a simple linear system for the boundary coefficients
𝐴 and 𝐵, which can be inverted by hand. Proceeding, we find that
                                                                                     
                            𝑔𝑎 𝑡 𝑛+1 − I𝑥 [𝑅] (𝑎) − 𝜇 𝑔𝑏 𝑡 𝑛+1 − I𝑥 [𝑅] (𝑏)
                       𝐴=                                                                ,
                                                       1 − 𝜇2
                                                                                     
                            𝑔𝑏 𝑡 𝑛+1 − I𝑥 [𝑅] (𝑏) − 𝜇 𝑔𝑎 𝑡 𝑛+1 − I𝑥 [𝑅] (𝑎)
                       𝐵=                                                                .
                                                       1 − 𝜇2
2.4.1.2  Neumann Boundary Conditions
We can also enforce conditions on the derivatives at the end of the domain. Given the Neumann
data
                          𝑑𝑢 𝑛+1 (𝑎)                       𝑑𝑢 𝑛+1 (𝑏)            
                                        = ℎ𝑎 𝑡 𝑛+1 ,                     = ℎ 𝑏 𝑡 𝑛+1 ,
                               𝑑𝑥                                 𝑑𝑥
we can evaluate the derivative formula for the update (2.35) and use the identities (2.33) and (2.34).
Performing these evaluations, we obtain the system of equations
                                                  1            
                                   −𝐴 + 𝜇𝐵 =        ℎ𝑎    𝑡 𝑛+1 − I𝑥 [𝑅] (𝑎),
                                                  𝛼
                                                  1            
                                                            𝑛+1
                                   −𝜇𝐴 + 𝐵 = ℎ 𝑏          𝑡        + I𝑥 [𝑅] (𝑏),
                                                  𝛼
where, again, 𝜇 = 𝑒 −𝛼(𝑏−𝑎) . Solving this system, we find that
                                                                                      
                          1        𝑛+1 − I [𝑅] (𝑎) − 𝜇 1 ℎ 𝑡 𝑛+1 + I [𝑅] (𝑏)
                          𝛼 ℎ 𝑎  𝑡           𝑥                  𝛼   𝑏             𝑥
                   𝐴=−                                                                     ,
                                                       1 − 𝜇2
                                                                                        
                          𝜇 𝛼1 ℎ𝑎 𝑡 𝑛+1 − I𝑥 [𝑅] (𝑎) − 𝛼1 ℎ 𝑏 𝑡 𝑛+1 + I𝑥 [𝑅] (𝑏)
                   𝐵=−                                                                         .
                                                        1 − 𝜇2
We note that Robin boundary conditions, which combine Dirichlet and Neumann conditions can
be enforced in a nearly identical way.
                                                       38


2.4.1.3   Periodic Boundary Conditions
Periodic boundary conditions are enforced by taking
                            𝑢 𝑛+1 (𝑎) = 𝑢 𝑛+1 (𝑏),   𝜕𝑥 𝑢 𝑛+1 (𝑎) = 𝜕𝑥 𝑢 𝑛+1 (𝑏).
Enforcing these conditions through the update (2.29) and its derivative (2.35), using the identities
(2.33)-(2.34), leads to the system of equations
                           (1 − 𝜇) 𝐴 + (𝜇 − 1)𝐵 = I𝑥 [𝑅] (𝑏) − I𝑥 [𝑅] (𝑎),
                           (𝜇 − 1) 𝐴 + (𝜇 − 1)𝐵 = −I𝑥 [𝑅] (𝑏) − I𝑥 [𝑅] (𝑎),
with 𝜇 = 𝑒 −𝛼(𝑏−𝑎) . The solution of this system, after some simplifications is given by
                                                   I𝑥 [𝑅] (𝑏)
                                              𝐴=               ,
                                                     1−𝜇
                                                   I𝑥 [𝑅] (𝑎)
                                              𝐵=               .
                                                     1−𝜇
2.4.1.4   Outflow Boundary Conditions
In problems defined over free-space, we must allow for waves to exit the computational domain.
Additionally, as the waves exit the domain, we would like to minimize the number of reflections,
which are non-physical, along this boundary. Exit conditions can be formulated in one spatial
dimension in the sense of characteristics, by requiring that
                                        𝜕𝑢       𝜕𝑢
                                             −𝑐      = 0,     𝑥 = 𝑎,                         (2.36)
                                        𝜕𝑡       𝜕𝑥
                                        𝜕𝑢       𝜕𝑢
                                             +𝑐      = 0,     𝑥 = 𝑏,                         (2.37)
                                         𝜕𝑡      𝜕𝑥
where 𝑐 > 0 is a wave speed.
    To formulate the boundary conditions for the BDF-2 update (2.28), we follow the approach
used in [34], which developed outflow boundary conditions for the central method, discussed in the
next section. We start from the free-space solution that is defined over the real line:
                                               𝛼 ∞ −𝛼|𝑥−𝑦|
                                                 ∫
                                    𝑛+1
                                   𝑢 (𝑥) =             𝑒         𝑅(𝑦) 𝑑𝑦.
                                               2 −∞
                                                    39


Here, 𝑅(𝑥) is the operand for the BDF-2 method and is defined as
                                              1 𝑛                     
                                      𝑅(𝑥) =      5𝑢 − 4𝑢 𝑛−1 + 𝑢 𝑛−2 .
                                              2
The free-space solution can then be separated to isolate the computational domain over [𝑎, 𝑏],
which we write as
                                           ∫  𝑎                        ∫   ∞
                                         𝛼        −𝛼|𝑥−𝑦|            𝛼
              𝑢 𝑛+1
                    (𝑥) = I𝑥 [𝑅] (𝑥) +          𝑒         𝑅(𝑦) 𝑑𝑦 +          𝑒 −𝛼|𝑥−𝑦| 𝑅(𝑦) 𝑑𝑦.
                                         2   −∞                      2   𝑏
If we use the definitions
                                               ∫
                                             𝛼 𝑎 −𝛼(𝑎−𝑦)
                                        𝐴=           𝑒        𝑅(𝑦) 𝑑𝑦,                            (2.38)
                                             2 −∞
                                             𝛼 ∞ −𝛼(𝑦−𝑏)
                                               ∫
                                        𝐵=           𝑒         𝑅(𝑦) 𝑑𝑦,                           (2.39)
                                             2 𝑏
then the decomposed free-space solution can be written in the familiar form (2.29):
                              𝑢 𝑛+1 (𝑥) = I𝑥 [𝑅] (𝑥) + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) .
    Now assume that the support of 𝑢(𝑥, 0) and 𝑅(𝑥) is entirely contained within the domain [𝑎, 𝑏]
initially. Using the finite speed of propagation 𝑐, it follows that the region of support at time 𝑡 𝑛
extends to [𝑎 − 𝑐𝑡 𝑛 , 𝑏 + 𝑐𝑡 𝑛 ], which means that the integrals (2.38) and (2.39) can be simplified to
                                              ∫
                                           𝛼 𝑎
                                       𝑛
                                      𝐴 =             𝑒 −𝛼(𝑎−𝑦) 𝑅(𝑦) 𝑑𝑦,                          (2.40)
                                            2 𝑎−𝑐𝑡 𝑛
                                              ∫       𝑛
                                       𝑛   𝛼 𝑏+𝑐𝑡 −𝛼(𝑦−𝑏)
                                      𝐵 =               𝑒        𝑅(𝑦) 𝑑𝑦.                         (2.41)
                                            2 𝑏
Since these integrals are defined over regions of space outside of the computational domain, the
idea is now to exchange space with time, using the characteristics of equations (2.36) and (2.37).
    Along the right boundary, the waves propagate to the right so that for any 𝑥 > 𝑏, the solution to
(2.37) is 𝑢(𝑥, 𝑡) = 𝑢(𝑥 − 𝑐𝑡). Tracing the ray backwards in time, we see that
                                                            𝑦
                                    𝑢(𝑏 + 𝑦, 𝑡) = 𝑢 𝑏, 𝑡 −       ,  𝑦 > 0.                        (2.42)
                                                             𝑐
Note that on a space-time plot, the characteristic has a slope of 𝑐−1 . Similarly, for along the left
boundary, we have that 𝑥 < 𝑎, and the solution to (2.36) is given by 𝑢(𝑥, 𝑡) = 𝑢(𝑥 + 𝑐𝑡). Here, the
                                                       40


characteristic has slope −𝑐−1 , so we obtain
                                                              𝑦
                                      𝑢(𝑎 − 𝑦, 𝑡) = 𝑢 𝑎, 𝑡 −      ,   𝑦 > 0.                              (2.43)
                                                               𝑐
In other words, since the data remains constant along characteristics, it follows that any point
outside of the computational domain corresponds to a boundary point at some earlier time.
    We now show the illustrate the process for converting space integrals to recursive time integrals
by considering the integral (2.41). The argument for (2.40) is identical and is therefore omitted. To
use the ray tracing formula (2.42), we shift the integration variable so that
                ∫      𝑛
        𝑛    𝛼 𝑏+𝑐𝑡 −𝛼(𝑦−𝑏)
      𝐵 =                𝑒           𝑅(𝑦) 𝑑𝑦,
             2 𝑏
                ∫ 𝑛
             𝛼 𝑐𝑡 −𝛼𝑦
          =           𝑒 𝑅(𝑏 + 𝑦) 𝑑𝑦,
             2 0
                ∫ 𝑛                                                             
             𝛼 𝑐𝑡 −𝛼𝑦 1  𝑛                          𝑛−1              𝑛−2
          ≡           𝑒             5𝑢 (𝑏 + 𝑦) − 4𝑢 (𝑏 + 𝑦) + 𝑢 (𝑏 + 𝑦) 𝑑𝑦,
             2 0                2
                ∫ 𝑛                                                                              
             𝛼 𝑐𝑡 −𝛼𝑦 1                    𝑛
                                                           
                                                                 𝑛−1
                                                                                
                                                                                     𝑛−2
          =           𝑒            5𝑢 (𝑏, 𝑡 − 𝑦/𝑐) − 4𝑢 𝑏, 𝑡         − 𝑦/𝑐 + 𝑢 𝑏, 𝑡       − 𝑦/𝑐       𝑑𝑦,
             2 0                2
where the last line applies (2.42) to the time history. If we now apply another transformation
                                                    𝑦/𝑐 ↦→ 𝑦,
then the integral becomes
                   ∫ 𝑛                                                                      
           𝑛    𝛼𝑐 𝑡 −𝛼𝑐𝑦 1                    𝑛
                                                             
                                                                  𝑛−1
                                                                             
                                                                                    𝑛−2
         𝐵 =             𝑒              5𝑢 (𝑏, 𝑡 − 𝑦) − 4𝑢 𝑏, 𝑡       − 𝑦 + 𝑢 𝑏, 𝑡      −𝑦        𝑑𝑦,
                 2 0                2
                𝛼𝑐 Δ𝑡 −𝛼𝑐𝑦 1                                                                
                   ∫                                                        
                                                𝑛                 𝑛−1               𝑛−2
             =            𝑒             5𝑢 (𝑏, 𝑡 − 𝑦) − 4𝑢 𝑏, 𝑡       − 𝑦 + 𝑢 𝑏, 𝑡      −𝑦        𝑑𝑦
                 2 0                2
                     ∫ 𝑛                                                                     
                  𝛼𝑐 𝑡 −𝛼𝑐𝑦 1                    𝑛
                                                               
                                                                    𝑛−1
                                                                                
                                                                                      𝑛−2
               +            𝑒            5𝑢 (𝑏, 𝑡 − 𝑦) − 4𝑢 𝑏, 𝑡        − 𝑦 + 𝑢 𝑏, 𝑡      −𝑦       𝑑𝑦,
                  2 Δ𝑡                2
             ≡ 𝐼1 + 𝐼2 .
    The integral 𝐼2 exhibits a recursive form. If we shift the integration bounds by −Δ𝑡, then it
                                                        41


follows that 𝐼2 is now given by
             ∫ 𝑛−1                                                                             
         𝛼𝑐 𝑡          −𝛼𝑐(𝑦+Δ𝑡) 1                𝑛−1
                                                                  
                                                                        𝑛−2
                                                                                     
                                                                                          𝑛−3
    𝐼2 =             𝑒                  5𝑢 𝑏, 𝑡       − 𝑦 − 4𝑢 𝑏, 𝑡         − 𝑦 + 𝑢 𝑏, 𝑡      −𝑦       𝑑𝑦,
          2 0                       2
                       ∫ 𝑡 𝑛−1                                                                            !
                   𝛼𝑐                   1                                                          
       = 𝑒 −𝛼𝑐Δ𝑡               𝑒 −𝛼𝑐𝑦        5𝑢 𝑏, 𝑡 𝑛−1 − 𝑦 − 4𝑢 𝑏, 𝑡 𝑛−2 − 𝑦 + 𝑢 𝑏, 𝑡 𝑛−3 − 𝑦            𝑑𝑦 ,
                   2 0                  2
       ≡ 𝑒 −𝛼𝑐Δ𝑡 𝐵𝑛−1 .                                                                                     (2.44)
     The integral term 𝐼1 , which is defined over [0, Δ𝑡], can be mapped to the interval [0, 1] using
the transformation
                                                𝑦 = 𝑧Δ𝑡,     𝑧 ∈ [0, 1].
Then it follows that
              ∫ 1                                                                             
        𝛼𝑐Δ𝑡          −𝛼𝑐Δ𝑡𝑧 1               𝑛                     𝑛−1                  𝑛−2
  𝐼1 =              𝑒              5𝑢 (𝑏, 𝑡 − 𝑧Δ𝑡) − 4𝑢 𝑏, 𝑡            − 𝑧Δ𝑡 + 𝑢 𝑏, 𝑡      − 𝑧Δ𝑡      𝑑𝑧. (2.45)
          2      0             2
To evaluate this local integral (in time), we store use a set of time history along the boundary point
to construct an interpolating function. After inserting this approximation into (2.45), the integration
can be performed analytically to generate the weights for the time history data. This is done in the
same spirit as the approach for computing quadrature weights to approximate the local integrals,
which was discussed in section 2.3.3. We demonstrate this process using two approaches, which
differ in the stencil used to interpolate the time history.
     A simple way to approximate the integral (2.45) is to build an “explicit" interpolating function
                                
based on the time history 𝑢 𝑛−2 (𝑏), 𝑢 𝑛−1 (𝑏), 𝑢 𝑛−1 (𝑏) . Using the algebraic polynomial basis and
performing the integration of the resulting interpolant analytically, we obtain an approximation of
the form
                                    𝐼1 ≈ 𝛾0 𝑢 𝑛 (𝑏) + 𝛾1 𝑢 𝑛−1 (𝑏) + 𝛾2 𝑢 𝑛−2 (𝑏),
where the outflow weights are
                                         5𝜈 + 2𝑒 −𝜈 + 𝜈 2 𝑒 −𝜈 − 5𝜈 2 − 3𝜈𝑒 −𝜈 − 2
                                𝛾0 = −                                               ,
                                                              4𝜈 2
                                         𝜈 2 𝑒 −𝜈 − 2𝑒 −𝜈 − 4𝜈 + 2𝜈 2 + 2𝜈𝑒 −𝜈 + 2
                                𝛾1 = −                                               ,
                                                              2𝜈 2
                                       𝑒 −𝜈 (−1 + 𝜈)(𝜈 − 2𝑒 𝜈 + 𝜈𝑒 𝜈 + 2)
                                𝛾2 =                                         ,
                                                        4𝜈 2
                                                           42


          √
with 𝜈 = 2. Combining this result with (2.44), we obtain the explicit update formula
                        𝐵𝑛 = 𝑒 −𝛼𝑐Δ𝑡 𝐵𝑛−1 + 𝛾0 𝑢 𝑛 (𝑏) + 𝛾1 𝑢 𝑛−1 (𝑏) + 𝛾2 𝑢 𝑛−2 (𝑏).                 (2.46)
In the case of outflow boundary conditions for the second-order, time-centered update, which
was presented in [34], the authors mentioned that including 𝑢 𝑛+1 (𝑏) in the interpolation stencil is
necessary for a convergent outflow method (see Remark 3 in [34]). Numerical experiments, which
we present in section 2.6 seem to suggest otherwise. This includes the BDF-2 method with the
explicit form of outflow given by (2.46).
    An “implicit" form of the outflow procedure can be obtained by including time level 𝑛 +1 data in
the interpolation stencil used to approximate the integral (2.45). Repeating the steps shown above,
we obtain a corresponding implicit formula
                         𝐼1 ≈ 𝛾0 𝑢 𝑛+1 (𝑏) + 𝛾1 𝑢 𝑛 (𝑏) + 𝛾2 𝑢 𝑛−1 (𝑏) + 𝛾3 𝑢 𝑛−2 (𝑏),
where the outflow weights are
                            𝜈 2 𝑒 −𝜈 − 6𝑒 −𝜈 − 12𝜈 − 3𝜈 3 𝑒 −𝜈 + 8𝜈 2 + 6𝜈𝑒 −𝜈 + 6
                    𝛾0 = −                                                          ,
                                                      12𝜈 3
                          4𝜈 2 𝑒 −𝜈 − 6𝑒 −𝜈 − 10𝜈 − 4𝜈 3 𝑒 −𝜈 + 3𝜈 2 + 5𝜈 3 + 4𝜈𝑒 −𝜈 + 6
                    𝛾1 =                                                                 ,
                                                         4𝜈 3
                            5𝜈 2 𝑒 −𝜈 − 8𝜈 − 𝜈 3 𝑒 −𝜈 + 4𝜈 3 + 2𝜈𝑒 −𝜈 + 6
                    𝛾2 = −                                                ,
                                                 4𝜈 3
                            6𝜈 + 6𝑒 −𝜈 − 4𝜈 2 𝑒 −𝜈 + 𝜈 2 − 3𝜈 3 − 6
                    𝛾3 = −                                           ,
                                              12𝜈 3
                  √
again, with 𝜈 = 2. To deal with the appearance of 𝑢 𝑛+1 (𝑏) in the approximation of (2.45), we
appeal to the update for the BDF-2 scheme (2.29). Assuming that outflow is applied to the left
boundary point, as well, we can repeat the above process to obtain a linear system, as with the other
types of boundary conditions discussed earlier. This results in the linear system
     (1 − 𝛾0 ) 𝐴𝑛 − 𝛾0 𝜇𝐵𝑛 = 𝑒 −𝛼𝑐Δ𝑡 𝐴𝑛−1 + 𝛾0 I𝑥 [𝑅] (𝑎) + 𝛾1 𝑢 𝑛 (𝑎) + 𝛾2 𝑢 𝑛−1 (𝑎) + 𝛾3 𝑢 𝑛−2 (𝑎) ≡ 𝑓𝑎 ,
   −𝛾0 𝜇 𝐴𝑛 + (1 − 𝛾0 )𝐵𝑛 = 𝑒 −𝛼𝑐Δ𝑡 𝐵𝑛−1 + 𝛾0 I𝑥 [𝑅] (𝑏) + 𝛾1 𝑢 𝑛 (𝑏) + 𝛾2 𝑢 𝑛−1 (𝑏) + 𝛾3 𝑢 𝑛−2 (𝑏) ≡ 𝑓𝑏 ,
                                                       43


where 𝜇 = 𝑒 −𝛼(𝑏−𝑎) and we have used 𝑓𝑎 and 𝑓𝑏 to indicate the right side of the system for brevity.
This system can be analytically inverted and leads to the solution
                                               (1 − 𝛾0 ) 𝑓𝑎 + 𝛾0 𝜇 𝑓𝑏
                                        𝐴𝑛 =                          ,
                                               (1 − 𝛾0 ) 2 − (𝛾0 𝜇) 2
                                               𝛾0 𝜇 𝑓𝑎 + (1 − 𝛾0 ) 𝑓𝑏
                                        𝐵𝑛 =                          .
                                               (1 − 𝛾0 ) 2 − (𝛾0 𝜇) 2
This finishes the introduction for outflow boundary conditions. In section 2.4.3, we mention
some of the caveats in applying the aforementioned boundary conditions to problems in a multi-
dimensional setting. Next, in section 2.4.2, we present the boundary conditions for the second-order,
time-centered scheme.
2.4.2    Centered Method
The update for second order time centered method, in one spatial dimension, can be obtained by
combining (2.16) with the semi-discrete equation (2.10) to obtain
                                              ∫ 𝑏           "           # !
                                           𝛼                       1
   𝑢 𝑛+1 (𝑥) = (2 − 𝛽2 )𝑢 𝑛 − 𝑢 𝑛−1 + 𝛽2           𝑒 −𝛼|𝑥−𝑦| 𝑢 𝑛 + 2 𝑆 𝑛 𝑑𝑦 + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) ,
                                           2 𝑎                    𝛼
                                                                                                  (2.47)
                                                         
                                                    1 𝑛
             ≡ (2 − 𝛽2 )𝑢 𝑛 − 𝑢 𝑛−1 + 𝛽2 I𝑥 𝑢 𝑛 +      𝑆 (𝑥) + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) ,          (2.48)
                                                   𝛼2
where we have, again, used I𝑥 [·] to denote the convolution integral term. By specifying different
conditions on the boundary data for the solution 𝑢, we can identify the corresponding values of 𝐴
and 𝐵 in the update (2.48). As with the BDF methods discussed in the previous sections, we will
assume that the boundary conditions along the line are the same, though this can be generalized to
mixed boundary conditions with minimal modifications. The techniques used to obtain boundary
conditions are nearly identical to the BDF-2 method shown in the previous section, so we skip the
details and simply state the results.
    As with the BDF method, we can obtain a method for constructing derivatives by directly
differentiating the update (2.48). Making use of the identity (2.32), the derivative at the new time
                                                      44


level is found to be
                                                                                            
   𝑑𝑢 𝑛+1           2 𝑑𝑢
                         𝑛   𝑑𝑢 𝑛−1 𝛽2 𝛼         𝑅      𝑛      1 𝑛            𝐿    𝑛    1 𝑛
           = (2 − 𝛽 )      −         +        −I𝑥 𝑢 + 2 𝑆 (𝑥) + I𝑥 𝑢 + 2 𝑆 (𝑥)
    𝑑𝑥                 𝑑𝑥       𝑑𝑥       2                    𝛼                        𝛼
                                                                        − 𝛼𝐴𝑒 −𝛼(𝑥−𝑎) + 𝛼𝐵𝑒 −𝛼(𝑏−𝑥) (2.49)
In contrast to the derivative (2.35) obtained with the BDF-2 method, we can see that the derivative
of the time-centered method includes a time history for the derivative itself. We have not shown
that computing derivatives with a recursive approach of this form is a stable process. Otherwise,
no additional approximations have been made beyond what is needed to compute I𝑥𝑅 and I𝑥𝐿 . As
with the BDF method, the boundary coefficients 𝐴 and 𝐵 appearing in (2.49) will be calculated in
the same way as the update (2.48).
2.4.2.1    Dirichlet Boundary Conditions
Dirichlet boundary conditions for which we are given the data
                                                                             
                             𝑢 𝑛+1 (𝑎) = 𝑔𝑎 𝑡 𝑛+1 ,       𝑢 𝑛+1 (𝑏) = 𝑔𝑏 𝑡 𝑛+1 ,
can be enforced by solving the linear system that results from evaluating (2.48) at the ends of the
domain. The solution to this system is
                                                     𝑓𝑎 − 𝜇 𝑓 𝑏
                                            𝐴=−                    ,
                                                      1 − 𝜇2
                                                     𝑓 𝑏 − 𝜇 𝑓𝑎
                                            𝐵=−                    ,
                                                      1 − 𝜇2
where we have used 𝜇 = 𝑒 −𝛼(𝑏−𝑎) and
                                                                                          
                           𝑛     1 𝑛               𝑛       𝑔𝑎 𝑡 𝑛+1 − 2𝑔𝑎 (𝑡 𝑛 ) + 𝑔𝑎 𝑡 𝑛−1
                 𝑓𝑎 = I𝑥 𝑢 + 2 𝑆 (𝑎) − 𝑔𝑎 (𝑡 ) −                                               ,
                                𝛼                                           𝛽2
                                                                    
                                                                   𝑛+1 − 2𝑔 (𝑡 𝑛 ) + 𝑔 𝑡 𝑛−1
                                                                                             
                                 1                         𝑔 𝑏   𝑡          𝑏         𝑏
                 𝑓𝑏 = I𝑥 𝑢 𝑛 + 2 𝑆 𝑛 (𝑏) − 𝑔𝑏 (𝑡 𝑛 ) −                                         ,
                                𝛼                                           𝛽2
for brevity.
                                                     45


2.4.2.2   Neumann Boundary Conditions
The boundary coefficients associated with Neumann boundary conditions
                           𝑑𝑢 𝑛+1 (𝑎)                    𝑑𝑢 𝑛+1 (𝑏)                
                                      = ℎ𝑎 𝑡 𝑛+1 ,                         = ℎ 𝑏 𝑡 𝑛+1 ,
                               𝑑𝑥                              𝑑𝑥
can be determined using the derivative (2.49) with the aid of the identities (2.33) and (2.34). The
resulting linear system has the solution
                                                    𝑤 𝑎 − 𝜇𝑤 𝑏
                                              𝐴=                    ,
                                                       1 − 𝜇2
                                                    𝑤 𝑎 − 𝜇𝑤 𝑏
                                              𝐵=                    ,
                                                       1 − 𝜇2
where we have used the definitions
                                                                        
                                                                       𝑛+1 − 2ℎ (𝑡 𝑛 ) + ℎ 𝑡 𝑛−1
                                                                                                  
                               1              1               ℎ 𝑎    𝑡           𝑎        𝑎
               𝑤 𝑎 = I𝑥 𝑢 𝑛 + 2 𝑆 𝑛 (𝑎) − ℎ𝑎 (𝑡 𝑛 ) −                                               ,
                               𝛼              𝛼                                  𝛼𝛽2
                                                                                               
                          𝑛    1 𝑛            1        𝑛      ℎ 𝑏 𝑡 𝑛+1 − 2ℎ 𝑏 (𝑡 𝑛 ) + ℎ 𝑏 𝑡 𝑛−1
              𝑤 𝑏 = I𝑥 𝑢 + 2 𝑆 (𝑏) + ℎ 𝑏 (𝑡 ) −                                                     ,
                               𝛼             𝛼                                  𝛼𝛽2
and 𝜇 = 𝑒 −𝛼(𝑏−𝑎) . An approach for Robin boundary conditions follows a nearly identical path.
2.4.2.3   Periodic Boundary Conditions
For periodic boundary conditions, we assume that for any time level 𝑛 > 0
                            𝑢 𝑛+1 (𝑎) = 𝑢 𝑛+1 (𝑏),      𝜕𝑥 𝑢 𝑛+1 (𝑎) = 𝜕𝑥 𝑢 𝑛+1 (𝑏).
To enforce these conditions, we can appeal to the update (2.48), its derivative (2.49), and the
identities (2.33) and (2.34). By solving the linear system obtained from the evaluation of these
updates at the ends of the domain, we obtain the coefficients
                                                 h                i
                                              I𝑥 𝑢 𝑛 + 𝛼12 𝑆 𝑛 (𝑏)
                                        𝐴=                                ,
                                                        1−𝜇
                                                 h                i
                                                            1 𝑛
                                              I𝑥 𝑢 + 𝛼2 𝑆 (𝑎)
                                                     𝑛
                                        𝐵=                                ,
                                                       1−𝜇
where, again, 𝜇 = 𝑒 −𝛼(𝑏−𝑎) .
                                                      46


2.4.2.4   Outflow Boundary Conditions
The procedure used to derive outflow boundary coefficients for the time-centered method was
originally presented in [34], so we shall skip many of the details here. In fact, the developments for
the time-centered method can be treated as a simplification of the process used in section 2.4.1.4,
which developed the outflow coefficients for the BDF-2 method.
    Repeating the steps shown in section 2.4.1.4, with the central scheme, one finds that
                                               ∫ 𝑛
                                         𝛼𝑐 𝑡 −𝛼𝑐𝑦
                                   𝑛
                                 𝐵 =                  𝑒      𝑢 (𝑏, 𝑡 𝑛 − 𝑦) 𝑑𝑦,
                                          2 0
                                         𝛼𝑐 Δ𝑡 −𝛼𝑐𝑦
                                               ∫
                                      =                𝑒      𝑢 (𝑏, 𝑡 𝑛 − 𝑦) 𝑑𝑦
                                          2 0
                                                 ∫ 𝑛
                                           𝛼𝑐 𝑡 −𝛼𝑐𝑦
                                        +                𝑒     𝑢 (𝑏, 𝑡 𝑛 − 𝑦) 𝑑𝑦,
                                             2 Δ𝑡
                                      ≡ 𝐼1 + 𝐼2 ,
where the second integral 𝐼2 can be written in the form
                                       ∫   𝑡 𝑛−1                             
                                  𝛼𝑐
                            𝐼2 =                 𝑒 −𝛼𝑐(𝑦+Δ𝑡) 𝑢 𝑏, 𝑡 𝑛−1 − 𝑦 𝑑𝑦,
                                   2    0
                                                                                  !
                                               ∫   𝑡 𝑛−1                       
                                         𝛼𝑐
                               = 𝑒 −𝛽                    𝑒 −𝛼𝑐𝑦 𝑢 𝑏, 𝑡 𝑛−1 − 𝑦 𝑑𝑦 ,
                                          2      0
                               ≡ 𝑒 −𝛽 𝐵𝑛−1 .                                                       (2.50)
Note that the simplifications shown above use the definition 𝛼 = 𝛽/(𝑐Δ𝑡) for the central scheme.
    Likewise, the first integral 𝐼1 , can be expressed as
                                              ∫   1
                                          𝛽
                                   𝐼1 =              𝑒 −𝛽𝑧 𝑢 (𝑏, 𝑡 𝑛 − 𝑧Δ𝑡) 𝑑𝑧,                    (2.51)
                                          2     0
using the definition 𝛼 = 𝛽/(𝑐Δ𝑡). As with the BDF-2 method, this integral can be approximated
in an explicit or implicit manner depending on the choice of points used to create the interpolating
function for the function 𝑢 (𝑏, 𝑡 𝑛 − 𝑧Δ𝑡).
    An explicit outflow method can be obtained if we approximate the function 𝑢 (𝑏, 𝑡 𝑛 − 𝑧Δ𝑡) using
                                             
algebraic polynomials with the stencil 𝑢 𝑛−2 (𝑏), 𝑢 𝑛−1 (𝑏), 𝑢 𝑛 (𝑏) . Integrating this analytically, and
                                                           47


combining the result with (2.50), we obtain the explicit outflow method
                              𝐵𝑛 = 𝑒 −𝛽 𝐵𝑛−1 + 𝛾0 𝑢 𝑛 (𝑏) + 𝛾1 𝑢 𝑛−1 (𝑏) + 𝛾2 𝑢 𝑛−2 (𝑏),
where the integration weights for the second-order, time-centered method are given by
                                         2𝑒 −𝛽 − 𝛽 + 2𝛽2 + 3𝛽𝑒 −𝛽 − 2
                                  𝛾0 =                                        ,
                                                         4𝛽2
                                            2𝑒 −𝛽 − 2𝛽 + 3𝛽2 𝑒 −𝛽 + 4𝛽𝑒 −𝛽 − 2
                                  𝛾1 = −                                             ,
                                                               6𝛽2
                                            𝛽 + 2𝑒 −𝛽 + 𝛽𝑒 −𝛽 − 2
                                  𝛾2 = −                              .
                                                     12𝛽2
     An implicit form of the outflow method for the central-2 scheme can be obtained using the stencil

  𝑢 𝑛−1 (𝑏), 𝑢 𝑛 (𝑏), 𝑢 𝑛+1 (𝑏) to approximation of 𝑢 (𝑏, 𝑡 𝑛 − 𝑧Δ𝑡). Using the update (2.48) eliminates
the variable 𝑢 𝑛+1 (𝑏) from the interpolation formula, which is not yet available. Repeating the steps
for the other boundary point, we obtain the linear system
                                                   
                                          1 + 𝛽2 𝛾0 𝐴𝑛 + 𝜇𝛽2 𝛾0 𝐵𝑛 = 𝑓𝑎 ,
                                         𝜇𝛽2 𝛾0 𝐴𝑛 + (1 + 𝛽2 𝛾0 )𝐵𝑛 = 𝑓𝑏 ,
with
                                                                             
              −𝛽               2            1 𝑛
       𝑓𝑎 = 𝑒 𝐴 + 𝛾0 𝛽 I𝑥 𝑢 + 2 𝑆 (𝑎) + 2 − 𝛽2 𝛾0 + 𝛾1 𝑢 𝑛 (𝑎) + (𝛾2 − 𝛾0 )𝑢 𝑛−1 (𝑎),
                   𝑛−1                𝑛
                                          𝛼
                                                                             
              −𝛽 𝑛−1                        1 𝑛
       𝑓𝑏 = 𝑒 𝐵 + 𝛾0 𝛽 I𝑥 𝑢 + 2 𝑆 (𝑏) + 2 − 𝛽2 𝛾0 + 𝛾1 𝑢 𝑛 (𝑏) + (𝛾2 − 𝛾0 )𝑢 𝑛−1 (𝑏),
                               2     𝑛
                                          𝛼
and 𝜇 = 𝑒 −𝛼(𝑏−𝑎) . The solution to this linear system is given by
                                                             
                                                 1 −  𝛽 2𝛾
                                                           0    𝑓𝑎 + 𝜇𝛽2 𝛾0 𝑓𝑏
                                        𝐴𝑛 =                  2             2 ,
                                                  1 − 𝛽2 𝛾0 − 𝜇𝛽2 𝛾0
                                                                           
                                           𝑛    𝜇𝛽2 𝛾0 𝑓𝑎 + 1 − 𝛽2 𝛾0 𝑓𝑏
                                        𝐵 =                   2             2 ,
                                                  1 − 𝛽2 𝛾0 − 𝜇𝛽2 𝛾0
where the corresponding integration weights are defined as
                                             𝛽 + 2𝑒 −𝛽 + 𝛽𝑒 −𝛽 − 2
                                   𝛾0 = −                              ,
                                                      4𝛽2
                                          2𝑒 −𝛽 + 𝛽2 + 2𝛽𝑒 −𝛽 − 2
                                   𝛾1 =                                  ,
                                                      2𝛽2
                                             2𝑒 −𝛽 − 𝛽 + 2𝛽2 𝑒 −𝛽 + 3𝛽𝑒 −𝛽 − 2
                                   𝛾2 = −                                           .
                                                               4𝛽2
                                                          48


2.4.3    Some Remarks for Multi-dimensional Problems
In this section we briefly discuss some of the issues concerning the application of boundary
conditions for the multi-dimensional updates given by (2.11) for the BDF-2 method and (2.12) for
the time-centered method. For convenience, these are given, respectively, by
                                                                                          √
                                     1 𝑛                      1                           2
                    L𝑥 L 𝑦 𝑢 𝑛+1  =     5𝑢 − 4𝑢 𝑛−1 + 𝑢 𝑛−2 + 2 𝑆 𝑛+1 (x),          𝛼 :=      ,
                                     2                             𝛼                      𝑐Δ𝑡
and
                                                              𝛽2                        𝛽
      L𝑥 L 𝑦 𝑢 𝑛+1 − (2 − 𝛽2 )𝑢 𝑛 + 𝑢 𝑛−1 (𝑥, 𝑦) = 𝛽2 𝑢 𝑛 + 2 𝑆 𝑛 (𝑥, 𝑦),          𝛼 :=     ,   0 < 𝛽 ≤ 2.
                                                                𝛼                       𝑐Δ𝑡
     By inverting the factored operator one direction at a time using the techniques presented in 2.3,
it follows that the solutions to these two equations are given respectively by
                                "
                                       h1                         1          i
                                                                                 #          √
                   𝑛+1       −1     −1         𝑛     𝑛−1      𝑛−2          𝑛+1                2
                 𝑢     = L𝑥 L 𝑦             5𝑢 − 4𝑢       +𝑢       + 2𝑆            , 𝛼 :=       ,
                                        2                             𝛼                     𝑐Δ𝑡
and
                                                   "                    #
                                                         h       1    i               𝛽
          𝑢 𝑛+1 = (2 − 𝛽2 )𝑢 𝑛 − 𝑢 𝑛−1 + 𝛽2 L𝑥−1 L 𝑦−1     𝑢𝑛 + 2 𝑆𝑛 ,       𝛼 :=       ,   0 < 𝛽 ≤ 2,
                                                                𝛼                   𝑐Δ𝑡
We wish to point out here that things are assumed to be smooth so that the ordering conventions
used for operators are irrelevant, i.e.,
                                                 L𝑥 L 𝑦 = L 𝑦 L𝑥 .
2.4.3.1    Sweeping Patterns in Multi-dimensional Problems
In the two-dimensional case, we need to construct terms of the form
                                                                    "         #
                                                                          h i
                                   L 𝑦 L𝑥 𝑤 = 𝑓 =⇒ 𝑤 = L𝑥−1 L 𝑦−1 𝑓 ,
with boundary data being prescribed for the variable 𝑤. The construction is performed over two
steps. The first step inverts the 𝑦 operator, so we obtain
                                                              h i
                                                 L𝑥 𝑤 = L 𝑦−1 𝑓 .                                        (2.52)
                                                        49


The step (2.52) requires boundary data for the intermediate variable L𝑥 𝑤 when we are only given
boundary data for 𝑤. From the definition of L𝑥 , we note that
                                                                
                                             1                     1
                               L𝑥 𝑤 ≡ I − 2 𝜕𝑥𝑥 𝑤 = 𝑤 + O 2 ,                              (2.53)
                                             𝛼                    𝛼
In other words, boundary conditions for L𝑥 𝑤 can be approximated to second-order in time by
those of 𝑤; however, unless we are dealing with outflow boundary conditions, we do not need to
sweep along the boundary of the domain, so the approximation (2.53) is not necessary. Proceeding
further, the second step of the inversion process leads to the solution 𝑤
                                                  "         #
                                                       h i
                                         𝑤 = L𝑥−1 L 𝑦−1 𝑓 ,                                (2.54)
which simply enforces the known boundary data on 𝑤.
    For purposes of clarity, we separate the types of boundary conditions. In each case, we
summarize the changes associated with moving to multi-dimensional problems, which includes
any changes necessitated by the proposed methods for calculating derivatives.
2.4.3.2    Periodic Boundary Conditions
Periodic boundary conditions in multi-dimensional problems can be enforced in a straightforward
way by directly applying the one-dimensional approaches outlined in sections 2.4.1.3 and 2.4.2.3
to each dimension. No modifications are required for either the scheme or the proposed derivative
methods.
2.4.3.3    Dirichlet Boundary Conditions
In the case of Dirichlet boundary conditions, the values of the function are known along the
boundary. Therefore, we only need to update the grid points corresponding to the interior of
the domain. As mentioned earlier, rather than approximating the boundary conditions for the
intermediate sweep, e.g., (2.53), we simply avoid sweeping along the boundary points of the
domain, since the values are known. The direction corresponding to the intermediate sweep will
                                                 50


now only use boundary data set by the solution, since the boundaries are left untouched. In the
case of homogeneous Dirichlet conditions, the sweeps can be performed on the boundary with no
effect.
    When sweeping over different directions for the derivatives, we note that along the boundary,
the derivative information is not known. Therefore, the sweeps should extend all the way to the
boundary. Otherwise the derivative will not be available there. In this case, the boundary data for
the intermediate data can be approximated according to (2.53).
2.4.3.4   Neumann Boundary Conditions
The treatment of Neumann boundary conditions in multi-dimensional problems is identical to the
Dirichlet case discussed in the previous section for the case of Cartesian grids. This also applies
to the proposed methods for computing derivatives. In problems defined on complex geometries
with embedded boundaries, based on the theory presented in [53], it was discovered that dissipation
was necessary to obtain stable numerical solutions [34]. While this is not a problem for the
BDF-2 method, which is dissipative, problems occur for the centered scheme. This motivated the
introduction of a tuneable dissipation term using the successive convolution framework for which
we provide an overview in section 2.5.1. We also wish to note that Robin boundary conditions
follow an identical approach.
2.4.3.5   Outflow Boundary Conditions
In the case of outflow boundary conditions, one should pay careful attention to the structure of the
time history used in the convolution integral for a particular scheme. For example, the integrand
for the second-order BDF method (2.11) uses data from three time levels {𝑢 𝑛−2 , 𝑢 𝑛−1 , 𝑢 𝑛 }, while
the time-centered method uses a single time level, i.e., 𝑢 𝑛 for this term. This led to two different
outflow procedures, which were presented in sections 2.4.1.4 and 2.4.2.4. When dealing with
outflow boundaries, the sweeps can be performed along the boundary, since these values are
unspecified. We note that this includes derivatives in addition to the fields. Next, we focus on the
                                                51


structure of the sweeping patterns used in the methods, beginning with the time-centered scheme
and then moving to the BDF method.
    In the time-centered approach (2.12), the structure of the time history used in the convolution
integral follows a regular pattern in the sense that the first and second layers of sweeps defined by
(2.52) and (2.54) operate on data from single time level. Assuming the sweeping pattern (2.54),
we first need to evaluate the term
                                                        h       1 i
                                          𝑣 (1) = L 𝑦−1 𝑢 𝑛 + 2 𝑆 𝑛 ,
                                                               𝛼
which requires the a time history for the solution 𝑢 at the 𝑥 boundary points along the 𝑦-direction.
The implicit approach to outflow for the time-centered method is constructed using the one-
dimensional update (2.48), which is not consistent with what we have written here. Instead, we
recommend that the explicit form of the outflow procedure be used for the calculation of the
boundary data. For the second set of sweeps given by (2.54), we need to compute
                                                           h     i
                                               𝑣 (2) = L𝑥−1 𝑣 (1) ,
which similarly requires the time history for 𝑣 (1) , now along boundary the 𝑦 boundary points in
the 𝑥-direction. Since this is the last layer of sweeps, one is free to use either the implicit or
explicit forms of outflow to construct the boundary terms. While we have proposed approaches for
computing derivatives for the time-centered scheme, we find the structure of the BDF derivatives
(2.35) more appealing. Therefore, we shall not discuss multi-dimensional aspects of the derivatives
for the time-centered method any further. This will be clarified when we present numerical results
in section 2.6.
    The BDF-2 scheme (2.11), in contrast to the time-centered method has other nuances which
should be addressed. Again, if we assume a sweeping pattern of the form (2.54), we first need to
evaluate                                 "                                  #
                                           1 𝑛                      1
                           𝑣 (1) = L 𝑦−1       5𝑢 − 4𝑢 𝑛−1 + 𝑢 𝑛−2 + 2 𝑆 𝑛+1 ,                  (2.55)
                                           2                          𝛼
which requires the a time history for the solution 𝑢 at the 𝑥 boundary points along the 𝑦-direction.
The construction of the boundary coefficients for this term follows the procedure presented in
                                                       52


section 2.4.1.4; however, when one changes directions to perform the second layer of sweeps
following (2.54), we see that we now need to construct a term of the form
                                                        h     i
                                            𝑣 (2) = L𝑥−1 𝑣 (1) ,
which is identical in structure found with the central-2 scheme. Computing this term with outflow
boundary conditions requires that we store a time history for 𝑣 (1) along the 𝑦 boundary points in
the 𝑥-direction. This requires us to change the reconstruction methods based on the order in which
sweeps are performed, which creates additional complications if we now wish to interchange the
order in which sweeps are performed. Instead, for the multi-dimensional BDF-2 update, we Taylor
expand the time history in the convolution integral (2.55) about time level 𝑛, so that it looks like
the integrand for the time-centered scheme. Assuming the source is not present at the boundary,
the integrand can be approximated as
                                 1 𝑛                     
                                    5𝑢 − 4𝑢 𝑛−1 + 𝑢 𝑛−2 = 𝑢 𝑛 + O (Δ𝑡).
                                 2
While this modification results in a loss of accuracy when the method is applied in outflow problems,
it creates a regular structure that is easier to work with in the numerical implementation, since the
outflow procedure for the time-centered method can now be used in both directions. A consequence
of this decision is that this approach is only compatible with explicit outflow because the implicit
form of outflow uses knowledge of the particular update, which would be inconsistent.
     Lastly, we remark on the treatment of derivatives obtained with the BDF method in the multi-
dimensional case. In section 2.4.1, we obtained an equation for spatial derivatives through direct
calculations involving the one-dimensional BDF-2 method, which resulted in (2.35). We would
like to apply these one-dimensional derivatives in multi-dimensional problems, as well. Since the
directions over which sweeps occur are treated independently, the one-dimensional derivatives can
also be applied in a dimensionally-split manner. When combined with outflow boundary conditions,
care should be taken to ensure that the time history used in the reconstruction is consistent with the
operand of a particular convolution integral. To illustrate, assuming a sweeping pattern of the form
                                                    53


(2.54), we find that the BDF-2 update in two spatial dimensions has the form
                                                       "           #
                                                               h i
                                         𝑢 𝑛+1 = L𝑥−1 L 𝑦−1 𝑅 ,                                 (2.56)
where we used
                                        1 𝑛                         1
                             𝑅(𝑥, 𝑦) =      5𝑢 − 4𝑢 𝑛−1 + 𝑢 𝑛−2 + 2 𝑆 𝑛+1 .
                                        2                              𝛼
Alternatively, if we swapped the order in which sweeps are performed, we obtain a similar update
                                                       "           #
                                                               h i
                                         𝑢 𝑛+1 = L 𝑦−1 L𝑥−1 𝑅 ,                                 (2.57)
with 𝑅 being unchanged. Now, consider a 𝑦-derivative of the schemes (2.56) and (2.57). Since
the inverse operator L𝑥−1 varies only with respect to 𝑥 and remains constant in 𝑦, we identify two
options to compute 𝑦-derivatives, namely
                                                       "             #
                                                                 h i
                                      𝜕𝑦 𝑢 𝑛+1 = L𝑥−1    𝜕𝑦 L 𝑦−1 𝑅 ,                           (2.58)
                                                          "          #
                                                                 h i
                                      𝜕𝑦 𝑢 𝑛+1 = 𝜕𝑦 L 𝑦−1   L𝑥−1 𝑅 .                            (2.59)
Both options are valid for computing the derivatives in the multi-dimensional case, but, in problems
with outflow boundary conditions, the data that is stored in these approaches is different. In the
first approach (2.56), the outflow boundary conditions require the time history for the derivative of
the intermediate variable. The second approach (2.57) proceeds in a manner which is more closely
related to the update of the solution, since the derivative (2.35) does not change 𝐴 and 𝐵 for a given
line. For this reason, we prefer the second option (2.57) in our implementation to compute the 𝑦
derivative. Similarly, the 𝑥 derivative works with the pattern (2.56).
2.5     Extensions for High-order Accuracy
Here we provide a brief discussion regarding high-order extensions of the aforementioned methods
in the context of the two-way wave equation. We include an overview of the successive convolution
method for the wave equation, which is loosely taken from [38]. We also mention methods for
increasing the accuracy of the BDF methods introduced in section 2.2, including ways to obtain
                                                     54


more accurate spatial derivatives. While we are primarily focused on the developments with the
second-order solvers, the purpose of this section is to demonstrate paths to high-order solvers.
Therefore, we provide fewer details concerning the implementation of these methods compared to
earlier sections.
2.5.1    Successive Convolution Methods
The first work on high-order extensions of the solvers presented in the previous sections was
presented in the 2014 paper [38]. Rather than increasing the width of the stencil to approximate the
second time derivative, these methods introduced additional spatial derivatives to retain a compact
stencil in time using the symmetries appearing in the truncation error for the time-centered method.
To illustrate, suppose we are solving the homogeneous wave equation
                                                𝜕 2𝑢
                                                   2
                                                      = 𝑐2 Δ𝑢.
                                                𝜕𝑡
If the second time derivative is approximated with a second-order centered finite-difference about
time level 𝑛, then
                                  𝜕 2 𝑢 𝑢 𝑛+1 − 2𝑢 𝑛 + 𝑢 𝑛−1
                                        =                        + O (Δ𝑡 2 ).
                                  𝜕𝑡  2            Δ𝑡 2
Taylor expanding of the terms in the numerator of the above approximation yields an expansion
containing only even order time derivatives, i.e.,
                                                           ∞
                                 𝑛+1       𝑛     𝑛−1
                                                         ∑︁     Δ𝑡 2𝑚 𝜕 2𝑚 𝑢 𝑛
                               𝑢      − 2𝑢 + 𝑢       =2                        .
                                                         𝑚=1
                                                               (2𝑚)!   𝜕𝑡 2𝑚
The time derivatives in the above equation were then replaced with spatial derivatives using the
PDE to obtain the expansion
                                        ∞           𝑚
                𝑛+1     𝑛    𝑛−1
                                       ∑︁    𝛽2𝑚     Δ                    𝛽
              𝑢     − 2𝑢 + 𝑢     =2                        𝑢𝑛 ,    𝛼 :=       ,  0 < 𝛽 ≤ 𝛽max , (2.60)
                                       𝑚=1
                                           (2𝑚)!     𝛼 2                 𝑐Δ𝑡
with the powers of the Laplacian being evaluated in a dimension-by-dimension fashion and the
parameter 𝛽 was introduced to tune the stability of the method. In [38], the powers of the Laplacian
were constructed recursively using the convolution operators defined earlier in this chapter, so the
                                                      55


approach became known as successive convolution. To approximate the Laplacian as a convolution,
the authors introduce the one-dimensional modified Helmholtz operators
                                                 1
                                   L 𝛾 := I −       𝜕𝛾𝛾 ,    𝛾 = 𝑥, 𝑦, · · · ,           (2.61)
                                                𝛼2
and another operator
                                    D𝛾 := I − L 𝛾−1 ,      𝛾 = 𝑥, 𝑦, · · · ,             (2.62)
which can be combined to form the Laplacian operator through the relation
                           1                                      1          1
                       −    2
                              Δ = L𝑥 D𝑥 + L 𝑦 D𝑦 + · · · = − 2 𝜕𝑥𝑥 − 2 𝜕𝑦𝑦 − · · ·
                          𝛼                                      𝛼           𝛼
The authors introduce a new operator C, which, in two-dimensions, is defined as
                                         C𝑥𝑦 := L 𝑦−1 D𝑥 + L𝑥−1 D𝑦 ,                     (2.63)
so that the Laplacian can be expressed in the factored form
                                             1                
                                          −     Δ =  L 𝑥 L 𝑦   C𝑥𝑦 .
                                             𝛼2
To remove the term L𝑥 L 𝑦 , they introduced yet another operator
                                           D𝑥𝑦 := I − L𝑥−1 L 𝑦−1 ,                       (2.64)
which can be rearranged to obtain the identity
                                                                   −1
                                          L𝑥 L 𝑦 = I − D𝑥𝑦             .
With this last identity, the Laplacian becomes
                                           1                    −1
                                        −    Δ  =   I −   D 𝑥𝑦       C𝑥𝑦 .
                                          𝛼2
    Then, they expand the Laplacian into a power series to obtain the result
                                     𝑚                  ∞             
                                   Δ              𝑚   𝑚
                                                         ∑︁     𝑝−1         𝑝−𝑚
                                          = (−1)    C𝑥𝑦                    D𝑥𝑦 .         (2.65)
                                   𝛼2                    𝑝=𝑚
                                                                𝑚−1
                                                      56


Note that the expansion above are valid because ||D𝑥𝑦 || ≤ 1 in the sense of operator norms. This
can be seen from the one dimensional analogue (2.62). In Fourier space, the operator D𝑥 satisfies
                                                         1            (𝑘/𝛼) 2
                    F [D𝑥 ] = 1 − F L𝑥−1 = 1 −
                                         
                                                                 =                ≤ 1.
                                                   1 + (𝑘/𝛼) 2 1 + (𝑘/𝛼) 2
                                                                  
With slight modifications, one can similarly show that F D𝑥𝑦 ≤ 1 also holds.
    By inserting the identity (2.65) into the error expansion (2.60), they obtained a family of
methods defined by
  𝑢 𝑛+1 = 2𝑢 𝑛 − 𝑢 𝑛−1
                  𝑁 ∑︁ 𝑝                   
                 ∑︁
                              𝑚  𝛽2𝑚 𝑝 − 1 𝑚 𝑝−𝑚 𝑛                         𝛽
             +2          (−1)                 C D         [𝑢 ] ,   𝛼 :=        ,  0 < 𝛽 ≤ 𝛽max , (2.66)
                 𝑝=1 𝑚=1
                                (2𝑚)! 𝑚 − 1 𝑥𝑦 𝑥𝑦                        𝑐Δ𝑡
where the truncation of the outer sum to 𝑁 terms led to a method of order 2𝑁 in time. The value
of 𝛽max in this expansion depends on the desired order of the scheme and decreases as the order of
the scheme increases. For specific information on this particular topic, we refer the reader to the
original paper [38], which analyzes the stability in great detail.
    To include source terms, the expansion (2.60) is modified to account for space and time
derivatives on the source itself. We wish to point out that there is a typographical error in the
fourth-order scheme with sources presented in the paper [38]. The correct form of the scheme is
given by
                                                       
                          2             2        𝛽4
  𝑢 𝑛+1      𝑛
        = 2𝑢 − 𝑢   𝑛−1
                       − 𝛽 C𝑥𝑦 [𝑢 ] − 𝛽 D𝑥𝑦 − C𝑥𝑦 C𝑥𝑦 [𝑢 𝑛 ]
                                  𝑛
                                                 12
                                                𝛽2  𝑛+1           𝑛     𝑛−1
                                                                                  𝛽4
                                            +         𝑆     +  10𝑆   + 𝑆        −      C𝑥𝑦 [𝑆 𝑛 ] . (2.67)
                                               12𝛼2                               12𝛼2
The above scheme contains an implicit source term that arises from an approximation of the second
time derivative about time level 𝑛. It should be noted that a different stencil can be used to make
the source explicit.
    The approach used to apply boundary conditions in multi-dimensional problems with successive
convolution is similar to the procedures presented for the second-order methods in section 2.4.3
with some caveats. Periodic boundary conditions can be applied level-by-level in the truncated
                                                 57


operator expansions directly following the procedure outlined in section 2.4.2.3 for the second-
order time-centered scheme. Other boundary conditions such as Dirichlet and Neumann leverage
the linearity of the wave equation being solved to enforce boundary conditions at the lowest level
and using a homogeneous variant for the remaining levels. Higher-order outflow methods with
successive convolution were introduced in the thesis [54], and were later published in the article
[40]. We wish to mention that the outflow methods proposed in the paper [40] showed large errors
along the boundaries in numerical experiments, despite being fourth-order. These methods are not
be considered in this work, so we shall not elaborate on this any further. The subject of boundary
conditions, especially high-order outflow, shall be the focus of future work, once this situation is
better understood for the second-order methods.
2.5.2   BDF Methods
A straightforward way of increasing the accuracy of the BDF method can be achieved by increasing
the size of the time history. In place of the second-order accurate approximation for 𝜕𝑡𝑡 𝑢 𝑛+1 , we
could instead use the third-order accurate difference given by
                                 33 𝑛+1    26 𝑛   19 𝑛−1    14 𝑛−2    11 𝑛−3
                                 12 𝑢   −   3𝑢  +  2𝑢     −  3𝑢    +  12 𝑢
                     𝜕𝑡𝑡 𝑢 𝑛+1
                               =                                                + O (Δ𝑡 3 ),              (2.68)
                                                     Δ𝑡 2
or even
                                      77 𝑛   107 𝑛−1             61 𝑛−3
                             15 𝑛+1
                              4𝑢    −  6𝑢  +  6 𝑢    − 13𝑢 𝑛−2 + 12 𝑢      − 56 𝑢 𝑛−4
              𝜕𝑡𝑡 𝑢 𝑛+1 =                                                              + O (Δ𝑡 4 ),       (2.69)
                                                     Δ𝑡 2
which is fourth-order accurate. Then, to derive higher-order BDF methods, one can repeat the
steps in section 2.2.1, replacing the second-order stencil with either the third-order stencil (2.68)
or fourth-order stencil (2.69). This leads to the respective semi-discrete schemes
                                                                 
          1        𝑛+1      12 26 𝑛 19 𝑛−1 14 𝑛−2 11 𝑛−3
     I − 2Δ 𝑢          =            𝑢 − 𝑢       + 𝑢        − 𝑢
         𝛼                  35 3         2         3         12
                                                                                                  √︃
                                                                                                   35
                                                             1                    1                  12
                                                          + 2 𝑆 𝑛+1 (x) + O 5 ,           𝛼 :=          , (2.70)
                                                             𝛼                   𝛼                𝑐Δ𝑡
                                                      58


and
                                                                        
          1      𝑛+1    4 77 𝑛 107 𝑛−1              𝑛−2   61 𝑛−3 5 𝑛−4
     I − 2Δ 𝑢        =         𝑢 −      𝑢     + 13𝑢     − 𝑢       + 𝑢
         𝛼             15 6          6                    12         6
                                                                                        √︃
                                                                                         15
                                                          1 𝑛+1            1                4
                                                       + 2 𝑆 (x) + O 6 ,          𝛼 :=        . (2.71)
                                                         𝛼                𝛼              𝑐Δ𝑡
The process for calculating derivatives is identical with the only differences from (2.35) being the
operand of the convolution integral and the particular definition of the parameter 𝛼, so we shall not
present these equations. Additionally, boundary conditions for this approach do not present any
additional challenges beyond the base second-order scheme. This feature makes higher order BDF
methods simple to implement.
     Despite the advantages afforded by the BDF methods, there are two issues to address in this
type of approach. First, there is the issue of stability. In the case of first-order ODEs, it is well
known that BDF discretizations become unstable if the order is greater than 6. Similar issues will
be encountered in these methods and we plan to address the topic of stability in our later work.
The second issue, which is less of a concern, is the splitting error in multi-dimensional problems
that arises from the factorization used for the modified Helmholtz operator; however, it is possible
to eliminate the splitting error by essentially “subtracting" it from the scheme with the aid of an
iterative method [43].
2.6     Numerical Examples
In this section, we present numerical results for the field solvers introduced in this chapter. Using
a one-dimensional test problem, we first compare the performance of the second-order BDF and
time-centered derivatives. Then, we assess the performance of the proposed methods in multi-
dimensional problems, focusing on the types of boundary conditions required in the problems
considered in chapter 4. In the case of outflow boundary conditions, we also demonstrate the
complications that arise when moving from one- to two-dimensional problems.
                                                  59


2.6.1    BDF and Time-centered Derivatives in One Spatial Dimension
As a first test, we perform a refinement study to compare the derivatives obtained with the second-
order BDF and time-centered methods, which are given, respectively, by equations (2.35) and
(2.49). The test problem we consider is the homogeneous two-way scalar wave equation
                                             1
                                               𝜕𝑡𝑡 𝑢 − 𝜕𝑥𝑥 𝑢 = 0,                                 (2.72)
                                            𝑐2
with 𝑐 = 1 and subject to periodic boundary conditions on [0, 2𝜋]. The solution is evolved to the
final time 𝑇 = 1 and we use the initial data
                                   𝑢(𝑥, 0) = sin(𝑥),      𝜕𝑡 𝑢(𝑥, 0) = 0.                         (2.73)
This equation has the solution
                                               sin(𝑥 − 𝑡) + sin(𝑥 + 𝑡)
                                   𝑢(𝑥, 𝑡) =                            ,                         (2.74)
                                                           2
with the corresponding derivative
                                                cos(𝑥 − 𝑡) + cos(𝑥 + 𝑡)
                                  𝜕𝑥 𝑢(𝑥, 𝑡) =                            .                       (2.75)
                                                             2
    Since these are multi-step methods, we need to supply values for the time history. While we can
certainly use Taylor expansions to approximate this data, we use the analytical solution (2.74) and its
corresponding spatial derivative (2.75). This allows us to avoid any potential errors associated with
the initialization. We also wish to point out that the splitting error is not present in one-dimensional
problems, so this eliminates another source of error.
    Regarding the implementation, we wish to note that unlike the time-centered method (2.49), the
BDF method (2.35) does not require additional time history for the derivative. In the experiment
shown here, the derivatives for the time-centered method are stored in a time history, similar to the
solution, which is updated in each time step of the simulation. A fifth-order spatial quadrature is
used to compute the local integrals in both methods. In the time-centered method (2.48) and its
derivative (2.49), we use 𝛽 = 2, which is the largest allowable 𝛽 that retains the stability of the
method [34].
                                                     60


                     (a) Central-2                                     (b) BDF-2
Figure 2.1: Spatial refinement study for the solution and its derivative obtained with second-order
methods for the periodic test problem in section 2.6.1. In Figure 2.1a, we plot the ℓ∞ errors for
both the numerical solution and the derivative obtained with the time-centered method. Similarly,
in Figure 2.1b, we show the same quantities, which are instead computed using the BDF method.
The derivative for the time-centered method fails to refine in space, while the BDF derivative is as
accurate at the numerical solution itself.
    In the refinement experiments for space, we run each case with 𝑁𝑡 = 212 time steps. We
successively double the number of mesh points beginning with 𝑁𝑥 = 32 and finishing with 𝑁𝑥 =
2048. In Figure 2.1 we compare the accuracy of the time-centered and BDF methods and their
proposed derivatives, against analytical solutions. The results indicate that the proposed spatial
derivatives computed via (2.35) refine at the same rate as the BDF method (2.29). In contrast, the
proposed derivative based on the time-centered method fails to refine, even though the numerical
solution demonstrates fifth-order accuracy in space.
    The time refinement experiment uses a fixed spatial mesh consisting of 𝑁𝑥 = 256 grid points
and successively doubles the number of time steps from 𝑁𝑡 = 8 to 𝑁𝑡 = 512. In Figure 2.2, we
show the results from the refinement study performed in time for the proposed methods. Based
on Figure 2.2b, we can see that when fewer time steps are used in the time-centered method, the
derivatives exhibit similar refinement properties of the numerical solution. As we use more time
steps, the errors between the solution and its derivative grow, which suggests an issue with the
accumulation of errors in time. The results obtained with the BDF method are displayed in Figure
                                                 61


                     (a) Central-2                                     (b) BDF-2
Figure 2.2: Time refinement study for the solution and its derivative obtained with second-order
methods for the test problem in section 2.6.1. In Figure 2.2a, we plot the ℓ∞ errors for both the
numerical solution and the derivative obtained with the time-centered method. Similarly, in Figure
2.2b, we show the same quantities, which are instead computed using the BDF method. The
derivative for the time-centered method initially converges together with the numerical solution,
but at some point begins to diverge. In contrast, we can see that the errors for the derivatives
obtained with the BDF method are aligned with those of the solution. Comparing the scales of the
plots, we note that the BDF solution is slightly less accurate than the time-centered method.
2.2b and match the second-order accuracy of the base method.
    The results of the space and time refinement experiments for the proposed methods suggest that
we should use (2.35) to construct derivatives. While derivatives computed with the BDF method
require a time history, we can think of this method as a one-step application in the sense that the
derivatives at any given time level depend on the solution and not on derivatives at other time
levels. Moreover, the time history in the BDF derivatives does not necessary have to come from
the BDF method itself. In fact, as we will soon see, these methods work even if this time history is
supplied by another solver, e.g., the time-centered method or perhaps a higher order method based
on successive convolution.
2.6.2   Periodic Boundary Conditions
The next problem we consider is the two-dimensional in-homogeneous scalar wave equation
                                         1
                                           𝜕𝑡𝑡 𝑢 − Δ𝑢 = 𝑆(𝑥, 𝑦),                              (2.76)
                                        𝑐2
                                                   62


where 𝑐 = 1 and
                                     𝑆(𝑥, 𝑦) = 3𝑒 −𝑡 sin(𝑥) cos(𝑦).                            (2.77)
We apply two-way periodic boundary conditions on the domain [0, 2𝜋] × [0, 2𝜋] and use the initial
data
                     𝑢(𝑥, 𝑦, 0) = sin(𝑥) cos(𝑦),      𝜕𝑡 𝑢(𝑥, 𝑦, 0) = − sin(𝑥) cos(𝑦).         (2.78)
The problem (2.76) is associated with the manufactured solution
                                     𝑢(𝑥, 𝑦, 𝑡) = 𝑒 −𝑡 sin(𝑥) cos(𝑦),                          (2.79)
and defines the source function (2.77). The partial derivatives of this solution are calculated to be
                                   𝜕𝑥 𝑢(𝑥, 𝑦, 𝑡) = 𝑒 −𝑡 cos(𝑥) cos(𝑦),                         (2.80)
                                   𝜕𝑦 𝑢(𝑥, 𝑦, 𝑡) = −𝑒 −𝑡 sin(𝑥) sin(𝑦).                        (2.81)
    We performed temporal and spatial refinement studies using a mixed approach, as well as a pure
BDF approach. The mixed approach uses the second-order time-centered method to evolve the
solution 𝑢, and its partial derivatives in both variables are calculated using the second-order BDF
scheme, which is denoted as “Central-2 + BDF-2" in the figures. Similarly, the pure BDF approach
computes both the solution and its derivatives using the BDF scheme, i.e., “BDF-2 + BDF-2". In
these experiments, we used a fifth-order spatial quadrature rule.
    In the temporal refinement study, the solution is computed until a final time of 𝑇 = 1 using a
fixed 256 × 256 mesh in space. We successively double the number of time steps from 𝑁𝑡 = 8 until
𝑁𝑡 = 512. We use the analytical solution to initialize the method since it is available. The results
of the temporal refinement study are presented in Figure 2.3, in which all methods, including those
for the derivatives, display the expected second-order convergence rate in time.
    For the space refinement experiment, we varied the spatial mesh in each direction from 16
points to 512 points. To keep the temporal error in the methods small during the refinement, we
applied the methods for 1 time step using a step size of Δ𝑡 = 1 × 10−4 . Note that the disparity in
the order between time and space (second-order versus fifth-order) necessitates a small time step
                                                    63


                                        (a) Central-2 + BDF-2
                                          (b) BDF-2 + BDF-2
Figure 2.3: Time refinement study for the solution and its derivative for the two-dimensional
periodic example 2.6.2 obtained with second-order methods. In Figure 2.3a, we plot errors for the
numerical solution obtained with the central-2 method and the partial derivatives obtained with the
BDF-2 method. Similarly, in Figure 2.3b, we show the same quantities, both of which are obtained
using the BDF-2 method.
here; however, the waves may fail to propagate if too small a time step is used. This can be fixed
using a Taylor expansion of the Green’s function in the quadrature rule, as mentioned in section
2.3.3. The refinement plots in Figure 2.4 indicate fifth-order accuracy in space for all methods. We
note that the derivatives in the methods begin to level-off as the error approaches 1 × 10−11 . This
                                                  64


                                           (a) Central-2 + BDF-2
                                            (b) BDF-2 + BDF-2
Figure 2.4: Space refinement of the solution and its derivative for the two-dimensional periodic
example 2.6.2 obtained with second-order methods. In Figure 2.4a, we plot errors for the numerical
solution obtained with the central-2 method and the partial derivatives obtained with the BDF-2
method. Similarly, in Figure 2.4b, we show the same quantities, both of which are obtained using
the BDF-2 method.
is likely due to a different error coefficient in time, which arises from the differentiation process.
A smaller time step would be necessary to remove this feature, but this requires some modification
of the quadrature.
                                                     65


2.6.3    Dirichlet Boundary Conditions
For the Dirichlet problem, we, again, consider the two-dimensional in-homogeneous scalar wave
equation
                                        1
                                           𝜕𝑡𝑡 𝑢 − Δ𝑢 = 𝑆(𝑥, 𝑦),                               (2.82)
                                        𝑐2
where 𝑐 = 1 and
                                     𝑆(𝑥, 𝑦) = 3𝑒 −𝑡 sin(𝑥) sin(𝑦).                            (2.83)
We apply homogeneous Dirichlet boundary conditions on the domain [0, 2𝜋] × [0, 2𝜋] and use the
initial data
                     𝑢(𝑥, 𝑦, 0) = sin(𝑥) sin(𝑦),     𝜕𝑡 𝑢(𝑥, 𝑦, 0) = − sin(𝑥) sin(𝑦).          (2.84)
The problem (2.76) is associated with the manufactured solution
                                     𝑢(𝑥, 𝑦, 𝑡) = 𝑒 −𝑡 sin(𝑥) sin(𝑦),                          (2.85)
and defines the source function (2.83). The partial derivatives of this solution are calculated to be
                                   𝜕𝑥 𝑢(𝑥, 𝑦, 𝑡) = 𝑒 −𝑡 cos(𝑥) sin(𝑦),                         (2.86)
                                   𝜕𝑦 𝑢(𝑥, 𝑦, 𝑡) = 𝑒 −𝑡 sin(𝑥) cos(𝑦).                         (2.87)
    We performed temporal and spatial refinement experiments using the same mixed approach and
pure BDF approaches considered in the previous section 2.6.2. As a reminder, the mixed approach
uses the second-order time-centered method to evolve the solution 𝑢, and its partial derivatives
are calculated using the second-order BDF scheme, which is denoted as “Central-2 + BDF-2" in
the figures. The pure BDF approach computes both the solution and its derivatives with the BDF
scheme, which is similarly denoted as “BDF-2 + BDF-2" in the figures. We use the same fifth-order
spatial quadrature rule as in the periodic test case to perform the runs.
    In the temporal refinement study, the solution is computed until a final time of 𝑇 = 1. We use
a fixed 256 × 256 spatial mesh and the number of time steps in each case is successively doubled
from 𝑁𝑡 = 8 until 𝑁𝑡 = 512. Errors can be directly measured with the analytical solution and its
                                                    66


                                         (a) Central-2 + BDF-2
                                          (b) BDF-2 + BDF-2
Figure 2.5: Time refinement study of the solution and its derivatives in the two-dimensional
Dirichlet problem 2.6.3 obtained with second-order methods. In Figure 2.5a, we plot errors for the
numerical solution obtained with the central-2 method and the partial derivatives obtained with the
BDF-2 method. Similarly, in Figure 2.5b, we show the same quantities, both of which are obtained
using the BDF-2 method.
derivatives. The results of the temporal refinement study are presented in Figure 2.5, in which all
methods, including those for the derivatives, display the expected second-order convergence rate.
The behavior is essentially identical to the results obtained for the periodic problem, which were
presented in Figure 2.5
                                                   67


                                         (a) Central-2 + BDF-2
                                          (b) BDF-2 + BDF-2
Figure 2.6: Space refinement of the solution and its derivatives in the two-dimensional Dirichlet
problem 2.6.3 obtained with second-order methods. In Figure 2.6a, we plot errors for the numerical
solution obtained with the central-2 method and the partial derivatives obtained with the BDF-2
method. Similarly, in Figure 2.6b, we show the same quantities, both of which are obtained using
the BDF-2 method.
    We performed the spatial refinement study by varying the number of mesh points in each
direction from 16 points to 512 points. Again, to keep the temporal error in the methods small while
space is refinement, we applied the methods for only 1 time step with a step size of Δ𝑡 = 1 × 10−4 .
The same remark about small time step sizes mentioned in the refinement experiment in space for
                                                   68


the periodic problem applies to this case, as well (see section 2.6.2). The refinement plots in Figure
2.6 indicate that the methods refine to fifth-order accuracy in space. In both the mixed and pure
BDF approaches, the error in the derivatives behaves differently from what was observed in the
periodic example. In particular, we do not observe a flattening of the error when the spacing Δ𝑥 is
small.
2.6.4    Outflow Boundary Conditions
To test outflow boundary conditions, we use the homogeneous scalar wave equation
                                           1
                                              𝜕𝑡𝑡 𝑢 − Δ𝑢 = 0,                                   (2.88)
                                           𝑐2
where 𝑐 = 1. We solve the problem (2.88) using the two-dimensional domain [−2, 2] × [−2, 2] and
use the initial data
                             𝑢(𝑥, 𝑦, 0) = 𝑒 −16(𝑥
                                                  2 +𝑦 2 )
                                                           , 𝜕𝑡 𝑢(𝑥, 𝑦, 0) = 0,                 (2.89)
which is a Gaussian centered at the origin. Note that the width of the Gaussian is chosen so that the
data is essentially machine zero at points along the boundary. If the initial data is not zero outside
of the domain, then a specialized approach for initializing the boundary coefficients in the temporal
recursion for outflow should be devised. This is not the case for this problem, so we can initialize
the recursion for the boundary coefficients with zeroes.
    The refinement experiments consider the same mixed and BDF approaches as in the previous
sections. Since we do not have an analytical solution for the problem (2.88), we use a reference
solution on a sufficiently fine temporal or spatial mesh. In all tests, the time history data is
initialized using Taylor expansions in time, keeping terms up to fourth-order accuracy. Recall that
we presented two forms of outflow in sections 2.4.1.4 and 2.4.2.4. We chose to use the explicit
forms of outflow because they are simpler to implement in multi-dimensional problems. Moreover,
using the one-dimensional analogue of problem (2.88), we found that explicit approaches were
more effective at suppressing artificial reflections of the waves along the boundary. An example
of this is shown in Figure 2.7, where we compared the implicit and explicit forms of outflow for
                                                     69


             (a) BDF-2 with implicit weights (1-D)          (b) BDF-2 with explicit weights (1-D)
Figure 2.7: Here we show the reflection observed between the implicit and explicit forms of outflow
boundary conditions for the second-order BDF method in a one-dimensional outflow problem. We
run with the same Gaussian initial condition until the final time 𝑇 = 4, at which point, the wave data
should no longer be in the simulation. What is left is the reflection at the artificial boundaries of the
domain. The plot shown on the left shows the results obtained with the proposed implicit form of
the outflow weights developed for the BDF-2 method, while the plot on the right uses the explicit
form of the weights. We find that the explicit form of the weights is more effective at suppressing
the spurious reflections at the artificial boundaries.
the BDF method. At the time shown in the plots, what remains of the original wave is due to
reflections.
    In the refinement study for time, we evolve the solution on a fixed 512 × 512 mesh until the
final time 𝑇 = 1.0, which is before the wave leaves the domain. The number of time steps are
successively doubled from 𝑁𝑡 = 8 until 𝑁𝑡 = 512. We note that the coarsest time discretization
gives a CFL ≈ 16, which is much larger than we would typically use for the second-order schemes.
The reference solution is obtained with the same spatial mesh with a total of 𝑁𝑡 = 2048 time steps.
Results obtained with the mixed and BDF approaches are presented in Figures 2.8a, respectively.
The methods for the solution and the corresponding derivatives refine to second-order accuracy in
time, with the mixed approach being the more accurate of the proposed methods. The overall error
in these methods is notably larger than for the periodic and Dirichlet test problems. One reason
for this is that periodic and Dirichlet conditions can be enforced exactly, while outflow conditions
are only approximately enforced. This error is further amplified in the case of the derivatives
                                                   70


                                     (a) Central-2 + BDF-2 (2-D)
                                      (b) BDF-2 + BDF-2 (2-D)
Figure 2.8: Time refinement study of the solution and its derivatives in the two-dimensional outflow
problem of section 2.6.4 obtained with second-order methods. In Figure 2.8a, we plot errors for the
numerical solution obtained with the central-2 method and the partial derivatives obtained with the
BDF-2 method. Similarly, in Figure 2.8b, we show the same quantities, both of which are obtained
using the BDF-2 method.
which introduce an additional factor that is O (1/Δ𝑡) which amplifies the overall size of the error.
Similar error properties can be observed in the analogous one-dimensional problem. Using the
same problem setup for the one-dimensional case, we applied the implicit forms of outflow for
the time-centered and BDF methods, which we show in Figure 2.9. While the outflow methods
                                                  71


        (a) Central-2 with implicit weights (1-D)           (b) BDF-2 with implicit weights (1-D)
Figure 2.9: A comparison of the temporal refinement properties for the one-dimensional implicit
methods. The weights for the time-centered method shown in Figure2.9a are taken from the paper
[34]. We compare this to the proposed implicit approach to outflow, shown on the right, in Figure
2.9b.
themselves are different, we can see that similar refinement properties are observed here as well.
This comparison informs us that the issue is likely not related to dimensionality or the form of the
outflow weights (e.g., implicit or explicit). This issue seems to be more related to the propagation
of errors in the methods when only approximate boundary data is available.
    The spatial refinement study was performed by varying the number of mesh points in each
direction from 17 points to 513 points. The methods were applied for only 1 time step with step
size of Δ𝑡 = 1 × 10−4 and the errors were measured against a reference solution computed on a
2049 × 2049 spatial mesh using Δ𝑡 = 1 × 10−4 , here, as well. The refinement plots in Figure 2.6
show that each of these methods is approximately first-order in space. As alluded to in the time
refinement, this seems to be a consequence of using inexact boundary conditions. The particular
form of the inverse operator (2.19), suggests that errors in 𝐴 and 𝐵, which are used to enforce
boundary conditions, can impact regions in the vicinity of the boundary. The particular size of
these regions depends on the size of 𝛼, which, in turn, depends on the wave speed 𝑐 and value of Δ𝑡.
As with the time refinement, we can look at the analogous one-dimensional problem and compare
these results with the two-dimensional problem. The one-dimensional refinement results, obtained
                                                  72


                                      (a) Central-2 + BDF-2 (2-D)
                                       (b) BDF-2 + BDF-2 (2-D)
Figure 2.10: Space refinement of the solution and its derivatives in the two-dimensional outflow
problem of section 2.6.4 obtained with second-order methods. In Figure 2.10a, we plot errors for
the numerical solution obtained with the central-2 method and the partial derivatives obtained with
the BDF-2 method. Similarly, in Figure 2.10b, we show the same quantities, both of which are
obtained using the BDF-2 method.
with the implicit form of the outflow weights, are presented in Figure 2.11. Again, we observe
similar behavior in the refinement properties of the one- and two-dimensional test problems, which,
again, hints at issues concerning the propagation of errors in the method.
                                                   73


         (a) Central-2 with implicit weights (1-D)         (b) BDF-2 with implicit weights (1-D)
Figure 2.11: A comparison of the spatial refinement properties for the one-dimensional implicit
methods. The weights for the time-centered method, shown in Figure2.11a, are taken from the
paper [34]. We compare this to the proposed implicit approach to outflow, shown on the right, in
Figure 2.11b.
2.7     Conclusion
In this work, we developed new approaches for computing fields and their derivatives with ap-
plications to scalar wave equations. Our contributions build on prior developments for a class of
algorithms known as the MOL𝑇 , which combines a dimensional splitting technique with a one-
dimensional integral equation method to yield algorithms with unconditional stability, geometric
flexibility, and have O (𝑁) complexity. The proposed methods for derivatives use data which is
already available through the base method, so they naturally inherit the properties offered in the base
method. We also presented a treatment of outflow boundary conditions for the BDF method. The
accuracy of the proposed methods was evaluated by performing a series of refinement experiments
in both space and time, using several types of boundary conditions. In particular, we established
some refinement properties for outflow boundary conditions, which were not presented in earlier
work.
                                                   74


                                           CHAPTER 3
             PARALLEL ALGORITHMS FOR SUCCESSIVE CONVOLUTION
3.1     Introduction
In this chapter, we develop parallel algorithms using novel approaches to represent derivative
operators for linear and nonlinear time-dependent partial differential equations (PDEs). We chose
to investigate algorithms for these representations due to the stability properties observed for a
wide range of linear and nonlinear PDEs. The approach considered here uses expansions involving
integral operators to approximate spatial derivatives. Here, we shall refer to this approach as
the Method of Lines Transpose (MOL𝑇 ) though this can be more broadly categorized within a
larger class of successive convolution methods. The name arises because the terms in the operator
expansions, which we describe later, involve convolution integrals whose operand is recursively
or successively defined. Despite the use of explicit data in these integral terms, the boundary
data remains implicit, which contributes to both the speed and stability of the representations.
The inclusion of more terms in these operator expansions, when combined with a high-order
quadrature method, allow one to obtain a high-order discretization in both space and time. Another
benefit of this approach is that extensions to multiple spatial dimensions are straightforward as
operators can be treated in a line-by-line fashion. Moreover, the integral equations are amenable
to fast-summation techniques, which reduce the overall computational complexity, along a given
dimension, from O (𝑁 2 ) to O (𝑁), where 𝑁 is the number of discrete grid points along a dimension.
    High-order successive convolution algorithms have been developed to solve a range of time-
dependent PDEs, including the wave equation [50], heat equation (e.g., Allen-Cahn [55] and Cahn-
Hilliard equations [43]), Maxwell’s equations [37], Vlasov equation [52], degenerate advection-
diffusion (A-D) equation [44], and the Hamilton-Jacobi (H-J) equation [56, 45]. In contrast to these
papers, this work focuses on the performance of the method in parallel computing environments,
which is a largely unexplored area of research. Specifically, our work focuses on developing
                                                 75


effective domain decomposition strategies for distributed memory systems and building thread-
scalable algorithms using the low-order schemes as a baseline. By leveraging the decay properties
of the integral representation, we restrict the calculations to localized non-overlapping subsets of
the spatial domain. The algorithms presented in this work consider dependencies between nearest-
neighbors (N-N), but, as we will see, this restriction can be generalized to include additional
information, at the cost of additional communication. Using a hybrid design that employs MPI and
Kokkos [46] for the distributed and shared memory components of the algorithms, respectively,
we show that our methods are efficient and can sustain an update rate > 1 × 108 DOF/node/s.
While experimentation on graphics processing units (GPUs) shall be left to future work, we believe
choosing Kokkos will provide a path for a more performant and seamless integration of our
algorithms with new computing hardware.
    Recent developments in successive convolution methods have focused on extensions to solve
more general nonlinear PDEs, for which an integral solution is generally not applicable. This work
considers discretizations developed for degenerate advection-diffusion (A-D) equations [44], as
well as the Hamilton-Jacobi (H-J) equations [56, 45]. The key idea of these papers exploited the
linearity of a given differential operator rather than the underlying equations, allowing derivatives in
nonlinear problems to be expressed using the same representations developed for linear problems.
For linear problems, it was demonstrated that one could couple these representations for the
derivative operators with an explicit time-stepping method, such as the strong-stability-preserving
Runge-Kutta (SSP-RK) methods, [57] and still obtain schemes which maintain unconditional
stability [56, 44]. To address shock-capturing and control non-physical oscillations, the latter two
papers introduced quadratures that use WENO reconstructions, along with a nonlinear filter to
further control oscillations. In [45], the schemes for the H-J equations were extended to enable
calculations on mapped grids. This paper also proposed a new WENO quadrature method that uses
a basis that consists of exponential polynomials that improves the shock capturing capabilities.
    Our choice in discretizing time, first, before treating the spatial quantities, is not a new idea.
A well-known approach is Rothe’s method [58, 59] in which a finite difference approximation is
                                                    76


used for time derivatives and an integral equation solver is developed for the resulting sequence
of elliptic PDEs (see e.g., [60, 61, 62, 49, 63, 64, 65]). The earlier developments for successive
convolution methods, such as [50], are quite similar to Rothe’s method in the treatment of the time
derivatives. However, successive convolution methods differ from Rothe’s method considerably
in the treatment of spatial derivatives for nonlinear problems, such as those considered in more
recent work on successive convolution (see e.g., [56, 44, 45]), as Newton iteration can be avoided
on nonlinear terms. Additionally, these methods do not require solutions to linear systems. In
contrast, Nyström methods, which are used to discretize the integral equations in Rothe’s method,
result in dense linear systems, which are typically solved using an iterative method such as GMRES
[27]. Despite the fact that the linear systems are well-conditioned, the various collective operations
that occur in distributed GMRES solves can become quite expensive on large computing platforms.
    Similarly, in [66, 67], Bruno and Lyon introduced a spectral method, based on the FFT,
for computing spatial derivatives of general, possibly non-periodic, functions known as Fourier-
Continuation (FC). They combined this representation with the well-known Alternating-Direction
(AD) methods, e.g., [68, 69, 70], dubbed FC-AD, to develop implicit solvers suitable for linear
equations. This resulted in a method capable of computing spatial derivatives in a rapidly convergent
and dispersionless fashion. A domain decomposition technique for the FC method is described in
[71] and weak scaling was demonstrated to 256 processors, using 4 processors per node, but larger
runs were not considered. Another related transform approach was developed to solve the linear
wave equation [72]. This work introduced a windowed Fourier methodology and combined this with
a frequency-domain boundary integral equation solver to simulate long-time, high-frequency wave
propagation. While this particular work does not focus on parallel implementations, they suggest
several generic strategies, including a trivially parallelizable approach that solves a collection of
frequency-domain integral equations in parallel; however, a purely parallel-in-time approach may
not be appropriate for massively parallel systems, especially if few frequencies are required across
time. This issue may be further complicated by the parallel implementation of the frequency-
domain integral equation solvers, which, as previously mentioned, require the solution of dense
                                                   77


linear systems. Therefore, it may be a rather difficult task to develop robust parallel algorithms,
which are capable of achieving their peak performance.
    This paper is organized as follows: In 3.2, we provide an overview of the numerical scheme
used to formulate our algorithms. To this end, we first illustrate the connections among several
characteristically different PDEs through the appearance of common operators in 3.2.1. Using
these fundamental operators, we define the integral representation in 3.2.2, which is used to
approximate spatial derivatives. Once the representations have been introduced, we briefly discuss
the complications associated with boundary conditions and provide the relevant information, in
3.2.3, for implementing boundary conditions used in our numerical tests. Sections 3.2.4 and
3.2.5 briefly review the spatial discretization process (including the fast-summation method and
the quadrature) and the coupling with time integration methods, respectively. We provide the
details of our new domain decomposition algorithm in 3.3, beginning with the derivation of the
so-called N-N conditions in 3.3.1. Using these conditions, we show how this can be used to enforce
boundary conditions, locally, for first and second derivative operators (3.3.2 and 3.3.3, respectively).
Details concerning the implementation of the parallel algorithms are contained entirely in 3.4. This
includes the introduction of the shared memory programming model (3.4.1), the definition of a
certain performance metric (3.4.2) used in both loop optimization experiments and scaling studies
(3.4.3), the presentation of shared memory algorithms (3.4.4), and, lastly, implementation details
concerning the distributed memory algorithms (3.4.5). 3.5 contains the core numerical results
which confirm the convergence (3.5.1), as well as, the weak and strong scalability (3.5.2 and 3.5.3,
respectively) of the proposed algorithms. In 3.5.4, we examine the impact of the restriction posed
by the N-N conditions. Finally, we summarize our findings with a brief conclusion in 3.6.
3.2     Description of Numerical Methods
In this section, we outline the approach used to develop unconditionally stable solvers making use
of knowledge for linear operators. We will start by demonstrating the connections between several
different PDEs using operator notation, which will allow us to reuse, or combine, approximations
                                                  78


in several different ways. Once we have established these connections, we define an appropri-
ate “inverse" operator and use this to develop the expansions used to represent derivatives. The
representations we develop for derivative operators are motivated by the solution of simple 1-D
problems. However, in multi-dimensional problems, these expressions are still valid approxima-
tions, in a certain sense, even though the kernels in the integral representation may not be solutions
to the PDE in question. While these approximations can be made high-order in both time and
space, the focus of this work is strictly on the scalability of the method, so we will limit ourselves to
formulations which are first-order in time. Note that the approach described in 3.4, which considers
first-order schemes, is quite general and can be easily extended for high-order representations. Once
we have discussed our treatment of derivative terms, we describe the fast summation algorithm and
quadrature method in 3.2.4. Despite the fact that this work only considers smooth test problems, we
include the relevant modifications required for non-smooth problems for completeness. In 3.2.5,
we illustrate how the representation of derivative operators can be used within a time stepping
method to solve PDEs.
3.2.1    Connections Among Different PDEs
Before introducing the operators relevant to successive convolution algorithms, we establish the
operator connections appearing in several linear PDE examples. This process helps identify key
operators that can be represented with successive convolution. Specifically, we shall consider the
following three prototypical linear PDEs:
     • Linear advection equation: (𝜕𝑡 − 𝑐𝜕𝑥 )𝑢 = 0,
     • Diffusion equation: (𝜕𝑡 − 𝜈𝜕𝑥𝑥 )𝑢 = 0,
     • Wave equation: (𝜕𝑡𝑡 − 𝑐2 𝜕𝑥𝑥 )𝑢 = 0.
Next, we apply an implicit time discretization to each of these problems. For discussion purposes,
we shall consider lower-order time discretizations, i.e., backward-Euler for the 𝜕𝑡 𝑢 and a second-
                                                    79


order central difference for 𝜕𝑡𝑡 𝑢. If we identify the current time as 𝑡 𝑛 , the new time level as 𝑡 𝑛+1 ,
and Δ𝑡 = 𝑡 𝑛+1 − 𝑡 𝑛 , then we obtain the corresponding set of semi-discrete equations:
     • Linear advection equation: (I − Δ𝑡𝑐𝜕𝑥 )𝑢 𝑛+1 = 𝑢 𝑛 ,
     • Diffusion equation: (I − Δ𝑡𝜈𝜕𝑥𝑥 )𝑢 𝑛+1 = 𝑢 𝑛 ,
     • Wave equation: (I − Δ𝑡 2 𝑐2 𝜕𝑥𝑥 )𝑢 𝑛+1 = 2𝑢 𝑛 − 𝑢 𝑛−1 .
Here, we use I to denote the identity operator, and, in all cases, each of the spatial derivatives
are taken at time level 𝑡 𝑛+1 to keep the schemes implicit. The key observation is that the operator
(I ± 𝛼1 𝜕𝑥 ) arises in each of these examples. Notice that
                                        1                    1              1
                                 (I −      𝜕𝑥𝑥 ) =  (I   −      𝜕  𝑥 )(I +    𝜕𝑥 ),
                                        𝛼2                   𝛼              𝛼
where 𝛼 is a parameter that is selected according to the equation one wishes to solve. For example,
in the case of diffusion, one selects
                                                           𝛽
                                                 𝛼=√             ,
                                                          𝜈Δ𝑡
while for the linear advection and wave equations, one selects
                                                           𝛽
                                                   𝛼=          .
                                                         𝑐Δ𝑡
The parameter 𝛽, which does not depend on Δ𝑡, is then used to tune the stability of the approxima-
tions. For test problems appearing in this paper, we always use 𝛽 = 1. In 3.2.2, we demonstrate
how the operator (I ± 𝛼1 𝜕𝑥 ) can be used to approximate spatial derivatives. We remark that for
second derivatives, one can also use (I − 𝛼12 𝜕𝑥𝑥 ) to obtain a representation for second order spatial
derivatives, instead of factoring into “left" and “right" characteristics.
    Next, we introduce the following definitions to simplify the notation:
                                                 1                         1
                                   L𝐿 ≡ I −        𝜕𝑥 ,    L𝑅 ≡ I +          𝜕𝑥 .                  (3.1)
                                                 𝛼                         𝛼
Written in this manner, these definitions indicate the left and right-moving components of the
characteristics, respectively, as the subscripts are associated with the direction of propagation. For
                                                        80


second derivative operators, which are not factored into first derivatives, we shall use
                                                             1
                                              L0 ≡ I −         𝜕𝑥𝑥 .                             (3.2)
                                                            𝛼2
    In order to connect these operators with suitable expressions for spatial derivatives, we need
to define the corresponding “inverse" for each of these linear operators on a 1-D interval [𝑎, 𝑏].
These definitions are given as
                                                  ∫  𝑏
                          L −1
                            𝐿 [ · ; 𝛼] (𝑥) ≡ 𝛼         𝑒 −𝛼(𝑠−𝑥) (·) 𝑑𝑠 + 𝐵𝑒 −𝛼(𝑏−𝑥) ,           (3.3)
                                                   𝑥
                                           ≡ 𝐼 𝐿 [ · ; 𝛼] (𝑥) + 𝐵𝑒 −𝛼(𝑏−𝑥) ,
                                                  ∫ 𝑥
                            −1
                          L 𝑅 [ · ; 𝛼] (𝑥) ≡ 𝛼         𝑒 −𝛼(𝑥−𝑠) (·) 𝑑𝑠 + 𝐴𝑒 −𝛼(𝑥−𝑎) ,           (3.4)
                                                   𝑎
                                           ≡ 𝐼 𝑅 [ · ; 𝛼] (𝑥) + 𝐴𝑒 −𝛼(𝑥−𝑎) .
These definitions can be derived in a number of ways. In A.1, we demonstrate how these definitions
can be derived for the linear advection equation using the integrating factor method. In these
definitions, 𝐴 and 𝐵 are constants associated with the “homogeneous solution" of a corresponding
semi-discrete problem and are used to satisfy the boundary conditions. In a similar way, one can
compute the inverse operator for definition (3.2), which yields
                                         ∫
                                       𝛼 𝑏 −𝛼|𝑥−𝑠|
                     −1
                   L0 [ · ; 𝛼] (𝑥) ≡            𝑒         (·) 𝑑𝑠 + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) ,     (3.5)
                                       2 𝑎
                                    ≡ 𝐼0 [ · ; 𝛼] (𝑥) + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) .
In these definitions, we refer to “·" as the operand and, again, 𝛼 is a parameter selected according
to the problem being solved. Although it is a slight abuse of notation, when it is not necessary to
explicitly indicate the parameter or the point of evaluation, we shall place the operand inside a pair
of parenthesis.
    If we connect these definitions to each of the linear semi-discrete equations mentioned earlier,
we can determine the update equation through an analytic inversion of the corresponding linear
operator(s):
     • Linear advection equation: 𝑢 𝑛+1 = L −1    𝑅 (𝑢 ), or 𝑢
                                                       𝑛        𝑛+1 = L −1 (𝑢 𝑛 ),
                                                                         𝐿
                                                       81


    • Diffusion equation: 𝑢 𝑛+1 = L −1     −1 𝑛
                                     𝐿 (L 𝑅 (𝑢 )), or 𝑢
                                                         𝑛+1 = L −1 (𝑢 𝑛 ),
                                                                   0
    • Wave equation: 𝑢 𝑛+1 = L −1      −1
                                 𝐿 (L 𝑅 (2𝑢 − 𝑢
                                             𝑛    𝑛−1 )), or 𝑢 𝑛+1 = L −1 (2𝑢 𝑛 − 𝑢 𝑛−1 ),
                                                                      0
with the appropriate choice of 𝛼 for the problem being considered. We note that each of these meth-
ods can be made high-order following the work in [56, 44, 45, 55, 38], where it was demonstrated
that these approaches lead to methods that are unconditionally stable to all orders of accuracy for
these linear PDEs, even with variable wave speeds or diffusion coefficients. Since the process of
analytic inversion yields an integral term, a fast-summation technique should be used to reduce
the computational complexity of a naive implementation, which would otherwise scale as O (𝑁 2 ).
Some details concerning the spatial discretization and the O (𝑁) fast-summation method are briefly
summarized in 3.2.4 (for full details, please see [55]). Next we demonstrate how the operator L∗
can be used to approximate spatial derivatives.
3.2.2   Representation of Derivatives
In the previous section, we observed that characteristically different PDEs can be described in-
terms of a common set of operators. The focus of this section shall be on manipulating these
approximations to obtain a high-order discretization in time through certain operator expansions.
The process begins by introducing an operator related to L∗−1 , namely,
                                           D∗ ≡ I − L∗−1 ,                                     (3.6)
where ∗ can be 𝐿, 𝑅, or 0. The motivation for these definitions will become clear soon. Additionally,
we can derive an identity from the definitions (3.6). By manipulating the terms we quickly find that
                                          L∗ ≡ (I − D∗ ) −1 ,                                  (3.7)
again, where ∗ can be 𝐿, 𝑅, or 0. The purpose of the identity (3.7) is that it connects the spatial
derivative to an expression involving integrals of the solution rather than derivatives. In other
words, it allows us to avoid having to use a stencil operation for derivatives.
                                                 82


    To obtain an approximation for the first derivative in space, we can use L 𝐿 , L 𝑅 , or both of
them, which may occur as part of a monotone splitting. If we combine the definition of the left
propagating first derivative operator in equation (3.1) with the definition (3.6) and identity (3.7),
we can define the first derivative in terms of the D 𝐿 operator. Observe that
                                       𝜕𝑥+ = 𝛼 (I − L 𝐿 ) ,
                                                                    
                                           = 𝛼 L 𝐿 L −1  𝐿   − L  𝐿    ,
                                                                 
                                           = 𝛼L 𝐿 L −1   𝐿   − I     ,
                                                                    
                                           = −𝛼L 𝐿 I − L −1      𝐿 ,
                                           = −𝛼 (I − D 𝐿 ) −1 D 𝐿 ,
                                                  ∑︁∞
                                                           𝑝
                                           = −𝛼        D𝐿 ,                                     (3.8)
                                                  𝑝=1
where, in the last step, we used the fact that the operator D 𝐿 is bounded by unity in an operator
norm. We use the + convention to indicate that this is a right-sided derivative. Likewise, for the
right propagating first derivative operator, we find that the complementary left-biased derivative is
given by
                                                       ∑︁∞
                                             𝜕𝑥− = 𝛼
                                                               𝑝
                                                             D𝑅 .                               (3.9)
                                                       𝑝=1
Additionally, the second derivative can be expressed as
                                                           ∞
                                                          ∑︁
                                                                 𝑝
                                           𝜕𝑥𝑥 = −𝛼2           D0 .                           (3.10)
                                                          𝑝=1
As the name implies, each power of D∗ is successively defined according to
                                                                
                                           D∗𝑘  ≡   D∗ D∗𝑘−1        .                         (3.11)
    In previous work, [44], for periodic boundary conditions, it was established that the partial
sums for the left and right-biased approximations to 𝜕𝑥 satisfy
                              𝑛                                        𝑛              
                           ©∑︁ 𝑝              1 ª                     ©∑︁ 𝑝            1 ª
                 𝜕𝑥+ = −𝛼 ­      D 𝐿 + O 𝑛+1 ® ,           𝜕𝑥− = 𝛼­         D 𝑅 + O 𝑛+1 ® .   (3.12)
                                            𝛼                                        𝛼
                           « 𝑝=1                    ¬                 « 𝑝=1                ¬
                                                      83


Similarly for second derivatives, with periodic boundaries, retaining 𝑛 terms leads to a truncation
error with the form
                                                 𝑛                 
                                           2©
                                               ∑︁
                                                     𝑝          1
                                  𝜕𝑥𝑥 = −𝛼 ­       D0  + O 2𝑛+2 ® .
                                                                      ª
                                                                                                (3.13)
                                               𝑝=1
                                                              𝛼
                                             «                        ¬
In both cases, the relations can be obtained through a repeated application of integration by parts
with induction. These approximations are still exact, in space, but the integral operators nested in
D∗ will eventually be approximated with quadrature. From these relations, we can also observe
the impact of 𝛼 on the size of the error in time. In particular, if we select 𝛼 = O (1/Δ𝑡) in (3.12)
             √ 
and 𝛼 = O 1/ Δ𝑡 in (3.13), each of the approximations should have an error of the form O (Δ𝑡 𝑛 ).
The results concerning the consistency and stability of these higher order approximations were
established in [56, 44, 38].
     As mentioned earlier, this work only considers approximations which are first-order with respect
to time. Therefore, we shall restrict ourselves to the following operator representations:
                             𝜕𝑥+ ≈ −𝛼D 𝐿 ,    𝜕𝑥− ≈ 𝛼D 𝑅 ,    𝜕𝑥𝑥 ≈ −𝛼2 D0 .                    (3.14)
This is a consequence of retaining a single term from each of the partial sums in equations (3.8),
(3.9), and (3.10). Consequently, computing higher powers of D∗ is unnecessary, so the successive
property (3.11) is not needed. However, it indicates, clearly, a possible path for higher-order
extensions of the ideas which will be presented here. For the moment, we shall delay prescribing the
choice of 𝛼 used in the representations (3.14), in order to avoid a problem-dependent selection. As
alluded to at the beginning of this section, an identical representation is used for multi-dimensional
problems, where the D∗ operators are now associated with a particular dimension of the problem.
The operators along a particular dimension are constructed using data along that dimension of the
domain, so that each of the directions remains uncoupled. This completes the discussion on the
generic form of the representations used for spatial derivatives. Next, we provide some information
regarding the treatment of boundary conditions, which determine the constants 𝐴 and 𝐵 appearing
in the D∗ operators.
                                                   84


3.2.3  Comment on Boundary Conditions
The process of prescribing values of 𝐴 and 𝐵, inside equations (3.3),(3.4), and (3.5), which are
required to construct D∗ , is highly dependent on the structure of the problem being solved. Previous
work has shown how to prescribe a variety of boundary conditions for linear PDEs (see e.g., [50, 55,
43, 38]). For example, in linear problems, such as the wave equation, with either periodic or non-
periodic boundary conditions, one can directly enforce the boundary conditions to determine the
constants 𝐴 and 𝐵. The situation can become much more complicated for problems which are both
nonlinear and non-periodic. For approximations of at most third-order accuracy, with nonlinear
PDEs, one can use the techniques from [56] for non-periodic problems. To achieve high-order
time accuracy, the partial sums for this case were modified to eliminate certain low-order terms
along the boundaries. We note that the development of high-order time discretizations, subject
to non-trivial boundary conditions, for nonlinear operators, is still an open area of research for
successive convolution methods.
    As this paper concerns the scalability of the method, we shall consider test problems that
involve periodic boundary conditions. For periodic problems defined on the line interval [𝑎, 𝑏],
the constants associated with the boundary conditions for first derivatives are given by
                                      𝐼 𝑅 [𝑣; 𝛼] (𝑏)          𝐼 𝐿 [𝑣; 𝛼] (𝑎)
                                 𝐴=                  ,    𝐵=                 ,                 (3.15)
                                           1−𝜇                     1−𝜇
where 𝐼 𝐿 and 𝐼 𝑅 were defined in equations (3.3) and (3.4). Similarly, for second derivatives, the
constants can be determined to be
                                      𝐼0 [𝑣; 𝛼] (𝑏)           𝐼0 [𝑣; 𝛼] (𝑎)
                                 𝐴=                  ,    𝐵=                 ,                 (3.16)
                                           1−𝜇                     1−𝜇
with the definition of 𝐼0 coming from (3.5). In the expressions (3.15) and (3.16) provided above,
we use
                                               𝜇 ≡ 𝑒 −𝛼(𝑏−𝑎) ,
where 𝛼 is the appropriately chosen parameter. Note that the function 𝑣(𝑥) denotes the generic
operand of the operator D∗ . This helps reduce the complexity of the notation when several
                                                       85


applications of D∗ are required, since they are recursively defined. As an example, suppose we
wish to compute 𝜕𝑥𝑥 ℎ(𝑢), where ℎ is some known function. For this, we can use the first-order
scheme for second derivatives (see (3.14)) and take 𝑣 = ℎ(𝑢) in the expressions (3.16) for the
boundary terms.
3.2.4    Fast Convolution Algorithm and Spatial Discretization
To perform a spatial discretization over [𝑎, 𝑏], we first create a grid of 𝑁 + 1 points:
                                          𝑥𝑖 = 𝑎 + 𝑖Δ𝑥𝑖 ,        𝑖 = 0, · · · , 𝑁,
where
                                                   Δ𝑥𝑖 = 𝑥𝑖+1 − 𝑥𝑖 .
A naive approach to computing the convolution integral would lead to method of complexity
O (𝑁 2 ), where 𝑁 is the number of grid points. However, using some algebra, we can write
recurrence relations for the integral terms which comprise D∗ :
              𝐼 𝑅 [𝑣; 𝛼] (𝑥𝑖 ) = 𝑒 −𝛼Δ𝑥𝑖−1 𝐼 𝑅 [𝑣; 𝛼] (𝑥𝑖−1 ) + 𝐽 𝑅 [𝑣; 𝛼] (𝑥𝑖 ),       𝐼 𝑅 [𝑣; 𝛼] (𝑥 0 ) = 0,
              𝐼 𝐿 [𝑣; 𝛼] (𝑥𝑖 ) = 𝑒 −𝛼Δ𝑥𝑖 𝐼 𝐿 [𝑣; 𝛼] (𝑥𝑖+1 ) + 𝐽 𝐿 [𝑣; 𝛼] (𝑥𝑖 ),      𝐼 𝐿 [𝑣; 𝛼] (𝑥 𝑁 ) = 0.
Here, we have defined the local integrals
                                                        ∫     𝑥𝑖
                                   𝐽 𝑅 [𝑣; 𝛼] (𝑥𝑖 ) = 𝛼           𝑒 −𝛼(𝑥𝑖 −𝑠) 𝑣(𝑠) 𝑑𝑠,                         (3.17)
                                                        ∫𝑥𝑖−1 𝑥𝑖+1
                                   𝐽 𝐿 [𝑣; 𝛼] (𝑥𝑖 ) = 𝛼             𝑒 −𝛼(𝑠−𝑥𝑖 ) 𝑣(𝑠) 𝑑𝑠.                       (3.18)
                                                           𝑥𝑖
By writing the convolution integrals this way, we obtain a summation method which has a com-
plexity of O (𝑁). Note that the same algorithm can be applied to compute the convolution integral
for the second derivative operator by splitting the integral at a point 𝑥. After applying the above
algorithm to the left and right contributions, we can recombine them through “averaging" to recover
the original integral. While a variety of quadrature methods have been proposed to compute the
local integrals (3.18) and (3.17) (see e.g., [45, 50, 55, 43, 38, 33]), we shall consider sixth-order
                                                           86


quadrature methods introduced in [44], which use WENO interpolation to address both smooth
and non-smooth problems. In what follows, we describe the procedure for 𝐽 𝑅 [𝑣; 𝛼] (𝑥𝑖 ), since the
reconstruction for 𝐽 𝐿 [𝑣; 𝛼] (𝑥𝑖 ) is similar. This approximation uses a six point stencil given by
                                                  𝑆(𝑖) = {𝑥𝑖−3 , · · · , 𝑥𝑖+2 },
which is then divided into three smaller stencils, each of which contains four points, defined by
𝑆𝑟 = {𝑥𝑖−3+𝑟 , · · · , 𝑥𝑖+𝑟 } for 𝑟 = 0, 1, 2. We associate 𝑟 with the shift in the stencil. A graphical
depiction of this stencil is provided in 3.1. The quadrature method is developed as follows:
   1. On each of the small stencils 𝑆𝑟 (𝑖)., we use the approximation
                                                    ∫    𝑥𝑖                             ∑︁3
                          𝐽 𝑅(𝑟) [𝑣; 𝛼] (𝑥𝑖 )  ≈𝛼           𝑒 −𝛼(𝑥 𝑖 −𝑠)
                                                                         𝑝𝑟 (𝑠) 𝑑𝑠 =           (𝑟)
                                                                                             𝑐 −3+𝑟+ 𝑗 𝑣 −3+𝑟+ 𝑗 , (3.19)
                                                      𝑥 𝑖−1                              𝑗=0
      where 𝑝𝑟 (𝑥) is the Lagrange interpolating polynomial formed from points in 𝑆𝑟 (𝑖) and 𝑐 ℓ(𝑟)
      are the interpolation coefficients, which depend on the parameter 𝛼 and the grid spacing, but
      not 𝑣.
   2. In a similar way, on the large stencil 𝑆(𝑖) we obtain the approximation
                                                                   ∫    𝑥𝑖
                                          𝐽 𝑅 [𝑣; 𝛼] (𝑥𝑖 ) ≈ 𝛼             𝑒 −𝛼(𝑥𝑖 −𝑠) 𝑝(𝑠) 𝑑𝑠.                    (3.20)
                                                                      𝑥𝑖−1
   3. When function 𝑣(𝑥) is smooth, we can combine the interpolants on the smaller stencils, so
      they are consistent with the high-order approximation obtained on the larger stencil, i.e.,
                                                                  ∑︁ 2
                                              𝐽 𝑅 [𝑣; 𝛼] (𝑥𝑖 ) =         𝑑𝑟 𝐽 𝑅(𝑟) [𝑣; 𝛼] (𝑥𝑖 ),                   (3.21)
                                                                   𝑟=0
      where 𝑑𝑟 > 0 are called the linear weights, which form a partition of unity. The problems we
      consider in this work involve smooth functions, so this is sufficient for the final approximation.
      For instances in which the solution is not smooth, the linear weights can be mapped to
      nonlinear weights using the notion of smoothness. We refer the interested reader to previous
      work [56, 44, 45] for details concerning non-smooth data sets.
                                                                87


                 Figure 3.1: Stencils used to build the six point quadrature [56, 44].
In A.3, we provide the expressions used to compute coefficients 𝑐 ℓ(𝑟) and 𝑑𝑟 for a uniform grid,
although the non-uniform grid case can be done as well [73]. In the case of a non-uniform mesh,
the linear weights 𝑑𝑟 would become locally defined in the neighborhood of a given point and would
need to be computed on-the-fly. Uniform grids eliminate this requirement as the linear weights,
for a given direction, can be computed once per time step and reused in each of the lines pointing
along that direction.
3.2.5   Coupling Approximations with Time Integration Methods
Here, we demonstrate how one can use this approach to solve a large class of PDEs by coupling
the spatial discretizations in 3.2.4 with explicit time stepping methods. In what follows, we shall
consider general PDEs of the form
                                              𝜕𝑡 𝑈 = 𝐹 (𝑡, 𝑈),
where 𝐹 (𝑡, 𝑈) is a collective term for spatial derivatives involving the solution variable 𝑈. Possible
choices for 𝐹 might include generic nonlinear advection and diffusion terms
                                    𝐹 (𝑡, 𝑈) = 𝜕𝑥 𝑔1 (𝑈) + 𝜕𝑥𝑥 𝑔2 (𝑈),
or even components of the HJ equations
                                          𝐹 (𝑡, 𝑈) = 𝐻 (𝑈, 𝜕𝑥 𝑈).
                                                    88


    To demonstrate how one can couple these approaches, we start by discretizing a PDE in time,
but, rather than use backwards Euler, we use an 𝑠-stage explicit Runge-Kutta (RK) method, i.e.,
                                                            ∑︁𝑠
                                             𝑢 𝑛+1 = 𝑢 𝑛 +       𝑏𝑖 𝑘 𝑖 ,
                                                             𝑖=1
where the various stages are given by
                                 𝑘 1 = 𝐹 (𝑡 𝑛 , 𝑢 𝑛 ),
                                 𝑘 2 = 𝐹 (𝑡 𝑛 + 𝑐 2 Δ𝑡, 𝑢 𝑛 + Δ𝑡𝑎 21 𝑘 1 ),
                                    ..
                                     .
                                                                   ∑︁𝑠
                                 𝑘 𝑠 = 𝐹 ­𝑡 𝑛 + 𝑐 𝑠 Δ𝑡, 𝑢 𝑛 + Δ𝑡
                                           ©                                       ª
                                                                          𝑎𝑠 𝑗 𝑘 𝑗 ® .
                                           «                        𝑗=1            ¬
As with a standard Method-of-Lines (MOL) discretization, we would need to reconstruct derivatives
within each RK-stage. To illustrate, consider the nonlinear A-D equation,
                                       𝐹 (𝑡, 𝑢) = 𝜕𝑥 𝑔1 (𝑢) + 𝜕𝑥𝑥 𝑔2 (𝑢).
For a term such as 𝑔1 , we would use a monotone Lax-Friedrichs flux splitting, i.e., 𝑔1 ∼ 12 (𝑔1+ + 𝑔1− ),
where 𝑔1± = 12 (𝑔1 (𝑢) ± 𝑟𝑢) with 𝑟 = max𝑢 𝑔1′ (𝑢). Hence, a particular RK-stage can be approximated
using
                           𝑠                              𝑠                                 𝑠
                     1 ∑︁ 𝑝 +                       1 ∑︁ 𝑝 −                           2
                                                                                         ∑︁
                                                                                               𝑝
         𝐹 (𝑡, 𝑢) ≈ − 𝛼      D [𝑔 (𝑢); 𝛼] + 𝛼                D [𝑔 (𝑢); 𝛼] + 𝛼𝜈                D0 [𝑔2 (𝑢); 𝛼𝜈 ].
                     2 𝑝=1 𝐿 1                      2 𝑝=1 𝑅 1                             𝑝=1
The resulting approximation to the RK-stage can be shown to be O (Δ𝑡 𝑠 ) accurate. Another nonlinear
PDE of interest to us is the H-J equation
                                              𝐹 (𝑡, 𝑢) = 𝐻 (𝜕𝑥 𝑢).
In a similar way, we would replace the Hamiltonian with a monotone numerical Hamiltonian, such
as
                                                     − + 𝑣+                     −       +
                                                           
                                 −     +           𝑣                 − + 𝑣 −𝑣
                             𝐻 (𝑣 , 𝑣 ) = 𝐻
                             ˆ                                + 𝑟 (𝑣 , 𝑣 )                ,
                                                       2                            2
                                                         89


where 𝑟 (𝑣 − , 𝑣 + ) = max𝑣 𝐻 ′ (𝑣). Then, the left and right derivative operators in the numerical
Hamiltonian can be replaced with
                                     𝑠
                                    ∑︁                            𝑠
                                                                 ∑︁
                          𝜕𝑥− 𝑢 = 𝛼                   𝜕𝑥+ 𝑢 = −𝛼
                                          𝑝                            𝑝
                                        D 𝑅 [𝑢; 𝛼],                  D 𝐿 [𝑢; 𝛼],
                                    𝑝=1                          𝑝=1
which, again, yields an O (Δ𝑡 𝑠 ) approximation.
    In previous work [56, 44], for linear forms of 𝐹 (𝑡, 𝑢), it was shown that the resulting methods
are unconditionally stable when coupled to explicit RK methods, up to order 3. Extensions beyond
third-order are, indeed, possible, but were not considered. For general, nonlinear problems, we
typically couple an 𝑠-stage RK method to a successive convolution approximation of the same time
accuracy, so that the error in the resulting approximation is O (Δ𝑡 𝑠 ).
3.3    Nearest-Neighbor Domain Decomposition Algorithm
In this section, we provide the relevant mathematical definitions of our domain decomposition
algorithm, which are derived from the key operators used in successive convolution. Our goal is to
establish and exploit data locality in the method, so that certain reconstructions, which are nonlocal,
can be independently completed on non-overlapping blocks of the domain. This is achieved in part
by leveraging certain decay properties of the integral representations. Once we have established
some useful definitions, we use them to derive conditions in 3.3.1, which restrict the communication
pattern to N-Ns. Then in sections 3.3.2 and 3.3.3, we illustrate how this condition can be used
to enforce boundary conditions, in a consistent manner, for first and second derivative operators,
on each of the blocks. We then provide a brief summary of these findings along with additional
comments in 3.3.4.
    Maintaining a localized stencil is often advantageous in parallel computing applications. Code
which is based on N-Ns is generally much easier to write and maintain. Additionally, messages used
to exchange data owned by other blocks may not have to travel long distances within the network,
provided that the blocks are mapped physically close together in hardware. Communication, even
on modern computing systems, is far more expensive than computation. Therefore, an initial
strategy for domain decomposition is to enforce N-N dependencies between the blocks. In order to
                                                    90


                                                                      Local data
                                                                      N-N data
                     Figure 3.2: A six-point WENO quadrature stencil in 2-D.
decompose a problem into smaller, independent pieces, we separate the global domain into blocks
that share borders with their nearest neighbors. For example, in the case of a 1-D problem defined
on the interval [𝑎, 𝑏] we can form 𝑁 blocks by writing
                                  𝑎 = 𝑐 0 < 𝑐 1 < 𝑐 2 < · · · < 𝑐 𝑁 = 𝑏,
with Δ𝑐𝑖 = 𝑐𝑖+1 − 𝑐𝑖 denoting the width of block 𝑖. Multidimensional problems can be addressed in
a similar way by partitioning the domain along multiple directions. Solving a PDE on each of these
blocks, independently, requires an understanding of the various data dependencies. First, we address
the local integrals 𝐽∗ . Depending on the quadrature method, the reconstruction algorithm might
require data from neighboring blocks. Reconstructions based on previously described WENO-type
quadratures require an extension of the grid in order to build the interpolant. This involves a “halo"
region (see Figure 3.2), which is distributed amongst N-N blocks in the decomposition. On the
other hand, more compact quadratures, such as Simpson’s method [33], do not require this data. In
this case, the quadrature communication phase can be ignored.
    The major task for this work involves efficiently communicating the data necessary to build each
of the convolution integrals 𝐼∗ . Once the local integrals 𝐽∗ are constructed through quadrature, we
                                                   91


sweep across the lines of the domain to build the convolution integrals. It is this operation which
couples the integrals across the blocks. To decompose this operation, we first rewrite the integral
operators 𝐼∗ , assuming we are in block 𝑖, as
                                              ∫ 𝑏
                    𝐼 𝐿 [𝑣; 𝛼] (𝑥) = 𝛼               𝑒 −𝛼(𝑠−𝑥) 𝑣(𝑠) 𝑑𝑠,
                                              ∫𝑥 𝑐𝑖+1                                    ∫     𝑐𝑁
                                                           −𝛼(𝑠−𝑥)
                                         =𝛼             𝑒            𝑣(𝑠) 𝑑𝑠 + 𝛼                    𝑒 −𝛼(𝑠−𝑥) 𝑣(𝑠) 𝑑𝑠,
                                               𝑥                                            𝑐 𝑖+1
and
                                               ∫    𝑥
                      𝐼 𝑅 [𝑣; 𝛼] (𝑥) = 𝛼               𝑒 −𝛼(𝑥−𝑠) 𝑣(𝑠) 𝑑𝑠,
                                               ∫𝑎 𝑐𝑖                                     ∫     𝑥
                                                           −𝛼(𝑥−𝑠)
                                           =𝛼           𝑒            𝑣(𝑠) 𝑑𝑠 + 𝛼                  𝑒 −𝛼(𝑥−𝑠) 𝑣(𝑠) 𝑑𝑠.
                                                  𝑐0                                        𝑐𝑖
These relations, which assume 𝑥 is within the interval [𝑐𝑖 , 𝑐𝑖+1 ], elucidate the local and non-local
contributions to the convolution integrals within block 𝑖. Using simple algebraic manipulations,
we can expand the non-local contributions to find that
                      ∫     𝑐𝑁                               𝑁−1
                                                              ∑︁   ∫     𝑐 𝑗+1
                                    −𝛼(𝑠−𝑥)
                                 𝑒          𝑣(𝑠) 𝑑𝑠 =                           𝑒 −𝛼(𝑠−𝑥) 𝑣(𝑠) 𝑑𝑠,
                         𝑐 𝑖+1                               𝑗=𝑖+1    𝑐𝑗
                                                             𝑁−1
                                                              ∑︁                   ∫     𝑐 𝑗+1
                                                                      −𝛼(𝑐 𝑗 −𝑥)
                                                         =         𝑒                            𝑒 −𝛼(𝑠−𝑐 𝑗 ) 𝑣(𝑠) 𝑑𝑠,
                                                             𝑗=𝑖+1                   𝑐𝑗
and
                     ∫     𝑐𝑖                              𝑖−1 ∫
                                                          ∑︁         𝑐 𝑗+1
                                 −𝛼(𝑥−𝑠)
                               𝑒           𝑣(𝑠) 𝑑𝑠 =                       𝑒 −𝛼(𝑥−𝑠) 𝑣(𝑠) 𝑑𝑠,
                        𝑐0                                 𝑗=0    𝑐𝑗
                                                           𝑖−1
                                                          ∑︁                      ∫    𝑐 𝑗+1
                                                                  −𝛼(𝑥−𝑐 𝑗+1 )
                                                      =         𝑒                              𝑒 −𝛼(𝑐 𝑗+1 −𝑠) 𝑣(𝑠) 𝑑𝑠,
                                                           𝑗=0                      𝑐𝑗
for the right and left-moving data, respectively. With these relations, each of the convolution
integrals can be formed according to
                            ∫    𝑐 𝑖+1                                 𝑁−1                     ∫     𝑐 𝑗+1
                                         −𝛼(𝑠−𝑥)                   © ∑︁ −𝛼(𝑐 𝑗 −𝑥)
     𝐼 𝐿 [𝑣; 𝛼] (𝑥) = 𝛼                          𝑣(𝑠) 𝑑𝑠 + 𝛼 ­                                             𝑒 −𝛼(𝑠−𝑐 𝑗 ) 𝑣(𝑠) 𝑑𝑠® ,
                                                                                                                               ª
                                       𝑒                                       𝑒                                                   (3.22)
                              𝑥                                       𝑗=𝑖+1                       𝑐𝑗
                                                                   «                                                           ¬
and
                               𝑖−1                 ∫     𝑐 𝑗+1                                         ∫     𝑥
                            ©∑︁ −𝛼(𝑥−𝑐 𝑗+1 )                     −𝛼(𝑐 𝑗+1 −𝑠)
     𝐼 𝑅 [𝑣; 𝛼] (𝑥) = 𝛼 ­                                                       𝑣(𝑠) 𝑑𝑠® + 𝛼                   𝑒 −𝛼(𝑥−𝑠) 𝑣(𝑠) 𝑑𝑠.
                                                                                            ª
                                     𝑒                         𝑒                                                                   (3.23)
                                                      𝑐𝑗                                                  𝑐𝑖
                            « 𝑗=0                                                           ¬
                                                                     92


From equations (3.22) and (3.23), we observe that both of the global convolution integrals can
be split into a localized convolution with additional contributions coming from preceding or
successive global integrals owned by other blocks in the decomposition. These global integrals
contain exponential attenuation factors, the size of which depends on the respective distances
between any pair of sub-domains. Next, we use this result to derive the restriction that facilitates
N-N dependencies.
3.3.1   Nearest-Neighbor Criterion
Building a consistent block-decomposition for the convolution integral is non-trivial, since this
operation globally couples unknowns along a dimension of the grid. Fortunately, the exponential
kernel used in these reconstructions is pleasant in the sense that it automatically generates a region
of compact support around a given block. Examining the exponential attenuation factors in (3.22)
and (3.23), we see that contributions from blocks beyond N-Ns become small provided that (1) the
distance between the blocks is large or (2) 𝛼 is taken to be sufficiently large. Since we have less
control over the block sizes e.g., Δ𝑐𝑖 , we can enforce the latter criterion. That is, we constrain 𝛼 so
that
                                               𝑒 −𝛼𝐿 𝑚 ≤ 𝜖,                                       (3.24)
where 𝜖 ≪ 1 is some prescribed error tolerance, typically taken as 1 × 10−16 , and 𝐿 𝑚 = min𝑖 Δ𝑐𝑖
denotes the length smallest block. Taking logarithms of both sides and rearranging the inequality,
we obtain the bound
                                                      log(𝜖)
                                              −𝛼 ≤           .
                                                        𝐿𝑚
    Our next step is to write this in terms of the time step Δ𝑡, using the choice of 𝛼. However, the
bound on the time step depends on the choice of 𝛼. In 3.2.1, we presented two definitions for the
parameter 𝛼, namely
                                               𝛽               𝛽
                                     𝛼≡             , or 𝛼 ≡ √     .
                                           𝑐 max Δ𝑡            𝜈Δ𝑡
                                                     93


Using these definitions for 𝛼, we obtain two conditions depending on the choice of 𝛼. For the linear
advection equation and the wave equation, we obtain the condition
                                 𝛽        log(𝜖)                   𝛽𝐿 𝑚
                            −          ≤           =⇒ Δ𝑡 ≤ −                  .                (3.25)
                              𝑐 max Δ𝑡      𝐿𝑚                 𝑐 max log(𝜖)
Likewise, for the diffusion equation, the restriction is given by
                                                                        2
                                  𝛽      log(𝜖)               1 𝛽𝐿 𝑚
                             −√        ≤          =⇒ Δ𝑡 ≤                   .                  (3.26)
                                 𝜈Δ𝑡       𝐿𝑚                 𝜈 log(𝜖)
Depending on the problem, if the condition (3.25) or (3.26) is not satisfied, then we use the
maximally allowable time step for a given tolerance 𝜖, which is given by the equality component
of the relevant condition. If several different operators appear in a given problem and are to be
approximated with successive convolution, then each operator will be associated with its own 𝛼.
In such a case, we should bound the time step according to the condition that is more restrictive
among (3.25) and (3.26), which can be accomplished through the choice
                                                                    2!
                                                𝛽𝐿 𝑚       1 𝛽𝐿 𝑚
                              Δ𝑡 ≤ min −                 ,                .                    (3.27)
                                            𝑐 max log(𝜖) 𝜈 log(𝜖)
As before, when the condition is not met, then we use the equality in (3.27).
    Restricting Δ𝑡 according to (3.25), (3.26), or (3.27) ensures that contributions to the right and
left-moving convolution integrals, beyond N-Ns, become negligible. This is important because it
significantly reduces the amount of communication, at the expense of a potentially restrictive time
step. Note that in 3.5.4, we analyze the limitations of such restrictions for the linear advection
equation. In our future work, we shall consider generalizations of our approach, which do not
require (3.25), (3.26), or (3.27). In sections 3.3.2 and 3.3.3, we demonstrate how to formulate
block-wise definitions of the global L∗−1 operators using the derived conditions (3.25), (3.26), or
(3.27).
                                                   94


3.3.2    Enforcing Boundary Conditions for 𝜕𝑥
In order to enforce the block-wise boundary conditions for the first derivative 𝜕𝑥 , we recall our
definitions (3.3) and (3.4) for the left and right-moving inverse operators:
                                    L −1
                                       𝐿 [𝑣; 𝛼] (𝑥) = 𝐼 𝐿 [𝑣; 𝛼] (𝑥) + 𝐵𝑒
                                                                                        −𝛼(𝑏−𝑥)
                                                                                                  ,                     (3.28)
                                    L −1
                                       𝑅 [𝑣; 𝛼] (𝑥) = 𝐼 𝑅 [𝑣; 𝛼] (𝑥) + 𝐴𝑒
                                                                                        −𝛼(𝑥−𝑎)
                                                                                                  .                     (3.29)
We can modify these definitions so that each block contains a pair of inverse operators given by
                                 L −1𝐿,𝑖 [𝑣; 𝛼] (𝑥) = 𝐼 𝐿,𝑖 [𝑣; 𝛼] (𝑥) + 𝐵𝑖 𝑒
                                                                                        −𝛼(𝑐 𝑖+1 −𝑥)
                                                                                                     ,                  (3.30)
                                 L −1𝑅,𝑖 [𝑣; 𝛼] (𝑥) = 𝐼 𝑅,𝑖 [𝑣; 𝛼] (𝑥) + 𝐴𝑖 𝑒
                                                                                        −𝛼(𝑥−𝑐 𝑖 )
                                                                                                   ,                    (3.31)
where 𝐼∗,𝑖 are defined as
                            ∫   𝑐 𝑖+1                                                        ∫    𝑥
                                        −𝛼(𝑠−𝑥)
      𝐼 𝐿,𝑖 [𝑣; 𝛼] (𝑥) = 𝛼            𝑒              𝑣(𝑠) 𝑑𝑠,       𝐼 𝑅,𝑖 [𝑣; 𝛼] (𝑥) = 𝛼             𝑒 −𝛼(𝑥−𝑠) 𝑣(𝑠) 𝑑𝑠, (3.32)
                              𝑥                                                                𝑐𝑖
and the subscript 𝑖 denotes the block in which the operator is defined. As before, this assumes
𝑥 ∈ [𝑐𝑖 , 𝑐𝑖+1 ]. To address the boundary conditions, we need to determine expressions for the
constants 𝐵𝑖 and 𝐴𝑖 on each of the blocks in the domain. First, substitute the definitions (3.22) and
(3.23) into (3.28) and (3.29):
                                     ∫    𝑐 𝑖+1
             L −1
                𝐿 [𝑣; 𝛼] (𝑥)  =𝛼                𝑒 −𝛼(𝑠−𝑥) 𝑣(𝑠) 𝑑𝑠                                                       (3.33)
                                       𝑥
                                                𝑁−1                  ∫ 𝑐 𝑗+1
                                             © ∑︁ −𝛼(𝑐 𝑗 −𝑥)
                                      +𝛼­                                     𝑒 −𝛼(𝑠−𝑐 𝑗 ) 𝑣(𝑠) 𝑑𝑠® + 𝐵𝑒 −𝛼(𝑐 𝑁 −𝑥) ,
                                                                                                       ª
                                                       𝑒
                                                                       𝑐𝑗
                                             « 𝑗=𝑖+1                                                   ¬
                                       𝑖−1                  ∫     𝑐 𝑗+1
                                     ©∑︁ −𝛼(𝑥−𝑐 𝑗+1 )
             L −1  [𝑣;    (𝑥) =                                          𝑒 −𝛼(𝑐 𝑗+1 −𝑠) 𝑣(𝑠) 𝑑𝑠®
                                                                                                  ª
                𝑅      𝛼]         𝛼  ­        𝑒                                                                         (3.34)
                                                               𝑐𝑗
                                     « 𝑗=0 ∫                                                      ¬
                                                   𝑥
                                      +𝛼             𝑒 −𝛼(𝑥−𝑠) 𝑣(𝑠) 𝑑𝑠 + 𝐴𝑒 −𝛼(𝑥−𝑐0 ) .
                                                𝑐𝑖
Since we wish to maintain consistency with the true operator being inverted, we require that each
of the block-wise operators satisfy
                     L −1                       −1
                       𝐿 [𝑣; 𝛼] (𝑥) = L 𝐿,𝑖 [𝑣; 𝛼] (𝑥),                L −1                       −1
                                                                           𝑅 [𝑣; 𝛼] (𝑥) = L 𝑅,𝑖 [𝑣; 𝛼] (𝑥),
                                                                  95


which can be explicitly written as
                                          𝑁−1
                        −𝛼(𝑐 𝑖+1 −𝑥)    © ∑︁ −𝛼(𝑐 𝑗 −𝑥)
                                     =­                       𝐼 𝐿, 𝑗 [𝑣; 𝛼] (𝑐 𝑗 ) ® + 𝐵𝑒 −𝛼(𝑐 𝑁 −𝑥) ,
                                                                                   ª
                   𝐵𝑖 𝑒                          𝑒                                                       (3.35)
                                        « 𝑗=𝑖+1                                    ¬
                                          𝑖−1
                                        ©∑︁ −𝛼(𝑥−𝑐 𝑗+1 )
                     𝐴𝑖 𝑒 −𝛼(𝑥−𝑐𝑖 ) = ­                        𝐼 𝑅, 𝑗 [𝑣; 𝛼] (𝑐 𝑗+1 ) ® + 𝐴𝑒 −𝛼(𝑥−𝑐0 ) .
                                                                                      ª
                                               𝑒                                                         (3.36)
                                        « 𝑗=0                                         ¬
Evaluating (3.35) at 𝑐𝑖+1 and (3.36) at 𝑐𝑖 , we obtain
                                  𝑁−1
                               © ∑︁ −𝛼(𝑐 𝑗 −𝑐𝑖+1 )
                        𝐵𝑖 = ­                         𝐼 𝐿, 𝑗 [𝑣; 𝛼] (𝑐 𝑗 ) ® + 𝐵𝑒 −𝛼(𝑐 𝑁 −𝑐𝑖+1 ) ,
                                                                            ª
                                        𝑒                                                                (3.37)
                               « 𝑗=𝑖+1                                      ¬
                                  𝑖−1
                               ©∑︁ −𝛼(𝑐𝑖 −𝑐 𝑗+1 )
                        𝐴𝑖 = ­                       𝐼 𝑅, 𝑗 [𝑣; 𝛼] (𝑐 𝑗+1 ) ® + 𝐴𝑒 −𝛼(𝑐𝑖 −𝑐0 ) .
                                                                              ª
                                      𝑒                                                                  (3.38)
                               «  𝑗=0                                         ¬
Modifying Δ𝑡 according to either (3.25) or, if necessary (3.27), results in the communication stencil
shown in Figure 3.3. More specifically, the terms representing the boundary contributions in each
of the blocks are given by
                                        
                                        
                                        
                                        
                                          𝐵,     𝑖 = 𝑁 − 1,
                                        
                                        
                                  𝐵𝑖 =
                                        
                                        
                                         𝐼 𝑅,𝑖+1 [𝑣; 𝛼] (𝑐𝑖+1 ),         𝑖 < 𝑁 − 1,
                                        
                                        
                                        
and
                                             
                                             
                                             
                                             
                                              𝐴,     𝑖 = 0,
                                             
                                             
                                      𝐴𝑖 =
                                             
                                             
                                              𝐼 𝑅,𝑖−1 [𝑣; 𝛼] (𝑐𝑖 ), 0 < 𝑖.
                                             
                                             
                                             
These relations generalize the various boundary conditions set by a problem. For example, with
periodic problems, we can select
                               𝐵 = 𝐼 𝐿,0 [𝑣; 𝛼] (𝑐 0 ),        𝐴 = 𝐼 𝑅,𝑁−1 [𝑣; 𝛼] (𝑐 𝑁 ).
This is the relevant strategy employed by domain decomposition algorithms in this work.
3.3.3   Enforcing Boundary Conditions for 𝜕𝑥𝑥
The enforcement of boundary conditions on blocks of the domain for the second derivative can be
accomplished using an identical procedure to the one described in 3.3.2. First, we recall the inverse
                                                             96


                                                                              Local data
                                                                              Left-going sweeps
                                                                              Right-going sweeps
             Figure 3.3: Fast convolution communication stencil in 2-D based on N-Ns.
operator associated with a second derivative (3.5):
                         L0−1 [𝑣; 𝛼] (𝑥) = 𝐼0 [𝑣; 𝛼] (𝑥) + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) ,                    (3.39)
and define an analogous block-wise definition of (3.39) as
                         −1
                       L0,𝑖 [𝑣; 𝛼] (𝑥) = 𝐼0,𝑖 [𝑣; 𝛼] (𝑥) + 𝐴𝑖 𝑒 −𝛼(𝑥−𝑐0 ) + 𝐵𝑖 𝑒 −𝛼(𝑐 𝑁 −𝑥) ,
with the localized convolution integral
                                                         ∫    𝑐 𝑖+1
                                                      𝛼
                                  𝐼0,𝑖 [𝑣; 𝛼] (𝑥) =                 𝑒 −𝛼|𝑥−𝑠| 𝑣(𝑠) 𝑑𝑠.
                                                      2    𝑐𝑖
Again, the subscript 𝑖 denotes the block in which the operator is defined and we take 𝑥 ∈ [𝑐𝑖 , 𝑐𝑖+1 ].
For the purposes of the fast summation algorithm, it is convenient to split this integral term into an
average of left and right contributions, i.e.,
                                                                     !
          −1              1
        L0,𝑖 [𝑣; 𝛼] (𝑥) =     𝐼 𝐿,𝑖 [𝑣; 𝛼] (𝑥) + 𝐼 𝑅,𝑖 [𝑣; 𝛼] (𝑥) + 𝐴𝑖 𝑒 −𝛼(𝑥−𝑐𝑖 ) + 𝐵𝑖 𝑒 −𝛼(𝑐𝑖+1 −𝑥) , (3.40)
                          2
where 𝐼∗,𝑖 are the same integral operators shown in equation (3.32) used to build the first derivative.
As in the case of the first derivative, a condition connecting the boundary conditions on the blocks
                                                          97


to the non-local integrals can be derived, which, if evaluated at the ends of the block, results in the
2 × 2 linear system
                                                    𝑁−1
                                       −𝛼Δ𝑐 𝑖    1 ∑︁ −𝛼(𝑐 𝑗 −𝑐𝑖 )
                         𝐴𝑖 + 𝐵𝑖 𝑒            =           𝑒          𝐼 𝐿, 𝑗 [𝑣; 𝛼] (𝑐 𝑗 )       (3.41)
                                                 2 𝑗=𝑖+1
                                                   𝑖−1
                                                 1 ∑︁ −𝛼(𝑐𝑖 −𝑐 𝑗+1 )
                                              +         𝑒            𝐼 𝑅, 𝑗 [𝑣; 𝛼] (𝑐 𝑗+1 )
                                                 2 𝑗=0
                                              + 𝐴𝑒 −𝛼(𝑐𝑖 −𝑐0 ) + 𝐵𝑒 −𝛼(𝑐 𝑁 −𝑐𝑖 ) ,
                                                    𝑁−1
                               −𝛼Δ𝑐 𝑖            1 ∑︁ −𝛼(𝑐 𝑗 −𝑐𝑖+1 )
                         𝐴𝑖 𝑒           + 𝐵𝑖 =            𝑒             𝐼 𝐿, 𝑗 [𝑣; 𝛼] (𝑐 𝑗 )    (3.42)
                                                 2 𝑗=𝑖+1
                                                   𝑖−1
                                                 1 ∑︁ −𝛼(𝑐𝑖+1 −𝑐 𝑗+1 )
                                              +         𝑒              𝐼 𝑅, 𝑗 [𝑣; 𝛼] (𝑐 𝑗+1 )
                                                 2 𝑗=0
                                              + 𝐴𝑒 −𝛼(𝑐𝑖+1 −𝑐0 ) + 𝐵𝑒 −𝛼(𝑐 𝑁 −𝑐𝑖+1 ) .
Equations (3.41) and (3.42) can be solved analytically to find that
                                                                                 
                              𝐴𝑖             1         1          −𝑒 −𝛼Δ𝑐𝑖     𝑟 0 
                              =
                                                                                  
                                                                                     ,
                              𝐵  1 − 𝑒 −2𝛼Δ𝑐𝑖
                                                                                 
                                                        −𝑒 −𝛼Δ𝑐𝑖          1      𝑟 
                              𝑖                                                   1
                                                                                 
where we have used the variables 𝑟 0 and 𝑟 1 to denote the terms appearing on the right-hand side of
(3.41) and (3.42), respectively. Under the N-N constraints (3.26) or (3.27), many of the exponential
terms can be neglected resulting in the compact expressions
                               
                               
                               
                               
                                 𝐴,      𝑖 = 0,
                               
                               
                       𝐴𝑖 =
                               
                               
                                21 𝐼 𝑅,𝑖−1 [𝑣; 𝛼] (𝑐𝑖 ) ≡ 𝐼0,𝑖−1 [𝑣; 𝛼] (𝑐𝑖 ),
                               
                                                                                    0 < 𝑖,
                               
and
                        
                        
                        
                        
                         𝐵,       𝑖 = 𝑁 − 1,
                        
                        
                   𝐵𝑖 =
                        
                        
                         12 𝐼 𝐿,𝑖+1 [𝑣; 𝛼] (𝑐𝑖+1 ) ≡ 𝐼0,𝑖+1 [𝑣; 𝛼] (𝑐𝑖+1 ),        𝑖 < 𝑁 − 1.
                        
                        
                        
                                                          98


3.3.4    Additional Comments
In this section, we developed the mathematical framework behind our proposed domain decompo-
sition algorithm. We derived a condition, which reduces the construction of a nonlocal operator to a
N-N dependency by leveraging the decay properties of the exponential term within the convolution
integrals. We wish to reiterate that this condition is not entirely necessary. One could remove this
condition by including contributions beyond N-Ns at the expense of additional communication.
This change would certainly result in a loss of speed per time step, but the additional expense
could be amortized by the ability to use much a larger time step, which would reduce the overall
time-to-solution. As a first pass, we shall ignore these additional contributions, which may limit
the scope of problems we can study, but we plan to generalize these algorithms in our future
work via an adaptive strategy. This approach would begin using data from N-Ns, then gradually
include additional contributions using information about the decay from the exponential. In the
next section, we shall discuss details regarding the implementation of our methods and particular
design choices made in the construction of our algorithms.
3.4     Strategies for Efficient Implementation on Parallel Systems
In this section, we discuss strategies for constructing parallel algorithms to solve PDEs. We provide
the details related to our work on thread-scalable, shared memory algorithms, as well as distributed
memory algorithms, where the problem is decomposed into smaller, independent problems that
communicate necessary information via message-passing. 3.4.1 introduces the core concepts in
Kokkos performance portability library, which is used to develop our shared memory algorithms.
Once we have introduced these ideas, we explore numerous loop-level optimizations for essential
loop structures in 3.4.3, using the performance metrics discussed in 3.4.2. Building on the results
of these loop experiments, we outline the structure of our shared memory algorithms in 3.4.4.
We then discuss the implementation of the distributed memory component of our algorithms in
3.4.5, along with modifications which enable the use of an adaptive time stepping rule. Finally, we
summarize the key findings and developments of the implementation which are used to conduct
                                                   99


our numerical experiments.
3.4.1   Selecting a Shared Memory Programming Model
Many programming models exist to address the aspects of shared memory paralellization, such
as OpenMP, OpenACC, CUDA, and OpenCL. The question of which model to use often depends
on the target architecture on which the code is to run. However, given the recent trend towards
deploying more heterogeneous computing systems, e.g., ones in which a given node contains a
variety of CPUs with one, or many, accelerators (typically GPUs), the choice becomes far more
complicated. Developing codes which are performant across many computing architectures is a
highly non-trivial task. Due to memory access patterns, code which is optimized to run on CPUs
is often not optimal on GPUs, so these models address portability rather than performance. This
introduces yet another concern related to code management and maintenance: As new architectures
are deployed, code needs to be tuned or modified to take advantage of new features, which can
be time consuming. Additionally, enabling these abstractions almost invariably results in either
multiple versions of the code or rather complicated build systems.
    In our work, we choose to adopt Kokkos [46], a performance portable, shared memory pro-
gramming model. Kokkos tries to address the aforementioned problem posed by rapidly evolving
architectures through template-metaprogramming abstractions. Their model provides abstractions
for common parallel policies (i.e., for-loops, reductions, and scans), memory spaces, and execution
spaces. The architecture specific details are hidden from the users through these abstractions, yet
the setup allows the application programmer to take advantage of numerous performance-related
features. Given basic knowledge in templates, operator overloads, as well as functors and lambdas,
one can implement a variety of program designs. Also provided are the so-called views, which are
powerful multi-dimensional array containers that allow Kokkos iterators to map data onto various
architectures in a performant way. Additionally, the bodies of iterators become either a user-defined
functor or a KOKKOS_LAMBDA, which is just a functor that is generated by the compiler. This
allows users to maintain one version of the code which has the flexibility to run on various architec-
                                                  100


                   Figure 3.4: Heterogeneous platform targeted by Kokkos [46].
tures, such as the one depicted in 3.4. Other performance portability models, such as RAJA [74],
work in a similar fashion as Kokkos, but they are less intrusive with regard to memory management.
With RAJA, the user is responsible for implementing architecture dependent details such as array
layouts and policies. In this sense, RAJA emphasizes portability, with the user being responsible
for handling performance.
3.4.2    Comment on Performance Metrics
In order to benchmark the performance of the algorithms, we need a descriptive metric that accounts
for varying workloads among problem sizes. In our numerical simulations we use a time stepping
rule so that problems with a smaller mesh spacing require more time steps, i.e., Δ𝑡 ∼ Δ𝑥. Therefore,
one can either time the code for a fixed number of steps or track the number of steps in the entire
simulation 𝑡 ∈ (0, 𝑇] and compute the average time per time step. We adopt the former approach
throughout this work. To account for the varying workloads attributed to varying cells/grid points,
we define the update rate as Degrees-of-Freedom/node/s (DOF/node/s), which can be computed
via
                                                total variables × 𝑁 𝑑
                                DOF/node/s =                            ,                   (3.43)
                                               nodes × total    time (s)
                                                            total steps
where 𝑑 is the number of spatial dimensions. This metric is a more general way of comparing the
raw performance of the code, as it allows for simultaneous comparisons among linear or nonlinear
                                                 101


problems with varying degrees of dimensionality and number of components. It also allows for a
comparison, in terms of speed, against other classes of methods, such as finite element methods,
where the workload on a given cell is allowed to vary according to the number of basis elements 1. In
3.4.3, we shall use this performance metric to benchmark a collection of techniques for prescribing
parallelism across predominant loop structures in the algorithms for successive convolution.
3.4.3     Benchmarking Prototypical Loop Patterns
Often, when designing shared memory algorithms, one has to make design decisions prescribing
the way threads are dispatched to the available data. However, there are often many ways of accom-
plishing a given task. Kokkos provides a variety of parallel iteration techniques — the selection of
a particular pattern typically depends on the structure of the loop (perfectly or imperfectly nested)
and the size of the loops. In [75], authors sought to optimize a recurring pattern, consisting of
triple or quadruple nested for-loops, in the Athena++ MHD code [76]. Their strategy was to use a
flexible loop macro to test various loop structures across a range of architectures. Athena++ was
already optimized to run on Intel Xeon-Phi platforms, so they primarily focused on approaches for
porting to GPUs which maintained this performance on CPUs. Our work differs in that we have
not yet identified optimal loop patterns for CPUs or GPUs and algorithms used here contain at least
two major prototypical loop patterns. At the moment, we are not focusing on optimizing for GPUs,
but, we do our best to keep in mind possible performance-related issues associated with various
parallelization techniques. Some examples of recurring loop structures, in successive convolution
algorithms, for 3D problems, are provided in Scheme 3.1 and Scheme 3.2. Technically, there are
left and right-moving operators associated with each direction, but, for simplicity, we will ignore
this in the pseudo-code.
     Another important note, we wish to make, concerns the storage of the operator data on the mesh.
Since the operations are performed “line-by-line" on potentially large multidimensional arrays, we
     1 Note that the update frequency does not account for error in the numerical solution. Certainly, in order to compare
the efficiency of various methods, especially those that belong to different classes, one must take into account the quality
of the solution. This would be reflected in, for example, an error versus time-to-solution plot.
                                                           102


for(int ix = 0; ix < Nx; ix ++){
   for(int iy = 0; iy < Ny; iy ++){
       // Perform some intermediate calculations
       // ...
       // Apply 1-D algorithm to z-line data
       for(int iz = 0; iz < Nz; iz ++){
             z_operator (ix ,iy ,iz) = ...
       }
   }
}
Scheme 3.1: Looping pattern used in the construction of local integrals, convolutions, and boundary
steps.
// Looping pattern for the resolvent operators
for(int ix = 0; ix < Nx; ix ++){
   for(int iy = 0; iy < Ny; iy ++){
       for(int iz = 0; iz < Nz; iz ++){
         z_operator (ix ,iy ,iz) = u(ix ,iy ,iz) - z_operator (ix ,iy ,iz);
         z_operator (ix ,iy ,iz) *= alpha_z ;
       }
   }
}
Scheme 3.2: Another looping pattern used to build “resolvent" operators. With some modifications,
this same pattern could be used for the integrator step. In several cases, this iteration pattern may
require reading entries, which are separated by large distances (i.e., the data is strided), in memory.
choose to store the data in memory so that the sweeps are performed on the fastest changing loop
variables. This allows us to avoid significant memory access penalties associated with reading and
writing to arrays, as the entries of interest are now consecutive in memory 2. For example, suppose
we have an 𝑁−dimensional array with indices 𝑥 1 , 𝑥2 , · · · , 𝑥 𝑁 , and we wish to construct an operator
in the 𝑥 1 direction. Then, we would store this operator in memory as operator(𝑥 2 , · · · , 𝑥 𝑁 , 𝑥1 ).
The loops appearing in Scheme 3.1 and Scheme 3.2 can then be permuted accordingly. Note that
the solution variable u(𝑥 1 , 𝑥2 , · · · , 𝑥 𝑁 ) is not transposed and is a read-only quantity during the
construction of the operators.
    In an effort to develop an efficient application, we follow the approach described in [75], to
determine optimal loop iteration techniques for patterns, such as Scheme 3.1 and Scheme 3.2. Our
    2 This is true when the memory space is that of the CPU (host memory). In device memory, these entries will be
“coalesced", which is the optimal layout for threading on GPUs. This mapping of indices, between memory spaces, is
automatically handled by Kokkos.
                                                         103


simple 2-D and 3-D experiments tested numerous combinations of policies including naive, as well
as more complex parallel iteration patterns using the OpenMP backend in Kokkos. Our goals were
to quantify possible performance gains attainable through the following strategies:
   1. auto-vectorization via #pragma statements or ThreadVectorRange (TVR)
   2. improving data reuse and caching behavior with loop tiling/blocking
   3. prescribing parallelism across combinations of team-type execution policies and team sizes
    Vectorization can offer substantial performance improvements for data that is contiguous in
memory; however, several performance critical operations in the algorithms involve reading data
which is strided in memory. Therefore, it is not straightforward whether vectorization would offer
any improvements. Additionally, for larger problems, the line operations along certain directions
involve reading strided data, so that benefits of caching are lost. The performance penalty of
operating on data with the wrong layout depends on the architecture, with penalties on GPUs
typically being quite severe compared to CPUs. The use of a blocked iteration pattern, such as the
one outlined in Scheme A.1 (see A.2), is a step toward minimizing such performance penalties.
In order to see a performance benefit from this approach, the algorithms must be structured, in
such a way, as to reuse the data that is read into caches, as much as possible. Naturally, one could
prescribe one or more threads (of a team) to process blocks, so we chose to implement cache
blocking using the hierarchical execution policies provided by Kokkos. From coarse-to-fine levels
of granularity, these can be ordered as follows: TeamPolicy (TP), TeamThreadRange (TTR), and
ThreadVectorRange (TVR). For perfectly nested loops, one can achieve similar behavior using
MDRange and prescribing block sizes. During testing, we found that when block sizes are larger
than or equal to the size of the view, a segmentation fault occurs, so this was avoided. The results of
our loop experiments are provided in Figures 3.5 and A.1. For tests employing blocking, we used
a block size of 2562 in 2-D, while 3-D problems used a block size of 323 . Information regarding
various choices, such as compiler, optimization flags, etc., used to generate these results can be
found in Table 3.1.
                                                  104


        CPU Type                                    Intel Xeon Gold 6148
     C++ Compiler                                        ICC 2019.03
  Optimization Flags         -O3 -xCORE-AVX512 -qopt-zmm-usage=high -qno-opt-prefetch
    Thread Bindings                OMP_PROC_BIND=close, OMP_PLACES=threads
Table 3.1: Architecture and code configuration for the loop experiments conducted on the Intel 18
cluster at Michigan State University’s Institute for Cyber-Enabled Research. To leverage the wide
vector registers, we encourage the compiler to use AVX-512 instructions. Hardware prefetching
is not used, as initial experiments seem to indicate that it hindered performance. Initially, we
used GCC 8.2.0-2.31.1 as our compiler, but we found through experimentation that using an
Intel compiler improved the performance of our application by a factor of ∼ 2 for this platform.
Authors in [75] experienced similar behavior for their application and attribute this to a difference
in auto-vectorization capabilities between compilers. An examination of the source code for loop
execution policies in Kokkos reveals that certain decorators, e.g., #pragma ivdep are present, which
help encourage auto-vectorization when Intel compilers are used. We are unsure if similar hints
are provided for GCC.
    As part of our blocking implementation, we stored block information in views, which could
then be accessed by a team of threads. After the information about the block is obtained, we
compute indices for the data within the block and use these to extract the relevant grid data. Then,
one can either create subviews (shallow copies) of the block data or proceed directly with the line
calculations of the block data. We refer to these as tiling with and without subviews, respectively.
Intuitively, one would think that skipping the block subview creation step would be faster. Among
the blocked or tiled experiments, those that created the subviews of the tile data were generally
faster than those that did not. Using blocking for smaller problems typically resulted in a large
number of idle threads, which significantly degraded the performance compared to non-blocked
policies. In such situations, a user would need to take care to ensure that a sufficient number of
blocks are used to generate enough work, i.e., each thread (or team) has at least one block to process.
For larger problems, blocking was faster when compared to variants that did not use blocking. We
observe that the performance of non-blocked policies begins to degrade once a problem becomes
sufficiently large, whereas blocked policies maintained a consistent update rate, even as the problem
size increased. By separating the key loop structures from the complexities of the application, we
                                                 105


were able to expedite the experimental process for identifying efficient loop execution techniques.
In 3.4.4, we use the results of these experiments to inform choices regarding the design of the
shared memory algorithms.
               Tiling w/subviews + TVR               TP + TTR
               Tiling w/o subviews + TVR             TP + TTR (Best)           Tiling w/subviews + TVR              TP + TTR w/o TVR
                       Team Size = Kokkos::AUTO()                              Tiling w/o subviews + TVR            TP + TTR + TVR (Best)
        1010
                                                                               TP + TTR + TVR                       TP + TTR w/o TVR (Best)
                                                                                             Team Size = Kokkos::AUTO()
DOF/s
                                                                               1010
        109
                                                                       DOF/s
                              Team Size = 2                                    109
        1010
                                                                                                   Team Size = 2
                                                                               1010
DOF/s
                                                                       DOF/s
        109
                                                                               109
                              Team Size = 4
        1010                                                                                       Team Size = 4
                                                                               1010
DOF/s
                                                                       DOF/s   109
        109
                512      24     48         96   92      384                             64          8          6          2
                       10      20        40     81     16
                                                                                                  12           25        51
                                     N                                                                     N
Figure 3.5: Plots comparing the performance of different parallel execution policies for the pattern
in Scheme 3.1 using test cases in 2-D (left) and 3-D (right). Tests were conducted on a single node
that consists of 40 cores using the code configuration outlined in 3.1. Each group consists of three
plots, whose difference is the value selected for the team size. We note that hyperthreading is not
enabled on our systems, so Kokkos::AUTO() defaults to a team size of 1. In each pane, we use “best"
to refer to the best run for that configuration across different team sizes. Tile experiments used block
sizes of 2562 , in 2-D problems, and 323 in 3-D. We observe that vectorized policies are generally
faster than non-vectorized policies. Interestingly, among blocked/tiled policies, construction of
subviews appears to be faster than those that skip the subview construction, despite the additional
work. As the problem size increases, the performance of blocked policies improves substantially.
This can be attributed to the large number of idle thread teams when the problem size does not
produce enough blocks. In such cases, increasing the size of the team does offer an improvement,
as it reduces the number of idle thread teams. For non-blocked policies, we observe that increasing
the team-size generally results in minimal, if any, improvement in performance. In all cases, the
use of blocking provides a more consistent update rate when enough work is introduced.
                                                                           106


3.4.4    Shared Memory Algorithms
The line-by-line approach to operator reconstruction suggests that we employ a hierarchical design,
which consists of thread teams. Rather than employ a fine-grained threading approach over loop
indices, we use the coarse-grained, blocked iteration pattern devised in 3.4.3. In this approach, we
divide the iteration space into blocks of nearly identical size, and assign one or more blocks to a
team of threads. The threads within a given team are then dispatched to one (or more) lines, with
vector instructions being used within the lines. As opposed to loop level parallelism, coarse-grained
approaches allow one to exploit multiple levels of parallelism, common to many modern CPUs,
and load balance the computation across blocks by adjusting the loop scheduling policy. In our
implementation, we provide the flexibility of setting the number of threads per block with a macro,
but, in general, we let Kokkos choose the appropriate team size using Kokkos::AUTO(). If running
on the CPU, this sets the team size to be the number of hyperthreads (if supported) on a given core.
For GPU architectures, the team size is the size of the warp.
     A hierarchical design pattern is used because the loops in our algorithms are not perfectly
nested, i.e., calculations are performed between adjoining loops. Information related to blocking
can be precomputed to minimize the number of operations are required to manipulate blocks. The
process of subview construction consists of shallow copies involving pointers to vertices of the
blocks, so no additional memory is required. With a careful choice of a base block size, one can
fit these blocks into high-bandwidth memory, so that accessing costs are reduced. Furthermore, a
team-based, hierarchical pattern seems to provide a large degree of flexibility compared to standard
loop-level parallelism. In particular, we can fuse adjacent kernels into a single parallel region,
which reduces the effect of kernel launch overhead and minimizes the number of synchronization
points. The use of a team-type execution policy also allows us to exploit features present on other
architectures, such as CUDA’s shared memory feature, through scratchpad constructs. Performing a
stenciled operation on strided data is associated with an architecture-dependent penalty. On CPUs,
while one wishes to operate in a contiguous or cached pattern, various compilers can hide these
penalties through optimizations, such as prefetching. GPUs, on the other hand, prefer to operate in a
                                                 107


lock-step fashion. Therefore, if a kernel is not vectorizable, then one pays a significant performance
penalty for poor data access patterns. Shared memory, while slower than register accesses, does
not require coalesced accesses, so the cost can be significantly reduced. The advantage of using
square-like blocks of a fixed size, as opposed to long pencils 3, is that one can adjust the dimensions
of the blocks so that they fit into the constraints of the high-bandwidth memory. Moreover, these
blocks can be loaded once and can be reused for additional directions, whereas pencils would
require numerous transfers, as lines are processed along a given dimension. Such optimizations
are not explored in this work, but algorithmic flexibility is something we must emphasize moving
forwards.
     The parallel nested loop structures, such as the one provided in A.2 (see Scheme A.1) are
applied during reconstructions for the local integrals 𝐽∗ and inverse operators L∗−1 , as well as the
integrator update. The current exception to this pattern is the convolution algorithm, shown in
Scheme A.2, which is also provided in A.2. Here, each thread is responsible for constructing 𝐼∗
on one or more lines of the grid. Therefore, within a line, each thread performs the convolution
sweeps, in serial, using our O (𝑁) algorithm. Adopting the team-tiling approach for this operation
requires that we modify our convolution algorithm considerably – this optimization is left to future
work. Additionally the benefit of this optimizations is not large for CPUs, as profiling indicated
that < 5% of the total time for a given run was spent inside this kernel. However, this will likely
consume more time on GPUs, so this will need to be investigated.
     If one wishes to use variable time stepping rules, where the time step is computed from a
formula of the form
                                                                              
                                                                 Δ𝑥 Δ𝑦
                                            Δ𝑡 = CFL min           ,     ,··· ,                                   (3.44)
                                                                 𝑐𝑥 𝑐 𝑦
then one must supply parallel loop structures with simultaneous maximum reductions for each of
the wave speeds 𝑐𝑖 . This can be implemented as a custom functor, but the use of blocking/tiling
introduces some complexities. More complex reducers that enable such calculations are not
     3 We refer to a pencil as a, generally, long rectangle (in 2-D) and a rectangular prism (in 3-D). The use of pencils,
as opposed to square blocks, would require additional precomputing efforts and, possibly, restrictions on the problem
size.
                                                             108


currently available. For this reason, problems that use time stepping rules, such as (3.44), are
constructed with symmetry in the wave speeds, i.e., 𝑐 𝑥 = 𝑐 𝑦 = · · · to avoid an overly complex
implementation with blocking/tiling. However, we plan to revisit this in later work as we begin
targeting more general problems. Next, in 3.4.5, we discuss the distributed memory component of
the implementation and the strategy used to employ an adaptive time stepping rule, such as (3.44).
3.4.5    Code Strategies for Domain Decomposition
One of the issues with distributed computing involves mapping the problem data in an intelligent way
so that it best aligns with the physical hardware. Since the kernels used in our algorithms consume
a relatively small amount of time, it is crucial that we minimize the time spent communicating
data. Given that these schemes were designed to run on Cartesian meshes, we can use a “topology
aware" virtual communicator supplied by MPI libraries. These constructs take a collection of
ranks in a communicator (each of which manages a sub-domain) and, if permitted, attempt to
reorganize them to best align with the physical hardware. This mapping might not be optimal,
since it depends on a variety of factors related to the job allocation and the MPI implementation.
Depending on the problem, these tools can greatly improve the performance of an application
compared to a hand-coded implementation that uses the standard communicator. Additionally,
MPI’s Cartesian virtual communicator provides functionality to obtain neighbor references and
derive additional communicator groups, say, along rows, columns, etc. The send and receive
operations are performed with persistent communications, which remove some of the overhead
for communication channel creation. Persistent communications require a regular communication
channel, which, for our purposes, is simply N-Ns. For more general problems with irregular
communication patterns, standard send and receive operations can be used.
    One strategy to minimize exposed communications is to use non-blocking communications.
This allows the programmer to overlap calculations with communications, and is especially ben-
eficial if the application can be written in a staggered fashion. If certain data required for a later
calculation is available, then communication can proceed while another calculation is being per-
                                                  109


                                                                 Start MPI_Iallreduce: Wave Speeds
       MPI_Isends : Halo          Interior Reconstructions     MPI_Isends : Halo           Interior Reconstructions
                     MPI_Wait: Halo                                           MPI_Wait: Halo
                    Halo Reconstructions                                     Halo Reconstructions
       Convolution Sweeps          MPI_Isends : BC data        Convolution Sweeps           MPI_Isends : BC data
                    MPI_Wait: BC Data                                        MPI_Wait: BC Data
                  Build “Inverse” Operators                                Build “Inverse” Operators
                     Build “D” Operators                                      Build “D” Operators
                           Integrator                                    MPI_Wait: Wave Speeds
                                                                          Integrator + Local Reduce
Figure 3.6: Task charts for the domain-decomposition algorithm under fixed (left) and adaptive
(right) time stepping rules. The work overlap regions are indicated, laterally, using gray boxes. The
work inside the overlap regions should be sufficiently large to hide the communications occuring in
the background. To clarify, the overlap in calculations for 𝐼∗ is achieved by changing the sweeping
direction during an exchange of the boundary data. As indicated in the adaptive task chart, the
reduction over the “lagged" wave speed data can be performed in the background while building
the various operators. Note the use of MPI_WAIT prior to performing the integrator step. This
is done to prevent certain overwrite issues during the local reductions in the subsequent integrator
step.
formed. Once the data is needed, we can block progress until the message transfer is complete.
However, we hope that the calculation done, in between, is sufficient to hide the time spent per-
forming the communication. In the multi-dimensional setting, other operators may be needed, such
as 𝜕𝑥 and 𝜕𝑦 . However, in our algorithms, directions are not coupled, which allows us to stagger
the calculations. So, we can initialize the communications along a given direction and build pieces
of other operators in the background.
    A typical complication that arises in distributed implementations of PDE solvers concerns the
use of various expensive collective operations, such as “all-to-one" and “one-to-all" communica-
tions. For implicit methods, these operations occur as part of the iterative method used for solving
distributed linear systems. The method employed here is “matrix-free", which eliminates the need
                                                           110


to solve such distributed linear systems. For explicit methods, these operations arise when an
adaptive time stepping rule, such as equation (3.44), is employed to ensure that the CFL restriction
is satisfied for stability purposes. At each time step, each of the processors, or ranks, must know
the maximum wave speeds across the entire simulation domain. On a distributed system, trans-
ferring this information requires the use of certain collective operations, which typically have an
overall complexity of O (log 𝑁 𝑝 ), where 𝑁 𝑝 is the number of processors. While the logarithmic
complexity results in a massive reduction of the overall number of steps, these operations use a
barrier, in which all progress stops until the operation is completed. This step cannot be avoided
for explicit methods, as the most recent information from the solution is required to accurately
compute the maximum wave speeds. In contrast, successive convolution methods, do not require
this information. However, implementations for schemes developed in e.g., [56, 44, 45] considered
“explicit" time stepping rules given by equation (3.44) because they improved the convergence of
the approximations. By exploiting the stability properties associated with successive convolution
methods, we can eliminate the need for accurate wave speed information, based on the current state,
and, instead, use approximations obtained with “lagged" data from the previous time step. We
present two, generic, distributed memory task charts in 3.6. The algorithm shown in the left-half of
3.6, which is based on a fixed time step, contains less overall communication, as the local and global
reductions for certain information used to compute the time step are no longer necessary. The sec-
ond version, shown in the right-half of 3.6, illustrates the key steps used in the implementation of an
adaptive rule, which can be used for problems with more dynamic quantities (e.g., wave speeds and
diffusivity). In contrast to distributed implementations of explicit methods, our adaptive approach
allows us to overlap expensive global collective operations (approximately) with the construction
of derivative operators, resulting in a more asynchronous algorithm (see Algorithm 3.1).
3.4.6    Some Remarks
In this section, we introduced key aspects that are necessary in developing a performant application.
We began with a brief discussion on Kokkos, which is the programming model used for our
                                                  111


Goal: Approximate the global maximum wave speeds 𝑐 𝑥 , 𝑐 𝑦 , · · · using the corresponding “lagged"
variables 𝑐˜𝑥 , 𝑐˜𝑦 , · · · .
  1: Initialize the 𝑐𝑖 ’s and 𝑐˜𝑖 ’s via the initial condition (no lag has been introduced)
  2: while timestepping do
  3:        Update the N-N condition (i.e., (3.25), (3.26), or (3.27)) using the “lagged" wave speeds
     𝑐˜𝑥 , 𝑐˜𝑦 , · · ·
  4:        Compute Δ𝑡 using the “lagged" wave speeds and check the N-N condition
  5:        Start the MPI_Iallreduce over the local wave speeds 𝑐 𝑥 , 𝑐 𝑦 , · · ·
  6:        Construct the spatial derivative operators of interest
  7:        Post the MPI_WAIT (in case the reductions have not completed)
  8:        Transfer the global wave speed information to the corresponding lagged variables:
                                                       𝑐˜𝑥 ← 𝑐 𝑥 ,
                                                       𝑐˜𝑦 ← 𝑐 𝑦 ,
                                                          ..
                                                           .
  9:        Perform the update step and computes the local wave speeds 𝑐 𝑥 , 𝑐 𝑦 , · · ·
10:         Return to step 3 to begin the next time step
                              Algorithm 3.1: Distributed adaptive time stepping rule.
shared memory implementation. Then we introduced one of the metrics, namely (3.43), used
to characterize the performance of our parallel algorithms. Using this performance metric, we
analyzed a collection of techniques for parallelizing prototypical loop structures in our algorithms.
These techniques considered several different approaches to the prescription of parallelism through
both naive and complex execution policies. Informed by these results, we chose to adopt a coarse-
grained, hierarchical approach that utilizes the extensive capabilities available on modern hardware.
In consideration of our future work, this approach also offers a large degree of algorithmic flexibility,
which will be essential for moving to GPUs. Finally, we provided some details concerning the
implementation of the distributed memory components of the parallel algorithms. We introduced
two different approaches: one based on a fixed time step, with minimal communication, and another,
which exploits the stability properties of the representations and allows for adaptive time stepping
rules. The next section provides numerical results, which demonstrate not only the performance
                                                        112


and scalability of these algorithms, but also their versatility in addressing different PDEs.
3.5     Numerical Results
This section provides the experimental results for our parallel algorithms using MPI and Kokkos,
together, with the OpenMP backend. First, in 3.5.1, we define several test problems and verify the
rates of convergence for the hybrid algorithms described in sections 3.3 and 3.4. Next, we provide
both weak and strong scaling results obtained from each of the example problems discussed in
3.5.1. 3.5.4 provides some insight on issues faced by the distributed memory algorithms, in light of
the N-N condition (3.25), which was derived in 3.3.1. Unless otherwise stated, the results presented
in this section were obtained using the configurations outlined in 3.2. Timing data presented in
Figures A.2, A.3, and A.4 was collected using 10 trials for each configuration (problem size and
node count) with the update metric (3.43) being displayed relative to 109 DOF/node/s. Each of
these trials, evolved the numerical solution over 10 time steps. Error bars, collected from data
involving averages, were computed using the sample standard deviation.
3.5.1    Description of Test Problems and Convergence Experiments
Despite the fact that we are primarily focused on developing codes for high-performance appli-
cations, we must also ensure that the parallel algorithms produce reliable answers. Here, we
demonstrate convergence of the 2-D hybrid parallel algorithms on several test problems, including
a nonlinear example that employs the adaptive time stepping rule outlined in 3.1. The convergence
results used 9 nodes, with 40 threads per node, assigning 1 MPI rank to each node, for a total of
360 threads. The quadrature method used to construct the local integrals is the fifth-order WENO-
quadrature rule, described in 3.2.4, which uses only the linear weights. The numerical solution, in
each of the examples, remains smooth over the corresponding time interval of interest. Therefore,
it is not necessary to transform the linear quadrature weights to nonlinear ones. According to the
analysis of the truncation error presented in [56, 44], retaining a single term in the partial sums for
D∗ should yield a first-order convergence rate, depending on the choice of 𝛼. Convergence results
                                                 113


       CPU Type                                        Intel Xeon Gold 6148
     C++ Compiler                                           ICC 2019.03
      MPI Library                                      Intel MPI 2019.3.199
  Optimization Flags         -O3 -xCORE-AVX512 -qopt-zmm-usage=high -qno-opt-prefetch
    Thread Bindings                  OMP_PROC_BIND=close, OMP_PLACES=threads
       Team Size                                         Kokkos::AUTO()
     Base Block Size                                            2562
           CFL                                                   1.0
              𝜷                                                  1.0
              𝝐                                               1 × 10−16
Table 3.2: Architecture and code configuration for the numerical experiments conducted on the
Intel 18 cluster at Michigan State University’s Institute for Cyber-Enabled Research. As with
the loop experiments in 3.4.3, we encourage the compiler to use AVX-512 instructions and avoid
the use of prefetching. All available threads within the node (40 threads/node) were used in the
experiments. Each node consists of two Intel Xeon Gold 6148 CPUs and at least 83 GB of memory.
We wish to note that hyperthreading is not supported on this system. As mentioned in 3.4.3, when
hyperthreading is not enabled, Kokkos::AUTO() defaults to a team size of 1. In cases where the
base block size did not divide the problem evenly, this parameter was adjusted to ensure that blocks
were nearly identical in size. The parameter 𝛽, which does not depend on Δ𝑡 is used in the definition
of 𝛼. For details on the range of admissible 𝛽 values, we refer the reader to [56, 44], where this
parameter was introduced. Lastly, recall that 𝜖 is the tolerance used in the NN constraints.
for each of the three test problems defined in sections 3.5.1, 3.5.1, and 3.5.1 are provided in 3.7.
Example 1: Linear Advection Equation
The first test problem considered in this work is the 2D linear advection equation
                               𝜕𝑡 𝑢 + 𝜕𝑥 𝑢 + 𝜕𝑦 𝑢 = 0,   (𝑥, 𝑦) ∈ [0, 2𝜋] 2 ,
                                            1
                               𝑢 0 (𝑥, 𝑦) =   (1 − cos(𝑥))(1 − cos(𝑦)),
                                            4
                               s.t. two-way periodic BCs.
We evolve the numerical solution to the final time 𝑇 = 2𝜋. In the experiments, we used the same
number of mesh points in both directions, with 𝛼𝑥 = 𝛼𝑦 = 𝛽/Δ𝑡, with 𝛽 = 1. While this problem
                                                    114


                                              Advection
                                              Diffusion
                                              Hamilton-Jacobi
                                              First Order Reference
                                  10−2
                       L∞ Error
                                  10−3
                                                                                    10−2
                                                                        ∆x
Figure 3.7: Convergence results for each of the 2-D example problems. Results were obtained
using 9 MPI ranks with 40 threads/node. Also included is a first-order reference line (solid
black). Our convergence results indicate first-order accuracy resulting from the low-order temporal
discretization. The final reported 𝐿 ∞ errors for each of the applications, on a grid containing 52772
total zones, are 2.874 × 10−3 (advection), 4.010 × 10−4 (diffusion), and 2.674 × 10−4 (H-J).
is rather simple and does not highlight many of the important features of our algorithm, it is nearly
identical to the code for a nonlinear example. For initial experiments, a simple test problem is
preferable because it gives more control over quantities which are typically dynamic, such as wave
speeds. Moreover, the error can be easily computed from the exact solution
                                               𝑢(𝑥, 𝑦, 𝑡) = 𝑢 0 (𝑥 − 𝑡, 𝑦 − 𝑡).
Example 2: Linear Diffusion Equation
The next test problem that we consider is the linear diffusion equation
                                         𝜕𝑡 𝑢 = 𝜕𝑥𝑥 𝑢 + 𝜕𝑦𝑦 𝑢,          (𝑥, 𝑦) ∈ [0, 2𝜋] 2 ,
                                         𝑢 0 (𝑥, 𝑦) = sin(𝑥) + sin(𝑦),
                                         s.t. two-way periodic BCs.
The numerical solution is evolved from (0, 𝑇], with 𝑇 = 1, in order to prevent substantial decay.
As with the previous example, we use an equal number of mesh points in both directions, so that
                                                                  115


Δ𝑥 = Δ𝑦. The fixed time stepping rule was used, with Δ𝑡 = Δ𝑥. Compared with the previous
                                                               √
example, we used the parameter definitions 𝛼𝑥 = 𝛼𝑦 = 1/ Δ𝑡, which corresponds to 𝛽 = 1 in the
definition for second derivative operators. The exact solution for this problem is given by
                                                                     
                                    𝑢(𝑥, 𝑦, 𝑡) = 𝑒 −𝑡 sin(𝑥) + sin(𝑦) .
    Characteristically, this example is different from the advection equation in the previous example,
which allows us to illustrate some key features of the method. Firstly, code developed for advection
operators can be reused to build diffusion operators, an observation made in 3.2.1. More specifically,
to construct the left and right-moving local integrals, we used the same linear WENO quadrature
as with the advection equation in Example 1. However, we note that this particular example could,
instead, use a more compact quadrature to eliminate the halo communication, which would remove
a potential synchronization point. The second feature concerns the time-to-solution and is related
to the unconditional stability of the method. Linear diffusion equations, when solved by an explicit
method, are known to incur a harsh stability restriction on the time step, namely, Δ𝑡 ∼ Δ𝑥 2 , making
long-time simulations prohibitively expensive. The implicit aspect of this method drastically
reduces the time-to-solution, as one can now select time steps which are, for example, proportional
to the mesh spacing. This benefit is further emphasized by the overall speed of the method, which
can be observed in sections 3.5.2 and 3.5.3.
Example 3: Nonlinear Hamilton-Jacobi Equation
The last test problem we consider is the nonlinear H-J equation
                                  1                  2
                          𝜕𝑡 𝑢 +     1 + 𝜕𝑥 𝑢 + 𝜕𝑦 𝑢 = 0,     (𝑥, 𝑦) ∈ [0, 2𝜋] 2 ,
                                  2
                          𝑢 0 (𝑥, 𝑦) = 0,
                          s.t. two-way periodic BCs.
To prevent the characteristic curves from crossing, which would lead to jumps in the derivatives
of the function 𝑢, the numerical solution is tracked over a short time, i.e., 𝑇 = 0.5. We applied a
                                                     116


high-order linear WENO quadrature rule to approximate the left and right-moving local integrals
and used the same parameter choices for 𝛼𝑥 , 𝛼𝑦 , and 𝛽, as with the advection equation in Example
1. However, since the wave speeds fluctuate based on the behavior of the solution 𝑢, we allow
the time step to vary according to (3.44), which requires the use of the distributed adaptive time
stepping rule outlined in 3.1.
    Typically, an exact solution is not available for such problems. Therefore, to test the convergence
of the method, we use a manufactured solution given by
                                                                 
                                    𝑢(𝑥, 𝑦, 𝑡) = 𝑡 sin(𝑥) + sin(𝑦) ,
with a corresponding source term included on the right-hand side of the equation. Methods
employed to solve this class of problems are typically explicit, with a shock-capturing method
being used to handle the appearance of “cusps" that would otherwise lead to jumps in the derivative
of the solution. A brief summary of such methods is provided in our recent paper [45], where
extensions of successive convolution were developed for curvilinear and non-uniform grids. The
method follows the same structural format as an explicit method with the ability to take larger
time steps as in an implicit method. However, the explicit-like structure of this method does not
require iteration for nonlinear terms and allows for a more straightforward coupling with high-order
shock-capturing methods. We wish to emphasize that despite the fact that this example is nonlinear,
the only major mathematical difference with Example 1 is the evaluation of a different Hamiltonian
function.
3.5.2   Weak Scaling Experiments
A useful performance property for examining the scalability of parallel algorithms describes how
they behave when the compute resources are chosen proportionally to the size of the problem. Here,
the amount of work per compute unit remains fixed, and the compute units are allowed to fluctuate.
Weak scaling assumes ideal or best-case performance for the parallel components of algorithms and
ignores the influence of bottlenecks imposed by the sequential components of a code. Therefore,
                                                    117


for 𝑁 compute units, we shall expect a speedup of 𝑁. This motivates the following definitions for
speedup and efficiency in the context of weak scaling:
                                          𝑁𝑇1            𝑆𝑁     𝑇1
                                   𝑆𝑁 =        ,   𝐸𝑁 =      ≡      .
                                          𝑇𝑁              𝑁     𝑇𝑁
Therefore, with weak scaling, ideal performance is achieved when the run times for a fixed work
size (or, equivalently, the DOF/node/s) remain constant, as we vary the compute units. To scale
the problem size, we take advantage of the periodicity for the test problems. The base problem on
[0, 2𝜋] × [0, 2𝜋] can be replicated across nodes so that the total work per node remains constant.
Provided in A.2 are plots of the weak scaling data — specifically the update metric 3.43 and the
corresponding efficiency — obtained from the fastest of 10 trials of each configuration, using up
to 49 nodes (1,960 cores).
    These results generally indicate good performance, both in terms of the update frequency and
efficiency, for a variety of problem sizes. Weak scalability appears to be excellent up to 16 nodes
(640 cores), then begins to decline, most likely due to network effects. The performance behavior
for advection and diffusion applications is quite similar, which is to be expected, since the parallel
algorithms used to construct the base operators are nearly identical. With regard to the Hamilton-
Jacobi application, we see that the performance is similar to the other applications at larger node
counts. This seems to indicate that no major communication penalties are incurred by use of the
adaptive time stepping method shown in 3.1, compared to fixed time stepping. Additionally, in
the Hamilton-Jacobi application, we observe a sharp decline in the performance at 9 nodes in A.2.
A closer investigation reveals that this is likely an artifact of the job scheduler for the system on
which the experiments were conducted, as we were unable to secure a “contiguous" allocation of
nodes. This has the unfortunate consequence of not being able to guarantee that data for a particular
trial remain in close physical proximity. This could result in issues such as network contention
and delays that exacerbate the cost of communication relative to the computation as discussed in
3.4. An non-contiguous placement of data is problematic for codes with inexpensive operations,
such as the methods shown here, because the work may be insufficient to hide this increased
cost of communication. For this reason, we chose to include plots containing the averaged weak
                                                  118


scaling data in A.2, which contains error bars calculated from the sample standard deviation. The
noticeable size of the error bars in these plots generally indicates a large degree of variation in the
timings collected from trials.
    To more closely examine the importance of data proximity on the nodes, we repeated the weak
scaling study, but with node counts for which a contiguous allocation count could be guaranteed.
We have provided results for the fastest and averaged data in A.3. Data collected from the fastest
trials indicates nearly perfect weak scaling, across all applications, up to 9 nodes, with a consistent
update rate between 2 − 4 × 108 DOF/node/s. For convenience, these results were plotted with the
same markers and formats so that results from the larger experiments in A.2 could be compared
directly. A comparison of the fastest timings between the large and small runs supports our claim
that data proximity is crucial to achieving the peak performance of the code. Furthermore, the
error bars for the contiguous experiments displayed in A.3 show that the individual trials exhibit
less overall variation in timings.
3.5.3    Strong Scaling Experiments
Another form of scalability considers a fixed problem size and examines the effect of varying the
number of work units used to find the solution. In these experiments, we allow the work per
compute unit to decrease, which helps identify regimes where sequential bottlenecks in algorithms
become problematic, provided we are granted enough resources. Applications which are said to
strong scale exhibit run times which are inversely proportional to the number of resources used.
For example, when 𝑁 compute units are applied to a problem, one expects the run to be 𝑁 times
faster than with a single compute unit. Additionally, if an algorithm’s performance is memory
bound, rather than compute bound, this will, at some point, become apparent in these experiments.
Supplying additional compute units should not improve performance, if more time is spent fetching
data, rather than performing useful computations. This motivates the following definitions for
speedup and efficiency in the context of strong scaling:
                                              𝑇1            𝑆𝑁
                                        𝑆𝑁 =     ,   𝐸𝑁 =      .
                                              𝑇𝑁            𝑁
                                                  119


Here, 𝑁 is the number of nodes used, so that 𝑇1 and 𝑇𝑁 correspond to the time measured using
a single node and N nodes, respectively. Results of our strong scaling experiments are provided
in A.4. As with the weak scaling experiments, we have plotted the update metric 3.43 along with
the strong scaling efficiency using both the fastest and averaged configuration data from a set of
10 trials. In contrast to weak scaling, strong scaling does not assume ideal speedup, so one could
plot this information; however, the information can be ascertained from the efficiency data, so we
refrain from plotting this data.
    Results from these experiments show decent strong scalability for the N-N method. This method
does not contain a substantial amount of work, so we do not expect good performance for smaller
base problem sizes, as the work per node becomes insufficient to hide the cost of communication.
On the other hand, larger base problem sizes, which introduce more work, are capable of saturating
the resources, but will at some point become insufficient. This behavior is apparent in our efficiency
plots. Increasing the problem size generally results in an improvement of the efficiency and speedup
for the method. Part of these problems can be attributed to the use of a blocking pattern for loops
structures discussed in 3.4.4. Depending on the size of the mesh, it may be the case that the block
size and the team size set by the user result in idle threads. One possible improvement is to simply
increase the team size so that there are fewer idle threads within an MPI task. Alternatively, one can
adjust the number of threads per task, so that each task is responsible for fewer threads. While these
approaches can be implemented with no changes to the code, they will likely not resolve this issue.
Profiling seems to indicate that the source of the problem is the low arithmetic intensity of the
reconstruction algorithms. In other words, the method is memory bound because the calculations
required in the reconstructions are inexpensive relative to the cost of retrieving data from memory.
As part of our future work, we plan to investigate such limitations through the use of detailed
roofline models. We also plan to consider test problems in 3-D, which will introduce additional
work.
                                                  120


3.5.4    Effect of CFL
In order to enforce a N-N dependency for our domain decomposition algorithm, we obtained several
possible restrictions on Δ𝑡, depending on the problem and the choice of 𝛼. In the case of linear
advection, we would, for example, require that
                                                      𝛽𝐿 𝑚
                                          Δ𝑡 ≤ −               .
                                                  𝑐 max log(𝜖)
with the largest possible time step permitting N-N dependencies being set by the equality. Admit-
tedly, such a restriction is undesirable. As mentioned in 3.3.1, this assumption can be problematic
if the problem admits fast waves (𝑐 max is large) and/or if the block sizes are particularly small
(𝐿 𝑚 is small). In many applications, the former circumstance is quite common. However, our test
problem contains fixed wave speeds so this is less of an issue. The latter condition is a concern
for configurations which use many blocks, such as a large simulation on many nodes of a cluster.
Another potential circumstance is related to the granularity of the blocks. For example, in these
experiments, we use 1 MPI rank per compute node. However, it may be advantageous to consider
different task configurations, e.g., using 1 (or more) rank(s) per NUMA region of a compute node.
    A larger CFL parameter is generally preferable because it reduces the overall time-to-solution.
Eventually, however, for a given CFL, there will be a crossover point, where the time step restriction
causes the performance to drop due to the increasing number of sequential time steps. This
experiment used a highly refined grid and varied the CFL number, using up to 9 nodes. Results
from the CFL experiments are provided in 3.8. The data was obtained using an older version
of our parallel algorithms, compiled with GCC 8.2.0-2.31.1, which does not use blocking. By
plotting the behavior according to the number of nodes (ranks) used, we can fix the lengths of
the blocks, hence 𝐿 𝑚 , and change the time step to identify the breakdown region. We observe a
substantial decrease in performance for the 9 node configuration, specifically when the CFL number
increases from 4 to 5. For more complex problems with dynamic wave behavior, this breakdown
may be observed earlier. In response to this behavior, a user could simply increase or relax the
tolerance, but the logarithm tends to suppress the impact of large relaxations. Another option,
                                                  121


                  103                                               1.0
                                                                    0.8
   Walltime (s)                                        Efficiency
                                                                    0.6
                                                                    0.4
                  102
                                                                    0.2
                                                                    0.0
                        1   2    3        4      5                        1   2       3           4   5
                                CFL                                                 CFL
                                 1 MPI Rank(s)         4 MPI Rank(s)              9 MPI Rank(s)
Figure 3.8: Results on the N-N method for the linear advection equation using a fixed mesh with
53772 total DOF and a variable CFL number. In each case, we used the fastest time-to-solution
collected from repeating each configuration a total of 20 times. This particular data was collected
using an older version of the code, compiled with GCC, which did not use the blocking approach.
For larger block sizes, increasing the CFL has a noticeable improvement on the run time, but
as the block sizes become smaller, the gains diminish. For example, if 9 MPI ranks are used,
improvements are observed as long as CFL ≤ 4. However, when CFL = 5, the run times begin to
increase, with a significant decrease in efficiency. As the blocks become smaller, Δ𝑡 needs to be
adjusted (decreased) so that the support of the non-local convolution data not extend beyond N-Ns.
which shall be considered in future work is to include more information from neighboring ranks
by either eliminating this condition or, at the least, communicating enough information to achieve
a prescribed tolerance.
3.6               Conclusion
In this paper, we presented hybrid parallel algorithms capable of addressing a wide class of both
linear and nonlinear PDEs. To enable parallel simulations on distributed systems, we derived a set
of conditions that use available wave speed (and/or diffusivity) information, along with the size of
the sub-domains, to limit the communication through an adjustment of the time step. Although not
considered here, these conditions, which are needed to ensure accuracy, rather than the stability
of the method, can be removed at the cost of additional communication. Using these restrictions,
boundary conditions are enforced across sub-domains in the decomposition. Results were obtained
                                                     122


for 2-D examples consisting of linear advection, linear diffusion, and a nonlinear H-J equation to
highlight the versatility of the methods in addressing characteristically different PDEs.
    As part of the implementation, we used constructs from the Kokkos performance portability
library to parallelize the shared memory components of the algorithms. We extracted essential
loop structures from the algorithms and analyzed a variety of parallel execution policies in an effort
to develop an efficient application. These experiments considered several common optimization
techniques, such as vectorization, cache-blocking, and placement of threads. From these experi-
ments, we chose to use a blocked iteration pattern in which threads (or teams thereof) are mapped
to blocks of an array with vector instructions being applied to 1-D line segments. These design
choices offer a large degree of flexibility, which is an important consideration as we proceed with
experimentation on other architectures, including GPUs, to leverage the full capabilities provided
by Kokkos. By exploiting the stability properties of the representations, we also developed an
adaptive time stepping method for distributed systems that uses “lagged" wave speed information to
calculate the time step. While the methods presented here do not require adaptive time stepping for
stability, it was included as an option because of its ability to prevent excessive numerical diffusion,
as observed in previous work.
    Convergence and scaling properties for the hybrid algorithms were established using at most 49
nodes (1960 cores), with a peak performance > 108 DOF/node/s. Larger weak scaling experiments,
which used up to 49 nodes (1960 cores), initially performed reasonably well, with all applications
later tending to 60% efficiency corresponding (roughly) to 2 × 108 DOF/node/s. While some
performance loss is to be expected from network related complications, we found this to be much
larger than what was observed in prior experiments. Later, it was discovered that the request
for a contiguous allocation could not be accommodated so data locality in the experiments was
compromised. By repeating the experiments on a smaller collection of nodes, which granted this
request, we discovered that data locality plays a pivotal role in the overall performance of the
method. We observe that a large base problem size is required to achieve good strong scaling.
Furthermore, when threads are prescribed work at a coarse granularity (i.e., across blocks, rather
                                                  123


than entries within the blocks), one must ensure that the problem size is capable of saturating
the resources to avoid idle threads. This approach introduces further complications for strong
scaling, as the workload per node drops substantially while the block size remains fixed. Finite
difference methods which, generally, do not generate a substantial amount of work, are quite similar
to successive convolution. Therefore, we do not expect excellent strong scalability. Certain aspects
of the algorithms can be tuned to improve the arithmetic intensity, which will improve the strong
scaling behavior. At some point, the algorithms will be limited by the speed of memory transfers
rather than computation. We also provided experimental results that demonstrate the limitations of
the N-N condition in the context of strong scaling.
    While we have presented several new ideas with this work, there is still much untapped potential
with successive convolution methods. Firstly, optimizations on GPU architectures, which shall play
an integral role in the upcoming exascale era, need to be explored and compared with CPUs. A
roofline model should be developed for these algorithms to help identify key limitations and
bottlenecks and formulate possible solutions. Although this work considered only first-order time
discretizations, our future developments shall be concerned with evaluating a variety of high-order
time discretization techniques in an effort to increase the efficiency of the method. Lastly, the
parallel algorithms should be modified to enable the possibility of mesh adaptivity, which is a
common feature offered by many state-of-the-art computing libraries.
                                                124


                                            CHAPTER 4
                      DEVELOPING A PARTICLE-IN-CELL METHOD
4.1    Introduction
This chapter concerns the development of a particle-in-cell (PIC) method for plasmas that is
constructed using solvers for fields developed in the preceding chapters. We begin by introducing the
concept of a macro-particle that is the foundation of all PIC methods in section 4.2. Next, in section
4.3, we discuss techniques used in our methods to enforce the involutions of the computational
model. We then discuss the methods employed to advance the particle data, including standard
integrators, which can be found in section 4.4, as well as a more recent integrators developed for
problems with so-called non-separable Hamiltonians in section 4.5. In the sections on integrators,
we propose modifications which allow for a coupling of these methods to the proposed field solvers.
Numerical examples which demonstrate the performance of the proposed methods on several plasma
test problems are presented in section 4.6. We provide a summary of our key findings in section
4.7.
4.2    Moving from Point-particles to Macro-particles
One of the essential features of PIC methods is that the simulation particles are not physical
particles. Instead, they represent a sample of particles collected from an underlying distribution
function. For this reason, they are often called super-particles or macro-particles. Moreover, the
“size" of this sample is reflected in the weight associated with a given macro-particle 𝑤 𝑚 𝑝 , which
can be calculated as
                                                     𝑁real
                                          𝑤𝑚 𝑝 =               .
                                                   𝑁simulation
Here, we use 𝑁real to denote the number of physical particles contained within a simulation domain
and 𝑁simulation to be the number of simulation particles. The calculation of 𝑁real is problem
dependent, but can be expressed in terms of the average macroscopic number density 𝑛¯ that
                                                  125


describes the plasma and a volume associated with either the domain or beam being considered.
Once the weight for each particle is calculated it can be absorbed into properties of the particle
species, e.g., the charge, so that 𝑤 𝑚 𝑝 𝑞𝑖 is written as 𝑞𝑖 .
    In section 1.2.2, the charge density and current density were defined in equations (1.22) and
(1.23) using a linear combination of Dirac delta distributions for a collection of 𝑁 𝑝 simulation
particles. While PIC methods can certainly be developed to work with these point-particle repre-
sentations, most PIC methods, including the ones developed in this work, represent particles using
shape functions so that
                                                𝑁𝑝
                                                ∑︁                    
                                   𝜌 (x, 𝑡) =       𝑞 𝑝 𝑆 x − x 𝑝 (𝑡) ,                       (4.1)
                                                𝑝=1
                                                𝑁𝑝
                                                ∑︁                        
                                   J (x, 𝑡) =       𝑞 𝑝 v 𝑝 𝑆 x − x 𝑝 (𝑡) ,                   (4.2)
                                                𝑝=1
where the shape function 𝑆 is now used to represent a simulation particle. The shape functions
most often employed in PIC simulations are B-splines, which are compact (local), positive, and
satisfy the partition of unity property. Furthermore, they can be easily extended to include addi-
tional dimensions using tensor products of univariant splines. While higher-order splines produce
smoother mappings to the mesh and possess higher degrees of continuity, the extended support
regions create complications for plasmas on bounded domains. The methods developed in this
work employ linear splines to represent particle shapes. The linear spline function that represents
the particle 𝑥 𝑝 on the mesh with spacing Δ𝑥 is given by
                                          
                                                |𝑥 − 𝑥 𝑝 |         |𝑥 − 𝑥 𝑝 |
                                          1 −                 0≤               ≤ 1,
                                          
                                                           ,
                                                    Δ𝑥                  Δ𝑥
                                          
                           𝑆(𝑥 − 𝑥 𝑝 ) =                                                      (4.3)
                                          
                                                              |𝑥 − 𝑥 𝑝 |
                                           0,
                                                                          > 1.
                                                                 Δ𝑥
The shape function (4.3) generally serves two purposes: (1) It provides a way to map particle data
onto the mesh (scatter operation) and (2) can be used to interpolate mesh based quantities to the
particles during the time integration (gather operation). For consistency in a PIC method, it is
important that these maps be identical.
                                                     126


    In the next section, we address the issue of enforcing charge conservation with the proposed
formulation. To this end, we discuss several approaches, including divergence cleaning methods,
as well as a modification of the usual PIC charge mapping (4.1) that solves the continuity equation.
4.3     Methods for Controlling Divergence Errors
The formulation adopted in this work considers formulations of Maxwell’s equations in the Lorenz
gauge (1.8)-(1.10) as well as the Coulomb gauge (1.12)-(1.14). The benefit of adopting a gauge
formulation is that the involution for the magnetic field
                                                ∇ · B = 0,
will be trivially satisfied, since B = ∇×A; however, it is important to recognize that this formulation
is over-determined only makes sense as long as the particular gauge condition (either (1.10) or
(1.14)) is satisfied. The issue of enforcing the gauge condition is ultimately connected to charge
conservation through the involution ∇ · E = 𝜌/𝜖0 . When this condition is not satisfied or properly
controlled in a numerical scheme, the solutions can become non-physical. In [19], the authors
proposed techniques for controlling the growth of errors in Gauss’ law, and demonstrated the usual
behavior that results when these conditions are not properly enforced in PIC methods. Even if a
Yee mesh [13] is used to represent the fields, this condition is not guaranteed to be satisfied in PIC
codes due to errors introduced through the particle-to-mesh mappings [77]. In an effort to control
this error, we examine two general classes of methods: (1) divergence cleaning through an auxiliary
equation and (2) an analytical charge map for particles that is based on the continuity equation.
The methods contained in the first class perform the cleaning through field solvers, which require
a Poisson solve or a wave solve, the latter of which can be performed with the methods introduced
in Chapter 2. The method in the second class enforces charge conservation through a path integral
of the particle trajectories and is an analytical extension of the map presented in [25] for FEM-PIC.
In contrast to [25], which computed the map numerically via quadrature, the map presented in this
work is obtained by analytically integrating and differentiating the shape functions used to represent
particles.
                                                   127


4.3.1    A Classic Elliptic Projection Method Based on Gauss’ Law
In PIC methods based on the Boris method [4], which evolve particles using E and B fields, one
can use a classic elliptic projection technique to reduce the errors in Gauss’ law [2, 20, 77]. The
idea is that one would like to enforce Gauss’ law
                                                        1
                                             ∇·E=          𝜌.
                                                       𝜎1
During the simulation, a build up in violations of the continuity equation introduces certain errors
that change the irrotational part of the electric field. If we let E∗ denote the electric field computed
with a numerical method, we can see that these violations take the form
                                            E = E∗ − ∇𝜓 𝐸 ,                                         (4.4)
where 𝜓 𝐸 is some scalar function that should be determined. Taking the divergence of both sides
and using Gauss’ law, we obtain the elliptic equation
                                                    1
                                        − Δ𝜓 𝐸 =       𝜌 − ∇ · E∗ .                                 (4.5)
                                                    𝜎1
The divergence term for the numerical electric field requires some form of a numerical derivative
obtained using a difference approximation or a basis. If we assuming that the error is zero at the
boundary, which is a realistic assumption, this equation can be solved with homogeneous Dirichlet
boundary conditions. One then takes a discrete gradient, and corrects the numerical electric field
using (4.4).
     For the case in which fields are expressed in terms of potentials, recall that the electric field is
calculated as
                                                          𝜕A
                                            E = −∇𝜓 −         ,
                                                           𝜕𝑡
which is valid for any gauge. After solving equation (4.5), we can correct the scalar potential and
its gradient according to
                                  𝜓 = 𝜓∗ + 𝜓𝐸 ,     ∇𝜓 = ∇𝜓 ∗ + ∇𝜓 𝐸 ,
where we have, again, used “*" to denote the initial approximation.
                                                   128


4.3.2    Elliptic Divergence Cleaning based on Potentials
Another elliptic divergence cleaning method was developed in the thesis [42] and also appeared
in [41] for PIC simulations of the VM system in the Lorenz gauge. The idea is to construct an
elliptic equation by combining Gauss’s law with the electric field, which is expressed in terms of
the potentials. In other words, we can substitute the definition
                                                         𝜕A
                                            E = −∇𝜓 −        ,
                                                         𝜕𝑡
into Gauss’ law to arrive at the elliptic equation
                                                1      𝜕 (∇ · A)
                                       −Δ𝜓 =       𝜌+            .
                                               𝜎1          𝜕𝑡
This equation was solved using a second-order centered finite-difference approximation of the
Laplacian, and it was shown that a discrete form of Gauss’ law will be satisfied as long as a
staggered grid is used for the mesh data. This mesh staggering, which is problem dependent, is
different from the more commonly used Yee method [13] and is largely motivated by the difference
scheme used to approximate the Laplacian.
4.3.3    Enforcing the Lorenz Gauge through Lagrange Multipliers
In this section, we introduce a Lagrange multiplier to enable the enforcement of the Lorenz gauge
condition (1.10) following [19], which proposed a divergence cleaning technique for the E-B form
of Maxwell’s equations using a hyperbolic model. The approach presented in this section considers
the same system written in terms of the Lorenz gauge. To this end, we modify the Lorenz gauge
condition by introducing a function 𝜙 which represents a residual:
                                          𝜙             1 𝜕𝜓
                                             = ∇  · A +        .                                (4.6)
                                          𝑑2            𝜅 2 𝜕𝑡
As we will see, this function 𝜙 satisfies a wave equation whose purpose is to sweep away the
residual in the Lorenz gauge. The non-physical parameter 𝑑 > 1 is connected to the wave speed
for 𝜙 and is selected to ensure that waves for 𝜙 propagate faster than the other waves of interest in
                                                  129


the problem. This choice is often problem dependent, but 𝑑 ≈ 10 is often used [19, 20, 21]. We
consider a modification of the original system (1.41)-(1.43), which has the form
                                                  1 𝜕2𝜓                  1
                                                             −  Δ𝜓  =       𝜌,
                                                 𝜅 2 𝜕𝑡 2              𝜎1
                                         1 𝜕2A
                                                   − ΔA − ∇𝜙 = 𝜎2 J,
                                         𝜅 2 𝜕𝑡 2
                                                            1 𝜕𝜓        1
                                                 ∇·A+ 2             = 2 𝜙.
                                                           𝜅 𝜕𝑡        𝑑
We have left the particle equations unspecified for generality. Observe that this modified system
is equivalent to the original system when 𝜙 ≡ 0. To develop the hyperbolic auxiliary equation, we
first create a coupling between the wave equation for the scalar potential 𝜓 and the modified gauge
                                                                  1
condition (4.6). This is accomplished by substituting               𝜕𝜓
                                                                  𝜅2 𝑡
                                                                           from the latter to the former to obtain
                                                             
                                      1 𝜕𝜙               𝜕A                 1
                                       2
                                               −∇·              − Δ𝜓 =         𝜌,
                                     𝑑 𝜕𝑡                 𝜕𝑡                𝜎1
We then take a time derivative of this equation to obtain
                                                   2             
                                1 𝜕2 𝜙              𝜕 A             𝜕𝜓         1 𝜕𝜌
                                 2     2
                                          −∇·           2
                                                             −Δ            =        .
                               𝑑 𝜕𝑡                 𝜕𝑡              𝜕𝑡        𝜎1 𝜕𝑡
Using the wave equation for A for the second time derivative in the above equation, we obtain, after
rearranging the terms and with the aid of (4.6), a wave equation for 𝜙, namely
                                                                                 
                               1 𝜕2 𝜙                1                   𝜕𝜌
                                           − 1 + 2 Δ𝜙 = 𝜎2                    +∇·J .
                            𝜅 2 𝑑 2 𝜕𝑡 2             𝑑                    𝜕𝑡
Hence, the system to be solved is
                                             1 𝜕2𝜓                  1
                                               2    2
                                                        − Δ𝜓 =         𝜌,
                                             𝜅 𝜕𝑡                  𝜎1
                                    1 𝜕2A
                                                − ΔA − ∇𝜙 = 𝜎2 J,
                                    𝜅 2 𝜕𝑡 2                                     
                               1 𝜕2 𝜙                1                   𝜕𝜌
                                           − 1 + 2 Δ𝜙 = 𝜎2                    +∇·J ,
                            𝜅 2 𝑑 2 𝜕𝑡 2             𝑑                    𝜕𝑡
where the new wave equation for 𝜙 is used to enforce the gauge condition. Since the function
𝜙 represents the gauge error, which should be “swept away", it is prescribed outflow boundary
                                                          130


conditions. Initially, the data for the potentials should satisfy the gauge condition, so the initial data
is 𝜙(x, 0) = 0. Furthermore, notice that the source in this auxiliary equation contains
                                                𝜕𝜌
                                                    + ∇ · J,
                                                𝜕𝑡
which describes the evolution of charge in domain. If the charge in the problem is conserved, then
                                             𝜕𝜌
                                                  + ∇ · J = 0,
                                             𝜕𝑡
and the gauge condition will be naturally satisfied. Otherwise, local violations of this constraint
act as sources that generate waves for the residual function 𝜙.
4.3.4   Enforcing the Coulomb Gauge
The Coulomb gauge (1.47) can be enforced using a projection technique that is nearly identical
to the one discussed in section 4.3.1 that enforces Gauss’ law. Any errors in the gauge condition
manifest through irrotational components, which need to be removed. To this end, we apply a
Helmholz decomposition to the numerical vector potential (indicated with “*")
                                            A∗ = Arot + Airrot .
If the vector potential is purely rotational then it is the curl of another function, so naturally
                                               ∇ · Arot = 0.
The irrotational or curl-free part of the vector potential can be expressed as the gradient of a scalar
function, so the decomposition for the vector potential is given by
                                             A∗ = Arot − ∇𝜂,                                         (4.7)
where 𝜂 is some scalar function. Taking the divergence of the equation (4.7), we see that
                                             −Δ𝜂 = ∇ · A∗ .
                                                    131


After solving this equation for 𝜂 and taking its derivative, we can obtain the rotational part of the
vector potential by rearranging (4.7) so that
                                             Arot = A∗ + ∇𝜂.                                     (4.8)
If we follow our formulation (1.45)-(1.47), the A∗ will already be an approximation of the rotational
part of the vector potential. Moreover, the rotational part of the wave equation (1.46) (i.e., (1.17))
can be evolved using the field solvers developed in chapter 2, which means that we can also compute
these derivatives as well.
4.3.5    Analytical Maps for Enforcing Charge Conservation
A new approach to enforcing charge conservation was recently proposed in [25] in the context of
finite-element particle-in-cell methods. The map for the charge density is based on the continuity
equation, which is integrated in time:
                                                     ∫    𝑡 𝑛+1
                                        𝑛+1      𝑛
                                      𝜌     =𝜌 −                ∇ · J 𝑑𝑡.                        (4.9)
                                                       𝑡𝑛
Their map exchanges the time integral (4.9) for a spatial integral that effectively traces the motion
of the particle on the mesh from 𝑡 𝑛 to 𝑡 𝑛+1 . In [25] and [26], this mapping was enforced in a weak
sense through the finite element basis. Here, we shall extend this technique a bit further to devise a
discrete analogue to (4.9), which can be computed analytically and enforces charge conservation to
machine precision. We summarize the key identities presented in the paper [26], offering additional
details which, we believe, make the method easier to understand.
    We begin with the usual definitions for the charge density and current density, which were
defined in equations (4.1) and (4.2) in terms of a shape function 𝑆(x) as
                                               𝑁𝑝
                                               ∑︁                      
                                  𝜌 (x, 𝑡) =       𝑞 𝑝 𝑆 x − x 𝑝 (𝑡) ,
                                               𝑝=1
                                               𝑁𝑝
                                               ∑︁                         
                                   J (x, 𝑡) =      𝑞 𝑝 v 𝑝 𝑆 x − x 𝑝 (𝑡) .
                                               𝑝=1
                                                    132


Again, 𝑁 𝑝 is the total number of macro-particles in the simulation, so that 𝑞 𝑝 now represents the
charge of a macro-particle. Additionally, the position at time 𝑡 for a given particle can be expressed
in terms of its velocity using
                                                             ∫  𝑡
                                       x 𝑝 (𝑡) = x 𝑝 (0) +        v 𝑝 (𝜏) 𝑑𝜏.                               (4.10)
                                                              0
    Various distributional identities are presented in [26]. In particular, they make repeated use of
the Dirac delta distribution, which is defined in Fourier space as
                                                           ∫   ∞
                                                      1
                                    𝛿 (x − a) =          3
                                                                  𝑒𝑖k·(x−a) 𝑑 3 k.                          (4.11)
                                                   (2𝜋)      −∞
Additionally, the delta distribution obeys the “sifting" properties
                                               ∫    ∞
                                      𝑓 (x) =          𝑓 (y) 𝛿 (x − y) 𝑑 3 y.                               (4.12)
                                                  −∞
and
                                                ∫   ∞
                                 𝛿 (𝜼 − 𝝃) =          𝛿 (𝜼 − y) 𝛿 (y − 𝝃) 𝑑 3 y,                            (4.13)
                                                  −∞
which shall be used in the discussion that follows.
    The first identity, given as equation (10) in [26], is
                                                                        ∫    𝑡            
                                                           
                          𝛿 x − x 𝑝 (𝑡) = 𝛿 x − x 𝑝 (0) ∗ 𝛿 x −                 v 𝑝 (𝜏) 𝑑𝜏 ,                (4.14)
                                                                            0
where we use “∗" to denote convolution. To see the equivalence, use the definition (4.10) to convert
the particle position to velocity
                                                                  ∫   𝑡               
                                           
                                𝛿 x − x 𝑝 = 𝛿 x − x 𝑝 (0) −              v 𝑝 (𝜏) 𝑑𝜏 ,
                                                                     0
                                                 ∫𝑡
then appeal to identity (4.13), taking 𝜼 =        0
                                                     v𝑖 (𝜏) 𝑑𝜏 and 𝝃 = x − x 𝑝 (0) to obtain
     ∫ ∞        ∫   𝑡            
                                                                                    
                                                                                            ∫  𝑡           
                                                            3
          𝛿 y−         v 𝑝 (𝜏) 𝑑𝜏 𝛿 x − x 𝑝 (0) − y 𝑑 y ≡ 𝛿 x − x 𝑝 (0) ∗ 𝛿 x −                   v 𝑝 (𝜏) 𝑑𝜏 .
      −∞           0                                                                          0
    The next identity relates time and space derivatives of the delta distribution and is given by (see
equation (11) in [26]) as
                                                                                   
                               𝜕𝑡 𝛿 x − x 𝑝 (𝑡) = −∇ · v 𝑝 (𝑡)𝛿 x − x 𝑝 (𝑡) ,                               (4.15)
                                                         133


where the divergence is taken over the spatial variable x. This can be obtained from a direct
calculation using the chain rule and the definition (4.11):
                                                                                            !
                                                            ∫   ∞
                                                       1
                                                                    𝑒𝑖k· ( x−x 𝑝 (𝑡) ) 𝑑 3 k ,
                                          
                          𝜕𝑡 𝛿 x − x 𝑝 (𝑡) = 𝜕𝑡           3
                                                    (2𝜋) −∞
                                                  ∫ ∞                                         !
                                            =−           𝑖k · v 𝑝 (𝑡)𝑒𝑖k· ( x−x 𝑝 (𝑡) ) 𝑑 3 k ,
                                                    −∞
                                                                                
                                            = −∇ · v 𝑝 (𝑡)𝛿 x − x 𝑝 (𝑡) ,
    The next identity (equation (12) in [26]) is the distributional equivalent of the continuity equation
                                                𝜕𝜌
                                                     + ∇ · J = 0.
                                                 𝜕𝑡
This identity can be obtained by combining the definitions (4.1) and (4.2) with the identity (4.15).
First, notice that
                                                    𝑁𝑝
                                                   ∑︁                           
                                    𝜕𝑡 𝜌 (x, 𝑡) =       𝑞 𝑝 𝜕𝑡 𝑆 x − x 𝑝 (𝑡) .
                                                   𝑝=1
Using the identity (4.15) along with the property (4.12), we can express the shape function as a
convolution with a delta function, which yields
                                      𝑁𝑝                                   !
                                     ∑︁                                 
                       𝜕𝑡 𝜌 (x, 𝑡) =      𝑞 𝑝 𝜕𝑡 𝑆 (x) ∗ 𝛿 x − x 𝑝           ,
                                     𝑝=1
                                      𝑁𝑝
                                     ∑︁                               
                                   =      𝑞 𝑝 𝑆 (x) ∗ 𝜕𝑡 𝛿 x − x 𝑝 ,
                                     𝑝=1
                                      𝑁𝑝                                                        !
                                     ∑︁                                                     
                                   =      𝑞 𝑝 𝑆 (x) ∗ − ∇ · v 𝑝 (𝑡)𝛿 x − x 𝑝 (𝑡)                  .
                                     𝑝=1
Next, observe that derivatives of convolution commute when the integrand and its derivatives decay
rapidly away from zero. The shape function 𝑆 is compactly supported so that these conditions will
                                                        134


be easily satisfied. This permits one to write
                                           𝑁𝑝                                                          !
                                           ∑︁                                                      
                          𝜕𝑡 𝜌 (x, 𝑡) =        𝑞 𝑝 𝑆 (x) ∗ − ∇ · v 𝑝 (𝑡)𝛿 x − x 𝑝 (𝑡)                    ,
                                           𝑝=1
                                                    𝑁𝑝                                             !
                                                   ∑︁                                           
                                       = −∇ ·           𝑞 𝑝 v 𝑝 (𝑡)𝑆 (x) ∗ 𝛿 x − x 𝑝 (𝑡)             ,
                                                   𝑝=1
                                                    𝑁𝑝                                   !
                                                   ∑︁                                 
                                       = −∇ ·           𝑞 𝑝 v 𝑝 (𝑡)𝑆 x − x 𝑝 (𝑡)           ,
                                                   𝑝=1
                                       = −∇ · J,
which shows how the continuity equation can be obtained using distributions.
    The last identity (equation (13) in [26]), which builds on these intermediate results, describes
how charge density can be calculated on the mesh so that it satisfies the continuity equation. To
start, they integrate the continuity equation from 0 to 𝑡, which yields
                                                                 ∫  𝑡
                                     𝜌 (x, 𝑡) = 𝜌 (x, 0) −            ∇ · J (x, 𝜏) 𝑑𝜏.
                                                                  0
Using the definition (4.2), this is equivalent to writing
                                    𝑁𝑝       ∫   𝑡     
                                   ∑︁                                               
          𝜌 (x, 𝑡) = 𝜌 (x, 0) −          𝑞𝑝        ∇ · v 𝑝 (𝜏)𝑆 x − x 𝑝 (𝜏) 𝑑𝜏.
                                    𝑝=1        0
                                    𝑁𝑝       ∫   𝑡     
                                   ∑︁                                                          
                   = 𝜌 (x, 0) −          𝑞𝑝        ∇ · v 𝑝 (𝜏)𝑆 (x) ∗ 𝛿 x − x 𝑝 (𝜏) 𝑑𝜏,
                                    𝑝=1        0
                                    𝑁𝑝       ∫   𝑡                                   ∫     ∞                         
                                   ∑︁                                           1
                   = 𝜌 (x, 0) −          𝑞𝑝        ∇ · v 𝑝 (𝜏)𝑆 (x) ∗                         𝑒𝑖k· ( x−x 𝑝 (𝜏) ) 𝑑 3 k 𝑑𝜏.
                                    𝑝=1        0                            (2𝜋) 3       −∞
For convenience, we shall express the integrals in the last line in their component-wise form, which
is given by the expression
                          ∫   𝑡                                  ∫   ∞                          
                                      ( 𝑗)                  1              𝑖k· ( x−x 𝑝 (𝜏) )  3
                                𝜕 𝑗 𝑣 𝑝 (𝜏)𝑆     (x) ∗                   𝑒                   𝑑 k 𝑑𝜏.
                            0                            (2𝜋) 3    −∞
Note that we adopt the usual summation convention in which repeated indices are summed and we
                    𝜕
have used 𝜕 𝑗 =    𝜕𝑥 𝑗 for brevity. It is (hopefully) clear from the notation that these spatial derivatives
                                                             135


act only on the mesh data rather than particles. Next, we use the definition (4.10) and rearrange the
integrals to obtain
                                                                                                                            !
                                              ∫    ∞∫ 𝑡
                                       1                                                        ∫  𝜏
                                                              𝑣 𝑝 (𝜏)𝑒𝑖k· ( x−x 𝑝 (0)− 0             v𝑖 (𝑠) 𝑑𝑠)
                                                                 ( 𝑗)
                  𝜕 𝑗 𝑆 (x) ∗               3
                                                                                                                 𝑑𝜏 𝑑 3 k .
                                     (2𝜋)       −∞       0
                                                                                                     𝑑x 𝑝
This expression can be further simplified by using the fact that                                      𝑑𝑡     = v 𝑝 , from which we deduce
that                                                                                                                 !
                                                       ∫   ∞∫ 𝑡
                                               1
                                                                       𝑣 𝑝 (𝜏)𝑒𝑖k· ( x−x 𝑝 (𝜏) ) 𝑑𝜏 𝑑 3 k .
                                                                          ( 𝑗)
                         𝜕 𝑗 𝑆 (x) ∗               3
                                                                                                                                      (4.16)
                                            (2𝜋)        −∞         0
The term above contains three separate contributions to the mesh corresponding to each index
value 𝑗. We shall provide details for 𝑗 = 1, and state the results for the remaining cases 𝑗 = 2, 3.
Focusing on the inner integral, we see that we need to evaluate the time integral
                                                ∫    𝑡
                                                                       𝑖k· ( x−x 𝑝 (𝜏) )
                                                       𝑣 (1)
                                                         𝑝 (𝜏)𝑒                           𝑑𝜏.
                                                   0
Next, we use the definition (4.10) to see that 𝑣 (1)                 𝑝 𝑑𝜏 is the total change in the position over a time
frame 𝑑𝜏, in other words, 𝑣 (1)                         (1)
                                      𝑝 𝑑𝜏 = 𝑑𝑥 𝑝 . Then the time integral can be converted into a path
integral that connects 𝑥 𝑝(1) (0) and 𝑥 𝑝(1) (𝑡):
                                 (1)
                         ∫     𝑥 𝑝 (𝑡)         
                                                          (1)
                                                                       
                                                                                 (2)
                                                                                             
                                                                                                         (3)
                                                                                                             
                                         𝑖 𝑘 (1) 𝑥 (1) −𝑥 𝑝 +𝑘 (2) 𝑥 (2) −𝑥 𝑝 +𝑘 (3) 𝑥 (3) −𝑥 𝑝
                                       𝑒                                                                        𝑑𝑥 𝑝(1) .
                              (1)
                            𝑥 𝑝 (0)
Here, the variables for the remaining particle coordinates are transformed to 𝑥 𝑝(2) and 𝑥 𝑝(3) and the
differential element runs over the first component of the particle position vector, leaving these
remaining variables unaffected. Inserting this result into (4.16) (with 𝑗 = 1) and interchanging the
integrals, we obtain
                                                                (1)                                               !
                                                        ∫     𝑥 𝑝 (𝑡)    ∫     ∞
                                                  1                                𝑖k· ( x−x 𝑝 )
                            𝜕1 𝑆 (x) ∗                                           𝑒                 𝑑  3
                                                                                                        k 𝑑𝑥 𝑝(1)    ,
                                              (2𝜋) 3         (1)
                                                           𝑥 𝑝 (0)          −∞
which, by (4.11), is equivalent to writing
                          (1)                                             !                    (1)                                !
                   ∫    𝑥 𝑝 (𝑡)                                                        ∫     𝑥 𝑝 (𝑡)
                                                                 𝑑𝑥 𝑝(1)                                                  𝑑𝑥 𝑝(1)
                                                                                                                      
               𝜕1                  𝑆 (x) ∗ 𝛿 x − x 𝑝                          = 𝜕1                     𝑆 x − x𝑝                     . (4.17)
                       (1)                                                                  (1)
                     𝑥 𝑝 (0)                                                              𝑥 𝑝 (0)
                                                                      136


The remaining indices ( 𝑗 = 2, 3) in (4.16) yield
                       ∫ ∞∫ 𝑡                                              !            ∫ 𝑥 (2)𝑝 (𝑡)
                                                                                                                             !
                  1
                                     𝑣 𝑝 (𝜏)𝑒𝑖k· ( x−x 𝑝 (𝜏) ) 𝑑𝜏 𝑑 3 k = 𝜕2
                                       (2)                                                                               (2)
                                                                                                                   
   𝜕2 𝑆 (x) ∗        3
                                                                                                     𝑆 x − x 𝑝 𝑑𝑥 𝑝 , (4.18)
                                                                                            (2)
               (2𝜋) −∞ 0                                                                  𝑥 𝑝 (0)
                       ∫ ∞∫ 𝑡                                              !            ∫ 𝑥 (3)𝑝 (𝑡)
                                                                                                                             !
                  1                             𝑖k· ( x−x 𝑝 (𝜏) )
                                     𝑣 (3)                                                           𝑆 x − x 𝑝 𝑑𝑥 𝑝(3) , (4.19)
                                                                                                                   
   𝜕3 𝑆 (x) ∗        3                 𝑝 (𝜏)𝑒                      𝑑𝜏 𝑑 3 k = 𝜕3
                                                                                            (3)
               (2𝜋) −∞ 0                                                                  𝑥 𝑝 (0)
which can be more succinctly expressed in vector notation as
                          ∫ ∞∫ 𝑡                                            !             ∫ x 𝑝 (𝑡)                         !
                    1
                                       v 𝑝 (𝜏)𝑒𝑖k· ( x−x 𝑝 (𝜏) ) 𝑑𝜏 𝑑 3 k = ∇ ·
                                                                                                                    
   ∇ · 𝑆 (x) ∗         3
                                                                                                      𝑆 x − x 𝑝 𝑑x 𝑝 ,         (4.20)
                 (2𝜋) −∞ 0                                                                   x 𝑝 (0)
with the vector integration bounds being interpreted in an entry-wise fashion. Once the shape
function 𝑆(x − ·) is selected, one can analytically compute the derivatives of the shape functions in
(4.20) via the equations (4.17) - (4.19).
    Actually, it is not necessary to introduce delta distributions to obtain the mapping (4.20). In
[26], a convolution between the shape functions and delta distributions was performed, so that
derivatives could be transferred directly onto the delta distributions. These derivatives are to be
interpreted in the sense of distributions. A far simpler approach can be obtained by starting from
the continuity equation with the definition (4.2):
                                                  ∑︁𝑁𝑝              ∫  𝑡                                    
                                                                                                      
                      𝜌 (x, 𝑡) = 𝜌 (x, 0) −               𝑞𝑝∇ ·           v 𝑝 (𝜏)𝑆 x − x 𝑝 (𝜏) 𝑑𝜏 .
                                                   𝑝=1                0
                       𝑑x 𝑝
Using the fact that     𝑑𝑡   = v 𝑝 , we can immediately see that 𝑑x 𝑝 = v 𝑝 𝑑𝜏, hence
                        ∫ 𝑡                                               ∫ x 𝑝 (𝑡)                         
                   ∇·          v 𝑝 (𝜏)𝑆 x − x 𝑝 (𝜏) 𝑑𝜏 = ∇ ·                              𝑆 x − x 𝑝 𝑑x 𝑝 .
                            0                                                   x 𝑝 (0)
    The charge mapping involves derivatives of the shape function 𝑆 according to equations (4.17)
- (4.19). An evaluation of these terms can be illustrated by considering the evaluation of a single
component, such as the first one, i.e.,
                                                            (1)                         !
                                                    ∫     𝑥 𝑝 (𝑡)
                                            𝜕
                                                                  𝑆(x − x 𝑝 ) 𝑑𝑥 𝑝(1) .
                                           𝜕𝑥 (1)        (1)
                                                       𝑥 𝑝 (0)
Using the change of variables z = x − x 𝑝 , this integral can be expressed as:
                  ∫ 𝑥 (1)
                        𝑝 (𝑡)
                                                    !                   ∫ 𝑥 (1) −𝑥 (1)
                                                                                    𝑝 (𝑡)
                                                                                                                           !
            𝜕                                                     𝜕
                              𝑆(x − x 𝑝 ) 𝑑𝑥 𝑝(1) = − (1)                                  𝑆(𝑧 (1) , 𝑧 (2) , 𝑧 (3) ) 𝑑𝑧 (1) .
          𝜕𝑥 (1) 𝑥 (1)
                     𝑝 (0)
                                                                𝜕𝑥         (1)   (1)
                                                                          𝑥 −𝑥 𝑝 (0)
                                                                 137


If we write the multivariant spline as a tensor product of univariant shape functions, then this is
equivalent to writing
                           (1)                              !                                    (1)                                               !
                  ∫      𝑥 𝑝 (𝑡)                                                  ∫    𝑥 (1) −𝑥 𝑝 (𝑡)
         𝜕                                                                𝜕
                                   𝑆(x − x 𝑝 ) 𝑑𝑥 𝑝(1)         = − (1)                                  𝑆(𝑧 (1) )𝑆(𝑧 (2) )𝑆(𝑧 (3) ) 𝑑𝑧 (1) .
       𝜕𝑥 (1)          (1)
                     𝑥 𝑝 (0)                                          𝜕𝑥                      (1)
                                                                                    𝑥 (1) −𝑥 𝑝 (0)
Next, we define the anti-derivative of the univariant spline as
                                                                            
                                                                                     𝑥2
                                                                            
                                                                              𝑥 + 2Δ𝑥                −Δ𝑥 ≤ 𝑥 ≤ 0,
                                                                            
                                                                            
                                                                            
                                                                                         ,
                                                    ∫ 𝑥                     
                                                                            
                                                                            
                                                                            
                                     𝐼 (𝑥) :=             𝑆(𝜉) 𝑑𝜉 = 𝑥 − 𝑥 2 ,                        0 < 𝑥 ≤ Δ𝑥,
                                                      0                     
                                                                                   2Δ𝑥
                                                                            
                                                                            
                                                                            
                                                                            
                                                                             0,
                                                                                                |𝑥| > Δ𝑥.
                                                                            
Let us temporarily ignore the functions of 𝑧 (2) and 𝑧 (3) , as the integral is done only in 𝑧 (1) . At the
end, we will reintroduce these shape functions with the time level of the coordinates in 𝑧 (2) and 𝑧 (3)
being selected to match those of 𝑧 (1) . Using the anti-derivative, the integral reduces to
                        (1)                                             (1)                                           (1)
         ∫    𝑥 (1) −𝑥 𝑝 (𝑡)                              ∫    𝑥 (1) −𝑥 𝑝 (𝑡)                             ∫  𝑥 (1) −𝑥 𝑝 (0)
                                        (1)         (1)                              (1)          (1)
                                𝑆(𝑧         ) 𝑑𝑧        =                      𝑆(𝑧        ) 𝑑𝑧        −                        𝑆(𝑧 (1) ) 𝑑𝑧 (1) ,
                     (1)
           𝑥 (1) −𝑥 𝑝 (0)                                   0                                              0
                                                                                                                   
                                                        = 𝐼 𝑥 (1) − 𝑥 𝑝(1) (𝑡) − 𝐼 𝑥 (1) − 𝑥 𝑝(1) (0) .
Taking the derivative then with respect to 𝑥 (1) , we obtain
                                  (1)
                  ∫     𝑥 (1) −𝑥 𝑝 (𝑡)                                                                                                        
          𝜕                                                           𝜕            (1)         (1)               𝜕            (1)        (1)
                                           𝑆(𝑧 (1) ) 𝑑𝑧 (1) =                𝐼   𝑥      −    𝑥 𝑝    (𝑡)   −            𝐼    𝑥     −    𝑥 𝑝   (0)    ,
        𝜕𝑥 (1)                 (1)
                     𝑥 (1) −𝑥 𝑝 (0)                                 𝜕𝑥 (1)                                   𝜕𝑥 (1)
                                                                                                                             
                                                               = 𝑆 𝑥 (1) − 𝑥 𝑝(1) (𝑡) − 𝑆 𝑥 (1) − 𝑥 𝑝(1) (0) .
Re-introducing the functions of 𝑥 𝑝(2) and 𝑥 𝑝(3) , and collecting the results, we get
                                                  (1)                                                                        x 𝑝 =x 𝑝 (𝑡)
                                  ∫     𝑥 (1) −𝑥 𝑝 (𝑡)
                           𝜕                                (1)         (2)        (3)         (1)                        
                   − (1)                                𝑆(𝑧     )𝑆(𝑧        )𝑆(𝑧        ) 𝑑𝑧         = −𝑆 x − x 𝑝                         .           (4.21)
                                               (1)
                       𝜕𝑥            𝑥 (1) −𝑥 𝑝 (0)
                                                                                                                             x 𝑝 =x 𝑝 (0)
Again, the time levels for 𝑥 𝑝(2) and 𝑥 𝑝(3) , in (4.21), should be the same as the one used for 𝑥 𝑝(1) for
consistency. It is interesting to note that when this process is repeated for the remaining derivatives,
the result is identical to (4.21). The final form of the map is given by
                                                                    𝑁𝑝
                                                                   ∑︁              ∫      x 𝑝 (𝑡)                          
                                                                                                                  
                                 𝜌 (x, 𝑡) = 𝜌 (x, 0) −                   𝑞𝑝∇ ·                       𝑆 x − x 𝑝 𝑑x 𝑝 ,                                 (4.22)
                                                                   𝑝=1                  x 𝑝 (0)
                                                                             138


where the divergence of the integral can be evaluated through repeated use of (4.21). While the
map defined by (4.21) and (4.22) is conservative in the sense that the total charge in the domain
is preserved in time, it introduces an image charge on the mesh that corresponds to the starting
position along a particle path. This is problematic for the case of the expanding beam problem
considered in section 4.6, which injects an electron beam into a metal cavity. The image charge
induces a large potential at the injection size that reverses the trajectory of the beam.
     The appearance of an image charge can be illustrated with a simple example that considers a
single test particle whose macro-particle charge 𝑞 = 1. Its charge at any time level can be calculated
with the mapping presented above, which, in 2-D, for a single particle, is given by
                                                                                         
                                 𝑛+1       𝑛
                               𝜌     = 𝜌 + 2𝑞 𝑆(x −           x𝑛+1
                                                                𝑝 )    − 𝑆(x −      x𝑛𝑝 )   .
In this example, we suppose that at time 𝑡 = 0 the particle is in the center of the grid cell (𝑖, 𝑗) and
at time 𝑡 = Δ𝑡, the particle has reached the cell boundary. We use a matrix convention so that the
grid points in question align with entries of the matrix. The linear spline shape functions at these
grid points are found to be
                                                                                             
                                         1  1                                             1
                                         4  4      0                            0
                                                                                           2  0
                         𝑆(x − x𝑛𝑝 ) =                      𝑆(x − x𝑛+1       ) =
                                                                                   
                                                      ,                 𝑝                     .
                                         1  1
                                                    0                            0      1
                                                                                              0
                                         4  4                                            2
                                                                                             
Plugging these directly into the charge mapping, we find that
                                                                                              
                            1   1
                                     0     0       1
                                                         0        1      1
                                                                                0 − 14       3
                                                                                                   0
                            4   4                    2             4      4                    4
                   𝜌 𝑛+1 =             +2                −2                  =
                                             
                                                                                                     .
                            1   1
                                     0     0       1
                                                         0        1      1
                                                                                0 − 14       3
                                                                                                   0
                            4   4                   2             4      4                    4
                                                                                              
From this, it is clear that the total charge is conserved in the sense that
                                             ∑︁               ∑︁
                                                     𝜌𝑖𝑛+1
                                                        𝑗 =          𝜌𝑖𝑛𝑗 ,
                                              𝑖, 𝑗             𝑖, 𝑗
but the method introduces a non-physical image charge corresponding to the previous location of
the particle. If we remove the factor of 2 that appears in the spline mapping, so that the charge map
now reads
                                                                                        
                                𝜌 𝑛+1 = 𝜌 𝑛 + 𝑞 𝑆(x − x𝑛+1     𝑝    ) −   𝑆(x   −  x 𝑛
                                                                                       )
                                                                                     𝑝 .
                                                         139


Then we can repeat the process and find that
                                                            
                                                      1
                                                  0
                                                      2    0
                                         𝜌 𝑛+1  =
                                                  
                                                             ,
                                                      1
                                                  0
                                                     2    0
                                                            
which removes the image charge with the opposing sign; however, this modification is equivalent
to the usual spline mapping for charge. For this reason, we shall not consider this map in our
numerical experiments.
4.4     Conventional Methods for Pushing Particles
In this section, we discuss standard methods used in the scientific computing community for pushing
particles. These developments are important to the core task of this work because they not only
provide a basis for comparison with other methods, but they also describe how other potential users
may employ our field solvers in their own codes. Beginning with the well-known leapfrog method,
we discuss how the proposed field solvers can be leveraged to develop solvers for electrostatic
problems. We then address the electromagnetic case, which begins with a description of the Boris
rotation algorithm.
4.4.1    Leapfrog Time Integration
Leapfrog time integration is a well-known technique in methods for evolving particles due to its
simplicity, long-time accuracy, and symplectic nature. While a comprehensive treatment of this
integrator can be found in any of the classic texts on particle methods, including Birdsall and
Langdon [2] and Hockney and Eastwood [3], we provide a few details here so that we can discuss
the coupling of the method to the collection of solvers introduced in chapter 2. To this end, we
consider Newton’s second law of motion for a single particle, which can be written in the first-order
form
                                            x¤ = v,                                            (4.23)
                                                 1
                                            v¤ =    F(x, 𝑡),                                   (4.24)
                                                 𝑚
                                                  140


where 𝑚 is the mass of the particle and x(𝑡), and v(𝑡) denote, respectively, the position and velocity
of the particle at time 𝑡. Further, F is the force that is used to accelerate the particle, which depends
only on the position data of the particle. The leapfrog method can be derived by integrating the
position equation (4.23) from 𝑡 𝑛 to 𝑡 𝑛+1 and the velocity equation (4.24) from 𝑡 𝑛−1/2 to 𝑡 𝑛+1/2 :
                                                   ∫ 𝑡 𝑛+1
                                  𝑛+1         𝑛
                             x(𝑡 ) = x(𝑡 ) +                v(𝜏) 𝑑𝜏,
                                                      𝑡𝑛
                                                            ∫   𝑡 𝑛+1/2
                                𝑛+1/2         𝑛−1/2      1
                            v(𝑡       ) = v(𝑡       )+                  F(x(𝜏), 𝜏) 𝑑𝜏.
                                                         𝑚   𝑡 𝑛−1/2
Then, the integrals are approximated with a second-order accurate midpoint rule to obtain the fully
discrete update
                                                               Δ𝑡
                                  v(𝑡 𝑛+1/2 ) = v(𝑡 𝑛−1/2 ) +       F(x(𝑡 𝑛 ), 𝑡 𝑛 ),                  (4.25)
                                                                𝑚
                                    x(𝑡 𝑛+1 ) = x(𝑡 𝑛 ) + Δ𝑡v(𝑡 𝑛+1/2 ).                               (4.26)
      Δ𝑡
The   2  offset in time between the position and velocity in this method gives the scheme its name.
In practice, an initial offset in the velocity can be achieved by stepping the velocity backwards in
time by − Δ𝑡2 using an explicit Euler method, so that
                                                            Δ𝑡
                                    v(𝑡 −1/2 ) = v(𝑡 0 ) −      F(x(𝑡 0 ), 𝑡 0 ).
                                                           2𝑚
Since the local truncation error for the Explicit Euler method is second-order in time, this step will
not degrade the rate of convergence. Once the initialization is complete, a given time step consists
of the following ingredients:
    1. Compute the force F(x(𝑡 𝑛 ), 𝑡 𝑛 ) at time 𝑡 = 𝑡 𝑛 .
    2. Update the velocity by Δ𝑡 using equation (4.25).
    3. Update the position by Δ𝑡 using equation (4.26).
For electrostatic problems, in the absence of external forces, the force term represents the motion
caused by an electric field, i.e., F(x(𝑡 𝑛 ), 𝑡 𝑛 ) = 𝑞E(x(𝑡 𝑛 ), 𝑡 𝑛 ) ≡ −𝑞∇𝜓(x(𝑡 𝑛 ), 𝑡 𝑛 ), where 𝑞 is the
charge of the particle, E is the electric field, and 𝜓 is a scalar potential.
                                                        141


     Depending on the context, the scalar potential may be provided as part of the problem or require
its own solve. In the electrostatic applications considered in this thesis, the scalar potential 𝜓 is the
solution to either a Poisson equation
                                                        1
                                              − Δ𝜓 =       𝜌,                                     (4.27)
                                                        𝜎1
or the two-way wave equation
                                          1                 1
                                           2
                                             𝜕𝑡𝑡 𝜓 − Δ𝜓 =     𝜌,                                  (4.28)
                                         𝜅                 𝜎1
with 𝜅 being the speed at which the wave propagates and 𝜌 being the charge density deposited by
particles on a mesh. In the case of the Poisson equation (4.27), the electric field can be computed
using elliptic solvers and the leapfrog time stepping procedure is unchanged. When the potential
is described by a wave equation (4.28), then some minor modifications are required depending on
whether the source 𝜌 is to be treated explicitly or implicitly by our methods. In the latter case, both
𝜓 and its derivatives ∇𝜓 can be computed once the position update (4.26) is complete. Similarly, an
advance for 𝜓 with an explicit source (e.g., the time-centered or central schemes) would take place
prior to the position update (4.26). Then, once the positions have been updated, the derivatives can
be computed using the now implicit sources.
     In the next section, we describe the Boris push, which is an extension of the leapfrog time
integration scheme to include contributions from more general electromagnetic fields.
4.4.2    The Boris Push
The Boris push, introduced in a 1970 paper by Jay Boris [4], is an extension of the leapfrog method
discussed in section 4.4.1 to address rotations supported by the more general Lorentz force
                                          F = 𝑞 (E + v × B) .
This technique is the standard method for pushing particles in electromagnetic fields and is widely
adopted for its simplicity, speed, and long-time accuracy [2, 3]. While the method itself is not
symplectic, its success has largely been attributed to its volume preserving feature [78]. Moreover,
                                                    142


despite its lack of symplecticity, the fluctuations in the energy introduced by the method are generally
known to remain small, even for problems that require integration over large time intervals.
    The setup for the Boris method begins with a discretization in time of the equations of motion
for the particles that results in a leapfrog structure identical to (4.25) and (4.26). A key difference is
that the velocity update equation for the particles involves the velocity itself, which is time averaged
to obtain the implicit equation
                                                                         
                                               𝑞 ©        v𝑛+1/2 + v𝑛−1/2       ª
                           v𝑛+1/2 = v𝑛−1/2 +     ­E +                       × B®® .
                                               𝑚 ­               2
                                                 «                              ¬
While it is possible to rearrange this equation and analytically solve for v𝑛+1/2 , Boris realized that
the electric and magnetic field contributions could be separated with a simple advance obtained
through the use of rotations. Following the notation of [79], the Boris rotation method for velocity
can be performed with the following steps:
                                  𝑞Δ𝑡
   1. Compute v− = v𝑛−1/2 +            E
                                   2𝑚
                    ′                            𝑞Δ𝑡
   2. Compute v = v− + v− × t, where t =              B
                                                 2𝑚
                               ′                   2t
   3. Compute v+ = v− + v × s, where s =
                                                 1 + 𝑡2
                                                                 𝑞Δ𝑡
   4. Lastly, the velocity update is given as v𝑛+1/2 = v+ +          E.
                                                                 2𝑚
The position data can then be updated using the newly acquired velocity v𝑛+1/2 following the usual
leapfrog update (4.26).
    We now describe an approach that couples our field solvers to the Boris method. The proposed
approach treats the fields at integer time levels, with the particle data being stored in the usual
                                                                     
leapfrog format. First, the initialization begins with x0 , v0 , along with the fields E0 and B0
(obtained from 𝜓 0 and A0 ). To create the velocity staggering in time, we use the initial fields to
push the velocities back by Δ𝑡/2 using the Boris rotation method, which is second-order and results
                                                                      
in v−1/2 . Then, assuming we have the data E𝑛 , B𝑛 , x𝑛 , v𝑛−1/2 , the procedure for evolving the
system from 𝑡 𝑛 to 𝑡 𝑛+1 consists of the following steps:
                                                     143


                                                     
    1. Apply the Boris rotation: E𝑛 , B𝑛 , v𝑛−1/2 ↦→ v𝑛+1/2 .
                                            
    2. Advance the positions: x𝑛 , v𝑛+1/2 ↦→ x𝑛+1 .
                                                       
    3. Time average the velocities      v𝑛−1/2 , v𝑛+1/2   ↦→ v𝑛 and compute the current density J𝑛 .
    4. Advance the A explicitly with the central scheme: (A𝑛 , J𝑛 ) ↦→ A𝑛+1 .
    5. Compute the charge density 𝜌 𝑛+1 .
                                                                               
    6. Advance the 𝜓 and its derivatives: 𝜓 𝑛 , 𝜌 𝑛+1 →      ↦    𝜓 𝑛+1 , ∇𝜓 𝑛+1 .
    7. Compute the electric field E𝑛+1 := −∇𝜓 𝑛+1 − 𝜕𝑡 A𝑛+1 using data from steps 4 and 6.
    8. Iterate on the magnetic field data:
                                                                                  
          a) Compute derivatives of A with the BDF update: A𝑛 , J𝑛+1,[𝑘] ↦→ ∇A𝑛+1,[𝑘] .
         b) Approximate the magnetic field B𝑛+1,[𝑘] := ∇ × A𝑛+1,[𝑘] with ∇A𝑛+1,[𝑘] .
                                                                            
          c) Advance the particle velocities: E𝑛+1 , B𝑛+1,[𝑘] , v𝑛+1/2 ↦→ v𝑛+3/2 .
                                                              
         d) Time average the velocities v       𝑛+1/2 ,v 𝑛+3/2   ↦→ v𝑛+1 and compute J𝑛+1,[𝑘+1] .
          e) Repeat for some prescribed number of iterations or until convergence.
                                                                                          
    9. Prepare for the next time step: E𝑛+1 , B𝑛+1 , x𝑛+1 , v𝑛+1/2 ↦→ E𝑛 , B𝑛 , x𝑛 , v𝑛−1/2 .
The fields (𝜓, A), spatial derivatives (∇𝜓, ∇A), and 𝜕𝑡 A, all live at the integer time levels in this
approach. The reason is that the particle velocity evolution step over [𝑡 𝑛−1/2 , 𝑡 𝑛+1/2 ] requires the
field data at the mid-point 𝑡 𝑛 . The particle position data is at the integer levels. We generally use
variables with square brackets as an upper index to indicate that it is an iteration variable. For
instance, in step 8a, we use the source data at the wrong time level, which means that the BDF
scheme will be using the incorrect source. By iterating on the current density J𝑛+1 , generally with
a few iterates, we can obtain a good approximation of this source. To start the iteration in step 8,
                                                      144


we can apply Taylor expansion to the current density so that it is centered about time 𝑡 = 𝑡 𝑛 . For
second-order accuracy in time, we can start the iteration with
                                                                     
                                          𝑛+1,[0]    𝑛     𝑛      𝑛−1
                                        J         =J + J −J             .
    In the next section, we discuss a time integration method for particles that are evolved using
non-separable Hamiltonians.
4.5     Time Integration with Non-separable Hamiltonians
In this section, we introduce the time integration methods for the particles in formulations, which
evolve a certain non-separable Hamiltonian. An outline of the approach is provided in section
4.5.1. Once we have introduces the basic elements of the method, we propose several approaches
for incorporating the field evolution into the scheme. We consider two perspectives. First, in
section 4.5.1.1, we develop a naive implementation in which the particles “lead" the fields, i.e., the
fields are modified through changes in the particles. The second approach we consider is presented
in section 4.5.1.2 and is a variant of the first approach, allowing one to utilize methods which allow
sources to be explicit.
4.5.1    The Molei Tao Integrator
For equations of motion of the form
                                  1
                           x¤ 𝑖 =    (P𝑖 − 𝑞𝑖 A) ≡ 𝑉 (x𝑖 , P𝑖 ) ,
                                  𝑚𝑖
                                             𝑞𝑖
                          P¤ 𝑖 = −𝑞𝑖 ∇𝜓 + (∇A) · (P𝑖 − 𝑞𝑖 A) ≡ 𝑊 (x𝑖 , P𝑖 ) ,
                                             𝑚𝑖
the phase space variables are non-separable, which means traditional integrators, such as those in
the previous section, can not be applied to the system. Recently, a paper by Tao [47] introduced
methods, inspired by the work [80], to approximate non-separable Hamiltonians 𝐻 (x, P) using an
augmented form
                                    𝐻¯ (x, P, y, Q) := 𝐻 𝐴 + 𝐻 𝐵 + 𝜔𝐻𝐶 ,
                                                      145


with
                                                                    1               1
               𝐻 𝐴 := 𝐻 (x, Q),       𝐻 𝐵 := 𝐻 (y, P),     𝐻𝐶 :=      ||x − y|| 2 + ||P − Q|| 2 .
                                                                    2               2
By duplicating the phase space information, Tao was able to construct a family of explicit symplectic
integration schemes of any even order degree of accuracy which do not require negative time steps.
This latter point is significant for plasma problems modeled on bounded domains. In such problems,
charged particles may be absorbed into the boundary within a single time step, and it is not clear
how one should reverse this process.
    To build these methods, Tao introduces the following set of flow maps that evolve the system
forward in Δ𝑡-time:
                      x               x                       x        x + Δ𝑡𝜕P 𝐻 (y, P) 
                                                                                            
                                                                                            
                      P     P − Δ𝑡𝜕x 𝐻 (x, Q)           Δ𝑡
                                                                   P               P           
              𝜙Δ𝑡
               𝐻𝐴   :   ↦→ 
                                                   ,  𝜙     :
                                                            𝐻𝐵  
                                                                       →
                                                                        ↦                         ,
                        y       y + Δ𝑡𝜕      (x, Q)                 y                  y
                                                                                                 
                                      Q 𝐻                                                   
                                                                                            
                      Q
                       
                              
                                        Q           
                                                     
                                                                   Q
                                                                   
                                                                           
                                                                             Q  − Δ𝑡𝜕  y 𝐻 (y, P) 
                                                                                                   
                                                              !                       !
                                       x           x+y                      x − y 
                                        
                                        
                                                    
                                                               + 𝑅(𝜔,    Δ𝑡)           
                               Δ𝑡
                                       P       1  P + Q                   P − Q 
                              𝜙𝜔𝐻𝐶 :   ↦→ 
                                                            !                       ! .
                                       y       2  x+y                       x−y 
                                                             − 𝑅(𝜔, Δ𝑡)              
                                       Q
                                        
                                                     P+Q
                                                                             P − Q 
Here, 𝑅 is the block rotation matrix
                                                                               
                                               cos(2𝜔Δ𝑡)𝐼 sin(2𝜔Δ𝑡)𝐼 
                               𝑅(𝜔, Δ𝑡) = 
                                                                               
                                                                                ,
                                              − sin(2𝜔Δ𝑡)𝐼 cos(2𝜔Δ𝑡)𝐼 
                                                                               
                                                                               
with 𝐼 being the 2 × 2 or 3 × 3 identity matrix. Various integrators of any even order degree of
accuracy can be obtained through a composition of these mappings. We refer the interested reader
to the paper [47] for details. As an example, the paper provides the following second-order method:
                                         Δ𝑡/2     Δ𝑡/2               Δ𝑡/2     Δ𝑡/2
                               𝜙Δ𝑡                         Δ𝑡
                                 2 := 𝜙 𝐻 𝐴 ◦ 𝜙 𝐻 𝐵 ◦ 𝜙𝜔𝐻𝐶 ◦ 𝜙 𝐻 𝐵 ◦ 𝜙 𝐻 𝐴 .
In this composition, it is important to note that the coupling map 𝜙Δ𝑡       𝜔𝐻𝐶 does not evolve the particles,
but, instead, mixes the data in phase space.
                                                       146


     A key element of this method is the use of a binding constant 𝜔 that synchronizes the two
sets of phase space variables. Tao establishes an estimate on the accuracy of these methods in the
context of long time simulations for integrable systems. For a method with order ℓ, time step size
Δ𝑡, coupling parameter 𝜔, and the simulation time 𝑇, he shows that the error is of the form
                                                                   
                                                       O 𝑇Δ𝑡 ℓ 𝜔 ,
                                              
as long as 𝑇 = O min Δ𝑡 −ℓ 𝜔−ℓ , 𝜔1/2 . Based on this bound, he recommends that Δ𝑡 ≪ 𝜔−1/ℓ ,
and that it is both more accurate and efficient to use increase the order ℓ under a fixed 𝜔.
     The test problems shown in Tao’s paper evolve particles in fixed fields which can be either static
or non-static, but are known functions. In our applications, the electric and magnetic fields respond
to changes in the plasma, which is represented using particles. Therefore, coupling these particle
integration schemes to our methods for evolving fields is a non-trivial task, and we describe the
necessary modifications in the sections that follow.
4.5.1.1    Approach for Implicit Sources: Particles Lead the Fields
The wave equations for the potentials 𝜓 and A are non-linear due to the source terms that couple
with the particle data. We break this coupling through the use of duplicate fields, analogous to Tao’s
                                                                                                                          
approach for the particles. To this end, we let 𝜓𝑥𝑞         𝑛 , ∇𝜓 𝑛 , A𝑛 , ∇A𝑛            and    𝜓 𝑛 , ∇𝜓 𝑛 , A𝑛 , ∇A𝑛
                                                                      𝑥𝑞     𝑥𝑞       𝑥𝑞            𝑦𝑝       𝑦𝑝    𝑦𝑝     𝑦𝑝
denote two pairs of field data at time level 𝑡 𝑛 . Moreover each pair of field data is associated with a
set of particle data indicated by the subscripts. Assuming that we have the data (x𝑛 , Q𝑛 ), (y𝑛 , P𝑛 ),
                                                                
   𝑛 , ∇𝜓 𝑛 , A𝑛 , ∇A𝑛 , and 𝜓 𝑛 , ∇𝜓 𝑛 , A𝑛 , ∇A𝑛 , the second order update 𝜙Δ𝑡 can be modified
  𝜓𝑥𝑞     𝑥𝑞    𝑥𝑞      𝑥𝑞              𝑦𝑝      𝑦𝑝     𝑦𝑝       𝑦𝑝                                       2
to include updates to fields as follows:
                                                                                                        
    1. Push particles:   (y𝑛 , P𝑛 )  ↦→      y𝑛+1/2 , P𝑛+1/2    using      x𝑛 , Q𝑛 , ∇𝜓𝑥𝑞𝑛 , A𝑛 , ∇A𝑛
                                                                                               𝑥𝑞       𝑥𝑞
                                                                                                              
    2. Evolve the fields 𝜓 𝑦𝑛 𝑝 , ∇𝜓 𝑦𝑛 𝑝 , A𝑛𝑦 𝑝 , ∇A𝑛𝑦 𝑝 ↦→ 𝜓 𝑦𝑛+1/2    𝑝     , ∇𝜓 𝑛+1/2
                                                                                     𝑦𝑝    , A 𝑛+1/2
                                                                                               𝑦𝑝    , ∇A  𝑛+1/2
                                                                                                            𝑦𝑝      using the
                                         
       particle data y𝑛+1/2 , P𝑛+1/2
                                                                                                                          
    3. Push particles:   (x𝑛 , Q𝑛 )   ↦→     x𝑛+1/2 , Q𝑛+1/2    using      y𝑛+1/2 , P𝑛+1/2 , ∇𝜓 𝑦𝑛+1/2
                                                                                                   𝑝    , A𝑛+1/2
                                                                                                             𝑦𝑝   , ∇A𝑛+1/2
                                                                                                                       𝑦𝑝
                                                           147


                                                                
    4. Coupling step: x𝑛+1/2 , P𝑛+1/2 , y𝑛+1/2 , Q𝑛+1/2 ↦→ (x∗ , P∗ , y∗ , Q∗ )
                                                                                      
    5. Recompute the field data 𝜓 𝑦𝑛+1/2   𝑝    , ∇𝜓 𝑛+1/2
                                                     𝑦𝑝      ,  A 𝑛+1/2
                                                                  𝑦𝑝     , ∇A    𝑛+1/2
                                                                                 𝑦𝑝       using the particle data (y∗ , P∗ )
                                                                                                                 
                               𝑛 , ∇𝜓 𝑛 , A𝑛 , ∇A𝑛                      𝑛+1/2        𝑛+1/2        𝑛+1/2       𝑛+1/2
    6. Evolve the fields     𝜓𝑥𝑞       𝑥𝑞      𝑥𝑞       𝑥𝑞     ↦→     𝜓𝑥𝑞       , ∇𝜓𝑥𝑞       , A𝑥𝑞      , ∇A𝑥𝑞         using the
       particle data (x∗ , Q∗ )
                                                                                                                  
    7. Push particles: (x∗ , Q∗ ) ↦→ x𝑛+1 , Q𝑛+1 using y∗ , P∗ , ∇𝜓 𝑦𝑛+1/2
                                                                                                𝑛+1/2      𝑛+1/2
                                                                                      𝑝     , A  𝑦𝑝    , ∇A  𝑦𝑝
                                                                                                                      
                               𝑛+1/2        𝑛+1/2     𝑛+1/2          𝑛+1/2
    8. Evolve the fields 𝜓𝑥𝑞         , ∇𝜓𝑥𝑞       , A𝑥𝑞      , ∇A𝑥𝑞          ↦→ 𝜓𝑥𝑞      𝑛+1 , ∇𝜓 𝑛+1 , A𝑛+1 , ∇A𝑛+1 using
                                                                                                    𝑥𝑞     𝑥𝑞         𝑥𝑞
                                         
       the particle data x𝑛+1 , Q𝑛+1
                                                                                                                
    9. Push particles: (y∗ , P∗ ) ↦→ y𝑛+1 , P𝑛+1 using x𝑛+1 , Q𝑛+1 , ∇𝜓𝑥𝑞
                                                                                          𝑛+1 , A𝑛+1 , ∇A𝑛+1
                                                                                                   𝑥𝑞        𝑥𝑞
                                                                                                                      
   10. Evolve the fields 𝜓 𝑦𝑛+1/2
                                𝑝    , ∇𝜓   𝑛+1/2
                                            𝑦𝑝    , A 𝑛+1/2
                                                      𝑦𝑝     ,  ∇A   𝑛+1/2
                                                                     𝑦𝑝       →
                                                                              ↦       𝜓  𝑛+1 , ∇𝜓 𝑛+1 , A𝑛+1 , ∇A𝑛+1 using
                                                                                         𝑦𝑝         𝑦𝑝     𝑦𝑝         𝑦𝑝
                                        
       the particle data y𝑛+1 , P𝑛+1
     There are several essential details embedded in the steps shown above which require further
explanation. While it may not be apparent from the notation, the field updates involve additional
time history, not just the most recent one. Quantities which are labeled with “∗" live at time
level 𝑡 𝑛+1/2 , but we use this notation to distinguish the data from 𝑡 𝑛+1/2 pre/post-mixing when
clarification is needed. The particle updates on pairs of coordinates may appear strange because of
their implicit-like form, which merely reflects the coupling of phase space. As an example, in order
to perform the particle push in step 3, we need to use the fields associated with the particle data
                
  y𝑛+1/2 , P𝑛+1/2 . This is reflected in step 2. Next, it would seem that we are forgetting to evolve
fields between steps 3 and 4; however, the mixing step will modify the particle data regardless of
the fields, so there is no need to perform this evolution. Instead, the fields are recomputed in step 5
because of the mixing that occurs in step 4.
Remark 4.5.1. This algorithm is quite costly both in terms of memory and computation. It requires
duplicates of both particle and field information, which can be problematic when large numbers
of particles and fine meshes are required. Computationally, a single time step of this second-order
                                                           148


scheme has a total of 10 steps in which fields and derivatives are updated. The total number of calls
made to the wave solver methods depends entirely on the dimensionality of the problem, which
will change the entries of the gradient vector, and the number of components retained for the vector
potentials (if any).
4.5.1.2   Approaches for a Mixed Advance: Dealing with Explicit and Implicit Source Terms
This section presents a modification of the approach discussed in section 4.5.1.1 to develop a
mixed approach where fields can be advanced using an explicit form of source terms with their
corresponding derivatives using sources with an implicit form. Assuming that we have the data
                                                                                                 
(x , Q ), (y , P ), 𝜓𝑥𝑞 , ∇𝜓𝑥𝑞 , A𝑥𝑞 , ∇A𝑥𝑞 , and 𝜓 𝑦 𝑝 , ∇𝜓 𝑦 𝑝 , A𝑦 𝑝 , ∇A𝑦 𝑝 , the second order update
  𝑛    𝑛     𝑛   𝑛     𝑛       𝑛      𝑛            𝑛                     𝑛        𝑛     𝑛       𝑛
𝜙Δ𝑡
 2 can be modified to include updates to fields as follows:
                                                                          
   1. Evolve the fields 𝜓 𝑦𝑛 𝑝 , A𝑛𝑦 𝑝 ↦→ 𝜓 𝑦𝑛+1/2      𝑝       , A  𝑛+1/2
                                                                     𝑦𝑝        using the particle data (y𝑛 , P𝑛 )
                                                                                                               
   2. Push particles: (y𝑛 , P𝑛 ) ↦→ y𝑛+1/2 , P𝑛+1/2 using x𝑛 , Q𝑛 , ∇𝜓𝑥𝑞                      𝑛 , A𝑛 , ∇A𝑛
                                                                                                      𝑥𝑞       𝑥𝑞
                                                                                                                        
   3. Compute the derivatives         ∇𝜓 𝑦𝑛+1/2
                                              𝑝      , ∇A𝑛+1/2𝑦𝑝         using the particle data          y𝑛+1/2 , P𝑛+1/2
                                                                         
                                                      𝑛+1/2         𝑛+1/2
   4. Evolve the fields 𝜓𝑥𝑞  𝑛 , A𝑛
                                   𝑥𝑞     →
                                          ↦       𝜓   𝑥𝑞       ,  A 𝑥𝑞         using the particle data (x𝑛 , Q𝑛 )
                                                                                                                                
   5. Push particles:  (x𝑛 , Q𝑛 ) ↦→        x𝑛+1/2 , Q𝑛+1/2            using      y𝑛+1/2 , P𝑛+1/2 , ∇𝜓 𝑦𝑛+1/2
                                                                                                           𝑝   , A𝑛+1/2
                                                                                                                    𝑦𝑝   , ∇A𝑛+1/2
                                                                                                                             𝑦𝑝
                                                                        
   6. Coupling step: x𝑛+1/2 , P𝑛+1/2 , y𝑛+1/2 , Q𝑛+1/2 ↦→ (x∗ , P∗ , y∗ , Q∗ )
                                                                         
   7. Recompute the derivatives ∇𝜓 𝑦𝑛+1/2        𝑝       , ∇A     𝑛+1/2
                                                                   𝑦𝑝        using the particle data (y∗ , P∗ )
                                                                      
   8. Compute the derivatives         ∇𝜓𝑥𝑞   𝑛+1/2
                                                     , ∇A𝑛+1/2𝑦𝑝         using the particle data (x∗ , Q∗ )
                                                                            
   9. Evolve the fields 𝜓𝑥𝑞  𝑛+1/2     𝑛+1/2
                                   , A𝑥𝑞            ↦→ 𝜓𝑥𝑞       𝑛+1 , A𝑛+1 using the particle data (x∗ , Q∗ )
                                                                           𝑥𝑞
                                                                                                                      
  10. Push particles: (x∗ , Q∗ ) ↦→ x𝑛+1 , Q𝑛+1 using y∗ , P∗ , ∇𝜓 𝑦𝑛+1/2
                                                                                                   𝑛+1/2        𝑛+1/2
                                                                                            𝑝  , A  𝑦𝑝      , ∇A  𝑦𝑝
                                                                                                                 
  11. Compute the derivatives ∇𝜓𝑥𝑞           𝑛+1 , ∇A𝑛+1 using the particle data x𝑛+1 , Q𝑛+1
                                                           𝑥𝑞
                                                                  149


                                                             
  12. Evolve the fields 𝜓 𝑦𝑛+1/2  , A 𝑛+1/2
                                               →
                                               ↦    𝜓 𝑛+1 , A𝑛+1 using the particle data (y∗ , P∗ )
                              𝑝       𝑦𝑝              𝑦𝑝     𝑦𝑝
                                                                                               
  13. Push particles: (y∗ , P∗ ) ↦→ y𝑛+1 , P𝑛+1 using x𝑛+1 , Q𝑛+1 , ∇𝜓𝑥𝑞
                                                                            𝑛+1 , A𝑛+1 , ∇A𝑛+1
                                                                                    𝑥𝑞      𝑥𝑞
                                                                                          
  14. Compute the derivatives        ∇𝜓 𝑦𝑛+1
                                          𝑝 , ∇A 𝑦 𝑝
                                                 𝑛+1    using the particle data y𝑛+1 , P𝑛+1
Remark 4.5.2. Consideration of a mixed approach of this form, with regard to the field solvers
developed in Chapter 2, is useful in preventing excessive dissipation. More specifically, the time-
centered field solvers, which use an explicit source, are known to be purely dispersive [42].
    Using the time-centered update for the fields, which requires an explicit source, in conjunction
with duplicate field and particle data may create a mismatch in time levels between the fields and
the particles. This motivates us to consider a variation on the first approach where particles lead
the fields, but the field advances apply the time centered method where the source is made implicit
by averaging. For example the charge density at time level 𝑡 𝑛 can be approximated to second-order
accuracy in time by
                                                 1  𝑛+1          
                                           𝜌𝑛 ≈       𝜌 + 𝜌 𝑛−1 .
                                                 2
4.5.2    The Asymmetrical Euler Method
We recently became aware of an alternative particle push, suitable for non-separable Hamiltonian
systems, which was proposed in [36]. For context, this paper considered mesh-free methods
for solving the Darwin limit of the VM system using the Coulomb gauge. Their adoption of a
generalized Hamiltonian model for particles was largely motivated by the numerical instabilities
associated with time derivatives of the vector potential in this particular limit. The resulting model,
which is essentially identical to the formulation (1.39)-(1.40) used in this work, trades additional
coupling of phase space for numerical stability. They proposed a semi-implicit method, dubbed
                                                       150


the asymmetrical Euler method (AEM), which has the form
                                x𝑖𝑛+1 = x𝑖𝑛 + v𝑖𝑛 Δ𝑡,                                             (4.29)
                                                                       !
                                P𝑖𝑛+1 = P𝑖𝑛 + 𝑞𝑖 − ∇𝜓 𝑛+1 + ∇A𝑛+1 · v𝑖𝑛 Δ𝑡,                       (4.30)
                                         1               
                                   v𝑖𝑛 ≡      P𝑖𝑛 − 𝑞𝑖 A𝑛 .                                       (4.31)
                                         𝑚𝑖
This method, which is first-order in time, proceeds by, first, performing an explicit update of the
particle positions using (4.29). Next, with the new positions, fields are updated, and finally, the
generalized momentum update (4.30) is performed. While it may appear that iteration is required
to compute gradients of the vector potential ∇A𝑛+1 , since A𝑛+1 requires v𝑖𝑛+1 through the current,
the authors use v𝑖𝑛 , which results in a fully-explicit update. It is recommended to avoid iteration,
especially in the case that the vector potential is strong to prevent the formation of numerical
instabilities. Further, this lagging of the velocity is consistent with a first-order method. At the end
of the particle update, the velocity is modified according to (4.31) for use in the next time step.
     In contrast to the method of Molei Tao, discussed in section 4.5.1, this method requires
significantly less overall storage and fewer overall field solves. Despite the fact that this method
is first order, experiments performed in [36] demonstrated good energy conservation and good
accuracy, in addition to computational efficiency. Additionally, the lagging of the velocity in the
generalized momentum update (4.30) means that this method avoids the pitfall of circular logic
that occurs in the Molei Tao method when current needs to be mapped to the mesh. We apply this
integrator to the expanding beam problems in section 4.6.
4.6     Numerical Examples
In this section we apply the proposed PIC methods to several well-known benchmark problems in
the literature. First, we test the particle methods and our formulation in the case of a single particle
moving through known fields. Then, we apply the methods to more interesting problems involving
dynamic fields, which respond to the motion of the particles. We use the physical constants listed
in Table 4.1 in the numerical experiments presented in this work.
                                                      151


                              Parameter                                                 Value
                           Ion mass (𝑚𝑖 ) [kg]                           9.108379025973462 × 10−29
                        Electron mass (𝑚 𝑒 ) [kg]                        9.108379025973462 × 10−31
              Boltzmann constant (𝑘 𝐵 ) [kg m2 s−2 K−1 ]                      1.38064852 × 10−23
           Permittivity of free space (𝜖0 ) [kg−1 m−3 s4 A2 ]                8.854187817 × 10−12
            Permittivity of free space (𝜇0 ) [kg m s−2 A−2 ]                   1.25663706 × 10−6
                         Speed of light (𝑐) [m/s]                              2.99792458 × 108
       Table 4.1: Table of the physical constants (SI units) used in the numerical experiments.
4.6.1    Motion of a Single Charged Particle
We first compare the integrator proposed by Tao [47] with the Boris method [4], which are discussed
in sections 4.4.2 and 4.5.1, respectively. This is a natural first step before applying the method to
problems with dynamic fields that respond to particle motion. Here, we consider a simple model
for the motion of a single charged particle that is given by
                                             x¤ = v,
                                                      𝑞
                                             v¤ =       (E + v × B) .
                                                     𝑚
    We use electro- and magneto-static fields here and suppose that the magnetic field lies along
the ẑ unit vector, so
                                 B = 𝐵0 ẑ,       E = 𝐸 (1) x̂ + 𝐸 (𝑦) ŷ + 𝐸 (𝑧) ẑ,
where 𝐵0 is a constant. Again, component-based definitions have been used for the fields E =
                                                 
   (1)
  𝐸 ,𝐸 ,𝐸 (2)   (3)              (1)   (2)      (3)
                      and B = 𝐵 , 𝐵 , 𝐵 . Consequently, we have that
                                        v × B = 𝑣 (2) 𝐵0 x̂ − 𝑣 (1) 𝐵0 ŷ,
so the full equations of motion are
                               𝑑𝑥 (1)               𝑑𝑣 (1)    𝑞  (1)                 
                                      = 𝑣 (1) ,             =     𝐸 + 𝑣 𝐵0 , (2)
                                𝑑𝑡                     𝑑𝑡     𝑚
                               𝑑𝑥 (2)               𝑑𝑣 (2)    𝑞  (2)                 
                                      = 𝑣 (2) ,             =     𝐸 − 𝑣 (1) 𝐵0 ,
                                𝑑𝑡                     𝑑𝑡     𝑚
                               𝑑𝑥 (3)               𝑑𝑣  (3)   𝑞
                                      = 𝑣 (3) ,             = 𝐸 (3) .
                                𝑑𝑡                     𝑑𝑡     𝑚
                                                          152


We can then use the classical momentum p = 𝑚v to obtain
                           𝑑𝑥 (1)               𝑑𝑝 (1)
                                                                                  
                                     1                                   1 (2)
                                  = 𝑝 (1) ,                      (1)
                                                         = 𝑞 𝐸 + 𝑝 𝐵0 ,
                             𝑑𝑡      𝑚            𝑑𝑡                    𝑚
                              (2)                    (2)
                                                                                  
                           𝑑𝑥        1          𝑑𝑝                       1 (1)
                                  = 𝑝 (2) ,                      (2)
                                                         = 𝑞 𝐸 − 𝑝 𝐵0 ,
                             𝑑𝑡      𝑚            𝑑𝑡                     𝑚
                           𝑑𝑥 (3)    1          𝑑𝑝   (3)
                                  = 𝑝 (3) ,              = 𝑞𝐸 (3) .
                             𝑑𝑡      𝑚            𝑑𝑡
Next, we show how to convert the electric and magnetic fields to potentials for use in Molei Tao’s
method.
                                                            
    Using the potentials 𝜓 and A ≡ 𝐴 (1) , 𝐴 (2) , 𝐴 (3) , one can compute the electric and magnetic
fields via (1.11), which is equivalent to writing
                          𝐸 (1) = −𝜕𝑥 𝜓 − 𝜕𝑡 𝐴 (1) ,     𝐵 (1) = 𝜕𝑦 𝐴 (3) − 𝜕𝑧 𝐴 (2) ,
                          𝐸 (2) = −𝜕𝑦 𝜓 − 𝜕𝑡 𝐴 (2) ,     𝐵 (2) = −𝜕𝑥 𝐴 (3) + 𝜕𝑧 𝐴 (1) ,
                          𝐸 (3) = −𝜕𝑧 𝜓 − 𝜕𝑡 𝐴 (3) ,     𝐵 (3) = 𝜕𝑥 𝐴 (2) − 𝜕𝑦 𝐴 (1) .
The time-independence of the magnetic field for this problem implies that 𝜕𝑡 A = 0, so that
                    E = −∇𝜓 =⇒ 𝐸 (1) = −𝜕𝑥 𝜓,             𝐸 (2) = −𝜕𝑦 𝜓,     𝐸 (3) = −𝜕𝑧 𝜓.
Therefore, for this problem, we can use
                                      𝜓 = −𝐸 (1) 𝑥 − 𝐸 (2) 𝑦 − 𝐸 (3) 𝑧.
Moreover, since the magnetic field lives only in the z-direction, then this implies that the vector
potential can be written as
                                B = (0, 0, 𝐵0 ) = (0, 0, 𝜕𝑥 𝐴 (2) − 𝜕𝑦 𝐴 (1) ).
As the choice of functions for gauges are not unique, it suffices to pick
                                  𝐴 (1) ≡ 0,    𝐴 (2) = 𝐵0 𝑥,      𝐴 (3) ≡ 0.
                                                      153


In summary, the values and required derivatives for the potentials are given by
                              −𝜕𝑥 𝜓 = 𝐸 (1) ,     −𝜕𝑦 𝜓 = 𝐸 (2) ,      −𝜕𝑧 𝜓 = 𝐸 (3) ,
                                𝐴 (1) = 0,    𝐴 (2) = 𝐵0 𝑥,      𝐴 (3) = 0,
                             𝜕𝑥 𝐴 (1) = 0,    𝜕𝑦 𝐴 (1) = 0,     𝜕𝑧 𝐴 (1) = 0,
                             𝜕𝑥 𝐴 (2) = 𝐵0 ,    𝜕𝑦 𝐴 (2) = 0,      𝜕𝑧 𝐴 (2) = 0,
                             𝜕𝑥 𝐴 (3) = 0,    𝜕𝑦 𝐴 (3) = 0,     𝜕𝑧 𝐴 (3) = 0,
which yield the simplified equations of motion
                                𝑑𝑥 (1)     1 (1)
                                       =     𝑃 ,
                                  𝑑𝑡      𝑚
                                𝑑𝑥 (2)     1  (2)            (1)
                                                                  
                                       =       𝑃 − 𝑞𝐵0 𝑥            ,
                                  𝑑𝑡      𝑚
                                𝑑𝑥 (3)     1
                                       = 𝑃 (3) ,
                                  𝑑𝑡      𝑚            "                            #
                               𝑑𝑃 (1)         (1)    𝑞      
                                                                (2)           (1)
                                                                                  
                                       = 𝑞𝐸 +            𝐵0 𝑃 − 𝑞𝐵0 𝑥                 ,
                                 𝑑𝑡                 𝑚
                               𝑑𝑃 (2)
                                       = 𝑞𝐸 (2) ,
                                 𝑑𝑡
                               𝑑𝑃 (3)
                                       = 𝑞𝐸 (3) .
                                 𝑑𝑡
    The setup for the test consists of a single particle with mass 𝑚 = 1.0 and charge 𝑞 = −1.0
whose initial position is at the origin of the domain i.e., x(0) = (0, 0, 0). Initially, the particles have
momentum components only in the 𝑥 and 𝑧 directions to generate so called “cyclotron" motion. We
choose the initial momenta to be p(0) = P(0) = 1.0 × 10−2 , 0, 1.0 × 10−2 . The strength of the
                                                                                        
magnetic field in the 𝑧 direction is selected to be 𝐵0 = 1.0, and we ignore the contributions from
the electric field, so that E = (0, 0, 0). Both methods are run to a final time of 𝑇 = 30.0 and a total
of 1000 time steps are used so that Δ𝑡 = 0.03. Lastly, we select the coupling parameter 𝜔 = 100
for the Molei Tao integrator. The particle’s position is tracked through time and plotted as a curve
in 3-D. Figure 4.1 compares the particle trajectories obtained with both methods.
    Next, we perform a refinement study of the two methods to examine the error properties using
the same experimental parameters from the cyclotron test. Errors are measured using reference
                                                       154


                     (a) Boris method                          (b) Tao method (𝜔 = 100)
Figure 4.1: Trajectories for the single particle test, which are obtained using the Boris method
4.1a and the second-order integrator by Molei Tao (as presented in [47]) 4.1b. Both methods
produce identical trajectories under identical experimental conditions. The particles rotate about
the magnetic field which points in the 𝑧-direction.
solutions computed with 106 time steps, so that Δ𝑡 = 3.0 × 10−5 . Errors are measured using the
ℓ∞ norm. The test starts with using a total of 100 time steps and successively doubles the number
of steps, using, at most, 1.28 × 104 steps. The results of the refinement study are shown in Figure
4.2. Both methods refine to second-order, with the Boris method producing a larger error in the
solution compared to the Molei Tao integrator. While it may be possible to further reduce the size
of the errors made by the Molei Tao method through better choices of the coupling parameter 𝜔,
the Boris method will likely remain more efficient in terms of error for a given amount of compute
time. Despite the fact that we are not reporting timing results, the Boris method was notably faster
than the Molei Tao integrator. This is not all that surprising given that the latter method requires
more intermediate steps.
4.6.2   The Cold Two-Stream Instability
We consider the motion of “cold" streams of ions and electrons restricted to a one-dimensional
periodic domain by a sufficiently strong (uniform) magnetic field in the two remaining directions.
                                                 155


                        (a) Boris method                      (b) Tao method (𝜔 = 100)
Figure 4.2: Self-refinement for the single particle test using the Boris method 4.1a and the second-
order integrator by Molei Tao (as presented in [47]) 4.1b. Second-order accuracy is achieved by
both methods, but the ℓ∞ errors for the Boris method are nearly a factor of 2 larger than those
produced by the Molei Tao method. While we have not presented timing results, it is worth noting
that the run times for the Boris method were considerably faster than those of the Molei Tao method
due to the latter’s additional “stages". The final error measurements taken from the refinement
study are 1.4728 × 10−7 (Boris) and 6.5592 × 10−8 (Tao).
Ions are taken to be uniformly distributed in space, and sufficiently heavy compared to the electrons
so that their motion can be ignored in the simulation. While the ions remain stationary, they act as
a neutralizing background against the dynamic electrons. Mathematically, the electron velocities
are represented as a sum of two Dirac delta distributions which are symmetric in velocity about
the origin, i.e., streams move in opposite directions but have the same velocity magnitude. A
slight perturbation in the electron velocities is then introduced to force a charge imbalance, as some
particles move faster than others. This, in turn, generates an electric field that attempts to restore
the neutrality of the system, causing the streams to interact, or “roll-up", creating regions of trapped
particles.
    In order to describe the models used in the simulation, let us denote the components of the
                                                                                                      
position and momentum vectors for particle 𝑖 as x𝑖 ≡ 𝑥𝑖(1) , 𝑥𝑖(2) , 𝑥𝑖(3) and P𝑖 ≡ 𝑃𝑖(1) , 𝑃𝑖(2) , 𝑃𝑖(3) ,
                                                   156


                  Models                  Time Integration           Fields + Derivatives
                −Δ𝜓 = 𝜎11 𝜌        Leapfrog, Tao (with averaging)         FFT + FFT
                                                                       BDF-2 + BDF-2,
            1               1
              𝜕 𝜓
           𝜅 2 𝑡𝑡
                   − Δ𝜓 =   𝜎1 𝜌   Leapfrog, Tao (with averaging)     Central-2 + BDF-2,
                                                                      Central-2 + BDF-4
Table 4.2: Summary of the algorithms explored for the two-stream instability example. Both time
integration methods considered are second-order.
respectively, the equations for the motion of particle 𝑖 assume the form
                                             𝑑𝑥𝑖(1)    1
                                                     = 𝑃𝑖(1) ,
                                              𝑑𝑡       𝑟𝑖
                                             𝑑𝑃𝑖(1)
                                                     = −𝑞𝑖 𝜕𝑥 𝜓.
                                              𝑑𝑡
Therefore, the motion in this plane requires knowledge of 𝜓, 𝜕𝑥 𝜓, which can be obtained by solving
a two-way wave equation for the scalar potential:
                                          1 𝜕2𝜓               1
                                           2    2
                                                    − Δ𝜓 =       𝜌.                           (4.32)
                                         𝜅 𝜕𝑡                𝜎1
As this is an electrostatic problem, the gauge condition can be safely ignored. In the limit where
𝜅 ≫ 1, the characteristic thermal velocities of the particles become well-separated from the speed
of light. Rather than solve the two-way wave equation, one instead solves the Poisson equation
                                                          1
                                              − Δ𝜓 =        𝜌.                                (4.33)
                                                         𝜎1
Using asymptotic analysis, it can be shown that the approximation error made by employing the
Poisson model for the scalar potential is O (1/𝜅) [42].
    We benchmark the performance of several combinations of algorithms (see Table 4.2) for
time stepping particles and evolving fields by comparing with well-known methods. This will help
establish the baseline properties for the methods, and reduce the parameter space of viable methods.
The setup for this test problem employs a spatial mesh defined on the interval [−10𝜋/3, 10𝜋/3],
which is discretized using 128 total grid points and supplied with periodic boundary conditions.
The non-dimensional final time for the simulation is taken to be 𝑇 𝑓 = 50.0 with 4,000 time steps
being used to evolve the system. The plasma is represented using a total of 20,000 macro-particles,
                                                     157


          Figure 4.3: Initial configuration of electrons used in the two-stream experiments.
which are split equally between ions and electrons. As mentioned earlier, the positions of the
ions and electrons are taken to be uniformly spaced along the grid. Ions remain stationary in the
problem so we set their velocity to zero. The construction of the streams begins by first splitting
the electrons into two equally sized groups, whose respective (non-dimensional) drift velocities are
set to be ±1. To generate an instability we add a perturbation to the electron velocities of the form
                                                              
                                                   2𝜋𝑘 (𝑥 − 𝑎)
                                           𝜖 sin                 .
                                                        𝐿
Here, 𝜖 = 5 × 10−3 controls the perturbation strength, 𝑘 = 1 is the wave number for the perturbation,
𝑥 is the position of the electron, 𝑎 is the left-most grid point, and 𝐿 is the length of the domain. In
a more physically realistic simulation, the perturbation would be induced by some external force,
which would also result in a perturbation of the position data for the particles. Such a perturbation
of the position data requires a self-consistent field solve to properly initialize the potentials. In our
simulation, we assume that no spatial perturbation is present, so that the fields are identically zero
at the initial time step. A plot of the electron streams at the initial condition is shown in Figure 4.3.
    The plasma parameters used in the non-dimensionalization for this test problem, are displayed
in Table 4.3. Note that under these scales, the normalized speed of light 𝜅 = 50. In some sense,
this value is close to the relativistic regime, but far enough away to avoid the need for relativistic
time integration methods. Additionally, we find that this configuration is sufficient to resolve the
Debye length (≈ 6 cells/𝜆 𝐷 ), angular plasma period (≈ 80 steps/𝜔 𝑝𝑒 ), and the particle CFL < 1,
which are all necessary for maintaining stability in explicit PIC methods.
    In order to get a sense of the behavior attributed to the particle integrator, we first considered the
                                                    158


                                    Parameter                               Value
                        Average number density (𝑛)  ¯ [m−3 ]           7.856060 × 101
                           Average temperature (𝑇) ¯ [K]               2.371698 × 106
                              Debye length (𝜆 𝐷 ) [m]                  1.199170 × 104
                  Inverse angular plasma frequency (𝜔−1  𝑝𝑒 ) [s/rad] 2.000000 × 10−3
                           Thermal velocity (𝑣 𝑡ℎ ) [m/s]              5.995849 × 106
        Table 4.3: Table of the plasma parameters used in the two-stream instability example.
Poisson model (4.33) for the scalar potential. Since the combination of leapfrog time integration
with an FFT field solver is such a commonly used approach to this problem, it allowed us to identify
key differences attributed solely to the choice of time integration method used for particles. We
found that a direct application of the Molei Tao integrator to this problem, which includes the
corresponding field solves, gave nonphysical results. As the streams began to develop interesting
structures, the particles appeared to “jump" off their smooth trajectories manifesting as some form of
noise. This can be attributed to the duplicate field and particle data required by Molei Tao’s method,
which are not guaranteed to remain close. Eventually this leads to differences in the potentials
that are used to move the particles. Adjustments to the coupling parameter in Molei Tao’s method
were unsuccessful at controlling this behavior, and, in fact, exacerbated the phenomenon. Having
to deal with a parameter in a method, which can lead to exceptionally nonphysical results is a
highly undesirable feature of a method. In an attempt to fix this problem, we chose to adjust the
                                                                                                  
particle data at the end of the time step by replacing the values with the averages of x𝑛+1 , Q𝑛+1 and
            
 y𝑛+1 , P𝑛+1 . In Figure 4.4, we compare the Molei Tao method and demonstrate the behavior of the
method with and without averaging. This approach while most likely not symplectic, seems to be
reasonably effective at controlling this difference and the number of particles that leave the stream
lines. Therefore, note that this modification shall be used in all subsequent experiments that use
the Molei Tao integrator. Having (somewhat) resolved the issue posed by the Molei Tao method,
we then compared this integrator against the Leapfrog method using the Poisson model (4.33) for
the scalar potential. We present snapshots of the electron streams obtained with both methods in
Figure 4.5. We observe nearly identical behaviors from both methods, with the exception of later
times, at which point, several particles have moved off their trajectories.
                                                   159


                   (a) With averaging                              (b) Without averaging
Figure 4.4: A comparison of the Molei Tao particle integrator with and without averaging for the
two-stream example with the Poisson model. Over time, the pairs of phase space data, including
the associated fields, can grow apart leading to vastly different potentials that kick particles off their
smooth trajectories. Averaging appears to be fairly effective at controlling this behavior.
                                                 160


             (a) Leapfrog                       (b) Leapfrog                         (c) Leapfrog                     (d) Leapfrog
   (e) Tao with averaging (𝜔 = 500)   (f) Tao with averaging (𝜔 = 500)     (g) Tao with averaging (𝜔 = 500) (h) Tao with averaging (𝜔 = 500)
Figure 4.5: We present plots of the electrons in phase space obtained using the Poisson model for the two-stream example. Results
obtained using leapfrog time integration are shown in the top row, while the bottom row uses the second-order integrator based on Molei
Tao and applies averaging. We selected 𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator. The FFT is used to
compute the scalar potentials in both methods. At later times, despite improvements from “averaging" the particle data, the Tao method
causes particles to move off the stream lines. This phenomena is a numerical artifact that is not present in the leapfrog method.
                                                                       161


                          (a) Leapfrog                   (b) Tao with averaging (𝜔 = 500)
Figure 4.6: Time refinement of a tracer particle’s position for the two-stream instability using the
Poisson model for the potential with leapfrog (a) and the Molei Tao integrator with averaging (b).
We selected 𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator. Both
methods converge to second-order accuracy with leapfrog generally displaying a larger absolute
error than the Tao method. The exception to this is the smallest Δ𝑡 used in the leapfrog experiments.
    A time refinement experiment for the two-stream example with the Poisson model was also
performed using the same numerical parameters outlined in Table 4.3. Errors were measured by
following the trajectory of a single tracer particle, which is arbitrarily selected from the middle
of the particle array, for several values of Δ𝑡. The location of this tracer particle, in each case, is
stored at the non-dimensional final time 𝑇 𝑓 = 25, which we note is prior to the occurrence of the
“roll-up" in the streams. The reference solution used to measure the error was computed using
16,384 time steps. We ran the code starting with 256 time steps and successively doubled this until
reaching 8192 steps. Errors were measured using the ℓ∞ norm, which in one dimension, becomes
the absolute value. We plot the errors against the time step size in Figure 4.6. The plots show
second-order time accuracy in both methods, with cleaner refinement behavior observed in the
leapfrog integrator despite a larger overall error. Additionally, the Molei Tao integrator displays a
noticeable increase in the error when larger time steps are taken. It is worth noting that the initial
values of Δ𝑡 (e.g., where the jump in the error occurs) are in violation of the particle CFL number,
which should be < 1 for explicit particle methods; however, this jump in the error is not observed in
the leapfrog method, which is known for its long-time accuracy. Moreover, it is difficult to identify
the exact time at which the breakdown in the Molei Tao method occurs. The accuracy guarantees
                                                 162


are given by Tao in an asymptotic form that involves the duration of the simulation, the order of
the method, and the value of the coupling parameter. Additionally, these estimates were obtained
only in the case where of particles moving through fields known at all positions in space and time,
so these estimates may no longer be valid for the dynamic fields considered in this work.
    This same experiment was repeated using the two-way wave model (4.32) in the place of the
Poisson model (4.33) for the scalar potential. For problems which are strongly electrostatic (i.e.,
𝜅 ≫ 1), the wave model should produce results which are similar to those of the Poisson model
shown in Figure 4.5. This setting allows us to benchmark the performance of the wave solvers
and methods for derivatives discussed in chapter 2. We include the second-order time-centered
and BDF field solvers, as well as derivative methods based on BDF-2 and BDF-4 discretizations
in this experiment. Recall that in section 2.5.1, we mentioned concerns of stability for field solvers
based on BDF-3 and BDF-4 discretizations. Here, we are using the higher-order methods only to
compute derivatives, which are not evolved in time, so stability less of a concern. Moreover, since
this problem is one-dimensional in space, we avoid the splitting error, so the derivatives will be more
accurate. We considered three pairings of these methods: (1) BDF-2 with BDF-2, (2) central-2
with BDF-2, and (3) central-2 with BDF-4. Results obtained with each of these pairings for the field
solvers are shown in Figures 4.7, 4.8, and 4.9, respectively. A notable difference with the Poisson
model is that particles stayed attached to their smooth trajectories. This can be attributed to the use
of a wave model, which, due to the finite speed of propagation, responds more slowly to changes
in the charge density. Among the results for the wave models, we observe excellent agreement
between the leapfrog and Molei Tao integrators. In these experiments, the run parameters we
selected gave a CFL ≈ 3.79, which is not large enough to see noticeable improvements gained by
moving to higher-order methods. Nevertheless, the overall consistency among the results is quite
encouraging. We note that time averaging is used on the charge density in solvers which combine
Molei Tao with the central-2 scheme for the fields (see section 4.5.1.2 for details). Without this
averaging, the streams interact at an accelerated rate, which is nonphysical.
    A time refinement of the proposed methods was also performed following the same procedure
                                                  163


used for the Poisson model based on a tracer particle. The results of the refinement study are
presented in Figure 4.10. We (generally) observe second-order temporal accuracy with each of
the methods. When leapfrog is used to move the particles, we observe fairly clean second-order
accuracy. The Molei Tao method, in contrast, shows some irregularities in the refinement pattern,
which in some cases appears to diverge when large time steps are used. This is likely due to the
time step simply being too large for an explicit method, despite satisfying the CFL condition for
the particles. Additionally, it is worth noting that the error in the methods with Molei Tao display
a smaller error. In terms of efficiency, however, the increased number of field solves required by
the Molei Tao method may not be offset by this improvement in the error.
                                                  164


             (a) Leapfrog                        (b) Leapfrog                         (c) Leapfrog                     (d) Leapfrog
   (e) Tao with averaging (𝜔 = 500)    (f) Tao with averaging (𝜔 = 500)     (g) Tao with averaging (𝜔 = 500) (h) Tao with averaging (𝜔 = 500)
Figure 4.7: We present plots of the electrons in phase space obtained using the wave model for the two-stream example. Results obtained
using leapfrog time integration are shown in the top row, while the bottom row uses the second-order integrator based on Molei Tao and
applies averaging. We selected 𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator. The second-order (diffusive)
BDF scheme (BDF-2) is used to compute the scalar potentials and their derivatives in for both methods. Unlike the results obtained with
the Poisson model, which used the FFT as the field solver (shown in Figure 4.5), the particles at the later times in the Molei Tao method
seem to stay attached to their trajectories.
                                                                        165


              (a) Leapfrog                       (b) Leapfrog                         (c) Leapfrog                     (d) Leapfrog
    (e) Tao with averaging (𝜔 = 500)   (f) Tao with averaging (𝜔 = 500)     (g) Tao with averaging (𝜔 = 500) (h) Tao with averaging (𝜔 = 500)
Figure 4.8: We present plots of the electrons in phase space obtained using the wave model for the two-stream example. Results obtained
using leapfrog time integration are shown in the top row, while the bottom row uses the second-order integrator based on Molei Tao and
applies averaging. We selected 𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator. The scalar potentials are
evolved using the second-order central scheme (central-2), while the derivatives are computed at each step with the second-order BDF
scheme (BDF-2). In the bottom row, which uses the Molei Tao method, we obtain results that are similar to the BDF-2 method (see 4.7)
in the sense that particles do not seem to jump off of their trajectories.
                                                                        166


             (a) Leapfrog                        (b) Leapfrog                         (c) Leapfrog                     (d) Leapfrog
   (e) Tao with averaging (𝜔 = 500)    (f) Tao with averaging (𝜔 = 500)     (g) Tao with averaging (𝜔 = 500) (h) Tao with averaging (𝜔 = 500)
Figure 4.9: We present plots of the electrons in phase space obtained using the wave model for the two-stream example. Results obtained
using leapfrog time integration are shown in the top row, while the bottom row uses the second-order integrator based on Molei Tao
and applies averaging. We selected 𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator. The scalar potentials
are evolved using the second-order central scheme (central-2), while the derivatives are computed at each step with the fourth-order
BDF scheme (BDF-4). As with the other wave solver methods, the particles in the Molei Tao experiments seem to stay attached to their
smooth trajectories, even at the later times.
                                                                        167


                  (a) Leapfrog with BDF-2             (b) Leapfrog with Central-2 + BDF-2       (c) Leapfrog with Central-2 + BDF-4
                     (d) Tao with BDF-2                  (e) Tao with Central-2 + BDF-2            (f) Tao with Central-2 + BDF-4
Figure 4.10: Time refinement of a tracer particle’s position for the two-stream instability. For the particle push, we consider both leapfrog
and the Molei Tao method with averaging, in combination with different methods for fields and their derivatives. We selected 𝜔 = 500
as the value of the coupling parameter in all of the Molei Tao integrator experiments. Each of the methods converge to second-order
accuracy with the error in the Tao method being smaller than leapfrog.
                                                                      168


                                    Parameter                             Value
                       Average number density (𝑛)   ¯ [m−3 ]         1.129708 × 1014
                           Average temperature (𝑇) ¯ [K]             2.371698 × 106
                              Debye length (𝜆 𝐷 ) [m]                1.000000 × 10−2
                 Inverse angular plasma frequency (𝜔−1  𝑝𝑒 ) [s/rad] 1.667820 × 10−9
                           Thermal velocity (𝑣 𝑡ℎ ) [m/s]            5.995849 × 106
         Table 4.4: Table of the plasma parameters used in the numerical heating example.
4.6.3   Numerical Heating Study
We now perform a numerical heating study using the same combination of models and algorithms
shown in Table 4.2 for the two-stream problem. The primary purpose of this test is to characterize
the effect of resolving the Debye length 𝜆 𝐷 in a steady-state problem to the Vlasov equation.
Moreover, it allows us to benchmark the degree of the heating phenomenon observed under different
selections of models, particle integrators, and field solvers. These numerical properties turn out to
be connected to the symplecticity of the method. Explicit PIC methods are not symplectic because
the fields and the particles are not self-consistent with one another. A consequence of this is that
the grid should be sufficiently fine so that a given particle can “see" the correct potential, which
is otherwise screened by particles of opposite charge. In other words, with explicit PIC methods,
one needs to resolve the charge separation inside of the plasma, which is determined by the Debye
length. A general rule of thumb for explicit PIC simulations is that the grid spacing Δ𝑥 should
be chosen to satisfy 4Δ𝑥 < 𝜆 𝐷 . The phenomenon of heating occurs in simulations which do not
adequately resolve this scale. In such cases, the system will try to increase the temperature of the
plasma until it becomes adequately resolved on the given mesh. Fully-implicit methods, which
break this restriction [9], allow for a substantially coarser mesh to be used for a given calculation
and will be the subject of future work.
    The setup for this problem is slightly different from the two-stream example discussed earlier.
Here, we provide, as input a Debye length 𝜆 𝐷 as well as a normalized speed of light 𝜅, which can
be used to calculate the average number density 𝑛¯ and macroscopic temperature 𝑇¯ for the plasma.
The remaining parameters related to the plasma can be derived from these values and are shown in
                                                   169


       Figure 4.11: Initial electron data in phase space used for the numerical heating tests.
Table 4.4. The non-dimensional grid used in this problem is taken to be periodic on the interval
[−25, 25]. This grid is refined by successively doubling the number of mesh points from 16 to 256.
In each case, the simulation uses 5 × 105 time steps to the non-dimensional final time 𝑇 𝑓 = 1 × 103 .
    A total of 5 × 103 particles are used for each species in the simulation, which consist of ions
and electrons. As before, we assume that ions will remain stationary since they are heavier than
the electrons. Electrons are given uniform positions in space and their velocities are initialized by
sampling from the standard normal distribution. There is no drift velocity present in this problem.
This non-dimensional standard normal distribution corresponds to a Maxwellian distribution that
has mean zero and a temperature 𝑇.   ¯ To ensure consistency across the runs, we seed the random
number generator. A plot of the electrons in phase space at the initial condition is displayed in
Figure 4.11.
    In order to monitor heating during the simulations, we track the time history of the variance for
the electron velocities, which is connected to the temperature of a Maxwellian distribution. Note
                                                       𝑁
that the variance data at 𝑁 + 1 time levels {var(𝑣 𝑛 )}𝑛=0 can be converted to a temperature history
using
                                     𝑛   𝑁     𝑚𝑒             𝑁
                                      𝑇¯ 𝑛=0
                                              =     {var(𝑣 𝑛 )}𝑛=0 .
                                                𝑘𝐵
    The results of our heating study can be found in Figures 4.12 and 4.13, which represent the
Poisson and wave models for the potential, respectively. In the case of the Poisson model, identical
heating properties are observed when particles are integrated with either leapfrog or the averaged
version of the Molei Tao integrator. The degree of heating becomes noticeably larger as the
                                                 170


            (a) Leapfrog with FFT                               (b) Tao with FFT
Figure 4.12: We present results from the numerical heating tests based on the Poisson model. Plots
show the average electron temperature as a function of the number of angular plasma periods using
leapfrog (left) and the second-order integrator by Molei Tao with averaging (right). Fields and their
derivatives are obtained using the FFT.
grid is coarsened. Similar behaviors are observed in the wave model for the potential with some
caveats. Firstly, we observe a less substantial degree of heating in cases where the Debye length is
underresolved due to the finite speed of propagation in the wave model. Additionally, we see that
two of the configurations, specifically the ones which combine Molei Tao with the time-centered
method, become unstable in time; however, the same approaches with leapfrog behave as expected.
A likely source of the problem is the time averaging applied to the source terms used in the Molei
Tao method, which is not applied in other methods that use either the Molei Tao integrator or the
time-centered scheme for the scalar potential.
                                                 171


           (a) Leapfrog with BDF-2          (b) Leapfrog with Central-2 + BDF-2            (c) Leapfrog with Central-2 + BDF-4
                 (d) Tao with BDF-2             (e) Tao with Central-2 + BDF-2              (f) Tao with Central-2 + BDF-4
Figure 4.13: We display results from the numerical heating tests that use the wave model for the potentials. Plots show the average
electron temperature as a function of the number of angular plasma periods using leapfrog (top) and the second-order integrator by
Molei Tao with averaging (bottom). We selected 𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator. The scalar
potentials and derivatives are computed with the scheme label provided in the individual captions.
                                                                     172


4.6.4   The Bennett Equilibrium Pinch
To benchmark the performance of our method on electromagnetic problems, we first consider the
Bennett Equilibrium pinch, named after its discoverer W. H. Bennett, who first analyzed the problem
[81]. In this paper, Bennett constructed a certain steady-state solution for the ideal MHD equations
in cylindrical coordinates. Electrons are modeled as a fluid that drifts along the 𝑧-direction, creating
currents that generate magnetic fields with 𝑥 and 𝑦 components that confine or “squeeze" the plasma
towards the axis of the cylinder. The fluid velocity along the axis of the cylinder is carefully chosen
to create a proper equilibrium balance between the plasma and the confining magnetic field that
surrounds it.
    Our particle simulation of the Bennett pinch closely follows the description provided in chapter
13 of Bittencourt [82]. We consider the motion of electrons inside a cross-section of a cylindrical
column of plasma whose radius is 𝑅𝑏 and suppose that the axis of the beam is centered at the
                                                                                               
origin of a bounding box Ω = [−2𝑅𝑏 , 2𝑅𝑏 ] × [−2𝑅𝑏 , 2𝑅𝑏 ]. Let J = 𝐽 (1) , 𝐽 (2) , 𝐽 (3) denote the
components of the current density. In the 𝑥-𝑦 plane, the particles are sampled from a Maxwellian
distribution, so we ignore the contributions from 𝐽 (1) and 𝐽 (2) because they will have a net-zero
current. Since the particles drift along the axis of the beam, we retain the third component of the
current density 𝐽 (3) . Consequently, we ignore the wave equations for 𝐴 (1) and 𝐴 (2) , choosing to
retain only 𝐴 (3) . Ions are spatially distributed by sampling the same distribution as the electrons
and are assumed to be stationary within the cross-section, since their mass is taken to be much
larger than the electrons. If we denote the components of the position and momentum vectors for
                                                                    
particle 𝑖 as x𝑖 ≡ 𝑥𝑖(1) , 𝑥𝑖(2) , 𝑥𝑖(3) and P𝑖 ≡ 𝑃𝑖(1) , 𝑃𝑖(2) , 𝑃𝑖(3) , respectively, the equations for the
                                                    173


motion of each particle assume the form
                            𝑑𝑥𝑖(1)     1
                                   = 𝑃𝑖(1) ,
                             𝑑𝑡        𝑟𝑖
                            𝑑𝑥𝑖(2)     1
                                   = 𝑃𝑖(2) ,
                             𝑑𝑡        𝑟𝑖
                           𝑑𝑃𝑖(1)                    𝑞𝑖       (3)
                                                                   
                                                                        (3)     (3)
                                                                                    
                                   = −𝑞𝑖 𝜕𝑥 𝜓 +           𝜕𝑥 𝐴        𝑃𝑖 − 𝑞𝑖 𝐴       ,
                             𝑑𝑡                      𝑟𝑖
                           𝑑𝑃𝑖(2)                    𝑞𝑖       (3)
                                                                   
                                                                         (3)    (3)
                                                                                    
                                   = −𝑞𝑖 𝜕𝑦 𝜓 +           𝜕𝑦 𝐴        𝑃𝑖 − 𝑞𝑖 𝐴       .
                             𝑑𝑡                      𝑟𝑖
Note that while we retain the last component of the momentum, we do not include an equation for
𝑃𝑖(3) , since each term involves 𝑧-derivatives which are ignored. In a more realistic simulation, we
would retain the full 3D-3P system to monitor changes in 𝑃𝑖(3) . That said, this test is still useful
because it serves as a benchmark to assess the quality of particle confinement inside the beam.
     The particle equations of motion for this problem can also be formulated in terms of E and B
and solved with the Boris method discussed in section 4.4.2. Given the potentials 𝜓 and A, we can
calculate the E and B fields using (1.11):
                                     E = −∇𝜓 − 𝜕𝑡 A,            B = ∇ × A.
These can be used to evolve particles through the non-dimensional model
                                   𝑑𝑥𝑖(1)
                                          = 𝑣 𝑖(1) ,
                                    𝑑𝑡
                                   𝑑𝑥𝑖(2)
                                          = 𝑣 𝑖(2) ,
                                    𝑑𝑡
                                   𝑑𝑣 𝑖(1) 𝑞𝑖  (1)           (3) (2)
                                                                       
                                          =        𝐸 − 𝑣𝑖 𝐵 ,
                                    𝑑𝑡      𝑟𝑖
                                   𝑑𝑣 𝑖(2) 𝑞𝑖  (2)           (3) (1)
                                                                       
                                          =        𝐸 + 𝑣𝑖 𝐵 ,
                                    𝑑𝑡      𝑟𝑖
                                   𝑑𝑣 𝑖(3) 𝑞𝑖  h (1) (2)                    i
                                          =           𝑣 𝑖 𝐵 − 𝑣 𝑖(2) 𝐵 (1) ,
                                    𝑑𝑡      𝑟𝑖
where the last equation for the velocity is neglected so that this form is consistent with the generalized
momentum formulation.
                                                         174


                                   Parameter                                         Value
                              Beam radius (𝑅𝑏 ) [m]                                1.0 × 10−6
                       Average number density (𝑛)   ¯ [m−3 ]                    4.391989 × 1019
                          Average temperature (𝑇)  ¯ [K]                        5.929245 × 101
                              Debye length (𝜆 𝐷 ) [m]                           8.019042 × 10−8
                 Inverse angular plasma frequency (𝜔−1      𝑝𝑒 ) [s/rad]       2.674864 × 10−12
                           Thermal velocity (𝑣 𝑡ℎ ) [m/s]                       2.997925 × 104
                       Electron drift velocity (𝑣 drift ) [m/s]                 2.997925 × 107
       Fraction of particles contained in the beam (𝛼) [non-dimensional]              0.99
         Table 4.5: Table of the parameters used in the setup for the Bennett pinch problem.
    Particles that leave the domain due to inadequate confinement by the fields can be prescribed
new positions by sampling the initial distribution, which essentially re-injects them into the beam.
The velocity and momenta are left unchanged in an effort to keep the total current density constant.
Next, we provide the experimental parameters for the simulations along with details regarding the
initialization procedure for the problem.
    During the initialization and time evolution phases of the simulation, we use an analytical
solution for the toroidal magnetic field to verify the correctness of the numerical fields, which
maintain the steady-state. To derive the analytical solution for the toroidal magnetic field 𝐵 (𝜃) , we
solve the differential equation (equation 13.3.5 in [82])
                                      𝑑  (𝜃) 
                                          𝑟𝐵      = 𝜇0 𝑞 𝑒 𝑣 𝑒(3) 𝑟𝑛(𝑟),
                                     𝑑𝑟
where 𝜇0 is the permeability of free space, 𝑞 𝑒 is the electron charge, 𝑣 𝑒(3) is the 𝑧-component of
the electron velocity, and 𝑛(𝑟) is the Bennett distribution for the electrons (equation 13.3.8 in [82])
given as
                                                           𝑛0
                                         𝑛(𝑟) =                   2 .                           (4.34)
                                                   1 + 𝑛0 𝑏𝑟 2
The on-axis number density 𝑛0 is calculated according to equation 13.3.14 of [82]
                                                                  !
                                                   1         𝛼
                                          𝑛0 =                      ,
                                                𝑏𝑅𝑏2 1 − 𝛼
where 𝛼 ∈ [0, 1) is a parameter the fraction of particles in the cross-section contained within the
                                                   175


beam. We use the constant 𝑏, whose form is given in equation 13.3.9 of [82], namely
                                                                   2
                                                     𝜇0 𝑞 𝑒 𝑣 𝑒(3)
                                                𝑏=                     .
                                                            8𝑘 𝐵𝑇¯
Note that the above expression is equivalent to assuming that the ions are cold so that 𝑇𝑖 = 0 and
𝑇𝑒 = 𝑇.
      ¯ To solve this differential equation, we integrate both sides from 0 to 𝑟:
                                                                                   𝑛0 𝑟 ′
                     ∫   𝑟                               ∫ 𝑟
                            𝑑  ′ (𝜃) ′  ′                             (3)                         ′
                                 𝑟  𝐵     (𝑟  )  𝑑𝑟  =          𝜇  𝑞  𝑣
                                                                  0 𝑒 𝑒                        2 𝑑𝑟 .
                       0   𝑑𝑟 ′                             0                  1 + 𝑛0 𝑏𝑟   ′2
The left side simplifies to 𝑟 𝐵 (𝜃) (𝑟) and the right side can be evaluated using the substitution
𝜏 = 1 + 𝑛0 𝑏𝑟 ′2 which yields
                                                 𝜇0 𝑞 𝑒 𝑣 𝑒(3)     1+𝑛0 𝑏𝑟 2
                                                               ∫
                                 𝑟𝐵   (𝜃)
                                          (𝑟) =                              𝜏 −2 𝑑𝜏,
                                                    2𝑏           1
                                                 𝜇0 𝑞 𝑒 𝑣 𝑒(3)
                                                                                   
                                                                             1
                                               =                 1−                   ,
                                                    2𝑏                1 + 𝑛0 𝑏𝑟 2
                                                 𝜇0 𝑞 𝑒 𝑣 𝑒(3) 𝑛0𝑟 2
                                               =                             .
                                                     2         1 + 𝑛0 𝑏𝑟 2
Hence, the analytical solution for the toroidal magnetic field is
                                                    𝜇0 𝑞 𝑒 𝑣 𝑒(3)     𝑛0 𝑟
                                        𝐵 (𝜃) (𝑟) =                              .                       (4.35)
                                                            2      1 + 𝑛0 𝑏𝑟 2
Numerically, we solve the problem in a Cartesian coordinate system, instead of a cylindrical
coordinate system. Therefore, we will have to convert 𝐵 (1) ≡ 𝐵 (𝑥) and 𝐵 (2) ≡ 𝐵 (𝑦) to 𝐵 (𝜃) . The
required transformation that converts between these coordinate systems is
                                  (𝑟)  
                                  𝐵   cos(𝜃) sin(𝜃) 0  𝐵 (𝑥) 
                                                                                   
                                                                                
                                                                                
                                  𝐵 (𝜃)  = − sin(𝜃) cos(𝜃) 0  𝐵 (𝑦)  ,
                                                                                
                                                                                
                                  (𝑧)                                      (𝑧) 
                                 𝐵   0                        0       1  𝐵 
                                                                                
                                                                                         𝑥             𝑦
Note that this transformation can be further simplified with cos(𝜃) = and sin(𝜃) = , with 𝑟 > 0,
                                                                                         𝑟             𝑟
so that we obtain for the 𝜃-component
                                                                          √︃
                             (𝜃)            𝑦         𝑥
                           𝐵     (𝑟) = − 𝐵 (𝑥) + 𝐵 (𝑦) ,            𝑟 = 𝑥 2 + 𝑦 2 > 0.
                                            𝑟         𝑟
                                                         176


For the case of 𝑟 = 0, we can appeal to the analytical solution (4.35), so that 𝐵 (𝜃) (0) = 0.
    The setup of our PIC simulation requires an average or macroscopic number density 𝑛.        ¯ This
can be obtained from a variation of equation 13.3.11 in [82] to obtain
                                                                 𝑛0 𝜋𝑅𝑏2
                                       ∫  𝑅𝑏
                                  2𝜋
                           𝑛¯ =              𝑛(𝑟)𝑟 𝑑𝑟 =                    ,                   (4.36)
                                16𝑅𝑏2   0                16𝑅𝑏 1 + 𝑛0 𝑏𝑅𝑏
                                                               2          2
where we have used the definition (4.34). A summary of the plasma parameters used in the
simulation of the Bennett pinch is presented in Table 4.5.
    Next, we describe the initialization procedure for the fields 𝜓 and A that recovers the steady-state
solution with enough accuracy to achieve particle confinement. This problem is defined over free-
space, so we prescribe outflow boundary conditions along the boundary of the box that encloses
the beam. To initialize the steady-state with the wave solvers, we first set 𝜓 = 0 and A = 0, and then
proceed by stepping the corresponding wave equations to steady-state, holding the sources 𝜌 and J
fixed. Although it is more expensive than direct evaluation of the free-space integral solution, this
technique allows for a self-consistent initialization of the data used by the outflow procedure for
this problem. Further, since the data is stepped to steady-state, this approach is general enough that
it can also be used to initialize both the Poisson and wave models. In Figure 4.14, we show plots
of the steady-state toroidal magnetic field obtained this initialization procedure using a 128 × 128
mesh. The errors in the initial condition are primarily concentrated along the edge of the simulation
domain and near the origin. Along the domain boundary, the errors here are primarily due to the
explicit outflow procedure. Errors occurring near the origin are due to the steep gradients created
by the beam, which can be corrected by using a finer mesh.
    As a first test, we compare the Molei Tao integrator with the Boris method for the true steady-
state problem, which is described by the elliptic system
                                                       1
                                              −Δ𝜓 =       𝜌,
                                                      𝜎1
                                           −Δ𝐴 (3) = 𝜎2 𝐽 (3) ,
which is defined over free-space. An approach based on Green’s functions is quite natural, as the
                                                  177


Figure 4.14: Initialization of the steady-state toroidal magnetic field in the Bennett problem
computed with the BDF-2 wave solver after 1000 steps against a fixed current density. The
derivatives of the vector potential 𝐴 (3) are also obtained with the BDF-2 method.
free-space solution of this decoupled elliptic system requires the evaluation of the volume integrals
                                                ∫
                                            1
                                𝜓(x) = −            ln (||x − x′ ||) 𝜌(x′) 𝑑x′,
                                          2𝜋𝜎1 Ω
                                              ∫
                               (3)        𝜎2
                             𝐴 (x) = −           ln (||x − x′ ||) 𝐽 (3) (x′) 𝑑x′,
                                          2𝜋 Ω
with ||·|| being the usual Euclidean distance. The evaluation of these integrals can be performed
efficiently with a fast summation method such as a tree-code [83] or fast multipole method [48],
which is beyond the scope of the present work. Instead, we solve this elliptic system using second-
order finite-differences with a sparse direct solver that applies Dirichlet boundary conditions.
Derivatives of the fields are computed with second-order finite-differences. The Dirichlet data
used for the elliptic solves is supplied by evolving the wave equations for the potentials with the
BDF-2 method using outflow boundary conditions. While there are many possible approaches to
this problem, it is important that this data be updated at each time step so that the boundary data
is consistent with the source. We applied both the Boris and Tao solvers for 50 thermal crossings
using a total of 1 × 106 steps, which gives a CFL ≈ 15.90. For each species, i.e., ions and electrons,
we used 102,400 particles. In Figure B.1, we show the state of the beam and corresponding fields
after 50 thermal crossings obtained with the Boris method. We observe satisfactory preservation
                                                   178


of the steady-state fields for the problem. We note that the CFL for this test is quite large for the
second-order wave solver; however, we only use the data from the wave solver along the boundary
of the simulation domain. In these regions the fields are mostly flat, so this is an acceptable
approximation, even with a diffusive solver. We also observe good agreement with the analytical
solution around the axis of the beam, specifically in capturing the wells in the field. Results
obtained with the Molei Tao integrator, using 𝜔 = 500, are shown in Figure B.2. The fields, which
are shown before the final time, demonstrate a loss of the steady-state attributed to an excess of
particles escaping from the beam. This causes the current density to spread outwards, resulting
in changes in the potentials used to calculate the fields. We have not determined the source of
this issue in the Tao method; however, given the other issues encountered with the approach, we
ultimately decided that this integrator was no longer a viable option for experimentation and chose
not to pursue it further.
    Allowing for time variation in the fields requires that we now solve a system of two wave
equations for the potentials
                                            1 𝜕2𝜓             1
                                             2    2
                                                     − Δ𝜓 =      𝜌,
                                           𝜅 𝜕𝑡              𝜎1
                                      1 𝜕 2 𝐴 (3)
                                                  − Δ𝐴 (3) = 𝜎2 𝐽 (3) ,
                                     𝜅 2 𝜕𝑡 2
subject to outflow boundary conditions. In the Boris method, all derivatives for 𝜓 and 𝐴 (3) , as well
as 𝜓 itself, are computed with the BDF-2 scheme; however, the vector potential 𝐴 (3) is evolved
explicitly using the central-2 method. To obtain the current density J𝑛+1 , which is required for the
derivatives obtained with BDF-2, we use the iterative approach presented in section 4.4.2, which
uses a Taylor expansion to create an initial guess for the current density. A total of 5 iterates are
used in each time step. We ran the solver for 35 thermal crossings using a total of 3.5 × 106 steps,
which gives a CFL ≈ 3.18. For each of the particle species, i.e., ions and electrons, we used
102,400 particles. In Figure B.3, we show the state of the beam and corresponding fields after 35
thermal crossings obtained with the Boris method. There is some slight dissipation in the regions
surrounding the axis of the beam, which is caused by the BDF method. As mentioned earlier, the
                                                    179


use of a finer mesh would improve the quality of the solution in these regions. Along the boundary,
we can see a discrepancy with the analytical solution due to inaccuracies in the treatment of outflow
boundary conditions by the wave solver. Future work will seek corrections to this behavior so
that the fields and their derivatives will be more accurate along the boundary. Despite these slight
inaccuracies, we observe satisfactory preservation of the steady-state fields for the problem.
4.6.5   The Expanding Beam Problem
We now apply the proposed methods to the expanding beam test problem [24]. This example is
well-known for its sensitivity to issues concerning charge conservation, which makes it well-suited
for evaluating methods used to enforce gauge conditions and involutions. While this particular
example is normally solved in cylindrical coordinates, our simulation is performed using a two-
dimensional rectangular box that retains the fields 𝐸 (1) and 𝐸 (2) , as well as 𝐵 (3) . An injection zone,
which is placed on one of the faces of the box, injects a steady beam of particles into the domain.
The beam expands as particles move along the box due to the electric field and eventually settles
into a steady-state. Particles are absorbed or “collected" once they reach the edge of the domain
and are removed from the simulation.
    Based on the results of the previous test, we decided to abandon the Molei Tao integrator for
this example; however, we recently became aware of another particle integrator that can be used
to evolve the generalized momentum formulation used in this work. In [36], a semi-implicit Euler
discretization was used in a mesh-free simulation of the VM system, in the Darwin limit. To
avoid time derivatives of the potentials, the particle equations were cast in terms of a generalized
momentum. This method, while only first-order accurate, is simple to implement, efficient, and
based on the results in [36], surprisingly accurate. Moreover, they point out that this approach
can likely be generalized to obtain high-order extensions. Note, an overview of this integrator was
presented in section 4.5.2. The generalized momentum formulation for this problem evolves the
                                                 180


particle equations
             𝑑𝑥𝑖(1)   1  (1)          (1)
                                           
                    =     𝑃 − 𝑞𝑖 𝐴           ,
              𝑑𝑡      𝑟𝑖 𝑖
             𝑑𝑥𝑖(2)   1  (2)          (2)
                                           
                    =     𝑃 − 𝑞𝑖 𝐴           ,
              𝑑𝑡      𝑟𝑖 𝑖
            𝑑𝑃𝑖(1)
                                                                                                  
                                 𝑞𝑖           (1)
                                                   
                                                        (1)       (1)
                                                                       
                                                                              (2)
                                                                                  
                                                                                      (2)      (2)
                    = −𝑞𝑖 𝜕𝑥 𝜓 +        𝜕𝑥 𝐴          𝑃𝑖 − 𝑞𝑖 𝐴        + 𝜕𝑥 𝐴        𝑃𝑖 − 𝑞𝑖 𝐴        ,
              𝑑𝑡                 𝑟𝑖
            𝑑𝑃𝑖(2)
                                                                                                  
                                 𝑞𝑖           (1)
                                                   
                                                        (1)       (1)
                                                                       
                                                                              (2)
                                                                                  
                                                                                       (2)     (2)
                    = −𝑞𝑖 𝜕𝑦 𝜓 +        𝜕𝑦 𝐴          𝑃𝑖 − 𝑞𝑖 𝐴        + 𝜕𝑦 𝐴        𝑃𝑖 − 𝑞𝑖 𝐴        .
              𝑑𝑡                 𝑟𝑖
The formulation, shown above, is written in terms of scalar and vector potentials, which are obtained
by solving Maxwell’s equations in the Coulomb gauge. As shown in section 1.2.2.2, the complete
system in the Coulomb gauge (1.45)-(1.47) can be simplified to obtain an equivalent system in
which the vector potential A is purely rotational. The system in this form is given by
                                                       1
                                         − Δ𝜓 =           𝜌,                                            (4.37)
                                                      𝜎1
                                         1 𝜕 2 𝐴 (1)                    (1)
                                           2       2
                                                        − Δ𝐴 (1) = 𝜎2 𝐽rot  ,                           (4.38)
                                         𝜅 𝜕𝑡
                                         1 𝜕 2 𝐴 (2)                    (2)
                                           2       2
                                                        − Δ𝐴 (2) = 𝜎2 𝐽rot  .                           (4.39)
                                         𝜅 𝜕𝑡
Note that we have abused notation here, since the vector potentials 𝐴 (1) and 𝐴 (2) , as written above,
are actually the rotational components of A. Along the boundary of the domain, the electric and
magnetic fields are prescribed perfectly electrically conducting (PEC) boundary conditions, which,
in two spatial dimensions, is equivalent to enforcing homogeneous Dirichlet boundary conditions
on the potentials 𝜓, 𝐴 (1) , and 𝐴 (2) . The rotational part of the current density is obtained by solving
the elliptic equation
                                             − Δ𝜂 = 𝜕𝑥 𝐽 (1) + 𝜕𝑦 𝐽 (2) ,                               (4.40)
which is then used to adjust the current density according to
                                                   (1)
                                                 𝐽rot  = 𝐽 (1) + 𝜕𝑥 𝜂,                                  (4.41)
                                                   (2)
                                                 𝐽rot  = 𝐽 (2) + 𝜕𝑦 𝜂.                                  (4.42)
                                                          181


Since the problem is PEC, there can be no currents or charges on the boundary. This suggests we
enforce homogeneous Neumann boundary conditions for equation (4.40). Despite the projection
(4.40)-(4.42) used for the current density, irrotational components of the vector potential can be
introduced through the discretization of the wave equations (4.38) and (4.39). After solving the
wave equations, we extract the rotational components of the vector potential by first solving the
elliptic equation
                                        − Δ𝜉 = 𝜕𝑥 𝐽 (1) + 𝜕𝑦 𝐽 (2) .                             (4.43)
In the second step, the gradient of 𝜉 is used to remove the irrotational parts of the potential via
                                             (1)
                                            𝐴rot  = 𝐴 (1) + 𝜕𝑥 𝜉,                                (4.44)
                                             (2)
                                            𝐴rot  = 𝐴 (2) + 𝜕𝑦 𝜉.                                (4.45)
As in the projection step for the current, we solve the elliptic equation (4.43) using homogeneous
Neumann boundary conditions. The sequence of corrections described by (4.43)-(4.45) are identical
to the elliptic divergence cleaning method discussed in section 4.3.4 to enforce the gauge condition.
    As with the Bennett problem, we shall compare results obtained using the generalized momen-
tum formulation with the formulation that employs the Boris method, in which the equations of
motion for the particles are expressed in terms of E and B and take the form
                                     𝑑𝑥𝑖(1)
                                            = 𝑣 𝑖(1) ,
                                      𝑑𝑡
                                     𝑑𝑥𝑖(2)
                                            = 𝑣 𝑖(2) ,
                                      𝑑𝑡
                                    𝑑𝑣 𝑖(1) 𝑞𝑖 (1)
                                                                     !
                                                             (2) (3)
                                            =        𝐸 + 𝑣𝑖 𝐵          ,
                                      𝑑𝑡       𝑟𝑖
                                    𝑑𝑣 𝑖(2) 𝑞𝑖 (2)
                                                                     !
                                                             (1) (3)
                                            =        𝐸 − 𝑣𝑖 𝐵          .
                                      𝑑𝑡       𝑟𝑖
A key difference with the generalized momentum formulation is that the fields in the Boris method
use the Lorenz gauge, rather than the Coulomb gauge. Therefore, the Boris approach requires
                                                      182


fields, which are obtained by solving the system
                                            1 𝜕2𝜓                 1
                                                      − Δ𝜓    =      𝜌,
                                            𝜅 2 𝜕𝑡 2              𝜎1
                                      1 𝜕 2 𝐴 (1)
                                       2      2
                                                   − Δ𝐴 (1) = 𝜎2 𝐽 (1)
                                      𝜅 𝜕𝑡
                                      1 𝜕 2 𝐴 (2)
                                       2      2
                                                   − Δ𝐴 (2) = 𝜎2 𝐽 (2) .
                                      𝜅 𝜕𝑡
Using equation (1.11), the potentials 𝜓, 𝐴 (1) , and 𝐴 (2) , can be used to obtain E and B for the
particle updates. At the present time, we do not have a working method for enforcing the Lorenz
gauge condition, so a cleaning method is not used this approach. The formulation involving the
Boris push could certainly be modified to work with the Coulomb gauge rather than the Lorenz
gauge condition, and will be explored in later work.
    To setup the simulation, we first create a box specified by the region [0, 1] × [− 12 , 12 ], which has
been normalized by some length scale 𝐿. We shall further assume that the beam consists only of
electrons, which are prescribed some injection velocity 𝑣 injection and travel along the 𝑥-axis of the
box. An estimate of the crossing time for a particle can be obtained using the injection velocity
and the length of the domain, which sets the time scale 𝑇 for the simulation. The duration of
the simulation is given in terms of particle crossings, which are then used to set the time step
Δ𝑡. At each time step, particles are initialized in an injection region specified by the interval
[−𝐿 ghost , 0) × [−𝑅𝑏 , 𝑅𝑏 ], where 𝑅𝑏 is the radius of the beam, and the width of the injection zone
𝐿 ghost is chosen such that
                                           𝐿 ghost = 𝑣 injection Δ𝑡.
This ensures that all particles initialized in the injection zone will be in the domain after one time
step. Particle positions in the injection region are set according to samples taken from a uniform
distribution, and the number of particles injected for a given time step is set by the injection rate.
In each time step, the injection procedure is applied before the particle position update, so that, at
the end of the time step, the injection zone is empty. To prevent the introduction of an impulse
response in the fields due to the initial injection of particles, we apply a linear ramp function to
                                                     183


                                      Parameter                                   Value
                                Beam radius (𝑅𝑏 ) [m]                          8.0 × 10−3
                          Average number density (𝑛) ¯ [m−3 ]                7.8025 × 1014
                            Largest box dimension (𝐿) [m]                      1.0 × 10−1
                     Electron injection velocity (𝑣 injection ) [m/s]           5.0 × 107
                            Electron crossing time (𝑇) [s]                     2.0 × 10−9
              Electron macro-particle weight (𝑤 𝑚 𝑝 ) [non-dimensional]       1.67 × 10−5
                             Injection rate (per Δ𝑡) [1/s]                         10
       Table 4.6: Table of the parameters used in the setup for the expanding beam problem.
the macro-particle weights whose duration is one particle crossing. A summary of the parameters
used to setup the problem are presented in Table 4.6.
    At some point, the particles will reach the boundary and should be removed from the simulation.
A list is a natural choice for managing the injection and deletion of particles, since particles can
be easily added or removed. The types of problems we are considering require many particles, so
having to constantly resize a list can become quite expensive. Moreover, implementations of lists
are not cache-friendly, so we lose opportunities for vectorization. Instead, we use arrays whose
lengths are determined by (over)estimating the total number of particles in the domain, at any given
time. We estimate the number of particles by first calculating the number of time steps required for
a particle to cross the box, which is then multiplied by the injection rate. Finally we double this
result for additional safety, since the beam will spread to some degree. We store a running total of
the number of particles in the domain at any given time, which identifies the entries of the array
to update, with entries beyond this value being considered “deleted." Consequently, at any given
time step, we need to sort the particle arrays so that particles outside of the domain are placed after
this counter variable. This sort step is performed by first creating a Boolean array which indicates
if particle is outside the domain. A sorting method is then applied to the Boolean array, which is
quite fast, as the data is mostly sorted. Then, the sorted Boolean array is used to remap the entries
of the particle arrays.
    We first test the formulation which combines the Boris method with the fields in the Lorenz
gauge (1.43). Initially, we applied the time-centered method to update the components of the vector
                                                  184


potentials; however, the lack of dissipation in the time-centered approach led to certain noise in the
fields due to dispersion error, which was further amplified by the application of the time derivative.
Instead, we used the BDF-2 method, which is dissipative, to perform these calculations. The effect
on the time derivatives, which is shown in Figure B.4, is quite apparent. We ran the simulation with
this configuration to a final time of 1000 particle crossings. A mesh of 128 × 128 grid points was
used in the calculation, which gave a CFL ≈ 0.761. The results for this experiment are shown in
Figure B.5. The beam, itself, is surprisingly stable and does not display significant issues associated
with violating the gauge condition. Along the edge of the beam, we observe some small oscillations
that will eventually grow over time, causing the beam to break apart. This is also reflected in the
growth of the error in the Lorenz gauge toward the end of the run. In Figure B.6, we show the
smooth potentials and their partial derivatives, which are constructed using the proposed BDF-2
wave solver. As mentioned earlier, we have not had success with enforcing the Lorenz gauge in this
particular work, but it is something we plan to revisit in the future.
    Next, we test the Coulomb gauge formulation, which applies the AEM for time integration
[36] (discussed in section 4.5.2). In this experiment, we ran the code for 3000 particle crossings
without a cleaning technique to enforce the gauge condition. We used the same 128 × 128 mesh
for the fields as in the Boris approach. The Poisson solves are performed using second-order finite-
differences with a sparse linear solver. Derivatives of the data obtained from the Poisson solves
are computed with second-order finite-differences. After 3000 particle crossings, the structure of
the beam is largely destroyed due to violations in the gauge condition. The results presented in
Figure B.7 show the beam at an earlier time, corresponding to 2000 crossings, at which point the
striations and clumping in the beam are quite apparent. In Figure B.8, we show the beam after 3000
crossings, which uses the same approach but applies the cleaning procedure described by equations
(4.43)-(4.45). The impact of the cleaning is quite remarkable, as the integrity of the beam is no
longer compromised. The cleaning approach displays some violations in the gauge condition at
the boundaries of the domain, which are expected because particles enter or leave the domain at
these points. On the interior the fluctuations are in the sixth decimal position, which can likely be
                                                  185


improved through the use of a more accurate Poisson solver along with additional particles. The
smooth potentials and their derivatives, which were obtained with this formulation are presented
in Figure B.9. The derivatives used to evaluate the gauge condition show some jumps along the
boundary where particles enter and leave, which we plan to investigate in greater detail. Comparing
Figures B.6 and B.9, it is interesting to see the structural differences in the potentials (and their
derivatives) obtained with different formulations.
     As mentioned earlier, we have not yet constructed a functioning cleaning method for the Lorenz
gauge formulation. In spite of this, we find the Lorenz gauge formulation to be quite appealing
because it avoids the use of elliptic solvers. For this reason, we combined the AEM for the particles
with a first-order BDF field solver. No cleaning method is used for the fields. We ran the simulation
out to 3000 particle crossings on the same 128 × 128 mesh for the fields. The beam, which,
remains surprising intact without a cleaning method, is shown in Figure B.10a. The time trace of
the Lorenz gauge error, which is displayed in Figure B.10b, shows some oscillations that appear to
be bounded. The fields from the same experiment, at the final step, are presented in Figure B.11.
The fields are quite similar to those obtained with the second-order BDF scheme combined with
the Boris method presented in Figure B.6. Despite the low order accuracy, it can be shown that a
first-order time discretization of the fields is consistent with a discrete form of the Lorenz gauge
(see B.1 for details). Furthermore, the low-order time accuracy of the fields does not introduce
significant dissipation. While the goal of our work is to build higher-order field solvers for plasma
applications, this result is interesting due to its practicality. More specifically, it demonstrates that
it is possible to obtain a reasonable solution in an inexpensive manner.
     The last point we wish to mention concerns the metrics used to assess charge conservation. One
of the advantages of working with a gauge formulation is that the metrics for charge conservation are
embedded in the gauge condition, which can be calculated using the derivatives of the potentials.
Compare this with measurements based on Gauss’ law ∇ · E = 𝜌/𝜎1 , which utilizes particle data
that may be under-resolved by the mesh. If either 𝜌 or the divergence term is not smooth, this could
give the impression that the method is ineffective at conserving charge. As an example, in the last
                                                    186


experiment, which enforced the Coulomb gauge through an elliptic method, we found that the error
in the gauge condition was quite small, especially away from the boundary. For the same problem,
the point-wise error in Gauss’ law, which is shown in Figure B.12b, indicates that the method is not
conserving charge, even in regions away from the boundaries. On the other hand, if we compute
the residual
           ∫ 
                              1
                                             ∑︁                     1
                                                                                     
                 ∇ · E(x) −     𝜌(x) 𝑑𝑉x ≈         ∇ · E(𝑥𝑖 , 𝑦 𝑗 ) −    𝜌(𝑥𝑖 , 𝑦 𝑗 ) Δ𝑥𝑖 Δ𝑦 𝑗 , (4.46)
             Ω               𝜎1               𝑖, 𝑗
                                                                      𝜎1
then we can say whether or not the method conserves charge in a bulk sense. In the above definition,
the divergence is interpreted in a discrete sense and the sum runs over the mesh points so that Δ𝑥𝑖 Δ𝑦 𝑗
is the volume of the grid-cell (𝑖, 𝑗). In Figure B.12a, we plot the bulk error as a function of time,
which shows far smaller violations and is symmetric about zero. The large violations in Gauss’
law for the case of cleaning (shown in Figure B.12b) are likely the result of the treatment used for
the divergence term. If we compare this with the same formulation that skips the cleaning step
(shown in Figure B.13), then we can see that the point-wise violations occur on a much greater
scale. This discrepancy was similarly in [19], where they showed the ℓ2 error in Gauss’ law to be
O (1), even if cleaning methods were applied. It is also important to note that if components of the
electric field display non-smooth features such as cusps or steep gradients, then the finite-difference
derivatives over modest stencils will show nonphysical oscillations, which will directly impact the
results of the point-wise error in Gauss’ law. One approach we have considered to deal with this is
the application of WENO derivatives, which we plan to explore in future work.
4.6.6    A Narrow Beam Problem and the Effect of Particle Count
In this last test, we slightly modify our setup from the previous example in section 4.6.5 to construct
a beam with a lower density, which will be used to conduct a certain type of refinement study. This
modification primarily concerns the prescription of particle weights. In Table 4.6 for the previous
problem, we provided a value for the number density 𝑛¯ in addition to a particle weight 𝑤 𝑚 𝑝 . The
particle weighting from the previous example was obtained by essentially hard-coding an estimate
                                                   187


for the number of simulation particles. While it is not incorrect to do this, the currents in the beam
may become too large as the injection rate increases, causing particles to move back to the injection
zone. To fix this problem, we compute the particle weight 𝑤 𝑚 𝑝 using an estimate for the number of
particles in the beam that accounts for the injection rate. Since we discussed how to estimate this
number in the previous section, we omit these details for brevity. Therefore, the particle weight no
longer needs to be prescribed in the setup, and it naturally adjusts according to the injection rate
specified by the user. This modification ultimately allows us to examine the effect of the particle
count on the solution by fixing the number density and varying the particle injection rate. Aside
from this modification, all other details, e.g., the models, injection procedure, etc. are identical to
those provided in section 4.6.5, so we shall not describe them further.
    We test the effect of an increased particle count on the numerical solution using the solver that
combines the AEM for particles with the BDF-2 field solver in the Coulomb gauge, along with
elliptic projections to enforce the gauge condition. We use the same 128 × 128 mesh as in the
previous problem and an injection rate of 400 particles per timestep. The remaining parameters
are specified in Table 4.7. We run the problem for a total of 5 particle crossings, which is sufficient
for the refinement purposes of this problem, using the same CFL ≈ 0.761 as the previous problem.
We plot the narrow beam and the gauge error in Figure B.14. The increased smoothness in the
charge density due to the increase in the number of particles is apparent. Moreover, the error in
the gauge condition seems to be quite small away from the boundary, where particles are injected
and removed. The corresponding fields and derivatives, which are used to move the particles are
presented in Figure B.15. The fields appear to be smooth. We also plot the “bulk" error in Gauss’
law associated as a function of time and show the point-wise error as a surface in Figure B.16.
The bulk error shows a jump at the time that corresponds to the first crossing, at which point some
particles begin to leave the domain. Shortly after this jump, the error settles. As before, the point-
wise violations in Gauss’ law seem to indicate a loss of charge conservation. For convenience,
we show the derivatives of the electric field, which are used to measure the error in Gauss’ law
in Figure B.17. We note that the derivative in the 𝑥 shows some oscillations in the interior of the
                                                   188


                                     Parameter                            Value
                               Beam radius (𝑅𝑏 ) [m]                    8.0 × 10−3
                        Average number density (𝑛)  ¯ [m−3 ]         1.552258 × 1014
                          Largest box dimension (𝐿) [m]                 1.0 × 10−1
                    Electron injection velocity (𝑣 injection ) [m/s]    5.0 × 107
                          Electron crossing time (𝑇) [s]                2.0 × 10−9
         Table 4.7: Table of the parameters used in the setup for the narrow beam problem.
beam, and the 𝑦 derivative contains a mix of sharp and uniform features.
     Lastly, we show the effect of increasing the particle count on the gauge condition by considering
injection rates of 100, 200, and 400 particles per time step. These results are presented in Figure
B.18. While the error at the boundaries remains largely unchanged in the runs, there is a noticeable
improvement in the error on the interior. Specifically, we see the more jagged features on the
interior become smoother and smaller in size due to the increased particle count.
4.7     Conclusion
In this work, we developed a PIC method by coupling dimensionally-split integral equation solvers
for the fields with standard and non-standard time integration methods for particles. After introduc-
ing the concepts of a general PIC method, we presented several approaches for enforcing gauges and
charge conservation. We then introduced methods for integrating the particle equations of motion
that are necessitated by the formulations considered in this work. The discussion was primarily
focused on two methods designed for problems with non-separable Hamiltonians. While the time
integration methods employed in this work are not new, the novel contribution of our work is that
we demonstrated how existing methods for particles can effectively leverage the proposed field
solvers in simulations of plasmas. This includes the construction of spatial derivatives, which can
be obtained directly from the field solvers. To this end, we applied the proposed methods to several
application problems involving beams. Results were compared with standard methods based on
leapfrog time integration, and in nearly all examples, the proposed methods recovered similar be-
havior. The results not only validate the generalized momentum formulation, but also demonstrate
the versatility and flexibility of the proposed field solvers in simulating plasma phenomena.
                                                   189


                                             CHAPTER 5
                           CONCLUSION AND FUTURE DIRECTIONS
In this thesis, we have presented a collection of algorithms for evolving fields in plasmas with
specific applications to the Vlasov-Maxwell system. Maxwell’s equations are reformulated in term
of the Lorenz gauge, as well as the Coulomb gauge to obtain systems involving wave equations.
These wave equations are solved using the methods proposed in this work, and are combined with a
particle-in-cell method [2, 3] to simulate plasmas. This particle description of the Vlasov equation
couples directly to the fields, which are solved using a mesh. We considered two formulations of the
equations for the particles. First, a standard approach was presented, which is based on the Newton-
Lorentz equations, while the other used a generalized Hamiltonian to write the particle equations
in terms of the potentials (and their derivatives) used in the gauge formulation. The advantage
offered by the generalized Hamiltonian framework is that it eliminates the need to compute time
derivatives, which reduce the time accuracy of the fields and can lead to instabilities in certain
limits [35, 36].
    In the first part of this thesis, we developed and extended methods for scalar wave equations,
which can be used to update the potentials in these formulations. Our developments are based on a
class of algorithms known as the MOL𝑇 , which combines a dimensional splitting technique with a
one-dimensional integral equation method. The resulting methods are unconditionally stable, can
address geometry, and are O (𝑁), where 𝑁 is the number of mesh points. Our work contributed
methods to construct derivatives of the potentials for this class of dimensionally-split methods.
These derivatives, which are used to evolve particles, were constructed directly from the data used
in the one-dimensional integral solution. Consequently, the methods naturally inherit the speed,
stability, and geometric flexibility offered by the base solver. Moreover, we established, through
refinement experiments, that these derivatives converge at the same rate in both space and time, as
the base method. We also presented a more systematic treatment of outflow boundary conditions
for the second-order (in time) methods, including refinement studies, which were not presented in
                                                  190


earlier work. While the outflow procedure used in this work is convergent, more work should be
done to reduce magnitude of the error and improve the rate of convergence.
    The core algorithms used in the MOL𝑇 and the related class of successive convolution methods
were also explored in the context of high-performance computing environments. We developed
a novel domain decomposition approach, which ultimately allowed the method to be used on
distributed memory computing platforms. Shared memory algorithms were developed using the
Kokkos performance portability library, which allows a user to write a single version of a code
that can be executed on various computing devices with the architecture-dependent details being
managed by the library. We optimized predominant loop structures in the code and settled on a
blocking pattern that prescribed parallelism at multiple levels. Moreover, the proposed iteration
pattern is flexible enough to work with shared memory features available on GPU systems. While the
results indicated a high sensitivity to data locality, which is a feature of memory bound algorithms,
the methods were shown to be quite fast. Scaling experiments demonstrated that the proposed
algorithms could sustain an update rate in excess of 2.5 × 106 grid points per second, per physical
core. On a shared memory system with 40 cores (per node), this translates to an update rate of
1 × 108 grid points per second (per node). This was true even for the largest experiment conducted
in that same study, which used nearly 35 billion grid points.
    We also presented particle-in-cell methods for the Vlasov-Maxwell system, which leveraged the
methods for fields and derivatives developed in this work. We showed how to combine the proposed
methods with standard and non-standard time integration methods for the particles and applied these
methods to a variety of plasma test problems. The focus on beam problems is primarily motivated
by the preference of particle methods over mesh-based discretizations, which are overly diffusive
near the edge of the beam. Our results are generally encouraging and demonstrate the capabilities
of the proposed field solvers in simulating plasma phenomena. Additionally, our results serve to
validate the generalized Hamiltonian formulation, which will be the foundation of our future work.
    The results presented in this thesis suggest several interesting directions for future research. In
terms of field solvers, the methods used for outflow boundary conditions should be reconsidered.
                                                   191


This is especially true in the case of the more intricate methods based on successive convolution,
which, with the current methods, requires a fairly substantial amount of storage for the time history
along the boundary. The success of the BDF methods suggests that higher-order approaches should
be considered and analyzed in a rigorous fashion. In light of the stability issues associated with
elongated time stencils, it may be worthwhile to construct higher-order methods using extrapolation
or other correction techniques, which can simultaneously address the splitting error associated with
multi-dimensional problems.
    Extensions of the parallel algorithms presented in this thesis should also be considered. First,
the current approach should be evaluated on GPUs. A generalization of the decomposition, which
extends beyond nearest-neighbors and eliminates the artificial CFL-like condition, should also
be evaluated. This would provide a clear path for addressing problems involving non-uniform
mesh patches that are frequently encountered in algorithms that support adaptivity. Future work
should also evaluate other fast summation algorithms which are more aligned with the vectorization
capabilities supported by new hardware.
    The research directions above also have implications for the development of new solvers for the
Vlasov-Maxwell system. In the near-term future, we plan on revisiting the formulation involving
the Lorenz gauge, which avoids the use elliptic solvers. This suggests a pairing with the hyperbolic
divergence cleaning method discussed in this thesis, which requires effective approaches for enforc-
ing outflow boundary conditions. On the other hand, we believe the Coulomb gauge formulation
has great potential despite the three elliptic solves required to properly enforce the gauge condition.
In order to support problems with geometry, the Poisson equations should be discretized as integral
equations rather than finite-differences. This suggestion is motivated by the access to analytical
derivatives offered by integral equation methods, which are aligned with the techniques used in this
work. Furthermore, there have been several interesting developments involving fast summation
methods using a technique known as barycentric Lagrange interpolation [84]. These approaches
which are kernel-independent, have also been explored on GPUs [85, 86] and show great promise
in addressing challenges posed by new computational hardware.
                                                   192


APPENDICES
    193


                                              APPENDIX A
                                  APPENDIX FOR CHAPTER 3
A.1     Example for Linear Advection
Suppose we wish to solve the 1D linear advection equation:
                               𝜕𝑡 𝑢 + 𝑐𝜕𝑥 𝑢 = 0,      (𝑥, 𝑡) ∈ (𝑎, 𝑏) × R+ ,                     (A.1)
where 𝑐 > 0 is the wave speed and leave the boundary conditions unspecified. The procedure
for 𝑐 < 0 is analogous. Discretizing (A.1) in time with backwards Euler yields a semi-discrete
equation of the form
                                  𝑢 𝑛+1 (𝑥) − 𝑢 𝑛 (𝑥)
                                                       + 𝑐𝜕𝑥 𝑢 𝑛+1 (𝑥) = 0.
                                           Δ𝑡
If we rearrange this, we obtain a linear equation of the form
                                         L [𝑢 𝑛+1 ; 𝛼] (𝑥) = 𝑢 𝑛 (𝑥),                            (A.2)
where we have used
                                              1                   1
                                      𝛼 :=       ,    L := I +      𝜕𝑥 .
                                            𝑐Δ𝑡                   𝛼
By reversing the order in which the discretization is performed, we have created a sequence of BVPs
at discrete time levels. If we had discretized equation (A.1) using the MOL formalism, then L
would be an algebraic operator. To solve equation (A.2) for 𝑢 𝑛+1 , we analytically invert the operator
L. Notice that this equation is actually an ODE, which is linear, so the problem can be solved using
methods developed for ODEs. If we apply the integrating factor method to the problem, we obtain
                                                      
                                     𝜕𝑥 𝑒 𝛼𝑥 𝑢 𝑛+1 (𝑥) = 𝛼𝑒 𝛼𝑥 𝑢 𝑛 (𝑥).
                                                     194


To integrate this equation, we use the fact that characteristics move to the right, so integration is
performed from 𝑎 to 𝑥. After rearranging the result, we arrive at the update equation
                                                                ∫ 𝑥
                                      −𝛼(𝑥−𝑎) 𝑛+1
                         𝑛+1
                        𝑢 (𝑥) = 𝑒              𝑢 (𝑎) + 𝛼              𝑒 −𝛼(𝑥−𝑠) 𝑢 𝑛 (𝑠) 𝑑𝑠,
                                                           ∫ 𝑥𝑎
                                  ≡ 𝑒 −𝛼(𝑥−𝑎) 𝐴𝑛+1 + 𝛼            𝑒 −𝛼(𝑥−𝑠) 𝑢 𝑛 (𝑠) 𝑑𝑠,
                                                              𝑎
                                  ≡ L −1 [𝑢 𝑛 ; 𝛼] (𝑥).
This update displays the origins of the implicit behavior of the method. While convolutions are
performed on data from the previous time step, the boundary terms are taken at time level 𝑛 + 1.
    Now that we have obtained the update equation, we need to apply the boundary conditions.
Clearly, if the problem specifies a Dirchlet boundary condition at 𝑥 = 𝑎, then 𝐴𝑛+1 = 𝑢 𝑛+1 (𝑎). We
can compute a variety of boundary conditions using the update equation
                                                              ∫ 𝑥
                                        −𝛼(𝑥−𝑎) 𝑛+1
                           𝑛+1
                          𝑢 (𝑥) = 𝑒              𝐴 +𝛼               𝑒 −𝛼(𝑥−𝑠) 𝑢 𝑛 (𝑠) 𝑑𝑠,
                                                                𝑎
where
                                                     ∫   𝑥
                                     𝑛
                                 𝐼 [𝑢 ; 𝛼] (𝑥) = 𝛼         𝑒 −𝛼(𝑥−𝑠) 𝑢 𝑛 (𝑠) 𝑑𝑠.
                                                       𝑎
For example, with periodic boundary conditions, we would need to satisfy
                                            𝑢 𝑛+1 (𝑎) = 𝑢 𝑛+1 (𝑏),                             (A.3)
                                         𝜕𝑥 𝑢 𝑛+1 (𝑎) = 𝜕𝑥 𝑢 𝑛+1 (𝑏).                          (A.4)
Applying condition (A.3), we find that
                                                           ∫    𝑏
                                      −𝛼(𝑏−𝑎)
                           𝐴 𝑛+1
                                  =𝑒           𝐴 𝑛+1
                                                     +𝛼           𝑒 −𝛼(𝑏−𝑠) 𝑢 𝑛 (𝑠) 𝑑𝑠.
                                                              𝑎
Solving this equation for 𝐴𝑛+1 shows that
                                                     𝐼 [𝑢 𝑛 ; 𝛼] (𝑏)
                                            𝐴𝑛+1 =                     ,
                                                         1−𝜇
with 𝜇 = 𝑒 −𝛼(𝑏−𝑎) . Alternatively, we could have started with (A.4), which would give an identical
solution. While this particular procedure is only applicable to linear problems, this exercise
motivates some of the choices made to define operators in the method.
                                                     195


// Distribute tiles of the array to teams of threads dynamically
Kokkos :: parallel_for ("team loop over tiles ", team_policy ( total_tiles ,
     Kokkos :: AUTO ()),
      KOKKOS_LAMBDA ( team_type & team_member )
{
      // Determine the flattened tile index via the team rank
      // and compute the unflattened indices of the tile T_{i,j}
      const int tile_idx = team_member . league_rank ();
      const int tj = tile_idx % num_tiles_x ;
      const int ti = tile_idx / num_tiles_x ;
      // Retrieve tile sizes & offsets and
      // obtain subviews of the relevant grid data on tile T_{i,j}
      // ...
      // Use a team ’s thread range over the lines
      Kokkos :: parallel_for ( Kokkos :: TeamThreadRange <>( team_member , Ny_tile ),
     [&]( const int iy)
      {
           // Slice to extract a subview of my line ’s data and
           // call line methods which use vector loops
           // ...
      }
});
             Scheme A.1: An example of coarse-grained parallel nested loop structure.
// Distribute the threads to lines
Kokkos :: parallel_for ("Fast sweeps along x", range_policy (0, Ny),
      KOKKOS_LAMBDA ( const int iy)
{
      // Slice to obtain the local integrals to which we apply
      // the convolution kernel to the entire line
      // ...
});
                   Scheme A.2: Kokkos kernel for the fast-convolution algorithm.
A.2     Kokkos Kernels
This section provides listings, which outline the general format of the Kokkos kernels used in this
work. Specifically, we provide structures for the tiled/blocked algorithms (Scheme A.1) in addition
to the kernel that executes the fast summation method along a line (Scheme A.2).
                                                 196


A.3     WENO Quadrature
We provide the various expressions for the coefficients and smoothness indicators used in the
reconstruction process for 𝐽 𝑅(𝑟) . Defining 𝜈 ≡ 𝛼Δ𝑥, the coefficients for the fixed stencils are given
in [44] as follows:
                           (0)   6 − 6𝜈 + 2𝜈 2 − (6 − 𝜈 2 )𝑒 −𝜈
                        𝑐 −3   =                                 ,
                                               6𝜈 3
                           (0)      6 − 8𝜈 + 3𝜈 2 − (6 − 2𝜈 − 2𝜈 2 )𝑒 −𝜈
                        𝑐 −2   =−                                         ,
                                                     2𝜈 3
                           (0)   6 − 10𝜈 + 6𝜈 2 − (6 − 4𝜈 − 𝜈 2 + 2𝜈 2 )𝑒 −𝜈
                        𝑐 −1   =                                                 ,
                                                       2𝜈 3
                                    6 − 12𝜈 + 11𝜈 2 − 6𝜈 3 − (6 − 6𝜈 + 2𝜈 2 )𝑒 −𝜈
                        𝑐 0(0) =−                                                      ,
                                                           6𝜈 3
                           (1)   6 − 𝜈 2 − (6 + 6𝜈 + 2𝜈 2 )𝑒 −𝜈
                        𝑐 −2   =                                 ,
                                               6𝜈 3
                           (1)      6 − 2𝜈 − 2𝜈 2 − (6 + 4𝜈 − 𝜈 2 − 2𝜈 3 )𝑒 −𝜈
                        𝑐 −1   =−                                                 ,
                                                        2𝜈 3
                                 6 − 4𝜈 − 𝜈 2 + 2𝜈 3 − (6 + 2𝜈 − 2𝜈 2 )𝑒 −𝜈
                        𝑐 0(1) =                                               ,
                                                      2𝜈 3
                                    6 − 6𝜈 + 2𝜈 2 − (6 − 𝜈 2 )𝑒 −𝜈
                        𝑐 1(1) =−                                  ,
                                                 6𝜈 3
                           (2)   6 + 6𝜈 + 2𝜈 2 − (6 + 12𝜈 + 11𝜈 2 + 6𝜈 3 )𝑒 −𝜈
                        𝑐 −1   =                                                     ,
                                                        6𝜈 3
                                    6 + 4𝜈 − 𝜈 2 − 2𝜈 3 − (6 + 10𝜈 + 6𝜈 2 )𝑒 −𝜈
                        𝑐 0(2) =−                                                   ,
                                                        2𝜈 3
                                 6 + 2𝜈 − 2𝜈 2 − (6 + 8𝜈 + 3𝜈 2 )𝑒 −𝜈
                        𝑐 1(2) =                                        ,
                                                   2𝜈 3
                                    6 − 𝜈 2 − (6 + 6𝜈 + 2𝜈 2 )𝑒 −𝜈
                        𝑐 2(2) =−                                  .
                                                 6𝜈 3
The corresponding linear weights are
                           6 − 𝜈 2 − (6 + 6𝜈 + 2𝜈 2 )𝑒 −𝜈
                    𝑑0 =                                    ,
                              3𝜈(2 − 𝜈 − (2 + 𝜈)𝑒 −𝜈 )
                           60 − 60𝜈 + 15𝜈 2 + 5𝜈 3 − 3𝜈 4 − (60 − 15𝜈 2 + 2𝜈 4 )𝑒 −𝜈
                    𝑑2 =                                                                 ,
                                       10𝜈 2 (6 − 𝜈 2 − (6 + 6𝜈 + 2𝜈 2 )𝑒 −𝜈 )
                                                      197


                    𝑑1 = 1 − 𝑑0 − 𝑑2 .
The expressions for the smoothness indicators are given in [56] as
                   13                                      1
              𝛽0 =    (−𝑣 𝑖−3 + 3𝑣 𝑖−2 − 3𝑣 𝑖−1 + 𝑣 𝑖 ) 2 + (𝑣 𝑖−3 − 5𝑣 𝑖−2 + 7𝑣 𝑖−1 − 3𝑣 𝑖 ) 2 ,
                   12                                      4
                   13                                      1
              𝛽1 =    (−𝑣 𝑖−2 + 3𝑣 𝑖−1 − 3𝑣 𝑖 + 𝑣 𝑖+1 ) 2 + (𝑣 𝑖−2 − 𝑣 𝑖−1 − 𝑣 𝑖 + 𝑣 𝑖+1 ) 2 ,
                   12                                      4
                   13                                      1
              𝛽2 =    (−𝑣 𝑖−1 + 3𝑣 𝑖 − 3𝑣 𝑖+1 + 𝑣 𝑖+2 ) 2 + (−3𝑣 𝑖−1 + 7𝑣 𝑖 − 5𝑣 𝑖+1 + 𝑣 𝑖+2 ) 2 .
                   12                                      4
To obtain the analogous expressions for 𝐽 𝐿(𝑟) , we exploit the “mirror-symmetry" property of WENO
reconstructions. That is, one can keep the left side of each of the expressions, then reverse the order
of the expressions on the right. Expressions for calculating one particular smoothness indicator, if
interested, can be found in [44].
A.4     Some Larger Figures from Experiments
                                                    198


                         Tiling w/subviews + TVR                       Tiling w/subviews w/o TVR               TP + TTR w/o tiling             range policy + simd
                         Tiling w/o subviews + TVR                     Tiling w/o subviews w/o TVR             range policy                    MDRange
                                                                              Team Size = Kokkos::AUTO()
                         1010
                 DOF/s
                             109
                             108
                                                                                     Team Size = 2
                         1010
                 DOF/s
                             109
                                                                                     Team Size = 4
                         1010
                 DOF/s
                             109
                                            51 2               10 24             20 48             40 96            81 92              16384
                                                                                           N
                                   Tiling w/subviews + TVR                Tiling w/o subviews w/o TVR            range policy                  MDRange
                                   Tiling w/o subviews + TVR              TP + TTR w/o tiling                    range policy + simd           Tiled MDRange
                                   Tiling w/subviews w/o TVR              TP + TTR + TVR w/o tiling
                                   10                                         Team Size = Kokkos::AUTO()
                              10
                     DOF/s     109
                                                                                     Team Size = 2
                              1010
                     DOF/s     109
                                                                                     Team Size = 4
                              1010
                     DOF/s     109
                                                   64                         128                       25 6                    512
                                                                                           N
Figure A.1: Plots comparing the performance of different parallel execution policies for the pattern
in Scheme 3.2 using test cases in 2-D (top) and 3-D (bottom). Tests were conducted on a single
node that consists of 40 cores using the code configuration outlined in 3.1. Each group consists of
three plots, whose difference is the value selected for the team size. We note that hyperthreading is
not enabled on our systems, so Kokkos::AUTO() defaults to a team size of 1. Tile experiments used
a block size of 2562 , in 2-D problems, and 323 in 3-D. A tiled MDRange was not implemented in
the 2-D cases because the block size was larger than some of the problems. The results generally
agree with those presented in 3.5. For smaller problem sizes, using the non-portable range_policy
with OpenMP simd directives is clearly superior over the policies. However, when enough work
is available, we see that blocked policies with subviews and vectorization generally become the
fastest. In both cases, MDRange seems to have fairly good performance. Tiling, when used with
MDRange, in the 3-D cases, seems to be slower than plain MDRange. Again, we see that the use
of blocking provides a more consistent update rate if enough work is available.
                                                                                         199


                               9            Advection                                Diffusion                       Hamilton-Jacobi
                     1.0 ×10
                     0.8
       DOF/node/s
                     0.6
                     0.4
                     0.2
                     0.0
                     1.2
                     1.0
                     0.8
        Efficiency
                     0.6
                     0.4
                     0.2
                     0.0
                           1 4     9   16      25     36         49   1 4   9   16      25       36   49   1 4   9   16    25      36   49
                             9              Advection                                Diffusion                       Hamilton-Jacobi
                     1.0 ×10                 Nodes                                    Nodes                             Nodes
                                                        DOF/node = 33612         DOF/node = 134412         DOF/node = 268812
                     0.8
       DOF/node/s
                     0.6
                     0.4
                     0.2
                     0.0
                     1.2
                     1.0
                     0.8
        Efficiency
                     0.6
                     0.4
                     0.2
                     0.0
                           1 4     9   16      25        36      49   1 4   9   16      25       36   49   1 4   9   16    25      36   49
                                             Nodes                                    Nodes                               Nodes
                                                        DOF/node = 33612         DOF/node = 134412         DOF/node = 268812
Figure A.2: Weak scaling results, for each of the applications, using up to 49 nodes (1960
cores). For each of the applications, we have provided the update rate and weak scaling efficiency
computed via the fastest time/step (top) and average time/step (bottom). Results for advection and
diffusion applications is quite similar, despite the use of different operators. The results for the
H-J application seem to indicate that no major performance penalties are incurred by use of the
adaptive time stepping method. Scalability appears to be excellent, up to 16 nodes (640 cores),
then begins to decline. While some loss in performance, due to network effects, is to be expected,
this loss appears to be larger than was previously observed. The nodes used in the runs were not
contiguous, which hints at a possible sensitivity to data locality.
                                                                                200


                               9   Advection                        Diffusion                      Hamilton-Jacobi
                     1.0 ×10
                     0.8
        DOF/node/s
                     0.6
                     0.4
                     0.2
                     0.0
                     1.2
                     1.0
                     0.8
        Efficiency
                     0.6
                     0.4
                     0.2
                     0.0
                          1        4                     9   1     4                  9   1          4               9
                             9     Advection                        Diffusion                      Hamilton-Jacobi
                     1.0 ×10         Nodes                           Nodes                             Nodes
                     0.8
                                               DOF/node = 16812   DOF/node = 67212        DOF/node = 268812
        DOF/node/s
                     0.6                       DOF/node = 33612   DOF/node = 134412
                     0.4
                     0.2
                     0.0
                     1.2
                     1.0
                     0.8
        Efficiency
                     0.6
                     0.4
                     0.2
                     0.0
                           1       4                     9   1     4                  9   1           4              9
                                       Nodes                           Nodes                              Nodes
                                               DOF/node = 16812   DOF/node = 67212        DOF/node = 268812
                                               DOF/node = 33612   DOF/node = 134412
Figure A.3: Weak scaling results obtained with contiguous allocations of up to 9 nodes (360 cores)
for each of the applications. For comparison, the same information is displayed as in A.2. Data
from the fastest trials indicates nearly perfect weak scaling, across all applications, up to 9 nodes,
with a consistent update rate between 2 − 4 × 108 DOF/node/s. A comparison of the fastest timings
between the large and small runs supports our claim that data proximity is crucial to achieving the
peak performance of the code. Note that size the error bars are generally smaller than those in A.2.
This indicates that the timing data collected from individual trials exhibits less overall variation.
                                                                  201


                              9   Advection                  Diffusion                         Hamilton-Jacobi
                    1.0 ×10
                    0.8
       DOF/node/s
                    0.6
                    0.4
                    0.2
                    0.0
                    1.2
                    1.0
                    0.8
       Efficiency
                    0.6
                    0.4
                    0.2
                    0.0
                          1       4              9   1      4               9     1              4               9
                            9     Advection                  Diffusion                         Hamilton-Jacobi
                    1.0 ×10        Nodes                      Nodes                               Nodes
                                              DOF = 16812    DOF = 67212        DOF = 268812
                    0.8                       DOF = 33612    DOF = 134412
       DOF/node/s
                    0.6
                    0.4
                    0.2
                    0.0
                    1.2
                    1.0
                    0.8
       Efficiency
                    0.6
                    0.4
                    0.2
                    0.0
                          1       4              9   1      4               9     1              4               9
                                      Nodes                     Nodes                                Nodes
                                              DOF = 16812    DOF = 67212        DOF = 268812
                                              DOF = 33612    DOF = 134412
Figure A.4: Strong scaling results for each of the applications obtained on contiguous allocations
of up to 9 nodes (360 cores). Displayed among each of the applications are the update rate and
strong scaling efficiency computed from the fastest time/step (top) and average time/step (bottom).
This method does not contain a substantial amount of work, so we do not expect good performance
for smaller base problem sizes, as the work per node becomes insufficient to hide the cost of
communication. Larger base problem sizes, which introduce more work, are capable of saturating
the resources, but will at some point become insufficient. Moreover, threads become idle when the
work per node fails to introduce enough blocks.
                                                            202


                                             APPENDIX B
                                   APPENDIX FOR CHAPTER 4
B.1     Semi-discrete Time Consistency of the Lorenz Formulation with BDF-1
Here we show that the Lorenz gauge formulation of Maxwell’s equations (1.8)-(1.10) satisfies a
certain time consistency relation when a first-order BDF time discretization is applied. By time
consistent, we mean that the semi-discrete system for the potentials induces both a gauge condition
and continuity equation at the semi-discrete level. In the treatment of the semi-discrete equations,
we shall ignore effects of dimensional splittings.
    Following the procedure used in section 2.2.1, we can derive the semi-discrete equations for the
scalar and vector potentials using first-order backwards differences for each of the time derivatives.
Proceeding, one obtains the following semi-discrete equations for the Lorenz gauge formulation:
                                                              1
                                   L𝜓 𝑛+1 = 2𝜓 𝑛 − 𝜓 𝑛−1 +          𝜌 𝑛+1 ,                     (B.1)
                                                             𝛼2 𝜖 0
                                                             𝜇0 𝑛+1
                                   LA𝑛+1 = 2A𝑛 − A𝑛−1 +          J ,                            (B.2)
                                                             𝛼2
                                   𝜓 𝑛+1 − 𝜓 𝑛
                                               + ∇ · A𝑛+1 = 0,                                  (B.3)
                                      𝑐2 Δ𝑡
where we have used the usual operator notation
                                                  1               1
                                      L := I −       Δ,   𝛼 :=        .                         (B.4)
                                                 𝛼2             𝑐Δ𝑡
We can verify that this semi-discrete system is time consistent in the sense of the semi-discrete
Lorenz gauge (B.3) through a direct calculation.
    First, note that the linear operator L can be inverted in the equation for the scalar potential to
obtain the update                                                           !
                                                                1
                                𝜓 𝑛+1 = L −1   2𝜓 𝑛 − 𝜓 𝑛−1 + 2 𝜌 𝑛+1 .                         (B.5)
                                                              𝛼 𝜖0
If we evaluate this equation at time level 𝑛, we obtain
                                                                          !
                                                                  1
                                 𝜓 𝑛 = L −1  2𝜓 𝑛−1 − 𝜓 𝑛−2 + 2 𝜌 𝑛 .                           (B.6)
                                                               𝛼 𝜖0
                                                    203


    Next, we take the divergence of A in equation (B.2) and find that
                                                                     𝜇0
                           L ∇ · A𝑛+1 = 2∇ · A𝑛 − ∇ · A𝑛−1 + 2 ∇ · J𝑛+1 .
                                                                      𝛼
Formally inverting the operator L, we obtain the relation
                                                                     𝜇0         
                          ∇ · A𝑛+1 = L −1 2∇ · A𝑛 − ∇ · A𝑛−1 + 2 ∇ · J𝑛+1 .                           (B.7)
                                                                      𝛼
    We now use equations (B.5), (B.7), and (B.6) to evaluate the semi-discrete Lorenz gauge (B.3).
Using the linearity of the operator L, we obtain
                              "                                                    𝑛+1                 #
𝜓 −𝜓
  𝑛+1      𝑛                    2𝜓 − 3𝜓
                                   𝑛      𝑛−1   +𝜓 𝑛−2                         𝜇0 𝜌 − 𝜌     𝑛
             +∇ ·A𝑛+1 = L −1                            +2∇ ·A𝑛 −∇ ·A𝑛−1 + 2                  + ∇ · J𝑛+1 .
   𝑐 Δ𝑡
     2                                 𝑐 Δ𝑡
                                         2                                     𝛼      Δ𝑡
Note that we have used the relation 𝑐2 = (𝜇0 𝜖0 ) −1 . From these calculations, we can see that the
corresponding semi-discrete continuity equation appears as a residual for the gauge condition (B.3).
The remaining terms in the operand for the inverse can be also be expressed directly in terms of
this semi-discrete gauge, since
                                                     𝑛                        𝑛−1                      
2𝜓 𝑛 − 3𝜓 𝑛−1 + 𝜓 𝑛−2            𝑛         𝑛−1        𝜓 − 𝜓 𝑛−1            𝑛     𝜓    − 𝜓 𝑛−2         𝑛−1
                       + 2∇ · A − ∇ · A          =2               +∇·A −                      +∇·A          .
         𝑐2 Δ𝑡                                           𝑐2 Δ𝑡                      𝑐2 Δ𝑡
    This gives rise to an inductive argument for the time consistency. The initial data for the problem
satisfies both the semi-discrete gauge condition and the continuity equation. If the discrete gauge
condition
                                       𝜓 𝑛+1 − 𝜓 𝑛
                                                     + ∇ · A𝑛+1 = 0,
                                           𝑐2 Δ𝑡
holds for any time level 𝑛, then it follows that the analogous semi-discrete continuity equation
                                        𝜌 𝑛+1 − 𝜌 𝑛
                                                     + ∇ · J𝑛+1 = 0,
                                              Δ𝑡
holds as well. We briefly sketch the idea for both directions.
    The forward direction can be easily seen by assuming that the semi-discrete gauge condition
holds up to time level 𝑛 + 1, which is equivalent to writing
                                           "                             #
                                             𝜇    𝜌 𝑛+1 − 𝜌 𝑛
                                 0 = L −1 2
                                               0
                                                               + ∇ · J𝑛+1 .
                                            𝛼         Δ𝑡
                                                     204


The result follows by applying the operator L to both sides.
     A similar argument can be used for the converse. The discrete gauge condition is assumed to
be satisfied by the initial condition and all relevant earlier times, i.e.,
                               𝜓 𝑛+1 − 𝜓 𝑛
                                             + ∇ · A𝑛+1 = 0,    𝑛 = −2, −1.
                                  𝑐 Δ𝑡
                                    2
Now, we assume that the continuity equation
                                         𝜌 𝑛+1 − 𝜌 𝑛
                                                     + ∇ · J𝑛+1 = 0,
                                             Δ𝑡
is true for any time level 𝑛. Then, the gauge condition at 𝑛 = 0 also satisfied because
                                    𝜓1 − 𝜓0
                                               + ∇ · A1 = L −1 [0] ≡ 0.
                                      𝑐2 Δ𝑡
This argument can be iterated 𝑛 more times to obtain the result.
B.2      Some Larger Figures from Experiments
                                                     205


         (a) The beam of electrons (left) and the corresponding distribution in terms of their radii (right)
                    (b) Slices in x (left) and y (right) of the toroidal magnetic field 𝐵 ( 𝜃) (𝑟)
Figure B.1: The state of the Bennett problem after 50 thermal crossing times using the Boris
method with the steady-state Poisson model for the fields. The top figure shows the electrons in the
non-dimensional grid and plots the radius of the beam as a reference. We also include a cumulative
histogram of the electrons based on their radii, which uses a total of 50 bins. The plots on the
bottom are cross-sections of the steady-state magnetic field 𝐵 (𝜃) , which are plotted against the
analytical field. We see good agreement in the magnetic field with its analytical solution, which is
enough to confine most of the particles within the beam.
                                                        206


          (a) The beam of electrons (left) and the corresponding distribution in terms of their radii (right)
                      (b) Slices in x (left) and y (right) of the toroidal magnetic field 𝐵 ( 𝜃) (𝑟)
Figure B.2: The state of the Bennett problem after 45 thermal crossing times obtained with the
Molei Tao method (𝜔 = 500) using the steady-state Poisson model for the fields. The top figure
shows the electrons in the non-dimensional grid and plots the radius of the beam as a reference.
We also include a cumulative histogram of the electrons based on their radii, which uses a total
of 50 bins. The plots on the bottom are slices of the steady-state magnetic field 𝐵 (𝜃) , which is
plotted against the analytical field. We observe a significant drift in the numerical field away from
its steady-state that results in a loss of confinement of the particles to the beam.
                                                          207


          (a) The beam of electrons (left) and the corresponding distribution in terms of their radii (right)
                     (b) Slices in x (left) and y (right) of the toroidal magnetic field 𝐵 ( 𝜃) (𝑟)
Figure B.3: The state of the Bennett problem after 35 thermal crossing times using the Boris method
with the wave model for the fields. The top figure shows the electrons in the non-dimensional grid
and plots the radius of the beam as a reference. We also include a cumulative histogram of the
electrons based on their radii. Again, the beam radius is indicated as a reference. A total of 50
bins are used in the plot. The plots on the bottom are slices of the steady-state magnetic field 𝐵 (𝜃) ,
which is plotted against the analytical field. We see good agreement in the magnetic field with its
analytical solution, which is enough to confine most of the particles within the beam.
                                                         208


                                      (a) Time-centered method
                                           (b) BDF method
Figure B.4: A comparison of the time derivatives of the vector potentials after 1000 particle
crossings for the expanding beam problem. This particular data was obtained using the Lorenz
gauge formulation for the fields with the Boris method for particles. In the top row, the vector
potentials are updated with the time-centered approach, which is purely dispersive and generates
noisy time derivatives. The bottom row performs the same experiment, but uses the BDF method,
which is purely dissipative. The differences in the quality of the results are quite apparent. This
was discussed in [42], but results were not shown to illustrate the severity of the effects due to
dispersion.
                                                209


                                                 (a)
                         (b)                                              (c)
Figure B.5: We plot the expanding beam after 1000 particle crossings obtained with the Lorenz
gauge formulation that combines the Boris method with the BDF-2 field solver. In Figure B.5a,
we plot the beam and the corresponding charge density. We observe some oscillations along the
top edge of the beam, which also appear in the charge density. In Figure B.5b, we observe an
increase in the size of violations of the Lorenz gauge condition, which indicates that the method
will eventually fail. We plot the Lorenz gauge error as a surface in Figure B.5c using data from the
final step. The most significant violations occur near the injection region and along the boundary
where particles are removed.
                                                210


                                                  (a)
                                                  (b)
Figure B.6: Here we show the potentials (and their derivatives) for the expanding beam problem
after 1000 particle crossings. This data was obtained using the Lorenz gauge formulation which
combines the Boris method with the BDF-2 wave solver. The first row plots the scalar potential
𝜓 and its partial derivatives. Similarly, in the second row, we plot the derivatives of the vector
potentials 𝐴 (1) and 𝐴 (2) , which are used to construct the magnetic field 𝐵 (3) (shown in the right-
most plot). Note that the time derivative data for the vector potentials were plotted in Figure B.4b,
so we exclude them here.
                                                 211


                                                 (a)
                        (b)                                              (c)
Figure B.7: We show the expanding beam after 2000 particle crossings obtained with the Coulomb
gauge formulation, which uses the AEM for time stepping without a cleaning method. In Figure
B.7a, we plot the beam and the corresponding charge density, which show visible striations and
oscillations along the edge of the beam due to violations in the gauge condition. The growth in the
errors associated with the gauge condition is reflected in Figure B.7b, which exhibits unbounded
growth. The surface plot of the gauge condition at 2000 crossings shows large errors, especially
near the injection region and along the boundary where particles are removed.
                                                212


                                                  (a)
                        (b)                                               (c)
Figure B.8: We show the expanding beam after 3000 particle crossings obtained with the Coulomb
gauge formulation that uses the AEM for time stepping with elliptic divergence cleaning. In Figure
B.8a, we plot the beam and the corresponding charge density. The elliptic divergence cleaning
seems effective at controlling the errors in the gauge condition, compared to the results shown in
Figure B.7, which do not apply the cleaning method. The fluctuations of the gauge error away from
the boundaries is now in the sixth decimal position, which is a notable improvement over the result
shown in Figure B.7c.
                                                 213


                                                 (a)
                                                 (b)
                                                 (c)
Figure B.9: Here we show the potentials (and their derivatives) for the expanding beam problem
after 3000 particle crossings. This data was obtained using the Coulomb gauge formulation which
combines the AEM for time integration with the BDF-2 wave solver. Elliptic divergence cleaning
was applied to the vector potential. In each row, we plot a field quantity and is corresponding
derivatives. The top row shows the scalar potential 𝜓 and its derivative, which are computed with
a finite-differences. The middle and last row show the vector potential components 𝐴 (1) and 𝐴 (2) ,
respectively, along with their derivatives, which are computed with the BDF method.
                                                214


                                                 (a)
                       (b)                                               (c)
Figure B.10: We show the expanding beam after 3000 particle crossings obtained with the Lorenz
gauge formulation that uses the AEM for time stepping along with a first-order BDF solver. No
divergence cleaning is applied. In Figure B.10a, we plot the beam and the corresponding charge
density. The beam surprisingly remains intact after many particle crossings without the use of a
cleaning method. The fluctuations of the gauge error over time are quite small. We do not observe
the growth in the gauge error shown earlier in Figure B.5b for the Boris method.
                                                215


                                                   (a)
                                                   (b)
                                                   (c)
Figure B.11: Here we show the potentials (and their derivatives) for the expanding beam problem
after 3000 particle crossings. This data was obtained using the Lorenz gauge formulation which
combines the AEM for time integration with the BDF-1 wave solver. A divergence cleaning method
is not used in this example. In each row, we plot a field quantity and is corresponding derivatives.
The top row shows the scalar potential 𝜓 and its derivative, while the middle and last row shows
the vector potential components 𝐴 (1) and 𝐴 (2) , respectively, along with their derivatives.
                                                  216


                         (a)                                              (b)
Figure B.12: Error in Gauss’ law for the Coulomb gauge formulation of the expanding beam
problem which applies the AEM for time integration and uses elliptic divergence cleaning. On the
left, we show the time evolution of an “averaged" residual in Gauss’ law. The plot on the right is a
surface of the error in Gauss’ law taken after 3000 particle crossings. Even though cleaning is used
to control violations in the gauge condition, whose corresponding surface was shown in Figure
B.8c, the metric based on point-wise violations in Gauss’ law seems to indicates a significant loss
of conservation. On the other hand, the plot on the left implies that Gauss’ law is satisfied in an
integral sense.
                                                 217


                         (a)                                              (b)
Figure B.13: Error in Gauss’ law for the Coulomb gauge formulation of the expanding beam
problem which applies the AEM for time integration. Elliptic divergence cleaning is not used here.
On the left, we show the time evolution of the “averaged" residual in Gauss’ law. The plot on the
right is a surface of the error in Gauss’ law taken after 3000 particle crossings. The point-wise
violations in Gauss’ law are much larger than we observed in Figure B.12b. Similarly, the time
evolution of the average defect in Gauss’ law is roughly three orders of magnitude larger than B.12a.
                                                  218


                                                   (a)
                         (b)                                                 (c)
Figure B.14: We show the narrow beam after 5 particle crossings obtained with the Coulomb gauge
formulation that uses the AEM for time stepping with elliptic divergence cleaning. We injected 400
particle per time step. In Figure B.14a, we plot the beam and the corresponding charge density. The
plot of the particles appears more solid due to the increased injection rate. The density itself is quite
smooth due to the use of additional particles. As before, we see there are violations in the gauge
condition along the boundaries due to the injection and removal of particles there. Additionally, the
gauge error appears to be quite small away from the boundaries due to the increased smoothness
offered by the use of additional particles.
                                                  219


                                                   (a)
                                                   (b)
                                                   (c)
Figure B.15: Here we show the potentials, as well as their derivatives, for the narrow beam problem
after 5 particle crossings using an injection rate of 400 particles per step. We used the Coulomb
gauge formulation which combines the AEM for time integration with the BDF-2 wave solver for
the fields. Elliptic divergence cleaning was applied to the vector potential. In each row, we plot
a field quantity and is corresponding spatial derivatives. The top row shows the scalar potential
𝜓 and its derivative, which are computed with a finite-differences. The middle and last row show
the vector potential components 𝐴 (1) and 𝐴 (2) , respectively, along with their derivatives, which are
computed with the BDF method. The structure of the fields and their derivative are quite smooth
here.
                                                  220


                          (a)                                                (b)
Figure B.16: Error in Gauss’ law for the narrow beam problem that uses an injection rate of 400
particles per step. On the left, we show the time evolution of an “averaged" residual in Gauss’ law.
There is a jump in the “bulk" error for Gauss’ law at step 1000, since this coincides with the beam’s
first crossing, before stabilizing. The plot on the right is a surface of the error in Gauss’ law taken
after 5 particle crossings. Even though cleaning is used to control violations in the gauge condition,
whose corresponding surface was shown in Figure B.14c, the metric based on point-wise violations
in Gauss’ law seems to indicates a loss of charge conservation similar to the previous example.
Figure B.17: We show the derivatives used to calculate the divergence of the electric field for the
narrow beam problem at 5 particle crossings. We used an injection rate of 400 particles. Derivatives
are computed with second-order finite-differences. We note the appearance of small oscillations
in the 𝑥 derivative, which is shown on the left. The plot to the right, which corresponds to the
𝑦-derivative is largely uniform on the interior of the beam, but is sharp along the edge of the beam.
                                                  221


                            (a)                                      (b)
                            (c)                                      (d)
                            (e)                                      (f)
Figure B.18: We show the effect of the particle injection rate on the gauge error for the narrow
beam problem at 5 particle crossings. In each row, we plot the error in the Coulomb gauge as a
surface (left column) and as a slice in 𝑥 along the middle of the beam (right) column. The rows
correspond to injection rates of 100, 200, and 400 particles per time step, respectively, from top to
bottom. We can see that the increase in particle count reduces the gauge error on the interior of the
domain due to the smoothing effect on the particle data.
                                                222


BIBLIOGRAPHY
     223


                                        BIBLIOGRAPHY
 [1] J. P. Verboncoeur, “Particle simulation of plasmas: Review and advances,” Plasma Physics
     and Controlled Fusion, vol. 47, A231–A260, 5A 2005.
 [2] C. K. Birdsall and A. B. Langdon, Plasma Physics via Computer Simulation. McGraw-Hill
     Book Company, 1985.
 [3] R. W. Hockney and J. W. Eastwood, Computer Simulation Using Particles, First. CRC Press,
     1988.
 [4] J. Boris, “Relativistic plasma simulation-optimization of a hybrid code,” in Proceedings of
     the Fourth Conference on Numerical Simulations of Plasmas, 1970, pp. 3–67.
 [5] A. Langdon, B. Cohen, and A. Friedman, “Direct implicit large time-step particle simulation
     of plasmas,” Journal of Computational Physics, vol. 51, pp. 107–138, 1 1981.
 [6] J. Brackbill and D. Forslund, “An implicit method for electromagnetic plasma simulation in
     two dimensions,” Journal of Computational Physics, vol. 46, pp. 271–308, 2 1982.
 [7] R. Mason, “An electromagnetic field algorithm for 2D implicit plasma simulation,” Journal
     of Computational Physics, vol. 71, pp. 429–473, 2 1987.
 [8] B. Cohen, A. Langdon, D. Hewett, and R. Procassini, “An implicit method for electromag-
     netic plasma simulation in two dimensions,” Journal of Computational Physics, vol. 81,
     pp. 151–168, 1 1989.
 [9] G. Chen, L. Chacón, and D. Barnes, “An energy- and charge-conserving, implicit, elec-
     trostatic particle-in-cell algorithm,” Journal of Computational Physics, vol. 230, pp. 7018–
     7036, 18 2011.
[10] D. Knoll and D. Keyes, “Jacobian-free Newton–Krylov methods: A survey of approaches
     and applications,” Journal of Computational Physics, vol. 193, pp. 357–397, 2 2004.
[11] L. Chacón, G. Chen, and D. Barnes, “A charge-and energy-conserving implicit, electrostatic
     particle-in-cell algorithm on mapped computational meshes,” Journal of Computational
     Physics, vol. 233, pp. 1–9, 2012.
[12] G. Chen, L. Chacón, L. Yin, B. Albright, J. Stark, and R. Bird, “A semi-implicit, energy- and
     charge-conserving particle-in-cell algorithm for the relativistic Vlasov-Maxwell equations,”
     Journal of Computational Physics, vol. 407, p. 109 228, 2020.
[13] K. S. Yee, “Numerical solution of initial boundary value problems involving Maxwell’s
                                                224


     equations in isotropic media,” IEEE Transactions on Antennas and Propagation, vol. 14,
     pp. 302–307, 3 1966.
[14] A. Taflove and S. C. Hagness, Computational electrodynamics: the finite-difference time-
     domain method, Third. Artech House Publishers, 2005.
[15] A. D. Greenwood, K. L. Cartwright, J. W. Luginsland, and E. A. Baca, “On the elimination
     of numerical Cerenkov radiation in PIC simulations,” Journal of Computational Physics,
     vol. 201, pp. 665–684, 2 2004.
[16] J. P. Verboncoeur, “Aliasing of electromagnetic fields in stair step boundaries,” Computer
     physics communications, vol. 164, pp. 344–352, 1 2004.
[17] B. Engquist, J. Häggblad, and O. Runborg, “On energy preserving consistent boundary
     conditions for the Yee scheme in 2D,” BIT Numerical Mathematics, vol. 52, pp. 615–637, 3
     2012.
[18] E. Sonnendrücker, J. Ambrosiano, and S. Brandon, “A finite element formulation of the Dar-
     win PIC model for use on unstructured grids,” Journal of Computational Physics, vol. 121,
     pp. 281–297, 2 1995.
[19] C.-D. Munz, P. Omnes, R. Schneider, E. Sonnendrücker, and U. Voss, “Divergence correction
     techniques for Maxwell solvers based on a hyperbolic model,” Journal of Computational
     Physics, vol. 161, no. 2, pp. 484–511, 2000.
[20] G. Jacobs and J. Hesthaven, “High-order nodal discontinuous Galerkin particle-in-cell
     method on unstructured grids,” Journal of Computational Physics, vol. 214, pp. 96–121, 1
     2006.
[21] ——, “Implicit–explicit time integration of a high-order particle-in-cell method with hyper-
     bolic divergence cleaning,” Journal of Computational Physics, vol. 180, pp. 1760–1767, 10
     2009.
[22] M. Pinto, S. Jund, S. Salmon, and E. Sonnendrücker, “Charge-conserving FEM–PIC schemes
     on general grids,” Comptes Rendus Mécanique, vol. 342, pp. 570–582, 10-11 2014.
[23] M. Pinto, K. Kormann, and E. Sonnendrücker, Variational framework for structure-preserving
     electromagnetic particle-in-cell methods, 2021. doi: 10.48550/ARXIV.2101.09247. [On-
     line]. Available: https://arxiv.org/abs/2101.09247.
[24] S. O’Connor, Z. D. Crawford, J. P. Verboncoeur, J. Luginsland, and B. Shanker, “A set
     of benchmark tests for validation of 3-D particle in cell methods,” IEEE Transactions on
     Plasma Science, vol. 49, pp. 1724–1731, 5 Apr. 2021.
[25] S. O’Connor, Z. D. Crawford, O. H. Ramachandran, J. Luginsland, and B. Shanker, “Time
                                               225


     integrator agnostic charge conserving finite element PIC,” Physics of Plasmas, vol. 28,
     p. 092 111, 9 Sep. 2021.
[26] Z. D. Crawford, S. O’Connor, J. Luginsland, and B. Shanker, “Rubrics for charge con-
     serving current mapping in finite element electromagnetic particle in cell methods,” IEEE
     Transactions on Plasma Science, vol. 49, pp. 3719–3732, 11 Nov. 2021.
[27] Y. Saad and M. H. Schultzn, “GMRES: A generalized minimal residual algorithm for solving
     nonsymmetric linear systems,” SIAM Journal on Scientific and Statistical Computing, vol. 7,
     pp. 856–869, 3 1986.
[28] D. Appelö, L. Zhang, T. Hagstrom, and F. Li, “An energy-based discontinuous Galerkin
     method with tame CFL numbers for the wave equation,” 2021. doi: 10.48550/ARXIV.2110.
     07099. [Online]. Available: https://arxiv.org/abs/2110.07099.
[29] O. Beznosov and D. Appelö, “Hermite-discontinuous Galerkin overset grid methods for the
     scalar wave equation,” Communications on Applied Mathematics and Computation, vol. 3,
     pp. 391–418, 3 2021.
[30] F. Zheng, Z. Chen, and J. Zhang, “A finite-difference time-domain method without the
     Courant stability conditions,” IEEE Microwave and Guided Wave Letters, vol. 9, pp. 497–
     523, 11 1999.
[31] ——, “Toward the development of a three-dimensional unconditionally stable finite-difference
     time-domain method,” IEEE Transactions on Microwave Theory and Techniques, vol. 48,
     pp. 1550–1558, 9 2000.
[32] J. Lee and B. Fornberg, “Some unconditionally stable time stepping methods for the
     3D Maxwell’s equations,” Journal of Computational and Applied Mathematics, vol. 166,
     pp. 497–523, 2 2004.
[33] M. F. Causley, A. J. Christlieb, Y. Güçlü, and E. Wolf, “Method of lines transpose: A fast
     implicit wave propagator,” 2013. doi: 10.48550/ARXIV.1306.6902. [Online]. Available:
     https://arxiv.org/abs/1306.6902.
[34] M. Causley, A. Christlieb, and E. Wolf, “Method of lines transpose: An efficient uncon-
     ditionally stable solver for wave propagation,” Journal of Scientific Computing, vol. 70,
     pp. 896–921, 2 2017.
[35] M. Maüsek and P. Gibbon, “Mesh-free magnetoinductive plasma model,” IEEE Transactions
     on Plasma Science, vol. 38, pp. 2377–2382, 9 2010.
[36] L. Siddi, G. Lapenta, and P. Gibbon, “Mesh-free Hamiltonian implementation of two dimen-
     sional Darwin model,” Physics of Plasmas, vol. 24, pp. 1–11, 8 2017.
                                              226


[37] Y. Cheng, A. J. Christlieb, W. Guo, and B. Ong, “An asymptotic preserving Maxwell solver
     resulting in the Darwin limit of electrodynamics,” Journal of Scientific Computing, vol. 71,
     no. 3, pp. 959–993, 2017.
[38] M. F. Causley and A. J. Christlieb, “Higher order A-stable schemes for the wave equation
     using a successive convolution approach,” SIAM Journal on Numerical Analysis, vol. 52,
     no. 1, pp. 220–235, 2014.
[39] A. J. Christlieb, P. T. Guthrey, W. A. Sands, and M. Thavappiragasm, “Parallel algorithms
     for successive convolution,” Journal of Scientific Computing, vol. 86, pp. 1–44, 1 2021.
[40] M. Thavappiragasm, A. Christlieb, J. Luginsland, and P. Guthrey, “A fast local embedded
     boundary method suitable for high power electromagnetic sources,” AIP Advances, vol. 10,
     p. 115 318, 11 2020.
[41] E. Wolf, M. Causley, A. Christlieb, and M. Bettencourt, “A particle-in-cell method for the
     simulation of plasmas based on an unconditionally stable field solver,” Journal of Computa-
     tional Physics, vol. 326, pp. 342–372, 2016.
[42] E. Wolf, “A particle-in-cell method for the simulation of plasmas based on an unconditionally
     stable wave equation solver,” Ph.D. dissertation, Michigan State University, 2015.
[43] M. Causley, H. Cho, and A. Christlieb, “Method of lines transpose: Energy gradient flows us-
     ing direct operator inversion for phase-field models,” SIAM Journal on Scientific Computing,
     vol. 39, no. 5, B968–B992, 2017.
[44] A. Christlieb, W. Guo, Y. Jiang, and H. Yang, “Kernel based high order explicit unconditionally-
     stable scheme for nonlinear degenerate advection-diffusion equations,” Journal of Scientific
     Computing, vol. 82:52, pp. 1–29, 3 2020.
[45] A. Christlieb, W. Sands, and H. Yang, “A kernel-based explicit unconditionally stable scheme
     for Hamilton-Jacobi equations on nonuniform meshes,” Journal of Computational Physics,
     vol. 415, pp. 1–25, 2020, Art. No. 109543.
[46] H. C. Edwards, C. R. Trott, and D. Sunderland, “Kokkos: Enabling manycore performance
     portability through polymorphic memory access patterns,” Journal of Parallel and Dis-
     tributed Computing, vol. 74, pp. 3202–3216, 12 2014.
[47] M. Tao, “Explicit symplectic approximation of nonseparable Hamiltonians: Algorithm and
     long time performance,” Phys. Rev. E, vol. 94, p. 043 303, 4 Oct. 2016. doi: 10.1103/
     PhysRevE.94.043303.
[48] L. Greengard and V. Rokhlin, “A fast algorithm for particle simulations,” Journal of Com-
     putational Physics, vol. 73, no. 2, pp. 325–348, 1987.
                                                227


[49] M. C. A. Kropinski and B. D. Quaife, “Fast integral equation methods for Rothe’s method
     applied to the isotropic heat equation,” Computers and Mathematics with Applications,
     vol. 61, pp. 2436–2446, 9 2011.
[50] M. Causley, A. Christlieb, B. Ong, and L. Van Groningen, “Method of lines transpose:
     An implicit solution to the wave equation,” Mathematics of Computation, vol. 83, no. 290,
     pp. 2763–2786, 2014.
[51] G.-S. Jiang and C.-W. Shu, “Efficient implementation of weighted ENO schemes,” Journal
     of Computational Physics, vol. 126, no. 1, pp. 202–228, 1996.
[52] A. Christlieb, W. Guo, and Y. Jiang, “A WENO-based method of lines transpose approach
     for vlasov simulations,” Journal of Computational Physics, vol. 327, pp. 337–367, 2016.
[53] H. Kreiss, N. Petersson, and J. Yström, “Difference approximations of the Neumann problem
     for the second order wave equation,” SIAM Journal of Numerical Analysis, vol. 42, pp. 1292–
     1323, 3 2004.
[54] M. Thavappiragasam, “A-stable implicit rapid scheme and software solution for electromag-
     netic wave propagation,” Ph.D. dissertation, Michigan State University, 2019.
[55] M. F. Causley, H. Cho, A. J. Christlieb, and D. C. Seal, “Method of lines transpose: High
     order L-stable O (𝑁) schemes for parabolic equations using successive convolution,” SIAM
     Journal on Numerical Analysis, vol. 54, no. 3, pp. 1635–1652, 2016.
[56] A. Christlieb, W. Guo, and Y. Jiang, “A kernel-based high order "explicit" unconditionally
     stable scheme for time dependent Hamilton–Jacobi equations,” Journal of Computational
     Physics, vol. 379, pp. 214–236, 2019.
[57] S. Gottlieb, C.-W. Shu, and E. Tadmor, “Strong stability-preserving high-order time dis-
     cretization methods,” SIAM review, vol. 43, no. 1, pp. 89–112, 2001.
[58] A. Salazar, M. Raydan, and A. Campo, “Theoretical analysis of the exponential transversal
     method of lines for the diffusion equation,” Numerical Methods for Partial Differential
     Equations, vol. 16, no. 1, pp. 30–41, 2000.
[59] M. Schemann and F. A. Bornemann, “An adaptive Rothe method for the wave equation,”
     Computing and Visualization in Science, vol. 1, no. 3, pp. 137–144, 1998.
[60] G. Biros, L. Ying, and D. Zorin, An embedded boundary integral solver for the unsteady
     incompressible Navier-Stokes equations (preprint), 2002.
[61] ——, “A fast solver for the Stokes equations with distributed forces in complex geometries,”
     Journal of Computational Physics, vol. 193, pp. 317–348, 1 2004.
                                              228


[62] S.-H. Chiu, M. N. J. Moore, and B. Quaife, “Viscous transport in eroding porous media,”
     Journal of Fluid Mechanics, vol. 893, A3, 2020. doi: 10.1017/jfm.2020.228.
[63] B. D. Quaife and M. N. J. Moore, “A boundary-integral framework to simulate viscous
     erosion of a porous medium,” Journal of Computational Physics, vol. 375, pp. 1–21, 2018.
[64] H. Wang, T. Lei, J. Li, J. Huang, and Z. Yao, “A parallel fast multipole accelerated integral
     equation scheme for 3D Stokes equations,” International journal for numerical methods in
     engineering, vol. 70, pp. 812–839, 7 2007.
[65] L. Ying, G. Biros, and D. Zorin, “A high-order 3D boundary integral equation solver for
     elliptic PDEs in smooth domains,” Journal of Computational Physics, vol. 219, pp. 247–275,
     1 2006.
[66] O. P. Bruno and M. Lyon, “High-order unconditionally stable FC-AD solvers for general
     smooth domains I. Basic elements,” Journal of Computational Physics, vol. 229, pp. 2009–
     2033, 6 2010.
[67] ——, “High-order unconditionally stable FC-AD solvers for general smooth domains II.
     Elliptic, parabolic and hyperbolic PDEs. Theoretical considerations,” Journal of Computa-
     tional Physics, vol. 229, pp. 3358–3381, 9 2010.
[68] J. Douglas Jr., “On the numerical integration of 𝜕𝑥𝑥 𝑈 + 𝜕𝑦𝑦 𝑈 = 𝜕𝑡 𝑈 by implicit methods,”
     Journal of the Society for Industrial and Applied Mathematics, no. 3, pp. 42–65, 1955.
[69] ——, “Alternating direction methods for three space variables,” Numerische Mathematik,
     no. 3, pp. 41–63, 1 1962.
[70] D. W. Peaceman and H. H. Rachford Jr., “The numerical solution of parabolic and elliptic
     differential equations,” Journal of the Society for Industrial and Applied Mathematics, no. 3,
     pp. 28–41, 1 1955.
[71] N. Albin and O. P. Bruno, “A spectral FC solver for the compressible Navier-Stokes equations
     in general domains I: Explicit time-stepping,” Journal of Computational Physics, vol. 230,
     pp. 6248–6270, 16 2011.
[72] T. G. Anderson, O. P. Bruno, and M. Lyon, “High-order, dispersionless "fast-hybrid" wave
     equation solver part I: O (1) sampling cost via incident-field windowing and recentering,”
     SIAM Journal on Scientific Computing, vol. 42, pp. 1348–1379, 2 2020.
[73] C.-W. Shu, “High order weighted essentially nonoscillatory schemes for convection domi-
     nated problems,” SIAM review, vol. 51, no. 1, pp. 82–126, 2009.
[74] R. D. Hornung and J. A. Keasler, “The RAJA portability layer: Overview and status,”
     Lawrence Livermore National Laboratory (LLNL), Livermore, CA, United States, Tech.
                                               229


     Rep., Sep. 2014. doi: 10.2172/1169830.
[75] P. Grete, F. W. Glines, and B. W. O’Shea, “K-Athena: A performance portable structured grid
     finite volume magnetohydrodynamics code,” IEEE Transactions on Parallel and Distributed
     Systems, vol. 32, pp. 85–97, 1 2020.
[76] C. J. White, J. M. Stone, and C. F. Gammie, “An extension of the Athena++ code framework
     for GRMHD based on advanced Riemann solvers and staggered-mesh constrained transport,”
     The Astrophysical Journal Supplement, vol. 225, 2 2016.
[77] P. Mardahl and J. Verboncoeur, “Charge conservation in electromagnetic PIC codes; spectral
     comparison of Boris/DADI and Langdon-Marder methods,” Computer Physics Communi-
     cations, vol. 106, pp. 219–229, 1997.
[78] H. Qin, S. Zhang, J. Xiao, J. Liu, Y. Sun, and W. Tang, “Why is Boris algorithm so good?”
     Physics of Plasmas, vol. 20, 8 Aug. 2013. doi: 10.1063/1.4818428.
[79] L. Brieda. “Particle push in magnetic field (boris method).” (2011), [Online]. Available:
     https://www.particleincell.com/2011/vxb-rotation/ (visited on 04/15/2022).
[80] P. Pihajoki, “Explicit methods in extended phase space for inseparable Hamiltonian prob-
     lems,” Celestial Mechanics and Dynamical Astronomy, vol. 121, pp. 211–231, 2015. doi:
     10.1007/s10569-014-9597-9.
[81] W. H. Bennett, “Magnetically self-focussing streams,” Phys. Rev., vol. 45, pp. 890–897, 12
     1934. doi: 10.1103/PhysRev.45.890.
[82] J. A. Bittencourt, Fundamental of Plasma Physics, Third. Springer-Verlag, 2010.
[83] J. Barnes and P. Hut, “A hierarchical 𝑂 (𝑁 log 𝑁) force-calculation algorithm,” Nature,
     vol. 324, pp. 446–449, 1986.
[84] L. Wang, R. Krasny, and S. Tlupova, “A kernel-independent treecode based on barycen-
     tric Lagrange interpolation,” Communications in Computational Physics, vol. 28, no. 4,
     pp. 1415–1436, 2020. doi: https://doi.org/10.4208/cicp.OA-2019-0177. [Online]. Avail-
     able: http://global-sci.org/intro/article_detail/cicp/18106.html.
[85] N. Vaughn, L. Wilson, and R. Krasny, “A GPU-accelerated barycentric Lagrange treecode,”
     in 2020 IEEE International Parallel and Distributed Processing Symposium Workshops
     (IPDPSW), 2020, pp. 701–710. doi: 10.1109/IPDPSW50202.2020.00125.
[86] L. Wilson, N. Vaughn, and R. Krasny, “A GPU-accelerated fast multipole method based on
     barycentric Lagrange interpolation and dual tree traversal,” Computer Physics Communica-
     tions, vol. 265, 2021. doi: https://doi.org/10.1016/j.cpc.2021.108017. [Online]. Available:
     https://www.sciencedirect.com/science/article/pii/S0010465521001296.
                                                230