NUMERICAL METHODS FOR THE EVOLUTION OF FIELDS WITH APPLICATIONS TO PLASMAS By William A. Sands A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computational Mathematics, Science, and Engineering – Doctor of Philosophy 2022 ABSTRACT NUMERICAL METHODS FOR THE EVOLUTION OF FIELDS WITH APPLICATIONS TO PLASMAS By William A. Sands In this dissertation, we present a collection of algorithms for evolving fields in plasmas with applications to the Vlasov-Maxwell system. Maxwell’s equations are reformulated in terms of the Lorenz and Coulomb gauge conditions to obtain systems involving wave equations. These wave equations are solved using the methods developed in this thesis and are combined with a particle-in-cell method to simulate plasmas. The particle-in-cell methods developed in this work treat particles using several approaches, including the standard Newton-Lorenz equations, as well as a generalized momentum formulation that eliminates the need to compute time derivatives of the field data. In the first part of this thesis, we develop and extend some earlier methods for scalar wave equations, which are used to update the potentials in these formulations. Our developments are based on a class of algorithms known as the MOL𝑇 , which combines a dimensional splitting technique with a one-dimensional integral equation method. This results in methods that are unconditionally stable, can address geometry, and are O (𝑁), where 𝑁 is the number of mesh points. Our work contributes methods to construct spatial derivatives of the potentials for this class of dimensionally-split algorithms, which are used to evolve particles. The second part of this thesis considers core algorithms used in the MOL𝑇 and the related class of successive convolution methods in the context of high-performance computing environments. We developed a novel domain decomposition approach that ultimately allows the method to be used on distributed memory computing platforms. Shared memory algorithms were developed using the Kokkos performance portability library, which permits a user to write a single code that can be executed on various computing devices with the architecture-dependent details being managed by the library. We optimized predominant loop structures in the code and developed a blocking pattern that prescribes parallelism at multiple levels and is also more cache-friendly. Moreover, the proposed iteration pattern is flexible enough to work with shared memory features available on GPU systems. The final part of this thesis presents the particle-in-cell method for the Vlasov-Maxwell system, which leverages the methods for fields and derivatives developed in this work. The proposed methods are applied to several test problems involving beams. Our results are generally encouraging and demonstrate the capabilities of the proposed field solvers in simulating basic plasma phenomena. Additionally, our results serve to validate the generalized momentum formulation, which will be the foundation of our future work. Copyright by WILLIAM A. SANDS 2022 To my family, friends, and educators that have always brought out the best in me. v ACKNOWLEDGEMENTS Many thanks and acknowledgements are in order, as this work was influenced by many people in a collaborative environment. First, I would like to express my gratitude to my advisor Professor Andrew Christlieb who supported me as a graduate student over the past few years, encouraged me to take on challenging problems, and who “believed in me" at times when I did not. I am, however, most grateful for the warm friendship he has provided me and for the compassion and emotional support he provided at an incredibly difficult junction in my personal life. I am also grateful for my committee members, who took the time to write generous recom- mendation letters on my behalf during my time as a student. I would also like to thank Dr. John Luginsland for his insightful suggestions and comments regarding the plasma examples presented in this work and for always putting a humorous spin on everything! Discussions with Dr. Eric Wolf were also helpful in this regard, as he was always willing to take time to answer my questions. Additionally, I would like to thank Professor Michael Murillo and Dr. Jeffrey Haack, who provided me with an opportunity to work at Los Alamos National Laboratory in the summer of 2019. This experience significantly contributed to my growth, both as a person and as a scientist. The Christlieb group, informally SPECTRE, (yep, like the James Bond movie) has been my home for the past couple of years. Even though I am changing groups for my next position, I hope to continue working with many of you as you assemble your own research portfolios. I have had the great fortune of being in a position to share my experiences and knowledge with many of you, whose curiosity inspires my own work. Lastly, but certainly not least, are my parents, who have constantly encouraged me to pursue the things I enjoy and always made themselves available at times when I needed help. We will see if the late night phone calls still happen after grad school... I am also grateful for the valuable friendships forged in the CMSE program. This includes the secretaries who were always incredibly friendly and helpful; they are the unsung heroes! While many of us are moving on to different stages in our professional lives, we will always be connected by the time we spent together. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii LIST OF SCHEMES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiv LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Background and Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Mathematical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.1 Vlasov-Maxwell System . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.2.1 Maxwell’s Equations with the Lorenz Gauge . . . . . . . . . . . 9 1.2.2.2 Maxwell’s Equations with the Coulomb Gauge . . . . . . . . . . 10 1.2.2.3 Formulation for the Particles . . . . . . . . . . . . . . . . . . . . 11 1.3 Non-dimensionalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3.1 Equations of Motion in E-B form . . . . . . . . . . . . . . . . . . . . . . . 16 1.3.2 Equations of Motion for the Generalized Hamiltonian . . . . . . . . . . . . 18 1.3.3 Maxwell’s Equations in the Lorenz Gauge . . . . . . . . . . . . . . . . . . 19 1.3.4 Maxwell’s Equations in the Coulomb Gauge . . . . . . . . . . . . . . . . . 20 1.4 Contributions of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 CHAPTER 2 NUMERICAL METHODS FOR THE FIELD EQUATIONS . . . . . . . . . 23 2.1 Integral Equation Methods and Green’s Functions . . . . . . . . . . . . . . . . . . 23 2.2 Semi-discrete Schemes for the Wave Equation . . . . . . . . . . . . . . . . . . . . 27 2.2.1 The BDF Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.2 Time-centered Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.3 Splitting Method Used for Multi-dimensional Problems . . . . . . . . . . . 29 2.3 Inverting One-dimensional Operators . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3.1 Integral Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3.2 Fast Summation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3.3 Approximating the Local Integrals . . . . . . . . . . . . . . . . . . . . . . 34 2.4 Applying Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.1 BDF Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.1.1 Dirichlet Boundary Conditions . . . . . . . . . . . . . . . . . . 37 2.4.1.2 Neumann Boundary Conditions . . . . . . . . . . . . . . . . . . 38 2.4.1.3 Periodic Boundary Conditions . . . . . . . . . . . . . . . . . . . 39 2.4.1.4 Outflow Boundary Conditions . . . . . . . . . . . . . . . . . . . 39 2.4.2 Centered Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.4.2.1 Dirichlet Boundary Conditions . . . . . . . . . . . . . . . . . . 45 2.4.2.2 Neumann Boundary Conditions . . . . . . . . . . . . . . . . . . 46 vii 2.4.2.3 Periodic Boundary Conditions . . . . . . . . . . . . . . . . . . . 46 2.4.2.4 Outflow Boundary Conditions . . . . . . . . . . . . . . . . . . . 47 2.4.3 Some Remarks for Multi-dimensional Problems . . . . . . . . . . . . . . . 49 2.4.3.1 Sweeping Patterns in Multi-dimensional Problems . . . . . . . . 49 2.4.3.2 Periodic Boundary Conditions . . . . . . . . . . . . . . . . . . . 50 2.4.3.3 Dirichlet Boundary Conditions . . . . . . . . . . . . . . . . . . 50 2.4.3.4 Neumann Boundary Conditions . . . . . . . . . . . . . . . . . . 51 2.4.3.5 Outflow Boundary Conditions . . . . . . . . . . . . . . . . . . . 51 2.5 Extensions for High-order Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.5.1 Successive Convolution Methods . . . . . . . . . . . . . . . . . . . . . . . 55 2.5.2 BDF Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.6 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.6.1 BDF and Time-centered Derivatives in One Spatial Dimension . . . . . . . 60 2.6.2 Periodic Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . . 62 2.6.3 Dirichlet Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . . 66 2.6.4 Outflow Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . . 69 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 CHAPTER 3 PARALLEL ALGORITHMS FOR SUCCESSIVE CONVOLUTION . . . . 75 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.2 Description of Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.2.1 Connections Among Different PDEs . . . . . . . . . . . . . . . . . . . . . 79 3.2.2 Representation of Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.2.3 Comment on Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . 85 3.2.4 Fast Convolution Algorithm and Spatial Discretization . . . . . . . . . . . 86 3.2.5 Coupling Approximations with Time Integration Methods . . . . . . . . . 88 3.3 Nearest-Neighbor Domain Decomposition Algorithm . . . . . . . . . . . . . . . . 90 3.3.1 Nearest-Neighbor Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.3.2 Enforcing Boundary Conditions for 𝜕𝑥 . . . . . . . . . . . . . . . . . . . . 95 3.3.3 Enforcing Boundary Conditions for 𝜕𝑥𝑥 . . . . . . . . . . . . . . . . . . . 96 3.3.4 Additional Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.4 Strategies for Efficient Implementation on Parallel Systems . . . . . . . . . . . . . 99 3.4.1 Selecting a Shared Memory Programming Model . . . . . . . . . . . . . . 100 3.4.2 Comment on Performance Metrics . . . . . . . . . . . . . . . . . . . . . . 101 3.4.3 Benchmarking Prototypical Loop Patterns . . . . . . . . . . . . . . . . . . 102 3.4.4 Shared Memory Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.4.5 Code Strategies for Domain Decomposition . . . . . . . . . . . . . . . . . 109 3.4.6 Some Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 3.5.1 Description of Test Problems and Convergence Experiments . . . . . . . . 113 3.5.2 Weak Scaling Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 117 3.5.3 Strong Scaling Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 119 3.5.4 Effect of CFL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 viii CHAPTER 4 DEVELOPING A PARTICLE-IN-CELL METHOD . . . . . . . . . . . . . 125 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.2 Moving from Point-particles to Macro-particles . . . . . . . . . . . . . . . . . . . 125 4.3 Methods for Controlling Divergence Errors . . . . . . . . . . . . . . . . . . . . . 127 4.3.1 A Classic Elliptic Projection Method Based on Gauss’ Law . . . . . . . . . 128 4.3.2 Elliptic Divergence Cleaning based on Potentials . . . . . . . . . . . . . . 129 4.3.3 Enforcing the Lorenz Gauge through Lagrange Multipliers . . . . . . . . . 129 4.3.4 Enforcing the Coulomb Gauge . . . . . . . . . . . . . . . . . . . . . . . . 131 4.3.5 Analytical Maps for Enforcing Charge Conservation . . . . . . . . . . . . . 132 4.4 Conventional Methods for Pushing Particles . . . . . . . . . . . . . . . . . . . . . 140 4.4.1 Leapfrog Time Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 140 4.4.2 The Boris Push . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 4.5 Time Integration with Non-separable Hamiltonians . . . . . . . . . . . . . . . . . 145 4.5.1 The Molei Tao Integrator . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 4.5.1.1 Approach for Implicit Sources: Particles Lead the Fields . . . . . 147 4.5.1.2 Approaches for a Mixed Advance: Dealing with Explicit and Implicit Source Terms . . . . . . . . . . . . . . . . . . . . . . . 149 4.5.2 The Asymmetrical Euler Method . . . . . . . . . . . . . . . . . . . . . . . 150 4.6 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 4.6.1 Motion of a Single Charged Particle . . . . . . . . . . . . . . . . . . . . . 152 4.6.2 The Cold Two-Stream Instability . . . . . . . . . . . . . . . . . . . . . . . 155 4.6.3 Numerical Heating Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 4.6.4 The Bennett Equilibrium Pinch . . . . . . . . . . . . . . . . . . . . . . . . 173 4.6.5 The Expanding Beam Problem . . . . . . . . . . . . . . . . . . . . . . . . 180 4.6.6 A Narrow Beam Problem and the Effect of Particle Count . . . . . . . . . . 187 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 CHAPTER 5 CONCLUSION AND FUTURE DIRECTIONS . . . . . . . . . . . . . . . . 190 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 APPENDIX A APPENDIX FOR CHAPTER 3 . . . . . . . . . . . . . . . . . . . . 194 APPENDIX B APPENDIX FOR CHAPTER 4 . . . . . . . . . . . . . . . . . . . . 203 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 ix LIST OF TABLES Table 3.1: Architecture and code configuration for the loop experiments conducted on the Intel 18 cluster at Michigan State University’s Institute for Cyber-Enabled Research. To leverage the wide vector registers, we encourage the compiler to use AVX-512 instructions. Hardware prefetching is not used, as initial experiments seem to indicate that it hindered performance. Initially, we used GCC 8.2.0-2.31.1 as our compiler, but we found through experimentation that using an Intel compiler improved the performance of our application by a factor of ∼ 2 for this platform. Authors in [75] experienced similar behavior for their application and attribute this to a difference in auto-vectorization capabilities between compilers. An examination of the source code for loop execution policies in Kokkos reveals that certain decorators, e.g., #pragma ivdep are present, which help encourage auto-vectorization when Intel compilers are used. We are unsure if similar hints are provided for GCC. . . . . . . . . . . . . 105 Table 3.2: Architecture and code configuration for the numerical experiments conducted on the Intel 18 cluster at Michigan State University’s Institute for Cyber- Enabled Research. As with the loop experiments in 3.4.3, we encourage the compiler to use AVX-512 instructions and avoid the use of prefetching. All available threads within the node (40 threads/node) were used in the experi- ments. Each node consists of two Intel Xeon Gold 6148 CPUs and at least 83 GB of memory. We wish to note that hyperthreading is not supported on this system. As mentioned in 3.4.3, when hyperthreading is not enabled, Kokkos::AUTO() defaults to a team size of 1. In cases where the base block size did not divide the problem evenly, this parameter was adjusted to ensure that blocks were nearly identical in size. The parameter 𝛽, which does not de- pend on Δ𝑡 is used in the definition of 𝛼. For details on the range of admissible 𝛽 values, we refer the reader to [56, 44], where this parameter was introduced. Lastly, recall that 𝜖 is the tolerance used in the NN constraints. . . . . . . . . . . 114 Table 4.1: Table of the physical constants (SI units) used in the numerical experiments. . . . 152 Table 4.2: Summary of the algorithms explored for the two-stream instability example. Both time integration methods considered are second-order. . . . . . . . . . . . 157 Table 4.3: Table of the plasma parameters used in the two-stream instability example. . . . 159 Table 4.4: Table of the plasma parameters used in the numerical heating example. . . . . . 169 Table 4.5: Table of the parameters used in the setup for the Bennett pinch problem. . . . . . 175 Table 4.6: Table of the parameters used in the setup for the expanding beam problem. . . . 184 x Table 4.7: Table of the parameters used in the setup for the narrow beam problem. . . . . . 189 xi LIST OF FIGURES Figure 2.1: Spatial refinement study for the solution and its derivative obtained with second-order methods for the periodic test problem in section 2.6.1. In Figure 2.1a, we plot the ℓ∞ errors for both the numerical solution and the derivative obtained with the time-centered method. Similarly, in Figure 2.1b, we show the same quantities, which are instead computed using the BDF method. The derivative for the time-centered method fails to refine in space, while the BDF derivative is as accurate at the numerical solution itself. . . . . 61 Figure 2.2: Time refinement study for the solution and its derivative obtained with second- order methods for the test problem in section 2.6.1. In Figure 2.2a, we plot the ℓ∞ errors for both the numerical solution and the derivative obtained with the time-centered method. Similarly, in Figure 2.2b, we show the same quantities, which are instead computed using the BDF method. The derivative for the time-centered method initially converges together with the numerical solution, but at some point begins to diverge. In contrast, we can see that the errors for the derivatives obtained with the BDF method are aligned with those of the solution. Comparing the scales of the plots, we note that the BDF solution is slightly less accurate than the time-centered method. . . . . . . . . 62 Figure 2.3: Time refinement study for the solution and its derivative for the two-dimensional periodic example 2.6.2 obtained with second-order methods. In Figure 2.3a, we plot errors for the numerical solution obtained with the central-2 method and the partial derivatives obtained with the BDF-2 method. Similarly, in Figure 2.3b, we show the same quantities, both of which are obtained using the BDF-2 method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Figure 2.4: Space refinement of the solution and its derivative for the two-dimensional periodic example 2.6.2 obtained with second-order methods. In Figure 2.4a, we plot errors for the numerical solution obtained with the central-2 method and the partial derivatives obtained with the BDF-2 method. Similarly, in Figure 2.4b, we show the same quantities, both of which are obtained using the BDF-2 method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Figure 2.5: Time refinement study of the solution and its derivatives in the two-dimensional Dirichlet problem 2.6.3 obtained with second-order methods. In Figure 2.5a, we plot errors for the numerical solution obtained with the central-2 method and the partial derivatives obtained with the BDF-2 method. Similarly, in Figure 2.5b, we show the same quantities, both of which are obtained using the BDF-2 method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 xii Figure 2.6: Space refinement of the solution and its derivatives in the two-dimensional Dirichlet problem 2.6.3 obtained with second-order methods. In Figure 2.6a, we plot errors for the numerical solution obtained with the central-2 method and the partial derivatives obtained with the BDF-2 method. Similarly, in Figure 2.6b, we show the same quantities, both of which are obtained using the BDF-2 method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 2.7: Here we show the reflection observed between the implicit and explicit forms of outflow boundary conditions for the second-order BDF method in a one- dimensional outflow problem. We run with the same Gaussian initial condi- tion until the final time 𝑇 = 4, at which point, the wave data should no longer be in the simulation. What is left is the reflection at the artificial boundaries of the domain. The plot shown on the left shows the results obtained with the proposed implicit form of the outflow weights developed for the BDF-2 method, while the plot on the right uses the explicit form of the weights. We find that the explicit form of the weights is more effective at suppressing the spurious reflections at the artificial boundaries. . . . . . . . . . . . . . . . . . . 70 Figure 2.8: Time refinement study of the solution and its derivatives in the two-dimensional outflow problem of section 2.6.4 obtained with second-order methods. In Fig- ure 2.8a, we plot errors for the numerical solution obtained with the central-2 method and the partial derivatives obtained with the BDF-2 method. Simi- larly, in Figure 2.8b, we show the same quantities, both of which are obtained using the BDF-2 method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Figure 2.9: A comparison of the temporal refinement properties for the one-dimensional implicit methods. The weights for the time-centered method shown in Fig- ure2.9a are taken from the paper [34]. We compare this to the proposed implicit approach to outflow, shown on the right, in Figure 2.9b. . . . . . . . . 72 Figure 2.10: Space refinement of the solution and its derivatives in the two-dimensional outflow problem of section 2.6.4 obtained with second-order methods. In Figure 2.10a, we plot errors for the numerical solution obtained with the central-2 method and the partial derivatives obtained with the BDF-2 method. Similarly, in Figure 2.10b, we show the same quantities, both of which are obtained using the BDF-2 method. . . . . . . . . . . . . . . . . . . . . . . . . 73 Figure 2.11: A comparison of the spatial refinement properties for the one-dimensional implicit methods. The weights for the time-centered method, shown in Fig- ure2.11a, are taken from the paper [34]. We compare this to the proposed implicit approach to outflow, shown on the right, in Figure 2.11b. . . . . . . . 74 Figure 3.1: Stencils used to build the six point quadrature [56, 44]. . . . . . . . . . . . . . 88 Figure 3.2: A six-point WENO quadrature stencil in 2-D. . . . . . . . . . . . . . . . . . . 91 xiii Figure 3.3: Fast convolution communication stencil in 2-D based on N-Ns. . . . . . . . . . 97 Figure 3.4: Heterogeneous platform targeted by Kokkos [46]. . . . . . . . . . . . . . . . . 101 Figure 3.5: Plots comparing the performance of different parallel execution policies for the pattern in Scheme 3.1 using test cases in 2-D (left) and 3-D (right). Tests were conducted on a single node that consists of 40 cores using the code configuration outlined in 3.1. Each group consists of three plots, whose dif- ference is the value selected for the team size. We note that hyperthreading is not enabled on our systems, so Kokkos::AUTO() defaults to a team size of 1. In each pane, we use “best" to refer to the best run for that configuration across different team sizes. Tile experiments used block sizes of 2562 , in 2-D problems, and 323 in 3-D. We observe that vectorized policies are gen- erally faster than non-vectorized policies. Interestingly, among blocked/tiled policies, construction of subviews appears to be faster than those that skip the subview construction, despite the additional work. As the problem size increases, the performance of blocked policies improves substantially. This can be attributed to the large number of idle thread teams when the problem size does not produce enough blocks. In such cases, increasing the size of the team does offer an improvement, as it reduces the number of idle thread teams. For non-blocked policies, we observe that increasing the team-size generally results in minimal, if any, improvement in performance. In all cases, the use of blocking provides a more consistent update rate when enough work is introduced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Figure 3.6: Task charts for the domain-decomposition algorithm under fixed (left) and adaptive (right) time stepping rules. The work overlap regions are indicated, laterally, using gray boxes. The work inside the overlap regions should be sufficiently large to hide the communications occuring in the background. To clarify, the overlap in calculations for 𝐼∗ is achieved by changing the sweeping direction during an exchange of the boundary data. As indicated in the adaptive task chart, the reduction over the “lagged" wave speed data can be performed in the background while building the various operators. Note the use of MPI_WAIT prior to performing the integrator step. This is done to prevent certain overwrite issues during the local reductions in the subsequent integrator step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Figure 3.7: Convergence results for each of the 2-D example problems. Results were obtained using 9 MPI ranks with 40 threads/node. Also included is a first- order reference line (solid black). Our convergence results indicate first- order accuracy resulting from the low-order temporal discretization. The final reported 𝐿 ∞ errors for each of the applications, on a grid containing 52772 total zones, are 2.874 × 10−3 (advection), 4.010 × 10−4 (diffusion), and 2.674 × 10−4 (H-J). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 xiv Figure 3.8: Results on the N-N method for the linear advection equation using a fixed mesh with 53772 total DOF and a variable CFL number. In each case, we used the fastest time-to-solution collected from repeating each configuration a total of 20 times. This particular data was collected using an older version of the code, compiled with GCC, which did not use the blocking approach. For larger block sizes, increasing the CFL has a noticeable improvement on the run time, but as the block sizes become smaller, the gains diminish. For example, if 9 MPI ranks are used, improvements are observed as long as CFL ≤ 4. However, when CFL = 5, the run times begin to increase, with a significant decrease in efficiency. As the blocks become smaller, Δ𝑡 needs to be adjusted (decreased) so that the support of the non-local convolution data not extend beyond N-Ns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Figure 4.1: Trajectories for the single particle test, which are obtained using the Boris method 4.1a and the second-order integrator by Molei Tao (as presented in [47]) 4.1b. Both methods produce identical trajectories under identical experimental conditions. The particles rotate about the magnetic field which points in the 𝑧-direction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Figure 4.2: Self-refinement for the single particle test using the Boris method 4.1a and the second-order integrator by Molei Tao (as presented in [47]) 4.1b. Second- order accuracy is achieved by both methods, but the ℓ∞ errors for the Boris method are nearly a factor of 2 larger than those produced by the Molei Tao method. While we have not presented timing results, it is worth noting that the run times for the Boris method were considerably faster than those of the Molei Tao method due to the latter’s additional “stages". The final error measurements taken from the refinement study are 1.4728 × 10−7 (Boris) and 6.5592 × 10−8 (Tao). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Figure 4.3: Initial configuration of electrons used in the two-stream experiments. . . . . . . 158 Figure 4.4: A comparison of the Molei Tao particle integrator with and without averaging for the two-stream example with the Poisson model. Over time, the pairs of phase space data, including the associated fields, can grow apart leading to vastly different potentials that kick particles off their smooth trajectories. Averaging appears to be fairly effective at controlling this behavior. . . . . . . . 160 xv Figure 4.5: We present plots of the electrons in phase space obtained using the Poisson model for the two-stream example. Results obtained using leapfrog time integration are shown in the top row, while the bottom row uses the second- order integrator based on Molei Tao and applies averaging. We selected 𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator. The FFT is used to compute the scalar potentials in both methods. At later times, despite improvements from “averaging" the particle data, the Tao method causes particles to move off the stream lines. This phenomena is a numerical artifact that is not present in the leapfrog method. . . . . . . . . . . . 161 Figure 4.6: Time refinement of a tracer particle’s position for the two-stream instability using the Poisson model for the potential with leapfrog (a) and the Molei Tao integrator with averaging (b). We selected 𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator. Both methods converge to second-order accuracy with leapfrog generally displaying a larger absolute error than the Tao method. The exception to this is the smallest Δ𝑡 used in the leapfrog experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Figure 4.7: We present plots of the electrons in phase space obtained using the wave model for the two-stream example. Results obtained using leapfrog time integration are shown in the top row, while the bottom row uses the second- order integrator based on Molei Tao and applies averaging. We selected 𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator. The second-order (diffusive) BDF scheme (BDF-2) is used to compute the scalar potentials and their derivatives in for both methods. Unlike the results obtained with the Poisson model, which used the FFT as the field solver (shown in Figure 4.5), the particles at the later times in the Molei Tao method seem to stay attached to their trajectories. . . . . . . . . . . . . . . . . . . . . . 165 Figure 4.8: We present plots of the electrons in phase space obtained using the wave model for the two-stream example. Results obtained using leapfrog time integration are shown in the top row, while the bottom row uses the second- order integrator based on Molei Tao and applies averaging. We selected 𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator. The scalar potentials are evolved using the second-order central scheme (central- 2), while the derivatives are computed at each step with the second-order BDF scheme (BDF-2). In the bottom row, which uses the Molei Tao method, we obtain results that are similar to the BDF-2 method (see 4.7) in the sense that particles do not seem to jump off of their trajectories. . . . . . . . . . . . . 166 xvi Figure 4.9: We present plots of the electrons in phase space obtained using the wave model for the two-stream example. Results obtained using leapfrog time integration are shown in the top row, while the bottom row uses the second- order integrator based on Molei Tao and applies averaging. We selected 𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator. The scalar potentials are evolved using the second-order central scheme (central- 2), while the derivatives are computed at each step with the fourth-order BDF scheme (BDF-4). As with the other wave solver methods, the particles in the Molei Tao experiments seem to stay attached to their smooth trajectories, even at the later times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Figure 4.10: Time refinement of a tracer particle’s position for the two-stream instability. For the particle push, we consider both leapfrog and the Molei Tao method with averaging, in combination with different methods for fields and their derivatives. We selected 𝜔 = 500 as the value of the coupling parameter in all of the Molei Tao integrator experiments. Each of the methods converge to second-order accuracy with the error in the Tao method being smaller than leapfrog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Figure 4.11: Initial electron data in phase space used for the numerical heating tests. . . . . . 170 Figure 4.12: We present results from the numerical heating tests based on the Poisson model. Plots show the average electron temperature as a function of the number of angular plasma periods using leapfrog (left) and the second-order integrator by Molei Tao with averaging (right). Fields and their derivatives are obtained using the FFT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Figure 4.13: We display results from the numerical heating tests that use the wave model for the potentials. Plots show the average electron temperature as a function of the number of angular plasma periods using leapfrog (top) and the second- order integrator by Molei Tao with averaging (bottom). We selected 𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator. The scalar potentials and derivatives are computed with the scheme label provided in the individual captions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Figure 4.14: Initialization of the steady-state toroidal magnetic field in the Bennett problem computed with the BDF-2 wave solver after 1000 steps against a fixed current density. The derivatives of the vector potential 𝐴 (3) are also obtained with the BDF-2 method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 xvii Figure A.1: Plots comparing the performance of different parallel execution policies for the pattern in Scheme 3.2 using test cases in 2-D (top) and 3-D (bottom). Tests were conducted on a single node that consists of 40 cores using the code configuration outlined in 3.1. Each group consists of three plots, whose difference is the value selected for the team size. We note that hyperthreading is not enabled on our systems, so Kokkos::AUTO() defaults to a team size of 1. Tile experiments used a block size of 2562 , in 2-D problems, and 323 in 3-D. A tiled MDRange was not implemented in the 2-D cases because the block size was larger than some of the problems. The results generally agree with those presented in 3.5. For smaller problem sizes, using the non- portable range_policy with OpenMP simd directives is clearly superior over the policies. However, when enough work is available, we see that blocked policies with subviews and vectorization generally become the fastest. In both cases, MDRange seems to have fairly good performance. Tiling, when used with MDRange, in the 3-D cases, seems to be slower than plain MDRange. Again, we see that the use of blocking provides a more consistent update rate if enough work is available. . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Figure A.2: Weak scaling results, for each of the applications, using up to 49 nodes (1960 cores). For each of the applications, we have provided the update rate and weak scaling efficiency computed via the fastest time/step (top) and average time/step (bottom). Results for advection and diffusion applications is quite similar, despite the use of different operators. The results for the H-J application seem to indicate that no major performance penalties are incurred by use of the adaptive time stepping method. Scalability appears to be excellent, up to 16 nodes (640 cores), then begins to decline. While some loss in performance, due to network effects, is to be expected, this loss appears to be larger than was previously observed. The nodes used in the runs were not contiguous, which hints at a possible sensitivity to data locality. . . . . 200 Figure A.3: Weak scaling results obtained with contiguous allocations of up to 9 nodes (360 cores) for each of the applications. For comparison, the same information is displayed as in A.2. Data from the fastest trials indicates nearly perfect weak scaling, across all applications, up to 9 nodes, with a consistent update rate between 2 − 4 × 108 DOF/node/s. A comparison of the fastest timings between the large and small runs supports our claim that data proximity is crucial to achieving the peak performance of the code. Note that size the error bars are generally smaller than those in A.2. This indicates that the timing data collected from individual trials exhibits less overall variation. . . . . . . . 201 xviii Figure A.4: Strong scaling results for each of the applications obtained on contiguous allocations of up to 9 nodes (360 cores). Displayed among each of the appli- cations are the update rate and strong scaling efficiency computed from the fastest time/step (top) and average time/step (bottom). This method does not contain a substantial amount of work, so we do not expect good performance for smaller base problem sizes, as the work per node becomes insufficient to hide the cost of communication. Larger base problem sizes, which introduce more work, are capable of saturating the resources, but will at some point become insufficient. Moreover, threads become idle when the work per node fails to introduce enough blocks. . . . . . . . . . . . . . . . . . . . . . . . . . 202 Figure B.1: The state of the Bennett problem after 50 thermal crossing times using the Boris method with the steady-state Poisson model for the fields. The top figure shows the electrons in the non-dimensional grid and plots the radius of the beam as a reference. We also include a cumulative histogram of the electrons based on their radii, which uses a total of 50 bins. The plots on the bottom are cross-sections of the steady-state magnetic field 𝐵 (𝜃) , which are plotted against the analytical field. We see good agreement in the magnetic field with its analytical solution, which is enough to confine most of the particles within the beam. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Figure B.2: The state of the Bennett problem after 45 thermal crossing times obtained with the Molei Tao method (𝜔 = 500) using the steady-state Poisson model for the fields. The top figure shows the electrons in the non-dimensional grid and plots the radius of the beam as a reference. We also include a cumulative histogram of the electrons based on their radii, which uses a total of 50 bins. The plots on the bottom are slices of the steady-state magnetic field 𝐵 (𝜃) , which is plotted against the analytical field. We observe a significant drift in the numerical field away from its steady-state that results in a loss of confinement of the particles to the beam. . . . . . . . . . . . . . . . . . . . . . 207 Figure B.3: The state of the Bennett problem after 35 thermal crossing times using the Boris method with the wave model for the fields. The top figure shows the electrons in the non-dimensional grid and plots the radius of the beam as a reference. We also include a cumulative histogram of the electrons based on their radii. Again, the beam radius is indicated as a reference. A total of 50 bins are used in the plot. The plots on the bottom are slices of the steady-state magnetic field 𝐵 (𝜃) , which is plotted against the analytical field. We see good agreement in the magnetic field with its analytical solution, which is enough to confine most of the particles within the beam. . . . . . . . . . . . . . . . . . 208 xix Figure B.4: A comparison of the time derivatives of the vector potentials after 1000 particle crossings for the expanding beam problem. This particular data was obtained using the Lorenz gauge formulation for the fields with the Boris method for particles. In the top row, the vector potentials are updated with the time-centered approach, which is purely dispersive and generates noisy time derivatives. The bottom row performs the same experiment, but uses the BDF method, which is purely dissipative. The differences in the quality of the results are quite apparent. This was discussed in [42], but results were not shown to illustrate the severity of the effects due to dispersion. . . . . . . . 209 Figure B.5: We plot the expanding beam after 1000 particle crossings obtained with the Lorenz gauge formulation that combines the Boris method with the BDF-2 field solver. In Figure B.5a, we plot the beam and the corresponding charge density. We observe some oscillations along the top edge of the beam, which also appear in the charge density. In Figure B.5b, we observe an increase in the size of violations of the Lorenz gauge condition, which indicates that the method will eventually fail. We plot the Lorenz gauge error as a surface in Figure B.5c using data from the final step. The most significant violations occur near the injection region and along the boundary where particles are removed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Figure B.6: Here we show the potentials (and their derivatives) for the expanding beam problem after 1000 particle crossings. This data was obtained using the Lorenz gauge formulation which combines the Boris method with the BDF- 2 wave solver. The first row plots the scalar potential 𝜓 and its partial derivatives. Similarly, in the second row, we plot the derivatives of the vector potentials 𝐴 (1) and 𝐴 (2) , which are used to construct the magnetic field 𝐵 (3) (shown in the right-most plot). Note that the time derivative data for the vector potentials were plotted in Figure B.4b, so we exclude them here. . . . . . 211 Figure B.7: We show the expanding beam after 2000 particle crossings obtained with the Coulomb gauge formulation, which uses the AEM for time stepping without a cleaning method. In Figure B.7a, we plot the beam and the corresponding charge density, which show visible striations and oscillations along the edge of the beam due to violations in the gauge condition. The growth in the errors associated with the gauge condition is reflected in Figure B.7b, which exhibits unbounded growth. The surface plot of the gauge condition at 2000 crossings shows large errors, especially near the injection region and along the boundary where particles are removed. . . . . . . . . . . . . . . . . . . . . 212 xx Figure B.8: We show the expanding beam after 3000 particle crossings obtained with the Coulomb gauge formulation that uses the AEM for time stepping with elliptic divergence cleaning. In Figure B.8a, we plot the beam and the corresponding charge density. The elliptic divergence cleaning seems effective at controlling the errors in the gauge condition, compared to the results shown in Figure B.7, which do not apply the cleaning method. The fluctuations of the gauge error away from the boundaries is now in the sixth decimal position, which is a notable improvement over the result shown in Figure B.7c. . . . . . . . . . . . 213 Figure B.9: Here we show the potentials (and their derivatives) for the expanding beam problem after 3000 particle crossings. This data was obtained using the Coulomb gauge formulation which combines the AEM for time integration with the BDF-2 wave solver. Elliptic divergence cleaning was applied to the vector potential. In each row, we plot a field quantity and is corresponding derivatives. The top row shows the scalar potential 𝜓 and its derivative, which are computed with a finite-differences. The middle and last row show the vector potential components 𝐴 (1) and 𝐴 (2) , respectively, along with their derivatives, which are computed with the BDF method. . . . . . . . . . . . . . 214 Figure B.10: We show the expanding beam after 3000 particle crossings obtained with the Lorenz gauge formulation that uses the AEM for time stepping along with a first-order BDF solver. No divergence cleaning is applied. In Figure B.10a, we plot the beam and the corresponding charge density. The beam surprisingly remains intact after many particle crossings without the use of a cleaning method. The fluctuations of the gauge error over time are quite small. We do not observe the growth in the gauge error shown earlier in Figure B.5b for the Boris method. . . . . . . . . . . . . . . . . . . . . . . . . . 215 Figure B.11: Here we show the potentials (and their derivatives) for the expanding beam problem after 3000 particle crossings. This data was obtained using the Lorenz gauge formulation which combines the AEM for time integration with the BDF-1 wave solver. A divergence cleaning method is not used in this example. In each row, we plot a field quantity and is corresponding derivatives. The top row shows the scalar potential 𝜓 and its derivative, while the middle and last row shows the vector potential components 𝐴 (1) and 𝐴 (2) , respectively, along with their derivatives. . . . . . . . . . . . . . . . . . . . . . 216 xxi Figure B.12: Error in Gauss’ law for the Coulomb gauge formulation of the expanding beam problem which applies the AEM for time integration and uses elliptic divergence cleaning. On the left, we show the time evolution of an “averaged" residual in Gauss’ law. The plot on the right is a surface of the error in Gauss’ law taken after 3000 particle crossings. Even though cleaning is used to control violations in the gauge condition, whose corresponding surface was shown in Figure B.8c, the metric based on point-wise violations in Gauss’ law seems to indicates a significant loss of conservation. On the other hand, the plot on the left implies that Gauss’ law is satisfied in an integral sense. . . . 217 Figure B.13: Error in Gauss’ law for the Coulomb gauge formulation of the expanding beam problem which applies the AEM for time integration. Elliptic divergence cleaning is not used here. On the left, we show the time evolution of the “averaged" residual in Gauss’ law. The plot on the right is a surface of the error in Gauss’ law taken after 3000 particle crossings. The point-wise violations in Gauss’ law are much larger than we observed in Figure B.12b. Similarly, the time evolution of the average defect in Gauss’ law is roughly three orders of magnitude larger than B.12a. . . . . . . . . . . . . . . . . . . . 218 Figure B.14: We show the narrow beam after 5 particle crossings obtained with the Coulomb gauge formulation that uses the AEM for time stepping with el- liptic divergence cleaning. We injected 400 particle per time step. In Figure B.14a, we plot the beam and the corresponding charge density. The plot of the particles appears more solid due to the increased injection rate. The density itself is quite smooth due to the use of additional particles. As before, we see there are violations in the gauge condition along the boundaries due to the injection and removal of particles there. Additionally, the gauge error appears to be quite small away from the boundaries due to the increased smoothness offered by the use of additional particles. . . . . . . . . . . . . . . . . . . . . . 219 Figure B.15: Here we show the potentials, as well as their derivatives, for the narrow beam problem after 5 particle crossings using an injection rate of 400 particles per step. We used the Coulomb gauge formulation which combines the AEM for time integration with the BDF-2 wave solver for the fields. Elliptic divergence cleaning was applied to the vector potential. In each row, we plot a field quantity and is corresponding spatial derivatives. The top row shows the scalar potential 𝜓 and its derivative, which are computed with a finite- differences. The middle and last row show the vector potential components 𝐴 (1) and 𝐴 (2) , respectively, along with their derivatives, which are computed with the BDF method. The structure of the fields and their derivative are quite smooth here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 xxii Figure B.16: Error in Gauss’ law for the narrow beam problem that uses an injection rate of 400 particles per step. On the left, we show the time evolution of an “averaged" residual in Gauss’ law. There is a jump in the “bulk" error for Gauss’ law at step 1000, since this coincides with the beam’s first crossing, before stabilizing. The plot on the right is a surface of the error in Gauss’ law taken after 5 particle crossings. Even though cleaning is used to control violations in the gauge condition, whose corresponding surface was shown in Figure B.14c, the metric based on point-wise violations in Gauss’ law seems to indicates a loss of charge conservation similar to the previous example. . . . 221 Figure B.17: We show the derivatives used to calculate the divergence of the electric field for the narrow beam problem at 5 particle crossings. We used an injection rate of 400 particles. Derivatives are computed with second-order finite- differences. We note the appearance of small oscillations in the 𝑥 derivative, which is shown on the left. The plot to the right, which corresponds to the 𝑦-derivative is largely uniform on the interior of the beam, but is sharp along the edge of the beam. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Figure B.18: We show the effect of the particle injection rate on the gauge error for the narrow beam problem at 5 particle crossings. In each row, we plot the error in the Coulomb gauge as a surface (left column) and as a slice in 𝑥 along the middle of the beam (right) column. The rows correspond to injection rates of 100, 200, and 400 particles per time step, respectively, from top to bottom. We can see that the increase in particle count reduces the gauge error on the interior of the domain due to the smoothing effect on the particle data. . . . . . 222 xxiii LIST OF SCHEMES Scheme 3.1: Looping pattern used in the construction of local integrals, convolutions, and boundary steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Scheme 3.2: Another looping pattern used to build “resolvent" operators. With some modifications, this same pattern could be used for the integrator step. In several cases, this iteration pattern may require reading entries, which are separated by large distances (i.e., the data is strided), in memory. . . . . . . . 103 Scheme A.1: An example of coarse-grained parallel nested loop structure. . . . . . . . . . . 196 Scheme A.2: Kokkos kernel for the fast-convolution algorithm. . . . . . . . . . . . . . . . . 196 xxiv LIST OF ALGORITHMS Algorithm 3.1: Distributed adaptive time stepping rule. . . . . . . . . . . . . . . . . . . . 112 xxv CHAPTER 1 INTRODUCTION Plasmas are ubiquitous in nature. In fact, it is well-known that they comprise a majority of the visible universe, including the electronic devices that we regularly use as part of our daily lives, or the sun, which sustains life on Earth. Consequently, the properties of plasmas span an enormous range of space and time scales, often orders of magnitude in size. These multi-scale features pose a significant challenge for model developers and computational scientists, as many of these models used in the general interrogation of such systems are computationally intractable, even with the existing capabilities offered by supercomputers. This necessitates the development of new algorithms for plasmas which can successfully address the issue of scales and fit within the design constraints posed by new computational hardware. Mathematically, plasmas are systems of charged particles, which can be conveniently described in terms of probability distribution functions that characterize the probability of finding a particular charged particle at some point of phase space. The background electromagnetic fields, which are described by Maxwell’s equations, adapt to the motion of these charges, inducing changes in the probability distribution functions. This results in a complex system of partial differential equations known as the Vlasov-Maxwell system. A defining characteristic of Vlasov plasmas is that effects due to collisions arise only through the electromagnetic fields, so they are often called “collisionless" plasmas in an effort to distinguish them from more complex collision models such as the Boltzmann equation. Building on the concept of a plasma, the next section provides a review of the literature concerning techniques for evolving the electromagnetic fields in the context of particle-in-cell methods for the Vlasov-Maxwell system. Then, we discuss the specific details concerning the mathematical models adopted in this work as well as the non-dimensionalization process. Lastly, we provide an overview of the work presented in this thesis, providing a delineation with some of the earlier work. 1 1.1 Background and Literature Review In this section, we provide a review of the literature pertaining to algorithms employed in the simulation of Vlasov-type plasmas. An emphasis is placed on the particle-in-cell (PIC) method, which is the plasma simulation technique adopted in this thesis. A comprehensive review of the literature for PIC methods up to 2005 can be found in the review article [1]. Much of the work highlighted by this reference is now largely considered standard, so, instead, we discuss articles from more recent years that are more aligned with the developments in this thesis. PIC methods [2, 3] have been extensively applied in the numerical simulation of plasmas and are an important class of techniques used in the design of experimental devices including lasers, pulsed power systems, particle accelerators, among others. The earliest work involving these methods began in the 1950s and 1960s, and it remains an active area of research to this day. At its core, a PIC method combines an Eulerian approach for the electromagnetic fields with a Lagrangian method that evolves collections of samples taken from general distribution functions in phase space. In other words, the fields are evolved using a mesh, while the distribution function is evolved using particles whose equations of motion are set according to characteristics of the PDEs for the distribution functions. Lastly, to combine the two approaches, an interpolation method is used to map data between the mesh and the particles. These maps are typically taken to be linear splines with the multi-dimensional interpolation performed by taking tensor products of one-dimensional interpolants. While PIC methods are capable of simulating complex nonlinear processes in plasmas, it is worth mentioning their weaknesses. Firstly, a natural consequence of the statistical element in PIC is that “bulk" processes in plasmas will be well represented, while the tails of the distribution will be largely underresolved even with good sampling methods. Another well-known consequence of the statistical element is the large number of simulation particles that are required in more systematic refinement studies due to certain numerical fluctuations. This consequence is critical to the computational efficiency of the method, which, in most applications, warrants the use of supercomputers to perform the simulations. The goal of this work is to supply new algorithms which aim to improve the computational efficiency of existing simulation tools, 2 such as PIC, in the numerical investigation of plasmas. Most PIC methods evolve the simulation particles explicitly using some form of leapfrog time integration along with the Boris rotation method [4] in the case of electromagnetic plasmas. The exploration of semi- and fully-implicit particle treatments in PIC methods began in the 1980s [5, 6, 7]. These approaches suffered from a number of unattractive features, including issues with numerical heating and cooling [8], slow nonlinear convergence, and inconsistencies between the fluid moments and particle data. Consequently, these approaches were abandoned in favor of explicit formulations which were more aligned with the computational hardware available at the time. In recent years, implicit PIC methods have seen a resurgence, beginning with [9], which addressed many of these issues in the case of the Vlasov-Poisson system. Nonlinear convergence and self-consistency were enforced using a Jacobian-free Newton-Krylov method [10] with a fluid preconditioner that enforced the continuity equation. The resulting solver demonstrated remarkable savings compared to explicit methods because they eliminated the need to resolve the charge separation in the plasma, allowing for a coarser mesh to be used. These techniques were later extended to curved geometries through the use of smooth grid mappings [11]. Recently, an effort has been made to extend these techniques to the full Vlasov-Maxwell system [12] to avoid the highly restrictive CFL condition posed by the gyrofrequency. While these contributions are significant in their own right, there are many opportunities for improvement. Many of these methods are fundamentally stuck at second-order accuracy in both space and time and may greatly benefit from more accurate field solvers. Additionally, applications of interest involve complex geometries which introduce additional complications with stability and are often poorly resolved with uniform Cartesian meshes. Lastly, there is the concern of scalability. Krylov subspace methods pose a massive challenge for scalability on large machines due to the various collective operations used in the algorithms. It seems that the scalability of these methods could be significantly improved if similar implicit methods could be developed which eliminate the inner Krylov solve altogether, though this is beyond the scope of the present work. A challenge associated with developing any solver for Maxwell’s equations is the enforcement 3 of the involutions for the fields, namely ∇ · E = 𝜌/𝜖0 and ∇ · B = 0. In the case of a structured Cartesian grid, Maxwell’s equations can be discretized using a staggered grid technique introduced by Yee [13]. The use of a staggered mesh yields a structure-preserving discrete analogue of the integral form of Maxwell’s equations that automatically enforces the involutions for E and B without additional treatment. This is the basis of the well-known finite-difference time-domain method (FD-TD) [14]. While the staggering in both space an time used in the FD-TD method is second-order accurate, there exists a fourth-order extension of the spatial discretization that was developed as a way of dealing with certain dispersion errors known as Cerenkov radiation [15]. While the use of a staggered mesh with finite-differences is quite effective for Cartesian grids, issues arise in problems defined with geometry, such as curved surfaces, in which one resorts to stair step boundaries [16]. To mitigate the effect of stair step boundaries in explicit methods, the mesh resolution is increased, resulting in a highly restrictive time step to meet the CFL stability criterion. Recently, an approach for dealing with geometry in the Yee scheme, which avoids the stair stepping along the boundary, was developed for two-dimensional problems [17]. The grid cells along the boundary in the method are replaced with cut-cells which use generalized finite- difference updates that account for different intersections with the boundary. While this scheme was shown to be energy conserving and, perhaps more remarkably, preserved the CFL condition of the Yee scheme, the convergence rates demonstrated by the method are suboptimal. The theory in this article established half-order accuracy, yet demonstrated first-order accuracy in numerical experiments. While many electromagnetic PIC methods solve Maxwell’s equations on Cartesian meshes through the FDTD method, other methods have been developed specifically for addressing issues posed by geometries through the use of unstructured meshes. In [18], a finite-element method (FEM) was coupled with PIC to solve the Darwin model in which the fields move significantly faster than the plasma. Explicit finite-volume methods (FVM), which can address geometry, were considered in [19], which also developed divergence cleaning methods suitable for applications to PIC simulations of the Vlasov-Maxwell system. Discontinuous Galerkin (DG) methods have also 4 been used to develop high-order PIC methods with elliptic [20] and hyperbolic [21] divergence cleaning methods being employed to enforce Gauss’ law. Other work in this area has explored more generalized FEM discretizations to enforce charge conservation on arbitrary grids [22] as well as certain structure preserving discretizations [23, 24, 25]. In particular, [24, 25, 26] employed so called Whitney basis sets taken from the de Rham sequences to automatically enforce involutions for the electric and magnetic fields. Despite the advantages afforded by the use of such bases, these implicit field solvers rely on the solutions of large linear systems which need to be solved using GMRES [27]. Even with preconditioning, such methods can be slow and difficult to achieve scalability. In the case of explicit solvers, such as FVMs and DG methods, other challenges exist. The basic FVM, without additional reconstructions, is first-order in space. These methods can, of course, be improved to second-order accuracy by performing reconstructions based on a collection of cells. Beyond second-order accuracy, the reconstruction process becomes quite complicated due to the size of the interpolation stencils. DG methods, on the other hand, store cell-wise expansions in a basis, which eliminates the issue encountered in the FVM, typically at the cost of a highly restrictive condition on the size of a time step. Additionally, the significant amount of local work in DG methods makes them appealing for newer hardware, yet the restriction on the time step size is often left unaddressed. Notable exceptions to this restriction exist for the two-way wave equation including staggered formulations [28] and Hermite methods [29], which allow for a much larger time step. It will be interesting to see the performance of such methods in plasma problems, especially in problems with intricate geometric features. Other methods for Maxwell’s equations have been developed with unconditional stability for the time discretization. The first of these methods is the ADI-FDTD method [30, 31], which combined an ADI approach with a two-stage splitting to achieve an unconditionally stable solver. Time stepping in these methods was later generalized using a Crank-Nicolson splitting and several techniques for enhancing the temporal accuracy were proposed [32]. Of particular significance to this work are methods based on successive convolution, also known as the method-of-lines- transpose (MOL𝑇 ) [33, 34]. These methods are unconditionally stable in time and can be obtained 5 by reversing the typical order in which discretization is performed. By first discretizing in time, one can solve a resulting boundary-value problem by formally inverting the differential operator using a Green’s function in conjunction with a fast summation method. Mesh-free methods [35, 36] for Maxwell’s equations have also been explored in the Darwin limit under the Coulomb gauge. These formulations are in some ways similar to PIC in that they evolve particles with shapes, except no mesh is used in the simulation. The elliptic equations are solved using a Green’s function on an unbounded domain and a fast summation method is used for efficiency. Green’s function methods have also been used to develop asymptotic preserving schemes [37]. This article utilized a boundary integral formulation with a multi-dimensional Green’s function, to obtain a method that recovers the Darwin limit under appropriate conditions. The methods considered in this work incorporate dimensional splittings, which results in algorithms with unconditional stability, high- order accuracy [38], parallel scalability [39] and geometric flexibility [40]. In [41], a PIC method was developed based on the MOL𝑇 discretization combined with a staggered grid formulation of the Vlasov-Maxwell system. In this approach, the field equations were cast in terms of the Lorenz gauge, producing wave equations for the scalar and vector potentials. Additionally, the wave equation for the scalar potential was replaced with an elliptic equation to control errors in the gauge condition. Since the particle equations were written in terms of E and B, additional finite-difference derivatives were required to compute the electric and magnetic fields from the potentials. While the contributions of this work differ significantly from the methods in [41], we consider the latter work as a baseline to focus our efforts. A fairly detailed outline of the contributions of this thesis can be found in section 1.4 at the end of this chapter. 1.2 Mathematical Models In this section, we provide relevant details of the mathematical models employed for the plasma applications considered in this work. We begin with a discussion of the Vlasov-Maxwell system, which is the most general model considered in this work, in section 1.2.1. Then, once we have finished the discussion of the model, we discuss the mathematical formulation in 1.2.2, which 6 expresses Maxwell’s equations in terms of potentials through the use of gauge conditions. More specifically, we discuss two formulations: one which employs the Lorenz gauge, and another, which uses the Coulomb gauge. With the exception of the expanding beam problems in section 4.6, the numerical examples shall exclusively work with the Lorenz gauge. Since we have adopted a gauge formulation of Maxwell’s equations, we also introduce the equations of motion for the particles in section 1.2.2, which are written entirely in terms of the potentials used for Maxwell’s equations. Once we have presented the formulations, we discuss the non-dimensionalized systems used in the numerical experiments presented in section 4.6. Finally, we conclude the chapter with a summary of the original contributions presented in this thesis, which can be found in section 1.4. 1.2.1 Vlasov-Maxwell System In this work, we develop numerical algorithms for plasmas described by the Vlasov-Maxwell (VM) system, which in SI units, reads as 𝑞𝑠 𝜕𝑡 𝑓𝑠 + v · ∇𝑥 𝑓𝑠 + (E + v × B) · ∇𝑣 𝑓𝑠 = 0, (1.1) 𝑚𝑠 ∇ × E = −𝜕𝑡 B, (1.2) ∇ × B = 𝜇0 (J + 𝜖0 𝜕𝑡 E) , (1.3) 𝜌 ∇·E= , (1.4) 𝜖0 ∇ · B = 0. (1.5) The first equation (1.1) is the Vlasov equation which describes the evolution of a probability distribution function 𝑓𝑠 (x, v, 𝑡) for particles of species 𝑠 in phase space which have mass 𝑚 𝑠 and charge 𝑞 𝑠 . More specifically, it describes the probability of finding a particle of species 𝑠 at the position x, with a velocity v, at a given time 𝑡. Since the position and velocity data are vectors with 3 components, the distribution function is a scalar function of 6 dimensions plus time. While the equation itself has fairly simple structure, the primary challenge in numerically solving this equation is its high dimensionality. This growth in the dimensionality has posed tremendous difficulties for grid-based discretization methods, where one often needs to use many grid points to resolve scales 7 in the problem. The use of grids for problems involving beams poses additional challenges due to excessive dissipation along the edge of the beams. This difficulty is further compounded by the fact that many plasmas of interest contain multiple species. Despite the lack of a collision operator on the right-hand side of (1.1), collisions do exist in a certain mean-field sense, through the electric and magnetic fields which appear as coefficients of the velocity gradient. Equations (1.2) - (1.5) are Maxwell’s equations, which describe the evolution of the background electric and magnetic fields. Since the plasma is a collection of moving charges, any changes in the distribution function for each species will be reflected in the charge density 𝜌(x, 𝑡), as well as the current density J(x, 𝑡), which, respectively, are the source terms for Gauss’ law 1.4 and Ampère’s law (1.3). For 𝑁 𝑠 species, the total charge density and current density are defined by summing over the species 𝑁𝑠 ∑︁ 𝑁𝑠 ∑︁ 𝜌(x, 𝑡) = 𝜌 𝑠 (x, 𝑡), J(x, 𝑡) = J𝑠 (x, 𝑡), (1.6) 𝑠=1 𝑠=1 where the species charge and current densities are defined through moments of the distribution function 𝑓𝑠 according to ∫ ∫ 𝜌 𝑠 (x, 𝑡) = 𝑞 𝑠 𝑓𝑠 (x, v, 𝑡) 𝑑v, J𝑠 (x, 𝑡) = 𝑞 𝑠 v 𝑓𝑠 (x, v, 𝑡) 𝑑v. (1.7) Ω𝑣 Ω𝑣 Here, the integrals are taken over the velocity components of phase space, which we have denoted by Ω𝑣 . The remaining parameters 𝜖0 and 𝜇0 describe the permittivity and permeability of the media in which the fields propagate, which we take to be free-space. In free-space, Maxwell’s equations move at the speed of light 𝑐, which leads to the useful relation 𝑐2 = (𝜇0 𝜖0 ) −1 . The last two equations (1.4) and (1.5) are constraints placed on the fields to maintain charge conservation and prevent the appearance of so-called “magnetic monopoles". It is imperative that numerical schemes for Maxwell’s equations satisfy these conditions. This requirement is one of the reasons we adopt a formulation for Maxwell’s equations in potential form, which is the subject of the next section. 8 1.2.2 Problem Formulation In this work, we adopt a particle formulation of the Vlasov equation (1.1) and use a gauge formulation of Maxwell’s equations. Here we discuss the models that form the basis of the numerical methods presented in this work. 1.2.2.1 Maxwell’s Equations with the Lorenz Gauge Under the Lorenz gauge, Maxwell’s equations transform to a system of decoupled wave equations of the form 1 𝜕2𝜓 1 2 2 − Δ𝜓 = 𝜌, (1.8) 𝑐 𝜕𝑡 𝜖0 2 1 𝜕 A − ΔA = 𝜇0 J, (1.9) 𝑐2 𝜕𝑡 2 1 𝜕𝜓 ∇·A+ 2 = 0, (1.10) 𝑐 𝜕𝑡 where 𝑐 is the speed of light, 𝜖0 and 𝜇0 represent, respectively, the permittivity and permeability of free-space. Further, we have used 𝜓 to denote the scalar potential and A is the vector potential. In fact, under any choice of gauge condition, given 𝜓 and A, one can recover E and B using the definitions 𝜕A E = −∇𝜓 − , B = ∇ × A, (1.11) 𝜕𝑡 where ” × ” denotes the vector cross product. The structure of equations (1.8) and (1.9) is appealing because the system, modulo the gauge condition (1.10), is essentially four “decoupled" scalar wave equations. Since this system is over-determined, the coupling manifests itself through the gauge condition which should be thought of as a constraint. Moreover, Maxwell’s equations (1.2) - (1.5) are equivalent to (1.8) and (1.9) as long as the Lorenz gauge condition (1.10) is satisfied by 𝜓 and A. This formulation is appealing for several reasons. First, this system is purely hyperbolic, so it evolves in a local sense. Computationally, this means that a localized method can be used to evolve the system, which will be more efficient for parallel computers. Another attractive feature is that 9 many of the methods developed for scalar wave equations, e.g., [38] and [34] can be applied to the system in a straightforward manner. 1.2.2.2 Maxwell’s Equations with the Coulomb Gauge If instead, we impose the Coulomb gauge in Maxwell’s equations, we obtain a coupled, mixed-type system 1 −Δ𝜓 = 𝜌, (1.12) 𝜖0 1 𝜕2A 1 𝜕 (∇𝜓) 2 2 − ΔA = 𝜇0 J − 2 , (1.13) 𝑐 𝜕𝑡 𝑐 𝜕𝑡 ∇ · A = 0, (1.14) where, again, 𝑐 is the speed of light, 𝜖0 and 𝜇0 represent, respectively, the permittivity and perme- ability of free-space and 𝜓 and A to denote the scalar potential and vector potential, respectively. There are several noticeable differences in the system above, when compared to the one obtained with the Lorenz gauge (1.8)-(1.10). The equations are no longer decoupled in the same way as in the case of the Lorenz gauge. Moreover, the equation for the scalar potential (1.12) is an elliptic equation, rather than a hyperbolic equation. This requires elliptic solvers, which are more difficult to scale on parallel computers than hyperbolic solvers due to their global properties. As a conse- quence, additional parallel communications are required to coordinate the solves. Additionally, the Coulomb gauge introduces a somewhat unusual time derivative 𝜕𝑡 ∇𝜓, which is connected to the steady state equation (1.12). In our implementation of the solver with the Coulomb gauge, we use a Helmholtz decomposition of the vector fields A and J that separates the rotational and irrotational components according to A = Arot + Airrot ≡ Arot − ∇𝜉, (1.15) J = Jrot + Jirrot ≡ Jrot − ∇𝜂. (1.16) Here, 𝜉 and 𝜂 are scalar functions and by definition, ∇ · Arot = 0 and ∇ · Jrot = 0. Substituting this 10 decomposition into equation (1.13), and separating the equations (by linearity), we obtain 1 𝜕 2 Arot − ΔArot = 𝜇0 Jrot , (1.17) 𝑐2 𝜕𝑡 2 1 𝜕 2 Airrot 1 𝜕 (∇𝜓) − ΔA irrot = 𝜇 0 Jirrot − . (1.18) 𝑐2 𝜕𝑡 2 𝑐2 𝜕𝑡 The second equation is connected to the continuity equation. If we assume that the Coulomb gauge is satisfied, then it follows that ∇·Airrot = −Δ𝜉 = 0. With this in mind, if we now take the divergence of equation (1.18), using (1.12) and (1.16), we find that 1 𝜕 (Δ𝜓) 0=− + 𝜇0 ∇ · Jirrot , 𝑐2 𝜕𝑡 1 𝜕𝜌 = 2 + 𝜇0 ∇ · (J − Jrot ) , 𝑐 𝜖0 𝜕𝑡   𝜕𝜌 = 𝜇0 +∇·J , 𝜕𝑡 which is the continuity equation. In order to solve (1.17), which avoids the term involving the strange time derivative, we need to identify Jrot from J. One way of doing this is by appealing to equation (1.16). Taking its divergence and rearranging the terms, we obtain the elliptic equation − Δ𝜂 = ∇ · J. (1.19) Once 𝜂 is identified, we can compute ∇𝜂 and set Jrot = J + ∇𝜂. (1.20) Our implementation works off of equations (1.12) and (1.17), in addition to (1.19) and (1.20). The issue of enforcing the gauge condition will be addressed in section 4.3, where we discuss methods for controlling errors in the gauges. 1.2.2.3 Formulation for the Particles A particle formulation of the Vlasov equation (1.1) can be developed by writing the species distribution function 𝑓𝑠 as a collection of Dirac delta distributions over phase space 𝑁 𝑝𝑠 ∑︁ 𝑓𝑠 (x, v, 𝑡) = 𝛿(x − x 𝑝 )𝛿(v − v 𝑝 ). (1.21) 𝑝=1 11 Here, 𝑁 𝑝 𝑠 symbolizes the number of particles of species 𝑠. Notice that we also have the relation ∫ ∫ 𝑓𝑠 (x, v, 𝑡) 𝑑v 𝑑x = 𝑁 𝑝 𝑠 , Ω𝑥 Ω𝑣 which holds at any time 𝑡. Furthermore, 𝑓𝑠 can be converted into a proper distribution by including a normalization factor of 1/𝑁 𝑝 𝑠 . By combining the ansatz (1.21) with the definitions (1.6) and (1.7), we obtain the following definitions of charge density and current density for a collection of 𝑁 𝑝 simulation particles: 𝑁𝑝 ∑︁ 𝜌(x) = 𝑞 𝑝 𝛿(x − x 𝑝 ), (1.22) 𝑝=1 𝑁𝑝 ∑︁ J(x) = 𝑞 𝑝 v 𝑝 𝛿(x − x 𝑝 ). (1.23) 𝑝=1 In the above equators, 𝑞 𝑝 , x 𝑝 , and v 𝑝 denote the charge, position, and velocity, respectively, of a particle whose label is 𝑝. In defining things this way, we have dropped the reference to the species altogether, since each particle can be thought of as its own entity. These particles move along characteristics of the equation (1.1), which are given by the system of ordinary differential equations x¤ 𝑖 = v𝑖 1 v¤ 𝑖 = F(x𝑖 ). 𝑚𝑖 The vector field F is the Lorentz force that acts on particles and is defined by   F= 𝑞 E+v×B , (1.24) where we have removed the subscript that refers to a specific particle for simplicity. Next, we write the fields in terms of their potentials. Using (1.11), we can obtain the equivalent expression  𝜕A  F = 𝑞 − ∇𝜓 − + v × (∇ × A) . 𝜕𝑡 This expression can be simplified with the aid of the vector identity ∇ (a · b) = a × ∇ × b + b × ∇ × a + (a · ∇) b + (b · ∇) a. (1.25) 12 Using a ≡ A and b ≡ v, along with the fact that the velocity v does not depend on x, we obtain the relation v × (∇ × A) = ∇ (A · v) − (v · ∇) A. Inserting this expression into the force yields  𝜕A  F = 𝑞 − ∇𝜓 − + ∇ (A · v) − (v · ∇) A . 𝜕𝑡 Then we can use the definition of the total (convective) derivative to write 𝑑A 𝜕A = + (v · ∇) A, 𝑑𝑡 𝜕𝑡 which means the force is equivalent to  𝑑A  F = 𝑞 − ∇𝜓 − + ∇ (A · v) . 𝑑𝑡 𝑑p If we let p denote the classical momentum p = 𝑚v, so that 𝑑𝑡 = F, then we can move the time derivative to the left side of the equation, which gives 𝑑    p + 𝑞A = 𝑞 − ∇𝜓 + ∇ (A · v) . 𝑑𝑡 The expression on the left contains the canonical (generalized) momentum P := p + 𝑞A, (1.26) and the right side can be expressed as −∇𝑈, if we let 𝑈 = 𝑞 (𝜓 − A · v). Therefore, we obtain the canonical momentum equation 𝑑P   = 𝑞 − ∇𝜓 + ∇ (A · v) . 𝑑𝑡 This is the penultimate form of the equation we seek. Instead, we want the derivatives to appear on the vector potential A rather than A · v. For this, we can use an equivalent form of the identity (1.25), namely ∇ (a · b) = (∇b) · a + (∇a) · b. 13 Again, we select 𝑎 ≡ A and 𝑏 ≡ v, which shows that ∇ (A · v) = (∇v) · A + (∇A) · v. Equation (1.26) provides the connection between the linear momentum p = 𝑚v and the canonical momentum P, so that the velocity is given by 1  v≡ P − 𝑞A . (1.27) 𝑚 Since v is a function only of time, we have that (∇v) · A = 0. Combining this with the other term yields an expanded form for the canonical momentum update 𝑑P  1  = 𝑞 − ∇𝜓 + (∇A) · (P − 𝑞A) . (1.28) 𝑑𝑡 𝑚 Note that since A is a vector, taking the gradient increases its rank by 1, which means that ∇A is a dyad. In component form, we can write ∇ (a · b) as    𝜕𝑎 ( 𝑗) 𝜕𝑏 ( 𝑗) 𝜕𝑖 𝑎 𝑗 𝑏 𝑗 = 𝜕𝑖 𝑎 𝑗 𝑏 𝑗 + 𝜕𝑖 𝑏 𝑗 𝑎𝑗 = 𝑏𝑗 + 𝑎𝑗, (1.29) 𝜕𝑥𝑖 𝜕𝑥𝑖 where the summation convention over repeated indices has been used. For our case, the vector b in the above calculation does not depend on space. Therefore, one only requires computing the entries of the dyad 𝜕 𝐴 ( 𝑗) ≡ 𝐽A𝑇 (1.30) 𝜕𝑥𝑖 where 𝐽A is the Jacobian matrix associated with A. Remark 1.2.1. Another way to see that the (1.29) and (1.30) are the correct expressions for (1.28) is to use the fact that A · v is a scalar, then apply the usual gradient operator for scalar functions. In other words,   ∇ (A · v) = ∇ 𝐴 (1) 𝑣 (1) + 𝐴 (2) 𝑣 (2) + 𝐴 (3) 𝑣 (3) ,   (1) (1)   𝜕𝑥 𝐴 𝑣 + 𝐴 (2) 𝑣 (2) + 𝐴 (3) 𝑣 (3)       (1) (1) = 𝜕𝑦 𝐴 𝑣 + 𝐴 𝑣 + 𝐴 𝑣  (2) (2) (3) (3) .       𝜕𝑧 𝐴 (1) 𝑣 (1) + 𝐴 (2) 𝑣 (2) + 𝐴 (3) 𝑣 (3)      14 The result follows by distributing the derivatives in each row, using the fact that the components of the velocity do not depend on space, followed by use of the definition (1.27). 𝑑x To obtain the position equation, we note that 𝑑𝑡 = v and that the classical momentum is given by p = 𝑚v. This implies that 𝑑x P=𝑚 + 𝑞A, 𝑑𝑡 which can be arranged to obtain 𝑑x 1   = P − 𝑞A . (1.31) 𝑑𝑡 𝑚 Since the transformed equations of motion given by (1.28) and (1.31) will be identical in structure among the particles (differing only by labels for the particles) this results in the system 𝑑x𝑖 1   = P𝑖 − 𝑞𝑖 A , (1.32) 𝑑𝑡 𝑚𝑖 𝑑P𝑖  1  = 𝑞𝑖 − ∇𝜓 + (∇A) · (P𝑖 − 𝑞𝑖 A) . (1.33) 𝑑𝑡 𝑚𝑖 The complete formulation consists of evolving the fields with (1.8)-(1.10) and (1.32) - (1.33) for the particles. In the next section, we provide details concerning the non-dimensionalization used for the models. 1.3 Non-dimensionalization In this section, we discuss the scalings used to non-dimensionalize the models explored in this work. Our choice in exploring the normalized form of these models is simply to reduce the number of floating point operations with small or large numbers. We first non-dimensionalize the models for particles, then focus our efforts on the field equations obtained with both the Lorenz and Coulomb gauges. To minimize repetition, we shall illustrate the process for parts of the formulation and simply state the results for those that follow an identical pattern. The setup for the non-dimensionalization process used in this section considers the following 15 substitutions: x → 𝐿 x̃, P → 𝑃P̃, 𝑡 → 𝑇 𝑡˜, 𝑛 → 𝑛¯ 𝑛, ˜ 𝜓 → 𝜓0 𝜓,˜ 𝜌 → 𝑄 𝑛¯ 𝜌,˜ ¯ 𝑄 𝑛𝐿 A → 𝐴0 Ã, J → 𝑄 𝑛𝑉 ¯ J̃ ≡ J̃. 𝑇 Here, we use 𝑛¯ to denote a reference number density [m−3 ], 𝑄 is the scale for charge in [C], and we also introduce 𝑀, which represents the scale for mass [kg]. The values for 𝑄 and 𝑀 are set according to the electrons, so that 𝑄 = |𝑞 𝑒 | and 𝑀 = 𝑚 𝑒 . Other scales, such as 𝜓0 and 𝐴0 shall be specified later. A natural choice of the scales for 𝐿, and 𝑇 are the Debye length and angular plasma period, which are defined, respectively, by √︄ √︂ 𝜖 0 𝑘 𝐵𝑇¯ 𝑚 𝑒 𝜖0 𝐿 = 𝜆𝐷 = [m], 𝑇= 𝜔−1 𝑝𝑒 = [s/rad], ¯ 2𝑒 𝑛𝑞 ¯ 2𝑒 𝑛𝑞 where 𝑘 𝐵 is the Boltzmann constant, 𝑚 𝑒 is the electron mass, 𝑞 𝑒 is the electron charge, and 𝑇¯ is an average macroscopic temperature for the plasma. We choose to select these scales for all test problems considered in section 4.6 with the exception of the expanding beam problems, which are the last two test problems in that section. For these problems, the length scale 𝐿 corresponds to the longest side of the simulation domain and 𝑇 is the crossing time for a particle that is injected into the domain. Generally, the user will need to provide a macroscopic temperature 𝑇¯ [K] in addition to the reference number density 𝑛¯ to compute 𝜆 𝐷 and 𝜔−1 𝑝𝑒 . Note that in some cases we shall refer to the plasma period, which can be obtained from the angular plasma period 𝑇 by multiplying with 2𝜋. Having introduced the definitions for the normalized variables, we proceed to non-dimensionalize the models, beginning with the equations for particles before addressing the field equations in both the Lorenz and Coulomb gauges. 1.3.1 Equations of Motion in E-B form In our development of new field solvers with applications to PIC, it is helpful to benchmark the proposed methods against standard approaches for integrating particles which use the Newton- 16 Lorenz equations 𝑑x𝑖 = v𝑖 , (1.34) 𝑑𝑡 𝑑v𝑖 𝑞𝑖   = E(x𝑖 ) + v𝑖 × B(x𝑖 ) . (1.35) 𝑑𝑡 𝑚𝑖 After inserting the scales into the position equation (1.34), we obtain the equivalent non- dimensional form 𝑑 x̃𝑖 = ṽ𝑖 . 𝑑 𝑡˜ Following the same process for the velocity equation, after some rearrangement, we obtain 𝑑 ṽ𝑖 𝑞˜𝑖  𝑄𝐸 0𝑇 𝑄𝐵0𝑇  = Ẽ + ṽ𝑖 × B̃ . 𝑑 𝑡˜ 𝑟𝑖 𝑀𝑉 𝑀 Here we have introduced the non-dimensional electric and magnetic fields Ẽ and B̃, which are normalized by 𝐸 0 and 𝐵0 , respectively, and 𝑟𝑖 = 𝑚𝑖 /𝑀 is a mass ratio . From equation (1.11), we can express these scales in terms of 𝜓0 and 𝐴0 as 𝜓0 𝐴0 𝐸0 = , 𝐵0 = . 𝐿 𝐿 Therefore, the non-dimensionalized equation for the velocity can be expressed in terms of these scales as 𝑑 ṽ𝑖 𝑞˜𝑖  𝑄𝜓0𝑇 2 𝑄 𝐴0 𝑇  = Ẽ + ṽ𝑖 × B̃ , 𝑑 𝑡˜ 𝑟𝑖 𝑀 𝐿 2 𝑀𝐿 𝑞˜𝑖   ≡ 𝛼1 Ẽ + 𝛼2 ṽ𝑖 × B̃ , 𝑟𝑖 where we have introduced the parameters 𝑄𝜓0𝑇 2 𝑄 𝐴0 𝑇 𝛼1 = , 𝛼2 = . 𝑀 𝐿2 𝑀𝐿 We then select 𝜓0 and 𝐴0 so that 𝛼1 = 𝛼2 = 1, i.e., 𝑀 𝐿2 𝑀𝐿 𝜓0 = , 𝐴0 = , (1.36) 𝑄𝑇 2 𝑄𝑇 17 which results in the non-dimensional system 𝑑x𝑖 = v𝑖 , (1.37) 𝑑𝑡 𝑑v𝑖 𝑞𝑖   = E + v𝑖 × B . (1.38) 𝑑𝑡 𝑟𝑖 Note that we have dropped the tildes for brevity. The next section provides the non-dimensionalized form for the analogous equations that evolve particles using the potentials 𝜓 and A in the generalized momentum framework. 1.3.2 Equations of Motion for the Generalized Hamiltonian Here, we non-dimensionalize the generalized momentum model for the particles, which is given by equations (1.32) and (1.33). For convenience, the system is given by 𝑑x𝑖 1   = P𝑖 − 𝑞𝑖 A , 𝑑𝑡 𝑚𝑖 𝑑P𝑖  1  = 𝑞𝑖 − ∇𝜓 + (∇A) · (P𝑖 − 𝑞𝑖 A) . 𝑑𝑡 𝑚𝑖 Substituting the scales introduced at the beginning of the section into the position equation, and rearranging terms, we obtain 𝑑 x̃𝑖 1  𝑇𝑃 𝑄𝑇 𝐴0  = P̃𝑖 − 𝑞˜𝑖 Ã ≡ ṽ𝑖 . 𝑑 𝑡˜ 𝑟𝑖 𝑀 𝐿 𝑀𝐿 This equation can be simplified further noting that 𝐴0 is chosen according to (1.36) and the scale for momentum is 𝑃 = 𝑀 𝐿𝑇 −1 . Therefore, we obtain the non-dimensionalized position equation 𝑑 x̃𝑖 1  = P̃𝑖 − 𝑞˜𝑖 Ã ≡ ṽ𝑖 . 𝑑 𝑡˜ 𝑟𝑖 Following an identical treatment for the generalized momentum equation, using the scales set according to (1.36), we obtain 𝑑 P̃𝑖  1 ˜   = 𝑞˜𝑖 − ∇˜ 𝜓˜ + ∇Ã · P̃𝑖 − 𝑞˜𝑖 Ã . 𝑑 𝑡˜ 𝑟𝑖 18 Therefore, the non-dimensional system is (again, dropping tildes for simplicity) 𝑑x𝑖 1  = P𝑖 − 𝑞𝑖 A ≡ v𝑖 , (1.39) 𝑑𝑡 𝑟𝑖 𝑑P𝑖  1  = 𝑞𝑖 − ∇𝜓 + (∇A) · (P𝑖 − 𝑞𝑖 A) . (1.40) 𝑑𝑡 𝑟𝑖 The next sections are focused on the non-dimensionalized form of the field equations cast under the Lorenz and Coulomb gauge conditions. 1.3.3 Maxwell’s Equations in the Lorenz Gauge Substituting scales introduced at the beginning of the section into the equations (1.8) - (1.10), we find that 1 𝜓0 𝜕 2 𝜓˜ 𝜓0 ˜ 𝑄 𝑛¯ − 2 Δ̃𝜓 = ˜ 𝜌, 𝑐2 𝑇 2 𝜕 𝑡˜2 𝐿 𝜖0 1 𝐴0 𝜕 2 Ã 𝐴0 𝜇0 𝑄 𝑛𝐿¯ 2 2 2 − 2 Δ̃Ã = J̃, 𝑐 𝑇 𝜕𝑡 ˜ 𝐿 𝑇 𝐴0 ˜ 1 𝜓0 𝜕 𝜓˜ ∇ · Ã + 2 = 0. 𝐿 𝑐 𝑇 𝜕 𝑡˜ The first equation can be rearranged to obtain 𝐿 2 𝜕 2 𝜓˜ 2 − Δ̃ ˜ = 𝐿 𝑄 𝑛¯ 𝜌. 𝜓 ˜ 𝑐2𝑇 2 𝜕 𝑡˜2 𝜖 0 𝜓0 Similarly, with the second equation we obtain 𝐿 2 𝜕 2 Ã ¯ 𝐿2 𝑄 𝑛𝑉 − Δ̃ Ã = J̃, 𝑐2𝑇 2 𝜕 𝑡˜2 𝑐 2 𝜖 0 𝐴0 where we have used 𝑉 = 𝐿𝑇 −1 as well as the fact that 𝑐2 = (𝜇0 𝜖0 ) −1 . Finally, the gauge condition becomes 𝜓0𝑉 𝜕 𝜓˜ ∇˜ · Ã + 2 = 0. 𝑐 𝐴0 𝜕 𝑡˜ 19 Introducing the normalized speed of light 𝜅 = 𝑐/𝑉, and selecting 𝜓0 and 𝐴0 from (1.36), we find that the above equations simplify to (dropping the tildes) 1 𝜕2𝜓 1 − Δ𝜓 = 𝜌, (1.41) 𝜅 2 𝜕𝑡 2 𝜎1 1 𝜕2A − ΔA = 𝜎2 J, (1.42) 𝜅 2 𝜕𝑡 2 1 𝜕𝜓 ∇·A+ 2 = 0, (1.43) 𝜅 𝜕𝑡 where we have introduced the new parameters 𝑀𝜖0 𝑄 2 𝐿 2 𝑛𝜇 ¯ 0 𝜎1 = , 𝜎2 = . (1.44) 𝑄𝑇 𝑛¯ 𝑀 These are nothing more than normalized versions of the permittivity and permeability constants in the original equations. 1.3.4 Maxwell’s Equations in the Coulomb Gauge Following an approach identical to the one used for the Lorenz gauge in the previous section, one finds that the non-dimensional form of Maxwell’s equations in the Coulomb gauge is given by 1 −Δ𝜓 = 𝜌, (1.45) 𝜎1 1 𝜕2A 1 𝜕 (∇𝜓) 2 2 − ΔA = 𝜎2 J − 2 , (1.46) 𝜅 𝜕𝑡 𝜅 𝜕𝑡 ∇ · A = 0, (1.47) where 𝜎1 and 𝜎2 are defined according to (1.44) and 𝜅 = 𝑐/𝑉 is the normalized speed of light. 1.4 Contributions of This Thesis The remainder of this thesis is devoted to the design of algorithms for solving the VM system using a PIC method. The particles will be advanced in the proposed methods according to either the Newton-Lorentz equations (1.37)-(1.38) or the generalized formulation (1.39)-(1.40). In the latter approach, particles are evolved using the smooth potentials 𝜓 and A, as well as their derivatives, 20 which has the advantage of eliminating time derivatives from the gauge formulation. The former approach for moving particles is largely standard and serves not only as a useful benchmark, but also demonstrates ways in which our methods can be incorporated into existing methodologies. While the methods developed in this work are more aligned with the Lorenz gauge formulation of Maxwell’s equations (1.41)-(1.43), we also include some results obtained with the formulation (1.45)-(1.47) based on the Coulomb gauge. Chapter 2 provides a discussion of the methods used to evolve the wave equations that appear in the gauge formulations considered in this work. The core algorithms for the fields are quite similar to those presented in the thesis [42], as well as the article [34]. The latter article analyzed stability and characterized other properties of the second-order solvers, including dissipation and dispersion. This work extends these ideas by introducing new methods for the construction of analytical derivatives that can be obtained directly from the base solver with no degradation in the rate of convergence. This provides a more general approach for the calculation of derivatives, which can be leveraged in problems with non-uniform meshes and non-trivial geometries [40] and costs the same as the base solver. In contrast, the particle methods developed in [41] used centered finite-differences on staggered Cartesian grids, which reduce the accuracy of the fields by one order in space. We also revisit outflow boundary conditions, focusing primarily on the multi-dimensional setting and propose a form of outflow that is based on extrapolation. Additional details concerning these approaches is provided to clarify the ambiguity in some of the earlier presentations. Time and space refinement experiments are used to demonstrate the effectiveness of the proposed approaches. In chapter 3, we discuss parallel algorithms for successive convolution methods (see, e.g., [38, 43, 44, 45]). Our paper [39] developed a nearest-neighbor domain decomposition algorithm by leveraging certain decay properties of the methods, which ultimately allowed the algorithms to run on distributed memory systems. Using the Kokkos performance portability library [46], we assessed the performance of several different strategies of dispatching threads in the shared memory space, focusing on optimizing common looping patterns in the algorithms. Weak and strong scaling experiments were performed on both linear and nonlinear problems to study the efficiency of the 21 algorithms. The focus of chapter 4 concerns the development of new PIC methods for the simulation of plasmas. We first introduce the time stepping methods used to evolve particles in the simulations, including leapfrog integration as well as the popular Boris rotation method [4]. These methods are used to forge a comparison with the generalized momentum formulation (1.39)-(1.40), which is evolved using algorithms for non-separable Hamiltonian systems [36, 47]. Several techniques are proposed in an effort to control errors in the gauge conditions and enforce charge conservation. We demonstrate the performance of the methods on several test problems, beginning with a single particle test in fixed fields, before moving to more challenging beam problems. The last chapter of this thesis provides a high-level summary of the results. We also discuss several ideas for future work. This includes outlining possible improvements to the methods presented here, as well as other interesting ideas and test problems for which there was not enough time to explore. 22 CHAPTER 2 NUMERICAL METHODS FOR THE FIELD EQUATIONS In this chapter, we describe algorithms used for wave propagation, which will be used in the formula- tions of Maxwell’s equations discussed in the previous chapter. We begin with a general discussion of Green’s function methods and integral equations in section 2.1, which are the foundation of the approaches considered in this work. The discussion of the solvers is primarily focused on two types of second-order accurate (time) methods, which are presented in section 2.2 in their corresponding semi-discrete form. Using a dimensional splitting technique, which is presented at the end of the same section, we formulate the solution in terms of one-dimensional operators that can be inverted using the methods discussed in section 2.3. We then discuss the application of boundary conditions in section 2.4, which also includes caveats for multi-dimensional problems. Additionally, we show how to construct derivatives of the fields analytically, which retains the convergence rate of the original method. This section also discusses outflow boundary conditions, which have presented a challenge for this class of methods. Following [38], we briefly discuss the extensions of these methods to higher-order accuracy in time and take care to address complications for boundary conditions in these methods. We present some numerical results in section 2.6, which demonstrate the accuracy of the proposed methods. Lastly, in section 2.7, we provide a brief summary of the contributions in this work. 2.1 Integral Equation Methods and Green’s Functions Integral equation methods or, more generally, Green’s function methods, are a powerful class of techniques used in the solution of boundary value problems that occur in a range of applications, including mechanics, fluid dynamics, and electromagnetism. Such methods allow one to write an explicit solution of an elliptic PDE in terms of a fundamental solution or Green’s function. While explicit, this solution can be difficult or impossible to evaluate, so numerical quadrature is used to evaluate these terms. So-called layer potentials can then be introduced in the form of 23 surface integrals, which are used to adjust the solution to satisfy the prescribed boundary data. This framework allows one to solve problems in complicated domains without resorting to the use of a mesh. We illustrate the basics of the method with an example. Suppose that we are solving the following modified Helmholtz equation   1 I − 2 Δ 𝑢(x) = 𝑆(x), x ∈ Ω, (2.1) 𝛼 where Ω ⊂ R𝑛 and I is the identity operator, Δ is the Laplacian operator in R𝑛 , 𝑆 is a source term, and 𝛼 ∈ R is a parameter. While this method can be broadly applied to other elliptic PDEs, equation (2.1) is of interest to us because it can be obtained from the time discretization of a parabolic or hyperbolic PDE. In this case, the source function would include additional time levels of 𝑢 and the parameter 𝛼 = 𝛼(Δ𝑡) would be connected to the time discretization of the original problem being solved. We shall not prescribe boundary conditions for this equation, and instead consider the most general solution. To apply a Green’s function method to equation (2.1), one first needs to identify a function 𝐺 (x, y) that solves the equation   1 I − 2 Δ 𝐺 (x, y) = 𝛿 (x − y) , x, y ∈ R𝑛 (2.2) 𝛼 over free-space, with 𝛿 (x − y) being the Dirac delta distribution. There are many ways to approach solving this equation. For example, it is common to take advantage of radial symmetry so that the problem reduces to a single variable, which can be solved using a Fourier transform. Green’s functions have been tabulated for many different operators, including the modified Helmholz operator. Therefore, we shall not elaborate on this further and assume that the fundamental solution 𝐺 (x, y) is readily available. We now show the connection between fundamental solution 𝐺 (x, y), which is defined on free- space and solves (2.2), and the original problem (2.1). First, let 𝑢 be a solution of the problem 24 (2.1). If we multiply the equation (2.2) by 𝑢 and integrate over Ω: ∫    ∫ 1 I − 2 Δ 𝐺 (x, y) 𝑢(y) 𝑑𝑉y = 𝛿 (x − y) 𝑢(y) 𝑑𝑉y , (2.3) Ω 𝛼 Ω = 𝑢(x), using properties of the delta distribution. The left side of this equation can be addressed using integration by parts. First, we split the left side into two terms: ∫    ∫ ∫ 1 1 I − 2 Δ 𝐺 (x, y) 𝑢(y) 𝑑𝑉y = 𝐺 (x, y)𝑢(y) 𝑑𝑉y − 2 Δ𝐺 (x, y)𝑢(y) 𝑑𝑉y . (2.4) Ω 𝛼 Ω 𝛼 Ω Using integration by parts and the divergence theorem, we find that the second integral can be expressed as ∫ ∫ ∫ Δ𝐺 (x, y)𝑢(y) 𝑑𝑉y = ∇ · (∇𝐺 (x, y)𝑢(y)) 𝑑𝑉y − ∇𝐺 (x, y) · ∇𝑢(y) 𝑑𝑉y , Ω Ω Ω ∫ ∫ 𝜕𝐺 = 𝑢(y) 𝑑𝑆y − ∇𝐺 (x, y) · ∇𝑢(y) 𝑑𝑉y . 𝜕Ω 𝜕n Ω Here we have used 𝜕/𝜕n to denote the normal derivative. If we, again, apply integration by parts along with the divergence theorem to the second integral, we find that ∫ ∫ ∫ ∇𝐺 (x, y) · ∇𝑢(y) 𝑑𝑉y = ∇ · (𝐺 (x, y)∇𝑢(y)) 𝑑𝑉y − 𝐺 (x, y)Δ𝑢(y) 𝑑𝑉y , Ω ∫Ω ∫ Ω 𝜕𝑢 = 𝐺 (x, y) 𝑑𝑆y − 𝐺 (x, y)Δ𝑢(y) 𝑑𝑉y . 𝜕Ω 𝜕n Ω Combining each of these results with the relation (2.4), and using this in place of the left side of (2.3), we obtain, after some simplifications, the equation ∫   ∫   1 1 𝜕𝑢 𝜕𝐺 𝑢(x) = 𝐺 (x, y) I − 2 Δ 𝑢(x) 𝑑𝑉y + 2 𝐺 (x, y) − 𝑢(y) 𝑑𝑆y . Ω 𝛼 𝛼 𝜕Ω 𝜕n 𝜕n Finally, since 𝑢 solves the PDE (2.1), the above equation is equivalent to ∫ ∫   1 𝜕𝑢 𝜕𝐺 𝑢(x) = 𝐺 (x, y)𝑆(y) 𝑑𝑉y + 2 𝐺 (x, y) − 𝑢(y) 𝑑𝑆y . (2.5) Ω 𝛼 𝜕Ω 𝜕n 𝜕n Since the volume integral term does not enforce boundary conditions, the surface integral contri- butions involving 𝑢 are replaced with layer potentials to ensure that these conditions are satisfied. 25 Therefore, the general solution, shown above, is replaced with the ansatz ∫ ∫   𝜕𝐺 𝑢(x) = 𝐺 (x, y)𝑆(y) 𝑑𝑉y + 𝜎(y)𝐺 (x, y) + 𝛾(y) 𝑑𝑆y , (2.6) Ω 𝜕Ω 𝜕n where 𝜎(y) is the single-layer potential and 𝛾(y) is the double-layer potential, which must now be determined. The names reflect the behavior of the Green’s function associated with each of the terms. The Green’s function itself is continuous, but its derivative will have a jump. Based on the boundary conditions, one selects either a single or double layer form as the ansatz for the solution. The single layer form is used in the Neumann problem, while the double layer form is chosen for the Dirichlet problem. Numerically evaluating the solution (2.6) first requires a discretization of the integrals using quadrature. The evaluation proceeds by computing the volume integral term, which is the particular solution to the problem (2.1). Using the particular solution, one obtains the homogeneous solution with modified boundary data that accounts for the particular solution’s contributions along the boundary. This step for the homogeneous solution requires the identification of 𝜎(y) or 𝛾(y) at the quadrature points taken along the domain boundary, which results in a dense linear system. In contrast to other classes of solvers, e.g, finite-element or finite-difference schemes, these linear systems are usually well-conditioned, so the application of an iterative solver, such as the GMRES method [27], converges with only a few iterations and is independent of the number of quadrature points. For efficiency, the fast-multipole method (FMM) [48] can be used to reduce the evaluation time required in the GMRES iterations [49]. The algorithms presented in the subsequent sections are essentially a one-dimensional analogue of these methods. Rather than invert the multi-dimensional operator corresponding to (2.6), the methods presented here, instead, factor the Laplacian and invert one-dimensional operators, dimension-by-dimension, using the one-dimensional form of (2.6). We will see that the resulting methods solve for something that looks like a layer potential, with the key difference being that the linear system is now a 2 × 2 matrix that can be inverted by hand rather than with an iterative method. Similarly, the particular solution along a given line segment can be rapidly computed with a lightweight, recursive, fast summation method, rather than a more complicated method, such as the 26 FMM. Moreover, these methods retain the geometric flexibility since the domain can be represented using one-dimensional line segments with termination points specified by the geometry. 2.2 Semi-discrete Schemes for the Wave Equation Here we provide a description of the second-order accurate wave solvers based on backwards- difference (BDF) and time-centered discretizations. We show how to derive the semi-discrete equations associated with each of the methods, which take the form of modified Helmholtz equations (2.1). Then, we discuss the splitting technique that is used for multi-dimensional problems. 2.2.1 The BDF Scheme To derive the BDF form of the wave solver, we start with the equation 1 𝜕 2𝑢 − Δ𝑢 = 𝑆(x, 𝑡), (2.7) 𝑐2 𝜕𝑡 2 where 𝑐 is the wave speed and 𝑆 is a source function. Then, using the notation 𝑢(x, 𝑡 𝑛 ) = 𝑢 𝑛 , we can apply a second-order accurate backwards finite-difference stencil for the second derivative 𝜕 2𝑢 2𝑢 𝑛+1 − 5𝑢 𝑛 + 4𝑢 𝑛−1 − 𝑢 𝑛−2 = + O (Δ𝑡 2 ), 𝜕𝑡 2 𝑡=𝑡 𝑛+1 Δ𝑡 2 where Δ𝑡 = 𝑡 𝑘 − 𝑡 𝑘−1 , for any 𝑘, is the grid spacing in time. Evaluating the remaining terms in equation (2.7) at time level 𝑡 𝑛+1 , and inserting the above difference approximation, we obtain   1  𝑛+1 𝑛 𝑛−1 𝑛−2  𝑛+1 𝑛+1 1 2𝑢 − 5𝑢 + 4𝑢 − 𝑢 − Δ𝑢 = 𝑆 (x) + O , 𝑐2 Δ𝑡 2 𝛼2 which can be rearranged to obtain the semi-discrete equation     1 𝑛+1 1 𝑛 𝑛−1 𝑛−2  1 𝑛+1 1 I − 2Δ 𝑢 = 5𝑢 − 4𝑢 +𝑢 + 2 𝑆 (x) + O 4 , (2.8) 𝛼 2 𝛼 𝛼 √ where we have introduced the parameter 𝛼 = 2/(𝑐Δ𝑡). We note that the source term is treated implicitly in this method, which creates additional complications if the source function 𝑆 depends on 𝑢. This necessitates some form of iteration, which increases the cost of the method. 27 Stability properties for the semi-discrete equation (2.8) were presented in the thesis [42], which showed that the method, above, is purely diffusive and unconditionally stable. Higher-order BDF methods can be obtained simply by using a wider finite-difference stencil to approximate the time derivative 𝜕𝑡𝑡 𝑢. While moving to higher-order reduces the (overly) diffusive feature of the second-order method, there are other concerns surrounding the stability of such methods. 2.2.2 Time-centered Scheme In the semi-discrete form of the second-order BDF method, we saw that the source term was treated implicitly, which, in some cases, requires some form of iteration. If the iteration procedure converges slowly, this can increase the cost of the method significantly. Another approach to deal with this problem is to use a time-centered method, in which the source is treated explicitly. In this case, the second time derivative is approximated with a second-order centered difference 𝜕 2𝑢 𝑢 𝑛+1 − 2𝑢 𝑛 + 𝑢 𝑛−1 = + O (Δ𝑡 2 ), 𝜕𝑡 2 𝑡=𝑡 𝑛 Δ𝑡 2 where, again, Δ𝑡 = 𝑡 𝑘 − 𝑡 𝑘−1 , for any 𝑘, is the grid spacing in time. Evaluating the equation (2.7) at time level 𝑡 𝑛 , and using this difference approximation, we obtain 1  𝑛+1  𝑢 − 2𝑢 𝑛 + 𝑢 𝑛−1 − Δ𝑢 𝑛 = 𝑆 𝑛 (x) + O (Δ𝑡 2 ). (2.9) 𝑐 Δ𝑡 2 2 To obtain a semi-discrete equation of the form (2.1) for the data 𝑢 𝑛+1 , the Laplacian term is made implicit through the introduction of the term   𝑢 𝑛+1 − 2𝑢 𝑛 + 𝑢 𝑛−1 Δ , 𝛽 ∈ R, 𝛽2 which is added and subtracted from both sides of (2.9) to obtain (after some rearrangement) 𝛽2  𝑛+1 𝑛 𝑛−1   𝑛+1 2 𝑛 𝑛−1  𝑢 − 2𝑢 + 𝑢 − Δ 𝑢 − (2 − 𝛽 )𝑢 + 𝑢 𝑐2 Δ𝑡 2     = 𝛽2 𝑆 𝑛 (x) − Δ 𝑢 𝑛+1 − 2𝑢 𝑛 + 𝑢 𝑛−1 + O 𝛽2 Δ𝑡 2 . 28 𝛽4 To make the semi-discrete equation take the form (2.1), we add 𝑐2 Δ𝑡 2 𝑢𝑛 to both sides of the equation, so that  2  𝛽  𝑛+1  𝛽4 𝑛 − Δ 𝑢 − (2 − 𝛽 2 𝑛 )𝑢 + 𝑢 𝑛−1 = 𝑢 + 𝛽2 𝑆 𝑛 (x) 𝑐2 Δ𝑡 2 𝑐2 Δ𝑡 2     − Δ 𝑢 𝑛+1 − 2𝑢 𝑛 + 𝑢 𝑛−1 + O 𝛽2 Δ𝑡 2 . If we now write 𝛼 = 𝛽/(𝑐Δ𝑡), and multiply through by 1/𝛼2 , this equation can be written as    2 1  𝑛+1 2 𝑛 𝑛−1  2 𝑛 𝛽2 𝑛 𝛽 I − 2 Δ 𝑢 − (2 − 𝛽 )𝑢 + 𝑢 = 𝛽 𝑢 + 2 𝑆 (x) + O 4 , (2.10) 𝛼 𝛼 𝛼 where we have used the fact that the Laplacian term on the right side satisfies     𝑛+1 𝑛 𝑛−1 𝑛 2 4 2 1 Δ 𝑢 − 2𝑢 + 𝑢 = Δ (𝜕𝑡𝑡 𝑢 ) Δ𝑡 + O (Δ𝑡 ) = O (Δ𝑡 ) ≡ O 2 . 𝛼 In contrast to the semi-discrete equation (2.8), it was shown that the time-centered update (2.10) is purely dispersive [42]. Through stability analysis, it was shown that an unconditionally stable scheme could be obtained as long as 0 < 𝛽 ≤ 2. 2.2.3 Splitting Method Used for Multi-dimensional Problems The semi-discrete equations (2.8) and (2.10) are both modified Helmholtz equations of the form (2.1). Rather than appealing to (2.6), which formally inverts the multi-dimensional modified Helmholtz operator, we apply a factorization into a product of one dimensional operators. For example, in two-spatial dimensions the factorization writes    1 1 1 1 I − 2 Δ = I − 2 𝜕𝑥𝑥 I − 2 𝜕𝑦𝑦 + 4 𝜕𝑥𝑥 𝜕𝑦𝑦 , 𝛼 𝛼 𝛼 𝛼 1 ≡ L𝑥 L 𝑦 + 4 𝜕𝑥𝑥 𝜕𝑦𝑦 , 𝛼 where L𝑥 and L 𝑦 are one-dimensional operators and the last term represents the splitting error associated with the factorization step. For second-order accuracy in time, the coefficient of the splitting error is 1/𝛼4 = O (Δ𝑡 4 ), so we shall ignore this term. Therefore, the semi-discrete equation (2.8) and (2.10) can be written more compactly (dropping error terms) as √ 𝑛+1 1 𝑛 𝑛−1 𝑛−2  1 𝑛+1 2 L𝑥 L 𝑦 𝑢 = 5𝑢 − 4𝑢 +𝑢 + 2 𝑆 (x), 𝛼 := , (2.11) 2 𝛼 𝑐Δ𝑡 29 and  𝑛+1 2 𝑛 𝑛−1  2 𝑛 𝛽2 𝑛 𝛽 L𝑥 L 𝑦 𝑢 − (2 − 𝛽 )𝑢 + 𝑢 = 𝛽 𝑢 + 2 𝑆 (x), 𝛼 := , 0 < 𝛽 ≤ 2, (2.12) 𝛼 𝑐Δ𝑡 respectively. 2.3 Inverting One-dimensional Operators The choice of factoring the multi-dimensional modified Helmholtz operator means we now have to solve a sequence of one-dimensional boundary value problems (BVPs) of the form   1 I − 2 𝜕𝑥𝑥 𝑤(𝑥) = 𝑓 (𝑥), 𝑥 ∈ [𝑎, 𝑏], (2.13) 𝛼 where [𝑎, 𝑏] is a one-dimensional line and 𝑓 is a new source term that can be used to represent a time history or an intermediate variable constructed from the inversion of an operator along another direction. We also point out that the parameter 𝛼 depends on the choice of the semi-discrete scheme √ employed to solve the problem. For the BDF scheme 𝛼 = 2/(𝑐Δ𝑡), while the centered scheme uses 𝛼 = 𝛽/(𝑐Δ𝑡), with 0 < 𝛽 ≤ 2. We will show the process by which one obtains the general solution to the problem (2.13), deferring the application of boundary conditions to section 2.4. 2.3.1 Integral Solution Since the BVP (2.13) is linear, its general solution can be expressed using the one dimensional analogue of equation (2.5): ∫ 𝑦=𝑏 𝑏 1   𝑤(𝑥) = 𝐺 (𝑥, 𝑦) 𝑓 (𝑦) 𝑑𝑦 + 2 𝐺 (𝑥, 𝑦)𝜕𝑦 𝑢(𝑦) − 𝑢(𝑦)𝜕𝑦 𝐺 (𝑥, 𝑦) . (2.14) 𝑎 𝛼 𝑦=𝑎 A simple way to obtain the free-space Green’s function for this problem is to use a Fourier transform. In Fourier space, this equation reads   𝑘2 b 𝛼2 1 + 2 𝐺 = 1 =⇒ 𝐺 = 2 b . 𝛼 𝛼 + 𝑘2 A closely related Fourier transform is obtained with the function h i 2𝜆 −𝜆|𝑥| F 𝑒 = , 𝜆2 + 𝑘2 30 from which, it follows that   𝜆 −𝜆|𝑥| 𝜆2 F 𝑒 = 2 . 2 𝜆 + 𝑘2 Therefore, matching transforms, it follows that the free-space Green’s function in one-dimension is 𝛼 −𝛼|𝑥−𝑦| 𝐺 (𝑥, 𝑦) = 𝑒 . (2.15) 2 To use the relation (2.14), we need to compute the derivatives in the Green’s function. We note that   𝛼𝑒 −𝛼(𝑥−𝑦) ,  𝑥 ≥ 𝑦,    𝜕𝑦 𝐺 (𝑥, 𝑦) =  −𝛼𝑒 −𝛼(𝑦−𝑥) ,    𝑥 < 𝑦.  Taking limits, we find that lim 𝜕𝑦 𝐺 (𝑥, 𝑦) = 𝛼𝑒 −𝛼(𝑥−𝑎) , 𝑦→𝑎 lim 𝜕𝑦 𝐺 (𝑥, 𝑦) = −𝛼𝑒 −𝛼(𝑏−𝑥) . 𝑦→𝑏 Combining these limits with (2.15) and (2.14), we obtain the general solution ∫ 𝛼 𝑏 −𝛼|𝑥−𝑦| 𝑤(𝑥) = 𝑒 𝑓 (𝑦) 𝑑𝑦 + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) , (2.16) 2 𝑎 where 𝐴 and 𝐵 are constants that are determined by boundary conditions. Comparing with (2.6), these terms serve the same purpose as the layer potentials. Further, we identify the general solution (2.16) as the inverse of the one-dimensional modified Helmholtz operator. In other words, we define L𝑥−1 so that 𝑤(𝑥) = L𝑥−1 [ 𝑓 ] (𝑥), (2.17) ∫ 𝛼 𝑏 −𝛼|𝑥−𝑦| ≡ 𝑒 𝑓 (𝑦) 𝑑𝑦 + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) , (2.18) 2 𝑎 ≡ I𝑥 [ 𝑓 ] (𝑥) + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) . (2.19) Section 2.4 will make repeated use of (2.17)-(2.19) to illustrate the application of boundary con- ditions. Additionally, the result (2.16) is general enough that it can be used for both the BDF and time-centered schemes. 31 2.3.2 Fast Summation Method In order to compute the inverse operators according to (2.17)-(2.19), suppose we have discretized the one-dimensional computational domain [𝑎, 𝑏] into a mesh consisting of 𝑁 + 1 grid points: 𝑎 = 𝑥0 < 𝑥1 < · · · < 𝑥 𝑁 = 𝑏, with the spacing defined by Δ𝑥 𝑗 = 𝑥 𝑗 − 𝑥 𝑗−1 , 𝑗 = 1, · · · 𝑁. If we directly evaluate the function 𝑤(𝑥) at each of the mesh points, according to (2.18), we obtain ∫ 𝑏 𝛼 𝑤(𝑥𝑖 ) = 𝑒 −𝛼|𝑥𝑖 −𝑦| 𝑓 (𝑦) 𝑑𝑦 + 𝐴𝑒 −𝛼(𝑥𝑖 −𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥𝑖 ) , 𝑖 = 0, · · · , 𝑁 + 1. 2 𝑎 Since the evaluation of the integral term in the variable 𝑦 requires with quadrature requires O (𝑁) operations, this direct approach requires a total of O (𝑁 2 ) operations. With the aid of a recursive fast summation algorithm, the cost of evaluating these terms can be reduced from O (𝑁 2 ) to O (𝑁). To this end, it is helpful to introduce the operators ∫ 𝑥 I𝑥𝑅 [ 𝑓 ] (𝑥) ≡𝛼 𝑒 −𝛼(𝑥−𝑦) 𝑓 (𝑦) 𝑑𝑦, (2.20) 𝑎 ∫ 𝑏 I𝑥𝐿 [ 𝑓 ] (𝑥) ≡𝛼 𝑒 −𝛼(𝑦−𝑥) 𝑓 (𝑦) 𝑑𝑦, (2.21) 𝑥 so that the total integral over [𝑎, 𝑏] can be expressed as the average of these operators 1 𝑅  I𝑥 [ 𝑓 ] (𝑥) = I𝑥 [ 𝑓 ] (𝑥) + I𝑥𝐿 [ 𝑓 ] (𝑥) . (2.22) 2 Then, the task now relies on computing the integrals (2.20) and (2.21) in an efficient manner. To develop a recursive expression, consider evaluating the integral (2.20) at a grid point 𝑥𝑖 . Then, 32 it follows that ∫ 𝑥𝑖 I𝑥𝑅 [ 𝑓 ] (𝑥𝑖 ) =𝛼 𝑒 −𝛼(𝑥𝑖 −𝑦) 𝑓 (𝑦) 𝑑𝑦, ∫𝑎 𝑥𝑖−1 ∫ 𝑥𝑖 −𝛼(𝑥𝑖 −𝑦) =𝛼 𝑒 𝑓 (𝑦) 𝑑𝑦 + 𝛼 𝑒 −𝛼(𝑥𝑖 −𝑦) 𝑓 (𝑦) 𝑑𝑦, ∫𝑎 𝑥𝑖−1 𝑥 𝑖−1 ∫ 𝑥𝑖 −𝛼(𝑥𝑖 −𝑥𝑖−1 +𝑥 𝑖−1 −𝑦) =𝛼 𝑒 𝑓 (𝑦) 𝑑𝑦 + 𝛼 𝑒 −𝛼(𝑥𝑖 −𝑦) 𝑓 (𝑦) 𝑑𝑦, ∫𝑎 𝑥𝑖−1 ∫ 𝑥𝑖 𝑥𝑖−1 =𝛼 𝑒 −𝛼(Δ𝑥𝑖 +𝑥𝑖−1 −𝑦) 𝑓 (𝑦) 𝑑𝑦 + 𝛼 𝑒 −𝛼(𝑥𝑖 −𝑦) 𝑓 (𝑦) 𝑑𝑦, 𝑎 𝑥  ∫ 𝑥𝑖−1  𝑖−1 ∫ 𝑥𝑖 = 𝑒 −𝛼Δ𝑥𝑖 𝛼 𝑒 −𝛼(𝑥𝑖−1 −𝑦) 𝑓 (𝑦) 𝑑𝑦 + 𝛼 𝑒 −𝛼(𝑥𝑖 −𝑦) 𝑓 (𝑦) 𝑑𝑦, 𝑎 𝑥 𝑖−1 ≡ 𝑒 −𝛼Δ𝑥𝑖 I𝑥𝑅 [ 𝑓 ] (𝑥𝑖−1 ) + J𝑥𝑅 [ 𝑓 ] (𝑥𝑖 ). In the last line, we have introduced the local integral ∫ 𝑥𝑖 J𝑥𝑅 [ 𝑓 ] (𝑥𝑖 ) =𝛼 𝑒 −𝛼(𝑥𝑖 −𝑦) 𝑓 (𝑦) 𝑑𝑦. 𝑥𝑖−1 We see that the integral (2.20) can be expressed through a recursive weighting of its previous values plus an additional term that is localized in space. In the next chapter, we use a variation of this method to develop a domain decomposition algorithm that allows the method to scale on parallel computers. To initialize the recursion, we set I𝑥𝑅 [ 𝑓 ] (𝑥 0 ) = 0, which follows directly from its definition (2.20). Since the calculation of the local integrals J𝑥𝑅 [ 𝑓 ] (𝑥𝑖 ) employs quadrature over a collection of 𝑀 points, the cost of computing (2.20) is now of the form O (𝑀 𝑁). Further more, since the number of localized quadrature points 𝑀 is independent of the mesh size 𝑁, and we additionally select 𝑀 ≪ 𝑁, the resulting approach scales as O (𝑁). A similar argument is made for the second integral (2.21). In summary, the fast summation method computes the integrals (2.20) and (2.21) according to I𝑥𝑅 [ 𝑓 ] (𝑥𝑖 ) = 𝑒 −𝛼Δ𝑥𝑖 I𝑥𝑅 [ 𝑓 ] (𝑥𝑖−1 ) + J𝑥𝑅 [ 𝑓 ] (𝑥𝑖 ), I𝑥𝑅 [ 𝑓 ] (𝑥 0 ) = 0, 𝑖 = 1, · · · 𝑁, (2.23) I𝑥𝐿 [ 𝑓 ] (𝑥𝑖 ) = 𝑒 −𝛼Δ𝑥𝑖+1 I𝑥𝐿 [ 𝑓 ] (𝑥𝑖+1 ) + J𝑥𝐿 [ 𝑓 ] (𝑥𝑖 ), I𝑥𝐿 [ 𝑓 ] (𝑥 𝑁 ) = 0, 𝑖 = 0, · · · 𝑁 − 1, (2.24) 33 where the local integrals are defined by ∫ 𝑥𝑖 𝑅 J𝑥 [ 𝑓 ] (𝑥𝑖 ) = 𝛼 𝑒 −𝛼(𝑥𝑖 −𝑦) 𝑓 (𝑦) 𝑑𝑦, 𝑖 = 1, · · · 𝑁, (2.25) ∫𝑥𝑖−1 𝑥 𝑖+1 𝐿 J𝑥 [ 𝑓 ] (𝑥𝑖 ) = 𝛼 𝑒 −𝛼(𝑦−𝑥𝑖 ) 𝑓 (𝑦) 𝑑𝑦, 𝑖 = 0, · · · 𝑁 − 1. (2.26) 𝑥𝑖 Next, we discuss the approximations used in the evaluation of the local integrals. 2.3.3 Approximating the Local Integrals Here, we present the general process used to obtain quadrature rules for the local integrals defined by (2.25) and (2.26), in the case of a uniform grid, i.e., Δ𝑥 = 𝑥 𝑗 − 𝑥 𝑗−1 , 𝑗 = 1, · · · , 𝑁. Rather than use numerical quadrature rules, e.g., Gaussian quadrature or Newton-Cotes formulas, for purposes of stability, it was discovered that a certain form of analytical integration was required [50]. In this approach, the operand 𝑓 (𝑥) is approximated by an interpolating function, which is then analytically integrated against the kernel. We provide a sketch of the approach to illustrate the idea. Specific details can be found in a number of papers, e.g., [34, 43, 44]. First, it is helpful to transform the integrals (2.25) and (2.26) using a change of variable. Consider the integral (2.25) and let 𝑦 = (𝑥 𝑗 − 𝑥 𝑗−1 )𝜏 + 𝑥 𝑗−1 ≡ Δ𝑥𝜏 + 𝑥 𝑗−1 , 𝜏 ∈ [0, 1]. Then we can write ∫ 1 −𝛼Δ𝑥 J𝑥𝑅 [ 𝑓 ] (𝑥𝑖 ) = 𝛼Δ𝑥𝑒 𝑒 𝛼𝜏Δ𝑥 𝑓 (𝜏Δ𝑥 + 𝑥𝑖−1 ) 𝑑𝜏. (2.27) 0 Next, we approximate 𝑓 (𝑥) in (2.27) using interpolation of some desired order of accuracy. As an example, suppose that we want to use linear interpolation with the data { 𝑓𝑖−1 , 𝑓𝑖 } using the basis {1, 𝑥 − 𝑥𝑖−1 }, which is shifted for convenience to cancel with the shift in (2.27). A direct calculation shows that the interpolating polynomial is 𝑓𝑖 − 𝑓𝑖−1 𝑝(𝑥) = 𝑓𝑖−1 + (𝑥 − 𝑥𝑖−1 ) . Δ𝑥 34 By replacing 𝑓 in (2.27) with the above interpolant, and integrating the result analytically, we find that ∫ 1   −𝛼Δ𝑥 J𝑥𝑅 [ 𝑓 ] (𝑥𝑖 ) ≈ 𝛼Δ𝑥𝑒 𝑒 𝛼𝜏Δ𝑥 𝑓𝑖−1 + ( 𝑓𝑖 − 𝑓𝑖−1 ) 𝜏 𝑑𝜏, 0 = 𝑤 0 𝑣 𝑖−1 + 𝑤 1 𝑣 𝑖 , where the weights for integration are 1 − 𝑒 −𝛼Δ𝑥 − 𝛼Δ𝑥𝑒 −𝛼Δ𝑥 𝑤0 = , 𝛼Δ𝑥 (𝛼Δ𝑥 − 1) + 𝑒 −𝛼Δ𝑥 𝑤1 = . 𝛼Δ𝑥 Modifications of the above can be made to accommodate additional interpolation points, as well as techniques for shock capturing. In the latter case, methods have been devised following the idea of WENO reconstruction [51] to create quadrature methods that can address non-smooth features including shocks and cusps [44, 45, 52]. Additional details on the WENO quadrature, including the reconstruction stencils can be found in chapter 3. In [45], we developed a quadrature rule using the exponential polynomial basis, which offers additional flexibility in capturing localized features through a “shape" parameter introduced in the basis. These tools offer a promising approach to addressing problems with discontinuities in the material properties as well as more complex domains with non-smooth boundaries. Despite the notable differences in the type of approximating function used for the operand, the process is essentially identical to the example shown here. We also wish to point out that certain issues may arise when 𝛼 ≫ 1 (i.e., Δ𝑡 ≪ 1). In such circumstances, when the weights are computed on-the-fly, the kernel function can be replaced with a Taylor expansion [38]. Otherwise, this results in a “narrow" Green’s function that is vastly under-resolved by the mesh, which causes wave phenomena to remain stagnant. Our experience has found this situation to be quite rare, but it is something to be aware of when a small CFL number is used in a simulation. 35 2.4 Applying Boundary Conditions In this section, we discuss the application of boundary conditions for the schemes based on BDF and centered time discretizations. Boundary conditions are presented for these methods from the perspective of one-dimensional problems in sections 2.4.1 and 2.4.2. Lastly, in section 2.4.3, we describe how these conditions can be used in multi-dimensional problems. 2.4.1 BDF Method The update for second order BDF method, in one-spatial dimension, can be obtained by combining (2.16) with the semi-discrete equation (2.8). Defining the operand 1 𝑛  1 𝑅(𝑥) = 5𝑢 − 4𝑢 𝑛−1 + 𝑢 𝑛−2 (𝑥) + 2 𝑆 𝑛+1 (𝑥), 2 𝛼 we obtain ∫ 𝑏 𝛼 𝑢 𝑛+1 (𝑥) = 𝑒 −𝛼|𝑥−𝑦| 𝑅(𝑦) 𝑑𝑦 + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) , (2.28) 2 𝑎 ≡ I𝑥 [𝑅] (𝑥) + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) , (2.29) where we have used I𝑥 [·] to denote the term involving the convolution integral which is not to be confused with the identity operator. Applying different boundary conditions amounts to determining the values of 𝐴 and 𝐵 used in (2.29). In the description of boundary conditions for the method, we shall assume that the boundary conditions at the ends of the one-dimensional domain are the same. Using slight variations of the methods illustrated below, one can mix the boundary conditions at the ends of the line segments. In order to enforce conditions on the derivatives of the solution, we will also need to compute a derivative of the update (2.28) (equivalently (2.29)). For this, we observe that the dependency for 𝑥 appears only on analytical functions, i.e., the Green’s function (kernel) and the exponential functions in the boundary terms. To differentiate (2.28) we start with the definition (2.22), which splits the integral at the point 𝑦 = 𝑥 and makes the kernel easier to manipulate. Then, using the 36 fundamental theorem of calculus, we can calculate derivatives of (2.20) and (2.21) to find that  ∫ 𝑥  𝑑  𝑅  𝑑 −𝛼(𝑥−𝑦) I [ 𝑓 ] (𝑥) = 𝛼 𝑒 𝑓 (𝑦) 𝑑𝑦 = −𝛼I𝑥𝑅 [ 𝑓 ] (𝑥) + 𝑓 (𝑥), (2.30) 𝑑𝑥 𝑥 𝑑𝑥 𝑎  ∫ 𝑏  𝑑  𝐿  𝑑 −𝛼(𝑦−𝑥) I [ 𝑓 ] (𝑥) = 𝛼 𝑒 𝑓 (𝑦) 𝑑𝑦 = 𝛼I𝑥𝐿 [ 𝑓 ] (𝑥) − 𝑓 (𝑥). (2.31) 𝑑𝑥 𝑥 𝑑𝑥 𝑥 These results can be combined according to (2.22), which provides an expression for the derivative of the convolution term: 𝑑   𝛼  I𝑥 [ 𝑓 ] (𝑥) = −I𝑥𝑅 [ 𝑓 ] (𝑥) + I𝑥𝐿 [ 𝑓 ] (𝑥) . (2.32) 𝑑𝑥 2 Additionally, by evaluating this equation at the ends of the interval, we obtain the identities 𝑑   I𝑥 [ 𝑓 ] (𝑎) = 𝛼I𝑥 [ 𝑓 ] (𝑎), (2.33) 𝑑𝑥 𝑑   I𝑥 [ 𝑓 ] (𝑏) = −𝛼I𝑥 [ 𝑓 ] (𝑏), (2.34) 𝑑𝑥 which are helpful in enforcing the boundary conditions. The relation (2.32) can be used to obtain a derivative for the solution at the new time level. From the update (2.29), a direct computation reveals that 𝑑𝑢 𝑛+1 𝛼  𝑅  = −I𝑥 [𝑅] (𝑥) + I𝑥𝐿 [𝑅] (𝑥) − 𝛼𝐴𝑒 −𝛼(𝑥−𝑎) + 𝛼𝐵𝑒 −𝛼(𝑏−𝑥) . (2.35) 𝑑𝑥 2 Notice that no additional approximations have been made beyond what is needed to compute I𝑥𝑅 and I𝑥𝐿 , which are needed in the base method. For this reason, we think of equation (2.35) as an analytical derivative. The boundary coefficients 𝐴 and 𝐵 appearing in (2.35) will be calculated in the same way as the update (2.29), which are discussed in the remaining subsections. This treatment ensures that the discrete derivative will be consistent with the conditions imposed on the solution variable. 2.4.1.1 Dirichlet Boundary Conditions Suppose we are given the function values along the boundary, which is represented by the data     𝑢 𝑛+1 (𝑎) = 𝑔𝑎 𝑡 𝑛+1 , 𝑢 𝑛+1 (𝑏) = 𝑔𝑏 𝑡 𝑛+1 . 37 If we evaluate the BDF-2 update (2.29) at the ends of the interval, we obtain the conditions   𝑔𝑎 𝑡 𝑛+1 = I𝑥 [𝑅] (𝑎) + 𝐴 + 𝜇𝐵,   𝑛+1 𝑔𝑏 𝑡 = I𝑥 [𝑅] (𝑏) + 𝜇𝐴 + 𝐵, where we have defined 𝜇 = 𝑒 −𝛼(𝑏−𝑎) . This is a simple linear system for the boundary coefficients 𝐴 and 𝐵, which can be inverted by hand. Proceeding, we find that    𝑔𝑎 𝑡 𝑛+1 − I𝑥 [𝑅] (𝑎) − 𝜇 𝑔𝑏 𝑡 𝑛+1 − I𝑥 [𝑅] (𝑏) 𝐴= , 1 − 𝜇2    𝑔𝑏 𝑡 𝑛+1 − I𝑥 [𝑅] (𝑏) − 𝜇 𝑔𝑎 𝑡 𝑛+1 − I𝑥 [𝑅] (𝑎) 𝐵= . 1 − 𝜇2 2.4.1.2 Neumann Boundary Conditions We can also enforce conditions on the derivatives at the end of the domain. Given the Neumann data 𝑑𝑢 𝑛+1 (𝑎)   𝑑𝑢 𝑛+1 (𝑏)   = ℎ𝑎 𝑡 𝑛+1 , = ℎ 𝑏 𝑡 𝑛+1 , 𝑑𝑥 𝑑𝑥 we can evaluate the derivative formula for the update (2.35) and use the identities (2.33) and (2.34). Performing these evaluations, we obtain the system of equations 1   −𝐴 + 𝜇𝐵 = ℎ𝑎 𝑡 𝑛+1 − I𝑥 [𝑅] (𝑎), 𝛼 1   𝑛+1 −𝜇𝐴 + 𝐵 = ℎ 𝑏 𝑡 + I𝑥 [𝑅] (𝑏), 𝛼 where, again, 𝜇 = 𝑒 −𝛼(𝑏−𝑎) . Solving this system, we find that     1 𝑛+1 − I [𝑅] (𝑎) − 𝜇 1 ℎ 𝑡 𝑛+1 + I [𝑅] (𝑏) 𝛼 ℎ 𝑎 𝑡 𝑥 𝛼 𝑏 𝑥 𝐴=− , 1 − 𝜇2       𝜇 𝛼1 ℎ𝑎 𝑡 𝑛+1 − I𝑥 [𝑅] (𝑎) − 𝛼1 ℎ 𝑏 𝑡 𝑛+1 + I𝑥 [𝑅] (𝑏) 𝐵=− . 1 − 𝜇2 We note that Robin boundary conditions, which combine Dirichlet and Neumann conditions can be enforced in a nearly identical way. 38 2.4.1.3 Periodic Boundary Conditions Periodic boundary conditions are enforced by taking 𝑢 𝑛+1 (𝑎) = 𝑢 𝑛+1 (𝑏), 𝜕𝑥 𝑢 𝑛+1 (𝑎) = 𝜕𝑥 𝑢 𝑛+1 (𝑏). Enforcing these conditions through the update (2.29) and its derivative (2.35), using the identities (2.33)-(2.34), leads to the system of equations (1 − 𝜇) 𝐴 + (𝜇 − 1)𝐵 = I𝑥 [𝑅] (𝑏) − I𝑥 [𝑅] (𝑎), (𝜇 − 1) 𝐴 + (𝜇 − 1)𝐵 = −I𝑥 [𝑅] (𝑏) − I𝑥 [𝑅] (𝑎), with 𝜇 = 𝑒 −𝛼(𝑏−𝑎) . The solution of this system, after some simplifications is given by I𝑥 [𝑅] (𝑏) 𝐴= , 1−𝜇 I𝑥 [𝑅] (𝑎) 𝐵= . 1−𝜇 2.4.1.4 Outflow Boundary Conditions In problems defined over free-space, we must allow for waves to exit the computational domain. Additionally, as the waves exit the domain, we would like to minimize the number of reflections, which are non-physical, along this boundary. Exit conditions can be formulated in one spatial dimension in the sense of characteristics, by requiring that 𝜕𝑢 𝜕𝑢 −𝑐 = 0, 𝑥 = 𝑎, (2.36) 𝜕𝑡 𝜕𝑥 𝜕𝑢 𝜕𝑢 +𝑐 = 0, 𝑥 = 𝑏, (2.37) 𝜕𝑡 𝜕𝑥 where 𝑐 > 0 is a wave speed. To formulate the boundary conditions for the BDF-2 update (2.28), we follow the approach used in [34], which developed outflow boundary conditions for the central method, discussed in the next section. We start from the free-space solution that is defined over the real line: 𝛼 ∞ −𝛼|𝑥−𝑦| ∫ 𝑛+1 𝑢 (𝑥) = 𝑒 𝑅(𝑦) 𝑑𝑦. 2 −∞ 39 Here, 𝑅(𝑥) is the operand for the BDF-2 method and is defined as 1 𝑛  𝑅(𝑥) = 5𝑢 − 4𝑢 𝑛−1 + 𝑢 𝑛−2 . 2 The free-space solution can then be separated to isolate the computational domain over [𝑎, 𝑏], which we write as ∫ 𝑎 ∫ ∞ 𝛼 −𝛼|𝑥−𝑦| 𝛼 𝑢 𝑛+1 (𝑥) = I𝑥 [𝑅] (𝑥) + 𝑒 𝑅(𝑦) 𝑑𝑦 + 𝑒 −𝛼|𝑥−𝑦| 𝑅(𝑦) 𝑑𝑦. 2 −∞ 2 𝑏 If we use the definitions ∫ 𝛼 𝑎 −𝛼(𝑎−𝑦) 𝐴= 𝑒 𝑅(𝑦) 𝑑𝑦, (2.38) 2 −∞ 𝛼 ∞ −𝛼(𝑦−𝑏) ∫ 𝐵= 𝑒 𝑅(𝑦) 𝑑𝑦, (2.39) 2 𝑏 then the decomposed free-space solution can be written in the familiar form (2.29): 𝑢 𝑛+1 (𝑥) = I𝑥 [𝑅] (𝑥) + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) . Now assume that the support of 𝑢(𝑥, 0) and 𝑅(𝑥) is entirely contained within the domain [𝑎, 𝑏] initially. Using the finite speed of propagation 𝑐, it follows that the region of support at time 𝑡 𝑛 extends to [𝑎 − 𝑐𝑡 𝑛 , 𝑏 + 𝑐𝑡 𝑛 ], which means that the integrals (2.38) and (2.39) can be simplified to ∫ 𝛼 𝑎 𝑛 𝐴 = 𝑒 −𝛼(𝑎−𝑦) 𝑅(𝑦) 𝑑𝑦, (2.40) 2 𝑎−𝑐𝑡 𝑛 ∫ 𝑛 𝑛 𝛼 𝑏+𝑐𝑡 −𝛼(𝑦−𝑏) 𝐵 = 𝑒 𝑅(𝑦) 𝑑𝑦. (2.41) 2 𝑏 Since these integrals are defined over regions of space outside of the computational domain, the idea is now to exchange space with time, using the characteristics of equations (2.36) and (2.37). Along the right boundary, the waves propagate to the right so that for any 𝑥 > 𝑏, the solution to (2.37) is 𝑢(𝑥, 𝑡) = 𝑢(𝑥 − 𝑐𝑡). Tracing the ray backwards in time, we see that  𝑦 𝑢(𝑏 + 𝑦, 𝑡) = 𝑢 𝑏, 𝑡 − , 𝑦 > 0. (2.42) 𝑐 Note that on a space-time plot, the characteristic has a slope of 𝑐−1 . Similarly, for along the left boundary, we have that 𝑥 < 𝑎, and the solution to (2.36) is given by 𝑢(𝑥, 𝑡) = 𝑢(𝑥 + 𝑐𝑡). Here, the 40 characteristic has slope −𝑐−1 , so we obtain  𝑦 𝑢(𝑎 − 𝑦, 𝑡) = 𝑢 𝑎, 𝑡 − , 𝑦 > 0. (2.43) 𝑐 In other words, since the data remains constant along characteristics, it follows that any point outside of the computational domain corresponds to a boundary point at some earlier time. We now show the illustrate the process for converting space integrals to recursive time integrals by considering the integral (2.41). The argument for (2.40) is identical and is therefore omitted. To use the ray tracing formula (2.42), we shift the integration variable so that ∫ 𝑛 𝑛 𝛼 𝑏+𝑐𝑡 −𝛼(𝑦−𝑏) 𝐵 = 𝑒 𝑅(𝑦) 𝑑𝑦, 2 𝑏 ∫ 𝑛 𝛼 𝑐𝑡 −𝛼𝑦 = 𝑒 𝑅(𝑏 + 𝑦) 𝑑𝑦, 2 0 ∫ 𝑛   𝛼 𝑐𝑡 −𝛼𝑦 1  𝑛 𝑛−1 𝑛−2 ≡ 𝑒 5𝑢 (𝑏 + 𝑦) − 4𝑢 (𝑏 + 𝑦) + 𝑢 (𝑏 + 𝑦) 𝑑𝑦, 2 0 2 ∫ 𝑛    𝛼 𝑐𝑡 −𝛼𝑦 1  𝑛  𝑛−1   𝑛−2 = 𝑒 5𝑢 (𝑏, 𝑡 − 𝑦/𝑐) − 4𝑢 𝑏, 𝑡 − 𝑦/𝑐 + 𝑢 𝑏, 𝑡 − 𝑦/𝑐 𝑑𝑦, 2 0 2 where the last line applies (2.42) to the time history. If we now apply another transformation 𝑦/𝑐 ↦→ 𝑦, then the integral becomes ∫ 𝑛    𝑛 𝛼𝑐 𝑡 −𝛼𝑐𝑦 1  𝑛  𝑛−1   𝑛−2 𝐵 = 𝑒 5𝑢 (𝑏, 𝑡 − 𝑦) − 4𝑢 𝑏, 𝑡 − 𝑦 + 𝑢 𝑏, 𝑡 −𝑦 𝑑𝑦, 2 0 2 𝛼𝑐 Δ𝑡 −𝛼𝑐𝑦 1    ∫     𝑛 𝑛−1 𝑛−2 = 𝑒 5𝑢 (𝑏, 𝑡 − 𝑦) − 4𝑢 𝑏, 𝑡 − 𝑦 + 𝑢 𝑏, 𝑡 −𝑦 𝑑𝑦 2 0 2 ∫ 𝑛    𝛼𝑐 𝑡 −𝛼𝑐𝑦 1  𝑛  𝑛−1   𝑛−2 + 𝑒 5𝑢 (𝑏, 𝑡 − 𝑦) − 4𝑢 𝑏, 𝑡 − 𝑦 + 𝑢 𝑏, 𝑡 −𝑦 𝑑𝑦, 2 Δ𝑡 2 ≡ 𝐼1 + 𝐼2 . The integral 𝐼2 exhibits a recursive form. If we shift the integration bounds by −Δ𝑡, then it 41 follows that 𝐼2 is now given by ∫ 𝑛−1      𝛼𝑐 𝑡 −𝛼𝑐(𝑦+Δ𝑡) 1 𝑛−1   𝑛−2   𝑛−3 𝐼2 = 𝑒 5𝑢 𝑏, 𝑡 − 𝑦 − 4𝑢 𝑏, 𝑡 − 𝑦 + 𝑢 𝑏, 𝑡 −𝑦 𝑑𝑦, 2 0 2 ∫ 𝑡 𝑛−1    ! 𝛼𝑐 1       = 𝑒 −𝛼𝑐Δ𝑡 𝑒 −𝛼𝑐𝑦 5𝑢 𝑏, 𝑡 𝑛−1 − 𝑦 − 4𝑢 𝑏, 𝑡 𝑛−2 − 𝑦 + 𝑢 𝑏, 𝑡 𝑛−3 − 𝑦 𝑑𝑦 , 2 0 2 ≡ 𝑒 −𝛼𝑐Δ𝑡 𝐵𝑛−1 . (2.44) The integral term 𝐼1 , which is defined over [0, Δ𝑡], can be mapped to the interval [0, 1] using the transformation 𝑦 = 𝑧Δ𝑡, 𝑧 ∈ [0, 1]. Then it follows that ∫ 1        𝛼𝑐Δ𝑡 −𝛼𝑐Δ𝑡𝑧 1 𝑛 𝑛−1 𝑛−2 𝐼1 = 𝑒 5𝑢 (𝑏, 𝑡 − 𝑧Δ𝑡) − 4𝑢 𝑏, 𝑡 − 𝑧Δ𝑡 + 𝑢 𝑏, 𝑡 − 𝑧Δ𝑡 𝑑𝑧. (2.45) 2 0 2 To evaluate this local integral (in time), we store use a set of time history along the boundary point to construct an interpolating function. After inserting this approximation into (2.45), the integration can be performed analytically to generate the weights for the time history data. This is done in the same spirit as the approach for computing quadrature weights to approximate the local integrals, which was discussed in section 2.3.3. We demonstrate this process using two approaches, which differ in the stencil used to interpolate the time history. A simple way to approximate the integral (2.45) is to build an “explicit" interpolating function  based on the time history 𝑢 𝑛−2 (𝑏), 𝑢 𝑛−1 (𝑏), 𝑢 𝑛−1 (𝑏) . Using the algebraic polynomial basis and performing the integration of the resulting interpolant analytically, we obtain an approximation of the form 𝐼1 ≈ 𝛾0 𝑢 𝑛 (𝑏) + 𝛾1 𝑢 𝑛−1 (𝑏) + 𝛾2 𝑢 𝑛−2 (𝑏), where the outflow weights are 5𝜈 + 2𝑒 −𝜈 + 𝜈 2 𝑒 −𝜈 − 5𝜈 2 − 3𝜈𝑒 −𝜈 − 2 𝛾0 = − , 4𝜈 2 𝜈 2 𝑒 −𝜈 − 2𝑒 −𝜈 − 4𝜈 + 2𝜈 2 + 2𝜈𝑒 −𝜈 + 2 𝛾1 = − , 2𝜈 2 𝑒 −𝜈 (−1 + 𝜈)(𝜈 − 2𝑒 𝜈 + 𝜈𝑒 𝜈 + 2) 𝛾2 = , 4𝜈 2 42 √ with 𝜈 = 2. Combining this result with (2.44), we obtain the explicit update formula 𝐵𝑛 = 𝑒 −𝛼𝑐Δ𝑡 𝐵𝑛−1 + 𝛾0 𝑢 𝑛 (𝑏) + 𝛾1 𝑢 𝑛−1 (𝑏) + 𝛾2 𝑢 𝑛−2 (𝑏). (2.46) In the case of outflow boundary conditions for the second-order, time-centered update, which was presented in [34], the authors mentioned that including 𝑢 𝑛+1 (𝑏) in the interpolation stencil is necessary for a convergent outflow method (see Remark 3 in [34]). Numerical experiments, which we present in section 2.6 seem to suggest otherwise. This includes the BDF-2 method with the explicit form of outflow given by (2.46). An “implicit" form of the outflow procedure can be obtained by including time level 𝑛 +1 data in the interpolation stencil used to approximate the integral (2.45). Repeating the steps shown above, we obtain a corresponding implicit formula 𝐼1 ≈ 𝛾0 𝑢 𝑛+1 (𝑏) + 𝛾1 𝑢 𝑛 (𝑏) + 𝛾2 𝑢 𝑛−1 (𝑏) + 𝛾3 𝑢 𝑛−2 (𝑏), where the outflow weights are 𝜈 2 𝑒 −𝜈 − 6𝑒 −𝜈 − 12𝜈 − 3𝜈 3 𝑒 −𝜈 + 8𝜈 2 + 6𝜈𝑒 −𝜈 + 6 𝛾0 = − , 12𝜈 3 4𝜈 2 𝑒 −𝜈 − 6𝑒 −𝜈 − 10𝜈 − 4𝜈 3 𝑒 −𝜈 + 3𝜈 2 + 5𝜈 3 + 4𝜈𝑒 −𝜈 + 6 𝛾1 = , 4𝜈 3 5𝜈 2 𝑒 −𝜈 − 8𝜈 − 𝜈 3 𝑒 −𝜈 + 4𝜈 3 + 2𝜈𝑒 −𝜈 + 6 𝛾2 = − , 4𝜈 3 6𝜈 + 6𝑒 −𝜈 − 4𝜈 2 𝑒 −𝜈 + 𝜈 2 − 3𝜈 3 − 6 𝛾3 = − , 12𝜈 3 √ again, with 𝜈 = 2. To deal with the appearance of 𝑢 𝑛+1 (𝑏) in the approximation of (2.45), we appeal to the update for the BDF-2 scheme (2.29). Assuming that outflow is applied to the left boundary point, as well, we can repeat the above process to obtain a linear system, as with the other types of boundary conditions discussed earlier. This results in the linear system (1 − 𝛾0 ) 𝐴𝑛 − 𝛾0 𝜇𝐵𝑛 = 𝑒 −𝛼𝑐Δ𝑡 𝐴𝑛−1 + 𝛾0 I𝑥 [𝑅] (𝑎) + 𝛾1 𝑢 𝑛 (𝑎) + 𝛾2 𝑢 𝑛−1 (𝑎) + 𝛾3 𝑢 𝑛−2 (𝑎) ≡ 𝑓𝑎 , −𝛾0 𝜇 𝐴𝑛 + (1 − 𝛾0 )𝐵𝑛 = 𝑒 −𝛼𝑐Δ𝑡 𝐵𝑛−1 + 𝛾0 I𝑥 [𝑅] (𝑏) + 𝛾1 𝑢 𝑛 (𝑏) + 𝛾2 𝑢 𝑛−1 (𝑏) + 𝛾3 𝑢 𝑛−2 (𝑏) ≡ 𝑓𝑏 , 43 where 𝜇 = 𝑒 −𝛼(𝑏−𝑎) and we have used 𝑓𝑎 and 𝑓𝑏 to indicate the right side of the system for brevity. This system can be analytically inverted and leads to the solution (1 − 𝛾0 ) 𝑓𝑎 + 𝛾0 𝜇 𝑓𝑏 𝐴𝑛 = , (1 − 𝛾0 ) 2 − (𝛾0 𝜇) 2 𝛾0 𝜇 𝑓𝑎 + (1 − 𝛾0 ) 𝑓𝑏 𝐵𝑛 = . (1 − 𝛾0 ) 2 − (𝛾0 𝜇) 2 This finishes the introduction for outflow boundary conditions. In section 2.4.3, we mention some of the caveats in applying the aforementioned boundary conditions to problems in a multi- dimensional setting. Next, in section 2.4.2, we present the boundary conditions for the second-order, time-centered scheme. 2.4.2 Centered Method The update for second order time centered method, in one spatial dimension, can be obtained by combining (2.16) with the semi-discrete equation (2.10) to obtain ∫ 𝑏 " # ! 𝛼 1 𝑢 𝑛+1 (𝑥) = (2 − 𝛽2 )𝑢 𝑛 − 𝑢 𝑛−1 + 𝛽2 𝑒 −𝛼|𝑥−𝑦| 𝑢 𝑛 + 2 𝑆 𝑛 𝑑𝑦 + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) , 2 𝑎 𝛼 (2.47)   1 𝑛 ≡ (2 − 𝛽2 )𝑢 𝑛 − 𝑢 𝑛−1 + 𝛽2 I𝑥 𝑢 𝑛 + 𝑆 (𝑥) + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) , (2.48) 𝛼2 where we have, again, used I𝑥 [·] to denote the convolution integral term. By specifying different conditions on the boundary data for the solution 𝑢, we can identify the corresponding values of 𝐴 and 𝐵 in the update (2.48). As with the BDF methods discussed in the previous sections, we will assume that the boundary conditions along the line are the same, though this can be generalized to mixed boundary conditions with minimal modifications. The techniques used to obtain boundary conditions are nearly identical to the BDF-2 method shown in the previous section, so we skip the details and simply state the results. As with the BDF method, we can obtain a method for constructing derivatives by directly differentiating the update (2.48). Making use of the identity (2.32), the derivative at the new time 44 level is found to be       𝑑𝑢 𝑛+1 2 𝑑𝑢 𝑛 𝑑𝑢 𝑛−1 𝛽2 𝛼 𝑅 𝑛 1 𝑛 𝐿 𝑛 1 𝑛 = (2 − 𝛽 ) − + −I𝑥 𝑢 + 2 𝑆 (𝑥) + I𝑥 𝑢 + 2 𝑆 (𝑥) 𝑑𝑥 𝑑𝑥 𝑑𝑥 2 𝛼 𝛼 − 𝛼𝐴𝑒 −𝛼(𝑥−𝑎) + 𝛼𝐵𝑒 −𝛼(𝑏−𝑥) (2.49) In contrast to the derivative (2.35) obtained with the BDF-2 method, we can see that the derivative of the time-centered method includes a time history for the derivative itself. We have not shown that computing derivatives with a recursive approach of this form is a stable process. Otherwise, no additional approximations have been made beyond what is needed to compute I𝑥𝑅 and I𝑥𝐿 . As with the BDF method, the boundary coefficients 𝐴 and 𝐵 appearing in (2.49) will be calculated in the same way as the update (2.48). 2.4.2.1 Dirichlet Boundary Conditions Dirichlet boundary conditions for which we are given the data     𝑢 𝑛+1 (𝑎) = 𝑔𝑎 𝑡 𝑛+1 , 𝑢 𝑛+1 (𝑏) = 𝑔𝑏 𝑡 𝑛+1 , can be enforced by solving the linear system that results from evaluating (2.48) at the ends of the domain. The solution to this system is 𝑓𝑎 − 𝜇 𝑓 𝑏 𝐴=− , 1 − 𝜇2 𝑓 𝑏 − 𝜇 𝑓𝑎 𝐵=− , 1 − 𝜇2 where we have used 𝜇 = 𝑒 −𝛼(𝑏−𝑎) and     𝑛 1 𝑛 𝑛 𝑔𝑎 𝑡 𝑛+1 − 2𝑔𝑎 (𝑡 𝑛 ) + 𝑔𝑎 𝑡 𝑛−1 𝑓𝑎 = I𝑥 𝑢 + 2 𝑆 (𝑎) − 𝑔𝑎 (𝑡 ) − , 𝛼 𝛽2    𝑛+1 − 2𝑔 (𝑡 𝑛 ) + 𝑔 𝑡 𝑛−1  1 𝑔 𝑏 𝑡 𝑏 𝑏 𝑓𝑏 = I𝑥 𝑢 𝑛 + 2 𝑆 𝑛 (𝑏) − 𝑔𝑏 (𝑡 𝑛 ) − , 𝛼 𝛽2 for brevity. 45 2.4.2.2 Neumann Boundary Conditions The boundary coefficients associated with Neumann boundary conditions 𝑑𝑢 𝑛+1 (𝑎)   𝑑𝑢 𝑛+1 (𝑏)   = ℎ𝑎 𝑡 𝑛+1 , = ℎ 𝑏 𝑡 𝑛+1 , 𝑑𝑥 𝑑𝑥 can be determined using the derivative (2.49) with the aid of the identities (2.33) and (2.34). The resulting linear system has the solution 𝑤 𝑎 − 𝜇𝑤 𝑏 𝐴= , 1 − 𝜇2 𝑤 𝑎 − 𝜇𝑤 𝑏 𝐵= , 1 − 𝜇2 where we have used the definitions    𝑛+1 − 2ℎ (𝑡 𝑛 ) + ℎ 𝑡 𝑛−1  1 1 ℎ 𝑎 𝑡 𝑎 𝑎 𝑤 𝑎 = I𝑥 𝑢 𝑛 + 2 𝑆 𝑛 (𝑎) − ℎ𝑎 (𝑡 𝑛 ) − , 𝛼 𝛼 𝛼𝛽2     𝑛 1 𝑛 1 𝑛 ℎ 𝑏 𝑡 𝑛+1 − 2ℎ 𝑏 (𝑡 𝑛 ) + ℎ 𝑏 𝑡 𝑛−1 𝑤 𝑏 = I𝑥 𝑢 + 2 𝑆 (𝑏) + ℎ 𝑏 (𝑡 ) − , 𝛼 𝛼 𝛼𝛽2 and 𝜇 = 𝑒 −𝛼(𝑏−𝑎) . An approach for Robin boundary conditions follows a nearly identical path. 2.4.2.3 Periodic Boundary Conditions For periodic boundary conditions, we assume that for any time level 𝑛 > 0 𝑢 𝑛+1 (𝑎) = 𝑢 𝑛+1 (𝑏), 𝜕𝑥 𝑢 𝑛+1 (𝑎) = 𝜕𝑥 𝑢 𝑛+1 (𝑏). To enforce these conditions, we can appeal to the update (2.48), its derivative (2.49), and the identities (2.33) and (2.34). By solving the linear system obtained from the evaluation of these updates at the ends of the domain, we obtain the coefficients h i I𝑥 𝑢 𝑛 + 𝛼12 𝑆 𝑛 (𝑏) 𝐴= , 1−𝜇 h i 1 𝑛 I𝑥 𝑢 + 𝛼2 𝑆 (𝑎) 𝑛 𝐵= , 1−𝜇 where, again, 𝜇 = 𝑒 −𝛼(𝑏−𝑎) . 46 2.4.2.4 Outflow Boundary Conditions The procedure used to derive outflow boundary coefficients for the time-centered method was originally presented in [34], so we shall skip many of the details here. In fact, the developments for the time-centered method can be treated as a simplification of the process used in section 2.4.1.4, which developed the outflow coefficients for the BDF-2 method. Repeating the steps shown in section 2.4.1.4, with the central scheme, one finds that ∫ 𝑛 𝛼𝑐 𝑡 −𝛼𝑐𝑦 𝑛 𝐵 = 𝑒 𝑢 (𝑏, 𝑡 𝑛 − 𝑦) 𝑑𝑦, 2 0 𝛼𝑐 Δ𝑡 −𝛼𝑐𝑦 ∫ = 𝑒 𝑢 (𝑏, 𝑡 𝑛 − 𝑦) 𝑑𝑦 2 0 ∫ 𝑛 𝛼𝑐 𝑡 −𝛼𝑐𝑦 + 𝑒 𝑢 (𝑏, 𝑡 𝑛 − 𝑦) 𝑑𝑦, 2 Δ𝑡 ≡ 𝐼1 + 𝐼2 , where the second integral 𝐼2 can be written in the form ∫ 𝑡 𝑛−1   𝛼𝑐 𝐼2 = 𝑒 −𝛼𝑐(𝑦+Δ𝑡) 𝑢 𝑏, 𝑡 𝑛−1 − 𝑦 𝑑𝑦, 2 0 ! ∫ 𝑡 𝑛−1   𝛼𝑐 = 𝑒 −𝛽 𝑒 −𝛼𝑐𝑦 𝑢 𝑏, 𝑡 𝑛−1 − 𝑦 𝑑𝑦 , 2 0 ≡ 𝑒 −𝛽 𝐵𝑛−1 . (2.50) Note that the simplifications shown above use the definition 𝛼 = 𝛽/(𝑐Δ𝑡) for the central scheme. Likewise, the first integral 𝐼1 , can be expressed as ∫ 1 𝛽 𝐼1 = 𝑒 −𝛽𝑧 𝑢 (𝑏, 𝑡 𝑛 − 𝑧Δ𝑡) 𝑑𝑧, (2.51) 2 0 using the definition 𝛼 = 𝛽/(𝑐Δ𝑡). As with the BDF-2 method, this integral can be approximated in an explicit or implicit manner depending on the choice of points used to create the interpolating function for the function 𝑢 (𝑏, 𝑡 𝑛 − 𝑧Δ𝑡). An explicit outflow method can be obtained if we approximate the function 𝑢 (𝑏, 𝑡 𝑛 − 𝑧Δ𝑡) using  algebraic polynomials with the stencil 𝑢 𝑛−2 (𝑏), 𝑢 𝑛−1 (𝑏), 𝑢 𝑛 (𝑏) . Integrating this analytically, and 47 combining the result with (2.50), we obtain the explicit outflow method 𝐵𝑛 = 𝑒 −𝛽 𝐵𝑛−1 + 𝛾0 𝑢 𝑛 (𝑏) + 𝛾1 𝑢 𝑛−1 (𝑏) + 𝛾2 𝑢 𝑛−2 (𝑏), where the integration weights for the second-order, time-centered method are given by 2𝑒 −𝛽 − 𝛽 + 2𝛽2 + 3𝛽𝑒 −𝛽 − 2 𝛾0 = , 4𝛽2 2𝑒 −𝛽 − 2𝛽 + 3𝛽2 𝑒 −𝛽 + 4𝛽𝑒 −𝛽 − 2 𝛾1 = − , 6𝛽2 𝛽 + 2𝑒 −𝛽 + 𝛽𝑒 −𝛽 − 2 𝛾2 = − . 12𝛽2 An implicit form of the outflow method for the central-2 scheme can be obtained using the stencil  𝑢 𝑛−1 (𝑏), 𝑢 𝑛 (𝑏), 𝑢 𝑛+1 (𝑏) to approximation of 𝑢 (𝑏, 𝑡 𝑛 − 𝑧Δ𝑡). Using the update (2.48) eliminates the variable 𝑢 𝑛+1 (𝑏) from the interpolation formula, which is not yet available. Repeating the steps for the other boundary point, we obtain the linear system   1 + 𝛽2 𝛾0 𝐴𝑛 + 𝜇𝛽2 𝛾0 𝐵𝑛 = 𝑓𝑎 , 𝜇𝛽2 𝛾0 𝐴𝑛 + (1 + 𝛽2 𝛾0 )𝐵𝑛 = 𝑓𝑏 , with      −𝛽 2 1 𝑛 𝑓𝑎 = 𝑒 𝐴 + 𝛾0 𝛽 I𝑥 𝑢 + 2 𝑆 (𝑎) + 2 − 𝛽2 𝛾0 + 𝛾1 𝑢 𝑛 (𝑎) + (𝛾2 − 𝛾0 )𝑢 𝑛−1 (𝑎), 𝑛−1 𝑛 𝛼      −𝛽 𝑛−1 1 𝑛 𝑓𝑏 = 𝑒 𝐵 + 𝛾0 𝛽 I𝑥 𝑢 + 2 𝑆 (𝑏) + 2 − 𝛽2 𝛾0 + 𝛾1 𝑢 𝑛 (𝑏) + (𝛾2 − 𝛾0 )𝑢 𝑛−1 (𝑏), 2 𝑛 𝛼 and 𝜇 = 𝑒 −𝛼(𝑏−𝑎) . The solution to this linear system is given by  1 − 𝛽 2𝛾 0 𝑓𝑎 + 𝜇𝛽2 𝛾0 𝑓𝑏 𝐴𝑛 = 2 2 , 1 − 𝛽2 𝛾0 − 𝜇𝛽2 𝛾0  𝑛 𝜇𝛽2 𝛾0 𝑓𝑎 + 1 − 𝛽2 𝛾0 𝑓𝑏 𝐵 = 2 2 , 1 − 𝛽2 𝛾0 − 𝜇𝛽2 𝛾0 where the corresponding integration weights are defined as 𝛽 + 2𝑒 −𝛽 + 𝛽𝑒 −𝛽 − 2 𝛾0 = − , 4𝛽2 2𝑒 −𝛽 + 𝛽2 + 2𝛽𝑒 −𝛽 − 2 𝛾1 = , 2𝛽2 2𝑒 −𝛽 − 𝛽 + 2𝛽2 𝑒 −𝛽 + 3𝛽𝑒 −𝛽 − 2 𝛾2 = − . 4𝛽2 48 2.4.3 Some Remarks for Multi-dimensional Problems In this section we briefly discuss some of the issues concerning the application of boundary conditions for the multi-dimensional updates given by (2.11) for the BDF-2 method and (2.12) for the time-centered method. For convenience, these are given, respectively, by √ 1 𝑛  1 2 L𝑥 L 𝑦 𝑢 𝑛+1 = 5𝑢 − 4𝑢 𝑛−1 + 𝑢 𝑛−2 + 2 𝑆 𝑛+1 (x), 𝛼 := , 2 𝛼 𝑐Δ𝑡 and   𝛽2 𝛽 L𝑥 L 𝑦 𝑢 𝑛+1 − (2 − 𝛽2 )𝑢 𝑛 + 𝑢 𝑛−1 (𝑥, 𝑦) = 𝛽2 𝑢 𝑛 + 2 𝑆 𝑛 (𝑥, 𝑦), 𝛼 := , 0 < 𝛽 ≤ 2. 𝛼 𝑐Δ𝑡 By inverting the factored operator one direction at a time using the techniques presented in 2.3, it follows that the solutions to these two equations are given respectively by " h1   1 i # √ 𝑛+1 −1 −1 𝑛 𝑛−1 𝑛−2 𝑛+1 2 𝑢 = L𝑥 L 𝑦 5𝑢 − 4𝑢 +𝑢 + 2𝑆 , 𝛼 := , 2 𝛼 𝑐Δ𝑡 and " # h 1 i 𝛽 𝑢 𝑛+1 = (2 − 𝛽2 )𝑢 𝑛 − 𝑢 𝑛−1 + 𝛽2 L𝑥−1 L 𝑦−1 𝑢𝑛 + 2 𝑆𝑛 , 𝛼 := , 0 < 𝛽 ≤ 2, 𝛼 𝑐Δ𝑡 We wish to point out here that things are assumed to be smooth so that the ordering conventions used for operators are irrelevant, i.e., L𝑥 L 𝑦 = L 𝑦 L𝑥 . 2.4.3.1 Sweeping Patterns in Multi-dimensional Problems In the two-dimensional case, we need to construct terms of the form " # h i L 𝑦 L𝑥 𝑤 = 𝑓 =⇒ 𝑤 = L𝑥−1 L 𝑦−1 𝑓 , with boundary data being prescribed for the variable 𝑤. The construction is performed over two steps. The first step inverts the 𝑦 operator, so we obtain h i L𝑥 𝑤 = L 𝑦−1 𝑓 . (2.52) 49 The step (2.52) requires boundary data for the intermediate variable L𝑥 𝑤 when we are only given boundary data for 𝑤. From the definition of L𝑥 , we note that     1 1 L𝑥 𝑤 ≡ I − 2 𝜕𝑥𝑥 𝑤 = 𝑤 + O 2 , (2.53) 𝛼 𝛼 In other words, boundary conditions for L𝑥 𝑤 can be approximated to second-order in time by those of 𝑤; however, unless we are dealing with outflow boundary conditions, we do not need to sweep along the boundary of the domain, so the approximation (2.53) is not necessary. Proceeding further, the second step of the inversion process leads to the solution 𝑤 " # h i 𝑤 = L𝑥−1 L 𝑦−1 𝑓 , (2.54) which simply enforces the known boundary data on 𝑤. For purposes of clarity, we separate the types of boundary conditions. In each case, we summarize the changes associated with moving to multi-dimensional problems, which includes any changes necessitated by the proposed methods for calculating derivatives. 2.4.3.2 Periodic Boundary Conditions Periodic boundary conditions in multi-dimensional problems can be enforced in a straightforward way by directly applying the one-dimensional approaches outlined in sections 2.4.1.3 and 2.4.2.3 to each dimension. No modifications are required for either the scheme or the proposed derivative methods. 2.4.3.3 Dirichlet Boundary Conditions In the case of Dirichlet boundary conditions, the values of the function are known along the boundary. Therefore, we only need to update the grid points corresponding to the interior of the domain. As mentioned earlier, rather than approximating the boundary conditions for the intermediate sweep, e.g., (2.53), we simply avoid sweeping along the boundary points of the domain, since the values are known. The direction corresponding to the intermediate sweep will 50 now only use boundary data set by the solution, since the boundaries are left untouched. In the case of homogeneous Dirichlet conditions, the sweeps can be performed on the boundary with no effect. When sweeping over different directions for the derivatives, we note that along the boundary, the derivative information is not known. Therefore, the sweeps should extend all the way to the boundary. Otherwise the derivative will not be available there. In this case, the boundary data for the intermediate data can be approximated according to (2.53). 2.4.3.4 Neumann Boundary Conditions The treatment of Neumann boundary conditions in multi-dimensional problems is identical to the Dirichlet case discussed in the previous section for the case of Cartesian grids. This also applies to the proposed methods for computing derivatives. In problems defined on complex geometries with embedded boundaries, based on the theory presented in [53], it was discovered that dissipation was necessary to obtain stable numerical solutions [34]. While this is not a problem for the BDF-2 method, which is dissipative, problems occur for the centered scheme. This motivated the introduction of a tuneable dissipation term using the successive convolution framework for which we provide an overview in section 2.5.1. We also wish to note that Robin boundary conditions follow an identical approach. 2.4.3.5 Outflow Boundary Conditions In the case of outflow boundary conditions, one should pay careful attention to the structure of the time history used in the convolution integral for a particular scheme. For example, the integrand for the second-order BDF method (2.11) uses data from three time levels {𝑢 𝑛−2 , 𝑢 𝑛−1 , 𝑢 𝑛 }, while the time-centered method uses a single time level, i.e., 𝑢 𝑛 for this term. This led to two different outflow procedures, which were presented in sections 2.4.1.4 and 2.4.2.4. When dealing with outflow boundaries, the sweeps can be performed along the boundary, since these values are unspecified. We note that this includes derivatives in addition to the fields. Next, we focus on the 51 structure of the sweeping patterns used in the methods, beginning with the time-centered scheme and then moving to the BDF method. In the time-centered approach (2.12), the structure of the time history used in the convolution integral follows a regular pattern in the sense that the first and second layers of sweeps defined by (2.52) and (2.54) operate on data from single time level. Assuming the sweeping pattern (2.54), we first need to evaluate the term h 1 i 𝑣 (1) = L 𝑦−1 𝑢 𝑛 + 2 𝑆 𝑛 , 𝛼 which requires the a time history for the solution 𝑢 at the 𝑥 boundary points along the 𝑦-direction. The implicit approach to outflow for the time-centered method is constructed using the one- dimensional update (2.48), which is not consistent with what we have written here. Instead, we recommend that the explicit form of the outflow procedure be used for the calculation of the boundary data. For the second set of sweeps given by (2.54), we need to compute h i 𝑣 (2) = L𝑥−1 𝑣 (1) , which similarly requires the time history for 𝑣 (1) , now along boundary the 𝑦 boundary points in the 𝑥-direction. Since this is the last layer of sweeps, one is free to use either the implicit or explicit forms of outflow to construct the boundary terms. While we have proposed approaches for computing derivatives for the time-centered scheme, we find the structure of the BDF derivatives (2.35) more appealing. Therefore, we shall not discuss multi-dimensional aspects of the derivatives for the time-centered method any further. This will be clarified when we present numerical results in section 2.6. The BDF-2 scheme (2.11), in contrast to the time-centered method has other nuances which should be addressed. Again, if we assume a sweeping pattern of the form (2.54), we first need to evaluate " # 1 𝑛  1 𝑣 (1) = L 𝑦−1 5𝑢 − 4𝑢 𝑛−1 + 𝑢 𝑛−2 + 2 𝑆 𝑛+1 , (2.55) 2 𝛼 which requires the a time history for the solution 𝑢 at the 𝑥 boundary points along the 𝑦-direction. The construction of the boundary coefficients for this term follows the procedure presented in 52 section 2.4.1.4; however, when one changes directions to perform the second layer of sweeps following (2.54), we see that we now need to construct a term of the form h i 𝑣 (2) = L𝑥−1 𝑣 (1) , which is identical in structure found with the central-2 scheme. Computing this term with outflow boundary conditions requires that we store a time history for 𝑣 (1) along the 𝑦 boundary points in the 𝑥-direction. This requires us to change the reconstruction methods based on the order in which sweeps are performed, which creates additional complications if we now wish to interchange the order in which sweeps are performed. Instead, for the multi-dimensional BDF-2 update, we Taylor expand the time history in the convolution integral (2.55) about time level 𝑛, so that it looks like the integrand for the time-centered scheme. Assuming the source is not present at the boundary, the integrand can be approximated as 1 𝑛  5𝑢 − 4𝑢 𝑛−1 + 𝑢 𝑛−2 = 𝑢 𝑛 + O (Δ𝑡). 2 While this modification results in a loss of accuracy when the method is applied in outflow problems, it creates a regular structure that is easier to work with in the numerical implementation, since the outflow procedure for the time-centered method can now be used in both directions. A consequence of this decision is that this approach is only compatible with explicit outflow because the implicit form of outflow uses knowledge of the particular update, which would be inconsistent. Lastly, we remark on the treatment of derivatives obtained with the BDF method in the multi- dimensional case. In section 2.4.1, we obtained an equation for spatial derivatives through direct calculations involving the one-dimensional BDF-2 method, which resulted in (2.35). We would like to apply these one-dimensional derivatives in multi-dimensional problems, as well. Since the directions over which sweeps occur are treated independently, the one-dimensional derivatives can also be applied in a dimensionally-split manner. When combined with outflow boundary conditions, care should be taken to ensure that the time history used in the reconstruction is consistent with the operand of a particular convolution integral. To illustrate, assuming a sweeping pattern of the form 53 (2.54), we find that the BDF-2 update in two spatial dimensions has the form " # h i 𝑢 𝑛+1 = L𝑥−1 L 𝑦−1 𝑅 , (2.56) where we used 1 𝑛  1 𝑅(𝑥, 𝑦) = 5𝑢 − 4𝑢 𝑛−1 + 𝑢 𝑛−2 + 2 𝑆 𝑛+1 . 2 𝛼 Alternatively, if we swapped the order in which sweeps are performed, we obtain a similar update " # h i 𝑢 𝑛+1 = L 𝑦−1 L𝑥−1 𝑅 , (2.57) with 𝑅 being unchanged. Now, consider a 𝑦-derivative of the schemes (2.56) and (2.57). Since the inverse operator L𝑥−1 varies only with respect to 𝑥 and remains constant in 𝑦, we identify two options to compute 𝑦-derivatives, namely " # h i 𝜕𝑦 𝑢 𝑛+1 = L𝑥−1 𝜕𝑦 L 𝑦−1 𝑅 , (2.58) " # h i 𝜕𝑦 𝑢 𝑛+1 = 𝜕𝑦 L 𝑦−1 L𝑥−1 𝑅 . (2.59) Both options are valid for computing the derivatives in the multi-dimensional case, but, in problems with outflow boundary conditions, the data that is stored in these approaches is different. In the first approach (2.56), the outflow boundary conditions require the time history for the derivative of the intermediate variable. The second approach (2.57) proceeds in a manner which is more closely related to the update of the solution, since the derivative (2.35) does not change 𝐴 and 𝐵 for a given line. For this reason, we prefer the second option (2.57) in our implementation to compute the 𝑦 derivative. Similarly, the 𝑥 derivative works with the pattern (2.56). 2.5 Extensions for High-order Accuracy Here we provide a brief discussion regarding high-order extensions of the aforementioned methods in the context of the two-way wave equation. We include an overview of the successive convolution method for the wave equation, which is loosely taken from [38]. We also mention methods for increasing the accuracy of the BDF methods introduced in section 2.2, including ways to obtain 54 more accurate spatial derivatives. While we are primarily focused on the developments with the second-order solvers, the purpose of this section is to demonstrate paths to high-order solvers. Therefore, we provide fewer details concerning the implementation of these methods compared to earlier sections. 2.5.1 Successive Convolution Methods The first work on high-order extensions of the solvers presented in the previous sections was presented in the 2014 paper [38]. Rather than increasing the width of the stencil to approximate the second time derivative, these methods introduced additional spatial derivatives to retain a compact stencil in time using the symmetries appearing in the truncation error for the time-centered method. To illustrate, suppose we are solving the homogeneous wave equation 𝜕 2𝑢 2 = 𝑐2 Δ𝑢. 𝜕𝑡 If the second time derivative is approximated with a second-order centered finite-difference about time level 𝑛, then 𝜕 2 𝑢 𝑢 𝑛+1 − 2𝑢 𝑛 + 𝑢 𝑛−1 = + O (Δ𝑡 2 ). 𝜕𝑡 2 Δ𝑡 2 Taylor expanding of the terms in the numerator of the above approximation yields an expansion containing only even order time derivatives, i.e., ∞ 𝑛+1 𝑛 𝑛−1 ∑︁ Δ𝑡 2𝑚 𝜕 2𝑚 𝑢 𝑛 𝑢 − 2𝑢 + 𝑢 =2 . 𝑚=1 (2𝑚)! 𝜕𝑡 2𝑚 The time derivatives in the above equation were then replaced with spatial derivatives using the PDE to obtain the expansion ∞  𝑚 𝑛+1 𝑛 𝑛−1 ∑︁ 𝛽2𝑚 Δ 𝛽 𝑢 − 2𝑢 + 𝑢 =2 𝑢𝑛 , 𝛼 := , 0 < 𝛽 ≤ 𝛽max , (2.60) 𝑚=1 (2𝑚)! 𝛼 2 𝑐Δ𝑡 with the powers of the Laplacian being evaluated in a dimension-by-dimension fashion and the parameter 𝛽 was introduced to tune the stability of the method. In [38], the powers of the Laplacian were constructed recursively using the convolution operators defined earlier in this chapter, so the 55 approach became known as successive convolution. To approximate the Laplacian as a convolution, the authors introduce the one-dimensional modified Helmholtz operators 1 L 𝛾 := I − 𝜕𝛾𝛾 , 𝛾 = 𝑥, 𝑦, · · · , (2.61) 𝛼2 and another operator D𝛾 := I − L 𝛾−1 , 𝛾 = 𝑥, 𝑦, · · · , (2.62) which can be combined to form the Laplacian operator through the relation 1 1 1 − 2 Δ = L𝑥 D𝑥 + L 𝑦 D𝑦 + · · · = − 2 𝜕𝑥𝑥 − 2 𝜕𝑦𝑦 − · · · 𝛼 𝛼 𝛼 The authors introduce a new operator C, which, in two-dimensions, is defined as C𝑥𝑦 := L 𝑦−1 D𝑥 + L𝑥−1 D𝑦 , (2.63) so that the Laplacian can be expressed in the factored form 1   − Δ = L 𝑥 L 𝑦 C𝑥𝑦 . 𝛼2 To remove the term L𝑥 L 𝑦 , they introduced yet another operator D𝑥𝑦 := I − L𝑥−1 L 𝑦−1 , (2.64) which can be rearranged to obtain the identity  −1 L𝑥 L 𝑦 = I − D𝑥𝑦 . With this last identity, the Laplacian becomes 1  −1 − Δ = I − D 𝑥𝑦 C𝑥𝑦 . 𝛼2 Then, they expand the Laplacian into a power series to obtain the result  𝑚 ∞   Δ 𝑚 𝑚 ∑︁ 𝑝−1 𝑝−𝑚 = (−1) C𝑥𝑦 D𝑥𝑦 . (2.65) 𝛼2 𝑝=𝑚 𝑚−1 56 Note that the expansion above are valid because ||D𝑥𝑦 || ≤ 1 in the sense of operator norms. This can be seen from the one dimensional analogue (2.62). In Fourier space, the operator D𝑥 satisfies 1 (𝑘/𝛼) 2 F [D𝑥 ] = 1 − F L𝑥−1 = 1 −   = ≤ 1. 1 + (𝑘/𝛼) 2 1 + (𝑘/𝛼) 2   With slight modifications, one can similarly show that F D𝑥𝑦 ≤ 1 also holds. By inserting the identity (2.65) into the error expansion (2.60), they obtained a family of methods defined by 𝑢 𝑛+1 = 2𝑢 𝑛 − 𝑢 𝑛−1 𝑁 ∑︁ 𝑝   ∑︁ 𝑚 𝛽2𝑚 𝑝 − 1 𝑚 𝑝−𝑚 𝑛 𝛽 +2 (−1) C D [𝑢 ] , 𝛼 := , 0 < 𝛽 ≤ 𝛽max , (2.66) 𝑝=1 𝑚=1 (2𝑚)! 𝑚 − 1 𝑥𝑦 𝑥𝑦 𝑐Δ𝑡 where the truncation of the outer sum to 𝑁 terms led to a method of order 2𝑁 in time. The value of 𝛽max in this expansion depends on the desired order of the scheme and decreases as the order of the scheme increases. For specific information on this particular topic, we refer the reader to the original paper [38], which analyzes the stability in great detail. To include source terms, the expansion (2.60) is modified to account for space and time derivatives on the source itself. We wish to point out that there is a typographical error in the fourth-order scheme with sources presented in the paper [38]. The correct form of the scheme is given by   2 2 𝛽4 𝑢 𝑛+1 𝑛 = 2𝑢 − 𝑢 𝑛−1 − 𝛽 C𝑥𝑦 [𝑢 ] − 𝛽 D𝑥𝑦 − C𝑥𝑦 C𝑥𝑦 [𝑢 𝑛 ] 𝑛 12 𝛽2  𝑛+1 𝑛 𝑛−1  𝛽4 + 𝑆 + 10𝑆 + 𝑆 − C𝑥𝑦 [𝑆 𝑛 ] . (2.67) 12𝛼2 12𝛼2 The above scheme contains an implicit source term that arises from an approximation of the second time derivative about time level 𝑛. It should be noted that a different stencil can be used to make the source explicit. The approach used to apply boundary conditions in multi-dimensional problems with successive convolution is similar to the procedures presented for the second-order methods in section 2.4.3 with some caveats. Periodic boundary conditions can be applied level-by-level in the truncated 57 operator expansions directly following the procedure outlined in section 2.4.2.3 for the second- order time-centered scheme. Other boundary conditions such as Dirichlet and Neumann leverage the linearity of the wave equation being solved to enforce boundary conditions at the lowest level and using a homogeneous variant for the remaining levels. Higher-order outflow methods with successive convolution were introduced in the thesis [54], and were later published in the article [40]. We wish to mention that the outflow methods proposed in the paper [40] showed large errors along the boundaries in numerical experiments, despite being fourth-order. These methods are not be considered in this work, so we shall not elaborate on this any further. The subject of boundary conditions, especially high-order outflow, shall be the focus of future work, once this situation is better understood for the second-order methods. 2.5.2 BDF Methods A straightforward way of increasing the accuracy of the BDF method can be achieved by increasing the size of the time history. In place of the second-order accurate approximation for 𝜕𝑡𝑡 𝑢 𝑛+1 , we could instead use the third-order accurate difference given by 33 𝑛+1 26 𝑛 19 𝑛−1 14 𝑛−2 11 𝑛−3 12 𝑢 − 3𝑢 + 2𝑢 − 3𝑢 + 12 𝑢 𝜕𝑡𝑡 𝑢 𝑛+1 = + O (Δ𝑡 3 ), (2.68) Δ𝑡 2 or even 77 𝑛 107 𝑛−1 61 𝑛−3 15 𝑛+1 4𝑢 − 6𝑢 + 6 𝑢 − 13𝑢 𝑛−2 + 12 𝑢 − 56 𝑢 𝑛−4 𝜕𝑡𝑡 𝑢 𝑛+1 = + O (Δ𝑡 4 ), (2.69) Δ𝑡 2 which is fourth-order accurate. Then, to derive higher-order BDF methods, one can repeat the steps in section 2.2.1, replacing the second-order stencil with either the third-order stencil (2.68) or fourth-order stencil (2.69). This leads to the respective semi-discrete schemes     1 𝑛+1 12 26 𝑛 19 𝑛−1 14 𝑛−2 11 𝑛−3 I − 2Δ 𝑢 = 𝑢 − 𝑢 + 𝑢 − 𝑢 𝛼 35 3 2 3 12 √︃   35 1 1 12 + 2 𝑆 𝑛+1 (x) + O 5 , 𝛼 := , (2.70) 𝛼 𝛼 𝑐Δ𝑡 58 and     1 𝑛+1 4 77 𝑛 107 𝑛−1 𝑛−2 61 𝑛−3 5 𝑛−4 I − 2Δ 𝑢 = 𝑢 − 𝑢 + 13𝑢 − 𝑢 + 𝑢 𝛼 15 6 6 12 6 √︃   15 1 𝑛+1 1 4 + 2 𝑆 (x) + O 6 , 𝛼 := . (2.71) 𝛼 𝛼 𝑐Δ𝑡 The process for calculating derivatives is identical with the only differences from (2.35) being the operand of the convolution integral and the particular definition of the parameter 𝛼, so we shall not present these equations. Additionally, boundary conditions for this approach do not present any additional challenges beyond the base second-order scheme. This feature makes higher order BDF methods simple to implement. Despite the advantages afforded by the BDF methods, there are two issues to address in this type of approach. First, there is the issue of stability. In the case of first-order ODEs, it is well known that BDF discretizations become unstable if the order is greater than 6. Similar issues will be encountered in these methods and we plan to address the topic of stability in our later work. The second issue, which is less of a concern, is the splitting error in multi-dimensional problems that arises from the factorization used for the modified Helmholtz operator; however, it is possible to eliminate the splitting error by essentially “subtracting" it from the scheme with the aid of an iterative method [43]. 2.6 Numerical Examples In this section, we present numerical results for the field solvers introduced in this chapter. Using a one-dimensional test problem, we first compare the performance of the second-order BDF and time-centered derivatives. Then, we assess the performance of the proposed methods in multi- dimensional problems, focusing on the types of boundary conditions required in the problems considered in chapter 4. In the case of outflow boundary conditions, we also demonstrate the complications that arise when moving from one- to two-dimensional problems. 59 2.6.1 BDF and Time-centered Derivatives in One Spatial Dimension As a first test, we perform a refinement study to compare the derivatives obtained with the second- order BDF and time-centered methods, which are given, respectively, by equations (2.35) and (2.49). The test problem we consider is the homogeneous two-way scalar wave equation 1 𝜕𝑡𝑡 𝑢 − 𝜕𝑥𝑥 𝑢 = 0, (2.72) 𝑐2 with 𝑐 = 1 and subject to periodic boundary conditions on [0, 2𝜋]. The solution is evolved to the final time 𝑇 = 1 and we use the initial data 𝑢(𝑥, 0) = sin(𝑥), 𝜕𝑡 𝑢(𝑥, 0) = 0. (2.73) This equation has the solution sin(𝑥 − 𝑡) + sin(𝑥 + 𝑡) 𝑢(𝑥, 𝑡) = , (2.74) 2 with the corresponding derivative cos(𝑥 − 𝑡) + cos(𝑥 + 𝑡) 𝜕𝑥 𝑢(𝑥, 𝑡) = . (2.75) 2 Since these are multi-step methods, we need to supply values for the time history. While we can certainly use Taylor expansions to approximate this data, we use the analytical solution (2.74) and its corresponding spatial derivative (2.75). This allows us to avoid any potential errors associated with the initialization. We also wish to point out that the splitting error is not present in one-dimensional problems, so this eliminates another source of error. Regarding the implementation, we wish to note that unlike the time-centered method (2.49), the BDF method (2.35) does not require additional time history for the derivative. In the experiment shown here, the derivatives for the time-centered method are stored in a time history, similar to the solution, which is updated in each time step of the simulation. A fifth-order spatial quadrature is used to compute the local integrals in both methods. In the time-centered method (2.48) and its derivative (2.49), we use 𝛽 = 2, which is the largest allowable 𝛽 that retains the stability of the method [34]. 60 (a) Central-2 (b) BDF-2 Figure 2.1: Spatial refinement study for the solution and its derivative obtained with second-order methods for the periodic test problem in section 2.6.1. In Figure 2.1a, we plot the ℓ∞ errors for both the numerical solution and the derivative obtained with the time-centered method. Similarly, in Figure 2.1b, we show the same quantities, which are instead computed using the BDF method. The derivative for the time-centered method fails to refine in space, while the BDF derivative is as accurate at the numerical solution itself. In the refinement experiments for space, we run each case with 𝑁𝑡 = 212 time steps. We successively double the number of mesh points beginning with 𝑁𝑥 = 32 and finishing with 𝑁𝑥 = 2048. In Figure 2.1 we compare the accuracy of the time-centered and BDF methods and their proposed derivatives, against analytical solutions. The results indicate that the proposed spatial derivatives computed via (2.35) refine at the same rate as the BDF method (2.29). In contrast, the proposed derivative based on the time-centered method fails to refine, even though the numerical solution demonstrates fifth-order accuracy in space. The time refinement experiment uses a fixed spatial mesh consisting of 𝑁𝑥 = 256 grid points and successively doubles the number of time steps from 𝑁𝑡 = 8 to 𝑁𝑡 = 512. In Figure 2.2, we show the results from the refinement study performed in time for the proposed methods. Based on Figure 2.2b, we can see that when fewer time steps are used in the time-centered method, the derivatives exhibit similar refinement properties of the numerical solution. As we use more time steps, the errors between the solution and its derivative grow, which suggests an issue with the accumulation of errors in time. The results obtained with the BDF method are displayed in Figure 61 (a) Central-2 (b) BDF-2 Figure 2.2: Time refinement study for the solution and its derivative obtained with second-order methods for the test problem in section 2.6.1. In Figure 2.2a, we plot the ℓ∞ errors for both the numerical solution and the derivative obtained with the time-centered method. Similarly, in Figure 2.2b, we show the same quantities, which are instead computed using the BDF method. The derivative for the time-centered method initially converges together with the numerical solution, but at some point begins to diverge. In contrast, we can see that the errors for the derivatives obtained with the BDF method are aligned with those of the solution. Comparing the scales of the plots, we note that the BDF solution is slightly less accurate than the time-centered method. 2.2b and match the second-order accuracy of the base method. The results of the space and time refinement experiments for the proposed methods suggest that we should use (2.35) to construct derivatives. While derivatives computed with the BDF method require a time history, we can think of this method as a one-step application in the sense that the derivatives at any given time level depend on the solution and not on derivatives at other time levels. Moreover, the time history in the BDF derivatives does not necessary have to come from the BDF method itself. In fact, as we will soon see, these methods work even if this time history is supplied by another solver, e.g., the time-centered method or perhaps a higher order method based on successive convolution. 2.6.2 Periodic Boundary Conditions The next problem we consider is the two-dimensional in-homogeneous scalar wave equation 1 𝜕𝑡𝑡 𝑢 − Δ𝑢 = 𝑆(𝑥, 𝑦), (2.76) 𝑐2 62 where 𝑐 = 1 and 𝑆(𝑥, 𝑦) = 3𝑒 −𝑡 sin(𝑥) cos(𝑦). (2.77) We apply two-way periodic boundary conditions on the domain [0, 2𝜋] × [0, 2𝜋] and use the initial data 𝑢(𝑥, 𝑦, 0) = sin(𝑥) cos(𝑦), 𝜕𝑡 𝑢(𝑥, 𝑦, 0) = − sin(𝑥) cos(𝑦). (2.78) The problem (2.76) is associated with the manufactured solution 𝑢(𝑥, 𝑦, 𝑡) = 𝑒 −𝑡 sin(𝑥) cos(𝑦), (2.79) and defines the source function (2.77). The partial derivatives of this solution are calculated to be 𝜕𝑥 𝑢(𝑥, 𝑦, 𝑡) = 𝑒 −𝑡 cos(𝑥) cos(𝑦), (2.80) 𝜕𝑦 𝑢(𝑥, 𝑦, 𝑡) = −𝑒 −𝑡 sin(𝑥) sin(𝑦). (2.81) We performed temporal and spatial refinement studies using a mixed approach, as well as a pure BDF approach. The mixed approach uses the second-order time-centered method to evolve the solution 𝑢, and its partial derivatives in both variables are calculated using the second-order BDF scheme, which is denoted as “Central-2 + BDF-2" in the figures. Similarly, the pure BDF approach computes both the solution and its derivatives using the BDF scheme, i.e., “BDF-2 + BDF-2". In these experiments, we used a fifth-order spatial quadrature rule. In the temporal refinement study, the solution is computed until a final time of 𝑇 = 1 using a fixed 256 × 256 mesh in space. We successively double the number of time steps from 𝑁𝑡 = 8 until 𝑁𝑡 = 512. We use the analytical solution to initialize the method since it is available. The results of the temporal refinement study are presented in Figure 2.3, in which all methods, including those for the derivatives, display the expected second-order convergence rate in time. For the space refinement experiment, we varied the spatial mesh in each direction from 16 points to 512 points. To keep the temporal error in the methods small during the refinement, we applied the methods for 1 time step using a step size of Δ𝑡 = 1 × 10−4 . Note that the disparity in the order between time and space (second-order versus fifth-order) necessitates a small time step 63 (a) Central-2 + BDF-2 (b) BDF-2 + BDF-2 Figure 2.3: Time refinement study for the solution and its derivative for the two-dimensional periodic example 2.6.2 obtained with second-order methods. In Figure 2.3a, we plot errors for the numerical solution obtained with the central-2 method and the partial derivatives obtained with the BDF-2 method. Similarly, in Figure 2.3b, we show the same quantities, both of which are obtained using the BDF-2 method. here; however, the waves may fail to propagate if too small a time step is used. This can be fixed using a Taylor expansion of the Green’s function in the quadrature rule, as mentioned in section 2.3.3. The refinement plots in Figure 2.4 indicate fifth-order accuracy in space for all methods. We note that the derivatives in the methods begin to level-off as the error approaches 1 × 10−11 . This 64 (a) Central-2 + BDF-2 (b) BDF-2 + BDF-2 Figure 2.4: Space refinement of the solution and its derivative for the two-dimensional periodic example 2.6.2 obtained with second-order methods. In Figure 2.4a, we plot errors for the numerical solution obtained with the central-2 method and the partial derivatives obtained with the BDF-2 method. Similarly, in Figure 2.4b, we show the same quantities, both of which are obtained using the BDF-2 method. is likely due to a different error coefficient in time, which arises from the differentiation process. A smaller time step would be necessary to remove this feature, but this requires some modification of the quadrature. 65 2.6.3 Dirichlet Boundary Conditions For the Dirichlet problem, we, again, consider the two-dimensional in-homogeneous scalar wave equation 1 𝜕𝑡𝑡 𝑢 − Δ𝑢 = 𝑆(𝑥, 𝑦), (2.82) 𝑐2 where 𝑐 = 1 and 𝑆(𝑥, 𝑦) = 3𝑒 −𝑡 sin(𝑥) sin(𝑦). (2.83) We apply homogeneous Dirichlet boundary conditions on the domain [0, 2𝜋] × [0, 2𝜋] and use the initial data 𝑢(𝑥, 𝑦, 0) = sin(𝑥) sin(𝑦), 𝜕𝑡 𝑢(𝑥, 𝑦, 0) = − sin(𝑥) sin(𝑦). (2.84) The problem (2.76) is associated with the manufactured solution 𝑢(𝑥, 𝑦, 𝑡) = 𝑒 −𝑡 sin(𝑥) sin(𝑦), (2.85) and defines the source function (2.83). The partial derivatives of this solution are calculated to be 𝜕𝑥 𝑢(𝑥, 𝑦, 𝑡) = 𝑒 −𝑡 cos(𝑥) sin(𝑦), (2.86) 𝜕𝑦 𝑢(𝑥, 𝑦, 𝑡) = 𝑒 −𝑡 sin(𝑥) cos(𝑦). (2.87) We performed temporal and spatial refinement experiments using the same mixed approach and pure BDF approaches considered in the previous section 2.6.2. As a reminder, the mixed approach uses the second-order time-centered method to evolve the solution 𝑢, and its partial derivatives are calculated using the second-order BDF scheme, which is denoted as “Central-2 + BDF-2" in the figures. The pure BDF approach computes both the solution and its derivatives with the BDF scheme, which is similarly denoted as “BDF-2 + BDF-2" in the figures. We use the same fifth-order spatial quadrature rule as in the periodic test case to perform the runs. In the temporal refinement study, the solution is computed until a final time of 𝑇 = 1. We use a fixed 256 × 256 spatial mesh and the number of time steps in each case is successively doubled from 𝑁𝑡 = 8 until 𝑁𝑡 = 512. Errors can be directly measured with the analytical solution and its 66 (a) Central-2 + BDF-2 (b) BDF-2 + BDF-2 Figure 2.5: Time refinement study of the solution and its derivatives in the two-dimensional Dirichlet problem 2.6.3 obtained with second-order methods. In Figure 2.5a, we plot errors for the numerical solution obtained with the central-2 method and the partial derivatives obtained with the BDF-2 method. Similarly, in Figure 2.5b, we show the same quantities, both of which are obtained using the BDF-2 method. derivatives. The results of the temporal refinement study are presented in Figure 2.5, in which all methods, including those for the derivatives, display the expected second-order convergence rate. The behavior is essentially identical to the results obtained for the periodic problem, which were presented in Figure 2.5 67 (a) Central-2 + BDF-2 (b) BDF-2 + BDF-2 Figure 2.6: Space refinement of the solution and its derivatives in the two-dimensional Dirichlet problem 2.6.3 obtained with second-order methods. In Figure 2.6a, we plot errors for the numerical solution obtained with the central-2 method and the partial derivatives obtained with the BDF-2 method. Similarly, in Figure 2.6b, we show the same quantities, both of which are obtained using the BDF-2 method. We performed the spatial refinement study by varying the number of mesh points in each direction from 16 points to 512 points. Again, to keep the temporal error in the methods small while space is refinement, we applied the methods for only 1 time step with a step size of Δ𝑡 = 1 × 10−4 . The same remark about small time step sizes mentioned in the refinement experiment in space for 68 the periodic problem applies to this case, as well (see section 2.6.2). The refinement plots in Figure 2.6 indicate that the methods refine to fifth-order accuracy in space. In both the mixed and pure BDF approaches, the error in the derivatives behaves differently from what was observed in the periodic example. In particular, we do not observe a flattening of the error when the spacing Δ𝑥 is small. 2.6.4 Outflow Boundary Conditions To test outflow boundary conditions, we use the homogeneous scalar wave equation 1 𝜕𝑡𝑡 𝑢 − Δ𝑢 = 0, (2.88) 𝑐2 where 𝑐 = 1. We solve the problem (2.88) using the two-dimensional domain [−2, 2] × [−2, 2] and use the initial data 𝑢(𝑥, 𝑦, 0) = 𝑒 −16(𝑥 2 +𝑦 2 ) , 𝜕𝑡 𝑢(𝑥, 𝑦, 0) = 0, (2.89) which is a Gaussian centered at the origin. Note that the width of the Gaussian is chosen so that the data is essentially machine zero at points along the boundary. If the initial data is not zero outside of the domain, then a specialized approach for initializing the boundary coefficients in the temporal recursion for outflow should be devised. This is not the case for this problem, so we can initialize the recursion for the boundary coefficients with zeroes. The refinement experiments consider the same mixed and BDF approaches as in the previous sections. Since we do not have an analytical solution for the problem (2.88), we use a reference solution on a sufficiently fine temporal or spatial mesh. In all tests, the time history data is initialized using Taylor expansions in time, keeping terms up to fourth-order accuracy. Recall that we presented two forms of outflow in sections 2.4.1.4 and 2.4.2.4. We chose to use the explicit forms of outflow because they are simpler to implement in multi-dimensional problems. Moreover, using the one-dimensional analogue of problem (2.88), we found that explicit approaches were more effective at suppressing artificial reflections of the waves along the boundary. An example of this is shown in Figure 2.7, where we compared the implicit and explicit forms of outflow for 69 (a) BDF-2 with implicit weights (1-D) (b) BDF-2 with explicit weights (1-D) Figure 2.7: Here we show the reflection observed between the implicit and explicit forms of outflow boundary conditions for the second-order BDF method in a one-dimensional outflow problem. We run with the same Gaussian initial condition until the final time 𝑇 = 4, at which point, the wave data should no longer be in the simulation. What is left is the reflection at the artificial boundaries of the domain. The plot shown on the left shows the results obtained with the proposed implicit form of the outflow weights developed for the BDF-2 method, while the plot on the right uses the explicit form of the weights. We find that the explicit form of the weights is more effective at suppressing the spurious reflections at the artificial boundaries. the BDF method. At the time shown in the plots, what remains of the original wave is due to reflections. In the refinement study for time, we evolve the solution on a fixed 512 × 512 mesh until the final time 𝑇 = 1.0, which is before the wave leaves the domain. The number of time steps are successively doubled from 𝑁𝑡 = 8 until 𝑁𝑡 = 512. We note that the coarsest time discretization gives a CFL ≈ 16, which is much larger than we would typically use for the second-order schemes. The reference solution is obtained with the same spatial mesh with a total of 𝑁𝑡 = 2048 time steps. Results obtained with the mixed and BDF approaches are presented in Figures 2.8a, respectively. The methods for the solution and the corresponding derivatives refine to second-order accuracy in time, with the mixed approach being the more accurate of the proposed methods. The overall error in these methods is notably larger than for the periodic and Dirichlet test problems. One reason for this is that periodic and Dirichlet conditions can be enforced exactly, while outflow conditions are only approximately enforced. This error is further amplified in the case of the derivatives 70 (a) Central-2 + BDF-2 (2-D) (b) BDF-2 + BDF-2 (2-D) Figure 2.8: Time refinement study of the solution and its derivatives in the two-dimensional outflow problem of section 2.6.4 obtained with second-order methods. In Figure 2.8a, we plot errors for the numerical solution obtained with the central-2 method and the partial derivatives obtained with the BDF-2 method. Similarly, in Figure 2.8b, we show the same quantities, both of which are obtained using the BDF-2 method. which introduce an additional factor that is O (1/Δ𝑡) which amplifies the overall size of the error. Similar error properties can be observed in the analogous one-dimensional problem. Using the same problem setup for the one-dimensional case, we applied the implicit forms of outflow for the time-centered and BDF methods, which we show in Figure 2.9. While the outflow methods 71 (a) Central-2 with implicit weights (1-D) (b) BDF-2 with implicit weights (1-D) Figure 2.9: A comparison of the temporal refinement properties for the one-dimensional implicit methods. The weights for the time-centered method shown in Figure2.9a are taken from the paper [34]. We compare this to the proposed implicit approach to outflow, shown on the right, in Figure 2.9b. themselves are different, we can see that similar refinement properties are observed here as well. This comparison informs us that the issue is likely not related to dimensionality or the form of the outflow weights (e.g., implicit or explicit). This issue seems to be more related to the propagation of errors in the methods when only approximate boundary data is available. The spatial refinement study was performed by varying the number of mesh points in each direction from 17 points to 513 points. The methods were applied for only 1 time step with step size of Δ𝑡 = 1 × 10−4 and the errors were measured against a reference solution computed on a 2049 × 2049 spatial mesh using Δ𝑡 = 1 × 10−4 , here, as well. The refinement plots in Figure 2.6 show that each of these methods is approximately first-order in space. As alluded to in the time refinement, this seems to be a consequence of using inexact boundary conditions. The particular form of the inverse operator (2.19), suggests that errors in 𝐴 and 𝐵, which are used to enforce boundary conditions, can impact regions in the vicinity of the boundary. The particular size of these regions depends on the size of 𝛼, which, in turn, depends on the wave speed 𝑐 and value of Δ𝑡. As with the time refinement, we can look at the analogous one-dimensional problem and compare these results with the two-dimensional problem. The one-dimensional refinement results, obtained 72 (a) Central-2 + BDF-2 (2-D) (b) BDF-2 + BDF-2 (2-D) Figure 2.10: Space refinement of the solution and its derivatives in the two-dimensional outflow problem of section 2.6.4 obtained with second-order methods. In Figure 2.10a, we plot errors for the numerical solution obtained with the central-2 method and the partial derivatives obtained with the BDF-2 method. Similarly, in Figure 2.10b, we show the same quantities, both of which are obtained using the BDF-2 method. with the implicit form of the outflow weights, are presented in Figure 2.11. Again, we observe similar behavior in the refinement properties of the one- and two-dimensional test problems, which, again, hints at issues concerning the propagation of errors in the method. 73 (a) Central-2 with implicit weights (1-D) (b) BDF-2 with implicit weights (1-D) Figure 2.11: A comparison of the spatial refinement properties for the one-dimensional implicit methods. The weights for the time-centered method, shown in Figure2.11a, are taken from the paper [34]. We compare this to the proposed implicit approach to outflow, shown on the right, in Figure 2.11b. 2.7 Conclusion In this work, we developed new approaches for computing fields and their derivatives with ap- plications to scalar wave equations. Our contributions build on prior developments for a class of algorithms known as the MOL𝑇 , which combines a dimensional splitting technique with a one- dimensional integral equation method to yield algorithms with unconditional stability, geometric flexibility, and have O (𝑁) complexity. The proposed methods for derivatives use data which is already available through the base method, so they naturally inherit the properties offered in the base method. We also presented a treatment of outflow boundary conditions for the BDF method. The accuracy of the proposed methods was evaluated by performing a series of refinement experiments in both space and time, using several types of boundary conditions. In particular, we established some refinement properties for outflow boundary conditions, which were not presented in earlier work. 74 CHAPTER 3 PARALLEL ALGORITHMS FOR SUCCESSIVE CONVOLUTION 3.1 Introduction In this chapter, we develop parallel algorithms using novel approaches to represent derivative operators for linear and nonlinear time-dependent partial differential equations (PDEs). We chose to investigate algorithms for these representations due to the stability properties observed for a wide range of linear and nonlinear PDEs. The approach considered here uses expansions involving integral operators to approximate spatial derivatives. Here, we shall refer to this approach as the Method of Lines Transpose (MOL𝑇 ) though this can be more broadly categorized within a larger class of successive convolution methods. The name arises because the terms in the operator expansions, which we describe later, involve convolution integrals whose operand is recursively or successively defined. Despite the use of explicit data in these integral terms, the boundary data remains implicit, which contributes to both the speed and stability of the representations. The inclusion of more terms in these operator expansions, when combined with a high-order quadrature method, allow one to obtain a high-order discretization in both space and time. Another benefit of this approach is that extensions to multiple spatial dimensions are straightforward as operators can be treated in a line-by-line fashion. Moreover, the integral equations are amenable to fast-summation techniques, which reduce the overall computational complexity, along a given dimension, from O (𝑁 2 ) to O (𝑁), where 𝑁 is the number of discrete grid points along a dimension. High-order successive convolution algorithms have been developed to solve a range of time- dependent PDEs, including the wave equation [50], heat equation (e.g., Allen-Cahn [55] and Cahn- Hilliard equations [43]), Maxwell’s equations [37], Vlasov equation [52], degenerate advection- diffusion (A-D) equation [44], and the Hamilton-Jacobi (H-J) equation [56, 45]. In contrast to these papers, this work focuses on the performance of the method in parallel computing environments, which is a largely unexplored area of research. Specifically, our work focuses on developing 75 effective domain decomposition strategies for distributed memory systems and building thread- scalable algorithms using the low-order schemes as a baseline. By leveraging the decay properties of the integral representation, we restrict the calculations to localized non-overlapping subsets of the spatial domain. The algorithms presented in this work consider dependencies between nearest- neighbors (N-N), but, as we will see, this restriction can be generalized to include additional information, at the cost of additional communication. Using a hybrid design that employs MPI and Kokkos [46] for the distributed and shared memory components of the algorithms, respectively, we show that our methods are efficient and can sustain an update rate > 1 × 108 DOF/node/s. While experimentation on graphics processing units (GPUs) shall be left to future work, we believe choosing Kokkos will provide a path for a more performant and seamless integration of our algorithms with new computing hardware. Recent developments in successive convolution methods have focused on extensions to solve more general nonlinear PDEs, for which an integral solution is generally not applicable. This work considers discretizations developed for degenerate advection-diffusion (A-D) equations [44], as well as the Hamilton-Jacobi (H-J) equations [56, 45]. The key idea of these papers exploited the linearity of a given differential operator rather than the underlying equations, allowing derivatives in nonlinear problems to be expressed using the same representations developed for linear problems. For linear problems, it was demonstrated that one could couple these representations for the derivative operators with an explicit time-stepping method, such as the strong-stability-preserving Runge-Kutta (SSP-RK) methods, [57] and still obtain schemes which maintain unconditional stability [56, 44]. To address shock-capturing and control non-physical oscillations, the latter two papers introduced quadratures that use WENO reconstructions, along with a nonlinear filter to further control oscillations. In [45], the schemes for the H-J equations were extended to enable calculations on mapped grids. This paper also proposed a new WENO quadrature method that uses a basis that consists of exponential polynomials that improves the shock capturing capabilities. Our choice in discretizing time, first, before treating the spatial quantities, is not a new idea. A well-known approach is Rothe’s method [58, 59] in which a finite difference approximation is 76 used for time derivatives and an integral equation solver is developed for the resulting sequence of elliptic PDEs (see e.g., [60, 61, 62, 49, 63, 64, 65]). The earlier developments for successive convolution methods, such as [50], are quite similar to Rothe’s method in the treatment of the time derivatives. However, successive convolution methods differ from Rothe’s method considerably in the treatment of spatial derivatives for nonlinear problems, such as those considered in more recent work on successive convolution (see e.g., [56, 44, 45]), as Newton iteration can be avoided on nonlinear terms. Additionally, these methods do not require solutions to linear systems. In contrast, Nyström methods, which are used to discretize the integral equations in Rothe’s method, result in dense linear systems, which are typically solved using an iterative method such as GMRES [27]. Despite the fact that the linear systems are well-conditioned, the various collective operations that occur in distributed GMRES solves can become quite expensive on large computing platforms. Similarly, in [66, 67], Bruno and Lyon introduced a spectral method, based on the FFT, for computing spatial derivatives of general, possibly non-periodic, functions known as Fourier- Continuation (FC). They combined this representation with the well-known Alternating-Direction (AD) methods, e.g., [68, 69, 70], dubbed FC-AD, to develop implicit solvers suitable for linear equations. This resulted in a method capable of computing spatial derivatives in a rapidly convergent and dispersionless fashion. A domain decomposition technique for the FC method is described in [71] and weak scaling was demonstrated to 256 processors, using 4 processors per node, but larger runs were not considered. Another related transform approach was developed to solve the linear wave equation [72]. This work introduced a windowed Fourier methodology and combined this with a frequency-domain boundary integral equation solver to simulate long-time, high-frequency wave propagation. While this particular work does not focus on parallel implementations, they suggest several generic strategies, including a trivially parallelizable approach that solves a collection of frequency-domain integral equations in parallel; however, a purely parallel-in-time approach may not be appropriate for massively parallel systems, especially if few frequencies are required across time. This issue may be further complicated by the parallel implementation of the frequency- domain integral equation solvers, which, as previously mentioned, require the solution of dense 77 linear systems. Therefore, it may be a rather difficult task to develop robust parallel algorithms, which are capable of achieving their peak performance. This paper is organized as follows: In 3.2, we provide an overview of the numerical scheme used to formulate our algorithms. To this end, we first illustrate the connections among several characteristically different PDEs through the appearance of common operators in 3.2.1. Using these fundamental operators, we define the integral representation in 3.2.2, which is used to approximate spatial derivatives. Once the representations have been introduced, we briefly discuss the complications associated with boundary conditions and provide the relevant information, in 3.2.3, for implementing boundary conditions used in our numerical tests. Sections 3.2.4 and 3.2.5 briefly review the spatial discretization process (including the fast-summation method and the quadrature) and the coupling with time integration methods, respectively. We provide the details of our new domain decomposition algorithm in 3.3, beginning with the derivation of the so-called N-N conditions in 3.3.1. Using these conditions, we show how this can be used to enforce boundary conditions, locally, for first and second derivative operators (3.3.2 and 3.3.3, respectively). Details concerning the implementation of the parallel algorithms are contained entirely in 3.4. This includes the introduction of the shared memory programming model (3.4.1), the definition of a certain performance metric (3.4.2) used in both loop optimization experiments and scaling studies (3.4.3), the presentation of shared memory algorithms (3.4.4), and, lastly, implementation details concerning the distributed memory algorithms (3.4.5). 3.5 contains the core numerical results which confirm the convergence (3.5.1), as well as, the weak and strong scalability (3.5.2 and 3.5.3, respectively) of the proposed algorithms. In 3.5.4, we examine the impact of the restriction posed by the N-N conditions. Finally, we summarize our findings with a brief conclusion in 3.6. 3.2 Description of Numerical Methods In this section, we outline the approach used to develop unconditionally stable solvers making use of knowledge for linear operators. We will start by demonstrating the connections between several different PDEs using operator notation, which will allow us to reuse, or combine, approximations 78 in several different ways. Once we have established these connections, we define an appropri- ate “inverse" operator and use this to develop the expansions used to represent derivatives. The representations we develop for derivative operators are motivated by the solution of simple 1-D problems. However, in multi-dimensional problems, these expressions are still valid approxima- tions, in a certain sense, even though the kernels in the integral representation may not be solutions to the PDE in question. While these approximations can be made high-order in both time and space, the focus of this work is strictly on the scalability of the method, so we will limit ourselves to formulations which are first-order in time. Note that the approach described in 3.4, which considers first-order schemes, is quite general and can be easily extended for high-order representations. Once we have discussed our treatment of derivative terms, we describe the fast summation algorithm and quadrature method in 3.2.4. Despite the fact that this work only considers smooth test problems, we include the relevant modifications required for non-smooth problems for completeness. In 3.2.5, we illustrate how the representation of derivative operators can be used within a time stepping method to solve PDEs. 3.2.1 Connections Among Different PDEs Before introducing the operators relevant to successive convolution algorithms, we establish the operator connections appearing in several linear PDE examples. This process helps identify key operators that can be represented with successive convolution. Specifically, we shall consider the following three prototypical linear PDEs: • Linear advection equation: (𝜕𝑡 − 𝑐𝜕𝑥 )𝑢 = 0, • Diffusion equation: (𝜕𝑡 − 𝜈𝜕𝑥𝑥 )𝑢 = 0, • Wave equation: (𝜕𝑡𝑡 − 𝑐2 𝜕𝑥𝑥 )𝑢 = 0. Next, we apply an implicit time discretization to each of these problems. For discussion purposes, we shall consider lower-order time discretizations, i.e., backward-Euler for the 𝜕𝑡 𝑢 and a second- 79 order central difference for 𝜕𝑡𝑡 𝑢. If we identify the current time as 𝑡 𝑛 , the new time level as 𝑡 𝑛+1 , and Δ𝑡 = 𝑡 𝑛+1 − 𝑡 𝑛 , then we obtain the corresponding set of semi-discrete equations: • Linear advection equation: (I − Δ𝑡𝑐𝜕𝑥 )𝑢 𝑛+1 = 𝑢 𝑛 , • Diffusion equation: (I − Δ𝑡𝜈𝜕𝑥𝑥 )𝑢 𝑛+1 = 𝑢 𝑛 , • Wave equation: (I − Δ𝑡 2 𝑐2 𝜕𝑥𝑥 )𝑢 𝑛+1 = 2𝑢 𝑛 − 𝑢 𝑛−1 . Here, we use I to denote the identity operator, and, in all cases, each of the spatial derivatives are taken at time level 𝑡 𝑛+1 to keep the schemes implicit. The key observation is that the operator (I ± 𝛼1 𝜕𝑥 ) arises in each of these examples. Notice that 1 1 1 (I − 𝜕𝑥𝑥 ) = (I − 𝜕 𝑥 )(I + 𝜕𝑥 ), 𝛼2 𝛼 𝛼 where 𝛼 is a parameter that is selected according to the equation one wishes to solve. For example, in the case of diffusion, one selects 𝛽 𝛼=√ , 𝜈Δ𝑡 while for the linear advection and wave equations, one selects 𝛽 𝛼= . 𝑐Δ𝑡 The parameter 𝛽, which does not depend on Δ𝑡, is then used to tune the stability of the approxima- tions. For test problems appearing in this paper, we always use 𝛽 = 1. In 3.2.2, we demonstrate how the operator (I ± 𝛼1 𝜕𝑥 ) can be used to approximate spatial derivatives. We remark that for second derivatives, one can also use (I − 𝛼12 𝜕𝑥𝑥 ) to obtain a representation for second order spatial derivatives, instead of factoring into “left" and “right" characteristics. Next, we introduce the following definitions to simplify the notation: 1 1 L𝐿 ≡ I − 𝜕𝑥 , L𝑅 ≡ I + 𝜕𝑥 . (3.1) 𝛼 𝛼 Written in this manner, these definitions indicate the left and right-moving components of the characteristics, respectively, as the subscripts are associated with the direction of propagation. For 80 second derivative operators, which are not factored into first derivatives, we shall use 1 L0 ≡ I − 𝜕𝑥𝑥 . (3.2) 𝛼2 In order to connect these operators with suitable expressions for spatial derivatives, we need to define the corresponding “inverse" for each of these linear operators on a 1-D interval [𝑎, 𝑏]. These definitions are given as ∫ 𝑏 L −1 𝐿 [ · ; 𝛼] (𝑥) ≡ 𝛼 𝑒 −𝛼(𝑠−𝑥) (·) 𝑑𝑠 + 𝐵𝑒 −𝛼(𝑏−𝑥) , (3.3) 𝑥 ≡ 𝐼 𝐿 [ · ; 𝛼] (𝑥) + 𝐵𝑒 −𝛼(𝑏−𝑥) , ∫ 𝑥 −1 L 𝑅 [ · ; 𝛼] (𝑥) ≡ 𝛼 𝑒 −𝛼(𝑥−𝑠) (·) 𝑑𝑠 + 𝐴𝑒 −𝛼(𝑥−𝑎) , (3.4) 𝑎 ≡ 𝐼 𝑅 [ · ; 𝛼] (𝑥) + 𝐴𝑒 −𝛼(𝑥−𝑎) . These definitions can be derived in a number of ways. In A.1, we demonstrate how these definitions can be derived for the linear advection equation using the integrating factor method. In these definitions, 𝐴 and 𝐵 are constants associated with the “homogeneous solution" of a corresponding semi-discrete problem and are used to satisfy the boundary conditions. In a similar way, one can compute the inverse operator for definition (3.2), which yields ∫ 𝛼 𝑏 −𝛼|𝑥−𝑠| −1 L0 [ · ; 𝛼] (𝑥) ≡ 𝑒 (·) 𝑑𝑠 + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) , (3.5) 2 𝑎 ≡ 𝐼0 [ · ; 𝛼] (𝑥) + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) . In these definitions, we refer to “·" as the operand and, again, 𝛼 is a parameter selected according to the problem being solved. Although it is a slight abuse of notation, when it is not necessary to explicitly indicate the parameter or the point of evaluation, we shall place the operand inside a pair of parenthesis. If we connect these definitions to each of the linear semi-discrete equations mentioned earlier, we can determine the update equation through an analytic inversion of the corresponding linear operator(s): • Linear advection equation: 𝑢 𝑛+1 = L −1 𝑅 (𝑢 ), or 𝑢 𝑛 𝑛+1 = L −1 (𝑢 𝑛 ), 𝐿 81 • Diffusion equation: 𝑢 𝑛+1 = L −1 −1 𝑛 𝐿 (L 𝑅 (𝑢 )), or 𝑢 𝑛+1 = L −1 (𝑢 𝑛 ), 0 • Wave equation: 𝑢 𝑛+1 = L −1 −1 𝐿 (L 𝑅 (2𝑢 − 𝑢 𝑛 𝑛−1 )), or 𝑢 𝑛+1 = L −1 (2𝑢 𝑛 − 𝑢 𝑛−1 ), 0 with the appropriate choice of 𝛼 for the problem being considered. We note that each of these meth- ods can be made high-order following the work in [56, 44, 45, 55, 38], where it was demonstrated that these approaches lead to methods that are unconditionally stable to all orders of accuracy for these linear PDEs, even with variable wave speeds or diffusion coefficients. Since the process of analytic inversion yields an integral term, a fast-summation technique should be used to reduce the computational complexity of a naive implementation, which would otherwise scale as O (𝑁 2 ). Some details concerning the spatial discretization and the O (𝑁) fast-summation method are briefly summarized in 3.2.4 (for full details, please see [55]). Next we demonstrate how the operator L∗ can be used to approximate spatial derivatives. 3.2.2 Representation of Derivatives In the previous section, we observed that characteristically different PDEs can be described in- terms of a common set of operators. The focus of this section shall be on manipulating these approximations to obtain a high-order discretization in time through certain operator expansions. The process begins by introducing an operator related to L∗−1 , namely, D∗ ≡ I − L∗−1 , (3.6) where ∗ can be 𝐿, 𝑅, or 0. The motivation for these definitions will become clear soon. Additionally, we can derive an identity from the definitions (3.6). By manipulating the terms we quickly find that L∗ ≡ (I − D∗ ) −1 , (3.7) again, where ∗ can be 𝐿, 𝑅, or 0. The purpose of the identity (3.7) is that it connects the spatial derivative to an expression involving integrals of the solution rather than derivatives. In other words, it allows us to avoid having to use a stencil operation for derivatives. 82 To obtain an approximation for the first derivative in space, we can use L 𝐿 , L 𝑅 , or both of them, which may occur as part of a monotone splitting. If we combine the definition of the left propagating first derivative operator in equation (3.1) with the definition (3.6) and identity (3.7), we can define the first derivative in terms of the D 𝐿 operator. Observe that 𝜕𝑥+ = 𝛼 (I − L 𝐿 ) ,   = 𝛼 L 𝐿 L −1 𝐿 − L 𝐿 ,   = 𝛼L 𝐿 L −1 𝐿 − I ,   = −𝛼L 𝐿 I − L −1 𝐿 , = −𝛼 (I − D 𝐿 ) −1 D 𝐿 , ∑︁∞ 𝑝 = −𝛼 D𝐿 , (3.8) 𝑝=1 where, in the last step, we used the fact that the operator D 𝐿 is bounded by unity in an operator norm. We use the + convention to indicate that this is a right-sided derivative. Likewise, for the right propagating first derivative operator, we find that the complementary left-biased derivative is given by ∑︁∞ 𝜕𝑥− = 𝛼 𝑝 D𝑅 . (3.9) 𝑝=1 Additionally, the second derivative can be expressed as ∞ ∑︁ 𝑝 𝜕𝑥𝑥 = −𝛼2 D0 . (3.10) 𝑝=1 As the name implies, each power of D∗ is successively defined according to   D∗𝑘 ≡ D∗ D∗𝑘−1 . (3.11) In previous work, [44], for periodic boundary conditions, it was established that the partial sums for the left and right-biased approximations to 𝜕𝑥 satisfy 𝑛   𝑛   ©∑︁ 𝑝 1 ª ©∑︁ 𝑝 1 ª 𝜕𝑥+ = −𝛼 ­ D 𝐿 + O 𝑛+1 ® , 𝜕𝑥− = 𝛼­ D 𝑅 + O 𝑛+1 ® . (3.12) 𝛼 𝛼 « 𝑝=1 ¬ « 𝑝=1 ¬ 83 Similarly for second derivatives, with periodic boundaries, retaining 𝑛 terms leads to a truncation error with the form 𝑛   2© ∑︁ 𝑝 1 𝜕𝑥𝑥 = −𝛼 ­ D0 + O 2𝑛+2 ® . ª (3.13) 𝑝=1 𝛼 « ¬ In both cases, the relations can be obtained through a repeated application of integration by parts with induction. These approximations are still exact, in space, but the integral operators nested in D∗ will eventually be approximated with quadrature. From these relations, we can also observe the impact of 𝛼 on the size of the error in time. In particular, if we select 𝛼 = O (1/Δ𝑡) in (3.12)  √  and 𝛼 = O 1/ Δ𝑡 in (3.13), each of the approximations should have an error of the form O (Δ𝑡 𝑛 ). The results concerning the consistency and stability of these higher order approximations were established in [56, 44, 38]. As mentioned earlier, this work only considers approximations which are first-order with respect to time. Therefore, we shall restrict ourselves to the following operator representations: 𝜕𝑥+ ≈ −𝛼D 𝐿 , 𝜕𝑥− ≈ 𝛼D 𝑅 , 𝜕𝑥𝑥 ≈ −𝛼2 D0 . (3.14) This is a consequence of retaining a single term from each of the partial sums in equations (3.8), (3.9), and (3.10). Consequently, computing higher powers of D∗ is unnecessary, so the successive property (3.11) is not needed. However, it indicates, clearly, a possible path for higher-order extensions of the ideas which will be presented here. For the moment, we shall delay prescribing the choice of 𝛼 used in the representations (3.14), in order to avoid a problem-dependent selection. As alluded to at the beginning of this section, an identical representation is used for multi-dimensional problems, where the D∗ operators are now associated with a particular dimension of the problem. The operators along a particular dimension are constructed using data along that dimension of the domain, so that each of the directions remains uncoupled. This completes the discussion on the generic form of the representations used for spatial derivatives. Next, we provide some information regarding the treatment of boundary conditions, which determine the constants 𝐴 and 𝐵 appearing in the D∗ operators. 84 3.2.3 Comment on Boundary Conditions The process of prescribing values of 𝐴 and 𝐵, inside equations (3.3),(3.4), and (3.5), which are required to construct D∗ , is highly dependent on the structure of the problem being solved. Previous work has shown how to prescribe a variety of boundary conditions for linear PDEs (see e.g., [50, 55, 43, 38]). For example, in linear problems, such as the wave equation, with either periodic or non- periodic boundary conditions, one can directly enforce the boundary conditions to determine the constants 𝐴 and 𝐵. The situation can become much more complicated for problems which are both nonlinear and non-periodic. For approximations of at most third-order accuracy, with nonlinear PDEs, one can use the techniques from [56] for non-periodic problems. To achieve high-order time accuracy, the partial sums for this case were modified to eliminate certain low-order terms along the boundaries. We note that the development of high-order time discretizations, subject to non-trivial boundary conditions, for nonlinear operators, is still an open area of research for successive convolution methods. As this paper concerns the scalability of the method, we shall consider test problems that involve periodic boundary conditions. For periodic problems defined on the line interval [𝑎, 𝑏], the constants associated with the boundary conditions for first derivatives are given by 𝐼 𝑅 [𝑣; 𝛼] (𝑏) 𝐼 𝐿 [𝑣; 𝛼] (𝑎) 𝐴= , 𝐵= , (3.15) 1−𝜇 1−𝜇 where 𝐼 𝐿 and 𝐼 𝑅 were defined in equations (3.3) and (3.4). Similarly, for second derivatives, the constants can be determined to be 𝐼0 [𝑣; 𝛼] (𝑏) 𝐼0 [𝑣; 𝛼] (𝑎) 𝐴= , 𝐵= , (3.16) 1−𝜇 1−𝜇 with the definition of 𝐼0 coming from (3.5). In the expressions (3.15) and (3.16) provided above, we use 𝜇 ≡ 𝑒 −𝛼(𝑏−𝑎) , where 𝛼 is the appropriately chosen parameter. Note that the function 𝑣(𝑥) denotes the generic operand of the operator D∗ . This helps reduce the complexity of the notation when several 85 applications of D∗ are required, since they are recursively defined. As an example, suppose we wish to compute 𝜕𝑥𝑥 ℎ(𝑢), where ℎ is some known function. For this, we can use the first-order scheme for second derivatives (see (3.14)) and take 𝑣 = ℎ(𝑢) in the expressions (3.16) for the boundary terms. 3.2.4 Fast Convolution Algorithm and Spatial Discretization To perform a spatial discretization over [𝑎, 𝑏], we first create a grid of 𝑁 + 1 points: 𝑥𝑖 = 𝑎 + 𝑖Δ𝑥𝑖 , 𝑖 = 0, · · · , 𝑁, where Δ𝑥𝑖 = 𝑥𝑖+1 − 𝑥𝑖 . A naive approach to computing the convolution integral would lead to method of complexity O (𝑁 2 ), where 𝑁 is the number of grid points. However, using some algebra, we can write recurrence relations for the integral terms which comprise D∗ : 𝐼 𝑅 [𝑣; 𝛼] (𝑥𝑖 ) = 𝑒 −𝛼Δ𝑥𝑖−1 𝐼 𝑅 [𝑣; 𝛼] (𝑥𝑖−1 ) + 𝐽 𝑅 [𝑣; 𝛼] (𝑥𝑖 ), 𝐼 𝑅 [𝑣; 𝛼] (𝑥 0 ) = 0, 𝐼 𝐿 [𝑣; 𝛼] (𝑥𝑖 ) = 𝑒 −𝛼Δ𝑥𝑖 𝐼 𝐿 [𝑣; 𝛼] (𝑥𝑖+1 ) + 𝐽 𝐿 [𝑣; 𝛼] (𝑥𝑖 ), 𝐼 𝐿 [𝑣; 𝛼] (𝑥 𝑁 ) = 0. Here, we have defined the local integrals ∫ 𝑥𝑖 𝐽 𝑅 [𝑣; 𝛼] (𝑥𝑖 ) = 𝛼 𝑒 −𝛼(𝑥𝑖 −𝑠) 𝑣(𝑠) 𝑑𝑠, (3.17) ∫𝑥𝑖−1 𝑥𝑖+1 𝐽 𝐿 [𝑣; 𝛼] (𝑥𝑖 ) = 𝛼 𝑒 −𝛼(𝑠−𝑥𝑖 ) 𝑣(𝑠) 𝑑𝑠. (3.18) 𝑥𝑖 By writing the convolution integrals this way, we obtain a summation method which has a com- plexity of O (𝑁). Note that the same algorithm can be applied to compute the convolution integral for the second derivative operator by splitting the integral at a point 𝑥. After applying the above algorithm to the left and right contributions, we can recombine them through “averaging" to recover the original integral. While a variety of quadrature methods have been proposed to compute the local integrals (3.18) and (3.17) (see e.g., [45, 50, 55, 43, 38, 33]), we shall consider sixth-order 86 quadrature methods introduced in [44], which use WENO interpolation to address both smooth and non-smooth problems. In what follows, we describe the procedure for 𝐽 𝑅 [𝑣; 𝛼] (𝑥𝑖 ), since the reconstruction for 𝐽 𝐿 [𝑣; 𝛼] (𝑥𝑖 ) is similar. This approximation uses a six point stencil given by 𝑆(𝑖) = {𝑥𝑖−3 , · · · , 𝑥𝑖+2 }, which is then divided into three smaller stencils, each of which contains four points, defined by 𝑆𝑟 = {𝑥𝑖−3+𝑟 , · · · , 𝑥𝑖+𝑟 } for 𝑟 = 0, 1, 2. We associate 𝑟 with the shift in the stencil. A graphical depiction of this stencil is provided in 3.1. The quadrature method is developed as follows: 1. On each of the small stencils 𝑆𝑟 (𝑖)., we use the approximation ∫ 𝑥𝑖 ∑︁3 𝐽 𝑅(𝑟) [𝑣; 𝛼] (𝑥𝑖 ) ≈𝛼 𝑒 −𝛼(𝑥 𝑖 −𝑠) 𝑝𝑟 (𝑠) 𝑑𝑠 = (𝑟) 𝑐 −3+𝑟+ 𝑗 𝑣 −3+𝑟+ 𝑗 , (3.19) 𝑥 𝑖−1 𝑗=0 where 𝑝𝑟 (𝑥) is the Lagrange interpolating polynomial formed from points in 𝑆𝑟 (𝑖) and 𝑐 ℓ(𝑟) are the interpolation coefficients, which depend on the parameter 𝛼 and the grid spacing, but not 𝑣. 2. In a similar way, on the large stencil 𝑆(𝑖) we obtain the approximation ∫ 𝑥𝑖 𝐽 𝑅 [𝑣; 𝛼] (𝑥𝑖 ) ≈ 𝛼 𝑒 −𝛼(𝑥𝑖 −𝑠) 𝑝(𝑠) 𝑑𝑠. (3.20) 𝑥𝑖−1 3. When function 𝑣(𝑥) is smooth, we can combine the interpolants on the smaller stencils, so they are consistent with the high-order approximation obtained on the larger stencil, i.e., ∑︁ 2 𝐽 𝑅 [𝑣; 𝛼] (𝑥𝑖 ) = 𝑑𝑟 𝐽 𝑅(𝑟) [𝑣; 𝛼] (𝑥𝑖 ), (3.21) 𝑟=0 where 𝑑𝑟 > 0 are called the linear weights, which form a partition of unity. The problems we consider in this work involve smooth functions, so this is sufficient for the final approximation. For instances in which the solution is not smooth, the linear weights can be mapped to nonlinear weights using the notion of smoothness. We refer the interested reader to previous work [56, 44, 45] for details concerning non-smooth data sets. 87 Figure 3.1: Stencils used to build the six point quadrature [56, 44]. In A.3, we provide the expressions used to compute coefficients 𝑐 ℓ(𝑟) and 𝑑𝑟 for a uniform grid, although the non-uniform grid case can be done as well [73]. In the case of a non-uniform mesh, the linear weights 𝑑𝑟 would become locally defined in the neighborhood of a given point and would need to be computed on-the-fly. Uniform grids eliminate this requirement as the linear weights, for a given direction, can be computed once per time step and reused in each of the lines pointing along that direction. 3.2.5 Coupling Approximations with Time Integration Methods Here, we demonstrate how one can use this approach to solve a large class of PDEs by coupling the spatial discretizations in 3.2.4 with explicit time stepping methods. In what follows, we shall consider general PDEs of the form 𝜕𝑡 𝑈 = 𝐹 (𝑡, 𝑈), where 𝐹 (𝑡, 𝑈) is a collective term for spatial derivatives involving the solution variable 𝑈. Possible choices for 𝐹 might include generic nonlinear advection and diffusion terms 𝐹 (𝑡, 𝑈) = 𝜕𝑥 𝑔1 (𝑈) + 𝜕𝑥𝑥 𝑔2 (𝑈), or even components of the HJ equations 𝐹 (𝑡, 𝑈) = 𝐻 (𝑈, 𝜕𝑥 𝑈). 88 To demonstrate how one can couple these approaches, we start by discretizing a PDE in time, but, rather than use backwards Euler, we use an 𝑠-stage explicit Runge-Kutta (RK) method, i.e., ∑︁𝑠 𝑢 𝑛+1 = 𝑢 𝑛 + 𝑏𝑖 𝑘 𝑖 , 𝑖=1 where the various stages are given by 𝑘 1 = 𝐹 (𝑡 𝑛 , 𝑢 𝑛 ), 𝑘 2 = 𝐹 (𝑡 𝑛 + 𝑐 2 Δ𝑡, 𝑢 𝑛 + Δ𝑡𝑎 21 𝑘 1 ), .. . ∑︁𝑠 𝑘 𝑠 = 𝐹 ­𝑡 𝑛 + 𝑐 𝑠 Δ𝑡, 𝑢 𝑛 + Δ𝑡 © ª 𝑎𝑠 𝑗 𝑘 𝑗 ® . « 𝑗=1 ¬ As with a standard Method-of-Lines (MOL) discretization, we would need to reconstruct derivatives within each RK-stage. To illustrate, consider the nonlinear A-D equation, 𝐹 (𝑡, 𝑢) = 𝜕𝑥 𝑔1 (𝑢) + 𝜕𝑥𝑥 𝑔2 (𝑢). For a term such as 𝑔1 , we would use a monotone Lax-Friedrichs flux splitting, i.e., 𝑔1 ∼ 12 (𝑔1+ + 𝑔1− ), where 𝑔1± = 12 (𝑔1 (𝑢) ± 𝑟𝑢) with 𝑟 = max𝑢 𝑔1′ (𝑢). Hence, a particular RK-stage can be approximated using 𝑠 𝑠 𝑠 1 ∑︁ 𝑝 + 1 ∑︁ 𝑝 − 2 ∑︁ 𝑝 𝐹 (𝑡, 𝑢) ≈ − 𝛼 D [𝑔 (𝑢); 𝛼] + 𝛼 D [𝑔 (𝑢); 𝛼] + 𝛼𝜈 D0 [𝑔2 (𝑢); 𝛼𝜈 ]. 2 𝑝=1 𝐿 1 2 𝑝=1 𝑅 1 𝑝=1 The resulting approximation to the RK-stage can be shown to be O (Δ𝑡 𝑠 ) accurate. Another nonlinear PDE of interest to us is the H-J equation 𝐹 (𝑡, 𝑢) = 𝐻 (𝜕𝑥 𝑢). In a similar way, we would replace the Hamiltonian with a monotone numerical Hamiltonian, such as − + 𝑣+ − +   − + 𝑣 − + 𝑣 −𝑣 𝐻 (𝑣 , 𝑣 ) = 𝐻 ˆ + 𝑟 (𝑣 , 𝑣 ) , 2 2 89 where 𝑟 (𝑣 − , 𝑣 + ) = max𝑣 𝐻 ′ (𝑣). Then, the left and right derivative operators in the numerical Hamiltonian can be replaced with 𝑠 ∑︁ 𝑠 ∑︁ 𝜕𝑥− 𝑢 = 𝛼 𝜕𝑥+ 𝑢 = −𝛼 𝑝 𝑝 D 𝑅 [𝑢; 𝛼], D 𝐿 [𝑢; 𝛼], 𝑝=1 𝑝=1 which, again, yields an O (Δ𝑡 𝑠 ) approximation. In previous work [56, 44], for linear forms of 𝐹 (𝑡, 𝑢), it was shown that the resulting methods are unconditionally stable when coupled to explicit RK methods, up to order 3. Extensions beyond third-order are, indeed, possible, but were not considered. For general, nonlinear problems, we typically couple an 𝑠-stage RK method to a successive convolution approximation of the same time accuracy, so that the error in the resulting approximation is O (Δ𝑡 𝑠 ). 3.3 Nearest-Neighbor Domain Decomposition Algorithm In this section, we provide the relevant mathematical definitions of our domain decomposition algorithm, which are derived from the key operators used in successive convolution. Our goal is to establish and exploit data locality in the method, so that certain reconstructions, which are nonlocal, can be independently completed on non-overlapping blocks of the domain. This is achieved in part by leveraging certain decay properties of the integral representations. Once we have established some useful definitions, we use them to derive conditions in 3.3.1, which restrict the communication pattern to N-Ns. Then in sections 3.3.2 and 3.3.3, we illustrate how this condition can be used to enforce boundary conditions, in a consistent manner, for first and second derivative operators, on each of the blocks. We then provide a brief summary of these findings along with additional comments in 3.3.4. Maintaining a localized stencil is often advantageous in parallel computing applications. Code which is based on N-Ns is generally much easier to write and maintain. Additionally, messages used to exchange data owned by other blocks may not have to travel long distances within the network, provided that the blocks are mapped physically close together in hardware. Communication, even on modern computing systems, is far more expensive than computation. Therefore, an initial strategy for domain decomposition is to enforce N-N dependencies between the blocks. In order to 90 Local data N-N data Figure 3.2: A six-point WENO quadrature stencil in 2-D. decompose a problem into smaller, independent pieces, we separate the global domain into blocks that share borders with their nearest neighbors. For example, in the case of a 1-D problem defined on the interval [𝑎, 𝑏] we can form 𝑁 blocks by writing 𝑎 = 𝑐 0 < 𝑐 1 < 𝑐 2 < · · · < 𝑐 𝑁 = 𝑏, with Δ𝑐𝑖 = 𝑐𝑖+1 − 𝑐𝑖 denoting the width of block 𝑖. Multidimensional problems can be addressed in a similar way by partitioning the domain along multiple directions. Solving a PDE on each of these blocks, independently, requires an understanding of the various data dependencies. First, we address the local integrals 𝐽∗ . Depending on the quadrature method, the reconstruction algorithm might require data from neighboring blocks. Reconstructions based on previously described WENO-type quadratures require an extension of the grid in order to build the interpolant. This involves a “halo" region (see Figure 3.2), which is distributed amongst N-N blocks in the decomposition. On the other hand, more compact quadratures, such as Simpson’s method [33], do not require this data. In this case, the quadrature communication phase can be ignored. The major task for this work involves efficiently communicating the data necessary to build each of the convolution integrals 𝐼∗ . Once the local integrals 𝐽∗ are constructed through quadrature, we 91 sweep across the lines of the domain to build the convolution integrals. It is this operation which couples the integrals across the blocks. To decompose this operation, we first rewrite the integral operators 𝐼∗ , assuming we are in block 𝑖, as ∫ 𝑏 𝐼 𝐿 [𝑣; 𝛼] (𝑥) = 𝛼 𝑒 −𝛼(𝑠−𝑥) 𝑣(𝑠) 𝑑𝑠, ∫𝑥 𝑐𝑖+1 ∫ 𝑐𝑁 −𝛼(𝑠−𝑥) =𝛼 𝑒 𝑣(𝑠) 𝑑𝑠 + 𝛼 𝑒 −𝛼(𝑠−𝑥) 𝑣(𝑠) 𝑑𝑠, 𝑥 𝑐 𝑖+1 and ∫ 𝑥 𝐼 𝑅 [𝑣; 𝛼] (𝑥) = 𝛼 𝑒 −𝛼(𝑥−𝑠) 𝑣(𝑠) 𝑑𝑠, ∫𝑎 𝑐𝑖 ∫ 𝑥 −𝛼(𝑥−𝑠) =𝛼 𝑒 𝑣(𝑠) 𝑑𝑠 + 𝛼 𝑒 −𝛼(𝑥−𝑠) 𝑣(𝑠) 𝑑𝑠. 𝑐0 𝑐𝑖 These relations, which assume 𝑥 is within the interval [𝑐𝑖 , 𝑐𝑖+1 ], elucidate the local and non-local contributions to the convolution integrals within block 𝑖. Using simple algebraic manipulations, we can expand the non-local contributions to find that ∫ 𝑐𝑁 𝑁−1 ∑︁ ∫ 𝑐 𝑗+1 −𝛼(𝑠−𝑥) 𝑒 𝑣(𝑠) 𝑑𝑠 = 𝑒 −𝛼(𝑠−𝑥) 𝑣(𝑠) 𝑑𝑠, 𝑐 𝑖+1 𝑗=𝑖+1 𝑐𝑗 𝑁−1 ∑︁ ∫ 𝑐 𝑗+1 −𝛼(𝑐 𝑗 −𝑥) = 𝑒 𝑒 −𝛼(𝑠−𝑐 𝑗 ) 𝑣(𝑠) 𝑑𝑠, 𝑗=𝑖+1 𝑐𝑗 and ∫ 𝑐𝑖 𝑖−1 ∫ ∑︁ 𝑐 𝑗+1 −𝛼(𝑥−𝑠) 𝑒 𝑣(𝑠) 𝑑𝑠 = 𝑒 −𝛼(𝑥−𝑠) 𝑣(𝑠) 𝑑𝑠, 𝑐0 𝑗=0 𝑐𝑗 𝑖−1 ∑︁ ∫ 𝑐 𝑗+1 −𝛼(𝑥−𝑐 𝑗+1 ) = 𝑒 𝑒 −𝛼(𝑐 𝑗+1 −𝑠) 𝑣(𝑠) 𝑑𝑠, 𝑗=0 𝑐𝑗 for the right and left-moving data, respectively. With these relations, each of the convolution integrals can be formed according to ∫ 𝑐 𝑖+1 𝑁−1 ∫ 𝑐 𝑗+1 −𝛼(𝑠−𝑥) © ∑︁ −𝛼(𝑐 𝑗 −𝑥) 𝐼 𝐿 [𝑣; 𝛼] (𝑥) = 𝛼 𝑣(𝑠) 𝑑𝑠 + 𝛼 ­ 𝑒 −𝛼(𝑠−𝑐 𝑗 ) 𝑣(𝑠) 𝑑𝑠® , ª 𝑒 𝑒 (3.22) 𝑥 𝑗=𝑖+1 𝑐𝑗 « ¬ and 𝑖−1 ∫ 𝑐 𝑗+1 ∫ 𝑥 ©∑︁ −𝛼(𝑥−𝑐 𝑗+1 ) −𝛼(𝑐 𝑗+1 −𝑠) 𝐼 𝑅 [𝑣; 𝛼] (𝑥) = 𝛼 ­ 𝑣(𝑠) 𝑑𝑠® + 𝛼 𝑒 −𝛼(𝑥−𝑠) 𝑣(𝑠) 𝑑𝑠. ª 𝑒 𝑒 (3.23) 𝑐𝑗 𝑐𝑖 « 𝑗=0 ¬ 92 From equations (3.22) and (3.23), we observe that both of the global convolution integrals can be split into a localized convolution with additional contributions coming from preceding or successive global integrals owned by other blocks in the decomposition. These global integrals contain exponential attenuation factors, the size of which depends on the respective distances between any pair of sub-domains. Next, we use this result to derive the restriction that facilitates N-N dependencies. 3.3.1 Nearest-Neighbor Criterion Building a consistent block-decomposition for the convolution integral is non-trivial, since this operation globally couples unknowns along a dimension of the grid. Fortunately, the exponential kernel used in these reconstructions is pleasant in the sense that it automatically generates a region of compact support around a given block. Examining the exponential attenuation factors in (3.22) and (3.23), we see that contributions from blocks beyond N-Ns become small provided that (1) the distance between the blocks is large or (2) 𝛼 is taken to be sufficiently large. Since we have less control over the block sizes e.g., Δ𝑐𝑖 , we can enforce the latter criterion. That is, we constrain 𝛼 so that 𝑒 −𝛼𝐿 𝑚 ≤ 𝜖, (3.24) where 𝜖 ≪ 1 is some prescribed error tolerance, typically taken as 1 × 10−16 , and 𝐿 𝑚 = min𝑖 Δ𝑐𝑖 denotes the length smallest block. Taking logarithms of both sides and rearranging the inequality, we obtain the bound log(𝜖) −𝛼 ≤ . 𝐿𝑚 Our next step is to write this in terms of the time step Δ𝑡, using the choice of 𝛼. However, the bound on the time step depends on the choice of 𝛼. In 3.2.1, we presented two definitions for the parameter 𝛼, namely 𝛽 𝛽 𝛼≡ , or 𝛼 ≡ √ . 𝑐 max Δ𝑡 𝜈Δ𝑡 93 Using these definitions for 𝛼, we obtain two conditions depending on the choice of 𝛼. For the linear advection equation and the wave equation, we obtain the condition 𝛽 log(𝜖) 𝛽𝐿 𝑚 − ≤ =⇒ Δ𝑡 ≤ − . (3.25) 𝑐 max Δ𝑡 𝐿𝑚 𝑐 max log(𝜖) Likewise, for the diffusion equation, the restriction is given by  2 𝛽 log(𝜖) 1 𝛽𝐿 𝑚 −√ ≤ =⇒ Δ𝑡 ≤ . (3.26) 𝜈Δ𝑡 𝐿𝑚 𝜈 log(𝜖) Depending on the problem, if the condition (3.25) or (3.26) is not satisfied, then we use the maximally allowable time step for a given tolerance 𝜖, which is given by the equality component of the relevant condition. If several different operators appear in a given problem and are to be approximated with successive convolution, then each operator will be associated with its own 𝛼. In such a case, we should bound the time step according to the condition that is more restrictive among (3.25) and (3.26), which can be accomplished through the choice  2! 𝛽𝐿 𝑚 1 𝛽𝐿 𝑚 Δ𝑡 ≤ min − , . (3.27) 𝑐 max log(𝜖) 𝜈 log(𝜖) As before, when the condition is not met, then we use the equality in (3.27). Restricting Δ𝑡 according to (3.25), (3.26), or (3.27) ensures that contributions to the right and left-moving convolution integrals, beyond N-Ns, become negligible. This is important because it significantly reduces the amount of communication, at the expense of a potentially restrictive time step. Note that in 3.5.4, we analyze the limitations of such restrictions for the linear advection equation. In our future work, we shall consider generalizations of our approach, which do not require (3.25), (3.26), or (3.27). In sections 3.3.2 and 3.3.3, we demonstrate how to formulate block-wise definitions of the global L∗−1 operators using the derived conditions (3.25), (3.26), or (3.27). 94 3.3.2 Enforcing Boundary Conditions for 𝜕𝑥 In order to enforce the block-wise boundary conditions for the first derivative 𝜕𝑥 , we recall our definitions (3.3) and (3.4) for the left and right-moving inverse operators: L −1 𝐿 [𝑣; 𝛼] (𝑥) = 𝐼 𝐿 [𝑣; 𝛼] (𝑥) + 𝐵𝑒 −𝛼(𝑏−𝑥) , (3.28) L −1 𝑅 [𝑣; 𝛼] (𝑥) = 𝐼 𝑅 [𝑣; 𝛼] (𝑥) + 𝐴𝑒 −𝛼(𝑥−𝑎) . (3.29) We can modify these definitions so that each block contains a pair of inverse operators given by L −1𝐿,𝑖 [𝑣; 𝛼] (𝑥) = 𝐼 𝐿,𝑖 [𝑣; 𝛼] (𝑥) + 𝐵𝑖 𝑒 −𝛼(𝑐 𝑖+1 −𝑥) , (3.30) L −1𝑅,𝑖 [𝑣; 𝛼] (𝑥) = 𝐼 𝑅,𝑖 [𝑣; 𝛼] (𝑥) + 𝐴𝑖 𝑒 −𝛼(𝑥−𝑐 𝑖 ) , (3.31) where 𝐼∗,𝑖 are defined as ∫ 𝑐 𝑖+1 ∫ 𝑥 −𝛼(𝑠−𝑥) 𝐼 𝐿,𝑖 [𝑣; 𝛼] (𝑥) = 𝛼 𝑒 𝑣(𝑠) 𝑑𝑠, 𝐼 𝑅,𝑖 [𝑣; 𝛼] (𝑥) = 𝛼 𝑒 −𝛼(𝑥−𝑠) 𝑣(𝑠) 𝑑𝑠, (3.32) 𝑥 𝑐𝑖 and the subscript 𝑖 denotes the block in which the operator is defined. As before, this assumes 𝑥 ∈ [𝑐𝑖 , 𝑐𝑖+1 ]. To address the boundary conditions, we need to determine expressions for the constants 𝐵𝑖 and 𝐴𝑖 on each of the blocks in the domain. First, substitute the definitions (3.22) and (3.23) into (3.28) and (3.29): ∫ 𝑐 𝑖+1 L −1 𝐿 [𝑣; 𝛼] (𝑥) =𝛼 𝑒 −𝛼(𝑠−𝑥) 𝑣(𝑠) 𝑑𝑠 (3.33) 𝑥 𝑁−1 ∫ 𝑐 𝑗+1 © ∑︁ −𝛼(𝑐 𝑗 −𝑥) +𝛼­ 𝑒 −𝛼(𝑠−𝑐 𝑗 ) 𝑣(𝑠) 𝑑𝑠® + 𝐵𝑒 −𝛼(𝑐 𝑁 −𝑥) , ª 𝑒 𝑐𝑗 « 𝑗=𝑖+1 ¬ 𝑖−1 ∫ 𝑐 𝑗+1 ©∑︁ −𝛼(𝑥−𝑐 𝑗+1 ) L −1 [𝑣; (𝑥) = 𝑒 −𝛼(𝑐 𝑗+1 −𝑠) 𝑣(𝑠) 𝑑𝑠® ª 𝑅 𝛼] 𝛼 ­ 𝑒 (3.34) 𝑐𝑗 « 𝑗=0 ∫ ¬ 𝑥 +𝛼 𝑒 −𝛼(𝑥−𝑠) 𝑣(𝑠) 𝑑𝑠 + 𝐴𝑒 −𝛼(𝑥−𝑐0 ) . 𝑐𝑖 Since we wish to maintain consistency with the true operator being inverted, we require that each of the block-wise operators satisfy L −1 −1 𝐿 [𝑣; 𝛼] (𝑥) = L 𝐿,𝑖 [𝑣; 𝛼] (𝑥), L −1 −1 𝑅 [𝑣; 𝛼] (𝑥) = L 𝑅,𝑖 [𝑣; 𝛼] (𝑥), 95 which can be explicitly written as 𝑁−1 −𝛼(𝑐 𝑖+1 −𝑥) © ∑︁ −𝛼(𝑐 𝑗 −𝑥) =­ 𝐼 𝐿, 𝑗 [𝑣; 𝛼] (𝑐 𝑗 ) ® + 𝐵𝑒 −𝛼(𝑐 𝑁 −𝑥) , ª 𝐵𝑖 𝑒 𝑒 (3.35) « 𝑗=𝑖+1 ¬ 𝑖−1 ©∑︁ −𝛼(𝑥−𝑐 𝑗+1 ) 𝐴𝑖 𝑒 −𝛼(𝑥−𝑐𝑖 ) = ­ 𝐼 𝑅, 𝑗 [𝑣; 𝛼] (𝑐 𝑗+1 ) ® + 𝐴𝑒 −𝛼(𝑥−𝑐0 ) . ª 𝑒 (3.36) « 𝑗=0 ¬ Evaluating (3.35) at 𝑐𝑖+1 and (3.36) at 𝑐𝑖 , we obtain 𝑁−1 © ∑︁ −𝛼(𝑐 𝑗 −𝑐𝑖+1 ) 𝐵𝑖 = ­ 𝐼 𝐿, 𝑗 [𝑣; 𝛼] (𝑐 𝑗 ) ® + 𝐵𝑒 −𝛼(𝑐 𝑁 −𝑐𝑖+1 ) , ª 𝑒 (3.37) « 𝑗=𝑖+1 ¬ 𝑖−1 ©∑︁ −𝛼(𝑐𝑖 −𝑐 𝑗+1 ) 𝐴𝑖 = ­ 𝐼 𝑅, 𝑗 [𝑣; 𝛼] (𝑐 𝑗+1 ) ® + 𝐴𝑒 −𝛼(𝑐𝑖 −𝑐0 ) . ª 𝑒 (3.38) « 𝑗=0 ¬ Modifying Δ𝑡 according to either (3.25) or, if necessary (3.27), results in the communication stencil shown in Figure 3.3. More specifically, the terms representing the boundary contributions in each of the blocks are given by      𝐵, 𝑖 = 𝑁 − 1,   𝐵𝑖 =    𝐼 𝑅,𝑖+1 [𝑣; 𝛼] (𝑐𝑖+1 ), 𝑖 < 𝑁 − 1,    and      𝐴, 𝑖 = 0,   𝐴𝑖 =    𝐼 𝑅,𝑖−1 [𝑣; 𝛼] (𝑐𝑖 ), 0 < 𝑖.    These relations generalize the various boundary conditions set by a problem. For example, with periodic problems, we can select 𝐵 = 𝐼 𝐿,0 [𝑣; 𝛼] (𝑐 0 ), 𝐴 = 𝐼 𝑅,𝑁−1 [𝑣; 𝛼] (𝑐 𝑁 ). This is the relevant strategy employed by domain decomposition algorithms in this work. 3.3.3 Enforcing Boundary Conditions for 𝜕𝑥𝑥 The enforcement of boundary conditions on blocks of the domain for the second derivative can be accomplished using an identical procedure to the one described in 3.3.2. First, we recall the inverse 96 Local data Left-going sweeps Right-going sweeps Figure 3.3: Fast convolution communication stencil in 2-D based on N-Ns. operator associated with a second derivative (3.5): L0−1 [𝑣; 𝛼] (𝑥) = 𝐼0 [𝑣; 𝛼] (𝑥) + 𝐴𝑒 −𝛼(𝑥−𝑎) + 𝐵𝑒 −𝛼(𝑏−𝑥) , (3.39) and define an analogous block-wise definition of (3.39) as −1 L0,𝑖 [𝑣; 𝛼] (𝑥) = 𝐼0,𝑖 [𝑣; 𝛼] (𝑥) + 𝐴𝑖 𝑒 −𝛼(𝑥−𝑐0 ) + 𝐵𝑖 𝑒 −𝛼(𝑐 𝑁 −𝑥) , with the localized convolution integral ∫ 𝑐 𝑖+1 𝛼 𝐼0,𝑖 [𝑣; 𝛼] (𝑥) = 𝑒 −𝛼|𝑥−𝑠| 𝑣(𝑠) 𝑑𝑠. 2 𝑐𝑖 Again, the subscript 𝑖 denotes the block in which the operator is defined and we take 𝑥 ∈ [𝑐𝑖 , 𝑐𝑖+1 ]. For the purposes of the fast summation algorithm, it is convenient to split this integral term into an average of left and right contributions, i.e., ! −1 1 L0,𝑖 [𝑣; 𝛼] (𝑥) = 𝐼 𝐿,𝑖 [𝑣; 𝛼] (𝑥) + 𝐼 𝑅,𝑖 [𝑣; 𝛼] (𝑥) + 𝐴𝑖 𝑒 −𝛼(𝑥−𝑐𝑖 ) + 𝐵𝑖 𝑒 −𝛼(𝑐𝑖+1 −𝑥) , (3.40) 2 where 𝐼∗,𝑖 are the same integral operators shown in equation (3.32) used to build the first derivative. As in the case of the first derivative, a condition connecting the boundary conditions on the blocks 97 to the non-local integrals can be derived, which, if evaluated at the ends of the block, results in the 2 × 2 linear system 𝑁−1 −𝛼Δ𝑐 𝑖 1 ∑︁ −𝛼(𝑐 𝑗 −𝑐𝑖 ) 𝐴𝑖 + 𝐵𝑖 𝑒 = 𝑒 𝐼 𝐿, 𝑗 [𝑣; 𝛼] (𝑐 𝑗 ) (3.41) 2 𝑗=𝑖+1 𝑖−1 1 ∑︁ −𝛼(𝑐𝑖 −𝑐 𝑗+1 ) + 𝑒 𝐼 𝑅, 𝑗 [𝑣; 𝛼] (𝑐 𝑗+1 ) 2 𝑗=0 + 𝐴𝑒 −𝛼(𝑐𝑖 −𝑐0 ) + 𝐵𝑒 −𝛼(𝑐 𝑁 −𝑐𝑖 ) , 𝑁−1 −𝛼Δ𝑐 𝑖 1 ∑︁ −𝛼(𝑐 𝑗 −𝑐𝑖+1 ) 𝐴𝑖 𝑒 + 𝐵𝑖 = 𝑒 𝐼 𝐿, 𝑗 [𝑣; 𝛼] (𝑐 𝑗 ) (3.42) 2 𝑗=𝑖+1 𝑖−1 1 ∑︁ −𝛼(𝑐𝑖+1 −𝑐 𝑗+1 ) + 𝑒 𝐼 𝑅, 𝑗 [𝑣; 𝛼] (𝑐 𝑗+1 ) 2 𝑗=0 + 𝐴𝑒 −𝛼(𝑐𝑖+1 −𝑐0 ) + 𝐵𝑒 −𝛼(𝑐 𝑁 −𝑐𝑖+1 ) . Equations (3.41) and (3.42) can be solved analytically to find that        𝐴𝑖  1  1 −𝑒 −𝛼Δ𝑐𝑖  𝑟 0   =       ,  𝐵  1 − 𝑒 −2𝛼Δ𝑐𝑖   −𝑒 −𝛼Δ𝑐𝑖 1  𝑟   𝑖   1       where we have used the variables 𝑟 0 and 𝑟 1 to denote the terms appearing on the right-hand side of (3.41) and (3.42), respectively. Under the N-N constraints (3.26) or (3.27), many of the exponential terms can be neglected resulting in the compact expressions      𝐴, 𝑖 = 0,   𝐴𝑖 =    21 𝐼 𝑅,𝑖−1 [𝑣; 𝛼] (𝑐𝑖 ) ≡ 𝐼0,𝑖−1 [𝑣; 𝛼] (𝑐𝑖 ),   0 < 𝑖,  and      𝐵, 𝑖 = 𝑁 − 1,   𝐵𝑖 =    12 𝐼 𝐿,𝑖+1 [𝑣; 𝛼] (𝑐𝑖+1 ) ≡ 𝐼0,𝑖+1 [𝑣; 𝛼] (𝑐𝑖+1 ), 𝑖 < 𝑁 − 1.    98 3.3.4 Additional Comments In this section, we developed the mathematical framework behind our proposed domain decompo- sition algorithm. We derived a condition, which reduces the construction of a nonlocal operator to a N-N dependency by leveraging the decay properties of the exponential term within the convolution integrals. We wish to reiterate that this condition is not entirely necessary. One could remove this condition by including contributions beyond N-Ns at the expense of additional communication. This change would certainly result in a loss of speed per time step, but the additional expense could be amortized by the ability to use much a larger time step, which would reduce the overall time-to-solution. As a first pass, we shall ignore these additional contributions, which may limit the scope of problems we can study, but we plan to generalize these algorithms in our future work via an adaptive strategy. This approach would begin using data from N-Ns, then gradually include additional contributions using information about the decay from the exponential. In the next section, we shall discuss details regarding the implementation of our methods and particular design choices made in the construction of our algorithms. 3.4 Strategies for Efficient Implementation on Parallel Systems In this section, we discuss strategies for constructing parallel algorithms to solve PDEs. We provide the details related to our work on thread-scalable, shared memory algorithms, as well as distributed memory algorithms, where the problem is decomposed into smaller, independent problems that communicate necessary information via message-passing. 3.4.1 introduces the core concepts in Kokkos performance portability library, which is used to develop our shared memory algorithms. Once we have introduced these ideas, we explore numerous loop-level optimizations for essential loop structures in 3.4.3, using the performance metrics discussed in 3.4.2. Building on the results of these loop experiments, we outline the structure of our shared memory algorithms in 3.4.4. We then discuss the implementation of the distributed memory component of our algorithms in 3.4.5, along with modifications which enable the use of an adaptive time stepping rule. Finally, we summarize the key findings and developments of the implementation which are used to conduct 99 our numerical experiments. 3.4.1 Selecting a Shared Memory Programming Model Many programming models exist to address the aspects of shared memory paralellization, such as OpenMP, OpenACC, CUDA, and OpenCL. The question of which model to use often depends on the target architecture on which the code is to run. However, given the recent trend towards deploying more heterogeneous computing systems, e.g., ones in which a given node contains a variety of CPUs with one, or many, accelerators (typically GPUs), the choice becomes far more complicated. Developing codes which are performant across many computing architectures is a highly non-trivial task. Due to memory access patterns, code which is optimized to run on CPUs is often not optimal on GPUs, so these models address portability rather than performance. This introduces yet another concern related to code management and maintenance: As new architectures are deployed, code needs to be tuned or modified to take advantage of new features, which can be time consuming. Additionally, enabling these abstractions almost invariably results in either multiple versions of the code or rather complicated build systems. In our work, we choose to adopt Kokkos [46], a performance portable, shared memory pro- gramming model. Kokkos tries to address the aforementioned problem posed by rapidly evolving architectures through template-metaprogramming abstractions. Their model provides abstractions for common parallel policies (i.e., for-loops, reductions, and scans), memory spaces, and execution spaces. The architecture specific details are hidden from the users through these abstractions, yet the setup allows the application programmer to take advantage of numerous performance-related features. Given basic knowledge in templates, operator overloads, as well as functors and lambdas, one can implement a variety of program designs. Also provided are the so-called views, which are powerful multi-dimensional array containers that allow Kokkos iterators to map data onto various architectures in a performant way. Additionally, the bodies of iterators become either a user-defined functor or a KOKKOS_LAMBDA, which is just a functor that is generated by the compiler. This allows users to maintain one version of the code which has the flexibility to run on various architec- 100 Figure 3.4: Heterogeneous platform targeted by Kokkos [46]. tures, such as the one depicted in 3.4. Other performance portability models, such as RAJA [74], work in a similar fashion as Kokkos, but they are less intrusive with regard to memory management. With RAJA, the user is responsible for implementing architecture dependent details such as array layouts and policies. In this sense, RAJA emphasizes portability, with the user being responsible for handling performance. 3.4.2 Comment on Performance Metrics In order to benchmark the performance of the algorithms, we need a descriptive metric that accounts for varying workloads among problem sizes. In our numerical simulations we use a time stepping rule so that problems with a smaller mesh spacing require more time steps, i.e., Δ𝑡 ∼ Δ𝑥. Therefore, one can either time the code for a fixed number of steps or track the number of steps in the entire simulation 𝑡 ∈ (0, 𝑇] and compute the average time per time step. We adopt the former approach throughout this work. To account for the varying workloads attributed to varying cells/grid points, we define the update rate as Degrees-of-Freedom/node/s (DOF/node/s), which can be computed via total variables × 𝑁 𝑑 DOF/node/s =  , (3.43) nodes × total time (s) total steps where 𝑑 is the number of spatial dimensions. This metric is a more general way of comparing the raw performance of the code, as it allows for simultaneous comparisons among linear or nonlinear 101 problems with varying degrees of dimensionality and number of components. It also allows for a comparison, in terms of speed, against other classes of methods, such as finite element methods, where the workload on a given cell is allowed to vary according to the number of basis elements 1. In 3.4.3, we shall use this performance metric to benchmark a collection of techniques for prescribing parallelism across predominant loop structures in the algorithms for successive convolution. 3.4.3 Benchmarking Prototypical Loop Patterns Often, when designing shared memory algorithms, one has to make design decisions prescribing the way threads are dispatched to the available data. However, there are often many ways of accom- plishing a given task. Kokkos provides a variety of parallel iteration techniques — the selection of a particular pattern typically depends on the structure of the loop (perfectly or imperfectly nested) and the size of the loops. In [75], authors sought to optimize a recurring pattern, consisting of triple or quadruple nested for-loops, in the Athena++ MHD code [76]. Their strategy was to use a flexible loop macro to test various loop structures across a range of architectures. Athena++ was already optimized to run on Intel Xeon-Phi platforms, so they primarily focused on approaches for porting to GPUs which maintained this performance on CPUs. Our work differs in that we have not yet identified optimal loop patterns for CPUs or GPUs and algorithms used here contain at least two major prototypical loop patterns. At the moment, we are not focusing on optimizing for GPUs, but, we do our best to keep in mind possible performance-related issues associated with various parallelization techniques. Some examples of recurring loop structures, in successive convolution algorithms, for 3D problems, are provided in Scheme 3.1 and Scheme 3.2. Technically, there are left and right-moving operators associated with each direction, but, for simplicity, we will ignore this in the pseudo-code. Another important note, we wish to make, concerns the storage of the operator data on the mesh. Since the operations are performed “line-by-line" on potentially large multidimensional arrays, we 1 Note that the update frequency does not account for error in the numerical solution. Certainly, in order to compare the efficiency of various methods, especially those that belong to different classes, one must take into account the quality of the solution. This would be reflected in, for example, an error versus time-to-solution plot. 102 for(int ix = 0; ix < Nx; ix ++){ for(int iy = 0; iy < Ny; iy ++){ // Perform some intermediate calculations // ... // Apply 1-D algorithm to z-line data for(int iz = 0; iz < Nz; iz ++){ z_operator (ix ,iy ,iz) = ... } } } Scheme 3.1: Looping pattern used in the construction of local integrals, convolutions, and boundary steps. // Looping pattern for the resolvent operators for(int ix = 0; ix < Nx; ix ++){ for(int iy = 0; iy < Ny; iy ++){ for(int iz = 0; iz < Nz; iz ++){ z_operator (ix ,iy ,iz) = u(ix ,iy ,iz) - z_operator (ix ,iy ,iz); z_operator (ix ,iy ,iz) *= alpha_z ; } } } Scheme 3.2: Another looping pattern used to build “resolvent" operators. With some modifications, this same pattern could be used for the integrator step. In several cases, this iteration pattern may require reading entries, which are separated by large distances (i.e., the data is strided), in memory. choose to store the data in memory so that the sweeps are performed on the fastest changing loop variables. This allows us to avoid significant memory access penalties associated with reading and writing to arrays, as the entries of interest are now consecutive in memory 2. For example, suppose we have an 𝑁−dimensional array with indices 𝑥 1 , 𝑥2 , · · · , 𝑥 𝑁 , and we wish to construct an operator in the 𝑥 1 direction. Then, we would store this operator in memory as operator(𝑥 2 , · · · , 𝑥 𝑁 , 𝑥1 ). The loops appearing in Scheme 3.1 and Scheme 3.2 can then be permuted accordingly. Note that the solution variable u(𝑥 1 , 𝑥2 , · · · , 𝑥 𝑁 ) is not transposed and is a read-only quantity during the construction of the operators. In an effort to develop an efficient application, we follow the approach described in [75], to determine optimal loop iteration techniques for patterns, such as Scheme 3.1 and Scheme 3.2. Our 2 This is true when the memory space is that of the CPU (host memory). In device memory, these entries will be “coalesced", which is the optimal layout for threading on GPUs. This mapping of indices, between memory spaces, is automatically handled by Kokkos. 103 simple 2-D and 3-D experiments tested numerous combinations of policies including naive, as well as more complex parallel iteration patterns using the OpenMP backend in Kokkos. Our goals were to quantify possible performance gains attainable through the following strategies: 1. auto-vectorization via #pragma statements or ThreadVectorRange (TVR) 2. improving data reuse and caching behavior with loop tiling/blocking 3. prescribing parallelism across combinations of team-type execution policies and team sizes Vectorization can offer substantial performance improvements for data that is contiguous in memory; however, several performance critical operations in the algorithms involve reading data which is strided in memory. Therefore, it is not straightforward whether vectorization would offer any improvements. Additionally, for larger problems, the line operations along certain directions involve reading strided data, so that benefits of caching are lost. The performance penalty of operating on data with the wrong layout depends on the architecture, with penalties on GPUs typically being quite severe compared to CPUs. The use of a blocked iteration pattern, such as the one outlined in Scheme A.1 (see A.2), is a step toward minimizing such performance penalties. In order to see a performance benefit from this approach, the algorithms must be structured, in such a way, as to reuse the data that is read into caches, as much as possible. Naturally, one could prescribe one or more threads (of a team) to process blocks, so we chose to implement cache blocking using the hierarchical execution policies provided by Kokkos. From coarse-to-fine levels of granularity, these can be ordered as follows: TeamPolicy (TP), TeamThreadRange (TTR), and ThreadVectorRange (TVR). For perfectly nested loops, one can achieve similar behavior using MDRange and prescribing block sizes. During testing, we found that when block sizes are larger than or equal to the size of the view, a segmentation fault occurs, so this was avoided. The results of our loop experiments are provided in Figures 3.5 and A.1. For tests employing blocking, we used a block size of 2562 in 2-D, while 3-D problems used a block size of 323 . Information regarding various choices, such as compiler, optimization flags, etc., used to generate these results can be found in Table 3.1. 104 CPU Type Intel Xeon Gold 6148 C++ Compiler ICC 2019.03 Optimization Flags -O3 -xCORE-AVX512 -qopt-zmm-usage=high -qno-opt-prefetch Thread Bindings OMP_PROC_BIND=close, OMP_PLACES=threads Table 3.1: Architecture and code configuration for the loop experiments conducted on the Intel 18 cluster at Michigan State University’s Institute for Cyber-Enabled Research. To leverage the wide vector registers, we encourage the compiler to use AVX-512 instructions. Hardware prefetching is not used, as initial experiments seem to indicate that it hindered performance. Initially, we used GCC 8.2.0-2.31.1 as our compiler, but we found through experimentation that using an Intel compiler improved the performance of our application by a factor of ∼ 2 for this platform. Authors in [75] experienced similar behavior for their application and attribute this to a difference in auto-vectorization capabilities between compilers. An examination of the source code for loop execution policies in Kokkos reveals that certain decorators, e.g., #pragma ivdep are present, which help encourage auto-vectorization when Intel compilers are used. We are unsure if similar hints are provided for GCC. As part of our blocking implementation, we stored block information in views, which could then be accessed by a team of threads. After the information about the block is obtained, we compute indices for the data within the block and use these to extract the relevant grid data. Then, one can either create subviews (shallow copies) of the block data or proceed directly with the line calculations of the block data. We refer to these as tiling with and without subviews, respectively. Intuitively, one would think that skipping the block subview creation step would be faster. Among the blocked or tiled experiments, those that created the subviews of the tile data were generally faster than those that did not. Using blocking for smaller problems typically resulted in a large number of idle threads, which significantly degraded the performance compared to non-blocked policies. In such situations, a user would need to take care to ensure that a sufficient number of blocks are used to generate enough work, i.e., each thread (or team) has at least one block to process. For larger problems, blocking was faster when compared to variants that did not use blocking. We observe that the performance of non-blocked policies begins to degrade once a problem becomes sufficiently large, whereas blocked policies maintained a consistent update rate, even as the problem size increased. By separating the key loop structures from the complexities of the application, we 105 were able to expedite the experimental process for identifying efficient loop execution techniques. In 3.4.4, we use the results of these experiments to inform choices regarding the design of the shared memory algorithms. Tiling w/subviews + TVR TP + TTR Tiling w/o subviews + TVR TP + TTR (Best) Tiling w/subviews + TVR TP + TTR w/o TVR Team Size = Kokkos::AUTO() Tiling w/o subviews + TVR TP + TTR + TVR (Best) 1010 TP + TTR + TVR TP + TTR w/o TVR (Best) Team Size = Kokkos::AUTO() DOF/s 1010 109 DOF/s Team Size = 2 109 1010 Team Size = 2 1010 DOF/s DOF/s 109 109 Team Size = 4 1010 Team Size = 4 1010 DOF/s DOF/s 109 109 512 24 48 96 92 384 64 8 6 2 10 20 40 81 16 12 25 51 N N Figure 3.5: Plots comparing the performance of different parallel execution policies for the pattern in Scheme 3.1 using test cases in 2-D (left) and 3-D (right). Tests were conducted on a single node that consists of 40 cores using the code configuration outlined in 3.1. Each group consists of three plots, whose difference is the value selected for the team size. We note that hyperthreading is not enabled on our systems, so Kokkos::AUTO() defaults to a team size of 1. In each pane, we use “best" to refer to the best run for that configuration across different team sizes. Tile experiments used block sizes of 2562 , in 2-D problems, and 323 in 3-D. We observe that vectorized policies are generally faster than non-vectorized policies. Interestingly, among blocked/tiled policies, construction of subviews appears to be faster than those that skip the subview construction, despite the additional work. As the problem size increases, the performance of blocked policies improves substantially. This can be attributed to the large number of idle thread teams when the problem size does not produce enough blocks. In such cases, increasing the size of the team does offer an improvement, as it reduces the number of idle thread teams. For non-blocked policies, we observe that increasing the team-size generally results in minimal, if any, improvement in performance. In all cases, the use of blocking provides a more consistent update rate when enough work is introduced. 106 3.4.4 Shared Memory Algorithms The line-by-line approach to operator reconstruction suggests that we employ a hierarchical design, which consists of thread teams. Rather than employ a fine-grained threading approach over loop indices, we use the coarse-grained, blocked iteration pattern devised in 3.4.3. In this approach, we divide the iteration space into blocks of nearly identical size, and assign one or more blocks to a team of threads. The threads within a given team are then dispatched to one (or more) lines, with vector instructions being used within the lines. As opposed to loop level parallelism, coarse-grained approaches allow one to exploit multiple levels of parallelism, common to many modern CPUs, and load balance the computation across blocks by adjusting the loop scheduling policy. In our implementation, we provide the flexibility of setting the number of threads per block with a macro, but, in general, we let Kokkos choose the appropriate team size using Kokkos::AUTO(). If running on the CPU, this sets the team size to be the number of hyperthreads (if supported) on a given core. For GPU architectures, the team size is the size of the warp. A hierarchical design pattern is used because the loops in our algorithms are not perfectly nested, i.e., calculations are performed between adjoining loops. Information related to blocking can be precomputed to minimize the number of operations are required to manipulate blocks. The process of subview construction consists of shallow copies involving pointers to vertices of the blocks, so no additional memory is required. With a careful choice of a base block size, one can fit these blocks into high-bandwidth memory, so that accessing costs are reduced. Furthermore, a team-based, hierarchical pattern seems to provide a large degree of flexibility compared to standard loop-level parallelism. In particular, we can fuse adjacent kernels into a single parallel region, which reduces the effect of kernel launch overhead and minimizes the number of synchronization points. The use of a team-type execution policy also allows us to exploit features present on other architectures, such as CUDA’s shared memory feature, through scratchpad constructs. Performing a stenciled operation on strided data is associated with an architecture-dependent penalty. On CPUs, while one wishes to operate in a contiguous or cached pattern, various compilers can hide these penalties through optimizations, such as prefetching. GPUs, on the other hand, prefer to operate in a 107 lock-step fashion. Therefore, if a kernel is not vectorizable, then one pays a significant performance penalty for poor data access patterns. Shared memory, while slower than register accesses, does not require coalesced accesses, so the cost can be significantly reduced. The advantage of using square-like blocks of a fixed size, as opposed to long pencils 3, is that one can adjust the dimensions of the blocks so that they fit into the constraints of the high-bandwidth memory. Moreover, these blocks can be loaded once and can be reused for additional directions, whereas pencils would require numerous transfers, as lines are processed along a given dimension. Such optimizations are not explored in this work, but algorithmic flexibility is something we must emphasize moving forwards. The parallel nested loop structures, such as the one provided in A.2 (see Scheme A.1) are applied during reconstructions for the local integrals 𝐽∗ and inverse operators L∗−1 , as well as the integrator update. The current exception to this pattern is the convolution algorithm, shown in Scheme A.2, which is also provided in A.2. Here, each thread is responsible for constructing 𝐼∗ on one or more lines of the grid. Therefore, within a line, each thread performs the convolution sweeps, in serial, using our O (𝑁) algorithm. Adopting the team-tiling approach for this operation requires that we modify our convolution algorithm considerably – this optimization is left to future work. Additionally the benefit of this optimizations is not large for CPUs, as profiling indicated that < 5% of the total time for a given run was spent inside this kernel. However, this will likely consume more time on GPUs, so this will need to be investigated. If one wishes to use variable time stepping rules, where the time step is computed from a formula of the form   Δ𝑥 Δ𝑦 Δ𝑡 = CFL min , ,··· , (3.44) 𝑐𝑥 𝑐 𝑦 then one must supply parallel loop structures with simultaneous maximum reductions for each of the wave speeds 𝑐𝑖 . This can be implemented as a custom functor, but the use of blocking/tiling introduces some complexities. More complex reducers that enable such calculations are not 3 We refer to a pencil as a, generally, long rectangle (in 2-D) and a rectangular prism (in 3-D). The use of pencils, as opposed to square blocks, would require additional precomputing efforts and, possibly, restrictions on the problem size. 108 currently available. For this reason, problems that use time stepping rules, such as (3.44), are constructed with symmetry in the wave speeds, i.e., 𝑐 𝑥 = 𝑐 𝑦 = · · · to avoid an overly complex implementation with blocking/tiling. However, we plan to revisit this in later work as we begin targeting more general problems. Next, in 3.4.5, we discuss the distributed memory component of the implementation and the strategy used to employ an adaptive time stepping rule, such as (3.44). 3.4.5 Code Strategies for Domain Decomposition One of the issues with distributed computing involves mapping the problem data in an intelligent way so that it best aligns with the physical hardware. Since the kernels used in our algorithms consume a relatively small amount of time, it is crucial that we minimize the time spent communicating data. Given that these schemes were designed to run on Cartesian meshes, we can use a “topology aware" virtual communicator supplied by MPI libraries. These constructs take a collection of ranks in a communicator (each of which manages a sub-domain) and, if permitted, attempt to reorganize them to best align with the physical hardware. This mapping might not be optimal, since it depends on a variety of factors related to the job allocation and the MPI implementation. Depending on the problem, these tools can greatly improve the performance of an application compared to a hand-coded implementation that uses the standard communicator. Additionally, MPI’s Cartesian virtual communicator provides functionality to obtain neighbor references and derive additional communicator groups, say, along rows, columns, etc. The send and receive operations are performed with persistent communications, which remove some of the overhead for communication channel creation. Persistent communications require a regular communication channel, which, for our purposes, is simply N-Ns. For more general problems with irregular communication patterns, standard send and receive operations can be used. One strategy to minimize exposed communications is to use non-blocking communications. This allows the programmer to overlap calculations with communications, and is especially ben- eficial if the application can be written in a staggered fashion. If certain data required for a later calculation is available, then communication can proceed while another calculation is being per- 109 Start MPI_Iallreduce: Wave Speeds MPI_Isends : Halo Interior Reconstructions MPI_Isends : Halo Interior Reconstructions MPI_Wait: Halo MPI_Wait: Halo Halo Reconstructions Halo Reconstructions Convolution Sweeps MPI_Isends : BC data Convolution Sweeps MPI_Isends : BC data MPI_Wait: BC Data MPI_Wait: BC Data Build “Inverse” Operators Build “Inverse” Operators Build “D” Operators Build “D” Operators Integrator MPI_Wait: Wave Speeds Integrator + Local Reduce Figure 3.6: Task charts for the domain-decomposition algorithm under fixed (left) and adaptive (right) time stepping rules. The work overlap regions are indicated, laterally, using gray boxes. The work inside the overlap regions should be sufficiently large to hide the communications occuring in the background. To clarify, the overlap in calculations for 𝐼∗ is achieved by changing the sweeping direction during an exchange of the boundary data. As indicated in the adaptive task chart, the reduction over the “lagged" wave speed data can be performed in the background while building the various operators. Note the use of MPI_WAIT prior to performing the integrator step. This is done to prevent certain overwrite issues during the local reductions in the subsequent integrator step. formed. Once the data is needed, we can block progress until the message transfer is complete. However, we hope that the calculation done, in between, is sufficient to hide the time spent per- forming the communication. In the multi-dimensional setting, other operators may be needed, such as 𝜕𝑥 and 𝜕𝑦 . However, in our algorithms, directions are not coupled, which allows us to stagger the calculations. So, we can initialize the communications along a given direction and build pieces of other operators in the background. A typical complication that arises in distributed implementations of PDE solvers concerns the use of various expensive collective operations, such as “all-to-one" and “one-to-all" communica- tions. For implicit methods, these operations occur as part of the iterative method used for solving distributed linear systems. The method employed here is “matrix-free", which eliminates the need 110 to solve such distributed linear systems. For explicit methods, these operations arise when an adaptive time stepping rule, such as equation (3.44), is employed to ensure that the CFL restriction is satisfied for stability purposes. At each time step, each of the processors, or ranks, must know the maximum wave speeds across the entire simulation domain. On a distributed system, trans- ferring this information requires the use of certain collective operations, which typically have an overall complexity of O (log 𝑁 𝑝 ), where 𝑁 𝑝 is the number of processors. While the logarithmic complexity results in a massive reduction of the overall number of steps, these operations use a barrier, in which all progress stops until the operation is completed. This step cannot be avoided for explicit methods, as the most recent information from the solution is required to accurately compute the maximum wave speeds. In contrast, successive convolution methods, do not require this information. However, implementations for schemes developed in e.g., [56, 44, 45] considered “explicit" time stepping rules given by equation (3.44) because they improved the convergence of the approximations. By exploiting the stability properties associated with successive convolution methods, we can eliminate the need for accurate wave speed information, based on the current state, and, instead, use approximations obtained with “lagged" data from the previous time step. We present two, generic, distributed memory task charts in 3.6. The algorithm shown in the left-half of 3.6, which is based on a fixed time step, contains less overall communication, as the local and global reductions for certain information used to compute the time step are no longer necessary. The sec- ond version, shown in the right-half of 3.6, illustrates the key steps used in the implementation of an adaptive rule, which can be used for problems with more dynamic quantities (e.g., wave speeds and diffusivity). In contrast to distributed implementations of explicit methods, our adaptive approach allows us to overlap expensive global collective operations (approximately) with the construction of derivative operators, resulting in a more asynchronous algorithm (see Algorithm 3.1). 3.4.6 Some Remarks In this section, we introduced key aspects that are necessary in developing a performant application. We began with a brief discussion on Kokkos, which is the programming model used for our 111 Goal: Approximate the global maximum wave speeds 𝑐 𝑥 , 𝑐 𝑦 , · · · using the corresponding “lagged" variables 𝑐˜𝑥 , 𝑐˜𝑦 , · · · . 1: Initialize the 𝑐𝑖 ’s and 𝑐˜𝑖 ’s via the initial condition (no lag has been introduced) 2: while timestepping do 3: Update the N-N condition (i.e., (3.25), (3.26), or (3.27)) using the “lagged" wave speeds 𝑐˜𝑥 , 𝑐˜𝑦 , · · · 4: Compute Δ𝑡 using the “lagged" wave speeds and check the N-N condition 5: Start the MPI_Iallreduce over the local wave speeds 𝑐 𝑥 , 𝑐 𝑦 , · · · 6: Construct the spatial derivative operators of interest 7: Post the MPI_WAIT (in case the reductions have not completed) 8: Transfer the global wave speed information to the corresponding lagged variables: 𝑐˜𝑥 ← 𝑐 𝑥 , 𝑐˜𝑦 ← 𝑐 𝑦 , .. . 9: Perform the update step and computes the local wave speeds 𝑐 𝑥 , 𝑐 𝑦 , · · · 10: Return to step 3 to begin the next time step Algorithm 3.1: Distributed adaptive time stepping rule. shared memory implementation. Then we introduced one of the metrics, namely (3.43), used to characterize the performance of our parallel algorithms. Using this performance metric, we analyzed a collection of techniques for parallelizing prototypical loop structures in our algorithms. These techniques considered several different approaches to the prescription of parallelism through both naive and complex execution policies. Informed by these results, we chose to adopt a coarse- grained, hierarchical approach that utilizes the extensive capabilities available on modern hardware. In consideration of our future work, this approach also offers a large degree of algorithmic flexibility, which will be essential for moving to GPUs. Finally, we provided some details concerning the implementation of the distributed memory components of the parallel algorithms. We introduced two different approaches: one based on a fixed time step, with minimal communication, and another, which exploits the stability properties of the representations and allows for adaptive time stepping rules. The next section provides numerical results, which demonstrate not only the performance 112 and scalability of these algorithms, but also their versatility in addressing different PDEs. 3.5 Numerical Results This section provides the experimental results for our parallel algorithms using MPI and Kokkos, together, with the OpenMP backend. First, in 3.5.1, we define several test problems and verify the rates of convergence for the hybrid algorithms described in sections 3.3 and 3.4. Next, we provide both weak and strong scaling results obtained from each of the example problems discussed in 3.5.1. 3.5.4 provides some insight on issues faced by the distributed memory algorithms, in light of the N-N condition (3.25), which was derived in 3.3.1. Unless otherwise stated, the results presented in this section were obtained using the configurations outlined in 3.2. Timing data presented in Figures A.2, A.3, and A.4 was collected using 10 trials for each configuration (problem size and node count) with the update metric (3.43) being displayed relative to 109 DOF/node/s. Each of these trials, evolved the numerical solution over 10 time steps. Error bars, collected from data involving averages, were computed using the sample standard deviation. 3.5.1 Description of Test Problems and Convergence Experiments Despite the fact that we are primarily focused on developing codes for high-performance appli- cations, we must also ensure that the parallel algorithms produce reliable answers. Here, we demonstrate convergence of the 2-D hybrid parallel algorithms on several test problems, including a nonlinear example that employs the adaptive time stepping rule outlined in 3.1. The convergence results used 9 nodes, with 40 threads per node, assigning 1 MPI rank to each node, for a total of 360 threads. The quadrature method used to construct the local integrals is the fifth-order WENO- quadrature rule, described in 3.2.4, which uses only the linear weights. The numerical solution, in each of the examples, remains smooth over the corresponding time interval of interest. Therefore, it is not necessary to transform the linear quadrature weights to nonlinear ones. According to the analysis of the truncation error presented in [56, 44], retaining a single term in the partial sums for D∗ should yield a first-order convergence rate, depending on the choice of 𝛼. Convergence results 113 CPU Type Intel Xeon Gold 6148 C++ Compiler ICC 2019.03 MPI Library Intel MPI 2019.3.199 Optimization Flags -O3 -xCORE-AVX512 -qopt-zmm-usage=high -qno-opt-prefetch Thread Bindings OMP_PROC_BIND=close, OMP_PLACES=threads Team Size Kokkos::AUTO() Base Block Size 2562 CFL 1.0 𝜷 1.0 𝝐 1 × 10−16 Table 3.2: Architecture and code configuration for the numerical experiments conducted on the Intel 18 cluster at Michigan State University’s Institute for Cyber-Enabled Research. As with the loop experiments in 3.4.3, we encourage the compiler to use AVX-512 instructions and avoid the use of prefetching. All available threads within the node (40 threads/node) were used in the experiments. Each node consists of two Intel Xeon Gold 6148 CPUs and at least 83 GB of memory. We wish to note that hyperthreading is not supported on this system. As mentioned in 3.4.3, when hyperthreading is not enabled, Kokkos::AUTO() defaults to a team size of 1. In cases where the base block size did not divide the problem evenly, this parameter was adjusted to ensure that blocks were nearly identical in size. The parameter 𝛽, which does not depend on Δ𝑡 is used in the definition of 𝛼. For details on the range of admissible 𝛽 values, we refer the reader to [56, 44], where this parameter was introduced. Lastly, recall that 𝜖 is the tolerance used in the NN constraints. for each of the three test problems defined in sections 3.5.1, 3.5.1, and 3.5.1 are provided in 3.7. Example 1: Linear Advection Equation The first test problem considered in this work is the 2D linear advection equation 𝜕𝑡 𝑢 + 𝜕𝑥 𝑢 + 𝜕𝑦 𝑢 = 0, (𝑥, 𝑦) ∈ [0, 2𝜋] 2 , 1 𝑢 0 (𝑥, 𝑦) = (1 − cos(𝑥))(1 − cos(𝑦)), 4 s.t. two-way periodic BCs. We evolve the numerical solution to the final time 𝑇 = 2𝜋. In the experiments, we used the same number of mesh points in both directions, with 𝛼𝑥 = 𝛼𝑦 = 𝛽/Δ𝑡, with 𝛽 = 1. While this problem 114 Advection Diffusion Hamilton-Jacobi First Order Reference 10−2 L∞ Error 10−3 10−2 ∆x Figure 3.7: Convergence results for each of the 2-D example problems. Results were obtained using 9 MPI ranks with 40 threads/node. Also included is a first-order reference line (solid black). Our convergence results indicate first-order accuracy resulting from the low-order temporal discretization. The final reported 𝐿 ∞ errors for each of the applications, on a grid containing 52772 total zones, are 2.874 × 10−3 (advection), 4.010 × 10−4 (diffusion), and 2.674 × 10−4 (H-J). is rather simple and does not highlight many of the important features of our algorithm, it is nearly identical to the code for a nonlinear example. For initial experiments, a simple test problem is preferable because it gives more control over quantities which are typically dynamic, such as wave speeds. Moreover, the error can be easily computed from the exact solution 𝑢(𝑥, 𝑦, 𝑡) = 𝑢 0 (𝑥 − 𝑡, 𝑦 − 𝑡). Example 2: Linear Diffusion Equation The next test problem that we consider is the linear diffusion equation 𝜕𝑡 𝑢 = 𝜕𝑥𝑥 𝑢 + 𝜕𝑦𝑦 𝑢, (𝑥, 𝑦) ∈ [0, 2𝜋] 2 , 𝑢 0 (𝑥, 𝑦) = sin(𝑥) + sin(𝑦), s.t. two-way periodic BCs. The numerical solution is evolved from (0, 𝑇], with 𝑇 = 1, in order to prevent substantial decay. As with the previous example, we use an equal number of mesh points in both directions, so that 115 Δ𝑥 = Δ𝑦. The fixed time stepping rule was used, with Δ𝑡 = Δ𝑥. Compared with the previous √ example, we used the parameter definitions 𝛼𝑥 = 𝛼𝑦 = 1/ Δ𝑡, which corresponds to 𝛽 = 1 in the definition for second derivative operators. The exact solution for this problem is given by   𝑢(𝑥, 𝑦, 𝑡) = 𝑒 −𝑡 sin(𝑥) + sin(𝑦) . Characteristically, this example is different from the advection equation in the previous example, which allows us to illustrate some key features of the method. Firstly, code developed for advection operators can be reused to build diffusion operators, an observation made in 3.2.1. More specifically, to construct the left and right-moving local integrals, we used the same linear WENO quadrature as with the advection equation in Example 1. However, we note that this particular example could, instead, use a more compact quadrature to eliminate the halo communication, which would remove a potential synchronization point. The second feature concerns the time-to-solution and is related to the unconditional stability of the method. Linear diffusion equations, when solved by an explicit method, are known to incur a harsh stability restriction on the time step, namely, Δ𝑡 ∼ Δ𝑥 2 , making long-time simulations prohibitively expensive. The implicit aspect of this method drastically reduces the time-to-solution, as one can now select time steps which are, for example, proportional to the mesh spacing. This benefit is further emphasized by the overall speed of the method, which can be observed in sections 3.5.2 and 3.5.3. Example 3: Nonlinear Hamilton-Jacobi Equation The last test problem we consider is the nonlinear H-J equation 1 2 𝜕𝑡 𝑢 + 1 + 𝜕𝑥 𝑢 + 𝜕𝑦 𝑢 = 0, (𝑥, 𝑦) ∈ [0, 2𝜋] 2 , 2 𝑢 0 (𝑥, 𝑦) = 0, s.t. two-way periodic BCs. To prevent the characteristic curves from crossing, which would lead to jumps in the derivatives of the function 𝑢, the numerical solution is tracked over a short time, i.e., 𝑇 = 0.5. We applied a 116 high-order linear WENO quadrature rule to approximate the left and right-moving local integrals and used the same parameter choices for 𝛼𝑥 , 𝛼𝑦 , and 𝛽, as with the advection equation in Example 1. However, since the wave speeds fluctuate based on the behavior of the solution 𝑢, we allow the time step to vary according to (3.44), which requires the use of the distributed adaptive time stepping rule outlined in 3.1. Typically, an exact solution is not available for such problems. Therefore, to test the convergence of the method, we use a manufactured solution given by   𝑢(𝑥, 𝑦, 𝑡) = 𝑡 sin(𝑥) + sin(𝑦) , with a corresponding source term included on the right-hand side of the equation. Methods employed to solve this class of problems are typically explicit, with a shock-capturing method being used to handle the appearance of “cusps" that would otherwise lead to jumps in the derivative of the solution. A brief summary of such methods is provided in our recent paper [45], where extensions of successive convolution were developed for curvilinear and non-uniform grids. The method follows the same structural format as an explicit method with the ability to take larger time steps as in an implicit method. However, the explicit-like structure of this method does not require iteration for nonlinear terms and allows for a more straightforward coupling with high-order shock-capturing methods. We wish to emphasize that despite the fact that this example is nonlinear, the only major mathematical difference with Example 1 is the evaluation of a different Hamiltonian function. 3.5.2 Weak Scaling Experiments A useful performance property for examining the scalability of parallel algorithms describes how they behave when the compute resources are chosen proportionally to the size of the problem. Here, the amount of work per compute unit remains fixed, and the compute units are allowed to fluctuate. Weak scaling assumes ideal or best-case performance for the parallel components of algorithms and ignores the influence of bottlenecks imposed by the sequential components of a code. Therefore, 117 for 𝑁 compute units, we shall expect a speedup of 𝑁. This motivates the following definitions for speedup and efficiency in the context of weak scaling: 𝑁𝑇1 𝑆𝑁 𝑇1 𝑆𝑁 = , 𝐸𝑁 = ≡ . 𝑇𝑁 𝑁 𝑇𝑁 Therefore, with weak scaling, ideal performance is achieved when the run times for a fixed work size (or, equivalently, the DOF/node/s) remain constant, as we vary the compute units. To scale the problem size, we take advantage of the periodicity for the test problems. The base problem on [0, 2𝜋] × [0, 2𝜋] can be replicated across nodes so that the total work per node remains constant. Provided in A.2 are plots of the weak scaling data — specifically the update metric 3.43 and the corresponding efficiency — obtained from the fastest of 10 trials of each configuration, using up to 49 nodes (1,960 cores). These results generally indicate good performance, both in terms of the update frequency and efficiency, for a variety of problem sizes. Weak scalability appears to be excellent up to 16 nodes (640 cores), then begins to decline, most likely due to network effects. The performance behavior for advection and diffusion applications is quite similar, which is to be expected, since the parallel algorithms used to construct the base operators are nearly identical. With regard to the Hamilton- Jacobi application, we see that the performance is similar to the other applications at larger node counts. This seems to indicate that no major communication penalties are incurred by use of the adaptive time stepping method shown in 3.1, compared to fixed time stepping. Additionally, in the Hamilton-Jacobi application, we observe a sharp decline in the performance at 9 nodes in A.2. A closer investigation reveals that this is likely an artifact of the job scheduler for the system on which the experiments were conducted, as we were unable to secure a “contiguous" allocation of nodes. This has the unfortunate consequence of not being able to guarantee that data for a particular trial remain in close physical proximity. This could result in issues such as network contention and delays that exacerbate the cost of communication relative to the computation as discussed in 3.4. An non-contiguous placement of data is problematic for codes with inexpensive operations, such as the methods shown here, because the work may be insufficient to hide this increased cost of communication. For this reason, we chose to include plots containing the averaged weak 118 scaling data in A.2, which contains error bars calculated from the sample standard deviation. The noticeable size of the error bars in these plots generally indicates a large degree of variation in the timings collected from trials. To more closely examine the importance of data proximity on the nodes, we repeated the weak scaling study, but with node counts for which a contiguous allocation count could be guaranteed. We have provided results for the fastest and averaged data in A.3. Data collected from the fastest trials indicates nearly perfect weak scaling, across all applications, up to 9 nodes, with a consistent update rate between 2 − 4 × 108 DOF/node/s. For convenience, these results were plotted with the same markers and formats so that results from the larger experiments in A.2 could be compared directly. A comparison of the fastest timings between the large and small runs supports our claim that data proximity is crucial to achieving the peak performance of the code. Furthermore, the error bars for the contiguous experiments displayed in A.3 show that the individual trials exhibit less overall variation in timings. 3.5.3 Strong Scaling Experiments Another form of scalability considers a fixed problem size and examines the effect of varying the number of work units used to find the solution. In these experiments, we allow the work per compute unit to decrease, which helps identify regimes where sequential bottlenecks in algorithms become problematic, provided we are granted enough resources. Applications which are said to strong scale exhibit run times which are inversely proportional to the number of resources used. For example, when 𝑁 compute units are applied to a problem, one expects the run to be 𝑁 times faster than with a single compute unit. Additionally, if an algorithm’s performance is memory bound, rather than compute bound, this will, at some point, become apparent in these experiments. Supplying additional compute units should not improve performance, if more time is spent fetching data, rather than performing useful computations. This motivates the following definitions for speedup and efficiency in the context of strong scaling: 𝑇1 𝑆𝑁 𝑆𝑁 = , 𝐸𝑁 = . 𝑇𝑁 𝑁 119 Here, 𝑁 is the number of nodes used, so that 𝑇1 and 𝑇𝑁 correspond to the time measured using a single node and N nodes, respectively. Results of our strong scaling experiments are provided in A.4. As with the weak scaling experiments, we have plotted the update metric 3.43 along with the strong scaling efficiency using both the fastest and averaged configuration data from a set of 10 trials. In contrast to weak scaling, strong scaling does not assume ideal speedup, so one could plot this information; however, the information can be ascertained from the efficiency data, so we refrain from plotting this data. Results from these experiments show decent strong scalability for the N-N method. This method does not contain a substantial amount of work, so we do not expect good performance for smaller base problem sizes, as the work per node becomes insufficient to hide the cost of communication. On the other hand, larger base problem sizes, which introduce more work, are capable of saturating the resources, but will at some point become insufficient. This behavior is apparent in our efficiency plots. Increasing the problem size generally results in an improvement of the efficiency and speedup for the method. Part of these problems can be attributed to the use of a blocking pattern for loops structures discussed in 3.4.4. Depending on the size of the mesh, it may be the case that the block size and the team size set by the user result in idle threads. One possible improvement is to simply increase the team size so that there are fewer idle threads within an MPI task. Alternatively, one can adjust the number of threads per task, so that each task is responsible for fewer threads. While these approaches can be implemented with no changes to the code, they will likely not resolve this issue. Profiling seems to indicate that the source of the problem is the low arithmetic intensity of the reconstruction algorithms. In other words, the method is memory bound because the calculations required in the reconstructions are inexpensive relative to the cost of retrieving data from memory. As part of our future work, we plan to investigate such limitations through the use of detailed roofline models. We also plan to consider test problems in 3-D, which will introduce additional work. 120 3.5.4 Effect of CFL In order to enforce a N-N dependency for our domain decomposition algorithm, we obtained several possible restrictions on Δ𝑡, depending on the problem and the choice of 𝛼. In the case of linear advection, we would, for example, require that 𝛽𝐿 𝑚 Δ𝑡 ≤ − . 𝑐 max log(𝜖) with the largest possible time step permitting N-N dependencies being set by the equality. Admit- tedly, such a restriction is undesirable. As mentioned in 3.3.1, this assumption can be problematic if the problem admits fast waves (𝑐 max is large) and/or if the block sizes are particularly small (𝐿 𝑚 is small). In many applications, the former circumstance is quite common. However, our test problem contains fixed wave speeds so this is less of an issue. The latter condition is a concern for configurations which use many blocks, such as a large simulation on many nodes of a cluster. Another potential circumstance is related to the granularity of the blocks. For example, in these experiments, we use 1 MPI rank per compute node. However, it may be advantageous to consider different task configurations, e.g., using 1 (or more) rank(s) per NUMA region of a compute node. A larger CFL parameter is generally preferable because it reduces the overall time-to-solution. Eventually, however, for a given CFL, there will be a crossover point, where the time step restriction causes the performance to drop due to the increasing number of sequential time steps. This experiment used a highly refined grid and varied the CFL number, using up to 9 nodes. Results from the CFL experiments are provided in 3.8. The data was obtained using an older version of our parallel algorithms, compiled with GCC 8.2.0-2.31.1, which does not use blocking. By plotting the behavior according to the number of nodes (ranks) used, we can fix the lengths of the blocks, hence 𝐿 𝑚 , and change the time step to identify the breakdown region. We observe a substantial decrease in performance for the 9 node configuration, specifically when the CFL number increases from 4 to 5. For more complex problems with dynamic wave behavior, this breakdown may be observed earlier. In response to this behavior, a user could simply increase or relax the tolerance, but the logarithm tends to suppress the impact of large relaxations. Another option, 121 103 1.0 0.8 Walltime (s) Efficiency 0.6 0.4 102 0.2 0.0 1 2 3 4 5 1 2 3 4 5 CFL CFL 1 MPI Rank(s) 4 MPI Rank(s) 9 MPI Rank(s) Figure 3.8: Results on the N-N method for the linear advection equation using a fixed mesh with 53772 total DOF and a variable CFL number. In each case, we used the fastest time-to-solution collected from repeating each configuration a total of 20 times. This particular data was collected using an older version of the code, compiled with GCC, which did not use the blocking approach. For larger block sizes, increasing the CFL has a noticeable improvement on the run time, but as the block sizes become smaller, the gains diminish. For example, if 9 MPI ranks are used, improvements are observed as long as CFL ≤ 4. However, when CFL = 5, the run times begin to increase, with a significant decrease in efficiency. As the blocks become smaller, Δ𝑡 needs to be adjusted (decreased) so that the support of the non-local convolution data not extend beyond N-Ns. which shall be considered in future work is to include more information from neighboring ranks by either eliminating this condition or, at the least, communicating enough information to achieve a prescribed tolerance. 3.6 Conclusion In this paper, we presented hybrid parallel algorithms capable of addressing a wide class of both linear and nonlinear PDEs. To enable parallel simulations on distributed systems, we derived a set of conditions that use available wave speed (and/or diffusivity) information, along with the size of the sub-domains, to limit the communication through an adjustment of the time step. Although not considered here, these conditions, which are needed to ensure accuracy, rather than the stability of the method, can be removed at the cost of additional communication. Using these restrictions, boundary conditions are enforced across sub-domains in the decomposition. Results were obtained 122 for 2-D examples consisting of linear advection, linear diffusion, and a nonlinear H-J equation to highlight the versatility of the methods in addressing characteristically different PDEs. As part of the implementation, we used constructs from the Kokkos performance portability library to parallelize the shared memory components of the algorithms. We extracted essential loop structures from the algorithms and analyzed a variety of parallel execution policies in an effort to develop an efficient application. These experiments considered several common optimization techniques, such as vectorization, cache-blocking, and placement of threads. From these experi- ments, we chose to use a blocked iteration pattern in which threads (or teams thereof) are mapped to blocks of an array with vector instructions being applied to 1-D line segments. These design choices offer a large degree of flexibility, which is an important consideration as we proceed with experimentation on other architectures, including GPUs, to leverage the full capabilities provided by Kokkos. By exploiting the stability properties of the representations, we also developed an adaptive time stepping method for distributed systems that uses “lagged" wave speed information to calculate the time step. While the methods presented here do not require adaptive time stepping for stability, it was included as an option because of its ability to prevent excessive numerical diffusion, as observed in previous work. Convergence and scaling properties for the hybrid algorithms were established using at most 49 nodes (1960 cores), with a peak performance > 108 DOF/node/s. Larger weak scaling experiments, which used up to 49 nodes (1960 cores), initially performed reasonably well, with all applications later tending to 60% efficiency corresponding (roughly) to 2 × 108 DOF/node/s. While some performance loss is to be expected from network related complications, we found this to be much larger than what was observed in prior experiments. Later, it was discovered that the request for a contiguous allocation could not be accommodated so data locality in the experiments was compromised. By repeating the experiments on a smaller collection of nodes, which granted this request, we discovered that data locality plays a pivotal role in the overall performance of the method. We observe that a large base problem size is required to achieve good strong scaling. Furthermore, when threads are prescribed work at a coarse granularity (i.e., across blocks, rather 123 than entries within the blocks), one must ensure that the problem size is capable of saturating the resources to avoid idle threads. This approach introduces further complications for strong scaling, as the workload per node drops substantially while the block size remains fixed. Finite difference methods which, generally, do not generate a substantial amount of work, are quite similar to successive convolution. Therefore, we do not expect excellent strong scalability. Certain aspects of the algorithms can be tuned to improve the arithmetic intensity, which will improve the strong scaling behavior. At some point, the algorithms will be limited by the speed of memory transfers rather than computation. We also provided experimental results that demonstrate the limitations of the N-N condition in the context of strong scaling. While we have presented several new ideas with this work, there is still much untapped potential with successive convolution methods. Firstly, optimizations on GPU architectures, which shall play an integral role in the upcoming exascale era, need to be explored and compared with CPUs. A roofline model should be developed for these algorithms to help identify key limitations and bottlenecks and formulate possible solutions. Although this work considered only first-order time discretizations, our future developments shall be concerned with evaluating a variety of high-order time discretization techniques in an effort to increase the efficiency of the method. Lastly, the parallel algorithms should be modified to enable the possibility of mesh adaptivity, which is a common feature offered by many state-of-the-art computing libraries. 124 CHAPTER 4 DEVELOPING A PARTICLE-IN-CELL METHOD 4.1 Introduction This chapter concerns the development of a particle-in-cell (PIC) method for plasmas that is constructed using solvers for fields developed in the preceding chapters. We begin by introducing the concept of a macro-particle that is the foundation of all PIC methods in section 4.2. Next, in section 4.3, we discuss techniques used in our methods to enforce the involutions of the computational model. We then discuss the methods employed to advance the particle data, including standard integrators, which can be found in section 4.4, as well as a more recent integrators developed for problems with so-called non-separable Hamiltonians in section 4.5. In the sections on integrators, we propose modifications which allow for a coupling of these methods to the proposed field solvers. Numerical examples which demonstrate the performance of the proposed methods on several plasma test problems are presented in section 4.6. We provide a summary of our key findings in section 4.7. 4.2 Moving from Point-particles to Macro-particles One of the essential features of PIC methods is that the simulation particles are not physical particles. Instead, they represent a sample of particles collected from an underlying distribution function. For this reason, they are often called super-particles or macro-particles. Moreover, the “size" of this sample is reflected in the weight associated with a given macro-particle 𝑤 𝑚 𝑝 , which can be calculated as 𝑁real 𝑤𝑚 𝑝 = . 𝑁simulation Here, we use 𝑁real to denote the number of physical particles contained within a simulation domain and 𝑁simulation to be the number of simulation particles. The calculation of 𝑁real is problem dependent, but can be expressed in terms of the average macroscopic number density 𝑛¯ that 125 describes the plasma and a volume associated with either the domain or beam being considered. Once the weight for each particle is calculated it can be absorbed into properties of the particle species, e.g., the charge, so that 𝑤 𝑚 𝑝 𝑞𝑖 is written as 𝑞𝑖 . In section 1.2.2, the charge density and current density were defined in equations (1.22) and (1.23) using a linear combination of Dirac delta distributions for a collection of 𝑁 𝑝 simulation particles. While PIC methods can certainly be developed to work with these point-particle repre- sentations, most PIC methods, including the ones developed in this work, represent particles using shape functions so that 𝑁𝑝 ∑︁  𝜌 (x, 𝑡) = 𝑞 𝑝 𝑆 x − x 𝑝 (𝑡) , (4.1) 𝑝=1 𝑁𝑝 ∑︁  J (x, 𝑡) = 𝑞 𝑝 v 𝑝 𝑆 x − x 𝑝 (𝑡) , (4.2) 𝑝=1 where the shape function 𝑆 is now used to represent a simulation particle. The shape functions most often employed in PIC simulations are B-splines, which are compact (local), positive, and satisfy the partition of unity property. Furthermore, they can be easily extended to include addi- tional dimensions using tensor products of univariant splines. While higher-order splines produce smoother mappings to the mesh and possess higher degrees of continuity, the extended support regions create complications for plasmas on bounded domains. The methods developed in this work employ linear splines to represent particle shapes. The linear spline function that represents the particle 𝑥 𝑝 on the mesh with spacing Δ𝑥 is given by   |𝑥 − 𝑥 𝑝 | |𝑥 − 𝑥 𝑝 | 1 − 0≤ ≤ 1,   , Δ𝑥 Δ𝑥  𝑆(𝑥 − 𝑥 𝑝 ) = (4.3)   |𝑥 − 𝑥 𝑝 |  0,  > 1.  Δ𝑥 The shape function (4.3) generally serves two purposes: (1) It provides a way to map particle data onto the mesh (scatter operation) and (2) can be used to interpolate mesh based quantities to the particles during the time integration (gather operation). For consistency in a PIC method, it is important that these maps be identical. 126 In the next section, we address the issue of enforcing charge conservation with the proposed formulation. To this end, we discuss several approaches, including divergence cleaning methods, as well as a modification of the usual PIC charge mapping (4.1) that solves the continuity equation. 4.3 Methods for Controlling Divergence Errors The formulation adopted in this work considers formulations of Maxwell’s equations in the Lorenz gauge (1.8)-(1.10) as well as the Coulomb gauge (1.12)-(1.14). The benefit of adopting a gauge formulation is that the involution for the magnetic field ∇ · B = 0, will be trivially satisfied, since B = ∇×A; however, it is important to recognize that this formulation is over-determined only makes sense as long as the particular gauge condition (either (1.10) or (1.14)) is satisfied. The issue of enforcing the gauge condition is ultimately connected to charge conservation through the involution ∇ · E = 𝜌/𝜖0 . When this condition is not satisfied or properly controlled in a numerical scheme, the solutions can become non-physical. In [19], the authors proposed techniques for controlling the growth of errors in Gauss’ law, and demonstrated the usual behavior that results when these conditions are not properly enforced in PIC methods. Even if a Yee mesh [13] is used to represent the fields, this condition is not guaranteed to be satisfied in PIC codes due to errors introduced through the particle-to-mesh mappings [77]. In an effort to control this error, we examine two general classes of methods: (1) divergence cleaning through an auxiliary equation and (2) an analytical charge map for particles that is based on the continuity equation. The methods contained in the first class perform the cleaning through field solvers, which require a Poisson solve or a wave solve, the latter of which can be performed with the methods introduced in Chapter 2. The method in the second class enforces charge conservation through a path integral of the particle trajectories and is an analytical extension of the map presented in [25] for FEM-PIC. In contrast to [25], which computed the map numerically via quadrature, the map presented in this work is obtained by analytically integrating and differentiating the shape functions used to represent particles. 127 4.3.1 A Classic Elliptic Projection Method Based on Gauss’ Law In PIC methods based on the Boris method [4], which evolve particles using E and B fields, one can use a classic elliptic projection technique to reduce the errors in Gauss’ law [2, 20, 77]. The idea is that one would like to enforce Gauss’ law 1 ∇·E= 𝜌. 𝜎1 During the simulation, a build up in violations of the continuity equation introduces certain errors that change the irrotational part of the electric field. If we let E∗ denote the electric field computed with a numerical method, we can see that these violations take the form E = E∗ − ∇𝜓 𝐸 , (4.4) where 𝜓 𝐸 is some scalar function that should be determined. Taking the divergence of both sides and using Gauss’ law, we obtain the elliptic equation 1 − Δ𝜓 𝐸 = 𝜌 − ∇ · E∗ . (4.5) 𝜎1 The divergence term for the numerical electric field requires some form of a numerical derivative obtained using a difference approximation or a basis. If we assuming that the error is zero at the boundary, which is a realistic assumption, this equation can be solved with homogeneous Dirichlet boundary conditions. One then takes a discrete gradient, and corrects the numerical electric field using (4.4). For the case in which fields are expressed in terms of potentials, recall that the electric field is calculated as 𝜕A E = −∇𝜓 − , 𝜕𝑡 which is valid for any gauge. After solving equation (4.5), we can correct the scalar potential and its gradient according to 𝜓 = 𝜓∗ + 𝜓𝐸 , ∇𝜓 = ∇𝜓 ∗ + ∇𝜓 𝐸 , where we have, again, used “*" to denote the initial approximation. 128 4.3.2 Elliptic Divergence Cleaning based on Potentials Another elliptic divergence cleaning method was developed in the thesis [42] and also appeared in [41] for PIC simulations of the VM system in the Lorenz gauge. The idea is to construct an elliptic equation by combining Gauss’s law with the electric field, which is expressed in terms of the potentials. In other words, we can substitute the definition 𝜕A E = −∇𝜓 − , 𝜕𝑡 into Gauss’ law to arrive at the elliptic equation 1 𝜕 (∇ · A) −Δ𝜓 = 𝜌+ . 𝜎1 𝜕𝑡 This equation was solved using a second-order centered finite-difference approximation of the Laplacian, and it was shown that a discrete form of Gauss’ law will be satisfied as long as a staggered grid is used for the mesh data. This mesh staggering, which is problem dependent, is different from the more commonly used Yee method [13] and is largely motivated by the difference scheme used to approximate the Laplacian. 4.3.3 Enforcing the Lorenz Gauge through Lagrange Multipliers In this section, we introduce a Lagrange multiplier to enable the enforcement of the Lorenz gauge condition (1.10) following [19], which proposed a divergence cleaning technique for the E-B form of Maxwell’s equations using a hyperbolic model. The approach presented in this section considers the same system written in terms of the Lorenz gauge. To this end, we modify the Lorenz gauge condition by introducing a function 𝜙 which represents a residual: 𝜙 1 𝜕𝜓 = ∇ · A + . (4.6) 𝑑2 𝜅 2 𝜕𝑡 As we will see, this function 𝜙 satisfies a wave equation whose purpose is to sweep away the residual in the Lorenz gauge. The non-physical parameter 𝑑 > 1 is connected to the wave speed for 𝜙 and is selected to ensure that waves for 𝜙 propagate faster than the other waves of interest in 129 the problem. This choice is often problem dependent, but 𝑑 ≈ 10 is often used [19, 20, 21]. We consider a modification of the original system (1.41)-(1.43), which has the form 1 𝜕2𝜓 1 − Δ𝜓 = 𝜌, 𝜅 2 𝜕𝑡 2 𝜎1 1 𝜕2A − ΔA − ∇𝜙 = 𝜎2 J, 𝜅 2 𝜕𝑡 2 1 𝜕𝜓 1 ∇·A+ 2 = 2 𝜙. 𝜅 𝜕𝑡 𝑑 We have left the particle equations unspecified for generality. Observe that this modified system is equivalent to the original system when 𝜙 ≡ 0. To develop the hyperbolic auxiliary equation, we first create a coupling between the wave equation for the scalar potential 𝜓 and the modified gauge 1 condition (4.6). This is accomplished by substituting 𝜕𝜓 𝜅2 𝑡 from the latter to the former to obtain   1 𝜕𝜙 𝜕A 1 2 −∇· − Δ𝜓 = 𝜌, 𝑑 𝜕𝑡 𝜕𝑡 𝜎1 We then take a time derivative of this equation to obtain  2    1 𝜕2 𝜙 𝜕 A 𝜕𝜓 1 𝜕𝜌 2 2 −∇· 2 −Δ = . 𝑑 𝜕𝑡 𝜕𝑡 𝜕𝑡 𝜎1 𝜕𝑡 Using the wave equation for A for the second time derivative in the above equation, we obtain, after rearranging the terms and with the aid of (4.6), a wave equation for 𝜙, namely     1 𝜕2 𝜙 1 𝜕𝜌 − 1 + 2 Δ𝜙 = 𝜎2 +∇·J . 𝜅 2 𝑑 2 𝜕𝑡 2 𝑑 𝜕𝑡 Hence, the system to be solved is 1 𝜕2𝜓 1 2 2 − Δ𝜓 = 𝜌, 𝜅 𝜕𝑡 𝜎1 1 𝜕2A − ΔA − ∇𝜙 = 𝜎2 J, 𝜅 2 𝜕𝑡 2     1 𝜕2 𝜙 1 𝜕𝜌 − 1 + 2 Δ𝜙 = 𝜎2 +∇·J , 𝜅 2 𝑑 2 𝜕𝑡 2 𝑑 𝜕𝑡 where the new wave equation for 𝜙 is used to enforce the gauge condition. Since the function 𝜙 represents the gauge error, which should be “swept away", it is prescribed outflow boundary 130 conditions. Initially, the data for the potentials should satisfy the gauge condition, so the initial data is 𝜙(x, 0) = 0. Furthermore, notice that the source in this auxiliary equation contains 𝜕𝜌 + ∇ · J, 𝜕𝑡 which describes the evolution of charge in domain. If the charge in the problem is conserved, then 𝜕𝜌 + ∇ · J = 0, 𝜕𝑡 and the gauge condition will be naturally satisfied. Otherwise, local violations of this constraint act as sources that generate waves for the residual function 𝜙. 4.3.4 Enforcing the Coulomb Gauge The Coulomb gauge (1.47) can be enforced using a projection technique that is nearly identical to the one discussed in section 4.3.1 that enforces Gauss’ law. Any errors in the gauge condition manifest through irrotational components, which need to be removed. To this end, we apply a Helmholz decomposition to the numerical vector potential (indicated with “*") A∗ = Arot + Airrot . If the vector potential is purely rotational then it is the curl of another function, so naturally ∇ · Arot = 0. The irrotational or curl-free part of the vector potential can be expressed as the gradient of a scalar function, so the decomposition for the vector potential is given by A∗ = Arot − ∇𝜂, (4.7) where 𝜂 is some scalar function. Taking the divergence of the equation (4.7), we see that −Δ𝜂 = ∇ · A∗ . 131 After solving this equation for 𝜂 and taking its derivative, we can obtain the rotational part of the vector potential by rearranging (4.7) so that Arot = A∗ + ∇𝜂. (4.8) If we follow our formulation (1.45)-(1.47), the A∗ will already be an approximation of the rotational part of the vector potential. Moreover, the rotational part of the wave equation (1.46) (i.e., (1.17)) can be evolved using the field solvers developed in chapter 2, which means that we can also compute these derivatives as well. 4.3.5 Analytical Maps for Enforcing Charge Conservation A new approach to enforcing charge conservation was recently proposed in [25] in the context of finite-element particle-in-cell methods. The map for the charge density is based on the continuity equation, which is integrated in time: ∫ 𝑡 𝑛+1 𝑛+1 𝑛 𝜌 =𝜌 − ∇ · J 𝑑𝑡. (4.9) 𝑡𝑛 Their map exchanges the time integral (4.9) for a spatial integral that effectively traces the motion of the particle on the mesh from 𝑡 𝑛 to 𝑡 𝑛+1 . In [25] and [26], this mapping was enforced in a weak sense through the finite element basis. Here, we shall extend this technique a bit further to devise a discrete analogue to (4.9), which can be computed analytically and enforces charge conservation to machine precision. We summarize the key identities presented in the paper [26], offering additional details which, we believe, make the method easier to understand. We begin with the usual definitions for the charge density and current density, which were defined in equations (4.1) and (4.2) in terms of a shape function 𝑆(x) as 𝑁𝑝 ∑︁  𝜌 (x, 𝑡) = 𝑞 𝑝 𝑆 x − x 𝑝 (𝑡) , 𝑝=1 𝑁𝑝 ∑︁  J (x, 𝑡) = 𝑞 𝑝 v 𝑝 𝑆 x − x 𝑝 (𝑡) . 𝑝=1 132 Again, 𝑁 𝑝 is the total number of macro-particles in the simulation, so that 𝑞 𝑝 now represents the charge of a macro-particle. Additionally, the position at time 𝑡 for a given particle can be expressed in terms of its velocity using ∫ 𝑡 x 𝑝 (𝑡) = x 𝑝 (0) + v 𝑝 (𝜏) 𝑑𝜏. (4.10) 0 Various distributional identities are presented in [26]. In particular, they make repeated use of the Dirac delta distribution, which is defined in Fourier space as ∫ ∞ 1 𝛿 (x − a) = 3 𝑒𝑖k·(x−a) 𝑑 3 k. (4.11) (2𝜋) −∞ Additionally, the delta distribution obeys the “sifting" properties ∫ ∞ 𝑓 (x) = 𝑓 (y) 𝛿 (x − y) 𝑑 3 y. (4.12) −∞ and ∫ ∞ 𝛿 (𝜼 − 𝝃) = 𝛿 (𝜼 − y) 𝛿 (y − 𝝃) 𝑑 3 y, (4.13) −∞ which shall be used in the discussion that follows. The first identity, given as equation (10) in [26], is  ∫ 𝑡    𝛿 x − x 𝑝 (𝑡) = 𝛿 x − x 𝑝 (0) ∗ 𝛿 x − v 𝑝 (𝜏) 𝑑𝜏 , (4.14) 0 where we use “∗" to denote convolution. To see the equivalence, use the definition (4.10) to convert the particle position to velocity  ∫ 𝑡   𝛿 x − x 𝑝 = 𝛿 x − x 𝑝 (0) − v 𝑝 (𝜏) 𝑑𝜏 , 0 ∫𝑡 then appeal to identity (4.13), taking 𝜼 = 0 v𝑖 (𝜏) 𝑑𝜏 and 𝝃 = x − x 𝑝 (0) to obtain ∫ ∞  ∫ 𝑡     ∫ 𝑡  3 𝛿 y− v 𝑝 (𝜏) 𝑑𝜏 𝛿 x − x 𝑝 (0) − y 𝑑 y ≡ 𝛿 x − x 𝑝 (0) ∗ 𝛿 x − v 𝑝 (𝜏) 𝑑𝜏 . −∞ 0 0 The next identity relates time and space derivatives of the delta distribution and is given by (see equation (11) in [26]) as    𝜕𝑡 𝛿 x − x 𝑝 (𝑡) = −∇ · v 𝑝 (𝑡)𝛿 x − x 𝑝 (𝑡) , (4.15) 133 where the divergence is taken over the spatial variable x. This can be obtained from a direct calculation using the chain rule and the definition (4.11): ! ∫ ∞ 1 𝑒𝑖k· ( x−x 𝑝 (𝑡) ) 𝑑 3 k ,  𝜕𝑡 𝛿 x − x 𝑝 (𝑡) = 𝜕𝑡 3 (2𝜋) −∞ ∫ ∞ ! =− 𝑖k · v 𝑝 (𝑡)𝑒𝑖k· ( x−x 𝑝 (𝑡) ) 𝑑 3 k , −∞   = −∇ · v 𝑝 (𝑡)𝛿 x − x 𝑝 (𝑡) , The next identity (equation (12) in [26]) is the distributional equivalent of the continuity equation 𝜕𝜌 + ∇ · J = 0. 𝜕𝑡 This identity can be obtained by combining the definitions (4.1) and (4.2) with the identity (4.15). First, notice that 𝑁𝑝 ∑︁  𝜕𝑡 𝜌 (x, 𝑡) = 𝑞 𝑝 𝜕𝑡 𝑆 x − x 𝑝 (𝑡) . 𝑝=1 Using the identity (4.15) along with the property (4.12), we can express the shape function as a convolution with a delta function, which yields 𝑁𝑝 ! ∑︁  𝜕𝑡 𝜌 (x, 𝑡) = 𝑞 𝑝 𝜕𝑡 𝑆 (x) ∗ 𝛿 x − x 𝑝 , 𝑝=1 𝑁𝑝 ∑︁  = 𝑞 𝑝 𝑆 (x) ∗ 𝜕𝑡 𝛿 x − x 𝑝 , 𝑝=1 𝑁𝑝 ! ∑︁   = 𝑞 𝑝 𝑆 (x) ∗ − ∇ · v 𝑝 (𝑡)𝛿 x − x 𝑝 (𝑡) . 𝑝=1 Next, observe that derivatives of convolution commute when the integrand and its derivatives decay rapidly away from zero. The shape function 𝑆 is compactly supported so that these conditions will 134 be easily satisfied. This permits one to write 𝑁𝑝 ! ∑︁   𝜕𝑡 𝜌 (x, 𝑡) = 𝑞 𝑝 𝑆 (x) ∗ − ∇ · v 𝑝 (𝑡)𝛿 x − x 𝑝 (𝑡) , 𝑝=1 𝑁𝑝 ! ∑︁  = −∇ · 𝑞 𝑝 v 𝑝 (𝑡)𝑆 (x) ∗ 𝛿 x − x 𝑝 (𝑡) , 𝑝=1 𝑁𝑝 ! ∑︁  = −∇ · 𝑞 𝑝 v 𝑝 (𝑡)𝑆 x − x 𝑝 (𝑡) , 𝑝=1 = −∇ · J, which shows how the continuity equation can be obtained using distributions. The last identity (equation (13) in [26]), which builds on these intermediate results, describes how charge density can be calculated on the mesh so that it satisfies the continuity equation. To start, they integrate the continuity equation from 0 to 𝑡, which yields ∫ 𝑡 𝜌 (x, 𝑡) = 𝜌 (x, 0) − ∇ · J (x, 𝜏) 𝑑𝜏. 0 Using the definition (4.2), this is equivalent to writing 𝑁𝑝 ∫ 𝑡  ∑︁  𝜌 (x, 𝑡) = 𝜌 (x, 0) − 𝑞𝑝 ∇ · v 𝑝 (𝜏)𝑆 x − x 𝑝 (𝜏) 𝑑𝜏. 𝑝=1 0 𝑁𝑝 ∫ 𝑡  ∑︁  = 𝜌 (x, 0) − 𝑞𝑝 ∇ · v 𝑝 (𝜏)𝑆 (x) ∗ 𝛿 x − x 𝑝 (𝜏) 𝑑𝜏, 𝑝=1 0 𝑁𝑝 ∫ 𝑡  ∫ ∞  ∑︁ 1 = 𝜌 (x, 0) − 𝑞𝑝 ∇ · v 𝑝 (𝜏)𝑆 (x) ∗ 𝑒𝑖k· ( x−x 𝑝 (𝜏) ) 𝑑 3 k 𝑑𝜏. 𝑝=1 0 (2𝜋) 3 −∞ For convenience, we shall express the integrals in the last line in their component-wise form, which is given by the expression ∫ 𝑡  ∫ ∞  ( 𝑗) 1 𝑖k· ( x−x 𝑝 (𝜏) ) 3 𝜕 𝑗 𝑣 𝑝 (𝜏)𝑆 (x) ∗ 𝑒 𝑑 k 𝑑𝜏. 0 (2𝜋) 3 −∞ Note that we adopt the usual summation convention in which repeated indices are summed and we 𝜕 have used 𝜕 𝑗 = 𝜕𝑥 𝑗 for brevity. It is (hopefully) clear from the notation that these spatial derivatives 135 act only on the mesh data rather than particles. Next, we use the definition (4.10) and rearrange the integrals to obtain ! ∫ ∞∫ 𝑡 1 ∫ 𝜏 𝑣 𝑝 (𝜏)𝑒𝑖k· ( x−x 𝑝 (0)− 0 v𝑖 (𝑠) 𝑑𝑠) ( 𝑗) 𝜕 𝑗 𝑆 (x) ∗ 3 𝑑𝜏 𝑑 3 k . (2𝜋) −∞ 0 𝑑x 𝑝 This expression can be further simplified by using the fact that 𝑑𝑡 = v 𝑝 , from which we deduce that ! ∫ ∞∫ 𝑡 1 𝑣 𝑝 (𝜏)𝑒𝑖k· ( x−x 𝑝 (𝜏) ) 𝑑𝜏 𝑑 3 k . ( 𝑗) 𝜕 𝑗 𝑆 (x) ∗ 3 (4.16) (2𝜋) −∞ 0 The term above contains three separate contributions to the mesh corresponding to each index value 𝑗. We shall provide details for 𝑗 = 1, and state the results for the remaining cases 𝑗 = 2, 3. Focusing on the inner integral, we see that we need to evaluate the time integral ∫ 𝑡 𝑖k· ( x−x 𝑝 (𝜏) ) 𝑣 (1) 𝑝 (𝜏)𝑒 𝑑𝜏. 0 Next, we use the definition (4.10) to see that 𝑣 (1) 𝑝 𝑑𝜏 is the total change in the position over a time frame 𝑑𝜏, in other words, 𝑣 (1) (1) 𝑝 𝑑𝜏 = 𝑑𝑥 𝑝 . Then the time integral can be converted into a path integral that connects 𝑥 𝑝(1) (0) and 𝑥 𝑝(1) (𝑡): (1) ∫ 𝑥 𝑝 (𝑡)   (1)   (2)   (3)  𝑖 𝑘 (1) 𝑥 (1) −𝑥 𝑝 +𝑘 (2) 𝑥 (2) −𝑥 𝑝 +𝑘 (3) 𝑥 (3) −𝑥 𝑝 𝑒 𝑑𝑥 𝑝(1) . (1) 𝑥 𝑝 (0) Here, the variables for the remaining particle coordinates are transformed to 𝑥 𝑝(2) and 𝑥 𝑝(3) and the differential element runs over the first component of the particle position vector, leaving these remaining variables unaffected. Inserting this result into (4.16) (with 𝑗 = 1) and interchanging the integrals, we obtain (1) ! ∫ 𝑥 𝑝 (𝑡) ∫ ∞ 1 𝑖k· ( x−x 𝑝 ) 𝜕1 𝑆 (x) ∗ 𝑒 𝑑 3 k 𝑑𝑥 𝑝(1) , (2𝜋) 3 (1) 𝑥 𝑝 (0) −∞ which, by (4.11), is equivalent to writing (1) ! (1) ! ∫ 𝑥 𝑝 (𝑡) ∫ 𝑥 𝑝 (𝑡) 𝑑𝑥 𝑝(1) 𝑑𝑥 𝑝(1)   𝜕1 𝑆 (x) ∗ 𝛿 x − x 𝑝 = 𝜕1 𝑆 x − x𝑝 . (4.17) (1) (1) 𝑥 𝑝 (0) 𝑥 𝑝 (0) 136 The remaining indices ( 𝑗 = 2, 3) in (4.16) yield ∫ ∞∫ 𝑡 ! ∫ 𝑥 (2)𝑝 (𝑡) ! 1 𝑣 𝑝 (𝜏)𝑒𝑖k· ( x−x 𝑝 (𝜏) ) 𝑑𝜏 𝑑 3 k = 𝜕2 (2) (2)  𝜕2 𝑆 (x) ∗ 3 𝑆 x − x 𝑝 𝑑𝑥 𝑝 , (4.18) (2) (2𝜋) −∞ 0 𝑥 𝑝 (0) ∫ ∞∫ 𝑡 ! ∫ 𝑥 (3)𝑝 (𝑡) ! 1 𝑖k· ( x−x 𝑝 (𝜏) ) 𝑣 (3) 𝑆 x − x 𝑝 𝑑𝑥 𝑝(3) , (4.19)  𝜕3 𝑆 (x) ∗ 3 𝑝 (𝜏)𝑒 𝑑𝜏 𝑑 3 k = 𝜕3 (3) (2𝜋) −∞ 0 𝑥 𝑝 (0) which can be more succinctly expressed in vector notation as ∫ ∞∫ 𝑡 ! ∫ x 𝑝 (𝑡) ! 1 v 𝑝 (𝜏)𝑒𝑖k· ( x−x 𝑝 (𝜏) ) 𝑑𝜏 𝑑 3 k = ∇ ·  ∇ · 𝑆 (x) ∗ 3 𝑆 x − x 𝑝 𝑑x 𝑝 , (4.20) (2𝜋) −∞ 0 x 𝑝 (0) with the vector integration bounds being interpreted in an entry-wise fashion. Once the shape function 𝑆(x − ·) is selected, one can analytically compute the derivatives of the shape functions in (4.20) via the equations (4.17) - (4.19). Actually, it is not necessary to introduce delta distributions to obtain the mapping (4.20). In [26], a convolution between the shape functions and delta distributions was performed, so that derivatives could be transferred directly onto the delta distributions. These derivatives are to be interpreted in the sense of distributions. A far simpler approach can be obtained by starting from the continuity equation with the definition (4.2): ∑︁𝑁𝑝 ∫ 𝑡   𝜌 (x, 𝑡) = 𝜌 (x, 0) − 𝑞𝑝∇ · v 𝑝 (𝜏)𝑆 x − x 𝑝 (𝜏) 𝑑𝜏 . 𝑝=1 0 𝑑x 𝑝 Using the fact that 𝑑𝑡 = v 𝑝 , we can immediately see that 𝑑x 𝑝 = v 𝑝 𝑑𝜏, hence ∫ 𝑡    ∫ x 𝑝 (𝑡)   ∇· v 𝑝 (𝜏)𝑆 x − x 𝑝 (𝜏) 𝑑𝜏 = ∇ · 𝑆 x − x 𝑝 𝑑x 𝑝 . 0 x 𝑝 (0) The charge mapping involves derivatives of the shape function 𝑆 according to equations (4.17) - (4.19). An evaluation of these terms can be illustrated by considering the evaluation of a single component, such as the first one, i.e., (1) ! ∫ 𝑥 𝑝 (𝑡) 𝜕 𝑆(x − x 𝑝 ) 𝑑𝑥 𝑝(1) . 𝜕𝑥 (1) (1) 𝑥 𝑝 (0) Using the change of variables z = x − x 𝑝 , this integral can be expressed as: ∫ 𝑥 (1) 𝑝 (𝑡) ! ∫ 𝑥 (1) −𝑥 (1) 𝑝 (𝑡) ! 𝜕 𝜕 𝑆(x − x 𝑝 ) 𝑑𝑥 𝑝(1) = − (1) 𝑆(𝑧 (1) , 𝑧 (2) , 𝑧 (3) ) 𝑑𝑧 (1) . 𝜕𝑥 (1) 𝑥 (1) 𝑝 (0) 𝜕𝑥 (1) (1) 𝑥 −𝑥 𝑝 (0) 137 If we write the multivariant spline as a tensor product of univariant shape functions, then this is equivalent to writing (1) ! (1) ! ∫ 𝑥 𝑝 (𝑡) ∫ 𝑥 (1) −𝑥 𝑝 (𝑡) 𝜕 𝜕 𝑆(x − x 𝑝 ) 𝑑𝑥 𝑝(1) = − (1) 𝑆(𝑧 (1) )𝑆(𝑧 (2) )𝑆(𝑧 (3) ) 𝑑𝑧 (1) . 𝜕𝑥 (1) (1) 𝑥 𝑝 (0) 𝜕𝑥 (1) 𝑥 (1) −𝑥 𝑝 (0) Next, we define the anti-derivative of the univariant spline as  𝑥2  𝑥 + 2Δ𝑥 −Δ𝑥 ≤ 𝑥 ≤ 0,     , ∫ 𝑥     𝐼 (𝑥) := 𝑆(𝜉) 𝑑𝜉 = 𝑥 − 𝑥 2 , 0 < 𝑥 ≤ Δ𝑥, 0   2Δ𝑥      0,  |𝑥| > Δ𝑥.  Let us temporarily ignore the functions of 𝑧 (2) and 𝑧 (3) , as the integral is done only in 𝑧 (1) . At the end, we will reintroduce these shape functions with the time level of the coordinates in 𝑧 (2) and 𝑧 (3) being selected to match those of 𝑧 (1) . Using the anti-derivative, the integral reduces to (1) (1) (1) ∫ 𝑥 (1) −𝑥 𝑝 (𝑡) ∫ 𝑥 (1) −𝑥 𝑝 (𝑡) ∫ 𝑥 (1) −𝑥 𝑝 (0) (1) (1) (1) (1) 𝑆(𝑧 ) 𝑑𝑧 = 𝑆(𝑧 ) 𝑑𝑧 − 𝑆(𝑧 (1) ) 𝑑𝑧 (1) , (1) 𝑥 (1) −𝑥 𝑝 (0) 0 0     = 𝐼 𝑥 (1) − 𝑥 𝑝(1) (𝑡) − 𝐼 𝑥 (1) − 𝑥 𝑝(1) (0) . Taking the derivative then with respect to 𝑥 (1) , we obtain (1) ∫ 𝑥 (1) −𝑥 𝑝 (𝑡)     𝜕 𝜕 (1) (1) 𝜕 (1) (1) 𝑆(𝑧 (1) ) 𝑑𝑧 (1) = 𝐼 𝑥 − 𝑥 𝑝 (𝑡) − 𝐼 𝑥 − 𝑥 𝑝 (0) , 𝜕𝑥 (1) (1) 𝑥 (1) −𝑥 𝑝 (0) 𝜕𝑥 (1) 𝜕𝑥 (1)     = 𝑆 𝑥 (1) − 𝑥 𝑝(1) (𝑡) − 𝑆 𝑥 (1) − 𝑥 𝑝(1) (0) . Re-introducing the functions of 𝑥 𝑝(2) and 𝑥 𝑝(3) , and collecting the results, we get (1) x 𝑝 =x 𝑝 (𝑡) ∫ 𝑥 (1) −𝑥 𝑝 (𝑡) 𝜕 (1) (2) (3) (1)  − (1) 𝑆(𝑧 )𝑆(𝑧 )𝑆(𝑧 ) 𝑑𝑧 = −𝑆 x − x 𝑝 . (4.21) (1) 𝜕𝑥 𝑥 (1) −𝑥 𝑝 (0) x 𝑝 =x 𝑝 (0) Again, the time levels for 𝑥 𝑝(2) and 𝑥 𝑝(3) , in (4.21), should be the same as the one used for 𝑥 𝑝(1) for consistency. It is interesting to note that when this process is repeated for the remaining derivatives, the result is identical to (4.21). The final form of the map is given by 𝑁𝑝 ∑︁ ∫ x 𝑝 (𝑡)   𝜌 (x, 𝑡) = 𝜌 (x, 0) − 𝑞𝑝∇ · 𝑆 x − x 𝑝 𝑑x 𝑝 , (4.22) 𝑝=1 x 𝑝 (0) 138 where the divergence of the integral can be evaluated through repeated use of (4.21). While the map defined by (4.21) and (4.22) is conservative in the sense that the total charge in the domain is preserved in time, it introduces an image charge on the mesh that corresponds to the starting position along a particle path. This is problematic for the case of the expanding beam problem considered in section 4.6, which injects an electron beam into a metal cavity. The image charge induces a large potential at the injection size that reverses the trajectory of the beam. The appearance of an image charge can be illustrated with a simple example that considers a single test particle whose macro-particle charge 𝑞 = 1. Its charge at any time level can be calculated with the mapping presented above, which, in 2-D, for a single particle, is given by   𝑛+1 𝑛 𝜌 = 𝜌 + 2𝑞 𝑆(x − x𝑛+1 𝑝 ) − 𝑆(x − x𝑛𝑝 ) . In this example, we suppose that at time 𝑡 = 0 the particle is in the center of the grid cell (𝑖, 𝑗) and at time 𝑡 = Δ𝑡, the particle has reached the cell boundary. We use a matrix convention so that the grid points in question align with entries of the matrix. The linear spline shape functions at these grid points are found to be     1 1 1 4 4 0 0 2 0 𝑆(x − x𝑛𝑝 ) =  𝑆(x − x𝑛+1 ) =  , 𝑝  . 1 1 0 0 1 0 4 4  2     Plugging these directly into the charge mapping, we find that         1 1 0 0 1 0 1 1 0 − 14 3 0 4 4 2 4 4 4 𝜌 𝑛+1 =  +2  −2 =  . 1 1 0 0 1 0 1 1 0 − 14 3 0 4 4  2 4 4 4         From this, it is clear that the total charge is conserved in the sense that ∑︁ ∑︁ 𝜌𝑖𝑛+1 𝑗 = 𝜌𝑖𝑛𝑗 , 𝑖, 𝑗 𝑖, 𝑗 but the method introduces a non-physical image charge corresponding to the previous location of the particle. If we remove the factor of 2 that appears in the spline mapping, so that the charge map now reads   𝜌 𝑛+1 = 𝜌 𝑛 + 𝑞 𝑆(x − x𝑛+1 𝑝 ) − 𝑆(x − x 𝑛 ) 𝑝 . 139 Then we can repeat the process and find that   1 0 2 0 𝜌 𝑛+1 =  , 1 0  2 0   which removes the image charge with the opposing sign; however, this modification is equivalent to the usual spline mapping for charge. For this reason, we shall not consider this map in our numerical experiments. 4.4 Conventional Methods for Pushing Particles In this section, we discuss standard methods used in the scientific computing community for pushing particles. These developments are important to the core task of this work because they not only provide a basis for comparison with other methods, but they also describe how other potential users may employ our field solvers in their own codes. Beginning with the well-known leapfrog method, we discuss how the proposed field solvers can be leveraged to develop solvers for electrostatic problems. We then address the electromagnetic case, which begins with a description of the Boris rotation algorithm. 4.4.1 Leapfrog Time Integration Leapfrog time integration is a well-known technique in methods for evolving particles due to its simplicity, long-time accuracy, and symplectic nature. While a comprehensive treatment of this integrator can be found in any of the classic texts on particle methods, including Birdsall and Langdon [2] and Hockney and Eastwood [3], we provide a few details here so that we can discuss the coupling of the method to the collection of solvers introduced in chapter 2. To this end, we consider Newton’s second law of motion for a single particle, which can be written in the first-order form x¤ = v, (4.23) 1 v¤ = F(x, 𝑡), (4.24) 𝑚 140 where 𝑚 is the mass of the particle and x(𝑡), and v(𝑡) denote, respectively, the position and velocity of the particle at time 𝑡. Further, F is the force that is used to accelerate the particle, which depends only on the position data of the particle. The leapfrog method can be derived by integrating the position equation (4.23) from 𝑡 𝑛 to 𝑡 𝑛+1 and the velocity equation (4.24) from 𝑡 𝑛−1/2 to 𝑡 𝑛+1/2 : ∫ 𝑡 𝑛+1 𝑛+1 𝑛 x(𝑡 ) = x(𝑡 ) + v(𝜏) 𝑑𝜏, 𝑡𝑛 ∫ 𝑡 𝑛+1/2 𝑛+1/2 𝑛−1/2 1 v(𝑡 ) = v(𝑡 )+ F(x(𝜏), 𝜏) 𝑑𝜏. 𝑚 𝑡 𝑛−1/2 Then, the integrals are approximated with a second-order accurate midpoint rule to obtain the fully discrete update Δ𝑡 v(𝑡 𝑛+1/2 ) = v(𝑡 𝑛−1/2 ) + F(x(𝑡 𝑛 ), 𝑡 𝑛 ), (4.25) 𝑚 x(𝑡 𝑛+1 ) = x(𝑡 𝑛 ) + Δ𝑡v(𝑡 𝑛+1/2 ). (4.26) Δ𝑡 The 2 offset in time between the position and velocity in this method gives the scheme its name. In practice, an initial offset in the velocity can be achieved by stepping the velocity backwards in time by − Δ𝑡2 using an explicit Euler method, so that Δ𝑡 v(𝑡 −1/2 ) = v(𝑡 0 ) − F(x(𝑡 0 ), 𝑡 0 ). 2𝑚 Since the local truncation error for the Explicit Euler method is second-order in time, this step will not degrade the rate of convergence. Once the initialization is complete, a given time step consists of the following ingredients: 1. Compute the force F(x(𝑡 𝑛 ), 𝑡 𝑛 ) at time 𝑡 = 𝑡 𝑛 . 2. Update the velocity by Δ𝑡 using equation (4.25). 3. Update the position by Δ𝑡 using equation (4.26). For electrostatic problems, in the absence of external forces, the force term represents the motion caused by an electric field, i.e., F(x(𝑡 𝑛 ), 𝑡 𝑛 ) = 𝑞E(x(𝑡 𝑛 ), 𝑡 𝑛 ) ≡ −𝑞∇𝜓(x(𝑡 𝑛 ), 𝑡 𝑛 ), where 𝑞 is the charge of the particle, E is the electric field, and 𝜓 is a scalar potential. 141 Depending on the context, the scalar potential may be provided as part of the problem or require its own solve. In the electrostatic applications considered in this thesis, the scalar potential 𝜓 is the solution to either a Poisson equation 1 − Δ𝜓 = 𝜌, (4.27) 𝜎1 or the two-way wave equation 1 1 2 𝜕𝑡𝑡 𝜓 − Δ𝜓 = 𝜌, (4.28) 𝜅 𝜎1 with 𝜅 being the speed at which the wave propagates and 𝜌 being the charge density deposited by particles on a mesh. In the case of the Poisson equation (4.27), the electric field can be computed using elliptic solvers and the leapfrog time stepping procedure is unchanged. When the potential is described by a wave equation (4.28), then some minor modifications are required depending on whether the source 𝜌 is to be treated explicitly or implicitly by our methods. In the latter case, both 𝜓 and its derivatives ∇𝜓 can be computed once the position update (4.26) is complete. Similarly, an advance for 𝜓 with an explicit source (e.g., the time-centered or central schemes) would take place prior to the position update (4.26). Then, once the positions have been updated, the derivatives can be computed using the now implicit sources. In the next section, we describe the Boris push, which is an extension of the leapfrog time integration scheme to include contributions from more general electromagnetic fields. 4.4.2 The Boris Push The Boris push, introduced in a 1970 paper by Jay Boris [4], is an extension of the leapfrog method discussed in section 4.4.1 to address rotations supported by the more general Lorentz force F = 𝑞 (E + v × B) . This technique is the standard method for pushing particles in electromagnetic fields and is widely adopted for its simplicity, speed, and long-time accuracy [2, 3]. While the method itself is not symplectic, its success has largely been attributed to its volume preserving feature [78]. Moreover, 142 despite its lack of symplecticity, the fluctuations in the energy introduced by the method are generally known to remain small, even for problems that require integration over large time intervals. The setup for the Boris method begins with a discretization in time of the equations of motion for the particles that results in a leapfrog structure identical to (4.25) and (4.26). A key difference is that the velocity update equation for the particles involves the velocity itself, which is time averaged to obtain the implicit equation   𝑞 © v𝑛+1/2 + v𝑛−1/2 ª v𝑛+1/2 = v𝑛−1/2 + ­E + × B®® . 𝑚 ­ 2 « ¬ While it is possible to rearrange this equation and analytically solve for v𝑛+1/2 , Boris realized that the electric and magnetic field contributions could be separated with a simple advance obtained through the use of rotations. Following the notation of [79], the Boris rotation method for velocity can be performed with the following steps: 𝑞Δ𝑡 1. Compute v− = v𝑛−1/2 + E 2𝑚 ′ 𝑞Δ𝑡 2. Compute v = v− + v− × t, where t = B 2𝑚 ′ 2t 3. Compute v+ = v− + v × s, where s = 1 + 𝑡2 𝑞Δ𝑡 4. Lastly, the velocity update is given as v𝑛+1/2 = v+ + E. 2𝑚 The position data can then be updated using the newly acquired velocity v𝑛+1/2 following the usual leapfrog update (4.26). We now describe an approach that couples our field solvers to the Boris method. The proposed approach treats the fields at integer time levels, with the particle data being stored in the usual  leapfrog format. First, the initialization begins with x0 , v0 , along with the fields E0 and B0 (obtained from 𝜓 0 and A0 ). To create the velocity staggering in time, we use the initial fields to push the velocities back by Δ𝑡/2 using the Boris rotation method, which is second-order and results   in v−1/2 . Then, assuming we have the data E𝑛 , B𝑛 , x𝑛 , v𝑛−1/2 , the procedure for evolving the system from 𝑡 𝑛 to 𝑡 𝑛+1 consists of the following steps: 143   1. Apply the Boris rotation: E𝑛 , B𝑛 , v𝑛−1/2 ↦→ v𝑛+1/2 .   2. Advance the positions: x𝑛 , v𝑛+1/2 ↦→ x𝑛+1 .   3. Time average the velocities v𝑛−1/2 , v𝑛+1/2 ↦→ v𝑛 and compute the current density J𝑛 . 4. Advance the A explicitly with the central scheme: (A𝑛 , J𝑛 ) ↦→ A𝑛+1 . 5. Compute the charge density 𝜌 𝑛+1 .   6. Advance the 𝜓 and its derivatives: 𝜓 𝑛 , 𝜌 𝑛+1 → ↦ 𝜓 𝑛+1 , ∇𝜓 𝑛+1 . 7. Compute the electric field E𝑛+1 := −∇𝜓 𝑛+1 − 𝜕𝑡 A𝑛+1 using data from steps 4 and 6. 8. Iterate on the magnetic field data:   a) Compute derivatives of A with the BDF update: A𝑛 , J𝑛+1,[𝑘] ↦→ ∇A𝑛+1,[𝑘] . b) Approximate the magnetic field B𝑛+1,[𝑘] := ∇ × A𝑛+1,[𝑘] with ∇A𝑛+1,[𝑘] .   c) Advance the particle velocities: E𝑛+1 , B𝑛+1,[𝑘] , v𝑛+1/2 ↦→ v𝑛+3/2 .   d) Time average the velocities v 𝑛+1/2 ,v 𝑛+3/2 ↦→ v𝑛+1 and compute J𝑛+1,[𝑘+1] . e) Repeat for some prescribed number of iterations or until convergence.     9. Prepare for the next time step: E𝑛+1 , B𝑛+1 , x𝑛+1 , v𝑛+1/2 ↦→ E𝑛 , B𝑛 , x𝑛 , v𝑛−1/2 . The fields (𝜓, A), spatial derivatives (∇𝜓, ∇A), and 𝜕𝑡 A, all live at the integer time levels in this approach. The reason is that the particle velocity evolution step over [𝑡 𝑛−1/2 , 𝑡 𝑛+1/2 ] requires the field data at the mid-point 𝑡 𝑛 . The particle position data is at the integer levels. We generally use variables with square brackets as an upper index to indicate that it is an iteration variable. For instance, in step 8a, we use the source data at the wrong time level, which means that the BDF scheme will be using the incorrect source. By iterating on the current density J𝑛+1 , generally with a few iterates, we can obtain a good approximation of this source. To start the iteration in step 8, 144 we can apply Taylor expansion to the current density so that it is centered about time 𝑡 = 𝑡 𝑛 . For second-order accuracy in time, we can start the iteration with   𝑛+1,[0] 𝑛 𝑛 𝑛−1 J =J + J −J . In the next section, we discuss a time integration method for particles that are evolved using non-separable Hamiltonians. 4.5 Time Integration with Non-separable Hamiltonians In this section, we introduce the time integration methods for the particles in formulations, which evolve a certain non-separable Hamiltonian. An outline of the approach is provided in section 4.5.1. Once we have introduces the basic elements of the method, we propose several approaches for incorporating the field evolution into the scheme. We consider two perspectives. First, in section 4.5.1.1, we develop a naive implementation in which the particles “lead" the fields, i.e., the fields are modified through changes in the particles. The second approach we consider is presented in section 4.5.1.2 and is a variant of the first approach, allowing one to utilize methods which allow sources to be explicit. 4.5.1 The Molei Tao Integrator For equations of motion of the form 1 x¤ 𝑖 = (P𝑖 − 𝑞𝑖 A) ≡ 𝑉 (x𝑖 , P𝑖 ) , 𝑚𝑖 𝑞𝑖 P¤ 𝑖 = −𝑞𝑖 ∇𝜓 + (∇A) · (P𝑖 − 𝑞𝑖 A) ≡ 𝑊 (x𝑖 , P𝑖 ) , 𝑚𝑖 the phase space variables are non-separable, which means traditional integrators, such as those in the previous section, can not be applied to the system. Recently, a paper by Tao [47] introduced methods, inspired by the work [80], to approximate non-separable Hamiltonians 𝐻 (x, P) using an augmented form 𝐻¯ (x, P, y, Q) := 𝐻 𝐴 + 𝐻 𝐵 + 𝜔𝐻𝐶 , 145 with 1 1 𝐻 𝐴 := 𝐻 (x, Q), 𝐻 𝐵 := 𝐻 (y, P), 𝐻𝐶 := ||x − y|| 2 + ||P − Q|| 2 . 2 2 By duplicating the phase space information, Tao was able to construct a family of explicit symplectic integration schemes of any even order degree of accuracy which do not require negative time steps. This latter point is significant for plasma problems modeled on bounded domains. In such problems, charged particles may be absorbed into the boundary within a single time step, and it is not clear how one should reverse this process. To build these methods, Tao introduces the following set of flow maps that evolve the system forward in Δ𝑡-time: x  x  x  x + Δ𝑡𝜕P 𝐻 (y, P)                  P P − Δ𝑡𝜕x 𝐻 (x, Q)  Δ𝑡  P  P  𝜙Δ𝑡 𝐻𝐴 :   ↦→      , 𝜙 : 𝐻𝐵    → ↦  , y y + Δ𝑡𝜕 (x, Q) y y       Q 𝐻              Q     Q    Q      Q − Δ𝑡𝜕 y 𝐻 (y, P)   ! ! x  x+y x − y        + 𝑅(𝜔, Δ𝑡)  Δ𝑡 P 1  P + Q P − Q  𝜙𝜔𝐻𝐶 :   ↦→    ! ! . y 2  x+y x−y     − 𝑅(𝜔, Δ𝑡)  Q    P+Q  P − Q  Here, 𝑅 is the block rotation matrix    cos(2𝜔Δ𝑡)𝐼 sin(2𝜔Δ𝑡)𝐼  𝑅(𝜔, Δ𝑡) =    , − sin(2𝜔Δ𝑡)𝐼 cos(2𝜔Δ𝑡)𝐼      with 𝐼 being the 2 × 2 or 3 × 3 identity matrix. Various integrators of any even order degree of accuracy can be obtained through a composition of these mappings. We refer the interested reader to the paper [47] for details. As an example, the paper provides the following second-order method: Δ𝑡/2 Δ𝑡/2 Δ𝑡/2 Δ𝑡/2 𝜙Δ𝑡 Δ𝑡 2 := 𝜙 𝐻 𝐴 ◦ 𝜙 𝐻 𝐵 ◦ 𝜙𝜔𝐻𝐶 ◦ 𝜙 𝐻 𝐵 ◦ 𝜙 𝐻 𝐴 . In this composition, it is important to note that the coupling map 𝜙Δ𝑡 𝜔𝐻𝐶 does not evolve the particles, but, instead, mixes the data in phase space. 146 A key element of this method is the use of a binding constant 𝜔 that synchronizes the two sets of phase space variables. Tao establishes an estimate on the accuracy of these methods in the context of long time simulations for integrable systems. For a method with order ℓ, time step size Δ𝑡, coupling parameter 𝜔, and the simulation time 𝑇, he shows that the error is of the form   O 𝑇Δ𝑡 ℓ 𝜔 ,    as long as 𝑇 = O min Δ𝑡 −ℓ 𝜔−ℓ , 𝜔1/2 . Based on this bound, he recommends that Δ𝑡 ≪ 𝜔−1/ℓ , and that it is both more accurate and efficient to use increase the order ℓ under a fixed 𝜔. The test problems shown in Tao’s paper evolve particles in fixed fields which can be either static or non-static, but are known functions. In our applications, the electric and magnetic fields respond to changes in the plasma, which is represented using particles. Therefore, coupling these particle integration schemes to our methods for evolving fields is a non-trivial task, and we describe the necessary modifications in the sections that follow. 4.5.1.1 Approach for Implicit Sources: Particles Lead the Fields The wave equations for the potentials 𝜓 and A are non-linear due to the source terms that couple with the particle data. We break this coupling through the use of duplicate fields, analogous to Tao’s     approach for the particles. To this end, we let 𝜓𝑥𝑞 𝑛 , ∇𝜓 𝑛 , A𝑛 , ∇A𝑛 and 𝜓 𝑛 , ∇𝜓 𝑛 , A𝑛 , ∇A𝑛 𝑥𝑞 𝑥𝑞 𝑥𝑞 𝑦𝑝 𝑦𝑝 𝑦𝑝 𝑦𝑝 denote two pairs of field data at time level 𝑡 𝑛 . Moreover each pair of field data is associated with a set of particle data indicated by the subscripts. Assuming that we have the data (x𝑛 , Q𝑛 ), (y𝑛 , P𝑛 ),     𝑛 , ∇𝜓 𝑛 , A𝑛 , ∇A𝑛 , and 𝜓 𝑛 , ∇𝜓 𝑛 , A𝑛 , ∇A𝑛 , the second order update 𝜙Δ𝑡 can be modified 𝜓𝑥𝑞 𝑥𝑞 𝑥𝑞 𝑥𝑞 𝑦𝑝 𝑦𝑝 𝑦𝑝 𝑦𝑝 2 to include updates to fields as follows:     1. Push particles: (y𝑛 , P𝑛 ) ↦→ y𝑛+1/2 , P𝑛+1/2 using x𝑛 , Q𝑛 , ∇𝜓𝑥𝑞𝑛 , A𝑛 , ∇A𝑛 𝑥𝑞 𝑥𝑞     2. Evolve the fields 𝜓 𝑦𝑛 𝑝 , ∇𝜓 𝑦𝑛 𝑝 , A𝑛𝑦 𝑝 , ∇A𝑛𝑦 𝑝 ↦→ 𝜓 𝑦𝑛+1/2 𝑝 , ∇𝜓 𝑛+1/2 𝑦𝑝 , A 𝑛+1/2 𝑦𝑝 , ∇A 𝑛+1/2 𝑦𝑝 using the   particle data y𝑛+1/2 , P𝑛+1/2     3. Push particles: (x𝑛 , Q𝑛 ) ↦→ x𝑛+1/2 , Q𝑛+1/2 using y𝑛+1/2 , P𝑛+1/2 , ∇𝜓 𝑦𝑛+1/2 𝑝 , A𝑛+1/2 𝑦𝑝 , ∇A𝑛+1/2 𝑦𝑝 147   4. Coupling step: x𝑛+1/2 , P𝑛+1/2 , y𝑛+1/2 , Q𝑛+1/2 ↦→ (x∗ , P∗ , y∗ , Q∗ )   5. Recompute the field data 𝜓 𝑦𝑛+1/2 𝑝 , ∇𝜓 𝑛+1/2 𝑦𝑝 , A 𝑛+1/2 𝑦𝑝 , ∇A 𝑛+1/2 𝑦𝑝 using the particle data (y∗ , P∗ )     𝑛 , ∇𝜓 𝑛 , A𝑛 , ∇A𝑛 𝑛+1/2 𝑛+1/2 𝑛+1/2 𝑛+1/2 6. Evolve the fields 𝜓𝑥𝑞 𝑥𝑞 𝑥𝑞 𝑥𝑞 ↦→ 𝜓𝑥𝑞 , ∇𝜓𝑥𝑞 , A𝑥𝑞 , ∇A𝑥𝑞 using the particle data (x∗ , Q∗ )   7. Push particles: (x∗ , Q∗ ) ↦→ x𝑛+1 , Q𝑛+1 using y∗ , P∗ , ∇𝜓 𝑦𝑛+1/2  𝑛+1/2 𝑛+1/2 𝑝 , A 𝑦𝑝 , ∇A 𝑦𝑝     𝑛+1/2 𝑛+1/2 𝑛+1/2 𝑛+1/2 8. Evolve the fields 𝜓𝑥𝑞 , ∇𝜓𝑥𝑞 , A𝑥𝑞 , ∇A𝑥𝑞 ↦→ 𝜓𝑥𝑞 𝑛+1 , ∇𝜓 𝑛+1 , A𝑛+1 , ∇A𝑛+1 using 𝑥𝑞 𝑥𝑞 𝑥𝑞  the particle data x𝑛+1 , Q𝑛+1   9. Push particles: (y∗ , P∗ ) ↦→ y𝑛+1 , P𝑛+1 using x𝑛+1 , Q𝑛+1 , ∇𝜓𝑥𝑞  𝑛+1 , A𝑛+1 , ∇A𝑛+1 𝑥𝑞 𝑥𝑞     10. Evolve the fields 𝜓 𝑦𝑛+1/2 𝑝 , ∇𝜓 𝑛+1/2 𝑦𝑝 , A 𝑛+1/2 𝑦𝑝 , ∇A 𝑛+1/2 𝑦𝑝 → ↦ 𝜓 𝑛+1 , ∇𝜓 𝑛+1 , A𝑛+1 , ∇A𝑛+1 using 𝑦𝑝 𝑦𝑝 𝑦𝑝 𝑦𝑝  the particle data y𝑛+1 , P𝑛+1 There are several essential details embedded in the steps shown above which require further explanation. While it may not be apparent from the notation, the field updates involve additional time history, not just the most recent one. Quantities which are labeled with “∗" live at time level 𝑡 𝑛+1/2 , but we use this notation to distinguish the data from 𝑡 𝑛+1/2 pre/post-mixing when clarification is needed. The particle updates on pairs of coordinates may appear strange because of their implicit-like form, which merely reflects the coupling of phase space. As an example, in order to perform the particle push in step 3, we need to use the fields associated with the particle data   y𝑛+1/2 , P𝑛+1/2 . This is reflected in step 2. Next, it would seem that we are forgetting to evolve fields between steps 3 and 4; however, the mixing step will modify the particle data regardless of the fields, so there is no need to perform this evolution. Instead, the fields are recomputed in step 5 because of the mixing that occurs in step 4. Remark 4.5.1. This algorithm is quite costly both in terms of memory and computation. It requires duplicates of both particle and field information, which can be problematic when large numbers of particles and fine meshes are required. Computationally, a single time step of this second-order 148 scheme has a total of 10 steps in which fields and derivatives are updated. The total number of calls made to the wave solver methods depends entirely on the dimensionality of the problem, which will change the entries of the gradient vector, and the number of components retained for the vector potentials (if any). 4.5.1.2 Approaches for a Mixed Advance: Dealing with Explicit and Implicit Source Terms This section presents a modification of the approach discussed in section 4.5.1.1 to develop a mixed approach where fields can be advanced using an explicit form of source terms with their corresponding derivatives using sources with an implicit form. Assuming that we have the data     (x , Q ), (y , P ), 𝜓𝑥𝑞 , ∇𝜓𝑥𝑞 , A𝑥𝑞 , ∇A𝑥𝑞 , and 𝜓 𝑦 𝑝 , ∇𝜓 𝑦 𝑝 , A𝑦 𝑝 , ∇A𝑦 𝑝 , the second order update 𝑛 𝑛 𝑛 𝑛 𝑛 𝑛 𝑛 𝑛 𝑛 𝑛 𝑛 𝑛 𝜙Δ𝑡 2 can be modified to include updates to fields as follows:     1. Evolve the fields 𝜓 𝑦𝑛 𝑝 , A𝑛𝑦 𝑝 ↦→ 𝜓 𝑦𝑛+1/2 𝑝 , A 𝑛+1/2 𝑦𝑝 using the particle data (y𝑛 , P𝑛 )     2. Push particles: (y𝑛 , P𝑛 ) ↦→ y𝑛+1/2 , P𝑛+1/2 using x𝑛 , Q𝑛 , ∇𝜓𝑥𝑞 𝑛 , A𝑛 , ∇A𝑛 𝑥𝑞 𝑥𝑞     3. Compute the derivatives ∇𝜓 𝑦𝑛+1/2 𝑝 , ∇A𝑛+1/2𝑦𝑝 using the particle data y𝑛+1/2 , P𝑛+1/2     𝑛+1/2 𝑛+1/2 4. Evolve the fields 𝜓𝑥𝑞 𝑛 , A𝑛 𝑥𝑞 → ↦ 𝜓 𝑥𝑞 , A 𝑥𝑞 using the particle data (x𝑛 , Q𝑛 )     5. Push particles: (x𝑛 , Q𝑛 ) ↦→ x𝑛+1/2 , Q𝑛+1/2 using y𝑛+1/2 , P𝑛+1/2 , ∇𝜓 𝑦𝑛+1/2 𝑝 , A𝑛+1/2 𝑦𝑝 , ∇A𝑛+1/2 𝑦𝑝   6. Coupling step: x𝑛+1/2 , P𝑛+1/2 , y𝑛+1/2 , Q𝑛+1/2 ↦→ (x∗ , P∗ , y∗ , Q∗ )   7. Recompute the derivatives ∇𝜓 𝑦𝑛+1/2 𝑝 , ∇A 𝑛+1/2 𝑦𝑝 using the particle data (y∗ , P∗ )   8. Compute the derivatives ∇𝜓𝑥𝑞 𝑛+1/2 , ∇A𝑛+1/2𝑦𝑝 using the particle data (x∗ , Q∗ )     9. Evolve the fields 𝜓𝑥𝑞 𝑛+1/2 𝑛+1/2 , A𝑥𝑞 ↦→ 𝜓𝑥𝑞 𝑛+1 , A𝑛+1 using the particle data (x∗ , Q∗ ) 𝑥𝑞   10. Push particles: (x∗ , Q∗ ) ↦→ x𝑛+1 , Q𝑛+1 using y∗ , P∗ , ∇𝜓 𝑦𝑛+1/2  𝑛+1/2 𝑛+1/2 𝑝 , A 𝑦𝑝 , ∇A 𝑦𝑝    11. Compute the derivatives ∇𝜓𝑥𝑞 𝑛+1 , ∇A𝑛+1 using the particle data x𝑛+1 , Q𝑛+1 𝑥𝑞 149     12. Evolve the fields 𝜓 𝑦𝑛+1/2 , A 𝑛+1/2 → ↦ 𝜓 𝑛+1 , A𝑛+1 using the particle data (y∗ , P∗ ) 𝑝 𝑦𝑝 𝑦𝑝 𝑦𝑝   13. Push particles: (y∗ , P∗ ) ↦→ y𝑛+1 , P𝑛+1 using x𝑛+1 , Q𝑛+1 , ∇𝜓𝑥𝑞  𝑛+1 , A𝑛+1 , ∇A𝑛+1 𝑥𝑞 𝑥𝑞    14. Compute the derivatives ∇𝜓 𝑦𝑛+1 𝑝 , ∇A 𝑦 𝑝 𝑛+1 using the particle data y𝑛+1 , P𝑛+1 Remark 4.5.2. Consideration of a mixed approach of this form, with regard to the field solvers developed in Chapter 2, is useful in preventing excessive dissipation. More specifically, the time- centered field solvers, which use an explicit source, are known to be purely dispersive [42]. Using the time-centered update for the fields, which requires an explicit source, in conjunction with duplicate field and particle data may create a mismatch in time levels between the fields and the particles. This motivates us to consider a variation on the first approach where particles lead the fields, but the field advances apply the time centered method where the source is made implicit by averaging. For example the charge density at time level 𝑡 𝑛 can be approximated to second-order accuracy in time by 1  𝑛+1  𝜌𝑛 ≈ 𝜌 + 𝜌 𝑛−1 . 2 4.5.2 The Asymmetrical Euler Method We recently became aware of an alternative particle push, suitable for non-separable Hamiltonian systems, which was proposed in [36]. For context, this paper considered mesh-free methods for solving the Darwin limit of the VM system using the Coulomb gauge. Their adoption of a generalized Hamiltonian model for particles was largely motivated by the numerical instabilities associated with time derivatives of the vector potential in this particular limit. The resulting model, which is essentially identical to the formulation (1.39)-(1.40) used in this work, trades additional coupling of phase space for numerical stability. They proposed a semi-implicit method, dubbed 150 the asymmetrical Euler method (AEM), which has the form x𝑖𝑛+1 = x𝑖𝑛 + v𝑖𝑛 Δ𝑡, (4.29) ! P𝑖𝑛+1 = P𝑖𝑛 + 𝑞𝑖 − ∇𝜓 𝑛+1 + ∇A𝑛+1 · v𝑖𝑛 Δ𝑡, (4.30) 1  v𝑖𝑛 ≡ P𝑖𝑛 − 𝑞𝑖 A𝑛 . (4.31) 𝑚𝑖 This method, which is first-order in time, proceeds by, first, performing an explicit update of the particle positions using (4.29). Next, with the new positions, fields are updated, and finally, the generalized momentum update (4.30) is performed. While it may appear that iteration is required to compute gradients of the vector potential ∇A𝑛+1 , since A𝑛+1 requires v𝑖𝑛+1 through the current, the authors use v𝑖𝑛 , which results in a fully-explicit update. It is recommended to avoid iteration, especially in the case that the vector potential is strong to prevent the formation of numerical instabilities. Further, this lagging of the velocity is consistent with a first-order method. At the end of the particle update, the velocity is modified according to (4.31) for use in the next time step. In contrast to the method of Molei Tao, discussed in section 4.5.1, this method requires significantly less overall storage and fewer overall field solves. Despite the fact that this method is first order, experiments performed in [36] demonstrated good energy conservation and good accuracy, in addition to computational efficiency. Additionally, the lagging of the velocity in the generalized momentum update (4.30) means that this method avoids the pitfall of circular logic that occurs in the Molei Tao method when current needs to be mapped to the mesh. We apply this integrator to the expanding beam problems in section 4.6. 4.6 Numerical Examples In this section we apply the proposed PIC methods to several well-known benchmark problems in the literature. First, we test the particle methods and our formulation in the case of a single particle moving through known fields. Then, we apply the methods to more interesting problems involving dynamic fields, which respond to the motion of the particles. We use the physical constants listed in Table 4.1 in the numerical experiments presented in this work. 151 Parameter Value Ion mass (𝑚𝑖 ) [kg] 9.108379025973462 × 10−29 Electron mass (𝑚 𝑒 ) [kg] 9.108379025973462 × 10−31 Boltzmann constant (𝑘 𝐵 ) [kg m2 s−2 K−1 ] 1.38064852 × 10−23 Permittivity of free space (𝜖0 ) [kg−1 m−3 s4 A2 ] 8.854187817 × 10−12 Permittivity of free space (𝜇0 ) [kg m s−2 A−2 ] 1.25663706 × 10−6 Speed of light (𝑐) [m/s] 2.99792458 × 108 Table 4.1: Table of the physical constants (SI units) used in the numerical experiments. 4.6.1 Motion of a Single Charged Particle We first compare the integrator proposed by Tao [47] with the Boris method [4], which are discussed in sections 4.4.2 and 4.5.1, respectively. This is a natural first step before applying the method to problems with dynamic fields that respond to particle motion. Here, we consider a simple model for the motion of a single charged particle that is given by x¤ = v, 𝑞 v¤ = (E + v × B) . 𝑚 We use electro- and magneto-static fields here and suppose that the magnetic field lies along the ẑ unit vector, so B = 𝐵0 ẑ, E = 𝐸 (1) x̂ + 𝐸 (𝑦) ŷ + 𝐸 (𝑧) ẑ, where 𝐵0 is a constant. Again, component-based definitions have been used for the fields E =     (1) 𝐸 ,𝐸 ,𝐸 (2) (3) (1) (2) (3) and B = 𝐵 , 𝐵 , 𝐵 . Consequently, we have that v × B = 𝑣 (2) 𝐵0 x̂ − 𝑣 (1) 𝐵0 ŷ, so the full equations of motion are 𝑑𝑥 (1) 𝑑𝑣 (1) 𝑞  (1)  = 𝑣 (1) , = 𝐸 + 𝑣 𝐵0 , (2) 𝑑𝑡 𝑑𝑡 𝑚 𝑑𝑥 (2) 𝑑𝑣 (2) 𝑞  (2)  = 𝑣 (2) , = 𝐸 − 𝑣 (1) 𝐵0 , 𝑑𝑡 𝑑𝑡 𝑚 𝑑𝑥 (3) 𝑑𝑣 (3) 𝑞 = 𝑣 (3) , = 𝐸 (3) . 𝑑𝑡 𝑑𝑡 𝑚 152 We can then use the classical momentum p = 𝑚v to obtain 𝑑𝑥 (1) 𝑑𝑝 (1)   1 1 (2) = 𝑝 (1) , (1) = 𝑞 𝐸 + 𝑝 𝐵0 , 𝑑𝑡 𝑚 𝑑𝑡 𝑚 (2) (2)   𝑑𝑥 1 𝑑𝑝 1 (1) = 𝑝 (2) , (2) = 𝑞 𝐸 − 𝑝 𝐵0 , 𝑑𝑡 𝑚 𝑑𝑡 𝑚 𝑑𝑥 (3) 1 𝑑𝑝 (3) = 𝑝 (3) , = 𝑞𝐸 (3) . 𝑑𝑡 𝑚 𝑑𝑡 Next, we show how to convert the electric and magnetic fields to potentials for use in Molei Tao’s method.   Using the potentials 𝜓 and A ≡ 𝐴 (1) , 𝐴 (2) , 𝐴 (3) , one can compute the electric and magnetic fields via (1.11), which is equivalent to writing 𝐸 (1) = −𝜕𝑥 𝜓 − 𝜕𝑡 𝐴 (1) , 𝐵 (1) = 𝜕𝑦 𝐴 (3) − 𝜕𝑧 𝐴 (2) , 𝐸 (2) = −𝜕𝑦 𝜓 − 𝜕𝑡 𝐴 (2) , 𝐵 (2) = −𝜕𝑥 𝐴 (3) + 𝜕𝑧 𝐴 (1) , 𝐸 (3) = −𝜕𝑧 𝜓 − 𝜕𝑡 𝐴 (3) , 𝐵 (3) = 𝜕𝑥 𝐴 (2) − 𝜕𝑦 𝐴 (1) . The time-independence of the magnetic field for this problem implies that 𝜕𝑡 A = 0, so that E = −∇𝜓 =⇒ 𝐸 (1) = −𝜕𝑥 𝜓, 𝐸 (2) = −𝜕𝑦 𝜓, 𝐸 (3) = −𝜕𝑧 𝜓. Therefore, for this problem, we can use 𝜓 = −𝐸 (1) 𝑥 − 𝐸 (2) 𝑦 − 𝐸 (3) 𝑧. Moreover, since the magnetic field lives only in the z-direction, then this implies that the vector potential can be written as B = (0, 0, 𝐵0 ) = (0, 0, 𝜕𝑥 𝐴 (2) − 𝜕𝑦 𝐴 (1) ). As the choice of functions for gauges are not unique, it suffices to pick 𝐴 (1) ≡ 0, 𝐴 (2) = 𝐵0 𝑥, 𝐴 (3) ≡ 0. 153 In summary, the values and required derivatives for the potentials are given by −𝜕𝑥 𝜓 = 𝐸 (1) , −𝜕𝑦 𝜓 = 𝐸 (2) , −𝜕𝑧 𝜓 = 𝐸 (3) , 𝐴 (1) = 0, 𝐴 (2) = 𝐵0 𝑥, 𝐴 (3) = 0, 𝜕𝑥 𝐴 (1) = 0, 𝜕𝑦 𝐴 (1) = 0, 𝜕𝑧 𝐴 (1) = 0, 𝜕𝑥 𝐴 (2) = 𝐵0 , 𝜕𝑦 𝐴 (2) = 0, 𝜕𝑧 𝐴 (2) = 0, 𝜕𝑥 𝐴 (3) = 0, 𝜕𝑦 𝐴 (3) = 0, 𝜕𝑧 𝐴 (3) = 0, which yield the simplified equations of motion 𝑑𝑥 (1) 1 (1) = 𝑃 , 𝑑𝑡 𝑚 𝑑𝑥 (2) 1  (2) (1)  = 𝑃 − 𝑞𝐵0 𝑥 , 𝑑𝑡 𝑚 𝑑𝑥 (3) 1 = 𝑃 (3) , 𝑑𝑡 𝑚 " # 𝑑𝑃 (1) (1) 𝑞  (2) (1)  = 𝑞𝐸 + 𝐵0 𝑃 − 𝑞𝐵0 𝑥 , 𝑑𝑡 𝑚 𝑑𝑃 (2) = 𝑞𝐸 (2) , 𝑑𝑡 𝑑𝑃 (3) = 𝑞𝐸 (3) . 𝑑𝑡 The setup for the test consists of a single particle with mass 𝑚 = 1.0 and charge 𝑞 = −1.0 whose initial position is at the origin of the domain i.e., x(0) = (0, 0, 0). Initially, the particles have momentum components only in the 𝑥 and 𝑧 directions to generate so called “cyclotron" motion. We choose the initial momenta to be p(0) = P(0) = 1.0 × 10−2 , 0, 1.0 × 10−2 . The strength of the  magnetic field in the 𝑧 direction is selected to be 𝐵0 = 1.0, and we ignore the contributions from the electric field, so that E = (0, 0, 0). Both methods are run to a final time of 𝑇 = 30.0 and a total of 1000 time steps are used so that Δ𝑡 = 0.03. Lastly, we select the coupling parameter 𝜔 = 100 for the Molei Tao integrator. The particle’s position is tracked through time and plotted as a curve in 3-D. Figure 4.1 compares the particle trajectories obtained with both methods. Next, we perform a refinement study of the two methods to examine the error properties using the same experimental parameters from the cyclotron test. Errors are measured using reference 154 (a) Boris method (b) Tao method (𝜔 = 100) Figure 4.1: Trajectories for the single particle test, which are obtained using the Boris method 4.1a and the second-order integrator by Molei Tao (as presented in [47]) 4.1b. Both methods produce identical trajectories under identical experimental conditions. The particles rotate about the magnetic field which points in the 𝑧-direction. solutions computed with 106 time steps, so that Δ𝑡 = 3.0 × 10−5 . Errors are measured using the ℓ∞ norm. The test starts with using a total of 100 time steps and successively doubles the number of steps, using, at most, 1.28 × 104 steps. The results of the refinement study are shown in Figure 4.2. Both methods refine to second-order, with the Boris method producing a larger error in the solution compared to the Molei Tao integrator. While it may be possible to further reduce the size of the errors made by the Molei Tao method through better choices of the coupling parameter 𝜔, the Boris method will likely remain more efficient in terms of error for a given amount of compute time. Despite the fact that we are not reporting timing results, the Boris method was notably faster than the Molei Tao integrator. This is not all that surprising given that the latter method requires more intermediate steps. 4.6.2 The Cold Two-Stream Instability We consider the motion of “cold" streams of ions and electrons restricted to a one-dimensional periodic domain by a sufficiently strong (uniform) magnetic field in the two remaining directions. 155 (a) Boris method (b) Tao method (𝜔 = 100) Figure 4.2: Self-refinement for the single particle test using the Boris method 4.1a and the second- order integrator by Molei Tao (as presented in [47]) 4.1b. Second-order accuracy is achieved by both methods, but the ℓ∞ errors for the Boris method are nearly a factor of 2 larger than those produced by the Molei Tao method. While we have not presented timing results, it is worth noting that the run times for the Boris method were considerably faster than those of the Molei Tao method due to the latter’s additional “stages". The final error measurements taken from the refinement study are 1.4728 × 10−7 (Boris) and 6.5592 × 10−8 (Tao). Ions are taken to be uniformly distributed in space, and sufficiently heavy compared to the electrons so that their motion can be ignored in the simulation. While the ions remain stationary, they act as a neutralizing background against the dynamic electrons. Mathematically, the electron velocities are represented as a sum of two Dirac delta distributions which are symmetric in velocity about the origin, i.e., streams move in opposite directions but have the same velocity magnitude. A slight perturbation in the electron velocities is then introduced to force a charge imbalance, as some particles move faster than others. This, in turn, generates an electric field that attempts to restore the neutrality of the system, causing the streams to interact, or “roll-up", creating regions of trapped particles. In order to describe the models used in the simulation, let us denote the components of the     position and momentum vectors for particle 𝑖 as x𝑖 ≡ 𝑥𝑖(1) , 𝑥𝑖(2) , 𝑥𝑖(3) and P𝑖 ≡ 𝑃𝑖(1) , 𝑃𝑖(2) , 𝑃𝑖(3) , 156 Models Time Integration Fields + Derivatives −Δ𝜓 = 𝜎11 𝜌 Leapfrog, Tao (with averaging) FFT + FFT BDF-2 + BDF-2, 1 1 𝜕 𝜓 𝜅 2 𝑡𝑡 − Δ𝜓 = 𝜎1 𝜌 Leapfrog, Tao (with averaging) Central-2 + BDF-2, Central-2 + BDF-4 Table 4.2: Summary of the algorithms explored for the two-stream instability example. Both time integration methods considered are second-order. respectively, the equations for the motion of particle 𝑖 assume the form 𝑑𝑥𝑖(1) 1 = 𝑃𝑖(1) , 𝑑𝑡 𝑟𝑖 𝑑𝑃𝑖(1) = −𝑞𝑖 𝜕𝑥 𝜓. 𝑑𝑡 Therefore, the motion in this plane requires knowledge of 𝜓, 𝜕𝑥 𝜓, which can be obtained by solving a two-way wave equation for the scalar potential: 1 𝜕2𝜓 1 2 2 − Δ𝜓 = 𝜌. (4.32) 𝜅 𝜕𝑡 𝜎1 As this is an electrostatic problem, the gauge condition can be safely ignored. In the limit where 𝜅 ≫ 1, the characteristic thermal velocities of the particles become well-separated from the speed of light. Rather than solve the two-way wave equation, one instead solves the Poisson equation 1 − Δ𝜓 = 𝜌. (4.33) 𝜎1 Using asymptotic analysis, it can be shown that the approximation error made by employing the Poisson model for the scalar potential is O (1/𝜅) [42]. We benchmark the performance of several combinations of algorithms (see Table 4.2) for time stepping particles and evolving fields by comparing with well-known methods. This will help establish the baseline properties for the methods, and reduce the parameter space of viable methods. The setup for this test problem employs a spatial mesh defined on the interval [−10𝜋/3, 10𝜋/3], which is discretized using 128 total grid points and supplied with periodic boundary conditions. The non-dimensional final time for the simulation is taken to be 𝑇 𝑓 = 50.0 with 4,000 time steps being used to evolve the system. The plasma is represented using a total of 20,000 macro-particles, 157 Figure 4.3: Initial configuration of electrons used in the two-stream experiments. which are split equally between ions and electrons. As mentioned earlier, the positions of the ions and electrons are taken to be uniformly spaced along the grid. Ions remain stationary in the problem so we set their velocity to zero. The construction of the streams begins by first splitting the electrons into two equally sized groups, whose respective (non-dimensional) drift velocities are set to be ±1. To generate an instability we add a perturbation to the electron velocities of the form   2𝜋𝑘 (𝑥 − 𝑎) 𝜖 sin . 𝐿 Here, 𝜖 = 5 × 10−3 controls the perturbation strength, 𝑘 = 1 is the wave number for the perturbation, 𝑥 is the position of the electron, 𝑎 is the left-most grid point, and 𝐿 is the length of the domain. In a more physically realistic simulation, the perturbation would be induced by some external force, which would also result in a perturbation of the position data for the particles. Such a perturbation of the position data requires a self-consistent field solve to properly initialize the potentials. In our simulation, we assume that no spatial perturbation is present, so that the fields are identically zero at the initial time step. A plot of the electron streams at the initial condition is shown in Figure 4.3. The plasma parameters used in the non-dimensionalization for this test problem, are displayed in Table 4.3. Note that under these scales, the normalized speed of light 𝜅 = 50. In some sense, this value is close to the relativistic regime, but far enough away to avoid the need for relativistic time integration methods. Additionally, we find that this configuration is sufficient to resolve the Debye length (≈ 6 cells/𝜆 𝐷 ), angular plasma period (≈ 80 steps/𝜔 𝑝𝑒 ), and the particle CFL < 1, which are all necessary for maintaining stability in explicit PIC methods. In order to get a sense of the behavior attributed to the particle integrator, we first considered the 158 Parameter Value Average number density (𝑛) ¯ [m−3 ] 7.856060 × 101 Average temperature (𝑇) ¯ [K] 2.371698 × 106 Debye length (𝜆 𝐷 ) [m] 1.199170 × 104 Inverse angular plasma frequency (𝜔−1 𝑝𝑒 ) [s/rad] 2.000000 × 10−3 Thermal velocity (𝑣 𝑡ℎ ) [m/s] 5.995849 × 106 Table 4.3: Table of the plasma parameters used in the two-stream instability example. Poisson model (4.33) for the scalar potential. Since the combination of leapfrog time integration with an FFT field solver is such a commonly used approach to this problem, it allowed us to identify key differences attributed solely to the choice of time integration method used for particles. We found that a direct application of the Molei Tao integrator to this problem, which includes the corresponding field solves, gave nonphysical results. As the streams began to develop interesting structures, the particles appeared to “jump" off their smooth trajectories manifesting as some form of noise. This can be attributed to the duplicate field and particle data required by Molei Tao’s method, which are not guaranteed to remain close. Eventually this leads to differences in the potentials that are used to move the particles. Adjustments to the coupling parameter in Molei Tao’s method were unsuccessful at controlling this behavior, and, in fact, exacerbated the phenomenon. Having to deal with a parameter in a method, which can lead to exceptionally nonphysical results is a highly undesirable feature of a method. In an attempt to fix this problem, we chose to adjust the  particle data at the end of the time step by replacing the values with the averages of x𝑛+1 , Q𝑛+1 and  y𝑛+1 , P𝑛+1 . In Figure 4.4, we compare the Molei Tao method and demonstrate the behavior of the method with and without averaging. This approach while most likely not symplectic, seems to be reasonably effective at controlling this difference and the number of particles that leave the stream lines. Therefore, note that this modification shall be used in all subsequent experiments that use the Molei Tao integrator. Having (somewhat) resolved the issue posed by the Molei Tao method, we then compared this integrator against the Leapfrog method using the Poisson model (4.33) for the scalar potential. We present snapshots of the electron streams obtained with both methods in Figure 4.5. We observe nearly identical behaviors from both methods, with the exception of later times, at which point, several particles have moved off their trajectories. 159 (a) With averaging (b) Without averaging Figure 4.4: A comparison of the Molei Tao particle integrator with and without averaging for the two-stream example with the Poisson model. Over time, the pairs of phase space data, including the associated fields, can grow apart leading to vastly different potentials that kick particles off their smooth trajectories. Averaging appears to be fairly effective at controlling this behavior. 160 (a) Leapfrog (b) Leapfrog (c) Leapfrog (d) Leapfrog (e) Tao with averaging (𝜔 = 500) (f) Tao with averaging (𝜔 = 500) (g) Tao with averaging (𝜔 = 500) (h) Tao with averaging (𝜔 = 500) Figure 4.5: We present plots of the electrons in phase space obtained using the Poisson model for the two-stream example. Results obtained using leapfrog time integration are shown in the top row, while the bottom row uses the second-order integrator based on Molei Tao and applies averaging. We selected 𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator. The FFT is used to compute the scalar potentials in both methods. At later times, despite improvements from “averaging" the particle data, the Tao method causes particles to move off the stream lines. This phenomena is a numerical artifact that is not present in the leapfrog method. 161 (a) Leapfrog (b) Tao with averaging (𝜔 = 500) Figure 4.6: Time refinement of a tracer particle’s position for the two-stream instability using the Poisson model for the potential with leapfrog (a) and the Molei Tao integrator with averaging (b). We selected 𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator. Both methods converge to second-order accuracy with leapfrog generally displaying a larger absolute error than the Tao method. The exception to this is the smallest Δ𝑡 used in the leapfrog experiments. A time refinement experiment for the two-stream example with the Poisson model was also performed using the same numerical parameters outlined in Table 4.3. Errors were measured by following the trajectory of a single tracer particle, which is arbitrarily selected from the middle of the particle array, for several values of Δ𝑡. The location of this tracer particle, in each case, is stored at the non-dimensional final time 𝑇 𝑓 = 25, which we note is prior to the occurrence of the “roll-up" in the streams. The reference solution used to measure the error was computed using 16,384 time steps. We ran the code starting with 256 time steps and successively doubled this until reaching 8192 steps. Errors were measured using the ℓ∞ norm, which in one dimension, becomes the absolute value. We plot the errors against the time step size in Figure 4.6. The plots show second-order time accuracy in both methods, with cleaner refinement behavior observed in the leapfrog integrator despite a larger overall error. Additionally, the Molei Tao integrator displays a noticeable increase in the error when larger time steps are taken. It is worth noting that the initial values of Δ𝑡 (e.g., where the jump in the error occurs) are in violation of the particle CFL number, which should be < 1 for explicit particle methods; however, this jump in the error is not observed in the leapfrog method, which is known for its long-time accuracy. Moreover, it is difficult to identify the exact time at which the breakdown in the Molei Tao method occurs. The accuracy guarantees 162 are given by Tao in an asymptotic form that involves the duration of the simulation, the order of the method, and the value of the coupling parameter. Additionally, these estimates were obtained only in the case where of particles moving through fields known at all positions in space and time, so these estimates may no longer be valid for the dynamic fields considered in this work. This same experiment was repeated using the two-way wave model (4.32) in the place of the Poisson model (4.33) for the scalar potential. For problems which are strongly electrostatic (i.e., 𝜅 ≫ 1), the wave model should produce results which are similar to those of the Poisson model shown in Figure 4.5. This setting allows us to benchmark the performance of the wave solvers and methods for derivatives discussed in chapter 2. We include the second-order time-centered and BDF field solvers, as well as derivative methods based on BDF-2 and BDF-4 discretizations in this experiment. Recall that in section 2.5.1, we mentioned concerns of stability for field solvers based on BDF-3 and BDF-4 discretizations. Here, we are using the higher-order methods only to compute derivatives, which are not evolved in time, so stability less of a concern. Moreover, since this problem is one-dimensional in space, we avoid the splitting error, so the derivatives will be more accurate. We considered three pairings of these methods: (1) BDF-2 with BDF-2, (2) central-2 with BDF-2, and (3) central-2 with BDF-4. Results obtained with each of these pairings for the field solvers are shown in Figures 4.7, 4.8, and 4.9, respectively. A notable difference with the Poisson model is that particles stayed attached to their smooth trajectories. This can be attributed to the use of a wave model, which, due to the finite speed of propagation, responds more slowly to changes in the charge density. Among the results for the wave models, we observe excellent agreement between the leapfrog and Molei Tao integrators. In these experiments, the run parameters we selected gave a CFL ≈ 3.79, which is not large enough to see noticeable improvements gained by moving to higher-order methods. Nevertheless, the overall consistency among the results is quite encouraging. We note that time averaging is used on the charge density in solvers which combine Molei Tao with the central-2 scheme for the fields (see section 4.5.1.2 for details). Without this averaging, the streams interact at an accelerated rate, which is nonphysical. A time refinement of the proposed methods was also performed following the same procedure 163 used for the Poisson model based on a tracer particle. The results of the refinement study are presented in Figure 4.10. We (generally) observe second-order temporal accuracy with each of the methods. When leapfrog is used to move the particles, we observe fairly clean second-order accuracy. The Molei Tao method, in contrast, shows some irregularities in the refinement pattern, which in some cases appears to diverge when large time steps are used. This is likely due to the time step simply being too large for an explicit method, despite satisfying the CFL condition for the particles. Additionally, it is worth noting that the error in the methods with Molei Tao display a smaller error. In terms of efficiency, however, the increased number of field solves required by the Molei Tao method may not be offset by this improvement in the error. 164 (a) Leapfrog (b) Leapfrog (c) Leapfrog (d) Leapfrog (e) Tao with averaging (𝜔 = 500) (f) Tao with averaging (𝜔 = 500) (g) Tao with averaging (𝜔 = 500) (h) Tao with averaging (𝜔 = 500) Figure 4.7: We present plots of the electrons in phase space obtained using the wave model for the two-stream example. Results obtained using leapfrog time integration are shown in the top row, while the bottom row uses the second-order integrator based on Molei Tao and applies averaging. We selected 𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator. The second-order (diffusive) BDF scheme (BDF-2) is used to compute the scalar potentials and their derivatives in for both methods. Unlike the results obtained with the Poisson model, which used the FFT as the field solver (shown in Figure 4.5), the particles at the later times in the Molei Tao method seem to stay attached to their trajectories. 165 (a) Leapfrog (b) Leapfrog (c) Leapfrog (d) Leapfrog (e) Tao with averaging (𝜔 = 500) (f) Tao with averaging (𝜔 = 500) (g) Tao with averaging (𝜔 = 500) (h) Tao with averaging (𝜔 = 500) Figure 4.8: We present plots of the electrons in phase space obtained using the wave model for the two-stream example. Results obtained using leapfrog time integration are shown in the top row, while the bottom row uses the second-order integrator based on Molei Tao and applies averaging. We selected 𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator. The scalar potentials are evolved using the second-order central scheme (central-2), while the derivatives are computed at each step with the second-order BDF scheme (BDF-2). In the bottom row, which uses the Molei Tao method, we obtain results that are similar to the BDF-2 method (see 4.7) in the sense that particles do not seem to jump off of their trajectories. 166 (a) Leapfrog (b) Leapfrog (c) Leapfrog (d) Leapfrog (e) Tao with averaging (𝜔 = 500) (f) Tao with averaging (𝜔 = 500) (g) Tao with averaging (𝜔 = 500) (h) Tao with averaging (𝜔 = 500) Figure 4.9: We present plots of the electrons in phase space obtained using the wave model for the two-stream example. Results obtained using leapfrog time integration are shown in the top row, while the bottom row uses the second-order integrator based on Molei Tao and applies averaging. We selected 𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator. The scalar potentials are evolved using the second-order central scheme (central-2), while the derivatives are computed at each step with the fourth-order BDF scheme (BDF-4). As with the other wave solver methods, the particles in the Molei Tao experiments seem to stay attached to their smooth trajectories, even at the later times. 167 (a) Leapfrog with BDF-2 (b) Leapfrog with Central-2 + BDF-2 (c) Leapfrog with Central-2 + BDF-4 (d) Tao with BDF-2 (e) Tao with Central-2 + BDF-2 (f) Tao with Central-2 + BDF-4 Figure 4.10: Time refinement of a tracer particle’s position for the two-stream instability. For the particle push, we consider both leapfrog and the Molei Tao method with averaging, in combination with different methods for fields and their derivatives. We selected 𝜔 = 500 as the value of the coupling parameter in all of the Molei Tao integrator experiments. Each of the methods converge to second-order accuracy with the error in the Tao method being smaller than leapfrog. 168 Parameter Value Average number density (𝑛) ¯ [m−3 ] 1.129708 × 1014 Average temperature (𝑇) ¯ [K] 2.371698 × 106 Debye length (𝜆 𝐷 ) [m] 1.000000 × 10−2 Inverse angular plasma frequency (𝜔−1 𝑝𝑒 ) [s/rad] 1.667820 × 10−9 Thermal velocity (𝑣 𝑡ℎ ) [m/s] 5.995849 × 106 Table 4.4: Table of the plasma parameters used in the numerical heating example. 4.6.3 Numerical Heating Study We now perform a numerical heating study using the same combination of models and algorithms shown in Table 4.2 for the two-stream problem. The primary purpose of this test is to characterize the effect of resolving the Debye length 𝜆 𝐷 in a steady-state problem to the Vlasov equation. Moreover, it allows us to benchmark the degree of the heating phenomenon observed under different selections of models, particle integrators, and field solvers. These numerical properties turn out to be connected to the symplecticity of the method. Explicit PIC methods are not symplectic because the fields and the particles are not self-consistent with one another. A consequence of this is that the grid should be sufficiently fine so that a given particle can “see" the correct potential, which is otherwise screened by particles of opposite charge. In other words, with explicit PIC methods, one needs to resolve the charge separation inside of the plasma, which is determined by the Debye length. A general rule of thumb for explicit PIC simulations is that the grid spacing Δ𝑥 should be chosen to satisfy 4Δ𝑥 < 𝜆 𝐷 . The phenomenon of heating occurs in simulations which do not adequately resolve this scale. In such cases, the system will try to increase the temperature of the plasma until it becomes adequately resolved on the given mesh. Fully-implicit methods, which break this restriction [9], allow for a substantially coarser mesh to be used for a given calculation and will be the subject of future work. The setup for this problem is slightly different from the two-stream example discussed earlier. Here, we provide, as input a Debye length 𝜆 𝐷 as well as a normalized speed of light 𝜅, which can be used to calculate the average number density 𝑛¯ and macroscopic temperature 𝑇¯ for the plasma. The remaining parameters related to the plasma can be derived from these values and are shown in 169 Figure 4.11: Initial electron data in phase space used for the numerical heating tests. Table 4.4. The non-dimensional grid used in this problem is taken to be periodic on the interval [−25, 25]. This grid is refined by successively doubling the number of mesh points from 16 to 256. In each case, the simulation uses 5 × 105 time steps to the non-dimensional final time 𝑇 𝑓 = 1 × 103 . A total of 5 × 103 particles are used for each species in the simulation, which consist of ions and electrons. As before, we assume that ions will remain stationary since they are heavier than the electrons. Electrons are given uniform positions in space and their velocities are initialized by sampling from the standard normal distribution. There is no drift velocity present in this problem. This non-dimensional standard normal distribution corresponds to a Maxwellian distribution that has mean zero and a temperature 𝑇. ¯ To ensure consistency across the runs, we seed the random number generator. A plot of the electrons in phase space at the initial condition is displayed in Figure 4.11. In order to monitor heating during the simulations, we track the time history of the variance for the electron velocities, which is connected to the temperature of a Maxwellian distribution. Note 𝑁 that the variance data at 𝑁 + 1 time levels {var(𝑣 𝑛 )}𝑛=0 can be converted to a temperature history using  𝑛 𝑁 𝑚𝑒 𝑁 𝑇¯ 𝑛=0 = {var(𝑣 𝑛 )}𝑛=0 . 𝑘𝐵 The results of our heating study can be found in Figures 4.12 and 4.13, which represent the Poisson and wave models for the potential, respectively. In the case of the Poisson model, identical heating properties are observed when particles are integrated with either leapfrog or the averaged version of the Molei Tao integrator. The degree of heating becomes noticeably larger as the 170 (a) Leapfrog with FFT (b) Tao with FFT Figure 4.12: We present results from the numerical heating tests based on the Poisson model. Plots show the average electron temperature as a function of the number of angular plasma periods using leapfrog (left) and the second-order integrator by Molei Tao with averaging (right). Fields and their derivatives are obtained using the FFT. grid is coarsened. Similar behaviors are observed in the wave model for the potential with some caveats. Firstly, we observe a less substantial degree of heating in cases where the Debye length is underresolved due to the finite speed of propagation in the wave model. Additionally, we see that two of the configurations, specifically the ones which combine Molei Tao with the time-centered method, become unstable in time; however, the same approaches with leapfrog behave as expected. A likely source of the problem is the time averaging applied to the source terms used in the Molei Tao method, which is not applied in other methods that use either the Molei Tao integrator or the time-centered scheme for the scalar potential. 171 (a) Leapfrog with BDF-2 (b) Leapfrog with Central-2 + BDF-2 (c) Leapfrog with Central-2 + BDF-4 (d) Tao with BDF-2 (e) Tao with Central-2 + BDF-2 (f) Tao with Central-2 + BDF-4 Figure 4.13: We display results from the numerical heating tests that use the wave model for the potentials. Plots show the average electron temperature as a function of the number of angular plasma periods using leapfrog (top) and the second-order integrator by Molei Tao with averaging (bottom). We selected 𝜔 = 500 as the value of the coupling parameter in the Molei Tao integrator. The scalar potentials and derivatives are computed with the scheme label provided in the individual captions. 172 4.6.4 The Bennett Equilibrium Pinch To benchmark the performance of our method on electromagnetic problems, we first consider the Bennett Equilibrium pinch, named after its discoverer W. H. Bennett, who first analyzed the problem [81]. In this paper, Bennett constructed a certain steady-state solution for the ideal MHD equations in cylindrical coordinates. Electrons are modeled as a fluid that drifts along the 𝑧-direction, creating currents that generate magnetic fields with 𝑥 and 𝑦 components that confine or “squeeze" the plasma towards the axis of the cylinder. The fluid velocity along the axis of the cylinder is carefully chosen to create a proper equilibrium balance between the plasma and the confining magnetic field that surrounds it. Our particle simulation of the Bennett pinch closely follows the description provided in chapter 13 of Bittencourt [82]. We consider the motion of electrons inside a cross-section of a cylindrical column of plasma whose radius is 𝑅𝑏 and suppose that the axis of the beam is centered at the   origin of a bounding box Ω = [−2𝑅𝑏 , 2𝑅𝑏 ] × [−2𝑅𝑏 , 2𝑅𝑏 ]. Let J = 𝐽 (1) , 𝐽 (2) , 𝐽 (3) denote the components of the current density. In the 𝑥-𝑦 plane, the particles are sampled from a Maxwellian distribution, so we ignore the contributions from 𝐽 (1) and 𝐽 (2) because they will have a net-zero current. Since the particles drift along the axis of the beam, we retain the third component of the current density 𝐽 (3) . Consequently, we ignore the wave equations for 𝐴 (1) and 𝐴 (2) , choosing to retain only 𝐴 (3) . Ions are spatially distributed by sampling the same distribution as the electrons and are assumed to be stationary within the cross-section, since their mass is taken to be much larger than the electrons. If we denote the components of the position and momentum vectors for     particle 𝑖 as x𝑖 ≡ 𝑥𝑖(1) , 𝑥𝑖(2) , 𝑥𝑖(3) and P𝑖 ≡ 𝑃𝑖(1) , 𝑃𝑖(2) , 𝑃𝑖(3) , respectively, the equations for the 173 motion of each particle assume the form 𝑑𝑥𝑖(1) 1 = 𝑃𝑖(1) , 𝑑𝑡 𝑟𝑖 𝑑𝑥𝑖(2) 1 = 𝑃𝑖(2) , 𝑑𝑡 𝑟𝑖 𝑑𝑃𝑖(1) 𝑞𝑖  (3)  (3) (3)  = −𝑞𝑖 𝜕𝑥 𝜓 + 𝜕𝑥 𝐴 𝑃𝑖 − 𝑞𝑖 𝐴 , 𝑑𝑡 𝑟𝑖 𝑑𝑃𝑖(2) 𝑞𝑖  (3)  (3) (3)  = −𝑞𝑖 𝜕𝑦 𝜓 + 𝜕𝑦 𝐴 𝑃𝑖 − 𝑞𝑖 𝐴 . 𝑑𝑡 𝑟𝑖 Note that while we retain the last component of the momentum, we do not include an equation for 𝑃𝑖(3) , since each term involves 𝑧-derivatives which are ignored. In a more realistic simulation, we would retain the full 3D-3P system to monitor changes in 𝑃𝑖(3) . That said, this test is still useful because it serves as a benchmark to assess the quality of particle confinement inside the beam. The particle equations of motion for this problem can also be formulated in terms of E and B and solved with the Boris method discussed in section 4.4.2. Given the potentials 𝜓 and A, we can calculate the E and B fields using (1.11): E = −∇𝜓 − 𝜕𝑡 A, B = ∇ × A. These can be used to evolve particles through the non-dimensional model 𝑑𝑥𝑖(1) = 𝑣 𝑖(1) , 𝑑𝑡 𝑑𝑥𝑖(2) = 𝑣 𝑖(2) , 𝑑𝑡 𝑑𝑣 𝑖(1) 𝑞𝑖  (1) (3) (2)  = 𝐸 − 𝑣𝑖 𝐵 , 𝑑𝑡 𝑟𝑖 𝑑𝑣 𝑖(2) 𝑞𝑖  (2) (3) (1)  = 𝐸 + 𝑣𝑖 𝐵 , 𝑑𝑡 𝑟𝑖 𝑑𝑣 𝑖(3) 𝑞𝑖  h (1) (2) i = 𝑣 𝑖 𝐵 − 𝑣 𝑖(2) 𝐵 (1) , 𝑑𝑡 𝑟𝑖 where the last equation for the velocity is neglected so that this form is consistent with the generalized momentum formulation. 174 Parameter Value Beam radius (𝑅𝑏 ) [m] 1.0 × 10−6 Average number density (𝑛) ¯ [m−3 ] 4.391989 × 1019 Average temperature (𝑇) ¯ [K] 5.929245 × 101 Debye length (𝜆 𝐷 ) [m] 8.019042 × 10−8 Inverse angular plasma frequency (𝜔−1 𝑝𝑒 ) [s/rad] 2.674864 × 10−12 Thermal velocity (𝑣 𝑡ℎ ) [m/s] 2.997925 × 104 Electron drift velocity (𝑣 drift ) [m/s] 2.997925 × 107 Fraction of particles contained in the beam (𝛼) [non-dimensional] 0.99 Table 4.5: Table of the parameters used in the setup for the Bennett pinch problem. Particles that leave the domain due to inadequate confinement by the fields can be prescribed new positions by sampling the initial distribution, which essentially re-injects them into the beam. The velocity and momenta are left unchanged in an effort to keep the total current density constant. Next, we provide the experimental parameters for the simulations along with details regarding the initialization procedure for the problem. During the initialization and time evolution phases of the simulation, we use an analytical solution for the toroidal magnetic field to verify the correctness of the numerical fields, which maintain the steady-state. To derive the analytical solution for the toroidal magnetic field 𝐵 (𝜃) , we solve the differential equation (equation 13.3.5 in [82]) 𝑑  (𝜃)  𝑟𝐵 = 𝜇0 𝑞 𝑒 𝑣 𝑒(3) 𝑟𝑛(𝑟), 𝑑𝑟 where 𝜇0 is the permeability of free space, 𝑞 𝑒 is the electron charge, 𝑣 𝑒(3) is the 𝑧-component of the electron velocity, and 𝑛(𝑟) is the Bennett distribution for the electrons (equation 13.3.8 in [82]) given as 𝑛0 𝑛(𝑟) = 2 . (4.34) 1 + 𝑛0 𝑏𝑟 2 The on-axis number density 𝑛0 is calculated according to equation 13.3.14 of [82] ! 1 𝛼 𝑛0 = , 𝑏𝑅𝑏2 1 − 𝛼 where 𝛼 ∈ [0, 1) is a parameter the fraction of particles in the cross-section contained within the 175 beam. We use the constant 𝑏, whose form is given in equation 13.3.9 of [82], namely  2 𝜇0 𝑞 𝑒 𝑣 𝑒(3) 𝑏= . 8𝑘 𝐵𝑇¯ Note that the above expression is equivalent to assuming that the ions are cold so that 𝑇𝑖 = 0 and 𝑇𝑒 = 𝑇. ¯ To solve this differential equation, we integrate both sides from 0 to 𝑟: 𝑛0 𝑟 ′ ∫ 𝑟 ∫ 𝑟 𝑑  ′ (𝜃) ′  ′ (3) ′ 𝑟 𝐵 (𝑟 ) 𝑑𝑟 = 𝜇 𝑞 𝑣 0 𝑒 𝑒  2 𝑑𝑟 . 0 𝑑𝑟 ′ 0 1 + 𝑛0 𝑏𝑟 ′2 The left side simplifies to 𝑟 𝐵 (𝜃) (𝑟) and the right side can be evaluated using the substitution 𝜏 = 1 + 𝑛0 𝑏𝑟 ′2 which yields 𝜇0 𝑞 𝑒 𝑣 𝑒(3) 1+𝑛0 𝑏𝑟 2 ∫ 𝑟𝐵 (𝜃) (𝑟) = 𝜏 −2 𝑑𝜏, 2𝑏 1 𝜇0 𝑞 𝑒 𝑣 𝑒(3)   1 = 1− , 2𝑏 1 + 𝑛0 𝑏𝑟 2 𝜇0 𝑞 𝑒 𝑣 𝑒(3) 𝑛0𝑟 2 = . 2 1 + 𝑛0 𝑏𝑟 2 Hence, the analytical solution for the toroidal magnetic field is 𝜇0 𝑞 𝑒 𝑣 𝑒(3) 𝑛0 𝑟 𝐵 (𝜃) (𝑟) = . (4.35) 2 1 + 𝑛0 𝑏𝑟 2 Numerically, we solve the problem in a Cartesian coordinate system, instead of a cylindrical coordinate system. Therefore, we will have to convert 𝐵 (1) ≡ 𝐵 (𝑥) and 𝐵 (2) ≡ 𝐵 (𝑦) to 𝐵 (𝜃) . The required transformation that converts between these coordinate systems is  (𝑟)    𝐵   cos(𝜃) sin(𝜃) 0  𝐵 (𝑥)               𝐵 (𝜃)  = − sin(𝜃) cos(𝜃) 0  𝐵 (𝑦)  ,            (𝑧)     (𝑧)  𝐵   0 0 1  𝐵       𝑥 𝑦 Note that this transformation can be further simplified with cos(𝜃) = and sin(𝜃) = , with 𝑟 > 0, 𝑟 𝑟 so that we obtain for the 𝜃-component √︃ (𝜃) 𝑦 𝑥 𝐵 (𝑟) = − 𝐵 (𝑥) + 𝐵 (𝑦) , 𝑟 = 𝑥 2 + 𝑦 2 > 0. 𝑟 𝑟 176 For the case of 𝑟 = 0, we can appeal to the analytical solution (4.35), so that 𝐵 (𝜃) (0) = 0. The setup of our PIC simulation requires an average or macroscopic number density 𝑛. ¯ This can be obtained from a variation of equation 13.3.11 in [82] to obtain 𝑛0 𝜋𝑅𝑏2 ∫ 𝑅𝑏 2𝜋 𝑛¯ = 𝑛(𝑟)𝑟 𝑑𝑟 =  , (4.36) 16𝑅𝑏2 0 16𝑅𝑏 1 + 𝑛0 𝑏𝑅𝑏 2 2 where we have used the definition (4.34). A summary of the plasma parameters used in the simulation of the Bennett pinch is presented in Table 4.5. Next, we describe the initialization procedure for the fields 𝜓 and A that recovers the steady-state solution with enough accuracy to achieve particle confinement. This problem is defined over free- space, so we prescribe outflow boundary conditions along the boundary of the box that encloses the beam. To initialize the steady-state with the wave solvers, we first set 𝜓 = 0 and A = 0, and then proceed by stepping the corresponding wave equations to steady-state, holding the sources 𝜌 and J fixed. Although it is more expensive than direct evaluation of the free-space integral solution, this technique allows for a self-consistent initialization of the data used by the outflow procedure for this problem. Further, since the data is stepped to steady-state, this approach is general enough that it can also be used to initialize both the Poisson and wave models. In Figure 4.14, we show plots of the steady-state toroidal magnetic field obtained this initialization procedure using a 128 × 128 mesh. The errors in the initial condition are primarily concentrated along the edge of the simulation domain and near the origin. Along the domain boundary, the errors here are primarily due to the explicit outflow procedure. Errors occurring near the origin are due to the steep gradients created by the beam, which can be corrected by using a finer mesh. As a first test, we compare the Molei Tao integrator with the Boris method for the true steady- state problem, which is described by the elliptic system 1 −Δ𝜓 = 𝜌, 𝜎1 −Δ𝐴 (3) = 𝜎2 𝐽 (3) , which is defined over free-space. An approach based on Green’s functions is quite natural, as the 177 Figure 4.14: Initialization of the steady-state toroidal magnetic field in the Bennett problem computed with the BDF-2 wave solver after 1000 steps against a fixed current density. The derivatives of the vector potential 𝐴 (3) are also obtained with the BDF-2 method. free-space solution of this decoupled elliptic system requires the evaluation of the volume integrals ∫ 1 𝜓(x) = − ln (||x − x′ ||) 𝜌(x′) 𝑑x′, 2𝜋𝜎1 Ω ∫ (3) 𝜎2 𝐴 (x) = − ln (||x − x′ ||) 𝐽 (3) (x′) 𝑑x′, 2𝜋 Ω with ||·|| being the usual Euclidean distance. The evaluation of these integrals can be performed efficiently with a fast summation method such as a tree-code [83] or fast multipole method [48], which is beyond the scope of the present work. Instead, we solve this elliptic system using second- order finite-differences with a sparse direct solver that applies Dirichlet boundary conditions. Derivatives of the fields are computed with second-order finite-differences. The Dirichlet data used for the elliptic solves is supplied by evolving the wave equations for the potentials with the BDF-2 method using outflow boundary conditions. While there are many possible approaches to this problem, it is important that this data be updated at each time step so that the boundary data is consistent with the source. We applied both the Boris and Tao solvers for 50 thermal crossings using a total of 1 × 106 steps, which gives a CFL ≈ 15.90. For each species, i.e., ions and electrons, we used 102,400 particles. In Figure B.1, we show the state of the beam and corresponding fields after 50 thermal crossings obtained with the Boris method. We observe satisfactory preservation 178 of the steady-state fields for the problem. We note that the CFL for this test is quite large for the second-order wave solver; however, we only use the data from the wave solver along the boundary of the simulation domain. In these regions the fields are mostly flat, so this is an acceptable approximation, even with a diffusive solver. We also observe good agreement with the analytical solution around the axis of the beam, specifically in capturing the wells in the field. Results obtained with the Molei Tao integrator, using 𝜔 = 500, are shown in Figure B.2. The fields, which are shown before the final time, demonstrate a loss of the steady-state attributed to an excess of particles escaping from the beam. This causes the current density to spread outwards, resulting in changes in the potentials used to calculate the fields. We have not determined the source of this issue in the Tao method; however, given the other issues encountered with the approach, we ultimately decided that this integrator was no longer a viable option for experimentation and chose not to pursue it further. Allowing for time variation in the fields requires that we now solve a system of two wave equations for the potentials 1 𝜕2𝜓 1 2 2 − Δ𝜓 = 𝜌, 𝜅 𝜕𝑡 𝜎1 1 𝜕 2 𝐴 (3) − Δ𝐴 (3) = 𝜎2 𝐽 (3) , 𝜅 2 𝜕𝑡 2 subject to outflow boundary conditions. In the Boris method, all derivatives for 𝜓 and 𝐴 (3) , as well as 𝜓 itself, are computed with the BDF-2 scheme; however, the vector potential 𝐴 (3) is evolved explicitly using the central-2 method. To obtain the current density J𝑛+1 , which is required for the derivatives obtained with BDF-2, we use the iterative approach presented in section 4.4.2, which uses a Taylor expansion to create an initial guess for the current density. A total of 5 iterates are used in each time step. We ran the solver for 35 thermal crossings using a total of 3.5 × 106 steps, which gives a CFL ≈ 3.18. For each of the particle species, i.e., ions and electrons, we used 102,400 particles. In Figure B.3, we show the state of the beam and corresponding fields after 35 thermal crossings obtained with the Boris method. There is some slight dissipation in the regions surrounding the axis of the beam, which is caused by the BDF method. As mentioned earlier, the 179 use of a finer mesh would improve the quality of the solution in these regions. Along the boundary, we can see a discrepancy with the analytical solution due to inaccuracies in the treatment of outflow boundary conditions by the wave solver. Future work will seek corrections to this behavior so that the fields and their derivatives will be more accurate along the boundary. Despite these slight inaccuracies, we observe satisfactory preservation of the steady-state fields for the problem. 4.6.5 The Expanding Beam Problem We now apply the proposed methods to the expanding beam test problem [24]. This example is well-known for its sensitivity to issues concerning charge conservation, which makes it well-suited for evaluating methods used to enforce gauge conditions and involutions. While this particular example is normally solved in cylindrical coordinates, our simulation is performed using a two- dimensional rectangular box that retains the fields 𝐸 (1) and 𝐸 (2) , as well as 𝐵 (3) . An injection zone, which is placed on one of the faces of the box, injects a steady beam of particles into the domain. The beam expands as particles move along the box due to the electric field and eventually settles into a steady-state. Particles are absorbed or “collected" once they reach the edge of the domain and are removed from the simulation. Based on the results of the previous test, we decided to abandon the Molei Tao integrator for this example; however, we recently became aware of another particle integrator that can be used to evolve the generalized momentum formulation used in this work. In [36], a semi-implicit Euler discretization was used in a mesh-free simulation of the VM system, in the Darwin limit. To avoid time derivatives of the potentials, the particle equations were cast in terms of a generalized momentum. This method, while only first-order accurate, is simple to implement, efficient, and based on the results in [36], surprisingly accurate. Moreover, they point out that this approach can likely be generalized to obtain high-order extensions. Note, an overview of this integrator was presented in section 4.5.2. The generalized momentum formulation for this problem evolves the 180 particle equations 𝑑𝑥𝑖(1) 1  (1) (1)  = 𝑃 − 𝑞𝑖 𝐴 , 𝑑𝑡 𝑟𝑖 𝑖 𝑑𝑥𝑖(2) 1  (2) (2)  = 𝑃 − 𝑞𝑖 𝐴 , 𝑑𝑡 𝑟𝑖 𝑖 𝑑𝑃𝑖(1)   𝑞𝑖  (1)  (1) (1)   (2)  (2) (2) = −𝑞𝑖 𝜕𝑥 𝜓 + 𝜕𝑥 𝐴 𝑃𝑖 − 𝑞𝑖 𝐴 + 𝜕𝑥 𝐴 𝑃𝑖 − 𝑞𝑖 𝐴 , 𝑑𝑡 𝑟𝑖 𝑑𝑃𝑖(2)   𝑞𝑖  (1)  (1) (1)   (2)  (2) (2) = −𝑞𝑖 𝜕𝑦 𝜓 + 𝜕𝑦 𝐴 𝑃𝑖 − 𝑞𝑖 𝐴 + 𝜕𝑦 𝐴 𝑃𝑖 − 𝑞𝑖 𝐴 . 𝑑𝑡 𝑟𝑖 The formulation, shown above, is written in terms of scalar and vector potentials, which are obtained by solving Maxwell’s equations in the Coulomb gauge. As shown in section 1.2.2.2, the complete system in the Coulomb gauge (1.45)-(1.47) can be simplified to obtain an equivalent system in which the vector potential A is purely rotational. The system in this form is given by 1 − Δ𝜓 = 𝜌, (4.37) 𝜎1 1 𝜕 2 𝐴 (1) (1) 2 2 − Δ𝐴 (1) = 𝜎2 𝐽rot , (4.38) 𝜅 𝜕𝑡 1 𝜕 2 𝐴 (2) (2) 2 2 − Δ𝐴 (2) = 𝜎2 𝐽rot . (4.39) 𝜅 𝜕𝑡 Note that we have abused notation here, since the vector potentials 𝐴 (1) and 𝐴 (2) , as written above, are actually the rotational components of A. Along the boundary of the domain, the electric and magnetic fields are prescribed perfectly electrically conducting (PEC) boundary conditions, which, in two spatial dimensions, is equivalent to enforcing homogeneous Dirichlet boundary conditions on the potentials 𝜓, 𝐴 (1) , and 𝐴 (2) . The rotational part of the current density is obtained by solving the elliptic equation − Δ𝜂 = 𝜕𝑥 𝐽 (1) + 𝜕𝑦 𝐽 (2) , (4.40) which is then used to adjust the current density according to (1) 𝐽rot = 𝐽 (1) + 𝜕𝑥 𝜂, (4.41) (2) 𝐽rot = 𝐽 (2) + 𝜕𝑦 𝜂. (4.42) 181 Since the problem is PEC, there can be no currents or charges on the boundary. This suggests we enforce homogeneous Neumann boundary conditions for equation (4.40). Despite the projection (4.40)-(4.42) used for the current density, irrotational components of the vector potential can be introduced through the discretization of the wave equations (4.38) and (4.39). After solving the wave equations, we extract the rotational components of the vector potential by first solving the elliptic equation − Δ𝜉 = 𝜕𝑥 𝐽 (1) + 𝜕𝑦 𝐽 (2) . (4.43) In the second step, the gradient of 𝜉 is used to remove the irrotational parts of the potential via (1) 𝐴rot = 𝐴 (1) + 𝜕𝑥 𝜉, (4.44) (2) 𝐴rot = 𝐴 (2) + 𝜕𝑦 𝜉. (4.45) As in the projection step for the current, we solve the elliptic equation (4.43) using homogeneous Neumann boundary conditions. The sequence of corrections described by (4.43)-(4.45) are identical to the elliptic divergence cleaning method discussed in section 4.3.4 to enforce the gauge condition. As with the Bennett problem, we shall compare results obtained using the generalized momen- tum formulation with the formulation that employs the Boris method, in which the equations of motion for the particles are expressed in terms of E and B and take the form 𝑑𝑥𝑖(1) = 𝑣 𝑖(1) , 𝑑𝑡 𝑑𝑥𝑖(2) = 𝑣 𝑖(2) , 𝑑𝑡 𝑑𝑣 𝑖(1) 𝑞𝑖 (1) ! (2) (3) = 𝐸 + 𝑣𝑖 𝐵 , 𝑑𝑡 𝑟𝑖 𝑑𝑣 𝑖(2) 𝑞𝑖 (2) ! (1) (3) = 𝐸 − 𝑣𝑖 𝐵 . 𝑑𝑡 𝑟𝑖 A key difference with the generalized momentum formulation is that the fields in the Boris method use the Lorenz gauge, rather than the Coulomb gauge. Therefore, the Boris approach requires 182 fields, which are obtained by solving the system 1 𝜕2𝜓 1 − Δ𝜓 = 𝜌, 𝜅 2 𝜕𝑡 2 𝜎1 1 𝜕 2 𝐴 (1) 2 2 − Δ𝐴 (1) = 𝜎2 𝐽 (1) 𝜅 𝜕𝑡 1 𝜕 2 𝐴 (2) 2 2 − Δ𝐴 (2) = 𝜎2 𝐽 (2) . 𝜅 𝜕𝑡 Using equation (1.11), the potentials 𝜓, 𝐴 (1) , and 𝐴 (2) , can be used to obtain E and B for the particle updates. At the present time, we do not have a working method for enforcing the Lorenz gauge condition, so a cleaning method is not used this approach. The formulation involving the Boris push could certainly be modified to work with the Coulomb gauge rather than the Lorenz gauge condition, and will be explored in later work. To setup the simulation, we first create a box specified by the region [0, 1] × [− 12 , 12 ], which has been normalized by some length scale 𝐿. We shall further assume that the beam consists only of electrons, which are prescribed some injection velocity 𝑣 injection and travel along the 𝑥-axis of the box. An estimate of the crossing time for a particle can be obtained using the injection velocity and the length of the domain, which sets the time scale 𝑇 for the simulation. The duration of the simulation is given in terms of particle crossings, which are then used to set the time step Δ𝑡. At each time step, particles are initialized in an injection region specified by the interval [−𝐿 ghost , 0) × [−𝑅𝑏 , 𝑅𝑏 ], where 𝑅𝑏 is the radius of the beam, and the width of the injection zone 𝐿 ghost is chosen such that 𝐿 ghost = 𝑣 injection Δ𝑡. This ensures that all particles initialized in the injection zone will be in the domain after one time step. Particle positions in the injection region are set according to samples taken from a uniform distribution, and the number of particles injected for a given time step is set by the injection rate. In each time step, the injection procedure is applied before the particle position update, so that, at the end of the time step, the injection zone is empty. To prevent the introduction of an impulse response in the fields due to the initial injection of particles, we apply a linear ramp function to 183 Parameter Value Beam radius (𝑅𝑏 ) [m] 8.0 × 10−3 Average number density (𝑛) ¯ [m−3 ] 7.8025 × 1014 Largest box dimension (𝐿) [m] 1.0 × 10−1 Electron injection velocity (𝑣 injection ) [m/s] 5.0 × 107 Electron crossing time (𝑇) [s] 2.0 × 10−9 Electron macro-particle weight (𝑤 𝑚 𝑝 ) [non-dimensional] 1.67 × 10−5 Injection rate (per Δ𝑡) [1/s] 10 Table 4.6: Table of the parameters used in the setup for the expanding beam problem. the macro-particle weights whose duration is one particle crossing. A summary of the parameters used to setup the problem are presented in Table 4.6. At some point, the particles will reach the boundary and should be removed from the simulation. A list is a natural choice for managing the injection and deletion of particles, since particles can be easily added or removed. The types of problems we are considering require many particles, so having to constantly resize a list can become quite expensive. Moreover, implementations of lists are not cache-friendly, so we lose opportunities for vectorization. Instead, we use arrays whose lengths are determined by (over)estimating the total number of particles in the domain, at any given time. We estimate the number of particles by first calculating the number of time steps required for a particle to cross the box, which is then multiplied by the injection rate. Finally we double this result for additional safety, since the beam will spread to some degree. We store a running total of the number of particles in the domain at any given time, which identifies the entries of the array to update, with entries beyond this value being considered “deleted." Consequently, at any given time step, we need to sort the particle arrays so that particles outside of the domain are placed after this counter variable. This sort step is performed by first creating a Boolean array which indicates if particle is outside the domain. A sorting method is then applied to the Boolean array, which is quite fast, as the data is mostly sorted. Then, the sorted Boolean array is used to remap the entries of the particle arrays. We first test the formulation which combines the Boris method with the fields in the Lorenz gauge (1.43). Initially, we applied the time-centered method to update the components of the vector 184 potentials; however, the lack of dissipation in the time-centered approach led to certain noise in the fields due to dispersion error, which was further amplified by the application of the time derivative. Instead, we used the BDF-2 method, which is dissipative, to perform these calculations. The effect on the time derivatives, which is shown in Figure B.4, is quite apparent. We ran the simulation with this configuration to a final time of 1000 particle crossings. A mesh of 128 × 128 grid points was used in the calculation, which gave a CFL ≈ 0.761. The results for this experiment are shown in Figure B.5. The beam, itself, is surprisingly stable and does not display significant issues associated with violating the gauge condition. Along the edge of the beam, we observe some small oscillations that will eventually grow over time, causing the beam to break apart. This is also reflected in the growth of the error in the Lorenz gauge toward the end of the run. In Figure B.6, we show the smooth potentials and their partial derivatives, which are constructed using the proposed BDF-2 wave solver. As mentioned earlier, we have not had success with enforcing the Lorenz gauge in this particular work, but it is something we plan to revisit in the future. Next, we test the Coulomb gauge formulation, which applies the AEM for time integration [36] (discussed in section 4.5.2). In this experiment, we ran the code for 3000 particle crossings without a cleaning technique to enforce the gauge condition. We used the same 128 × 128 mesh for the fields as in the Boris approach. The Poisson solves are performed using second-order finite- differences with a sparse linear solver. Derivatives of the data obtained from the Poisson solves are computed with second-order finite-differences. After 3000 particle crossings, the structure of the beam is largely destroyed due to violations in the gauge condition. The results presented in Figure B.7 show the beam at an earlier time, corresponding to 2000 crossings, at which point the striations and clumping in the beam are quite apparent. In Figure B.8, we show the beam after 3000 crossings, which uses the same approach but applies the cleaning procedure described by equations (4.43)-(4.45). The impact of the cleaning is quite remarkable, as the integrity of the beam is no longer compromised. The cleaning approach displays some violations in the gauge condition at the boundaries of the domain, which are expected because particles enter or leave the domain at these points. On the interior the fluctuations are in the sixth decimal position, which can likely be 185 improved through the use of a more accurate Poisson solver along with additional particles. The smooth potentials and their derivatives, which were obtained with this formulation are presented in Figure B.9. The derivatives used to evaluate the gauge condition show some jumps along the boundary where particles enter and leave, which we plan to investigate in greater detail. Comparing Figures B.6 and B.9, it is interesting to see the structural differences in the potentials (and their derivatives) obtained with different formulations. As mentioned earlier, we have not yet constructed a functioning cleaning method for the Lorenz gauge formulation. In spite of this, we find the Lorenz gauge formulation to be quite appealing because it avoids the use of elliptic solvers. For this reason, we combined the AEM for the particles with a first-order BDF field solver. No cleaning method is used for the fields. We ran the simulation out to 3000 particle crossings on the same 128 × 128 mesh for the fields. The beam, which, remains surprising intact without a cleaning method, is shown in Figure B.10a. The time trace of the Lorenz gauge error, which is displayed in Figure B.10b, shows some oscillations that appear to be bounded. The fields from the same experiment, at the final step, are presented in Figure B.11. The fields are quite similar to those obtained with the second-order BDF scheme combined with the Boris method presented in Figure B.6. Despite the low order accuracy, it can be shown that a first-order time discretization of the fields is consistent with a discrete form of the Lorenz gauge (see B.1 for details). Furthermore, the low-order time accuracy of the fields does not introduce significant dissipation. While the goal of our work is to build higher-order field solvers for plasma applications, this result is interesting due to its practicality. More specifically, it demonstrates that it is possible to obtain a reasonable solution in an inexpensive manner. The last point we wish to mention concerns the metrics used to assess charge conservation. One of the advantages of working with a gauge formulation is that the metrics for charge conservation are embedded in the gauge condition, which can be calculated using the derivatives of the potentials. Compare this with measurements based on Gauss’ law ∇ · E = 𝜌/𝜎1 , which utilizes particle data that may be under-resolved by the mesh. If either 𝜌 or the divergence term is not smooth, this could give the impression that the method is ineffective at conserving charge. As an example, in the last 186 experiment, which enforced the Coulomb gauge through an elliptic method, we found that the error in the gauge condition was quite small, especially away from the boundary. For the same problem, the point-wise error in Gauss’ law, which is shown in Figure B.12b, indicates that the method is not conserving charge, even in regions away from the boundaries. On the other hand, if we compute the residual ∫  1  ∑︁  1  ∇ · E(x) − 𝜌(x) 𝑑𝑉x ≈ ∇ · E(𝑥𝑖 , 𝑦 𝑗 ) − 𝜌(𝑥𝑖 , 𝑦 𝑗 ) Δ𝑥𝑖 Δ𝑦 𝑗 , (4.46) Ω 𝜎1 𝑖, 𝑗 𝜎1 then we can say whether or not the method conserves charge in a bulk sense. In the above definition, the divergence is interpreted in a discrete sense and the sum runs over the mesh points so that Δ𝑥𝑖 Δ𝑦 𝑗 is the volume of the grid-cell (𝑖, 𝑗). In Figure B.12a, we plot the bulk error as a function of time, which shows far smaller violations and is symmetric about zero. The large violations in Gauss’ law for the case of cleaning (shown in Figure B.12b) are likely the result of the treatment used for the divergence term. If we compare this with the same formulation that skips the cleaning step (shown in Figure B.13), then we can see that the point-wise violations occur on a much greater scale. This discrepancy was similarly in [19], where they showed the ℓ2 error in Gauss’ law to be O (1), even if cleaning methods were applied. It is also important to note that if components of the electric field display non-smooth features such as cusps or steep gradients, then the finite-difference derivatives over modest stencils will show nonphysical oscillations, which will directly impact the results of the point-wise error in Gauss’ law. One approach we have considered to deal with this is the application of WENO derivatives, which we plan to explore in future work. 4.6.6 A Narrow Beam Problem and the Effect of Particle Count In this last test, we slightly modify our setup from the previous example in section 4.6.5 to construct a beam with a lower density, which will be used to conduct a certain type of refinement study. This modification primarily concerns the prescription of particle weights. In Table 4.6 for the previous problem, we provided a value for the number density 𝑛¯ in addition to a particle weight 𝑤 𝑚 𝑝 . The particle weighting from the previous example was obtained by essentially hard-coding an estimate 187 for the number of simulation particles. While it is not incorrect to do this, the currents in the beam may become too large as the injection rate increases, causing particles to move back to the injection zone. To fix this problem, we compute the particle weight 𝑤 𝑚 𝑝 using an estimate for the number of particles in the beam that accounts for the injection rate. Since we discussed how to estimate this number in the previous section, we omit these details for brevity. Therefore, the particle weight no longer needs to be prescribed in the setup, and it naturally adjusts according to the injection rate specified by the user. This modification ultimately allows us to examine the effect of the particle count on the solution by fixing the number density and varying the particle injection rate. Aside from this modification, all other details, e.g., the models, injection procedure, etc. are identical to those provided in section 4.6.5, so we shall not describe them further. We test the effect of an increased particle count on the numerical solution using the solver that combines the AEM for particles with the BDF-2 field solver in the Coulomb gauge, along with elliptic projections to enforce the gauge condition. We use the same 128 × 128 mesh as in the previous problem and an injection rate of 400 particles per timestep. The remaining parameters are specified in Table 4.7. We run the problem for a total of 5 particle crossings, which is sufficient for the refinement purposes of this problem, using the same CFL ≈ 0.761 as the previous problem. We plot the narrow beam and the gauge error in Figure B.14. The increased smoothness in the charge density due to the increase in the number of particles is apparent. Moreover, the error in the gauge condition seems to be quite small away from the boundary, where particles are injected and removed. The corresponding fields and derivatives, which are used to move the particles are presented in Figure B.15. The fields appear to be smooth. We also plot the “bulk" error in Gauss’ law associated as a function of time and show the point-wise error as a surface in Figure B.16. The bulk error shows a jump at the time that corresponds to the first crossing, at which point some particles begin to leave the domain. Shortly after this jump, the error settles. As before, the point- wise violations in Gauss’ law seem to indicate a loss of charge conservation. For convenience, we show the derivatives of the electric field, which are used to measure the error in Gauss’ law in Figure B.17. We note that the derivative in the 𝑥 shows some oscillations in the interior of the 188 Parameter Value Beam radius (𝑅𝑏 ) [m] 8.0 × 10−3 Average number density (𝑛) ¯ [m−3 ] 1.552258 × 1014 Largest box dimension (𝐿) [m] 1.0 × 10−1 Electron injection velocity (𝑣 injection ) [m/s] 5.0 × 107 Electron crossing time (𝑇) [s] 2.0 × 10−9 Table 4.7: Table of the parameters used in the setup for the narrow beam problem. beam, and the 𝑦 derivative contains a mix of sharp and uniform features. Lastly, we show the effect of increasing the particle count on the gauge condition by considering injection rates of 100, 200, and 400 particles per time step. These results are presented in Figure B.18. While the error at the boundaries remains largely unchanged in the runs, there is a noticeable improvement in the error on the interior. Specifically, we see the more jagged features on the interior become smoother and smaller in size due to the increased particle count. 4.7 Conclusion In this work, we developed a PIC method by coupling dimensionally-split integral equation solvers for the fields with standard and non-standard time integration methods for particles. After introduc- ing the concepts of a general PIC method, we presented several approaches for enforcing gauges and charge conservation. We then introduced methods for integrating the particle equations of motion that are necessitated by the formulations considered in this work. The discussion was primarily focused on two methods designed for problems with non-separable Hamiltonians. While the time integration methods employed in this work are not new, the novel contribution of our work is that we demonstrated how existing methods for particles can effectively leverage the proposed field solvers in simulations of plasmas. This includes the construction of spatial derivatives, which can be obtained directly from the field solvers. To this end, we applied the proposed methods to several application problems involving beams. Results were compared with standard methods based on leapfrog time integration, and in nearly all examples, the proposed methods recovered similar be- havior. The results not only validate the generalized momentum formulation, but also demonstrate the versatility and flexibility of the proposed field solvers in simulating plasma phenomena. 189 CHAPTER 5 CONCLUSION AND FUTURE DIRECTIONS In this thesis, we have presented a collection of algorithms for evolving fields in plasmas with specific applications to the Vlasov-Maxwell system. Maxwell’s equations are reformulated in term of the Lorenz gauge, as well as the Coulomb gauge to obtain systems involving wave equations. These wave equations are solved using the methods proposed in this work, and are combined with a particle-in-cell method [2, 3] to simulate plasmas. This particle description of the Vlasov equation couples directly to the fields, which are solved using a mesh. We considered two formulations of the equations for the particles. First, a standard approach was presented, which is based on the Newton- Lorentz equations, while the other used a generalized Hamiltonian to write the particle equations in terms of the potentials (and their derivatives) used in the gauge formulation. The advantage offered by the generalized Hamiltonian framework is that it eliminates the need to compute time derivatives, which reduce the time accuracy of the fields and can lead to instabilities in certain limits [35, 36]. In the first part of this thesis, we developed and extended methods for scalar wave equations, which can be used to update the potentials in these formulations. Our developments are based on a class of algorithms known as the MOL𝑇 , which combines a dimensional splitting technique with a one-dimensional integral equation method. The resulting methods are unconditionally stable, can address geometry, and are O (𝑁), where 𝑁 is the number of mesh points. Our work contributed methods to construct derivatives of the potentials for this class of dimensionally-split methods. These derivatives, which are used to evolve particles, were constructed directly from the data used in the one-dimensional integral solution. Consequently, the methods naturally inherit the speed, stability, and geometric flexibility offered by the base solver. Moreover, we established, through refinement experiments, that these derivatives converge at the same rate in both space and time, as the base method. We also presented a more systematic treatment of outflow boundary conditions for the second-order (in time) methods, including refinement studies, which were not presented in 190 earlier work. While the outflow procedure used in this work is convergent, more work should be done to reduce magnitude of the error and improve the rate of convergence. The core algorithms used in the MOL𝑇 and the related class of successive convolution methods were also explored in the context of high-performance computing environments. We developed a novel domain decomposition approach, which ultimately allowed the method to be used on distributed memory computing platforms. Shared memory algorithms were developed using the Kokkos performance portability library, which allows a user to write a single version of a code that can be executed on various computing devices with the architecture-dependent details being managed by the library. We optimized predominant loop structures in the code and settled on a blocking pattern that prescribed parallelism at multiple levels. Moreover, the proposed iteration pattern is flexible enough to work with shared memory features available on GPU systems. While the results indicated a high sensitivity to data locality, which is a feature of memory bound algorithms, the methods were shown to be quite fast. Scaling experiments demonstrated that the proposed algorithms could sustain an update rate in excess of 2.5 × 106 grid points per second, per physical core. On a shared memory system with 40 cores (per node), this translates to an update rate of 1 × 108 grid points per second (per node). This was true even for the largest experiment conducted in that same study, which used nearly 35 billion grid points. We also presented particle-in-cell methods for the Vlasov-Maxwell system, which leveraged the methods for fields and derivatives developed in this work. We showed how to combine the proposed methods with standard and non-standard time integration methods for the particles and applied these methods to a variety of plasma test problems. The focus on beam problems is primarily motivated by the preference of particle methods over mesh-based discretizations, which are overly diffusive near the edge of the beam. Our results are generally encouraging and demonstrate the capabilities of the proposed field solvers in simulating plasma phenomena. Additionally, our results serve to validate the generalized Hamiltonian formulation, which will be the foundation of our future work. The results presented in this thesis suggest several interesting directions for future research. In terms of field solvers, the methods used for outflow boundary conditions should be reconsidered. 191 This is especially true in the case of the more intricate methods based on successive convolution, which, with the current methods, requires a fairly substantial amount of storage for the time history along the boundary. The success of the BDF methods suggests that higher-order approaches should be considered and analyzed in a rigorous fashion. In light of the stability issues associated with elongated time stencils, it may be worthwhile to construct higher-order methods using extrapolation or other correction techniques, which can simultaneously address the splitting error associated with multi-dimensional problems. Extensions of the parallel algorithms presented in this thesis should also be considered. First, the current approach should be evaluated on GPUs. A generalization of the decomposition, which extends beyond nearest-neighbors and eliminates the artificial CFL-like condition, should also be evaluated. This would provide a clear path for addressing problems involving non-uniform mesh patches that are frequently encountered in algorithms that support adaptivity. Future work should also evaluate other fast summation algorithms which are more aligned with the vectorization capabilities supported by new hardware. The research directions above also have implications for the development of new solvers for the Vlasov-Maxwell system. In the near-term future, we plan on revisiting the formulation involving the Lorenz gauge, which avoids the use elliptic solvers. This suggests a pairing with the hyperbolic divergence cleaning method discussed in this thesis, which requires effective approaches for enforc- ing outflow boundary conditions. On the other hand, we believe the Coulomb gauge formulation has great potential despite the three elliptic solves required to properly enforce the gauge condition. In order to support problems with geometry, the Poisson equations should be discretized as integral equations rather than finite-differences. This suggestion is motivated by the access to analytical derivatives offered by integral equation methods, which are aligned with the techniques used in this work. Furthermore, there have been several interesting developments involving fast summation methods using a technique known as barycentric Lagrange interpolation [84]. These approaches which are kernel-independent, have also been explored on GPUs [85, 86] and show great promise in addressing challenges posed by new computational hardware. 192 APPENDICES 193 APPENDIX A APPENDIX FOR CHAPTER 3 A.1 Example for Linear Advection Suppose we wish to solve the 1D linear advection equation: 𝜕𝑡 𝑢 + 𝑐𝜕𝑥 𝑢 = 0, (𝑥, 𝑡) ∈ (𝑎, 𝑏) × R+ , (A.1) where 𝑐 > 0 is the wave speed and leave the boundary conditions unspecified. The procedure for 𝑐 < 0 is analogous. Discretizing (A.1) in time with backwards Euler yields a semi-discrete equation of the form 𝑢 𝑛+1 (𝑥) − 𝑢 𝑛 (𝑥) + 𝑐𝜕𝑥 𝑢 𝑛+1 (𝑥) = 0. Δ𝑡 If we rearrange this, we obtain a linear equation of the form L [𝑢 𝑛+1 ; 𝛼] (𝑥) = 𝑢 𝑛 (𝑥), (A.2) where we have used 1 1 𝛼 := , L := I + 𝜕𝑥 . 𝑐Δ𝑡 𝛼 By reversing the order in which the discretization is performed, we have created a sequence of BVPs at discrete time levels. If we had discretized equation (A.1) using the MOL formalism, then L would be an algebraic operator. To solve equation (A.2) for 𝑢 𝑛+1 , we analytically invert the operator L. Notice that this equation is actually an ODE, which is linear, so the problem can be solved using methods developed for ODEs. If we apply the integrating factor method to the problem, we obtain   𝜕𝑥 𝑒 𝛼𝑥 𝑢 𝑛+1 (𝑥) = 𝛼𝑒 𝛼𝑥 𝑢 𝑛 (𝑥). 194 To integrate this equation, we use the fact that characteristics move to the right, so integration is performed from 𝑎 to 𝑥. After rearranging the result, we arrive at the update equation ∫ 𝑥 −𝛼(𝑥−𝑎) 𝑛+1 𝑛+1 𝑢 (𝑥) = 𝑒 𝑢 (𝑎) + 𝛼 𝑒 −𝛼(𝑥−𝑠) 𝑢 𝑛 (𝑠) 𝑑𝑠, ∫ 𝑥𝑎 ≡ 𝑒 −𝛼(𝑥−𝑎) 𝐴𝑛+1 + 𝛼 𝑒 −𝛼(𝑥−𝑠) 𝑢 𝑛 (𝑠) 𝑑𝑠, 𝑎 ≡ L −1 [𝑢 𝑛 ; 𝛼] (𝑥). This update displays the origins of the implicit behavior of the method. While convolutions are performed on data from the previous time step, the boundary terms are taken at time level 𝑛 + 1. Now that we have obtained the update equation, we need to apply the boundary conditions. Clearly, if the problem specifies a Dirchlet boundary condition at 𝑥 = 𝑎, then 𝐴𝑛+1 = 𝑢 𝑛+1 (𝑎). We can compute a variety of boundary conditions using the update equation ∫ 𝑥 −𝛼(𝑥−𝑎) 𝑛+1 𝑛+1 𝑢 (𝑥) = 𝑒 𝐴 +𝛼 𝑒 −𝛼(𝑥−𝑠) 𝑢 𝑛 (𝑠) 𝑑𝑠, 𝑎 where ∫ 𝑥 𝑛 𝐼 [𝑢 ; 𝛼] (𝑥) = 𝛼 𝑒 −𝛼(𝑥−𝑠) 𝑢 𝑛 (𝑠) 𝑑𝑠. 𝑎 For example, with periodic boundary conditions, we would need to satisfy 𝑢 𝑛+1 (𝑎) = 𝑢 𝑛+1 (𝑏), (A.3) 𝜕𝑥 𝑢 𝑛+1 (𝑎) = 𝜕𝑥 𝑢 𝑛+1 (𝑏). (A.4) Applying condition (A.3), we find that ∫ 𝑏 −𝛼(𝑏−𝑎) 𝐴 𝑛+1 =𝑒 𝐴 𝑛+1 +𝛼 𝑒 −𝛼(𝑏−𝑠) 𝑢 𝑛 (𝑠) 𝑑𝑠. 𝑎 Solving this equation for 𝐴𝑛+1 shows that 𝐼 [𝑢 𝑛 ; 𝛼] (𝑏) 𝐴𝑛+1 = , 1−𝜇 with 𝜇 = 𝑒 −𝛼(𝑏−𝑎) . Alternatively, we could have started with (A.4), which would give an identical solution. While this particular procedure is only applicable to linear problems, this exercise motivates some of the choices made to define operators in the method. 195 // Distribute tiles of the array to teams of threads dynamically Kokkos :: parallel_for ("team loop over tiles ", team_policy ( total_tiles , Kokkos :: AUTO ()), KOKKOS_LAMBDA ( team_type & team_member ) { // Determine the flattened tile index via the team rank // and compute the unflattened indices of the tile T_{i,j} const int tile_idx = team_member . league_rank (); const int tj = tile_idx % num_tiles_x ; const int ti = tile_idx / num_tiles_x ; // Retrieve tile sizes & offsets and // obtain subviews of the relevant grid data on tile T_{i,j} // ... // Use a team ’s thread range over the lines Kokkos :: parallel_for ( Kokkos :: TeamThreadRange <>( team_member , Ny_tile ), [&]( const int iy) { // Slice to extract a subview of my line ’s data and // call line methods which use vector loops // ... } }); Scheme A.1: An example of coarse-grained parallel nested loop structure. // Distribute the threads to lines Kokkos :: parallel_for ("Fast sweeps along x", range_policy (0, Ny), KOKKOS_LAMBDA ( const int iy) { // Slice to obtain the local integrals to which we apply // the convolution kernel to the entire line // ... }); Scheme A.2: Kokkos kernel for the fast-convolution algorithm. A.2 Kokkos Kernels This section provides listings, which outline the general format of the Kokkos kernels used in this work. Specifically, we provide structures for the tiled/blocked algorithms (Scheme A.1) in addition to the kernel that executes the fast summation method along a line (Scheme A.2). 196 A.3 WENO Quadrature We provide the various expressions for the coefficients and smoothness indicators used in the reconstruction process for 𝐽 𝑅(𝑟) . Defining 𝜈 ≡ 𝛼Δ𝑥, the coefficients for the fixed stencils are given in [44] as follows: (0) 6 − 6𝜈 + 2𝜈 2 − (6 − 𝜈 2 )𝑒 −𝜈 𝑐 −3 = , 6𝜈 3 (0) 6 − 8𝜈 + 3𝜈 2 − (6 − 2𝜈 − 2𝜈 2 )𝑒 −𝜈 𝑐 −2 =− , 2𝜈 3 (0) 6 − 10𝜈 + 6𝜈 2 − (6 − 4𝜈 − 𝜈 2 + 2𝜈 2 )𝑒 −𝜈 𝑐 −1 = , 2𝜈 3 6 − 12𝜈 + 11𝜈 2 − 6𝜈 3 − (6 − 6𝜈 + 2𝜈 2 )𝑒 −𝜈 𝑐 0(0) =− , 6𝜈 3 (1) 6 − 𝜈 2 − (6 + 6𝜈 + 2𝜈 2 )𝑒 −𝜈 𝑐 −2 = , 6𝜈 3 (1) 6 − 2𝜈 − 2𝜈 2 − (6 + 4𝜈 − 𝜈 2 − 2𝜈 3 )𝑒 −𝜈 𝑐 −1 =− , 2𝜈 3 6 − 4𝜈 − 𝜈 2 + 2𝜈 3 − (6 + 2𝜈 − 2𝜈 2 )𝑒 −𝜈 𝑐 0(1) = , 2𝜈 3 6 − 6𝜈 + 2𝜈 2 − (6 − 𝜈 2 )𝑒 −𝜈 𝑐 1(1) =− , 6𝜈 3 (2) 6 + 6𝜈 + 2𝜈 2 − (6 + 12𝜈 + 11𝜈 2 + 6𝜈 3 )𝑒 −𝜈 𝑐 −1 = , 6𝜈 3 6 + 4𝜈 − 𝜈 2 − 2𝜈 3 − (6 + 10𝜈 + 6𝜈 2 )𝑒 −𝜈 𝑐 0(2) =− , 2𝜈 3 6 + 2𝜈 − 2𝜈 2 − (6 + 8𝜈 + 3𝜈 2 )𝑒 −𝜈 𝑐 1(2) = , 2𝜈 3 6 − 𝜈 2 − (6 + 6𝜈 + 2𝜈 2 )𝑒 −𝜈 𝑐 2(2) =− . 6𝜈 3 The corresponding linear weights are 6 − 𝜈 2 − (6 + 6𝜈 + 2𝜈 2 )𝑒 −𝜈 𝑑0 = , 3𝜈(2 − 𝜈 − (2 + 𝜈)𝑒 −𝜈 ) 60 − 60𝜈 + 15𝜈 2 + 5𝜈 3 − 3𝜈 4 − (60 − 15𝜈 2 + 2𝜈 4 )𝑒 −𝜈 𝑑2 = , 10𝜈 2 (6 − 𝜈 2 − (6 + 6𝜈 + 2𝜈 2 )𝑒 −𝜈 ) 197 𝑑1 = 1 − 𝑑0 − 𝑑2 . The expressions for the smoothness indicators are given in [56] as 13 1 𝛽0 = (−𝑣 𝑖−3 + 3𝑣 𝑖−2 − 3𝑣 𝑖−1 + 𝑣 𝑖 ) 2 + (𝑣 𝑖−3 − 5𝑣 𝑖−2 + 7𝑣 𝑖−1 − 3𝑣 𝑖 ) 2 , 12 4 13 1 𝛽1 = (−𝑣 𝑖−2 + 3𝑣 𝑖−1 − 3𝑣 𝑖 + 𝑣 𝑖+1 ) 2 + (𝑣 𝑖−2 − 𝑣 𝑖−1 − 𝑣 𝑖 + 𝑣 𝑖+1 ) 2 , 12 4 13 1 𝛽2 = (−𝑣 𝑖−1 + 3𝑣 𝑖 − 3𝑣 𝑖+1 + 𝑣 𝑖+2 ) 2 + (−3𝑣 𝑖−1 + 7𝑣 𝑖 − 5𝑣 𝑖+1 + 𝑣 𝑖+2 ) 2 . 12 4 To obtain the analogous expressions for 𝐽 𝐿(𝑟) , we exploit the “mirror-symmetry" property of WENO reconstructions. That is, one can keep the left side of each of the expressions, then reverse the order of the expressions on the right. Expressions for calculating one particular smoothness indicator, if interested, can be found in [44]. A.4 Some Larger Figures from Experiments 198 Tiling w/subviews + TVR Tiling w/subviews w/o TVR TP + TTR w/o tiling range policy + simd Tiling w/o subviews + TVR Tiling w/o subviews w/o TVR range policy MDRange Team Size = Kokkos::AUTO() 1010 DOF/s 109 108 Team Size = 2 1010 DOF/s 109 Team Size = 4 1010 DOF/s 109 51 2 10 24 20 48 40 96 81 92 16384 N Tiling w/subviews + TVR Tiling w/o subviews w/o TVR range policy MDRange Tiling w/o subviews + TVR TP + TTR w/o tiling range policy + simd Tiled MDRange Tiling w/subviews w/o TVR TP + TTR + TVR w/o tiling 10 Team Size = Kokkos::AUTO() 10 DOF/s 109 Team Size = 2 1010 DOF/s 109 Team Size = 4 1010 DOF/s 109 64 128 25 6 512 N Figure A.1: Plots comparing the performance of different parallel execution policies for the pattern in Scheme 3.2 using test cases in 2-D (top) and 3-D (bottom). Tests were conducted on a single node that consists of 40 cores using the code configuration outlined in 3.1. Each group consists of three plots, whose difference is the value selected for the team size. We note that hyperthreading is not enabled on our systems, so Kokkos::AUTO() defaults to a team size of 1. Tile experiments used a block size of 2562 , in 2-D problems, and 323 in 3-D. A tiled MDRange was not implemented in the 2-D cases because the block size was larger than some of the problems. The results generally agree with those presented in 3.5. For smaller problem sizes, using the non-portable range_policy with OpenMP simd directives is clearly superior over the policies. However, when enough work is available, we see that blocked policies with subviews and vectorization generally become the fastest. In both cases, MDRange seems to have fairly good performance. Tiling, when used with MDRange, in the 3-D cases, seems to be slower than plain MDRange. Again, we see that the use of blocking provides a more consistent update rate if enough work is available. 199 9 Advection Diffusion Hamilton-Jacobi 1.0 ×10 0.8 DOF/node/s 0.6 0.4 0.2 0.0 1.2 1.0 0.8 Efficiency 0.6 0.4 0.2 0.0 1 4 9 16 25 36 49 1 4 9 16 25 36 49 1 4 9 16 25 36 49 9 Advection Diffusion Hamilton-Jacobi 1.0 ×10 Nodes Nodes Nodes DOF/node = 33612 DOF/node = 134412 DOF/node = 268812 0.8 DOF/node/s 0.6 0.4 0.2 0.0 1.2 1.0 0.8 Efficiency 0.6 0.4 0.2 0.0 1 4 9 16 25 36 49 1 4 9 16 25 36 49 1 4 9 16 25 36 49 Nodes Nodes Nodes DOF/node = 33612 DOF/node = 134412 DOF/node = 268812 Figure A.2: Weak scaling results, for each of the applications, using up to 49 nodes (1960 cores). For each of the applications, we have provided the update rate and weak scaling efficiency computed via the fastest time/step (top) and average time/step (bottom). Results for advection and diffusion applications is quite similar, despite the use of different operators. The results for the H-J application seem to indicate that no major performance penalties are incurred by use of the adaptive time stepping method. Scalability appears to be excellent, up to 16 nodes (640 cores), then begins to decline. While some loss in performance, due to network effects, is to be expected, this loss appears to be larger than was previously observed. The nodes used in the runs were not contiguous, which hints at a possible sensitivity to data locality. 200 9 Advection Diffusion Hamilton-Jacobi 1.0 ×10 0.8 DOF/node/s 0.6 0.4 0.2 0.0 1.2 1.0 0.8 Efficiency 0.6 0.4 0.2 0.0 1 4 9 1 4 9 1 4 9 9 Advection Diffusion Hamilton-Jacobi 1.0 ×10 Nodes Nodes Nodes 0.8 DOF/node = 16812 DOF/node = 67212 DOF/node = 268812 DOF/node/s 0.6 DOF/node = 33612 DOF/node = 134412 0.4 0.2 0.0 1.2 1.0 0.8 Efficiency 0.6 0.4 0.2 0.0 1 4 9 1 4 9 1 4 9 Nodes Nodes Nodes DOF/node = 16812 DOF/node = 67212 DOF/node = 268812 DOF/node = 33612 DOF/node = 134412 Figure A.3: Weak scaling results obtained with contiguous allocations of up to 9 nodes (360 cores) for each of the applications. For comparison, the same information is displayed as in A.2. Data from the fastest trials indicates nearly perfect weak scaling, across all applications, up to 9 nodes, with a consistent update rate between 2 − 4 × 108 DOF/node/s. A comparison of the fastest timings between the large and small runs supports our claim that data proximity is crucial to achieving the peak performance of the code. Note that size the error bars are generally smaller than those in A.2. This indicates that the timing data collected from individual trials exhibits less overall variation. 201 9 Advection Diffusion Hamilton-Jacobi 1.0 ×10 0.8 DOF/node/s 0.6 0.4 0.2 0.0 1.2 1.0 0.8 Efficiency 0.6 0.4 0.2 0.0 1 4 9 1 4 9 1 4 9 9 Advection Diffusion Hamilton-Jacobi 1.0 ×10 Nodes Nodes Nodes DOF = 16812 DOF = 67212 DOF = 268812 0.8 DOF = 33612 DOF = 134412 DOF/node/s 0.6 0.4 0.2 0.0 1.2 1.0 0.8 Efficiency 0.6 0.4 0.2 0.0 1 4 9 1 4 9 1 4 9 Nodes Nodes Nodes DOF = 16812 DOF = 67212 DOF = 268812 DOF = 33612 DOF = 134412 Figure A.4: Strong scaling results for each of the applications obtained on contiguous allocations of up to 9 nodes (360 cores). Displayed among each of the applications are the update rate and strong scaling efficiency computed from the fastest time/step (top) and average time/step (bottom). This method does not contain a substantial amount of work, so we do not expect good performance for smaller base problem sizes, as the work per node becomes insufficient to hide the cost of communication. Larger base problem sizes, which introduce more work, are capable of saturating the resources, but will at some point become insufficient. Moreover, threads become idle when the work per node fails to introduce enough blocks. 202 APPENDIX B APPENDIX FOR CHAPTER 4 B.1 Semi-discrete Time Consistency of the Lorenz Formulation with BDF-1 Here we show that the Lorenz gauge formulation of Maxwell’s equations (1.8)-(1.10) satisfies a certain time consistency relation when a first-order BDF time discretization is applied. By time consistent, we mean that the semi-discrete system for the potentials induces both a gauge condition and continuity equation at the semi-discrete level. In the treatment of the semi-discrete equations, we shall ignore effects of dimensional splittings. Following the procedure used in section 2.2.1, we can derive the semi-discrete equations for the scalar and vector potentials using first-order backwards differences for each of the time derivatives. Proceeding, one obtains the following semi-discrete equations for the Lorenz gauge formulation: 1 L𝜓 𝑛+1 = 2𝜓 𝑛 − 𝜓 𝑛−1 + 𝜌 𝑛+1 , (B.1) 𝛼2 𝜖 0 𝜇0 𝑛+1 LA𝑛+1 = 2A𝑛 − A𝑛−1 + J , (B.2) 𝛼2 𝜓 𝑛+1 − 𝜓 𝑛 + ∇ · A𝑛+1 = 0, (B.3) 𝑐2 Δ𝑡 where we have used the usual operator notation 1 1 L := I − Δ, 𝛼 := . (B.4) 𝛼2 𝑐Δ𝑡 We can verify that this semi-discrete system is time consistent in the sense of the semi-discrete Lorenz gauge (B.3) through a direct calculation. First, note that the linear operator L can be inverted in the equation for the scalar potential to obtain the update ! 1 𝜓 𝑛+1 = L −1 2𝜓 𝑛 − 𝜓 𝑛−1 + 2 𝜌 𝑛+1 . (B.5) 𝛼 𝜖0 If we evaluate this equation at time level 𝑛, we obtain ! 1 𝜓 𝑛 = L −1 2𝜓 𝑛−1 − 𝜓 𝑛−2 + 2 𝜌 𝑛 . (B.6) 𝛼 𝜖0 203 Next, we take the divergence of A in equation (B.2) and find that   𝜇0 L ∇ · A𝑛+1 = 2∇ · A𝑛 − ∇ · A𝑛−1 + 2 ∇ · J𝑛+1 . 𝛼 Formally inverting the operator L, we obtain the relation  𝜇0  ∇ · A𝑛+1 = L −1 2∇ · A𝑛 − ∇ · A𝑛−1 + 2 ∇ · J𝑛+1 . (B.7) 𝛼 We now use equations (B.5), (B.7), and (B.6) to evaluate the semi-discrete Lorenz gauge (B.3). Using the linearity of the operator L, we obtain "  𝑛+1 # 𝜓 −𝜓 𝑛+1 𝑛 2𝜓 − 3𝜓 𝑛 𝑛−1 +𝜓 𝑛−2 𝜇0 𝜌 − 𝜌 𝑛 +∇ ·A𝑛+1 = L −1 +2∇ ·A𝑛 −∇ ·A𝑛−1 + 2 + ∇ · J𝑛+1 . 𝑐 Δ𝑡 2 𝑐 Δ𝑡 2 𝛼 Δ𝑡 Note that we have used the relation 𝑐2 = (𝜇0 𝜖0 ) −1 . From these calculations, we can see that the corresponding semi-discrete continuity equation appears as a residual for the gauge condition (B.3). The remaining terms in the operand for the inverse can be also be expressed directly in terms of this semi-discrete gauge, since  𝑛   𝑛−1  2𝜓 𝑛 − 3𝜓 𝑛−1 + 𝜓 𝑛−2 𝑛 𝑛−1 𝜓 − 𝜓 𝑛−1 𝑛 𝜓 − 𝜓 𝑛−2 𝑛−1 + 2∇ · A − ∇ · A =2 +∇·A − +∇·A . 𝑐2 Δ𝑡 𝑐2 Δ𝑡 𝑐2 Δ𝑡 This gives rise to an inductive argument for the time consistency. The initial data for the problem satisfies both the semi-discrete gauge condition and the continuity equation. If the discrete gauge condition 𝜓 𝑛+1 − 𝜓 𝑛 + ∇ · A𝑛+1 = 0, 𝑐2 Δ𝑡 holds for any time level 𝑛, then it follows that the analogous semi-discrete continuity equation 𝜌 𝑛+1 − 𝜌 𝑛 + ∇ · J𝑛+1 = 0, Δ𝑡 holds as well. We briefly sketch the idea for both directions. The forward direction can be easily seen by assuming that the semi-discrete gauge condition holds up to time level 𝑛 + 1, which is equivalent to writing "  # 𝜇 𝜌 𝑛+1 − 𝜌 𝑛 0 = L −1 2 0 + ∇ · J𝑛+1 . 𝛼 Δ𝑡 204 The result follows by applying the operator L to both sides. A similar argument can be used for the converse. The discrete gauge condition is assumed to be satisfied by the initial condition and all relevant earlier times, i.e., 𝜓 𝑛+1 − 𝜓 𝑛 + ∇ · A𝑛+1 = 0, 𝑛 = −2, −1. 𝑐 Δ𝑡 2 Now, we assume that the continuity equation 𝜌 𝑛+1 − 𝜌 𝑛 + ∇ · J𝑛+1 = 0, Δ𝑡 is true for any time level 𝑛. Then, the gauge condition at 𝑛 = 0 also satisfied because 𝜓1 − 𝜓0 + ∇ · A1 = L −1 [0] ≡ 0. 𝑐2 Δ𝑡 This argument can be iterated 𝑛 more times to obtain the result. B.2 Some Larger Figures from Experiments 205 (a) The beam of electrons (left) and the corresponding distribution in terms of their radii (right) (b) Slices in x (left) and y (right) of the toroidal magnetic field 𝐵 ( 𝜃) (𝑟) Figure B.1: The state of the Bennett problem after 50 thermal crossing times using the Boris method with the steady-state Poisson model for the fields. The top figure shows the electrons in the non-dimensional grid and plots the radius of the beam as a reference. We also include a cumulative histogram of the electrons based on their radii, which uses a total of 50 bins. The plots on the bottom are cross-sections of the steady-state magnetic field 𝐵 (𝜃) , which are plotted against the analytical field. We see good agreement in the magnetic field with its analytical solution, which is enough to confine most of the particles within the beam. 206 (a) The beam of electrons (left) and the corresponding distribution in terms of their radii (right) (b) Slices in x (left) and y (right) of the toroidal magnetic field 𝐵 ( 𝜃) (𝑟) Figure B.2: The state of the Bennett problem after 45 thermal crossing times obtained with the Molei Tao method (𝜔 = 500) using the steady-state Poisson model for the fields. The top figure shows the electrons in the non-dimensional grid and plots the radius of the beam as a reference. We also include a cumulative histogram of the electrons based on their radii, which uses a total of 50 bins. The plots on the bottom are slices of the steady-state magnetic field 𝐵 (𝜃) , which is plotted against the analytical field. We observe a significant drift in the numerical field away from its steady-state that results in a loss of confinement of the particles to the beam. 207 (a) The beam of electrons (left) and the corresponding distribution in terms of their radii (right) (b) Slices in x (left) and y (right) of the toroidal magnetic field 𝐵 ( 𝜃) (𝑟) Figure B.3: The state of the Bennett problem after 35 thermal crossing times using the Boris method with the wave model for the fields. The top figure shows the electrons in the non-dimensional grid and plots the radius of the beam as a reference. We also include a cumulative histogram of the electrons based on their radii. Again, the beam radius is indicated as a reference. A total of 50 bins are used in the plot. The plots on the bottom are slices of the steady-state magnetic field 𝐵 (𝜃) , which is plotted against the analytical field. We see good agreement in the magnetic field with its analytical solution, which is enough to confine most of the particles within the beam. 208 (a) Time-centered method (b) BDF method Figure B.4: A comparison of the time derivatives of the vector potentials after 1000 particle crossings for the expanding beam problem. This particular data was obtained using the Lorenz gauge formulation for the fields with the Boris method for particles. In the top row, the vector potentials are updated with the time-centered approach, which is purely dispersive and generates noisy time derivatives. The bottom row performs the same experiment, but uses the BDF method, which is purely dissipative. The differences in the quality of the results are quite apparent. This was discussed in [42], but results were not shown to illustrate the severity of the effects due to dispersion. 209 (a) (b) (c) Figure B.5: We plot the expanding beam after 1000 particle crossings obtained with the Lorenz gauge formulation that combines the Boris method with the BDF-2 field solver. In Figure B.5a, we plot the beam and the corresponding charge density. We observe some oscillations along the top edge of the beam, which also appear in the charge density. In Figure B.5b, we observe an increase in the size of violations of the Lorenz gauge condition, which indicates that the method will eventually fail. We plot the Lorenz gauge error as a surface in Figure B.5c using data from the final step. The most significant violations occur near the injection region and along the boundary where particles are removed. 210 (a) (b) Figure B.6: Here we show the potentials (and their derivatives) for the expanding beam problem after 1000 particle crossings. This data was obtained using the Lorenz gauge formulation which combines the Boris method with the BDF-2 wave solver. The first row plots the scalar potential 𝜓 and its partial derivatives. Similarly, in the second row, we plot the derivatives of the vector potentials 𝐴 (1) and 𝐴 (2) , which are used to construct the magnetic field 𝐵 (3) (shown in the right- most plot). Note that the time derivative data for the vector potentials were plotted in Figure B.4b, so we exclude them here. 211 (a) (b) (c) Figure B.7: We show the expanding beam after 2000 particle crossings obtained with the Coulomb gauge formulation, which uses the AEM for time stepping without a cleaning method. In Figure B.7a, we plot the beam and the corresponding charge density, which show visible striations and oscillations along the edge of the beam due to violations in the gauge condition. The growth in the errors associated with the gauge condition is reflected in Figure B.7b, which exhibits unbounded growth. The surface plot of the gauge condition at 2000 crossings shows large errors, especially near the injection region and along the boundary where particles are removed. 212 (a) (b) (c) Figure B.8: We show the expanding beam after 3000 particle crossings obtained with the Coulomb gauge formulation that uses the AEM for time stepping with elliptic divergence cleaning. In Figure B.8a, we plot the beam and the corresponding charge density. The elliptic divergence cleaning seems effective at controlling the errors in the gauge condition, compared to the results shown in Figure B.7, which do not apply the cleaning method. The fluctuations of the gauge error away from the boundaries is now in the sixth decimal position, which is a notable improvement over the result shown in Figure B.7c. 213 (a) (b) (c) Figure B.9: Here we show the potentials (and their derivatives) for the expanding beam problem after 3000 particle crossings. This data was obtained using the Coulomb gauge formulation which combines the AEM for time integration with the BDF-2 wave solver. Elliptic divergence cleaning was applied to the vector potential. In each row, we plot a field quantity and is corresponding derivatives. The top row shows the scalar potential 𝜓 and its derivative, which are computed with a finite-differences. The middle and last row show the vector potential components 𝐴 (1) and 𝐴 (2) , respectively, along with their derivatives, which are computed with the BDF method. 214 (a) (b) (c) Figure B.10: We show the expanding beam after 3000 particle crossings obtained with the Lorenz gauge formulation that uses the AEM for time stepping along with a first-order BDF solver. No divergence cleaning is applied. In Figure B.10a, we plot the beam and the corresponding charge density. The beam surprisingly remains intact after many particle crossings without the use of a cleaning method. The fluctuations of the gauge error over time are quite small. We do not observe the growth in the gauge error shown earlier in Figure B.5b for the Boris method. 215 (a) (b) (c) Figure B.11: Here we show the potentials (and their derivatives) for the expanding beam problem after 3000 particle crossings. This data was obtained using the Lorenz gauge formulation which combines the AEM for time integration with the BDF-1 wave solver. A divergence cleaning method is not used in this example. In each row, we plot a field quantity and is corresponding derivatives. The top row shows the scalar potential 𝜓 and its derivative, while the middle and last row shows the vector potential components 𝐴 (1) and 𝐴 (2) , respectively, along with their derivatives. 216 (a) (b) Figure B.12: Error in Gauss’ law for the Coulomb gauge formulation of the expanding beam problem which applies the AEM for time integration and uses elliptic divergence cleaning. On the left, we show the time evolution of an “averaged" residual in Gauss’ law. The plot on the right is a surface of the error in Gauss’ law taken after 3000 particle crossings. Even though cleaning is used to control violations in the gauge condition, whose corresponding surface was shown in Figure B.8c, the metric based on point-wise violations in Gauss’ law seems to indicates a significant loss of conservation. On the other hand, the plot on the left implies that Gauss’ law is satisfied in an integral sense. 217 (a) (b) Figure B.13: Error in Gauss’ law for the Coulomb gauge formulation of the expanding beam problem which applies the AEM for time integration. Elliptic divergence cleaning is not used here. On the left, we show the time evolution of the “averaged" residual in Gauss’ law. The plot on the right is a surface of the error in Gauss’ law taken after 3000 particle crossings. The point-wise violations in Gauss’ law are much larger than we observed in Figure B.12b. Similarly, the time evolution of the average defect in Gauss’ law is roughly three orders of magnitude larger than B.12a. 218 (a) (b) (c) Figure B.14: We show the narrow beam after 5 particle crossings obtained with the Coulomb gauge formulation that uses the AEM for time stepping with elliptic divergence cleaning. We injected 400 particle per time step. In Figure B.14a, we plot the beam and the corresponding charge density. The plot of the particles appears more solid due to the increased injection rate. The density itself is quite smooth due to the use of additional particles. As before, we see there are violations in the gauge condition along the boundaries due to the injection and removal of particles there. Additionally, the gauge error appears to be quite small away from the boundaries due to the increased smoothness offered by the use of additional particles. 219 (a) (b) (c) Figure B.15: Here we show the potentials, as well as their derivatives, for the narrow beam problem after 5 particle crossings using an injection rate of 400 particles per step. We used the Coulomb gauge formulation which combines the AEM for time integration with the BDF-2 wave solver for the fields. Elliptic divergence cleaning was applied to the vector potential. In each row, we plot a field quantity and is corresponding spatial derivatives. The top row shows the scalar potential 𝜓 and its derivative, which are computed with a finite-differences. The middle and last row show the vector potential components 𝐴 (1) and 𝐴 (2) , respectively, along with their derivatives, which are computed with the BDF method. The structure of the fields and their derivative are quite smooth here. 220 (a) (b) Figure B.16: Error in Gauss’ law for the narrow beam problem that uses an injection rate of 400 particles per step. On the left, we show the time evolution of an “averaged" residual in Gauss’ law. There is a jump in the “bulk" error for Gauss’ law at step 1000, since this coincides with the beam’s first crossing, before stabilizing. The plot on the right is a surface of the error in Gauss’ law taken after 5 particle crossings. Even though cleaning is used to control violations in the gauge condition, whose corresponding surface was shown in Figure B.14c, the metric based on point-wise violations in Gauss’ law seems to indicates a loss of charge conservation similar to the previous example. Figure B.17: We show the derivatives used to calculate the divergence of the electric field for the narrow beam problem at 5 particle crossings. We used an injection rate of 400 particles. Derivatives are computed with second-order finite-differences. We note the appearance of small oscillations in the 𝑥 derivative, which is shown on the left. The plot to the right, which corresponds to the 𝑦-derivative is largely uniform on the interior of the beam, but is sharp along the edge of the beam. 221 (a) (b) (c) (d) (e) (f) Figure B.18: We show the effect of the particle injection rate on the gauge error for the narrow beam problem at 5 particle crossings. In each row, we plot the error in the Coulomb gauge as a surface (left column) and as a slice in 𝑥 along the middle of the beam (right) column. The rows correspond to injection rates of 100, 200, and 400 particles per time step, respectively, from top to bottom. We can see that the increase in particle count reduces the gauge error on the interior of the domain due to the smoothing effect on the particle data. 222 BIBLIOGRAPHY 223 BIBLIOGRAPHY [1] J. P. Verboncoeur, “Particle simulation of plasmas: Review and advances,” Plasma Physics and Controlled Fusion, vol. 47, A231–A260, 5A 2005. [2] C. K. Birdsall and A. B. Langdon, Plasma Physics via Computer Simulation. McGraw-Hill Book Company, 1985. [3] R. W. Hockney and J. W. Eastwood, Computer Simulation Using Particles, First. CRC Press, 1988. [4] J. Boris, “Relativistic plasma simulation-optimization of a hybrid code,” in Proceedings of the Fourth Conference on Numerical Simulations of Plasmas, 1970, pp. 3–67. [5] A. Langdon, B. Cohen, and A. Friedman, “Direct implicit large time-step particle simulation of plasmas,” Journal of Computational Physics, vol. 51, pp. 107–138, 1 1981. [6] J. Brackbill and D. Forslund, “An implicit method for electromagnetic plasma simulation in two dimensions,” Journal of Computational Physics, vol. 46, pp. 271–308, 2 1982. [7] R. Mason, “An electromagnetic field algorithm for 2D implicit plasma simulation,” Journal of Computational Physics, vol. 71, pp. 429–473, 2 1987. [8] B. Cohen, A. Langdon, D. Hewett, and R. Procassini, “An implicit method for electromag- netic plasma simulation in two dimensions,” Journal of Computational Physics, vol. 81, pp. 151–168, 1 1989. [9] G. Chen, L. Chacón, and D. Barnes, “An energy- and charge-conserving, implicit, elec- trostatic particle-in-cell algorithm,” Journal of Computational Physics, vol. 230, pp. 7018– 7036, 18 2011. [10] D. Knoll and D. Keyes, “Jacobian-free Newton–Krylov methods: A survey of approaches and applications,” Journal of Computational Physics, vol. 193, pp. 357–397, 2 2004. [11] L. Chacón, G. Chen, and D. Barnes, “A charge-and energy-conserving implicit, electrostatic particle-in-cell algorithm on mapped computational meshes,” Journal of Computational Physics, vol. 233, pp. 1–9, 2012. [12] G. Chen, L. Chacón, L. Yin, B. Albright, J. Stark, and R. Bird, “A semi-implicit, energy- and charge-conserving particle-in-cell algorithm for the relativistic Vlasov-Maxwell equations,” Journal of Computational Physics, vol. 407, p. 109 228, 2020. [13] K. S. Yee, “Numerical solution of initial boundary value problems involving Maxwell’s 224 equations in isotropic media,” IEEE Transactions on Antennas and Propagation, vol. 14, pp. 302–307, 3 1966. [14] A. Taflove and S. C. Hagness, Computational electrodynamics: the finite-difference time- domain method, Third. Artech House Publishers, 2005. [15] A. D. Greenwood, K. L. Cartwright, J. W. Luginsland, and E. A. Baca, “On the elimination of numerical Cerenkov radiation in PIC simulations,” Journal of Computational Physics, vol. 201, pp. 665–684, 2 2004. [16] J. P. Verboncoeur, “Aliasing of electromagnetic fields in stair step boundaries,” Computer physics communications, vol. 164, pp. 344–352, 1 2004. [17] B. Engquist, J. Häggblad, and O. Runborg, “On energy preserving consistent boundary conditions for the Yee scheme in 2D,” BIT Numerical Mathematics, vol. 52, pp. 615–637, 3 2012. [18] E. Sonnendrücker, J. Ambrosiano, and S. Brandon, “A finite element formulation of the Dar- win PIC model for use on unstructured grids,” Journal of Computational Physics, vol. 121, pp. 281–297, 2 1995. [19] C.-D. Munz, P. Omnes, R. Schneider, E. Sonnendrücker, and U. Voss, “Divergence correction techniques for Maxwell solvers based on a hyperbolic model,” Journal of Computational Physics, vol. 161, no. 2, pp. 484–511, 2000. [20] G. Jacobs and J. Hesthaven, “High-order nodal discontinuous Galerkin particle-in-cell method on unstructured grids,” Journal of Computational Physics, vol. 214, pp. 96–121, 1 2006. [21] ——, “Implicit–explicit time integration of a high-order particle-in-cell method with hyper- bolic divergence cleaning,” Journal of Computational Physics, vol. 180, pp. 1760–1767, 10 2009. [22] M. Pinto, S. Jund, S. Salmon, and E. Sonnendrücker, “Charge-conserving FEM–PIC schemes on general grids,” Comptes Rendus Mécanique, vol. 342, pp. 570–582, 10-11 2014. [23] M. Pinto, K. Kormann, and E. Sonnendrücker, Variational framework for structure-preserving electromagnetic particle-in-cell methods, 2021. doi: 10.48550/ARXIV.2101.09247. [On- line]. Available: https://arxiv.org/abs/2101.09247. [24] S. O’Connor, Z. D. Crawford, J. P. Verboncoeur, J. Luginsland, and B. Shanker, “A set of benchmark tests for validation of 3-D particle in cell methods,” IEEE Transactions on Plasma Science, vol. 49, pp. 1724–1731, 5 Apr. 2021. [25] S. O’Connor, Z. D. Crawford, O. H. Ramachandran, J. Luginsland, and B. Shanker, “Time 225 integrator agnostic charge conserving finite element PIC,” Physics of Plasmas, vol. 28, p. 092 111, 9 Sep. 2021. [26] Z. D. Crawford, S. O’Connor, J. Luginsland, and B. Shanker, “Rubrics for charge con- serving current mapping in finite element electromagnetic particle in cell methods,” IEEE Transactions on Plasma Science, vol. 49, pp. 3719–3732, 11 Nov. 2021. [27] Y. Saad and M. H. Schultzn, “GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems,” SIAM Journal on Scientific and Statistical Computing, vol. 7, pp. 856–869, 3 1986. [28] D. Appelö, L. Zhang, T. Hagstrom, and F. Li, “An energy-based discontinuous Galerkin method with tame CFL numbers for the wave equation,” 2021. doi: 10.48550/ARXIV.2110. 07099. [Online]. Available: https://arxiv.org/abs/2110.07099. [29] O. Beznosov and D. Appelö, “Hermite-discontinuous Galerkin overset grid methods for the scalar wave equation,” Communications on Applied Mathematics and Computation, vol. 3, pp. 391–418, 3 2021. [30] F. Zheng, Z. Chen, and J. Zhang, “A finite-difference time-domain method without the Courant stability conditions,” IEEE Microwave and Guided Wave Letters, vol. 9, pp. 497– 523, 11 1999. [31] ——, “Toward the development of a three-dimensional unconditionally stable finite-difference time-domain method,” IEEE Transactions on Microwave Theory and Techniques, vol. 48, pp. 1550–1558, 9 2000. [32] J. Lee and B. Fornberg, “Some unconditionally stable time stepping methods for the 3D Maxwell’s equations,” Journal of Computational and Applied Mathematics, vol. 166, pp. 497–523, 2 2004. [33] M. F. Causley, A. J. Christlieb, Y. Güçlü, and E. Wolf, “Method of lines transpose: A fast implicit wave propagator,” 2013. doi: 10.48550/ARXIV.1306.6902. [Online]. Available: https://arxiv.org/abs/1306.6902. [34] M. Causley, A. Christlieb, and E. Wolf, “Method of lines transpose: An efficient uncon- ditionally stable solver for wave propagation,” Journal of Scientific Computing, vol. 70, pp. 896–921, 2 2017. [35] M. Maüsek and P. Gibbon, “Mesh-free magnetoinductive plasma model,” IEEE Transactions on Plasma Science, vol. 38, pp. 2377–2382, 9 2010. [36] L. Siddi, G. Lapenta, and P. Gibbon, “Mesh-free Hamiltonian implementation of two dimen- sional Darwin model,” Physics of Plasmas, vol. 24, pp. 1–11, 8 2017. 226 [37] Y. Cheng, A. J. Christlieb, W. Guo, and B. Ong, “An asymptotic preserving Maxwell solver resulting in the Darwin limit of electrodynamics,” Journal of Scientific Computing, vol. 71, no. 3, pp. 959–993, 2017. [38] M. F. Causley and A. J. Christlieb, “Higher order A-stable schemes for the wave equation using a successive convolution approach,” SIAM Journal on Numerical Analysis, vol. 52, no. 1, pp. 220–235, 2014. [39] A. J. Christlieb, P. T. Guthrey, W. A. Sands, and M. Thavappiragasm, “Parallel algorithms for successive convolution,” Journal of Scientific Computing, vol. 86, pp. 1–44, 1 2021. [40] M. Thavappiragasm, A. Christlieb, J. Luginsland, and P. Guthrey, “A fast local embedded boundary method suitable for high power electromagnetic sources,” AIP Advances, vol. 10, p. 115 318, 11 2020. [41] E. Wolf, M. Causley, A. Christlieb, and M. Bettencourt, “A particle-in-cell method for the simulation of plasmas based on an unconditionally stable field solver,” Journal of Computa- tional Physics, vol. 326, pp. 342–372, 2016. [42] E. Wolf, “A particle-in-cell method for the simulation of plasmas based on an unconditionally stable wave equation solver,” Ph.D. dissertation, Michigan State University, 2015. [43] M. Causley, H. Cho, and A. Christlieb, “Method of lines transpose: Energy gradient flows us- ing direct operator inversion for phase-field models,” SIAM Journal on Scientific Computing, vol. 39, no. 5, B968–B992, 2017. [44] A. Christlieb, W. Guo, Y. Jiang, and H. Yang, “Kernel based high order explicit unconditionally- stable scheme for nonlinear degenerate advection-diffusion equations,” Journal of Scientific Computing, vol. 82:52, pp. 1–29, 3 2020. [45] A. Christlieb, W. Sands, and H. Yang, “A kernel-based explicit unconditionally stable scheme for Hamilton-Jacobi equations on nonuniform meshes,” Journal of Computational Physics, vol. 415, pp. 1–25, 2020, Art. No. 109543. [46] H. C. Edwards, C. R. Trott, and D. Sunderland, “Kokkos: Enabling manycore performance portability through polymorphic memory access patterns,” Journal of Parallel and Dis- tributed Computing, vol. 74, pp. 3202–3216, 12 2014. [47] M. Tao, “Explicit symplectic approximation of nonseparable Hamiltonians: Algorithm and long time performance,” Phys. Rev. E, vol. 94, p. 043 303, 4 Oct. 2016. doi: 10.1103/ PhysRevE.94.043303. [48] L. Greengard and V. Rokhlin, “A fast algorithm for particle simulations,” Journal of Com- putational Physics, vol. 73, no. 2, pp. 325–348, 1987. 227 [49] M. C. A. Kropinski and B. D. Quaife, “Fast integral equation methods for Rothe’s method applied to the isotropic heat equation,” Computers and Mathematics with Applications, vol. 61, pp. 2436–2446, 9 2011. [50] M. Causley, A. Christlieb, B. Ong, and L. Van Groningen, “Method of lines transpose: An implicit solution to the wave equation,” Mathematics of Computation, vol. 83, no. 290, pp. 2763–2786, 2014. [51] G.-S. Jiang and C.-W. Shu, “Efficient implementation of weighted ENO schemes,” Journal of Computational Physics, vol. 126, no. 1, pp. 202–228, 1996. [52] A. Christlieb, W. Guo, and Y. Jiang, “A WENO-based method of lines transpose approach for vlasov simulations,” Journal of Computational Physics, vol. 327, pp. 337–367, 2016. [53] H. Kreiss, N. Petersson, and J. Yström, “Difference approximations of the Neumann problem for the second order wave equation,” SIAM Journal of Numerical Analysis, vol. 42, pp. 1292– 1323, 3 2004. [54] M. Thavappiragasam, “A-stable implicit rapid scheme and software solution for electromag- netic wave propagation,” Ph.D. dissertation, Michigan State University, 2019. [55] M. F. Causley, H. Cho, A. J. Christlieb, and D. C. Seal, “Method of lines transpose: High order L-stable O (𝑁) schemes for parabolic equations using successive convolution,” SIAM Journal on Numerical Analysis, vol. 54, no. 3, pp. 1635–1652, 2016. [56] A. Christlieb, W. Guo, and Y. Jiang, “A kernel-based high order "explicit" unconditionally stable scheme for time dependent Hamilton–Jacobi equations,” Journal of Computational Physics, vol. 379, pp. 214–236, 2019. [57] S. Gottlieb, C.-W. Shu, and E. Tadmor, “Strong stability-preserving high-order time dis- cretization methods,” SIAM review, vol. 43, no. 1, pp. 89–112, 2001. [58] A. Salazar, M. Raydan, and A. Campo, “Theoretical analysis of the exponential transversal method of lines for the diffusion equation,” Numerical Methods for Partial Differential Equations, vol. 16, no. 1, pp. 30–41, 2000. [59] M. Schemann and F. A. Bornemann, “An adaptive Rothe method for the wave equation,” Computing and Visualization in Science, vol. 1, no. 3, pp. 137–144, 1998. [60] G. Biros, L. Ying, and D. Zorin, An embedded boundary integral solver for the unsteady incompressible Navier-Stokes equations (preprint), 2002. [61] ——, “A fast solver for the Stokes equations with distributed forces in complex geometries,” Journal of Computational Physics, vol. 193, pp. 317–348, 1 2004. 228 [62] S.-H. Chiu, M. N. J. Moore, and B. Quaife, “Viscous transport in eroding porous media,” Journal of Fluid Mechanics, vol. 893, A3, 2020. doi: 10.1017/jfm.2020.228. [63] B. D. Quaife and M. N. J. Moore, “A boundary-integral framework to simulate viscous erosion of a porous medium,” Journal of Computational Physics, vol. 375, pp. 1–21, 2018. [64] H. Wang, T. Lei, J. Li, J. Huang, and Z. Yao, “A parallel fast multipole accelerated integral equation scheme for 3D Stokes equations,” International journal for numerical methods in engineering, vol. 70, pp. 812–839, 7 2007. [65] L. Ying, G. Biros, and D. Zorin, “A high-order 3D boundary integral equation solver for elliptic PDEs in smooth domains,” Journal of Computational Physics, vol. 219, pp. 247–275, 1 2006. [66] O. P. Bruno and M. Lyon, “High-order unconditionally stable FC-AD solvers for general smooth domains I. Basic elements,” Journal of Computational Physics, vol. 229, pp. 2009– 2033, 6 2010. [67] ——, “High-order unconditionally stable FC-AD solvers for general smooth domains II. Elliptic, parabolic and hyperbolic PDEs. Theoretical considerations,” Journal of Computa- tional Physics, vol. 229, pp. 3358–3381, 9 2010. [68] J. Douglas Jr., “On the numerical integration of 𝜕𝑥𝑥 𝑈 + 𝜕𝑦𝑦 𝑈 = 𝜕𝑡 𝑈 by implicit methods,” Journal of the Society for Industrial and Applied Mathematics, no. 3, pp. 42–65, 1955. [69] ——, “Alternating direction methods for three space variables,” Numerische Mathematik, no. 3, pp. 41–63, 1 1962. [70] D. W. Peaceman and H. H. Rachford Jr., “The numerical solution of parabolic and elliptic differential equations,” Journal of the Society for Industrial and Applied Mathematics, no. 3, pp. 28–41, 1 1955. [71] N. Albin and O. P. Bruno, “A spectral FC solver for the compressible Navier-Stokes equations in general domains I: Explicit time-stepping,” Journal of Computational Physics, vol. 230, pp. 6248–6270, 16 2011. [72] T. G. Anderson, O. P. Bruno, and M. Lyon, “High-order, dispersionless "fast-hybrid" wave equation solver part I: O (1) sampling cost via incident-field windowing and recentering,” SIAM Journal on Scientific Computing, vol. 42, pp. 1348–1379, 2 2020. [73] C.-W. Shu, “High order weighted essentially nonoscillatory schemes for convection domi- nated problems,” SIAM review, vol. 51, no. 1, pp. 82–126, 2009. [74] R. D. Hornung and J. A. Keasler, “The RAJA portability layer: Overview and status,” Lawrence Livermore National Laboratory (LLNL), Livermore, CA, United States, Tech. 229 Rep., Sep. 2014. doi: 10.2172/1169830. [75] P. Grete, F. W. Glines, and B. W. O’Shea, “K-Athena: A performance portable structured grid finite volume magnetohydrodynamics code,” IEEE Transactions on Parallel and Distributed Systems, vol. 32, pp. 85–97, 1 2020. [76] C. J. White, J. M. Stone, and C. F. Gammie, “An extension of the Athena++ code framework for GRMHD based on advanced Riemann solvers and staggered-mesh constrained transport,” The Astrophysical Journal Supplement, vol. 225, 2 2016. [77] P. Mardahl and J. Verboncoeur, “Charge conservation in electromagnetic PIC codes; spectral comparison of Boris/DADI and Langdon-Marder methods,” Computer Physics Communi- cations, vol. 106, pp. 219–229, 1997. [78] H. Qin, S. Zhang, J. Xiao, J. Liu, Y. Sun, and W. Tang, “Why is Boris algorithm so good?” Physics of Plasmas, vol. 20, 8 Aug. 2013. doi: 10.1063/1.4818428. [79] L. Brieda. “Particle push in magnetic field (boris method).” (2011), [Online]. Available: https://www.particleincell.com/2011/vxb-rotation/ (visited on 04/15/2022). [80] P. Pihajoki, “Explicit methods in extended phase space for inseparable Hamiltonian prob- lems,” Celestial Mechanics and Dynamical Astronomy, vol. 121, pp. 211–231, 2015. doi: 10.1007/s10569-014-9597-9. [81] W. H. Bennett, “Magnetically self-focussing streams,” Phys. Rev., vol. 45, pp. 890–897, 12 1934. doi: 10.1103/PhysRev.45.890. [82] J. A. Bittencourt, Fundamental of Plasma Physics, Third. Springer-Verlag, 2010. [83] J. Barnes and P. Hut, “A hierarchical 𝑂 (𝑁 log 𝑁) force-calculation algorithm,” Nature, vol. 324, pp. 446–449, 1986. [84] L. Wang, R. Krasny, and S. Tlupova, “A kernel-independent treecode based on barycen- tric Lagrange interpolation,” Communications in Computational Physics, vol. 28, no. 4, pp. 1415–1436, 2020. doi: https://doi.org/10.4208/cicp.OA-2019-0177. [Online]. Avail- able: http://global-sci.org/intro/article_detail/cicp/18106.html. [85] N. Vaughn, L. Wilson, and R. Krasny, “A GPU-accelerated barycentric Lagrange treecode,” in 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2020, pp. 701–710. doi: 10.1109/IPDPSW50202.2020.00125. [86] L. Wilson, N. Vaughn, and R. Krasny, “A GPU-accelerated fast multipole method based on barycentric Lagrange interpolation and dual tree traversal,” Computer Physics Communica- tions, vol. 265, 2021. doi: https://doi.org/10.1016/j.cpc.2021.108017. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0010465521001296. 230