NUMERICAL SIMULATIONS OF PLASMAS IN GALAXY CLUSTERS
                                   By
                        Forrest Wolfgang Glines
                           A DISSERTATION
                               Submitted to
                       Michigan State University
               in partial fulfillment of the requirements
                            for the degree of
        Astrophysics and Astronomy – Doctor of Philosophy
  Computational Mathematics, Science and Engineering – Dual Degree
                                  2022


                                             ABSTRACT
              NUMERICAL SIMULATIONS OF PLASMAS IN GALAXY CLUSTERS
                                                   By
                                        Forrest Wolfgang Glines
     As the largest gravitationally bound objects in the universe, galaxy clusters are a unique probe
of large scale cosmological structure. Determining the distribution of galaxy clusters and their
virial masses may be key to constraining properties of dark energy and dark matter. Since 84%
of a typical galaxy cluster’s mass is comprised of non-radiating dark matter, however, determining
the virial mass of galaxy clusters depends on inference from the radiating baryonic matter. 84%
of this baryonic matter is contained in the intracluster medium (ICM) – a hot, diffuse, magnetized
plasma permeating the galaxy cluster. While the baryonic matter is the only emitter of observable
electromagnetic emissions from galaxy clusters, the complex behavior of the ICM as a turbulent
magnetized plasma makes constraining the virial mass of the cluster with observable signatures
difficult. Numerical simulations are essential tools for advancing understanding of the ICM and
for tying galaxy cluster observables to virial masses. The goal of this dissertation is to explore and
enable simulations of galaxy clusters and magnetized plasmas via a number of different avenues.
     I first explore self-regulation of feedback from active galactic nuclei (AGN) preventing over-
cooling in cool-core (CC) clusters – galaxy clusters with anomalously high central thermal emission
which should cool on shorter timescales than they persist. In the idealized galaxy cluster simulations
with a thermal abstraction of AGN feedback, we find that the thermal-only heating kernels we test
are unable to offset cooling while maintaining a realistic structure, suggesting exploration of more
complex AGN feedback mechanisms such as those including magnetic fields and turbulence.
     We then explore how kinetic and magnetic energy thermalizes in the ICM by studying decaying
magnetized turbulence with simulations of the magnetized compressible Taylor-Green vortex.
Using a shell-to-shell energy transfer analysis, we find that the magnetic fields facilitate a significant
amount of the energy flux that is not seen in hydrodynamic turbulence. Although the full cascade


will not be directly captured in ICM simulations for the foreseeable future, higher resolution
simulations enabled by larger computational resources can diminish such effects.
    Different novel many-core architectures have emerged in recent years on the way toward larger
supercomputers in the exascale era. Performance portability is required to prevent repeated non-
trivial refactoring of a code for different architectures. To address the need for a performance
portable magnetohydrodynamics (MHD) code, we combined Athena++, an existing MHD CPU
code, with Kokkos, a performance portable framework, into K-Athena to allow efficient simula-
tions on multiple architectures using a single codebase. K-Athena has also inspired the Parthenon
performance portable adaptive mesh refinement (AMR) framework. Using this framework, we de-
veloped the performance portable AMR MHD code AthenaPK.
    Galaxy clusters contain significant magnetic fields, although their origin and role is still under
investigation. Numerical modeling is essential for the inference of their properties. One aspect is
whether magnetic AGN feedback models can self-regulate. I present work-in-progress simulations
with AthenaPK of magnetized galaxy clusters slated for exascale supercomputers later this year.
    With the higher resolutions enabled by exascale systems, galaxy cluster simulations with rel-
ativistic jet velocities will be possible. Robust methods for relativistic plasmas will be needed.
With this goal, I present a discontinuous-Galerkin (DG) method for relativistic hydrodynamics.
We include an exploration of different methods to recover the primitive variables from conserved
variables, a new operator for enforcing a physically permissible conserved state, and numerous
tests of the method. This method has been used at Sandia National Laboratories to study terrestrial
plasmas and will inform relativistic MHD methods for AthenaPK.
    Finally, I cover the future directions of the work in this dissertation, including the many codes
enabled by Parthenon, additions to the magnetized galaxy cluster simulations with AthenaPK,
and the large body of projects at Los Alamos National Laboratory to explore binary black hole
mergers embedded within AGN accretion disks as a possible formation channel of the massive
black holes observed by LIGO. The work in this dissertation to develop performance portable
plasma simulations will enable ground-breaking simulations for years to come.


Copyright by
FORREST WOLFGANG GLINES
2022


                                   ACKNOWLEDGEMENTS
    This work was completed at Michigan State University from 2016-2022 and at Sandia National
Laboratories1 from 2019-2022. This research has been made possible by the Michigan State
University Distinguished Fellowship, by NASA through Astrophysics Theory Program grant No.
NNX15AP39G and Hubble Theory grant HST-AR-13261.01-A, by the NSF through grant AST-
1514700, by the NCSA through the 2019 Blue Waters Graduate Fellowship, and by the Michigan
Institute of Plasma Science and Engineering through the MIPSE Graduate Fellowship.
    This dissertation has made extensive use of computing resources at the Michigan State University
High Performance Computing Center (operated by the Institute for Cyber-Enabled Research), the
NASA Pleiades supercomputer through allocation SMD-16-7720, the Blue Waters Supercomputer
at the National Center for Supercomputing Applications (NCSA), the Texas Advanced Computing
Center (TACC) at The University of Texas at Austin, at OLCF on Titan through allocation AST133,
on Summit through allocation AST146, on ALCF Theta through allocation athena_performance,
and on XSEDE Comet through allocation TG-AST090040.
    The work in this dissertation has benefited extensively from the work on AGN feedback by
Greg Meece (Meece et al., 2017), the Athena++ code (Stone et al., 2020a), the Kokkos library
(Carter Edwards et al., 2014; Trott et al., 2022), and the Parthenon library (Grete et al., 2022).
    I give thanks to the the MSU Department of Physics and Astronomy and the Department
of Computational Mathematics, Science and Engineering for fostering a supportive environment.
Thank you to the many graduate students for welcoming me to Lansing and making me feel
comfortable at MSU, including the previous cohort of students who welcomed me to MSU including
Austin Edmister, Rachel Frisbie, Dana Koeppe, and Jennifer Ranta; my fellow cohort with whom
    1 Sandia National Laboratories is a multimission laboratory managed and operated by National
Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell
International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration
under contract DE-NA0003525. This work was supported in part by LDRD project # 209240.
SAND No. SAND2022-9057 T.
                                                   v


I came to love MSU including Jessica Maldonado and Carl Fields; the cohorts following me
who made MSU home and whom I wish the best including Justin Grace, Adam Kawash, Claire
Koppenhaufer, Michael Pajkos, Brandon Barker, Eric Britt, CJ Llorente, Teresa Panurach, and
Joshua Shields.
     Thanks to Kim Crosslan for keeping the Physics and Astronomy department running smoothly
and helping me through the administrative hurdles over graduate school.
     Thanks to the postdoctoral researchers Elias Aydi, Brian Clark, Chelsea Harris, Sumit Sarbad-
hicary, and and Abbie Stevens and professors Wolfgang Kerzendorf and Daniel Hayden to whom
I’ve gone to for academic and professional guidance moving onto the the postdoctoral stage and
beyond.
     Thank you to the members of my committee, Sean Couch, Tyce DeYoung, Megan Donahue,
and Mark Voit for providing guidance and direction for my dissertation research.
     Many thanks to the incredible computational structure research group, especially Deovrat
Prasad, whose discussions and insights into contemporary galaxy cluster research have been essen-
tial for my understanding of the field, and Philipp Grete, whose contributions to computing have
been instrumental to our successes with K-Athena, Parthenon, AthenaPK, and my future work
with Los Alamos National Laboratory, and whose fervent pursuit of computational plasma research
I aspire to match.
     Special thanks you to Kristian Beckwith who helped me have a fulfilling internship, stuck with
my research projects, and brought them towards publication despite all obstacles. Special thanks
to my new found friends in and around Michigan who have helped me discover and embrace my
identity. Special thanks to my family and parents, who have encouraged me throughout my graduate
career when I have been supported me especially when I felt inadequate.
     Special thanks above all to my advisor, Brian O’Shea, whose endless patience with me to
pick a topic of study to match all my interests in astrophysics, plasmas, and computing. Without
his continual support in academic pursuits, professional aspirations, and personal growth this
dissertation would not be possible.
                                                 vi


                                  TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .          xi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . .                  . . . . . . .  1
   1.1 Galaxy Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        . . . . . . .  1
   1.2 Plasmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        . . . . . . .  4
       1.2.1 Plasma Regimes . . . . . . . . . . . . . . . . . . . . . . . . .           . . . . . . .  6
       1.2.2 Turbulence in Plasmas . . . . . . . . . . . . . . . . . . . . .            . . . . . . . 12
       1.2.3 The Simulation of Plasmas as a Research Tool . . . . . . . .               . . . . . . . 19
       1.2.4 Numerical Methods for Plasmas in the Fluid Approximation .                 . . . . . . . 20
   1.3 The Intracluster Medium – Plasma Physics Applied to Galaxy Clusters              . . . . . . . 23
       1.3.1 The cool core cluster problem . . . . . . . . . . . . . . . . .            . . . . . . . 24
       1.3.2 Self-Regulating AGN Feedback via Precipitation . . . . . . .               . . . . . . . 26
       1.3.3 The nature of AGN Feedback . . . . . . . . . . . . . . . . .               . . . . . . . 28
       1.3.4 Simulation of Galaxy Clusters . . . . . . . . . . . . . . . . .            . . . . . . . 29
   1.4 The Changing Supercomputer Architecture Landscape . . . . . . . .                . . . . . . . 32
       1.4.1 Performance Portability . . . . . . . . . . . . . . . . . . . .            . . . . . . . 37
   1.5 Outline of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . .        . . . . . . . 38
CHAPTER 2    TESTS OF AGN FEEDBACK KERNELS IN SIMULATED GALAXY
             CLUSTERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .             . . 39
   2.1 Chapter Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       . . 39
   2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       . . 39
   2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .          . . 44
       2.3.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           . . 44
       2.3.2 AGN Feedback Kernels . . . . . . . . . . . . . . . . . . . . . . . . . .             . . 46
   2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 49
       2.4.1 Categorization of Simulations . . . . . . . . . . . . . . . . . . . . . .            . . 52
               2.4.1.1 Central Cooling . . . . . . . . . . . . . . . . . . . . . . . .            . . 54
               2.4.1.2 Central Convective Zone . . . . . . . . . . . . . . . . . . . .            . . 54
               2.4.1.3 Central Entropy Floor . . . . . . . . . . . . . . . . . . . . .            . . 55
       2.4.2 Important radii: 𝑟 𝐿 , 𝑟 𝐻 , 𝑟 − , 𝑟 + , and 𝑟 multi . . . . . . . . . . . . . . . . . . 55
       2.4.3 Condensation of Cold Gas . . . . . . . . . . . . . . . . . . . . . . . .             . . 57
       2.4.4 Central Heating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .          . . 61
   2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       . . 63
       2.5.1 No Adequate Heating Kernel . . . . . . . . . . . . . . . . . . . . . . .             . . 63
       2.5.2 Robustness of Feedback Algorithm . . . . . . . . . . . . . . . . . . . .             . . 65
       2.5.3 Comparison to Observations . . . . . . . . . . . . . . . . . . . . . . .             . . 67
       2.5.4 Comparison to Other Simulations . . . . . . . . . . . . . . . . . . . .              . . 67
       2.5.5 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         . . 69
                                                   vii


      2.5.6 Other Models Investigated . . . . . . . . . . . . . . . . . . . . . . . . . . 70
      2.5.7 Future Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
  2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
CHAPTER 3   MAGNETIZED DECAYING TURBULENCE IN THE WEAKLY COM-
            PRESSIBLE TAYLOR-GREEN VORTEX . . . . . . . . . . . . . . . .                   . .  73
  3.1 Chapter Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . .  73
  3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . .  73
  3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . .  77
      3.3.1 MHD Equations and Numerical Method . . . . . . . . . . . . . . . . .            . .  77
      3.3.2 Magnetized TG Vortex . . . . . . . . . . . . . . . . . . . . . . . . . .        . .  78
      3.3.3 Energy Transfer Analysis . . . . . . . . . . . . . . . . . . . . . . . . .      . .  81
  3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  82
      3.4.1 Bulk Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . .  83
              3.4.1.1 Evolution of energy reservoirs . . . . . . . . . . . . . . . . .      . .  83
              3.4.1.2 Energy Spectra . . . . . . . . . . . . . . . . . . . . . . . . .      . .  85
              3.4.1.3 Spectral Index . . . . . . . . . . . . . . . . . . . . . . . . .      . .  86
      3.4.2 Energy Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . .  88
              3.4.2.1 Nonlocal Energy Transfer . . . . . . . . . . . . . . . . . . .        . .  89
              3.4.2.2 Inverted Turbulent Cascades . . . . . . . . . . . . . . . . . .       . .  91
              3.4.2.3 Cross-Scale Flux . . . . . . . . . . . . . . . . . . . . . . . .      . .  92
  3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . .  93
      3.5.1 Comparison to driven turbulence simulations . . . . . . . . . . . . . .         . .  93
      3.5.2 Comparison to previous results . . . . . . . . . . . . . . . . . . . . . .      . .  94
      3.5.3 Implication of results . . . . . . . . . . . . . . . . . . . . . . . . . . .    . .  95
      3.5.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . .  96
  3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . .  97
CHAPTER 4   K-ATHENA: A PERFORMANCE PORTABLE STRUCTURED                             GRID
            FINITE VOLUME MAGNETOHYDRODYNAMICS CODE . .                             . . . . . . 110
  4.1 Chapter Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . 110
  4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . 110
  4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . 113
      4.3.1 Kokkos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . . . . . 113
      4.3.2 Athena++ . . . . . . . . . . . . . . . . . . . . . . . . . . . .        . . . . . . 114
      4.3.3 K-Athena = Kokkos + Athena++ . . . . . . . . . . . . . . . .            . . . . . . 116
  4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
      4.4.1 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . 120
      4.4.2 Performance portability . . . . . . . . . . . . . . . . . . . . . .     . . . . . . 122
              4.4.2.1 Overview of architectures used . . . . . . . . . . . .        . . . . . . 122
              4.4.2.2 Roofline model . . . . . . . . . . . . . . . . . . . . .      . . . . . . 123
              4.4.2.3 Performance portability metric . . . . . . . . . . . . .      . . . . . . 127
      4.4.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . . . . . 129
              4.4.3.1 Single CPU and GPU performance . . . . . . . . . .            . . . . . . 129
              4.4.3.2 Weak scaling . . . . . . . . . . . . . . . . . . . . . .      . . . . . . 130
                                              viii


      4.4.4 Strong scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
  4.5 Current limitations and future enhancements . . . . . . . . . . . . . . . . . . . . . 134
  4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
CHAPTER 5 RELATIVISTIC DISCONTINUOUS-GALERKIN HYDRODYNAMICS                                . . 139
  5.1 Chapter Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
  5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
  5.3 Theoretical Background and Discretization . . . . . . . . . . . . . . . . . . . .    . . 141
      5.3.1 Special Relativistic Hydrodynamics . . . . . . . . . . . . . . . . . . .       . . 142
      5.3.2 Equations of State . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 143
      5.3.3 Spatial and Temporal Discretizations . . . . . . . . . . . . . . . . . . .     . . 145
      5.3.4 Computation of the Surface Flux . . . . . . . . . . . . . . . . . . . . .      . . 149
      5.3.5 Physicality Enforcing Operator . . . . . . . . . . . . . . . . . . . . . .     . . 150
  5.4 Recovery of Primitive Variables . . . . . . . . . . . . . . . . . . . . . . . . .    . . 153
      5.4.1 Ideal Gas Equation of State . . . . . . . . . . . . . . . . . . . . . . . .    . . 154
      5.4.2 Taub-Matthews Equation of State . . . . . . . . . . . . . . . . . . . . .      . . 157
      5.4.3 Conserved to Primitive Solver Comparisons . . . . . . . . . . . . . . .        . . 159
  5.5 Tests of the Relativistic Hydrodynamics Scheme . . . . . . . . . . . . . . . . .     . . 166
      5.5.1 Linear Waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 167
      5.5.2 1D Riemann Problems . . . . . . . . . . . . . . . . . . . . . . . . . .        . . 171
      5.5.3 1D Taub-Matthews Equation of State Test . . . . . . . . . . . . . . . .        . . 173
      5.5.4 2D Riemann Problems . . . . . . . . . . . . . . . . . . . . . . . . . .        . . 176
              5.5.4.1    2D Riemann Problems: Test 1 . . . . . . . . . . . . . . . . .     . . 178
              5.5.4.2    2D Riemann Problems: Test 2 . . . . . . . . . . . . . . . . .     . . 180
              5.5.4.3    2D Riemann Problems: Test 3 . . . . . . . . . . . . . . . . .     . . 182
      5.5.5 Kelvin-Helmholtz Instability . . . . . . . . . . . . . . . . . . . . . . .     . . 182
              5.5.5.1 Linear Growth Phase . . . . . . . . . . . . . . . . . . . . . .      . . 184
              5.5.5.2 Non-linear Evolution . . . . . . . . . . . . . . . . . . . . . .     . . 185
      5.5.6 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 189
  5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 197
CHAPTER 6    SIMULATIONS OF GALAXY CLUSTERS WITH MAGNETIC AGN
             JET FEEDBACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 200
  6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
  6.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 202
      6.2.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 202
              6.2.1.1 Gravitational Potential . . . . . . . . . . . . . . . . . . . . .    . . 202
              6.2.1.2 Entropy Profile . . . . . . . . . . . . . . . . . . . . . . . . .    . . 203
              6.2.1.3 Initial Pressure and Density (Hydrostatic Equilibrium) . . . .       . . 204
              6.2.1.4 Linearly Interpolated Tabular Cooling . . . . . . . . . . . . .      . . 204
              6.2.1.5 Precessing Jet Coordinates . . . . . . . . . . . . . . . . . . .     . . 205
              6.2.1.6 Magnetic tower . . . . . . . . . . . . . . . . . . . . . . . . .     . . 206
              6.2.1.7 AGN Feedback . . . . . . . . . . . . . . . . . . . . . . . . .       . . 207
              6.2.1.8 Thermal AGN Feedback . . . . . . . . . . . . . . . . . . . .         . . 207
              6.2.1.9 Kinetic AGN Feedback . . . . . . . . . . . . . . . . . . . . .       . . 207
                                              ix


              6.2.1.10 Magnetic AGN Feedback . . . . . . . . . . . . . . . . . . . . . 208
              6.2.1.11 AGN cold mass triggering . . . . . . . . . . . . . . . . . . . . . 210
   6.3 Current State of Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
CHAPTER 7 SUMMARY AND FUTURE DIRECTIONS . . . . . . . . . . . . . . . .                     . 212
   7.1 Summary of Dissertation Work . . . . . . . . . . . . . . . . . . . . . . . . . . .   . 212
       7.1.1 Chapter 2: Tests of AGN Feedback Kernels in Simulated Galaxy Clusters          . 212
       7.1.2 Chapter 3: Magnetized Decaying Turbulence in the Weakly Compress-
              ible Taylor-Green Vortex . . . . . . . . . . . . . . . . . . . . . . . . . .  . 213
       7.1.3 Chapter 4: K-Athena: A Performance Portable Structured Grid Finite
              Volume Magnetohydrodynamics Code . . . . . . . . . . . . . . . . . .          . 214
       7.1.4 Chapter 5: Relativistic Discontinuous-Galerkin Hydrodynamics . . . . .         . 215
   7.2 Ongoing and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . 217
       7.2.1 Parthenon and AthenaPK . . . . . . . . . . . . . . . . . . . . . . . .         . 217
       7.2.2 Relativistic DG Methods . . . . . . . . . . . . . . . . . . . . . . . . . .    . 218
       7.2.3 Simulations of Magnetized Galaxy Clusters . . . . . . . . . . . . . . . .      . 219
       7.2.4 AGN Accretion Disk Channel for Intermediate Mass Black Holes . . . .           . 221
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
                                             x


                                         LIST OF TABLES
Table 2.1: List of combinations of inner smoothing radius 𝑟 𝑠 [kpc], outer cutoff radius
           𝑟 𝑐 [kpc], and exponent 𝛼 used. The rightmost column lists all values of 𝛼
           explored for the given combination of 𝑟 𝑠 and 𝑟 𝑐 in the leftmost and middle
           column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Table 2.2: Brief definition of variables described in full in text and used in later figures.
           "Median" here refers to the median of the distribution of a variable (e.g.
           entropy, cooling rate, etc.) at given radius. . . . . . . . . . . . . . . . . . . . . 57
Table 4.1: Software Environment and Compiler Flags Used In Scaling Tests. . . . . . . . . 120
Table 4.2: Technical specifications for devices used in the performance portability metric.
           Cache size and core counts for CPUs specify the aggregate sizes and counts
           for a two-socket node while numbers for GPUs show the aggregate for a single
           device. For the Tesla K80, the cache size and core count is for just one
           of the two GK210 chips in the GPU. For DRAM bandwidth (BW) we use
           the empirically measured bandwidth of the DRAM on CPUs and the global
           memory on GPUs. Data for Intel devices comes from Intel Corporation (2016)
           and data for NVIDIA devices comes from NVIDIA Corporation (2014, 2016,
           2017); Jia et al. (2018). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Table 5.1: Values of 𝑁 (no. of wavelengths) and 𝑛 (𝑛𝑡ℎ acceptable wavelength) for linear
           waves tests (see Eq. 5.78) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Table 5.2: Order of convergence for both primitive and conserved variables along the
           rows for each of the 5 eigenvalue/eigenvector pairs 𝑗 ∈ {−, 0 (1,2,3) , +} along
           the columns, all tested in 3D with non-grid-aligned waves, using a 2nd order
           basis with the SSPRK3 integrator. For all cases we expect a 3.0 rate of
           convergence. Entries with ’-’ denote variables where the eignvector used for
           that test does not affect that variable. . . . . . . . . . . . . . . . . . . . . . . . 170
                                                  xi


                                        LIST OF FIGURES
Figure 1.1: Galaxy Cluster Abell 1689 in X-ray (purple) as captured by Chandra with
            optical from Hubble underneath. The galaxy cluster has sufficient mass to
            bend light from background galaxies around the galaxy cluster core, smearing
            background sources into duplicated arcs around the galaxy cluster core. This
            strong gravitational lensing permits estimates of the galaxy cluster’s mass
            (Kochanek, 2006; Hoekstra et al., 2013; Bartelmann, 2010). The Intracluster
            Medium – the hot, diffuse plasma comprising most of the baryonic mass but
            a relatively smaller portion of the total mass – is responsible for the majority
            of the X-ray emissions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3
Figure 1.2: Charged particle number densities on the 𝑥-axis and temperatures on the 𝑦-
            axis for different astrophysical and terrestrial plasmas. The comparatively hot
            and diffuse plasma of the ICM is marked in yellow, with the Perseus cluster
            as seen in X-ray by Chandra. Diagram by the https://www.cpepphysics.org. . .         6
Figure 1.3: Spectrum of appropriate plasma models for different regimes, as determined
            by the Knudsen number 𝐾𝑛, which is a measure of the relative importance
            of particle-particle interactions versus ensemble interactions defined in equa-
            tions 1.1, and the charge separation distance Λ𝑑 , which is a measure of the
            importance of electric fields in the plasma defined in equation 1.18. Fluid
            models appear to the left and kinetic models appear to the right while models
            where electromagnetics are important appear towards the bottom and models
            where electromagnetics unimportant appear towards the top. Systems and
            simulations explored within astrophysics typically use models from the 4
            extremes: Euler, Boltzmann, ideal MHD, and Vlasov models. The plasma
            model best describing the ICM would be a non-ideal MHD model on the
            galaxy cluster scale and a Vlasov model on the plasma instability, particle
            acceleration scale. Created by Uri Shumlak for a presentation at Sandia
            National Laboratories (Shumlak, 2015) and appearing in Kramer et al. (2020). .       7
Figure 1.4: Schlieren photograph showing the thermal plume of a lit candle, showing the
            smooth rising flow starting from the base of the flame that transitions into
            turbulence at the top of the flame. As a gas, the viscosity in smoke and air is
            low; thus, the velocity of the uplifted heated gas is sufficient to create a high
            Reynolds number flow, with Re ∼    > 103 , which is prone to fluid instabilities.
            The laminar flow originating from the flame decays into turbulence as these
            instabilities grow further down the flow. . . . . . . . . . . . . . . . . . . . . . 13
                                                  xii


Figure 1.5: Photographs of a cylinder moving through a tank of water containing alu-
            minum powder (van Dyke, 1982). The higher the velocity of the water flow
            relative to the cylinder the higher the Reynolds number, showing flows from
            top to bottom with Re = 9.6, Re = 2,000, and Re = 10,000. As the Reynolds
            number is increased beyond ∼ 103 , the flow becomes prone to fluid insta-
            bilities which grow non-linearly as the flow moves past the cylinder. These
            instabilities develop into the turbulent flow beyond the cylinder, as best seen
            on the right hand side with the Re = 10,000 flow. . . . . . . . . . . . . . . . . 14
Figure 1.6: Diagram of the energy spectra of a turbulent plasma denoting the hydro-
            dynamic turbulent cascade and the effects of magnetic fields and limited
            simulations resolution on the energy spectra. Wavenumber increases along
            the 𝑥−axis, with larger length scales to the left and smaller length scales to
            the right. Energy contained in the plasma at a certain wavenumber is plotted
            along the 𝑦−axis. The black solid line shows the kinetic energy spectrum of
            a plasma with no magnetic fields, where kinetic energy is introduced into the
            plasma at the production scale (marked by the leftmost vertical dashed black
            line) and dissipates into thermal heating at the dissipation scale (marked by
            the rightmost vertical dashed black line). Between these scales, turbulent
            plasmas follow a 𝑘 −5/3 power law in the kinetic energy spectrum. With the
            addition of magnetic fields, in the resulting kinetic energy spectrum (shown in
            red) the power law is flattened or broken, with more energy at smaller scales.
            In simulations without an explicit viscosity, the smallest cell size introduces a
            dissipation length scale (the vertical dashed blue line) potentially larger than
            the physical length scale, which truncates the energy spectrum (in solid blue).
            Increased resolution decreases the dissipation imposed by numerics. . . . . . . 15
Figure 1.7: Bubbles inflated in by AGN jets in galaxy cluster MS0735.6+7421, as evi-
            denced by X-ray cavities in the ICM and radio synchrotron emission from
            cosmic rays accelerated at the shock fronts around the bubble. Image made
            by NASA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
                                                 xiii


Figure 1.8: Diagram of the self-regulating AGN feedback precipitation model from Voit
             et al. (2017), where the left panel shows a diagram of AGN feedback in a
                                                                                 −2/3
             galaxy cluster and the right panel shows the entropy 𝐾 ≡ 𝑘 𝐵𝑇 𝑛𝑒         where 𝑛𝑒
             is the electron number density. In this model, cold gas condenses in the isen-
             tropic central region of the galaxy cluster and accretes onto the central SMBH,
             triggering feedback in the form of bipolar outflows that uplift condensed gas
             into the power-law zone of the entropy profile in the cluster outreaches, tem-
             pering the overcooling and condensation of gas. In this power-law zone,
             buoyancy suppresses condensation while uplift promotes condensation. Ob-
             servationally, the transition between the isentropic and power-law zones of
             the entropy profile occurs where the ratio of cooling time to free fall time is
             𝑡cool /𝑡ff ∼ 10, where the cooling time 𝑡cool of a parcel of gas is the time it
             would take for it to radiative away all its energy at its current rate of radiative
             cooling and the free fall time 𝑡ff of a parcel of gas is the time it would take to
             infall from rest to the cluster center due to gravity. . . . . . . . . . . . . . . . . 27
Figure 1.9: Relative clock speeds of single core (black) and multicore (gray, orange,
             blue, and red, in order of increasing core counts) processors relative to the
             Intel 80386 CPU using the SPECint benchmark. The green round dots show
             processor clock frequencies, the frequency at which a single core can execute
             a clock cycle to execute one or several operations, relative to the Intel 80386.
             Although clock frequencies have stagnated since the mid 2000s, processors
             have increased performance by adding more cores. Future performance gains
             are increasingly dependent on higher core counts. Figure from (Leiserson
             et al., 2020). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Figure 1.10: Example code to execute z[i]=a*x[i]+y[i] with different programming
             APIs. Even with this simple code example, there are significant differences
             in the implementation with different APIs. Each API also requires different
             code outside of this snippet to manage memory and execution on the GPU,
             along with a myriad of performance concerns. . . . . . . . . . . . . . . . . . . 36
Figure 2.1: Top: Local ratio of heating to cooling as a function of radius (𝑟) at the
             beginning of several representative simulations. The dotted blue line shows
             a simulation with low central heating and heating kernel parameters 𝛼 = 2.0,
             𝑟 𝑠 = 8 kpc, and 𝑟 𝑐 = 1000 kpc. The dashed orange line shows a simulation
             with high central heating and heating kernel parameters 𝛼 = 2.6, 𝑟 𝑠 = 1 kpc,
             and 𝑟 𝑐 = 150 kpc. The solid green line shows a simulation with intermediate
             central heating and heating kernel parameters 𝛼 = 2.6, 𝑟 𝑠 = 12 kpc, and
             𝑟 𝑐 = 150 kpc. Bottom: Cumulative ratio of heating to cooling within 𝑟 for
             the same simulations. At large radii, all of the cumulative heating curves
             converge to the cumulative cooling rate because total heating is normalized
             to equal to total cooling rate at 𝑅 = 1.5 Mpc. . . . . . . . . . . . . . . . . . . 47
                                                  xiv


Figure 2.2: Schematic illustrations of how different AGN heating kernels affect the en-
            tropy profile of a simulated galaxy cluster. In each case, the total heating
            rate is set equal to the total cooling rate. Top: Radial profiles of radiative
            cooling and AGN heating per unit volume, with the initial median cooling
            rate in black and the AGN heating kernel in color. Bottom: Response of
            the median entropy profile to heat input. The initial median profile in black
            and the response is in color. The left column shows a heating kernel with
            central heating that falls below central cooling. The entropy profile in this
            case tends to follow a power law down to the origin and eventually leads to a
            central cooling catastrophe. The center column shows a heating kernel with
            excessive central heating, which elevates central entropy, inverts the entropy
            profile, and produces a central convective zone. The right column shows
            a heating kernel with intermediate central heating, which slightly raises the
            central entropy and produces a flat core. Due to the high initial entropy and
            long cooling time at outer radii, the power-law at the outer radii changes very
            slowly with under- and over-heating. . . . . . . . . . . . . . . . . . . . . . . . 50
Figure 2.3: Mass density plots of cooling and heating rate (top) and entropy (bottom)
            versus radius, with color representing the total mass of all simulation cells
            from a 2D histogram of cooling rate and entropy versus radius. Across
            the three columns we show three simulations at different times that broadly
            represent the whole set of simulations, as differentiated by the behavior of
            the inner tens of kpc. The left column shows a simulation (with 𝛼 = 2.0,
            𝑟 𝑠 = 8 kpc, and 𝑟 𝑐 = 1000 kpc at 𝑡 = 0.3 Gyr) with low central heating which
            allows excess central cooling that quickly undergoes a cooling catastrophe.
            The middle column shows a simulation (with 𝛼 = 2.6, 𝑟 𝑠 = 1 kpc, and
            𝑟 𝑐 = 150 kpc at 𝑡 = 3.0 Gyr) with high central heating that maintains a
            convective zone in the inner 100 kpc with a high central entropy peak. The
            right column shows a simulation (with 𝛼 = 2.6, 𝑟 𝑠 = 12 kpc, and 𝑟 𝑐 = 150 kpc
            at 𝑡 = 8.0 Gyr) with an intermediate amount of central heating and that
            holds a flat entropy floor slightly elevated from the initial conditions and
            observational data on the entropy of the inner tens of kpc. On the entropy
            plots, observational entropy data of clusters from the ACCEPT data set are
            displayed in grayscale showing the range (light grey), 68% confidence interval
            (dark grey), and median (black line) of the dataset. The median entropy is
            also marked by a magenta line, and the minimum (𝐾 𝐿 ) and maximum (𝐾 𝐻 )
            values of the entropy median within the inner 25 kpc are marked by stars.
            On the cooling rate plots, the heating rate is marked by a red line and the
            median cooling rate is marked by a blue line. The crossover radii 𝑟 − and 𝑟 +
            as defined in the text are marked by stars in the simulations where they can be
            defined.The heating curve parameters 𝑟 𝑠 and 𝑟 𝑐 are also annotated with finely
            dashed and dashed gray lines. . . . . . . . . . . . . . . . . . . . . . . . . . . 53
                                                 xv


Figure 2.4: Time dependence of total cooling rate (solid lines) and total mass of condensed
            gas under 3 × 104 K (dashed lines) for the three simulations shown in Figure
            2.3. The blue points show a simulation with low central heating and excess
            central cooling (𝛼 = 2.0, 𝑟 𝑠 = 8 kpc, 𝑟 𝑐 = 1000 kpc) that experiences an
            early cooling catastrophe. Orange points show a simulation with high central
            heating (𝛼 = 2.6, 𝑟 𝑠 = 1 kpc, 𝑟 𝑐 = 150 kpc) that forms a quasi-stable central
            convective zone. Green points show a simulation with intermediate central
            heating (𝛼 = 2.6, 𝑟 𝑠 = 12 kpc, 𝑟 𝑐 = 150 kpc) that maintains a flat entropy core
            for almost 10 Gyr before undergoing a late cooling catastrophe. In simulations
            that form a multiphase gas through a cooling catastrophe, the formation of
            cold gas is preceded by a rise and then a sharp peak in the total cooling rate. . . 58
Figure 2.5: Plots of relationships between 𝑟 − , the radius at which the gas switches from net
            heating to net cooling, and other features of the simulations. Top left: Time
            averaged radius of the minimum of the median entropy profile (𝑟 𝐿 ) versus the
            time average of 𝑟 − up to the formation of a multiphase gas. (Includes only
            simulations in which 𝑟 − can be defined for at least 50 Myr.) Top right: Radius
            at which multiphase gas first forms versus the time averaged 𝑟 − . (Includes
            only simulations in which 𝑟 − can be defined for more than one time step.)
            Bottom left: Radius at which multiphase gas first forms versus the time
            averaged value of 𝑟 𝐿 for all simulations. Bottom right: The time required
            for a simulation to form multiphase gas versus the time averaged value of
            the cooling time at 𝑟 − . (Includes only simulations that form multiphase gas
            and in which 𝑟 − can be defined for at least 50 Myr.) Shapes in each panel
            denote the general behavior of the central region of the simulation. Blue
            highlighted triangles denote Central Cooling simulations, orange highlighted
            circles denote Central Convective Zone simulations. Green highlighted stars
            denote Entropy Floor simulations. Colors show the heating kernel parameter
            𝛼, with greater 𝛼 generally corresponding to heating that is more centrally
            concentrated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Figure 2.6: Left: Time required to form multiphase gas in a simulation versus the ratio
            of heating to cooling within the inner 10 kpc at the first time step. Right:
            Maximum of the median entropy within the inner 25 kpc, versus the ratio
            of heating to cooling within the inner 10 kpc at the first time step. In both
            panels, a solid line marks a heating to cooling ratio of 2, and a dashed line
            marks a heating to cooling ratio of 5. A ratio of at least 2 is required to avoid
            multiphase condensation within 1 Gyr. In the right panel, a dashed line marks
            the maximum central entropy that is observationally expected for a CC cluster.      61
                                                  xvi


Figure 2.7: Left: Relationships between the initial ratio of heating to cooling averaged
            over the inner 10 kpc and the time-averaged radius ⟨𝑟 − ⟩ beyond which cooling
            begins to dominate over heating. Only those simulations in which 𝑟 − can be
            defined for at least 50 Myr are included. The box in the lower right shows
            hypothetical simulations with an average 𝑟 − over 30 kpc and an inner heating
            to cooling ratio under five. Right: Relationships between the time average
            of 𝐾 𝐻 (the maximum level of the median entropy profile within the inner
            25 kpc) and the time 𝑡m𝑢𝑙𝑡𝑖 until multiphase gas forms in the simulation. The
            plot includes all simulations, assigning 𝑡multi = 16 Gyr to simulations that do
            not form cold gas by that time. An empty box in the lower right corner indi-
            cates where points representing heating kernels satisfying adequacy criteria
            would fall, by persisting for more than 5 Gyr before forming multiphase gas
            while maintaining a maximum entropy level < 30 keV cm−2 within 25 kpc.
            However, no heating kernel we tested satisfies those those criteria. . . . . . . . 63
Figure 2.8: Top: Time-averaged median entropy profiles of the simulated cluster halos
            in Figure 2.3. The dotted line shows the simulation with low central heating
            ( 𝛼 = 2.0, 𝑟 𝑠 = 8 kpc, 𝑟 𝑐 = 1000 kpc), and the blue shaded region around it
            shows the 1𝜎 dispersion of its median profile over time. The dashed line shows
            the simulation with high central heating (𝛼 = 2.6, 𝑟 𝑠 = 1 kpc, 𝑟 𝑐 = 150 kpc),
            and the orange shaded region around it shows its 1𝜎 dispersion. The dot-
            dashed line shows the simulation with intermediate central heating (𝛼 = 2.6,
            𝑟 𝑠 = 12 kpc, 𝑟 𝑐 = 150 kpc), and the green shaded region around it shows its
            1𝜎 dispersion. In each case, entropy is weighted by the x-ray luminosity in
            the 0.5–2.0 keV band, to mimic data obtainable with Chandra. The median,
            1𝜎 interval, and full extent of the entropy profiles of clusters with less than
            30 keV cm2 from ACCEPT are shown in grayscale, using the broken power
            law fits from Cavagnolo et al. (2009) for the entropy profiles. Bottom: X-ray
            surface brightness in the 0.5–2.0 keV band for the same simulated halos,
            with shaded regions showing the 1𝜎 dispersion and black lines showing the
            median. The median, 1𝜎 interval, and full extent of the entropy profiles of
            CC clusters from ACCEPT are shown in grayscale, using surface brightness
            profiles derived from electron density and temperature profiles. . . . . . . . . . 66
Figure 3.1: Slices of sonic Mach number (left) and magnetic pressure (right) at 𝑡 =
            0.77𝑇 and 𝑡 = 5.16𝑇 in the 𝑥𝑦−plane through 𝑧 = 𝜋2 𝐿, with streamlines on
            the left showing the direction of flow and streamlines on the right showing
            the direction of the magnetic fields, plotting only the 1st quadrant from
            the Ms0.2_Ma10 simulation, demonstrating the transition of the flow into
            turbulence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
                                                xvii


Figure 3.2: Mean energies over over time in the top row with kinetic energy (solid blue),
            magnetic energy (solid orange), the sum of kinetic and magnetic energies
            (solid green), and the change in thermal energy since the simulation start
            (solid red), and dimensionless numbers over time in the bottom row with RMS
            sonic Mach number M 𝑠 (blue), Alvénic Mach number M 𝐴 (orange), and
            plasma beta 𝛽 (green) for the Ms0.2 simulations. Energy over time from the
            simulation from Fig. 3a in Pouquet et al. (2010) (adjusted to the normalization
            used here), which matches the setup of the Ms0.2_Ma1 simulation, is shown
            with dashed lines in the upper left panel for reference. Energies and mach
            numbers for all nine simulations are shown in the online supplements. . . . . . 100
Figure 3.3: Kinetic energy spectra (in solid blue) and magnetic energy spectra (in solid
            orange) compensated by 𝑘 4/3 , with black dashed lines showing the power
            law fit to the spectral to obtain a spectral index. In the left column we show
            the Ms0.2_Ma1 simulation, in the middle column we show the Ms0.2_Ma3.2
            simulation, and in the right column we show the Ms0.2_Ma10 simulation.
            In the top row we show all simulations at 𝑡 = 0.77𝑇, in the middle row we
            show the three simulations at different times (𝑡 = 1.29, 𝑡 = 1.81𝑇, 𝑡 = 1.81𝑇)
            when the simulations are displaying interesting behavior discussed in sections
            3.4.2.2 and 3.4.2.1, and in the bottom row we show all simulations at 𝑡 = 5.16𝑇
            when the initial flow has completely decayed into turbulence and both energy
            spectra fluctuate around a 𝑘 −4/3 spectrum. . . . . . . . . . . . . . . . . . . . . 101
Figure 3.4: The kinetic energy (top) and magnetic energy (bottom) at wavenumbers 𝑘 =
            8, 22, 64, 128 plotted separately in different colors versus time, where the
            energy at each wavenumber has been compensated by 𝑘 4/3 to make them
            comparable. In the left column we show the Ms0.2_Ma1 simulation, in
            the middle column we show the Ms0.2_Ma3.2 simulation, and in the right
            column we show the Ms0.2_Ma10 simulation. Energy at the smallest length
            scales in both reservoirs saturates at 𝑡 ≃ 1𝑇, 𝑡 ≃ 1.5𝑇, and 𝑡 ≃ 2.5 in
            the Ms0.2_Ma1, Ms0.2_Ma3.2, and Ms0.2_Ma10 simulations respectively,
            showing approximately when the turbulence has developed at all scales. . . . . 102
                                                 xviii


Figure 3.5: Evolution of the spectral indices of the kinetic (blue), magnetic (orange),
            and sum of kinetic and magnetic energy (green) spectra over time for the
            Ms0.2 simulations. The slope is computed from a least squares fit of the
            energy spectra limited to wavenumbers 𝑘 ∈ [10, 32] which is approximately
            the inertial range. Shaded bands show how the fitted slope differs if a range
            𝑘 ∈ [8, 34], 𝑘 ∈ [10, 32], or 𝑘 ∈ [12, 30] is used. Note that the spectral index
            using the range 𝑘 ∈ [10, 32] is not guaranteed to be bounded by the spectral
            indices obtained using 𝑘 ∈ [8, 34], 𝑘 ∈ [10, 32] and 𝑘 ∈ [12, 30], which is
            especially evident in the Ms0.2_Ma3.2 and Ms0.2_Ma10 simulations from
            𝑡 ≃ 2𝑇 to 𝑡 ≃ 4𝑇. Horizontal dashed lines show −4/3 and −5/3 spectral
            indices. The slope is only shown after 𝑡 = 1𝑇 as the initial flow conditions
            dominate the spectra at early times, leading to steep spectra. We include the
            spectral indices versus time for all nine simulations in the online supplements. . 103
Figure 3.6: Shell-to-shell energy transfer plots for the energy transfer within the kinetic
            (left) and magnetic (right) energy reservoirs via advection and compression at
            𝑡 = 0.77𝑇 (top) and 𝑡 = 5.16𝑇 (bottom) from the simulations with Ms0.2_Ma1,
            showing the development of the kinetic and magnetic turbulent cascades.
            Annotations on the figure highlight key features of the energy transfer that are
            characteristic of a developing turbulence cascade. Each bin shows the flux of
            energy from shell 𝑄 to shell 𝐾, where orange with white circles showing a
            positive flux of energy, so that 𝐾 is gaining energy, and purple with white x’s
            showing a negative flux, so that 𝐾 is losing energy. The energy flux in each
            bin is normalized by 𝜀 = max𝑄,𝐾 |T𝑋𝑌 (𝑄, 𝐾)| so that a higher 𝜀 means a
            higher energy flux. The solid black line shows equivalent scale transfers. As
            the turbulent cascade develops in the magnetic and kinetic energy reservoirs,
            more energy transfers along the diagonal fill out the energy spectrum down
            to numerical dissipation scales. . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Figure 3.7: Shell-to-shell energy transfer plots for the energy transfer within the kinetic
            (top) and magnetic (bottom) energy reservoirs via advection and compression
            at 𝑡 = 1.29𝑇 from the Ms0.2_Ma1 simulation, showing a transient inverse
            cascade within the magnetic energy reservoir (on all scales 𝐾, 𝑄 ∼       < 100)
            and kinetic energy reservoir (on large scales 𝐾, 𝑄 ∼  < 16). Annotations show
            where along the diagonal the inverse cascade is present. . . . . . . . . . . . . . 105
Figure 3.8: Shell-to-shell energy transfer plots for the energy transfer from kinetic to
            magnetic energy via magnetic tension at 𝑡 = 1.81𝑇 from the Ms0.2_Ma10
            simulation, showing the nonlocal energy transfer from large kinetic scales to
            many smaller magnetic scales. Annotations show where the nonlocal transfer
            is present. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
                                                 xix


Figure 3.9: Integrated energy flux over time from kinetic to magnetic energy via tension
             from larger wavenumbers to smaller nonlocal wavenumbers (purple), from
             larger wavenumbers to smaller local wavenumbers (blue), between equivalent
             wavenumbers (green), from smaller wavenumbers to larger local wavenum-
             bers (orange), and from smaller wavenumbers to larger nonlocal wavenumbers
             (red) in the Ms0.2 simulations. We normalize the energy flux in each panel
             so that the absolute maximum of all of the flux bins is 1.0, where 𝜀 is the
             normalization factor use in each panel. Comparisons of the relative strength
             of energy fluxes in different simulations must consider 𝜀. The inset plot in the
             lower right panel shows the color coded regions that are integrated to calculate
             each line at a single time for the same shell-to-shell transfer from Figure 3.8.
             Solid lines show the integrated flux if “local" wavenumbers as defined as 5
             logarithmic bins away from the equivalent wavenumber. The shaded regions
             show the integrated flux if 4 or 6 bins are used, showing that the behavior
             is robust if the range “local" wavenumbers is defined closer or further away
             from transfer between equivalent scales. We include the integrated flux from
             kinetic to magnetic energy via tension for all nine simulations in the online
             supplements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Figure 3.10: Integrated energy flux over time within the kinetic energy (top) and within
             the magnetic energy (bottom) from larger wavenumbers to smaller nonlocal
             wavenumbers (purple), from larger wavenumbers to smaller local wavenum-
             bers (blue), between equivalent wavenumbers (green), from smaller wavenum-
             bers to larger local wavenumbers (orange), and from smaller wavenumbers to
             larger nonlocal wavenumbers (red) in the Ms0.2_Ma1 simulation. The inset
             plot in the lower middle panel demonstrates the color coded regions that are
             integrated to calculate each line at 𝑡 = 1.29𝑇 from the shell-to-shell transfer
             from Figure 3.7. Solid lines show the integrated flux if "local" wavenumbers
             as defined as 5 logarithmic bins away from the equivalent wavenumber. The
             results change very little if 4 or 6 bins are used. We include the integrated
             flux within the kinetic energy and magnetic energy for all nine simulations in
             the online supplements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Figure 3.11: Cross-scale flux within the kinetic energy (blue line), within the magnetic
             energy (orange line), and from kinetic to magnetic energy via tension (green
             line) in the three Ms0.2 simulations across columns and at dynamical time
             𝑡 = 0.77𝑇 (top) and later at dynamical time 𝑡 = 5.16𝑇. Note that the cross-
             scale fluxes at later times are an order of magnitude less than early cross-scale
             fluxes. Positive values of this quantity denote energy transfer from larger to
             smaller scales. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
                                                   xx


Figure 4.1: Profiling results on a GPU (left) and CPU (right) for selected regions (x-axis)
            within the main loop of an MHD timestep using the algorithm described
            in Sec. 4.4. The different lines correspond to different loop structures, see
            Sec. 4.3.3 and the timings are normalized to the fastest Riemann region in
            each panel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Figure 4.2: Roofline models of a 2 socket Intel Xeon Gold 6248 "Cascade Lake" CPU
            node on NASA’s Aitken (4.2c) and a single NVIDIA Tesla V100 "Volta"
            GPU on MSU HPCC (4.2d). Theoretical L1 and DRAM bandwidths and
            theoretical peak throughputs according to manufacturer specifications are
            shown in dashed line. for For both cases shown here and all other architectures
            we tested, DRAM bandwidth (or MCDRAM bandwidth for KNLs) is the
            limiting bandwidth for K-Athena’s performance. . . . . . . . . . . . . . . . . 124
Figure 4.3: Performance Portability plot of several CPU and GPU machines with dif-
            ferent architectures. Individual bars show the performance of K-Athena
            compared to the theoretical peak performance limited by the empirically
            measured DRAM and L1 bandwidths. Black bars with diamonds denote the
            theoretical performance limited by the manufacturer reported bandwidths.
            The performance portability metrics across all architectures for DRAM and
            L1 are shown with horizontal orange lines where solid orange used the empir-
            ically measured bandwidths and dashed orange uses manufacturer reported
            bandwidths.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Figure 4.4: Raw performance for double precision MHD (algorithm described in Sec. 4.4)
            of K-Athena, Athena++, and GAMER on a single GPU (left) or CPU (right)
            for varying problem sizes. Volta refers to an NVIDIA V100 GPU, Pascal
            refers to an NVIDIA P100 GPU, BDW (Broadwell) refers to a 14-core Xeon
            E5-2680 CPU, and SKX (Skylake) refers to a 20-core Xeon Gold 6148 CPU.
            The GAMER numbers were reported in Zhang et al. (2018) for the same
            algorithm used here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
                                                xxi


Figure 4.5: Weak scaling for double precision MHD (exact algorithm described in Sec. 4.4)
            on different supercomputers and architectures for K-Athena and the original
            Athena++ version. Numbers correspond to the 80th percentile of individual
            cycle performances of several runs in order to reduce effects of network vari-
            ability. The top row shows the raw performance in number of cell-updates
            per second per node and can directly be compared between different system
            and architectures. The bottom row shows the parallel efficiency normalized
            to the individual single node performance. The first column contains results
            for a workload of 643 and 1283 cells per core on NASA’s Electra system using
            two 20-core Intel Xeon Gold 6148 processors per node. The second column
            shows results for a workload of 643 per core on ALCF’s Theta system with
            one 64-core Intel Xeon Phi 7230 (Knights Landing) per node. HT-1, HT-2,
            and HT-4 refers to using 1, 2, and 4 hyperthreads per core, respectively. The
            third column shows results for a workload of 1283 per CPU core and 1923 per
            GPU on OLCF’s Titan system with one AMD Opteron 6274 16-core CPU and
            one NVIDIA K20X (Kepler) GPU per node. The last column contains results
            for a workload of 643 per CPU core and 2563 per GPU on OLCF’s Summit
            system with two 21-core IBM POWER9 CPUs and six NVIDIA V100 (Volta)
            GPUs per node. On all systems the GPU runs used 1D loops and the CPU
            runs used simd-for loops with the the exception of the dashed purple line on
            Summit that used Kokkos nested parallelism, see Sec. 4.3.3 for more details.     . 131
Figure 4.6: Strong parallel scaling for double precision MHD (algorithm described in
            Sec. 4.4) of K-Athena on NVIDIA V100 GPUs (6 GPUs per node; green
            solid lines) and IBM Power 9 CPUs (42 cores per node; orange/red dash dotted
            lines) on Summit. The top panel shows the raw performance in cell-updates
            per second per node and the bottom panel shows the parallel efficiency. The
            effective workload per GPU goes from 2563 to 643 for the 1,5363 domain
            and from 2563 to 1283 for the 30723 domain. In the CPU case the effective
            workload per single Power9 CPU (21 cores) goes from 3533 to 883 for the
            1,4083 domain and from 3533 to 1773 for the 2,9443 domain. The resulting
            effective workloads per node are comparable (within few percent) between
            GPU and CPU runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Figure 5.1: Enthalpy (top), sound speed (middle), and equivalent adiabatic index (bot-
            tom) as a function of the temperature proxy Θ/𝑐2 for the Synge gas (solid
            blue), ideal equation of state with a relativistic Γ = 4/3 (dashed orange)
            and a non-relativistic Γ = 5/3 (finely dashed green), and the Taub-Matthews
            approximation to the Synge gas (dot-dashed red). With the Synge and Taub-
            Matthews equations of state, each of the quantities shown here vary smoothly
            between the two extremes of the ideal equation of state as Θ/𝑐2 changes from
            non-relativistic to relativistic. The Taub-Matthews equation of state provides
            a reasonable approximation to the Synge gas while remaining simple for
            computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
                                                 xxii


Figure 5.2: Map of the error of the conserved-to-primitive solvers with the error using the
            analytical method in the left column and using varying numbers of iterations
            in the middle two columns and error of these configurations versus Lorentz
            factor in the right column. The top row shows results for the ideal gas, testing
            the iterative solver with 6 and 12 iterations, and the bottom row shows results
            for the Taub-Matthews equation of state, testing the iterative solver using
            25 and 50 iterations. In all panels, 25 × 25 primitive states are tested with
            Lorentz factors varying from 1 to 100 on the 𝑥-axis and pressures varying from
            105 to 1010 N m−2 , using 𝑐 = 3 × 108 m s−1 and fixing 𝐷 = 1 kg m−3 , these
            primitive states are first converted to conserved states and then converted back
            to a primitive state using the specified analytical or iterative solver. In the left
            three columns, the relative error is shown in color with the 𝑦-axis showing the
            pressure. In the rightmost column, the median (solid line) and first to third
            quartile (shared region) of the error sampled using different pressures given a
            specific Lorentz factor. All results in this figure are using the Intel compiler
            on CPUs. The iterative solver for the ideal equation of state is more accurate
            than the analytic solver using just 12 iterations for high Lorentz factors and
            just 6 iterations for low Lorentz factors. For the Taub-Matthews equation of
            state, the analytical solver is almost always at least or more accurate than the
            iterative solver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Figure 5.3: Required iterations for the iterative solver to reach the same accuracy as the
            analytical solver using the same primitive states as Fig. 5.2, with results for the
            ideal gas in the top row and the Taub-Matthews equation of state in the bottom
            row. The left column shows the required iterations when compiling with the
            Intel compiler in color with Lorentz factor on the 𝑥 axis and pressure on the
            𝑦 axis. For two primitive states the ideal analytic solver recovers the velocity
            exactly, leading the iterative solver being unable to reach the same accuracy,
            which we show in yellow. The right column shows the median (solid line)
            and first to third quartile (shared region) of the error sampled using different
            pressures given a specific Lorentz factor, Results with the GNU compiler on
            CPUs are shown in orange, with the Intel compiler on CPUs with the Kokkos
            OpenMP backend in blue, and with the Kokkos CUDA backend on GPUs in
            green. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
                                                 xxiii


Figure 5.4: Timing comparisons for the iterative solver to reach the same accuracy as
            the analytic solver, with comparisons as a color map in the left three panels
            and versus Lorentz factor in the rightmost panel, using the same primitive
            states as Fig. 5.2 with results for the ideal gas in the top row and the Taub-
            Matthews equation of state in the bottom row. In all panels we compare
            results using the metric Analytical Time/Iterative Time − 1, where a positive
            value shows how much slower the analytical solver is as a fraction of the time
            the iterative solver takes and a negative value shows the fraction by which
            the analytical solver is faster. The left three columns show the timing metric
            in color (blue shows where the iterative method is faster) with the Lorentz
            factor on the 𝑥 and the pressure on the 𝑦 axis, showing comparisons for the
            GNU and Intel compilers on CPUs with the Kokkos OpenMP backend and
            on GPUs with the Kokkos CUDA backend across the three columns. The
            rightmost column shows the median (solid line) and first to third quartile
            (shared region) of the error sampled using different pressures given a specific
            Lorentz factor, showing results for all compilers tested (note that this does not
            compare timings between compilers, only the analytic against the iterative
            solver for each compiler). For the ideal equation of state, the iterative solver
            is faster than the analytic solver under a certain threshold of Lorentz factor
            that is compiler and architecture dependent. The iterative solver for the Taub-
            Matthews equation of state is almost always slower than the analytic method.
             . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Figure 5.5: Aggregate performance of all methods and compilers tested shown as box
            and whiskers of the primitive recoveries per second (higher is better) across
            the grid of primitive states used in Fig. 5.2. Red lines show medians, boxes
            show the interquartile range, and whiskers show the maximum and minimum
            values inside of 1.5 times the length of the interquartile range above the 3rd
            quartile and below the 1st quartile, described by Tukey (1977). We exclude
            outlier timings from the figure, which range from 1011 to 1.2 × 1012 primitive
            recoveries per second for all methods and compilers. We show results for
            GNU on CPUs in orange, Intel on CPUs in blue, and CUDA on GPUs in
            green, for the ideal gas on the left and the Taub-Matthews equation of state
            on the right. Generally, on CPUs using the Intel compiler allows more
            primitive recoveries per second than the GNU compiler. The performance for
            recovery with the Taub-Matthews gas has a much larger spread than recovery
            with the ideal equation of state. Between the two equations of state, the
            solvers achieve roughly the same number of recoveries per second on each
            architecture, indicating that equation of state can have a mitigated impact on
            the full code’s performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
                                                xxiv


Figure 5.6: Order of convergence for the relativistic mass density (in solid blue) for
            three resolutions along the 𝑥-axis the 5 eigenvalue/eigenvector pairs 𝑗 ∈
            {−, 0 (1,2,3) , +} in different panel. For all tests here we test in 3D with non-
            grid-aligned waves, using a 2nd order basis with the SSPRK3 integrator. For
            all cases we expect a 3.0 rate of convergence, which we denote with a dashed
            black line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Figure 5.7: Plots of the five 1D Riemann problems tested using the ideal equation of
            state. Each row shows end state of a different Riemann problem. From top
            to bottom, the first row shows a mildly relativistic blast wave, the second a
            highly relativistic blast wave, the third a blast wave with transverse velocity,
            the fourth a Sod shock tube, and the fifth a planar shock reflection. The
            columns show from left to right the rest-mass density, the pressure, the
            velocity, and the Lorentz factor. In each panel we show the reference solution
            computed with a finite volume scheme (Stone et al., 2020a) with a solid line
            and the basis 0, 1, and 2 solutions with our method with a red dashed, green
            dot-dashed, and yellow finely dash line respectively. Although the method can
            evolve these shocks with the help of the physicality-enforcing operator, small
            oscillations appear around shocks for higher order bases. These oscillations
            can be damped out by widening the limiting thresholds for the Moe limiter or
            by changing the minmod limiter but this results in more diffusion and lower
            order convergence for basis order 2. . . . . . . . . . . . . . . . . . . . . . . . 174
Figure 5.8: Convergence of the L1 error of the method presented here to a high resolution
            reference solution of the same Riemann problems from Fig. 5.7 computed
            with a finite volume scheme (Stone et al., 2020a). From top to bottom, the first
            row shows a mildly relativistic blast wave, the second a highly relativistic blast
            wave, the third a blast wave with transverse velocity, the fourth a planar shock
            reflection, and the fifth a Sod shock tube. The columns show from left to right
            the rest-mass density, the pressure, the velocity, and the Lorentz factor. In
            each panel we show the L1 error of our method with dots, a fitted convergence
            rate using logarithmically weighted least squares with a solid line, and a 2/3
            convergence rate for basis order 0 and a first order convergence rate for bases
            1 and 2 with dashed lines. We use different colors to denote different basis
            orders, using blue for basis order 0, orange for basis order 1, and green for
            basis order 2. Due to the presence of shocks, we expect the L1 error of higher
            order bases to converge to first order at best, although sharp blasts prove
            difficult for convergence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
                                                   xxv


Figure 5.9: Blast wave with relativistic temperatures on the left and non-relativistic
             temperature on the right, evolved to 𝑡 = 0.7 using the Taub-Matthews
             equation of state (solid blue), ideal equation of state with adiabatic index
             Γ = 4/3 (dashed orange), and ideal equation of state with Γ = 5/3 (finely
             dashed green). In order of rows, we show the density 𝜌, longitudinal ve-
             locity 𝑢𝑥 , transverse
                                velocity 𝑢 𝑦, pressure 𝑃, and equivalent adiabatic index
             Γeq = ℎ − 𝑐2 / ℎ − 𝑐4 − 𝑃/𝜌 . The Taub-Matthews equation of state, as
             an approximation to the Synge gas, behaves apart from both the Γ = 5/3 and
             Γ = 4/3 ideal gases depending on the effective adiabatic index. . . . . . . . . . 177
Figure 5.10: Plots of the 2D Riemann problem test 1 with two colliding shocks using the
             initial conditions in eq. 5.95, using a 1st order basis in the top row and a
             2nd order basis in the bottom row. We show the rest-mass density in the left
             column and the pressure in the right column at 𝑡 = 0.8/𝑐 on a grid with 1024
             elements. Note the boundary effects where shocks traveling into the first
             quadrant intersect with the outflow boundaries when using the 2nd order basis.     179
Figure 5.11: Plots of the 2D Riemann problem test 2 with four vortex sheets using the initial
             conditions in eq. 5.99, using 1st order basis in the top row and a 2nd order
             basis in the bottom row. We show the rest-mass density in the left column and
             the pressure in the right column at 𝑡 = 0.8/𝑐 using a grid with 1024 elements.
             Note the boundary effects where the vortex sheets intersect with the outflow
             boundaries which are subtle using the 1st order basis and more apparent when
             using the 2nd order basis, especially along the top boundary. Like the 1D test
             of a shock reflecting against a wall, this test highlights unresolved difficulties
             of higher order bases leading to boundary effects. . . . . . . . . . . . . . . . . 181
Figure 5.12: Plots of the 2D Riemann problem test 3 with intersecting rarefactions using
             the initial conditions in eq. 5.103. We show the rest-mass density left column
             and the pressure right column at 𝑡 = 0.8/𝑐 using a 2nd order basis on a grid
             with 1024 elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Figure 5.13: Mean square of the transverse velocity 𝑣 𝑦 over time of the relativistic 2D
             Kelvin Helmholtz instability using our DG method using a 0th , 1st , and 2nd
             order bases respectively in the top three rows and using the finite volume code
             PLUTO with PLM and PPM reconstruction respectively in the bottom two
             rows. In the left column we show results including the contact discontinuity in
             the Riemann solver (using HLLC with our method and HLLD with PLUTO)
             and without the contact discontinuity using the HLL Riemann solver in the
             right column. The gray band from 𝑡 = 1.5 to 𝑡 = 3.0 shows the region over
             which we measure the growth rate shown in other plots. Higher resolutions
             generally lead to faster growth rates while the more diffusive HLL Riemann
             solver leads to steadier growth rates due to diminished secondary instabilities. . 186
                                                  xxvi


Figure 5.14: Growth rates of ⟨𝑣2𝑦 ⟩ versus degrees of freedom from 𝑡 = 1.5 to 𝑡 = 3.0 of the
             relativistic 2D Kelvin Helmholtz instability using our DG method using the
             finite volume code PLUTO. In the left column we show results including the
             contact discontinuity in the Riemann solver (using HLLC with our method
             and HLLD with PLUTO) and without the contact discontinuity using the
             HLL Riemann solver in the right column. Growth rates are measured by
             computing least squares fit of a ⟨𝑣2𝑦 ⟩ ∝ 𝑡 𝜔 model to the data shown in Fig.
             5.13, with error bars showing the standard deviation of the least squares fit. . . 187
Figure 5.15: The absolute difference in growth rate between the highest resolution sim-
             ulation for each method and each of the lower resolution simulations which
             serves as rough measure of the error of the growth rate, plotted versus the
             degrees of freedom. The discontinuous-Galerkin simulations with a 1st order
             basis show the most effective convergence of the simulations explored here,
             with HLLC converging slightly faster at the highest resolutions and HLL con-
             verging faster at lower resolutions. The discontinuous-Galerkin simulations
             with a 2nd order basis do not converge below a 10−1 difference even with
             high resolutions, which we attribute to the boundary effects that worsen with
             higher resolution. Otherwise, the other methods converge at varying rates,
             the 0th order basis discontinuous-Galerkin methods converging the slowest. . . 188
Figure 5.16: Snapshots of the transverse velocity at 𝑡 = 3.0 from simulations of the rela-
             tivistic Kelvin-Helmholtz instability using the method presented in this work
             using a 0th order basis. We show results using the HLL Riemann solver in the
             top row and with HLLC in the bottow row. We show the four highest resolu-
             tion simulations across the columns, ranging from 512 × 1024 to 4096 × 8192
             cells from left to right. With basis order zero, at this stage, using the HLL
             Riemann solver the method has difficulty growing the Kelvin Helmholtz in-
             stability, although the structure of the perturbation resembles results with
             simple structures when using higher orders. The HLLC Riemann solver gen-
             erates secondary vortices that get worse with high resolutions, which leads to
             a climbing growth rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
                                                xxvii


Figure 5.17: Snapshots of the transverse velocity at 𝑡 = 3.0 from simulations of the rela-
             tivistic Kelvin-Helmholtz instability using the method presented in this work
             using a 1st order basis in the first and third row and with the PLUTO finite
             volume MHD code with a first order method. We show results using the HLL
             Riemann solver in the top two rows and with HLLC for our code and with
             HLLD for PLUTO in the bottow two rows. We show the four highest resolu-
             tion simulations across the columns, ranging from 512 × 1024 to 4096 × 8192
             cells from left to right. Note that DG method has 4 times as many degrees
             of freedom with the 1st order basis, meaning that our 512 × 1024 simulation
             is comparable in degrees of freedom to the 1024 × 2048 simulation using
             PLUTO. At this times and these resolutions, the results with our DG method
             have converged to a similar solution with a simple structure. Results with
             PLUTO converge towards the DG method results, with secondary vortices
             present at lower resolutions that are more pronounced with HLLC. . . . . . . . 191
Figure 5.18: Snapshots of the transverse velocity at 𝑡 = 3.0 from simulations of the rela-
             tivistic Kelvin-Helmholtz instability using the method presented in this work
             using a 2nd order basis in the first and third row and with the PLUTO finite
             volume MHD code with a second order method. We show results using the
             HLL Riemann solver in the top two rows and with HLLC for our code and with
             HLLD for PLUTO in the bottom two rows. We show the four highest resolu-
             tion simulations across the columns, ranging from 512 × 1024 to 4096 × 8192
             cells from left to right. Note that DG method has 4 times as many degrees
             of freedom with the 1st order basis, meaning that our 512 × 1024 simulation
             has degrees of freedom between the 1024 × 2048 simulation and 2048 × 4096
             simulation using PLUTO. With this higher order basis at 𝑡 = 3.0, we also
             see the results with our DG method converge quickly to simple structures
             while the results with PLUTO require more resolution to suppress secondary
             vortices. However, in our results using 4096 × 8912 cells with basis order 2,
             we see anomalously high transverse velocities away from the interface, which
             is caused by boundary effects at high resolutions that will be addressed in
             future improvements to the method. . . . . . . . . . . . . . . . . . . . . . . . 192
Figure 5.19: Snapshots of the transverse velocity at 𝑡 = 5.0 from simulations of the rela-
             tivistic Kelvin-Helmholtz instability using the method presented in this work
             using a 0th order basis. We show results using the HLL Riemann solver in the
             top row and with HLLC in the bottom row. We show the four highest resolu-
             tion simulations across the columns, ranging from 512 × 1024 to 4096 × 8192
             cells from left to right. At late times into what should be the linear growth
             phase, our DG method with the HLL solver struggles to growth the instability
             at low resolutions. The HLLC method has developed some structures but
             they do not resemble results at higher orders. . . . . . . . . . . . . . . . . . . 193
                                                 xxviii


Figure 5.20: Snapshots of the transverse velocity at 𝑡 = 5.0 from simulations of the rela-
             tivistic Kelvin-Helmholtz instability using the method presented in this work
             using a 1st order basis in the first and third row and with the PLUTO finite
             volume MHD code with PLM reconstruction. We show results using the
             HLL Riemann solver in the top two rows and with HLLC for our code and
             with HLLD for PLUTO in the bottom two rows. We show the four high-
             est resolution simulations across the columns, ranging from 512 × 1024 to
             4096 × 8192 cells from left to right. Note that DG method has 4 times as
             many degrees of freedom with the 1st order basis, meaning that our 512×1024
             simulation is comparable in degrees of freedom to the 1024 × 2048 simula-
             tion using PLUTO. At this later time once the instability has entered into the
             nonlinear growth phase, the DG method shows clear roll ups at all resolu-
             tions. Secondary vortices are suppress with higher resolutions and by the
             more diffusive HLL solver. In contrast, the PLUTO results show secondary
             instabilities through out the perturbation, although these diminish with reso-
             lution. Notably, the structure of the instabilities with the DG method versus
             the finite method are very different. . . . . . . . . . . . . . . . . . . . . . . . 194
Figure 5.21: Snapshots of the transverse velocity at 𝑡 = 5.0 from simulations of the rela-
             tivistic Kelvin-Helmholtz instability using the method presented in this work
             using a 2nd order basis in the first and third row and with the PLUTO finite
             volume MHD code with PPM reconstruction. We show results using the HLL
             Riemann solver in the top two rows and with HLLC for our code and with
             HLLD for PLUTO in the bottom two rows. We show the four highest resolu-
             tion simulations across the columns, ranging from 512 × 1024 to 2048 × 4096
             cells from left to right. Note that DG method has 4 times as many degrees
             of freedom with the 1st order basis, meaning that our 512 × 1024 simulation
             has degrees of freedom between the 1024 × 2048 simulation and 2048 × 4096
             simulation using PLUTO. The suppression of secondary vortices with our DG
             method is enhanced with basis order 2 compared to basis order 1, requiring
             fewer cells and degrees of freedom. Secondary instabilities still appear with
             the finite volume method, largely unaffected by the increase in method order.     . 195
                                                 xxix


Figure 5.22: Performance of the code modeling the Kelvin Helmholtz instability from
             section §5.5.5, plotting updates to degrees of freedom per second versus
             degrees of freedom, using 1024 cores spread across 22 dual socket nodes with
             Intel Xeon Platinum 8268 CPUs (comprising approximately ∼ 88TFLOPS
             in total) in the left column and using 32 NVidia Tesla V100-SXM2 GPUs
             (comprising approximately ∼ 250TFLOPS in total) spread across 8 nodes
             on the right, where the peak computational throughput of the GPUs used
             are roughly three times the peak computational throughput of the CPUs.
             The computational resources for both tests was chosen to accommodate the
             memory needed for the largest simulation in the suite. We show profiling
             results with the HLLC and HLL Riemann solvers and with the 0th , 1st ,
             and 2nd order bases, between which we see little difference in performance.
             Comparing between the CPU and GPU runs, however, we see that the CPU
             performance becomes saturated at around 106 DOFs while the GPUs have not
             saturated the performance, even with simulations using more than 10 times
             the degrees of freedom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Figure 7.1: Electric field (top left) and pressure (top right) along the 1D warm diode
             with electron temperatures 𝑇𝑒 = 1, 10, 100 eV using the relativistic two-fluid
             MHD DG method with my contributions in red, green and blue for each
             temperature and in black showing the exact solutions with a semi-analytic
             model. L1 error in electric field (bottom left) and pressure (bottom right) of
             the relativistic two-fluid MHD DG method to the exact solution, showing 2nd
             order convergence as was expected for the second-order accurate fluid solver.
             Figures taken from Laity et al. (2021). . . . . . . . . . . . . . . . . . . . . . . 219
                                                 xxx


                                               CHAPTER 1
                                            INTRODUCTION
1.1    Galaxy Clusters
    Galaxy clusters are the largest gravitationally bound objects in the universe, beyond which the
expansion of space due to dark energy exceeds gravity (Longair, 2008; Mo et al., 2010). With
virial masses on the order of 1014 − 1015 M⊙ and radii ∼ 1 Mpc, by mass they are primarily
composed of dark matter – typically 84% of a galaxy cluster’s mass is contained in a dark matter
halo. The remaining 16% is baryonic matter, 84% of which is contained in the intracluster medium
(ICM), a hot diffuse X-ray emitting plasma permeating the cluster with temperatures on the order of
1 − 10 keV (107 − 108 K) and particle densities on the order of 10−4 − 10−2 cm−3 . The remaining
10% of the baryonic matter, constituting 1% of the total galaxy cluster mass, is contained within
10-100 galaxies (Longair, 2008; Mo et al., 2010).
    In their unique role as the largest gravitationally bound objects in the universe, galaxy clusters
serve as key probes of cosmological properties of the universe (Lima et al., 2003; Wang et al., 2004;
Basilakos et al., 2010; Pratt et al., 2019; Allen et al., 2011). Specifically, they trace the structure of the
largest overdensities of dark matter, revealing the power spectrum of mass distribution through the
universe on the largest scales. Determining this structure is essential for characterizing an equation
of state for dark energy (Lima et al., 2003), the observed but as yet relatively uncharacterized force
that drives the accelerating expansion of the universe. More precisely, we need the number density
of galaxy clusters as a function of their virial mass and redshift (see Voit, 2005; Allen et al., 2011,
for a thorough review).
    However, virial masses are not directly measured, but instead must be inferred by observable
properties such as gravitational lensing and the electromagnetic radiation emitted by the baryonic
matter.
    The most straightforward method to determine a galaxy cluster’s mass is via strong gravitational
                                                       1


lensing, where the mass of the galaxy clusters (primarily its dark matter halo) bends the trajectory of
light from a background source behind the galaxy cluster following General Relativity (Kochanek,
2006; Hoekstra et al., 2013; Bartelmann, 2010), creating multiple images of the background source
around the galaxy cluster. Although strong gravitational lensing can be used to determine galaxy
cluster masses with minimal assumptions, it requires a background source directly behind the
cluster and must be near enough to observe the multiple images, thus limiting its application to a
small number of systems. Strong lensing is also only useful for estimating the mass near the cluster
core, so the virial mass of the galaxy cluster still needs to be inferred (Hoekstra et al., 2013). In
contrast, weak gravitational lensing from the deflection of light in the entire sky by multiple sources
gives a statistical measure of the distribution of mass in the universe (Bartelmann & Schneider,
2001).
    The virial masses and number densities of galaxy clusters can also be determined from multi-
wavelength observations – the electromagnetic radiation emitted by the baryonic matter in galaxy
clusters and observed in multiple wavelengths. Of particular interest to galaxy clusters and the
ICM, the X-ray emission from the hot diffuse ICM measures the gas density and temperature (see
Figure 1.1), which can be used to estimate the galaxy cluster mass assuming the gas is in hydrostatic
equilibrium (HSE; Sarazin, 1988; Allen et al., 2011). However, the ICM is disrupted from HSE
by AGN feedback, magnetic fields, turbulence, cosmic ray pressure, and any other non-thermal
support (Fabian et al., 2003; Carilli & Taylor, 2002; Dennis & Chandran, 2005; Loewenstein et al.,
1991). Likewise, the optical emissions from galaxies within the galaxy cluster can be used to
estimate the cluster mass by assuming dynamical equilibrium (Binney & Tremaine, 1987; Carl-
berg et al., 1997). Galaxies can similarly be disrupted from dynamical equilibrium by large scale
structure interactions in the universe (White et al., 2010). The Sunyaev–Zeldovich (SZ) effect – the
upscattering of the cosmic microwave background (CMB) to higher energies via inverse Compton
scattering with high-energy electrons (Sunyaev & Zel’dovich, 1980) – can also be used to estimate
gas density and temperature of the electron population of galaxy clusters.
    Although assumptions of HSE and dynamical equilibrium can be used to coarsely estimate
                                                   2


                                                                                             Abell 1689 in X-ray: NASA/CXC/SAO/E.Bulbul, et al. and Optical: NASA/STScI
Figure 1.1: Galaxy Cluster Abell 1689 in X-ray (purple) as captured by Chandra with optical
from Hubble underneath. The galaxy cluster has sufficient mass to bend light from background
galaxies around the galaxy cluster core, smearing background sources into duplicated arcs around
the galaxy cluster core. This strong gravitational lensing permits estimates of the galaxy cluster’s
mass (Kochanek, 2006; Hoekstra et al., 2013; Bartelmann, 2010). The Intracluster Medium – the
hot, diffuse plasma comprising most of the baryonic mass but a relatively smaller portion of the
total mass – is responsible for the majority of the X-ray emissions.
                                                 3


galaxy cluster masses, more precise mass proxies relying on electromagnetic observables depend
on more precise understanding of the dark matter and ICM out of HSE and dynamical equilibrium.
Numerical simulations of both components of galaxy clusters individually and simultaneously have
been essential for recent improvements of galaxy cluster mass proxies (Pratt et al., 2019). N-body
simulations of the dark matter halos of galaxy clusters can inform more realistic halo mass profiles
(Navarro et al., 2004; Gao et al., 2012). Galaxy cluster simulations including the complex plasma
physics of the ICM, however, are rapidly evolving (Walker et al., 2019), with the magnetized
plasma nature of the ICM and AGN feedback under particular recent scrutiny (Donnert et al., 2018;
Morganti, 2017).
    Understanding the ICM is key to developing accurate mass proxies to discern the the virial mass
of galaxy clusters in large X-ray surveys that can characterize the structure of dark matter in the
universe and the equation of state of dark energy. At present, the forefront of our understanding of
galaxy clusters is limited by our understanding of the ICM as a complex plasma.
1.2     Plasmas
    “Plasma” is a state of matter where a portion or all of the electrons are decoupled from the
ions, creating a sea of charged particles (Chen & Chen, 1984; Bittencourt, 2004; Bellan, 2008).
These charged particles facilitate currents and thus magnetic fields within the matter. The bulk
kinetic motion of the charged particles exerts forces on these currents and magnetic fields and vice
versa, leading plasmas to exhibit behaviors unseen in other states of matter. With these properties,
plasmas can behave quite differently from unionized baryonic matter.
    Although rare on Earth, plasmas are ubiquitous in the universe, comprising the vast majority
of baryonic matter. We can divide most plasmas into two broad categories: terrestrial plasmas,
which occur or are created on Earth, and astrophysical plasmas, which occur beyond the Earth’s
atmosphere.1
    1 Plasmas in the upper Earth’s atmosphere and magnetosphere are known as space plasmas
and have characteristics of both terrestrial plasmas and astrophysical plasmas, being more diffuse
and longer lived than terrestrial plasmas but not as hot as astrophysical plasmas (Baumjohann &
Treumann, 2012; Treumann & Baumjohann, 1997).
                                                  4


    Most terrestrial plasmas, except for naturally occurring lightning, are either created for industrial
applications or for a wide variety of scientific experiments (Chen & Chen, 1984). Chief among
these experiments are prototype fusion devices, including magnetic confinement fusion (MCF;
Ongena et al., 2016) devices such as the International Thermonuclear Experimental Reactor (ITER;
Aymar et al., 2002) where the plasma is confined by self-sustaining magnetic fields, and inertial
confinement fusion (ICF; Craxton et al., 2015) devices such as the National Ignition Facility (NIF;
Miller et al., 2004; Zylstra et al., 2022) and the Z-Machine (Sinars et al., 2020), where the fusion
target is unconfined but a burst of energy from lasers or large currents heats the plasma quickly
enough to allow inertia to confine the plasma for long enough to attain pressures and temperatures
sufficient to undergo fusion. MCF plasmas are typically maintained for seconds to tens of seconds
(Ongena et al., 2016), with the expectations for minutes-long lived plasmas being created for fusion
devices in the near future, while ICF plasmas persist for picoseconds to microseconds (Zylstra et al.,
2022).
    Astrophysical plasmas, in comparison, are typically hotter, often more diffuse, and much
longer lived (Chiuderi & Velli, 2015, see Figure 1.2 for examples of astrophysical and terrestrial
plasmas). Stars, the interstellar medium (ISM) between them, and the ICM are all plasmas. Except
for the dense plasmas found within compact objects such as stars, the majority of matter in the
universe is within diffuse plasmas such as the ISM and ICM. These plasmas are also typically
much longer lived than terrestrial plasmas, where present day temperatures of the ISM and ICM
lead to partially or fully ionized plasmas. However, the dynamics of the fluid and the coupling
with magnetic fields in astrophysical plasmas are governed by the same physical laws as terrestrial
plasmas. Although not as physically accessible as a plasma created in a laboratory, the ubiquity and
longevity of astrophysical plasmas allows convenient study of plasma physics via our observations
of astrophysical plasmas. Thus, knowledge in plasma physics gained from studying astrophysical
plasmas can improve understanding of terrestrial plasmas and vice versa.
                                                   5


                                                                                            Typical plasmas ©2022 Contemporary Physics Education Project
                                                                                            Perseus Cluster in X-ray: NASA/CXC/SAO/E.Bulbul, et al.
Figure 1.2: Charged particle number densities on the 𝑥-axis and temperatures on the 𝑦-axis for
different astrophysical and terrestrial plasmas. The comparatively hot and diffuse plasma of the
ICM is marked in yellow, with the Perseus cluster as seen in X-ray by Chandra. Diagram by the
https://www.cpepphysics.org.
1.2.1   Plasma Regimes
   The behavior of plasmas and the theories and equations that best describe them depend on
many of the properties of the plasma system in question (see Figure 1.3; Kramer et al., 2020).
These properties include the particle composition, the degree of ionization, the thermodynamics,
the kinematics, the electrodynamics, the scale of the system of interest relative to other scales in
the system, and countless other properties. Fortunately, to simplify categorization different plasma
models can be broadly divided based on a few quantifiable properties.
   Different models of plasmas can be broadly divided into kinetic and fluid methods based on the
Knudsen number Kn of the plasma system
                                                     𝜆
                                             Kn =      ,                                                                  (1.1)
                                                     𝐿
                                                 6


                                                                                               ©Prof. Uri Shumlak
Figure 1.3: Spectrum of appropriate plasma models for different regimes, as determined by the
Knudsen number 𝐾𝑛, which is a measure of the relative importance of particle-particle interactions
versus ensemble interactions defined in equations 1.1, and the charge separation distance Λ𝑑 , which
is a measure of the importance of electric fields in the plasma defined in equation 1.18. Fluid models
appear to the left and kinetic models appear to the right while models where electromagnetics
are important appear towards the bottom and models where electromagnetics unimportant appear
towards the top. Systems and simulations explored within astrophysics typically use models from the
4 extremes: Euler, Boltzmann, ideal MHD, and Vlasov models. The plasma model best describing
the ICM would be a non-ideal MHD model on the galaxy cluster scale and a Vlasov model on the
plasma instability, particle acceleration scale. Created by Uri Shumlak for a presentation at Sandia
National Laboratories (Shumlak, 2015) and appearing in Kramer et al. (2020).
which is the ratio of of the mean free path of particles 𝜆 to the length scale of interest 𝐿. The
Knudsen number depends on the size of the system examined – i.e. Mpcs for the plasma comprising
galaxy clusters. Smaller size systems exist, however, within the larger system. For example, in
the ICM plasma instabilities and particle acceleration across shocks and turbulence happen at a
much smaller scale (on the order of km to pc), whereas the mean free path of particles is the same,
resulting in a larger Knudsen number a system studying plasma instabilities within the ICM versus
studying the ICM of an entire galaxy cluster (Marcowith et al., 2020)2. Although the particle
   2 The  effective Knudsen number of the ICM is complex since the mean free path of particles
via Thompson scattering in the ICM (10 − 1000 kpc) is significant to the size of the galaxy cluster,
but the length scale on which plasma instabilities can introduce dissipation (∼ km) is quite small.
Thus, the applicability of fluid models appropriate for small Knudsen numbers to the ICM is under
debate. These issues may be addressed using non-ideal MHD (Kunz et al., 2011; Schekochihin
et al., 2009).
                                                  7


acceleration physics with high Knudsen numbers still occur within the larger physics of the ICM
of a galaxy cluster, their effects are typically secondary to the larger scale dynamics.
    Even though the mean free paths and length scales of astrophysical and terrestrial plasmas
can differ greatly, their ratio and thus Knudsen numbers can be quite similar, allowing them to
be studied with shared models. For example, while the km-scale plasma instabilities in the ICM
happen on much larger scales than plasma instabilities in magnetically insulated transmission lines
(MITL) used to deliver power to accelerators (Ottinger & Schumer, 2006; Kramer et al., 2020;
Luo et al., 2019), both systems have similar Knudsen number and thus can be studied with similar
plasma models.
    High Knudsen number plasmas are best described with kinetic theory, where the plasma is
described by a statistical distribution of particles in phase space. Each particle species of the
plasma is described by a density function in phase space that evolves over time. Following Kramer
et al. (2020), let 𝑁 𝑠 (r, v, 𝑡) be the phase density function containing every particle in the plasma of
species 𝑠, where r is a position, v is a velocity, and 𝑡 is a time. The microscopic evolution of this
density functions is exactly described the Klimontovich equation (Klimontovich, 1994)
                                 𝜕𝑁 𝑠      𝜕𝑁 𝑠 𝑞 𝑠                   𝜕𝑁 𝑠
                                      +v·       +      (E + v × B) ·       =0                       (1.2)
                                  𝜕𝑡        𝜕r    𝑚𝑠                   v
where 𝑞 𝑠 and 𝑚 𝑠 are the particle species charge and mass and where E and B are the local
microscopic electric and magnetic fields which are governed by Maxwell’s equations. The density
function 𝑁 𝑠 encompasses every individual particle of the plasma, however, which is rarely useful
or tractable for modeling whether by theory or numerical simulation.
    If we instead consider a probability distribution function (PDF) 𝑓 𝑠 (r, v, 𝑡) of each particle
species and consider an averaged macroscopic electric and magnetic field, we obtain the Boltzmann
equation (Chen & Chen, 1984; Bittencourt, 2004; Bellan, 2008)
                          𝜕 𝑓𝑠        𝜕𝑁 𝑠 𝑞 𝑠                 𝜕𝑁 𝑠     𝑓𝑠
                                +v·       +     (E + v × B) ·        =            ,                 (1.3)
                           𝜕𝑡          𝜕r    𝑚𝑠                 v      𝜕𝑡 Coulomb
where the rightmost term is a source and sink term for Coulomb collisions. Specific operators
for the Coulomb collision terms give the Fokker-Planck equation and the Vlasov equation. The
                                                       8


particles are coupled to the electromagnetic fields via Maxwell’s equations, which can be written
in Lorentz-Heaviside units as
                                        1 𝜕E                  4𝜋
                                               =∇×B−             J                                (1.4)
                                        𝑐 𝜕𝑡                   𝑐
                                           1 𝜕B
                                                  = −∇ × E                                        (1.5)
                                            𝑐 𝜕𝑡
                                               ∇·E=𝑞                                              (1.6)
                                               ∇ · B = 0,                                         (1.7)
where the current and charge densities are defined as
                                                 ∑︁
                                             J≡       𝑞 𝑠 𝑛𝑠 v                                    (1.8)
                                                   𝑠
                                                  ∑︁
                                              𝑞≡       𝑞 𝑠 𝑛𝑠                                     (1.9)
                                                    𝑠
and 𝑛 𝑠 is the zeroth moment of the distribution function,
                                                  ∫
                                             𝑛𝑠 =      𝑓 𝑠 𝑑v.                                  (1.10)
Examples of high Knudsen number systems in astrophysics include the microphysics of particle
acceleration via shocks and magnetized turbulence to create cosmic rays and the magnetosphere
surrounding many planets.
    Since the equations in kinetic theories have high dimension – 6D PDFs are needed for each
species even with the statistical simplifications used for the Boltzmann and Vlasov equations –
numerical approaches are often limited. Monte Carlo (MC) methods (Metropolis et al., 1953),
which rely on random sampling to approximate solutions, are generally more useful for highly
dimensional systems compared to other methods such as finite volume or finite element (see Section
1.2.4 and Humpherys et al., 2017). The most widely used method for kinetic theories is the Particle-
in-Cell method (PIC), where the distributions are species are randomly sampled by super-particles
representing populations of particles that are then used to approximate electromagnetic fields across
a mesh of cells (Harlow et al., 1955; Dawson, 1983; Tskhakaya et al., 2007). The electromagnetic
fields are then used to update the positions and velocities of the super-particles, evolving the fields
                                                    9


and then particles with a leapfrog integration method. As an MC method, PIC converges slowly with
increased super-particle count, improving accuracy at a 𝑛1/2 rate where 𝑛 is the number of super-
particles (Myers et al., 2016). For large systems or long evolution times this slow convergence makes
PIC a resource-intensive, cumbersome, and sometimes infeasible computational method, depending
on the system of interest (Harlow, 1962; Liu et al., 2019). Fortunately, larger systems necessarily
mean smaller Knudsen numbers, for which more computationally amendable approaches exist.
    Low Knudsen number plasmas are best described with fluid theories, assuming continuum
particle distributions in thermodynamic equilibrium (or close to thermodynamic equilibrium, with
corrections). Although the kinetic theories and associated equations are still valid for low Knudsen
number plasmas, their high dimensionality leads us to use approximations of these equations that
are appropriate for a continuum particle distribution. We assume a thermodynamic distribution of
the fluid, such as the Maxwell-Boltzmann distribution, which implies thermodynamic equilibrium,
so that PDFs are not directly evolved. Taking the first three moments from the Boltzmann equation
– multiplying Equation 1.3 by 𝑚 𝑠 , 𝑚 𝑠 v, and 𝑚 𝑠 𝑣2 /2 respectively and integrating over all velocity
space – yields equations for conservation of mass, momentum, and energy (Kramer et al., 2020;
Bittencourt, 2004). If we apply these for a single species non-relativistic fluid, ignoring other
iteractions such as viscosity and electromagnetic fields, we obtain the Euler equations (Toro, 2009;
Chen & Chen, 1984; Bittencourt, 2004; Bellan, 2008):
                                        𝜕𝜌
                                            + v · ∇𝜌 + 𝜌∇ · v = 0                                (1.11)
                                        𝜕𝑡
                                     𝜕 𝜌v
                                           + ∇ · (𝜌v ⊗ v) + ∇𝑝 = 0                               (1.12)
                                      𝜕𝑡
                                         𝜕𝜀
                                             + ∇ · v (𝜀 + 𝑝) = 0                                 (1.13)
                                         𝜕𝑡
where 𝜌 is the density, v is the flow velocity, 𝑝 is the pressure, 𝜀 is the energy density including
kinetic and thermal contribution, and 𝐼 is the identity matrix. A viscosity stress tensor can be added
to the momentum equation to give the Navier-Stokes equations.
    Electromagnetic field can be coupled to the Euler equations via Maxwell’s equations to give
models that better suit plasmas, where electromagnetic fields can influence the medium. In the
                                                   10


ideal plasma limit, where currents are instantaneous and resistance is zero (leading to zero electric
fields), we get the ideal magnetohydrodynamics (MHD) equations (Toro, 2009; Bittencourt, 2004;
Bellan, 2008):
                                        𝜕𝜌
                                             + v · ∇𝜌 + 𝜌∇ · v = 0                               (1.14)
                                         𝜕𝑡
                           𝜕 𝜌v                                         
                                 + ∇ · (𝜌v ⊗ v − B ⊗ B) + ∇ 𝑝 + 𝐵2 /2 = 0                        (1.15)
                            𝜕𝑡
                              𝜕𝜀        h                            i
                                  + ∇ · v 𝜀 + 𝑝 + 𝐵2 /2 − B (B · v) = 0                          (1.16)
                              𝜕𝑡
with only two components remaining from Maxwell’s equations due to vanishing electric fields
                                            𝜕B
                                                 = ∇ × (v × B) .                                 (1.17)
                                            𝜕𝑡
Although the ideal MHD equations provide a good model for many plasmas, they can be extended
to include a variety of second order plasma effects.
    Including other electromagnetic effects leads to non-ideal MHD equation sets. Resistivity can
be included in this model via Ohm’s Law to arrive at the resistive MHD equations, which support
magnetic reconnection in the modeled plasma, while including anisotropic diffusion and thermal
conduction along magnetic field lines gives Braginksii MHD (Braginskii, 1965).
    The appropriate kinetic or fluid approximation depends on both on the Knudsen number and
on the charge separation distance
                                                         𝑘
                                                  Λ𝑑 ≡ 𝐷                                         (1.18)
                                                          𝐿
which is the degree to which electric fields are relevant to the plasma, and where 𝑘 𝐷 is the Debye
length
                                                       4𝜋𝑛𝑞 2
                                               𝑘 2𝐷 ≡         ,                                  (1.19)
                                                        𝑘 𝐵𝑇
which is the how far the charged particles’ comprising the plasma net electrostatic effect persists,
where 𝑛 is the number density of the particles, 𝑞 is their elementary charge, 𝑘 𝐵 is the Boltzmann
constant, and 𝑇 is their temperature. If the Debye length is small compared to the system size then the
system is electrically well-screened, so that the electric fields from discrete charges are unimportant
compared to macroscale electromagnetic fields (Bittencourt, 2004; Bellan, 2008). In the extreme
                                                      11


high Λ𝑑 and low Knudsen number limit the fluid is neutral, and standard fluid dynamics governed
via Euler’s equations are relevant (Kramer et al., 2020). As the Knudsen number is increased and
the dissipation scale becomes closer to the system scale, the Navier-Stokes equations become more
appropriate. In the low Λ𝑑 and low Knudsen number limit the plasma is an ideal plasma where the
ideal MHD equation are best applicable. As the system size is shrunk dissipation scales and small
scale plasma instabilities become more relevant, leading to non-ideal MHD approximations such
as resistive MHD (Bonafede et al., 2011) and Braginskii MHD (St-Onge et al., 2020) becoming
more applicable.
1.2.2     Turbulence in Plasmas
    Turbulence is the chaotic flow, density, and pressure structures that form in all fluids when the
kinetic or magnetic energy in the fluid exceeds dampening due to viscosity, which is the internal
friction or resistance to flow within the fluid (see Figure 1.4; McComb, 1990). Being formally
chaotic, the evolution of turbulent flows cannot be predicted exactly, but are better understood
statistically on a macroscopic and microscopic level.
    The onset of turbulence in a fluid can be predicted by the dimensionless Reynolds number
(Stokes, 1851; Sommerfeld, 1909; Reynolds, 1883; Rott, 1990), which is defined as
                                                         𝑣𝐿
                                                 Re ≡                                                (1.20)
                                                          𝜈
where 𝑣 is the fluid velocity, 𝐿 is the characteristic length scale that depends on the size of the system
examined, and 𝜈 is the kinematic viscosity. Although the transition point is fuzzy and depends on
the fluid, flow structure, and system in question, fluids with a Reynolds number above 103 − 104
exhibit instabilities in smooth (laminar) flows that disrupt them into turbulent flows (see Figure 1.5,
Incropera & DeWitt, 1981). In terms of fluids encountered in everyday life, air has a low viscosity
and thus higher Reynolds numbers for similar velocities and scales compared to water and honey,
which have comparatively higher viscosities and thus lower Reynolds number and are less prone
to turbulent flows. Viscosity in both liquids and gases arise from molecular interactions but the
origin of these forces can be quite different (Bird et al., 2006). As relevant to this dissertation,
                                                     12


                                                                                   ©Dr. Gary Settles
                                                                                   CC BY-SA 3.0
Figure 1.4: Schlieren photograph showing the thermal plume of a lit candle, showing the smooth
rising flow starting from the base of the flame that transitions into turbulence at the top of the flame.
As a gas, the viscosity in smoke and air is low; thus, the velocity of the uplifted heated gas is sufficient
to create a high Reynolds number flow, with Re ∼     > 103 , which is prone to fluid instabilities. The
laminar flow originating from the flame decays into turbulence as these instabilities grow further
down the flow.
                                                    13


                                                                              ©Milton Van Dyke 1982
Figure 1.5: Photographs of a cylinder moving through a tank of water containing aluminum powder
(van Dyke, 1982). The higher the velocity of the water flow relative to the cylinder the higher the
Reynolds number, showing flows from top to bottom with Re = 9.6, Re = 2,000, and Re = 10,000.
As the Reynolds number is increased beyond ∼ 103 , the flow becomes prone to fluid instabilities
which grow non-linearly as the flow moves past the cylinder. These instabilities develop into the
turbulent flow beyond the cylinder, as best seen on the right hand side with the Re = 10,000 flow.
                                                14


viscosity within gases arises primarily from molecular diffusion, where the relevant scales are on
the order of the mean free path of particles. This length scale at which dissipation becomes relevant
is known as the dissipation scale or the Kolmogorov length scale. Since the system scales of gases
studied are often much larger than the dissipation scale, the Reynolds number is often quite high
for gas systems and thus they are usually turbulent. Since viscosity serves as a dampening force
against kinetic flow, it converts macroscopic kinetic energy in the fluid into thermal energy at the
dissipation scale.
                                 Production Scale        Simulation
                                                                                   imula
                                                                             MHD S
                                                                                                    MH
                                                         Dissipation Scale
                                                                                               ro
                                                                                                    D
                                                                                           Hyd
                                                                                  tion
Figure 1.6: Diagram of the energy spectra of a turbulent plasma denoting the hydrodynamic
turbulent cascade and the effects of magnetic fields and limited simulations resolution on the
energy spectra. Wavenumber increases along the 𝑥−axis, with larger length scales to the left and
smaller length scales to the right. Energy contained in the plasma at a certain wavenumber is
plotted along the 𝑦−axis. The black solid line shows the kinetic energy spectrum of a plasma
with no magnetic fields, where kinetic energy is introduced into the plasma at the production
scale (marked by the leftmost vertical dashed black line) and dissipates into thermal heating at
the dissipation scale (marked by the rightmost vertical dashed black line). Between these scales,
turbulent plasmas follow a 𝑘 −5/3 power law in the kinetic energy spectrum. With the addition of
magnetic fields, in the resulting kinetic energy spectrum (shown in red) the power law is flattened
or broken, with more energy at smaller scales. In simulations without an explicit viscosity, the
smallest cell size introduces a dissipation length scale (the vertical dashed blue line) potentially
larger than the physical length scale, which truncates the energy spectrum (in solid blue). Increased
resolution decreases the dissipation imposed by numerics.
   Energy distribution in a turbulent plasma can be further understood via its energy power
spectrum (Taylor, 1938) as shown in Figure 1.6. With this approach, we examine the energy (which
may be kinetic, magnetic, thermal, or a total energy) contained in the plasma at every length scale.
                                                    15


This can be computed from a 3D plasma via a Fourier transform into spectral space or also from
structure functions, which are the real-space equivalent of the power spectrum (Arenas & Chorin,
2006), as used by Kolmogorov (1941). These methods will give the energy power spectrum as
a function of wavenumber 𝑘 or wavelength 𝜆 = 2𝜋/𝑘. Smaller 𝑘 pertains to larger length scales
while higher 𝑘 pertain to smaller length scales.
    The energy spectrum of a turbulent fluid reveals how turbulence transfers energy from larger
scales to smaller scales via the Kolmogorov cascade model of turbulence (Richardson, 1922;
Beresnyak, 2019; Kolmogorov, 1941). When energy is introduced to the plasma by external forces
at a certain length scale, it produces eddies and flows at these injection scales. In the ICM, large
scale production includes contraction from the initial conditions, galaxy cluster mergers, and at a
smaller scale (∼ 10 − 100 kpc) AGN feedback. These large scale energy injections lead to large
eddies that break up into smaller eddies (higher wavenumber). As eddies break up, less kinetic
energy is transferred from the large eddies to smaller eddies. Eventually, the eddies become small
enough that viscous effects disallow smaller eddies and instead the kinetic energy dissipates into
thermal heating, i.e., turbulent dissipation or turbulent heating. In the ICM this dissipation occurs
due to plasma instabilities with length scales on the order of the cyclotron radius, which is typically
on the order of 1 km. If the small-scale turbulent motions are statistically isotropic, then the
energy spectrum between the injection scale and dissipation scale follows a power law 𝐸 (𝑘) ∝ 𝑘 −𝛾
with spectral index 𝛾 = 5/3, as predicted by Kolmogorov (1941) for incompressible hydrodynamic
turbulence.
    The addition of magnetic fields to a turbulent plasma greatly complicates models of turbulence
and has been under intense research and debate over the last decade (Beresnyak, 2019; Schekochihin,
2020). Not only do magnetic fields introduce an additional energy reservoir with its own energy
spectrum apart from the kinetic energy spectrum, but magnetic fields also confine and collimate
kinetic flows while the kinetic motions twist and wind magnetic fields, exchanging energy between
these reservoirs (Grete et al., 2017, 2018, 2021b; Glines et al., 2021) and generally disrupting the
assumptions of Kolmogorov turbulence.
                                                   16


    Non-ideal MHD effects due to particle interactions near the particle scale lead to additional
dissipation in plasmas, leading to the magnetic Reynolds number (Beresnyak, 2019)
                                                      𝑣𝐿
                                             Re𝑚 ≡                                             (1.21)
                                                       𝜂
where 𝑣 and 𝐿 are again the velocity and length scale of the scale of interest, and 𝜂 is the magnetic
diffusivity
                                                    𝑐2
                                              𝜂=                                               (1.22)
                                                   4𝜋𝜎
where 𝑐 is the speed of light and 𝜎 is the conductivity. Magnetic fields likewise dissipate on small
scales due to these same particle interactions, giving the Lundquist number (Beresnyak, 2019)
                                                   𝑣 𝐿
                                              𝑆≡ 𝐴                                             (1.23)
                                                    𝜂
where 𝑣 𝐴 is the Alfvén speed
                                                     𝐵
                                            𝑣 𝐴 ≡ √︁                                           (1.24)
                                                     4𝜋𝜌
where 𝐵 is the magnetic field strength. Similar to how high Reynolds number lead to fluids more
prone to turbulence, high magnetic Reynolds numbers and Lundquist numbers (such as in the ICM)
lead to plasmas that are more prone to magnetized turbulence (Beresnyak, 2019).
    In the presence of a strong mean-field magnetic field, meaning there is a significant large scale
magnetic field with associated Alfvèn speed much greater than velocity perturbations, perturbations
with wavevectors perpendicular to the magnetic fields are well favored over parallel wavevectors,
producing anisotropic turbulent motions in conflict with the assumptions of Kolmogorov turbulence
(Montgomery & Turner, 1981; Shebalin et al., 1983).
    Turbulence may also play a significant role in the amplification of magnetic fields in the ICM
via the small-scale turbulent dynamo (Roh et al., 2019; Tobias, 2021). In this dynamo, the twisting
and folding of magnetic fields by the turbulent motions in small eddies leads to an increase in the
magnetic fields on small scales (Schekochihin et al., 2004; Steinwandel et al., 2021). Magnetic
tension in the plasma in some cases can also accelerate or hinder the growth of turbulence in
the magnetic and kinetic spectra at different rates (Glines et al., 2021; Bambic et al., 2018). It is
                                                 17


currently unknown what the true spectral index of a magnetized plasma is, or if the energy spectrum
is a power law between the production and dissipation scales (Grete et al., 2017, 2018; Glines et al.,
2021; Grete et al., 2021b).
     Magnetized turbulence in the ICM (and the applicability of MHD to the ICM in general) is
complicated due to the ICM being weakly collisional: the mean-free path in the ICM, on the order
of 1 − 105 pc3, is not much smaller than the system scales of the ICM, which is a requirement
of a collisional plasma and an assumption in most theories of turbulence. Small scale plasma
instabilities may instead make up for the lost dissipation from collisions, although this is an area of
open research (Lyutikov, 2007; Rosin et al., 2011; Berlok & Pessah, 2015). The pressure anisotropy
of weakly collisional plasmas should be accounted for in models of the ICM and may have an effect
on turbulence dissipation in the ICM (Kunz et al., 2011).
     Previous theoretical studies have estimated the turbulent dissipation in galaxy clusters, suggest-
ing that an RMS turbulence velocity within 100 to 300 km s−1 can potentially produce sufficient
turbulence to match cooling within clusters (Dennis & Chandran, 2005). Observational studies
have estimated the turbulent heating by inferring a power spectrum of density fluctuations in cool
core galaxy clusters imprinted on high-resolution Chandra images (Zhuravleva et al., 2014, 2019;
Li et al., 2020; Vidal-García et al., 2021). Although these studies have shown that turbulent heating
may be sufficient to counteract overcooling, they have approximated the turbulence within the ICM
as non-magnetized. It is also unclear whether processes in the ICM such as AGN feedback are
sufficient to drive this turbulence, or whether multiple cycles of jet feedback are required (Heinrich
et al., 2021). Generally, better understanding of magnetized turbulent dissipation within diffuse
astrophysical plasmas such as the ICM is needed, and understanding of this phenomenon can be
expanded via numerical simulations.
    3 Mean   free path of Coulomb collisions in the ICM (Spitzer, 1956, 1978)
                                                  18


1.2.3    The Simulation of Plasmas as a Research Tool
    Although plasmas are ubiquitous throughout the universe and are often created in laboratories,
recreating exact astrophysical plasma conditions (or their scaled-down equivalent) and observing
them in a laboratory can be challenging and prohibitively expensive. Astrophysical plasmas span
huge distances, both high and low densities, and extreme energies that are nearly impossible to
recreate in a lab. Observing certain characteristics of astrophysical plasmas such as the magnetic
fields and small scale turbulence can also be difficult due to the lack of direct electromagnetic
emissions and limited resolution of telescopes. The complex and often non-linear nature of
the equations governing these plasmas also makes pen and paper theoretical work limited. In
both terrestrial and astrophysical plasmas, numerical simulations bridge the gaps between theory,
observations, and experimental design. Numerical simulations of plasmas serves as a simplified,
affordable, and accessible experimental stand-in for real plasmas, giving insight to both observations
and experiments.
    Simulating turbulent plasmas comes with its own complexities. Numerical methods implicitly
but unavoidably add a numerical viscosity, which introduces a dissipation scale on the order of
the resolution of the simulation. If a system can be fully resolved with elements smaller than
the physical dissipation scale, then the entire turbulent cascade can be directly captured with
an explicitly included realistic viscosity. Since turbulence in the ICM is driven on scales of
kpc but dissipates on the scale of km, spanning several orders of magnitude, fully resolving the
turbulent cascade of the magnetized plasma is infeasible for the foreseeable future due to the
enormous volume of data that would be required to simulate a galaxy cluster down to km scales.
As a result, the dissipation scale is artificially large and the turbulent dissipation is stronger in
simulated clusters. This over-powered turbulent dissipation can be diminished by increasing the
spatial resolution of simulations, although numerical dissipation will exceed the true dissipation
using supercomputing resources available in the near to intermediate future. This translates to a
difference in Reynolds number between the simulated plasma and the target system. Simulations
on current supercomputers can achieve Reynolds numbers up to Re ∼ 103 − 104 (Ritos et al., 2018)
                                                  19


whereas Reynolds numbers in the ICM could be as high as Re > 1012 (Miniati, 2014, 2015; Egan
et al., 2016). Although larger supercomputers will enable higher resolution and lower dissipation,
the achieved Reynolds number of simulations is unlikely to reach the true Reynolds number of the
ICM for the foreseeable future.
1.2.4     Numerical Methods for Plasmas in the Fluid Approximation
    At its core, simulating plasmas in the fluid approximation amounts to evolving approximate
solutions to the partial differential equations describing the plasma. Plasmas in the fluid regime
have been simulated via many classes of methods developed for computational fluid dynamics
(CFD) but extended to include magnetic fields for MHD or non-ideal MHD (Trac & Pen, 2003;
Lind et al., 2020; Ledvina et al., 2008). Although not an exhaustive list, these methods include:
     • Finite difference (FD) methods, where the partial differential equations are approximated via
        finite differences on a mesh of cells (Trac & Pen, 2003; Brandenburg & Dobler, 2010)
     • Finite volume (FV) methods, where the fluid equations are converted to surface integrals
        constituting fluxes between cells (Toro, 2009; Stone & Norman, 1992; Stone et al., 2008a;
        Bryan et al., 2014; White et al., 2016a)
     • Finite element (FE) methods, which comprise a variety of other methods (including discontinuous-
        Galerkin methods, DG) where the plasma is also discretized into a mesh of cells (Meier,
        1999)
     • Smoothed particle hydrodynamics(SPH), where a mesh is forgone and the fluid is represented
        by particles with overlapping spatially smoothed density functions (Katz et al., 1996; Springel
        et al., 2001; Wadsley et al., 2004; Springel, 2005, 2010)
     • Pseudo-spectral methods, where the equations are solved in a spectral basis (such as with
        Fourier transforms) and with an additional basis to quickly convert to a spatial grid (Simon,
        1992; Burns et al., 2020)
                                                   20


These fluid methods can be broadly divided by their specification of the fluid flow into Eulerian
and Lagrangian specifications. Lagrangian specifications follow along with a parcel of the fluid,
whether that be a mass or volumetric discretization (See Hopkins, 2014, for a Lagrangian code that
implements both mass and volumetric discretizations), whereas Euler specifications follow fluid
motion as it moves through a discretization of space. In a simple analogy of the flow of a river, a
Lagrangian specification would follow the water as a boat moving with the river while an Eulerian
specification would follow the water from a bridge stationary to the river.
    Codes using Lagrangian specifications typically discretize using particles representing discrete
masses or volumes within the domain. SPH is historically the most used Lagrangian method within
astrophysics (??Katz et al., 1996; Springel et al., 2001; Springel, 2005), although recent methods
have innovated beyond SPH by including corrections to better capture shocks like an Eulerian
specification (Hopkins, 2014) or to use a moving mesh where a Godunov-like scheme (explained
below) can be applied to a Lagrangian code (Weinberger et al., 2020).
    Codes using an Eulerian specification typically discretize the fluid domain into a mesh of cells
within which properties of the fluid are tracked. In the case of FV (Toro, 2009; Stone & Norman,
1992; Stone et al., 2008a; Bryan et al., 2014; White et al., 2016a) and FD (Trac & Pen, 2003;
Brandenburg & Dobler, 2010) methods, the cell averages of variables such as density, momentum,
pressure, and energy are tracked. For other Eulerian methods such as DG methods, a linear
combination of polynomials of these same variables are tracked, in addition to the cell averages
evolving quadratic, cubic, and higher order spatial terms.
    The theoretical basis for FV plasma methods begins the strong form of the fluid equations,
where the conservation laws for the conserved quantities such as density, momentum, energy, etc.
are expressed in terms of divergence of fluxes and source terms, i.e.
                                       𝜕
                                          U + ∇ · F (U) = S,                                   (1.25)
                                       𝜕𝑡
where U are the conserved variables, F are flux terms, and S are source terms. This strong form
of the equations holds absolutely for the plasma. This strong form of the equations is converted to
the weak form of the equation set using the divergence theorem, leading to a set of surface integrals
                                                 21


to be satisfied (LeVeque, 2002), i.e.
                             ∫                ∫                   ∫
                                  𝜕
                                    U𝑑Ω +        ∇ · F (U) 𝑑Ω =       S𝑑Ω                       (1.26)
                               Ω 𝜕𝑡            Ω                   Ω
where Ω is the domain of a single cell from the discretized mesh. Assuming U and F are sufficiently
smooth over Ω allows us to apply the divergence theorem to obtain the weak formulation
                                ∫            ∫                   ∫
                             𝜕
                                    U𝑑Ω +         F (U) · n𝑑𝐴 =      S𝑑Ω.                       (1.27)
                            𝜕𝑡 Ω               𝜕Ω                  Ω
The advantage of the weak formulation is that it permits discontinuous solutions between cells
or different Ω volumes where the divergence is not defined; i.e., the fluid can be approximated
with a mesh of cells between which the fluid description is discontinuous. In a FV method, the
cell averages of fluid quantities are tracked in each cell while these surface integrals become fluid
fluxes between neighboring cells. Most FV methods for CFD are Godunov-like schemes (Godunov,
1959; Toro, 2009), where the fluxes are determined by solving or approximating a solution to a
local Riemann problem at each cell interface. In a typical Godunov-like scheme, the fluid state at
both sides of each cell interface is first reconstructed using an interpolation from the cell averages
in surrounding cells. At each cell interface, the two fluid states from each side creates a Riemann
problem that can be solved to determine the fluid flux into each cell. This computed flux is then
used in the numerical integration to advance the state of the fluid in time.
    In a DG method, solutions to the weak form of the equation set take the form of linear combina-
tions of polynomials (such as the Legendre polynomials), which allow higher order representations
of the fluid compared to FV methods (Reed & Hill, 1973; Cockburn et al., 2005; Chen & Liu, 2013).
A 0th order DG method, which carries a constant contribution across is each cell, is equivalent
to a FV method carrying cell averages. The method order for DG can be increased arbitrarily,
however, just by carrying more polynomial terms. Reconstruction of fluid states at cell interfaces
is computed using the polynomials internal to each cell while the Riemann problems solved in DG
are equivalent to those solved in FV methods. Exact integration of surface integrals is facilitated
by Gaussian quadrature. DG methods are also potentially better suited for upcoming hardware by
being more arithmetically intensive, i.e., by executing more floating point operations per byte of
                                                   22


data loaded or written from memory, which pairs well with hardware advances improving com-
putational throughput faster than memory bandwidth (Klöckner et al., 2009, ; see discussion of
changing supercomputer architectures in Section 1.4).
1.3    The Intracluster Medium – Plasma Physics Applied to Galaxy Clusters
     The ICM, a hot diffuse plasma, comprises the majority of baryonic matter in galaxy clusters and
is the primary emitter of cluster X-rays. Thus, the ICM has a profound effect on both how clusters
evolve and how we observe them. Modeling and understanding the plasma physics governing
the ICM allows better characterizations of galaxy clusters as a whole, one ultimate goal being to
refine the luminosity-mass relation for galaxy clusters. This would enable surveys of galaxy cluster
number densities that would reveal properties of dark matter and dark energy and the large scale
structure of the universe.
     Additionally, the ICM provides a unique plasma laboratory that can inform terrestrial plasmas.
The high temperatures and low densities of the ICM are impractical to achieve on Earth, restricting
their study to astrophysical observations, theory, and simulation. However, the ICM is likely very
turbulent (Brüggen & Vazza, 2015; Zhuravleva et al., 2014; Simionescu et al., 2019), allowing
study of magnetized turbulence that directly affects applications of plasmas on Earth. Turbulence
triggered by the onset of plasma instabilities is a fundamental obstacle for achieving net power-
generating fusion in both ICF (Casner, 2021) and MCF (Boozer, 2005; Sanchez & Newman, 2015),
as it disrupts plasmas from being long-lived enough to achieve fusion. By studying the long-lived
turbulent plasmas in astrophysical contexts via observations, we can better understand magnetized
turbulence in laboratory plasmas and potentially develop more effective plasma devices (Ryutov &
Remington, 2002; Chatterjee et al., 2017).
     Conversely, since laboratory plasmas can be examined in closer detail and their experimental
parameters changed, they can be used to study astrophysical plasmas as long as results are scaled
appropriately (Ryutov & Remington, 2002). The magnetized supersonic flows, shocks, jets, and
the development of plasma instabilities in these systems can be studied in laboratory high energy
density plasmas (HEDP; Giuliani et al., 2012), which can inform understanding of these phenomena
                                                 23


in the ICM (Beg, 2019). From a numerical perspective, methods, algorithms, and codes used for
modeling laboratory plasmas can be repurposed for astrophysical plasmas (Howes et al., 2008) and
vice versa (Beresnyak et al., 2018).
1.3.1   The cool core cluster problem
    Approximately half of the galaxy clusters in the universe have high central X-ray surface
brightnesses that would indicate significant radiative thermal loses within the inner several kpc
(Fabian, 1994; Cavagnolo et al., 2009). Galaxy clusters with this property are known as cool-core
(CC) clusters. Consequently, the centers of the galaxy clusters should quickly cool and collapse
due to these energy losses within a few hundred million years in an event known as a “cooling
catastrophe,” which would be accompanied by massive rates of star formation. Historically, from
a theoretical perspective these CC cluster centers would be replenished by massive inflows of gas
known as cooling flows (Fabian, 1994). The large amount of cold gas implied by these cooling flows
was never observed, however, nor were the elevated rates of star formation that would accompany
the collapsing of the cold gas. Although X-rays are being emitted and energy is being radiated
away, CC clusters are not cooling down - although not in HSE, they are apparently quasi-stable.
Thus, some mechanism must offset or disrupt this cooling. Many potential mechanisms for doing
so have been proposed.
    Galaxy cluster mergers could disrupt this cooling since a large scale interaction such as a merger
can inject sufficient energy into a CC cluster to offset central heating. However, galaxy cluster
mergers are too infrequent to account for the abundance of quasi-stable CC clusters, occurring on
the scale of 1 Gyr rather than 10 − 100 Myr cooling times observed. Thermal conduction, where
thermal energy from the cluster outreaches is conducted along magnetic field lines to the cluster
center, can offset some cooling but the effect is insufficient to offset all central cooling (Voigt et al.,
2002; Ruszkowski & Begelman, 2002; Voigt & Fabian, 2004; Parrish et al., 2009). Stars collapsing
into supernovae within the cluster can also inject heating but are likewise insufficient in power and
frequency to offset cooling and also introduce metals, which promote cooling (Bregman & David,
                                                   24


1989; Domainko et al., 2004).
                                                                                  Composite image created by NASA
                                                                                  Hubble and Chandra Image: NASA, ESA, CXC, STScI, and B.
                                                                                  McNamara (University of Waterloo)
                                                                                  Very Large Array Telescope Image: NRAO, and L. Birzan and team
                                                                                  (Ohio University)
Figure 1.7: Bubbles inflated in by AGN jets in galaxy cluster MS0735.6+7421, as evidenced by
X-ray cavities in the ICM and radio synchrotron emission from cosmic rays accelerated at the shock
fronts around the bubble. Image made by NASA
   AGN feedback via jets excited by gas infalling onto the accretion disk of the AGN’s central
SMBH, however, is widely agreed to be sufficient to offset cooling (Fabian et al., 2000; McNamara
et al., 2000; Gitti et al., 2012; Fabian, 2012). The capability of AGN jets to inject sufficient
energy into the ICM to offset cooling was realized by bubbles inflated by AGN feedback, which
appear as X-ray cavities indicating evacuated gas and radio lobes where cosmic rays are accelerated
across shocks and emit radio synchrotron radiation at the bubble shock-front (Fabian et al., 2000;
McNamara et al., 2000). Figure 1.7 shows said bubbles inflated by AGN jets in galaxy cluster
MS0735.6+7421 as observed in X-ray and radio wavelengths. The energy injected by the AGN into
the cluster can be estimated by the work done on the gas to inflate these bubbles; 𝑊 ∼ 𝑃d𝑉 where
𝑊 is the work done by the AGN, 𝑃 is the pressure of the bubble, and d𝑉 is the size of the bubble
(McNamara et al., 2000; Churazov et al., 2002; Blanton et al., 2010). The work done by AGN
                                                25


feedback is sufficient to offset the central cooling in CC clusters. In our current understanding of
CC clusters, AGN feedback is widely believed to be the dominant mechanism preventing cooling
flows and cooling catastrophes.
    Many aspects of AGN feedback are still poorly understood (Morganti, 2017), including how
AGN feedback is triggered, how AGN feedback deposits energy into the ICM, and how these
two factors of AGN feedback combine to apparently maintain CC clusters in a thermodynamically
unstable multiphase state (Gaspari et al., 2012b; Tümer et al., 2019). The AGN feedback is sufficient
to offset cooling, prevent cooling flows, and quench star formation, but it is not so powerful as to
evacuate gas from CC clusters. Instead, the cluster centers are maintained in a thermodynamically
unstable multiphase state, with blobs of cold condensed gas amongst hot, rapidly cooling X-ray
bright gas. Thus, AGN feedback is believed to be self-regulating – i.e. increased AGN feedback
diminishes AGN triggering, thereby tempering further feedback. The multiphase nature of the AGN
environment may be key to the self-regulation of AGN feedback in CC cores, which is explored in
the precipitation model of self-regulating AGN feedback (Voit et al., 2015, 2017).
1.3.2   Self-Regulating AGN Feedback via Precipitation
    Given the thermodynamically unstable nature of the multiphase medium of the AGN envi-
ronment that is maintained in CC clusters, it may play a significant role in the AGN triggering
mechanism. In the precipitation model of self-regulating feedback shown in Figure 1.8 this mul-
tiphase medium leads to cold gas condensing out of the ICM and falling inwards due to loss of
buoyancy onto the AGN accretion disk. Since most of the mass and energy that enters into an
SMBH accretion does not make it down to accrete onto the SMBH, much of the gravitational
potential energy of this infalling mass is diverted into the jet driven by the accretion disk, feeding
energy into the ICM. This feedback drives outflows that uplift condensed blobs of cold gas which
would otherwise feed onto the AGN jet, regulating the feedback. Additionally, the energy deposited
by the AGN into the outskirts of the cluster creates an entropy gradient sloping down towards the
multiphase region of the cluster. As gas cools in this power-law zone of the entropy curve of the
                                                  26


                                                                                                    Diagram made by Voit et al. (2017)
Figure 1.8: Diagram of the self-regulating AGN feedback precipitation model from Voit et al.
(2017), where the left panel shows a diagram of AGN feedback in a galaxy cluster and the right
                                        −2/3
panel shows the entropy 𝐾 ≡ 𝑘 𝐵𝑇 𝑛𝑒           where 𝑛𝑒 is the electron number density. In this model,
cold gas condenses in the isentropic central region of the galaxy cluster and accretes onto the
central SMBH, triggering feedback in the form of bipolar outflows that uplift condensed gas into
the power-law zone of the entropy profile in the cluster outreaches, tempering the overcooling
and condensation of gas. In this power-law zone, buoyancy suppresses condensation while uplift
promotes condensation. Observationally, the transition between the isentropic and power-law zones
of the entropy profile occurs where the ratio of cooling time to free fall time is 𝑡 cool /𝑡ff ∼ 10, where
the cooling time 𝑡cool of a parcel of gas is the time it would take for it to radiative away all its energy
at its current rate of radiative cooling and the free fall time 𝑡ff of a parcel of gas is the time it would
take to infall from rest to the cluster center due to gravity.
cluster it loses buoyancy and falls into the isentropic zone, replenishing the gas (Voit et al., 2015,
2017).
    From observations, the boundary between the isentropic zone and the power-law zone of the
entropy profile is where the ratio of the cooling time 𝑡cool , the time the plasma would take to cool
to 0𝐾 at its current rate of emission, and the freefall time 𝑡ff , the time the gas would take to fall to
the cluster center from rest at its current radius, is approximately 𝑡cool /𝑡ff ∼ 10 (Cavagnolo et al.,
2008; Rafferty et al., 2008; McCourt et al., 2012; Meece et al., 2015).
    In this model, the AGN feedback and triggering mechanisms are intrinsically tied to the mul-
tiphase nature of the AGN environment. However, the model is not specific on the details of how
AGN feedback couples to the ICM – how the AGN jet thermalizes energy into the ICM (Ho, 2004;
Kunz et al., 2011; Morganti, 2017).
                                                    27


1.3.3     The nature of AGN Feedback
    As gas accretes onto the AGN accretion disk around the central SMBH, the charged particles
comprising the plasma of the accretion disk winds up magnetic fields that collimate into jets that
emanate from both poles of the SMBH. Although these jets are likely the primary mechanism
by which the AGN deposits energy into the ICM, it is still under debate how the magnetized,
relativistic, tightly collimated jet thermalizes into heating and large scale outflows that can quench
cooling in such a way to self-regulate AGN feedback and maintain a multiphase AGN environment
(Young, 2010; Morganti, 2017).
    One possible mechanism is turbulent dissipation incited by the AGN jet. Observational studies
have estimated the turbulent heating by inferring a power spectrum of density fluctuations in cool
core galaxy clusters imprinted on high-resolution Chandra images (Zhuravleva et al., 2014, 2019; Li
et al., 2020; Vidal-García et al., 2021). By approximating the turbulence as purely hydrodynamic,
velocity spectra can be inferred from these density perturbations and a 𝑘 −5/3 energy spectra
turbulent cascade can be fit to the velocity spectra. This gives an observational estimate of the
turbulent heating in the cluster that is sufficient to offset cooling. This estimate, however, does not
account for the magnetic fields within the ICM which change the behavior of the turbulence.
    This aspect of the ICM as a magnetized, potentially non-ideal MHD plasma may play a
significant role in the thermalization of AGN feedback. The AGN accretion disk winds up strong
magnetic fields that lead to the tight collimation of the AGN jet. These same fields may deposit
significant energy into the ICM (Li et al., 2006). The AGN jet may also play a role in the
amplification of existing magnetic fields within the galaxy cluster (Dubois et al., 2009) via a
turbulent dynamo (Federrath, 2016). Anisotropic pressure in the ICM as a high-𝛽 plasma may
trigger microscale instabilities in the plasma faster than if it were an ideal plasma, leading to higher
turbulent dissipation that can more closely match radiative cooling (Kunz et al., 2011).
    Numerical simulations are one cornerstone of our advancement in understanding AGN jets and
how they interact with the ICM (Martí, 2019; Komissarov & Porth, 2021). Simulating the nature
of AGN feedback is one of the ultimate goals of the methods presented in this dissertation. The
                                                    28


current and future state of this work is explored in Chapter 6.
1.3.4    Simulation of Galaxy Clusters
    The large dynamical range of the ICM requires vast computational resources to simulate accu-
rately. The dynamical range of the ICM extends from the cluster scales on the order of 10 Mpc,
down to the 1 pc scale of molecular clouds and star forming regions, and further down to the 1 km
scale of plasma instabilities that drive dissipation in the diffuse plasma, spanning more than 20
orders of magnitude. Current world-class cosmological simulations can reach resolutions on the
order of 100 pc (Pillepich et al., 2019), more than 15 orders of magnitude larger than the 1 km scale
of plasma instabilities. In order to resolve said plasma instabilities directly in simulation we would
                           3
need on the order of 1015 = 1045 as many elements as used presently, and thus a supercomputer
at least 1045 times larger than current supercomputers. Following the Courant–Friedrichs–Lewy
(CFL) condition, the duration of timesteps Δ𝑡 for this hypothetical simulation would need to satisfy
                                              𝑣Δ𝑡
                                                   ≤ 𝐶CFL                                        (1.28)
                                              Δ𝑥
where 𝑣 is the velocity (unchanged), Δ𝑥 is the cell size (now 1015 times smaller than current
simulations), and 𝐶CFL is a constant to maintain stability that depends on the method (unchanged).
Thus Δ𝑡 would need to be 1015 times smaller than currently used timesteps and said simulation would
require 1015 as many timesteps to complete. Since individual CPU core speeds have stagnated and
are unlikely to increase in the near future (Leiserson et al., 2020), said supercomputer would need
to be 1015 times larger again to complete the simulation in the same human time, on the order of
months. In totality, we would need a supercomputer 1060 larger than present supercomputers (20
orders magnitude short of a "gogolFLOP" supercomputer). Assuming a variant of Moore’s Law
holds true for the indefinite future – that supercomputers will double in computational throughput
every 2 years – this computer will come online in ∼ 400 years.4
    4 If energy consumption per operation is the same for this hypothetical computer as current
hardware, this supercomputer would need 1061 MW = 1070 erg s−1 of power. Over one day it
would consume ∼ 1079 erg ∼ 1024 M⊙ 𝑐2 in energy.
                                                   29


    Since supercomputers in the foreseeable future are not capable of resolving the ICM down to
plasma instability scales, all simulations of the ICM are necessarily an approximation. Unresolved
key features of galaxy clusters such as the star forming regions and AGN must be approximated with
subgrid model prescriptions that mimic the unresolved physics using a combination of observations
and smaller scale simulations of plasmas and galaxy clusters. Within computational modeling,
such simulations are referred to as multiphysics simulations as they incorporate many physical
descriptions and scales into a single simulation. At its most basic, simulations of galaxy clusters
are comprised of a model for gravity and dark matter, a model of the plasma, and any number of
additional physics, feedback mechanisms, and subgrid models.
    As the most massive component of the galaxy cluster, a treatment of the gravitational interactions
of dark matter is essential for galaxy cluster simulations. For computational efficiency for idealized
isolated galaxy clusters, this dark matter profile can be a fixed gravitational potential such as a
Navarro–Frenk–White profile (NFW; Navarro et al., 1996). The gold-standard for dynamically
evolving dark matter distributions, however, is to use an N-body method where the dark matter
population is discretized into super-particles that can be evolved following gravity including the
expansion of the universe (Aarseth et al., 1979). N-body simulations of dark matter have a long
history that pre-dates computers (Holmberg, 1941) and continues to be researched today (Rogers &
Peiris, 2021; Ebisu et al., 2022). To make robust predictions of the electromagnetic observations,
however, requires coupling a treatment of the dark matter, whether that be a fixed gravitational
potential or N-body simulation, to the baryonic matter.
    This baryonic matter – the ICM – is a plasma that is reasonably approximated as a fluid5.
This plasma can be modeled using methods from CFD that may include magnetic fields for higher
    5 The ICM is weakly collisional, in that the the mean free path of particle-particle interactions
(via Coulomb collisions in the ICM) is long (1−105 pc) while the Debye length – 𝜆2𝐷 = 𝑘 𝐵𝑇/4𝜋𝑛𝑞 2 ,
which is a measure of the scale on which the electric fields from individual charged particles in
the plasma is relevant (Bellan, 2008) – is short. The ICM is thus electrically well screened, in that
macroscale electric fields dominate over the fields from individual particles, but particle-particle
collisions are infrequent. Non-ideal MHD models including pressure anisotropy and thermal
conduction are more appropriate for weakly collisional plasmas such as the ICM (Braginskii, 1965;
Berlok & Pessah, 2015).
                                                    30


fidelity. As discussed in Section 1.2.4 there are a wide variety of methods, but also a range of
additional plasma physics that can be included. The ICM is potentially a non-ideal MHD plasma,
so including non-ideal MHD effects such as resistivity (Bonafede et al., 2011), anisotropic diffusion
(Berlok & Pessah, 2015), and thermal conduction (Narayan & Medvedev, 2001; Jubelgas et al.,
2004; Wagh et al., 2014) along magnetic field lines can provide a more realistic simulation of the
ICM.
    The ICM also loses significant energy over time via free-free emission and line emission.
Free-free emission, or Bremsstrahlung emission, is caused by the deceleration of charged particles,
namely the electrons of the plasma, by the electric field of larger charged particles, specifically the
ions of the plasma. This radiative cooling rate depends on the temperature and ion density. In a
H/He plasma with hydrogen number density 𝑛H and hydrogen mass fraction 𝑋 ≈ 0.76 such that
𝑛H = 𝑋 𝜌/𝑚 𝑝 where 𝑚 𝑝 is the proton mass, then the volumeteric free-free cooling rate is (Katz
et al., 1996)
                                                                  1/2
                                                              𝑇
                            Λfree-free ≈  2.5 × 10−23 𝑛2H               erg .                    (1.29)
                                                            108 K
Free-free emission only dominates cooling when the plasma is fully ionized, with ICM temperatures
above ∼ 107 K. At lower temperatures other processes become more important. These processes
are collisional ionization, where atoms are ionized by collisions with electrons; recombination,
where electrons combine with an ion, emitting a photon; and collisional excitation, where atoms
are excited by collisions with electrons and then decay to a lower state (Mo et al., 2010). These
processes depend on both the temperature and ion species within the plasma, where larger nuclei,
or metals, lead to more cooling due to higher availability of electron orbitals. For numerical
simulations these processes can be pre-computed for the ICM with a fixed metallicity (Schure et al.,
2009) or with an evolving metallicity (Smith et al., 2017) combined with cooling tables to compute
a radiative cooling rate (Ferland et al., 2013). These effects persist for temperatures down to 104 K,
below which radiative losses are negligible for the dynamics of the ICM (Mo et al., 2010).
    Beyond the basics of a gravitational or dark matter model and an ICM plasma model with
radiative cooling, many important systems contributing to the dynamics of the ICM such as the
                                                   31


AGN, supernovae, and star forming molecular clouds, remain unresolved or underresolved due to
limited computational resources. These phenomena can be included via subgrid models, which
are prescriptions for the triggering and feedback of these systems on the ICM. For example, gas
accretion onto the AGN accretion disk (which is approximately 10−2 pc Hawkins, 2007) occurs
well below the 1 pc resolution of the current highest resolution isolated galaxy cluster simulations.
AGN triggering can instead be included with a subgrid model following a Bondi-Hoyle accretion
model (Bondi, 1952; Edgar, 2004), a boosted Bondi-Hoyle mode (Booth & Schaye, 2009), or a
cold-gas mass triggered model informed by the precipitation theory (Meece Jr, 2016). The accretion
disk physics that generate the AGN jet are likewise underresolved but various subgrid models of
the AGN jet can be used to incorporate this feedback (Li et al., 2006; Meece Jr, 2016; Glines et al.,
2020). Subgrid models for star formation, supernovae, turbulence (Schmidt & Federrath, 2011;
Vlaykov et al., 2016; Grete et al., 2016), and cosmic rays can similarly improve the simulation of
the galaxy cluster at the cost of complexity.
    Despite these approximations, more resolution enabled by larger computational resources is
always preferred for achieving higher fidelity simulations of the ICM as it reduces the dependency on
artificial models and their free parameters. More complex multiphysics – including magnetic fields,
self-gravity, cosmic rays, plasma microphysics, cooling, and more complex subgrid models for
turbulence and AGN feedback – all impose additional computational expense, resolution constraints,
and time step constraints to galaxy cluster simulations. Astrophysics simulations and especially
simulations of the ICM are always wanting for more computational resources. In order to gain
access to such resources, astrophysical simulation codes must evolve with the changing landscape
of supercomputing hardware.
1.4     The Changing Supercomputer Architecture Landscape
    Limitations to semiconductor manufacture have to led the predicted end of Moore’s law – the
trend in computer chip manufacturing observed over the last 50 years that transistor density has
doubled every two years – which has previously driven the growth of supercomputing resources.
Transistor dimensions are reaching the physical limitations of semiconductor manufacturing, with
                                                  32


microchip features reaching 3 nm in the coming years, where the atomic radius of silicon is 0.1 nm,
meaning transistors in microchips now span 10s of atoms6. Smaller microchips, which allow
high clock speeds and thus faster computation, have become increasingly more difficult to develop
over the last two decades (Iwai, 1999; Theis & Wong, 2017). The power consumed by these
higher density microchips is likewise becoming more of an issue, since this power needs to be
transported away from the chip to prevent heat damage (Landauer, 1988). More recent designs
often trade computing speed for power efficiency, further limiting increases in computing resources
(Leiserson et al., 2020). Alternative materials to silicon and other technologies such as optical
transistors (Nolte & Nolte, 2001) may extend Moore’s law for a few years but eventually microchip
manufacture will reach hard physical limits of atomic radii. Although useful in some contexts, it is
unclear whether quantum computers will impact astrophysical simulations since they have limited
applications to CFD in general (Sammak et al., 2015; Steĳl & Barakos, 2018).
    Instead of relying on increasing clock speeds and processing cores to grow computing resources,
supercomputer hardware has increase the size of processing chips by adding more cores or more
parallelization to computer chips (See Figure 1.9; Leiserson et al., 2020). Whereas the Pentium Pro
CPUs in ASCI Red (Top500, 2000), the fastest supercomputer in June 2000, had 1 core per CPU, the
Xeon X5670 CPUS in Tianhe-1A (Top500, 2010), the fastest supercomputer in June 2010, had 12
cores per CPU, and the A64FX CPUs in Fugaku (Top500, 2020), the fastest supercomputer in June
2020 and at present, have 48 cores per CPU. Although individual core speeds have not improved
since roughly 2005 (Leiserson et al., 2020), the increased core count permits higher computational
throughput that is especially useful for CFD.
    This trend in higher core counts on individual chips is taken to the extreme in hardware
accelerators – computer chips designed for higher core counts and parallelization compared to
traditional CPUs. Whereas a state-of-the-art Intel Xeon Platinum 8280 CPU used in Frontera, the
current leading supercomputer where a majority of throughput is via traditional CPUs (Top500,
2021), has 28 cores per CPU with 2 threads per core (Intel, 2021) and provides over 2×1012 floating
    6 This limitation in the size of microchip features has long been predicted, including by Feynman
in lectures on computation given during the 1980s (Feynman et al., 1998)
                                                    33


                                                                                            ©Science
                                                                                            (Leiserson et al., 2020)
Figure 1.9: Relative clock speeds of single core (black) and multicore (gray, orange, blue, and red,
in order of increasing core counts) processors relative to the Intel 80386 CPU using the SPECint
benchmark. The green round dots show processor clock frequencies, the frequency at which a
single core can execute a clock cycle to execute one or several operations, relative to the Intel
80386. Although clock frequencies have stagnated since the mid 2000s, processors have increased
performance by adding more cores. Future performance gains are increasingly dependent on higher
core counts. Figure from (Leiserson et al., 2020).
point (64 bit) operations per second, or 2 TFLOPS; the state-of-the-art NVIDIA A100 graphics
processing unit (GPU, Choquette et al., 2021) has 108 streaming microprocessors (SMs) with
a total of 6912 cores, providing 9.7 TFLOPS of computational throughput for comparable price
and energy consumption. Among the different accelerators, GPUs originally made to accelerate
graphics rendering have been especially well-suited for high performance scientific computing
(Du et al., 2011; Afzal et al., 2017; HajiRassouliha et al., 2018). GPU cores are designed for
performing the same computational tasks simultaneously on large blocks of data as opposed to near
complete independence between cores on CPUs. Although GPU cores are simpler than CPU cores,
providing less features and less independence in execution, they are physically smaller in size and
thus more GPU cores than CPU cores can be fit onto the same silicon die and for a similar cost.
Thus, computational throughput can be expanded without depending on transistor manufacturing
                                                34


improvements, extending the growth of HPC past the end of Moore’s Law (Leiserson et al., 2020).
    GPUs’ high core counts make them remarkably well suited for highly parallelizable tasks such
as the methods used for CFD and plasma simulations (Griebel & Zaspel, 2010; Xu et al., 2015).
All of the largest upcoming supercomputers being built in the US will use GPUs for the vast
majority of their computational throughput. The US Department of Energy (DOE) is investing
in new supercomputers to break the exascale barrier, executing 1018 floating point operations
per second (FLOPS), an exaFLOP, on a single supercomputer. The goal is encapsulated in the
Exascale Computing Project (ECP), which funds both the software and hardware for an exascale
supercomputer (Messina, 2017). All US exascale supercomputers planned for the near future –
Frontier, Aurora, and El Capitan – will use GPUs to achieve an exaFLOPS.
    These hardware accelerators can be difficult to program scientific applications for compared
to traditional CPUs, however. This is not only because of their extreme vectorization and stream-
lined architecture that maximizes computational throughput, but also since they require different
application programming interfaces (APIs). Traditional CPUs can be programmed using standard
programming languages such as C, C++, and FORTRAN. GPUs, on the other hand, use APIs
specific to each manufacturer (Patterson, 2010). NVidia GPUs, the historical leader in scientific
computing with GPUs, uses the CUDA API, AMD uses ROCm and also provides the CUDA-like
HIP interface, while Intel uses SYCL with its implementation named Data Parallel C++ (DPC++).
Figure 1.10 shows a comparison between these different APIs. This state of APIs for GPUs is
detrimental for scientific computing, as it requires rewriting code for each new API to use new
computing resources. Said rewrites may introduce new bugs in different versions of the software,
while making algorithmic improvements and additions to the the code requires updating the code for
each API. New hardware architectures, such as Field Programmable Gate Arrays (FPGAs), would
require additional versions and more development effort. Additionally, different architectures use
different parallelization and memory layouts which might lead a code design to perform optimally
on one machine but underperform on others, wasting computing resources. The duplicated code
for different hardware leads to higher development costs in terms of scientific researchers’ time,
                                                35


# pragma omp simd                                   __global__
for( int i=0; i<n; i++){                            void vec_add (int n,float a,
       z[i] = a*x[i] + y[i];                           float *x,float *y,float *z){
}                                                      int i = threadIdx .x
                                                           + blockIdx .x* blockDim .x;
                                                       if (i < n)
                                                           z[i] = a*x[i] + y[i];
                                                    }
                                                    ...
                                                    vec_add <<<(n+255) /256 , 256>>>
                                                       (n, a, d_x , d_y , d_z);
(a) C/C++ example, where OpenMP is used for vec- (b) CUDA Example, where the arrays d_x, d_y,
torization.                                         and d_z are allocated as CUDA arrays within GPU
                                                    memory.
__global__ void                                       ...
vec_add (int n, float a,                              parallel_for (n,
    const float* __restrict__ x,                          kernel_functor (
    const float* __restrict__ y,                                [ = ](id <> item) {
    float * __restrict__ z) {                               int i =
       int i = hipThreadIdx_x                                   item. get_global (0);
           + hipBlockDim_x *                                d_z[i] =
               hipBlockIdx_x ;                                    a*d_x[i] + d_y[i];
       if (i < n)                                     }));
           z[i] = a*x[i] + y[i];
}
...
hipLaunchKernelGGL (vec_add ,
    dim3 ((n+255) /256) , dim3 (256) ,
    0, 0,
    n, a, d_a , d_x , d_y , d_z);
(c) HIP Example, where the arrays d_x, d_y, and       (d) SYCL Example, where the arrays d_x, d_y, and
d_z are allocated as HIP arrays within GPU memory.    d_z are allocated within GPU memory.
Figure 1.10: Example code to execute z[i]=a*x[i]+y[i] with different programming APIs.
Even with this simple code example, there are significant differences in the implementation with
different APIs. Each API also requires different code outside of this snippet to manage memory
and execution on the GPU, along with a myriad of performance concerns.
                                                   36


which could otherwise be used to pursue science goals. As algorithmic and method changes are
made and as bugs are found in the code, the different versions of the code written for the different
architectures becomes out of sync, multiplying the development cost for each new architecture.
Generally, needing to rewrite code with different APIs for each new hardware architecture limits
scientific computing on these upcoming exascale supercomputers.
1.4.1    Performance Portability
    Performance portability APIs have been developed to address the issue of different APIs for
each hardware architecture (Reguly & Mudalige, 2020). Performance portability APIs provide
portability – code written with the framework can be run on multiple hardware architectures
without modification – and portable performance – the code executes with high performance,
efficiently using hardware resources and features, on multiple architectures with differing memory
and parallelization layouts. Within a performance portability framework, algorithms are written
with more abstraction from parallelization and memory management details. This approach allows
the API and the compiler to assemble a program for multiple hardware architectures from a
single version of the code, vastly cutting down code duplication and software development for
the scientist. The API can also vary the memory layout and parallelization strategy between
different architectures, optimizing for each with minimal effort on the part of the scientist. Recent
performance portability solutions include the libraries OCCA (Medina et al., 2014), Kokkos
(Carter Edwards et al., 2014; Trott et al., 2022), and RAJA (Beckingsale et al., 2019), the OpenMP
API with the "target offloading" capabilities beginning with OpenMP 4.5, and specifically for AMR
applications, the AMReX library (Zhang et al., 2019). With a single code version using these APIs,
the API backend can handle the execution of code and management of memory on both CPUs and
GPUs from the different manufacturers currently producing the world’s largest supercomputers.
    The implementation of performance portability is an emerging field in scientific computing
(Deakin et al., 2019). The construction of exascale supercomputers with each of the different GPU
manufacturers necessitates developing new performance portable astrophysics codes that can adapt
                                                   37


to these upcoming architectures as well as to future computers. Research into performance portabil-
ity strategies as well as quantifying performance portability across different hardware architectures
(Pennycook et al., 2016) is needed to better facilitate adoption of performance portability APIs in
scientific computing.
1.5     Outline of Dissertation
     The remaining chapters of this dissertation are composed of first a series of four peer-reviewed
papers where I am either the first author or an equal co-first author, one chapter consisting of current
projects, and a final chapter for future directions of my work.
     In Chapter 2 I explore the energy deposition requirements for self-regulating AGN feedback
triggered by cold gas accretion using thermal only abstractions of AGN feedback. This chapter
originally appeared as the published paper Glines et al. (2020).
     In Chapter 3 I explore magnetized turbulence from decaying large scale flows, as might be
created by large scale infrequent events in the ICM such as AGN outbursts and galaxy cluster
mergers, using simulations of the magnetized Taylor-Green vortex. This chapter originally appeared
as the published paper Glines et al. (2021).
     In Chapter 4 I present the implementation and profiling of the performance portable magne-
tohydrodynamics code K-Athena, which was used for the simulations in chapter 3. This chapter
originally appeared as the published paper Grete et al. (2021a), on which I am equal co-first author.
     In Chapter 5 I present a new DG method for relativistic hydrodynamics. This chapter orig-
inally appeared as Glines et al. (2022), which has been submitted to the Astrophysical Journal
Supplements.
     In Chapter 6 I present in-progress simulations of magnetized AGN feedback in galaxy clusters,
coming full circle to the nature of self-regulating AGN feedback.
     Finally, in Chapter 7 I summarize the dissertation and discuss future directions of the methods,
codes, and scientific results presented in this dissertation.
                                                   38


                                             CHAPTER 2
     TESTS OF AGN FEEDBACK KERNELS IN SIMULATED GALAXY CLUSTERS
    This chapter first appeared as the published paper Glines et al. (2020). I include the original
abstract as the introduction to this chapter.
2.1    Chapter Abstract
    In cool-core galaxy clusters with central cooling times much shorter than a Hubble time,
condensation of the ambient central gas is regulated by a heating mechanism, probably an active
galactic nucleus (AGN). Previous analytical work has suggested that certain radial distributions
of heat input may result in convergence to a quasi-steady global state that does not substantively
change on the timescale for radiative cooling, even if the heating and cooling are not locally in
balance. To test this hypothesis, we simulate idealized galaxy cluster halos using the Enzo code
with an idealized, spherically symmetric heat-input kernel intended to emulate. Thermal energy is
distributed with radius according to a range of kernels, in which total heating is updated to match
total cooling every 10 Myr. Some heating kernels can maintain quasi-steady global configurations,
but no kernel we tested produces a quasi-steady state with central entropy as low as those observed
in cool-core clusters. The general behavior of the simulations depends on the proportion of heating
in the inner 10 kpc, with low central heating leading to central cooling catastrophes, high central
heating creating a central convective zone with an inverted entropy gradient, and intermediate
central heating resulting in a flat central entropy profile that exceeds observations. The timescale
on which our simulated halos fall into an unsteady multiphase state is proportional to the square of
the cooling time of the lowest entropy gas, allowing more centrally concentrated heating to maintain
a longer lasting steady state.
2.2    Introduction
    Cool-core (CC) clusters have X-ray surface brightness profiles with sharp central peaks produced
by substantial radiative losses of thermal energy from gas within the central few tens of kpc (Fabian,
                                                  39


1994). Given the observed rates of energy loss, CC clusters should be capable of radiating away
their central thermal energy in less than 1 Gyr. If uncompensated, such a rapid cooling rate would
lead to a cooling catastrophe in which multiphase condensation of ambient gas into cold clouds
fuels star formation rates much greater than those observed. However, CC clusters are generally
not observed to experience such dramatic cooling catastrophes (McDonald et al., 2019). They
apparently remain close to thermal balance for billions of years and are common, representing
about half of all galaxy clusters at the present time. Consequently, some mechanism must be
counteracting central radiative cooling, and active galactic nuclei (AGN) are currently believed to
be the responsible energy sources (Fabian et al., 2000; McNamara et al., 2000; Fabian et al., 2006;
McNamara & Nulsen, 2007; Panagoulia et al., 2014; Gaspari, 2015).
     Many other heat sources have been explored, including galaxy cluster mergers (Roettiger et al.,
1997; Gómez et al., 2002; ZuHone et al., 2010), supernovae (Ciotti & Ostriker, 1997; Wu et al.,
1998; Voit & Bryan, 2001; Domainko et al., 2004; Short et al., 2013), thermal conduction (Chandran
& Cowley, 1998; Narayan & Medvedev, 2001; Malyshkin & Kulsrud, 2001; Voigt et al., 2002;
Jubelgas et al., 2004; Brüggen, 2003a; Smith et al., 2013), gravitational heating (Khosroshahi et al.,
2004; Dekel & Birnboim, 2007), and gas sloshing (Ritchie & Thomas, 2002; Markevitch et al.,
2001; ZuHone et al., 2010). Most either do not provide enough heat to offset the observed cooling
or do not adjust to the radiative cooling rate on a short enough time scale. Core cooling times
in many CC clusters are < 1 Gyr (Cavagnolo et al., 2009; Pratt et al., 2009), much less than the
lifetimes of these clusters, suggesting that any heating mechanism coupled to cooling must react
on shorter timescales. The gas accretion rate onto the central supermassive black hole (SMBH)
would therefore need to couple to the radiative cooling rate with a lag time no greater than several
hundred Myr.
     Feedback from the central galaxy and AGN was explored numerically as early as Tabor &
Binney (1993), Metzler & Evrard (1994), and Binney & Tabor (1995). More recently, Sĳacki et al.
(2007), Gaspari et al. (2011), Li et al. (2015), Meece et al. (2017), Prasad et al. (2015, 2017, 2018),
and many others (Fabjan et al., 2010; Dubois et al., 2010; Short et al., 2013; Yang & Reynolds,
                                                  40


2016a) have demonstrated in hydrodynamic simulations of idealized galaxy clusters that AGN can
plausibly regulate the high cooling rate in CC clusters. Simulated AGN self-regulate by coupling
feedback energy output to the ambient gas density or cold-gas accretion rate around the AGN and
inject that energy through either thermal deposition around the AGN or bipolar outflows from the
AGN or a combination of the two. In addition to regulating the cooling rate and the condensation
of cold gas clouds within the cluster, some of these AGN simulations produce temperature, density,
and entropy profiles that resemble observations, including the multiphase cores observed in the
central 100 kpc of galaxy clusters (Gaspari et al., 2012b; Meece et al., 2017; Prasad et al., 2018).
    The simulations that most successfully resemble observations rely on cold-gas accretion to fuel
the AGN and bipolar outflows to distribute the feedback energy (Gaspari et al., 2017; Gaspari &
Sądowski, 2017; Voit et al., 2017; Meece et al., 2017). Ambient gas at the center of the system
is nearly isentropic and therefore convectively unstable, resulting in the formation of a complex
multiphase medium in which cold clumps of gas condense out of the ambient gas and precipitate
onto the black hole. As the precipitation increases, so does the output of feedback energy, which
raises the central cooling time and ultimately reduces the rate of precipitation. The resulting
coupling suspends the ambient medium in a transitional state on the verge of a cooling catastrophe.
Condensation outside of the isentropic center is marginally suppressed by buoyancy, and gas lifted
out of the center by bipolar jets and buoyant bubbles forms multiphase filaments (Revaz et al., 2008;
Li & Bryan, 2014a,b), in general agreement with observations (McDonald et al., 2010; Russell
et al., 2016, 2017). However, even these idealized simulations do not track all of the physical
processes that might be transporting and thermalizing AGN feedback energy, which range from
turbulent heat diffusion (Ruszkowski et al., 2011; Zhuravleva et al., 2014), viscous dissipation of
waves generated by the AGN (Ruszkowski et al., 2004), and cosmic rays created by the AGN heating
the plasma via small scale fluid instabilities (Böehringer & Morfill, 1988; Loewenstein et al., 1991;
Rephaeli & Silk, 1995; Colafrancesco et al., 2004; Pfrommer et al., 2007; Jubelgas et al., 2008).
    Incorporating all of these mechanisms and processes into a cosmological simulation of galaxy
cluster formation is currently prohibitively complex. Typically, the minimum spatial resolution
                                                  41


in simulations modeling hot jets that interact with the intracluster medium is 200 pc. The finer
resolution of the gas along which the jet deposits energy leads the jet to drill a hole through the
ICM, allowing energy from the AGN to be deposited at further radii (Meece et al., 2017; Li et al.,
2015). These resolution constraints are not always feasible for large cosmological simulations,
because the computational effort needed to model these AGN jets exerts unacceptable drag on the
evolution of the entire system. Therefore, simplified subgrid models are still needed to represent
AGN feedback in cosmological simulations.
    The results we present here emerged from an effort to develop a simple heat-input kernel to
serve as an acceptable proxy for the much more complex process of AGN feedback. We sought a
kernel that would satisfy three criteria:
   1. The simulated hot-gas atmospheres of clusters balanced by AGN feedback should remain
      nearly thermally steady, meaning that they should not dramatically change because of cooling
      and feedback for periods of several billion years.
   2. The central entropy of the hot gas in such a quasi-steady cluster halo should not exceed the
      values observed in CC clusters.
   3. The feedback process should be computationally efficient, requiring neither very high resolu-
      tion nor extremely small time steps that would make implementation in a current cosmological
      simulation prohibitively costly.
    The first criterion requires the heating kernel to prevent a cooling catastrophe, which we define
for the purposes of this paper to be a factor of 10 increase in radiative cooling within 10 Myr,
accompanied by a rapid increase in the amount of cold (104 K) gas. As the central cooling time
becomes short, compensating thermal feedback is needed to prevent runaway overcooling.
    The second criterion requires that the kernel not overheat the central region, which would elevate
or invert the central entropy profile. Such centrally concentrated AGN feedback can produce both
non-cool core (NCC) clusters or observationally unreasonable galaxy clusters with large central
entropy peaks. Furthermore, buoyancy is unable to suppress runaway thermal instabilities in
                                                  42


systems with centrally flattened entropy profiles, making them prone to multiphase condensation
(e.g., Voit et al., 2017) Simultaneously satisfying both this criterion and the first one proved to be
difficult, even though observations show that CC clusters can remain remarkably close to a cooling
catastrophe without producing an overabundance of cold gas and young stars.
     Finding a way to satisfy the third criterion along with with the other two was the main motivator
for this paper. Tracking the rapid formation of a complex multiphase medium approaching a
cooling catastrophe requires high resolution and small time steps. Furthermore, if feedback energy
output is directly linked to condensation of cold clouds, the approach of a cooling catastrophe leads
directly to rapid central heating, computational requirements. We therefore sought a simple method
that would avert a cooling catastrophe while still allowing the ambient central gas to remain in a
low-entropy state.
     In our search for a numerically simple heating kernel that would satisfy these three criteria,
we investigated kernels with a power-law radial distribution of thermal feedback, normalized so
that feedback heating globally equals radiative cooling within the galaxy-cluster halo. Use of such
a heating kernel implicitly assumes that the most consequential feature of more complex AGN
feedback mechanisms is the radial distribution of heat input. Depositing heat into the gas according
to a kernel that depends only on radius is numerically simple and efficient to incorporate into
cosmological simulations, and it does not require high spatial resolution as long as the feedback
method can maintain the hot halo gas in a thermally steady state without overcooling. In order
to create a tunable model, we also modified the radial power law with an inner truncation radius
to limit central feedback and an outer exponential cutoff radius to constrain the bulk of the AGN
heating to gas with shorter and more relevant cooling times. These additional parameters gave
us a numerically simple but tunable model to search for an adequate AGN feedback kernel. We
heuristically explored different values of the inner truncation radius that avoided central entropy
peaks and different values of the outer cutoff radius that kept the majority of the feedback inside
the region of the halo where gas cools within a hubble time. We discuss the model in more detail
later in the paper.
                                                    43


    Section 2.3 discuses the simulation setup and AGN feedback prescription and heating kernel
in detail. Section 2.4 shows simulation results, describing in detail the results of three heating
kernels that broadly represent the whole set of simulations, and examining the impact of different
heating kernel parameters. Section 2.5 discusses the adequacy of the heating kernels tested, the
robustness of the resulting feedback model, and the possible implications of these simulations for
our understanding of AGN feedback in general. Lastly, Section 2.6 summarizes the results and
conclusions of this work.
2.3   Methodology
    This work builds upon simulations by Meece et al. (2017), using the same initial conditions
from that work, described in §2.3.1, but using an AGN feedback kernel that is adapted to deposit
energy at large radii as described in §2.3.2.
2.3.1   Simulation Setup
    We ran several simulations of idealized galaxy cluster halos with a simplified AGN heating
model using the hydrodynamics code Enzo (Bryan et al., 2014).
    We used initial conditions approximating the Perseus Cluster, following the approach from Li
& Bryan (2012) and Meece et al. (2017). The ICM begins as a hydrostatic sphere of gas in a fixed
gravitational potential.
    The gravitational potential has two components: a dark matter halo profile and a BCG with
a mass profile with parameters chosen to match the Perseus cluster. The dark matter follows the
NFW profile (Navarro et al., 1997), using 𝑀200𝑐 = 8.5 × 1014 M⊙ for the mass within the virial
radius and a concentration parameter 𝑐 = 6.81. The dark matter density from the NFW profile
takes the form
                                                        𝜌0NFW
                                    𝜌 NFW (𝑟) =                   2                        (2.1)
                                                  (𝑟/𝑅 𝑠 ) 1 + 𝑅𝑟𝑠
where the scale density 𝜌0NFW is defined by
                                         200                𝑐3
                                𝜌0NFW =       𝜌𝑐                       ,                     (2.2)
                                          3      ln(1 + 𝑐) − 𝑐/(1 + 𝑐)
                                                   44


where 𝜌𝑐 = 3𝐻 2 /(8𝜋𝐺) is the critical density and the scale radius 𝑅 𝑠 can be found from
                            𝑀200𝑐 = 4𝜋𝜌0NFW 𝑅 3𝑠 [ln (1 + 𝑐) − 𝑐/(1 + 𝑐)] .                      (2.3)
    The BCG mass profile, following Meece et al. (2017), has the form
                                        "                                    #
                                                        2−𝛽∗
                          𝑀∗ (𝑟) = 𝑀4                                          ,                 (2.4)
                                          (𝑟/4 kpc) −𝛼∗ (1 + 𝑟/4 kpc) 𝛼∗ −𝛽∗
where 𝑀4 = 7.5 × 1010 𝑀⊙ is the stellar mass within 4kpc and 𝛼∗ = 0.1 and 𝛽∗ = 1.43 are
constraints.1 does not substantially affect our results.
    The initial pressure was computed from the temperature and density assuming an ideal gas with
𝛾 = 5/3 in hydrostatic equilibrium with the gravitational potential. Cosmological expansion is
neglected in these simulations. We used a vanilla ΛCDM model to get the virial mass of the NFW
halo and to set its gas temperature. We set redshift 𝑧 = 0 at initialization with Ω 𝑀 = 0.3, ΩΛ = 0.7,
and 𝐻0 = 70 km s−1 . We note that the precise details of the cosmological model do not impact the
results presented in later sections of this paper, which pertain to baryonic physics in the halo core.
    The entropy profile of the gas, using the form
                                                     𝑘 𝑇
                                                𝐾≡ 𝑏                                             (2.5)
                                                      2/3
                                                     𝑛𝑒
for the specific entropy, where 𝑘 𝑏 is Boltzmann’s constant, 𝑇 is the temperature, and 𝑛𝑒 is the
electron density, was initialized to a power law
                                 𝐾 (𝑟) = 𝐾0 + 𝐾100 (𝑟/100 kpc) 𝛼𝐾 ,                              (2.6)
following the power law fits used in the ACCEPT database (Cavagnolo et al., 2009). Here, 𝑟 is
the radius from the halo center and 𝐾0 = 19.38 keV cm2 , 𝐾100 = 119.87 keV cm2 , and 𝛼𝐾 = 1.74
are fitting parameters corresponding to the core entropy, entropy slope and exponential increase,
chosen to approximate the Perseus Cluster.
    1 Due to a programming error, the simulations use an incorrect initial mass profile for the BCG,
which leads to the central 1 kpc being initialized out of hydrostatic equilibrium, with an absence
of baryonic mass by less than a factor of two. However, the central halo gas either relaxes to
hydrostatic equilibrium within 50 Myr or AGN feedback quickly drives it further from equilibrium,
depending on the heating kernel parameters. Consequently, this error in the initial conditions
                                                   45


    The simulations were run on a cartesian grid in a cubic volume with side length of 3.2 Mpc,
with 643 cells in the base grid of the AMR hierarchy and a maximum of 8 levels of refinement,
making the resolution of the finest cells approximately 195 pc. The mesh was refined based on the
magnitude of gradients in fluid quantities and high baryon density. Additionally, a cubic grid with
side length 4 kpc around the simulation center and was fixed at the maximum level of refinement
with 195 pc resolution.
    Each simulation was allowed to run for 16 Gyr or until excessive AGN feedback during a cooling
catastrophe either created unphysical cell values or led to intractably small timesteps (see Section
2.5.2). To give context to the simulation duration, consider that the sound speed of gas with a
temperature of 𝑇 = 2 × 107 K is 𝑐 𝑠 = 𝛾𝑘 𝐵𝑇/𝜇𝑚 H ≈ 0.70 Mpc Gyr−1 , where 𝑚 H is the mass of
                                        √︁
hydrogen and 𝜇 = 0.6 is the mean mass per particle in units of 𝑚 H , meaning that the approximate
sound crossing time across the inner 𝑅 = 0.5 Mpc, where the majority of the dynamics of the
galaxy cluster halo evolves, is approximately 1.4 Gyr.
    We used the ZEUS solver for hydrodynamics (Stone & Norman, 1992) due to its robustness to
evolve through discontinuities in the fluid around the AGN due to sharply peaked thermal injection.
ZEUS is a relatively diffusive solver and requires an artificial viscosity, which may affect the
accuracy of the hydrodynamics simulation (Stone & Norman, 1992; Meece Jr, 2016). Tabulated
cooling was used to model radiative cooling following Schure et al. (2009), assuming a metallicity
of 0.5 Z⊙ . The cooling table has a temperature floor of 104 K; any processes below this temperature
will take place on a smaller scale than can be accurately explored with our spatial resolution.
    Simulation results were analyzed using yt (Turk et al., 2011).
2.3.2   AGN Feedback Kernels
    In our simplified AGN feedback model, thermal energy is deposited in a spherically symmetric
distribution around the halo center by an assumed AGN, with the total amount of heating set
equal to the total cooling in the halo every 10 Myr. Heating per unit volume 𝑒(𝑟) ¤    is distributed
following a power law in radius so that 𝑒(𝑟)¤    ∝ 𝑟 −𝛼 . This basic power-law functional form has
                                                  46


Heating( < r)/Cooling( < r) Heating(r)/Cooling(r)
                                                     103
                                                             rs                                               rc
                                                     102
                                                     101                     rs
                                                                                              rc
                                                     100
                                                                   rs                              rc
                                                    10 1
                                                     103                          rs = 8 kpc, rc = 1000 kpc, = 2.0
                                                              rs                  rs = 1 kpc, rc = 150 kpc, = 2.6
                                                    102                           rs = 12 kpc, rc = 150 kpc, = 2.6
                                                    101             rs
                                                                                              rc
                                                    100                                       rc               rc
                                                    10   1         rs
                                                             100   101            102                         103
                                                                     Radius [kpc]
Figure 2.1: Top: Local ratio of heating to cooling as a function of radius (𝑟) at the beginning of
several representative simulations. The dotted blue line shows a simulation with low central heating
and heating kernel parameters 𝛼 = 2.0, 𝑟 𝑠 = 8 kpc, and 𝑟 𝑐 = 1000 kpc. The dashed orange line
shows a simulation with high central heating and heating kernel parameters 𝛼 = 2.6, 𝑟 𝑠 = 1 kpc,
and 𝑟 𝑐 = 150 kpc. The solid green line shows a simulation with intermediate central heating and
heating kernel parameters 𝛼 = 2.6, 𝑟 𝑠 = 12 kpc, and 𝑟 𝑐 = 150 kpc. Bottom: Cumulative ratio of
heating to cooling within 𝑟 for the same simulations. At large radii, all of the cumulative heating
curves converge to the cumulative cooling rate because total heating is normalized to equal to total
cooling rate at 𝑅 = 1.5 Mpc.
                                                                        47


several numerical and practical issues. Most critically, these issues are a volumetric heating rate
that diverges to infinity at the halo center, a “long tail” of heating at the halo outskirts where cooling
is too slow to be relevant, and an unrealistic hard cutoff at the simulation boundaries. These latter
two issues are compounded by observations that suggest AGN feedback is generally constrained
to be within a few hundred kpc of the halo center. To address these issues and to create a more
tunable and effective heating kernel, we added two parameters: a minimum truncation radius 𝑟 𝑠
(effectively a smoothing length) and an exponential decay cutoff radius 𝑟 𝑐 . To avoid having the
feedback stop at a simulation boundary at 𝑥, 𝑦, 𝑧 = ±1.6 Mpc, the AGN feedback is contained
within a radius of 𝑅 = 1.5 Mpc and set to zero outside this radius. Since the heating leading up to
𝑅 is negligible compared to the cooling at far radii and the cooling time of the gas is much longer
than the simulation time at that radius, we do not expect the value of 𝑅 to have an impact on the
outcome of the simulation. The full form of the feedback kernel defining the heating rate per unit
          ¤ [erg s−1 cm−3 ] is
volume 𝑒(𝑡)
                                                  −𝛼                 
                                                𝑟𝑠                  𝑟𝑠
                                              
                                              
                                                𝑟𝑐         exp − 𝑟 𝑐 ,          𝑟 ≤ 𝑟𝑠
                                      𝐸¤ (𝑡) 
                                              
                                                 −𝛼
                                                                      
                           ¤ 𝑡) =
                           𝑒(𝑟,                   𝑟
                                                 𝑟𝑐         exp − 𝑟𝑟𝑐 , 𝑟 𝑠 < 𝑟 ≤ 𝑅 .                (2.7)
                                        𝐴    
                                              
                                              
                                              
                                                            0,                  𝑅<𝑟
                                              
The scalar 𝐴 [cm3 ] is defined by
                                   ∫ 𝑟𝑠                  −𝛼                 
                                                2        𝑟𝑠                 𝑟𝑠
                           𝐴 =             4𝜋𝑟 𝑑𝑟                 exp −
                                     0                   𝑟𝑐                 𝑟𝑐
                                       ∫ 𝑅                  −𝛼                 
                                                  2         𝑟                  𝑟
                                   +          4𝜋𝑟 𝑑𝑟                exp −                            (2.8)
                                         𝑟𝑠                𝑟𝑐                 𝑟𝑐
                                                  𝑟 𝑠 3 𝑟 𝑠 −𝛼
                                                       
                                    4𝜋
                                =        exp −          𝑟
                                     3            𝑟𝑐 𝑠 𝑟𝑐
                                                                                    
                                            3                   𝑅                    𝑟𝑠
                                   + 4𝜋𝑟 𝑐 −Γ 3 − 𝛼,                  − Γ 3 − 𝛼,           ,         (2.9)
                                                                𝑟𝑐                   𝑟𝑐
                  ∫∞
where Γ(𝑠, 𝑥) = 𝑥 𝑡 𝑠−1 𝑒 −𝑡 𝑑𝑡 is the upper incomplete gamma function, normalizes 𝑒(𝑡)      ¤ so that the
            ¤ over the volume of the simulation matches 𝐸¤ (𝑡). Higher values of 𝛼 correspond to
integral of 𝑒(𝑡)
more centralized feedback around the AGN. Without the inner smoothing length, a heating kernel
with 𝛼 ≥ 3 is not normalizable, because integration over a volume containing the origin diverges.
                                                           48


     The total heating rate 𝐸¤ (𝑡) is set to the total cooling rate within the cluster halo. Since the total
cooling rate can be difficult to compute on-the-fly due to the nature of the AMR hierarchy’s timestep
update, it is recomputed only every 10 Myr. Although the cooling rate increases exponentially
leading up to a cooling catastrophe, the increase is slow enough that the heating rate does not fall
behind the true cooling rate by more than a few percent except immediately within a Myr before
the catastrophe, at which point the simulation has already demonstrated that the particular heating
kernel being tested is inadequate.
     Note that the short time scale over which heating reacts to cooling in our model is not physical.
Heat deposition resulting from AGN feedback does not instantaneously happen far from the AGN.
We therefore probed heating kernels with a 50 Myr lag time between heating and cooling as well as
averaging cooling over the same time period to smooth out jumps in heating. However, adding lag
time led to more cold gas forming due to the lack of immediate feedback to counter condensation
and more explosive feedback overall.
     This study tested 91 different heating kernels with a range of parameters: different radial
exponents 𝛼 ∈ [2.0, 3.2], smoothing lengths 𝑟 𝑠 ∈ 1, 4, 8, 10, 12, 16, 20, 40 kpc, and exponential
cutoff radii 𝑟 𝑐 ∈ 100, 150, 200 kpc. We began our exploration of the parameter space by setting
𝑟 𝑠 = 1 kpc and 𝑟 𝑐 = 1500 kpc and sampled the range of 𝛼 before trying different values of 𝑟 𝑠 and
𝑟 𝑐 with a smaller number of 𝛼 values, seeking parameter combinations that seemed closest to an
optimal kernel. Figure 2.1 presents a representative sampling of heating kernels showing the initial
ratio of heating to cooling as a function of radius, including both the local ratio at each radius and
the cumulative ratio within each radius. Table 2.1 lists all combinations of parameters explored.
2.4   Results
     All the heating kernels we explored resulted either in cooling catastrophes within a few Gyr,
central entropy levels greater than observations, or both. Simulations that eventually formed cold,
condensed gas all went through cooling catastrophes. In those simulations, the minimum entropy
drops over time, eventually leading to multiphase condensation. As cold clumps of gas form and
runaway cooling begins, the requirement for total heating to match total cooling causes the heating
                                                       49


                           Central Cooling               Central Convective Zone         Central Entropy Floor
                                         Cooling Rate
                                         Heating Rate
                                                               Central over-heating
 log Energy Rate
                                                                                      Tempered central
                                                                                          heating
                   Central under-heating
                        Initial K
                        Resulting K
                                                        Inverted convective
                                                            zone forms
 log Entropy
                                                                                      Flat entropy
                                                                                       ﬂoor forms
                                 Power Law forms
                                   down to core,
                                cooling catastrophe
                                      ensues
                                                                log Radius
Figure 2.2: Schematic illustrations of how different AGN heating kernels affect the entropy profile
of a simulated galaxy cluster. In each case, the total heating rate is set equal to the total cooling
rate. Top: Radial profiles of radiative cooling and AGN heating per unit volume, with the initial
median cooling rate in black and the AGN heating kernel in color. Bottom: Response of the
median entropy profile to heat input. The initial median profile in black and the response is in
color. The left column shows a heating kernel with central heating that falls below central cooling.
The entropy profile in this case tends to follow a power law down to the origin and eventually
leads to a central cooling catastrophe. The center column shows a heating kernel with excessive
central heating, which elevates central entropy, inverts the entropy profile, and produces a central
convective zone. The right column shows a heating kernel with intermediate central heating, which
slightly raises the central entropy and produces a flat core. Due to the high initial entropy and long
cooling time at outer radii, the power-law at the outer radii changes very slowly with under- and
over-heating.
                                                                  50


Table 2.1: List of combinations of inner smoothing radius 𝑟 𝑠 [kpc], outer cutoff radius 𝑟 𝑐 [kpc],
and exponent 𝛼 used. The rightmost column lists all values of 𝛼 explored for the given combination
of 𝑟 𝑠 and 𝑟 𝑐 in the leftmost and middle column.
                           𝑟 𝑠 [kpc] 𝑟 𝑐 [kpc]  𝛼
                           1         150        2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2
                           1         1000       2.0, 2.1, 2.2, 2.3, 2.35, 2.375,
                                                2.4, 2.425, 2.45, 2.5, 2.525,
                                                2.55, 2.575, 2.6, 2.65, 2.7, 2.8,
                                                2.9, 3.0
                           4         150        2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2
                           8         150        2.0, 2.2, 2.4, 2.6, 2.8, 2.9,
                                                2.95, 3.0, 3.2
                           16        150        2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2
                           10        150        2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2
                           10        150        2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2
                           12        150        2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2
                           16        100        2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2
                           16        150        2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2
                           20        100        2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2
                           40        150        2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2
rate to spike. The time required for cold gas to form is roughly correlated with the smallest radius
at which cooling exceeds heating. If central cooling exceeds central heating, the halo quickly forms
cold gas and experiences a cooling catastrophe. Simulations with higher central heating tend to
have high central entropy, similar to observations of NCC clusters. If the heating exceeds cooling
out to radii of several tens of kpc, then the simulations persist for many Gyr without forming cold
gas. Under- and over-heating at outer radii beyond 100 kpc is inconsequential since the time scale
of heating is much longer than the dynamical time scale of the system due to the large specific
energy and entropy at initialization.
    Figure 2.2 schematically shows the general behavior of the different heating kernels. The three
heating kernel examples in Figure 2.1 have colors that match the corresponding schematics in
Figure 2.2. Figure 2.3 shows mass density profiles of cooling rate, heating rate, and entropy at later
moments in simulations employing the same three heating kernels as in Figure 2.1.
                                                   51


2.4.1    Categorization of Simulations
    The results of our simulations can be grouped according to the morphology of the entropy
profiles that develop within the central 100 kpc:
   1. Central Cooling. The entropy profiles of simulated cluster halos with heating that is
       insufficient to balance radiative cooling at small radii develop central cooling flows with a
       positive entropy gradient at all radii. They undergo a central cooling catastrophe relatively
       quickly, in which runaway multiphase condensation at small radii brings the simulation to a
       halt.
   2. Central Convective Zone. The entropy profiles of simulations with high central heating form
       an inner convective zone with high central entropy and a negative central entropy gradient.
       Those simulations persist the longest before undergoing cooling catastrophes.
   3. Central Entropy Floor. Simulations with intermediate central heating can maintain a nearly
       flat entropy gradient within the central ∼ 10 to 20 kpc.
For the purposes of our analysis, we define these categories based on the entropy within the inner
25 kpc. We categorize as Central Cooling those simulations whose average minimum entropy
remains below 12 keV cm2 (2/3 of the the initial minimum central entropy of 18 keV cm2 ) .
The Central Convective Zone simulations are defined to have maximum central entropy above
50 keV cm2 (equal to the initial mean entropy of the inner 100 kpc). No simulation meets both
of these criteria, so there is no overlap of these first two groups. The remaining simulations,
which have minimum central entropies above 12 keV cm2 and maximum central entropies below
50 keV cm2 , are categorized as Central Entropy Floor simulations.
    The schematic diagrams in Figure 2.2 illustrate the general behavior of the different categories.
Figure 2.3 shows representative snapshots of both cooling rate and entropy versus radius. Some of
our simulations exhibit behavior from multiple categories at different times in their evolution. The
following subsections describe each category in more detail.
                                                  52


Figure 2.3: Mass density plots of cooling and heating rate (top) and entropy (bottom) versus
radius, with color representing the total mass of all simulation cells from a 2D histogram of cooling
rate and entropy versus radius. Across the three columns we show three simulations at different
times that broadly represent the whole set of simulations, as differentiated by the behavior of the
inner tens of kpc. The left column shows a simulation (with 𝛼 = 2.0, 𝑟 𝑠 = 8 kpc, and 𝑟 𝑐 = 1000 kpc
at 𝑡 = 0.3 Gyr) with low central heating which allows excess central cooling that quickly undergoes
a cooling catastrophe. The middle column shows a simulation (with 𝛼 = 2.6, 𝑟 𝑠 = 1 kpc, and
𝑟 𝑐 = 150 kpc at 𝑡 = 3.0 Gyr) with high central heating that maintains a convective zone in the inner
100 kpc with a high central entropy peak. The right column shows a simulation (with 𝛼 = 2.6,
𝑟 𝑠 = 12 kpc, and 𝑟 𝑐 = 150 kpc at 𝑡 = 8.0 Gyr) with an intermediate amount of central heating
and that holds a flat entropy floor slightly elevated from the initial conditions and observational
data on the entropy of the inner tens of kpc. On the entropy plots, observational entropy data of
clusters from the ACCEPT data set are displayed in grayscale showing the range (light grey), 68%
confidence interval (dark grey), and median (black line) of the dataset. The median entropy is
also marked by a magenta line, and the minimum (𝐾 𝐿 ) and maximum (𝐾 𝐻 ) values of the entropy
median within the inner 25 kpc are marked by stars. On the cooling rate plots, the heating rate is
marked by a red line and the median cooling rate is marked by a blue line. The crossover radii 𝑟 −
and 𝑟 + as defined in the text are marked by stars in the simulations where they can be defined.The
heating curve parameters 𝑟 𝑠 and 𝑟 𝑐 are also annotated with finely dashed and dashed gray lines.
                                                  53


2.4.1.1    Central Cooling
    Simulations with low 𝛼, large 𝑟 𝑐 , or large 𝑟 𝑠 tend to have central cooling exceeding central
heating, which quickly leads to a cooling catastrophe. The left column in Fig. 2.3 shows an
example of such a simulation. Within the inner 10 kpc, the heating rate ranges from half the
cooling rate to more than an order of magnitude less than the cooling rate. Because the central
heating is insufficient to counteract a growing mass of strongly cooling gas at the halo center, the
simulation produces a cooling catastrophe within 2 Gyr. However, up to the moment at which
a substantial quantity of cold gas forms, the entropy profile remains close to the initial state and
similar to the cool-core clusters in the ACCEPT data set.
2.4.1.2    Central Convective Zone
    Heating rates within the central ∼ 10 kpc of simulations with high 𝛼, small 𝑟 𝑐 , or small 𝑟 𝑠 tend
to greatly exceed radiative cooling. The middle column in Fig. 2.3 shows an example. Excess
central heating leads to a central entropy peak and an inverted entropy profile that drives convection.
Low-entropy gas at the minimum entropy point sinks toward the center, but is reheated there and
eventually rises to larger radii. Such a convective configuration can persist for many Gyr without
producing multiphase condensation, because the minimum entropy and minimum cooling time are
both large.
    A few of the simulations in this category do form multiphase gas. When that happens, conden-
sation first appears at the minimum of the entropy profile and rapidly leads to a cooling catastrophe.
Although these simulations have large central heating rates, the heating rate still falls below cooling
at intermediate radii (near the entropy minimum), allowing large clumps of cold gas to form there.
In all cases in which a convective central zone forms, the central entropy is excessive compared
with observed CC clusters, in some cases being more typical for a NCC.
                                                  54


2.4.1.3      Central Entropy Floor
    Simulations with intermediate central heating, corresponding to a narrow range of combinations
of 𝛼, 𝑟 𝑠 , and 𝑟 𝑐 , are able to maintain quasi-stable flat entropy profiles out to radii exceeding 10 kpc.
The right column in Fig. 2.3 shows an example. Central heating within the inner 10 kpc of these
simulations is typically several times the central cooling rate, sufficient to offset runaway cooling
but not great enough to produce a large entropy inversion. Only some of these simulations form
cold gas, and typically do so at larger radii and later times than in the Central Cooling simulations.
However, the central heating in these simulations is still great enough to elevate the central entropy
above the values observed in CC clusters.
2.4.2     Important radii: 𝑟 𝐿 , 𝑟 𝐻 , 𝑟 − , 𝑟 + , and 𝑟 multi
    To help with the analysis of the simulations, we identify several quantities that proved to be
useful for interpreting their behavior. Those quantities are labeled in Figure 2.3.
    The maximum and minimum entropy levels in the central regions turn out to be closely related
to the time it takes for a cooling catastrophe to manifest. To quantify those extremes we first
determine the median entropy at each radius, illustrated by the purple dotted lines in Figure 2.3.
We then define 𝐾 𝐿 to be the minimum of the median entropy profile and 𝑟 𝐿 to be the radius at
that point. Outside of 𝑟 𝐿 the median entropy profile is stable to convection, but inside of 𝑟 𝐿 it is
convectively unstable. In simulations with low central heating, 𝑟 𝐿 is close to the center. We define
𝐾 𝐻 to be the maximum of the median entropy profile within 25 kpc of the simulation center and
𝑟 𝐻 to be the radius at that point. We use the 25 kpc cutoff to exclude cosmologically heated gas at
large radii from the analysis in order to focus on the effects of feedback heating. The initial entropy
at 25 kpc is just below 30 keV cm2 , so a persistent 𝐾 𝐻 above 30 keV cm2 indicates that heating has
elevated the central entropy, making it too great for a CC cluster and possibly producing a central
convective zone.
    The entropy extrema 𝐾 𝐿 and 𝐾 𝐻 and the corresponding radii 𝑟 𝐿 and 𝑟 𝐻 evolve over time as
feedback alters the median entropy profile. We denote the cooling times at those radii by 𝑡 𝑐 (𝑟 𝐿 )
                                                         55


and 𝑡 𝑐 (𝑟 𝐻 ). The value of 𝑡 𝑐 (𝑟 𝐿 ) is closely linked to the time required for condensation to begin.
The relationship between how the heating kernel parameters affect 𝐾 𝐻 and 𝐾 𝐿 along with the
associated radii and cooling times is explored in sections 2.4.3, 2.4.4, and 2.5.1.
    The radii at which heating equals cooling are special and come in two types. For one type, the
net heating rate goes from positive to negative as 𝑟 increases. We define 𝑟 − to be the smallest such
radius. Excess heating within that radius tends to raise the median entropy while excess cooling at
large radii causes the median entropy to decline. The result is flattening and sometimes inversion
of the median entropy profile, which drives convection and ultimately makes the system prone to
condensation near 𝑟 − . However, if cooling dominates heating in the central regions, then 𝑟 − is
undefined. Some relationships between 𝑟 − and the simulation outcomes are explored in Section
2.4.3.
    At the other type of heating-cooling equality radius, the net heating rate goes from negative
to positive as 𝑟 increases. We define 𝑟 + to be the largest such radius. Outside of 𝑟 + , net heating
raises the median entropy and suppresses condensation. Within 𝑟 + , net cooling lowers the median
entropy. Together, these effects produce a positive entropy gradient in the vicinity of 𝑟 + .
    While the median cooling rate may exceed the heating rate at very large radii (on the order
of hundreds of kpc), cooling times at those radii are so long that cold gas does not form on an
astrophysically significant time scale. During a given simulation, the radii 𝑟 − and 𝑟 + do not stay
fixed, but rather shift as heating and cooling change the median cooling rate. We denote the cooling
time at those radii as 𝑡 𝑐 (𝑟 − ) and 𝑡 𝑐 (𝑟 + ).
    The heating kernel parameters also affect when cold gas forms in the simulations and at what
radius the cold gas first appears. We define 𝑡multi to be the time from the beginning of the simulation
to the moment when multiphase condensation produces cold gas. In our analysis, we use 105 K
as the temperature cutoff for cold, although gas around these temperatures will rapidly cool to
colder temperatures. Our temporal resolution of 𝑡 multi is limited by the frequency of output to disk,
which is every 10 Myr. We define 𝑟 multi to be the radius at which cold gas first appears, using the
innermost radius if cold gas appears simultaneously at multiple radii. The relationship between
                                                       56


Table 2.2: Brief definition of variables described in full in text and used in later figures. "Median"
here refers to the median of the distribution of a variable (e.g. entropy, cooling rate, etc.) at given
radius.
                                 𝐾𝐿         Lowest median entropy
                                 𝐾𝐻         Highest median entropy within 25 kpc of
                                            the simulation center
                                 𝑟𝐿         Radius of lowest median entropy
                                 𝑟𝐻         Radius of highest median entropy within
                                            25 kpc of the simulation center
                                 𝑟−         Inner radius within which median heating
                                            exceeds median cooling
                                 𝑟+         Outer radius outside of which median
                                            heating exceeds median cooling
                                 𝑡 𝑐 (𝑟 𝑥 ) Median cooling rate at radius 𝑟 𝑥
                                 𝑡 multi    Simulation time at which multiphase gas
                                            first forms
                                 𝑟 multi    Radius at which multiphase gas first forms
𝑟 multi , 𝑟 𝐻 , 𝑡multi , and 𝑡 𝑐 (𝑟 − ) is explored in Section 2.4.3.
    Table 2.2 summarizes the variables defined in this section. These variables are used in figures
and analysis in later sections.
2.4.3      Condensation of Cold Gas
    Multiphase condensation forms cold gas in many of the simulations, in each case leading to
a cooling catastrophe. Cold gas starts forming near 𝑟 𝐿 , then falls toward the center, displacing
buoyantly rising warmer gas. The location of 𝑟 𝐿 depends on the heating kernel parameters and is
related to 𝑟 − .
    However, when gas at 𝑟 𝐿 cools enough to transition into the cold phase, it sharply raises the
total cooling rate of the halo. That event immediately boosts the heating rate by the same factor,
because our AGN feedback prescription forces the total heating rate to equal the total cooling rate.
This heat is distributed across the halo and is not concentrated on the cooling gas, and thus the
AGN feedback does not halt the cooling catastrophe.
    In many cases, rapid heating of lower-density gas during the cooling catastrophe produces
                                                            57


                                    Central
                                              Cooling                tro
                                                                   En
                                                                        py                        109
                                                                            Flo o
                       1046
Cooling Rate [erg/s]
                                                                                r
                                                                                                       Cold Gas [M ]
                                                                                  ct i ve         108
                                                                                e
                                                                           ConvZone
                       1045
                                     Cooling Rate
                                                                                                  107
                                     Cold Mass
                                     Central Cooling
                                     Convective Zone
                       1044          Entropy Floor                                                106
                              0.0   2.5    5.0      7.5 10.0 12.5 15.0
                                                  Time [Gyr]
       Figure 2.4: Time dependence of total cooling rate (solid lines) and total mass of condensed gas
       under 3×104 K (dashed lines) for the three simulations shown in Figure 2.3. The blue points show a
       simulation with low central heating and excess central cooling (𝛼 = 2.0, 𝑟 𝑠 = 8 kpc, 𝑟 𝑐 = 1000 kpc)
       that experiences an early cooling catastrophe. Orange points show a simulation with high central
       heating (𝛼 = 2.6, 𝑟 𝑠 = 1 kpc, 𝑟 𝑐 = 150 kpc) that forms a quasi-stable central convective zone. Green
       points show a simulation with intermediate central heating (𝛼 = 2.6, 𝑟 𝑠 = 12 kpc, 𝑟 𝑐 = 150 kpc)
       that maintains a flat entropy core for almost 10 Gyr before undergoing a late cooling catastrophe.
       In simulations that form a multiphase gas through a cooling catastrophe, the formation of cold gas
       is preceded by a rise and then a sharp peak in the total cooling rate.
                                                        58


such great sound speeds and creates such large discontinuities in the fluid that the simulation
becomes infeasible to continue due to the Courant condition. At that point the heating input greatly
exceeds the AGN activity observed in real CC clusters, meaning that the chosen heating kernel
has become physically unrealistic. In simulations that managed to evolve through this catastrophic
event, the heat input leads to drastically elevated entropy in the ambient gas, which slowly reheats
the embedded cold gas and prevents more cold gas from forming. After the cooling catastrophe,
the core entropy is left much higher than before the catastrophe. Figure 2.4 illustrates the timeline
of a catastrophe resulting from an increasing cooling rate that leads the formation of cold gas.
     Our simulation set generally demonstrates that the radii 𝑟 multi and 𝑟 𝐿 are both related to
𝑟 − . Figure 2.5 shows the relationships among the values of those three radii. We average these
quantites over time from the simulation outputs, which have 10 Myr frequency, in order to produce
one data point per heating kernel. Larger ⟨𝑟 − ⟩ corresponded to a larger ⟨𝑟 𝐿 ⟩, as shown in top right
panel, meaning that the radius of lowest entropy corresponds to the inner radius inside of which
heating exceeds cooling. The top right panel shows that larger ⟨𝑟 − ⟩ corresponds to larger 𝑟 multi ,
meaning that the radius of lowest entropy corresponds to the inner radius inside of which heating
exceeds cooling roughly determines where cold gas first forms. In the bottom left panel, ⟨𝑟 𝐿 ⟩ also
corresponds to larger 𝑟 multi , showing that multiphase gas typically first forms around the entropy
minimum. The relationship between 𝑟 − and the formation of cold gas is most apparent in the plot
of 𝑡 𝑐 (𝑟 − ) versus 𝑡multi in the bottom right panel. When 𝑟 − is larger, so that cooling first exceeds
heating at a larger radius, the cooling time at 𝑟 − is longer, which leads to cold gas forming later in
the simulation. The timescale on which cold gas forms is closely tied to the cooling time of this
gas. Interestingly, the relationship is non-linear, following
                                                   ⟨𝑡 𝑐 (𝑟 multi )⟩ 2
                                          𝑡multi =                    .                            (2.10)
                                                      200 Myr
This result is consistent with previous work by Meece et al. (2015) exploring the condensation of
gas in the central ICM of galaxy clusters. Meece et al. (2015) found in thermally balanced ICM
simulations with varying initial ratios of cooling time to freefall time that gas with a greater initial
ratio remains nearly homogeneous for a larger number of cooling times before condensing into a
                                                     59


                                                                                                                                                                          3.2
                                                                                                102
                                                                                                                                r −i t
                                                                                                                           =h
                                                                                                                      ti
                                                                                                101             r mul
                                                                                                                                                                          3.0
                                                                          rmulti [kpc]
       hrL it [kpc]
                                                                                                100
                                            it                                                                                                                            2.8
                  101                  r−
                                       5h
                                    1.
                                  =                                                       10−1
                                 Li
                                   t
                           hr
                                                101                                                                   101
                                                                                                                                                                          2.6   α
                                                hr− it [kpc]                                                          hr− it [kpc]
                  102
                                                                                                104
                                           it
                                  = hr L
                                                                                                                                                                    2
                              i                                                                                                                                    )i t   2.4
                        r mult                                                                                                                                 (r −
                  101
                                                                                                                                                            ht c
                                                                                 tmulti [Myr]
                                                                                                                                                        1 yr
rmulti [kpc]
                                                                                                                                                       00M
                                                                                                                                                     2
                                                     Behavior of Center                                                                          =
                                                                                                                                            ti
                                                       Central Cooling                                                                   ul
                  100                                                                                                                tm
                                                       Convective Zone                                                                                                    2.2
                                                       Entropy Floor
                                                                                                103
               10−1
                                                                                                                                                                          2.0
                        101                                                                           4 × 102        6 × 102                          103
                                                hrL it [kpc]                                                       htc (r− )it [Myr]
 Figure 2.5: Plots of relationships between 𝑟 − , the radius at which the gas switches from net
 heating to net cooling, and other features of the simulations. Top left: Time averaged radius of
 the minimum of the median entropy profile (𝑟 𝐿 ) versus the time average of 𝑟 − up to the formation
 of a multiphase gas. (Includes only simulations in which 𝑟 − can be defined for at least 50 Myr.)
 Top right: Radius at which multiphase gas first forms versus the time averaged 𝑟 − . (Includes
 only simulations in which 𝑟 − can be defined for more than one time step.) Bottom left: Radius at
 which multiphase gas first forms versus the time averaged value of 𝑟 𝐿 for all simulations. Bottom
 right: The time required for a simulation to form multiphase gas versus the time averaged value of
 the cooling time at 𝑟 − . (Includes only simulations that form multiphase gas and in which 𝑟 − can
 be defined for at least 50 Myr.) Shapes in each panel denote the general behavior of the central
 region of the simulation. Blue highlighted triangles denote Central Cooling simulations, orange
 highlighted circles denote Central Convective Zone simulations. Green highlighted stars denote
 Entropy Floor simulations. Colors show the heating kernel parameter 𝛼, with greater 𝛼 generally
 corresponding to heating that is more centrally concentrated.
                                                                                                60


                                                                                                                                     3.2
                      Behavior of Center
                                                                               104
                        Central Cooling
               104      Convective Zone                                                                                              3.0
                                                                                                        H/C = 2   H/C = 5
                        Entropy Floor
                                           H/C = 2
                                                                                                                                     2.8
                                                               hKH it [Kev cm2 ]
                                                                               103
tmulti [Myr]
                                                                                                                                     2.6   α
               103                                                                                                                   2.4
                                                                               102
                                                     H/C = 5
                                                                                                                                     2.2
                                                                                                                            Low KH
                                                                               101                                                   2.0
                       10−1           100              101                           10−1           100              101
                          Heating / Cooling within 10 kpc                               Heating / Cooling within 10 kpc
 Figure 2.6: Left: Time required to form multiphase gas in a simulation versus the ratio of heating
 to cooling within the inner 10 kpc at the first time step. Right: Maximum of the median entropy
 within the inner 25 kpc, versus the ratio of heating to cooling within the inner 10 kpc at the first time
 step. In both panels, a solid line marks a heating to cooling ratio of 2, and a dashed line marks a
 heating to cooling ratio of 5. A ratio of at least 2 is required to avoid multiphase condensation within
 1 Gyr. In the right panel, a dashed line marks the maximum central entropy that is observationally
 expected for a CC cluster.
 multiphase gas, suggesting a non-linear relationship between cooling time and the formation of a
 multiphase medium.
 2.4.4               Central Heating
                The heating kernel parameters also affect the central entropy of the cluster halo, in some cases
 resulting in unreasonably high levels for a CC cluster and in other cases allowing cold gas to quickly
 condense and collect in the halo center. The central entropy and general behavior of the core is
 directly related to the amount of heating compared to cooling in the halo center. A certain amount
 of heating in the center is necessary to offset the central cooling but an excess of heating in the halo
 center causes central entropies higher than observed in CC clusters.
                To explore this behavior, we track the ratio of the total heating within the inner 10 kpc of the
                                                                              61


halo to the total cooling within the same volume.2 Figure 2.6 shows 𝑡multi and the time average of
𝐾 𝐻 versus the initial central heating to cooling ratio. A ratio of heating-to-cooling of approximately
two is needed to maintain quasi-stability for any significant amount of time, while a ratio greater
than five always leads to high central entropies. Inside this range of ratios of heating to cooling,
different heating kernels produce all three categories of central entropy behaviors.
     When the integrated heating in the inner region is less than twice the cooling in the same region,
a cooling catastrophe happens within 1 Gyr. For simulations with less heating than cooling in the
central region, cooling quickly causes the central entropy profile to approximate a power law down
to the halo center. Cooling gas then flows down the entropy gradient, collecting in the center,
and forming multiphase gas. In simulations with average heating one to two times the average
cooling rate in the center, density inhomogeneities in the gas allow cooling to exceed heating in
some locations. As the cooling of that gas increases, the total heating rate rises but is insufficient to
counter the localized increase in cooling, thus leading a runaway cooling catastrophe. Additionally,
as central entropy falls and density increases in the lead up to the catastrophe, central pressure
increases and compresses clumps of cooling gas. This further accelerates their cooling during the
runaway catastrophe. With simulations having heating-to-cooling ratios above two in the center
region, the central cooling is more successfully countered so that the formation of multiphase gas
happens on a longer timescale connected to 𝑡 𝑐 (𝑟 𝐿 ) and 𝑡 𝑐 (𝑟 − ), as discussed in Section 2.4.3. The
left plot in Figure 2.6 also shows this distinction in behavior.
     When central heating rates are more than two times greater than the cooling rate, excess heating
leads to central entropies that are higher than what is observed for CC clusters. The right plot in
Figure 2.6 shows the relationship between the ratio of central heating to cooling and the maximum
entropy in the central region averaged over time. Some simulations with two to five times heating
to cooling in the center stay under the typical 30 keV cm2 specific entropy for CC clusters, but all
of the simulations with heating-to-cooling ratios of greater than five produce unrealistically high
    2 The  inner 10 kpc volume was chosen to coincide with the region within which the initial
entropy profile is nearly flat. We also tested this analysis using the inner 20 kpc volume and found
similar results.
                                                    62


                                                                                                                                3.2
                                          Behavior of Center
                                                                                           104
                                            Central Cooling
                                            Convective Zone                                                                     3.0
Heating / Cooling within 10 kpc
                                            Entropy Floor
                                  101
                                                                                                                                2.8
                                                                           hKH it [Kev cm2 ]
                                                                                           103
                                                                                                                                2.6   α
                                                                                                                                2.4
                                                                                           102
                                  100
                                                                                                                                2.2
                                                                                                                   Long lived
                                                                                                                    Low KH
                                                                                           101                                  2.0
                                                            101                                  103                104
                                                        hr− i [kpc]                                 tmulti [Myr]
Figure 2.7: Left: Relationships between the initial ratio of heating to cooling averaged over the
inner 10 kpc and the time-averaged radius ⟨𝑟 − ⟩ beyond which cooling begins to dominate over
heating. Only those simulations in which 𝑟 − can be defined for at least 50 Myr are included.
The box in the lower right shows hypothetical simulations with an average 𝑟 − over 30 kpc and
an inner heating to cooling ratio under five. Right: Relationships between the time average of
𝐾 𝐻 (the maximum level of the median entropy profile within the inner 25 kpc) and the time
𝑡m𝑢𝑙𝑡𝑖 until multiphase gas forms in the simulation. The plot includes all simulations, assigning
𝑡multi = 16 Gyr to simulations that do not form cold gas by that time. An empty box in the lower
right corner indicates where points representing heating kernels satisfying adequacy criteria would
fall, by persisting for more than 5 Gyr before forming multiphase gas while maintaining a maximum
entropy level < 30 keV cm−2 within 25 kpc. However, no heating kernel we tested satisfies those
those criteria.
entropies. With values of 𝐾 𝐻 above the 30 keV cm2 specific entropy where the isentropic entropy
profile changes into power law, these simulations form an inverse convective zone where hot gas
collects in the halo center and cold gas collects at 𝑟 𝐿 at intermediate radii.
2.5                                     Discussion
2.5.1                                    No Adequate Heating Kernel
                                   None of the 91 heating kernels we simulated meet all three of the adequacy criteria specified
in Section 2.2. The failure modes we observe in the simulations can be discussed in terms of the
same behavioral categories listed in Section 2.4.1 for the central entropy profile:
                                                                                          63


   1. Central Cooling. Heating kernels with low central heating fail to meet our first criterion by
      producing a cooling catastrophe within ∼ 1 Gyr that radically changed the structure of the
      ambient medium.
   2. Central Convective Zone. Heating kernels with high central heating produces central
      convective zones that fail to meet our second criterion by producing central entropy levels
      greatly exceeding those observed among typical CC clusters. Some of the simulations in this
      group also fail our longevity criterion because the heating kernel is unable to prevent an early
      cooling catastrophe due to insufficient heating at intermediate radii.
   3. Central Entropy Floor. The heating kernels closest to being adequate, according to our
      criteria, were those with intermediate central heating that exceeds central cooling, but not by
      a large factor. Those simulations maintain a quasi-stable entropy floor and prevents cooling
      catastrophe for billions of years. However, the central entropy profiles of those simulations,
      while lower than those in the previous category, were still elevated compared to observed
      CC clusters and thus do not meet our second criterion. Lowering the central heating rates in
      an attempt to bring their entropy profiles more in line with observation also causes cold gas
      to form much more quickly. The simulation that provides results closest to a realistic cluster
      (with kernel parameters 𝑟 𝑠 = 12 kpc, 𝑟 𝑙 = kpc, and 𝛼 = 2.4) maintains a flat entropy core of
      30 keV cm2 and lasts for just under 4 Gyr, which may be sufficiently long to maintain a CC
      cluster between external heating events.
    No heating kernel we tested is able to maintain a low entropy floor close to observations of
CC clusters for longer than 4 Gyr. Figure 2.7 summarizes the failure modes of the heating kernels
probed in this study. The right panel shows 𝐾 𝐻 versus 𝑡 multi , a measure of the longevity of the
simulation before a cooling catastrophe strongly altered it. Some simulations prevent a multiphase
cooling catastrophe for many Gyr while others maintain low central entropy, but no heating kernel
accomplished both aims. The left panel shows the ratio of central heating to cooling versus 𝑟 − , the
two parameters that most strongly influenced the central entropy and longevity, respectively.
                                                 64


2.5.2   Robustness of Feedback Algorithm
    The ultimate obstacle to finding an adequate thermal heating kernel is the difficulty of preventing
gas in the halo center from overcooling while still maintaining a reasonably low entropy profile. In
order to prevent a cooling catastrophe, central heating must be sufficient to raise the median entropy
profile enough to keep the lowest-entropy gas from undergoing runaway cooling. Our simulations
show that an integrated central heating rate within the inner 10 kpc that is approximately two times
the cooling rate in that same region is necessary. Otherwise, too large a proportion of the gas
within the central region ends up with cooling exceeding heating, causing a rapid increase in the
total radiative cooling rate.
    The consequences of that rapid rise in cooling are dramatic, because the total heating rate
is set equal to the radiative cooling rate and rises just as rapidly. However, that heat input is
distributed more evenly across a large volume and cannot counteract radiative cooling of localized
dense gas clumps. As a result, the ambient pressure sharply rises, compressing the dense clumps
of low-entropy gas, causing both radiative cooling and the matching heating rate to increase. That
coupling therefore causes the cooling/heating rate to spike to unphysically high levels during a
cooling catastrophe (see Figure 2.4). Central internal energies and velocities then rapidly rise and
create discontinuities in the fluid. Due to the Courant condition, the time steps sometimes became
too small to continue evolving the simulations. In other cases, those discontinuities lead to negative
densities and/orz internal energies in the hydro solver, ultimately ending the simulation.
    In reality, CC clusters can form cold gas (as is evident from observed star formation rates
ranging from 1 to 100 M⊙ per year), and so a physically accurate model should accommodate
the formation of moderate amounts of cold gas. However, a heating kernel that immediately
responds by injecting compensating thermal energy with a fixed spatial distribution appears unable
to accommodate multiphase condensation without causing excessive heating.
                                                  65


                                           104
                                           103
                          K [keV cm2]      102
                                           101
                                          10   11
                                                                             Central Cooling
                                                                             Convective Zone
             SX [erg s 1 cm 2 arcsec 2]
                                          10   12
                                                                             Entropy Floor
                                                                             ACCEPT
                                          10   13
                                          10   14
                                          10   15
                                          10   16
                                          10   17
                                                    100   101          102              103
                                                          Radius [kpc]
Figure 2.8: Top: Time-averaged median entropy profiles of the simulated cluster halos in Figure
2.3. The dotted line shows the simulation with low central heating ( 𝛼 = 2.0, 𝑟 𝑠 = 8 kpc,
𝑟 𝑐 = 1000 kpc), and the blue shaded region around it shows the 1𝜎 dispersion of its median profile
over time. The dashed line shows the simulation with high central heating (𝛼 = 2.6, 𝑟 𝑠 = 1 kpc,
𝑟 𝑐 = 150 kpc), and the orange shaded region around it shows its 1𝜎 dispersion. The dot-dashed line
shows the simulation with intermediate central heating (𝛼 = 2.6, 𝑟 𝑠 = 12 kpc, 𝑟 𝑐 = 150 kpc), and
the green shaded region around it shows its 1𝜎 dispersion. In each case, entropy is weighted by the
x-ray luminosity in the 0.5–2.0 keV band, to mimic data obtainable with Chandra. The median, 1𝜎
interval, and full extent of the entropy profiles of clusters with less than 30 keV cm2 from ACCEPT
are shown in grayscale, using the broken power law fits from Cavagnolo et al. (2009) for the entropy
profiles. Bottom: X-ray surface brightness in the 0.5–2.0 keV band for the same simulated halos,
with shaded regions showing the 1𝜎 dispersion and black lines showing the median. The median,
1𝜎 interval, and full extent of the entropy profiles of CC clusters from ACCEPT are shown in
grayscale, using surface brightness profiles derived from electron density and temperature profiles.
                                                          66


2.5.3    Comparison to Observations
    Figure 2.8 shows the time-averaged median entropy profile and projected X-ray surface bright-
ness profile, along with the 1𝜎 dispersion in the median profiles. It also shows the median entropy
profile of observed CC clusters in the ACCEPT dataset (Cavagnolo et al., 2009), along with the 1𝜎
dispersion and the full range. The dispersion in the simulated profiles is computed in radial bins
over the lifetime of each simulation up until the formation of cold gas or the end of the simulation.
The dispersion in the ACCEPT data is generated from a table of power-law fits to the entropy
profiles. Only CC clusters from ACCEPT with 𝐾0 < 30 keV cm2 are used.
    No quasi-stable simulation maintains a central entropy close to the majority of the CC clusters
in the ACCEPT dataset. Heating kernels that keep low entropies within the range of the ACCEPT
CC clusters are not steady for more than 1 Gyr, and all experience central cooling catastrophes.
Heating kernels that form central convective regions have higher central entropies than the ACCEPT
CC clusters. Simulations that form a central entropy floor have lower entropies than the central
convective zone simulations and are steady for longer periods than the low central heating kernels,
but still have higher central entropies than the majority of observed CC clusters in the ACCEPT
dataset.
    The differences among the X-ray surface brightness profiles are more subdued, with more
centralized feedback corresponding to a lower central surface brightness. The median central
surface brightness of the simulation shown here with a central catastrophe is within an order of
magnitude of the simulations that form a convective zone. Additionally, the surface brightness
profiles from the simulations fall inside the 1𝜎 interval of the CC clusters from ACCEPT.
2.5.4    Comparison to Other Simulations
    Thermal regulation of galaxy clusters by AGN jets has been studied previously through numer-
ical simulation using many different models of AGN feedback. These approaches include injection
of buoyant bubbles (Brüggen, 2003b; Hillel & Soker, 2016), magnetic fields (Li et al., 2006; Naka-
mura et al., 2006, 2007; Huarte-Espinosa et al., 2012), kinetic jets (Wu et al., 2015; Martizzi et al.,
                                                  67


2016; Hahn et al., 2017; Meece et al., 2017), stochastic momentum feedback (Weinberger et al.,
2017; Nelson et al., 2019), cosmic rays (Jubelgas et al., 2008; Butsky & Quinn, 2018), and turbulent
heating (Gaspari et al., 2012a; Zhuravleva et al., 2014; Banerjee & Sharma, 2014), either explicitly
or implicitly driven by the central SMBH. Some simulations have also used purely thermal feedback
models like the model used in this work, to which we can compare.
    Meece et al. (2017), the predecessor to this work, tested a AGN feedback model consisting of
a precessing bipolar jet that injected kinetic and thermal energy. They tested different fractions of
AGN feedback going into thermal heating versus the kinetic jet. For triggering the feedback they
tested three different models: a cold gas triggering model from Li & Bryan (2014a), a boosted
Bondi-like triggering, and a Booth and Schaye accretion model (Booth & Schaye, 2009). Like
this work, Meece et al. (2017) found that AGN models with purely thermal feedback led to an
overabundance of cold gas in the simulation core. However, their thermal feedback was limited
to a small region around the AGN, less than 1 kpc in diameter. In their simulations, hot bubbles
inflated via AGN heating at the cluster center buoyantly rose a short distance out of the center to
10 − 30 kpc and created a flatter entropy profile that was unstable to multiphase condensation and
therefore failed to suppress large accumulations of multiphase gas. Many of the heating kernels
tested in this paper rectify the problem of overly centralized heating but result in elevated core
entropy beyond what is reasonable for a CC cluster. globally our heating prescription is no longer
robust to the formation of cold gas.
    The Rhapsody-G simulations of galaxy clusters explored cosmological zoom-in simulations
with star formation and feedback (SFF) and supermassive black hole (SMBH) formation and
feedback, using the Ramses Eulerian AMR code (Wu et al., 2015; Teyssier, 2002). In their AGN
feedback prescription, mass accreted onto the SMBH following a density-boosted Bondi-Hoyle
accretion rate (Booth & Schaye, 2009). Thermal energy was deposited into a small radius around
the SMBH (Martizzi et al., 2016). Compared to CC cluster entropy profiles from the ACCEPT
catalogue, CC clusters in the Rhapsody-G had lower central entropies, showing overcooling in the
inner tens of kpc (Hahn et al., 2017).
                                                  68


    Tremmel et al. (2017) presented the Romulus galaxy simulations using the ChaNGa smoothed
particle hydrodynamics code and includes SMBH feedback and SFF models tuned to observations.
Their SMBH feedback model had two free parameters: (1) the efficiency of the accretion rate onto
the SMBH and (2) the gas coupling efficiency 𝜖 𝑐 . These parameters were calibrated to produce
galaxies with observed values of the stellar-mass to halo ratio, HI gas fraction as a function of
stellar mass, galaxy specific angular momentum versus stellar mass, and the SMBH to stellar mass
relation. Their simulations used a thermal-only feedback model that deposited feedback energy
into the 32 gas particles nearest to the SMBH. Mass accretion was governed by a modified Bondi
accretion rate. Gas cooling was suppressed when heated by the SMBH for a time step equal to the
time step of the SMBH. This allowed energy to escape away from the SMBH, although it may not
be physically realistic. This feedback model produced galaxies with regulated SFF compared to
observation.
    In the follow-up paper Tremmel et al. (2019) on the cosmological RomulusC simulations, the
same SFF and SMBH feedback models were used in a zoom-in simulation of a single halo. In
an isolated halo, purely thermal feedback from the SMBH led to a conic structure with a highly
collimated jet-like outflow. The outflows evolved over time, changing in shape and direction with
the angular momentum of the gas near the SMBH. Energy was carried out to large radii through the
outflows, which suppressed cooling at large radii. Star formation rates were regulated and matched
observed rates in clusters. Additionally, the entropy profile of the clusters was within the range
of observed profiles in CC clusters. Although the outflows were not explicitly introduced by their
feedback prescription, their ability to transport AGN feedback energy tens of kiloparsecs from the
center without inverting the large-scale entropy profile and overstimulating thermal instability is
the key to proper thermal regulation of their simulated CC cluster.
2.5.5   Implications
    Since the heating kernels explored here failed to produce quasi-stable CC clusters with realistic
entropy profiles, extrapolations to real CC clusters may not be accurate. However, a few lessons
                                                  69


can be drawn from these simulations:
     • In the context of purely thermal AGN feedback, feedback that is highly centrally concentrated
       and tied directly to the global radiative cooling rate produces cores with entropy levels that
       greatly exceed those of observed CC clusters and in some cases are physically unreasonable.
     • When the total heating rate is directly tied to the total cooling rate in the halo, rapid cooling of
       gas into cold clumps causes the heating rate to reach unphysically high levels. In comparison,
       in simulations using Bondi accretion or cold gas accretion such as in Meece et al. (2017)
       AGN feedback increases more gradually with the formation of cold gas, allowing feedback
       energy output to tune itself to physically reasonable values.
     • The heating kernels considered here, in which heating per unit volume had a fixed radial
       distribution, were unable to maintain thermal stability of the cluster halo. In cases where a
       cold clump of gas formed, the purely thermal AGN feedback was insufficient to disrupt the
       clump without injecting unphysically high amounts of energy. The thermal heating in these
       simulations was unable to reproduce the effects caused by kinetic outflows from AGN jets
       such as in Meece et al. (2017).
    A spherically symmetric heating kernel for purely thermal feedback that satisfies all of our
criteria may exist but would need to have different parameters than are explored here. Such an
idealized heating kernel would be useful to efficiently include AGN feedback in cosmological
simulations.
2.5.6    Other Models Investigated
    In search of a satisfactory heating kernel, we investigated several extensions to the spherically
symmetric ones described in Section 2.3. First, we applied a polar angle dependence of cos2 𝜃 to
mimic the conical distribution of heat from a kinetic jet. Total heating remained linked to total
cooling. However, decreased heating near the equatorial plane leads to cold gas forming several
tens of Myr sooner than for the corresponding spherical kernel and did not change the general
                                                    70


behavior of the cooling catastrophe. Next, we tried a model in which cold gas was removed from
the center of the simulation as it formed, to decrease the central density, potentially avoid fluid
discontinuities in the fluid solver, and allow robust simulations with the formation of cold gas.
However, explosive heat input triggered by the formation of cold gas still causes the hydrodynamics
solver to fail. We also tested equating total heating to total cooling of only the warm gas, testing
separately temperature thresholds of 106.5 K and 107 K, to exclude the rapid cooling of cold gas and
avoid explosive AGN feedback. However, this filtering of cold gas in the calculation of the heating
rate leads to more cold gas forming and the leftover warm gas having an elevated central entropy.
In some cases the heat input is still great enough to halt the simulation because of the Courant
condition. Lastly, we tried smoothing out the rise in AGN heating by setting the total feedback to
the average of the cooling rate over the last 50 Myr, in essence implementing a temporal kernel as
well as a spatial kernel. However, this approach also leads to high rates of formation of cold gas
due to the delayed heating response, as well as an eventual spike in AGN heating since the cooling
catastrophe ultimately is not counteracted.
2.5.7   Future Models
    There remain conceivable modifications to this heating kernel approach that we did not inves-
tigate, but which could produce more physically realistic CC clusters. For example, total heating
could be capped at a physically reasonable value to avoid the overheating that coincides with the
formation of cold gas. Additionally, we could investigate a radially piecewise conic feedback kernel
in which AGN heating is spherically symmetric at small radii and conical at large radii. Another
alternative would be a kernel with a spatial distribution that depends on the total heat input, adjust-
ing to spikes in heating/cooling by distributing increased heating over a larger volume, as would
happen with an increase in total jet power.
2.6   Summary
    We have presented simulation results for simplified models of AGN feedback using heating
kernels for purely thermal feedback. In those kernels, heat input has a spatial dependence following
                                                  71


a radial power law 𝑒¤ ∝ 𝑟 −𝛼 having a smoothing length 𝑟 𝑠 at small radii, an exponential cutoff radius
𝑟 𝑐 at large radii, and a total heating rate set equal to the total cooling rate measured within the cluster
halo. This approach differs from previous simulations approximating feedback rates using Bondi
and cold gas accretion models, which can temper the feedback response but are computationally
more expensive. Our intention was to identify a heating kernel that would be both computationally
inexpensive and able to maintain the hot atmosphere of a galaxy cluster in realistic quasi-steady
state.
     All of the heating kernels we tested failed to maintain a quasi-steady state with an entropy
profile consistent with those observed among cool-core clusters (see Figures 2.3 and 2.7). We
compared entropy profiles from our simulations to observational data from the ACCEPT dataset.
Some simulations exhibit small to large central peaks in entropy that differ significantly from the
entropy profiles seen in the ACCEPT sample. The central entropy peaks are most pronounced
in simulations with highly centralized feedback. Simplified AGN models with overly centralized
thermal heating therefore do not produce realistic entropy profiles.
     A few lessons can be drawn from this work. Thermalization of AGN feedback energy must
occur over a large region in order for the entropy profiles of simulated clusters to agree with those
of observed cool-core clusters. However, it is difficult to distribute thermal feedback over a large
region while also preventing a cooling catastrophe. Also, requiring total heating to equal total
cooling becomes particularly problematic near the onset of a cooling catastrophe, because the
increased cooling rate during the formation of large clumps of cold gas raises the heating rate to
very high levels.
     No configuration of purely thermal feedback explored here achieved thermal stability nor
prevented a run away collapse into a cold clump, in contrast to simulations that introduce feedback
energy in the form of kinetic jets. A heating kernel for purely thermal AGN feedback that produces
realistic CC clusters may still exist but would need to significantly differ from the kernels we tested.
Such a heating kernel that functions as an accurate and efficient proxy for more complex AGN
feedback physics would allow larger cosmological simulations without increasing resolution.
                                                       72


                                             CHAPTER 3
    MAGNETIZED DECAYING TURBULENCE IN THE WEAKLY COMPRESSIBLE
                                    TAYLOR-GREEN VORTEX
    This chapter first appeared as the published paper Glines et al. (2021). I include the original
abstract as the introduction to this chapter.
3.1   Chapter Abstract
    Magnetohydrodynamic turbulence affects both terrestrial and astrophysical plasmas. The prop-
erties of magnetized turbulence must be better understood to more accurately characterize these
systems. This work presents ideal MHD simulations of the compressible Taylor-Green vortex under
a range of initial sub-sonic Mach numbers and magnetic field strengths. We find that regardless
of the initial field strength, the magnetic energy becomes dominant over the kinetic energy on all
scales after at most several dynamical times. The spectral indices of the kinetic and magnetic en-
ergy spectra become shallower than 𝑘 −5/3 over time and generally fluctuate. Using a shell-to-shell
energy transfer analysis framework, we find that the magnetic fields facilitate a significant amount
of the energy flux and that the kinetic energy cascade is suppressed. Moreover, we observe non-
local energy transfer from the large scale kinetic energy to intermediate and small scale magnetic
energy via magnetic tension. We conclude that even in intermittently or singularly driven weakly
magnetized systems, the dynamical effects of magnetic fields cannot be neglected.
3.2   Introduction
    Magnetized turbulence is present in many terrestrial and astrophysical plasmas. Turbulence
in magnetohydrodynamics (MHD) has been studied extensively over recent decades, from ex-
perimental, theoretical, and numerical perspectives, as the field continues to work towards a full
understanding of magnetized turbulent plasmas. However, much of the theoretical and numerical
work focuses on continuously driven plasmas, where a continuous (although potentially stochastic)
force adds energy to the plasma, resulting in stationary turbulence. In many natural systems, the
                                                  73


turbulence can be intermittently driven by infrequently occurring events or initialized from the ini-
tial conditions. For example, in the circumgalactic medium (CGM), the hot diffuse gas surrounding
galaxies, or in the intracluster medium (ICM), the plasma in galaxy cluster that accounts for the
majority of baryonic mass, turbulence can be introduced by various mechanisms. These include
mergers with other galaxies, brief increases in the birth rate of stars, temporary outflows from jets
driven by gas accreting onto supermassive black holes, supernovae, and many more transient events
(Norman & Bryan, 1999; Larson, 1981; Britzen et al., 2017; Korpi et al., 1999). In pulsed power
plasmas such as in a z-pinch, the plasma is driven by a single initial event and then allowed to
decay into turbulence as kinetic and magnetic energy in the plasma dissipate into heat (Rudakov &
Sudan, 1997; Kroupp et al., 2018). Therefore, to bridge the gap between observed, intermittently
driven turbulent systems and theories of stationary MHD turbulence, we can study the behavior of
decaying magnetized turbulence in an idealized environment.
     In decaying turbulence, the turbulent flow arises purely from the initial conditions in the absence
of a continuous driving force that injects energy. Essentially, the driving force is a delta function
forcing at the initialization of the flow. The absence of external forces can avoid some of the
shortfalls of driven turbulence simulations. As an example of these shortfalls, previous studies
have shown that seemingly unimportant driving parameters such as the autocorrelation time and
normalization of the driving field can bias plasma properties in turbulence simulations, in some
cases affecting the scaling of the energy spectra (Grete et al., 2018). In addition, the driving forces
contaminate the driven scales, making studies of turbulent plasma properties on those scale difficult
to interpret. Simulations of decaying turbulence with fixed initial conditions avoid these issues
since there are no driving forces.
     The Taylor-Green (TG) vortex provides a useful set of smooth initial conditions that devolve
into a turbulent flow. It was first proposed by Taylor & Green (1937) as an early mathematical
exploration of the development of the turbulent cascade in a three dimensional hydrodynamic
fluid. In the modern era, it is a canonical transition-to-turbulence problem also used for validation
and verification of numerical schemes (Wang et al., 2013). From a physics point of view, the
                                                   74


TG vortex has been explored from numerous angles, including numerical simulations of inviscid
and viscous incompressible hydrodynamics with an emphasis on the development of small scale
structures through vortex stretching (Brachet et al., 1983). Multiple configurations for TG vortices
with magnetic fields were proposed in Lee et al. (2008) in order to study decaying turbulence in
incompressible MHD. The new magnetic field configurations maintain all of the symmetries of the
original hydrodynamic flow (Lee et al., 2008), and later works (Lee et al., 2010; Pouquet et al.,
2010; Brachet et al., 2013) used these symmetries to save computational resources and allow more
highly resolved simulations of the vortex. These simulations produced differing 𝑘 −2 , 𝑘 −5/3 , and
𝑘 −3/2 spectra depending on the initial magnetic field, where the 𝑘 −2 spectra was speculated to be
due to weak turbulence. Later work by Dallas & Alexakis (2013a,b) investigated the mechanism
behind the different spectra. They concluded that the 𝑘 −2 spectra produced by one configuration
of the magnetic field was due to magnetic discontinuities in the plasma and not weak turbulence
as previously thought. In Dallas & Alexakis (2013c), perturbations added to the initial conditions
lead the symmetries of the TG vortex to break and the 𝑘 −2 spectra to dissipate to shallower 𝑘 −5/3
spectra. A similar problem using the hydrodynamic initial configuration of the TG vortex but
with an Orszag-Tang magnetic field was studied in imcompressible resistive MHD by Vahala et al.
(2008), where a 𝑘 −5/3 energy spectra was found in their simulations.
    All of these studies are concerned with incompressible turbulence, whereas many astrophysical
systems (such as the interstellar, circumgalactic, intracluster, and intergalactic media) are comprised
of compressible magnetized plasmas. To our knowledge, the formulation of the TG vortex from
Lee et al. (2008) remains unexplored in the compressible MHD regime. Moreover, there have been
recent advances in analytical tools to study the transfer of energy between reservoirs in compressible
MHD (Yang et al., 2016; Grete et al., 2017). Energy transfer analysis enables measurement of
the flux of energies between length scales within and between the kinetic, magnetic, and thermal
energies of the plasma. In a compressible ideal MHD plasma, energy can be redistributed within the
kinetic and within the magnetic energy budget via advection and compression. Moreover, magnetic
tension can facilitate energy transfer between kinetic and magnetic energies as vortical motion in
                                                   75


the turbulent plasma contributes to magnetic fields and magnetic fields constrain the motion of the
plasma. In turbulent flow, intra-budget energy transfers via advection and compression typically
manifest from a larger scale to a smaller but similar scale (i.e., “down scale-local"), defining the
turbulent cascade. Inter-budget energy transfer via, e.g., magnetic tension, complicates the picture
of a turbulent cascade as it moves energy between reservoirs and potentially allows for nonlocal
transfer of energy from large scales directly to much smaller scales. Given the transient nature of
the TG vortex, we expect the energy transfers to change over time as, e.g., the ratio of kinetic to
magnetic energy evolves over time or due to the development of increasingly small-scale structure.
This is in contrast to stationary turbulence where the dynamics remain constant over time in a
statistical sense.
    For these reasons, we focus on a detailed study of the dynamics in the magnetized, weakly com-
pressible Taylor-Green vortex. Moreover, to explore magnetized decaying turbulence in different
regimes we present nine simulations of the TG vortex probing all combinations of three different
initial ratios of kinetic to magnetic energy (1, 10, and 100, corresponding to initial Alfvénic Mach
numbers of M 𝐴 = {1, 3.2, 10}) and three different initial fluid velocities (initial root mean squared,
or RMS, sonic Mach numbers of M 𝑠,0 = {0.1, 0.2, 0.4}). Thus, we explore strongly and weakly
magnetized, subsonic plasmas in which density perturbations are present but limited.
    To summarize our results, we find that magnetic fields significantly influence the decaying
turbulence in the plasma regardless of the initial field strength. In all cases, we find that at
late times the magnetic dynamics dominate kinetic dynamics even if the initial magnetic energy
is 100 times smaller than the kinetic energy. Moreover, the spectral indices of the kinetic and
magnetic energies are not fixed in time but evolve from steep ≃ 𝑘 −2 spectra at earlier times to
shallower ≃ 𝑘 −4/3 spectra at later times. Using the energy transfer analysis, we see that most
energy transfer is dominated by magnetic field dynamics. This includes both energy flux from
kinetic to magnetic energy via magnetic tension and the flux of energy within the magnetic energy
budget via compression and advection. Overall, the kinetic energy cascade is effectively absent
and the initial sonic Mach number (M 𝑠,0 ) only weakly affects the observed dynamics. We also
                                                  76


see several transient phenomena during the transition to turbulence, including temporary inverse
turbulent cascades in both the magnetic and kinetic energies and large nonlocal energy transfers
between scales separated by up to two orders of magnitude from the kinetic to the magnetic energy.
    We organize the paper as follows. In Section 3.3, we describe the simulation and analysis
setup including numerical methods, detailed Taylor-Green vortex initial conditions, and the energy
transfer analysis. In Section 3.4, we present results of the simulations (focusing on M 𝑠,0 = 0.2)
such as the bulk properties of the plasma, the evolution of the energy spectra, and the transient
behaviors seen through the energy transfer analysis as the turbulence develops. In Section 3.5, we
discuss our findings in the broader context of magnetized turbulence and astrophysical plasmas and
conclude in Section 3.6 with a summary of our key findings. The online supplementary materials
for this paper contain detailed plots of the results of all initial M 𝑠,0 .
3.3   Method
3.3.1    MHD Equations and Numerical Method
    The equations for compressible ideal MHD plasma can be written as a hyperbolic system of
conservation laws. In differential form the ideal MHD equations are
                                                           𝜕𝑡 𝜌 + ∇ · (𝜌u) = 0
                                                                        
                       𝜕𝑡 𝜌u + ∇ · (𝜌u ⊗ u − B ⊗ B) + ∇ 𝑝 + B2 /2 = 0
                                                     𝜕𝑡 B − ∇ × (u × B) = 0
                                       h                               i
                            𝜕𝑡 𝐸 + ∇ ·              2
                                          𝐸 + 𝑝 + B /2 u − (B · v) B = 0
where 𝜌 is the density, u is the flow velocity, B is the magnetic field (that includes a factor of
   √
1/ 4𝜋), 𝑝 is the thermal pressure, and 𝐸 is the total energy density. We close the system of
equations with the equation of state for an adiabatic ideal gas with
                                           𝑝 = 𝜌 (𝛾 − 1) 𝑒
where 𝛾 is the ratio of specific heats and 𝑒 is the internal energy found from
                                                                 
                                            1        1
                                   𝐸 = 𝜌 u·u+ B·B+𝑒 .
                                            2        2
                                                   77


    We use the open source K-Athena Grete et al. (2021a) astrophysical MHD code, which is a
performance portable version of Athena++ Stone et al. (2020a) using the Kokkos performance
portability library Carter Edwards et al. (2014). K-Athena uses an unsplit finite volume Godunov
scheme to evolve the ideal MHD equations originally presented and implemented in Athena Stone
& Gardiner (2009). The method consists of a second-order Van Leer predictor-corrector integrator
with piecewise linear reconstruction (PLM) and HLLD Riemann solver, and constrained transport
to preserve a divergence-free magnetic field.
3.3.2   Magnetized TG Vortex
    The TG vortex was first proposed by Taylor & Green (1937) as a mathematical exploration of
the development of hydrodynamic turbulence in 3D. The initial flow was made to be periodic and
symmetrical in order to accommodate simple approximations to a solution. There exist a number of
different formulations. We follow the setup described in Wang et al. (2013) for the hydro variables
and Lee et al. (2008) for the initial magnetic field configuration.
    The simplest hydrodynamic setup of a TG vortex begins with a periodic field of fluid velocity
in the xy-plane and periodic pressure and density field with constant sound speed throughout the
domain. Using a cubic periodic domain with side length 2𝜋𝐿, the initial fluid velocity is set to
                                                      𝑥     𝑦     𝑧
                                    𝑢 𝑥 = 𝑢 0 sin cos cos
                                                      𝐿     𝐿     𝐿
                                                        𝑥     𝑦     𝑧
                                    𝑢 𝑦 = −𝑢 0 cos sin cos
                                                        𝐿     𝐿     𝐿
                                    𝑢𝑧 = 0
where 𝑢 0 is the maximum initial velocity. Note that in this formulation the initial flow velocity is
confined to the xy-plane. The initial pressure and density are set to
                                       𝜌0 𝑢 20
                                                                           
                                                     2𝑥       2𝑦        2𝑧
                         𝑃 = 𝑃0 +                cos    + cos       cos    +2
                                        16           𝐿         𝐿        𝐿
                         𝜌 = 𝑃𝜌0 /𝑃0
                                                     78


     so that 𝑃 and 𝜌 are proportional to each other. This means that the sound speed
                                          √︁          √︁
                                    𝑐𝑠 =     𝛾𝑃/𝜌 =      𝛾𝑃0 /𝜌0
is initially constant throughout the domain.
     The root mean square (RMS) of the initial Mach number is related to 𝑢 0 by
                                                      𝑢0
                                           M 𝑠,0 =        .
                                                    2𝑐 𝑠
     For simplicity, we set 𝑃0 = 1 and 𝜌0 = 1. We assume the fluid is a monatomic ideal gas with
an adiabatic index 𝛾 = 5/3. The resulting total initial kinetic energy is
                                         𝐸𝑈,0 = 𝜌0 𝑢 20 (𝜋𝐿) 3 .                                  (3.1)
     Magnetic fields were first added to the TG vortex in Lee et al. (2008) with the express constraint
of preserving the same symmetries of the hydrodynamic flow. Here, we follow the proposed
insulating configuration so that currents are confined to 𝜋𝐿 boxes, e.g., the cube [0, 𝜋𝐿] 3 forms an
insulating box. The corresponding initial magnetic fields are given by
                                                    𝑥         𝑦      𝑧
                                   𝐵𝑥 = 𝐵0 cos sin sin
                                                    𝐿        𝐿       𝐿
                                                    𝑥         𝑦      𝑧
                                   𝐵 𝑦 = 𝐵0 sin cos sin
                                                    𝐿        𝐿       𝐿
                                                        𝑥         𝑦      𝑧
                                   𝐵 𝑧 = −2𝐵0 sin sin cos
                                                       𝐿         𝐿       𝐿
where 𝐵0 is the initial magnetic field strength. In practice, we initialize the magnetic field from the
magnetic vector potential A
                                                  𝑥         𝑦        𝑧
                                 𝐴𝑥 = −𝐵0 sin           cos         cos
                                                 𝑥𝐿        𝑦 𝐿      𝑧 𝐿
                                𝐴 𝑦 = 𝐵0 cos          sin          cos
                                                  𝐿           𝐿          𝐿
                                 𝐴𝑧 = 0
using B = ∇ × A. This guarantees ∇ · B = 0 to machine precision in the initial conditions, which is
then preserved by the constrained transport algorithm throughout the simulation. The total initial
                                                   79


magnetic energy is
                                           𝐸 𝐵,0 = 3𝐵02 (𝜋𝐿) 3                                  (3.2)
so that the initial ratio of kinetic to magnetic energy is
                                             𝐸𝑈,0 𝜌0 𝑢 20
                                                    =       .                                   (3.3)
                                             𝐸 𝐵,0     3𝐵02
    Since the magnetic field is zero is some regions of the domain, the Alfvénic Mach number
          √
M 𝐴 = 𝑢 𝜌/𝐵 is also undefined in some regions. For this reason, we use a proxy based on the
mean energies for the Alfvénic Mach number
                                                  √︁
                                          M 𝐴 :=     ⟨𝐸𝑈 ⟩/⟨𝐸 𝐵 ⟩                               (3.4)
throughout the rest of the paper. We also adopt a similar proxy for the plasma 𝛽 (ratio of thermal
to magnetic pressure)
                                                         2
                                                     2 M𝐴
                                              𝛽 :=                                              (3.5)
                                                     𝛾 M 𝑠2
where M 𝑠 is the RMS of the sonic Mach number.
    The hydrodynamic and magnetic initial conditions exhibit a number of symmetries that are
maintained throughout the simulation. In each of the three dimensions there are two planes across
which the fluid is antisymmetric. For our setup, these are planes through 𝑥 = 0 and 𝑥 = 𝜋𝐿;
planes through 𝑦 = 0 and 𝑦 = 𝜋𝐿; and planes through 𝑧 = 0 and 𝑧 = 𝜋𝐿. Additionally, the
flow is rotationally symmetric through a rotation of 𝜋 around the two axes 𝑥 = 𝑧 = 𝜋𝐿/2 and
𝑥 = 𝑧 = 𝜋𝐿/2 and rotationally symmetric through a rotation of 𝜋/2 around the axis 𝑥 = 𝑦 = 𝜋𝐿/2.
These symmetries are more thoroughly explored in Lee et al. (2008).
    We explore the transition to magnetized turbulence and the following decay in different regimes
with our simulation suite of TG vortices and focus on two parameters: the initial RMS Mach number
using M 𝑠,0 = {0.1, 0.2, 0.4} and the initial ratio of kinetic to magnetic energy using 𝐸𝑈,0 /𝐸 𝐵,0 =
{1, 10, 100}, or alternatively, the initial RMS Alfvénic Mach number M 𝐴,0 = {1, 3.2, 10}. We
simulate all nine combinations of the three values of these two parameters. Throughout the rest of
                                                    80


the text, we use MsX to refer to simulations with M 𝑠,0 = 𝑋 and MaY to refer to simulations with
M 𝐴,0 = 𝑌 .
    The initial magnetic field amplitude 𝐵0 is obtained from Equation 3.3 using given a specific
value of M 𝑠,0 and M 𝐴,0 . All simulations employ a cubic [−0.5, 0.5] 3 domain with periodic
boundaries, with 𝐿 = 2𝜋  1 to be consistent with the definition of the initial condition that is presented
above. We use a uniform Cartesian grid with 1,0243 cells. The characteristic length scale of the
initial vortices is 𝜋𝐿, so that we define
                                                   𝜋𝐿
                                               𝑇=
                                                    𝑢0
as the dynamical time 1 In order to evolve the simulations for sufficient time to allow a turbulent
flow to form and evolve, we run each simulation for ≈ 6 dynamical times.
    In our results, we present all measurements of time in terms of the dynamical time 𝑇 and all
measurements of wavenumber in terms of 1/𝐿. Unless otherwise noted, all other results are in
terms of simulation units.
3.3.3    Energy Transfer Analysis
    In order to probe the movement of energy between different energy reservoirs, we use the shell-
to-shell energy transfer analysis from Grete et al. (2017), which extends the framework presented
in Alexakis et al. (2005) to the compressible regime.
    The total transfer of energy from some shell 𝑄 in energy reservoir 𝑋 to some shell 𝐾 in reservoir
𝑌 is denoted by
                                        T𝑋𝑌 (𝑄, 𝐾)   𝑋, 𝑌 ∈ [𝑈, 𝐵]                                    (3.6)
where we use 𝑈 and 𝐵 to denote the kinetic and magnetic energy reservoirs, respectively.
    1 Note that other works such as Wang et al. (2013); Pouquet      et al. (2010) use a nondimensional
time, 𝑡 ∗ = 𝐿/𝑢 0 , in contrast to the dynamical time used here.
                                                   81


     In this work we focus on the energy transfer within the kinetic and magnetic energy reservoirs
via advection and compression which are respectively
                                                   ∫
                                 T𝑈𝑈 (𝑄, 𝐾) = −       w𝐾 · (u · ∇) w𝑄 𝑑x
                                                 − 21 w𝐾 · w𝑄 ∇ · u𝑑x
                                                     ∫
                                                   ∫
                                 T𝐵𝐵 (𝑄, 𝐾) = − B𝐾 · (u · ∇) B𝑄 𝑑x
                                                 − 12 B𝐾 · B𝑄 ∇ · u𝑑x
                                                      ∫
and the energy transferred from kinetic energy to magnetic energy via magnetic tension (and vice
versa) given by
                                                  ∫                    
                                T𝑈𝐵𝑇 (𝑄, 𝐾) =         B𝐾 · ∇ v 𝐴 ⊗ w𝑄 𝑑x                           (3.7)
                                                  ∫
                                T𝐵𝑈𝑇 (𝑄, 𝐾) =         w𝐾 · (v 𝐴 · ∇) B𝑄 𝑑x .                       (3.8)
                                                    √
Here we use the mass weighted velocity w =            𝜌u so that the spectral energy density is positive
definite Kida & Orszag (1990), and v 𝐴 is the Alfvénic wave speed.
     The velocity w𝐾 and magnetic field B𝐾 in a shell K (or Q) are obtained using a sharp spectral
filter in Fourier space. The shell bounds are logarithmically spaced and given by 1 and 2𝑛/4+2
for 𝑛 ∈ {−1, 0, 1, . . . , 32}. Shells (uppercase, e.g., K) and wavenumbers (lowercase, e.g., 𝑘) obey
a direct mapping, i.e., 𝐾 = 24 corresponds to the logarithmic shell that contains 𝑘 = 24, i.e.,
𝑘 ∈ (22.6, 26.9].
3.4    Results
     In this section we present results of the Taylor-Green vortices we simulated, showing bulk
properties of the fluid (Section 3.4.1), including the evolution of the different energy spectra. These
results demonstrate that the kinetic, magnetic, and thermal energy reservoirs interact with each
other in a manner that depends significantly on the initial strength of the magnetic field. The
energy spectra evolves to a turbulent cascade over 1-2 dynamical times and then stays there for the
remainder of the simulation. In Section 3.4.2, we examine in detail the transfer of energy between
different energy reservoirs, including the transient behaviors we observed in the simulations. We
                                                     82


see robust transfer of energy at all scales within the kinetic and magnetic energy reservoirs when
examined separately, as well as complex and time-varying nonlocal transfer of energy between the
kinetic and magnetic energy reservoirs, including evidence for an intermittent inverse turbulent
cascade. Since the initial Mach number had much less of an effect on the results compared to the
initial ratio of kinetic to magnetic energy, we focus on results using only the three Ms0.2 simulations
as reference. We provide more complete plots of all nine simulations spanning all Mach numbers
in the online supplements.
    Starting with a visual demonstration of the TG vortex, Figure 3.1 shows the sonic Mach number
and magnetic pressure from the Ms0.2_Ma10 simulation after 0.77 dynamical times and after 5.16
dynamical times in a slice in the 𝑥𝑦−plane through the origin. Only one quadrant of the 𝑥𝑦-place is
shown, as it exhibits symmetry across 4 quadrants in the 𝑥𝑦-plane. From the slice plot, we can see
that the TG vortex begins as a smooth vortical flow and magnetic field. After several dynamical
times, the smooth flow devolves into a chaotic magnetized turbulent flow. Kinetic and magnetic
structures at all scales persist throughout the simulation, as will be shown in energy spectra later in
this work.
3.4.1    Bulk Properties
3.4.1.1     Evolution of energy reservoirs
    Figure 3.2 shows the total kinetic, magnetic, and thermal energies and the dimensionless RMS
sonic Mach number M 𝑠 , Alvénic Mach number M 𝐴 , and plasma beta 𝛽 of the Ms0.2 simulations
as a function of time. In this figure, we can see that in all simulations kinetic and magnetic energy
convert into thermal energy over time. This decay into thermal energy is not immediate; rather,
it requires at least one dynamical time to begin (i.e., it is observed to occur at a minimum of
𝑡 = 1𝑇 in all simulations). In the Ma1 simulations, due to the initial conditions there is even a
small transient transfer of thermal energy into kinetic and magnetic energies. After 𝑡 = 2𝑇, all
simulations dissipate kinetic and magnetic energy into thermal energy. The sonic Mach number
generally decreases by less than a factor of 4 over time from its initial 0.2 value, and 𝛽 remains
                                                   83


high (from ∼ > 20 for Ms0.2_Ma1 to ∼  > 100 for Ms0.2_Ma10) throughout the simulations.
    In all cases, the flow becomes dominated by magnetic energy (i.e., become sub-Alfvénic with
M 𝐴 < 1) at different dynamical times depending on the initial ratio of kinetic to magnetic energy
and mostly independent of the initial Mach number. In other words, even for the simulations with
initially 100 times more kinetic than magnetic energy (Ma10), in the final state the magnetic energy
dominates over the kinetic energy. This already highlights the importance of kinetic to magnetic
energy transfer. The initial growth of magnetic energy is characteristic of the insulating magnetic
field configuration and is seen in other works on the TG vortex Lee et al. (2010). This behavior of
the magnetic field is likely due to the magnetic fields and vorticity beginning parallel to each other
everywhere. All simulations experience a peak in the magnetic energy evolution before 𝑡 = 3𝑇
depending on the initial magnetic energy. At 𝑡 = 6𝑇, all simulations are still losing total kinetic
and magnetic energy to thermal energy, although the rate of energy dissipation is slowing by the
simulation end. The magnetic and kinetic energies also become similar in magnitude, cf., M 𝐴 ≃ 1.
    The Ms0.2_Ma1 simulation displays notably different behavior than those where the kinetic
energy initially dominates. In particular, we observe periodic exchanges of energy between these
two reservoirs before the bulk of the energy is converted into heat, rather than a smooth transfer of
energy from the kinetic to magnetic reservoir, followed by a decline of both as the flow thermalizes.
At approximately 𝑡 = 1𝑇, more than five times as much energy is stored in the magnetic reservoir as
compared to the kinetic reservoir, which is in stark contrast with other calculations. These results
suggest that the large initial magnetic field facilitates a more rapid transfer of kinetic energy, which
will be examined in more detail later in this paper. For reference, we also plot the temporal evolution
of the energies in the incompressible, magnetized Taylor-Green vortex with Ma=1 presented in
Pouquet et al. (2010) in the top left panel of Fig. 3.2 next to our Ms0.2_Ma1 results. The evolution
in (Pouquet et al., 2010) covers the first oscillation and is in good agreement with our simulation.
Finally, the oscillations observed in the energy reservoirs for the Ma1 simulations in general have a
period that depends on the initial Mach number, which can be seen in the figures that we leave for
the online supplements.
                                                    84


3.4.1.2    Energy Spectra
     Figure 3.3 shows the temporal evolution of the kinetic and magnetic energy spectra of the three
Ms0.2 simulations, compensated by 𝑘 4/3 , which demonstrates how both the kinetic and magnetic
energy spectra change from the smooth initial large scale flow to fully developed turbulence. The
top row shows the three simulations earlier in the evolution (𝑡 = 0.77𝑇), when the spectra are
still steep with large scale structure from the initial conditions. In the case of the strongest initial
magnetization (Ma1), the magnetic energy is larger than the kinetic energy on all scales and their
spectral scaling is comparable. For Ma3.2 and Ma10 the kinetic energy spectrum is steeper than
the magnetic one. The spectra cross at 𝑘 ≃ 7 and 𝑘 ≃ 20, respectively, so that the kinetic energy
is still dominant on large scales. The middle row in Figure 3.3 shows intermediate times with
Ms0.2_Ma1 at 𝑡 = 1.29𝑇, which is the time that is discussed in Section 3.4.2.2 and Ms0.2_Ma3.2
and Ms0.2_Ma10 simulations at 𝑡 = 1.81𝑇, which is the time is discussed in Section 3.4.2.1. Note
that the spectra are still evolving at this intermediate stage. In the Ms0.2_Ma10 simulation at
𝑡 = 1.81𝑇, the magnetic spectra has reached a 𝑘 −4/3 spectrum while the kinetic spectra shows a
broken power law with excess energy at larger length scales. In both Ma1 and Ma3.2 the magnetic
energy is now dominant on effectively all scales (with the exception of the noisy part of the spectrum
at the largest scales, 𝑘 ∼< 4). The bottom row shows all three Ms0.2 simulations at 𝑡 = 5.16𝑇.
Here, the magnetic energy is effectively dominant on all scales in all simulations and the kinetic
and magnetic spectra exhibit a scaling close to 𝑘 −4/3 . The spectral indices still fluctuate, which we
explore in Section 3.4.1.3.
     In Figure 3.4 we show the kinetic and magnetic energy at specific wavenumbers and compensated
by 𝑘 4/3 plotted over time. At early times (before 𝑡 = 2𝑇) the large scale (𝑘 = 8) kinetic energy
shows the fastest growth rate compared to smaller scales as expected from an initial entirely large
scale configuration. The kinetic energy at 𝑘 = 8 peaks between 𝑡 = 1𝑇 and 𝑡 = 2𝑇 with larger
initial magnetic field leading to an earlier peak. The magnetic energy at 𝑘 = 8 in the Ms0.2_Ma1
simulation oscillates throughout the duration of the simulation, with the kinetic energy oscillating
once. No oscillatory behavior is observed in Ms0.2_Ma3.2 and Ms0.2_Ma10 for these quantities.
                                                  85


From this plot we can also see that the small scale (𝑘 = 128) energies saturate at 𝑡 ≃ 1𝑇, 𝑡 ≃ 1.5𝑇,
and 𝑡 ≃ 2.5𝑇, respectively.
3.4.1.3   Spectral Index
    We measured the spectral indices of the kinetic and magnetic energy spectra 𝛼 by fitting a
power-law 𝐸 ∝ 𝑘 𝛼 to the energy spectra of each reservoir at each time step. For the inertial
range of wavenumbers across which we fit the power-law to the spectra, we used wavenumbers
𝑘 = 10 to 𝑘 = 32. We chose this inertial range because very little large scale structure persists
below 𝑘 = 10 and wavenumbers above 𝑘 = 32 are not entirely free of numerical dissipation any
more. The kinetic and magnetic spectral indices measured across the inertial range are not fixed in
time across the different simulations, with the most variation being due to initial magnetic energy.
Figure 3.5 shows the spectral indices of the kinetic, magnetic, and sum of kinetic and magnetic
energy spectra over time for the Ms0.2 simulations. In all simulations, the spectral index evolves
over time, decaying from the initial steep spectral index (𝛼 ∼  < −2) as energy is transferred to small
scales. The kinetic and magnetic spectral indices evolves separately in the calculations until the
magnetic energy exceeds the kinetic energy, after which the spectral indices of the separate and
combined reservoirs fluctuate within Δ𝛼 ≃ 0.2. The crossover of kinetic and magnetic energies
happens immediately in the Ms0.2_Ma1 simulation, early in the Ms0.2_Ma3.2 simulation before
𝑡 = 2𝑇, and later in the Ms0.2_Ma10 simulation at 𝑡 ≃ 4𝑇. After the kinetic and magnetic spectral
indices reach rough parity and the magnetic field becomes dominant, both spectral indices reach
comparable values and reach a rough constant 1 − 2 dynamical times later, although they continue
to vary over time. Since the magnetic fields in the Ma1 simulations immediately become dominant,
the spectral indices reach a rough constant at 𝑡 ≃ 2𝑇, while in the Ma3.2 simulations they reach a
rough constant at 𝑡 ≃ 4𝑇 and in the Ma10 simulations this happens at 𝑡 ≃ 5𝑇. The Ms0.2_Ma3.2
simulation experiences a brief peak in the spectral index around 𝑡 ≃ 1.5𝑇 while the flow is still in
transition. This is also reflected in the large uncertainty of the spectral index during that time, e.g.,
the index of the kinetic energy spectrum varies between −1 and −2.25 by choosing slightly different
                                                   86


fitting ranges (as indicated by the shaded blue bands in Fig. 3.5). Note that in the Ma10 case, the
magnetic spectrum flattens and the spectral index reaches a roughly constant value much sooner
than in the other two cases, at 𝑡 ≃ 2𝑇 when the kinetic energy still dominates. Later on in the Ma10
simulations, the kinetic spectral index becomes comparable to the magnetic spectral index. For the
high initial magnetic field simulations, the spectral index levels out at about 𝛼 ≃ −5/3 while the
initially kinetically dominated simulations level out at 𝛼 ≃ −4/3.
     The final spectral indices depend on the initial ratio of kinetic to magnetic energy, with more
magnetic energy leading to shallower magnetic spectra. The Ma1 simulations end with 𝛼 ≃ −1.7
(close to −5/3), Ma3.2 ends with 𝛼 ≃ −1.3 (close to −4/3), and Ma10 ends with slightly lower
values of 𝛼 ≃ −1.2. In the presence of the stronger magnetic fields in the Ma1 simulations, the
flattening of the spectra seems to be suppressed. Before the kinetic and magnetic spectral indices
become comparable in each simulation, there is also greater variance in the spectral slope when
measured using different inertial ranges. This indicates that a power-law might be a poor fit for
the spectra at those early times, showing that the spectra is not fully developed until the magnetic
energy is dominant. For example, as seen in Figure 3.3, the kinetic energy spectra appears as a
broken power law at intermediate times, which is especially evident in the Ms0.2_Ma10 simulation
at 𝑡 = 1.81𝑇 to a lesser extent the Ms0.2_Ma1 simulation at 𝑡 = 1.29𝑇 and the Ms0.2_Ma3.2
simulation at 𝑡 = 1.81𝑇. Oscillations in the spectral index of the Ma1 simulations also appear,
whose period seems to be linked to the initial Mach number, with larger Mach numbers leading to
a smaller period of oscillation.
     We note that between the three values of M 𝐴 , the simulations shown here exhibit a wide variety
of behaviors, highlighted by the spectral indices in Fig. 3.5. More simulations with intermediate
values of M 𝐴 would be required to determine if the transition between these behaviors is smooth
or abrupt.
                                                 87


3.4.2   Energy Transfer
    While the total energy and spectra of the kinetic and magnetic reservoirs can broadly describe
the isolated behavior of the different energy reservoirs, examining the energy transfer within and
between reservoirs using the analysis described in Section 3.3.3 can provide deeper insights into the
physical phenomena, including demonstrating the mechanisms that are responsible for the transfer
of energy. The shell-to-shell energy transfer fluxes examined in this section demonstrate the flux
from wavenumber 𝑄 to wavenumber 𝐾 within and between energy reservoirs via different pathways.
    Figure 3.6 shows the energy transfer within the kinetic (left) and magnetic (right) energy
reservoirs via advection and compression in the Ms0.2_Ma1 simulation at 𝑡 = 0.77𝑇 (top) and at
𝑡 = 5.16𝑇 (bottom). This plot encapsulates the energy transfer of a turbulent cascade. Near the
beginning of the simulation in the top panels, most of the energy is in large scale modes, with energy
from larger 𝑄 wavenumbers moving to smaller 𝐾 wavenumbers. Note that the energy transfer is
constrained to the diagonal because the bulk of the energy transfer is local, occurring between
comparable scales of 𝑄 to 𝐾. White space fills the off-diagonals because very little nonlocal energy
transfer occurs internally within reservoirs. The energy transfer shown in this figure is solely within
the kinetic and magnetic reservoirs – there is no energy transfer shown between these reservoirs
(although it is occurring, as will be discussed in the next paragraph). In the simulation shown
here, the magnetic energy transfer is larger in magnitude than the kinetic energy transfer. In all
simulations, the magnetic energy transfer extends to higher wavenumbers more rapidly than the
kinetic energy. After the flow has decayed into turbulence (as shown in the bottom panels), energy
transfer to smaller local scales happens across the resolved modes down to numerical dissipation
scales. At large wavenumbers (𝑄 > 16), the energy transfers are scale-local and of comparable
magnitude. This phenomenon continues to at least 𝑄 ≃ 200 in both the kinetic and magnetic
energy transfer – i.e., to much larger wavenumbers than an inertial range is observed (see, e.g.,
Figure 3.3). Thus, the effective (numerical) viscosity and resistivity are not affecting the turbulent
cascade encoded by these transfers to a significant degree.
    Figure 3.7 shows the energy transfer within the kinetic (top) and magnetic (bottom) energy
                                                  88


reservoirs in the Ms0.2_Ma1 simulation at 𝑡 = 1.29𝑇 (just before the magnetic energy peaks).
Energy transfer within the kinetic and magnetic reservoirs briefly reverses directions and moves
energy from smaller local scales to larger local scales (note the purple color indicating energy loss
above the diagonal and orange color below the diagonal, which is in contrast to Fig. 3.6). This
constitutes a transient inverse cascade. Additionally, the inverse cascade is present throughout most
scales of the magnetic energy (𝐾, 𝑄 ∼   < 100) but only apparent at large scales in the kinetic energy
(𝐾, 𝑄 ∼ < 16). As seen in Figure 3.4, at this early time the turbulent flow is just beginning to saturate
the smallest scales while the large scale energy oscillates, so the energy transfer inversion lasts less
than a dynamical time (see Section 3.4.2.2 for further exploration of the duration).
    Figure 3.8 shows the energy transfer between the kinetic to magnetic energy reservoirs due to
magnetic tension at 𝑡 = 1.81𝑇 in the Ms0.2_Ma10 simulation. This Figure displays nonlocal transfer
from kinetic to magnetic energy. Unlike the advection- and compression-driven modes within the
magnetic and kinetic energy reservoirs, energy transfers from kinetic to magnetic reservoirs via
tension can support nonlocal energy transfers. The nonlocal transfer happens from large kinetic
scales to much smaller magnetic scales, spanning more than an order of magnitude downward in
spatial scale from the largest kinetic modes. The nonlocal energy transfer between kinetic and
magnetic energy was significant in simulations with lower initial magnetic energy, and especially in
the Ma10 simulations where the magnetic field is dynamically unimportant at early times. Kinetic
energy moves significant energy to all magnetic scales from early times at 𝑡 ≃ 1.5𝑇 to intermediate
times at 𝑡 ≃ 4𝑇 in these simulations, although some energy continues to flow via this mechanism at
later times. Additionally, since the transfer of energy via tension is between two different reservoirs,
the energy transfer can transfer at equivalent scales from one reservoir to the other. This is shown
as non-zero transfer along the diagonal of the plot.
3.4.2.1   Nonlocal Energy Transfer
    Like in some driven turbulence simulations Alexakis et al. (2005); Grete et al. (2017), these
decaying turbulence simulations also demonstrate significant nonlocal energy transfer between
                                                    89


kinetic and magnetic energy reservoirs. Unlike in driven simulations, the energy transfers in this
work are solely due to the fluid flow and not due to externally-applied driving forces. Figure 3.9
shows the total local, nonlocal, and equivalent-scale energy transfers via magnetic tension in the
Ms0.2 simulations over time. We obtain these quantities by integrating the transfer functions over
different sets of scales:
                                                 ∑︁      ∑︁
                             Nonlocal lower                     T𝑋𝑌 (𝑄, 𝐾)
                                                  𝑄 𝐾∈[1,2−ℓ 𝑄)
                                                 ∑︁       ∑︁
                               Local-Lower                       T𝑋𝑌 (𝑄, 𝐾)
                                                  𝑄 𝐾∈[2−ℓ 𝑄,𝑄)
                                                 ∑︁ ∑︁
                                Equivalent                T𝑋𝑌 (𝑄, 𝐾)
                                                  𝑄 𝐾=𝑄
                                                 ∑︁      ∑︁
                              Local-Higher                     T𝑋𝑌 (𝑄, 𝐾)
                                                  𝑄 𝐾∈(2ℓ 𝑄,𝑄]
                                                 ∑︁      ∑︁
                             Nonlocal Higher                    T𝑋𝑌 (𝑄, 𝐾)
                                                  𝑄 𝐾∈(2ℓ 𝑄,∞]
where ℓ is a parameter for differentiating local versus nonlocal separation of wavenumbers in log
space. In Figure 3.9, we show the analysis using ℓ = 5/4 with a solid line, which corresponds to
5 logarithmic bins above or below 𝑄 (see 3.3.3 for the description of the binning), and show the
extent of the fluxes if ℓ = 5/4 ± 1/4 is used in shaded regions. As seen in this figure from the red
line, the nonlocal energy transfer from large scale kinetic modes to small scale magnetic modes
(“downscale” transfer) is present in all simulations but is only dominant when the initial kinetic
energy exceeds the initial magnetic energy – this nonlocal energy transfer is more significant in the
Ma3.2 and Ma10 simulations. Nonlocal energy transfer downscale (red line) peaks depending on the
initial magnetic field and in all cases before the total magnetic energy peaks. The nonlocal transfer
helps fill out the magnetic energy spectrum faster than the kinetic energy spectrum, especially in the
Ma10 simulations, which is consistent with the spectral index shown in Figure 3.3 and the turbulent
cascades shown in the shell-to-shell energy transfer in Figure 3.6. By the time the magnetic energy
has exceeded the kinetic energy in the Ma3.2 and Ma10 simulations, nonlocal energy transfer is
largely diminished due to the lack of kinetic energy to feed the transfer.
                                                   90


     Local energy transfer downscale (orange line) depends more strongly on the initial magnetic
field, with local transfer to smaller scales reaching double the nonlocal transfer in the Ma1 simulation
and being less than half in other cases. Local energy transfer upscale (blue line) is positive for
some early times in the Ma1 and Ma3.2 simulations.
     The Ma1 simulations also display two different oscillatory behaviors, with a low frequency
oscillation in the local energy transfer and a high frequency oscillation clearly visible in the
equivalent energy transfer but also present in local and nonlocal down scale transfer.
3.4.2.2    Inverted Turbulent Cascades
     At early times during the evolution of the Ma1 simulations, a temporary inverse cascade forms
within the kinetic and magnetic energy reservoirs where small scale energy transfers to larger
spatial scales. Figure 3.10 shows the local and nonlocal energy transfers within the kinetic and
magnetic energies to both smaller and larger length scales. In the Ma1 simulations, the local energy
transfer from larger to smaller length scales temporarily reverses into an inverse cascade in both the
kinetic and magnetic energy reservoirs shortly after peak magnetic energy is reached. The inversion
appears with all three sonic Mach numbers simulated, with the longest inversion appearing in the
Ms0.1_Ma1 simulation for ≃ 1𝑇 and shortest in the high Ms0.4_Ma1 simulation for ≃ 0.5𝑇. For the
Ms0.1_Ma1 simulation, the kinetic energy reservoir briefly reverses to the normal configuration,
moving energy from large scales to scales while the magnetic energy is in an inverted cascade,
before returning to the inverted cascade, lingering longer than the magnetic field in the inverted
state and finally transitioning into a turbulent cascade for the rest of the simulation. As seen in
Figure 3.7, the movement of energy to larger scales is not limited to any region of the spectra –
it is present at all length scales. The Ma1 simulations, which are the only simulations to exhibit
an inverse cascade, are also the only ones in which the total kinetic energy increases during any
period. After peak magnetic energy in the Ma1, the magnetic energy increases while the kinetic
energy increases for ≃ 1𝑇; the inverse cascade appears during this same period.
                                                    91


3.4.2.3   Cross-Scale Flux
    With additional analysis of the shell-to-shell transfer, we can extract more insight into the move-
ment of energy. We can measure the cross-scale flux of energy from scales below a wavenumber 𝑘
to scales above a wave number 𝑘 by integrating the transfer function
                                                ∑︁ ∑︁
                                   Π𝑌𝑋<> (𝑘) =           T𝑋𝑌 (𝑄, 𝐾)                                (3.9)
                                               𝑄≤𝑘 𝐾 ≥𝑘
Figure 3.11 shows the cross-scale fluxes via different transfer mechanisms for the simulations with
Ms0.2. The top row shows cross-scale fluxes early in the simulation at 𝑡 = 0.77𝑇, when the large
scale flow is still decaying into smaller scales. The magnetic cross-scale flux at low wavenumbers
predictably depends on the initial magnetic energy, while the kinetic energy cross-scale flux is
largely the same between simulations at a given sonic Mach number. For example, for Ma10 the
cross-scale flux is strongly dominated by Π𝑈>  𝑈< , whereas for Ma3.2 it is still the most significant
contribution to the cross-scale flux, but substantial contributions are also seen from Π𝑈<
                                                                                         𝐵> (≃ 60% of
  𝑈< (4)), Π 𝐵< (≃ 30%), and Π 𝐵< (≃ 20%). For the strongest initial magnetization (Ma1) the early
Π𝑈>          𝐵>                   𝑈>
cross-scale flux is dominated by magnetic tension-mediated transfers from the kinetic-to-magnetic
budget (Π𝑈<𝐵> ) on all scales having a non-zero cross-scale flux (𝑘 ∼< 64), with a similar contribution
by the magnetic cascade on intermediate scales (9 ∼    <𝑘 ∼  < 64). The kinetic cascade is suppressed
on all scales, generally contributing less than 10% to the total cross-scale flux.
    At later times (𝑡 = 5.16𝑇, bottom row of Fig. 3.11), magnetic energy dominates both the
energy budget and cross-scale energy flux. Cross-scale energy flux via kinetic interactions is
near zero across the inertial range of the spectrum, and thus does not significantly contribute to
the total cross-scale energy flux. Only the magnetic fields facilitate down scale cross-scale flux
at intermediate scales, both within the magnetic energy and from kinetic to magnetic energy.
Moreover, the relative contributions of the individual transfer Π𝑈<       𝐵<    𝐵<         𝑈<
                                                                   𝐵> , Π 𝐵> , Π𝑈> , and Π𝑈> (in order
of decreasing contribution) on intermediate scales (16 ∼  <𝑘∼  < 64) is the same independent of initial
magnetization. This continuous cross-scale flux is consistent with the evolving spectral index
discussed in Section 3.4.1.3. Cross-scale flux through large physical scales is irregular, variable,
                                                   92


and sometimes negative due to the lack of structure and driving forces at large scales.
3.5    Discussion
3.5.1     Comparison to driven turbulence simulations
     The Taylor-Green vortex provides an interesting study of a freely evolving transition to decaying
turbulence. In other words, no external force is applied to the simulation as is the case in driven
turbulence simulations. This external force may introduce unintended dynamics to the flow (Grete
et al., 2018). For example, in a simulation that is mechanically driven at large scales, energy may
still be injected on intermediate scales both in the incompressible regime (Domaradzki et al., 2010)
as well as in the compressible regime due to density coupling (Grete et al., 2017). Moreover,
mechanical driving generally results in an excess of energy on the excited, kinetic scales that
presents a barrier for magnetic field amplification on those scales in cases without a dynamically
relevant mean magnetic field. This barrier is often expressed in the lack of a clear power law
regime in the magnetic spectrum and resembles an inverse parabolic shape. At the same time, the
magnetic energy spectrum drops below the kinetic one on the driving scales (see, e.g., Figure 1 in
(Grete et al., 2021c) and references therein). In the simulations presented here no such barrier is
observed. Both kinetic and magnetic energy spectra exhibit a (limited) regime where power law
scaling is observed once a state of developed turbulence is reached.
     Another important question raised from driven turbulence simulations pertains the locality of
energy transfers. While there is agreement that T𝑈𝑈 and T𝐵𝐵 mediated transfers, i.e., within a
budget, are highly local, the energy transfers between budgets (here, T𝑈𝐵𝑇 ) have been observed to
be weakly local and/or contain a nonlocal component from the driven scales (Alexakis et al., 2005;
Yang et al., 2016; Grete et al., 2017). Here, we show that in the absence of the driving force the
energy transfer mediated by magnetic tension contains both a local component as well as nonlocal
component. The latter directly transfers large-scale kinetic energy to large and intermediate scales
in the magnetic energy budget. Thus, the nonlocal component is not an artifact of an external
driving force.
                                                   93


    Finally, we recently showed that the kinetic energy spectra in driven turbulence simulations
follow a scaling close to 𝑘 −4/3 , i.e., shallower than Kolmogorov scaling, and explained this by the
suppression of the kinetic energy cascade due to magnetic tension (Grete et al., 2021c). This is in
agreement with our findings in the work presented here, where the same dynamics are observed at
late times when turbulence is fully developed.
    Naturally, this does not demonstrate that the same physical mechanisms are causing the similar
slopes. Nevertheless, the late time evolution of the simulations presented here is still comparable
to a limited degree to driven simulation of stationary turbulence. For example, even at late times
(see, e.g., 𝑡 = 5.16𝑇 in Fig. 3.6), energy is still cascading down from the largest scales (𝑘 ∼
                                                                                              < 8) but
the cascade is weaker than its initial magnitude. The reduction in strength of the cascade on large
scale is directly linked to the decay of the large initial vortices. Nevertheless, even at late times
the overall energy balance is still dominated by the largest scales, cf., the spectra shown in Fig. 3.3
when taking into account the 𝑘 4/3 compensation used in the plot. Overall, while here the inertial
range shrinks and becomes weaker (to a limited degree) over time as the large scale modes lose
energy, the dynamics within the inertial range is similar to driven turbulence simulations.
3.5.2    Comparison to previous results
    In general, our results in the weakly compressible MHD regime are in agreement with the
𝛼 ≃ −2 spectrum reported by previous works on the TG vortex in Pouquet et al. (2010); Lee
et al. (2010); Dallas & Alexakis (2013a,b) in the imcompressible MHD regime using the insulating
magnetic field configuration. We see the same 𝛼 ≃ −2 spectrum early in the evolution before 𝑡 = 2𝑇,
which corresponds to the time period near maximum energy dissipation that these other studies
focused on. In all cases that we simulated the spectra became shallower at later times, independent
of the initial magnetization (whereas these other works focused on 𝐸𝑈 /𝐸 𝐵 = 1, i.e., M 𝐴,0 = 1,
configurations, which are in good agreement with the Ms0.2_Ma1.0 simulation presented here, see
top left panel of Fig. 3.2). As noted by Dallas & Alexakis (2013a), the 𝛼 ≃ −2 spectrum is likely
due to discontinuities in a small volume of the flow that can be disrupted by symmetry breaking at
                                                     94


either large or small scales Dallas & Alexakis (2013c). According to Dallas & Alexakis (2013c),
a simulated Taylor-Green vortex with sufficiently high Reynolds number should show symmetry
breaking at the small scales at late times in the evolution, causing a break from the −2 power law at
large wavenumbers. Since our simulations do not impose symmetries on the flow, this is a possible
explanation for the observed behavior. However, we see an 𝛼 ≃ −4/3 inertial range scaling at late
times, instead of the 𝛼 ≃ −2 and 𝛼 ≃ −5/3 broken power law theorized by Dallas & Alexakis
(2013c).
    Finally, work done in Lee et al. (2010); Brachet et al. (2013); Dallas & Alexakis (2013b)
shows that the behavior of the magnetic field and spectra changes with the initial magnetic field
configurations. With the insulating initial magnetic fields that we use, the vorticity begins parallel
to the magnetic field. This facilitates the early energy flux from kinetic to magnetic energy. The
insulating case tends towards stronger large magnetic fields compared to the other magnetic field
configurations. Both of the other initial magnetic fields result in different energy spectra, with the
conducting magnetic field setup leading to a 𝑘 −3/2 spectra and the alternative insulating field setup
leading to spectra interpreted as either a 𝑘 −5/3 or 𝑘 −2 spectra as argued by Lee et al. (2010) and
Dallas & Alexakis (2013b) respectively.
3.5.3    Implication of results
    In all of our simulations, we see magnetic fields and effects facilitated by the magnetic fields
dominating the evolution of the decaying turbulence, even when the initial kinetic energy exceeds
the magnetic energy by a factor of 100 in the Ma10 simulations. Energy transfer from kinetic to
magnetic energy via tension and energy transfer within the magnetic energy far exceed energy flux
via the kinetic turbulent cascade at later times. Energy transfer from kinetic to magnetic energy at
earlier times leads to the magnetic energy dominating over kinetic energy in all cases in both total
magnitude as well as in terms of the scale-wise budget, cf., magnetic versus kinetic energy spectra.
This is similar to what has been found in incompressible (Alexakis et al., 2005) and compressible
simulations (Grete et al., 2017, 2021c) of driven turbulence. Thus, even in intermittently-driven
                                                   95


systems one can expect the magnetic field to significantly influence the dynamics after a few
dynamical times.
     Our simulations exhibit a magnetic energy spectra with a measurable power law after the
turbulent flow is realized. The inertial range is short, from approximately 𝑘 = 10 to 𝑘 = 32,
due to the resolution of these simulations. Nevertheless, within this region we can reasonably
fit a power law to both the kinetic and magnetic spectra, which is often not possible in driven
turbulence simulation without a dynamically relevant mean magnetic field, cf., Sec. 3.5.1. Thus,
freely evolving and driven turbulence simulations complement each other and both are required to
disentangle environmental from intrinsic effects.
     From an observational point of view, we demonstrated that the spectral indices evolve over time
and fluctuate even for similar parameters. Therefore, the derived spectral indices from observation
(e.g., velocity maps in astrophysics), which represent individual snapshots in time, need to be
interpreted with care when trying to infer the “nature” of turbulence (e.g., Kolmogorov or Burgers)
in the object of interest.
     Finally, the observed nonlocal energy transfer has implications on the dynamical development
of small scale structures from intermittent or singular energy injection events. Within the context of
natural astrophysical and terrestrial plasmas, the nonlocal energy transfer from kinetic to magnetic
energies suggests that small magnetic field structures develop before small scale kinetic structures.
3.5.4    Limitations
     While our analysis showed that the results are generally robust (e.g., with respect to varying the
fitting range in the spectral indices or varying range in the definition of scale-local in the energy
transfers), higher resolution simulations are desirable. With higher resolution in an implicit large
eddy simulation (ILES) the dynamic range is increased and, thus, the effective Reynolds numbers
of the simulated plasma are raised.
     Similarly, due to the nature of ILES the effective magnetic Prandtl number in all simulations is
Pm ≃ 1. However, in natural systems (both astrophysical and terrestrial/experimental) Pm is either
                                                  96


≫ 1 or ≪ 1, motivating the exploration of these regimes in the future as well.
    All of our simulations started with subsonic initial conditions, leaving the supersonic regime
unexplored. The additional shocks, discontinuities, and strong density variations that may arise
in a supersonic flow could alter the energy transfer as the flow transitions into turbulence. In the
simulations we present here, the Mach number generally did not significantly affect the growth
and behavior of the turbulence. In a supersonic flow, however, the transitory effects such as the
nonlocal energy transfer and inverse cascade may be altered or suppressed in addition to generally
richer dynamics related to compressive effects and effective space-filling of turbulent structures
(Federrath, 2013).
    Figure 5 indicates that the spectral index of both the kinetic and magnetic energy cascades
evolves as a function of magnetic field strength (i.e., initial M 𝐴 .) It is unclear whether there is
a threshold of M 𝐴 above which the spectra become shallower, or whether there is a continuum
of behavior as the initial M 𝐴 is increased. While we would like to engage in a more thorough
exploration of the dependence of these behaviors on M 𝐴 , the simulations in question are com-
putationally expensive and it is infeasible to do so at present. Exploration of this transition is a
promising venue for future work. Finally, the shell decomposition used here to study energy transfer
has been shown to violate the inviscid criterion for decomposing scales in the compressible regime
(Zhao & Aluie, 2018). However, this only pertains to flows with significant density variations and,
thus, is effectively irrelevant for the subsonic simulations presented here.
3.6   Conclusions
    We have presented in this work nine simulations of the Taylor-Green vortex using the insulating
magnetic field setup from Lee et al. (2008) to study magnetized decaying turbulence in the com-
pressible ideal MHD regime using the finite volume code K-Athena. As a first for the Taylor-Green
vortex, we have also presented an energy transfer analysis to show the movement of energy between
scales and energy reservoirs as facilitated via different mechanisms. Our key results are as follows:
    • Magnetic fields significantly affect the evolution of the decaying turbulence, regardless of
                                                   97


  initial field strength. Energy flux from kinetic energy to magnetic energy leads to the
  magnetic energy dominating the energy budget, even in simulations where the magnetic
  energy is initially very small.
• The Taylor-Green vortex simulations explored here display a power law in both the kinetic
  and magnetic energy spectra with a measurable spectral index, which is in contrast with the
  lack of a power law in the magnetic energy spectrum seen in driven turbulence calculations
  without a significant mean field.
• Decaying turbulent flows do not exhibit a spectral index that is constant in time in either
  the kinetic nor magnetic energy reservoirs – these spectra continually evolve over time.
  The spectral indices of the kinetic and magnetic energies become comparable and roughly
  constant around 1 − 2 dynamical times after the magnetic energy has become dominant. This
  can happen as early as 𝑡 = 2𝑇 when the initial magnetic energy equals initial the kinetic
  energy, and as late as 𝑡 = 5𝑇 when initial kinetic energy exceeds the magnetic by a factor
  of 100. For simulations with more initial kinetic energy than magnetic energy, the spectral
  indices reach a rough constant slightly steeper than 𝛼 ≃ −4/3.
• Before the turbulent flow fully develops, an inverse cascade within the kinetic and magnetic
  energy reservoirs is intermittently observed. This intermittent behavior moves energy from
  smaller scales to larger scales, and is possible when the magnetic energy is comparable to
  the kinetic energy.
• Analysis of energy transfer within and between reservoirs indicates that within fully-developed
  turbulence, the cross-scale flux of energy in both the kinetic and magnetic cascades are dom-
  inated by energy transfer mediated by the magnetic field.
• Magnetic tension facilitates nonlocal transfer from larger scales in the kinetic energy to
  smaller scales in the magnetic energy, and is particularly prominent in simulations where the
  magnetic field is initially weak.
                                             98


Figure 3.1: Slices of sonic Mach number (left) and magnetic pressure (right) at 𝑡 = 0.77𝑇 and
𝑡 = 5.16𝑇 in the 𝑥𝑦−plane through 𝑧 = 𝜋2 𝐿, with streamlines on the left showing the direction
of flow and streamlines on the right showing the direction of the magnetic fields, plotting only
the 1st quadrant from the Ms0.2_Ma10 simulation, demonstrating the transition of the flow into
turbulence.
                                              99


         0.07
                          Ms0.2_Ma1         Ms0.2_Ma3.2                                          Ms0.2_Ma10
         0.06                                                   EU     EU + EB    [13] EU
                                                                EB     ES ES, 0   [13] EB
         0.05                                                                     [13] EU + EB
         0.04
Energy   0.03
         0.02
         0.01
         0.00
         104
                            s
         103                A
         102
         101
         100
         10   1
                  0   2    4          6 0          2               4     6 0       2             4            6
                                               Time t [units of T]
  Figure 3.2: Mean energies over over time in the top row with kinetic energy (solid blue), magnetic
  energy (solid orange), the sum of kinetic and magnetic energies (solid green), and the change in
  thermal energy since the simulation start (solid red), and dimensionless numbers over time in the
  bottom row with RMS sonic Mach number M 𝑠 (blue), Alvénic Mach number M 𝐴 (orange), and
  plasma beta 𝛽 (green) for the Ms0.2 simulations. Energy over time from the simulation from Fig. 3a
  in Pouquet et al. (2010) (adjusted to the normalization used here), which matches the setup of the
  Ms0.2_Ma1 simulation, is shown with dashed lines in the upper left panel for reference. Energies
  and mach numbers for all nine simulations are shown in the online supplements.
                                                    100


            100
            10   2
            10   4
            10   6
                     Ms0.2_Ma1          Ms0.2_Ma3.2                    Ms0.2_Ma10
                      t = 0.77T          t = 0.77T                      t = 0.77T
            10   8
            100
            10   2
            10   4
E(k)k 4/3
            10   6
                     Ms0.2_Ma1          Ms0.2_Ma3.2                    Ms0.2_Ma10
                      t = 1.29T          t = 1.34T                      t = 1.81T
            10   8
            100
            10   2
            10   4
            10   6
                     Ms0.2_Ma1          Ms0.2_Ma3.2                    Ms0.2_Ma10
                      t = 5.16T          t = 5.16T                      t = 5.16T
            10   8
                            101   102         101            102              101       102
                                        Wavenumber k [units of 1/L]
Figure 3.3: Kinetic energy spectra (in solid blue) and magnetic energy spectra (in solid orange)
compensated by 𝑘 4/3 , with black dashed lines showing the power law fit to the spectral to obtain a
spectral index. In the left column we show the Ms0.2_Ma1 simulation, in the middle column we
show the Ms0.2_Ma3.2 simulation, and in the right column we show the Ms0.2_Ma10 simulation.
In the top row we show all simulations at 𝑡 = 0.77𝑇, in the middle row we show the three simulations
at different times (𝑡 = 1.29, 𝑡 = 1.81𝑇, 𝑡 = 1.81𝑇) when the simulations are displaying interesting
behavior discussed in sections 3.4.2.2 and 3.4.2.1, and in the bottom row we show all simulations
at 𝑡 = 5.16𝑇 when the initial flow has completely decayed into turbulence and both energy spectra
fluctuate around a 𝑘 −4/3 spectrum.
                                                     101


                    Ms0.2_Ma1                      Ms0.2_Ma3.2                     Ms0.2_Ma10
             0.05
                            k=8          k = 64
             0.04           k = 22       k = 128
EU(k)k 4/3
             0.03
             0.02
             0.01
             0.00
             0.05
             0.04
EB(k)k 4/3
             0.03
             0.02
             0.01
             0.00
                    0       2        4         6   0           2           4   6   0       2    4   6
                                                       Time t [units of T]
Figure 3.4: The kinetic energy (top) and magnetic energy (bottom) at wavenumbers 𝑘 =
8, 22, 64, 128 plotted separately in different colors versus time, where the energy at each wavenum-
ber has been compensated by 𝑘 4/3 to make them comparable. In the left column we show the
Ms0.2_Ma1 simulation, in the middle column we show the Ms0.2_Ma3.2 simulation, and in the
right column we show the Ms0.2_Ma10 simulation. Energy at the smallest length scales in both
reservoirs saturates at 𝑡 ≃ 1𝑇, 𝑡 ≃ 1.5𝑇, and 𝑡 ≃ 2.5 in the Ms0.2_Ma1, Ms0.2_Ma3.2, and
Ms0.2_Ma10 simulations respectively, showing approximately when the turbulence has developed
at all scales.
                                                              102


                 0.75       EU
                            EB
                 1.00       EU + EB
                 1.25
Spectral Index
                 1.50
                 1.75
                 2.00
                 2.25                     Ms0.2_Ma1                         Ms0.2_Ma3.2                   Ms0.2_Ma10
                 2.50
                        1   2     3   4     5     6   1    2      3        4    5     6   1   2   3   4      5     6
                                                          Time t [units of T]
     Figure 3.5: Evolution of the spectral indices of the kinetic (blue), magnetic (orange), and sum
     of kinetic and magnetic energy (green) spectra over time for the Ms0.2 simulations. The slope
     is computed from a least squares fit of the energy spectra limited to wavenumbers 𝑘 ∈ [10, 32]
     which is approximately the inertial range. Shaded bands show how the fitted slope differs if a range
     𝑘 ∈ [8, 34], 𝑘 ∈ [10, 32], or 𝑘 ∈ [12, 30] is used. Note that the spectral index using the range
     𝑘 ∈ [10, 32] is not guaranteed to be bounded by the spectral indices obtained using 𝑘 ∈ [8, 34],
     𝑘 ∈ [10, 32] and 𝑘 ∈ [12, 30], which is especially evident in the Ms0.2_Ma3.2 and Ms0.2_Ma10
     simulations from 𝑡 ≃ 2𝑇 to 𝑡 ≃ 4𝑇. Horizontal dashed lines show −4/3 and −5/3 spectral indices.
     The slope is only shown after 𝑡 = 1𝑇 as the initial flow conditions dominate the spectra at early
     times, leading to steep spectra. We include the spectral indices versus time for all nine simulations
     in the online supplements.
                                                               103


                                                                                                                                                100
                    Ms0.2_Ma1                                                                    Ms0.2_Ma1
                     t = 0.77T                          UU                                        t = 0.77T           BB
                                         er
                                   nsf                                r
                                ra                               fe
       10       2
                              lT                           a   ns
                             a                          Tr
                          oc
                                                                                                                                                      1
                                                    l                                      r                                                    10
                        l                                                                fe
                      on                         ca
                     N                        Lo
                                                                                      ans
                                                                               l   Tr
                                                                                                                                                              Shell-to-Shell Transfer (Q, K) [units of ]
                                                                            ca
                                                                          lo
                                                                 on
                                                                N                                                    Magnetic cascade                 2
                                                                                                                                                10
       10       1
                                         Initial transfer from                                                       develops faster
                                         large scales
 Wavenumber K
                                                                          = 3.39 × 106                                           = 6.94 × 106
                                                                                                                                                0
                    Ms0.2_Ma1                                                                    Ms0.2_Ma1
                     t = 5.16T                          UU                                        t = 5.16T           BB
                         Tr                                                                                   ﬁl l
                     ra ans
                         ng fe                                                                    B                                                 10    2
                                                                                                th oth
                            e rs
                               ar in
                                  e ap the                                                        e
       102
                                        pr in
                                           ox er                                                    ca res
                                             . c tia                                                  sc er
                    En
                       tir
                                                 on l
                                                   st an                                                ad vio
                          el y                           t                                                 e rs                       ca
                               lo ca                                                                                                     l
                                     lt                                                                                              lo
                                        ra                                                                                         on               10    1
                                          ns fe rs                                                                      eg
                                                                                                                       an lig
       101
                                              Large scale transfers                                                      sf ibl
                                              have weakened                                                                er e
                                                                                                                      N         N
                                                                                                                        tr
                                                                          = 3.60 × 105                                           = 8.01 × 105
                                                                                                                                                    100
                          101                                    102                                   101                 102
                                                                                               Wavenumber Q
Figure 3.6: Shell-to-shell energy transfer plots for the energy transfer within the kinetic (left)
and magnetic (right) energy reservoirs via advection and compression at 𝑡 = 0.77𝑇 (top) and
𝑡 = 5.16𝑇 (bottom) from the simulations with Ms0.2_Ma1, showing the development of the kinetic
and magnetic turbulent cascades. Annotations on the figure highlight key features of the energy
transfer that are characteristic of a developing turbulence cascade. Each bin shows the flux of
energy from shell 𝑄 to shell 𝐾, where orange with white circles showing a positive flux of energy,
so that 𝐾 is gaining energy, and purple with white x’s showing a negative flux, so that 𝐾 is losing
energy. The energy flux in each bin is normalized by 𝜀 = max𝑄,𝐾 |T𝑋𝑌 (𝑄, 𝐾)| so that a higher 𝜀
means a higher energy flux. The solid black line shows equivalent scale transfers. As the turbulent
cascade develops in the magnetic and kinetic energy reservoirs, more energy transfers along the
diagonal fill out the energy spectrum down to numerical dissipation scales.
                                                                                                 104


                                                                            100
                               Ms0.2_Ma1
                                t = 1.29T                  UU
                102
                                                                                  1
                                                                            10
                                                                                          Shell-to-Shell Transfer (Q, K) [units of ]
                                                                                  2
                                                                            10
                101                                Large scale
                                                   inverse cascade
                Wavenumber K
                                                             = 1.15 × 106
                                                                            0
                               Ms0.2_Ma1
                                t = 1.29T                  BB
                                                                                      2
                                                                                10
                102
                                                                    gh
                                                                      ou
                                                                        t
                                                                ro u
                                                           th
                                                                                      1
                                                           e                    10
                                                        ad
                                                      sc
                                                   ca
                101                              se
                                            ve r
                                            In               = 3.83 × 106
                                                                                100
                                        1                       2
                                   10                      10
                                                   Wavenumber Q
Figure 3.7: Shell-to-shell energy transfer plots for the energy transfer within the kinetic (top)
and magnetic (bottom) energy reservoirs via advection and compression at 𝑡 = 1.29𝑇 from the
Ms0.2_Ma1 simulation, showing a transient inverse cascade within the magnetic energy reservoir
(on all scales 𝐾, 𝑄 ∼
                    < 100) and kinetic energy reservoir (on large scales 𝐾, 𝑄 ∼
                                                                              < 16). Annotations
show where along the diagonal the inverse cascade is present.
                                                        105


                                                                                              Shell-to-Shell Transfer (Q, K) [units of ]
                                                                                100
                   Ms0.2_Ma10
                    t = 1.29T                       UBT
                                                                                      1
                                Nonlocal                                        10
                                transfer
             102
                                                                                      2
                                                                                10
  Wavenumber K
                                                                                0
                                                                                          2
                                                                                    10
             101                                                                    10    1
                                                     = 1.09 × 105
                                                                                    100
                       101                          102
                                      Wavenumber Q
Figure 3.8: Shell-to-shell energy transfer plots for the energy transfer from kinetic to magnetic
energy via magnetic tension at 𝑡 = 1.81𝑇 from the Ms0.2_Ma10 simulation, showing the nonlocal
energy transfer from large kinetic scales to many smaller magnetic scales. Annotations show where
the nonlocal transfer is present.
                                              106


                         1.00                         Ms0.1_Ma1                           Ms0.1_Ma3.2                                       Ms0.1_Ma10
                           EU = EB
                                                                        EU = EB
                                     max EB                             max EB
                         0.75
                                                  = 6.32 × 106                             = 2.37 × 106                                 = 1.89 × 106
                         0.50
Integrated Energy Flux
                                                                                                                         max EB   EU = EB
                         0.25
                         0.00
                         0.25
       UBT
                         0.50                                                  Nonlocal lower k           Local higher k
                                                                               Local lower k              Nonlocal higher k
                         0.75                                                  Equivalent k
                                0             2   4               6 0         2             4             6 0            2              4                6
                                                                        Time t [units of T]
        Figure 3.9: Integrated energy flux over time from kinetic to magnetic energy via tension from
        larger wavenumbers to smaller nonlocal wavenumbers (purple), from larger wavenumbers to smaller
        local wavenumbers (blue), between equivalent wavenumbers (green), from smaller wavenumbers to
        larger local wavenumbers (orange), and from smaller wavenumbers to larger nonlocal wavenumbers
        (red) in the Ms0.2 simulations. We normalize the energy flux in each panel so that the absolute
        maximum of all of the flux bins is 1.0, where 𝜀 is the normalization factor use in each panel.
        Comparisons of the relative strength of energy fluxes in different simulations must consider 𝜀. The
        inset plot in the lower right panel shows the color coded regions that are integrated to calculate
        each line at a single time for the same shell-to-shell transfer from Figure 3.8. Solid lines show the
        integrated flux if “local" wavenumbers as defined as 5 logarithmic bins away from the equivalent
        wavenumber. The shaded regions show the integrated flux if 4 or 6 bins are used, showing that the
        behavior is robust if the range “local" wavenumbers is defined closer or further away from transfer
        between equivalent scales. We include the integrated flux from kinetic to magnetic energy via
        tension for all nine simulations in the online supplements
                                                                             107


                                                                         UU             Ms0.2_Ma1
                                                    max EB
                                      1.0
                                                                                      = 3.34 × 107
                                      0.5
                                      0.0
                                      0.5
             Integrated Energy Flux
                                      1.0                Nonlocal lower k            Local higher k
                                                         Local lower k               Nonlocal higher k
                                                         Equivalent k
                                                                         BB             Ms0.2_Ma1
                                                    max EB
                                      1.0
                                                                                      = 1.28 × 108
                                      0.5
                                      0.0
                                      0.5
                                      1.0
                                            0   1              2        3        4        5          6
                                                             Time t [units of T]
Figure 3.10: Integrated energy flux over time within the kinetic energy (top) and within the magnetic
energy (bottom) from larger wavenumbers to smaller nonlocal wavenumbers (purple), from larger
wavenumbers to smaller local wavenumbers (blue), between equivalent wavenumbers (green),
from smaller wavenumbers to larger local wavenumbers (orange), and from smaller wavenumbers
to larger nonlocal wavenumbers (red) in the Ms0.2_Ma1 simulation. The inset plot in the lower
middle panel demonstrates the color coded regions that are integrated to calculate each line at
𝑡 = 1.29𝑇 from the shell-to-shell transfer from Figure 3.7. Solid lines show the integrated flux if
"local" wavenumbers as defined as 5 logarithmic bins away from the equivalent wavenumber. The
results change very little if 4 or 6 bins are used. We include the integrated flux within the kinetic
energy and magnetic energy for all nine simulations in the online supplements.
                                                                   108


                   0.035         Ms0.2_Ma1                             Ms0.2_Ma3.2               Ms0.2_Ma10
                   0.030               t = 0.77T                               t = 0.77T               t = 0.77T
                   0.025                                                                                   UU
                   0.020                                                                                   BB
                   0.015                                                                                   UBT
                   0.010                                                                                   BUT
                   0.005
                   0.000
Cross-scale Flux
                   0.003
                   0.002
                   0.001
                   0.000
                                 Ms0.2_Ma1                             Ms0.2_Ma3.2               Ms0.2_Ma10
                   0.001               t = 5.16T                               t = 5.16T               t = 5.16T
                           101   102                    101             102                101   102
                                                   Wavenumber [units of 1/L]
      Figure 3.11: Cross-scale flux within the kinetic energy (blue line), within the magnetic energy
      (orange line), and from kinetic to magnetic energy via tension (green line) in the three Ms0.2
      simulations across columns and at dynamical time 𝑡 = 0.77𝑇 (top) and later at dynamical time
      𝑡 = 5.16𝑇. Note that the cross-scale fluxes at later times are an order of magnitude less than early
      cross-scale fluxes. Positive values of this quantity denote energy transfer from larger to smaller
      scales.
                                                            109


                                             CHAPTER 4
 K-ATHENA: A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME
                             MAGNETOHYDRODYNAMICS CODE
    This chapter first appeared as the published paper Grete et al. (2021a), on which I am equal
co-first author. I include the original abstract as the introduction to this chapter.
4.1    Chapter Abstract
    Large scale simulations are a key pillar of modern research and require ever-increasing compu-
tational resources. Different novel manycore architectures have emerged in recent years on the way
towards the exascale era. Performance portability is required to prevent repeated non-trivial refactor-
ing of a code for different architectures. We combine Athena, an existing magnetohydrodynamics
(MHD) CPU code, with Kokkos, a performance portable on-node parallel programming paradigm,
into K-Athena to allow efficient simulations on multiple architectures using a single codebase. We
present profiling and scaling results for different platforms including Intel Skylake CPUs, Intel
Xeon Phis, and NVIDIA GPUs. K-Athena achieves > 108 cell-updates/s on a single V100 GPU
for second-order double precision MHD calculations, and a speedup of 30 on up to 24,576 GPUs on
Summit (compared to 172,032 CPU cores), reaching 1.94×1012 total cell-updates/s at 76% parallel
efficiency. Using a roofline analysis we demonstrate that the overall performance is currently limited
by DRAM bandwidth and calculate a performance portability metric of 62.8%. Finally, we present
the implementation strategies used and the challenges encountered in maximizing performance.
This will provide other research groups with a straightforward approach to prepare their own codes
for the exascale era. K-Athena is available at https://gitlab.com/pgrete/kathena.
4.2    Introduction
    The era of exascale computing is approaching. Different projects around the globe are working
on the first exascale supercomputers, i.e., supercomputers capable of conducting 1018 floating point
operations per second. This includes, for example, the Exascale Computing Initiative working with
                                                  110


Intel and Cray on Aurora as the first exascale computer in the US in 2021, the EuroHPC collaboration
working on building two exascale systems in Europe by 2022/2023, Fujitsu and RIKEN in Japan
working on the Post-K machine to launch in 2021/2022, and China who target 2020 for their first
exascale machine. While the exact architectural details of these machines are not announced yet
and/or are still under active development, the overall trend in recent years has been manycore
architectures. Here, manycore refers to an increasing number of (potentially simpler) cores on a
single compute node and includes CPUs (e.g., Intel’s Xeon Scalable Processor family or AMD’s
Epyc family), accelerators (e.g., the now discontinued Intel Xeon Phi line), and GPUs for general
purpose computing. MPI+OpenMP has been the prevailing parallel programming paradigm in
many areas of high performance computing for roughly two decades. It is questionable, however,
whether this generic approach will be capable of making efficient use of available hardware features
such as parallel threads and vectorization across different manycore architectures and between
nodes.
    In addition to extensions of the MPI standard such as shared-memory parallelism, several
approaches in addition to MPI+OpenMP exist and are being actively developed to address either
on-node, inter-node, or both types of parallelism. These include, for example, partitioned global
address space (PGAS) programming models such as UPC++ Zheng et al. (2014), or parallel
programming frameworks such as Charm++ or Legion, which are based on message-driven
migratable objects Kale & Krishnan (1993); Bauer et al. (2012).
    Our main goal is a performance portable version of the existing MPI+OpenMP finite volume
(general relativity) magnetohydrodynamics (MHD) code Athena++ White et al. (2016b); Stone
et al. (2020b). This goal includes enabling GPU-accelerated simulations while maintaining CPU
performance using a single code base. More generally, performance portability refers to achiev-
ing consistent levels of performance across heterogeneous platforms using as little architecture-
dependent code as possible. Given the uncertainties in future architecures (and the broad availability
of different architecture already today) performance portability is an active field of research in many
areas Straatsma et al. (2017); Bennett et al. (2015). This includes (but is not limited to) idealized
                                                  111


benchmarks and miniapps Heroux et al. (2009); Martineau et al. (2017); Deakin et al. (2018);
Hammond & Mattson (2019), algorithm libraries Heroux & Willenbring (2012), structured mesh
codes Holmen et al. (2019), or particle in cell codes Artigues et al. (2019).
    In order to keep the code changes minimal, and given the MPI+OpenMP basis of Athena++, we
decided to keep MPI for inter-node parallelism and focus on on-node performance portability. For
on-node performance portability several libraries and programming language extensions exist. With
version 4.5 OpenMP Dagum & Menon (1998) has been extended to support offloading to devices
such as GPUs, but support and maturity is still highly compiler and architecture dependent. This
similarly applies to OpenACC, which has been designed from the beginning to target heterogeneous
platforms. While these two directives-based programming models are generally less intrusive with
respect to the code base, they only expose a limited fraction of various platform-specific features.
OpenCL Stone et al. (2010) is much more flexible and allows fine grained control over hardware
features (e.g., threads), but this, on the other hand, adds substantial complexity to the code. Kokkos
Edwards et al. (2014) and RAJA Hornung et al. (2015) try to combine the strength of flexibility with
ease of use by providing abstractions in the form of C++ templates. Both Kokkos and RAJA focus
on abstractions of parallel regions in the code, and Kokkos additionally provides abstractions of the
memory hierarchy. At compile time the templates are translated to different (native) backends, e.g.,
OpenMP on CPUs or CUDA on NVIDIA GPUs. A more detailed description of these different
approaches including benchmarking in more idealized setups can be found in, e.g., Martineau et al.
(2017); Deakin et al. (2018).
    We chose Kokkos for the refactoring of Athena++ for several reasons. Kokkos offers the
highest level of abstraction without forcing the developer to use it by setting reasonable implicit
platform defaults. Moreover, the Kokkos core developer team actively works on integrating the
programming model into the C++ standard. New, upcoming features, e.g., in OpenMP, will replace
manual implementations in the Kokkos OpenMP backend over time. Kokkos is already used in
several large projects to achieve performance portability, e.g., the scientific software building block
collection Trilinos Heroux et al. (2005) or the computational framework for simulating chemical
                                                    112


and physical reactions Uintah Holmen et al. (2017). In addition, Kokkos is part of the DOE’s
Exascale Computing Project and we thus expect a backend for Aurora’s new Intel Xe architecture
when the system launches. Finally, the Kokkos community, including core developers and users,
is very active and supportive with respect to handling issues, questions and offering workshops.
    The resulting K-Athena code successfully achieves performance portability across CPUs (Intel,
AMD, and IBM), Intel Xeon Phis, and NVIDIA GPUs. We demonstrate weak scaling at 76%
parallel efficiency on 24,576 GPUs on OLCF’s Summit, reaching 1.94 × 1012 total cell-updates/s
for a double precision MHD calculation. Moreover, we calculate a performance portability metric
of 62.8% across Xeon Phis, 6 CPU generations, and 3 GPU generations. We make the code
available as an open source project1.
    The paper is organized as follows. In Section 4.3 we introduce Kokkos, Athena++, and the
changes made and approach chosen in creating K-Athena. In Section 4.4 we present profiling,
scaling and roofline analysis results. Finally, we discuss current limitations and future enhancements
in Sec. 4.5 and make concluding remarks in Sec. 4.6.
4.3    Method
4.3.1    Kokkos
    Kokkos is an open source2 C++ performance portability programming model Edwards et al.
(2014). It is implemented as a template library and offers abstractions for parallel execution of
code and data management. The core of the programming model consists of six abstractions.
    First, execution spaces define where code is executed. This includes, for example, OpenMP
on CPUs or Intel Xeon Phis, CUDA on NVIDIA GPUs, or ROCm on AMD GPUs (which is
currently experimental). Second, execution patterns are parallel patterns, e.g. parallel_for or
parallel_reduce, are the building blocks of any application that uses Kokkos. These parallel
regions are often also referred to as kernels as they can be dispatched for execution on execution
spaces (such as GPUs). Third, execution policies determine how an execution pattern is executed.
    1 K-Athena’s   project repository is located at https://gitlab.com/pgrete/kathena.
    2 See https://github.com/kokkos for the library itself, associated tools, tutorial and a wiki.
                                                   113


There exist simple range policies that only specify the indices of the parallel pattern and the order of
iteration (i.e., the fastest changing index for multidimensional arrays). More complicated policies,
such as team policies, can be used for more fine-grained control over individual threads and nested
parallelism. Fourth, memory spaces specify where data is located, e.g., in host/system memory or
in device space such as GPU memory. Fifth, the memory layout determines the logical mapping of
multidimensional indices to actual memory location, cf., C family row-major order versus Fortran
column-major order. Sixth, memory traits can be assigned to data and specify how data is accessed,
e.g., atomic access, random access, or streaming access.
    These six abstractions offer substantial flexibility in fine-tuning application, but the application
developer is not always required to specify all details. In general, architecture-dependent defaults
are set at compile time based on the information on devices and architecture provided. For example,
if CUDA is defined as the default execution space at compile time, all Kokkos::Views, which are
the fundamental multidimensional array structure, will be allocated in GPU memory. Moreover,
the memory layout is set to column-major so that consecutive threads in the same warp access
consecutive entries in memory.
4.3.2    Athena++
    Athena++ is a radiation general relativistic magnetohydrodynamics (GRMHD) code focusing
on astrophysical applications White et al. (2016b); Stone et al. (2020b). It is a rewrite in modern
C++ of the widely used Athena C version Stone et al. (2008b). Athena++ offers a wide variety
of compressible hydro- and magnetohydrodynamics solvers including support for special and rela-
tivistic (M)HD, flexible geometries (Cartesian, cylindrical, or spherical), and mixed parallelization
with OpenMP and MPI. Apart from the overall feature set, the main reasons we chose Athena++
are a) its excellent performance on CPUs and KNLs due to a focus on vectorization in the code
design, b) a generally well written and documented code base in modern C++, c) point releases are
publicly available that contain many (but not all) features3, and d) a flexible task-based execution
    3 Our  code changes are based on the public version, Athena++ 1.1.1, see https://github.
                                                  114


Listing 4.1: Example triple for loop for a typical operation in a finite volume method on a structured
mesh such as in a code like Athena++, where ks, ke, js, je, is, and ie are loop bounds and u is
an athena_array object of, for example, an MHD variable.
for( int k = ks; k < ke; k++){
    for( int j = js; j < je; j++){
       # pragma omp simd
       for( int i = is; i < ie; i++){
          /* Loop Body */
          u(k,j,i) = ...
}}}
model that allows for a high degree of modularity.
    Athena++’s parallelization strategy evolves around so-called meshblocks. The entire simu-
lation grid is divided into smaller meshblocks that are distributed among MPI processes and/or
OpenMP threads. Each MPI processes (or OpenMP thread) owns one or more meshblocks that
can be updated independently after boundary information have been communicated. If hybrid
parallelization is used, each MPI process runs one or more OpenMP threads that each are assigned
one or more meshblock. This design choice is often referred to as coarse-grained parallelization as
threads are used at a block (here meshblock) level and not over loop indices. In general, Athena++
uses persistent MPI communication handles in combination with one-sided MPI calls to realize
asynchronous communication. Moreover, each thread makes its own MPI calls to exchange bound-
ary information. As a result, using more than one thread per MPI process may increase overall
on-node performance due to hyperthreading but also increases both the number of MPI messages
sent and the total amount of data sent. The latter may result in overall worse parallel performance
and efficiency, as demonstrated in Sec 4.4.3.2.
    Given the coarse-grained OpenMP approach over meshblocks the prevalent structures in the
code base are triple (or quadruple) nested for loops that iterate over the content of each meshblock
(and variables in the quadruple case). A prototypical nested loop is illustrated in Listing 4.1.
Generally, all loops (or kernels) in Athena++ have been written so that OpenMP simd pragmas
com/PrincetonUniversity/athena-public-version
                                                 115


are used for the innermost loop. This helps the compiler in trying to automatically vectorize the
loops resulting in a more performant application.
4.3.3    K-Athena = Kokkos + Athena++
     In order to combine Athena++ and Kokkos, four major changes in the code base were required:
1) making Kokkos::Views the fundamental data structure, 2) converting nested for loop structures
to kernels, 3) converting “support” functions, such as the equation of state, to inline functions, and
4) converting communication buffer filling functions into kernels.
     First, Views are the Kokkos’ abstraction of multidimensional arrays. Thus, the multidimen-
sional arrays originally used in Athena++, e.g., the MHD variables for each meshblock, need
to be converted to Views so that these arrays can transparently be allocated in arbitrary memory
spaces such as device (e.g., GPU) memory or system memory. Athena++ already implemented an
abstract athena_array class for all multidimensional arrays with an interface similar to the inter-
face of a View. Therefore, we only had to add View objects as member variables and to modify the
functions of athena_arrays to transparently use functions of those member Views. This included
using View constructors to allocate memory, using Kokkos::deep_copy or Kokkos::subview
for copy constructors and shallow slices, and creating public member functions to access the Views.
The latter is required in order to properly access the data from within compute kernels.
     Second, all nested for loop structures (see Listing 4.1 need to be converted to so-called kernels,
i.e., parallel region that can be dispatched for execution by an execution space. As described in
Sec. 4.3.1 multiple execution policies are possible, such as a multidimensional range policy (see
Listing 4.2), a one dimensional policy with manual index mapping (see Listing 4.3), or a team
policy that allows for more fine-grained control and nested parallelism (see Listing 4.4).
     Generally, the loop body remained mostly unchanged. Given that it is not a priori clear what
kind of execution policy yields the best performance for a given implementation of an algorithm, we
decided to implement a flexible loop macro4. That macro allows us to easily change the execution
    4 Note,  that in newer versions of the code we replaced the macro with a template.
                                                  116


Listing 4.2: Example for loop using Kokkos. The loop body is reformulated into a lambda
function and passed into Kokkos::parallel_for to execute on the target architecture. The class
Kokkos::MDRangePolicy specifies the loop bounds. The array u is now a Kokkos::View, a
Kokkos building block that allows transparent access to CPU and GPU memory. The loop body,
i.e., the majority of the code, remains mostly unchanged.
parallel_for ( MDRangePolicy <Rank <3>>
        ({ks ,js ,is},{ke ,je ,ie}),
    KOKKOS_LAMBDA (int k, int j, int i){
        /* Loop Body */
        u(k,j,i) = ...
});
Listing 4.3: Same as Listing 4.2 but using a one dimensional Kokkos::RangePolicy (implicit
through default template parameter) with explicit index calculation.
int nk = ke -ks , nj = je -js , ni = ie -is;
parallel_for (nk*nj*ni ,
    KOKKOS_LAMBDA (int idx ){
        int k = idx / (nj*ni);
        int j = (idx - k*(nj*ni) / ni;
        int i = idx - k*(nj*ni) - j*ni;
        /* Loop Body */
        u(k,j,i) = ...
});
policy for performance tests – see profiling results in Sec. 4.4.3.1 and discussion in Sec. 4.5, and this
intermediate abstraction is similar to the approach chosen in other projects Holmen et al. (2019).
     Third, all functions that are called within a kernel need to be converted into inline functions
(here, more specifically using the KOKKOS_INLINE_FUNCTION macro). This is required because if
the kernels are executed on a device such as a GPU, the function need to be compiled for the device
(e.g., with a __device__ attribute when compiling with CUDA). In Athena++, this primarily
concerned functions such as the equation of state and coordinate system-related functions.
     Fourth, Athena++ uses persistent communication buffers (and MPI handles) to exchange data
between processes. Originally, these buffers resided in the system memory and were filled directly
from arrays residing in the system memory. In the case where a device (such as a GPU) is used as
                                                  117


Listing 4.4: Another approach using Kokkos’ nested team-based parallelism through the
Kokkos::TeamThreadRange and Kokkos::ThreadVectorRange classes. This interface is
closer to the underlying parallelism used by the backend such as CUDA blocks on GPUs and
SIMD vectors on CPUs.
parallel_for ( team_policy (nk , AUTO),
    KOKKOS_LAMBDA ( member_type thread ) {
       const int k = thread . league_rank () + ks;
       parallel_for (
          TeamThreadRange <>( thread ,js ,je ,
             [&] (const int j) {
             parallel_for (
                 ThreadVectorRange <>( thread ,is ,ie ,
                 [=] (const int i) {
                    /* Loop Body */
                    u(k,j,i) = ...
});});});
the primary execution space and the arrays should remain on the device to reduce data transfers,
the buffer filling functions need to be converted too. Thus, we changed all buffers to be Views
and converted the buffer filling functions into kernels that can be executed on any execution space.
In addition, this allows for CUDA-aware MPI– GPU buffers to be directly copied between the
memories of GPUs (both on the same node and on different nodes) without an implicit or explicit
copy of the data to system memory.
    In general, the first three changes above are required in refactoring any legacy code to make
use of Kokkos. We note that the original Athena++ design made it mostly straightforward to
implement those changes, e.g., because of the existence of an abstract array class and the prevailing
tightly nested loops already optimized for vectorized instructions. More broadly, we expect that
structured grid fluid codes will require similar changes and that other algorithms and application
may require more subtle refactoring in order to achieve good performance. The fourth change was
required more specifically for Athena++ due to the existing MPI communication patterns.
    Finally, for the purpose of the initial proof-of-concept, we only refactored the parts required
for running hydrodynamic and magnetohydrodynamic simulations on static and adaptive Cartesian
                                                 118


meshes. Running special and general relativistic simulations on spherical or cylindrical coordinates
is currently not supported. However, the changes required to allow for these kind of simulations
are straightforward and we encourage and support contributions to re-enable this functionality.
    Throughout the development process, we continuously measured the code performance in detail
using so-called Kokkos profiling regions as well as the automated profiling of all Kokkos kernels.
Moreover, we employed automated regression testing using GitLab’s continuous integration features
and included specific tests to address changes related to Kokkos (such as running on different
architectures and testing different loop patterns).
4.4   Results
    If not noted otherwise, all results in this section have been obtained using a double precision,
shock-capturing, unsplit, adiabatic MHD solver consisting of Van Leer integration, piecewise linear
reconstruction, Roe Riemann solver, and constrained transport for the integration of the induction
equation (see, e.g., Stone & Gardiner (2009) for more details). The test problem is a linear
fast magnetosonic wave on a static, structured, three-dimensional grid. In GPU runs there is no
explicit data transfer between system and GPU memory except during problem initialization, i.e.,
the exchange of ghost cells is handled either by direct copies between buffers in GPU memory on
the same GPU or between buffers in GPU memory on different GPUs using CUDA-aware MPI.
Similarly, there is also no implicit data transfer as unified memory was not used. Generally, we used
the Intel compilers on Intel platforms, and gcc and nvcc on other platforms as we found that (recent)
Intel compilers are more effective in automatic vectorization than (recent) gcc compilers. We used
the identical software environment and compiler flags for both K-Athena and Athena++ where
possible. Details are listed in Table 4.1. We used Athena++ version 1.1.1 (commit 4d0e425)
and K-Athena commit 73fec12d for the scaling tests. Additional information on how to run
K-Athena on different machines can be found in the code’s documentation.
                                                    119


            Table 4.1: Software Environment and Compiler Flags Used In Scaling Tests.
         Machine       Compiler     Compiler flags                               MPI version
         Summit GPU    GCC 6.4.0    -O3 -std=c++11 -fopenmp                 Spectrum MPI 10.2.0.11
                       &      Cuda  -Xcudafe –diag_suppress=\
                       9.2.148      esa_on_defaulted_function_ignored
                                    -expt-extended-lambda -arch=sm_70
                                    -Xcompiler
         Summit CPU    GCC 8.1.1    -O3 -std=c++11 -fopenmp-simd            Spectrum MPI 10.2.0.11
                                    -fwhole-program -flto -ffast-math
                                    -fprefetch-loop-arrays -fopenmp
                                    -mcpu=power9 -mtune=power9
         Titan GPU     GCC 6.3.0    -O3 -std=c++11 -fopenmp                   Cray MPICH 7.6.3
                       &      Cuda  -Xcudafe –diag_suppress=\
                       9.1.85       esa_on_defaulted_function_ignored
                                    -expt-extended-lambda -arch=sm_35
                                    -Xcompiler
         Titan CPU     GCC 6.3.0    -O3 -std=c++11 -fopenmp                   Cray MPICH 7.6.3
         Theta         ICC 18.0.0   -O3 -std=c++11 -ipo -xMIC-AVX512          Cray MPICH 7.7.3
                                    -inline-forceinline -qopenmp-simd
                                    -qopenmp
         Electra       ICC 18.0.3   -O3 -std=c++11 -ipo                         HPE MPT 2.17
                                    -inline-forceinline -qopenmp-simd
                                    -qopt-prefetch=4 -qopenmp
                                    -xCORE-AVX512
4.4.1    Profiling
     In order to evaluate the effect on performance of the different loop structures presented in
Sec. 4.3.3 we compare the timings of different regions within the main loop of the code. The results
using both an NVIDIA V100 GPU and an Intel Skylake CPU for a selection of the computation-
ally most expensive regions are shown in Fig. 4.1. The 1DRange loop structure refers to a one
dimensional range policy over a single index that is explicitly unpacked to the multidimensional
indices in the code (cf. Listing 4.3). While this 1DRange is the fastest loop structure for all regions
on the GPU, it is the slowest for all regions on the CPU. According to the compiler report this
particular one dimensional mapping prevents automated vectorization optimizations. All other
loop structures tested, i.e., simd-for (cf. Listing 4.1), MDRange (cf. Listing 4.2), and TeamPolicy
(cf. Listing 4.4) logically separate the nested loops and, thus, make it easier for the compiler to auto-
matically vectorize the innermost loop. This also explains why the results for simd-for, MDRange,
and TeamPolicy are very close to each other for all regions except the Riemann solver. The
Riemann solver is the most complex kernel in the chosen setup so that the compiler is not automat-
                                                   120


                                                NVidia Volta V100 GPU                                 Intel Xeon Gold 6148 Skylake CPU
                               4
walltime relative to Riemann
                                                                           1DRange                                1DRange                MDRange
                               3                                           TeamPolicy
                                                                           MDRange
                                                                                                                  simd for               TeamPolicy
                               2
                               1
                               0
                                   mann    orn
                                                rE    Z
                                                     uct    uct
                                                               X      nce
                                                                       CT             veU      mann      orn
                                                                                                            rE     Z
                                                                                                                  uct    uct
                                                                                                                             X      nce
                                                                                                                                     CT             veU
                               Rie             e   str     str       rge             dA       Rie            e   str    str        rge             dA
                                          teC   con      con      ive            hte                  teC   con        con     ive             hte
                                     mp u   Re        Re         xD           We ig                 mp u   Re      Re         xD            We ig
                                     Co                        Flu                                  Co                       Flu
Figure 4.1: Profiling results on a GPU (left) and CPU (right) for selected regions (x-axis) within
the main loop of an MHD timestep using the algorithm described in Sec. 4.4. The different lines
correspond to different loop structures, see Sec. 4.3.3 and the timings are normalized to the fastest
Riemann region in each panel.
ically vectorizing this loop despite the #pragma ivdep in Kokkos’ MDRange and TeamPolicy.
Only the more aggressive explicit #pragma omp simd results in a vectorized loop. The aggregate
performance differences (all kernels of a cycle combined) to the fastest simd-for pattern are 0.78
(TeamPolicy), 0.71 (MDRange), and 0.51 (1DRange).
                               On the GPU, MDRange is the slowest loop structure, being several times (2x-4x) slower than the
1DRange across all regions. TeamPolicy is on par with 1DRange for half of the regions shown.
Here, the aggregate performance differences to the fastest 1DRange pattern are 0.75 (TeamPolicy)
and 0.078 (MDRange). As discussed in more detail in Sec. 4.5, we expected these non-optimized
raw loop structures to not cause any major differences in performance.
                               The results shown here for V100 GPUs and Skylake CPUs equally apply to other GPU gener-
ations and other CPUs (and Xeon Phis), respectively. For all tests conducted in the following, we
use the loop structure with the highest performance on each architecture, i.e., 1DRange on GPUs
and simd-for on CPUs and Xeon Phis.
                                                                                        121


4.4.2    Performance portability
    Our main objective for writing K-Athena is an MHD code that runs efficiently on any current
supercomputer and possibly any future machines. A code that runs efficiently on more architectures
is said to be performance portable. Determining what is meant by “efficient code” can be vague,
especially when comparing performance across different architectures. The memory space sizes,
bandwidths, instruction sets, and arrangement of cores on different architectures can all affect how
efficiently a code can utilize the hardware.
    In order to make fair comparisons of K-Athena’s performance across different machines (see
Sec. 4.4.2.1), we used the roofline model Williams et al. (2009), described in Sec. 4.4.2.2, to
compute on several architectures the architectural efficiency of K-Athena, or the fraction of the
performance achieved compared to the theoretical performance as limited hardware. We then used
the architectural efficiencies to compute the performance portability metric from Pennycook et al.
(2019), described in Sec. 4.4.2.3, to quantify the performance portability of K-Athena.
4.4.2.1    Overview of architectures used
    In total, we created roofline models for six Intel CPUs, Intel Xeon Phis, and three NVIDIA
GPUs. The CPU models roughly follow Intel’s tick-tock production model and, thus, span pairs
of three different instructions sets (AVX, AVX2, and AVX512) with one CPU introducing a new
instruction set and the other an increase in cores and/or clock rate with the same instruction set.
The Intel Xeon Phi (Knights Landing) also supports AVX512 instructions and differs from the
CPUs at the highest level by an increased core count, lower clock rate, and access to MCDRAM.
The three different NVIDIA GPUs span three different microarchitectures (Kepler, Pascal, and
Volta), which also translates to an increased core count in the GPUs used. L1 data caches are also
implemented differently across the three microarchitectures. On Kelper and Volta GPUs, the L1
cache is physically in the same memory device as CUDA "shared" memory while on Pascal GPUs
the L1 cache is combined with texture memory NVIDIA Corporation (2014, 2016, 2017). Load
throughput to L1 cache on Pascal GPUs achieves lower bytes/cycle compared to Kelper and Volta
                                                 122


Table 4.2: Technical specifications for devices used in the performance portability metric. Cache
size and core counts for CPUs specify the aggregate sizes and counts for a two-socket node while
numbers for GPUs show the aggregate for a single device. For the Tesla K80, the cache size and
core count is for just one of the two GK210 chips in the GPU. For DRAM bandwidth (BW) we
use the empirically measured bandwidth of the DRAM on CPUs and the global memory on GPUs.
Data for Intel devices comes from Intel Corporation (2016) and data for NVIDIA devices comes
from NVIDIA Corporation (2014, 2016, 2017); Jia et al. (2018).
  Manufacturer        Intel  Intel  Intel   Intel    Intel   Intel   Intel     NVIDIA NVIDIA NVIDIA
  Family              Xeon   Xeon   Xeon    Xeon     Xeon    Xeon    Xeon      Tesla  Tesla   Tesla
                      E5     E5     E5      E5       Gold    Gold    Phi
  Microarchitecture   Sandy  Ivy    Haswell Broad-   Skylake Cascade Knights   Kepler Pascal  Volta
                      Bridge Bridge         well             Lake    Landing
  Model               2670   2680v2 2680v3  2680v4   6148    6248    7250      K80    P100    V100
  Instruction Set     AVX    AVX    AVX2    AVX2     AVX512  AVX512  AVX512
  CUDA Capability                                                              3.7    6.0     7.0
  Clock Rate (GHz)    2.6    2.8    2.5     2.4      2.4     2.5     1.4       0.562  1.328   1.29
  Num. Cores          16     20     24      28       40      40      68        832    1792    2560
  Max L1 Cache (KB)   512    640    768     896      1280    1280    2176      1456   1344    10240
  Total L2 Cache (KB) 4096   2560   5120    7168     40000   40000   34000     1536   4096    6144
  Total L3 Cache (MB) 40     50     60      70       55      55
  DRAM BW (GB/s)      97.9   121    139     147      246     247     494       195    521     782
GPUs Jia et al. (2018), which led to K-Athena maintaining a higher fraction of peak L1 bandwidth.
An comparative overview of the technical specifications for all architectures is given in Table 4.2.
4.4.2.2     Roofline model
     The roofline model is a graphical tool to demonstrate the theoretical peak performance of an
application on an architecture by condensing the performance limits imposed by the bandwidth of
each memory space and peak throughput of the device into a single plot. In a roofline model plot,
peak throughputs and bandwidths of the hardware are plotted on a log Performance [FLOPS] versus
log arithmetic intensity [FLOP/B] axis so that throughputs are horizontal lines and bandwidths as
𝑃 ∝ 𝐼 lines (since bandwidth-limited 𝑃 = 𝐵 × 𝐼), where 𝑃 [FLOPS] is performance5, 𝐼 [FLOP/B]
is arithmetic intensity (the operations executed per byte read and written), and 𝐵 [B/s] is the
bandwidth. The arithmetic intensities of each memory space for a specific application appear as
vertical lines, extending up where the bandwidth of the memory space limits performance.
    5 In this work we consider double precision throughput and count FMA instructions as two
FLOP on architectures that support it.
                                                  123


                                                                                        Peak (3.19 TFLOPS)
                         Performance [TFLOPS]
                                                                                                 /s)
                                                 100                  /s)   /s)    0.73 TFLOPS  GB
                                                                   GB      GB      /s)      (24 7
                                                              380      57          GB      AM
                                                             (19      (65       61        DR
                                                                      L2     (11
                                                                                            0.17 TFLOPS
                                                           L1
                                                                            L3
                                                10    1    K-Athena (0.13 TFLOPS)
                                                             10   2       10 1           100       101
                                                                  Arithmetic Intensity [FLOP/Byte]
                                                           (a) Cascade Lake CPU Roofline
                                                101                                       Peak (7.79 TFLOPS)
                       Performance [TFLOPS]
                                                                                                         )
                                                                                                     GB
                                                                                   3.42                 /s
                                                                       /s)      TFLOPS    /s)    (78
                                                                      GB             GB              2
                                                                  001               45          AM
                                                                                   (28      DR
                                                              (15
                                                100          L1                 L2          1.14 TFLOPS
                                                          K-Athena (0.82 TFLOPS)
                                                                    10 1             100           101
                                                                  Arithmetic Intensity [FLOP/Byte]
                                                           (b) Tesla V100 GPU Roofline
Figure 4.2: Roofline models of a 2 socket Intel Xeon Gold 6248 "Cascade Lake" CPU node
on NASA’s Aitken (4.2c) and a single NVIDIA Tesla V100 "Volta" GPU on MSU HPCC (4.2d).
Theoretical L1 and DRAM bandwidths and theoretical peak throughputs according to manufacturer
specifications are shown in dashed line. for For both cases shown here and all other architectures
we tested, DRAM bandwidth (or MCDRAM bandwidth for KNLs) is the limiting bandwidth for
K-Athena’s performance.
   The maximum theoretical performance of an application is limited by the bandwidth and
throughput ceilings displayed in the roofline model. For the given device and application, the
                                                                             124


maximum obtainable performance in FLOPS is limited by
                        𝑃max (𝑎, 𝑝, 𝑖) ≤ min {min [ 𝑇Peak (𝑖),                                 (4.1)
                                         𝑚∈𝑀
                                                       𝐵(𝑖, 𝑚) × 𝐼 (𝑎, 𝑝, 𝑖, 𝑚)]} ,
where 𝑃max (𝑎, 𝑝, 𝑖)[FLOPS] is the maximum possible FLOPS obtainable by application 𝑎 solving
problem 𝑝 on architectural platform 𝑖, 𝑇Peak (𝑖)[FLOPS] is the peak throughput on the platform, 𝑀 is
all the memory spaces on the device (L1 cache, L2 cache, DRAM, etc.), and 𝐼 (𝑎, 𝑝, 𝑖, 𝑚)[FLOP/B]
is the arithmetic intensity the application solving the problem on the memory space 𝑚, or the
number of FLOP executed per number of bytes written and read to and from 𝑚. We can also mark
the actual performance of application with a horizontal dashed line, indicating the actual average
FLOPS achieved. Figures 4.2c and 4.2d show roofline models of K-Athena solving a 2563 linear
wave on an Intel Cascade Lake CPU node on NASA’s Aitken and a single NVIDIA Volta V100
GPU on MSU’s HPCC.
     Using the roofline model, we can quantify the architectural efficiency of the K-Athena, or
the fraction of performance achieved compared to the theoretical maximum performance of the
algorithm as limited by bandwidth. In this work, we further distinguish multiple architectural
efficiencies per platform as limited by the bandwidth of different memory spaces. The architectural
efficiency 𝑒(𝑎, 𝑝, 𝑖, 𝑚) of the application 𝑎 solving the problem 𝑝 on platform 𝑖 as limited by the
bandwidth of the memory space 𝑚 on platform 𝑖 is
                                                      𝜀(𝑎, 𝑝, 𝑖)
                          𝑒(𝑎, 𝑝, 𝑖) =                                                         (4.2)
                                       min (𝑇Peak (𝑖), 𝐵(𝑖, 𝑚) × 𝐼 (𝑎, 𝑝, 𝑖, 𝑚))
where 𝜀(𝑎, 𝑝, 𝑖) is the achieved performance of the application 𝑎 for solving the problem 𝑝 on
the platform 𝑖, 𝐵(𝑖, 𝑚) is the peak DRAM bandwidth on the platform, and 𝐼 (𝑎, 𝑝, 𝑖, 𝑚) is the
arithmetic intensity of the for solving the problem on that platform. For example, on Summit’s
Volta V100s, K-Athena achieves 0.82 TFLOPS while the DRAM bandwidth limits performance
to 1.13 TFLOPS, giving to a 72.5% architectural performance as limited by DRAM bandwidth.
     Although bandwidths and throughputs can be obtained from vendor specifications and arith-
metic intensities can be computed by hand, empirical testing more accurately reflects the actual
                                                  125


performance. Acquiring these metrics requires a variety of performance profiling tools on the
different architectures and machines. For gathering the bandwidths and throughputs on GPUs, we
used GPUMembench Konstantinidis & Cotronis (2017) for measuring the L1 bandwidth and the
Empirical Roofline Tool (Version 1.1.0) Lo et al. (2015) for measuring all other bandwidths and
the peak throughput. For computing arithmetic intensities on GPUs, we used NVIDIA’s nvprof
(CUDA Toolkit 9.2.88 on MSU HPCC, 9.2.148 on SDSC Comet) to measure memory usage to
calculate arithmetic intensities and total FLOP count to estimate FLOP per finite volume cell up-
date. To measure memory usage of the different caches, we specifically measured total memory
transactions from global memory to the SMs (gld_transactions and gst_transactions, as
a rough proxy for L1 usage), transactions to and from L2 cache (l2_read_transactions and
l2_write_transactions),              and     transactions      to    and      from     DRAM/HBM
(dram_read_transactions and dram_write_transactions). Since we do not use atomic
memory operations, texture memory, or shared memory, we measured zero transactions from these
memory spaces. For Intel CPUs and KNLs, we used Intel Advisor’s (version 2019 update 5) built-in
hierarchical roofline gathering tools to collect memory bandwidths, throughputs, and arithmetic
intensities Marques et al. (2017) using the arithmetic intensity from the cache-aware roofline model
for the roofline of the highest memory level. For both CPUs and GPUs, we use total memory
transactions to cores and SMs as a surrogate for L1 cache usage due to limitations in the memory
transaction metrics available. Although some of the memory transactions may not be through L1
cache, in a best case performance scenario the memory transactions to the registers are limited by
the fastest cache bandwidth, which is the L1 cache bandwidth.
    We used a 3D linear wave on a 2563 cell grid for benchmarking K-Athena’s performance and
arithmetic intensities for the roofline model. Our metric for CPU machines are for two sockets on
a node while the metric for KNLs and GPUs are for a single device, or a single GK210 chip for
the Tesla K80. In all cases we found that K-Athena’s performance is limited by the main memory
space that accommodates the data for a single MPI task. For GPUs, this is on device DRAM/HBM,
for CPUs this is the DDR3/DDR4 DRAM, and for KNLs this was the MCDRAM. This result
                                                 126


is expected, since the finite volume MHD method in K-Athena is implemented as a series of
simple triple or quadruple for-loop kernels that loop over the data in a task without explicitly
caching data. Since the data can only fit in its entirety in DRAM, it must be loaded from and
written to DRAM within each kernel. Future improvements can be made to K-Athena to explicitly
cache data in smaller 1D arrays and kept in higher level caches. This would raise the DRAM
arithmetic intensity and facilitate faster throughput Glines et al. (2015). Similar improvements
have already been implemented upstream in Athena++. A more complete solution would involve
fusing consecutive kernels into one kernel to reduce DRAM accesses. Given the virtually identical
performance between Athena++ and K-Athena on CPUs (cf. 4.4.3.1) we expect the roofline model
of Athena++ to be practically indistinguishable from K-Athena on non-GPU platforms.
4.4.2.3   Performance portability metric
    Performance portability is at present nebulously defined. It is generally held that a performance
portable application can execute wide variety of architectures and achieve acceptable performance,
preferably maintaining a single code base for all architectures. In order to make valid comparisons
between codes, an objective metric of performance portability is needed.
    The metric proposed by Pennycook et al. (2019) quantifies performance portability by the
harmonic sum of the performance achieved on each platform, so that
                                           |𝐻|
                                     
                                                     if 𝑖 is supported ∀𝑖 ∈ 𝐻
                                     
                                     Í
                                               1
                       𝑃(𝑎, 𝑝, 𝐻) =      𝑖∈𝐻 𝑒(𝑎,𝑝,𝑖)                                            (4.3)
                                     
                                     
                                     
                                            0                 otherwise
                                     
where 𝐻 is the space of all relevant platforms and 𝑒(𝑎, 𝑝, 𝑖) is the performance efficiency of
application 𝑎 to solve the problem 𝑝 on a platform 𝑖. If an application does not support a platform,
then it is not performance portable across the platforms and is assigned a metric of 0. The
performance efficiency can also be defined as either the application efficiency, the fraction of
the performance of the fastest application that can solve the problem on the platform; or as the
architectural efficiency, the achieved fraction of the theoretical peak performance limited by the
hardware that we computed in Sec. 4.4.2.2. Since we did not have MHD codes implementing
                                                 127


the same method as K-Athena on all architectures, we used the architectural efficiencies obtained
from the roofline model to compute the performance portability metric. For completeness, we
considered the architectural efficiencies as limited by the both the L1 cache and DRAM bandwidths
to compute separate performance portability metric against both memory spaces.
                                                     100
                                                                  CPU Machines                 GPU Machines
                      Architectural Efficiency (%)
                                                      75
                                                      50      DRAM Perf. Port. 62.8%
                                                      25
                                                                          L1 Perf. Port. 7.7%
                                                       0                 1                               6    5
                                                            ge ridge swelldwell ylake Lakendingr K80 P100 V100
                                                              1     1         1      2    3    4    5
                                                        r id        a   a           e              l
                                                    dy B Ivy B H Bro Sakscadht's LaKeplPeasca Volta
                                                 San                        C Knig
                                                                 1Pleiades, 2Electra, 3Aitken,
                                                             4Stampede 2, 5MSU HPCC, 6Comet
Figure 4.3: Performance Portability plot of several CPU and GPU machines with different archi-
tectures. Individual bars show the performance of K-Athena compared to the theoretical peak
performance limited by the empirically measured DRAM and L1 bandwidths. Black bars with
diamonds denote the theoretical performance limited by the manufacturer reported bandwidths.
The performance portability metrics across all architectures for DRAM and L1 are shown with
horizontal orange lines where solid orange used the empirically measured bandwidths and dashed
orange uses manufacturer reported bandwidths.6
   In Fig. 4.3, the architectural efficiencies as measured against the DRAM bandwidth and L1 cache
bandwidth are shown with the computed performance portability metrics. K-Athena achieved
62.8% DRAM performance portability and 7.7% L1 cache performance portability, measured
across a number of CPU and GPU architectures. In general, K-Athena achieved higher efficiencies
on newer architectures.
   6 The high L1 efficiency on the NVIDIA Tesla Pascal P100 is due to a lower obtainable bytes
loaded to L1 per cycle compared to the Kepler and Volta GPUs Jia et al. (2018, 2019). The lower
L1 cache performance makes it easier to obtain a higher efficiency.
                                                                             128


4.4.3     Scaling
4.4.3.1    Single CPU and GPU performance
                                                    Single GPU          Single CPU
                                         108                            Athena++ SKX
                                                                        K-Athena SKX
                                                                        Athena++ BDW
                        cell-updates/s
                                                                        K-Athena BDW
                                                                        GAMER BDW
                                         107
                                                     K-Athena Volta
                                                     K-Athena Pascal
                                                     GAMER Pascal
                                               323 643 1283 2563       643 1283 2563
                                                   total # cells        total # cells
Figure 4.4: Raw performance for double precision MHD (algorithm described in Sec. 4.4) of K-
Athena, Athena++, and GAMER on a single GPU (left) or CPU (right) for varying problem sizes.
Volta refers to an NVIDIA V100 GPU, Pascal refers to an NVIDIA P100 GPU, BDW (Broadwell)
refers to a 14-core Xeon E5-2680 CPU, and SKX (Skylake) refers to a 20-core Xeon Gold 6148
CPU. The GAMER numbers were reported in Zhang et al. (2018) for the same algorithm used here.
   In order to compare the degree to which the refactoring of Athena++ affected performance
we first compare Athena++ and K-Athena on a single CPU. The right panel of Fig. 4.4 shows
the cell-updates/s achieved on an Intel Broadwell and an Intel Skylake CPU for both codes for
varying problem size. Overall, the achieved cell-updates/s are practically independent of problem
sizes reaching ≈ 8 × 106 on a single Broadwell CPU and ≈ 1.4 × 107 on a single Skylake
CPU. Moreover, without any additional performance optimizations (see discussion in Sec. 4.5),
K-Athena is virtually on par with Athena++, reaching 93% or more of the original performance.
For comparison, we also show the results of GAMER Zhang et al. (2018). It is another recent
(astrophysical) MHD code with support for CPU and (CUDA-based) GPU accelerated calculations
and has directly been compared to Athena++ in Zhang et al. (2018). We also find that Athena++
(and thus K-Athena) is about 1.5 times faster than GAMER on the same CPU.
   A slightly smaller difference (factor of ≈ 1.25) is observed when comparing results for GPU
runs as shown in the left panel of Fig. 4.4. On a P100 Pascal GPU, K-Athena is about 1.3 times
                                                                 129


faster than GAMER, suggesting that the difference in performance is related to the fundamental
code design and not related to the implementation of specific computing kernels. On a single V100
Volta GPU, K-Athena reaches a peak performance of greater than 108 cell-updates/s for large
problem sizes. In general, the achieved performance in cell-updates/s is strongly dependent on the
problem size. For small grids the performance is more than one order of magnitude lower than
what is achieved for the largest permissible grid sizes that still fit into GPU memory. The plateau
in performance on GPUs at larger grid sizes is due to DRAM bandwidth impeding K-Athena’s
performance, as discussed in Section 4.4.2.2.
4.4.3.2   Weak scaling
    Weak scaling results (using the same test problem and algorithm as in Sec. 4.4.3.1) for K-Athena
and the original Athena++ version on different systems and architectures are shown in Fig. 4.5.
Note that the chosen problem setup (using a single meshblock per MPI process) is effectively not
making use of of the asynchronous communication capabilities to allow for overlapping computation
and communication.
    Overall, the differences between K-Athena and Athena++ on CPUs and Xeon Phis are
marginal. This is expected as K-Athena employed simd-for loops for all kernels that are
similar to the ones already in Athena++. Therefore, the parallel efficiency is also almost identical
between both codes, reaching ≈ 80% on NASA’s Electra system with Skylake CPUs (first column
in Fig. 4.5) and ≈ 70% on ALCF’s Theta system with Knights Landing Xeon Phis (second column
in Fig. 4.5) at 2,048 nodes each. Using multiple hyperthreads per core on Theta has no significant
influence on the results given the intrinsic variations observed on that system7.
    The first major difference is observed on OLCF’s Titan (third column in Fig. 4.5), where results
for K-Athena on GPUs are included. While the parallel efficiency for both codes remains at 94%
up to 8,192 nodes using only CPUs, it drops to 72% when using GPUs with K-Athena. However,
the majority of loss in parallel efficiency already occurs going from 1 to 8 nodes using GPUs and
    7 According to the ALCF support staff, system variability contributes around 10% to the fluctu-
ations in performance between identical runs.
                                                  130


                                Electra Skylake CPU       Theta Knights Landing Titan Opteron/Kepler GPU Summit Power9/Volta GPU
                                         Athena++ 643              Athena++ HT-1
                                         Athena++ 1283             Athena++ HT-2
                                         K-Athena 643              Athena++ HT-4
cell-updates/s/node
                                         K-Athena 1283             K-Athena HT-1
                        108                                        K-Athena HT-2
                                                                   K-Athena HT-4
                        107
                        1.0
                        0.8
  parallel efficiency
                        0.6
                                                                                                               Athena++ CPU HT-1 643
                        0.4                                                                                    Athena++ CPU HT-2 643
                                                                                                               K-Athena CPU HT-1 643
                                                                                       Athena++ CPU 1283       K-Athena CPU HT-2 643
                        0.2                                                            K-Athena CPU 1283       K-Athena GPU 2563
                                                                                       K-Athena GPU 1923       K-Athena CPU nested 643
                        0.0
                              100 101 102 103            100 101 102 103           100 101 102 103         100 101 102 103
                                     # nodes                    # nodes                   # nodes                 # nodes
Figure 4.5: Weak scaling for double precision MHD (exact algorithm described in Sec. 4.4) on
different supercomputers and architectures for K-Athena and the original Athena++ version.
Numbers correspond to the 80th percentile of individual cycle performances of several runs in
order to reduce effects of network variability. The top row shows the raw performance in number
of cell-updates per second per node and can directly be compared between different system and
architectures. The bottom row shows the parallel efficiency normalized to the individual single
node performance. The first column contains results for a workload of 643 and 1283 cells per
core on NASA’s Electra system using two 20-core Intel Xeon Gold 6148 processors per node. The
second column shows results for a workload of 643 per core on ALCF’s Theta system with one
64-core Intel Xeon Phi 7230 (Knights Landing) per node. HT-1, HT-2, and HT-4 refers to using
1, 2, and 4 hyperthreads per core, respectively. The third column shows results for a workload
of 1283 per CPU core and 1923 per GPU on OLCF’s Titan system with one AMD Opteron 6274
16-core CPU and one NVIDIA K20X (Kepler) GPU per node. The last column contains results for
a workload of 643 per CPU core and 2563 per GPU on OLCF’s Summit system with two 21-core
IBM POWER9 CPUs and six NVIDIA V100 (Volta) GPUs per node. On all systems the GPU runs
used 1D loops and the CPU runs used simd-for loops with the the exception of the dashed purple
line on Summit that used Kokkos nested parallelism, see Sec. 4.3.3 for more details.
                                                                           131


afterwards remains almost flat. This behavior is equally present for CPU runs but less visible due
to the higher parallel efficiency in general. The differences in parallel efficiency between CPU and
GPU runs can be attributed to the vastly different raw performance of each architecture. On a single
node the single Kepler K20X GPU is about 7 times faster than the 16-core AMD Opteron CPU.
Given that the interconnect is identical for GPU and CPU communication, the effective ratio of
computation to communication is worse for GPUs. Despite the worse parallel efficiency on GPUs
the raw per-node performance using GPUs is still about 5.5 times faster than using CPUs at 8,192
nodes, which is overall comparable to the ratio of theoretical peak performances in both FLOPS
and DRAM bandwidth.
    K-Athena on OLCF’s Summit system (last column in Fig. 4.5) with six Volta V100 GPUs and
two 21-core POWER9 CPUs exhibits a GPU weak scaling behavior similar to the one observed
on Titan. Going from 1 to 8 nodes results in a loss of 15% and afterwards the parallel efficiency
remains almost flat to 76% on 4,096 nodes. The CPU weak scaling results for both codes using
CPUs reveal properties of the interconnect. The weak scaling is almost perfect up to 256 nodes
using 1 hyperthread per core and afterwards rapidly plummets. Using 2 hyperthreads per core (i.e.,
doubling the number of threads making MPI calls and doubling the number of MPI messages sent
and received, as described in Sec. 4.3.2) the steep drop in parallel efficiency is already observed
beyond 128 nodes. No such drop is observed using GPUs, which perform 42/6 = 7 times fewer
MPI calls (compared to using 1 hyperthread per core) with larger message sizes in general.
    Naturally, this is tightly related to the existing communication pattern in Athena++, i.e., coarse
grained threading over meshblocks with each thread performing one-sided MPI calls. Without
making additional changes to the code base, we can evaluate the effect of reducing the number of
MPI calls for a fixed problem size in a multithreaded CPU setup using Kokkos nested parallelism in
K-Athena. More specifically, we use the triple nested construct illustrated in Listing 4.4 allowing
multiple threads handling a single meshblock. As a proof of concept, the results for using using 1
MPI process per 2 cores each with one thread are shown in the purple dash line in the last column of
Fig. 4.5. While the raw performance on a single node is slightly lower (about 16%), the improved
                                                     132


communication pattern results in a higher overall performance for > 1,024 nodes. Similarly, the
sharp drop in parallel efficiency has been shifted to first occur at 2,048 nodes.
    At the single node level the six GPUs on Summit are tightly connected via NVLink. The
weak scaling efficiency from one GPU to six GPUs on a single node is ≈ 99% (cf., > 6 × 108
cell-updates/s/node for a single node in the top right panel of Fig. 4.5). In addition, the host
interconnect has a lower bandwidth and higher latency compared to NVLink. Thus, the intra-node
parallel overhead is generally negligible in our analysis.
    Finally, the raw per-node performance is overall comparable between Intel Skylake CPUs, Intel
Knight Landing Xeon Phis, IBM POWER9 CPUs, and a single NVIDIA Kepler GPU, ranging
between ≈ 1.5 – 3×107 cell-updates/s/node. The latest NVIDIA Volta GPU is a notable exception,
reaching more than 108 cell-updates/s/GPU. This performance, in combination with six GPUs
per node on Summit and a high parallel efficiency, results in a total performance of 1.94 × 1012
cell-updates/s on 4,096 nodes.
4.4.4   Strong scaling
    Strong scaling results for K-Athena on Summit on both CPUs and GPUs are shown in Fig. 4.6
(same test problem and algorithm as in Sec. 4.4.3.1). Overall, strong scaling in terms of parallel
efficiency is better on CPUs than on GPUs. For example, for a 1,4083 domain the parallel efficiency
using CPUs remains > 83% going from 32 to 512 nodes whereas it drops to 45% for the similar
GPU case (1,5363 domain using 36 to 576 nodes). This is easily explained by comparing to the
single CPU/GPU performance discussed in Sec. 4.4.3.1, which effectively corresponds to on-node
strong scaling. The more pronounced decrease in parallel efficiency on the GPUs is a direct result
of the decreased raw performance of GPUs with smaller problem sizes per GPU. The increased
communication overhead of the strong scaling test plays only a secondary role. Therefore, the strong
scaling efficiency of K-Athena in comparison to Athena++ is expected to be identical. Moreover,
additional performance improvements, as discussed in the following Section, will greatly benefit
the strong scaling behavior of GPUs in general. Nevertheless, the raw performance of the GPUs
                                                 133


                                                                Summit Power9/Volta GPU
                          cell-updates/s/node
                                                  108
                                                  107
                                                  1.0
                                                  0.8
                            parallel efficiency
                                                  0.6
                                                  0.4      K-Athena GPU 15363
                                                           K-Athena GPU 30723
                                                  0.2      K-Athena CPU 14083
                                                           K-Athena CPU 29443
                                                  0.0
                                                     101               102          103
                                                                          # nodes
Figure 4.6: Strong parallel scaling for double precision MHD (algorithm described in Sec. 4.4)
of K-Athena on NVIDIA V100 GPUs (6 GPUs per node; green solid lines) and IBM Power 9
CPUs (42 cores per node; orange/red dash dotted lines) on Summit. The top panel shows the raw
performance in cell-updates per second per node and the bottom panel shows the parallel efficiency.
The effective workload per GPU goes from 2563 to 643 for the 1,5363 domain and from 2563 to
1283 for the 30723 domain. In the CPU case the effective workload per single Power9 CPU (21
cores) goes from 3533 to 883 for the 1,4083 domain and from 3533 to 1773 for the 2,9443 domain.
The resulting effective workloads per node are comparable (within few percent) between GPU and
CPU runs.
still outperforms CPUs by a large multiple despite the worse strong scaling parallel efficiency. For
example, in the case discussed above on Summit, the per-node performance of GPUs over CPUs is
still about 14 times higher at > 512 nodes.
4.5    Current limitations and future enhancements
      Our primary goal for the current version of K-Athena was to make GPU-accelerated simulations
possible while maintaining CPU performance, and to do so with the smallest amount of code changes
necessary. Naturally, this resulted in several trade-offs and leaves room for further (performance)
                                                                        134


improvements in the future.
    For example, we are currently not making use of the memory hierarchy abstraction provided by
Kokkos. This includes more advanced hardware features such as scratch spaces on GPUs. Scratch
space can be shared among threads of a TeamPolicy and allows for efficient reuse of memory.
We could use scratch space to reduce the number of reads from DRAM in stenciled kernels (like
the fluid solver’s reconstruction step). We could also fuse consecutive kernels to further reduce
reads and writes to DRAM, although this would also increase register and possibly spill store
usage. Moreover, complex kernels such as a Riemann solver could be broken down further by using
TeamThreadRanges and ThreadVectorRanges structures that are closer to the structure of the
algorithm. This is in contrast to our current approach where all kernels are treated equally, with the
same execution policies independent of the individual algorithms within the kernels. The Riemann
solver could also be split into separate kernels to reduce the number of registers needed, eliminate
the use of spill stores on the GPU, and allow higher occupancy on the GPU.
    Similarly, on CPUs and Xeon Phis we are currently not using a Kokkos parallel execution
pattern. The macro we introduced to easily exchange parallel patterns replaces the parallel region
on CPUs and Xeon Phis with a simple nested for loop including a simd pragma, as shown in
Listing 4.1. This is required for maximum performance as the implicit #pragma ivdep hidden
in the Kokkos templates is less aggressive than the explicit #pragma omp simd with respect to
automated vectorization. We reported this issue and future Kokkos updates will address this by
either providing an explicit tightly nested vectorized loop pattern and/or adding support for a simd
property to the execution policy template.
    Another possible future improvement is an increase in parallel efficiency by overlapping com-
munication and computation. While Athena++ is already built for asynchronous communication
through one-sided MPI calls and a task based execution model, more fine-grained optimizations
are possible. For example, spatial dimensions in the variable reconstruction step that occurs after
the exchange of boundary information could be split, so that the kernel in the first dimension could
run while the boundary information of the second and third dimension are still being exchanged.
                                                  135


In addition, the next major Kokkos release will contain more support for architecture-dependent
task based execution and, for example, will allow for the transparent use of CUDA streams.
    CUDA streams may also help in addressing another current limitation of K-Athena on GPUs.
Our minimal implementation approach currently limits all meshblocks to be allocated in a fixed
memory space. This means that the total problem size that can currently be addressed with K-
Athena is limited by the total amount of GPU memory available. An alternative approach is
keeping the entire mesh in system memory, which is still several times larger than the GPU memory
on most (if not all) current machines. For the execution of kernels individual meshblocks would
be copied back and forth between system memory and GPU memory. Here, CUDA streams could
be used to hide these expensive memory transfers as they would occur in the background while
the GPU is executing different kernels. Theoretically, meshes larger than the GPU memory could
already be used right now with the help of unified memory. However, given that the code is not
optimized for efficient page migrations the resulting performance degradation is large (more than
a factor of 10). Thus, using unified memory with meshes larger than the GPU memory is not
recommended.
4.6   Conclusions
    We presented K-Athena – a Kokkos-based performance portable version of the finite volume
MHD code Athena++. Kokkos is a C++ template library that provides abstractions for on-
node parallel regions and the memory hierarchy. Our main goal was to enable GPU-accelerated
simulations while maintaining Athena++’s excellent CPU performance using a single code base
and with minimal changes to the existing code.
    Generally, four main changes were required in the refactoring process. We changed the underly-
ing memory management in Athena++’s multi-dimensional array class to make transparently use
of Kokkos’s equivalent multi-dimensional arrays, i.e., Kokkos::Views. We exchanged all (tightly)
nested for loops with the Kokkos equivalent parallel region, e.g., a Kokkos::parallel_for,
which are now kernels that can be launched on any supported device. We inlined all support func-
tions (e.g., the equation of state) that are called within kernels. We changed the communication
                                                  136


buffers to be Views so that MPI calls between GPUs buffers are directly possible without going
through system memory.
    With all changes in place we performed both profiling and scaling studies across different
platforms, including NASA’s Electra system with Intel Skylake CPUs, ALCF’s Theta system with
Intel Xeon Phi Knights Landing, OLCF’s Titan with AMD Opteron CPUs and NVIDIA Kepler
GPUs, and OLCF’s Summit machine with IBM Power9 CPUs and NVIDIA Volta GPUs. Using a
roofline model analysis, we demonstrated that the current implementation of the MHD algorithms
is memory bound by either the DRAM, HBM, or MCDRAM bandwidths on CPUs and GPUs.
Moreover, we calculated a performance portability metric of 62.8% across Xeon Phis, and 6 CPU
and 3 GPU generations.
    Detailed Kokkos profiling revealed that there is currently no universal Kokkos execution policy
(how a parallel region is executed) that achieves optimal performance across different architectures.
For example, a one-dimensional loop with manual index matching from 1 to 3D/4D is fastest on
GPUs (achieving > 108 ) double precision MHD cell-updates/s on a single NVIDIA V100 GPU)
whereas tightly nested for loops with simd directives are fastest on CPUs. This is primarily a result
of Kokkos’s specific implementation details and expected to improve in future releases through
more flexible execution policies.
    Strong scaling on GPUs is currently predominately limited by individual GPU performance
and not by communication. In other words, insufficient GPU utilization outweighs additional
performance overhead with decreasing problem size per GPU.
    Weak scaling is generally good, with parallel efficiencies of 80% and higher for more than
1,000 nodes across all machines tested. Notably, on Summit K-Athena achieves a total calculation
speed of 1.94 × 1012 cell-updates/s on 24,567 V100 GPUs at a speedup of 30 compared to using
the available 172,032 CPU cores.
    Finally, there is still a great deal of untapped potential left, e.g., using more advanced hardware
features such as fine-grained nested parallelism, scratch pad memory (i.e., fast memory that can
be shared among threads), or CUDA streams. These are currently being addressed within the
                                                   137


new Parthenon collaboration (https://github.com/lanl/parthenon), which is developing
a performance portable adaptive mesh refinement framework based on the results presented here.
    Nevertheless, we achieved our primary performance portability goal of enabling GPU-accelerated
simulations while maintaining CPU performance using a single code base. Moreover, we consider
the current results highly encouraging and will continue with further development on the project’s
GitLab repository at https://gitlab.com/pgrete/kathena. Contributions of any kind are
welcome!
                                               138


                                             CHAPTER 5
            RELATIVISTIC DISCONTINUOUS-GALERKIN HYDRODYNAMICS
    The work in this chapter was completed under an internship with Sandia National Laboratories.
This chapter has been submitted to the Journal of Computational Physics with SAND # SAND2022-
1601 O as Glines et al. (2022). This work was supported in part by LDRD project # 209240. I
include the original abstract as the introduction to this chapter.
5.1   Chapter Abstract
    In this work, we present a discontinuous-Galerkin method for evolving relativistic hydrodynam-
ics. We include an exploration of analytical and iterative methods to recover the primitive variables
from the conserved variables for the ideal equation of state and the Taub-Matthews approximation
to the Synge equation of state. We also present a new operator for enforcing a physically permissible
conserved state at all basis points within an element while preserving the volume average of the
conserved state. We implement this method using the Kokkos performance-portability library
to enable running at performance on both CPUs and GPUs. We use this method to explore the
relativistic Kelvin-Helmholtz instability compared to a finite volume method. Last, we explore the
performance of our implementation on CPUs and GPUs.
5.2   Introduction
    Many high energy astrophysical and terrestrial plasmas attain relativistic velocities and tem-
peratures. Examples from astrophysics include jets from active galactic nuclei (Blandford et al.,
2019), accretion flows onto black holes (Villiers et al., 2003), and gamma-ray bursts (Kumar &
Zhang, 2015). In terrestrial systems, relativistic flows can also play a crucial role in a broad range
of accelerator systems, including magnetically insulated transmission lines (MITLs) utilized in (for
example) the Z machine at Sandia National Laboratories Sinars et al. (2020). In all of these plasmas,
velocities close to the speed of light lead to an apparent increase of mass as measured by a station-
ary observer while relativistic particle velocities at high temperatures lead to a non-linear increase
                                                  139


in pressure. Non-relativistic hydrodynamics are insufficient to model such flows – a relativistic
treatment of the fluid is required. Numerical solutions for relativistic hydrodynamics were first
pioneered in the 1960’s and 1970’s by May & White (1966) and Wilson (1972). High-resolution
shock-capturing solutions followed suit, with an early review of those methods given by Martí &
Müller (2003).
     When modeling complex systems with small time step constraints, higher order methods are
advantageous for efficiently achieving high accuracy. Discontinuous Galerkin methods have become
standard in fluid dynamics for enabling high-order methods in complex geometries. High-order
discontinuous-Galerkin methods afford enhanced data locality when compared with finite volume
methods of similar order (Fuhry et al., 2014). Given the trend in compute performance outpacing
memory performance in newer architectures such as graphics processing units (GPUs), the higher
arithmetic intensity of discontinuous-Galerkin methods will permit higher computational efficiency
due to higher arithmetic intensity algorithms using more of the growing computational throughput
while using less of the stagnant memory bandwidth, enabling higher fidelity simulations compared
to finite volume simulations for equivalent computational resources.
     In this work, we present a robust, performance-portable discontinuous-Galerkin method for
relativistic hydrodynamics. In §5.3.1 we present a formulation of the equations of relativistic
hydrodynamics that allows for a range of equations of state; we present two such possibilities: (1)
an ideal equation of state, which approximates a perfect gas but assumes a constant adiabatic index
for a relativistic perfect gas, and (2) an approximation to the Synge gas from Mathews (1971),
where the Synge equation of state models a relativistic perfect gas (Synge, 1957). We discuss the
discretization of the system using a discontinuous-Galerkin technique and discuss strong-stability-
preserving time discretization techniques. To enable robust higher order discretization, in §5.3.5
we present a new and novel physicality-enforcing operator for discontinuous-Galerkin methods for
relativistic hydrodynamics. The method smooths conserved variables within individual cells to
the cell volume averages until all basis points within the cell satisfy conditions for physicality (i.e.
positive density and pressure and flow speed under the speed of light). We implement the method
                                                 140


for relativistic hydrodynamics using the Kokkos performance portability library to enable running
on both CPUs and GPU (Carter Edwards et al., 2014).
    A key part of any algorithm for relativistic hydrodynamics is the method by which the non-
linear relationship between primitive variables and the conserved state is solved. In §5.4, we
compare analytical and iterative methods for recovering the primitive variables from the conserved
variables for both equations of state, across a range of different hardware platforms and compilers
as facilitated by Kokkos, finding that for the ideal gas our iterative method following Riccardi &
Durante (2008) is faster, more robust, and more accurate than an analytical method, but the exact
reverse is true for an approximation to the Synge gas.
    We proceed to validate the method using several tests (discussed in detail in §5.5), exploring
convergence of the method to analytical solutions of relativistic linear waves, convergence to high
resolution reference solutions of a range of 1D shock tubes, evolution of 2D Riemann problems, and
growth rates of the relativistic Kelvin-Helmholtz instability with two different initial perturbations.
Using a 0th order basis, we find that the method performs comparably to 1st order finite volume
methods, as expected. Using higher order bases we see the expected level of convergence for
smooth flows. In fluid systems with shocks, the method requires the physicality-enforcing operator
presented here and exhibits expected rates of convergence around shocks. Additionally, with the
exploration of the growth rate of the Kelvin-Helmholtz problem, we show that using the more
accurate HLLC Riemann solver (Mignone & Bodo, 2006) instead of the HLL solver (Schneider
et al., 1993) has a greater impact on the growth rate than basis order or resolution. We further
utilize this test problem to demonstrate a range of performance portability results in §5.5.6 before
summarizing our results and conclusions in §5.6.
5.3    Theoretical Background and Discretization
    In this section, we describe our method for relativistic hydrodynamics in a discontinuous-
Galerkin code, starting by reviewing the equations for relativistic hydrodynamics in §5.3.1, includ-
ing a discussion of the equation of state. Then, in §5.3.3, we give the general discontinuous-Galerkin
method for solving the relativistic hydrodynamics equations as a set of hyperbolic equations with
                                                   141


computation of fluxes given in §5.3.4. Last, in §5.3.5, we present a new operator that enforces
physicality of all basis points within a cell while maintaining the volume average within the cell.
5.3.1   Special Relativistic Hydrodynamics
    The special relativistic hydrodynamics equations for a relativistic fluid are given by a set of
hyperbolic conservation laws,
                                          𝜕𝑡 U + ∇ · F [W(U)] = 0                                     (5.1)
where the conserved variables U = [𝐷, M, 𝐸] 𝑇 are the relativistic density, relativistic specific
momentum, and the total energy density including energy from the rest mass. The flux is
                                                                  
                                                     
                                                            𝜌u    
                                                                   
                                                      𝜌ℎ          
                                       F [W(U)] =  u ⊗ u + 𝑃I ,                                   (5.2)
                                                      𝑐2          
                                                                  
                                                          𝛾𝜌ℎu    
                                                                  
where the rest mass density 𝜌, the three spacelike components of the 4-velocity denoted here with
u, and the pressure 𝑃 constitutes the primitive state W(U) = [𝜌, u, 𝑃] 𝑇 . The specific enthalpy ℎ is
given by
                                                       𝑒+𝑃
                                                  ℎ=                                                  (5.3)
                                                         𝜌
where 𝑒 is the specific internal energy. The conserved state U can be determined from the primitive
state W by
                                                                    
                                     
                                           𝛾𝜌         𝛾𝜌   𝐷 
                                                                     
                                                                    
                                U =  𝛾(𝑒 + 𝑃)u/𝑐2  =  𝛾𝜌ℎu/𝑐2  ≡ M                        (5.4)
                                                                    
                                      2               2
                                     𝛾 (𝑒 + 𝑃) − 𝑃 𝛾 𝜌ℎ − 𝑃  𝐸 
                                                                       
                                                                    
            √︁
where 𝛾 ≡     1 + |u| 2 /𝑐2 is the Lorentz factor and 𝐷, M, and 𝐸 are the relativistic density, relativistic
momentum density, and total energy density respectively. We also find it convenient to use the
three-velocity v at times, which relates to u following u = 𝛾v and the Lorentz velocity following
       √︁
𝛾 = 1/ 1 − |v| 2 /𝑐2 .
                                                     142


5.3.2   Equations of State
    The relativistic hydrodynamics equations in Eq. 5.1 are not complete; an equation of state is
used to close the system. Following Ryu et al. (2006), we express the equation of state by relating
ℎ to the primitive variables
                                             ℎ ≡ ℎ(𝜌, 𝑃).                                       (5.5)
The equation of state also determines the sound speed 𝑐 𝑠 , which is given by
                                         𝜌 𝜕ℎ                𝜕ℎ
                                𝑐2𝑠 = −          with   𝑛=𝜌      −1                             (5.6)
                                        𝑛ℎ 𝜕 𝜌               𝜕𝑃
where 𝑛 is the polytropic index. In this work, we explore two choices of equation of state: the
equation of state of an ideal gas and the Taub-Matthews approximation to the Synge equation of
state described in Mathews (1971).
    In a relativistic perfect gas, the adiabatic index decreases with temperature, starting with
Γ = 5/3 for non-relativistic temperatures when 𝑃/𝜌 ≪ 𝑐2 and decreasing to Γ = 4/3 for relativistic
temperatures when 𝑃/𝜌 ≫ 𝑐2 . The equation of state of the perfect gas is given by the Synge gas
(Synge, 1957) :                                           
                                                 𝐾3 𝑐2 /Θ
                                          ℎ = 𝑐2                                                (5.7)
                                                 𝐾2 𝑐2 /Θ
                                                           
where 𝐾2 and 𝐾3 are modified Bessel functions of the second kind and Θ ≡ 𝑃/𝜌 is a temperature-
like variable. From a computational standpoint, however, there are significant drawbacks, as
these Bessel functions are both expensive to compute and can introduce inaccuracy due to limited
machine precision. Even worse, the Bessel functions need to be inverted to recover the primitive
variables from conserved variables, which greatly increases computational costs. Consequently,
approximations to the equation of state are usually used in simulations.
    The simplest approximation to the relativistic perfect gas is the ideal equation of state, which
assumes a constant adiabatic index. The enthalpy for the ideal gas is given by
                                                      Γ
                                          ℎ = 𝑐2 +        Θ                                     (5.8)
                                                    Γ−1
                                                 143


where the constant Γ is the adiabatic index (ratio of specific heats.) The corresponding speed of
sound is then:
                                               𝑐2𝑠     Θ
                                                   =Γ .                                          (5.9)
                                               𝑐2       ℎ
For non-relativistic temperatures when Θ ≪ 𝑐2 , an adiabatic index of Γ = 5/3 best approximates
the perfect gas (consistent with non-relativistic theory) while for relativistic temperatures when
Θ ≫ 𝑐2 and adiabatic index of Γ = 4/3 is appropriate. The ideal equation of state is common
for relativistic hydrodynamics simulations. However, relativistic fluid systems can have relativistic
and non-relativistic temperatures simultaneously at different locations within the fluid, throwing
into question the use of a constant adiabatic index across the simulation. Additionally, Taub (1948)
showed that Γ ≥ 4/3 becomes inconsistent with relativistic kinetic theory as Θ/𝑐2 → ∞, suggesting
that adiabatic indices above 4/3 are unphysical for ultra-relativistic temperatures.
    A more accurate approximation to the Synge gas that is still computationally efficient is the
Taub-Matthews approximation to the Synge gas, which we will refer to as the Taub-Matthews
equation of state (Mathews, 1971). In this approximation, the enthalpy is given by:
                                                     √︂
                                             5     3          4
                                        ℎ = Θ+          Θ2 + 𝑐4                                 (5.10)
                                             2     2          9
with the corresponding sound speed:
                                                      √︃
                                 𝑐2𝑠        3Θ2  + 5Θ Θ2 + 49 𝑐4
                                     =                                  .                       (5.11)
                                 𝑐2 12Θ2 + 2𝑐4 + 12Θ Θ2 + 4 𝑐4
                                                           √︃
                                                                   9
The Taub-Matthews equation of state satisfies the conditions for causality at high temperatures while
correctly approximating the ideal gas equation of state for a subrelativistic gas at low temperatures
(Mathews, 1971). As such, the Taub-Matthews equation of state effectively simulates an ideal gas
with an adiabatic index that varies from Γ = 5/3 as Γ = 4/3 as Θ is taken from Θ → 0 to Θ → ∞.
More formally, this can be seen through defining an equivalent adiabatic index1 (see, e.g. Mignone
    1 Note that since we have not defined a canonical equation of state for the Taub-Matthews equation
of state (i.e. ℎ(𝑆, 𝑃) where 𝑆 is entropy), we have not defined a relationship with temperature 𝑇,
and we cannot compute specific heat capacities and subsequently Γ. Hence the need for the proxy
Γeq .
                                                   144


& McKinney, 2007):
                                                      ℎ − 𝑐2
                                          Γeq =                 ,                               (5.12)
                                                  ℎ − 𝑐2 − Θ
This relationship, along with the enthalpy and speed of sound, for ideal gases with Γ = 4/3 and
𝛾 = 5/3, the Synge gas, and the Taub-Matthews equation of state is shown in Fig. 5.1.
5.3.3    Spatial and Temporal Discretizations
     In this work, spatial discretization of the hyperbolic conservation law, Eq. 5.1, is performed
using a discontinuous-Galerkin method in a similar fashion as was proposed by Núñez-de la Rosa &
Munz (2018), following on the influential sequence Cockburn & Shu (1989); Cockburn et al. (1989,
1990); Cockburn & Shu (1998). The discontinuous-Galerkin method requires a mesh defined as
the subdivision of the domain into non-overlapping hexahedral (3𝐷) or quadrilateral (2𝐷) cells
denoted Ω 𝑘 ⊂ Ω ⊂ R𝑑 . The approximation of the conserved variables on cell Ω 𝑘 is written
                                                   ∑︁
                                U(x) ≈ U ℎ (x) =        U𝑖 𝜙𝑖 (x)  x ∈ Ω𝑘                       (5.13)
                                                   𝑖=1
where the set {𝜙𝑖 (x)} is a linearly independent basis that spans a polynomial space of fixed order
on element Ω 𝑘 . Lagrange polynomials are employed here, where the nodal points are denoted as
x 𝑗 such that
                                              𝜙𝑖 (x 𝑗 ) = 𝛿𝑖 𝑗                                  (5.14)
where 𝛿 is the Kronecker delta function. Globally, Uℎ is defined as a piecewise polynomial function
with discontinuities permitted at cell boundaries. The restriction of the numerical solution to a cell
Ω 𝑘 is denoted Uℎ𝑘 .
     On each cell the approximate solution to Eq. 5.1 is computed by enforcing that the residual is
orthogonal to the test space, defined in the Galerkin fashion. Practically, after integration by parts,
this implies the satisfaction of the weak form
 ∫                    ∮                                  ∫
       𝜕Uℎ
             𝜙(x)𝑑x +        F [W (U)] · n𝜙(x)𝑑𝑠 −
                                   ℎ                           F [W ℎ (U)] · ∇𝜙(x)𝑑x = 0,  ∀𝜙 ∈ {𝜙𝑖 }
   Ω 𝑘 𝜕𝑡               𝜕Ω 𝑘                               Ω𝑘
                                                                                                (5.15)
                                                   145


                                           Synge Gas
                                           Ideal eq = 4/3
                                           Ideal eq = 5/3
                                           Taub-Matthews
                               101
                        h/c2
                                100
                                0.8
                                0.7
                                0.6
                                0.5
                        cs/c
                                0.4
                                0.3
                                0.2
                                0.1
                               1.65
                               1.60
                               1.55
                        eq     1.50
                               1.45
                               1.40
                               1.35
                                  10   2         10   1           100   101
                                                            /c2
Figure 5.1: Enthalpy (top), sound speed (middle), and equivalent adiabatic index (bottom) as
a function of the temperature proxy Θ/𝑐2 for the Synge gas (solid blue), ideal equation of state
with a relativistic Γ = 4/3 (dashed orange) and a non-relativistic Γ = 5/3 (finely dashed green),
and the Taub-Matthews approximation to the Synge gas (dot-dashed red). With the Synge and
Taub-Matthews equations of state, each of the quantities shown here vary smoothly between the
two extremes of the ideal equation of state as Θ/𝑐2 changes from non-relativistic to relativistic.
The Taub-Matthews equation of state provides a reasonable approximation to the Synge gas while
remaining simple for computation.
                                                      146


on each cell. The second term is the integral of normal flux over the surface of an element. The
solution at cell interfaces is double-valued as indicated by the overline; one value corresponding to
the data inside the cell, the other from the neighboring cell. As such, the solution is discontinuous
and the flux must be computed using a Riemann solver in a fashion similar to the finite volume
method. We have implemented two approximate Riemann solvers: HLL and HLLC, discussed in
§5.3.4. Beyond the choice of Riemann solver, the discrete conservation law, Eq. 5.15 can admit
a range of different basis orders. A first order basis (e.g piecewise constant) will eliminate the
                  ∫
contribution of Ω F [W(U)] · ∇𝜙(x)𝑑x, resulting in a scheme equivalent to a first order finite
                     ℎ
volume discretization. Moving to higher order bases (e.g. piecewise linear, etc.) will introduce the
need to provide additional stabilization (e.g. dissipation) at discontinuities and shocks. For this we
use the Moe limiter from Moe et al. (2015) and the minmod limiter (van Leer, 1979) as well as
the physicality enforcing operator tailored for relativistic hydrodynamics that we discuss in detail
in §5.3.5.
    Before the integrals in Eq. 5.15 can be computed, the primitive variables must be calculated for
use in the numerical flux. There are different options for computation: interpolate conserved and
compute primitives at quadrature points, versus compute primitives at nodal points and interpolate.
In Newtonian hydrodynamics, the primitive variables, W, can be recovered algebraically from the
conserved state. As such, it is straightforward to interpolate the conserved quantities to the required
quadrature point and recover the necessary primitive quantities to construct the flux. In relativistic
hydrodynamics, such an algebraic recovery of the primitive quantities does not exist; prior work
(see e.g. Beckwith & Stone, 2011) has demonstrated that, in the context of finite volume schemes, it
is necessary to interpolate primitive variables (rather than conserved quantities) in order to ensure
that the state remains physical (e.g. |v| 2 < 𝑐2 , 𝜌 > 0, 𝑃 > 0). Here, we follow a similar procedure:
the primitive state is computed from the conserved state at the basis points and then interpolated
to quadrature points in order to compute fluxes. In addition to enhanced stability, this minimizes
the number of calls to the method that recovers the primitive variables from the conserved state,
minimizing the impact that this routine has on overall algorithm performance (see §5.4 for further
                                                    147


discussion). Thus, the first step in the assembly is to compute the primitives at nodal points:
                                               W𝑖 = 𝑝(U𝑖 )                                          (5.16)
where 𝑝 computes the primitive variables from the conserved (see Sec. 5.4 for specific details). With
this expression, the primitives are easily interpolated to points within the cell using Eq. 5.13, yielding
                                          Í
the primitive approximation W ℎ (x) = 𝑖 Wi 𝜙𝑖 (x). Thus a nonlinear conserved-to-primitive solve
is required at each nodal point.
    The numerical quadrature for the volumetric contributions of the fluxes are computed as
                       ∫                              ∑︁
                           F [W ℎ (U)] · ∇𝜙(x)𝑑x ≈         𝑤𝑞 F [W ℎ (x𝑞 )] · ∇𝜙(x𝑞 )               (5.17)
                        Ω𝑘                              𝑞
and the surface fluxes on the interface shared by Ω 𝑘 and Ω 𝑘 ′ are
         ∫                                       ∑︁
                       F [W ℎ (U)] · n𝜙(x)𝑑𝑠 ≈       𝜔 𝑞 F (Wℎ𝑘 (x𝑞 ), W ℎ𝑘′ (x𝑞 )) · n𝜙(x𝑞 ).      (5.18)
           𝜕Ω 𝑘 ∩𝜕Ω𝑘 ′                            𝑞
Here it is understood that the quadrature rules are defined with respect to the domain of integration.
The volumetric term (Eq. 5.17) requires evaluation of the flux at each quadrature point while the
surface term (Eq. 5.18) requires evaluation of the numerical flux from cell 𝑘 and the neighbor 𝑘 ′ at
each quadrature point.
    The temporal discretization we employ uses a multi-stage strong-stability preserving (SSP)
Runge-Kutta time integrator similar to that described in Cockburn & Shu (1989); Cockburn et al.
(1989, 1990); Cockburn & Shu (1998). SSP time discretization methods were designed to ensure
nonlinear stability properties in the numerical solution of spatially discretized hyperbolic partial
differential equations, such as Eq. 5.15. These methods assume that there is a time-step, Δ𝑡 𝐹𝐸 such
that forward-Euler condition:
                           ||U + Δ𝑡F [W(U)]|| ≤ ||U|| for 0 ≤ Δ𝑡 ≤ Δ𝑡 𝐹𝐸                            (5.19)
is satisfied for all U. An explicit Runge-Kutta (ERK) method is called SSP if the methods can be
rewritten as a convex combination of forward Euler methods and the estimate ||U𝑛+1 || < ||U𝑛 ||
holds for the numerical solution of Eq. 5.15 whenever the condition given in Eq. 5.19 holds and
                                                   148


Δ𝑡 ≤ C𝑆𝑆𝑃 Δ𝑡 𝐹𝐸 , where C𝑆𝑆𝑃 is known as the SSP-coefficient. The convex combination above
ensures that the strong stability property is also satisfied by the intermediate stages in a Runge-Kutta
method ( see Gottlieb et al., 2011; Gottlieb, 2015). This may be desirable in many applications,
notably in simulations that require positivity (Ferracina & Spĳker, 2005, 2004; Higueras, 2004,
2005). In this work, we make use of the second and third order schemes found in Shu & Osher
(1989), which were proved to be optimal in Gottlieb & Shu (1998).
5.3.4   Computation of the Surface Flux
    The surface flux contributions on the interface shared by Ω 𝑘 and Ω 𝑘 ′ require the evaluation of
(Eq. 5.18):
                                 ∑︁
                                      𝜔 𝑞 F (W ℎ𝑘 (x𝑞 ), W ℎ𝑘′ (x𝑞 )) · n𝜙(x𝑞 )                   (5.20)
                                  𝑞
In the method presented here, this is accomplished by use of an approximate Riemann solver, of
which we have implemented the relativistic HLL and HLLC variants due to Schneider et al. (1993)
and Mignone & Bodo (2005). Both of these approximate Riemann solvers require an estimate of
the maximum and minimum wavespeeds on either side of the interface, which we compute through
the maximum and minimum eigenvalues of 𝜕F/𝜕U (Mignone & Bodo, 2005):
                                                    √︂                      
                                               𝑣𝑥 ± 𝜎𝑠 𝑐2 − 𝑣𝑥2 + 𝑐2 𝜎𝑠
                                 𝜆 ± (W) =                                                        (5.21)
                                                           1 + 𝜎𝑠
where
                                                    h              i
                                          𝜎𝑠 = 𝑐2𝑠 / 𝛾 2 𝑐2 − 𝑐2𝑠 .                               (5.22)
We compute 𝜆 ± (W) for every Wℎ𝑘 (x𝑞 ) and W ℎ𝑘′ (x𝑞 )) to find the maximum and minimum
wavespeeds at each surface quadrature point across interface:
                                                                           
                              𝜆 𝐿 = min      𝜆 − W ℎ𝑘 (x𝑞 )   , 𝜆− W ℎ𝑘′ (x𝑞 )                    (5.23)
                                                                           
                             𝜆 𝑅 = max 𝜆 + W ℎ𝑘 (x𝑞 ) , 𝜆+ W ℎ𝑘′ (x𝑞 ) .                          (5.24)
                                                      149


5.3.5   Physicality Enforcing Operator
    While using 0th order polynomials for a relativistic hydrodynamics discontinuous-Galerkin
method is guaranteed to produce a physical conserved state after every flux update even with
shocks when using a local-extremum-diminishing numerical fluxes such as HLL, higher order
bases can introduce spurious oscillations and non-physical conserved states within cells around
shocks (see Wu & Tang (2016)). To resolve this issue, an operator is needed to smooth the solution
within a cell. Taking inspiration from the limiter presented in Moe et al. (2015), we present here a
smoothing procedure that enforces physical conserved states within a cell with a physical volume
average.
    Following Riccardi & Durante (2008) and Wu & Tang (2016), a conserved state that satisfies
                                                        √︃
                           𝐷 > 0,     𝑞 (U) ≡ 𝐸/𝑐2 −       𝐷 2 − |M/𝑐| 2 > 0,                    (5.25)
is a physically admissible state as long as the specific energy 𝑒(𝜌, 𝑝) is continuously differentiable
under the chosen equation of state. If a conserved state satisfies Eq. 5.25, the state can be inverted
for a primitive state with positive density and pressure with a velocity less than 𝑐. Since the space
of permissible conserved states under Eq. 5.25 is convex (i.e. any conserved state interpolated
between two physically permissible conserved states is also physically permissible (Wu & Tang,
2016)), we can use the same strategies from Moe et al. (2015) in a simple smoothing procedure to
enforce physicality within a discontinuous-Galerkin cell. From a high level, we apply an operator
to average nodal points within a cell towards a physical volume average.
    Before enforcing physicality within cells, we first screen for cells with non-physical nodal points
by checking that all conserved states at the nodal points – U𝑖 – satisfy Eq. 5.25. If any point fails,
we flag the cell as needing smoothing to ensure that all points are physical. We then check that the
cell volume average U of the conserved state satisfies Eq. 5.25. As long as the cell volume average
is physical, a smoothing factor can be found that ensures physicality without changing the global
conserved quantities. If the cell volume average is not physical, then the nodal points cannot be
made physical through the physicality-enforcing operator without changing the volume average.
                                                  150


     To enforce physicality within a cell, we first seek a smoothing factor 𝑠 ∈ [0, 1] such that the
smoothed states
                                                  Ũ𝑖 = 𝑠U𝑖 + (1 − 𝑠) U                                               (5.26)
at all nodal points in the cell satisfy Eq. 5.25. At each point in the cell, we find the largest smoothing
factor such that
                                                                   √︃
                                                                                         2
                                𝐷˜ 𝑖 > 0      𝑞˜𝑖 ≡ 𝐸˜𝑖 /𝑐2 −         𝐷˜ 𝑖2 + | M̃𝑖 |/𝑐 > 0.                          (5.27)
If we assume that U is physical, then 𝑠 := 0 would lead to a physical Ũ, so we can assume that such
a smoothing factor 𝑠𝑖 ≥ 0 exists. We find this factor in two stages.
                                                                                                    (1)
     In the first stage, we compute an intermediate stage smoothing factor 𝑠𝑖                           for each nodal point
that ensures a positive 𝐷 and 𝐸. We solve
                                                                                
                                          ˜ (1)       (1)                    (1)
                                         𝐷 𝑖 = 𝑠𝑖,𝐷 𝐷 𝑖 + 1 − 𝑠𝑖,𝐷 𝐷 > 0                                              (5.28)
                                                                               
                                          ˜ (1)       (1)                   (1)
                                         𝐸𝑖 = 𝑠𝑖,𝐸 𝐸𝑖 + 1 − 𝑠𝑖,𝐸 𝐸 > 0                                                (5.29)
                  (1)   (1)
for the largest 𝑠𝑖,𝐷 , 𝑠𝑖,𝐸 ∈ [0, 1] that satisfies the constraints and compute an intermediate smoothing
                                  
         (1)            (1) (1)                     (1)
factor 𝑠𝑖 = min 𝑠𝑖,𝐷 , 𝑠𝑖,𝐸 . We use 𝑠𝑖 to compute an intermediate smoothed state
                                                                                   
                                                (1)       (1)                   (1)
                                             Ũ𝑖     =  𝑠𝑖 U𝑖      +    1 − 𝑠𝑖        U                               (5.30)
so that we ensure that 𝐷˜ and 𝐸˜ are positive.
                                                                                               (2)
     In the second stage, we compute a second stage smoothing factor 𝑠𝑖 ∈ [0, 1] such that
                                                         √︂
                                                                 (2) 2
                                                                                       2
                                (2)         (2)    2                                (2)
                              𝑞˜𝑖 = 𝐸˜𝑖 /𝑐 −                  𝐷˜ 𝑖        + | M̃𝑖 |/𝑐 > 0.                            (5.31)
                                              
           (2)      (2)   (1)             (2)
where Ũ𝑖      = 𝑠𝑖 U𝑖        + 1 − 𝑠𝑖           U is the second smoothed state. Note that since 𝑠 (2) := 0
leads to Ũ (2) := U, we know that an acceptable smoothing factor exists. Solving Eq. 5.31 can be
                                                                                    (2)
simplified by noting that 𝐸˜ (2) is positive for any choice of 𝑠𝑖                       ∈ [0, 1] since 𝐸˜ (1) and 𝐸 are both
positive (for the same reasons, 𝐷˜ (2) is also always positive). We can rewrite Eq. 5.31 as
                                                    2 
                                                                 (2) 2
                                                                                       2
                                          (2)                                      (2)
                                        𝐸˜𝑖 /𝑐2 > 𝐷˜ 𝑖                    + | M̃𝑖 |/𝑐                                 (5.32)
                                                     (2) 2
                                                         
                                                                     (2)
                                                𝑎 𝑠𝑖          + 𝑏𝑠𝑖 + 𝑐 > 0                                           (5.33)
                                                               151


where
                         1  ˜ (1)      2              2 1                  2
                                                (1)                  (1)
                    𝑎=        𝐸𝑖 − 𝐸 − 𝐷˜ 𝑖 − 𝐷 −                 M̃𝑖 − M                            (5.34)
                        𝑐4                                     𝑐2
                         2                                    2                   2
                                   (1)               (1)                      (1)
                    𝑏 = 𝐸 𝐸˜𝑖 − 𝐸 − 2𝐷 𝐷˜ 𝑖 − 𝐷 − M · M̃𝑖 − M                                        (5.35)
                        𝑐4                                        𝑐2
                         1 2          2  1     2
                    𝑐= 𝐸 −𝐷 −                M .                                                     (5.36)
                        𝑐 4              𝑐2
                        (2)
Since U is physical, 𝑠𝑖      := 0 must satisfy the inequality. Note that the quadratic can only have
                                                                     (2)
at most one root within [0,1]; if it had two roots, then either 𝑠𝑖        := 0 and 𝑠 (2) do not satisfy the
inequality, implying that U is unphysical, or that both satisfy the inequality and that some interior
 (2)
𝑠𝑖    ∈ [0, 1] do not satisfy the inequality, implying that the space of physical conserved states is
                                                                                                   (2)
not convex, both of which are contradictions. If there are no roots within [0, 1], since 𝑠𝑖            := 0
                            (2)
satisfies the inequality, 𝑠𝑖     := 1 must as well, so 1 would be the largest acceptable second stage
smoothing factor.
                                                          (2)
    In the case that there is just one root, then since 𝑠𝑖    := 0 satisfies the inequality, the coefficient
𝑎 must be negative or 0 (which is the simple linear case), and only the root
                                                       √
                                           (2)  −𝑏 − 𝑏 2 − 4𝑎𝑐
                                         𝑠𝑖 =                                                        (5.37)
                                                         2𝑎
can fall within [0, 1], and so we only need to compute this root to find the largest smoothing factor
                                                                                    (1) (2)
for this nodal point. The final smoothing factor for this nodal point is 𝑠𝑖 = 𝑠𝑖 𝑠𝑖 , which ensures
that any 𝑠 ≤ 𝑠𝑖 chosen will satisfy Eq. 5.27. After computing 𝑠𝑖 for each nodal point in the cell, we
compute the final smoothing factor for the cell using 𝑠 = min 𝑠𝑖 , which we use to compute ũ using
Eq. 5.26.
    The procedure for our physicality-enforcing operator goes as follows
   1. We flag cells with nodal points with conserved states that violate Eq. 5.25 as cells with
       non-physical nodal points.
   2. We check that the volume average within a flagged cell satisfies equation 5.25, which guar-
       antees that the smoothing procedure will enforce physicality within the cell.
                                                    152


   3. For each point in a flagged cell, we compute the largest smoothing factor 𝑠𝑖 that will guarantee
       that the new smoothed state will satisfy Eq. 5.27. For each nodal point, the procedure goes
       as:
                                                            (1)     (1)
         a) We compute the first stage smoothing factor 𝑠𝑖,𝐷 and 𝑠𝑖,𝐸 to ensure positivity of 𝐷 and
             𝐸 by solving for them in Eq. 5.28.
                                                          (1)        (1)    (1)
         b) We compute the first stage smoothing factor 𝑠𝑖    = min 𝑠𝑖,𝐷 , 𝑠𝑖,𝐸 and use this to compute
             the intermediate smoothed state Ũ (1) using Eq. 5.30.
                                                                                                  (1)
         c) We then check whether Ũ (1) satisfies equation 5.25, in which case we use 𝑠𝑖 = 𝑠𝑖 .
                                    (2)
         d) If not, we compute 𝑠𝑖       by solving the quadratic described in Eq. 5.32 and Eq. 5.34
                                  (2)
             using the root for 𝑠 𝐼   in Eq. 5.37. The smoothing factor for this nodal point is then
                    (1) (2)
             𝑠𝑖 = 𝑠𝑖 𝑠𝑖 .
   4. We compute a final smoothing factor for each cell using 𝑠 = min 𝑠𝑖 , which allows us to
       compute the smoothed state U𝑖 at each nodal point using Eq. 5.26.
As long as the volume average conserved state U is physical, this procedure will produce the
physical conserved state Ũ𝑖 .
5.4   Recovery of Primitive Variables
    Although the conservation laws in relativistic hydrodynamics are similar to those in Newtonian
hydrodynamics, the inclusion of the Lorentz factor in conservation of mass, momentum, and energy
adds complexity to the equation set in several ways that complicate recovery of primitive variables
from conserved variables. Primarily, the Lorentz factor couples every conserved variable with
the velocity in all directions. While adding a transverse velocity to a non-relativistic flow will
not affect longitudinal evolution, in demonstration of Galilean invariance, a transverse velocity
in a relativistic flow contributes to the apparent density, momentum, and energy, fundamentally
modifying the dynamics. Additionally, the inclusion of the Lorentz factor leads to a non-linear
relationship between the primitive and conserved variables. For even simple choices of equation of
                                                  153


state, recovering the primitive state from the conserved state (i.e. inverting Eq. 5.4) requires finding
the roots of cubic or higher order polynomials. Last, the relativistic hydrodynamics equations
(and causality) require the three-velocity to be bounded by the speed of light, with superluminal
velocities leading to complex Lorentz factors. For highly relativistic flows close to the speed of light,
we are often limited by machine precision when representing small changes in the three-velocity
that equate to large changes in the Lorentz factor. For these reasons, the stability and fidelity
of any scheme for relativistic hydrodynamics is fundamentally tied to that of the scheme used to
compute primitive variables from conserved quantities. As a result, a wide variety of schemes,
including but not limited to those presented in Schneider et al. (1993); Ryu et al. (2006); Riccardi
& Durante (2008), have been described in the literature. Each of these options has its advantages
and disadvantages from a physical fidelity, stability, and robustness standpoint; however, as far as
we are aware, the performance of these different formulations has not previously been examined
from a performance portability perspective, as we do here.
    We consider two different approaches to recovering the primitive variables from conserved
quantities: an analytical approach and an iterative approach. We then develop both of these
methods for the ideal gas and Taub-Matthews equations of state to give four algorithms in all. In
formulating these, we use the dimensionless variables
                                           𝑀                  𝐸
                                       𝜉=        and    𝜂=        .                                (5.38)
                                           𝐷𝑐                𝐷𝑐2
This rescaling aids with reducing issues due to large differences in numbers, although this does not
eliminate issues of near-speed-of-light velocities.
5.4.1    Ideal Gas Equation of State
    In the case of the ideal gas equations of state, the primitive variables can be recovered from the
conserved quantities by solving the roots of a quartic equation. One approach demonstrated by Ryu
et al. (2006) computes the analytic solution to a quartic polynominal in 𝛽 = 𝑣/𝑐. For completeness,
we restate this method here in terms of the dimensionless parameters 𝜉 and 𝜂, which allows us to
keep 𝑐 throughout the set of equations.
                                                  154


   As shown in Schneider et al. (1993), the solution for the special relativistic velocity 𝛽 can be
found from the roots of the quartic polynomial
                                    𝑎 3 𝛽4 + 𝑎 2 𝛽2 + 𝑎 1 𝛽 + 𝑎 0 = 0                        (5.39)
where the coefficients are given by
                                        −2Γ(Γ − 1)𝜉𝜂
                               𝑎3 =                                                          (5.40)
                                       (Γ − 1) 2 (𝜉 2 + 1)
                                      Γ2 𝜂2 + 2(Γ − 1)𝜉 2 − (Γ − 1) 2
                               𝑎2 =                                                          (5.41)
                                               (Γ − 1) 2 (𝜉 2 + 1)
                                            −2Γ𝜉𝜂
                               𝑎1 =                                                          (5.42)
                                       (Γ − 1) 2 (𝜉 2 + 1)
                                               𝜉2
                               𝑎0 =                         .                                (5.43)
                                       (Γ − 1) 2 (𝜉 2 + 1)
Only one root of the polynomial provides a physical 𝛽 ∈ [0, 1). The root can be found using a
root-finding method or analytically (Ryu et al., 2006) through:
                                                      √
                                              −𝐵 + 𝐵2 − 4𝐶
                                         𝛽=                                                  (5.44)
                                                        2
where
                                                                      
                                         1
                                                   √︃
                                   𝐵=                  2
                                             𝑎 + 𝑎 3 − 4𝑎 2 + 4𝑥                             (5.45)
                                         2 3
                                                              
                                         1
                                                  √︃
                                  𝐶=         𝑥 − 𝑥 2 − 4𝑎 0                                  (5.46)
                                         2
We then have that:
                                 2/3      h            √ i
                            2                 1     −1     −𝑇
                       2 𝑅 + 𝑇          cos 3 tan                 − 𝑖1 /3 if 𝑇 < 0
                       
                       
                                                           𝑅
                  𝑥=                                                                        (5.47)
                             √  1/3           √  1/3
                        𝑅+ 𝑇
                       
                                       + 𝑅− 𝑇             − 𝑖1 /3         otherwise
                       
where 𝑅, 𝑆, and 𝑇 are found from
                                           1                        
                                    𝑅=          9𝑖2𝑖2 − 27𝑖3 − 2𝑖13                          (5.48)
                                          54
                                          1             
                                     𝑆=       3𝑖2 − 𝑎 22                                     (5.49)
                                          9
                                    𝑇 = 𝑅2 + 𝑆3                                              (5.50)
                                                   155


where
                                        𝑖1 = −𝑎 2                                              (5.51)
                                        𝑖2 = 𝑎 3 𝑎 1 − 4𝑎 0                                    (5.52)
                                        𝑖3 = 4𝑎 2 𝑎 0 − 𝑎 21 − 𝑎 23 𝑎 0 .                      (5.53)
                                                                                               (5.54)
With a solution for 𝛽, the rest of the primitive variables can be recovered using
                                           √︃
                                    𝜌 = 𝐷 1 − 𝛽2                                               (5.55)
                                          𝛽
                                    v=       M                                                 (5.56)
                                         𝜉𝐷
                                                                         
                                    𝑃 = (Γ − 1) 𝐸 − M · v −          𝜌𝑐2    .                  (5.57)
    An alternative strategy for recovering the primitive variables from conserved quantities is to
utilize an iterative solver to find the roots. Exploring the iterative approach, we used an iterative
solver following the recovery method presented in Riccardi & Durante (2008). This solver has
two main advantages. First it uses a proxy for the velocity that scales more evenly from weakly to
highly relativistic flows. Second, the resulting quartic polynomial can be solved using the Newton-
Raphson method, which it typically more robust, accurate, and faster even using several iterations
due to avoiding the slow and imprecise square roots and inverse tangents in the analytic solver.
    Instead of recovering the primitives by solving for velocity, Lorentz factor, or pressure, we
instead solve for a proxy of the velocity, 𝑤, where
                                                       2𝑤
                                               𝑢=             .                                (5.58)
                                                     1 + 𝑤2
We solve for 𝑤 ∈ (0, 1) by finding the root within (0, 1) of the quartic polynomial
       𝑃(𝑤) = (𝛼 − 1) 𝜉𝑤4 − 2 (𝛼𝜂 + 1) 𝑤3 + 2 (𝛼 + 1) 𝜉𝑤2 − 2 (𝛼𝜂 − 1) 𝑤 + (𝛼 − 1) 𝜉,          (5.59)
where 𝛼 = Γ/(Γ − 1). Within the range 𝑤 ∈ (0, 1), the equation 𝑃(𝑤) = 0 has only one root.
While 𝑃(𝑤) = 0 could be solved analytically using the same method for our analytical solver,
                                                     156


the Newton-Raphson method is simpler and often quicker, since it only requires addition and
multiplication and coefficients of the polynomial can be reused across iterations. We also find that
the Newton-Raphson method always converges to the root in (0, 1) as long as the initial guess is in
(0, 1), which is consistent with Riccardi & Durante (2008). This obviates the need for a bounded
root solver. For reasonably relativistic flows with 𝛾 < 10, this may only take 5 iterations to recover
𝑤 to within double floating point machine precision (Δ𝑤 ∼ 10−16 ).
    When 𝜉 is very small, a cubic approximation for a solution for 𝑤 can be used
                 𝛼−1            (𝛼 − 1) 2
          𝑤=              𝜉+                 [(𝛼 + 3) (𝛼𝜂 + 1) − 4 (𝛼 + 1)] 𝜉 3 + 𝑂 (𝜉 5 ).      (5.60)
              2 (𝛼𝜂 − 1)      8 (𝛼𝜂 − 1)   4
    Generally, the iterative solver for the ideal equation of state is more accurate than the analytical
solver. Often, the iterative solver is also faster. Comparison between the solvers for the ideal
equation of state and the solvers for the Taub-Matthews equation of state are explored in section
5.4.3.
5.4.2   Taub-Matthews Equation of State
    For the Taub-Matthews equation of state, the primitive state can be recovered from the conserved
state by solving a cubic equation for 𝑊 = 𝛾 2 − 1. Following Ryu et al. (2006), we solve for 𝑊 from
                                      𝑊 3 + 𝑐 1𝑊 2 + 𝑐 2𝑊 + 𝑐 3 = 0                              (5.61)
where
                                         h                    i
                                 𝜂2 + 𝜉 2    4 𝜂2 + 𝜉 2 − 𝜉 2 + 1 − 14𝜉 2 𝜂2
                         𝑐1 =                               2                                   (5.62)
                                                 2 𝜂2 − 𝜉 2
                                          h                   i 2
                                           4 𝜂2 + 𝜉 2 − 𝜉 2 + 1       − 57𝜉 2 𝜂2
                                   𝑐2 =                           2                             (5.63)
                                                      16 𝜂2 − 𝜉 2
                                                                     9𝜉 2 𝜂2
                                                         𝑐3 = −              2 .                (5.64)
                                                                16 𝜂2 − 𝜉 2
Eq. 5.61 can be solved analytically and iteratively. Analytically solving the cubic polynomial is
straightforward compared to solving the quartic polynomial for the ideal equation of state. The
                                                   157


solution for 𝑊 depends on the discriminant of the cubic equation
                                            𝑑 = 𝑄 3 + 𝑅2                                     (5.65)
with
                                                        1             
                                                  𝑄=        3𝑐 2 − 𝑐21                       (5.66)
                                                        9
                                        1                           
                                   𝑅=       9𝑐 1 𝑐 2 − 27𝑐 3 − 2𝑐31 .                        (5.67)
                                       54
                                                                                             (5.68)
If 𝑑 < 0, then Eq. 5.61 has the solution
                                                         𝜄 𝑐
                                                              − 1
                                            √︁
                                      𝑊 = 2 −𝑄 cos                                           (5.69)
                                                          3        3
with                                                          !
                                                         𝑅
                                        𝜄=  cos−1     √︁         .                           (5.70)
                                                        −𝑄 3
Otherwise if 𝑑 ≥ 0, then Eq. 5.61 has the solution
                                                  𝑐
                                         𝑊 = − 1 +𝑆 +𝑇                                       (5.71)
                                                   3
with
                                                      √  1/3
                                          𝑆= 𝑅+ 𝑑                                            (5.72)
                                                    √  1/3
                                         𝑇 = 𝑅− 𝑑              .                             (5.73)
    A root-finding method can also be used to recover 𝑊 from Eq. 5.61. As an alternative option to
the analytic solution, we use the bracketed root solver Brent’s method (Brent, 1973) to recover 𝑊.
For the Taub-Matthews equation of state, we use Brent’s method instead of the Newton-Raphson
since Brent’s method allows us to bracket the one non-negative root. Unlike for the quartic
polynomial solved for the ideal equation of state, the Newton-Raphson method is not guaranteed
to converge to the positive root when using a positive initial guess, which leads to an incorrect
and unphysical recovered velocity. We first bracket the root 𝑊 with the region corresponding to
                                                  158


𝛾 ∈ [1, 200], then iteratively expand the upper range if the root is not found. For the tests explored
here 𝛾 = 200 is a sufficiently high upper bound that this rebracketing is not needed.
    With 𝑊 recovered, the Lorentz factor and relativistic velocity can be recovered via
                                                           √︂
                                         √                      𝑊
                                    𝛾 = 𝑊 +1          𝛽=            .                             (5.74)
                                                              𝑊 +1
The lab frame density 𝜌 and velocity v can be recovered via the same method as the ideal equation
of state. The pressure with the Taub-Matthews equation of state is recovered via
                                             (𝐸 − M · v) 2 − 𝜌 2
                                        𝑃=                       .                                (5.75)
                                               3 (𝐸 − M · v)
5.4.3   Conserved to Primitive Solver Comparisons
    Fig. 5.2 shows the relative error in the recovered velocity in the ideal gas equation of state and
Taub-Matthews equations of state using the analytical method and iterative methods using varying
number of iterations. The plots are created by applying the methods on a grid of 252 primitive
states with 𝐷 = 1 kg m−3 and 25 logarithmically spaced pressures from 105 to 1010 N m−2 and
25 logarithmically spaced Lorentz factors from 1 to 100, using 𝑐 = 3 × 108 m s−1 . Each pair of
pressure and Lorentz factor is converted to a conserved state using Eq. 5.4 that is converted back
to a primitive state using the specified recovery method. We then compute the relative error of the
velocity in the recovered primitive state to the original velocity determined by the Lorentz factor.
    For the ideal gas using 64 bits of floating precision, the analytical solver recovers the velocity to
10−15 for Lorentz factors below 3 and in some cases recovering it exactly due to machine precision
(10−16 in this regime). The accuracy of the analytical method decreases roughly as a power law
with increasing Lorentz factor, reaching about 10−10 at 𝛾 = 100. At this high Lorentz factor, the
relative error in recovered Lorentz factor is 10−6 , which propagates into other recovered primitives,
highlighting the need to accurate recovery of velocity for ultrarelativistic flows. In contrast, the
iterative method for the ideal gas recovers the velocity exactly or near machine precision for Lorentz
factors below 10 in only 6 iterations, past which the error increases rapidly with Lorentz factor.
Owing to the flexibility of the accuracy of the iterative method, increasing the iteration count to
                                                  159


                                                   Relative Error in u
                   10    16   10   14   10   12       10   10     10   8          10   6         10   4
                1010                                                                                                                10   5
                         Analytical                n=6                           n=12                           Analytical          10   6
                                                                                                                n=6                 10   7
                109                                                                                             n=12                10   8
                                                                                                                                    10   9
                108                                                                                                                 10   10
                                                                                                                                    10   11
Ideal                                                                                                                               10   12
                107                                                                                                                 10   13
                                                                                                                                    10   14
                                                                                                                                              Relative Error in u
                106                                                                                                                 10   15
                                                                                                                                    10   16
 P                                                                                                                                  0
                 105
                1010                                                                                                                10   5
                         Analytical                n=25                          n=50                                 Analytical    10   6
                                                                                                                      n=25          10   7
                109                                                                                                   n=50          10   8
Taub-Matthews
                                                                                                                                    10   9
                108                                                                                                                 10   10
                                                                                                                                    10   11
                                                                                                                                    10   12
                107                                                                                                                 10   13
                                                                                                                                    10   14
                106                                                                                                                 10   15
                                                                                                                                    10   16
                                                                                                                                    0
                105
                       100     101           102            101            102             101            102      101        102
Figure 5.2: Map of the error of the conserved-to-primitive solvers with the error using the analytical
method in the left column and using varying numbers of iterations in the middle two columns and
error of these configurations versus Lorentz factor in the right column. The top row shows results for
the ideal gas, testing the iterative solver with 6 and 12 iterations, and the bottom row shows results
for the Taub-Matthews equation of state, testing the iterative solver using 25 and 50 iterations.
In all panels, 25 × 25 primitive states are tested with Lorentz factors varying from 1 to 100 on
the 𝑥-axis and pressures varying from 105 to 1010 N m−2 , using 𝑐 = 3 × 108 m s−1 and fixing
𝐷 = 1 kg m−3 , these primitive states are first converted to conserved states and then converted back
to a primitive state using the specified analytical or iterative solver. In the left three columns, the
relative error is shown in color with the 𝑦-axis showing the pressure. In the rightmost column, the
median (solid line) and first to third quartile (shared region) of the error sampled using different
pressures given a specific Lorentz factor. All results in this figure are using the Intel compiler on
CPUs. The iterative solver for the ideal equation of state is more accurate than the analytic solver
using just 12 iterations for high Lorentz factors and just 6 iterations for low Lorentz factors. For the
Taub-Matthews equation of state, the analytical solver is almost always at least or more accurate
than the iterative solver.
12 leads to recovering the velocity near machine precision for all Lorentz factors tested. At higher
Lorentz factors, the iterative solver has relatively more difficulty in recovering the velocity due to
                                                                           160


the method recovering the velocity from a proxy of the velocity and the slow variation of velocity
at high Lorentz factors. Small errors in the recovered velocity at high Lorentz factors amplify to
large errors in other recovered primitives. We also note that for very high pressures at and above
1020 𝜌𝑐2 , analytical method for the ideal gas encounters imaginary numbers and fails to recover the
velocity at all, whereas the iterative solver does not fail with very high pressures.
    In comparison, the cubic analytic solver for the Taub-Matthews equation of state performs
closer to machine precision across the domain of primitive states tested. The iterative solver for
the Taub-Matthews equation of state requires many more iterations than for the ideal gas equation
of state. We attribute this to the construction of the polynomial for the iterative solver for the ideal
equation of state, which is designed to converge in a few iterations. The Taub-Matthews equation
of state iterative solver performs worse at lower Lorentz factors since it recovers the velocity from
a proxy of the Lorentz factor, and the Lorentz factor varies slowly at low velocities. Small errors in
the recovered Lorentz factor at sub-relativistic velocities amplify to large errors in other recovered
primitives. Generally, the iterative solver for the Taub-Matthews equation of state is less accurate
than the analytical solver, and the high iteration counts required lead to slower performance.
    We next investigate the number of iterations required for the iterative solver to reach accuracy
parity with the analytic solver in Fig. 5.3. In this figure, we test the same grid of primitive states
used in Fig. 5.2, running the iterative solver with increasing number of iterations until it achieves
greater accuracy than the analytic solver. For some cases with the ideal gas, the analytic solver
recovers the velocity exactly, which we mark with yellow.
    The number of iterations required for the iterative solvers to reach accuracy parity depends
mostly on the Lorentz factor with some variation in pressure. The iterative solver for the ideal
gas requires more iterations at higher Lorentz factors. We attribute this to the iterative solver
recovering the primitive state by first recovering a proxy for the velocity instead of Lorentz factor,
which requires less precision to recover at low Lorentz factors. For the primitives states tested here
that the analytical solver does not recover exactly, the ideal iterative solver requires fewer than 10
iterations to achieve parity. We attribute the low iteration count to the one physical root of the
                                                  161


                                             Intel Required Iterations
                                                  0    5   10    15   20
                                          1010
                                                                                  GNU      15
                                          109                                     Intel
                                                                                  CUDA
                                          108                                              10
                          Ideal           107
                                                                                                Required Iterations
                                                                                           5
                                          106
                               P
                                           105
                                          1010
                                                                                           15
                                          109
                          Taub-Matthews
                                          108                                              10
                                          107
                                                                                           5
                                          106
                                          105
                                                 100       101        102   101      102
Figure 5.3: Required iterations for the iterative solver to reach the same accuracy as the analytical
solver using the same primitive states as Fig. 5.2, with results for the ideal gas in the top row and the
Taub-Matthews equation of state in the bottom row. The left column shows the required iterations
when compiling with the Intel compiler in color with Lorentz factor on the 𝑥 axis and pressure on
the 𝑦 axis. For two primitive states the ideal analytic solver recovers the velocity exactly, leading
the iterative solver being unable to reach the same accuracy, which we show in yellow. The right
column shows the median (solid line) and first to third quartile (shared region) of the error sampled
using different pressures given a specific Lorentz factor, Results with the GNU compiler on CPUs
are shown in orange, with the Intel compiler on CPUs with the Kokkos OpenMP backend in blue,
and with the Kokkos CUDA backend on GPUs in green.
quartic always being the same root.
   The iterative solver required comparatively more iterations, almost always more than 5 and
upwards of 15 for low Lorentz factors. Generally more iterations are required for lower Lorentz
factors, possibly due to the solver recovering a proxy of the Lorentz factor first, from which
recovering the velocity is sensitive to precision. The required iterations form a sawtooth with
Lorentz factors due to the physical root switching positions.
   Depending on the architecture and compiler, the iterative solver for the ideal gas is usually faster
than the analytic solver, while for the Taub-Matthews equation of state the iterative solver is almost
always slower. We investigate the performance of the recovery methods in Fig. 5.4. Using the same
                                                                      162


                                        Analytical2 Time/Iterative Time-1
                       100     10   1           10     0       10  2        10   1     100
                             GNU                      Intel                  CUDA
                1010                                                                                        100
                109                                                                                         10    1
                                                                                                            10    2
                108
                                                                                                                          Analytical Time/Iterative Time-1
Ideal           107
                                                                                                            0
                                                                                                                10    2
 P              106                                                                                             10    1
                 105                                                                                         100
                1010                                                                                        100
                                                                                                   GNU
                109                                                                                Intel    10    1
Taub-Matthews
                                                                                                   CUDA
                                                                                                            10    2
                108
                                                                                                            0
                107
                                                                                                                10    2
                106                                                                                             10    1
                105                                                                                             100
                       100   101          102          101          102          101   102   101      102
Figure 5.4: Timing comparisons for the iterative solver to reach the same accuracy as the analytic
solver, with comparisons as a color map in the left three panels and versus Lorentz factor in the
rightmost panel, using the same primitive states as Fig. 5.2 with results for the ideal gas in the top
row and the Taub-Matthews equation of state in the bottom row. In all panels we compare results
using the metric Analytical Time/Iterative Time−1, where a positive value shows how much slower
the analytical solver is as a fraction of the time the iterative solver takes and a negative value shows
the fraction by which the analytical solver is faster. The left three columns show the timing metric
in color (blue shows where the iterative method is faster) with the Lorentz factor on the 𝑥 and the
pressure on the 𝑦 axis, showing comparisons for the GNU and Intel compilers on CPUs with the
Kokkos OpenMP backend and on GPUs with the Kokkos CUDA backend across the three columns.
The rightmost column shows the median (solid line) and first to third quartile (shared region) of
the error sampled using different pressures given a specific Lorentz factor, showing results for all
compilers tested (note that this does not compare timings between compilers, only the analytic
against the iterative solver for each compiler). For the ideal equation of state, the iterative solver
is faster than the analytic solver under a certain threshold of Lorentz factor that is compiler and
architecture dependent. The iterative solver for the Taub-Matthews equation of state is almost
always slower than the analytic method.
grid of primitive states that we used in Fig. 5.2, we compare the run times of the analytical solvers
and iterative solvers with the number of iterations required to achieve accuracy parity, running each
                                                                    163


of the primitive states from Fig. 5.2 on 103 cells with 27 points per cell, taking an average runtime
over 100 runs each. We compare timings using the metric Analytical Time/Iterative Time − 1,
where the iterative time is with the number of iterations required to match the analytical accuracy,
in order to highlight where the iterative solver is faster. Negative values show the fraction by which
the analytical method is faster than the iterative method while positive values show the fraction by
which the analytical solver is slower.
    For the ideal gas on CPUs using the Intel compiler, the iterative solver is about 10% faster than
the analytical solver at Lorentz factors below 10 and about 10% slower at Lorentz factors above
10. For higher iteration counts reaching to 10 iterations, the analytical solver begins to be faster
than the iterative solver by several percent. However, it should be noted from Fig. 5.2 that in this
regime the analytical method introduces more inaccuracy to the primitive state, while the iterative
solver can recover the primitive state with much better accuracy at the cost of performance. A
red line on the right hand side shows that the analytical solver more quickly identifies the zero
velocity case, whereas the iterative solver takes longer due the layout of the code and using the
cubic approximation from Eq. 5.60 for near-zero momenta.
    Using the GNU compiler on CPUs, the iterative solver is always faster than the analytical solver
except for trivial cases. We attribute this slowdown with GNU to the slower math functions required
in the analytic solver.
    For GPUs, the iterative solver for the ideal gas is faster than the analytical solver by several
percent for all but the trivial case and Lorentz factors above 60. This is despite the potential for the
kernel to branch at every point if different points require different numbers of iterations, although
these timing tests do not exercise this possibility. The timing disparity may be due to the ‘sqrt‘
operation in the analytical solver, which is more optimized on CPUs compared to GPUs.
    Considering the Taub-Matthews equation of state, the iterative solver is almost always slower
than the analytical solver. This is expected from the larger number of iterations needed for the
iterative solver to reach parity with the analytical solver. The performance difference is largest on
the Intel compiler, where the optimized math functions allow good performance for the analytical
                                                  164


solver.
                                                                 1e12
                                                           1.0                                                      Ideal                                                                                                 Taub-Matthews
                         Primitive recoveries per second
                                                           0.8
                                                           0.6
                                                           0.4
                                                           0.2
                                                           0.0
                                                                                                                                                               Analytical CUDA Ideal   Iterative CUDA Ideal                                                                                     Analytical CUDA TM   Iterative CUDA TM
                                                                 Analytical GNU Ideal   Iterative GNU Ideal                                                                                                   Analytical GNU TM   Iterative GNU TM
                                                                                                              Analytical Intel Ideal   Iterative Intel Ideal                                                                                         Analytical Intel TM   Iterative Intel TM
Figure 5.5: Aggregate performance of all methods and compilers tested shown as box and whiskers
of the primitive recoveries per second (higher is better) across the grid of primitive states used in
Fig. 5.2. Red lines show medians, boxes show the interquartile range, and whiskers show the
maximum and minimum values inside of 1.5 times the length of the interquartile range above the
3rd quartile and below the 1st quartile, described by Tukey (1977). We exclude outlier timings from
the figure, which range from 1011 to 1.2 × 1012 primitive recoveries per second for all methods
and compilers. We show results for GNU on CPUs in orange, Intel on CPUs in blue, and CUDA
on GPUs in green, for the ideal gas on the left and the Taub-Matthews equation of state on the
right. Generally, on CPUs using the Intel compiler allows more primitive recoveries per second
than the GNU compiler. The performance for recovery with the Taub-Matthews gas has a much
larger spread than recovery with the ideal equation of state. Between the two equations of state, the
solvers achieve roughly the same number of recoveries per second on each architecture, indicating
that equation of state can have a mitigated impact on the full code’s performance.
   In Fig. 5.5 we show performance of all methods on all architectures and compilers tested as a
box and whisker plot of the attained primitive recoveries per second. Runs on CPUs with GNU
and Intel and the Kokkos OpenMP backend were performed on 2-socket node with Intel Xeon
Platinum 8268 CPUs on a total of 48 OpenMP threads compiled with AVX512 vectorization.
Runs with the Kokos CUDA backend were performed on an NVidia V100 SXM2 Tesla GPU. For
                                                                                                                                                                             165


the ideal gas, the analytic method is slower than the iterative method on GNU, slightly faster on
Intel, and nearly the same performance on GPUs. For the Taub-Matthews approximation to the
Taub-Matthews equation of state, the analytical method is generally faster on all architectures,
with the performance difference being the greatest on Intel and the smallest on GNU. Between
the two equations of states, the analytical solver for both gases performs at about the same speed
for each architecture. This suggests that just considering conserved-to-primitive updates, using a
Taub-Matthews equation of state is about as fast as using an ideal equation of state, although the
more complex computation of wavespeeds and enthalpies in the Taub-Matthews equation of state
will lead to slowdowns elsewhere.
     Overall, these results demonstrate that, for the ideal gas equation of state, the iterative method
to recover the primitive variables from the conserved variables is more flexible, robust, accurate,
and in some cases faster than the analytical method. By contrast, for the Taub-Matthews equation
of state, the characteristics of the analytic and iterative solver are nearly the opposite, with the
iterative solver performing generally worse. Nevertheless, the comparable speed and robustness of
the analytical solver for the Taub-Matthews equation of state suggest that the higher fidelity of the
Taub-Matthews equation of state comes at little cost to execution time and stability.
5.5    Tests of the Relativistic Hydrodynamics Scheme
     To verify the accuracy of the relativistic hydrodynamics scheme, we investigate several standard
test problems in 1D and 2D with and without shocks. First, in §5.5.1, we demonstrate convergence
of a set of relativistic linear waves in three-dimensions. We then demonstrate the accuracy of the
method for discontinuous solutions in §5.5.2 by demonstrating convergence for five different 1D
Riemann problems to high resolution reference solutions generated from a publicly available finite
volume code Athena++ (Stone et al., 2020a). Next, we demonstrate the scheme’s ability to handle
multi-dimensional shocks through a series of 2D Riemann problems previously established in the
literature. Then, we measure the growth rate of the relativistic Kelvin-Helmholtz instability in 2D
in §5.5.5, comparing to results using the finite volume code PLUTO(Mignone et al., 2011). Last,
in §5.5.6, we show timing tests of the code evolving the Kelvin-Helmholtz instability.
                                                   166


5.5.1     Linear Waves
      Prior work in the literature (see, e.g. Stone et al., 2008a) has demonstrated that the convergence
of linear waves in multi-dimensions is a sensitive test of algorithmic fidelity. As far as we are
aware, however, linear wave convergence has not been utilized as a test of algorithms for relativistic
hydrodynamics. Here, we elucidate how such a test can be established and demonstrate the
performance of the algorithm presented here for such a test problem. To generate the linear waves,
a perturbation is made to the initial primitive state, W0 = [𝜌0 , v0 , 𝑃0 ] 𝑇 (using rest mass density,
three-velocity, and pressure), in the form of
                                   W[𝑖] = W0 [𝑖] + 𝐴r 𝑗 [𝑖] sin(𝑘𝑥 − 𝜔𝑡)                           (5.76)
where W is the perturbed primitive state, 𝐴 is the perturbation amplitude (typically 10−6 − 10−4 ),
r 𝑗 [𝑖] is the j𝑡ℎ right eigenvector, the wavelength is equal to 1, 𝑘 = 2𝜋 and 𝜔 = 𝑘𝜆 𝑗 . Here, we have
defined 𝜆 is the wavelength and 𝜆 𝑗 is the eigenvalue corresponding to the j𝑡ℎ right eigenvector of
the Jacobian, 𝐴(V), given in Mignone et al. (2005). Each eigenvalue/vector pair corresponds to a
different set of physics for linear wave testing, giving a total of 5 physically different linear wave
tests, which we denote with 𝑗 ∈ {−, 0 (1,2,3) , +}. Once we have the perturbed primitives, we need
to translate these to a perturbed conserved quantities state, U. This is done using the Jacobian
𝜕U/𝜕W in the following equation:
                                                     𝜕U
                                        U(𝑡 = 0) =           W(𝑡 = 0)                              (5.77)
                                                    𝜕W W
The Jacobian, 𝜕U/𝜕W must be constructed around a state, W such that the solution to the non-
linear relationship W[𝑖]] (U) = W0 [𝑖] + 𝐴r 𝑗 [𝑖] sin(𝑘𝑥 − 𝜔𝑡) at 𝑡 = 0. If this condition is not
fulfilled, then a different problem is initialized and the evolution of the system will depart from
the linear dispersion relation. To fulfill this criteria, we have found that it is necessary to compute
the Jacobian using the unperturbed state, W0 , but including the perturbation to the velocity in the
Lorentz factor, in order to ensure that coupling between different components of the velocity is
accurately captured. While we emphasize that this is done only to establish the initial condition
                                                     167


in the conserved quantities, this reinforces a fundamental difference between relativistic and non-
relativistic hydrodynamics; in the relativistic case the primitive variables are always a non-linear
function of the conserved quantities due to the presence of the Lorentz factor.
    Now that the 1D perturbed states U and W have been determined, we can rotate these for 2D
and 3D non-grid-aligned cases. To do this, we first start with a desired number of wavelengths, 𝑁,
and find the 𝑛𝑡ℎ acceptable angle, 𝜃, by Eq. 5.78, where 𝑛 < 𝑁. The values for 𝑁 and 𝑛 for the
linear waves tests are shown in Tab. 5.1.
                                                     √︂               !
                                                         𝑁
                                       𝜃 = tan−1                 −1                             (5.78)
                                                        𝑁 −𝑛
Table 5.1: Values of 𝑁 (no. of wavelengths) and 𝑛 (𝑛𝑡ℎ acceptable wavelength) for linear waves
tests (see Eq. 5.78)
                                     Test Type                     𝑁    𝑛
                                     1D                             1   0
                                     2D Grid-Aligned                1   0
                                     2D Non-Grid-Aligned            2   1
                                     3D Grid-Aligned                1   0
                                     3D Non-Grid-Aligned            3   2
    From here, the base equations in the 1D form of Eq. 5.76 are rotated by the angle 𝜃. Which
is done either about the 𝑦 axis, 𝑎 = (0, 1, 0), for 2D or about the 𝑎 = (0, −1, 1) axis for 3D. The
rotation matrix, R, is generated via
                                                                                    
                      1 0 0           𝑎 𝑥 𝑎 𝑥 𝑎 𝑥 𝑎 𝑦 𝑎 𝑥 𝑎 𝑧           0 −𝑎 𝑧 𝑎 𝑦 
                                                                                    
                                                                                    
                r1 = 0 1 0 , r2 = 𝑎 𝑦 𝑎 𝑥 𝑎 𝑦 𝑎 𝑦 𝑎 𝑦 𝑎 𝑧  , r3 =  𝑎 𝑧 0 −𝑎 𝑥       (5.79)
                                                                                    
                      0 0 1                                              −𝑎 𝑦 𝑎 𝑥 0 
                                                                                    
                                        𝑎 𝑧 𝑎𝑥 𝑎 𝑧 𝑎 𝑦 𝑎 𝑧 𝑎 𝑧 
                                                                                    
                               R = cos(𝜃)r1 + (1 − cos(𝜃))r2 + sin(𝜃)r3 .                       (5.80)
R is then used to rotate the three-velocity vector, v, and the momentum vector, M, by left multiplying
them by R. Next, the (𝑥, 𝑦, 𝑧) coordinates in each equation are substituted with rotated coordinates
                                                    168


(𝑥 ′, 𝑦′, 𝑧′), where
                                                                     
                                        1           0               0
                                                                     
                                  ′              ′                 ′
                                                                     
                                𝑥 = R 0 , 𝑦 = R 1 , 𝑧 = R 0 .
                                                                                                (5.81)
                                                                     
                                        0           0               1
                                                                     
                                                                     
Once these values have been substituted, the final, non-grid-aligned equations for U and W have
been obtained.
     For all eigenvalue/eigenvector cases, 𝑗 = {−, 0 (1,2,3) , +}, tests are run for the rotation configu-
rations in Table 5.1 with basis order and time integrator combinations of (0, RK1), (1, SSPRK2),
and (2, SSPRK3). The domain, L, and number of elements in each direction, N, is calculated based
on the rotation matrix, R:
                                                            
                                                          e
                                              L = 𝑁R                                                (5.82)
                                                         |e|
                                                                  
                                                                e
                                          N=  𝑁𝑛elem 𝑥 𝑟𝜎 R                                         (5.83)
                                                               |e|
where 𝑁 is the number of wavelengths, e is the direction vector for the default orientation of the
        h         i𝑇 
wave 1 0 0 , 𝑥 𝜎 is the refinement multiplier per refinement increment (default 𝑥 𝜎 = 2), 𝑟 is
the refinement level, and 𝑛elem is the base number of elements, which varies for 1D, 2D, and 3D.
                                                                       h                               i𝑇
     For these tests, the velocity was either set to v = 0 or v = 0.5𝑣max −0.3𝑣max 0.4𝑣max ,
where 𝑣max = 0.05𝑐 𝑠 . The base time step is determined by running the test with adaptive time
stepping, which adjusts the time step to maintain a certain CFL during the test (0.2 in this case).
The test is then run again 3 times, each time increasing the refinement in both space and time by a
factor of 2 to maintain a constant CFL. The L1Error and L2Error are gathered for each test and are
fitted against the results using the following equation:
                                      L1Error(𝑑𝑥) = 𝑝 0 + 𝑝 1 (𝑑𝑥) 𝑝 2                              (5.84)
where 𝑝 0 , 𝑝 1 , and 𝑝 2 are fitting constants. The exponent 𝑝 2 is the convergence order, which is
expected to be 1, 2, and 3 for the time integrators RK1, SSPRK2, and SSPRK3 respectively. Results
for the 3D, non-grid-aligned, zero velocity, basis order 2, SSPRK3, test case are shown in Tab. 5.2,
while the L1Error is plotted against the expected values for the conserved quantity 𝐷 in Fig. 5.6.
                                                   169


Table 5.2: Order of convergence for both primitive and conserved variables along the rows for
each of the 5 eigenvalue/eigenvector pairs 𝑗 ∈ {−, 0 (1,2,3) , +} along the columns, all tested in 3D
with non-grid-aligned waves, using a 2nd order basis with the SSPRK3 integrator. For all cases we
expect a 3.0 rate of convergence. Entries with ’-’ denote variables where the eignvector used for
that test does not affect that variable.
                                          Eigenvalue/eigenvector Test Case
                Quantity
                                  -          0 (1)        0 (2)      0 (3)         +
                    𝐷        3.099989     3.036570 2.561624 2.561624          3.099989
                   𝑀𝑥        3.079648          -       2.838988 2.838988      3.079648
                   𝑀𝑦        3.079648          -       2.879077 2.824568      3.079648
                   𝑀𝑧        3.079648          -       2.824568 2.879077      3.079648
                    𝐸        3.099989     3.036570 2.561652 2.561652          3.099989
                    𝜌        3.099989     3.036570           -         -      3.099989
                   𝑢𝑥        3.079655          -       2.838988 2.838988      3.079655
                   𝑢𝑦        3.079655          -       2.879077 2.824568      3.079655
                    𝑢𝑧       3.079655          -       2.824568 2.879077      3.079655
                    𝑃        3.099989          -             -         -      3.099989
             (a) Case: -                       (b) Case: 0 (1)                 (c) Case: 0 (2)
                             (d) Case: 0 (3)                    (e) Case: +
Figure 5.6: Order of convergence for the relativistic mass density (in solid blue) for three resolutions
along the 𝑥-axis the 5 eigenvalue/eigenvector pairs 𝑗 ∈ {−, 0 (1,2,3) , +} in different panel. For all
tests here we test in 3D with non-grid-aligned waves, using a 2nd order basis with the SSPRK3
integrator. For all cases we expect a 3.0 rate of convergence, which we denote with a dashed black
line.
                                                    170


5.5.2    1D Riemann Problems
    We now investigate the accuracy of the relativistic hydrodynamics method through considering
the evolution of a set of standard 1D Riemann problems in order to characterize how well the code
handles shocks. For initial conditions, we use three standard blast waves and a reflecting wall test
from Martí & Müller (2003, 2015) and one Sod shock tube, and a reflecting wall test for a total of
five different 1D Riemann problems.
    For the first four 1D Riemann problems, we use a [0, 1] grid with Dirichlet boundary conditions.
These four tests begin divided into a primitive state on the left W 𝐿 = (𝜌, 𝑣𝑥 , 𝑣 𝑦 , 𝑝) 𝐿 for 𝑥 ∈ [0, 0.5)
and right W 𝑅 = (𝜌, 𝑣𝑥 , 𝑣 𝑦 , 𝑝) 𝑅 for 𝑥 ∈ [0.5, 1]. In the fifth test, we replace the boundary condition
at 𝑥 = 1 with a reflecting boundary and use a uniform initial primitive state through the domain. In
all cases, we set 𝑣𝑧 = 0 and use the ideal equation of state with 𝛾 = 5/3 for the first four tests and
𝛾 = 4/3 for the fifth test.
    For each of the five 1D Riemann problems, we use a [0, 1] grid with Dirichlet boundary
conditions except for test 5, which uses a reflecting boundary condition on the right wall. The
tests begin divided into a primitive state on the left W 𝐿 = (𝜌, 𝑣𝑥 , 𝑣 𝑦 , 𝑝) 𝐿 for 𝑥 ∈ [0, 0.5) and right
W 𝑅 = (𝜌, 𝑣𝑥 , 𝑣 𝑦 , 𝑝) 𝑅 for 𝑥 ∈ [0.5, 1] except for test 5, which begins with a constant primitive state
throughout the volume. In all cases, 𝑣𝑧 = 0.
    For reference data, we compute a 𝑛𝑥 = 214 cell solution using a HLLC Riemann solver, a second
order Van-Leer integrator due to Stone et al. (2020a) for each of the tested Riemann problems.
We run each 1D Riemann problem with five resolutions in powers of two from 𝑛𝑥 = 256 to
𝑛𝑥 = 4096 cells with polynomial basis orders 0, 1, and 2 using the HLLC Riemann solver and
the iterative primitives recovery method for the ideal gas. For basis orders 1 and 2, we use the
limiter from Moe et al. (2015) in addition to the physicality-enforcing operator from § 5.3.5. The
physicality-enforcing operator was necessary for all tests with basis orders over 0. Fig. 5.7 shows
the density, longitudinal velocity, pressure, and Lorentz factor from the five 1D Riemann problems
using 𝑛𝑥 = 128 with the three polynomial basis orders and the reference solution. Fig. 5.8 shows a
log-log plot of the L1 error of the relativistic density, longitudinal relativistic momentum density,
                                                     171


and total energy density compared to the reference solution along with power fits to the convergence
rate and the expected rate of convergence.
     1D Riemann problem 1 is a mildy relativistic blast wave with initial conditions
                                                                                   
                 W 𝐿 = 10, 0, 0, (40/3)𝑐2            W 𝑅 = 1, 0, 0, (2/3 × 10−6 )𝑐2             (5.85)
                                              𝐿                                         𝑅
where we have followed Núñez-de la Rosa & Munz (2018) and used a pressure close to zero for the
right side primitive state for numerical reasons. For this test, we use an adiabatic index Γ = 5/3.
We evolve the shock until 𝑡 = 0.4/𝑐. For this first test we achieved the expected convergence rate in
all variables except for the density for basis order 0, which suffers from slow converging dissipation
around the blast wave. We also see a small cusp in velocity and oscillations in basis order 2 at the
trailing edge of the blast wave which are more apparent in the Lorentz factor. L1 error of basis
orders 1 and 2 are comparable, highlighting the difficultly in achieving high-order convergence
with higher order methods when the problem contains shocks. However, since the basis order 2
test has more degrees of freedom than the basis order 1 test, the L1 error per degree of freedom is
still lower for basis order 2, indicating that higher order bases can still be more efficient.
     1D Riemann problem 2 is a highly relativistic blast wave with initial conditions
                                                                             
                      W 𝐿 = 1, 0, 0, (103 )𝑐2          W 𝑅 = 1, 0, 0, (10−2 )𝑐2 ,               (5.86)
                                                𝐿                                 𝑅
using an adiabatic index Γ = 5/3 and evolved until 𝑡 = 0.4/𝑐. In this test, we see that the sharpness
of the resolved density of the blast wave changes with resolution. We see it the sharpest with basis
order 1, second with basis order 0, and most diffuse with basis order 2, although for each basis the
sharpness improves with resolution. We see a slight cusp in the Lorentz factor for all basis orders
just behind the blastwave where the velocity approaches 𝑐 but in the high resolution finite volume
method the region has a flat Lorentz factor. The sharp blast wave in density causes problems for
convergence at basis order 0 while higher order bases achieve the expected convergence.
     1D Riemann problem 3 is also a highly relativistic blast wave but with a transverse velocity
with initial conditions
                                                                               
                    W 𝐿 = 1, 0, 0, (103 )𝑐2          W 𝑅 = 1, 0, 0.99, (10−2 )𝑐2 ,              (5.87)
                                              𝐿                                     𝑅
                                                   172


with an adiabatic index Γ = 5/3 and evolved until 𝑡 = 0.4/𝑐. With the addition of a relativistic
transverse velocity, the blast wave widens into a square plateau in density, somewhat similar to
problem 1. Like in problem 2, we find that basis order 1 best captures the blast wave, although
resolution improves accuracy for all basis orders. In the Lorentz factor we see a small cusp at the
rightmost edge of the rarefaction and some smearing across the blastwave. The wider blast wave
allows basis order 0 to achieve the expected convergence rate. L1 error for basis order 2 is greater
than the L1 error for basis order 1, although this is mostly due to more degrees of freedom in the
summation of the L1 error for basis order 1.
    1D Riemann problem 4 is a Sod shock with initial conditions
                                                                              
                  W 𝐿 = 1, 0.01𝑐, 0, 1.0𝑐2          W 𝑅 = 0.125, 0.01𝑐, 0, 0.1𝑐2 ,             (5.88)
                                              𝐿                                    𝑅
using an adiabatic index Γ = 4/3 and evolving until 𝑡 = 0.4/𝑐. We see some diffusivity across the
contact discontinuity and at the leftmost edge of the rarefaction.
    For the fifth 1D Riemann problem we study a highly relativistic flow moving to the right and
reflecting against the right wall. We use the initial conditions
                                                                 
                                    W = 1, 0.99999𝑐, 0, 0.01𝑐 , 2                              (5.89)
with an adiabatic index Γ = 4/3 and evolved until 𝑡 = 1.5/𝑐. We see a small cusp in the Lorentz
factor at the left edge of the piled up stationary mass. For higher order bases, we see wall heating
causing spurious oscillations in the reflected fluid. These leads to slow rates of convergence for
basis order 2.
5.5.3   1D Taub-Matthews Equation of State Test
    We test the Taub-Matthews approximation to the Synge equation of state against the ideal
equation of state using the fifth blast wave problem from Ryu et al. (2006), which highlights the
differences between the Synge gas and ideal gas. The initial conditions for the test, using the same
notation and domain as §5.5.2, are
                                                                              
                  W 𝐿 = 1, 0, 0.9𝑐, (103 )𝑐2         W 𝑅 = 1, 0, 0.99𝑐, (10−2 )𝑐2 ,            (5.90)
                                                𝐿                                  𝑅
                                                  173


                                                ux                                  P
           10                                                    12.5
                                    0.6                                                       1.4
            8                                                    10.0
                                                                                              1.3
  Test 1
            6                       0.4                           7.5
                                                                                              1.2
            4                                                     5.0
                                    0.2                                                       1.1
                                                                  2.5
            2
                                    0.0                           0.0                         1.0
                                    1.0                         1000
         10.0                       0.8                          800                           3
           7.5                      0.6                          600
Test 2     5.0                      0.4                          400                           2
           2.5                      0.2                          200
           0.0                      0.0                             0                          1
           25                       0.8                         1000                           8
           20                       0.6                          800
           15                                                                                  6
                                                                 600
  Test 3   10
                                    0.4
                                                                 400                           4
             5                      0.2                          200                           2
             0                      0.0                             0
           1.0                                                    1.0
                                    0.4                                                   1.100
           0.8                                                   0.8
                                    0.3                                                   1.075
 Test 4
           0.6                                                   0.6
                                    0.2                                                   1.050
           0.4                                                   0.4
                                    0.1                                                   1.025
           0.2                                                   0.2
                                    0.0                                                   1.000
                                    1.0
           30                                                     60
                                    0.8                                                        6
           20                       0.6
  Test 5
                                                                  40
                                                                                               4
                                    0.4
           10                       0.2                           20
                                                                                               2
            0                       0.0                            0
                 0.5   0.0    0.5         0.5   0.0       0.5           0.5       0.0   0.5         0.5   0.0   0.5
                                                           x
                        Reference Data          Basis 0                 Basis 1         Basis 2
Figure 5.7: Plots of the five 1D Riemann problems tested using the ideal equation of state. Each
row shows end state of a different Riemann problem. From top to bottom, the first row shows
a mildly relativistic blast wave, the second a highly relativistic blast wave, the third a blast wave
with transverse velocity, the fourth a Sod shock tube, and the fifth a planar shock reflection. The
columns show from left to right the rest-mass density, the pressure, the velocity, and the Lorentz
factor. In each panel we show the reference solution computed with a finite volume scheme (Stone
et al., 2020a) with a solid line and the basis 0, 1, and 2 solutions with our method with a red dashed,
green dot-dashed, and yellow finely dash line respectively. Although the method can evolve these
shocks with the help of the physicality-enforcing operator, small oscillations appear around shocks
for higher order bases. These oscillations can be damped out by widening the limiting thresholds
for the Moe limiter or by changing the minmod limiter but this results in more diffusion and lower
order convergence for basis order 2.
                                                                              174


                                                       D                                  Mx                              E
                             10   1
                    Test 1
                             10   2         Basis 0                            L1 Error
                                            Basis 1                            Fit
                                            Basis 2                            Expected
                    Test 2   10   1
                             100
L1 Error (Normalized)
                    Test 3
                             10   1
                             10   2
                    Test 4   10   3
                             10   4
                             10   1
                    Test 5
                             10   2
                                      256    512      1024   2048   4096 256    512       1024   2048   4096 256   512   1024   2048   4096
                                                                                          nx
Figure 5.8: Convergence of the L1 error of the method presented here to a high resolution reference
solution of the same Riemann problems from Fig. 5.7 computed with a finite volume scheme (Stone
et al., 2020a). From top to bottom, the first row shows a mildly relativistic blast wave, the second
a highly relativistic blast wave, the third a blast wave with transverse velocity, the fourth a planar
shock reflection, and the fifth a Sod shock tube. The columns show from left to right the rest-mass
density, the pressure, the velocity, and the Lorentz factor. In each panel we show the L1 error of
our method with dots, a fitted convergence rate using logarithmically weighted least squares with a
solid line, and a 2/3 convergence rate for basis order 0 and a first order convergence rate for bases
1 and 2 with dashed lines. We use different colors to denote different basis orders, using blue for
basis order 0, orange for basis order 1, and green for basis order 2. Due to the presence of shocks,
we expect the L1 error of higher order bases to converge to first order at best, although sharp blasts
prove difficult for convergence.
which evolves into a blast wave. In the initial state, the temperature stand-in Θ = 𝑃/𝜌 on the
left-hand side is relativistic while Θ on the right-hand side is non-relativistic. As such, for an ideal
                                                                                                 175


equation of state, an adiabatic index of Γ = 4/3 is appropriate for the left-hand side while Γ = 5/3
is appropriate for the right-hand side. The Taub-Matthew equation of state approximation allows
accurate modeling of both sides with a single equation of state.
    We show results for the blast wave with the three different equation of state in Fig. 5.9. The
Synge gas as approximated by the Taub-Matthews equation of state behaves like the relativistic
Γ = 4/3 ideal gas on the left side of the blast wave (which is contained within [0.3, 0.4] at 𝑡 = 0.7 as
shown) and like the non-relativistic Γ = 5/3 ideal gas on the right side. This is most evident in the
velocity profiles and pressure profiles in the relativistic region that occupies most of the domain at
this time. The equivalent adiabatic index Γeq of the Taub-Matthews equation of state is expectedly
4/3 in the relativistic region and 5/3 in the non-relativistic region, and varies between these values
across the blast wave. In this region within the blast wave, the peak density with the Taub-Matthews
equation of state falls between the extremes of the two ideal gases. Notably, the blast wave with
the Taub-Matthew equation of state travels slightly faster than either ideal gases, and the minimum
transverse velocity is also lower. These results are consistent with the blast waves evolved with the
Taub-Matthews equation of state in Ryu et al. (2006).
5.5.4   2D Riemann Problems
    Next, we test the robustness of the method evolving intersecting shocks in 2D using the three
2D Riemann problems used in Zanna & Bucciantini (2002); Núñez-de la Rosa & Munz (2018). In
each of the three problems, the problem is defined with a [−1, 1] × [−1, 1] domain divided into
four quadrants with different initial states. Following Núñez-de la Rosa & Munz (2018), we denote
these states using
                                        Q1 := [0, 1] × [0, 1]                                     (5.91)
                                        Q2 := [−1, 0] × [0, 1]                                    (5.92)
                                        Q3 := [−1, 0] × [−1, 0]                                   (5.93)
                                        Q4 := [0, 1] × [−1, 0]                                    (5.94)
                                                   176


                                15          Synge TM
                                            Ideal = 4/3
                                10          Ideal = 5/3
                                 5
                                  0
                                0.4
                                0.3
                          ux    0.2
                                0.1
                               0.0
                              1.00
                              0.95
                              0.90
                         uy
                              0.85
                              0.80
                               0.75
                              1000
                               800
                               600
                         P
                               400
                               200
                                  0
                                1.6
                           eq   1.5
                                1.4
                                      0.4        0.2         0.0   0.2   0.4
Figure 5.9: Blast wave with relativistic temperatures on the left and non-relativistic temperature on
the right, evolved to 𝑡 = 0.7 using the Taub-Matthews equation of state (solid blue), ideal equation
of state with adiabatic index Γ = 4/3 (dashed orange), and ideal equation of state with Γ = 5/3
(finely dashed green). In order of rows, we show the density 𝜌, longitudinal
                                                                            velocity
                                                                                       𝑢 𝑥 , transverse
                                                                                                 
                                                                          2           4
velocity 𝑢 𝑦 , pressure 𝑃, and equivalent adiabatic index Γeq = ℎ − 𝑐 / ℎ − 𝑐 − 𝑃/𝜌 . The
Taub-Matthews equation of state, as an approximation to the Synge gas, behaves apart from both
the Γ = 5/3 and Γ = 4/3 ideal gases depending on the effective adiabatic index.
                                                       177


and denote the initial primitive states in each of these quadrants by W1 , W2 , W3 , and W4
respectively. For all of these Riemann problems, we use an adiabatic index of Γ = 5/3, use 𝑣𝑧 = 0
everywhere, and use transmissive boundary conditions on all sides. We evolve each Riemann
problem to 𝑡 = 0.8/𝑐. For all 2D shock tests we use the Moe limiter (Moe et al., 2015) and HLLC
Riemann solver.
5.5.4.1    2D Riemann Problems: Test 1
    In this test, the domain begins with a low density and pressure region in the upper right, a high
density and pressure region in the lower left, and intermediate density and high pressure regions
in the upper left and lower right with initial velocities moving into the lower density region with
𝛽 = 0.7.
                    W1 := (0.035145216124503, 0.0, 0.0, 0.162931056509027𝑐2 )                   (5.95)
                    W2 := (0.1, 0.7𝑐, 0.0, 1.0𝑐2 )                                              (5.96)
                    W3 := (0.5, 0.0, 0.0, 1.0𝑐2 )                                               (5.97)
                    W4 := (0.1, 0.0, 0.7𝑐, 1.0𝑐2 )                                              (5.98)
    Results from the first 2D Riemannn problem is shown in Fig. 5.10 with the 1st and 2nd order
bases, the system evolves with stationary contact discontinuities between the high density and
moving intermediate density regions, planar shocks moving from the intermediate density regions
into the low density regions, and curved shocks bowing into the intermediate density regions from
the diagonal. A jet-like, low density structure forms into the high density region with gentle density
and pressure gradients forming ahead and behind it. Our method evolves the curved shocks with
symmetric shock fronts using both low order and high-order bases. When using bases over 0th
order, the physicality-enforcing operator described in §5.3.5 is necessary to avoid negative densities,
pressures, and otherwise unphysical states. With the 2nd order basis, we see subtle boundary effects
where the shocks traveling transverse to the boundary into the first quadrant intersect with the
                                                   178


                                                                         p
        0.1   0.2     0.3    0.4    0.5    0.6               1        2          3           4
     p=1
     p=2
Figure 5.10: Plots of the 2D Riemann problem test 1 with two colliding shocks using the initial
conditions in eq. 5.95, using a 1st order basis in the top row and a 2nd order basis in the bottom
row. We show the rest-mass density in the left column and the pressure in the right column at
𝑡 = 0.8/𝑐 on a grid with 1024 elements. Note the boundary effects where shocks traveling into the
first quadrant intersect with the outflow boundaries when using the 2nd order basis.
                                                179


outflow boundary conditions. Boundary effects with the 2nd order basis are seen again in § 5.5.4.2
and § 5.5.5.2.
5.5.4.2    2D Riemann Problems: Test 2
    In this test, all four quadrants begin with different densities, equal pressures, and each move
diagonally clockwise around the origin.
                                  W1 := (0.5, 0.5𝑐, −0.5𝑐, 5.0𝑐2 )                            (5.99)
                                  W2 := (1.0, 0.5𝑐, 0.5𝑐, 5.0𝑐2 )                            (5.100)
                                  W3 := (3.0, −0.5𝑐, 0.5𝑐, 5.0𝑐2 )                           (5.101)
                                  W4 := (1.5, −0.5𝑐, −0.5𝑐, 5.0𝑐2 )                          (5.102)
                                                 180


        0.5 1.0 1.5 2.0 2.5 3.0 3.5                        1.2 1.4 1.6 1.8 2.0 2.2 2.4
    p=1
    p=2
Figure 5.11: Plots of the 2D Riemann problem test 2 with four vortex sheets using the initial
conditions in eq. 5.99, using 1st order basis in the top row and a 2nd order basis in the bottom row.
We show the rest-mass density in the left column and the pressure in the right column at 𝑡 = 0.8/𝑐
using a grid with 1024 elements. Note the boundary effects where the vortex sheets intersect with
the outflow boundaries which are subtle using the 1st order basis and more apparent when using the
2nd order basis, especially along the top boundary. Like the 1D test of a shock reflecting against a
wall, this test highlights unresolved difficulties of higher order bases leading to boundary effects.
                                                   181


    Results from the second 2D Riemannn problem are shown in Fig. 5.11 with the 1st and 2nd
order bases, the system develops into four vortex sheets that expand from the origin. A low
rest mass region forms at the center of the vortex sheets at the origin. The physicality-enforcing
operator ensures positive densities and pressures in this region. With the 2nd order basis, we see
subtle boundary effects where the shocks traveling transverse to the boundary into the first quadrant
intersect with the outflow boundary conditions. These boundary effects are not apparent with the
1st order basis.
5.5.4.3    2D Riemann Problems: Test 3
    This tests begins with overdense first and third quadrants following
                              W1 := (1.0, 0.0, 0.0, 1.0𝑐2 )                                   (5.103)
                              W2 := (0.5771, −0.3529𝑐, 0.0, 0.4𝑐2 )                           (5.104)
                              W3 := (1.0, −0.3529𝑐, −0.3529𝑐, 1.0𝑐2 )                         (5.105)
                              W4 := (0.5771, 0.0, −0.3529𝑐, 0.4𝑐2 ).                          (5.106)
Rarefactions move from the second and fourth quadrants into the first and third quadrants, producing
curved shocks where the rarefactions intersect.
    Results from the third 2D Riemann problem are shown in Fig. 5.12 with the 2nd order basis.
The method evolves the curved shocks and rarefactions without issue. No boundary effects are
apparent in this test.
5.5.5   Kelvin-Helmholtz Instability
    The relativistic Kelvin-Helmholtz instability provides a useful benchmark with which to explore
the performance of the scheme presented here for shear-flow type problems. Previous work,
e.g. Mignone et al. (2009); Beckwith & Stone (2011) has revealed significant differences in the
performance of different numerical schemes for this classic fluid flow problem and subsequent work
Lecoanet et al. (2016) has further elucidated the issues raised in prior works through the comparison
                                                 182


                                                                         Ms
     0.3   0.4   0.5   0.6    0.7   0.8   0.9      0.0    0.1    0.2     0.3    0.4      0.5    0.6
     p=2
Figure 5.12: Plots of the 2D Riemann problem test 3 with intersecting rarefactions using the initial
conditions in eq. 5.103. We show the rest-mass density left column and the pressure right column
at 𝑡 = 0.8/𝑐 using a 2nd order basis on a grid with 1024 elements.
of finite volume and spectral methods. Here, we compare the discontinuous-Galerkin scheme
presented here with a finite volume method previously presented in the literature Mignone et al.
(2011), explore both the linear and non-linear regime of the instability and examine performance
metrics for the scheme.
     We simulate the Kelvin-Helmholtz instability on a [−0.5, 0.5] × [−1.0, 1.0] domain with a
single interface along 𝑦 = 0, specified with a smoothly varying profile using a mesh of square
cells with twice as many cells in 𝑦 than 𝑥, testing mesh sizes in powers of 2 from 256 × 512 to
4096 × 8192 for a total of 6 different mesh sizes. We tested using basis orders 0, 1, and 2, however
due to memory constraints and increasing execution time, we forgo the highest resolution mesh
using basis order 1 and the two highest resolutions using basis order 2. We conduct separate tests
using the HLLC and HLL Riemann solvers and using a shear velocity 𝑣𝑥,0 = 0.25𝑐. We run a
                                                183


total of 60 simulations exploring growth rates of the Kelvin-Helmholtz instability. In all these
calculations, we use an ideal equation of state with adiabatic index 𝛾 = 4/3 using the iterative
conserved-to-primitive solver, an initial density 𝜌0 = 1, an initial pressure 𝑃0 = 𝑐2 , a perturbation
amplitude 𝐴 = 0.05, and a shearing layer thickness 𝑎 = 0.01. We use 𝑘 = 2𝜋 so that the wavelength
of the perturbations in 𝑥 is 1 and for each test run until 𝑡 = 5 to verify from the growth rate that the
transverse velocity perturbations have saturated past the linear growth phase.
5.5.5.1    Linear Growth Phase
We explore the growth of the instability by examining the spatial average
                                                     ∫
                                                   1
                                            2
                                         ⟨𝑣 𝑦 ⟩ =        𝑣2 𝑑𝑉                                   (5.107)
                                                  |Ω| Ω 𝑦
where Ω is the domain and |Ω| is the volume of the domain. Fig. 5.13 shows ⟨𝑣2𝑦 ⟩ as a function
of time in the left column for the Kelvin-Helmholtz instability simulations explored in this work,
where Riemann solvers are grouped by column and basis order and reconstruction method grouped
by rows. Except for the lowest resolution simulations, all simulations with the HLLC solver enter a
linear growth phase by 𝑡 = 2.0 and display non-linear features by 𝑡 = 4.0. By contrast, simulations
that utilize the HLL Riemann solver, especially with the 0th order basis, exhibit large levels of
numerical diffusion and substantially reduced growth rates for all but the largest number of degrees
of freedom. However, for basis order greater than zero, the HLL Riemann solver exhibits rapid
convergence to a well-defined growth rate, while the reference finite volume schemes that utilize
this same Riemann solver exhibit changing growth rates over this same range of degrees of freedom.
    We quantify this result by measuring the growth rate, 𝑟 of ⟨𝑣2𝑦 ⟩ by fitting log⟨𝑣2𝑦 ⟩(𝑡) = 𝐴 + 𝑟𝑡 to
the measured ⟨𝑣2𝑦 ⟩ using a least squares curve fit in log space over 𝑡 = 1.5 to 𝑡 = 3.0. We measure
the growth rate early in the linear growth phase from 𝑡 = 1.5 to 𝑡 = 3.0 before non-linear modes
dominate. We perform the fit in log space so as to not favor the larger changes in ⟨𝑣2𝑦 ⟩ at later
times. The growth rate of ⟨𝑣2𝑦 ⟩ for all simulations and methods versus the degrees of freedom is
shown in Fig. 5.14. Here, the degrees of freedom for a given resolution 𝑛𝑥 × 𝑛 𝑦 and basis order
𝑝 is DOF = 𝑛𝑥 × 𝑛 𝑦 × ( 𝑝 + 1) 2 . Except for the discontinuous-Galerkin methods using the 0th
                                                   184


order basis, the growth rates using different methods converge to approximately the same value
with higher resolutions. Generally, using higher order bases, using the HLL Riemann solver over
the HLLC Riemann solver, and using the discontinuous-Galerkin method over the finite volume
method lead to faster convergence of growth rate. Notably, the overall second order accurate
discontinuous-Galerkin scheme (first order basis, second order time integration scheme) achieves a
converged growth rate at lower numbers of degrees of freedom than a overall second order accurate
finite volume scheme, using either the HLLC or HLL Riemann solver.
      This result is explored in more detail in Fig. 5.15. The data of this figure shows the difference in
growth rate between the highest resolution simulation with a certain method and the lower resolution
simulations with the same methods versus the degrees of freedom. The discontinuous-Galerkin
simulations with a 1st order basis show the most effective convergence of the simulations explored
here, with HLLC converging slightly faster at the highest resolutions and HLL converging faster
at lower resolutions. By contrast, the overall second order accurate finite volume schemes exhibit
slower convergence than this scheme, despite the equivalent order of accuracy, while the first order
accurate discontinuous-Galerkin scheme exhibits similar convergence rates as the finite volume
schemes when combined with the HLLC Riemann solver, but low convergence rates with the HLL
solver. We also note that the discontinuous-Galerkin simulations with a 2nd order basis do not
converge below a 10−1 difference even with high resolutions, which we attribute to interaction of
the flow with outflow boundary conditions used here, highlighting the need for improved fidelity
boundary conditions in order to realize the promise of higher order discontinuous-Galerkin methods.
5.5.5.2     Non-linear Evolution
      Fig. 5.16, 5.17, and 5.18 show the state of the Kelvin Helmholtz instability at 𝑡 = 3.0 using
the method presented in this work and the reference finite volume scheme Mignone et al. (2011)
with the 4 highest resolutions explored in this study. The different figures show results using 0th ,
1st , and 2nd order bases or 1st , 2nd , and 3rd order methods respectively, where a 1st method is only
available for our code. In Fig. 5.16 using our method with a 0th order basis or a 1st order method,
                                                     185


                                            HLLC                                                     HLL
                      p=0                                                     p=0
       10   2
       10   4
       10   6
       10   8
                      p=1                                                     p=1
       10   2
       10   4
       10   6
       10   8
                      p=2                                                     p=2
       10   2
       10   4
    vy210   6
       10   8
       10   2         FV PLM RK2                                              FV PLM RK2
       10   4
       10   6
                            ny = 26         ny = 28           ny = 210        ny = 212
       10   8
                            ny = 27         ny = 29           ny = 211        ny = 213
       10   2         FV PPM RK3                                              FV PPM RK3
       10   4
       10   6
       10   8
                0.0     0.5     1.0   1.5   2.0       2.5   3.0   3.5    4.00.0 0.5      1.0   1.5   2.0   2.5   3.0   3.5   4.0
                                                                           Time
Figure 5.13: Mean square of the transverse velocity 𝑣 𝑦 over time of the relativistic 2D Kelvin
Helmholtz instability using our DG method using a 0th , 1st , and 2nd order bases respectively in
the top three rows and using the finite volume code PLUTO with PLM and PPM reconstruction
respectively in the bottom two rows. In the left column we show results including the contact
discontinuity in the Riemann solver (using HLLC with our method and HLLD with PLUTO) and
without the contact discontinuity using the HLL Riemann solver in the right column. The gray
band from 𝑡 = 1.5 to 𝑡 = 3.0 shows the region over which we measure the growth rate shown in
other plots. Higher resolutions generally lead to faster growth rates while the more diffusive HLL
Riemann solver leads to steadier growth rates due to diminished secondary instabilities.
                                                                     186


                                         HLLC                                          HLL
                         4
                         3
    Growth Rate of vy2
                         2
                         1
                         0               p=0          FV PLM RK2
                                         p=1          FV PPM RK3
                         1               p=2
                             213   216    219   222      225               213   216   219   222   225
                                                                     DOF
Figure 5.14: Growth rates of ⟨𝑣2𝑦 ⟩ versus degrees of freedom from 𝑡 = 1.5 to 𝑡 = 3.0 of the
relativistic 2D Kelvin Helmholtz instability using our DG method using the finite volume code
PLUTO. In the left column we show results including the contact discontinuity in the Riemann
solver (using HLLC with our method and HLLD with PLUTO) and without the contact discontinuity
using the HLL Riemann solver in the right column. Growth rates are measured by computing least
squares fit of a ⟨𝑣2𝑦 ⟩ ∝ 𝑡 𝜔 model to the data shown in Fig. 5.13, with error bars showing the standard
deviation of the least squares fit.
we see significant differences between the HLL and HLLC solutions; the HLL Riemann solver
struggles to grow the instability, although the structure of the perturbation resembles results with
simple structures when using higher orders. Secondary instabilities appear to be nonexistent. By
contrast, the HLLC Riemann solver generates secondary vortices that increasing in amplitude with
higher resolutions. Looking at Fig. 5.17 and 5.18, the 2nd and 3rd order methods from this work
quickly converge to simple structures. The finite volume method also converges to a similar simple
structure, although it requires more resolution compared to the discontinuous-Galerkin method
presented here.
   Figs. 5.19, 5.20, and 5.21 show the state of the Kelvin Helmholtz instability at 𝑡 = 5.0, which
is well into the non-linear phase, using the method presented in this work and with the reference
finite volume scheme with the 4 highest resolutions explored in this study. The different figures
show results using 1st , 2nd , and 3rd order methods respectively, where a 1st is only available for
our code. In Fig. 5.19 using our method with a 0th order basis or a 1st order method, we again
                                                               187


                                                            HLLC                     HLL
                                                 100
                        Growth Rate Difference
                                                 10   1
                                                 10   2
                                                 10   3
                                                 10   4   p=0
                                                          p=1
                                                          p=2
                                                 10   5
                                                          PLM
                                                          PPM
                                                 10   6
                                                          105      107         105         107
                                                                         DOF
Figure 5.15: The absolute difference in growth rate between the highest resolution simulation
for each method and each of the lower resolution simulations which serves as rough measure of
the error of the growth rate, plotted versus the degrees of freedom. The discontinuous-Galerkin
simulations with a 1st order basis show the most effective convergence of the simulations explored
here, with HLLC converging slightly faster at the highest resolutions and HLL converging faster at
lower resolutions. The discontinuous-Galerkin simulations with a 2nd order basis do not converge
below a 10−1 difference even with high resolutions, which we attribute to the boundary effects that
worsen with higher resolution. Otherwise, the other methods converge at varying rates, the 0th
order basis discontinuous-Galerkin methods converging the slowest.
see significant differences between the HLL and HLLC solutions. The HLLC solution grows faster
than the HLL solution but neither resemble the structures seen with higher order bases. Using
the HLLC Riemann solver, secondary vortices are apparent during the non-linear phase, which
become more defined with higher resolution. Examining Figs. 5.20 and 5.21, the 2nd and 3rd order
methods from this work quickly converge with higher resolution to simple structures during the
non-linear phase. Results with HLL over HLLC and with a 2nd order basis over a 1st order basis
are generally smoother with fewer secondary vortices. The solution generated by the reference
finite volume scheme also converges to roughly the same structures as the discontinuous-Galerkin
method, although secondary instabilities are obvious along the interface between the primary
vortices. Note that the mode of these secondary instabilities increased with resolution, with smaller
but more numerous instabilities at higher resolutions.
                                                                   188


    Our interpretation of these results is that the secondary structures found in the finite volume
method at the end of the linear growth phase serve to seed non-linear structures that are observed
at late times; a result somewhat consistent with that reported by Lecoanet et al. (2016). What is
notable is that these structures vanish in the second order accurate (first order basis) discontinuous-
Galerkin scheme presented here at lower resolution than in the finite volume scheme for the HLLC
Riemann solver and are absent in the HLL Riemann solver based scheme, indicating a role played
by the dissipation of the HLLC Riemann solver in the formation of these structures. In addition,
the presence of these structures in the finite volume scheme utilizing the HLL solver and the clear
dependency of the properties of these structures on the reconstruction method (PLM vs. PPM)
is another point of contrast between discontinuous-Galerkin methods and finite volume schemes.
This is strongly reminiscent of the results presented by Lecoanet et al. (2016), where finite volume
schemes were demonstrated to exhibit similar secondary vortices at moderate resolutions (similar
to these presented here), which then disappeared at higher resolutions; the higher resolution
simulations being comparable to spectral methods. The absence of such secondary vortices for
combinations of the discontinuous-Galerkin algorithms presented here suggest that these methods
may be less susceptible to such considerations. However, we stress that this is a single application
on both methods and that the performance of either method may depend on details of the set up of
the instability. Over the development of the method, we also explored the analytic growth rate of
perturbations given the initial conditions from Bodo et al. (2004), where we found that the growth
rate of the instability generally did not match the analytically predicted growth rate, and that an
initial transient outgoing wave from the initial perturbation caused significant boundary effects
with the 2nd order basis. Although our discontinuous-Galerkin method provides apparently better
results in this case, more development especially around boundary conditions is required.
5.5.6    Performance
    To test the performance of the method on multiple architectures, we timed simulations of the
Kelvin Helmholtz instability on CPUs and GPUs, using the perturbations described in §5.5.5. For
                                                  189


Figure 5.16: Snapshots of the transverse velocity at 𝑡 = 3.0 from simulations of the relativistic
Kelvin-Helmholtz instability using the method presented in this work using a 0th order basis. We
show results using the HLL Riemann solver in the top row and with HLLC in the bottow row.
We show the four highest resolution simulations across the columns, ranging from 512 × 1024 to
4096 × 8192 cells from left to right. With basis order zero, at this stage, using the HLL Riemann
solver the method has difficulty growing the Kelvin Helmholtz instability, although the structure
of the perturbation resembles results with simple structures when using higher orders. The HLLC
Riemann solver generates secondary vortices that get worse with high resolutions, which leads to
a climbing growth rate.
both architectures, we time the performance of the code with 𝑣𝑥,0 = 0.25𝑐 using basis orders
0, 1, and 2 and resolutions of 256 × 512, 512 × 1024, and 1024 × 2048 with each basis order
testing both HLLC and HLL for a total of 18 simulations for both architectures. We conduct
CPU testing on 1024 cores spread across 22 dual socket nodes with Intel Xeon Platinum 8268
CPUs, comprising approximately ∼ 88TFLOPS in total. For GPU runs we use 32 NVidia Tesla
V100-SXM2 GPUs spread across 8 nodes, comprising approximately ∼ 250TFLOPS in total.
These computational resources were chosen to accommodate the memory needed for the largest
simulation in the performance profiling suite.
    We show profiling results with the HLLC and HLL Riemann solvers and with the 0th , 1st , and
2nd order bases in Fig. 5.22. The degree of freedom updates per second is computed with
                                            DOF × steps × stages per step
                        DOF per second =                                    ,              (5.108)
                                              time to solution in seconds
                                                190


Figure 5.17: Snapshots of the transverse velocity at 𝑡 = 3.0 from simulations of the relativistic
Kelvin-Helmholtz instability using the method presented in this work using a 1st order basis in the
first and third row and with the PLUTO finite volume MHD code with a first order method. We
show results using the HLL Riemann solver in the top two rows and with HLLC for our code and
with HLLD for PLUTO in the bottow two rows. We show the four highest resolution simulations
across the columns, ranging from 512 × 1024 to 4096 × 8192 cells from left to right. Note that
DG method has 4 times as many degrees of freedom with the 1st order basis, meaning that our
512 × 1024 simulation is comparable in degrees of freedom to the 1024 × 2048 simulation using
PLUTO. At this times and these resolutions, the results with our DG method have converged to a
similar solution with a simple structure. Results with PLUTO converge towards the DG method
results, with secondary vortices present at lower resolutions that are more pronounced with HLLC.
                                                191


Figure 5.18: Snapshots of the transverse velocity at 𝑡 = 3.0 from simulations of the relativistic
Kelvin-Helmholtz instability using the method presented in this work using a 2nd order basis in the
first and third row and with the PLUTO finite volume MHD code with a second order method. We
show results using the HLL Riemann solver in the top two rows and with HLLC for our code and
with HLLD for PLUTO in the bottom two rows. We show the four highest resolution simulations
across the columns, ranging from 512 × 1024 to 4096 × 8192 cells from left to right. Note that
DG method has 4 times as many degrees of freedom with the 1st order basis, meaning that our
512 × 1024 simulation has degrees of freedom between the 1024 × 2048 simulation and 2048 × 4096
simulation using PLUTO. With this higher order basis at 𝑡 = 3.0, we also see the results with our
DG method converge quickly to simple structures while the results with PLUTO require more
resolution to suppress secondary vortices. However, in our results using 4096 × 8912 cells with
basis order 2, we see anomalously high transverse velocities away from the interface, which is
caused by boundary effects at high resolutions that will be addressed in future improvements to the
method.
                                                192


Figure 5.19: Snapshots of the transverse velocity at 𝑡 = 5.0 from simulations of the relativistic
Kelvin-Helmholtz instability using the method presented in this work using a 0th order basis. We
show results using the HLL Riemann solver in the top row and with HLLC in the bottom row.
We show the four highest resolution simulations across the columns, ranging from 512 × 1024 to
4096 × 8192 cells from left to right. At late times into what should be the linear growth phase, our
DG method with the HLL solver struggles to growth the instability at low resolutions. The HLLC
method has developed some structures but they do not resemble results at higher orders.
which serves as a measure of computational efficiency. With the RK1, SSPRK2, and SSPRK3
integrators used for basis orders 0, 1, and 2 we use 1, 2, and 3 stages per step for the respective
basis orders.
    We show profiling results with the HLLC and HLL Riemann solvers and with the 0th , 1st , and
2nd order bases, between which we see little difference in performance. Comparing between the
CPU and GPU runs, we see that the CPU performance becomes saturated at around 106 DOF while
the GPUs have not saturated the performance, even with simulations using more than 10 times the
degrees of freedom. Simulations with more degrees of freedom would not fit within GPU memory
here, indicating that our present implementation is unable to fully saturate GPU performance.
Note that the theoretical peak throughput of the GPU resources using here is approximately three
times the throughput for the CPU resources. Memory bandwidth resources between RAM and the
registers on CPUs and HBM memory and the registers on GPUs is similarly greater on GPUs Since
the CPUs and GPUs achieve roughly the same updates per second, this indicates underutilization
                                                 193


Figure 5.20: Snapshots of the transverse velocity at 𝑡 = 5.0 from simulations of the relativistic
Kelvin-Helmholtz instability using the method presented in this work using a 1st order basis in the
first and third row and with the PLUTO finite volume MHD code with PLM reconstruction. We
show results using the HLL Riemann solver in the top two rows and with HLLC for our code and
with HLLD for PLUTO in the bottom two rows. We show the four highest resolution simulations
across the columns, ranging from 512 × 1024 to 4096 × 8192 cells from left to right. Note that
DG method has 4 times as many degrees of freedom with the 1st order basis, meaning that our
512 × 1024 simulation is comparable in degrees of freedom to the 1024 × 2048 simulation using
PLUTO. At this later time once the instability has entered into the nonlinear growth phase, the
DG method shows clear roll ups at all resolutions. Secondary vortices are suppress with higher
resolutions and by the more diffusive HLL solver. In contrast, the PLUTO results show secondary
instabilities through out the perturbation, although these diminish with resolution. Notably, the
structure of the instabilities with the DG method versus the finite method are very different.
                                                194


Figure 5.21: Snapshots of the transverse velocity at 𝑡 = 5.0 from simulations of the relativistic
Kelvin-Helmholtz instability using the method presented in this work using a 2nd order basis in
the first and third row and with the PLUTO finite volume MHD code with PPM reconstruction.
We show results using the HLL Riemann solver in the top two rows and with HLLC for our
code and with HLLD for PLUTO in the bottom two rows. We show the four highest resolution
simulations across the columns, ranging from 512 × 1024 to 2048 × 4096 cells from left to right.
Note that DG method has 4 times as many degrees of freedom with the 1st order basis, meaning
that our 512 × 1024 simulation has degrees of freedom between the 1024 × 2048 simulation and
2048 × 4096 simulation using PLUTO. The suppression of secondary vortices with our DG method
is enhanced with basis order 2 compared to basis order 1, requiring fewer cells and degrees of
freedom. Secondary instabilities still appear with the finite volume method, largely unaffected by
the increase in method order.
                                               195


                                                               CPU                 GPU
                                           108
                       DOF Updates per s
                                           107
                                           106
                                           105
                                                              105      107        105       107
                                                                            DOF
                                                 HLLC p = 0          HLLC p = 1    HLLC p = 2
                                                 HLL p = 0           HLL p = 1     HLL p = 2
Figure 5.22: Performance of the code modeling the Kelvin Helmholtz instability from section
§5.5.5, plotting updates to degrees of freedom per second versus degrees of freedom, using 1024
cores spread across 22 dual socket nodes with Intel Xeon Platinum 8268 CPUs (comprising
approximately ∼ 88TFLOPS in total) in the left column and using 32 NVidia Tesla V100-SXM2
GPUs (comprising approximately ∼ 250TFLOPS in total) spread across 8 nodes on the right, where
the peak computational throughput of the GPUs used are roughly three times the peak computational
throughput of the CPUs. The computational resources for both tests was chosen to accommodate
the memory needed for the largest simulation in the suite. We show profiling results with the HLLC
and HLL Riemann solvers and with the 0th , 1st , and 2nd order bases, between which we see little
difference in performance. Comparing between the CPU and GPU runs, however, we see that the
CPU performance becomes saturated at around 106 DOFs while the GPUs have not saturated the
performance, even with simulations using more than 10 times the degrees of freedom.
of GPU FLOPS. i.e. our implementation is failing to meet computation or memory bounds,
where the arithmetic-intensity of discontinuous-Galerkin methods lead to typically memory bound
algorithms.
   These performance characteristics are consistent with insufficient work within individual kernels
to offset kernel launch overhead, as was the case in the K-Athena magnetohydrodynamics code
presented in Grete et al. (2021a) and was resolved in the Parthenon adaptive-mesh refinement
framework and AthenaPK magnetohydrodynamics code presented in Grete et al. (2022). We
performed an informal profiling of our method evolving the Kelvin-Helmholtz instability on a
single V100 GPU using nvprof. With a timeline trace, we verified for problem sizes that occupied
the entirety of the HBM memory of a single GPU that a large percentage of compute time on the
                                                                     196


GPU, > 70%, was dominated by short duration 4μs kernel calls. These kernel durations would be
consumed by kernel launch overhead from within the CUDA API.
    With the launch of each kernel, between the APIs, drivers, and hardware a few microseconds
are spent launching the kernel on the GPU. Unless sufficient work is done within each kernel, this
launch overhead will dominate runtime. For our implementation, the work done within individual
kernels can be increased with more degrees of freedom. However, the GPU has insufficient memory
to allow enough work to hide kernel launch overhead, hence the underutilization of the GPU. In
Parthenon and AthenaPK, this kernel launch overhead was hidden by fusing together the work
from multiple kernels into fewer, larger kernelsGrete et al. (2021a). Similar improvements would
be needed for our implementation in order to saturate GPU performance.
5.6   Summary
    We have presented a scheme to evolve the relativistic hydrodynamics equations using a
discontinuous-Galerkin method. Within our scheme, we have developed a robust method for
enforcing physicality of the conserved state via a operator. Our presentation of the method in-
cludes relativistic HLL and HLLC Riemann solvers, multiple methods for recovering the primitive
variables from conserved variables with the ideal equation of state, and the Taub-Matthews approx-
imation to the Synge equation of state, using physical units that keep factors of 𝑐. We implement
the method using the Kokkos performance portability library, which allows us to run CPUs and
GPUs supported by Kokkos.
    The novel physicality-enforcing operator in the work allows evolution of shocks with high-
order basis methods. The operator strictly enforces positive density and pressure and subluminal
velocities on all basis points within a cell by smoothing nonphysical points towards the physical
volume average. Additionally, the method conserves volume averages of conserved variables.
    In our exploration of methods to recover primitive variables from conserved variables when
using an ideal equation of state, we found that the iterative method from Riccardi & Durante (2008)
was faster, more robust, and more accurate than the analytical method from Ryu et al. (2006),
consistent with findings from Riccardi & Durante (2008). The iterative method for ideal gases
                                                  197


presented here recovers the primitive variables by solving a quartic as described in Eq. 5.58, which
provides more digits of precision in simultaneously in sub-relativistic and ultra-relativistic regimes
compared to solving in terms of the velocity or Lorentz factor. Additionally, the Newton-Raphson
method as applied to Eq. 5.59 gives comparable accuracy to the analytic method in under 10
iterations, as is explored in Fig. 5.2. More iterations allow a more accurate recovery with the
iterative method compared to the analytic method. In the case of our implementation, the iterative
method is faster to compute for 𝛾 < 10 on CPUs and always faster on GPUs except in trivial cases.
    Conversely, in our exploration of methods to recover primitives variables from conserved
variables with the Taub-Mathews equation of state, the analytical method detailed in Ryu et al.
(2006) was faster than the iterative method implemented in this work. With the Taub-Mathews
equation of state, recovering the primitives requires solving a cubic equation, which has a much
simpler analytical solution compared to the quartic equation for the ideal gas. Solving this cubic
equation iteratively requires a bounded root solver, where we use Brent’s method in this work.
The iterative method we implemented for the Taub-Matthews equation of state requires many more
iterations to achieve acceptable accuracy than the iterative solver for the ideal gas. As such, we
found the analytic method for the Taub-Matthews equation of state to outperform the iterative
method in terms of time to solution and accuracy on both CPUs and GPUs.
    With this method, we ran several standard test problems, including linear waves, 1D and
2D Riemann problems, and the relativistic Kelvin-Helmholtz problem. The iterative conserved-
to-primitive solver facilitated more relativistic problems and the physicality-enforcing operator
allowed stable evolution with higher order bases for problems with shocks. In some test problems
with a shock moving transverse to an outflow boundary conditions, we saw some non-physical
boundary effects when using a 2nd order basis.
    In our tests of the Kelvin-Helmholtz instability, comparing to results using a finite volume
reference scheme (Mignone et al., 2011), the discontinuous-Galerkin method presented in this work
can better suppress secondary vortices and instabilities compared to the finite volume method. Our
method works best with a 1st order basis, which is a 2nd order method in space and time, since the
                                                 198


0th order basis is slow to grow the instability with low resolution while with the 2nd order basis
boundary effects enter in at the outflow boundaries with high resolution.
    In the tests of the Kelvin-Helmholtz instability and some of the 2D Riemann problems, we saw
numerical boundary effects enter at the outflow boundary conditions, which increased with higher
resolutions. Further development of the outflow boundaries with higher order bases is required.
    Finally, in the exploration of the performance of our implementation evolving the Kelvin-
Helmholtz instability, we found that our implementation is unable to saturate performance on
GPUs before the problem size grows too large for the GPU memory. From these performance
results and profiling using nvprof, we suspect that insufficient work inside individual kernels,
leading to kernel launch overhead dominating runtime, is responsible for the lack of performance
on GPUs. Combining the work from multiple kernels – as was done in the Parthenon framework
presented in Grete et al. (2021a) – would be needed for our implementation in order to saturate
GPU performance.
                                                 199


                                               CHAPTER 6
  SIMULATIONS OF GALAXY CLUSTERS WITH MAGNETIC AGN JET FEEDBACK
6.1   Motivation
    The hot, diffuse plasmas called the intracluster medium (ICM) comprising the majority of
baryonic mass in galaxy clusters is known to maintain significant magnetic fields (Carilli & Taylor,
2002; Govoni & Feretti, 2004; Donnert et al., 2018). These magnetic fields have been observed
via a number of techniques which include: inference from synchrotron emitting radio relics; the
magnetic Sunyaev-Zeldovich (SZ) effect, where magnetic fields lead to modified electron energy
distributions which which cause inverse Compton scattering of photons from the cosmic microwave
background (CMB) off electrons in the ICM (Hu & Lou, 2004) and a distortion in the CMB; Faraday
rotation, where the magnetic fields rotate the polarization of photons passing through the magnetic
fields of the galaxy cluster (Clarke et al., 2001; Carilli & Taylor, 2002; Clarke, 2004); and cold fronts
where magnetic fields suppress a Kelvin Helmholtz instability, preserving a sharp discontinuity in
the gas (Vikhlinin et al., 2001a,b; Ghizzardi et al., 2010). These measurements of the magnetic
fields in galaxy clusters, however, do not directly give the magnetic field strengths or geometry
but instead inform inferences of these properties using assumptions of magnetic length scales and
models. Although the magnetic fields aren’t dominant over gravitational forces or gas pressure in
the ICM, they are nevertheless believed to be dynamically important, maintaining field strengths
from ∼ 1 − 50𝜇m (Vacca et al., 2018; Donnert et al., 2018). The amplification of the cluster
magnetic fields to their present values is likewise an open question, where shocks, turbulence,
and jets launched by active galactic nuclei (AGN) are likely to play large roles (Donnert et al.,
2018). Computational modeling is a cornerstone for inferring magnetic fields, understanding their
dynamical role in the ICM, and how magnetic fields are created and amplified in galaxy clusters.
    One aspect of specific aspect that can be addressed by global galaxy cluster simulations is how
the magnetized jets launched by AGN can affect magnetic fields and energy balance within the
                                                    200


ICM (Li et al., 2006; Wang et al., 2020). The jets emitted by AGN are collimated by the magnetic
fields generated by the AGN accretion disk and are thus inherently defined by their magnetic fields.
Observations of these jets have inferred a helical magnetic tower structure of the AGN jet (Gabuzda,
2021). Simulations of these magnetic AGN jets in isolation and with their impact on the galaxy
cluster has been performed in the past (Li et al., 2006; Gan et al., 2017; Martí, 2019; Barniol Duran
et al., 2017) although their role in the self-regulation of AGN feedback and cooling has been under-
investigated. Kinetic jet models are able to self-regulate (Meece Jr, 2016; Meece et al., 2017)
while thermal-only heating models have difficulty self-regulating while also maintaining a realistic
galaxy cluster, as was explored in Chapter 2. To rectify this gap in exploration, my current work is
focused on simulations comparing magnetized AGN feedback to kinetic jet and thermal feedback
to ascertain how well magnetized AGN feedback triggered by cold gas accretion can self-regulate
in galaxy clusters.
    To best explore this question, we intend to perform the highest resolution simulations of galaxy
clusters to date, using world-class supercomputers. Said supercomputers, however, use GPUs from
a number of different manufacturers for the majority of their computational throughput. Thus,
a performance portable magnetohydrodynamics code such as the K-Athena code presented in
Chapter 4 is needed to utilize these GPU supercomputers. K-Athena only supports uniform grids,
however, and simulating galaxy clusters with high resolution near the galaxy cluster core requires
adaptive mesh refinement (AMR) – where the resolution of the simulation grid is increased near
fine features in the system and decreased where the flow is smooth. Resolving the entire ∼ 4 Mpc
box of a galaxy cluster simulation down to 10 pc size grid cells would require ∼ 10 EB (exabytes)
of memory disk space to store one output whereas the current largest supercomputer provides
∼ 100 PB (petabytes). AMR allows effectively the same accuracy of resolution for a fraction of the
data volume, allowing us to resolve a central box of ∼ 40 kpc around the AGN down to 10 pc.
    Although implementing AMR into K-Athena would be possible, such a project would be chal-
lenging for a small university team. Thus, we collaborated with Los Alamos National Laboratory
and researchers at the Institute of Advanced Study to develop Parthenon, a performance portable
                                                   201


AMR framework. Using this new framework we developed the performance portable AMR MHD
code AthenaPK, a successor to K-Athena that can perform these AMR simulations of magnetized
galaxy cluster with magnetized AGN feedback.
6.2    Methodology
6.2.1    Simulation Setup
    The exascale galaxy cluster simulations we intend to run will use a Cartesian grid in a cubic
volume with side length of 3.2 Mpc, with 1283 cells in the base grid of a static mesh refinement
hierarchy. We enforce 3 levels of refinement with [−400, 400] 3 kpc (where the root grid is the 0th
level), 5 levels of refinement on [−100, 100] 3 kpc, and 11 levels of refinement on [−12.5, 12.5] 3 kpc,
giving us ∼ 12 pc resolution on the finest grid. We are currently testing simulations with the
physics discussed below with lower resolutions that fit into local supercomputer resources. These
test simulations will inform our upcoming simulation campaign on exascale systems.
    Cosmological expansion is neglected in these simulations. We use a ΛCDM model to get the
virial mass of the NFW halo and to set its gas temperature, following Meece et al. (2017). We set
redshift 𝑧 = 0 at initialization with Ω 𝑀 = 0.3, ΩΛ = 0.7, and 𝐻0 = 70 km s−1 . We note that the
precise details of the cosmological model will not impact explorations of the baryonic physics in
the halo core, which is our primary interest.
6.2.1.1    Gravitational Potential
    The gravitational potential has three components: a dark matter halo profile, a brighest cluster
galaxy (BCG) with a mass profile, and a supermassive black hole (SMBH). We chose parameters for
each of these to reflect a typical galaxy cluster. The dark matter follows the NFW profile (Navarro
et al., 1997), using 𝑀𝑁 𝐹𝑊 = 1 × 1015 M⊙ for the mass within the virial radius and a concentration
parameter 𝑐 𝑁 𝐹𝑊 = 6. The gravitational field from the NFW profile takes the form
                                                 h                           i
                                        𝑀         ln  1 +    𝑟      −     𝑟
                                     𝐺 𝑁 𝐹𝑊               𝑅 𝑁 𝐹𝑊      𝑟+𝑅 𝑁 𝐹𝑊
                          𝑔NFW (𝑟) =                                             .                 (6.1)
                                     𝑟2                            𝑐
                                             ln (1 + 𝑐 𝑁 𝐹𝑊 ) − 1+𝑐𝑁 𝐹𝑊
                                                                     𝑁 𝐹𝑊
                                                   202


The scale radius 𝑅 𝑁 𝐹𝑊 for the NFW profile is computed from
                                                                                     1/3
                                                       𝑀𝑁 𝐹𝑊
                  𝑅 𝑁 𝐹𝑊 =                                                                   (6.2)
                               4𝜋𝜌 𝑁 𝐹𝑊 [ln (1 + 𝑐 𝑁 𝐹𝑊 ) − 𝑐 𝑁 𝐹𝑊 /(1 + 𝑐 𝑁 𝐹𝑊 )]
where the scale density 𝜌 𝑁 𝐹𝑊 is computed from
                                 200                          𝑐3𝑁 𝐹𝑊
                      𝜌 𝑁 𝐹𝑊 =        𝜌𝑐𝑟𝑖𝑡                                          .       (6.3)
                                  3         ln (1 + 𝑐 𝑁 𝐹𝑊 ) − 𝑐 𝑁 𝐹𝑊 /(1 + 𝑐 𝑁 𝐹𝑊 )
The critical density 𝜌𝑐𝑟𝑖𝑡 is computed from
                                                         3𝐻02
                                               𝜌𝑐𝑟𝑖𝑡 ≡        .                              (6.4)
                                                        8𝜋𝐺
We use a Hernquist BCG profile
                                                     𝑀𝐵𝐶𝐺          1
                                    𝑔 𝐵𝐶𝐺 (𝑟) = 𝐺                      2                    (6.5)
                                                        𝑅2
                                                              
                                                                1+   𝑟
                                                                     𝑅
with 𝑀𝐵𝐶𝐺 = 5 × 1011 𝑀⊙ and 𝑅 𝐵𝐶𝐺 = 4 kpc. We include the gravitational field from a SMBH
black hole with 𝑀𝑆𝑀 𝐵𝐻 = 4 × 108 𝑀⊙ at the center of the cluster halo.
6.2.1.2    Entropy Profile
    Initial entropy profile of the gas follows the form
                                                       𝑘 𝑇
                                                 𝐾≡ 𝑏                                        (6.6)
                                                         2/3
                                                       𝑛𝑒
for the specific entropy, where 𝑘 𝑏 is Boltzmann’s constant, 𝑇 is the temperature, and 𝑛𝑒 is the
electron density, and is initialized follows a power law
                                  𝐾 (𝑟) = 𝐾0 + 𝐾100 (𝑟/100 kpc) 𝛼𝐾 ,                         (6.7)
as introduced in the ACCEPT database (Cavagnolo et al., 2009). We use 𝐾0 = 10.0 keV cm2 , 𝐾100 =
150.0 keV cm2 , and 𝛼𝐾 = 1.1 for the initial entropy profile, which is a typical profile for a CC
cluster.
                                                    203


6.2.1.3   Initial Pressure and Density (Hydrostatic Equilibrium)
    We compute the initial pressure and density by enforcing the initial cluster to be in hydrostatic
equilibrium given the gravitational profile described above and the ACCEPT-like entropy profile,
assuming an ideal gas with adiabatic index 𝛾 = 5/3. Additionally, to close the set of equations
to define the initial gas profile, we follow Meece Jr (2016) and approximate the temperature of a
hydrostatic ICM following Voit (2005)
                                              𝜇𝑚 ℎ
                                 𝑘 𝐵𝑇𝑁 𝐹𝑊 =         [10𝐺 𝑀𝑁 𝐹𝑊 𝐻0 ] 2/3                        (6.8)
                                                2
where 𝑇𝑁 𝐹𝑊 is the fixed temperature at the virial radius 𝑅 𝑁 𝐹𝑊 .
6.2.1.4   Linearly Interpolated Tabular Cooling
    We use a sub-cycling cooling method using a linearly interpolated tabular cooling function.
Over each hydrodynamic cycle, we integrate the internal energy with cooling using an RK45
method, where the difference between the fourth order and fifth order estimations is used to adjust
the subcycle. When the relative error in the change in internal energy over a subcycle is greater
than 10−5 , we redo the subcycle. We limit the minimum subcycle time step to be 1/100 the fluid
time step. Additionally, we limit the fluid time step to be no greater than 1/10 of the cooling time
within any cell. We use the cooling table from Schure et al. (2009) using solar metallicity.
    We use a helium mass fraction 𝜒 = 0.25, with the remaining baryonic mass being hydrogen
and electrons, which allows temperature 𝑇 to be defined from density 𝜌 and pressure 𝑃 following
                                                   𝜇𝑚 ℎ 𝑃
                                              𝑇=          .                                    (6.9)
                                                    𝑘𝐵 𝜌
where 𝑚 ℎ is the atomic mass of hydrogen, 𝑘 𝐵 is Boltzmann’s constant, and 𝜇 is the mean particle
mass per 𝑚 ℎ , found by
                                                              −1
                                              3             2
                                       𝜇=       𝜒 + (1 − 𝜒)        .                          (6.10)
                                              4
                                                   204


6.2.1.5      Precessing Jet Coordinates
     For injection of kinetic feedback by the AGN, initialization of the magnetic tower field, and
feedback from the magnetic tower, we assume a precessing AGN jet and so employ coordinate
transforms to convert Cartesian coordinates relative to the simulation frame to cylindrical coordi-
nates relative to the precessing jet and transform cylindrical vector fields relative to the precessing
jet to Cartesian coordinates relative to the simulation frame.
     First, we define axes for Cartesian coordinates relative to the jet. Let 𝜙 𝑗 𝑒𝑡 = 𝜙0, 𝑗 𝑒𝑡 + 𝜔 𝜙, 𝑗 𝑒𝑡 𝑡 be
the azimuthal angle of the jet (relative to 𝑥),    ˆ where 𝜙0, 𝑗 𝑒𝑡 is the initial azimuthal angle and 𝜔 𝜙, 𝑗 𝑒𝑡
is the precession frequency, and let 𝜃 𝑗 𝑒𝑡 by the inclination angle of the jet. The axis of the jet points
                                         
along the vector 𝑛ˆ ≡ 1, 𝜃 𝑗 𝑒𝑡 , 𝜙 𝑗 𝑒𝑡 in spherical coordinates relative to the simulation frame.
     Using Sympy to generate the coordinate transforms, a position with simulation Cartesian
coordinates (𝑥 𝑠𝑖𝑚 , 𝑦 𝑠𝑖𝑚 , 𝑧 𝑠𝑖𝑚 ) has the following Cartesian coordinates relative to the jet
                                                                                           
         𝑥 𝑗 𝑒𝑡 = 𝑥 𝑠𝑖𝑚 cos 𝜙 𝑗 𝑒𝑡 cos 𝜃 𝑗 𝑒𝑡 + 𝑦 𝑠𝑖𝑚 sin 𝜙 𝑗 𝑒𝑡 − 𝑧 𝑠𝑖𝑚 sin 𝜃 𝑗 𝑒𝑡 cos 𝜙 𝑗 𝑒𝑡          (6.11)
                                                                                             
         𝑦 𝑗 𝑒𝑡 = −𝑥 𝑠𝑖𝑚 sin 𝜙 𝑗 𝑒𝑡 cos 𝜃 𝑗 𝑒𝑡 + 𝑦 𝑠𝑖𝑚 cos 𝜙 𝑗 𝑒𝑡 + 𝑧 𝑠𝑖𝑚 sin 𝜙 𝑗 𝑒𝑡 sin 𝜃 𝑗 𝑒𝑡         (6.12)
                                                     
         𝑧 𝑗 𝑒𝑡 = 𝑥 𝑠𝑖𝑚 sin 𝜃 𝑗 𝑒𝑡 + 𝑧 𝑠𝑖𝑚 cos 𝜃 𝑗 𝑒𝑡 .                                                 (6.13)
and the following cylindrical coordinates relative to the jet
                                                      √︃
                                                  𝑟 = 𝑥 2𝑗 𝑒𝑡 + 𝑦 2𝑗 𝑒𝑡                                 (6.14)
                                                               𝑦 𝑗 𝑒𝑡
                                                  𝜃 = arctan                                            (6.15)
                                                               𝑥 𝑗 𝑒𝑡
                                                  ℎ = 𝑧 𝑗 𝑒𝑡 .                                          (6.16)
     Given a vector in cylindrical coordinates relative to the jet (such as the magnetic vector potential,
magnetic field, or kinetic jet velocity) denoted by (𝑣𝑟 , 𝑣𝜃 , 𝑣ℎ ) at position (𝑟, 𝜃, ℎ), can be expressed
                                                          205


in Cartesian coordinates relative to the jet following
                                                    𝑣𝑥, 𝑗 𝑒𝑡 = 𝑣𝑟 cos (𝜃) − 𝑣𝜃 sin (𝜃)                                             (6.17)
                                                    𝑣 𝑦, 𝑗 𝑒𝑡 = 𝑣𝑟 sin (𝜃) + 𝑣𝜃 cos (𝜃)                                            (6.18)
                                                    𝑣𝑧, 𝑗 𝑒𝑡 = 𝑣ℎ .                                                                (6.19)
This vector in simulation Cartesian coordinates can then be found from multiplying the vector
(𝑣𝑥, 𝑗 𝑒𝑡 , 𝑣 𝑦, 𝑗 𝑒𝑡 , 𝑣𝑧, 𝑗 𝑒𝑡 ) by the direction cosine matrix to convert from jet Cartesian coordinates to
simulation Cartesian coordinates:
                                                                                                                   
          𝑣𝑥,𝑠𝑖𝑚   cos 𝜙 𝑗 𝑒𝑡 cos 𝜃 𝑗 𝑒𝑡
                                                               − sin 𝜙 𝑗 𝑒𝑡 cos 𝜃 𝑗 𝑒𝑡 sin 𝜃 𝑗 𝑒𝑡            𝑣𝑥, 𝑗 𝑒𝑡 
                                                                                                                            
                                                                                                                        
          𝑣          =             sin  𝜙
                                                  
                                                                        cos    𝜙
                                                                                       
                                                                                                       0         𝑣              (6.20)
           𝑦,𝑠𝑖𝑚                          𝑗 𝑒𝑡                                 𝑗 𝑒𝑡                            𝑦, 𝑗 𝑒𝑡 
                                                                                                                  
           𝑣𝑧,𝑠𝑖𝑚  − sin 𝜃 𝑗 𝑒𝑡 cos 𝜙 𝑗 𝑒𝑡                      sin 𝜙 𝑗 𝑒𝑡 sin 𝜃 𝑗 𝑒𝑡           cos 𝜃 𝑗 𝑒𝑡 
                                                                                                                         
                                                                                                                   𝑣𝑧, 𝑗 𝑒𝑡 
                                                                                                                        
6.2.1.6        Magnetic tower
    We initialize the galaxy cluster with a pre-existing magnetic tower following the form described
in Li et al. (2006). The magnetic fields are described in cylindrical coordinates as
                                                                                          !
                                                                ℎ𝑟        −𝑟 2 − ℎ2
                                               𝐵𝑟 = 𝐵0 2 exp                                                                       (6.21)
                                                                ℓ2             ℓ2
                                                                                        !
                                                                𝑟        −𝑟 2 − ℎ2
                                              𝐵𝜃 = 𝐵0 𝛼 exp                                                                        (6.22)
                                                                ℓ            ℓ2
                                                                          !                       !
                                                                      𝑟2             −𝑟 2 − ℎ2
                                              𝐵 ℎ = 𝐵0 2 1 −                exp                                                    (6.23)
                                                                       ℓ2                  ℓ2
where 𝑟 is the distance from the jet axis aligned along ℎ,                       ˆ 𝜃 is the polar angle around ℎ,           ˆ and ℎ is the
height from the assumed accretion disk. The parameter 𝛼 controls the the relative strength between
poloidal and toroidal fields, where 𝛼 ∼ 2.6 corresponds to roughly equal poloidal and toroidal flux.
Following Li et al. (2006), we use 𝛼 = 20, which corresponds to a strong toroidal flux consistent
with a magnetic tower that is highly wound by a rotating accretion disk.
                                                                       206


6.2.1.7   AGN Feedback
    We include AGN feedback using thermal heating, kinetic jet, and magnetic tower models
exploring different relative strengths. We divide the AGN feedback between the three mechanisms
following
                          𝐸¤ 𝐴𝐺 𝑁 = 𝐸¤𝑇 + 𝐸¤ 𝐾 + 𝐸¤ 𝐵 = ( 𝑓𝑇 + 𝑓𝐾 + 𝑓 𝐵 ) 𝐸¤ 𝐴𝐺 𝑁 .              (6.24)
where 𝐸¤ 𝐴𝐺 𝑁 is the total AGN feedback rate; 𝐸¤𝑇 , 𝐸¤ 𝐾 , and 𝐸¤ 𝐵 are the total thermal, kinetic, and
magnetic AGN feedback rates; and 𝑓𝑇 , 𝑓𝐾 , and 𝑓 𝐵 are the thermal, kinetic, and magnetic fractions
of the total AGN feedback rate.
6.2.1.8   Thermal AGN Feedback
    In the thermal feedback model, thermal energy is deposited at a uniform heating rate per volume
within a sphere around the center of the halo where the presumed AGN resides.
                                                𝑓𝑇 𝐸¤ 𝐴𝐺 𝑁
                                               
                                                (4/3)𝜋𝑅3
                                               
                                                                𝑟 ≤ 𝑅𝑇
                                   𝑒¤𝑇 (𝑟) =               𝑇                                     (6.25)
                                               
                                               
                                               
                                                     0         otherwise
where we use 𝑅𝑇 = 0.5 kpc for the radius of thermal feedback.
6.2.1.9   Kinetic AGN Feedback
    In the kinetic feedback model, kinetic energy and mass is injected above and below a presumed
accretion disk inside an cylindrical jet. We align the jet along the 𝑧 axis with a radius 𝑅𝐾 = 1 kpc
and extend it 𝐻𝐾 = 10 kpc above and below the 𝑥𝑦 plane. The rate of mass injected by the jet
is set proportional to the kinetic jet power 𝑓𝐾 𝐸¤ 𝐴𝐺 𝑁 divided by a kinetic jet efficiency parameter
𝜖 𝐾 = 10−2 (Meece et al., 2017), reflecting the low efficiency of the conversion of accretion to
kinetic flows(?). The mass injection rate follows the form
                                                       𝑓 𝐸¤
                                             𝑀¤ 𝐾 = 𝐾 𝐴𝐺 𝑁                                       (6.26)
                                                         𝜖 𝐾 𝑐2
                                                     207


where 𝑐 is the speed of light. The jet then injects a mass density
                                                            𝑀¤ 𝐾
                                                 𝜌¤ 𝐾 =                                          (6.27)
                                                        2𝜋𝑅𝐾 2𝐻
                                                                  𝑗 𝑒𝑡
with a jet speed
                                                         √︁
                                                    𝑣𝐾 =   2𝜖 𝐾 𝑐.                               (6.28)
heading away from the 𝑥𝑦 plane The momentum density injected into the cluster is then
                                    sign(𝑧) 𝜌¤ 𝐾 𝑣𝐾 ℎˆ when 𝑟 ≤ 𝑅𝐾 and |ℎ| ≤ 𝐻𝐾
                                   
                                   
                                   
                       ¤
                      M𝐾 (r) =                                                                   (6.29)
                                               0                       otherwise
                                   
                                   
                                   
where 𝑟 here is the distance from the jet axis and ℎ is the signed height above or below the accretion
disk. The injected kinetic energy per volume is
                                        21 𝜌¤ 𝐾 𝑣𝐾2 when 𝑟 ≤ 𝑅 and |𝑧| ≤ 𝐻
                                       
                                       
                                                                     𝐾          𝐾
                           e¤ 𝐾 (r) =                                                            (6.30)
                                        0                       otherwise
                                       
                                       
so that the total kinetic energy injected matches 𝑓𝐾 𝐸 𝐴𝐺 𝑁 .
6.2.1.10    Magnetic AGN Feedback
    In the magnetic feedback model, a magnetic field is deposited proportional to the magnetic
tower field proposed in Li et al. (2006)
                                                                         !
                                                   ℎ𝑟       −𝑟 2 − ℎ2
                                   B𝑟 = B0 2 exp                                                 (6.31)
                                                   ℓ2           ℓ2
                                                                       !
                                                   𝑟      −𝑟 2 − ℎ2
                                   B𝜃 = B0 𝛼 exp                                                 (6.32)
                                                   ℓ           ℓ2
                                                           !                   !
                                                        𝑟2           −𝑟 2 − ℎ2
                                   Bℎ = B0 2 1 −             exp                                 (6.33)
                                                        ℓ2                ℓ2
where 𝑟 is the distance from the jet axis, ℎ is the signed height above or below the accretion disk, ℓ
is the length scale, 𝛼 controls the ratio of polodial to torodial fields, and B0 is the strength of the
                                                        208


magnetic field. A vector potential corresponding to this magnetic field can be written as
                                      A𝑟 = 0                                                    (6.34)
                                                                  !
                                                𝑟       −𝑟 2 − ℎ2
                                      A 𝜃 = B0 ℓ exp                                            (6.35)
                                                ℓ           ℓ2
                                                                  !
                                                𝛼       −𝑟 2 − ℎ2
                                      A ℎ = B0 ℓ exp                                            (6.36)
                                                 2          ℓ2
so that ∇ × A = B. Constructing the magnetic fields from the vector potential is preferred to
maintain ∇ · B as close to zero as possible.
    We apply magnetic fields from Equation 6.31 aligned to a precessing jet, and so the coordinate
and vector transformations described in §6.2.1.5 as necessary to transform (𝑥, 𝑦, 𝑧) → (𝑟, 𝜃, ℎ) and
                               
(B𝑟 , B𝜃 , Bℎ ) → B𝑥 , B 𝑦 , B𝑧 .
    We use the magnetic field from Equation 6.31 to apply the initial magnetic field and to apply
magnetic feedback, which feedback can be scaled to a specified field rate or power.
    Initializing magnetic fields is simple. To inject a 𝐵0 magnetic field, we set the initial magnetic
field to B = B| B0 =𝐵0 . Injecting constant magnetic field increase is also simple. To inject a field
                                       ¤ = B|
rate of 𝐵¤ 0 , we add a magnetic field B      B0 = 𝐵¤ 0
    Injecting magnetic energy at a specified power is much more complicated since the existing
magnetic field must be considered and both linear and quadratic contributions from the injected
magnetic field must also be considered. Given an existing magnetic field B𝑛 and a timestep Δ𝑡,
we inject a magnetic field following the magnetic tower model with a strength 𝐵 𝑝 that must be
determined so that the new magnetic field is
                                       B𝑛+1 = B𝑛 + Δ𝑡B| B0 =𝐵 𝑝 .                               (6.37)
                                                    209


The change in total magnetic energy is then
            ∫
                1                 1
   Δ𝐸 𝐵 =         B𝑛+1 · B𝑛+1 − B𝑛 · B𝑛 𝑑𝑉                                                    (6.38)
             Ω 2                 2
                                                                                            
               ∫
            1                                             2
         =         B𝑛 · B𝑛 + 2Δ𝑡B𝑛 · B| B0 =𝐵 𝑝 + (Δ𝑡) B| B0 =𝐵 𝑝 · B| B0 =𝐵 𝑝 − B𝑛 · B𝑛 𝑑𝑉   (6.39)
            2 Ω
              ∫                                                                            
            1                                             2
         =         B𝑛 · B𝑛 + 2Δ𝑡B𝑛 · B| B0 =𝐵 𝑝 + (Δ𝑡) B| B0 =𝐵 𝑝 · B| B0 =𝐵 𝑝 − B𝑛 · B𝑛 𝑑𝑉   (6.40)
            2 Ω
                  ∫                                        ∫
                                                     2  2     1
         = Δ𝑡𝐵 𝑝      B𝑛 · B| B0 =1 𝑑𝑉 + (Δ𝑡) 𝐵 𝑝               B| B0 =1 · B| B0 =1 𝑑𝑉        (6.41)
                    Ω                                       𝜔2
where Ω is the simulation domain. To determine the magnetic field strength 𝐵 𝑝 to be injected, the
two integrals in Equation 6.41 corresponding to the linear and quadratic contributions must first
be computed (via reduction over the entire domain), then 𝐵 𝑝 can be determined by the quadratic
formula (only one root should be positive).
    For the case of magnetic field injection by the AGN, the change in magnetic energy is set to
Δ𝐸 𝐵 = Δ𝑡 𝑓 𝐵 𝐸¤ 𝐴𝐺 𝑁 and 𝐵 𝐴𝐺 𝑁 is determined by the reductions above.
    Applying a magnetic tower field injects a finite total magnetic energy even when applied over
all space due to the exponential decay away from the AGN. The total magnetic energy 𝐸 𝐵 when
applied over all space is given by
                                                                                   
                          ∫ ∞ ∫ 2𝜋 ∫ ∞                               𝜋  3/2 𝛼2 + 5 ℓ 3
                                            1
                    𝐸𝐵 =                      B · B𝑟 𝑑ℎ 𝑑𝜃 𝑑𝑟 = 𝐵02           √        .      (6.42)
                           0     0      −∞  2                               8 2
6.2.1.11    AGN cold mass triggering
    AGN feedback is triggered by cold mass around the presumed AGN. AGN triggering occurs
within a 𝑟 𝑎𝑐𝑐 = 10 kpc radius accretion zone around the AGN. Within the accretion zone, gas with
a temperature below the user-defined threshold 𝑇𝑐𝑜𝑙𝑑 = 5 × 104 K triggers AGN feedback. The
mass accretion rate onto the AGN follows
                                             ∫
                                   𝑀¤ 𝐴𝐺 𝑁 =           𝜌𝑐𝑜𝑙𝑑 (r)/𝑡 𝑎𝑐𝑐 d𝑉                     (6.43)
                                               𝑟<𝑟 𝑎𝑐𝑐
                                                    210


where 𝜌𝑐𝑜𝑙𝑑 (r) is equal to 𝜌(r) in cells where 𝑇 (r) ≤ 𝑇𝑐𝑜𝑙𝑑 and 0 otherwise, and 𝑡 𝑎𝑐𝑐 = 100 Myr
is the accretion time scale. The total AGN feedback rate is then set to
                                       𝐸¤ 𝐴𝐺 𝑁 = 𝜖 𝐴𝐺 𝑁 𝑀¤ 𝐴𝐺 𝑁 𝑐2                            (6.44)
where 𝜖 𝐴𝐺 𝑁 = 10−3 is the cold mass triggering efficiency.
     The accreted mass is removed from the simulation. Mass is only removed from cells within the
accretion zone with a temperature below the cold gas temperature threshold. The density removed
follows the rate
                                            𝜌(r)/𝑡 𝑎𝑐𝑐 𝑇 (r) < 𝑇𝑐𝑜𝑙𝑑
                                           
                                           
                                           
                                 𝜌¤ (r) =                                                     (6.45)
                                                 0         otherwise
                                           
                                           
                                           
6.3    Current State of Simulations
     Each of the components to initialize the galaxy cluster simulation and evolve the cluster with
triggered AGN feedback have been individually tested and verified to work as expected. Integrated
tests of all these components are underway. Analysis pipelines using yt (Turk et al., 2011) are also
in development. Testing of the full magnetized galaxy cluster simulation set up is expected to be
completed by early summer 2022 in time for exascale simulations later in the summer.
                                                   211


                                             CHAPTER 7
                            SUMMARY AND FUTURE DIRECTIONS
    The ultimate goal of this dissertation is to better understand the behavior of diffuse astrophysical
plasmas, especially as applied to the intracluster medium (ICM), and to develop better numerical
tools and methods to explore these plasmas. In this final chapter in Section 7.1 I first summarize
each of the chapters comprising peer-reviewed or near submission work, which includes Chapters
2, 3, 4, and 5. I then describe the ongoing and future work of these projects in Section 7.2,
including the many projects spawned by Parthenon and AthenaPK, the work at Sandia National
Laboratories enabled by the relativistic discontinuous-Galerkin (DG) method I presented in Chapter
5, the ongoing magnetized galaxy cluster simulations and future additions to those simulations, and
finally the work on magnetized jets in AGN accretion disks that I plan to explore as a postdoctoral
fellow at Los Alamos National Laboratory.
7.1   Summary of Dissertation Work
7.1.1    Chapter 2: Tests of AGN Feedback Kernels in Simulated Galaxy Clusters
    In Chapter 2, we explored the energy deposition of active galactic nuclei (AGN) feedback that
is necessary to prevent cooling catastrophes within the cluster while maintaining a realistic entropy
profile (Glines et al., 2020). To this end, we ran 91 simulations of idealized galaxy clusters with
a simplified model of AGN feedback, abstracting the thermalization of AGN jets and magnetic
fields as a spherically symmetric heating kernel balanced to the cooling within the cluster, testing
a range of heating kernel profiles with varying degrees of central peaking (See Figure 2.2 and
Table 2.1). We did not find a spherical heating kernel that produced both a quasi-stable galaxy
cluster that did not undergo a cooling catastrophe and also kept an observationally realistic entropy
profile. We did find that sharply centrally peaked heating kernels prevented cooling catastrophes
by severely exceeding radiative cooling in the cluster core where cooling times are short compared
to the lifetime of the cluster. These centrally overpowered heating kernels led to centrally inverted
                                                   212


entropy profiles where the high central entropy was resistant to overcooling but was in-congruent
with entropy profiles of observed galaxy clusters. We also found that weakly centrally peaked
heating kernels kept realistic entropy profiles but failed to offset central cooling, leading to cooling
catastrophes well under the observationally expected lifetimes for these clusters (See Figure 2.3).
    Although these simulations did not conclusively rule out a thermal-only abstraction for AGN
feedback, they do point towards more complex mechanisms than pure heating at play in self-
regulating cool-core (CC) clusters. To explore such questions, we would like to explore high fidelity
simulations with enough primary and secondary physics to realistically model the magnetized ICM
and AGN jet, including at least magnetic fields and potentially non-ideal MHD effects, cosmic ray
pressure, and possibly relativistic AGN jet velocities. Such simulations would also need at least
an order of magnitude increase in computing resources in order to include the additional physics
and increase resolution of the ICM. Enabling such simulations was the goal of the work presented
in Chapter 4 developing K-Athena– a performance portable MHD code that can utilize upcoming
supercomputers. This goal of high resolution magnetized galaxy cluster simulations is coming to
fruition with the in-progress work presented Chapter 6.
7.1.2   Chapter 3: Magnetized Decaying Turbulence in the Weakly Compressible Taylor-
        Green Vortex
    In Chapter 3, we explored the development of magnetized turbulence from the decay of a
large scale flow, performing 9 simulations of the magnetized Taylor-Green vortex. The decaying
turbulent plasma scenario models the growth of turbulence in the ICM due to large scale infrequent
perturbations, such as from a galaxy cluster merger or an AGN outburst. These simulations
are distinct from more commonly performed driven turbulence simulations, where a stochastic
force is applied to the plasma, continually injecting energy at the injection scale. In this aspect,
after turbulence is well developed in our simulations the energy spectrum of the simulations is
uncontaminated by injected energy and is purely a result of the extant energy in the turbulence.
    In these simulations, magnetic energy came to dominate over kinetic energy even when the
                                                 213


initial magnetic field was small. We found that the magnetized turbulence developed from this
decaying flow scaled following a 𝑘 −4/3 power law, flatter than the 𝑘 −5/3 power law for hydrody-
namical turbulence with comparatively more energy at smaller scales in the magnetized turbulence,
confirming results from related driven turbulence simulations (See Figure 3.4, Grete et al., 2018,
2021b). Using the energy transfer analysis developed by Grete et al. (2017), we explored the de-
velopment of the energy spectrum from the energy transfers between kinetic and magnetic energy
at different scales. The buildup of energy at smaller scales was aided by non-local energy transfer
from large scale kinetic energy to all scales of magnetic energy via magnetic tension; a mechanism
absent in hydrodynamical turbulence. In general, the magnetized turbulence behaved differently
from hydrodynamical turbulence and thus should not be ignored in explorations of turbulence in
the ICM.
7.1.3    Chapter 4: K-Athena: A Performance Portable Structured Grid Finite Volume Mag-
         netohydrodynamics Code
    In Chapter 4, we present the performance portable magnetohydrodynamics (MHD) code K-
Athena, which is designed to enable computational astrophysics on the next generation of super-
computers while maintaining performance on traditional supercomputers. These new supercom-
puters will use graphics processing units (GPUs) from a number of different vendors for the majority
of their computational throughput. At the same time, supercomputers using traditional CPUs will
persist for the near future. Astrophysics codes capable of efficiently utilizing both GPUs from all
manufacturers and traditional central processing units (CPUs) are needed to enable simulations
on any given hardware that a computational astrophysicists might have access to. Performance
portable codes provide this high performance on multiple architectures with a single code base,
eliminating the development cost involved with creating and maintaining multiple versions for dif-
ferent architectures. K-Athena fulfills this need for a performance portable MHD code for uniform
grids.
    We demonstrated K-Athena running simulations on many of the largest supercomputers avail-
                                                  214


able at the time, showing high performance and efficiency on both CPUs and GPUs. K-Athena
ran at high performance on the NVidia “Volta” V100 GPUs on Summit, achieving 76% parallel
efficiency and attaining at peak a speed of 1.94 × 1012 cell updates/second. We also performed a
roofline analysis to compute how efficiently K-Athena was using each level of memory, ultimately
demonstrating that K-Athena was limited by the DRAM bandwidth on all architectures. This
roofline analysis also allowed us to compute for K-Athena a 62.8% performance portability metric
as measuring against theoretical performance limited by the DRAM bandwidth (Pennycook et al.,
2016). The K-Athena code has been used for two papers to date (Grete et al., 2021b; Glines et al.,
2021), including for the magnetized turbulence work shown in Chapter 3.
7.1.4   Chapter 5: Relativistic Discontinuous-Galerkin Hydrodynamics
    In Chapter 5, we presented a robust method for evolving the special relativistic hydrodynamics
equations using a DG method. In a DG method, the fluid within individual elements of the
simulation domain is represented by linear combinations of polynomials instead of just a cell
average, allowing linear, quadratic, and higher order spatial contributions to be carried in each
element. The methods have the dual advantage of being easily raised to arbitrary spatial orders by
just increasing the order of the polynomial basis and for non-Cartesian mesh boundary conditions
since stencils span a single cell (for which there exists an internal Cartesian-like grid) rather than
multiple cells. These traits make them extremely valuable for terrestrial plasma simulations, where
non-rectangular apparatuses introduce irregular boundaries that are best handled with unstructured
meshes but still benefit from higher order methods.
    The relativistic hydrodynamics method we present includes a unique exploration of solvers
for the primitive recovery step – the non-trivial inversion of a transcendental equation to get the
primitive variables from the conserved variables, which is an essential step to compute fluxes in
both FV and DG methods. We show accuracy and speed performance for analytic and iterative
solvers for both the ideal equation of state and the Taub-Matthews approximation to the Synge gas
equation of state. In this exploration we show that the iterative methods for finding the roots of the
                                                  215


quartic polynomial to invert the ideal equation of state is both faster and more accurate than the
analytic method while the analytic methods for finding the roots of the cubic polynomial to invert
the Taub-Matthews equation of state is conversely faster and more accurate. This reversed result
may be a consequence of the high complexity in analytic quartic solvers and the simplicity of the
Newton-Raphson method for the ideal equation of state versus the lower complexity of analytic
cubic solvers and higher complexity in the bisection method needed for the Taub-Matthews solver.
    For the method we developed a novel operator for maintaining the physicality of conserved states
when running DG with higher orders (see Section 5.3.5). When shocks arise in DG simulations
(and in FV simulations), non-physical conserved variables can be introduced after integrating
fluxes around the discontinuity, especially when using spatial orders higher than 0th order. In a
special relativistic hydrodynamics method these non-physical conserved variables can correspond
to negative pressures or densities, superluminal velocities, or may correspond to imaginary or
complex variables. These non-physical conserved variables can be screened using Equation 5.25.
Within the algorithm we developed, when non-physical conserved variables are detected at interior
points in a DG cell the physicality enforcing operator smooths all points within the cell towards
the volume average so that all points are made physical as long as the volume average is physical.
In practice, this operator was necessary to run any simulation with shocks with a spatial 1st order
basis and higher.
    We also compared the accuracy of this method relative to a FV scheme at evolving the relativistic
Kelvin-Helmholtz instability using both the HLL and HLLC Riemann solvers and with 0th , 1st ,
and 2nd order bases. We found that the DG method we developed better suppresses non-physical
secondary vortices and instabilities in the linear growth phase of the Kelvin-Helmholtz instability
when compared to the FV method, especially when using higher order bases (see Figures 5.16,
5.17, 5.18, 5.19, 5.20, and 5.21). Non-physical boundary effects entered into the outflow boundary
conditions with increased resolution, however, indicating that more development of higher order
outflow boundary conditions is needed.
                                                 216


7.2   Ongoing and Future Work
7.2.1   Parthenon and AthenaPK
    The K-Athena MHD code, however, was only capable of running performance portable uniform
grid simulations, where the resolution of the simulation is the same across the domain. This
limited the applicability of K-Athena for studying galaxy clusters, a class of systems requiring
high resolution near the central AGN and AGN jet but also including an expansive halo that
can be simulated with low resolution. Although possible, the design of K-Athena would make
implementation of performance portable AMR difficult for a small team. A performance portable
MHD code with adaptive mesh refinement (AMR) was needed for galaxy cluster simulations.
In fact, many simulated systems from both astrophysical and terrestrial plasma physics would
benefit from performance portable AMR capabilities. Such an AMR code would impact a large
cross-section of the computational plasma community.
    To fulfill this need, we founded a collaboration to develop the Parthenon AMR framework
(https://github.com/lanl/parthenon): a performance portable AMR framework based on
the AMR implementation in Athena++ but tuned for performance portability on GPUs and
CPUs (Grete et al., 2022). Originally conceived as an AMR capable successor to K-Athena,
Parthenon has gone on to be the basis for many codes in the computational plasma astro-
physics community. These include AthenaPK (https://gitlab.com/theias/hpc/jmstone/
athena-parthenon/athenapk, Athena-Parthenon-Kokkos) an AMR-capable successor to
K-Athena; Phoebus (https://github.com/lanl/phoebus), a general relativistic (GRMHD)
code with neutrino radiation for modeling neutron star mergers and intermediate mass black hole
candidates; KHARMA, another GRMHD code to be used for interpretation of black hole imaging
via the Event Horizon Telescope as part of a 2022 INCITE award; RIOT, a Los Alamos National
Laboratory-based multiphysics code (Grete et al., 2022); and likely more codes yet to be developed.
This collaboration, which K-Athena inspired, will enable exascale simulations of the ICM, galaxy
clusters, magnetized turbulence, the formation of intermediate mass black holes, AGN accretion
                                               217


disks, black hole imaging, planet formation, and many terrestrial plasma systems. The success of
K-Athena and later Parthenon has also inspired Kokkos integration into other codes, includ-
ing Athena-K, a GRMHD code in development at the Institute of Advanced Study, and Enzo-E
(Bordner & Norman, 2018).
7.2.2    Relativistic DG Methods
    The merit of the algorithm we developed has already been demonstrated in two upcoming
papers from Sandia National Laboratories on which I am co-author (Roberds et al., 2022; Hamlin
et al., 2022), which relied on this method and my implementation for different flavors of extended
relativistic MHD methods. Both papers use this relativistic method as a basis for a relativistic
two-fluid MHD scheme. In the relativistic two-fluid MHD equations, the electrons and ions of the
plasma comprise two relativistic fluids with distinct densities, flow velocities, and pressures. These
two fluids are coupled together via Maxwell’s equations and Ohm’s law (Amano, 2016).
    Roberds et al. (2022) uses this two-fluid method to study electron emission across a warm diode
– a scenario where the electron species is accelerated across a gap via injected kinetic energy and
injected with sufficient thermal energy to be non-negligible to the injected kinetic energy. The
solution using the two-fluid MHD method was compared against a semi-analytic model for the 1D
warm diode problem and found to converge to 2nd order accuracy as was expected for this problem
(see Figure 7.1). Preliminary results are reported in Laity et al. (2021).
    Hamlin et al. (2022) uses the two-fluid method for 2D numerical simulations of a magnetron, a
device that converts electrostatic potential in the electron population into microwave energy. That
work compares the fluid approach to the PIC method conventionally used for that system. Results
are awaiting publication.
    Lessons learned from developing the relativistic hydrodynamics method for DG will be applied
to relativistic MHD algorithms implemented in AthenaPK for future projects. These methods and
implementations yet to be developed will be the basis for studies and simulations of relativistic
AGN jet feedback to determine whether the relativistic nature of the jet has an impact on energy
                                                  218


                                                                                       ©Sandia National Laboratories
                                                                                       Laity et al. (2021)
Figure 7.1: Electric field (top left) and pressure (top right) along the 1D warm diode with electron
temperatures 𝑇𝑒 = 1, 10, 100 eV using the relativistic two-fluid MHD DG method with my contri-
butions in red, green and blue for each temperature and in black showing the exact solutions with
a semi-analytic model. L1 error in electric field (bottom left) and pressure (bottom right) of the
relativistic two-fluid MHD DG method to the exact solution, showing 2nd order convergence as
was expected for the second-order accurate fluid solver. Figures taken from Laity et al. (2021).
deposition and thermalization within a magnetized ICM.
7.2.3   Simulations of Magnetized Galaxy Clusters
   In Chapter 6 I share my current work developing simulations of magnetized AGN jets within
a magnetized galaxy cluster. In order to achieve resolutions higher than previously possible in
galaxy cluster simulations with AGN feedback we developed Parthenon, a performance portable
AMR framework, and on top of that AthenaPK, a performance portable MHD code with AMR.
These code developments in performance portability will allow us to run on more architectures and
specifically on the GPU supercomputers comprising the highest echelons of current and near-future
                                                219


computing resources available. The unique scales of these computing resources will enable higher
fidelity galaxy cluster simulations than previously explored, giving us better tools to examine the
thermalization of AGN feedback within the ICM.
    AthenaPK currently implements all the necessary components to explore high resolution
simulations studying the magnetic aspect of AGN feedback, which is discussed in Chapter 6. These
components include a new precessing magnetic tower injection model for AGN feedback, which
allows exploration of whether precessing magnetic towers can self-regulate a CC cluster like a
precessing kinetic jet (Meece Jr, 2016). These simulations and their analysis will run in summer
2022, with publication of results expected later this year.
    With this performance portable MHD code as a base, we can add a variety of additional physics
while keeping the simulations computationally feasible. The first addition will be cosmic rays,
which may play an important role in self-regulating AGN feedback. Non-thermal, relativistically
moving electrons and especially protons comprising these cosmic rays have long been suspected
to play a key role in offsetting cooling and preventing cooling flows in the ICM, proving additional
heating and pressure (Loewenstein et al., 1991; Ando & Nagai, 2008). Although the cosmic ray
energy density is low compared to the thermal energy density in the ICM (Dunn & Fabian, 2004),
they may be a key factor in AGN feedback by elongating and inflating AGN-created bubbles by
exerting anisotropic pressure along magnetic field lines (Guo & Oh, 2008; Guo & Mathews, 2011).
AGN feedback itself injects cosmic rays into the ICM by creating shocks and turbulence where
charged particles can be accelerated to relativistic velocities via the first and second order Fermi
processes (Krymskii, 1977; Bell, 1978; Bustamante et al., 2010). Thus, cosmic ray injection may
be an important component of a self-regulating AGN.
    The implementation of cosmic rays in AthenaPK will follow Jiang & Oh (2018), which has
a publicly available implementation in AthenaPK’s parent code Athena++. We will then extend
the current simulations campaign exploring magnetized AGN feedback to include the injection of
cosmic rays to see how they affect self-regulation of the AGN.
    Beyond the addition of cosmic ray pressure, we will also investigate how using a non-ideal
                                                 220


MHD that better describes the plasma of the ICM might affect AGN feedback and self-regulation.
Specifically, we will explore Braginskii MHD, which includes anisotropic transport of particles
not present in ideal MHD, which introduced anisotropic heat conduction and anisotropic pressure
along magnetic field lines (Braginskii, 1965). This model of the plasma better reflects the weakly
collisional nature of the ICM (Reynolds, 2018). In the ICM, these non-ideal effects may play a role
in magnetized turbulence (Kunz et al., 2011; Ruszkowski & Oh, 2011), and the amplification of
magnetic fields (St-Onge et al., 2020), and may bring a small but non-negligible amount of heating
from cluster outskirts into the core (Voigt et al., 2002; Voigt & Fabian, 2004; Ruszkowski & Oh,
2011; Yang & Reynolds, 2016b). This more accurate representation of the ICM has been of keen
interest in the last decade (Ruszkowski & Oh, 2011; Berlok et al., 2020) and will be a key feature
for high fidelity simulations of galaxy clusters.
7.2.4    AGN Accretion Disk Channel for Intermediate Mass Black Holes
    Being one of the only performance portable AMR frameworks available, the Parthenon library
is poised to impact many computational studies in astrophysical and terrestrial plasmas. AthenaPK
is also likely to be applied to many different systems in the near future. With more and more GPU
supercomputers coming online, there are ample computational resources available with diverse
GPU architectures but few codes that can use them.
    One such area of exploration, headed by scientists at Los Alamos National Laboratories and to
which I will be contributing, is the formation of intermediate mass black holes via AGN accretion
disks.
    With the advent of gravitational wave observatories such as the Laser Interferometer Gravitational-
Wave Observatory (LIGO), we now have unprecedented access to the masses of previously unob-
servable black holes via binary black hole (BBH) mergers. In the observation of GW190521
by LIGO, we observed an 85 M⊙ black hole merge with a 66 M⊙ black hole creating a 142 M⊙
black hole, the heaviest BBH merger to date (LIGO Scientific Collaboration and Virgo Collab-
oration et al., 2020). This merger poses theoretical inconsistencies since black hole masses in
                                                  221


the ∼ 60 − 120 M⊙ mass gap are excluded by conventional theories of black hole formation via
pair instability supernovae, despite both progenitors in the BBH falling into this black hole mass
gap (Woosley, 2017). The mechanism by which these BBHs may have formed is as yet poorly
understood (Koliopanos, 2018), although several formation channels have been proposed including
primordial black holes (Lacroix & Silk, 2018), Population III stars (massive stars formed from
metal poor gas in the early universe; Lacroix & Silk, 2018), mergers of stellar-mass black holes in
dense environments (Rose et al., 2021), and super-Eddington accretion (accreting faster than the
traditional limit where radiation pressure emitted by accreting gas balances gravitational pull) in
dense environments (Ogawa et al., 2017; Toyouchi et al., 2021).
    One channel of particular interest is the AGN channel, where stellar-mass black holes can
accrete at super-Eddington rates in the gas-rich environment of an AGN accretion disk (McKernan
et al., 2012, 2014). Such regions in the AGN accretion disk could form multiple > 50 M mass black
holes that could produce mergers such as GW190521. Of benefit to observational verification, said
super-Eddington accretion and the jets emitted from a mergers in an accretion disk should have a
signature as the jet breaks out of the AGN accretion disk (Zhu et al., 2021).
    However, there are multiple aspects of the AGN channel that still need to be studied to determine
how to identify the combined the gravitational wave and electromagnetic signature on a BBH merger
embedded in an AGN accretion disk. Many of these aspects will be studied with performance
portable AMR simulations built upon the Parthenon library that I helped enable, including
GRMHD simulations of the BBH using Phoebus and jet simulations within the AGN accretion
disk using AthenaPK.
    As a Metropolis fellow at Los Alamos National Laboratory, I will perform simulations of rela-
tivistic magnetized jets emerging from an AGN accretion to better characterize the electromagnetic
signature of a BBH merger for heavy black holes formed in the AGN channel. These simulations
will use the magnetized AGN jet physics implemented as part of my PhD work and re-contextualize
them into an AGN accretion disk using the “shearing box” approximation. This approximation
reformulates the MHD equations into a local, Cartesian reference frame co-rotating with a disk
                                                 222


(Hawley et al., 1995; Sharma et al., 2006; Stone & Gardiner, 2010). The flow introduced by
the shearing box will approximate the AGN disk environment surrounding jets emanating from
mergers. After exploring these simulations comprising a magnetized jet escaping from a shearing
box, we will explore including coupling radiative transfer to the magnetohydrodynamics in order
to better model the dense environment of the AGN accretion disk. This work will use the same
algorithm as the cosmic ray solver we will use in the magnetized galaxy cluster simulations (Jiang
et al., 2014; Jiang & Oh, 2018). The inclusion of radiative transfer will enable better predictions of
electromagnetic observations of jets escaping AGN accretion disks. The next step will be to imple-
ment relativistic MHD jets in order to explore high velocity jets from mergers to see if relativistic
effects impact the jet structure and electromagnetic signature. That work will be informed by the
relativistic DG method developed in Chapter 5.
                                                223


BIBLIOGRAPHY
     224


                                          BIBLIOGRAPHY
Aarseth, S. J., Gott, III, J. R., & Turner, E. L. 1979, The Astrophysical Journal, 228, 664
Afzal, A., Ansari, Z., Faizabadi, A. R., & Ramis, M. K. 2017, Archives of Computational Methods
  in Engineering, 24, 337
Alexakis, A., Mininni, P. D., & Pouquet, A. 2005, Phys. Rev. E, 72, 046301
Allen, S. W., Evrard, A. E., & Mantz, A. B. 2011, Annual Review of Astronomy and Astrophysics,
  49, 409
Amano, T. 2016, The Astrophysical Journal, 831, 100
Ando, S., & Nagai, D. 2008, Monthly Notices of the Royal Astronomical Society, 385, 2243
Arenas, A., & Chorin, A. J. 2006, Proceedings of the National Academy of Sciences, 103, 4352
Artigues, V., Kormann, K., Rampp, M., & Reuter, K. 2019, arXiv e-prints, arXiv:1911.08394
Aymar, R., Barabaschi, P., & Shimomura, Y. 2002, Plasma Physics and Controlled Fusion, 44, 519
Bambic, C. J., Morsony, B. J., & Reynolds, C. S. 2018, The Astrophysical Journal, 857, 84
Banerjee, N., & Sharma, P. 2014, MNRAS, 443, 687
Barniol Duran, R., Tchekhovskoy, A., & Giannios, D. 2017, Monthly Notices of the Royal Astro-
  nomical Society, 469, 4957
Bartelmann, M. 2010, Classical and Quantum Gravity, 27, 233001
Bartelmann, M., & Schneider, P. 2001, Physics Reports, 340, 291
Basilakos, S., Plionis, M., & Lima, J. A. S. 2010, Physical Review D, 82, 083517
Bauer, M., Treichler, S., Slaughter, E., & Aiken, A. 2012, in Proceedings of the International
  Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12 (Los
  Alamitos, CA, USA: IEEE Computer Society Press), 66:1–66:11
Baumjohann, W., & Treumann, R. A. 2012, Basic Space Plasma Physics (Revised Edition) (World
  Scientific Publishing Company)
Beckingsale, D. A., Burmark, J., Hornung, R., et al. 2019, in 2019 IEEE/ACM International
  Workshop on Performance, Portability and Productivity in HPC (P3HPC), 71–81
Beckwith, K., & Stone, J. M. 2011, The Astrophysical Journal Supplement Series, 193, 6
Beg, F. 2019, From Interstellar Cloud to Star to Laboratory: Frontier HEDP Studies of Magnetized
  Colliding Plasma Flows with Strong Radiative Cooling, Tech. Rep. DOE-UCSD-14493, Univ.
  of California, San Diego, CA (United States), doi:10.2172/1500122
                                                   225


Bell, A. R. 1978, Monthly Notices of the Royal Astronomical Society, 182, 147
Bellan, P. M. 2008, Fundamentals of Plasma Physics (Cambridge University Press)
Bennett, J. C., Baker, G. M., Bettencourt, M. T., et al. 2015, doi:10.2172/1432926
Beresnyak, A. 2019, Living Reviews in Computational Astrophysics, 5, 2
Beresnyak, A., Giuliani, J. L., Jackson, S. L., et al. 2018, IEEE Transactions on Plasma Science,
  46, 3881
Berlok, T., Pakmor, R., & Pfrommer, C. 2020, Monthly Notices of the Royal Astronomical Society,
  491, 2919
Berlok, T., & Pessah, M. E. 2015, The Astrophysical Journal, 813, 22
Binney, J., & Tabor, G. 1995, Monthly Notices of the Royal Astronomical Society, 276, 663
Binney, J., & Tremaine, S. 1987, Galactic Dynamics
Bird, R. B., Stewart, W. E., & Lightfoot, E. N. 2006, Transport Phenomena (John Wiley & Sons)
Bittencourt, J. A. 2004, Fundamentals of Plasma Physics (New York, NY: Springer New York),
  doi:10.1007/978-1-4757-4030-1
Blandford, R., Meier, D., & Readhead, A. 2019, Annual Review of Astronomy and Astrophysics,
  57, 467
Blanton, E. L., Clarke, T. E., Sarazin, C. L., Randall, S. W., & McNamara, B. R. 2010, Proceedings
  of the National Academy of Sciences, 107, 7174
Bodo, G., Mignone, A., & Rosner, R. 2004, PHYSICAL REVIEW E, 4
Böehringer, H., & Morfill, G. E. 1988, The Astrophysical Journal, 330, 609
Bonafede, A., Dolag, K., Stasyszyn, F., Murante, G., & Borgani, S. 2011, Monthly Notices of the
  Royal Astronomical Society, 418, 2234
Bondi, H. 1952, Monthly Notices of the Royal Astronomical Society, 112, 195
Booth, C. M., & Schaye, J. 2009, MNRAS, 398, 53
Boozer, A. H. 2005, Reviews of Modern Physics, 76, 1071
Bordner, J., & Norman, M. L. 2018, arXiv:1810.01319 [astro-ph, physics:physics],
  arXiv:1810.01319
Brachet, M. E., Bustamante, M. D., Krstulovic, G., et al. 2013, Physical Review E, 87, 013110
Brachet, M. E., Meiron, D. I., Orszag, S. A., et al. 1983, Journal of Fluid Mechanics, 130, 411
Braginskii, S. I. 1965, Reviews of Plasma Physics, 1, 205
                                                  226


Brandenburg, A., & Dobler, W. 2010, Astrophysics Source Code Library, ascl:1010.060
Bregman, J. N., & David, L. P. 1989, The Astrophysical Journal, 341, 49
Brent, R. P. 1973, Algorithms for Minimization Without Derivatives (Englewood Cliffs, New
  Jersey: Prentice-Hall)
Britzen, S., Fendt, C., Eckart, A., & Karas, V. 2017, Astronomy & Astrophysics, 601, A52
Brüggen, M. 2003a, The Astrophysical Journal, 593, 700
—. 2003b, The Astrophysical Journal, 592, 839
Brüggen, M., & Vazza, F. 2015, in Magnetic Fields in Diffuse Media, ed. A. Lazarian, E. M. de
  Gouveia Dal Pino, & C. Melioli (Berlin, Heidelberg: Springer), 599–614
Bryan, G. L., Norman, M. L., O’Shea, B. W., et al. 2014, The Astrophysical Journal Supplement
  Series, 211, 19
Burns, K. J., Vasil, G. M., Oishi, J. S., Lecoanet, D., & Brown, B. P. 2020, Physical Review
  Research, 2, 023068
Bustamante, M., Jez, P., Monroy Montañez, J. A., et al. 2010, High-Energy Cosmic-Ray Acceler-
  ation, https://cds.cern.ch/record/1249755, doi:10.5170/CERN-2010-001.533
Butsky, I. S., & Quinn, T. R. 2018, The Astrophysical Journal, 868, 108
Carilli, C. L., & Taylor, G. B. 2002, Annual Review of Astronomy and Astrophysics, 40, 319
Carlberg, R. G., Yee, H. K. C., & Ellingson, E. 1997, The Astrophysical Journal, 478, 462
Carter Edwards, H., Trott, C. R., & Sunderland, D. 2014, Journal of Parallel and Distributed
  Computing, 74, 3202
Casner, A. 2021, Philosophical Transactions of the Royal Society A: Mathematical, Physical and
  Engineering Sciences, 379, 20200021
Cavagnolo, K. W., Donahue, M., Voit, G. M., & Sun, M. 2008, ApJL, 683, L107
—. 2009, ApJS, 182, 12
Chandran, B. D. G., & Cowley, S. C. 1998, Physical Review Letters, 80, 3077
Chatterjee, G., Schoeffler, K. M., Kumar Singh, P., et al. 2017, Nature Communications, 8, 15970
Chen, F. F., & Chen, F. F. 1984, Introduction to Plasma Physics and Controlled Fusion, 2nd edn.
  (New York: Plenum Press)
Chen, J., & Liu, Q. H. 2013, Proceedings of the IEEE, 101, 242
Chiuderi, C., & Velli, M. 2015, Basics of Plasma Astrophysics, UNITEXT for Physics (Milano:
  Springer Milan), doi:10.1007/978-88-470-5280-2
                                                 227


Choquette, J., Gandhi, W., Giroux, O., Stam, N., & Krashinsky, R. 2021, IEEE Micro, 41, 29
Churazov, E., Sunyaev, R., Forman, W., & Böhringer, H. 2002, Monthly Notices of the Royal
  Astronomical Society, 332, 729
Ciotti, L., & Ostriker, J. P. 1997, The Astrophysical Journal, 487, L105
Clarke, T. E. 2004, Journal of The Korean Astronomical Society, 37, 337
Clarke, T. E., Kronberg, P. P., & Böhringer, H. 2001, The Astrophysical Journal, 547, L111
Cockburn, B., Hou, S., & Shu, C.-W. 1990, Mathematics of Computation, 54, 545
Cockburn, B., Kanschat, G., & Schötzau, D. 2005, Computers & Fluids, 34, 491
Cockburn, B., Lin, S.-Y., & Shu, C.-W. 1989, Journal of computational Physics, 84, 90
Cockburn, B., & Shu, C.-W. 1989, Mathematics of computation, 52, 411
—. 1998, Journal of Computational Physics, 141, 199
Colafrancesco, S., Dar, A., & De Rújula, A. 2004, Astronomy and Astrophysics, 413, 441
Craxton, R. S., Anderson, K. S., Boehly, T. R., et al. 2015, Physics of Plasmas, 22, 110501
Dagum, L., & Menon, R. 1998, IEEE Comput. Sci. Eng., 5, 46
Dallas, V., & Alexakis, A. 2013a, Physical Review E, 88, 053014
—. 2013b, Physics of Fluids, 25, 105106
—. 2013c, Physical Review E, 88, 063017
Dawson, J. M. 1983, Reviews of Modern Physics, 55, 403
Deakin, T., McIntosh-Smith, S., Price, J., et al. 2019, in 2019 IEEE/ACM International Workshop
  on Performance, Portability and Productivity in HPC (P3HPC), 1–13
Deakin, T., Price, J., Martineau, M., & McIntosh-Smith, S. 2018, International Journal of Compu-
  tational Science and Engineering, 17, 247
Dekel, A., & Birnboim, Y. 2007, Monthly Notices of the Royal Astronomical Society, 383, 119
Dennis, T. J., & Chandran, B. D. G. 2005, The Astrophysical Journal, 622, 205
Domainko, W., Gitti, M., Schindler, S., & Kapferer, W. 2004, Astronomy & Astrophysics, 425, L21
Domaradzki, J. A., Teaca, B., & Carati, D. 2010, Physics of Fluids, 22, 051702
Donnert, J., Vazza, F., Brüggen, M., & ZuHone, J. 2018, Space Science Reviews, 214,
  doi:10.1007/s11214-018-0556-8
                                                 228


Du, P., Weber, R., Luszczek, P., et al. 2011, From CUDA to OpenCL: Towards a Performance-
  portable Solution for Multi-platform GPU Programming
Dubois, Y., Devriendt, J., Slyz, A., & Silk, J. 2009, Monthly Notices of the Royal Astronomical
  Society: Letters, 399, L49
Dubois, Y., Devriendt, J., Slyz, A., & Teyssier, R. 2010, Monthly Notices of the Royal Astronomical
  Society, 409, 985
Dunn, R. J. H., & Fabian, A. C. 2004, Monthly Notices of the Royal Astronomical Society, 355,
  862
Ebisu, T., Ishiyama, T., & Hayashi, K. 2022, Physical Review D, 105, 023016
Edgar, R. 2004, New Astronomy Reviews, 48, 843
Edwards, H. C., Trott, C. R., & Sunderland, D. 2014, Journal of Parallel and Distributed Comput-
  ing, 74, 3202 , domain-Specific Languages and High-Level Frameworks for High-Performance
  Computing
Egan, H., O’Shea, B. W., Hallman, E., et al. 2016, arXiv:1601.05083 [astro-ph], arXiv:1601.05083
Fabian, A. 2012, Annual Review of Astronomy and Astrophysics, 50, 455
Fabian, A. C. 1994, Annual Review of Astronomy and Astrophysics, 32, 277
Fabian, A. C., Sanders, J. S., Allen, S. W., et al. 2003, Monthly Notices of the Royal Astronomical
  Society, 344, L43
Fabian, A. C., Sanders, J. S., Taylor, G. B., et al. 2006, Monthly Notices of the Royal Astronomical
  Society, 366, 417
Fabian, A. C., Sanders, J. S., Ettori, S., et al. 2000, MNRAS, 318, L65
Fabjan, D., Borgani, S., Tornatore, L., et al. 2010, Monthly Notices of the Royal Astronomical
  Society, 401, 1670
Federrath, C. 2013, Mon. Not. R. Astron. Soc., 436, 1245
—. 2016, Journal of Plasma Physics, 82, doi:10.1017/S0022377816001069
Ferland, G. J., Porter, R. L., van Hoof, P. A. M., et al. 2013, arXiv:1302.4485 [astro-ph],
  arXiv:1302.4485
Ferracina, L., & Spĳker, M. 2005, Mathematics of Computation, 74, 201
Ferracina, L., & Spĳker, M. N. 2004, SIAM Journal on Numerical Analysis, 42, 1073
Feynman, R. P., Hey, J. G., & Allen, R. W. 1998, Feynman Lectures on Computation (USA:
  Addison-Wesley Longman Publishing Co., Inc.)
                                                    229


Fuhry, M., Giuliani, A., & Krivodonova, L. 2014, International Journal for Numerical Methods in
  Fluids, 76, 982
Gabuzda, D. C. 2021, Galaxies, 9, 58
Gan, Z., Li, H., Li, S., & Yuan, F. 2017, The Astrophysical Journal, 839, 14
Gao, L., Navarro, J. F., Frenk, C. S., et al. 2012, Monthly Notices of the Royal Astronomical
  Society, 425, 2169
Gaspari, M. 2015, Proceedings of the International Astronomical Union, 11, 17
Gaspari, M., Brighenti, F., & Temi, P. 2012a, Monthly Notices of the Royal Astronomical Society,
  424, 190
Gaspari, M., Melioli, C., Brighenti, F., & D’Ercole, A. 2011, Monthly Notices of the Royal
  Astronomical Society, 411, 349
Gaspari, M., Ruszkowski, M., & Sharma, P. 2012b, The Astrophysical Journal, 746, 94
Gaspari, M., & Sądowski, A. 2017, The Astrophysical Journal, 837, 149
Gaspari, M., Temi, P., & Brighenti, F. 2017, Monthly Notices of the Royal Astronomical Society,
  466, 677
Ghizzardi, S., Rossetti, M., & Molendi, S. 2010, Astronomy & Astrophysics, 516, A32
Gitti, M., Brighenti, F., & McNamara, B. R. 2012, Advances in Astronomy, 2012, e950641
Giuliani, J. L., Beg, F. N., Gilgenbach, R. M., et al. 2012, IEEE Transactions on Plasma Science,
  40, 3246
Glines, F. W., Anderson, M., & Neilsen, D. 2015, in 2015 IEEE International Conference on Cluster
  Computing, 611–618, iSSN: 2168-9253
Glines, F. W., Grete, P., & O’Shea, B. W. 2021, Physical Review E, 103, 043203
Glines, F. W., O’Shea, B. W., & Voit, G. M. 2020, The Astrophysical Journal, 901, 117
Glines, F. W., Beckwith, K. R. C., Braun, J. R., et al. 2022, The Astrophysical Journal Supplement
  Series
Godunov, S. K. 1959, Matematicheskii Sbornik, 89, 271
Gómez, P. L., Loken, C., Roettiger, K., & Burns, J. O. 2002, The Astrophysical Journal, 569, 122
Gottlieb, S. 2015, in Spectral and High Order Methods for Partial Differential Equations ICOSA-
  HOM 2014, ed. R. M. Kirby, M. Berzins, & J. S. Hesthaven, Lecture Notes in Computational
  Science and Engineering (Cham: Springer International Publishing), 17–30
Gottlieb, S., Ketcheson, D. I., & Shu, C.-W. 2011, Strong Stability Preserving Runge-Kutta and
  Multistep Time Discretizations (World Scientific)
                                                230


Gottlieb, S., & Shu, C.-W. 1998, Mathematics of Computation, 67, 73
Govoni, F., & Feretti, L. 2004, International Journal of Modern Physics D, 13, 1549
Grete, P., Glines, F. W., & O’Shea, B. W. 2021a, IEEE Transactions on Parallel and Distributed
  Systems, 32, 85
Grete, P., O’Shea, B. W., & Beckwith, K. 2018, The Astrophysical Journal, 858, L19
—. 2021b, The Astrophysical Journal, 909, 148
—. 2021c, The Astrophysical Journal, 909, 148
Grete, P., O’Shea, B. W., Beckwith, K., Schmidt, W., & Christlieb, A. 2017, Physics of Plasmas,
  24, 092311
Grete, P., Vlaykov, D. G., Schmidt, W., & Schleicher, D. R. G. 2016, Physics of Plasmas, 23,
  062317
Grete, P., Dolence, J. C., Miller, J. M., et al. 2022, arXiv:2202.12309 [astro-ph], arXiv:2202.12309
Griebel, M., & Zaspel, P. 2010, Computer Science - Research and Development, 25, 65
Guo, F., & Mathews, W. G. 2011, The Astrophysical Journal, 728, 121
Guo, F., & Oh, S. P. 2008, Monthly Notices of the Royal Astronomical Society, 384, 251
Hahn, O., Martizzi, D., Wu, H.-Y., et al. 2017, Monthly Notices of the Royal Astronomical Society,
  470, 166
HajiRassouliha, A., Taberner, A. J., Nash, M. P., & Nielsen, P. M. F. 2018, Signal Processing:
  Image Communication, 68, 101
Hamlin, N. D., Smith, T., Roberds, N., Glines, F., & Beckwith, K. 2022, 26
Hammond, J. R., & Mattson, T. G. 2019, in Proceedings of the International Workshop on OpenCL,
  IWOCL’19 (New York, NY, USA: Association for Computing Machinery)
Harlow, F. H. 1962, The Particle-in-Cell Method for Numerical Solution of Problems in Fluid
  Dynamics, Tech. Rep. LADC-5288, Los Alamos National Lab. (LANL), Los Alamos, NM
  (United States), doi:10.2172/4769185
Harlow, F. H., Evans, M., & Richtmyer, R. D. 1955, A Machine Calculation Method for Hydrody-
  namic Problems (Los Alamos Scientific Laboratory of the University of California)
Hawkins, M. R. S. 2007, Astronomy & Astrophysics, 462, 581
Hawley, J. F., Gammie, C. F., & Balbus, S. A. 1995, The Astrophysical Journal, 440, 742
Heinrich, A. M., Chen, Y.-H., Heinz, S., Zhuravleva, I., & Churazov, E. 2021, Monthly Notices of
  the Royal Astronomical Society, stab1557
                                                    231


Heroux, M. A., & Willenbring, J. M. 2012, Scientific Programming, 20, doi:10.1155/2012/408130
Heroux, M. A., Bartlett, R. A., Howle, V. E., et al. 2005, ACM Trans. Math. Softw., 31, 397
Heroux, M. A., Doerfler, D. W., Crozier, P. S., et al. 2009, doi:10.2172/993908
Higueras, I. 2004, Journal of Scientific Computing, 21, 193
—. 2005, SIAM Journal on Numerical Analysis, 43, 924
Hillel, S., & Soker, N. 2016, Monthly Notices of the Royal Astronomical Society, 455, 2139
Ho, L. C. 2004, Coevolution of Black Holes and Galaxies: Volume 1, Carnegie Observatories
   Astrophysics Series (Cambridge University Press)
Hoekstra, H., Bartelmann, M., Dahle, H., et al. 2013, Space Science Reviews, 177, 75
Holmberg, E. 1941, The Astrophysical Journal, 94, 385
Holmen, J. K., Humphrey, A., Sunderland, D., & Berzins, M. 2017, in Proceedings of the Practice
   and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact,
   PEARC17 (New York, NY, USA: ACM), 27:1–27:8
Holmen, J. K., Peterson, B., & Berzins, M. 2019, in 2019 IEEE/ACM International Workshop on
   Performance, Portability and Productivity in HPC (P3HPC), 36–49
Hopkins, P. F. 2014, Astrophysics Source Code Library, ascl:1410.003
Hornung, R., Jones, H., Keasler, J., et al. 2015, ASC Tri-lab Co-design Level 2Milestone Report
   2015, Tech Report LLNL-TR-677453, LLNL
Howes, G. G., Dorland, W., Cowley, S. C., et al. 2008, Physical Review Letters, 100, 065004
Hu, J., & Lou, Y.-Q. 2004, The Astrophysical Journal, 606, L1
Huarte-Espinosa, M., Frank, A., Blackman, E. G., et al. 2012, The Astrophysical Journal, 757, 66
Humpherys, J., Jarvis, T. J., & Evans, E. J. 2017, Foundations of Applied Mathematics
Incropera, F. P., & DeWitt, D. P. 1981, Fundamentals of Heat Transfer (Wiley)
Intel. 2021, Xeon Platinum 8280 Specs, https://www.intel.com/content/www/us/en/products/sku/192478/intel-
   xeon-platinum-8280-processor-38-5m-cache-2-70-ghz.html
Intel Corporation. 2016, Intel 64 and IA-32 Architectures Optimization Reference Manual
Iwai, H. 1999, IEEE Journal of Solid-State Circuits, 34, 357
Jia, Z., Maggioni, M., Smith, J., & Scarpazza, D. P. 2019, arXiv:1903.07486 [cs], arXiv:
   1903.07486
Jia, Z., Maggioni, M., Staiger, B., & Scarpazza, D. P. 2018, arXiv:1804.06826 [cs], arXiv:
   1804.06826
                                                232


Jiang, Y.-F., & Oh, S. P. 2018, The Astrophysical Journal, 854, 5
Jiang, Y.-F., Stone, J. M., & Davis, S. W. 2014, The Astrophysical Journal Supplement Series, 213,
   7
Jubelgas, M., Springel, V., & Dolag, K. 2004, Monthly Notices of the Royal Astronomical Society,
   351, 423
Jubelgas, M., Springel, V., Enßlin, T., & Pfrommer, C. 2008, Astronomy and Astrophysics, 481, 33
Kale, L. V., & Krishnan, S. 1993, CHARM++: A Portable Concurrent Object Oriented System
   Based on C++, Tech. rep., Champaign, IL, USA
Katz, N., Weinberg, D. H., & Hernquist, L. 1996, The Astrophysical Journal Supplement Series,
   105, 19
Khosroshahi, H. G., Jones, L. R., & Ponman, T. J. 2004, Monthly Notices of the Royal Astronomical
   Society, 349, 1240
Kida, S., & Orszag, S. A. 1990, Journal of Scientific Computing, 5, 85
Klimontovich, Y. L. 1994, Physics-Uspekhi, 37, 737
Klöckner, A., Warburton, T., Bridge, J., & Hesthaven, J. S. 2009, Journal of Computational Physics,
   228, 7863
Kochanek, C. S. 2006, in Gravitational Lensing: Strong, Weak and Micro, ed. P. Schneider, C. S.
   Kochanek, & J. Wambsganss (Berlin, Heidelberg: Springer), 91–268
Koliopanos, F. 2018, arXiv:1801.01095 [astro-ph], arXiv:1801.01095
Kolmogorov, A. 1941, Akademiia Nauk SSSR Doklady, 30, 301
Komissarov, S., & Porth, O. 2021, New Astronomy Reviews, 92, 101610
Konstantinidis, E., & Cotronis, Y. 2017, Journal of Parallel and Distributed Computing, 107, 37
Korpi, M. J., Brandenburg, A., Shukurov, A., Tuominen, I., & Nordlund, A. 1999, The Astrophysical
   Journal, 514, L99
Kramer, R. M. J., Cyr, E. C., Miller, S. T., et al. 2020, A Plasma Modeling Hierarchy and Verification
   Approach, Tech. Rep. SAND-2020-3576, Sandia National Lab. (SNL-NM), Albuquerque, NM
   (United States), doi:10.2172/1608511
Kroupp, E., Stambulchik, E., Starobinets, A., et al. 2018, Physical Review E, 97, 013202
Krymskii, G. F. 1977, Akademiia Nauk SSSR Doklady, 234, 1306
Kumar, P., & Zhang, B. 2015, Physics Reports, 561, 1
Kunz, M. W., Schekochihin, A. A., Cowley, S. C., Binney, J. J., & Sanders, J. S. 2011, Monthly
   Notices of the Royal Astronomical Society, 410, 2446
                                                     233


Lacroix, T., & Silk, J. 2018, The Astrophysical Journal, 853, L16
Laity, G., Robinson, A., Cuneo, M., et al. 2021, Towards Predictive Plasma Science and Engi-
  neering through Revolutionary Multi-Scale Algorithms and Models, Final Report., Tech. Rep.
  SAND2021-0718, Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Sandia
  National Laboratories, SNL California, doi:10.2172/1813907
Landauer, R. 1988, Nature, 335, 779
Larson, R. B. 1981, Monthly Notices of the Royal Astronomical Society, 194, 809
Lecoanet, D., McCourt, M., Quataert, E., et al. 2016, Monthly Notices of the Royal Astronomical
  Society, 455, 4274
Ledvina, S. A., Ma, Y.-J., & Kallio, E. 2008, Space Science Reviews, 139, 143
Lee, E., Brachet, M. E., Pouquet, A., Mininni, P. D., & Rosenberg, D. 2008, Physical Review E,
  78, 066401
—. 2010, Physical Review E, 81, 016318
Leiserson, C. E., Thompson, N. C., Emer, J. S., et al. 2020, Science, 368, eaam9744
LeVeque, R. J. 2002, Finite Volume Methods for Hyperbolic Problems (Cambridge; New York:
  Cambridge University Press)
Li, H., Lapenta, G., Finn, J. M., Li, S., & Colgate, S. A. 2006, The Astrophysical Journal, 643, 92
Li, Y., & Bryan, G. L. 2012, The Astrophysical Journal, 747, 26
—. 2014a, The Astrophysical Journal, 789, 54
—. 2014b, The Astrophysical Journal, 789, 153
Li, Y., Bryan, G. L., Ruszkowski, M., et al. 2015, ApJ, 811, 73
Li, Y., Gendron-Marsolais, M.-L., Zhuravleva, I., et al. 2020, The Astrophysical Journal Letters,
  889, L1
LIGO Scientific Collaboration and Virgo Collaboration, Abbott, R., Abbott, T. D., et al. 2020,
  Physical Review Letters, 125, 101102
Lima, J. A. S., Cunha, J. V., & Alcaniz, J. S. 2003, Physical Review D, 68, 023510
Lind, S. J., Rogers, B. D., & Stansby, P. K. 2020, Proceedings of the Royal Society A: Mathematical,
  Physical and Engineering Sciences, 476, 20190801
Liu, C., Zhou, G., Shyy, W., & Xu, K. 2019, Shock Waves, 29, 1083
Lo, Y. J., Williams, S., Van Straalen, B., et al. 2015, in High Performance Computing Systems.
  Performance Modeling, Benchmarking, and Simulation, ed. S. A. Jarvis, S. A. Wright, & S. D.
  Hammond (Springer International Publishing), 129–148
                                                  234


Loewenstein, M., Zweibel, E. G., & Begelman, M. C. 1991, The Astrophysical Journal, 377, 392
Longair, M. S. 2008, Galaxy Formation, 2nd edn., Astronomy and Astrophysics Library (Berlin ;
  New York: Springer)
Luo, W., Li, Y., Wang, H., et al. 2019, Laser and Particle Beams, 37, 301
Lyutikov, M. 2007, The Astrophysical Journal, 668, L1
Malyshkin, L., & Kulsrud, R. 2001, The Astrophysical Journal, 549, 402
Marcowith, A., Ferrand, G., Grech, M., et al. 2020, arXiv:2002.09411 [astro-ph], arXiv:2002.09411
Markevitch, M., Vikhlinin, A., & Mazzotta, P. 2001, The Astrophysical Journal, 562, L153
Marques, D., Duarte, H., Ilic, A., et al. 2017, in 2017 International Conference on High Performance
  Computing Simulation (HPCS), 898–907, iSSN: null
Martí, J.-M. 2019, Galaxies, 7, 24
Martí, J. M., & Müller, E. 2003, Living Reviews in Relativity, 6, doi:10.12942/lrr-2003-7
—. 2015, Living Reviews in Computational Astrophysics, 1, 3
Martineau, M., McIntosh-Smith, S., & Gaudin, W. 2017, Concurrency and Computation: Practice
  and Experience, 29, e4117, e4117 cpe.4117
Martizzi, D., Hahn, O., Wu, H.-Y., et al. 2016, Monthly Notices of the Royal Astronomical Society,
  459, 4408
Mathews, W. G. 1971, The Astrophysical Journal, 165, 147
May, M. M., & White, R. H. 1966, Physical Review, 141, 1232
McComb, W. D. 1990, The Physics of Fluid Turbulence
McCourt, M., Sharma, P., Quataert, E., & Parrish, I. J. 2012, Monthly Notices of the Royal
  Astronomical Society, 419, 3319
McDonald, M., Veilleux, S., Rupke, D. S. N., & Mushotzky, R. 2010, ApJ, 721, 1262
McDonald, M., McNamara, B. R., Voit, G. M., et al. 2019, The Astrophysical Journal, 885, 63
McKernan, B., Ford, K. E. S., Kocsis, B., Lyra, W., & Winter, L. M. 2014, Monthly Notices of the
  Royal Astronomical Society, 441, 900
McKernan, B., Ford, K. E. S., Lyra, W., & Perets, H. B. 2012, Monthly Notices of the Royal
  Astronomical Society, 425, 460
McNamara, B. R., & Nulsen, P. E. J. 2007, Annual Review of Astronomy and Astrophysics, 45,
  117
                                                    235


McNamara, B. R., Wise, M., Nulsen, P. E. J., et al. 2000, ApJL, 534, L135
Medina, D. S., St-Cyr, A., & Warburton, T. 2014, arXiv:1403.0968 [cs], arXiv:1403.0968
Meece, G. R., O’Shea, B. W., & Voit, G. M. 2015, The Astrophysical Journal, 808, 43
Meece, G. R., Voit, G. M., & O’Shea, B. W. 2017, The Astrophysical Journal, 841, 17pp
Meece Jr, G. R. 2016, AGN Feedback and Delivery Methods for Simulations of Cool-Core Galaxy
  Clusters (Michigan State University)
Meier, D. L. 1999, The Astrophysical Journal, 518, 788
Messina, P. 2017, Computing in Science Engineering, 19, 63
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. 1953, The Journal
  of Chemical Physics, 21, 1087
Metzler, C. A., & Evrard, A. E. 1994, The Astrophysical Journal, 437, 564
Mignone, A., & Bodo, G. 2005, Monthly Notices of the Royal Astronomical Society, 364, 126
—. 2006, Monthly Notices of the Royal Astronomical Society, 368, 1040
Mignone, A., & McKinney, J. C. 2007, Monthly Notices of the Royal Astronomical Society, 378,
  1118
Mignone, A., Plewa, T., & Bodo, G. 2005, The Astrophysical Journal Supplement Series, 160, 199
Mignone, A., Ugliano, M., & Bodo, G. 2009, Monthly Notices of the Royal Astronomical Society,
  393, 1141
Mignone, A., Zanni, C., Tzeferacos, P., et al. 2011, The Astrophysical Journal Supplement Series,
  198, 7
Miller, G. H., Moses, E. I., & Wuest, C. R. 2004, Optical Engineering, 43, 2841
Miniati, F. 2014, The Astrophysical Journal, 782, 21
—. 2015, The Astrophysical Journal, 800, 60
Mo, H., Van den Bosch, F., & White, S. 2010, Galaxy Formation and Evolution (Cambridge; New
  York: Cambridge University Press)
Moe, S. A., Rossmanith, J. A., & Seal, D. C. 2015, arXiv:1507.03024 [math], arXiv:1507.03024
Montgomery, D., & Turner, L. 1981, The Physics of Fluids, 24, 825
Morganti, R. 2017, Frontiers in Astronomy and Space Sciences, 4
Myers, A., Colella, P., & Straalen, B. V. 2016, The Astrophysical Journal, 816, 56
Nakamura, M., Li, H., & Li, S. 2006, The Astrophysical Journal, 652, 1059
                                                236


—. 2007, The Astrophysical Journal, 656, 721
Narayan, R., & Medvedev, M. V. 2001, The Astrophysical Journal, 562, L129
Navarro, J. F., Frenk, C. S., & White, S. D. M. 1996, The Astrophysical Journal, 462, 563
Navarro, J. F., Frenk, C. S., & White, S. D. M. 1997, The Astrophysical Journal, 490, 493
Navarro, J. F., Frenk, C. S., & White, S. D. M. 1997, ApJ, 490, 493
Navarro, J. F., Hayashi, E., Power, C., et al. 2004, Monthly Notices of the Royal Astronomical
  Society, 349, 1039
Nelson, D., Pillepich, A., Springel, V., et al. 2019, Monthly Notices of the Royal Astronomical
  Society, 490, 3234
Nolte, P. o. P. a. A. D. D., & Nolte, D. D. 2001, Mind at Light Speed: A New Kind of Intelligence
  (Simon and Schuster)
Norman, M. L., & Bryan, G. L. 1999, in The Radio Galaxy Messier 87, ed. H.-J. Röser &
  K. Meisenheimer, Lecture Notes in Physics (Berlin, Heidelberg: Springer), 106–115
Núñez-de la Rosa, J., & Munz, C.-D. 2018, Computer Physics Communications, 222, 113
NVIDIA Corporation. 2014, NVIDIA’s Next Generation CUDA Compute Architecture: Kepler
  GK110/210
—. 2016, NVIDIA Tesla P100
—. 2017, NVIDIA Tesla V100 GPU Architecture
Ogawa, T., Mineshige, S., Kawashima, T., Ohsuga, K., & Hashizume, K. 2017, Publications of the
  Astronomical Society of Japan, 69, 33
Ongena, J., Koch, R., Wolf, R., & Zohm, H. 2016, Nature Physics, 12, 398
Ottinger, P. F., & Schumer, J. W. 2006, Physics of Plasmas, 13, 063109
Panagoulia, E. K., Fabian, A. C., & Sanders, J. S. 2014, Monthly Notices of the Royal Astronomical
  Society, 438, 2341
Parrish, I. J., Quataert, E., & Sharma, P. 2009, The Astrophysical Journal, 703, 96
Patterson, D. 2010, IEEE Spectrum, 47, 28
Pennycook, S., Sewall, J., & Lee, V. 2019, Future Generation Computer Systems, 92, 947
Pennycook, S. J., Sewall, J. D., & Lee, V. W. 2016, arXiv:1611.07409 [cs], arXiv:1611.07409
Pfrommer, C., Enßlin, T. A., Springel, V., Jubelgas, M., & Dolag, K. 2007, Monthly Notices of the
  Royal Astronomical Society, 378, 385
                                                 237


Pillepich, A., Nelson, D., Springel, V., et al. 2019, Monthly Notices of the Royal Astronomical
   Society, 490, 3196
Pouquet, A., Lee, E., Brachet, M. E., Mininni, P. D., & Rosenberg, D. 2010, Geophysical and
   Astrophysical Fluid Dynamics, 104, 115
Prasad, D., Sharma, P., & Babul, A. 2015, ApJ, 811, 108
—. 2017, Monthly Notices of the Royal Astronomical Society, 471, 1531
—. 2018, The Astrophysical Journal, 863, 62
Pratt, G. W., Arnaud, M., Biviano, A., et al. 2019, Space Science Reviews, 215, 25
Pratt, G. W., Croston, J. H., Arnaud, M., & Böhringer, H. 2009, Astronomy and Astrophysics, 498,
   361
Rafferty, D. A., McNamara, B. R., & Nulsen, P. E. J. 2008, ApJ, 687, 899
Reed, W. H., & Hill, T. R. 1973, Triangular Mesh Methods for the Neutron Transport Equation,
   Tech. Rep. LA-UR-73-479; CONF-730414-2, Los Alamos Scientific Lab., N.Mex. (USA)
Reguly, I. Z., & Mudalige, G. R. 2020, Computers & Fluids, 199, 104425
Rephaeli, Y., & Silk, J. 1995, The Astrophysical Journal, 442, 91
Revaz, Y., Combes, F., & Salomé, P. 2008, A&A, 477, L33
Reynolds, C. 2018, The Micro- and Macro-Physics of Thermal Conduction in the ICM, 49
Reynolds, O. 1883, Philosophical Transactions of the Royal Society of London, 174, 935
Riccardi, G., & Durante, D. 2008, in International Mathematical Forum, Vol. 42, 2081–2111
Richardson, L. F. 1922, Weather Prediction by Numerical Process (Cambridge: Cambridge Uni-
   versity Press)
Ritchie, B. W., & Thomas, P. A. 2002, Monthly Notices of the Royal Astronomical Society, 329,
   675
Ritos, K., Kokkinakis, I. W., & Drikakis, D. 2018, Computers & Fluids, 173, 307
Roberds, N. A., Cartwright, K. L., Sandoval, A. J., et al. 2022, 9
Roettiger, K., Loken, C., & Burns, J. O. 1997, The Astrophysical Journal Supplement Series, 109,
   307
Rogers, K. K., & Peiris, H. V. 2021, Physical Review D, 103, 043526
Roh, S., Ryu, D., Kang, H., Ha, S., & Jang, H. 2019, The Astrophysical Journal, 883, 138
Rose, S. C., Naoz, S., Sari, R., & Linial, I. 2021, arXiv:2201.00022 [astro-ph], arXiv:2201.00022
                                                 238


Rosin, M. S., Schekochihin, A. A., Rincon, F., & Cowley, S. C. 2011, Monthly Notices of the Royal
  Astronomical Society, 413, 7
Rott, N. 1990, Annual Review of Fluid Mechanics, 22, 1
Rudakov, L. I., & Sudan, R. N. 1997, Physics Reports, 283, 253
Russell, H. R., McNamara, B. R., Fabian, A. C., et al. 2016, MNRAS, 458, 3134
Russell, H. R., McDonald, M., McNamara, B. R., et al. 2017, ApJ, 836, 130
Ruszkowski, M., & Begelman, M. C. 2002, The Astrophysical Journal, 581, 223
Ruszkowski, M., Brüggen, M., & Begelman, M. C. 2004, The Astrophysical Journal, 611, 158
Ruszkowski, M., Lee, D., Brüggen, M., Parrish, I., & Oh, S. P. 2011, The Astrophysical Journal,
  740, 81
Ruszkowski, M., & Oh, S. P. 2011, Monthly Notices of the Royal Astronomical Society, 414, 1493
Ryu, D., Chattopadhyay, I., & Choi, E. 2006, The Astrophysical Journal Supplement Series, 166,
  410
Ryutov, D. D., & Remington, B. A. 2002, Plasma Physics and Controlled Fusion, 44, B407
Sammak, S., Nouri, A. G., Ansari, N., & Givi, P. 2015, in Mathematical Modeling of Technological
  Processes, ed. N. Danaev, Y. Shokin, & A.-Z. Darkhan, Communications in Computer and
  Information Science (Cham: Springer International Publishing), 124–132
Sanchez, R., & Newman, D. E. 2015, Plasma Physics and Controlled Fusion, 57, 123002
Sarazin, C. L. 1988, X-Ray Emission from Clusters of Galaxies
Schekochihin, A. A. 2020, arXiv:2010.00699 [astro-ph, physics:nlin, physics:physics],
  arXiv:2010.00699
Schekochihin, A. A., Cowley, S. C., Dorland, W., et al. 2009, The Astrophysical Journal Supplement
  Series, 182, 310
Schekochihin, A. A., Cowley, S. C., Taylor, S. F., Maron, J. L., & McWilliams, J. C. 2004, The
  Astrophysical Journal, 612, 276
Schmidt, W., & Federrath, C. 2011, Astronomy & Astrophysics, 528, A106
Schneider, V., Katscher, U., Rischke, D. H., et al. 1993, Journal of Computational Physics, 105, 92
Schure, K. M., Kosenko, D., Kaastra, J. S., Keppens, R., & Vink, J. 2009, Astronomy & Astro-
  physics, 508, 751
Sharma, P., Hammett, G. W., Quataert, E., & Stone, J. M. 2006, The Astrophysical Journal, 637,
  952
                                                 239


Shebalin, J. V., Matthaeus, W. H., & Montgomery, D. 1983, Journal of Plasma Physics, 29, 525
Short, C. J., Thomas, P. A., & Young, O. E. 2013, Monthly Notices of the Royal Astronomical
  Society, 428, 1225
Shu, C.-W., & Osher, S. 1989, Journal of Computational Physics, 83, 32
Shumlak, U. 2015, High Fidelity Physics Using the Multi-Fluid Plasma Model
Sĳacki, D., Springel, V., Di Matteo, T., & Hernquist, L. 2007, Monthly Notices of the Royal
  Astronomical Society, 380, 877
Simionescu, A., ZuHone, J., Zhuravleva, I., et al. 2019, Space Science Reviews, 215, 24
Simon, H. D. 1992, Parallel Computational Fluid Dynamics - Implementations and Results, Tech.
  rep., Cambridge, MA (United States); MIT Press
Sinars, D. B., Sweeney, M. A., Alexander, C. S., et al. 2020, Physics of Plasmas, 27, 070501
Smith, B., O’Shea, B. W., Voit, G. M., Ventimiglia, D., & Skillman, S. W. 2013, The Astrophysical
  Journal, 778, 152
Smith, B. D., Bryan, G. L., Glover, S. C. O., et al. 2017, Monthly Notices of the Royal Astronomical
  Society, 466, 2217
Sommerfeld, A. 1909, Ein Beitrag zur hydrodynamischen Erklärung der turbulenten Flüssigkeits-
  bewegungen
Spitzer, L. 1956, Physics of Fully Ionized Gases
—. 1978, Physical Processes in the Interstellar Medium, doi:10.1002/9783527617722
Springel, V. 2005, Monthly Notices of the Royal Astronomical Society, 364, 1105
—. 2010, Annual Review of Astronomy and Astrophysics, 48, 391
Springel, V., Yoshida, N., & White, S. D. M. 2001, New Astronomy, 6, 79
St-Onge, D. A., Kunz, M. W., Squire, J., & Schekochihin, A. A. 2020, arXiv e-prints, 2003,
  arXiv:2003.09760
Steĳl, R., & Barakos, G. N. 2018, Computers & Fluids, 173, 22
Steinwandel, U. P., Boess, L. M., Dolag, K., & Lesch, H. 2021, arXiv:2108.07822 [astro-ph],
  arXiv:2108.07822
Stokes, G. G. 1851, Transactions of the Cambridge Philosophical Society, 9, 8
Stone, J. E., Gohara, D., & Shi, G. 2010, Computing in Science Engineering, 12, 66
Stone, J. M., & Gardiner, T. 2009, New Astronomy, 14, 139
                                                  240


Stone, J. M., & Gardiner, T. A. 2010, The Astrophysical Journal Supplement Series, 189, 142
Stone, J. M., Gardiner, T. A., Teuben, P., Hawley, J. F., & Simon, J. B. 2008a, The Astrophysical
  Journal Supplement Series, 178, 137
—. 2008b, The Astrophysical Journal Supplement Series, 178, 137
Stone, J. M., & Norman, M. L. 1992, The Astrophysical Journal Supplement Series, 80, 753
Stone, J. M., Tomida, K., White, C. J., & Felker, K. G. 2020a, The Astrophysical Journal Supplement
  Series, 249, 4
—. 2020b, arXiv:2005.06651
Straatsma, T. P., Antypas, K. B., & Williams, T. J. 2017, Exascale Scientific Applications: Scala-
  bility and Performance Portability, 1st edn. (Chapman & Hall/CRC)
Sunyaev, R. A., & Zel’dovich, Y. B. 1980, Annual Review of Astronomy and Astrophysics, 18, 537
Synge, J. 1957, The Relativistic Gas, Series in Physics (North-Holland Publishing Company)
Tabor, G., & Binney, J. 1993, Monthly Notices of the Royal Astronomical Society, 263, 323
Taub, A. H. 1948, Physical Review, 74, 328
Taylor, G. I. 1938, Proceedings of the Royal Society of London. Series A - Mathematical and
  Physical Sciences, 164, 476
Taylor, G. I., & Green, A. E. 1937, Proceedings of the Royal Society of London Series A, 158, 499
Teyssier, R. 2002, Astronomy & Astrophysics, 385, 337
Theis, T. N., & Wong, H.-S. P. 2017, Computing in Science Engineering, 19, 41
Tobias, S. M. 2021, Journal of Fluid Mechanics, 912, doi:10.1017/jfm.2020.1055
Top500. 2000, ASCI Red | TOP500, https://www.top500.org/system/168753/
—. 2010, Tianhe-1A - NUDT YH MPP, Xeon X5670 6C 2.93 GHz, NVIDIA 2050 | TOP500,
  https://www.top500.org/system/176929/
—. 2020, Supercomputer Fugaku - Supercomputer Fugaku, A64FX 48C 2.2GHz, Tofu Interconnect
  D | TOP500, https://www.top500.org/system/179807/
—. 2021, Frontera - Dell C6420, Xeon Platinum 8280 28C 2.7GHz, Mellanox InfiniBand HDR |
  TOP500, https://www.top500.org/system/179607/
Toro, E. F. 2009, Riemann Solvers and Numerical Methods for Fluid Dynamics: A Practical
  Introduction, 3rd edn. (Dordrecht ; New York: Springer)
Toyouchi, D., Inayoshi, K., Hosokawa, T., & Kuiper, R. 2021, The Astrophysical Journal, 907, 74
                                                  241


Trac, H., & Pen, U.-L. 2003, Publications of the Astronomical Society of the Pacific, 115, 303
Tremmel, M., Karcher, M., Governato, F., et al. 2017, Monthly Notices of the Royal Astronomical
  Society, 470, 1121
Tremmel, M., Quinn, T. R., Ricarte, A., et al. 2019, Monthly Notices of the Royal Astronomical
  Society, 483, 3336
Treumann, R. A., & Baumjohann, W. 1997, Advanced Space Plasma Physics (PUBLISHED BY IM-
  PERIAL COLLEGE PRESS AND DISTRIBUTED BY WORLD SCIENTIFIC PUBLISHING
  CO.), doi:10.1142/p020
Trott, C. R., Lebrun-Grandié, D., Arndt, D., et al. 2022, IEEE Transactions on Parallel and
  Distributed Systems, 33, 805
Tskhakaya, D., Matyash, K., Schneider, R., & Taccogna, F. 2007, Contributions to Plasma Physics,
  47, 563
Tukey, J. W. 1977, Exploratory Data Analysis (Reading, Mass. : Addison-Wesley Pub. Co.)
Tümer, A., Tombesi, F., Bourdin, H., et al. 2019, Astronomy & Astrophysics, 629, A82
Turk, M. J., Smith, B. D., Oishi, J. S., et al. 2011, The Astrophysical Journal Supplement Series,
  192, 9
Vacca, V., Murgia, M., Govoni, F., et al. 2018, Galaxies, 6, 142
Vahala, G., Keating, B., Soe, M., et al. 2008, Commun. Comput. Phys., 23
van Dyke, M. 1982, NASA STI/Recon Technical Report A, 82, 36549
van Leer, B. 1979, Journal of Computational Physics, 32, 101
Vidal-García, A., Falgarone, E., Arrigoni Battaia, F., et al. 2021, Monthly Notices of the Royal
  Astronomical Society, 506, 2551
Vikhlinin, A., Markevitch, M., & Murray, S. S. 2001a, The Astrophysical Journal, 549, L47
—. 2001b, The Astrophysical Journal, 551, 160
Villiers, J.-P. D., Hawley, J. F., & Krolik, J. H. 2003, The Astrophysical Journal, 599, 1238
Vlaykov, D. G., Grete, P., Schmidt, W., & Schleicher, D. R. G. 2016, Physics of Plasmas, 23,
  062316
Voigt, L. M., & Fabian, A. C. 2004, \mnras, 347, 1130
Voigt, L. M., Schmidt, R. W., Fabian, A. C., Allen, S. W., & Johnstone, R. M. 2002, Monthly
  Notices of the Royal Astronomical Society, 335, L7
Voit, G. M. 2005, Reviews of Modern Physics, 77, 207
                                                   242


Voit, G. M., & Bryan, G. L. 2001, Nature, 414, 425
Voit, G. M., Donahue, M., Bryan, G. L., & McDonald, M. 2015, Nature, 519, 203
Voit, G. M., Meece, G., Li, Y., et al. 2017, The Astrophysical Journal, 845, 80
Wadsley, J., Stadel, J., & Quinn, T. 2004, New Astronomy, 9, 137
Wagh, B., Sharma, P., & McCourt, M. 2014, Monthly Notices of the Royal Astronomical Society,
  439, 2822
Walker, S., Simionescu, A., Nagai, D., et al. 2019, Space Science Reviews, 215, 7
Wang, C., Ruszkowski, M., Pfrommer, C., Oh, P., & Yang, H. 2020, 236, 124.02
Wang, S., Khoury, J., Haiman, Z., & May, M. 2004, Physical Review D, 70, 123008
Wang, Z. J., Fidkowski, K., Abgrall, R., et al. 2013, International Journal for Numerical Methods
  in Fluids, 72, 811
Weinberger, R., Springel, V., & Pakmor, R. 2020, The Astrophysical Journal Supplement Series,
  248, 32
Weinberger, R., Springel, V., Hernquist, L., et al. 2017, Monthly Notices of the Royal Astronomical
  Society, 465, 3291
White, C. J., Stone, J. M., & Gammie, C. F. 2016a, The Astrophysical Journal Supplement Series,
  225, 22
—. 2016b, The Astrophysical Journal Supplement Series, 225, 22
White, M., Cohn, J. D., & Smit, R. 2010, Monthly Notices of the Royal Astronomical Society, 408,
  1818
Williams, S., Waterman, A., & Patterson, D. 2009, Commun. ACM, 52, 65
Wilson, J. R. 1972, The Astrophysical Journal, 173, 431
Woosley, S. E. 2017, The Astrophysical Journal, 836, 244
Wu, H.-Y., Evrard, A. E., Hahn, O., et al. 2015, Monthly Notices of the Royal Astronomical Society,
  452, 1982
Wu, K., & Tang, H. 2016, The Astrophysical Journal Supplement Series, 228, 3
Wu, K. K. S., Fabian, A. C., & Nulsen, P. E. J. 1998, Monthly Notices of the Royal Astronomical
  Society, 301, L20
Xu, Z., Zhao, H., & Zheng, C. 2015, Journal of Computational Physics, 281, 844
Yang, H.-Y. K., & Reynolds, C. S. 2016a, The Astrophysical Journal, 829, 90
                                                 243


—. 2016b, The Astrophysical Journal, 818, 181
Yang, Y., Shi, Y., Wan, M., Matthaeus, W. H., & Chen, S. 2016, Phys. Rev. E, 93, 061102
Young, D. S. D. 2010, The Astrophysical Journal, 710, 743
Zanna, L. D., & Bucciantini, N. 2002, Astronomy & Astrophysics, 390, 1177
Zhang, U.-H., Schive, H.-Y., & Chiueh, T. 2018, The Astrophysical Journal Supplement Series,
  236, 50
Zhang, W., Almgren, A., Beckner, V., et al. 2019, Journal of Open Source Software, 4, 1370
Zhao, D., & Aluie, H. 2018, Phys. Rev. Fluids, 3, 054603
Zheng, Y., Kamil, A., Driscoll, M. B., Shan, H., & Yelick, K. 2014, in 2014 IEEE 28th International
  Parallel and Distributed Processing Symposium, 1105–1114
Zhu, J.-P., Zhang, B., Yu, Y.-W., & Gao, H. 2021, The Astrophysical Journal, 906, L11
Zhuravleva, I., Churazov, E., Schekochihin, A. A., et al. 2019, Nature Astronomy, 3, 832
—. 2014, Nature, 515, 85
ZuHone, J. A., Markevitch, M., & Johnson, R. E. 2010, The Astrophysical Journal, 717, 908
Zylstra, A. B., Hurricane, O. A., Callahan, D. A., et al. 2022, Nature, 601, 542
                                                244