MEMORY-EFFICIENT EMULATION OF PHYSICAL TABULAR DATA USING QUADTREE DECOMPOSITION By Jared Carlson A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Compuatational Mathematics, Science, and Engineering – Master of Science 2022 ABSTRACT MEMORY-EFFICIENT EMULATION OF PHYSICAL TABULAR DATA USING QUADTREE DECOMPOSITION By Jared Carlson Computationally expensive functions are sometimes replaced in simulations with an emulator that approximates the true function (e.g., equations of state, wavelength-dependent opacity, or composition-dependent materials properties). For functions that have a constrained domain of interest, this can be done by discretizing the domain and performing a local interpolation on the tabulated function values of each local domain. For these so-called tabular data methods, the method of discretizing the domain and mapping the input space to each subdomain can drastically influence the memory and computational costs of the emulator. This is especially true for functions that vary drastically in different regions. We present a method for domain discretization and mapping that utilizes quadtrees, which results in significant reductions in the size of the emulator with minimal increases to computational costs or loss of global accuracy. We apply our method to the electron-positron Helmholtz free energy equation of state and show over an order of magnitude reduction in memory costs for reasonable levels of numerical accuracy. Copyright by JARED CARLSON 2022 ACKNOWLEDGEMENTS Jared Carlson acknowledges Matt Piekenbrock for assistance with the compact mapping scheme, and Bronson Messer, Adam Alessio, and Saiprasad Ravishankar for their input. BWO was supported in part by MSU’s Office of Research and Innovation through the Institute of Cyber-Enabled Research, by NSF grants no. OAC-1835213 and AST-1908109, and by NASA ATP grants NNX15AP39G and 80NSSC18K1105. SMC is supported by the U.S. Department of Energy, Office of Science, Office of Nuclear Physics, Early Career Research Program under Award Number DE-SC0015904. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research and Office of Nuclear Physics, Scientific Discovery through Advanced Computing (SciDAC) program under Award Number DE- SC0017955. This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) that are responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative. iv TABLE OF CONTENTS LIST OF FIGURES vi CHAPTER 1 INTRODUCTION 1 CHAPTER 2 METHODS 4 2.1 Emulating the Electron-Positron Helmholtz Free Energy 4 2.2 Defining Error 7 2.3 Combining Model Classes into a Single Emulator 8 2.4 Domain Decomposition using a Quadtree 10 2.5 Training the Quadtree Emulator 11 2.6 Creating the Memory Compact Quadtree Emulator 11 2.7 Emulating the Helmholtz Free Energy 14 CHAPTER 3 RESULTS 17 CHAPTER 4 DISCUSSION 23 CHAPTER 5 CONCLUSIONS 28 BIBLIOGRAPHY 29 v LIST OF FIGURES Figure 2.1: An example of various aspects of the quadtree decomposition with a max depth εth =6 and a relative error threshold Dmax = 10−3 . Four quadtrees, placed side- by-side along the x-axis of the table, were trained to cover this domain. a) The log10 of the absolute value of the electron-positron Helmholtz free energy f . b) The domain decomposition. c) The mapping of the linear-space and log-space model classes to the decomposed domain. d) The relative error of the resulting emulator. 5 Figure 2.2: The log10 relative error of different model classes at a fixed uniform cell size. (a) linear-space model class, (b) log-space model class (c), and linear-space and log-space model classes. For (c), the best model class is chosen on a cell- by-cell basis such that the predicted error is minimized. 9 Figure 2.3: An example of a quadtree. Each leaf node is characterized by a number, which corresponds to a region in the 2D decomposition. The ordering of the numbers follows how the leaf nodes are laid out in memory. Cells close together in 2D space tend to be close together in memory as well. 10 Figure 2.4: An example of the compact mapping scheme of a quadtree with Dmax =2. (a) The 2D domain is discretized into a uniform grid corresponding to a quadtree that is fully refined at Dmax =2, with indices matching the quadtree index space of its leaf nodes. (b) The domain of each model after the quadtree emulator has been trained. (c) The mapping array stores the index mapping of the quadtree index space to the model array. (d)) The mapping array is compactly repre- sented using run-length encoding. 13 Figure 3.1: The error and memory cost for different emulators with Dmax =7. The dashed lines correspond to a quadtree emulator with just the linear-space model class and the solid lines correspond to a quadtree emulator using both linear-space and log-space model classes. For each type of error, the points going left to right correspond to the following εth of [10−1 , 10−2 , 10−3 , 10−4 , 10−5 , 10−6 , 10−7 ]. The stars correspond to TS00 and the triangles to TS00 but with log- space and linear-space model classes being used. 19 Figure 3.2: The error and memory cost for different emulators with Dmax =9. The dashed lines correspond to a quadtree emulator with just the linear-space model class and the solid to a quadtree emulator using both linear-space and log-space model classes. For each type of error, the points going left to right correspond to an εth of [10−1 , 10−2 , 10−3 , 10−4 , 10−5 , 10−6 , 10−7 , 10−8 , 10−9 ]. The stars correspond to TS00 and the triangles to TS00 but with log-space and linear-space model classes being used. 20 vi Figure 3.3: An estimate of the log10 (ε) density over the domain for four different cases: TS00, linear-space model class with Dmax =7, and log-space and linear-space model classes with a Dmax of 7 and 9. 21 Figure 3.4: Spatial relative error for three different cases: (a) TS00’s error, (b) Quadtree emulator that has the same norm error as TS00 (Dmax =7, εth =10−4 ), (c) Quadtree emulator that has the same size as TS00 (Dmax =9, εth =10−8 ). 22 Figure 4.1: An example of a quadtree refined region with a discontinuity crossing the (a) corner and (b) diagonal. (c) the compression ratio between using a tree based grid refinement method vs a uniform refinement for the cases shown in (a) and (b) but in varying dimensions. The red stars in (c)) show the compression of the Dmax =7, εth =10−4 and Dmax =9, εth =10−7 . The εth is chosen for each Dmax such that the L2 error has mostly stopped decreasing. 26 vii CHAPTER 1 INTRODUCTION In scientific computing, we often deal with tabular data as an effective way of emulating (i.e., approximating) computationally expensive functions over a constrained domain. This is especially beneficial in cases where function evaluations happen frequently, are relatively close in the input domain, and/or are relatively smooth in the output space. In cases like these, it is often faster to evaluate many values of the function offline and then use them for interpolation during runtime [1]. Although this introduces a degree of error, tabulating more function values can reduce this error to the desired threshold. The trade-off for the reduction in computational costs that are gained by doing an interpolation instead of evaluating the true function during runtime is the memory cost of storing the tabulated data. The tabular data used by the emulator must be held in memory and portions of it must be loaded to the compute device whenever an evaluation of the function is made. To address these issues, we propose a method of domain decomposition over the emulator’s domain that uses a tree structure, as opposed to a uniform domain decomposition [2]. Although using a uniform decomposition has the benefit of O(1) computational complexity for mapping inputs to the correct cell, the error of the interpolation within each cell can vary drastically. As a consequence, the choice of cell size leads to under-refinement in some areas and/or over-refinement in others, which is especially evident when discontinuities are present or when the function changes rapidly in some regions of parameter space but varies slowly in others. Thus for a uniform table decomposition, the trade-off becomes either a higher error than desired or larger memory cost than needed. By using a tree structure, we show that this issue can be remedied with only a small additional cost of mapping inputs to the correct cell, i.e., traversing the tree. The use of a tree structure to deal with areas of different optimal resolutions is a problem that has long been addressed in the field of image compression [3, 4]. Images are generally represented as a set of uniformly spaced cells, or pixels. One way to compress an image is to use quadtree 1 compression, which reduces neighboring groups of pixels that are close to the same color into a single larger pixel. Our method uses a similar technique but combines regions for which the expected interpolation error is low. Our method is also analogous to the adaptive mesh refinement technique, which is frequently used in large-scale physics simulations [5], and a variety of techniques in computer science [6]. Another constraint of common tabular methods is the requirement that all cells do the same type of interpolation. This restriction limits the type of interpolation scheme that can be used, as they must behave well throughout the emulator’s domain. We show that adaptively selecting the type of interpolation scheme used on a cell-by-cell basis can decrease the resulting interpolation error, further allowing for smaller tables. As a proxy problem, we will use interpolation of the electron-positron Helmholtz free energy [7, 8], which is used in modeling the equation of state of stellar plasmas. Although this free energy and resulting equation of state are relatively simple, they have characteristics features that are representative of (and thus can be generalized to) other, more complex, equations of state and opacities, examples of which are shown in [9, 10, 11, 12, 13, 14]. In these cases, the tabular data is supplied in a uniform grid and often without any interpolation scheme included [8, 15, 16, 17, 12, 18]. This leaves the interpolation of table values up to the application using that data, which typically resorts to an N-dimensional linear interpolation scheme [19, 13], or less commonly a more sophisticated higher-order polynomial-based interpolation [8]. As a consequence, the method described here can therefore be readily applied to existing tabular data sets. We will show that by using our method on the electron-positron Helmholtz free energy we can reduce the memory consumption of the table by at least an order of magnitude, while still maintaining comparable levels of accuracy to standard interpolating schemes. This paper is organized as follows. In Section 2, we present our new method for the construction of an emulator using quadtree decomposition. We then present the results obtained from an implementation of this method in Section 3. Finally, we discuss our findings and present our conclusions in Sections 4 and 5, respectively. 2 The code for the quadtree emulator is available at https://doi.org/10.5281/zenodo.4770200. The data sets used for training and testing the quadtree emulator are available at https://doi.org/10.5281/zenodo.4739173. An N-dimensional version of the code is publicly available at https://github.com/Carlson-J/ND- tree tabular data emulator. This code utilizes additional performance optimizations, which is documented in the repository. 3 CHAPTER 2 METHODS 2.1 Emulating the Electron-Positron Helmholtz Free Energy The electron-positron Helmholtz free energy f = f (ρ, T ) is a function of density ρ and temperature T and is used in astrophysical simulations of stellar phenomena to calculate the contribution of electrons and positrons of arbitrary degeneracy and relativity to the total fluid equation of state [20], and as a consequence its value is needed multiple times per volume element per time step. A visualization of f is shown in Figure 2.1(a). Because the inputs and outputs of f vary over orders of magnitude, Figure 2.1(a) shows log10 (| f |), with the ρ, T domain also transformed by log10 (·). Due to the high computational cost of evaluating this function, values of f are computed prior to simulation execution and tabulated. At runtime, the tabulated values of f are interpolated between to estimate f at a given density and temperature. The domain over which the values of f are tabulated depends on the simulation being run. Figure 2.1(a) shows a typical range in density and temperature. Once the domain of interest is determined it is decomposed into a complete disjoint set of cells, i.e., there is no overlap between the domains of the cells and the union of the cells’ domains covers the entire domain of interest. The standard method, developed by Timmes & Swesty 2000 [8], referred to hereafter as TS00, does the decomposition by creating a grid of cells over the domain that are evenly spaced in log-space. This allows a relatively small number of cells to cover densities and temperatures that vary over many orders of magnitude. Each cell has a corresponding model that approximates f within the cell’s domain. Here we define a model as a fully defined mapping (no free parameters) from the input domain R 2 to the output space R 6 , where the outputs are f and its first-order and second-order derivatives. The TS00 method uses a model class of biquintic polynomials to perform the interpolation, which we will refer to as the linear-space model class because we are interpolating in the linear- space, i.e., we do not do any log transforms on the domain or function values. This is opposed to the 4 Figure 2.1: An example of various aspects of the quadtree decomposition with a max depth εth =6 and a relative error threshold Dmax = 10−3 . Four quadtrees, placed side-by-side along the x-axis of the table, were trained to cover this domain. a) The log10 of the absolute value of the electron-positron Helmholtz free energy f . b) The domain decomposition. c) The mapping of the linear-space and log-space model classes to the decomposed domain. d) The relative error of the resulting emulator. 5 log-space model class, which will be introduced later, where both the domain and function values are transformed using the log function. We define a model class as an under-defined mapping between the input domain and output space. A model class is trained for each cell, during which its parameters are fixed such that the resulting model best emulates f within the cell’s domain. For the linear-space model class, these free parameters consist of the value f and several derivatives at four points, one at each corner of the cell, ∂ f ∂ f ∂2 f ∂2 f ∂2 f ∂3 f ∂3 f ∂4 f   f, , , , , , , , , ∂ ρ ∂ T ∂ ρ 2 ∂ T 2 ∂ ρ∂ T ∂ ρ 2 ∂ T ∂ ρ∂ T 2 ∂ ρ 2 ∂ T 2 where the partial derivatives of f are in terms of the density ρ and temperature T . All values except for the third and fourth-order derivatives can be computed exactly from values obtained through the equations of state, f = E −TS (2.1) ∂f P = (2.2) ∂ρ ρ2 ∂f = −S (2.3) ∂T ∂2 f   ∂ f 1 ∂P 1 = −2 + (2.4) ∂ ρ2 ∂ ρ ρ ∂ ρ ρ2 ∂2 f ∂S =− (2.5) ∂T2 ∂T ∂2 f ∂S =− , (2.6) ∂ ρ∂ T ∂ρ where E, S, and P are the electron-positron energy, entropy, and pressure respectively, and their explicit dependence on ρ and T have been removed for clarity. To compute the higher-order terms, ∂3 f 3 ∂4 f ∂2 f , ∂ f , ∂ ρ 2 ∂ T ∂ ρ∂ T 2 and ∂ ρ 2∂ T 2 , we use a sixth order-accurate finite difference scheme [21], using ∂ ρ∂ T as an input. We now present a new model class, a biquintic interpolation in the log-space, which we will refer to as the log-space model class. This requires taking the log10 (·) of the inputs, log10 (| f |), and 6 transforming the derivatives terms as follows, s = sign( f ) (2.7) f ∗ = log10 (| f |) (2.8) ρ ∗ = log10 (ρ) (2.9) T ∗ = log10 (T ) (2.10) ∂ f∗ ρs ∂ f = (2.11) ∂ ρ∗ f ∂ρ ∂ f∗ Ts ∂ f = (2.12) ∂T∗ f ∂T ∂2 f∗ ln(10)ρ ∂ f ∂2 f ∂f 2 = ( f s( + ρ) − ( ) ρ) (2.13) ∂ ρ ∗2 f2 ∂ ρ ∂ ρ2 ∂ρ ∂2 f∗ ln(10)T ∂f ∂2 f ∂f = ( f s( + T ) − ( )2 T ) (2.14) ∂ T ∗2 f 2 ∂T ∂T 2 ∂T ∂2 f∗ ln(10)ρT ∂f 2 ∂f ∂f = (fs − ). (2.15) ∂ ρ ∗∂ T ∗ f 2 ∂T∂ρ ∂T ∂ρ The higher-order derivatives are computed the same way as in the linear-space model class, but ∂2 f∗ on the transformed values, i.e., using ∂ ρ ∗∂ T ∗ , T ∗ , and ρ ∗ . This model class requires an additional parameter, s, that holds the sign of f before it is transformed. 2.2 Defining Error For the electron-positron Helmholtz free energy, the error of interest is the relative error (also called fraction error), defined as f − fˆ ε= , (2.16) fˆ where fˆ is the true value and f is the predicted value. We choose this error because the magnitude of f changes by many orders of magnitude throughout the domain, so it is the relative error that is of primary interest when using this quantity in simulations. To estimate the error in a region, we first compute ε at a set of points within the region. The points at which we compute ε will either be from our training set or testing set, depending on whether we are training the emulator or predicting the error of a trained emulator, respectively (see 7 Section 2.7 for details). We then compute three different norms, defined as ||ε||1 1 N L1 = = ∑ |εi | (2.17) N N i !1 N 2 ||ε|| 1 L2 = √ 2 = √ N N ∑ |εi|2 (2.18) i L∞ = ||ε||∞ = max(|εi |). (2.19) i When training the emulator, we use the L∞ norm to estimate the error in a region. When the prediction is done in the log-space, e.g., using the log-space model class, we first transform it back into the linear-space before doing the error calculation. This is because the value that is used in simulations is the linear-space value. Doing the transformation before computing the error also takes into account any numerical error that is introduced due to the transformations. The ∗ transform of f ∗ back into linear-space is simply f = s10 f . 2.3 Combining Model Classes into a Single Emulator Due to the changing characteristics of f throughout the domain covered by the emulator, certain model classes perform better in different regions. Figure 2.2(a & b) show the error of f using the linear-space and log-space model classes respectively. As seen in the figure, the log-space models perform better than the linear-space models in most, but not all, regions. The simplicity of using a single model class over the entire domain comes at the cost that it must be able to capture all features of the function. This eliminates the possibility of using only the log-space model class as it fails to emulate portions of the domain, specifically when f crosses zero. Although the linear-space model class can capture all of the features, it has a higher error in regions where the log-space model class excels. At the cost of additional complexity, we use both model classes throughout the domain. We do this by choosing the model class on a cell-by-cell basis, where the model class with the lowest estimated error over the cell’s domain is used (see Section 2.2). The error over the domain when using both model classes in the same emulator is shown in Figure 2.2(c). 8 Figure 2.2: The log10 relative error of different model classes at a fixed uniform cell size. (a) linear- space model class, (b) log-space model class (c), and linear-space and log-space model classes. For (c), the best model class is chosen on a cell-by-cell basis such that the predicted error is minimized. 9 2D Decomposition Quadtree Representation 0 1 0 1 2 3 9 10 4 5 8 11 12 2 3 8 9 10 11 12 6 7 4 5 6 7 Figure 2.3: An example of a quadtree. Each leaf node is characterized by a number, which corre- sponds to a region in the 2D decomposition. The ordering of the numbers follows how the leaf nodes are laid out in memory. Cells close together in 2D space tend to be close together in memory as well. 2.4 Domain Decomposition using a Quadtree Although using multiple model classes decreases the error in many regions, the magnitude of error still changes drastically throughout the domain. This places a constraint on the size of the cells, which has to be chosen such that the region with the highest error is within some threshold. This leads to regions where the cell spacing is much smaller than needed, creating excess memory requirements. To reduce the memory requirements, we use a quadtree to decompose the domain into a complete disjoint set of cells, an example of which is shown in Figure 2.3. A quadtree is a recursive data structure consisting of nodes. Each node has either four or zero child nodes. Nodes that have no children are called leaf nodes. All nodes have a single parent node with the exception of the root node, which has no parent. To traverse the tree, an input is first given to the root node. The root node then does a comparison and sorts it to one of its child nodes. This process is repeated recursively until a leaf node is reached, at which point data in the leaf node is returned. In our case, the returned value would be the parameters defining the model for the leaf node’s corresponding cell. To decompose the 2D input space (density and temperature), each node of the quadtree is defined by four boundaries (top, bottom, left, and right), forming a rectangle in the 2D space. The input for the quadtree is a pair of density and temperature values (T, ρ) that are within the domain of the root node. The root node sorts inputs into one of the four Cartesian quadrants of its domain, with 10 the origin being the midpoint of all four boundaries. The ordering of the cells follows the Morton coding scheme [22]. We use the Morton ordering to reduce memory lookup costs, as it tends to keep things close in the domain near in index space and therefore near in memory. The input value is then passed to the child node that corresponds to that region. This process is repeated recursively until a leaf node is reached. Each leaf node corresponds to a rectangular cell in the domain. The quadtree, therefore, defines a subjective function from the root node’s domain to the set of leaf nodes. Each cell/leaf node has a corresponding model that is defined over its domain. 2.5 Training the Quadtree Emulator Training the quadtree emulator entails recursively refining cells until the estimated error is below a given threshold εth or the maximum depth Dmax of the tree has been reached. As discussed in Section 2.2, the error within a cell is estimated by computing the error at multiple points within the cell’s domain and taking the max of those errors. To refine a cell, the cell is divided into four Cartesian quadrants with the origin centered at the midpoint of its four boundaries. A new leaf node is then created for each quadrant. An example of how the domain is decomposed after training is shown in Figure 2.1(b), where four quadtrees have been put side-by-side along the x-axis to cover a domain of interest. The determination of which model class to use for each cell is the same as described in Section 2.3. However, because the cells are no longer uniformly distributed, two density and two temperature values that specify the domain of each cell must be stored with the weights of the model, as it can no longer be determined based solely on its index. An example of how different model classes are mapped to different regions is shown in Figure 2.1(c). 2.6 Creating the Memory Compact Quadtree Emulator Once the quadtree has been built, it is saved in a way that is memory efficient. Since the tree is static once built, it is converted into a set of arrays holding the information about the models and 11 how to map the input space to these models. For each model class, the model parameters are stored in a 2D array. Each row corresponds to a single model and consists of the values needed for the model to perform predictions. For the linear-space model class, this requires 40 64-bit floats1 . For the log-space model class, 41 64-bit floats are required, the extra float being the sign variable. The order of the rows follows the ordering of the quadtree, shown in Figure 2.3, which tends to keep cells close in the domain close in memory. Mapping an input to the correct model requires determining which cell the input is in and where the corresponding model is stored in memory. To do this we first map the input to its corresponding Cartesian index space. This is done by creating a grid of cells over the domain that are uniformly spaced in each dimension. The dimension of the grid is 2Dmax by 2Dmax , i.e., a fully refined quadtree of depth Dmax . The iρ , jT cell that contains the density and temperature input can then be computed by   ρ − ρmin ρmax − ρmin iρ = floor , where ∆ρ = ∆ρ 2Dmax   T − Tmin Tmax − Tmin jT = 2Dmax − 1 − floor , where ∆T = . ∆T 2Dmax Note that the difference in jT compared to iρ is because we started our Morton ordering at the Tmax instead of the Tmin . These indices can then be mapped to the quadtree index space by constructing an index at the bit level. Let iρ = x1 x2 x3 ...xN and jT = y1 y2 y3 ...yN be N-bit unsigned integers corresponding to the density and temperature indices respectively, with each xn or yn corresponding to the nth bit. The index in the quadtree index space is computed by k = x1 y1 x2 y2 x3 y3 ...xN yN , yielding a 2N-bit unsigned integer [23]. Figure 2.4(a) shows an example of the quadtree indices for a uniform grid of cells. We then create a mapping from the quadtree index space to the correct model, e.g., mapping the cell index in Figure 2.4(a) to the correct model in Figure 2.4(b), such that cell’s domain in (a) is within the model’s domain in (b). To do this, we first create a mapping array M of size 4Dmax (Dmax 1 This is 4 more variables then used by TS00 since the density and temperature ranges must also be stored 12 Figure 2.4: An example of the compact mapping scheme of a quadtree with Dmax =2. (a) The 2D domain is discretized into a uniform grid corresponding to a quadtree that is fully refined at Dmax =2, with indices matching the quadtree index space of its leaf nodes. (b) The domain of each model after the quadtree emulator has been trained. (c) The mapping array stores the index mapping of the quadtree index space to the model array. (d)) The mapping array is compactly represented using run-length encoding. 13 is the max depth of the quadtree). Each entry’s index in this array corresponds to a cell on a fully refined quadtree, shown in Figure 2.4(c). The value of each entry is an integer that gives the index of the model in the model array. When multiple models are being used at once, offsets are saved that separate the arrays. For example, if one model array has 5 models and the second one had 2, then the offset would be 5. If the integer in M is 6 it would map to 1 in the second model array, e.g., b1 in the example in 2.4(c). To reduce the size of M, which is 4Dmax , we use run-length encoding to reduce the redundancy in the mapping array. This results in two arrays. The first is an array of indices corresponding to where values in M change, i.e., the indices where M[k − 1] 6= M[k]. We call this the encoding array. The second array contains the values of M, which we call the index array. An example of this is shown in Figure 2.4(d). Using run-length encoding eliminates all repeated values in our mapping scheme since it is laid out with the same ordering as the quadtree. This gives both the index and encoding arrays the same number of entries as the total number of models m. Using this encoding scheme adds an additional cost of searching over the encoding array. Since the encoding array is sorted, this is a O(log2 m) search. To further reduce the memory cost of our mapping scheme, the integer size of each array is determined by the number of models and the maximum number of cells. For the encoding array, this is the smallest unsigned int such that 4Dmax can be represented. For the index array, the smallest unsigned int is used such that m can be represented. 2.7 Emulating the Helmholtz Free Energy We choose a density ρ range of [10−10.6 , 1015 ] (g/cm3 ) and a temperature T range of [104.4 , 1010.8 ] (K). This is a subset of the ρ and T ranges spanned by the default TS00 implementation, [10−12 , 1015 ] (g/cm3 ) and [103 , 1013 ] (K) respectively. We choose these ranges because they capture the different characteristic regions of f and because we can directly use the function values and derivatives used in TS00. The latter allows for direct comparison between our method and TS00, 14 as we can use the values stored in TS00’s data file as our training data. This choice of ranges also allows us to separate the table into four square sections, each of which is decomposed by a quadtree. Two different Dmax values are used on this domain, 7 and 9. With Dmax =7, we have the same cell size and locations as TS00 in areas where the quadtree is fully refined. When Dmax =9, the cell edges are four times smaller than TS00’s cells. When doing cell refinement and mapping, the domain is viewed in log-space. The training data that is used for the quadtree emulator with Dmax =7 is obtained using the exact equation of state [20] for up to and including the second-order derivatives, with the higher-order derivatives being computed using a finite difference method as described in Section 2.1. At full refinement and using the linear-space model class, this quadtree emulator is equivalent to TS00, allowing for direct comparison. The spacing in both dimensions is 1/20 (0.05) (log10 (g/cm3 ) and log10 (K)). Similarly, the training data for the quadtree emulator with Dmax =9 is generated in the same way, except the spacing of data in both dimensions is 1/160 (0.00625) (log10 (g/cm3 ) and log10 (K)), which is half the size of a cell at max refinement (1/80) (log10 (g/cm3 ) and log10 (K)). For each Dmax , we run at a εth of 10−1 , 10−2 , 10−3 , 10−4 , 10−5 , 10−6 , 10−7 , 10−8 and 10−9 . At each refinement step, we first select the points from the training set that lie within the cell’s domain. If more than 100 data points are selected, we randomly choose 100 of them. The predicted values for these points are computed and compared to the true values. The maximum relative error of these predictions is then computed and used as an estimate of the cell’s error. Each combination of Dmax and εth produces four different emulators, one for each of the four sections of the domain2 . These emulators are then saved in HDF5 files3 using the mapping and encoding scheme discussion in Section 2.6. The memory cost for each Dmax and εth combination is computed by the sum of the four corresponding HDF5 files’ sizes4 . For TS00 we compute the memory cost analytically by examining the number of double precision variables used for the 2 The domain we wish to model is rectangular, so stacking four quadtrees side-by-side in the density axis covers the domain of interest. 3 No data compression is used when saving the emulators using the HDF5 format. 4 This is a slight overestimate of the memory cost because it includes metadata stored in the HDF5 file as part of its self-describing file format, but which is not part of the table itself. There is also no data compression used when saving these HDF5 files. 15 electron-positron Helmholtz free energy interpolation, which are stored in a plain text file. To estimate the error over the domain, we create a test set of data. This data is generated over the same domain except it is offset by (1/480) and has a spacing of (1/240) (log10 (K) and log10 (g/cm3 )). This guarantees that the test and training sets are disjoint sets and that there are 9 test points in each cell for the smallest possible cell. These 9 points also sample near the midpoints between the cells and at the cell centers, which are observed to be the regions of the highest error. We then compute the error at each point and use three different norms, defined in Equations 2.17-2.19, to estimate the global error. 16 CHAPTER 3 RESULTS Using the setup outlined in Section 2.7, we compare the expected relative norm errors between different emulators. Figures 3.1 and 3.2 show the estimated norm relative error vs. total table size in megabytes (MBs). Figure 3.1 shows the results for a quadtree emulator with Dmax =7. When at full refinement the domain decomposition is equivalent to TS00, i.e., the domain decomposition is a uniform grid of cells with edge lengths of 1/27 the size of the full domain. Three different norm errors, as defined by Equations (2.16)-(2.19), are represented by different colors. The solid lines correspond to using both the log-space and linear-space model classes, where the model class that has the lowest expected error is chosen on a cell-by-cell basis. The dashed lines correspond to just using the linear-space model class, which is equivalent to the TS00 interpolation scheme. Each point along these lines corresponds to a different εth =[10−1 , 10−2 , 10−3 , 10−4 , 10−5 , 10−6 , 10−7 ], from left to right respectively. The stars correspond to the memory and error of TS00. The triangles correspond to an emulator that has the same uniform domain decomposition as TS00 but uses both the log-space and linear- space model classes1 . Figure 3.2 is similar to Figure 3.1 except Dmax =9. In both Figures 3.1 and 3.2, the error stops decreasing at a given table size. This is due to the maximum refinement being reached in many regions before εth is achieved, i.e, there are regions where the error cannot be reduced further given the maximum depth of the quadtree Dmax . In Figure 3.1, this happens early, where the data size has reached ∼ 2 MBs. In Figure 3.2 the error stops decreasing when the data size has reached ∼ 10 MBs. After these points, the error in the regions that have reached maximum refinement dominate. Lowering the εth further thus has little effect on the norm errors, but does increase the size of the table. 1 The slight increase in table size of these over TS00 is due to the extra sign factor that must be stored for the log-space models. 17 The normalized2 density distribution of the log10 (| · |) error is shown in Figure 3.3. Each row corresponds to a different εth , shown by the vertical red dashed line. The normalized error density distribution is shown for TS00 and quadtree emulators with Dmax being 7 and 9, using linear-space and combined linear-space & log-space model classes. As εth decreases, so does the majority of the error. The cases where the error is exceeding εth are due to two factors: 1) inaccuracies in our estimated maximum error based on our training data, which is expected, and 2) cells having reached maximum refinement before the error is reduced below εth . In the first row, the errors above εth are due mainly to 1), while in the last rows it is mainly due to 2). The Dmax =7 combined linear-space and log-space model class has noticeably less error above εth compared to TS00, while the Dmax =9 quadtree emulator offers drastic improvements in errors above εth . The improvement of the Dmax =7 linear-space model class above εth is due to the increased accuracy in computing the higher-order derivatives compared to TS00. In the case where the data used in the quadtree emulator is the same as used in TS00, the results are equivalent. Figure 3.4 shows the log10 (|e|) over the domain for three different cases: (a) TS00, (b) a quadtree emulator that has the same norm error as TS00 (Dmax =7 and εth =10−4 , corresponds to the fourth point from the left on the solid line in Figure 3.1), (c) a quadtree emulator that is the same size as TS00 (Dmax =9 and εth =10−8 corresponds to the eighth point from the left in Figure 3.2). This Figure shows that the quadtree emulator can maintain roughly consistent accuracy over its domain, leading to either significant memory savings or increased accuracy when compared to interpolation on a standard data table that is uniformly spaced in the relevant independent variables. 2 The integration of the distribution is unity. 18 Figure 3.1: The error and memory cost for different emulators with Dmax =7. The dashed lines correspond to a quadtree emulator with just the linear-space model class and the solid lines correspond to a quadtree emulator using both linear-space and log-space model classes. For each type of error, the points going left to right correspond to the follow- ing εth of [10−1 , 10−2 , 10−3 , 10−4 , 10−5 , 10−6 , 10−7 ]. The stars correspond to TS00 and the triangles to TS00 but with log-space and linear-space model classes being used. 19 Figure 3.2: The error and memory cost for different emulators with Dmax =9. The dashed lines correspond to a quadtree emulator with just the linear-space model class and the solid to a quadtree emulator using both linear-space and log-space model classes. For each type of error, the points going left to right correspond to an εth of [10−1 , 10−2 , 10−3 , 10−4 , 10−5 , 10−6 , 10−7 , 10−8 , 10−9 ]. The stars correspond to TS00 and the triangles to TS00 but with log-space and linear-space model classes being used. 20 Figure 3.3: An estimate of the log10 (ε) density over the domain for four different cases: TS00, linear- space model class with Dmax =7, and log-space and linear-space model classes with a Dmax of 7 and 9. 21 Figure 3.4: Spatial relative error for three different cases: (a) TS00’s error, (b) Quadtree emulator that has the same norm error as TS00 (Dmax =7, εth =10−4 ), (c) Quadtree emulator that has the same size as TS00 (Dmax =9, εth =10−8 ). 22 CHAPTER 4 DISCUSSION By increasing the complexity of the domain decomposition over the emulator’s domain we can avoid excessive refinement in regions where sparse refinement produces acceptable errors as compared to the “truth” – i.e., the calculation that generates the tabular data. Similarly, we can increase the refinement in under-resolved regions that have relatively high error. This allows for the desired accuracy to be achieved throughout the domain without the excess memory requirements needed by a uniformly refined data table. Since the needed refinement of f varies drastically over the domain, we achieve substantial memory reductions from the increased domain decomposition complexity – a factor of about 20× at a similar level of error to TS00. The cost of increasing the complexity of the domain decomposition is an additional O(log2 (m)) search that must be performed for each input. Because the size of the array being searched is relatively small, and inputs close together in density vs temperature space are typically close together in memory, this operation will likely not suffer from excess memory movement from DRAM but will benefit greatly from cache reuse. Thus we predict the increased computational cost to be negligible compared to the memory savings. An added benefit to the quadtree decomposition producing cells that are closer to their maximum size, while still being within an error tolerance, is that it effectively increases the arithmetic intensity by reducing the number of cells that need to be loaded for a given set of input values. Since simulations using tabular data are generally memory-bound, this should have the effect of increasing its overall performance. Examining the extent of this improvement and how it should influence the optimal size of cells is a topic of future research. We have shown that allowing different model classes to be used throughout the table has increased the accuracy of the table while also decreasing memory cost. Although considerations of thermodynamic consistency restrain the types of model classes that can be used, the quadtree emulator framework allows for an arbitrary number of different model classes. Our future research 23 will investigate a variety of different model classes that may perform well in different regions of the domain. This paper has focused on emulating a 2D domain using a quadtree. However, extending our framework to higher dimensions simply requires using a higher-order tree, e.g., an octree for 3D. The memory savings, compared to a uniform grid, in higher dimensions has the possibility of being even more significant. As a best-case scenario in terms of memory savings, assume a nD domain defined by a hypercube with corners ~0 and ~1, with an (n − 1)D hyperplane discontinuity intersecting the ~0 corner. Furthermore, assume that to resolve the discontinuity we need to have each cell that contains the discontinuity a certain size and that all other cells can be as big as possible. For the uniform case, this would require all cells to be small, i.e, 2ND cells, where D is the dimension and N is the maximum depth of the tree needed to reach the smallest cell size. Using a tree decomposition, we find the cell count by recursively adding 2D − 1 for every refinement except for the last one, in which case we add 2D . Thus the total cell count is N(2D − 1) + 1 cells. A 2D example is shown in Figure 4.1(a). In this case, increasing the dimension of the problem further increases the memory savings, as shown in Figure 4.1(c). For a prototypical case, assume the same hypercube is bisected by a (n − 1)D hyperplane representing a discontinuity in the data values, which intersects the hypercube at half of its corners, and has similar cell resolution constraints as before. A 2D example is shown in Figure 4.1(b). In this case, we compute the number of cells recursively. Doing a single refinement on this hypercube using a 2D -tree, we have 2D cells, with 2D−1 being bisected by the hyperplane and 2D−1 having their corners intersected by the hyperplane. We have already done the case where a hyperplane intersects the corner of a hypercube, in which case N(2D − 1) + 1 cells are needed. We have 2D−1 of these, each of which will have N − 1 more refinements. Therefore, we have 2D−1 ((N − 1)(2D − 1) + 1) new cells. The 2D−1 that do not have their corner intersected but are bisected by the discontinuity can be treated recursively. Each refinement we decrease N by 1 and increasing the number of cells 24 that have their corners intersected by a multiplicative factor of 2D−1 . Thus we have N ∑ 2i(D−1)((N − i)(2D − 1) + 1) i=1 cells. On the last refinement, i = N, we need to add in the bisected cells, which is 2N(D−1) . All together we require a total of N h i Nc = ∑ 2i(D−1) ((N − i)(2D − 1) + 1) + 2N(D−1) i=1 cells in the quadtree to refine the discontinuity, in comparison with the Nu = 2ND cells required to uniformly resolve the D-dimensional space with 2N cells per edge, for a reduction in memory by a factor of Nc /Nu . We find that in this case, the memory savings as we increase the dimensions would be similar to the quadtree, as shown in Figure 4.1(c). This figure also includes the two quadtree emulators presented in Section 3 (Dmax =7, εth =10−4 and Dmax =9, εth =10−7 ), which are chosen at the points where the L2 error has stopped decreasing. The behavior of the emulator for these cases is roughly in line with the prototypical case, which is consistent with the fact that the data tables have clearly visible discontinuities (see, e.g., Figure 2.1(a)). Data tables with multiple discontinuities would be expected to have somewhat higher memory consumption, but still have significant memory savings using the quadtree emulator compared to uniformly-spaced tabular data. We predict that the memory savings that come from using this framework to emulate higher dimensional data tables will typically result in progressively more significant savings in terms of memory consumption, with the same caveat regarding data discontinuities as in the 2D examples shown in this work. We have presented a proof-of-concept result that highlights the memory savings that can be gleaned from using tree-based methods to emulate tabular data. There is still room for further optimization to reduce memory costs and optimize the speed with which data can be retrieved from the emulator. In practice, modern multi- and many-core CPU and GPU architectures make a variety of tradeoffs between memory bandwidth, cache size, cache structure, cache bandwidth, and per-core arithmetic capabilities and speed. These details need to be considered when optimizing a tree-based emulator. Further optimizations could take into account the structure of the emulated data, as well as the simulation data itself (e.g., pre-loading branches of the emulator’s tree into cache based on 25 (a) (b) (c) Figure 4.1: An example of a quadtree refined region with a discontinuity crossing the (a) corner and (b) diagonal. (c) the compression ratio between using a tree based grid refine- ment method vs a uniform refinement for the cases shown in (a) and (b) but in varying di- mensions. The red stars in (c)) show the compression of the Dmax =7, εth =10−4 and Dmax =9, εth =10−7 . The εth is chosen for each Dmax such that the L2 error has mostly stopped decreasing. 26 local simulation quantities in order to save lookup time). We expect that the increased flexibility provided by the tree-based method presented in this paper will ultimately allow for the utilization of all of these factors in lowering both the overall runtime and memory consumption of simulations utilizing tabular data during runtime. We plan to explore this in future work. In the present work, we focus on the application of our quadtree emulator to the recovery of the equation of state of stellar plasmas from tabulated data, but our approach can be generalized to other table-based recovery schemes. Potential applications include other tabular equations of state, recovery of complex opacity data for radiative transfer calculations, or materials properties. 27 CHAPTER 5 CONCLUSIONS We have presented a scheme for tabular data that uses a quadtree decomposition to reduce memory costs and keep the global accuracy consistent throughout the domain. For our proxy problem, the electron-positron Helmholtz free energy equation of state, we have achieved a memory reduction of 20x without loss of accuracy or four orders of magnitude increase in accuracy for the same size table. Although outlined in 2D, this method is easily modified for higher-dimensional problems and has the flexibility of using many types of interpolation schemes in coordination. 28 BIBLIOGRAPHY 29 BIBLIOGRAPHY [1] A. S. Jermyn, J. Schwab, E. Bauer, F. X. Timmes, A. Y. Potekhin, Skye: A differentiable equation of state, arXiv:2104.00691 [astro-ph] (2021). arXiv:2104.00691. URL http://arxiv.org/abs/2104.00691 [2] R. A. Finkel, J. L. Bentley, Quad trees a data structure for retrieval on composite keys, Acta Informatica 4 (1) (1974) 1–9. doi:10.1007/BF00288933. URL https://doi.org/10.1007/BF00288933 [3] T. Markas, J. Reif, Quad tree structures for image compression applications, Information Processing & Management 28 (6) (1992) 707–721, publisher: Pergamon. doi:10.1016/ 0306-4573(92)90063-6. URL http://www.sciencedirect.com/science/article/pii/0306457392900636 [4] A. Gersho, R. M. Gray, Vector quantization and signal compression, Vol. 159, Springer Science & Business Media, 2012. [5] B. Fryxell, K. Olson, P. Ricker, F. X. Timmes, M. Zingale, D. Q. Lamb, P. MacNeice, R. Rosner, J. W. Truran, H. Tufo, FLASH: An adaptive mesh hydrodynamics code for modeling astrophysical thermonuclear flashes, The Astrophysical Journal Supplement Series 131 (1) (2000) 273, publisher: IOP Publishing. doi:10.1086/317361. URL http://iopscience.iop.org/article/10.1086/317361/meta [6] Z. He, C. Wu, G. Liu, Z. Zheng, Y. Tian, Decomposition tree: a spatio-temporal indexing method for movement big data, Cluster Computing 18 (4) (2015) 1481–1492. doi:10.1007/ s10586-015-0475-3. URL https://doi.org/10.1007/s10586-015-0475-3 [7] F. D. Swesty, Thermodynamically consistent interpolation for equation of state tables, Journal of Computational Physics 127 (1) (1996) 118–127. doi:10.1006/jcph.1996.0162. URL http://www.sciencedirect.com/science/article/pii/S002199919690162X [8] F. X. Timmes, F. D. Swesty, The Accuracy, Consistency, and Speed of an Electron-Positron Equation of State Based on Table Interpolation of the Helmholtz Free Energy, The Astrophysi- cal Journal Supplement Series 126 (2000) 501–516. doi:10.1086/313304. URL http://adsabs.harvard.edu/abs/2000ApJS..126..501T [9] M. J. Cawkwell, M. Zecevic, D. J. Luscher, K. J. Ramos, Complete equations of state for cyclotetramethylene tetranitramine, Propellants, Explosives, Pyrotechnics (2021). doi:https: //doi.org/10.1002/prep.202000274. URL http://onlinelibrary.wiley.com/doi/abs/10.1002/prep.202000274 [10] J. D. Coe, S. P. Rudin, B. Maiorov, Multiphase equation of state and thermoelastic data for polycrystalline beryllium, AIP Conference Proceedings 2272 (1) (2020) 070009, publisher: American Institute of Physics. doi:10.1063/12.0000902. URL http://aip.scitation.org/doi/abs/10.1063/12.0000902 30 [11] A. L. Kuhl, J. B. Bell, D. Grote, Diffusion effects near discontinuities in explosions, AIP Conference Proceedings 2272 (1) (2020) 070024, publisher: American Institute of Physics. doi:10.1063/12.0001094. URL http://aip.scitation.org/doi/abs/10.1063/12.0001094 [12] M. Oertel, M. Hempel, T. Klähn, S. Typel, Equations of state for supernovae and compact stars, Reviews of Modern Physics 89 (2017) 015007. doi:10.1103/RevModPhys.89.015007. URL http://adsabs.harvard.edu/abs/2017RvMP...89a5007O [13] D. Pochik, B. L. Barker, E. Endeve, J. Buffaloe, S. J. Dunham, N. Roberts, A. Mezzacappa, thornado-hydro: A discontinuous galerkin method for supernova hydrodynamics with nuclear equations of state, The Astrophysical Journal Supplement Series 253 (1) (2021) 21. doi: 10.3847/1538-4365/abd700. URL https://doi.org/10.3847/1538-4365/abd700 [14] B. Paxton, L. Bildsten, A. Dotter, F. Herwig, P. Lesaffre, F. Timmes, Modules for experiments in stellar astrophysics (MESA), The Astrophysical Journal Supplement Series 192 (2011) 3. doi:10.1088/0067-0049/192/1/3. URL http://adsabs.harvard.edu/abs/2011ApJS..192....3P [15] H. Shen, H. Toki, K. Oyamatsu, K. Sumiyoshi, Relativistic equation of state for core-collapse supernova simulations, The Astrophysical Journal Supplement Series 197 (2011) 20. doi: 10.1088/0067-0049/197/2/20. URL http://adsabs.harvard.edu/abs/2011ApJS..197...20S [16] H. Shen, F. Ji, J. Hu, K. Sumiyoshi, Effects of symmetry energy on the equation of state for simulations of core-collapse supernovae and neutron-star mergers, The Astrophysical Journal 891 (2) (2020) 148, publisher: American Astronomical Society. doi:10.3847/1538-4357/ ab72fd. URL https://doi.org/10.3847/1538-4357/ab72fd [17] S. P. Lyon, Sesame: the los alamos national laboratory equation of state database, Los Alamos National Laboratory report LA-UR-92-3407 (1992). [18] A. W. Steiner, M. Hempel, T. Fischer, CORE-COLLAPSE SUPERNOVA EQUATIONS OF STATE BASED ON NEUTRON STAR OBSERVATIONS, The Astrophysical Journal 774 (1) (2013) 17, publisher: IOP Publishing. doi:10.1088/0004-637X/774/1/17. URL http://iopscience.iop.org/article/10.1088/0004-637X/774/1/17/meta [19] S. W. Bruenn, J. M. Blondin, W. R. Hix, E. J. Lentz, O. E. B. Messer, A. Mezzacappa, E. Endeve, J. A. Harris, P. Marronetti, R. D. Budiardja, M. A. Chertkow, C.-T. Lee, Chimera: A massively parallel code for core-collapse supernova simulations, The Astrophysical Journal Supplement Series 248 (1) (2020) 11, publisher: IOP Publishing. doi:10.3847/1538-4365/ ab7aff. URL http://iopscience.iop.org/article/10.3847/1538-4365/ab7aff/meta [20] F. X. Timmes, D. Arnett, The Accuracy, Consistency, and Speed of Five Equations of State for Stellar Hydrodynamics, The Astrophysical Journal Supplement Series 125 (1999) 277–294. 31 doi:10.1086/313271. URL http://adsabs.harvard.edu/abs/1999ApJS..125..277T [21] M. Baer, findiff, https://github.com/maroba/findiff (2021). [22] G. M. Morton, A computer oriented geodetic data base and a new technique in file sequencing (1966). [23] N. Mamoulis, Spatial data management, Synthesis Lectures on Data Management 3 (6) (2011) 1–149. 32