EFFICIENT AND PORTABLE SPARSE SOLVERS FOR HETEROGENEOUS HIGH PERFORMANCE COMPUTING SYSTEMS By Md Fazlay Rabbi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science – Doctor of Philosophy 2022 ABSTRACT EFFICIENT AND PORTABLE SPARSE SOLVERS FOR HETEROGENEOUS HIGH PERFORMANCE COMPUTING SYSTEMS By Md Fazlay Rabbi Sparse matrix computations arise in the form of the solution of systems of linear equations, matrix factorization, linear least-squares problems, and eigenvalue problems in numerous computational disciplines ranging from quantum many-body problems, computational fluid dynamics, machine learning and graph analytics. The scale of problems in these scientific applications typically necessitates execution on massively parallel architectures. Moreover, due to the irregular data access patterns and low arithmetic intensities of sparse matrix computations, achieving high performance and scalability is very difficult. These challenges are further exacerbated by the increasingly complex deep memory hierarchies of the modern architectures as they typically integrate several layers of memory storage. Data movement is an important bottleneck against efficiency and energy consumption in large-scale sparse matrix computations. Minimizing data movement across layers of the memory and overlapping data movement with computations are keys to achieving high performance in sparse matrix computations. My thesis work contributes towards systematically identifying algorithmic challenges of the sparse solvers and providing optimized and high performing solutions for both shared memory architectures and heterogeneous architectures by minimizing data movements between different memory layers. For this purpose, we first introduce a shared memory task-parallel framework focusing on optimizing the entire solvers rather than a specific kernel. As most of the recent (or upcoming) supercomputers are equipped with Graphics Processing Unit (GPU), we decided to evaluate the efficacy of the directive-based programming models (i.e., OpenMP and OpenACC) in offloading computations on GPU to achieve performance portability. Being inspired by the promising results of this work, we port and optimize our shared memory task-parallel framework on GPU accelerated systems to execute problem sizes that exceed device memory. Copyright by MD FAZLAY RABBI 2022 This thesis is dedicated to my family. iv ACKNOWLEDGEMENTS This thesis would not have been possible without the help and support of many people. First and foremost, I would like to thank my advisor Dr. Hasan Metin Aktulga for his guidance and support throughout my Ph.D. study. Dr. Aktulga not only taught me how to become a good researcher but also provided me with a perfect example of balancing work and family: being a researcher, a husband, and a father. Dr. Aktulga is also a constant source of ideas and insights. He is always someone I can look up to, and I was very fortunate to have him as a mentor. I will emulate his humility and kindness in every step of my life. Besides my advisor, I would like to thank Dr. Sandeep Kulkarni, Dr. Brian O’Shea, and Dr. Ümit V. Çatalyürek for taking their time to serve on my thesis committee and providing me with useful feedback. I am thankful to our collaborators, Dr. Nicholas Wright and Christopher Daley from National Energy Research Scientific Computing Center (NERSC). Their frequent exchange of ideas helped shape my research. I would like to thank Michigan State University and my program for giving me the opportunity to be a proud Spartan. I hope our football team keeps beating the University of Michigan until the end of the world! Surviving graduate school would have been impossible without the support and encouragement of many fellow graduate students. Over the years, the friends I made here at MSU became my family. I consider myself lucky to have such incredible people around me, both inside and outside the lab. Without these friends, my years at MSU would not have been nearly as enjoyable. I thank my labmates for the interesting discussions we had about life, research, job search and everything else. I am also grateful to Abdullah for his time in reviewing this dissertation. I would like to thank my parents (Ansar Ali and Fatema Khatun) and family for their constant support throughout my educational journey. Without them, I would not have been where I am now. My elder brother Altaf Hossain has always been there for me whenever I need him. A special thanks to my son Taahir Ruzaif for the immense joy he brought me. I will be forever grateful to my family. v Last, because she deserves a special place, I would like to thank my wife Tanzila for waiting patiently for me to graduate. She is a true superwoman in my life who has been managing everything for me just to give me enough time to cross the finish line. All my accomplishments would have been impossible without her support. Her love and unwavering support kept me going during my challenging time. This Ph.D. is as much hers as it is mine. I will miss the beautiful MSU campus. Forever and always - Go Green! Go White!! vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 DeepSparse Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Evaluation of Directive based Programming Models on GPUs . . . . . . . . . . . . 6 1.5 DeepSparseGPU Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 CHAPTER 2 DEEPSPARSE: A TASK-PARALLEL FRAMEWORK FOR SPARSE SOLVERS ON SHARED MEMORY ARCHITECTURES . . . . . . . . . . 11 2.1 DeepSparse Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.1 Primitive Conversion Unit (PCU) . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1.1 Task Identifier (TI) . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1.2 Task Dependency Graph Generator (TDGG) . . . . . . . . . . . 16 2.1.2 Task Executor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.3 Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.4 Limitations of the Task Executor . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Benchmark Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.1 Lanczos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.2 LOBPCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.2 LOBPCG Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.3 Lanczos Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.4 Compiler Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 CHAPTER 3 EVALUATION OF DIRECTIVE-BASED GPU PROGRAMMING MODEL ON A BLOCK ITERATIVE SOLVER . . . . . . . . . . . . . . . . . . . . 31 3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.1 The LOBPCG algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.2 Baseline CPU implementation . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.3 A GPU implementation of LOBPCG . . . . . . . . . . . . . . . . . . . . . 35 3.1.4 Tiling LOBPCG kernels to fit in GPU memory capacity . . . . . . . . . . . 37 3.1.4.1 SpMM kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1.4.2 Inner product kernel . . . . . . . . . . . . . . . . . . . . . . . . 38 vii 3.1.5 Hardware and software environment . . . . . . . . . . . . . . . . . . . . . 40 3.1.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.1.6.1 Performance of the LOBPCG solver . . . . . . . . . . . . . . . . 41 3.1.6.2 Performance of X T Y and SpMM kernels for large matrices . . . 42 3.1.6.3 Performance of tiled and Unified Memory versions of SpMM . . 42 3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.1 Performance of the LOBPCG solver . . . . . . . . . . . . . . . . . . . . . 43 3.2.2 Performance of X T Y and SpMM kernels for large matrices . . . . . . . . 46 3.2.3 Performance of tiled and Unified Memory versions of SpMM . . . . . . . . 48 3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 CHAPTER 4 A PORTABLE SPARSE SOLVER FRAMEWORK FOR LARGE MA- TRICES ON HETEROGENEOUS ARCHITECTURES . . . . . . . . . . . 53 4.1 DeepSparseGPU Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.1.1 Primitive Conversion Unit (PCU) . . . . . . . . . . . . . . . . . . . . . . . 56 4.1.1.1 Task Identifier (TI) . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.1.1.2 Task Dependency Graph Generator (TDGG) . . . . . . . . . . . 57 4.1.2 Task Executor (TE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.1.2.1 Kernel Launcher (KL) . . . . . . . . . . . . . . . . . . . . . . . 58 4.1.2.2 Memory Manager (MM) . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.1.1 Benchmark Application . . . . . . . . . . . . . . . . . . . . . . 65 4.2.1.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2.1.3 Baseline Implementation . . . . . . . . . . . . . . . . . . . . . . 66 4.2.1.4 Incremental Implementation of DeepSparseGPU . . . . . . . . . 67 4.2.2 LOBPCG Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2.2.1 Data Movement between Host and Device . . . . . . . . . . . . . 70 4.2.2.2 Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2.2.3 Effect of Pinned Memory . . . . . . . . . . . . . . . . . . . . . 73 4.2.2.4 Comparison with a CPU version . . . . . . . . . . . . . . . . . . 74 4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.3.1 Usage of Data Prefetching and Multiple Streams . . . . . . . . . . . . . . . 76 4.4 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 CHAPTER 5 EXPLORING FINE-GRAINED TASK PARALLELISM USING CUDA GRAPH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.1 CUDA Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2 Implementing DeepSparseGPU Task Graph using CUDA Graph . . . . . . . . . . 83 5.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3.3 Baseline Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.3.4 LOBPCG Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 viii 5.4 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 CHAPTER 6 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . 88 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 ix LIST OF TABLES Table 2.1: Major data structures after parsing third line. . . . . . . . . . . . . . . . . . . . 19 Table 2.2: Matrices used in our evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Table 3.1: Test Matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Table 4.1: Matrices used to evaluate problem size > 16 GB. . . . . . . . . . . . . . . . . . 66 Table 4.2: Matrices used to evaluate problem size < 16 GB. . . . . . . . . . . . . . . . . . 66 Table 5.1: Matrices used in our evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 85 x LIST OF FIGURES Figure 1.1: Memory hierarchy in deep memory architectures. . . . . . . . . . . . . . . . . 2 Figure 1.2: Applications of sparse linear algebra. . . . . . . . . . . . . . . . . . . . . . . . 9 Figure 2.1: Schematic overview of DeepSparse. . . . . . . . . . . . . . . . . . . . . . . . . 14 Figure 2.2: Overview of input output matrices partitioning of task-based matrix multipli- cation kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Figure 2.3: Overview of matrices partitioning of task-based SpMM kernel. . . . . . . . . . 20 Figure 2.4: Overview of matrices partitioning of task-based inner product kernel. . . . . . . 21 Figure 2.5: Task graph for the psudocode in listing 2.2. . . . . . . . . . . . . . . . . . . . . 21 Figure 2.6: A sample task graph for the LOBPCG algorithm using a small sparse matrix. . . 24 Figure 2.7: Comparison of L1, L2, LLC misses and execution times between Deepsparse, libcsb and libcsr for the LOBPCG solver. . . . . . . . . . . . . . . . . . . . . . 27 Figure 2.8: LOBPCG single iteration execution flow graph of dielFilterV3real. . . . . . . . 27 Figure 2.9: Comparison of L1, L2, LLC misses and execution times between Deepsparse, libcsb and libcsr for the Lanczos solver. . . . . . . . . . . . . . . . . . . . . . . 28 Figure 2.10: Comparison of execution time for different compilers between Deepsparse , libcsb and libcsr for Lanczos Algorithm. (Blue/Left: GNU, Red/Middle: Intel, Green/Right: Clang compiler.) . . . . . . . . . . . . . . . . . . . . . . . 29 Figure 2.11: Cache Miss comparison between compilers for HV15R . . . . . . . . . . . . . 29 Figure 3.1: The use of is_device_ptr to avoid memory copies. Error checking is omitted for brevity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Figure 3.2: Overview of tiling SpMM operation. . . . . . . . . . . . . . . . . . . . . . . . 38 Figure 3.3: Overview of tiling Inner Product kernel . . . . . . . . . . . . . . . . . . . . . . 39 Figure 3.4: The time spent in LOBPCG on Cori-GPU and Summit when using various compilers with either OpenMP or OpenACC . . . . . . . . . . . . . . . . . . . 44 xi Figure 3.5: The time spent in LOBPCG on Cori-GPU when using matrix Nm7 . . . . . . . 45 Figure 3.6: LOBPCG GPU speedup on Cori-GPU for each test matrix . . . . . . . . . . . . 46 Figure 3.7: Time spent in X T Y kernel on Cori-GPU and Summit when the memory footprint exceeds GPU memory capacity. . . . . . . . . . . . . . . . . . . . . . 47 Figure 3.8: Time spent in SpMM kernel on Cori-GPU and Summit when the memory footprint exceeds GPU memory capacity . . . . . . . . . . . . . . . . . . . . . 47 Figure 3.9: Time spent in tiled and Unified Memory versions of the SpMM kernel on Cori-GPU and Summit. The memory footprint is less than GPU memory capacity. 48 Figure 3.10: Time spent in tiled and Unified Memory versions of the SpMM kernel on Cori- GPU and Summit. The memory footprint exceeds GPU memory capacity. We use a logarithmic scale on the Time (sec) axis to capture the slow run time for the Unified Memory configuration on Summit. . . . . . . . . . . . . . . . . 49 Figure 3.11: Unified Memory nvprof profile of the XY microbenchmark on Cori-GPU (top) and Summit (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Figure 4.1: Schematic overview of DeepSparseGPU framework. . . . . . . . . . . . . . . . 57 Figure 4.2: Data structures used in managing tile information in Memory Manager. . . . . 61 Figure 4.3: Illustrative example of how MM works . . . . . . . . . . . . . . . . . . . . . . 64 Figure 4.4: Code snippet of mat_add kernel in DeepSparse_MAP version. . . . . . . . . . 67 Figure 4.5: Code snippet of mat_add kernel in DeepSparse_UM version. . . . . . . . . . . 68 Figure 4.6: Code snippet of mat_add kernel in DeepSparseGPU framework. . . . . . . . . 69 Figure 4.7: Average H2D data transfer per iteration. . . . . . . . . . . . . . . . . . . . . . 70 Figure 4.8: Average D2H data transfer per iteration. . . . . . . . . . . . . . . . . . . . . . 71 Figure 4.9: Average execution time per iteration. . . . . . . . . . . . . . . . . . . . . . . . 72 Figure 4.10: OpenMP vs CUDA execution time comparison for most expensive kernels. . . . 73 Figure 4.11: Performance of DeepSparseGPU using pinned memory . . . . . . . . . . . . . 74 Figure 4.12: CPU vs DeepSparseGPU execution time comparison (problem size > 16 GB) . 75 xii Figure 4.13: CPU vs DeepSparseGPU execution time comparison with different CPU-GPU interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Figure 4.14: CPU vs DeepSparseGPU execution time comparison using problem size < 16 GB 77 Figure 5.1: Toy task dependency graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Figure 5.2: libcsr (CUDA) vs DeepSparseGPU (CUDA Graph) execution time comparison . 87 xiii LIST OF ALGORITHMS Algorithm 1: SpMM Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Algorithm 2: Lanczos Algorithm in Exact Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Algorithm 3: LOBPCG Algorithm (for simplicity, without a preconditioner) . . . . . . . . . . . . 24 Algorithm 4: Tiled SpMM (cusparseDcsrmm) kernel . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Algorithm 5: Tiled Inner Product (cublasDgemm) Kernel . . . . . . . . . . . . . . . . . . . . . . . . 39 xiv CHAPTER 1 INTRODUCTION 1.1 Motivation Sparse matrix computations arise in the form of the solution of systems of linear equations, matrix factorizations, linear least-squares problems, and eigenvalue problems in numerous com- putational disciplines ranging from quantum many-body problems [1] and computational fluid dynamics [2] to machine learning [3] and graph analytics [4]. Most of the scientific applications of these domains involve large-scale sparse matrices and require running them on massively parallel computing architectures. Moreover, sparse matrices are often irregularly structured and come in different forms and properties depending on the application area. Achieving high performance and scalability in sparse matrix computations is challenging due to the irregular data access patterns and low arithmetic intensities. Furthermore, due to the irregular access patterns of the sparse matrix computations, the computation times are often bound by data movement speeds between different memory layers or different processors rather than the compute resources. These chal- lenges become more complicated on modern computing architectures as the memory hierarchies on these architectures are becoming increasingly complex. Figure 1.1 shows an abstract view of the assumed underlying memory hierarchy in modern architectures. This deep memory hierarchy brings different bandwidths and latency depending on the proximity between the cores and storage devices. As we go farther away from the processor, memory capacity increases at the expense of increased latency and reduced bandwidth. Therefore, minimizing data movement across lay- ers of the memory and overlapping data movement with computations are keys to achieving high performance in sparse matrix computations. As we move forward to the exascale computing era [5], leadership supercomputers are becoming increasingly heterogeneous. More than 25% of TOP500 [6] supercomputers adopt accelerators like Graphics Processing Units (GPUs) due to their massive parallel computing capabilities. GPUs 1 Figure 1.1: Memory hierarchy in deep memory architectures. have been widely used to accelerate applications, including sparse matrix computations, in various domains, such as machine learning, graph analytics, and climate modeling. However, more complex memory layers in the memory hierarchy of modern heterogeneous systems further complicate the programming, migration, and optimization of sparse matrix computations. Given NVIDIA’s dominance in GPU computing, most GPU acceleration efforts have focused on CUDA, an NVIDIA- hardware specific programming model. Considering the increasing number of computing systems with GPUs from different vendors, choosing a portable programming model that allows applications to adapt to the diversity in heterogeneous computing is crucial from software maintenance and sustainability perspectives. OpenMP 4.5+ and OpenACC have emerged as portable directive-based programming models to provide a unified programming model for CPUs and GPUs. Higher-level abstraction adopted in these directive-based programming models yields application portability and developer productivity. Moreover, recent studies show that OpenMP and OpenACC are robust and able to provide good performance on different hardware [7, 8, 9]. Whether an application uses a device-specific (e.g., CUDA) or a portable (e.g., OpenMP) programming model, the amount of data moved between CPU and GPU is a key concern in heterogeneous systems due to the wide gap between the GPU’s computing power and memory bandwidth. For best performance, programmers have to explicitly control the data movement between the host (CPU) and the device (GPU), but this comes at the expense of programmer productivity. As a general solution, Unified Virtual Memory (UVM) has become available since CUDA 8.0 and the Pascal architecture (2016). The key idea behind UVM is that programmers 2 no longer need to think about the device and the host memory as two distinct memory spaces. Instead, UVM creates a pool of managed memory that is shared between the CPU and GPU, and is accessible from both, with memory pages being migrated on-demand by the CUDA runtime system automatically. Even though UVM dramatically reduces developer effort in managing data movement, it typically causes a significant hit in performance. While the size of data needed by scientific applications and machine learning/data analytics workloads increases rapidly, the memory available on GPUs increases at a much more modest pace in comparison. For example, NVIDIA’s Tesla V100 “Volta" GPUs have only 16 GBs of device memory available; one could have up to 80 GBs of memory on the newest generation NVIDIA A100s, but they come with a price tag not affordable by most customers or even HPC centers. To illustrate this issue, take, for instance, the Many Fermion Dynamics - nuclei (MFDn) code, which is a quantum many-body code based on the configuration interaction model. MFDn is a total memory-bound application, i.e., scientific studies using this code typically utilize all memory (DRAM) space available, thus easily exceeding the total device memory available [10, 11]. When an application’s working set size exceeds the device memory size, the resulting data movement becomes a critical design and performance bottleneck [12]. Considering the complex memory hierarchies of modern heterogeneous computing systems and increased problem sizes of large-scale scientific applications, there is a dire need for new approaches at the algorithmic and runtime system levels for sparse matrix computations. 1.2 Contributions of This Thesis In this thesis, we aim to design and develop a sparse linear algebra framework to accelerate sparse solver codes on modern architectures with deep memory hierarchies. The contributions of this thesis are divided into three parts: In the first part, we design and implement a sparse solver framework named DeepSparse on shared memory architectures to support a wide variety of sparse solvers. We use OpenMP task parallelism to implement DeepSparse to ensure its easy integration and adaptation in real- 3 world applications. We evaluate the performance of the DeepSparse framework on two important eigensolvers widely used in large-scale scientific computing applications: Lanczos eigensolver [13] and Locally Optimal Block Preconditioned Conjugate Gradient algorithm (LOBPCG) [14]. In Chapter 2, we present the design and implementation details of the DeepSparse framework. In the second part, we start porting and optimizing the DeepSparse framework on GPU- accelerated computing systems after seeing the encouraging results on shared memory systems. As a feasibility study, we evaluate the efficacy of directive-based GPU programming models for the LOBPCG solver and consider the performance implications for large sparse matrices. We implement the LOBPCG eigensolver using two popular directive-based programming models, OpenMP and OpenACC, for GPU-accelerated systems. We also consider creating an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernels) in LOBPCG and evaluate their performances using two different programming approaches: tiling the kernels and using Unified Memory with the original kernels. In Chapter 3, we present the findings of our preliminary studies on directive-based GPU programming models. In the third part, we extend our task-parallel sparse linear algebra framework described in Chapter 2 to GPUs, which we name DeepSparseGPU. DeepSparseGPU aims to make developing sparse solvers for large problems on GPU-accelerated architectures easier. The main goal of the DeepSparseGPU framework is therefore to ease the development of performant and portable sparse solvers. Hence, building upon our DeepSparse framework [15] which runs on shared memory architectures, DeepSparseGPU leverages OpenMP’s target offloading functionality. As we aim to optimize DeepSparseGPU for GPUs, several changes were made to the original DeepSparse framework, especially to support applications with memory requirements that exceed the available device memory capacity. We adopt the same task-based approach which serves the double purpose of enabling data locality optimizations and facilitates the management of application data so that it can be processed in batches that fit into the device memory. In Chapter 4, we present the design and implementation details of the DeepSparseGPU framework. In Chapter 5, we present 4 the preliminary results of our exploration of fine-grained task parallelism using CUDA Graph. I briefly present some motivations and background studies for all the thesis projects in the following sections. 1.3 DeepSparse Framework Eigenvalue decomposition is at the heart of many scientific calculations, including nuclear astrophysics, molecular dynamics, machine learning, and life sciences. However, the eigenvalue decomposition is computationally expensive. For example, the computational complexity of finding eigenvalues of a n × n square matrix is O(n3 ). However, the matrix dimension can easily grow to thousands, millions or even billions in scientifically relevant problems. Thus, finding the eigenvalues in large-scale computations becomes very challenging. Several iterative eigenvalue decomposition algorithms have been developed over the years to deal with large matrices. These methods work by repeatedly refining approximations to the eigenvectors and eigenvalues and can be terminated whenever the approximations reach to the desired degree of accuracy. Iterative methods form the basis of much of modern-day eigenvalue computations. In the first part of this thesis, we propose a novel sparse linear algebra framework, named DeepSparse, which aims to accelerate sparse solver codes on modern architectures with deep memory hierarchies. Our proposed DeepSparse framework differs from existing work in two ways. First, we propose a holistic approach that targets all computational steps in a sparse solver rather than narrowing the problem into a single kernel, e.g., sparse matrix vector multiplication (SpMV) or sparse matrix multiple vector multiplication (SpMM), as is commonly done. Second, we adopt a fully integrated task-parallel approach which is well suited for the high degrees of parallelism available in modern hardware. DeepSparse provides a GraphBLAS plus BLAS/LAPACK-like frontend for domain scientists to express their algorithms without having to worry about the architectural details (e.g., mem- ory hierarchy) and parallelization considerations (i.e., determining the individual tasks and their scheduling) [16, 17, 18]. DeepSparse automatically generates and expresses the entire computation 5 as a task dependency graph (TDG) where each node corresponds to a specific part of a com- putational kernel and edges denote control and data dependencies between computational tasks. We chose to build DeepSparse on top of OpenMP[19] because OpenMP is the most commonly used shared memory programming model, but more importantly it supports task-based data-flow programming abstraction. As such, DeepSparse relies on OpenMP for parallel execution of the TDG. We anticipate two main advantages of DeepSparse over a conventional bulk synchronous parallel (BSP) approach where each kernel relies on loop parallelization and is optimized independently. First, DeepSparse would be able to expose better parallelism as it creates a global task graph for the entire sparse solver code. Second, since the OpenMP runtime system has explicit knowledge about the TDG, it may be possible to leverage a pipelined execution of tasks that have data dependencies, thereby leading to better utilization of the hardware cache. 1.4 Evaluation of Directive based Programming Models on GPUs Heterogeneous computing architectures are becoming more popular among the high perfor- mance computing (HPC) communities as more than a quarter of TOP500 and Green500 supercom- puters employ GPU accelerators [6, 20]. Also, there is a pressing need to migrate and optimize applications for execution on GPUs and other accelerators in order to fully utilize their computing capabilities. Most of the existing GPU based heterogeneous computing systems are equipped with CPUs and GPUs from different vendors for accelerating workloads. Most recent and future planned systems for the Department of Energy Office of Advanced Scientific Computing Research (DOE ASCR) include Perlmutter at NERSC (AMD CPU + NVIDIA GPU nodes), Aurora at ALCF (Intel CPU + Intel Xe accelerator nodes), and Frontier at OLCF (AMD CPU + AMD GPU nodes). The full capability of these systems can only be realized by making efficient use of the accelerators on the compute nodes. Most efforts to use accelerators to date have involved scientists using the CUDA programming language that only targets NVIDIA GPUs. The success of these efforts, the expected marginal gains in general-purpose CPU performance, and the understanding that special purpose 6 accelerators are the best way to obtain significant performance gains within a fixed financial and power budget convinced HPC facilities to invest in accelerator-based systems. However, CUDA is not a universal method to target accelerators produced by different vendors, e.g. NVIDIA, AMD, Intel, Xilinx, although there are efforts by AMD to use the HIP framework to convert CUDA to a more portable style of C++ [21]. To address this issue, OpenMP 4.5+ and OpenACC have emerged as portable directive-based programming models targeting hardware accelerators. The higher level of abstraction adopted in these directive-based programming models increases application portability and application developer productivity. These directive-based methods have lowered the barrier of entry for application developers to target accelerators and are anticipated to be a key enabler for HPC users to efficiently use forthcoming supercomputers. However, there needs to be a wider testing of OpenMP and OpenACC in scientific applications to address any shortcomings in the language specifications, improve the robustness and performance of vendor compilers, and continue to refine our understanding of best practices to migrate applications to accelerators. At the same time, the most efficient way to use accelerators is often achieved using optimized math and scientific libraries, e.g. cuBLAS and Tensorflow. Therefore, it is very likely to be the case that non-trivial applications will increasingly need to combine optimized library calls with directives to obtain the the highest performance on a given application. In the second part of this thesis, we demonstrate the porting and optimization of an itera- tive block eigensolver for GPUs using a combination of directives and optimized library calls to evaluate the efficacy of the directive-based programming models. We choose the Locally Opti- mal Block Preconditioned Conjugate Gradient (LOBPCG) [14, 22] algorithm as the representative block eigensolver as it is a popular sparse eigensolver with computational characteristics that fit well to GPU architectures. LOBPCG algorithm requires a fairly complex implementation, as such it serves a good evaluation case for our purposes. An important issue in large scientific computing and data analysis workloads is that applications’ data usage often exceeds the available device memory space. As such, our evaluation extends into such scenarios, and we present remedies for the significant performance degradation observed due 7 to large data transfers between host and device memories. 1.5 DeepSparseGPU Framework Managing data migration between CPU and GPU is one of the significant challenges in GPU programming. Kernels that are running on GPU can only access data that are available on GPU memory. Thus, programmers have to explicitly control the data movement between CPU and GPU for best performance. The Unified Virtual Memory (UVM) allows treating the address space as a single memory address space accessible from both CPUs and GPUs with memory pages migrated on-demand. However, even though enabling UVM is simple, it provides poor performance compared with custom methods that manually offload and prefetch data. Furthermore, GPUs in modern supercomputers have limited device memory. The scale of data needed in important scientific computing and data analytic applications often exceeds the available device memory space. After seeing promising results from our work on evaluating directive-based programming model on GPU accelerated systems, we start porting and optimizing our task parallel linear algebra framework, DeepSparse, on GPUs using OpenMP target offload. We name this GPU version as DeepSparseGPU. As we aim to port the DeepSparse framework to GPUs, we make several changes to the shared memory based design of DeepSparse to support problem sizes that exceed the available GPU memory. In DeepSparseGPU, we aim to minimize data movement between the CPU and the GPU by exploiting data dependencies between the computational tasks. Towards this end, we adopt the same task-based tilling approach in DeepSparseGPU as tiling is one of the most commonly used techniques for optimizing applications for data locality. In addition, decomposing data into tiles enables the programmer to run applications on GPUs even when the application data does not entirely fit into the device memory. We describe a memory manager that automatically keeps track of the data that is available on the CPU and the GPU. The proposed memory manager automatically migrates data between CPU and GPU. We evaluate the performance of DeepSparseGPU also using the LOBPCG block eigensolver. 8 Figure 1.2: Applications of sparse linear algebra. 1.6 Related Work Figure 1.2 shows an overview of the applications of sparse linear algebra in data analytics and scientific computing. Sparse matrix-vector multiplication (SpMV) and sparse matrix dense matrix multiplication (SpMM) are two essential operations in eigenvalue decomposition as well as in solving sparse linear systems and similar problems, and are well studied in the literature. Sparse matrix-sparse matrix multiplication (SpGEMM) and sparse matrix sparse vector multiplication (SpMSV) are, on the other hand, less studied in the literature, but they find important applications in machine learning methods that started becoming popular. In its simplest form, an SpMV operation computes y = Ax where A is a sparse matrix, and x and y are dense vectors. The usage of SpMV is typically repetitive and it constitutes the main computational kernel for several applications such as the solution of partial differential equations [23] and the Schrodinger’s Equation [24] in scientific computing, spectral clustering [25] and dimensionality reduction [26] in machine learning, and the Page Rank algorithm [27] in graph analytics. The Roofline Model by Williams et al. [28] suggests that the memory bandwidth ultimately bounds the performance of the SpMV kernel. Consequently, performance optimizations to increase cache utilization and reduce data access latencies for SpMV have drawn significant interest [29, 30, 31, 32, 33, 34, 35, 36, 37, 28], which is an incomplete list of related work on this topic. The sparse matrix dense matrix multiplication (SpMM) operation has long been the workhorse of 9 block iterative solvers, e.g., the block Krylov subspace methods [23, 38] and block Jacobi-Davidson method. SpMM has much higher arithmetic intensity than SpMV and can efficiently leverage wide vector execution units. As a result, SpMM-based solvers have recently drawn significant interest in scientific computing [11, 39, 40, 41]. SpMM also finds applications naturally in machine learning where several features (or eigenvectors) of sparse matrices are needed [26, 25]. Although SpMM has a significantly higher arithmetic intensity than SpMV, the extended Roofline model that was recently proposed suggests that cache bandwidth, rather than the memory bandwidth, can still be a critical performance-limiting factor for SpMM [11]. Multiplication of sparse matrices (SpGEMM) and sparse matrix times sparse vector (SpMSV) operation also find applications in important problems. For example, SpGEMM is the main kernel in the algebraic multi-grid method [42] and the Markov Clustering algorithm, while SpMSV is the main building block for breadth-first search, bipartite graph matching, and maximal independent set algorithms [43]. Overall, related work in the literature generally examines each of these sparse matrix computa- tion kernels in isolation. Our holistic approach that tries to improve the overall performance of a sparse solver through the utilization of a global task graph executed in an asynchronously parallel fashion represents a distinguishing aspect of my work. 10 CHAPTER 2 DEEPSPARSE: A TASK-PARALLEL FRAMEWORK FOR SPARSE SOLVERS ON SHARED MEMORY ARCHITECTURES This work has been published in IEEE HiPC 2019 [15]. This work is a collaborative effort from Fazlay Rabbi and Md Afibuzzaman. Both contributed equally in this work. Dense linear algebra libararies (i.e., BLAS [18], LAPACK [17] and ScaLAPACK [44]) are well established and several vendor optimization implementations exist for dense linear algebra (i.e., Intel MKL, Cray LibSci). However, unlike its dense matrix analogue, the state of the art for sparse matrix computations is lagging far behind. Standardization for sparse linear algebra has started with the NIST Sparse Basic Linear Algebra Subprograms (SparseBLAS), but SparseBLAS defines only a few subset of sparse matrix operations [45] that arise in finite element simulations. The main reason behind the slow progress in the standardization of sparse linear algebra and development of libraries for it is that sparse matrices come in very different forms and properties depending on the application area. The widening gap between the memory system and processor performance, irregular data access patterns and low arithmetic intensities of sparse matrix compu- tations have effectively made them “memory-bound” computations. Furthermore, the downward trend in memory space and bandwidth per core in high performance computing (HPC) systems [46] has paved the way for a deepening memory hierarchy. Thus, there is a dire need for new approaches both at the algorithmic and runtime system levels for sparse matrix computations. While sparse solver libraries such as Trilinos [47], PETSc [48] and Hypre [49] provide a wide variety of sparse solver implementations, they do not explicitly address the performance bottlenecks associated with deep memory hierarchies or emerging NVRAM technologies. In this context, difficulties faced by domain scientists in building high performance data analytics and scientific computing application provide our main motivation. Our goal is to develop a computational infrastructure to enable large-scale data analytics and scientific computing on massively parallel architectures with deep memory hierarchies. 11 We propose a novel fully integrated task-parallel sparse linear algebra framework, DeepSparse, which aims to accelerate sparse solver codes on modern architectures with deep memory hierarchies. Our goal is to accelerate sparse matrix computations on shared memory architecture by minimizing date movement between different memory hierarchies. In order to do so, we express the sparse solvers as a directed acyclic graph (DAG) by exploiting date dependencies between computational kernels. In DeepSparse, we adopt a holistic approach that aims to optimize all computational steps in a sparse solver rather than narrowing the problem into a single kernel, e.g., sparse matrix vector multiplication (SpMV) or sparse matrix multiple vector multiplication (SpMM). As mentioned, there exists several well-optimized libraries for dense linear algebra. One of the most popular of such libraries is LAPACK [17], which is a linear algebra library for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The workhorse for LAPACK is the Basic Linear Algebra Subprograms (BLAS). PLASMA aims to overcome the shortcomings of the LAPACK library in efficiently solving the problems in dense linear algebra on multicore processors [50, 51]. MAGMA is a dense linear algebra library (like LAPACK) for heterogeneous systems, i.e., systems with GPUs [52, 53, 54], to fully exploit the computational power that each of the heterogeneous components would offer. MAGMA provides very similar functionality like LAPACK and makes it easier for the user to port their code from LAPACK to MAGMA. However, none of these libraries support sparse matrices nor the solution of sparse linear systems, eigenvalue problems or singular value decompositions, highlighting the lack of general purpose software for sparse matrix computations. In regards to task-based parallelization of HPC workloads, Barrera et. al. [55] use computational dependencies and dynamic graph partitioning method to minimize NUMA effect on shared memory architectures. StarPU [56] is a runtime system that facilitates the execution of parallel tasks on heterogeneous computing platforms, and incorporates multiple scheduling policies. However, the application developer has to create the computational tasks by themselves in order to use StarPU. As indicated above, while the concept of task parallelism based on data flow dependencies 12 is not new, exploration of the benefits of this idea in the context of sparse solvers constitutes a novel aspect of the present work. Additionally, to the best of our knowledge, related work on task parallelism has not explored its impact on cache utilization compared to the BSP model as we do in this work. To summarize, our main contributions are: • Introduction of a novel task-parallel sparse solver framework with complete details in regards to the front-end API and description of how a solver code expressed through this API is automatically converted into a task graph, • An extensive evaluation of the performance of DeepSparse on two applications using a variety of sparse matrices from different domains, • Demonstration that the proposed task-parallel framework can significantly reduce misses across all cache levels (up to 16×) as well as improve the execution time (by up to 3.9×) compared to highly optimized library implementations based on the BSP model. 2.1 DeepSparse Overview Figure 2.1 illustrates the architectural overview of DeepSparse. As shown, DeepSparse consists of two major components: i) Primitive Conversion Unit (PCU) which provides a front-end to domain scientists to express their application at a high-level; and ii) Task Executor which creates the actual tasks based on the abstract task graph generated by PCU and hands them over to the OpenMP runtime for execution. As sparse matrix related computations represent the most expensive calculations in many large- scale scientific computing, we define tasks in our framework based on the decomposition of the input sparse matrices. For most sparse matrix operations, both 1D (block row) and 2D (sparse block) partitioning are suitable options. A 2D partitioning is ideal for exposing high degrees of parallelism and reducing data movement across memory layers [57], as such 2D partitioning is the default scheme in DeepSparse. For a 2D decomposition, DeepSparse defines tasks based on the 13 Primitive Conversion Unit (PCU) Task Executor Task Identifier (TI) In In In In do { Core 0 Data Data DataData SpMM(Hpsi, H, psi) dot(E, psi, psi) SpMM dot daxpy(Epsi, E, psi) daxpy(R, Hpsi,Epsi) dot(W,Tinv, R) Out Out Out Out Partition 0 dot(Wmat, W, W) Data Data Data Data dsyevd(S, Wmat). .. } while(!converged) In In In In Core 1 Data Data Data Data SpMM dot Partition 0 Out Out Out Out Data Data Data Data TDG Generator In In In In Core 2 Data Data Data Data SpMM dot Partition 0 Out Out Out Out Data Data Data Data Figure 2.1: Schematic overview of DeepSparse. Compressed Sparse Block (CSB) [30] representation of the sparse matrix, which is analogous to tiles that are commonly used in task parallel implementation of dense linear algebra libraries. However, CSB utilizes much larger block dimensions (on the order of thousands) due to sparsity [11, 58, 30]. Consequently, DeepSparse starts out by decomposing the sparse matrix (or matrices) for a given code into CSB blocks (which eventually corresponds to the tasks during execution with each kernel producing a large number of tasks). Note that the decomposition of a sparse matrix dictates partitioning of the input and output vectors (or vector blocks) in the computation as well, effectively inducing decomposition of all data structures used in the solver code. DeepSparse creates and maintains fine-grained dependency information across different kernels of a given solver code based on the result of the above decomposition scheme. As such, instead of simply converting each kernel into its own task graph representation and concatenating them, DeepSparse generates a global task graph, allowing for more optimal data access and task scheduling 14 decisions based on global information. Since the global task graph depends on the specific algorithm and input sparse matrix, DeepSparse will explicitly generate the corresponding task dependency graph. While this incurs some computational and memory overheads, such overheads are negligible. The main reason for computational overheads to be negligible is that sparse solvers are typically iterative, and the same task dependency graph is used for several iterations. The reason why memory overheads is negligible is that each vertex in the task graph corresponds to a large set of data in the original problem. After this brief overview, we explain the technical details in DeepSparse. 2.1.1 Primitive Conversion Unit (PCU) PCU is composed of two parts: i) Task Identifier, and ii) Local Task Dependency Graph (TDG) Generator. 2.1.1.1 Task Identifier (TI) The application programming interface (API) for DeepSparse is a combination of the recently pro- posed GraphBLAS interface [16] (for sparse matrix related operations) and BLAS/LAPACK [17, 18] (for vector and occasional dense matrix related computations). This allows application devel- oper to express their algorithms at a high-level without having to worry about architectural details (e.g., memory hierarchy) or parallelization considerations (e.g., determining the individual tasks and their scheduling). Task identifier parses a code expressed using the DeepSparse API to identify the specific BLAS/LAPACK and GraphBLAS calls, as well as the input/output of each call. It then passes this information to the local task dependency graph generator. TI builds two major data structures: (i) ParserMap: ParserMap is an unordered map that holds the parsed data information in the form of (Key, Value) pairs. As TI starts reading and processing the DeepSparse code, it builds a ParserMap from the function calls. To uniquely identify each call in the code, Key class is made up of three components: opCode which is specific to each type of operation 15 used in the code, id which keeps track of the order of the same function call in the code (e.g., if there are two matrix addition operations, then the first call will have id = 1 and the second one will have id = 2), and timestamp which stores the line number of the call in the code and is used to detect the input dependencies of this call to the ones upstream. For each key, the corresponding Value object stores the input and output variable information. It also stores the dimensions of the matrices involved in the function call. (ii) Keyword & idTracker: Keyword is a vector of strings that holds the unique function names (i.e., cblas_dgemm, dsygv, mkl_dcrmm, etc.) that have been found in the given code, and the idTracker keeps track of the number of times that function (Keyword) has been called so far. Keyword and idTracker vectors are synchronized with each other. When TI finds a function call, it searches for the function name in the Keyword vector. If found, the corresponding idTracker index is incremented. Otherwise, the Keyword vector is expanded with a corresponding initial idTracker value of 1. 2.1.1.2 Task Dependency Graph Generator (TDGG) The output of Task Identifier (TI) is a dependency graph at a very coarse-level, i.e., at the function call level. For an efficient parallel execution and tight control over data movement, tasks must be generated at a much finer granularity. This is accomplished by the Task Dependency Graph (TDGG), which goes over the input/output data information generated by TI for each function call and starts decomposing these data structures. As noted above, the decomposition into finer granularity tasks starts with the first function call involving the sparse matrix (or matrices) in the solver code which is typically an SpMV, SpMM or SpGEMM operation. After tasks for this function call are identified by examining the non-zero pattern of the sparse matrix, tasks for prior and subsequent function calls are generated accordingly. As part of task dependency graph generation procedure, TDGG also generates the dependencies between individual fine-granularity tasks by examining the function call dependencies determined by TI. Note that the dependencies generated by TDGG may (and often do) span function boundaries and this is an important property 16 struct TaskInfo { int opCode ; // type of operation int numParamsCount ; int * numParamsList ; // tile id , dimensions etc . int strParamsCount ; char ** strParamsList ; // i . e . buffer name int taskID ; // analogous to id of Key Class } Listing 2.1: TaskInfo Structure of DeepSparse that separates it from a bulk synchronous parallel (BSP) program which effectively imposes barriers at the end of each function call. The resulting task dependency graph generated by TDGG is essentially a directed acyclic graph (DAG) representing the data flow in the solver code where vertices denote computational tasks, incoming edges represent the input data and outgoing edges represent the output data for each task. TDGG also labels the vertices in the task dependency graph with the estimated computational cost of each task, and the directed edges with the name and size of the corresponding data, respectively. During execution, such information can be used for load balancing among threads and/or ensuring that active tasks fit in the available cache space. In this initial version of DeepSparse though, such information is not yet used because we rely on OpenMP’s default task execution algorithms, as explained next. 2.1.2 Task Executor To represent a vertex in the task graph, TDGG uses an instance of the TaskInfo structure [listing 2.1] which provides all the necessary information for the Task executor to properly spawn the corre- sponding OpenMP task. The task executor receives an array of TaskInfo structures [listing 2.1] from the PCU that represents the full computational dependency graph, picks each node from this array one by one and extracts the corresponding task information. DeepSparse implements OpenMP task based functions for all computational kernels (represented by opCode) it supports. Based on 17 Algorithm 1: SpMM Kernel Input: X[i,j] (β × β, Sparse CSB block), Y[j] (β × b) Output: Z[i] (Dense vector block, β × b)     1 #pragma omp task depend(in: X i, j , Y j , Z i ) depend(out: Z i )     2 { 3 foreach val ∈ X i, j .nnz do   4 r = X[i,j].row_loc[val] 5 c = X[i,j].col_loc[val] 6 for k = 0 to b do 7 Z[r × b + k] = Z[r × b + k] + val × Y[c × b + k] 8 end 9 end 10 } the opCode, partition id of the input/output data structures and other required parameters (given by numParamsList and strParamsList) found in the TaskInfo structure at hand, Task Executor calls the necessary computational function found in the DeepSparse library, effectively spawning an OpenMP task. In DeepSparse, the master thread spawns all OpenMP tasks one after the other, and relies on OpenMP’s default task scheduling algorithms for execution of these tasks. OpenMP’s Runtine Environment then determines which tasks are ready to be executed based on the provided task dependency information. When ready, those tasks are executed by any available member of the current thread pool (including the master thread). Note that OpenMP supports task parallel exeuction with named dependencies, and better yet these dependencies can be specified as variables. This feature is fundamental for DeepSparse to be able to generate TDGs based on different problem sizes and matrix sparsity patterns. This is exemplified in Algorithm 1, where SpMM tasks for the compressed sparse block at row i and j is simply invoked by providing the X i, j sparse matrix   block along with Y j input vector block and Z i output vector block in the depend clause.     An important issue in a task parallel program is the data race conditions involving the output data that is being generated. Fortunately, the task-parallel execution specifications of OpenMP requires only one thread to be active among threads writing into the same output data location. While this ensures a race-condition free execution, it might hinder performance due to a lack 18 cblas_dgemm ( CblasRowMajor , CblasNoTrans , CblasNoTrans , m , n , k , 1.0 , A , k , B , n , 0 , C , n ) ; SpMM (X , C , D , m , n ) ; cblas_dgemm ( CblasRowMajor , CblasTrans , CblasNoTrans , n , n , m , 1.0 , D , n , C , n , 0 , E , n ) ; Listing 2.2: An example pseudocode of parallelism. Therefore, for data flowing into tasks with a high incoming degree, DeepSparse allocates temporary output data buffers based on the number of threads and the available memory space. Note that this also requires the creation of an additional task to reduce the data in temporary buffer space before it is fed into the originally intended target task. 2.1.3 Illustrative Example We provide an example to demonstrate the operation of DeepSparse using the simple code snippet provided in Listing 2.2. As TI parses the sample solver code, it discovers that the first cblas_dgemm in the solver corresponds to a linear combination operation (see Figure 2.2), the second line is a sparse matrix vector block multiplication (SpMM, see Figure 2.3) and the second cblas_dgemm at the end is an inner product of two vector blocks (see Figure 2.4). These function calls, their parameters as well as dependencies are captured in the ParserMap, Keyword, and idTracker data structures as shown in Table 2.1. Table 2.1: Major data structures after parsing third line. Data Structure Content <{XY, 1, 1},{, , }> ParserMap <{SpMM, 1, 2},{, , }> <{XTY, 1, 3},{, , }> keyword idTracker <1, 1, 1> Task Dependency Graph (TDG) generator receives the necessary information from TI and determines the tasks corresponding to partitionings of operand data structures of each operation, 19 0 task0 task0 1 task1 task1 2 task2 task2 … … … X Y Z b n-1 taskn-1 taskn-1 b b Figure 2.2: Overview of input output matrices partitioning of task-based matrix multiplication kernel. 0 task0,0 task0,1 task0,2 task0,n-1 task1,0 task1,1 task1,2 1 2 task2,0 task2,1 task2,2 … … … X Y Z n-1 taskn-1,n-1 b b Figure 2.3: Overview of matrices partitioning of task-based SpMM kernel. as well as their origins (whether the necessary data are coming from another task or from a variable). TDGG then builds the DAG of each computational kernel and appends it to the global DAG with proper edge connectivity (i.e., dependencies). While generating the DAG, the TDGG also encodes the value of the TaskInfo structure instance that represent each of the vertices into the vertex name. The vertex naming convention is . Figure 2.5 shows the task dependency graph of the solver 20 cblas_dgemm call on block vector Partial result buffer reduction task0 t0 b task0 task1 task2 XT taskn-1 task1 0 1 2 …. task2 t1 …. …. n-1 Y t2 Z tp taskn-1 b Figure 2.4: Overview of matrices partitioning of task-based inner product kernel. 5,1,0,0,1 5,1,1,0,1 2,2,0,0,0,1 2,2,1,0,0,1 2,2,0,1,0,1 2,2,1,1,0,1 3,2,0,0,0,1 3,2,1,1,0,1 4,1,0,1,EBUF,-1 Figure 2.5: Task graph for the psudocode in listing 2.2. code in Listing 2.2 (assuming m = 100, k = 8, n = 8, CSB tile/block size = 50, so each input matrix is partitioned into 2 chunks). The task executor receives an array of TaskInfo structures that contains the node information as shown in Figure 2.5. The task executor goes over each of the tasks in the array of TaskInfo structure. At first, it reads the nodes (<5,1,0,0,1>, <5,1,1,0,1>) of the first operation and spawns two matrix multiplication (xY) tasks with proper input output matrices. The task executor then reads all the task information for all SpMM tasks {<2,2,0,0,0,1>, <2,2,0,1,0,1>, <2,2,1,0,0,1>, <2,2,1,1,0,1>} and spawns four SpMM tasks with proper input/output matrix blocks. Finally, the 21 task executor reads <3,2,0,0,0,1>, <3,2,1,1,0,1> and <4,1,0,1,EBUF,-1> and spawns two inner product (XT Y) tasks and one partial output buffer reduction task for the inner product operation. 2.1.4 Limitations of the Task Executor Despite the advantages of an asynchronous task-parallel execution, the Task Executor has the following limitations: • Synchronization at the end of an iteration: Most computations involving sparse matrices are based on iterative techniques. As such, the TDG generated for a single iteration can be reused over several steps (until the algorithm reaches convergence). However, it is necessary to introduce a #pragma omp taskwait at the end of each solver iteration and force all tasks of the current iteration to be completed to ensure computational consistency among different iterations of the solver. For relatively simple solvers, the taskwait clause adds some overhead to the total execution time due to threads idling at taskwaits. • Limited number of temporary buffers: While OpenMP allows the use of program variables in the dependency clauses, it does not allow dynamically changing the variable lists of the depend clauses. As such, the number of buffer lists in the partial output reduction tasks need to be fixed to overcome this issue. Depending on the available memory, there are at most nbuf number of partial output buffers for a reduction operation. If nbuf is less than the total number of threads, then there might be frequent read after write (RAW ) contentions on partial output buffers. This could be have been potentially avoided, if the list of variables in the depend clause could have been dynamically changed. 2.2 Benchmark Applications We demonstrate the performance of the DeepSparse framework on two important eigensolvers widely used in large-scale scientific computing applications: Lanczos eigensolver [13] and Locally Optimal Block Preconditioned Conjugate Gradient algorithm (LOBPCG) [14]. 22 2.2.1 Lanczos Lanczos algorithm finds eigenvalues of a symmetric matrix by building a matrix Qk = q1 , . . . , qk of orthogonal Lanczos vectors [59]. The eigenvalues of the sparse matrix A is then approximated by the Ritz values. As shown in Algorithm 2, it is a relatively simple algorithm consisting of an Sparse Matrix Vector Multiplication (SpMV) along with some vector inner products for orthonormalization. Algorithm 2: Lanczos Algorithm in Exact Arithmetic 1 q1 = b||b||2 , β0 = 0, q0 = 0 2 for j = 1 to k do 3 z = Aqj 4 αj = qj T z 5 z = z − αj qj − βj−1 qj−1 6 βj = ||z||z 7 if βj = 0, quit 8 qj+1 = zβj 9 Compute eigenvalues, eigenvectors, and error bounds of Tk 10 end 2.2.2 LOBPCG LOBPCG is a commonly used block eigensolver based on the SpMM kernel [14], see Algorithm 3 for a pseudocode. Compared to Lanczos, LOBPCG comprises high arithmetic intensity operations (SpMM and Level-3 BLAS). In terms of memory, while the H c matrix takes up considerable space, when a large number of eigenpairs are needed (e.g. dimensionality reduction, spectral clustering or quantum many-body problems), memory needed for block vector Ψ can be comparable to or even greater than that of H.c In addition, other block vectors (residual R, preconditioned residual W, previous direction P), block vectors from the previous iteration and the preconditioning matrix T must be stored, and accessed at each iteration. Figure 2.6 shows a sample task graph for LOBPCG generated by TDGG using a very small matrix. Clearly, orchestrating the data movement in a deep memory hierarchy to obtain an efficient LOBPCG implementation is non-trivial. 23 Algorithm 3: LOBPCG Algorithm (for simplicity, without a preconditioner) Input: Ĥ , matrix of dimensions N × N Input: Ψ0 , a block of randomly initialized vectors of dimensions of N × m Output: Ψ and E such that kĤΨ − Ψ EkF is small, and Ψ T Ψ = Im 1 Orthonormalize the columns of Ψ0 2 P0 ← 0 3 for i = 0, 1, . . . , until convergence do 4 Ei = ΨiT ĤΨi 5 Ri ← ĤΨi − Ψi Ei 6 Apply the Rayleigh–Ritz procedure on span{Ψi , Ri , Pi } 7 Ψi+1 ← argmin traceS T ĤS S∈span{Ψi .Ri ,Pi }, S T S=Im 8 Pi+1 ← Ψi+1 − Ψi 9 Check convergence 10 end 11 Ψ ← Ψi+1 Figure 2.6: A sample task graph for the LOBPCG algorithm using a small sparse matrix. 2.3 Performance Evaluation 2.3.1 Experimental Setup We conducted all our experiments on Cori Phase I, a Cray XC40 supercomputer at NERSC, mainly using the GNU compiler. Each Cori Phase I node has two sockets with a 16-core Intel Xeon 24 Processor E5-2698 v3 Haswell CPUs. Each core has a 64 KB private L1 cache (32 KB instruction and 32 KB data cache) and a 256 KB private L2 cache. Each CPU has a 40 MB shared L3 cache (LLC). We use thread affinity to bind threads to cores and use a maximum of 16 threads to avoid NUMA issues. We test DeepSparse using five matrices with different size, sparsity patterns and domains (see Table 2.2). The first 4 matrices are from The SuitSparse Matrix Collection and the Nm7 matrix is from nuclear no-core shell model code MFDn. Table 2.2: Matrices used in our evaluation. Matrix Rows Columns Nonzeros inline1 503,712 503,712 36,816,170 dielFilterV3real 1,102,824 1,102,824 89,306,020 HV15R 2,017,169 2,017,169 283,073,458 Queen4147 4,147,110 4,147,110 316,548,962 Nm7 4,985,422 4,985,422 647,663,919 We compare the performance of DeepSparse with two other library implementations: i) libcsr is implementation of the benchmark solvers using thread-parallel Intel MKL Library calls (including SpMV/SpMM) with CSR storage of the sparse matrix, ii) libcsb is an implementation again using Intel MKL calls, but with the matrix being stored in the CSB format. Performance data for LOBPCG is averaged over 10 iterations, while the number of iterations is set to 50 for Lanczos runs. Our performance comparison criteria are L1, L2, LLC misses and execution times for both solvers. All cache miss data was obtained using the Intel VTune software. Performance of the DeepSparse and libcsb implementations depends on the CSB block sizes used. Choosing a small block size creates a large number of small tasks. While this is preferable on a highly parallel architecture, the large number of tasks may lead to significant task execution overheads, in terms of both cache misses and execution times. Increasing the block size reduces such overheads, but this may then lead to increased thread idle times and load imbalances. Therefore, the CSB block size is a parameter to be optimized based on the specific problem. Different block sizes we experimented with have been 1K, 2K , 4K, 8K and 16K. 25 2.3.2 LOBPCG Evaluation In Figure 2.7, we show the number of cache misses at all three levels (L1, L2 and L3) and execution time comparison between all three versions of the LOBPCG algorithm compiled using the GNU compiler. LOBPCG is a complex algorithm with a number of different kernel types; its task graph results in millions of tasks for a single iteration. As shown in Figure 2.7, except for the Nm7 matrix, libcsb and libcsr versions achieve similar number of cache misses; for Nm7, libcsb has important cache miss reductions over the libcsr version. On the other hand, DeepSparse achieves 2.5× - 10.7× fewer L1 misses, 6.5× - 16.2× fewer L2 misses and 2× - 7× fewer L3 cache misses compared to the libcsr version. As the last row of Figure 2.7 shows, even with the implicit task graph creation and execution overheads of DeepSparse, the significant reduction in cache misses leads to 1.2× - 3.9× speedup over the execution times of libcsr. Given the highly complex DAG of LOBPCG and abundant data re-use opportunities available, we attribute these improvements to the pipelined execution of tasks which belong to different computational kernels (see Figure 2.8) but use the same data structures. We note that the Task Executor in DeepSparse solely relies on the default scheduling algorithm used in the OpenMP runtime environment. By making use of the availability of the entire global task graph and labeling information on vertices/edges, it might be possible to improve the performance of DeepSparse even further. 2.3.3 Lanczos Evaluation In Figure 2.9, cache misses and execution time comparisons for different Lanczos versions are shown. Lanczos algorithm is much simpler than LOBPCG, it has much fewer types and numbers of tasks than LOBPCG (basically, one SpMV and one inner product kernel at each iteration). As such, there are not many opportunities for data re-use. In fact, we observe that DeepSparse sometimes leads to increases in cache misses for smaller matrices. However, for the Nm7 and HV15R matrices, which are the largest matrices among our benchmark set, we observe an improvement in cache misses, achieving up to 2.4× fewer L1 cache misses, 3.1× fewer L2 misses and 4.5× fewer L3 misses than libcsr. But most importantly, DeepSparse achieves up to 1.8× improvement in 26 0.7 2.1 3.5 7 14 0.6 1.8 3.0 6 12 0.5 1.5 2.5 5 10 0.4 1.2 2.0 4 8 Billions 0.3 0.9 1.5 3 6 0.2 0.6 1.0 2 4 0.1 0.3 0.5 1 2 0.0 0.0 0.0 0 0 L1 Misses 0.42 1.4 2.1 4.9 14 0.36 1.2 1.8 4.2 12 0.30 1.0 1.5 3.5 10 0.24 0.8 1.2 2.8 8 Billions 0.18 0.6 0.9 2.1 6 0.12 0.4 0.6 1.4 4 0.06 0.2 0.3 0.7 2 0.00 0.0 0.0 0.0 0 L2 Misses 0.035 0.28 0.14 0.42 3.5 0.030 0.24 0.12 0.36 3.0 0.025 0.20 0.10 0.30 2.5 0.020 0.16 0.08 0.24 2.0 Billions 0.015 0.12 0.06 0.18 1.5 0.010 0.08 0.04 0.12 1.0 0.005 0.04 0.02 0.06 0.5 0.000 0.00 0.00 0.00 0.0 LLC Misses 0.14 0.63 0.77 1.4 2.8 0.12 0.54 0.66 1.2 2.4 0.10 0.45 0.55 1 2 Time(s) 0.08 0.36 0.44 0.8 1.6 0.06 0.27 0.33 0.6 1.2 0.04 0.18 0.22 0.4 0.8 0.02 0.09 0.11 0.2 0.4 0.00 0.00 0.00 0 0 Execution Time inline1 dielFilter HV15R Queen_4147 Nm7 Figure 2.7: Comparison of L1, L2, LLC misses and execution times between Deepsparse, libcsb and libcsr for the LOBPCG solver. Figure 2.8: LOBPCG single iteration execution flow graph of dielFilterV3real. 27 0.28 0.7 2.8 2.1 14 0.24 0.6 2.4 1.8 12 0.20 0.5 2.0 1.5 10 0.16 0.4 1.6 1.2 8 Billions 0.12 0.3 1.2 0.9 6 0.08 0.2 0.8 0.6 4 0.04 0.1 0.4 0.3 2 0.00 0.0 0.0 0.0 0 L1 Misses 0.14 0.42 2.1 1.4 7 0.12 0.36 1.8 1.2 6 0.10 0.3 1.5 1.0 5 0.08 0.24 1.2 0.8 4 Billions 0.06 0.18 0.9 0.6 3 0.04 0.12 0.6 0.4 2 0.02 0.06 0.3 0.2 1 0.00 0 0.0 0.0 0 L2 Misses 0.08 0.21 0.7 0.8 3.5 0.07 0.18 0.6 0.7 3.0 0.06 0.15 0.5 0.6 2.5 0.04 0.12 0.4 0.4 2.0 Billions 0.03 0.09 0.3 0.3 1.5 0.02 0.06 0.2 0.2 1.0 0.01 0.03 0.1 0.1 0.5 0.00 0.00 0.0 0.0 0.0 LLC Misses 0.028 0.07 0.14 0.21 0.7 0.024 0.06 0.12 0.18 0.6 0.02 0.05 0.10 0.15 0.5 Time(s) 0.016 0.04 0.08 0.12 0.4 0.012 0.03 0.06 0.09 0.3 0.008 0.02 0.04 0.06 0.2 0.004 0.01 0.02 0.03 0.1 0 0.00 0.00 0.00 0 Execution Time inline1 dielfilter HV15R Queen_4147 Nm7 Figure 2.9: Comparison of L1, L2, LLC misses and execution times between Deepsparse, libcsb and libcsr for the Lanczos solver. terms of execution time. We attribute the execution time improvement observed across the board to the increased degrees of parallelism exposed by the global task graph of DeepSparse, which is in fact highly critical for smaller matrices. 2.3.4 Compiler Comparison For all of our experiments, we use OpenMP as our backend. To explore the impact of different task scheduling approaches in different OpenMP implementations, we have experimented with three compilers: Intel, GNU and Clang/LLVM compilers. In Figure 2.10, we show the comparison in execution time among different compilers for the three implementations. We see that the execution 28 0.7 1.4 4.2 4.2 10.5 0.6 1.2 3.6 3.6 9 0.5 1 3.0 3 7.5 Time(s) 0.4 0.8 2.4 2.4 6 0.3 0.6 1.8 1.8 4.5 0.2 0.4 1.2 1.2 3 0.1 0.2 0.6 0.6 1.5 0 0 0.0 0 0 Sp ar sb sr Sp ar sb sr Sp ar sb sr Sp sb sr Sp sb sr se lib c lib c se lib c lib c se lib c lib c arse lib c lib c arse lib c lib c De De De De De ep ep ep ep ep Inline1 DielFilter HV15R Queen_4147 Nm7 Figure 2.10: Comparison of execution time for different compilers between Deepsparse , libcsb and libcsr for Lanczos Algorithm. (Blue/Left: GNU, Red/Middle: Intel, Green/Right: Clang compiler.) DeepSparse Libcsb Libcsr 1.1 3.5 3.5 0.9 3.0 3.0 0.8 2.5 2.5 0.6 2.0 2.0 Billions 0.5 1.5 1.5 0.3 1.0 1.0 0.2 0.5 0.5 0.0 0.0 0.0 L1 L2 L3 L1 L2 L3 L1 L2 L3 Figure 2.11: Cache Miss comparison between compilers for HV15R time for the Clang/LLVM compiler is significantly higher compared to GNU and Intel compilers for all matrices. However, cache misses stay pretty much the same when one moves to a different compiler. We show the cache miss comparison between the three compilers in Figure 2.11 for one matrix, HV15R. All other matrices follow a similar cache miss pattern like HV15R. Here, we can clearly see that regardless of the compiler, DeepSparse achieves fewer cache misses over libcsb and libcsr implementations. We can see that Clang/LLVM shows fewer cache misses for DeepSparse as well, but it eventually has a poor running time. We believe that this is because Clang/LLVM is not able to schedule tasks as efficiently as GNU and Intel. Compared to Intel compiler , GNU compiler sometimes shows more L1 and L2 misses. But the execution time is higher in Intel. This may be due to the scheduling strategy and and the implementation of task scheduling points in the compilers. Overall, GNU does best with running times among the three compilers, and Intel compilers do not do well with the library based solver implementations. 29 2.4 Summary of the Chapter This chapter introduced a novel task-parallel sparse solver framework that targets all com- putational steps in sparse solvers. We demonstrated that our DeepSparse framework achieves significantly fewer cache misses across different cache layers and improves the execution time over optimized library versions. Inspired by the performance of DeepSparse on shared memory architec- tures, our major goal is to port and optimize DeepSparse on heterogeneous architectures. In doing so, we would like to use directive-based GPU programming models (i.e., OpenMP or OpenACC) to make it a portable and performant framework. Evaluating the performance of directive-based programming models to offload the main kernels involved (such as SpMV, SpMM, vector opera- tions) constitutes a preliminary step towards our major goal which we discuss in detail in the next chapter. 30 CHAPTER 3 EVALUATION OF DIRECTIVE-BASED GPU PROGRAMMING MODEL ON A BLOCK ITERATIVE SOLVER Most of the of TOP500 and Green500 supercomputers employ CPUs and accelerators like GPUs [6, 20] from different vendors. There is a pressing need to migrate and optimize applications for execu- tion on GPUs and other accelerators in order to utilize the full capabilities of the GPU accelerated supercomputers. As most of the supercomputers are equipped with GPUs from different vendors, we need a common programming model that will allow scientists and application developers to develop their code once and run it everywhere. In recent years, OpenACC and OpenMP have emerged as portable, base-language independent, and an increasingly robust and performant way to target accelerators. These directive-based methods have lowered the barrier of entry for application developers to target accelerators and are anticipated to be a key enabler for HPC users to efficiently use forthcoming supercomputers. However, there needs to be wider testing of OpenMP and Ope- nACC in scientific applications to address any shortcomings in the language specifications, improve the robustness and performance of vendor compilers, and continue to refine our understanding of best practices to migrate applications to accelerators. OpenMP and OpenACC enable a programmer to run application kernels on a GPU. Multiple compilers support these directives and can generate GPU code. The quality of GPU support in OpenMP and OpenACC compilers is evaluated in [60] on a suite of 4 mini applications. Here, the authors find issues with all compilers as well as challenges in creating a single portable code which compiles and executes efficiently for all compilers. The interoperability of CUDA and OpenACC is evaluated in [61]. The author successfully combines hand-written CUDA with OpenACC when using the PGI compiler. Our work evaluates the performance of OpenMP and OpenACC implementations of a block eigensolver, as well the interoperability of these runtime systems with optimized CUDA libraries for 3 different compilers. Unified Memory (UM) is a programming feature which provides a single memory address space 31 accessible by all processors in a compute node. It greatly simplifies GPU programming because the same single pointer to data can be used on both CPU and GPU. The NVIDIA Volta V100 GPU provides a page migration engine to move memory pages between CPU and GPU when the page is not in the memory of the processor accessing the data. NVIDIA evaluated UM performance using the PGI OpenACC compiler in [62]. The authors created UM versions of the OpenACC applications in the SPEC ACCEL 1.2 benchmark suite. They ran the applications on the Piz-Daint supercomputer and found that the UM versions ran at 95% of the performance of the original explicit data management versions. In [63], the NVIDIA presenter shows that the Gyrokinetic Toroidal Code (GTC) has almost identical performance on a x86+V100 system whether OpenACC data directives are used or not. Our work also compares UM against explicit data management, but additionally considers problems whose memory requirements are significantly over the device memory capacity. The performance of oversubscribing UM is evaluated in [64]. The authors find that UM can be up to 2x slower than explicit data management in several applications on an x86+V100 system. Our work considers performance on both x86 and Power GPU-accelerated systems. In this work, we port and optimize a block eigensolver for GPUs using a combination of directives and optimized library calls. The main appeal of block eigensolvers (e.g., the LOBPCG algorithm) as opposed to the more conventional SpMV based solvers is their high arithmetic intensity which is especially important to reap the full benefits of GPUs. The main computational kernels involved in block iterative solvers are the multiplication of a sparse matrix with multiple vectors and level-3 BLAS operations on dense vector blocks. Optimizing the SpMM kernel on GPUs has been studied in several research works. Yang et al. [65] propose two novel algorithms for SpMM operation on GPUs that take the sparse matrix input in compressed-sparse-row (CSR) format and focus on latency hiding with instruction-level parallelism and load-balancing. They find out a memory access pattern that allows efficient access into both input and output matrices which is the main enabler for their excellent performance on SpMM. A common optimization strategy of SpMM is to rely on a special sparse matrix representation to exploit the nonzeros efficiently. 32 Most commonly used sparse matrix storage variants other than CSR format are ELLPACK called ELLPACK-R [66] and a variant of Sliced ELLPACK called SELL-P [67]. Hong et al. [68] separates the sparse matrix into heavy and light rows in order to perform dynamic load-balancing. They process the heavy rows by CSR and the light rows by doubly compressed sparse row (DCSR) in order to take advantage of tiling. However, these special matrix storage formats incur some additional computational and format conversion cost in the full computational pipeline. Anzt et. al [67] optimize the performance of SpMM using ELLPACK format [69] and compare the performance of their CPU-GPU implementation with the multithreaded CPU implementation of LOBPCG provided in the BLOPEX [70] package. All of their kernels were written in CUDA 5.5 and they evaluated the performance experiment on two Intel Sandy Bridge CPUs and one NVIDIA K40 GPU. Dziekonski et. al [71] implement LOBPCG method with an inexact nullspace filtering approach to find eigenvalues in electromagnetics analysis. Most of the prior work focused on optimizing either the SpMV or the SpMM operation on GPUs with the ultimate goal of accelerating the iterative solver used in a scientific application. A distinguishing aspect of this work is that we adopt a holistic approach that includes all computa- tional kernels required for the LOBPCG solver. We use directive based programming models to achieve portability. We also investigate the scenario where the total memory footprint exceeds the device memory capacity and propose a solution that addresses performance degradations seen with NVIDIA’s generic “Unified Memory” approach. Our contributions in this part can be summarized as follows: • We demonstrate that a complex block eigensolver can be implemented efficiently using a mix of accelerator directives (in both OpenMP and OpenACC frameworks) and optimized library functions. We obtain up to a 4.3x speedup over a well optimized CPU implementation. • We show that the performance of the Unified Memory version of SpMM, the dominant kernel in LOBPCG, depends on the supercomputer used and apparently the underlying CPU to GPU interconnect, when application working set exceeds GPU memory capacity. We measure a 33 13.4x performance loss when migrating from a supercomputer with a PCIe Gen3 CPU to GPU interconnect to one with NVLink 2.0. • We address the Unified Memory performance portability issue by tiling the dominant kernels in LOBPCG. This obtains the highest performance on both supercomputers which have different CPU to GPU interconnects. 3.1 Methodology In this section, we provide an overview of the LOBPCG algorithm, our baseline CPU imple- mentation, and the steps we took to port and optimize the CPU implementation to run efficiently on GPU-accelerated systems using OpenMP and OpenACC. We then describe our pathfinding activities for creating an efficient LOBPCG algorithm which can operate on matrices exceeding the device memory capacity. In particular, we discuss how we tiled the two most expensive kernels in LOBPCG and created microbenchmarks that enable performance comparison of programmer- controlled and system-controlled (i.e. Unified Memory) data movement schemes between the CPU and GPU. Finally, we describe the experimental platforms used for evaluating the performance of our LOBPCG and microbenchmark implementations on GPU-accelerated systems. 3.1.1 The LOBPCG algorithm Algorithm 3 shows the pseudocode of the LOBPCG algorithm. We explain the LOBPCG algorithm in Section 2.2.2 of Chapter 2. LOBPCG algorithm is a good test application for our experiments as it comprises high arithmetic intensity operations (SpMM and Level-3 BLAS) and some tall-skinny matrix addition, subtraction, multiplication, and reduction operations. 3.1.2 Baseline CPU implementation We implemented the baseline CPU version of LOBPCG using OpenMP and OpenACC directives in C/C++. We adopted the Compress Sparse Row (CSR) format to store the sparse matrix and used 34 the mkl_dcsrmm routine from Intel MKL library for the SpMM kernel. We also implemented a custom SpMM kernel in both OpenMP and OpenACC, again based on the CSR format, and used it with the PGI and IBM compilers. For all LAPACK and BLAS routines needed, we used Intel MKL, the PGI-packaged LAPACK and BLAS libraries, and IBM ESSL for Intel, PGI and IBM compilers, respectively. 3.1.3 A GPU implementation of LOBPCG The most expensive kernels in the baseline CPU version are the SpMM operation and the inner product of vector blocks (X T Y ). The cuSPARSE [72] and cuBLAS CUDA libraries provide tuned versions of these kernels. We used cusparseDcsrmm for the SpMM operation and replaced cblas_dgemm routine with cublasDgemm for the vector block operations. We allocated the device data for these routines using cudaMalloc. We ported the remaining application kernels using OpenMP and OpenACC offloading pragmas. The application kernels are grouped together inside device data regions to avoid data movement between successive application kernels. However, the performance of this implementation was still poor because significant time was spent moving data between CPU and GPU. This happened because the application and library kernels were operating on distinct data on the GPU. OpenMP and OpenACC provide a clause to enable the application kernels to operate on data already resident on the device. The clause is named is_device_ptr in OpenMP and deviceptr in OpenACC. We used the pointer returned by cudaMalloc in our OpenACC implementation. This approach caused a run time error in our OpenMP implementation compiled with LLVM/Clang. We therefore replaced cudaMalloc with omp_target_alloc in our OpenMP implementation because the OpenMP 5.0 specification [73] states that “Support for device pointers created outside of Open- MP, specifically outside of the omp_target_alloc routine and the use_device_ptr clause, is implementation defined.”. Figure 3.1 shows an example of the structure of most of our application kernels after using this clause. It enabled us to remove multiple OpenMP/OpenACC data regions 35 Before using is_device_ptr / / d_R and o t h e r d e v i c e a r r a y s a l l o c a t e d w i t h o m p _ t a r g e t _ a l l o c cublasDgemm ( h a n d l e , CUBLAS_OP_N , CUBLAS_OP_N , b , numrows , b , &cudaAlpha , d_lambda , b , d_X , b , &c u d a B e t a , d_R , b ) ; / / Copy o u t p u t a r r a y d_R t o t h e h o s t a r r a y R omp_target_memcpy ( R , d_R , R _ s i z e ∗ s i z e o f ( d o u b l e ) , 0 , 0 , h , t ) ; / / Copy h o s t a r r a y R t o t h e d e v i c e i n OpenMP t a r g e t d a t a r e g i o n # pragma omp t a r g e t d a t a map ( t o f r o m : newX [ 0 : X _ s i z e ] ) \ map ( t o : X[ 0 : X _ s i z e ] , R[ 0 : R _ s i z e ] ) { m a t _ m u l t (X, R , newX , numrows , b ) ; } void mat_mult ( double ∗ src1 , double ∗ src2 , double ∗ dst , i n t row , i n t c o l ) { # pragma omp t a r g e t t e a m s d i s t r i b u t e p a r a l l e l f o r c o l l a p s e ( 2 ) f o r ( i n t i = 0 ; i < row ; i ++) f o r ( i n t j = 0 ; j < c o l ; j ++) dst [ i ∗ col + j ] = src1 [ i ∗ col + j ] ∗ src2 [ i ∗ col + j ] ; } After using is_device_ptr / / d_R and o t h e r d e v i c e a r r a y s a l l o c a t e d w i t h o m p _ t a r g e t _ a l l o c cublasDgemm ( h a n d l e , CUBLAS_OP_N , CUBLAS_OP_N , b , numrows , b , &cudaAlpha , d_lambda , b , d_X , b , &c u d a B e t a , d_R , b ) ; / / P o i n t e r s t o d e v i c e a r r a y s passed i n t o mat_mult f u n c t i o n m a t _ m u l t ( d_X , d_R , d_newX , numrows , b ) ; void mat_mult ( double ∗ src1 , double ∗ src2 , double ∗ dst , i n t row , i n t c o l ) { / / Use i s _ d e v i c e _ p t r b e c a u s e d a t a i s a l r e a d y on t h e d e v i c e # pragma omp t a r g e t i s _ d e v i c e _ p t r ( s r c 1 , s r c 2 , d s t ) # pragma omp t e a m s d i s t r i b u t e p a r a l l e l f o r c o l l a p s e ( 2 ) f o r ( i n t i = 0 ; i < row ; i ++) f o r ( i n t j = 0 ; j < c o l ; j ++) dst [ i ∗ col + j ] = src1 [ i ∗ col + j ] ∗ src2 [ i ∗ col + j ] ; } Figure 3.1: The use of is_device_ptr to avoid memory copies. Error checking is omitted for brevity. and thus considerable data movement between the CPU and GPU 1. All kernels run on the GPU except for some LAPACK routines, i.e., LAPACKE_dpotrf and 1 Alternatively, we could have copied the data to the device using OpenMP/OpenACC and then passed the device pointer to the CUDA library functions using OpenMP’s use_device_ptr clause or OpenACC’s use_device clause. We did not use this approach because we wanted the option to use cudaMallocManaged to allocate data in managed memory. 36 LAPACKE_dsygv which are not available in the CUDA toolkit math libraries. This causes 10 small matrices to move between CPU and GPU in each iteration of the LOBPCG method. As the sizes of those matrices are very small, we find that the overhead associated with these data movements are insignificant compared to the total execution time. 3.1.4 Tiling LOBPCG kernels to fit in GPU memory capacity The LOBPCG GPU implementation described in Section 3.1.3 allocated the tall skinny matrices and the sparse matrix in GPU memory. This approach is limited to cases where the aggregated matrix memory footprint is less than the GPU memory capacity. However, a major challenge in many scientific domains [74, 75, 11] (such as configuration interaction in MFDn) is the massive size of the sparse matrix, which can have several billions of rows and columns and the total number of nonzeros can easily exceed trillions. In this subsection, we explain how we tiled the SpMM and inner product kernels (X T Y ) to operate on problems larger than the GPU memory capacity. We extracted each kernel into a standalone microbenchmark to check for correctness and enable performance evaluation. Although not described in this paper, we have also implemented and evaluated the linear combination kernel (XY ) which has similar characteristics to the inner product kernel (X T Y ), but involves the multiplication of a tall-skinny vector block (X) with a small square matrix (Y ). 3.1.4.1 SpMM kernel The SpMM kernel is typically the most expensive operation in LOBPCG. Figure 3.2 shows the tiling idea for the SpMM kernel for cases when the LOBPCG data is too large to fit into the GPU memory. For a given tile size β, we divide the sparse matrix into block of rows. Algorithm 4 describes the steps in our tiled SpMM kernel. In short, we copy the Y matrix to the GPU at the beginning and it resides there until all sparse matrix tiles are processed. Then, we extract the CSR format of each of the tiles and copy that to GPU memory. Then we apply the cusparseDcsrmm routine on the sparse matrix block and Y. This produces the corresponding row blocks of the final 37 output matrix Z. After processing each tile, we copy back the partial output to the corresponding tile of the Z matrix. Tile0 Tile1 Tile2 … … … X Y Z Tilen-1 b b Figure 3.2: Overview of tiling SpMM operation. Algorithm 4: Tiled SpMM (cusparseDcsrmm) kernel Input: X(m × m) sprase matrix in CSR format (val, rowPtr, colIndex), Y(m × b), β(tile size) Output: Z(m × b) 1 nrowblk = d e m β 2 for i = 0 to nrowblk - 1 do // extract_CSR_tile() method extracts the CSR format of the i-th tile from the given sparse matrix 3 [rowPtrTile, colIndxTile, valTile, nnz_Tile] = extract_CSR_tile(val, rowPtr, colIndex, i) 4 cusparseDcsrmm(β, b, m, nnz_tile, 1.0, valTile, rowPtrTile, colIndxTile, R, m, 0.0, AR, β) 5 cudaDeviceSynchronize() 6 cudaMemcpy(Z[i-th tile], AR, cudaMemcpyDeviceToHost) 7 cudaDeviceSynchronize() 8 end 3.1.4.2 Inner product kernel One of the most frequently invoked and expensive kernels in LOBPCG is the inner product operation (Z = X T Y ) between two tall skinny matrices. Hence, a well performing tiled inner product kernel is crucial for large problem sizes. Figure 3.3 shows the overview of the matrix tiling idea for the 38 cublasDgemm call on each tile tile0 b tile0 tile1 tile2 XT tilen-1 tile1 0 1 2 …. tile2 Accumulating partial …. …. n-1 output in devZ on GPU b Y devZ b tilen-1 b Figure 3.3: Overview of tiling Inner Product kernel Algorithm 5: Tiled Inner Product (cublasDgemm) Kernel Input: X(m × b), Y(m × b), β(tile size) Output: Z(b × b) 1 nrowblk = d e m β 2 cudaMemset(devZ, 0.0, b*b*sizeof(b)) 3 for i = 0 to nrowblk - 1 do 4 cudaMemcpy(devX, X[i-th block], β * b, cudaMemcpyHostToDevice); 5 cudaMemcpy(devY, Y[i-th block], β * b, cudaMemcpyHostToDevice); 6 cudaDeviceSynchronize(); 7 cublasDgemm(b, b, β, 1.0, devY, β, devX, β, 1.0, devZ, β); 8 cudaDeviceSynchronize() 9 end 10 cudaMemcpy(Z, devZ, b * b, cudaMemcpyDeviceToHost); inner product kernel. X and Y are of size m × b where m  b. Both matrices are partitioned into n = dm β e tiles. In our custom inner product kernel, we transfer each tile of X and Y from CPU to GPU and apply cublasDgemm routine on each tile. We keep accumulating the partial output to a b × b matrix on the GPU. After processing all tiles, we copy back the final result to Z. Algorithm 5 gives an overview of our custom inner product kernel. 39 3.1.5 Hardware and software environment We conducted all of our experiments on the Cori-GPU testbed at the National Energy Research Scientific Computing Center (NERSC) [76] and the Summit supercomputer at the Oak Ridge Leadership Computing Facility (OLCF) [77]. Cori-GPU is a Cray CS-Storm 500NX consisting of 18 compute nodes. Each compute node has two 20-core Skylake processors clocked at 2.4 GHz and 8 NVIDIA Tesla V100 "Volta" GPUs with 16 GBs of HBM per GPU. The V100 GPU model has a peak double precision performance of 7.0 TFLOP/s. There is a total of 384 GB DDR4 DRAM space on each node. The CPUs are connected to the GPUs via four PCIe 3.0 switches and the GPUs are connected to each other via NVIDIA’s NVLink 2.0 interconnect. The Summit supercomputer is an IBM AC922 system consisting of 4608 compute nodes [78]. Each compute node has two 22-core IBM Power9 processors clocked at 3.1 GHz and 6 NVIDIA Tesla V100 "Volta" GPUs with 16 GBs of HBM per GPU. The V100 GPU model is based on the SXM2 form factor and has a peak double precision performance of 7.8 TFLOP/s. There is a total of 512 GB DDR4 DRAM space per node. Unlike Cori-GPU, the CPUs and GPUs in a Summit compute node are all connected with the high bandwidth NVLink 2.0 interconnect. This also provides cache coherence between CPUs and GPUs and enables system-wide atomics. The theoretical peak uni-directional bandwidth between 1 CPU and 1 GPU is 16 GB/s on Cori-GPU and 50 GB/s on Summit. However, the highest pageable bandwidth we measured from CPU to GPU was 5.2 GB/s on Cori-GPU and 25.0 GB/s on Summit. The Cori-GPU and Summit supercomputers provide extensive software environments to com- pile OpenMP and OpenACC programs. Here, we list the software environment used in this paper. The software used on the Cori-GPU system were Intel Compiler v19.0.3 (OpenMP for CPU), LLVM/Clang compiler v9.0.0-git (OpenMP for GPU), and PGI compiler v19.5 (OpenACC for CPU and GPU). We used Intel MKL with the Intel and LLVM/Clang compilers and PGI’s version of LAPACK with the PGI compiler. The GPU accelerated libraries were cuSPARSE and cuBLAS provided with CUDA v10.1.168. The software used on Summit were IBM XLC Compiler v16.1.1- 3 (OpenMP for CPU and GPU) and PGI compiler v19.5 (OpenACC for CPU and GPU). We used IBM ESSL with the IBM XLC Compiler and PGI’s version of LAPACK with the PGI compiler. 40 Once again, the GPU accelerated libraries were cuSPARSE and cuBLAS provided with CUDA v10.1.168. 3.1.6 Experiments In this section we explain the experiments conducted. The first set of experiments are used to evaluate the LOBPCG GPU implementation. The second set of experiments are used to evaluate our microbenchmarks on problems exceeding the GPU memory capacity. 3.1.6.1 Performance of the LOBPCG solver The CPU and GPU implementations of LOBPCG are evaluated using a series of real-world matrices with different sizes, sparsity patterns and application domains as shown in Table 3.1. The first 2 matrices are from the SuitSparse Matrix Collection [79] and the Nm7 and Nm8 matrices are extracted from two very large Hamiltonian matrices that arise in nuclear structure calculations with MFDn. Note that the test matrices have millions of rows and hundreds of millions of nonzeros. The memory footprint of these matrices vary from 2 GB to 7.8 GB using the CSR matrix format. Table 3.1: Test Matrices. Matrix Rows Columns Nonzeros Size (GB) Domain Queen_4147 4,147,110 4,147,110 166,823,197 2.018 3D Strctural Problem HV15R 2,017,169 2,017,169 283,073,458 3.405 Computational Fluid Dynamics Nm7 4,985,422 4,985,422 647,663,919 7.792 MFDn Nm8 7,579,303 7,579,303 592,099,416 7.136 MFDn We measured the runtime of the LOBPCG CPU implementation on a single CPU socket on Cori- GPU and Summit nodes. The configurations used 1 thread per core and used the appropriate slurm, jsrun and OpenMP/OpenACC environment variables to bind the process and child threads. We did not use hyperthreading / SMT because our kernels are memory bandwidth bound. We measured the runtime of the LOBPCG GPU implementation on a single CPU socket and one GPU on Cori- GPU and Summit nodes. Our configurations only ever used a single CPU socket to avoid potential 41 performance issues associated with non-uniform memory access time. We evaluated the compiler combinations described in Section 3.1.5 and measured runtime with application timers. 3.1.6.2 Performance of X T Y and SpMM kernels for large matrices Our next experiment evaluated the X T Y microbenchmark and SpMM microbenchmark on input problems exceeding GPU memory capacity on Cori-GPU and Summit. This experiment is de- signed to inform our future sparse solver implementations. We tested the tiled versions of the microbenchmarks so that we could easily separate how much time is spent in computation versus data movement between the CPU and GPU. If more time is spent in computation then data move- ment costs can potentially be hidden. In the X T Y microbenchmark, we chose to multiply two matrices of size 67, 108, 864 × 48 leading to a memory footprint of 51.54 GB. We set the tile size (β) to 131, 072 for the X T Y microbenchmark and 2, 597, 152 for the SpMM microbenchmark as this gives us the best performance. The tile size (β) is an optimization parameter and one can vary it as long as the memory footprint required to process a single tile is less than GPU memory capacity. In the SpMM microbenchmark, we used a synthetic input matrix of 24 GB, leading to a memory footprint of 35.1 GB. The dimension of the synthetic sparse matrix is 14, 957, 833 × 14, 957, 833 with 1, 946, 671, 770 nonzeros. We multiplied this sparse matrix with a dense matrix of dimension 14, 957, 833 × 48. We used a multi-threaded RMAT graph generator [80] to generate our synthetic sparse matrix. We measured compute and data movement time using the nvprof profiler. 3.1.6.3 Performance of tiled and Unified Memory versions of SpMM Our final experiment evaluated the Unified Memory version of the SpMM microbenchmark. The Unified Memory version was written in OpenACC and compiled with the PGI compiler and the compiler option -ta:tesla:managed to replace regular system memory allocations with managed memory allocations. We compared runtime against the tiled version of SpMM on Cori-GPU and Summit for two input matrices. The first input matrix is Nm7 (see table 3.1) and leads to a microbenchmark memory footprint of 11.7 GB. The second input matrix is the synthetic sparse matrix (14, 957, 833 × 14, 957, 833 with 1, 946, 671, 770 nonzeros) and leads to a microbenchmark 42 memory footprint of 35.1 GB. The matrices are chosen to create problems less than GPU memory capacity and greater than GPU memory capacity. In both cases, we multiplied these sparse matrices with a dense matrix of 48 vector blocks. We set the tile size (β) to 2, 597, 152 for both of matrices as it is the highest tile size that we can use without overflowing the GPU memory and it gives the best performance. The nvprof profiler is used to collect compute time, data movement time, and Unified Memory data movement and page fault time. 3.2 Results In this section we show performance results on the Cori-GPU and Summit supercomputers. Section 3.2.1 shows the performance of the CPU and GPU versions of LOBPCG when parallelized with either OpenMP or OpenACC. We then consider how we could use the LOBPCG solver on matrices larger than GPU memory capacity. Section 3.2.2 shows performance results when tiling the dominant X T Y and SpMM kernels so that each tile fits within GPU memory capacity. Finally, Section 3.2.3 compares the performance of the tiled implementation of the SpMM kernel against a naive Unified Memory implementation. 3.2.1 Performance of the LOBPCG solver We compared the performance of the LOBPCG solver when using a suite of different compilers. The compilers can all generate code for the host CPU and sometimes also for the GPU. In the following sentences, we place CPU or GPU in parenthesis to indicate whether we used the compiler to generate code for the CPU or GPU. The OpenMP compilers were Intel (CPU) and Clang (GPU) on Cori-GPU and IBM (CPU and GPU) on Summit. The OpenACC compiler was always PGI (CPU and GPU). In all cases we used a hand-written portable SpMM kernel except for our Intel compiler experiment which used mkl_dcsrmm from Intel MKL. We did this to obtain the best possible CPU time to more transparently show the value of our GPU implementation. The performance results for the Nm7 matrix are shown in Figure 3.4. The execution time of the LOBPCG solver is averaged over 10 iterations. 43 12.00 9.82 10.00 8.00 Time (sec) 6.00 5.38 4.87 4.00 2.25 2.00 0.77 0.74 0.76 0.74 0.00 OpenACC OpenMP OpenACC OpenMP OpenACC OpenMP OpenACC OpenMP Cori-GPU (CPU) Summit (CPU) Cori-GPU (CPU+GPU) Summit (CPU+GPU) Figure 3.4: The time spent in LOBPCG on Cori-GPU and Summit when using various compilers with either OpenMP or OpenACC The results show that the execution time of our GPU implementation is almost independent of directive based programming model and evaluation platform. Our reasoning is that the OpenMP and OpenACC configurations use the same GPU math libraries, the GPUs are nearly identical in Cori-GPU and Summit (different V100 models), and that our LOBPCG implementation has been highly tuned to minimize data movement between CPU and GPU. The best GPU performance is 3.05x faster than the best CPU performance for Nm7 matrix. The CPU versions show more variable performance for different combinations of compilers and math libraries used on Cori-GPU and Summit. The highest performance is obtained with the OpenMP version when compiled with Intel compiler on Cori-GPU. The performance differences can mostly be attributed to the host CPU and SpMM performance: mkl_dcsrmm is 1.4x faster than our hand-written SpMM kernel in OpenMP and the hand-written SpMM kernel is 1.5 - 3.0x faster when using OpenMP rather than OpenACC. We did not investigate the host CPU performance in any more detail because it is not the focus of our work. Figure 3.5 shows how time is spent in the best configurations on CPU and GPU when using the Nm7 matrix. Execution time is divided into library time, application kernel time, and unaccounted 44 CUDA API time. The library time is spent in cuBLAS and cuSPARSE in the GPU implementation and Intel MKL in the CPU implementation. The application kernel time is spent in user defined functions in both the CPU and GPU implementations. The CUDA API time includes GPU data allocation and data movement between CPU and GPU and is calculated by subtracting time spent in application and library kernels from the total run time. The library and application kernels speedup by 3.7x and 5.0x, respectively, when using GPUs. Application kernel time is a relatively small fraction of total run time on GPU. However, the offload is a key optimization step needed to keep total run time low. Total run time would be significantly slower if we decided to use host application kernels because of unnecessary data movement between CPU and GPU. Library kernels CUDA API calls Application kernels 3.5 3 2.5 2.25 Time (sec) 2 1.5 1 0.74 0.5 0 OpenMP+GPU OpenMP+CPU Figure 3.5: The time spent in LOBPCG on Cori-GPU when using matrix Nm7 Figure 3.6 shows GPU speedup over the best LOBPCG CPU implementation for all the test matrices in Table 3.1. The LOBPCG GPU implementation achieves 2.8x - 4.3x speedup over the best CPU implementation. The GPU implementation therefore performs well over a range of matrices from different domains with different sparsity patterns. 45 5 4.5 4.31 4 3.5 3.36 3.05 3 2.80 Speedup 2.5 2 1.5 1 0.5 0 Queen_4147 HV15R Nm7 Nm8 Figure 3.6: LOBPCG GPU speedup on Cori-GPU for each test matrix 3.2.2 Performance of X T Y and SpMM kernels for large matrices Figure 3.7 shows the time spent in the inner product (X T Y ) kernel on Cori-GPU and Summit when total memory footprint is 51.54 GB. The tile size is 131,072. The total time is divided into host-to-device (HtoD) data transfer time and computation time in the inner product kernel (device- to-host (DtoH) data transfer times are negligible for this kernel). We measured data transfer and computation time using nvprof. The results show that total run time is dominated by data transfers. Run time is lower on Summit because of the high bandwidth NVLink 2.0 interconnect. We obtained data transfers of 4 GB/s on Cori-GPU and 13 GB/s on Summit in this kernel. Results indicate that data transfer time cannot be hidden behind computation when the matrix exceeds the GPU memory capacity. Figure 3.8 shows the time spent in the SpMM kernel. The input sparse matrix is 24 GB and the total memory footprint is 35.1 GB. This time, results show that computation time is greater than the data movement time. This indicates that data movement time could be completely hidden behind computation. It would therefore be possible to obtain nearly the same computational throughput as one would get using matrices completely resident in the GPU memory. However, an actual 46 Compute time CUDA memcpy HtoD time 14.00 12.00 10.00 Time (sec) 8.00 11.98 6.00 4.00 3.82 2.00 0.00 0.36 0.42 Cori-GPU Summit Figure 3.7: Time spent in X T Y kernel on Cori-GPU and Summit when the memory footprint exceeds GPU memory capacity. block eigensolver alternates between SpMM and vector block operations, so this may not be easy to realize in practice. Compute time CUDA memcpy time 30 25 7.24 20 2.62 Time (sec) 15 10 18.15 18.06 5 0 Cori-GPU Summit Figure 3.8: Time spent in SpMM kernel on Cori-GPU and Summit when the memory footprint exceeds GPU memory capacity 47 5 4.76 4.5 4.34 4.15 4 3.5 3 2.81 Time (sec) 2.5 2 1.5 1 0.5 0 Tiling Unified Memory Tiling Unified Memory Cori GPU Summit Figure 3.9: Time spent in tiled and Unified Memory versions of the SpMM kernel on Cori-GPU and Summit. The memory footprint is less than GPU memory capacity. 3.2.3 Performance of tiled and Unified Memory versions of SpMM Figure 3.9 shows the performance of the tiled SpMM kernel compared to the Unified Memory version of the SpMM kernel when the memory footprint is less than GPU memory capacity. The total memory footprint of this experiment is 11.7 GB. The tiled version is fastest on both platforms. nvprof shows that the tiled version is faster on Summit because of less time in CUDA memcpy. Interestingly, the Unified Memory version performs similarly on both platforms. Figure 3.10 shows the performance of the two SpMM kernels when the memory footprint exceeds GPU memory capacity. We used the same tile size (β) for the tiled experiments in Figures 3.9 and 3.10. There are now significant differences between the performance of the tiled and Unified Memory versions. The most surprising result is the 48.2x performance difference between tiled and Unified Memory versions on Summit. This is a performance difference of 13.4x between Cori-GPU and Summit when using Unified Memory on different machines. This is unexpected given the high bandwidth NVLink 2.0 interconnect and hardware managed cache coherency on the Summit IBM system. Although not shown, there is a similar performance difference on Summit 48 for the X T Y and XY kernels. Unified Memory performance is therefore poor and depends on the machine used. 997.68 1000.00 100.00 74.56 Time (sec) 25.39 20.68 10.00 1.00 Tiling Unified Memory Tiling Unified Memory Cori GPU Summit Figure 3.10: Time spent in tiled and Unified Memory versions of the SpMM kernel on Cori-GPU and Summit. The memory footprint exceeds GPU memory capacity. We use a logarithmic scale on the Time (sec) axis to capture the slow run time for the Unified Memory configuration on Summit. Figure 3.11 shows nvprof output for the Unified Memory version of the XY kernel on Cori- GPU and Summit. The results show that the total count of page faults and the total data moved is the same on both systems. As expected, the data transfer is 3x faster on Summit according to the bandwidth of the CPU to GPU interconnect. However, the metric named "Gpu page fault groups" takes 30x more time on Summit compared to Cori-GPU for unknown reasons. This explains the poor performance on Summit. We observed similar performance difference without nvprof (nvprof added a performance overhead of about 10% on both machines). We are currently in contact with OLCF and NVIDIA staff to understand our performance observations. 3.3 Discussion In this section we discuss the key learnings from the results in Section 3.2. The results show that we have successfully ported the LOBPCG solver to NVIDIA GPUs using 49 Cori-GPU D e v i c e " T e s l a V100−SXM2−16GB ( 0 ) " Count Avg S i z e Min S i z e Max S i z e Total Size T o t a l Time Name 196608 1 7 0 . 6 7KB 4 . 0 0 0 0KB 0 . 9 9 6 1MB 3 2 . 0 0 0 0 0GB 3.326868 s H o s t To D e v i c e 8526 1 . 9 9 9 3MB 4 . 0 0 0 0KB 2 . 0 0 0 0MB 1 6 . 6 4 6 5 5GB 1.368811 s D e v i c e To H o s t 98304 − − − − 10.668444 s Gpu p a g e f a u l t g r o u p s T o t a l CPU Page f a u l t s : 98305 Summit D e v i c e " T e s l a V100−SXM2−16GB (0) " Count Avg S i z e Min S i z e Max S i z e Total Size T o t a l Time Name 163840 2 0 4 . 8 0KB 6 4 . 0 0 0KB 9 6 0 . 0 0KB 3 2 . 0 0 0 0 0GB 1.078612 s H o s t To D e v i c e 8525 1 . 9 9 9 8MB 6 4 . 0 0 0KB 2 . 0 0 0 0MB 1 6 . 6 4 8 5 0GB 3 9 6 . 9 5 3 3 ms D e v i c e To H o s t 98304 − − − − 313.43688 s Gpu p a g e f a u l t g r o u p s 8524 2 . 0 0 0 0MB 2 . 0 0 0 0MB 2 . 0 0 0 0MB 1 6 . 6 4 8 4 4GB − Remote mapping from d e v i c e T o t a l CPU Page f a u l t s : 98305 T o t a l r e m o t e m a p p i n g s t o CPU : 8524 Figure 3.11: Unified Memory nvprof profile of the XY microbenchmark on Cori-GPU (top) and Summit (bottom). directives and optimized CUDA library calls. We obtained similar performance for the OpenMP implementation using Clang and XLC compiler as we did for the OpenACC implementation using the PGI compiler. The quality of OpenMP compilers for GPUs have often been criticized over the past few years [60], however, our experience provides evidence that OpenMP compilers are becoming more robust and are capable of generating high performance code. We found that the key enabler of performance was to keep data resident on the GPU between calls to optimized CUDA math functions. We were able to do this trivially by adding OpenMP/OpenACC accelerator directives to the large number of kernels in the LOBPCG solver. In the past, this would have been much more challenging and time-consuming because the remaining application kernels would need to be ported to CUDA. Our related work section shows that earlier attempts to port a LOBPCG solver to GPUs by other scientists was generally focused on optimizing the SpMM kernel only on GPU whereas we focus on optimizing the full solver on GPU. This highlights the productivity gains from using directives and the importance of interoperability between the code generated by the OpenMP/OpenACC compilers and CUDA. This interoperability is not required in the OpenMP specification and is only recommended as a note to implementors in the OpenACC specification. However, we have highlighted the importance of interoperability, and believe that the HPC community should strongly request this support from compilers as we have done for 50 LLVM/Clang (https://bugs.llvm.org/show_bug.cgi?id=42643). We have shown that our LOBPCG microbenchmarks can be tiled to solve problems larger than GPU memory capacity. We found that the time spent in cublasDgemm for the inner product (X T Y ) microbenchmark is shorter than the time spent moving data to and from the GPU. This indicates that it is not possible to write a tiled cublasDgemm for larger problems which achieves the same computational throughput as a problem which fits in GPU memory capacity. The tiled cublasDgemm performance was mostly determined by the bandwidth of the CPU to GPU interconnect. This will remain a challenge in many CPU+GPU systems in the coming years because PCIe Gen4 has lower bandwidth than NVLink 2.0. The SpMM microbenchmark showed the opposite to X T Y in that more time was spent in computation than data movement. This indicates that data movement costs could be hidden, i.e., computation on one tile could occur concurrently with the data movement for the next tile. The full LOBPCG solver includes X T Y and SpMM operations. Therefore, the amount of computation on the GPU relative to data movement between CPU and GPU is more than what is shown in our microbenchmarks. This indicates that it should be possible to write an efficient LOBPCG solver for GPUs which can solve problems larger than the GPU memory capacity. We had mixed success when using a Unified Memory implementation of the SpMM kernel. The performance was a little worse than the tiled implementation when the memory footprint was less than GPU memory capacity. This could be acceptable to many application programmers because we obtained this performance with much simpler code. This would be a huge productivity win for the application programmer because there is no need to manage separate host and device copies of data; there is just a single pointer to the data which can be used on both host and device. We found that the performance of the Unified Memory implementation was much worse than the tiled implementation when the memory footprint exceeded GPU memory capacity. It was so bad on Summit that it would have been more efficient to use a CPU implementation and leave the GPUs idle. We are still working to understand why Unified Memory performance was so poor on Summit. However, our early experience serves as a warning to application programmers that they should not rely on Unified Memory when application memory footprint is larger than GPU memory capacity. 51 It is also useful information to HPC system providers that the success of their users strongly depends on purchasing GPUs with sufficient memory capacity. We recommend that tiling be used in large memory footprint applications on CPU+GPU systems. This can deliver both high performance and predictable performance across different CPU+GPU systems. However, it can be a significant amount of work to tile and overlap data transfers with computation in an application. This may become easier in future with enhancements to the OpenMP standard providing directive-based partitioning and pipelining [81]. Alternatively, middleware for sparse solvers on GPUs could abstract away these programming challenges. 3.4 Summary of the Chapter In this work, we have described our approaches to mix CUDA library calls with OpenMP/Ope- nACC offloading pragmas in order to implement and optimize the full LOBPCG eigensolver on GPU-accelerated systems. We successfully used both OpenMP and OpenACC, and achieved a speedup of 2.8x - 4.3x over a baseline CPU implementation. Our experiments with SpMM and inner product microbenchmarks showed that tiling is the preferred approach for larger problem sizes. We found that a naive Unified Memory implementation had worse performance than a tiled implementation by up to an order of magnitude depending on the target supercomputing platform. With these observations, we got a better idea of how to design and implement a portable sparse solvers framework on heterogeneous architectures to solve problem sizes that exceed total device memory. In the next chapter, we will discuss the details of our portable sparse solvers for heterogeneous architecture. 52 CHAPTER 4 A PORTABLE SPARSE SOLVER FRAMEWORK FOR LARGE MATRICES ON HETEROGENEOUS ARCHITECTURES Heterogeneous computing architectures employing GPUs have become the mainstream approach in recent leadership supercomputers [6]. GPUs have been used to accelerate applications in various domains, such as graph analytics, machine learning, computational finance, climate modeling, and multimedia. Recently, GPUs have improved significantly in terms of programmability. GPUs are typically programmed using device-specific (e.g., CUDA) or a portable (e.g., OpenMP) program- ming model. Irregardless of whether an application uses a device-specific or a portable programming model, the amount of data moved between CPU and GPU is a key concern in heterogeneous systems due to the wide gap between computational performance and data movement. For best performance, programmers have to explicitly control the data movement between the host and the device, but this comes at the expense of programmer productivity. As a general solution, Unified Virtual Memory (UVM) has become available since CUDA 8.0 and the Pascal architecture (2016). UVM provides a single and consistent view of the host and device memories on a system. Even though UVM dramatically reduces developer effort in regards to managing data movement, it typically causes a significant hit in performance. While the size of data needed by scientific applications and machine learning/data analytics workloads increase at a rapid pace, the memory available on GPUs increase at a much modest pace in comparison. For example, NVIDIA’s Tesla V100 “Volta" GPUs have only 16 GBs of device memory available; one could have up to 80 GBs of memory on the newest generation NVIDIA A100s, but they come with a price tag not affordable by most customers or even HPC centers. When an application’s working set size exceeds the device memory size, the resulting data movement becomes a critical design and performance bottleneck [12]. In this work, we present a tiling-based task-parallel sparse linear algebra framework, named 53 DeepSparseGPU, which aims to make it easy to develop sparse solvers for large problems on GPU- accelerated architectures. Due to the irregular data access patterns and low arithmetic intensities associated with sparse matrix computations, achieving a high performance of the peak processor percentage, especially on GPUs, is challenging. These challenges are further exacerbated due to the increasing number of hardware accelerators by different vendors (i.e., NVIDIA, AMD, Intel, etc.). The main goal of the DeepSparseGPU framework is therefore to ease the development of performant and portable sparse solvers. Hence, building upon our DeepSparse framework [15] which runs on shared memory architecture, DeepSparseGPU leverages OpenMP’s target offload functionality. As we aim to optimize DeepSparseGPU for GPUs, several changes were made to the original DeepSparse framework, especially to support applications with memory requirements that exceed the available device memory capacity. We adopt the same task-based tilling approach which serves the double purpose of enabling data locality optimizations and facilitates the management of application data so that it can be processed in batches that fit into the device memory. For this, DeepSparseGPU automatically generates and expresses the entire computation as a task dependency graph (TDG), which is then executed using OpenMP’s tasking and target offloading functionalities. A memory manager developed as a part of DeepSparseGPU keeps track of the data residing on host and device memories, and automatically migrates data between the two whenever needed. Given the data dependencies between computational tasks, DeepSparseGPU employs a topological-sorting based heuristic to minimize the data movement and increase application performance. Given the importance of sparse solvers in scientific computing and machine learning, several optimization techniques have been proposed for sparse matrix-vector multiplication (SpMV) on GPUs [29, 82, 83, 84]. However, the performance of SpMV is bounded by memory bandwidth [85]. Since sparse matrix-matrix multiplication (SpMM) has a much higher arithmetic intensity than SpMV and can efficiently leverage the performance benefits of GPUs, SpMM-based solvers have recently drawn significant interest in scientific computing in the form of block linear solvers and eigensolvers [11]. As such several groups have studied the optimization of the SpMM kernel on GPUs [65, 66, 67, 68, 67, 69, 71]. 54 As can be seen, most prior work on accelerating iterative solvers for GPUs has focused only on optimizing the SpMV/SpMM kernels with a few exceptions. In this work, we present a holistic framework that includes all computational kernels required for block eigensolvers (LOBPCG is used as a case study). Another distinguishing aspect of our work compared to the work reviewed above is the support we provide for applications with memory demands significantly larger than the available device memory capacity. In addition to the support for large applications, we adopt a directive-based programming model to achieve portability. Several existing work studied the efficacy of the GPU offload support in OpenMP and/or OpenACC in the context of individual kernels [7, 8, 9], mini-applications [60] or proxy applications [86]. In this regard, our evaluation of OpenMP’s offloading support in the context of a real eigensolver applied to matrices from several domains also constitutes a niche. In doing so, for the benefit of the community, we also discuss the compiler support issues we faced, as OpenMP offloading implementations are constantly evolving and are advertised as having only partial support in different compilers. In evaluating DeepSparseGPU, we use a GPU accelerated cuSPARSE and cuBLAS library kernels with Unified Virtual Memory (UVM) based implementation as a baseline. The current data communication strategy in UVM is full page migration. Oversubscription of GPU memory and system-wide automatic memory operations are enabled by full-page migration [12, 87]. There are some poor performance risks in UVM regarding large page migration [62, 64]. Our work compares UVM against explicit data management and considers problems whose memory requirements significantly exceed the device’s memory capacity. Consequently, our contributions can be summarized as follows: • We demonstrate that a complex block eigensolver can be implemented efficiently using OpenMP target offload directives by minimizing data movement between CPU and GPU. We obtain up to 2.93x speedup and up to 2.18x less data transfer between CPU and GPU over a well-optimized UVM based implementation. 55 • We designed and developed a Memory Manager (MM) that automatically keeps track of data on both CPUs and GPUs, and migrates it whenever needed. Memory manager helps to execute problem sizes that exceed device memory. • We use topological sort on the DAG to get a custom schedule of the computing tasks. Empirically, we find that topological sort helps minimize data movement compared to the pain baseline schedule. • We share and discuss our experiences and issues that we faced during the development process so that the community could benefit from it. 4.1 DeepSparseGPU Overview Figure 4.1 illustrates the architectural overview of the DeepSparseGPU framework. DeepSparseGPU consists of two major components: i) Primitive Conversion Unit (PCU), which provides a front-end to domain scientists to express their application, such as the LOBPCG solver, at a high-level and generates the task dependency graph (TDG); and ii) Task Executor (TE), which receives a DAG from the PCU and performs a topological sort on the DAG to get a task schedule. The task executor then launches proper kernels that correspond to computational nodes in the TDG according to the task schedule. 4.1.1 Primitive Conversion Unit (PCU) The Primitive Conversion Unit (PCU) is composed of two parts: i) Task Identifier and ii) Task Dependency Graph (TDG) Generator. 4.1.1.1 Task Identifier (TI) The PCU allows the application developer to express their algorithms at a high level without having to worry about architectural details (e.g., memory hierarchy) or parallelization considerations (e.g., determining the individual tasks and their scheduling). Task identifier parses the given application 56 Primitive Conversion Unit (PCU) Task Executor Task Identifier (TI) Memory Manager In In In In SM 0 Data Data DataData do { Device Memory SpMM(Hpsi, H, psi) SpMM dot dot(E, psi, psi) daxpy(Epsi, E, psi) daxpy(R, Hpsi,Epsi) D2H H2D Out Out Out Out dot(W,Tinv, R) Data Data Data Data dot(Wmat, W, W) dsyevd(S, Wmat). Tile DB .. Host Memory } while(!converged) In In In In SM 1 Data Data Data Data SpMM dot Out Out Out Out Data Data Data Data TDG Generator Kernel Launcher do { In In In In gpuKernel1(…..); SM 2 gpuKernel2(…..); Data Data Data Data ………………………… hostKernel3(…..); SpMM dot ………………………… gpuKernel4(…..); ………………………… Out Out Out Out ………………………… Data Data Data Data } while(!converged) GPU Figure 4.1: Schematic overview of DeepSparseGPU framework. code to identify the specific BLAS/LAPACK and GraphBLAS function calls and the input/output of each function call. It then passes this information to the local task dependency graph generator. 4.1.1.2 Task Dependency Graph Generator (TDGG) The output of Task Identifier (TI) is a dependency graph at a very coarse-level, i.e., at the function call level. Tasks must be generated at a much finer granularity for efficient parallel execution of larger problems on the GPU by carefully planning the data movement between the CPU and GPU. This is accomplished by the Task Dependency Graph Generator (TDGG), which goes over the input/output data information generated by TI for each function call and starts decomposing/tiling these kernels and data structures. As noted above, the decomposition into finer granularity tasks starts with the first function call involving the sparse matrix (or matrices) in the solver code, which is typically an SpMV, SpMM, or SpGEMM operation. For example, SpMM is the main sparse matrix kernel in the LOBPCG solver. We use compressed sparse row (CSR) matrix format to store the sparse matrix. First, the 57 TDGG decomposes the sparse matrix using a 1D decomposition. Then, the main sparse matrix kernel decomposition induces the decomposition of all kernels above and below (i.e., the full solver) the sparse matrix operation. TDGG also generates the dependencies between individual fine-granularity tasks by examining the function call dependencies determined by TI as part of the task dependency graph generation procedure. The resulting task dependency graph generated by TDGG is a directed acyclic graph (DAG) representing the data flow in the solver code where vertices denote computational tasks, incoming edges represent the input data, and outgoing edges represent the output data for each task. TDGG also labels the vertices in the TDG with the estimated computational cost of each task, and the directed edges with the name and size of the corresponding data, respectively. Particularly, TDGG uses an instance of Taskinfo [listing 2.1] as the name of each vertex in TDG to store the necessary information to execute each task in the DAG. During execution, such information can be used for load balancing among threads and/or ensuring that active tasks fit in the available GPU memory. 4.1.2 Task Executor (TE) The Task Executor (TE) in DeepSparseGPU is composed of two parts as well: i) Kernel Launcher, and ii) Memory Manager. 4.1.2.1 Kernel Launcher (KL) The KL receives a DAG from the PCU that represents the full execution flow of the given application. Then, KL performs a depth-first topological sort based on the dependencies in the DAG to get an execution schedule of tasks and saves the tasks in an array of TaskInfo structures according to the task schedule. The topological ordering of the DAG respects the data dependencies and guarantees the numerical correctness of the given application. Furthermore, it is expected that the depth-first strategy used in the topological ordering would help minimize the data movement between the CPU and GPU by placing tasks with data dependencies together in the task schedule. 58 // main . cpp [ task , numTask ] = topologicalSort ( DAG ) ; # pragma omp parallel { # pragma omp master { while (! converged ) { for ( i = 0 ; i < numTask ; i ++) { // extract task information taskinfo = extractTaskInfo ( task [ i ]) ; // preparing operands on gpu ( or cpu ) using // Memory Manager operands = memoryManger ( taskinfo ) ; // launch the proper kernel on GPU // ( or CPU ) based on taskinfo kernel ( operands ) ; } } } } // an example gpu application kernel void mat_add ( double * device_memory , int offset1 /* src1 */ , int offset2 /* src2 */ , int offset3 /* dst */ , int blksz , int col ) { int sz = blksz * col ; # pragma omp target is_device_ptr ( device_memory ) \ depend ( in : device_memory [ offset1 : sz ] , device_memory [ offset2 : sz ]) \ depend ( out : device_memory [ offset3 : sz ]) nowait # pragma omp teams distribute parallel for collapse (2) for ( int i = 0; i < row ; i ++) 59 for ( int j = 0 ; j < col ; j ++) device_memory [ offset3 + i * col + j ] = device_memory [ offset1 + i * col + j ] + device_memory [ offset2 + i * col + j ]; } Listing 4.1: Kernel Launcher skeleton code in OpenMP Once the topological ordering of the DAG is done, the KL picks each node from the task array one by one and launches a kernel for them on the GPU (or in CPU, if the corresponding GPU kernel is not available). The KL first extracts the task information from the task name. Then, it consults with the Memory Manager (MM) to check if the operands of the task are ready on the GPU memory (or on the CPU memory). Once all operands are ready, the kernel can be launched. Currently, the exceptions to GPU executable kernels are only a few LAPACK operations that are not yet implemented in the GPU libraries. As shown in listing 4.1, we use target teams distribute parallel for directives to offload computations to the GPU. The nowait clause is used for asynchronous execution on the GPU. Whenever possible, the collapse clause is used to merge nested loops to obtain higher degrees of parallelism in the kernels. All directives use the depend clause to respect the input/output dependencies among tasks. 4.1.2.2 Memory Manager (MM) The memory manager (MM) is responsible for keeping track of all data tiles on the host and device memory. The KL consults with MM before launching each kernel on the GPU (or CPU). MM makes sure that the latest copy of each of the operands of a kernel is available on GPU (or CPU) before launching the kernel. MM moves data tiles between host and device automatically whenever needed and implements a First-In-First-Out (FIFO) eviction policy to evict blocks when the device memory becomes full. Data Structures used in MM: The MM maintains a few data structures to fulfill its operations. 60 memory blocks on device device_memory memory_view 0 1 2 3 4 5 6 Key Value memory_map Key Value freeblock_map <# of blocks> evictionQueue Out FIFO Queue In Figure 4.2: Data structures used in managing tile information in Memory Manager. Figure 4.2 shows the important data structures that MM maintains to manage all data tiles between the host and device. The purposes of these data structures are as follows: • device_memory: This is an array of double-precision numbers that spans the total available device memory in our target machine. We reserve a small fraction of the device memory to store some matrices if the computational model requires it. The rest of the device memory is allocated as this bigger array. We divide the device_memory array into blocks of memory chunks according to user-provided computational granularity. If the size of the device_memory array is m bytes and memory granularity is β bytes, then we divide the full device_memory array into n = d m β e memory blocks. We use omp_target_alloc function in our OpenMP implementation to allocate the needed memory on device. • memory_view: This is an array of n elements ( pair). It contains name and id of the data tiles stored in the i-th memory block of device_memory array. 61 • memory_map: This is an unordered map that holds detailed information on each data tiles stored in device_memory in the form of (Key, Value) pairs. The Key is a pair so that each tile of all data structures associated with the computation has a unique key. The Value is an array of 4 numbers (). The first number in the Value field is the index of device_memory where the data block () is stored on the device. The second number in the array is the number of memory blocks occupied on the device by the stored tile. Depending on the size of the tile, it may require more than one device memory block for storage. The third element of the array is the actual size of the tile that is stored at index location on device_memory. The fourth element of the array indicates whether the tile data is modified on the device or not. If isModified = 0, then the matrix tile data is not modified on the device. If isModified = 1, then it indicates that the matrix tile data is modified on the device. MM always checks the isModified field before evicting a matrix tile from the device memory. If tile data is modified, the MM copies the latest matrix tile data to the proper host location before evicting it from the device memory. • evictionQueue: The MM implements a First-In-First-Out (FIFO) eviction policy using evictionQueue to execute larger sparse matrix problems that exceed device memory. As MM moves a tile from host to device, it adds the of that block to the end of evictionQueue. MM always tries to evict from the beginning of evictionQueue. • freeblock_map: This is an unordered map that keeps track of the free memory blocks on the device_memory array. The Key of the freeblock_map is the number of contiguous free memory blocks on the device_memory array. The Value of the freeblock_map is an array of start indices of free memory blocks on the device_memory array. The MM starts tracking the free memory blocks once the eviction policy is activated, and MM always tries to copy a matrix block from the host to the free memory blocks on the device. This helps minimize memory fragmentation on the device. 62 Eviction/Replacement Policy: Data movement between the host and device becomes the main performance bottleneck when application working sets exceed the physical memory capacity of the device. The MM uses a software-managed First-In-First-Out (FIFO) eviction policy to efficiently manage data movement between the host and device algorithm. From the beginning of the execution of the application, MM adds a matrix block at the end of the evictionQueue whenever it moves a matrix block from host to device; an eviction is performed when the device memory becomes full. MM evicts the front element of the evictionQueue to make space for newly required data on the device. If the data tile at the front of evictionQueue is an operand of the current operation, it is removed from the front and reinserted at the back of evictionQueue. List of MM methods/functions: Here are the list of important functions/methods of MM: • isOnDevice(mtx, tile_id): This function returns true if the tile_id-th block of mtx matrix is available on device, otherwise it returns false. • copyToDevice(mtx, tile_id): This function is used to copy the tile_id-th block of mtx matrix from host to the suitable available memory blocks on device memory. It throws an excep- tion if there is not enough space or any suitable memory blocks on the device. We use omp_target_memcpy function to copy data between host and device in DeepSparseGPU. • reserveOnDevice(mtx, tile_id): This function reserves spaces for the tile_id-th block of mtx on the device. This function is beneficial when any operation produces new data. In such cases, we do not need to copy the output matrix tile from host to device; reserving sufficient device space for that matrix tile is enough in this case. This function throws an exception if there is not enough space or no suitable memory blocks exist on the device. • copyToHost(mtx, tile_id): This function is used to copy the tile_id-th block of mtx matrix from the device to the host. It throws an exception, if it is unable to copy the matrix block. We use omp_target_memcpy function to copy data between from device to host. How MM works: We provide an example in Figure 4.3 to demonstrate the operations of MM. 63 memory blocks on device memory blocks on device device_memory device_memory 0 1 2 3 4 5 6 0 1 2 3 4 5 6 memory_view memory_view 0 1 2 3 4 5 6 0 1 2 3 4 5 6 evictionQueue evictionQueue Out In Out FIFO Queue In FIFO Queue <0, 2, 8, 1> <2, 1, 4, 0> <1> <1> <2, 1, 4, 0> <3, 1, 4, 1> <3, 1, 4, 1> freeblock_map <4, 1, 4, 0> freeblock_map <4, 1, 4, 0> <5, 1, 4, 1> <5, 1, 4, 1> <6, 1, 4, 1> <6, 1, 4, 1> <0, 1, 4, 1> memory_map memory_map (a) Device memory becomes full (b) Evicting and copying Figure 4.3: Illustrative example of how MM works Let us assume we have have 4 matrices (A, B, C and D) associated with a computation and we have a total of 7 memory blocks on the device. Each matrix is tiled into 2 × 1 blocks in this example. Each tile of matrix A requires 2 memory blocks on the device, whereas tiles for other matrices require 1 memory block on the device. Due to the nature of the computation, let us assume that the order of moving matrix tiles to device is - {, , , , , , }. As shown in Figure 4.3a, device memory becomes full before moving the last block () in due to the limited device memory capacity. Figure 4.3a shows the status of all the data structures of MM at this stage and MM activates the eviction policy, as device memory is full now. The MM needs to move tile to device. In order to do so, MM evicts the front of evictionQueue which is matrix tile. () occupies 2 memory blocks on device and requires only 1 memory block to be stored on device. So MM copies at 0 index of device_memory array and adds index 1 in the freeblock_map. MM also updates memory_view, map_map and evictionQueue accordingly. As shown in Figure 4.3a, isModified = 1 in memory_map for which means is modified on the device. MM copies the latest value of from device to host before evicting it in Figure 4.3b. 64 4.2 Performance Evaluation 4.2.1 Experimental Setup We conducted all of our experiments on the Cori-GPU testbed at the National Energy Research Scientific Computing Center (NERSC) [76]. Each compute node has two 20-core Skylake proces- sors clocked at 2.4 GHz and 8 NVIDIA Tesla V100 “Volta" GPUs with ∼16 GBs of HBM per GPU. The V100 GPU model has a peak double-precision performance of 7.0 TFLOP/s. There is a total of 384 GB DDR4 DRAM space on each node. The CPUs are connected to the GPUs via four PCIe 3.0 switches and the GPUs are connected to each other via NVIDIA’s NVLink 2.0 interconnect. The Cori-GPU provides extensive software environments to compile OpenMP (and OpenACC) programs. We use the Cray compilers (‘CCE’) for compiling our implementation. The GPU accelerated cuSPARSE and cuBLAS libraries provided with CUDA v11.1.1 are used to compile our baseline implementation. We use thread affinity to bind threads to cores and we use 20 CPU threads and 1 GPU (1 CPU socket + 1 GPU) for our experiments to avoid NUMA issues. We take a full Cori-GPU node for our experiments in order to avoid noisy environment created by sharing the same node with other users. 4.2.1.1 Benchmark Application We demonstrate the performance of the DeepSparseGPU framework on LOBPCG eigensolver which is widely used in large-scale scientific computing applications [14]. We give the pseudocode for the LOBPCG solver in Algorithm 3 in Chapter 2. It involves kernels with high arithmetic intensities such as SpMM and several level-3 BLAS kernels. The total memory needed for block vectors Ψ, R, Q and others can easily exceed the space matrix H c takes up. Figure 2.6 shows a sample task graph for LOBPCG generated by TDGG using a very small matrix. Clearly, minimizing the data movement between CPU and GPU to obtain an efficient LOBPCG implementation is non-trivial. 65 4.2.1.2 Dataset We selected 11 square matrices with varying sizes, sparsity patterns, and domains from the SuiteS- parse Matrix Collection in addition to the Nm7 matrix, which is from a nuclear shell model code (see Tab. 4.1 and 4.2) [79]. Matrices in Table 4.1 are used to evaluate bigger problem sizes that do not fit in GPU memory. The bigger problem sizes range from 32.06 GB to 84.54 GB. Matrices in Table 4.2 are used to evaluate problem sizes that fit in the GPU memory. The smaller problem sizes range from 2.64 GB to 13.87 GB. Performance data for LOBPCG is averaged over five iterations. Table 4.1: Matrices used to evaluate problem size > 16 GB. Matrix #Rows #Non-zeros Problem Size (GB) Nm7 4,985,944 648,890,590 32.06 nlpkkt200 16,240,000 448,225,632 44.98 nlpkkt240 27,993,600 760,648,352 77.32 it_2004 41,291,594 1,150,725,436 64.18 sk_2005 50,636,154 1,949,412,601 54.38 webbase_2001 118,142,155 1,019,903,190 84.54 Table 4.2: Matrices used to evaluate problem size < 16 GB. Matrix #Rows #Non-zeros Problem Size (GB) inline_1 503,712 36,816,342 2.64 dielFilterV3real 1,102,824 89,306,020 6.07 Flan_1565 1,564,794 117,406,044 5.22 HV15R 2,017,169 281,419,743 8.29 Bump_2911 2,911,419 127,729,899 8.63 Nm7 4,985,944 648,890,590 13.87 nlpkkt160 8,345,600 229,518,112 12.94 4.2.1.3 Baseline Implementation We compare the performance of DeepSparseGPU with a baseline implementation that we call Libcsr_UM, which is an implementation of the LOBPCG solver using GPU accelerated cuSPARSE and cuBLAS libraries for SpMM (with CSR storage of the sparse matrix), inner product, and linear combination kernels. The application kernels are implemented on the GPU using OpenMP 66 parallel for loop and target offload pragmas. Application kernels include tall skinny matrix operations such as addition, subtraction, element-wise multiplication, etc. We use NVIDIA’s unified memory technology (using the cudaMallocManaged function) in this implementation. As such, the runtime system automatically takes care of the data movement between the host and device. 4.2.1.4 Incremental Implementation of DeepSparseGPU We incrementally implemented DeepSparseGPU. We implemented two intermediate working ver- sions of our tile-based DeepSparseGPU framework. We adopted two different data movement schemes in these two intermediate versions, which helped us design and develop our memory manager. We include the performance data from these intermediate versions to show the margin of improvement using the latest version of DeepSparseGPU: v o i d mat_add ( d o u b l e ∗X, d o u b l e ∗Y, d o u b l e ∗Z , i n t row , i n t c o l , i n t b l o c k _ w i d t h , i n t b l o c k _ i d ) { int i , j , blksz ; b l k s z = ( b l o c k _ i d ∗ b l o c k _ w i d t h + b l o c k _ w i d t h ) > row ? row − b l o c k _ i d ∗ b l o c k _ w i d t h : b l o c k _ w i d t h ; i n t o f f s e t = block_id ∗ block_width ∗ col ; i n t sz = blksz ∗ col ; # pragma omp t a r g e t f i r s t p r i v a t e ( b l k s z , sz , o f f s e t ) map ( t o : X[ o f f s e t : s z ] ) \ map ( t o : Y[ o f f s e t : s z ] ) map ( t o f r o m : Z [ o f f s e t : s z ] ) \ d e p e n d ( i n : X[ o f f s e t : s z ] ) d e p e n d ( i n : Y[ o f f s e t : s z ] ) d e p e n d ( i n o u t : Z [ o f f s e t : s z ] ) # pragma omp t e a m s d i s t r i b u t e p a r a l l e l f o r p r i v a t e ( j ) c o l l a p s e ( 2 ) f o r ( i = 0 ; i < b l k s z ; i ++) { f o r ( j = 0 ; j < c o l ; j ++) { Z [ o f f s e t + i ∗ c o l + j ] = X[ o f f s e t + i ∗ c o l + j ] +Y[ o f f s e t + i ∗ c o l + j ] ; } } } Figure 4.4: Code snippet of mat_add kernel in DeepSparse_MAP version. • DeepSparse_MAP: This version relies on OpenMP map(to: ), map(tofrom: ) and map(from: ) clauses for transferring data between host and device. Each offload pragma moves the necessary data tiles on-the-fly right before launching its GPU 67 kernel, essentially relying on the OpenMP runtime system for data movement. Figure 4.4 shows the code snippet of mat_add kernel in DeepSparse_MAP version. / / i n main p r o g r a m i n t s i z e = row ∗ c o l ∗ s i z e o f ( d o u b l e ) ; cudaMallocManaged (&X, s i z e ) ; cudaMallocManaged (&Y, s i z e ) ; cudaMallocManaged (&Z , s i z e ) ; / / mat_add k e r n e l v o i d mat_add ( d o u b l e ∗X, d o u b l e ∗Y, d o u b l e ∗Z , i n t row , i n t c o l , i n t b l o c k _ w i d t h , i n t b l o c k _ i d ) { int i , j , blksz ; b l k s z = ( b l o c k _ i d ∗ b l o c k _ w i d t h + b l o c k _ w i d t h ) > row ? row − b l o c k _ i d ∗ b l o c k _ w i d t h : b l o c k _ w i d t h ; i n t o f f s e t = block_id ∗ block_width ∗ col ; i n t sz = blksz ∗ col ; # pragma omp t a r g e t f i r s t p r i v a t e ( b l k s z , sz , o f f s e t ) map ( t o : X[ o f f s e t : s z ] ) \ map ( t o : Y[ o f f s e t : s z ] ) map ( t o f r o m : Z [ o f f s e t : s z ] ) \ d e p e n d ( i n : X[ o f f s e t : s z ] ) d e p e n d ( i n : Y[ o f f s e t : s z ] ) d e p e n d ( i n o u t : Z [ o f f s e t : s z ] ) # pragma omp t e a m s d i s t r i b u t e p a r a l l e l f o r p r i v a t e ( j ) c o l l a p s e ( 2 ) f o r ( i = 0 ; i < b l k s z ; i ++) { f o r ( j = 0 ; j < c o l ; j ++) { Z [ o f f s e t + i ∗ c o l + j ] = X[ o f f s e t + i ∗ c o l + j ] +Y[ o f f s e t + i ∗ c o l + j ] ; } } } Figure 4.5: Code snippet of mat_add kernel in DeepSparse_UM version. • DeepSparse_UM: Like DeepSparseGPU, we tile all computational kernels and express them as DAG in DeepSparse_UM. But instead of using MM, we rely on using unified memory for automatic data movement between the host and the device. All associated matrices are allocated using cudaMallocManaged. They are accessed using is_device_ptr in target offload pragmas which allows the kernels to treat the data tiles as device pointers, and the runtime system automatically makes the data available on the GPU whenever needed. Fig- ure 4.5 shows the code snippet of mat_add kernel in DeepSparse_UM version and Figure 4.6 shows the code snippet of the same kernel in our final DeepSparseGPU version with memory manager. 68 / / i n main p r o g r a m taskinfo = extractTaskInfo ( task [ i ]) ; / / p r e p a r i n g o p e r a n d s on gpu ( o r cpu ) u s i n g Memory Manager o p e r a n d s = memoryManger ( t a s k i n f o ) ; mat_add ( o p e r a n d s ) ; / / mat_add v o i d mat_add ( d o u b l e ∗ device_memory , i n t o f f s e t 1 / ∗ s r c 1 ∗ / , i n t o f f s e t 2 / ∗ s r c 2 ∗ / , i n t o f f s e t 3 /∗ dst ∗/ , i n t blksz , i n t col ) { i n t sz = blksz ∗ col ; # pragma omp t a r g e t i s _ d e v i c e _ p t r ( device_memory ) \ d e p e n d ( i n : device_memory [ o f f s e t 1 : s z ] , device_memory [ o f f s e t 2 : s z ] ) \ d e p e n d ( o u t : device_memory [ o f f s e t 3 : s z ] ) n o w a i t # pragma omp t e a m s d i s t r i b u t e p a r a l l e l f o r c o l l a p s e ( 2 ) f o r ( i n t i = 0 ; i < row ; i ++) f o r ( i n t j = 0 ; j < c o l ; j ++) device_memory [ o f f s e t 3 + i ∗ c o l + j ] = device_memory [ o f f s e t 1 + i ∗ c o l + j ] + device_memory [ o f f s e t 2 + i ∗ c o l + j ] ; } Figure 4.6: Code snippet of mat_add kernel in DeepSparseGPU framework. 4.2.2 LOBPCG Evaluation Our performance comparison criteria include the amount of Host-to-Device (H2D), Device-to-Host (D2H) data transfer, and average execution time per iteration among Libcsr_UM, DeepSparse_UM, DeepSparse_MAP and DeepSparseGPU implementations for the LOBPCG solver. The Memory Manager quantifies and keeps track of the H2D and D2H data transfer in the DeepSparseGPU framework. We use NVIDIA nvprof profiler tool to measure the amount of H2D and D2H data transfer while using unified memory. In DeepSparse_MAP, we manually measure the total H2D and D2H data transfer based on the data mentioned in the map clauses. The performance of DeepSparseGPU, DeepSparse_UM, and DeepSparse_MAP depends on the tile size. Choosing a small tile size creates a large number of fine granularity tasks. This means we have to launch more kernels on GPU. There are overheads associated with launching kernels on GPU. Also, a smaller tile size may lead to poor H2D and D2H transfer rates. Increasing the tile size reduces GPU execution overhead as GPU execution heavily depends on data parallelism. Therefore, we use the tile size as an optimization parameter based on matrix dimensions and sparsity patterns. We experimented with different tile sizes of 32K, 64K, 128K, and 256K. We report the results of the best-performing tile size for DeepSparseGPU, DeepSparse_UM, and DeepSparse_MAP. In Libcsr_UM, we are not required to tile the input matrices as we used unified memory with cuSPARSAE and CUBLAS 69 library kernels from CUDA and OpenMP target offloaded application kernels. 4.2.2.1 Data Movement between Host and Device Figure 4.7 shows the amount of Host-to-Device (H2D) data transfer comparison among all four versions of the LOBPCG algorithm. LOBPCG is a complex algorithm, as it has several different kernel types, and its task graph results in a vast number of tasks to be launched on GPU for each iteration. As can be seen in Fig 4.7, DeepSparseGPU always transfers less data from the host to the device compared to the other three versions. DeepSparseGPU transfers 1.18× - 1.94× less H2D data transfer compared to Libcsr_UM version except for the sk_2005 matrix. DeepSparseGPU also transfers 1.25× - 2.59× less H2D data compared to DeepSparse_UM version. Figure 4.7: Average H2D data transfer per iteration. Figure 4.8 shows the amount of Device-to-Host (D2H) data transfer comparison between all four versions of the LOBPCG algorithm. As can be seen in Fig 4.8, DeepSparseGPU always outperforms the other three versions when it comes to the amount of D2H data transfer. DeepSparseGPU transfers 1.84× - 2.69× less D2H data compared to Libcsr_UM, 2.39× - 3.70× less D2H data compared to DeepSparse_UM and 1.95× - 3.12× less D2H data compared to DeepSparse_MAP. Considering 70 the total amount of data transfer (H2D + D2H), DeepSparseGPU transfers 1.39× - 2.18× less data compared to Libcsr_UM, 1.89× - 2.79× less data compared to DeepSparse_UM and 2.92× - 5.01× less data compared to DeepSparse_MAP. Figure 4.8: Average D2H data transfer per iteration. From Figure 4.7 and 4.8, we can see that the main reason why DeepSparseGPU transfers significantly less data is due to the explicit data management by the Memory Manager based on a good task scheduling heuristic. The task scheduling heuristic helps to maximize the utilization of the data while it resides in GPU memory. DeepSparse_MAP is the worst regarding H2D and D2H data transfer performance among all four versions. This is expected because DeepSparse_MAP moves all inputs and outputs of a kernel in both H2D and D2H directions during each kernel launch. Figure 4.7 and 4.8 show that the design of our task scheduling scheme and memory manager is robust and helps to minimize the data movement between host and device. 4.2.2.2 Execution Time As can be seen in Figure 4.9, the significant reduction in H2D and D2H data transfer of DeepSparseGPU over the other three versions leads to a significant execution time speedups in general. In partic- 71 ular, DeepSparseGPU achieves 1.37× - 1.86× speedup compared to Libcsr_UM version except for Nm7, nlpkkt200 and sk_2005 matrices. We further investigated the reasons for the slower execution of DeepSparseGPU for these matrices. We found that our CSR format-based custom SpMM runs significantly slower than the highly optimized cusparseSpMM (a cuSPARSE library routine) used for SpMM operation in the Libcsr_UM version. To be specific, our custom SpMM routine runs 1.47× - 2.58× slower in case of these 3 matrices compared to the cusparseSpMM routine in Libcsr_UM. Figure 4.9: Average execution time per iteration. We experimented with OpenMP target offload, and CUDA implemented versions of the most expensive kernels. Figure 4.10 shows the performance comparison between OpenMP target offload implementation and CUDA implementation (cuBLAS, cuSPARSE) of most expensive kernels in- cluding inner product (X 0 Y ), linear combination (XY ), SpMM and Column-wise matrix reduction. The input size of all of these kernels was less than 16 GB. As we can see, the CUDA versions of these kernels are 2.13× - 12.85× faster compared to the OpenMP target offload versions. We use nvprof to get the execution time of cuBLAS and cuSPARSE library kernels. OpenMP target offload implementations in different vendor-provided compilers are still in the early stage. Most 72 compilers support only a subset of the OpenMP specification. It would be robust with time, and the performance of DeepSparseGPU would also improve in the upcoming supercomputers with a faster interconnect. Figure 4.10: OpenMP vs CUDA execution time comparison for most expensive kernels. 4.2.2.3 Effect of Pinned Memory It is important to note that even if our DeepSparseGPU transfers less H2D and D2H data, the overall H2D data transfer rate in DeepSparseGPU is 4.49 GB/sec, whereas the average H2D rate is 6.34 GB/sec in Libcsr_UM for all test matrices. The overall D2H data transfer rates are more striking, Libcsr_UM (12.13 GB/sec) achieves 2.82× higher data transfer rates DeepSparseGPU (4.29 GB/sec). Despite this lower data transfer rate and the worse SpMM performance mentioned above, DeepSparseGPU runs faster than Libcsr_UM with the Unified Memory. We had used pageable memory with DeepSparseGPU for the results shown in Figure 4.7, 4.8 and 4.9, therefore we also experimented with pinned memory to achieve a better bandwidth and see its effect on execution time. Figure 4.11 shows the execution time performance comparison of DeepSparseGPU using pinned memory with the results shown in Figure 4.9. As can be seen, we achieve 1.45× 73 - 1.98× execution time speedup using pinned memory with DeepSparseGPU compared to its pageable memory version and up to 2.93× speedup compared to Libcsr_UM. Figure 4.11: Performance of DeepSparseGPU using pinned memory 4.2.2.4 Comparison with a CPU version Figure 4.12 shows the execution time performance comparison between DeepSparseGPU (using pinned memory) and the baseline version running on CPU for bigger problem sizes. The CPU baseline implementation uses thread-parallel Intel MKL Library calls (including SpMM) with CSR storage of the sparse matrix. As we can see, DeepSparseGPU is running slower in comparison. However, it should be noted that even with the pinned memory optimization above, we get only about 12 GB/sec bandwidth between the CPU and GPU using a PCIe express 3.0 interconnect. In contrast, the bandwidth between the CPU and DRAM is approximately 90 GB/sec. Another reason behind the poor GPU performance is that there are no mechanisms to use multiple streams in OpenMP implementation yet and asynchronous execution support through the nowait clause does not allow a robust and portable implementation. As such data movement and computation cannot be overlapped by utilizing separate streams and nowait clauses in OpenMP pragmas. 74 Figure 4.12: CPU vs DeepSparseGPU execution time comparison (problem size > 16 GB) Figure 4.13 shows the theoretical execution time of DeepSparseGPU with different types of interconnect such as PCIe-Gen3 (12 GB/Sec), PCIe-Gen4 (32 GB/Sec), PCIe-Gen5 (64 GB/Sec), and NVLINK-4 (900 GB/Sec) [88]. For this experiment, we took the compute time of V100 from Figure 4.12 and we calculated the data movement time by dividing the total amount of data transfer in Figure 4.12 by the 80% of peak bandwidth for each type of interconnect to simulate the complete overlap of data movement and computation. As we can see from Figure 4.13, the theoretical execution time with PCIe-Gen3 is still on par with the CPU time, which means that no amount of software optimization can make DeepSparseGPU performance beat the CPU performance. However, we can see that the theoretical time of DeepSparseGPU with PCIe-Gen4, PCIe-Gen5 and NVLINK-4 beat the CPU time. The above simulation indicates that with the availability of platforms with faster interconnects (some such platforms currently exist, but those hardware have not been available for us to experiment with), DeepSparseGPU would have real merit in accelerating actual machine learning or scientific computing workloads. Figure 4.14 which shows the execution time performance comparison between DeepSparseGPU when the application working set fits into GPU memory (Table 4.2) provides further evidence in this direction. 75 Figure 4.13: CPU vs DeepSparseGPU execution time comparison with different CPU-GPU inter- connect As can be seen, DeepSparseGPU achieves 1.77× - 4.87× speedup compared to the CPU baseline implementation when the total memory footprint is less than 16GB, i.e., when data movement is not a big bottleneck. We anticipate that the availability of better interconnects and better GPU acceleration support in OpenMP will lessen the impact of such data movement bottleneck we observed in this study. 4.3 Discussion OpenMP target offload implementations in different vendor provided compilers are still in early stage. Most compilers support only a subset of the OpenMP specification. We noticed a big performance gap between OpenMP target offload and CUDA library implemented versions of the most expensive kernels. 4.3.1 Usage of Data Prefetching and Multiple Streams The reason behind much higher H2D and D2H rates in Libcsr_UM version is that the unified memory internally uses multiple streams and prefetches data. Such supports are not available in OpenMP 76 Figure 4.14: CPU vs DeepSparseGPU execution time comparison using problem size < 16 GB target offloading implementations. As such, we see in section 4.2 that H2D and D2H transfer rates are significantly higher in the Libcsr_UM compared to DeepSparseGPU. With the possibility of getting those supports in OpenMP implementation, the performance of DeepSparseGPU would improve even further. 4.4 Summary of the Chapter In this work, we introduced a tiling-based sparse solver framework for heterogeneous architec- tures that aims to minimize data transfer between host and device to achieve better performance. We showed that our framework transfers significantly less data between host and device. Our framework also improves the execution time over the UM-based baseline implementation using pinned memory. In our future work, we will focus on improving the efficiency of data transfers in DeepSparseGPU and optimizing the performance of sparse matrix kernels. To this end, we plan to design and implement a DAG partitioner that would help minimize the data movement between CPU and GPU. We also plan to use 2D decomposition of the SpMM operation, which would ex- pose more parallelism in the computation. We also plan to explore the feasibility of implementing DeepSparseGPU using CUDA Graphs as it seems a good fit for our design. In the next chapter, we 77 will discuss the detail of our exploration of task-parallelism using CUDA Graph. 78 CHAPTER 5 EXPLORING FINE-GRAINED TASK PARALLELISM USING CUDA GRAPH The computing capabilities of GPUs have been increasing continuously over the last decade to address the growing computational needs of HPC and the data science landscape. Modern GPUs are so fast that, in many cases, the time to submit an operation (e.g., kernel or memory copy) to GPU dominates the actual time (measured in microseconds) taken by the operation itself. For example, in DeepSparseGPU, we submit thousands of operations to GPU on the tiled data at each iteration. Each operation is launched to the GPU separately and typically completes quickly. Empirically, we found that the combined work submission overheads cause overall degradation in performance. Recently, CUDA has introduced an asynchronous task-graph programming model, CUDA Graph, to enable efficient launch and execution of GPU work [89]. Unlike the conventional CUDA kernel execution, CUDA Graph separates the initialization from the execution [90]. CUDA Graph reduces the overall overhead by doing the device set-up, device initialization, kernel dependency analysis, and other expensive operations during the initialization phase. Furthermore, CUDA Graph significantly improves the performance by launching any number of kernels in a single operation [91, 92]. After experiencing the poor performance of the OpenMP target offload implementations of the most expensive computational kernels (Figure 4.10 in Chapter 4) in DeepSparseGPU, we chose to implement our task-parallel DeepSparseGPU framework using the CUDA programming model. In Chapter 4, we have seen that our topological sorting heuristic with the global task graph worked well in DeepSparseGPU. However, we envision using parallel decomposition techniques (instead of only a heuristic) that aim to maximize data locality. Our vision is to partition the global task graph and execute it in chunks to maximize data locality and minimize data movement. However, the 1D tiling of the sparse matrix does not give the partitioning algorithms flexibility. Therefore, we envision decomposing the sparse matrix using 2D tiling. Hence, the number of kernel launches becomes a problem due to several small kernels, and that is where we anticipate the CUDA graph technology to be immensely useful. 79 In this work, we have performed preliminary work on the use and merits of CUDA Graph to im- plement and optimize DeepSparseGPU on GPUs as it seems to be a good fit for the DeepSparseGPU framework with all CUDA kernels. We left the 2D tiling and graph partitioning on the tiles as future work. Our contributions in this study can be summarized as follows: • We demonstrate that the global task graph of DeepSparseGPU framework can be implemented using CUDA Graph. • We show that CUDA Graph helps to reduce the kernel launch overhead in DeepSparseGPU framework. Our CUDA Graph based DeepSparseGPU implementation runs up to 1.13× faster compared to a CUDA kernels, cuBLAS and cuSPARSE library routines based baseline version. 5.1 CUDA Graph CUDA Graph is a new programming model introduced in CUDA 10 that allows define-once- run-repeatedly execution flow by following a three-stage execution model: graph definition, graph instantiation, and graph execution. Figure 5.1: Toy task dependency graph 80 CUDA Graph API offers two modes of defining the graph, manual mode and stream capture mode. Graph creation starts with cudaGraphCreate routine which creates an empty graph. In manual mode, user creates nodes (representing GPU operations) and specifies dependencies and parameters. For example, cudaGraphAddKernelNode creates a node that is associated with a CUDA kernel and its input and output parameters. cudaGraphAddMemcpyNode creates memory copy node that performs data movement between host and device. A graph is defined by adding nodes with their corresponding dependencies. CUDA Graph exploits the explicit dependency information in manual mode for optimizing and scheduling the graph execution. Listing 5.1 depicts the CUDA Graph implementation of the toy task graph in Figure 5.1 in manual mode. cudaGraph_t G ; std :: vector < cudaGraphNode_t > nodeDependencies ; cudaGraphNode_t h2d_1 , h2d_2 , d2h , kernelNode ; c u d a K e r n e l N o d e P a r a m s kernelNodeParams = { 0 }; // Step -1: Defining graph G checkCudaErrors ( cudaGraphCreate (& G , 0) ) ; checkCudaErrors ( c u d a S t r e a m C r e a t e W i t h F l a g s (& stream , cudaStreamNonBlocking )); // creating H2D (a - > d_a ) node , assuming params is set properly checkCudaErrors ( c u d a G r a p h A d d M e m c p y N o d e (& h2d_1 , G , NULL , 0 , & params ) ) ; nodeDependencies . push_back ( h2d_1 ) ; // creating H2D (b - > d_b ) node , assuming params is set properly checkCudaErrors ( c u d a G r a p h A d d M e m c p y N o d e (& h2d_2 , G , NULL , 0 , & params ) ) ; nodeDependencies . push_back ( h2d_2 ) ; // creating addKernel node ( d_c = d_a + d_b ) , assuming kernelNodeParams is set properly kernelNodeParams . func = ( void *) addKernel ; void * kernelArgs [4] = { ( void *) & dev_c , ( void *) & dev_a , ( void *) & dev_b , & size }; kernelNodeParams . kernelParams = kernelArgs ; checkCudaErrors ( c u d a G r a p h A d d K e r n e l N o d e (& kernelNode , G , nodeDependencies 81 . data () , nodeDependencies . size () , & kernelNodeParams ) ) ; nodeDependencies . clear () ; // creating D2H ( d_c - > c ) node , assuming params is set properly checkCudaErrors ( c u d a G r a p h A d d M e m c p y N o d e (& d2h , G ,1 ,& kernelNode ,& params ) ) ; // Step -2: Instantiating G cudaGraphExec_t graphExec ; checkCudaErrors ( c u d a G r a p h I n s t a n t i a t e (& graphExec , G , NULL , NULL , 0) ) ; // Step -3: Executing G for ( int i = 0; i < loopCount ; ++ i ) { checkCudaErrors ( cudaGraphLaunch ( graphExec , stream ) ) ; checkCudaErrors ( c u d a S t r e a m S y n c h r o n i z e ( stream ) ) ; } // cleaning up checkCudaErrors ( c u d a G r a p h E x e c D e s t r o y ( graphExec ) ) ; checkCudaErrors ( cudaGraphDestroy ( G ) ) ; Listing 5.1: Snippet of CUDA Graph corresponding with the DAG shown in Figure 5.1 In manual mode, the user needs to create nodes and edges explicitly and know the parameters upfront to create the graph. However, kernel execution parameters are not available beforehand for cuBLAS and cuSPARSE library routines. To address this problem, CUDA Graph offers stream capture mode that creates the graph implicitly. In stream caputure mode, a graph is created by capturing an existing CUDA stream-based implementation. This is realized by placing cudaStreamBeginCapture and cudaStreamEndCapture routine at the beginning and end of the existing implementation respectively. In this case, node dependencies are obtained from stream and event synchronization. In stream capture mode, kernels that run on the same stream are serialized, and the kernels that run on different streams can be executed in parallel. Once the graph is defined, the following two steps instantiate the constructed graph and run the executable graph on GPU using a single function call. CUDA Graph API also allows users to update the GPU kernel’s execution parameters or a node in the graph. 82 5.2 Implementing DeepSparseGPU Task Graph using CUDA Graph We construct the global task graph of DeepSparseGPU framework using CUDA Graph. We take LOBPCG eigensolver as an example application. We give the pseudocode for the LOBPCG solver in Algorithm 3. It involves kernels with high arithmetic intensities, such as SpMM and several level-3 BLAS kernels, and other application kernels (i.e., matrix addition, subtraction, multiplication). We tile each of the data structures and operations according to the user provided tile size to build the LOBPCG task graph. As LOBPCG involves SpMM and multiple linear combination operations and inner product operations, we use both manual and stream capture mode to create the global task graph of the LOBPCG algorithm. We use manual to create the nodes and dependencies related to application kernels as their parameters are known beforehand. We use cuBLAS and cuSPARSE library routines for the linear combination, inner product, and SpMM operations. However, the kernel launch parameters of these vendor libraries are not unavailable. To overcome this issue, we use stream capture mode to create a standalone graph for each of the cuBLAS and cuSPARSE library routines and add the standalone graph as a sub-graph node using cudaGraphAddChildGraphNode routine in the main graph with proper dependencies. Listing 5.2 shows an example snippet on how we mix stream capture and manual mode while defining the LOBPCG global task graph. // mainGraph cudaGraph_t mainGraph ; // capturing cuBLAS routine call using stream capture mode vector < cudaGraph_t > xyGraph_1 ( nrowblk ) ; vector < cudaGraphNode_t > xyNode_1 ( nrowblk ) ; for ( block_id = 0 ; block_id < nrowblk ; block_id ++) { blksz = block_width ; if ( block_id * block_width + blksz > M ) blksz = M - block_id * block_width ; // stream capture call starts 83 checkCudaErrors ( c u d a S t r e a m B e g i n C a p t u r e ( stream1 , cudaStreamCaptureModeGlobal )); // cuBLAS ( cublasDgemm ) call starts cubstat = cublasDgemm ( handle , CUBLAS_OP_N , CUBLAS_OP_N , blocksize , blksz , blocksize , & cudaAlpha , lambda , blocksize , X + ( block_id * block_width * blocksize ) , blocksize , & cudaBeta , R + ( block_id * block_width * blocksize ) , blocksize ) ; // stream capture ends and creation of standalone XY graph checkCudaErrors ( c u d a S t r e a m E n d C a p t u r e ( stream1 , & xyGraph_1 [ block_id ]) ) ; // Adding the standalone graph to the mainGraph as a sub - graph node with proper dependencies checkCudaErrors ( c u d a G r a p h A d d C h i l d G r a p h N o d e (& xyNode_1 [ block_id ] , mainGraph , & setZeroNode , 1 , xyGraph_1 [ block_id ]) ) ; } // creating nodes for subtraction operation using manual mode vector < cudaGraphNode_t > subNode_1 ( nrowblk ) ; vector < cudaKernelNodeParams > params ( nrowblk ) ; for ( block_id = 0 ; block_id < nrowblk ; block_id ++) { // creating sub nodes , assuming params is set properly // adding incomin edges from stream captured XY sub - graph node checkCudaErrors ( c u d a G r a p h A d d K e r n e l N o d e (& subNode_1 [ block_id ] , mainGraph , & xyNode_1 [ block_id ] , 1 , & params [ block_id ]) ) ; } Listing 5.2: Snippet of CUDA Graph showing mixture of stream capture and manual mode Once the the global task graph is defined, we instantiate it and run it for every iteration. 84 Table 5.1: Matrices used in our evaluation Matrix Rows Columns Nonzeros inline_1 503,712 503,712 36,816,342 Queen_4147 4,147,110 4,147,110 329,499,284 dielFilterV3real 1,102,824 1,102,824 89,306,020 HV15R 2,017,169 2,017,169 281,419,743 Bump_2911 2,911,419 2,911,419 127,729,899 5.3 Performance Evaluation 5.3.1 Experimental Setup We conducted all of our experiments on Perlmutter Phase I at the National Energy Research Scientific Computing Center (NERSC) [93]. Each compute node has a single 64-core AMD EPYC 7763 (Milan) processor clocked at 2.0 GHz and 4 NVIDIA Ampere A100 GPUs with ∼40 GBs of HBM per GPU. The A100 GPU model has a peak double-precision performance of 9.7 TFLOP/s. There is a total of 256 GB DDR4 DRAM space on each node. The CPUs are connected to the GPUs via four PCIe 4.0 switches, and the GPUs are connected to each other via NVIDIA’s NVLink-3 interconnect. The Perlmutter provides NVIDIA HPC SDK 21.11 suite to compile CUDA programs. In addition, the GPU accelerated cuSPARSE and cuBLAS libraries provided with NVIDIA HPC SDK 21.11 are used to compile our baseline implementation. We take a full Perlmutter node for our experiments to avoid a noisy environment created by sharing the same node with other users. 5.3.2 Dataset We selected 5 square matrices with varying sizes, sparsity patterns, and domains from the SuiteS- parse Matrix Collection (see Table 5.1). For our preliminary study, we chose an appropriate number of vector blocks to fit the full problem size in the GPU memory. The problem size ranges from 2.89 GB to 9.01 GB. Performance data for LOBPCG is averaged over the last three iterations of five iterations. 85 5.3.3 Baseline Implementation We compare the performance of LOBPCG CUDA Graph implementation with a baseline imple- mentation that we call libcsr, which is an implementation of the LOBPCG solver using GPU accelerated cuSPARSE and cuBLAS libraries for SpMM (with CSR storage of the sparse matrix), inner product, and linear combination kernels. The application kernels are implemented on the GPU using CUDA. Application kernels include tall skinny matrix operations such as addition, sub- traction, element-wise multiplication, etc. We use NVIDIA’s unified memory technology (using the cudaMallocManaged function) in both (CUDA Graph and libcsr) implementations. As such, the runtime system automatically takes care of the data movement between the host and device. We use single stream for both implementations. 5.3.4 LOBPCG Evaluation Our performance comparison criteria includes the average execution time per iteration between libcsr (CUDA) and CUDA Graph based DeepSparseGPU with unified memory. For LOBPCG global task graph, we experimented with different tile sizes of 8K, 16K, 32K. We report the results of the best performing tile size for DeepSparseGPU. In libcsr, we are not required to tile the input matrices as we used unified memory with cuSPARSAE and CUBLAS library kernels from CUDA and CUDA C++ for application kernels. Figure 5.2 shows the execution time comparison between both implementations. As we can see CUDA Graph based DeepSparseGPU implementation runs 1.08× - 1.13× faster than the libcsr implementation. We do thousands of kernel launch in CUDA Graph based DeepSparseGPU implementation. Our preliminary result shows that CUDA Graph execution effectively minimizes the kernel launch overhead. 5.4 Summary of the Chapter In this work, we explored the feasibility of using CUDA Graph to implement DeepSparseGPU with unified memory. We used both implicit and explicit graph creation in order to create the 86 Figure 5.2: libcsr (CUDA) vs DeepSparseGPU (CUDA Graph) execution time comparison LOBPCG global task graph. Our preliminary result shows that CUDA Graph effectively minimizes kernel launch overhead. Our future work includes incorporating our custom memory manager into CUDA Graph-based DeepSparseGPU implementation, finding an optimal task scheduling order with 2D decomposition of the SpMM operation, and choosing the optimal number of streams. 87 CHAPTER 6 CONCLUSION AND FUTURE WORK In this thesis, we aimed to accelerate sparse matrix computation in several ways by minimizing data movement between memory layers. At first, we introduced a novel task-parallel framework named DeepSparse, aiming to optimize computational steps in a sparse solver rather than a single kernel such as SpMV or SpMM. DeepSparse automatically generates and expresses the entire computation as a task dependency graph (TDG) and relies on the OpenMP task to execute this TDG. We observed a reduction in runtime and a significant reduction of cache misses in all levels of memory using our task-parallel DeepSparse framework. With the success of a single shared memory node implementation of DeepSparse, we aimed to port and optimized it on GPU-accelerated systems using OpenMP target offload. However, before diving straight into the implementations, we wanted to evaluate the performance of directive-based GPU programming models (i.e., OpenMP target offload, OpenACC) on LOBPCG block eigensolver with consideration of large sparse matrices. We achieved good speed on GPU using OpenMP and OpenACC over a baseline CPU implementation. Inspired by the promising results of the evaluation study, we ported and optimized the DeepSparse framework on GPU using OpenMP target offload. To this end, we designed and implemented a Memory Manager to process input sizes that exceed GPU memory limits. We have seen that our GPU implementation of DeepSparse with memory manager greatly reduce the data movements between CPU and GPU using a topological sort based task scheduling. We plan to use 2D de- composition of the sparse matrix to create the DAG and we believe with 2D decomposition would give better results as 2D decomposition exposes more parallelism compared to 1D decomposition. We also plan to use a graph partitioner to partition the DAG with a tight memory bound for active memories to that an entire partition fits into GPU memory. We would like to incorporate our cus- tom memory manager and optimal scheduling order in our CUDA Graph-based DeepSparseGPU implementation. 88 BIBLIOGRAPHY 89 BIBLIOGRAPHY [1] S. Bogner, A. Bulgac, J. Carlson, J. Engel, G. Fann, R. J. Furnstahl, S. Gandolfi, G. Hagen, M. Horoi, C. Johnson et al., “Computational nuclear quantum many-body problem: The unedf project,” Computer Physics Communications, vol. 184, no. 10, pp. 2235–2250, 2013. [2] D. Keyes, A. Ecer, J. Periaux, N. Satofuka, and P. Fox, “Towards realistic performance bounds for implicit cfd codes,” in Parallel Computational Fluid Dynamics: Towards Teraflops, Optimization, and Novel Formulations: Proceedings of the Parallel CFD’99 Conference. North-Holland, 2000, p. 241. [3] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in python,” Journal of machine learning research, vol. 12, no. Oct, pp. 2825–2830, 2011. [4] N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey, “Navigating the maze of graph analytics frameworks using massive graph datasets,” in Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 2014, pp. 979–990. [5] J. Dongarra, S. Gottlieb, and W. T. Kramer, “Race to exascale,” Computing in Science & Engineering, vol. 21, no. 1, pp. 4–5, 2019. [6] TOP500, “Top500 supercomputers,” http://www:top500:org. [7] J. M. Diaz, S. Pophale, K. Friedline, O. Hernandez, D. E. Bernholdt, and S. Chandrasekaran, “Evaluating support for openmp offload features,” in Proceedings of the 47th International Conference on Parallel Processing Companion, 2018, pp. 1–10. [8] F. Rabbi, C. S. Daley, H. M. Aktulga, and N. J. Wright, “Evaluation of directive-based gpu programming models on a block eigensolver with consideration of large sparse matrices,” in International Workshop on Accelerator Programming Using Directives. Springer, 2019, pp. 66–88. [9] C. Daley, H. Ahmed, S. Williams, and N. Wright, “A case study of porting hpgmg from cuda to openmp target offload,” in International Workshop on OpenMP. Springer, 2020, pp. 37–51. [10] P. Maris, H. M. Aktulga, M. A. Caprio, Ü. V. Çatalyürek, E. G. Ng, D. Oryspayev, H. Potter, E. Saule, M. Sosonkina, J. P. Vary et al., “Large-scale ab initio configuration interaction calculations for light nuclei,” in Journal of Physics: Conference Series, vol. 403, no. 1. IOP Publishing, 2012, p. 012019. [11] H. M. Aktulga, A. Buluç, S. Williams, and C. Yang, “Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations,” in Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 2014, pp. 1213– 1222. 90 [12] T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler, “Towards high perfor- mance paged memory for gpus,” in 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2016, pp. 345–357. [13] C. Lanczos, An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. United States Governm. Press Office Los Angeles, CA, 1950. [14] A. V. Knyazev, “Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method,” SIAM journal on scientific computing, vol. 23, no. 2, pp. 517–541, 2001. [15] M. Afibuzzaman, F. Rabbi, M. Y. Özkaya, H. M. Aktulga, and U. V. Çatalyürek, “Deepsparse: A task-parallel framework for sparsesolvers on deep memory architectures,” in 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC). IEEE, 2019, pp. 373–382. [16] J. Kepner, D. Bade, A. Buluç, J. Gilbert, T. Mattson, and H. Meyerhenke, “Graphs, matrices, and the graphblas: Seven good reasons,” arXiv preprint arXiv:1504.01039, 2015. [17] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Green- baum, S. Hammerling, A. McKenney et al., “Lapack users’ guide, vol. 9,” Society for Industrial Mathematics, vol. 39, 1999. [18] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, “Basic linear algebra subprograms for fortran usage,” ACM Transactions on Mathematical Software (TOMS), vol. 5, no. 3, pp. 308–323, 1979. [19] A. OpenMP, “Openmp application program interface version 4.0,” 2013. [20] G. 500, “Green500 supercomputers,” http://www.green500.org. [21] “HIP : Convert CUDA to Portable C++ Code,” https://github.com/ROCm-Developer-Tools/ HIP; accessed 4 September 2019, 2019. [22] A. V. Knyazev and M. E. Argentati, “Implementation of a preconditioned eigensolver using hypre,” 2005. [23] Y. Saad, Iterative methods for sparse linear systems. siam, 2003, vol. 82. [24] M. Feit, J. Fleck Jr, and A. Steiger, “Solution of the schrödinger equation by a spectral method,” Journal of Computational Physics, vol. 47, no. 3, pp. 412–433, 1982. [25] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in Advances in neural information processing systems, 2002, pp. 849–856. [26] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” Chemometrics and intelligent laboratory systems, vol. 2, no. 1-3, pp. 37–52, 1987. [27] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web.” Stanford InfoLab, Tech. Rep., 1999. 91 [28] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, “Optimization of sparse matrix-vector multiplication on emerging multicore platforms,” in SC’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing. IEEE, 2007, pp. 1–12. [29] N. Bell and M. Garland, “Implementing sparse matrix-vector multiplication on throughput- oriented processors,” in Proceedings of the conference on high performance computing net- working, storage and analysis. ACM, 2009, p. 18. [30] A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, “Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks,” in Proceedings of the twenty-first annual symposium on Parallelism in algorithms and archi- tectures. ACM, 2009, pp. 233–244. [31] E.-J. Im and K. A. Yelick, Optimizing the performance of sparse matrix-vector multiplication. University of California, Berkeley, 2000. [32] B. C. Lee, R. W. Vuduc, J. W. Demmel, and K. A. Yelick, “Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply,” in Parallel Processing, 2004. ICPP 2004. International Conference on. IEEE, 2004, pp. 169–176. [33] R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A. Yelick, “When cache blocking of sparse matrix vector multiply works and why,” Applicable Algebra in Engineering, Communication and Computing, vol. 18, no. 3, pp. 297–311, 2007. [34] A. Pinar and M. T. Heath, “Improving performance of sparse matrix-vector multiplication,” in SC’99: Proceedings of the 1999 ACM/IEEE Conference on Supercomputing. IEEE, 1999, pp. 30–30. [35] Y. Saad, “Sparskit: A basic tool kit for sparse matrix computations,” 1990. [36] R. Vuduc, J. W. Demmel, and K. A. Yelick, “Oski: A library of automatically tuned sparse matrix kernels,” in Journal of Physics: Conference Series, vol. 16, no. 1. IOP Publishing, 2005, p. 071. [37] S. W. Williams, Auto-tuning performance on multicore computers. University of California, Berkeley, 2008. [38] A. El Guennouni, K. Jbilou, and A. Riquet, “Block krylov subspace methods for solving large sylvester equations,” Numerical Algorithms, vol. 29, no. 1-3, pp. 75–96, 2002. [39] W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith, “Toward realistic performance bounds for implicit cfd codes,” in Proceedings of parallel CFD, vol. 99. Citeseer, 1999, pp. 233–240. [40] X. Liu, E. Chow, K. Vaidyanathan, and M. Smelyanskiy, “Improving the performance of dynamical simulations via multiple right-hand sides,” in Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 2012, pp. 36–47. [41] A. E. Sarıyüce, E. Saule, K. Kaya, and Ü. V. Çatalyürek, “Regularizing graph centrality computations,” Journal of Parallel and Distributed Computing, vol. 76, pp. 106–119, 2015. 92 [42] A. Buluç, H. Meyerhenke, I. Safro, P. Sanders, and C. Schulz, “Recent advances in graph partitioning,” Algorithm engineering, pp. 117–158, 2016. [43] A. Azad and A. Buluc, “A work-efficient parallel sparse matrix-sparse vector multiplication algorithm,” Proceedings - 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), vol. 2017, 7 2017. [44] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet et al., ScaLAPACK users’ guide. SIAM, 1997. [45] K. Remington, R. Pozo et al., “Nist sparse blas: User’s guide,” Citeseer, Tech. Rep., 1996. [46] P. Kogge and J. Shalf, “Exascale computing trends: Adjusting to the" new normal" for computer architecture,” Computing in Science & Engineering, vol. 15, no. 6, pp. 16–26, 2013. [47] M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, T. G. Kolda, R. B. Lehoucq, K. R. Long, R. P. Pawlowski, E. T. Phipps et al., “An overview of the trilinos project,” ACM Transactions on Mathematical Software (TOMS), vol. 31, no. 3, pp. 397–423, 2005. [48] S. Balay, S. Abhyankar, M. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. Gropp, D. Kaushik et al., “Petsc users manual revision 3.8,” Argonne National Lab.(ANL), Argonne, IL (United States), Tech. Rep., 2017. [49] R. D. Falgout and U. M. Yang, “hypre: A library of high performance preconditioners,” in International Conference on Computational Science. Springer, 2002, pp. 632–641. [50] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov, “Numerical linear algebra on emerging architectures: The plasma and magma projects,” in Journal of Physics: Conference Series, vol. 180, no. 1. IOP Publishing, 2009, p. 012037. [51] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, “A class of parallel tiled linear algebra algorithms for multicore architectures,” Parallel Computing, vol. 35, no. 1, pp. 38–53, 2009. [52] S. Tomov, J. Dongarra, and M. Baboulin, “Towards dense linear algebra for hybrid GPU accelerated manycore systems,” Parallel Computing, vol. 36, no. 5-6, pp. 232–240, Jun. 2010. [53] S. Tomov, R. Nath, H. Ltaief, and J. Dongarra, “Dense linear algebra solvers for multicore with GPU accelerators,” in Proc. of the IEEE IPDPS’10. Atlanta, GA: IEEE Computer Society, April 19-23 2010, pp. 1–8, DOI: 10.1109/IPDPSW.2010.5470941. [54] J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, and I. Yamazaki, “Ac- celerating numerical dense linear algebra calculations with gpus,” Numerical Computations with GPUs, pp. 1–26, 2014. [55] I. S. Barrera, M. Moretó, E. Ayguadé, J. Labarta, M. Valero, and M. Casas, “Reducing data movement on large shared memory systems by exploiting computation dependencies,” in Proceedings of the 2018 International Conference on Supercomputing. ACM, 2018, pp. 207–217. 93 [56] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, “StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures,” in Euro-Par - 15th International Conference on Parallel Processing, ser. Lecture Notes in Computer Science, vol. 5704. Delft, The Netherlands: Springer, Aug. 2009, pp. 863–874. [Online]. Available: http://hal.inria.fr/inria-00384363 [57] E. Saule, H. M. Aktulga, C. Yang, E. G. Ng, and Ü. V. Çatalyürek, “An out-of-core task- based middleware for data-intensive scientific computing,” in Handbook on Data Centers. Springer, 2015, pp. 647–667. [58] A. Buluç and J. R. Gilbert, “Parallel sparse matrix-matrix multiplication and indexing: Im- plementation and experiments,” SIAM Journal on Scientific Computing, vol. 34, no. 4, pp. C170–C191, 2012. [59] J. W. Demmel, Applied Numerical Linear Algebra. SIAM, 1997. [60] V. G. V. Larrea, R. Budiardja, R. Gayatri, C. Daley, O. Hernandez, and W. Joubert, “Expe- riences porting mini-applications to OpenACC and OpenMP on heterogeneous systems,” in Cray User Group (CUG), May 2019. [61] Y. Wang, “Research on matrix multiplication based on the combination of openacc and cuda,” in Geo-informatics in Sustainable Ecosystem and Society, Y. Xie, A. Zhang, H. Liu, and L. Feng, Eds. Singapore: Springer Singapore, 2019, pp. 100–108. [62] S. Deldon, J. Beyer, and D. Miles, “OpenACC and CUDA Unified Memory,” in Cray User Group (CUG), May 2018. [63] N. Sakharnykh, “EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEM- ORY.” Presented at GPU Technology Conference (GTC) 2018. Also available at http://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to- know-about-unified-memory.pdf, 3 2018. [64] M. Knap and P. Czarnul, “Performance evaluation of unified memory with prefetching and oversubscription for selected parallel cuda applications on nvidia pascal and volta gpus,” The Journal of Supercomputing, pp. 1–21, 2019. [65] C. Yang, A. Buluç, and J. D. Owens, “Design principles for sparse matrix multiplication on the gpu,” in European Conference on Parallel Processing. Springer, 2018, pp. 672–687. [66] G. Ortega, F. Vázquez, I. García, and E. M. Garzón, “Fastspmm: An efficient library for sparse matrix matrix product on gpus,” The Computer Journal, vol. 57, no. 7, pp. 968–979, 2014. [67] H. Anzt, S. Tomov, and J. Dongarra, “Accelerating the lobpcg method on gpus using a blocked sparse matrix vector product,” in Proceedings of the Symposium on High Performance Computing. Society for Computer Simulation International, 2015, pp. 75–82. 94 [68] C. Hong, A. Sukumaran-Rajam, B. Bandyopadhyay, J. Kim, S. E. Kurt, I. Nisa, S. Sabhlok, Ü. V. Çatalyürek, S. Parthasarathy, and P. Sadayappan, “Efficient sparse-matrix multi-vector product on gpus,” in Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. ACM, 2018, pp. 66–79. [69] H. Anzt, S. Tomov, and J. Dongarra, “Implementing a sparse matrix vector product for the sell-c/sell-c-σ formats on nvidia gpus,” University of Tennessee, Tech. Rep. ut-eecs-14-727, 2014. [70] A. V. Knyazev, M. E. Argentati, I. Lashuk, and E. E. Ovtchinnikov, “Block locally optimal preconditioned eigenvalue xolvers (blopex) in hypre and petsc,” SIAM Journal on Scientific Computing, vol. 29, no. 5, pp. 2224–2239, 2007. [71] A. Dziekonski, M. Rewienski, P. Sypek, A. Lamecki, and M. Mrozowski, “Gpu-accelerated lobpcg method with inexact null-space filtering for solving generalized eigenvalue problems in computational electromagnetics analysis with higher-order fem,” Communications in Com- putational Physics, vol. 22, no. 4, pp. 997–1014, 2017. [72] M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi, “Cusparse library,” in GPU Technol- ogy Conference, 2010. [73] “Openmp specification,” https://www.openmp.org/wp-content/uploads/ OpenMP-API-Specification-5.0.pdf. [74] P. Maris, M. Sosonkina, J. P. Vary, E. Ng, and C. Yang, “Scaling of ab-initio nuclear physics calculations on multicore computer architectures,” Procedia Computer Science, vol. 1, no. 1, pp. 97–106, 2010. [75] P. Sternberg, E. G. Ng, C. Yang, P. Maris, J. P. Vary, M. Sosonkina, and H. V. Le, “Acceler- ating configuration interaction calculations for nuclear structure,” in Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, 2008, p. 15. [76] NERSC, “Cori-gpu system configuration,” https://docs-dev.nersc.gov/cgpu/. [77] ORNL, “Summit system configuration,” https://www.olcf.ornl.gov/summit/. [78] S. S. Vazhkudai, B. R. de Supinski, A. S. Bland, A. Geist, J. Sexton, J. Kahle, C. J. Zimmer, S. Atchley, S. Oral, D. E. Maxwell et al., “The design, deployment, and evaluation of the coral pre-exascale systems,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 2018, p. 52. [79] T. Davis, Y. Hu, and S. Kolodziej, “The suitesparse matrix collection,” http://faculty.cse.tamu. edu/davis/suitesparse.html, 2018. [80] F. Khorasani, R. Gupta, and L. N. Bhuyan, “Scalable simd-efficient graph processing on gpus,” in 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 2015, pp. 39–50. 95 [81] X. Cui, T. R. W. Scogland, B. R. d. Supinski, and W. Feng, “Directive-based partitioning and pipelining for graphics processing units,” in 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2017, pp. 575–584. [82] M. Garland, “Sparse matrix computations on manycore gpu’s,” in Proceedings of the 45th annual Design Automation Conference. ACM, 2008, pp. 2–6. [83] J. W. Choi, A. Singh, and R. W. Vuduc, “Model-driven autotuning of sparse matrix-vector multiply on gpus,” in ACM sigplan notices, vol. 45, no. 5. ACM, 2010, pp. 115–126. [84] X. Yang, S. Parthasarathy, and P. Sadayappan, “Fast sparse matrix-vector multiplication on gpus: implications for graph mining,” Proceedings of the VLDB Endowment, vol. 4, no. 4, pp. 231–242, 2011. [85] S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for floating-point programs and multicore architectures,” Communications of the As- sociation for Computing Machinery, 2009. [86] J. H. Davis, C. Daley, S. Pophale, T. Huber, S. Chandrasekaran, and N. J. Wright, “Per- formance assessment of openmp compilers targeting nvidia v100 gpus,” arXiv preprint arXiv:2010.09454, 2020. [87] Y. Kim, J. Lee, D. Kim, and J. Kim, “Scalegpu: Gpu architecture for memory-unaware gpu programming,” IEEE Computer Architecture Letters, vol. 13, no. 2, pp. 101–104, 2013. [88] NVIDIA, “Nvlink-4 bandwidth,” https://nvidianews.nvidia.com/news/ nvidia-announces-cpu-for-giant-ai-and-high-performance-computing-workloads. [89] P. Ramarao, “CUDA 10 Features Revealed: Turing, CUDA Graphs, and More,” https:// devblogs.nvidia.com/cuda-10-features-revealed/, in NVIDIA GPU Technology Conference (GTC), 2021. [90] S. Jones, “Cuda: New features and beyond,” http://on-demand.gputechconf.com/gtc/2018/ presentation/s8278-cuda-new-features-and-beyond.pdf, 2018. [91] C. Li and J. Yao, “Cuda graph in tensorflow,” https://www.nvidia.com/en-us/on-demand/ session/gtcspring21-s31312/, in NVIDIA GPU Technology Conference (GTC), 2021. [92] R. V. der Wijngaart, “Effortless cuda graphs,” https://www.nvidia.com/en-us/on-demand/ session/gtcspring21-s32082/, in NVIDIA GPU Technology Conference (GTC), 2021. [93] NERSC, “Perlmutter system configuration,” https://docs.nersc.gov/systems/perlmutter/ system_details/. 96