EFFICIENT AND PORTABLE SPARSE SOLVERS FOR HETEROGENEOUS HIGH
               PERFORMANCE COMPUTING SYSTEMS
                                     By
                              Md Fazlay Rabbi
                             A DISSERTATION
                                 Submitted to
                         Michigan State University
                 in partial fulfillment of the requirements
                              for the degree of
                Computer Science – Doctor of Philosophy
                                    2022


                                            ABSTRACT
      EFFICIENT AND PORTABLE SPARSE SOLVERS FOR HETEROGENEOUS HIGH
                            PERFORMANCE COMPUTING SYSTEMS
                                                 By
                                          Md Fazlay Rabbi
Sparse matrix computations arise in the form of the solution of systems of linear equations, matrix
factorization, linear least-squares problems, and eigenvalue problems in numerous computational
disciplines ranging from quantum many-body problems, computational fluid dynamics, machine
learning and graph analytics. The scale of problems in these scientific applications typically
necessitates execution on massively parallel architectures. Moreover, due to the irregular data access
patterns and low arithmetic intensities of sparse matrix computations, achieving high performance
and scalability is very difficult. These challenges are further exacerbated by the increasingly
complex deep memory hierarchies of the modern architectures as they typically integrate several
layers of memory storage. Data movement is an important bottleneck against efficiency and energy
consumption in large-scale sparse matrix computations. Minimizing data movement across layers
of the memory and overlapping data movement with computations are keys to achieving high
performance in sparse matrix computations.
    My thesis work contributes towards systematically identifying algorithmic challenges of the
sparse solvers and providing optimized and high performing solutions for both shared memory
architectures and heterogeneous architectures by minimizing data movements between different
memory layers. For this purpose, we first introduce a shared memory task-parallel framework
focusing on optimizing the entire solvers rather than a specific kernel. As most of the recent (or
upcoming) supercomputers are equipped with Graphics Processing Unit (GPU), we decided to
evaluate the efficacy of the directive-based programming models (i.e., OpenMP and OpenACC)
in offloading computations on GPU to achieve performance portability. Being inspired by the
promising results of this work, we port and optimize our shared memory task-parallel framework
on GPU accelerated systems to execute problem sizes that exceed device memory.


Copyright by
MD FAZLAY RABBI
2022


This thesis is dedicated to my family.
                  iv


                                   ACKNOWLEDGEMENTS
This thesis would not have been possible without the help and support of many people. First and
foremost, I would like to thank my advisor Dr. Hasan Metin Aktulga for his guidance and support
throughout my Ph.D. study. Dr. Aktulga not only taught me how to become a good researcher
but also provided me with a perfect example of balancing work and family: being a researcher, a
husband, and a father. Dr. Aktulga is also a constant source of ideas and insights. He is always
someone I can look up to, and I was very fortunate to have him as a mentor. I will emulate his
humility and kindness in every step of my life.
    Besides my advisor, I would like to thank Dr. Sandeep Kulkarni, Dr. Brian O’Shea, and Dr.
Ümit V. Çatalyürek for taking their time to serve on my thesis committee and providing me with
useful feedback. I am thankful to our collaborators, Dr. Nicholas Wright and Christopher Daley
from National Energy Research Scientific Computing Center (NERSC). Their frequent exchange of
ideas helped shape my research. I would like to thank Michigan State University and my program
for giving me the opportunity to be a proud Spartan. I hope our football team keeps beating the
University of Michigan until the end of the world!
    Surviving graduate school would have been impossible without the support and encouragement
of many fellow graduate students. Over the years, the friends I made here at MSU became my
family. I consider myself lucky to have such incredible people around me, both inside and outside
the lab. Without these friends, my years at MSU would not have been nearly as enjoyable. I thank
my labmates for the interesting discussions we had about life, research, job search and everything
else. I am also grateful to Abdullah for his time in reviewing this dissertation.
    I would like to thank my parents (Ansar Ali and Fatema Khatun) and family for their constant
support throughout my educational journey. Without them, I would not have been where I am now.
My elder brother Altaf Hossain has always been there for me whenever I need him. A special
thanks to my son Taahir Ruzaif for the immense joy he brought me. I will be forever grateful to my
family.
                                                  v


    Last, because she deserves a special place, I would like to thank my wife Tanzila for waiting
patiently for me to graduate. She is a true superwoman in my life who has been managing everything
for me just to give me enough time to cross the finish line. All my accomplishments would have
been impossible without her support. Her love and unwavering support kept me going during my
challenging time. This Ph.D. is as much hers as it is mine.
    I will miss the beautiful MSU campus. Forever and always - Go Green! Go White!!
                                                  vi


                                 TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     x
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   xi
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . .            . . . . . . . . . . .  1
   1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1
   1.2 Contributions of This Thesis . . . . . . . . . . . . . . . . . .   . . . . . . . . . . .  3
   1.3 DeepSparse Framework . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . . . .  5
   1.4 Evaluation of Directive based Programming Models on GPUs .         . . . . . . . . . . .  6
   1.5 DeepSparseGPU Framework . . . . . . . . . . . . . . . . . .        . . . . . . . . . . .  8
   1.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . .  9
CHAPTER 2    DEEPSPARSE: A TASK-PARALLEL FRAMEWORK FOR SPARSE
             SOLVERS ON SHARED MEMORY ARCHITECTURES . . . . . . . .                         . . 11
   2.1 DeepSparse Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 13
       2.1.1 Primitive Conversion Unit (PCU) . . . . . . . . . . . . . . . . . . . . .      . . 15
              2.1.1.1 Task Identifier (TI) . . . . . . . . . . . . . . . . . . . . . . .    . . 15
              2.1.1.2 Task Dependency Graph Generator (TDGG) . . . . . . . . .              . . 16
       2.1.2 Task Executor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 17
       2.1.3 Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 19
       2.1.4 Limitations of the Task Executor . . . . . . . . . . . . . . . . . . . . .     . . 22
   2.2 Benchmark Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 22
       2.2.1 Lanczos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 23
       2.2.2 LOBPCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       . . 23
   2.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 24
       2.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 24
       2.3.2 LOBPCG Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .        . . 26
       2.3.3 Lanczos Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 26
       2.3.4 Compiler Comparison . . . . . . . . . . . . . . . . . . . . . . . . . .        . . 28
   2.4 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 30
CHAPTER 3    EVALUATION OF DIRECTIVE-BASED GPU PROGRAMMING MODEL
             ON A BLOCK ITERATIVE SOLVER . . . . . . . . . . . . . . . . . . .                . 31
   3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . 34
       3.1.1 The LOBPCG algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .         . 34
       3.1.2 Baseline CPU implementation . . . . . . . . . . . . . . . . . . . . . . .        . 34
       3.1.3 A GPU implementation of LOBPCG . . . . . . . . . . . . . . . . . . . .           . 35
       3.1.4 Tiling LOBPCG kernels to fit in GPU memory capacity . . . . . . . . . .          . 37
              3.1.4.1 SpMM kernel . . . . . . . . . . . . . . . . . . . . . . . . . . .       . 37
              3.1.4.2 Inner product kernel . . . . . . . . . . . . . . . . . . . . . . .      . 38
                                              vii


      3.1.5   Hardware and software environment . . . . . . . . . . . . . . . . . . .       . . 40
      3.1.6   Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 41
              3.1.6.1 Performance of the LOBPCG solver . . . . . . . . . . . . . .          . . 41
              3.1.6.2 Performance of X T Y and SpMM kernels for large matrices .            . . 42
              3.1.6.3 Performance of tiled and Unified Memory versions of SpMM              . . 42
  3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
      3.2.1 Performance of the LOBPCG solver . . . . . . . . . . . . . . . . . . .          . . 43
      3.2.2 Performance of X T Y and SpMM kernels for large matrices . . . . . .            . . 46
      3.2.3 Performance of tiled and Unified Memory versions of SpMM . . . . . .            . . 48
  3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 49
  3.4 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 52
CHAPTER 4   A PORTABLE SPARSE SOLVER FRAMEWORK FOR LARGE MA-
            TRICES ON HETEROGENEOUS ARCHITECTURES . . . . . . . . .                         . . 53
  4.1 DeepSparseGPU Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . .        . . 56
      4.1.1 Primitive Conversion Unit (PCU) . . . . . . . . . . . . . . . . . . . . .       . . 56
              4.1.1.1 Task Identifier (TI) . . . . . . . . . . . . . . . . . . . . . . .    . . 56
              4.1.1.2 Task Dependency Graph Generator (TDGG) . . . . . . . . .              . . 57
      4.1.2 Task Executor (TE) . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 58
              4.1.2.1 Kernel Launcher (KL) . . . . . . . . . . . . . . . . . . . . .        . . 58
              4.1.2.2 Memory Manager (MM) . . . . . . . . . . . . . . . . . . . .           . . 60
  4.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 65
      4.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 65
              4.2.1.1 Benchmark Application . . . . . . . . . . . . . . . . . . . .         . . 65
              4.2.1.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 66
              4.2.1.3 Baseline Implementation . . . . . . . . . . . . . . . . . . . .       . . 66
              4.2.1.4 Incremental Implementation of DeepSparseGPU . . . . . . .             . . 67
      4.2.2 LOBPCG Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .         . . 69
              4.2.2.1 Data Movement between Host and Device . . . . . . . . . . .           . . 70
              4.2.2.2 Execution Time . . . . . . . . . . . . . . . . . . . . . . . . .      . . 71
              4.2.2.3 Effect of Pinned Memory . . . . . . . . . . . . . . . . . . .         . . 73
              4.2.2.4 Comparison with a CPU version . . . . . . . . . . . . . . . .         . . 74
  4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 76
      4.3.1 Usage of Data Prefetching and Multiple Streams . . . . . . . . . . . . .        . . 76
  4.4 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 77
CHAPTER 5   EXPLORING FINE-GRAINED TASK PARALLELISM USING CUDA
            GRAPH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 79
  5.1 CUDA Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 80
  5.2 Implementing DeepSparseGPU Task Graph using CUDA Graph . . . . . . . .                . . 83
  5.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 85
      5.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 85
      5.3.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 85
      5.3.3 Baseline Implementation . . . . . . . . . . . . . . . . . . . . . . . . .       . . 86
      5.3.4 LOBPCG Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .         . . 86
                                              viii


   5.4 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
CHAPTER 6  CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . 88
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
                                            ix


                                       LIST OF TABLES
Table 2.1: Major data structures after parsing third line. . . . . . . . . . . . . . . . . . . . 19
Table 2.2: Matrices used in our evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Table 3.1: Test Matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Table 4.1: Matrices used to evaluate problem size > 16 GB. . . . . . . . . . . . . . . . . . 66
Table 4.2: Matrices used to evaluate problem size < 16 GB. . . . . . . . . . . . . . . . . . 66
Table 5.1: Matrices used in our evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 85
                                                 x


                                        LIST OF FIGURES
Figure 1.1: Memory hierarchy in deep memory architectures. . . . . . . . . . . . . . . . .          2
Figure 1.2: Applications of sparse linear algebra. . . . . . . . . . . . . . . . . . . . . . . .    9
Figure 2.1: Schematic overview of DeepSparse. . . . . . . . . . . . . . . . . . . . . . . . . 14
Figure 2.2: Overview of input output matrices partitioning of task-based matrix multipli-
             cation kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Figure 2.3: Overview of matrices partitioning of task-based SpMM kernel. . . . . . . . . . 20
Figure 2.4: Overview of matrices partitioning of task-based inner product kernel. . . . . . . 21
Figure 2.5: Task graph for the psudocode in listing 2.2. . . . . . . . . . . . . . . . . . . . . 21
Figure 2.6: A sample task graph for the LOBPCG algorithm using a small sparse matrix. . . 24
Figure 2.7: Comparison of L1, L2, LLC misses and execution times between Deepsparse,
             libcsb and libcsr for the LOBPCG solver. . . . . . . . . . . . . . . . . . . . . . 27
Figure 2.8: LOBPCG single iteration execution flow graph of dielFilterV3real. . . . . . . . 27
Figure 2.9: Comparison of L1, L2, LLC misses and execution times between Deepsparse,
             libcsb and libcsr for the Lanczos solver. . . . . . . . . . . . . . . . . . . . . . . 28
Figure 2.10: Comparison of execution time for different compilers between Deepsparse
             , libcsb and libcsr for Lanczos Algorithm. (Blue/Left: GNU, Red/Middle:
             Intel, Green/Right: Clang compiler.) . . . . . . . . . . . . . . . . . . . . . . . 29
Figure 2.11: Cache Miss comparison between compilers for HV15R . . . . . . . . . . . . . 29
Figure 3.1: The use of is_device_ptr to avoid memory copies. Error checking is
             omitted for brevity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Figure 3.2: Overview of tiling SpMM operation. . . . . . . . . . . . . . . . . . . . . . . . 38
Figure 3.3: Overview of tiling Inner Product kernel . . . . . . . . . . . . . . . . . . . . . . 39
Figure 3.4: The time spent in LOBPCG on Cori-GPU and Summit when using various
             compilers with either OpenMP or OpenACC . . . . . . . . . . . . . . . . . . . 44
                                                 xi


Figure 3.5: The time spent in LOBPCG on Cori-GPU when using matrix Nm7 . . . . . . . 45
Figure 3.6: LOBPCG GPU speedup on Cori-GPU for each test matrix . . . . . . . . . . . . 46
Figure 3.7: Time spent in X T Y kernel on Cori-GPU and Summit when the memory
             footprint exceeds GPU memory capacity. . . . . . . . . . . . . . . . . . . . . . 47
Figure 3.8: Time spent in SpMM kernel on Cori-GPU and Summit when the memory
             footprint exceeds GPU memory capacity . . . . . . . . . . . . . . . . . . . . . 47
Figure 3.9: Time spent in tiled and Unified Memory versions of the SpMM kernel on
             Cori-GPU and Summit. The memory footprint is less than GPU memory capacity. 48
Figure 3.10: Time spent in tiled and Unified Memory versions of the SpMM kernel on Cori-
             GPU and Summit. The memory footprint exceeds GPU memory capacity.
             We use a logarithmic scale on the Time (sec) axis to capture the slow run time
             for the Unified Memory configuration on Summit. . . . . . . . . . . . . . . . . 49
Figure 3.11: Unified Memory nvprof profile of the XY microbenchmark on Cori-GPU
             (top) and Summit (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Figure 4.1: Schematic overview of DeepSparseGPU framework. . . . . . . . . . . . . . . . 57
Figure 4.2: Data structures used in managing tile information in Memory Manager.         . . . . 61
Figure 4.3: Illustrative example of how MM works . . . . . . . . . . . . . . . . . . . . . . 64
Figure 4.4: Code snippet of mat_add kernel in DeepSparse_MAP version. . . . . . . . . . 67
Figure 4.5: Code snippet of mat_add kernel in DeepSparse_UM version. . . . . . . . . . . 68
Figure 4.6: Code snippet of mat_add kernel in DeepSparseGPU framework. . . . . . . . . 69
Figure 4.7: Average H2D data transfer per iteration. . . . . . . . . . . . . . . . . . . . . . 70
Figure 4.8: Average D2H data transfer per iteration. . . . . . . . . . . . . . . . . . . . . . 71
Figure 4.9: Average execution time per iteration. . . . . . . . . . . . . . . . . . . . . . . . 72
Figure 4.10: OpenMP vs CUDA execution time comparison for most expensive kernels. . . . 73
Figure 4.11: Performance of DeepSparseGPU using pinned memory . . . . . . . . . . . . . 74
Figure 4.12: CPU vs DeepSparseGPU execution time comparison (problem size > 16 GB) . 75
                                                xii


Figure 4.13: CPU vs DeepSparseGPU execution time comparison with different CPU-GPU
             interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Figure 4.14: CPU vs DeepSparseGPU execution time comparison using problem size < 16 GB 77
Figure 5.1: Toy task dependency graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Figure 5.2: libcsr (CUDA) vs DeepSparseGPU (CUDA Graph) execution time comparison . 87
                                                xiii


                                 LIST OF ALGORITHMS
Algorithm 1: SpMM Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Algorithm 2: Lanczos Algorithm in Exact Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Algorithm 3: LOBPCG Algorithm (for simplicity, without a preconditioner) . . . . . . . . . . . . 24
Algorithm 4: Tiled SpMM (cusparseDcsrmm) kernel . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Algorithm 5: Tiled Inner Product (cublasDgemm) Kernel . . . . . . . . . . . . . . . . . . . . . . . . 39
                                                xiv


                                           CHAPTER 1
                                        INTRODUCTION
1.1    Motivation
    Sparse matrix computations arise in the form of the solution of systems of linear equations,
matrix factorizations, linear least-squares problems, and eigenvalue problems in numerous com-
putational disciplines ranging from quantum many-body problems [1] and computational fluid
dynamics [2] to machine learning [3] and graph analytics [4]. Most of the scientific applications of
these domains involve large-scale sparse matrices and require running them on massively parallel
computing architectures. Moreover, sparse matrices are often irregularly structured and come in
different forms and properties depending on the application area. Achieving high performance and
scalability in sparse matrix computations is challenging due to the irregular data access patterns
and low arithmetic intensities. Furthermore, due to the irregular access patterns of the sparse
matrix computations, the computation times are often bound by data movement speeds between
different memory layers or different processors rather than the compute resources. These chal-
lenges become more complicated on modern computing architectures as the memory hierarchies
on these architectures are becoming increasingly complex. Figure 1.1 shows an abstract view of
the assumed underlying memory hierarchy in modern architectures. This deep memory hierarchy
brings different bandwidths and latency depending on the proximity between the cores and storage
devices. As we go farther away from the processor, memory capacity increases at the expense
of increased latency and reduced bandwidth. Therefore, minimizing data movement across lay-
ers of the memory and overlapping data movement with computations are keys to achieving high
performance in sparse matrix computations.
    As we move forward to the exascale computing era [5], leadership supercomputers are becoming
increasingly heterogeneous. More than 25% of TOP500 [6] supercomputers adopt accelerators like
Graphics Processing Units (GPUs) due to their massive parallel computing capabilities. GPUs
                                                 1


                   Figure 1.1: Memory hierarchy in deep memory architectures.
have been widely used to accelerate applications, including sparse matrix computations, in various
domains, such as machine learning, graph analytics, and climate modeling. However, more complex
memory layers in the memory hierarchy of modern heterogeneous systems further complicate
the programming, migration, and optimization of sparse matrix computations. Given NVIDIA’s
dominance in GPU computing, most GPU acceleration efforts have focused on CUDA, an NVIDIA-
hardware specific programming model. Considering the increasing number of computing systems
with GPUs from different vendors, choosing a portable programming model that allows applications
to adapt to the diversity in heterogeneous computing is crucial from software maintenance and
sustainability perspectives. OpenMP 4.5+ and OpenACC have emerged as portable directive-based
programming models to provide a unified programming model for CPUs and GPUs. Higher-level
abstraction adopted in these directive-based programming models yields application portability
and developer productivity. Moreover, recent studies show that OpenMP and OpenACC are robust
and able to provide good performance on different hardware [7, 8, 9].
    Whether an application uses a device-specific (e.g., CUDA) or a portable (e.g., OpenMP)
programming model, the amount of data moved between CPU and GPU is a key concern in
heterogeneous systems due to the wide gap between the GPU’s computing power and memory
bandwidth. For best performance, programmers have to explicitly control the data movement
between the host (CPU) and the device (GPU), but this comes at the expense of programmer
productivity. As a general solution, Unified Virtual Memory (UVM) has become available since
CUDA 8.0 and the Pascal architecture (2016). The key idea behind UVM is that programmers
                                                2


no longer need to think about the device and the host memory as two distinct memory spaces.
Instead, UVM creates a pool of managed memory that is shared between the CPU and GPU, and
is accessible from both, with memory pages being migrated on-demand by the CUDA runtime
system automatically. Even though UVM dramatically reduces developer effort in managing data
movement, it typically causes a significant hit in performance.
    While the size of data needed by scientific applications and machine learning/data analytics
workloads increases rapidly, the memory available on GPUs increases at a much more modest pace
in comparison. For example, NVIDIA’s Tesla V100 “Volta" GPUs have only 16 GBs of device
memory available; one could have up to 80 GBs of memory on the newest generation NVIDIA
A100s, but they come with a price tag not affordable by most customers or even HPC centers.
To illustrate this issue, take, for instance, the Many Fermion Dynamics - nuclei (MFDn) code,
which is a quantum many-body code based on the configuration interaction model. MFDn is a
total memory-bound application, i.e., scientific studies using this code typically utilize all memory
(DRAM) space available, thus easily exceeding the total device memory available [10, 11]. When
an application’s working set size exceeds the device memory size, the resulting data movement
becomes a critical design and performance bottleneck [12].
    Considering the complex memory hierarchies of modern heterogeneous computing systems
and increased problem sizes of large-scale scientific applications, there is a dire need for new
approaches at the algorithmic and runtime system levels for sparse matrix computations.
1.2     Contributions of This Thesis
    In this thesis, we aim to design and develop a sparse linear algebra framework to accelerate
sparse solver codes on modern architectures with deep memory hierarchies. The contributions of
this thesis are divided into three parts:
    In the first part, we design and implement a sparse solver framework named DeepSparse
on shared memory architectures to support a wide variety of sparse solvers. We use OpenMP
task parallelism to implement DeepSparse to ensure its easy integration and adaptation in real-
                                                  3


world applications. We evaluate the performance of the DeepSparse framework on two important
eigensolvers widely used in large-scale scientific computing applications: Lanczos eigensolver [13]
and Locally Optimal Block Preconditioned Conjugate Gradient algorithm (LOBPCG) [14]. In
Chapter 2, we present the design and implementation details of the DeepSparse framework.
    In the second part, we start porting and optimizing the DeepSparse framework on GPU-
accelerated computing systems after seeing the encouraging results on shared memory systems.
As a feasibility study, we evaluate the efficacy of directive-based GPU programming models for
the LOBPCG solver and consider the performance implications for large sparse matrices. We
implement the LOBPCG eigensolver using two popular directive-based programming models,
OpenMP and OpenACC, for GPU-accelerated systems. We also consider creating an efficient
LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create
microbenchmarks representing the two dominant kernels (inner product and SpMM kernels) in
LOBPCG and evaluate their performances using two different programming approaches: tiling the
kernels and using Unified Memory with the original kernels. In Chapter 3, we present the findings
of our preliminary studies on directive-based GPU programming models.
    In the third part, we extend our task-parallel sparse linear algebra framework described in
Chapter 2 to GPUs, which we name DeepSparseGPU. DeepSparseGPU aims to make developing
sparse solvers for large problems on GPU-accelerated architectures easier. The main goal of the
DeepSparseGPU framework is therefore to ease the development of performant and portable sparse
solvers. Hence, building upon our DeepSparse framework [15] which runs on shared memory
architectures, DeepSparseGPU leverages OpenMP’s target offloading functionality. As we aim
to optimize DeepSparseGPU for GPUs, several changes were made to the original DeepSparse
framework, especially to support applications with memory requirements that exceed the available
device memory capacity. We adopt the same task-based approach which serves the double purpose
of enabling data locality optimizations and facilitates the management of application data so that
it can be processed in batches that fit into the device memory. In Chapter 4, we present the
design and implementation details of the DeepSparseGPU framework. In Chapter 5, we present
                                                   4


the preliminary results of our exploration of fine-grained task parallelism using CUDA Graph.
    I briefly present some motivations and background studies for all the thesis projects in the
following sections.
1.3    DeepSparse Framework
    Eigenvalue decomposition is at the heart of many scientific calculations, including nuclear
astrophysics, molecular dynamics, machine learning, and life sciences. However, the eigenvalue
decomposition is computationally expensive. For example, the computational complexity of finding
eigenvalues of a n × n square matrix is O(n3 ). However, the matrix dimension can easily grow
to thousands, millions or even billions in scientifically relevant problems. Thus, finding the
eigenvalues in large-scale computations becomes very challenging. Several iterative eigenvalue
decomposition algorithms have been developed over the years to deal with large matrices. These
methods work by repeatedly refining approximations to the eigenvectors and eigenvalues and can be
terminated whenever the approximations reach to the desired degree of accuracy. Iterative methods
form the basis of much of modern-day eigenvalue computations.
    In the first part of this thesis, we propose a novel sparse linear algebra framework, named
DeepSparse, which aims to accelerate sparse solver codes on modern architectures with deep
memory hierarchies. Our proposed DeepSparse framework differs from existing work in two ways.
First, we propose a holistic approach that targets all computational steps in a sparse solver rather
than narrowing the problem into a single kernel, e.g., sparse matrix vector multiplication (SpMV)
or sparse matrix multiple vector multiplication (SpMM), as is commonly done. Second, we adopt
a fully integrated task-parallel approach which is well suited for the high degrees of parallelism
available in modern hardware.
    DeepSparse provides a GraphBLAS plus BLAS/LAPACK-like frontend for domain scientists
to express their algorithms without having to worry about the architectural details (e.g., mem-
ory hierarchy) and parallelization considerations (i.e., determining the individual tasks and their
scheduling) [16, 17, 18]. DeepSparse automatically generates and expresses the entire computation
                                                  5


as a task dependency graph (TDG) where each node corresponds to a specific part of a com-
putational kernel and edges denote control and data dependencies between computational tasks.
We chose to build DeepSparse on top of OpenMP[19] because OpenMP is the most commonly
used shared memory programming model, but more importantly it supports task-based data-flow
programming abstraction. As such, DeepSparse relies on OpenMP for parallel execution of the
TDG.
    We anticipate two main advantages of DeepSparse over a conventional bulk synchronous parallel
(BSP) approach where each kernel relies on loop parallelization and is optimized independently.
First, DeepSparse would be able to expose better parallelism as it creates a global task graph for the
entire sparse solver code. Second, since the OpenMP runtime system has explicit knowledge about
the TDG, it may be possible to leverage a pipelined execution of tasks that have data dependencies,
thereby leading to better utilization of the hardware cache.
1.4     Evaluation of Directive based Programming Models on GPUs
    Heterogeneous computing architectures are becoming more popular among the high perfor-
mance computing (HPC) communities as more than a quarter of TOP500 and Green500 supercom-
puters employ GPU accelerators [6, 20]. Also, there is a pressing need to migrate and optimize
applications for execution on GPUs and other accelerators in order to fully utilize their computing
capabilities. Most of the existing GPU based heterogeneous computing systems are equipped with
CPUs and GPUs from different vendors for accelerating workloads. Most recent and future planned
systems for the Department of Energy Office of Advanced Scientific Computing Research (DOE
ASCR) include Perlmutter at NERSC (AMD CPU + NVIDIA GPU nodes), Aurora at ALCF (Intel
CPU + Intel Xe accelerator nodes), and Frontier at OLCF (AMD CPU + AMD GPU nodes). The
full capability of these systems can only be realized by making efficient use of the accelerators on
the compute nodes. Most efforts to use accelerators to date have involved scientists using the CUDA
programming language that only targets NVIDIA GPUs. The success of these efforts, the expected
marginal gains in general-purpose CPU performance, and the understanding that special purpose
                                                  6


accelerators are the best way to obtain significant performance gains within a fixed financial and
power budget convinced HPC facilities to invest in accelerator-based systems.
     However, CUDA is not a universal method to target accelerators produced by different vendors,
e.g. NVIDIA, AMD, Intel, Xilinx, although there are efforts by AMD to use the HIP framework
to convert CUDA to a more portable style of C++ [21]. To address this issue, OpenMP 4.5+
and OpenACC have emerged as portable directive-based programming models targeting hardware
accelerators. The higher level of abstraction adopted in these directive-based programming models
increases application portability and application developer productivity. These directive-based
methods have lowered the barrier of entry for application developers to target accelerators and
are anticipated to be a key enabler for HPC users to efficiently use forthcoming supercomputers.
However, there needs to be a wider testing of OpenMP and OpenACC in scientific applications to
address any shortcomings in the language specifications, improve the robustness and performance of
vendor compilers, and continue to refine our understanding of best practices to migrate applications
to accelerators. At the same time, the most efficient way to use accelerators is often achieved using
optimized math and scientific libraries, e.g. cuBLAS and Tensorflow. Therefore, it is very likely to
be the case that non-trivial applications will increasingly need to combine optimized library calls
with directives to obtain the the highest performance on a given application.
     In the second part of this thesis, we demonstrate the porting and optimization of an itera-
tive block eigensolver for GPUs using a combination of directives and optimized library calls to
evaluate the efficacy of the directive-based programming models. We choose the Locally Opti-
mal Block Preconditioned Conjugate Gradient (LOBPCG) [14, 22] algorithm as the representative
block eigensolver as it is a popular sparse eigensolver with computational characteristics that fit
well to GPU architectures. LOBPCG algorithm requires a fairly complex implementation, as such
it serves a good evaluation case for our purposes.
     An important issue in large scientific computing and data analysis workloads is that applications’
data usage often exceeds the available device memory space. As such, our evaluation extends into
such scenarios, and we present remedies for the significant performance degradation observed due
                                                   7


to large data transfers between host and device memories.
1.5     DeepSparseGPU Framework
    Managing data migration between CPU and GPU is one of the significant challenges in GPU
programming. Kernels that are running on GPU can only access data that are available on GPU
memory. Thus, programmers have to explicitly control the data movement between CPU and
GPU for best performance. The Unified Virtual Memory (UVM) allows treating the address
space as a single memory address space accessible from both CPUs and GPUs with memory pages
migrated on-demand. However, even though enabling UVM is simple, it provides poor performance
compared with custom methods that manually offload and prefetch data. Furthermore, GPUs in
modern supercomputers have limited device memory. The scale of data needed in important
scientific computing and data analytic applications often exceeds the available device memory
space.
    After seeing promising results from our work on evaluating directive-based programming
model on GPU accelerated systems, we start porting and optimizing our task parallel linear algebra
framework, DeepSparse, on GPUs using OpenMP target offload. We name this GPU version as
DeepSparseGPU. As we aim to port the DeepSparse framework to GPUs, we make several changes
to the shared memory based design of DeepSparse to support problem sizes that exceed the available
GPU memory. In DeepSparseGPU, we aim to minimize data movement between the CPU and the
GPU by exploiting data dependencies between the computational tasks. Towards this end, we adopt
the same task-based tilling approach in DeepSparseGPU as tiling is one of the most commonly
used techniques for optimizing applications for data locality. In addition, decomposing data into
tiles enables the programmer to run applications on GPUs even when the application data does not
entirely fit into the device memory. We describe a memory manager that automatically keeps track
of the data that is available on the CPU and the GPU. The proposed memory manager automatically
migrates data between CPU and GPU. We evaluate the performance of DeepSparseGPU also using
the LOBPCG block eigensolver.
                                                  8


                          Figure 1.2: Applications of sparse linear algebra.
1.6    Related Work
    Figure 1.2 shows an overview of the applications of sparse linear algebra in data analytics and
scientific computing. Sparse matrix-vector multiplication (SpMV) and sparse matrix dense matrix
multiplication (SpMM) are two essential operations in eigenvalue decomposition as well as in
solving sparse linear systems and similar problems, and are well studied in the literature. Sparse
matrix-sparse matrix multiplication (SpGEMM) and sparse matrix sparse vector multiplication
(SpMSV) are, on the other hand, less studied in the literature, but they find important applications
in machine learning methods that started becoming popular.
    In its simplest form, an SpMV operation computes y = Ax where A is a sparse matrix,
and x and y are dense vectors. The usage of SpMV is typically repetitive and it constitutes
the main computational kernel for several applications such as the solution of partial differential
equations [23] and the Schrodinger’s Equation [24] in scientific computing, spectral clustering [25]
and dimensionality reduction [26] in machine learning, and the Page Rank algorithm [27] in
graph analytics. The Roofline Model by Williams et al. [28] suggests that the memory bandwidth
ultimately bounds the performance of the SpMV kernel. Consequently, performance optimizations
to increase cache utilization and reduce data access latencies for SpMV have drawn significant
interest [29, 30, 31, 32, 33, 34, 35, 36, 37, 28], which is an incomplete list of related work on this
topic.
    The sparse matrix dense matrix multiplication (SpMM) operation has long been the workhorse of
                                                   9


block iterative solvers, e.g., the block Krylov subspace methods [23, 38] and block Jacobi-Davidson
method. SpMM has much higher arithmetic intensity than SpMV and can efficiently leverage wide
vector execution units. As a result, SpMM-based solvers have recently drawn significant interest in
scientific computing [11, 39, 40, 41]. SpMM also finds applications naturally in machine learning
where several features (or eigenvectors) of sparse matrices are needed [26, 25]. Although SpMM
has a significantly higher arithmetic intensity than SpMV, the extended Roofline model that was
recently proposed suggests that cache bandwidth, rather than the memory bandwidth, can still be a
critical performance-limiting factor for SpMM [11].
    Multiplication of sparse matrices (SpGEMM) and sparse matrix times sparse vector (SpMSV)
operation also find applications in important problems. For example, SpGEMM is the main kernel
in the algebraic multi-grid method [42] and the Markov Clustering algorithm, while SpMSV is the
main building block for breadth-first search, bipartite graph matching, and maximal independent
set algorithms [43].
    Overall, related work in the literature generally examines each of these sparse matrix computa-
tion kernels in isolation. Our holistic approach that tries to improve the overall performance of a
sparse solver through the utilization of a global task graph executed in an asynchronously parallel
fashion represents a distinguishing aspect of my work.
                                                   10


                                            CHAPTER 2
    DEEPSPARSE: A TASK-PARALLEL FRAMEWORK FOR SPARSE SOLVERS ON
                            SHARED MEMORY ARCHITECTURES
This work has been published in IEEE HiPC 2019 [15]. This work is a collaborative effort from
Fazlay Rabbi and Md Afibuzzaman. Both contributed equally in this work.
    Dense linear algebra libararies (i.e., BLAS [18], LAPACK [17] and ScaLAPACK [44]) are well
established and several vendor optimization implementations exist for dense linear algebra (i.e.,
Intel MKL, Cray LibSci). However, unlike its dense matrix analogue, the state of the art for sparse
matrix computations is lagging far behind. Standardization for sparse linear algebra has started
with the NIST Sparse Basic Linear Algebra Subprograms (SparseBLAS), but SparseBLAS defines
only a few subset of sparse matrix operations [45] that arise in finite element simulations.
    The main reason behind the slow progress in the standardization of sparse linear algebra and
development of libraries for it is that sparse matrices come in very different forms and properties
depending on the application area. The widening gap between the memory system and processor
performance, irregular data access patterns and low arithmetic intensities of sparse matrix compu-
tations have effectively made them “memory-bound” computations. Furthermore, the downward
trend in memory space and bandwidth per core in high performance computing (HPC) systems [46]
has paved the way for a deepening memory hierarchy. Thus, there is a dire need for new approaches
both at the algorithmic and runtime system levels for sparse matrix computations. While sparse
solver libraries such as Trilinos [47], PETSc [48] and Hypre [49] provide a wide variety of sparse
solver implementations, they do not explicitly address the performance bottlenecks associated with
deep memory hierarchies or emerging NVRAM technologies. In this context, difficulties faced by
domain scientists in building high performance data analytics and scientific computing application
provide our main motivation. Our goal is to develop a computational infrastructure to enable
large-scale data analytics and scientific computing on massively parallel architectures with deep
memory hierarchies.
                                                 11


    We propose a novel fully integrated task-parallel sparse linear algebra framework, DeepSparse,
which aims to accelerate sparse solver codes on modern architectures with deep memory hierarchies.
Our goal is to accelerate sparse matrix computations on shared memory architecture by minimizing
date movement between different memory hierarchies. In order to do so, we express the sparse
solvers as a directed acyclic graph (DAG) by exploiting date dependencies between computational
kernels. In DeepSparse, we adopt a holistic approach that aims to optimize all computational steps
in a sparse solver rather than narrowing the problem into a single kernel, e.g., sparse matrix vector
multiplication (SpMV) or sparse matrix multiple vector multiplication (SpMM).
    As mentioned, there exists several well-optimized libraries for dense linear algebra. One of
the most popular of such libraries is LAPACK [17], which is a linear algebra library for solving
systems of simultaneous linear equations, least-squares solutions of linear systems of equations,
eigenvalue problems, and singular value problems. The workhorse for LAPACK is the Basic Linear
Algebra Subprograms (BLAS). PLASMA aims to overcome the shortcomings of the LAPACK
library in efficiently solving the problems in dense linear algebra on multicore processors [50, 51].
MAGMA is a dense linear algebra library (like LAPACK) for heterogeneous systems, i.e., systems
with GPUs [52, 53, 54], to fully exploit the computational power that each of the heterogeneous
components would offer. MAGMA provides very similar functionality like LAPACK and makes
it easier for the user to port their code from LAPACK to MAGMA. However, none of these
libraries support sparse matrices nor the solution of sparse linear systems, eigenvalue problems or
singular value decompositions, highlighting the lack of general purpose software for sparse matrix
computations.
    In regards to task-based parallelization of HPC workloads, Barrera et. al. [55] use computational
dependencies and dynamic graph partitioning method to minimize NUMA effect on shared memory
architectures. StarPU [56] is a runtime system that facilitates the execution of parallel tasks on
heterogeneous computing platforms, and incorporates multiple scheduling policies. However, the
application developer has to create the computational tasks by themselves in order to use StarPU.
    As indicated above, while the concept of task parallelism based on data flow dependencies
                                                  12


is not new, exploration of the benefits of this idea in the context of sparse solvers constitutes a
novel aspect of the present work. Additionally, to the best of our knowledge, related work on task
parallelism has not explored its impact on cache utilization compared to the BSP model as we do
in this work.
    To summarize, our main contributions are:
     • Introduction of a novel task-parallel sparse solver framework with complete details in regards
       to the front-end API and description of how a solver code expressed through this API is
       automatically converted into a task graph,
     • An extensive evaluation of the performance of DeepSparse on two applications using a variety
       of sparse matrices from different domains,
     • Demonstration that the proposed task-parallel framework can significantly reduce misses
       across all cache levels (up to 16×) as well as improve the execution time (by up to 3.9×)
       compared to highly optimized library implementations based on the BSP model.
2.1     DeepSparse Overview
    Figure 2.1 illustrates the architectural overview of DeepSparse. As shown, DeepSparse consists
of two major components: i) Primitive Conversion Unit (PCU) which provides a front-end to
domain scientists to express their application at a high-level; and ii) Task Executor which creates
the actual tasks based on the abstract task graph generated by PCU and hands them over to the
OpenMP runtime for execution.
    As sparse matrix related computations represent the most expensive calculations in many large-
scale scientific computing, we define tasks in our framework based on the decomposition of the
input sparse matrices. For most sparse matrix operations, both 1D (block row) and 2D (sparse
block) partitioning are suitable options. A 2D partitioning is ideal for exposing high degrees of
parallelism and reducing data movement across memory layers [57], as such 2D partitioning is the
default scheme in DeepSparse. For a 2D decomposition, DeepSparse defines tasks based on the
                                                   13


  Primitive Conversion Unit (PCU)                                               Task Executor
             Task Identifier (TI)
                                                                                  In In        In In
          do {                                                    Core 0         Data Data    DataData
          SpMM(Hpsi, H, psi)
          dot(E, psi, psi)                                                        SpMM          dot
          daxpy(Epsi, E, psi)
          daxpy(R, Hpsi,Epsi)
          dot(W,Tinv, R)                                                          Out Out     Out Out
                                                                   Partition 0
          dot(Wmat, W, W)                                                         Data Data   Data Data
          dsyevd(S, Wmat).
          ..
          } while(!converged)
                                                                                   In In        In In
                                                                   Core 1         Data Data    Data Data
                                                                                   SpMM          dot
                                                                    Partition 0    Out Out     Out Out
                                                                                   Data Data   Data Data
              TDG Generator
                                                                                   In In         In In
                                                                   Core 2         Data Data    Data Data
                                                                                    SpMM         dot
                                                                    Partition 0    Out Out     Out Out
                                                                                   Data Data   Data Data
                                  Figure 2.1: Schematic overview of DeepSparse.
Compressed Sparse Block (CSB) [30] representation of the sparse matrix, which is analogous to tiles
that are commonly used in task parallel implementation of dense linear algebra libraries. However,
CSB utilizes much larger block dimensions (on the order of thousands) due to sparsity [11, 58, 30].
Consequently, DeepSparse starts out by decomposing the sparse matrix (or matrices) for a given
code into CSB blocks (which eventually corresponds to the tasks during execution with each kernel
producing a large number of tasks). Note that the decomposition of a sparse matrix dictates
partitioning of the input and output vectors (or vector blocks) in the computation as well, effectively
inducing decomposition of all data structures used in the solver code.
    DeepSparse creates and maintains fine-grained dependency information across different kernels
of a given solver code based on the result of the above decomposition scheme. As such, instead
of simply converting each kernel into its own task graph representation and concatenating them,
DeepSparse generates a global task graph, allowing for more optimal data access and task scheduling
                                                        14


decisions based on global information. Since the global task graph depends on the specific algorithm
and input sparse matrix, DeepSparse will explicitly generate the corresponding task dependency
graph. While this incurs some computational and memory overheads, such overheads are negligible.
The main reason for computational overheads to be negligible is that sparse solvers are typically
iterative, and the same task dependency graph is used for several iterations. The reason why memory
overheads is negligible is that each vertex in the task graph corresponds to a large set of data in the
original problem. After this brief overview, we explain the technical details in DeepSparse.
2.1.1    Primitive Conversion Unit (PCU)
PCU is composed of two parts: i) Task Identifier, and ii) Local Task Dependency Graph (TDG)
Generator.
2.1.1.1    Task Identifier (TI)
The application programming interface (API) for DeepSparse is a combination of the recently pro-
posed GraphBLAS interface [16] (for sparse matrix related operations) and BLAS/LAPACK [17,
18] (for vector and occasional dense matrix related computations). This allows application devel-
oper to express their algorithms at a high-level without having to worry about architectural details
(e.g., memory hierarchy) or parallelization considerations (e.g., determining the individual tasks
and their scheduling). Task identifier parses a code expressed using the DeepSparse API to identify
the specific BLAS/LAPACK and GraphBLAS calls, as well as the input/output of each call. It then
passes this information to the local task dependency graph generator.
    TI builds two major data structures:
  (i) ParserMap: ParserMap is an unordered map that holds the parsed data information in the
       form of (Key, Value) pairs. As TI starts reading and processing the DeepSparse code, it
       builds a ParserMap from the function calls. To uniquely identify each call in the code, Key
       class is made up of three components: opCode which is specific to each type of operation
                                                  15


       used in the code, id which keeps track of the order of the same function call in the code (e.g.,
       if there are two matrix addition operations, then the first call will have id = 1 and the second
       one will have id = 2), and timestamp which stores the line number of the call in the code and
       is used to detect the input dependencies of this call to the ones upstream. For each key, the
       corresponding Value object stores the input and output variable information. It also stores
       the dimensions of the matrices involved in the function call.
  (ii) Keyword & idTracker: Keyword is a vector of strings that holds the unique function names
       (i.e., cblas_dgemm, dsygv, mkl_dcrmm, etc.) that have been found in the given code,
       and the idTracker keeps track of the number of times that function (Keyword) has been
       called so far. Keyword and idTracker vectors are synchronized with each other. When TI
       finds a function call, it searches for the function name in the Keyword vector. If found, the
       corresponding idTracker index is incremented. Otherwise, the Keyword vector is expanded
       with a corresponding initial idTracker value of 1.
2.1.1.2     Task Dependency Graph Generator (TDGG)
The output of Task Identifier (TI) is a dependency graph at a very coarse-level, i.e., at the function
call level. For an efficient parallel execution and tight control over data movement, tasks must
be generated at a much finer granularity. This is accomplished by the Task Dependency Graph
(TDGG), which goes over the input/output data information generated by TI for each function
call and starts decomposing these data structures. As noted above, the decomposition into finer
granularity tasks starts with the first function call involving the sparse matrix (or matrices) in
the solver code which is typically an SpMV, SpMM or SpGEMM operation. After tasks for
this function call are identified by examining the non-zero pattern of the sparse matrix, tasks for
prior and subsequent function calls are generated accordingly. As part of task dependency graph
generation procedure, TDGG also generates the dependencies between individual fine-granularity
tasks by examining the function call dependencies determined by TI. Note that the dependencies
generated by TDGG may (and often do) span function boundaries and this is an important property
                                                    16


struct TaskInfo
{
   int opCode ; // type of operation
   int numParamsCount ;
   int * numParamsList ; // tile id , dimensions etc .
   int strParamsCount ;
   char ** strParamsList ; // i . e . buffer name
   int taskID ; // analogous to id of Key Class
}
                                     Listing 2.1: TaskInfo Structure
of DeepSparse that separates it from a bulk synchronous parallel (BSP) program which effectively
imposes barriers at the end of each function call.
    The resulting task dependency graph generated by TDGG is essentially a directed acyclic graph
(DAG) representing the data flow in the solver code where vertices denote computational tasks,
incoming edges represent the input data and outgoing edges represent the output data for each task.
TDGG also labels the vertices in the task dependency graph with the estimated computational cost
of each task, and the directed edges with the name and size of the corresponding data, respectively.
During execution, such information can be used for load balancing among threads and/or ensuring
that active tasks fit in the available cache space. In this initial version of DeepSparse though, such
information is not yet used because we rely on OpenMP’s default task execution algorithms, as
explained next.
2.1.2   Task Executor
To represent a vertex in the task graph, TDGG uses an instance of the TaskInfo structure [listing 2.1]
which provides all the necessary information for the Task executor to properly spawn the corre-
sponding OpenMP task. The task executor receives an array of TaskInfo structures [listing 2.1] from
the PCU that represents the full computational dependency graph, picks each node from this array
one by one and extracts the corresponding task information. DeepSparse implements OpenMP
task based functions for all computational kernels (represented by opCode) it supports. Based on
                                                   17


  Algorithm 1: SpMM Kernel
    Input: X[i,j] (β × β, Sparse CSB block), Y[j] (β × b)
    Output: Z[i] (Dense vector block,  β × b)    
  1 #pragma omp task depend(in: X i, j , Y j , Z i ) depend(out: Z i )
                                                                       
  2 {
  3 foreach val ∈ X i, j .nnz do
                        
  4     r = X[i,j].row_loc[val]
  5     c = X[i,j].col_loc[val]
  6     for k = 0 to b do
  7         Z[r × b + k] = Z[r × b + k] + val × Y[c × b + k]
  8     end
  9 end
 10 }
the opCode, partition id of the input/output data structures and other required parameters (given
by numParamsList and strParamsList) found in the TaskInfo structure at hand, Task Executor
calls the necessary computational function found in the DeepSparse library, effectively spawning
an OpenMP task.
    In DeepSparse, the master thread spawns all OpenMP tasks one after the other, and relies on
OpenMP’s default task scheduling algorithms for execution of these tasks. OpenMP’s Runtine
Environment then determines which tasks are ready to be executed based on the provided task
dependency information. When ready, those tasks are executed by any available member of
the current thread pool (including the master thread). Note that OpenMP supports task parallel
exeuction with named dependencies, and better yet these dependencies can be specified as variables.
This feature is fundamental for DeepSparse to be able to generate TDGs based on different problem
sizes and matrix sparsity patterns. This is exemplified in Algorithm 1, where SpMM tasks for the
compressed sparse block at row i and j is simply invoked by providing the X i, j sparse matrix
                                                                                  
block along with Y j input vector block and Z i output vector block in the depend clause.
                                                
    An important issue in a task parallel program is the data race conditions involving the output
data that is being generated. Fortunately, the task-parallel execution specifications of OpenMP
requires only one thread to be active among threads writing into the same output data location.
While this ensures a race-condition free execution, it might hinder performance due to a lack
                                                18


cblas_dgemm ( CblasRowMajor , CblasNoTrans , CblasNoTrans , m , n , k ,
                 1.0 , A , k , B , n , 0 , C , n ) ;
SpMM (X , C , D , m , n ) ;
cblas_dgemm ( CblasRowMajor , CblasTrans , CblasNoTrans , n , n , m ,
                 1.0 , D , n , C , n , 0 , E , n ) ;
                                 Listing 2.2: An example pseudocode
of parallelism. Therefore, for data flowing into tasks with a high incoming degree, DeepSparse
allocates temporary output data buffers based on the number of threads and the available memory
space. Note that this also requires the creation of an additional task to reduce the data in temporary
buffer space before it is fed into the originally intended target task.
2.1.3   Illustrative Example
We provide an example to demonstrate the operation of DeepSparse using the simple code snippet
provided in Listing 2.2. As TI parses the sample solver code, it discovers that the first cblas_dgemm
in the solver corresponds to a linear combination operation (see Figure 2.2), the second line is a
sparse matrix vector block multiplication (SpMM, see Figure 2.3) and the second cblas_dgemm
at the end is an inner product of two vector blocks (see Figure 2.4). These function calls, their
parameters as well as dependencies are captured in the ParserMap, Keyword, and idTracker data
structures as shown in Table 2.1.
                       Table 2.1: Major data structures after parsing third line.
                  Data Structure                         Content
                                        <{XY, 1, 1},{<A, B>, <C>, <m, n, k>}>
                    ParserMap
                                     <{SpMM, 1, 2},{<X, C>, <D>, <m, m, n>}>
                                     <{XTY, 1, 3},{<D, C>, <E>, <m, n, n>}>
                      keyword                      <XY, SpMM, XTY>
                     idTracker                          <1, 1, 1>
    Task Dependency Graph (TDG) generator receives the necessary information from TI and
determines the tasks corresponding to partitionings of operand data structures of each operation,
                                                    19


                      0           task0                                   task0
                      1           task1                                   task1
                      2           task2                                   task2
                     …
                     …
                     …              X              Y                        Z
                                           b
                  n-1            taskn-1                                 taskn-1
                                    b                                       b
Figure 2.2: Overview of input output matrices partitioning of task-based matrix multiplication
kernel.
        0 task0,0 task0,1 task0,2                        task0,n-1
          task1,0 task1,1 task1,2
        1
        2 task2,0 task2,1 task2,2
       …
       …
       …                                 X                             Y             Z
     n-1                                                 taskn-1,n-1
                                                                       b              b
             Figure 2.3: Overview of matrices partitioning of task-based SpMM kernel.
as well as their origins (whether the necessary data are coming from another task or from a
variable). TDGG then builds the DAG of each computational kernel and appends it to the global
DAG with proper edge connectivity (i.e., dependencies). While generating the DAG, the TDGG
also encodes the value of the TaskInfo structure instance that represent each of the vertices into
the vertex name. The vertex naming convention is <opCode, numParamsCount, numParamsList,
strParamsCount, strParamsList, taskID>. Figure 2.5 shows the task dependency graph of the solver
                                                20


                            cblas_dgemm call on block vector                            Partial result buffer reduction
                                                                     task0                  t0
     b   task0   task1    task2      XT              taskn-1         task1
         0       1        2       ….                                 task2                  t1
                                  ….
                                  ….                 n-1
                                                                      Y                     t2                  Z
                                                                                            tp
                                                                    taskn-1
                                                                       b
         Figure 2.4: Overview of matrices partitioning of task-based inner product kernel.
                         5,1,0,0,1                                                 5,1,1,0,1
                         2,2,0,0,0,1          2,2,1,0,0,1           2,2,0,1,0,1                   2,2,1,1,0,1
                                           3,2,0,0,0,1                    3,2,1,1,0,1
                                                  4,1,0,1,EBUF,-1
                                  Figure 2.5: Task graph for the psudocode in listing 2.2.
code in Listing 2.2 (assuming m = 100, k = 8, n = 8, CSB tile/block size = 50, so each input matrix
is partitioned into 2 chunks).
   The task executor receives an array of TaskInfo structures that contains the node information
as shown in Figure 2.5. The task executor goes over each of the tasks in the array of TaskInfo
structure. At first, it reads the nodes (<5,1,0,0,1>, <5,1,1,0,1>) of the first operation and spawns
two matrix multiplication (xY) tasks with proper input output matrices. The task executor then
reads all the task information for all SpMM tasks {<2,2,0,0,0,1>, <2,2,0,1,0,1>, <2,2,1,0,0,1>,
<2,2,1,1,0,1>} and spawns four SpMM tasks with proper input/output matrix blocks. Finally, the
                                                               21


task executor reads <3,2,0,0,0,1>, <3,2,1,1,0,1> and <4,1,0,1,EBUF,-1> and spawns two inner
product (XT Y) tasks and one partial output buffer reduction task for the inner product operation.
2.1.4    Limitations of the Task Executor
Despite the advantages of an asynchronous task-parallel execution, the Task Executor has the
following limitations:
     • Synchronization at the end of an iteration: Most computations involving sparse matrices
       are based on iterative techniques. As such, the TDG generated for a single iteration can be
       reused over several steps (until the algorithm reaches convergence). However, it is necessary
       to introduce a #pragma omp taskwait at the end of each solver iteration and force all
       tasks of the current iteration to be completed to ensure computational consistency among
       different iterations of the solver. For relatively simple solvers, the taskwait clause adds
       some overhead to the total execution time due to threads idling at taskwaits.
     • Limited number of temporary buffers: While OpenMP allows the use of program variables
       in the dependency clauses, it does not allow dynamically changing the variable lists of the
       depend clauses. As such, the number of buffer lists in the partial output reduction tasks need
       to be fixed to overcome this issue. Depending on the available memory, there are at most
       nbuf number of partial output buffers for a reduction operation. If nbuf is less than the
       total number of threads, then there might be frequent read after write (RAW ) contentions on
       partial output buffers. This could be have been potentially avoided, if the list of variables in
       the depend clause could have been dynamically changed.
2.2     Benchmark Applications
    We demonstrate the performance of the DeepSparse framework on two important eigensolvers
widely used in large-scale scientific computing applications: Lanczos eigensolver [13] and Locally
Optimal Block Preconditioned Conjugate Gradient algorithm (LOBPCG) [14].
                                                   22


2.2.1   Lanczos
Lanczos algorithm finds eigenvalues of a symmetric matrix by building a matrix Qk = q1 , . . . , qk
of orthogonal Lanczos vectors [59]. The eigenvalues of the sparse matrix A is then approximated
by the Ritz values. As shown in Algorithm 2, it is a relatively simple algorithm consisting
of an Sparse Matrix Vector Multiplication (SpMV) along with some vector inner products for
orthonormalization.
  Algorithm 2: Lanczos Algorithm in Exact Arithmetic
  1 q1 = b||b||2 , β0 = 0, q0 = 0
  2 for j = 1 to k do
  3     z = Aqj
  4     αj = qj T z
  5     z = z − αj qj − βj−1 qj−1
  6     βj = ||z||z
  7     if βj = 0, quit
  8     qj+1 = zβj
  9     Compute eigenvalues, eigenvectors, and error bounds of Tk
 10 end
2.2.2   LOBPCG
LOBPCG is a commonly used block eigensolver based on the SpMM kernel [14], see Algorithm 3
for a pseudocode. Compared to Lanczos, LOBPCG comprises high arithmetic intensity operations
(SpMM and Level-3 BLAS). In terms of memory, while the H     c matrix takes up considerable space,
when a large number of eigenpairs are needed (e.g. dimensionality reduction, spectral clustering or
quantum many-body problems), memory needed for block vector Ψ can be comparable to or even
greater than that of H.c In addition, other block vectors (residual R, preconditioned residual W,
previous direction P), block vectors from the previous iteration and the preconditioning matrix T
must be stored, and accessed at each iteration. Figure 2.6 shows a sample task graph for LOBPCG
generated by TDGG using a very small matrix. Clearly, orchestrating the data movement in a deep
memory hierarchy to obtain an efficient LOBPCG implementation is non-trivial.
                                                 23


  Algorithm 3: LOBPCG Algorithm (for simplicity, without a preconditioner)
    Input: Ĥ , matrix of dimensions N × N
    Input: Ψ0 , a block of randomly initialized vectors of dimensions of N × m
    Output: Ψ and E such that kĤΨ − Ψ EkF is small, and Ψ T Ψ = Im
  1 Orthonormalize the columns of Ψ0
  2 P0 ← 0
  3 for i = 0, 1, . . . , until convergence do
  4     Ei = ΨiT ĤΨi
  5     Ri ← ĤΨi − Ψi Ei
  6     Apply the Rayleigh–Ritz procedure on span{Ψi , Ri , Pi }
  7     Ψi+1 ←                  argmin          traceS T ĤS
                  S∈span{Ψi .Ri ,Pi }, S T S=Im
  8     Pi+1 ← Ψi+1 − Ψi
  9     Check convergence
 10 end
 11 Ψ ← Ψi+1
     Figure 2.6: A sample task graph for the LOBPCG algorithm using a small sparse matrix.
2.3    Performance Evaluation
2.3.1   Experimental Setup
We conducted all our experiments on Cori Phase I, a Cray XC40 supercomputer at NERSC, mainly
using the GNU compiler. Each Cori Phase I node has two sockets with a 16-core Intel Xeon
                                                     24


Processor E5-2698 v3 Haswell CPUs. Each core has a 64 KB private L1 cache (32 KB instruction
and 32 KB data cache) and a 256 KB private L2 cache. Each CPU has a 40 MB shared L3 cache
(LLC). We use thread affinity to bind threads to cores and use a maximum of 16 threads to avoid
NUMA issues. We test DeepSparse using five matrices with different size, sparsity patterns and
domains (see Table 2.2). The first 4 matrices are from The SuitSparse Matrix Collection and the
Nm7 matrix is from nuclear no-core shell model code MFDn.
                             Table 2.2: Matrices used in our evaluation.
                          Matrix           Rows        Columns      Nonzeros
                          inline1         503,712       503,712    36,816,170
                     dielFilterV3real    1,102,824     1,102,824   89,306,020
                          HV15R          2,017,169     2,017,169  283,073,458
                        Queen4147        4,147,110     4,147,110  316,548,962
                           Nm7           4,985,422     4,985,422  647,663,919
    We compare the performance of DeepSparse with two other library implementations: i) libcsr is
implementation of the benchmark solvers using thread-parallel Intel MKL Library calls (including
SpMV/SpMM) with CSR storage of the sparse matrix, ii) libcsb is an implementation again using
Intel MKL calls, but with the matrix being stored in the CSB format. Performance data for LOBPCG
is averaged over 10 iterations, while the number of iterations is set to 50 for Lanczos runs. Our
performance comparison criteria are L1, L2, LLC misses and execution times for both solvers. All
cache miss data was obtained using the Intel VTune software.
    Performance of the DeepSparse and libcsb implementations depends on the CSB block sizes
used. Choosing a small block size creates a large number of small tasks. While this is preferable
on a highly parallel architecture, the large number of tasks may lead to significant task execution
overheads, in terms of both cache misses and execution times. Increasing the block size reduces such
overheads, but this may then lead to increased thread idle times and load imbalances. Therefore,
the CSB block size is a parameter to be optimized based on the specific problem. Different block
sizes we experimented with have been 1K, 2K , 4K, 8K and 16K.
                                                 25


2.3.2   LOBPCG Evaluation
In Figure 2.7, we show the number of cache misses at all three levels (L1, L2 and L3) and execution
time comparison between all three versions of the LOBPCG algorithm compiled using the GNU
compiler. LOBPCG is a complex algorithm with a number of different kernel types; its task graph
results in millions of tasks for a single iteration. As shown in Figure 2.7, except for the Nm7
matrix, libcsb and libcsr versions achieve similar number of cache misses; for Nm7, libcsb has
important cache miss reductions over the libcsr version. On the other hand, DeepSparse achieves
2.5× - 10.7× fewer L1 misses, 6.5× - 16.2× fewer L2 misses and 2× - 7× fewer L3 cache misses
compared to the libcsr version. As the last row of Figure 2.7 shows, even with the implicit task
graph creation and execution overheads of DeepSparse, the significant reduction in cache misses
leads to 1.2× - 3.9× speedup over the execution times of libcsr. Given the highly complex DAG
of LOBPCG and abundant data re-use opportunities available, we attribute these improvements to
the pipelined execution of tasks which belong to different computational kernels (see Figure 2.8)
but use the same data structures. We note that the Task Executor in DeepSparse solely relies on
the default scheduling algorithm used in the OpenMP runtime environment. By making use of the
availability of the entire global task graph and labeling information on vertices/edges, it might be
possible to improve the performance of DeepSparse even further.
2.3.3   Lanczos Evaluation
In Figure 2.9, cache misses and execution time comparisons for different Lanczos versions are
shown. Lanczos algorithm is much simpler than LOBPCG, it has much fewer types and numbers of
tasks than LOBPCG (basically, one SpMV and one inner product kernel at each iteration). As such,
there are not many opportunities for data re-use. In fact, we observe that DeepSparse sometimes
leads to increases in cache misses for smaller matrices. However, for the Nm7 and HV15R
matrices, which are the largest matrices among our benchmark set, we observe an improvement in
cache misses, achieving up to 2.4× fewer L1 cache misses, 3.1× fewer L2 misses and 4.5× fewer
L3 misses than libcsr. But most importantly, DeepSparse achieves up to 1.8× improvement in
                                                  26


       0.7                   2.1                    3.5                 7              14
       0.6                   1.8                    3.0                 6              12
       0.5                   1.5                    2.5                 5              10
       0.4                   1.2                    2.0                 4               8
Billions
       0.3                   0.9                    1.5                 3               6
       0.2                   0.6                    1.0                 2               4
       0.1                   0.3                    0.5                 1               2
       0.0                   0.0                    0.0                 0               0
                                                            L1 Misses
       0.42                  1.4                    2.1                 4.9            14
       0.36                  1.2                    1.8                 4.2            12
       0.30                  1.0                    1.5                 3.5            10
       0.24                  0.8                    1.2                 2.8             8
Billions
       0.18                  0.6                    0.9                 2.1             6
       0.12                  0.4                    0.6                 1.4             4
       0.06                  0.2                    0.3                 0.7             2
       0.00                  0.0                    0.0                 0.0             0
                                                            L2 Misses
           0.035             0.28                   0.14                0.42           3.5
           0.030             0.24                   0.12                0.36           3.0
           0.025             0.20                   0.10                0.30           2.5
           0.020             0.16                   0.08                0.24           2.0
Billions
           0.015             0.12                   0.06                0.18           1.5
           0.010             0.08                   0.04                0.12           1.0
           0.005             0.04                   0.02                0.06           0.5
           0.000             0.00                   0.00                0.00           0.0
                                                           LLC Misses
           0.14              0.63                   0.77                1.4            2.8
           0.12              0.54                   0.66                1.2            2.4
           0.10              0.45                   0.55                  1              2
Time(s)
           0.08              0.36                   0.44                0.8            1.6
           0.06              0.27                   0.33                0.6            1.2
           0.04              0.18                   0.22                0.4            0.8
           0.02              0.09                   0.11                0.2            0.4
           0.00              0.00                   0.00                  0              0
                                                           Execution Time
                   inline1             dielFilter          HV15R          Queen_4147         Nm7
Figure 2.7: Comparison of L1, L2, LLC misses and execution times between Deepsparse, libcsb
and libcsr for the LOBPCG solver.
                   Figure 2.8: LOBPCG single iteration execution flow graph of dielFilterV3real.
                                                           27


           0.28               0.7                 2.8                   2.1                 14
           0.24               0.6                 2.4                   1.8                 12
           0.20               0.5                 2.0                   1.5                 10
           0.16               0.4                 1.6                   1.2                  8
Billions
           0.12               0.3                 1.2                   0.9                  6
           0.08               0.2                 0.8                   0.6                  4
           0.04               0.1                 0.4                   0.3                  2
           0.00               0.0                 0.0                   0.0                  0
                                                            L1 Misses
           0.14               0.42                2.1                   1.4                 7
           0.12               0.36                1.8                   1.2                 6
           0.10                0.3                1.5                   1.0                 5
           0.08               0.24                1.2                   0.8                 4
Billions
           0.06               0.18                0.9                   0.6                 3
           0.04               0.12                0.6                   0.4                 2
           0.02               0.06                0.3                   0.2                 1
           0.00                  0                0.0                   0.0                 0
                                                            L2 Misses
           0.08               0.21                0.7                   0.8                 3.5
           0.07               0.18                0.6                   0.7                 3.0
           0.06               0.15                0.5                   0.6                 2.5
           0.04               0.12                0.4                   0.4                 2.0
Billions
           0.03               0.09                0.3                   0.3                 1.5
           0.02               0.06                0.2                   0.2                 1.0
           0.01               0.03                0.1                   0.1                 0.5
           0.00               0.00                0.0                   0.0                 0.0
                                                           LLC Misses
           0.028              0.07                0.14                  0.21                0.7
           0.024              0.06                0.12                  0.18                0.6
            0.02              0.05                0.10                  0.15                0.5
Time(s)
           0.016              0.04                0.08                  0.12                0.4
           0.012              0.03                0.06                  0.09                0.3
           0.008              0.02                0.04                  0.06                0.2
           0.004              0.01                0.02                  0.03                0.1
               0              0.00                0.00                  0.00                  0
                                                         Execution Time
                    inline1          dielfilter             HV15R              Queen_4147         Nm7
Figure 2.9: Comparison of L1, L2, LLC misses and execution times between Deepsparse, libcsb
and libcsr for the Lanczos solver.
terms of execution time. We attribute the execution time improvement observed across the board
to the increased degrees of parallelism exposed by the global task graph of DeepSparse, which is
in fact highly critical for smaller matrices.
2.3.4              Compiler Comparison
For all of our experiments, we use OpenMP as our backend. To explore the impact of different task
scheduling approaches in different OpenMP implementations, we have experimented with three
compilers: Intel, GNU and Clang/LLVM compilers. In Figure 2.10, we show the comparison in
execution time among different compilers for the three implementations. We see that the execution
                                                           28


          0.7                                1.4                                 4.2                                   4.2                                 10.5
          0.6                                1.2                                 3.6                                   3.6                                   9
          0.5                                    1                               3.0                                    3                                   7.5
Time(s)
          0.4                                0.8                                 2.4                                   2.4                                   6
          0.3                                0.6                                 1.8                                   1.8                                  4.5
          0.2                                0.4                                 1.2                                   1.2                                   3
          0.1                                0.2                                 0.6                                   0.6                                  1.5
              0                                  0                               0.0                                    0                                    0
                  Sp ar
                              sb      sr             Sp ar
                                                                    sb      sr         Sp ar
                                                                                                    sb      sr               Sp            sb         sr          Sp          sb           sr
                       se   lib c   lib c                 se   lib c     lib c              se    lib c   lib c                 arse     lib c   lib c               arse   lib c        lib c
          De                                 De                                  De                                    De                                   De
            ep                                 ep                                  ep                                    ep                                   ep
                        Inline1                                DielFilter                         HV15R                                Queen_4147                           Nm7
Figure 2.10: Comparison of execution time for different compilers between Deepsparse , libcsb
and libcsr for Lanczos Algorithm. (Blue/Left: GNU, Red/Middle: Intel, Green/Right: Clang
compiler.)
                                DeepSparse                                                       Libcsb                                                    Libcsr
                  1.1                                                    3.5                                                       3.5
                  0.9                                                    3.0                                                       3.0
                  0.8                                                    2.5                                                       2.5
                  0.6                                                    2.0                                                       2.0
   Billions
                  0.5                                                    1.5                                                       1.5
                  0.3                                                    1.0                                                       1.0
                  0.2                                                    0.5                                                       0.5
                  0.0                                                    0.0                                                       0.0
                               L1           L2                 L3                L1                 L2            L3                             L1               L2                L3
                                    Figure 2.11: Cache Miss comparison between compilers for HV15R
time for the Clang/LLVM compiler is significantly higher compared to GNU and Intel compilers
for all matrices. However, cache misses stay pretty much the same when one moves to a different
compiler. We show the cache miss comparison between the three compilers in Figure 2.11 for one
matrix, HV15R. All other matrices follow a similar cache miss pattern like HV15R. Here, we can
clearly see that regardless of the compiler, DeepSparse achieves fewer cache misses over libcsb and
libcsr implementations. We can see that Clang/LLVM shows fewer cache misses for DeepSparse
as well, but it eventually has a poor running time. We believe that this is because Clang/LLVM
is not able to schedule tasks as efficiently as GNU and Intel. Compared to Intel compiler , GNU
compiler sometimes shows more L1 and L2 misses. But the execution time is higher in Intel. This
may be due to the scheduling strategy and and the implementation of task scheduling points in
the compilers. Overall, GNU does best with running times among the three compilers, and Intel
compilers do not do well with the library based solver implementations.
                                                                                                   29


2.4     Summary of the Chapter
    This chapter introduced a novel task-parallel sparse solver framework that targets all com-
putational steps in sparse solvers. We demonstrated that our DeepSparse framework achieves
significantly fewer cache misses across different cache layers and improves the execution time over
optimized library versions. Inspired by the performance of DeepSparse on shared memory architec-
tures, our major goal is to port and optimize DeepSparse on heterogeneous architectures. In doing
so, we would like to use directive-based GPU programming models (i.e., OpenMP or OpenACC)
to make it a portable and performant framework. Evaluating the performance of directive-based
programming models to offload the main kernels involved (such as SpMV, SpMM, vector opera-
tions) constitutes a preliminary step towards our major goal which we discuss in detail in the next
chapter.
                                                 30


                                           CHAPTER 3
EVALUATION OF DIRECTIVE-BASED GPU PROGRAMMING MODEL ON A BLOCK
                                      ITERATIVE SOLVER
Most of the of TOP500 and Green500 supercomputers employ CPUs and accelerators like GPUs [6,
20] from different vendors. There is a pressing need to migrate and optimize applications for execu-
tion on GPUs and other accelerators in order to utilize the full capabilities of the GPU accelerated
supercomputers. As most of the supercomputers are equipped with GPUs from different vendors,
we need a common programming model that will allow scientists and application developers to
develop their code once and run it everywhere. In recent years, OpenACC and OpenMP have
emerged as portable, base-language independent, and an increasingly robust and performant way to
target accelerators. These directive-based methods have lowered the barrier of entry for application
developers to target accelerators and are anticipated to be a key enabler for HPC users to efficiently
use forthcoming supercomputers. However, there needs to be wider testing of OpenMP and Ope-
nACC in scientific applications to address any shortcomings in the language specifications, improve
the robustness and performance of vendor compilers, and continue to refine our understanding of
best practices to migrate applications to accelerators.
    OpenMP and OpenACC enable a programmer to run application kernels on a GPU. Multiple
compilers support these directives and can generate GPU code. The quality of GPU support in
OpenMP and OpenACC compilers is evaluated in [60] on a suite of 4 mini applications. Here, the
authors find issues with all compilers as well as challenges in creating a single portable code which
compiles and executes efficiently for all compilers. The interoperability of CUDA and OpenACC
is evaluated in [61]. The author successfully combines hand-written CUDA with OpenACC
when using the PGI compiler. Our work evaluates the performance of OpenMP and OpenACC
implementations of a block eigensolver, as well the interoperability of these runtime systems with
optimized CUDA libraries for 3 different compilers.
    Unified Memory (UM) is a programming feature which provides a single memory address space
                                                  31


accessible by all processors in a compute node. It greatly simplifies GPU programming because
the same single pointer to data can be used on both CPU and GPU. The NVIDIA Volta V100 GPU
provides a page migration engine to move memory pages between CPU and GPU when the page
is not in the memory of the processor accessing the data. NVIDIA evaluated UM performance
using the PGI OpenACC compiler in [62]. The authors created UM versions of the OpenACC
applications in the SPEC ACCEL 1.2 benchmark suite. They ran the applications on the Piz-Daint
supercomputer and found that the UM versions ran at 95% of the performance of the original
explicit data management versions. In [63], the NVIDIA presenter shows that the Gyrokinetic
Toroidal Code (GTC) has almost identical performance on a x86+V100 system whether OpenACC
data directives are used or not. Our work also compares UM against explicit data management,
but additionally considers problems whose memory requirements are significantly over the device
memory capacity. The performance of oversubscribing UM is evaluated in [64]. The authors
find that UM can be up to 2x slower than explicit data management in several applications on
an x86+V100 system. Our work considers performance on both x86 and Power GPU-accelerated
systems.
    In this work, we port and optimize a block eigensolver for GPUs using a combination of
directives and optimized library calls. The main appeal of block eigensolvers (e.g., the LOBPCG
algorithm) as opposed to the more conventional SpMV based solvers is their high arithmetic
intensity which is especially important to reap the full benefits of GPUs. The main computational
kernels involved in block iterative solvers are the multiplication of a sparse matrix with multiple
vectors and level-3 BLAS operations on dense vector blocks. Optimizing the SpMM kernel on
GPUs has been studied in several research works. Yang et al. [65] propose two novel algorithms
for SpMM operation on GPUs that take the sparse matrix input in compressed-sparse-row (CSR)
format and focus on latency hiding with instruction-level parallelism and load-balancing. They find
out a memory access pattern that allows efficient access into both input and output matrices which
is the main enabler for their excellent performance on SpMM. A common optimization strategy
of SpMM is to rely on a special sparse matrix representation to exploit the nonzeros efficiently.
                                                 32


Most commonly used sparse matrix storage variants other than CSR format are ELLPACK called
ELLPACK-R [66] and a variant of Sliced ELLPACK called SELL-P [67]. Hong et al. [68] separates
the sparse matrix into heavy and light rows in order to perform dynamic load-balancing. They
process the heavy rows by CSR and the light rows by doubly compressed sparse row (DCSR)
in order to take advantage of tiling. However, these special matrix storage formats incur some
additional computational and format conversion cost in the full computational pipeline.
    Anzt et. al [67] optimize the performance of SpMM using ELLPACK format [69] and compare
the performance of their CPU-GPU implementation with the multithreaded CPU implementation
of LOBPCG provided in the BLOPEX [70] package. All of their kernels were written in CUDA 5.5
and they evaluated the performance experiment on two Intel Sandy Bridge CPUs and one NVIDIA
K40 GPU. Dziekonski et. al [71] implement LOBPCG method with an inexact nullspace filtering
approach to find eigenvalues in electromagnetics analysis.
    Most of the prior work focused on optimizing either the SpMV or the SpMM operation on
GPUs with the ultimate goal of accelerating the iterative solver used in a scientific application. A
distinguishing aspect of this work is that we adopt a holistic approach that includes all computa-
tional kernels required for the LOBPCG solver. We use directive based programming models to
achieve portability. We also investigate the scenario where the total memory footprint exceeds the
device memory capacity and propose a solution that addresses performance degradations seen with
NVIDIA’s generic “Unified Memory” approach.
    Our contributions in this part can be summarized as follows:
     • We demonstrate that a complex block eigensolver can be implemented efficiently using a mix
       of accelerator directives (in both OpenMP and OpenACC frameworks) and optimized library
       functions. We obtain up to a 4.3x speedup over a well optimized CPU implementation.
     • We show that the performance of the Unified Memory version of SpMM, the dominant kernel
       in LOBPCG, depends on the supercomputer used and apparently the underlying CPU to GPU
       interconnect, when application working set exceeds GPU memory capacity. We measure a
                                                 33


       13.4x performance loss when migrating from a supercomputer with a PCIe Gen3 CPU to
       GPU interconnect to one with NVLink 2.0.
     • We address the Unified Memory performance portability issue by tiling the dominant kernels
       in LOBPCG. This obtains the highest performance on both supercomputers which have
       different CPU to GPU interconnects.
3.1     Methodology
    In this section, we provide an overview of the LOBPCG algorithm, our baseline CPU imple-
mentation, and the steps we took to port and optimize the CPU implementation to run efficiently
on GPU-accelerated systems using OpenMP and OpenACC. We then describe our pathfinding
activities for creating an efficient LOBPCG algorithm which can operate on matrices exceeding the
device memory capacity. In particular, we discuss how we tiled the two most expensive kernels
in LOBPCG and created microbenchmarks that enable performance comparison of programmer-
controlled and system-controlled (i.e. Unified Memory) data movement schemes between the CPU
and GPU. Finally, we describe the experimental platforms used for evaluating the performance of
our LOBPCG and microbenchmark implementations on GPU-accelerated systems.
3.1.1    The LOBPCG algorithm
Algorithm 3 shows the pseudocode of the LOBPCG algorithm. We explain the LOBPCG algorithm
in Section 2.2.2 of Chapter 2. LOBPCG algorithm is a good test application for our experiments as
it comprises high arithmetic intensity operations (SpMM and Level-3 BLAS) and some tall-skinny
matrix addition, subtraction, multiplication, and reduction operations.
3.1.2    Baseline CPU implementation
We implemented the baseline CPU version of LOBPCG using OpenMP and OpenACC directives
in C/C++. We adopted the Compress Sparse Row (CSR) format to store the sparse matrix and used
                                                 34


the mkl_dcsrmm routine from Intel MKL library for the SpMM kernel. We also implemented a
custom SpMM kernel in both OpenMP and OpenACC, again based on the CSR format, and used
it with the PGI and IBM compilers. For all LAPACK and BLAS routines needed, we used Intel
MKL, the PGI-packaged LAPACK and BLAS libraries, and IBM ESSL for Intel, PGI and IBM
compilers, respectively.
3.1.3   A GPU implementation of LOBPCG
The most expensive kernels in the baseline CPU version are the SpMM operation and the inner
product of vector blocks (X T Y ). The cuSPARSE [72] and cuBLAS CUDA libraries provide
tuned versions of these kernels. We used cusparseDcsrmm for the SpMM operation and replaced
cblas_dgemm routine with cublasDgemm for the vector block operations. We allocated the device
data for these routines using cudaMalloc. We ported the remaining application kernels using
OpenMP and OpenACC offloading pragmas. The application kernels are grouped together inside
device data regions to avoid data movement between successive application kernels. However, the
performance of this implementation was still poor because significant time was spent moving data
between CPU and GPU. This happened because the application and library kernels were operating
on distinct data on the GPU.
    OpenMP and OpenACC provide a clause to enable the application kernels to operate on data
already resident on the device. The clause is named is_device_ptr in OpenMP and deviceptr
in OpenACC. We used the pointer returned by cudaMalloc in our OpenACC implementation. This
approach caused a run time error in our OpenMP implementation compiled with LLVM/Clang. We
therefore replaced cudaMalloc with omp_target_alloc in our OpenMP implementation because
the OpenMP 5.0 specification [73] states that “Support for device pointers created outside of Open-
MP, specifically outside of the omp_target_alloc routine and the use_device_ptr clause, is
implementation defined.”. Figure 3.1 shows an example of the structure of most of our application
kernels after using this clause. It enabled us to remove multiple OpenMP/OpenACC data regions
                                                  35


                                                         Before using is_device_ptr
/ / d_R and o t h e r d e v i c e a r r a y s a l l o c a t e d w i t h o m p _ t a r g e t _ a l l o c
cublasDgemm ( h a n d l e , CUBLAS_OP_N , CUBLAS_OP_N , b , numrows , b ,
   &cudaAlpha , d_lambda , b , d_X , b , &c u d a B e t a , d_R , b ) ;
/ / Copy o u t p u t a r r a y d_R t o t h e h o s t a r r a y R
omp_target_memcpy ( R , d_R , R _ s i z e ∗ s i z e o f ( d o u b l e ) , 0 , 0 , h , t ) ;
/ / Copy h o s t a r r a y R t o t h e d e v i c e i n OpenMP t a r g e t d a t a r e g i o n
# pragma omp t a r g e t d a t a map ( t o f r o m : newX [ 0 : X _ s i z e ] ) \
   map ( t o : X[ 0 : X _ s i z e ] , R[ 0 : R _ s i z e ] )
{
   m a t _ m u l t (X, R , newX , numrows , b ) ;
}
void mat_mult ( double ∗ src1 , double ∗ src2 , double ∗ dst ,
                        i n t row , i n t c o l )
{
# pragma omp t a r g e t t e a m s d i s t r i b u t e p a r a l l e l f o r c o l l a p s e ( 2 )
   f o r ( i n t i = 0 ; i < row ; i ++)
        f o r ( i n t j = 0 ; j < c o l ; j ++)
            dst [ i ∗ col + j ] = src1 [ i ∗ col + j ] ∗ src2 [ i ∗ col + j ] ;
}
                                                           After using is_device_ptr
/ / d_R and o t h e r d e v i c e a r r a y s a l l o c a t e d w i t h o m p _ t a r g e t _ a l l o c
cublasDgemm ( h a n d l e , CUBLAS_OP_N , CUBLAS_OP_N , b , numrows , b ,
   &cudaAlpha , d_lambda , b , d_X , b , &c u d a B e t a , d_R , b ) ;
/ / P o i n t e r s t o d e v i c e a r r a y s passed i n t o mat_mult f u n c t i o n
m a t _ m u l t ( d_X , d_R , d_newX , numrows , b ) ;
void mat_mult ( double ∗ src1 , double ∗ src2 , double ∗ dst ,
                        i n t row , i n t c o l )
{
    / / Use i s _ d e v i c e _ p t r b e c a u s e d a t a i s a l r e a d y on t h e d e v i c e
# pragma omp t a r g e t i s _ d e v i c e _ p t r ( s r c 1 , s r c 2 , d s t )
# pragma omp t e a m s d i s t r i b u t e p a r a l l e l f o r c o l l a p s e ( 2 )
   f o r ( i n t i = 0 ; i < row ; i ++)
        f o r ( i n t j = 0 ; j < c o l ; j ++)
            dst [ i ∗ col + j ] = src1 [ i ∗ col + j ] ∗ src2 [ i ∗ col + j ] ;
}
Figure 3.1: The use of is_device_ptr to avoid memory copies. Error checking is omitted for
brevity.
and thus considerable data movement between the CPU and GPU 1.
       All kernels run on the GPU except for some LAPACK routines, i.e., LAPACKE_dpotrf and
      1 Alternatively,         we could have copied the data to the device using OpenMP/OpenACC and then
passed the device pointer to the CUDA library functions using OpenMP’s use_device_ptr clause
or OpenACC’s use_device clause. We did not use this approach because we wanted the option to
use cudaMallocManaged to allocate data in managed memory.
                                                                               36


LAPACKE_dsygv which are not available in the CUDA toolkit math libraries. This causes 10 small
matrices to move between CPU and GPU in each iteration of the LOBPCG method. As the sizes
of those matrices are very small, we find that the overhead associated with these data movements
are insignificant compared to the total execution time.
3.1.4    Tiling LOBPCG kernels to fit in GPU memory capacity
The LOBPCG GPU implementation described in Section 3.1.3 allocated the tall skinny matrices
and the sparse matrix in GPU memory. This approach is limited to cases where the aggregated
matrix memory footprint is less than the GPU memory capacity. However, a major challenge in
many scientific domains [74, 75, 11] (such as configuration interaction in MFDn) is the massive
size of the sparse matrix, which can have several billions of rows and columns and the total number
of nonzeros can easily exceed trillions. In this subsection, we explain how we tiled the SpMM
and inner product kernels (X T Y ) to operate on problems larger than the GPU memory capacity.
We extracted each kernel into a standalone microbenchmark to check for correctness and enable
performance evaluation. Although not described in this paper, we have also implemented and
evaluated the linear combination kernel (XY ) which has similar characteristics to the inner product
kernel (X T Y ), but involves the multiplication of a tall-skinny vector block (X) with a small square
matrix (Y ).
3.1.4.1   SpMM kernel
The SpMM kernel is typically the most expensive operation in LOBPCG. Figure 3.2 shows the
tiling idea for the SpMM kernel for cases when the LOBPCG data is too large to fit into the GPU
memory. For a given tile size β, we divide the sparse matrix into block of rows. Algorithm 4
describes the steps in our tiled SpMM kernel. In short, we copy the Y matrix to the GPU at the
beginning and it resides there until all sparse matrix tiles are processed. Then, we extract the CSR
format of each of the tiles and copy that to GPU memory. Then we apply the cusparseDcsrmm
routine on the sparse matrix block and Y. This produces the corresponding row blocks of the final
                                                  37


output matrix Z. After processing each tile, we copy back the partial output to the corresponding
tile of the Z matrix.
             Tile0
             Tile1
             Tile2
                 …
                 …
                 …                    X                              Y             Z
            Tilen-1
                                                                     b              b
                          Figure 3.2: Overview of tiling SpMM operation.
   Algorithm 4: Tiled SpMM (cusparseDcsrmm) kernel
     Input: X(m × m) sprase matrix in CSR format (val, rowPtr, colIndex), Y(m × b), β(tile
              size)
     Output: Z(m × b)
  1 nrowblk = d e
                    m
                    β
  2 for i = 0 to nrowblk - 1 do
         // extract_CSR_tile() method extracts the CSR format of the i-th tile from the
             given sparse matrix
  3      [rowPtrTile, colIndxTile, valTile, nnz_Tile] = extract_CSR_tile(val, rowPtr, colIndex,
          i)
  4      cusparseDcsrmm(β, b, m, nnz_tile, 1.0, valTile, rowPtrTile, colIndxTile, R, m, 0.0,
          AR, β)
  5      cudaDeviceSynchronize()
  6      cudaMemcpy(Z[i-th tile], AR, cudaMemcpyDeviceToHost)
  7      cudaDeviceSynchronize()
  8  end
3.1.4.2    Inner product kernel
One of the most frequently invoked and expensive kernels in LOBPCG is the inner product operation
(Z = X T Y ) between two tall skinny matrices. Hence, a well performing tiled inner product kernel
is crucial for large problem sizes. Figure 3.3 shows the overview of the matrix tiling idea for the
                                                 38


                                cublasDgemm call on each tile
                                                                    tile0
          b   tile0   tile1   tile2    XT                tilen-1    tile1
              0       1       2       ….                            tile2      Accumulating partial
                                      ….
                                      ….                 n-1                   output in devZ on GPU
                                                                                                 b
                                                                     Y                   devZ
                                                                                     b
                                                                   tilen-1
                                                                      b
                               Figure 3.3: Overview of tiling Inner Product kernel
 Algorithm 5: Tiled Inner Product (cublasDgemm) Kernel
   Input: X(m × b), Y(m × b), β(tile size)
   Output: Z(b × b)
 1 nrowblk = d e
                 m
                 β
 2 cudaMemset(devZ, 0.0, b*b*sizeof(b))
 3 for i = 0 to nrowblk - 1 do
 4     cudaMemcpy(devX, X[i-th block], β * b, cudaMemcpyHostToDevice);
 5     cudaMemcpy(devY, Y[i-th block], β * b, cudaMemcpyHostToDevice);
 6     cudaDeviceSynchronize();
 7     cublasDgemm(b, b, β, 1.0, devY, β, devX, β, 1.0, devZ, β);
 8     cudaDeviceSynchronize()
 9 end
10 cudaMemcpy(Z, devZ, b * b, cudaMemcpyDeviceToHost);
inner product kernel. X and Y are of size m × b where m  b. Both matrices are partitioned into
n = dm
     β e tiles. In our custom inner product kernel, we transfer each tile of X and Y from CPU to
GPU and apply cublasDgemm routine on each tile. We keep accumulating the partial output to a
b × b matrix on the GPU. After processing all tiles, we copy back the final result to Z. Algorithm 5
gives an overview of our custom inner product kernel.
                                                          39


3.1.5   Hardware and software environment
We conducted all of our experiments on the Cori-GPU testbed at the National Energy Research
Scientific Computing Center (NERSC) [76] and the Summit supercomputer at the Oak Ridge
Leadership Computing Facility (OLCF) [77]. Cori-GPU is a Cray CS-Storm 500NX consisting of
18 compute nodes. Each compute node has two 20-core Skylake processors clocked at 2.4 GHz and
8 NVIDIA Tesla V100 "Volta" GPUs with 16 GBs of HBM per GPU. The V100 GPU model has
a peak double precision performance of 7.0 TFLOP/s. There is a total of 384 GB DDR4 DRAM
space on each node. The CPUs are connected to the GPUs via four PCIe 3.0 switches and the GPUs
are connected to each other via NVIDIA’s NVLink 2.0 interconnect. The Summit supercomputer
is an IBM AC922 system consisting of 4608 compute nodes [78]. Each compute node has two
22-core IBM Power9 processors clocked at 3.1 GHz and 6 NVIDIA Tesla V100 "Volta" GPUs with
16 GBs of HBM per GPU. The V100 GPU model is based on the SXM2 form factor and has a peak
double precision performance of 7.8 TFLOP/s. There is a total of 512 GB DDR4 DRAM space per
node. Unlike Cori-GPU, the CPUs and GPUs in a Summit compute node are all connected with the
high bandwidth NVLink 2.0 interconnect. This also provides cache coherence between CPUs and
GPUs and enables system-wide atomics. The theoretical peak uni-directional bandwidth between 1
CPU and 1 GPU is 16 GB/s on Cori-GPU and 50 GB/s on Summit. However, the highest pageable
bandwidth we measured from CPU to GPU was 5.2 GB/s on Cori-GPU and 25.0 GB/s on Summit.
    The Cori-GPU and Summit supercomputers provide extensive software environments to com-
pile OpenMP and OpenACC programs. Here, we list the software environment used in this paper.
The software used on the Cori-GPU system were Intel Compiler v19.0.3 (OpenMP for CPU),
LLVM/Clang compiler v9.0.0-git (OpenMP for GPU), and PGI compiler v19.5 (OpenACC for
CPU and GPU). We used Intel MKL with the Intel and LLVM/Clang compilers and PGI’s version
of LAPACK with the PGI compiler. The GPU accelerated libraries were cuSPARSE and cuBLAS
provided with CUDA v10.1.168. The software used on Summit were IBM XLC Compiler v16.1.1-
3 (OpenMP for CPU and GPU) and PGI compiler v19.5 (OpenACC for CPU and GPU). We used
IBM ESSL with the IBM XLC Compiler and PGI’s version of LAPACK with the PGI compiler.
                                              40


Once again, the GPU accelerated libraries were cuSPARSE and cuBLAS provided with CUDA
v10.1.168.
3.1.6   Experiments
In this section we explain the experiments conducted. The first set of experiments are used to
evaluate the LOBPCG GPU implementation. The second set of experiments are used to evaluate
our microbenchmarks on problems exceeding the GPU memory capacity.
3.1.6.1   Performance of the LOBPCG solver
The CPU and GPU implementations of LOBPCG are evaluated using a series of real-world matrices
with different sizes, sparsity patterns and application domains as shown in Table 3.1. The first 2
matrices are from the SuitSparse Matrix Collection [79] and the Nm7 and Nm8 matrices are
extracted from two very large Hamiltonian matrices that arise in nuclear structure calculations with
MFDn. Note that the test matrices have millions of rows and hundreds of millions of nonzeros.
The memory footprint of these matrices vary from 2 GB to 7.8 GB using the CSR matrix format.
                                      Table 3.1: Test Matrices.
     Matrix         Rows      Columns      Nonzeros     Size (GB)              Domain
  Queen_4147      4,147,110   4,147,110   166,823,197     2.018         3D Strctural Problem
     HV15R        2,017,169   2,017,169   283,073,458     3.405    Computational Fluid Dynamics
      Nm7         4,985,422   4,985,422   647,663,919     7.792                 MFDn
      Nm8         7,579,303   7,579,303   592,099,416     7.136                 MFDn
    We measured the runtime of the LOBPCG CPU implementation on a single CPU socket on Cori-
GPU and Summit nodes. The configurations used 1 thread per core and used the appropriate slurm,
jsrun and OpenMP/OpenACC environment variables to bind the process and child threads. We
did not use hyperthreading / SMT because our kernels are memory bandwidth bound. We measured
the runtime of the LOBPCG GPU implementation on a single CPU socket and one GPU on Cori-
GPU and Summit nodes. Our configurations only ever used a single CPU socket to avoid potential
                                                 41


performance issues associated with non-uniform memory access time. We evaluated the compiler
combinations described in Section 3.1.5 and measured runtime with application timers.
3.1.6.2   Performance of X T Y and SpMM kernels for large matrices
Our next experiment evaluated the X T Y microbenchmark and SpMM microbenchmark on input
problems exceeding GPU memory capacity on Cori-GPU and Summit. This experiment is de-
signed to inform our future sparse solver implementations. We tested the tiled versions of the
microbenchmarks so that we could easily separate how much time is spent in computation versus
data movement between the CPU and GPU. If more time is spent in computation then data move-
ment costs can potentially be hidden. In the X T Y microbenchmark, we chose to multiply two
matrices of size 67, 108, 864 × 48 leading to a memory footprint of 51.54 GB. We set the tile size
(β) to 131, 072 for the X T Y microbenchmark and 2, 597, 152 for the SpMM microbenchmark as
this gives us the best performance. The tile size (β) is an optimization parameter and one can vary it
as long as the memory footprint required to process a single tile is less than GPU memory capacity.
In the SpMM microbenchmark, we used a synthetic input matrix of 24 GB, leading to a memory
footprint of 35.1 GB. The dimension of the synthetic sparse matrix is 14, 957, 833 × 14, 957, 833
with 1, 946, 671, 770 nonzeros. We multiplied this sparse matrix with a dense matrix of dimension
14, 957, 833 × 48. We used a multi-threaded RMAT graph generator [80] to generate our synthetic
sparse matrix. We measured compute and data movement time using the nvprof profiler.
3.1.6.3   Performance of tiled and Unified Memory versions of SpMM
Our final experiment evaluated the Unified Memory version of the SpMM microbenchmark. The
Unified Memory version was written in OpenACC and compiled with the PGI compiler and the
compiler option -ta:tesla:managed to replace regular system memory allocations with managed
memory allocations. We compared runtime against the tiled version of SpMM on Cori-GPU
and Summit for two input matrices. The first input matrix is Nm7 (see table 3.1) and leads to a
microbenchmark memory footprint of 11.7 GB. The second input matrix is the synthetic sparse
matrix (14, 957, 833 × 14, 957, 833 with 1, 946, 671, 770 nonzeros) and leads to a microbenchmark
                                                   42


memory footprint of 35.1 GB. The matrices are chosen to create problems less than GPU memory
capacity and greater than GPU memory capacity. In both cases, we multiplied these sparse matrices
with a dense matrix of 48 vector blocks. We set the tile size (β) to 2, 597, 152 for both of matrices
as it is the highest tile size that we can use without overflowing the GPU memory and it gives the
best performance. The nvprof profiler is used to collect compute time, data movement time, and
Unified Memory data movement and page fault time.
3.2     Results
    In this section we show performance results on the Cori-GPU and Summit supercomputers.
Section 3.2.1 shows the performance of the CPU and GPU versions of LOBPCG when parallelized
with either OpenMP or OpenACC. We then consider how we could use the LOBPCG solver on
matrices larger than GPU memory capacity. Section 3.2.2 shows performance results when tiling
the dominant X T Y and SpMM kernels so that each tile fits within GPU memory capacity. Finally,
Section 3.2.3 compares the performance of the tiled implementation of the SpMM kernel against a
naive Unified Memory implementation.
3.2.1    Performance of the LOBPCG solver
We compared the performance of the LOBPCG solver when using a suite of different compilers.
The compilers can all generate code for the host CPU and sometimes also for the GPU. In the
following sentences, we place CPU or GPU in parenthesis to indicate whether we used the compiler
to generate code for the CPU or GPU. The OpenMP compilers were Intel (CPU) and Clang (GPU)
on Cori-GPU and IBM (CPU and GPU) on Summit. The OpenACC compiler was always PGI (CPU
and GPU). In all cases we used a hand-written portable SpMM kernel except for our Intel compiler
experiment which used mkl_dcsrmm from Intel MKL. We did this to obtain the best possible CPU
time to more transparently show the value of our GPU implementation. The performance results
for the Nm7 matrix are shown in Figure 3.4. The execution time of the LOBPCG solver is averaged
over 10 iterations.
                                                   43


              12.00
                                           9.82
              10.00
               8.00
 Time (sec)
               6.00                                     5.38
                      4.87
               4.00
                                  2.25
               2.00
                                                                  0.77        0.74     0.76        0.74
               0.00
                      OpenACC     OpenMP   OpenACC      OpenMP    OpenACC     OpenMP    OpenACC    OpenMP
                       Cori-GPU (CPU)         Summit (CPU)       Cori-GPU (CPU+GPU)    Summit (CPU+GPU)
Figure 3.4: The time spent in LOBPCG on Cori-GPU and Summit when using various compilers
with either OpenMP or OpenACC
       The results show that the execution time of our GPU implementation is almost independent of
directive based programming model and evaluation platform. Our reasoning is that the OpenMP
and OpenACC configurations use the same GPU math libraries, the GPUs are nearly identical in
Cori-GPU and Summit (different V100 models), and that our LOBPCG implementation has been
highly tuned to minimize data movement between CPU and GPU. The best GPU performance
is 3.05x faster than the best CPU performance for Nm7 matrix. The CPU versions show more
variable performance for different combinations of compilers and math libraries used on Cori-GPU
and Summit. The highest performance is obtained with the OpenMP version when compiled with
Intel compiler on Cori-GPU. The performance differences can mostly be attributed to the host
CPU and SpMM performance: mkl_dcsrmm is 1.4x faster than our hand-written SpMM kernel in
OpenMP and the hand-written SpMM kernel is 1.5 - 3.0x faster when using OpenMP rather than
OpenACC. We did not investigate the host CPU performance in any more detail because it is not
the focus of our work.
       Figure 3.5 shows how time is spent in the best configurations on CPU and GPU when using the
Nm7 matrix. Execution time is divided into library time, application kernel time, and unaccounted
                                                         44


CUDA API time. The library time is spent in cuBLAS and cuSPARSE in the GPU implementation
and Intel MKL in the CPU implementation. The application kernel time is spent in user defined
functions in both the CPU and GPU implementations. The CUDA API time includes GPU data
allocation and data movement between CPU and GPU and is calculated by subtracting time spent
in application and library kernels from the total run time. The library and application kernels
speedup by 3.7x and 5.0x, respectively, when using GPUs. Application kernel time is a relatively
small fraction of total run time on GPU. However, the offload is a key optimization step needed
to keep total run time low. Total run time would be significantly slower if we decided to use host
application kernels because of unnecessary data movement between CPU and GPU.
                                  Library kernels   CUDA API calls   Application kernels
              3.5
               3
              2.5                                                                     2.25
 Time (sec)
               2
              1.5
               1                      0.74
              0.5
               0
                                 OpenMP+GPU                                      OpenMP+CPU
                    Figure 3.5: The time spent in LOBPCG on Cori-GPU when using matrix Nm7
       Figure 3.6 shows GPU speedup over the best LOBPCG CPU implementation for all the test
matrices in Table 3.1. The LOBPCG GPU implementation achieves 2.8x - 4.3x speedup over
the best CPU implementation. The GPU implementation therefore performs well over a range of
matrices from different domains with different sparsity patterns.
                                                          45


            5
           4.5                            4.31
            4
           3.5         3.36
                                                               3.05
            3                                                                        2.80
 Speedup
           2.5
            2
           1.5
            1
           0.5
            0
                   Queen_4147           HV15R                 Nm7                   Nm8
                  Figure 3.6: LOBPCG GPU speedup on Cori-GPU for each test matrix
3.2.2        Performance of X T Y and SpMM kernels for large matrices
Figure 3.7 shows the time spent in the inner product (X T Y ) kernel on Cori-GPU and Summit
when total memory footprint is 51.54 GB. The tile size is 131,072. The total time is divided into
host-to-device (HtoD) data transfer time and computation time in the inner product kernel (device-
to-host (DtoH) data transfer times are negligible for this kernel). We measured data transfer and
computation time using nvprof. The results show that total run time is dominated by data transfers.
Run time is lower on Summit because of the high bandwidth NVLink 2.0 interconnect. We obtained
data transfers of 4 GB/s on Cori-GPU and 13 GB/s on Summit in this kernel. Results indicate that
data transfer time cannot be hidden behind computation when the matrix exceeds the GPU memory
capacity.
     Figure 3.8 shows the time spent in the SpMM kernel. The input sparse matrix is 24 GB and the
total memory footprint is 35.1 GB. This time, results show that computation time is greater than the
data movement time. This indicates that data movement time could be completely hidden behind
computation. It would therefore be possible to obtain nearly the same computational throughput
as one would get using matrices completely resident in the GPU memory. However, an actual
                                                 46


                               Compute time       CUDA memcpy HtoD time
               14.00
               12.00
               10.00
  Time (sec)
                8.00
                             11.98
                6.00
                4.00
                                                                            3.82
                2.00
                0.00          0.36                                          0.42
                            Cori-GPU                                      Summit
Figure 3.7: Time spent in X T Y kernel on Cori-GPU and Summit when the memory footprint
exceeds GPU memory capacity.
block eigensolver alternates between SpMM and vector block operations, so this may not be easy
to realize in practice.
                                   Compute time     CUDA memcpy time
               30
               25
                            7.24
               20                                                          2.62
  Time (sec)
               15
               10          18.15                                           18.06
                5
                0
                          Cori-GPU                                        Summit
Figure 3.8: Time spent in SpMM kernel on Cori-GPU and Summit when the memory footprint
exceeds GPU memory capacity
                                                  47


                5                              4.76
               4.5                                                                    4.34
                           4.15
                4
               3.5
                3                                               2.81
  Time (sec)
               2.5
                2
               1.5
                1
               0.5
                0
                          Tiling          Unified Memory       Tiling           Unified Memory
                                   Cori GPU                              Summit
Figure 3.9: Time spent in tiled and Unified Memory versions of the SpMM kernel on Cori-GPU
and Summit. The memory footprint is less than GPU memory capacity.
3.2.3            Performance of tiled and Unified Memory versions of SpMM
Figure 3.9 shows the performance of the tiled SpMM kernel compared to the Unified Memory
version of the SpMM kernel when the memory footprint is less than GPU memory capacity. The
total memory footprint of this experiment is 11.7 GB. The tiled version is fastest on both platforms.
nvprof shows that the tiled version is faster on Summit because of less time in CUDA memcpy.
Interestingly, the Unified Memory version performs similarly on both platforms.
        Figure 3.10 shows the performance of the two SpMM kernels when the memory footprint
exceeds GPU memory capacity. We used the same tile size (β) for the tiled experiments in Figures
3.9 and 3.10. There are now significant differences between the performance of the tiled and
Unified Memory versions. The most surprising result is the 48.2x performance difference between
tiled and Unified Memory versions on Summit. This is a performance difference of 13.4x between
Cori-GPU and Summit when using Unified Memory on different machines. This is unexpected
given the high bandwidth NVLink 2.0 interconnect and hardware managed cache coherency on the
Summit IBM system. Although not shown, there is a similar performance difference on Summit
                                                      48


for the X T Y and XY kernels. Unified Memory performance is therefore poor and depends on the
machine used.
                                                                                           997.68
              1000.00
               100.00                           74.56
 Time (sec)
                            25.39
                                                                    20.68
                10.00
                 1.00
                            Tiling         Unified Memory           Tiling             Unified Memory
                                     Cori GPU                                Summit
Figure 3.10: Time spent in tiled and Unified Memory versions of the SpMM kernel on Cori-GPU
and Summit. The memory footprint exceeds GPU memory capacity. We use a logarithmic scale on
the Time (sec) axis to capture the slow run time for the Unified Memory configuration on Summit.
       Figure 3.11 shows nvprof output for the Unified Memory version of the XY kernel on Cori-
GPU and Summit. The results show that the total count of page faults and the total data moved is
the same on both systems. As expected, the data transfer is 3x faster on Summit according to the
bandwidth of the CPU to GPU interconnect. However, the metric named "Gpu page fault groups"
takes 30x more time on Summit compared to Cori-GPU for unknown reasons. This explains
the poor performance on Summit. We observed similar performance difference without nvprof
(nvprof added a performance overhead of about 10% on both machines). We are currently in
contact with OLCF and NVIDIA staff to understand our performance observations.
3.3            Discussion
       In this section we discuss the key learnings from the results in Section 3.2.
       The results show that we have successfully ported the LOBPCG solver to NVIDIA GPUs using
                                                        49


Cori-GPU
D e v i c e " T e s l a V100−SXM2−16GB ( 0 ) "
     Count Avg S i z e Min S i z e Max S i z e                 Total Size        T o t a l Time     Name
   196608 1 7 0 . 6 7KB 4 . 0 0 0 0KB 0 . 9 9 6 1MB            3 2 . 0 0 0 0 0GB   3.326868 s       H o s t To D e v i c e
        8526 1 . 9 9 9 3MB 4 . 0 0 0 0KB 2 . 0 0 0 0MB         1 6 . 6 4 6 5 5GB   1.368811 s       D e v i c e To H o s t
     98304                  −               −                −                 − 10.668444 s        Gpu p a g e f a u l t g r o u p s
T o t a l CPU Page f a u l t s : 98305
Summit
D e v i c e " T e s l a V100−SXM2−16GB          (0) "
     Count Avg S i z e Min S i z e               Max S i z e   Total Size        T o t a l Time     Name
   163840 2 0 4 . 8 0KB 6 4 . 0 0 0KB            9 6 0 . 0 0KB 3 2 . 0 0 0 0 0GB   1.078612 s       H o s t To D e v i c e
        8525 1 . 9 9 9 8MB 6 4 . 0 0 0KB         2 . 0 0 0 0MB 1 6 . 6 4 8 5 0GB 3 9 6 . 9 5 3 3 ms D e v i c e To H o s t
     98304                  −               −                −                 − 313.43688 s        Gpu p a g e f a u l t g r o u p s
        8524 2 . 0 0 0 0MB 2 . 0 0 0 0MB         2 . 0 0 0 0MB 1 6 . 6 4 8 4 4GB                  − Remote mapping from d e v i c e
T o t a l CPU Page f a u l t s : 98305
T o t a l r e m o t e m a p p i n g s t o CPU :  8524
Figure 3.11: Unified Memory nvprof profile of the XY microbenchmark on Cori-GPU (top) and
Summit (bottom).
directives and optimized CUDA library calls. We obtained similar performance for the OpenMP
implementation using Clang and XLC compiler as we did for the OpenACC implementation using
the PGI compiler. The quality of OpenMP compilers for GPUs have often been criticized over
the past few years [60], however, our experience provides evidence that OpenMP compilers are
becoming more robust and are capable of generating high performance code.
      We found that the key enabler of performance was to keep data resident on the GPU between calls
to optimized CUDA math functions. We were able to do this trivially by adding OpenMP/OpenACC
accelerator directives to the large number of kernels in the LOBPCG solver. In the past, this would
have been much more challenging and time-consuming because the remaining application kernels
would need to be ported to CUDA. Our related work section shows that earlier attempts to port
a LOBPCG solver to GPUs by other scientists was generally focused on optimizing the SpMM
kernel only on GPU whereas we focus on optimizing the full solver on GPU. This highlights the
productivity gains from using directives and the importance of interoperability between the code
generated by the OpenMP/OpenACC compilers and CUDA. This interoperability is not required
in the OpenMP specification and is only recommended as a note to implementors in the OpenACC
specification. However, we have highlighted the importance of interoperability, and believe that
the HPC community should strongly request this support from compilers as we have done for
                                                                              50


LLVM/Clang (https://bugs.llvm.org/show_bug.cgi?id=42643).
    We have shown that our LOBPCG microbenchmarks can be tiled to solve problems larger than
GPU memory capacity. We found that the time spent in cublasDgemm for the inner product (X T Y )
microbenchmark is shorter than the time spent moving data to and from the GPU. This indicates
that it is not possible to write a tiled cublasDgemm for larger problems which achieves the same
computational throughput as a problem which fits in GPU memory capacity. The tiled cublasDgemm
performance was mostly determined by the bandwidth of the CPU to GPU interconnect. This will
remain a challenge in many CPU+GPU systems in the coming years because PCIe Gen4 has lower
bandwidth than NVLink 2.0. The SpMM microbenchmark showed the opposite to X T Y in that
more time was spent in computation than data movement. This indicates that data movement costs
could be hidden, i.e., computation on one tile could occur concurrently with the data movement
for the next tile. The full LOBPCG solver includes X T Y and SpMM operations. Therefore, the
amount of computation on the GPU relative to data movement between CPU and GPU is more than
what is shown in our microbenchmarks. This indicates that it should be possible to write an efficient
LOBPCG solver for GPUs which can solve problems larger than the GPU memory capacity.
    We had mixed success when using a Unified Memory implementation of the SpMM kernel. The
performance was a little worse than the tiled implementation when the memory footprint was less
than GPU memory capacity. This could be acceptable to many application programmers because
we obtained this performance with much simpler code. This would be a huge productivity win for
the application programmer because there is no need to manage separate host and device copies
of data; there is just a single pointer to the data which can be used on both host and device. We
found that the performance of the Unified Memory implementation was much worse than the tiled
implementation when the memory footprint exceeded GPU memory capacity. It was so bad on
Summit that it would have been more efficient to use a CPU implementation and leave the GPUs
idle. We are still working to understand why Unified Memory performance was so poor on Summit.
However, our early experience serves as a warning to application programmers that they should not
rely on Unified Memory when application memory footprint is larger than GPU memory capacity.
                                                  51


It is also useful information to HPC system providers that the success of their users strongly depends
on purchasing GPUs with sufficient memory capacity.
     We recommend that tiling be used in large memory footprint applications on CPU+GPU
systems. This can deliver both high performance and predictable performance across different
CPU+GPU systems. However, it can be a significant amount of work to tile and overlap data
transfers with computation in an application. This may become easier in future with enhancements
to the OpenMP standard providing directive-based partitioning and pipelining [81]. Alternatively,
middleware for sparse solvers on GPUs could abstract away these programming challenges.
3.4      Summary of the Chapter
     In this work, we have described our approaches to mix CUDA library calls with OpenMP/Ope-
nACC offloading pragmas in order to implement and optimize the full LOBPCG eigensolver on
GPU-accelerated systems. We successfully used both OpenMP and OpenACC, and achieved a
speedup of 2.8x - 4.3x over a baseline CPU implementation. Our experiments with SpMM and
inner product microbenchmarks showed that tiling is the preferred approach for larger problem
sizes. We found that a naive Unified Memory implementation had worse performance than a
tiled implementation by up to an order of magnitude depending on the target supercomputing
platform. With these observations, we got a better idea of how to design and implement a portable
sparse solvers framework on heterogeneous architectures to solve problem sizes that exceed total
device memory. In the next chapter, we will discuss the details of our portable sparse solvers for
heterogeneous architecture.
                                                 52


                                           CHAPTER 4
     A PORTABLE SPARSE SOLVER FRAMEWORK FOR LARGE MATRICES ON
                           HETEROGENEOUS ARCHITECTURES
Heterogeneous computing architectures employing GPUs have become the mainstream approach
in recent leadership supercomputers [6]. GPUs have been used to accelerate applications in various
domains, such as graph analytics, machine learning, computational finance, climate modeling, and
multimedia. Recently, GPUs have improved significantly in terms of programmability. GPUs are
typically programmed using device-specific (e.g., CUDA) or a portable (e.g., OpenMP) program-
ming model.
    Irregardless of whether an application uses a device-specific or a portable programming model,
the amount of data moved between CPU and GPU is a key concern in heterogeneous systems due
to the wide gap between computational performance and data movement. For best performance,
programmers have to explicitly control the data movement between the host and the device, but this
comes at the expense of programmer productivity. As a general solution, Unified Virtual Memory
(UVM) has become available since CUDA 8.0 and the Pascal architecture (2016). UVM provides
a single and consistent view of the host and device memories on a system. Even though UVM
dramatically reduces developer effort in regards to managing data movement, it typically causes a
significant hit in performance.
    While the size of data needed by scientific applications and machine learning/data analytics
workloads increase at a rapid pace, the memory available on GPUs increase at a much modest pace
in comparison. For example, NVIDIA’s Tesla V100 “Volta" GPUs have only 16 GBs of device
memory available; one could have up to 80 GBs of memory on the newest generation NVIDIA
A100s, but they come with a price tag not affordable by most customers or even HPC centers. When
an application’s working set size exceeds the device memory size, the resulting data movement
becomes a critical design and performance bottleneck [12].
    In this work, we present a tiling-based task-parallel sparse linear algebra framework, named
                                                 53


DeepSparseGPU, which aims to make it easy to develop sparse solvers for large problems on GPU-
accelerated architectures. Due to the irregular data access patterns and low arithmetic intensities
associated with sparse matrix computations, achieving a high performance of the peak processor
percentage, especially on GPUs, is challenging. These challenges are further exacerbated due to
the increasing number of hardware accelerators by different vendors (i.e., NVIDIA, AMD, Intel,
etc.). The main goal of the DeepSparseGPU framework is therefore to ease the development of
performant and portable sparse solvers. Hence, building upon our DeepSparse framework [15]
which runs on shared memory architecture, DeepSparseGPU leverages OpenMP’s target offload
functionality. As we aim to optimize DeepSparseGPU for GPUs, several changes were made to the
original DeepSparse framework, especially to support applications with memory requirements that
exceed the available device memory capacity. We adopt the same task-based tilling approach which
serves the double purpose of enabling data locality optimizations and facilitates the management
of application data so that it can be processed in batches that fit into the device memory. For this,
DeepSparseGPU automatically generates and expresses the entire computation as a task dependency
graph (TDG), which is then executed using OpenMP’s tasking and target offloading functionalities.
A memory manager developed as a part of DeepSparseGPU keeps track of the data residing on host
and device memories, and automatically migrates data between the two whenever needed. Given the
data dependencies between computational tasks, DeepSparseGPU employs a topological-sorting
based heuristic to minimize the data movement and increase application performance.
    Given the importance of sparse solvers in scientific computing and machine learning, several
optimization techniques have been proposed for sparse matrix-vector multiplication (SpMV) on
GPUs [29, 82, 83, 84]. However, the performance of SpMV is bounded by memory bandwidth [85].
Since sparse matrix-matrix multiplication (SpMM) has a much higher arithmetic intensity than
SpMV and can efficiently leverage the performance benefits of GPUs, SpMM-based solvers have
recently drawn significant interest in scientific computing in the form of block linear solvers and
eigensolvers [11]. As such several groups have studied the optimization of the SpMM kernel on
GPUs [65, 66, 67, 68, 67, 69, 71].
                                                  54


    As can be seen, most prior work on accelerating iterative solvers for GPUs has focused only
on optimizing the SpMV/SpMM kernels with a few exceptions. In this work, we present a holistic
framework that includes all computational kernels required for block eigensolvers (LOBPCG is
used as a case study). Another distinguishing aspect of our work compared to the work reviewed
above is the support we provide for applications with memory demands significantly larger than
the available device memory capacity.
    In addition to the support for large applications, we adopt a directive-based programming model
to achieve portability. Several existing work studied the efficacy of the GPU offload support in
OpenMP and/or OpenACC in the context of individual kernels [7, 8, 9], mini-applications [60]
or proxy applications [86]. In this regard, our evaluation of OpenMP’s offloading support in the
context of a real eigensolver applied to matrices from several domains also constitutes a niche. In
doing so, for the benefit of the community, we also discuss the compiler support issues we faced,
as OpenMP offloading implementations are constantly evolving and are advertised as having only
partial support in different compilers.
    In evaluating DeepSparseGPU, we use a GPU accelerated cuSPARSE and cuBLAS library
kernels with Unified Virtual Memory (UVM) based implementation as a baseline. The current data
communication strategy in UVM is full page migration. Oversubscription of GPU memory and
system-wide automatic memory operations are enabled by full-page migration [12, 87]. There are
some poor performance risks in UVM regarding large page migration [62, 64]. Our work compares
UVM against explicit data management and considers problems whose memory requirements
significantly exceed the device’s memory capacity.
    Consequently, our contributions can be summarized as follows:
    • We demonstrate that a complex block eigensolver can be implemented efficiently using
       OpenMP target offload directives by minimizing data movement between CPU and GPU. We
       obtain up to 2.93x speedup and up to 2.18x less data transfer between CPU and GPU over a
       well-optimized UVM based implementation.
                                                   55


    • We designed and developed a Memory Manager (MM) that automatically keeps track of
      data on both CPUs and GPUs, and migrates it whenever needed. Memory manager helps to
      execute problem sizes that exceed device memory.
    • We use topological sort on the DAG to get a custom schedule of the computing tasks.
      Empirically, we find that topological sort helps minimize data movement compared to the
      pain baseline schedule.
    • We share and discuss our experiences and issues that we faced during the development
      process so that the community could benefit from it.
4.1    DeepSparseGPU Overview
    Figure 4.1 illustrates the architectural overview of the DeepSparseGPU framework. DeepSparseGPU
consists of two major components: i) Primitive Conversion Unit (PCU), which provides a front-end
to domain scientists to express their application, such as the LOBPCG solver, at a high-level and
generates the task dependency graph (TDG); and ii) Task Executor (TE), which receives a DAG
from the PCU and performs a topological sort on the DAG to get a task schedule. The task executor
then launches proper kernels that correspond to computational nodes in the TDG according to the
task schedule.
4.1.1   Primitive Conversion Unit (PCU)
The Primitive Conversion Unit (PCU) is composed of two parts: i) Task Identifier and ii) Task
Dependency Graph (TDG) Generator.
4.1.1.1   Task Identifier (TI)
The PCU allows the application developer to express their algorithms at a high level without having
to worry about architectural details (e.g., memory hierarchy) or parallelization considerations (e.g.,
determining the individual tasks and their scheduling). Task identifier parses the given application
                                                    56


  Primitive Conversion Unit (PCU)                                                  Task Executor
            Task Identifier (TI)                        Memory Manager
                                                                                                   In In        In In
                                                                                            SM 0  Data Data    DataData
         do {                                                       Device Memory
         SpMM(Hpsi, H, psi)
                                                                                                   SpMM          dot
         dot(E, psi, psi)
         daxpy(Epsi, E, psi)
         daxpy(R, Hpsi,Epsi)                                     D2H           H2D
                                                                                                   Out Out     Out Out
         dot(W,Tinv, R)
                                                                                                   Data Data   Data Data
         dot(Wmat, W, W)
         dsyevd(S, Wmat).
                                                 Tile DB
         ..                                                          Host Memory
         } while(!converged)                                                                        In In        In In
                                                                                             SM 1  Data Data    Data Data
                                                                                                    SpMM          dot
                                                                                                    Out Out     Out Out
                                                                                                    Data Data   Data Data
             TDG Generator                              Kernel Launcher
                                                     do {
                                                                                                    In In         In In
                                                     gpuKernel1(…..);                        SM 2
                                                     gpuKernel2(…..);                              Data Data    Data Data
                                                     …………………………
                                                     hostKernel3(…..);                               SpMM         dot
                                                     …………………………
                                                     gpuKernel4(…..);
                                                     …………………………                                     Out Out     Out Out
                                                     …………………………                                     Data Data   Data Data
                                                     } while(!converged)
                                                                                                           GPU
                             Figure 4.1: Schematic overview of DeepSparseGPU framework.
code to identify the specific BLAS/LAPACK and GraphBLAS function calls and the input/output
of each function call. It then passes this information to the local task dependency graph generator.
4.1.1.2        Task Dependency Graph Generator (TDGG)
The output of Task Identifier (TI) is a dependency graph at a very coarse-level, i.e., at the function
call level. Tasks must be generated at a much finer granularity for efficient parallel execution of
larger problems on the GPU by carefully planning the data movement between the CPU and GPU.
This is accomplished by the Task Dependency Graph Generator (TDGG), which goes over the
input/output data information generated by TI for each function call and starts decomposing/tiling
these kernels and data structures.
     As noted above, the decomposition into finer granularity tasks starts with the first function call
involving the sparse matrix (or matrices) in the solver code, which is typically an SpMV, SpMM,
or SpGEMM operation. For example, SpMM is the main sparse matrix kernel in the LOBPCG
solver. We use compressed sparse row (CSR) matrix format to store the sparse matrix. First, the
                                                                       57


TDGG decomposes the sparse matrix using a 1D decomposition. Then, the main sparse matrix
kernel decomposition induces the decomposition of all kernels above and below (i.e., the full
solver) the sparse matrix operation. TDGG also generates the dependencies between individual
fine-granularity tasks by examining the function call dependencies determined by TI as part of
the task dependency graph generation procedure. The resulting task dependency graph generated
by TDGG is a directed acyclic graph (DAG) representing the data flow in the solver code where
vertices denote computational tasks, incoming edges represent the input data, and outgoing edges
represent the output data for each task. TDGG also labels the vertices in the TDG with the estimated
computational cost of each task, and the directed edges with the name and size of the corresponding
data, respectively. Particularly, TDGG uses an instance of Taskinfo [listing 2.1] as the name of
each vertex in TDG to store the necessary information to execute each task in the DAG. During
execution, such information can be used for load balancing among threads and/or ensuring that
active tasks fit in the available GPU memory.
4.1.2   Task Executor (TE)
The Task Executor (TE) in DeepSparseGPU is composed of two parts as well: i) Kernel Launcher,
and ii) Memory Manager.
4.1.2.1   Kernel Launcher (KL)
The KL receives a DAG from the PCU that represents the full execution flow of the given application.
Then, KL performs a depth-first topological sort based on the dependencies in the DAG to get an
execution schedule of tasks and saves the tasks in an array of TaskInfo structures according to the
task schedule. The topological ordering of the DAG respects the data dependencies and guarantees
the numerical correctness of the given application. Furthermore, it is expected that the depth-first
strategy used in the topological ordering would help minimize the data movement between the CPU
and GPU by placing tasks with data dependencies together in the task schedule.
                                                 58


// main . cpp
[ task , numTask ] = topologicalSort ( DAG ) ;
# pragma omp parallel
{
   # pragma omp master
   {
      while (! converged )
      {
         for ( i = 0 ; i < numTask ; i ++)
         {
           // extract task information
           taskinfo = extractTaskInfo ( task [ i ]) ;
           // preparing operands on gpu ( or cpu ) using
           // Memory Manager
           operands = memoryManger ( taskinfo ) ;
           // launch the proper kernel on GPU
           // ( or CPU ) based on taskinfo
           kernel ( operands ) ;
         }
      }
   }
}
// an example gpu application kernel
void mat_add ( double * device_memory , int offset1 /* src1 */ , int offset2
     /* src2 */ , int offset3 /* dst */ , int blksz , int col )
{
   int sz = blksz * col ;
   # pragma omp target is_device_ptr ( device_memory ) \
   depend ( in : device_memory [ offset1 : sz ] , device_memory [ offset2 : sz ]) \
   depend ( out : device_memory [ offset3 : sz ]) nowait
   # pragma omp teams distribute parallel for collapse (2)
   for ( int i = 0; i < row ; i ++)
                                         59


          for ( int j = 0 ; j < col ; j ++)
            device_memory [ offset3 + i * col + j ] = device_memory [ offset1 + i
                 * col + j ] + device_memory [ offset2 + i * col + j ];
    }
                    Listing 4.1: Kernel Launcher skeleton code in OpenMP
    Once the topological ordering of the DAG is done, the KL picks each node from the task array
one by one and launches a kernel for them on the GPU (or in CPU, if the corresponding GPU
kernel is not available). The KL first extracts the task information from the task name. Then, it
consults with the Memory Manager (MM) to check if the operands of the task are ready on the
GPU memory (or on the CPU memory). Once all operands are ready, the kernel can be launched.
Currently, the exceptions to GPU executable kernels are only a few LAPACK operations that are
not yet implemented in the GPU libraries.
    As shown in listing 4.1, we use target teams distribute parallel for directives to
offload computations to the GPU. The nowait clause is used for asynchronous execution on
the GPU. Whenever possible, the collapse clause is used to merge nested loops to obtain higher
degrees of parallelism in the kernels. All directives use the depend clause to respect the input/output
dependencies among tasks.
4.1.2.2   Memory Manager (MM)
The memory manager (MM) is responsible for keeping track of all data tiles on the host and device
memory. The KL consults with MM before launching each kernel on the GPU (or CPU). MM
makes sure that the latest copy of each of the operands of a kernel is available on GPU (or CPU)
before launching the kernel. MM moves data tiles between host and device automatically whenever
needed and implements a First-In-First-Out (FIFO) eviction policy to evict blocks when the device
memory becomes full.
    Data Structures used in MM: The MM maintains a few data structures to fulfill its operations.
                                                   60


                                                    memory blocks on device
 device_memory
   memory_view       <mtx1, 0>   <mtx2, 2>    <mtx8, 1>  <mtx7, 0>     <mtx1, 3>   <mtx2, 3> <mtx1, 4>
                        0            1           2           3              4         5         6
                                Key                                          Value
   memory_map
                     <double *mtx, int tile_id> <index, no. of occupied blocks, size, isModified>
                                 Key                                         Value
  freeblock_map
                            <# of blocks>           <index1, index2, index3, index4, ….…………….….>
  evictionQueue      <mtx1, 0>   <mtx8, 1>    <mtx2, 2>  <mtx1, 3>     <mtx2, 3>   <mtx7, 0> <mtx1, 4>
                          Out                            FIFO Queue                                    In
        Figure 4.2: Data structures used in managing tile information in Memory Manager.
Figure 4.2 shows the important data structures that MM maintains to manage all data tiles between
the host and device. The purposes of these data structures are as follows:
     • device_memory: This is an array of double-precision numbers that spans the total available
       device memory in our target machine. We reserve a small fraction of the device memory
       to store some matrices if the computational model requires it. The rest of the device
       memory is allocated as this bigger array. We divide the device_memory array into blocks
       of memory chunks according to user-provided computational granularity. If the size of the
       device_memory array is m bytes and memory granularity is β bytes, then we divide the full
       device_memory array into n = d m   β e memory blocks. We use omp_target_alloc function in
       our OpenMP implementation to allocate the needed memory on device.
     • memory_view: This is an array of n elements (<data structure name, tile id> pair). It
       contains name and id of the data tiles stored in the i-th memory block of device_memory
       array.
                                                   61


• memory_map: This is an unordered map that holds detailed information on each data tiles
  stored in device_memory in the form of (Key, Value) pairs. The Key is a <data structure
  name, tile id> pair so that each tile of all data structures associated with the computation
  has a unique key. The Value is an array of 4 numbers (<index, # of occupied blocks, size,
  isModified>). The first number in the Value field is the index of device_memory where the
  data block (<data structure name, tile id>) is stored on the device. The second number in the
  array is the number of memory blocks occupied on the device by the stored tile. Depending
  on the size of the tile, it may require more than one device memory block for storage. The
  third element of the array is the actual size of the tile that is stored at index location on
  device_memory. The fourth element of the array indicates whether the tile data is modified
  on the device or not. If isModified = 0, then the matrix tile data is not modified on the device.
  If isModified = 1, then it indicates that the matrix tile data is modified on the device. MM
  always checks the isModified field before evicting a matrix tile from the device memory. If
  tile data is modified, the MM copies the latest matrix tile data to the proper host location
  before evicting it from the device memory.
• evictionQueue: The MM implements a First-In-First-Out (FIFO) eviction policy using
  evictionQueue to execute larger sparse matrix problems that exceed device memory. As MM
  moves a tile from host to device, it adds the <data structure name, tile id> of that block to
  the end of evictionQueue. MM always tries to evict from the beginning of evictionQueue.
• freeblock_map: This is an unordered map that keeps track of the free memory blocks on
  the device_memory array. The Key of the freeblock_map is the number of contiguous free
  memory blocks on the device_memory array. The Value of the freeblock_map is an array of
  start indices of free memory blocks on the device_memory array. The MM starts tracking
  the free memory blocks once the eviction policy is activated, and MM always tries to copy
  a matrix block from the host to the free memory blocks on the device. This helps minimize
  memory fragmentation on the device.
                                              62


    Eviction/Replacement Policy: Data movement between the host and device becomes the main
performance bottleneck when application working sets exceed the physical memory capacity of the
device. The MM uses a software-managed First-In-First-Out (FIFO) eviction policy to efficiently
manage data movement between the host and device algorithm. From the beginning of the execution
of the application, MM adds a matrix block at the end of the evictionQueue whenever it moves a
matrix block from host to device; an eviction is performed when the device memory becomes full.
MM evicts the front element of the evictionQueue to make space for newly required data on the
device. If the data tile at the front of evictionQueue is an operand of the current operation, it is
removed from the front and reinserted at the back of evictionQueue.
    List of MM methods/functions: Here are the list of important functions/methods of MM:
    • isOnDevice(mtx, tile_id): This function returns true if the tile_id-th block of mtx matrix is
      available on device, otherwise it returns false.
    • copyToDevice(mtx, tile_id): This function is used to copy the tile_id-th block of mtx matrix
      from host to the suitable available memory blocks on device memory. It throws an excep-
      tion if there is not enough space or any suitable memory blocks on the device. We use
      omp_target_memcpy function to copy data between host and device in DeepSparseGPU.
    • reserveOnDevice(mtx, tile_id): This function reserves spaces for the tile_id-th block of mtx
      on the device. This function is beneficial when any operation produces new data. In such
      cases, we do not need to copy the output matrix tile from host to device; reserving sufficient
      device space for that matrix tile is enough in this case. This function throws an exception if
      there is not enough space or no suitable memory blocks exist on the device.
    • copyToHost(mtx, tile_id): This function is used to copy the tile_id-th block of mtx matrix
      from the device to the host. It throws an exception, if it is unable to copy the matrix block.
      We use omp_target_memcpy function to copy data between from device to host.
    How MM works: We provide an example in Figure 4.3 to demonstrate the operations of MM.
                                                  63


                                          memory blocks on device                                                              memory blocks on device
device_memory     <A, 0>    <A, 0>      <B, 0>   <C, 0>    <B, 1>  <C, 1>    <D, 0>   device_memory    <D, 1>    <FREE>      <B, 0>   <C, 0>    <B, 1>    <C, 1>    <D, 0>
                     0          1           2        3         4       5        6                         0          1           2        3         4         5        6
  memory_view     <A, 0>    <A, 0>      <B, 0>   <C, 0>    <B, 1>  <C, 1>    <D, 0>    memory_view     <D, 1>    <FREE>      <B, 0>   <C, 0>    <B, 1>    <C, 1>    <D, 0>
                     0          1           2        3         4       5        6                         0          1           2        3         4         5        6
  evictionQueue   <A, 0>    <B, 0>     <C, 0>   <B, 1>     <C, 1>  <D, 0>              evictionQueue  <B, 0>     <C, 0>     <B, 1>   <C, 1>     <D, 0>    <D, 1>
                        Out                                               In                                 Out                    FIFO Queue                   In
                                               FIFO Queue
                   <A, 0>     <0, 2, 8, 1>                                                              <B, 0>     <2, 1, 4, 0>
                                                                                                                                                       <1>            <1>
                   <B, 0>     <2, 1, 4, 0>                                                              <C, 0>     <3, 1, 4, 1>
                   <C, 0>     <3, 1, 4, 1>                        freeblock_map                         <B, 1>     <4, 1, 4, 0>                          freeblock_map
                   <B, 1>     <4, 1, 4, 0>                                                              <C, 1>     <5, 1, 4, 1>
                   <C, 1>     <5, 1, 4, 1>                                                              <D, 0>      <6, 1, 4, 1>
                   <D, 0>      <6, 1, 4, 1>                                                             <D, 1>      <0, 1, 4, 1>
                    memory_map                                                                            memory_map
              (a) Device memory becomes full                                                   (b) Evicting <A,0> and copying <D,1>
                                           Figure 4.3: Illustrative example of how MM works
Let us assume we have have 4 matrices (A, B, C and D) associated with a computation and we have a
total of 7 memory blocks on the device. Each matrix is tiled into 2 × 1 blocks in this example. Each
tile of matrix A requires 2 memory blocks on the device, whereas tiles for other matrices require 1
memory block on the device. Due to the nature of the computation, let us assume that the order of
moving matrix tiles to device is - {<A, 0>, <B, 0>, <C, 0>, <B, 1>, <C, 1>, <D, 0>, <D, 1>}.
As shown in Figure 4.3a, device memory becomes full before moving the last block (<D, 1>) in
due to the limited device memory capacity. Figure 4.3a shows the status of all the data structures
of MM at this stage and MM activates the eviction policy, as device memory is full now. The MM
needs to move <D, 1> tile to device. In order to do so, MM evicts the front of evictionQueue
which is <A, 0> matrix tile. (<A, 0>) occupies 2 memory blocks on device and <D, 1> requires
only 1 memory block to be stored on device. So MM copies <D, 1> at 0 index of device_memory
array and adds index 1 in the freeblock_map. MM also updates memory_view, map_map and
evictionQueue accordingly. As shown in Figure 4.3a, isModified = 1 in memory_map for <A, 0>
which means <A, 0> is modified on the device. MM copies the latest value of <A, 0> from device
to host before evicting it in Figure 4.3b.
                                                                                    64


4.2     Performance Evaluation
4.2.1   Experimental Setup
We conducted all of our experiments on the Cori-GPU testbed at the National Energy Research
Scientific Computing Center (NERSC) [76]. Each compute node has two 20-core Skylake proces-
sors clocked at 2.4 GHz and 8 NVIDIA Tesla V100 “Volta" GPUs with ∼16 GBs of HBM per
GPU. The V100 GPU model has a peak double-precision performance of 7.0 TFLOP/s. There
is a total of 384 GB DDR4 DRAM space on each node. The CPUs are connected to the GPUs
via four PCIe 3.0 switches and the GPUs are connected to each other via NVIDIA’s NVLink 2.0
interconnect. The Cori-GPU provides extensive software environments to compile OpenMP (and
OpenACC) programs. We use the Cray compilers (‘CCE’) for compiling our implementation.
The GPU accelerated cuSPARSE and cuBLAS libraries provided with CUDA v11.1.1 are used to
compile our baseline implementation. We use thread affinity to bind threads to cores and we use
20 CPU threads and 1 GPU (1 CPU socket + 1 GPU) for our experiments to avoid NUMA issues.
We take a full Cori-GPU node for our experiments in order to avoid noisy environment created by
sharing the same node with other users.
4.2.1.1    Benchmark Application
We demonstrate the performance of the DeepSparseGPU framework on LOBPCG eigensolver
which is widely used in large-scale scientific computing applications [14]. We give the pseudocode
for the LOBPCG solver in Algorithm 3 in Chapter 2. It involves kernels with high arithmetic
intensities such as SpMM and several level-3 BLAS kernels. The total memory needed for block
vectors Ψ, R, Q and others can easily exceed the space matrix H
                                                              c takes up. Figure 2.6 shows a sample
task graph for LOBPCG generated by TDGG using a very small matrix. Clearly, minimizing the data
movement between CPU and GPU to obtain an efficient LOBPCG implementation is non-trivial.
                                                  65


4.2.1.2    Dataset
We selected 11 square matrices with varying sizes, sparsity patterns, and domains from the SuiteS-
parse Matrix Collection in addition to the Nm7 matrix, which is from a nuclear shell model code
(see Tab. 4.1 and 4.2) [79]. Matrices in Table 4.1 are used to evaluate bigger problem sizes that do
not fit in GPU memory. The bigger problem sizes range from 32.06 GB to 84.54 GB. Matrices in
Table 4.2 are used to evaluate problem sizes that fit in the GPU memory. The smaller problem sizes
range from 2.64 GB to 13.87 GB. Performance data for LOBPCG is averaged over five iterations.
                     Table 4.1: Matrices used to evaluate problem size > 16 GB.
                   Matrix           #Rows         #Non-zeros      Problem Size (GB)
                     Nm7          4,985,944       648,890,590             32.06
                 nlpkkt200        16,240,000      448,225,632             44.98
                 nlpkkt240        27,993,600      760,648,352             77.32
                   it_2004        41,291,594     1,150,725,436            64.18
                  sk_2005         50,636,154     1,949,412,601            54.38
               webbase_2001      118,142,155     1,019,903,190            84.54
                     Table 4.2: Matrices used to evaluate problem size < 16 GB.
                       Matrix         #Rows       #Non-zeros     Problem Size (GB)
                      inline_1       503,712       36,816,342            2.64
                dielFilterV3real    1,102,824      89,306,020            6.07
                    Flan_1565       1,564,794     117,406,044            5.22
                      HV15R         2,017,169     281,419,743            8.29
                   Bump_2911        2,911,419     127,729,899            8.63
                        Nm7         4,985,944     648,890,590           13.87
                    nlpkkt160       8,345,600     229,518,112           12.94
4.2.1.3    Baseline Implementation
We compare the performance of DeepSparseGPU with a baseline implementation that we call
Libcsr_UM, which is an implementation of the LOBPCG solver using GPU accelerated cuSPARSE
and cuBLAS libraries for SpMM (with CSR storage of the sparse matrix), inner product, and
linear combination kernels. The application kernels are implemented on the GPU using OpenMP
                                                  66


parallel for loop and target offload pragmas. Application kernels include tall skinny
matrix operations such as addition, subtraction, element-wise multiplication, etc. We use NVIDIA’s
unified memory technology (using the cudaMallocManaged function) in this implementation. As
such, the runtime system automatically takes care of the data movement between the host and
device.
4.2.1.4           Incremental Implementation of DeepSparseGPU
We incrementally implemented DeepSparseGPU. We implemented two intermediate working ver-
sions of our tile-based DeepSparseGPU framework. We adopted two different data movement
schemes in these two intermediate versions, which helped us design and develop our memory
manager. We include the performance data from these intermediate versions to show the margin of
improvement using the latest version of DeepSparseGPU:
v o i d mat_add ( d o u b l e ∗X, d o u b l e ∗Y, d o u b l e ∗Z , i n t row , i n t c o l , i n t b l o c k _ w i d t h , i n t b l o c k _ i d )
{
    int i , j , blksz ;
    b l k s z = ( b l o c k _ i d ∗ b l o c k _ w i d t h + b l o c k _ w i d t h ) > row ?
                    row − b l o c k _ i d ∗ b l o c k _ w i d t h : b l o c k _ w i d t h ;
    i n t o f f s e t = block_id ∗ block_width ∗ col ;
    i n t sz = blksz ∗ col ;
   # pragma omp t a r g e t f i r s t p r i v a t e ( b l k s z , sz , o f f s e t ) map ( t o : X[ o f f s e t : s z ] ) \
   map ( t o : Y[ o f f s e t : s z ] ) map ( t o f r o m : Z [ o f f s e t : s z ] ) \
   d e p e n d ( i n : X[ o f f s e t : s z ] ) d e p e n d ( i n : Y[ o f f s e t : s z ] ) d e p e n d ( i n o u t : Z [ o f f s e t : s z ] )
   # pragma omp t e a m s d i s t r i b u t e p a r a l l e l f o r p r i v a t e ( j ) c o l l a p s e ( 2 )
    f o r ( i = 0 ; i < b l k s z ; i ++)
   {
        f o r ( j = 0 ; j < c o l ; j ++)
        {
          Z [ o f f s e t + i ∗ c o l + j ] = X[ o f f s e t + i ∗ c o l + j ] +Y[ o f f s e t + i ∗ c o l + j ] ;
        }
   }
}
                       Figure 4.4: Code snippet of mat_add kernel in DeepSparse_MAP version.
        • DeepSparse_MAP: This version relies on OpenMP map(to:                                                                 <list>), map(tofrom:
            <list>) and map(from:                                <list>) clauses for transferring data between host and device.
            Each offload pragma moves the necessary data tiles on-the-fly right before launching its GPU
                                                                                       67


            kernel, essentially relying on the OpenMP runtime system for data movement. Figure 4.4
            shows the code snippet of mat_add kernel in DeepSparse_MAP version.
/ / i n main p r o g r a m
i n t s i z e = row ∗ c o l ∗ s i z e o f ( d o u b l e ) ;
cudaMallocManaged (&X, s i z e ) ;
cudaMallocManaged (&Y, s i z e ) ;
cudaMallocManaged (&Z , s i z e ) ;
/ / mat_add k e r n e l
v o i d mat_add ( d o u b l e ∗X, d o u b l e ∗Y, d o u b l e ∗Z , i n t row , i n t c o l , i n t b l o c k _ w i d t h , i n t b l o c k _ i d )
{
    int i , j , blksz ;
    b l k s z = ( b l o c k _ i d ∗ b l o c k _ w i d t h + b l o c k _ w i d t h ) > row ?
                    row − b l o c k _ i d ∗ b l o c k _ w i d t h : b l o c k _ w i d t h ;
    i n t o f f s e t = block_id ∗ block_width ∗ col ;
    i n t sz = blksz ∗ col ;
    # pragma omp t a r g e t f i r s t p r i v a t e ( b l k s z , sz , o f f s e t ) map ( t o : X[ o f f s e t : s z ] ) \
   map ( t o : Y[ o f f s e t : s z ] ) map ( t o f r o m : Z [ o f f s e t : s z ] ) \
    d e p e n d ( i n : X[ o f f s e t : s z ] ) d e p e n d ( i n : Y[ o f f s e t : s z ] ) d e p e n d ( i n o u t : Z [ o f f s e t : s z ] )
    # pragma omp t e a m s d i s t r i b u t e p a r a l l e l f o r p r i v a t e ( j ) c o l l a p s e ( 2 )
    f o r ( i = 0 ; i < b l k s z ; i ++)
    {
        f o r ( j = 0 ; j < c o l ; j ++)
        {
          Z [ o f f s e t + i ∗ c o l + j ] = X[ o f f s e t + i ∗ c o l + j ] +Y[ o f f s e t + i ∗ c o l + j ] ;
        }
    }
}
                        Figure 4.5: Code snippet of mat_add kernel in DeepSparse_UM version.
        • DeepSparse_UM: Like DeepSparseGPU, we tile all computational kernels and express them
            as DAG in DeepSparse_UM. But instead of using MM, we rely on using unified memory
            for automatic data movement between the host and the device. All associated matrices are
            allocated using cudaMallocManaged. They are accessed using is_device_ptr in target
            offload pragmas which allows the kernels to treat the data tiles as device pointers, and the
            runtime system automatically makes the data available on the GPU whenever needed. Fig-
            ure 4.5 shows the code snippet of mat_add kernel in DeepSparse_UM version and Figure 4.6
            shows the code snippet of the same kernel in our final DeepSparseGPU version with memory
            manager.
                                                                                       68


 / / i n main p r o g r a m
taskinfo = extractTaskInfo ( task [ i ]) ;
 / / p r e p a r i n g o p e r a n d s on gpu ( o r cpu ) u s i n g Memory Manager
o p e r a n d s = memoryManger ( t a s k i n f o ) ;
mat_add ( o p e r a n d s ) ;
 / / mat_add
v o i d mat_add ( d o u b l e ∗ device_memory , i n t o f f s e t 1 / ∗ s r c 1 ∗ / , i n t o f f s e t 2 / ∗ s r c 2 ∗ / ,
                         i n t o f f s e t 3 /∗ dst ∗/ , i n t blksz , i n t col )
{
    i n t sz = blksz ∗ col ;
    # pragma omp t a r g e t i s _ d e v i c e _ p t r ( device_memory ) \
    d e p e n d ( i n : device_memory [ o f f s e t 1 : s z ] , device_memory [ o f f s e t 2 : s z ] ) \
    d e p e n d ( o u t : device_memory [ o f f s e t 3 : s z ] ) n o w a i t
    # pragma omp t e a m s d i s t r i b u t e p a r a l l e l f o r c o l l a p s e ( 2 )
    f o r ( i n t i = 0 ; i < row ; i ++)
        f o r ( i n t j = 0 ; j < c o l ; j ++)
            device_memory [ o f f s e t 3 + i ∗ c o l + j ] = device_memory [ o f f s e t 1 + i ∗ c o l + j ]
                                                                        + device_memory [ o f f s e t 2 + i ∗ c o l + j ] ;
}
                     Figure 4.6: Code snippet of mat_add kernel in DeepSparseGPU framework.
4.2.2          LOBPCG Evaluation
Our performance comparison criteria include the amount of Host-to-Device (H2D), Device-to-Host
(D2H) data transfer, and average execution time per iteration among Libcsr_UM, DeepSparse_UM,
DeepSparse_MAP and DeepSparseGPU implementations for the LOBPCG solver. The Memory
Manager quantifies and keeps track of the H2D and D2H data transfer in the DeepSparseGPU
framework. We use NVIDIA nvprof profiler tool to measure the amount of H2D and D2H data
transfer while using unified memory. In DeepSparse_MAP, we manually measure the total H2D
and D2H data transfer based on the data mentioned in the map clauses. The performance of
DeepSparseGPU, DeepSparse_UM, and DeepSparse_MAP depends on the tile size. Choosing a
small tile size creates a large number of fine granularity tasks. This means we have to launch more
kernels on GPU. There are overheads associated with launching kernels on GPU. Also, a smaller tile
size may lead to poor H2D and D2H transfer rates. Increasing the tile size reduces GPU execution
overhead as GPU execution heavily depends on data parallelism. Therefore, we use the tile size
as an optimization parameter based on matrix dimensions and sparsity patterns. We experimented
with different tile sizes of 32K, 64K, 128K, and 256K. We report the results of the best-performing
tile size for DeepSparseGPU, DeepSparse_UM, and DeepSparse_MAP. In Libcsr_UM, we are not
required to tile the input matrices as we used unified memory with cuSPARSAE and CUBLAS
                                                                              69


library kernels from CUDA and OpenMP target offloaded application kernels.
4.2.2.1    Data Movement between Host and Device
Figure 4.7 shows the amount of Host-to-Device (H2D) data transfer comparison among all four
versions of the LOBPCG algorithm. LOBPCG is a complex algorithm, as it has several different
kernel types, and its task graph results in a vast number of tasks to be launched on GPU for each
iteration. As can be seen in Fig 4.7, DeepSparseGPU always transfers less data from the host to the
device compared to the other three versions. DeepSparseGPU transfers 1.18× - 1.94× less H2D
data transfer compared to Libcsr_UM version except for the sk_2005 matrix. DeepSparseGPU also
transfers 1.25× - 2.59× less H2D data compared to DeepSparse_UM version.
                         Figure 4.7: Average H2D data transfer per iteration.
    Figure 4.8 shows the amount of Device-to-Host (D2H) data transfer comparison between all four
versions of the LOBPCG algorithm. As can be seen in Fig 4.8, DeepSparseGPU always outperforms
the other three versions when it comes to the amount of D2H data transfer. DeepSparseGPU transfers
1.84× - 2.69× less D2H data compared to Libcsr_UM, 2.39× - 3.70× less D2H data compared to
DeepSparse_UM and 1.95× - 3.12× less D2H data compared to DeepSparse_MAP. Considering
                                                  70


the total amount of data transfer (H2D + D2H), DeepSparseGPU transfers 1.39× - 2.18× less
data compared to Libcsr_UM, 1.89× - 2.79× less data compared to DeepSparse_UM and 2.92× -
5.01× less data compared to DeepSparse_MAP.
                         Figure 4.8: Average D2H data transfer per iteration.
    From Figure 4.7 and 4.8, we can see that the main reason why DeepSparseGPU transfers
significantly less data is due to the explicit data management by the Memory Manager based on a
good task scheduling heuristic. The task scheduling heuristic helps to maximize the utilization of
the data while it resides in GPU memory. DeepSparse_MAP is the worst regarding H2D and D2H
data transfer performance among all four versions. This is expected because DeepSparse_MAP
moves all inputs and outputs of a kernel in both H2D and D2H directions during each kernel launch.
Figure 4.7 and 4.8 show that the design of our task scheduling scheme and memory manager is
robust and helps to minimize the data movement between host and device.
4.2.2.2   Execution Time
As can be seen in Figure 4.9, the significant reduction in H2D and D2H data transfer of DeepSparseGPU
over the other three versions leads to a significant execution time speedups in general. In partic-
                                                   71


ular, DeepSparseGPU achieves 1.37× - 1.86× speedup compared to Libcsr_UM version except
for Nm7, nlpkkt200 and sk_2005 matrices. We further investigated the reasons for the slower
execution of DeepSparseGPU for these matrices. We found that our CSR format-based custom
SpMM runs significantly slower than the highly optimized cusparseSpMM (a cuSPARSE library
routine) used for SpMM operation in the Libcsr_UM version. To be specific, our custom SpMM
routine runs 1.47× - 2.58× slower in case of these 3 matrices compared to the cusparseSpMM
routine in Libcsr_UM.
                         Figure 4.9: Average execution time per iteration.
    We experimented with OpenMP target offload, and CUDA implemented versions of the most
expensive kernels. Figure 4.10 shows the performance comparison between OpenMP target offload
implementation and CUDA implementation (cuBLAS, cuSPARSE) of most expensive kernels in-
cluding inner product (X 0 Y ), linear combination (XY ), SpMM and Column-wise matrix reduction.
The input size of all of these kernels was less than 16 GB. As we can see, the CUDA versions
of these kernels are 2.13× - 12.85× faster compared to the OpenMP target offload versions. We
use nvprof to get the execution time of cuBLAS and cuSPARSE library kernels. OpenMP target
offload implementations in different vendor-provided compilers are still in the early stage. Most
                                                  72


compilers support only a subset of the OpenMP specification. It would be robust with time, and
the performance of DeepSparseGPU would also improve in the upcoming supercomputers with a
faster interconnect.
      Figure 4.10: OpenMP vs CUDA execution time comparison for most expensive kernels.
4.2.2.3    Effect of Pinned Memory
It is important to note that even if our DeepSparseGPU transfers less H2D and D2H data, the
overall H2D data transfer rate in DeepSparseGPU is 4.49 GB/sec, whereas the average H2D rate
is 6.34 GB/sec in Libcsr_UM for all test matrices. The overall D2H data transfer rates are more
striking, Libcsr_UM (12.13 GB/sec) achieves 2.82× higher data transfer rates DeepSparseGPU
(4.29 GB/sec). Despite this lower data transfer rate and the worse SpMM performance mentioned
above, DeepSparseGPU runs faster than Libcsr_UM with the Unified Memory. We had used
pageable memory with DeepSparseGPU for the results shown in Figure 4.7, 4.8 and 4.9, therefore
we also experimented with pinned memory to achieve a better bandwidth and see its effect on
execution time. Figure 4.11 shows the execution time performance comparison of DeepSparseGPU
using pinned memory with the results shown in Figure 4.9. As can be seen, we achieve 1.45×
                                                73


- 1.98× execution time speedup using pinned memory with DeepSparseGPU compared to its
pageable memory version and up to 2.93× speedup compared to Libcsr_UM.
               Figure 4.11: Performance of DeepSparseGPU using pinned memory
4.2.2.4   Comparison with a CPU version
Figure 4.12 shows the execution time performance comparison between DeepSparseGPU (using
pinned memory) and the baseline version running on CPU for bigger problem sizes. The CPU
baseline implementation uses thread-parallel Intel MKL Library calls (including SpMM) with CSR
storage of the sparse matrix. As we can see, DeepSparseGPU is running slower in comparison.
However, it should be noted that even with the pinned memory optimization above, we get only
about 12 GB/sec bandwidth between the CPU and GPU using a PCIe express 3.0 interconnect.
In contrast, the bandwidth between the CPU and DRAM is approximately 90 GB/sec. Another
reason behind the poor GPU performance is that there are no mechanisms to use multiple streams
in OpenMP implementation yet and asynchronous execution support through the nowait clause
does not allow a robust and portable implementation. As such data movement and computation
cannot be overlapped by utilizing separate streams and nowait clauses in OpenMP pragmas.
                                                74


    Figure 4.12: CPU vs DeepSparseGPU execution time comparison (problem size > 16 GB)
    Figure 4.13 shows the theoretical execution time of DeepSparseGPU with different types of
interconnect such as PCIe-Gen3 (12 GB/Sec), PCIe-Gen4 (32 GB/Sec), PCIe-Gen5 (64 GB/Sec),
and NVLINK-4 (900 GB/Sec) [88]. For this experiment, we took the compute time of V100 from
Figure 4.12 and we calculated the data movement time by dividing the total amount of data transfer
in Figure 4.12 by the 80% of peak bandwidth for each type of interconnect to simulate the complete
overlap of data movement and computation. As we can see from Figure 4.13, the theoretical
execution time with PCIe-Gen3 is still on par with the CPU time, which means that no amount
of software optimization can make DeepSparseGPU performance beat the CPU performance.
However, we can see that the theoretical time of DeepSparseGPU with PCIe-Gen4, PCIe-Gen5 and
NVLINK-4 beat the CPU time.
    The above simulation indicates that with the availability of platforms with faster interconnects
(some such platforms currently exist, but those hardware have not been available for us to experiment
with), DeepSparseGPU would have real merit in accelerating actual machine learning or scientific
computing workloads. Figure 4.14 which shows the execution time performance comparison
between DeepSparseGPU when the application working set fits into GPU memory (Table 4.2)
provides further evidence in this direction.
                                                 75


Figure 4.13: CPU vs DeepSparseGPU execution time comparison with different CPU-GPU inter-
connect
    As can be seen, DeepSparseGPU achieves 1.77× - 4.87× speedup compared to the CPU baseline
implementation when the total memory footprint is less than 16GB, i.e., when data movement is
not a big bottleneck. We anticipate that the availability of better interconnects and better GPU
acceleration support in OpenMP will lessen the impact of such data movement bottleneck we
observed in this study.
4.3    Discussion
    OpenMP target offload implementations in different vendor provided compilers are still in early
stage. Most compilers support only a subset of the OpenMP specification. We noticed a big
performance gap between OpenMP target offload and CUDA library implemented versions of the
most expensive kernels.
4.3.1   Usage of Data Prefetching and Multiple Streams
The reason behind much higher H2D and D2H rates in Libcsr_UM version is that the unified memory
internally uses multiple streams and prefetches data. Such supports are not available in OpenMP
                                                76


  Figure 4.14: CPU vs DeepSparseGPU execution time comparison using problem size < 16 GB
target offloading implementations. As such, we see in section 4.2 that H2D and D2H transfer rates
are significantly higher in the Libcsr_UM compared to DeepSparseGPU. With the possibility of
getting those supports in OpenMP implementation, the performance of DeepSparseGPU would
improve even further.
4.4     Summary of the Chapter
    In this work, we introduced a tiling-based sparse solver framework for heterogeneous architec-
tures that aims to minimize data transfer between host and device to achieve better performance.
We showed that our framework transfers significantly less data between host and device. Our
framework also improves the execution time over the UM-based baseline implementation using
pinned memory. In our future work, we will focus on improving the efficiency of data transfers in
DeepSparseGPU and optimizing the performance of sparse matrix kernels. To this end, we plan
to design and implement a DAG partitioner that would help minimize the data movement between
CPU and GPU. We also plan to use 2D decomposition of the SpMM operation, which would ex-
pose more parallelism in the computation. We also plan to explore the feasibility of implementing
DeepSparseGPU using CUDA Graphs as it seems a good fit for our design. In the next chapter, we
                                                 77


will discuss the detail of our exploration of task-parallelism using CUDA Graph.
                                                  78


                                              CHAPTER 5
       EXPLORING FINE-GRAINED TASK PARALLELISM USING CUDA GRAPH
The computing capabilities of GPUs have been increasing continuously over the last decade to
address the growing computational needs of HPC and the data science landscape. Modern GPUs
are so fast that, in many cases, the time to submit an operation (e.g., kernel or memory copy) to GPU
dominates the actual time (measured in microseconds) taken by the operation itself. For example, in
DeepSparseGPU, we submit thousands of operations to GPU on the tiled data at each iteration. Each
operation is launched to the GPU separately and typically completes quickly. Empirically, we found
that the combined work submission overheads cause overall degradation in performance. Recently,
CUDA has introduced an asynchronous task-graph programming model, CUDA Graph, to enable
efficient launch and execution of GPU work [89]. Unlike the conventional CUDA kernel execution,
CUDA Graph separates the initialization from the execution [90]. CUDA Graph reduces the overall
overhead by doing the device set-up, device initialization, kernel dependency analysis, and other
expensive operations during the initialization phase. Furthermore, CUDA Graph significantly
improves the performance by launching any number of kernels in a single operation [91, 92].
    After experiencing the poor performance of the OpenMP target offload implementations of the
most expensive computational kernels (Figure 4.10 in Chapter 4) in DeepSparseGPU, we chose to
implement our task-parallel DeepSparseGPU framework using the CUDA programming model. In
Chapter 4, we have seen that our topological sorting heuristic with the global task graph worked
well in DeepSparseGPU. However, we envision using parallel decomposition techniques (instead
of only a heuristic) that aim to maximize data locality. Our vision is to partition the global task
graph and execute it in chunks to maximize data locality and minimize data movement. However,
the 1D tiling of the sparse matrix does not give the partitioning algorithms flexibility. Therefore,
we envision decomposing the sparse matrix using 2D tiling. Hence, the number of kernel launches
becomes a problem due to several small kernels, and that is where we anticipate the CUDA graph
technology to be immensely useful.
                                                   79


    In this work, we have performed preliminary work on the use and merits of CUDA Graph to im-
plement and optimize DeepSparseGPU on GPUs as it seems to be a good fit for the DeepSparseGPU
framework with all CUDA kernels. We left the 2D tiling and graph partitioning on the tiles as
future work. Our contributions in this study can be summarized as follows:
    • We demonstrate that the global task graph of DeepSparseGPU framework can be implemented
       using CUDA Graph.
    • We show that CUDA Graph helps to reduce the kernel launch overhead in DeepSparseGPU
       framework. Our CUDA Graph based DeepSparseGPU implementation runs up to 1.13×
       faster compared to a CUDA kernels, cuBLAS and cuSPARSE library routines based baseline
       version.
5.1     CUDA Graph
    CUDA Graph is a new programming model introduced in CUDA 10 that allows define-once-
run-repeatedly execution flow by following a three-stage execution model: graph definition, graph
instantiation, and graph execution.
                              Figure 5.1: Toy task dependency graph
                                                80


    CUDA Graph API offers two modes of defining the graph, manual mode and stream capture
mode. Graph creation starts with cudaGraphCreate routine which creates an empty graph. In
manual mode, user creates nodes (representing GPU operations) and specifies dependencies and
parameters. For example, cudaGraphAddKernelNode creates a node that is associated with a
CUDA kernel and its input and output parameters. cudaGraphAddMemcpyNode creates memory
copy node that performs data movement between host and device. A graph is defined by adding
nodes with their corresponding dependencies. CUDA Graph exploits the explicit dependency
information in manual mode for optimizing and scheduling the graph execution. Listing 5.1
depicts the CUDA Graph implementation of the toy task graph in Figure 5.1 in manual mode.
    cudaGraph_t G ;
    std :: vector < cudaGraphNode_t > nodeDependencies ;
    cudaGraphNode_t h2d_1 , h2d_2 , d2h , kernelNode ;
    c u d a K e r n e l N o d e P a r a m s kernelNodeParams = { 0 };
    // Step -1: Defining graph G
    checkCudaErrors ( cudaGraphCreate (& G , 0) ) ;
    checkCudaErrors ( c u d a S t r e a m C r e a t e W i t h F l a g s (& stream ,
          cudaStreamNonBlocking ));
    // creating H2D (a - > d_a ) node , assuming params is set properly
    checkCudaErrors ( c u d a G r a p h A d d M e m c p y N o d e (& h2d_1 , G , NULL , 0 , & params ) ) ;
    nodeDependencies . push_back ( h2d_1 ) ;
    // creating H2D (b - > d_b ) node , assuming params is set properly
    checkCudaErrors ( c u d a G r a p h A d d M e m c p y N o d e (& h2d_2 , G , NULL , 0 , & params ) ) ;
    nodeDependencies . push_back ( h2d_2 ) ;
    // creating addKernel node ( d_c = d_a + d_b ) , assuming kernelNodeParams
            is set properly
    kernelNodeParams . func = ( void *) addKernel ;
    void * kernelArgs [4] = { ( void *) & dev_c , ( void *) & dev_a , ( void *) & dev_b , &
          size };
    kernelNodeParams . kernelParams = kernelArgs ;
    checkCudaErrors ( c u d a G r a p h A d d K e r n e l N o d e (& kernelNode , G , nodeDependencies
                                                         81


         . data () , nodeDependencies . size () , & kernelNodeParams ) ) ;
    nodeDependencies . clear () ;
    // creating D2H ( d_c - > c ) node , assuming params is set properly
    checkCudaErrors ( c u d a G r a p h A d d M e m c p y N o d e (& d2h , G ,1 ,& kernelNode ,& params ) ) ;
    // Step -2: Instantiating G
    cudaGraphExec_t graphExec ;
    checkCudaErrors ( c u d a G r a p h I n s t a n t i a t e (& graphExec , G , NULL , NULL , 0) ) ;
    // Step -3: Executing G
    for ( int i = 0; i < loopCount ; ++ i ) {
      checkCudaErrors ( cudaGraphLaunch ( graphExec , stream ) ) ;
      checkCudaErrors ( c u d a S t r e a m S y n c h r o n i z e ( stream ) ) ;
    }
    // cleaning up
    checkCudaErrors ( c u d a G r a p h E x e c D e s t r o y ( graphExec ) ) ;
    checkCudaErrors ( cudaGraphDestroy ( G ) ) ;
    Listing 5.1: Snippet of CUDA Graph corresponding with the DAG shown in Figure 5.1
    In manual mode, the user needs to create nodes and edges explicitly and know the parameters
upfront to create the graph. However, kernel execution parameters are not available beforehand
for cuBLAS and cuSPARSE library routines. To address this problem, CUDA Graph offers
stream capture mode that creates the graph implicitly. In stream caputure mode, a graph is
created by capturing an existing CUDA stream-based implementation. This is realized by placing
cudaStreamBeginCapture and cudaStreamEndCapture routine at the beginning and end of the
existing implementation respectively. In this case, node dependencies are obtained from stream and
event synchronization. In stream capture mode, kernels that run on the same stream are serialized,
and the kernels that run on different streams can be executed in parallel.
    Once the graph is defined, the following two steps instantiate the constructed graph and run
the executable graph on GPU using a single function call. CUDA Graph API also allows users to
update the GPU kernel’s execution parameters or a node in the graph.
                                                         82


5.2     Implementing DeepSparseGPU Task Graph using CUDA Graph
    We construct the global task graph of DeepSparseGPU framework using CUDA Graph. We
take LOBPCG eigensolver as an example application. We give the pseudocode for the LOBPCG
solver in Algorithm 3. It involves kernels with high arithmetic intensities, such as SpMM and
several level-3 BLAS kernels, and other application kernels (i.e., matrix addition, subtraction,
multiplication). We tile each of the data structures and operations according to the user provided
tile size to build the LOBPCG task graph. As LOBPCG involves SpMM and multiple linear
combination operations and inner product operations, we use both manual and stream capture
mode to create the global task graph of the LOBPCG algorithm. We use manual to create the nodes
and dependencies related to application kernels as their parameters are known beforehand. We
use cuBLAS and cuSPARSE library routines for the linear combination, inner product, and SpMM
operations. However, the kernel launch parameters of these vendor libraries are not unavailable.
To overcome this issue, we use stream capture mode to create a standalone graph for each of the
cuBLAS and cuSPARSE library routines and add the standalone graph as a sub-graph node using
cudaGraphAddChildGraphNode routine in the main graph with proper dependencies. Listing 5.2
shows an example snippet on how we mix stream capture and manual mode while defining the
LOBPCG global task graph.
     // mainGraph
     cudaGraph_t mainGraph ;
     // capturing cuBLAS routine call using stream capture mode
     vector < cudaGraph_t > xyGraph_1 ( nrowblk ) ;
     vector < cudaGraphNode_t > xyNode_1 ( nrowblk ) ;
     for ( block_id = 0 ; block_id < nrowblk ; block_id ++) {
        blksz = block_width ;
        if ( block_id * block_width + blksz > M )
           blksz = M - block_id * block_width ;
           // stream capture call starts
                                                 83


     checkCudaErrors ( c u d a S t r e a m B e g i n C a p t u r e ( stream1 ,
         cudaStreamCaptureModeGlobal ));
     // cuBLAS ( cublasDgemm ) call starts
     cubstat = cublasDgemm ( handle , CUBLAS_OP_N , CUBLAS_OP_N , blocksize ,
         blksz , blocksize , & cudaAlpha , lambda , blocksize , X + ( block_id *
         block_width * blocksize ) , blocksize , & cudaBeta , R + ( block_id *
         block_width * blocksize ) , blocksize ) ;
     // stream capture ends and creation of standalone XY graph
     checkCudaErrors ( c u d a S t r e a m E n d C a p t u r e ( stream1 , & xyGraph_1 [ block_id ]) ) ;
     // Adding the standalone graph to the mainGraph as a sub - graph node
         with proper dependencies
     checkCudaErrors ( c u d a G r a p h A d d C h i l d G r a p h N o d e (& xyNode_1 [ block_id ] ,
         mainGraph , & setZeroNode , 1 , xyGraph_1 [ block_id ]) ) ;
  }
  // creating nodes for subtraction operation using manual mode
  vector < cudaGraphNode_t > subNode_1 ( nrowblk ) ;
  vector < cudaKernelNodeParams > params ( nrowblk ) ;
  for ( block_id = 0 ; block_id < nrowblk ; block_id ++) {
     // creating sub nodes , assuming params is set properly
     // adding incomin edges from stream captured XY sub - graph node
     checkCudaErrors ( c u d a G r a p h A d d K e r n e l N o d e (& subNode_1 [ block_id ] ,
         mainGraph , & xyNode_1 [ block_id ] , 1 , & params [ block_id ]) ) ;
  }
Listing 5.2: Snippet of CUDA Graph showing mixture of stream capture and manual mode
  Once the the global task graph is defined, we instantiate it and run it for every iteration.
                                                      84


                               Table 5.1: Matrices used in our evaluation
                            Matrix          Rows       Columns        Nonzeros
                           inline_1        503,712     503,712       36,816,342
                        Queen_4147        4,147,110   4,147,110     329,499,284
                       dielFilterV3real   1,102,824   1,102,824      89,306,020
                           HV15R          2,017,169   2,017,169     281,419,743
                         Bump_2911        2,911,419   2,911,419     127,729,899
5.3     Performance Evaluation
5.3.1    Experimental Setup
We conducted all of our experiments on Perlmutter Phase I at the National Energy Research
Scientific Computing Center (NERSC) [93]. Each compute node has a single 64-core AMD EPYC
7763 (Milan) processor clocked at 2.0 GHz and 4 NVIDIA Ampere A100 GPUs with ∼40 GBs of
HBM per GPU. The A100 GPU model has a peak double-precision performance of 9.7 TFLOP/s.
There is a total of 256 GB DDR4 DRAM space on each node. The CPUs are connected to the GPUs
via four PCIe 4.0 switches, and the GPUs are connected to each other via NVIDIA’s NVLink-3
interconnect. The Perlmutter provides NVIDIA HPC SDK 21.11 suite to compile CUDA programs.
In addition, the GPU accelerated cuSPARSE and cuBLAS libraries provided with NVIDIA HPC
SDK 21.11 are used to compile our baseline implementation. We take a full Perlmutter node for
our experiments to avoid a noisy environment created by sharing the same node with other users.
5.3.2    Dataset
We selected 5 square matrices with varying sizes, sparsity patterns, and domains from the SuiteS-
parse Matrix Collection (see Table 5.1). For our preliminary study, we chose an appropriate number
of vector blocks to fit the full problem size in the GPU memory. The problem size ranges from
2.89 GB to 9.01 GB. Performance data for LOBPCG is averaged over the last three iterations of
five iterations.
                                                  85


5.3.3   Baseline Implementation
We compare the performance of LOBPCG CUDA Graph implementation with a baseline imple-
mentation that we call libcsr, which is an implementation of the LOBPCG solver using GPU
accelerated cuSPARSE and cuBLAS libraries for SpMM (with CSR storage of the sparse matrix),
inner product, and linear combination kernels. The application kernels are implemented on the
GPU using CUDA. Application kernels include tall skinny matrix operations such as addition, sub-
traction, element-wise multiplication, etc. We use NVIDIA’s unified memory technology (using
the cudaMallocManaged function) in both (CUDA Graph and libcsr) implementations. As such,
the runtime system automatically takes care of the data movement between the host and device. We
use single stream for both implementations.
5.3.4   LOBPCG Evaluation
Our performance comparison criteria includes the average execution time per iteration between
libcsr (CUDA) and CUDA Graph based DeepSparseGPU with unified memory. For LOBPCG
global task graph, we experimented with different tile sizes of 8K, 16K, 32K. We report the results
of the best performing tile size for DeepSparseGPU. In libcsr, we are not required to tile the input
matrices as we used unified memory with cuSPARSAE and CUBLAS library kernels from CUDA
and CUDA C++ for application kernels.
    Figure 5.2 shows the execution time comparison between both implementations. As we can
see CUDA Graph based DeepSparseGPU implementation runs 1.08× - 1.13× faster than the
libcsr implementation. We do thousands of kernel launch in CUDA Graph based DeepSparseGPU
implementation. Our preliminary result shows that CUDA Graph execution effectively minimizes
the kernel launch overhead.
5.4     Summary of the Chapter
    In this work, we explored the feasibility of using CUDA Graph to implement DeepSparseGPU
with unified memory. We used both implicit and explicit graph creation in order to create the
                                                  86


   Figure 5.2: libcsr (CUDA) vs DeepSparseGPU (CUDA Graph) execution time comparison
LOBPCG global task graph. Our preliminary result shows that CUDA Graph effectively minimizes
kernel launch overhead. Our future work includes incorporating our custom memory manager into
CUDA Graph-based DeepSparseGPU implementation, finding an optimal task scheduling order
with 2D decomposition of the SpMM operation, and choosing the optimal number of streams.
                                              87


                                             CHAPTER 6
                             CONCLUSION AND FUTURE WORK
In this thesis, we aimed to accelerate sparse matrix computation in several ways by minimizing
data movement between memory layers. At first, we introduced a novel task-parallel framework
named DeepSparse, aiming to optimize computational steps in a sparse solver rather than a single
kernel such as SpMV or SpMM. DeepSparse automatically generates and expresses the entire
computation as a task dependency graph (TDG) and relies on the OpenMP task to execute this
TDG. We observed a reduction in runtime and a significant reduction of cache misses in all levels
of memory using our task-parallel DeepSparse framework.
    With the success of a single shared memory node implementation of DeepSparse, we aimed to
port and optimized it on GPU-accelerated systems using OpenMP target offload. However, before
diving straight into the implementations, we wanted to evaluate the performance of directive-based
GPU programming models (i.e., OpenMP target offload, OpenACC) on LOBPCG block eigensolver
with consideration of large sparse matrices. We achieved good speed on GPU using OpenMP and
OpenACC over a baseline CPU implementation.
    Inspired by the promising results of the evaluation study, we ported and optimized the DeepSparse
framework on GPU using OpenMP target offload. To this end, we designed and implemented a
Memory Manager to process input sizes that exceed GPU memory limits. We have seen that our
GPU implementation of DeepSparse with memory manager greatly reduce the data movements
between CPU and GPU using a topological sort based task scheduling. We plan to use 2D de-
composition of the sparse matrix to create the DAG and we believe with 2D decomposition would
give better results as 2D decomposition exposes more parallelism compared to 1D decomposition.
We also plan to use a graph partitioner to partition the DAG with a tight memory bound for active
memories to that an entire partition fits into GPU memory. We would like to incorporate our cus-
tom memory manager and optimal scheduling order in our CUDA Graph-based DeepSparseGPU
implementation.
                                                  88


BIBLIOGRAPHY
      89


                                          BIBLIOGRAPHY
[1] S. Bogner, A. Bulgac, J. Carlson, J. Engel, G. Fann, R. J. Furnstahl, S. Gandolfi, G. Hagen,
     M. Horoi, C. Johnson et al., “Computational nuclear quantum many-body problem: The
     unedf project,” Computer Physics Communications, vol. 184, no. 10, pp. 2235–2250, 2013.
[2] D. Keyes, A. Ecer, J. Periaux, N. Satofuka, and P. Fox, “Towards realistic performance bounds
     for implicit cfd codes,” in Parallel Computational Fluid Dynamics: Towards Teraflops,
     Optimization, and Novel Formulations: Proceedings of the Parallel CFD’99 Conference.
     North-Holland, 2000, p. 241.
[3] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
     P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in python,”
     Journal of machine learning research, vol. 12, no. Oct, pp. 2825–2830, 2011.
[4] N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta,
     Z. Yin, and P. Dubey, “Navigating the maze of graph analytics frameworks using massive
     graph datasets,” in Proceedings of the 2014 ACM SIGMOD international conference on
     Management of data. ACM, 2014, pp. 979–990.
[5] J. Dongarra, S. Gottlieb, and W. T. Kramer, “Race to exascale,” Computing in Science &
     Engineering, vol. 21, no. 1, pp. 4–5, 2019.
[6] TOP500, “Top500 supercomputers,” http://www:top500:org.
[7] J. M. Diaz, S. Pophale, K. Friedline, O. Hernandez, D. E. Bernholdt, and S. Chandrasekaran,
     “Evaluating support for openmp offload features,” in Proceedings of the 47th International
     Conference on Parallel Processing Companion, 2018, pp. 1–10.
[8] F. Rabbi, C. S. Daley, H. M. Aktulga, and N. J. Wright, “Evaluation of directive-based gpu
     programming models on a block eigensolver with consideration of large sparse matrices,” in
     International Workshop on Accelerator Programming Using Directives. Springer, 2019, pp.
     66–88.
[9] C. Daley, H. Ahmed, S. Williams, and N. Wright, “A case study of porting hpgmg from
     cuda to openmp target offload,” in International Workshop on OpenMP. Springer, 2020, pp.
     37–51.
[10] P. Maris, H. M. Aktulga, M. A. Caprio, Ü. V. Çatalyürek, E. G. Ng, D. Oryspayev, H. Potter,
     E. Saule, M. Sosonkina, J. P. Vary et al., “Large-scale ab initio configuration interaction
     calculations for light nuclei,” in Journal of Physics: Conference Series, vol. 403, no. 1. IOP
     Publishing, 2012, p. 012019.
[11] H. M. Aktulga, A. Buluç, S. Williams, and C. Yang, “Optimizing sparse matrix-multiple
     vectors multiplication for nuclear configuration interaction calculations,” in Parallel and
     Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 2014, pp. 1213–
     1222.
                                                  90


[12] T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler, “Towards high perfor-
     mance paged memory for gpus,” in 2016 IEEE International Symposium on High Performance
     Computer Architecture (HPCA). IEEE, 2016, pp. 345–357.
[13] C. Lanczos, An iteration method for the solution of the eigenvalue problem of linear differential
     and integral operators. United States Governm. Press Office Los Angeles, CA, 1950.
[14] A. V. Knyazev, “Toward the optimal preconditioned eigensolver: Locally optimal block
     preconditioned conjugate gradient method,” SIAM journal on scientific computing, vol. 23,
     no. 2, pp. 517–541, 2001.
[15] M. Afibuzzaman, F. Rabbi, M. Y. Özkaya, H. M. Aktulga, and U. V. Çatalyürek, “Deepsparse:
     A task-parallel framework for sparsesolvers on deep memory architectures,” in 2019 IEEE
     26th International Conference on High Performance Computing, Data, and Analytics (HiPC).
     IEEE, 2019, pp. 373–382.
[16] J. Kepner, D. Bade, A. Buluç, J. Gilbert, T. Mattson, and H. Meyerhenke, “Graphs, matrices,
     and the graphblas: Seven good reasons,” arXiv preprint arXiv:1504.01039, 2015.
[17] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Green-
     baum, S. Hammerling, A. McKenney et al., “Lapack users’ guide, vol. 9,” Society for Industrial
     Mathematics, vol. 39, 1999.
[18] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, “Basic linear algebra subprograms
     for fortran usage,” ACM Transactions on Mathematical Software (TOMS), vol. 5, no. 3, pp.
     308–323, 1979.
[19] A. OpenMP, “Openmp application program interface version 4.0,” 2013.
[20] G. 500, “Green500 supercomputers,” http://www.green500.org.
[21] “HIP : Convert CUDA to Portable C++ Code,” https://github.com/ROCm-Developer-Tools/
     HIP; accessed 4 September 2019, 2019.
[22] A. V. Knyazev and M. E. Argentati, “Implementation of a preconditioned eigensolver using
     hypre,” 2005.
[23] Y. Saad, Iterative methods for sparse linear systems.     siam, 2003, vol. 82.
[24] M. Feit, J. Fleck Jr, and A. Steiger, “Solution of the schrödinger equation by a spectral
     method,” Journal of Computational Physics, vol. 47, no. 3, pp. 412–433, 1982.
[25] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in
     Advances in neural information processing systems, 2002, pp. 849–856.
[26] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” Chemometrics and
     intelligent laboratory systems, vol. 2, no. 1-3, pp. 37–52, 1987.
[27] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing
     order to the web.” Stanford InfoLab, Tech. Rep., 1999.
                                                  91


[28] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, “Optimization of sparse
     matrix-vector multiplication on emerging multicore platforms,” in SC’07: Proceedings of the
     2007 ACM/IEEE Conference on Supercomputing. IEEE, 2007, pp. 1–12.
[29] N. Bell and M. Garland, “Implementing sparse matrix-vector multiplication on throughput-
     oriented processors,” in Proceedings of the conference on high performance computing net-
     working, storage and analysis. ACM, 2009, p. 18.
[30] A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, “Parallel sparse
     matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks,”
     in Proceedings of the twenty-first annual symposium on Parallelism in algorithms and archi-
     tectures. ACM, 2009, pp. 233–244.
[31] E.-J. Im and K. A. Yelick, Optimizing the performance of sparse matrix-vector multiplication.
     University of California, Berkeley, 2000.
[32] B. C. Lee, R. W. Vuduc, J. W. Demmel, and K. A. Yelick, “Performance models for evaluation
     and automatic tuning of symmetric sparse matrix-vector multiply,” in Parallel Processing,
     2004. ICPP 2004. International Conference on. IEEE, 2004, pp. 169–176.
[33] R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A. Yelick, “When cache blocking of sparse
     matrix vector multiply works and why,” Applicable Algebra in Engineering, Communication
     and Computing, vol. 18, no. 3, pp. 297–311, 2007.
[34] A. Pinar and M. T. Heath, “Improving performance of sparse matrix-vector multiplication,” in
     SC’99: Proceedings of the 1999 ACM/IEEE Conference on Supercomputing. IEEE, 1999,
     pp. 30–30.
[35] Y. Saad, “Sparskit: A basic tool kit for sparse matrix computations,” 1990.
[36] R. Vuduc, J. W. Demmel, and K. A. Yelick, “Oski: A library of automatically tuned sparse
     matrix kernels,” in Journal of Physics: Conference Series, vol. 16, no. 1. IOP Publishing,
     2005, p. 071.
[37] S. W. Williams, Auto-tuning performance on multicore computers. University of California,
     Berkeley, 2008.
[38] A. El Guennouni, K. Jbilou, and A. Riquet, “Block krylov subspace methods for solving large
     sylvester equations,” Numerical Algorithms, vol. 29, no. 1-3, pp. 75–96, 2002.
[39] W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith, “Toward realistic performance
     bounds for implicit cfd codes,” in Proceedings of parallel CFD, vol. 99. Citeseer, 1999, pp.
     233–240.
[40] X. Liu, E. Chow, K. Vaidyanathan, and M. Smelyanskiy, “Improving the performance of
     dynamical simulations via multiple right-hand sides,” in Parallel & Distributed Processing
     Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 2012, pp. 36–47.
[41] A. E. Sarıyüce, E. Saule, K. Kaya, and Ü. V. Çatalyürek, “Regularizing graph centrality
     computations,” Journal of Parallel and Distributed Computing, vol. 76, pp. 106–119, 2015.
                                                 92


[42] A. Buluç, H. Meyerhenke, I. Safro, P. Sanders, and C. Schulz, “Recent advances in graph
     partitioning,” Algorithm engineering, pp. 117–158, 2016.
[43] A. Azad and A. Buluc, “A work-efficient parallel sparse matrix-sparse vector multiplication
     algorithm,” Proceedings - 2017 IEEE International Parallel and Distributed Processing
     Symposium (IPDPS), vol. 2017, 7 2017.
[44] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra,
     S. Hammarling, G. Henry, A. Petitet et al., ScaLAPACK users’ guide. SIAM, 1997.
[45] K. Remington, R. Pozo et al., “Nist sparse blas: User’s guide,” Citeseer, Tech. Rep., 1996.
[46] P. Kogge and J. Shalf, “Exascale computing trends: Adjusting to the" new normal" for
     computer architecture,” Computing in Science & Engineering, vol. 15, no. 6, pp. 16–26,
     2013.
[47] M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, T. G. Kolda, R. B. Lehoucq,
     K. R. Long, R. P. Pawlowski, E. T. Phipps et al., “An overview of the trilinos project,” ACM
     Transactions on Mathematical Software (TOMS), vol. 31, no. 3, pp. 397–423, 2005.
[48] S. Balay, S. Abhyankar, M. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout,
     W. Gropp, D. Kaushik et al., “Petsc users manual revision 3.8,” Argonne National Lab.(ANL),
     Argonne, IL (United States), Tech. Rep., 2017.
[49] R. D. Falgout and U. M. Yang, “hypre: A library of high performance preconditioners,” in
     International Conference on Computational Science. Springer, 2002, pp. 632–641.
[50] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek,
     and S. Tomov, “Numerical linear algebra on emerging architectures: The plasma and magma
     projects,” in Journal of Physics: Conference Series, vol. 180, no. 1. IOP Publishing, 2009,
     p. 012037.
[51] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, “A class of parallel tiled linear algebra
     algorithms for multicore architectures,” Parallel Computing, vol. 35, no. 1, pp. 38–53, 2009.
[52] S. Tomov, J. Dongarra, and M. Baboulin, “Towards dense linear algebra for hybrid GPU
     accelerated manycore systems,” Parallel Computing, vol. 36, no. 5-6, pp. 232–240, Jun. 2010.
[53] S. Tomov, R. Nath, H. Ltaief, and J. Dongarra, “Dense linear algebra solvers for multicore
     with GPU accelerators,” in Proc. of the IEEE IPDPS’10. Atlanta, GA: IEEE Computer
     Society, April 19-23 2010, pp. 1–8, DOI: 10.1109/IPDPSW.2010.5470941.
[54] J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, and I. Yamazaki, “Ac-
     celerating numerical dense linear algebra calculations with gpus,” Numerical Computations
     with GPUs, pp. 1–26, 2014.
[55] I. S. Barrera, M. Moretó, E. Ayguadé, J. Labarta, M. Valero, and M. Casas, “Reducing data
     movement on large shared memory systems by exploiting computation dependencies,” in
     Proceedings of the 2018 International Conference on Supercomputing. ACM, 2018, pp.
     207–217.
                                                93


[56] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, “StarPU: A Unified Platform
     for Task Scheduling on Heterogeneous Multicore Architectures,” in Euro-Par - 15th
     International Conference on Parallel Processing, ser. Lecture Notes in Computer Science,
     vol. 5704. Delft, The Netherlands: Springer, Aug. 2009, pp. 863–874. [Online]. Available:
     http://hal.inria.fr/inria-00384363
[57] E. Saule, H. M. Aktulga, C. Yang, E. G. Ng, and Ü. V. Çatalyürek, “An out-of-core task-
     based middleware for data-intensive scientific computing,” in Handbook on Data Centers.
     Springer, 2015, pp. 647–667.
[58] A. Buluç and J. R. Gilbert, “Parallel sparse matrix-matrix multiplication and indexing: Im-
     plementation and experiments,” SIAM Journal on Scientific Computing, vol. 34, no. 4, pp.
     C170–C191, 2012.
[59] J. W. Demmel, Applied Numerical Linear Algebra.        SIAM, 1997.
[60] V. G. V. Larrea, R. Budiardja, R. Gayatri, C. Daley, O. Hernandez, and W. Joubert, “Expe-
     riences porting mini-applications to OpenACC and OpenMP on heterogeneous systems,” in
     Cray User Group (CUG), May 2019.
[61] Y. Wang, “Research on matrix multiplication based on the combination of openacc and cuda,”
     in Geo-informatics in Sustainable Ecosystem and Society, Y. Xie, A. Zhang, H. Liu, and
     L. Feng, Eds. Singapore: Springer Singapore, 2019, pp. 100–108.
[62] S. Deldon, J. Beyer, and D. Miles, “OpenACC and CUDA Unified Memory,” in Cray User
     Group (CUG), May 2018.
[63] N. Sakharnykh, “EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEM-
     ORY.” Presented at GPU Technology Conference (GTC) 2018. Also available
     at http://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-
     know-about-unified-memory.pdf, 3 2018.
[64] M. Knap and P. Czarnul, “Performance evaluation of unified memory with prefetching and
     oversubscription for selected parallel cuda applications on nvidia pascal and volta gpus,” The
     Journal of Supercomputing, pp. 1–21, 2019.
[65] C. Yang, A. Buluç, and J. D. Owens, “Design principles for sparse matrix multiplication on
     the gpu,” in European Conference on Parallel Processing. Springer, 2018, pp. 672–687.
[66] G. Ortega, F. Vázquez, I. García, and E. M. Garzón, “Fastspmm: An efficient library for
     sparse matrix matrix product on gpus,” The Computer Journal, vol. 57, no. 7, pp. 968–979,
     2014.
[67] H. Anzt, S. Tomov, and J. Dongarra, “Accelerating the lobpcg method on gpus using a
     blocked sparse matrix vector product,” in Proceedings of the Symposium on High Performance
     Computing. Society for Computer Simulation International, 2015, pp. 75–82.
                                                94


[68] C. Hong, A. Sukumaran-Rajam, B. Bandyopadhyay, J. Kim, S. E. Kurt, I. Nisa, S. Sabhlok,
     Ü. V. Çatalyürek, S. Parthasarathy, and P. Sadayappan, “Efficient sparse-matrix multi-vector
     product on gpus,” in Proceedings of the 27th International Symposium on High-Performance
     Parallel and Distributed Computing. ACM, 2018, pp. 66–79.
[69] H. Anzt, S. Tomov, and J. Dongarra, “Implementing a sparse matrix vector product for the
     sell-c/sell-c-σ formats on nvidia gpus,” University of Tennessee, Tech. Rep. ut-eecs-14-727,
     2014.
[70] A. V. Knyazev, M. E. Argentati, I. Lashuk, and E. E. Ovtchinnikov, “Block locally optimal
     preconditioned eigenvalue xolvers (blopex) in hypre and petsc,” SIAM Journal on Scientific
     Computing, vol. 29, no. 5, pp. 2224–2239, 2007.
[71] A. Dziekonski, M. Rewienski, P. Sypek, A. Lamecki, and M. Mrozowski, “Gpu-accelerated
     lobpcg method with inexact null-space filtering for solving generalized eigenvalue problems
     in computational electromagnetics analysis with higher-order fem,” Communications in Com-
     putational Physics, vol. 22, no. 4, pp. 997–1014, 2017.
[72] M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi, “Cusparse library,” in GPU Technol-
     ogy Conference, 2010.
[73] “Openmp              specification,”            https://www.openmp.org/wp-content/uploads/
     OpenMP-API-Specification-5.0.pdf.
[74] P. Maris, M. Sosonkina, J. P. Vary, E. Ng, and C. Yang, “Scaling of ab-initio nuclear physics
     calculations on multicore computer architectures,” Procedia Computer Science, vol. 1, no. 1,
     pp. 97–106, 2010.
[75] P. Sternberg, E. G. Ng, C. Yang, P. Maris, J. P. Vary, M. Sosonkina, and H. V. Le, “Acceler-
     ating configuration interaction calculations for nuclear structure,” in Proceedings of the 2008
     ACM/IEEE conference on Supercomputing. IEEE Press, 2008, p. 15.
[76] NERSC, “Cori-gpu system configuration,” https://docs-dev.nersc.gov/cgpu/.
[77] ORNL, “Summit system configuration,” https://www.olcf.ornl.gov/summit/.
[78] S. S. Vazhkudai, B. R. de Supinski, A. S. Bland, A. Geist, J. Sexton, J. Kahle, C. J. Zimmer,
     S. Atchley, S. Oral, D. E. Maxwell et al., “The design, deployment, and evaluation of the coral
     pre-exascale systems,” in Proceedings of the International Conference for High Performance
     Computing, Networking, Storage, and Analysis. IEEE Press, 2018, p. 52.
[79] T. Davis, Y. Hu, and S. Kolodziej, “The suitesparse matrix collection,” http://faculty.cse.tamu.
     edu/davis/suitesparse.html, 2018.
[80] F. Khorasani, R. Gupta, and L. N. Bhuyan, “Scalable simd-efficient graph processing on
     gpus,” in 2015 International Conference on Parallel Architecture and Compilation (PACT).
     IEEE, 2015, pp. 39–50.
                                                 95


[81] X. Cui, T. R. W. Scogland, B. R. d. Supinski, and W. Feng, “Directive-based partitioning and
     pipelining for graphics processing units,” in 2017 IEEE International Parallel and Distributed
     Processing Symposium (IPDPS), May 2017, pp. 575–584.
[82] M. Garland, “Sparse matrix computations on manycore gpu’s,” in Proceedings of the 45th
     annual Design Automation Conference. ACM, 2008, pp. 2–6.
[83] J. W. Choi, A. Singh, and R. W. Vuduc, “Model-driven autotuning of sparse matrix-vector
     multiply on gpus,” in ACM sigplan notices, vol. 45, no. 5. ACM, 2010, pp. 115–126.
[84] X. Yang, S. Parthasarathy, and P. Sadayappan, “Fast sparse matrix-vector multiplication on
     gpus: implications for graph mining,” Proceedings of the VLDB Endowment, vol. 4, no. 4,
     pp. 231–242, 2011.
[85] S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance
     model for floating-point programs and multicore architectures,” Communications of the As-
     sociation for Computing Machinery, 2009.
[86] J. H. Davis, C. Daley, S. Pophale, T. Huber, S. Chandrasekaran, and N. J. Wright, “Per-
     formance assessment of openmp compilers targeting nvidia v100 gpus,” arXiv preprint
     arXiv:2010.09454, 2020.
[87] Y. Kim, J. Lee, D. Kim, and J. Kim, “Scalegpu: Gpu architecture for memory-unaware gpu
     programming,” IEEE Computer Architecture Letters, vol. 13, no. 2, pp. 101–104, 2013.
[88] NVIDIA,           “Nvlink-4        bandwidth,”           https://nvidianews.nvidia.com/news/
     nvidia-announces-cpu-for-giant-ai-and-high-performance-computing-workloads.
[89] P. Ramarao, “CUDA 10 Features Revealed: Turing, CUDA Graphs, and More,” https://
     devblogs.nvidia.com/cuda-10-features-revealed/, in NVIDIA GPU Technology Conference
     (GTC), 2021.
[90] S. Jones, “Cuda: New features and beyond,” http://on-demand.gputechconf.com/gtc/2018/
     presentation/s8278-cuda-new-features-and-beyond.pdf, 2018.
[91] C. Li and J. Yao, “Cuda graph in tensorflow,” https://www.nvidia.com/en-us/on-demand/
     session/gtcspring21-s31312/, in NVIDIA GPU Technology Conference (GTC), 2021.
[92] R. V. der Wijngaart, “Effortless cuda graphs,” https://www.nvidia.com/en-us/on-demand/
     session/gtcspring21-s32082/, in NVIDIA GPU Technology Conference (GTC), 2021.
[93] NERSC, “Perlmutter system configuration,” https://docs.nersc.gov/systems/perlmutter/
     system_details/.
                                                 96