OPTIMIZATION OF LARGE SCALE ITERATIVE EIGENSOLVERS
                                 By
                         Md Afibuzzaman
                       A DISSERTATION
                           Submitted to
                   Michigan State University
            in partial fulfillment of the requirements
                         for the degree of
            Computer Science - Doctor of Philosophy
                               2021


                                        ABSTRACT
     OPTIMIZATION OF LARGE SCALE ITERATIVE EIGENSOLVERS
                                             By
                                     Md Afibuzzaman
Sparse matrix computations, in the form of solvers for systems of linear equations, eigenvalue
problem or matrix factorizations constitute the main kernel in problems from fields as di-
verse as computational fluid dynamics, quantum many body problems, machine learning and
graph analytics. Iterative eigensolvers have been preferred over the regular method because
the regular method not being feasible with industrial sized matrices. Although dense linear
algebra libraries like BLAS, LAPACK, SCALAPACK are well established and some vendor
optimized implementation like mkl from Intel or Cray Libsci exist, it is not the same case for
sparse linear algebra which is lagging far behind. The main reason behind slow progress in
the standardization of sparse linear algebra or library development is the different forms and
properties depending on the application area. It is worsened for deep memory hierarchies
of modern architectures due to low arithmetic intensities and memory bound computations.
Minimization of data movement and fast access to the matrix are critical in this case. Since
the current technology is driven by deep memory architectures where we get the increased
capacity at the expense of increased latency and decreased bandwidth when we go further
from the processors. The key to achieve high performance in sparse matrix computations
in deep memory hierarchy is to minimize data movement across layers of the memory and
overlap data movement with computations. My thesis work contributes towards addressing
the algorithmic challenges and developing a computational infrastructure to achieve high
performance in scientific applications for both shared memory and distributed memory ar-
chitectures. For this purpose, I started working on optimizing a blocked eigensolver and


optimized specific computational kernels which uses a new storage format. Using this opti-
mization as a building block, we introduce a shared memory task parallel framework focusing
on optimizing the entire solvers rather than a specific kernel. Before extending this shared
memory implementation to a distributed memory architecture, I simulated the communi-
cation pattern and overheads of a large scale distributed memory application and then I
introduce the communication tasks in the framework to overlap communication and com-
putation. Additionally, I also tried to find a custom scheduler for the tasks using a graph
partitioner. To get acquainted with high performance computing and parallel libraries, I
started my PhD journey with optimizing a DFT code named Sky3D where I used dense
matrix libraries. Despite there might not be any single solution for this problem, I tried to
find an optimized solution. Though the large distributed memory application MFDn is kind
of the driver project of the thesis, but the framework we developed is not confined to MFDn
only, rather it can be used for other scientific applications too. The output of this thesis
is the task parallel HPC infrastructure that we envisioned for both shared and distributed
memory architectures.


Copyright by
MD AFIBUZZAMAN
2021


I would like to dedicate this thesis to my parents Md. Abdul Motin and Afrin Sultana, my
  wife Mayeesha Farzana and my Son Taaif Abdullah for supporting me endlessly towards
                                      achieving this goal.
                                               v


                               ACKNOWLEDGMENTS
I would like to thank my advisor Dr. Hasan Metin Aktulga who has supported me throughout
my PhD career with research grants and enriched my knowledge with innovative ideas. From
the very beginning he has taught me the fundamentals of research. He made constructive
criticisms throughout my PhD journey and also motivated me to keep going when things were
not going as I expected. I would also like to thank my friends in Michigan State University
who made my time enjoyable and did not let me fall victim to research related fatigue. I
sincerely thank my group mates for their ideas and support. My parents and brother Md
Ahiduzzaman always encouraged me and supported me living in a different country. My son
Taaif who was born in the final year of my PhD motivated me to push my work with his
amazing smile. And most importantly my wife Mayeesha deserves the most credit with her
constant support and presence in my life.
                                              vi


                             TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     xi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           xx
Chapter 1 INTRODUCTION AND MOTIVATION . . . . . . .                            . . . . . .  1
   1.1 Background And Related Work . . . . . . . . . . . . . . . . . . .       . . . . . .  1
   1.2 Emergence Of Deep Memory Hierarchies . . . . . . . . . . . . . .        . . . . . .  3
   1.3 Optimizing Sky3D . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . .  6
   1.4 Optimization Of Blocked Eigensolver For Sparse Matrix . . . . . .       . . . . . .  9
   1.5 Inrtroducing the DeepSparse Framework . . . . . . . . . . . . . .       . . . . . . 13
   1.6 Exploring Custom Schedule For Tasks Using Graph Partition . . .         . . . . . . 15
   1.7 Simulating Communication Behavior Of A Real World Distributed           Application 18
   1.8 Introducing Communication for DeepSparse . . . . . . . . . . . .        . . . . . . 19
Chapter 2    OPTIMIZATION IN LARGE SCALE DISTRIBUTED DENSE
             MATRICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        22
   2.1 Nuclear Density Functional Theory With Skyrme Interaction . . . . . . . . .         22
   2.2 Sky3D Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  25
   2.3 Distributed Memory Parallelization With MPI . . . . . . . . . . . . . . . . .       26
       2.3.1 The 1D and 2D partitionings . . . . . . . . . . . . . . . . . . . . . .       28
       2.3.2 Parallelization across Neutron and Proton groups . . . . . . . . . . .        30
       2.3.3 Calculations with 2D distributions . . . . . . . . . . . . . . . . . . .      32
              2.3.3.1 Matrix construction (step 3a) . . . . . . . . . . . . . . . . .      33
              2.3.3.2 Diagonalization and Orthonormalization (steps 3b & c) . . .          34
              2.3.3.3 Post-processing (steps 3d & e) . . . . . . . . . . . . . . . . .     36
       2.3.4 Calculations with a 1D distribution . . . . . . . . . . . . . . . . . . .     37
       2.3.5 Switching between different data distributions . . . . . . . . . . . . .      38
       2.3.6 Memory considerations . . . . . . . . . . . . . . . . . . . . . . . . . .     38
   2.4 Shared Memory Parallelization with OpenMP . . . . . . . . . . . . . . . . .         40
   2.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    41
       2.5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .    41
       2.5.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
       2.5.3 Comparison between MPI-only and MPI/OpenMP hybrid parallelization             46
       2.5.4 Load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    51
       2.5.5 Conclusion of this work . . . . . . . . . . . . . . . . . . . . . . . . . .   55
Chapter 3    OPTIMIZATION IN LARGE SCALE DISTRIBUTED SPARSE
             MATRICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
                                            vii


  3.1 Eigenvalue Problem in CI Calculations . . . . . . . . . . . . . . . .      . . . . .  57
  3.2 Motivation and CI Implementation . . . . . . . . . . . . . . . . . .       . . . . .  58
  3.3 Multiplication of the Sparse Matrix with Multiple Vectors (SpMM)           . . . . .  61
  3.4 Matrix Storage Formats . . . . . . . . . . . . . . . . . . . . . . . .     . . . . .  62
  3.5 Methodology and Optimization . . . . . . . . . . . . . . . . . . . .       . . . . .  64
  3.6 An Extended Roofline Model for CSB . . . . . . . . . . . . . . . . .       . . . . .  67
  3.7 Kernels with Tall and Skinny Matrices . . . . . . . . . . . . . . . .      . . . . .  70
  3.8 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . .     . . . . .  72
      3.8.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . .     . . . . .  73
      3.8.2 Performance of SpMM and SpMMT . . . . . . . . . . . . . .            . . . . .  77
      3.8.3 Improvement with using CSB . . . . . . . . . . . . . . . . .         . . . . .  78
      3.8.4 Tuning for the Optimal Value of β . . . . . . . . . . . . . .        . . . . .  79
      3.8.5 Combined SpMM/SpMMT performance . . . . . . . . . . .                . . . . .  79
      3.8.6 Performance analysis . . . . . . . . . . . . . . . . . . . . . .     . . . . .  80
      3.8.7 Performance of tall-skinny matrix operations . . . . . . . . .       . . . . .  81
      3.8.8 Performance summary . . . . . . . . . . . . . . . . . . . . .        . . . . .  83
  3.9 Evaluation on Xeon Phi Knights Corner (KNC) . . . . . . . . . . .          . . . . .  89
      3.9.1 Conclusions of this work . . . . . . . . . . . . . . . . . . . .     . . . . .  92
Chapter 4 ON NODE TASK PARALLEL OPTIMIZATION                               . . . . . . . .  93
  4.1 DeepSparse Overview . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . .  95
      4.1.1 Primitive Conversion Unit (PCU) . . . . . . . . . . . .        . . . . . . . .  97
              4.1.1.1 Task Identifier (TI) . . . . . . . . . . . . . . .   . . . . . . . .  97
              4.1.1.2 Task Dependency Graph Generator (TDGG)               . . . . . . . .  98
      4.1.2 Task Executor . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . .  99
      4.1.3 Illustrative Example . . . . . . . . . . . . . . . . . . .     . . . . . . . . 101
      4.1.4 Limitations of the Task Executor . . . . . . . . . . . .       . . . . . . . . 104
  4.2 Benchmark Applications . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . 105
      4.2.1 Lanczos . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . 105
      4.2.2 LOBPCG . . . . . . . . . . . . . . . . . . . . . . . . .       . . . . . . . . 106
  4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . 107
      4.3.1 Experimental setup . . . . . . . . . . . . . . . . . . . .     . . . . . . . . 107
      4.3.2 LOBPCG evaluation . . . . . . . . . . . . . . . . . . .        . . . . . . . . 109
      4.3.3 Lanczos evaluation . . . . . . . . . . . . . . . . . . . .     . . . . . . . . 111
      4.3.4 Compiler comparison . . . . . . . . . . . . . . . . . . .      . . . . . . . . 113
      4.3.5 Conclusions of this work . . . . . . . . . . . . . . . . .     . . . . . . . . 113
Chapter 5 SCHEDULING TASKS WITH A GRAPH                            PARTITIONER           . 115
  5.1 Coarsening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
  5.2 Initial Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
  5.3 Uncoarsening/Refinement . . . . . . . . . . . . . . . .      . . . . . . . . . . . . 122
  5.4 Partitioned Graphs . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . 124
  5.5 Coarsening A Block of SPMM Nodes Into One Block .            . . . . . . . . . . . . 130
  5.6 Issues with the Partitioner . . . . . . . . . . . . . . . .  . . . . . . . . . . . . 133
      5.6.1 Upperboounds . . . . . . . . . . . . . . . . . . .     . . . . . . . . . . . . 133
                                           viii


       5.6.2 Refinement . . . . . . . . . . . .     . . . . . . . . . . . . . . . . . . . . 133
       5.6.3 Graph structure . . . . . . . . . .    . . . . . . . . . . . . . . . . . . . . 136
       5.6.4 Edgecut . . . . . . . . . . . . . .    . . . . . . . . . . . . . . . . . . . . 136
  5.7 PowerLaw Graph Partitioning Attempts          . . . . . . . . . . . . . . . . . . . . 137
       5.7.1 Lowest Common Ancestor . . . .         . . . . . . . . . . . . . . . . . . . . 140
       5.7.2 Hierarchical partitioning attempts     . . . . . . . . . . . . . . . . . . . . 141
  5.8 Memory Bound Implementation . . . . .         . . . . . . . . . . . . . . . . . . . . 143
  5.9 Experimental Results . . . . . . . . . . .    . . . . . . . . . . . . . . . . . . . . 147
       5.9.1 Experiment setup . . . . . . . . .     . . . . . . . . . . . . . . . . . . . . 147
       5.9.2 Performance of the partitioner . .     . . . . . . . . . . . . . . . . . . . . 147
  5.10 Future work on the partitioner . . . . . .   . . . . . . . . . . . . . . . . . . . . 150
Chapter 6    SIMULATING THE COMMUNICATION PATTERNS                                  IN  A
             LARGE SCALE DISTRIBUTED APPLICATION . . .                              . . . . 151
  6.1  MFDn Communication Motif . . . . . . . . . . . . . . . . . . . . . .         . . . . 154
  6.2  Simulation of a Distributed communication . . . . . . . . . . . . . . .      . . . . 154
  6.3  Simulation Framework and Implementation . . . . . . . . . . . . . . .        . . . . 159
       6.3.1 Ember . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . 159
       6.3.2 FireFly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . 160
       6.3.3 Merlin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . 161
  6.4  Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . 161
       6.4.1 Random Distribution of Processes . . . . . . . . . . . . . . . .       . . . . 163
  6.5  Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . 164
       6.5.1 Hardware and Software . . . . . . . . . . . . . . . . . . . . . .      . . . . 165
       6.5.2 Benchmark problems . . . . . . . . . . . . . . . . . . . . . . .       . . . . 166
       6.5.3 SST parameters for Cori-KNL simulations . . . . . . . . . . .          . . . . 167
       6.5.4 Simulation results in Cori-KNL . . . . . . . . . . . . . . . . .       . . . . 169
  6.6  Simulation for A Future Network . . . . . . . . . . . . . . . . . . . .      . . . . 173
       6.6.1 Conclusions of this work . . . . . . . . . . . . . . . . . . . . .     . . . . 176
Chapter 7    OPTIMIZING A DISTRIBUTED MEMORY APPLICATION
             USING DEEPSPARSE . . . . . . . . . . . . . . . . . . . . . . . .               177
  7.1  Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
       7.1.1 Introducing communication tasks . . . . . . . . . . . . . . . . . . . .        178
       7.1.2 Better pipelining of matrix and vector operations . . . . . . . . . . .        179
  7.2  Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    181
       7.2.1 Issue with blocking MPI calls . . . . . . . . . . . . . . . . . . . . . .      182
       7.2.2 Issue with absence of TAG fields in MPI collectives . . . . . . . . . .        183
       7.2.3 Distributed SpMM . . . . . . . . . . . . . . . . . . . . . . . . . . . .       184
       7.2.4 Blocked communication . . . . . . . . . . . . . . . . . . . . . . . . .        185
       7.2.5 Custom reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     190
  7.3  Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    193
       7.3.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .     193
       7.3.2 Impact of blocked communications . . . . . . . . . . . . . . . . . . .         195
       7.3.3 Improvement with custom reduction . . . . . . . . . . . . . . . . . .          195
                                             ix


      7.3.4  Breakdown of individual kernel performance . . . . . . . . . . . . . . 197
      7.3.5  Expensive matrix multiplication compared to vector operations . . . 199
      7.3.6  Conclusions of this work . . . . . . . . . . . . . . . . . . . . . . . . . 201
Chapter 8   CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . 203
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
                                          x


                               LIST OF TABLES
Table 2.1: Hardware specifications for a single socket on Cori, a Cray XC40 super-
           computer at NERSC. Each node consists of two sockets. . . . . . . . .       42
Table 2.2: Scalability of MPI-only version of Sky3D for the L = 32 fm grid.
           Time is given in seconds, and efficiency (eff) is given in percentages.     44
Table 2.3: Scalability of MPI-only version of Sky3D for the L = 48 fm grid.
           Time is given in seconds, and efficiency (eff) is given in percentages.     46
Table 2.4: Scalability of MPI/OpenMP parallel version of Sky3D for the L =
           32 fm grid. Time is given in seconds, and efficiency (eff) is given in
           percentages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  48
Table 2.5: Scalability of MPI/OpenMP parallel version of Sky3D for the L =
           48 fm grid. Time is given in seconds, and efficiency (eff) is given in
           percentages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  50
Table 2.6: Scalability of MPI-only version of Sky3D for the 5000 neutrons and
           1000 protons system using the L = 48 fm grid. Time is given in
           seconds, and efficiency (eff) is given in percentages. . . . . . . . . . .  54
Table 3.1: MFDn matrices (per-process sub-matrix) used in our evaluations. For
           the statistics in this table, all matrices were cache blocked using β =
           6000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Table 3.2: Overview of Evaluated Platforms. 1 With hyper threading, but only 12
           threads were used in our computations. 2 Based on the saxpy1 bench-
           mark in [1]. 3 Memory bandwidth is measured using the STREAM copy
           benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  76
Table 3.3: Statistics for the full MFDn matrices used in distributed memory
           parallel Lanczos/FO and LOBPCG executions. . . . . . . . . . . .            87
Table 4.1: Major data structures after parsing third line. . . . . . . . . . . . . 104
Table 4.2: Matrices used in our evaluation. . . . . . . . . . . . . . . . . . . . . 109
Table 6.1: Matrices used in this study, the dimensions and number of nonzero
           matrix elements of each matrix. . . . . . . . . . . . . . . . . . . . . 166
                                          xi


Table 6.2: Router and NIC Parameters used for Simulating Cori-KNL . . . . . 169
Table 6.3: MPI ranks, number of diagonals, number of ranks per custom com-
           municators, Message Size during Broadcast and Reduce and Message
           Size during Allgather and Reduce_Scatter. . . . . . . . . . . . . . 170
Table 6.4: Router and NIC Parameter used for simulating Perlmutter’s predicted
           network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Table 7.1: Overview of Evaluated Platforms. 1 With hyper threading, but only 12
           threads were used in our computations. 2 Based on the saxpy1 bench-
           mark in [1]. 3 Memory bandwidth is measured using the STREAM copy
           benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Table 7.2: Matrices used in this experiment. Number of MPI ranks, dimensions
           and number of nonzeroes per rank. . . . . . . . . . . . . . . . . . . . 194
                                       xii


                              LIST OF FIGURES
Figure 1.1: Memory hierarchy in deep memory architectures . . . . . . . . . . .      4
Figure 2.1: Flowchart of the parallelized Sky3D code. Parts in the 1d distribution
            are marked in green, parts in full 2d distribution are marked in blue,
            parts in divided 2d distribution are marked in yellow and collective
                                                         (n)
            non-parallelized parts are marked in red. ψα denotes the orthonor-
                                                                             (n+1)
            mal and diagonal wave function at step n with index α, |ϕα i = ϕα
            denotes the non-diagonal and non-orthonormal wave function at step
            n+1. ĥ is the one-body Hamiltonian. . . . . . . . . . . . . . . . . .  27
Figure 2.2: 2D block cyclic partitioning example with 14 wave functions using
            a 3×2 processor topology. Row and column block sizes are set as
            NB=MB=2. The blue shaded area marks the lower triangular part.          30
Figure 2.3: Scalability of MPI-only version of Sky3D for the 3000 neutron and
            3000 proton system using the L = 32 fm grid. . . . . . . . . . . . . .  43
Figure 2.4: Scalability of MPI-only version of Sky3D for the 3000 neutron and
            3000 proton system using the L = 48 fm grid. . . . . . . . . . . . . .  45
Figure 2.5: Scalability of MPI/OpenMP parallel version of Sky3D for the 3000
            neutron and 3000 proton system using the L = 32 fm grid. . . . . . .    47
Figure 2.6: Comparison of the execution times for the MPI-only and MPI/OpenMP
            parallel versions of Sky3D for the 3000 neutron and 3000 proton sys-
            tem using the L = 32 fm grid. . . . . . . . . . . . . . . . . . . . . . 49
Figure 2.7: Scalability of MPI/OpenMP parallel version of Sky3D for the 3000
            neutron and 3000 proton system using the L = 48 fm grid. . . . . . .    50
Figure 2.8: Comparison of the execution times for the MPI-only and MPI/OpenMP
            parallel versions of Sky3D for the 3000 neutron and 3000 proton sys-
            tem using the L = 32 fm grid. . . . . . . . . . . . . . . . . . . . . . 51
Figure 2.9: Times per iteration for neutron and proton processor groups, illus-
            trating the load balance for the 3000 neutrons and 3000 protons (a),
            4000 neutrons and 2000 protons (b), and 5000 neutrons and 1000 pro-
            tons (c) systems using the L = 48 fm grid. Due to memory constraints
            the latter two cases cannot be calculated using 32 CPU. . . . . . . .   52
                                        xiii


Figure 2.10: A detailed breakdown of per iteration times for neutron and proton
             processor groups, illustrating the load balance for the 5000 neutrons
             and 1000 protons system. . . . . . . . . . . . . . . . . . . . . . . . .    54
Figure 3.1:  The dimension and the number of non-zero matrix elements of the
             various nuclear Hamiltonian matrices as a function of the truncation
             parameter Nmax . While the bottom panel is specific to 16 O, it is also
             representative of a wider set of nuclei [2, 3]. . . . . . . . . . . . . . . 59
Figure 3.2:  Overview of the SpMM operation with P = 4 threads. The operation
             proceeds by performing all P β × β local SpMM operations Y=AX+Y
             one blocked row at a time. The operation AT X is realized by per-
             muting the blocking (β × P β blocks). . . . . . . . . . . . . . . . . .     65
Figure 3.3:  Performance in GFlop/s for vector block inner product, V T W , and
             vector block scaling, V X kernels using Intel MKL and Cray libsci
             libraries on a Cray XC30 system (Edison @ NERSC). . . . . . . . .           72
Figure 3.4:  Sparsity structure of the local Nm6 matrix at process 1 in an MFDn
             run with 15 processes. A block size of β = 6000 is used. Each dot
             corresponds to a block with nonzero matrix elements in it. Darker
             colors indicate denser nonzero blocks. . . . . . . . . . . . . . . . . .    74
Figure 3.5:  Optimization benefits on Edison using the Nm6 matrix for SpMM
             (top) and SpMMT (bottom) as a function of m (the number of vectors). 77
Figure 3.6:  Performance benefit on the combined SpMM and SpMMT operation
             from tuning the value of β for the Nm8 matrix. . . . . . . . . . . . .      80
Figure 3.7:  SpMM and SpMMT combined performance results on Edison using
             the Nm6, Nm7 and Nm8 matrices (from top to bottom) as a function
             of m (the number of vectors). We identify 3 Rooflines (one per level
             of memory) as per our extended roofline model for CSB. . . . . . . .        82
Figure 3.8:  Performance in GFlop/s for inner product V T W (top), and linear
             combination V C (bottom) operations, using Intel MKL and Cray
             LibSci libraries, as well as our custom implementations on Edison.
             Tall-skinny matrix sizes are l × m, where l = 1 M. . . . . . . . . . .      84
Figure 3.9:  Comparison and detailed breakdown of the time-to-solution using the
             new LOBPCG implementation vs. the existing Lanczos/FO solver.
             Nm6, Nm7 and Nm8 testcases executed on 15, 66, and 231 MPI ranks
             (6 OpenMP threads per rank), respectively, on Edison. . . . . . . .         87
                                         xiv


Figure 3.10: SpMM and SpMMT combined performance results on Babbage using
             the Nm6, Nm7 and Nm8 matrices (from top to bottom) as a function
             of m (the number of vectors). . . . . . . . . . . . . . . . . . . . . . .    90
Figure 3.11: Performance of V T W (top) and V C (bottom) kernels, using the MKL
             library, as well as our custom implementations on an Intel Xeon Phi
             processor. Local vector blocks are l × m, where l = 1 M. . . . . . . .       91
Figure 4.1:  Schematic overview of DeepSparse. . . . . . . . . . . . . . . . . . . .      96
Figure 4.2:  Overview of input output matrices partitioning of task-based matrix
             multiplication kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Figure 4.3:  Overview of matrices partitioning of task-based SpMM kernel. . . . 102
Figure 4.4:  Overview of matrices partitioning of task-based inner product kernel. 103
Figure 4.5:  Task graph for the psudocode in listing 4.2. . . . . . . . . . . . . . . 103
Figure 4.6:  A sample task graph for the LOBPCG algorithm using a small sparse
             matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Figure 4.7:  Comparison of L1, L2, LLC misses and execution times between
             Deepsparse, libcsb and libcsr for the LOBPCG solver. . . . . . . . . 110
Figure 4.8:  LOBPCG single iteration execution flow graph of dielFilterV3real. . 110
Figure 4.9:  Comparison of L1, L2, LLC misses and execution times between
             Deepsparse, libcsb and libcsr for the Lanczos solver. . . . . . . . . . 111
Figure 4.10: Comparison of execution time for different compilers between Deepsparse,
             libcsb and libcsr for Lanczos Algorithm. (Blue/Left: GNU, Red/Mid-
             dle: Intel, Green/Right: Clang compiler.) . . . . . . . . . . . . . . . 112
Figure 4.11: Cache Miss comparison between compilers for HV15R . . . . . . . . 112
Figure 5.1:  Matching example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Figure 5.2:  Simple Lobpcg DAG with 3*3 blocks . . . . . . . . . . . . . . . . . 120
Figure 5.3:  Lobpcg graph after pre-processing with every 3 nodes in the same
             topological level are coarsened into one single node and the edges are
             kept intact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
                                          xv


Figure 5.4:  Coarsed graph Partition assignment example, blue is part 0 and green
             is part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Figure 5.5:  Sparse matrix block access patterns in different parts with matching        126
Figure 5.6:  Sparse matrix block access patterns in different parts with pre-processing
             before matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Figure 5.7:  A Small part of the Original DAG that is generated. The matched
             edges are shown in green color . . . . . . . . . . . . . . . . . . . . . 127
Figure 5.8:  step 2 of the matching . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Figure 5.9:  step 3 of the matching . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Figure 5.10: step 4 of the matching . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Figure 5.11: step 5 of the matching . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Figure 5.12: blocking of csb blocks to coarse multiple nodes into one node . . . . 132
Figure 5.13: Sparse matrix block access patterns in different parts . . . . . . . . 134
Figure 5.14: Sparse matrix block access patterns in different parts . . . . . . . . 135
Figure 5.15: A cycle is created using GRLG approach . . . . . . . . . . . . . . . 139
Figure 5.16: Partitioning of Z5 matrix with 64K block size . . . . . . . . . . . . . 142
Figure 5.17: Hierarchical blocking scheme . . . . . . . . . . . . . . . . . . . . . . 143
Figure 5.18: Partitioning of Z5 matrix with 1K block size . . . . . . . . . . . . . 144
Figure 5.19: Memory Management in partitions . . . . . . . . . . . . . . . . . . . 146
Figure 5.20: Performance comparisons between different cache levels and execu-
             tion time of Nm7 matrix in Haswell nodes . . . . . . . . . . . . . . . 148
Figure 5.21: Performance comparisons between different cache levels and execu-
             tion time of Nm7 matrix in knl nodes . . . . . . . . . . . . . . . . . 149
Figure 6.1:  Processor topology with 15 processors numbered from 0-15. Dis-
             tributed in an efficient manner where each row and column has the
             same number of processors. . . . . . . . . . . . . . . . . . . . . . . . 155
                                          xvi


Figure 6.2: Processor distribution in MFDn: MPI_COMM_WORLD (top) and
            custom column (left) and row (right) communicator groups. . . . . 156
Figure 6.3: Communication pattern for distributed SpMV (Lanczos) or SpMM
            (LOBPCG) during iterative solver. In our actual implementation,
            we have replaced the initial Gather + Broadcast along the columns
            by a single call to AllGatherV, and similarly, the final Reduce +
            Scatter along the columns by a single call to ReduceScatter. Also,
            the Broadcast and Reduce along the rows is overlapping with the
            local SpMV and SpMVT . (Figure adapted from Ref. [4] . . . . . . . 157
Figure 6.4: Random selection of the ranks. . . . . . . . . . . . . . . . . . . . . . 164
Figure 6.5: Illustration of a general Dragonfly topology with a single group shown
            on the left and the optical all-to-all connections of each group in a sys-
            tem shown on the right. Per the original definition of a dragonfly [5],
            the design within a group is not strictly specified. . . . . . . . . . . 167
Figure 6.6: Total Execution time per iteration in Cori-KNL for real application
            using default MPI_SUM, custom OMP_SUM and SST Simulation                    171
Figure 6.7: Ratio of different MPI communication routines between SST simula-
                                                                               SST _time 171
            tion and communication skeleton runs with MPI_SUM. i.e. Real_run_time
Figure 6.8: Communication time breakdown for a real run and SST simulation
            for dimension = 1,343,536,728 . . . . . . . . . . . . . . . . . . . . . 174
Figure 6.9: Timing comparison of the simulation of MFDn motif in Cori-KNL
            and the soon-to-be-installed Perlmutter machine with our predicted
            parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Figure 7.1: LOBPCG two iteration execution flow graph of nlpkkt240 matrix.
            SpMM is represented using orange color, XY operation is Maroon
            and XTY is using green color palette . . . . . . . . . . . . . . . . . 180
Figure 7.2: Example code for a blocking mpi call as an OpenMP task . . . . . . 183
Figure 7.3: Example code for a non blocking mpi call        . . . . . . . . . . . . . . 183
Figure 7.4: Matrix and Vector distribution in MPI ranks and efficient processor
            topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Figure 7.5: Blocked Communication along the processes in the same row com-
            municator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
                                         xvii


Figure 7.6:  Hierarchical blocked communication . . . . . . . . . . . . . . . . . . 188
Figure 7.7:  Blocked broadcast code . . . . . . . . . . . . . . . . . . . . . . . . . 189
Figure 7.8:  Blocked SpMM code . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Figure 7.9:  Blocked Reduction code . . . . . . . . . . . . . . . . . . . . . . . . . 191
Figure 7.10: Custom Reduction depending on the local vector distribution . . . . 192
Figure 7.11: Comparison of execution time per iteration in Haswell 16 threads
             between loop parallel, task parallel and task parallel with custom
             reduce-scatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Figure 7.12: Comparison of execution time per iteration in Haswell 32 threads
             between loop parallel, task parallel and task parallel with custom
             reduce-scatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Figure 7.13: Comparison of execution time per iteration in knl 32 threads between
             loop parallel, task parallel and task parallel with custom reduce-scatter196
Figure 7.14: Comparison of execution time per iteration in knl 64 threads between
             loop parallel, task parallel and task parallel with custom reduce-scatter197
Figure 7.15: Breakdown of communication and computation operations with 6 mpi
             ranks in Haswell nodes . . . . . . . . . . . . . . . . . . . . . . . . . 198
Figure 7.16: Breakdown of communication and computation operations with 6 mpi
             ranks in KNL nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Figure 7.17: Breakdown of communication and computation operations with 45
             mpi ranks in haswell nodes . . . . . . . . . . . . . . . . . . . . . . . 199
Figure 7.18: Breakdown of communication and computation operations with 45
             mpi ranks in knl nodes . . . . . . . . . . . . . . . . . . . . . . . . . 199
Figure 7.19: Change in SpMM dimension and local dimensions with the increase
             of mpi ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Figure 7.20: Ratio of LOBPCG compared to SpMM in Haswell nodes with 16
             threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Figure 7.21: Ratio of LOBPCG compared to SpMM in Haswell nodes with 32
             threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
                                          xviii


Figure 7.22: Ratio of LOBPCG compared to SpMM in knl nodes with 32 threads 201
Figure 7.23: Ratio of LOBPCG compared to SpMM in knl nodes with 64 threads 202
                                   xix


                             LIST OF ALGORITHMS
Algorithm 1: SpMM kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Algorithm 2: Lanczos Algorithm in Exact Arithmetic . . . . . . . . . . . . . . . . . . . . 106
Algorithm 3: LOBPCG Algorithm (for simplicity, without a preconditioner). . . . . . . 108
Algorithm 4: Ember pseudocode for the MFDn communication motif . . . . . . . . . . . . 164
                                            xx


Chapter 1
INTRODUCTION AND
MOTIVATION
1.1       Background And Related Work
Eigenvalue calculation is one of the most important parts in application of numerical linear
algebra. Almost all kinds of scientific research whether it is nuclear astrophysics or molecular
dynamics, whether this is applied machine learning or life science, calculating eigenvalues
often becomes one of the primary requirements theoretically. Also with the emerging of
Artificial Intelligence and Machine learning in the current computing world, eigenvalue cal-
culations plays a big part. The naive way to find the eigenvalues of a matrix is to find all
the roots of the characteristic polynomial of the matrix. But in large scale analysis where
the matrix dimensions are in thousands or millions, this is fairly impractical to find the
eigenvalues in such a way. Hence a number of iterative algorithms have been developed over
the years. These methods work by repeatedly refining approximations to the eigenvectors or
eigenvalues, and can be terminated whenever the approximations reach a suitable degree of
accuracy. Iterative methods form the basis of much of modern day eigenvalue computation.
    Since a graph represented in an adjacency matrix format is essentially a sparse matrix,
graph algorithms can also be expressed in the language of sparse linear algebra. In fact, recent
studies have shown that graph algorithms expressed in this way achieve significantly better
                                                1


performance than alternative abstractions [6]. To simplify the presentation, we use a unified
nomenclature for both, and use the term sparse matrix to also refer to the network structure
in a graph. The term nonzeros will refer to the nonzero matrix elements in sparse matrices
or edge meta-data (e.g., weights) in graphs. Finally, we overload the term vector in sparse
matrix computations to also refer to the array of vertex properties in graph computations.
Most fundamental operation in sparse linear algebra is thought to be the multiplication of a
sparse matrix with a vector (SpMV), as it forms the main computational kernel for several
applications (e.g., the solution of partial differential equations (PDE) [7] and the Schrodinger
Equation [8] in scientific computing, spectral clustering [9] and dimensionality reduction [10]
in machine learning, and the Page Rank algorithm [11] in graph analytics). The Roofine
model by Williams et al. [12] suggests that the performance of SpMV kernel is ultimately
bounded by the memory bandwidth. Consequently, performance optimizations to increase
cache utilization and reduce data access latencies for SpMV has drawn significant interest
[13, 14, 15, 16, 17, 18, 19, 20, 21], which is a rather incomplete list of related work on this
topic.
    A closely related kernel is the multiplication of a sparse matrix with multiple vectors
(SpMM) which constitutes the main operation in block solvers, e.g., the block Krylov sub-
space methods and block Jacobi-Davidson method. SpMM has much higher arithmetic
intensity than SpMV and can effciently leverage wide vector execution units. As a re-
sult, SpMM-based solvers has recently drawn significant interest in scientific computing
[22, 23, 24, 25, 26]. SpMM also finds applications naturally in machine learning where sev-
eral features (or eigenvectors) of sparse matrices are needed [10, 9]. Although SpMM has a
significantly higher arithmetic intensity than SpMM, the extended Roofline model that we
recently proposed suggests that cache bandwidth, rather than the memory bandwidth, can
                                                  2


still be an important performance limiting factor for SpMM [22]. Multiplication of sparse
matrices (SpGEMM) and sparse matrix times sparse vector (SpMSV) operation also find
applications in important problems. SpGEMM is the main kernel in the algebraic multi-grid
method [27], and the Markov Clustering algorithm, while SpMSV is the main building block
for breadth-first search, bipartite graph matching, and maximal independent set algorithms.
1.2       Emergence Of Deep Memory Hierarchies
Given the widening gap between memory system and processor performance [28], irregular
data access patterns and low arithmetic intensities of sparse matrix computations have ef-
fectively made them "memory-bound" computations. Furthermore, the downward trend in
memory space and bandwidth per core in high performance computing (HPC) systems [29]
has paved the way for a deepening memory hierarchy. For example, many-core processors
(i.e., GPUs and Xeon Phis) have their own high-bandwidth (but limted size) device memories
(HBM). NVRAM storages have recently emerged to alleviate issues (such as cost, capacity,
energy consumption and resiliency) associated with the DRAM technology. Consequently,
ash memory and 3D-XPoint memory have already found wide adoption as a storage-class
cache between DRAM and disk systems, and they are being adopted as memory-class stor-
ages complementing DRAM in modern HPC systems .
     In Fig. 1.1, we give an abstract view of the assumed underlying memory hierarchy, along
with some hardware specifications based on current technology. Our target architectures are
many-core processors such as GPUs and Xeon Phis, which are essentially the cornerstones
of big data analytics and scientific computing. While exact specifications and number of
layers change as architectures evolves, the underlying principle of memory hierarachy stays
                                                3


                   Figure 1.1: Memory hierarchy in deep memory architectures
the same: Going further away from the processor, memory capacity increases at the expense
of increased latency and reduced bandwidth. Thus, minimizing data movement across layers
of the memory and overlapping data movement with computations are keys to achieving
high performance in sparse matrix computations. While we mainly focus on parallelism and
performance on a single node, developed techniques and software will be complementary to
those aimed at enabling distributed memory parallelism for sparse matrix computations and
graphs (such as Par-METIS, Xtra- PuLP, etc.), or manual partitionings specified by a user.
Hence, the footprint of applications that can benefit from this project will be significant.
    In data analytics and scientific computing, total available memory is often a limiting
factor. Hence, data management is an important due to both the size of the data involved
and the complexity of the program ow. As an example, in Fig. 3, we give the pseudocode for
the locally optimal block preconditioned conjugate gradient algorithm (LOBPCG), a widely
used block eigensolver [30]. SpMM of the sparse matrix H and block vector , despite being an
expensive step, is only one part of the computation in line 4. In terms of memory, while the
H matrix takes up considerable space, when a large number of eigenpairs are needed (e.g.
dimensionality reduction, spectral clustering or quantum many-body problems), memory
needed for block vector can be comparable to or even greater than that of H . In addition,
other block vectors (residual R, preconditioned residual W, previous direction P), block
                                              4


vectors from the previous iteration and the preconditioning matrix T (if available) must be
stored, and accessed at each iteration. Clearly, orchestrating the data movement in a deep
memory hierarchy to attain an efficient implementation can become a daunting task for a
domain scientist.
    In fact, considering the complete range of solvers that sparse matrix computations arise
in, LOBPCG algorithm is a relatively simple case. For instance, when performing Singular
Value Decompositions (SVD), each node would need to perform operations pertaining also
to the transpose MT (or complex conjugate M) matrix of the sparse partition that they
own [38], which must be applied on the result of the application of M over the source
vector. The interior eigenvalue problem is another example illustrating complexities in a
real application. An effective way to accelerate convergence to eigenpairs in a desired range
is to build polynomial filters [65] (e.g., [a3M3 + a2M2 + a1M]x is a 3rd order matrix
polynomial), which require several applications of a sparse matrix over the source vector in
each iteration. Finally, an SpGEMM kernel or sparse LU factorization represent significantly
more complex computations, as computations associated with nonzeros form a complex
dependency graph in these cases. We propose a comprehensive framework that can effciently
handle complex and irregular (due to sparsity) task depedencies arising in a wide variety
of applications. In addition to conventional applications involving static sparse matrices,
we envision our framework to be generic enough to support incremental algorithms used
to to tackle streaming (or online) problems [34, 69, 93]. Such applications are common in
data analytics, e.g., to analyze dynamic web graphs or social networks, or incrementally
incorporate user feedback.
    For our work first we target an application called Sky3d which works on a dense matrix.
Then we shift our interest to sparse matrix iterative eigensolvers. We start with the Sky3d
                                              5


application which is a density functional theory approach to study different kinds of nuclear
shapes. After a successful exploration of dense matrix linear algebra libraries we worked on
a sparse matrix iterative eigensolver code named MFDn. In that project we optimized the
sparse matrix multiple vector kernels using a new storage format for sparse matrices. Taking
that experience, we developed a shared memory framework for iterative eigensolvers using
task parallelism. At first we used a default scheduler from OpenMP libraries but we also
wanted to have our own schedule for tasks. We studied some graph partitioners and created
custom partitions and task orders for execution. We wanted to extend the framework for a
distributed solver like MFDn. Hence we first studied the communication behavior of MFDn
using a simulator called SST. Looking at the observations, we introduced communication
tasks in our framework which overlaps the communication and computations. Now I will
present some motivations, background studies for all the projects.
1.3      Optimizing Sky3D
Compact objects in space such as neutron stars are great test laboratories for nuclear physics
as they contain all kinds of exotic nuclear matter which are not accessible in experiments
on earth [31, 32]. Among interesting kinds of astromaterial is the so called nuclear "pasta"
phase [33, 34] which consists of neutrons and protons embedded in an electron gas. The
naming arises from the shapes, e.g. rods and slabs, which resemble the shapes of the Italian
pasta (spaghetti and lasagna).
    Nuclear pasta is expected in the inner crust of neutron stars in a layer of about 100 m at
a radius of about 10 km, at sub-nuclear densities. Since the typical length scale of a neutron
or a proton is on the order of 1 fm, it is impossible to simulate the entire system. The usual
                                               6


strategy is to simulate a small part of the system in a finite box with periodic boundary
conditions. While it is feasible to perform large simulations with semi-classical methods
such as Molecular Dynamics (MD) [35, 36, 37, 38, 39, 40, 41] involving 50,000 or even more
particles or the Thomas-Fermi approximation [42, 43, 44, 45], quantum mechanical (QM)
methods which can yield high fidelity results have been limited to about 1000 nucleons due
to their immense computational costs [46, 47, 48, 49, 50, 51, 52, 53].
    The side effects of using small boxes in QM methods are twofold: First, the finite size
of the box causes finite-volume effects, which have an observable influence on the results of
a calculation. Those effects have been studied and can be suppressed by introducing the
twist-averaged boundary conditions [51]. More importantly though, finite boxes limit the
possible resulting shapes of the nuclear pasta because the unit cell of certain shapes might be
larger than the maximum box size. For instance, in MD simulations [54], slabs with certain
defects have been discovered. Those have not been observed in QM simulations because
they only manifest themselves in large boxes. To observe such defects, we estimate that it is
necessary to increase the number of simulated particles (and the corresponding simulation
box volume) by about an order of magnitude.
    In this work, we focus on the microscopic nuclear density functional theory (DFT) ap-
proach to study nuclear pasta formations. The nuclear DFT approach is a particularly good
choice for nuclear pasta. The most attractive property is the reliability of its answers over
the whole nuclear chart [55, 56], and yet it is computationally feasible for applications involv-
ing the heaviest nuclei and even nuclear pasta matter, because the interaction is expressed
through one-body densities and the explicit n-body interactions do not have to be evaluated.
    In contrast to finite nuclei that are usually calculated employing a harmonic oscillator
finite-range basis [57, 58] using mostly complete diagonalization of the basis functions to solve
                                                 7


the self-consistent equations, nuclear pasta matter calculations have to be performed in a
suitable basis with an infinite range. We use the DFT code Sky3D [59], which represents all
functions on an equidistant grid and employs the damped gradient iteration steps, where fast
Fourier transforms (FFTs) are used for derivatives, to reach a self-consistent solution. This
code is relatively fast compared to its alternatives and incorporates all features necessary
to perform DFT calculations with modern functionals, such as the Skyrme functionals (as
used here). Since Sky3D can be used to study static and time-dependent systems in 3d
without any symmetry restrictions, it can be applied to a wide range of problems. In the
static domain it has been used to describe a highly excited torus configuration of 40 Ca [60]
and also finite nuclei in a strong magnetic field as present in neutron stars [61]. In the
time-dependent context, it was used for calculations on nuclear giant resonances [62, 63],
and on the spin excitation in nuclear reactions [64]. The Wigner function, a 6 dimensional
distribution function, and numerical conservation properties in the time-dependent domain
have also been studied using Sky3D [65, 66].
    For the case of time-dependent problems, the Sky3D code has already been parallelized
using MPI. The time-dependent iterations are simpler to parallelize, because the treatment
of the single particles are independent of each other and can be distributed among the
nodes. Only the mean field has to be communicated among the computational nodes. On
the other hand, accurate computation of nuclear ground states, which we are interested in,
requires a careful problem decomposition strategy and organization of the communication
and computation operations as discussed below. However, only a shared memory parallel
version of Sky3D (using OpenMP) exists to this date. In this work, we present algorithms
and techniques to achieve scalable distributed memory parallelism in Sky3D using MPI.
                                              8


1.4      Optimization Of Blocked Eigensolver For Sparse
         Matrix
We found out that for dense matrices there are some very optimized numerical libraries for
distributed memory. Being a relatively well studied area, these libraries have been serving
the scientific community for quite a while. In our work we observed strong scaling using
the highly optimized ScLAPACK library. But we also noticed that for sparse matrices there
are a lot of room for improvement. The choice of numerical algorithms and how efficiently
they can be implemented on high performance computer (HPC) systems critically affect the
time-to-solution for large-scale scientific applications. Several new numerical techniques or
adaptations of existing ones that can better leverage the massive parallelism available on
modern systems have been developed over the years. Although these algorithms may have
slower convergence rates, their high degree of parallelism may lead to better time-to-solution
on modern hardware [24]. In the next work, we consider the solution of the quantum many-
body problem using the configuration interaction (CI) formulation. We present algorithms
and techniques to significantly speed up eigenvalue computations in CI by using a block
eigensolver and optimizing the key computational kernels involved.
    The quantum many-body problem transcends several areas of physics and chemistry. The
CI method enables computing the wave functions associated with discrete energy levels of
these many-body systems with high accuracy. Since only a small number of low energy states
are typically needed to compute the physical observables of interest, a partial diagonalization
of the large CI many-body Hamiltonian is sufficient.
    More formally, we are interested in finding a small number of extreme eigenpairs of a
                                                9


large, sparse, symmetric matrix:
                                  xi = λxi , i = 1, . . . , m, m  N.                           (1.1)
Iterative methods such as the Lanczos and Jacobi–Davidson [67] algorithms, as well as
their variants [68, 69, 70], can be used for this purpose. The key kernels for these meth-
ods can be crudely summarized as (repeated) sparse matrix–vector multiplications (SpMV)
and orthonormalization of vectors (level-1 BLAS). As alternatives, block versions of these
algorithms have been developed [71, 72, 73] which improves the arithmetic intensity of com-
putations at the cost of a reduced convergence rate and increased total number of matrix–
vector operations [74]. In block methods, SpMV becomes a sparse matrix multiple vector
multiplication (SpMM) and vector operations become level-3 BLAS operations.
    Performance of SpMV is ultimately bounded by memory bandwidth [75]. The widening
gap between processor performance and memory bandwidth significantly limits the achiev-
able performance in several important applications. On the other hand, in SpMM, one can
make use of the increased data locality in the vector block and attain much higher FLOP
rates on modern architectures. Gropp et al. was the first to exploit this idea by using mul-
tiple right hand sides for SpMV in a computational fluid dynamics application [23]. SpMM
is one of the core operations supported by the auto-tuned sequential sparse matrix library
OSKI [20]. OSKI’s shared memory parallel successor, pOSKI, currently does not support
SpMM [76]. More recently, Liu et al. [24] investigated strategies to improve the performance
of SpMM 1 using SIMD (AVX/SSE) instructions on modern multicore CPUs. Their driv-
   1 Liu et al. actually uses the name GSpMV for “generalized” SpMV. We refrain from doing so because
the same name has been used in conflicting contexts such as SpMV for graph algorithms where the scalar
operations can be arbitrarily overloaded.
                                                  10


ing application is the motion simulation of biological macromolecules in solvent using the
Stokesian dynamics method. Röhrig-Zöllner et al. [25] discuss performance optimization
techniques for the block Jacobi–Davidson method to compute a few eigenpairs of large-scale
sparse matrices, and report reduced time-to-solution using block methods instead of single
vector counterparts, particularly for problems in quantum mechanics and PDEs.
    Our work differs from previous efforts substantially, in part due to the immense size of the
sparse matrices involved. We exploit symmetry to reduce the overall memory footprint, and
offer an efficient solution to perform SpMM on a sparse matrix and its transpose (SpMMT )
with roughly the same performance [22]. This is achieved through a novel thread parallel
SpMM implementation, CSB/OpenMP, which is based on the compressed sparse block (CSB)
framework [77] (Sect. 3.3). We demonstrate the efficiency of CSB/OpenMP on a series of
CI matrices where we obtain 3–4× speedup over the commonly used compressed sparse
row (CSR) format. To estimate the performance characteristics and better understand the
bottlenecks of the SpMM kernel, we propose an extended Roofline model to account for
cache bandwidth limitations (Sect. 3.3).
    In this work, we extend a previous work (presented in [22]) by considering an end-to-
end optimization of a block eigensolver implementation. As will be discussed in Sect. 3.7,
the performance of the tall-skinny matrix operations in block eigensolvers is critical for an
excellent overall performance. We observe that the implementations of these level-3 BLAS
operations in optimized math libraries perform significantly below expectations for typical
matrix sizes encountered in block eigensolvers. We propose a highly efficient thread parallel
implementation for inner product and linear combination operations that involve tall-skinny
matrices and analyze the resulting performance.
    To demonstrate the merits of the proposed techniques, we incorporate the CSB/OpenMP
                                              11


implementation of SpMM and optimized tall-skinny matrix kernels into a LOBPCG [73]
based solver in MFDn, an advanced nuclear CI code [2, 78, 3]. We demonstrate through
experiments with real-world problems that the resulting block eigensolver can outperform
the widely used Lanczos algorithm (based on single vector iterations) with modern multicore
architectures (Sect. 3.8.8). We also analyze the performance of our techniques on an Intel
Xeon Phi Knights Corner (KNC) processor to assess the feasibility of our implementations
for future architectures.
    While we focus on nuclear CI computations, the impact of optimizing the performance of
key kernels in block iterative solvers is broader. For example, spectral clustering, one of the
most promising clustering techniques, uses eigenvectors associated with the smallest eigen-
values of the Laplacian of the data similarity matrix to cluster vertices in large symmetric
graphs [79, 80]. Due to the size of the graphs, it is desirable to exploit the symmetry, and
for a k-way clustering problem, k eigenvectors are needed, where typically 10 ≤ k ≤ 100,
an ideal range for block eigensolvers. Block methods are also used in solving large-scale
sparse singular value decomposition (SVD) problems [81], with most popular methods being
the subspace iteration and block Lanzcos. SVDs are critical for dimensionality reduction
in applications like latent semantic indexing [82]. In SVD, singular values are obtained by
solving the associated symmetric eigenproblem that requires subsequent SpMM and SpMMT
computations in each iteration [83]. Thus, our techniques are expected to have a positive
impact on the adoption of block solvers in closely related applications.
                                               12


1.5      Inrtroducing the DeepSparse Framework
Sparse matrix computations, in the form of solvers for systems of equations, eigenvalue prob-
lems or matrix factorizations, constitute the main kernel in fields as diverse as computational
fluid dynamics (CFD), quantum many-body problems, machine learning and graph analyt-
ics. The scale of problems in these scientific applications typically necessitates execution on
massively parallel architectures. Moreover, sparse matrices come in very different forms and
properties depending on application area. However, due to the irregular data access patterns
and low arithmetic intensities of sparse matrix computations, achieving high performance
and scalability is very difficult. These challenges are further exacerbated by the increasingly
complex deep memory hierarchies of the modern architectures as they typically integrate
several layers of memory storage. While exact specifications and number of layers change as
architectures evolve, the underlying principle of memory hierarchy stays the same: Going far-
ther away from the processor, memory capacity increases at the expense of increased latency
and reduced bandwidth. As such, minimizing data movement across layers of the memory
and overlapping data movement with computations are keys to achieving high performance
in sparse matrix computations.
    Unlike its dense matrix analogue, the state of the art for sparse matrix computations is
lagging far behind. The widening gap between the memory system and processor perfor-
mance, irregular data access patterns and low arithmetic intensities of sparse matrix com-
putations have effectively made them “memory-bound” computations. Furthermore, the
downward trend in memory space and bandwidth per core in high performance computing
(HPC) systems [29] has paved the way for a deepening memory hierarchy. Thus, there is a
dire need for new approaches both at the algorithmic and runtime system levels for sparse
                                                13


matrix computations.
    In this work, we propose a novel sparse linear algebra framework, named DeepSparse,
which aims to accelerate sparse solver codes on modern architectures with deep memory
hierarchies. Our proposed framework differs from existing work in two ways. First, we
propose a holistic approach that targets all computational steps in a sparse solver rather than
narrowing the problem into a single kernel, e.g., sparse matrix vector multiplication (SpMV)
or sparse matrix multiple vector multiplication (SpMM). Second, we adopt a fully integrated
task-parallel approach while utilizing commonly used sparse matrix storage schemes.
    In a nutshell, DeepSparse provides a GraphBLAS plus BLAS/LAPACK-like frontend for
domain scientists to express their algorithms without having to worry about the architec-
tural details (e.g., memory hierarchy) and parallelization considerations (i.e., determining
the individual tasks and their scheduling) [84, 85, 86]. DeepSparse automatically gener-
ates and expresses the entire computation as a task dependency graph (TDG) where each
node corresponds to a specific part of a computational kernel and edges denote control and
data dependencies between computational tasks. We chose to build DeepSparse on top of
OpenMP [87] because OpenMP is the most commonly used shared memory programming
model, but more importantly it supports task-based data-flow programming abstraction. As
such, DeepSparse relies on OpenMP for parallel execution of the TDG.
    We anticipate two main advantages of DeepSparse over a conventional bulk synchronous
parallel (BSP) approach where each kernel relies on loop parallelization and is optimized
independently. First, DeepSparse would be able to expose better parallelism as it creates
a global task graph for the entire sparse solver code. Second, since the OpenMP runtime
system has explicit knowledge about the TDG, it may be possible to leverage a pipelined
execution of tasks that have data dependencies, thereby leading to better utilization of the
                                               14


hardware cache.
1.6      Exploring Custom Schedule For Tasks Using Graph
         Partition
In the DeepSparse framework we generate a global task graph. In our executor we use
appropriate OpenMP task dependencies with proper memory offsets and sizes to make the
task parallel implementation coherent across the entire iteration. OpenMP looks at the
in-out dependencies and generates a direct acyclic graph underneath after solving those
dependencies. In case of a dependency of a task gets resolved, that task is(or can be) pulled
from the task pool by OpenMP engine.
    Although we saw that OpenMP does a great job with memory utilization over all level of
memories, it is still beyond our control. OpenMP is generating the DAG itself and resolving
themselves. Whenever the data dependencies of a task is resolved and it is not dependent
on any other task for its execution , it can be immediately pulled and executed. But this
might not be optimal scenario if we think from memory usage perspective. A task which
does not have any relation with the tasks that are active at the moment can be immediately
executed once a thread gets free regardless of its memory input and outputs. Hence there is
a possibility of a task which would improve the memory usage with the input already being
in the lower level of the memory and having cache hits reduces. The probability of cache
misses increases with this kind of scheduling.
    This motivated us to use a novel graph partition based schedulers that use the global data
flow graphs generated by the PCU to minimize data movements in a deep memory hierarchy.
Graph partitioners have been extensively studied but existing approaches do not meet our
                                             15


needs as they typically handle DAGs by converting them to undirected graphs. However,
the directed nature of the task graph must be respected in our case. In this regard, acyclic
partitioning heuristics for DAGs, recently introduced by Dr. Catalyurek’s group, provided
a great starting point.
    Our sparse solver DAGs contain a fair number of vertices with high fan-in/fan-out" de-
grees due to operations such as SpMVs, SpMMs, inner products and vector reductions.
There are two immediate issues that call for an alternative scheme. First, the presence of
such vertices requires an excessive number of coarsening iterations (which rely on the edge
matching" technique), and slows down partitioning to the extent of making it unusable for
large graphs. Second, during refinement, the high degree fan-in/fan-out vertices effectively
yield 1D partitionings, because they drag their incoming/outgoing vertices into the same
partition as them. We wanted to develop a novel scheduler through the following tasks: We
will adopt problem-specific coarsening/refinement techniques where vertex matchings are
identified by recursively doubling the CSB block (in 2D) or blockrow (in 1D) dimensions.
This would allow to preserve the original DAG structure at the coarsest level, while enabling
the partitioning of extremely large DAGs.
    Like other graph partitioners, the objective function for the acyclic heuristics of [44] is
minimization of the edge cut (defined as the sum of all edges crossing partitions). However,
for our purposes, each partition is an execution phase" denoting the set of tasks that must
be brought together to the higher level memory. In fact, maximizing the edge cut between
successive execution phases would be desirable in this context, because edge cuts would
correspond to sharing of input data between successive stages or reuse of output from one
stage as input in the subsequent stage. In our heuristics, the objective function will take
into account the ordering among partitions (which is not a consideration at all for regular
                                             16


graph partitioners) and try to maximize the (cumulative) size of input/output data overlaps
between successive phases (so that the overall data movement is reduced).
    Constraints in graph partitioners are generally geared towards ensuring load balance
among partitions. However, in our case, the constraint would be the storage limit of the
fast memory which the Scheduler will impose by estimating the maximum active memory
size" during the execution of a phase. This is different from, but in the worst case equal
to, the sum of edge weights in a partition. We anticipate that such memory constraints
in conjunction with the above described objective functions will enable our Scheduler to
discover better partitionings than those we can find with the techniques of (for instance,
2D-shaped partitionings are known to yield better data locality compared to the 1D-like
partitionings.
    To facilitate execution in a deep memory architecture, we designed the Scheduler to be
hierarchical. This can, for example, be achieved by recursively applying our partitioning
algorithms.
    To support incremental algorithms for streaming/online problems, the Scheduler will be
dynamic, i.e., it will be able to make real-time decisions regarding the placement of new tasks
for incoming data and removal of tasks for deleted data. This can simply be achieved by
greedily placing new tasks or deleting old ones to ensure real-time response, and periodically
recreating the entire schedule to avoid suboptimal performance that may be caused by several
greedy decisions made consecutively.
    Note that during execution of a given phase, depending on the available fast memory
space, it is possible to start loading the input data for the subsequent phase. This way,
the Scheduler can overlap the (already minimized) data movement with computations for
improved performance.
                                                17


    The DAG structure of our data- flow execution model greatly facilitates the scheduling of
independent tasks to available cores on a node. In fact, scheduling heuristics can readily be
used for determining the assignment of tasks to individual cores to maximize cache locality.
1.7       Simulating Communication Behavior Of A Real
          World Distributed Application
Large-scale real-world scientific applications typically need high-performance computing (HPC)
platforms to perform their simulations, not only because of the necessary compute-power for
large-scale calculations, but also because of the aggregate memory needed for the simula-
tions. Often, one needs to store large amounts of data in main memory, which requires the
use of a large number of nodes; with communication between the nodes over high-speed
interconnection networks. Typically, the amount of data as well as the number of nodes
increases as the size of the simulations increases. Communication between nodes can, and
often will, become a bottleneck, in particular for iterative sparse eigensolvers due to their
low arithmetic intensities.
    When preparing and optimizing scientific applications for specific HPC platforms one
therefore has to take into consideration the potential communication overhead. However,
it is far from trivial to estimate the actual communication overhead based on HPC design
specifications such as peak bisection bandwidth, network topology, individual node and link
bandwidths, latencies, among others. On existing HPC platforms, one can in principle
run a skeleton of the scientific application code, simulating only the communication, and
thus empirically measure the communication overhead. Unfortunately, this tends to be
computationally expensive, and only possible if one has access to the specific HPC platforms.
                                                18


For a future machine, one would have to wait until it is deployed and one has gained access
to the system, before being able to realistically measure the actual communication overhead.
Ideally, one would like to gain insight in the communication overhead without actually
performing such a skeleton run of just the communication.
    In this work we use the Ember library, which is part of the Structural Simulation Toolkit
(SST) [88], to model the communication costs of a large-scale nuclear physics application
[89, 90, 4, 91] on a current (for validation) and a future HPC platform (for prediction).
This application uses iterative distributed eigensolvers (specifically, the Lanczos [92] and
LOBPCG [4, 93] algorithms) to obtain the lowest eigenvalues and eigenvectors of a large
sparse symmetric matrix, and runs on tens of thousands of nodes. It is known that for large-
scale runs, the communication can indeed become a bottleneck; most of the communication
cost however can be hidden behind local computation. We first compare our SST motif with
timings from communication skeleton runs of our application on up to a thousand nodes
on Cori-KNL, which is a Knights Landing based cluster at the National Energy Research
Scientific Computing Center (NERSC), followed by simulations aimed at Perlmutter, which
is a new machine to be installed at the same facility later this fall.
1.8       Introducing Communication for DeepSparse
In this work we implemented two different algorithms Lanczos and LOBPCG algorithms
used executed them using our DeepSparse framework. The implementation is based on task
parallelism and was an on node optimization. We observed that DeepSparse achieves 2× -
16× fewer cache misses across different cache layers (L1, L2 and L3) over implementations
of the same solvers based on optimized library function calls. We also achieve 2× - 3.9×
                                               19


improvement in execution time when using DeepSparse over the same library versions.
    As we discussed in the previous Chapter 6 about the communication pattern in MFDn
while doing the distributed matrix multiplication. We noticed that we have an allgather,
a broadcast, a reduction and a reduce scatter operation in the MFDn code. The detailed
explanation is given in Section 6.2 in Chapter 6. In practice, it is seen that for MFDn, when
run in an architecture like knl, the communication usually takes over the computation for a
very large simulation consisting of a very high number of mpi ranks involved from a lot of
compute nodes. We simulated the performance of the communication patterns and tried to
find out a possible cause using a simulator named SST.
    We observed that, the broadcast operation takes a much longer time in real life because
of possible network congestion and the messages being very large in size also fuels into this
behavior. Since our deepsparse framework is a task based parallel framework where each
task performs a particular matrix or vector operation on a matrix or a vector block, we were
motivated to use blocked communication tasks.
    Our motivation was to introduce custom communication tasks where each communication
tasks will communicate with other nodes and only transmit a block of the matrix or a vector
between themselves. This will help us in multiple ways. Being blocked communication will
reduce the size of the messages during the communication much less than the actual code
which we expected will help in case of network congestion. The other motivation was to
overlap the communication with the computations in the matrix multiplication. Whenever
a particular block is is received or ready to compute, the other kernels waiting for this
particular block of matrix or a vector can start immediately rather than waiting for the
entire matrix or vector to be transmitted and then
    In the shared memory implementation of DeepSparse we observed a nice pipelined execu-
                                             20


tion of different kinds of kernels. The matrix multiplication SpMM and the vector operations
like vector vector multiplication or vector vector transpose multiplication. in Figure 4.8 we
can see a pipelined execution of an actual iteration of LOBPCG where the tasks are different
kernels but they use the same datastructure and ultimately improves the performance. This
test was done in a haswell architecture.
    We also did similar tests on Broadwell machines with a different matrix to validate the
framework and the pipelined execution. In Figure 7.1 we show this pipelined execution for the
nlpkkt240 matrix in a broadwell architecture. Here the SpMM is represented using orange
color, XY operation is Maroon and XTY is using green color palette. We can clearly observe
that the matrix and vector operations are well pipelined. Another interesting observation was
that the ration of time spent on the SpMM and vector operations are somewhat in the similar
range. We observed that since the matrix and vector operations are taking similar amount
of time during an iteration, a well pipelined execution of these kernels will improve the cache
performance which it did and we saw the execution time is actually improved in a shared
memory architecture. We were motivated to extend this idea for a distributed application
like MFDn which also has similar matrix and vector operations. With the introduction of
communication tasks, we were motivated to use the idea from shared memory to distributed
memory.
                                               21


Chapter 2
OPTIMIZATION IN LARGE SCALE
DISTRIBUTED DENSE MATRICES
2.1      Nuclear Density Functional Theory With Skyrme
         Interaction
Unlike in classical calculations where a point particle is defined by its position and its mo-
mentum, quantum particles are represented as complex wave functions. The square modulus
of the wave function in real space is interpreted as a probability amplitude to find a particle
at a certain point. Wave functions in the real space and the momentum space are related via
the Fourier transform. In the Hartree-Fock approximation used in nuclear DFT calculations,
the nuclear N-body wave function is restricted to a single Slater determinant consisting of
N orthonormalized one-body wave functions ψα , α = 1..N . Each of these one-body wave
functions have to fulfill the one-body Schrödinger’s Equation
                                         ĥq ψα = α ψα ,                                 (2.1)
when convergence is reached, i.e., when
                                                22


                                  v
                                               | ĥ2 | ψα i − hψα | ĥ | ψα i2
                                  uP
                                  t α hψα
                                  u
                          ∆ε =                           P                                  (2.2)
                                                            α1
    is small.
    In nuclear DFT, the interaction between nucleons (i.e., neutrons and protons) is expressed
through a mean field. In this work, we utilize the Skyrme mean field [56]:
                                       X 
                               ESk =           Cqρ (ρ0 )ρ2q + Cq∆ρ ρq ∆ρq
                                      q=n,p
                                                                  
                                                          ~
                                   +  Cqτ ρq τq  + Cq∇J ρq ∇J~q     ,                       (2.3)
where the parameters Cqi have to be fitted to experimental observables. The mean field is
determined by nucleon densities and their derivatives:
                                            vα2 | ψα (~r, s) |2
                                   XX
                        ρq (~r) =                                                          (2.4a)
                                   α∈q s
                        J~q (~r) = −i           vα2 ψα∗ (~r, s)∇×~σss0 ψα (~r, s0 )
                                       XX
                                                                                           (2.4b)
                                      α∈q ss0
                                            vα2  | ∇ψα (~r, s) |2
                                   XX
                         τq (~r) =                                  ,                      (2.4c)
                                   α∈q s
    where ρq is the number density, J~q is the spin-orbit density and τq (~r) is the kinetic den-
sity for q ∈ (protons,neutrons), which are calculated from the wave functions. We assume a
time-reversal symmetric state in the equations. The interaction is explicitly isospin depen-
dent. The parameters vα2 are either 0 for non-occupied states or 1 for occupied states for
calculations without the pairing force. With those occupation probabilities, the calculation
can also be performed using more wave functions than the number of particles. The sum
P     2 determines the particle number. A detailed description of the Skyrme energy density
   α vα
                                                    23


functional can be found in references [94, 95].
    In DFT, the ground states associated with a many-body system is found in a self-
consistent way, i.e., iteratively. The self-consistent solution can be approached through
the direct diagonalization method or, in this case we use the damped gradient iteration
method which is described below. While stable finite nuclei as present on earth typically
do not contain more than a total of 300 nucleons, nuclear pasta matter in neutron stars
is quasi-infinite on the scales of quantum simulations. Therefore it is desirable to simulate
as large volumes as possible to explore varieties of nuclear pasta matter. Furthermore, in
contrast to finite nuclei which are approximately spherical, pasta matter covers a large range
of shapes and deformations and thus many more iterations are needed to reach convergence.
Since larger volumes and consequently more nucleons require very intensive calculations, a
high performance implementation of nuclear DFT codes is desirable.
    DFT is also widely used for electronic structure calculations in computational chemistry.
While DFT approaches used in computational chemistry can efficiently diagonalize matrices
associated with a large number of basis sets, we need to rely on different iteration techniques
in nuclear DFT. The most important reason for this is that in computational chemistry,
electrons are present in a strong external potential. Therefore, iterations can converge rela-
tively quickly in this case. However, in nuclear DFT, the problem must be solved in a purely
self-consistent manner because nuclei are self-bound. As a result, the mean field can change
drastically from one iteration to the next, since no fixed outer potential is present. Especially
for nuclear pasta spanning a wide range of shapes, a few thousand iterations are necessary
for the solver to converge. Therefore, nuclear DFT iterations have to be performed relatively
quickly, making it infeasible to employ the electronic DFT methods which are expensive for
a single iteration.
                                               24


2.2       Sky3D Software
Sky3D is a nuclear DFT solver, which has frequently been used for finite nuclei, as well
as for nuclear pasta (for both static and time-dependent) simulations. The time-dependent
version of Sky3D is relatively simpler to parallelize compared to the static version, because
properties like orthornomality of the wave functions are implicitly conserved due to the fact
that the time-evolution operator is unitary. Therefore the calculation of a single nucleon is
independent of the others. The only interaction between nucleons takes place through the
mean field. Thus only the nuclear densities using which the mean field is constructed has
to be communicated. In the static case, however, orthonormality has to be ensured and the
Hamiltonian matrix must be diagonalized to obtain the eigenvalues and eigenvectors of the
system at each iteration. In this paper, we describe parallelization of the more challenging
static version (which previously was only shared memory parallel).
    Sky3D operates on a three dimensional equidistant grid in coordinate space. Since nuclear
DFT is a self-consistent method requiring an iterative solver, the calculation has to be
initialized with an initial configuration. In Sky3D, the wave functions are initialized with a
trial state, using either the harmonic oscillator wave functions for finite nuclei or plane waves
for periodic systems. The initial densities and the mean field are calculated from those trial
wave functions.
    After initialization, iterations are performed using the damped gradient iteration scheme
[96]
                            (                                                    )
                                         δ
                                                                         
                (n+1)          (n)                       (n)          (n)    (n)
              ψα      =O     ψα    −           ĥ(n) − hψα | ĥ(n) | ψα i ψα        ,        (2.5)
                                     T̂ + E0
    where O denotes the orthonormalization of the wave functions, T̂ denotes the kinetic
                     (n)
energy operator, ψα and ĥ(n) denote the single-particle wave function and the Hamiltonian
                                                  25


at step n, respectively, and δ and E0 are constants that need to be tuned for fast conver-
gence. The Hamiltonian consists mainly of the kinetic energy, the mean field contribution
(Eq.2.3) and the Coulomb contribution. We use FFTs to compute the derivatives of the
wave functions. The Coulomb problem is solved in the momentum space, also employing
FFTs.
    The basic flow chart of the static Sky3D code is shown in Fig 2.1. Since the damped
gradient iterations of Eq. 2.5 does not conserve the diagonality of the wave functions with
respect to the Hamiltonian, i.e. hψα | ĥ | ψβ i = δα β , they have to be diagonalized after
each step to obtain the eigenfunctions. Subsequently, single-particle properties, e.g. single-
particle energies, are determined. If the convergence criterion (Eq.(2.2)) is fulfilled at the
end of the current iteration, properties of the states and the wave functions are written into
a file and the calculation is terminated.
2.3       Distributed Memory Parallelization With MPI
There are basically two approaches for distributed memory parallelization of the Sky3D code.
The first approach would employ a spatial decomposition where the three dimensional space
is partitioned and corresponding grid points are distributed over different MPI ranks. How-
ever, computations like the calculation of the density gradients ∇ρq (~r) are global operations
that require 3D FFTs, which are known to have poor scaling due to their dependence on
all-to-all interprocess communications [97]. Hence, this approach would not scale well. The
second approach would be to distribute the single particle wave functions among processes.
While communications are unavoidable, by carefully partitioning the data at each step, it
is possible to significantly reduce the communication overheads in this scheme. In what
                                               26


                               1: Initialize ψ (0) and mean field
                                      2: damped gradient
                                            (n)        (n+1)
                                     step ψα → ϕα
                                      and calculate ĥ|ϕα i
                                    3a: Calculate matrices:
                              Iαβ = hϕα |ϕβ i, Hαβ = hϕα |ĥ|ϕβ i
                                  3b: Diagonal-
                                 ization of Hαβ
                                            3c: Orthogonalization
                                3d: combine othonomalization
                                 and diagonalization matrices
                                   3e: build orthonormalized
                                     and diagonalized w.f.
                                     4a: calculate densities
                               4b: calculate mean field V (n+1)
                                  5: calculate s.p. properties
                                           converged?
                                                                 no
                                             yes
                                 6: Finalizing the calculation
Figure 2.1: Flowchart of the parallelized Sky3D code. Parts in the 1d distribution are marked in
green, parts in full 2d distribution are marked in blue, parts in divided 2d distribution are marked
                                                                        (n)
in yellow and collective non-parallelized parts are marked in red. ψα denotes the orthonormal
                                                                 (n+1)
and diagonal wave function at step n with index α, |ϕα i = ϕα          denotes the non-diagonal and
non-orthonormal wave function at step n+1. ĥ is the one-body Hamiltonian.
                                                   27


follows, we present our parallel implementation of Sky3D using this second approach.
    The iterative steps following the initialization phase constitute the computationally ex-
pensive part of Sky3D. Hence, our discussion will focus on the implementation of steps 2
through 5 of Fig. 2.1. Computations in these steps can be classified into two groups, i)
those that work with matrices (and require 2D distributions to obtain good scaling), and ii)
those that work on the wave functions themselves (and utilize 1D distributions as it is more
convenient in this case to have wave functions to be fully present on a node). Our paral-
lel implementation progresses by switching between these 2D partitioned steps (marked in
violet and yellow in Fig. 2.1) and 1D partitioned steps (marked in green) in each iteration.
Steps marked in red are not parallelized.
    As discussed in more detail below, an important aspect of our implementation is that we
make use of optimized scientific computing libraries such as ScaLAPACK [98] and FFTW [99]
wherever possible. Since ScaLAPACK and FFTW are widely used and well optimized across
HPC systems, this approach allows us to achieve high performance on a wide variety of
architectures without the added burden of fine-tuning Sky3D. This may even extend to
future architectures with decreased memory space per core and possibly multiple levels in
the memory hierarchy. As implementations of ScaLAPACK and FFTW libraries evolve for
such systems, we anticipate that it will be relatively easy to adapt our implementation to
such changes in HPC systems.
2.3.1     The 1D and 2D partitionings
The decisions regarding 1D and 2D decompositions are made around the wave functions
which represent the main data structure in Sky3D and are involved in all the key compu-
tational steps. We represent the wave functions using a two dimensional array psi(V,A),
                                               28


where V = nx × ny × nz × 2 includes the spatial degrees of freedom and the spin degree of
freedom with nx , ny and nz being the grid sizes in x, y and z directions, respectively, and the
factor 2 originating from the two components of the spinor. In the case of 1D distribution,
full wave functions are distributed among processes in a block cyclic way. The block size Nψ
determines how many consecutive wave functions are given to each process in each round.
In round one, the first Nψ wave functions are given to the first process P0 , then the second
process P1 gets the second batch and so on. When all processes are assigned a batch of
wave functions in a round, the distribution resumes with P0 in the subsequent round until
all wave functions are exhausted.
    In the 2D partitioning case, single particle wave functions as well as the matrices con-
structed using them (see Sect. 2.3.3.1) are divided among processes using a 2D block cyclic
distribution. In Fig. 2.2, we provide a visual example of a square matrix distributed in a
2D block cyclic fashion where processes are organized into a 3×2 grid topology, and the row
block size NB and the column block size MB have been set equal to 2. The small rectangu-
lar boxes in the matrix show the arrangement of processes in the 3×2 grid topology – the
number of rows are in general not equal to the number of columns in the process grid. For
symmetric or Hermitian matrices only the (blue marked) lower triangular part is needed as
it defines the matrix fully. In this particular case, P0 is assigned the matrix elements in
rows 1, 2, 7, 8, 13 and 14 in the first column, as well as those in rows 2, 7, 8, 13 and 14 in
the second column; P1 is assigned the matrix elements in rows 7, 8, 13 and 14 in columns
3 and 4, and so on. Single particle wave functions, which are stored as rectangular (non-
symmetric) matrices with significantly more number of rows than the number of columnd
(due to the large grid sizes needed), are also distributed using the same strategy. The 2D
block cyclic distribution provides very good load balance in computations associated with
                                              29


                 1   2    3   4     5    6    7    8  9  10   11  12   13   14
              1
              2  P0       P1        P0        P1
              3
              4  P2       P3        P2        P3
              5
              6  P4       P5        P4        P5
              7
              8  P0       P1        P0        P1     P0       P1
              9
             10  P2       P3        P2        P3     P2       P3
             11
             12  P4       P5        P4        P5     P4       P5
             13
             14  P0       P1        P0        P1     P0       P1 P0
Figure 2.2: 2D block cyclic partitioning example with 14 wave functions using a 3×2 processor
topology. Row and column block sizes are set as NB=MB=2. The blue shaded area marks the lower
triangular part.
the single particle wave functions and the matrices formed using them, as all processes are
assigned approximately the same number of elements.
2.3.2      Parallelization across Neutron and Proton groups
An important observation for parallelization of Sky3D is that neutrons and protons interact
only through the mean field. Therefore, the only communication needed between these two
species takes place within step 4b. To achieve better scalability, we leverage this fact and
                                                30


separate computations associated with neutrons and protons to different processor groups,
while trying to preserve the load balance between them as explained in detail below.
    Suppose A = N + Z is the total number of wave functions, such that N is the number
of neutron wave functions and Z is the number of proton wave functions. To distribute
the array psi, we split the available processors into two groups; one group of size PN for
neutrons, and another group of size PP for protons. We note that in practical simulations, the
number of neutrons are often larger than the number of protons (for example, simulations
of neutron star matter are naturally neutron rich). The partitioning of processors into
neutron and proton groups must account for this situation to ensure good load balance
between the two groups. As will be discussed in Section 2.3.3, Sky3D execution time is
mainly determined by the time spent on 2D partitioned steps which is primarily dominated
by the construction of the overlap and Hamiltonian matrices step, and to a lesser degree by
eigenvalue computations associated with these matrices. Since the computational cost of the
matrix construction step is proportional to the square of the number of particles in a group
(see Section 2.3.3.1), we choose to split processors into two groups by quadratically weighing
the number of particles in each species. More precisely, if the total processor count is given
                        2                     2
by C, then PN = 2N 2 C and PP = 2Z 2 C according to this scheme. It is well-known
                    N +Z                  N +Z
that 2D partitioning is optimal for scalable matrix-matrix multiplications [100]. Therefore,
once the number of processors within neutron and proton groups is set, we determine the
number of rows and columns for the 2D process topologies of each of these groups through
MPI’s MPI_DIMS_CREATE function. This function ensures that the number of rows and
columns is as close as possible to the square root of PN and PP for neutron and proton
groups, respectively, thus yielding a good 2D process layout.
    As will be demonstrated through numerical experiments, the scheme described above
                                                31


typically delivers computations with good load balances, but it has a caveat. Under certain
circumstances, this division might yield a 2D process grid with a tall & skinny layout, which
essentially is more similar to a 1D partitioning and have led to significant performance
degradations for our 2D partitioned computations. For instance, in a system with 5000
neutrons and 1000 protons, when we use 256 processors, according to our scheme PN will
be 246, and PP will be 10. For PN = 246, the corresponding process grid is (41 ∗ 6) which
is much closer to a 1D layout than a 2D layout. To prevent such circumstances, we require
the number of processors within each group to be a multiple of certain powers of 2 (i.e., 2,
4, 8, 16 or 32 depending on the total core count). For the above example, by requiring the
number of cores within each group to be a multiple of 16, we determine PN to be 240 and
PP to be 16. This results in a process grid of size 16 × 15 for neutrons which is almost a
square shaped grid, and a perfect square grid of size 4 × 4 for protons.
2.3.3      Calculations with 2D distributions
As a result of the split, neutron and proton processor groups asynchronously advance through
steps that require 2D partitionings, i.e., steps 3a to 3e. Main computational tasks here are the
construction of the overlap and Hamiltonian matrices (using 1D distributed wave functions)
and eigendecompositions of these matrices. These tasks are essentially accomplished by
calling suitable ScaLAPACK routines. The choice of a 2D block cyclic distribution maximizes
the load balancing with ScaLAPACK for the steps 3a through 3e.
                                                 32


2.3.3.1     Matrix construction (step 3a)
Construction of the overlap matrix I and the Hamiltonian matrix H
                                   Iαβ = hϕα | ϕβ i, and                                  (2.6)
                                   Hαβ = hϕα | ĥ | ϕβ i,                                 (2.7)
where | ϕα i marks the non-orthonormalized and non-diagonal wave functions, constitutes
the most expensive part of the iterations in Sky3D, because the cost of these operations scales
quadratically with the number of particles. More precisely, calculating these matrices costs
O(I 2 V ), where I ∈ {N, Z} is the number of wave functions. Since these two operations are
simply inner products, we use ScaLAPACK’s complex matrix matrix multiplication routine
PZGEMM for constructing these two matrices.
    One sublety here is that prior to the start of steps with 2D partitionings, wave functions
are distributed in a 1D scheme. To achieve good performance and scaling with PZGEMM, we
first switch both wave functions | ϕα i (psi) and ĥ | ϕα i (hampsi) into a 2D cyclic layout
which uses the same process grid created through the MPI_DIMS_CREATE function. The
PZGEMM call then operates on these two matrices and the complex conjugate of (psi). The
resulting matrices I and H are distributed over the same 2D process grid as well.
    Since I and H are square matrices, we set the row and column block sizes, NB and MB,
respectively, to be equal (i.e., NB = MB). Normally, in a 2D matrix computation, one
would expect a trade-off in choosing the exact value for NB and MB, as small blocks lead
to a favorable load balance, but large blocks reduce communication overheads. However, for
typical problems that we are interested in, i.e., more than 1000 particles using at least a
few hundred cores, our experiments with different NB and MB values such as 2, 4, 8, 32, 64
                                               33


have shown negligible performance differences. Therefore, we have empirically determined
the choice for NB = MB to be 32 for large computations.
2.3.3.2    Diagonalization and Orthonormalization (steps 3b & c)
After the matrix H is constructed according to Eq. (2.7), its eigenvalue decomposition is
computed to find the eigenstates of the Hamiltonian. Since H is a hermitian matrix, we use
the complex Hermitian eigenvalue decomposition routine PZHEEVR in ScaLAPACK, which
first reduces the input matrix to tridiagonal form, and then computes the eigenspectrum
using the Multiple Relatively Robust Representations (MRRR) algorithm [101]. The local
matrices produced by the 2D block cyclic distribution of the matrix construction step can
readily be used as input to the PZHEEVR routine. After the eigenvectors of H are obtained,
the diagonal set of wave functions ψα can be obtained through the following matrix-vector
multiplication
                                                  Hϕ
                                             X
                                       ψα =     Zαβ  β                                 (2.8)
                                             β
for all ϕβ where Z is the matrix containing the eigenvectors of H.
    Orthonormalization is commonly accomplished through the modified Gram-Schmidt (mGS)
method, a numerically more stable version of the classical Gram-Schmidt method. Unlike
the original version of Sky3D, we did not opt for mGS for a number of reasons. First, mGS
is an inherently sequential process where the orthonormalization of wave function n + 1 can
start only after the first n vectors are orthonormalized. Second, the main computational
kernel in this method is a dot product which is a Level-1 BLAS operation and has low arith-
metic intensity. Finally, a parallel mGS with a block cyclic distribution of wave functions
| ϕα i as used by matrix construction and diagonalization steps would incur significant syn-
                                             34


chronization overheads, especially due to the small blocking factors needed to load balance
the matrix construction step.
    An alternative approach to orthonormalize the wave functions is the Löwdin method
[102], which can be stated for our purposes as:
                                       C = I −1/2                                       (2.9a)
                                              X
                                     ψα =         Cβα ϕβ .                              (2.9b)
                                               β
The Löwdin orthonormalization is well known in quantum chemistry and has the property
that the orthornormalized wave functions are those that are closest to the non-orthonormalized
wave functions in a least-squares sense. Note that since I is a Hermitian matrix, it can be
factorized as I = XΛX T , where columns of X are its eigenvectors and Λ is a diagonal matrix
composed of I’s eigenvalues. Consequently, I −1/2 in Eq.2.9a can be computed simply by
taking the inverses of the square roots of I’s eigenvalues, i.e., C = I −1/2 = XΛ−1/2 X T .
    Applying the Löwdin method in our problem is equivalent to computing an eigendecom-
position of the overlap matrix I, which can also be implemented by using the PZHEEVR
routine in ScaLAPACK. Note that exactly the same distribution of wave functions and
blocking factors as in the matrix construction step can be used for this step, too.
    Detailed performance analyses reveal that the eigendecomposition routine PZHEEVR does
not scale well for large P with the usual number of wave functions in nuclear pasta calcula-
tions. However, the eigendecompositions of the I and H matrices (within both the neutron
and proton groups) are independent of each other and their construction is also similar with
respect to each other. Therefore, to gain additional performance, we perform steps 3b and
3c in parallel using half the number of MPI ranks available in a group, i.e., PN /2 and PP /2,
                                              35


respectively for neutrons and protons.
2.3.3.3    Post-processing (steps 3d & e)
The post-processing operations in the diagonalization and orthonormalization steps are
matrix-vector multiplications acting on the current set of wave functions ϕj . As opposed to
applying these operations one after the other, i.e., C T (Z H {ϕ}), we combine diagonalization
and orthonormalization by performing (C T Z H ){ϕ}, where {ϕ} = (ϕ1 , ϕ2 , .., ϕn )T denotes
a vector containing the single-particle wave functions. While both sequences of operations
are arithmetically equivalent, the latter has a benefit in terms of the computational cost,
as it reduces the number of multiply-adds from 2I 2 V to I 3 + I 2 V . This is almost half the
cost of using the first sequence of operations, since we have I << V for both neutrons and
protons, because the number of wave functions has to be significantly smaller than the size
of the basis to prevent any bias due to the choice of the basis. We describe this optimization
in the form of a pseudocode in Fig. ??. By computing the overlap matrix I together with
the Hamiltonian matrix H, and performing their eigendecompositions, we can combine the
update and orthonormalization of wave functions. Lines shown in red in Fig. ?? mark those
affected by this optimization. Consequently, the matrix-matrix multiplication C T Z H can
be performed prior to the matrix-vector multiplication involving the wave functions {ϕ(~rν )}.
As a result, the overall computational cost is significantly reduced, and efficient level-3 BLAS
routines can be leveraged.
    The C T Z H operation is carried out in parallel using ScaLAPACK’s PZGEMM routine (step
3d). Then the resulting matrix is multiplied with the vector of wave functions for all spatial
and spin degrees of freedom in step 3e using another PZGEMM call.
    It should be noted that with this method, we introduce small errors during iterations,
                                               36


because the matrix C T is calculated from non-orthogonalized wave functions. However,
since we take small gradient steps, these errors disappear as we reach convergence and we
ultimately arrive at the same results as an implementation which computes the matrix C T
from orthogonalized wave functions.
2.3.4      Calculations with a 1D distribution
Steps 2, 4 and 5 use the 1D distribution, because for a given wave function ψα , computations
in these steps are independent of all other wave functions. Within the neutron and proton
processor groups, we distribute wave functions using a 1D block cyclic distribution with
block size NB ψ . Such a distribution further ensures good load balance and facilitates the
communication between 1D and 2D distributions. The damped gradient step (as shown in
the curled brackets in Eq. 2.5) is performed in step 2. Here, the operator T̂ shown in Eq. 2.5
is calculated using the FFTW library. Since the Hamiltonian has to be applied to the wave
functions in this step, ĥ | ψα i is saved in the array hampsi, distributed in the same way as
psi and will be reused in step 3a. In step 4a, the partial densities as given in Eqs. (2.4a)-
(2.4c) are calculated on each node for the local wave functions separately and subsequently
summed up with the MPI routine MPI_ALLREDUCE. We use FFTs to compute derivatives of
wave functions as needed in Eqs. (2.4a)-(2.4c). The determination of the mean field in step
4b does not depend on the number of particles, and is generally not expensive. Consequently,
this computation is performed redundantly by each MPI rank to avoid synchronizations. Also
the check for convergence is performed on each MPI rank separately. Both are marked in
red in Fig. 2.1. Finally, in step 5, single-particle properties are calculated and partial results
for single-particle properties are aggregated on all MPI ranks using an MPI_ALLREDUCE.
                                                 37


2.3.5     Switching between different data distributions
As described above, our parallel Sky3D implementation uses 3 different distributions: The
1D distribution which is defined separately for neutrons and protons and used in green
marked steps of Fig 2.1, the 2d distribution which is again defined separately for neutrons
and protons and used in blue marked steps, and the 2d distribution for diagonalization
and orthogonalization (used in steps marked in yellow) within subgroups of size PN /2 and
PP /2, respectively for neutrons and protons. For calculation of overlap and Hamiltonian
matrices, wave functions psi and hampsi need to be switched from the 1D distribution into
the 2D distribution after step 2. After step 3a, matrices I and H must be transfered to the
subgroups. After eigendecompositions in steps 3b and 3c are completed, the matrices Z and
C, which contain the eigenvectors, need to be redistributed back to the full 2D groups. After
step 3e, only the updated array psi has to be switched back to the 1D distribution from the
2D distributions.
   While these operations require complicated interprocess communications, they are easily
carried out with the ScaLAPACK routine PZGEMR2D which can transfer distributed matrices
from one processor grid to another, even if the involved grids are totally different in their
shapes and formations. As we will demonstrate in the performance evaluation section, the
time required by PZGEMR2D is insignificant, including in large scale calculations.
2.3.6     Memory considerations
Beyond performance optimizations, memory utilization is of great importance for large-scale
nuclear pasta calculations. Data structures that require major memory space in Sky3D
are the wave functions stored in matrices psi and hampsi. The latter matrix was not
                                              38


needed in the original Sky3D code as H was calculated on the fly, but this is not an option
for a distributed memory implementation. As such, the total memory need increases by
roughly a factor of 2. Furthermore, we store both arrays in both 1D and 2D distributions,
which contributes another factor of 2. Besides the wavefunctions, another major memory
requirement is storage of the matrices such as H and I. These data structures are much
smaller because the total matrix size grows only as N 2 for neutrons and Z 2 for protons. In
our implementation, we store these matrices twice for the 2D distribution and twice for the
2D distributions within subgroups.
    To give an example as to the actual memory utilization, largest calculations we conducted
in this work are nuclear pasta calculations with a cubic lattice of 48 points and 6000 wave
functions. In this case, the aggregate size of a single matrix to store wave functions in double
precision complex format is 483 × 2 × 6000 × 16 ≈ 21 GB, and all four arrays required in
our parallelization would amount to about 84 GBs of memory. The Hamiltonian and overlap
matrices occupy a much smaller footprint, roughly 144 MBs per matrix. As this example
shows, it is still feasible to carry out large scale nuclear pasta formations using the developed
parallel Sky3D code. Even if we choose a bigger grid, e.g., of size 643 , with typical compute
nodes in today’s HPC systems having ≥ 64 GB of memory and the memory need per MPI
rank decreasing linearly with the total number of MPI ranks in a calculation (there is little to
no duplication of data structures across MPI ranks), such calculations would still be feasible
using our implementation.
                                                  39


2.4      Shared Memory Parallelization with OpenMP
In addition to MPI parallelization, we also implemented a hybrid MPI/OpenMP parallel
version of Sky3D. The rational behind a hybrid parallel implementation is that it would
map more naturally to today’s multi-core architectures, and it may also reduce the amount
of inter-node communication using MPI.
    For the 1D distribution calculations, similar to the MPI implementation, we distribute the
wave functions over threads by parallelizing loops using OpenMP. Since no communication
is needed for step 2, we do not expect a gain in performance as a result of the shared memory
implementation in this step. Step 4a, however, involves major communications, as the partial
sums of the densities have to be reduced across all threads. The OpenMP implementation
reduces the number of MPI ranks when the total core count P is kept constant. This reduces
the amount of inter-node communication. Similarly, step 5 involves the communication of
single particle properties. Since these quantities are mainly scalars or small sized vectors
though, inter-node communications are not as expensive.
    The steps with a 2D distribution, i.e., steps 3a-3e, largely rely on ScaLAPACK routines.
In this part, shared memory parallelization is introduced implicitly via the usage of multi-
threaded ScaLAPACK routines. Consequently, note that we rely mostly on the ScaLAPACK
implementation and its thread optimizations for the steps with 2D data distributions.
                                               40


2.5      Performance Evaluation
2.5.1     Experimental setup
We conducted our computational experiments on Cori - Phase I, a Cray XC40 supercom-
puting platform at NERSC, which contains two 16-core Xeon E5-2698 v3 Haswell CPUs
per node (see Table 7.1). Each of the 16 cores runs at 2.3 GHz and is capable of executing
one fused multiply-add (FMA) AVX (4×64-bit SIMD) operation per cycle. Each core has
a 64 KB L1 cache (32 KB instruction and 32 KB data cache) and a 256 KB L2 cache, both
of which are private to each core. In addition, each CPU has a 40 MB shared L3 cache.
The Xeon E5-2698 v3 CPU supports hyperthreading which would essentially allow the use
of 64 processes or threads per Cori-Phase I node. Our experiments with hyperthreading
have led to performance degradation for both the MPI and MPI/OpenMP hybrid parallel
implementations. As such, we have disabled hyperthreading in our performance tests.
    For performance analysis, we choose a physically relevant setup. In practice, the grid
spacing is about ∆x = ∆y = ∆z ∼ 1 fm. This grid spacing gives results with desired accuracy
[53]. For nuclear pasta matter, very high mean number densities (0.02 fm−3 − 0.14 fm−3 )
have to be reached. We choose two different cubic boxes of L = 32 fm and L = 48 fm and a
fixed number of nucleons N + Z = 6000. This results in mean densities of 0.122 fm−3 and
0.036 fm−3 , respectively.
    To eliminate any load balancing effects, we begin with our performance and scalability
tests using a symmetric system with 3000 neutrons and 3000 protons. As the systems for
neutron star applications are neutron rich, we also test systems with 4000 neutron and 2000
protons and also 5000 neutrons and 1000 protons.
                                             41


                                      Platform         Cray XC40
                                     Processor     Xeon E5-2698 v3
                                          Core           Haswell
                                   Clock (GHz)              2.3
                              Data Cache (KB)        64(32+32)+256
                           Memory-Parallelism          HW-prefetch
                               Cores/Processor              16
                            Last-level L3 Cache           40 MB
                                    SP TFlop/s              1.2
                                    DP TFlop/s              0.6
                                STREAM BW3              120 GB/s
                       Available Memory/node              128 GB
                                   Interconnect Cray Aries (Dragonfly)
                                     Global BW         5.625 TB/s
                                   MPI Library          MPICHv2
                                       Compiler      Intel/17.0.2.174
Table 2.1: Hardware specifications for a single socket on Cori, a Cray XC40 supercomputer at
NERSC. Each node consists of two sockets.
2.5.2     Scalability
First, we consider a system with 3000 neutrons and 3000 protons with two different grid
sizes, L = 32 fm and L = 48 fm. On Cori, each “Haswell" node contains 32 cores on two
sockets. For both grid sizes, we ran simulations on 1, 2, 4, 8, 16, 32 and 48 nodes using
32, 64, 128, 256, 1024 and 1536 MPI ranks, respectively. In all our tests, all nodes are fully
packed, i.e., one MPI rank is assigned to each core, exerting full load on the memory/cache
system.
    Performance results for L = 32 fm and L = 48 fm cases are shown in Fig. 2.3 and Fig. 2.4,
respectively. In both figures, total execution time per iteration is broken down into the
time spent for individual steps of Fig. 2.1. In addition, “communication" represents the time
                                               42


                                                    2                3d                  4a
                                                    3a               3e                  4b
                                                    3b-3c            communication       5
                                             102
                                             101
                       time/iteration (s)
                                             100
                                            10−1
                                            10−2
                                               32    64       128        256   512   1024 1536
                                                            Number of CPU cores
Figure 2.3: Scalability of MPI-only version of Sky3D for the 3000 neutron and 3000 proton system
using the L = 32 fm grid.
needed by ScaLAPACK’s PZGEMR2D routine for data movement during switches between
different distributions (i.e., 1D, 2D, and 2D subgroups). For each step, we use a single line
to report the time for neutron and proton computations by taking their maximum.
   As seen in Fig. 2.3, calculation of the matrices (step 3a) is the most expensive step. An-
other expensive step is step 3e where diagonalized and orthonormalized wave functions are
built. Diagonalization of the Hamiltonian H and the Löwdin orthonormalization procedures
(steps 3b-3c, which are combined into a single step as they are performed in parallel) also
takes significant amount of time. It can be seen that step 4a does not scale well, because it
consists mainly of communication of the densities. We also note that step 4b is not paral-
lelized, it is rather performed redundantly on each process because it takes an insignificant
amount of time.
   The damped gradient step (step 2) and computation of single particle properties (step 5)
                                                                    43


Table 2.2: Scalability of MPI-only version of Sky3D for the L = 32 fm grid. Time is given in
seconds, and efficiency (eff) is given in percentages.
                       calc. matrix recombine diag+Löwdin               Total
                cores time          eff time      eff time       eff time     eff
                    32 23.8         100 12.2      100   7.4     100 59.8      100
                    64 11.8       101.5    6.0 102.0    2.9   128.7 28.4 105.1
                  128    6.3       94.8    3.2   96.8   2.0    93.6 16.0 93.5
                  256    3.2       92.2    1.5   99.4   1.6    58.4    9.2 81.4
                  512    1.9       79.8    0.9   82.4   0.8    61.6    5.3 70.5
                 1024    1.0       73.0    0.5   79.1   0.6    36.8    3.4 55.0
                 1536    0.8       63.6    0.4   64.4   0.8    18.6    3.5 36.1
scale almost perfectly. Steps 3a and 3e, which are compute intensive kernels, also exhibit good
scaling. While ScaLAPACK’s eigensolver routine PZHEEVR performs well for smaller number
of cores, it does not scale to a high number of cores. In fact, the internal communications in
this routine becomes a significant bottleneck to the extent that steps 3b and 3c become the
most expensive part of the calculation on 1536 cores. In Table 2.2, we give strong scaling
efficiencies for the most important parts of the iterations for the L = 32 fm grid. Overall,
we observe good scaling up to 512 cores, where we achieve 70.5% efficiency. However, this
number drops to 36.1% on 1536 cores and steps 3b-3c are the main reason for this drop.
    In Fig. 2.4, strong scaling results for the L = 48 fm grid is shown. In this case, the
number of neutrons and protons are the same, but the size of wave functions is larger than
the previous case. As a result, the computation intensive steps 3a, 3e and 4b which directly
work on these wave functions are significantly more expensive than the corresponding runs
for the L = 32 fm case. As it is clearly visible in this figure, on small number of nodes,
the overall iteration time is dominated by these compute-intensive kernels. This changes in
larger scale runs, where the times spent in diagonalization and Löwdin orthonormalization
                                                  44


                                                  2               3d                   4a
                                                  3a              3e                   4b
                                                  3b-3c           communication        5
                                           102
                                           101
                     time/iteration (s)
                                           100
                                          10−1
                                          10−2
                                             32    64       128        256   512   1024 1536
                                                          Number of CPU cores
Figure 2.4: Scalability of MPI-only version of Sky3D for the 3000 neutron and 3000 proton system
using the L = 48 fm grid.
(steps 3b-3c) along with communication operations also become significant.
   The increased size of wave functions and increased computational intensity actually re-
sults in better scalability. As shown in Table 2.3, we observe 93.1% efficiency during matrix
construction and 95.5% efficiency for the recombine step on 1024 cores. We note that the cal-
culate matrix step’s efficiency can actually be greater than 100% owing to the perfect square
core counts like 64 and 256 cores. However, like in the previous case, the diagonalization
and orthonormalization steps do not scale well for larger number of cores.
   Overall, Sky3D shows good scalability for small to moderate number of nodes, but this
decreases slightly with increased core counts. This decrease in efficiency is mainly due to the
poor scaling of ScaLAPACK’s eigensolver used in the diagonalization and orthonormalization
steps, and partially due to the cost of having to communicate large wave functions at each
Sky3D iteration.
                                                                  45


Table 2.3: Scalability of MPI-only version of Sky3D for the L = 48 fm grid. Time is given in
seconds, and efficiency (eff) is given in percentages.
                       calc. matrix recombine diag+Löwdin              Total
                cores time          eff time      eff time      eff time     eff
                   32 87.3          100 42.1      100   7.4     100   192    100
                   64 43.1        101.2     21 100.3    2.9     127 95.1 100.9
                  128 24.7         88.5     11   95.9   2.0    92.6 53.6 89.6
                  256 10.6          102    4.9 107.2    1.2    80.3 25.51 94.1
                  512    5.8         94    2.8   95.2   0.9    48.9     15 80.2
                 1024    2.9       93.1    1.4   95.5   0.6    35.3    8.4 71.3
                 1536    2.5       71.7    1.2   70.5   0.8    19.1 7.54      53
2.5.3      Comparison between MPI-only and MPI/OpenMP hybrid
           parallelization
On Cori, "Haswell" compute nodes contain two sockets with 16 cores each. To prevent any
performance degradations due to non-uniform memory accesses (NUMA), we performed our
tests using 2 MPI ranks per node with each MPI rank having 16 OpenMP threads executed
on a single socket. Since we are grouping the available cores into neutron and proton groups
which are further divided in half for running diagonalization and orthonormalization tasks
in parallel, we need a minimum of 4 MPI ranks in each test. In Figures 2.5 and 2.7, we show
strong scalability test results similar to the MPI-only implementation discussed earlier. In
this case, the legends along the x-axis denotes the total core counts. For example, 128 means
that we are running this test on 4 nodes with 8 MPI ranks and 16 OpenMP threads per rank.
For the L = 32 fm grid, we have tested the MPI/OpenMP hybrid parallel version with 4, 8,
16, 32, 64, 96 MPI ranks. For the L = 48 fm grid, we have a larger number of wave functions
for which ScaLAPACK’s data redistribution routine PZGEMR2D runs out of memory on low
node counts. Hence, we tested this case with 16, 32, 64 and 96 MPI ranks only.
                                                  46


                                               2              3d                   4a
                                               3a             3e                   4b
                                               3b-3c          communication        5
                                        101
                  time/iteration (s)
                                        100
                                       10−1
                                       10−2
                                          64     128        256      512      1024 1536
                                                       Number of CPU cores
Figure 2.5: Scalability of MPI/OpenMP parallel version of Sky3D for the 3000 neutron and 3000
proton system using the L = 32 fm grid.
                                                             47


Table 2.4: Scalability of MPI/OpenMP parallel version of Sky3D for the L = 32 fm grid. Time is
given in seconds, and efficiency (eff) is given in percentages.
                        calc. matrix recombine diag+lowedin            Total
                cores time         eff time        eff time      eff time eff
                   64    14.2      100    10.1     100    7.8    100  50.7 100
                  128     8.2     86.8    3.34 151.2      6.2   62.3  30.7 82.6
                  256     3.8     94.2      1.7 146.8     3.4   56.8    16 79.4
                  512     2.2       80      1.1 110.1     2.2   43.5   9.6    66
                 1024     1.2     76.4      0.6 106.3     2.4   20.5   6.8 46.7
                 1536       1       61      0.4   99.7    1.1   29.8   4.4 48.2
    In Fig. 2.5, we show the strong scaling results for the L = 32 fm case, along with detailed
efficiency numbers for the computationally expensive steps in Table 2.4. Similar to the
MPI-only case, the compute-intensive matrix construction and recombine phases show good
scalability, but the diagonalization and Löwdin orthonormalization part does not perform as
well as these two parts. While the strong scaling efficiency numbers in this case look better
than the MPI-only case (see Table 2.2), we note that this is due to the inferior performance
of the MPI/OpenMP parallel version for its base case of 64 cores. In fact, the recombine
part performs so poor on 64 cores that its strong scaling efficiency is constantly over 100%
almost all the way up to 1536 cores. But comparing the total execution times, we see that
the MPI-only code takes 28.4 seconds on average per iteration, while the MPI/OpenMP
parallel version takes 50.7 seconds for this same problem on 64 cores.
    This performance issue in the MPI/OpenMP version persists through all test cases
as shown in Figure 2.6. In particular, for smaller number of cores, the performance of
MPI/OpenMP version is poor compared to the MPI-only version. With increasing core
counts, the performance difference lessens slightly, but the MPI-only version still performs
better. In general, we observe that the MPI-only version outperforms the MPI/OpenMP
                                                  48


                                     60
                                                                  only MPI
                                     50
                                                                  MPI/OpenMP
                time/iteration (s)
                                     40
                                     30
                                     20
                                     10
                                      0
                                          64   128      256    512   1024   1536
                                                     Number of CPU cores
Figure 2.6: Comparison of the execution times for the MPI-only and MPI/OpenMP parallel versions
of Sky3D for the 3000 neutron and 3000 proton system using the L = 32 fm grid.
version by a factor of about 1.5x to 2x. The main reason behind this is that the thread par-
allel ScaLAPACK routines used in the MPI/OpenMP implementation perform worse than
the MPI-only ScaLAPACK routines, which is contrary to what one might naively expect,
given that our tests are performed on a multi-core architecture.
   In Fig. 2.7 and Table 2.5, we show the strong scaling results for the L = 48 fm grid. Again,
in this case parts directly working with the wave functions, i.e., calculation of matrices (step
3a) and building of orthonormalized and diagonalized wave functions (step 3e), become sig-
nificantly more expensive compared to the diagonalization and Löwdin orthonormalizations
(steps 3b & 3c). Of particular note here is the more pronounced communication times during
switches between different data distributions which is mainly due to the larger size of the
wave functions. Overall, we obtain 64.5% strong scaling efficiency using up to 96 MPI ranks
with 16 threads per rank (1536 cores in total). In terms of total execution times though,
MPI/OpenMP parallel version still underperforms compared to the MPI-only version (see
Fig. 2.8).
                                                            49


                                                      2                    3d                     4a
                                                      3a                   3e                     4b
                                                      3b-3c                communication          5
                                             101
                       time/iteration (s)
                                             100
                                            10−1
                                               256                  512                1024       1536
                                                              Number of CPU cores
Figure 2.7: Scalability of MPI/OpenMP parallel version of Sky3D for the 3000 neutron and 3000
proton system using the L = 48 fm grid.
Table 2.5: Scalability of MPI/OpenMP parallel version of Sky3D for the L = 48 fm grid. Time is
given in seconds, and efficiency (eff) is given in percentages.
                               calc. matrix recombine diag+Löwdin                                  Total
              cores time                              eff time            eff time            eff time eff
                 256 17.3                            100      5.6         100    3.4       100 43.7 100
                 512 12.5                            69.3     3.9     72.5       2.2       77.3 28.1 77.9
                1024                        6.2      69.6     1.9     72.5       1.4       60.5 16.2 67.4
                1536                        3.5      82.5     1.5     63.6       1.1       50.5 11.3 64.5
                                                                      50


                                    40
                                    35                      only MPI
                                    30                      MPI/OpenMP
               time/iteration (s)
                                    25
                                    20
                                    15
                                    10
                                     5
                                     0
                                         256    512       1024     1536
                                               Number of CPU cores
Figure 2.8: Comparison of the execution times for the MPI-only and MPI/OpenMP parallel versions
of Sky3D for the 3000 neutron and 3000 proton system using the L = 32 fm grid.
2.5.4      Load balancing
In this section, we analyze the performance of our load balancing approach which divides
the available cores into neutron and proton groups for parallel execution. For better pre-
sentation, we break down the execution time into three major components: Calculations in
steps using a 2D data distribution, calculations using a 1D distribution of wave functions
and communication times. In Fig. 2.9(a), we show the time taken by the cores in the neu-
tron and proton groups for the 3000 neutron and 3000 proton system using the L = 48 fm
grid - which is essentially the same plot as in the previous section, but it gives the timings
for neutrons and protons separately. As this system has an equal number of neutrons and
protons, available cores are divided equally into two groups. As can be seen in Fig. 2.9(a),
the time needed for different steps in this case is almost exactly identical for neutrons and
protons.
   In Figure 2.9(b), we present the results for a system with 4000 neutrons and 2000 protons.
                                                      51


                                              2d neut                2d prot         comm prot
                                              1d neut                1d prot         total
                                              comm neut
                                       102                             (a) 3000 n + 3000 p
                                       101
                                       100
                                       102
                  time/iteration (s)
                                                                       (b) 4000 n + 2000 p
                                       101
                                       100
                                       102                              (c) 5000 n + 1000 p
                                       101
                                       100
                                         32      64       128        256       512   10241536
                                                      Number of CPU cores
Figure 2.9: Times per iteration for neutron and proton processor groups, illustrating the load balance
for the 3000 neutrons and 3000 protons (a), 4000 neutrons and 2000 protons (b), and 5000 neutrons
and 1000 protons (c) systems using the L = 48 fm grid. Due to memory constraints the latter two
cases cannot be calculated using 32 CPU.
                                                                52


In this case, according to our load balancing scheme, the number of cores in the neutron
group will be roughly 4x larger than the number of cores in the proton group because we
distribute the cores based on the ratio of the square of the number of particles in each group.
We observe that all three major parts are almost equally balanced for up to 1024 cores,
but 2D calculations for neutrons is slightly more expensive on 1536 cores. A more detailed
examination reveals that this difference is due to the eigendecomposition times in steps 3b-3c.
However, it is relatively minor compared to the total execution time per iteration.
    Note that the times for the steps with 1d distribution show some variation for the system
with 4000 neutrons and 2000 protons. This is due to the fact that we split the available
cores into neutron and proton groups based on the cost of steps with 2D data distributions.
Consequently, 1D distributed steps take more time on the proton processor group, but this
difference is negligible in comparison to the cost of 2D distributed steps.
    In Figure 2.9(c) results for a more challenging case with 5000 neutrons and 1000 protons
are presented. In this case the majority of the available cores are assigned to the neutron
group - more precisely, the ratio between the sizes of the two groups is roughly 25. We
observe that 1D calculations take significantly more time for protons in this case, but any
potential load imbalances are compensated by the reduced 2D calculation times for protons.
    A further inspection of the execution time of each step for the 5000 neutron and 1000
proton system is given in Figure 2.10. This inspection reveals that time needed for neutrons
and protons mainly differ for step 3b-3c and step 3d due to the large difference between
neutron and proton counts. But these difference are not significant compared to the other
computationally heavy steps which are well load balanced. As shown in Table 2.6, our
implementation still achieves about 50% strong scaling efficiency on 1536 cores for this
challanging case with 5000 neutrons and 1000 protons.
                                               53


                                                3a n                 communication n         3d p
                                                3b-3c n              1d calc n               3e p
                                                3d n                 3a p                    communication p
                                                3e n                 3b-3c p                 1d calc p
                                                102
                                                101
                          time/iteration (s)
                                                100
                                               10−1
                                               10−2
                                                  64           128          256        512      1024 1536
                                                                     Number of CPU cores
Figure 2.10: A detailed breakdown of per iteration times for neutron and proton processor groups,
illustrating the load balance for the 5000 neutrons and 1000 protons system.
Table 2.6: Scalability of MPI-only version of Sky3D for the 5000 neutrons and 1000 protons system
using the L = 48 fm grid. Time is given in seconds, and efficiency (eff) is given in percentages.
                       calc. matrix recombine diag+Löwdin                                                      Total
               cores time                               eff time            eff time              eff time         eff
                  64 72.9                              100 38.6             100        2.3        100       170    100
                 128 37.1                              98.1 18.6 103.4 1.35                        85          87 97.6
                 256 15.7                          115.8         7.1 136.9 2.75                20.82 35.9 118.3
                 512     10                            91.3      4.6 105.7             2.1       13.6 23.5 90.3
                1024    5.4                            83.9      2.4       99.2        2.7        5.3 16.3 65.2
                1536    3.6                               84     1.7       96.7        2.5        3.8 12.4 57.2
                                                                            54


2.5.5      Conclusion of this work
In this work, we described efficient and scalable techniques used to parallelize Sky3D, a
nuclear DFT solver that operates on an equidistant grid in a pure MPI framework as well
as a hybrid MPI/OpenMP framework. By carefully analyzing the computational motifs
in each step and data dependencies between different steps, we used a 2D decomposition
scheme for Sky3D kernels working with matrices, while using a 1D scheme for those per-
forming computations on wave functions. We presented load balancing techniques which
can efficiently leverage high degrees of parallelism by splitting available processors into neu-
tron and proton groups. We also presented algorithmic techniques that reduce the total
execution time by overlapping diagonalization and orthogonalization steps using subgroups
within each processor group. Detailed performance analysis on a multi-core architecture
(Cori at NERSC) reveal that parallel Sky3D can achieve good scaling to a moderately large
number of cores. Contrary to what one might naively expect, the MPI-only implementa-
tion outperforms the hybrid MPI/OpenMP implementation, mainly because ScaLAPACK’s
eigedecomposition routines perform worse in the hybrid parallel case. For larger core counts,
the disparity between the two implementations seems to be less pronounced. As a result of
detailed performance evaluations, we have observed that 256 to 1024 processors are reason-
ably efficient for nuclear pasta simulations and we consider these core counts for production
runs, depending on the exact calculation.
    Using the new MPI parallel Sky3D code, we expect that pasta phases can be calculated
for over 10,000 nucleons in a fairly large box using a quantum mechanical treatment. As a
result, we expect to reach an important milestone in this field. We plan to calculate properties
of more complicated pasta shapes and investigate defects in pasta structures which occur in
                                               55


large systems.
    This work motivated us to work with large scale sparse matrices in the newer architec-
tures. We first started with a blocked eigensolver and tried to optimize its performance. I
discuss this in more detail in the next chapter.
                                             56


Chapter 3
OPTIMIZATION IN LARGE SCALE
DISTRIBUTED SPARSE
MATRICES
3.1       Eigenvalue Problem in CI Calculations
Nuclear physics faces the multiple hurdles of a very strong interaction, three-nucleon interac-
tions, and complicated collective motion dynamics. The eigenvalue problem arises in nuclear
structure calculations because the nuclear wave functions Ψ are solutions of the many-body
Schrödinger’s equation expressed as a Hamiltonian matrix eigenvalue problem, HΨ = EΨ .
    In the CI approach, both the wave functions Ψ and the Hamiltonian H are expressed
in a finite basis of Slater determinants (anti-symmetrized product of single-particle states,
typically based on harmonic oscillator wave functions). Each element of this basis is referred
to as a many-body basis state. The representation of H within an A-body basis space, using
up to k-body interactions with kA, results in a sparse symmetric matrix Ĥ Thus, in CI
calculations, Schrödinger’s equation becomes an eigenvalue problem, where one is interested
in the lowest eigenvalues (energies) and their associated eigenvectors (wave functions). A
specific many-body basis state corresponds to a specific row and column of the Hamiltonian
                                              57


matrix. A nonzero in the Hamiltonian matrix indicates the presence of an interaction between
either the same or different many-body basis states. Both the total number of many-body
states N (the dimension of Ĥ) and the total number of nonzero matrix elements in Ĥ are
controlled by the number of nuclear particles, the truncation parameter Nmax , which is the
maximum number of HO quanta above the minimum for a given nucleus (see Fig. 3.1), and by
the maximum number of particles allowed to interact simultaneously in the Hamiltonian in
Eq, (3). Higher Nmax values yield more accurate results, but at the expense of an exponential
growth in problem size. Many nuclear applications seek to reach at least an Nmax of 10 in
order to establish a sequence of values of observables as a function of Nmax in order to
estimate exact answers through extrapolations to infinite Nmax .
3.2      Motivation and CI Implementation
As the load balancing issue and communication overheads on distributed memory systems
have been addressed in our previous work [78, 103, 104], here we mainly focus on the per-
formance of the thread-parallel computations within a single MPI rank. Conventionally,
in MFDn as well as in other CI codes, the Lanczos algorithm is used due to its excellent
convergence properties. However, locally optimal block preconditioned conjugate gradient
(LOBPCG) [30], a block eigensolver, is an attractive alternative for a number of reasons.
First, the LOBPCG algorithm allows effective use of many-body wave functions from closely
related model spaces (e.g. smaller basis, or different single-particle wave functions) to be
used as good initial guesses. Second, the LOBPCG algorithm can easily incorporate an effec-
tive preconditioner which can often be constructed based on physics insights to significantly
improve convergence. Third and most relevant to our focus in this paper, the LOBPCG
                                             58


                                                               10
                                                          10
                                                                   9
                         M-scheme basis space dimension
                                                           10
                                                                   8
                                                           10
                                                                   7
                                                           10
                                                                   6                                                4He
                                                           10
                                                                   5
                                                                                                                    6Li
                                                           10                                                       8Be
                                                                   4                                                10B
                                                           10                                                       12C
                                                           10
                                                                   3                                                16O
                                                                   2
                                                                                                                    19F
                                                           10                                                       23Na
                                                                   1                                                27Al
                                                           10
                                                                   0
                                                           10 0        2       4       6          8        10        12     14
                                                                                           Nmax
                                                           15
                                                 10
             number of nonzero matrix elements
                                                           12
                                                 10
                                                               9
                                                          10
                                                            6
                                                          10                                          16O, dimension
                                                                                                      2-body interactions
                                                               3
                                                                                                      3-body interactions
                                                          10                                          4-body interactions
                                                                                                      A-body interactions
                                                            0
                                                          10 0             2       4              6             8           10
                                                                                           Nmax
Figure 3.1: The dimension and the number of non-zero matrix elements of the various nuclear
Hamiltonian matrices as a function of the truncation parameter Nmax . While the bottom panel is
specific to 16 O, it is also representative of a wider set of nuclei [2, 3].
                                                                                       59


algorithm naturally leads to an implementation with a high arithmetic density, as the main
computational kernels involved are the multiplication of a sparse matrix with multiple vec-
tors, and level-3 BLAS on dense vector blocks, as opposed to the SpMVs and level-1 BLAS
operations that are the building blocks in Lanczos. Finally, although not studied here, we
note that the potential benefits of a block eigensolver can be even more significant for CI
implementations based on on-the-fly computation of the Hamiltonian.
    Alg. 3 gives the pseudocode for a simplified version of the LOBPCG algorithm without
preconditioning. LOBPCG is a subspace iteration method that starts with an initial guess
of the eigenvectors (Ψ0 ) and refines its approximation at each iteration (Ψi ). Ri denotes
the residual associated with each eigenvector and Pi contains the direction information from
the previous step. Hence, in Alg. 3, Ψi , Ri and Pi correspond to dense blocks of vectors. To
ensure good convergence, the dimension of the initial subspace, m, is typically set to 1.5 to 2
times the number of desired eigenpairs nev. For numerical stability, the converged eigenpairs
are locked, i.e., m gets smaller as the algorithm progresses.
    In the rest of this paper, we present our techniques to improve the efficiency of sparse
matrix computations, and dense vector block operations that constitute the key kernels
for LOBPCG. We then evaluate the impact of our techniques in real-world problems and
compare the performance of our optimized LOBPCG implementation with a Lanczos-based
solver.
    The CI method is implemented in MFDn [2, 3]. A major challenge in CI is the massive
size of the matrix Ĥ ∈ RN ×N , where N can be in the range of several billions and the
total number of nonzeros can easily exceed trillions. Since only the low-lying eigenpairs are
of interest, iterative eigensolvers are used to tame the computational cost [103, 104]. As
the identification of nonzeros in Ĥ and calculation of their values is very expensive, MFDn
                                               60


constructs the sparse matrix only once and preserves it throughout the computation. To
accelerate matrix construction and reduce the memory footprint, only half of the symmetric
Ĥ matrix is stored in the distributed memory available. A unique 2D triangular processor
grid is then used to carry out the computations in parallel [103, 104]. In this scheme, a
“diagonal" processor stores only the lower triangular part of a sub-matrix along the diagonal
of Ĥ. Each “non-diagonal" processor, a processor that owns a sub-matrix from either the
lower or the upper half of Ĥ, is assigned the operations related to the transpose of that sub-
matrix. A well-balanced distribution of the nonzeros among processors is ensured through
efficient heuristics [78]. Exploiting symmetry in MFDn demands SpMVT (SpMMT ) in addi-
tion to the SpMV (SpMM) operations, and thus data structures that efficiently implement
both operations. The accuracy from single-precision arithmetic is in general sufficient to
calculate the physical observables. Hence, in MFDn, the Hamiltonian matrix is stored in
single-precision to further reduce the memory footprint.
3.3       Multiplication of the Sparse Matrix with Multiple
          Vectors (SpMM)
To exploit symmetry in a block eigensolver, each process must perform a conventional SpMM
(Y = AX), as well as a transpose operation SpMMT (Y = AT X), where A corresponds to
the local partition of Ĥ, X to a row partition of Ψi . Y is the output vector block in each
case. The number of rows and columns of A are typically very close to each other, therefore,
for simplicity, we take A to be a square matrix of size n × n. Both X and Y are dense
vector blocks of dimensions n × m. As SpMM and SpMMT are performed in separate phases
of the MPI parallel algorithm [104], we use the same input/output vectors to simplify the
                                               61


presentation.
    Naively, one can realize SpMM by storing the vector blocks in column-major order and
applying one SpMV to each column of X. However, to exploit spatial locality, a row-
major layout should be preferred for vector blocks X and Y . This format also ensures good
data locality for the tall-skinny matrix operations of LOBPCG. Thus, the simplest SpMM
implementation can be implemented as an extension of SpMV where the operation on scalar
                P                                                              P
elements yi =      Ai,j xj becomes an operation on m-element vectors Yi =        Ai,j Xj . The
input and output vectors can be aligned to 32-byte boundaries for efficient vectorization of
the m-element loops. This operation can be implemented by looping over each nonzero Ai,j .
3.4      Matrix Storage Formats
The most common sparse matrix storage format is compressed sparse rows (CSR) in which
the nonzeros of each matrix row are stored consecutively as a list in memory. One maintains
an array of pointers (which are simply integer offsets) into the list of nonzeros in order to
mark the beginning of each row. An additional index array is used to keep the column indices
of each nonzero. Nonzero values and column indices are stored in separate arrays of length
nnz, and the row pointers array is of length n + 1. For single-precision sparse matrices whose
local row and column indices can be addressed with 32-bit integers (i.e., n ≤ 232 − 1), the
storage cost for the CSR format is 8nnz + 4n + 1. One may reuse matrices stored in the
CSR format for the SpMMT operation by reinterpreting row pointers and column indices as
column pointers and row indices, respectively. Such an interpretation would correspond to
a compressed sparse column (CSC) representation in which one operates on columns rather
than rows to implement the SpMMT operation.
                                              62


    Large vector blocks (with 4 ≤ m ≤ 50, and n > 106 ) can potentially prevent a CSR
based SpMM implementation from taking full advantage of the locality in vector blocks
depending on the matrix sparsity structure. After a few rows, it is likely that vector data
will have been evicted from the L2 cache, while after a few hundred rows, it is very likely
that data will have been evicted from even the last level L3 cache. Moreover, in a thread-
parallel SpMMT , CSC’s scatter operation on thread-private output vectors (necessary to
prevent race conditions) coupled with the reduction required for partial thread results can
significantly impede performance [2]. Thus, it is imperative that we adopt a data structure
that can attain good locality for the vector blocks and does not suffer from the performance
penalties associated with the CSR and CSC implementations.
    Our data structure for storing sparse matrices is a variant of the compressed sparse blocks
(CSB) format [77]. For a given block size parameter β, CSB nominally partitions the n × n
                                                                   √
local matrices into β × β blocks. When β is on the order of          n, we can address nonzeros
within each block by using half the bits needed to index into the rows and columns of the full
                                                          √
matrix (16 bits instead of 32 bits). Therefore, for β =      n, the storage cost of CSB matches
the storage cost of traditional formats such as CSR. In addition, CSB automatically enables
cache blocking [105]. In CSB format, each β × β block is independently addressable through
a 2D array of pointers. SpMM operation can then be performed by processing this 2D array
by rows, while SpMMT can simply be realized by processing it via columns.
    The formal CSB definition does not specify how the nonzeros are stored within a block.
An existing implementation of CSB for sparse matrix–vector (SpMV) and transpose sparse
matrix–vector (SpMVT ) multiplication stores nonzeros within each block using a space filling
curve to exploit data locality and enable efficient parallelization of the blocks themselves [77].
                                                63


3.5      Methodology and Optimization
CSR/OpenMP: Our baseline SpMM implementation uses the CSR format. SpMM oper-
ation was threaded using an OpenMP parallel for loop with dynamic scheduling over the
matrix rows. SpMMT operation was threaded over columns (which are simply reinterpreta-
tions of CSR rows for the transpose) where each thread uses a private copy of the output
vector block to prevent race conditions. Private copies are then reduced (using thread par-
allelism) to complete the SpMMT operation.
Rowpart/OpenMP: On multi-core CPUs with several cores, the CSR implementation
above is certainly not suitable for performing SpMMT on large sparse matrices. Thread
private copies of the output vector requires an additional O(nmP ) storage, where P denotes
the number of threads. In fact, more storage space than the sparse matrix itself could be
needed for even small values of m for matrices with only tens of nonzeros per row. In terms
of performance, thread private output vectors may adversely effect data reuse in the last
level of cache, and requires an expensive post-processing step. Therefore we implemented
the Rowpart algorithm. It is identical to our baseline CSR implementation for SpMM, but for
a memory efficient and load balanced SpMMT , it preprocesses the columns of the transpose
matrix and determines row indices for each thread such that row partitions assigned to
threads contain (roughly) equal number of nonzeros. Each thread then maintains a starting
and ending index of its row partition boundaries per column. Extra storage space cost of
Rowpart is only O(nP ) and the preprocessing overheads are insignificant when used in an
iterative solver.
CSB/OpenMP: Our new parametrized implementation for SpMM and SpMMT , CSB/OpenMP,
is based on the CSB format. As the other implementations, CSB/OpenMP is written in For-
                                               64


                   X                m
                                                                     Y
                                   β
                A
                                   β     block 2                         Pβ
                              Pβ
                                                                     m
                               block 1
Figure 3.2: Overview of the SpMM operation with P = 4 threads. The operation proceeds by
performing all P β × β local SpMM operations Y=AX+Y one blocked row at a time. The operation
AT X is realized by permuting the blocking (β × P β blocks).
tran using OpenMP. As shown in Fig. 3.2, the matrix is partitioned into β × β blocks that
are stored in coordinate format (COO) with 16-bit indices and 32-bit single-precision val-
ues. SpMM is threaded over individual rows of blocks (corresponding to β × n slices of the
matrix), which creates block rows of size P β × n. In SpMMT , threads sweep through block
columns of size n × P β and use the COO’s row indices as column indices and vice versa. We
tune for the optimal value of β for each value of m for a given matrix.
CSB/Cilk: For comparisons with the original Cilk-based CSB, we extended the fully parallel
SpMV and SpMVT algorithms [77] in CSB to operate on multiple vectors. We used a vector
of std::array’s, a compile-time fixed-sized variant of the built-in arrays for storing X
and Y . This effectively creates tall-skinny matrices in row major order. CSB heuristically
determines the block parameter β, considering the parallel slackness, size of the L2 cache,
                                                65


and the addressability by 16-bit indices. The parameter β chosen for the single vector cases
presented in Sect. 3.8 was 16,384 or 8,192 (depending on the matrix), and it got progressively
smaller all to way to β = 1024 as m increases (due to increased L2 working set limitations).
    SpMM and SpMMT implemented using CSB/Cilk employ three levels of parallelism. For
SpMM (the transpose case is symmetric), it first parallelizes across rows of blocks, then
within dense rows of block using temporary vectors, and finally within sufficiently dense
blocks if needed. Additional parallelization costs of second and third levels are amortized
by performing them on sufficiently dense rows of block and individual blocks that threaten
load balance. Such blocks and rows of blocks can be shown to have enough work to amortize
the parallelization overheads. Our CSB/OpenMP implementation differs from the CSB/Cilk
implementation in that CSB/OpenMP does not parallelize within individual rows/columns
of blocks or within dense blocks. Rather, CSB/OpenMP partitions the sparse matrix into a
sufficiently large number of rows/columns of blocks by choosing an appropriate β. Dynamic
scheduling is leveraged to ensure load balance among threads.
    In all implementations (CSR, Rowpart, CSB/OpenMP, CSB/Cilk), innermost loops
                                            Ai,j Xi for SpMMT ) were manually unrolled for
        P                                P
(Yi =      Ai,j Xj for SpMM and Yj =
each m value. In Fortran !$dir simd directives and in C #pragma simd always prag-
mas were used for vectorization. We inspected the assembly code to ensure that packed
SIMD/AVX instructions were generated for best performance. To minimize TLB misses, we
used large pages during compilation and runtime.
                                              66


3.6      An Extended Roofline Model for CSB
Conventional wisdom suggests that SpMV performance is a function of STREAM band-
width and data movement from compulsory misses on matrix elements. Then the simplified
Roofline model [75] provides a lower bound to SpMV time by 8 · nnz/BWstream for single-
precision CSR matrices [106]. This simple analysis may lead one to conclude that performing
SpMV’s on multiple right-hand sides (SpMM) is essentially no more expensive than perform-
ing one SpMV. Unfortunately, this is premised on three assumptions — (i) compulsory misses
for vectors are small compared to the matrix, (ii) there are few capacity misses associated
with the vectors, and (iii) cache bandwidth does not limit performance. The first premise is
certainly invalidated once the number of right-hand sides reaches half the average number
of nonzeros per row (assuming an 8-byte total space for single-precision nonzeros, 4-byte
single-precision vector elements, and a write-allocate cache). The second would be true for
low-bandwidth matrices with working sets smaller than the last level cache. The final as-
sumption is highly dependent on microarchitecture, matrix sparsity structure, and the value
of m. We observe that for MFDn matrices and moderate values of m, this conventional
wisdom fails to provide a good performance bound.
    In this paper, we construct an extended Roofline performance model that captures how
cache locality and bandwidth interact to tighten the performance bound for CSB-like sparse
kernels. Let us consider three progressively more restrictive cases: vector locality in the L2,
vector locality in the L3, and vector locality in DRAM. As it is highly unlikely a β × β block
acting on multiple vectors attains good vector locality in the tiny L1 caches, we will ignore
this case. Although potentially an optimistic assumption, we assume we may always hit
peak L2, L3, or DRAM bandwidth with the caveat that, on average, we overfetch 16 bytes.
                                                67


    First, if we see poor L1 locality for the block of vectors but good L2 locality, then for
each nonzero, CSB must read 8 bytes of nonzero data, 4m bytes of the source vector, and
4m bytes of the destination vector. It may then perform 2m flops and write back 4m bytes
of destination data. Thus we perform 2m flops and must move 8 + 12m bytes ideally at the
peak L2 bandwidth. Ultimately, this would limit SpMM performance to 6.6 GFlop/s per
core, or about 80 GFlop/s per chip on Edison which has an L2 cache bandwidth of 40 GB/s
per core (see Sect. 3.8.1). One should observe that we have assumed high locality in L2. As
this is unlikely, this bound is rather loose.
    Unfortunately, static analysis of sparse matrix operations has its limits. In order to
understand how locality in the L3 and L3 bandwidth constrain performance we implemented
a simplified L2 cache simulator to calculate the number of capacity misses associated with
accessing X and Y . For each β × β block the simulator tries to estimate the size of the L2
working set based on the average number of nonzeros per column. When the average number
of nonzeros per column is less than one, then the working set is bounded by (8m + 32) · nnz
bytes — each nonzero requires a block of the source vector and a block of the destination
vector plus overfetch. When the average number of nonzeros per column reaches one, we
saturate the working set at 8mβ bytes — full blocks of source and destination vectors. If
the working set is less than the L2 cache capacity we must move 8 · nnz + 4mβ bytes when
the number of nonzeros per column is equal to or greater than 1 and (8 + 4m + 16) · nnz
bytes (but never more than 8 · nnz + 4mβ bytes) when the number of nonzeros per column
is less than 1 (miss on the nonzero and the source vector). If the working set exceeds the
cache capacity, then we forgo any assumptions on reuse of X or Y in the L2 and incur
(8 + 4m + 16) · nnz + 8mβ bytes of data movement. So, this bound on data movement
depends on both m and the input matrix.
                                              68


     Finally, let us consider the bound due to a lack of locality in L3 and finite DRAM
bandwidth. As shown in Fig. 3.2, CSB matrices are partitioned into blocks of size β × β,
and P threads stream through block rows (or block columns for SpMMT ) performing local
SpMM operations on blocks of size P β × β. If one thread (a β × β block) gets ahead of
the others, then it will likely run slower as it is reading X from DRAM while the others are
reading X from the last level cache. Thus, we created a second simplified cache simulator to
track DRAM data movement which tracks how a chip processes each P β × β block, rather
than tracking how individual cores process their β × β blocks. Our model streams through
the block rows of a matrix (like in Fig. 3.4) and for each nonzero P β × β block examines
its cache to determine whether the corresponding block of X is present. If it misses, then
it fetches the entire block and increments the data movement tally. If the requisite cache
working set exceeds the cache capacity, then we evict a block (LRU policy). Finally, we add
the nonzero data movement and the read-modify-write data movement associated with the
output vector block Y (8nm bytes).
     Ultimately, the combined estimates for DRAM, L2, and L3 data movement provide us a
narrow range of expected SpMM performance as a function of m. For low arithmetic intensity
(small m), the Roofline suggests we would be DRAM-bound, but the Roofline plateaus, it
is likely to do so because of either L2 or L3 bandwidth, rather than the peak FLOP rate. In
the future, we plan to use this lightweight simulator as a model-based replacement for the
expensive empirical tuning of β.
                                                69


3.7      Kernels with Tall and Skinny Matrices
Besides sparse matrix operations, all block methods require operations on the dense blocks of
vectors themselves, which we denote as tall-skinny matrix computations owing to the shape
of the multiple vector structures involved. The LOBPCG algorithm mainly involves inner
product and linear combination operations. Performance in these kernels are critical for
the overall eigensolver performance for three reasons. First, an optimized SpMM algorithm
incurs a significantly reduced cost on a per SpMV basis. Second, while the per iteration
cost of vector operations is O(N ) for Lanczos-like solvers, in block methods these operations
cost O(N m2 ) which grows quickly with m. Finally, and most importantly, the LOBPCG
algorithm involves several of these operations in each iteration. For example, computing Ei
and updating the residual Ri before the Rayleigh–Ritz procedure in Alg. 3 requires an inner
product and a linear combination, respectively. The Rayleigh–Ritz procedure itself requires
computing the overlap matrix between each pair of the current Ψi , Ri , Pi vectors themselves,
as well as their overlap with the vector blocks from the previous iteration, leading to a
total of 18 inner product operations. Following the Rayleigh–Ritz procedure is the updates
to the Ψi , Ri , Pi blocks of vectors for the next iteration, which require computing linear
combinations. There are a total of 10 such linear combination operations per Rayleigh–Ritz
procedure.
    Note that the Hamiltonian matrix in MFDn is partitioned into a 2D triangular grid,
and during parallel SpMM the X and Y vector blocks are shared/aggregated among the
processes in the row/column groups of this triangular grid [103, 104]. Efficient parallelization
of LOBPCG operations requires further partitioning X and Y among processes in the same
row groups, resulting in smaller local blocks of vectors of size l × m (l = n/prow , where
                                               70


prow is the number of process in a row of the triangular grid). In fact, as shown in Alg. 3,
each process has to keep several matrices of size l × m due to the need for storing the
residual R and previous direction P information locally. Since we are mainly interested in
the performance of the kernels, we will generically denote the local l × m blocks of vectors
as V and W . We observe that achieving numerical stability with LOBPCG requires using
double-precision arithmetic for most MFDn calculations. Therefore, we convert the result of
the sparse matrix operations into double-precision before starting LOBPCG computations.
    We denote the inner product of two blocks of vectors V and W as V T W , and the linear
combination of the vectors in a block by a small square coefficient matrix C as V C. Both
V T W and V C have high arithmetic intensities. Specifically, for both kernels the number of
flops is O(lm2 ) and the total data movement is O(lm), yielding an arithmetic intensity of
O(m). These kernels can be implemented as level-3 BLAS operations using optimized math
libraries such as Intel’s MKL or Cray’s LibSci. While one would expect to achieve a high
percentage of the peak performance (especially for large m), as demonstrated in Sect. 3.8.7,
both MKL and LibSci perform poorly for these kernels. This is most likely due to the unusual
shape of the matrices involved (typically l  m for large-scale computations).
    To eliminate the performance bottlenecks with the V T W and V C computations, we
developed custom thread-parallel implementations for them. Fig. 3.3 gives an overview of
our V T W implementation. We store V and W in row-major order, consistent with the
storage of the vector blocks in sparse matrix computations. We partition V and W into
small row blocks of size s × m, and compute the inner product V T W by aggregating the
results of (vendor tuned) dgemm operations between a row block in V and the corresponding
one in W . The loop over s × m blocks is thread parallelized using OpenMP. To achieve load
balance with minimal scheduling overheads, we use the guided scheduling option. Race
                                             71


                                                       t1
            m   t1 t2   VT    tp t1 t2                 t2           t1
                               l                      W
                                                                    t2
                                                       tp
                                                                                  VTW
                                                       t1
                                                       t2
                                                                    tp
                                                      m
Figure 3.3: Performance in GFlop/s for vector block inner product, V T W , and vector block scaling,
V X kernels using Intel MKL and Cray libsci libraries on a Cray XC30 system (Edison @ NERSC).
conditions in the output matrix are resolved by keeping a thread-private buffer matrix of
size m × m. We perform a reduction, which is also thread-parallel, over the buffer matrices
to compute the final overlap matrix.
    Our custom V C kernel is implemented similarly by partitioning V into row blocks. In
this case, C is a square matrix of size m × m which is read-shared by all threads. Again, the
loop over the s × m blocks of V is thread parallelized with guided scheduling. To prevent
race conditions, we let each thread perform the computation using the full C matrix, i.e., a
dgemm between matrices of size s × m and m × m. Each thread then uniquely produces the
corresponding set of s output rows.
3.8      Performance Evaluation
We now present a comprehensive evaluation of our methods for sparse and tall-skinny matrix
computations.
                                                72


3.8.1      Experimental setup
We use a series of computations with MFDn. As the overall execution time is dominated
by on-node computations, we begin with single-socket performance evaluations of SpMM
(Sect. 3.8.2) and LOBPCG computations (Sect. 3.8.7). In Sect. 3.8.8, we inspect the resulting
solver’s performance in a distributed memory setting.
MFDn matrices: We identified three test cases, “Nm6”, “Nm7” and “Nm8”, which are
matrices corresponding to the 10 B Hamiltonian at Nmax = 6, 7, and 8 truncation levels,
respectively. The actual Hamiltonian matrices are very large and therefore are nominally
distributed across several processes in the actual calculations. For a given nucleus, the
sparsity of Ĥ is determined by (i) the underlying interaction potential, and (ii) the Nmax
parameter. We used a 2-body interaction potential; a 3-body or a higher order interaction
potential would result in denser matrices presenting more favorable conditions for achiev-
ing computational efficiency. For a given nucleus and interaction potential, increasing the
Nmax value reduces the density of nonzeros in each row, thereby allowing us to evaluate the
effectiveness of our techniques on a range of matrix sparsities.
    Each process on a distributed memory execution receives a different sub-matrix of the
Hamiltonian, but these sub-matrices have similar sparsity structures. For simplicity and
consistency, we use the first off-diagonal processor’s sub-matrix as our input for single-
socket evaluations. Table 3.1 enumerates the test matrices used in this paper. Note that
the test matrices have millions of rows and hundreds of millions of nonzeros. As discussed
in Sect. 3.3, we use the compressed sparse block (CSB) format [77] in our optimized SpMM
implementation. Therefore a sparse matrix is stored in blocks of size β × β. When blocked
with β = 6000, we observe that both the number of block rows and the average number of
                                              73


Figure 3.4: Sparsity structure of the local Nm6 matrix at process 1 in an MFDn run with 15
processes. A block size of β = 6000 is used. Each dot corresponds to a block with nonzero matrix
elements in it. Darker colors indicate denser nonzero blocks.
                                   Matrix         Nm6         Nm7        Nm8
                                      Rows 2,412,566 4,985,944 7,583,735
                                  Columns 2,412,569 4,985,944 7,583,938
                            Nonzeros (nnz) 429,895,762 648,890,590 592,325,005
                             Blocked Rows           403        831        1264
                          Blocked Columns           403        831        1264
                     Average nnz per Block         7991       4191        2311
Table 3.1: MFDn matrices (per-process sub-matrix) used in our evaluations. For the statistics in
this table, all matrices were cache blocked using β = 6000.
nonzeros per nonzero block remain high. Fig. 3.4 gives a sparsity plot of the Nm6 matrix at
the block level, where each nonzero block is marked by a dot whose intensity represents the
density of nonzeros in the corresponding block. For our test matrices, 41–64% of these blocks
are nonzero. We observe a high variance on the number of nonzeros per nonzero block.
                                                 74


Vector block sizes: In nuclear physics applications, up to 20–25 eigenvalues are needed,
and 5–15 eigenpairs will be sufficient. In LOBPCG, to ensure a rich representation of the
subspace and ensure good convergence, the number of basis vectors m needs to be set to 1.5 to
2 times the number of desired eigenvectors nev, As the algorithm proceeds and eigenvectors
converge, converged eigenpairs are deflated (or locked) and the subspace shrinks. Therefore,
we examine the performance of our optimized kernels for values of m in the range 1 to 48.
Computing platforms: We primarily use high-end multi-core processors (Intel Xeon) for
performance studies. However, the energy efficiency requirements of HPC systems point
to an outlook where many-core processors will play a prominent role. To guide our future
efforts in this area, we conduct performance evaluations on an Intel Xeon Phi processor as
well.
    Our multi-core results come from the Cray XC30 MPP at NERSC (Edison) which con-
tains more than 10 thousand, 12-core Xeon E5 CPUs. Each of the 12 cores runs at 2.4 GHz
and is capable of executing one AVX (8×32-bit SIMD) multiply and one AVX add per cycle
and includes both a private 32 KB L1 data cache and a 256 KB L2 cache. Although the
per-core L1 bandwidth exceeds 75 GB/s, the per-core L2 bandwidth is less than 40 GB/s.
There are two 128-bit DDR3-1866 memory controllers that provide a sustained STREAM
bandwidth of 52 GB/s per processor. The cores, the 30MB last level L3 cache and mem-
ory controllers are interconnected with a complex ring network-on-chip which can sustain a
bandwidth of about 23 GB/s per core.
    Our many-core results have been obtained at the Babbage testbed at NERSC. Intel Xeon
Phi (MIC) cards are connected to the host processor through the PCIe bus and contain 60
cores running at 1 GHz, with 4 hardware threads per core. Each MIC card has an on-device
8 GB of high bandwidth memory (320 GB/s). Cores are interconnected by a high-speed
                                             75


                          Processor Xeon E5-2695 v2 Xeon Phi 5110P
                                Core      Ivy Bridge         Pentium P54c
                        Clock (GHz)             2.4                 1.05
                   Data Cache (KB)           32+256              32 + 512
                 Memory-Parallelism       HW-prefetch       SW-prefetch + MT
                    Cores/Processor             12                   60
                  Threads/Processor             241                   4
                 Last-level L3 Cache          30 MB                   –
                         SP GFlop/s            460.8               2,022
                         DP GFlop/s            230.4               1,011
                   Aggregate L2 BW          480 GB/s            960 GB/s2
                   Aggregate L3 BW          276 GB/s                  –
                     STREAM BW3              52 GB/s             320 GB/s
                   Available Memory           32 GB                8 GB
Table 3.2: Overview of Evaluated Platforms. 1 With hyper threading, but only 12 threads were used
in our computations. 2 Based on the saxpy1 benchmark in [1]. 3 Memory bandwidth is measured
using the STREAM copy benchmark.
bidirectional ring. Each core has a 32 KB L1 data cache and 512 KB L2 cache locally with
high speed access to all other L2 caches to implement a fully coherent cache. Note that
there is not a shared last level L3 cache on the MIC cards. Each core supports 512-bit wide
AVX-512 vector ISA that can execute 8 double-precision (or 16 single-precision or integer)
operations per cycle. With Fused Multiply-Add (FMA), this amounts to 16 DP or 32 SP
FLOPS/cycle. Peak performance for each MIC card is 1 TFlop/s (in DP). The characteristics
of both processors are summarized in Table 7.1.
    We use the Intel Fortran compiler with flags -fast -openmp. For comparison with the
original CSB using Intel Cilk Plus, we use the Intel C++ compiler with flags -O2 -no-ipo -parallel -restr
As Intel Cilk Plus uses dynamically loaded libraries not natively supported by the Cray oper-
ating system, we use Cray’s cluster compatibility mode that causes only a small performance
degradation. The Xeon Phi’s performance was evaluated in the native mode, enabled through
the -mmic flag in Intel compilers.
                                                76


                             90
                             80       CSB/OpenMP
                                      CSB/Cilk
                             70
                                      CSR/OpenMP
                   GFlop/s
                             60
                             50
                             40
                             30
                             20
                             10
                              0
                                  1      4    8      12    16   24   32   48
                                                   #vectors (m)
                             90
                             80   CSB/OpenMP
                             70   CSB/Cilk
                                  rowpart/OpenMP
                 GFlop/s
                             60
                                  CSR/OpenMP
                             50
                             40
                             30
                             20
                             10
                              0
                                  1      4   8       12    16   24   32   48
                                                   #vectors (m)
Figure 3.5: Optimization benefits on Edison using the Nm6 matrix for SpMM (top) and SpMMT
(bottom) as a function of m (the number of vectors).
3.8.2     Performance of SpMM and SpMMT
We first present the SpMM and SpMMT performance results for our optimized implementa-
tions. We report the average performance over five iterations where the number of requisite
floating-point operations is 2 · nnz · m.
                                                    77


3.8.3     Improvement with using CSB
Fig. 3.5 presents SpMM (Y = AX) and SpMMT (Y = AT X) performance for the Nm6
matrix as a function of m (the number of vectors). For m = 1, a conventional CSR SpMV
implementation does about as well as can be expected. However, as m increases, the benefit
of CSB variants’ blocking on cache locality is manifested. The CSB/OpenMP version delivers
noticeably better performance than the CSB/Cilk implementation. This may be due in part
to performance issues associated with Cray’s cluster compatibility mode, but most likely
due to additional parallelization overheads of the Cilk version that uses temporary vectors
to introduce parallelism at the levels of block rows and blocks. This additional level of
parallelism is eliminated in CSB/OpenMP by noting that the work associated with each
nonzero is significantly increased as m increases, and we leverage the large dimensionality of
input vectors for load balancing among threads. Ultimately, we observe that CSB/OpenMP’s
performance saturates at around 65 GFlop/s for m > 16. This represents a roughly 45%
increase in performance over CSR, and 20% increase over CSB/Cilk.
    CSB truly shines when performing SpMMT . The ability to efficiently thread the compu-
tation coupled with improvements in locality allows CSB/OpenMP to realize a 35% speedup
for SpMV over CSR and nearly a 4× improvement in SpMM for m ≥ 16. The row partition-
ing scheme has only a minor benefit and only at very large m. Moreover, CSB ensures SpMM
and SpMMT performance are now comparable (67 GFlop/s vs 62 GFlop/s with OpenMP)
— a clear requirement as both computations are required for MFDn.
    As an important note, we point out that the increase in arithmetic intensity introduced
by SpMM allows for more than 5× increase in performance over SpMV. This should be
an inspiration to explore algorithms that transform numerical methods from being memory
                                               78


bandwidth-bound (SpMV) to compute-bound (SpMM).
3.8.4      Tuning for the Optimal Value of β
As discussed previously, we wish to maintain a working set for the X and Y vector blocks
as close to the processor as possible in the memory hierarchy. Each β × β block demands
a working set of size βm in the L2 for X and Y . Thus, as m increases, we are motivated
to decrease β. Fig. 3.6 plots performance of the combined SpMM and SpMMT operation
using CSB/OpenMP on the Nm8 matrix as a function of m for varying β. For small m,
there is either sufficient cache capacity to maintain locality on the block of vectors, or other
performance bottlenecks are pronounced enough to mask any capacity misses. However, for
large m (shown up to m = 96 for illustrative purposes), we clearly see that progressively
smaller β are the superior choice as they ensure a constrained resource (e.g., L3 bandwidth)
is not flooded with cache capacity miss traffic. Still, note in Fig. 3.6 that no matter what
β value is used, the maximum performance obtained for m > 48 is lower than the peak of
45 Gflops/s achieved for lower values of m. This suggests that for large values of m, it may
be better to perform SpMM and SpMMT as batches of tasks with narrow vector blocks. In
the following sections, we always use the best value of β for a given m.
3.8.5      Combined SpMM/SpMMT performance
Our ultimate goal is to include the LOBPCG algorithm as an alternative eigensolver in
MFDn. As discussed earlier, the computation of both SpMM and SpMMT is needed for
this purpose. We are therefore interested in the performance benefit for the larger (and pre-
sumably more challenging) MFDn matrices. Fig. 3.7 presents the combined performance of
SpMM and SpMMT as a function of m for our three test matrices. Clearly, the CSB variants
                                               79


                           50
                           45
                           40
                           35
                 GFlop/s
                           30
                           25
                           20
                           15                                       B=6K
                           10                                       B=4K
                            5                                       B=2K
                            0
                                1   4   8   12 16 24 32 48 64 80 96
                                              #vectors (m)
Figure 3.6: Performance benefit on the combined SpMM and SpMMT operation from tuning the
value of β for the Nm8 matrix.
deliver extremely good performance for the combined operation with the CSB/OpenMP de-
livering the best performance. We observe that as one increases the number of vectors m,
performance increases to a point at which it saturates. A naive understanding of locality
would suggest that regardless of matrix size, the ultimate SpMM performance should be
the same. However, as one moves to the larger and sparser matrices, performance satu-
rates at lower values. Understanding these effects and providing possible remedies requires
introspection using our performance model.
3.8.6    Performance analysis
Given the complex memory hierarchies of varying capacities and bandwidths in highly par-
allel processors, the ultimate bottlenecks to performance can be extremely non-intuitive and
require performance modeling. In Fig. 3.7, we provide three Roofline performance bounds
based on DRAM, L3, and L2 data movements and bandwidth limits as described in Sect. 3.6.
In all cases, we use the empirically determined optimal value of β for each m as a parameter
                                              80


in our performance model. The L2 and L3 bounds take the place of the traditional in-core
(peak flop/s) performance bounds. Bounding data movement for small m (where compulsory
data movement dominates) is trivial and thus accurate. However, as m increases, capacity
and conflict misses start dominating. In this space, quantifying the volume of data move-
ment in a deep cache hierarchy with an unknown replacement policy and unknown reuse
pattern is non-trivial. As Fig. 3.4 clearly demonstrates, the matrices in question are not
random (worst case), but exhibit a structure. We note that these Roofline curves for large
m are not strict performance bounds but rather guidelines.
    Clearly, for small m performance is highly-correlated with DRAM bandwidth. As we
proceed to larger m, we see an inversion for the sparser matrices where L3 bandwidth can
surpass DRAM bandwidth as the preeminent bottleneck. We observe that for the denser Nm6
matrix, performance is close to our optimistic L2 bound. Nevertheless, the model suggests
that the L3 bandwidth is the true bottleneck while DRAM bandwidth does not constrain
performance for m ≥ 8. Conversely, the sparser Nm8’s performance is well correlated with
the DRAM bandwidth bound for m ≤ 16 at which point the L3 and DRAM bottlenecks
reach parity.
    Ultimately, our Roofline model tracks the performance trends well and highlights poten-
tial bottlenecks — DRAM, L3, and L2 bandwidths and capacities — as one transitions to
larger m or larger and sparser matrices.
3.8.7      Performance of tall-skinny matrix operations
In Fig. 3.8, we analyze the performance of our custom inner product (V T W ) and linear
combination (V C) operations proposed as alternatives to the BLAS library calls for tall-
skinny matrices. As mentioned in Sect. 3.7, our custom implementations still rely on the
                                             81


               80                                                                                     80
                                                                        CSB/OpenMP                                                                    CSB/OpenMP
               70                                                       CSB/Cilk                      70                                              CSB/Cilk
                                                                        rowpart/OpenMP                                                                rowpart/OpenMP
               60                                                                                     60
                                                                        CSR/OpenMP                                                                    CSR/OpenMP
                                                                                            GFlop/s
               50                                                       Roofline (DRAM)               50                                              Roofline (DRAM)
     GFlop/s
               40                                                       Roofline (L3)                                                                 Roofline (L3)
                                                                                                      40
                                                                        Roofline (L2)                                                                 Roofline (L2)
               30
                                                                                                      30
               20
                                                                                                      20
               10
                                                                                                      10
                0
                                                                                                       0
                    1   4   8     12   16      24         32   48
                                                                                                            1    4    8     12   16    24   32   48
                                #vectors (m)
                                                                                                                          #vectors (m)
                                                          80                                                         CSB/OpenMP
                                                          70                                                         CSB/Cilk
                                                                                                                     rowpart/OpenMP
                                                          60                                                         CSR/OpenMP
                                                          50                                                         Roofline (DRAM)
                                                GFlop/s
                                                                                                                     Roofline (L3)
                                                          40                                                         Roofline (L2)
                                                          30
                                                          20
                                                          10
                                                           0
                                                               1    4      8     12   16    24         32   48
                                                                               #vectors (m)
Figure 3.7: SpMM and SpMMT combined performance results on Edison using the Nm6, Nm7 and
Nm8 matrices (from top to bottom) as a function of m (the number of vectors). We identify 3
Rooflines (one per level of memory) as per our extended roofline model for CSB.
library dgemm calls to perform multiplications between small matrix blocks. Hence, we
report the performance of two different versions, Custom/MKL based on MKL dgemm, and
Custom/LibSci based on the LibSci dgemm.
   As shown in Fig. 3.8, we obtain significant speedups over MKL and LibSci in computing
V T W . Both of our custom V T W kernels exhibit a similar performance profile and outper-
form their library counterparts significantly. The speedups we obtain range from about 1.7×
(for larger values of m) up to 10× (for m ≈ 16). As small m values are common in an
application like MFDn, this represents a significant performance improvement for LOBPCG
over using the library dgemm. However, for the V C kernel, we do not observe speedups
from our custom implementations (Fig. 3.8) – they closely track the performance of their
library counterparts. In fact, for larger values of m, the Custom/MKL implementation is
outperformed slightly by MKL. While l was fixed at 1 M for these tests, we observed similar
                                                                                          82


results in other cases (l=10 K, 100 K, and 500 K).
    A key observation based on Fig. 3.8 is that the overall performance of both kernels are
significantly higher for larger m values. Since LOBPCG algorithm contains operations with
multiple blocks of vectors, i.e., Ψi , Ri , Pi , one potential optimization is to combine these three
blocks of vectors into a single l × 3m matrix. In this case, the 18 inner product operations
with l × m matrices would be turned into 2 separate inner products with l × 3m matrices. As
the Custom bundled curve shows in Fig. 3.8, the additional performance gains are significant
for V T W , achieving up to 15× speedup compared to the library counterparts. However,
the linear combination operation V C does not benefit from bundling as much as the inner
product does. For V C, the improvements we observe are limited to a factor of 1.5 for values
of m ranging from 12 to 48. The main reason behind the limited performance gains in this
case is that the tall-skinny matrix products can be converted into dgemm’s of dimensions
l × 3m and 3m × m, as opposed to being extended to 3m in all shorter dimensions as in the
case of V T W .
3.8.8      Performance summary
We now demonstrate the benefits of an architecture-aware eigensolver implementation in
actual CI problems. MFDn’s existing solver uses the Lanczos algorithm with full orthog-
onalization (Lanczos/FO) for numerical stability and good convergence. We implemented
the LOBPCG algorithm in MFDn using the optimized SpMM and tall-skinny matrix kernels
described above. The distributed memory implementations for both solvers are similar and
use the 2D partitioning scheme described in Sect. 3.2 and in more detail in [104].
    Beyond optimizing the kernels, there are a number of key issues that need to be considered
to leverage the full benefits of LOBPCG. These include the choice of initial eigenvector
                                                     83


                      240
                                VTW - Roofline
                      210       VTW - Intel MKL
                                VTW - Cray libsci
                      180       Custom VTW/MKL
                                Custom VTW/libsci
                      150       Custom VTW bundled
            GFlop/s
                      120
                       90
                       60
                       30
                        0
                            1       4     8     12   16   24     32      48   64   80   96
                                                      #vectors (m)
                      240
                                VC - Roofline
                      210       VC - Intel MKL
                                VC - Cray libsci
                      180
                                Custom VC/MKL
                      150       Custom VC/libsci
            GFlop/s
                                Custom VC bundled
                      120
                       90
                       60
                       30
                        0
                            1      4     8     12    16        24   32   48   64   80   96
                                                      #vectors (m)
Figure 3.8: Performance in GFlop/s for inner product V T W (top), and linear combination V C
(bottom) operations, using Intel MKL and Cray LibSci libraries, as well as our custom implemen-
tations on Edison. Tall-skinny matrix sizes are l × m, where l = 1 M.
                                                          84


guesses, the design of a preconditioner to accelerate convergence, and suitable data structures
for combining the optimized SpMM and tall-skinny kernels. Full details and analyses of initial
guesses and preconditioning techniques used are beyond the scope of this paper and will be
discussed in a subsequent publication—we describe these techniques briefly below:
 Initial Guesses: CI models with truncations smaller than the target Nmax result in signifi-
 cantly smaller problem sizes. Roughly speaking, reducing Nmax by 2 (subsequent trunca-
 tions must be evenly separated) gives an order of magnitude reduction in matrix dimensions.
 We observe that eigenvectors computed with smaller Nmax values provide good approxima-
 tions to the eigenvectors in the target model space, so we solve the eigenvalue problem of
 the smaller Nmax first and use these results as initial guesses to our target problem. This
 idea can be applied recursively for additional performance benefits.
 Preconditioning: Preconditioners transform a given problem into a form that is more favor-
 able for iterative solvers, often by reducing the condition number of the input matrix. To
 be effective in large-scale computations, a preconditioner must be easily parallelizable and
 computationally inexpensive to compute and apply, while still providing a favorable trans-
 formation. We build such a preconditioner in MFDn by computing crude approximations
 to the inverses of the diagonal blocks of the Hamiltonian (easy to parallelize). The diagonal
 blocks in CI typically contain very large nonzeros (important for a quality transformation).
 Bundling Blocks of Vectors: While bundling all three blocks of vectors into a single, but
 thicker tall-skinny matrix is favorable for LOBPCG (see Sect. 3.7), this would harm the
 performance of the SpMM and SpMMT operations as the locality between consecutive
 rows of input and output vectors would be lost. A work around to this problem is to copy
 the input vectors Ψi from the vector bundle into a separate block of vectors at the end of
                                               85


 LOBPCG (in preparation for SpMM), and then copy HΨi back into the vector bundle after
 SpMMT (in preparation for LOBPCG). Our experiments show that overheads associated
 with such copies are small compared to the gains obtained from bundled V T W and V C
 operations, hence we opt to bundle the blocks of vectors in LOBPCG into a single one in
 our implementation.
 Alignment/Padding: If vector rows are not aligned with the 32-bit word boundaries, the
 overall performance of the solver is significantly reduced due to the presence of unpacked
 vector instructions. Hence we set the initial basis dimension m for LOBPCG such that
 m is a multiple of 4 to ensure vectorization at least with SSE instructions. When m is
 a multiple of 8, AVX instructions are automatically used, but forcing m to always be a
 multiple of 8 introduces computational overheads not compensated by AVX vectorization.
 As the converged eigenvectors need to be locked, the basis size would slowly shrink during
 LOBPCG iterations. To maintain good performance throughout, we shrink the basis only
 when the number of active vectors decreases to a smaller multiple of 4, and replace the
 converged vectors with 0 vectors in the meantime.
    Our testcases to compare the eigensolvers are the full 10 B problems with Nmax trun-
cations of 6, 7 and 8, seeking 8 eigenpairs in all cases. Table 3.3 gives more details for
the testcases and the distributed memory runs. We executed the MPI/OpenMP paral-
lel solvers using 6 threads/rank (despite having 12 cores/socket), because Intel’s MPI li-
brary currently supports only serialized MPI communications by multiple threads (i.e.,
MPI_THREAD_SERIALIZED mode). Using 12 threads per rank resulted in not being able
to fully saturate the network injection bandwidth, and therefore increased communication
overheads in both cases.
                                              86


                                             Problem     10 B,   Nm6      10 B,   Nm7      10 B,   Nm8
                     Dimension (in millions)                      12.06            44.87           144.06
                       Nonzeros (in billions)                      5.48            26.77           108.53
                      nev (m for LOBPCG)                         8 (12)           8 (12)           8 (12)
                          residual tolerance                       1e-6             1e-6             1e-6
               MPI ranks (w/6 omp threads)                           15               66              231
                                    Lanczos Iterations              240              320              240
                                               SpMVs                240              320              240
                                       Inner Products            28,800           51,200           28,800
                                LOBPCG Iterations                    28              48                38
                                           SpMVs                    324             548               428
                                    Inner Products               72,656         121,296            93,936
Table 3.3: Statistics for the full MFDn matrices used in distributed memory parallel Lanczos/FO
and LOBPCG executions.
                              180                                           2        Tallreduce
                              160                                          1.8       Tcolumn-comm
                                                                           1.6       Ortho/RR
                              140                                                    Precond
                                                                           1.4
                              120                                                    SpMV/SpMM
                                                                           1.2       Speedup
                   Time (s)
                              100
                                                                            1
                              80
                                                                           0.8
                              60
                                                                           0.6
                              40                                           0.4
                              20                                           0.2
                                0                                           0
                                       Nm6       Nm7             Nm8
Figure 3.9: Comparison and detailed breakdown of the time-to-solution using the new LOBPCG
implementation vs. the existing Lanczos/FO solver. Nm6, Nm7 and Nm8 testcases executed on 15,
66, and 231 MPI ranks (6 OpenMP threads per rank), respectively, on Edison.
   In Fig. 3.9, we break down the overall timing into the following parts: sparse matrix
computations (SpMV/SpMM), application of the preconditioner (Precond), orthonormaliza-
tion for Lanczos and Rayleigh–Ritz procedure for LOBPCG (Ortho/RR), communications
among column groups of the 2D processor grid (Tcol−comm )—row communications are fully
overlapped with (SpMV/SpMM) [104], and finally MPI_Allreduce calls needed for reduc-
tions during Ortho/RR step (Tallreduce ). The solve times for smaller Nmax truncations to
obtain the initial LOBPCG guesses are included in the execution times in Fig. 3.9.                          We
                                                         87


observe that our LOBPCG implementation consistently outperforms the existing Lanczos
solver by a factor of 1.7×, 1.8×, and 1.4× for the Nm6, Nm7 and Nm8 cases, respectively.
Although LOBPCG requires more SpMVs overall for convergence (see Table 3.3), the main
reason for the improved time-to-solution is the high performance SpMM and SpMMT ker-
nels we presented. Since two threads in a socket (one per MPI rank) are used to overlap
communications with SpMM computations, the peak SpMM flop rate during the eigensolver
iterations is slightly lower than those in Fig. 3.7. For LOBPCG computations though, we
observe that V T W and V C kernels execute at rates in line with those in Fig. 3.8. We note
that without the preconditioner and initial guess techniques we adopted, LOBPCG’s slow
convergence rate leads to similar or worse solution times compared to Lanczos/FO, wip-
ing out the gains from replacing SpMVs with SpMMs. In this regard, inexpensive solves
with smaller Nmax truncations and low cost preconditioners are crucial for the performance
benefits obtained with LOBPCG.
     In Table 3.3, we also show the total number of inner products required by both solvers.
The LOBPCG algorithm relies heavily on vector operations as discussed in Sect. 3.7 and
evidenced by the total number of inner products reported here. So despite the use of Lanczos
with full orthogonalization, LOBPCG requires more inner products overall. In addition, a
smaller but still significant number of linear combination operations, as well as solutions
of small eigenvalue problems are required for LOBPCG. Cumulatively, these factors lead
to a computationally expensive Ortho/RR part for the LOBPCG solver. So despite using
highly optimized BLAS 3 based kernels in LOBPCG, we observe that the time spent in
Ortho/RR part is comparable for Lanczos/FO and LOBPCG. Finally, in the larger runs,
i.e., Nm7 and Nm8, we observe that Lanczos/FO solver incurs significant Tallreduce times,
possibly due to slight load imbalances exacerbated by frequent synchronizations. On the
                                              88


other hand, Tallreduce times are much lower for LOBPCG (an expected consequence of the
fewer iteration counts), but LOBPCG’s Tcol−comm are significantly higher due to the larger
volume of communication required in this part.
3.9      Evaluation on Xeon Phi Knights Corner (KNC)
Our performance evaluation on Xeon Phi is limited to the isolated SpMM and tall-skinny
matrix kernels due to MFDn’s large memory requirements and the limited device memory
available on the KNC (which was used in the native mode as a proxy for the upcoming
Knights Landing architecture). We used the same testcases as before, and experimented
with 30, 60, 120, 180, and 240 threads to determine the ideal thread count. We obtained
the best performance with 120 threads for both SpMM and tall-skinny matrix kernels and
report those results.
    Comparing the average SpMM Gflop rates on KNC (given in Fig. 3.10) with the Ivy
Bridge (IB) Xeon, we see that the peak performance on KNC is much lower (as much as
3× for Nm6, m = 12, for instance) than that on IB for all cases. This is likely due to
the significantly smaller cache space available per KNC thread. In any case, we can clearly
see that the CSB/OpenMP implementation delivers significantly better performance than
traditional CSR and Rowpart implementations, as it did on IB. On KNC, we also observe
the pattern of increased performance with increasing values of m values. But unlike IB
where the use of SSE vs AVX vectorization did not make a significant difference (see the
similar GFlop rates for m = 12 and m = 16 in Fig. 3.7), on KNC utilizing packed AVX-512
vectorization is crucial as indicated by the sharp performance drop in going from m = 16 to
m = 24. We utilize the same extended Roofline model as before, but KNC does not offer
                                               89


              90                                                                   90
                           CSB/OpenMP                                                            CSB/OpenMP
              80           Rowpart/OpenMP                                          80            Rowpart/OpenMP
                           CSR/OpenMP                                                            CSR/OpenMP
              70                                                                   70
                           Roofline (DRAM)                                                       Roofline (DRAM)
              60           Roofline (L2-DRAM)                                      60            Roofline (L2-DRAM)
                           Roofline (L2)                                                         Roofline (L2)
    GFlop/s                                                              GFlop/s
              50                                                                   50
              40                                                                   40
              30                                                                   30
              20                                                                   20
              10                                                                   10
              0                                                                    0
                   1   4     8     12    16   24          32       48                   1   4    8     12   16     24   32   48
                                 #vectors (m)                                                         #vectors (m)
                                      90
                                                               CSB/OpenMP
                                             80                Rowpart/OpenMP
                                                               CSR/OpenMP
                                             70
                                                               Roofline (DRAM)
                                             60                Roofline (L2-DRAM)
                                                               Roofline (L2)
                                   GFlop/s
                                             50
                                             40
                                             30
                                             20
                                             10
                                             0
                                                  1   4        8      12   16    24         32   48
                                                                    #vectors (m)
Figure 3.10: SpMM and SpMMT combined performance results on Babbage using the Nm6, Nm7
and Nm8 matrices (from top to bottom) as a function of m (the number of vectors).
a shared L3 cache, as such the L3 Roofline of Fig. 3.7 is replaced with the device memory
to L2 bandwidth limit, i.e., Roofline L2-DRAM. With KNC’s high bandwidth memory and
limited cache space, original DRAM Roofline gives a very loose bound, and so does the plain
L2 Roofline. But L2-DRAM Roofline provides a tight envelope.
   For the tall-skinny matrix operations, we only present results using the MKL library as
LibSci was not available on KNC. In Fig. 3.11, we observe that our custom V T W kernel sig-
nificantly outperforms MKL’s dgemm (up to 25×). Custom/Bundled implementation gives
important performance gains for 8 < m < 48. For the V C kernel however, our custom imple-
mentations provide only slight improvements over MKL. Overall, the performance achieved
                                                                        90


                       1050
                                   VTW Roofline
                         900       VTW - Intel MKL
                                   Custom VTW/MKL
                         750       Custom VTW bundled
                         600
               GFlop/s   450
                         300
                         150
                           0
                               1      4     8    12     16   24     32   48   64   80   96
                                                         #vectors (m)
                       1050
                                   VC - Roofline
                         900       VC - Intel MKL
                         750       Custom VC/MKL
                                   Custom VC bundled
             GFlop/s
                         600
                         450
                         300
                         150
                           0
                               1      4    8     12     16   24     32   48   64   80   96
                                                         #vectors (m)
Figure 3.11: Performance of V T W (top) and V C (bottom) kernels, using the MKL library, as well
as our custom implementations on an Intel Xeon Phi processor. Local vector blocks are l ×m, where
l = 1 M.
for both kernels is very low compared to the peak performance predicted by the Roofline
model.
   The poor performance observed for SpMM and tall-skinny matrix operations on KNC
suggests that further optimizations are necessary to achieve good eigensolver performance
on future systems.
                                                         91


3.9.1      Conclusions of this work
In this work we observed block eigensolvers are favorable alternatives to SpMV-based it-
erative eigensolvers for modern architectures with a widening gap between processor and
memory system performance. In this study, we focused on the sparse matrix multiple vec-
tors (SpMM, SpMMT ) and tall-skinny matrix operations (V T W , V C) that constitute the
key kernels of a block eigensolver. Using many-body nuclear Hamiltonian test matrices ex-
tracted from MFDn, we demonstrated that the use of compressed sparse blocks (CSB) format
in conjunction with manual unrolling for vectorization and tuning can improve SpMM and
SpMMT performance by up to 1.5× and 4×, respectively, on modern multi-core processors.
As block eigensolvers are sufficiently compute-intensive, the DRAM bandwidth may be rel-
egated to a secondary bottleneck. We presented an extended Roofline model that captures
the effects of L2 and L3 bandwidth limits in addition to the original DRAM bandwidth limit.
This extended model highlighted how the performance bottleneck transitions from DRAM
to the L3 bandwidth for large m values or sparser matrices.
    Contrary to the common wisdom, we observe that simply calling dgemm in optimized
math libraries does not suffice to attain high flop rates for tall-skinny matrix operations in
block eigensolvers. Through custom thread-parallel implementations for inner product and
linear combination operations and bundling separate vector blocks into a single large block,
we have obtained 1.2× to 15× speedup (depending on m and the type of operation) over
MKL and LibSci libraries.
    Taking the ideas from this work we wanted to extend this approach to an entire solver
rather than focusing only on a single kernel. Hence we started working on a task parallel
framework which we discuss next in detail.
                                              92


Chapter 4
ON NODE TASK PARALLEL
OPTIMIZATION
This work has been published in IEEE HiPC 2019[107]. This work is a collaborative effort
from Md Afibuzzaman and Fazlay Rabbi. Both contributed equally in this work.
    The most fundamental operation in sparse linear algebra is arguably the multiplication
of a sparse matrix with a vector (SpMV), as it forms the main computational kernel for
several applications, such as, the solution of partial differential equations (PDE) [7] and the
Schrödinger Equation [8] in scientific computing, spectral clustering [9] and dimensionality
reduction [108] in machine learning, the Page Rank algorithm [11] in graph analytics, and
many others. The Roofline model by Williams et al. [12] suggests that the performance of
SpMV kernel is ultimately bounded by the memory bandwidth. Consequently, performance
optimizations to increase cache utilization and reduce data access latencies for SpMV has
drawn significant interest, e.g., [13, 14, 15, 16].
    A closely related kernel is the multiplication of a sparse matrix with multiple vectors
(SpMM) which constitutes the main operation in block solvers, e.g., the block Krylov sub-
space methods [109, 7] and block Jacobi-Davidson method. SpMM has much higher arith-
metic intensity than SpMV and can efficiently leverage wide vector execution units. As
a result, SpMM-based solvers has recently drawn significant interest in scientific comput-
                                                93


ing [22, 23, 24, 26, 25, 110, 111, 112]. SpMM also finds applications naturally in machine
learning where several features (or eigenvectors) of sparse matrices are needed [108, 9]. Al-
though SpMM has a significantly higher arithmetic intensity than SpMV, the extended
Roofline model that we recently proposed suggests that cache bandwidth, rather than the
memory bandwidth, can still be an important performance limiting factor for SpMM [22].
    LAPACK [85] is a linear algebra library for solving systems of simultaneous linear equa-
tions, least-squares solutions of linear systems of equations, eigenvalue problems, and sin-
gular value problems. LAPACK routines mostly exploit Basic Linear Algebra Subprograms
(BLAS) to solve these problems. PLASMA aims to overcome the shortcomings of the LA-
PACK library in efficiently solving the problems in dense linear algebra on multicore pro-
cessors [113, 114]. PLASMA can solve dense general systems of linear equations, symmetric
positive definite systems of linear equations and linear least squares problems, using LU,
Cholesky, QR and LQ factorizations and supports both single precision and double preci-
sion arithmetic. However, PLASMA does not support general sparse matrices and does not
solve sparse eigenvalue or singular value problems. PLASMA supports only shared-memory
machines.
    MAGMA is a dense linear algebra library (like LAPACK) for heterogeneous systems, i.e.,
systems with GPUs [115, 116, 117], to fully exploit the computational power that each of
the heterogeneous components would offer. MAGMA provides very similar functionality like
LAPACK and makes it easier for the user to port their code from LAPACK to MAGMA.
MAGMA supports both CPU and GPU interfaces. The users do not have to know details
of GPU programming to use MAGMA.
    Barrera et. al. [118] use computational dependencies and dynamic graph partitioning
method to minimize NUMA effect on shared memory architectures. StarPU [119] is a runtime
                                              94


system that facilitates the execution of parallel tasks on heterogeneous computing platforms,
and incorporates multiple scheduling policies. However, the application developer has to
create the computational tasks by themselves in order to use StarPU.
    While the concept of task parallelism based on data flow dependencies is not new, explo-
ration of the benefits of this idea in the context of sparse solvers constitutes a novel aspect
of this work. Additionally, to the best of our knowledge, related work on task parallelism
has not explored its impact on cache utilization compared to the BSP model as we do in
this work.
4.1       DeepSparse Overview
Figure 4.1 illustrates the architectural overview of DeepSparse. As shown, DeepSparse con-
sists of two major components: i) Primitive Conversion Unit (PCU) which provides a front-
end to domain scientists to express their application at a high-level; and ii) Task Executor
which creates the actual tasks based on the abstract task graph generated by PCU and hands
them over to the OpenMP runtime for execution.
    As sparse matrix related computations represent the most expensive calculations in many
large-scale scientific computing, we define tasks in our framework based on the decomposition
of the input sparse matrices. For most sparse matrix operations, both 1D (block row) and
2D (sparse block) partitioning are suitable options. A 2D partitioning is ideal for exposing
high degrees of parallelism and reducing data movement across memory layers [120], as such
2D partitioning is the default scheme in DeepSparse. For a 2D decomposition, DeepSparse
defines tasks based on the Compressed Sparse Block (CSB) [14] representation of the sparse
matrix, which is analogous to tiles that are commonly used in task parallel implementation
                                               95


                    Primitive Conversion Unit (PCU)                     Task Executor
                              Task Identifier (TI)
                                                                          In In        In In
                           do {                            Core 0        Data Data    DataData
                           SpMM(Hpsi, H, psi)
                           dot(E, psi, psi)                               SpMM          dot
                           daxpy(Epsi, E, psi)
                           daxpy(R, Hpsi,Epsi)
                           dot(W,Tinv, R)                                 Out Out     Out Out
                                                           Partition 0
                           dot(Wmat, W, W)                                Data Data   Data Data
                           dsyevd(S, Wmat).
                           ..
                           } while(!converged)
                                                                           In In        In In
                                                            Core 1        Data Data    Data Data
                                                                           SpMM          dot
                                                            Partition 0    Out Out     Out Out
                                                                           Data Data   Data Data
                               TDG Generator
                                                                           In In         In In
                                                            Core 2        Data Data    Data Data
                                                                            SpMM         dot
                                                            Partition 0    Out Out     Out Out
                                                                           Data Data   Data Data
                              Figure 4.1: Schematic overview of DeepSparse.
of dense linear algebra libraries. However, CSB utilizes much larger block dimensions (on
the order of thousands) due to sparsity [22, 27, 14]. Consequently, DeepSparse starts out
by decomposing the sparse matrix (or matrices) for a given code into CSB blocks (which
eventually corresponds to the tasks during execution with each kernel producing a large
number of tasks). Note that the decomposition of a sparse matrix dictates partitioning
of the input and output vectors (or vector blocks) in the computation as well, effectively
inducing decomposition of all data structures used in the solver code.
    DeepSparse creates and maintains fine-grained dependency information across different
kernels of a given solver code based on the result of the above decomposition scheme. As
such, instead of simply converting each kernel into its own task graph representation and
concatenating them, DeepSparse generates a global task graph, allowing for more optimal
data access and task scheduling decisions based on global information. Since the global task
graph depends on the specific algorithm and input sparse matrix, DeepSparse will explicitly
generate the corresponding task dependency graph. While this incurs some computational
and memory overheads, such overheads are negligible. The main reason for computational
                                                    96


overheads to be negligible is that sparse solvers are typically iterative, and the same task
dependency graph is used for several iterations. The reason why memory overheads is neg-
ligible is that each vertex in the task graph corresponds to a large set of data in the original
problem. After this brief overview, we explain the technical details in DeepSparse.
4.1.1       Primitive Conversion Unit (PCU)
PCU is composed of two parts: i) Task Identifier, and ii) Local Task Dependency Graph
(TDG) Generator.
4.1.1.1     Task Identifier (TI)
The application programming interface (API) for DeepSparse is a combination of the recently
proposed GraphBLAS interface [84] (for sparse matrix related operations) and BLAS/LA-
PACK [85, 86] (for vector and occasional dense matrix related computations). This allows
application developer to express their algorithms at a high-level without having to worry
about architectural details (e.g., memory hierarchy) or parallelization considerations (e.g.,
determining the individual tasks and their scheduling). Task identifier parses a code ex-
pressed using the DeepSparse API to identify the specific BLAS/LAPACK and GraphBLAS
calls, as well as the input/output of each call. It then passes this information to the local
task dependency graph generator.
    TI builds two major data structures:
       ParserMap: ParserMap is an unordered map that holds the parsed data information
       in the form of (Key, Value) pairs. As TI starts reading and processing the DeepSparse
       code, it builds a ParserMap from the function calls. To uniquely identify each call
                                               97


      in the code, Key class is made up of three components: opCode which is specific to
      each type of operation used in the code, id which keeps track of the order of the same
      function call in the code (e.g., if there are two matrix addition operations, then the first
      call will have id = 1 and the second one will have id = 2), and timestamp which stores
      the line number of the call in the code and is used to detect the input dependencies of
      this call to the ones upstream. For each key, the corresponding Value object stores the
      input and output variable information. It also stores the dimensions of the matrices
      involved in the function call.
      Keyword & idTracker: Keyword is a vector of strings that holds the unique function
      names (i.e., cblas_dgemm, dsygv, mkl_dcrmm, etc.) that have been found in the
      given code, and the idTracker keeps track of the number of times that function
      (Keyword) has been called so far. Keyword and idTracker vectors are synchronized
      with each other. When TI finds a function call, it searches for the function name
      in the Keyword vector. If found, the corresponding idTracker index is incremented.
      Otherwise, the Keyword vector is expanded with a corresponding initial idTracker
      value of 1.
4.1.1.2    Task Dependency Graph Generator (TDGG)
The output of Task Identifier (TI) is a dependency graph at a very coarse-level, i.e., at the
function call level. For an efficient parallel execution and tight control over data movement,
tasks must be generated at a much finer granularity. This is accomplished by the Task
Dependency Graph (TDGG), which goes over the input/output data information generated
by TI for each function call and starts decomposing these data structures. As noted above,
the decomposition into finer granularity tasks starts with the first function call involving
                                                 98


the sparse matrix (or matrices) in the solver code which is typically an SpMV, SpMM or
SpGEMM operation. After tasks for this function call are identified by examining the non-
zero pattern of the sparse matrix, tasks for prior and subsequent function calls are generated
accordingly. As part of task dependency graph generation procedure, TDGG also gener-
ates the dependencies between individual fine-granularity tasks by examining the function
call dependencies determined by TI. Note that the dependencies generated by TDGG may
(and often do) span function boundaries and this is an important property of DeepSparse
that separates it from a bulk synchronous parallel (BSP) program which effectively imposes
barriers at the end of each function call.
     The resulting task dependency graph generated by TDGG is essentially a directed acyclic
graph (DAG) representing the data flow in the solver code where vertices denote compu-
tational tasks, incoming edges represent the input data and outgoing edges represent the
output data for each task. TDGG also labels the vertices in the task dependency graph with
the estimated computational cost of each task, and the directed edges with the name and
size of the corresponding data, respectively. During execution, such information can be used
for load balancing among threads and/or ensuring that active tasks fit in the available cache
space. In this initial version of DeepSparse though, such information is not yet used because
we rely on OpenMP’s default task execution algorithms, as explained next.
4.1.2      Task Executor
To represent a vertex in the task graph, TDGG uses an instance of the TaskInfo structure
[listing 4.1] which provides all the necessary information for the Task executor to properly
spawn the corresponding OpenMP task. The task executor receives an array of TaskInfo
structures [listing 4.1] from the PCU that represents the full computational dependency
                                               99


                                Listing 4.1: TaskInfo Structure
struct TaskInfo
{
   int opCode; //type of operation
   int numParamsCount;
   int *numParamsList; //tile id, dimensions etc.
   int strParamsCount;
   char **strParamsList; //i.e. buffer name
   int taskID; //analogous to id of Key Class
}
graph, picks each node from this array one by one and extracts the corresponding task
information. DeepSparse implements OpenMP task based functions for all computational
kernels (represented by opCode) it supports. Based on the opCode, partition id of the
input/output data structures and other required parameters (given by numParamsList and
strParamsList) found in the TaskInfo structure at hand, Task Executor calls the necessary
computational function found in the DeepSparse library, effectively spawning an OpenMP
task.
    In DeepSparse, the master thread spawns all OpenMP tasks one after the other, and re-
lies on OpenMP’s default task scheduling algorithms for execution of these tasks. OpenMP’s
Runtine Environment then determines which tasks are ready to be executed based on the
provided task dependency information. When ready, those tasks are executed by any avail-
able member of the current thread pool (including the master thread). Note that OpenMP
supports task parallel exeuction with named dependencies, and better yet these dependen-
cies can be specified as variables. This feature is fundamental for DeepSparse to be able
to generate TDGs based on different problem sizes and matrix sparsity patterns. This is
exemplified in Algorithm 1, where SpMM tasks for the compressed sparse block at row i and
j is simply invoked by providing the X[i, j] sparse matrix block along with Y [j] input vector
                                              100


  Algorithm 1: SpMM Kernel
    Input: X[i,j] (β × β, Sparse CSB block), Y[j] (β × b)
    Output: Z[i] (Dense vector block, β × b)
  1 #pragma omp task depend(in: X[i, j], Y [j], Z[i]) depend(out: Z[i])
  2 {
  3 foreach val ∈ X[i, j].nnz do
  4     r = X[i,j].row_loc[val]
  5     c = X[i,j].col_loc[val]
  6     for k = 0 to b do
  7         Z[r × b + k] = Z[r × b + k] + val × Y[c × b + k]
  8 }
block and Z[i] output vector block in the depend clause.
    An important issue in a task parallel program is the data race conditions involving the
output data that is being generated. Fortunately, the task-parallel execution specifications
of OpenMP requires only one thread to be active among threads writing into the same
output data location. While this ensures a race-condition free exeuction, it might hinder
performance due to a lack of parallelism. Therefore, for data flowing into tasks with a high
incoming degree, DeepSparse allocates temporary output data buffers based on the number
of threads and the available memory space. Note that this also requires the creation of an
additional task to reduce the data in temporary buffer space before it is fed into the originally
intended target task.
4.1.3      Illustrative Example
We provide an example to demonstrate the operation of DeepSparse using the simple code
snippet provided in Listing 4.2. As TI parses the sample solver code, it discovers that the
first cblas_dgemm in the solver corresponds to a linear combination operation (see Fig. 4.2),
the second line is a sparse matrix vector block multiplication (SpMM, see Fig. 4.3) and
                                              101


                               0               task0                            task0
                               1               task1                            task1
                               2               task2                            task2
                               …
                               …
                               …                 X             Y                  Z
                                                           b
                            n-1              taskn-1                           taskn-1
                                                 b                                b
Figure 4.2: Overview of input output matrices partitioning of task-based matrix multiplication ker-
nel.
                             0  task0,0  task0,1 task0,2         task0,n-1
                                task1,0  task1,1 task1,2
                             1
                             2  task2,0  task2,1 task2,2
                            …
                            …
                            …                            X                   Y       Z
                           n-1                                   taskn-1,n-1
                                                                             b        b
            Figure 4.3: Overview of matrices partitioning of task-based SpMM kernel.
the second cblas_dgemm at the end is an inner product of two vector blocks (see Fig. 4.4).
These function calls, their parameters as well as dependencies are captured in the ParserMap,
Keyword, and idTracker data structures as shown in Table 4.1.
                                        Listing 4.2: An example pseudocode
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, 1.0, A, k, B, n
     , 0, C, n);
SpMM(X, C, D, m, n);
cblas_dgemm(CblasRowMajor, CblasTrans, CblasNoTrans, n, n, m, 1.0, D, n, C, n,
     0, E, n);
     Task Dependency Graph (TDG) generator receives the necessary information from TI
and determines the tasks corresponding to partitionings of operand data structures of each
                                                             102


                                                 cblas_dgemm call on block vector                           Partial result buffer reduction
                                                                                         task0                  t0
                          b   task0   task1    task2      XT              taskn-1        task1
                              0       1        2       ….                                task2                  t1
                                                       ….
                                                       ….                 n-1
                                                                                          Y                     t2                  Z
                                                                                                                tp
                                                                                        taskn-1
                                                                                           b
        Figure 4.4: Overview of matrices partitioning of task-based inner product kernel.
                                              5,1,0,0,1                                                5,1,1,0,1
                                              2,2,0,0,0,1          2,2,1,0,0,1          2,2,0,1,0,1                   2,2,1,1,0,1
                                                               3,2,0,0,0,1                    3,2,1,1,0,1
                                                                      4,1,0,1,EBUF,-1
                    Figure 4.5: Task graph for the psudocode in listing 4.2.
operation, as well as their origins (whether the necessary data are coming from another
task or from a variable). TDGG then builds the DAG of each computational kernel and
appends it to the global DAG with proper edge connectivity (i.e., dependencies). While
generating the DAG, the TDGG also encodes the value of the TaskInfo structure instance
that represent each of the vertices into the vertex name. The vertex naming convention
is <opCode, numParamsCount, numParamsList, strParamsCount, strParamsList, taskID>.
Figure 4.5 shows the task dependency graph of the solver code in Listing 4.2 (assuming m
= 100, k = 8, n = 8, CSBtile/blocksize = 50, so each input matrix is partitioned into 2
chunks).
   The task executor receives an array of TaskInfo structures that contains the node infor-
mation as shown in Figure 4.5. The task executor goes over each of the tasks in the array of
TaskInfo structure. At first, it reads the nodes (<5,1,0,0,1>, <5,1,1,0,1>) of the first oper-
ation and spawns two matrix multiplication (xY) tasks with proper input output matrices.
The task executor then reads all the task information for all SpMM tasks {<2,2,0,0,0,1>,
                                                                                    103


 Data Structure                          Content
                      <{XY, 1, 1},{<A, B>, <C>, <m, n, k>}>
   ParserMap
                    <{SpMM, 1, 2},{<X, C>, <D>, <m, m, n>}>
                    <{XTY, 1, 3},{<D, C>, <E>, <m, n, n>}>
     keyword                      <XY, SpMM, XTY>
    idTracker                           <1, 1, 1>
                     Table 4.1: Major data structures after parsing third line.
<2,2,0,1,0,1>, <2,2,1,0,0,1>, <2,2,1,1,0,1>} and spawns four SpMM tasks with proper
input/output matrix blocks. Finally, the task executor reads <3,2,0,0,0,1>, <3,2,1,1,0,1>
and <4,1,0,1,EBUF,-1> and spawns two inner product (XT Y) tasks and one partial output
buffer reduction task for the inner product operation.
4.1.4     Limitations of the Task Executor
Despite the advantages of an asynchronous task-parallel execution, the Task Executor has
the following limitations:
      Synchronization at the end of an iteration: Most computations involving sparse
      matrices are based on iterative techniques. As such, the TDG generated for a single
      iteration can be reused over several steps (until the algorithm reaches convergence).
      However, it is necessary to introduce a #pragma omp taskwait at the end of each
      solver iteration and force all tasks of the current iteration to be completed to ensure
      computational consistency among different iterations of the solver. For relatively simple
      solvers, the taskwait clause adds some overhead to the total execution time due to
      threads idling at taskwaits.
      Limited number of temporary buffers: While OpenMP allows the use of program
                                               104


          variables in the dependency clauses, it does not allow dynamically changing the variable
          lists of the depend clauses. As such, the number of buffer lists in the partial output
          reduction tasks need to be fixed to overcome this issue. Depending on the available
          memory, there are at most nbuf number of partial output buffers for a reduction
          operation. If nbuf is less than the total number of threads, then there might be
          frequent read after write (RAW ) contentions on partial output buffers. This could be
          have been potentially avoided, if the list of variables in the depend clause could have
          been dynamically changed.
4.2            Benchmark Applications
We demonstrate the performance of the DeepSparse framework on two important eigensolvers
widely used in large-scale scientific computing applications: Lanczos eigensolver [121] and
Locally Optimal Block Preconditioned Conjugate Gradient algorithm (LOBPCG) [122].
4.2.1           Lanczos
Lanczos algorithm finds eigenvalues of a symmetric matrix by building a matrix Qk =
[q1 , . . . , qk ] of orthogonal Lanczos vectors [123]. The eigenvalues of the sparse matrix A is
then approximated by the Ritz values. As shown in Algorithm 2, it is a relatively simple
algorithm consisting of an Sparse Matrix Vector Multiplication (SpMV) along with some
vector inner products for orthonormalization.
                                                   105


  Algorithm 2: Lanczos Algorithm in Exact Arithmetic
  1 q_1 = b/ k b k2 , β0 = 0, q0 = 0 for j = 1 to k do
  2     z = Aqj
  3     αj = qj T z
  4     z = z − αj qj − βj−1 qj−1
  5     βj =k z kz
  6     if βj = 0, quit
  7     qj+1 = z/βj
  8     Compute eigenvalues, eigenvectors, and error bounds of Tk
4.2.2      LOBPCG
LOBPCG is a commonly used block eigensolver based on the SpMM kernel [122], see Figure 3
for a pseudocode. Compared to Lanczos, LOBPCG comprises high arithmetic intensity
operations (SpMM and Level-3 BLAS). In terms of memory, while the H          c matrix takes
up considerable space, when a large number of eigenpairs are needed (e.g. dimensionality
reduction, spectral clustering or quantum many-body problems), memory needed for block
vector Ψ can be comparable to or even greater than that of H.   c In addition, other block
vectors (residual R, preconditioned residual W, previous direction P), block vectors from the
previous iteration and the preconditioning matrix T must be stored, and accessed at each
iteration. Figure 4.6 shows a sample task graph for LOBPCG generated by TDGG using a
very small matrix. Clearly, orchestrating the data movement in a deep memory hierarchy to
obtain an efficient LOBPCG implementation is non-trivial.
                                              106


     Figure 4.6: A sample task graph for the LOBPCG algorithm using a small sparse matrix.
4.3       Performance Evaluation
4.3.1      Experimental setup
We conducted all our experiments on Cori Phase I, a Cray XC40 supercomputer at NERSC,
mainly using the GNU compiler. Each Cori Phase I node has two sockets with a 16-core
Intel Xeon Processor E5-2698 v3 Haswell CPUs. Each core has a 64 KB private L1 cache
(32 KB instruction and 32 KB data cache) and a 256 KB private L2 cache. Each CPU has
a 40 MB shared L3 cache (LLC). We use thread affinity to bind threads to cores and use a
maximum of 16 threads to avoid NUMA issues. We test DeepSparse using five matrices with
different size, sparsity patterns and domains (see Table 4.2). The first 4 matrices are from
The SuitSparse Matrix Collection and the Nm7 matrix is from nuclear no-core shell model
code MFDn.
    We compare the performance of DeepSparse with two other library implementations: i)
libcsr is implementation of the benchmark solvers using thread-parallel Intel MKL Library
                                               107


  Algorithm 3: LOBPCG Algorithm (for simplicity, without a preconditioner) used
  to solve ĤΨ = EΨ
    Input: Ĥ, matrix of dimensions N × N
    Input: Ψ0 , a block of vectors of dimensions of N × m
    Output: Ψ and E such that kĤΨ − Ψ EkF is small, and Ψ T Ψ = Im
  1 Orthonormalize the columns of Ψ0
  2 P0 ← 0
  3 for i = 0, 1, . . . , until convergence do
  4     Ei = ΨiT ĤΨi
  5     Ri ← ĤΨi − Ψi Ei
  6     Apply the Rayleigh–Ritz procedure on span{Ψi , Ri , Pi }
  7     Ψi+1 ←               argmin            trace(S T ĤS)
                 S∈span{Ψi .Ri ,Pi }, S T S=Im
  8     Pi+1 ← Ψi+1 − Ψi
  9     Check convergence
calls (including SpMV/SpMM) with CSR storage of the sparse matrix, ii) libcsb is an
implementation again using Intel MKL calls, but with the matrix being stored in the CSB
format. Performance data for LOBPCG is averaged over 10 iterations, while the number of
iterations is set to 50 for Lanczos runs. Our performance comparison criteria are L1, L2,
LLC misses and execution times for both solvers. All cache miss data was obtained using
the Intel VTune software.
    Performance of the DeepSparse and libcsb implementations depends on the CSB block
sizes used. Choosing a small block size creates a large number of small tasks. While this is
preferable on a highly parallel architecture, the large number of tasks may lead to significant
task execution overheads, in terms of both cache misses and execution times. Increasing the
block size reduces such overheads, but this may then lead to increased thread idle times and
load imbalances. Therefore, the CSB block size is a parameter to be optimized based on the
specific problem. Different block sizes we experimented with have been 1K, 2K, 4K, 8K and
16K.
                                                  108


                            Table 4.2: Matrices used in our evaluation.
                        Matrix            Rows      Columns       Nonzeros
                         inline1         503,712      503,712     36,816,170
                    dielFilterV3real    1,102,824    1,102,824    89,306,020
                         HV15R          2,017,169    2,017,169    283,073,458
                       Queen4147        4,147,110    4,147,110    316,548,962
                          Nm7           4,985,422    4,985,422    647,663,919
4.3.2      LOBPCG evaluation
In Fig. 4.7, we show the number of cache misses at all three levels (L1, L2 and L3) and
execution time comparison between all three versions of the LOBPCG algorithm compiled
using the GNU compiler. LOBPCG is a complex algorithm with a number of different kernel
types; its task graph results in millions of tasks for a single iteration. As shown in Fig. 4.7,
except for the Nm7 matrix, libcsb and libcsr versions achieve similar number of cache misses;
for Nm7, libcsb has important cache miss reductions over the libcsr version. On the other
hand, DeepSparse achieves 2.5× - 10.7× fewer L1 misses, 6.5× - 16.2× fewer L2 misses and
2× - 7× fewer L3 cache misses compared to the libcsr version. As the last row of Fig. 4.7
shows, even with the implicit task graph creation and execution overheads of DeepSparse, the
significant reduction in cache misses leads to 1.2× - 3.9× speedup over the execution times of
libcsr. Given the highly complex DAG of LOBPCG and abundant data re-use opportunities
available, we attribute these improvements to the pipelined execution of tasks which belong
to different computational kernels (see Fig. 4.8) but use the same data structures. We note
that the Task Executor in DeepSparse solely relies on the default scheduling algorithm used
in the OpenMP runtime environment. By making use of the availability of the entire global
task graph and labeling information on vertices/edges, it might be possible to improve the
performance of DeepSparse even further.
                                               109


                  0.7                    2.1                 3.5                 7               14
                  0.6                    1.8                 3.0                 6               12
                  0.5                    1.5                 2.5                 5               10
                  0.4                    1.2                 2.0                 4                8
           Billions
                  0.3                    0.9                 1.5                 3                6
                  0.2                    0.6                 1.0                 2                4
                  0.1                    0.3                 0.5                 1                2
                  0.0                    0.0                 0.0                 0                0
                                                                      L1 Misses
                   0.42                  1.4                 2.1                 4.9             14
                   0.36                  1.2                 1.8                 4.2             12
                   0.30                  1.0                 1.5                 3.5             10
                   0.24                  0.8                 1.2                 2.8              8
            Billions
                   0.18                  0.6                 0.9                 2.1              6
                   0.12                  0.4                 0.6                 1.4              4
                   0.06                  0.2                 0.3                 0.7              2
                   0.00                  0.0                 0.0                 0.0              0
                                                                     L2 Misses
                       0.035             0.28                0.14                0.42            3.5
                       0.030             0.24                0.12                0.36            3.0
                       0.025             0.20                0.10                0.30            2.5
                       0.020             0.16                0.08                0.24            2.0
           Billions
                       0.015             0.12                0.06                0.18            1.5
                       0.010             0.08                0.04                0.12            1.0
                       0.005             0.04                0.02                0.06            0.5
                       0.000             0.00                0.00                0.00            0.0
                                                                     LLC Misses
                      0.14               0.63                0.77                1.4             2.8
                      0.12               0.54                0.66                1.2             2.4
                      0.10               0.45                0.55                  1               2
           Time(s)
                      0.08               0.36                0.44                0.8             1.6
                      0.06               0.27                0.33                0.6             1.2
                      0.04               0.18                0.22                0.4             0.8
                      0.02               0.09                0.11                0.2             0.4
                      0.00               0.00                0.00                  0               0
                                                                     Execution Time
                               inline1          dielFilter           HV15R          Queen_4147         Nm7
Figure 4.7: Comparison of L1, L2, LLC misses and execution times between Deepsparse, libcsb and
libcsr for the LOBPCG solver.
         Figure 4.8: LOBPCG single iteration execution flow graph of dielFilterV3real.
                                                                    110


                       0.28              0.7                 2.8                   2.1                 14
                       0.24              0.6                 2.4                   1.8                 12
                       0.20              0.5                 2.0                   1.5                 10
                       0.16              0.4                 1.6                   1.2                  8
            Billions
                       0.12              0.3                 1.2                   0.9                  6
                       0.08              0.2                 0.8                   0.6                  4
                       0.04              0.1                 0.4                   0.3                  2
                       0.00              0.0                 0.0                   0.0                  0
                                                                       L1 Misses
                       0.14              0.42                2.1                   1.4                 7
                       0.12              0.36                1.8                   1.2                 6
                       0.10               0.3                1.5                   1.0                 5
                       0.08              0.24                1.2                   0.8                 4
            Billions
                       0.06              0.18                0.9                   0.6                 3
                       0.04              0.12                0.6                   0.4                 2
                       0.02              0.06                0.3                   0.2                 1
                       0.00                 0                0.0                   0.0                 0
                                                                       L2 Misses
                       0.08              0.21                0.7                   0.8                 3.5
                       0.07              0.18                0.6                   0.7                 3.0
                       0.06              0.15                0.5                   0.6                 2.5
                       0.04              0.12                0.4                   0.4                 2.0
           Billions
                       0.03              0.09                0.3                   0.3                 1.5
                       0.02              0.06                0.2                   0.2                 1.0
                       0.01              0.03                0.1                   0.1                 0.5
                       0.00              0.00                0.0                   0.0                 0.0
                                                                      LLC Misses
                       0.028             0.07                0.14                  0.21                0.7
                       0.024             0.06                0.12                  0.18                0.6
                        0.02             0.05                0.10                  0.15                0.5
           Time(s)
                       0.016             0.04                0.08                  0.12                0.4
                       0.012             0.03                0.06                  0.09                0.3
                       0.008             0.02                0.04                  0.06                0.2
                       0.004             0.01                0.02                  0.03                0.1
                           0             0.00                0.00                  0.00                  0
                                                                    Execution Time
                               inline1          dielfilter             HV15R              Queen_4147         Nm7
Figure 4.9: Comparison of L1, L2, LLC misses and execution times between Deepsparse, libcsb and
libcsr for the Lanczos solver.
4.3.3     Lanczos evaluation
In Fig. 4.9, cache misses and execution time comparisons for different Lanczos versions are
shown. Lanczos algorithm is much simpler than LOBPCG, it has much fewer types and
numbers of tasks than LOBPCG (basically, one SpMV and one inner product kernel at each
iteration). As such, there are not many opportunities for data re-use. In fact, we observe that
DeepSparse sometimes leads to increases in cache misses for smaller matrices. However, for
the Nm7 and HV15R matrices, which are the largest matrices among our benchmark set, we
observe an improvement in cache misses, achieving up to 2.4× fewer L1 cache misses, 3.1×
fewer L2 misses and 4.5× fewer L3 misses than libcsr. But most importantly, DeepSparse
achieves up to 1.8× improvement in terms of execution time. We attribute the execution
time improvement observed across the board to the increased degrees of parallelism exposed
by the global task graph of DeepSparse, which is in fact highly critical for smaller matrices.
                                                                    111


          0.7                                1.4                                 4.2                                   4.2                                 10.5
          0.6                                1.2                                 3.6                                   3.6                                   9
          0.5                                    1                               3.0                                    3                                   7.5
Time(s)
          0.4                                0.8                                 2.4                                   2.4                                   6
          0.3                                0.6                                 1.8                                   1.8                                  4.5
          0.2                                0.4                                 1.2                                   1.2                                   3
          0.1                                0.2                                 0.6                                   0.6                                  1.5
              0                                  0                               0.0                                    0                                    0
                  Sp ar
                              sb      sr             Sp ar
                                                                    sb      sr         Sp ar
                                                                                                    sb      sr               Sp            sb         sr          Sp          sb           sr
                       se   lib c   lib c                 se   lib c     lib c              se    lib c   lib c                 arse     lib c   lib c               arse   lib c        lib c
          De                                 De                                  De                                    De                                   De
            ep                                 ep                                  ep                                    ep                                   ep
                        Inline1                                DielFilter                         HV15R                                Queen_4147                           Nm7
Figure 4.10: Comparison of execution time for different compilers between Deepsparse, libcsb and
libcsr for Lanczos Algorithm. (Blue/Left: GNU, Red/Middle: Intel, Green/Right: Clang compiler.)
                  1.1
                                DeepSparse                               3.5
                                                                                                 Libcsb                                                    Libcsr
                                                                                                                                   3.5
                  0.9                                                    3.0                                                       3.0
                  0.8                                                    2.5                                                       2.5
                  0.6                                                    2.0                                                       2.0
   Billions
                  0.5                                                    1.5                                                       1.5
                  0.3                                                    1.0                                                       1.0
                  0.2                                                    0.5                                                       0.5
                  0.0                                                    0.0                                                       0.0
                               L1           L2                 L3                L1                 L2            L3                             L1               L2                L3
                                      Figure 4.11: Cache Miss comparison between compilers for HV15R
                                                                                                  112


4.3.4      Compiler comparison
For all of our experiments, we use OpenMP as our backend. To explore the impact of different
task scheduling approaches in different OpenMP implementations, we have experimented
with three compilers: Intel, GNU and Clang/LLVM compilers. In Figure 4.10, we show the
comparison in execution time among different compilers for the three implementations. We
see that the execution time for the Clang/LLVM compiler is significantly higher compared to
GNU and Intel compilers for all matrices. However, cache misses stay pretty much the same
when one moves to a different compiler. We show the cache miss comparison between the
three compilers in Figure 4.11 for one matrix, HV15R. All other matrices follow a similar
cache miss pattern like HV15R. Here, we can clearly see that regardless of the compiler,
DeepSparse achieves fewer cache misses over libcsb and libcsr implementations. We can see
that Clang/LLVM shows fewer cache misses for DeepSparse as well, but it eventually has a
poor running time. We believe that this is because Clang/LLVM is not able to schedule tasks
as efficiently as GNU and Intel. Compared to Intel compiler, GNU compiler sometimes shows
more L1 and L2 misses. But the execution time is higher in Intel. This may be due to the
scheduling strategy and and the implementation of task scheduling points in the compilers.
Overall, GNU does best with running times among the three compilers, and Intel compilers
do not do well with the library based solver implementations.
4.3.5      Conclusions of this work
This work introduces a novel task-parallel sparse solver framework which targets all com-
putational steps in sparse solvers. We show that our approach achieves significantly fewer
cache misses across different cache layers and also improves the execution time over the li-
                                             113


brary versions. Future works will be in the direction of further reducing the cache misses and
execution time over the current versions by experimenting with more advanced partitioning
and scheduling algorithms compared to the default schemes in OpenMP. In this work we
used the scheduler provided by OpenMP but we also wanted to partition the tasks using
a memory heuristics to generate a custom scheduler. We discuss this in detail in the next
chapter.
                                             114


Chapter 5
SCHEDULING TASKS WITH A
GRAPH PARTITIONER
In the DeepSparse framework, we saw that OpenMP does a great job with memory utilization
over all memory levels. But it is still beyond our control. OpenMP is generating the DAG
itself and resolving themselves. Whenever the data dependencies of a task is resolved and
it is not dependent on any other task for its execution , it can be immediately pulled and
executed. But this might not be optimal scenario if we think from memory usage perspective.
A task which does not have any relation with the tasks that are active at the moment can
be immediately executed once a thread gets free regardless of its memory input and outputs.
Hence there is a possibility of a task which would improve the memory usage with the input
already being in the lower level of the memory and having cache hits reduces. The probability
of cache misses increases with this kind of scheduling.
    This motivated us to use a novel graph partition based scheduler that use the global data
flow graphs generated by the PCU to minimize data movements in a deep memory hierarchy.
Graph partitioners have been extensively studied but existing approaches do not meet our
needs as they typically handle DAGs by converting them to undirected graphs. However,
the directed nature of the task graph must be respected in our case.
    Our sparse solver DAGs contain a fair number of vertices with high fan-in/fan-out" de-
                                              115


grees due to operations such as SpMVs, SpMMs, inner products and vector reductions.
    In this work we take the graph generated by the task generator, create appropriate data
structures for the graph with accordance to the partitioner code and then partition the graph
using the following algorithm steps.
5.1      Coarsening
Experimental DAGs are usually very large in size. Often they have millions or tens of
millions of nodes. The partitioning algorithms they use have different complexities, the
worst being O(n2 ) in the kernighan algorithm where n is the number of vertices. This makes
the partitioning phase really slow and impractical if we want to use this partitioner as a
supporting tool. This is why the original graph is coarsened to a smaller coarsened graph
until a minimum size is reached.
    In the original code we received, they coarsen the graph using a matching algorithm. At
each coarsening step, they compute how many vertices can be matched. They consider all
the edges one by one and put them in the matching if they respect the acyclicity property.
Let us see how the matching is actually done in their implementation with a simple graph
which has almost all kinds of cases that might arise in our lobpcg DAG in graph 5.1.
    Here we can see a graph with 15 vertices. There is vertex 5 which has multiple incoming
edges and multiple outgoing edges. The algorithm first creates a topological order of the
nodes visited in dfs traversal. Then it goes through each of the vertices and checks the
constraints for acyclicity. Node 15 will be visited first according to the topological order.
A sorted neighbor algorithm is run to find the outgoing edge with the least cost. In this
case, node 4 is chosen rather than 3. So, node 15 and 4 are matched. Node 3 is then visited
                                             116


            14                15
       1       2        3            4
12
                    5                    13
          6      7        8            9
                                  10        11
                                                 6   7
         9                10
                                                   1
                 2
   1                            8
                                               2
                                                   3 4 5
     3
               4      5          6
                            7
                 Figure 5.1: Matching example
                                   117


according to the topological order. Likewise, node 5 is its only outgoing node and thus node
3 and node 5 are selected for the next matching. Likewise, nodes 14,2 and 1,12 are matched.
In the end nodes 9 and 11 are matched.
    After these nodes are matched, a new graph is generated with the matched nodes. If
(u, v) is matched then u becomes the leader of v. That is how 14 is the leader of 2, 3 is the
leader of 5. Then they are numbered consecutively to form the new coarse graph. We can
see that the acyclicity is preserved in the new graph. Node numbering has been changed.
Once 2,5 is matched in this graph, no other nodes 3,4,5,7 will be looked for. Eventually this
algorithm will run into cases like the last graph where each time only one matching will be
selected.
    The problem that we faced with this matching algorithm was that the graph was even-
tually going to the last graph stage where each time we were getting a very few matching,
making the entire matching process really slow. There is a threshold value for stopping the
matching process but after initial coarsening, the whole graph becomes a big graph with
small parts just like this last graph here.
    To get rid of this issue, we added an extra pre-processing step before the actual matching.
As we saw in the previous case that a lot of these kinds of nodes which has a lot og incoming
edges and outgoing edges, most likely the topological levels of the nodes that are the sources
of these high degree nodes will be same. Also, most likely the children of these high degree
nodes will also have the same topological level. That is why during the pre-processing phase,
we do an additional coarsening of the graph. We coarse every 3 nodes in the same topological
level into one node. As they already maintain the topological order in a DAG, this coarsening
does not lead to a cycle. The motivation for this change is to minimize the incoming and
outgoing nodes for the high degree nodes. Let us see with a simple lobpcg DAG and its pre-
                                               118


processed version. In Figure 5.2 we can see an actual lobpcg DAG generated by the DAG
generator code. This DAG is for a matrix which is divided into 3*3 CSB blocks. Straightway
we can see the high degree nodes in this graphs. Even in such a simple graph, there are
multiple high degree graphs which might lead us to the issue we previously discussed. In
graph 5.3 we see the preprocessed graph. Here we can see that every 3 nodes in the same
topological level are coarsened into one single node. Although the edges are shown as they
were in the previous graph, my opinion is we can merge them onto a single edge adding their
costs to form a new cost.
5.2       Initial Partitioning
After the coarse graph is generated, it is fed to the initial partitioning function. They have
a bunch of partitioning algorithms from where we can chose one. Although we initially had
kernighan algorithm as the partitioning algorithm, we figured out that it takes a lot of time
for kernighan algorithm because it traverses through all the nodes.
    In there paper they discuss about a greedy approach where they define f ree vertices
and eligible vertices where f ree vertices are those vertices which are not put in a part yet.
Eligible vertices are those vertices whose predecessors are all not free. Gain is computed
for eligible vertices to see in which part they can be moved. The gain is computed according
to their algorithm in the paper and the objective function is edge cost. That means the
minimization of the edge cut.In the end all the vertices are divided into two parts.
    There was another approach named BFSGrowing which was actually a DFS traversal
of nodes and from that traversal each time one vertex is being selected and it is added to
the current filling part. The weight of that vertex is then adjusted accordingly. But this
                                               119


                                             2                                   1                             6                                                      10
3              7                                     14               11                             15                               16
                                                                  17                                 18                               19
                                                                         21                          22                                    23
                                                                                        26                              27                 28        20
                                                                                                                                           24
                                                                                                                                                25
                           5                                                                                                                            29                 9           4            8                  12                                  13
                          93                                                                                      30                                                      94          75           76                  77                                  95
                                                                              31                 32                               33                                                   79          80                  81
                                                                       35             36                  37                                                                                       78
                                                                                      34                                                                                                           82
                                                                38                         39                        40                                                                            83
                          96                                           41                      42                    43                                                   97          84                 85                  86                            98
                   99                                                  45                      46                47                                   100           87                                   88                  89                             101
      102                   116            117              118                                44                                        103       124     90                125               91                         92          131  132     126 133      104
                                 115                                                           48                                                                             123                                                              130
                                                                                               49
                                                                          50                            51                52
                                                                          53                  66          54                       62         55                                         71
  106      107        108                        57         56     68              60          69          70         59           64         65           58                     61                  73        74            135 136     137
                      105                                                          67                                        63                                                   72                                              134
                                       128              111                                         110                 121         120              112        122                                                               138
                                                                                                    109                       119
                                                                                                         113                                                                      127
                                                                                                    114
                                                                                                    129
                                                                                                              139
       158       165         143     164         140       153    150        152       149                                                151       141     154        144         142         145          166    159        160
                             146                           156               155                                                                            157        147                     148
                      161                    168                         167                                                                                             169               162            163
                                         Figure 5.2: Simple Lobpcg DAG with 3*3 blocks
                                                                                                                  120


                       25(5, 6, 62)                             23(9, 10, 60)                                                                         29(1, 2, 71)
                                                                                28(3, 4, 66)                                   24(7, 8, 61)          1(14, 15, 16)                                                  22(11, 12, 59)
                                                                                                                                                           2(17, 18, 19)
                                                                                                                                                                         4(21, 22, 23)
                                                                                                                                                                         7(26, 27, 28)                3(13, 20)
                                                                                                                                                                             5(24)
                                                                                                                                                                                                   6(25)
                                                                                                                                                                                                         8(29)
                                                                                                                                                                       9(30, 75, 76)                 31(77, 93, 94)                                        37(95)
                                                                                                                               10(31, 32, 33)                      33(79, 80, 81)
                                                                                                                               12(35, 36, 37)                          32(78)
                                                                                                                                 11(34, 82)
                                                                                                          13(38, 39, 40)                                                                    34(83)
                                                                                                          14(41, 42, 43)                                               35(84, 85, 86)                 38(96, 97, 98)
                                                                                                          16(45, 46, 47)                                               36(87, 88, 89)                                                             39(99, 100, 101)
                                                                                                                 15(44, 90, 91)                                                        48(117, 118, 124)            47(92, 102, 116)    54(125, 126, 132)                  55(133) 53(103, 104, 131)
                                                                                                                                                                                        17(48, 115, 123)                                                           52(130)
                                                                                                                                      18(49, 127)
                                                                                                                    19(50, 51, 52)
                                                                                                             20(53, 54, 55)
                                                                        27(64, 65, 68)           41(69, 70, 106)                          30(73, 74, 107)                   42(108, 135, 136)           56(137)          21(56, 57, 58)
                                                                                     26(63, 67, 72)                                                                            40(105, 134)
                                                              49(120, 121, 122)                         44(110, 111, 112)                                                                                                                 50(128, 138)
                                                                43(109, 119)
                                                                  45(113)
                                                                  46(114)
                                                                                                                                                               51(129)
                                                                         57(139)
       64(158, 159, 160)            59(143, 144, 145) 58(140, 141, 142)            61(149, 150, 151)            62(152, 153, 154)             66(164, 165, 166)
                                    60(146, 147, 148)                                                           63(155, 156, 157)
                       65(161, 162, 163)                                                                                     67(167, 168, 169)
Figure 5.3: Lobpcg graph after pre-processing with every 3 nodes in the same topological level are
coarsened into one single node and the edges are kept intact
                                                                                                                                                                        121


algorithm does not compute any gains on the edgecut. It only fills the part as long as there
is space in this part with the upper bound. We modified the DFS to BFS traversal but it
did not have a major influence on the results.
    It is only natural that the algorithms like kernighan DFS, Greedy Graph Growing which
computes the gain and takes the edge cuts into account might not have an even partition.
Hence, a forced balancing function is called afterwards to make them balanced according to
the upperbounds. We did not modify anything in the forced balancing function.
5.3       Uncoarsening/Refinement
After the partitioning algorithm returns the coarsed graph with two partitions, this graph is
uncoarsened and refined. Previously we stated that while coarsening, in each step the new
node number in the new coarse graph is mapped with an old node number in the previous
graph before coarsening. In each coarsening step, these records are kept. Also the leader
information is kept in each coarsening step. The uncoarsening step is mainly a project back
from a later graph to an older graph. The partitioning will return two partitions of the coarse
graph with some vertices will be in part 0 and some will be in part 1. While projecting back,
in each coarsening step, the matched node is assigned the part of its leader. For example,
let us assume that in a coarsening step, (u, v) was matched and u became the leader of v
and node u was renumbered in the next step. Suppose u is partitioned using partitioning
algorithm and assigned in part 1. While projecting back, v is assigned in the same part as its
leader which is 1. This is how the entire graph will be projected back to a previous coarsened
version. In Figure 5.4 we can see a simplified example of how the projectback function assigns
the part numbers accordingly. The third graph is the most coarsened graph and is fed to the
                                               122


                            14               15
                       1       2       3            4
                12
                                   5                    13
                          6      7       8            9
                                                 10        11
                                                                6     7
                         9               10
                                                                  1
                                 2
                   1                           8
                                                              2
                                                                  3   4    5
                     3
                               4     5          6
                                           7
Figure 5.4: Coarsed graph Partition assignment example, blue is part 0 and green is part 1
                                                  123


partitioning algorithms. The algorithms returns a partitioning where {1, 6, 7} in part 0 and
{2, 3, 4, 5} in part 1. While projecting back, the nodes which were matched in the previous
coarsened step are assigned to the same part as the part of the leader node. Eventually in
the end we can see that, among 15 nodes, only 4 will go into part 1 and node 11 will be
assigned to part 0.
    After this uncoarsening phase, a refinement step is called. We can clearly see that even
in our small example, one part gets 11 nodes whereas the other part gets 4 nodes only. Ac-
cording to the vertex weight scheme, there is an imbalance here. That is why the refinement
step is called. The refinement step moves some nodes from one part to the other part in a
way that does not violate the acyclicity constraint.
    First they compute where can each nodes be moved. As in actual code, only two partitions
are generated, they check whether a node can be moved from this part to the other part.
A list of boundary vertices are created and their gain are computed accordingly. A heap is
maintained to extract the node with highest gain. This entire moving of nodes are continued
until one part is greater than the upper bound. After the refinement phase, some nodes are
moved to the other part.
5.4        Partitioned Graphs
For analyzing the quality of partitioning we are getting from this partitioner, we concentrated
on the csb blocks that are accessed during a task. The sparse matrix is only accessed during
the SPMM task and it is the dominating attribute that will occupy the majority of the
memory. We wanted to see how are the matrix blocks are traversed in a part, or if there is
any specific pattern present in their accessing order.
                                              124


    Let us see the csb block accessing order in different partitions. In Figure 5.5 we show the
block access pattern for a sparse matrix with dimensions of 500,000*500,000 having 1000000
nonzeros. This is a sparse matrix generated by ParMAT library. In this case we assume each
csb block to have 512*512 dimensions. That makes the entire sparse matrix a block matrix
having 977*977 blocks. We ran a Lobpcg algorithm on this sparse matrix and generated a
DAG, then fed this DAG to get a partition with 32 parts. For convenience, we color only
part 1, 2 and 3 in this graph.
    Straightway we start seeing a pattern here. The accessing of sparse blocks are in a column
order in a part. Almost all the column blocks of a particular column falls in the same part.
Another interesting thing is not necessarily all the adjacent columns will fall into the same
part. For example, blocks in column 897,709,699 and 479 falls in the same part(Part 1).
    Also let us see the partitioning result for our pre processed graph for a 61*61 sparse
block matrix in Figure 5.6. Here we show the block access pattern for part 20,21 and 22.
In this figure we can see that the sparse blocks are accessed in the same column order way.
Although its more scattered now for some parts. As we are coarsening every 3 nodes in the
same topological level without any relation among themselves, the access pattern becomes
a bit more scattered.
    The reason behind this kinds of pattern leads us to the kind of graph that we have. As
we already have mentioned that we will have some nodes with high degrees. Let us see a
small part of the DAG that is actually generated. Also we will see how the actual matching
takes place.
    In this Figure 5.7 we can see that from INV,RBR node, there will be a lot of outgoing
edges. To be precise, if the block dimension is nblocks ∗ nblocks, then there will be nblocks
number of outgoing edges. Also, from each DLACPY node there will be nblocks number of
                                              125


Figure 5.5: Sparse matrix block access patterns in different parts with matching
                                      126


                                                                                                                                                                     0        64
                                                                                                                                                                     1        65
                                                                                                                                                                     2        66
                                                                                                                                                                     3        67
                                                                                                                                                                     4        68
                                                                                                                                                                     5        69
                                                                                                                                                                     6        70
                                                            60                                                                                                       7        71
                                                                                                                                                                     8        72
                                                                                                                                                                     9        73
                                                                                                                                                                     10       74
                                                                                                                                                                     11       75
                                                                                                                                                                     12       76
                                                                                                                                                                     13       77
                                                                                                                                                                     14       78
                                                                                                                                                                     15       79
                                                            50                                                                                                       16       80
                                                                                                                                                                     17       81
                                                                                                                                                                     18       82
                                                                                                                                                                     19       83
                                                                                                                                                                     20       84
                                                                                                                                                                     21       85
                                                                                                                                                                     22       86
                                                                                                                                                                     23       87
                                                            40                                                                                                       24       88
                                                                                                                                                                     25       89
                                                                                                                                                                     26       90
                                                                                                                                                                     27       91
                                                                                                                                                                     28       92
                                                                                                                                                                     29       93
                                                                                                                                                                     30       94
                                                                                                                                                                     31       95
                                                            30                                                                                                       32       96
                                                                                                                                                                     33       97
                                                                                                                                                                     34       98
                                                                                                                                                                     35       99
                                                                                                                                                                     36       100
                                                                                                                                                                     37       101
                                                                                                                                                                     38       102
                                                                                                                                                                     39       103
                                                            20                                                                                                       40       104
                                                                                                                                                                     41       105
                                                                                                                                                                     42       106
                                                                                                                                                                     43       107
                                                                                                                                                                     44       108
                                                                                                                                                                     45       109
                                                                                                                                                                     46       110
                                                                                                                                                                     47       111
                                                            10                                                                                                       48       112
                                                                                                                                                                     49       113
                                                                                                                                                                     50       114
                                                                                                                                                                     51       115
                                                                                                                                                                     52       116
                                                                                                                                                                     53       117
                                                                                                                                                                     54       118
                                                                                                                                                                     55       119
                                                                                                                                                                     56       120
                                                             0                                                                                                       57       121
                                                                                                                                                                     58       122
                                                                                                                                                                     59       123
                                                                                                                                                                     60       124
                                                                       0          10             20              30            40             50             60      61       125
                                                                                                                                                                     62       126
                                                                                                                                                                     63       127
Figure 5.6: Sparse matrix block access patterns in different parts with pre-processing before matching
                                                                                                                    Chol,RBR
                                                                                                                    Inv,RBR
                                        XY,0,3                                          XY,1,3                                                    XY,2,3                                       XY,3,3
                                                                                                                                                                        SETZERO
                                                               SETZERO                                            SETZERO                                                                                      SETZERO
                                                                                      DLACPY,1
                                       DLACPY,0                                                                                                  DLACPY,2                  2,2
                                                                                                                                                                                             DLACPY,3
                                                                  0,2                                                1,2                                                                                          3,2
                                                                                                                                                                        SETZERO
            SETZERO,                                            SETZERO                                              SETZERO                                                      SPMM,0,3 SPMM,1,3   SPMM,2,3     SPMM,3,3
                     SPMM,0,0 SPMM,1,0    SPMM,2,0 SPMM,3,0              SPMM,0,1    SPMM,1,1  SPMM,2,1 SPMM,3,1             SPMM,0,2 SPMM,1,2     SPMM,2,2 SPMM,3,2       3,1
               0,1                                                  1,1                                                  2,1
                                        SPMMRED,0                                       SPMMRED,1                                                 SPMMRED,2                                     SPMMRED,3
Figure 5.7: A Small part of the Original DAG that is generated. The matched edges are shown in
green color
                                                                                                          127


outgoing edges to all the SPMM tasks where the column blocks of the sparse matrix will
be accessed. Also we can see there are nblocks number of SPMMRED nodes which have
incoming edges from nblock SPMM nodes.
    As we previously assumed, with higher degree nodes, what happens is that once these
high degree nodes are matched, all the other edges incident with that node are not considered
at all. This results in a lot of nodes to be left out from consideration. Hence, in the coarsened
graph, the number of nodes are not reduced that much and gets kind of saturated and the
matching process becomes slow.
    Another very important phenomena is that, while matching the edge (u, v), u becomes
the leader of v in the coarsened graph. In this graph we show the matched edges in the first
coarsened level with thick green arrows. We can see that, CHOL,RBR node will become
the leader of INV,RBR node. All the XY nodes will become the leaders of their subsequent
DLACPY nodes. Since all the DLACPY nodes are matched now, all of the edges from
DLACPY to SPMM nodes will not be considered. Rather, only the nblock number of edges
from SPMM to SPMMRED will be added to matching set.
    In the next coarsening level, shown in 5.8 we start seeing a pattern where clustered
XY nodes (combined with XY and DLACPY) have outgoing edges to SPMM nodes and
some of those edges are included in the matching set. For example the edge from XY,2,3
to SPMM,0,2 is added to the matching set making XY,2,3 the leader of SPMM,0,2 and
resulting in not considering any other edges incident with XY,2,3.
    In subsequent Figures 5.9, 5.10 and 5.11 we show the later coarsening stages. Here we
start seeing that pattern where the edge from XY,2,3 to SPMM,1,2 and then SPMM,2,2
and then SPMM,3,2 are matched in consecutive coarsening levels. Making all the SPMM
tasks along the same column clustered in the same XY node. This also happens with other
                                                  128


                                                                                                                           CHol,RBR
                                                                                                                                                                                      SETZERO
                                                             SETZERO                                                      SETZERO                                                                                                SETZERO
                                       XY,0,3                                                XY,1,3                                                       XY,2,3                                                XY,3,3
                                                                                                                                                                                         2,2
                                                                0,2                                                           1,2                                                                                                   3,2
                                                                                                                                                                                      SETZERO
    SETZERO,                                                  SETZERO                                                        SETZERO                                                             SPMM,0,3   SPMM,1,3   SPMM,2,3      SPMM,3,3
                 SPMM,0,0     SPMM,1,0   SPMM,2,0 SPMM,3,0                  SPMM,0,1    SPMM,1,1    SPMM,2,1    SPMM,3,1              SPMM,0,2 SPMM,1,2   SPMM,2,2     SPMM,3,2          3,1
       0,1                                                        1,1                                                             2,1
                                       SPMM,0,0                                               SPMM,1,0                                                    SPMM,2,0                                                SPMM,3,0
                                                            Figure 5.8: step 2 of the matching
                                                                                                                         CHol,RBR
                                                                                                                                                                                              SETZERO
                                                     SETZERO                                                             SETZERO                                                                                                               SETZERO
                                                                                     XY,1,3
                          XY,0,3                                                                                                                             XY,2,3                              2,2
                                                                                                                                                                                                                         Chol,RBR
                                                        0,2                                                                 1,2                                                                                                                   3,2
SETZERO      SETZERO        SETZERO     SETZERO                                                                                                                                                           SPMM,0,3     SPMM,1,3       SPMM,2,3     SPMM,3,3
                                                                     XY,1,3       SPMM,1,1      SPMM,2,1     SPMM,3,1                   XY,2,3   SPMM,1,2     SPMM,2,2       SPMM,3,2
   0,1          1,1            2,1         3,1
                          SETZERO                                                     SETZERO                                                                 SETZERO                                                          SETZERO
                             0,1                                                           1,1                                                                     2,1                                                            3,1
                                                            Figure 5.9: step 3 of the matching
                                                                                                             129


                                                                               CHol,RBR
                       XY,0,3                         XY,1,3                                          XY,2,3                       Chol,RBR
                                                                                                                        CHOL,RBR SPMM,1,3  SPMM,2,3
              SPMM,1,3   SPMM,2,1 SPMM,3,1   XY,1,3 XY,1,3   SPMM,2,1 SPMM,3,1          XY,2,3 XY,2,3 SPMM,2,2 SPMM,3,2                             SPMM,3,3
                        XY,0,3                         SPMM,1,3                                       SPMM,2,1                         SPMM,3,1
                                           Figure 5.10: step 4 of the matching
XY nodes where all the SPMM tasks along the same column becomes clustered in the same
node. Moreover, a large cluster forms with CHOL,RBR as the leader. This continues to
happen until we reach to our step size.
5.5      Coarsening A Block of SPMM Nodes Into One
         Block
Since the SPMM tasks along the same column gets clustered in the same node, all the column
blocks gets into the same partition. So we tried a different proprocessing approach. Rather
than coarsening nodes in the same topological level, we concentrated only on tasks which
access the main sparse matrix blocks. Before sending the graph to the regular matching
                                                                          130


                                                         CHol,RBR
                             XY,1,3
XY,0,3                                                                          XY,2,3                      Chol,RBR
                                                                                                  CHOL,RBR XY,0,3    CHOL,RBR SPMM,3,3
  SPMM,2,1 SPMM,3,1 XY,1,3 XY,1,3      SPMM,2,1 SPMM,3,1          XY,2,3 XY,2,3  XY,2,3  SPMM,3,2
 XY,0,3                         XY,0,3                                          SPMM,2,1                         SPMM,3,1
                           Figure 5.11: step 5 of the matching
                                                                 131


                                        0  1   2   3   4  5  6   7
                                    0   x  x       x   x     x
                                    1   x  x   x       x     x
                                    2      x   x       x  x  x   x
                                    3   x  x       x   x     x   x
                                    4   x  x       x      x  x   x
                                    5      x   x   x      x
                                    6   x  x   x          x      x
                                    7              x   x  x
              Figure 5.12: blocking of csb blocks to coarse multiple nodes into one node
algorithm, we coarsen the nodes which are accessing to the csb blocks which form a small
block. Assuming we will coarse bxb nodes into one node, we divide the entire block matrix
into (nblocks/b) ∗ (nblocks/b) small blocks. This will be the maximum number of nodes
after coarsening. The block number for each SPMM node is calculated and the first visited
SPMM node in a row-major traversal from that block is set as the leader. If there are some
some blocks with no nonzeros, that means that SPMM node will not be present in the DAG.
That node will not be considered anyway. Let us see our method in this Figure 5.12
    Here let us assume we have a sparse block matrix with 8*8 csb blocks. x denotes that
there is at least one non zero in this block. We take b = 2 for this example. That means there
will be 4*4 small blocks. The blue block will have SPMM (0,0), SPMM(0,1), SPMM(1,0)
and SPMM(1,1) nodes. In row major order, SPMM(0,0) will be the first node visited from
this block. Hence we make the SPMM(0,0) the leader of the other nodes in this block.
    There will be some small blocks where some csb blocks will have no non zeros. In the
figure, the orange block only has two csb blocks with non zeros. For this kind of cases, first
node traversed from this block(SPMM(4,5)) becomes the leader of the other node. In this
case only 2 nodes will be coarsened.
    The motivation behind trying this scheme was to find out if we can actually see any
                                                  132


difference in the 1d partitioning results because this time we are coarsening a small block
into one node. In figure 5.13 we see the results for a similar matrix from figure 5.14. Here
we have b = 5 that means we coarsen 5*5 nodes into one single node. We see that the
thickness of the csb blocks accessed has somewhat increased. But overall it is still a 1d kind
of partitioning. This makes sense as now some adjacent columns will also be included in
the same part thats why the thickness increases. But those DLACPY nodes will still have
outgoing edges to (nblocks/b) nodes which will behave same.
5.6      Issues with the Partitioner
5.6.1     Upperboounds
issue is that they rely on the upperbounds and lower bounds while partitioning. These
partitioning bounds are first decided based on the vertex weights. There initial examples
had vertex weight of one. For example, if we have 100 vertices, and we want to create 4
parts using the partitioner, each part will have an upperbound of ∼ 27. BFSGrowing fills
one part until it is under this upper bound, regardless of the gains.
5.6.2     Refinement
In my opinion this is an important issue that is making the csb block accessing pattern
more scattered. When we project back, each node is assigned the part number of its leader.
Because of regular matching algorithm and also the pre processing that we tried, obviously
one part becomes larger than the other part. In this case the refinement function calls a
forced balance function. In refinement step, boundary vertices are selected and the gain for
                                             133


Figure 5.13: Sparse matrix block access patterns in different parts
                               134


Figure 5.14: Sparse matrix block access patterns in different parts
                               135


their move to the other parts are calculated. Currently nodes from the big part are moved
to the small part only. Hence some nodes gets moved to the other part anyway. At each
uncoarsening step, after each project back call the refinement is called. As we have seen
that SPMM column nodes are clustered into one node DLACPY most of the time, different
DLCPY nodes fall into same part in a coarsened graph. While projecting back, the SPMM
nodes associated with them gets into that same part making the non adjacent columns in the
same part. Moreover, the refinement process moves some nodes to the smaller part making
the accessing more scattered.
5.6.3     Graph structure
Obviously our graph structure leads us to different kinds of issues. What happens with a
node with high indegree or high outdegree. The matching becomes slower because of this
kinds of nodes. If we want to avoid this saturation in matching, then we need to increase
the coarse graph size which might increase the time for partitioning. Also, because of this
regular matching, the column blocks are getting clustered into one node which leads us to a
1d partitioning. 1d partitioning will not be the best for memory usage. We can think this
graph as a hypergraph somehow. The codebase will need a huge change then. But may be
with hypergraphs we can come up with some agglomerative approaches for matching.
5.6.4     Edgecut
EdgeCut is calculated on the entire graph in different stages of the algorithm. Weights
of every edge which has its endpoints in different parts are added to the edgecut. While
generating partitioning, the gains are considered. At this moment, the minimum of edgecuts
                                             136


are considered. But for our purpose, the more the edgecut is, the more data reuse between
the parts.
5.7       PowerLaw Graph Partitioning Attempts
There are two main issues that we faced with the DAGs we were working with. As we have
already seen the structure follows a specific pattern where some nodes have a lot of outgoing
edges and some nodes have a lot of incoming edges. This graph maintains a power law graph
shape. We know from power law graphs that it follows a power law relation.
    Abou-Rjeili and Karypis discusses a multilevel partitioner approach for the irregular
graphs which follows power law distribution. They discuss similar problems that we face
with our graph structure and matching. Once the nodes with higher degrees get matched,
we can not hide a lot of edges and thus it makes the whole matching process slow. Hence,
the coarsening steps become more and more slow resulting in increase of memory required
to store these intermediate coarsened graphs. In our current method, we do not consider a
vertex at all once it gets matched. But they propose a scheme where a vertex will still be
considered for including in the matching set even if it is matched.
    They have two edge visiting strategies - globaly greedy strategy (GG) and globally random
locally greedy strategy(GRLG). In GG they order the edges according to some pieces of infor-
mation and then take those edges in a greedy approach. In GRLG method, they randomly
choses some vertices and then greedyly considers the edges incident to those vertices. As our
graphs are directed, GG strategy will not for us because we need to maintain the acyclicity
though out our approach. Hence we tried with GRLG approach.
    We selected a random order for visiting the vertices. Then we ordered the edges incident
                                              137


to that vertex with the edge weights. The highest weighted edge with that vertex gets
the most priority. The purpose is to cluster the nodes which have a high data movement
between them to be in the same cluster so that we can gain some data reuse. But this
approach although worked for some random orderings, did not work for all the random
orderes. It generated a cycle after some matching step. Here we present an example of this
kind of case.
    In this figure 5.15 we show a possible cycle generation using the above mentioned ap-
proach. We show an actual part of our graph which follows that kind of structure of one
node having a lot of outgoing edges. For ease of explanation we are naming them as node
numbers. Here, we see that in the first graph when node 8 is chosen, all of its adjacent edges
are selected for matching. In the earlier case, only one of these edges would be chosen and
all of them would be replicated in the coarsened graph hiding all those similar edges. But in
this case, all the similar kind of edges incident to node 8, 10, 12 and 14 are matched. This
was one of the main motivations behind the aforementioned paper. In the next graph we
can see the coarsened graph after the first stage with updated vertex number. In the next
graph similarly with a random order some vertices and their adjacent edges are matched.
But this time it creates a problem. Since the vertex numbers are updated accordingly to the
leader of the matched edge, in the third graph we can see that there is an edge from node
1 to node 9 and also from node 9 to node 1. This is a cycle so this approach will not work
for us. With undirected graphs, these schemes work pretty well as they have showed but for
our case unfortunately we cannot implement these ideas as they were stated.
                                              138


                                                     1
                                                     2
                 3                    4                           5                  6
                                      9                                  12
                 7        8                      10              11                 13        14
                                                                         30 31 32          33    34
    15 16 17        18 19  20 21 22       23  24    25 26 27      28  29
                   35                  36                          37                   38
                                                  1
                                                  2
               3                    4                           5                  6
               7                   9                           11                13
             8                   10                          12                  14
                 15                 16                            17                  18
                                                  1
                                                  1
               1                   1                           1                  1
                                   9
               1                                               11                13
             1                   1                           1                   1
                15                  16                          17                   18
Figure 5.15: A cycle is created using GRLG approach
                                             139


5.7.1      Lowest Common Ancestor
To get rid of this cycle issue, we came up with a different method. We calculated the lowest
common ancestor of two nodes in the graph. Then while the matching process, we check
whether the other parents of a children of a node we are considering for matching shares a
lowest common ancestor. If they have a common ancestor, that means it might be possible
that it creates a cycle. Because in some later coarsening steps, the vertex we are considering
might be matched the common ancestor and resulting an edge from the other sibling to have
an edge to the common ancestor and thus creating a cycle. Hence, if the other parents of
this children have a lowest common ancestor with the vertex we are considering, we do not
match this edge. We do it for all the vertices.
    For example in figure 5.15 second graph, when we are considering node 7, its children is
node 8 whose other parents 9, 11 and 13 all have a lowest common ancestor 2 which is also
an ancestor of node 7. This means, if node 7 is matched in some coarsening step with node
2, then it will create a cycle. So in this case we do not match node 7 and 8.
    Although this scheme never creates a cycle in the coarsened graph, but unfortunately it
stops the coarsening process too early. The main reason is that our graph structure has a lot
of nodes who share a common ancestor at a very high level of the graph even though those
nodes are in the down level of the graph and actually might never be matched together. In
the earlier coarsening steps we see some matching being done but after some steps, all the
nodes have some common ancestors with some nodes resulting in no matching thus stopping
the matching process. This scenario does not serve our purpose. Also, another important
motivation for us is to match in such a way so that the matrix can be traversed in a blocked
shape. That purpose is also not served.
                                               140


5.7.2      Hierarchical partitioning attempts
We have already discussed that our graph maintains a certain structure which leads to a
specific pattern of accessing along the columns of the large matrix in a partition. This kind
of 1d partitioning will not be ideal for cache utilization since the entire output block vector
needs to be kept in the memory for all the columns and so for almost all the partitions.
We would like to access the matrix in a 2D kind of shape so that we can improve the cache
utilization. Keeping this thing in mind, we have tried another approach which hierarchically
breaks down a large tasks into small tasks.
    We start our partitioning with tasks having a large matrix block size eg. 64K, 32K etc.
With large block size, the number of nodes in the DAG will be relatively small. Hence
our regular partitioning algorithms which struggled with partitioning large number of nodes
can partition this graph with relatively small nodes with ease. Hence we will generate a
partitioning using a large block size. The problem with large block size is that it will incur
a large number of unnecessary calculations. For example in a 64K ∗ 64K block, there might
only be a few nonzeros but eventually we will be needing to apply all the operations into
these blocks where almost all the values are zero. In figure 5.16 we show the partitioning
achieved in the Z5 graph. We can clearly see that there are less number of blocks and they
have mostly that 1D pattern.
    But for efficient computation, we would like to have small blocks of the matrix. The
main reason is that we can skip so many unnecessary calculations we were doing with large
blocks. Also the cache utilization will increase with small block sizes. Hence we divide each
large block into small blocks. We create a completely new graph with these small blocks.
All the tasks which executes on a block whether a matrix block or a vector block will be
                                              141


                          980
                          970
                          960
                          950
                                                                         0  9
                                                                         1  10
                                                                         2  11
                          940                                            3  12
                                                                         4  13
                                                                         5  14
                                                                         6  15
                                                                         7  16
                                                                         8  17
                          930
                          920
                          910
                          900
                              0   10    20   30    40  50    60   70
                    Figure 5.16: Partitioning of Z5 matrix with 64K block size
replicated into newly constructed task nodes with small blocks with appropriate incoming
and outgoing edges.
    In figure 5.17 we show an example of how the hierarchical blocking is actually done. Here
we have a matrix which is divided into 4*4 blocks. We will use this graph for partitioning.
After we get the partitioning, we will refine this graph into a different graph with small block
size. Each of the blocks in the large blocks are divided into further small 4*4 blocks. Hence,
for each large matrix blocks, we will have 4*4 = 16 new nodes and they will be renamed
appropriately. Note that, when we refine the large blocks into small blocks, a lot of small
blocks will have no nonzeros in them. We will consciously skip those nodes from our refined
graph since those nodes will not incur any valid computation. Hence the original spmm, 0, 0
node will be divided into 16 new spmm tasks starting from spmm, 0, 0 to spmm, 3, 3. All
the vector blocks will also be divided into small blocks. That means a dgemm, 0 task will be
                                                142


                         0,0 0,1 0,2 0,3                               0,4 0,5 0,6 0,7
                                                0,0    0,1     0,2 0,3
                         1,0                                           1,4
                                            1,0
                         3,0         3,3                               3,4         3,4
                                            3,0                    3,3
                                            4,0   4,1 4,2  4,3
                                            5,0
                                            6,0
                                            7,0            7,7
                               Figure 5.17: Hierarchical blocking scheme
divided into 4 dgemm tasks starting from dgemm, 0 to dgemm, 3 and so on. All the edges
in the original graph will be converted to appropriate number of edges to appropriate nodes
accordingly.
    As we already know that the partitioner assigns a partition number to each task. Hence,
the part number of the tasks in the original graph will be replicated to the part number of
the tasks in the refined graph with small block size. Since each node in the original graph
is divided into small blocks, we expect to see some 2D kind of shape in the refined graph.
In figure 5.18 we have the refined graph where each block from the original graph has been
divided into 64 ∗ 64 blocks of 1k size. We can see how the matrix is accessed in the refined
graph. We still see a slight column based access but in this case there is a blocked shape in
this case which should increase the data reuse that we are looking for.
5.8      Memory Bound Implementation
One of the most important aspects of our partitioner is that it ensures a constraint is strictly
maintained in all partitions. In our case, the constraint is the fast memory. All our partitions
                                                          143


                    Figure 5.18: Partitioning of Z5 matrix with 1K block size
follow this memory bound. We have two important phenomena to consider while managing
this constraint over all the partitions.
      We should correctly identify all the active memories that will be needed during the
      execution of this part. As we design our partitioner in such a way that we assume all
      the tasks in this partition will be executed in a single node, that means all the active
      memory parts that will be needed, allocated for both input and output of individual
      tasks should contribute in the active memory calculation. Not only input and output
      memories which will be read or written, but also the amount of internal memories allo-
      cated for executing that task (Such as temporary memories allocated in heap through
      malloc which will have a scope on that task only) should be considered while active
      memory calculations.
      Another important and motivating aspect behind our partitioner is to catch the data
                                               144


      reuse between the tasks whenever possible. The DAG generated has some dependencies
      between the tasks where the edge represents the data flow from one task to another.
      Note that the memory needed by a task as inputs are already written as output memory
      by another tasks. Hence, we can safely say that memory chunk needed for writing will
      be present in the fast memory and when a task that needs that memory chunk as an
      input memory for execution, it will find it in the fast memory. This memory chunk is
      active and it should not be counted multiple times.
    Let us give an example of these two considerations. In figure 5.19 we can assume a
partition consisting of some tasks named A, B, C, D, E and F which are represented as
nodes and the edges here represent the memory chunk these tasks need as input and the
memory generated by these tasks as output which goes to other tasks as input. For example,
task D has incoming edges from task A and B and outgoing edge to task E. Task D receives
the memory chunks a and b as input from tasks A and B respectively. It generates an output
d which is then forwarded to the task E as its input.
    While the tasks are being executed, we can see here that task B and C uses a common
memory chunk p2 as input. If task B is executed first, memory p2 will be loaded in the fast
memory. Hence, when C will be executed, p2 is already in the fast memory. Hence while
calculating active memory, we should not include p2 again in our calculation. We will keep
adding all the active memory needed for each of the tasks unless they are already in the
active memory. When task E needs to be executed, the input memories it needs are d, b and
x(from some other part). Since both d and b are already in the fast memory the values of
these memory chunks will not be added to the active memory calculation. But since x has
not been yet included in the active memory, it will be included in the calculation.
                                             145


                                                     p2          p3
                                                          p2
                                      p1
                                           A            B      C
                                         a         b
                                                        b
                                               D
                                             d
                                    x
                                                 E
                                                             F
                                                        e
                                               e
                         Figure 5.19: Memory Management in partitions
     We will also include all the temporary memory allocated in each tasks which only has a
scope on that task only but since they are allocated, those memories are also part of active
memory in this part.
     After all the active memory calculations, we can measure if the total active memory
needed in this partition is under our fast memory amount. If it is less than the fast memory
amount then we can easily go ahead with the execution of these tasks in this part as it will
fit in the fast memory. If the amount of active memory is more than the amount of fast
memory, that means we need to further partition this part.
                                                    146


5.9      Experimental Results
5.9.1      Experiment setup
We conducted all our experiments on Cori Phase I, a Cray XC40 supercomputer at NERSC,
mainly using the GNU compiler. Each Cori Phase I node has two sockets with a 16-core
Intel Xeon Processor E5-2698 v3 Haswell CPUs. Each core has a 64 KB private L1 cache
(32 KB instruction and 32 KB data cache) and a 256 KB private L2 cache. Each CPU has
a 40 MB shared L3 cache (LLC). We use thread affinity to bind threads to cores and use a
maximum of 16 threads to avoid NUMA issues. We test DeepSparse using five matrices with
different size, sparsity patterns and domains (see Table 4.2). The first 4 matrices are from
The SuitSparse Matrix Collection and the Nm7 matrix is from nuclear no-core shell model
code MFDn.
    We compare the performance of partitioned schedule with two other library implemen-
tations: i) libcsr is implementation of the benchmark solvers using thread-parallel Intel
MKL Library calls (including SpMV/SpMM) with CSR storage of the sparse matrix, ii)
libcsb is an implementation again using Intel MKL calls, but with the matrix being stored
in the CSB format and our DeepSparse implementation. Performance data for LOBPCG
is averaged over 10 iterations, while the number of iterations is set to 50 for Lanczos runs.
Our performance comparison criteria are L1, L2, LLC misses and execution times for both
solvers. All cache miss data was obtained using the Intel VTune software.
5.9.2      Performance of the partitioner
We ran our custom scheduler in both Haswell and Knl nodes and measured the L1, L2
and last level of memory(in this case MCDRAM) misses using VTune in similar way to our
                                            147


Figure 5.20: Performance comparisons between different cache levels and execution time of Nm7
matrix in Haswell nodes
previous project.
   In Figure 5.20 we show the cache miss and execution time results for Nm7 matrix in
Haswell nodes for different block sizes for our execution. For L1, we do not see much
improvement for our custom scheduler. But for L2 and L3 caches, we see the custom scheduler
achieves better cache performance over libcsr version. But the improvement over DeepSparse
was not consistent and often it could not beat the DeepSparse performance. Also we noticed
that the execution time is actually not as we expected for the custom scheduler.
   In Figure 5.21 we show the cache miss and execution time results for Nm7 matrix in KNL
nodes for different block sizes for our execution. We see similar traits here and additionally
we get better L1 cache optimization in this case. But likewise, we could not outperform the
DeepSparse for KNL nodes too.
                                              148


Figure 5.21: Performance comparisons between different cache levels and execution time of Nm7
matrix in knl nodes
                                            149


5.10        Future work on the partitioner
We tested the performance of the custom scheduler in other matrices for both Haswell and knl
machines. We saw some similar traits in all of our experiments. We achieve an improvement
over the library versions as we expected. But we can not out perform the DeepSparse
performance for all matrices with all the block sizes. That means the OpenMP default
scheduler is doing a pretty great job as well as our custom scheduler is not performing as
we expected. One important aspect is having to use a taskwait after each part is holding us
back a little bit. Because if we do not use this taskwait between each partition, since we are
still using OpenMP for executing all the tasks of a partition together, some other tasks from
a different part can be scheduled before a task from this partitioner if its in out dependencies
have been resolved, which will hinder our performance.
     Hence we thought we will be needing some other heuristics if we want to use the par-
titioner with a better scheduler. An idea might be to use hypergraphs and partition the
hypergraph. But converting the graphs to hypergraphs and converting every data structure
to hypergraphs would be a cumbersome work. We still hope to continue trying to improve
the partitioner. At this point we wanted to move on with the existing scheduler served by
OpenMP and try and implement the distributed MFDn. Before that we wanted to see the
communication pattern and their behavior in MFDn. We use a simulator called SST to
simulate the networking topology in MFDn which we discuss in detail in the next chapter.
                                              150


Chapter 6
SIMULATING THE
COMMUNICATION PATTERNS IN
A LARGE SCALE DISTRIBUTED
APPLICATION
Our target application, MFDn [89, 90, 4] is used for ab initio calculations of the structure
of atomic nuclei. The structure of an atomic nucleus with A nucleons can be described by
solutions of the many-body Schrödinger equation
                            Ĥ Ψ(~r1 , . . . , ~rA ) = E Ψ(~r1 , . . . , ~rA )           (6.1)
where Ĥ is the nuclear Hamiltonian acting on the A-body wavefunction Ψ(~r1 , . . . , ~rA ), ~rj
the single-nucleon coordinates, and E the energy. For stable nuclei, the low-lying spectrum
is discrete. The solution associated with the algebraically smallest eigenvalue is the ground
state. The nuclear Hamiltonian Ĥ contains the kinetic energy operator K̂, and the potential
term V̂ which describes the strong interactions between nucleons as well as the Coulomb
repulsion between protons. Solving for nuclear properties with realistic nucleon–nucleon
(NN) potentials, supplemented by three-nucleon forces (3NF) as needed, for more than a
                                                     151


few nucleons is recognized to be computationally hard [124].
    Obtaining highly accurate predictions for properties of light nuclei using the No-Core
Configuration Interaction (NCCI) approach requires computing the lowest eigenvalues and
associated eigenvectors of a very large sparse symmetric many-body Hamiltonian matrix Ĥ.
If A is the number of nucleons in a nucleus, this matrix is the projection of the nuclear
many-body Hamiltonian operator into a subspace (configuration space) spanned by Slater
determinants of the form
                                                                                    
                                                         φ (~
                                                         a1 r1 )   . . . φaA (~r1 ) 
                                                                                     
                                                                                    
                                                1
                                                                                    
                     Φa (~r1 , . . . , ~rA ) = √
                                                        
                                                    det     ..     ...      ..      
                                                                                     ,   (6.2)
                                                 A!
                                                        
                                                        
                                                              .               .      
                                                                                     
                                                                                    
                                                                                    
                                                                                    
                                                         φa1 (~rA ) . . . φaA (~rA )
where φa are orthonormal single-particle wavefunctions indexed by a generic label a. The
dimension of the subspace or basis spanned by these many-body wavefunctions Φa depends
on (1) the number of nucleons A; (2) the number of single-particle states; and (3) the
(optional) many-body truncation.
    In the NCCI approach, one typically works in a basis of harmonic oscillator single-particle
states where the number of single-particle states is implicitly determined by the many-body
truncation Nmax , which imposes a limit on the sum of the single-particle energies (oscillator
quanta) included in each Slater determinant of A nucleons. In the limit of a complete (but
infinite-dimensional) basis, this approach would give the exact bound state wave functions;
in practice, increasingly accurate approximations to both the ground state and the narrow
(low-lying) excited states of a given nucleus often require increasingly large values of Nmax .
The dimension D of Ĥ increases rapidly both with the number of nucleons, A, and with
                                                      152


the truncation parameter, Nmax . The sparsity of the matrix Ĥ (and hence the memory
requirements and computational load) depend on the nuclear potential. With NN-only
potentials this matrix is extremely sparse, whereas with 3NFs there are significantly more
nonzero matrix elements in the matrix, and with a (hypothetical) A-nucleon force one would
get a dense matrix [91].
    Many-Fermion Dynamics—nuclear, or MFDn, is a NCCI code for nuclear structure calcu-
lations using realistic NN and 3NFs forces [125, 126, 127, 128, 129] written in Fortran 90 using
a hybrid OpenMP/MPI programming model. A typical calculation consists of constructing
the many-body Hamiltonian matrix in chosen basis, obtaining the lowest eigenpairs, and
calculating a set of observables from those eigenpairs. Efficiently utilizing the aggregate
memory available in a cluster is essential because a typical basis dimension is several billion
- corresponding to a very sparse matrix with tens of trillions of nonzero elements.
    The lowest few eigenvalues and eigenvectors of the very large real sparse symmetric
Hamiltonian matrix are found with iterative solvers - using either the Lanczos [92] or the
LOBPCG [93] algorithms. The key kernels in iterative eigensolvers are Sparse Matrix–
Vector (SpMV) and Sparse transposed Matrix–Vector (SpMVT ) products, as only half of
the symmetric matrix is stored in order to reduce the memory footprint. The sparse matrix
is stored in a CSB_COO format [22], which allows for efficient linear algebra operations on
very sparse matrices, improved cache reuse on multicore architectures and thread scaling
even when the same structure is used for both SpMV and SpMVT (as is the case in this
application).
                                              153


6.1       MFDn Communication Motif
On a distributed-memory system, we would like to distribute the matrix among different
processing units in such a way that each processing unit will perform roughly the same
number of operations in parallel sparse matrix computations. To achieve this, we split the
sparse matrix into nd × nd approximately square submatrices, each of approximately the
same dimension and with the same number of nonzero matrix elements, for both CPU time
and memory load-balancing purposes. As mentioned above, since the matrix Ĥ is symmetric
and the number of nonzero elements in Ĥ increases rapidly for increasing problem sizes, we
store only half of the symmetric matrix. This results in np = nd (nd + 1)/2 submatrices for
e.g. the lower-triangle part of the Hamiltonian. One can then distribute these submatrices
over np different processing units, perform local SpMV and SpMVT operations with these
local submatrices, and synchronize after every iteration along both the columns and the
rows of this grid of nd × nd submatrices. However, in addition to CPU and memory load-
balancing, we also have to consider communication load-balancing, and the naive distribution
of just the lower triangle (or equivalently, just the upper triangle) leads to highly imbalanced
communication patterns [90]; in particular, because the number of processing units per
column (row) ranges from 1 to nd for different columns (rows).
6.2       Simulation of a Distributed communication
To create a more efficient mapping for load-balanced communication, one can start from
the nd × nd square grid of submatrices, and, taking into account that Ĥ is symmetric,
require that each column (row) in the nd × nd grid has the same number of submatrices
(specifically (nd + 1)/2) assigned to one of the np = nd (nd + 1)/2 processing units. There are
                                                154


Figure 6.1: Processor topology with 15 processors numbered from 0-15. Distributed in an efficient
manner where each row and column has the same number of processors.
many different ways to achieve this – the implementation that is used in MFDn [22, 90], is
illustrated in the top panel of Fig. 6.2 for 15 processing units on a 5 × 5 grid of submatrices.
After each local SpMV and SpMVT , we have to perform two reductions: One along the
processing units in the same column, and one along the processing units in the same row,
as as indicated by the lower panels of Fig. 6.2. With the mapping of Fig. 6.2, all column-
and row-communicator groups contain (nd + 1)/2) processing units per communicator, and
have essentially the same communication volume as well. Thus, with this distribution the
communication load associated with the SpMV or SpMM in the iterative solver is almost
the same for all processing units and communicator groups, both in message sizes, and in
number of processing units in each communicator, provided that the dimensions of each
submatrix are approximately the same.
                                               155


Figure 6.2: Processor distribution in MFDn: MPI_COMM_WORLD (top) and custom column
(left) and row (right) communicator groups.
     The additional steps in the iterative solvers, namely orthogonalization of (blocks of)
vectors, preparation of the input vector (block) for the next iteration, and in the case of
LOBPCG applying the pre-conditioner, are all distributed evenly over all np processors. For
this purpose, we further divide each of the nd (blocks of) vectors into (nd + 1)/2 segments,
and distribute these evenly over the column-communicator groups. Thus each processing
unit deals with a (block of) vectors of length D/(nd (nd + 1)/2) for the orthogonalization
and preparation of the input for the next iteration, where D is the dimension of the Hamil-
tonian. These steps also involve additional communication, mainly reduce or all_reduce on
all processing units, but the message sizes are small and the communication overhead during
this step is negligible compared to the communication overhead during the SpMV/SpMM
phase.
     Schematically, the communication pattern for the SpMV/SpMM phase at each iteration
is given in Fig. 6.3:
       First, the (block of) vector segments of length D/(nd (nd + 1)/2) are gathered on the
       ’diagonal’ processing units within each column-communicator;
                                              156


Figure 6.3: Communication pattern for distributed SpMV (Lanczos) or SpMM (LOBPCG) during
iterative solver. In our actual implementation, we have replaced the initial Gather + Broadcast
along the columns by a single call to AllGatherV, and similarly, the final Reduce + Scatter along
the columns by a single call to ReduceScatter. Also, the Broadcast and Reduce along the rows is
overlapping with the local SpMV and SpMVT . (Figure adapted from Ref. [4]
                                               157


       Next, the diagonal processors broadcast their (block of) sub-vectors of length D/nd
       both along the column-communicators and along the row-communicators;
       Each processing unit performs a local SpMV/SpMM and SpMVT /SpMMT ;
       The outputs of both the local SpMV/SpMM and SpMVT /SpMMT are reduced along
       the column and row communicators, respectively, onto the diagonal processors;
       Finally, each diagonal processor scatters the final result among the (nd + 1)/2) pro-
       cessing units within its column-communicator, for further processing in preparation
       for the next iteration.
In practice, we perform the initial gathering of the vector segments onto the diagonals fol-
lowed by the broadcast along the column-communicators in a single collective MPI call,
namely MPI_AllGatherV. Similarly, the reduction along the column-communicators fol-
lowed by the final scatter of the vector segments at the end can also be done in a sin-
gle collective MPI call, namely MPI_ReduceScatter. Thus the entire SpMV/SpMM re-
quires only four collective MPI routines: MPI_AllGatherV, MPI_Bcast, MPI_Reduce, and
MPI_ReduceScatter.
    It has been shown that this distribution of the data and implementation of the communi-
cation performs efficiently on current HPC platforms; furthermore, on multicore processors
it allows us to hide the broadcast and reduction along the row-communicators during the
SpMV/SpMM phase behind computation [22], depending on the local CPU performance,
the actual network performance, and the problem size. However, for large-scale production
runs, using thousands of processing units, and with a vector dimension in the tens of billions,
the communication overhead does become a major factor, surpassing the time required for
the local sparse matrix computations. For an application scientist preparing for the next
                                              158


generation of HPC platforms, it is therefore very useful to be able to simulate just the com-
munication overhead before the system becomes available so that any algorithm changes or
optimizations can be worked out in advance.
6.3      Simulation Framework and Implementation
To evaluate the communication pattern we utilize the Structural Simulation Toolkit [88].
SST is a flexible simulation framework designed to explore trade-offs in the parallel and
distributed system design space. SST is widely used within the academic, national labora-
tory and vendor communities [130, 131, 132]. Components included in SST represent the
different subsystems of a supercomputer, including, CPUs, accelerators, memory, networks
and software. These components are connected together by the core SST framework which
can then simulate discrete events in parallel across multiple nodes using MPI. In this work,
we utilize three existing modules within SST, namely Ember, Firefly and Merlin. These
represent the workload/communication pattern, communication software stack (e.g., MPI)
and the network fabric (routers, cables, topology, etc.), respectively.
6.3.1      Ember
Ember is one of the components of the SST libraries. This is a state-machine based event
engine. This event engine replicates communication patterns in a scientific application at
a simulation endpoint. Sets of application communication patterns are called motifs. A
collection of motifs can be created within each point to create a complex workflow.
    A sequence of events containing primitive communications and also collective commu-
nications, computations, timings, barriers are created in a motif. A queue is maintained
                                             159


where these events are pushed and executed one by one until the queue is empty. Motifs are
prompted to refill the queue with additional events once emptied.
    The communication events are sent to the Firefly layer. A communication event is coded
in Ember but converted into actual operations by Firefly component which keeps track of
the parameters associated with the encoded Ember communication motif code. Ember uses
short sprints of events. Ember component can scale to very large simulated node counts
regardless of any constraints on the amount of memory or simulation related processing.
6.3.2      FireFly
Firefily is another component in SST which implements a state machine based data move-
ment stack. It is the "MPI" equivalent in SST. The main purpose of Firefly is to help testing
different network topologies at a larger scale than actual simulations would be allowed to run
on by a network stack. Firefly can not be run stand alone. It requires a network component
which in this case is Merlin and a driver component which in our case is Zodiac.
    This component provides library supports for point-to-point communications like send,
receive, wait etc. It also provides support for the collectives like alltoall, reduce etc opera-
tions. It follows an eager/rendezvous protocol model. The state machine functionality moves
the data between hosts over a bus. The motifs are written in Ember component and sent to
Firefly component where the network parameters along with the message sizes, bandwidth,
latencies are processed.
                                               160


6.3.3     Merlin
Merlin is a combination of low level networking components that be used to simulate high
speed networks or on chip networks. Merline comprises of a range of different network
topologies. Merlin also provides a set of tunable parameters like buffer size, latencies, routing
models which can be used to create a new architecture that is not yet released. Merlin library
currently supports DragonFly topology which we have used in our simulation, Torus, Fat
Tree topologies. All topologies use deterministic routing.
6.4      Implementation
For comparisons with SST simulations, we implemented a communication only version of
MFDn, which we call communications-skeleton code, because for very large matrices where
large numbers of compute nodes are needed, communication overheads dominate the exe-
cution time of the MFDn eigensolver. Hence, our SST/MFDN motif does not simulate any
computations either.
    We implemented MFDn’s communication motif (without any computations) using SST.
We show the pseudocode in Algorithm 4. MPI communication routines implemented in
SST/Ember have the following common syntax: enQ_ < M P I_communication_routine_name >.
We tried to replicate the actual application code with only communication routines in our
MFDn motif, but we could not use the exact Ember/SST equivalent of the MPI routines in
a few cases where they lacked support as detailed below.
    MFDn uses MPI_Comm_Split for creating its row and column communication groups.
However, the corresponding enQ_Comm_Split function was not producing the correct com-
munication groups for our motif, therefore we used the enQ_Comm_Create function which
                                              161


  Algorithm 4: Ember pseudocode for the MFDn communication motif.
  1 enQ_Comm_Create(. . . , Comm_world,. . . , Col_Comm);
  2 enQ_Comm_Create(. . . , Comm_world,. . . , Row_Comm);
  3 for iter = 1 to maxIteration do
  4     enQ_Allgather(. . . ,vector_segments,. . . ,Col_comm);
  5     enQ_bcast(. . . ,sub_vector,. . . ,Row_comm);
  6     enQ_reduce(. . . ,SpMM_output,. . . ,Row_comm);
  7     enQ_reduce(. . . ,final_result,. . . ,Col_comm);
  8     enQ_scatter(. . . ,final_result,. . . ,Col_comm);
creates a custom communicator for given a set of MPI ranks.
    MFDn uses MPI_Reduce_Scatter along the column communicator in the final phase as
pointed out in Figure 6.3 because MPI_Reduce_Scatter significantly reduces the execution
time over using an MPI_Reduce followed by an MPI_Scatter. Unfortunately, SST currently
does not have an equivalent enQ_Reduce_Scatter function. Hence, we used an enQ_Reduce
followed by an enQ_Scatter in the MFDn motif. We added the simulation time for the
reduction and the scatter operations together in all results in the following section where
we compare the timings from the SST simulation and the real application. We acknowledge
that this will not entirely capture the effect of MPI_Reduce_Scatter in the real runs.
    Another important difference between MPI and Ember/SST is that unlike the MPI_Reduce
operation, no local aggregations are applied in the enQ_Reduce function. In MFDn, re-
ductions are actually performed in two different ways. One uses the default MPI_SUM
aggregation, but this default MPI_SUM option does not make use of all available cores in
a hybrid parallel MPI-OpenMP code. Since the dimensions of vectors in MFDn are very
large, local aggregations during reductions are potentially very time consuming. Therefore,
MFDn uses a customized reduction that uses OMP multithreading for aggregation; we refer
to this version as OMP_SUM. For most moderates size problems, the OMP_SUM version
outperforms the default MPI_SUM, but for very large number of ranks or if we use 16 or
                                                  162


more ranks per node (leaving only 4 or fewer threads for OMP parallelization), there is al-
most no difference or the default MPI_SUM is slightly more efficient. This illustrates that
the actual reduction operation takes a non-negligible amount of time for the message sizes
in MFDn. Since there are no actual summations involved in the Ember/SST reduction, we
expect there will be a considerable performance difference between the real application runs
and the SST simulation results for reduction operations.
6.4.1      Random Distribution of Processes
During a production run, when we try to allocate nodes from a system using batch scripts,
getting nodes in close proximity is generally not guaranteed (unless if one uses the entire
machine). In a large supercomputer, typically hundreds of jobs are running, starting and
completing at different times. Resources are allocated according to various queuing policies.
A set of nodes are created from the available nodes and those are allocated for the next
eligible job. Hence, medium to large jobs are generally fragmented in a random manner
across the set of available nodes.
    In the MFDn code, the process distribution assumes that MPI ranks are in the range
from 0 to np as discussed in Section 6.1. When we run the exact process distribution using
SST, it assumes that the first np cores/nodes will be used for the simulation and the custom
communicators are created accordingly, thus deviating from the random node allocation
scheme. Therefore, we have introduced a similar random node selection in the SST simulation
code for MFDn.
    In Figure 6.4, we show a small example where we need 6 nodes with one MPI ranks
per node for our simulation. We assume the machine has 32 nodes in total. When we ask
for 6 nodes, a random set of 6 nodes are given. Then in the SST simulations, the custom
                                            163


                          Figure 6.4: Random selection of the ranks.
communicators will be generated accordingly as shown in Figure 6.4. If all MPI ranks in
the SST simulation are consecutive and start from 0, then it does not portray a real world
scenario and it is possible that it does not catch the actual communication bottlenecks.
Introducing this randomness helps us making the SST simulation close to the real world
simulations.
    Note that when we use more than one MPI rank per node, some of the MPI ranks will
be bundled under the same node. We have created the random set keeping this practical
issue in mind. Whenever we need more MPI ranks per node for our simulation, first a set of
random nodes will be selected. Then the set of MPI ranks will be created using those nodes
and the number of MPI ranks bundled together.
6.5      Evaluation and Results
For our experiments, we chose two different clusters. We used the Cori-KNL cluster as
an existing machine for validation and another machine similar to the upcoming NERSC
Perlmutter system for prediction. Although the detailed configuration information for this
                                            164


system is not yet public, we use a reasonable approximation.
6.5.1     Hardware and Software
We conducted all our validation experiments on Cori Phase II (Cori-KNL), a Cray XC40
supercomputer at NERSC. Each Cori-KNL node is a single-socket Intel Xeon Phi Processor
7250 ("Knights Landing") processor with 68 cores per node @ 1.4 GHz. Each node has 96
GB DDR4 2400 MHz memory with 102 GiB/s peak bandwidth and also a 16 GB MCDRAM
(multi-channel DRAM).
    Although not all details of the Perlmutter system are released yet, it is scheduled to be
delivered in two phases. Phase 1 will have 12 GPU-accelerated cabinets and 35 PB of all-
Flash storage and Phase 2 will have 12 CPU cabinets. Each of Phase 1’s GPU-accelerated
nodes will have 4 NVIDIA A100 Tensor Core GPUs based on the NVIDIA Ampere GPU
architecture, along with 256GB of memory for a total of over 6000 GPUs. In addition, the
Phase 1 nodes will each have a single AMD Milan CPU. Each of Phase 2’s CPU nodes will
have 2 AMD Milan CPUs with 512 GB of memory per node. The system will contain over
3000 CPU-only nodes. For simulating a Perlmutter like machine, we adjusted our parameters
accordingly since it will have larger memory.
    We use SST_9.1.0 version for all our simulations. While installing SST, we used Open-
MPI_4.0.2 version as a supporting flag. For compiling SST, we use NERSC’s default pro-
gramming environment modules, namely PrgEnv-Intel/6.0.5.
                                            165


Table 6.1: Matrices used in this study, the dimensions and number of nonzero matrix elements of
each matrix.
                      Nucleus         Dimension      Number of Nonzeros
                   11Be, Nm= 8        196,861,465    146,137,030,364
                   10Be, Nm= 9        430,062,264    409,045,051,874
                   10Be, Nm=10       1,343,536,728   1,600,272,603,633
6.5.2     Benchmark problems
For normal production runs of MFDn, the dimension of the matrices ranges from a few
hundred to tens of billions; the largest runs to date, on nearly the full Cori-KNL machine,
have a dimension of about 35 billion. For practical reasons, we restrict ourselves here to three
problem sizes, as listed in Table 6.1. We also list the number of nonzero matrix elements in
half of the symmetric matrix – this is what dominates the computational load.
    It has been reported already that the wall time to simulate the motifs using SST varies
depending on different factors. These factors include congestion, adaptive routing, number
of events and distribution of cores across physical nodes. Simple motifs take small wall times.
But complicated motifs require a large amount of wall time across a large number of nodes
due to limited memory. For example, for the dimension 430,062,264 it took more than 33
hours to finish 2 iterations of our motif using 16 Cori-KNL nodes having 68 cores each. For
comparison, a communication skeleton run on MFDn, performing only the communication
during the eigensolver, took less than half an hour on 71 nodes for 5 runs on different number
of MPI ranks, and 20 iterations per run. That is, an aggregate of about 30 node-hours for the
communication skeleton run, compared to 512 node-hours for the SST simulation. Of course,
the advantage of the SST simulation is that one can simulate the performance of machines
that do not (yet) exist. Because of this issue, we could not compare the cases with larger
dimension requiring larger number of nodes. We believe that our motif being complicated
                                               166


Figure 6.5: Illustration of a general Dragonfly topology with a single group shown on the left and
the optical all-to-all connections of each group in a system shown on the right. Per the original
definition of a dragonfly [5], the design within a group is not strictly specified.
with different communicators increased the SST walltime compared to more simpler motifs.
6.5.3      SST parameters for Cori-KNL simulations
In SST, there are a number of network topologies that we can use to simulate our motifs.
Since both Cori-KNL and Perlmutter are based on the Dragonfly topology [5], we use the
Dragonfly option in SST. Dragonfly networks combine high radix routers and create virtual
routers called a group which are fully connected to other groups by optical links. Local
ports connect a router to a compute node or NIC. Group ports connect routers within
the same group together. Global ports facilitate inter-group traffic and use optical links
so that they may reach larger distance than is practical for electrical cables. For a visual
reference the reader may refer to Fig 6.5. As mentioned before, because of our interest in
communication overheads, we have elected to utilize communication-only skeleton code and
its SST implementation. Specifically, we utilize SST to accurately represent the packet-level
routing, buffering, and internal switch characteristics of the Dragonfly network, as well as
                                                 167


the MPI semantics and message matching.
    Below, we specify the Merlin and Firefly parameters we use to simulate Cori-KNL inter-
connect:
      Network Topology: For simulating Cori-KNL System, we use a network having 12056
      Nodes. In SST, we need to provide the shape of the network that we want to use as a
      parameter. We use 4 local ports and 96 as the group port value. To replicate the same
      peak bisection bandwidth as Cori-KNL, we use 10 global optical links among groups.
      Note that we limit our runs to the minimum number of “switches" required. There
      are (4 ∗ 96) = 384 nodes in each Cori-KNL switch. Hence, for example a simulation
      needing 1000 nodes, we use 3 groups for the corresponding SST simulation.
      Router and NIC Parameters: In Table 6.2, we present the router and NIC parameters
      we used for simulating the Cori-KNL cluster. Each links has a bandwidth of 8 GB/s.
      The port input, output latency values are taken from the Cray documentation[133],
      where possible or otherwise estimated (as is the case with input and output buffer
      sizes).
    In practice, there are subtle differences in the real system simulated that Merlin does not
currently capture. For example in a real Cray Aries router, traffic from a single 8GBps NIC
is divided across 48 router tiles. Depending on whether it is an optical tile or electrical tile,
the bandwidth may vary between 4.7 and 5.25GBps. Though these architectural subtleties
are more complex than can be simulated by Merlin currently, SST is widely used within
industry to simulate the performance of large systems and we use parameters that provide
as close an approximation as possible to the NERSC Cori network.
    NERSC uses SLURM for job scheduling. With the "–switches" flag in SLURM, we can
                                                168


              Table 6.2: Router and NIC Parameters used for Simulating Cori-KNL
                                 Parameter Name        Value
                                        flitSize       6 Byte
                                  port input latency   150ns
                                 port output latency   150ns
                                     link latency      150ns
                                      packetSize      64Byte
                                       link BW        8 GB/s
                                   input buffer size   16KB
                                  output buffer size   16KB
limit our runs to a certain number of Dragonfly groups when we need small number of nodes.
This significantly reduces the communication time, as the global optical links are not used in
these cases. We used the appropriate parameters in SST simulations in accordance with this.
We have also implemented the random distribution of processes in SST in such a way that
addresses this phenomena and the random set is created in the same range as the production
runs. The matrices considered in this study all fit within a switch of 384 KNL nodes.
6.5.4     Simulation results in Cori-KNL
We run the simulations with 3 matrices from Table 6.1. In Cori-KNL the total memory per
node is 96GB. To use half the memory of of cori KNL(48 GB) with the sparse matrix, we
try to keep the number of nonzeros around 6 billion per node. We use different number of
MPI ranks per node (1,2,4,8,16) for both the communication skeleton runs of MFDn and the
SST simulations. So in our simulations, for each of the matrices, the nodes required remains
similar but with the increased MPI ranks per node, total MPI rank increases, see Table
6.3. The communication volume between ranks also decreases for all MPI communication
routines. We show the number of MPI ranks (np), the number of diagonal processors (nd)
                                                169


Table 6.3: MPI ranks, number of diagonals, number of ranks per custom communicators, Message
Size during Broadcast and Reduce and Message Size during Allgather and Reduce_Scatter.
                                  Dimension = 196,861,465
                np    nd  nd+1    Bcast & Reduce AllGather&ReduceScatter
                             2
                28     7    4         112 MB                   28 MB
                45     9    5         88 MB                    18 MB
                91    13    7         60 MB                    8.7 MB
               190    19    10        40 MB                    4.2 MB
               378    27    14        29 MB                    2.1 MB
                                  Dimension = 430,062,264
                np    nd  nd+1 Bcast & Reduce AllGather&ReduceScatter
                             2
                66    11    6         144 MB                   26 MB
               120    15    8         116 MB                   14 MB
               276    23    12        76 MB                    6.2 MB
               496    31    16        56 MB                    3.5 MB
               1128   47    24        36 MB                    1.5 MB
                                 Dimension = 1,343,536,728
                np    nd  nd+1 Bcast & Reduce AllGather&ReduceScatter
                             2
               276    23    12        232 MB                   20 MB
               496    31    16        172 MB                   11 MB
               1128   47    24        116 MB                   4.8 MB
               2016   63    32        84 MB                    2.7 MB
               4560   95    48        56 MB                    1.2 MB
which represents how many column or row communicators will we have, number of processors
in each custom communicator ((nd + 1)/2) in Table 6.3 as well (see also Figs. 6.2 and 6.3
for the different custom communicators and the communication pattern).
    In Fig. 6.6, we show the execution times per iteration for the communication skeleton
of our application with default MPI_SUM and custom OMP_SUM, and also with the SST
simulation. We can see that there is a difference between the real application result and the
simulation results. We attribute this difference to a combination of (a) the non-negligible
reduction operation which is not present in SST simulation but is included in the commu-
                                             170


Figure 6.6: Total Execution time per iteration in Cori-KNL for real application using default
MPI_SUM, custom OMP_SUM and SST Simulation
Figure 6.7: Ratio of different MPI communication routines between SST simulation and communi-
                                           SST _time
cation skeleton runs with MPI_SUM. i.e. Real_run_time
                                              171


nication skeleton; (b) the lack of an MPI_Reduce_scatter equivalent in the SST implemen-
tation; and (c) possible network congestion in the communication skeleton runs due to the
communication pattern of the workload.
    In Fig. 6.7, we show the ratio of SST simulation results with real application runs for all
communication routines in MFDn. As mentioned before, we add the times of the Reduce and
the Scatter calls in SST and compare it with the Reduce_Scatter time of the communication-
skeleton runs. We can see in all cases that broadcasts in SST are predicted to be much
faster than the real runs. Our hypothesis is that this is a result of how the Aries network
manages congestion compared to SST Merlin. It has been pointed out in recent work that
congestion on the Aries system can severely degrade performance in ways that the Merlin
does not capture. For example in a real system if congestion reaches threshold network
quiesce operations may occur that are not captured in SST simulations. In the severe cases
the slowdowns on real Cray XC systems have resulted in a 99% reduction in bandwidth
[134].
    Network congestion on real systems vs simulation: The message sizes and communication
patterns of our motif mean that the motif is largely bisection bandwidth bound. To better
understand the differences between simulated and real network congestion, we created a
simple benchmark to communicate across the bisection bandwidth of the real Cori network
and compare the congested vs peak performance. This benchmark divides the network into
two partitions and then creates pairs of nodes between pairs of groups, such that no node-
pair share a group or partition. Each pair of nodes then measures achievable bandwidth. We
ran this across the entire Haswell and KNL partition of Cori during a maintenance window
and observed the following: Peak bandwidth between any set of nodes was 3893 MBps,
while average bandwidth was 621 MBps showing a roughly 6X difference in bandwidth.
                                              172


This suggest router ports on NERSC Cori were spending approximately 84% of time stalled
on average for the evaluation. The factor of slowdown due to congestion is very similar to
what we observe in the difference between SST simulated and real broadcast operations in
Figure 6.8.
    Existing work has examined the impact of congestion in similar bisection bound motifs
(3DFFT and AllPingPong) for dragonfly topologies simulated in SST [130]. In that work the
peak percentage of time that router ports were stalled was approximately 20%. This analysis
suggest that there is a gap between how congestion is simulated in SST and production
systems. Improving models of congestion within SST would be valuable to future studies.
6.6      Simulation for A Future Network
We also used SST to simulate the communication motifs on the Perlmutter machine which
will be installed at NERSC later this year. Below, we specify the interconnect parameters
that we anticipate for this system.
      Network Topology: Perlmutter will have a Dragonfly topology like Cori-KNL. However,
      for this network, we assume the network to have 16 local ports and 32 group ports,
      and we assume the number of global links among groups to be 4. As in the Cori-KNL
      results, for the simulations requiring less than (16 ∗ 32) = 512 nodes, runs are limited
      to one group and likewise.
      Memory per node: We assume this new machine to have a much larger memory per
      node value, likely 4 times the memory of Cori-KNL nodes. For the same simulations
      that we used in Cori-KNL, we reduce the required nodes by 1/4th and increase the
      MPI Ranks per core to 4 times the values we used in Cori.
                                              173


Figure 6.8: Communication time breakdown for a real run and SST simulation for dimension =
1,343,536,728
                                           174


    Table 6.4: Router and NIC Parameter used for simulating Perlmutter’s predicted network.
                                 Parameter Name           Value
                                         flitSize         6 Byte
                                  port input latency       150ns
                                 port output latency       150ns
                                      link latency         150ns
                                       packetSize         64Byte
                                        link BW          25 GB/s
                                   input buffer size       64KB
                                  output buffer size       64KB
Figure 6.9: Timing comparison of the simulation of MFDn motif in Cori-KNL and the soon-to-be-
installed Perlmutter machine with our predicted parameters.
       Router and NIC Parameters: In Table 6.4, we give the router and NIC parameters
       we used for simulating Perlmutter using SST. We assume the network to have a link
       bandwidth of 25GB/s. We assume the input/output buffer sizes for this new system to
       be 4 times as those of Aries to account for the increase in the bandwidth-delay-product.
    In Figure 6.9, we show the results that we obtained using the parameters we predict for
the soon-to-be-installed Perlmutter system and compare them with the SST simulations on
Cori-KNL. Since we assumed the bandwidth to be more in the new machine along with high
in node memory and increased buffer size, we predict that the execution time in the future
machine will bring communication overhead improvements over Cori-KNL in most cases.
                                                  175


    Expected impact of congestion in future systems: Network congestion has received greater
attention in the design of future networks such as the Cray Slingshot system. Existing work
has shown that the impact of congestion is significantly reduced on Slingshot networks com-
pared to the Cori Aries network. Because of this, we expect the SST simulated performance
of Perlmutter to be a closer match to the real performance than was observed in Figure 6.7.
6.6.1     Conclusions of this work
In this work, we introduced a new application motif which corresponds to the communication
operations in the distributed eigensolver algorithm used in the MFDn code. We compare
the simulations of the SST motif with actual runs of a skeleton code written only using
communication routines for an existing architecture for validation. We point out to the
differences between real runs and simulation results, and gave possible reasons behind those
differences. We also evaluated our motif in a future architecture and compare its results with
the existing architecture. We also discussed the features we used and some shortcomings of
SST. Moreover we also have contributed in introducing new ember motifs by the developers
in the SST open source community. With these observations we got a better idea on how
to go ahead and implement the distributed MFDn using communication tasks which we will
decide in the next chapter.
                                             176


Chapter 7
OPTIMIZING A DISTRIBUTED
MEMORY APPLICATION USING
DEEPSPARSE
7.1      Motivation
In the quest for a task parallel implementation of an iterative eigensolver, we first dived
into the shared memory architecture. We developed the framework DeepSparse which we
described in Chapter 4. In that work we implemented two different algorithms Lanczos and
LOBPCG algorithms used executed them using our DeepSparse framework. The implemen-
tation is based on task parallelism and was an on node optimization. We observed that
DeepSparse achieves 2× - 16× fewer cache misses across different cache layers (L1, L2 and
L3) over implementations of the same solvers based on optimized library function calls. We
also achieve 2× - 3.9× improvement in execution time when using DeepSparse over the same
library versions.
    In the shared memory implementation of DeepSparse, we only have computational ker-
nels. In all the kernels we used, there was some kind of computation/assignment operations.
Looking at the success of the framework in a shared memory architecture, we were moti-
                                              177


vated to extend the framework into a distributed memory architecture and extending the
framework into an actual scientific application. With the experience on working with MFDn
code which is a distributed memory CI code I decided to take this application to build a
distributed memory version for DeepSparse.
    Our target application, MFDn [89, 90, 4] is used for ab initio calculations of the structure
of atomic nuclei. The structure of an atomic nucleus with A nucleons can be described by
solutions of the many-body Schrödinger equation
                            Ĥ Ψ(~r1 , . . . , ~rA ) = E Ψ(~r1 , . . . , ~rA )                 (7.1)
where Ĥ is the nuclear Hamiltonian acting on the A-body wavefunction Ψ(~r1 , . . . , ~rA ), ~rj the
single-nucleon coordinates, and E the energy. This code already consists of the LOBPCG
algorithm that we explored and the Sparse Matrix Multiple Vector Multiplication(SpMM)
being the main kernel which is also optimized by us in a prior project motivated us to use
this application.
7.1.1     Introducing communication tasks
As we discussed in the previous Chapter 6 about the communication pattern in MFDn
while doing the distributed matrix multiplication. We noticed that we have an allgather,
a broadcast, a reduction and a reduce scatter operation in the MFDn code. The detailed
explanation is given in Section 6.2 in Chapter 6. In practice, it is seen that for MFDn, when
run in an architecture like knl, the communication usually takes over the computation for a
very large simulation consisting of a very high number of mpi ranks involved from a lot of
compute nodes. We simulated the performance of the communication patterns and tried to
                                                     178


find out a possible cause using a simulator named SST.
    We observed that, the broadcast operation takes a much longer time in real life because
of possible network congestion and the messages being very large in size also fuels into this
behavior. Since our deepsparse framework is a task based parallel framework where each
task performs a particular matrix or vector operation on a matrix or a vector block, we were
motivated to use blocked communication tasks.
    Our motivation was to introduce custom communication tasks where each communication
tasks will communicate with other nodes and only transmit a block of the matrix or a vector
between themselves. This will help us in multiple ways.
       Reduced Message Size: Being blocked communication will reduce the size of the
       messages during the communication much less than the actual code which we expected
       will help in case of network congestion.
       Overlapping Communication and Computation: The other motivation was
       to overlap the communication with the computations in the matrix multiplication.
       Whenever a particular block is is received or ready to compute, the other kernels
       waiting for this particular block of matrix or a vector can start immediately rather
       than waiting for the entire matrix or vector to be transmitted and then starting its
       execution.
7.1.2      Better pipelining of matrix and vector operations
In the shared memory implementation of DeepSparse we observed a nice pipelined execution
of different kinds of kernels. The matrix multiplication SpMM and the vector operations like
vector vector multiplication or vector vector transpose multiplication. in figure 4.8 we can
                                              179


                            26
                            24
                            22
                            20
                            18
                   Thread
                            16
                            14
                            12
                            10
                             8
                             6
                             4
                             2
                             0
                                 0         2            4              6
                                           Time (Sec.)
Figure 7.1: LOBPCG two iteration execution flow graph of nlpkkt240 matrix. SpMM is represented
using orange color, XY operation is Maroon and XTY is using green color palette
see a pipelined execution of an actual iteration of LOBPCG where the tasks are different
kernels but they use the same datastructure and ultimately improves the performance. This
test was done in a haswell architecture.
   We also did similar tests on Broadwell machines with a different matrix to validate the
framework and the pipelined execution. In figure 7.1 we show this pipelined execution for the
nlpkkt240 matrix in a broadwell architecture. Here the SpMM is represented using orange
color, XY operation is Maroon and XTY is using green color palette. We can clearly observe
                                               180


that the matrix and vector operations are well pipelined. Another interesting observation was
that the ration of time spent on the SpMM and vector operations are somewhat in the similar
range. We observed that since the matrix and vector operations are taking similar amount
of time during an iteration, a well pipelined execution of these kernels will improve the cache
performance which it did and we saw the execution time is actually improved in a shared
memory architecture. We were motivated to extend this idea for a distributed application
like MFDn which also has similar matrix and vector operations. With the introduction of
communication tasks, we were motivated to use the idea from shared memory to distributed
memory.
7.2      Methodology
In our previous works we implemented our custom kernels for a task parallel implementation
of LOBPCG and Lanczos algorithms for a shared memory architecture. I also worked on
the actual MFDn code which is a distributed memory application. For this work the goal
was to add communication tasks since other kernels were already implemented.
    Using MPI routines as OpenMP task is not a very common practice. There are a few
reasons behind this. This is a tricky task, as features of both languages might easily interact
in an unexpected way, resulting in dead-locks, incorrect results, fatal errors or performance
issues. IntertWine project from Barcelona supercomputing center have been trying to merge
MPI routines with their OmpSs programming model[135]. OmpSs uses a task based parallel
programming model which is thoroughly similar to OpenMP. They also report some pitfalls
of using MPI with OpenMP.
    MPI programs start with a function call which initializes the message passing library.
                                              181


MPI standard defines two different functions for this purpose:
      MPI_Init(), used when no multi-threading support is needed.
      MPI_Init_thread(), specifies a desired level of multi-threading support.
    These routines must be called by one thread only. That thread is called the main thread
and must be the thread that calls MPI_Finalize(). The programmer must provide a de-
sired level of multi-threading support to the MPI_Init_thread() which, in turn, may return
a value lower than requested. This is because different library implementations may be re-
stricted to different levels (e.g. absence of locking mechanisms for efficiency in single-threaded
programs).
    We used MPI_Init_Thread with MPI_THREAD_MULTIPLE multithreaded level value
as multiple threads may call MPI, with no restrictions if we use this value.
7.2.1      Issue with blocking MPI calls
An important pitfall is using blocking MPI routines as OpenMP tasks.In figure 7.2 we show
a small example of this case. Figure 7.2 shows a case where a specific order of task scheduling
can produce the single thread execution to hang: all processes execute task C first, making
the thread wait for a message that is never sent. This thread enters the MPI routine and
cannot leave it until the communication is completed. Thus a deadlock is created.
    Using non blocking MPI calls we can get rid of these kinds of scenario. In figure 7.3 we
show the kind of call that we will make inside an OpenMP task. To stop the thread still
waiting inside the task, we can suspend the execution of this task and allow other tasks using
the taskyield directive.
                                                 182


               Figure 7.2: Example code for a blocking mpi call as an OpenMP task
                       Figure 7.3: Example code for a non blocking mpi call
    Although this way helps us to get rid of deadlocks but there is a caveat. In this technique
modification of every single MPI routine of an application is cumbersome. Also, tasks can be
resumed even though the requests they are waiting for have not been completed yet. Hence,
they are resumed to check the completion of the communication and this is not efficient
because there is chance that they may be suspended again.
7.2.2      Issue with absence of TAG fields in MPI collectives
Another known problem is the absence of TAG field in MPI Colletive routines. I tried to use
collectives like MPI_Ibcast and MPI_Ireduce inside OpenMP tasks but eventually deadlock
and segmentation fault occurred because the individual tasks need to know exactly which
message it is either sending or receiving with a specific tag. Hence I had to use point to
                                               183


point MPI routines like MPI_Isend and MPI_Irecv for communications.
    Here we are working with blocks of the vectors. During a collective call either all the
processes sends some messages to a specific destination process or the root processor send
some messages to all other process. The blocks in our vectors are not uniform in size. One
thread might send a particular block to the destination and another thread might send
another block in separate tasks. In the destination process, the threads in that process are
waiting in receive tasks. Without a proper tag field, the destination will start receiving
the message but because of nonuniform size of the blocks, it creates a size mismatch and
eventually creates a segmentation fault.
    This is where I had to use point to point sends and receives instead of collective routines.
In the point to point send and receive routines the TAG field is set as the starting offset of
the block. The receive routine also expects the message with this particular tag and size.
Hence the size and offset of the messages match and the operation completes as expected.
7.2.3      Distributed SpMM
We use an efficient mapping for load-balanced communication. For this, we start from
the nd × nd square grid of submatrices, and, taking into account that Ĥ is symmetric,
require that each column (row) in the nd × nd grid has the same number of submatrices
(specifically (nd + 1)/2) assigned to one of the np = nd (nd + 1)/2 processing units. There are
many different ways to achieve this – the implementation that is used in MFDn [22, 90], is
illustrated in the top panel of Fig. 7.4 for 15 processing units on a 5 × 5 grid of submatrices.
After each local SpMV and SpMVT , we have to perform two reductions: One along the
processing units in the same column, and one along the processing units in the same row, as
as indicated by the lower panels of Fig. 6.2.
                                               184


    With the mapping of Fig. 7.4, all column- and row-communicator groups contain (nd +
1)/2) processing units per communicator, and have essentially the same communication
volume as well. Thus, with this distribution the communication load associated with the
SpMV or SpMM in the iterative solver is almost the same for all processing units and
communicator groups, both in message sizes, and in number of processing units in each
communicator, provided that the dimensions of each submatrix are approximately the same.
    The additional steps in the iterative solvers, namely orthogonalization of (blocks of)
vectors, preparation of the input vector (block) for the next iteration, and in the case of
LOBPCG applying the pre-conditioner, are all distributed evenly over all np processors. For
this purpose, we further divide each of the nd (blocks of) vectors into (nd + 1)/2 segments,
and distribute these evenly over the column-communicator groups. Thus each processing
unit deals with a (block of) vectors of length D/(nd (nd + 1)/2) for the orthogonalization and
preparation of the input for the next iteration, where D is the dimension of the Hamiltonian.
These steps also involve additional communication, mainly reduce or all_reduce on all pro-
cessing units, but the message sizes are small and the communication overhead during this
step is negligible compared to the communication overhead during the SpMV/SpMM phase.
In the right side of figure 7.4 we show how the vectors are also divided into subvectors and
saved in each processors.
7.2.4     Blocked communication
In figure 6.3 we showed the nature of communication done during the SpMM and how
the local SpMM and the transpose version of it are executed and also how the results are
accumulated eventually. Our goal is to make these communications also blocked.
    Since the SpMM and SpMMT both are executed according to the blocksize of the matrix
                                              185


Figure 7.4: Matrix and Vector distribution in MPI ranks and efficient processor topology
                                          186


blocks. The offsets of these blocks are kept as a metadata which is used during the multipli-
cations to calculate correct addresses. This block size is not uniform. Since our SpMM kernel
has a specific OpenMP input output dependencies based on these blockoffsets and custom
blocksizes, the same offsets and sizes need to be used when we introduce the communication
tasks.
    In figure 7.5 we show how do we implement the blocked communication tasks. Since our
vectors are divided mpi ranks accross the column communicator, we need to do an allgather
accross the column communicator and gather the subvectors in the diagonal processor. Then
we do the Bcast along the row communicator. The offsets and sizes being exactly same among
the processors, I use the starting offset as the TAG field for the point to point communication
routine which helps us to get rid of any possible deadlock or message size mismatch which
might result in a segmentation fault.
    In figure 7.5 we can see that whenever the communication is done for a particular block,
we can immediately execute a local SpMM or a local SpMMT on that block since we have
already received that block and now it is data safe to execute an SpMM kernel.
    After the local SpMM is finished, it is safe to execute the reduction tasks. Hence, when-
ever an SpMM task is completed, the reduction task which is depending on the output of
this task is called(or expected to get called by OpenMP).
    First I implemented a version where I used each individual blocks for a task. But after
our implementation, we noticed that a lot of communication tasks are generated and the
performance is poor. Hence, I implemented a hierarchical blocking of tasks. In figure 7.6 I
show how did we divide the entire matrix into a block of sparse matrix blocks. In this example
I show a sparse matrix with 16X16 blocks. We use a higher level block consisting of 4x4 blocks
of the original block and use this as our offsets and sizes for the kernel OpenMP dependencies.
                                                187


Figure 7.5: Blocked Communication along the processes in the same row communicator
                    Figure 7.6: Hierarchical blocked communication
                                          188


                               Figure 7.7: Blocked broadcast code
Doing this way reduced the high number of communication tasks and eventually improved
the overall performance. For our runs, a task block size of 8 or 16 is used because they
performed the best.
    The blocked broadcast is shown in the code snippet 7.7. Since we only use MPI point
to point non blocking routines, the root process sends the block to all the processes serially.
All the other processes receive that message from the root process with the same TAG that
the message is sent with.
    In the code snippet 7.8 the task dependencies are shown for the SpMM code. Here it
can be seen that we use the same kind of memory footprint for the OpenMP tasks. These
dependencies need to be exactly coherent among the tasks otherwise there will be deadlocks
or race conditions might not give us the correct result. For example, if a dependency for an
                                              189


                                Figure 7.8: Blocked SpMM code
MPI_Recv task is not coherent with its prior dependencies, this task will get pulled from
the task pool and OpenMP will try to execute that task regardless of an actual send has
been processed for the recv or not. This results in a deadlock. Resolving the dependencies
appropriately was one of the most important challenges during this implementation. An
example code snippet for reduction is shown in figure 7.9 and it can be clearly seen that
the same dependencies are kept coherent with the SpMM kernels. Although since it is a
reduction operation and the values need to be added with the values of the root processor,
a buffer needs to be kept to receive the results from other processors.
    When all the OpenMP dependencies are resolved appropriately, the result matches and
the MFDn code converges. OpenMP looks at the dependencies of each individual tasks and
pulls the tasks whose dependencies have been resolved from a task pool and creates its own
DAG underneath and its own scheduler.
7.2.5      Custom reduction
After the SpMM and SpMMt have been executed, then their results are reduced and then
scattered in the individual processors for the next iteration. This is done using a MPI_Reduce_scatter
call in the actual MFDn code which results in a better performance. The displacements while
                                               190


                                Figure 7.9: Blocked Reduction code
doing the reduce-scatter is actually not uniform. During the reduce scatter we move from a
dimensions related to the offsets and sizes for the sparse matrix blocks to a local dimension
for the vectors in each individual. Since we need to have the memory chunk sizes coherent
for a successful and efficient task parallel implementation, implementing reduce scatter like
a reduce and a scatter would not be efficient.
    For this reason, we implement a custom reduction. Rather than doing a reduction fol-
lowed by a scatter, only a reduction will be done in each process on specific blocks. Since
we have a displacement array for doing the scatter which keeps track of the vector blocks
being in a specific process, using this information, each individual block will determine which
process it belongs to. If a block belongs to a particular mpi rank, it will receive the locally
calculated results from other processors for this block and accumulate the result. If this
                                               191


            Figure 7.10: Custom Reduction depending on the local vector distribution
block does not belong to this mpi rank but a different one, it sends the locally calculated
values for this block to that mpi rank which will eventually calculate the local value of its
own.
   In figure 7.10 an example of this custom reduction is shown. In this example, the entire
subvector in the diagonal processor is divided into 4 blocks with a local distribution of 1, 2
and 1 blocks respectively. When the result of block 0 is calculated which belongs to the mpi
rank 0, it will receive the local results for that task from the other processors. For blocks
1 and 2, it will send the locally calculated results to mpi rank 1 and for the block 3, it will
send the result to process 3.
                                               192


                           Processor Xeon E5-2698 v3         Xeon Phi 7250
                                Core         Haswell        Knights Landing
                        Clock (GHz)             2.3                 1.4
                   Data Cache (KB)       64(32+32)+256       64(32+32) + 512
                 Memory-Parallelism       HW-prefetch          HW-prefetch
                        Cores/Nodes             32                   68
                     Threads/Nodes              64                  272
                  Last-level Memory         40 MB L3         16GB MCDRAM
                          SP TFlop/s            1.2                   3
                         DP TFlop/s             0.6                 1.5
                   Available Memory          128 GB                96 GB
                         Interconnect   Aries(Dragonfly)     Aries (Dragonfly)
                          Global BW         120 GB/s            490 GB/s
Table 7.1: Overview of Evaluated Platforms. 1 With hyper threading, but only 12 threads were used
in our computations. 2 Based on the saxpy1 benchmark in [1]. 3 Memory bandwidth is measured
using the STREAM copy benchmark.
7.3      Experiments and Results
7.3.1      Experimental setup
We conducted all our experiments on Cori Phase I, a Cray XC40 supercomputer at NERSC,
mainly using the GNU compiler. Each Cori Phase I node has two sockets with a 16-core
Intel Xeon Processor E5-2698 v3 Haswell CPUs. Each core has a 64 KB private L1 cache
(32 KB instruction and 32 KB data cache) and a 256 KB private L2 cache. Each CPU has
a 40 MB shared L3 cache (LLC). We use thread affinity to bind threads to cores and use a
maximum of 16 threads to avoid NUMA issues.
    We also conducted all our validation experiments on Cori Phase II (Cori-KNL), a Cray
XC40 supercomputer at NERSC. Each Cori-KNL node is a single-socket Intel Xeon Phi
Processor 7250 ("Knights Landing") processor with 68 cores per node @ 1.4 GHz. Each node
has 96 GB DDR4 2400 MHz memory with 102 GiB/s peak bandwidth and also a 16 GB
MCDRAM (multi-channel DRAM).
                                               193


                      MPI ranks     Dimension   local dim  nonzero
                      6             3354349     1677320    1063773871
                      15            3369600     1123336    1025614613
                      45            5771535     1154406    1044593229
                      66            6189415     1031593    958230159
                      120           8161289     1020212    968138528
Table 7.2: Matrices used in this experiment. Number of MPI ranks, dimensions and number of
nonzeroes per rank.
    For our initial implementations, we received a standalone code consisting of the Lobpcg
algorithms, the preconditioning and the actual SpMM code from our collaborators. we did
our implementations in a C wrapper code and merged it with the existing FORTRAN code.
    We received 5 different matrices from our collaborators which are different in sizes. The
matrices mainly varies in the MPI ranks they use and the number of diagonals. In table
7.2 we show the dimensions for SpMM multiplication, the average local dimensions of the
vectors in each mpi ranks and number of nonzeroes per node. We can see the the number
of nonzeroes are kept somewhat close to 1 billion per rank. Hence we expect a weak scaling
in our experiments.
    Since we have 128 GB per haswell nodes and 96 GB per knl nodes, if we want to use
half the memory of the entire nodes for the matrix, we can have 8 MPI ranks per haswell
nodes(8*8 GB = 64 GB ). But empirically we have seen that the code performs best when
we use 2 mpi ranks per node. Hence for our matrices, we use 3, 8, 23, 33 and 60 Haswell
and Knl nodes and 2 mpi ranks per node. Since we have 32 cores in Haswell nodes and 68
cores in knl nodes, we can use 16 OpenMP threads and 32 OpenMP threads for Haswell and
knl runs repectively. Also we have used hyperthreading to use 32 and 64 OpenMP threads
respectively in Haswell and knl runs.
                                             194


Figure 7.11: Comparison of execution time per iteration in Haswell 16 threads between loop parallel,
task parallel and task parallel with custom reduce-scatter
7.3.2      Impact of blocked communications
In figures 7.11,7.12,7.13 and 7.14 we show the performance per LOBPCG iteration in MFDn
code. We compare the execution time between the loop parallel version and our task parallel
implementation. We show the results for both Haswell nodes and knl nodes with and without
hyperthreading. We observe that the task parallel implementation is slightly slower than the
loop parallel version for almost all the experiments. The huge number of tasks generated
and the dependencies being resolved regularly might be generating a tasking overhead which
results in a slightly slower performance.
7.3.3      Improvement with custom reduction
We also implemented the reduce-scatter with a custom reduction and measured the results
and show them in the same plots. For this case, the tasks in reduce scatter are also included in
the single omp loop thus there are more overlaps between computation and communication.
This is also evident in our results. In almost all the experiments, the version with custom
reduction is faster than the version without this custom implementation. But even this
                                                195


Figure 7.12: Comparison of execution time per iteration in Haswell 32 threads between loop parallel,
task parallel and task parallel with custom reduce-scatter
Figure 7.13: Comparison of execution time per iteration in knl 32 threads between loop parallel,
task parallel and task parallel with custom reduce-scatter
                                                196


Figure 7.14: Comparison of execution time per iteration in knl 64 threads between loop parallel,
task parallel and task parallel with custom reduce-scatter
improved task parallel implementation could not always beat the loop parallel times.
7.3.4      Breakdown of individual kernel performance
In MFDn we mainly have four expensive and important kernels. The bcast, the SpMM, the
reduction and the SpMMt. We broke down the time needed to execute these 4 kernels and
compared those between the loop parallel version and the task parallel version. We broke
down the timings for 6 MPI ranks and 45 MPI ranks cases. In the loop parallel version, the
communications are done by only one thread. And all the other threads take part in the
SpMM and SpMMt. Whereas our entire implementation is task parallel and every thread
takes part in all four kernels. In figures 7.15, 7.16, 7.17 and 7.18 we can see the comparison.
It is clear that the communication times improved in task parallel version over the loop
parallel version since more threads are taking part in the communication now.
    Since the number of threads taking part in SpMM and SpMMt have been somewhat
balanced for task parallel version, their execution time increases slightly. Eventually with
this increase and also because of tasking overhead, the final execution time per iteration is
                                                197


Figure 7.15: Breakdown of communication and computation operations with 6 mpi ranks in Haswell
nodes
Figure 7.16: Breakdown of communication and computation operations with 6 mpi ranks in KNL
nodes
                                            198


Figure 7.17: Breakdown of communication and computation operations with 45 mpi ranks in haswell
nodes
Figure 7.18: Breakdown of communication and computation operations with 45 mpi ranks in knl
nodes
increased slightly.
7.3.5     Expensive matrix multiplication compared to vector opera-
          tions
As mentioned earlier, one of our motivations behind this work is to overlap computation with
communication. Since we saw a better pipelined execution flow results in improvements in
performance for a shared node where the matrix and vector operations have a balance, we
thought this will lead to better performance in MFDn too.
                                             199


  Figure 7.19: Change in SpMM dimension and local dimensions with the increase of mpi ranks
      Figure 7.20: Ratio of LOBPCG compared to SpMM in Haswell nodes with 16 threads
    As seen in table 7.2 and also pictorially shown in figure 7.19, we can see that with the
increase of MPI ranks, the local dimensions for the vector actually decreases. This means
the individual matrix dimensions actually increases and since the number of non zeros are
kept similar per mpi rank, the matrises are more and more sparser with the increase of mpi
ranks.
    Also, with the decrease of the local vector dimensions, the vector operations become
smaller compared to the matrix multiplications.
    We can see in the breakdown sections that with the increase in dimensions, the com-
munication also becomes expensive in SpMM making the SpMM more and more expensive
compared to the vector operations.
                                             200


       Figure 7.21: Ratio of LOBPCG compared to SpMM in Haswell nodes with 32 threads
          Figure 7.22: Ratio of LOBPCG compared to SpMM in knl nodes with 32 threads
    In figures 7.20, 7.21, 7.22 and 7.23 we show the ratio between SpMM execution time and
LOBPCG execution time which consists of vector operations. Unlike out shared memory
version, the collective vector vector and vector transpose multiplication is almost 9 times
slower than SpMM which was not the case for shared memory. This was an interesting
observation for us and making the improvement not as expected as we thought it would be.
7.3.6      Conclusions of this work
To conclude this work, we introduced the communication tasks. While we faced several
issues with the merging OpenMP tasks with MPI routines, we eventually succeeded. The
MPI collective routines do not welcome the use of blocked tasks as we use in DeepSparse.
                                              201


        Figure 7.23: Ratio of LOBPCG compared to SpMM in knl nodes with 64 threads
We had to use point to point MPI communication routines as a result. We also had to
introduce our custom reduce-scatter blocked routine. Eventually we observed that bulk
synchronous implementation performs well compared to the task parallel implementation.
A better adaptability in the newer MPI library would improve the implementation and also
using a relatively newer MPI+Threads might actually improve the performance compared
to the bulk synchronous process.
                                           202


Chapter 8
CONCLUSION AND FUTURE
WORK
In this thesis, we presented techniques and algorithms to optimize large scale iterative eigen-
solvers in deep memory architectures using task parallel approach. Although we started with
a dense matrix code, eventually our work focused towards sparse matrix iterative eigensolvers.
    At first we worked with a dense matrix in the Sky3D code and optimized its performance
by setting up the 1d and 2d partitionings accordingly. With the help of an already rich
library ScaLAPACK we observed great scaling. We also observed the performance of a pure
MPI implementation and a MPI+OpenMP hybrid implementation. Dense matrices already
have a pretty impressive collection of optimized multithreaded libraries which are widely
used across different scientific fields.
    But when it comes to sparse iterative solvers, we do not see a lot of optimized libraries
as for the dense matrices. There are a lot of scopes to optimize the sparse eigensolvers.
We discuss about such an approach that we implemented blocked version of sparse matrix
matrix multiplication where the matrices are stored in a novel way of blocks rather than
traditional compressed row or compressed columns. We discuss the implementation and the
benefit of using the blocked storage version. We also discuss about the roofline model and
the performance achieved by our implementation in Intel Xeon Phi architectures and Intel
                                             203


Ivy Bridge machines.
    We only concentrated on a single kernel for this case but for our next work we targeted
an entire eigensolver and looked at all the steps in an eigensolvers rather than looking only a
single kernel. We introduced a novel task-parallel framework named DeepSparse which takes
all the steps of an eigensolver and distributes it into tasks on blocked matrices and executed
these tasks in parallel using OpenMP tasks. We observed a reduction in runtime and a huge
reduction of cache misses in all level of memory using our task parallel approach.
    In this work we depended on the scheduler generated by OpenMP. We also wanted to
use our own custom scheduler. With this goal we used a graph partitioner to partition
graphs with a tight memory bound for active memories as inputs and outputs at a particular
phase. After creating those phases, we created partitions by sorting topologically. Although
we convincingly improved over the library implementations, we could not outperform the
openmp scheduler. We need to find a different heuristic to work on with the paritioner.
    With the success of a single node implementation of DeepSparse, we targeted the dis-
tributed MFDn algorithm for DeepSparse with a view to optimize distributed iterative eigen-
solvers. Before diving straight into the implementations we wanted to simulate the behaviour
of communications in MFDn. We created a communication motif and simulated it using
Structural Simulation Toolkit(SST). In the process we found out a network congestion issue
present in the actual machines for large message sizes for communication routines. We plan
to introduce task based blocked communication routines and we expect it to improve the
performance of the distributed eigensolver.
    Eventually we implemented a distributed memory version of DeepSparse and imple-
mented a task parallel SpMM in the MFDn code. For the distributed memory application,
communication plays a big role in SpMM being dominant over vector operations which was
                                              204


not the case for shared memory version. The vector operations still need to be executed in
the same openmp block with SpMM. Also at the moment the communication tasks are point
to point and the collective communications are done serially. A tree based communication
can be introduced for further improvement.
    A more adaptable MPI library with OpenMP tasks might improve the performance of
distributed MFDn implementation. A recent approach of MPI+Threads might also make
the implementation faster and a more simplified implementation. We will keep looking for
better approaches to improve the custom scheduler using partitioning and also to improve
the distributed implementation of communication routines.
                                           205


BIBLIOGRAPHY
      206


                                    BIBLIOGRAPHY
 [1] J. Fang, H. Sips, L. Zhang, C. Xu, Y. Che, and A. L. Varbanescu, “Test-driving Intel
     Xeon Phi,” in 5th ACM/SPEC International Conference on Performance Engineering.
     ACM, 2014, pp. 137–148.
 [2] P. Maris, M. Sosonkina, J. P. Vary, E. Ng, and C. Yang, “Scaling of ab-initio nuclear
     physics calculations on multicore computer architectures,” Procedia Computer Science,
     vol. 1, no. 1, pp. 97 – 106, 2010.
 [3] P. Sternberg, E. G. Ng, C. Yang, P. Maris, J. P. Vary, M. Sosonkina, and H. V. Le,
     “Accelerating configuration interaction calculations for nuclear structure,” in Proceed-
     ings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, 2008, pp.
     1–15.
 [4] M. Shao, H. Aktulga, C. Yang, E. G. Ng, P. Maris, and J. P. Vary, “Accelerating
     nuclear configuration interaction calculations through a preconditioned block
     iterative eigensolver,” Computer Physics Communications, 2017. [Online]. Available:
     http://www.sciencedirect.com/science/article/pii/S0010465517302904
 [5] J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technology-driven, highly-scalable drag-
     onfly topology,” in 2008 International Symposium on Computer Architecture. IEEE,
     2008, pp. 77–88.
 [6] N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sen-
     gupta, Z. Yin, and P. Dubey, “Navigating the maze of graph analytics frameworks
     using massive graph datasets,” in Proceedings of the 2014 ACM SIGMOD interna-
     tional conference on Management of data. ACM, 2014, pp. 979–990.
 [7] Y. Saad, Iterative methods for sparse linear systems. siam, 2003, vol. 82.
 [8] M. Feit, J. Fleck Jr, and A. Steiger, “Solution of the schrödinger equation by a spectral
     method,” Journal of Computational Physics, vol. 47, no. 3, pp. 412–433, 1982.
 [9] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an
     algorithm,” in Advances in neural information processing systems, 2002, pp. 849–856.
[10] I. Jollife, Principal Component Analysis. Wiley Online Library, 2002.
[11] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking:
     Bringing order to the web.” Stanford InfoLab, Tech. Rep., 1999.
[12] S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual perfor-
     mance model for floating-point programs and multicore architectures,” Communica-
     tions of the Association for Computing Machinery, 2009.
                                             207


[13] N. Bell and M. Garland, “Implementing sparse matrix-vector multiplication on
     throughput-oriented processors,” in Proceedings of the conference on high performance
     computing networking, storage and analysis. ACM, 2009, p. 18.
[14] A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, “Paral-
     lel sparse matrix-vector and matrix-transpose-vector multiplication using compressed
     sparse blocks,” in Proceedings of the twenty-first annual symposium on Parallelism in
     algorithms and architectures. ACM, 2009, pp. 233–244.
[15] E.-J. Im and K. A. Yelick, Optimizing the performance of sparse matrix-vector multi-
     plication. University of California, Berkeley, 2000.
[16] B. C. Lee, R. W. Vuduc, J. W. Demmel, and K. A. Yelick, “Performance models
     for evaluation and automatic tuning of symmetric sparse matrix-vector multiply,” in
     Parallel Processing, 2004. ICPP 2004. International Conference on. IEEE, 2004, pp.
     169–176.
[17] R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A. Yelick, “When cache blocking
     of sparse matrix vector multiply works and why,” Applicable Algebra in Engineering,
     Communication and Computing, vol. 18, no. 3, pp. 297–311, 2007.
[18] A. Pinar and M. T. Heath, “Improving performance of sparse matrix–vector multipli-
     cation,” in Proceedings of the 1999 ACM/IEEE conference on Supercomputing. ACM,
     1999, p. 30.
[19] Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd ed.         Philadelpha, PA:
     SIAM, 2003.
[20] R. Vuduc, J. W. Demmel, and K. A. Yelick, “OSKI: A library of automatically tuned
     sparse matrix kernels,” in Journal of Physics: Conference Series, vol. 16, no. 1. IOP
     Publishing, 2005, p. 521.
[21] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, “Optimization
     of sparse matrix-vector multiplication on emerging multicore platforms,” in SC’07:
     Proceedings of the 2007 ACM/IEEE Conference on Supercomputing. IEEE, 2007, pp.
     1–12.
[22] H. M. Aktulga, A. Buluç, S. Williams, and C. Yang, “Optimizing sparse matrix-
     multiple vectors multiplication for nuclear configuration interaction calculations,” in
     Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE,
     2014, pp. 1213–1222.
[23] W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith, “Toward realistic perfor-
     mance bounds for implicit cfd codes,” in Proceedings of parallel CFD, vol. 99. Citeseer,
     1999, pp. 233–240.
                                            208


[24] X. Liu, E. Chow, K. Vaidyanathan, and M. Smelyanskiy, “Improving the performance
     of dynamical simulations via multiple right-hand sides,” in Parallel & Distributed Pro-
     cessing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 2012, pp. 36–47.
[25] M. Röhrig-Zöllner, J. Thies, M. Kreutzer, A. Alvermann, A. Pieper, A. Basermann,
     G. Hager, G. Wellein, and H. Fehske, “Increasing the performance of the jacobi–
     davidson method by blocking,” SIAM Journal on Scientific Computing, vol. 37, no. 6,
     pp. C697–C722, 2015.
[26] A. E. Sarıyüce, E. Saule, K. Kaya, and Ü. V. Çatalyürek, “Regularizing graph cen-
     trality computations,” Journal of Parallel and Distributed Computing, vol. 76, pp.
     106–119, 2015.
[27] A. Buluç and J. R. Gilbert, “Parallel sparse matrix-matrix multiplication and indexing:
     Implementation and experiments,” SIAM Journal on Scientific Computing, vol. 34,
     no. 4, pp. C170–C191, 2012.
[28] J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative approach.
     Elsevier, 2011.
[29] P. Kogge and J. Shalf, “Exascale computing trends: Adjusting to the" new normal"
     for computer architecture,” Computing in Science & Engineering, vol. 15, no. 6, pp.
     16–26, 2013.
[30] A. V. Knyazev, “Toward the optimal preconditioned eigensolver: Locally optimal block
     preconditioned conjugate gradient method,” SIAM Journal on Scientific Computing,
     vol. 23, no. 2, pp. 517–541, 2001.
[31] H. A. Bethe, “Supernova mechanisms,” Reviews of Modern Physics, vol. 62, no. 4, pp.
     801–866, 10 1990. [Online]. Available: http://link.aps.org/doi/10.1103/RevModPhys.
     62.801
[32] H. Suzuki, Physics and Astrophysics of Neutrinos, M. Fukugita and A. Suzuki, Eds.
     Springer, Tokyo, 1994.
[33] D. G. Ravenhall, C. J. Pethick, and J. R. Wilson, “Structure of Matter below Nuclear
     Saturation Density,” Phys. Rev. Lett., vol. 50, no. 26, pp. 2066–2069, 6 1983. [Online].
     Available: http://link.aps.org/doi/10.1103/PhysRevLett.50.2066
[34] M. Hashimoto, H. Seki, and M. Yamada, “Shape of Nuclei in the Crust of Neutron
     Star,” Prog. Theor. Phys., vol. 71, no. 2, pp. 320–326, 1984. [Online]. Available:
     http://ptp.ipap.jp/link?PTP/71/320/
[35] A. S. Schneider, C. J. Horowitz, J. Hughto, and D. K. Berry, “Nuclear “pasta”
     formation,” Phys. Rev. C, vol. 88, no. 6, p. 65807, 2013. [Online]. Available:
     http://link.aps.org/doi/10.1103/PhysRevC.88.065807
                                             209


[36] A. S. Schneider, D. K. Berry, C. M. Briggs, M. E. Caplan, and C. J. Horowitz, “Nuclear
     “waffles”,” Phys. Rev. C, vol. 90, no. 5, p. 55805, 2014.
[37] C. J. Horowitz, D. K. Berry, C. M. Briggs, M. E. Caplan, A. Cumming, and A. S.
     Schneider, “Disordered nuclear pasta, magnetic field decay, and crust cooling in neu-
     tron stars,” Phys. Rev. Lett., vol. 114, no. 3, 2015.
[38] G. Watanabe, K. Sato, K. Yasuoka, and T. Ebisuzaki, “Microscopic study of slablike
     and rodlike nuclei: Quantum molecular dynamics approach,” Phys. Rev. C, vol. 66,
     no. 1, p. 6, 2002.
[39] G. Watanabe and others, “Structure of cold nuclear matter at subnuclear densities by
     quantum molecular dynamics,” Phys. Rev. C, vol. 68, no. 3, p. 35806, 2003.
[40] G. Watanabe, T. Maruyama, K. Sato, K. Yasuoka, and T. Ebisuzaki, “Simulation of
     transitions between "pasta" phases in dense matter,” Phys. Rev. Lett., vol. 94, no. 3,
     2005.
[41] G. Watanabe, H. Sonoda, T. Maruyama, K. Sato, K. Yasuoka, and T. Ebisuzaki,
     “Formation of Nuclear “Pasta” in Supernovae,” Phys. Rev. Lett., vol. 103, no. 12, p.
     121101, 2009.
[42] R. D. Williams and S. E. Koonin, “Sub-saturation phases of nuclear matter,”
     Nucl. Phys., vol. 435, no. 3?4, pp. 844–858, 1985. [Online]. Available:
     http://www.sciencedirect.com/science/article/pii/0375947485901915
[43] K. Oyamatsu, “Nuclear shapes in the inner crust of a neutron star,” Nuclear
     Physics A, vol. 561, no. 3, pp. 431–452, 8 1993. [Online]. Available: http:
     //linkinghub.elsevier.com/retrieve/pii/037594749390020X
[44] M. Okamoto, T. Maruyama, K. Yabana, and T. Tatsumi, “Nuclear “pasta” structures
     in low-density nuclear matter and properties of the neutron-star crust,” Phys. Rev. C,
     vol. 88, p. 25801, 2013.
[45] H. Pais, S. Chiacchiera, and C. Providência, “Light clusters, pasta phases, and phase
     transitions in core-collapse supernova matter,” Phys. Rev. C, vol. 91, no. 5, p. 55801,
     2015. [Online]. Available: http://link.aps.org/doi/10.1103/PhysRevC.91.055801
[46] P. Magierski and P.-H. Heenen, “Structure of the inner crust of neutron stars: Crystal
     lattice or disordered phase?” Phys. Rev. C, vol. 65, no. 4, p. 45804, 4 2002. [Online].
     Available: http://link.aps.org/doi/10.1103/PhysRevC.65.045804
[47] P. Gögelein and H. Müther, “Nuclear matter in the crust of neutron stars,”
     Phys. Rev. C, vol. 76, no. 2, p. 24312, 8 2007. [Online]. Available: http:
     //link.aps.org/doi/10.1103/PhysRevC.76.024312
                                             210


[48] W. G. Newton and J. R. Stone, “Modeling nuclear “pasta” and the transition to uni-
     form nuclear matter with the 3D Skyrme-Hartree-Fock method at finite temperature:
     Core-collapse supernovae,” Phys. Rev. C, vol. 79, no. 5, p. 55801, 2009.
[49] H. Sonoda, G. Watanabe, K. Sato, K. Yasuoka, and T. Ebisuzaki, “Phase diagram
     of nuclear “pasta” and its uncertainties in supernova cores,” Phys. Rev. C, vol. 77,
     no. 3, p. 35806, 2008. [Online]. Available: http://link.aps.org/doi/10.1103/PhysRevC.
     77.035806
[50] B. Schuetrumpf, K. Iida, J. A. Maruhn, and P. G. Reinhard, “Nuclear "pasta matter"
     for different proton fractions,” Phys. Rev. C, vol. 90, no. 5, 2014.
[51] B. Schuetrumpf and W. Nazarewicz, “Twist-averaged boundary conditions for nuclear
     pasta Hartree-Fock calculations,” Phys. Rev. C, vol. 92, no. 4, 2015.
[52] I. Sagert, G. I. Fann, F. J. Fattoyev, S. Postnikov, and C. J. Horowitz, “Quantum
     simulations of nuclei and nuclear pasta with the multiresolution adaptive numerical
     environment for scientific simulations,” Phys. Rev. C, vol. 93, no. 5, 2016.
[53] F. J. Fattoyev, C. J. Horowitz, and B. Schuetrumpf, “Quantum nuclear pasta and
     nuclear symmetry energy,” Phys. Rev. C, vol. 95, no. 5, p. 055804, 5 2017. [Online].
     Available: http://arxiv.org/abs/1703.01433http://dx.doi.org/10.1103/PhysRevC.95.
     055804http://link.aps.org/doi/10.1103/PhysRevC.95.055804
[54] C. J. Horowitz, D. K. Berry, C. M. Briggs, M. E. Caplan, A. Cumming, and A. S.
     Schneider, “Disordered Nuclear Pasta, Magnetic Field Decay, and Crust Cooling in
     Neutron Stars,” Phys. Rev. Lett., vol. 114, no. 3, p. 31102, 1 2015. [Online]. Available:
     http://link.aps.org/doi/10.1103/PhysRevLett.114.031102
[55] J. Erler,       N. Birge,       M. Kortelainen,       W. Nazarewicz,         E. Olsen,
     A. M. Perhac, and M. Stoitsov, “The limits of the nuclear land-
     scape,” Nature, vol. 486, no. 7404, pp. 509–512, jun 2012. [Online].
     Available:     http://dx.doi.org/10.1038/nature11188http://www.nature.com/nature/
     journal/v486/n7404/abs/nature11188.html{#}supplementary-information
[56] M. Bender, P.-H. Heenen, and P.-G. Reinhard, “Self-consistent mean-field models for
     nuclear structure,” Reviews of Modern Physics, vol. 75, no. 1, pp. 121–180, 1 2003.
     [Online]. Available: http://link.aps.org/doi/10.1103/RevModPhys.75.121
[57] M. V. Stoitsov, N. Schunck, M. Kortelainen, N. Michel, H. Nam, E. Olsen, J. Sarich,
     and S. Wild, “Axially deformed solution of the Skyrme-Hartree-Fock-Bogoliubov equa-
     tions using the transformed harmonic oscillator basis (II) hfbtho v2.00d: A new version
     of the program,” Computer Physics Communications, vol. 184, no. 6, pp. 1592–1604,
     2013.
[58] N. Schunck, J. Dobaczewski, J. McDonnell, W. Satuła, J. A. Sheikh,
     A. Staszczak, M. Stoitsov, and P. Toivanen, “Solution of the Skyrme–
                                             211


     Hartree–Fock–Bogolyubov equations in the Cartesian deformed harmonic-oscillator
     basis.: (VII) hfodd (v2.49t): A new version of the program,” Computer
     Physics Communications, vol. 183, no. 1, pp. 166–192, 2012. [Online]. Available:
     http://www.sciencedirect.com/science/article/pii/S0010465511002852
[59] J. A. Maruhn, P.-G. Reinhard, P. D. Stevenson, and A. S. Umar, “The {TDHF} code
     Sky3D,” Comput. Phys. Commun., vol. 185, no. 7, pp. 2195–2216, 2014. [Online].
     Available: http://www.sciencedirect.com/science/article/pii/S0010465514001313
[60] T. Ichikawa, J. A. Maruhn, N. Itagaki, K. Matsuyanagi, P.-G. Reinhard, and
     S. Ohkubo, “Existence of an Exotic Torus Configuration in High-Spin Excited
     States of 40Ca,” Phys. Rev. Lett., vol. 109, p. 232503, 2012. [Online]. Available:
     http://dx.doi.org/10.1103/PhysRevLett.109.232503
[61] M. Stein, J. A. Maruhn, A. Sedrakian, and P.-G. Reinhard, “Carbon-oxygen-neon
     mass nuclei in superstrong magnetic fields,” Phys. Rev. C, vol. 94, p. 35802, 2016.
     [Online]. Available: 10.1103/PhysRevC.94.035802
[62] P.-G. Reinhard, L. Guo, and J. A. Maruhn, “Nuclear Giant Resonances and
     Linear Response,” Eur. Phys. J. A, vol. 32, p. 19, 2007. [Online]. Available:
     http://dx.doi.org/10.1140/epja/i2007-10366-9
[63] B. Schuetrumpf, W. Nazarewicz, and P. G. Reinhard, “Time-dependent density
     functional theory with twist-averaged boundary conditions,” 3 2016. [Online].
     Available: http://arxiv.org/abs/1603.03743http://dx.doi.org/10.1103/PhysRevC.93.
     054304
[64] J. A. Maruhn, P.-G. Reinhard, P. D. Stevenson, and M. R. Strayer, “Spin-
     excitation mechanisms in Skyrme-force time-dependent Hartree-Fock calculations,”
     Phys. Rev. C, vol. 74, no. 2, p. 027601, 8 2006. [Online]. Available:
     http://link.aps.org/doi/10.1103/PhysRevC.74.027601
[65] N. Loebl, J. A. Maruhn, and P.-G. Reinhard, “Equilibration in the time-dependent
     Hartree-Fock approach probed with the Wigner distribution function,” Phys. Rev.
     C, vol. 84, p. 34608, 2011. [Online]. Available: http://link.aps.org/doi/10.1103/
     PhysRevC.84.034608
[66] L. Guo, P.-G. Reinhard, and J. A. Maruhn, “Conservation Properties in the
     Time-Dependent {H}artree {F}ock Theory,” Phys. Rev. C, vol. 77, p. 41301, 2008.
     [Online]. Available: http://link.aps.org/doi/10.1103/PhysRevC.77.041301
[67] G. H. Golub and C. F. Van Loan, Matrix computations, 4th ed.        Johns Hopkins
     University Press, 2013.
[68] R. B. Lehoucq, D. C. Sorensen, and C. Yang, ARPACK Users’ Guide: Solution of
     Large-Scale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods. SIAM,
     1998.
                                             212


[69] K. Wu and H. Simon, “Thick-restart Lanczos method for large symmetric eigenvalue
     problems,” SIAM Journal on Matrix Analysis and Applications, vol. 22, no. 2, pp.
     602–616, 2000.
[70] A. Stathopoulos and Y. Saad, “Restarting techniques for the (Jacobi–)Davidson sym-
     metric eigenvalue methods,” Electronic Transactions on Numerical Analysis, vol. 7,
     pp. 163–181, 1998.
[71] G. H. Golub and R. Underwood, “The block Lanczos method for computing eigenval-
     ues,” Mathematical Software, vol. 3, pp. 361–377, 1977.
[72] A. Stathopoulos and J. R. McCombs, “A parallel, block, Jacobi–Davidson implemen-
     tation for solving large eigenproblems on coarse grain environment,” in PDPTA, 1999,
     pp. 2920–2926.
[73] A. V. Knyazev, “Toward the optimal preconditioned eigensolver: Locally optimal block
     preconditioned conjugate gradient method,” SIAM Journal on Scientific Computing,
     vol. 23, no. 2, pp. 517–541, 2001.
[74] Y. Saad, “On the rates of convergence of the Lanczos and the block-Lanczos methods,”
     SIAM Journal on Numerical Analysis, vol. 17, no. 5, pp. 687–706, 1980.
[75] S. Williams, A. Watterman, and D. Patterson, “Roofline: An insightful visual perfor-
     mance model for floating-point programs and multicore architectures,” Comm. of the
     ACM, April 2009.
[76] J.-H. Byun, R. Lin, K. A. Yelick, and J. Demmel, “Autotuning sparse matrix–vector
     multiplication for multicore,” Technical report, EECS Department, University of Cal-
     ifornia, Berkeley, Tech. Rep., 2012.
[77] A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and E. Leiserson, “Parallel sparse
     matrix–vector and matrix-transpose–vector multiplication using compressed sparse
     blocks,” in SPAA, 2009, pp. 233–244.
[78] P. Maris, H. M. Aktulga, S. Binder, A. Calci, Ü. V. Çatalyürek, J. Langhammer, E. Ng,
     E. Saule, R. Roth, J. P. Vary et al., “No-Core CI calculations for light nuclei with chiral
     2-and 3-body forces,” Journal of Physics: Conference Series, vol. 454, no. 1, p. 012063,
     2013.
[79] W.-Y. Chen, Y. Song, H. Bai, C.-J. Lin, and E. Y. Chang, “Parallel spectral clus-
     tering in distributed systems,” IEEE Transactions on Pattern Analysis and Machine
     Intelligence, vol. 33, no. 3, pp. 568–586, 2011.
[80] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17,
     no. 4, pp. 395–416, 2007.
                                             213


[81] M. W. Berry, “Large-scale sparse singular value computations,” International Journal
     of Supercomputer Applications, vol. 6, no. 1, pp. 13–49, 1992.
[82] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman,
     “Indexing by latent semantic analysis,” JASIS, vol. 41, no. 6, pp. 391–407, 1990.
[83] H. Zha, O. Marques, and H. D. Simon, “Large-scale SVD and subspace methods for
     information retrieval,” in Solving Irregularly Structured Problems in Parallel. Springer,
     1998, pp. 29–42.
[84] J. Kepner, D. Bade, A. Buluç, J. Gilbert, T. Mattson, and H. Meyerhenke, “Graphs,
     matrices, and the graphblas: Seven good reasons,” arXiv preprint arXiv:1504.01039,
     2015.
[85] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz,
     A. Greenbaum, S. Hammerling, A. McKenney et al., “Lapack users’ guide, vol. 9,”
     Society for Industrial Mathematics, vol. 39, 1999.
[86] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, “Basic linear algebra sub-
     programs for fortran usage,” ACM Transactions on Mathematical Software (TOMS),
     vol. 5, no. 3, pp. 308–323, 1979.
[87] A. OpenMP, “Openmp application program interface version 4.0,” 2013.
[88] A. R. Gilbert Hendry, “SST: A simulator for exascale co-design,” in n ASCR/ASC
     Exascale Research Conf., 2012.
[89] P. Maris, M. Sosonkina, J. P. Vary, E. Ng, and C. Yang, “Scaling of ab-initio
     nuclear physics calculations on multicore computer architectures,” Procedia Computer
     Science, vol. 1, no. 1, pp. 97 – 106, 2010, iCCS 2010. [Online]. Available:
     http://www.sciencedirect.com/science/article/pii/S187705091000013X
[90] H. M. Aktulga, C. Yang, E. G. Ng, P. Maris, and J. P. Vary, “Improving the scalability
     of a symmetric iterative eigensolver for multi-core platforms,” Concurrency Computat.
     Pract. Exper., vol. 26, no. 16, pp. 2631–2651, 2014.
[91] B. R. Barrett, P. Navratil, and J. P. Vary, “Ab initio no core shell model,” Prog. Part.
     Nucl. Phys., vol. 69, pp. 131–181, 2013.
[92] C. Lanczos, “An iteration method for the solution of the eigenvalue problem of linear
     differential and integral operators,” J. Res. Nat’l Bur. Std., vol. 45, pp. 255–282, 1950.
[93] A. V. Knyazev, “Toward the optimal preconditioned eigensolver: Locally optimal block
     preconditioned conjugate gradient method,” SIAM J. Sci. Comput., vol. 23, no. 2, pp.
     517–541, 2001.
                                              214


 [94] M. Kortelainen, J. McDonnell, W. Nazarewicz, P.-G. Reinhard, J. Sarich, N. Schunck,
      M. V. Stoitsov, and S. M. Wild, “Nuclear energy density optimization: Large
      deformations,” Phys. Rev. C, vol. 85, no. 2, p. 24304, 2 2012. [Online]. Available:
      http://link.aps.org/doi/10.1103/PhysRevC.85.024304
 [95] P. Klüpfel, P. G. Reinhard, T. J. Bürvenich, and J. A. Maruhn, “Variations on a theme
      by Skyrme: A systematic study of adjustments of model parameters,” Phys. Rev. C,
      vol. 79, no. 3, 2009.
 [96] V. Blum, G. Lauritsch, J. Maruhn, and P.-G. Reinhard, “Comparison of
      coordinate-space techniques in nuclear mean-field calculations,” Journal of
      Computational Physics, vol. 100, no. 2, pp. 364–376, 6 1992. [Online]. Available:
      http://linkinghub.elsevier.com/retrieve/pii/002199919290242Q
 [97] A. Sunderland, S. Pickles, M. Nikolic, A. Jovic, J. Jakic, V. Slavnic, I. Girotto, P. Nash,
      and M. Lysaght, “An analysis of fft performance in prace application codes,” PRACE
      whitepaper, 2012.
 [98] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra,
      S. Hammarling, G. Henry, A. Petitet et al., ScaLAPACK users’ guide. SIAM, 1997.
 [99] M. Frigo and S. G. Johnson, “Fftw: An adaptive software architecture for the fft,” in
      Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE Interna-
      tional Conference on, vol. 3. IEEE, 1998, pp. 1381–1384.
[100] R. A. Van De Geijn and J. Watts, “Summa: Scalable universal matrix multiplication
      algorithm,” Concurrency-Practice and Experience, vol. 9, no. 4, pp. 255–274, 1997.
[101] I. S. Dhillon, B. N. Parlett, and C. Vömel, “The design and implementation of the mrrr
      algorithm,” ACM Transactions on Mathematical Software (TOMS), vol. 32, no. 4, pp.
      533–560, 2006.
[102] P. Löwdin, “On the Non-Orthogonality Problem Connected with the Use of Atomic
      Wave Functions in the Theory of Molecules and Crystals,” The Journal of Chemical
      Physics, vol. 18, no. 3, 1950.
[103] H. M. Aktulga, C. Yang, E. Ng, P. Maris, and J. Vary, “Topology-aware mappings for
      large-scale eigenvalue problems,” in Euro-Par 2012 Parallel Processing, ser. Lecture
      Notes in Computer Science (LNCS), vol. 7484. Springer Berlin/Heidelberg, 2012, pp.
      830–842.
[104] H. M. Aktulga, C. Yang, E. G. Ng, P. Maris, and J. P. Vary, “Improving the scal-
      ability of symmetric iterative eigensolver for multi-core platforms,” Concurrency and
      Computation: Practice and Experience, vol. 26, pp. 2631–2651, 2013.
                                              215


[105] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, “Optimiza-
      tion of sparse matrix–vector multiplication on emerging multicore platforms,” in Proc.
      SC2007: High Performance Computing, Networking, and Storage Conference, 2007.
[106] S. Williams, “Auto-tuning performance on multicore computers,” Ph.D. dissertation,
      University of California, Berkeley, 2008.
[107] M. Afibuzzaman, F. Rabbi, Y. Ozkaya, H. M. Aktulga, and U. V. Çatalyürek,
      “DeepSparse: A Task-parallel Framework for Sparse Solvers on Deep Memory Ar-
      chitectures,” pp. 373–382, 12 2019.
[108] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” Chemometrics
      and intelligent laboratory systems, vol. 2, no. 1-3, pp. 37–52, 1987.
[109] A. El Guennouni, K. Jbilou, and A. Riquet, “Block krylov subspace methods for solving
      large sylvester equations,” Numerical Algorithms, vol. 29, no. 1-3, pp. 75–96, 2002.
[110] Z. Zhou, E. Saule, H. M. Aktulga, C. Yang, E. G. Ng, P. Maris, J. P. Vary, and
      U. V. Catalyürek, “An out-of-core eigensolver on ssd-equipped clusters,” in 2012 IEEE
      International Conference on Cluster Computing. IEEE, 2012, pp. 248–256.
[111] H. M. Aktulga, M. Afibuzzaman, S. Williams, A. Buluç, M. Shao, C. Yang, E. G. Ng,
      P. Maris, and J. P. Vary, “A high performance block eigensolver for nuclear configura-
      tion interaction calculations,” IEEE Transactions on Parallel and Distributed Systems,
      vol. 28, no. 6, pp. 1550–1563, 2016.
[112] M. Shao, H. M. Aktulga, C. Yang, E. G. Ng, P. Maris, and J. P. Vary, “Accelerating
      nuclear configuration interaction calculations through a preconditioned block iterative
      eigensolver,” Computer Physics Communications, vol. 222, pp. 1–13, 2018.
[113] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief,
      P. Luszczek, and S. Tomov, “Numerical linear algebra on emerging architectures: The
      plasma and magma projects,” in Journal of Physics: Conference Series, vol. 180, no. 1.
      IOP Publishing, 2009, p. 012037.
[114] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, “A class of parallel tiled linear
      algebra algorithms for multicore architectures,” Parallel Computing, vol. 35, no. 1, pp.
      38–53, 2009.
[115] S. Tomov, J. Dongarra, and M. Baboulin, “Towards dense linear algebra for hybrid
      GPU accelerated manycore systems,” Parallel Computing, vol. 36, no. 5-6, pp. 232–240,
      Jun. 2010.
[116] S. Tomov, R. Nath, H. Ltaief, and J. Dongarra, “Dense linear algebra solvers for mul-
      ticore with GPU accelerators,” in Proc. of the IEEE IPDPS’10. Atlanta, GA: IEEE
      Computer Society, April 19-23 2010, pp. 1–8, DOI: 10.1109/IPDPSW.2010.5470941.
                                              216


[117] J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, and I. Yamazaki,
      “Accelerating numerical dense linear algebra calculations with gpus,” Numerical Com-
      putations with GPUs, pp. 1–26, 2014.
[118] I. S. Barrera, M. Moretó, E. Ayguadé, J. Labarta, M. Valero, and M. Casas, “Reducing
      data movement on large shared memory systems by exploiting computation dependen-
      cies,” in Proceedings of the 2018 International Conference on Supercomputing. ACM,
      2018, pp. 207–217.
[119] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, “StarPU: A Unified
      Platform for Task Scheduling on Heterogeneous Multicore Architectures,” in Euro-Par
      - 15th International Conference on Parallel Processing, ser. Lecture Notes in
      Computer Science, vol. 5704. Delft, The Netherlands: Springer, Aug. 2009, pp.
      863–874. [Online]. Available: http://hal.inria.fr/inria-00384363
[120] E. Saule, H. M. Aktulga, C. Yang, E. G. Ng, and Ü. V. Çatalyürek, “An out-of-core
      task-based middleware for data-intensive scientific computing,” in Handbook on Data
      Centers. Springer, 2015, pp. 647–667.
[121] C. Lanczos, An iteration method for the solution of the eigenvalue problem of linear
      differential and integral operators. United States Governm. Press Office Los Angeles,
      CA, 1950.
[122] A. V. Knyazev, “Toward the optimal preconditioned eigensolver: Locally optimal block
      preconditioned conjugate gradient method,” SIAM journal on scientific computing,
      vol. 23, no. 2, pp. 517–541, 2001.
[123] J. W. Demmel, Applied Numerical Linear Algebra. SIAM, 1997.
[124] S. Bogner et al., “Computational Nuclear Quantum Many-Body Problem: The UN-
      EDF Project,” Comput. Phys. Commun., vol. 184, pp. 2235–2250, 2013.
[125] P. Maris, J. P. Vary, P. Navratil, W. E. Ormand, H. Nam, and D. J. Dean, “Origin
      of the anomalous long lifetime of 14 C,” Phys. Rev. Lett., vol. 106, no. 20, p. 202502,
      2011.
[126] S. Binder, A. Calci, E. Epelbaum, R. J. Furnstahl, J. Golak, K. Hebeler, H. Kamada,
      H. Krebs, J. Langhammer, S. Liebig, P. Maris, U.-G. Meißner, D. Minossi, A. Nogga,
      H. Potter, R. Roth, R. Skinińki, K. Topolnicki, J. P. Vary, and H. Witała, “Few-
      nucleon systems with state-of-the-art chiral nucleon-nucleon forces,” Phys. Rev. C,
      vol. 93, no. 4, p. 044002, 2016.
[127] A. Shirokov, I. Shin, Y. Kim, M. Sosonkina, P. Maris, and J. Vary, “N3LO NN interac-
      tion adjusted to light nuclei in ab exitu approach,” Phys. Lett. B, vol. 761, pp. 87–91,
      2016.
                                              217


[128] E. Epelbaum et al., “Few- and many-nucleon systems with semilocal coordinate-space
      regularized chiral two- and three-body forces,” Phys. Rev. C, vol. 99, no. 2, p. 024313,
      2019.
[129] M. Caprio, P. Fasano, P. Maris, A. McCoy, and J. Vary, “Probing ab initio emergence
      of nuclear rotation,” Eur. Phys. J. A, vol. 56, no. 4, p. 120, 2020.
[130] T. Groves, R. E. Grant, S. Hemmer, S. Hammond, M. Levenhagen, and D. C. Arnold,
      “(SAI) stalled, active and idle: Characterizing power and performance of large-scale
      dragonfly networks,” in 2016 IEEE International Conference on Cluster Computing
      (CLUSTER). IEEE, 2016, pp. 50–59.
[131] J. J. Wilke, J. P. Kenny, S. Knight, and S. Rumley, “Compiler-assisted source-to-
      source skeletonization of application models for system simulation,” in International
      Conference on High Performance Computing. Springer, 2018, pp. 123–143.
[132] T. Connors, T. Groves, T. Quan, and S. Hemmert, “Simulation framework for studying
      optical cable failures in dragonfly topologies,” in 2019 IEEE International Parallel and
      Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2019, pp. 859–864.
[133] B. Alverson, E. Froese, L. Kaplan, and D. Roweth, “Cray xc series network,” Cray
      Inc., White Paper WP-Aries01-1112, 2012.
[134] S. Chunduri, T. Groves, P. Mendygral, B. Austin, J. Balma, K. Kandalla,
      K. Kumaran, G. Lockwood, S. Parker, S. Warren, N. Wichmann, and N. Wright,
      “Gpcnet: Designing a benchmark suite for inducing and measuring contention
      in hpc networks,” in Proceedings of the International Conference for High
      Performance Computing, Networking, Storage and Analysis, ser. SC ’19. New
      York, NY, USA: Association for Computing Machinery, 2019. [Online]. Available:
      https://doi.org/10.1145/3295500.3356215
[135] B. S. C. I. Project, “Best practice guide for writing mpi + ompss interoperable
      programs,” 2017. [Online]. Available: http://www.intertwine-project.eu/sites/default/
      files/images/INTERTWinE_Best_Practice_Guide_MPI%2BOmpSs_1.0.pdf
                                              218