ACCELERATING SPARSE EIGENSOLVERS THROUGH ASYNCHRONY, HYBRID
ALGORITHMS, AND HETEROGENEOUS ARCHITECTURES

By

Abdullah Alperen

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computer Science—Doctor of Philosophy

2024

ABSTRACT

Sparse matrix computations comprise the core component of a broad base of scientific

applications in fields ranging from molecular dynamics and nuclear physics to data mining

and signal processing. Among sparse matrix computations, the eigenvalue problem has a

significant place due to its common use in the area of high performance scientific computing.

In nuclear physics simulations, for example, one of the most challenging problems is solv-

ing large-scale eigenvalue problems arising from nuclear structure calculations. Numerous

iterative algorithms have been developed to solve this problem over the years.

Lanczos and locally optimal block preconditioned conjugate gradient (LOBPCG) are

two of such popular iterative eigensolvers. Together, they present a good mix of the com-

putational motifs encountered in sparse solvers. With this work, we describe our efforts to

accelerate large-scale sparse eigensolvers by employing asynchronous runtime systems, the

development of hybrid algorithms and the utilization of GPU resources.

We first evaluate three task-parallel programming models, OpenMP, HPX and Regent,

for Lanczos and LOBPCG. We demonstrate these asynchronous frameworks’ merit on two

architectures, Intel Broadwell (a multicore processor) and AMD EPYC (a modern manycore

processor). We achieve up to an order of magnitude improvement both in execution time

and cache performance.

We then examine and compare a few iterative methods for solving large-scale eigen-

value problems arising from nuclear structure calculations. In particular, besides Lanczos

and LOBPCG, we discuss the possibility of using block Lanczos method and the residual

minimization method accelerated by direct inversion of iterative subspace (RMM-DIIS). We

show that RMM-DIIS can be effectively combined with either block Lanczos and LOBPCG

to yield a hybrid eigensolver that has several desirable properties.

We finally demonstrate the challenges posed by the emergence of accelerator-based com-

puter architectures to achieve high performance for large-scale sparse computations. We

particularly focus on the scalability of sparse matrix vector multiplication (SpMV) and

sparse matrix multi-vector multiplication (SpMM) kernels of Lanczos and LOBPCG. We

scale their performance up to hundreds of GPUs by improving their computation through

hand-optimized CUDA kernels and communication aspect through asynchronous point-to-

point calls and optimized NVIDIA Collective Communications Library (NCCL) collectives.

Copyright by
ABDULLAH ALPEREN
2024

I dedicate this thesis to my parents, Kadir and Nihal, my brother, Ahmet, and my sisters,
Ay¸senur and Bet¨ul, for their endless support from afar in helping me achieve this goal.

v

ACKNOWLEDGEMENTS

Reaching this milestone would not have been possible without the support of many. First

and foremost, I express my genuine gratitude to my PhD advisor, Hasan Metin Aktulga. His

valuable feedback and guidance have been instrumental throughout my PhD journey. His

constant support through research grants helped me greatly in focusing on my studies.

Besides my advisor, I want to thank my dear committee members, Brian O’Shea and

Sandeep Kulkarni, for their valuable time and feedback.

I must also express my sincere

appreciation to Pieter Maris, Khaled Ibrahim, and Nan Ding for their valuable collaboration.

I particularly extend my profound gratitude to Chao Yang for his precious research ideas

and his service on my committee.

Finally, my heartfelt gratitude goes to my fellow lab mates, Kurt, Anik, Fazlay, C¸ a˘grı,

Do˘ga, and ¨Omer, for the collaborative exchange of ideas and camaraderie.

This work was supported by the U.S. Department of Energy, Office of Science, Office of

Nuclear Physics under Grant DE-SC0023175, and the National Science Foundation Office of

Advanced Cyberinfrastructure under Grant 1845208.

vi

TABLE OF CONTENTS

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Contributions of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 2

AN EVALUATION OF TASK-PARALLEL FRAMEWORKS
FOR SPARSE SOLVERS ON MULTICORE AND MANYCORE
CPU ARCHITECTURES . . . . . . . . . . . . . . . . . . . . . .
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
. . . . . . . . . . . . . . . . . . . . .
2.3 Benchmark Applications . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Conclusion of This Work . . . . . . . . . . . . . . . . . . . . . . . . . . .

Implementation and Optimizations

CHAPTER 3

PERFORMANCE OF THE HPX FRAMEWORK FOR SPARSE
EIGENSOLVERS ON DISTRIBUTED MEMORY
ARCHITECTURES . . . . . . . . . . . . . . . . . . . . . . . . . .
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Conclusion of This Work . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 4

HYBRID EIGENSOLVERS FOR NUCLEAR
CONFIGURATION INTERACTION CALCULATIONS . . . . .
4.1 Existing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Hybrid Algorithms
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Numerical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Conclusion of This Work . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
1
4

7
9
11
19
20
32

35
35
37
41
45

46
49
61
62
81

CHAPTER 5

83
SCALING EIGENSOLVERS TO HUNDREDS OF GPUS . . . . .
86
5.1 MFDn’s Proprietary Distributed SpMV Algorithm . . . . . . . . . . . . .
92
5.2 Suggested Improvements and Proposed Schemes . . . . . . . . . . . . . .
5.3 Lanczos Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
5.4 LOBPCG Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.5 Conclusion of This Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

CHAPTER 6

CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . 124

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

vii

CHAPTER 1

INTRODUCTION

1.1 Motivation

Sparse matrix computations manifest themselves in many forms such as the solution of

systems of linear equations, matrix factorizations, linear least squares problems, and eigen-

value problems [24]. As such, they comprise the core component of a broad base of scientific

applications in fields ranging from molecular dynamics and nuclear physics to data min-

ing and signal processing. In the presence of large-scale data, sparse matrix computations

become quite challenging as they demand massive parallelism but cannot effectively utilize

compute resources. This underutilization stems from the memory-bound nature of the com-

putations, which is not only the result of low arithmetic intensity but also irregular data

access patterns. The latter factor becomes more evident in modern computer architectures

where cache performance holds a greater significance within deep memory hierarchies with

the accumulated cost of going farther away from the processor.

To effectively perform dense matrix computations on parallel systems, there exist dense

linear algebra libraries such as i.e., Basic Linear Algebra Subprograms (BLAS) [37], Linear

Algebra Package (LAPACK) [6] and Scalable Linear Algebra Package (ScaLAPACK) [10]),

which are well established. These dense linear algebra libraries also have their own vendor

optimizated implementations (i.e., Intel Math Kernel Library – MKL, Cray Scientific Li-

braries – LibSci). However, unlike its dense matrix analogue, the state of the art for sparse

matrix computations is lagging far behind. The main reason behind the slow progress in

the standardization of sparse linear algebra and development of libraries for it is that sparse

matrices come in very different forms and properties depending on the application area.

Among sparse matrix computations, the eigenvalue problem has a significant place due

to its common use in the area of high performance scientific computing. In nuclear physics

simulations, for example, one of the most challenging problems is solving large-scale eigen-

value problems arising from nuclear structure calculations. The conventional approach to

1

compute eigenvalues involves finding all the roots of the matrix’s characteristic polynomial,

which is not feasible for large-scale analysis where the matrix dimensions are in millions or

even billions. Therefore, numerous iterative algorithms have been developed to solve this

problem over the years. These methods work by repeatedly refining approximations to the

eigenvectors or eigenvalues and can be terminated whenever the approximations reach a

sufficient degree of accuracy.

Lanczos and locally optimal block preconditioned conjugate gradient (LOBPCG) are two

of such popular iterative eigensolvers. While the Lanczos algorithm is representative of the

relatively simple sparse solvers such as the Conjugate Gradient method and Page Rank,

the LOBPCG algorithm is a complex one with several steps that involve tall-skinny matrix

operations in addition to the sparse matrix computations. Together, they present a good

mix of the computational motifs encountered in sparse solvers. This is particularly true

given the fact that the core component of the Lanczos algorithm is the sparse matrix vector

multiplication (SpMV) kernel whereas the main kernel in the LOBPCG algorithm is the

sparse matrix multi-vector multiplication (SpMM).

Both SpMV and SpMM operations are widely used [20, 42] and well studied [48] in the

literature. It is known that SpMV has notoriously low arithmetic intensity since only two

floating point operations (one addition, one multiplication) are performed per nonzero ele-

ment of the sparse matrix. Therefore, the Roofline Model by Williams et al. [63] suggests that

the memory bandwidth ultimately bounds the performance of the SpMV kernel. Although

SpMM has significantly higher arithmetic intensity than SpMV, the extended Roofline model

that was later proposed suggests that cache bandwidth, rather than the memory bandwidth,

can still be a critical performance-limiting factor for SpMM [2].

Challenges posed by large-scale sparse matrix computations as in SpMV and SpMM

kernels of Lanczos and LOBPCG algorithms are not particularly well addressed by a bulk

synchronous parallel (BSP) approach, which is a type of coarse-grained parallelism, and

imposes a barrier synchronization at the end of each computational kernel. The two main

2

factors that limit the performance of the BSP approaches are (i) poor cache performance that

can be attributed to coarse-grained tasks which do not fit into the last level cache (LLC) and

(ii) high synchronization costs that may be exacerbated by load imbalances given the skewed

distribution of nonzero values within the matrices. Therefore, a fundamentally new approach

is needed to tackle these issues and improve the performance of sparse eigensolvers, which

validate the emergence and increased use of asynchronous many-task (AMT) programming

models.

Regardless of the approach taken to extract parallelism within large-scale sparse eigen-

solvers, Lanczos and LOBPCG algorithms have their own limitations and shortcomings. For

a long time, the Lanczos algorithm was the default algorithm to use in sparse eigensolvers

because it is easy to implement and because it is quite robust. In recent work [51], it is

shown that the low lying eigenvalues can be computed efficiently by using the LOBPCG

method [32]. The advantages of the LOBPCG algorithm over the Lanczos algorithm include

(i) performing SpMM instead of SpMV, which introduces an additional level of concurrency

in the computation and enables exploiting data locality better, and (ii) allowing to take ad-

vantage of a preconditioner that can be used to accelerate convergence. However, LOBPCG

can become unstable near convergence. Although methods for stabilizing the algorithm has

been developed and implemented [26, 19], they do not completely eliminate the problem.

As such, one could potentially combine LOBPCG with another solver to yield a hybrid

eigensolver that can effectively eliminate the numerical stability as well as outperform both

Lanczos and LOBPCG.

From the hardware point of view, the challenges in limiting the power consumption and

addressing the heat dissipation of chips became the bottleneck that limits the clock frequency,

thereby preventing the improvement of single thread performance of central processing units

(CPUs). Reaching the power wall has forced CPU manufacturers to look for alternative so-

lutions, rather than increasing the clock speed of processors, for enhancing the performance.

This seeking led to the introduction of multi- and many-core processors. In the past, these

3

multi- and many-core central processing units (CPUs) equipped on supercomputers were the

default hardware to accelerate high performance computations. Also, this seeking simulta-

neously gave way to the integration of accelerators such as graphics processing units (GPUs)

on supercomputing nodes.

GPUs have been used to address problems CPUs struggle to overcome for many years.

Rendering and processing images, analyzing big data, training neural networks are some of

the tasks that unlike CPUs, GPUs easily tackle as the huge number of small cores they have

enable to achieve high level of data parallelism. Nonetheless, it should be noted that GPUs

are not supposed to replace CPUs by any means. For most cases, CPU loads compute-

intensive parts to the GPU and once the GPU finishes the computation, results are loaded

back to the CPU. Having high memory bandwidth is a significant feature of GPUs at this

point to allow it to still be efficient overall with the data movements involved.

Besides exploring new runtime systems and hybrid algorithms to accelerate the eigen-

solvers, the increased availability of high performance computing platforms equipped with

general purpose GPUs has motivated us to consider modifying the implementation of the

Lanczos and LOBPCG algorithms to enable them to run efficiently on accelerator based sys-

tems. In particular, we want to focus on the efficiency of the distributed SpMV and SpMM

algorithms, which constitute a significant portion of the overall solver time. There exist

performance issues of the GPU-parallel Lanczos and LOBPCG solver implementations that

leverage MPI collectives for communications and OpenACC directives for the computations.

As such, we want to overcome scalability bottlenecks by improving both computational and

communication aspects of the SpMV and SpMM kernels.

1.2 Contributions of This Thesis

In this thesis, we describe our efforts to accelerate large-scale sparse eigensolvers by

employing asynchronous runtime systems, the development of hybrid algorithms and the

utilization of GPUs on distributed architectures. As explained below, the contributions of

this thesis are divided into four parts.

4

In the first part, we evaluate three task-parallel programming models, OpenMP, HPX

and Regent, in terms of performance and ease of implementation, and compare them against

the traditional BSP model for two popular eigensolvers, Lanczos and LOBPCG. We give a

general outline in regards to achieving parallelism using these runtime systems, and present

a heuristic for tuning their performance to balance tasking overheads with the degree of

parallelism that can be exposed. We then demonstrate their merits on two architectures,

Intel Broadwell (a multicore processor) and AMD EPYC (a modern manycore processor).

In Chapter 2, we present the details of our exploration and evaluation of these AMT models

in the context of sparse matrix computations.

In the second part, we explore the capabilities of HPX runtime system on distributed

architectures after seeing its merit on shared memory systems in order to assess whether we

could use HPX as the backbone of a large-scale sparse solver and graph analytics framework.

We compare its performance to hybrid MPI+OpenMP model up to 512 cores. In Chapter 3,

we present the details of our evaluation of HPX’s distributed memory performance and lack

thereof.

In the third part, we examine and compare a few iterative methods for solving large-

scale eigenvalue problems arising from nuclear structure calculations. In particular, besides

Lanczos and LOBPCG, we discuss the possibility of using block Lanczos method, a Cheby-

shev filtering based subspace iterations and the residual minimization method accelerated

by direct inversion of iterative subspace (RMM-DIIS). We show that RMM-DIIS can be

effectively combined with either the block Lanczos and LOBPCG methods to yield a hybrid

eigensolver that has several desirable properties. In Chapter 4, we present the details of a

few practical issues that need to be addressed to make the hybrid solver efficient and robust.

In the fourth and last part, we work towards scaling the Lanczos and LOBPCG solvers

to hundreds of GPUs. To that end, we particularly focus on the costly SpMV/SpMM algo-

rithm on distributed memory architectures with multiple GPUs. We propose performance

optimizations for the compute kernels, which are originally implemented using OpenACC

5

directives, via hand-tuned CUDA imeplementations. We also examine scaling issues of cer-

tain MPI collectives used in the distributed SpMV/SpMM algorithm. We then propose

improving the communication aspect by replacing these MPI collectives with non-blocking

point-to-point calls and NVIDIA Collective Communications Library (NCCL) routines. We

demonstrated improved performance with these new approaches up to 1128 GPUs. We also

propose a pipelining technique that can better hide communication overheads by overlapping

them with computations. In Chapter 5, we present the findings of our work on the efficient

computation and communication schemes to tackle the distributed SpMV and SpMM al-

gorithm in order to improve the overall solver time for Lanczos and LOBPCG solvers on

hundreds of GPUs.

6

CHAPTER 2

AN EVALUATION OF TASK-PARALLEL FRAMEWORKS FOR SPARSE
SOLVERS ON MULTICORE AND MANYCORE CPU ARCHITECTURES

This chapter has been published in ACM ICPP 2021 on 05 October 2021 [4], available

at: https://doi.org/10.1145/3472456.3472476.

Challenges posed by large-scale sparse matrix computations are not particularly well

addressed by a bulk synchronous parallel (BSP) approach. This is due to the high synchro-

nization cost and load imbalance issues as well as data movement overheads of the traditional

BSP models arising in modern shared memory architectures. Therefore, a fundamentally new

approach is needed to tackle these issues, which validate the emergence and increased use of

asynchronous many-task (AMT) programming models. In this work, we explore the AMT

programming models to evaluate their ability to address these issues in the context of sparse

solvers. We namely evaluate the performance of OpenMP, HPX and Regent runtime sys-

tems that allow task-parallel programming over the BSP model for two popular eigensolvers,

Lanczos and LOBPCG.

OpenMP’s task parallelism has been commonly used and well studied since 2013 [43]

as it allows extracting parallelism via scheduling and asynchronous execution of fine-grained

tasks. This model has the potential to remedy both deficiencies of the BSP model with

regard to cache performance and load balancing issues.

There are other runtime systems that enable fine-grained task parallelism such as HPX

[29]. HPX is an advanced runtime system and a programming API that conforms to the

C++11/14/17/20 standards while supporting lightweight task scheduling to expose new

levels of parallelism.

It also extends the standard to the distributed case by employing

a global address space, which renders efficient utilization of inter-node parallelism when

combined with runtime adaptive resource management. HPX has demonstrated promising

results in many projects as diverse as astrophysics simulations, n-body problems and storm

surge forecasting [16, 25].

7

Another runtime system that adopts the AMT model is Regent

[52], a programming

language and compiler designed for HPC. Regent programs appear to be sequential codes

with calls to tasks, i.e., functions eligible for parallel execution. Regent runtime system

discovers implicit dataflow parallelism in the code by internally computing the task depen-

dency graph, eliminating the need for explicit synchronization. Moreover, through its ability

to schedule and run tasks on distributed machines, Regent frees programmers from low level

distributed memory programming. Regent is shown to achieve performance comparable to

OpenMP and MPI+X for a variety of applications [57, 52].

Recently, using the AMT model in OpenMP has been shown to offer important advan-

tages over its BSP model in the context of sparse solvers with the DeepSparse framework [1].

DeepSparse adopts a fully integrated task-parallel approach that targets all computational

steps in a sparse solver rather than a single kernel such as SpMV or SpMM. DeepSparse

automatically generates and expresses the entire computation as a task dependency graph

(TDG) and relies on OpenMP for the execution of this TDG. Despite the larger number of

tasks that must be generated and managed, DeepSparse achieved significant improvements

in terms of cache misses with little overheads through pipelined execution of tasks.

Having seen the success of OpenMP’s task parallelism on sparse solvers, and the lack

of work on evaluation of other AMT models in this area, this work aims to discern how

OpenMP, HPX and Regent compare as well as what they offer over BSP models. Our main

contributions can be summarized as follows:

• a novel task-parallel implementation of two sparse solvers with different characteristics,

i.e., Lanczos and LOBPCG algorithms, using the HPX and Regent runtime systems

by highlighting key factors to obtain an optimized code,

• an extensive performance evaluation of DeepSparse, HPX and Regent on multicore and

manycore CPU architectures using a variety of sparse matrices from different domains,

• empirical demonstration of the significant cache miss reduction across all cache levels

8

and the execution time improvement by up to 9.9× and 7.5× compared to highly

optimized library implementations for Lanczos and LOBPCG, respectively.

• presentation of a practical rule of thumb for determining the ideal task granularity for

each runtime system.

After reviewing the related work in Section 2.1, we discuss how we leverage the dataflow

model in each framework in the context of sparse matrix computations and point out the

key factors for optimized implementations in Secton 2.2. Section 2.3 describes the solvers

used for benchmarking. Section 2.4 shows the impact of the optimizations applied, evaluates

the frameworks in terms of execution time and cache performance, and provides a heuristic

for choosing the task granularity.

2.1 Related Work

There are several other AMT frameworks as we try to review below. Unfortunately, we

cannot examine all of them in this work, but to the best of our knowledge, cross-examination

of end-to-end sparse solver performances of some important AMT frameworks constitutes a

unique aspect of our work.

PaRSEC is a framework and a runtime system that aims to manage tasks through ar-

chitecture aware scheduling [11]. D-PLASMA library [12] is developed using PaRSEC,

and shows that PaRSEC can improve the performance of dense linear algebra algorithms by

expressing them as a directed acyclic graph (DAG) of tasks. However, PaRSEC provides a

conservative data-flow model in both shared and distributed memory as the runtime system

works on the data distribution and task graph information specified by the user [56]. It also

offers limited work stealing on distributed memory as inter-node scheduling relies on remote

completion notifications.

StarPU is a runtime system with a unified execution model at high level [7]. Its main goal

is to facilitate the generation and execution of parallel tasks on heterogeneous architectures

using multiple scheduling algorithms. Nevertheless, it requires the explicit data distribution

9

and task creation by the developer and depends on MPI communications on distributed

memory [56].

Legion runtime system extracts parallelism by dynamically identifying nested parallelism

and independent tasks on account of logical (definition of objects) and physical regions

(actual copies of objects) [9]. Regent compiler is essentially built on top of Legion, making

it simpler to program without sacrificing performance [52].

Charm++ is a C++-based, message-driven, and portable parallel programming frame-

work and language [30]. It can expose both task and data level parallelism through the

execution of parallel processes called “chares”. Charm++ is similar to HPX in that they

both (i) mitigate load imbalances in a distributed system by migrating part of the data

between nodes, (ii) prefer to execute the code close to where the data resides, (iii) adopt

message&data-driven approach, and (iv) employ a global address space (GAS) environment.

There are several other task-based parallel programming models such as Intel Threading

Building Blocks (TBB)

[34], Qthreads

[62] and Intel Cilk Plus

[46]. What makes HPX

and Regent more appealing and convenient than these models listed above is that both can

employ the same task parallel approach on distributed memory systems with automatic data

migration, which improves the programmability on exascale architectures.

Thoman et. al. [56] provide a task-focused taxonomy for HPC technologies, includ-

ing HPX, Charm++, Legion, OpenMP, StarPU, Intel Cilk Plus and TBB. Kulkarni and

Lumsdaine [35] theoretically compare AMT runtimes along programming model, execution

model, and implementation characteristics bases. Stpiczy´nski

[55] evaluates the perfor-

mance of OpenMP, TBB and Cilk Plus and advises how to improve performance using the

Belman-Ford algorithm as an example. Wang [59] provides a guideline to help programmers

select an appropriate task model between Cilk, OpenMP and HPX by drawing conclusions

from six benchmarks: Fibonacci, Knight, Pi, Sort, N-Queens, and Unbalanced-Tree-Search.

To our knowledge, our work is the first to compare the empirical performance of three AMT

models; OpenMP, HPX and Regent to an optimized BSP approach within the context of

10

sparse solvers on both manycore and multicore architectures.

2.2 Implementation and Optimizations

In many large-scale scientific applications, sparse matrix computations are the most ex-

pensive kernels. As such, in all three frameworks (DeepSparse, HPX and Regent), we define

tasks based on the decomposition of the input sparse matrices. Compared to a 1D (block

row) partitioning, a 2D (sparse block) partitioning is known to expose a higher degree of

parallelism while potentially reducing data movement [50]. Therefore, we adapt a 2D parti-

tioning scheme where tasks are defined based on the Compressed Sparse Block (CSB) [13]

representation of the sparse matrix. All three task-parallel versions start by partitioning

the sparse matrix into CSB blocks, which also dictates the decomposition of all other data

structures involved, such as input/output vectors and/or vector blocks.

Listing 2.1 An example pseudocode.

1
2

3

SpMM (A , X , Y , m , n ) ; // A * X = Y
cblas_dgemm ( CblasRowMajor , CblasNoTrans , CblasNoTrans , m , n , n , 1.0 , Y ,

n , Z , n , 0 , Q , n ) ; // Y * Z = Q

cblas_dgemm ( CblasRowMajor , CblasTrans , CblasNoTrans , n , n , m , 1.0 , Y , n

, Q , n , 0 , P , n ) ; // Y ’* Q = P

Consider the simple code snippet in Listing 2.1, which we will utilize to illustrate the

salient technical details of the Lanczos and LOBPCG algorithms and our optimization ideas.

This code snippet includes three of the most common sparse solver kernels. Suppose that

the input matrix A is of size m×m, block size is denoted by b, and the vector width by n ≥ 1:

• SpMM kernel is partitioned into tasks where each operates on a b×b block of the

matrix A, and b×n block of X and Y as shown in Figure 2.1. Tasks are created only

for non-empty blocks.

• In the second kernel, which is a linear combination operation and will be referred to as

the XY kernel, each task operates on a b×n block of Y, the entire Z matrix (a single

n×n block) and b×n block of Q.

11

• The third one is the inner product kernel, which will be referred to as XTY kernel,

spawns tasks as shown in Figure 2.2, computing partial results from the multiplication

of n×b block of Y T and b×n block of Q. A final task reduces the partial results.

Notice that there are two different ways to implement the task-parallel SpMM kernel:

(i) to launch all SpMM tasks asynchronously and keep a partial output vector on each

thread/core, or (ii) to setup dependencies between tasks to ensure no two threads/cores

access the same portion of the output vector. With the latter option, the maximum degree

of concurrency in SpMM is still equal to the number of blocks in the output vector. Therefore,

as long as the number of blocks in the output vector is greater than the number of threads,

it avoids the memory cost and processing overhead of the reduction required for the first

option. As shown in Section 2.4, the degree of parallelism yielded by the optimal block sizes

exceeds the thread count for DeepSparse and HPX. Experimental results on Regent pointed

out that the dependency based solution achieves better performance than the buffer based

solution even when there is not enough tasks to keep all threads occupied. As a result, we

adopt the latter approach in all three frameworks.

With this crucial detail in mind, for b = m/3, the computational DAG for Listing 2.1 is

shown in Figure 2.3. Regardless of the underlying representation of the DAG by the runtime

system, the correctness of computation depends on a valid execution order with respect to

the DAG topology, whereas the performance relies on exploiting the maximum parallelism

available while determining a schedule that reduces data movement. Next, we discuss how

each framework accomplishes these conflicting goals in detail.

2.2.1 DeepSparse

DeepSparse [1] consists of two major components:

2.2.1.1 Primitive Conversion Unit (PCU)

PCU essentially provides a high level front-end scientific application development. Task

Identifier (TI), the first subcomponent of PCU, parses GraphBLAS [31] and BLAS/LA-

12

Figure 2.1 Task partitioning of the SpMM kernel.

Figure 2.2 Task partitioning of the inner product kernel.

Figure 2.3 Task graph for the pseudocode in Listing 2.1.

13

XAbbbbbb....      .      .bnbnYQPYTcblas_dgemmcall on block vectorPartial result buffer reductionnnbbbnnnn...bSpMM0,0SpMM1,0SpMM2,0SpMM0,1SpMM1,1SpMM2,1SpMM0,2SpMM1,2SpMM2,2XY0XY1XY2XTY0XTY1XTY2XTYREDPACK [6, 37] function calls, which are similar to those in Listing 2.1, expressed through

DeepSparse API. The output of TI is a dependency graph at the function call level whereas

tasks must be created at a much finer granularity to expose parallelism and to allow control

over data movement. Task Dependency Graph Generator (TDGG), the second subcompo-

nent, accomplishes that by going over the input/output data information generated by TI for

each function call and decomposing corresponding data structures. TDGG then generates

the dependencies between individual fine-granularity tasks by examining the function call

dependencies determined by TI, while taking into consideration the non-zero pattern of the

sparse matrix.

2.2.1.2 Task Executor

DeepSparse provides the OpenMP task-based implementation of all computational ker-

nels it supports. As such, Task Executor picks each node from the output of TDGG one by

one and extracts the corresponding task information. Based on the kernel id, partition id

of the input/output data structures and other required parameters, Task Executor calls the

corresponding function found in the DeepSparse library, effectively spawning an OpenMP

task. In DeepSparse, the master thread spawns all OpenMP tasks in a depth-first topologi-

cal order, and relies on OpenMP’s default task scheduling algorithms for execution of these

tasks.

DeepSparse will explicitly generate the task dependency graph for each algorithm and

input sparse matrix combination but the overhead of this graph generation is negligible for

two reasons. First, each vertex in the graph corresponds to a task operating on a large set of

data in the original problem. Secondly, sparse solvers are typically iterative, and the same

task dependency graph is used for several iterations. Therefore, such overhead would be

insignificant in comparison to the actual problem size. This is also why we will not account

for this overhead when making a performance comparison.

14

2.2.2 HPX

HPX attains asynchronous parallelism through asynchronous function execution (async)

and future instances: Asynchronous execution of a function will result in scheduling of

it as a new HPX thread and return a new future instance as HPX threads in the queue

are dynamically managed by the runtime system [29]. A dataflow object, on the other

hand, triggers a predefined function when a set of futures become ready. Combining a

dataflow object with asynchronous execution provides a powerful mechanism for maintaining

data dependencies and constructing an execution tree. We show HPX’s dataflow model in

Listing 2.2, which implements the pseudocode in Listing 2.1.

Each future defined (line 1-4) indicates whether a task of void is executed, and thus the

futures are of void type as well. We have four functions that are not given in the code snippet:

(i) SpMM for a single block of the matrix, (ii) f dgemm and (iii) f dgemm t that are wrapper

functions for cblass dgemm calls to execute XY and XTY tasks, and (iv) reduce buffer to

accumulate partial results of P . We define a proxy function for each of these four functions

(line 9-12) whose sole purpose is to unwrap the futures when ready before passing to the

actual function. That enables programmers to write functions with the same ease as the

equivalent sequential code.

We launch an asynchronous SpMM task (line 17) that operates on sparse matrix block

Ai,j and block Xj to update block Yi. The dataflow returns a future to the result of the

SpMM task, which is assigned to Y f tr[i] (line 17). Since the use of a buffer for the output Y

is avoided through a dependency based approach, this future depends on itself. To compute

block Qi within the XY kernel, the computation of Yi should be finished, which is indicated

by the readiness of Y f tr[i] (line 20). Likewise, in the XTY kernel, the task responsible for

ith buffer of P will be triggered when both Y f tr[i] and Q f tr[i] are ready (line 23).

Note that checking Y f tr[i] is redundant there as Q f tr[i] already depends on Y f tr[i].

HPX allows a vector of futures being provided as a parameter in dataflow to set the de-

pendencies; we use this feature for P : reduce buffer will be invoked once every future in

15

vector P prtl f tr is ready (line 24). Last but not least, we skip the empty matrix blocks

(line 16) since they do not contribute to the output, in order to lighten the burden on the

runtime system and improve the performance.

Listing 2.2 HPX code for the pseudocode in Listing 2.1.

std :: vector < hpx :: shared_future < void > > Y ( np ) ;
std :: vector < hpx :: shared_future < void > > Q ( np ) ;
std :: vector < hpx :: shared_future < void > > P_prtl_ftr ( np ) ;
hpx :: shared_future < void > P_rdcd_ftr ;
// np ( number of partitions ) = ceil ( m / blocksize )
for ( int i = 0; i != np ; ++ i )

Y_ftr [ i ] = hpx :: make_ready_future () ;
// to unwrap futures passed to functions
auto OpSpMM = hpx :: util :: unwrapping (& SpMM ) ;
auto OpDGEMV = hpx :: util :: unwrapping (& f_dgemm ) ;
auto OpDGEMV_T = hpx :: util :: unwrapping (& f_dgemm_t ) ;
auto OpRed = hpx :: util :: unwrapping (& reduce_buf ) ;
// Y = A * X
for ( i = 0; i != np ; ++ i )

for ( int j = 0; j != np ; ++ j )

if ( A [ i * np + j ]. nnz > 0)

Y_ftr [ i ] = hpx :: dataflow ( hpx :: launch :: async , OpSpMM , Y_ftr [

i ] , A , X , Y , i , j ) ;

// Q = Y * Z
for ( i = 0; i != np ; ++ i )

Q_ftr [ i ] = hpx :: dataflow ( hpx :: launch :: async , OpDGEMV , Y_ftr [ i ] , Y ,

Z , Q , i ) ;

// P = Y ’ * Q
for ( i = 0; i != np ; ++ i )

P_prtl_ftr [ i ] = dataflow ( hpx :: launch :: async , OpDGEMV_T , Y_ftr [ i ] ,

Q_ftr [ i ] , Y , Q , Pbuf , i ) ;

P_rdcd_ftr = dataflow ( hpx :: launch :: async , OpRed , P_prtl_ftr , Pbuf , P ) ;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

18
19
20

21
22
23

24

2.2.3 Regent

Regent is a language, runtime system, and a compiler that exerts implicit dataflow par-

allelism through two key abstractions: tasks and logical regions (or simply regions)

[52].

Tasks are functions that are marked as eligible for parallel execution by the programmer

and regions are collections of structured objects that can be recursively partitioned to ren-

der parallel execution possible. Tasks in Regent are forced to describe how they interact

16

with each region they take as argument by declaring privileges: read, write, read/write or

reduce, which in return allows Regent to discover parallelism in seemingly sequential code.

To illustrate this, in Listing 2.3, we provide the Regent code of the pseudocode presented in

Listing 2.1.

A region (line 16-17) is the cross-product between an index space (lines 14-15) and a field

space (lines 1-3, like structs in C). A region can be disjointly partitioned into subregions

with a simple use of partition function (line 20). An index space (line 19) must be provided

to name the respective subregions. CSB format requires each block (subregion in Regent) to

dynamically allocate memory based on their number of non-zero entries, but Regent does

not allow such allocation. As a workaround, we create a region that contains all entries (line

14&16) in advance where the entries falling into the same block are kept contiguous to better

utilize the cache.

SpMM task implementation is similar to that of HPX, but here, rather than passing the

pointer to the entire X and Y data and making sure we access and update the appropriate

portion, we directly pass the corresponding subregion (i.e., Xj and Yi in line 26). By analyzing

the privileges defined on each passed region/subregion (line 8), the runtime system extracts

parallelism for SpMM tasks as shown in Figure 2.3. Moreover, the index launch represents

a loop of tasks that are non-interfering and is a compiler-level optimization. This concept

helps the Regent to launch those tasks without any dependency checks. It is not required,

but the programmer can use it to ensure the implementation is sound (line 31&36), for

f dgemm and f dgemm t tasks (line 33&38), for instance. Although we do not share it due

to space constraints, f dgemm declares read privilege on Yi and Z and write privilege on Qi.

Slightly different from HPX, f dgemm t declares reduce privilege on P , which is convenient

in contrast to using reduce on Yi for SpMM since P is a much smaller matrix with a lower

overhead.

17

Listing 2.3 Regent code for the pseudocode in Listing 2.1.

fspace csb_entry {

{ rloc , cloc }: uint16 , val : double ,

}
task SpMM ( rA : region ( ispace ( int1d ) , csb_entry ) ,

rX : region ( ispace ( int1d ) , double ) ,
rY : region ( ispace ( int1d ) , double ) ,
s : int , e : int )
where reads ( rA , rX ) , reads writes ( rY ) do

-- ... ( SpMM implementation )

end
-- ... ( other tasks )
task main ()

-- ... np ( num partitions ) = ceil ( m / blksize )
var sparse_matrix_is = ispace ( int1d , nnz )
var vector_block_is = ispace ( int1d , m * n )
var Alr = region ( sparse_matrix_is , csb_entry )
var Xlr = region ( vector_block_is , double )

init )

var part = ispace ( int1d , np )
var Xlp = partition ( equal , Xlr , part )
-- ... ( Y and Q partitionings , etc. )
-- Y = A * X
for i = 0 , np do

for j = 0 , np do

if blkptrs [ i * np + j ] < blkptrs [ i * np + j +1] then

SpMM ( Alr , Xlp [ j ] , Ylp [ i ] , blkptrs [ i * np + j ] , blkptrs [ i * np

+ j +1])

end

end

end
-- Q = Y * Z
__demand ( __index_launch )
for i = 0 , np do

f_dgemm ( Ylp [ i ] , Zlr , Qlp [ i ] , m , n , blksize , i )

end
-- P = Y ’ * Q
__demand ( __index_launch )
for i = 0 , np do

f_dgemm_t ( Ylp [ i ] , Qlp [ i ] , Plr , m , n , blksize , i )

end

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

27
28
29
30
31
32
33
34
35
36
37
38
39

18

2.3 Benchmark Applications

We evaluate the performance of the task parallel frameworks on two popular eigensolvers

with different characteristics: Lanczos

[36], which is SpMV based, and Locally Optimal

Block Preconditioned Conjugate Gradient (LOBPCG) [33], which is SpMM based.

Lanczos computes the k algebraically largest (or smallest) eigenvalues of a symmetric

matrix by building the Krylov subspace Q, a block of orthogonal vectors. As k ≪ m, it

is a relatively simple algorithm (see Alg. 2.1) where SpMV is the main kernel. We give

Algorithm 2.1 Lanczos Algorithm.

1: b ← initial vector
2: Q0 ← b/||b||2
3: for i = 1 to k do
4:
5:
6:
7:
8:
9:
10: end for

z ← AQi−1
αi ← Qi−1
q ← [Q0, . . . , Qi−1]
z ← z − qqT z
βi ← ||z||
Qi ← z/βi

T z

the pseudocode for the LOBPCG solver in Alg. 2.2. It involves kernels with much higher

arithmetic intensities (such as SpMM and several level-3 BLAS calls) compared to Lanczos.

The total memory needed for block vectors Ψ , R, Q and others can easily exceed the space

matrix (cid:98)H takes up. [1] shows a sample task graph for LOBPCG for a toy problem, which

demonstrates the difficulty of creating a schedule to attain an efficient execution.

DeepSparse uses the DAG constructed for a single iteration with barriers in between

because the number of iterations until convergence is unknown. HPX and Regent form the

DAG internally on-the-fly, so they might proceed between iterations without a barrier, but

this is hard to achieve in practice due to the convergence check at each iteration. Critical

path lengths in Lanczos and LOBPCG are 5 and 29, respectively. Number of tasks depends

on block and matrix sizes, and ranges from 56 to 6,570,446 per iteration.

19

Algorithm 2.2 LOBPCG Algorithm solving ˆHΨ = M Ψ .

Input: ˆH, matrix of dimensions m × m
Input: Ψ0, a block of vectors of dimensions of m × n
Output: Ψ and M such that ∥ ˆHΨ − Ψ M ∥F < ϵ, and Ψ T Ψ = In

1: Orthonormalize the columns of Ψ0
2: Q0 ← 0
3: for i = 0, 1, . . . , until convergence do
4:
5:
6:
7:

Mi ← Ψ T
i
Ri ← ˆHΨi − ΨiMi
Apply the Rayleigh–Ritz procedure on span{Ψi, Ri, Qi}
Ψi+1 ←

trace(V T ˆHV )

ˆHΨi

argmin
V ∈span{Ψi.Ri,Qi}, V T V =In

Qi+1 ← Ψi+1 − Ψi

8:
9: end for
10: Ψ = Ψi+1

2.4 Performance Evaluation

Performance evaluations were carried out on two systems, an Intel Broadwell-based mul-

ticore cluster and an AMD EPYC-based manycore cluster. Each Broadwell cluster node

has two 14-core Intel Xeon E5-2680v4 Broadwell 2.4 GHz processors with a 64 KB L1 cache

(32 KB instruction, 32 KB data) and a 256 KB L2 cache per core, in addition to a 35 MB

shared L3 cache. Each EPYC cluster node has two 64-core AMD EPYC 7H12 2.6 GHz

processors. Each EPYC core has a 64 KB L1 cache (32 KB instruction, 32 KB data), a

512 KB L2 cache and 16 MB L3 cache is shared between every four cores.

We compare the performance of DeepSparse, HPX and Regent implementations with two

library-based BSP versions: (i) libcsr is the implementation of the benchmark solvers using

thread-parallel Intel MKL Library calls (including SpMV/SpMM) with CSR storage of the

sparse matrix, (ii) libcsb also uses Intel MKL calls, but with the matrix being stored in the

CSB format. We use MKL calls within each task of the AMT models whenever possible for

a fair comparison. Finally, we do not claim to have the best possible implementation for all

solver versions, although we did our best to optimize them.

We utilized an entire node for our runs on both clusters, i.e., 28 cores on Broadwell and

128 cores on EPYC. For DeepSparse, libcsr and libcsb, we bind OpenMP threads to cores.

20

For HPX, the number of OS threads spawned through –hpx:threads argument is the same as

the number of cores. For Regent, the number of cores to be used for executing the application

tasks is specified using -ll:cpu, and -ll:util determines the number of cores allocated to the

runtime system. Empirically we found that -ll:cpu 24 -ll:util 4 on Broadwell and -ll:cpu 110

-ll:util 18 on EPYC yield (near-)optimal results on both benchmark applications.

We selected 14 matrices with varying sizes, sparsity patterns, and domains from the

SuiteSparse Matrix Collection in addition to the Nm7 matrix, which is from a nuclear shell

model code (see Table 2.1) [17]. Since both solvers require the input matrix to be symmetric,

the matrices that are not symmetric (shown in bold in Table 2.1) are made so by copying the

transpose of the lower triangular part over the upper triangular part: Anew = L + LT − D.

Matrices shown in italics were originally binary matrix, and hence were filled with random

values without breaking the symmetry.

Matrix
inline1
dielFilterV3real
Flan 1565
HV15R
Bump 2911
Queen4147
Nm7
nlpkkt160
nlpkkt200
nlpkkt240
it-2004
twitter7
sk-2005
webbase-2001
mawi 201512020130

#Rows #Non-zeros
503,712
1,102,824
1,564,794
2,017,169
2,911,419
4,147,110
4,985,422
8,345,600
16,240,000
27,993,600
41,291,594
41,652,230
50,636,154
118,142,155
128,568,730

36,816,170
89,306,020
117,406,044
281,419,743
127,729,899
329,499,284
647,663,919
229,518,112
448,225,632
774,472,352
1,120,355,761
868,012,304
1,909,906,755
1,013,570,040
270,234,840

Table 2.1 Matrices used in our evaluation.

All presented performance data come solely from the solver iteration parts, excluding any

I/O, initialization and setup parts. Performance data were averaged over multiple iterations

(20 for Lanczos, 10 for LOBPCG). For the last two matrices, number of iterations was 10

for Lanczos and 5 for LOBPCG due to their size. Our comparison criteria are L1, L2, LLC

21

(L3) misses (unavailable on EPYC due to root access requirement) and execution times for

both solvers and architectures. Cache misses were normalized with respect to that of libcsr,

and speedups were calculated over libcsr. Cache miss data was obtained using “perf stat”

command.

We note that performance data from where a single socket is used on both architectures

(14 cores on Broadwell and 64 cores on EPYC) are similar to the case presented here, where

both sockets are used. The only difference is, on EPYC, the task parallel frameworks seem to

be affected less by the NUMA-related performance issues as their speedup numbers improve

going from a single socket to the entire node. The single socket results are not presented

due to space constraints.

The impact of the degree of parallelism is measured by conducting tests for several differ-

ent block sizes. However, for a given block size, all AMT models are essentially presented the

same DAG, i.e., the available degree of parallelism are identical across all runtimes. Since

all runtimes are executing the same DAG, we believe their performance differences are due

to the different scheduling algorithms employed as this directly impacts thread idling and

cache utilization.

Scheduling policies of the runtimes studied here are either opaque or are not well doc-

umented, making a detailed comparison of the impact of the different scheduling policies

difficult. For instance, OpenMP’s task scheduling is left to the implementation and not well

documented; task priorities are only ignorable hints. For HPX, NUMA-aware scheduling

made a big difference in cache utilization and performance, but their scheduling algorithms

are not well documented either. Regent gives task mapping options, which is, however,

mainly recommended for heterogeneous computing.

In Section 2.4.2 and 2.4.3, we share the results from the experiments where the optimal

block size is employed, which depends on the solver, architecture, runtime system and matrix

type. Then in Section 2.4.4, we present a practical rule of thumb for determining the ideal

task granularity by choosing the ideal block size for each runtime system.

22

2.4.1 The Effect of Optimizations

We tried to optimize the code at the same level for each task parallel system using the

techniques discussed below. Although each framework benefits from all optimizations, due

to space constraints we only share the results of every optimization for a certain framework,

solver and architecture combination where the impact is the most evident. In all following

optimization plots, compared implementations incorporate all other optimization techniques

so that the only variable is the selected optimization at hand.

2.4.1.1 First-Touch Placement Policy

This policy refers to allocation of a data page in the memory closest to the thread

accessing it first. When a single thread initializes all data structures, the data ends up

residing in the memory of a single NUMA node, which increases access times, consequently

hurting the performance. Leveraging this policy would simply require the initialization of

vector blocks and the sparse matrix in parallel in the case of sparse solvers. The 128 cores

on an EPYC node are internally organized into 8 NUMA subregions, 4 per socket. As shown

in Figure 2.4 for DeepSparse, this optimization is vital for good performance (up to 2.5 fold)

for the small and mid-sized matrices on the EPYC system. We also utilize it in the BSP

versions (libcsr and libcsb) for a fair comparison.

Figure 2.4 Execution time of DeepSparse for Lanczos on EPYC wrt first-touch policy.

2.4.1.2 Skipping Empty Tasks

Depending on the sparsity pattern of a matrix and chosen CSB block size, there are

empty blocks which do not contribute to the result of SpMV/SpMM operation. Spawning

23

inline_1dielFilterFlan_1565HV15RBump_2911Queen_4147Nm7nlpkkt160nlpkkt200nlpkkt240it-2004twitter7sk-2005webbasemawiExecution Time (s) 0.028 0.047 0.061 0.114 0.075 0.129 0.207 0.091 0.146 0.270 0.439 0.697 0.540 0.515 0.869 0.019 0.028 0.026 0.046 0.039 0.061 0.135 0.089 0.169 0.281 0.435 0.754 0.533 0.500 0.872Sequential InitializationLeveraging First Touchtasks for those blocks creates scheduling overheads for the runtime system. Figure 2.5 shows

that skipping such tasks may speed up the execution time by 30% on average, albeit not as

effective on some matrices. We attribute the lack of improvement in those cases to the fact

that both implementations use the optimal block size for each matrix, which in general does

not yield too many empty blocks.

Figure 2.5 Execution time of HPX for Lanczos on Broadwell wrt skipping empty tasks.

2.4.1.3 Eliminating Reduction for SpMV/SpMM Output

As discussed in Section 2.2, the dependency based approach is used on all frameworks to

eliminate the reduction overheads. In Figure 2.6, we show the empirical advantage of this

decision on Regent for the LOBPCG algorithm. We observe that the reduce-based approach

yields an extremely poor performance on large matrices, and we believe this is due to large

buffers that need to be allocated by each core. Furthermore, Regent runtime system manages

the reduce operation internally (recall that one of the region privileges was reduce). Given

the problematic scaling behavior of Regent with regard to the number of tasks (discussed in

Section 2.4.4), poor performance of the reduce based approach on Regent is not a surprise.

2.4.1.4 Other Attempts

We also attempted several framework-specific optimizations. For instance, HPX allows

passing scheduling hints to their scheduler in an effort to create executors that target a

specific NUMA domain or to pin HPX threads to a particular core. We employed schedul-

ing hints to achieve a locality-aware scheduling for both solvers. This technique improved

HPX’s both Lanczos and LOBPCG performance significantly on EPYC, where there exist

24

inline_1dielFilterFlan_1565HV15RBump_2911Queen_4147Nm7nlpkkt160nlpkkt200nlpkkt240it-2004twitter7sk-2005webbasemawiExecution Time (s) 0.042 0.048 0.058 0.109 0.073 0.122 0.232 0.168 0.317 0.580 0.959 1.359 1.613 1.103 1.248 0.013 0.040 0.048 0.081 0.062 0.103 0.253 0.140 0.268 0.469 0.831 1.223 1.116 1.161 1.251Launching Empty TasksSkipping Empty TasksFigure 2.6 Execution time of Regent for LOBPCG on Broadwell wrt two SpMV/SpMM
computation approaches.

8 NUMA domains (16 cores each). Regent provides a technique called dynamic tracing [38]

to reduce the task management overhead for iterative solvers. This technique relies on cap-

turing the task graph in the first iteration and replaying it for subsequent iterations through

memoization to avoid the dependence analysis. However, this last attempt did not yield any

significant performance improvement.

2.4.2 Lanczos Evaluation

Lanczos algorithm is a relatively simple algorithm in the sense that it has much fewer

types and number of tasks than LOBPCG because it essentially consists of one SpMV and

one inner product kernel at each iteration. As such, scheduling decisions are simpler and

there are fewer data reuse opportunities. Consequently, we observe that the task parallel

systems often lead to little to no improvement in terms of cache misses. This can be seen

in Figure 2.7 where the cache misses comparison on EPYC for different Lanczos versions is

shown. No framework achieves consistent reduction in cache misses on L1 level. Moreover,

the improvements on L2 level can be attributed to the matrices being stored in the CSB

format since libcsb, the other BSP version, yields similar improvements.

Most importantly, all three task-parallel versions give decent speedups on both archi-

tectures as shown in Figure 2.8. On Broadwell (the top subplot), DeepSparse, HPX and

Regent achieve up to 2.3×, 4.3× and 2.0× improvement although on average, the speedup

achieved is somewhat modest (1.5×, 2.2× and 1.1×, respectively). Task parallel versions

perform better when we go from a multicore (Broadwell) to a manycore (EPYC) architec-

25

inline_1dielFilterFlan_1565HV15RBump_2911Queen_4147Nm7nlpkkt160nlpkkt200nlpkkt240it-2004twitter7sk-2005webbasemawiExecution Time (s) 0.39 0.79 0.66 1.6 1.0 1.6 12 3.2 6.0 10 208 347 248 141 55 0.19 0.34 0.37 0.62 0.51 0.85 1.70 1.42 2.51 4.64 7.10 11.0 11.0 9.54 9.87Reduce BasedDependency BasedFigure 2.7 L1 and L2 misses of different Lanczos versions on EPYC normalized wrt libcsr.

ture. DeepSparse achieves as high as 6.5× speedup, HPX up to 9.9× speedup and Regent up

to 2.7× speedup. On average, DeepSparse, HPX and Regent achieve 3.3×, 4.9× and 1.6×

speedup, the majority of which comes from the large matrices.

Figure 2.8 Speedup of different Lanczos versions on Broadwell (top) and EPYC (bottom)
over libcsr.

We attribute the speedups observed across the board to the increased parallelism with

tasking and reduced synchronization overheads. In fact, a further investigation of the ex-

ecution flow graph of tasks in Figure 2.9 shows that the manycore architecture provides a

greater level of parallelism for the task parallel systems to fill the gap resulting from load

imbalances of SpMV with the succeeding tasks. Therefore, each iteration is completed not

long after the execution of the last SpMV task on EPYC, providing the AMT approaches

with a greater success.

2.4.3 LOBPCG Evaluation

LOBPCG is a complex algorithm with several different kernel types; its task graph may

result in millions of tasks depending on the block size.

In LOBPCG, vector blocks have

only 8-16 columns, hence there is no tiling in the column dimension. Block sizes refer to

26

0.00.51.01.52.02.53.03.5L1 Misses(Better)libcsrlibcsbDeepSparseHPXRegentinline_1dielFilterV3realFlan_1565HV15RBump_2911Queen_4147Nm7nlpkkt160nlpkkt200nlpkkt240it-2004twitter7sk-2005webbase-2001mawi0.00.51.01.52.02.53.0L2 Misses(Better)0.00.51.01.52.02.53.03.54.04.5Broadwell Speedup(Better)libcsrlibcsbDeepSparseHPXRegentinline_1dielFilterV3realFlan_1565HV15RBump_2911Queen_4147Nm7nlpkkt160nlpkkt200nlpkkt240it-2004twitter7sk-2005webbase-2001mawi0246810EPYC Speedup(Better)(a) libcsr on Broadwell

(b) DeepSparse on Broadwell

(c) libcsr on EPYC

(d) HPX on EPYC

Figure 2.9 Execution flow graph of nlpkkt240 from first three iterations of Lanczos for
different versions and architectures.

the number of rows in each chunk, they ranged from 1K to 16M. Since LOBPCG requires

several vector operations consecutively, there are plenty of data reuse opportunities for vector

chunks.

In Figure 2.10, we show the cache misses and speedup comparison on Broadwell for all five

LOBPCG versions. The libcsr and libcsb versions achieve similar number of cache misses,

while the task-parallel versions demonstrate an outstanding cache performance:

• DeepSparse yields a consistent improvement throughout all cache levels: It achieves

3.0× - 10.4× fewer L1 misses, 3.8× - 12.0× fewer L2 misses and 1.4× - 4.7× fewer L3

cache misses than libcsr.

27

00.511.52Time (Sec.)0 4 8 12162024Thread IDRESETSPMVDOT1XTYXYSUBDOT2NORMCOPY00.511.5Time (Sec.)0 4 8 12162024Thread ID00.511.52Time (Sec.)0  16 32 48 64 80 96 112Thread ID00.10.20.30.4Time (Sec.)0  16 32 48 64 80 96 112Thread ID• HPX’s cache performance is on par with DeepSparse: It achieves 2.8× - 13.7× fewer

L1 misses, 3.7× - 13.1× fewer L2 misses and 1.4× - 5.2× fewer L3 cache misses than

libcsr.

• Regent has competitive cache utilization too, as it produces 4.3× - 9.6× fewer L1

misses, 4.0× - 12.3× fewer L2 misses and 1.6× - 6.2× fewer L3 cache misses compared

to libcsr.

Figure 2.10 L1, L2 and LLC (L3) misses of different LOBPCG versions on Broadwell
normalized wrt libcsr.

As the top subplot of Figure 2.11 shows that on Broadwell, even with the implicit task

graph creation and execution overheads of the runtime systems, this significant reduction

in cache misses leads to 1.8× - 3.0× speedup for DeepSparse, 1.5× - 4.4× speedup for

HPX and 0.8× - 1.9× speedup for Regent (slowdown occurring on a few smaller matrices)

over the execution times of libcsr. Given the highly complex underlying task dependency

graph of LOBPCG and abundant data re-use opportunities available, we attribute these

improvements to the pipelined execution of tasks which belong to different computational

kernels but use the same data structures.

AMT models continue their superior performance in terms of execution time on EPYC

as shown in the bottom subplot of Figure 2.11. As a matter of fact, DeepSparse and HPX

improve their performance further compared to Broadwell: DeepSparse achieves 1.2× - 5.5×

28

0.00.20.40.60.81.01.21.4L1 Misses(Better)libcsrlibcsbDeepSparseHPXRegent0.00.20.40.60.81.0L2 Misses(Better)inline_1dielFilterV3realFlan_1565HV15RBump_2911Queen_4147Nm7nlpkkt160nlpkkt200nlpkkt240it-2004twitter7sk-2005webbase-2001mawi0.00.20.40.60.81.0LLC Misses(Better)speedups and HPX achieves 1.7× - 7.5× speedups over libcsr. However, Regent demon-

strate a similar performance on this architecture achieving 0.8× - 2.3× speedup where the

performance degradation is again being observed on the smaller matrices.

AMT models achieve up to 99% L1 hits for LOBPCG, compared to the 85-90% hit ratio

of loop-parallel versions. Considering AMT models’ outstanding cache miss performance as

well, we conclude that cache utilization is an important factor with LOBPCG due to data

reuse opportunities. The improved performance observed in HPX by switching to NUMA-

aware scheduling, which is around 50%, also supports this view.

Figure 2.11 Speedup of different LOBPCG versions on Broadwell (top) and EPYC
(bottom) over libcsr.

The pipelined execution of tasks in DeepSparse and HPX in comparison to libcsr can be

observed in Figure 2.12. The performance on XTY kernel accounts for the main difference

in timing (see Figure 2.12a & Figure 2.12b). Data parallel execution of this kernel in the

BSP model considerably hurts the performance, which seems to be avoided in task parallel

execution to a great extent through the re-use of involved matrix blocks in kernels such as

XY or SpMM after the execution of XTY tasks. There also exist characteristic differences

within the task parallel models as seen in Figure 2.12c & Figure 2.12d: Both DeepSparse and

HPX give the best result on the nlpkkt240 matrix with 256K CSB block size whereas their

execution flow graphs do not show much resemblance to each other. HPX in general seems to

place less value on prioritization of the tasks that are launched earlier, which consequently

produces a more shuffled graph where the overlap in each kernel’s start and finish time

increases. Regardless of that difference, their execution times are similar in this example

29

012345Broadwell Speedup(Better)libcsrlibcsbDeepSparseHPXRegentinline_1dielFilterV3realFlan_1565HV15RBump_2911Queen_4147Nm7nlpkkt160nlpkkt200nlpkkt240it-2004twitter7sk-2005webbase-2001mawi012345678EPYC Speedup(Better)(≈ 3.0sec).

(a) libcsr on Broadwell

(b) DeepSparse on Broadwell

(c) DeepSparse on EPYC

(d) HPX on EPYC

Figure 2.12 Execution flow graph of nlpkkt240 from two iterations of LOBPCG for
different versions and architectures.

2.4.4 Block Size Selection

The CSB block size has a significant effect on the performance of task parallel models

as the factors such as task granularity, degree of parallelism, and scheduling overhead of the

runtime systems are directly shaped by the block size for a given matrix and solver type.

This is because we use this block size as a uniform partitioning factor whether it is for a 2D

(e.g., SpMV and SpMM) or 1D (e.g., all vector operations) kernel. Therefore, by the “block

count”, we simply mean the number of tiles/blocks in each dimension, which is determined

by the row count of matrices/vectors and the CSB block size. Choosing a small block size

30

05101520Time (Sec.)0 2 4 6 8 101214161820222426ThreadRESETXYSUBMULTDOTGETXTYUPDATEADDDLACPYMatOPSpMM0246Time (Sec.)0 2 4 6 8 101214161820222426Thread0123Time (Sec.)0  8  16 24 32 40 48 56 64 72 80 88 96 104112120Thread0123Time (Sec.)0  8  16 24 32 40 48 56 64 72 80 88 96 104112120Threadcreates a large number of small tasks, which is preferable on a parallel architecture, but the

large number of tasks may lead to significant scheduling overheads.

Increasing the block

size reduces such overheads, but this may then lead to increased thread idle times and load

imbalances. Therefore, finding the sweet spot between these two extremes is important.

We found the optimal block size, which differs for each matrix, solver architecture, and

runtime system combination by trying a variety of numbers from 210 to 224. This brute-force

search may not always be practical. Thus, analyzing the data further, we have come to notice

that the optimal block size would always yield a block count between 8 and 511 regardless

of the case. As such, picking the optimal one boils down to the comparison of the six block

sizes, the ones that result in 8 to 15, 16 to 31, 32 to 63, 64 to 127, 128 to 255 or 256 to 511

block counts.

We show the performance profiles in Figure 2.13 to compare these six block counts for

each runtime system and architecture. Note that our findings on LOBPCG solver match the

ones from Lanczos solver. So, to avoid redundancy, we only share LOBPCG results here.

In the performance profiles, we plot the percentages of the instances in which a block count

yields an execution time on a matrix that is no longer than τ times the best execution time

found by any block count for that matrix. Therefore, the higher a profile at a given τ , the

better a heuristic is. We observe the following:

• For DeepSparse, 32-63 block count is the best option and it is always within 1.15× the

best option on Broadwell. On EPYC, 64-127 block count have the top spot and 32-63

block count provides a comparable performance.

• For HPX, 64-127 block count gives the optimal performance in general on both Broad-

well and EPYC. However, 32-63 and 128-255 block count configurations perform sim-

ilarly well regardless of the architecture.

• Regent prefers more coarse-grained tasks as 16-31 block count performs the best on

both architectures. Considering that bottom three spots belong to highest three options

31

suggests that Regent has scaling issues with regard to creation or scheduling of large

number of tasks. In fact, going beyond 64 block count can cause 5× - 10× slowdowns

although we cannot see it here as τ is shown for between 1.0 and 2.0 for all systems.

As a practical rule of thumb, we can say that 32-63 block count on Broadwell and 64-127

block count on EPYC for DeepSparse and HPX are good choices. Even though we have

slightly less than a task per thread per kernel with the 64-127 block count, there are many

kernels in LOBPCG to be executed in parallel. In fact, the execution flow graphs verify that

this kernel-level parallelism might be just enough to keep all threads occupied by exposing

more than a task per thread.

We know from the speedup plots that DeepSparse and HPX are the best two versions

by far. Taking into account these block counts, they achieve high performance by over-

decomposing the work to yield more than one task per thread for load balancing purposes

while limiting that task to thread ratio to avoid scheduling overhead.

Tuning the block size is very important for best results. However, this is a complicated

choice that depends on the specific problem, architecture and compiler. We note that even for

the easier-to-characterize dense linear algebra kernels, auto-tuning is necessary (e.g., ATLAS

library [61]). Hence, for sparse solvers this is a very difficult problem. Nevertheless, we tried

to illustrate the trade-offs and give some insights which we hope could be helpful to others.

2.5 Conclusion of This Work

Several AMT frameworks emerged as we move towards exascale, and there is a lack of

comparative studies, by third party users in particular. We believe a fair evaluation on

various application domains would benefit readers in helping them make a well-informed

decision in preparing for exascale. This is precisely the motivation for our evaluation of

three runtime systems in the context of sparse solvers. To our knowledge, this is the first

such comparative work.

We introduced optimized implementations of LOBPCG and Lanczos eigensolvers using

the task-parallel paradigm using three runtime systems: OpenMP (through the DeepSparse

32

(a) Broadwell DeepSparse

(b) EPYC DeepSparse

(c) Broadwell HPX

(d) EPYC HPX

(e) Broadwell Regent

(f) EPYC Regent

Figure 2.13 The relative performance of block counts for DeepSparse, HPX and Regent on
LOBPCG solver.

framework), HPX and Regent. We show that these task-parallel systems achieve significantly

fewer cache misses across different cache layers for LOBPCG, a fairly complex solver. They

also provide promising improvements in the execution time over a traditional optimized

33

1.01.21.41.61.82.0τ = Execution time relative to the best0.20.40.60.81.0Fraction of the matricesblock count 256-511block count 128-255block count 64-127block count 32-63block count 16-31block count 8-151.01.21.41.61.82.0τ = Execution time relative to the best0.20.40.60.81.0Fraction of the matrices1.01.21.41.61.82.0τ = Execution time relative to the best0.20.40.60.81.0Fraction of the matrices1.01.21.41.61.82.0τ = Execution time relative to the best0.20.40.60.81.0Fraction of the matrices1.01.21.41.61.82.0τ = Execution time relative to the best0.20.40.60.81.0Fraction of the matrices1.01.21.41.61.82.0τ = Execution time relative to the best0.20.40.60.81.0Fraction of the matricesBSP implementation for both solvers on two different architectures: Broadwell, an Intel-

based multicore system, and EPYC, an AMD-based manycore system. Moreover, they

allow achieving such performance improvements without sacrificing the ease of high-level

programming.

We conclude that OpenMP tasking and HPX are setting themselves apart from others.

OpenMP’s great performance is commendable, but so is HPX’s because it generates the DAG

itself as it goes along and is extensible to distributed memory architectures. Future work will

be in the direction of testing HPX in a distributed memory environment using large-scale

sparse solvers and graph analytics kernels, and comparing these to hybrid MPI+OpenMP

solutions.

34

CHAPTER 3

PERFORMANCE OF THE HPX FRAMEWORK FOR SPARSE
EIGENSOLVERS ON DISTRIBUTED MEMORY ARCHITECTURES

After seeing the success of HPX with our work described in Chapter 2 and given that

HPX extends its asynchronous dataflow model to distributed memory environments, our

next goal became testing HPX in a distributed memory environment using large-scale sparse

solvers and graph analytics kernels, and comparing these to hybrid MPI+OpenMP solutions.

To this end, we have written the distributed Lanczos algorithm using both HPX and

MPI+OpenMP programming models. Both implementations are simple in a sense that

the matrix is 1D row partitioned, which results in all the communications being global

and straightforward. However, the HPX code runs significantly slower than the plain

MPI+OpenMP code on average up to 512 AMD EPYC cores.

Moreover, a careful investigation of HPX’s performance with respect to the matrix block

(tile) size, which dictates the partitioning of the input and output column vectors and a

tall-and-skinny matrix used for this algorithm, shows that HPX prefers considerably large

block sizes where localities (a locality in HPX’s terminology is same as a rank in MPI) and

threads clearly suffer from starvation. This behavior contradicts our shared memory results

for the same algorithm where HPX achieves high performance by over-decomposing the work

to yield more than one task per thread for load balancing purposes.

3.1 Implementation

The pseudocode of the distributed Lanczos algorithm can be seen in Listing 3.1.

Lanczos computes the k algebraically largest (or smallest) eigenvalues of a symmetric

matrix by building the Krylov subspace Q, a block of orthogonal vectors. As k ≪ n, it is a

relatively simple algorithm where SpMV is the main kernel. In the distributed implementa-

tion of Lanczos, there are four global communication calls: two all reduce calls on the local

dot product results (line 18 and 24), one all reduce call on a small vector that is the local

output of a linear combination kernel (21), and a rather expensive all gather call (line 15.)

35

at the end of which each locality gets the global input vector for the SpMV operation. Note

that this all gather is needed as the matrix is 1D row partitioned and in order to compute

the local output vector of the SpMV, each locality needs the full input vector. As such, all

the column vectors q, z, QQpZ and the Krylov subspace Q are naturally 1D partitioned.

Listing 3.1 The pseudocode of the distributed Lanczos algorithm.

// n = number of rows / columns ( of a square matrix )
// p = number of MPI ranks or HPX localities
// n_local = number of matrix rows a locality owns ( roughly n / p )
// k = number of largest eigenvalues to be computed
// A = n_local × n local sparse matrix
// q , z , QQpZ = n_local × 1 column vector
// q_global = n × 1 column vector
// Q = n_local × k tall - and - skinny matrix
// QpZ = k × 1 small vector

q = a random unit vector
Q = q ;
for ( it = 0; it != maxIterations ; ++ it )
{

q_global = all_gather ( q ) ;
z = A * q_global ; // SpmV
alpha = qT z ;
alpha_global = all_reduce ( alpha ) ;
QpZ = QT z ; // inner product kernel
QpZ_global = all_reduce ( QpZ ) ;
QQpZ = Q * QpZ_global ; // linear combination kernel
z = z - QQpZ ;
beta = || z ||2
beta_global = sqrt ( all_reduce ( beta ) ) ;
q = z / beta_global ;
Q = [ Q q ];

}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

Both MPI+OpenMP (or simply MPI) and HPX implement the pseudocode provided in

Listing 3.1. The biggest difference is, however, we adapt a 2D partitioning scheme for HPX

where tasks are defined based on the Compressed Sparse Block (CSB) representation of the

sparse matrix whereas we simply adapt a fork-join model for the MPI code. HPX code starts

by partitioning the sparse matrix into CSB blocks, which are then 1D row block partitioned

36

among localities. To give an example based on real experiments, when the block size is

chosen to be 220 for the matrix webbase 2001, which has 118, 142, 155 rows and columns,

there will be 113 × 113 CSB matrix blocks as ceil (118,142,155/220) = 113. When there are

four localities, for instance, first locality will own the first 29 × 113 CSB blocks whereas the

rest will own 28 × 113 CSB blocks each. Since the column vectors are 1D block partitioned

as well, the first locality will get 29 × 1 blocks of q, z and QQzP where the block size is

again 220.

There are other significant differences between the HPX and MPI code. For example,

HPX can overlap the all gather operation in line 16 with the SpMV tasks in line 17 that

only need the local blocks of the input vector as the all gather operation is non-blocking.

Furthermore, as the SpMV kernel is the most expensive kernel, we try to keep the threads

in each locality as occupied as possible by using buffers for the output vector of the SpMV

operation. Whenever a buffer is available, a thread will be assigned to an SpMV task to

fill in the buffer, assuming that not all tasks are complete. And whenever the block of the

output vector which corresponds to that SpMV task is available, meaning there is no task

updating it, a thread will add the partial result in that buffer to that block of the output

vector. Then, a task will be created to empty that buffer. The full implementation can be

seen here.

3.2 Results

We used 5 matrices to assess the performance of the distributed Lanczos algorithm as

shown in Table 3.1. All matrices besides N m7 is from the suitesparse matrix collection.

Those that are not originally symmetric are made so by copying the transpose of the lower

triangular part over the upper triangular part. Also, we assigned random weights to the

originally binary matrices.

We conduct experiments on AMD EPYC clusters at HPCC up to 4 nodes. Each EPYC

cluster node has two 64-core AMD EPYC 7H12 2.6 GHz processors. Each EPYC core has a

64 KB L1 cache (32 KB instruction, 32 KB data), a 512 KB L2 cache and 16 MB L3 cache is

37

Matrix
inline1
dielFilterV3real
Nm7
twitter7
webbase-2001

#Rows #Non-zeros
503,712
1,102,824
4,985,422
41,652,230
118,142,155

36,816,170
89,306,020
647,663,919
868,012,304
1,013,570,040

Table 3.1 Matrices used in our evaluation.

shared between every four cores. We utilize the entire node for our runs, i.e., all 128 cores on

EPYC. For HPX, the number of OS threads spawned per locality is the same as the number

of cores per locality. The same goes for MPI regarding the number of OpenMP threads per

MPI process.

We ran each matrix for HPX with varying CSB block sizes defining the task granularity.

In the first plots (Figure 3.1, 3.2 & 3.3) where we compare the performance of HPX and

MPI, we report the execution times that are obtained using the optimal block

size in the case of HPX. On the other hand, MPI uses the CSR matrix format without

any blocking and as such, the block size is not a parameter there. For both HPX and MPI,

we report the average iteration time over 20 iterations on the first four matrices and over 10

iterations on the last matrix, webbase-2001.

In Figure 3.1, 3.2 & 3.3, each row corresponds to a set of results for the same matrix

whereas each column corresponds to a set of results for the same processor affinity.

In

Figure 3.2, for example, as we are using two EPYC nodes and there are 256 cores total, each

MPI process or HPX locality will be assigned 16 cores in the case of 16 ranks. The middle

column in Figure 3.2 shows the experiment results with such processor affinity.

3.2.1

1-Node Results

We see in Figure 3.1 that HPX is slower than MPI for every matrix and thread configu-

ration combination. Although the slowdown is closer to 5× on inline 1, the smallest matrix,

it is never higher than 2× on the largest two matrices: twitter7 and webbase-2001. In fact,

HPX delivers a competitive performance for certain combinations such as on twitter7 with

4 ranks.

38

Figure 3.1 Execution times of the distributed Lanczos algorithm on a single EPYC node.

3.2.2

2-Node Results

In Figure 3.2, we see a similar trend to that of the single node experiments. That is, HPX

is slower than MPI in 14 out of 15 combination and the difference is more apparent on the

small matrices. Among the three rank-core configurations, 64 ranks with 4 cores each is the

bottom performing one. Also, neither HPX nor MPI scales well when we go from one node

to two nodes in general. For the largest matrix, webbase-2001, with 4 ranks, for example,

MPI and HPX take 0.68 and 1.14 sec on single node, and 0.66 and 1.28 sec on two nodes,

respectively.

3.2.3

4-Node Results

Figure 3.3 shows that for the large matrices, 16 ranks with 32 cores each configuration

(middle column) is where HPX performs similar to MPI. However, in the first column for the

finished experiments we see that HPX is 24× and 7× slower than the MPI. Also on twitter7,

HPX is, for the first time, more than 2× slower (the right column). Nonetheless, we observe

some speedup when going from two nodes to four nodes on webbase-2001 for both codes

although it is not nearly ideal: considering the 64 rank configuration, MPI and HPX take

39

inline_1 0.007 0.016 0.008 0.009 0.010 0.048Execution Times of Distributed Frameworks for Lanczos Algorithm on EPYC using 1 nodempi+openmphpxdielFilterV3real 0.015 0.028 0.013 0.015 0.020 0.048Nm7 0.122 0.165 0.098 0.185 0.124 0.226twitter7 0.881 1.028 0.624 1.168 0.900 1.3004 ranks w/ 32 cores each16 ranks w/ 8 cores each64 ranks w/ 2 cores eachwebbase-2001 0.684 1.142 0.856 1.378 1.744 1.936Figure 3.2 Execution times of the distributed Lanczos algorithm on two EPYC nodes.

1.04 and 1.87 sec on two nodes, respectively, and 0.80 and 1.44 sec on four nodes.

Figure 3.3 Execution times of the distributed Lanczos algorithm on four EPYC nodes.

40

inline_1 0.007 0.020 0.004 0.011 0.007 0.017Execution Times of Distributed Frameworks for Lanczos Algorithm on EPYC using 2 nodesmpi+openmphpxdielFilterV3real 0.016 0.035 0.009 0.017 0.014 0.059Nm7 0.120 0.160 0.071 0.109 0.079 0.170twitter7 0.865 0.791 0.460 0.745 0.545 0.9684 ranks w/ 64 cores each16 ranks w/ 16 cores each64 ranks w/ 4 cores eachwebbase-2001 0.655 1.281 0.669 0.988 1.039 1.868inline_1 0.008 0.196 0.004 0.015 0.010 0.026Execution Times of Distributed Frameworks for Lanczos Algorithm on EPYC using 4 nodesmpi+openmphpxdielFilterV3real 0.017 0.125 0.009 0.023 0.016 0.041Nm7 0.117 DNF 0.065 0.088 0.057 0.119twitter7 0.847 DNF 0.414 0.475 0.401 0.8244 ranks w/ 128 cores each16 ranks w/ 32 cores each64 ranks w/ 8 cores eachwebbase-2001 0.595 DNF 0.539 0.735 0.800 1.4353.3 Discussion

While HPX achieves excellent parallelism and performance on shared memory systems

in comparison to OpenMP’s both task-parallel and loop-parallel models, it fails to do so on

distributed memory systems in comparison to the plain MPI+OpenMP model. As such, we

performed some analysis to figure out why it is the case.

We limit our analysis to webbase-2001, the largest matrix, as it should provide plenty

of work to keep the threads occupied in an ideal scenario. However, we must state that we

observe the trends we see on this matrix on the others as well.

Table 3.2 shows the average execution time of a single Lanczos iteration for the shared

and distributed implementation of HPX with varying block sizes. Both versions use an entire

EPYC node.

Looking at the raw data and comparing the distributed HPX results to the shared HPX

results, we observe that HPX’s distributed implementation does a poor job, performing

around 25× worse than the shared version (13.7 sec vs 0.57 sec) at the finest task granularity.

What is more, the distributed version performs the best when there is not nearly enough

degree of parallelism. Specifically when the block size is 223, the CSB block count is 15 × 15,

which is 1D row partitioned among 16 localities. As such, the first 15 localities get 1 × 15

matrix blocks while the last one gets virtually nothing. Moreover, while the last locality gets

no block of any column vectors, the rest gets one such block each. This means that there are

15 tasks spawned for the SpMV kernel and a single task spawned for the other kernels (such

as the dot product or linear combination kernel). Despite the fact that there are 8 threads

per locality waiting for a task, it is surprising to realize that this task granularity yields the

top performance.

The results make much more sense for the shared version: there are 226×226 SpMV tasks

and 226 tasks per other kernel spawned, which are then executed by 128 threads, with the

block size of 219. And this task granularity performs better than the others. Such disparity

between the shared and distributed versions makes us wonder whether this is simply an

41

implementation issue or it is a problem in the runtime system’s core.

Matrix Block Size Block Count Shared HPX Distributed HPX

webbase

219
220
221
222
223
224

226 × 226
113 × 113
57 × 57
29 × 29
15 × 15
8 × 8

0.57
0.77
1.38
0.74
0.99
1.29

13.73
6.82
3.71
3.55
1.38
1.47

Table 3.2 Execution times of shared and distributed HPX codes per Lanczos iteration on a
single node (128 threads on shared and 16 localities with 8 threads each on distributed run).

We were also wondering if these problematic trends we observed with HPX could be

attributed to the lack of new software packages and libraries available to us on HPCC at

MSU. As such, we repeated some of the experiments at NERSC using Haswell nodes but

as seen in Table 3.3, we observed the same outcome there too: first, HPX is considerably

slower than the MPI (it is not given in the table but it takes 0.73 sec for MPI on 4 nodes

of Haswell per Lanczos iteration whereas HPX cannot get faster than 1.69 sec). Second,

HPX performs the best when there is not nearly enough number of tasks to utilize all the

available localities and threads. HPX somehow prefers coarse-grained tasks on distributed

runs, which is the opposite of what we see on shared memory runs.

Matrix Block Size Block Count Haswell EPYC

webbase

219
220
221
222
223
224

226 × 226
113 × 113
57 × 57
29 × 29
15 × 15
8 × 8

15.14
7.70
4.09
2.15
2.27
1.69

5.42
2.94
1.44
0.96
0.74
0.89

Table 3.3 Execution times of distributed HPX codes per Lanczos iteration on four nodes of
Haswell and EPYC (16 localities with 8 threads each on Haswell and 16 localities with 32
threads each on EPYC).

We also tried a couple of possible optimization techniques such as calling dgemv function

of sequential MKL from the inner product and linear combination kernels, disabling work

stealing and/or forcing [] operator, which is a member function in data-holding classes, to

42

be inlined by the compiler. Nonetheless, the original code ran faster without any of these

possible optimization efforts.

In addition, we thought the lack of improvements we see on the distributed code might be

due to the overhead of using an object oriented approach with the client and server classes.

In the shared memory counterpart of Lanczos, we had simply used flat arrays to hold vector

and matrix data while relying on future to void variables to determine the availability of

vector chunks. As such, we wondered if we could improve the performance for the distributed

runs by simply adopting the same approach, except for the global communication part of

course. This code can be seen here. However, this version without any object creation or

destruction overheads worked slower than the first version with the client-server approach.

Finally, we wanted to estimate the time the global communications take for both HPX

and MPI+OpenMP. To do that, we made tasks return immediately without any computation

for all kernels. This way, the entire time would be spent on either global communications

or task creation/execution/deletion overheads. The execution times of normal runs and

those with empty tasks can be seen in Table 3.4. We know from the shared memory runs

that HPX does not yield any significant runtime system overheads. Therefore, we believe

that most time for the empty runs is spent on the communication part. Although both

MPI+OpenMP and HPX communicates the same data, HPX spends much more time on it:

MPI+OpenMP takes 0.65 sec for communication whereas it varies between 0.73 and 10.67

sec for HPX. It is important to note there that 0.73 sec occurs when only half of the localities

participate in communication as there are only 8 row blocks where only the localities with

an even locality ID do all the communication and computation.

To confirm our suspicions regarding HPX’s network communication issues, we measured

the bandwidth and latency achieved within HPX and compared it to pure MPI numbers.

Note that HPX itself enables the use of MPI as well as TCP for the networking operations

in the HPX runtime but since TCP was considerably slower than MPI, we opted for MPI

within HPX as the communication layer. Therefore, whenever a communication call is used

43

webbase

Matrix BS
219
220
221
222
223
224

BC
226 × 226
113 × 113
57 × 57
29 × 29
15 × 15
8 × 8

MPI MPI empty HPX HPX empty
0.97
0.98
0.97
0.98
0.98
0.98

13.31
6.62
3.50
2.14
1.58
1.33

10.67
6.81
2.77
1.41
1.03
0.73

0.65
0.65
0.65
0.66
0.65
0.64

Table 3.4 Execution times of MPI+OpenMP and HPX codes with normal and empty tasks
per Lanczos iteration on a single node of EPYC (16 localities with 8 threads each on
EPYC) highlighting the global communication overhead.

with HPX, it is simply a wrapped MPI call inside HPX’s runtime system.

To assess the network bandwidth, we simply create a vector of double of varying sizes be-

tween 8KB and 512MB and synchronously send it from the rank/locality0 to rank/locality1

. For the latency, we only send a single integer in the same way. We report the average over

100 tries. The benchmark numbers are reported in Table 3.5 and 3.6.

Number
min Bandwidth
max Bandwidth
avg Bandwidth
min Latency
max Latency
avg Latency

pure MPI HPX using MPI

202.4
18857.6
13177.1
0.08
0.12
0.09

39.2
2170.5
1148.6
81.80
122.29
91.63

Table 3.5 Network Benchmark on AMD EPYC for Intra-Node Communication. The
bandwidth reported is in MB/s and the latency is in µ (microsecond).

Number
min Bandwidth
max Bandwidth
avg Bandwidth
min Latency
max Latency
avg Latency

pure MPI HPX using MPI

241.5
12075.1
10468.2
0.30
1.33
0.37

10.87
704.8
332.8
254.71
406.91
312.35

Table 3.6 Network Benchmark on AMD EPYC for Inter-Node Communication. The
bandwidth reported is in MB/s and the latency is in µ (microsecond).

44

3.4 Conclusion of This Work

It seems that the distributed HPX yields poor and questionable numbers regarding the

1-node, 2-node and 4-node experiments. That is, it is not possible to achieve a speedup

over the plain MPI+OpenMP implementation regardless of the blocking factor. Also, it

is interesting to see HPX performing best when the tasks are too coarse grained. Seeing

how slow HPX is when run with empty tasks suggests that the communication becomes the

bottleneck as the tasks are finer grained. Moreover, the bandwidth and latency numbers

observed suggest that HPX introduces a significant overhead when a communication call is

used despite using MPI as the networking layer. After seeing the performance of HPX in

distributed case for Lanczos solver, we decided not to pursue our initial intentions, which

was to develop our own large-scale sparse solver and graph analytics framework using HPX

as the backbone.

45

CHAPTER 4

HYBRID EIGENSOLVERS FOR NUCLEAR CONFIGURATION
INTERACTION CALCULATIONS

This chapter has been published in Computer Physics Communications on November

2023 [5], available at: https://doi.org/10.1016/j.cpc.2023.108888.

So far in Chapter 2 and 3, we were targeting a broad base of matrices with varying

sizes, sparsity patterns, and domains from the SuiteSparse Matrix Collection. The goal was

to explore the acceleration of sparse matrix computations through asynchronous dataflow

models on shared and distributed memory architectures. This kind of general exploration

allowed us to experiment with different runtime systems and different matrices. Although

we showed these runtime systems’ merit on the Lanczos and LOBPCG algorithms, we only

approached these sparse eigensolvers as a collection of linear algebra kernels where we ran

each solver for a fixed number of iterations using those matrices.

In this work, we take an interest in the convergence behavior of these algorithms in the

context of nuclear physics where the eigenvalue problem has certain significance. We namely

explore whether we could improve the convergence of the iterative eigensolvers through

hybrid algorithms. While doing that, we also attempt to tackle the limitations of the eigen-

solvers with respect to the large-scale sparse systems that arise in nuclear physics. Therefore,

while observing the iteration counts and total number of SpMVs and/or SpMMs performed

as we employ both pure and hybrid algorithms, we confine ourselves to the test problems

from the study of nuclear physics. Moreover, although we use MATLAB and Fortran for

the experiments of this work, iteration and SpMV counts of the algorithms are agnostic to

programming languages and runtime systems.

The computational study of the structure of atomic nuclei involves solving the many-body

Schr¨odinger equation for a nucleus consisting of Z protons and N neutrons, with A = Z + N

the total number of nucleons,

ˆH Ψi(⃗r1, . . . , ⃗rA) = Ei Ψi(⃗r1, . . . , ⃗rA) ,

(4.1)

46

where ˆH is the nuclear Hamiltonian, Ei are the discrete energy levels of the low-lying spec-

trum of the nucleus, and Ψi the corresponding A-body wavefunctions. A commonly used

approach to address this problem is the no-core Configuration Interaction (CI) method (or

No-Core Shell Model), in which the many-body Schr¨odinger equation, Eq. (4.1), becomes an

eigenvalue problem

H x = λ x ,

(4.2)

where H is an n × n square matrix that approximates the many-body Hamiltonian ˆH,

λ is an eigenvalue of H, and x is the corresponding eigenvector [8, 53, 58]. The size n

of the symmetric matrix H grows rapidly with the number of nucleons A and with the

desired numerical accuracy, and can easily be several billion or more; however, this matrix

is extremely sparse, at least for nuclei with A ≥ 6. Furthermore, we are typically interested

in only a few (5–10) eigenvalues at the low end of the spectrum of H. An iterative method

that can make use of an efficient Hamiltonian-vector multiplication procedure is therefore

often the preferred method to solve Eq. (4.2) for the lowest eigenpairs.

For a long time, the Lanczos algorithm [36] with full orthogonalization was the default

algorithm to use because it is easy to implement and because it is quite robust even though

it requires storing hundreds of Lanczos basis vectors.

Indeed, there are several software

packages in which the Lanczos algorithm is implemented for nuclear structure calculations.

Here we focus on the software MFDn (Many-Fermion Dynamics for nuclear structure), which

is a hybrid MPI/OpenMPI code that is being used at several High-Performance Computing

centers; it has recently also been ported to GPUs using OpenACC.

In recent work [51], we have shown that the low-lying eigenvalues can be computed effi-

ciently by using the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG)

algorithm [32]. The advantages of the LOBPCG algorithm, which we will describe with some

detail in the next section, over the Lanczos algorithm include

• The algorithm is a block method that allows us to multiply H with several vectors

47

simultaneously. That is, instead of an SpMV, one performs an Sparse Matrix-Matrix

multiplication (SpMM) of a sparse square n×n matrix on a tall skinny n×nb matrix at

every iteration, which introduces an additional level of concurrency in the computation

and enables us to exploit data locality better. In order to converge 5 to 10 eigenpairs

we typically use blocks of 8 to 16 vectors – this can also be tuned to the hardware of

the HPC platform.

• The algorithm allows us to make effective use of pre-existing approximations to several

eigenvectors.

• The algorithm allows us to take advantage of a preconditioner that can be used to

accelerate convergence.

• Other dense linear algebra operations can be implemented as level 3 BLAS.

Even though Lanczos is efficient in terms of the number of sparse matrix vector multiplica-

tions (SpMV) it uses, we have shown that the LOBPCG method often takes less wallclock

time to run because performing a single SpMM is more efficient than performing several

SpMVs sequentially, which is required in the Lanczos algorithm.

However, there are occasionally some issues with LOBPCG:

• The method can become unstable near convergence. Although methods for stabilizing

the algorithm has been developed and implemented [26, 19], they do not completely

eliminate the problem.

• Even though the algorithm in principle only requires storing three blocks of vectors,

in practice, many more blocks of vectors are needed to avoid performing additional

SpMMs in the Rayleigh-Ritz procedure. This is a problem for machines on which high

bandwidth memory is in short supply (such as GPUs).

In this work, we examine several alternative algorithms for solving large-scale eigenvalue

problems in the context of nuclear configuration interaction calculations. In particular, we

48

will examine the block Lanczos algorithm [21] and the Chebyshev filtered subspace itera-

tion. Both are block algorithms that can benefit from an efficient implementation of the

SpMM operation and can take advantage of good initial guesses to several eigenvectors, if

they are available. Neither one of these algorithms can incorporate a preconditioner, which

is a main drawback. However, as we will show in Section 4.3, in the early iterations of these

algorithms, good approximations to the desired eigenpairs emerge quickly, even though the

total number of SpMVs required to obtain accurate approximations can be higher compared

to the Lanczos and LOBPCG algorithms. This observation suggests that these algorithms

can be combined with algorithms that are effective in refining existing eigenvector approxi-

mations. One such refinement algorithm is the residual minimization method (RMM) with

direct inversion of iterative subspace (DIIS) correction. This algorithm has an additional

feature that it can reach convergence to a specific eigenpair without performing orthogonal-

ization against approximations to other eigenpairs as long as a sufficiently accurate initial

guess is available. Therefore, this algorithm can also be used to compute (or refine) different

eigenpairs independently. This feature introduces an additional level of concurrency in the

eigenvalue computation that enhances the parallel scalability.

The chapter is organized as follows. In the next section, we give an overview of the Lanc-

zos, block Lanczos, LOBPCG as well as the Chebyshev filtered subspace iteration (ChebFSI)

algorithms. We also describe the RMM-DIIS algorithm and discuss how it can be combined

with (block) Lanczos, LOBPCG and ChebFSI to form a hybrid algorithm to efficiently com-

pute the desired eigenpairs. In Section 4.3, we give several numerical examples to demon-

strate the effectiveness of the hybrid algorithm and compare it with the standard algorithms.

We discuss both the convergence of the algorithm and some implementation issues.

4.1 Existing Algorithms

We review several algorithms for computing a few algebraically smallest eigenvalues and

the corresponding eigenvectors. We denote the eigenvalues of the n × n nuclear CI Hamil-

tonian H arranged in an increasing order by λ1 ≤ λ2 ≤ · · · ≤ λn. Their corresponding

49

eigenvectors are denoted by x1, x2, ..., xn. We are interested in the first nev ≪ n eigen-

values and eigenvectors. If we define X = [x1, x2, . . . , xnev ] and Λ = diag {λ1, λ2, . . . , λnev },

respectively, we have HX = XΛ.

4.1.1 Lanczos Algorithm

The Lanczos algorithm is a classical algorithm for solving large scale eigenvalue problems.

The algorithm generates an orthonormal basis of a k-dimensional Krylov subspace

K(H; v1) = {v1, Hv1, ..., H k−1v1},

(4.3)

where v1 is an appropriately chosen and normalized starting guess. Such a basis is produced

by a Gram-Schmidt process in which the key step of obtaining the j + 1st basis vector is

wj = (I − VjV T

j )Hvj, vj+1 = wj/∥wj∥,

(4.4)

where Vj is a matrix that contains all previous orthonormal basis vectors, i.e.,

Vj = (v1, v2, ..., vj) .

The projection of H into the k-dimensional subspace spanned by columns of Vk is a

tridiagonal matrix Tk that satisfies

HVk = VkTk + wkeT
k ,

(4.5)

where ek is the last column of a k × k identity matrix. Approximate eigenpairs of H are

obtained by solving the k × k eigenvalue problem

Tkq = θq.

(4.6)

It follows from (4.5), (4.6) and the fact that V T

k Vk = Ik, V T

k wk = 0 that the relative residual

norm associated with an approximate eigenpair (θ, Vkq) can be estimated by

∥H(Vkq) − θ(Vkq)∥
|θ|

=

∥wk∥ · |eT
|θ|

k q|

.

(4.7)

50

4.1.2 Block Lanczos

One of the drawbacks of the standard Lanczos algorithm is that it is not easy for the

algorithm to take advantage of good initial guesses to several desired eigenvectors. Although

we can take a simple linear combination or average of these initial guesses as the initial

vector v1, the Lanczos algorithm tends to converge to one of the eigenvectors much faster

than others.

An algorithm that can take advantage of multiple starting guesses to different eigenvectors

is the block Lanczos algorithm. The Block Lanczos algorithm generates an orthonormal basis

of a block Krylov subspace

K(H; V1) = {V1, HV1, ..., H k−1V1},

(4.8)

where V1 is a matrix that contains nb orthonormal basis vectors that are often good initial

guesses to the desired eigenvectors. Note that, in practice, nb can be chosen to be slightly

larger than the number of desired eigenvalues nev.

The Gram-Schmidt process used to generate an orthonormal basis in Lanczos is replaced

by a block Gram-Schmidt step that is characterized by

Wj = (I − VjVT

j )HVj,

where the matrix Vj contains j block orthonormal basis, i.e.,

Vj = (V1, V2, ..., Vj) .

(4.9)

(4.10)

The normalization step in (4.4) is simply replaced by a QR factorization step, i.e.

Wj = Vj+1Rj+1,

where V T

j+1Vj+1 = Inb, and Rj+1 is an nb × nb upper triangular matrix.

The projection of H into the subspace spanned by columns of Vk is a block tridiagonal

matrix Tk that satisfies

HVk = VkTk + WkET
k ,

(4.11)

51

where Ek is the last nb columns of an nb · k × nb · k identity matrix.

Approximate eigenpairs of H are obtained by solving the nb · k × nb · k eigenvalue problem

Tkq = θq.

(4.12)

It follows from (4.11), (4.12) and the fact that VT

k Vk = Inbk, VT

k Wk = 0 that the relative

residual norm associated with an approximate eigenpair (θ, Vkq) can be estimated by

∥H(Vkq) − θ(Vkq)∥
|θ|

=

∥Wk∥F · ∥ET
|θ|

k q∥

,

(4.13)

Algorithm 4.1 outlines the main steps of the block Lanczos algorithm. Both the Lanczos

and block Lanczos algorithms produce approximation to the desired eigenvector in the form

of

z = pd(H)v0,

where v0 is some starting vector and d is the degree of the polynomial.

However, for the same number of multiplications of the sparse matrix H with a vector

(SpMVs), denoted by m, the degree of the polynomial generated in a block Lanczos algorithm

is d = m/nb, whereas, in the standard Lanczos algorithm, the degree of the polynomial is

d = m. Because the accuracy of the approximate eigenpairs obtained from the Lanczos and

block Lanczos methods is directly related to d, we expect more SpMVs to be used in a block

Lanczos algorithm to reach convergence. On the other hand, in a block Lanczos method, one

can perform nb SpMVs as a matrix-matrix multiplication (SpMM) on a tall skinny matrix

consisting of a block of nb vectors, which is generally more efficient than performing nb

SpMVs in succession. As a result, the block Lanczos method can take less time even if it

performs more SpMVs.

4.1.3 LOBPCG

It is well known that the invariant subspace associated with the smallest nev eigenvalues

and spanned by columns of X ∈ Rn×nev is the solution to the trace minimization problem

min
X T X=I

trace(X T HX).

52

(4.14)

Algorithm 4.1 The block Lanczos algorithm.

Input: The sparse matrix H
Input: The number of desired eigenvalues nev
Input: An initial guess to the eigenvectors associated with the lowest nb ≥ nev eigen-
values X (0) ∈ Rn×nb
Input: Convergence tolerance (tol)
Input: Maximum number of iteration allowed (maxiter)
Output: (Λ, X), where Λ is a nev × nev diagonal matrix containing the desired eigen-
values, and X ∈ Rn×nev contains the corresponding eigenvector approximations

1: Generate V1 ∈ Rn×nb that contains an orthonormal basis of X (0)
2: V1 ← (V1)
3: T0 ← VT
0 HV0
4: for i = 1, 2, . . . ,maxiter do
Wi ← (I − ViVT
5:
Generate Vi+1 that contains an orthonormal basis of Wi
6:
Vi+1 ← (Vi Vi)
7:
Update Ti+1 ← VT
8:
Solve the projected eigenvalue problem Ti+1U = U Θ, where U T U = I, Θ is a diagonal
9:

i+1HVi+1

i )HVi

matrix containing eigenvalues of Ti+1 in an ascending order

10:
11:

Xi = Vi+1U (:, 1 : nev)
Determine number of converged eigenpairs nc by checking the Ritz residual esti-

mate (4.13)

if nc ≥ nev then
exit loop

end if

12:
13:
14:
15: end for
16: Λ ← Θ
17: X ← Xi

The LOBPCG algorithm developed by Knyazev [32] seeks to solve (4.14) by using the

updating formula

X (i+1) = X (i)C (i+1)

1

+ W (i)C (i+1)

2

+ P (i−1)C (i+1)

3

,

(4.15)

to approximate the eigenvector corresponding to the nev leftmost eigenvalues of H, where

W (i) ∈ Rn×nev is the preconditioned gradient of the Lagrangian

L(X, Λ) =

1
2

trace(X T HX) −

trace (cid:2)(X T X − I)Λ(cid:3)

1
2

(4.16)

associated with (4.14) at X (i), and P (i−1) is the search direction obtained in the (i − 1)st

iterate of the optimization procedure, and C (i+1)

1

, C (i+1)
2

, C (i+1)
3

are a set of coefficient ma-

trices of matching dimensions that are obtained by minimizing (4.16) within the subspace

53

S(i) spanned by

S(i) ≡ (cid:0)X (i) W (i) P (i−1)(cid:1) .

(4.17)

To improve the convergence of the LOBPCG algorithm, one can include a few more

vectors in X (i) so that the number of columns in X (i), W (i) and P (i) is nb ≥ nev.

The preconditioned gradient W (i) can be computed as

W (i) = K −1(HX (i) − X (i)Θ(i))

(4.18)

where Θ(i) = X (i)T HX (i), and K is a preconditioner that approximates H in some way. The

subspace minimization problem that yields the coefficient matrix C (i+1)

1

, C (i+1)
2

, C (i+1)
3

, which

are three block rows of a 3nb × nb matrix C (i+1), can be solved as a generalized eigenvalue

problem

(cid:0)S(i)T HS(i)(cid:1) C (i+1) = (cid:0)S(i)T S(i)(cid:1) C (i+1)D(i+1),

(4.19)

where D(i+1) is a nb × nb diagonal matrix containing nb leftmost eigenvalues of the projected

matrix pencil
(cid:0)S(i)T HS(i), S(i)T S(i)(cid:1). The procedure that forms the projected matrices S(i)T HS(i) and

S(i)T S(i) and solves the projected eigenvalue problem (4.19) is often referred to as the

Rayleigh–Ritz procedure [44]. Note that the summation of the last two terms in (4.15)

represents the search direction followed in the ith iteration, i.e.,

P (i) = W (i)C (i+1)

2

+ P (i−1)C (i+1)

3

.

(4.20)

Algorithm 4.2 outlines the main steps of the basic LOBPCG algorithm. The most com-

putationally costly step of Algorithm 4.2 is the multiplication of H with a set of vectors.

Although it may appear that we need to perform such calculations in steps 8 (where the

projected matrix S(i)T HS(i) is formed) and 10, the multiplication of H with X (i), X (i+1) and

P (i) can be avoided because HX (i+1) and HP (i) satisfy the following recurrence relationships

HX (i+1) = HX (i)C (i+1)

1

+ HW (i)C (i+1)

2

+ HP (i−1)C (i+1)

3

,

HP (i) = HW (i)C (i+1)

2

+ HP (i−1)C (i+1)

3

.

(4.21)

(4.22)

54

Algorithm 4.2 The basic LOBPCG algorithm.

Input: The sparse matrix H
Input: A preconditioner K
Input: An initial guess to the eigenvectors associated with the lowest nb ≥ nev eigen-
values X (0) ∈ Rn×nb
Input: Number of desired eigenvalues (nev)
Input: Convergence tolerance (tol) and maximum number of iteration allowed (maxiter)
Output: (Λ, X), where Λ is a nev × nev diagonal matrix containing the desired eigen-
values, and X ∈ Rn×nev contains the corresponding eigenvector approximations

1: [C (1), Θ(1)] ← RayleighRitz(H, X (0))
2: X (1) ← X (0)C (1)
3: R(1) ← HX (1) − X (1)Θ(1)
4: P (0) ← ∅
5: for i = 1, 2, . . . ,maxiter do
W (i) ← K −1R(i)
6:
S(i) ← (cid:2)X (i), W (i), P (i−1)(cid:3)
7:
[C (i+1), Θ(i+1)] ← RayleighRitz(H, S(i))
X (i+1) ← S(i)C (i+1)
R(i+1) ← HX (i+1) − X (i+1)Θ(i+1)
P (i) ← W (i)C (i+1)
+ P (i−1)C (i+1)
Determine number of converged eigenpairs nc by comparing the relative norms of the

2

3

8:
9:
10:
11:
12:

leading nev columns of R(i+1) against the convergence tolerance tol

end if

if nc ≥ nev then
exit loop

13:
14:
15:
16: end for
17: Λ ← Θ(i)(1 : nev, 1 : nev)
18: X ← X (i)(:, 1 : nev)

Therefore, the only SpMM we need to perform is HW (i). For the nuclear CI calculations of

interest, the dimension n of the sparse symmetric matrix H can be several billions, whereas

W (i) is a tall skinny n × nb matrix with nb typically of the order of 8 to 16.

4.1.4 Chebyshev filtering

An mth-degree Chebyshev polynomial of the first kind can be defined recursively as

Tm(t) = 2tTm−1(t) − Tm−2(t),

(4.23)

with T0(t) = 1 and T1(t) = t. The magnitude of Tm(t) is bounded by 1 within [−1, 1] and

grows rapidly outside of this interval.

55

By mapping the unwanted eigenvalues (e.g., the unoccupied states) of the many-body

Hamiltonian H enclosed by [λF, λub] to [−1, 1] through the linear transformation (t − c)/e,
where c = (λF+λub)/2 and e = (λub−λF)/2, we can use ˆTm(H) = Tm((H −cI)/e)v to amplify

the eigenvector components in v that correspond to eigenvalues outside of [λF, λub]. Figure

4.1 shows a 10th degree Chebyshev polynomial defined on the spectrum of a Hamiltonian

matrix and how the leftmost eigenvalues λi, i = 1, 2, ..., 8 are mapped to T10(λi).

Applying Tm((H − cI)/e) repeatedly to a block of vectors V filters out the eigenvectors

associated with eigenvalues in [λF, λub]. The desired eigenpairs can be obtained through the

standard Rayleigh–Ritz procedure [44].

Figure 4.1 Chebyshev polynomials of the first kind.

To obtain an accurate approximation to the desired eigenpairs, a high degree Chebyshev

polynomial may be needed. Instead of applying a high degree polynomial once to a block of

vectors, which can be numerically unstable, we apply Chebyshev polynomial filtering within

a subspace iteration to iteratively improve approximations to the desired eigenpairs. We

will refer to this algorithm as a Chebyshev filtering based subspace iteration (CheFSI). The

basic steps of this algorithm is listed in Algorithm 4.3.

56

Algorithm 4.3 The Chebyshev filtering based subspace iteration (CheFSI).

Input: The sparse matrix H
Input: An initial guess to the eigenvectors associated with the lowest nb ≥ nev eigen-
values X (0) ∈ Rn×nb
Input: Number of desired eigenvalues (nev)
Input: The degree of the Chebyshev polynomial d
Input: Convergence tolerance (tol)
Input: Maximum number of subspace iteration allowed (maxiter)
Output: (Λ, X), where Λ is a nev × nev diagonal matrix containing the desired eigen-
values, and X ∈ Rn×nev contains the corresponding eigenvector approximations

W ← 2(HQ − cQ)/e
for j = 2, . . . , d do

1: e ← (λub − λF)/2
2: c ← (λub + λF)/2
3: [Q, R] ← CholeskyQR(H, X (0))
4: for i = 1, 2, . . . ,maxiter do
5:
6:
7:
8:
9:
10:
11:
12:
13:

Y ← 2(HW − cW )/e − Q
Q ← W
W ← Y

end for
[Q, R] ← CholeskyQR(Y )
T ← QT HQ
Solve the eigenvalue problem T S = SΘ, where Θ is diagonal, and update Q by

Q ← QS

14:

Determine number of converged eigenpairs nc by comparing the relative norms of the

leading nev columns of R = HQ − QD against the convergence tolerance tol

end if

if nc ≥ nev then
exit loop

15:
16:
17:
18: end for
19: Λ ← Θ(1 : nev, 1 : nev)
20: X ← Q(:, 1 : nev)

Owing to the three-term recurrence in (4.23), W = ˆTm(H)V can be computed recursively
without forming ˆTm(H) explicitly in advance. Lines 5 to 9 of Algorithm 4.3 illustrates how

this step is carried out in detail. To maintain numerical stability, we orthonormalize vectors

in W . The orthonormalization can be performed by a (modified) Gram–Schmidt process or

by a Householder transformation based QR factorization [22].

In Algorithm 4.3, the required inputs are a filter degree, d, an estimated upper bound of

the spectrum of H, λub, and an estimated spectrum cutoff level, λF. The estimation of the

57

upper bound λub can be calculated by running a few Lanczos iterations [18, 65, 39], and λF

can often be set at 0 or the estimation of the nb + 1st leftmost eigenvalue of H obtained from

the Lanczos algorithm.

In the subsequent subspace iterations, λF can be modified based

on more accurate approximations to the desired eigenvalues. See the work of Saad [49] and

Zhou et al. [66] for more details on Chebyshev filtering.

4.1.5 RMM-DIIS

The residual minimization method (RMM) [64, 28] accelerated by direct inversion of iter-

ative subspace (DIIS) [45] was developed in the electronic structure calculation community

to solve a linearized Kohn-Sham eigenvalue problem in each self-consistency field (SCF) it-

eration. Given a set of initial guesses to the desired eigenvectors, {x0

j }, j = 1, 2, ..., nev, the

method produces successively more accurate approximations by seeking an optimal linear

combination of previous approximations to the jth eigenvector by minimizing the norm of

the corresponding sum of residuals. To be specific, let x(i)

j , i = 0, 1, ..., ℓ − 1 be approxi-

mations to the jth eigenvector of H obtained in the previous ℓ − 1 steps of the RMM-DIIS

algorithm, and θ(i)

j be the corresponding eigenvalue approximations. In the ℓth iteration (for

ℓ > 1), we first seek an approximation in the form of

where

ℓ−1
(cid:88)

˜xj =

αix(i)
j ,

i=min{0,ℓ−s}

ℓ−1
(cid:88)

i=ℓ−s

αi = 1,

(4.24)

(4.25)

for some fixed 1 ≤ s ≤ smax. The coefficients αi’s are obtained by solving the following

constrained least squares problem

ℓ−1
(cid:88)

min ∥

αir(i)

j ∥2,

i=min{ℓ−s}

(4.26)

is the residual associated with the approximate eigenpair

where r(i)

j = Hx(i)

j − θ(i)

j x(i)

j

(θ(i)
j

, x(i)

j ), subject to the same constraint defined by (4.25). The constrained minimiza-

58

tion problem can be turned into an unconstrained minimization problem by substituting
αℓ−1 = 1 − (cid:80)ℓ−2

i=min{ℓ−s} αi into (4.26).

Once we solve (4.26), we compute the corresponding residual

˜rj = H ˜xj − ˜θj ˜xj,

where ˜θj = ⟨˜xj, H ˜xj⟩/⟨˜xj, ˜xj⟩. A new approximation to the desired eigenvector is obtained

by projecting H into the two-dimensional subspace Wj spanned by ˜xj and ˜rj, and solving

the 2 × 2 generalized eigenvalue problem

(W T

j HWj)g = θ(W T

j Wj)g.

Such an approximation can be written as

x(ℓ)
j = Wjg,

(4.27)

(4.28)

where g is the eigenvector associated with the smaller eigenvalue of the matrix pencil

(W T

j HWj, W T

j Wj).

Although Rayleigh-Ritz procedure defined by (4.27) and (4.28) are often used to compute

the lowest eigenvalue of H, the additional constraint specified by (4.24) and (4.25) keeps x(ℓ)
j

close to the initial guess of the jth eigenvector. Therefore, if the initial guess is sufficiently

close to the jth eigenvector, x(ℓ)
j

can converge to this eigenvector instead of the eigenvector

associated with the smallest eigenvalue of H.

Algorithm 4.4 outlines the main steps of the RMM-DIIS algorithm.

4.1.6 Comparison summary

The computational cost of all iterative methods discussed above is dominated by the the

number of sparse Hamiltonian matrix vector multiplications (SpMVs).

In the Lanczos algorithm, the number of SpMVs is the same as the number of iterations.

In block algorithms such as the block Lanczos algorithm and the LOBPCG algorithm, the

number of SpMVs is the product of the block size and the number of iterations. The number

of SpMVs used in the Chebyshev filtering subspace iteration is the product of the number of

59

Algorithm 4.4 The RMM-DIIS algorithm.

Input: The sparse matrix H
Input: An initial guess to the nev desired eigenvectors {x(0)
Input: Maximum dimension of DIIS subspace s
Input: Convergence tolerance (tol)
Input: Maximum number of iteration allowed (maxiter)
Output: {(θj, xj)}, j = 1, 2, ..., nev, where θj is the approximation to the jth lowest
eigenvalue, and xj is the corresponding approximate eigenvector

j }, j = 1, 2, ...nev

3:

1: for j=1,2,. . . ,nev do
x(0)
j /∥x(0)
j ← x(0)
j ∥
2:
, Hx(0)
θ(0) ← ⟨x(0)
j ⟩
j
r(0)
j x(0)
j − θ(0)
j ← Hx(0)
j ← x(0)
˜x(1)
˜r(1)
j ← r(0)
for i = 1, 2, . . . ,maxiter do

5:

4:

j

j

j

if i > 1 then

6:
7:
8:
9:

10:

11:

12:
13:
14:

15:

16:

j ∥/|θ(i)
j
exit loop

17:
18:
19:
20:
21: end for

end if

end for

Solve the residual minimization least squares problem (4.26)
Set ˜x(i)
j ← ˜x(i)
˜x(i)
Set the residual ˜r(i)

j according to (4.24)
j /∥˜x(i)
j ∥

ℓ=min{0,i−s} αix(ℓ)

j ← (cid:80)i−1

j

j , ˜r(i)
j )

end if
Set Wj ← (˜x(i)
Solve the Rayleigh-Ritz problem (4.27) and obtain (θ(i)
j
j ← Hx(i)
Compute the residual r(i)
if ∥r(i)
| < tol then

j − θ(i)

j x(i)

j

, x(i)
j )

subspace iterations, the block size and the degree of the Chebyshev polynomial used. The

number of SpMVs used in RMM-DIIS is the sum of the RMM-DIIS iterations for all desired

eigenpairs.

In addition to SpMVs, some dense linear algebra operations are performed in these algo-

rithms to orthonormalize basis vectors and to perform the Rayleigh-Ritz calculations. The

cost of orthonormalization can become large if too many Lanczos iterations or block Lanc-

zos iterations are performed. Such cost is relatively small in LOBPCG, RMM-DIIS and

Chebyshev polynomial filtering subspace iterations.

60

In Table 4.1, we compare the memory usage of each method discussed above. Note that

the first term for Lanczos, block Lanczos, LOBPCG and CheFSI in this table is generally

the dominant term. We use O(c) to denote a small multiple (i.e., typically 2 or 3) of c. The

number of iterations taken by a block Lanczos iteration kblocklan is typically smaller than

the number of Lanczos iterations klan when the same number of eigenpairs are computed by

these methods. However, kblocklan · nb is often larger than klan. It is possible to use a smaller

amount of memory in LOBPCG and CheFSI at the cost of performing more SpMMs. For

example, if we were to explicity compute HX (i+1) and HP (i) in the LOBPCG algorithm to

perform the Rayleigh-Ritz calculation instead of updating these blocks according to (4.21)

and (4.22) respective, we can reduce the LOBPCG memory usage to 4n·nb +O(9n2

b). For the

RMM-DIIS algorithm, we assume that we compute one eigenpair at a time. The parameter

smax is the maximum dimension of the DIIS subspace constructed to correct an approximate

eigenvector. This parameter is often chosen to be between 10 and 20.

If we batch the

refinement of several eigenvector together to make use of SpMMs as we will discuss below,

the memory cost of RMM-DIIS will increase by a factor of nev.

Method
Lanczos
block Lanczos
LOBPCG
CheFSI
RMM-DIIS

Memory cost

n · (kLan + nev) + O(k2

Lan)

n · nb · (kblocklan + nev) + O((nb · kblocklan)2)
7n · nb + O(9n2
b )
4n · nb + O(n2
b )
n · (3nev + smax)

Table 4.1 A comparison of memory footprint associated with the Lanczos, block Lanczos,
LOBPCG, CheFSI and RMM-DIIS methods.

4.2 Hybrid Algorithms

Among all methods discussed above, the Lanczos, block Lanczos and the LOBPCG meth-

ods as well as the Chebyshev polynomial filtered subspace iteration can all proceed with an

arbitrary starting guess of the desired eigenvectors although they can all benefit from the

availability of a good starting guess. As an eigenvector refinement method, the RMM-

DIIS method requires a relatively more accurate approximation of the desired eigenvectors.

61

Therefore, a more effective way to use the RMM-DIIS method is to combine it with the

other methods, i.e., we can start with Lanczos, block Lanczos, LOBPCG or CheFSI method

and switch to RMM-DIIS when the approximate eigenvectors become sufficiently accurate.

The basis orthogonalization cost as well as the memory requirement become progressively

higher in the Lanczos and block Lanczos method. Therefore, a notable benefit to switch from

the Lanczos or block Lanczos methods to RMM-DIIS is to lower the orthogonalization cost

and memory requirement.

Although the orthogonalization cost and memory requirement for the LOBPCG method

is fixed throughout all LOBPCG iterations, the subspace (4.17) from which eigenvalue and

eigenvector approximations are drawn becomes progressively more ill-conditioned as the

norms of the vectors in (4.18) become smaller. The ill-conditioned subspace can make

the LOBPCG algorithm numerically unstable even after techniques proposed in [26, 19]

are applied. Therefore, it is desirable to switch from LOBPCG to RMM-DIIS, when the

condition number of the subspace is not too large.

The orthogonalization cost and memory requirement for CheFSI are also fixed. The

method is generally more efficient in the early subspace iterations when Tn(H) is applied

to a block of vectors in each iteration. As the approximate eigenvectors converge, applying

Tn(H) to a block of vectors in a single iteration results in a higher cost compared to RMM-

DIIS that can refine each approximate eigenvector separately. Therefore, it also seems to

be plausible to switch to RMM-DIIS, when approximate eigenvectors become sufficiently

accurate in CheFSI.

4.3 Numerical examples

In this section, we compare and analyze the performance of algorithms presented in Sec-

tion 4.1 using numerical examples. Through these examples, we demonstrate the advantage

of using a hybrid LOBPCG/RMM-DIIS or block Lanczos/RMM-DIIS algorithm to solve nu-

clear configuration interaction eigenvalue problems. We also discuss how to address some of

the practical problems to make the hybrid algorithm more efficient. We initially started with

62

the MATLAB experiments where testing the effectiveness of algorithms in terms of sheer

SpMV numbers required was much simpler. Having seen the merit of those algorithms, we

then proceeded to implement, and experiment with them in MFDn.

4.3.1 Test problems and computing platform

The test problems we use are the many-body Hamiltonian matrices associated with four

different types of nuclei Li6, Li7, B11 and C12 , where the subscripts indicate the number of

nucleons (protons and neutrons) in the nuclei. These Hamiltonian matrices are constructed in

different configuration interaction model spaces labeled by the Nmax parameter. In Table 4.2,

we list the dimension of each matrix as well as the number of nonzero matrix elements in

these matrices.

System Nmax Matrix dimension #Non-zeros
106,738,802
421,938,629
389,033,682
555,151,572

197,822
663,527
814,092
1,118,926

Li6
Li7
B11
C12

6
6
4
4

Table 4.2 Test problems used in the numerical experiments.

Before solving eigenvalue problems for Hamiltonians listed in Table 4.2, we first construct

a good initial guess of the desired eigenvectors by computing the lowest few eigenvalues and

the corresponding eigenvectors of smaller Hamiltonian matrices constructed from a lower di-

mensional configuration interaction model space labelled by smaller Nmax values. Table 4.3

shows the dimensions and number of nonzeros in these smaller Hamiltonians. The initial

guesses to the desired eigenvectors of the Hamiltonian matrices listed in Table 4.2 are ob-

tained by padding the eigenvectors of the smaller Hamiltonian matrices by zeros to match

the dimension of the original problems to be solved. As we can see, since the dimension of

the problems listed in Table 4.3 are an order of magnitude or two smaller than the corre-

sponding problems listed in Table 4.2, they can be solved relatively easily and quickly by

almost any method.

All algorithms presented in Section 4.1 have been implemented in MATLAB which is ideal

63

System Nmax Matrix Size #Non-zeros

Li6
Li7
B11
C12

4
4
2
2

17,040
48,917
16,097
17,725

4,122,448
14,664,723
2,977,735
3,365,099

Table 4.3 The dimensions and number of nonzeros of Hamiltonian matrices for Li6, Li7, B11
and C12 that are constructed in a lower dimensional configuration model space labelled by
a smaller Nmax value.

for prototyping new algorithms. We perform the numerical experiments presented below on

a single AMD EPYC 7763 socket on a Perlmutter CPU node maintained at the National

Energy Research Scientific Computing (NERSC) Center. The EPYC socket contains 64 cores

and we disabled hyperthreading so 64 OpenMP threads were used. For each test problem, we

typically perform two sets of experiments for each algorithm. In the first set of experiments,

we compute nev = 5 lowest eigenvalues and their corresponding eigenvectors. In the second

set, we increase the number of eigenpairs to be computed to nev = 10. All calculations are

performed in double precision arithmetic.

4.3.2 The performance of single method solvers

In this section, we report and compare the performance of the Lanczos, block Lanczos,

LOBPCG, CheFSI and RMM-DIIS methods when they are applied to the test problems

listed in Table 4.2.

For block methods such as the block Lanczos, LOBPCG and CheFSI methods, we set the

block size, i.e., the number of vectors in the matrix Vj in (4.9), the matrix X (i) in (4.15), to

nb = 8 when computing the nev = 5 lowest eigenpairs of H, or to nb = 16 when computing

the nev = 10 lowest eigenpairs of H. SpMM is performed to multiply H with 8 or 16 vectors

all at once. Even though RMM-DIIS is not a block method, nev SpMVs performed in the

algorithm can be fused together as a single SpMM as we explain below.

For block methods, we choose the starting guess for each method as the eigenvectors of

the Hamiltonian constructed in a smaller configuration space (with a smaller Nmax value)

listed in Table 4.3 padded with zeros to match dimensions of the Hamiltonian in the larger

64

configuration space (with a larger Nmax value) as mentioned earlier. This is also used in the

RMM-DIIS method which only requires a starting guess for each of the desired eigenpairs.

For the Lanczos algorithm, we take the initial guess v0 to be the linear combination

of augmented eigenvectors associated with the lowest nev eigenvalues of the Hamiltonian

constructed from the smaller configuration space, i.e.,

nev(cid:88)

v0 = (

i=1

ˆzi)/nev,

where ˆzi is the zero padded eigenvector associated with the ith eigenvalue of the Hamiltonian

constructed from the smaller configuration space.

All methods are terminated when the relative residual norms or estimated residual norm

associated with all desired eigenpairs are below the threshold of τ = 10−6. A relative residual

norm for an approximate eigenpair (θ, z) is defined to be

∥Hz − zθ∥/|θ|.

For the Lanczos and block Lanczos methods, we use (4.7) and (4.13) to estimate the

relative residual norm without performing additional Hamiltonian matrix and vector multi-

plications.

We use a 10th degree Chebyshev polynomial in the CheFSI method. The convergence

of the algorithm depends on the choice of several parameters. The upper bound of the

spectrum λub is determined by first running 10 Lanczos iterations and using Rayleigh Ritz

approximation to the largest eigenpairs (θ10, u10) to set λub to θ10+∥r10∥, where r10 = Hu10−

θ10u10. We set the parameter λF simply to 0 because the desired eigenvalues are bound states

of the nucleus of interest and are expected to be negative. We apply the technique of deflation

for converged eigenvectors, i.e., once the relative residual norm of an approximate eigenpair

falls below the convergence tolerance of 10−6, we “lock” the approximate eigenvector in

place and do not apply H to this vector in subsequent computations. These vectors will still

participate in the Rayleigh-Ritz caculation performed in steps 11 and 12 of Algorithm 4.3

and be updated as part of the Rayleigh-Ritz procedure.

65

In Tables 4.4 and 4.5, we compare the performance of Lanczos, block Lanczos, LOBPCG,

CheFSI and RMM-DIIS in terms of the total number SpMVs performed in each of these

methods. It is clear from these tables that the Lanczos method uses the least number of

SpMVs. However, the number of SpMVs used by both the LOBPCG, block Lanczos and

RMM-DIIS is within a factor of 3 when nev = 5 eigenpairs are computed. Because 8 SpMVs

can be fused as a single SpMM, which is more efficient, in the block Lanczos and LOBPCG

method, the total wall clock time used by these methods can be less than that used by the

Lanczos method. Five SpMVs can also be fused in the RMM-DIIS method, even though the

algorithm targets each eigenvalue separately. Because different eigenvalues may converge at

a different rate, we may switch to using SpMVs when some of the eigenpairs converge. We

discuss whether switching to SpMVs is beneficial later in this section.

Table 4.5 shows that the number of SpMVs used in the Lanczos algorithm increases only

slightly when we compute nev = 10 eigenpairs. However, the number of SpMVs required

in the block Lanczos, LOBPCG, CheFSI and RMM-DIIS increase at a higher rate. This is

mainly due to the fact that once a sufficiently large Krylov subspace has been constructed,

we can easily obtain approximations to more eigenpairs without enlarging the subspace much

further.

System Lanczos Block Lanczos LOBPCG ChebFSI RMM-DIIS

Li6
Li7
B11
C12

95
109
82
106

208
280
240
248

184
240
192
192

480
960
950
890

271
291
152
181

Table 4.4 SpMV count for different algorithms on MATLAB to compute five lowest
eigenvalues.

Figures 4.2, 4.3, 4.4, 4.5, 4.6 show the convergence history of the Lanczos, block Lanczos,

LOBCPG, CheFSI and RMM-DIIS methods. In these figures, we plot the relative residual

norm of each approximate eigenpair with respect to the iteration number. We can clearly

see that accurate approximations to some of the eigenpairs start to emerge in the Lanczos

method when the dimension of the Krylov subspace (i.e., iteration number) is sufficiently

66

System Lanczos Block Lanczos LOBPCG ChebFSI

Li6
Li7
B11
C12

114
192
180
164

464
512
480
480

464
464
432
400

690
1350
2470
2300

RMM-DIIS
686
884
Convergence failure
499

Table 4.5 SpMV count for different algorithms on MATLAB to compute ten lowest
eigenvalues.

large. Eigenpairs do not converge at the same rate. The smallest eigenvalue converges first,

followed by the second, third, fourth and the fifth eigenvalues. However, these eigenvalues do

not necessarily have to converge in order. Although the relative residual for each eigenpair

eventually goes below the convergence threshold of 10−6, the reduction of the relative residual

norm is not monotonic with respect to the Lanczos iteration number. The relative residual

can sometimes increase after the Krylov subspace becomes sufficiently large and new spectral

information becomes available.

Figure 4.2 The convergence of the lowest five eigenvalues of the Li6 Hamiltonian in the
Lanczos algorithm.

All approximate eigenpairs appear to converge at a similar rate in the block Lanczos and

67

020406080100Lanczos Iteration10-1210-1010-810-610-410-2100102Relative Residual Norm6162636465tolFigure 4.3 The convergence of the lowest five eigenvalues of the Li6 Hamiltonian in the
block Lanczos algorithm.

Figure 4.4 The convergence of the lowest five eigenvalues of the Li6 Hamiltonian in the
LOBPCG algorithm.

68

051015202530Block Lanczos Iteration10-710-610-510-410-310-210-1100Relative Residual Norm6162636465tol0510152025LOBPCG Iteration10-710-610-510-410-310-210-1100Relative Residual Norm6162636465tolFigure 4.5 The convergence of the lowest five eigenvalues of the Li6 Hamiltonian in the
Chebyshev algorithm.

Figure 4.6 The convergence of the lowest five eigenvalues of the Li6 Hamiltonian in the
RMM-DIIS algorithm.

69

1234567Chebyshev Iteration10-710-610-510-410-310-2Relative Residual Norm6162636465tol01020304050RMM-DIIS Iteration10-710-610-510-410-310-210-1100Relative Residual Norm6162636465tolLOBPCG methods. This is one of the advantages of a block method. The LOBPCG method

performs slightly better in terms of the total number of SpMMs (or the number of iterations)

used. This is due to the fact that LOBPCG makes use of an effective preconditioner.

The CheFSI is also a block method. Figure 4.5 shows that only five subspace iterations

are required to obtain converged λ2, λ4 and λ5. Two more subspace iterations are required to

obtain converged λ1 and λ3. The difference in the convergence rates for different eigenvalues

is likely due to the variation in the contributions from different eigenvectors in the initial

subspace constructed from the eigenvectors of the Hamiltonian associated with a smaller

configuration space. Although the number of subspace iterations used in CheFSI is relatively

small, each subspace iteration needs to perform nb · m SpMVs, where nb is the number of

vectors in the initial subspace and m is the degree of the Chebyshev polynomial. When

nb = 8 and m = 10, a total of 480 SpMVs are used to compute the lowest five eigenpairs

as reported in Table 4.4. Note that this count is less than 8 × 10 × 7 = 560 because a

deflation scheme that locks the converged eigenvector is used in CheFSI. When nb = 16,

m = 10, a total of 690 SpMVs are used to compute 10 lowest eigenpairs, as reported in

Table 4.5. Because these SpMV counts are significantly higher than those used in other

methods, CheFSI appears to be not competitive for solving this type of eigenvalue problem.

Therefore, from this point on, we will not discuss this method any further.

The convergence of the RMM-DIIS method is interesting. We observe from Figure 4.6

that the first three eigenvalues of the Li6 Hamiltonian converge relatively quickly. The num-

ber of RMM-DIIS iterations required to reach convergence is 31 for the first eigenvalue, 25

for the third eigenvalue and 23 for the second eigenvalue. Altogether, 79 SpMVs are used

to obtain accurate approximations to the three smallest eigenvalues and their corresponding

eigenvectors, which is almost same as the 78 SpMVs used in the Lanczos algorithm for ob-

taining these three eigenpairs. However, the fourth and fifth eigenvalues take much longer

to converge. Furthermore, we also observe that eigenvalues do not converge in order. In

fact, the second smallest eigenvalue converges first, followed by the third and the first small-

70

est eigenvalues. The RMM-DIIS procedure that starts with the initial guess to the fourth

smallest eigenvalue ends up converging to the fifth eigenvalue, whereas the procedure that

starts with initial guess to the fifth eigenvalue converges to the fourth eigenvalue. However,

it is important to note that RMM-DIIS iterations that start with different initial guesses

all converge to different eigenpairs even though no orthogonalization is performed between

approximate eigenvectors produced in different RMM-DIIS iterations. Table 4.5 shows that

more RMM-DIIS iterations are needed to obtain accurate approximations to larger eigen-

values deeper inside the spectrum of Li6 Hamiltonian, partly because the initial guesses to

the interior eigenvalues are poor in this case.

4.3.3 Hybrid methods

As we discussed in Section 4.1, the RMM-DIIS algorithm can be very effective when a

good initial guess to the target eigenvector is available. Figure 4.7 shows that RMM-DIIS

converges rapidly when the initial guesses to the eigenvectors associated with the lowest

5 eigenvalues are chosen to be the approximated eigenvectors produced from 10 LOBPCG

iterations. As a result, we can combine RMM-DIIS with either the block Lanczos or the

LOBPCG algorithm to devise a hybrid algorithm that first runs a few block Lanczos or

LOBPCG iterations and then use the RMM-DIIS method to refine approximated eigenvec-

tors returned from the block Lanczos or the LOBPCG algorithm simultaneously.

Although it is possible to combine the RMM-DIIS algorithm with the Lanczos algorithm,

such a hybrid algorithm is less attractive, partly because convergence rates for different

eigenpairs in a Lanczos algorithm are different. As we can see from Figure 4.2 that the

left most eigenvalues typically converge much faster than larger eigenvalues. As a result,

the RMM-DIIS method can only be effectively used to refine the eigenvectors associated

with larger eigenvalues when they become sufficiently accurate. At that point, the left

most eigenvalues are likely to have converged already. Furthermore, because the Lanczos

algorithm tends to be less efficient than the block Lanczos or the LOBPCG algorithms when

the multiplication of the sparse Hamiltonian H with several vectors can be implemented

71

efficiently, we will not consider the hybrid Lanczos and RMM-DIIS method here.

Figure 4.7 The convergence of the lowest five eigenvalues of the Li6 Hamiltonian in the
RMM-DIIS algorithm after 10 iterations of LOBPCG.

4.3.3.1 When to switch

A practical question we need to address in order to implement a hybrid block Lanczos

/RMM-DIIS or LOBPCG/RMM-DIIS eigensolver is to decide when to switch from one

algorithm to another. Running more block Lanczos or LOBPCG iterations will produce

more accurate approximations to the desired eigenvectors that can then be quickly refined

by the RMM-DIIS algorithm. But the cost of running block Lanczos or LOBPCG may

dominate the overall cost of the computation. On the other hand, running fewer block

Lanczos or LOBPCG iterations may produce a set of approximate eigenvectors that require

more RMM-DIIS iterations. There is clearly a trade off.

Tables 4.6 shows the total number of SpMVs required in a hybrid block Lanczos/RMM-

DIIS algorithm in which 5, 10, 15 or 20 block Lanczos iterations are performed first and

followed by the RMM-DIIS procedure. We observe that this hybrid scheme uses fewer total

SpMVs for all test problems than that used in a simple block Lanczos or RMM-DIIS run.

A similar observation can be made from Table 4.7 when 10 lowest eigenvalues are computed

72

05101520RMM-DIIS Iteration10-610-510-410-3Relative Residual Norm6162636465tolexcept that for Li7, the hybrid algorithm that starts with 5 block Lanczos iterations uses more

SpMVs than the non-hybrid block Lanczos only algorithm. The optimal number of block

Lanczos iterations we should perform (in terms of the total SpMV count) before switching to

the RMM-DIIS procedure is problem dependent. Because good approximations to interior

eigenvalues only emerge when the Krylov subspace produced by the block Lanczos iteration

is sufficiently large, more block Lanczos iterations are required in the hybrid algorithm to

yield an optimal SpMV count when 10 eigenvalues and the corresponding eigenvectors are

needed, as we can see in Table 4.7.

System 5 block Lan 10 block Lan 15 block Lan 20 block Lan all block Lan

Li6
Li7
B11
C12

183
205
170
179

163
218
178
180

173
234
195
179

185
221
199
201

208
280
240
248

Table 4.6 SpMV count in hybrid block Lanczos/RMM-DIIS for computing five lowest
eigenvalues.

System 5 block Lan 10 block Lan 15 block Lan 20 block Lan all block Lan

Li6
Li7
B11
C12

422
544
419
430

361
422
422
415

347
418
392
369

367
402
382
369

464
512
480
480

Table 4.7 SpMV count in hybrid block Lanczos/RMM-DIIS for computing 10 lowest
eigenvalues.

A hybrid LOBPCG and RMM-DIIS algorithm also performs better when a sufficiently

large number of LOBPCG iterations are performed first and followed by the RMM-DIIS

procedure as can be seen in Tables 4.8 and 4.9.

System 5 LOBPCG 10 LOBPCG 15 LOBPCG 20 LOBPCG all LOBPCG

Li6
Li7
B11
C12

191
211
260
186

158
200
162
157

163
190
159
153

165
194
165
166

184
240
192
192

Table 4.8 SpMV count in hybrid LOBPCG/RMM-DIIS for computing five lowest
eigenvalues.

73

System 5 LOBPCG 10 LOBPCG 15 LOBPCG 20 LOBPCG all LOBPCG

Li6
Li7
B11
C12

432
529
758
523

344
398
397
340

337
364
334
317

356
365
355
334

464
464
423
400

Table 4.9 SpMV count in hybrid LOBPCG/RMM-DIIS for computing 10 lowest
eigenvalues.

From Tables 4.6, 4.7, 4.8 and 4.9, we observe that the total SpMV count is optimal when

the number of block Lanczos or LOBPCG iterations is sufficiently large, but not too large.

However, the optimal number of block Lanczos/LOBPCG iterations is problem dependent.

One practical way to determine when to switch from block Lanczos or LOBPCG to RMM-

DIIS is to examine the change in the approximate eigenvalue. Because RMM-DIIS can be

viewed as an eigenvector refinement method, it works well when an approximate eigenvalue

becomes sufficiently accurate while the corresponding approximate eigenvector still needs to

be corrected.

Since we do not know the exact eigenvalues in advance, we use the average changes in the

relative difference between approximate eigenvalues obtained in two consecutive iterations

(e.g., the k − 1st and the kth iterations) defined as

τ =

(cid:118)
(cid:117)
(cid:117)
(cid:116)

(cid:32)

nev(cid:88)

j=1

1
nev

j − θ(k−1)
θ(k)
j
θ(k)
j

(cid:33)2

.

(4.29)

as a metric to decide when to switch from block Lancos or LOBPCG to RMM-DIIS.

Tables 4.10 and 4.11 show the optimal SpMV count of the hybrid block Lanczos/RMM-

DIIS and LOBPCG/RMM-DIIS algorithms when computing five or 10 lowest eigenvalues,

respectively. Unlike Tables 4.6, 4.7, 4.8 and 4.9 where we show the results only for 5, 10, 15 or

20 iterations, these optimal numbers are found by exhaustively trying every possible iteration

count of block Lanczos and LOBPCG before switching to RMM-DIIS. Along with the optimal

numbers, Tables 4.10 and 4.11 also show the SpMV count of the hybrid algorithms when

block Lanczos and LOBPCG procedures are stopped as τ goes below 10−4 and 10−7.

Table 4.11 shows that the number of SpMV counts in both the hybrid block Lanc-

74

zos /RMM-DIIS and LOBPCG/RMM-DIIS algorithms appear to be nearly optimal when

τ ≤ 10−7. However, when fewer eigenvalues are needed, we can possibly stop the block

Lanczos procedure in the hybrid block Lanczos/RMM-DIIS algorithm when τ is below 10−4

as shown in Table 4.10. The τ ≤ 10−7 criterion still seems to be a good one for the hybrid

LOBPCG/RMM-DIIS algorithm even when fewer eigenpairs are needed.

System Opt Hyb Block Lan

Li6
Li7
B11
C12

157
205
166
172

τ ≤ 10−4
194
211
176
180

τ ≤ 10−7 Opt Hyb LOBPCG τ ≤ 10−4
156
190
158
152

187
210
188
169

169
224
195
179

τ ≤ 10−7
159
190
159
153

Table 4.10 SpMV count for hybrid algorithms wrt stopping threshold on MATLAB to
compute five lowest eigenvalues.

System Opt Hyb Block Lan

Li6
Li7
B11
C12

340
401
379
362

τ ≤ 10−4
384
515
525
408

τ ≤ 10−7 Opt Hyb LOBPCG τ ≤ 10−4
330
363
332
312

346
418
384
368

393
478
625
437

τ ≤ 10−7
335
363
333
317

Table 4.11 SpMV count for hybrid algorithms wrt stopping threshold on MATLAB to
compute 10 lowest eigenvalues.

4.3.3.2 Optimal implementation

As we indicated earlier, the SpMV count may not be the best metric to evaluate the

performance of a block eigensolver that performs SpMVs for a block of vectors as a SpMM

operation. It has also been shown in [2, 51] that on many-core processors, the LOBPCG

algorithm outperforms the Lanczos algorithm even though its SpMV count is much higher.

The reason that a block algorithm can outperform a single vector algorithm such as the Lanc-

zos method on a many-core processor is that SpMM can take advantage of high concurrency

and memory locality.

The performance of SpMV and SpMM can be measured in terms of number of floating

point operations performed per second (GFLOPS). Tables 4.12 and 4.13 show the SpMM

GFLOPS measured in the LOBPCG algorithm is much higher than the SpMV GFLOPS

measured in the Lanczos algorithm.

75

As a result, we can evaluate the performance of a block eigensolver by diving the actual

SpMV count by the ratio between SpMM and SpMV GFLOPS to obtain an “effective”

SpMV count.

For example, because the SpMM and SpMV GFLOPS ratio is 24.2/4.1 ≈ 5.9 for Li6, the

effective SpMV count for performing 23 iterations of the LOBPCG algorithm with a block

size 8 is 23 × 8/5.9 ≈ 31 which is much lower than the 95 SpMVs performed in the Lanczos

method, even though the actual number of SpMVs performed in the LOBPCG iteration is

23 × 8 = 184 > 95.

System SpMV GFLOPS SpMM GFLOPS Ratio

Li6
Li7
B11
C12

4.1
6.7
6.4
6.0

24.2
36.0
33.5
28.0

5.9
5.4
5.2
4.7

Table 4.12 GFLOPs achieved by the SpMV/SpMM kernels within MFDn on one AMD
EPYC node with 64 threads. The sparse Hamiltonian is applied to 8 vectors in SpMM.

System SpMV GFLOPS SpMM GFLOPS Ratio

Li6
Li7
B11
C12

4.1
6.7
6.4
6.0

38.9
56.0
50.3
47.5

9.5
8.4
7.9
7.9

Table 4.13 GFLOPs achieved by the SpMV/SpMM kernels within MFDn on one AMD
EPYC node with 64 threads. The sparse Hamiltonian is applied to 16 vectors in SpMM.

Strictly speaking, RMM-DIIS is a single vector method, i.e., in each RMM-DIIS run,

we refine one specific eigenvector associated with a target eigenvalue which has become

sufficiently accurate as discussed earlier. However, because the refinement of different eigen-

vectors can be performed independently from each other, we can perform several RMM-

DIIS refinements simultaneously. The simultaneous RMM-DIIS runs can be implemented

by batching the SpMVs in each RMM-DIIS iteration together as a single SpMM. This step

constitutes the major cost of the RMM-DIIS method. The least squares problems given in

76

Eqs. 4.26- 4.28 for different eigenvectors can be solved in sequence in each step, since they

do not cost much computation.

Because different eigenvectors may converge at different rates as we have already seen in

Figure 4.6, we need to decide what to do when one or a few eigenvectors have converged.

One possibility is to decouple the batched RMM-DIIS method after a certain number of

eigenvectors have converged, and switch from using a single SpMM in a coupled RMM-

DIIS implementation to using several SpMVs in a decoupled RMM-DIIS implementation.

Another possibility is to just keep using the coupled RMM-DIIS with a single SpMM in

each step without updating the eigenvectors that have already converged.

In this case,

the SpMM calculation performs more floating point operations than necessary. However,

because an SpMM can be carried out at a much higher GFLOPs than an SpMV, the overall

performance of the computation may not be degraded even with the extra computation.

In Figure 4.8, we show the effective number of SpMVs performed in several hybrid block

Lanczos and RMM-DIIS runs for the Li6 testcase. The horizontal axis represents the number

of block Lanczos iterations performed before we switch to RMM-DIIS. The blue dots show

the number of effective SpMVs used in the hybrid method when we switch from SpMM to

SpMV after one eigenvector has converged. Similarly, the red, orange and magenta dots

show the effective SpMV counts when we switch from SpMM to SpMV after two, three and

four eigenvectors have converged respectively. The green dots show the effective number of

SpMVs when we always use SpMM regardless of how many eigenvectors have converged.

As we can see from this figure, the number of effective SpMVs is relatively high when we

switch to RMM-DIIS after a few block Lanczos iterations. This is because it will take

RMM-DIIS longer to converge if the initial eigenvector approximations produced from the

block Lanczos iterations are not sufficiently accurate. Regardless of when we switch from

block Lanczos to RMM-DIIS, using SpMM throughout the RMM-DIIS algorithm appears

to almost always yield the lowest effective SpMV count. We can also see that the difference

in the effective SpMV count is relatively large when we switch from block Lanczos to RMM-

77

DIIS too early. This is understandable because some of the approximate eigenvectors are

more accurate than others when we terminate the block Lanczos iteration too early. As a

result, the number of RMM-DIIS steps required to reach convergence may vary significantly

from one eigenvector to another. The difference in the rate of convergence prevents us from

batching several SpMVs into a single SPMM. If we switch after more block Lanczos iterations

have been performed, this difference becomes quite small as all approximate eigenvectors are

sufficiently accurate and converge more or less at the same rate. The relative difference (τ )

between eigenvalue approximations from two consecutive RMM-DIIS iterations falls below

10−7 at the 13th block Lanczos iteration.

If we switch to RMM-DIIS at that point and

use SpMM throughout the RMM-DIIS iteration, the number of effective SpMVs used in the

hybrid scheme is about 40, which is slightly higher than the optimal 32 effective SpMVs

required if we were to switch to RMM-DIIS after 23 block Lanczos iterations.

Figure 4.8 Total effective SpMV cost of the block Lanczos/RMM-DIIS algorithm to
compute the lowest five eigenvalues for Li6 w.r.t. switching strategy from SpMM to SpMV
within RMM-DIIS.

78

34567891011121314151617181920212223242526Number of Block Lanczos Iterations before RMM-DIIS405060708090Number of Effective SpMV OperationsSpMMSpMV after 1 eigenvalue convergesSpMMSpMV after 2 eigenvalues convergeSpMMSpMV after 3 eigenvalues convergeSpMMSpMV after 4 eigenvalues convergeSpMM onlyAn early switch allows us to keep the basis orthogonalization cost of the block Lanczos

algorithm as low as possible. As we can see from Figure 4.9, the cost of orthogonalization as

a percentage of the SpMM cost can become significantly higher as we perform more block

Lanczos iterations. In particular, for all test problems, the orthogonalization cost exceeds

50% of the SpMM cost after 20 block Lanczos iterations. We note that the performance

shown in Figure 4.9 is measured from the wallclock time of an implementation of the block

Lanczos algorithm in the MFDn software executed on a single node of the Perlmutter using

64 threads.

Figure 4.9 The percentage of cumulative time spent on orthogonalization compared to
SpMM in the block Lanczos algorithm for all test problems.

Figure 4.10 shows that it is beneficial to use SpMM throughout the hybrid LOBPCG

/RMM-DIIS algorithm even after some of the eigenvectors have converged. When a suffi-

cient number of LOBPCG iterations have been performed, the total SPMV count does not

vary much. In this case, we should terminate LOBPCG as early as possible to avoid poten-

tial numerical instabilities that can be introduced by the numerical rank deficiency of the

preconditioned residual vectors [19]. Figure 4.11 shows that the estimated condition number

79

1234567891011121314151617181920212223242526272829303132333435Block Lanczos Iteration304050607080Orthogonalization Time Over SpMM Time (%)Li6Li7C12B11of the subspace from which approximated eigenvalues and eigenvectors are extracted in the

LOBPCG algorithm increases rapidly as we perform more LOBPCG iterations. Although

the optimal SpMV count is attained when we switch to RMM-DIIS after 20 LOBPCG itera-

tions, it is not unreasonable to terminate LOBPCG sooner, for example, after 12 iterations,

when the estimated condition number of the LOBPCG subspace is around 109. This is also

the point at which the average relative change in the approximations to the desired eigenval-

ues τ just moves below 10−7. Therefore, the previously discussed strategy of using τ < 10−7

to decide when to switch to RMM-DIIS works well. Even though this strategy would lead

to a slight increase in the number of effective SpMV operations compared with the optimal

effective SpMV count achieved when we switch to RMM-DIIS after 20 iterations, it makes

the hybrid algorithm more robust and stable. We should also note that in a practical calcu-

lation the optimal effective SpMV count and when the optimality is achieved is unknown a

priori. This optimality is problem dependent, and is also architecture dependent.

Figure 4.10 Total effective SpMV cost of the LOBPCG/RMM-DIIS algorithm to compute
the lowest five eigenvalues for Li6 w.r.t. switching strategy from SpMM to SpMV within
RMM-DIIS.

80

34567891011121314151617181920212223Number of LOBPCG Iterations before RMM-DIIS30405060708090Number of Effective SpMV OperationsSpMMSpMV after 1 eigenvalue convergesSpMMSpMV after 2 eigenvalues convergeSpMMSpMV after 3 eigenvalues convergeSpMMSpMV after 4 eigenvalues convergeSpMM onlyFigure 4.11 The condition number of the LOBPCG subspace from which approximate
eigenpairs are extracted to compute the lowest 5 eigenvalues of the Li6 Hamiltonian.

4.4 Conclusion of This Work

In this work, we examine and compare a few iterative methods for solving large-scale

eigenvalue problems arising from nuclear structure calculations. We observe that the block

Lanczos and LOBPCG methods are generally more efficient than the standard Lanczos

method and the stand-alone RMM-DIIS method in terms of the effective number of SpMVs.

The Chebyshev filtering based subspace iteration method is not competitive with other

methods even though it requires the least amount of memory and has been found to be

very efficient for other applications. We show that by combining the block Lanczos or

LOBPCG algorithm with the RMM-DIIS algorithm, we obtain a hybrid solver that can

outperform existing solvers. The hybrid LOBPCG/RMM-DIIS method is generally more

efficient than block Lanczos/RMM-DIIS when a good preconditioner is available. The use

of RMM-DIIS in the block Lanczos/RMM-DIIS hybrid algorithm allows us to limit the

orthogonalization cost in the block Lanczos iterations. In the LOBPCG/RMM-DIIS hybrid

algorithm, the use of RMM-DIIS allows us to avoid the numerical instability that may arise

81

1234567891011121314151617181920212223242526272829LOBPCG Iteration103106109101210151018Condition NumberLi6Li7C12B11in LOBPCG when the residuals of the approximate eigenpairs become small. We discuss

the practical issue of how to decide when to switch from block Lanczos or LOBPCG to

RMM-DIIS. A strategy based on monitoring the averaged relative changes in the desired

approximate eigenvalues has been found to work well. Although the RMM-DIIS method is

a single vector refinement scheme, we show the SpMVs in multiple independent RMM-DIIS

iterations targeting different eigenpairs can be batched together and implemented as a single

SpMM. Such a batching scheme significantly improves the performance of the hybrid solver

and is found to be useful even after some of the approximate eigenpairs have converged.

82

CHAPTER 5

SCALING EIGENSOLVERS TO HUNDREDS OF GPUS

In previous chapters, we focused on accelerating our sparse eigensolvers through asyn-

chronous execution models and hybrid algorithms. Now, we turn our attention to acceler-

ating them on heterogeneous architectures through an efficient use of GPU accelerators and

inter-GPU communication calls. This efficient use allows us to scale Lanczos and LOBPCG

to hundreds of GPUs. For this work, we particularly focus on the implementation of these

eigensolvers within the distributed memory version of Many-body Fermion Dynamics for

nuclei (MFDn).

The direct solution of the quantum many-body problem transcends several areas of

physics and chemistry. MFDn is the state-of-the-art Configuration Interaction (CI) code

that aim to solve for the structure of light nuclei by direct diagonalization of the nuclear

many-body Hamiltonian in a harmonic oscillator single-particle basis [54]. In MFDn, the

nuclear Hamiltonian is evaluated in a large harmonic oscillator single-particle basis and di-

agonalized by iterative techniques to obtain the low-lying eigenvalues and eigenvectors.

One key feature of a CI calculation is the large dimension of the Hamiltonian matrix it

can produce. The dimension of the matrix characterizes the size of the many-body basis

used to represent a nuclear many-body Hamiltonian. This problem size highly depends on

the Nmax value, which is the maximum number of oscillator quanta above the minimum

for a given nucleus. Higher Nmax values yield more precise results, but at the expense of

an exponential growth in problem size. The dimension of the matrices can easily exceed a

billion, and the number of nonzeros a trillion, with a higher number of Nmax and with more

realistic interactions [2].

The Lanczos and LOBPCG algorithms have been implemented in MFDn to obtain the

low-lying eigenpairs. Since the growing problem size makes it infeasible for the Hamiltonian

to be stored on a single node, these algorithms have been implemented for distributed mem-

ory architectures. To be precise, the implementation of Lanczos and LOBPCG uses a hybrid

83

MPI/OpenMP parallelization scheme, and has been optimized [14] to achieve scalable per-

formance for distributed memory many-core CPU systems such as the Cori KNL system,

who retired in 2022 after 13 years at the National Energy Research Scientific Computing

(NERSC) Center.

The increased availability of high performance computing platforms equipped with gen-

eral purpose GPUs has motivated the MFDn developers to consider modifying the many-core

CPU implementation, which is written in Fortran 90, to enable it to run efficiently on ac-

celerator based systems. Rewriting the code using, e.g., CUDA Fortran [47] or OpenCL

programming models [41] would take a substantial amount of work. Therefore, the develop-

ers decided to use the OpenACC directive based programming model in combination with

CUDA-aware MPI [60, 23]. They show how to incorporate OpenACC for sparse eigensolvers

in [40].

In their work, they also present numerical examples that demonstrate the performance

improvement achieved by using OpenACC directives. They obtained a significant speedup

with their MPI/OpenACC approach over their previous hybrid MPI/OpenMP scheme. Al-

though their work showed promising initial results in regards to utilizing GPUs for eigen-

solvers, further analysis indicated that there was room for improvement on large-scale to

capitalize GPU resources to a better extent.

With this work, we aim to show the limitations of this initial GPU-parallel version that

uses MPI/OpenACC scheme. We then propose numerous schemes to improve the perfor-

mance of Lanczos and LOBPCG on large-scale multi-node multi-GPU experiments. For

these experiments, we use Perlmutter nodes equipped with NVIDIA A100 GPUs at NERSC.

We note that we apply the proposed improvements to SpMV and SpMM, which are by far

the most time-consuming portions of these algorithms.

The contribution of this work can be listed as follows:

• demonstrating a factor of two speedup in the expensive SpMV kernel of Lanczos by

switching from OpenACC to CUDA,

84

• benchmarking the cost of all four MPI collectives used in the distributed SpMV algo-

rithm: Allgatherv, Bcast, Reduce and Reduce scatter,

• showing M P I Reduce and M P I Reduce scatter to be the performance bottleneck of

Lanczos on multi-GPU experiments due to local reductions needed by these two calls

being performed on the host side,

• proposing a Hybrid/CUDA scheme by employing non-blocking MPI point-to-point

(P2P) communication for Reduce and Reduce scatter, and MPI collectives for All-

gatherv and Bcast to overcome the bottleneck,

• proposing a NCCL/CUDA scheme by replacing all four MPI calls with NVIDIA Col-

lective Communications Library (NCCL) collectives to overcome the same bottleneck,

• applying optimizations on the NCCL/CUDA scheme to truly overlap communication

with computation and to utilize the memory bandwidth to a better extend,

• evaluating the performance of Hybrid/CUDA and optimized NCCL/CUDA schemes

for large matrices with up to a trillion nonzeros on large-scale experiments using up to

1128 NVIDIA A100 GPUs,

• achieving significant performance improvements in overall Lanczos solver time with

Hybrid/CUDA (up to 2.9×) and NCCL/CUDA (up to 4.9×) schemes over MPI/Ope-

nACC,

• analyzing the strong-scaling efficiency of the default and proposed schemes to highlight

the impact of increasing communication volume on the overall performance,

• showing the performance of the proposed schemes on LOBPCG and investigating the

lack of speedup observed.

This chapter is organized as follows.

In Section 5.1, we explain in detail the existing

distributed GPU-parallel SpMV algorithm used in MFDn. In Section 5.2, we describe our

85

proposed schemes. Next, in Section 5.3, we share the performance results of the default as

well as the proposed schemes on Lanczos. We then give the performance of these schemes

on LOBPCG in Section 5.4. We conclude our work by giving final remarks in Section 5.5.

5.1 MFDn’s Proprietary Distributed SpMV Algorithm

SpMV is by far the most time-consuming part of the Lanczos algorithm. An efficient

implementation of this step is crucial in order to obtain a high-performance Lanczos code.

This is particularly true on distributed-memory where the communication part can easily

become bottleneck and hinder performance if not addressed properly.

In this section, we first describe the custom data distribution of the sparse matrix, referred

to as H, as well as the input and output vectors, referred to as W and U . We then outline the

steps of the proprietary SpMV algorithm employed in MFDn to accommodate this custom

data distribution.

5.1.1 Data Distribution

Since the size of the sparse matrix H in a nuclear CI calculation, both in terms of

dimension and number of nonzeros, can be extremely large, H is partitioned and distributed

across multiple processes [54]. Additionally, in MFDn, only half of the symmetric matrix

H is stored, utilizing a specialized data distribution scheme, as described below.

We first divide the rows and columns of H into nd × nd matrices with 2D partitioning.

Since the matrix is symmetric, we then only map nd × (nd + 1)/2 of these submatrices to

n × (nd + 1)/2 processes where each process is assigned to a submatrix. When choosing

nd × (nd + 1)/2 out of nd × nd submatrices, ensuring that we select exactly (nd + 1)/2

submatrices on each row and column group is vital. This is because the processes are later

organized into (sub)communicators based on the row and column indices of their respective

submatrices. Well-balanced row and column (sub)communicators are key to achieving a

meaningful performance for the distributed SpMV algorithm we will describe.

In Figure 5.1, we show an example of this partition and selection process. Here, H is

partitioned into 5 × 5 matrices. We also highlight the selected submatrices, each of which

86

Figure 5.1 The partition of symmetric sparse matrix H into 5 × 5 submatrices. We store
15 of them while achieving load balance.

Figure 5.2 The partition of input vector W into 5 sub-vectors on each column. We
partition each sub-vector further into 3 segments.

87

H1,1H1,4H1,5H2,1H2,2H2,5H3,1H3,2H3,3H4,2H4,3H4,4H5,3H5,4H5,5W1,1W4,3W5,2W1,2W2,1W5,3W1,3W2,2W3,1W2,3W3,2W4,1W3,3W4,2W5,1W1W2W3W4W5is assigned to its own process. Notice that each row and column holds 3 submatrices. This

means there will be 5 row communicators and 5 column communicators, each containing 3

MPI processes.

We also partition the input vector W and the output vector U among nd × (nd + 1)/2

processes in two stages. First, we partition W and U into nd sub-vectors and distribute

among column groups. Then, we further partition each sub-vector into (nd + 1)/2 segments

and distribute among the processes within each column group. By this way, each process is

assigned to 1/(nd × (nd + 1)/2) of W and U , providing load-balance. We show the partition

of W in Figure 5.2 for the same example given in Figure 5.1.

5.1.2 Distributed SpMV

A specialized SpMV multiplication procedure has been developed to fit this unique data

distribution scheme when multiplying H with W . Here, we refer to the jth block of sub-

vectors of W as Wj. In MFDn, the distributed-memory parallel multiplication of H and W

is performed as follows:

Figure 5.3 First step of the SpMV algorithm in MFDn where we gather the distributed
input segments onto each process.

1. We first gather the distributed segments of the sub-vector Wj onto each process within

the jth column communicator by using an M P I Allgatherv call. This step is shown

in Figure 5.3.

88

AllGatheron column groupW1,1W4,3W5,2W1,2W2,1W5,3W1,3W2,2W3,1W2,3W3,2W4,1W3,3W4,2W5,1W1W4W5W1W2W5W1W2W3W2W3W4W3W4W5Step 1/4Figure 5.4 Second step of the proprietary SpMV algorithm employed in MFDn where we
overlap the broadcast of the gathered input segments with local SpMV.

2. The ith diagonal process then broadcasts the gathered sub-vector Wj, where i = j for

the diagonal processes, across the ith row communicator. This is done in preparation

for the distributed SpMV-transpose computations of the next step. We overlap this

M P I Bcast call with the local SpMV using the local sub-matrix Hi,j, which is Ui =

Hi,jWj, by using the gathered input sub-vector from the previous step. This step is

shown in Figure 5.4.

3. By the end of the previous step, each process within the ith row communicator holds

the partial result of the output sub-vector Ui. Now, we reduce the output sub-

vectors Ui along each row communicator onto the ith diagonal process. We overlap

this M P I Reduce call with the local SpMV-transpose on the sub-vector Wi, which is

Uj = H T

i,jWi, by using the broadcast input sub-vector from the previous step. This

step is shown in Figure 5.5.

4. After the local SpMV-transpose, on the diagonal processes, we add the reduced output

89

W1W2W3W4W5W1W1W1W2W2W2W3W3W3W4W4W4W5W5W5Broadcaston row groupH1,1W1H1,4W4H1,5W5H2,1W1H2,2W2H2,5W5H3,1W1H3,2W2H3,3W3H4,2W2H4,3W3H4,4W4H5,3W3H5,4W4H5,5W5local SpMV kernel on GPUOverlapping local SpMV kernel and Broadcast of W along the row groupsStep 2/4Figure 5.5 Third step of the proprietary SpMV algorithm employed in MFDn where we
overlap the reduce of the partial SpMV outputs with local SpMV-transpose.

Figure 5.6 Fourth and last step of the proprietary SpMV algorithm employed in MFDn
where we reduce and scatter the partial SpMV-transpose outputs after adding the reduced
SpMV output on diagonal processes.

90

U1U1U1U2U2U2U3U3U3U4U4U4U5U5U5U1U2U3U4U5Reduceon row groupHT1,1W1HT1,4W1HT1,5W1HT2,1W2HT2,2W2HT2,5W2HT3,1W3HT3,2W3HT3,3W3HT4,2W4HT4,3W4HT4,4W4HT5,3W5HT5,4W5HT5,5W5local SpMVT kernel on GPUlocal SpMV resultsOverlapping local SpMVT kernel and Reduce of U along the row groupsStep 3/4UT1UT4UT5UT1UT2UT5UT1UT2UT3UT2UT3UT4UT3UT4UT5ReduceScatteron column grouplocal SpMVT resultsU1U2U3U4U5reduced local SpMV resultsU1,1U4,3U5,2U1,2U2,1U5,3U1,3U2,2U3,1U2,3U3,2U4,1U3,3U4,2U5,1Step 4/4sub-vector Ui to the local output sub-vector Uj. Finally, we reduce and scatter the

sub-vectors Uj into (nd + 1)/2 segments within the jth column communicator by using

M P I Reduce scatter. At the end of this step, each process gets the corresponding

segment of the output vector as intended, which concludes our SpMV algorithm. This

step is shown in Figure 5.6.

5.1.3 Task-to-Processor Mapping

At large scales, communication overhead can limit the scalability of our eigensolvers. As

such, investigating methods to reduce such overhead is of value. Topology-aware mapping of

computational tasks to physical processors is one of these methods in particular since such

mapping on large-scale clusters can significantly improve efficiency.

When mapping nd × (nd + 1)/2 submatrices of H and nd × (nd + 1)/2 segments of W and

U to nd × (nd + 1)/2 MPI processes, we use a column-major mapping scheme developed in

[3]. Figure 5.7 shows this column-major mapping for the same example where nd = 5.

Figure 5.7 Using a column-major order for process ordering. Tasks mapped to the same
column (row) of the grid belong to the same column (row) communication groups.

The reason for that is, we can partially or entirely hide the cost of M P I Bcast and

M P I Reduce by overlapping them with local SpMV and SpMV-transpose. Both collectives

are performed along row communicators. However, there is no compute task to overlap

91

112142415357681091113M P I Allgatherv and M P I Reduce scatter with. Therefore, performing these two collec-

tives on column communicators as quickly as possible would be efficient in terms of overall

performance [3].

MPI library itself considers the network topology when assigning ranks to processes. For

example, processes within a single node will be given consecutive ranks by default. Therefore,

ordering processes in column-major order will ensure processes in a column group and their

corresponding tasks are assigned to physically nearby processors (or GPUs).

Although topology-aware mapping is shown to be an NP-complete problem [27], this

heuristic suggested by [3] is proven to be quite efficient for the performance of the given

SpMV algorithm. Please note that [3] shows the merit of this heuristic for the MPI/OpenMP

implementation of MFDn on a distributed CPU cluster. However,

[40] uses the same

heuristic for the GPU-parallel MPI/OpenACC scheme as well due to the same underlying

principles. As such, we will continue using this heuristic in our work as well.

5.2 Suggested Improvements and Proposed Schemes

MFDn uses the MPI/OpenACC scheme to implement the distributed SpMV algorithm

outlined here. That is, it currently uses OpenACC as the GPU programming model to per-

form local SpMV and SpMV-transpose, and MPI collectives to perform Allgatherv, Bcast,

Reduce and Reduce scatter calls.

Our objective with this work is to improve the Lanczos performance by optimizing the

distributed SpMV algorithm. We limit our improvements to SpMV mainly because we spend

75−90% of the solver time on this algorithm as we will show later. Also, the orthogonalization

kernel, where we spend the rest of the solver time, does not offer any room for improvement

since it consists of load-balanced dense linear algebra operations.

Before defining our proposed schemes, we first list the improvements we suggest.

5.2.1 Improving Local SpMV with CUDA

OpenACC was the first choice to port the distributed solver code to GPUs within MFDn.

Although OpenACC is easy to program with, this ease of programming comes at the expense

92

of efficiency. That is, one can easily implement a GPU-parallel code using OpenACC di-

rectives but this does not usually produce the most optimal performance. Using a low-level

GPU programming language such as CUDA can improve the GPU kernel efficiency. This is

mostly because we have more control over how the parallel loops, for example, are mapped

to the software or hardware resources at the GPU stack.

Also, the main advantage of OpenACC is the portability. One can easily compile the code

for different architectures to have a production-ready version quickly. Offering portability

intrinsically means losing performance to a certain degree. CUDA is exclusively used for

NVIDIA GPUs, meaning it offers no portability. This lack of portability allows the CUDA

programs to exploit the exclusive features offered for NVIDIA GPUs.

Since we were running our experiments on NVIDIA GPUs at Perlmutter, we propose to

improve the compute aspect of our proprietary SpMV algorithm by switching from OpenACC

to CUDA in order to gain performance.

5.2.2

Improving Communication with P2P

As we will show later, M P I Reduce and M P I Reduce scatter calls inside the SpMV

algorithm become the bottleneck of our Lanczos solver on large-scale. This is due to the

local arithmetic needed by these two calls being performed on the host side.

To overcome this problem, we propose using non-blocking point-to-point (P2P) commu-

nication. All MPI collectives can be implemented with simple M P I Send and M P I Recv

calls between two processes. MPI also provides a non-blocking send and receive mecha-

nism with M P I Isend and M P I Irecv. As such, we propose implementing these two MPI

collectives using M P I Isend on the send side with M P I Recv on the receive side.

This mechanism adds asynchrony since it allows the send side to start the P2P call as

early as possible and continue the work. It also allows the receive side to stall the blocking

P2P call as much as possible while doing useful work. Moreover, the expensive additions

performed by MPI with Reduce and Reduce scatter collectives can be done locally on the

GPU side when these collectives are converted to P2P calls.

93

There are two ways to implement the local reductions on the receive side. First, we can

allocate a receive buffer large enough to store all the incoming data and perform reduction

once when all M P I Recv calls are complete. Secondly, we can allocate a receive buffer only

large enough to store the incoming data from a single process and perform reduction after

each M P I Recv.

We choose to implement the second option due to its space efficiency. Also, we couple

this implementation with M P I AN Y SOU RCE on the receive side so we do not impose

any order among the sent data, which may help with time efficiency.

5.2.3

Improving Communication with NCCL

The NVIDIA Collective Communications Library (NCCL, pronounced “Nickel”) is a

library offering topology-aware inter-GPU communication primitives that can be seamlessly

integrated into applications [15].

NCCL provides both point-to-point send/receive primitives and collective communica-

tion. While it is not a comprehensive parallel programming framework, it is a library specif-

ically designed to accelerate inter-GPU communication.

NCCL has been widely adopted in Deep Learning Frameworks, where the AllReduce

collective is extensively used for training neural networks. The multi-GPU and multi-node

communication capabilities provided by NCCL enable efficient scaling of neural network

training.

Regarding the use of NCCL within MFDn, the collective communication primitives sup-

ported by NCCL include all four collectives needed by our SpMV algorithm with minor

differences and/or adjustments:

• There is no Allgatherv collective provided by NCCL. Instead, NCCL provides ncclAll-

Gather where the aggregation is performed on equal-sized vectors. This is only a minor

issue because the scattered input vectors that are accumulated by M P I Allgatherv

within MFDn have similar sizes. In fact, their size varies by less than 1%. As such,

padding the participating vectors with zeros to accommodate for the lack of Allgatherv

94

in NCCL only introduces a small overhead both in memory and compute time.

• Although NCCL provides a collective that implements the same functionality as MPI’s

Reduce scatter, which is ncclReduceScatter, this collective expects equal-sized vectors

from all participating processes as in ncclAllGather. Likewise, this issue is overcome

with only minor overhead due to the same reason.

5.2.4 Proposed Schemes

Based on these potential improvements, we propose three new schemes.

MPI/CUDA: This scheme uses MPI collectives as is but replaces the OpenACC kernels

for local SpMV and SpMV-transpose with corresponding CUDA kernels.

Hybrid/CUDA: This scheme uses CUDA as well. It also replaces M P I Reduce and

M P I Reduce scatter collectives with non-blocking MPI P2P calls and local GPU reduction

kernels.

NCCL/CUDA: This scheme uses CUDA and replaces MPI collectives with NCCL

collectives.

We summarize the proposed schemes in Table 5.1 along with the default MPI/OpenACC

scheme.

MPI/OpenACC MPI/CUDA Hybrid/CUDA NCCL/CUDA

Local SpMV
Local SpMVT
Allgatherv
Bcast
Reduce
Reduce scatter

OpenACC
OpenACC
MPI Coll
MPI Coll
MPI Coll
MPI Coll

CUDA
CUDA
MPI Coll
MPI Coll
MPI Coll
MPI Coll

CUDA
CUDA
MPI Coll
MPI Coll
MPI P2P
MPI P2P

CUDA
CUDA
NCCL Coll
NCCL Coll
NCCL Coll
NCCL Coll

Table 5.1 Summary of the default MPI/OpenACC as well as the proposed MPI/CUDA,
Hybrid/CUDA and NCCL/CUDA schemes.

5.3 Lanczos Experiments

In this section, we share the experiment results of all four schemes shown in Table 5.1

for the Lanczos eigensolver.

95

5.3.1 Experimental Setup

We report the performance of Lanczos in MFDn when it is run for 100 iterations on the

Perlmutter GPU system at the National Energy Research Scientific Computing (NERSC)

Center. Each Perlmutter GPU compute node consists of a single AMD EPYC 7763 processor

with 64 cores per processor and 4 NVIDIA A100 GPUs. We give the detailed hardware and

system specifications of Perlmutter GPU system in Table 5.2. Also, we report the versions

of the relevant software used in this work in Table 5.3. Finally, we plot the node architecture

of Perlmutter GPU machines in Figure 5.8.

CPUs per node
GPUs per node
NICs per node
GPU-GPU interconnect
CPU-GPU interconnect
CPU-NIC interconnect

1 x AMD EPYC 7763
4 x NVIDIA A100
4 x HPE Slingshot 11 NICs
NVLink 3.0
PCIe 4.0
PCIe 4.0

Table 5.2 System specifications of Perlmutter GPU.

NVIDIA HPC SDK
CUDA
MPI
NCCL

23.9
12.2
Cray-MPICH (8.1.28)
NCCL 2.18.3

Table 5.3 Software versions used in the performance experiments on Perlmutter.

We use four test problems that correspond to realistic Hamiltonian matrices of different

nuclei (with up to 12 nucleons) represented in different configuration interaction spaces. The

dimensions of these matrices as well as the number of nonzero matrix elements (nnz) in half

of each of these symmetric matrices are listed in Table 5.4. We see in this table that the

dimension of these test problems ranges from 1.2 × 107 to 9.8 × 108 whereas their nnz varies

between 5.4 × 109 and 1.3 × 1012.

The smallest problem, referred to as XSmall, can fit within one NVIDIA A100 80GB

GPU on a single node. As the problem size becomes larger and the memory footprint of

the Lanczos algorithm exceeds the available high-bandwidth memory (HBM2e), we need to

96

Figure 5.8 Node architecture of Perlmutter GPU.

97

A100GPUA100GPUA100GPUA100GPUAMD MILANNICNICNICNICPCIe 4.04 x NVLink 3.0distribute the Hamiltonian matrix and the input/output vector across multiple GPUs. As

such, the Lanczos computations for Small, Medium and Large test cases have to be run on

at least 28, 91 and 276 A100 80GB GPUs, respectively.

Test case
System
Nmax
Matrix dimension (×106)
# of nonzero elemens (nnz) (×109)
# of A100 80GB GPUs needed

XSmall Small Medium Large

10B
6
12.0
5.47
1

10B
8
165
129
28

12C
8
575
475
91

9Li
11
978
1305
276

Table 5.4 The dimensions of sparse matrices used in the performance test and number of
nonzero matrix elements in each matrix.

5.3.2 MPI/OpenACC Results

In Table 5.5, we show the overall solver time of Lanczos for 100 iterations with the

MPI/OpenACC scheme. For each test problem, we start with the minimum GPU count

needed and increase this number to assess this scheme’s strong-scaling performance. For

example, on Medium, we first use 91 GPUs on which 100 solver iterations take 44.5s and go

up to 496 GPUs where the solver time drops down to 18.5s.

Test Case # of Nodes # of GPUs Solver Time

Small

Medium

Large

7
12
23
48
69
23
48
69
95
124
69
95
124
195
282

28
45
91
190
276
91
190
276
378
496
276
378
496
780
1128

38.4
26.1
15.5
11.8
10.4
44.5
30.4
23.5
21.5
18.5
43.0
35.9
30.0
27.3
25.6

Table 5.5 Solver time in seconds for 100 Lanczos iterations with MPI/OpenACC.

98

We will discuss the strong-scaling efficiency of MPI/OpenACC on these matrices later in

Section 5.3.7 along with the other schemes. For now, we turn our focus to the significance

of the distributed SpMV algorithm in Lanczos.

In Table 5.6, we share the time spent in the SpMV algorithm for a subset of the ex-

periments. We also compute the percentage of time spent in this algorithm over the entire

Lanczos solver time. We notice that the SpMV account for 75 − 90% of the solver time.

Regardless of the scale of the experiment or the test problem used, SpMV is the most time-

consuming part of Lanczos. This is the main reason why we invest our efforts into improving

the SpMV time with our proposed schemes.

The performance of MPI/OpenACC shown in Table 5.5 will be our baseline to compare

the new schemes against.

Test Case # of Nodes # of GPUs SpMV Time Solver Time Percentage

Small

Medium

12
23
48
23
48
95

45
91
190
91
190
378

23.5
13.2
8.8
40.4
27.9
17.6

26.1
15.5
11.8
44.5
30.4
21.5

90.1
85.2
75.0
90.9
92.0
81.9

Table 5.6 Percentage of time spent in SpMV for 100 Lanczos iterations with
MPI/OpenACC.

5.3.3 MPI/CUDA Results

We give the performance results of the MPI/CUDA scheme in Table 5.7 for the same set

of experiments. In this table, we also show the speedup achieved over MPI/OpenACC.

We see that by changing the local SpMV and SpMV transpose kernels originally imple-

mented using OpenACC directives to hand-tuned CUDA implementations, we achieve up to

2.00× speedup. However, we observe the highest speedup always on the lowest GPU count

for each problem. For the Small problem, for example, the 2.00× speedup obtained on 45

GPUs gradually diminishes to an only 3% improvement on 276 GPUs. We essentially see no

advantage of using CUDA over OpenACC on the highest GPU count in any test cases.

99

Small

Test Case # of Nodes # of GPUs Solver Time Speedup
28
45
91
190
276
91
190
276
378
496
276
378
496
780
1128

19.2
15.2
11.5
10.2
10.1
32.8
25.5
22.0
20.6
18.2
34.0
37.0
26.8
26.6
25.5

2.00
1.71
1.34
1.15
1.03
1.35
1.19
1.07
1.04
1.02
1.27
1.17
1.12
1.03
1.00

7
12
23
48
69
23
48
69
95
124
69
95
124
195
282

Medium

Large

Table 5.7 Solver time in seconds for 100 Lanczos iterations with MPI/CUDA and speedup
achieved over MPI/OpenACC.

We suspected that this is due to the MPI collectives becoming the bottleneck as we

increase the number of GPUs in our strong-scaling experiments. To verify that, we measured

how much time we spend on each collective by each process throughout the 100 iterations of

Lanczos. For an accurate measurement, we placed an M P I Barrier before these collectives.

In Table 5.8, we share the maximum time spent on each collective by any process.

Test Case # of Nodes # of GPUs Allgatherv Bcast Reduce Reduce scatter

Small

Medium

12
23
48
23
48
95

45
91
190
91
190
378

0.51
0.36
0.23
1.42
0.76
0.56

1.13
1.07
0.87
3.64
3.07
2.21

5.56
3.85
2.84
13.4
9.61
7.01

3.89
3.36
3.36
8.59
6.94
6.91

Table 5.8 Communication time spent on each collective in 100 Lanczos iterations with MPI
collectives. Each collective is isolated with an M P I Barrier. All processes take
measurements separately and we report the maximum time among them for each collective.

These numbers suggest Reduce and Reduce scatter collectives seem to be much more

expensive than Allgatherv and Bcast. On the Small problem with 190 GPUs, for instance,

we spend 6.2s total on Reduce and Reduce scatter while Allgatherv and Bcast sum up to

100

1.1 seconds. Considering that MPI/CUDA scheme spends 10.2s for this problem and GPU

count (see Table 5.7), spending 6.2s on Reduce and Reduce scatter seems to be an issue we

need to address.

Moreover, although we overlap M P I Bcast with local SpMV and M P I Reduce with lo-

cal SpMV-transpose, we believe having an idea about the speedup we can achieve when MPI

collectives take this much time can help when interpreting the MPI/CUDA performance. For

that reason, we calculate in Table 5.9 the total time spent on MPI collectives as percentage

of the overall solver time. We then use this percentage to arrive at the maximum theoretical

speedup that can be achieved by optimizing the local GPU kernels.

Test Case # of Nodes # of GPUs Comm. Time % of Solver Max. Speedup

Small

Medium

12
23
48
23
48
95

45
91
190
91
190
378

11.1
8.64
7.30
27.1
20.4
16.7

42.5
55.7
61.8
60.8
67.1
77.6

2.35
1.79
1.61
1.64
1.49
1.28

Table 5.9 Total communication time spent on collectives based on Table 5.8, percentage of
communication time to solver time given in Table 5.5, and maximum theoretical speedup
possible by keeping the communication cost fixed.

These numbers suggest that there is only so much speedup we can get by simply improving

the local computations. For example, we can never see more than 1.28× speedup with

MPI/CUDA for the Medium problem on 378 GPUs when the MPI collectives are this costly.

This realization motivates us to understand the underlying issue with M P I Reduce and

M P I Reduce scatter calls and address it to achieve further improvements.

5.3.4 The Issue with Reduce and Reduce scatter

Realizing that M P I Reduce and M P I Reduce scatter take considerably longer than the

other two collectives got us thinking about the compute part of these collectives, i.e., local

reductions. We thought these slow collectives might be performed on the host side despite

using CUDA-aware MPI. To investigate this further, we did Nsight Systems profiling. Here

is what our profiling analysis reveals:

101

• Regarding M P I Reduce, we observe a single device-to-host data movement (DtoH

memcpy) in the earlier stages of this call.

• For the same call, we observe a single host-to-device data movement (HtoD memcpy)

in the later stages only on the root process.

• Regarding M P I Reduce scatter, we observe multiple device-to-host and host-to-device

memory copy operations on all processes throughout this call.

• To give an example, when there are three processes participating in this call, we observe

four device-to-host and two host-to-device copies. For seven processes, these numbers

go up to nine device-to-host and six host-to-device copies.

In Figure 5.9, we give a snapshot of our Nsight Systems profiling where the bottom

part shows the blocking nature of M P I Reduce and M P I Reduce scatter calls on the CPU

side. The top part, where we can see the activity on the CUDA hardware, reveals the copy

operations between host and device.

Figure 5.9 NSight Systems profiling showing device to host and host to device copy
operations performed during M P I Reduce and M P I Reduce scatter calls. The Lanczos is
run for XSmall test case on 15 GPUs in this experiment.

Moreover, the size of these copy operations occurring concurrently with these collectives

matches the size of the buffers used in these calls. These observations made it clear to us

102

MPI_Reduce &MPI_Reduce_scatterDevice to Host &Host to Device memcpyGPUactivityCPUactivitythat reductions are indeed happening on the host side. That is why these two collectives

were notoriously slower than the other two as given in Table 5.8.

Please note that we use a CUDA-aware MPI implementation provided by Cray-MPICH

(8.1.28), which is the default MPI library on Perlmutter. After realizing that reductions takes

place on the CPU with this implementation, we wanted to test if it was a single-threaded or

multi-threaded reduction operation.

In Table 5.10, we share the time M P I Reduce and M P I Reduce scatter take in 100

Lanczos iterations for varying number of CPU threads. We see that using one, two, eight

or 32 threads (as first allocated by Slurm’s –cpus-per-task flag and then set by export

OMP NUM THREADS ) makes no difference in the performance.

Test Case # of Nodes # of GPUs # of CPU Threads Reduce Reduce scatter

XSmall

4

15

1
2
8
32

0.84
0.82
0.84
0.82

0.54
0.60
0.54
0.56

Table 5.10 Reduce and Reduce scatter time spent in 100 Lanczos iterations with
MPI/OpenACC for varying number of CPU threads used.

Given that the reductions needed by these two collectives are very likely to use a single-

threaded CPU implementation, we explored ways to improve the overall solver time by

addressing this bottleneck through Hybrid/CUDA and NCCL/CUDA schemes.

5.3.5 Hybrid/CUDA Results

In Table 5.11, we give the speedup achieved on Reduce and Reduce scatter operations

by changing the MPI collectives with a set of non-blocking point-to-point MPI calls and

local reductions on the GPU (see Section 5.2.2 for more details). The table shows that we

can improve the Reduce time up to 2.3× and the Reduce scatter time up to 5.9× with this

technique. However, we notice a decline in the speedup numbers for Reduce as we go from

Small to Medium and/or increase the number of GPUs. In fact, we observe no speedup at

all for the Medium problem on 378 GPUs.

103

Small

Test Case # of Nodes # of GPUs Reduce Speedup Reduce scatter Speedup
2.44
2.39
1.99
8.18
8.80
7.00

0.74
0.75
0.57
2.61
2.05
1.61

2.28
1.61
1.43
1.64
1.09
1.00

5.26
4.48
5.89
3.29
3.39
4.29

45
91
190
91
190
378

12
23
48
23
48
95

Medium

Table 5.11 Reduce and Reduce scatter time spent in seconds for 100 Lanczos iterations
with MPI point-to-point communication and speedup achieved over MPI collective
communication based on Table 5.8. Both calls are isolated with an MPI Barrier. All
processes take measurements separately and we report the maximum time among them for
each call.

Recall that the Reduce call is performed along a row communicator as overlapped with

the local SpMV-transpose. The processes on row communicators are physically farther away

from each other than those on column communicators, on which the Reduce scatter call takes

place. Due to the non-scaling nature of point-to-point communication, which is exacerbated

by the distant communicating processes, our point-to-point Reduce implementation loses its

advantage against the slow M P I Reduce collective at scale.

Now we give the performance results of the Hybrid/CUDA scheme along with the speedup

achieved over MPI/OpenACC in Table 5.12. In this table, we consistently observe around

2.8× speedup for the Small problem. For Medium, we start a with 2.3× speedup on 91

GPUs and end up with 1.8× speedup on 496 GPUs. For the Large problem, we can still see

a 62% improvement on 1128 GPUs.

These numbers indicate the significance of performing local arithmetic needed by Reduce

and Reduce scatter on the device side. By simply getting rid of these two MPI collectives in

the distributed SpMV algorithm, which perform a CPU-side reduction, we achieve between

1.6× and 2.9× speedup in the overall solver time.

5.3.6 NCCL/CUDA Results

In Table 5.13, we give the speedup achieved on Reduce and Reduce scatter operations

by using NCCL instead of MPI collectives. The Reduce scatter time given here includes the

process of zero-padding the send buffer before calling ncclReduceScatter (see Section 5.2.3

104

Small

Test Case # of Nodes # of GPUs Solver Time Speedup
28
45
91
190
276
91
190
276
378
496
276
378
496
780
1128

13.8
9.45
5.66
4.12
3.60
19.7
14.4
12.4
11.4
10.6
22.5
20.4
19.1
17.1
15.7

2.77
2.76
2.74
2.86
2.89
2.25
2.11
1.90
1.88
1.75
1.91
1.76
1.57
1.59
1.62

7
12
23
48
69
23
48
69
95
124
69
95
124
195
282

Medium

Large

Table 5.12 Solver time in seconds for 100 Lanczos iterations with Hybrid/CUDA and
speedup achieved over MPI/OpenACC.

for more details).

This table shows that on Small and Medium test cases, we can improve the Reduce time

up to 11.1× and the Reduce scatter time up to 33.6× with this change. We again attribute

the higher speedup numbers achieved on the Reduce scatter call to the fact that processes

participating in the Reduce call are more distant to each other. When the communicating

processes are farther, the communication part of the Reduce call takes a bigger portion of this

collectively. Consequently, we lose some of the gains obtained by using NCCL’s GPU-side

reduction over MPI’s CPU-side reduction in the compute part.

Small

Test Case # of Nodes # of GPUs Reduce Speedup Reduce scatter Speedup
0.50
0.39
0.30
2.75
1.96
1.62

11.1
9.87
9.47
4.89
4.90
4.33

0.34
0.24
0.10
1.90
0.32
0.24

11.4
14.0
33.6
4.52
21.6
28.7

45
91
190
91
190
378

12
23
48
23
48
95

Medium

Table 5.13 Reduce and Reduce scatter time in seconds for 100 Lanczos iterations with
NCCL and speedup achieved over MPI based on Table 5.8. Both collectives are isolated
with barriers. All processes take measurements separately and we report the maximum
time among them for each collective.

105

In Table 5.14, we give the Lanczos solver time for four different versions of NCCL/CUDA

to highlight our optimization efforts. The description of each version is as follows:

NCCL-v1: This is our initial version where we use as few synchronization call as pos-

sible. That is, we do not use any blocking MPI or NCCL call since both CUDA kernels and

NCCL calls are assigned to asynchronous CUDA streams. We also only synchronize streams

when there is a data dependency.

NCCL-v2: This version is similar to NCCL-v1, except before each NCCL call, we add

an M P I Barrier on the corresponding communicator and cudaDeviceSynchronize(). We

notice in Table 5.14 that adding barriers improve the solver time in 14 out of 15 cases

(three test cases, five runs each). For Medium on 496 GPUs, for example, adding such

synchronization leads to a 60% improvement (11.5s vs 7.1s).

We originally implemented this version to time NCCL calls accurately. Achieving an im-

provement with such additional synchronizations and barriers seemed counter-intuitive. As

such, we examined the Nsight profiling data to understand the underlying reasons. According

to the profiling analysis, there are two reasons for the observed improvements.

First, we did not truly overlap communication and computation in NCCL-v1 to begin

with. This is because NCCL calls are blocking either on the CPU or GPU side even when

we configure the NCCL communicators to be non-blocking. Since communication and com-

putation are already serialized, adding a CUDA synchronization call incurs no penalty.

Secondly, for reasons unknown to us, adding an M P I Barrier before NCCL calls appears

to accelerate the communication time in our experiments. That is, the overall cost of an

M P I Barrier followed by a NCCL call is lower than a stand-alone NCCL call. This seems

to be the case even when there is no concurrent GPU kernel to the NCCL call. As such,

adding an M P I Barrier before NCCL calls turns out to be helpful in our case.

NCCL-v3: We began exploring ways to achieve overlap with NCCL after realizing it

was not the case. We found out that we can truly overlap a GPU kernel with a NCCL

call when we (i) launch the GPU kernel first and follow it by the NCCL call on the CPU

106

side, and (ii) assign a higher priority to the stream used by NCCL. This way, the NCCL

call cannot possibly block the GPU kernel while the CUDA runtime can still allocate the

resources needed by the NCCL call due to its higher priority.

Note that after seeing its merit on NCCL-v2, we use an M P I Barrier before each NCCL

call in this version as well. We see in Table 5.14 that this version performs better than the

previous version in 13 out of 15 cases. For example, we see a 33% improvement over NCCL-v2

for the Large problem when we test it on 496 GPUs (22.9s vs 17.2s).

NCCL-v4: Our last optimization effort with NCCL/CUDA is to adjust the number

of thread blocks used by the NCCL communicators ourselves. NCCL by default selects a

thread block count that can saturate the memory bandwidth whenever we create a NCCL

communicator. NCCL then assigns that many thread blocks to the communicator each time

it is used for a NCCL call throughout the Lanczos iterations. However, this default number

may not always yield the best performance.

In this version, we start with the optimizations used by NCCL-v3. We then configure the

NCCL communicators to try five different thread block counts: two, four, eight, 12 and 16.

We run our Lanczos solver separately for each configuration and report the optimal solver

time in Table 5.14. This version performs better than the previous version in 14 out of 15

cases.

We summarize the impact of our optimization efforts in Table 5.15 where we show the

speedup achieved over our baseline with NCCL/CUDA for both NCCL-v1 and NCCL-v4.

We see that NCCL-v1, our initial version, offers a speedup ranging from 1.6× to 2.2×. As

we introduce new optimizations with each version, our final speedup reaches a point between

2.6× to 4.9×.

A question that naturally arises when using NCCL-v4, the final version, is what thread

block count to use on each experiment. To answer this question, we show what thread block

count yields the optimal performance in Table 5.16. In this table, we also give the percentage

of slowdown occurring when we use a non-optimal thread block count.

107

Test Case # of Nodes # of GPUs NCCL-v1 NCCL-v2 NCCL-v3 NCCL-v4

Small

Medium

Large

7
12
23
48
69
23
48
69
95
124
69
95
124
195
282

28
45
91
190
276
91
190
276
378
496
276
378
496
780
1128

18.4
14.9
9.38
6.21
4.86
29.5
17.8
13.3
13.3
11.5
23.2
22.0
13.4
15.2
13.6

16.2
10.5
5.66
3.08
2.38
25.9
13.2
10.4
8.69
7.12
22.9
19.4
16.4
11.5
10.1

13.8
9.11
4.71
2.62
2.37
20.1
10.1
10.1
8.79
8.33
17.2
15.4
13.7
10.2
10.0

11.6
7.67
4.00
2.43
2.39
14.6
8.86
7.86
6.96
6.84
13.5
12.1
10.3
9.39
9.93

Table 5.14 Solver time in seconds for 100 Lanczos iterations with NCCL/CUDA for all
versions.

Small

Test Case # of Nodes # of GPUs
7
12
23
48
69
23
48
69
95
124
69
95
124
195
282

28
45
91
190
276
91
190
276
378
496
276
378
496
780
1128

Medium

Large

Initial Speedup Final Speedup

2.08
1.75
1.65
1.90
2.14
1.51
1.71
1.78
1.61
1.60
1.85
1.63
2.24
1.79
1.88

3.31
3.40
3.88
4.85
4.34
3.04
3.43
3.00
3.09
2.71
3.19
2.98
2.93
2.90
2.57

Table 5.15 Speedup achieved with NCCL/CUDA for the initial (NCCL-v1) and the most
optimized (NCCL-v4) versions over MPI/OpenACC for 100 Lanczos iterations.

108

Test Case # of GPUs

Small

Medium

Large

28
45
91
190
276
91
190
276
378
496
276
378
496
780
1128

2 Blocks
4.8%
1.9%
2.0%
4.6%
16.1%
33.4%
35.3%
30.3%
48.7%
23.0%
53.5%
39.0%
43.8%
29.6%
13.5%

4 Blocks
1.8%
Opt
Opt
Opt
0.5%
12.5%
14.2%
2.2%
20.0%
13.8%
15.9%
4.6%
15.0%
12.9%
2.4%

8 Blocks
0.5%
0.5%
1.0%
1.9%
Opt
3.0%
7.0%
0.3%
Opt
3.6%
Opt
0.1%
6.2%
3.4%
2.1%

12 Blocks
Opt
0.7%
1.3%
1.6%
3.3%
3.5%
3.6%
Opt
11.8%
Opt
4.0%
Opt
Opt
2.2%
Opt

16 Blocks
0.3%
1.0%
1.8%
2.2%
6.4%
Opt
Opt
0.3%
5.9%
6.9%
0.1%
0.3%
7.3%
Opt
0.7%

Table 5.16 Percentage of slowdown observed over the optimal performance when the NCCL
communicators are configured to use 2, 4, 8, 12 and 16 thread blocks for NCCL-v4.

These numbers suggest that configuring the communicators to use two thread blocks

yields the worst performance in general. It is worth noting that this configuration can be

up to 54% slower on the Large matrix. Using four thread blocks, on the other hand, seems

to work well on the Small test case although it can be up to 20% and 16% slower than the

best option on Medium and Large, respectively.

Regarding the remaining three options, there is probably a trade-off when we go from

using eight to 16 thread blocks for NCCL calls. Using eight blocks might correspond to a

slightly slower NCCL communication but it leaves more streaming multiprocessors (SMs)

to complete the concurrent GPU kernels slightly faster. This might be the reason why the

overall solver time does not vary much when we go from eight to 16.

We find eight, 12 and 16 thread block configurations to be commendable for our experi-

ments since all three configurations are at most 12% slower than the optimal configuration

in any case. However, we note that the optimal thread block count to configure a NCCL

communicator with depends on the network architecture and topology, problem size and

communication pattern.

109

5.3.7 Discussion

In Figure 5.10, 5.11 and 5.12, we show the strong-scaling plot of MPI/OpenACC, Hy-

brid/CUDA and NCCL/CUDA, respectively. In these figure, the y-axis denotes the solver

time and the x-axis indicates the GPU count. Along with the observed solver time for each

experiment, we also plot dashed lines to indicate the ideal scaling, which is what we would

observe if the speedup achieved matched the scale-up in GPUs used.

Figure 5.10 Strong-scaling plot of MPI/OpenACC for 100 Lanczos iterations. Solid lines
correspond to solver time with respect to the GPU count for each problem whereas dashed
lines show ideal scaling.

These plots show that we slowly deviate from the ideal scaling with each scheme. To

support this, we show in Table 5.17 the strong-scaling efficiency attained with each scheme

for each test case where the percentage numbers shown in this table correspond to the ratio

between the speedup achieved and the ideal speedup.

We see that our efficiency never goes below 33% but we cannot ever reach 50% either,

which is far from ideal. We know that going from MPI/OpenACC to MPI/CUDA yields

an only marginal return whereas changing both the communication and the compute aspect

110

Figure 5.11 Strong-scaling plot of Hybrid/CUDA for 100 Lanczos iterations. Solid lines
correspond to solver time with respect to the GPU count for each problem whereas dashed
lines show ideal scaling.

Figure 5.12 Strong-scaling plot of NCCL/CUDA (NCCL-v4) for 100 Lanczos iterations.
Solid lines correspond to solver time with respect to the GPU count for each problem
whereas dashed lines show ideal scaling.

111

produces significant speedups with Hybrid/CUDA and NCCL/CUDA. The strong-scaling

efficiency numbers, on the other hand, suggest that the proposed schemes do not scale as

well as the baseline. On the Large problem, for instance, MPI/OpenACC attains a 41%

efficiency against the 35% and 33% of Hybrid/CUDA and NCCL/CUDA.

The higher efficiency numbers observed on Medium and Large with our baseline can be

attributed to the compute aspect of the M P I Reduce and M P I Reduce scatter collectives

being their main bottleneck, which do not necessarily scale poorly. An increase in the number

of processes means an increase in the CPU threads performing reductions as well. This is not

exactly the case for the Hybrid/CUDA and NCCL/CUDA schemes because we anticipate the

communication aspect of these collectives to be more dominant than their compute aspect

in the first place.

Test Case MPI/OpenACC Hybrid/CUDA NCCL/CUDA

Small
Medium
Large

37%
44%
41%

38%
34%
35%

49%
39%
33%

Table 5.17 Strong-scaling efficiency of the MPI/OpenACC, Hybrid/CUDA and
NCCL/CUDA schemes on Small, Medium and Large matrices.

Apart from the comparison between these schemes, we believe the lack of efficiency

across the board is due to the communication aspect of the distributed SpMV algorithm

scaling worse than its compute aspect in our strong scaling experiments. To understand this

issue, we numerically quantify the total number of floating point operations (FLOP) and the

communication volume observed in the distributed SpMV. We specifically define these two

numbers for the parts where computation and communication can be overlapped as follows:

Total FLOP: This refers to the total number of floating point operations performed in

local SpMV and SpMV-transpose kernels across all GPUs for a given symmetric matrix H.

Regardless of the GPU count,

T otal F LOP = 2 × 2 × nnzH

(5.1)

112

where nnzH is the number of nonzero elements in half of the symmetric matrix. This is

because we perform one addition and one multiplication per nonzero element in both kernels.

Total Communication Volume: This refers to the total communication volume of

Bcast and Reduce, overlapped collectives, across all GPUs for a given vector W . For the

sake of simplicity, we calculate this number for the ncclBroadcast and ncclReduce calls using

the ring algorithm. Under these assumptions,

T otal Communication V olume = 2 × 4 × (

nd − 1
2

) ×

SW
nd

× (

−1

nd + 1
2

)

(5.2)

where SW is the number of elements in W . Recall that W is distributed among nd column

groups, where each group has (nd + 1)/2 processes, or GPUs (see Section 5.1.1). First two

terms in this equation come from the number of collectives, which is two, and the byte per

element, which is four since we store numbers in single-precision in MFDn. The remaining

terms account for each sub-vector of W being partitioned into (nd + 1)/2 segments where

each process sends or receives (nd − 1)/2 of them during the ring algorithm.

Test Case # of GPUs Total FLOP Total Comm Vol Ratio (FLOP/B)

Small

Medium

Large

28
276
91
496
276
1128

5.16E11
5.16E11
1.90E12
1.90E12
5.22E12
5.22E12

3.97GB
14.6GB
27.6GB
69.1GB
86.1GB
180GB

130
35.4
68.9
27.6
60.7
29.0

Table 5.18 Number of floating point operations performed per byte communicated in the
distributed SpMV algorithm.

In Table 5.18, we show the ratio between the total FLOP and total communication

volume, which corresponds to the number of operations performed per byte communicated.

We see that this FLOP/B ratio goes down in our strong scaling experiments. This is because

our compute volume stays the same although our communication volume grows linearly with

respect to nd according to Eq. 5.2.

Let P denote the number of processes or GPUs used.

113

P = nd × (nd + 1)/2 =⇒ nd =

√

8P + 1

1
2

(5.3)

This means our communication volume grows by the square root of the number of GPUs

used. Given that the compute volume stays constant, all evaluated schemes eventually

become communication-bound. Therefore, we believe it is reasonable to see these schemes

to deviate from the ideal solver time in our strong-scaling experiments.

Moreover, we only considered the overlapped communication calls so far, i.e., Bcast and

Reduce. We also perform an Allgatherv and a Reduce scatter that are not overlapped at

all. Considering that these collectives have the same growth factor as the other two, their

presence is another factor that deters us from reaching ideal scaling.

Lastly, we perform our experiments on Perlmutter where we share the network with

other jobs that are scheduled to run concurrently as SLURM sees fit. Using a higher number

of nodes, or GPUs, makes our runs more prone to network contention and noise. This is

because we request more nodes and the allocated nodes may be scattered around the cluster.

As such, the observed network bandwidth is likely to become lower when we increase the

number of GPUs in the strong-scaling scenario, which is yet another factor to consider when

interpreting the efficiency numbers in Table 5.18.

5.4 LOBPCG Experiments

In this section, we share the experiment results of all four schemes show in Table 5.1 for

the LOBPCG eigensolver.

5.4.1 Experimental Setup

We report the performance of LOBPCG in MFDn when it is run for 20 iterations on

Perlmutter. Please refer to Section 5.3.1 to see the hardware and system specifications of

Perlmutter, software versions or the test cases used in our experiments.

114

5.4.2 Results

In Table 5.19, we show the experiment results for 20 LOBPCG iterations to compute eight

eigenvalues with MPI/OpenACC and MPI/CUDA. These numbers suggest that changing the

local SpMM kernels from OpenACC to CUDA does not correspond to any improvement in

the execution time as the speedup achieved is never more than 5%.

In fact, we observe

occasional slowdowns with the CUDA code up to 2%.

Test Case # of Nodes # of GPUs MPI/OpenACC MPI/CUDA

XSmall

Small

Medium

4
12
23
12
23
48
23
48
95

15
45
91
45
91
190
91
190
378

4.93
2.41
2.29
21.56
15.09
10.64
50.15
35.51
25.82

4.69
2.38
2.20
21.41
15.09
10.82
50.05
36.03
26.08

Table 5.19 Solver time in seconds for 20 LOBPCG iterations to compute eight eigenvalues
with MPI/OpenACC and MPI/CUDA.

The employment of Hybrid/CUDA and NCCL/CUDA schemes have proven to be useful

in Section 5.3. As such, we used these schemes to improve the communication aspect and

thereby performance of the LOBPCG solver. In Table 5.20, we show the speedup numbers

achieved with these schemes over MPI/OpenACC.

Test Case # of Nodes # of GPUs Hybrid/CUDA NCCL/CUDA

XSmall

Small

Medium

4
12
23
12
23
48
23
48
95

15
45
91
45
91
190
91
190
378

3.02
1.80
1.78
2.06
1.87
1.59
1.89
1.67
1.40

3.16
2.08
2.01
1.51
1.27
1.57
1.33
1.60
1.40

Table 5.20 Speedup achieved for 20 LOBPCG iterations to compute eight eigenvalues with
Hybrid/CUDA and NCCL/CUDA over MPI/OpenACC based on Table 5.19.

115

We observe that these schemes do not appear to be as effective as they are on Lanczos.

For the Small problem on 190 GPUs, for example, we have seen over 4.9× speedup achieved

with NCCL/CUDA in Table 5.15. However, switching from SpMV-based Lanczos to SpMM-

based LOBPCG brings the NCCL’s improvement in solver time down to 1.6×. We also notice

a similar trend with Hybrid/CUDA for this problem where our speedup decreases from 2.9×

to 1.6× (see Table 5.12).

Note that SpMM with eight vectors yields an eight-fold increase in communication volume

compared to SpMV. To understand the effect of this change on the performance, we show in

Table 5.21 the Reduce time observed with MPI, P2P and NCCL for 20 LOBPCG iterations.

We see that NCCL still appears to be faster than MPI on all problem sizes. However, we

no longer see over an order of magnitude improvement with NCCL over MPI as shown in

Table 5.13.

Test Case # of Nodes # of GPUs MPI P2P NCCL

Small

Medium

12
23
48
23
48
95

45
91
190
91
190
378

8.64
6.05
4.43
20.7
15.2
10.5

3.06
2.84
2.79
9.77
9.77
9.74

2.36
1.99
1.40
8.13
5.84
4.36

Table 5.21 Reduce time in seconds for 20 LOBPCG iterations to compute eight eigenvalues
with MPI, P2P and NCCL. Reduce calls are isolated with barriers. All processes take
measurements separately and we report the maximum time among them for each collective.

In Table 5.22, we compare the average Reduce time in SpMV and in SpMM with NCCL.

The ratio computed indicates that SpMM’s Reduce is 13.4× - 25.5× costlier than SpMV’s

for the same problem. That is, increasing the communication volume by 8× slows down the

Reduce call up to 25.5×. This corresponds to a 3.2× slowdown in the achieved bandwidth

with NCCL. We believe this is the main reason why we only see a modest improvement with

NCCL/CUDA over MPI/OpenACC for the LOBPCG solver.

The NCCL/CUDA scheme presented here for LOBPCG lacks the optimizations incor-

porated in NCCL-v3 and NCCL-v4 as mentioned in Section 5.3.6. That is, we do not truly

116

overlap communication and computation. Also, we do not configure NCCL communicators

to use optimal thread block count. With these improvements, we may potentially achieve

higher speedup numbers.

Small

Test Case # of Nodes # of GPUs Reduce in SpMV Reduce in SpMM Ratio
23.60
25.51
23.33
14.78
14.78
13.46

0.005
0.004
0.003
0.028
0.020
0.016

0.118
0.100
0.070
0.407
0.292
0.218

45
91
190
91
190
378

12
23
48
23
48
95

Medium

Table 5.22 Reduce time in seconds for SpMV in a single Lanczos iteration, Reduce time for
SpMM in a single LOBPCG iteration and the ratio of Reduce time in SpMM time over
Reduce time in SpMV with NCCL/CUDA.

5.4.3 Pipelined SpMM to Hide Communication Costs

In our distributed SpMM implementation, which follows the same steps as the distributed

SpMV as outlined in Section 5.1.2, we overlap the local SpMM with Bcast, and the local

SpMM-transpose with Reduce. This means there is no compute task to overlap Allgatherv

or Reduce scatter with. Our experiments with MPI suggests that although the processes

participating in Reduce scatter are in close proximity to each other, this collective is still

quite costly. We believe with a right pipelining technique, this Reduce scatter cost can in

fact be hidden.

We highlight in Section 5.1.2 that Reduce scatter is performed on the output of local

SpMM-transpose. However, Reduce scatter calls can be replaced with multiple Reduce calls,

each of which have a different root process. Turning a single Reduce scatter to multiple

Reduces provides us with the freedom of commencing a Reduce call earlier. That is, as soon

as the local SpMM-transpose operations are completed for a partition of the output vectors,

we can call M P I Reduce.

We can partition the local SpMM-transpose with respect to the row dimension of the

output vectors, which maps to partitioning the input matrix in the column dimension. Then,

117

launching a separate local SpMM-transpose kernel for each partition allows us to pipeline

these kernels with Reduce calls.

Figure 5.13 Local SpMM-transpose followed by Reduce scatter in distributed SpMM.

To better illustrate this pipelining technique in our SpMM algorithm, we show a couple

of drawings provided in Figure 5.13 and 5.14 on a small sample example. In these figures,

three processes perform the local SpMM-transpose and participate in the Reduce scatter

call.

Figure 5.13 shows the strictly ordered nature of this kernel and collective call where we

have to complete the local SpMM-transpose first before calling the Reduce scatter. Fig-

118

T1st rank2nd rank3rd rankReduce_scatter,,SpMMT shown is performed on all three ranks ﬁrst. Then, we call Reduce_scatter.ure 5.14, on the other hand, demonstrates that by partitioning the SpMM-transpose into

smaller kernels and Reduce scatter into multiple Reduce calls, we can pipeline these two

operations. For instance, while the Reduce call for the first partition is taking place, GPU

threads can work on the second partition.

Figure 5.14 Partitioned local SpMM-transpose followed by Reduce calls in pipelined
distributed SpMM.

Notice that we can use a similar idea to partition the local SpMM kernel into multiple

kernels and pipeline these kernels with partitioned Reduce calls. This means that the local

SpMM can be overlapped with M P I Bcast and pipelined with multiple M P I Reduce calls

while the local SpMM-transpose can be pipelined with M P I Reduce scatter as described

119

3rd partition2nd partition1st partitionT1st partition2nd partition3rd partition1st rank2nd rank3rd rankSpMMT kernel for each partition is launched in order. When a kernel completes, we call the corresponding Reduce.above. We refer to the implementation achieving such overlap with this pipelining technique

as Pipe/CUDA.

We have shown in Table 5.11 how costly an M P I Reduce call can become and how

effective replacing it call with P2P calls. As such, we replace every Reduce collective with

corresponding P2P calls in our Pipe/CUDA scheme.

In Table 5.23, we share the experiment results achieved with Pipe/CUDA for 20 LOBPCG

iterations, and we compare this scheme’s performance to MPI/CUDA and Hybrid/CUDA.

The reason why we compare Pipe/CUDA to Hybrid/CUDA is we are effectively using the

same collective and P2P combination. The only difference is, Pipe/CUDA uses fine-grained

calls for Reduce and Reduce scatter in an effort to achieve a better overlap.

Test Case # of Nodes # of GPUs MPI/CUDA Hybrid/CUDA Pipe/CUDA

XSmall

Small

Medium

4
12
23
12
23
48
23
48
95

15
45
91
45
91
190
91
190
378

4.77
2.49
2.20
21.5
10.9
10.9
49.8
36.9
30.9

1.63
1.34
1.29
10.4
8.02
6.71
26.5
21.2
18.4

1.62
1.31
1.29
10.3
8.14
6.93
26.9
22.3
19.7

Table 5.23 Solver time in seconds for 20 LOBPCG iterations to compute eight eigenvalues
with MPI/CUDA, Hybrid/CUDA and Pipe/CUDA.

These results show that Pipe/CUDA performs similar to Hybrid/CUDA while obtaining

a significant improvement over MPI/CUDA. Therefore, replacing the Reduce collectives with

the P2P calls in Pipe/CUDA optimizes the performance. However, partitioning the local

SpMM and SpMM-transpose operations to overlap with Bcast, Reduce and Reduce scatter

virtually yields no additional improvements.

We see the performance of Pipe/CUDA as underwhelming. We believe the lack of a

meaningful speedup over Hybrid/CUDA calls for further investigation. The number of par-

titions used in the current implementation is equal to the number of processes in each row

or column communication group. Tuning the pipelined code by experimenting with differ-

120

ent partition counts may potentially improve the performance. Moreover, pipelining NCCL

instead of MPI may provide us with greater benefits. This is because our experiments with

Lanczos and LOBPCG show that NCCL seems to prefer smaller communication volume,

with which our pipelining technique can help. Lastly, we have to keep in mind the growing

kernel launch overheads with pipelining, which may eventually become the bottleneck at

large-scale.

5.4.4 Discussion

In Ch. 4 regarding hybrid eigensolvers, we examine the GFLOPS numbers for SpMV

and SpMM on many-core shared memory architectures. We see in Fig 4.12 that SpMM

yields a much higher GFLOPS value than SpMV for two reasons. First, SpMM has a higher

arithmetic intensity than SpMV, which is a significant factor given the memory-bound nature

of both kernels. Secondly, the inner-most loop that updates all output vectors for a given

nonzero element can be executed within a single instruction cycle by using SIMD instructions

on CPUs. SpMM therefore yields around 5× more GFLOPS rate than SpMV on a many-core

CPU architecture such as a 64-core AMD EPYC processor. This ratio gives a great advantage

to block eigensolvers that utilize SpMM such as LOBPCG in comparison to SpMV-based

eigensolvers such as Lanczos.

We do a similar comparison between the distributed SpMV of Lanczos and SpMM of

LOBPCG for the MPI/OpenACC scheme within MFDn on GPU architectures.

In Ta-

ble 5.24, we show the overall execution time of SpMV in 100 Lanczos iterations and of

SpMM in 20 LOBPCG iterations for the Small and Medium problems. In 20 LOBPCG iter-

ations, we effectively perform 160 SpMVs. We define the ratio as how expensive an SpMV of

Lanczos is compared to an SpMV of LOBPCG, which gives us the effective GFLOPS ratio

between SpMV and SpMM in distributed memory.

These numbers indicate that LOBPCG no longer possesses the same advantage with

SpMM on distributed GPU architectures. We believe this is due to the fact that the inner-

most loop can no longer be executed in a single instruction cycle. Instead, each update to a

121

Small

Test Case # of Nodes # of GPUs SpMV Time SpMM Time Ratio
1.97
1.59
1.53
1.44
1.40
1.45

18.5
12.9
8.76
44.6
31.6
22.7

22.8
12.8
8.37
40.3
27.7
20.5

45
91
190
91
190
378

12
23
48
23
48
95

Medium

Table 5.24 SpMV time in seconds for 100 Lanczos iterations, SpMM time in seconds for 20
LOBPCG iterations with MPI/OpenACC and ratio between 1/100 of SpMV time and 1/60
of SpMM time. SpMV and SpMM time is based on experiment results shown in Table 5.5
and 5.19.

vector in this loop is done through atomic operations [40]. The expensive atomic operations

are likely to take away the advantage of a higher arithmetic intensity in SpMM over SpMV as

well since atomics may become the bottleneck rather than the memory access to the sparse

matrix or input and output vectors.

Lastly, the communication cost in these distributed algorithms is another reason that

brings this GFLOPS ratio down. This is because the collectives play an equalizing role,

negating the performance gains achieved by the local SpMM and SpMM-transpose kernels.

The highest ratio that we see in Table 5.24 is still below two. Given that these low ratios

render an average LOBPCG iteration relatively more expensive, Lanczos has been the default

solver in MFDn’s production code since the code has been ported to GPUs.

5.5 Conclusion of This Work

In this work, we started with MFDn that has these eigensolvers developed for GPUs by

using MPI collectives for communication and OpenACC directives for computation. Next, we

investigated the custom data distribution and the proprietary SpMV algorithm conformal to

this distribution within the MFDn solver. Then, we showed the scaling issues observed with

this initial MPI/OpenACC scheme and examined the potential root causes thereof. Finally,

we offered an improvement in the overall solver time for both Lanczos and LOBPCG by

optimizing the compute and communication aspect of SpMV.

To improve the compute aspect of SpMV, we rewrote the existing OpenACC kernels with

122

CUDA, a low-level language native to NVIDIA GPUs. Although we achieved decent speedup

numbers on lower GPU counts in our Perlmutter experiments, we saw that the advantage of

using CUDA disappeared on higher GPU counts. We attributed the vanishing improvement

to the communication aspect of SpMV becoming the bottleneck on large-scale.

To improve the communication aspect, we proposed two approaches. First, we utilized

asynchronous point-to-point (P2P) communication in place of expensive MPI collectives.

Then, we switched to NCCL collectives in place of MPI’s with minimal change. We showed

significant performance improvements across the board for Lanczos with these changes. For

LOBPCG, on the other hand, we observed more modest improvements. We attributed the

lack of performance on this solver to an eight-fold increase in communication volume.

We analyzed the strong-scaling efficiency of our schemes on Lanczos. For this analysis,

we modeled the compute and communication volume of the SpMV algorithm to calculate

the floating point operations performed per byte communicated. We then realized that

the total communication volume grows by the square root of the process count while the

compute volume stays constant in the strong-scaling scenario. As such, we concluded that

communication would eventually become the bottleneck in large-scale with our current SpMV

algorithm.

Lastly, we compared the overall SpMV and SpMM time within Lanczos and LOBPCG

to have an understanding of the GFLOPS values achieved by SpMV and SpMM. We noticed

that although the ratio between these values favors SpMM on shared CPU architectures,

it is no longer the case on distributed GPU architectures. We then attributed this change

in GFLOPS ratio to (i) the difference between CPU and GPU architectures and (i) the

presence of communication overheads in distributed scenario. We finally concluded that for

these reasons, SpMM-based block eigensolvers such as LOBPCG lose their advantage against

SpMV-based eigensolvers such as Lanczos on multi-GPU experiments.

123

CHAPTER 6

CONCLUSION AND FUTURE WORK

In this thesis, we described different ways to accelerate large-scale sparse eigensolvers.

Towards this end, we explored asynchronous runtime systems, hybrid algorithms and het-

erogeneous architectures equipped with GPUs.

In the first part, we evaluated the performance of three task-parallel programming mod-

els, OpenMP, HPX and Regent on shared memory architectures. We compared their per-

formance against the traditional BSP model for two iterative sparse eigensolvers: Lanczos

and LOBPCG. We then demonstrated their merits on two architectures, Intel Broadwell (a

multicore processor) and AMD EPYC (a modern manycore processor). We observed that

these frameworks achieve up to 13.7× fewer cache misses over an efficient BSP implemen-

tation across L1, L2 and L3 cache layers. They also obtained up to 9.9× improvement in

execution time over the same BSP implementation.

In the second part, after seeing HPX’s success with our first work, we wanted to use HPX

as the backbone of the large-scale sparse solver and graph analytics framework we wanted

to develop. However, our experiments with HPX for the distributed Lanczos solver using

up to 512 cores suggested that HPX was slower than the hybrid MPI+OpenMP model.

Investigating this lack of improvement showed that HPX involves a considerable runtime

overheads whenever a network communication call is used. As such, we gave up on utilizing

HPX’s asynchronous execution model in the distributed memory architectures and pivoted

our work towards accelerating sparse eigensolvers at the algorithmic level, which laid the

foundation for our next work.

In the third part, we examined and compared a few iterative methods for solving large-

scale eigenvalue problems arising from nuclear structure calculations besides Lanczos and

LOBPCG. In particular, we discussed the possibility of using block Lanczos method, a

Chebyshev filtering based subspace iterations and the residual minimization method ac-

celerated by direct inversion of iterative subspace (RMM-DIIS). Although the RMM-DIIS

124

method does not exhibit rapid convergence when the initial approximations to the desired

eigenvectors are not sufficiently accurate, we showed that it can be effectively combined with

either the block Lanczos and LOBPCG methods to yield a hybrid eigensolver that has sev-

eral desirable properties. We described a few practical issues that need to be addressed to

make the hybrid solver efficient and robust.

In the fourth part, we accelerated the Lanczos and LOBPCG solvers by improving the

communication and computational part of their distributed SpMV/SpMM algorithm on het-

erogeneous architectures. We worked on problem sizes with up to 1.3 trillion nonzeros by

using up to 1128 GPUs. With these large scale experiments, first, we obtained up to 2.0×

improvement by going from an optimized OpenACC to a hand-optimized CUDA code for the

local SpMM and SpMM-transpose kernels. We then improved the cost of communication by

understanding the limitations of MPI collectives and employing asynchronous point-to-point

(P2P) and NCCL calls instead. On Lanczos, P2P and NCCL improvements paired with

CUDA kernels achieved up to 2.89× (P2P) and 4.85× (NCCL) speedup over the MPI/Ope-

nACC scheme. On LOBPCG, which was more demanding due to the higher communication

volume of SpMM over SpMV, these P2P and NCCL improvements led to up to 3.0× and

3.1× speedup in the overall solver time.

As future work, we aim to optimize the GPU-parallel solver codes further by overlapping

communication and computation correctly via NVSHMEM. We believe NVSHMEM has

the potential to hide the communication costs to a better degree through explicit thread

block specialization. NVSHMEM can also help with the kernel launch overheads given

the iterative nature of our solvers where GPU kernels are launched and teared down every

iteration. Last but not the least, we are currently working towards porting our work to AMD

GPUs on Frontier. As such, we consider utilizing HIP (Heterogeneous-Compute Interface for

Portability) on the compute part and RCCL (ROCm Communication Collectives Library)

on the communication part. We want to see how well HIP and RCCL perform on hundreds

of AMD MI250x GPUs against OpenACC and MPI-based implementations.

125

BIBLIOGRAPHY

[1] Md Afibuzzaman, Fazlay Rabbi, M Yusuf ¨Ozkaya, Hasan Metin Aktulga, and ¨Umit V
C¸ ataly¨urek. Deepsparse: A task-parallel framework for sparsesolvers on deep mem-
ory architectures. In 2019 IEEE 26th International Conference on High Performance
Computing, Data, and Analytics (HiPC), pages 373–382. IEEE, 2019.

[2] Hasan Metin Aktulga, Aydin Bulu¸c, Samuel Williams, and Chao Yang. Optimizing
sparse matrix-multiple vectors multiplication for nuclear configuration interaction cal-
culations. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th Interna-
tional, pages 1213–1222. IEEE, 2014.

[3] Hasan Metin Aktulga, Chao Yang, Esmond G Ng, Pieter Maris, and James P Vary.
Topology-aware mappings for large-scale eigenvalue problems. In European Conference
on Parallel Processing, pages 830–842. Springer, 2012.

[4] Abdullah Alperen, Md Afibuzzaman, Fazlay Rabbi, M Yusuf Ozkaya, Umit Catalyurek,
and Hasan Metin Aktulga. An evaluation of task-parallel frameworks for sparse solvers
on multicore and manycore cpu architectures. In Proceedings of the 50th International
Conference on Parallel Processing, pages 1–11, 2021.

[5] Abdullah Alperen, Hasan Metin Aktulga, Pieter Maris, and Chao Yang. Hybrid eigen-
solvers for nuclear configuration interaction calculations. Computer Physics Communi-
cations, 292:108888, 2023.

[6] Edward Anderson, Zhaojun Bai, Christian Bischof, Susan Blackford, James Demmel,
Jack Dongarra, Jeremy Du Croz, Anne Greenbaum, S Hammerling, Alan McKenney,
et al. Lapack users’ guide, vol. 9. Society for Industrial Mathematics, 39, 1999.

[7] C´edric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-Andr´e Wacrenier.
StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architec-
tures. In Euro-Par - 15th International Conference on Parallel Processing, volume 5704
of Lecture Notes in Computer Science, pages 863–874, Delft, The Netherlands, August
2009. Springer.

[8] Bruce R. Barrett, Petr Navratil, and James P. Vary. Ab initio no core shell model.

Prog. Part. Nucl. Phys., 69:131–181, 2013.

[9] Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. Legion: Expressing
locality and independence with logical regions. In SC’12: Proceedings of the Interna-
tional Conference on High Performance Computing, Networking, Storage and Analysis,
pages 1–11. IEEE, 2012.

[10] L Susan Blackford, Jaeyoung Choi, Andy Cleary, Eduardo D’Azevedo, James Demmel,
Inderjit Dhillon, Jack Dongarra, Sven Hammarling, Greg Henry, Antoine Petitet, et al.
ScaLAPACK users’ guide. SIAM, 1997.

126

[11] George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas
H´erault, and Jack J Dongarra. Parsec: Exploiting heterogeneity to enhance scalability.
Computing in Science & Engineering, 15(6):36–45, 2013.

[12] George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Herault, Pitior Luszczek,
and Jack Dongarra. Dense linear algebra on distributed heterogeneous hardware with
a symbolic dag approach. Technical report, Lawrence Berkeley National Lab.(LBNL),
Berkeley, CA (United States), 2012.

[13] Aydin Bulu¸c, Jeremy T Fineman, Matteo Frigo, John R Gilbert, and Charles E Leis-
erson. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using
compressed sparse blocks. In Proceedings of the twenty-first annual symposium on Par-
allelism in algorithms and architectures, pages 233–244. ACM, 2009.

[14] Brandon Cook, Pieter Maris, Meiyue Shao, Nathan Wichmann, Marcus Wagner, John
O’Neill, Thanh Phung, and Gaurav Bansal. High performance optimizations for nuclear
physics code mfdn on knl. In International Conference on High Performance Computing,
pages 366–377. Springer, 2016.

[15] NVIDIA Corporation. Overview of NCCL. https://docs.nvidia.com/deeplearning/nccl/

user-guide/docs/overview.html, 2024. Accessed: 2024-07-21.

[16] Gregor Daiß, Parsa Amini, John Biddiscombe, Patrick Diehl, Juhan Frank, Kevin Huck,
Hartmut Kaiser, Dominic Marcello, David Pfander, and Dirk Pf¨uger. From piz daint
to the stars: simulation of stellar mergers using high-level abstractions. In Proceedings
of the International Conference for High Performance Computing, Networking, Storage
and Analysis, pages 1–37, 2019.

[17] Timothy A Davis and Yifan Hu. The university of florida sparse matrix collection. ACM

Transactions on Mathematical Software (TOMS), 38(1):1–25, 2011.

[18] Jos L. M. Van Dorsselaer, M. E. Hoschstenbach, and H. A. Van der Vorst. Computing
probabilistic bounds for extreme eigenvalues of symmetric matrices with the Lanczos
method. SIAM J. Matrix Anal. Appl., 22(3):837–852, 2000.

[19] Jed A Duersch, Meiyue Shao, Chao Yang, and Ming Gu. A robust and efficient imple-
mentation of lobpcg. SIAM Journal on Scientific Computing, 40(5):C655–C676, 2018.

[20] MD Feit, JA Fleck Jr, and A Steiger. Solution of the schr¨odinger equation by a spectral

method. Journal of Computational Physics, 47(3):412–433, 1982.

[21] G. H. Golub and R. Underwood. The block Lanczos method for computing eigenvalues.

In J. R. Rice, editor, Mathematical Software III, pages 361–377, 1977.

[22] Gene H. Golub and Charles F. Van Loan. Matrix Computations (3rd Ed.). Johns

Hopkins University Press, 1996.

127

[23] Khaled Hamidouche, Ammar Ahmad Awan, Akshay Venkatesh, and Dhabaleswar K
Panda. Cuda m3: Designing efficient cuda managed memory-aware mpi by exploit-
ing gdr and ipc. In 2016 IEEE 23rd International Conference on High Performance
Computing (HiPC), pages 52–61. IEEE, 2016.

[24] M. T. Heath. Sparse matrix computations. In The 23rd IEEE Conference on Decision

and Control, pages 662–665, 1984.

[25] Thomas Heller, Patrick Diehl, Zachary Byerly, John Biddiscombe, and Hartmut Kaiser.
Hpx–an open source c++ standard library for parallelism and concurrency. Proceedings
of OpenSuCo, page 5, 2017.

[26] U. Hetmaniuk and R. Lehoucq. Basis selection in LOBPCG. J. Comput. Phys., 218:324–

332, 2006.

[27] Torsten Hoefler and Marc Snir. Generic topology mapping strategies for large-scale
parallel architectures. In Proceedings of the international conference on Supercomputing,
pages 75–84, 2011.

[28] Z. Jia and G. W. Stewart. An analysis of the Rayleigh-Ritz method for approximating

eigenspaces. Math. Comput., 70:637–647, 2001.

[29] Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar
Fey. Hpx: A task based programming model in a global address space. In Proceedings
of the 8th International Conference on Partitioned Global Address Space Programming
Models, pages 1–11, 2014.

[30] Laxmikant V Kale and Sanjeev Krishnan. Charm++ a portable concurrent object
oriented system based on c++. In Proceedings of the eighth annual conference on Object-
oriented programming systems, languages, and applications, pages 91–108, 1993.

[31] Jeremy Kepner, David Bade, Aydın Bulu¸c, John Gilbert, Timothy Mattson, and Hen-
ning Meyerhenke. Graphs, matrices, and the graphblas: Seven good reasons. arXiv
preprint arXiv:1504.01039, 2015.

[32] A. V. Knyazev. Toward the optimal preconditioned eigensolver: Locally optimal block
preconditioned conjugate gradient method. SIAM J. Sci. Comput., 23(2):517–541, 2001.

[33] Andrew V Knyazev. Toward the optimal preconditioned eigensolver: Locally optimal
block preconditioned conjugate gradient method. SIAM journal on scientific computing,
23(2):517–541, 2001.

[34] Alexey Kukanov and Michael J Voss. The foundations for scalable multi-core software

in intel threading building blocks. Intel Technology Journal, 11(4), 2007.

[35] Abhishek Kulkarni and Andrew Lumsdaine. A comparative study of asynchronous
arXiv preprint

many-tasking runtimes: Cilk, charm++, parallex and am++.
arXiv:1904.00518, 2019.

128

[36] Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of lin-
ear differential and integral operators. United States Governm. Press Office Los Angeles,
CA, 1950.

[37] Chuck L Lawson, Richard J. Hanson, David R Kincaid, and Fred T. Krogh. Basic linear
algebra subprograms for fortran usage. ACM Transactions on Mathematical Software
(TOMS), 5(3):308–323, 1979.

[38] Wonchan Lee, Elliott Slaughter, Michael Bauer, Sean Treichler, Todd Warszawski,
Michael Garland, and Alex Aiken. Dynamic tracing: Memoization of task graphs for
dynamic task-based runtimes. In SC18: International Conference for High Performance
Computing, Networking, Storage and Analysis, pages 441–453. IEEE, 2018.

[39] R. Li and Y. Zhou. Bounding the spectrum of large Hermitian matrices. Linear Algebra

and its Applications, 435:480–493, 2011.

[40] Pieter Maris, Chao Yang, Dossay Oryspayev, and Brandon Cook. Accelerating an
iterative eigensolver for nuclear structure configuration interaction calculations on gpus
using openacc. Journal of Computational Science, 59:101554, 2022.

[41] Aaftab Munshi, Benedict Gaster, Timothy G Mattson, and Dan Ginsburg. OpenCL

programming guide. Pearson Education, 2011.

[42] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an

algorithm. Advances in neural information processing systems, 14, 2001.

[43] ARB OpenMP. Openmp application program interface version 4.0, 2013.

[44] B. N. Parlett. The Symmetric Eigenvalue Problem. Prentice Hall, Englewood Cliffs,

1980.

[45] P. Pulay. Convergence acceleration of iterative sequences: The case of SCF iteration.

Chem. Phys. Lett., 73:393–398, 1980.

[46] Arch D Robison. Cilk plus: Language support for thread and vector parallelism. Talk

at HP-CAST, 18:25, 2012.

[47] Gregory Ruetsch and Massimiliano Fatica. CUDA Fortran for scientists and engineers:

best practices for efficient CUDA Fortran programming. Elsevier, 2013.

[48] Y. Saad. Numerical Methods for Large Eigenvalue Problems: Revised Edition. Num-

ber 66 in Classics in Applied Mathematics. SIAM, Philadelphia, 2011.

[49] Youcef Saad. Chebyshev acceleration techniques for solving nonsymmetric eigenvalue

problems. Math. Comp., 42:567–588, 1984.

[50] Erik Saule, Hasan Metin Aktulga, Chao Yang, Esmond G Ng, and ¨Umit V C¸ ataly¨urek.
An out-of-core task-based middleware for data-intensive scientific computing. In Hand-
book on Data Centers, pages 647–667. Springer, 2015.

129

[51] Meiyue Shao, H. Metin Aktulga, Chao Yang, Esmond G. Ng, Pieter Maris, and James P.
Vary. Accelerating nuclear configuration interaction calculations through a precondi-
tioned block iterative eigensolver. Computer Physics Communications, 222:1–13, 2018.

[52] Elliott Slaughter, Wonchan Lee, Sean Treichler, Michael Bauer, and Alex Aiken. Regent:
a high-productivity programming language for hpc with logical regions. In Proceedings
of the International Conference for High Performance Computing, Networking, Storage
and Analysis, pages 1–12, 2015.

[53] Philip Sternberg, Esmond G Ng, Chao Yang, Pieter Maris, James P Vary, Masha
Sosonkina, and Hung V Le. Accelerating configuration interaction calculations for nu-
clear structure. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing
(SC08), pages 1–12. IEEE, 2008.

[54] Philip Sternberg, Esmond G Ng, Chao Yang, Pieter Maris, James P Vary, Masha
Sosonkina, and Hung Viet Le. Accelerating configuration interaction calculations for
nuclear structure. In SC’08: Proceedings of the 2008 ACM/IEEE Conference on Super-
computing, pages 1–12. IEEE, 2008.

[55] Przemys(cid:32)law Stpiczy´nski. Language-based vectorization and parallelization using in-
trinsics, openmp, tbb and cilk plus. The Journal of Supercomputing, 74(4):1461–1472,
2018.

[56] Peter Thoman, Kiril Dichev, Thomas Heller, Roman Iakymchuk, Xavier Aguilar, Khalid
Hasanov, Philipp Gschwandtner, Pierre Lemarinier, Stefano Markidis, Herbert Jordan,
et al. A taxonomy of task-based parallel programming technologies for high-performance
computing. The Journal of Supercomputing, 74(4):1422–1434, 2018.

[57] Hilario Torres, Manolis Papadakis, and Llu´ıs Jofre Cruanyes. Soleil-x: turbulence,
particles, and radiation in the regent programming language. In SC’19: Proceedings
of the International Conference for High Performance Computing, Networking, Storage
and Analysis, pages 1–4, 2019.

[58] James P. Vary, Robert Basili, Weijie Du, Matthew Lockner, Pieter Maris, Dossay Orys-
payev, Soham Pal, Shiplu Sarker, Hasan M. Aktulga, Esmond Ng, Meiyue Shao, and
Chao Yang. Ab Initio No Core Shell Model with Leadership-Class Supercomputers.
In A.M. Shirokov and A.I. Mazur, editors, Proc. of Int. Conf. ’Nuclear Theory in the
Supercomputing Era — 2016’, pages 15–35, Khabarovsk, Russia, 2018.

[59] Chun-Kun Wang. Selection of parallel runtime systems for tasking models. In 2017
International Conference on Computational Science and Computational Intelligence
(CSCI), pages 1091–1096. IEEE, 2017.

[60] Hao Wang, Sreeram Potluri, Miao Luo, Ashish Kumar Singh, Sayantan Sur, and Dha-
baleswar K Panda. Mvapich2-gpu: optimized gpu to gpu communication for infiniband
clusters. Computer Science-Research and Development, 26(3):257–266, 2011.

[61] R Clint Whaley, Antoine Petitet, and Jack J Dongarra. Automated empirical optimiza-

tions of software and the atlas project. Parallel computing, 27(1-2):3–35, 2001.

130

[62] Kyle B Wheeler, Richard C Murphy, and Douglas Thain. Qthreads: An api for pro-
gramming with millions of lightweight threads. In 2008 IEEE International Symposium
on Parallel and Distributed Processing, pages 1–8. IEEE, 2008.

[63] Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and
James Demmel. Optimization of sparse matrix-vector multiplication on emerging mul-
ticore platforms. In SC’07: Proceedings of the 2007 ACM/IEEE Conference on Super-
computing, pages 1–12. IEEE, 2007.

[64] D. M. Wood and A. Zunger. A new method for diagonalising large matrices. Journal

of Physics A: Mathematical and General, 18(9):1343, jun 1985.

[65] Y. Zhou, Y. Saad, M. L. Tiago, and J. R. Chelikowsky. Parallel self-consistent-field
calculations via Chebyshev-filtered subspace acceleration. Phy. Rev. E, 74:066704, 2006.

[66] Yunkai Zhou, James R. Chelikowsky, and Yousef Saad. Chebyshev-filtered subspace
iteration method free of sparse diagonalization for solving the Kohn–Sham equation. J.
Comput. Phys., 274:770–782, 2014.

131