DISSERTATION: NOVEL PARALLEL ALGORITHMS AND PERFORMANCE
OPTIMIZATION TECHNIQUES FOR THE MULTI-LEVEL FAST MULTIPOLE

ALGORITHM

By

Michael Lingg

A DISSERTATION

Michigan State University

in partial fulﬁllment of the requirements

Submitted to

for the degree of

Computer Science – Doctor of Philosophy

2020

ABSTRACT

DISSERTATION: NOVEL PARALLEL ALGORITHMS AND PERFORMANCE
OPTIMIZATION TECHNIQUES FOR THE MULTI-LEVEL FAST MULTIPOLE

ALGORITHM

By

Michael Lingg

Since Sir Issac Newton determined that characterizing orbits of celestial objects required consid-
ering the gravitational interactions among all bodies in the system, the N-Body problem has been
a very important tool in physics simulations. Expanding on the early use of the classical N-Body
problem for gravitational simulations, the method has proven invaluable in ﬂuid dynamics, molec-
ular simulations and data analytics. The extension of the classical N-Body problem to solve the
Helmholtz equation for groups of particles with oscillatory interactions has allowed for simulations
that assist in antenna design, radar cross section prediction, reduction of engine noise, and medical
devices that utilize sound waves, to name a sample of possible applications. While N-Body simu-
lations are extremely valuable, the computational cost of directly evaluating interactions among all
pairs grows quadratically with the number of particles, rendering large scale simulations infeasible
even on the most powerful supercomputers. The Fast Multipole Method (FMM) and the broader
class of tree algorithms that it belongs to have signiﬁcantly reduced the computational complexity
of N-body simulations, while providing controllable accuracy guarantees. While FMM provided
a signiﬁcant boost, N-body problems tackled by scientists and engineers continue to grow larger
in size, necessitating the development of eﬃcient parallel algorithms and implementations to run
on supercomputers. The Laplace variant of FMM, which is used to treat the classical N-body
problem, has been extensively researched and optimized to the extent that Laplace FMM codes
can scale to tens of thousands of processors for simulations involving over trillion particles. In
contrast, the Multi-Level Fast Multipole Algorithm (MLFMA), which is aimed for the Helmholtz
kernel variant of FMM, lags signiﬁcantly behind in eﬃciency and scaling. The added complex-
ity of an oscillatory potential results in much more intricate data dependency patterns and load

balancing requirements among parallel processes, making algorithms and optimizations developed
for Laplace FMM mostly ineﬀective for MLFMA. In this thesis, we propose novel parallel al-
gorithms and performance optimization techniques to improve the performance of MLFMA on
modern computer architectures. Proposed algorithms and performance optimizations range from
eﬃcient leveraging of the memory hierarchy on multi-core processors to an investigation of the
beneﬁts of the emerging concept of task parallelism for MLFMA, and to signiﬁcant reductions of
communication overheads and load imbalances in large scale computations. Parallel algorithms for
distributed memory parallel MLFMA are also accompanied by detailed complexity analyses and
performance models. We describe eﬃcient implementations of all proposed algorithms and opti-
mization techniques, and analyze their impact in detail. In particular, we show that our work yields
signiﬁcant speedups and much improved scalability compared to existing methods for MLFMA in
large geometries designed to test the range of the problem space, as well as in real world problems.

ACKNOWLEDGEMENTS

I would like to thank all of the staﬀ at Grand Valley State University for starting me on this journey,
and the staﬀ of Michigan State University for guiding me to its present conclusion. Drs Greg Wolﬀe,
Paul Jorgensen, Jonathan Engelsma and Christian Treﬀtz at GVSU were particularly instrumental
in giving me the conﬁdence to learn new skills above and beyond the ones they thought.

At MSU, Dr Metin Aktulga was tireless in his eﬀorts being my primary advisor while creating
this PhD thesis. Dr Aktulga was always available to discuss any question regarding any of the work
and provided very calm and measured guidance. The entire process felt more like a series of minor
steps, rather than major hurdles to overcome. Dr Shanker Balasubramaniam provided an excellent
example of what an expert in the ﬁeld should strive for. His depth of knowledge seemed endless,
yet he could make the most technical of topics make sense.

Dr Stephen Hughey laid the groundwork for a majority of my research at MSU. After completing
his PhD, Dr Hughey continued to provide his knowledge for continued research. Of particular values
was his explanations of not only how the program worked but how the parts of the program and the
theory behind the algorithm ﬁt together.

Finally I would not be here without my wife Sara’s support. While putting up with the long

hours, she made sure I could do what needed to be done, and stood by me over four long years.

iv

TABLE OF CONTENTS

.
.

.
.
.

.
.
.

.
.
.

.
.
.
INTRODUCTION .

LIST OF TABLES .
.
.
LIST OF FIGURES .
.
.
LIST OF ALGORITHMS .
CHAPTER 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
.
xi
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.1 Building a Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
1.2 Tree Traversal
8
. 10
1.3 Thesis Outline .

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .
.

.
.

.
.

.
.

.
.

CHAPTER 2 OPTIMIZATION OF THE SPHERICAL HARMONICS TRANSFORM

.
.

.
.

.
.

.
.

. .
.
.

2.1
Introduction .
2.2 Background .

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

BASED TREE TRAVERSALS IN THE HELMHOLTZ FMM ALGORITHM 12
. 12
. 14
2.2.1 Globally Interpolated Tree Traversal (M2M/L2L) in MLFMA . . . . . . . 14
2.2.2
Spherical Harmonics based Tree Traversal . . . . . . . . . . . . . . . . . . 15
2.3 Algorithms & Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1
Implementation of the Spherical Harmonics Transform Method . . . . . . . 16
2.3.2 Vector Reuse in Forward/Backward Matrix Multiplications . . . . . . . . . 21
2.3.3 Combining Forward/Backward Matrices . . . . . . . . . . . . . . . . . . . 21
2.3.4 Level by Level Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Numerical Results and Performance Evaluation . . . . . . . . . . . . . . . . . . . 27
2.4.1 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

CHAPTER 3 HIGH PERFORMANCE EVALUATION OF HELMHOLTZ POTEN-

.
.
.

.
.
.

.
.
.

.
.
.

Introduction .
3.1
3.2 Background .
3.3 Algorithms
.

. .
.
.
.
.

TIALS USING THE MULTI-LEVEL FAST MULTIPOLE ALGORITHM . 34
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 Tree Construction and Setup . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.2
Parallel Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2.1 C2M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2.2 M2M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2.3 M2L .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.2.4
.
3.3.2.5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.4.1
3.4.2 Translation (M2L)
3.4.3 Anterpolation (L2L)

Interpolation (M2M)

L2L .
L2O .
3.4 Numerical Methods . .

.

.

v

Scalability . .

3.5 Numerical Results and Performance Evaluation . . . . . . . . . . . . . . . . . . . 55
3.5.1 Load Balance with the Fine Grain Parallel Algorithm . . . . . . . . . . . . 56
. 58
3.5.2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 60
3.5.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 62
3.5.5 Memory Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 63
Performance Comparison with Other Codes . . . . . . . . . . . . . . . . . 65
3.5.6
3.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Process Alignment

. .

.

CHAPTER 4 EXPLORING TASK PARALLELISM FOR THE MULTILEVEL FAST-

.

.

.

.

.

.

.

.

.

.

.

.
.
.

. .

Introduction .

4.3 Methods .

4.1
4.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2.1
4.2.2 Related Work .
4.2.3 Contributions .
.

MULTIPOLE ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . 67
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
. 69
Fast Multipole Method (FMM) . . . . . . . . . . . . . . . . . . . . . . . . 69
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.1 MLFMA with BSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.1.1 Near-ﬁeld Computations (NF) . . . . . . . . . . . . . . . . . . . 72
4.3.1.2 Upward Tree Traversal (C2M and M2M)
. . . . . . . . . . . . . 73
4.3.1.3
Translations (M2L) . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.1.4 Downward Tree Traversal (L2L and L2O) . . . . . . . . . . . . . 74
4.3.2 Task Parallel MLFMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.2.1 Near-ﬁeld Computations (NF) . . . . . . . . . . . . . . . . . . . 74
4.3.2.2 Upward Tree Traversal (C2M and M2M)
. . . . . . . . . . . . . 75
4.3.2.3
Translations (M2L) . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.2.4 Downward Tree Traversal (L2L and L2O) . . . . . . . . . . . . . 78
Task ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.2.5
.
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4.1 Tuning the Task-Parallel MLFMA Implementation . . . . . . . . . . . . . 80
Task Generation Ordering . . . . . . . . . . . . . . . . . . . . . 80
4.4.1.1
Task Bundling for M2L . . . . . . . . . . . . . . . . . . . . . . 81
4.4.1.2
Performance Comparison between BSP and Task Parallel Implementations . 82
4.4.2.1
Performance on a Multicore Architecture (Cori-Haswell) . . . . . 83
4.4.2.2 Manycore Architecture (Cori-KNL) . . . . . . . . . . . . . . . . 83
4.4.3 Understanding the Reasons behind Observed Diﬀerences . . . . . . . . . . 85
4.4.3.1
. . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4.3.2 Cache Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.5 Conclusions .

Timeline Analysis

4.4 Results .

4.4.2

. .

.

.

. .

.

.

.

.

.

CHAPTER 5 FUTURE WORK .
BIBLIOGRAPHY .
.
.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
. 93
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

LIST OF TABLES

Table 2.1: Speedup times for each Spherical Harmonics Transform implementation rel-
ative to the base code. Each column gives relative speedup for a particular
level, the last column gives the overall speedup results. The ﬁrst line for the
base implementation reports the total execution time in seconds.

. . . . . . . . . 28

Table 3.1: Performance of the MLFMA algorithm on the 512 grid geometry.

. . . . . . . 59

Table 3.2: Performance of the MLFMA algorithm on the 32 volumetric geometry. . . . . . 59

Table 3.3: Performance of the MLFMA algorithm on the 384 diameter sphere geometry.

. 60

Table 3.4: Comparison of the number of packets sent between Rank Ordered and Process

Aligned schemes for the 512 grid geometry in millions of packets sent. . . . . . 63

Table 3.5: Total memory utilization (in GBs) by the three largest data structures for the

512 grid geometry.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Table 3.6: Total memory utilization (in GBs) by the three largest data structures for the

32 volume geometry.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Table 3.7: Comparison of BEMFMM vs our parallel MLFMA implementation (referred

to as “this work"). .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 65

vii

LIST OF FIGURES

Figure 1.1: 2 interaction scaling, assuming 1 billion interactions per second. . . . . . . . .

Figure 1.2: Point to point vs hub interactions. . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 1.3: Point to point interactions shown on the top, while farﬁeld interactions are

shown in the middle, and nearﬁeld interactions on the bottom. . . . . . . . . . .

Figure 1.4: An FMM tree corresponding with a 2 dimensional problem space square. . . .

.

Figure 1.5: Multi-level interactions. The source node is red, nodes with nearﬁeld inter-

actions are grey and nodes with farﬁeld interactions are blue.

. . . . . . . . . .

Figure 1.6: Top-down view of the lowest two levels of an FMM tree, illustrating the key
computational kernels. Particles (green dots) are mapped onto leaf nodes
(small squares) in an octree partitioning.

. . . . . . . . . . . . . . . . . . . . .

3

4

5

6

7

8

Figure 2.1: Spherical Harmonics Filter Step 1.

. . . . . . . . . . . . . . . . . . . . . . .

. 17

Figure 2.2: Spherical Harmonics Filter Steps 2 and 3.

. . . . . . . . . . . . . . . . . . .

. 19

Figure 2.3: Spherical Harmonics Filter Step 4.

. . . . . . . . . . . . . . . . . . . . . . .

. 20

Figure 2.4: Spherical Harmonics Filter with combined Forward/Backward matrices.

. . . . 22

Figure 2.5: Reorganizing data for level by level processing. . . . . . . . . . . . . . . . . . . 24

Figure 2.6: Spherical Harmonics Transform operation applied on a level by level basis

(combined forward/backward version is shown).

. . . . . . . . . . . . . . . . . 25

Figure 2.7: Execution time for diﬀerent Spherical Harmonics Transform implementations

for an 11 level surface geometry.

. . . . . . . . . . . . . . . . . . . . . . . . . 29

Figure 2.8: DGEMM Vs ZGEMM optimization.

. . . . . . . . . . . . . . . . . . . . . .

. 31

Figure 2.9: Analysis of blocking data around Forward/Backward FFTs.

. . . . . . . . . . . 32

Figure 2.10: Analysis of DGEMM blocking on nodes (note, updated graph needed)

. . . . . 32

Figure 3.1: Spatial vs Direction Partitioning.

. . . . . . . . . . . . . . . . . . . . . . . . . 35

Figure 3.2: Parallel Farﬁeld MLFMA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

viii

Figure 3.3: Graphical illustration of the transposition and folding operations during ﬁne
grained parallel interpolation of multipole data of child node  to parent node
 for  = 3,  = 4 (for ) and  = 5,  = 6 (for ) using 3 processes
each of which owns a column of the initial multipole data as indicated by the
hashmarks. Multipole data from (a) is interpolated along the  dimension
locally, leading to the multipole expansions in (b). The folding operation
acting on the interpolated data is shown by the repositioning of the data as
in (c). The hash marks show how the folded data is stored on the wrong
processes, and must be communicated to the correct process as shown in (d).
With the entire multipole data columns in the  direction being available on
each process, another set of interpolations are performed locally (e), which is
then transposed and folded to yield the ﬁnal multipole expansions (f). A ﬁnal
communication step is needed to send each  vector to their owner processes
(g) which can then be shifted to the center of the parent box and added to
the parent’s multipole expansions in accordance with the spherical symmetry
condition.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Figure 3.4: This image shows how multipole samples, ordered clockwise starting at 12
o’clock, are assigned to processes owning the children nodes (C1-C5) overlap
with processes owning the parent node (P1-P5) when assigned in process rank
order on the left, and with our realignment scheme on the right. The darker
portions on the parent samples show the regions where the parent node data
overlap with the child node data, and are essentially local data that do not
require communication.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Figure 3.5: A translation operation between two plural nodes shared by diﬀerent numbers

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

of processes. .

.

.

Figure 3.6: Process execution times for a grid and a sphere geometry.

. . . . . . . . . . . . 57

Figure 3.7: Actual vs. estimated computational complexity. The left subﬁgure shows
results for the surface geometry, while the right subﬁgure is for the volume
geometry.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

.

.

.

.

.

.

Figure 3.8: Actual vs estimated communication volume. The left subﬁgure shows results
for the surface geometry, while the right subﬁgure is for the volume geometry.

. 61

Figure 3.9: Message counts vs expected Big-O message counts. The left diagram shows
analysis of a surface geometry, while the right diagram shows analysis of a
volume geometry.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

. .

Figure 4.1: Direction Vs Plane Wave Partitioning.

. . . . . . . . . . . . . . . . . . . . . . 68

Figure 4.2:

Interactions with 16 processes (green dashed lines) Vs 1 process and 16
threads (lighter green dotted lines).

. . . . . . . . . . . . . . . . . . . . . . . . 69

ix

Figure 4.3: Dependencies between boxes within an FMM octree due to the nearﬁeld and

farﬁeld computation process.

. . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Figure 4.4:

Impact of task order on execution time. . . . . . . . . . . . . . . . . . . . . . . 81

Figure 4.5:

Impact of bundling on execution time. Performance of M2M implementation
with all children bundled into a single task (M2M with Bundling) and M2L
with various bundling factors (none, 9, 27, and 189) are shown for diﬀerent
thread counts.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

. .

. .

Figure 4.6: Task vs Loop (BSP) parallel runs on Haswell compute nodes for four diﬀerent

geometries. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

.

.

.

.

.

.

.

.

.

.

Figure 4.7: Task vs Loop (BSP) parallel runs on KNL compute nodes for four diﬀerent

geometries. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Figure 4.8: BSP timeline on grid geometry.

. . . . . . . . . . . . . . . . . . . . . . . . . . 86

Figure 4.9: Task parallel timeline on grid geometry.

. . . . . . . . . . . . . . . . . . . . . 87

Figure 4.10: BSP timeline on the sphere geometry. . . . . . . . . . . . . . . . . . . . . . . . 87

Figure 4.11: Task parallel timeline on the sphere geometry.

. . . . . . . . . . . . . . . . .

. 88

Figure 4.12: BSP timeline on the airplane geometry. . . . . . . . . . . . . . . . . . . . . . . 88

Figure 4.13: Task parallel timeline on the airplane geometry.

. . . . . . . . . . . . . . . . . 89

Figure 4.14: Comparison of the cache performance of BSP and task parallel implementa-

tions on a Cori-Haswell node. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

x

LIST OF ALGORITHMS

Algorithm 3.1: Multipole-to-multipole interpolation . . . . . . . . . . . . . . . . . . . . 41
Algorithm 3.2: M2L Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Algorithm 4.3: Task-based upward tree traversal . . . . . . . . . . . . . . . . . . . . . . 76
Algorithm 4.4: Parallel Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Algorithm 4.5: Task-parallel translations . . . . . . . . . . . . . . . . . . . . . . . . . . 77

xi

CHAPTER 1

INTRODUCTION

Simulations of many many physics phenomena can be broken down into the interaction between
two or more bodies. In the 17th century, Sir Issac Newton discovered that knowing the trajectory
of a planet or moon was not suﬃcient to determine the orbit of the body. He found that the
gravitational interaction of each body with all other bodies in the system had to be considered.
Newton’s discovery is one of the earliest known deﬁnitions of the N-Body problem. An N-Body
simulation might range from gravitational interaction between two small bodies at the most basic,
to simulations of interactions between a trillion separate particles Potter et al. (2017). From the
early use of the N-Body problem to compute movement of bodies in the solar system, use of the
N-Body problem has extended to design and development of many modern technologies. A small
subset of technologies includes sonar and radar distance measuring equipment, medical imaging
technologies, and wireless communication.

One of the earliest N-Body simulations was performed by Erik Holmberg using lamps and
photo cells to simulate gravity interactions between stars.
In 1956, Sebastian von Hoerner be-
gan developing computer simulations for gravitational interactions, von Hoerner (2001). Typical
gravitational simulations solve Laplace’s equation for N-Body interactions, Greengard & Rokhlin
(1987). This case computes the pairwise interactions of all particles in the form:


=1
ࣔ 

( ) =


|  − | .

(1.1)

Beyond gravitational interaction, applications of Eq. 1.1 exist in the ﬁeld of ﬂuid dynamics
(Salmon & Warren (1994)), molecular simulation (Shimada et al. (1994)) and atomic simulations
(Ding et al. (1992)), among many others. Another class of N-Body problems uses the Helmholtz
equation to explore oscillatory interactions between particles, Greengard (1988). For the Helmholtz
N-Body problem, Vikram & Shanker (2009), pairwise interactions of all particles are computed in

1

the form:


=1
ࣔ 

() (cid:17)

( − ),

where  is Green’s function for the Helmholtz equation:

() =

−  ||
4||

.

(1.2)

(1.3)

In Eq. 1.3,  = 2/ is the wave number in rad/m and  is the wavelength in meters.
The need to track the waveform of the oscillatory interactions adds additional complexity to
the Helmholtz N-Body problem, that is not found in Laplace kernels, where only the magnitude
needs to be considered. The impact of this added complexity will be discussed in detail at the
end of section 1.2. Some applications of electromagnetic wave modelling include microwave
medical devices, radar wave prediction, and studying how antennas interact with their environment,
for example a small antenna on a large naval vessel made of metal. Uses of Helmholtz kernels
are not limited to electromagnetic ﬁelds, but are applicable to any N-Body problem dealing with
oscillatory interactions. Acoustic applications can range from medical imaging, Huttunen et al.
(2005), to modelling sound waves generated in automotive, Chaigne et al. (2007), or aviation,
LóPez-PortuguéS et al. (2012), and other applications, Ergul (2011). Potential applications can
range as far as predicting how light interacts with certain materials as investigated by Burresi et al.
(2014).

Within the N-Body simulation, more bodies allows for a larger and more complex scenario,
or can be used to increase the precision of the simulation. In gravity simulations, for example,
simulating large galaxies by only simulating larger stars, or clusters of stars, allows for simulations
that run in less time, but loses some of the complex interactions of a larger simulation. Simulating
each star in the entire Milky Way galaxy would require 100 billion particles, and the Andromeda
Galaxy would require 1 trillion particles, requiring much more computational power to simulate in
a reasonable amount of time. Helmholtz simulations have similar considerations that lead to a large

2

800

600

400

200

0

s
y
a
D

0

0.5

1
1.5
Bodies

2

2.5
·108

Figure 1.1: 2 interaction scaling, assuming 1 billion interactions per second.

number of particles interacting. Precisely simulating small features on large wavelength surfaces,
such as the previously mentioned antenna on a naval vessel, requires small wavelengths to precisely
model the antenna, but ship itself is very large in terms of the same wavelengths. In addition to the
potential problem space being very large, the (2
) cost of a complete simulation, where  is
the number of bodies, or degrees of freedom, means the computation cost increases quadratically
with problem size. Figure 1.1 shows how computing 2 interactions, even at an assumed 1 billion
interactions computed per second, quickly increases from a little over 16 minutes to compute 1
million particles, to over two years to compute 256 million interactions.

To reduce the amount of computation necessary, an approximation can be computed for groups
of bodies, and interactions can be computed between these approximations. This approach is very
similar to how hub and spoke communication can be used to reduce the number of connections
required vs point to point communication, as shown in ﬁgure 1.2.

3

Figure 1.2: Point to point vs hub interactions.

1.1 Building a Tree Algorithm

Farﬁeld vs Nearﬁeld The primary method of producing group interactions for N-Body solution
has been to separate the interactions into nearﬁeld and farﬁeld interactions. As shown by Appel
(1985), interactions between any group of particles that is well separated from another group can
be computed between approximations of the groups with controllable accuracy, Dembart & Yip
(1998). While Appel (1985) used monopole approximations, since Greengard & Rokhlin (1987),
the approximations are typically created using multipole expansions, thus the common name for the
algorithm, the Fast Multipole Method (FMM). The multipole expansion is a mathematical series
created by sampling the outgoing radiation pattern of all particles contained within a unit sphere,
Engheta et al. (1985). Interactions using these multipole expansions provide controllable accuracy
based on the number of multipole samples, more samples means more accuracy, but at an increased
computational cost. The result of the group interaction is then distributed to the particles within
the group, while bodies within each group still interact with each other directly. The top of ﬁgure
1.3 illustrates how all bodies would interact with a single body in a standard N-Body solution. The
middle of ﬁgure 1.3 illustrates the farﬁeld interaction process:

1. The group of source particles (on the left) is approximated.

2. The source group interacts with the observer group (on the right).

3. The observer group distributes the farﬁeld interaction to the particles in the group.

4

Figure 1.3: Point to point interactions shown on the top, while farﬁeld interactions are shown in
the middle, and nearﬁeld interactions on the bottom.

The bottom of ﬁgure 1.3 the illustrates how the nearﬁeld interactions of the bodies within each well
separated group would be calculated. The farﬁeld and nearﬁeld interactions are summed together
to compute the potential observed by each particle due to the interaction with all other particles. If
this approach is constrained to a single level of approximations, it is shown to reduce the complexity
to () 4

3 for a volume and () 3

2 for a surface, Coifman et al. (1993b).

This method of computing interactions via hub-like group approximations can be extended
further. Interactions with a group of approximations that are well separated from another group
of approximations can themselves be approximated as a single radiation pattern, providing mul-
tiple levels of approximations. The multi-level approach allows the farﬁeld approximations to be
organized into a tree. The leaf level contains nodes representing the approximation of the particles
located in the area of the problem space occupied by the leaf node, and higher levels contain nodes
representing radiation pattern approximations of groups of child nodes.

5

Tree Construction While higher level approximations are created from groups of lower level
approximations, the tree structure is typically created from top down, before approximations, and
interactions between approximations, are calculated, Barnes & Hut (1986). First the entire problem
space is placed inside of a cube. This cube forms the root node of the FMM tree. The cube is evenly
divided into 8 smaller cubes, forming the second level of the tree. Cubes are further subdivided,
creating more levels and producing an octree. The same method is used with both volume and
surface geometries, but a 2-d plane of particles cutting across the cube will typically populate 4
nodes per level with particles, rather than 8, creating a quadtree.

Figure 1.4: An FMM tree corresponding with a 2 dimensional problem space square.

Figure 1.4 illustrates a 2 dimensional representation of the tree construction. The entire
geometry is contained by the root node in red. The second level evenly divides each dimension
in half, producing four orange nodes in level 2. The subdivision continues, producing 16 green
nodes in level 3 and 64 blue nodes in level 4 (all nodes are not displayed in the tree as it would
exceed the page width). In the example in ﬁgure 1.4, level 4 would represent the leaf level, but an
FMM tree could continue subdividing the geometry. Creation of lower levels of the tree typically

6

ends when the lowest level nodes are around 0.25, the point where the multipole expansion begins
to appear non-oscillatory, however tree memory use and leaf node particle density also need to
be considered. Too many points per node increases the computation time of nearﬁeld, while too
few points per node increases the computation time of farﬁeld. With the tree constructed, the tree
traversal process can be used to compute the farﬁeld interactions, while nearﬁeld interactions are
computed directly.

Multi-Level FMM Figure 1.5 illustrates an example of 2 dimensional multi-level node interac-
tions. At the highest level, Level 3, the source node in red performs farﬁeld interactions with all
other nodes that are not an immediate neighbor. Moving down a level, farﬁeld interactions can be
performed with nodes which are not immediate neighbors of the child node. Child nodes of nodes
which were farﬁeld nodes for the parent have already computed farﬁeld interactions at the higher
level, so do not need computation at the lower level. This process continues until the lowest level
of computation is reached, and all opportunities for farﬁeld interactions have been exhausted. At
the lowest level, the leaf level of the tree, direct interactions must be computed between particles
within the source box, as well as between these particles and the particles within immediate neigh-
bor boxes. The multi-level approach reduces the computation cost to ( log ) for a surface
geometry, and () for a volume geometry. While the nearﬁeld interactions have been limited
to particles only interacting within their own box and with particles in the immediate neighboring
boxes, and now cost () to process.

Figure 1.5: Multi-level interactions. The source node is red, nodes with nearﬁeld interactions are
grey and nodes with farﬁeld interactions are blue.

7

Figure 1.6: Top-down view of the lowest two levels of an FMM tree, illustrating the key
computational kernels. Particles (green dots) are mapped onto leaf nodes (small squares) in an
octree partitioning.
1.2 Tree Traversal

The stages of farﬁeld computation during tree traversal are illustrated in Fig. 1.6 and deﬁned as

follows:

1. Charge-to-multipole (C2M): This stage calculates leaf node multipole expansions. Particles

inside each leaf node are summarized into the multipole expansion for that node.

2. Multipole-to-multipole interpolation (M2M): M2M stage calculates the multipole expan-
sions (approximations) in parent nodes. Starting at the leaf nodes and traversing up the tree,
child nodes interpolate and shift their multipole expansions to the centers of their parent
boxes. Once all of its children nodes are interpolated and shifted, the parent node aggregates
the interpolated multipole expansions of its children.

3. Multipole-to-local translation (M2L): M2L stage translates the multipole expansions of a
source box to the center of observer boxes within its far ﬁeld as deﬁned according to the
well separateness criteria. Multipole contributions at each observer box is then aggregated
to account for the eﬀect of distant particles.

4. Local-to-local anterpolation (L2L): L2L distributes the far ﬁeld contributions in parent
nodes (as calculated by the M2L stage) to each of its children. L2L is essentially the reverse
of the M2M stage; it is also called the downward tree traversal or the anterpolation stage.

8

5. Local-to-observer (L2O): Once multipole expansions are percolated all the way down to
the leaf nodes, they must be mapped back onto the particles. This is done in the L2O stage,
which involves calculation of the potential observed by each particle based on the multipole
information summed up in its enclosing box.

Nearﬁeld The nearﬁeld interactions can be computed during any phase of farﬁeld computations
as this operation computes the point to point interaction of particles in the same, or neighboring,
leaf boxes. The only consideration is that the results of nearﬁeld interactions must be summed with
the results of L2O for each particle.

Laplace vs Helmholtz FMM As discussed earlier, common FMM variants can be grouped
into two categories: Laplace FMM (L-FMM), where interactions are between particles are non-
oscillatory particle interactions, such as gravitational and electrostatic particles, and Helmholtz
FMM (MLFMA), Dembart & Yip (1995); Nishimura (2002); Shanker & Huang (2007), where
interactions are oscillatory (i.e., wave problems such as electromagnetic and acoustic particles).
In L-FMM, since the interaction is non-oscillatory, only the magnitude information needs to be
approximated in the multipole expansion. As such, the number of terms, or samples, of the
multipole expansion is constant for all nodes across all levels of the tree. Consequently, most of
the work for L-FMM is at the lowest level of the tree, Ying & Zorin (2004); Sundar et al. (2008);
Agullo et al. (2014); Salmon (1991); Wang et al. (2019); Ying et al. (2004), because a majority of
the tree nodes are located at the leaf level.

In contrast, in MLFMA, both magnitude and phase information must be stored due to the
oscillatory nature of the interaction. To ensure controllable accuracy, the amount of multiple
expansion data in a tree node must be approximately quadrupled as one ascends the tree.
In
addition, most MLFMA problems involve surface geometries (which implies that the number of
tree nodes decreases roughly by a factor of 4 at each level). When combined with the increase in
node size, all levels of the MLFMA tree have similar computational costs. In practice, higher tree
levels often require more computations in MLFMA.

9

1.3 Thesis Outline

Existing work on FMM, and speciﬁcally MLFMA, has provided signiﬁcant performance im-
provements by converting the (2
) N-Body problem into an ( log ) or () (depending
on the problem geometry) tree algorithm. The continued growth of the size of N-Body problems
has led to the need for eﬃcient parallel FMM algorithms. Laplace FMM methods have shown
excellent scalability, Hamada et al. (2009); Ishiyama et al. (2012); Rahimian et al. (2010); Lashuk
et al. (2012), but methods to parallelize Helmholtz MLFMA continue to lag behind due to the
added complexity of the oscillatory interactions, Ergül (2011); Ergul & Gurel (2013); Michiels
et al. (2013a); Taboada et al. (2013); Waltz et al. (2007).
In the remainder of this report, we
present a number of algorithms that provide performance and parallel scalability improvements for
MLFMA applications. The content is organized as follows.

Chapter 2 describes a step by step method to improve the performance of the spherical har-
monics transform. These improvements focus primarily on data locality and cache optimizations
of the matrix vector operations. Numerical results show the performance improvements of the
optimizations nearing 10x speedups over the original algorithm.

Chapter 3 describes a ﬁne grain parallel method that divides the samples and work of the high
level nodes of the MLFMA tree evenly among the processes. This method improves the balance
of process loads when the number of nodes in a level is less than the number of process, and
balances tree storage across processes. The chapter also includes a detailed complexity model
of the algorithm. Finally, numerical results are included that show the load balancing eﬀect of
the algorithm and the accuracy of the complexity model, and a comparison showing up to 164x
speedup when compared with with an existing MLFMA algorithm.

Chapter 4 performs a comparison of bulk synchronous parallel and task parallel approaches for
implementing MLFMA. Task parallel implementation is expected to best handle process communi-
cation if integrated with the distributed memory parallel code, and this comparison demonstrates that

10

task parallel implementation is eﬃcient. The chapter includes a description of the two implemen-
tations, as well as considerations of performance trade oﬀs with the task parallel implementation.
The chapter wraps up with a performance comparison of the two shared memory parallel methods,
comparing thread activity, cache utilization analysis, and execution time comparisons showing task
parallel outperforming the bulk synchronous parallel version under most scenarios, with up to 1.4x
speedup.

Chapter 5 looks at possible future research of balancing global and local interpolation, integrating
thread based parallel methods into the MLFMA algorithms and improving load balancing for
unbalanced trees.

11

CHAPTER 2

OPTIMIZATION OF THE SPHERICAL HARMONICS TRANSFORM BASED TREE

TRAVERSALS IN THE HELMHOLTZ FMM ALGORITHM

2.1 Introduction

In this chapter, a step by step method of improving the performance of the spherical harmonics
transform is provided. A naive implementation of spherical harmonics transforms results in data
alignment that is very ineﬃcient for data locality. An updated spherical harmonics transform
algorithm will be provided that arranges the data to optimize cache use.

Vertical Tree Traversal (M2M/L2L) in MLFMA Multipole expansions are simply functions
deﬁned on the unit sphere (, ) ∈ [0, 2] × [0, ] and sampled at locations (,  ) for  =
1, . . . , , and  = 1, . . . , . The sampling rates for a given tree node at some level  are governed
√
3 ()(cid:99) + 1, where () is the box diameter at level
by its bandwidth (), deﬁned by () = (cid:98) 
,  is the wave number deﬁned in (1.3),  ≥ 1.0 is a parameter that can be tuned to control the
accuracy of the MLFMA evaluation (higher  values yield more accurate simulations at the expense
of higher computational cost). Two main techniques used in traversing up/down the MLFMA tree
are local interpolation Song & Chew (1995); Chew et al. (1997); Zhao & Chew (2000); Ergul &
Gurel (2006) and global interpolation techniques Jakob-Chien & Alpert (1997); Shanker et al.
(2003); Sarvas (2003).

Local Local interpolation focuses on interpolating a local region of the multipole data. This
method requires a boundary samples from neighboring regions, usually stored on other processes,
to perform the interpolation. The boundary samples can be as few as only the samples immediately
neighboring the local region. This means that it is relatively easy to limit the bandwidth, as shown
in Michiels et al. (2013b) during interpolation. Also the he number of messages between processes
can be reduced, as shown in Yang et al. (2019), during interpolation. A further possible beneﬁt

12

is processes are assigned a slice of samples from all node of a given level,
of the Yang et al.
then during translation the process owns all samples that must be translated so no communication
is required. The trade oﬀ is that local interpolation introduces errors in the interpolation result.
This error can be reduced by increasing the number of multipole samples in both the  and 
direction. The error can also be reduced by calculating the local interpolation using boundary
samples beyond the immediate neighboring samples. The increase of boundary samples can result
in needing to communicate with processes beyond the neighboring ones. The best results are found
by combining these two methods, but any improvement in error here is at the cost of additional
memory or communication requirements.

In contrast to global interpolation requires all of the samples during interpolation. This
Global
means if the samples are partitioned between processes, all processes storing some samples of a
node must be involved in communication to interpolate the node. Thus partitioning options such
as assigning slices of samples from all nodes of a level is less eﬃcient as interpolation with this
partitioning would require communication between all processes to perform interpolation. The
beneﬁt of global interpolation is due to interpolating with all of the samples, the interpolation result
can be exact with a minimal number of multipole samples. For this reason, we focus on global
interpolation methods.

Problem Statement and Contribution In this chapter, we focus on the MLFMA algorithm.
While the literature on optimizing the performance of L-FMM implementations is abundant,
existing work in Laplace FMM can not readily be transferred to an MLFMA implementation due to
the reasons outlined above. In MLFMA, interpolation and anterpolation operations (which together
are also referred to as tree traversal operations) related to the creation of multipole expansions
across the tree constitute one of the most expensive parts. Therefore optimization of these tree
traversal operations is important, and provides the main motivation for this work. We begin
by describing the fast spherical ﬁlter based inter/anterpolation operators in Section 2.2. Then in
Section 2.3, we provide a basic implementation of tree traversals with spherical ﬁltering and present

13

a series of techniques that successively leads up to our fully optimized tree traversal implementation
for MLFMA. In Section 2.4, we demonstrate the impact of the presented techniques through an
extensive set of tests.

Related Work The MLFMA literature is quite extensive, and the works discussed in this chapter
represent only a sampling of those discussing the relevant issues. The single-level MLFMA
(without inter/anterpolation) was introduced in Coifman et al. (1993a), providing a prescription for
its implementation. Implementation issues for the multilevel version are discussed, e.g., in Song
et al. (1997); Darve (2000); Vikram et al. (2009). The problem of inter/anterpolation is addressed in
depth in Jakob-Chien & Alpert (1997); Shanker et al. (2003); Sarvas (2003); Ergul & Gurel (2006);
Cecka & Darve (2013a), yielding a number of diﬀerent methods of varying cost and accuracy. In
fact, the techniques described in these studies provide the background for the present work. Wedi et
al. Wedi et al. (2013) describe performance optimization techniques for a similar problem the ﬁeld
of numerical weather prediction. However, the algorithm presented in Wedi et al. (2013) involves a
complicated compression technique which trades accuracy for lower computational complexity, but
is only eﬃcient at regimes diﬀerent than those tackled in this chapter. Additionally, the optimization
techniques we present do not compromise the accuracy of the underlying operations.

2.2 Background

2.2.1 Globally Interpolated Tree Traversal (M2M/L2L) in MLFMA

Previously mentioned, our chosen method of interpolation is global interpolation. Among the global
techniques, one can choose the spherical harmonics transforms (SHT) Jakob-Chien & Alpert (1997)
or the fast Fourier transforms (FFT) Sarvas (2003) based inter/anterpolations. Spherical harmonics
and Fourier transforms require O(()3) and O(()2 log ()) operations, respectively. Spher-
ical harmonics based inter/anterpolations, while asymptotically more expensive, utilize optimal
(minimal) sampling rates on the unit sphere by sampling at Gauss-Legendre nodes in the  direc-
tion, and picking uniformly spaced samples in the  direction Jakob-Chien & Alpert (1997). More

14

speciﬁcally, it requires () = () + 1 samples in the  direction, and () = 2() + 1 samples
in the  direction for a given level . As such, spherical harmonics transforms are advantageous in
terms of memory (up to a factor of 2) and computational costs (by about a factor of 4 to 8 depend-
ing on inputs and hardware) for several levels above the leaf nodes, and therefore our MLFMA
implementation is based on this technique. Also as discussed in our recent work Hughey et al.
(2018), one can easily transition from spherical harmonics sampling to the full uniform sampling
as required by the fast Fourier transforms (FFT), when the asymptotic costs of spherical harmonics
transforms start taking over.

2.2.2 Spherical Harmonics based Tree Traversal

Before we move to implementation and optimization details, we review the steps of the spherical
harmonics transform (SHT) method to perform interpolation (M2M) and anterpolation (L2L) in
the MLFMA tree. Interpolation is used to increase the sampling rate of multipole expansions to
that of their parents’ so that diagonal shift operators may be employed as one moves up the tree.
In anterpolation, the reverse is done, i.e. the parent box multipole expansions are shifted to their
children and decimated to the child’s sampling rate. The steps outlined below focus on the upward
traversal (M2M) operation; we note the nuances for downward traversal (L2L) at the end of this
subsection.

1. Compute Fourier coeﬃcients  () by performing FFTs for each set of samples in the  direction

from 1, . . . , ():

f() =

√
2
()

()

=1

 (, )

(2.1)

where,  = 2/() for  = 1, . . . , () and  (, ) is the multipole expansion for a tree
node along the  direction for a given .

15

2. Compute the spherical harmonic coeﬃcients  

 for |  |≤  ≤ (cid:48) = () by applying Legendre

transforms to the Fourier coeﬃcients calculated in the previous step:

f
 =

  ¯

 (cos  )  ( ),

(2.2)

where 1, . . . ,   denote the corresponding Gauss-Legendre quadrature weights, and ¯
denote the normalized associated Legendre functions.

3. Calculate the ﬁltered Fourier coeﬃcients(cid:101) ( ) from spherical harmonic coeﬃcients (  

 ) up

 (cos )

to degree (cid:48) by a backward Legendre transform:

()

=1

(cid:101)f() =

(cid:48)

=||

 (cos ).
¯

 
(2.3)

4. Finally, the ﬁltered multipole values (cid:101) (,  ) are calculated by an inverse FFT, using Fourier
coeﬃcients up to degree (cid:48):(cid:101)f(, ) =

(cid:101) ()

(−1)

√
2

(2.4)

=−(−1)

The L2L operation consists of the same steps but with (cid:48) = ( + 1) < () to accomplish
the ﬁltering. In M2M, the shift operator is applied post-interpolation, whereas in L2L the shift is
applied before the anterpolation.

2.3 Algorithms & Numerical Methods

2.3.1

Implementation of the Spherical Harmonics Transform Method

First, we discuss in detail the base algorithm used to implement the SHT method. This will provide
a detailed explanation of why certain design decisions were made, allowing for a step by step
description of the improvements from this base algorithm.

Overview In SHT, for a given tree level , the multipole expansion has () samples in the 
direction and () samples in the  direction. To aid in the presentation, () will be referred to

16

Figure 2.1: Spherical Harmonics Filter Step 1.

as  and () will be referred to as . The multipole expansion information is stored as a matrix
of size  × ; we denote this matrix by M. To exploit data locality due to reasons that will be
discussed below, our base code (written in FORTRAN) stores  samples as columns of M.

During interpolation, the number of samples is increased to ( − 1) samples (i.e., roughly
doubled) in the  direction, and to (−1) samples (i.e., again roughly doubled) in the  direction.
Recall that anterpolation reduces the number of samples in the  and  directions by about half to
( + 1) and ( + 1), respectively. In both cases the target number of  samples will be referred
to as ˆ and the target number of  samples will be referred to as ˆ. Thus in both interpolation and
anterpolation, a multipole expansion matrix M of size  ×  is resampled to a multipole expansion
matrix ˆM of size ˆ × ˆ.

SHT performs calculations over  = −, . . . , . In our implementation, we use  indexes from
0, . . . ,  to represent the range of  as shown in the formula below. The reverse can be used to ﬁnd
the  index corresponding to a given  index:



 =

 − 1

if  ≤  + 1

 − 2 × 

if  >  + 1

(2.5)

Precalculated Data In step 2 of the SHT method, the normalized associated Legendre functions
and the Gaussian quadrature weights are multiplied together. Because the tree structure is static
throughout the entire simulation in Helmholtz problems, the bandwidth at each level and the
resulting sampling rates remain constant.

17

As such, we can precalculate the operators eﬀecting this operation, yielding what we call the
forward transform matrices. At each level there are −, . . . ,  forward matrices, corresponding
with the range of . Each forward matrix is ( − ||) ×  in size. Additionally, as described in
Jakob-Chien & Alpert (1997), when  < 0, −
 . This allows us to only store the forward
matrices only in the range  = 0, . . . , .

 = 

In step 3 of SHT, the normalized associated Legendre functions are used again. To save important
computational time, these can also be precalculated and stored as the backward transform matrices.
The backward matrices project the number of  samples for a given level  to the number of 
samples of a target level ˆ. Thus the matrix size is ˆ × ( − ||). Finally, to save space, both the
backward matrix for interpolation and the backward matrix for anterpolation to the target level are
given by the same matrix. With the combination, there are − ˆ, . . . , ˆ backward matrices, again
corresponding with the range of  for the target level. Each backward matrix is ˆ × ( ˆ − ||) in
size.

Step 1 Figure 2.1 shows the ﬁrst step in the SHT algorithm used in our base MLFMA implemen-
tation. In this step the Fourier coeﬃcients are calculated by performing FFTs on the  samples
in each column, indexed from 0, . . . , , of the multipole expansion. Each FFT produces a new
column of  sample Fourier coeﬃcients. In the second step of the SHT method, calculations will
be performed over  samples. Since calculations would be most eﬃcient by placing the  samples
along columns (in FORTRAN), the resulting vector from each FFT along  samples is transposed
and stored in rows of the new Fourier coeﬃcients matrix. This produces a  ×  matrix of -Fourier
coeﬃcients. In our implementation, all FFTs use the FFTW library Frigo & Johnson (2005) for
best performance.

Step 2 The right side of Fig. 2.2 shows the second step in our base SHT implementation where
the spherical harmonic coeﬃcients are calculated from the Fourier coeﬃcients matrix (of step 1) by
using the precalculated Legendre transform matrices. More speciﬁcally, the base implementation
loops over  indexes from  = 0, . . . ,  and performs a matrix-vector multiplication between the

18

Figure 2.2: Spherical Harmonics Filter Steps 2 and 3.

forward matrix corresponding to index  according to Eq. (2.5), and the th column of the Fourier
coeﬃcients matrix (produced in step 1). Note that in step 2 of the SHT, Legendre transforms are
applied to Fourier coeﬃcient vectors from  to (cid:48). Figure 2.2 represents the vectors that are NOT
used in matrix vector multiplication as light grey cells. While these values are not calculated, their
inclusion in the ﬁgure helps show how the matrices and vectors align during multiplication.

Step 3 The left side of Fig. 2.2 shows the third step in our base SHT implementation where
the ﬁltered Fourier coeﬃcients are calculated from the spherical harmonics coeﬃcients matrix by
performing backward Legendre transforms. This is done by looping over  indexes of  = 1, . . . , 
and performing a matrix-vector multiplication between the backward matrix of the target level
corresponding to  according to (2.5), and the th column of the spherical harmonics coeﬃcients
matrix (produced in step 2). Because the backward matrix for M2M and L2L have been combined,
when performing M2M the backward matrix will have more columns than the number of rows in the
ﬁltered Fourier coeﬃcients matrix. To perform the matrix vector multiplication, only the number
of columns in the backward matrix up to the number of rows in the ﬁltered Fourier coeﬃcients
matrix is used. This adjusts the use of the backward matrix to match the format shown in Fig. 2.2.
The number of rows of each backward matrix in the target level is ˆ, projecting the  length
columns of spherical harmonics coeﬃcients to ˆ length columns of ﬁltered Fourier coeﬃcients.
The initial number of columns of ﬁltered Fourier coeﬃcients matrix is , the number of columns
of the spherical harmonics coeﬃcients. When the target node has more  samples than the source

19

Figure 2.3: Spherical Harmonics Filter Step 4.

node, the ﬁltered Fourier coeﬃcients matrix must be padded to the desired ˆ columns for the target
level. To do this, the ﬁltered Fourier coeﬃcients matrix is split vertically down the middle, columns
1, . . . ,  on the left and columns  + 1, . . . ,  on the right. The matrix is then padded in the middle
with columns of 0s to widen the matrix to ˆ, where ˆ is the number of  samples of the target
node. When the target node has fewer  samples than the source node, (2.5) is used to identify any
column index producing an  value greater than ˆ and these columns are ignored.

Step 4 Fig. 2.3 shows the fourth and ﬁnal step in our base SHT implementation where inverse
FFTs are performed on each  index row, from   = 0, . . . , , of the ﬁltered Fourier coeﬃcients
matrix for the node. To properly provide the data to the FFTW library, each row is transposed into
a contiguous vector. The FFT results can then be stored in columns of the target multipole matrix,
M.

Operational Costs As evident from the above discussion, main computational costs associated
with SHT are a series of FFTs (regular and inverse) and matrix-vector multiplications. More
speciﬁcally, for a given node,  forward FFTs (each of which cost  log ) and ˆ inverse FFTs (each
of which cost ˆ log ˆ) must be performed. In terms of matrix-vector multiplications, there are a
total of  matrix-vector multiplications involving forward matrices of size ( − )×  depending on
the value of  corresponding to the  index. The portion of the backward matrices used are of size
ˆ × ( − ) (where ˆ denotes the number of samples in the target level), and there again are a total
of  matvecs involving backward matrices. Consequently, the cost of matrix vector multiplications
for a particular  is  × ( − ) ×  +  × ˆ × ( − ). Given that  ranges from 0, . . . , , the total

20

matvec cost can be approximated as


1

1

2  ×  ×  + 1

2

ˆ ×  × .

(2.6)

2.3.2 Vector Reuse in Forward/Backward Matrix Multiplications

A simple optimization that can be applied to the base SHT implementation is reuse of vectors
between forward and backward matrix multiplications. In the base implementation of SHT that
we started with in our performance optimization eﬀorts, all  forward matrix multiplications are
performed, producing a set of  vectors containing spherical harmonic coeﬃcients for a given
node. Due to the total size of the forward matrices that one must go through before starting
the backward matrix-vector multiplications, it is highly likely that the intermediate vectors with
spherical harmonic coeﬃcients will be evicted from cache (even for leaf nodes or nodes close to the
leaf level). In this variant of the SHT implementation, instead of storing the spherical harmonics
coeﬃcients as an intermediate matrix, each vector, once calculated, is immediately multiplied by the
corresponding backward matrix and the result is stored in the ﬁltered Fourier coeﬃcients matrix for
the node. This avoids the need for intermediate storage of the entire spherical harmonics coeﬃcients
matrix for the node, but more importantly it allows reuse of the spherical harmonics coeﬃcient
vector data. As will be demonstrated through numerical results in Sect. 2.4, this optimization gives
improved performance over the base version for lower level tree nodes. However, towards the top
of the tree, it actually leads to slightly worse performance than the base version, as the individual
spherical harmonics coeﬃcient vectors themselves grow larger than the L1 cache, wiping oﬀ any
beneﬁts.

2.3.3 Combining Forward/Backward Matrices

The next optimization we pursue is combining forward and backward matrices in the SHT method.
Note that the base code applies the forward and backward matrices one after the other in the
form of back to back matrix-vector multiplications. In the proposed optimization, the ﬁrst step is

21

Figure 2.4: Spherical Harmonics Filter with combined Forward/Backward matrices.

multiplying the forward and backward matrices together, and then applying the combined matrices
to Fourier coeﬃcient vectors in a node, directly calculating the ﬁltered Fourier coeﬃcient vectors
at the target node level.

A comparison of the computational costs of the combined method and the base method reveals
that for a given node, by itself this optimization would actually increase computational costs. More
precisely, for a given  index, we ﬁrst need to multiply a backward matrix of size ˆ ×( − ) with a
forward matrix of size ( − ) × , and the resulting matrix must be applied to a Fourier coeﬃcient
vector of size . The total operational cost then is ˆ × ( − ) ×  + ˆ × , which is more than the
cost of the base algorithm.

However, a key observation made when combining the forward and backward matrices is that
the forward and backward matrices are deﬁned according to the spacial relationship between the
nodes in the tree, and the structure of an MLFMA tree does not change over the course of a
simulation as discussed previously. Moreover, the set of forward and backward matrices used in
SHT is the same for all tree nodes at a given level . This allows us to precalculate the combination
of the forward and backward matrices. With precalculation, we only have to multiply the combined
matrices with the corresponding Fourier coeﬃcient vectors at a cost of ˆ ×  (independent of the
value of ), opening a door to save on the total number of calculations.

Note that the combined forward/backward matrix optimization would work well for certain
values of , i.e., when  is small. As  grows, the forward and backward matrices have more
zero-ﬁlled columns, but omitting calculations with the zero-ﬁlled columns as in the base algorithm

22

is not possible in the combined approach. By comparing the operational costs based on the value
of , it is possible to identify the value of  for which the number of operations in the base
implementation becomes larger than that of the combined matrices approach:

ˆ × ( − ) ×  + ( − ) ×  ×  = ˆ ×  × 

(2.7)

As mentioned in Sect. 3.2, for M2M the number of  samples for the current level, , and the
number of  samples for the next level up, ˆ, are approximately two times the number of  samples
for the current level, . We can use this relation to simplify and solve for the value of  where the
base method becomes costlier than using the combined matrix approach:

2 × ( − ) × 2 + ( − ) ×  × 2 = 2 ×  × 2
2
6 

 =

(2.8)

This provides an estimation of when the base algorithm out performs the combined matrix for
a given tree level. In fact, one can utilize the combined matrices approach only for values of  in
a level where it provides an advantage over the base implementation. Also instead of relying on
operation counts to judge which approach is better for a speciﬁc  value, an alternative method is
an empirical tuning where one runs a number of matrix multiplications of appropriate sizes before
starting the MLFMA solver, and determines for the given architecture at which point it becomes
advantageous to switch to the combined matrix approach from the base implementation.

One additional trade oﬀ with the combined matrices approach is memory usage. In the base
code, each level has a set of forward and backward matrices which are of size  × . With combined
matrices, we need a matrix for the M2M stage and another one for the L2L stage each of which are
of size ˆ ×  where ˆ is the number of  samples of the target node. The  samples are roughly
doubled when moving up the tree and roughly halved when moving down the tree, which gives us
an approximate memory ratio of:

= 1.25

(2.9)

2 2

22 + 1
22

23

Figure 2.5: Reorganizing data for level by level processing.

or 25% more memory usage for the combined matrices over the separate forward and backward
matrices. In order to use the fastest calculation of the base and combined matrix versions, we need
to keep the forward backward matrices used by the base code, and additionally store the combined
matrices. This results in approximately 125% more memory than the base version for storage of
these matrices.

2.3.4 Level by Level Processing

Optimizations presented in the two preceding subsections focus on performing SHT operations for
tree traversal on a node-by-node basis. While they try to improve data locality (through vector reuse)
and reduce total computational costs (combined forward/backward matrices), their eﬀectiveness
is still limited. As a ﬁnal optimization, we investigated level by level processing to perform the
SHT operations. The key observation here is that a given forward/backward matrix pair (or their
combined version) must be applied to the Fourier coeﬃcient vectors of several nodes found in the
same tree level (according to the  to  mapping described in Eq.(2.5)). Moving from a node-
by-node level processing to level-by-level processing is expected to yield signiﬁcantly improved
performance because i) the set of forward/backward matrices must now be read from memory
once per tree level as opposed to once per node (hence signiﬁcantly improving data locality),
ii) the matrix-vector operations of node-by-node processing scheme can now be performed as
matrix-matrix multiplications (hence signiﬁcantly improving arithmetic intensity). That being
said, level-by-level processing introduces new challenges in terms data reorganization/movement

24

Figure 2.6: Spherical Harmonics Transform operation applied on a level by level basis (combined
forward/backward version is shown).

as well as the size of intermediate storage space needed. Below we discuss the implementation
details related to level-by-level processing and address the associated challenges.

For SHT method to utilize matrix-matrix operations, the Fourier coeﬃcient matrices belonging
to all nodes within the same tree level can be arranged into a 3D matrix, where  is moved to the
outer-most index so that for all nodes, all  vectors can be multiplied with the same forward/backward
matrix pair (or their combined matrix version).  is left as the inner index to improve data locality
during matrix-matrix multiplications, leaving the node index as the middle index. This data
rearrangement is shown in Fig.2.5.

With such rearrangement of the Fourier coeﬃcient vectors, matrix operations in SHT have a
lot more opportunity for data reuse. As shown in Fig. 2.6, each  indexed slice of the Fourier
coeﬃcients matrix can be multiplied by the corresponding single combined forward/backward
matrix (or a single forward followed by backward matrix, if it is more eﬃcient to do so depending
on the value of  and tree level). This allows forward/backward matrices to be loaded once per tree
level, multiplied by the corresponding  vector of all nodes, and then stored in the ﬁltered Fourier
coeﬃcients matrices.

How the algorithm loops through  indexes to perform matrix multiplications can be further
optimized in this scheme. Per Eq.(2.5) there are two  indexes that map to the same  index,
with the exception of  = 1. This means that when looping through all  indexes, all but one
combined forward/backward matrix must be loaded twice. Instead of looping through  indexes,
the algorithm can loop from  = 0, . . . ,  and perform a matrix multiplication for each of the 

25

indexes that correspond with the current  index. This allows each combined forward/backward
matrix to ﬁnally only be loaded from memory once.

Matrix Multiplication Since matrix-matrix multiplication is a very common kernel in scientiﬁc
computing and data analytics, there exists highly optimized libraries functions for this operation.
Two options for matrix multiplication in BLAS are DGEMM (for multiplying real matrices) and
ZGEMM (for multiplying complex matrices). Neither ﬁts perfectly with the SHT method as the
Fourier coeﬃcients data is complex doubles, while the forward and backward matrices are real
doubles.

To multiply a complex matrix with a real matrix, one method is to split the complex matrix
into two matrices, one containing the real part and the other containing the imaginary part. Then
each matrix can be multiplied against the real-valued multipole matrix using two DGEMMs. The
results of these two multiplications would then need to be combined back into a complex matrix
for later parts of the SHT operation. This option has a beneﬁt that the DGEMM multiplication is
being used exactly as intended, multiplying two full real matrices. The trade oﬀ is that extra time
must be spent splitting the complex matrix apart and putting the results back together, and extra
memory must be used to store the complex data in real matrices.

The second method to multiply a complex matrix with a real valued matrix would be to cast the
real matrix into complex, with all imaginary values set to 0. Now the two matrices can by multiplied
as complex matrices using ZGEMM. With all data stored as complex matrices, no extra work needs
to be done to split apart the real and imaginary components or putting them back together. The
downside to this method is that the full complex matrix multiplication must be performed, even
though all of the imaginary data of one matrix is set to 0, resulting in extra processing, and the
combined forward/backward matrices now require twice as much memory to store the imaginary
0s.

As shown in Sect. 2.4, we observe that the DGEMM version in general outperforms the ZGEMM

option, and therefore our optimized level-by-level implementation uses DGEMMs.

26

Cache Blocking in Data Reorganization Because FFTs are performed across  samples, the
optimal multipole expansion data arrangement is to place  data in columns (in FORTRAN) for
SHT steps 1 and 4. This prevents data striding while performing the FFTs. On the other hand,
the optimal data layout for matrix multiplication is columns of  data, nodes in the rows and  as
the outer-most index (slices of 2D matrices). This conﬂict between FFTs and matrix multiplication
means there is no one optimal data layout.

For this reason, we perform the FFTs in the same manner as outlined in Section 2.3.1. After
performing FFTs, the resulting block of  vectors with Fourier coeﬃcient must be moved to the
three dimensional matrix layout for DGEMM/ZGEMMs. While performing this copy operation,
we utilize cache blocking on  indexes to limit the amount of memory striding; otherwise the
overheads associated with this copy operation can easily wipe out the beneﬁts from reorganizing
data in the ideal form for FFTs, then DGEMMs, and then again inverse FFTs.

Node Blocking Due to the rearrangement of of the matrices to to be  ×  × , with 
as the inner most index, each 2D slice of the matrix is now has a width of the number of nodes,
which can be on the order of millions for lower tree levels in large-scale computations. These
larger 2D matrices can exceed the available cache. By passing a limited number of nodes to the
SHT algorithm, the size of the 2D matrix can be limited to an amount of data that better ﬁts in
cache. This provides a performance improvement when there is a large number of nodes, however
matrix multiplication is not as eﬃcient with extremely narrow matrices. The node blocking is
kept balanced by limiting how small the blocks can be, in order to keep the matrix multiplication
eﬃcient. For all performance tests presented in the next section, level-by-level implementations
include FFTW and node blocking were used as these were the optimal scenarios found below.

2.4 Numerical Results and Performance Evaluation

In this section, the performance of the presented spherical harmonics transform optimizations
are analyzed. Performance results were gathered using the Cori supercomputer in the National
Energy Research Scientiﬁc Computing Center (NERSC). Runs were performed on the two socket

27

()
3
Base (sec)
3.10
Vector Pipeline
1.13
Matrix Pre Calculated
1.37
Optimal Combined
1.46
Level by Level
2.03
LbyL/DGEMM 1.48

6
1.93
1.13
1.29
1.38
1.05
0.97

12
1.62
1.11
1.21
1.34
0.91
0.68

24
3.11
1.04
0.99
1.13
0.92
1.00

48
3.93
1.04
0.89
1.08
1.02
1.28

96
4.01
1.03
0.70
1.15
1.02
1.71

192
17.79
0.99
0.58
1.92
2.45
3.99

384
40.01
0.98
0.57
1.40
2.98
5.30

768
104.75
1.01
0.55
1.17
3.99
9.32

Total
180.25
1.01
0.58
1.27
2.85
4.71

Table 2.1: Speedup times for each Spherical Harmonics Transform implementation relative to the
base code. Each column gives relative speedup for a particular level, the last column gives the
overall speedup results. The ﬁrst line for the base implementation reports the total execution time
in seconds.

Intel Xeon “Haswell" processor nodes with 2.3Ghz clock rate and 128 GB DDR4 2133 MHz
memory. Code was compiled using the Intel Fortran compiler with ﬂags -O3, -no-vec and -mkl,
and run as a single thread. Intel’s MKL library was set to utilize a single thread for ZGEMM and
DGEMM operations to obtain fair comparisons against other versions described.

2.4.1 Overall Performance

To provide a consistent comparison of the optimizations of the SHT method, all methods were pro-
vided the same input data and veriﬁed to output identical resulting data as the base implementation
(described in Sect. 2.3.1, which is known to produce accurate results). The base implementation’s
execution time provides the baseline which each optimized version is compared against. In the
results shown in Fig. 2.7 and Fig. 2.1, a full 11 level quad tree (4 children per node) was created
to simulate a surface geometry. In all ﬁgures and tables presented, we provide measurements for
each level of the tree to analyze the impact of changing samples sizes on the methods used. These
diﬀerent levels are marked with the bandwidth () in that level, from which the number of 
and  samples are derived. In our synthetic quad tree, each leaf node contains diﬀerent generated
multipole data to ensure realistic memory usage and diﬀerent interpolation/anterpolation results
for each node, pinpointing any code errors that may produce diﬀerent results from the base code.

28

Figure 2.7: Execution time for diﬀerent Spherical Harmonics Transform implementations for an
11 level surface geometry.

The initial optimization was to calculate the Spherical Harmonics Coeﬃcients immediately
followed by calculating the Filtered Fourier Coeﬃcients for each  vector, rather than calculating
all of the Spherical Harmonics Coeﬃcients before calculating the Filtered Fourier Coeﬃcients. We
denote this version as “vector pipelining" in Fig. 2.7. At lower levels of the tree (denoted by smaller
() values), this optimization provides a minor improvement as it reduces intermediate storage
and uses less operations by not having to store oﬀ the intermediate results. At higher tree levels,
vector pipelining does not perform as well, when the forward and backward matrices no longer ﬁt
in cache.

The next optimization is precalculating the combined forward/ backward matrices, denoted
as “Matrix Precalculated" in Figures 2.7 and 2.1. As discussed in the algorithms section, this
precalculation is expected to outperform the base version only for lower  values. As we move up
the tree, i.e., as the number of required samples  and  increases, the value of  where matrix
vector multiplications using the precalculated matrix outperforms the base code decreases. In other
words, an implementation which uses the combined matrix approach for all values of  starts
signiﬁcantly underperforming against the base implementation. Adopting a hybrid approach where
the best performing parts of the base code are intermixed with the best performing parts of using

29

the combined matrix approach, the improvement over the base code is higher in all tested cases. We
refer to this hybrid approach as “Optimal Combined", and as can be seen in Figure 2.1 this hybrid
approach delivers a 1.27x overall speedup over the base code.

Performance is further improved by moving to “Level by Level" computation method. This
optimization which uses a naive 3-loop matrix matrix multiplication scheme still outperforms the
base code overall by 2.85x. Rearranging the data in the ideal form for FFTs and then for matrix-
matrix computations does impose an overhead cost. This is best seen for () from 6 through 96,
where the overhead cost at times erases all the performance gains obtained from converting matrix-
vector operations to matrix-matrix multiplications. At very low tree levels, the rearrangement
cost is not signiﬁcant with smaller numbers of  and  samples in the multipole data, and for
higher levels of the tree, the gains from matrix-matrix multiplications far outweigh the cost of data
rearrangements. For instance for () = 768, the “Level by Level" scheme achieves 3.99x speedup
over the base SHT implementation.

Finally, given that the level by level scheme uses matrix/matrix multiplications, optimized
library kernels can be used for this purpose. We observe that the best performance is obtained
when using DGEMM to perform the matrix multiplication, as opposed to ZGEMMs.
In our
tests, our best implementation, named “LbyL/ DGEMM" achieves an overall speedup of 4.71x
In fact, for ()=768, we observe speedups over 9x as a
over the base SHT implementation.
result of signiﬁcantly improved matrix-matrix multiplication performance. At higher tree levels
the performance improvements of “LbyL/DGEMM" increase at a faster rate.

While the “LbyL/DGEMM" approach shows signiﬁcant speedup starting at ()=48, and
continuing to increase as () increases, the performance at () from 6 to 24 signiﬁcantly under-
performs. The “Level by Level" performance shows the initial overhead is due to rearrangement
of the data for matrix-matrix computations. The second overhead, as noted in Sect. 2.3.4, is the
conversion of complex data matrices into one matrix of the real data, and a second of the imaginary
data, so that the DGEMM multiplication can be used. While this overhead cost is signiﬁcant at
lower levels, the increase in performance results in “LbyL/DGEMM" being the fastest algorithm

30

Figure 2.8: DGEMM Vs ZGEMM optimization.

in overall time. A probably faster approach could be to use the “Optimal Combined" method and
“LbyL/DGEMM" together, using the method that is fastest for the current ().

DGEMM vs ZGEMM Finally, in Section 2.3.4, it was discussed that both DGEMM and ZGEMM
could be an option for matrix multiplication given that the multipole data is complex data and the
forward/backward matrices are real data. Fig. 2.8 shows the resulting run times for DGEMM and
ZGEMM based scheme as compared to the base code. At lower tree levels, ZGEMM appears to
outperform DGEMM slightly due ZGEMM not having to split the multipole data into two real
arrays. At higher levels though, DGEMM begins to notably outperform ZGEMM when the size
of the matrices being multiplied become larger. Due to the minimal improvement of ZGEMM at
lower levels and the signiﬁcant improvement of DGEMM over ZGEMM at higher levels, we have
chosen to use DGEMM in the other parts of the chapter.

Impact of Cache Blocking Recall that rearranging the data to be more optimal for matrix
multiplication conﬂicted with the data arrangement needed by the FFTW library. This overhead
was limited by cache blocking the data movement during the rearrangements back/forth.
In
particular, we perform a set of FFTs (8-16) at once for each  index of the data, and move them
in chunks to contiguous locations in the 3D matrix of Fourier coeﬃcients. Figure 2.9 shows that

31

Figure 2.9: Analysis of blocking data around Forward/Backward FFTs.

Figure 2.10: Analysis of DGEMM blocking on nodes (note, updated graph needed)

the improvement obtained from cache blocking the data rearrangement stage continues increasing
as () increases, with a minor exception at ()=96. At this level, all methods used show a
decrease in FFTW execution time, indicating that 96 is a sweet spot for the FFTW library.
If
cache blocking had not been used, the beneﬁts of the LbyL/DGEMM scheme would have been
signiﬁcantly reduced.

32

Node Blocking Rearranging for optimal matrix multiplication also results in the potential of
extremely large multipole matrices being multiplied against the combined forward/backward ma-
trices. At the leaf level of an 11 level surface geometry for instance, there are 1 million nodes.
Instead of passing all nodes to the SHT algorithm, they can be passed in blocks to save on the
total memory requirements. Fig. 2.10 compares the LbyL/DGEMM scheme’s execution time with
no-blocking, when nodes are grouped into 16 block and when using an adaptive blocking scheme.
As seen in the ﬁgure, dividing the total nodes into 16 blocks provides a signiﬁcant improvement
over unblocked at lower levels, but blocking at higher tree levels limits the improvement as this
shrinks the size of the multipole matrix too much. At higher levels, passing all nodes to the SHT
algorithm provides the best performance.

2.5 Conclusions and Future Work

Due to the need for representing both magnitude and phase information, Helmholtz variant
of the FMM algorithm is signiﬁcantly challenging compared to the commonly studied Laplace
variant.
In this chapter, due to its computational and storage cost advantages over alternatives,
we adopted a Spherical Harmonic Transform based tree traversal scheme for the computationally
expensive M2M and L2L stages of the MLFMA algorithm. We presented a series of optimization
techniques to improve the performance of the SHT method. We demonstrated that the performance
of SHT performance can be signiﬁcantly improved by moving to a level-by-level scheme from the
commonly used node-by-node processing scheme, and by carefully considering data rearrangement
and cache utilization issues. This improvement has been shown in detail, step by step, with the
performance improvements laid out. The performance improvements obtained (up to 9.3x) provide
a clear demonstration of beneﬁt of the optimizations described here.

33

CHAPTER 3

HIGH PERFORMANCE EVALUATION OF HELMHOLTZ POTENTIALS USING THE

MULTI-LEVEL FAST MULTIPOLE ALGORITHM

3.1 Introduction

In the previous chapter we focused on a single node optimization to Spherical Harmonics
Transform interpolation. In practice, full Helmholtz FMM simulations need to be executed on large
computer clusters due to the high amounts of computation and memory required. As mentioned
previously, Laplace FMM has a large portion of the tree data stored in the leaf levels, allowing for
excellent scaling with leaf level optimizations. The Helmholtz FMM higher level tree costs mean
data partitioning and communication of these higher levels are important bottlenecks to scaling.
In this chapter, we address data partitioning and communication issues for high level tree node
through novel algorithms, and provide an algorithmic complexity analysis of computation and
communication overheads.

Data Partitioning How the tree is constructed, as discussed in section 1.1, has implications for
parallel implementation trade oﬀs. In Helmholtz FMM, the lowest levels of the tree have a large
number of nodes with a small number of samples per node, while the highest levels of the tree have
a small number of nodes with a large number of samples per node, such that the total data stored at
any level is close to equivalent, or slightly larger at the high levels for a surface geometry.

Looking at the low levels, even a fairly small surface geometry that produces a 10 level tree has
262k leaf nodes. It is easy to assign a portion of the nodes to each of the 64, 8192 or even more
processes. This method is known as spatial partitioning. If the data can be divided such that all
descendant nodes of a high level node reside on a single process, all of these descendants can be
interpolated using only local data. The primary question is how much communication is required
when child nodes are split between processes, requiring communication when the interpolated data
must be aggregated together, and what the performance impact will be. This question will be

34

investigated further in this chapter.

Figure 3.1: Spatial vs Direction Partitioning.

At the high levels, the number of nodes per level is far lower, fewer nodes in the level than
the number of processes that might be assigned to the execution. At tree level 4 of any surface
geometry, more than 64 processes means there will be more processes than nodes. The more
processes assigned, the lower the level where we have more processes than nodes. At these high
levels we can use a method known as direction partitioning. Figure 3.1 illustrates the how spatial
partitioning assigns whole nodes to processes and directional partitioning divides nodes in a given
level up among unique groups of processes. This method assigns each node in the level to unique
groups of processes so that every process is assigned. Melapudi et al. (2011) describe an adaptive
direction partitioning algorithm, where the switch from spatial partitioning to direction partitioning
can occur at diﬀerent levels, for diﬀerent partitions to account for non-uniform trees. See section
3.3.1 for a more detailed description of this partitioning method. The primary weakness of the
described method is how the native processes, we will refer to them as resident processes, must be
sent the entire interpolated node data to be aggregated and then communicated back to the other
processes sharing the node. In this chapter we will investigate a method of distributing all of the
work of the shared nodes among the processes that share the node and analyze the complexity of
this approach.

35

3.2 Background

Related Work The aforementioned work proﬁle of the H-FMM octree suggests that any eﬃcient
parallelization must strike a balance between distributing the many lightweight boxes at lower
levels and distributing the work of the few heavyweight boxes at higher levels across processes.
Furthermore, the mathematics used to eﬀect each stage of the process dictate the intricacies of the
algorithms developed for parallelization. The existing algorithms address these diﬀerent scenarios
with diﬀerent trade-oﬀs. For the purposes of the ensuing discussion, we note that multipole and
local expansions may be viewed as two-dimensional arrays of sampled function values.

At scale, the multipole and local expansions of octree boxes at the uppermost levels of the
tree must be distributed across processes to achieve load balance Velamparambil et al. (2000). To
reduce the costs of communication in M2M and L2L for these distributed nodes, several authors
have employed local interpolation techniques Chew et al. (2001); Ergül & Gürel (2008); Michiels
et al. (2013b), in which only a small "halo" of nearby samples are required to calculate each new
sample. However, despite a slightly lower M2M and L2L asymptotic complexity of O( log ),
when compared to global interpolation’s complexity of O( log2 ) Chien & Alpert (1997);
Sarvas (2003); Cecka & Darve (2013b), this approach increases memory and computational costs
stemming from the need to signiﬁcantly oversample multipole and local expansions, in addition to
reducing the numerical accuracy of the H-FMM. Alternatively, we propose the use of a parallel
version of the global interpolation method, which has typically remained restricted to the serial
M2M/L2L operations at lower levels of the tree. Though the global algorithm obviously has higher
communication costs, it does not introduce additional numerical errors, and it facilitates optimal
(minimum) sampling rates for multipole/local expansions Hughey et al. (2019); Lingg et al. (2018).
Local interpolation based hierarchical partitioning (HiP) approach distributes expansions hi-
erarchically at the uppermost levels in block columns, or strips Ergül & Gürel (2008). However,
the M2M and L2L communication costs scale as O(√
) per process, hampering the scalability
of the H-FMM evaluation Ergül & Gürel (2013). The blockwise HiP (B-HiP) strategy Michiels
et al. (2011, 2013b) alleviates this bottleneck by distributing expansions in blocks, whose much

36

lower surface-to-volume ratio results in O(1) communication costs per process, hence improving
scalability Michiels et al. (2015). In both methods, M2L operations are carried out in parallel by
collecting on each process samples of the remote multipole expansions required to compute the
local expansions it owns. It should also be noted that the increased sampling required with local
interpolations hampers the scalability of the M2L phase, as collecting remote multipole expansions
requires a signiﬁcantly higher communication bandwidth compared to a global scheme with optimal
sampling rates.

Building on the HiP approach, Yang et al. Yang et al. (2019) transition from hierarchical
partitioning to plane-wave partitioning (PWP) Velamparambil et al. (2000) for the highest levels of
the tree, using a binary tree decomposition of the MPI communicator to ﬂexibly load balance the
computation. The PWP approach achieves zero communication overhead for M2L operations by
distributing expansions at the uppermost levels of the tree by assigning each process a ﬁxed window
of samples for all expansions at a given level. However, the transition from HiP to PWP requires
expensive communications in M2M and L2L phases to rearrange the expansions, though this cost
may be justiﬁed by recognizing that each node interacts with at most 8 other nodes to perform the
M2M/L2L operations, while the maximum number of nodes for M2L operations is 189 (with a
volumetric problem).

As previously stated, local interpolation methods are convenient for parallelization but result in
an H-FMM that is not strictly error-controllable. The principal challenge to a scalable H-FMM with
error control is the communication cost of distributed global (exact) interpolation. In Melapudi
et al. (2011), Melapudi et al. describe an error-controllable H-FMM based on global interpolation
using a bottom-up partitioning which gives great ﬂexibility for load balanced partitioning of the
tree. This work has continued to this day with the most recent research done by Hughey et al
Hughey et al. (2019), which discusses the current state of the art in globally interpolated H-
FMM. Scalability of this implementation is hampered by the coarse-grained parallelization of the
M2L phase and redundant M2M/L2L calculations associated with high-level tree nodes shared by
multiple processes.

37

Contribution In this chapter, we build upon our earlier eﬀorts Melapudi et al. (2011); Hughey
et al. (2019) toward a scalable, error-controllable H-FMM based on global interpolation. We address
several challenges regarding parallelization and communication, and we demonstrate an eﬃcient
and scalable method for evaluation of the Helmholtz potential. In particular, our contributions can
be summarized as follows:

1. Development of a ﬁne-grain parallel algorithm with bottom-up partitioning that enables

scalable evaluation of deep uniform MLFMA trees,

2. maintain the high level of controllable accuracy shown in previous global interpolation

implementations,

3. a detailed analytical model to characterize the complexity and memory use of the parallel

algorithm for far-ﬁeld interactions,

4. and, demonstration of the overall algorithm performance and validation of this performance

against our analytical model for diﬀerent test scenarios.

3.3 Algorithms

In what follows, we describe details of each stage of the algorithm. For completeness, we
replicate some of the concepts introduced in Melapudi et al. (2011); Hughey et al. (2019), before
delving into details of our speciﬁc contributions. Speciﬁcs on mathematical formulae and operators
used in this algorithm can be found in Melapudi et al. (2011); Hughey et al. (2019).

3.3.1 Tree Construction and Setup

Let  denote the number of processes used in the computation. We initially distribute the 
particles evenly across all processes and determine the diameter 0 of the cube bounding the entire
computational domain. Given the ﬁnest box diameter (), the number of levels in the tree is
calculated as the smallest integer  such that  ≥ log2(0/()) + 1. Once the number of levels

38

and therefore the position of the leaf nodes are known, every particle is assigned a key based on
the Morton-Z order traversal of the MLFMA tree Warren & Salmon (1993). A parallel bucket
sort on the Morton keys is then used to roughly equally distribute particles across processes at
the granularity of leaves. This is done by selecting  − 1 Morton keys, or “splitters”, which
chop the Morton Z-curve into  contiguous segments. Whole leaves are uniquely assigned to
processes using these splitters. Given a contiguous segment of leaf nodes, each process determines
all ancestor keys of its leaves up to the root. The leaf through ancestor nodes are used to construct
the local subtree. A simple method of storing the local subtree is as a post-order traversal array. To
quickly access a random node, we use an indexer into this local subtree array.

Plural Nodes Despite the non-overlapping partitioning of leaf nodes, overlaps among diﬀerent
processes at the higher level nodes are inevitable (and in fact, are desirable) as illustrated in Fig. 3.2.
Details and associated proofs on such partitioning can be found in Melapudi et al. (2011). We
call such nodes shared by multiple processes as plural nodes. While there are no limitations to
the number of processes that can share a plural node, we designate a particular process, i.e., the
right-most process sharing the plural node in the MLFMA tree, as its resident process. We refer
to the resident process’ copy of a plural node as a shared node and all other copies of this node
residing on other processes as duplicate nodes. We call the set of processes that own these duplicate
nodes as users of the shared node, denoted by (), where  is the shared node.

One notable advantage provided by plural nodes is that storage of the node is split between
multiple processes.
In this case, the indexer additionally stores which slice of a tree node the
current process actually stores in its local subtree array. As the information content for a node is
not available to any single process in its entirety, a ﬁne grain parallelization is necessary to perform
computations associated with plural nodes. We note that a process can have at most two plural
nodes per level in its local tree (essentially one to the “left”, and another to the “right”); for proof
see Melapudi et al. (2011)

39

Figure 3.2: Parallel Farﬁeld MLFMA.

3.3.2 Parallel Evaluation

3.3.2.1 C2M

C2M is unchanged from our previous works Melapudi et al. (2011); Hughey et al. (2019). As each
process is assigned a contiguous segment of whole leaf nodes, each process handles the C2M phase
for its assigned leaf nodes in parallel independently.

3.3.2.2 M2M

In a nutshell, M2M creates multipole expansions of non-leaf boxes from the multipole expansions
of their children. This ﬁrst requires multipole expansions of all children to be interpolated to the
size of the parent box. Next, each interpolated child box is shifted from the center of the child
box to the center of the parent box. Finally, multipole contributions of every shifted child box are
aggregated to form the multipole expansion for the parent box.

M2M computations start at the leaf level and proceed upwards in the tree following a post-order
traversal of the local tree. Our parallelization of M2M depends on the level of the node and is
described in Alg. 4.3. The approach is as follows: i) Non-plural tree nodes (that are typically found
at lower levels of the tree) are handled by their owner processes in parallel independently, ii) for
plural nodes without any plural children, interpolation and shift steps for child nodes are performed

40

sequentially, and the aggregation step is performed as a reduce-scatter operation among processes
sharing the plural node, iii) plural nodes with plural children (which typically are located at the
higher levels of the tree and incur signiﬁcant computational and storage costs) are processed using
a ﬁne-grained parallel algorithm that we discuss in more detail below.

for each child box  do
if  is plural then

 [] ← parallel_interpolation()
 [] ← serial_interpolation()

Algorithm 3.1 Multipole-to-multipole interpolation
Require: . coordinates of the parent box center
Ensure:   is parent’s multipole representation
1: for each box  in post-order traversal do
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17: end for

end if
 [] ← shift( , .)

for each child box  do

aggregate( , [])

end for
if  is plural then

reduce_scatter( ,  ( ))

else

else

end for

end if

Fine-grained Parallel Interpolation For plural nodes that necessitate ﬁne-grained parallelization
of M2M, the multipole data of the child nodes needed for FFTs are themselves split among multiple
processes as indicated in Alg. 4.3. Prior to elucidating our parallel algorithm, we note that our
M2M implementation uses a Fast Fourier Transform (FFT)-based interpolation over the uniformly
spaced multipole expansions of the child nodes Sarvas (2003). FFT-based interpolation on the
Fourier sphere requires equispaced samples along  (vertical) and  (horizontal) directions. Due
to the inter-dependencies of the FFT algorithm, there is no way to partition the data so as to avoid
communication.

Our approach is as follows: First, each process is assigned a (roughly) equal number of con-

41

Figure 3.3: Graphical illustration of the transposition and folding operations during ﬁne grained
parallel interpolation of multipole data of child node  to parent node  for  = 3,  = 4 (for )
and  = 5,  = 6 (for ) using 3 processes each of which owns a column of the initial
multipole data as indicated by the hashmarks. Multipole data from (a) is interpolated along the 
dimension locally, leading to the multipole expansions in (b). The folding operation acting on the
interpolated data is shown by the repositioning of the data as in (c). The hash marks show how the
folded data is stored on the wrong processes, and must be communicated to the correct process as
shown in (d). With the entire multipole data columns in the  direction being available on each
process, another set of interpolations are performed locally (e), which is then transposed and
folded to yield the ﬁnal multipole expansions (f). A ﬁnal communication step is needed to send
each  vector to their owner processes (g) which can then be shifted to the center of the parent box
and added to the parent’s multipole expansions in accordance with the spherical symmetry
condition.

tiguous columns of multipole data (which correspond to groups of samples along the  direction).
The operation begins with a set of independent FFTs along these columns for interpolation in the
 direction, performed the same as the basic Sarvas approach. Then, the interpolated columns are
shifted into rows (see Fig. 3.3), transposing and folding the samples in the  direction into individual
columns. The next step with serial processing would be FFT interpolation in the  direction, but this
data is split between processes sharing the plural node. Therefore, each process is communicated
the  samples they need to complete their assigned columns using an MPI_Alltoallv collective
call. Now that each process is storing full columns of  samples, these multipole data can be
interpolated. The fully-interpolated multipole data is transposed and folded back to its original
form ( samples along columns,  samples along rows). The data are again communicated back to
the processes that own the corresponding multipole samples via another MPI_Alltoallv. These
major steps are illustrated in Figure 3.3.

Shifting of Interpolated Data Shifting of multipole data is simply the translation of the interpo-
lated samples from the child node’s center to the parent node’s center. Translation of each multipole

42

data is independent of others and trivially parallelizable even in the case of ﬁne-grained parallel
M2M.

Aggregation Aggregation requires adding all corresponding samples from each interpolated
and shifted child node together to form the multipole expansion of a parent node (step (g) in
Fig. 3.3). When children are owned by separate processes (as is the case for plural nodes), reduction
communications are required. Note that in ﬁne-grained parallel M2M for a plural node, each
process owns only a portion of the parent node’s multipole data. A reduce/scatter operation (for
instance using MPI_Reduce_scatter) could perform both the aggregation and distribution of the
appropriate portions of the aggregated multipole data to the processes sharing a plural node. One
complication here is that the reduce/scatter operation would require memory to be allocated to the
full-size of the parent node by each user process through padding the parts not owned by a process
with 0s. Clearly, this would lead to signiﬁcant memory and computational overheads, especially at
the highest levels of the H-FMM tree (due to the large sizes of the plural nodes there). Therefore,
we opt for a custom point-to-point aggregation scheme where the interpolated and shifted multipole
samples from child nodes are communicated directly (via MPI_Send and MPI_Recvs) to the process
that owns the corresponding samples of the parent node. If the source and destination are the same
process, obviously no communication is performed. Each process sharing the parent node then adds
up the corresponding multipole samples it receives from each child node, local or communicated.
This method reduces both the amount of temporary memory necessary for aggregation and the
overall communication volume.

In ﬁne-grained parallel M2M, performance impact of how the multipole data
Process Alignment
of child and parent plural nodes are mapped to processes sharing those nodes also needs to be
considered. From the description of our custom point-to-point aggregation scheme above, it is
evident that increasing the overlap of the multipole sample regions owned by a process in the child
and parent nodes is critical for reducing the communication volume. As an extreme example, if a
process owns no samples in the parent node that correspond with any of the samples it owns in the

43

Figure 3.4: This image shows how multipole samples, ordered clockwise starting at 12 o’clock,
are assigned to processes owning the children nodes (C1-C5) overlap with processes owning the
parent node (P1-P5) when assigned in process rank order on the left, and with our realignment
scheme on the right. The darker portions on the parent samples show the regions where the parent
node data overlap with the child node data, and are essentially local data that do not require
communication.

child node, all its interpolated and shifted child node samples would have to be communicated to
another process for aggregation. In fact, this extreme example is not uncommon when multipole
data of a node is simply partitioned into blocks and mapped to user processes according to their
process ranks. This situation is illustrated on the left side of Fig. 3.4; some processes own samples
of a parent node that has no overlap with the samples they own in the child node. Speciﬁcally,
while process 1 owns overlapping samples in the child and parent nodes, process 2 and 3 own no
overlapping samples.

As a heuristic to minimize the communication volume, we order processes within a parent node
such that the parent node samples are assigned by following the priorities below to ensure maximal
overlap with their child node samples:

1. Index of the lowest sample they own in the child nodes (lower comes ﬁrst),

2. number of samples they own in the child nodes (fewer comes ﬁrst),

3. process rank.

In the example given in Fig. 3.4, both process 1 and process 3 own samples with index 0 in the

44

child nodes, but process 3 has a smaller number of samples so it is assigned the ﬁrst portion of the
parent node samples with process 1 being assigned the second portion. Following processes 3 and
1, process 4 owns the multipole sample with the lowest index in the child nodes, followed by process
2, and then process 5. As such, remaining portions of the parent node samples are assigned in this
order. As can be seen in the ﬁgure, all samples each process owns in the parent node fully overlap
with samples that they own in the child nodes, despite the non-uniform layout. With the proposed
process alignment scheme, process 3 will still need to communicate some samples to process 1 for
aggregation, but over half of the child samples interpolated by process 3 remains local. Note that
with the straight-forward ordering of processes by their ranks, the entire child data interpolated by
process 3 would have to be communicated to processes 1 and 2. In the new scheme, all processes
use interpolated samples from their part of a child node without having to communicate. While
this scheme would work best with a perfectly balanced tree, this approach will still be eﬀective in
reducing the communication volume during aggregation with any tree structure.

3.3.2.3 M2L

The M2M step builds the multipole expansions of all tree nodes owned/shared by a process, starting
from the leaves all the way up to the root (or the highest level of computation). During M2L each
observer node loops through all source nodes in its far-ﬁeld and translates the source multipole
data to its locale, aggregating the eﬀects from all its far-ﬁeld interactions in the process. When
the source-observer node pair is on the same process, this interaction is handled purely locally.
However, when the source node data is on another process (or a set of processes), one needs a load
balanced algorithm that is communication eﬃcient.

To understand the scope of the problem, consider Figure 3.5. Here, the source node is a plural
node shared by three processes (S1, S2 and S3); the observer node is shared by two processes (O1
and O2). Multipole samples for both are shared starting at the top of each circle and increasing
clockwise (consistent with the process alignment scheme utilized during M2M). Process S1 and
O1 both own multipole samples with the lowest indices, with S1 having less samples than O1. For

45

Figure 3.5: A translation operation between two plural nodes shared by diﬀerent numbers of
processes.

this far-ﬁeld interaction, S1 would need to send all samples that it owns to O1. S3 and O2 both
own samples with the highest indices; here S3 would need to send all of its samples to O2. Finally,
S2 owns samples that are needed by both O1 and O2, therefore S2 must send half of its samples
to O1 and the other half to O2. Since each node in the H-FMM tree interacts with several others
(≈27 for surfaces and up to 189 for volumes) each of which may be shared by a varying number
of diﬀerent processes, it is evident that coordination of all communications that must take place
during an H-FMM evaluation is non-trivial.

In an initialization step before the actual M2L operations, all processes discover the owner
process(es) of the tree nodes (i.e., observers) which will need the multipole data for the source nodes
they own, given the partitioning of leaf nodes (for load balancing purposes) and process alignments
for plural nodes. This pre-calculated list is formed to carry out the actual communications that will
take place during the ensuing H-FMM potential evaluations. If a source node or the corresponding
observer node is plural, then the pre-calculated communication list will include only the intersection
of the multipole data owned by both the source and target processes. When a source node on a
process is needed by multiple observers on another process, it is suﬃcient to include the source
node in the communication list to that other process only once. Also, multipole data for multiple
source nodes residing on one process that are needed by another can be combined into a single
message in this communication list, even if the source nodes are at diﬀerent levels. Note that the
tree structure in most H-FMM applications are ﬁxed. As such, the overheads associated with such
an initialization stage is minimal.

46

Algorithm 3.2 M2L Translation
1: Determine the intersection of data owned by both source and target.
2: procedure M2L
3:
4:
5:
6:
7:

for each process sharing target box do

for each farﬁeld interaction target do

if Target box is not local then

for each source box do

Add source multipole data in union of data owned by both source and target

to buﬀer.

end for

end if

end for

end for
Communicate buﬀers between processes.
for each target box do

for each farﬁeld interaction source do

if Source box is local then

8:
9:
10:
11:
12:
13:
14:
15:
16:
17:

18:
19:
20:
21:

target node

else

end for

for target node

end if

22:
23:
24:
25:
26: end procedure

end for

end for

Load source multipole data from local memory.
Translate source multipole data to target and add to existing translated data for

for each process sharing source box do

Read source multipole data from communication.
Translate source multipole data to target and add to existing translated data

While plural node to plural node far ﬁeld interactions (which constitute the most expensive
communications in an H-FMM evaluation) could actually be carried out using all-to-all commu-
nications that involve only the users of the two corresponding plural nodes, due to the excessive
number of M2L interactions present in large-scale computations (and hence the large number of
diﬀerent communicators that must be created), we choose to perform these communications using
non-blocking point-to-point send/recv operations (i.e., MPI-Isends and MPI-Irecvs) in the default
global communicator. Another reason for opting for a point-to-point scheme is that there are sig-
niﬁcant diﬀerences in the amount of data that must be sent to one process compared to another. As

47

part of the initialization step then, each process allocates a message buﬀer for every other process
that it will communicate with. The size of the send buﬀer is limited to avoid excessive memory use
and maximize communication overlap.

To perform communications during M2L, every process ﬁrst ﬁlls their send buﬀers for each
process based on the pre-calculated communication list and initiates the message transmission
using MPI_Isends. Immediately after initiating the sends, each process starts waiting for their ex-
pected messages using MPI_Irecvs. The status of these communications are checked periodically.
Computations are overlapped with M2L communications in two ways. First, blocks of translations
that are entirely local (which is actually common at the lower tree levels) are processed while the
non-blocking send/recvs are taking place. Second, translation data that is detected as received
during the periodic checks are applied immediately, overlapping the corresponding computational
task with communications underway. Due to the limit we impose on the message buﬀer sizes,
communications with processes that involve a large amount of data transfer need to be performed
in multiple phases. Hence, upon reception/delivery of a message from another process, if there is
more data to be transferred, a new non-blocking recv/send operation is initiated.

Translation Operators Source node data is translated to the target node by multiplying it with a
translation operator. The translation operators can be pre-calculated to reduce computational costs.
As these can potentially take signiﬁcant memory, we limit such memory use by each process by
having them store the pre-calculated operators only for translation of the local and remote samples
that they will actually need. This information can be determined from the pre-calculated M2L
communication list. In case the memory available is not suﬃcient to store the needed operators,
we use techniques outlined in Hughey et al. (2019) to sample and interpolate translation operators.

3.3.2.4 L2L

To anterpolate and distribute the translated local expansions down to the child nodes, L2L applies
the operations in M2M in reverse order. First, local expansions at the parent node are shifted to

48

the center of each child node, they are then anterpolated and percolated down the tree. Finally, the
anterpolated data is aggregated with local expansions previously translated to the child node during
the M2L stage.

Much like M2M, L2L operation is parallelized in three diﬀerent ways: i) Non-plural parent tree
nodes at the lower tree levels are processed independently in parallel by their owner processes, ii) for
a plural node with a non-plural child, shifts involve communications, but the ensuing anterpolation
and aggregation (with translated local expansions) are performed sequentially by the process owning
the child node, iii) plural nodes with plural children require ﬁne-grained parallelization.

In a parallel shift operation, parent node samples corresponding to those of the child node
Shift
must be communicated by the processes sharing the parent node to the process(es) owning the child
node. This is most easily done before the data has been shifted, as the parent will not need to know
the position of the child node. In case of plural parent and non-plural child, this communication
would essentially be a gather, and in case of plural parent with a plural child, it would be an
all-to-all, in both cases involving all processes sharing the parent node. However, only a subset
of the processes sharing the parent node will actually share the child node. To avoid non-trivial
issues that would arise from having to coordinate several collective calls among diﬀerent subsets of
processes, we again resort to point-to-point communications instead. Consequently, messages are
only sent from processes owning a piece of the parent node to a process owning the corresponding
piece of the child node. Once a process gathers all of necessary samples of the parent node, it
applies the shift operation to all its multipole samples independently.

Anterpolation Anterpolation would have to be performed in parallel only if the child node is
a plural node. The procedure for parallel anterpolation is exactly the same as that of the parallel
interpolation, except that the number of multipole samples is reduced (rather than increased).

Aggregation Aggregating the shifted and anterpolated parent data with translated local expan-
sions is trivial. Even in the case of plural child nodes, all required data is already available

49

locally.

3.3.2.5 L2O

As in C2M, each process handles the L2O computations of its assigned leaf nodes in parallel
independently.

3.4 Numerical Methods

In this section, we analyze the asymptotic computational and communication complexity of the
parallel H-FMM algorithm described above. To simplify the analysis, we focus on two extreme
cases, a 2D surface represented by points on a regularly spaced planar grid (dimension =2) and
a 3D volumetric structure represented by points on a regularly spaced cubic grid ( = 3). These
represent extreme cases, and hence are ideal for asymptotic analysis.

Let () denote the number of samples in the  and  directions for a node at level . All FMM
algorithms are constructed such that the only operators that depend on particles are C2M and L2O,
the operations of M2M, M2L and L2L only depend on the existence of the leaf node Greengard
et al. (1998); Coifman et al. (1993b); Ergin et al. (1998, 1999). As such we assume that each leaf
node contains O(1) samples. It follows that the number of leaf nodes is ∝ , the number of source
points. For simplicity and with no loss in generality, we assume that the constant of proportionality
is 1. Next, we denote the number of nodes at level  by (). The total number of levels is given
by . As one moves up the octree, we observe that the number of groups per level is reduced by
roughly 4 times for the 2D surface and 8 times for the 3D volume. Leveraging the relation between
the dimensionality of a structure and the rate of decrease in the number of nodes per level, one can
write () as follows:

Since () doubles at each level, given (1) = , it follows that


(2)(−1) .

() =

() = 2(−1) .

50

(3.1)

(3.2)

Finally, we deﬁne  as the number of processes,  as the level where (almost) all nodes in a level
start becoming plural and () as the average number of processes sharing a plural node at level .
Equivalently,  is the level when () <  for the ﬁrst time, and this remains true from hereon
to the root. Given ,  and ,

() =


() =

(2)(−1)


(3.3)

3.4.1

Interpolation (M2M)

Computational Complexity M2M is performed for each node, starting from the leaf level up to
the highest level . The dominant component in the computational complexity for 2 is FFT-
based interpolation. Shifting and aggregation are (()2) operations each, while interpolation
for a given node costs (()2 log2 (())). This gives a total computational complexity of


=1

 ∝

()()2 log2(())

(3.4)

Plugging in the equations (3.1) and (3.2) and simplifying the summation, we obtain the com-

putational complexity for a surface to be:

and for a volume to be:

 ∝ ( log2 ),

 ∝ ()

(3.5)

(3.6)

Number of Messages (Latency) Communication in 2 happens during aggregations for both
coarse-grained and ﬁne-grained parallel 2s, as well as the FFTs of the ﬁne-grained parallel
2s. As described in Sect. 3.3.2.5, we perform aggregations (which are eﬀectively reduce-scatter
operations) using point-to-point communications. In an ideal tree, every source and observer node

51

will be divided among the same number of nodes. This means the portion of a source node
owned by any process will only be owned by a single process in the observer node, limiting the
communication for each source node process to one process in each observer node. Since this is
done for each group at each level, the total message count for aggregations can be written as:


 ∝

()( + 1).

(3.7)

(3.8)

(3.9)

(3.10)

Using expressions for ( + 1) and (), yields the number of messages for aggregation


 = ( log()2).

Here, we ignore aggregations that would be needed for plural nodes (located at process boundaries)
below level . Note that there may only be two such plural nodes per level for each process and
these aggregations will involve only two processes. As such, their contribution to the number of
messages during aggregations is of a lower order term.

Next, consider the parallel FFTs in ﬁne-grained parallel 2s.

In this case, an all-to-all
communication is performed after each of the two fold and transpose operations. As we implement
these all-to-all communications using point-to-point calls, the message count for FFTs is then:


 =

()()2 ∝ (2).

Consequently, the total number of messages for 2 is:

=

2 = (2 +  log()).

Note, the  portion of the equation above is only going to matter when  is greater than the
number of levels in the tree. In all other cases, increasing the height of the tree does not increase
the number of levels with plural nodes. Given that it is practically useless to have more processes
than the number of leaf nodes (which is the condition required for  to be more than the tree
height), the message count can be simpliﬁed to 2 = (2).

52

Communication Volume (Bandwidth) Bandwidth during interpolation is due to all to all com-
munications during interpolation, and a reduce scatter during the aggregation. Each of these
operation communicates up to the entire node, resulting in a bandwidth that can be written as:

 ∝

()()2

(3.11)

Applying the previous deﬁnitions for () and () yields a communication bandwidth of  ∝
 log  for the surface geometry, and  ∝  for the volume geometry.

3.4.2 Translation (M2L)

Computational Complexity The complexity for the translation operation at a given level is di-
rectly proportional to the number of multipole samples for nodes, the average number of interactions
per node (denoted by () for level ), and the number of nodes at that level. Summing these costs
across all levels, we obtain:


=1


=1

 ∝

()2()().

(3.12)

While the number of interactions for a node changes based on its exact position in the geometry
(for instance, corner or edge nodes), the upper limit is the constant 6 − 3. Using the equations for
() and (), computational complexity of the translation step can be simpliﬁed to ( log )
for the surface structure, and to () for the volume structure.

Number of Messages (Latency) At level  or above, a process can have multipole samples
for only one node. Since a process owns at most part of a single node, each of its interactions
will require a separate communication because the nodes in its far-ﬁeld will all reside on diﬀerent
processes. Assuming an ideal tree partitioning where the source and target nodes are shared among
the same number of processes, the th process for the target node will only need the source node
data from the th process of the source node. As we limit the size of each translation message,
the number of messages will then be proportional to the communication volume between a pair of
processes divided by the message buﬀer size . At levels below , a process can own multiple

53

nodes. Here groups of nodes can be communicated to the same process, if all source nodes reside
on one process and all observer nodes reside on another.
In this case, the interaction count is
going to be based on the total amount of data communicated between the two interacting processes,
divided by the message buﬀer size, summed up for all interacting processes.

Considering contributions at/above  and below  gives a total message count of:


=

 ∝

−1
=1 ()2 ()


() ()2

 
+ ()(cid:100)

(cid:101)

(3.13)

where  is the size of message buﬀers. For the surface geometry, this can be simpliﬁed to
 ∝ ( log ) + () (where the ﬁrst term is for levels ≥  and the second term is for
levels < ), and for the volume geometry it can be simpliﬁed to  ∝ () (with both below
and above  having the same impact).

Communication Volume (Bandwidth) Similarly, communication volume can be analyzed in
two parts as well. At and above , all multipole data for every source node must essentially be
communicated to every target node as no process contains any multipole data other than its own.
Even if the number of processes increases, still the same amount of data needs to be transmitted,
just among an increased number of nodes. Therefore, for level at or above , the communication
volume is independent of the number of processes:

 ∝

()2()()

(3.14)

This expression simpliﬁes to ( log ) for the surface geometry and to () for the volume
geometry.

=

Below , each process will own more than one node, nodes will be interacting with nodes on
the same process, or multiple nodes owned by a neighboring process. In fact, only nodes within
two nodes oﬀ the edge of process boundaries will require communications with other processes.
Total communication bandwidth can then be expressed as:


−1

=1

 ∝

()2()

54

(3.15)

where  is the number of nodes that have nodes in its far-ﬁeld from at least one (out of the 8
possible neighboring processes for the surface and 26 for the volume) other processes touching
them and can be represented as  = ()
 . With this deﬁnition of , the
total communication volume for M2L below  becomes

(2)(−1) ) −1


−1
 =

(
1


for the surface, and

 ∝ ()

 ∝ (

2
 )
3

(3.16)

(3.17)

for the volume, due to the lower portion of the tree dominating.

3.4.3 Anterpolation (L2L)

As mentioned before, L2L is the reverse operation for M2M. Similar to M2M, anterpolation
dominates the computational complexity for L2L. Computational complexity for anterpolations
is the same as that of interpolations, so L2L’s computation complexity is the same as M2M’s.
Likewise, communications performed are the same but in reverse order. Therefore, the latency and
bandwidth costs of L2L are the same as those of M2M.

3.5 Numerical Results and Performance Evaluation

In this section, we evaluate the performance of the parallel H-FMM algorithm described. All
results were obtained on the Cori-Haswell supercomputer at National Energy Research Scientiﬁc
Computing Center (NERSC). Each node on this system contains two sockets, populated by Intel
Xeon E5-2698 v3 (Haswell) processors with a clock speed of 2.3 GHz. Each node has 32 cores and
128 GB 2133MHz DDR4 RAM. The Haswell system uses the Cray Aries with Dragonﬂy topology
interconnect network. The code is implemented in Fortran 90 using only MPI parallelization and
was compiled with the Intel compiler, version 19.0.3.199. The Cray-FFTW library, version 3.3.8.4,
is used for all FFT operations.

55

The runs here focus on the timing of the M2M, M2L and L2L phases of the tree traversal.
As discussed in section 3.4, these operations of these phases only depend on the existence of the
leaf node. Thus, each leaf node is only populated with a single unknown, eﬀectively bypassing
the near-ﬁeld, C2M and L2O processing steps. This also makes the number of unknowns being
processed much smaller than what could be processed by M2M, M2L and L2L in the same amount
of time when analyzing actual physics as in Hughey et al. (2019). In a typical 0.25 leaf box (as we
use in the runs below) with a 0.1  discretization rate, one could potentially have anywhere between
100-180 particle per box. Our largest tree being processed is 14 levels with 42 million points for
one point per leaf box. If the leaf nodes were fully populated, this tree would be equivalent to
processing a tree with 4.2 to 7.5 billion points. Populating leaf nodes would increase the time of
the C2M and L2O steps, but have no impact on the execution times of the M2M, M2L and L2L
phases. Additionally all runs use the same  and  discretization as set by the 0.25 leaf box size
and a  = 1.1, see Lingg et al. (2018); Vikram et al. (2011) for use of 

For all runs testing an increasing number of processes, the ﬁrst run is performed with the lowest
number of nodes that can execute the geometry without an out of memory exception using 32
processes per node (1 process per core). Processes are assigned to cores using srun, with –cpus-
per-task set to 1 and –nodes set to the number of processes divided by 32. The number of processes
are increased by increasing the number of nodes used, while maintaining 32 processes per node.

Numerical results from all runs were compared against results with our previous work Hughey
et al. (2019) to verify the only diﬀerences were due to ﬂoating point precision, and the previous
work showed the results to be error controllable with respect to the analytical solution.

3.5.1 Load Balance with the Fine Grain Parallel Algorithm

The intent of the ﬁne-grain parallel algorithm is to provide improved balance at the upper levels of
the tree where a lower number of much larger nodes reside. First, we look at the performance of a
planar grid of particles (in the  = 0 plane) of dimensions 512 × 512 with a grid spacing of /4
and 4,194,304 particles in total. The box size is chosen to be 0.25, resulting in a 12-level tree with

56

10 levels of computation. As can be seen in the left subﬁgure of Fig. 3.6, the resulting execution
proﬁle is very balanced across process ranks. Execution time of the fastest to the slowest process
varies by only 1.43%. Balance of total time can be a little misleading as M2L cannot progress until
all processes that a given process interacts with have completed their M2M processing. However,
the M2M execution times are also very balanced, varying from slowest to fastest process by 5.23%.

Figure 3.6: Process execution times for a grid and a sphere geometry.

Next, we look at the performance for a sphere of diameter 384 discretized using 4,542,208
dipoles on the surface with a leaf box size of 0 = 0.25, yielding an 11-level tree. This geometry
is less balanced than the grid geometry (see the right subﬁgure of Fig. 3.6) as high level nodes can
range from having no children due to no particles being in that part of the geometry at the leaf level,
up to having a completely ﬁlled quad tree from the leaf level up to a high level node. This results
in notable imbalance at the M2M level, which as discussed before, results in delays in the M2L
execution. Note that while there are no explicit barriers between the phases, there is an implicit
barrier at the beginning of M2L processing, as an M2L interaction communication cannot proceed
until both interacting processes have completed their M2M phases (though the faster process can
perform local translations while waiting). Another (less signiﬁcant) implicit barrier occurs at the
beginning of L2L where nodes that are fully owned by a process must have all of the data from
the translated parent node to perform anterpolation on this parent node. The M2M execution range
from the fastest to the slowest processes varies by as much as 84.5%. Despite this noticeable
imbalance, the long execution times are not clustered among a small group of processes as one
would see if each of the highest level tree nodes were to be handled by a single process without

57

ﬁne-grain parallelization. So even in this unbalanced geometry, the ﬁne-grain parallel algorithm is
helping to maintain a good load balance across process ranks.

3.5.2 Scalability

Next, we investigate the strong scaling eﬃciency of our parallel Helmholtz FMM algorithm, ﬁrst
on a surface and then on a volumetric structure.

For the 2D surface structure, we use the same 512 × 512 planar grid as above. As our base
case for strong scaling eﬃciency, we use the performance on 128 cores because this is the smallest
number of cores that this problem can be executed on due to its memory requirements. As seen in
Table 3.1, both the interpolation and anterpolation phases (M2M and L2L) perform very well with
the increasing process counts, while M2L’s performance falls oﬀ rapidly (down to 25% eﬃciency
on 2048 cores). There are a couple of factors that contribute to this diﬀerence we observe in scaling
characteristics. First factor is that M2M and L2L incur signiﬁcant communications only at the
highest level nodes, while M2L communications occur at every level where the source and observer
nodes are on separate processes, which may essentially happen all the way down to the leaf nodes.
Secondly, and more importantly, M2M and L2L computations involve relatively computation-heavy
FFTs in between its communication steps. When the number of nodes in a level exceeds the number
of processes, no process can own both a source and observer node of any translation, so all node
data must be communicated. As the number of processes approaches the number of leaf nodes, the
M2L communication bandwidth asymptotically approaches the worst case estimate. This means
the increase in M2L bandwidth exceeds the worst case estimate increase as the number of processes
approaches the number of leaf nodes. Despite M2L not scaling very well, the ﬁne grained parallel
algorithm presented still provides good speedups, nearly an 8x speedup when going from 128 to
2048 processes without showing any performance stagnation.

Next, we examine strong scaling on a 32 × 32 × 32 volumetric structure (Table 3.2). From
128 to 512 processes, we observe very good scaling (80% overall eﬃciency), but then parallel
eﬃciency drops oﬀ quickly (down to 50% overall at 2048 cores). In a volumetric problem, each

58

Speedup

Grid (s)
Tot
 M2M M2L L2L
1.00
128
5.30
0.83
2.66
256
0.72
1.31
512
0.57
1024
0.69
2048
0.49
0.37
Table 3.1: Performance of the MLFMA algorithm on the 512 grid geometry.

Par Eﬀ. (%)
Tot M2M M2L L2L
1.00
1.00
0.99
1.65
1.01
2.89
4.58
0.95
0.89
7.78

Tot
18.55
11.21
6.42
4.05
2.38

1.00
0.64
0.48
0.31
0.25

1.00
0.95
0.95
0.89
0.84

5.80
3.06
1.52
0.81
0.43

5.30
4.13
2.76
2.12
1.31

 M2M M2L
2.15
128
256
1.10
0.679
512
0.380
1024
2048
0.271

0.527
0.266
0.14
0.079
0.051

Volume (s)
L2L
0.526
0.263
0.135
0.084
0.058

Tot
3.26
1.68
0.99
0.574
0.406

Speedup

Par Eﬀ. (%)
Tot M2M M2L L2L
1.00
1.00
1.94
0.99
0.97
3.27
0.78
5.68
8.03
0.57

1.00
0.99
0.93
0.83
0.65

1.00
0.97
0.79
0.71
0.50

Tot
1.00
0.97
0.82
0.71
0.50

Table 3.2: Performance of the MLFMA algorithm on the 32 volumetric geometry.

tree node has a large number of nodes in its far-ﬁeld (up to 189). Therefore the overall execution
time is largely dominated by the M2L stage which does not manifest good scaling. The ideal
scenario for our ﬁne-grained parallel algorithm is when the nodes of a given level are distributed
evenly among the processes, i.e., when the number of processes divides evenly into the number of
nodes in a level or vice versa. This does not occur at 1024 or 2048 processes for this particular
volumetric problem. Nevertheless, the overall speedup remains at around 8x when going from 128
to 2048 processes.

Finally, we look at scaling on the 384 sphere (Table 3.3). As seen in the load balance analysis
of the previous subsection, load imbalances result in the faster processes having to wait for slower
processes. This results in a noticeable drop in scaling eﬃciency of the M2M phase, where the
imbalance has the greatest impact, as well as the M2L phase, where some processes that are already
in their M2L phase have to wait for others that are still in their M2M phase. This also has an impact

59

 M2M M2L
128
13.54
10.05
256
6.20
512
1024
4.32
3.19
2048

6.71
3.86
2.34
1.23
0.92

Sphere (s)
L2L
5.29
2.7
1.46
0.72
0.416

Tot
26.76
18.00
11.04
7.70
5.58

Speedup

Par Eﬀ. (%)
Tot M2M M2L L2L
1.00
1.00
0.97
1.49
0.91
2.42
3.47
0.92
0.79
4.79

1.00
0.67
0.55
0.39
0.26

1.00
0.87
0.72
0.68
0.46

Tot
1.00
0.74
0.61
0.43
0.30

Table 3.3: Performance of the MLFMA algorithm on the 384 diameter sphere geometry.

on the overall speedup. While increasing the number of processes continues to improve execution
times, the speedup when going from 128 to 2048 processes is just under 5x in the sphere case.

3.5.3 Complexity Analysis

To help validate the complexity analysis presented in Sect. 3.4, the software was instrumented to
report the computational cost, the number of messages sent and the size of these messages. In
accordance with the geometries analyzed in Sect. 3.4, data was collected on the grid geometries
ranging from 64 to 1024 and volume geometries ranging from 16 to 16 to as these geometries
produce perfect quadtrees of heights ranging from 9 to 13 levels and octrees of heights ranging
from 7 to 11 levels, respectively. As complexity estimates are asymptotic, they are scaled by
least-squares ﬁt to help visualize how well the estimates match the actual measurements.

Figure 3.7 shows the actual vs.

the estimated overall computational complexities for the
surface and volume geometries. The actual complexities match the estimates very closely. This
indicates that the implementation of this algorithm does not have any unnecessary overhead costs
in computation as computation is near to the ideal for Helmholtz FMM.

Figure 3.8 shows the actual vs the estimated communication volumes for each phase separately.
Of note is how the measured communication volume drops oﬀ relative to the estimate. We believe
this is due to the number of samples producing a tree with more nodes at lower levels than the
number of processes. Hence, many nodes are fully owned by a single process and require no

60

M2M/L2L Est
L2L Act
M2L Act

M2M Act
M2L Est

M2M/M2L/L2L Est
L2L Act

M2M Act
M2L Act

1012

1011

1010

s
n
o
i
t
a
r
e
p
O

109

0

1011

1010

109

0

5M

10M 15M

Particles

·107

5M

10M 15M

Particles

·107

Figure 3.7: Actual vs. estimated computational complexity. The left subﬁgure shows results for
the surface geometry, while the right subﬁgure is for the volume geometry.

M2M/L2L Est
L2L Act
M2L Act

M2M Act
M2L Est

M2M/L2L Est
L2L Act
M2L Act

M2M Act
M2L Est

h
t
d
i
w
d
n
a
B

1010

109

108

109

108

107

0

5M

10M
Particles

·107
Figure 3.8: Actual vs estimated communication volume. The left subﬁgure shows results for the
surface geometry, while the right subﬁgure is for the volume geometry.

·107

15M

0

5M

10M
Particles

15M

61

M2M/L2L Est
L2L Act
M2L Act

M2M Act
M2L Est

M2M/L2L Est
L2L Act
M2L Act

M2M Act
M2L Est

106

105

104

s
e
g
a
s
s
e

M

105

104

103

250 500

1000
Processes

2000

250 500

1000
Processes

2000

Figure 3.9: Message counts vs expected Big-O message counts. The left diagram shows analysis
of a surface geometry, while the right diagram shows analysis of a volume geometry.

communication during M2M or L2L. Increasing the number of processes would lead to more levels
with plural nodes, bringing the communication volume closer to our estimates. M2L does not
show the same communication volume falloﬀ as M2M and L2L compared to the estimated volume
because fully owned nodes still require data from the source nodes to be communicated to the
process owning the observer node. Such communications will be required all the way down to the
leaf nodes.

Figure 3.9 shows the measured worst case messages vs the estimated worst case messages for
M2M and L2L. M2M and L2L Message counts are dominated by the 2 complexity of the all to all
communications and the actual message count reﬂects this. The M2L prediction simpliﬁes a very
complex process that results in the number of M2L messages that are communicated.

3.5.4 Process Alignment

In Table 3.4, we compare the number of packets sent between Rank Ordered and Process Aligned
schemes during M2M and L2L phases for the 512 grid geometry. We observe a notable reduction

62


Rank Ordered

Process Aligned

Delta

M2M Bandwidth
L2L Bandwidth
Combined Bandwidth
M2M Bandwidth
L2L Bandwidth
Combined Bandwidth

128
1714
1206
2920
1468
960
2428
-492

256
1872
1166
3038
1662
955
2617
-421

512
3009
2429
5438
2711
2131
4842
-596

1024
3126
2382
5508
3272
2104
5376
-132

2048
4246
3649
7895
4412
3338
7750
-145

Table 3.4: Comparison of the number of packets sent between Rank Ordered and Process Aligned
schemes for the 512 grid geometry in millions of packets sent.

 S/R Buﬀs Trans Ops Tree Mem
128
52.7
62.0
256
52.8
512
1024
62.0
52.8
2048

107.5
141.0
162.2
199.0
221.4

1.7
4.5
102.7
257.0
431.8

Table 3.5: Total memory utilization (in GBs) by the three largest data structures for the 512 grid
geometry.

in the number of messages exchanged, and hence the overall bandwidth, for lower process counts
and continued reduction at higher process counts as expected. This reduction is likely to be eﬀective
in the relatively good scaling characteristics of M2M and L2L phases.

3.5.5 Memory Utilization

Table 3.5 shows the total program memory utilization of the three data structures with largest
memory needs with increasing process counts. As expected, the memory used for tree storage
(Tree Mem) does not increase with process count, despite some ﬂuctuations due to diﬀerent
partitionings of the leaf nodes. This shows that the tree data structure is being nicely partitioned
across processes. Size of the translation operators (Trans Ops) increase slowly with process count,

63

 S/R Buﬀs Trans Ops Tree Mem
5.0
128
256
5.2
5.5
512
5.1
1024
2048
5.4

6.6
10.1
16.3
21.4
30.4

3.1
8.4
22.3
31.9
51.7

Table 3.6: Total memory utilization (in GBs) by the three largest data structures for the 32
volume geometry.

slightly more than doubling going from 128 to 2048 processes. This is due to the spatial distribution
of the tree nodes; multiple source observer pairs with the same translation in the tree may belong to
diﬀerent processes. Particularly, as the process count increases and the number of nodes in a process
decreases. This results in some processes storing some of the same translation operators as the other
processes. The greatest memory increase is in the message buﬀers (S/R Buﬀs). The translation
send and receive buﬀers (S/R Buﬀs) are used to communicate the data for source nodes that interact
with nodes in another process. Single node communications for each source and observer pair
would eliminate the need for this buﬀer, but would result in drastically more translation messages
which would degrade performance. So the translation message buﬀers are maximized to use any
remaining memory to limit the number of translation messages that must be sent.

On the other end of what can be performed with H-FMM is the volume geometry. Here the
number of nodes per level is signiﬁcantly increased due to the underlying full oct-tree structure
(as opposed to a quad tree for a surface geometry), but the maximum height of a tree that can be
computed is reduced. Most memory is reduced due to the shorter height of the tree, which reduces
the size of the nodes at the top of the tree. However, the translation message buﬀers still use up as
much memory as possible to improve translation communication performance.

64

Processes
BEMFMM (s)
This work (s)
Speedup

2
108.92
8.29
13.1

4
100.85
4.28
23.6

8
148.69
2.37
62.7

16
91.03
1.26
72.2

32
60.45
0.68
88.9

64
35.72
0.39
91.6

128
29.42
0.23
127.9

256
26.72
0.17
157.1

512
24.72
0.15
164.8

Table 3.7: Comparison of BEMFMM vs our parallel MLFMA implementation (referred to as
“this work").

3.5.6 Performance Comparison with Other Codes

Finally, we seek to compare our approach against those code that are available in the public
domain. We note that open-source H-FMM codes are almost non-existent, with Abduljabbar et al.
Abduljabbar et al. (2019) being a recent exception. In their 2019 paper, Abduljabbar et al. (2019),
BEMFMM is tested with a 1 meter sphere, up to 17.9  and 2.3 billion unknowns, which suggests a
7 or 8 level tree, with 5 or 6 levels of computation. Our code has been run on a 14 level tree with 12
levels of computation, which covers geometries up to 2048 . If the leaf nodes were fully populated,
this tree would be equivalent to processing a tree with 4.2 to 7.5 billion points. Their BEMFMM
code discretizes a mesh; the discretized points are used as particle inputs to our H-FMM code in
order to compare processing of the same geometry. We ran a spherical geometry with 240 thousand
mesh elements that produces 1.44 million points. Both codes are conﬁgured to produce an 8 level
tree with exactly 65471 leaf nodes with this sphere geometry. With this conﬁguration, our ﬁne
grain parallel Helmholtz FMM algorithm shows signiﬁcantly faster performance in comparison as
seen in Table 3.7.

3.6 Conclusions and Future Work

We have demonstrated a novel method for parallel computation of large, upper level tree nodes
in large-scale Helmholtz FMM which helps alleviate a key performance bottleneck associated
with node dependency. The complexity of this method has been characterized. The results
presented support the provided characterization and show the balance provided by this method.
The performance of the algorithm has been shown to compare favorably with an existing Helmholtz

65

FMM implementation.

Beyond the improvements presented, further work can be performed to improve memory usage,
as well as the M2L communication overhead. One possible method is a hybrid approach of MPI
parallel with thread parallel. Using thread parallel within a given node provides the opportunity
to exploit shared memory parallelism. All cores within a node can be assigned to shared memory
threads, rather than MPI processes, eliminating the need to communicate between these threads,
and reducing duplicate memory allocation.

66

CHAPTER 4

EXPLORING TASK PARALLELISM FOR THE MULTILEVEL FASTMULTIPOLE

ALGORITHM

4.1 Introduction

In the previous chapter, we showed how using a ﬁne grain parallel distribution of the larger,
higher level nodes helps to balance the overall work of traversing the tree. Traversal up and down
(M2M and L2L) the tree showed excellent scalability, while translating the node data (M2L) quickly
saw reduced performance as the number of processes increased. An analysis of the performance of
translations found that computation scaled well as the number of processes increased, but adding
more processes did not lead to a reduction in communication time as process counts increased to
one thousand and beyond.

One way of reducing M2L communications is to partition data among multiple processes,
such that each process owns the same range of samples for all nodes at a given level, which
Yang et al. (2019) call plane wave partitioning. Figure 4.1 shows how direction partitioning splits
each node between unique subsets of processes, while plane wave partitioning divides each node
up between all processes equally. This partitioning method eliminates the need to communicate
any data while processing translations. With local interpolation, this method can perform well
because there is limited impact to interpolation and anterpolation. In our chosen method of global
interpolation, the entire node data is needed for interpolation and anterpolation. As a result, this
data partitioning will perform more poorly when using global interpolation, as each process will
have to communicate with all other processes during interpolation and anterpolation, rather than a
subset of other processes.

67

Figure 4.1: Direction Vs Plane Wave Partitioning.

Rather than changing how the node data is partitioned to reduce translation communication,
we can instead look into maximizing the number of nodes stored in a given process. In a pure
MPI scheme, increasing the number of processes reduces the work for each process, but at the
same time, the number of source and observer node pairs residing on the same process decreases.
Each source and observer node pair that reside in the memory of the same process requires no
communication. Thus if we can divide up the workload further, while keeping a higher number of
source and observer node pairs in the memory of the same process, we can reduce communication
costs.

Shared memory parallel adds an excellent complement to the MPI parallel approach. While
assigning each core of a node maximizes the amount of physical processing power, it also means
processes running on the same core must communicate with each other to share data. Instead, a
shared memory parallel scheme can be used where each core within a node executes a separate
thread, and each MPI process is assigned to a separate node. Now the shared memory threads can

68

process all of the nodes that previously resided on separate cores for a given node, without any
communications necessary. Figure 4.2 illustrates how M2L interactions, when partitioning 16 cores
each to a separate process, with the red node would require communicating with 8 other processes.
While partitioning the 16 cores to 16 threads, and 1 process, would not require any communication,
as all nodes are stored in shared memory. In this chapter we compare two common methods of
shared memory parallel, Bulk Synchronous Parallel (BSP) and task parallel to determine what
method would be preferable to integrate with the existing MPI parallel implementation.

Figure 4.2: Interactions with 16 processes (green dashed lines) Vs 1 process and 16 threads
(lighter green dotted lines).

4.2 Background and Related Work

4.2.1 Fast Multipole Method (FMM)

Figure 4.3 shows how the interaction information ﬂows from the multipole expansion tree on the
left side through the local expansion tree on the right side, through diﬀerent stages of the FMM
algorithm (for illustration purposes, only a small subset of interactions/information ﬂow is shown).
In MLFMA, memory and computation associated with each node quadruples at each level as one
ascends in the tree. Consequently, for surface geometries that are typical in electromagnetics and
acoustics applications, each level has approximately the same amount of memory and computation
costs. Note that all nodes within a given level can be processed independently, while traversing up
(M2M) or down (L2L) the tree. Therefore, it is relatively straight-forward to apply the BSP model
to MLFMA, as one can loop through the tree level by level and divide up the nodes at each level

69

Figure 4.3: Dependencies between boxes within an FMM octree due to the nearﬁeld and farﬁeld
computation process.

among threads using parallel-for loops. This method may run into a bottleneck as there are fewer
(but signiﬁcantly heavier) nodes available while moving up the tree, leading to the possibility of
more threads being available to work than the number of nodes above a certain level. In levels
where this occurs, one can parallelize over operations within each tree node at the expense of
ﬁner-grained synchronization overheads among threads.

The information ﬂow shown in Fig. 4.3 nicely illustrates the dependencies among diﬀerent
computational steps associated with the tree nodes. Dependencies among these tasks form a
directed acyclic graph (DAG) which can easily be expressed through a runtime system with dataﬂow
dependency support. A task parallel approach is less prone to thread idling as it can “ﬁll in" any
voids with useful work from other stages of the computation, and ﬁner parallelization of heavier few
nodes towards the top of the tree does not necessarily require participation (and synchronization)
by all threads. In this sense, task parallelism provides a ﬂexible and potentially eﬀective solution.

70

4.2.2 Related Work

To the best of our knowledge, task parallelism has not been explored in detail in the context of
MLFMA before, but there are several prior works on task parallel L-FMM. Of those, studies by
Agullo et al. Agullo et al. (2014) and Yokota et al. Yokota (2013) are similar to this work. Agullo
et al. evaluate multiple methods of thread parallel approaches; in the ﬁrst method they split all
tree nodes for each level between threads using a parallel-for, they then expand this method by
investigating a single thread only processing of some of a parent node’s children or a portion of
a node’s far-ﬁeld interactions. Finally, they interleave diﬀerent steps of FMM by using tasks with
diﬀerent DAG orderings and priorities. Each approach shows good strong scaling of up to 91%
eﬃciency on a shared memory architecture, when a geometry with a large number of particles is
chosen. The eﬃciency falls oﬀ when using a smaller number of particles. This high eﬃciency is
in part due to a majority of L-FMM processing being at the lowest level of the tree where there are
a large number of tree nodes that can be parallelized independently.

Yokota presents an L-FMM implementation using a dual tree traversal scheme and task based
parallelism Yokota (2013). The dual tree approach provides greater ﬂexibility in tree partitioning
and consequently in load balancing. The implementation is shown to scale well on a shared memory
system, and performs better than other algorithms on the same hardware.

Pi et al. analyze a BSP implementation for MLFMA Pi et al. (2010). The implementation
simply loop parallelizes the creation of the near-ﬁeld interaction matrix, and uses parallel-loops to
process nodes during each level of the far-ﬁeld tree traversal. With runs up to 16 threads on the
Deep-Comp 7000 HPC at the Chinese Academy of Sciences, the near-ﬁeld parallel portion shows
eﬃciencies above 95%, while the far-ﬁeld parallel portion shows lower eﬃciencies of under 75%.
Abduljabbar et al. describe a solver for low-frequency 3D Helmholtz soft body acoustic
problems Abduljabbar et al. (2019), which is probably the closest work reported in the literature
to our work. They outline the shared memory optimizations they have performed on MLFMA to
maximize node performance. They break down these optimizations into two categories: Data-level
and thread-level parallelism. In the context of data-level parallelism, they exploit the vectorization

71

units in modern multi-core processors, mostly through compiler-aided techniques. Their thread-
level parallelization scheme extends the task based dual-tree approach proposed by Yokota, but it
lacks details in regards to how they adopt the dual-tree approach to MLFMA. Even though theirs
is a distributed memory parallel implementation, it is also not detailed if/how communications are
performed along with computational tasks being performed by multiple threads. For these reasons,
eﬀective task parallelization strategies for MLFMA warrant further in-depth analysis.

4.2.3 Contributions

Our contributions in this work can be summarized as follows:

1. We develop an eﬃcient task parallel implementation of MLFMA,

2. we explore ideal task orderings and task granularities for optimal performance, and

3. we present an in-depth comparison of BSP and task parallel MLFMA implementations on

modern shared memory architectures.

4.3 Methods

4.3.1 MLFMA with BSP

Applying the BSP model in MLFMA is relatively straight-forward, as it mainly amounts to par-
allelizing over tree nodes for each phase of MLFMA using parallel-for loops. Nevertheless, we
provide some details to facilitate the performance analysis and comparison discussions presented
in the next section. As the base MLFMA implementation is written in Fortran, OpenMP was used
for thread parallel development for both BSP and task parallel.

4.3.1.1 Near-ﬁeld Computations (NF)

In this phase, point-to-point interactions for all particles in a given leaf box with particles in nearby
leaf boxes are processed using direct interactions. In doing so, we choose to sweep through all pairs

72

in an observer-ﬁrst parallel loop, i.e., the eﬀects of all source particles on an individual observer
particle is calculated by a single thread. This avoids the write-after-write contention that would
have risen had we chosen to sweep through all pairs in a source-ﬁrst way.

4.3.1.2 Upward Tree Traversal (C2M and M2M)

For the upward tree traversal, we choose a level-by-level approach over a post order traversal
approach because it 1) can easily exploit the independent parallel processing opportunity among
nodes in a particular level, and 2) does not suﬀer from load imbalances among threads as all nodes
in a level have similar computational costs. In the upward tree traversal, ﬁrst all leaf nodes are
processed in parallel, performing the C2M operations for each leaf node. Then during M2M, the
multipole information of previously processed child nodes is interpolated, shifted and aggregated
to form the multipole information of their parent nodes. This process is repeated moving up one
level at a time until all levels have been processed. This scheme requires synchronization among
all threads at the end of each tree level.

4.3.1.3 Translations (M2L)

M2L is very similar to near-ﬁeld computation, after all, these are the two MLFMA phases where
actual interactions take place. Observer boxes are looped over in parallel and the translations
from each source box which has far-ﬁeld interactions with the current observer are computed and
aggregated to the observer boxes.
In this phase, observer boxes are processed in a post-order
traversal order as our implementation has evolved from a serial implementation. For M2L, there is
no clear advantage of level-by-level processing over post-order processing or vice versa, because
all nodes across the entire tree are fully independent of each other. The only dependency for any
observer box is that the upward traversal phase (C2M and M2M) must be completed for all source
boxes before the M2L translation can safely be performed.

73

4.3.1.4 Downward Tree Traversal (L2L and L2O)

The downward tree traversal is almost the reverse operation of the upward tree traversal. We loop
though the tree level-by-level in a top-down manner, and perform a parallel loop over nodes in each
level.

4.3.2 Task Parallel MLFMA

Creation of tasks in Helmholtz FMM requires a balance between task granularity versus ﬂexibility.
For instance, for a coarse granularity partitioning, a geometry with 16 nodes to compute at its
highest level of computation could have each of the 16 nodes along with all their children deﬁned
as a task and have them assigned to one of 16 threads available. While such a partitioning provides
coarse grained tasks, an unbalanced tree would result in some threads completing their tasks at
much diﬀerent times from others. Conversely, tasks can be limited in scope to the interpolation of
a single child node, or the translation of one source to observer node. Tasks of this scale would
be ﬁne-grained, but would have far fewer dependencies within the tree. The reduced dependencies
mean more tasks would be available to threads for execution at any given time. However, this
would also mean more scheduling overheads at runtime. As a guiding principle, we try to balance
between the ﬂexibility of ﬁne-grained tasks vs. their scheduling overheads.

4.3.2.1 Near-ﬁeld Computations (NF)

We have chosen to keep the task parallel near-ﬁeld implementation simple and straightforward.
Much like the loop parallel implementation which performs a parallel loop through all observer
nodes, we make near-ﬁeld computations of each observer node a task. Near-ﬁeld computations
implemented in this way only has output dependencies with the L2O phase, thus they can be
executed at any other time. This provides great scheduling ﬂexibility and potential performance
improvements as near-ﬁeld computations can help ﬁll-in the thread idlings during execution of the
far-ﬁeld interactions that have complex dependencies.

74

4.3.2.2 Upward Tree Traversal (C2M and M2M)

The C2M step generates the multipole expansion of a leaf box from all particles within it. We
create a task for the C2M operation of each leaf node. Even for a small geometry, the number of
leaves far exceeds the number of threads available on a typical shared memory architecture. Thus,
there is little point in making C2M tasks ﬁner grained than creating the entire multipole expansion
for a single leaf. Creating a task from groups of leaves would yield larger granularity tasks, but it
would also increase the number of M2M and M2L tasks dependent on each C2M task, restricting
parallelism up the tree.

The M2M step generates the multipole expansion of a parent node from all its children. We
create a separate task for each child being interpolated, shifted and aggregated to create a parent
node. This means each task has a single input dependency on the child node’s multipole data being
ready and a single output dependency on the parent node. The M2M operation to produce the
entire multipole data for a parent node could be a single task as well, but then such a task would
depend on multipole data for all child nodes being ready, instead of just one. As we demonstrate
in Section 2.4, coarse-graining M2M tasks does not perform as well as the ﬁne-grain approach we
adopt.

We provide the pseudocode for this initial version of our task-parallel upward tree traversal

algorithm in Alg. 4.3.

One of the drawbacks of the above described upward tree traversal scheme is interpolation of the
nodes at the higher levels of the tree. For instance, in a typical surface geometry, there are likely to
be 16 nodes at the highest level. Due to output dependencies, only up to 16 threads can be actively
working on the interpolation of these high-level nodes. Therefore, we apply a further reﬁnement
of M2M tasks for the high level tree nodes. All samples within a node are fully independent
during the shifting and aggregating operations, therefore we split these operations into many tasks
for individual nodes. Interpolation is more complex though. While it is beyond the scope of this
chapter to go into too much detail, in MLFMA multipole data take the form of functions sampled
in two angular dimensions; the data can be viewed as a rectangular array of function samples which

75

if  is leaf box then

end task

else

task Depend In all child box  Depend Out box 

task Depend Out box 

  ← C2M()

Algorithm 4.3 Task-based upward tree traversal
Require: . coordinates of the parent box center
Ensure:   is parent’s multipole representation
1: for each box  in post-order traversal do
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15: end for

 [] ← interpolation()
 [] ← shift( , .)
aggregate( , [])

for each child box  do

end task

end for

end if

can be partitioned into block columns or rows. FFT-based interpolation of these partitions are also
independent of each other Hughey et al. (2019); Lingg et al. (2020). Hence, we create tasks for
interpolations of partitions. We illustrate the ﬁne-grained task parallel M2M method used for high
level nodes in Alg.4.4.

4.3.2.3 Translations (M2L)

The M2L phase translates the multipole expansion of each source box to the local expansions
of all observer boxes in its far-ﬁeld. Following our previous strategy of minimal dependencies
would mean each translation of source to observer box should be a separate task as in ﬁne-grained
parallelization of M2M phase. On the other extreme, all translations for a source node could be
deﬁned as a single task which could potentially reduce the number of times a source node needs to
be loaded from memory. We have found that a middle ground between the two, i.e., performing
translations in chunks, is the most eﬃcient approach for M2L.

In MLFMA, the number of translations (interactions) required for a node changes signiﬁcantly
from a geometry to another - while the average number of translations per node is about 27 for

76

Ò[v] ← interpolate()
Ò  1[v] ← transposeandfold(Ò[v])

task

end for

end task

for each  vector  in  do

Algorithm 4.4 Parallel Interpolation
Ensure:  is the child box being interpolated
1:  ← partition()
2: for each partition  in  do
3:
4:
5:
6:
7:
8:
9: end for
10: TaskWait
11:  ← partition(Ò  1)
12: for each partition  in  do
13:
14:
15:
16:
17:
18:
19: end for
20: TaskWait

end task

task

for each  vector  in  do
Ò[v] ← interpolate()
Ò  1[v] ← transposeandfold(Ò[v])

end for

a surface geometry, this number goes up to 189 for a volume geometry (which is not common in
practice). Therefore, we experimented with various bundling factors (bf ) for M2L, see Section 2.4
for further details. We provide the pseudocode for our task-parallel M2L implementation with
bundling in Alg. 4.5.

for each box   interacting with  in groups of   do

Algorithm 4.5 Task-parallel translations
Ensure:   is translation bundling factor
Ensure:   is the local expansions of the box
1: for each box  do
2:
3:
4:
5:
6:
7:
8: end for

 ← compute_interaction(  ,)
 [] ← add_interaction()

end task

end for

task Depend In box  Depend Out box  

77

4.3.2.4 Downward Tree Traversal (L2L and L2O)

As mentioned before, L2L and L2O steps are almost the reverse of M2M and C2M operations,
respectively. As such, their task parallelization follows the same strategy as upward tree traversal
outlined above, albeit with some simpliﬁcations. For L2L, the highest level nodes are read-only.
Output dependencies are on nodes the next level down, which will have a minimum of 64 nodes. This
represents a suﬃcient degree of parallelism for existing multi-core and many-core architectures,
therefore we have not adopted the ﬁne-grained parallelization method of M2M here, but it certainly
can be done relatively easily.

4.3.2.5 Task ordering

A further consideration is the impact of the order of tasks. Being able to inﬂuence the scheduling of
tasks is important for performance reasons because tasks from diﬀerent phases of the computation
that do not have dependencies between them may “ﬁll-in" the voids encountered during execution.
Most task-based runtime systems, including OpenMP which we have used for implementation of
our ideas described above, allow programmers to specify task priorities. In OpenMP though, task
priorities are only suggestions for the runtime system and we have observed in general that these
priorities have little to no eﬀect in terms of the scheduling of tasks; at least, that has been the case
for our task-parallel implementation. However, we have found that the order in which tasks are
generated aﬀects their execution order and that is what we have used to modify the scheduling of
tasks.

In this regard, near-ﬁeld tasks provide the greatest ﬂexibility because they can only conﬂict
with the L2O tasks writing the tree-generated potential values. Therefore, near-ﬁeld tasks can be
performed without race conditions at any time before or after L2O. The chosen time to perform
nearﬁeld processing of our algorithm is after translations (M2L) and before starting the downward
traversal (L2L). At the top of the MLFMA tree, the number of nodes is typically smaller than the
number of threads, but each node is very large and requires signiﬁcant amount of computation.
As a result, there is a good chance that some threads will be left without tasks to perform until

78

the upward traversal (M2M) and translations (M2L) of these highest level nodes are completed.
Performing near-ﬁeld computations during this time-frame ﬁlls in any potential gaps.

The remaining stages of the tree traversal have more dependencies to deal with. A node cannot
start its M2M computations until the M2M computations of its child nodes have ﬁnished. A node
cannot perform its M2L translations until its own M2M computations are completed. Finally, a
node cannot start its L2L phase until its parents have completed theirs and the node has completed
its M2L interactions. This limits task ordering, but still allows some ﬂexibility. The simplest
implementation is generating all upward traversal tasks ﬁrst, then all far-ﬁeld interaction tasks, and
ﬁnally all downward traversal tasks. Alternatively, one can do the same thing but at the level of
individual nodes. As soon as a node has interpolated, shifted and aggregated all of its children,
farﬁeld interactions can be computed for that node. On the opposite end, a high level node can
perform its L2L operations as soon as M2L has translated all of its source nodes, but before
any of its children have performed M2L translations. This approach can be repeated, computing
anterpolations before translations where possible. The ﬁrst method was chosen as it was empirically
found to perform better.

4.4 Results

In this section, we evaluate the performance of the task-parallel MLFMA algorithm described.
All results were obtained on the Cori supercomputer at National Energy Research Scientiﬁc Com-
puting Center (NERSC). Each Haswell node on this system contains two sockets, populated with
Intel Xeon E5-2698 v3 (Haswell) processors with a clock speed of 2.3 GHz. Each node has 32 cores,
plus hyperthreading, 128 GB 2133MHz DDR4 RAM, and 40M Cache. The code is implemented
in Fortran 90 using only OpenMP parallelization and was compiled with the Intel compiler version
19.0.3.199. The Cray FFTW library version 3.3.8.4 is used for all FFT operations.

Performance was also measured using Cori’s KNL nodes. Each KNL node contains a single
socket, populated with an Intel Xeon Phi Processor 7250 ("Knights Landing") processor with a
clock speed of 1.4 GHz. Each node has 68 cores, with 4 hardware threads per node, 96 GB 2400

79

MHz DDR4 RAM, and 64 KB L1 cache per core, plus 1MB L2 cache per tile (2 cores per tile).
The lower processor speed vs Haswell leads to longer execution times.

4.4.1 Tuning the Task-Parallel MLFMA Implementation

As mentioned in 4.3, there are two optimizations we used for our task-parallel MLFMA implemen-
tation. These are ordering of the creation of tasks, which in turn alters the scheduling of tasks, and
bundling of tasks. For tuning our implementation, we chose a 7-level sphere geometry, as spheres
are a commonly used benchmark for MLFMA codes. Our tuning is empirical, certainly relying
on the speciﬁc architecture and geometry. However, we note that in applications, the MLFMA is
used as an inner kernel in long-running iterative solvers that can take hundreds to thousands of
iterations to converge for large problems. Since our tuning parameter space is relatively small, it
is practical to tune the performance for the particular geometry and architecture before the actual
solver is launched.

4.4.1.1 Task Generation Ordering

Since the near-ﬁeld (NF) phase is the most ﬂexible phase within MLFMA, we created diﬀerent
ﬂavors of task-parallel MLFMA where NF tasks are generated between all tree computation phases.
Starting with “NF First", these are “NF after M2M", “NF after M2L", “NF Last". There are two
other ﬁner grain reorderings; they interleave the execution of M2L with M2M (“M2L during M2M")
or M2L with L2L (“M2L after L2L"), rather than executing each phase entirely separately.

As can be seen in Fig. 4.4, for most thread counts, generating NF tasks at diﬀerent phases has
minimal impact, but for 64 threads “NF after M2L" results in a 5% performance improvement
over the others. Executing L2L wherever possible before M2L produces good scaling, but poor
execution times overall. Executing M2L as soon as possible during the M2M execution shows a
slight improvement in performance. Finally, combining the best of the two task orders, “NF after
M2L" and “M2L during M2M", produces a 4% execution improvement at 32 threads and over 18%

80

102

)
s
(
e
m
T

i

101

NF First

NF after M2M
NF after M2L

NF Last

M2L after L2L

M2L during M2M

Optimal

100

101

Thread Counts

Figure 4.4: Impact of task order on execution time.

improvement at 64 threads. This method is labeled on the graph as “Optimal", and is used for the
task-parallel MLFMA results reported.

4.4.1.2 Task Bundling for M2L

The second optimization we implemented is bundling tree operations together in each task. For
the same sphere geometry, we experimented with diﬀerent bundling schemes. This included the
extreme cases of bundling all M2M operations of children of a single parent node together on one
side and creating a separate task for each child on the other side. Both methods performed on par
with each other for small thread counts, but we observed that bundling all children into a single
task during M2M performed signiﬁcantly worse at 64 threads (see Fig. 4.5). This is likely due to
the small number of tasks created at higher tree levels which contain computationally expensive
nodes. Therefore, we deﬁne all children during M2M as individual tasks.

For M2L interactions, we experimented with diﬀerent bundling factors such as 9, 27 (which is
the expected number of interactions for surface geometries), 189 (theoretical maximum for M2L for
any geometry) and compared them with regular (non-bundled) M2L in terms of performance, see
Fig. 4.5. Grouping the translations of 9 observer nodes with a common source node into a single
task provides a notable beneﬁt. Any impact is barely noticeable through 16 threads, but at 32 and

81

M2M Block

M2L Block None

M2L Block 9
M2L Block 27
M2L Block 189

101.5

)
s
(
e
m
T

i

101

100.5

100

101

Thread Counts

Figure 4.5: Impact of bundling on execution time. Performance of M2M implementation with all
children bundled into a single task (M2M with Bundling) and M2L with various bundling factors
(none, 9, 27, and 189) are shown for diﬀerent thread counts.

64 threads, the performance improvement over the non-bundled version is nearly 33%. Increasing
the bundling factor to 27 interactions of a common source node decreases the performance slightly,
and the extreme case of bundling 189 interactions results in a signiﬁcant performance falloﬀ even
at small number of threads. All results presented in the rest of this manuscript uses a bundling
factor of 9 for the M2L phase.

4.4.2 Performance Comparison between BSP and Task Parallel Implementations

In this subsection, we compare the performance of our task parallel MLFMA implementation against
the BSP version on a number of geometries. Both versions use the same tree construction methods
(so the amount of work performed by both methods is identical) and they use the same OpenMP
compilation and runtime settings. The potentials computed by both versions were compared to
ensure that the only diﬀerences are due to ﬂoating point arithmetic precision.

For benchmarking, we used four diﬀerent geometries. The ﬁrst geometry is a simple planar
grid of particles (in the  = 0 plane). The grid dimensions are 128 × 128, with 5,242,880
points uniformly distributed over the geometry and smallest FMM box size of /4. This produces
a 10-level tree with 20 points in each leaf box. The second geometry is a sphere whose radius is

82

128, with 7,264,954 points uniformly distributed over the geometry and smallest FMM box size
of /4. This also produces a 10-level tree, with an average of 18 points in each leaf box. The
third geometry is a 3D volumetric distribution of particles. The box dimensions are 8 × 8 × 8,
with 1,048,576 points randomly distributed over the cube and smallest FMM box size of /4. This
produces a 6-level tree with an average of 32 points in each leaf box. The last geometry is an
airplane model which is of size 256 in length. It is discretized with over 4,459,776 points and the
smallest FMM box size is /4. This produces an 11-level tree with an average of 15 points in each
leaf box, albeit with a highly non-uniform distribution of points across leaves.

4.4.2.1 Performance on a Multicore Architecture (Cori-Haswell)

Figure 4.6 compares the execution times of BSP and task-parallel MLFMA versions using 1 to
64 threads. Note that the Haswell processors only have 32 physical cores (on two sockets), so 64
thread executions use hyperthreading. The airplane model, which is a real application, shows the
strongest performance advantage for task-parallel MLFMA as it attains as much as 1.35x speedup
over the BSP version. Both the grid and volume geometries also show that the task parallel version
achieves consistently increasing speedups over the BSP version with increasing number of threads.
While we initially observe signiﬁcant gains with task parallelism over the BSP version for the
sphere geometry as well, to our surprise these gains fade away at high number of threads. We try
to provide a more detailed insight into these results in the next subsection.

4.4.2.2 Manycore Architecture (Cori-KNL)

We performed the same performance analysis using Cori-KNL nodes which have a signiﬁcantly
diﬀerent architecture than Cori-Haswell nodes. We observe that for 2 to 32 threads, task parallel
MLFMA shows performance gains similar to those of Cori-Haswell experiments (see Fig. 4.7).
However, its scalability falls oﬀ slightly at 64 cores, which is potentially due to two cores sharing
the L2 cache on a tile when the number of threads is increased from 32 to 64. Beyond 64 threads,
KNL eﬀectively employs hyperthreading.
In this regime (not shown in plots), while the BSP

83

Grid

Sphere

1.18

1.16

1.14

1.12

1.1

103.5

103

102.5

0
0

10
10

20
20

30
40
30
40
Threads
Volume

50
50

60
60

70
70

0
0

10
10

20
20

1.25

1.2

p
u
d
e
e
p
S

1.15

1.1

50
50

60
60

70
70

30
40
30
40
Threads
Airplane

1.35

1.3

1.25

p
u
d
e
e
p
S

1.2

s
d
n
o
c
e
S

103

102

102.5

102

s
d
n
o
c
e
S

101.5

1.12

1.1

1.08

1.06

103

102

Loop
Task

Task vs Loop

0
0

10
10

20
20

40
30
30
40
Threads

50
50

60
60

70
70

1.15

0
0

10
10

20
20

40
30
30
40
Threads

50
50

60
60

70
70

Figure 4.6: Task vs Loop (BSP) parallel runs on Haswell compute nodes for four diﬀerent
geometries.

implementation is able to keep performing at a similar level, the performance of the task parallel
MLFMA actually starts dropping. This is likely because the scheduling of tasks which must be
done sequentially starts becoming a bottleneck with the increase in the number of threads. The use
of many slow cores on KNL (as opposed to multiple high performance cores like Xeon CPUs) can
have a compounding eﬀect on this bottleneck, too.

84

Grid

Sphere

104

103

s
d
n
o
c
e
S

s
d
n
o
c
e
S

103

102

0
0

10
10

20
20

50
50

60
60

30
40
30
40
Threads
Volume

1.18
1.16
1.14
1.12
1.1
1.08
1.06
1.04

70
70

1.14

1.12

1.1

1.08

1.06

104

103

104

103

0
0

10
10

20
20

40
30
30
40
Threads

50
50

60
60

1.04

70
70

Loop
Task

Task vs Loop

1.25

1.2

1.15

p
u
d
e
e
p
S

1.1

1.05

0
0

10
10

20
20

30
40
30
40
Threads
Airplane

50
50

60
60

70
70

1.4

1.35

1.3

1.25

1.2

p
u
d
e
e
p
S

1.15

1.1

0
0

10
10

20
20

30
40
40
30
Threads

50
50

60
60

70
70

Figure 4.7: Task vs Loop (BSP) parallel runs on KNL compute nodes for four diﬀerent geometries.

4.4.3 Understanding the Reasons behind Observed Diﬀerences

To understand the performance beneﬁts of task parallel MLFMA over BSP version, we conducted
timeline and cache performance analyses, for which we used the perf-stat tool.

4.4.3.1 Timeline Analysis

Figure 4.8 shows the order of execution of threads in the BSP version execution for a 7-level grid
geometry using 64 threads (hyperthreaded) on a Cori-Haswell node. NF, C2M and L2O all perform

85

well. Each of these operations has a very large number of nodes that can all be processed in parallel.
M2M begins showing load balance issues which become very signiﬁcant at the highest level where
only 16 nodes can be processed. M2L shows a lesser extent of thread idleness, likely due to thread
dependencies as there are a large number of M2L nodes that can be processed in parallel, up to the
highest level where we again see an issue with there only being 16 nodes at the top level. Finally,
L2L shows similar thread inactivity as M2M, but in reverse.

Figure 4.9 shows the order of execution of the tasks during task parallel MLFMA. Unlike the
BSP version, C2M, M2M and M2L tasks are mixed together as dependencies allow. Further, NF
is mixed in with other tasks, ﬁlling in some the empty space during M2M computation of the top
level and M2L helping ﬁll in more of the rest. The start of L2L also shows the beneﬁt of ﬁne grain
parallel at the top level interpolation and anterpolation operations where more threads are able to
participate.

Nearﬁeld

M2L

C2M
L2L

M2M
L2O

)
3
6
-
0
(

D

I

d
a
e
r
h
T

Time (Seconds)

Figure 4.8: BSP timeline on grid geometry.

Figures 4.10 and 4.11 show how the BSP and task parallel timelines change for a spherical
geometry. The sphere ﬁlls more of the highest level tree nodes. This means the BSP approach
has more nodes to process at the top level and is more eﬃcient at keeping all threads active. The
dependencies caused by lower level nodes having fewer child boxes than higher level nodes, as
the sphere acts more like a surface, mean the task parallel approach has more tasks that cannot be

86

Nearﬁeld

M2L

C2M
L2L

M2M
L2O

)
3
6
-
0
(

D

I
d
a
e
r
h
T

Time (Seconds)

Figure 4.9: Task parallel timeline on grid geometry.

executed until dependent tasks complete. As a result, the task parallel approach is not as eﬃcient
for this example.

Nearﬁeld

M2L

C2M
L2L

M2M
L2O

)
3
6
-
0
(

D

I

d
a
e
r
h
T

Time (Seconds)

Figure 4.10: BSP timeline on the sphere geometry.

Alternately, Figures 4.12 and 4.13 show how the BSP and task parallel timeline behave for an
airplane geometry. Unlike the previous examples, this geometry is non-uniform, so many of the
particles are clustered in fewer nodes. The impact of this can be seen in Figure 4.12 where the top
levels of M2M and L2L have fewer nodes that can be processed. Furthermore, the next level down

87

Nearﬁeld

M2L

C2M
L2L

M2M
L2O

)
3
6
-
0
(

D

I
d
a
e
r
h
T

Time (Seconds)

Figure 4.11: Task parallel timeline on the sphere geometry.

still has a limited number of nodes to process. Figure 4.13 shows that tasks keep more threads
active by performing M2L and NF tasks during the times when there are not enough high level
nodes to occupy all threads.

Nearﬁeld

M2L

C2M
L2L

M2M
L2O

)
3
6
-
0
(

D

I

d
a
e
r
h
T

Time (Seconds)

Figure 4.12: BSP timeline on the airplane geometry.

88

Nearﬁeld

M2L

C2M
L2L

M2M
L2O

)
3
6
-
0
(

D

I
d
a
e
r
h
T

Time (Seconds)

Figure 4.13: Task parallel timeline on the airplane geometry.

4.4.3.2 Cache Analysis

To look further at why the task parallel approach is more eﬃcient, we analyzed the cache utilization
of the two versions. Cache analysis was performed using VTune and 64-thread executions of the
grid geometry on Cori-Haswell nodes. The cache analysis runs in Fig. 4.14 show that the ratio of
cache hits to misses is not always more favorable for task parallel vs the BSP version. However,
since L1 cache hits ratio is as high as 99.8%, any diﬀerences are eﬀectively a rounding error. As
such, we conclude that while task parallel MLFMA makes less eﬀective use of cache, this does not
negatively impact its performance at a signiﬁcant degree.

4.5 Conclusions

Due to the near constant amount of processing necessary per level with the number of nodes
per level decreasing while moving up the tree, Helmholtz FMM presents challenges to paralleliza-
tion that are not present in Laplace FMM. In this chapter, we presented a task parallel MLFMA
implementation to address these parallelization challenges. Results on various geometries have
shown that in most cases, particularly for the real world application case of an airplane geometry,
the task parallel implementation shows improved performance and scalability for shared memory
architectures compared to a bulk synchronous parallel MLFMA implementation. Our study pro-

89

·1014

·1011

1.5

1

0.5

0

4

3

2

1

0

Grid

·1011

Sphere
L1 Hits

Plane

Grid

Sphere
L2 Hits

Plane

5

4

3

2

1

0

8

6

4

2

0

Grid

·1010

Sphere

L1 Misses

Plane

BSP
Task

Grid

Sphere

L2 Misses

Plane

Figure 4.14: Comparison of the cache performance of BSP and task parallel implementations on a
Cori-Haswell node.

vides evidence that task parallelism is a promising approach for MLFMA, and it can be even more
useful in a hybrid shared and distributed memory parallel context because it would allow great
ﬂexibility in terms of overlapping the execution of communication-intensive parts of MLFMA with
its computation-intensive parts so as to minimize idle times and achieve scaling to a large number
of compute nodes.

90

CHAPTER 5

FUTURE WORK

While looking at the results of chapter 3, it was noticed in table 3.1 that the scaling of M2M
and L2L was quite strong through 2048 processes, while M2L’s performance faded much faster.
The most likely cause is too much data communication during translation. At upper levels of the
tree, a process will own at most a full node for a given level, or only part of a node. To perform
translations, the observer node needs all samples from the source node, or if the node is split across
processes, the samples of the source node corresponding to the samples the process owns for the
observer node. Since no process can own more than one node, all data from every node must
be communicated between processes owning source nodes, and processes owning observer nodes.
Worse, every observer node is guaranteed to be owned by a separate process. So each source node
must be communicated to up to the 27 processes owning observer nodes for a surface, or 189 for
a volume. At lower levels, multiple observer nodes may reside on a single process, or a process
may own both a source and observer node, reducing or eliminating some of the communication at
this level. However, when the number of processes increase, the level where all node data must be
communicated moves down the tree, increasing communication requirements.

A possible improvement to the communication requirements lies in local interpolation as shown
in Yang et al. (2019). In this paper, nodes at the highest levels of the tree are divided into groups based
on the number of processes. Then each group of samples from every node for that level are assigned
to corresponding processes. Because each process owns corresponding samples of each node for a
level, translation requires no communication. With this approach, global interpolation could still
be used at lower levels, taking advantage of the exact interpolation results and lower sampling rates,
while local interpolation is used only at the higher level where translation communication costs are
the highest. In addition to managing the errors induced by local interpolation and the increased
sampling costs, the impact to the M2M and L2L stages will need to be considered.

A method with more immediate improvements would be to test the performance beneﬁts of

91

combining the MPI parallel approach in chapter 3 with the shared memory parallel implementation
of chapter 4. In shared memory parallel, the ratio of nodes in a level to process count would be
the same as MPI parallel, but all processes on the same processor could access the same memory.
This means when a source node is owned by one process and an observer node is owned by another
process, in MPI parallel the source data must be communicated to the process owning the observer.
With a hybrid of MPI and shared memory parallel, a process would reside on a single node, while
each core of that node could be running a separate thread. All threads of the same process can used
shared memory, rather than more expensive communication, during operations where interacting
nodes are on separate threads in the same process.

Further work can also be done to improve load balancing for non-uniform trees. Hughey
(2018) describes a bottom up load balancing method. The work done in chapter 3 providing a ﬁne
grain parallel algorithm at the top level of the tree results in a diﬀerent work load from the cited
thesis so a new method would be necessary. In an unbalanced tree, there would likely be a higher
level where all nodes at the level must be fully processed, but some of these nodes may have few
descendants with any particles. The unbalance could range from all high level nodes having the
same number of descendants, to one high level node having a full sub tree with other nodes only
having a unary subtree. Bottom up partitioning could properly balancing the spacial partitioning
at the low level, but assign the previously mentioned high level node with a unary subtree a single
process. Top down partitioning could assign a large number of processes to the same high level
node with a unary subtree while all these processes have to share a single leaf node. Instead the
data could be partitioned among processes in a two stage process. In the ﬁrst stage full high level
nodes at an appropriate level would be assigned to groups of processes. The size of the groups
could be balanced between the number of descendants of the high level node, and the amount of
work processing the high level node and the share of the node’s parents. The second stage would
assign descendants of the shared nodes to processes sharing the ancestor node. This method would
balance direction partitioning among the process groups, then balance spatial partitioning within
the groups.

92

BIBLIOGRAPHY

93

BIBLIOGRAPHY

Abduljabbar, Mustafa, Mohammed Al Farhan, Noha Al-Harthi, Rui Chen, Rio Yokota, Hakan
Bagci & David Keyes. 2019. Extreme scale fmm-accelerated boundary integral equation solver
for wave scattering. SIAM Journal on Scientiﬁc Computing 41(3). C245–C268.

Agullo, Emmanuel, Bérenger Bramas, Olivier Coulaud, Eric Darve, Matthias Messner & Toru
Takahashi. 2014. Task-based fmm for multicore architectures. SIAM Journal on Scientiﬁc
Computing 36(1). C66–C93.

Appel, A. 1985. An eﬃcient program for many-body simulations. SIAM J. Sci. Comput. 6. 85–103.
Barnes, J. & P. Hut. 1986. A hierarchical (( log ) force calculation algorithm. Nature 324.

446–449.

Burresi, Matteo, Lorenzo Cortese, Lorenzo Pattelli, Mathias Kolle, Peter Vukusic, Diederik S
Wiersma, Ullrich Steiner & Silvia Vignolini. 2014. Bright-white beetle scales optimise multiple
scattering of light. Scientiﬁc reports 4.

Cecka, Cris & Eric Darve. 2013a. Fourier-based fast multipole method for the helmholtz equation.

SIAM Journal on Scientiﬁc Computing 35(1). A79–A103.

Cecka, Cris & Eric Darve. 2013b. Fourier-based fast multipole method for the helmholtz equation.

SIAM Journal on Scientiﬁc Computing 35(1). A79–A103.

Chaigne, Sebastien, Guillaume Sylvand, Eric Duceau & Julien Simon. 2007. On the use of the fast
multipole method for accurate automotive body panel acoustic load predictions. Tech. rep. SAE
Technical Paper.

Chew, W.C., V. Jandala, C.C. Lu, E. Michielssen, B. Shanker, J.M. Song & J.S. Zhao. 1997.
Fast multilevel techniques for solving integral equations in electromagnetics.
In Microwave
conference proceedings, 1997. apmc ’97., 1997 asia-paciﬁc, vol. 1, 457–460vol.1. doi:10.1109/
APMC.1997.659422.

Chew, Weng Cho, Eric Michielssen, JM Song & Jian-Ming Jin. 2001. Fast and eﬃcient algorithms

in computational electromagnetics. Artech House, Inc.

Chien, R. J. & B. K. Alpert. 1997. A fast spherical ﬁlter with uniform resolution. J. Comput. Phys.

136. 580–584.

Coifman, R., V. Rokhlin & S. Wandzura. 1993a. The fast multipole method for the wave equation:

A pedestrian prescription. IEEE Antennas Propagat. Mag. 35(3). 7–12.

Coifman, Ronald, Vladimir Rokhlin & Stephen Wandzura. 1993b. The fast multipole method for
the wave equation: A pedestrian prescription. IEEE Antennas and Propagation magazine 35(3).
7–12.

94

Darve, Eric. 2000. The fast multipole method: numerical implementation. Journal of Computa-

tional Physics 160(1). 195–240.

Dembart, B. & E. Yip. 1995. A 3d fast multipole method for electromagnetics with multiple levels.
In Proceedings of the 11th annual conference on applied computational electromagnetics, vol. 1,
621–628. Monterey, CA.

Dembart, B. & E. Yip. 1998. The accuracy of fast multipole methods for maxwell’s equations.

IEEE Computational Science and Engineering 5. 48–56.

Ding, Hong-Qiang, Naoki Karasawa & William A Goddard III. 1992. Atomic level simulations
on a million particles: The cell multipole method for coulomb and london nonbond interactions.
The Journal of chemical physics 97(6). 4309–4315.

Engheta, Nader, William D Murphy, Vladimir Rokhlin & Marius Vassiliou. 1985. The fast multi-
pole method for electromagnetic scattering computation. IEEE Transactions on Antennas and
Propagation 40. 634–641.

Ergin, A Arif, Balasubramaniam Shanker & Eric Michielssen. 1998. Fast evaluation of three-
dimensional transient wave ﬁelds using diagonal translation operators. Journal of Computational
Physics 146(1). 157–180.

Ergin, A Arif, Balasubramaniam Shanker & Eric Michielssen. 1999. The plane-wave time-domain
algorithm for the fast analysis of transient wave phenomena. IEEE Antennas and Propagation
Magazine 41(4). 39–52.

Ergül, Ö & L Gürel. 2008. Hierarchical parallelisation strategy for multilevel fast multipole

algorithm in computational electromagnetics. Electronics Letters 44(1). 3–5.

Ergul, Ozgur. 2011. Parallel implementation of MLFMA for homogeneous objects with various

material properties. Progress In Electromagnetics Research 121. 505–520.

Ergül, Özgür. 2011. Solutions of large-scale electromagnetics problems involving dielectric objects

with the parallel multilevel fast multipole algorithm. JOSA A 28(11). 2261–2268.

Ergul, Ozgur & Levent Gurel. 2006. Enhancing the accuracy of the interpolations and anterpolations

in mlfma. IEEE Antennas and Wireless Propagation Letters 5(1).

Ergul, Ozgur & Levent Gurel. 2013. Accurate solutions of extremely large integral-equation

problems in computational electromagnetics. Proceedings of the IEEE 101(2). 342–349.

Ergül, Özgür & Levent Gürel. 2013. Fast and accurate analysis of large-scale composite structures

with the parallel multilevel fast multipole algorithm. JOSA A 30(3). 509–517.

Frigo, Matteo & Steven G. Johnson. 2005. The design and implementation of FFTW3. Proceedings
of the IEEE 93(2). 216–231. Special issue on “Program Generation, Optimization, and Platform
Adaptation”.

Greengard, L. 1988. The rapid evaluation of potential ﬁelds in particle systems. Cambridge, MA:

MIT Press.

95

Greengard, L. & V. Rokhlin. 1987. A fast algorithm for particle simulations. Journal of Computa-

tional Physics 20. 63–71.

Greengard, Leslie, Jingfang Huang, Vladimir Rokhlin & Stephen Wandzura. 1998. Accelerating
IEEE Computational

fast multipole methods for the helmholtz equation at low frequencies.
Science and Engineering 5(3). 32–38.

Hamada, Tsuyoshi, Tetsu Narumi, Rio Yokota, Kenji Yasuoka, Keigo Nitadori & Makoto Taiji.
2009. 42 tﬂops hierarchical n-body simulations on gpus with applications in both astrophysics and
turbulence. In Proceedings of the conference on high performance computing networking, storage
and analysis SC ’09, 62:1–62:12. New York, NY, USA: ACM. doi:10.1145/1654059.1654123.
http://doi.acm.org/10.1145/1654059.1654123.

von Hoerner, S. 2001. How it all started. In Dynamics of star clusters and the milky way, vol. 228,

11.

Hughey, S., H. M. Aktulga & B. Shanker. 2018. Scalable and accurate tree traversal operators for

helmholtz fmm. under submission .

Hughey, S., H. M. Aktulga, M. Vikram, M. Lu, B. Shanker & E. Michielssen. 2019. Parallel
wideband mlfma for analysis of electrically large, nonuniform, multiscale structures.
IEEE
Transactions on Antennas and Propagation 67(2). 1094–1107. doi:10.1109/TAP.2018.2882621.
Hughey, Stephen Michael. 2018. Eﬃcient parallelization of non-uniform fast multipole algorithms.
Huttunen, Tomi, Matti Malinen, Jari P Kaipio, Phillip Jason White & Kullervo Hynynen. 2005. A
full-wave helmholtz model for continuous-wave ultrasound transmission. IEEE transactions on
ultrasonics, ferroelectrics, and frequency control 52(3). 397–409.

Ishiyama, Tomoaki, Keigo Nitadori & Junichiro Makino. 2012. 4.45 pﬂops astrophysical n-body
simulation on k computer: The gravitational trillion-body problem.
In Proceedings of the
international conference on high performance computing, networking, storage and analysis SC
’12, 5:1–5:10. Los Alamitos, CA, USA: IEEE Computer Society Press. http://dl.acm.org/
citation.cfm?id=2388996.2389003.

Jakob-Chien, R & B K Alpert. 1997. A fast spherical ﬁlter with uniform resolution. J. Comp. Phys.

136. 580–584.

Lashuk, Ilya, Aparna Chandramowlishwaran, Harper Langston, Tuan-Anh Nguyen, Rahul Sampath,
Aashay Shringarpure, Richard Vuduc, Lexing Ying, Denis Zorin & George Biros. 2012. A mas-
sively parallel adaptive fast multipole method on heterogeneous architectures. Communications
of the ACM 55(5). 101–109.

Lingg, Michael P, Stephen M Hughey & Hasan Metin Aktulga. 2018. Optimization of the spherical
harmonics transform based tree traversals in the helmholtz fmm algorithm. In Proceedings of
the 47th international conference on parallel processing, 1–11.

Lingg, Michael P., Stephen M. Hughey, Hasan Metin Aktulga & Balasubramaniam Shanker.
2020. High performance evaluation of helmholtz potentials using the multi-level fast multipole
algorithm.

96

LóPez-PortuguéS, Miguel, JesúS A LóPez-FernáNdez, Jonatan MenéNdez-Canal, Alberto
RodríGuez-Campa & José Ranilla. 2012. Acoustic scattering solver based on single level fmm
for multi-gpu systems. Journal of Parallel and Distributed Computing 72(9). 1057–1064.

Melapudi, Vikram, Balasubramaniam Shanker, Sudip Seal & Srinivas Aluru. 2011. A scalable
parallel wideband MLFMA for eﬃcient electromagnetic simulations on large scale clusters.
Antennas and Propagation, IEEE Transactions on 59(7). 2565–2577.

Michiels, Bart, Jan Fostier, Ignace Bogaert & Daniel De Zutter. 2013a. Performing large full-wave
In Antennas and propagation

simulations by means of a parallel MLFMA implementation.
society international symposium (apsursi), 2013 ieee, 1880–1881. IEEE.

Michiels, Bart, Jan Fostier, Ignace Bogaert & Daniël De Zutter. 2013b. Weak scalability analysis
of the distributed-memory parallel MLFMA. Antennas and Propagation, IEEE Transactions on
61(11). 5567–5574.

Michiels, Bart, Jan Fostier, Ignace Bogaert & Daniel De Zutter. 2015. Full-Wave Simulations of
Electromagnetic Scattering Problems With Billions of Unknowns. Antennas and Propagation,
IEEE Transactions on 63(2). 796–799.

Michiels, Bart, Jan Fostier, Ignace Bogaert, Piet Demeester & Daniël De Zutter. 2011. Towards a
scalable parallel MLFMA in three dimensions. In Computational electromagnetics international
workshop (cem), 2011, 132–135. IEEE.

Nishimura, Naoshi. 2002. Fast multipole accelerated boundary integral equation methods. Applied

mechanics reviews 55(4). 299–324.

Pi, Wei-Chao, Xiao-Min Pan & Xin-Qing Sheng. 2010. A parallel multilevel fast multipole
algorithm based on openmp. In 2010 international conference on microwave and millimeter
wave technology, 1356–1359. IEEE.

Potter, Douglas, Joachim Stadel & Romain Teyssier. 2017. Pkdgrav3: beyond trillion particle
cosmological simulations for the next era of galaxy surveys. Computational Astrophysics and
Cosmology 4(1). 2.

Rahimian, Abtin, Ilya Lashuk, Shravan Veerapaneni, Aparna Chandramowlishwaran, Dhairya
Malhotra, Logan Moon, Rahul Sampath, Aashay Shringarpure, Jeﬀrey Vetter, Richard Vuduc,
Denis Zorin & George Biros. 2010. Petascale direct numerical simulation of blood ﬂow on
200k cores and heterogeneous architectures. In Proceedings of the 2010 acm/ieee international
conference for high performance computing, networking, storage and analysis SC ’10, 1–11.
Washington, DC, USA: IEEE Computer Society. doi:10.1109/SC.2010.42. http://dx.doi.
org/10.1109/SC.2010.42.

Salmon, John K. 1991. Parallel hierarchical n-body methods: California Institute of Technology

dissertation.

Salmon, John K & Michael S Warren. 1994. Fast parallel tree codes for gravitational and ﬂuid
dynamical n-body problems. The International Journal of Supercomputer Applications and High
Performance Computing 8(2). 129–142.

97

Sarvas, J. 2003. Performing interpolation and anterpolation by the fast fourier transform in the 3d

multilevel fast multipole algorithm. SIAM J. Numer. Anal. 41. 2180–2196.

Shanker, B., A.A. Ergin, Mingyu Lu & E. Michielssen. 2003. Fast analysis of transient electromag-
netic scattering phenomena using the multilevel plane wave time domain algorithm. Antennas
and Propagation, IEEE Transactions on 51(3). 628–641. doi:10.1109/TAP.2003.809054.

Shanker, B. & H. Huang. 2007. Accelerated cartesian expansions - a fast method for computing of

potentials of the form rˆ{- } for all real . Journal of Computational Physics 226. 732–753.

Shimada, Jiro, Hiroki Kaneko & Toshikazu Takada. 1994. Performance of fast multipole methods
for calculating electrostatic interactions in biomacromolecular simulations. Journal of Compu-
tational Chemistry 15(1). 28–43.

Song, J. M. & W. C. Chew. 1995. Multilevel fast-multipole algorithm for solving combined ﬁeld
integral equations of electromagnetic scattering. Microwave and Optical Technology Letters
10(1). 14–19.

Song, J. M., C. C. Lu & W. C. Chew. 1997. Mlfma for electromagnetic scattering by large complex

objects. IEEE Transactions on Antennas and Propagation 45. 1488–1493.

Sundar, Hari, Rahul S Sampath & George Biros. 2008. Bottom-up construction and 2: 1 balance
reﬁnement of linear octrees in parallel. SIAM Journal on Scientiﬁc Computing 30(5). 2675–2708.
Taboada, Jose Manuel, Marta G Araujo, Fernando Obelleiro Basteiro, José Luis Rodríguez & Luis
Landesa. 2013. MLFMA-FFT parallel algorithm for the solution of extremely large problems in
electromagnetics. Proceedings of the IEEE 101(2). 350–363.

Velamparambil, S, Jiming Song & Weng Cho Chew. 2000. On the parallelization of electrodynamic
multilevel fast multipole method on distributed memory computers. In Innovative architecture
for future generation high-performance processors and systems, 1999. international workshop,
3–11. IEEE.

Vikram, M., He Huang, B. Shanker & T. Van. 2009. A novel wideband fmm for fast integral
equation solution of multiscale problems in electromagnetics. Antennas and Propagation, IEEE
Transactions on 57(7). 2094–2104. doi:10.1109/TAP.2009.2019926.

Vikram, M., C. Knowles, B. Shanker & L.C. Kempel. 2011. An ultra-wideband fmm for multi-
scale electromagnetic simulations. 27th Annual Review of Progress in Applied Computational
Electromagnetics .

Vikram, M. & B. Shanker. 2009. An incomplete review of fast multipole methods from static to
wideband as applied to problems in computational electromagnetics. Applied Computational
Electromagnetics Society Journal 27. 79.

Waltz, Caleb, Kubilay Sertel, Michael Carr, Brian C Usner, John L Volakis & Others. 2007.
Massively parallel fast multipole method solutions of large electromagnetic scattering problems.
Antennas and Propagation, IEEE Transactions on 55(6). 1810–1816.

98

Wang, Ruoxi, Chao Chen, Jonghyun Lee & Eric Darve. 2019. Pbbfmm3d: a parallel black-box fast

multipole method for non-oscillatory kernels. arXiv preprint arXiv:1903.02153 .

Warren, M.S. & J.K. Salmon. 1993. A parallel hashed oct-tree n-body algorithm.

supercomputing, 1–12.

In Proc.

Wedi, Nils P, Mats Hamrud & George Mozdzynski. 2013. A fast spherical harmonics transform

for global nwp and climate models. Monthly Weather Review 141(10). 3450–3461.

Yang, Ming-Lin, Bi-Yi Wu, Hong-Wei Gao & Xin-Qing Sheng. 2019. A ternary parallelization ap-
proach of mlfma for solving electromagnetic scattering problems with over 10 billion unknowns.
IEEE Transactions on Antennas and Propagation .

Ying, Lexing, George Biros & Denis Zorin. 2004. A kernel-independent adaptive fast multipole

algorithm in two and three dimensions. Journal of Computational Physics 196(2). 591–626.

Ying, Lexing & Denis Zorin. 2004. A simple manifold-based construction of surfaces of arbitrary
smoothness. ACM Transactions on Graphics 23. 271–275. doi:http://doi.acm.org/10.1145/
1015706.1015714. http://doi.acm.org/10.1145/1015706.1015714.

Yokota, Rio. 2013. An fmm based on dual tree traversal for many-core architectures. Journal of

Algorithms & Computational Technology 7(3). 301–324.

Zhao, Jun-Sheng & Weng Cho Chew. 2000. Integral equation solution of maxwells equations from
zero frequency to microwave frequencies. IEEE Transactions on Antennas and Propagation 48.
1635–1645.

99