EFFICIENT PARALLELIZATION OF NON-UNIFORM FAST MULTIPOLE ALGORITHMS

By

Stephen Michael Hughey

A DISSERTATION

Michigan State University

in partial fulﬁllment of the requirements

Submitted to

for the degree of

Electrical and Computer Engineering – Doctor of Philosophy

Computational Mathematics, Science, and Engineering – Dual Major

2018

ABSTRACT

EFFICIENT PARALLELIZATION OF NON-UNIFORM FAST MULTIPOLE ALGORITHMS

By

Stephen Michael Hughey

Many applications of the N-body problem today involve distributions of bodies that are (i) very large
and (ii) highly non-uniform. A variety of fast multipole algorithms have been devised to reduce the
cost from O(N2) to O(N log N) or O(N) for oscillatory and non-oscillatory problems, respectively.
The issue of non-uniformity, however, presents signiﬁcant challenges in parallelization, requiring a
much more nuanced approach. Compounding this challenge, oscillatory N-body problems arising
from wave physics (electromagnetics, acoustics, etc.) are burdened with capturing both phase and
amplitude information as opposed to just the amplitude; non-uniformity even further complicates
things. As a result, the algorithm and underlying data structures become extremely complicated,
and parallelization becomes quite diﬃcult.

This thesis aims to develop novel parallel fast multipole methods for both oscillatory and non-
oscillatory problems that (i) are controllably accurate to arbitrary precision, (ii) are capable of
eﬃciently handling highly non-uniform distributions, and (iii) scale well up to extremely large
problem sizes and numbers of CPU cores. The accelerated Cartesian expansion (ACE) method and
wideband multilevel fast multipole algorithm (MLFMA) are modiﬁed to accurately and eﬃciently
accommodate non-uniform, and in the case of MLFMA extremely large, distributions in parallel.
Several parallel algorithms for eﬃciently building the distributed non-uniform tree data structures
are developed. Eﬀective, novel algorithms are introduced to reduce load imbalances arising
from non-uniformity and certain idiosyncrasies of the parallel wideband MLFMA which hamper
scalability. The algorithms presented here meet each of the stated goals, enabling computations
involving several hundred million degrees of freedom on 2048 cores for an electromagnetics
problem and several billion particles on 16,384 cores for non-oscillatory problems.

ACKNOWLEDGEMENTS

I am sincerely grateful for the many people who have so generously supported me in the course of
my education and have contributed to my development as a researcher, as a professional, and as a
person. From my family and friends I have received boundless love and encouragement, without
which I could scarcely have succeeded. From my mentors, invaluable insights on matters ranging
from the academic to the interpersonal that continue to prove useful. From my committee, sound
advice and a better perspective on communicating with those outside my own area of research.
My lab-mates’ ready willingness to explore and discuss ideas helped strengthen my grasp of both
electromagnetics and computational science. My advisers’ patience, encouragement, support,
advice, and seriousness have been central to all of my successes.

Finally, my ﬁancé has for years been an unﬂagging source of love, empathy, compassion,
I cannot adequately

wisdom, and measured criticism whenever I needed any of these things.
express the depth of my gratitude to her in writing.

iii

TABLE OF CONTENTS

.
.

.
.
.
.

LIST OF TABLES .
.
.
.
.
LIST OF FIGURES .
.
.
.
.
LIST OF ALGORITHMS .
.
.
KEY TO ABBREVIATIONS .
CHAPTER 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
.
.
x
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
INTRODUCTION .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.1 N-body problems .
1.2 Fast multipole methods (FMMs)
1.3 Parallelization of FMMs
.
1.4 Thesis structure .

1
1
2
3
4

xi

.

.

.

.

.

CHAPTER 2 PARALLEL ALGORITHM FOR FAST EVALUATION OF STOKES PO-

.
.
.

.
.
.

.
.
.

.
.
.

Introduction .

. . . . . . . . . . . . . . . . .

2.4 Parallel ACE algorithm .

Framework for fast N-body computations

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6
TENTIALS .
2.1
6
. .
.
9
2.2 Problem statement .
2.3 Generalization of ACE to L∇ {ψ(r)} . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1
. 10
2.3.2 Brief background on ACE . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Modiﬁcations for evaluating Φ(n)(r)
. . . . . . . . . . . . . . . . . . . . . 14
2.3.4 Cost of evaluating L∇ {Φ} . . . . . . . . . . . . . . . . . . . . . . . . . . 15
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Construction of the distributed octree
. . . . . . . . . . . . . . . . . . . . 16
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Adaptive tree . .
. . . . . . . . . 18
. . . . . . . . . . . . . . . . . . . . . . . 20
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.3 Load balancing .
2.4.4 Evaluation of the potential in parallel . . . . . . . . . . . . . . . . . . . . . 22
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
.
2.5.1 Error convergence .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.2 Cost comparison of evaluating L∇ {ψ} vs. ψ . . . . . . . . . . . . . . . . 26
Parallel performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Evaluation of ACE interactions in adaptive trees
2:1 balance constraint

2.6 Conclusion . .

2.5 Results . .

2.4.2.1
2.4.2.2

. .

. .

. .

.

.

.

.

.

.

.

.

.

.

CHAPTER 3 PARALLEL NON-UNIFORM WIDEBAND MLFMA FOR MULTISCALE

Introduction .

3.1
. .
3.2 Problem Statement
3.3 Challenges and remedies for parallel multiscale analysis . . . . . . . . . . . . . .

ELECTROMAGNETIC SIMULATION . . . . . . . . . . . . . . . . . . . . 31
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
. 35
3.3.1 Non-uniform trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

.
.

.
.

.
.

.
.

.

iv

.

. .

3.3.3.1
3.3.3.2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.4.1 Tree construction .
3.4.2 Load balancing .
.
3.4.3

3.3.4 Cross-level interactions . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parallel evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3.2 Low-frequency analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
. 39
3.3.3 MLFMA . .
Inter/anterpolation and spectral integration . . . . . . . . . . . . 41
Subtleties for vector ﬁelds . . . . . . . . . . . . . . . . . . . . . 45
. 46
Interpolation of MLFMA translation operators . . . . . . . . . . . . . . . . 47
3.4 Summary of parallel wideband MLFMA algorithm . . . . . . . . . . . . . . . . . 49
. 49
. 50
. 51
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
. 52
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1.1 Hybrid sampling MLFMA error control . . . . . . . . . . . . . . 52
3.5.1.2
. 53
3.5.1.3 Uniform-tree wideband MLFMA error control
. . . . . . . . . . 55
3.5.1.4 Non-uniform-tree wideband MLFMA error control . . . . . . . . 55
Parallel CFIE solver evaluation . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.2.1 Uniform trees . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 56
. 57
3.5.2.2 Non-uniform trees . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Trade-oﬀs for hybrid sampling . . . . . . . . . . . . . . . . . .

.
3.5.1 Accuracy analysis .

3.5 Numerical Results .

3.6 Conclusion .

3.5.2

. .

.

.

.

.

.

.

.

CHAPTER 4 PARALLELIZATION TECHNIQUES FOR THE NON-UNIFORM WIDE-

.

.
.

4.2.1

Introduction .

4.1
. .
4.2 Problem Statement

BAND MLFMA .
.
.
.
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Summary of non-uniform wideband MLFMA . . . . . . . . . . . . . . . . 64
4.2.1.1
Low-frequency regime . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.1.2 High-frequency regime . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.1.3
Interactions in the adaptive algorithm . . . . . . . . . . . . . . . 66
4.3 Parallelization of the MLFMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.1 Tree construction and setup . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Plural nodes . .
4.3.2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.3 Adaptive tree .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.4
.
4.3.5 Evaluation of the potential
. . . . . . . . . . . . . . . . . . . . . . . . . . 70
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.6 Bottlenecks .
4.4 Overcoming bottlenecks associated with plural nodes . . . . . . . . . . . . . . . . 72
Scheduling computations and communications for plural nodes . . . . . . . 72
Parallelization of Fourier-based interpolation/anterpolation operators . . . . 72
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5.1 MLFMA vs. static FMM . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Parallel load balancing strategy for adaptive wideband MLFMA . . . . . . 75
4.5.2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
.

.
.
Interaction lists .

4.6 Results .
4.7 Conclusion .

4.5 Load balancing .

4.4.1
4.4.2

. .
.

. .

. .

. .

.
.

.
.

.
.

.

.

.

.

.

.

.

CHAPTER 5 CONCLUSION AND FUTURE DIRECTIONS . . . . . . . . . . . . . . . . 85
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
BIBLIOGRAPHY .

.

.

.

.

.

.

.

.

vi

LIST OF TABLES

Table 2.1: L2O timings (in seconds) for Laplace potential (scalar) and Stokeslet (vector) . . 27

Table 3.1: Comparison of L2 relative error data for translation operator interpolation with

and without compensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Table 3.2: Memory consumption and timings for diﬀerent hybrid sampling transitions

. . . 55

Table 3.3: Comparison of uniform and non-uniform trees for the conesphere geometry . . . 56

Table 4.1: Tabular illustration of Algorithm 4. Each row from top to bottom represents
a step. Tildes denote the shared nodes on their resident processors. Boldface
denotes that a node has been scheduled.

. . . . . . . . . . . . . . . . . . . . . . 73

Table 4.2: Comparison of original algorithm with PFFT for 320λ diameter sphere geometry. 79

Table 4.3: Comparison of PFFT with original algorithm for 512λ grid geometry.

. . . . . . 82

vii

LIST OF FIGURES

Figure 2.1:

Figure 2.2:

Illustration of the operations in the ACE algorithm and notation used in this
section. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

.

.

.

.

.

.

.

(a) Graphical illustration of the interaction lists in the adaptive tree for a leaf
box b; (b) example of non-uniform distribution with non-uniform tree in two
dimensions.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

. .

. .

.

Figure 2.3: Far-ﬁeld only error convergence vs. ACE expansion order for diﬀerent kernel

functions.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Figure 2.4: Convergence of the Stokes potentials for uniform and non-uniform volume

distributions. .

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Figure 2.5: Strong scaling and per-process timings for Stokes kernel evaluation on a

uniformly-distributed volume geometry with 5 billion particles.

. . . . . . . . . 28

Figure 2.6: Strong scaling and per-process timings for Stokes kernel evaluation on a

uniform spherical distribution of 1.024 billion points.

. . . . . . . . . . . . . . 29

Figure 2.7: Strong scaling and per-process timings for Stokes kernel evaluation on a

non-uniform spherical distribution of 1.024 billion points. . . . . . . . . . . . . 29

Figure 3.1:

Illustration of the target computational strategy for representing interactions
in a general multiscale geometry. Each T represents operations performed
with the subscripted method, e.g. ACE, spherical and uniform interpola-
tion/anterpolation, and local band-limited interpolation/anterpolation. . . . . . . 37

Figure 3.2: Graphical illustration of the addition theorem and some notation.

. . . . . . .

. 40

Figure 3.3: Error convergence (left) and matvec timings (right) vs. oversampling param-

eter χ for each sampling/interpolation method in the 8λ cube geometry. . . . . . 53

Figure 3.4: Error convergence vs oversampling parameter χ for (a) uniform tree ACE-
MLFMA for 64,000 dipoles in a ﬂattened box, and (b) non-uniform tree
cone-sphere. The number next to each data point indicates the order P of
ACE expansions used. Note that the example in (a) uses three buﬀer boxes
for the far-ﬁeld, while that in (b) uses only one. . . . . . . . . . . . . . . . . . . 54

Figure 3.5: Comparison of bistatic RCS calculated by parallel MLFMA vs. analytical for

a sphere of diameter 256λ within ±5o of the main lobe.

. . . . . . . . . . . . . 57

viii

Figure 3.6: Comparison of bistatic RCS calculated by parallel MLFMA vs. analytical for

a sphere of diameter 512λ within ±2o of the main lobe.

. . . . . . . . . . . . . 58

Figure 3.7: Comparison of bistatic RCS calculated by parallel MLFMA vs. analytical for

the 20λ sphere with densely discretized pole. . . . . . . . . . . . . . . . . . . . 58

Figure 3.8: RCS calculated with parallel MLFMA for 470λ arrowhead geometry. . . . . . . 59

Figure 3.9: Memory utilization histograms for (a) 470λ arrowhead and (b) 755λ airplane

geometry.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 59

Figure 3.10: RCS calculated with parallel MLFMA for 755λ airplane geometry.

. . . . . . . 60

Figure 4.1:

Illustration of distributed tree. . . . . . . . . . . . . . . . . . . . . . . . . . .

. 67

Figure 4.2: Graphical illustration of the transposition and folding operation within the
parallel FFT interpolation of radiation pattern a to A for Nθ = 3, Nφ =
4, Mθ = 5, Mφ = 6. .
Illustration of the cost-percolation algorithm. Each pair of boxes represents
a node in the tree; the number in the left box is the intrinsic cost, while the
number in the right box is the partial percolated cost of all ancestors.

Figure 4.3:

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

. . . . . . 77

.

.

Figure 4.4: Timing comparison of M2M and L2L operations for 320λ diameter sphere

on 2048 cores with and without PFFT.

. . . . . . . . . . . . . . . . . . . . .

. 80

Figure 4.5: Timing breakdown for M2M/L2L with and without PFFT for the 320λ sphere.

. 80

Figure 4.6: Timing comparison of M2M and L2L operations for 1024λ diameter grid on

2048 cores with and without PFFT.

. . . . . . . . . . . . . . . . . . . . . . . . 81

Figure 4.7: Timing breakdown for M2M/L2L with and without PFFT for the 1024λ grid. . . 81

Figure 4.8: Per-process timings for far-ﬁeld matvec stages for the airplane geometry (a)
with uniform vs. non-uniform tree and (b) with and without load balancing
(LB) for non-uniform tree on 1024 processes.

. . . . . . . . . . . . . . . . . . 82

Figure 4.9: Parallel eﬃciency of the complete matvec for the 286M-point arrow geometry
with reference to 256 processes. Colorization of geometry solely to illustrate
depth.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

.

.

.

.

.

.

.

.

.

ix

LIST OF ALGORITHMS

Algorithm 1:

Parallel algorithm for merging the tree with the option to impose 2:1
balance constraint .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Algorithm 2:

Local multipole-to-multipole (M2M) computation with interleaved up-
date of plural nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Algorithm 3:

Local-to-local (L2L) computation with asynchronous update of plural nodes 25

Algorithm 4:

SchedulePluralNodes: Scheduling algorithm for communication and
computation involving plural nodes . . . . . . . . . . . . . . . . . . . . . . 73

x

KEY TO ABBREVIATIONS

ACE Accelerated Cartesian expansion
CFIE Combined-ﬁeld integral equation
FMM Fast multipole method
IE Integral equation
Matvec Matrix-vector multiplication
MLFMA Multilevel fast multipole algorithm
RCS Radar cross-section

xi

CHAPTER 1

INTRODUCTION

1.1 N-body problems

The N-body problem refers to a common computational pattern arising in a signiﬁcant number
of ﬁelds in computational physics and related ﬁelds. Consider a collection of N particles (bodies)
located at points ri, i = 1, . . . , N and with strengths ui, i = 1, . . . , N. Broadly speaking, the N-body
problem may be deﬁned in the following manner. Given the source strengths and locations ui, ri
and kernel function ψ encoding action at a distance, the task is to compute

N

j=1

Ψ(ri) =

ψ(ri − rj)uj,

i = 1, . . . , N,

(1.1)

where the function Ψ is referred to as the potential function associated with the kernel ψ. While
it is sometimes the case that the target locations (i.e., those points at which Ψ is evaluated) do not
coincide with the source locations, this thesis assumes the source and target locations are always
coincident. The principal bottleneck in evaluating (1.1) is the O(N2) computational complexity.
The N-body problem (1.1) appears in a number of applications too large to list here compre-
hensively; an abridged list will suﬃce to highlight its ubiquity. The N-body framework is used
for gravitational force calculation in galaxy formation problems in computational astrophysics,
electrostatic force calculation in molecular dynamics and physical electronics, kernel classiﬁcation
methods in machine learning, integral equation solvers in electromagnetics, acoustics, elastody-
namics, ﬂuid mechanics, and many more.

Given the importance and number of applications of this problem, it is not surprising that a
signiﬁcant amount of research eﬀort has been expended to reduce the cost of evaluating (1.1),
particularly in light of the success of the fast Fourier transforms developed in the 1960s. The
algorithms of Appel [4] and Barnes and Hut [8] reduced the cost of (1.1) for gravitational force
calculations from O(N2) → O(N log N). At the same time, Rokhlin and Greengard proposed the

1

elegant Fast Multipole Method (FMM) [59, 34] for the same problem, reducing the cost to the
optimal O(N).

1.2 Fast multipole methods (FMMs)

The FMM idea quickly gained traction in the computational physics community. Its optimal
runtime, control over accuracy, and geometry adaptivity make it a highly attractive option for
large-scale simulation. The idea was soon generalized for a number of other kernels, including
Green’s functions in elastodynamics [17], Stokes ﬂow [31], and electromagnetics [21, 67]. The
electromagnetic variant of the FMM, deemed the multilevel fast multipole algorithm (MLFMA),
is diﬀerent from the other methods listed here in a very fundamental way. The electromagnetic
Green’s function imparts a notion of phase to go along with amplitude information, as opposed
to the amplitude-only FMM. Adequately resolving these oscillations requires satisfaction of a
sampling theorem over a domain proportional to the size of the computational domain. As a result,
the MLFMA’s memory requirements are signiﬁcantly higher than its non-oscillatory counterparts,
and the algorithm scales as O(N log N) rather than O(N).

Historically, these methods relied on the existence of an analytic factorization of the kernel
function into a sum-product of functions that depend separately on the source and observer coordi-
nate frames; however, the FMM has also been extended to a general kernel-independent [2, 47, 30]
or almost kernel-independent [61] setting. The latter method, deemed the accelerated Cartesian
expansion (ACE) method, is almost kernel-independent because, while it requires the derivatives
of the kernel function, only one stage of the algorithm requires any information about the kernel,
and the method can be applied to any non-oscillatory kernel.

The basic mechanism of acceleration is the separation of interactions into “near” and “far” ﬁeld
interactions classiﬁed by some separation criterion, and approximating the far-ﬁeld interactions
using intermediate bulk interactions between clusters of particles. In other words,

Ψ(r) = ΨNF(r) + ΨFF(r),

(1.2)

where the superscripts denote the near- and far-ﬁeld interaction potentials. Rather than classify each

2

point-to-point interaction directly by comparison of the interaction distance with some threshold,
the simulation domain is broken up into a multi-level hierarchy of boxes organized into a tree
data structure, and points are assigned to their containing box. Near- and far-ﬁeld interactions are
classiﬁed on a box-by-box basis on this grid using some number of near-ﬁeld buﬀer boxes. The
near-ﬁeld interactions, i.e.
those between points in neighboring boxes, are evaluated directly at
O(N) cost. Far-ﬁeld interactions are typically approximated using some variant of the following:

1. For each box at the ﬁnest level of reﬁnement, compute the multipole expansion representing

the sources inside the box;

2. For each coarser-level box, form multipole expansions by shifting and combining those from

the ﬁner levels;

3. For each box, form local expansions of the observed ﬁeld within by applying translation

operators to multipole expansions of all far-ﬁeld boxes and combining the results;

4. For each ﬁner-level box, combine local expansion with a shifted local expansion of containing

upper-level box;

5. Evaluate the ﬁeld at each observation point by evaluating the local expansion within each

box at the ﬁnest level.

The FMM, ACE, and MLFMA approximations are error-controllable, meaning that arbitrarily
high accuracy can be achieved by increasing the order of approximation, at the expense of extra
computational time.

1.3 Parallelization of FMMs

While these methods allow the fast evaluation of large N-body sums, the size of the problems that
can be solved is limited by the computational resources of a single computer. Hence, parallelization
of these algorithms is necessary for solving truly large-scale problems. However, parallelization
is not at all straightforward. An eﬃcient parallel algorithm relies on an equipartitioning of the

3

overall workload associated with the tree structure. In the FMM for non-oscillatory kernels, most
of the work to be done lies at the ﬁnest level of reﬁnement, or the leaf level, greatly simplifying
the problem. In this case, a simple partitioning of the boxes at this level along a locality-preserving
space-ﬁlling curve suﬃces to form the base of highly scalable parallel algorithms [64], though
employing an adaptive form of the FMM complicates the parallelization.

In the MLFMA, however, the necessity of capturing both phase and amplitude information,
as opposed to strictly amplitude in non-oscillatory FMMs, introduces signiﬁcant challenges to
parallelization. The MLFMA can be viewed as a decomposition of the kernel into a sum of
plane waves traveling in all directions. The sampling theorem requires the number of plane
waves in the sum to be proportional to the surface area of the sphere enclosing each collection
of sources/observers, and so the information content per box increases as one ascends the tree.
Consequently, in MLFMA simulations involving electrically large objects, not only is the cost per
tree level approximately constant, but the storage and computational costs of nodes at the uppermost
levels of the tree can become prohibitive without an intelligent parallelization strategy.

Work on parallelization of the MLFMA has continued for almost two decades. The basic
technique common to the most successful approaches is that of partitioning over space and then
over the plane waves at some chosen granularity [28, 48, 50]; however, as will be discussed in a later
chapter, the necessity for accurate interpolation and downsampling in the upward and downward
tree traversal stages of the matvec, respectively, has important consequences for the partitioning
scheme. While some progress has been made toward a robust, scalable, and accurate MLFMA
for solving large-scale problems in EM, there remains plenty of work to be done. In particular,
eﬃcient parallelization of adaptive wideband variants remains an open problem.

1.4 Thesis structure

The remainder of this thesis aims to put forth a set of algorithms advancing the state of the art
in eﬃcient parallel, non-uniform, fast multipole-like algorithms. We will concern ourselves ﬁrst
with non-oscillatory potentials. We will then turn our attention to the electromagnetic case.

4

Chapter 2 describes a scalable parallel ACE algorithm for evaluating linear operators acting on
arbitrary non-oscillatory potentials. We ﬁrst present an example of potentials deﬁned in terms of
linear operators acting on a common kernel function and then provide an algorithmic prescription
for evaluating such potentials. Next, we present a suite of parallel algorithms for tree construction,
load balancing, and potential evaluation. Finally, a set of numerical experiments demonstrates
error control of the method for several kernel functions and good parallel scalability for problems
as large as 5 billion particles on 16,384 processes.

Chapter 3 focuses on the solution of EM scattering problems using the non-uniform wideband
MLFMA. We introduce methods which greatly increase the scale of EM problems that can be solved
using the wideband MLFMA presented originally in [73]. We ﬁrst present a numerical method for
breaking a signiﬁcant bottleneck of the original algorithm that limited the size of problems to which
it could be applied. We then present an adaptive form of the algorithm which signiﬁcantly improves
performance for multiscale geometries, i.e. those with highly non-uniform spatial distributions of
unknowns. We demonstrate the error control of these methods and performance of the parallel
algorithm on thousands of processes for several electrically-large scatterers.

Chapter 4 focuses on the parallelization details of the non-uniform wideband MLFMA used in
the previous chapter. We identify several computational bottlenecks and their causes in the original
parallel wideband MLFMA presented in [48], upon which this work is based. We then propose
and implement remedies to their resolution. We also present explicit algorithms for building the
non-uniform tree in parallel, computing interaction lists in the non-uniform tree, and load balancing
the potential evaluation stage for highly non-uniform distributions. The beneﬁts of the proposed
methods are then proven through a series of numerical experiments.

5

PARALLEL ALGORITHM FOR FAST EVALUATION OF STOKES POTENTIALS

CHAPTER 2

2.1 Introduction

A number of physical systems rely on the computation of pair potentials to analyze/understand
the relevant physics. By pair potentials, we imply the means through which two entities interact,
viz., the relevant Green’s function that depends on relative position of the two particles (and
possibly the relative times). Examples of such potentials are abundant. For instance, the Coulomb
potential is used in a number of diﬀerent ﬁelds ranging from gravitational physics to electro-statics
to molecular dynamics to density functional theory and so on. Likewise, we see application of
potentials such as Lennard-Jones, Yukawa, Buckingham, and so on in molecular dynamics, lattice
potential in electronic structure calculations, Stokes potentials in low Reynolds number ﬂuid ﬂows,
potentials arising from the Helmholtz equation, wave equation, heat equation, diﬀusion equation,
Klein-Gordon equation, and more. Often, the pair potentials necessary are tensorial, obtained via
action of linear diﬀerential operators on simpler potentials. A rather simple example is the force
on a particle due to electrostatic/gravitational interactions. More generally, the pair potential can
be expressed as L∇ {ψ(r)} where L∇ {·} is a linear operator.

Evaluation of these potentials is expensive and scales as O(N2) where N is the number of
degrees of freedom. Reduction of this cost complexity, from O(N2) −→ O(N α) with α < 2 for
evaluation of the Coulomb potential for gravitational physics was the origin of the classical N-body
problem. The earliest eﬀorts toward its resolution can be traced back to Barnes and Hut [8], which
is the ancestor of a class of methods called tree algorithms. A slightly diﬀerent approach, the
fast multipole method (FMM), introduced by Rokhlin [59], Zhao [82], and then Greengard [32],
has been celebrated as one of the top ten algorithms of the 20th century [25]. For volumetric
distributions, tree algorithms and FMM reduce the computational costs from O(N2) to O(N log N)
and O(N), respectively. Despite similarities between the two methods, there are fundamental

6

diﬀerences between their asymptotic complexities; these have been elucidated in [32, 35, 61]. The
principal diﬀerence between the two is in how one traverses the tree. The bottom line, however, is
that both approaches eﬀectively approximate the pair potential in terms of low-order polynomials
and their error bounds are well understood.

While the most extensive body of work on N-body problems has been on Coulomb potentials
[34, 35, 18, 24], there has been signiﬁcant interest and eﬀort in other areas as well. These include
Yukawa [13, 38, 72], Gauss [36, 77], radial basis functions [43, 30], lattice gas [61], Klein-Gordon
and diﬀusion [74], Helmholtz [21, 66], retarded [62, 72], and periodization of some of these [7,
16, 44]. In general, methods used for potentials arising from the Helmholtz and wave equations
need to be treated diﬀerently [21, 27] than the others due to the need to capture phase information
as well as amplitude. But a unifying theme to the methods developed for all other potentials is
representation of these in an observation domain in terms of low order polynomials. This can be
eﬀected either using Taylor series (or an optimal variant in the case of FMMs) [34, 82], kernel
independent FMMs [79], or the black-box approaches [30].

A fast multipole-like algorithm that uses Cartesian tensors in an optimal manner (hence called
accelerated Cartesian expansions, or ACE) was introduced in 2007 [61] for potentials of the form
r−ν. The salient features of the ACE algorithm are (a) the ability to accelerate smooth, non-
oscillatory potentials, ψ(r), of arbitrary form and can be extended to linear operators acting on
ψ, or L∇ {ψ(r)}, with minimal change in cost; (b) exact traversal up and down the tree in that if
the multipole expansion were to be computed at a given level directly from the actual particles, it
would be identical –to machine precision– to that computed from its grand-children; (c) mapping
for traversal up the tree from level l −→ l + 1 and l + 1 −→ l + 2 diﬀer only by a multiplicative
constant and is independent of the potential ψ(r); (d) the same is true for traversing down the
tree, as well as across if ψ(r) = |r|−ν, ∀ν ∈ R. The complexity of ACE scales as O(αNP6) (and
can be reduced trivially to O(αNP5)) where P is the order of representation and α = 1/(720s).
Here, s is the average number of particles in a leaf box and can be tuned to optimize the cost.
There also exits a specialization of ACE for r−1 using traceless Cartesian tensors that (a) can

7

be mapped directly onto the classical fast multipole algorithm, and (b) has a complexity that
scales as O(βNP4) with β = 1/(2s). This drew upon the relationship between traceless Cartesian
tensors and spherical harmonics. The ACE algorithm has been extended to Yukawa, low-frequency
Helmholtz, diﬀusion, Klein-Gordon, dispersive and low-frequency retarded potentials (and some
of their periodic variations) [74, 6]. Given these features of ACE, in this paper we will build upon
this framework to accelerate the evaluation of non-oscillatory potentials deﬁned by linear operators.
Since acceleration methods are commonly used for the Coulomb potential, the eﬀort spent
on developing eﬀective parallel algorithms has been equally impressive; see [33, 11, 63, 76, 56]
and references therein. More recently, the emergence of kernel independent FMM and potential
applications of non-oscillatory potentials to PDEs alluded to earlier in this section as well as to
machine learning [14, 10] has resulted in a body of work on parallelization that is most current
and pertinent [79, 45, 1, 47]. These methods typically employ a local essential tree (LET)
structure, which compartmentalizes communication and computation, allowing the overlapping of
communication with computations from diﬀerent stages of the FMM algorithm. However, this
approach requires redundant computation and storage of remote source data. A diﬀerent parallel
algorithm that uses a bottom-up partitioning of the tree based on a post-order traversal sequence
and eliminates the need for redundant computations at the higher levels of the tree has been reported
by Melapudi et al. [75, 48]. As demonstrated through the evaluation of multiple pair potentials in
a particle dynamics simulation using the ACE algorithm, the implementation by Melapudi et al.
exhibits excellent strong scaling properties.

In this paper, our goal is to develop parallel algorithms to evaluate ψ(r) and L∇ {ψ(r)} eﬃciently.
A few examples of potentials of this form are the set of Stokes potential (Stokeslet, rotlet and
stresslet) [42], force or ﬁeld calculations in electrostatics/gravitation, as well as those encountered
in low frequency electromagnetics wherein one transitions across scales (quantum to classical) [3].
Given the landscape of both fast methods and the attendant parallelization algorithms mentioned
above, it is apparent that each has its (dis)advantages. However, in keeping with our goal, ACE is
an ideal framework as will be demonstrated through both mathematical and numerical analyses. To

8

this end, the main contributions of this paper are (a) development of a modiﬁed ACE framework to
enable evaluation of L {ψ(r)} with minimal cost overhead compared to that for ψ(r), (b) an eﬃcient
parallel implementation of ACE for highly non-uniform particle distributions, and (c) demonstration
of the eﬃciency and scalability of the described methods for several diﬀerent potentials.

The rest of this paper is organized as follows: In the next section, we deﬁne the problem
addressed by this work and state the Stokes kernels in terms of an L∇-operation. Section 2.3
reviews the ACE algorithm and introduces the necessary modiﬁcations to eﬀect such operators
for the Stokes potentials, and presents a complexity estimate for the modiﬁed ACE algorithm.
Section 2.4 describes our parallel algorithm in detail, including tree construction, interaction lists,
load balancing, and parallel potential evaluation. Finally, in Section 2.5, we present a number of
numerical results demonstrating the accuracy and eﬃciency of our algorithm.

2.2 Problem statement

Consider a collection of N sources distributed randomly in R3, described by u(r) =N

j uj δ(r−

rj). The eﬀect of these sources at a point r can be written as

Φ(r) = ψ(r) (cid:63)s u(r),

(2.1)

where r = |r|, and (cid:63)s represents a spatial convolution. In several applications, one needs to compute

Φ(n)(r) = L∇ {ψ(|r|)} (cid:63)s u(r)

(2.2)

where the superscript (n) denotes a tensor of rank n as required by the operator L∇.
exposition, we have restricted u(r) to a scalar source.
presented here to vector sources.

In the
It is, however, trivial to generalize ideas

In what follows, we will illustrate our ideas in the context of Stokes kernels. These potentials
can be expressed as a set of linear operators acting on non-oscillatory pair potentials. Stokes
potentials arise in boundary integral formulations in ﬂuid dynamics for low Reynolds numbers, and
are denoted by the Stokeslet S(n), rotlet Ω(n), and the stresslet σ(n) kernels which, respectively, can

9

be expressed in index notation as

− rirj
δi j
Si j (r) =
r3 ,
r
rk
Ωi j (r) = i jk
r3,
σi jk (r) = −6rirjrk
r5

(2.3a)

(2.3b)

(2.3c)

Alternatively, the above potentials can be rearranged and expressed in terms of derivatives on r
[57, 29, 42]:

Si j (r) (cid:17)(cid:16)
Ωi j (r) (cid:17)(cid:16)−i jk ∂k∇2(cid:17)
σi jk (r) (cid:17)(cid:104)(cid:16)

δi j∇2 − ∂i∂j

(cid:17)

ψ(r),
ψ(r),

(cid:17) ∇2 − 2∂i∂j ∂k

(cid:105)

ψ(r)

(2.4a)

(2.4b)

(2.4c)

δi j ∂k + δjk ∂i + δki∂j

where ψ(r) = r, δi j is the Kronecker delta, ∇2 denotes the Laplacian, ∂i represents the derivative
along the ith direction (from the set {x, y, z}), and i jk denotes the Levi-Civita symbol. From these
equations, one can arrive at the appropriate L∇ {·} for each of the tensors.
Assuming N particles, the cost of evaluating (2.2) scales as O(3nN2).

In what follows, we
introduce modiﬁcations to the ACE algorithm which bring down the complexity of evaluating all
3n components down to O(NP6).
2.3 Generalization of ACE to L∇ {ψ(r)}

In this section, we discuss the overall computational framework, provide a brief summary of
the ACE algorithm, present the modiﬁcation that permit eﬃcient computation of L∇ {ψ(r)} using
ACE, and conclude with a computational complexity analysis.

2.3.1 Framework for fast N-body computations
To begin, let us assume that the entire computational domain Ω ∈ R3 can be embedded within a
cube. All FMM-like fast algorithms start with building an oct-tree data structure by hierarchically
partitioning the cube encompassing the entire computational domain. At each level, the cube is

10

Figure 2.1: Illustration of the operations in the ACE algorithm and notation used in this section.

subdivided into eight equi-sized sub-cubes, and this process continues recursively until the desired
level of reﬁnement (e.g., deﬁned by a threshold on the number of sources/observers contained
within each sub-cube) is reached; an L-level scheme implies L − 1 recursive divisions of the
domain. At any level, the (sub)domain that is being partitioned is called the parent of all the eight
children that it is being partitioned into. At the lowest level, all sources/observers are mapped onto
the leaf boxes. Interactions between all source/observer pairs are now computed using the tree
structure. Speciﬁcally, near and far ﬁelds are identiﬁed for all boxes/subdomains at any level in the
tree according to the following criteria: Two subdomains are classiﬁed as being in the near ﬁeld of
each other, if they share a common corner, edge or surface. Otherwise two subdomains are in the
far ﬁeld of each other, if the distance between their centers is at least twice the side length of the
subdomain, and their parents are in the near ﬁeld of each other.

Once the interaction lists are built for all levels, the computation proceeds as follows. At the leaf
levels, interactions between elements of boxes that are in the near ﬁeld of each other are computed
directly, i.e., using ψ(r). Far ﬁeld interactions are computed using a three stage algorithm: (i)
compute multipoles of sources that reside in each box (C2M); (ii) convert these to local expansion
at all boxes that are in its far ﬁeld (M2L); (iii) from the local expansion, compute the ﬁeld at each
observer (L2O). It is apparent that one can cut down computational costs by embedding this scheme
within itself. That is, if two domains interacting with each other are far away, then these clusters

11

rcs,prco,prcsrcor′rC2MM2MM2LL2LL2Omay be combined to form larger clusters that then interact with each other at a higher level and
so on. As has been shown elsewhere [34, 61], this computational strategy considerably mitigates
the overall cost. Additional theorems exist that enable (a) shifting the origins of multipoles so
that eﬀects of small clusters can be grouped together to form larger clusters (M2M) and (b) move
the origin of local expansion so that expansions at the origin of the parent may be disaggregated
to those of its children (L2L). The overall computational scheme is illustrated in Fig. 2.1 and
proceeds as follows; (c) compute nearﬁeld interactions at the leaf levels, and traverse the tree
(C2M−→M2M−→M2L−→L2L−→L2O) as appropriate to compute the remaining interactions.
Theorems to eﬀect the tree traversal are summarized in the remainder of this section.

2.3.2 Brief background on ACE

s, rc

o, rc

s,p, and rc

p
o. Centers of these boxes are denoted using rc

To begin describing ACE, consider two boxes Ωs and Ωo that contain source and observers,
respectively, in the oct-tree decomposition of Ω, and let the parents of Ωs and Ωo be denoted using
p
s and Ω
o,p. To set the stage for the
Ω
relevant descriptions, we start with some remarks and notations; (i) an n-th rank totally symmetric
tensor A(n) contains (n + 1)(n + 2)/2 independent components as opposed to 3n components; (ii) in
compressed form this tensor can be represented using A(n)(n1, n2, n3) where n = n1 + n2 + n3; (iii)
an example of such a tensor is the polyadic associated with r which is given by rr · · · rr
= rn and
n times
can be represented in compressed form as r(n) = xn1 yn2zn3; (iv) an m-fold contractions between
two tensors A(n) and B(m) is denoted using A(n) · m · B(m) = C(n−m); and (v) a direct product
between two tensors can be written as C(n+m) = A(n)B(m).

(cid:124)(cid:32)(cid:32)(cid:32)(cid:123)(cid:122)(cid:32)(cid:32)(cid:32)(cid:125)

The foundation of ACE is based on a Taylor series expansion which provides a natural framework

for developing addition theorems. A Taylor series expansion for ψ(|r − r(cid:48)|) about the origin,

(cid:0)r(cid:48)(cid:1)n · n · ∇nψ(r),

(2.5)

for |r| > |r(cid:48)|, results in a series of theorems summarized next; this is provided only for completeness,
and proofs for all can be found in [61].

∞

n=0

ψ(r − r(cid:48)) =

(−1)n
n!

12

Theorem 1 (Charge-to-multipole (C2M)). Assuming that k sources exist in the domain Ωs, the
potential function Φ(r) at any point r signiﬁcantly away from Ωs is given by

∞
k

n=0

i=1

Φ(r) =

M(n)(rc

s) =

M(n)(rc

s) · n · ∇nψ(r)

(−1)n
n!

(ri − rc

s)nui,

(2.6)

where M(n)(rc

s) is the rank-n multipole tensor about rc
s.

Next, these multipoles can be re-expressed about rc

s,p using

Theorem 2 (Multipole-to-multipole (M2M)). The multipole expansion M(rc
about the center rc

s,p of a domain Ω

s ⊃ Ωs via

p

s) may be re-centered

M(n)
p (rc

s,p) =

m!
n! (rc

s,p − rc

s)n−mM(m)(rc
s),

(2.7)

n


m=0

P(n,m)

where P(n, m) is the permutation of all partitions of n into sets of size n − m and m, and Mp(rc
s,p)
is the re-centered multipole expansion.

Next, we translate these multipoles about Ω

p
s to Ω

p
o.

p

Theorem 3 (Multipole-to-local (M2L)). Assume that a domain Ω
distance from Ω
s,p, the potential Φ(r) may be alternatively expressed in the form
rc

p
s , and Ω

o ∩ Ω

p

p
o centered at rc
s = ∅. Then, given a multipole expansion M(n)
p (rc
∞
∞

o,p)n · n · L(n)
M(m−n)

(rc
s,p) · (m − n) · ∇mψ(|rc

o,p − rc

(r − rc

p (rc

s,p|),

o,p),

Φ(r) =

L(n)
p (rc

o,p) =

p

n=0
1
n!

m=n

o,p exists at some
s,p) centered at

(2.8)

where Lp(rc

o,p) is referred to as the local expansion for Ω

p
o.

Next, we map the local expansions in Ω

p
o to Ωo.

13

Theorem 4 (Local-to-local (L2L)). Given a local expansion L(n)(rc
p
o centered at rc
the center rc

o,p, the local expansion within Ωo ⊂ Ω

o,p) within the domain Ω
o is given by

L(n)(rc

o) =

o,p) · (m − n) · (rc

o − rc

o,p)m−n.

∞

m=n

(cid:169)(cid:173)(cid:173)(cid:171) m

m − n

(cid:170)(cid:174)(cid:174)(cid:172) L(m)(rc

Finally, the local expansions are mapped to observers.

p
o about

(2.9)

Theorem 5 (Local-to-observer (L2O)). The potential at a point r ∈ Ωo can be obtained from the
local expansion within Ωo centered at rc

Φ(r) =

(r − rc

o)n · n · L(n)(rc
o).

(2.10)

o via

∞

n=0

Given these theorems, we next specialize it to Stokes potentials.

2.3.3 Modiﬁcations for evaluating Φ(n)(r)
Thus far, we have developed a method to evaluate Φ(r). Examination of (2.2) provides a method-
ology to evaluate the Φ(n)(r) as a minor modiﬁcation of that for Φ(r); given Thm. 5, it follows that
one can write

Φ(n)(r) = L∇ {ψ(r)}

L∇(cid:8)(r − rc

o)n(cid:9) · n · L(n)(rc

o)

∞

n=0

=

(2.11)

This implies that the only change (both procedurally as well as in the computational cost) arises
from a change in the L2O stage. Speciﬁcally, the Stokes potentials can be written as

Si j(r) ∗ u(r) =

Ωi j(r) ∗ u(r) =

(σ)i jk(r) ∗ u(r) =

i j)(r − rc
2

(cid:104)(δi j∇2 − ∂
(cid:104)(cid:16)−i jk ∂k∇2(cid:17) (r − rc
(cid:104)(cid:16)(cid:16)

o)n(cid:105) · n · L(n)
o)n(cid:105) · n · L(n)
(cid:17) ∇2 − 2∂

δi j ∂k + δjk ∂i + δki∂j

n=0

∞
∞
∞

n=0

n=0

(2.12a)

(2.12b)

(2.12c)

(cid:17) (r − rc

o)n(cid:105) · n · L(n)

k .

3
i jk

14

For each of the Stokes potentials, the above operation requires contracting a local expansion of rank
n with another of rank (n +2), (n +2), and (n +3) for the Stokeslet, rotlet, and stresslet, respectively.
Furthermore, examination of these equations alongside (2.12) suggests that we can precompute and
store the 6+10=16 unique second and third derivatives of the tensor (r − rc
o)n, combining them as
needed to construct each disaggregation operator (the bracketed terms in (2.12)).

Furthermore, if the source u(r) excites all three potentials, the simple mapping between ψ(r) = r
and Stokes potentials enables signiﬁcant optimization. That is, only three tree traversals are required
to construct the 19 independent components of the Stokeslet, rotlet, and stresslet collectively at the
observer. Finally, it is easily shown that the error in the evaluated potentials is commensurate with
the representation polynomial order; the error bounds for this approximation follow from what was
derived for ACE in [61].

2.3.4 Cost of evaluating L∇ {Φ}
The cost of computing the action of L∇ {Φ(r)} by taking derivatives on the disaggregation tensors
rn, n = 0, . . . , P is the same as that of simply computing Φ(r). To show this, we invoke Theorem
2.4 of [61] to give us

∞

n=k

∇k Φ(r) =

n!
(n − k)!L(n) · (n − k) · (r − rc

o)n−k,

k ≤ n.

(2.13)

Truncating at P terms to make use of all the information contained in L and evaluating the sum
clearly requires at most the same amount of work as evaluation of (2.10) truncated at the same
number of terms P. The resulting rank-k tensor contains all derivatives of degree k, from which
the elements of L∇Φ are assembled. It follows that the cost of evaluating all components of the
tensor Φ(n) is not signiﬁcantly diﬀerent from evaluating Φ by itself and scales as O(NP6).

2.4 Parallel ACE algorithm

In this section, we outline our parallel algorithms for constructing, adapting, and load balancing
the tree, as well as parallel potential evaluation. While parallelization of FMMs is well-documented

15

in the literature [76, 80], as we will demonstrate in Sect. 3.5, redundant M2L computations in the
commonly used locally essential tree (LET) based algorithms would signiﬁcantly hamper the scaling
of our proposed framework. For this reason, our parallel evaluation strategy diﬀers from the usual
LET based implementations, and therefore we present our parallel implementation in detail. In the
ensuing discussion, let L denote the number of levels in the octree as determined by the parameter
d0, which represents the edge length of the smallest boxes in the tree. Level L denotes the level of
ﬁnest reﬁnement, i.e., where boxes are of diameter d0, while level 1 denotes the root, or the box
containing the entire computational domain.

2.4.1 Construction of the distributed octree

Construction of the distributed octree starts by partitioning the input particles equally among all
processes. On each process, the equidistributed particles are ﬁrst hashed into Morton keys, which
is a binary encoding of the leaf boxes according to their space-ﬁlling Morton-Z curve ordering. A
parallel bucket sort is then used to assign each process a distinct set of leaves contiguous in the
Morton-Z ordering such that each process gets a roughly equal number of particles. This initial
partitioning is performed at the ﬁnest reﬁnement level, i.e., at level L, in preparation for the ensuing
adaptive tree coarsening and load balancing stages, and is concluded with a local postorder traversal
tree construction on each process from leaves all the way up to the root.

The above partitioning scheme necessarily incurs duplicate copies of internal tree nodes, cor-
responding to common ancestors of leaf nodes residing on diﬀerent processes. Such nodes are
referred to as plural nodes. Treatment of computations associated with plural nodes is one aspect
of our parallelization scheme that diﬀerentiates it from the LET based implementations. Hence, we
provide some terminology to aid the discussion. For any plural node, the process with the highest
rank which owns a copy is designated as the resident process, and as such is deemed responsible
for all its tree-based interactions. Other processes with a copy of the plural node are called users
of the node, and are directed only to send and receive data to and from the resident process. A
plural node is referred to as a shared node on its resident process, and the users’ copies are called

16

duplicate nodes. As a consequence of the postorder traversal tree construction, it can be shown
that all duplicate nodes in a process’ local subtree appear consecutively at the end of the post-order
traversal sequence. Shared nodes may appear anywhere within the post-order sequence, though
they tend to appear toward the beginning. As we will discuss, these properties allow eﬀective
overlapping of communication and computation in the potential evaluation stage.

2.4.2 Adaptive tree

Random (or homogeneous) distributions of source or observer particles are well-represented by a
uniform tree structure. Within such distributions, the number of particles per box is approximately
constant across the entire simulation domain, facilitating the linear scaling of the ACE algorithm or
any other FMM implementation for that matter. For non-uniform distributions though, i.e., those
with subregions of relatively high particle density, the minimum box size must be decreased to
prevent near-ﬁeld costs from dominating the computation. Hence, in a uniform tree framework,
the densest region of discretization dictates the leaf box size throughout the entire tree. The result
is large swaths of space with a poor work-to-particle ratio in sparsely populated regions, degrading
the cost scaling of parallel ACE computations.

To prevent such ineﬃciencies, we adapt the tree structure to the particle distribution so as
to restore the approximate uniformity in leaf box population throughout the tree. Adaptation of
trees for N-body simulation has been well-studied and has become a standard feature of modern
tree-based simulation methods for non-oscillatory kernels [63, 64, 68, 47]. We employ a bottom-up
scheme for both constructing and merging the tree to minimize communication costs. Our parallel
algorithm for merging the distributed octree is given in Algorithm 1. Simply put, this algorithm
starts with a uniform tree constructed according to the description in Sect 2.4.1 and merges each
set of sibling leaf boxes into their parents, if the total number of particles in these leaf boxes do
not exceed a pre-determined particle threshold M. This scheme ensures that leaf boxes in more
sparsely-populated regions contain roughly as many points as those in the dense regions, thereby
avoiding excessive tree interactions (in sparse regions) or expensive near ﬁeld computations (in

17

dense regions).

: local subtree at level (cid:96)

Algorithm 1: Parallel algorithm for merging the tree with the option to impose 2:1 balance
constraint
1: Deﬁne:
2: Tloc: local subtree (including plural nodes)
3: T (cid:96)
loc
4: M: maximum number of points per leaf box
5: S(b): set of siblings of a box b
6: n(b): number of points contained within the complete subtree rooted at box b
7: N(b): set of box b’s near-neighbors at the same level
8: Votes: “1”: vote to merge; “0”: indeterminate; “−1”: veto
9: Votes← 0
10: for (cid:96) = L, L − 1, . . . , 4 do
C ← ∅
11:
for each b ∈ T (cid:96)
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:

end if
end for
Votes(S(C))← −1 {Global step; involves communication of vetoes to resident processes of
boxes in C}
Synchronize votes of level (cid:96) plural nodes and their siblings; vetoes dominate

end if
if Votes(b)≤ 0 and 2:1 balance desired then

C ← C ∪ N(P(b))

loc do

if Votes(b)==0 then
if n(P(b)) ≤ M then
Votes(S(b)) ← 1
Votes(S(b)) ← −1

else

end if

25:
26: end for
27: Prune each b ∈ Tloc and re-assign particles accordingly

2.4.2.1 Evaluation of ACE interactions in adaptive trees

In adaptive octree algorithms, it is necessary to evaluate interactions between boxes of diﬀerent
sizes. As usual, these interactions are classiﬁed into U, V, W, X-lists based on the adjacency and
relative size of source and observer box pairs [15, 32]. For the sake of completeness, these lists are
deﬁned as follows:

18

(a)

(b)

Figure 2.2: (a) Graphical illustration of the interaction lists in the adaptive tree for a leaf box b; (b)
example of non-uniform distribution with non-uniform tree in two dimensions.

• U-list: source and observer boxes are adjacent and are both leaves, i.e., they are within the

near ﬁeld of each other;

• V-list: source and observer boxes are not adjacent but their parents are, i.e., they are inside

the far ﬁeld of each other;

• X-list: source and observer boxes are not adjacent, but the observer box is (i) adjacent to the
parent of the source box, (ii) a leaf, and (iii) larger than the source box, corresponding to a
cross-level interaction with an observer node higher up in the octree;

• W-list: reciprocal interactions of those in the X-list, corresponding to a cross-level interaction

with an observer node lower down in the octree.

These interaction lists are illustrated in Fig. 2.2a. U-list interactions are handled by directly
computing the interactions between particles contained within each box pair. The V-list interactions
correspond to the M2L operations deﬁned by (3). The X- and W-list interactions, however, require
special treatment. Consider the evaluation of a potential function φ with kernel ψ using ACE for
an X-list interaction between two boxes Ωo, Ωs, centered at ro, rs respectively. If the diameter of

19

bUUUUUUUUWWVVVVVVVVVVXXX0.40.420.440.460.480.5x(m)0.40.410.420.430.440.450.460.470.480.490.5y(m)the sphere enclosing Ωo is greater than the diameter of that for Ωs, the local expansion about ro
formed from the source multipole expansion M(n)(rs) is not valid everywhere inside Ωo. However,
M(n)(rs) can be used to construct a diﬀerent representation that is valid in this region. Consequently,
the contribution φX(r) to the potential φ(r) observed at a point r within the observer box is given
by direct evaluation of the Taylor series. For the Stokes potentials, the derivative on the observer
coordinate frame in L2O moves directly onto the kernel function K. For integer derivative order d
and truncation number P, it is clear that

∇d φX(r) =

M(n)(rs) · n · ∇n+dψ(r − rs)

(2.14)

P

n=0

yields a symmetric rank-d tensor whose entries contain the full complement of derivatives of order
d. All three Stokes potentials can be constructed using d = 2, 3 for each particle within Ωo.

The W-list interactions have a somewhat simpler implementation. Here, the multipole expansion
of the source box Ωs is not quickly convergent anywhere within the observer box Ωo, and therefore
a local expansion must be formed about ro directly from the particles contained in Ωs. Viewing
the Taylor series from the perspective of the observer coordinate frame, the contribution φW of the
W-list sources to φ at the point r ∈ Ωo is given by

P

n=0

(r − ro)n · n · ns

i=1

φW(r) =

∇nψ(ro − r(cid:48)
i),

(2.15)

where r(cid:48)
is the location of the ith of ns particles in the source box. The inner summation constitutes
i
the local expansion about ro of the sources within Ωs. Derivatives for constructing the Stokes
potentials may then be taken in the usual manner in the L2O stage.

2.4.2.2

2:1 balance constraint

For highly non-uniform distributions, unconstrained merging of the tree results in a tree with
a matching degree of non-uniformity. While this is normally desirable from the perspective of
reducing computational costs (less tree nodes means less number of interactions overall), it has a
downside in terms of memory utilization. In a uniform tree where no X-list or W-list interactions

20

exist, there can be at most 63 (total number of boxes in the neighborhood of a box) - 33 (number of
near ﬁeld boxes) = 189 boxes in the far ﬁeld of a box. Leveraging symmetries and scale invariance
of certain kernels can signiﬁcantly reduce the number of translation operators needed for V-list
interactions. However, in regions containing sharp discontinuities in leaf box size, the number of
X- and W-list interactions can become quite large, and the number of unique translation operators
needed to perform the corresponding X- and W-list interactions can also be very large. The 2:1
balance constraint [68, 47] remedies this storage problem by disallowing adjacent leaf boxes to
diﬀer in size by more than a factor of two, signiﬁcantly reducing the number of diﬀerent X and
W interactions and the memory overhead for storing their particle-speciﬁc translation operators.
Algorithm 1 for merging the tree includes the option to impose this constraint.

2.4.3 Load balancing

After the tree merging procedure under the 2:1 balance constraints is complete, each process is left
with a compressed representation of the original uniform tree. As such, the computational proﬁle
is suﬃciently diﬀerent from the original tree that re-balancing the computational load on processes
is necessary. To accomplish this, we follow an empirical load balancing strategy where we estimate
the work attendant to each node in the distributed tree and aim to assign them in a load-balanced
manner.

The cost per leaf is determined as follows. As part of the initialization operations, we ﬁrst time
a set of dummy operations to obtain cost estimates for each of the U, V, W, and X-list interactions.
The cost for each node in the tree is then determined by multiplying the number of each interaction
by its corresponding cost estimate, and summing these costs across all interaction types. As with
the initial partitioning, we partition the distributed tree by determining a set of separators between
contiguous chunks of leaf boxes. Therefore we account for the costs of the interior nodes by
percolating their estimated costs down to the leaf boxes and adding them to the extant cost estimate
at each leaf. A variant of Algorithm 1 from [68] is then used to determine P − 1 locations at which
to split the leaf level Morton curve for a re-balancing of the computational load. While the overall

21

strategy is similar to the costzones approach of [63, 64], our strategy accounts for the work of the
entire tree instead of just the leaves. This is important for surface geometries with non-uniform
distributions, as the amount of work above the leaves is not negligible.

2.4.4 Evaluation of the potential in parallel

The parallel potential evaluation is performed in three stages: The upward pass (M2M), translation
(M2L), and the downward pass (L2L). Algorithm 2 describes the M2M (upward pass) stage of
our implementation, which essentially entails shifting the multipole data of each tree node to
the center of its parent box and aggregating the shifted multipole data from all siblings. Each
process starts processing the nodes in its local subtree from right-to-left in the post-order traversal
sequence. By choosing to go from right-to-left, we ensure that duplicate nodes are encountered
toward the beginning of the M2M stage, and shared nodes are encountered toward the end of this
stage. Using MPI’s non-blocking Ireduce collective calls, this pattern facilitates overlapping
the communication operations for plural nodes whose children, by deﬁnition, reside on diﬀerent
processes with M2M computations of other nodes.

The translation stage commences on each process once the local upward pass is complete. This
stage includes three substages for computations of X, V and W lists. First, X-list interactions are
handled by sending the required multipole expansions to processes with X-list observers, where
the observed potentials are computed. Next, V-list interactions are carried out. Typically, this is
the most expensive stage of any fast multipole-like method. To hide communication overheads,
the source multipole expansions to be exchanged are divided up into packets and communicated
to processes that need this information. As each source expansion is received in full, all its V-list
interactions are computed, making good use of temporal locality. Once all source expansions in
the current packet are exhausted of work, the next packet is constructed and communicated using
non-blocking primitives to facilitate overlapping of communication and computation. This process
continues until all remote V-list interactions are completed. Local V-list interactions, i.e., those in
which both the source and observer boxes belong to the same process, are computed next. Finally,

22

W-list interactions are computed by exchanging the source weights, i.e. ui in (2.6), and applying
the appropriate translation operators to form the local expansions of these sources.

To complete the evaluation, Algorithm 3 describes the L2L (downward pass) stage which
involves re-centering and adding parents’ local expansions to their childrens’. Contrary to the M2M
stage, here we employ a pre-order traversal of the local subtree so that the shared nodes are now
encountered at the beginning and the duplicates are encountered toward the end of the L2L stage.
This way, by using MPI’s non-blocking IBcast primitive, we ensure that broadcast for a shared node
can be completed in the background, while L2L computations of other nodes are being performed.
Also, the result of the V-list interaction evaluations for duplicate nodes is communicated to users
during the downward pass, again using non-blocking primitives for overlapping communications
with computations.

As described above, updating multipole and local expansions for nodes that are shared between
processes are performed eﬃciently in our implementation using asynchronous communications.
We note that at any level, the local subtree of any process can have at most two plural nodes –
one shared, one duplicate. It follows, then, that we may create a total of 2L MPI communicators,
each with groupings of shared nodes and their users, so that we can take advantage of non-blocking
variants of MPI’s reduce and broadcast operations.

We note that our evaluation algorithm diﬀers from that of the common LET-based implementa-
tions in that we avoid duplicating V-list interactions, which are typically the most computationally
expensive part of any fast multipole-like algorithm (see Sect 3.5). Each V-list interaction is com-
puted exactly once, and the resident process is responsible for such computations.

2.5 Results

In this section, we present an array of results demonstrating error convergence and parallel

performance of the present algorithm.

23

if b is a leaf then

else

required for plural nodes

Calculate M(b) from particles of b
for each child c of b do

M(b) ← M(b) + M2M(c, b)

Algorithm 2: Local multipole-to-multipole (M2M) computation with interleaved update of
plural nodes
1: Deﬁne pid as the process rank
2: Deﬁne M(b) as the multipole expansion for node b
3: Deﬁne T(b) as a temporary storage space of the same size as M(b) unique to node b; only
4: Deﬁne Reqs(·) as the array of request handles for asynchronous comms.
5: r ← 0 {r is the request counter}
6: R ← ∅ {Set of update nodes}
7: for each b ∈ Tloc in right-to-left post-order do
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
end if
21:
22: end for
23: MPI_Waitall(r,Reqs(1 : r))
24: for each b ∈ R do
25: M(b) ← T(b)
26: end for

r ← r + 1
MPI_Ireduce(M(b),T(b),Reqs(r),Root(b),MPI_SUM)
if pid == Root(b) then

end for
if b is a plural node then

R ← R ∪ b

end if

end if

2.5.1 Error convergence
We ﬁrst examine the error convergence of the ACE algorithm applied to the r−ν, Yukawa, and
Stokes potentials. The ﬁrst two are computed using scalar sources, while the Stokes potentials are
calculated using vector-valued sources. The error for Stokes potentials is compared to the analytical
by taking an inner product of the tensor-valued potential with each observer’s polarization vector.
We ﬁrst evaluate the r−ν and Yukawa potentials for 3.125 million points within a 0.64 m cube.
The distribution is mapped onto a tree with a leaf box diameter of d0 = 0.01 m, yielding a 7-level
tree with 12 points per box on average. A single buﬀer box is used for the far-ﬁeld. Fig. 2.3

24

else

if b is a leaf then

r ← r + 1
MPI_Ibcast(L(b),Reqs(r),Root(b))
if b is a duplicate node then

Calculate potentials due to L(b) for each particle in b
if b is a plural node then

Algorithm 3: Local-to-local (L2L) computation with asynchronous update of plural nodes
1: Deﬁne L(b) as the local expansion for node b
2: r ← 0
3: for each b ∈ T in pre-order do
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
end if
19:
20: end for

if c is not a duplicate node then

L(c) ← L(c) + L2L(c, b)

MPI_Wait(Reqs(r))

end if

end if
for each child c of b do

end if
end for

shows convergence in the L2 error for the far-ﬁeld with increasing expansion order P = 1, 3, . . . , 19
for both kernels. Diﬀerent parameters are used to control the growth or decay of the kernels as r
increases.

We next consider the evaluation of error in Stokes potentials for a collection of points within a
cube with both uniform and non-uniform spatial distributions. Both distributions are characterized
by 3.125 million points within a 0.64 m cube. In the case of the uniform distribution, we employ
a uniform tree with increasing box size as P is increased to minimize runtime. The non-uniform
distribution is generated by generating random points within the unit cube and raising the generated
x, y, z positions to the powers of 1.2, 0.7, and 1.7, respectively, before rescaling the results to ﬁt
into the 0.64 m cube. In this case, we use a non-uniform tree with 2:1 balance and increase the
minimum box size with P to approximately minimize runtime according to the uniform-tree cost
estimate. The error convergence with increasing P for both distributions is shown in Fig. 2.4. The
error metric used here relies on randomly sampling the analytical ﬁelds at one observer on each

25

(a) r−ν potential

(b) Yukawa potential e−γr/r, γ > 0

Figure 2.3: Far-ﬁeld only error convergence vs. ACE expansion order for diﬀerent kernel functions.

Figure 2.4: Convergence of the Stokes potentials for uniform and non-uniform volume distributions.

process, ensuring a good spatial distribution of observers, and comparing it with the computed
potential using ACE. For each selected particle, the error in the potential relative to the analytic
solution is computed, and the average of these errors is reported.

2.5.2 Cost comparison of evaluating L∇ {ψ} vs. ψ
Next, we demonstrate the low overhead for evaluating L∇ {ψ} using our method. Consider the
evaluation of the Laplace potential and the Stokes potentials for a collection of 100,000 particles

26

0510152010−910−710−510−310−1ExpansionorderPL2relativeerrorν=−1ν=−2.70510152010−710−610−510−410−310−210−1ExpansionorderPL2relativeerrorγ=0.4γ=1.0γ=2.3510152010−710−510−310−1StokesletRotletStressletExpansionorderPAvg.relativeerrorUniformNon-uniformTable 2.1: L2O timings (in seconds) for Laplace potential (scalar) and Stokeslet (vector)

P

Stokeslet
Laplace
Ratio

6

0.118
0.046
2.57

8

0.198
0.072
2.75

10
0.333
0.104
3.20

12
0.570
0.156
3.65

14
0.769
0.229
3.36

16
1.125
0.317
3.55

18
1.561
0.418
3.73

distributed randomly inside a cube of diameter 1 m. Note, for the Stokes potentials we use vector
sources. It follows from the arguments thusfar, the principal diﬀerence in cost between evaluating
the Laplace and Stokes potentials should be a factor of three arising from the diﬀerence between
the nature of the source distributions (scalar vs. vector). For the Stokes potentials, each of the three
vector components of the local expansion must be disaggregated with the derivative operations
deﬁned in (2.12) and assembled to form the observed ﬁeld. Table 2.1 presents timings of the L2O
stage for both the Laplace and Stokes potentials for increasing values of P and the ratio of the
Stokeslet time to the Laplace time. The increase hovers around the expected factor of 3, albeit with
additional overhead for larger P which could potentially be due in part to cache eﬀects associated
with the O(P3) arrays involved in these computations.

2.5.3 Parallel performance

Finally, we examine the performance of the algorithm in evaluating all three Stokes potentials
simultaneously at the L2O stage. These results were obtained on the Haswell partition of the
Cori supercomputer at the National Energy Research Scientiﬁc Computing Center (NERSC). This
cluster comprises 2388 compute nodes with two sockets each, populated by 16-core Intel Xeon
E5-2698 v3 “Haswell” CPUs running at 2.3 GHz and 64 GB DDR4 RAM at 2133 MHz per socket.
The algorithm was implemented in Fortran 90 using double-precision arithmetic and parallelized
strictly in distributed-memory fashion using MPI. The code was compiled using the Intel compiler
version 18 with optimization and architecture-speciﬁc instructions using the -O3 -xHost ﬂags.
For these runs we use vector-valued sources in R3 and evaluate the far-ﬁeld only, selecting P = 7
to give O(10−5) accuracy for the Stokeslet and O(10−3) accuracy for the rotlet and stresslet.

We next consider a uniform random distribution of 5 billion randomly-oriented vector sources

27

(a) Strong scaling

(b) Per-process timings

Figure 2.5: Strong scaling and per-process timings for Stokes kernel evaluation on a uniformly-
distributed volume geometry with 5 billion particles.

with unit norm inside a cube of diameter 8 m. The minimum box size was set to 0.015 m, yielding
an 11-level uniform tree with 33 particles per leaf box on average. Fig. 2.5a shows the parallel
eﬃciency (strong scaling) of this distribution up to 16,384 processes with respect to 1024 processes.
The increase in eﬃciency from 2048 to 4096 processes is due to the fact that the distribution is
randomly re-generated for each run. The eﬃciency is over 90% for all cases. Fig. 2.5b shows the
timings for each process for the 16,384 process case. The C2M stage requires on average 0.54 s
while the L2O stage, modiﬁed for the Stokes potentials, requires 6.05 s, about an 11.2X increase.
Most of this increase is due to the calculation of the stresslet potential. The overall computational
time is dominated by the V-list calculation, taking over 45 s of the 54 s, underscoring the importance
of avoiding redundant V-list calculations related to duplicate nodes at upper levels of the tree.

We now consider uniformly and non-uniformly distributed sources on the surface of a sphere.
Both distributions comprise 1.024 billion particles. For the uniform distribution, the sphere
diameter is 2 m. A uniform tree is used with a leaf box size of 5 × 10−4 m, yielding a 13-level tree
with leaves containing 15 particles per box on average. The strong scaling and per-process timings
on 16,384 processes for the uniform distribution are shown in Fig. 2.6. A mock distribution is also
shown inset in the strong scaling plot. In this case, the eﬃciency is over 77%. The choppiness of

28

1,0242,0484,0968,19216,3845060708090100#processesParallelefficiencyη(%)	0	10	20	30	40	50	60	0	2000	4000	6000	8000	10000	12000	14000	16000Time	(s)Process	rankFar	ﬁeldV-listL2LM2M(a) Strong scaling

(b) Per-process timings

Figure 2.6: Strong scaling and per-process timings for Stokes kernel evaluation on a uniform
spherical distribution of 1.024 billion points.

(a) Strong scaling

(b) Per-process timings

Figure 2.7: Strong scaling and per-process timings for Stokes kernel evaluation on a non-uniform
spherical distribution of 1.024 billion points.

29

1,0242,0484,0968,19216,3845060708090100#processesParalleleﬃciencyη(%)	0	1	2	3	4	5	6	7	8	9	10	0	2000	4000	6000	8000	10000	12000	14000	16000Time	(s)Process	rankFar	ﬁeldV-listL2LM2M1,0242,0484,0968,19216,38460708090100#processesParallelefficiencyη(%)	0	2	4	6	8	10	0	2000	4000	6000	8000	10000	12000	14000	16000Time	(s)Process	rankFar	ﬁeldX,V,W-listL2LM2Mthe L2L stage timings is due to the fact that the number of particles assigned to each process is
uneven; instead, the load balancing algorithm aims to balance the overall computational load. As
the bulk of the work lies in the V-list evaluation, Fig. 2.6 suggests the algorithm does a reasonably
good job.

In the case of the non-uniform distribution, the sphere has a diameter of 8 m, and the particles
are clustered around the north and south (±z) poles (as depicted inset in Fig. 2.7a). The minimum
box size is chosen as 6.25 × 10−7 m in diameter, resulting in a 25-level tree. We set smax = 50
particles per box, resulting in about 18 particles per box on average and distributing leaves over the
bottom 16 levels of the tree. Despite the extremely non-uniform distribution of particles, the strong
scaling shown in Fig. 2.7a is still as high as 78% eﬃcient on 16,384 processes. In addition, the
per-process timings presented in Fig. 2.7 for the far-ﬁeld evaluation suggest that the load balancing
algorithm does its job reasonably well. Again, the choppiness here is due to the fact that particles
are not evenly distributed across processes, and processes with high particle counts must spend a
lot of time computing X- and W-list interactions. Those with the most particles stick out in these
plots.

2.6 Conclusion

We have presented an eﬃcient parallel algorithm for evaluating arbitrary non-oscillatory po-
tentials and those deﬁned in terms of gradients on a non-oscillatory potential. We demonstrated
the accuracy of the numerical method for several diﬀerent potential functions and showed that
the parallel algorithm scales well for surface, volume, uniform, and non-uniform distributions of
billions of particles, achieving nearly 78% parallel eﬃciency on 16,384 processes.

30

CHAPTER 3

PARALLEL NON-UNIFORM WIDEBAND MLFMA FOR MULTISCALE

ELECTROMAGNETIC SIMULATION

3.1 Introduction

Over the past several decades, computer simulation has become an indispensable tool for design-
ing and prototyping in electromagnetic (EM) engineering. This includes designing microwave/RF
circuits and antennas, electromagnetic interference mitigation, stealth aircraft proﬁle reduction,
THz devices, and much more. Many of these problems, however, pose signiﬁcant challenges for
existing computational methods. The conﬂuence of perpetual increase in frequencies of engineer-
ing interest, the widespread availability of quality computational resources, and the demand for
realism in computer models is driving a massive increase in the number of degrees of freedom, Ns,
required to solve these problems.

While the literature is ﬁlled with means to tackle these problems, over the years, surface integral
equations have emerged as the main tool of analysis [54, 20, 41]. The main bottleneck to widespread
use of these methods was their computational complexity; since the 1990s, this has largely been
overcome thanks to accelerators such as the multilevel fast multipole algorithm (MLFMA) that
have reduced the computational complexity from O(N2
s ) to O(Ns log Ns) for surfaces. As a result
of these signiﬁcant beneﬁts, research into advancing nuances and application of these approaches
is extensive and perhaps too numerous to enumerate; a partial list can be found in [71, 52] and
references therein. As is evident from these citations, there is still considerable interest in exploring
and expanding the range of these techniques at either end, i.e., extending these methods to problems
with high disparity in discretization scales that are then embedded in electrically large objects. At
the same time, there has been a concerted eﬀort to develop parallel algorithms to further exploit
the capabilities of these algorithms. Both these issues are explored in more detail next.

The literature is rich with eﬀorts to bridge the length scale between the Helmholtz regime and

31

the low-frequency or Laplace regime. The principal challenge is the low-frequency breakdown of
the classic MLFMA. The earliest work on this problem was done by Greengard [35, 37]. Soon
after, Zhao and Chew [83] introduced a modiﬁcation of the original addition theorem to stabilize
the MLFMA at low frequencies. Since these, other methods have emerged, including [19, 26, 12,
5, 69]. Another method introduced in 2007 is based on accelerated Cartesian expansions [61, 73]
blends seamlessly and intuitively with the MLFMA, and is error-controllable to arbitrary accuracy
with no bound on the discretization density.

It is apparent that to extend the reach of fast multipole methods, development of parallel
algorithms is a necessity [70]. The earliest eﬀorts were based on extending methods developed for
the Laplace (electrostatic) fast multipole method (FMM) to MLFMA with little success. Recent
eﬀort focused on building an algorithm based on hierarchical partitioning (HiP-MLFMA) [28] and
more recently, its blockwise variant [51]. This approach works by partitioning in space at lower
levels and directions at higher levels; the blockwise scheme optimizes the manner in which the
directions are partitioned. In these methods, one uses a number of processors that is either a power
of 2 (for hierarchical partitioning) or a power of 4 (for blockwise hierarchical partitioning).

Another algorithm [48] that has also shown excellent scalability and timings takes a diﬀerent
approach; using post-order traversal, the resulting self-similarity of the tree as well as direction
partitioning yields an algorithm that scales very well [48]. The salient features of this algorithm are
(i) wideband MLFMA for analysis from low to high frequencies, (ii) global (exact) interpolation
and anterpolation, (iii) demonstrated error control (L2 error norm against analytical data), and (iv)
adaptive direction partitioning that is dictated by the number of duplicate nodes; for the purposes
of this discussion it will be referred to as AP-MLFMA, for “adaptive partitioning”. It was shown
that the matrix vector product (matvec) times of AP-MLFMA are highly competitive with others.
To a large extent, both methods assume a uniform tree.

Despite this progress, several open problems remain, particularly for wideband and multiscale
MLFMA systems. These can be summarized as accuracy in deep non-uniform trees and their
relative trade-oﬀs in developing a eﬃcient parallel algorithms. To wit, both HiP- and AP-MLFMA

32

oﬀer a solution to a portion of the problem. The former relies on local inter/anterpolation, which
has linear asymptotic complexity but requires signiﬁcant oversampling, resulting in an increase in
cost. It is also inaccurate, as one needs a ﬁlter at the anterpolation stage [27, 62, 19].

On the positive side, HiP-MLFMA is easier to parallelize. The AP-MLFMA uses global
(exact) inter/anterpolations so the accuracy can be well controlled, and the sampling is optimal
and predetermined. However, developing parallel algorithms for this approach is a challenge; in
addition, memory costs required to store matrices for inter/anter-polation on each node limit the
height of the tree. Both HiP- and AP-MLFMA do not include non-uniform trees within their parallel
frameworks. Algorithms that address both wideband and multiscale nature of the problems implies
that one needs both the capability of handling low frequencies as well as adaptive non-uniform
trees [73, 15, 19]; while such algorithms exist in serial, the literature on parallelization is extremely
sparse with each [48, 9] addressing a portion of the problem.

The principal contribution of this chapter is to address these extant deﬁciencies; our objective
is to provide a parallel computational framework for computing ﬁelds arising from realistic distri-
butions with the following properties: (a) it is a methodology whose accuracy can be controlled
to desired precision, (b) it includes transitions to low frequencies and non-uniform distributions,
(c) the sampling of the spectrum is optimal and overcomes memory bottlenecks associated with
inter/anterpolations, and (d) it is eﬃcient in terms of scalability and execution times. We present
a number of results to demonstrate the accuracy of the numerical methods and the performance of
the parallel algorithms presented here. The methodology extensively extends the work [73, 48] on
ACE-AP-MLFMA.

The rest of the chapter is organized as follows: Section 3.2 lays out the EM scattering problem
to be solved. In Section 3.3, we review the wideband MLFMA, establish the need for ﬁlters for
anterpolation, develop transitions between spherical harmonics and Fourier series for interpola-
tion/anterpolation, non-uniform trees for multiscale features, and an interpolation method for saving
memory. Section 3.4 outlines the parallel algorithm, and ﬁnally, Section 3.5 demonstrates the error
control and performance of the serial and parallel algorithms.

33

3.2 Problem Statement

Consider an object residing in a region D ∈ R3 bounded by the surface ∂D, the perfectly-

conducting portion of which is denoted by S. Let(cid:8)Ei(r), Hi(r)(cid:9) denote electric and magnetic ﬁelds
impedance of the medium is denoted by η = (cid:112)µ/ε. The ﬁeld scattered by the object may be

that are incident on this object, µ and ε denote the magnetic permeability and electric permittivity
µε denote the wavenumber for frequency ω. The intrinsic
of the medium, respectively, and k = ω

√

obtained using a combined ﬁeld integral equation (CFIE) formulated in terms of the unknown
electric surface current density J(r) on S. For r ∈ S,

α ˆn × ˆn × Ei(r) + (1 − α)ˆn × Hi(r)
= −αL{J}(r) + (1 − α)K{J}(r)

Þ
G(cid:0)r − r(cid:48)(cid:1) · J(r(cid:48)) dS(cid:48)
Þ
G(cid:0)r − r(cid:48)(cid:1) · J(r(cid:48)) dS(cid:48)
(cid:20)
G(cid:0)r − r(cid:48)(cid:1) (cid:17) −jkη

L{J}(r) (cid:17) ˆn × ˆn ×
K{J}(r) (cid:17) ˆn × 1
jkη

g(r − r(cid:48))

S
∇ ×

(cid:21)

S

I +

∇∇
k2
g(r − r(cid:48)) (cid:17) e−jk|r−r(cid:48)|
4π|r − r(cid:48)|,

(3.1)

(3.2)

(3.3)

(3.4)

where

(3.5)
is the scalar Helmholtz Green’s function and ˆn (cid:17) ˆn(r) denotes the unit normal to S at the point r.
Solving for J via the method of moments (MoM) involves the discretization of (3.1), typically by
expansion and testing with the classic RWG basis functions [58] deﬁned on a triangular mesh. To
do so, we expand the unknown surface current as

J(r) =

Infn(r),

(3.6)

where fn(r) denotes the RWG function associated with the nth edge, and Ns is the total number
of edges in the mesh. The Ns × Ns matrix equation resulting from the MoM procedure may be
expressed as

Ns

n=1

Z I = V,

34

(3.7)

where I =(cid:8)I1, . . . , INs

(cid:9)T is the vector of unknown coeﬃcients, and the elements of the matrix Z

and vector V are given by

mn = − (cid:104)fm(r), L{fn}(r)(cid:105)
Z E

mn = (cid:104)fm(r), K{fn}(r)(cid:105)
Z H

Zmn = αZ E

mn + (1 − α)Z H

mn

(cid:28)

Þ

S
jη
k

fm(r), jkη
∇ · fm(r),
−

g(r − r(cid:48))fn(r(cid:48))dS(cid:48)(cid:29)
(cid:28)
g(r − r(cid:48))∇(cid:48) · fn(r(cid:48))dS(cid:48)(cid:29)
Þ
(cid:28)
∇g(r − r(cid:48)) × fn(r(cid:48))dS(cid:48)(cid:29)
Þ
(cid:68)fm(r), α ˆn × ˆn × Ei(r) + (1 − α)ˆn × Hi(r)(cid:69)

fm(r) × ˆn,

S

S

=

=

(3.8)

(3.9)

(3.10)

,

Vm =

(3.11)
where (cid:104)·, ·(cid:105) denotes the usual inner product. Here, the EFIE is represented in mixed-potential form.
Solutions to this equation are typically eﬀected using an iterative solver, for instance, the
Generalized Minimal Residual (GMRES) method. Coupling this equation with a accelerator
amortizes the cost of the matrix vector product; the most popular of the accelerators being the
multilevel fast multipole method (MLFMA) [67]. Next we discuss challenges and their resolution
when these methods are applied for multiscale analysis.

3.3 Challenges and remedies for parallel multiscale analysis

All acceleration methods for accelerating the solution of (3.7) rely on the fact that the matrix Z
may be partitioned into a matrix Z NF of near interactions and a matrix Z FF of far interactions, or

Z = Z NF + Z FF .

(3.12)

This partitioning is founded on the notion that matvecs with Z FF can be computed in O(Ns log Ns)
time, and those with Z NF can be computed in O(Ns) time. Developing criteria for such a partitioning
for multiscale problems poses challenges from both a computational as well as parallelization
perspectives. These challenges arise from the non-uniformity of discretization and their distribution.
To set the stage, assume that the spatial distribution of unknowns can be mapped onto a uniform

35

octree data structure. The root of the tree contains all unknowns, and the leaves of the tree
correspond to clusters of unknowns, serving as the interface between the tree and the unknowns.
The matvec with Z FF is eﬀected by using appropriate operators to traverse up, down and across
the tree. Assume that the minimum size of a leaf is around 0.2λ. In a multiscale scenario a leaf
box may contain too many unknowns, thus dramatically increasing computational costs of near-
ﬁeld interactions and overwhelming the cost complexity of the MLFMA. This can be overcome
using a judicious choice of representations for the Green’s function that enable a low-frequency
decomposition, as in [37, 19, 12]; here we employ the accelerated Cartesian expansion (ACE)
method [61, 73]. Enabling this results in a tree that could potentially have MLFMA leaves that
are then roots of subtrees that extend downwards. From a computational perspective one needs
to develop operators to eﬃciently transfer information between nodes (both at the same level and
between levels).

Another bottleneck is accurate traversal up and down the tree. As will be shown later, a provably
accurate approach is to use a global representation of radiated ﬁeld or use a local bandlimited
representation [46, 49]. For the latter, the price that one pays is the oversampling required for
the same accuracy. However, the downside of using global sampling is the challenges it poses in
terms of memory requirements for storing interpolation coeﬃcients at higher levels in the tree as
well as parallelization. As a result, an eﬃcient strategy for deep trees would use a spherical ﬁlter
that uses spherical sampling up to a certain level, switches to a FFT-based ﬁlter using uniform
sampling [60] to save on memory, and then a local bandlimited ﬁlter with linear complexity to
leverage parallelism. This concept is illustrated in Fig. 3.1. While we leave the discussion of local
bandlimited interpolation and modiﬁcation of parallel algorithms to a subsequent paper, we shall
extensively discuss spherical, uniform sampling and transitions between the two, and parallelization
strategies that one can employ. In what follows, we will explore each of these issues in greater detail.
Note, the exposition will focus on evaluation of the scalar Green’s function; changes necessary to
eﬀect these operations for the dyadic Green’s function are well known and only those subtleties that
are not well known are elaborated upon.

36

Figure 3.1: Illustration of the target computational strategy for representing interactions in a gen-
eral multiscale geometry. Each T represents operations performed with the subscripted method,
e.g. ACE, spherical and uniform interpolation/anterpolation, and local band-limited interpola-
tion/anterpolation.

3.3.1 Non-uniform trees

MLFMA relies on construction of oct-trees to partition near and far interactions. Here we brieﬂy
discuss creating such list for non-uniform distributions. Consider a box b at any level in the tree;
we denote its parent by P(b) and its grandparent by P2(b) and so on up the tree. Boxes that share a
spatial locations (vertex, edge or face) with b are identiﬁed as being in its near ﬁeld. If two boxes
are of the same size (and share a spatial location), they belong to the U-list. Likewise if two boxes
are of the same size, do not share a spatial location, and their parents are in the near ﬁeld of each
other, they are in each other’s farﬁeld or V-list. To handle non-uniform distributions, we develop an
adaptive tree [15] starting with a uniform tree representation where all leaf boxes reside precisely
at the same level, merging siblings subject to the rule that their parent does not contain more than
some speciﬁed smax DoF.

An important consequence of adaptive methods is the introduction of far-ﬁeld interactions
between boxes at diﬀerent levels of the tree or cross-level interactions. These interactions, deemed
the X- and W-lists, are discovered during the construction of the near and far interaction lists in the

37

1101001000100001000001000000100000110000010100000111000011000010100001001000010110000111000110001010001110011001010011101101010111111011001101111111011100111011110101110111110110111011111101110111011111111111101111001111000111100111110111111111110111111TACETsphTuniTlocalfollowing manner. Consider a box b. To construct its interaction lists, we generate all hypothetically
interacting boxes with b at the same level according to the scheme described earlier, and we perform
a top-down search of the octree to locate each such hypothetical box h. If h itself is found in the
tree, it is obviously added to the appropriate U- or V-list.
If the search for h dead-ends on an
ancestor Pn(h) that is not a leaf, the interaction pair (b, h) is discarded; however, if Pn(h) is a leaf
and it satisﬁes the conditions of the present interaction list, we retain (b, Pn(h)) in the X-list and its
reciprocal interaction pair (Pn(h), b) in the W-list.

The computational procedure is as follows:

• Construct the non-uniform tree and delineate near and far ﬁeld interactions

• Precompute and store near ﬁeld interactions

• For any matvec,

– Compute contributions due to the near-ﬁeld

– Compute contributions due to the far-ﬁeld:

∗ For all leaves, compute charge to multipole (C2M);

∗ Construct multipole information for all boxes from those of their children (M2M).
This operation is level dependent, and one needs the following operators: MACE −→
MACE, MACE −→ Msph, Msph −→ Msph, Msph −→ Muni and Muni −→ Muni;
∗ Translate from multipole to local expansion (M2L). For non-uniform trees, this

would involve constructing operators for both in- and across-level translations;

∗ Construct local expansions for all boxes from those of its parent (L2L). This is a
conceptual inverse of the multipole operation and one needs to develop identical
operators to traverse down the tree;

∗ From local expansions at leaves, construct far ﬁeld information at all observers

(L2O);

– Sum the contributions of near- and far-ﬁelds.

38

3.3.2 Low-frequency analysis

Over distances or length scales which are small relative to a wavelength (< 0.25λ or so), the
oscillatory behavior of g over larger distances gives way to quasi-static behavior which can be
accurately modeled using localized Taylor series expansions. This is the foundation of the ACE
method [61, 73] for the low-frequency Helmholtz problem.

Consider the evaluation of the potential Ψ(r) due to a source distribution u(r)

Ψ(r) =

ung(r − r(cid:48)
n).

(3.13)

Ns

n=1

To compute the potential potentials between regions that are suﬃciently separated, ACE [61, 73]
dictates that the potential due to s(cid:48) sources with location and strength {rm, um}, m = 1, . . . , s(cid:48)
within the region Ωs centered at rc

s) · n · ∇(n)

g(r − rc
s),

(3.14)

∞

s can be written as
M(n)(rc

ΨFF(r) =

n=0

where ·n· denotes an n-fold tensor contraction, M(n) is the multipole tensor representing the sources
within Ωs and the tensor ∇(n)g(r − rc
s) is an n-fold tensor operand on the Green’s function. Details
for these expression, as well as aggregation operators MACE −→ MACE and MACE −→ Msph and
their counterparts for disaggregation are available in [73]. Operators addressing translations across
levels are presented later in this section.

3.3.3 MLFMA

The derivation of FMM and its multilevel variant are well known [21, 67, 66, 22, 60]. Given the
wealth of papers in this area over the past few decades, it seems somewhat presumptuous to contend
that there is signiﬁcant new information to be added to the literature. The purpose of this section
is somewhat diﬀerent. In keeping with our goal to have a methodology with controllable accuracy,
we will describe methods that permit both better understanding and error control. We start with a
2-level description of MLFMA that is based on the integral representation of the Green’s function

g(X + d) ≈ − jk
(4π)2

Þ
S2 e−jk·(ds+do)T(k, X)d2 ˆk,

39

(3.15)

Figure 3.2: Graphical illustration of the addition theorem and some notation.

where the translation operator T is given by

T(k, X) (cid:17)

(−j)n(2n + 1)h(2)

n (k X)Pn

(cid:16) ˆk · ˆX

(cid:17)

,

(3.16)

∞

n=0

with k = k ˆk. Here, S2 denotes the unit sphere, parametrized by (θ, φ) ∈ [0, π] × [0, 2π]. We note
that k = k(θ, φ), and use these notations interchangeably. Fig. 3.2 illustrates the decomposition
of |r − r(cid:48)| into |X + d(1)|, where d(l) denotes the sum of particle-to-center vectors at level l with
l = 1 being leaf level. We denote the centers of boxes containing r(cid:48) and r as rc
o(1),
s(1) − r(cid:48). The rules to evaluate the spectral integral
respectively. It follows that d = r − rc
are well-understood and in place [27, 66, 60, 19, 73]. The far-ﬁeld part of the potential (3.13) is
evaluated using

s(1) and rc

o(1) + rc

Þ
S2 e−jk·doU1 (θ, φ) d2 ˆk,

ΨFF(r) ≈ − jk
(4π)2

where the local expansion U1 of the observer box is

U1 (θ, φ) (cid:17) T(k, X)V1(k),

and V1 denotes the multipole expansion of the source box

V1 (θ, φ) =

−jk·(rc

s−r(cid:48)
i)

.

uie

s(cid:48)

i=1

40

(3.17)

(3.18)

(3.19)

Xdodsrcsrcorr0In what follows, we discuss the intimate connection between integration, interpolation (Vl+1 (θ, φ) −→
Vl (θ, φ)) and anterpolation (Ul+1 (θ, φ) −→ Ul (θ, φ)).

3.3.3.1

Inter/anterpolation and spectral integration

Let us revisit (3.15). It is well known that the plane wave expansion gives us

e−jk·(ds(1)+do(1)) ≈ S(kd, θ, φ)

Nh(1)

n

=

n=0

m=−n

anmYnm (θ, φ)

(3.20)

where d = |d(1)| = |ds(1) + do(1)|, and anm are known coeﬃcients of the normalized spherical
harmonics, Ynm (θ, φ), used to represent the plane wave. Assuming that T(k, X) can be written as
nm (θ, φ), it follows that the (3.20) ﬁlters the above representa-

n=0n
T(k, X) =∞

m=−n bnm(k|X|)Y∗

The ﬁlter due to S(kd, θ, φ) follows from the fact that, if higher order terms (n > Nh(1)) are
present, they integrate to zero analytically. In other words, S(kd, θ, φ) eﬀectively band-limits the
spectrum of the incoming wave. As a result, the integration rules (and number of harmonics of the
translation operator) are chosen to exactly integrate polynomials of order 2Nh(1) present in (3.15).
If polynomials of order higher than 2Nh(1) were present, then, while they should theoretically
integrate to zero. But they would not do so numerically if the integration rule is not chosen
appropriately.

41

tion via (3.15), i.e.,

=

=

Nh(1)
n
(cid:34) ∞
n

Þ
Þ
S2 S(kd, θ, φ)T(k, X)d2 ˆk
S2 d2 ˆk
×
Þ
m=−n
S2 d2 ˆk S(kd, θ, φ) ·

bnmY∗

m=−n

n=0

n=0



anmYnm (θ, φ)

(cid:35)
n

nm (θ, φ)

∞Nh(1)

n=0

m=−n

(3.21)

bnmY∗

nm (θ, φ)



Given the bandwidth of S(kd, θ, φ), the number of quadrature points/samples required to com-
pletely represent S(kd, θ, φ), i.e, obtain all coeﬃcients anm, is (Nh(1)+1)×(2Nh(1)+1). To evaluate
the integral (3.15) numerically, one should be able to recover bnm from samples of T(k, X), and vice-
versa. An integration rule to be used can be deﬁned using Nh(1) = (cid:100) χskD(1)(cid:101), Nθ(1) = Nh(1) + 1,
and Nφ(1) = 2Nh(1) + 1, where D(1) is the diameter of a sphere that encloses the leaf box, and χs
is an oversampling factor.

In a multilevel scenario, the same logic holds at every level. That is, both the contributions from
that level and those from its parents should be representable in terms of spherical harmonics up to
a chosen order. This implies that one needs to develop an anterpolation operator to eﬀect a ﬁlter of
the incoming spectra to enable a transition from parent to child, or in terms of representation, reduce
the maximum degree of spherical harmonics used from Nh(i + 1) −→ Nh(i). To control errors
arising from this transition, it necessary to understand the spectral properties of the anterpolation
operator used; local algebraic inter/anter-polation operators does not accomplish the necessary
ﬁltering, and errors due to lack of ﬁltering is exacerbated in deep trees. As a result, we will
use bandlimited anterpolants. As discussed in Section 3.4, using such global operations poses
bottlenecks in parallelization.

Thus far, we have established the need for ﬁlters to perform interpolation on outgoing expansions
(Vl (θ, φ)) and more importantly, anterpolation on incoming expansions Ul (θ, φ). In order to treat
both expeditiously, we denote both using Λl (θ, φ). The two variations that we have used are
spherical and Fourier transforms. Spherical ﬁlters as used in [40, 62] result in optimal sampling of
the far ﬁeld data. An alternative approach is to use Fourier transforms in both angular directions
[60], requiring uniformly-spaced samples. Obviously, the tradeoﬀ is O(Nh(l)3) memory required
to eﬀect the Legendre transforms vs. doubling of the sample count required for the latter. The
approach we espouse is to choose a transition level at which the representation switches from
spherical to uniform sampling.

42

Spherical ﬁlters Spherical ﬁlters are based on projecting an incoming or outgoing ﬁeld pattern
Λl onto spherical harmonics as

Λl (θ, φ) =

λl,nmYnm (θ, φ)

(3.22)

and then performing the various down- and up-sampling operations. We shall not delve into details,
as these are well known and obtainable from references [40].

Nh(l)

n

n=0

m=−n

Fourier ﬁlters The concept of using Fourier transforms was ﬁrst introduced into the FMM
literature by Sarvas [60]. The key to this approach was the realization that the angular domain
of the associated Legendre function can be extended from [0, π] → [0, 2π] [78]. To summarize,
Λl (θ, φ) can be extended to the entire sphere as



˜Λl (θ, φ) =

Λ(θ, φ)
Λ(2π − θ, φ + π)
Λ(2π − θ, φ − π)

(θ, φ) ∈ [0, π] × [0, 2π]
(θ, φ) ∈ [π, 2π] × [0, π]
(θ, φ) ∈ [π, 2π]2

Nh(l)

Nh(l)

n=−Nh(l)

m=−Nh(l)

˜Λl (θ, φ) =

˜λl,nme−jnθe−jmφ

(3.23)

(3.24)

If the spectrum of Λl (θ, φ) is bandlimited to Nh(l) harmonics, then it can be written as

It follows that one needs to sample at (2Nh(l) + 1)×(2Nh(l) + 1) uniformly on the sphere to recover
the coeﬃcients ˜λl,nm. Using the above representation we can write (3.17) as

ΨFF(r) ≈ − jk
32π

2

dθ |sin θ|

dφe−jk·do(1) ˜Λ1 (θ, φ)

(3.25)

In the above expression, |sin θ| is not bandlimited. But as discussed earlier, ˜Λ ﬁlters |sin θ| such
that one can represent it in terms of a ﬁnite Fourier series. Speciﬁcally,

Þ 2π

0

|sin θ| ≈ ˜s(θ) =

sne−jnθ

(3.26)

As a result, it follows that the highest order of harmonics in the θ is 2Nh(1), and the integration
rules must be designed such that one can evaluate (3.25) using (4Nh(1) + 1) × (2Nh(1) + 1). This

Þ 2π

0

Nh(1)

n=−Nh(1)

43

Þ 2π

Þ 2π

is done using uniform sampling in both θ and φ directions. Further, we note that as is the usual
practice, we can embed |sin θ| within the translation operator. As a result, one needs to evaluate an
integral of the form

ΨFF(r) ≈ − jk
32π

2

dθ

0

0

dφe−jk·do(1) ˜Υ1 (θ, φ)

(3.27)

where ˜Υ1 (θ, φ) = ˜s(θ)T(k, X). As before, these ideas can be trivially extended to a multilevel
setting.

Spherical to uniform and vice-versa Both spherical and uniform interpolation have their draw-
backs in terms of costs. Transitioning from one method to the other yields a computationally
eﬃcient scheme. To facilitate this, in what follows we present the means to do so. Consider the
transition from spherical to uniform. The procedure is straightforward, i.e.

Λl (θ, φ) sph−−−→ λl,nm

uni−−−→ ˜Λl+1 (θ, φ) ,

(3.28)

where the stacked symbols indicate the sampling regimes and embed a shift operation for upward
traversal. The above equation indicates that one traverses from the spherical samples of Λl (θ, φ)
to coeﬃcients λl,nm and then using these to construct the requisite samples at uniform points in the
[0, 2π]×[0, 2π] grid, and then shift these from the center of the child box to that of its parent. Note
that one can exploit symmetry to store ﬁeld samples only in the [0, π] × [0, 2π] grid by using an
even number of samples in φ.

Next, to eﬀect the transition from uniform to non-uniform sampling we consider (3.27):

Þ 2π
Þ 2π

dθ

0
sin θdθ

Þ 2π
Þ π

0

0

dφe−jk·do(1) ˜Υ1 (θ, φ)

dφe−jk·do(1) ˜Υ1 (θ, φ)
sin θ

0

2

ΨFF(r) ≈ − jk
32π
= − jk
(4π)2

(3.29)

As is evident from the above expression, the integration has been reduced back to the unit sphere.
However, the integration is challenging as ˜Υ1 (θ, φ)/sin θ is not bandlimited in terms of spherical
harmonics. To eﬀect the integral, we have to resort to the notion that the the eﬀective bandwidth

44

of e−jk·do ﬁlters the rest of the integrand. That is,

1
sin θ

˜Υ1 (θ, φ) =

υ1,nmYnm (θ, φ)

n

m=−n

Nh(1)
Þ 2π

n=0

0

Þ π

0

dθ

To obtain υ1,nm we evaluate

υ1,nm =

dφ ˜Υ1 (θ, φ)Y∗

nm (θ, φ)

The above equation can be evaluated exactly by deﬁning an extended spherical harmonic as

(3.30)

(3.31)

(3.32)



ˆYnm (θ, φ) =

(θ, φ) ∈ [0, π] × [0, 2π]
Ynm (θ, φ)
Ynm(2π − θ, φ + π) (θ, φ) ∈ [π, 2π] × [0, π]
Ynm(2π − θ, φ − π) (θ, φ) ∈ [π, 2π]2

The periodic extension of the normalized spherical harmonics can be represented in terms of a
Fourier series, and as a result, the integral

Þ 2π

0

Þ 2π

dθ

0

υ1,nm =

1
2

dφ ˜Υ1 (θ, φ) ˆY∗

nm (θ, φ)

(3.33)

can evaluated exactly using uniform samples of ˜Υ1 (θ, φ) that are available to us with a trapezoid
rule in both dimensions.

3.3.3.2 Subtleties for vector ﬁelds

The algorithm of this chapter uses four trees; one each for component of the magnetic vector
potential and one for the scalar potential. This is the point form of MLFMA that addresses
integration issues [23]. Alternatively, one can develop MLFMA with either two or three trees. The
two tree version is the usual dyadic MLFMA and requires either vector spherical harmonics [62] or
the Fourier method modiﬁed to account for the anti-symmetry of the transverse ﬁelds [53], whereas
the three tree version requires mapping onto the three components of electric ﬁelds, Ex, Ey, Ez, and
employing the scalar ﬁlters espoused here. Multiplication of ﬁelds of the form (3.22) by the dyad
ˆθ ˆθ + ˆφ ˆφ increases the polynomial order by two, requiring a commensurate increase in the size of
the integration rule.

45

3.3.4 Cross-level interactions

Next, non-uniform tree introduces the notion of cross-level interactions. Addition theorems used
derive either MLFMA of ACE are invalid for these interactions. To circumvent this issue, we
subsume either do or ds, depending on the interaction, into the translation heading X. If the source
box Ωs is smaller than the observer box Ωo, i.e. |Ωs| < |Ωo|, the source multipole expansion is valid
everywhere within Ωo, meaning X ← X + do. The source multipole expansion is consequently
translated from its expansion center rc
s directly to each particle within Ωo. Such interactions are
tabulated in the X-list. Conversely, if |Ωs| > |Ωo|, we take X ← X + ds, translating each particle
within Ωs to the center rc
o of Ωo, adding to the local expansion. These interactions are tabulated
in the W-list. Both of these forms of the addition theorem are always valid for well-separated
interactions. The essential criterion is that the two boxes are separated by at least one box of the
same size as the smaller interacting box.

The interaction “type” is determined by the ﬁeld representation (ACE or MLFMA) of the
smaller of the two boxes. Consider evaluation of (3.13) between source and observation points in
Ωs and Ωo, respectively. Let M(n) be the ACE multipole tensor representing the sources within Ωs
centered at rc

s. Then we may write

∞
∞
n=0 M(n) · n · ∇ng(r − rc
s),
o)n · n · L(n),
n=0(r − rc

L(n) =

ui
n!∇ng(rc

o − ri).

Þ
Þ
S2 T(k, r − rc
S2 e−jk·(r−rc

s)V(k, rc
o)U(k, rc

s)d2 ˆk,
o)d2 ˆk,

N

i=1

46

Ψ(r) =

where

In the MLFMA, we have

Ψ(r) =




|Ωo| > |Ωs|
|Ωo| < |Ωs|

|Ωo| > |Ωs|
|Ωo| < |Ωs|

(3.34)

(3.35)

(3.36)

where V(k, rc

s) is the source multipole expansion, and

N

i=1

U(k, rc

o) =

uiT(k, rc

o − ri)

(3.37)

represents the local expansion within Ωo of the sources. As with the interaction type, the band
limit for the translation operator T is determined by the smaller of the two boxes.

It should be noted that the ACE translation operator is valid only over electrically-small distances
[73], so care should be taken to ensure that translations from a box at the coarsest ACE level to
points within MLFMA boxes fall within this radius. For this purpose, 2:1 balancing of the octree
[68] at this level suﬃces. We also note that the minimum translation distance for X- and W-list
interactions falls from 2D(l) to 1.5D(l) if all particles exist inside the interacting boxes, limiting
the range of the MLFMA truncation parameter Nh for such interactions.

Interpolation of MLFMA translation operators

3.3.5
To reduce the storage overhead associated with translation operators, interpolation in ˆk· ˆX is typically
used [65]. However, this scheme is untenable for large numbers of X- and W-list interactions, as it
requires the number of unique translation distances to be small. To ameliorate this cost, we can also
interpolate in the argument of the spherical Hankel function to further reduce the storage without
signiﬁcantly reducing the accuracy of the MLFMA. Using the well-known series representation

(−j)k−n−1(n + k)!

2k k!(n − k)!

z−k

(3.38)

n

k=0
˜h(2)
n (z)

h(2)
n (z) =

=

e−jz
z
e−jz
z

n (z) =
n (z). It is then common to all elements of the series T, and we may implicitly deﬁne a

It is well known that we may pull out a phase-magnitude term independent of n, obtaining h(2)
(e−jz/z)˜h(2)
smoother, phase- and amplitude-compensated translation operator ˜T:

T(k, X) =

e−jk X
k X

˜T(k, X).

47

(3.39)

We employ the Lagrange interpolating polynomials (cid:96)n(x) of order p. For a given level, let
k Xmin, k Xmax denote the minimum and maximum translation distances, respectively. The in-
terpolation nodes are chosen to give a high density near k Xmin and a lower density near k Xmax
because i) the function to be interpolated has O((k X)−K) behavior, and ii) this is typically good
practice for polynomial interpolation. The nodes are chosen as follows. Let a = k Xmin − ∆1 and
b = k Xmax + ∆2 be the beginning and end of the sampling interval respectively, where ∆1, ∆2 are
small extensions of the interval, and J be the number of k X values to interpolate over. Then, for
j = 1, . . . , J, we let

(cid:20)

(cid:19)(cid:21)

(cid:18)(j − 1)π

2(J − 1)

xj = a + (b − a)

1 − cos

denote the jth interpolation node. Let p denote the order of interpolation polynomials used. The
interval extensions are chosen as

∆1 = p(k Xmax − k Xmin)
∆2 = p(k Xmax − k Xmin) cos

1 − cos

(cid:20)

(cid:19)(cid:21)

(cid:18)
(cid:18)(J − 2)π

2(J − 1)

π

(cid:19)

2(J − 1)

(3.40)

(3.41)

so that the extensions are the size of only p subintervals at the beginning and end. This prevents
extension of the interpolation interval into the explosion zone of the spherical Hankel functions.

We store samples of the compensated translator ˜T for level l in a matrix Tl deﬁned as

where {βm} is a set of uniformly-spaced samples on(cid:2)−q∆β, π + q∆β

[Tl]mn (cid:17) ˜T

xn
k

,

(cid:3), with interpolation order q

(3.42)

and spacing ∆β. Then, for each translation distance k X in the interaction list at level l, we pre-
compute the interpolation weights {wj = (cid:96)j(k X)}, j = 0, . . . , p for reconstructing the compensated
translator ˜T from the columns of Tl as

(cid:16)cos βm,

(cid:17)

˜T(cos βm, k X) ≈

wj [Tl]m(Q+j) ,

(3.43)

p

j=0

where Q is the index of the ﬁrst-occurring sample xQ used in the interpolation, determined by
binary search. During the evaluation phase, (3.43) is evaluated to obtain the samples for the

48

Table 3.1: Comparison of L2 relative error data for translation operator interpolation with and
without compensation

Sampling
Uniform
Cosine

Uniform+comp.
Cosine+comp.

max
7.38e-3
2.35e-3
3.28e-3
2.05e-4

min
0

2.39e-7

0

1.10e-7

avg
1.43e-3
4.16e-4
5.59e-4
7.50e-5

k X/D(l) at max

1.81411
3.52377
1.81411
2.3004

interpolation in ˆk · ˆX. The result of this interpolation is then multiplied by the phase-magnitude
factor e−jk X/k X to complete the process. The complexity is therefore increased only by a constant
factor of (p + 1) ∼ O(1). Table 3.1 summarizes the reconstruction error over a range of translation
distances normalized by box size using 2Nh(l) samples and p = 3. The average error is reduced
by a factor of two when switching from uniform sampling to cosine sampling, and another factor
of two is gained by phase-amplitude compensation. The maximum error at any point within the
interval is reduced signiﬁcantly by the approach described here.

3.4 Summary of parallel wideband MLFMA algorithm

In this section, we provide a brief outline of the parallel ACE-AP-MLFMA algorithm used
to obtain the numerical results presented in the next section (more details will be provided in a
subsequent publication). The parallel algorithm we use is based on that presented in [48] with a
number of important improvements. We would like to point out that this algorithm is eﬀectively
a bottom-up tree construction/partitioning algorithm, and compared to the top-down hierarchical
partitioning schemes used in [28, 50], gives us important advantages in terms of data partitioning
and load balancing as discussed below.

3.4.1 Tree construction

Our bottom-up tree construction starts by assigning points to leaf boxes according to a prescribed
leaf level box size (essentially creating a uniform tree to begin with). Using a parallel bucket-
sort algorithm, these leaf boxes are then partitioned into contiguous chunks with the objective

49

of balancing the number of points assigned to diﬀerent processes. After this initial partitioning,
each process constructs the upper levels of its own tree based on its leaf boxes. This scheme will
obviously result in sparsely populated leaf nodes given the non-uniform particle distributions in our
target problems. Therefore, starting at the leaf level and working up the tree, each process merges
its sparsely populated tree nodes. To be precise, if the total number of particles owned by the
children of an internal tree node is below a predetermined threshold smax, then those leaf nodes are
merged at the parent node, making the parent a new leaf node. Since the octree is distributed across
processes, such merge operations are performed recursively at the higher levels in coordination
with neighboring processes. Interaction lists are subsequently computed on the merged tree.

3.4.2 Load balancing

Even distribution of the computational work is crucial to achieve scaling to a large number of
processes. Two major issues make load balancing for the MLFMA algorithm challenging:
i)
Electrically large objects result in deep trees that contain a signiﬁcant amount of work across a
small number of high level tree nodes, and ii) the non-uniform octree structure which is a natural
consequence of the non-uniform point distribution has a work proﬁle with high variability.

We address the ﬁrst issue by inheriting the adaptive direction partitioning idea, ﬁrst presented
in [48]. Note that the bottom-up partitioning approach described above results in overlapped
tree regions between processes, i.e., internal tree nodes at process boundaries may appear in the
partial trees of multiple processes. We call such nodes as duplicate nodes, and to balance the
load in deep trees, we distribute the multipole expansion data for these nodes as well as the
computations (inter/anterpolations, translations) associated with them evenly across all duplicating
processes. As will be discussed in our subsequent paper, the present work signiﬁcantly improves
the implementation of the basic direction partitioning idea by leveraging better parallelism, non-
blocking collective communication primitives, and an eﬃcient distributed execution schedule.

The load balancing issue associated with non-uniform trees is addressed using an empirical cost
evaluation technique. Speciﬁcally, each process computes the interaction lists for non-duplicate

50

nodes within its own partial tree and evaluates how much time it takes to perform a sample set
of kernel operations (inter/anterpolations and translations) on the given hardware. It then assigns
a computational load to each tree node, starting from the highest level by taking into account
the empirical kernel costs and the number of interactions each tree node has. These costs are
then percolated all the way down to the leaf nodes. Finally, the non-uniform MLFMA tree is
repartitioned such that the empirical costs of the contiguous chunks of leaf nodes assigned to each
process is evenly distributed.

3.4.3 Parallel evaluation

Evaluation of the matvec consists of 3 main phases: i) M2M, upward traversal of the tree to compute
the multipole data for each tree node, ii) M2L, lateral translation of multipole data to eﬀect far-ﬁeld
interactions, and iii) L2L, downward traversal of the tree to disaggregate local expansions to the
leaf nodes and evaluate ﬁelds at observers.

Parallelism is applied in three diﬀerent forms in the M2M phase. Each process ﬁrst performs
the serial computations for their own unique interior nodes, which are nodes that strictly belong to a
single process. Secondly, plural nodes in the ACE and spherically sampled levels are handled using
coarse-grained parallelism, where all processes sharing a plural node compute the interpolations for
the children they own sequentially and reduce or reduce-scatter, for ACE and MLFMA respectively,
the results among all sharing processes using non-blocking MPI collectives. As the third form of
parallelism, duplicate nodes above the spherical-to-uniform transition level are interpolated using
a ﬁne-grained parallel version of the FFT-based interpolation algorithm. The partial multipole data
for each duplicate node in this category is shifted, summed, and partitioned over the duplicating
processes in parallel.

Translations in the M2L phase follow a similar strategy as above, where all processes begin
with sequentially computing translations within their own trees, then exchange multipole data with
neighboring processes and process them as they arrive – eﬀectively overlapping computation and
communication. Due to the presence of cross-level interactions in non-uniform trees, these com-

51

putations are organized into three subphases where X-list translations are performed by evaluating
the spectral integral at each basis function in the ﬁrst subphase, V-list translations are computed in
the second subphase, and W-list translation contributions to the local expansions are computed in
the third subphase. We note that one important shortcoming of the present parallel M2L imple-
mentation is that all translation data for duplicate nodes need to be communicated to the resident
process, i.e., the designated receiving process for each duplicate node, which then applies them one
by one – this can potentially create performance bottlenecks for high level nodes in deep trees.

Finally, the L2L phase is performed by essentially mimicking M2M in reverse order. As an
artifact of the limited parallelization of duplicate nodes mentioned above, L2L phase starts with
broadcasting the local expansions of duplicate nodes from the resident process to the duplicating
processes. After this initial communication, anterpolation and disaggregation of the duplicate
nodes above the spherical-to-uniform transition level are performed in distributed fashion. Once
all parallel anterpolations are completed, all processes can independently perform the sequential
L2L operations for their internal nodes.

3.5 Numerical Results

3.5.1 Accuracy analysis

In this section, we demonstrate the controllable accuracy of each interpolation/anterpolation method
discussed in this paper, namely spherical, uniform, and hybrid. We also discuss trade-oﬀs between
these methods in terms of memory and run time. The examples in this subsection were run using a
single core on a desktop computer with an Intel Xeon E5-2630 CPU with clock speed of 2.3 GHz
and 64 GB RAM. All computations use double precision arithmetic, and FFTW 3.3.5 is used for
performing FFTs in the interpolation and anterpolation stages.

3.5.1.1 Hybrid sampling MLFMA error control

To illustrate the accuracy of the proposed hybrid sampling and interpolation/anterpolation method,
we consider a uniformly random distribution of 256,000 randomly-oriented dipoles of unit strength

52

Figure 3.3: Error convergence (left) and matvec timings (right) vs. oversampling parameter χ for
each sampling/interpolation method in the 8λ cube geometry.

within an 8λ × 8λ × 8λ cubical domain. The box size is chosen to be 0.25λ, resulting in a 6-level
uniform octree. Neither ACE nor interpolation of the MLFMA translation operator is used in this
experiment. For the upper levels of the tree, a cap was enforced on the bandwidth to prevent
numerical breakdown of the translation operator. Using three buﬀer boxes, we study the error
convergence of the far-ﬁeld contributions for each of the spherical, uniform, and hybrid methods
outlined in the previous section. For the hybrid scheme, the transition was chosen so that only the
bottom two levels use the spherical scheme. This data is shown in Fig. 3.3a. As the oversampling
parameter χ is increased, each method converges in a nearly identical manner, demonstrating that
the hybrid method introduces no error. Fig. 3.3b shows the time taken per matvec for each method.
The hybrid scheme only adds a few seconds to the baseline time taken by the purely spherical
scheme, while the uniform-only scheme takes roughly twice as long on average, as expected. For
contrast, direct evaluation of the far-ﬁeld matvec takes 5128.029 s on the same machine.

3.5.1.2 Trade-oﬀs for hybrid sampling

We have already seen the eﬀect of purely uniform sampling on runtime due to the doubled sampling
rate throughout the tree. The results for the hybrid sampling scheme suggest that the extra cost of
fully-uniform sampling comes principally from performing M2L operations while oversampling at

53

11.522.533.544.510−1010−810−610−410−2χL2relativeerrorSphericalUniformHybrid11.522.533.544.50200400600χMatvectime(s)SphericalUniformHybridFigure 3.4: Error convergence vs oversampling parameter χ for (a) uniform tree ACE-MLFMA for
64,000 dipoles in a ﬂattened box, and (b) non-uniform tree cone-sphere. The number next to each
data point indicates the order P of ACE expansions used. Note that the example in (a) uses three
buﬀer boxes for the far-ﬁeld, while that in (b) uses only one.

the ﬁnest levels of the tree where most boxes exist, as expected. Conversely, the optimal sampling
rate for the spherical scheme gives the best run-time, but at the expense of precomputing and storing
samples of associated Legendre functions Pm
n to perform interpolation and anterpolation via the
spherical harmonics transform. To study the trade-oﬀs, we examine the case of 256,000 dipoles
distributed on a 256λ × 256λ square in the x − y plane. Using leaf-level boxes of diameter λ/4, the
uniform tree is 11 levels deep, and a one-buﬀer box rule is used. The oversampling rate is ﬁxed at
χ = 1.0.

We examine memory consumption and matvec time for diﬀerent values of the ﬁnest spherical
n samples, multipole and

sampling level Lsu. Table 3.2 shows the memory required for storing Pm
local expansions, and M2L operators, and the serial matvec time vs. Lsu.

While this data is collected using a serial kernel, we must emphasize that in a parallel setting the
multipole and local data, which consume the majority of the overall memory, is distributed across
processes in a largely non-redundant fashion, while the Pm
n storage is duplicated on every process.
M2L operators must also be stored local to each process, though it is not identical on each process.
In contemporary HPC environments, the maximum amount of RAM per core is typically around 4
GB, underscoring the importance of ﬁnding an acceptable balance between memory consumption

54

123410−1010−810−610−410−223456789111213χL2relativeerror11.522.533.544.510−610−510−410−310−210−123457810111213χL2relativeerrorTable 3.2: Memory consumption and timings for diﬀerent hybrid sampling transitions

Lsu

Uni. levels
Sph. levels
n (GB)
Pm
M2L (GB)

Multipole (GB)

Matvec time (s)

3
0
9
6.0
13.1
0.8
730

4
1
8
2.0
14.5
1.4
957

5
2
7
0.3
15.9
1.5
1364

6
3
6
0.04
17.4
1.6
1377

7
4
5

0.005
18.8
1.6
1589

and runtime to make most eﬃcient use of computational resources.

3.5.1.3 Uniform-tree wideband MLFMA error control

We now examine the error in the hybrid-sampling MLFMA with ACE, using a uniform tree. We
consider the case of 64,000 dipoles distributed non-uniformly within an 8λ × 8λ × 1/16λ box. The
point locations were generated using (x, y, z) = (8r1.5
2 , r3/16), where r1, r2, r3 are random
numbers on [0, 1]. The smallest box was chosen as λ/16, yielding an 8-level tree with 2 levels
of ACE, 2 levels of spherical MLFMA, and 2 levels of uniform MLFMA. Fig. 3.4a shows error
convergence of the far-ﬁeld as both χ, P are increased, demonstrating ﬁne-grained control over
accuracy over both the low- and mid-frequency regimes.

1 , 8r1.5

3.5.1.4 Non-uniform-tree wideband MLFMA error control

We now examine the error convergence for the non-uniform wideband ACE-MLFMA applied to a
multiscale object and compare with the uniform-tree algorithm. Consider a “cone-sphere” surface
geometry comprising a long cone terminated on its wide end by a hemispherical surface. The cone
portion is 36.33λ long, and the spherical portion has a radius of 4.49λ. The mesh contains 47,367
triangles, and we placed 6 randomly oriented dipoles within each triangle, totaling 284,202 dipoles.
The smallest box size is chosen to be λ/128, and the tree is merged using smax = 30, yielding a
14-level tree with 5 levels of ACE, 3 levels of spherical MLFMA, and 4 levels of uniform MLFMA.
Table 3.3 summarizes some of the stark diﬀerences between the uniform and non-uniform trees.

55

Table 3.3: Comparison of uniform and non-uniform trees for the conesphere geometry

Nodes

Leaf level
Dipoles/box
Inter. lists

Uniform Non-uniform
1058881

29562
8-14
12

25.05 s

14
1

1149.16 s

Notably, the pre-computation time required for computing interaction lists decreases by a factor of
over 45 as a consequence of the 97% reduction in overall tree nodes. Fig. 3.4b shows convergence
in the relative L2 error of the far-ﬁeld using the non-uniform tree with a single buﬀer box rule.

3.5.2 Parallel CFIE solver evaluation

In this section, the kernel evaluation is wrapped in a parallel MoM solver for the CFIE (3.1), and
we present results demonstrating its capabilities, all with α = 0.5. Parallel GMRES is used as the
iterative solver. We note that our use of a four-tree mixed-potential MLFMA as opposed to a two-
or three-tree dyadic MLFMA directly results in an increase in runtime.

3.5.2.1 Uniform trees

First, we consider the evaluation of the radar cross section (RCS) of two electrically-large conducting
spheres of diameter 256λ and 512λ, discretized with 84,934,656 and 339,738,624 unknowns,
respectively. A 7-point integration rule within each patch was used for both source and testing
far-ﬁeld integrals. In each case, the leaf-level box is chosen as 0.25λ, resulting in 11- and 12-level
trees. The bottom six levels employ spherical sampling before transitioning to uniform sampling,
and χ = 1.0. The incident plane wave is polarized along the ˆx direction and traveling in the
+z direction. Parallel GMRES with tolerance 10−3 and a restart value of 30 is used to solve the
resulting matrix system. Both systems converged within two outer iterations. For the 256λ case,
the solve took 31 minutes using 2048 processes, requiring 27.8 seconds per matvec. The 512λ
solve on 2048 processes took 8.5 hours, requiring 5 minutes 44 seconds per matvec; in this case
we had to resort to Lagrange interpolation of translation operators to alleviate memory bottlenecks

56

Figure 3.5: Comparison of bistatic RCS calculated by parallel MLFMA vs. analytical for a sphere
of diameter 256λ within ±5o of the main lobe.

associated with storing them directly at the uppermost levels. This resulted in approximately 76%
of the matvec time being spent in V-list evaluation as opposed to 36% in the 256λ case, hence the
increase in runtime being larger than the increase in unknown count. The calculated RCS of the
spheres are shown in Figs. 3.5 and 3.6 along with the analytical Mie series solution, with which
the agreement is very good.

3.5.2.2 Non-uniform trees

Finally, we consider evaluation of RCS from geometries with non-uniform distributions using non-
uniform trees. In these cases, the far-ﬁeld integration rule is set to one point per patch to ease
memory demands for storing X and W list operators. First, we demonstrate scattering from a small
20λ diameter sphere with a λ/10 discretization everywhere except for the θ = 0 pole, where a
much ﬁner discretization is used, using 2,053,947 unknowns overall. Using a minimum box size of
1.95 × 10−3
λ and smax = 35 results in a 15-level tree with 12 unknowns per leaf box on average.
The simulation parameters for MLFMA and ACE were χ = 1.5 and P = 4. Using 28 processes and
GMRES tolerance of 10−3, the solution took 38 minutes at about 15 s per matvec. The calculated

57

	40	50	60	70	80	90	100	110-4-2	0	2	4RCS	(dB)Theta	(deg)MieMLFMA																				Figure 3.6: Comparison of bistatic RCS calculated by parallel MLFMA vs. analytical for a sphere
of diameter 512λ within ±2o of the main lobe.

Figure 3.7: Comparison of bistatic RCS calculated by parallel MLFMA vs. analytical for the 20λ
sphere with densely discretized pole.

RCS is shown in Fig. 3.7 and compared with the Mie series solution, with which it shows excellent
agreement.

Next, we consider electrically-large objects with sharp corners or other features with regions
of high density of unknowns. Our ﬁrst example is an arrowhead-shaped geometry measuring
470λ × 154λ × 79λ, discretized using 77,257,728 unknowns. A 14-level non-uniform tree is used

58

	50	60	70	80	90	100	110	120-2-1.5-1-0.5	0	0.5	1	1.5	2RCS	(dB)Theta	(deg)MieMLFMA																	-20-10	0	10	20	30	40	50	60	70-60-40-20	0	20	40	60RCS	(dB)Theta	(deg)MieMLFMAFigure 3.8: RCS calculated with parallel MLFMA for 470λ arrowhead geometry.

(a) Arrowhead

(b) Airplane

Figure 3.9: Memory utilization histograms for (a) 470λ arrowhead and (b) 755λ airplane geometry.

with minimum leaf box size of 0.0574λ and smax = 35, resulting in 10 unknowns per leaf, on
average. The RCS is shown in Fig. 3.8, along with an illustration of the geometry. The incident
electric ﬁeld is given by Ei = ˆxe jkz. Solution with tolerance 5 × 10−3 took 19.3 minutes on 2048
processes, at 23.6 s per matvec. A histogram of peak memory utilization on each process is shown
in Fig. 3.9.

Our last example concerns scattering from the 755λ airplane geometry depicted inset in Fig.
3.10. A λ/10 discretization of the surface yields 285,425,664 unknowns. The incident ﬁeld is

59

-60-40-20	0	20	40	60	80-150-100-50	0	50	100	150RCS	(dB)Theta	(deg)Phi=0Phi=90xz(cid:10)yEi;Hi	0	20	40	60	80	100	120	140	160	180	1	1.35	1.7	2.05	2.4	2.75Number	of	processesPeak	memory	consumption	(GB)	0	20	40	60	80	100	120	140	160	180	1	2	3	4	5	6Number	of	processesPeak	memory	consumption	(GB)Figure 3.10: RCS calculated with parallel MLFMA for 755λ airplane geometry.

given as Ei = ˆxe−jkz. Using a minimum leaf size of 0.0231λ, the non-uniform tree spans 16
levels and is merged with smax = 35, increasing the average number of unknowns per box from
1 to 18 and distributing leaves across the bottom four tree levels. ACE (P = 3) is used for boxes
smaller than λ/4, and uniform MLFMA is used at the three uppermost levels. The densest box at
the ﬁnest level contains 40 unknowns. Using GMRES with a tolerance of 5× 10−3, the solution on
2048 processes took 3.17 hours, using 1 minute 8 seconds per matvec. Memory utilization for this
computation is depicted in Fig. 3.9. The RCS in the φ = 0o, 90o cuts is shown in Fig. 3.10.

3.6 Conclusion

We have presented an extremely wide-band parallel MLFMA with ﬁne-grained control over the
error in ﬁeld evaluation and methods for reducing both storage and computational cost while sacri-
ﬁcing nothing in terms of accuracy. We introduced rigorous operators for computing interactions
at any level on a non-uniform octree which is adapted in parallel to ﬁt to the geometry. We also in-
troduced a rigorous transition from spherical harmonics-based to Fourier-based inter/anterpolation
of multipole expansions to optimize storage and solution time. An array of numerical examples
demonstrate the accuracy and eﬃciency of the algorithm, and several scattering examples demon-

60

-20-1001020304050607080-150-100-50050100150RCS(dB)Theta(deg)Phi=0Phi=90strated the ability of the solver to accurately solve problems on large, complicated objects with
non-uniform spatial distributions of unknowns.

61

PARALLELIZATION TECHNIQUES FOR THE NON-UNIFORM WIDEBAND MLFMA

CHAPTER 4

4.1 Introduction

In this chapter, we detail an eﬃcient parallelization strategy for the non-uniform wideband
MLFMA presented in the previous chapter. The algorithm is based on that presented in [48], which
we will show suﬀers from several serious computational bottlenecks when applied to progressively
larger problems. This work clearly identiﬁes the bottlenecks and their causes, and oﬀers algorithmic
remedies for overcoming them.

4.2 Problem Statement

Consider a collection of N point sources un ∈ C located at points rn ∈ R3, n = 1, . . . , N. The

Helmholtz potential Φ(r) at some observation point r is then given by

N

n=1

Φ(r) (cid:17)

g(r − rn)un,

(4.1)

(4.2)

where the Green’s function g for the Helmholtz equation is given by

g(r) =

e−ik|r|
4π|r| ,

where k = 2π/λ denotes the wavenumber in rad/m and λ denotes the wavelength in meters. Sums
of the form (4.1) often arise in the discretization of integral equations in electromagnetics and
acoustics. Typically, the evaluation of Φ(rm), m = 1, . . . , N is required, taking the form of a
classical N-body problem with kernel g. As such, the O(N2) cost of direct (exact) evaluation of
these sums becomes prohibitively expensive for large N. Instead, one often employs the multilevel
fast multipole algorithm (MLFMA) [66] to approximate these quantities in O(N log N) time with
the ability to control the accuracy of the approximation to arbitrary precision.

Similar in structure to the fast multipole method for the Laplace potential [34], the MLFMA
employs a hierarchical tree data structure for rapidly computing interactions between clusters of

62

particles via alternate representations called multipole and local expansions. First, the computa-
tional domain is recursively subdivided into cubes (boxes) and each particle is mapped to the box
in which it resides. The hierarchy of this recursive subdivision procedure is modeled as a tree (see
Fig. 4.1), speciﬁcally an octree in which each subdivision results in the creation of eight boxes of
half the diameter. The evaluation of (4.1) using the MLFMA is broken up into several stages of
computation:

1. Particle-to-particle (P2P): For each pair of spatially adjacent leaf (childless) boxes, evaluate

interactions between all contained particles.

2. Charge-to-multipole (C2M): Multipole expansions are created for each leaf box from their

constituent particles.

3. Multipole-to-multipole (M2M): Multipole expansions of non-leaf boxes are created by

combining multipole expansions of their children.

4. Multipole-to-local (M2L): The local expansion for each box is created by converting mul-
tipole expansions of boxes in the far ﬁeld (deﬁned momentarily) into local expansions and
summing the results. The far ﬁeld of a box b is deﬁned as the set of children of b’s parent’s
adjacent neighbors which are not adjacent to b.

5. Local-to-local (L2L): The local expansion of each box b is updated with information from
its ancestors in a top-down manner so that the inﬂuence of all particles in non-adjacent boxes
is present in the ﬁnal local expansion of b.

6. Local-to-observer (L2O): For each box, the local expansion is used to calculate the potential

at the location of each particle in that box.

Large values of N are most often encountered in problems that are either densely discretized to
capture ﬁne geometric features or electrically-large, i.e. the principal dimension of the geometry
is hundreds or thousands of wavelengths long. Multiscale problems, i.e. problems featuring both
of these characteristics, are very diﬃcult to solve and require a robust computational approach for

63

the eﬃcient evaluation of (4.1). The key to such an approach is to understand the behavior of the
kernel g over a hierarchy of length scales, and to then design seamless transitions between methods
exploiting this behavior in each regime. Realistic problems in modern computer-aided design
and analysis are indeed multiscale, requiring discretizations with N in the millions to potentially
billions. In such large problems, operations on the octree at the coarsest levels can become very
expensive, underscoring the need for an eﬃcient parallel algorithm.

4.2.1 Summary of non-uniform wideband MLFMA

In this section, we brieﬂy summarize at a high level the non-uniform wideband MLFMA presented
in the previous chapter. To begin, we assume that there exists an octree structure with L levels
enveloping the computational domain. Level 1 refers to the root of the octree, i.e. the box containing
the entire computational domain, and level L is the level of ﬁnest reﬁnement. The box diameter
at level l is denoted D(l), and D(L) is speciﬁed (in units of wavelengths) as an input parameter to
the algorithm. The root box diameter is 2L−1D(L). The octree is non-uniform, i.e. leaf (childless)
nodes may exist at any level of the tree, and each leaf contains approximately the same number of
particles per box up to a maximum of smax.

4.2.1.1 Low-frequency regime
At small length scales relative to a wavelength, e.g. k|r| < π/2 or |r| < λ/4, the Helmholtz kernel
g does not exhibit oscillatory behavior and is instead quite smooth. Away from the singularity, g
is well-approximated over electrically short distances by a low-order polynomial expansion. The
accelerated Cartesian expansion (ACE) method exploits this fact to evaluate the potential in these
regimes in O(N) time. Within a domain Ωo suﬃciently far from a cluster of sources Ωs, the
potential Φ due to the sources within Ωs is represented using an order-P Taylor series about the
center of Ωo. As length scales increase, the ability to accurately compute the potential via ACE
diminishes and one must transition to MLFMA, as deﬁned in [73].

64

4.2.1.2 High-frequency regime
For boxes of diameter larger than λ/4 the usual MLFMA is employed. Multipole and local
expansions are functions deﬁned on the unit sphere (θ, φ) ∈ [0, π]×[0, 2π] and sampled at locations
(θi, φ j), i = 1, . . . , Nθ, j = 1, . . . , Nφ. The sampling rates for a given box at some level l are
√
3kD(l)(cid:99) + 1, where χ ≥ 1.0 is a parameter
governed by its bandwidth K(l), deﬁned by K(l) = (cid:98) χ
that to a large extent controls the accuracy of the MLFMA.

The MLFMA requires interpolation and anterpolation (decimation) of the sampled multipole
and local expansions in the M2M and L2L stages, respectively. Three approaches to this task
dominate the MLFMA landscape: i) spherical harmonics transforms, ii) fast Fourier transforms, and
iii) local interpolation. Methods i) and ii) are exact and thus preferable to method iii), which requires
signiﬁcant oversampling and a properly band-limited interpolation function to achieve relative
accuracy beyond a digit or two. The downside to the exact methods is their higher computational
complexity; spherical harmonics and Fourier transforms require O(K(l)3) and O(K(l)2 log K(l))
operations whereas local interpolation is O(K(l)2), albeit with a large constant proportional to the
square of the one-dimensional oversampling rate.

The method based on spherical harmonics transforms, while computationally expensive as
K → ∞, facilitates optimal (minimal) sampling rates on the unit sphere, or Nθ(l) = K(l)+1, Nφ(l) =
2K(l) + 1, and is used for several levels immediately above the ACE regime. Samples in θ are
located at Gauss-Legendre nodes, and samples in φ are uniformly spaced. While the optimal
sampling rate helps to minimize costs associated with M2L, the O(K(l)3) storage and complexity
become untenable in the upper reaches of the tree. Instead, above some level lT, a switch is made
to the Fourier-based method which has no memory overhead [60, 39]. For these boxes, the samples
are spaced uniformly in both variables, and Nθ(l) = 2K(l) + 1, Nφ(l) = 2K(l) + 2. The storage
and M2L cost are doubled per node in this regime, so one should seek a value of lT with the best
trade-oﬀ between memory and computational time.

65

4.2.1.3

Interactions in the adaptive algorithm

Using a non-uniform tree complicates matters somewhat. Using the classic FMM terminology [34],
the list of box pairs involved in P2P calculations is called the U-list, and the list of box pairs involved
in M2L calculations is called the V-list. In the adaptive algorithm, there are two more lists which
tabulate interactions between boxes of diﬀerent size. The X-list tabulates interactions wherein the
source box is smaller than the observer box, and the W-list is its transpose. Interactions in the X-list
are carried out by directly computing the potential due to particles in the source box at each particle
location in the observer box using the source box’s multipole expansion; W-list interactions involve
computing local expansions directly from particles in the source box. Obviously, observer boxes
in the X-list (sources in the W-list) must be leaf boxes.

4.3 Parallelization of the MLFMA

4.3.1 Tree construction and setup

Let Np denote the number of processes used in the computation. We begin by distributing the N
particles evenly across all processes and determining the diameter D0 of the cube bounding the
entire computational domain. The number of levels in the tree with ﬁnest box diameter D(L) is then
calculated as the smallest integer L such that L ≥ log2(D0/D(L)) + 1. Once the number of levels is
known, every particle is assigned a Morton key [76]. A parallel bucket sort on the Morton keys is
then used to approximately evenly distribute particles across processes at the granularity of leaves.
This is done by selecting Np − 1 Morton keys, or “splitters”, which chop the Morton Z-curve into
Np contiguous segments. Leaves are uniquely assigned to processes using these splitters. On each
process, all ancestor keys of the leaves up to the root are determined, and the leaves and ancestors
are stored in post-order, comprising the local subtree.

The distributed tree resulting from this bottom-up partitioning scheme generally reﬂects the
geometry’s inherent work distribution and provides the ﬂexibility of modifying the partitioning if
the workload is imbalanced. As will be discussed, this is often the case for multiscale geometries,

66

Figure 4.1: Illustration of distributed tree.

wherein the density of particles (and consequently the workload) varies signiﬁcantly throughout
the computational domain. This stands in contrast to the top-down partitioning scheme employed
in the hierarchical partitioning (HiP) versions of the MLFMA [28, 51]. While the workload is
balanced by design at the uppermost levels of the tree, the distribution of nodes at the lower levels
is inherited from this rigid distribution of large nodes. As a result, the HiP methods perform well
for relatively uniform particle distributions, but they may produce highly skewed trees and work
distributions when applied to multiscale geometries.

4.3.2 Plural nodes

Once consequence of this partitioning strategy is the appearance of plural nodes, i.e. nodes which
appear in the local subtrees of multiple processes. For a plural node b, the process with the highest
rank is called the resident process and as such is designated the receiver for V-list operations, and
consequently the root for broadcast operations prior to the L2L stage. The resident process’ copy of
b is called a shared node. The processes which own copies of b but are not the resident process are
called users of b (denoted by the set U(b)), and their copies are called duplicate nodes. Within the
local subtree of any process, its duplicate nodes are guaranteed to appear consecutively at the end
of the post-order sequence, while shared nodes may in general appear anywhere in the sequence
(except within a duplicate node chain) but tend to appear toward the beginning. As discussed later,
these properties are helpful for overlapping communication with computation.

67

To mitigate the duplication of such nodes and their data, some distribution of the work associated
with the node should be undertaken. In the low-frequency/ACE regime, the workload associated
with a single plural node is considered negligible; however, in the upper-mid and high frequency
regimes the workload may be substantial. The adaptive direction partitioning scheme for plural
nodes was introduced in [48] to balance communication load in V-list evaluation for plural nodes
in the MLFMA regime. Multipole expansions of plural nodes are sent to their resident processes,
where they are summed and equidistributed back to all owners. This method was shown via
numerical experiments to improve the load balance of the potential evaluation.

4.3.3 Adaptive tree

In the case of a spatially non-uniform distribution of particles it is computationally advantageous
to prune regions of the tree where the number of particles per box is close to 1, as discussed in
previous chapters. This process is called merging the tree. To begin, the number of particles
contained within each non-leaf node in the local subtree is calculated in level-by-level fashion. At
this stage, the plural nodes contain only the local particles counts, so a parallel sum is required to
account for the particles contained within remote descendants.

The tree is then merged from the bottom up, level by level, using a voting mechanism. All
tree nodes start out with an indeterminate vote. A maximum number smax of particles per box is
chosen. For each node at level L, the number of particles contained within the node is compared to
smax. If this number exceeds smax, the node’s vote is changed to ‘veto’ the merge, and its siblings
follow suit. This vetoing inﬂuence is called preclusion. Otherwise, the box’s vote is changed to
‘merge’ unless it has been precluded. As a result, all siblings must share the same vote. Here,
the votes of duplicate nodes and siblings straddling process boundaries are synchronized across
processes, with vetoes taking precedence. Moving to the next coarsest level L − 1, this process is
repeated level by level until level 3 is reached, where merging cannot take place. The result of this
process is a pruned version of the distributed tree where leaves (childless boxes) can exist at any
level and contain approximately smax particles each.

68

4.3.4

Interaction lists

Once the tree is fully formed, the U, V, W, X-lists [32, 15] may be determined. First, deﬁne the
neighbors of a box b as those boxes at the same level as b which share a face, edge, or vertex with
b. For each node b in the local subtree, two sets of Morton keys are formed:
i) the near-ﬁeld,
consisting of all possible neighbors (33 = 27 in total) of b, and ii) the far-ﬁeld, comprising all
children of the neighbors of b’s parent which are not neighbors of b (63 − 33 = 189 in total). The
diﬀerent lists are computed as follows.

1. U-list. For each local node b, the keys in b’s near-ﬁeld are sent along with the Morton key
representing b to their resident processes, as determined by the splitters. For each received
key r generated by b, a top-down search of the local subtree is performed. If r is found, the
interaction (b, r) is added to the U-list. If the search dead-ends on an ancestor Ar of r, and Ar
is a leaf, then (b, Ar) is added to the U-list, and a ﬂag is placed on this interaction. Otherwise,
the interaction is discarded. For each ﬂagged interaction, the reciprocal is added by sending
the keys Ar, b to the resident process for b and adding (Ar, b) to its U-list. A return-to-sender
approach suﬃces here, since leaves cannot be plural nodes.

2. V, W, X-lists. In the same manner as the near-ﬁeld, far-ﬁeld keys are exchanged and sought
using top-down search. Consider the potential interaction (b, r). If r is found, then (b, r)
is added to the V-list of r’s resident process. If an ancestor Ar of r is found and it is both
a leaf and not adjacent to b, then (b, Ar) is added to the X-list. Otherwise, the interaction
is discarded. The W-list is determined by transposing the X-list, though a simple return-to-
sender approach is not suﬃcient for this purpose. In W-list interactions, the node Ar serves
as the source and b is the observer, so the interaction (Ar, b) must be sent to the resident
process for b, which in general may not be the originator of the interaction (b, Ar).

69

4.3.5 Evaluation of the potential

Here, we describe the non-uniform version of the wideband MLFMA presented in [48] and identify
several performance issues in the original implementation which will be addressed in the following
sections. The parallel evaluation of (4.1) using the non-uniform wideband MLFMA comprises
three stages: i) an upward pass through the tree, ii) translation of multipole expansions into local
expansions, and iii) a downward pass of the tree to compute the potential at each observer.

First, we discuss the upward pass. On each process, the local subtree is traversed in post-order
as multipole expansions are created via either C2M or M2M operations as described above. All
operations are performed serially. When this is done, the multipole expansions of plural nodes in
any local subtree are incomplete, as they are missing contributions from particles owned by remote
processes. For plural ACE nodes, user processes send their multipole expansions to their resident
processes, which sum the expansions and take full computational responsibility for their shared
nodes until the L2L stage. For plural MLFMA nodes, resident processes similarly aggregate and
sum multipole expansions from their users, but the results are distributed over all owning processes
according to the adaptive direction partitioning scheme. This is referred to as the multipole update
phase.

Next, we consider the translation (M2L) stage. A round-robin communication scheme uses
packets of ﬁxed size to exchange segments of multipole expansions from source processes to
observer processes, where this data is used to carry out V- and X-list interactions and is then
overwritten with the next segment, until all interactions are exhausted. Interactions where both
source and observer boxes exist in a single process’ local subtree are then carried out serially.
Communication is overlapped with computation by making use of asynchronous MPI primitives.
Next, particle strengths un are exchanged from sources to observers for the evaluation of W-list
interactions.

The ﬁnal stage is the downward pass. Because resident processes act as a “magnet” for all
V, W-list interactions (all X-list observer boxes are leaves and thus cannot be plural nodes), only the
resident process of a plural node possesses its complete local expansion. The complete expansions

70

are ﬁrst sent to all users in what is called the local update phase, and the remaining work of the
downward pass is then completely serial. The local subtree is traversed top-down in reverse post-
order, updating local expansions via the L2L operations discussed above. Finally, the complete
local expansions at leaf boxes are used to compute the potential observed at each particle via the
L2O operation.

4.3.6 Bottlenecks

While the uniform version of this algorithm exhibited good performance for problems of modest size
in [48], there are several performance bottlenecks which must be overcome if similar performance
metrics are to be obtained for larger problems on larger clusters. First, the multipole and local update
phases employ simple point-to-point communication, requiring communication proportional to
Nuser Nθ Nφ, where Nuser is the number of users. Since each of Nuser, Nθ, Nφ at the uppermost levels
of the tree grow with problem size and increasing process count, this cost can become prohibitive.
As a remedy, we propose the use of MPI primitives which implement these communication patterns
much more eﬃciently using tree-like algorithms. This topic is addressed in detail in the next section.
The second bottleneck arises in the serial portions of the upward and downward passes. Espe-
cially for electrically-large particle distributions, plural nodes at the uppermost levels of the tree
contain a signiﬁcant amount of data. Not only is the redundant serial computation of M2M and
L2L operations for these nodes unnecessary, it also results in signiﬁcant load imbalances caused
by the fact that resident processes for these nodes typically have an additional node at the same
level. These additional expensive calculations impede the progress of processes depending on these
results. We discuss in the next section a method for parallelizing the redundant M2M and L2L
operations to largely eliminate this bottleneck.

Finally, multiscale geometries often result in an imbalanced workload distribution. The initial
distribution aims only to apportion approximately the same number of particles on each process, not
accounting for the density of the distribution or the actual work to be done; as a result, processes
that own regions of dense discretization have fewer nodes, and thus less work to be done, than

71

other processes. To alleviate this problem, an algorithm to approximately balance the workload is
proposed in a later section.

4.4 Overcoming bottlenecks associated with plural nodes

4.4.1 Scheduling computations and communications for plural nodes

In order to make use of eﬃcient MPI collective communication operations for plural nodes, it
is necessary to create local communicators. It can be shown that at any level, the local subtree
of any process has at most one duplicate node and at most one shared node. Therefore, on any
process only two communicators are required per level. The use of blocking MPI calls used for
communication in the algorithm described in the following section, however, requires a consistent
execution schedule that is free of deadlocks. It suﬃces to determine such a schedule for each level
independently.

An algorithm for scheduling a single level is presented in Algorithm 4, with an illustrative
example in Table 4.1. We begin by assigning shared nodes into communication slot 1 and duplicate
nodes into communication slot 2. The communication slot of a node b is given by A(b). On
process i, the shared node is denoted si and the duplicate node di. From the initial conﬁguration,
we proceed backward from the end of the process list and for each process in sequence, impose the
communication slot of the current shared node upon its users, swapping slots where necessary. By
the deﬁnition of resident process, each such imposition only aﬀects processes of lower rank than
the present process, so at step i the schedule is ﬁnalized for all processes of rank j ≥ i. Once the
schedule is determined for each level, one MPI communicator is created for each communication
slot using MPI_Comm_split for a total of 2L communicators, well beneath the limits imposed by
standard MPI implementations.

4.4.2 Parallelization of Fourier-based interpolation/anterpolation operators

While the adaptive direction partitioning scheme helps to balance the load within the V, W, X-
list evaluations, interpolation and anterpolation operations in the M2M and L2L stages for plural

72

if i == rank then
for k ∈ U(si) do
Send(A(si),k)
end for

else if rank ∈ U(si) then

Algorithm 4: SchedulePluralNodes: Scheduling algorithm for communication and computa-
tion involving plural nodes
1: for i ← Np − 1, . . . , 0 do
2:
3:
4:
5:
6:
7:
8:
9:
end if
10:
11: end for
Table 4.1: Tabular illustration of Algorithm 4. Each row from top to bottom represents a step.
Tildes denote the shared nodes on their resident processors. Boldface denotes that a node has been
scheduled.

Recv(a,i)
A(drank) ← a
A(srank) ← (a mod 2) + 1

P0
(·, a)
(·, a)
(·, a)
(·, a)
(·, a)

P1
( ˜a, b)
( ˜a, b)
( ˜a, b)
(b, ˜a)
(b, ˜a)

P2
(·, b)
(·, b)
(·, b)
(b, ·)
(b, ·)

P3
(˜b, c)
(˜b, c)
(˜b, c)
(˜b, c)
(˜b, c)

P4
( ˜c, d)
(d, ˜c)
(d, ˜c)
(d, ˜c)
(d, ˜c)

P5
(·, d)
(d, ·)
(d, ·)
(d, ·)
(d, ·)

P6
( ˜d, ·)
(˜d, ·)
(˜d, ·)
(˜d, ·)
(˜d, ·)

nodes present perhaps an even greater computational bottleneck at scale. For a ﬁxed problem size,
the number of plural nodes is proportional to the number of processes used in the computation.
Attendant to this increase is an increase in duplicated work, particularly at the uppermost levels of
the tree where i) the amount of work per node is the highest and ii) nodes are duplicated on the
greatest number of processes. Consider the following. If a node b at level (cid:96) is a plural node which
exists on nb processes, then its parent P(b) and all its ancestors up to the root must also be plural
nodes existing on n ≥ nb processes. It follows, then, that the serial operation P(b) ← P(b) + SIb
is performed independently on each process its own part bi, where b = b1 + . . . + bnb
. As
a consequence, the runtime for M2M and L2L is bounded from below by the time required to
perform a serial interpolation or anterpolation operation between the two coarsest levels of the tree,
independent of the number of processes used.

Obviously, this represents an enormous bottleneck to scalability. To break it, we parallelize

73

plural-to-plural Fourier interpolation and anterpolation operations over the processes which possess
a copy of the child. In large-scale computations, the size of the nodes in the high-frequency regime
dominate the rest, so we allow duplicate work to be done in the mid-frequency regime as it does
not make an appreciable diﬀerence in comparison. During the upward pass, plural-to-plural M2M
operations are deferred until after the rest of the M2M operations are complete. The parallel M2M
operations are then performed level by level, beginning with the ﬁnest level and ascending the tree,
according to the schedule outlined in the previous subsection. At the outset of the L2L stage, the
parallel plural-to-plural L2L operations are carried out according to the same schedule from top to
bottom, after which the serial L2L work is executed.

Multipole and local expansions are arranged in the form of a 2-D array with the most rapidly
In the
varying (row) index representing samples in φ and the other representing samples in θ.
M2M stage, partial multipole expansions of plural nodes are evenly distributed across processes
at the granularity of columns via a reduce-scatter operation. The parallel interpolation operation
illustrated in Fig. 4.2 proceeds as follows: ﬁrst, Fourier interpolation of the column (φ-varying)
data is performed locally on each process via FFT, zero-padding, and inverse FFT (4.2b); this data
is then folded according to the spherical symmetry condition described in [60] and transposed
across the processes for contiguous access in the following step (4.2c); next, Fourier interpolation
is performed on each column (θ-varying) of the folded and transposed data (4.2d). Finally, the data
is unfolded back to the original unit sphere, shifted to the parent box, and distributed according to
the adaptive direction partition scheme. The L2L stage proceeds in largely the same fashion, with
the exception that the initial data distribution is achieved using a broadcast operation so that L2L
operations from plural nodes to interior nodes can be performed serially.

4.5 Load balancing

4.5.1 MLFMA vs. static FMM

Load balancing for the MLFMA is signiﬁcantly more complicated than for static FMMs. Broadly
speaking, load balancing the FMM for the Laplace kernel is simple in that work is concentrated

74

Figure 4.2: Graphical illustration of the transposition and folding operation within the parallel FFT
interpolation of radiation pattern a to A for Nθ = 3, Nφ = 4, Mθ = 5, Mφ = 6.
at the leaves as work per level falls oﬀ rapidly above the leaf level. Suﬃcient load balance can be
achieved by considering the work model for the leaves only. Obviously, the problem is complicated
by non-uniform distributions of particles, but the fact remains that the work per node is constant no
matter the level.

In the adaptive wideband MLFMA, the work associated with each node depends on the level
and regime. In the low-frequency regime, the work per node is constant irrespective of level, as
in the static FMM. Here, V-list and M2M/L2L operations dominate the cost.
In the mid- and
high-frequency regimes, the work for a node at level l is proportional to (K(l))3 or (K(l))2 log K(l),
respectively. In these regimes, M2M and L2L dominate the complexity estimate. Consequently, and
especially for surface distributions, the work per level is constant and must therefore be accounted
for in the work proﬁle used for load balancing.

4.5.2 Parallel load balancing strategy for adaptive wideband MLFMA

Whether or not the adaptive tree is employed, the basic strategy of equipartitioning the particles
over processes fails to give good load balance for spatially non-uniform distributions. A more

75

NφNθa1,1a2,1a3,1a4,1a1,2a2,2a3,2a4,2a1,3a2,3a3,3a4,3MφNθ˜a1,1˜a2,1˜a3,1˜a4,1˜a5,1˜a6,1˜a1,2˜a2,2˜a3,2˜a4,2˜a5,2˜a6,2˜a1,3˜a2,3˜a3,3˜a4,3˜a5,3˜a6,32Nθ−1Mφ/2˜a1,1˜a1,2˜a1,3˜a6,1˜a5,1˜a6,2˜a5,2˜a6,3˜a5,3˜a4,3˜a3,3˜a2,3˜a4,2˜a3,2˜a2,22Mθ−1Mφ/2A1,1A1,2A1,3A2,1A2,2A2,3A3,1A3,2A3,3A4,1A4,2A4,3A5,1A5,2A5,3A6,1A6,2A6,3A7,1A7,2A7,3A8,1A8,2A8,3A9,1A9,2A9,3Step1:IφStep2:Fold(comm.)Step3:Iθ(a)(b)(c)(d)eﬀective re-partitioning of the tree to balance the computational load is then in order. If a new,
more optimal set of splitters is determined, the original partitioning algorithm can re-partition the
tree using these splitters.

The splitters are determined as follows. A cost model is established based on timings of dummy
M2M, L2L, and U, V, W, X-list operations. For each node in the tree, each of these operations is
counted, multiplied by the appropriate cost, and summed to obtain an intrinsic node cost. Next, the
costs for plural nodes are summed in parallel to obtain a consistent account across resident and user
processes. Then, for each node b, the number of leaf nodes contained within the subtree rooted at b
is determined and synchronized in the same fashion as the particle count in the merging algorithm
described above. This is denoted by Nl(b).

Now, the cost per leaf is determined. First, deﬁne the partial percolated cost (PPC) of a node
as a bucket for the cost of its ancestors scaled by the number of leaves beneath them, and initialize
this to zero. Starting at the root node and progressing downward level by level, for each node b,
the intrinsic cost for b is divided by Nl(b) and added to the PPC of each of its children, along with
b’s PPC. This process goes on until the leaf level, where the cost for each is computed by the sum
its intrinsic and partial percolated costs. This procedure is perhaps best explained via illustration,
as in the exemplary Fig. 4.3. The purpose of the PPC scheme is to evenly distribute the cost of
upper-level nodes to children in a manner that i) is not inﬂuenced by the initial partitioning, and ii)
does not overwhelm the costs of lower-level nodes.

Lastly, a parallel preﬁx sum of the leaf costs in the Morton sequence is computed, and the total
leaf-level cost C is broadcast from the highest-rank process. Denoting the number of processes used
by Np, splitters are determined to be the Np−1 leaf keys closest to each of mC/Np, m = 1, . . . , Np−1.
All splitter keys are then shifted back to their leftmost descendants at level L for use in re-
partitioning. The greatest limitation of the proposed algorithm in terms of load balancing ability
is that the existence of plural nodes renders the true load balancing problem non-linear, as the
associated costs are a function of the splitter locations. Ideally, this problem could be addressed
via an iterative balancing and cost re-evaluation procedure, but in this context doing so would be

76

Figure 4.3: Illustration of the cost-percolation algorithm. Each pair of boxes represents a node in
the tree; the number in the left box is the intrinsic cost, while the number in the right box is the
partial percolated cost of all ancestors.

prohibitively expensive. Because the interaction list communication graph depends on the tree
partitioning, it would need to be re-computed after each iteration to calculate the load.

4.6 Results

In this section, we present the results of numerical experiments demonstrating the gains in
performance resulting from the parallelization techniques described in the previous sections. All
results presented here were obtained on a cluster comprising nodes with two sockets each, populated
by Intel Xeon E5-2698 v3 ("Haswell") processors with a clock speed of 2.3 GHz. Each node has
32 cores and 128 GB 2133MHz DDR4 RAM. The code is implemented in Fortran 90 using only
MPI parallelization and compiled with Intel compilers. The FFTW library is used for all FFTs.

We will ﬁrst examine the performance gains in the M2M and L2L stages from parallelizing
the Fourier interpolation/anterpolation operators for plural nodes. We ﬁrst consider a sphere of
diameter 320λ discretized using 241,920,000 dipoles on the surface. The leaf box is chosen as
d0 = 0.2λ, yielding a 12-level tree. The transition level is chosen such that the ﬁnest ﬁve levels of
computation use spherical sampling and the top ﬁve levels use uniform.

Timings for M2M and L2L for diﬀerent numbers of processes Np up to 2048 are given in
Table 4.2. As with the grid geometry, performance gains using PFFT become increasingly evident

77

48026161116929729827++Leafcost=3836350.¯30.¯30.50.51.0with increasing Np, culminating in speedups of over 3× on 2048 processes. The scalability of
the traversal operations thus improves dramatically. A more intimate perspective is oﬀered by
Figs. 4.4-4.5, which present per-process timings for M2M and L2L in both the original and PFFT-
augmented algorithms. Unlike the grid geometry, a sphere does not conform directly to the octree
structure so imbalances in these operations are to be expected. This is evident from the timings
shown in Fig. 4.5a), which shows timings of individual components of M2M and L2L for the
original algorithm. Here, a large amount of time is spent waiting in M2M at an MPI_Waitall
as other processes ﬁnish their portion of the serial interpolations. L2L also suﬀers from a load
imbalance resulting from uneven distribution of high-level serial anterpolations. These eﬀects are
highlighted by the extreme imbalance of composite serial interpolation and anterpolation (int/ant)
timings. Parallelization of these operations, as shown in Fig. 4.5b), signiﬁcantly reduces both the
computation and wait times. The serial interpolation and anterpolation are reduced to about one
second overall in the worst case, and the overall distribution is much more balanced. The diﬀerence
in the shapes of the M2M and L2L timings is due to the fact that M2M has a synchronization
barrier at the end, while L2L does not. We next consider a 512λ × 512λ grid of particles in the
z = 0 plane with a grid spacing of λ/32 with 268,435,456 points. The box size is chosen to be
0.25λ, resulting in a 12-level tree with 10 levels of computation. For the hybrid implementation,
we empirically set the bottom 6 levels of the tree to employ the spherical scheme, while the top 4
levels of computation employ the uniform scheme. The oversampling parameter χ is chosen to be
1.

The regularity of the grid geometry helps to highlight load imbalances inherent to the original
algorithm and to illustrate the beneﬁts of the PFFT algorithm. A study of the timings for M2M
and L2L for both the original and PFFT schemes is shown in Table 4.3. We observe bigger gains
in performance relative to the original scheme as the number of processes Np increases. This is
due to the increasing number of plural nodes to be parallelized. PFFT also facilitates signiﬁcant
improvements in the M2M and L2L parallel eﬃciency (strong scaling), deﬁned as

NpTNp
NqTNq

× 100%,

(4.3)

Eﬀ.(cid:2)Nq

(cid:3) (%) (cid:17)

78

Table 4.2: Comparison of original algorithm with PFFT for 320λ diameter sphere geometry.

Np M2M L2L M2M L2L M2M L2L M2M L2L M2M
32
100
89.5
64
70.5
128
256
66.5
50.9
512
36.2
1024
2048
30.4

Speedups Orig Eﬀ. (%) PFFT Eﬀ. (%)
L2L
100
93.4
89.9
85.6
65.4
47.0
37.8

100
91.0
71.1
63.1
42.7
13.3
10.6

100
89.9
68.0
58.8
39.4
13.2
9.67

1.06
1.06
1.10
1.20
1.38
2.92
3.34

0.87
0.89
1.10
1.18
1.33
3.08
3.10

73.9
41.3
26.2
13.9
9.08
6.38
3.80

71.9
38.5
20.0
10.5
6.87
4.78
2.97

Orig (s)

PFFT (s)

78.6
43.7
28.9
16.7
12.5
18.6
12.7

62.6
34.4
22.0
12.4
9.16
14.7
9.22

where Np, TNp are the reference process count and computation time, and Nq, TNq are the process
count and computation time for which the eﬃciency is being calculated. We now consider the case
of a 1024λ × 1024λ grid lying in the z = 0 plane with a regular spacing between points of λ/32
in each dimension. The total number of particles in this geometry is 1,073,741,824. The leaf box
size is chosen as λ/4, resulting in a 13-level tree (with 11 levels of computation). The ﬁnest six
levels of the tree employ spherical interpolation/anterpolation and the coarsest ﬁve levels employ
uniform interpolation/anterpolation.

Figure 4.6 provides a comparison of tree traversal timings for 2048 processes both with and
without the PFFT scheme. Speciﬁc timings for the run without the PFFT scheme are shown in
Figure 4.7a), while those for the run with the PFFT scheme are given in Figure 4.7b). As with the
previous grid example, the apparently well-balanced load in M2M for the former case is shown to
be misleading, as most processes spend approximately 23 seconds in M2M waiting for nodes with
larger than average amounts of serial interpolations to ﬁnish their serial computations. The load
imbalance is also clear in the timings for L2L, which lacks an explicit synchronization step at the
end.

The PFFT scheme, in contrast, results in much better load balance in both M2M and L2L stages,
along with reduced run-times due to the parallelization of some previously serial computations.
The time spent in M2M is reduced from 46.2 seconds for the original algorithm to 7.65 seconds
with the PFFT scheme, a speedup of over 6×. Similarly, the maximum time spent in L2L decreases
from 35.0 seconds to 5.44 seconds for a speedup of 4.6×.

79

Figure 4.4: Timing comparison of M2M and L2L operations for 320λ diameter sphere on 2048
cores with and without PFFT.

(a) Without PFFT

(b) With PFFT

Figure 4.5: Timing breakdown for M2M/L2L with and without PFFT for the 320λ sphere.

80

05001,0001,5002,000051015ProcessrankTime(s)M2M(orig)M2M(PFFT)L2L(orig)L2L(PFFT)05001,0001,5002,00005101520ProcessrankTime(s)M2MM2Mcomm.SerialInt/antL2L05001,0001,5002,000012345ProcessrankTime(s)M2MM2Mcomm.SerialInt/antL2LFigure 4.6: Timing comparison of M2M and L2L operations for 1024λ diameter grid on 2048
cores with and without PFFT.

(a) Without PFFT

(b) With PFFT

Figure 4.7: Timing breakdown for M2M/L2L with and without PFFT for the 1024λ grid.

81

05001,0001,5002,00001020304050ProcessrankTime(s)M2M(orig)M2M(PFFT)L2L(orig)L2L(PFFT)05001,0001,5002,000020406080ProcessrankTime(s)M2MM2Mcomm.SerialInt/antL2L05001,0001,5002,00002468ProcessrankTime(s)M2MM2Mcomm.SerialInt/antL2LTable 4.3: Comparison of PFFT with original algorithm for 512λ grid geometry.

Orig (s)

PFFT (s)

Np M2M L2L M2M L2L M2M L2L M2M L2L M2M
32
100
89.4
64
86.2
128
256
75.2
63.2
512
1024
46.4

Speedups Orig Eﬀ. (%) PFFT Eﬀ. (%)
L2L
100
92.1
89.0
82.0
68.8
50.8

38.0
21.2
14.3
9.9
8.5
7.7

44.5
24.9
12.9
7.4
4.4
3.0

37.4
20.3
10.5
5.7
3.4
2.3

100
89.6
66.4
48.0
27.9
15.4

45.4
25.8
18.5
13.5
12.1
11.4

1.02
1.04
1.43
1.82
2.75
3.80

1.02
1.04
1.36
1.74
2.5
3.35

100
88.0
61.4
42.0
23.5
12.4

(a)

(b)

Figure 4.8: Per-process timings for far-ﬁeld matvec stages for the airplane geometry (a) with
uniform vs. non-uniform tree and (b) with and without load balancing (LB) for non-uniform tree
on 1024 processes.

Finally, we examine the eﬀects of merging the tree for non-uniform distributions and load
balancing. We consider a geometry containing 175,764,666 points on the surface of an aircraft
which ﬁts into a bounding box of dimensions 693.5λ × 200.2λ × 754.8λ. The minimum leaf box
diameter was set to λ/32 resulting in a 16-level tree, and the densest box contains 56 points. The
four coarsest levels of computation use uniform sampling, and the remaining seven MLFMA levels
use spherical sampling. ACE is used for boxes of diameter smaller than λ/4. The non-uniform tree
is merged with smax = 40, resulting in leaves containing 20 points on average distributed over the
bottom ﬁve levels of the tree, and an almost twenty-fold reduction in leaf boxes. For accuracy on
the order of 10−3, we set χ = 1.0, P = 3 [73]. This supposition is backed up by randomly selecting

82

Uniformfar-ﬁeldNon-uniformfar-ﬁeldUniformM2LNon-uniformV,W,X02004006008001,0002030405060ProcessrankTime(s)Far-ﬁeldw/oLBFar-ﬁeldwithLBV,W,Xw/oLBV,W,XwithLBM2M+L2Lw/oLBM2M+L2LwithLB02004006008001,00020406080ProcessrankTime(s)Figure 4.9: Parallel eﬃciency of the complete matvec for the 286M-point arrow geometry with
reference to 256 processes. Colorization of geometry solely to illustrate depth.

a point on each process to ensure good spatial distribution, computing the exact observed ﬁeld, and
comparing with the computed value; the average relative error at these observers is 3.98 × 10−3.
With this setup, we examine the trade-oﬀs for non-uniform trees vs. uniform trees and the
beneﬁcial eﬀects of load balancing for parallel matvecs. Fig. 4.8a shows the time taken per process
for the far-ﬁeld matvec using 1024 processes using a uniform tree and a non-uniform tree. The time
drops from 61 seconds on average for the uniform tree to 38 seconds for the non-uniform tree. Load
balancing is employed in both instances. Much of the speedup is achieved in the translation stages.
The ﬂatness of portions of the M2L timings for the uniform tree are caused by synchronization
barriers. Fig. 4.8b details the eﬀectiveness of load balancing, showing per-process timings for the
tree traversal (M2M+L2L) and translation (V, W, X−lists) both with and without load balancing
with exactly the same parameters. Without load balancing, the translation stage alone takes ten
seconds longer than the entire matvec with load balancing.

Next, we consider a collection of 286,312,650 points distributed uniformly on a surface geometry
with an arrow-like shape, yielding high variation in point density over the surface. The bounding
box for the arrow measures 234.9λ × 77λ × 39.32λ, and we choose the minimum box size to be
λ/64, resulting in a 15-level tree. As always, ACE is used for boxes smaller than λ/4. The three
coarsest MLFMA levels of computation employ uniform sampling, while the rest employ spherical

83

2565121,0242,048020406080100234.9λ#processesParalleleﬃciencyη(%)sampling. The tree is merged using smax = 25, resulting in 9 points per leaf box on average.

The parallel eﬃciency of the complete matvec (both near- and far-ﬁeld together) is given in Fig.
4.9 with reference to 256 processes. Memory limitations prevented us from running the code under
the same memory-per-core conditions beneath 256 processes. Up to 1,024 processes the matvec
exhibits good scaling, but drops precipitously for 2,048 processes. This appears to be associated
with synchronization steps within the parallel interpolation/anterpolations.

4.7 Conclusion

In this work, we have proposed and demonstrated several remedies to fundamental bottlenecks
in the parallel wideband MLFMA of [48] updated with the adaptive version of the same algorithm
presented in [39]. We provided a parallel algorithmic prescription for constructing non-uniform
trees and interaction lists and for load balancing these trees. Our methods enable signiﬁcant
improvements in runtime and scalability for problems up to 1024 wavelengths in diameter.

84

CHAPTER 5

CONCLUSION AND FUTURE DIRECTIONS

The works presented in this thesis demonstrably advance the state of the art in parallelization
of the ACE and MLFMA. These developments facilitate eﬃcient simulations involving highly
non-uniform distributions of particles.

A novel parallel non-uniform ACE algorithm for evaluating linear operators acting on arbitrary
non-oscillatory potentials was shown to be scalable up to 16,384 processes with little loss in
performance for even highly non-uniform surface distributions of billions of particles with up to 25
levels in the octree. The error convergence of this method was demonstrated for several diﬀerent
kernel functions. While good results were obtained, further investigation of reducing the cost
of translation (M2M, M2L, L2L) from O(P6) should be pursued. These operators each possess
a quasi-convolutional structure which may potentially be exploited to reduce the complexity to
something like O(P3 log P), as can be done for actual convolutions using fast Fourier transforms.
The solution to this problem could signiﬁcantly boost the appeal of this algorithm.

The bulk of this thesis deals with the non-uniform wideband MLFMA and its parallelization.
i) a key computational and
The contributions presented accomplish several important goals:
memory bottleneck is broken, greatly increasing the maximum problem size; ii) a controllably-
accurate, provably convergent, adaptive form of the algorithm is introduced, improving the runtime
for multiscale distributions; iii) an eﬃcient parallel algorithm from tree construction to potential
evaluation is detailed, breaking two other important computational bottlenecks; and iv) a novel load
balancing mechanism for multiscale distributions is presented. There are, however, still many open
topics for research. First, investigating ﬂexible direction partitioning schemes (i.e. not based on
powers of 2 or 4 as in [28, 50]) and bandlimited local interpolation methods could facilitate analysis
of extremely electrically large geometries. To that same end, development of error-controllable
“windowed” translation operators in the high-frequency regime may signiﬁcantly reduce both
computation and communication for very large problems. Finally, extending the PEC algorithm

85

to formulations for penetrable objects such as JMCFIE [55] or PMCHWT [81] will require the
development of a novel load balancing mechanism, as variations in subdomain size and material
properties lead to severe ineﬃciencies when each subdomain is processed one at a time.

86

BIBLIOGRAPHY

87

BIBLIOGRAPHY

[1] Emmanuel Agullo et al. “Task-based FMM for multicore architectures”. In: SIAM Journal

on Scientiﬁc Computing 36.1 (2014), pp. C66–C93.

[2] Christopher R Anderson. “An implementation of the fast multipole method without mul-
tipoles”. In: SIAM Journal on Scientiﬁc and Statistical Computing 13.4 (1992), pp. 923–
947.

[3] David L Andrews. “On the conveyance of angular momentum in electronic energy transfer”.

In: Physical Chemistry Chemical Physics 12.27 (2010), pp. 7409–7417.

[4] Andrew W Appel. “An eﬃcient program for many-body simulation”. In: SIAM Journal on

Scientiﬁc and Statistical Computing 6.1 (1985), pp. 85–103.

[5]

Jonatan Aronsson, Ian Jeﬀrey, and Vladimir Okhmatovski. “Generalization of the Barnes-
Hut algorithm for the Helmholtz equation in three dimensions”. In: IEEE Antennas and
Wireless Propagation Letters 8 (2009), pp. 425–428.

[6] AD Baczewski and B Shanker. “An O (n) method for the rapid analysis of periodic prob-
lems using accelerated cartesian expansions (ace)”. In: Antennas and Propagation Society
International Symposium (APSURSI), 2010 IEEE. IEEE. 2010, pp. 1–4.

[7] Andrew D Baczewski, Daniel L Dault, and Balasubramaniam Shanker. “Accelerated Carte-
sian expansions for the rapid solution of periodic multiscale problems”. In: IEEE Transactions
on Antennas and Propagation 60.9 (2012), pp. 4281–4290.

[8]

Josh Barnes and Piet Hut. “A hierarchical O (N log N) force-calculation algorithm”. In:
nature 324.6096 (1986), pp. 446–449.

[9] Austin R Benson et al. “A parallel directional fast multipole method”. In: SIAM Journal on

Scientiﬁc Computing 36.4 (2014), pp. C335–C352.

[10] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.

[11]

[12]

John A Board et al. “Accelerated molecular dynamics simulation with the parallel fast
multipole algorithm”. In: Chemical Physics Letters 198.1-2 (1992), pp. 89–94.

Ignace Bogaert et al. “NSPWMLFMA: A low frequency stable formulation of the MLFMA
in three dimensions”. In: Antennas and Propagation Society International Symposium, 2008.
AP-S 2008. IEEE. IEEE. 2008, pp. 1–4.

88

[13] Alexander H Boschitsch, Marcia O Fenley, and Wilma K Olson. “A fast adaptive multi-
pole algorithm for calculating screened Coulomb (Yukawa) interactions”. In: Journal of
Computational Physics 151.1 (1999), pp. 212–241.

[14] Martin D Buhmann. Radial basis functions: theory and implementations. Vol. 12. Cambridge

university press, 2003.

[15]

J Carrier, Leslie Greengard, and Vladimir Rokhlin. “A fast adaptive multipole algorithm for
particle simulations”. In: SIAM journal on scientiﬁc and statistical computing 9.4 (1988),
pp. 669–686.

[16] Matt Challacombe, Chris White, and Martin Head-Gordon. “Periodic boundary condi-
tions and the fast multipole method”. In: The Journal of chemical physics 107.23 (1997),
pp. 10131–10140.

[17] YH Chen, WC Chew, and S Zeroug. “Fast multipole method as an eﬃcient solver for 2D
elastic wave surface integral equations”. In: Computational mechanics 20.6 (1997), pp. 495–
506.

[18] Hongwei Cheng, Leslie Greengard, and Vladimir Rokhlin. “A fast adaptive multipole al-
gorithm in three dimensions”. In: Journal of computational physics 155.2 (1999), pp. 468–
498.

[19] Hongwei Cheng et al. “A wideband fast multipole method for the Helmholtz equation in

three dimensions”. In: Journal of Computational Physics 216.1 (2006), pp. 300–325.

[20] Weng Cho Chew et al. Fast and eﬃcient algorithms in computational electromagnetics.

Artech House, Inc., 2001.

[21] Ronald Coifman, Vladimir Rokhlin, and Stephen Wandzura. “The fast multipole method for
the wave equation: A pedestrian prescription”. In: IEEE Antennas and Propagation Magazine
35.3 (1993), pp. 7–12.

[22] Eric Darve. “The fast multipole method I: error analysis and asymptotic complexity”. In:

SIAM Journal on Numerical Analysis 38.1 (2000), pp. 98–128.

[23] Daniel Dault and B Shanker. “A mixed potential MLFMA for higher order moment methods
with application to the generalized method of moments”. In: IEEE Transactions on Antennas
and Propagation 64.2 (2016), pp. 650–662.

[24] Walter Dehnen. “A hierarchical O (N) force calculation algorithm”. In: Journal of Compu-

tational Physics 179.1 (2002), pp. 27–42.

[25]

Jack Dongarra and Francis Sullivan. “Guest editors’ introduction: The top 10 algorithms”.
In: Computing in Science & Engineering 2.1 (2000), pp. 22–23.

89

[26] Tommi Dufva and Jukka Sarvas. “Broadband MLFMA with plane wave expansions and
optimal memory demand”. In: IEEE Transactions on Antennas and Propagation 57.3 (2009),
pp. 742–753.

[27] Michael A Epton and Benjamin Dembart. “Multipole translation theory for the three-
dimensional Laplace and Helmholtz equations”. In: SIAM Journal on Scientiﬁc Computing
16.4 (1995), pp. 865–897.

[28] ÖzgÜr Ergul and Levent Gurel. “Eﬃcient parallelization of the multilevel fast multipole
algorithm for the solution of large-scale scattering problems”. In: IEEE Transactions on
Antennas and Propagation 56.8 (2008), pp. 2335–2345.

[29] X-J Fan, Nhan Phan-Thien, and Rong Zheng. “Completed double layer boundary element
method for periodic suspensions”. In: Zeitschrift für angewandte Mathematik und Physik
ZAMP 49.2 (1998), pp. 167–193.

[30] William Fong and Eric Darve. “The black-box fast multipole method”. In: Journal of Com-

putational Physics 228.23 (2009), pp. 8712–8725.

[31] Yuhong Fu and Gregory J Rodin. “Fast solution method for three-dimensional Stokesian
many-particle problems”. In: International Journal for Numerical Methods in Biomedical
Engineering 16.2 (2000), pp. 145–149.

[32] Leslie Greengard. The rapid evaluation of potential ﬁelds in particle systems. MIT press,

1988.

[33] Leslie Greengard and William D Gropp. “A parallel version of the fast multipole method”.

In: Computers & Mathematics with Applications 20.7 (1990), pp. 63–71.

[34] Leslie Greengard and Vladimir Rokhlin. “A fast algorithm for particle simulations”. In:

Journal of computational physics 73.2 (1987), pp. 325–348.

[35] Leslie Greengard and Vladimir Rokhlin. “A new version of the fast multipole method for the

Laplace equation in three dimensions”. In: Acta numerica 6 (1997), pp. 229–269.

[36] Leslie Greengard and John Strain. “The fast Gauss transform”. In: SIAM Journal on Scientiﬁc

and Statistical Computing 12.1 (1991), pp. 79–94.

[37] Leslie Greengard et al. “Accelerating fast multipole methods for the Helmholtz equation at
low frequencies”. In: IEEE Computational Science and Engineering 5.3 (1998), pp. 32–38.

[38] Leslie F Greengard and Jingfang Huang. “A new version of the fast multipole method for
screened Coulomb interactions in three dimensions”. In: Journal of Computational Physics
180.2 (2002), pp. 642–658.

90

[39] Stephen Hughey et al. “Parallel Wideband MLFMA for Analysis of Electrically Large,

Non-Uniform, Multiscale Structures”. In: ().

[40] Rüdiger Jakob-Chien and Bradley K Alpert. “A fast spherical ﬁlter with uniform resolution”.

In: Journal of Computational Physics 136.2 (1997), pp. 580–584.

[41]

Jian-Ming Jin. Theory and computation of electromagnetic ﬁelds. John Wiley & Sons, 2011.

[42] Ludvig af Klinteberg, Davoud Saﬀar Shamshirgar, and Anna-Karin Tornberg. “Fast Ewald
summation for free-space Stokes potentials”. In: Research in the Mathematical Sciences 4.1
(2017), p. 1.

[43] Robert Krasny and Lei Wang. “Fast evaluation of multiquadric RBF sums by a Cartesian

treecode”. In: SIAM Journal on Scientiﬁc Computing 33.5 (2011), pp. 2341–2355.

[44] Konstantin N Kudin and Gustavo E Scuseria. “A fast multipole algorithm for the eﬃcient
treatment of the Coulomb problem in electronic structure calculations of periodic systems
with Gaussian orbitals”. In: Chemical Physics Letters 289.5 (1998), pp. 611–616.

[45]

Ilya Lashuk et al. “A massively parallel adaptive fast multipole method on heterogeneous
architectures”. In: Communications of the ACM 55.5 (2012), pp. 101–109.

[46] M Lu and E Michielssen. “A local ﬁltering scheme for FMM/PWTD algorithms”. In:
Antennas and Propagation Society International Symposium, 2004. IEEE. Vol. 4. IEEE.
2004, pp. 4523–4526.

[47] Dhairya Malhotra and George Biros. “Algorithm 967: A Distributed-Memory Fast Multipole
Method for Volume Potentials”. In: ACM Transactions on Mathematical Software (TOMS)
43.2 (2016), p. 17.

[48] Vikram Melapudi et al. “A scalable parallel wideband MLFMA for eﬃcient electromagnetic
simulations on large scale clusters”. In: IEEE Transactions on Antennas and Propagation
59.7 (2011), pp. 2565–2577.

[49] Bart Michiels. “Parallel fast multipole methods for the simulation of extremely large elec-

tromagnetic scattering problems”. PhD thesis. Ghent University, 2013.

[50] Bart Michiels et al. “Full-wave simulations of electromagnetic scattering problems with
billions of unknowns”. In: IEEE Transactions on Antennas and Propagation 63.2 (2015),
pp. 796–799.

[51] Bart Michiels et al. “Weak scalability analysis of the distributed-memory parallel MLFMA”.

In: IEEE Transactions on Antennas and Propagation 61.11 (2013), pp. 5567–5574.

91

[52] Naoshi Nishimura. “Fast multipole accelerated boundary integral equation methods”. In:

Applied Mechanics Reviews 55.4 (2002), pp. 299–324.

[53]

Joris Peeters. “Eﬃcient simulation of 3D electromagnetic scattering problems using bound-
ary integral equations”. PhD thesis. Ghent University, 2010.

[54] Andrew F Peterson et al. Computational methods for electromagnetics. IEEE press New

York, 1998.

[55] Andrew J Poggio and Edmund K Miller. Integral equation solutions of three-dimensional

scattering problems. MB Assoc., 1970.

[56] Douglas Potter, Joachim Stadel, and Romain Teyssier. “PKDGRAV3: beyond trillion particle
cosmological simulations for the next era of galaxy surveys”. In: Computational Astrophysics
and Cosmology 4.1 (2017), p. 2.

[57] C Pozrikidis. “Computation of periodic Green’s functions of Stokes ﬂow”. In: Journal of

engineering Mathematics 30.1-2 (1996), pp. 79–96.

[58] Sadasiva Rao, D Wilton, and Allen Glisson. “Electromagnetic scattering by surfaces of
arbitrary shape”. In: IEEE Transactions on antennas and propagation 30.3 (1982), pp. 409–
418.

[59] Vladimir Rokhlin. “Rapid solution of integral equations of classical potential theory”. In:

Journal of computational physics 60.2 (1985), pp. 187–207.

[60]

Jukka Sarvas. “Performing interpolation and anterpolation entirely by fast Fourier transform
in the 3-D multilevel fast multipole algorithm”. In: SIAM Journal on Numerical Analysis
41.6 (2003), pp. 2180–2196.

[61] B Shanker and H Huang. “Accelerated Cartesian expansions–A fast method for computing
of potentials of the form R- ν for all real ν”. In: Journal of Computational Physics 226.1
(2007), pp. 732–753.

[62] Balasubramaniam Shanker et al. “Fast analysis of transient electromagnetic scattering phe-
nomena using the multilevel plane wave time domain algorithm”. In: IEEE Transactions on
Antennas and Propagation 51.3 (2003), pp. 628–641.

[63]

[64]

Jaswinder Pal Singh et al. “A parallel adaptive fast multipole method”. In: Proceedings of
the 1993 ACM/IEEE conference on Supercomputing. ACM. 1993, pp. 54–65.

Jaswinder Pal Singh et al. “Load balancing and data locality in adaptive hierarchical N-body
methods: Barnes-Hut, fast multipole, and radiosity”. In: Journal of Parallel and Distributed
Computing 27.2 (1995), pp. 118–141.

92

[65]

[66]

[67]

Jiming Song and Weng Cho Chew. “Interpolation of translation matrix in MLFMA”. In:
Microwave and optical technology letters 30.2 (2001), pp. 109–114.

Jiming Song, Cai-Cheng Lu, and Weng Cho Chew. “Multilevel fast multipole algorithm for
electromagnetic scattering by large complex objects”. In: IEEE Transactions on Antennas
and Propagation 45.10 (1997), pp. 1488–1493.

JM Song and Weng Cho Chew. “Multilevel fast-multipole algorithm for solving combined
ﬁeld integral equations of electromagnetic scattering”. In: Microwave and Optical Technol-
ogy Letters 10.1 (1995), pp. 14–19.

[68] Hari Sundar, Rahul S Sampath, and George Biros. “Bottom-up construction and 2: 1 balance
reﬁnement of linear octrees in parallel”. In: SIAM Journal on Scientiﬁc Computing 30.5
(2008), pp. 2675–2708.

[69] Manouchehr Takrimi, Özgür Ergül, and Vakur B Ertürk. “A novel broadband multilevel
fast multipole algorithm with incomplete-leaf tree structures for multiscale electromagnetic
problems”. In: IEEE Transactions on Antennas and Propagation 64.6 (2016), pp. 2445–
2456.

[70] Sanjay Velamparambil, Weng Cho Chew, and Jiming Song. “10 million unknowns: Is it that
big?[Computational electromagnetics]”. In: IEEE Antennas and Propagation Magazine 45.2
(2003), pp. 43–58.

[71] Melapudi Vikram and Balasubramaniam Shanker. “An incomplete review of fast multipole
methods-from static to wideband-as applied to problems in computational electromagnet-
ics”. In: APPLIED COMPUTATIONAL ELECTROMAGNETICS SOCIETY JOURNAL 24.2
(2009), pp. 79–108.

[72] Melapudi Vikram and Balasubramaniam Shanker. “Fast evaluation of time domain ﬁelds
in sub-wavelength source/observer distributions using accelerated Cartesian expansions
(ACE)”. In: Journal of Computational Physics 227.2 (2007), pp. 1007–1023.

[73] Melapudi Vikram et al. “A novel wideband FMM for fast integral equation solution of mul-
tiscale problems in electromagnetics”. In: IEEE Transactions on Antennas and Propagation
57.7 (2009), pp. 2094–2104.

[74] Melapudi Vikram et al. “Accelerated Cartesian expansion (ACE) based framework for the
rapid evaluation of diﬀusion, lossy wave, and Klein–Gordon potentials”. In: Journal of
Computational Physics 229.24 (2010), pp. 9119–9134.

[75] Melapudi Vikram et al. “Parallel accelerated Cartesian expansions for particle dynamics
simulations”. In: Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International
Symposium on. IEEE. 2009, pp. 1–11.

93

[76] Michael S Warren and John K Salmon. “A parallel hashed oct-tree n-body algorithm”. In:
Proceedings of the 1993 ACM/IEEE conference on Supercomputing. ACM. 1993, pp. 12–21.

[77] Changjiang Yang et al. “Improved fast gauss transform and eﬃcient kernel density estima-

tion”. In: null. IEEE. 2003, p. 464.

[78] Samuel YK Yee. “Studies on Fourier series on spheres”. In: Monthly Weather Review 108.5

(1980), pp. 676–678.

[79] Lexing Ying, George Biros, and Denis Zorin. “A kernel-independent adaptive fast multipole
algorithm in two and three dimensions”. In: Journal of Computational Physics 196.2 (2004),
pp. 591–626.

[80] Lexing Ying et al. “A new parallel kernel-independent fast multipole method”. In: Super-

computing, 2003 ACM/IEEE Conference. IEEE. 2003, pp. 14–14.

[81] Pasi Yla-Oijala and Matti Taskinen. “Application of combined ﬁeld integral equation for
electromagnetic scattering by dielectric and composite objects”. In: IEEE Transactions on
Antennas and Propagation 53.3 (2005), pp. 1168–1173.

[82] Feng Zhao. “An O (N) algorithm for three-dimensional N-body simulations”. In: (1987).

[83]

Jun-Sheng Zhao and Weng Cho Chew. “Three-dimensional multilevel fast multipole algo-
rithm from static to electrodynamic”. In: Microwave and Optical Technology Letters 26.1
(2000), pp. 43–48.

94