THE FAST MULTIPOLE METHOD IN THE DIFFERENTIAL ALGEBRA
FRAMEWORK FOR THE CALCULATION OF 3D SPACE CHARGE FIELDS
By
He Zhang

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Physics
2013

ABSTRACT
THE FAST MULTIPOLE METHOD IN THE DIFFERENTIAL ALGEBRA
FRAMEWORK FOR THE CALCULATION OF 3D SPACE CHARGE
FIELDS
By
He Zhang
The space charge eﬀect is one of the most important collective eﬀects in beam dynamics
studies. In many cases, numerical simulations are inevitable in order to get a clear understanding of this eﬀect. The particle-particle interaction algorithms and the particle-in-cell
algorithms are widely used in space charge eﬀect simulations. But they both have diﬃculties
in dealing with highly correlated beams with abnormal distributions or complicated geometries. We developed a new algorithm to calculate the three dimensional self-ﬁeld between
charged particles by combining the diﬀerential algebra (DA) techniques with the fast multipole method (FMM). The FMM hierarchically decomposes the whole charged domain into
many small regions. For each region it uses multipole expansions to represent the potential/ﬁeld contributions from the particles far away from the region and then converts the
multipole expansions into a local expansion inside the region. The potential/ﬁeld due to the
far away particles is calculated from the expansions and the potential/ﬁeld due to the nearby
particles is calculated from the Coulomb force law. The DA techniques are used in the calculation, translation and converting of the expansions. The new algorithm scales linearly with
the total number of particles and it is suitable for any arbitrary charge distribution. Using
the DA techniques, we can calculate both the potential/ﬁeld and its high order derivatives,
which will be useful for the purpose of including the space charge eﬀect into transfer maps
in the future.

We ﬁrst present the single level FMM, which decomposes the whole domain into boxes of
the same size. It works best for charge distributions that are not overly non-uniform. Then
we present the multilevel fast multipole algorithm (MLFMA), which decomposes the whole
domain into diﬀerent sized boxes according to the charge density. Finer boxes are generated
where the higher charge density exists; thus the algorithm works for any arbitrary charge
distribution. A Message Passing Interface (MPI) based parallel version of the MLFMA is
developed, so that we can take advantage of cluster machines and enhance our simulation
ability. The algorithms are described in details and the numerical experimental results about
the eﬃciency and accuracy of the algorithm are presented and discussed. In the end, we
give an example of using this algorithm in the photo emission process simulation. Some
simulation related topics are discussed, such as: how to choose the proper units for the
variables in the beam dynamics equations, how to transform the space charge ﬁelds from
the bunch frame to the laboratory frame, and how to avoid artiﬁcial collisions between the
charged particles.

To my parents and my grandpa

iv

ACKNOWLEDGMENTS

I have spent the past six years happily and gratefully on studying and doing research in
the Physics and Astronomy Department at Michigan State University. Looking back, ﬁrst
I want to thank my adviser Professor Martin Berz. Without his continuous support and
guidance, I would not be able to ﬁnish my research and my thesis. I also want to thank
Professor Kyoko Makino and Professor Chong-Yu Ruan. I have beneﬁted a lot from their
kind help in the work. I truly appreciate Professor C.-P. Yuan, Professor Gary Westfall, and
Professor Brad Sherrill serving in my committee.
I am grateful to my friends, Dr. He Huang and Dr. Jianghao Yu. When I started to work
on the fast multipole method, I discussed a lot with Dr. Huang. Dr. Yu is always able to give
suggestions when I have diﬃculties in physics. I also thank Ravi Jagasia, Alexander Wittig,
Zhensheng Tao and Jenni Portman, for the valuable discussions and smooth cooperation we
had in the work.
My appreciation also goes to the former and current beam physics group members, Johannes Grote, Pavel Snopok, Shashikant Manikonda, Youn-Kyung Kim, Alexey Poklonskiy,
Pierluigi Di Lizia, Roberto Armellin, Robert Hipple, and Ben Loseth, and my friends, Hui
Wang, Kaijie Xu, Kaiyu He, Feng Wu, and Chao Cheng, for their friendship.
I also want to give my acknowledgment to Debbie Barratt and Brenda Wenzlick, secretaries of the Physics and Astromony Department at Michigan State University, for their
help in all these years.
Finally I want to thank my parents, who are always supporting me at any time.

v

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 1
1.1
1.2

A Brief Review on the Algorithms for Self-Field Calculation
between Charged Particles . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A Glimpse into the Space Charge Calculation Routines for Beam Dynamics
Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 The Particle-Particle-Interaction Method . . . . . . . . . . . . . . . .
1.2.2 The Particle-In-Cell Method . . . . . . . . . . . . . . . . . . . . . . .
1.2.3 The Fast Multipole Method . . . . . . . . . . . . . . . . . . . . . . .

ix

2
2
6
7

Chapter 2
2.1
2.2
2.3

2.4

2.5
2.6

Single Level Fast Multipole Method in the Diﬀerential Algebra
Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Algorithm of the single level FMM . . . . . . . . . . . . . . . . . . . . .
A Brief Review of Diﬀerential Algebra and COSY . . . . . . . . . . . . . . .
FMM in the DA Framework . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 The Far Multipole Expansion . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Translation of a Multipole Expansion . . . . . . . . . . . . . . . . . .
2.3.3 Conversion of a Far Multipole Expansion into a Local Expansion . . .
2.3.4 Translation of a Local Expansion . . . . . . . . . . . . . . . . . . . .
2.3.5 Representation of Potential and the Field as Polynomials . . . . . . .
2.3.6 Gaussian Macroparticles Instead of Point Charges . . . . . . . . . . .
2.3.7 Example Calculations Exhibiting the Linear Scaling Property . . . .
Use of the FMM in tracking simulations . . . . . . . . . . . . . . . . . . . .
2.4.1 Strategy of cutting boxes . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.2 Frame rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Examples of Tracking Simulations . . . . . . . . . . . . . . . . . . . . . . . .
Discussion Concerning the Eﬃciency . . . . . . . . . . . . . . . . . . . . . .

1
1

Chapter 3
3.1

Multiple Level Fast Multipole Algorithm in the Diﬀerential
Algebra Framework . . . . . . . . . . . . . . . . . . . . . . . . . . .
Analytical Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 The Far Multipole Expansion from the Charges . . . . . . . . . . . .
3.1.2 Translation of The Far Multipole Expansion . . . . . . . . . . . . . .
3.1.3 Convert The Far Multipole Expansion into a Local Expansion . . . .

10
10
15
17
17
20
22
24
25
26
28
29
30
33
34
37

vi

39
41
41
42
43

3.2
3.3

3.4
3.5

3.1.4 The Local Expansion from the Charges . . . .
3.1.5 Translate The Local Expansion . . . . . . . .
3.1.6 Calculate the Potential and the Field from the
Error Analysis . . . . . . . . . . . . . . . . . . . . . .
The Multiple Level Fast Multipole Algorithm . . . .
3.3.1 Strategy of the MLFMA . . . . . . . . . . . .
3.3.2 Notation . . . . . . . . . . . . . . . . . . . . .
3.3.3 Brief Description of the Algorithm . . . . . .
3.3.4 Data Structure . . . . . . . . . . . . . . . . .
Numerical Results . . . . . . . . . . . . . . . . . . . .
Increasing of The Accuracy . . . . . . . . . . . . . .

. . . . . . .
. . . . . . .
Expansions
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .

Chapter 4 Parallel Version of the Multiple Level Fast Multipole
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Hierarchical Tree Structure . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Octree and Compressed Octree . . . . . . . . . . . . . . .
4.2.2 Storage of the Compressed Octree . . . . . . . . . . . . . .
4.2.3 Tree Construction . . . . . . . . . . . . . . . . . . . . . . .
4.3 Calculation of the Far Multipole Expansions . . . . . . . . . . . .
4.3.1 Tree Partitioning . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Parallel Calculation of the Far Multipole Expansions . . .
4.4 Calculation of the Local Expansions and the Fields . . . . . . . .
4.4.1 Relation between Boxes . . . . . . . . . . . . . . . . . . .
4.4.2 Parallel Calculation of the Local Expansions and the ﬁelds
4.5 Numerical Results and Discussion . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

Algorithm
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .

45
45
46
47
53
53
54
58
59
62
65
69
69
71
71
74
78
80
80
82
84
84
86
89

Chapter 5 Using the Fast Multipole Method in Beam Dynamics Simulations 92
5.1 Beam Dynamics Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2 The Lorentz Transformation of the Space Charge Field . . . . . . . . . . . . 95
5.3 Collisions Between Charged Particles in the Near Region . . . . . . . . . . . 96
5.4 An Example of Photoemission Process Simulation . . . . . . . . . . . . . . . 102
5.4.1 Model of the Photoemission Process . . . . . . . . . . . . . . . . . . . 102
5.4.2 Model of the Positive Residual Field . . . . . . . . . . . . . . . . . . 105
5.4.3 Comparison of the Simulation Results with the Experiment Data . . 113
5.5 Beneﬁt of Using the Grid-free Fast Multipole Method . . . . . . . . . . . . . 137
APPENDIX

. . . . . . . . . . . . . . .

BIBLIOGRAPHY

. . . . . . . . . . . . .

vii

. . . . . . . . . . . . . . . . . . . 139
. . . . . . . . . . . . . . . . . 149

LIST OF TABLES

Table 1.1

Some PPI programs . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

Table 2.1

The relative errors of the potential calculated from the multipole
expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

Table 2.2

The relative errors of the potential calculated from the local expansions 24

Table 2.3

A local expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 2.4

Computational expenses for the Gaussian/uniform distribution bunches 37

Table 3.1

Variables and their sizes . . . . . . . . . . . . . . . . . . . . . . . . .

62

Table 3.2

Comparison of the computation time for the uniform distribution
bunches of n particles by direct calculation (td ) and MLFMA calculation (tM ) with DA order p and each childless box holding at most s
particles, given the relative errors (Err1 ,Err2 , and Err3 ) of the ﬁeld
in three diﬀerent direction . . . . . . . . . . . . . . . . . . . . . . .

64

Comparison of the computation time for the Gaussian distribution
bunches of n particles by direct calculation (td ) and MLFMA calculation (tM ) with DA order p and each childless box holding at most s
particles, given the relative errors (Err1 ,Err2 , and Err3 ) of the ﬁeld
in three diﬀerent direction . . . . . . . . . . . . . . . . . . . . . . .

64

Comparison of the computation time for the Gaussian distribution
bunch groups of n particles by direct calculation (td ) and MLFMA
calculation (tM ) with DA order p and each childless box holding at
most s particles, given the relative errors (Err1 ,Err2 , and Err3 ) of
the ﬁeld in three diﬀerent direction . . . . . . . . . . . . . . . . . .

64

Computation time tM and the relative error Err for a Gaussian bunch
with 1000,000 electrons for diﬀerent DA order p and diﬀerent na , i.e.
diﬀerent deﬁnition of the ”adjacent boxes” . . . . . . . . . . . . . .

67

Relations between two boxes and the respective actions . . . . . . .

87

Table 3.3

Table 3.4

Table 3.5

Table 4.1

viii

26

LIST OF FIGURES

Figure 2.1

Sequential cutting of an original box . . . . . . . . . . . . . . . . . .

11

Figure 2.2

The propagation of the far multipole expansions . . . . . . . . . . .

12

Figure 2.3

The behavior of the interaction list of boxes at three levels . . . . .

13

Figure 2.4

The translation of the local expansions . . . . . . . . . . . . . . . .

14

Figure 2.5

The absolute value of the diﬀerent order coeﬃcients of the local expansion. For interpretation of the references to color in this and all
other ﬁgures, the reader is referred to the electronic version of this
thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

Figure 2.6

Computation time for diﬀerent number of electrons

. . . . . . . . .

29

Figure 2.7

The box cutting strategy . . . . . . . . . . . . . . . . . . . . . . . .

32

Figure 2.8

Frame rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

Figure 2.9

Longitudinal bunch size during the free expansion of an electron bunch 34

Figure 2.10

Transverse bunch size during the free expansion of an electron bunch

35

Figure 2.11

Computational expense for the simulations for diﬀerent numbers of
macroparticles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

Figure 3.1

Far multipole expansion of the potential of a point charge . . . . . .

47

Figure 3.2

Local expansion of the potential of a point charge . . . . . . . . . .

51

Figure 3.3

The hierarchical structure of the boxes . . . . . . . . . . . . . . . .

54

Figure 3.4

Box (b) and its colleagues (c) . . . . . . . . . . . . . . . . . . . . . .

55

Figure 3.5

Case 1. Adjacent boxes . . . . . . . . . . . . . . . . . . . . . . . . .

56

ix

Figure 3.6

Case 2. Separated boxes of the same size . . . . . . . . . . . . . . .

56

Figure 3.7

Case 3 and case 4. Separated boxes of diﬀerent sizes . . . . . . . . .

57

Figure 3.8

Box b and the associated lists 1-5 . . . . . . . . . . . . . . . . . . .

58

Figure 3.9

Sequenced boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

Figure 3.10

Storage of the hierarchical structure of the boxes . . . . . . . . . . .

61

Figure 3.11

Relation between computation time and particle number . . . . . .

63

Figure 3.12

Box b and the associated lists 1-5 with na = 2 . . . . . . . . . . . .

65

Figure 3.13

Relation between the error and the DA order for diﬀerent na , i.e. the
diﬀerent deﬁnitions of the ”adjacent boxes” . . . . . . . . . . . . .

67

Computation time of diﬀerent accuracies for diﬀerent na , i.e. the
diﬀerent deﬁnitions of the ”adjacent boxes” . . . . . . . . . . . . .

68

Figure 4.1

Full tree structure . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

Figure 4.2

Partial tree structure . . . . . . . . . . . . . . . . . . . . . . . . . .

73

Figure 4.3

From a normal tree to a compressed tree . . . . . . . . . . . . . . .

73

Figure 4.4

The maximum number of boxes generated in the worst case . . . . .

75

Figure 4.5

Sequenced boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

Figure 4.6

Storage of the compressed tree . . . . . . . . . . . . . . . . . . . . .

77

Figure 4.7

Parallel tree construction . . . . . . . . . . . . . . . . . . . . . . . .

79

Figure 4.8

Tree partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

Figure 4.9

Parallel multipole expansion calculation . . . . . . . . . . . . . . . .

82

Figure 4.10

Relations between two boxes . . . . . . . . . . . . . . . . . . . . . .

85

Figure 4.11

Parallel local expansion and ﬁeld calculation . . . . . . . . . . . . .

88

Figure 3.14

x

Figure 4.12

Computation time for 1,000,000 electrons with diﬀerent number of
processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

Parallel eﬃciency for 1,000,000 electrons with diﬀerent number of
processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

Figure 5.1

The artiﬁcial collision . . . . . . . . . . . . . . . . . . . . . . . . . .

97

Figure 5.2

Discontinuity of the momentum dispersion due to the near region
collision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

Figure 4.13

Figure 5.3

Electrons in z − pz phase space . . . . . . . . . . . . . . . . . . . . .

Figure 5.4

Electron density in z − pz phase space . . . . . . . . . . . . . . . . . 100

Figure 5.5

Momentum dispersion after applying the ﬁrst method . . . . . . . . 100

Figure 5.6

Momentum dispersion after applying the second method . . . . . . . 101

Figure 5.7

Momentum dispersion after applying the third method . . . . . . . . 101

Figure 5.8

Longitudinal ﬁeld . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

Figure 5.9

Radial ﬁeld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

Figure 5.10

Experiment setup, from Zhensheng Tao . . . . . . . . . . . . . . . . 114

Figure 5.11

Number of particles as a function of time . . . . . . . . . . . . . . . 115

Figure 5.12

Raw data of the shadow patterns at 70 ps . . . . . . . . . . . . . . . 117

Figure 5.13

Comparison of the experiment with the simulation at 70 ps . . . . . 117

Figure 5.14

Comparison of the experiment with the simulation at 80 ps . . . . . 118

Figure 5.15

Comparison of the experiment with the simulation at 90 ps . . . . . 118

Figure 5.16

Comparison of the experiment with the simulation at 100 ps . . . . 119

Figure 5.17

Comparison of the bunch sizes in the experiment and the simulation 119

xi

99

Figure 5.18

The momentum dispersion as a function of time . . . . . . . . . . . 120

Figure 5.19

The transverse bunch sizes evolve with time . . . . . . . . . . . . . . 121

Figure 5.20

Charge distribution in x − z phase space in the ﬁrst 130 fs . . . . . 122

Figure 5.21

Charge distribution in x − pz phase space in the ﬁrst 130 fs . . . . . 124

Figure 5.22

Charge distribution in x − z phase space in the following 120 ps . . 128

Figure 5.23

Charge distribution in x − y phase space in the following 100 ps . . 131

Figure 5.24

Charge distribution in x − z phase space in the following 120 ps . . 134

Figure A.1

Diagram of stage 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

Figure A.2

Diagram of description (2) . . . . . . . . . . . . . . . . . . . . . . . 146

Figure A.3

Diagram of description (3) . . . . . . . . . . . . . . . . . . . . . . . 148

xii

Chapter 1
A Brief Review on the Algorithms for
Self-Field Calculation between
Charged Particles

1.1

Introduction

The Coulomb interaction between charged particles inside a bunch is one of the most important collective eﬀects in the study of beam dynamics. As scientists try to approach higher
beam intensity, this eﬀect becomes more signiﬁcant in modern high brilliance particle accelerators or free electron laser devices. This eﬀect may also be dominant in contemplated
time resolved electric microscopes where the beam energy is relatively low [45, 48]. Although
many analytical models and discussions have been made, numerical simulation is inevitable
to have a good understanding of this eﬀect in many cases.
Usually we assume all the charged particles have very small velocities in the beam frame,
so that their interaction can be treated as an electrostatic ﬁeld, which can be calculated
from the Coulomb force law. However, the computation cost of the pairwise formula of
the Coulomb force law scales with the square of the particle number N ; this makes it not
applicable in many cases because we often have a huge number of charged particles inside a

1

bunch. To increase the eﬃciency many fast calculation algorithms have been developed. In
this chapter, we will ﬁrst give a brief historic review of the algorithms developed for beam
dynamics simulations, then concentrate on our algorithm, the fast multipole method (FMM)
based on the diﬀerential algebra (DA) , whose computation cost scales linearly with N .

1.2

A Glimpse into the Space Charge Calculation Routines for Beam Dynamics Simulations

Beam physicists have been using numerical simulations to study the space charge eﬀect
for decades. Many fast algorithms and programs have been developed; most of them can
be classiﬁed into two categories: the particle-particle-interaction (PPI) method and the
particle-in-cell (PIC) method [51].

1.2.1

The Particle-Particle-Interaction Method

The basic idea of the PPI method is to use macroparticles, each of which represents a
group of real particles, calculate the ﬁeld on the macroparticles and move them in the
simulation [60, 51, 35, 52, 50]. In order to calculate the ﬁeld, one makes some assumptions
concerning the bunch shape and charge distribution, such as an elliptical bunch with uniform
charge distribution inside [17, 18], or ﬁt the real charge distribution by a group of analytical
functions such as a Gaussian [52, 51, 50]. Table 1.1 lists some typical PPI programs with
the assumptions of their algorithm. The programs work well as long as the assumptions are
valid.
MAPRO2 uses macroparticles, each representing 105 to 107 real particles, in simulation.

2

Table 1.1: Some PPI programs
Programs
MAPRO2
SC3DELP
SCHERM
Improved SCHERM

Year
1971
1991
1996
1996

Assumptions
Ellipsoidal shape and Gaussian charge distribution
Ellipsoidal shape
Ellipsoidal symmetry in transverse directions
NONE

The charge distribution inside the bunch is assumed to be a three dimensional Gaussian
distribution such as[60]

n(x, y, z) = n0 exp −

1
2

x2 y 2 z 2
+ 2 + 2
a2
b
c

,

(1.1)

where a, b, and c are the root mean square (rms) bunch sizes and n0 = 4N/((2π)3/2 a · b · c).
Then the ﬁeld on each macroparticle can be calculated by numerical integration in Eq. (1.2).

q·a·b·c·x
Ex =
·
2ε0
0
q·a·b·c·y
·
Ey =
2ε0
0
q·a·b·c·z
·
Ez =
2ε0
0

1
∞ n0 exp − 2

(a2 + s)

1
∞ n0 exp − 2

(b2 + s)
1
∞ n0 exp − 2

(c2 + s)

x2 + y 2 + z 2
a2 +s
b2 +s
c2 +s

ds
,

(a2 + s)(b2 + s)(c2 + s)
y2
z2
x2
2 +s + b2 +s + c2 +s
a

ds
,

(a2 + s)(b2 + s)(c2 + s)
y2
z2
x2
2 +s + b2 +s + c2 +s
a

(a2 + s)(b2 + s)(c2 + s)

(1.2)

ds
,

where q is the total charge of the macroparticle, ε0 is the permittivity. MARPO2 assumes
that the bunch has an elliptical shape and a Gaussian distribution of charges, which is not
necessarily true in practice, especially for high density beams, and may lead to inaccurate
results.
SC3DELP (1991) expands the charge distribution from the Gaussian distribution to any
3

arbitrary distribution, although it still assumes an elliptical shape. If the density is described
as[35]
n(t0 ) = n

x2 y 2 z 2
+ 2 + 2
a2
b
c

,

(1.3)

where a, b, and c are the rms bunch sizes and t0 deﬁnes the isodensity contours of the
distribution, the electric ﬁeld can be expressed as[35]
q·a·b·c·x ∞
n(t)ds
Ex =
,
2 + s)3/2 (b2 + s)1/2 (c2 + s)1/2
2ε0
0 (a
q·a·b·c·y ∞
n(t)ds
Ey =
,
2ε0
0 (a2 + s)1/2 (b2 + s)3/2 (c2 + s)1/2
q·a·b·c·z ∞
n(t)ds
Ez =
,
2 + s)3/2 (b2 + s)1/2 (c2 + s)3/2
2ε0
0 (a

(1.4)

where q is the total charge of the macroparticle, ε0 is the permittivity, and t = t(x, y, z, s) =
x2 /(a2 + s) + y 2 /(b2 + s) + z 2 /(c2 + s). The charge density can be described as an expansion
of the Fourier series, in which the coeﬃcients can be approximated as a summation over all
macroparticles. For each macroparticle, plug in the coordinates x, y, and z in Eq. (1.4) and
the ﬁeld can be calculated by performing the integration over s.
SCHERM (1996) only assumes the ellipsoidal symmetry in the transverse directions. The
longitudinal shape is described by the overlap of two or three Gaussian functions [51]. The
charge distribution can be expressed by a Hermite series expansion or a Cesaro-Fejer series
expansion [50]. Since the bunch is decomposed into two or three ellipsoidal bunches, the ﬁeld
is the summation of the contribution from each bunch, which can be numerically calculated
by Eq. (1.4). The improved SCHERM removes all the assumptions on symmetry, based on
the knowledge that the three dimensional charge distribution of a bunch can be expressed

4

by a Hermite series expansion as[52]

x y z
ρ( , , ) =
a b c

Ai,j,k Hi
i,j,k

x2
z
x
y
y2
z2
Hj
Hk
exp − 2 − 2 − 2
a
b
c
2a
2b
2c

,

(1.5)

with
Ai,j,k =

q

P

(2π)3/2 i!j!k! p=1

Hi

xp
yp
zp
Hj
Hk
.
a
b
c

The charge distribution in a bunch can be considered as the Gaussian distribution

ρ0 (x, y, z) = A000 exp −

x2
y2
z2
− 2− 2
2a2 2b
2c

with some modiﬁcations. The ﬁeld due to the Gaussian distribution can be numerically
calculated by Eq. (1.4) as before and the ﬁeld due to the other terms with Ai,j,k = A000
can be approximately calculated by Eq. (1.6) as follows [52].

x
Hj
Ex = −αAi,j,k Hi−1
a
x
Hj−1
Ey = −βAi,j,k Hi
a
y
x
Hj
Ez = −γAi,j,k Hi
a
b

x2
z
y
y2
z2
Hk
exp − 2 − 2 − 2 ,
b
c
2a
2b
2c
x2
z
y
y2
z2
Hk
exp − 2 − 2 − 2 ,
b
c
2a
2b
2c
2
2
x
z
y
z2
Hk−1
exp − 2 − 2 − 2 ,
c
2a
2b
2c

(1.6)

where α, β, and γ are parameters that need to be optimized for each term and that satisﬁes
α + β + γ = 1.
The PPI codes have been developed for several decades. In the beginning one can only
treat some simple cases with the ellipsoidal symmetry in shape and the uniform distribution
or the Gaussian distribution of charges. Now one can treat bunches with any arbitrary shape
and any arbitrary charge distribution. The most time consuming part is to ﬁt the charge

5

distribution and numerically calculate the electric ﬁeld. Except for a few special cases, there
is no explicit formula for the self-ﬁeld inside a charged particle bunch. In most cases the ﬁeld
calculation depends on numerical methods. Although new algorithms are being developed
[66, 73], it is always challenging to do it in a fast and accurate way. In some cases, it may be
very challenging to ﬁt the charge distribution. From Eq. (1.5) one can see it will be easier
to ﬁt the charge distribution if it is close to the Gaussian distribution. However this is not
necessarily true in practice. The charge distribution of a high intensity bunch under external
ﬁelds may be far away from a Gaussian distribution, which makes the ﬁtting diﬃcult.

1.2.2

The Particle-In-Cell Method

The PIC method is the most popular method to calculate the space charge eﬀect in beam
dynamics simulations, which has been developed for several decades. In the earlier days,
assumptions on symmetry were used to help simplify the calculation and solve the problem.
For example, SCHEFF, a two dimensional PIC code, assumes the bunch shape has cylindrical
symmetry [69]. Thanks to the development of algorithms and computer hardware, the
contemporary PIC codes can deal with the bunches containing billions of macroparticles
with any arbitrary shape and charge distribution. In the latest two decades, many PIC
codes, for example, WARP [43], GPT [71], VORPAL [63], and IMPACT [74], have been
developed and successfully applied in beam dynamics simulation in diﬀerent areas such as
laser-plasma accelerators, electron guns, high intensity ion or electron storage rings, etc
[42, 63]. Some PIC codes can simulate millions of particles on a PC. Parallel version of
many contemporary PIC codes has also been developed [63, 74, 41], which allows them to
run on the modern cluster machines. Taking advantage of supercomputers, some PIC codes
can simulate billions of particles [77, 62].
6

The strategy of the PIC method can be described in three steps. First, one needs to set
up a grid on the space with charges and distribute the charge density onto the mesh points.
Then, solve the Poisson equation to obtain the electric ﬁeld on the mesh points, with which
the electric ﬁeld on the charged particles can be calculated by interpolation. Finally solve
the dynamics equations of the charged particles or solve the Vlasov equation of the ﬂow.
So the ﬁrst challenge is how to set up the grid, which has eﬀects on accuracy and eﬃciency.
Usually, in practice one needs to ﬁnd a balance between the accuracy and the eﬃciency. In
earlier codes, an equidistant grid is often used. But by now to eﬃciently deal with more
complicated charge density distributions, adaptive griding techniques with nonequidistant
grids have been developed, which generate a ﬁner grid where higher charge density exists and
a coarser grid where lower charge density exists [72, 87]. The second challenge is to solve the
Poisson equation. This problem has been addressed by physicists and mathematicians for
many years, and there are many well developed methods that can be selected such as the fast
Fourier transformation based solver used in WARP [43], the Green’s function based solver
used in IMPACT [74], the ﬁnite diﬀerence time-domain solver used in VORPAL[63] , etc.
Finally, one has to solve the dynamics equations, which is also a classical problem. Leap-frog
integrators and Runge-Kutta integrators are often employed [43, 63, 71, 74]. In one sentence
the PIC method is a mature and promising method to solve the self-ﬁeld between charged
particles.

1.2.3

The Fast Multipole Method

Since so many simulation codes for the space charge eﬀect calculation have been developed
and successfully applied, a question naturally raised is why we need a new algorithm? Why
do we not simply use the existing codes? To answer this question, we have to make it
7

clear what we want to calculate. From the beginning, we plan to include the space charge
eﬀect into the transfer map, which is widely used in the beam dynamics simulation for
single particle dynamics studies. The space charge may have signiﬁcant eﬀects on the beam
dynamics, which can be reﬂected by the transfer map [16, 65]. In the long run, we want
to develop a method that can generate the transfer map for any arbitrary ﬁeld. Our group
has developed a method to generate the map for any given external ﬁeld [57, 58, 59]. To
generate the transfer map for the self-ﬁeld inside a charged particle bunch is the question in
front of us. For this purpose we need to calculate not only the ﬁeld but also the derivatives
of the ﬁeld on the reference particle. Using the PIC method, the ﬁeld on the macroparticle
is calculated by interpolation, as a result it is very diﬃcult to calculate the derivatives of the
ﬁeld. Using the PPI method, it is possible to calculate the derivatives of the ﬁeld if we use
a DA based numerical integrator to calculate the ﬁeld, but there always exists the danger
that the ﬁtting algorithm for the charge density distribution may fail if the charge density is
too complicated. So we need an algorithm that can treat any arbitrary charge distribution
and that can calculate both the ﬁeld and the high order derivatives of the ﬁeld.
The year 1986 brought the publication of the tree code algorithm, also known as BarnesHut algorithm [8], in which the potentials of the particles far away from the observer are
represented by the multipole expansions in powers of 1/r, covering larger and larger boxes
further and further away from the point of interest. In this way the original point to point
interaction is replaced by the point to box (that contains the multipole expansion) interaction. It could be shown that the cost of the method scales with O(N log N )[8, 9]. In 1987,
Greengard and Roklin published the fast multipole method (FMM), in which multipole expansions are converted into the local expansions in the near region of the observer. Thus
the point to box interaction is converted into the box (that contains the multipole expan8

sion) to box (that contains the local expansion) interaction, and the eﬃciency is further
increased to O(N )[39, 19].The original FMM focused on the Coulomb (1/r) potential, and
the three dimensional potential of a point charge is expanded in terms of the spherical harmonic functions in spherical coordinates[10]. Later on, other algorithms in the FMM family
were developed for diﬀerent problems, such as the screened Coulomb interaction[40, 55], electromagnetic scattering (Helmholtz equation)[32, 25], 1/rν potential for any real ν [80], and
potentials that have no explicit expression and can only be expressed numerically[5, 56, 34].
Some of the algorithms express their multipole expansions and local expansions in Cartesian
coordinates. Both FMM and tree code have been widely applied in may areas and have also
gained some attentions in beam dynamics simulations. The two dimensional FMM algorithm
has been employed in Accsin to calculate the transverse space charge ﬁeld [46, 47]. The tree
code has been used in some electron beam emittance studies [26]. We build a new algorithm
that ﬁts our needs by combining the FMM with our diﬀerential algebra (DA) method. The
FMM in the DA framework can calculate not only the three dimensional self-ﬁeld, but also
its high order derivatives between charged particles in any distribution. In the following
chapters, the details of the DA based FMM will be presented and discussed.

9

Chapter 2
Single Level Fast Multipole Method
in the Diﬀerential Algebra Framework

2.1

The Algorithm of the single level FMM

The key idea of the FMM is to represent the potential of suitable groups of source particles
that are far away from the observer in terms of expansions involving powers of 1/r, which
we refer to as a far multipole expansion, and use the fact that far enough away, higher
powers of 1/r are less and less signiﬁcant. This makes it possible to compute the action of
source particles on observer particles in a more eﬃcient manner. Furthermore, far multipole
expansions corresponding to certain groups can be translated and combined. And ﬁnally,
far multipoles can also be locally expanded involving powers of r, which we call a local
expansion. This local expansion thus allows the treatment of groups of nearby observer
particles in a combined manner.
In practice, one ﬁrst encloses all the particles in a group of cube boxes, the zero level
boxes. Then each of these cube boxes is cut into eight equal small cube boxes, which we
call the ﬁrst level boxes. Then each ﬁrst level box is cut in the same way into eight smaller
boxes, leading to the second level boxes. This process is continued until a pre-speciﬁed level
is reached. In practice, the number of levels is determined such that the average number
of particles in the ﬁnest boxes is near a pre-speciﬁed value, the size of which will aﬀect the
10

Figure 2.1: Sequential cutting of an original box

l=1

l=2

3

4

5

2

1

6

9

8

7

eﬃciency of the method.
We call boxes of the same level neighbors if they touch. For a given box A, we denote the
region made of all same level neighbors and A itself as the near region of A, and everything
else as the far region A. For a box A, a next higher level box B containing A is called a
parent box of A, and A is called one of the child boxes of B. Apparently each child box has
only one parent box, and each parent box has eight child boxes.
A two dimensional example about how to cut the boxes is shown in Figure 2.1. In the
ﬁrst level, the square box is cut into four square boxes, and in the second level, 16 boxes are
cut out, and so on. Furthermore, boxes 2 through 9 are all neighbors of box 1, and together
they describe the near region of 1. All the unnumbered boxes are in the far region of the
box 1. There is no diﬀerence between the two dimensional case and the three dimensional
case in principle, except that each box is cut into four child boxes in the two dimensional
case rather than eight in the three dimensional case.
The FMM algorithm consists of two parts, the ﬁrst of which we now describe.
1. Cut the box of interest into the desired levels.
2. In each box of the ﬁnest level, calculate the far multipole expansion of the particles
11

inside that box around the center of the box
3. From the ﬁnest level to the second ﬁnest level, translate the multipole expansions from
the center of the children boxes into the center of their parent box, and then add them
up to obtain the far multipole expansion of the parent box, as Figure 2.2 shows.
4. Repeat this at all levels.
After this part is completed, we now have a far multipole expansion for each box in each
level.
Figure 2.2: The propagation of the far multipole expansions
l=3

l=2

l=1

We now proceed to the second part of the FMM. First, for any box A having a parent
B, we deﬁne the interaction list to be the collection of those boxes of the same level that
belong to the far region of A, but to the near region of its parent B. These boxes of “medium
distance” to A will be important in the subsequent algorithm because they require special
treatment. To illustrate the concept of the interaction list, consider Figure 2.3, which shows
a box of interest in black, its near region in white, and its interaction list by hatching of
diagonal lines from lower left to upper right, for three diﬀerent levels.

12

Figure 2.3: The behavior of the interaction list of boxes at three levels

l=2

Near region

l=3

Interaction list

l=4

Already considered

The second part of the FMM algorithm now begins at the ﬁrst level at which boxes have
interaction lists.
1. For each box with an interaction list, compute a local expansion around its center point
from the far expansions of the boxes making up the interaction list
2. At the next ﬁner level, again compute a local expansion around the center of each
box from the far expansion of the boxes making up the interaction list. To this, add
the local expansion of the parent’s interaction list by re-expanding the parent’s local
expansion to its own center, as shown in Figure 2.4
3. Proceed until the lowest level is reached.

13

Figure 2.4: The translation of the local expansions
l=1

l=2

l=3

After the completion of the second step, each box at the ﬁnest level now possesses a local
expansion of all contributions to the potential from outside its near region. The gradient of
this local expansion consequently represents the ﬁeld contributions from everything in the
far region.
The third step of the method now consists of determining the ﬁeld on each particle of
interest.
1. For each particle, determine the ﬁnest level box to which the particle belongs, and
evaluate the local expansion of the ﬁeld of that box at the particle’s coordinates. This
results in an approximation of the ﬁeld at the particle due to all other particles outside
the near region of the box in which the particle lies.
2. To the far ﬁeld, explicitly add the Coulomb ﬁeld of all other particles in the near region.
We note that the accuracy of this approximation depends on the orders used for the far
expansions and the local expansions that occur, and in principle can be made as high as
desired.
In the remainder of the paper we will discuss how to perform all the relevant far and
local expansions using the diﬀerential algebraic method.
14

2.2

A Brief Review of Diﬀerential Algebra and COSY

Before we continue to present our work on the FMM in the DA framework, we want to
give a brief introduction of the DA method. The basic concepts and the deductions will be
presented directly without proof, please refer to [14] for details.
Consider the vector space of the inﬁnitely diﬀerentiable functions C ∞ (Rv ), in which we
can deﬁne an equivalence relation “=n ” between two functions a, b ∈ C ∞ (Rv ) via a =n b if
a(0) = b(0) and if all the partial derivatives of a and b at 0 agree up to the order n. Note
that the point 0 is selected for convenience, and any other point could be chosen as well.
The set of all b that satisﬁes b =n a is called the equivalence class of a, which is denoted
by [a]n . We denote all the equivalence classes with respect to =n on C ∞ (Rv ) as n Dv . The
addition, scalar multiplication and multiplication on n Dv can be deﬁned as eq. (2.1)

[a]n + [b]n := [a + b]n ,
c · [a]n := [c · a]n ,

(2.1)

[a]n · [b]n := [a · b]n ,
where a, b ∈ n Dv and c is a scalar, so that n Dv is an algebra. We can also deﬁne the
derivation operator ∂v as eq. (2.2)

∂v [a]n :=

∂
a
,
∂xv n−1

15

(2.2)

where xv is the v th variable of the function a. The operator ∂v satisﬁes

∂v ([a] · [b]) = [a] · (∂v [b]) + (∂v [a]) · [b]

(2.3)

An algebra with a derivation is called a diﬀerential algebra. There are v special classes
dv = [xv ], whose elements are all inﬁnitely small. According to the ﬁxed point theorem[14],
the inverse and the roots of any element that is not inﬁnitely small in n Dv exist and can
be calculated easily. Furthermore, all real power series can be extend to the DA within
their radius of convergence. If a function a in n Dv has all the derivatives cJ1 ,...Jv =
J

∂ J1 +...+Jv a/∂x1 1 · ... · ∂xJv , then [a] can be written as
v
[a] =

J

cJ1 ,...Jv · d1 1 · ... · dJv .
v

(2.4)

J

Thus d1 1 · ... · dJv is a basis of the vector space of n Dv . The eq. (2.4) reminds us of the
v
Taylor expansion of a function. Actually if we have a function f in C ∞ (Rv ) and fT is its
Taylor expansion up to order n, obviously we have f =n fT in n Dv . In practice this means
we can express f by its Taylor expansion up to an arbitrary order n as an element in n Dv ,
and we can calculate the derivative classes of f and any other function that can be derived
by applying the elemental operations, divisions, roots, and power series on f . [11]
A beam optical system can be described in terms of the transfer map method, which
relates the ﬁnal positions and velocities of the particles to the initial conditions, so that it
makes the tracking more eﬃcient than solving the dynamics equations element by element.
The power of the DA make it possible to calculate the map up to any arbitrary order[12].
COSY Inﬁnity is a program developed for high performance modern scientiﬁc computing,

16

which is used in beam optical system design. COSY supports various advanced data types,
such as DA, TM and VE[15]. In our following work on the FMM, we will use the DA data
type. By evaluating a function f in DA data type in COSY, one can obtain its Taylor
expansion fT up to an arbitrary predetermined order n , because a DA vector carries all its
coeﬃcients cJ1 ,...,Jv , which are exactly the coeﬃcients of fT . If we consider two functions
f, g ∈ n Dv which can also be viewed as two maps Mf and Mg , the function g(f ) or the
composition of the two maps Mg ◦Mf can be calculated by the command POLVAL in COSY.
If f is a linear function or a linear map, a more eﬃcient command DATRN can be used.[15]
Furthermore, the TM data type can be used for the rigorous calculation. In the future we
will use the TM in our algorithm to obtain rigorous error bounds.[13] COSY also supports
parallel calculation. Our current code can be easily revised for the parallel calculation and
run in a cluster machine.

2.3
2.3.1

FMM in the DA Framework
The Far Multipole Expansion

Given a unit point charge located at position ri , its potential at the location r of an observer
is given by

φ=

q
,
|r − ri |

(2.5)

The key idea of the FFM method lies in grouping together particles that are suﬃciently near
and treat them with a combined expansion. As it turns out, it is not immediately obvious
what the best choice of variables is to achieve this end. It is known that using spherical

17

coordinates, φ can be expressed as series of the spherical harmonic functions, which are
composed of products of cos θ, eiψ and 1/r [10]. The desire to operate directly in Cartesian
coordinates ultimately leads to the inspiration of expanding the function with respect to 1/r,
x/r2 , y/r2 , and z/r2 . As it turns out this is one variable more than necessary and contains
redundancy because (x, y, z) already determines r. But the use of four variables leads to a
particularly transparent algorithm.
Back to the FMM, we consider a cube box centered at the origin, whose side length is
a, and there are n particles inside the box whose coordinates are (xi , yi , zi ). At any point
(x, y, z), whose distance to the origin r ≫ a, the potential can be expressed as

n

qi

φ =
i

(xi − x)2 + (yi − y)2 + (zi − z)2

n

qi /

=
i

1+

2 2
x2 +yi +zi
i
x2 +y 2 +z 2

−

2xi x
x2 +y 2 +z 2

n

=
i

with

x2 + y 2 + z 2
−

2yi y
x2 +y 2 +z 2

−

2zi z
x2 +y 2 +z 2

d1
2
2
1 + (x2 + yi + zi )d2 − 2xi d2 − 2yi d3 − 2zi d4
1
i

(2.6)

x
x
1
= 2,
= , d2 = 2
2 + z2
r
x +y
r
x2 + y 2 + z 2
y
y
z
z
d3 = 2
= 2 , d4 = 2
= 2.
2 + z2
2 + z2
x +y
r
x +y
r

d1 =

1

If we consider points (x, y, z) whose distance to the origin r ≫ a, the expression above can
apparently expanded in powers of the quantities d1 , d2 , d3 and d4 , since these quantities go
to zero as r becomes larger and larger. The resulting expansion mathematically represent a

18

Taylor series in d1 , d2 , d3 and d4 , and in common physical terminology a multipole expansion.
To avoid confusion with the common deﬁnition of multipole expansions in particle optics,
we call this expansion the far multipole expansion if it is not clear from the context.
Using the variables d1 , d2 , d3 and d4 as the expansion variables in the DA algorithm,
COSY can now readily compute the far multipole expansions to any order. Furthermore,
quite importantly, the same method can be applied to other types of the potential, for
example the two dimensional Coulomb potential, the common van der Waals potentials,
or the potentials of certain macroparticle charge distributions, and in the DA formalism,
without much additional diﬃculty far multipole expansions can be calculated by evaluation
in DA arithmetic.
We now use the formula to calculate the multipole expansions in the ﬁnest level boxes.
To assess the performance of the method, we performed some test calculations of the far
multipole expansion. One group of results are presented in Table 2.1. We put 50 electrons
into each of the two cube boxes, whose centers are (0, 0, 0) and (4, 0, 0) and whose side length
are both 2. In terms of the terminology introduced above, this means the second box lies
in the far region of the ﬁrst and is hence subject to the far multipole expansion; but it
represents the closes such box, where thus the convergence of the method is particularly
diﬃcult to achieve.
The potential on the electrons in the second box are calculated and repeated for 1000
times with diﬀerent random numbers, so that we have 50000 data points together. The
relative error is calculated by

N
2
i=1 (∆φ/φ) with N = 50000. We can see as the DA

order increases, the relative error decreases, which shows the convergence of this multipole
expansion.

19

Table 2.1: The relative errors of the potential calculated from the multipole expansions
DA order
1
2
3
4
5
6

2.3.2

Error
2.088 × 10−2
3.816 × 10−3
1.524 × 10−3
9.681 × 10−4
1.381 × 10−4
5.758 × 10−5

Translation of a Multipole Expansion

If we have a multipole expansion in the cube box centered at the origin (0, 0, 0), we can
′ ′
translate it into another frame whose origin is at (x′ , yo , zo ). The distance from the observer
o

to the origin of the new frame is assumed to be much larger than the side length of the box
′ ′
centered at (x′ , yo , zo ), which assures that the new DA variables d′ ...d′ in eq. (2.7) are small
o
1
4

enough to still have convergence. Speciﬁcally, we have

d′ =
1

1

′
′
(x − x′ )2 + (y − yo )2 + (z − zo )2
o
x − x′
x′
o
d′ =
= ′2
2
r′2
r
′
y′
y − yo
= ′2
d′ =
3
r′2
r
′
z − zo
z′
d′ =
= ′2
4
r′2
r

=

1
= ′
r
x′2 + y ′2 + z ′2
1

(2.7)

′ ′
In eq. (2.7), (x′ , y ′ , z ′ ) is the position of the observer in the new frame of (x′ , yo zo ), and r′
o
′ ′
is the distance from the observer to the new origin (x′ , yo , zo ). The new DA variables and
o

20

the old DA variables in eq. (2.6) are related via

d1 =
=

1
=
r

1
′
′
′
′
(x − x′ + x′ )2 + (y − yo + yo )2 + (z − zo + zo )2
o
o
1

′
′2
′2
(x − x′ )2 + (y − yo )2 + (z − zo )2 + x′2 + yo + zo
o
o
′ ′
′ ′
+2(x − x′ )x′ + 2(y − yo )yo + 2(z − zo )zo
o o

=

=

1
′
′
′2
′2
1/d′2 + x′2 + yo + zo + 2x′ d′ /d′2 + 2yo d′ /d′2 + 2zo d′ /d′2
o
o 2 1
4 1
3 1
1

d′
1

′
′
′2
′2
1 + (x′2 + yo + zo )d′2 + 2x′ d′ + 2yo d′ + 2zo d′
o 2
o
4
3
1
√
′ · R,
= d1

(2.8)

d2 = xd2 = (x − x′ + x′ )d2 = (d′ + x′ d′2 ) · R,
o
o 1
o 1
1
2
′
′
′
d3 = yd2 = (y − yo + yo )d2 = (d′ + yo d′2 ) · R,
1
3
1
1
′
′
′
d4 = zd2 = (z − zo + zo )d2 = (d′ + zo d′2 ) · R,
1
4
1
1

with
R =

1
.
′
′
′2
′2
1 + (x′2 + yo + zo )d′2 + 2x′ d′ + 2yo d′ + 2zo d′
o 2
o
4
3
1

If we substitute eq. (2.8) into eq. (2.6), we obtain the multipole expansion in the new
frame. If one considers eq. (2.6) as a map Mc2m (d1 , d2 , d3 , d4 ) and eq. (2.8) as a map
M1 (d′ , d′ , d′ , d′ ), the substitution is actually merely the composition of the two maps
1 2 3 4

Mm2m = Mc2m ◦ M1,

(2.9)

which is a common operation in DA methods, and which here yields the expression of φ
in the new frame with the new DA variables. With eq. (2.8) and eq. (2.9), to calculate
21

the multipole expansion in a box not in the ﬁnest level, we can translate all the multipole
expansions of all its children boxes to its center, then take the summation.

2.3.3

Conversion of a Far Multipole Expansion into a Local Expansion

The potential of a far multipole expansion at the frame at (0, 0, 0) on an observer (x, y, z)
can be converted into a local expansion in the near region of the observer. Since the new
′ ′
frame has its origin (x′ , yo , zo ) close to the observer, it is natural to choose the coordinates
o

of the observer (x′ , y ′ , z ′ ) in the new frame as the new DA variables, as eq. (2.12) shows.

d ′ = x′ ,
1
d′ = y ′ ,
2

(2.10)

d′ = z ′ .
3

So, in contrast to the far multipole expansion, which is a Taylor expansion in the variable
d1 , d2 , d3 and d4 , which vanish at inﬁnity but diverge in close proximity, we now generate
a Taylor expansion in the variables x′ , y ′ and z ′ , which vanish at the origin. The old DA
variables in eq. (2.6) and the new DA variables in eq. (2.12) have the relation as shown in

22

eq. (2.11).

d1 =

1
x2 + y 2 + z 2

=

1
′
′
(x′ + d′ )2 + (yo + d′ )2 + (zo + d′ )2
o
3
2
1

=

√

R,

x′ + d ′
x
o
′
′
1
= ′
′ )2 + (y ′ + d′ )2 + (z ′ + d′ )2 = (xo + d1 ) · R,
2 + y2 + z2
x
(xo + d1
o
o
3
2
′
yo + d ′
y
′
′
2
= ′
d3 = 2
′ )2 + (y ′ + d′ )2 + (z ′ + d′ )2 = (yo + d2 ) · R,
2 + z2
x +y
(xo + d1
o
o
3
2
d2 =

d4 =

′
zo + d ′
z
′
′
3
= ′
′ )2 + (y ′ + d′ )2 + (z ′ + d′ )2 = (zo + d3 ) · R.
2 + y2 + z2
x
(xo + d1
o
o
3
2

with
R =

(2.11)
1

.
′
′
(x′ + d′ )2 + (yo + d′ )2 + (zo + d′ )2
o
1
2
3

If we call eq. (2.11) the map M2 (d′ d′ d′ ), the local expansion can again be written as a
1, 2, 3
composition of Mc2m and M2 as

Mm2l = Mc2m ◦ M2 .

(2.12)

Having eq. (2.11) and eq. (2.12), we can convert the multipoles in the interaction list of
each box into the local expansion inside the box itself. The result of a test calculation to
show the convergence of the local expansion is presented in Table 2.2. The set up of the
boxes and electrons, the number of the data points, and the deﬁnition the relative error are
the same with those of Table 2.1. We can see that the relative error decreases as we increase
the DA order.

23

Table 2.2: The relative errors of the potential calculated from the local expansions
DA order
1
2
3
4
5
6

2.3.4

Error
3.136 × 10−2
8.348 × 10−3
3.024 × 10−3
1.165 × 10−3
2.086 × 10−4
8.036 × 10−5

Translation of a Local Expansion

Consider two local frames whose origins are both close to the observer. If we know the local
expansion of the frame at (0, 0, 0) is as eq. (2.12) shows, we can translate it into the other
′ ′
local frame at (x′ , yo , zo ). Since the new frame’s origin is also close to the observer, we can
o

still choose the coordinates of the observer in the new frame as the new DA variables d′ , d′ ,
1 2
and d′ . Obviously the old DA variables d1 , d2 , and d3 , which are the coordinates of the
3
observer in the local frame at (0, 0, 0), have the simple relation shown in eq. (2.13) with the
new DA variables.

d 1 = x′ + d ′ ,
o
1
′
d 2 = yo + d ′ ,
2

(2.13)

′
d3 = zo + d′ .
3

We call eq. (2.13) the map M3 , then the new local expansion can be calculated by composing
Mm2l with M3 as
Ml2l = Mm2l ◦ M3 .

24

(2.14)

We can translate the local expansion of a box into its children boxes using eq. (2.13) and
eq. (2.14).

2.3.5

Representation of Potential and the Field as Polynomials

In order to calculate the high order transfer map of a beam optical system, one needs not only
the ﬁeld on the reference orbit but also the derivatives of the ﬁeld[12]. If we have the ﬁeld
of the Coulomb interaction and its derivatives, we can include the Coulomb interaction into
the map. Eq. (2.14) provides the expansion of the potential φ with respect to the position
of the observer up to the order p. This expansion converges inside the corresponding ﬁnest
box. The constant part of the expansion is the value of the potential at the center of the
box, and the coeﬃcient of each monomial is the value of the corresponding partial derivative
of the potential.
Table 2.3 shows an example of the local expansion. In a large cube box, we put 10000
electrons. The cube box is cut into the third level. And the third level box size is 2, which
means if we put a frame at the center of one ﬁnest box, we have x, y, z ∈ [−1, 1]. The local
expansion of each 3rd level box is calculated up to the 10th order, and the coeﬃcients till
the third order of one 3rd level box are presented in Table 2.3. The forth column and the
eighth column show the exponents of each DA variables in each monomial. For example,
the index triple 2 1 0 denotes the monomial x2 y 1 z 0 = x2 y. The second and sixth columns
show the coeﬃcient of each monomial. From the table, we observe the decreasing trend
of the coeﬃcients according to the increase of the order. To obtain suﬃciently meaningful
statistics, we repeat the above process for 40 times with diﬀerent groups of electrons. Each
time we ﬁnd the maximum absolute value of the coeﬃcients for each order and calculate the
average absolute value for the coeﬃcients for each order. Then we take the average over the
25

Table 2.3: A local expansion
I
1
2
3
4
5
6
7
8
9
10

COEFFICIENT
-10173.27230228843
77.96202591617063
84.83337126367421
94.59868392563654
-3.199969762547595
4.505077944915953
2.824067476091086
1.929372636609234
-3.396989424915588
0.3759022864564996

p
0
1
1
1
2
2
2
2
2
2

EXP.
000
100
010
001
200
110
020
101
011
002

I
11
12
13
14
15
16
17
18
19
20

COEFFICIENT
-.3770785081465575
0.3065719640561415
1.430687924061862
0.1167018056511110
-1.761289530718459
0.7359490919857412
-1.758649785663027
-.2994523996221934
-.6566773810094656
1.173313105460495

p
3
3
3
3
3
3
3
3
3
3

EXP.
300
210
120
030
201
111
021
102
012
003

40 processes and present the result in Figure 2.5. The green dots show the maximum value
of the coeﬃcients, and the red dots show the average value of the coeﬃcients. It is clear that
the coeﬃcients decrease as the order increase, which suggests the convergence of the local
expansion.
If we want to calculate the expansion of the Coulomb interaction ﬁeld on a reference
point, we only need to determine which ﬁnest box the point belongs to, then by Eq. (2.13)
and Eq. (2.14) we can translate the local expansion of the potential from the center of the
ﬁnest box to the reference point. Taking the derivative of the position (x, y, z), we obtain
the expansion of the ﬁeld (Ex , Ey , Ez ) up to the order p − 1. The expansion of the ﬁeld
only includes the contribution of the electrons in the far region. As to the electrons in the
near region, we obtain the extension of their ﬁeld by representing each of them by Gaussian
distribution, which will be discussed in the following.

2.3.6

Gaussian Macroparticles Instead of Point Charges

Sometimes it is important to use a ﬁnite number of particles to represent a continuous
distribution. There are two important cases where this arises; the ﬁrst is when the number of
26

Figure 2.5: The absolute value of the diﬀerent order coeﬃcients of the local expansion. For
interpretation of the references to color in this and all other ﬁgures, the reader is referred to
the electronic version of this thesis
1000

Avg. Coef.
Max. Coef.

100

Coefficient

10

1

0.1

0.01

0.001
1

2

3

4

5
6
order

7

8

9

10

actual particles is substantially larger than the number than what can practically be handled
in a simulation. The second case is the desire to compute a transfer map and aberrations
of the system at hand, which requires the knowledge of high-order ﬁeld expansions near the
reference point, which cannot be done reliably if point charges are near. To avoid this, one
can use three dimensional spherical Gaussian distributions instead of point charges as
x2
r2
q
y2
z2
1
· exp(− 2 − 2 − 2 ) = √
· exp(− 2 ).
f (x, y, z) = √
σ
σ
σ
σ
( 2πσ)3
( 2πσ)3

27

(2.15)

It is easy to see from Gauss’s law that the ﬁeld along the radius can be expressed as

Er =

√
r
q
r2
√
2πerf( √ )σ 3 − 2rσ 2 exp(− 2 ) ,
2σ
r2 2πσ 3
2σ

(2.16)

√
where erf(r/ 2σ) is the error function. By setting a proper value for σ, we can use the
Gaussian distribution in eq. (2.15) to represent a point charge. When r ≫ σ, Er is equal to
the ﬁeld of a point charge. When r → 0, Er → 0. To calculate the expansion of the error
function, we need to ﬁnd all the derivatives of it. Considering the deﬁnition of the error
function
x
2
2
erf(x) = √
e−t dt,
π 0

(2.17)

it is easily to determine its derivative

2
2
derf(x)
= √ e−x .
dx
π

(2.18)

For any given point x, we calculate erf(x) by the rational Chebyshev approximation[24] and
derf(x)/dx by eq. (2.18), which is inﬁnitely diﬀerentiable and allows the computation of all
higher derivatives.

2.3.7

Example Calculations Exhibiting the Linear Scaling Property

To show the performance of the method, for reasons of space we limit ourselves here to the
most important characteristic, namely the scaling of the computation time to the number
of particles. We perform a series of simulations with expansion order 5 and with varying
numbers of electrons N on our computer with 4 core Intel Xeon Processor X5677 running
28

Figure 2.6: Computation time for diﬀerent number of electrons
250

Time (s)

200

150

100

50

0
0

0.2

0.4

0.6

0.8

1

N (106)

at 3.5 GHz. The single process program uses only one core in the computation. Figure 2.6
shows the required computation time for a simulation with varying particle numbers N. We
observe the nearly linear behavior, with a slight deviation around N = 0.4 × 106 , at which
point the number of particles crosses a threshold for transitions from a three-level scheme
to a four level scheme, which leads to slightly improved performance after the transition.
Overall, the expected linearity is well achieved.

2.4

Use of the FMM in tracking simulations

The FMM described above always starts from a cubical box, then cuts it into eight child
boxes and keeps all of them. The method works well when the bunch sizes in all three
dimensions are close to each other. However, in practice this is not always the case. In fact,
29

usually, the bunch sizes evolve with time, and the bunch may have an oblate shape where
one of the dimensions is much smaller than the others, or a prolate shape where one of the
dimensions is signiﬁcantly larger than the other two. In these cases, the FMM above will
lose some of its eﬃciency because a large number of the resulting boxes are empty.
However, what plays the critical role in the FMM algorithm is the hierarchical relation
between the boxes, while the details of the cutting can be modiﬁed as necessary. We will now
discuss a more eﬃcient way of cutting boxes which still maintains the necessary hierarchical
relation.

2.4.1

Strategy of cutting boxes

Assume we know the length of the bunch in all the three dimensions (Lx , Ly , Lz ). Without
losing generality, we assume Lx ≥ Ly ≥ Lz . We also know the total number of the particles
N , and the average number of particles inside each box of the ﬁnest level n, from which
we can obtain the least number of boxes needed Nb . The new idea of box cutting can be
described as follows.
1) Enclose the bunch by a cubic box whose size is slightly greater or equal to the longest
side of the bunch Lx . We say this box is of level zero. Without losing generality we assume
this box is centered at (0, 0, 0).
2) Try to cut this cube box from the center into eight equal size cube boxes. Compare
the coordinates (cx , cy , cz ) of the center of each child box with (Lx /2, Ly /2, Lz /2). If a child
box whose center satisﬁes |ci | < Li /2 for all the three directions, it is accepted. Otherwise,
if in any direction, |ci | < Li /2 is not satisﬁed, ci is set to zero to obtain a new center point
with the coordinates of the other directions unchanged, and the child box is replaced by a
new box that is centered at the newly obtained point. In this way the number of child boxes
30

is decreased. For example, assuming two child boxes with centers (x1 , y1 , z1 ) and (x1 , y1 , z2 ),
if in x and y directions, |ci | < Li /2 is satisﬁed, but the absolute values of both z1 and z2
are greater than Lz /2, they will be replaced by one box whose center is (x1 , y1 , 0). The side
length of the child boxes is set to be half of the parent box.
3) Cut each box of the current ﬁnest level in the same way as 3) to obtain the boxes of
a ﬁner level.
4) Repeat 3) until the box number at the ﬁnest level is greater than or equal to Nb .
To make the idea clearer, let us consider a two dimensional example as shown in Figure
2.7. The hatching part shows the dimensions of the bunch. First, we enclose the bunch by a
square box. Then we try to cut the square box into four child boxes, but we notice that the
y coordinates of all the centers of the child boxes do not satisfy |cy | < Ly /2. So we set the
y coordinates of them zero, and the four center points fall into two new points as Figure 2.7
shows. We use these two points as the centers of the new child boxes whose side length is
half of their parent box, so that we obtain two child boxes of the ﬁrst level instead of four.
Then we repeat the same cutting process for each box of the ﬁrst level, we obtain two child
boxes for each of them. So we have four boxes of the second level. When we cut the boxes
of the second level, we ﬁnd all the centers of the child boxes satisfy |ci | < Li /2, so we accept
all of them and obtain 16 boxes of the third level. As shown in Figure 2.7, when we use one
square box to enclose the bunch in the beginning, there is a lot of blank place in the box.
But after we ﬁnd the proper cutting, the dimension of the boxes ﬁts the bunch well. In this
way, we avoid the problem of a large number of empty boxes, which negatively aﬀects the
performance of the FMM method.
Although the strategy is described as above, in practice we do not need to check the
centers of all child boxes for each cutting. What we really need to know is how many times
31

Figure 2.7: The box cutting strategy

we need to cut each dimension for each level. For this purpose, we can simply compare the
size of each dimension to a scale. For the ﬁrst level the scale is set to be the size of the largest
dimension, and for each ﬁner level the scale is divided by two. Each dimension that is larger
than half of the scale will be cut. If a dimension is cut in some level, it will deﬁnitely be cut
in all the ﬁner levels. We label each box by four numbers (l, nx , ny , nz ). l is the level, and ni
is the index of box in the ith direction. ni = 0 if the ith direction is not cut in the lth level,
or ni ∈ [1, 2li ] if the ith direction is cut li times in the lth level. Assuming the zero level box
is centered at the origin (0, 0, 0), the center position of each box can be simply calculated
from its label as
clni = sl · ni + 0.5 · sl · (1 − 2li )

(2.19)

where clni is the center position in the ith dimension of the nth box of the lth level, and sl
is the box size of the lth level boxes. The center positions are needed when we translate or
convert the far multipole expansions and the local expansions.

32

Figure 2.8: Frame rotation

y

y′

x
x′

2.4.2

Frame rotation

Another method to avoid empty boxes is to set the frame consistent with the principal
axes of the bunch as shown in Figure 2.8 for a two dimensional case. We start from a
frame whose origin is the center of mass of the bunch. The position of the ith particle is
Xi = (xi1 , xi2 , xi3 ). If the coordinate axes are denoted by xj , j = 1, 2, 3, then the moment
of inertia coeﬃcient matrix element Ijk can be written as

Ijk =
i

2
mi (ri δjk − xij xik ),

(2.20)

2
where i is the index of the particle and ri = x2 + x2 + x2 . In the principal axes frame, only
i1
i2
i3

the diagonal elements of the matrix I are nonzero. Assuming a matrix P that diagonalizes I
′
′
′
as P−1 IP = I′ = diag{I11 , I22 , I33 }, the position of the particle in the principal axes frame

is X′ = P−1 X. If the electric ﬁeld in the principal axes frame is E′ , the ﬁeld in the original
frame is E = PE′ .
So when we have a bunch in the lab frame, we ﬁrst shift the origin of the frame to the
center of mass of the bunch, and rotate the frame so that it is consistent with the principal

33

axes of the bunch. We then cut the boxes as stated in the previous subsection (2.4.1) and
calculate the potential or/and the ﬁeld. Finally we translate the ﬁeld back to the lab frame.
This approach automatically determines an initial enclosing box of smallest possible
volume.

2.5

Examples of Tracking Simulations

Figure 2.9: Longitudinal bunch size during the free expansion of an electron bunch
120

P2P
FMM

Standard deviation of z (µm)

100

80

60

40

20

0
20

40

60

80

100

Time (ps)

We used the FMM in tracking simulations, and compared the results with those obtained
by the much more expensive use of a pairwise Coulomb formula. One example is shown in
Figure 2.9 and Figure 2.10. We have a bunch of 2,082,000 electrons with a three dimensional
Gaussian distribution. The initial size of the bunch is (115µm, 81µm, 1µm). We use 100,000
macroparticles, each of which represents 20.82 electrons, and a fourth order Runge-Kutta
34

Figure 2.10: Transverse bunch size during the free expansion of an electron bunch
160

Standard deviation of x/y (µm)

150

P2P x
P2P y
FMM x
FMM y

140
130
120
110
100
90
80
20

40

60

80

100

Time (ps)

integrator with ﬁxed step size to simulate the free expansion of the bunch without any
external ﬁeld. The step size is 1ps, and the simulation runs for 100 steps to 100ps. The
results of the FMM with a ﬁfth order DA expansion are presented in dots, and the results
of the pairwise Coulomb formula are presented in lines. The evolution of the bunch size
with respect to time in z direction is shown on the left, and those in x and y directions
are shown on the right. The ﬁnal bunch sizes are (155.21µm, 127.08µm, 103.15µm) and
(155.29µm, 127.06µm, 103.13µm) respectively. The relative diﬀerence is less than 0.0515%.
Figure 2.11 shows the computational expense for the simulations of the free expansion of
a proton bunch. The bunch has 2,082,000 protons with a uniform distribution in all dimensions, and its initial size is (66.45µm, 46.76µm, 0.58µm). We use a fourth order Runge-Kutta
integrator with ﬁxed step size 1ps and simulate to 100ps on our computer with 4 Core (8

35

Figure 2.11: Computational expense for the simulations for diﬀerent numbers of macroparticles
3000

2500

Time (mins)

2000

1500

1000

500

0
0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of macroparticles (× 1e6)

Hyperthreaded) Intel Xeon Processor X5677 running at approximately 3.5GHz. Our simulation program is a single process program, although the FMM can be parallelized in principle.
The ﬁnal bunch size is (66.52µm, 46.84µm, 29.57µm). The macroparticle numbers range from
100,000 to 1,000,000, and the computational expenses for the simulations with diﬀerent numbers of macroparticles are presented in Figure 2.11. Apparently the computational expense
does indeed increase linearly with the number of macroparticles, as expected from the above
theoretical arguments.

36

Table 2.4: Computational expenses for the Gaussian/uniform distribution bunches
Distr.
GS
U
GS
U
GS
U

2.6

dx (µm) dy(µm) dz(µm)
99.91
99.95
100.03
99.91
99.96
100.00
1.00
99.95
100.03
1.00
99.96
100.00
1.00
1.00
100.03
1.00
1.00
100.00

lx
5
5
0
0
3
3

ly
5
5
6
6
3
3

lz time (min)
5
16.10
5
5.93
6
10.64
6
3.00
9
10.44
9
6.53

Discussion Concerning the Eﬃciency

We have shown that the FMM can be used in the tracking simulation, and its computational
expense scales with the macroparticle number. However, we also notice that the eﬃciency of
treating a bunch with the Gaussian distribution is worse than that of a bunch with a uniform
distribution when they have the same number of macroparticles and similar shapes. Some
examples are presented in Table 2.4. We calculated the electric ﬁeld for a bunch of 1,000,000
electrons with either the Gaussian distribution or the uniform distribution of varying bunch
shapes. (dx, dy, dz) is the r.m.s. size of the bunch, and (lx, ly, lz) is how many times the
bunch is divided in (x, y, z) direction. The single process program runs on our computer
with 4 Core (8 Hyperthreaded) Intel Xeon Processor X5677 running at 3.5 GHz and uses
only one core in the computation. We can see the computational expense of a bunch with
a uniform distribution is much less than that of a bunch with the Gaussian distribution of
similar shape. This is because the FMM has the best eﬃciency when all of the boxes of the
ﬁnest level have the same number of particles inside. On the other hand, when a bunch has
a Gaussian distribution, its center has a much higher charge density than its edge, which
results in the boxes at the edge being either empty or enclosing few particles and the boxes
at the center enclosing many more particles than expected. To solve this problem, we need
to use the adaptive FMM, also called multiple level fast multipole algorithm (MLFMA),
37

by which we cut the boxes into ﬁner levels where the charge density is higher, so that we
can make sure all the boxes of the ﬁnest levels (diﬀerent lowest levels at diﬀerent positions)
enclose similar numbers of particles, which guarantees the eﬃciency [19, 20].

38

Chapter 3
Multiple Level Fast Multipole
Algorithm in the Diﬀerential Algebra
Framework
The single level FMM cuts the whole domain of charges into the same level boxes everywhere.
It works well when the charged particle distribution is uniform or close to it. However in
many cases the charged particles have a more complicated distribution. The loss of ability
to deal with heavily non-uniform distribution limits the use of the single level FMM. For
example, a Gaussian distribution has much higher charge density at the center than at the
edge. In this case, if we cut the whole domain to a very deep level so that the boxes at the
center enclose a small number of particles, we will end up with many empty boxes or boxes
with few particles at the edge, which reduces the calculation speed. A deeper level also leads
to a larger number of boxes and thus a larger amount of memory is needed, because of which
the program may fail due to lack of memory. If we do not cut to such a deep level, the boxes at
the center will enclose many more particles than they should, and this will also signiﬁcantly
lower the eﬃciency. In some cases the charge distribution may be more complicated. For
example, an FFAG accelerator has several bunches distributed along a helix. The single
level FMM has diﬃculties to deal with such kind of complicated distributions. However,

39

the multiple level fast multipole algorithm (MLFMA) can treat these complicated charge
distributions of N particles with an eﬃciency of O(N ), by cutting the whole domain into
diﬀerent levels at diﬀerent places according to the charge density there, which means higher
level boxes are created where higher density is and lower level boxes are created where lower
density is. So one can make sure the number of particles inside the boxes in the lowest levels
at diﬀerent places are close to each other, which improves the eﬃciency.[19, 20] To expand
the usage of the FMM in the beam dynamics simulation, we developed the MLFMA in the
DA framework, which we are going to present in the following.

40

3.1
3.1.1

Analytical Tools
The Far Multipole Expansion from the Charges

Suppose that n particles with charge qi located at ri (xi , yi , zi ) with ri < r0 , the electrostatic
potential at a point r(x, y, z) with r > r0 can be expressed as
n

qi

φ =
i=1

(x − xi )2 + (y − yi )2 + (z − zi )2

n

qi /

=
i=1

i=1

(3.1)

x2 +y 2 +z 2
2xi x
2yi y
2zi z
1+ i i i −
−
−
x2 +y 2 +z 2
x2 +y 2 +z 2
x2 +y 2 +z 2
x2 +y 2 +z 2

n

=

x2 + y 2 + z 2

d r · qi

2
2
1 + (x2 + yi + zi )d2 − 2xi dx − 2yi dy − 2zi dz
r
i

¯
= dr · φM
with

y
x
, dy = 2
,
dx = 2
x + y2 + z2
x + y2 + z2
z
dz = 2
, dr = d2 + d2 + d2 ,
x
y
z
x + y2 + z2
n

¯
φM =

qi /
i=1

2
2
1 + (x2 + yi + zi )d2 − 2xi dx − 2yi dy − 2zi dz .
r
i

¯
If we choose dx , dy , and dz as DA variables, φM can be expressed as a DA vector, which
¯
can be considered as the Taylor expansion of φM with dx , dy , and dz at inﬁnity. dr does
not exist in the DA framework, because dx , dy , and dz are all inﬁnitesimal, so is the sum of
their squares, and the square root operation on an inﬁnitesimal is not well deﬁned[14]. So
¯
we can not express φ in the DA framework. But it is enough to have the expansion of φM
in the DA framework and keep dr in mind. We will explain how it works in the following.
41

3.1.2

Translation of The Far Multipole Expansion

The potential φ at (x, y, z) of a far multipole expansion at the origin (0, 0, 0) can be translated
′ ′
into another far multipole expansion at the point (x′ , yo , zo ). The new DA variables can be
o

chosen as
x − x′
x′
o
=
= ′2 ,
r′2
r
′
y′
y − yo
= ′2 ,
d′ =
y
r′2
r
′
z − zo
z′
d′ =
= ′2 ,
z
r′2
r

d′
x

with r′ =

x′2 + y ′2 + z ′2 . With some direct algebra, one can ﬁnd the relation between the

new DA variables and the old DA variables as shown in Eq. (3.2).

dx = (d′ + x′ · (d′2 + d′2 + d′2 )) · R,
x
o
x
y
z
′
dy = (d′ + yo · (d′2 + d′2 + d′2 )) · R,
y
x
y
z

(3.2)

′
dz = (d′ + zo · (d′2 + d′2 + d′2 )) · R,
z
x
y
z

with
R =

1
′2
′2
′
′
1 + (x′2 + yo + zo )(d′2 + d′2 + d′2 ) + 2x′ d′ + 2yo d′ + 2zo d′
o
x
y
z
o x
y
z

.

Eq. (3.2) can be considered as the map between the new DA variables and the old DA
¯
˜
variables, which we refer to M1 . If we translate φM in the old frame into φM in the new
frame by
˜
¯
φ M = φ M ◦ M1 ,

42

where ◦ means the composition of two maps, the potential in the new frame can be written
as
˜
φ′ = φM · d ′ ·
r

√

¯
R = d ′ · φ′ ,
r
M

(3.3)

knowing that dr in the old frame can be translated into the new frame as

dr =

1
′
′
′
′
(x − x′ + x′ )2 + (y − yo + yo )2 + (z − zo + zo )2
o
o

1

=

′
′2
′2
(x − x′ )2 + (y − yo )2 + (z − zo )2 + x′2 + yo + zo
o
o
′ ′
′ ′
+2(x − x′ )x′ + 2(y − yo )yo + 2(z − zo )zo
o o

=

1
′2
′2
′
′
1/d′2 + x′2 + yo + zo + 2x′ d′ /d′2 + 2yo d′ /d′2 + 2zo d′ /d′2
r
o
o x r
y r
z r

d′
r

=

′2
′2
′
′
1 + (x′2 + yo + zo )d′2 + 2x′ d′ + 2yo d′ + 2zo d′
o
r
o x
y
z
√
= d′ · R,
r

with
d′ =
r

d′2 + d′2 + d′2 .
x
y
z

Same as dr , d′ does not exist in the DA framework, but for now we only need to calculate
r
¯
φ′ and keep the existence of d′ in mind.
r
M

3.1.3

Convert The Far Multipole Expansion into a Local Expansion

′ ′
Given a far multipole expansion at the origin (0, 0, 0), a local expansion at (x′ , yo , zo ), which
o

creates the same potential on the observers, can be found. It is called “local” because

43

′ ′
(x′ , yo , zo ) is close to the observer (x, y, z). So it is natural to choose the new DA variables
o

as

d ′ = x − x′ = x′ ,
x
o
′
d ′ = y − yo = y ′ ,
y

(3.4)

′
d ′ = z − zo = z ′ .
z

′ ′
(x′ , y ′ , z ′ ) are the new coordinates of the observer (x, y, z) if we shift the origin to (x′ , yo , zo ).
o

The old and the new DA variables have the relation as Eq. (3.5) shows
x
x′ + d ′
o
x
= ′
= (x′ + d′ ) · R,
o
x
′
′
x2 + y 2 + z 2
(xo + d′ )2 + (yo + d′ )2 + (zo + d′ )2
x
y
z
′
yo + d ′
y
y
′
= ′
= (yo + d′ ) · R,
dy = 2
y
′
′
x + y2 + z2
(xo + d′ )2 + (yo + d′ )2 + (zo + d′ )2
x
y
z

dx =

dz =

′
zo + d ′
z
z
′
= ′
= (zo + d′ ) · R.
z
′
′
x2 + y 2 + z 2
(xo + d′ )2 + (yo + d′ )2 + (zo + d′ )2
x
y
z

with
R =

(3.5)
1
.
′
′
(x′ + d′ )2 + (yo + d′ )2 + (zo + d′ )2
o
x
y
z

We note Eq. (3.5) as M2 . To convert the far multipole expansion in Eq. (3.3) into a local
¯
˜
expansion, we need to work on the two parts separately. φ′ can be converted into φL in the
M
new frame as
˜
¯
φ L = φ ′ ◦ M2 ,
M
√
and d′ can be converted into R in the new frame. Therefore we have the local expansion
r
as
˜
φL = φM ·
44

√

R.

(3.6)

Note that by now we have gotten the expansion of the potential, not just a component of it,
as a DA vector.

3.1.4

The Local Expansion from the Charges

Consider an observer point r(x, y, z) within a spherical region of the radius r0 centered at
′ ′
(x′ , yo , zo ), the electrostatic potential on the observer from n particles outside the spherical
o
′ ′
region can be expressed as a local expansion at (x′ , yo , zo ). We choose the DA variables as
o

d ′ = x − x′ = x′ ,
x
o
′
d ′ = y − yo = y ′ ,
y

(3.7)

′
d ′ = z − zo = z ′ .
z

Assuming the ith source particle has charge qi and located at ri (xi , yi , zi ), the local
expansion of the potential is
n

φL =
i=1
n

=
i=1

3.1.5

qi
(x − xi )2 + (y − yi )2 + (z − zi )2
qi
′
′
(x′ − xi + d′ )2 + (yo − yi + d′ )2 + (zo − zi + d′ )2
o
x
y
z

(3.8)

Translate The Local Expansion

′ ′
A local expansion at the origin (0, 0, 0) can be translated to (x′ , yo , zo ), assuming both points
o

are close to the observer (x, y, z). Choosing the new DA variable as the same as Eq. (3.7),

45

the relation between the old and the new DA variables is just a linear shift.

d x = x′ + d ′ ,
o
x
′
d y = yo + d ′ ,
y

(3.9)

′
dz = zo + d′ .
z

We call eq. (3.9) the map M3 , then the new local expansion can be calculated by

φ ′ = φ L ◦ M3 .
L

3.1.6

(3.10)

Calculate the Potential and the Field from the Expansions

It is straightforward to calculate the potential from a local expansion or a multipole expansion. Since we have the potential expressed as an pth order polynomial, we only need to plug
in the value of (dx , dy , dz ) for each particle to obtain the potential on it.
The local expansion of the potential is a polynomial of the observer’s coordinates. Taking
the derivative of the potential with respect to a coordinate, one can obtain the (p − 1)th
order polynomial of the ﬁeld on the respective direction.
To calculate the ﬁeld by the multipole expansion is more complicated, because the DA
variables are not the coordinates, but some analytical functions of the coordinates. We need
to take the derivative with respect to the coordinates by the chain rule. After some direct

46

algebra, we obtain the expression of the ﬁeld as
¯
¯
¯
∂φ
∂φ
∂φ
¯
· (d2 − 2d2 ) + 2
· dx dy + 2
· dx dz + φ · dx } · dr
r
x
∂dx
∂dy
∂dz
¯
¯
¯
∂φ
∂φ 2
∂φ
¯
Ey = {2
· dy dx −
(dr − 2d2 ) + 2
· dy dz + φ · dy } · dr
y
∂dx
∂dy
∂dz
¯
¯
¯
∂φ
∂φ
∂φ
¯
· dz dx + 2
· dz dy −
· (d2 − d2 ) + φ · dz } · dr
Ez = {2
r
z
∂dx
∂dy
∂dz

Ex = {−

with
dr =

(3.11)
d2 + d2 + d2 .
x
y
z

When we calculate the ﬁeld using Eq. (3.11), the part inside the brackets and dr needs to
be treated separately, because dr does not exist in the DA framework. Note that d2 can
r
be expressed as a DA vector, so the part inside the brackets can be calculated in the DA
framework. We ﬁrst calculate the value of it. Then we calculate dr and the multiplication
for each particle.

3.2

Error Analysis
Figure 3.1: Far multipole expansion of the potential of a point charge

rd

rs
θ

r

In order to give an error estimation of the far multipole expansion of the charges, let us
ﬁrst consider a source charge inside a spherical region of radius a as shown in Figure 3.1.
47

Without loss of the generality, we assume the center of the sphere is at the origin. The source
charge is at the position rs (xs , ys , zs ), the observer point is r(x, y, z), the distance between
them is rd = |rs − r| and the angle between r and rs is θ. The potential on the observer can
be expressed as a far multipole expansion as shown in Eq. (3.1).
1/rd can be expressed as an Taylor expansion shown in Eq. (3.12),
1
1
=
=
rd
|r − rs |

∞
n=0

1
1
n
· (−1)n · rs · n · ∇n .
n!
r

(3.12)

In the above ·n· is the n-fold tensor contraction operation, which is deﬁned as [7]
n
An · n · B n = An
α1...αn Bαn...α1

where the convention of implied summation over repeated Greek subscripts is followed. The
gradients of 1/r can be calculated as Eq. (3.13) shows according to the Maxwell Cartesian
tensors[7, 80],

n

n

n

∂α 1 ∂β 2 ∂γ 3

1
r

= (−1)n r−2n

F(n1 ) F(n2 ) F(n3 )









 n1   n2   n3 
(−1)m 



m1 =0 m2 =0 m3 =0
m1
m2
m3

×r2m f (n − m − 1)xn1 −2m1 y n2 −2m2 z n3 −2m3

(3.13)

with n = n1 + n2 + n3 , m = m1 + m2 + m3 , f (v) = 1 × 3 × 5 × ... × (2v + 1), F(x) = x and
2




n!
 n 
.

= m
2 m!(n − 2m)!
m
48

To compare with Eq. (3.12), we can also expand 1/rd in powers of r using the Legendre
Polynomials[94, 6].






1
=
rd 




n
rs
∞
n=0 rn+1 Pn (cos θ)

s
if rr < 1

∞
rn
n=0 rn+1 Pn (cos θ)
s

s
if rr > 1

(3.14)

where Pn (cos θ) are Legendre Polynomials. Comparing Eq. (3.12) with Eq. (3.14), one can
see they have a term-by-term correspondence, which indicates

Pn (cos θ) =

(−1)n n+1
·r
·
n!

1
rs n
· n · ∇n .
rs
r

(3.15)

Notice that the Legendre Polynomials Pn (cos θ) only depend on the angle θ.
¯
The Taylor expansion of φM in Eq. (3.1) can be expressed explicitly by tensors, such as
N

qi

¯
φM =

2
2
1 + (x2 + yi + zi )d2 − 2xi dx − 2yi dy − 2zi dz
r
i

i=1
N

=

=

∞

i=1 n=0
N ∞
i=1 n=0

qi n1 n2 n3 ¯
β γ
∂α ∂β ∂γ φM |(0,0,0) · dα dy dz
x
n!

(3.16)

An · n · dn .
r
i

where ∂α ,∂β , and ∂γ refer to the partial derivatives of dx , dy , and dz , dr = r/r2 , and
q n n n ¯
An n n = i ∂α1 ∂β 2 ∂γ 3 φM |(0,0,0)
i,n1 2 3
n!
=

qi
n!

F(n1 ) F(n2 ) F(n3 )









 n1   n2   n3 
(−1)m 



m1 =0 m2 =0 m3 =0
m3
m2
m1
n −2m1 n2 −2m2 n3 −2m3
yi
zi
.

2m
×ri f (n − m − 1) · xi 1

49

(3.17)

Comparing Eq. (3.17) with Eq. (3.13) we obtain
N

¯
φM =
=

∞

i=1 n=0
N ∞

rn (−1)n n+1
· ri
·
qi i ·
rn
n!
qi

i=1 n=0

r n
1
· n · ∇n
r
ri

(3.18)

ri n
· Pn (cos θi ),
r

where Eq. (3.15) is used to get the result. Hence the far multipole expansion of the potential
is
N

φ=

∞

i=1 n=0

n
ri
qi · n+1 · Pn (cos θi ).
r

(3.19)

If we keep the ﬁrst p terms in the expansion, The error can be estimated as follows.
N

|ǫ| =

∞

i=1 n=p+1
N

≤
=

|qi | ·

n
ri

qi · n+1 · Pn (cos θi )
r
∞

an

rn+1
n=p+1
∞
a n
ap+1
C · p+2 ·
r
r
n=0
1
ap+1
i=1

(3.20)

= C · p+2 ·
1− a
r
r
1
a p+1
,
·
= C·
r
r−a
where C =

N
i=1 |qi | and ri ≤ a for any i.

When we translate a far multipole expansion from a child box’s center to its parent
box’s center, because the transfer map between the DA variables in diﬀerent frame does not
contain a constant part, as shown in Eq. (3.2), the polynomial we obtain in the parent box
frame should be the same with the one calculated directly from the charges in the parent

50

Figure 3.2: Local expansion of the potential of a point charge

θ

r′

rs
box frame by Eq. (3.1), so that the error has the same expression with Eq. (3.20) with r
and a deﬁned in the parent box frame.
The local expansion from the charges is calculated as Eq. (3.8), which can be written as
Eq. (3.21) using Eq. (3.14).
N

φL =
i=1
N

=

=

qi
′
′
(x′ − xi + dx )2 + (yo − yi + dy )2 + (zo − zi + dz )2
o

qi
rs,i − r′

i=1
N ∞

i=1 n=0

(3.21)

r′n

qi · n+1 · Pn (cos θi ),
rs,i

′
where rs is the position of a source charge in the observer’s local frame with rs,i = ri − ro ,

r′ is the position of the observer in its local frame, and θ is the angle between rs and r′ as

51

shown in Figure 3.2. If we keep the ﬁrst p terms, then
N

|ǫ| =

∞

r′n
qi · n+1 · Pn (cos θi )
rs,i
i=1 n=p+1
N

≤

i=1

∞

|qi | ·

n=p+1
∞
′p+1

r
= C · p+2 ·
b

r′n
bn+1

n=0

r′
b

n

(3.22)

1
r′p+1
= C · p+2 ·
′
b
1− r
b
= C·
where C =

p+1

r′
b

·

1
,
b − r′

N
i=1 |qi | and rs,i ≥ b for any i.

When we convert the far multipole expansion into a local expansion, the polynomial we
obtain for the local expansion is diﬀerent with the one that is calculated directly from the
charges by Eq. (3.8). That’s because the transfer map between the DA variables contain a
constant term, so that the truncated terms of p+1 or higher order contribute to the constant
term of the local expansion, which is missing. From Eq. (3.5), we can see the constant term
′ ′2
′
of the transfer map is ro /ro where ro is the position of the local frame’s origin in the far

multipole frame, hence the missing terms are
N

Constmissing =

∞

i=1 n=p+1

An · n ·
i

′
ro
′2
ro

n

,

(3.23)

where An is deﬁned in Eq. (3.17). Similar to Eq. (3.20), the error of the missing constant
i
part is
ǫ′ ≤ C ·

1
a p+1
· ′
,
′
ro
ro − a
52

(3.24)

where C =

N
i=1 |qi | and ri ≤ a for any i. Then the error of the local expansion converted

from a far multipole expansion is

|ǫ| ≤ C ·

1
a p+1
· ′
+C ·
′
ro
ro − a

r′
b

p+1

·

1
.
b − r′

(3.25)

In the translation of the local expansion of a parent box into its child boxes, since the
transfer map is linear, there is no truncation error created in this process.

3.3
3.3.1

The Multiple Level Fast Multipole Algorithm
Strategy of the MLFMA

In 1988 Carrier, Greengard and Rokhlin introduced the adaptive FMM [19] , which is often
referred as the MLFMA. The general strategy of the MLFMA is similar to the uniform
level FMM. In order to evaluate the electrostatic potential and/or ﬁeld due to an arbitrary
distribution of charges, we divide the charges into groups and evaluate the interactions
between the groups far away enough by multipole expansions and calculate the interactions
between nearby particles directly.
To be more speciﬁc, consider a two dimensional example as shown in Figure 3.3. We want
to calculate the interaction of a bunch of charges with non-uniform distribution. Without
losing generality, we can enclose the charges with a square box, which we call the level zero
box. We select a number s. If the number of charges inside the box is larger than s, the
box is a parent box. It will be cut from the center into four equal square boxes, which are
its child boxes. All the child boxes form a new level. If a child box contains more than s
charges, it will be cut again and will have its child boxes. If the parent box is in level l, the
53

child boxes are in level l + 1. This process results in the hierarchical structure of the boxes
that reﬁne the level zero box into smaller and smaller regions. Diﬀerent with the uniform
level FMM in [93], we do not use the same number of levels for the whole domain. For
each level of reﬁnement, we only subdivide the boxes that contains more than s charges.
Generally where the higher charge density is, it is reﬁned with the higher level boxes, as
shown in Figure 3.3, where s = 1. At each level of reﬁnement, we only keep the nonempty
boxes. The empty boxes are discarded and completely ignored by the subsequent processes.
The three dimensional case can be treated in the same way; the only diﬀerence is the cube
box has eight child boxes.
Figure 3.3: The hierarchical structure of the boxes

3.3.2

Notation

We choose our notation following reference [19] as shown below. Note that all the boxes here
are nonempty boxes.
A child box is a nonempty box resulting from the division of a larger box.
54

If a box has one or more child boxes, it is called a parent box. Otherwise it is called a
childless box.
The ancestor boxes of box b include b’s parent box, the parent box of the parent box,
and so on till the root box.
If two boxes share at least one vertex, they are called adjacent. Otherwise they are well
separated.
Colleagues are adjacent boxes of the same size (at the same level). A given box has at
most 27 colleagues including itself. A two dimensional example is given in Figure 3.4.
Figure 3.4: Box (b) and its colleagues (c)

c

c

c

c

b

c

c

c

Because of the hierarchical structure, any box b only directly interacts with the boxes
who are colleagues of b’s parent box and their descents, and b inherits the contribution
from other boxes from its parent box. There are four kinds of relations between b and the
boxes who directly interacts with it. Based on their sizes and positions, which determine the
55

Figure 3.5: Case 1. Adjacent boxes

c

c

b

b

b

c

Figure 3.6: Case 2. Separated boxes of the same size

b

c

b

c

interactions, we can deﬁne four lists for box b as follows:
The ﬁrst kind of relation is shown in Figure 3.5. Box b and box c touch each other.
If both of them are childless, the ﬁeld between the charged particles in b and c should be
calculated directly, regardless of the box sizes of them. If any of them is a parent box, no
actions can be taken on this stage. Hence we can deﬁne list 1 of the box b, denoted by Ub .
If b is a parent box, Ub is empty. If b is childless, any box that is childless and adjacent to b
is contained in Ub .
The second kind of relation is shown in Figure 3.6. Box b and box c have the same
size. They are separated, and the distance between them is equal to or larger than their
side length. In this case, regardless whether they are childless boxes or parent boxes, we
can always translate the far multipole expansions of one box into the local expansions in
the other box, according to Eq. (3.25) with Eq. (3.20). Box b and the boxes in list 2 of b,
denoted by Vb , have this kind of relation. Any box whose parent is a colleague of b’s parent
and who itself is well separated from b is contained in Vb .
The third kind of relation is shown in Figure 3.7 on the left. Box b and box c are

56

Figure 3.7: Case 3 and case 4. Separated boxes of diﬀerent sizes

b

c

b

c

separated, box b is larger than box c, and the distance in between is equal to or greater than
the side length of c, but less than the side length of b. If b is childless, because the distance
between any particle inside b and the center of c is larger than the side length of c, according
to Eq. (3.20) the ﬁeld on the particles in b due to the particles in c could be calculated by the
far multipole expansion of c. For the same reason, according to Eq. (3.22) the ﬁeld inside c
due to the particles in b could represented by a local expansion in c calculated directly from
the particles in b using Eq. (3.8). If b is a parent box, no action can be taken on this stage.
Hence we can deﬁne list 3 of the box b , denoted by Wb . If b is a parent box, Wb is empty.
If b is a childless box, any box whose parent is adjacent to b and who itself is not adjacent
to b is contained in Wb .
The forth kind of relation is shown in Figure 3.7 on the right. This is the same with the
third kind of relation, except that box b and box c are replaced by each other. Hence we can
deﬁne the List 4 of the box b, denoted by Xb . Any box whose list 3 includes b is contained
in Xb .
We can add an extra list, List 5, for the box b, denoted by Yb . Any box that is not
adjacent to b’s parent is contained in Yb .
A two dimensional example of the above lists is given in Figure 3.8.

57

Figure 3.8: Box b and the associated lists 1-5

4

4

2

2

5

2

2

2

1

1

1

2

2

5

2
2

1

b

1

2

1 11 1
33
3 3 3

3
3

2

2

4

5

3.3.3

1

5

Brief Description of the Algorithm

Same with the single level FMM, MLFMA can be divided into a upward process and a
downward process. The algorithm can be described as follows.
(1) Build up the hierarchical tree structure of boxes by ﬁrstly selecting a number s, then
cutting the boxes until the number of particles in each childless box is less than or equal to
s.
(2) Upward process. From the ﬁnest level to the coarsest level, calculate the multiple

58

expansion for each box. If the box is childless, calculate its multipole expansion from the
particles inside using Eq. (3.1). If the box is a parent box, calculate its multipole expansion
by shifting and adding up the multipole expansions of its child boxes using Eq. (3.2) and
Eq. (3.3).
(3) Downward process. From the coarsest level to the ﬁnest level. For each box check
the boxes in its colleague list and in its parent box’s colleague list and the descents of these
boxes if needed, and take actions according to their relations. Considering box b, if c is
in Ub , calculate the ﬁeld inside b due to particles in c by the Coulomb force law if they
are both childless boxes. If c is in Vb , convert the multiple expansion of c into the local
expansion of b. If c is in Wb , calculate the ﬁeld inside b from the multipole expansion of c
using Eq. (3.11). If c is in Xb , calculate the local expansion of b due to the particles in c
using Eq. (3.8). Also translate the local expansion from the parent box into its child boxes.
After the downward process, all the boxes have a local expansion, which may come from
three sources, the particles inside other boxes, the multipole expansions of other boxes, and
the local expansions from the parent box. For all childless boxes, the ﬁeld inside has been
partially calculated from the particles inside the near boxes and/or the multipole expansions
from some other boxes.
(4) Field calculation. For each childless box, calculate the ﬁeld inside from the local
expansions and add the results to what have been calculated. By now we have calculated
the ﬁeld for all the particles.

3.3.4

Data Structure

We use several arrays to save the necessary information for the boxes. The data is saved in
such a way that (1) only the non-empty boxes are saved; (2) Boxes are saved consecutively;
59

(3) the coarser level boxes are saved before the ﬁner level boxes, as Figure 3.9 shows; and (4)
diﬀerent kinds of information are saved into diﬀerent arrays, using the index of the element
in the arrays as the identiﬁcation of the boxes, which means the elements with the same
index in diﬀerent arrays store diﬀerent information for the same box and the elements in the
same array with diﬀerent indexes store the same information for diﬀerent boxes. The data
structure is described with details in the following.
Figure 3.9: Sequenced boxes

Level 0

Level 1

Level 2

...

We use the linked list structure to save the hierarchical structure of the boxes. Two one
dimensional arrays, box and nchld, are used, both having the same length as the number of
boxes. The ith element of box saves the index of the ﬁrst child box of the box i, and the ith
element of nchld saves the number of child boxes of the box i. The ith element of both box
and nchld are zero if the box i is childless. So if the ith element of box has the value j, and
the corresponding element of nchld has the value k, we know that the box from j to j + k − 1
are all child boxes of the box i. The value of each element from j to j + k − 1 in the array
box tells us where its ﬁrst child box is, and the value of each box from j to j + k − 1 in the
array nchld tells us how many child boxes it has. Using box and nchld we can search the
hierarchical structure downwards, as shown in Figure 3.10. We use another one dimensional
array prnt with the same length as the number of boxes to save the index of the parent box
for each box. If the ith element of prnt has the value k, it means the parent box of the box i
60

is the box k. The ﬁrst element of prnt is zero, because the level zero box has no parent box.
Together with the array prnt, we can search the hierarchical structure upwards.
Figure 3.10: Storage of the hierarchical structure of the boxes

The child boxes of box i The child boxes of box j
Index of the box
Index of
the 1st child box

i

j j + 1j + 2

k

j

k

l

0

childless box
Number of
the child boxes

3

2

4

0

The one dimensional array lvl saves the level for each box. The two dimensional array
cntr saves the coordinates of the center for each box. The one dimensional array mltp and
lcl save the far multipole expansion and the local expansion for each box respectively.
The coordinates of all the particles are saved in three vectors, x, y and z, and the ﬁelds
on them are saved in three other vectors Ex , Ey , and Ez . Some operations on the vector
data type can improve the eﬃciency [15]. The linked list structure is used again to save the
information about the particles inside each childless box. If the ith box is childless, the index
of the ﬁrst particle inside it is saved in the ith element of the one dimensional array idx, and
the total number of particles inside is saved in the ith element of the one dimensional array
nptcl. For a parent box, the corresponding elements in idx and nptcl are both zero. There
is another one dimensional array link, whose length equals the number of particles. Each
element of link saves the index of the next particle, or zero if there are no more particles, in
the same childless box. In order to ﬁnd all the particles inside any childless box i, we ﬁrst
ﬁnd the ﬁrst particle from idx[i], then ﬁnd the following from link[i], link[link[i]] , ... until
61

we reach the zero. nptcl[i] gives the number of particles inside the box i, which is useful
when we need to create a vector to save the particles’ coordinates.
Each box may have at most 27 colleagues including itself. We use one dimensional
array clg to save the colleagues of all boxes. Each element of clg contains a vector, whose
ﬁrst element is the total number of the saved colleagues for the corresponding box and the
following elements are the indexes of those colleagues. The vector length is at most 28. Note
that for a box b, only the colleagues whose indexes are behind b are saved.
Table 3.1: Variables and their sizes
Array
Size
Array
Size
Vector
Size

box
1 × Nb
mltp
Nda × Nb
x
1 × Np

nchld
1 × Nb
lcl
Nda × Nb
y
1 × Np

lvl
1 × Nb
idx
1 × Nb
z
1 × Np

prnt
1 × Nb
nptcl
1 × Nb
Ex
1 × Np

cntr
3 × Nb
link
1 × Np
Ey
1 × Np

clg
28 × Nb
Ez
1 × Np

Table 3.1 shows the variables and their sizes. Nb means the number of boxes, Np means
the number of particles, and Nda means the size of a DA vector, which is determined by
the order of the DA vector and the number of DA variables. If a DA vector is an nth order
polynomial of v variables, Nda = C(n + v)/(C(n) · C(v)), where C(n) means factorial n [14].

3.4

Numerical Results

In the following we present some numerical results from experimental calculations. All the
calculations are performed on a computer with a Four Core (Eight Hyperthreaded) Intel
Xeon Processor X5677 running at approximately 3.5 GHz, but our program is a single
process program. In Table 3.2 we list some calculation results of an electron bunch with
the uniform distribution. N is the total number of the electrons, s is the largest number of
62

electrons that each childless box can hold, tM is the computation time using MLFMA, td
is the computation time using direct pair-to-pair formula, and Err1 , Err2 , and Err3 are
relative errors of Ex , Ey , and Ez , which is deﬁned as

Err =

2
i (EM,i − Ed,i )
,
2
i Ed,i

(3.26)

where EM is the electric ﬁeld calculated by MLFMA and Ed is the electric ﬁeld by direct pairto-pair formula. td for 400,000 or more electron cases is estimated according to the values
for less electron cases. Same results for an electron bunch with the Gaussian distribution
and a group of ten electron bunches, each one holding one tenth of the total electrons with
the Gaussian distribution, are listed in Table 3.3 and Table 3.4. From these results, we can
see the MLFMA has much better eﬃciency than the direct pair-to-pair calculation. If we
plot the computation time with the particle numbers, as shown in Figure 3.11, we can see
the computation time shows a roughly linear increase with the number of particles.
Figure 3.11: Relation between computation time and particle number
o

&RPSXWDWLRQoWLPHoePLQV(

s
28QLIRUP2EXQFK
2*DXVVLDQ2EXQFK
2*DXVVLDQ2EXQFKHV

a
G
h
c
u
b
m
r
r

bHUrh cHUrh GHUrh sHUrh mHUrG mfbHUrG
3DUWLFOHoQXPEHUo1

63

Table 3.2: Comparison of the computation time for the uniform distribution bunches of n
particles by direct calculation (td ) and MLFMA calculation (tM ) with DA order p and each
childless box holding at most s particles, given the relative errors (Err1 ,Err2 , and Err3 ) of
the ﬁeld in three diﬀerent direction
N
1e5
2e5
4e5
6e5
8e5
1e6

s
400
400
400
400
400
400

p
5
5
5
5
5
5

tM (min)
0.292
1.655
1.801
2.395
3.026
3.785

td (min)
75.315
300.030
1.205×103
2.711×103
4.820×103
7.532×103

Err1 (×10−4 )
1.241
1.498
1.482
1.538
1.522
1.520

Err2 (×10−4 )
1.388
1.504
1.519
1.550
1.497
1.402

Err3 (×10−4 )
1.271
1.498
1.554
1.524
1.537
1.526

Table 3.3: Comparison of the computation time for the Gaussian distribution bunches of n
particles by direct calculation (td ) and MLFMA calculation (tM ) with DA order p and each
childless box holding at most s particles, given the relative errors (Err1 ,Err2 , and Err3 ) of
the ﬁeld in three diﬀerent direction
N
1e5
2e5
4e5
6e5
8e5
1e6

s
400
400
400
400
400
400

p
5
5
5
5
5
5

tM (min)
0.693
1.448
2.624
4.325
5.932
7.420

td (min)
75.315
300.030
1.205×103
2.711×103
4.820×103
7.532×103

Err1 (×10−4 )
4.074
4.174
4.051
4.189
4.122
3.995

Err2 (×10−4 )
3.181
3.260
3.281
3.268
3.307
3.200

Err3 (×10−4 )
3.509
3.370
3.509
3.501
3.668
3.464

Table 3.4: Comparison of the computation time for the Gaussian distribution bunch groups
of n particles by direct calculation (td ) and MLFMA calculation (tM ) with DA order p and
each childless box holding at most s particles, given the relative errors (Err1 ,Err2 , and
Err3 ) of the ﬁeld in three diﬀerent direction
N
1e5
2e5
4e5
6e5
8e5
1e6

s
400
400
400
400
400
400

p
5
5
5
5
5
5

tM (min)
0.449
1.256
2.475
3.374
5.321
7.412

td (min)
75.315
300.030
1.205×103
2.711×103
4.820×103
7.532×103

Err1 (×10−4 )
2.864
5.276
1.505
7.115
7.693
2.972

64

Err2 (×10−4 )
4.257
4.954
3.852
3.332
4.574
6.436

Err3 (×10−4 )
3.262
4.040
3.124
3.765
4.163
3.652

3.5

Increasing of The Accuracy

From the discussion about the error in section 3.2, one can see the error is determined by
the truncation order p and the distance between the observers and the sources. To increase
the accuracy, one natural way is to keep more terms in the multipole expansions and the
local expansions. Besides that, one can also increase the distance between the source box, in
which the charges are approximately represented by the multipole expansion, to the observer
box. One only needs to change the deﬁnition of the “adjacent boxes” as follows. For a box b
Figure 3.12: Box b and the associated lists 1-5 with na = 2

1

1

2

1

4

2

1

2

1

1

1

1

2

4

1
1

1

b

1

1

1 11 1
11
3333
1 3333

1
1

1

1

1

1

65

2

2

1

2

2

with the side length sb , a box c with the side length sc is adjacent to b if the distance between
the center of b and the center of c is less than 0.5 · sb + na · sc , where na is a predeﬁned
positive integer. When na = 1, the deﬁnition is the same with what in section 3.3.2. One
can choose a larger na for better accuracy. One example of na = 2 in two dimensional case
is shown in Figure (3.12). Comparing with Figure (3.8), one can see in Figure (3.8) the
distance between observer box b and any source box c in Vb , Wb or Xb is at least one times
the side length of c, referred to sc ; however in Figure (3.12) this distance is at least 2 · sc .
It is easy to see for a given na , if a box c is well separated with box b, then there are at
least na boxes with the same size of c in between. So more boxes are included in the near
region of b, if we increase na . In Table 3.5, we present some numerical results for a bunch of
1,000,000 electrons with the Gaussian distribution. s is set to be 400. The electric ﬁelds are
calculated with diﬀerent p and diﬀerent na . Err is the average of Err1 , Err2 , and Err3 .
Figure 3.13 shows the error goes down faster with the DA order p for a larger na , which is as
expected. From Eq. (3.20), Eq. (3.22) and Eq. (3.25) one can see the error is in proportion
to K p . Considering two well separated boxes, K is determined by the rate of the box size
to their distance, and k is always less than one. When we increase na , K decreases, so that
the error converges faster. This explains Figure 3.13, in which when the error is plotted in
logarithm scale, it shows a linear relation with p, and the slope is larger when na is larger.
In practice, we care more about how much time is needed for a given accuracy. We plot the
error with the computation time for diﬀerent na in Figure 3.14, in which the error and the
computation time shows a roughly linear relation in the logarithm scale and the larger na
leads to the larger slope. This fact suggests a larger na may be a wise choice if we want
to achieve very high accuracy, for example 1/100,000,000 or even better. For clarity, in the
following chapters we assume na = 1, if na is not speciﬁed.
66

Table 3.5: Computation time tM and the relative error Err for a Gaussian bunch with
1000,000 electrons for diﬀerent DA order p and diﬀerent na , i.e. diﬀerent deﬁnition of the
”adjacent boxes”
p
3
4
5
6
7
8
9
10
3
4

na
1
1
1
1
1
1
1
1
2
2

tM (min)
3.275
4.916
8.528
16.866
32.620
61.569
115.170
212.149
13.270
20.947

Err
5.805 × 10−3
1.314 × 10−3
3.553 × 10−4
1.008 × 10−4
3.124 × 10−5
8.566 × 10−6
3.243 × 10−6
1.032 × 10−6
1.113 × 10−3
1.434 × 10−4

p
5
6
7
8
3
4
5
6
7
8

na
2
2
2
2
3
3
3
3
3
3

tM (min)
36.631
69.087
150.910
282.467
33.953
53.489
93.360
176.112
377.196
697.151

Err
2.267 × 10−5
3.592 × 10−6
6.099 × 10−7
1.160 × 10−7
3.823 × 10−4
3.453 × 10−5
3.914 × 10−6
4.369 × 10−7
5.250 × 10−8
8.368 × 10−9

Figure 3.13: Relation between the error and the DA order for diﬀerent na , i.e. the diﬀerent
deﬁnitions of the ”adjacent boxes”
r

3
3

(UURU

3
H=
H=
H=

QD- -
QD- -
QD- -

H=
H=





'$rRUGHUrS

67





Figure 3.14: Computation time of diﬀerent accuracies for diﬀerent na , i.e. the diﬀerent
deﬁnitions of the ”adjacent boxes”
r

3
3

(UURU

3
H2
H2
H2

QD- -
QD- -
QD- -

H2
H2




&RPSXWDWLRQrWLPHriPLQn

68

=

Chapter 4
Parallel Version of the Multiple Level
Fast Multipole Algorithm

4.1

Introduction

To simulate modern accelerators with high intensity beams, one may have to calculate the
interaction between billions of particles, which makes it necessary to take the advantage of
the parallel calculation and run the code in cluster machines with thousands of processors.
The single level fast multiple algorithm for uniformly distributed charged particles is easier
to parallelize. The MLFMA is relatively diﬃcult to parallelize, because its hierarchical box
structure depends on the distribution of charged particles, which keeps changing during the
simulation and makes it diﬃcult to predict how many boxes will be created and how many
boxes will have an interaction with a given box.
Our parallel approach is based on COSY Inﬁnity 9.1 with MPI support. The oﬃcial
release of COSY Inﬁnity 9.1 does not have MPI support built in, which is likely being rolled
out in the oﬃcial version 10. However, users can easily revise the source ﬁles and compile
COSY with MPI support following the instruction in COSY 9.1 programmer manual[15].
COSY supports the communication between the processes in a distributed cluster machine
by PLOOP, which has the following structure:
PLOOP I 1 NP ;
69

...
STATEMENTS ;
...
ENDPLOOP PMAP ;
where NP is the number of processes and PMAP is an array whose last dimension is equal to
or greater than NP. PLOOP distributes the STATEMENTS over all processes. Each process
takes actions and writes its result into the respective element of PMAP to share with the
other processes. After PLOOP, each process receives the results from all the others. As
one can see, PLOOP is an all-to-all communication, which is the only MPI communication
that is currently supported by COSY. In our parallel MLFMA code, all the communications
between processes use PLOOP.
There are some good discussions on how to parallelize MLFMA in both the shared memory machine and the distributed memory machine. The topics cover the creation, storage,
and sort of the partial octree that represents the hierarchical box structure, the calculation
of the multipole expansions, the calculation of the local expansions and the ﬁelds in parallel.
In 1987, Zhao developed a parallel algorithm for the single level FMM in the distributed
memory machine [94], soon after which Greengard and Gropp developed the parallel algorithm for the MLFMA in the shared memory machine [37]. The most challenging is the
parallel algorithm for the MLFMA in the distributed memory machine, which is highly desired since most modern super computers with thousands of processors have the distributed
memory structure and support MPI. The hierarchical box structure of the MLFMA is highly
unbalanced. It is diﬃcult to predict the total number of boxes and the number of interaction boxes for each single box, so that it is diﬃcult to predict the total calculation cost and
also diﬃcult to decompose and distribute the work load to each process. The MLFMA is
70

actively applied in solving Helmholtz equation related problems, such as the reﬂection of
electromagnetic waves, and the electrostatic potential related problems. Diﬀerent parallel
algorithms have been presented in these two areas [79, 78, 4, 44, 89, 88, 29, 33] and many
discussions have been made [91, 81, 82], while people in other areas also made their contribution [21, 92]. Teng [86] and Aluru [79] independently developed diﬀerent parallel algorithms
with provably good partitioning and load balancing. We borrowed some ideas, for example
the compressed tree from Aluru [3, 2], from the previous work. Now COSY only supports the
all-to-all communication, which is the most eﬃcient way to communicate with all processes.
We develop our own algorithm based on the all-to-all communication, and this has an eﬀect
on our algorithm, which will be explained in the following sections.

4.2

Hierarchical Tree Structure

4.2.1

Octree and Compressed Octree

It is natural to use a tree to present the hierarchical structure of the boxes. The largest
box that includes all the particles is the root of the tree. The child boxes of the largest box
construct the next level nodes connected to the root. A child box may also have child boxes,
which construct the following level nodes. In the three dimensional case, the hierarchical
structure of the boxes composes an octree. The data structure of a single level FMM is a
full tree. A two dimensional example is shown in Figure 4.1. As to a full tree, the number
of boxes (nodes) can be calculated as

L
L
n=0 4 for the two dimensional case and

L
L
n=0 8

for the three dimensional case, where L is the depth of the tree. If the charged particles
have a uniform distribution, so that all the childless boxes hold the same or very similar
number of particles, L should satisfy 4L > N/s > 4L−1 or 8L > N/s > 8L−1 for two
71

Figure 4.1: Full tree structure

dimensional and three dimensional, where N is the total number of particles and s is the
largest number of particles each childless box can hold. Thus L = ⌈log(N/s) − log 4⌉ or
L = ⌈log(N/s) − log 8⌉. The number of boxes inside the interaction list is also ﬁxed, 27 for
the two dimensional and 189 for the three dimensional. With all these information, it is
easier to parallelize the single level FMM. However, the hierarchical structure of the boxes
for the MLFMA should be represented as a partial octree. A two dimensional example of
the partial tree and the corresponding hierarchical box structure is shown in Figure 4.2.
Comparing with the full tree for single level FMM in Figure 4.1, some notes are missing
in the partial tree, which is decided by the distribution of the charged particles. And it is
impossible to predict the number of boxes only from N and s, if the charge distribution is
arbitrary and unpredictable.
In a normal tree, a box could have only one child box, if the child box is not childless.
In this case, the box and its only child box contains the same group of particles. A path
as shown in Figure 4.3 could happen. All the boxes of (b, d, ..., e, f ) contain the same group

72

Figure 4.2: Partial tree structure

Figure 4.3: From a normal tree to a compressed tree

a

a

b

f

c

d

g

c
h

e
f
g

h

of particles, although they have diﬀerent sizes. If we use the normal tree structure, all the
boxes of (b, d, ..., e, f ) will be stored in the memory, and their multipole expansions and local
expansions will be calculated, which is actually not necessary. To avoid this, a compressed
tree structure can be employed. The diﬀerence between a compressed tree and a normal
tree is any box in the compressed tree has at least two child boxes. As Figure 4.3 shows, to
change a normal tree into a compressed tree, one simply needs to remove all the boxes who
has only one child box, and links the remaining part of the branch to the parent box of the
removed boxes. We used the compressed octree in our algorithm.
Even if the compressed octree structure is used, it is still diﬃcult to predict the number of
73

boxes. In the worst case, the number of boxes could be very close to the number of particles.
Consider an two dimensional example in Figure 4.4. Assuming we have a bunch of charged
particles, which we enclosed by a square box, when we cut the box into the next level, we
ﬁnd one particle in each box, except that the box in the lower right corner has more than s
particles inside. In the next step we only need to cut the box at the lower right corner. After
we cut it, the same thing appears and only the box in the lower right corner is left to be cut
again. We repeat this process until the box in the lower right corner encloses s − 2 particles.
In this way the number of boxes we generate is N − s + 1, which is very close to the number
of particles N , since in general N is a very large number, from hundreds of thousands to
millions, while s is much smaller number, a few hundreds or less. In our code we do not want
to use the array with the length up to N to save the hierarchical box structure, because in
most cases we do not have so many boxes and it will be a waste of memory if we do so.
Instead we will estimate the length of the array according to our experience. If it turns out
the array is not long enough, the algorithm will quit and restart automatically with a larger
array.

4.2.2

Storage of the Compressed Octree

We use several arrays to save the octree and some related information. The data is saved in
such a way that (1) boxes are saved consecutively; (2) the higher level boxes are saved before
the lower level boxes, as Figure 4.5 shows; and (3) diﬀerent kinds of information are saved
into diﬀerent arrays, using the index of the element in the arrays as the identiﬁcation of
the boxes, which means the elements with the same index in diﬀerent arrays store diﬀerent
information for the same box and the elements with diﬀerent indexes in the same array
store the same kind of information for diﬀerent boxes. Please note in the compressed tree
74

Figure 4.4: The maximum number of boxes generated in the worst case

1

1

1

1

1
1

75

1 1
1

s-2

Figure 4.5: Sequenced boxes
Level 0

Level 1

Level 2

...

structure, there are two kinds of “level”. One is related to the position of a box inside the
tree, and the other is related to the size of a box. We use “level”, as in the above (2),
referring to the tree position level , and use “cutting level” referring to the box size level.
In a normal tree, all the boxes in the same level of the tree have the same box size, so there
is no diﬀerence between the two kinds of “level”. However in the compressed tree, since we
removed those boxes that have only one child each, we can not relate the box size directly
to its position in the tree. We have to use an extra number “cutting level” to represent the
box size. Take the compressed tree in Figure 4.3 as an example, box f and c are in the same
level, but f has a smaller size, hence a ﬁner cutting level. The data structure is described
with details in the following.
We use the linked list structure to save the hierarchical structure of the boxes. Two one
dimensional arrays, box and nchld, are used, both having the same length as the number of
boxes. The ith element of box saves the index of the ﬁrst child box of the box i, and the
ith element of nchld saves the number of child boxes of the box i. The ith element of both
box and nchld are zero if the box i is childless. So if the ith element of box has the value
j, and the corresponding element of nchld has the value k, we know that the box from j to
j + k − 1 are all child boxes of the box i. The value of each element from j to j + k − 1
in the array box tells us where its ﬁrst child box is, and the value of each box from j to
j + k − 1 in the array nchld tells us how many child boxes it has. Using box and nchld
76

Figure 4.6: Storage of the compressed tree

The child boxes of box i The child boxes of box j
Index of the box
Index of
the 1st child box

i

j j + 1j + 2

k

j

k

l

0

childless box
Number of
the child boxes

3

2

4

0

we can search the compressed tree structure downwards, as shown in Figure 4.6. We use
another one dimensional array prnt with the same length as the number of boxes to save the
index of the parent box for each box. If the ith element of prnt has the value k, it means
the parent box of the box i is the box k. The ﬁrst element of prnt is zero, because the level
zero box has no parent box. Together with the array prnt, we can search the compressed
tree structure upwards.
The one dimensional array lvl saves the level for each box. The two dimensional array
cntr saves the coordinates of the center for each box. Each processor save the coordinates
of npp particles in three vectors x, y, and z. The total particle number N = npp · np , if
our program runs on np nodes. Another three vectors, xx, yy and zz are used to save the
coordinates of all the particles, in order to decrease the communication time.
The linked list structure is used again to save the information about the particles inside
each childless box. If the ith box is childless, the index of the ﬁrst particle inside it is
saved in the ith element of the one dimensional array idx, and the total number of particles
inside is saved in the ith element of the one dimensional array nptcl. For a parent box, the
corresponding elements in idx and nptcl are both zero. There is another one dimensional
array link, whose length equals the number of particles. Each element of link saves the index
77

of the next particle, or zero if no more particles, in the same childless box. In order to ﬁnd
all the particles inside any childless box i, we ﬁrst ﬁnd the ﬁrst particle from idx[i], then
ﬁnd the following from link[i], link[link[i]] , ... until we reach the zero. nptcl[i] gives the
number of particles inside the box i, which is useful when we need to create a vector to save
the particles’ coordinates.

4.2.3

Tree Construction

In the parallel algorithm, each process saves the coordinates of npp particles. The basic idea
of the tree construction in parallel is to let each process only deal with the npp particles saved
in itself. To determine whether a box should be split, we only need to know the number
of particles inside the box, which can be calculated by gathering the information from all
processes and take the summation. During the tree construction procedure, each process
uses a temporary array plink with the length npp to save which box each particle belongs
to. In the end we will gather plink from all processes, and by linking the paths saved in
diﬀerent plink for each childless box, we can construct the array link. We will also gather
x, y, and z from all the processes and construct xx, yy, and zz for future computation. The
procedure of the parallel tree construction is shown in Figure 4.7.
The algorithm of the tree construction in parallel can be described as follows.
(1) Each process saves the coordinates of npp particles.
(2) Calculate the range of the coordinates in each process, gather the results from all
processes to determine the proper size of the root box that contains all the particles.
(3) Set the value of the ﬁrst element in cntr, lvl and nptcl, which represents the root
box.
(4) Use two pointers to record the range of the boxes to treat. Initialize them as pnt1 = 1
78

Figure 4.7: Parallel tree construction

Hold particles

Hold particles

Find the
root box

Find the
root box

Cut the box

Cut the box

Y

Y
Any
parent boxes?

Any
parent boxes?

N

N

Gather all
particles and
save them

Gather all
particles and
save them

79

and pnt2 = 1.
(5) For all the boxes between pnt1 and pnt2, if it contains more than s particles, split
it into eight child boxes, distribute the particles into the child boxes and for each child box
save the number of particles inside it and save the particle list in idx and plink.
(6) For all the boxes between pnt1 and pnt2, gather the number of particles in their child
boxes. Check each box between pnt1 and pnt2 in sequence as follows. For each child box
take the summation of all processes to obtain the total particle number. If a box has only
one child box, replace the information of the parent box by its child box and split the child
box. Repeat until two or more child boxes are found. If a box has more than one child
boxes, set the respective values in box, nchld, and nptcl for the parent box, and in cntr, lvl,
and prnt for the child boxes.
(7) Set pnt1 = pnt2 + 1, and then make pnt2 point to the last element in the box array.
(8) Repeat (5) to (7) until all the childless boxes contain less than s particles.
(9) Gather the idx value for each childless box and plink from all processes, and set
proper values for idx and link.
(10) Scale x, y, and z so that the side length of the smallest childless box is 2. Gather
the scaled x, y, and z from all the processes and construct xx, yy, and zz, which save the
coordinates of all particles.

4.3
4.3.1

Calculation of the Far Multipole Expansions
Tree Partitioning

In order to increase the eﬃciency we want each process to deal with only a part of the tree.
We partition the whole tree in such a way that each process is assigned the same number
80

Figure 4.8: Tree partitioning

Process 1

Process 2

Process 3

Process 4

of childless boxes and all the ancestor boxes of them. Take the tree in Figure 4.8 as an
example. We have totally 18 childless boxes in the tree. Assume we want to distribute this
tree into four processes. First, we sequence all the childless boxes from left to the right, no
matter which level the box belongs. The we assign the ﬁrst ﬁve childless boxes and their
ancestors to the ﬁrst process, the following ﬁve to the second process, the following four to
the third process, and ﬁnal four to the forth process. In this way we try to put as many
childless boxes who share the same ancestors as possible to the same process. In Figure 4.8
the belongings to the diﬀerent processes are presented by diﬀerent colors. One can see that
some ancestor boxes belong to multiple processes, which is presented by multiple colors. It
is clear that two consecutive processes can share at most L − 1 ancestor boxes, with L being
the depth of the tree. Since each parent box has at least two child boxes in the compressed
tree, if there are nc childless boxes, the length of the local tree is limited by 2l − 1, where
l = 1 + (log nc − log np )/ log 2 .

81

Figure 4.9: Parallel multipole expansion calculation

M for
childless boxes

UPWARDS
M for
parent boxes

UPWARDS
M for
parent boxes

Save all M
In local MEM

4.3.2

M for
childless boxes

Save all M
In local MEM

Parallel Calculation of the Far Multipole Expansions

The local tree is saved in two arrays similar to box and prnt, which let us search upwards
and downwards the tree, in a sequence that a child box is always before its parent box. For
an arbitrary tree, the boxes can be sequenced by labeling them in the following way.
(1) Check all the boxes from the root of the tree.
(2) If the box is childless or all the child boxes of it have been labeled, label this box and
go back to its parent box. Otherwise check the ﬁrst unlabeled child box of it.
(3) Continue until all the boxes are labeled.
The procedure of calculating the far multipole expansions in parallel is shown in Figure
4.9, and the algorithm can be described as follows.
(1) Each process calculates the far multipole expansions for the boxes in its own local
tree from the beginning to the end of the array. If the box is childless, the far multipole
expansion is calculated from the particles inside the box. If the box is a parent box, the far
82

multipole expansion is calculated by shifting the far multipole expansions of its child boxes
to the center of the parent box and taking the summation. Since any box is sequenced before
its parent box, when we need to calculate the far multipole expansion of a parent box, the
far multipole expansions of all its child boxes have been calculated.
(2) Gather the far multipole expansions from all processes and save them in the array
mltp for future calculation. If a box belongs to multiple processes, which means its far
multipole expansion has been calculated by multiple processes, take the summation.
When using PLOOP to gather all the far multipole expansions in (2), the length of the
element in PMAP is determined by the longest local tree. The total memory needed is the
production of the memory needed to save the far multipole expansions of the longest local
tree and the number of processes, which is often more than the memory needed to save all
the far multipole expansions of the whole octree. In some case, when the local trees are
high unbalanced, the overestimation may drain too much memory and cause the failure of
the algorithm. An alternative that avoids this is to calculate the far multipole expansions
according to the diﬀerent level in the whole octree, which can be described as follows.
(1) Find the boxes in the ﬁnest level of the whole octree and distribute them evenly to
all processes.
(2) Each process calculates the far multipole expansion for the boxes it gets.
(3) Communicate between all processes to share the far multipole expansions that are
calculated in (2).
(4) Find the boxes in the coarser level and repeat (2) and (3), until all the boxes are
treated.
In this way we will need L times of communication, where L is the depth of the whole
octree, but in each communication the memory needed is much less than the total memory
83

needed to store the far multipole expansions of the whole octree. The overestimation in
PMAP is less than the memory needed to save np far multipole expansions. Note that the
sequenced local tree is still useful in the following calculation, although not used in the
calculation of far multipole expansions.

4.4
4.4.1

Calculation of the Local Expansions and the Fields
Relation between Boxes

To calculate the local expansions, we have to consider the relation between the boxes, which
is a little more complicated in a compressed tree than in a normal tree. For any arbitrary
two boxes, there are three kinds of relations between them according to their positions and
their sizes. Consider two boxes b and c. Without loss of the generality, we assume b is ether
larger than or have the same size with c, referred to as Sb ≥ Sc , where Sb and Sc are the
side length of b and c. d is the largest value of the projections of the distance between the
centers of b and c in x, y, and z direction. If d > (0.5 + na ) · Sb , where na is a predetermined
positive integer, b and c are well separated; if (0.5+na )·Sb > d > 0.5·Sb +na ·Sc , they are ill
separated; otherwise they are adjacent. Roughly say, if we can put na larger boxes between
the two boxes, they are well separated; if we can’t do that but we can put na smaller boxes
in between, they are ill separated; otherwise they are adjacent.Note that the value of d is
discrete. if d > (0.5 + na ) · Sb , the smallest value of d is (0.5 + na ) · Sb + 0.5 · Sc . Figure 4.10
shows examples for the three kinds of relations between two boxes for na = 1 and na = 2.
Note that when na > 1, the two boxes that are adjacent to each other are not necessarily
touching each other.
According to the strategy of the FMM, for any box b, the ﬁeld or potential of those
84

Figure 4.10: Relations between two boxes

na=1

Well separated

Ill separated

Adjacent

na=2

Well separated

Ill separated

Adjacent

particles close to it is calculated directly, the contribution of those particles further away
is calculated by the multipole expansions or the local expansions, and as to those particles
more further away, their contributions are represented by the local expansions in the ancestor
boxes of b and will be transferred to b. So every box only interacts with those boxes who
do not have interactions with its ancestor boxes. For each parent box b, we set up an action
list. The list contains the following four kinds of boxes: (1) the boxes in the same level
of b , who are smaller than b and adjacent to or ill separated with b, (2) the boxes in the
same level of b, who are larger than b and adjacent to b, (3) the boxes in the lower level of
b, who are childless and adjacent to b, and (4) the box b itself. These boxes in the action
list of b and/or their descendants enclose the particles whose contributions are not included
in the local expansions of b or its ancestor boxes. And, these boxes will interact with the
child boxes of b. A childless box has no action list. The length of the action list is limited
by (2 · na + 1)D , where D is the dimension of the problem. D = 3 for a three dimensional
problem.

85

The action lists for all the parent boxes can be generated in parallel as follows.
(1) Each process initializes the action list for the root, which only contains the root box
itself.
(2) Each process checks the local tree from the last element to the ﬁrst element. If the
element b is a parent box, do step (3) to (5) to generate its action list. Note b is stored before
its parent box, so the action list of b’s parent box is always generated before b gets checked,
if we check the local tree backwards.
(3) Find the parent box of b. Check each box in the action list of b’s parent box. If the
box is childless and adjacent to b, add it into b’s action list. If the box is a parent box, check
all of its child boxes as step (4) states.
(4) If the box is larger than b and adjacent to b, add it to b’s action list. If the box is
smaller than or in the same size with b and adjacent to or ill separated with b, add it to b’s
action list.
(5) Add b itself into b’s action list.

4.4.2

Parallel Calculation of the Local Expansions and the ﬁelds

Assume we want to calculate the local expansion or the ﬁeld insight the box b, and w is a
box in the action list of b’s parent box or its descendant. The diﬀerent actions we can take
depend on the diﬀerent conditions of b and w, which are shown in the Table 4.1. The
in the “Box size” column means both Sb < Sw and Sb ≥ Sw , and the

in the “box b”

and “Box w” columns means both “childless” and “parent”. Cw → Lb means to calculate
the local expansion in b by the charges in w; Mw → Cb means to calculate the ﬁeld on the
charges inside b from the multipole expansion in w; Cw → Cb means to calculate the ﬁeld
on the charges inside b from the charges inside w by the Coulomb’s theory; Mw → Lb means
86

Table 4.1: Relations between two boxes and the respective actions
Relation

Box size

Box b

Sb < Sw
ill separated
Sb ≥ Sw
Adjacent

childless
parent
childless
parent

well separated

Box w
childless
parent

Action
C w → Lb
Check w’s descendants
Mw → C b
Add w into b’s action list
childless
Cw → Cb
parent
Check w’s descendants
Add w into b’s action list
Mw → L b

to translate the multipole expansion in w into the local expansion in b.
The procedure to calculate the local expansions and the ﬁelds inside each box in parallel
is shown in Figure 4.11 and the algorithm can be described as follows.
(1) Each process creates the action list for each box in its local tree.
(2) Each process creates three arrays pex, pey, and pez to save the ﬁeld in x, y, and z
direction of the particles inside the childless boxes in its local tree.
(3) Each process checks the local tree from the last element to the ﬁrst element, and
calculates the local expansion and the ﬁeld as steps (4) and (5) show.
(4) If a box b in the local tree is a parent box, we only need to calculate the local expansion
in it. The local expansion comes from three sources, inherited from b’s parent box, Cw → Lb ,
and Mw → Lb , where w is a box in the action list of b’s parent box or its descendant. So
to calculate the local expansion of b, ﬁrstly check each box in the action list of b’s parent
box. If the box is a childless box, take actions according to Table 4.1. If the box is a parent
box, check all of its child boxes one by one and take actions according to Table 4.1. Then
translate the local expansion of b’s parent box into b and add it to the local expansion that
has been calculated from Cw → Lb , and Mw → Lb .
(5) If a box b in the local tree is a childless box, we need to calculate the ﬁeld on the

87

Figure 4.11: Parallel local expansion and ﬁeld calculation

Action lists
for all boxes

Action lists
for all boxes

DOWNWARDS
L for
all boxes

DOWNWARDS
L for
all boxes

Field/Potential
Calculation

Field/Potential
Calculation

Field/Potential
Distribution

Field/Potential
Distribution

88

charges inside it. The ﬁeld comes from three sources, local expansion of b to the charges
inside it, Cw → Cb and Mw → Cb . Check each box in the action list of b’s parent box.
If the box is a childless box, take actions according to Table 4.1. If the box is a parent
box, check all of its child boxes one by one and take actions according to Table 4.1. Then
translate the local expansion of b’s parent box into b and add it to the local expansion that
has been calculated from Cw → Lb , and Mw → Lb . Finally calculate the ﬁeld from the local
expansion of b and add it to the ﬁeld that has been calculated from Cw → Cb and Mw → Cb .
(6) Gather the ﬁeld data from all the processes. Each process searches for the respective
ﬁeld for its own particles. Note that the particles whose ﬁelds are calculated by a process
depends on the childless boxes in which they are enclosed, so generally they are not the same
particles stored in the process.

4.5

Numerical Results and Discussion

To test the performance of the parallel algorithm, we performed some calculations on our
small cluster machine with 48 nodes, each running at 2.2 GHz. Figure 4.12 shows the computational cost to calculate the three dimensional electric ﬁeld between 1,000,000 electrons
of the Gaussian distribution with diﬀerent numbers of processes. It cost 894 seconds if using
only one node. As the number of processes increases, the cost of time decreases. And it cost
only 30 seconds when using all the 48 nodes. In these calculations, the DA order is kept up
to ﬁve and the relative error is less than 0.001. The parallel eﬃciency can be deﬁned as

Ef f =

Tseq (N )
,
q · Tp (N, np )

89

(4.1)

Figure 4.12: Computation time for 1,000,000 electrons with diﬀerent number of processes
900
800
700

time (s)

600
500
400
300
200
100
0
0

5

10

15

20
25
30
35
number of processes

40

45

50

where np is the number of processes, N is the number of unknowns, which is 1,000,000,
the number of particles, in this case, Tseq is the time cost of the sequential algorithm, and
Tp is the time cost of the parallel algorithm.

As Figure 4.13 shows the parallel eﬃciency

decreases with the number of processes. When we use all the 48 nodes, the parallel eﬃciency
is still above 60%.
Currently we can calculate the three dimensional ﬁeld for at most about 10,000,000
charged particles with the parallel algorithm. Using 96 nodes from the cluster machine of
MSU HPCC, it takes 167 seconds with the DA order up to ﬁve and relative error less than
0.001. The parallel eﬃciency is also about 60%. The bottleneck of the current parallel
algorithm is the memory cost. Without a proper compiler, for now we can only compile
COSY Inﬁnity 9.1 as a 32 bit Fortran program; so each process can only assess no more than
2 GB memory. We save the coordinates of all particles and all the multipole expansions in
the local memory of each process. The memory cost increases with the number of particles,
90

Figure 4.13: Parallel eﬃciency for 1,000,000 electrons with diﬀerent number of processes
1
0.95

parallel efficiency

0.9
0.85
0.8
0.75
0.7
0.65
0.6
0

5

10

15

20
25
30
35
number of processes

40

45

50

no matter how many processes we use. Since the memory that each process can assess is
limited by 2 GB, the number of particles that we can treat is also limited. If we do not save
these data in the local memory of each process, one process may need to obtain some data
stored in another process in the calculation. Because we only have all-to-all communication,
all the processes have to stop and wait, even if the communication is only needed between
two processes. Since this kind of communication will happen frequently, the eﬃciency will
be heavily aﬀected. In one sentence, we save this data in local memory to avoid too many
unnecessary all-to-all communications in order to keep a better eﬃciency. The price we paid
is the memory.

91

Chapter 5
Using the Fast Multipole Method in
Beam Dynamics Simulations
The essential purpose of developing the DA based MLFMA is to simulate the space charge
eﬀects in the high density system. In this chapter it will be presented how the algorithm
can be used in beam dynamics simulations. As an example, we simulated the photoemission
process of applying a laser beam on a metal surface, in which the space charge eﬀect plays
an important role. The charge distributions are illustrated in diﬀerent phase space plots in
the early stage soon after the electrons are created. Some related topics are discussed, such
as how to choose the proper units for the variables in the beam dynamics equations, how to
transform the space charge ﬁelds from the bunch frame to the laboratory frame, and how to
avoid artiﬁcial collisions between the charged particles.

5.1

Beam Dynamics Equations

The ﬁrst step of tracking the motion of a charged particles in electromagnetic ﬁeld is to ﬁnd
the proper dynamics equations of (p, x), where p represents the momentum and x represents
the position of the particle. The diﬀerential equation for x can be simply written down as

p
dx
=
,
dt
γm
92

(5.1)

where γ is the Lorentz factor and m is the static mass of the particle. The diﬀerential
equation of v can be derived from the Lorentz force formula and Newton’s second law.

p
dp
= F = q E + qv × B = q E + q
× B,
dt
γm

(5.2)

where t is the time, and E and B are the electric and magnetic ﬁelds experienced by the
particle. γ in Eq.(5.1) and Eq.(5.2) can be expressed as a function of p, such as

γ=

1+

p2
,
m2 c2

(5.3)

where c is the speed of light. Using Eq.(5.3), catastrophic cancellation of signiﬁcant digits
can be avoided in the computation. Then, the dynamics equations are
p

dx
=
dt
dp
= qE +
dt

m

2
1+ p 2
2c
m

qp
2
1+ p 2
m2 c

m

,
(5.4)

× B.

In order to avoid too small or too large numbers in calculation, an appropriate choice of
the units is necessary. For example, in most cases time t is a small number because of the
fast speed of the particles, but we can multiple with a constant number, the speed of light
c, to the time t and use meter as the unit, so that the scale of time is in a proper range.
Divided by c in both sides, Eq.(5.4) becomes

93

dx
=β=
dτ
dp
q
= E+
dτ
c

p
mc

2
1+ p 2
2c
m

qp
mc

2
1+ p 2
2c
m

,
(5.5)

× B,

where τ = tc and β = v/c is the dimensionless Lorentz factor. If one chooses the following
units for the other variables,

x = x[m] p = p[λeV/c], q = q[e], m = m[λeV/c2 ],
(5.6)
E = E[λV/m], B = B[T] = B[Vs/m2 ], c = c[λm/s].
where λ is one of the SI preﬁxes such as k,M, G, etc., substituting them into Eq.(5.5), all
the units are canceled and one gets the dimensionless dynamics equations as Eq.(5.7) shows.
dx
=β=
dτ
dp
= qE + qc
dτ

p
m

2
1 + p2
m

p
m

2
1 + p2
m

=

p
,
γm
(5.7)

× B = q E + cβ × B .

In the above, one coeﬃcient λ is used to control the scale. In our simulation, we choose
λ = 106 . Then the units are

x = x[m] p = p[MeV/c], q = q[e], m = m[MeV/c2 ],
(5.8)
E = E[MV/m], B = B[T] = B[Vs/m2 ], c = c[Mm/s].

94

5.2

The Lorentz Transformation of the Space Charge
Field

In Eq.(5.7), all the physical quantities, including E and B, are measured in the laboratory
frame. E and B usually have two components, the external ﬁeld and the space charge ﬁeld.
The external ﬁeld is often calculated in the laboratory frame. But the space charge ﬁeld
is calculated as the static Coulomb ﬁeld in the bunch frame, which moves together with
the bunch, under the assumption that all the particles have similar momenta. If the bunch
moves with a speed close to the speed of light, we have to consider the relativistic eﬀects
and convert the space charge ﬁeld from the bunch frame into the laboratory frame by the
Lorentz transform.
We refer to the laboratory frame as Σ and the bunch frame Σ′ . All the physical quantities
measured in Σ are represented by an alphabet without prime and all that measured in Σ′
with a prime. For example, x, y, and z are position coordinates in Σ and x′ , y ′ , and z ′ are
position coordinates in Σ′ . Assume in Σ the bunch moves with velocity vz in z direction,
which in practice can be chosen as the average velocity of all particles in z direction, then
we have

x′ = x,
y ′ = y,

(5.9)

z ′ = γ · (z − vz t),
where γ =

2
1 − vz /c2 . (x′ , y ′ , z ′ ) is used to calculate the Coulomb ﬁeld between the particles

using the DA based MLFMA in Σ′ . It is apparent that B ′ = 0 in Σ′ . Then we convert E ′
95

in Σ′ into E and B in Σ by Eq. (5.10).

′
Ex = γEx ,
′
Ey = γEy ,
′
Ez = Ez ,

vz
′
Bx = −γ 2 · Ey ,
c
vz
′
By = γ 2 · Ex ,
c

(5.10)

Bz = 0,

where c is the speed of light. Knowing E and B we can integrate the beam dynamics
equations in the laboratory frame.

5.3

Collisions Between Charged Particles in the Near
Region

Since the dynamics equations are numerically solved on discrete time steps, an artiﬁcial
collision between charged particles, as shown in Figure 5.1, could happen in the near region,
where the ﬁeld is calculated directly by the the Coulomb force law. In the real world, the
charged particles feel a stronger repulsive force between them when they move closer and
thus they follow the traces like the green curves in the picture. But when we integrate the
dynamics equations in a discrete time step, the force between the pair of charged particles
is calculated at the very beginning and is kept as a constant during the whole time step.
The force is underestimated when the pair of particles move closer to each other. It could
happen that in some extreme case the two particles get very close to each other, as the black
lines show in the picture, and they will suddenly feel a very strong force in the next step,
which leads to a sudden increasing of their momentum. And later they will leave the bunch.
Actually when we simulate millions of particles, this artiﬁcial collision will not aﬀect the
96

Figure 5.1: The artiﬁcial collision

collective behavior if it only happens on a few particles. But it may aﬀect some statistical
quantities, which are used to describe the collective behavior. For example, the r.m.s. value
of the positions, which is often used to describe the bunch size, may be increased simply
because the pair of particles in collision have left the bunch and contribute a huge distance
to the center of the bunch. So we need to prevent this collision from happening in the
simulation, or detect it when it happens and ﬁx its eﬀect. Another reason that we want to
prevent this artiﬁcial collision is due to the use of macroparticles. In many cases, we still use
one macroparticle to represent a group of particles to increase the eﬃciency, although the
MLFMA is applied. The macroparticles are treated as point charges in the near region ﬁeld
calculation. However, as a group of particles, they in fact should have volume. When two
macroparticles go extremely close to each other, the two groups of real particles overlap each
other and the ﬁeld between them will not go as high as the Coulomb force law calculates for
two point charges.
One easy way to detect the collision is to check the momentum dispersion of the bunch.

97

The sudden increasing of the momentum of the particles experiencing collision often leads to
the discontinuity in the momentum dispersion curve. Figure 5.2 shows an example, in which
we simulate the free expansion of an electron bunch of 10,000 electrons for 10 ps in 200 steps.
No external ﬁeld is applied and only the space charge ﬁeld makes eﬀects in the expansion
process. The dynamics equations are solved by the fourth order Runge-Kutta integrator
with ﬁxed step size. The r.m.s. value of the momentum and the minimum distance between
all pairs of electrons are plotted in Figure 5.2. There are two jumps of the momentum
dispersion curve and for each one a local minimum value of the distance between electrons
can be found. We plot all the electrons in z − pz phase space, as shown in Figure 5.3, then
we found one pair of electrons that have much larger momenta than all the others. These
two electrons reached a very close distance in the last time step. Then in this time step, the
ﬁeld between them goes extremely high, thus they feel a much large impulse than the other
electrons, which gives them much larger momenta than all the other electrons. While from
the charge density plot in Figure 5.4 we can see most electrons have very small momenta and
stay in the center of the space. The behavior of the two electrons do not aﬀect the collective
behavior of the whole bunch and they will leave the bunch soon if they do not collide with
other electrons. But the values of their momenta are large enough to aﬀect the r.m.s of the
momenta of the whole bunch.
We have three methods to avoid this. First, when we calculate the statistical quantities
of the whole bunch, we do not count the contribution of these two electrons. Second, we can
set a cut-oﬀ rate γ. When we calculate the ﬁeld by the DA based MLFMA, we know exactly
how many particles each childless box holds. For each childless box, we calculate the average
distance ra . If the distance between two particles r is less than γ · ra , we use γ · ra instead of
r to calculate the ﬁeld between these two particles. In this way we actually set a maximum
98

Figure 5.2: Discontinuity of the momentum dispersion due to the near region collision
0.159

100

r.m.s. of p (keV/c)

0.157
10

0.156
0.155
0.154

1

0.153
rms of px
rms of py
rms of pz
min of r

0.152
0.151
0

1

2

3

4

5
6
t (ps)

7

8

9

Figure 5.3: Electrons in z − pz phase space
5
4
3

Pz(keV/c)

2
1
0
-1
-2
-3
-4
-5
-25 -20 -15 -10 -5

0
5
z(µm)

99

10 15 20 25

0.1
10

minimum distance (nm)

0.158

pz(keV/c)

Figure 5.4: Electron density in z − pz phase space

5
4
3
2
1
0
-1
-2
-3
-4
-5

1
0.8
0.6
0.4
0.2
0
-25-20-15-10 -5 0 5 10 15 20 25
z(µm)

Figure 5.5: Momentum dispersion after applying the ﬁrst method
0.157

r.m.s. of p (keV/c)

0.156

0.155

0.154

0.153

0.152

rms of px
rms of py
rms of pz

0.151
0

1

2

3

4

100

5
6
t (ps)

7

8

9

10

Figure 5.6: Momentum dispersion after applying the second method
0.157

r.m.s. of p (keV/c)

0.156

0.155

0.154

0.153

0.152

rms of px
rms of py
rms of pz

0.151
0

1

2

3

4

5
6
t (ps)

7

8

9

10

Figure 5.7: Momentum dispersion after applying the third method
0.157

r.m.s. of p (keV/c)

0.156

0.155

0.154

0.153

0.152

rms of px
rms of py
rms of pz

0.151
0

1

2

3

4

101

5
6
t (ps)

7

8

9

10

force between each pair of particles for each childless box. Once the distance between a pair
of particles is less than the cut-oﬀ distance γ · ra , we use the maximum force to replace the
real force. Third, similarly to the second, when the distance between two particles r is less
than γ · ra , we use a three dimensional Gaussian distribution model, instead of the point
charge model, to represent the source particle. When the observer particle gets closer to the
center of the Gaussian distribution, the force it feels goes to zero. Figure 5.5, Figure 5.6
and Figure 5.7 respectively show the momentum dispersion curve after applying the above
method. In all the three cases the discontinuity disappears. In the last two simulations, γ
is chosen to be 0.001. The time step size need to be considered for the proper choice of γ,
because the change of the momentum depends on the impulse. Besides the above methods,
an integrator with automatic step size control can be employed to avoid this issue. One
needs to carefully choose the criteria of the step size control.

5.4
5.4.1

An Example of Photoemission Process Simulation
Model of the Photoemission Process

When an ultrafast laser pulse is applied on a gold ﬁlm, electrons are generated. The the
motion of the electrons are dominated by the space charge ﬁeld between them. We simulated
this process and the DA based MLFMA was used to calculate the space charge ﬁeld. The
following model is used in the simulation. We let the gold ﬁlm lie in the x − y plane, and z
direction is perpendicular to the ﬁlm. The incident laser pulse is applied on the surface of
the gold ﬁlm and the electrons come out of the surface after colliding with the photons and
absorbing their energy. Assume we know the Fermi energy of the gold ﬁlm Ef , the work
function of the gold ﬁlm W and the photon energy Eph = hν with ν the laser frequency.
102

The highest possible energy of an electron inside the gold ﬁlm is the Fermi energy Ef . We
assume the electron absorb all the energy of the photon after their collision. If the initial
energy of an electron is Ei , after the collision its energy is Ei + Eph . If this electron gets
enough energy to overcome the work function and escape from the gold ﬁlm, we should
have Ei + Eph − W > Ef . So the lower bound of the energy for an electron to escape is
Ef − Eph + W . The higher bound is of course Ef . We assume the electrons have a uniform
distribution within the energy band of [Ef − Eph + W, Ef ]. After the collision, the energy of
the electron Ee is within the range of [Ef + W, Ef + Eph ]. And the velocity of the electron
satisﬁes
v=

2Ee
,
m

(5.11)

where m is the electron mass. To overcome the work function, the electron’s velocity in z
direction should be at least mv 2 /2 = Ef + W , so that

vz,min =

2(Ef + W )
.
m

(5.12)

This implies that for a given energy Ee ∈ [Ef + W, Ef + Eph ], only the electrons whose
velocities are inside an incident cone of the angle θmax with respect to the z direction can
escape, where
cos θmax =

vz,min
=
v

Ef + W
.
Ee

(5.13)

From Eq. (5.13) we can see a smaller Ee leads to a smaller θ, and therefore a smaller cone.
If we assume the electron moves in a random direction after its collision with the photon,
Eq. (5.13) indicates the electrons with lower energies have lower possibilities to escape. The
possibility is in proportion to the solid angle of the cone, which is 2π(1 − cos θmax ). The
103

solid angle of a sphere is 4π, so we have the possibility to escape as

Pesc. =

2π(1 − cos θmax )
= 0.5 · (1 − cos θmax ),
4π

where θmax is always a sharp angle. With a given energy Ee and a given outgoing angle θ,
the speed of the electron can be calculated by Eq. (5.11), and then the velocity in z direction
is
vz = v · cos θ.

(5.14)

Given another angle φ ∈ [0, 2π], in the transverse directions we have
vx = v · sin θ · cos φ, and vy = v · sin θ · sin φ.

(5.15)

Eq.(5.14) and Eq. (5.15) are the velocities on the metal side. If we assume the work function
is abrupt at the surface, the initial velocities of the electrons in the air side are

vxi = vx , vyi = vy , and vzi =

2
vz − 2(Ef + W )/m.

(5.16)

The algorithm of how we sample the electrons and simulate their motion is described
as follows. Note that if we do not specify the distribution of a random number, it has the
uniform distribution.
1. Assume the laser pulse has a Gaussian distribution with standard deviation σ in the
time domain. We cut the range of [−3σ, +3σ] into N even parts. The ith part covers from
t1 = −3σ + (i − 1) · dt to t2 = −3σ + i · dt, with dt = 6σ/N . If the total number of electrons
created during the photoemission process is Ne , the number of electrons created in the ith

104

√
√
time interval is (erf(t2 / 2) − erf(t1 / 2))/(2 × 0.997).
2. For each outgoing electron, we generate its initial velocities and positions in the
following way. (a) Generate a random number Ee ∈ [Ef + W, Ef + Eph ], calculate cos θmax
by Eq. (5.13), and then generate a random number c ∈ [0, 1]. (b) If c > 1 − cos θ, discard
Ee and repeat step (a). Otherwise accept Ee and continue. (c) Generate a random number
a ∈ [0, cos θmax ] and another random number φ ∈ [0, 2π], let θ = arccos a, and calculate
the initial velocities by Eq. (5.14), Eq. (5.15) and Eq. (5.16). (d) Generate two random
numbers xi and yi of the Gaussian distributions with the standard deviations σx and σy
respectively for the initial coordinates in x and y direction. (e) The initial coordinate in z
direction is set to be zero.
3. For each outgoing electron, there is a positive residual left behind, assuming that other
electrons do not have enough time to compensate it. For each electron coming out of the
surface, there are three kinds of interactions: space charge ﬁeld between the electrons, ﬁeld
from the positive holes and the external ﬁeld that is applied to help extract the electrons out.
The space charge ﬁeld will be calculated by the DA based MLFMA. The external extracting
ﬁeld is constant. The model of the positive residual ﬁeld will be described in Section 5.4.2.
4. Solve the dynamics equations for the outgoing electrons. Once the z coordinate of an
electron is less than zero, remove it from our calculation.

5.4.2

Model of the Positive Residual Field

If we assume the incident laser pulse has a Gaussian distribution with standard deviation
σxy in the transverse plane, the outgoing electrons have the same transverse distribution and
so do the positive holes on the surface. Assume all positive residuals lie in a circular area
whose radius is 6σxy . We cut this area into Ns circular stripes whose width are all 6σxy /Ns .
105

The radius of the inner side and the outer side of the ith stripe is r1i = (i − 1) · 6σxy /Ns and
r2i = i · 6σxy /Ns respectively. If the positive residuals have a distribution on the surface of
the form
f (r) =

1
r2
exp − 2
2
2πσxy
2σxy

(5.17)

and the total charges are Q, then the charges inside a circular area with radius r1 are
r1

2π

dθ
0

0

f (r) · rdr = 2π

r1
0

1
r2
exp − 2
2
2πσxy
2σxy

2
r1
· rdr = 1 − exp − 2
2σxy

. (5.18)

So that the charges inside the ith stripe are
2
r1i
Qi = Q exp − 2
2σxy

2
r2i
− exp − 2
2σxy

.

(5.19)

Then the whole area can be approximately represented by Ns rings with uniform charge
distribution. The radius of the ith ring is ri = 0.5 · (r1i + r2i ) and the charge density of it
is ρi = Qi /2πri . The explicit formula for the three dimensional ﬁeld of a uniformly charged
ring can be derived as follows.
Assume we have a uniformly charged ring, whose center is (0,0,z0 ) and whose radius is
r0 , lying in the z = z0 plate. Its charge density is a constant number ρ. Using cylindrical

106

coordinate, the electric ﬁeld on an arbitrary point (r, θ, z) can be written as
2π

ρr0 dθ0

φ =

2
(z − z0 )2 + r2 + r0 − 2rr0 cos(θ0 − θ)

0
2π

ρr0 dψ

=
0
π

= 2
0

(5.20)

2
(z − z0 )2 + r2 + r0 − 2rr0 cos ψ

ρr0 dψ

2
(z − z0 )2 + r2 + r0 − 2rr0 cos ψ

,

where ψ = (θ0 − θ) and the symmetry is used in the last step. Let
R2 = (r + r0 )2 + (z − z0 )2 ,

(5.21)

k 2 = 4rr0 /R2 .

(5.22)

We have

2
(z − z0 )2 + r2 + r0 − 2rr0 cos ψ

=

R2 − 2rr0 − 2rr0 cos ψ

=

R2 − 2rr0 (1 + cos ψ)

=

R2 − 4rr0 sin2 ψ ′

=

R2 (1 − k 2 sin2 ψ ′ ),

(5.23)

where ψ = π − 2ψ ′ and 1 + cos ψ = 2 sin2 ψ ′ . Then the potential in Eq. (5.20) can be

107

expressed as follows.
π

φ = 2

ρr0 dψ
1 − k 2 sin2 ψ ′
ρr0 (−2)dψ ′

R

0
0

= 2
= 4

ρr0 dψ ′

1 − k 2 sin2 ψ ′

R

0

=

1 − k 2 sin2 ψ ′

π/2 R
π/2

4ρr0
K(k),
R

(5.24)

where K(k) is the complete elliptic integral of the ﬁrst kind, which is deﬁned as
π/2

K(k) =
0

dα
1 − k 2 sin2 α

.

(5.25)

The complete elliptic integral of second kind E(k) can be deﬁned as
π/2

E(k) =
0

1 − k 2 sin2 αdα,

(5.26)

and K(k) and E(k) have the following relations in Eq. (5.27) and Eq. (5.28), which is going
to be used in our derivation.

E
dK
=
,
dk
k(1 − k 2 )
E K
dE
=
− .
dk
k
k

(5.27)
(5.28)

In cylindrical coordinates, we have

∇φ =

∂φ
1 ∂φ ˆ ∂φ
ρ+
ˆ
z.
ˆ
θ+
∂ρ
ρ ∂θ
∂z
108

(5.29)

So that

∂ 4ρr0
∂φ
=−
K(k)
∂z
∂z
R
√
1 ∂R
1 ∂K ∂
4rr0
−4ρr0 − 2 K
+
∂z
R ∂k ∂z
R
R
1 ∂K
∂R
4ρr0
K+
4rr0
R ∂k
∂z
R2
E
z − z0
4ρr0
K+
−K
2
2
R
R
1−k
4ρr0 E
(z − z0 ),
R3 1 − k 2

Ez = −
=
=
=
=

(5.30)

and

∂φ
∂ 4ρr0
=−
K(k)
∂r
∂r
R
∂ K(k)
−4ρr0
∂r
R
√
4rr0
1 ∂R
1 ∂K ∂
−4ρr0 − 2 K
+
∂r
R ∂k ∂k
R
R
∂R
4ρr0
∂R
∂K r0
1 ∂K
K
4rr0
−
+
2
∂r
R ∂k
∂r
∂k
r
R
1 ∂K
4ρr0
∂R ∂K r0
K+
4rr0
−
2
R ∂k
∂r
∂k
r
R
E
4ρr0
K
r0
E r + r0
−
−
r (1 − k 2 )k
k
R2 1 − k 2 R
4ρr0
R
E
E r + r0
−
−K
2
2 R
2r 1 − k 2
R
1−k
E
E r2 + rr0 R
4ρr0
−
−K
.
2 1 − k2
R
2 1 − k2
rR

Er = −
=
=
=
=
=
=
=

(5.31)

In Eq. (5.30) and Eq. (5.31), relations as shown in Eq. (5.27), Eq. (5.28) and Eq. (5.32-5.35)

109

are used.

z − z0
∂R
=
∂z
R

(5.32)

RK =

(5.33)

4rr0

∂R
r + r0
=
∂r
R
1 r0
R
=
k r
2r

(5.34)
(5.35)

Now we can calculate the three dimensional ﬁeld of a uniformly charged ring by Eq. (5.30)
and Eq. (5.31), in which the elliptic integral K(k) and E(k) can be calculated numerically
by Chebyshev polynomial expansions[23]. To check the validity of Eq. (5.30) and Eq. (5.31),
we let r = 0 and derive the formula for Ez and Er on the axis, which is well known. When
r = 0, k = 0, R =

2
r0 + (z − z0 )2 , and E(k) = E(0) = π/2. From Eq. (5.30), we get

4ρr0

π
· (z − z0 )
2
(r0 + (z − z0 ))3/2 2
2πρr0 (z − z0 )
=
.
2
(r0 + (z − z0 ))3/2

Ez |r=0 =

·

(5.36)

From Eq. (5.31), we obtain

E r + r0 R
4ρr0
+
2
R2 1 − k 2 R
4ρr0
=
[I + II] ,
R2

Er =

110

K
E
−
r
r(1 − k 2 )
(5.37)

where

E r + r0
,
1 − k2 R
R K
E
II =
−
2 r
r(1 − k 2 )

(5.38)

I =

.

(5.39)

When r = 0,
I|r=0 =

πr0
2

2
r0 + (z − z0 )2

.

(5.40)

The other part is[49]
π/2

II =
0

dα

π/2

−

0
1 − k 2 sin2 α
π/2
k 2 sin2 α − k 2

r

1 − k 2 sin2 α
dα
r(1 − k 2 )

=

1
r 0

=

π/2
4r0
sin2 α − 1
dα.
(r + r0 )2 + (z − z0 )2 0
(1 − k 2 ) 1 − k 2 sin2 α

(1 − k 2 )

1 − k 2 sin2 α

dα

(5.41)

When r = 0, We have
π/2
4r0
(sin2 α − 1)dα
2 + (z − z )2
r0
0
0
πr0
= − 2
.
r0 + (z − z0 )2

II|r=0 =

(5.42)

Then
Er |r=0 = I|r=0 + II|r=0 = 0.

(5.43)

The expressions of Er and Ez on the axis in Eq.(5.43) and Eq.(5.36) are as expected.
In our simulation, we use 1000 stripes (Ns = 1000) to approximately calculate the ﬁeld
of the positive residuals on a 101 × 101 grid, which covers the area of [0, 579µm] × [0, 579µm].
111

Ez(V/m)

Figure 5.8: Longitudinal ﬁeld

0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0

0

100

200 300
z(µm)

400

500

6000

600
500
400
300
r(µm)
200
100

Er(V/m)

Figure 5.9: Radial ﬁeld

0.07
0.06
0.05
0.04
0.03
0.02
0.01
0

0

100

200 300
z(µm)

400

500

6000

600
500
400
300
r(µm)
200
100

The longitudinal ﬁeld and the radial ﬁeld on the grid of one unit charge with Gaussian
distribution on the surface are shown respectively in Figure 5.8 and Figure 5.9. We assume
the charge distribution of the positive residuals does not charge, only the number of charges
changes and it is always equal to the number of electrons outside the surface. So at any time,

112

ﬁelds always have the same shape as Figure 5.8 and Figure 5.9 show. But their strength
changes with time. If we have N particles outside the surface, which leaves N positive
residuals on the surface, the ﬁeld on the grid of N residuals is just N times the ﬁeld of a
unit charge. The ﬁelds on the particles are calculated by linear interpolation.

5.4.3

Comparison of the Simulation Results with the Experiment
Data

Our colleagues performed experiments to measure the ultrafast electron pulse dynamics immediately following photoemission[84, 85] using a point projection imaging techniques[75].
The setup of the measurement system is shown in Figure 5.10. A 50 fs laser pulse is applied
on the gold photo cathode surface with a incidence of 45◦ to trigger the photoemission.
The laser has a Gaussian proﬁle with an elliptical cross-section having σx = 115µm and
σy = 81µm. The wavelength of the laser is 266 nm, and the photon energy is 4.66 eV, which
is slightly higher than the work function of gold ﬁlm that ranges from 4.0-4.6 eV[83], allowing
photoemission with a small energy spread. The exciting ﬂuence of the laser pulse can be adjusted from 1-10 mJ/cm2 . A positive electrode (anode) is separated 5 mm from the cathode,
providing an applied ﬁeld (Fa ) to facilitate the photoemission. The potential on the anode
can be adjusted from 0-2400 V, which leads to the range of Fa 0-0.4 MV/m calculated by
Field Precision, an electromagnetic ﬁeld calculation program. The laser pulse together with
the the applied extracting ﬁeld can produce a wide range from 104 to 108 electrons/pulse. A
point laser source is synchronized to the exciting laser beam with a well deﬁned delay with
respect to photoemission. Once the electron pulse is emitted, the point electron source costs
a shadow of the electron pulse onto a metalized phosphor screen connected to an intensiﬁed

113

Figure 5.10: Experiment setup, from Zhensheng Tao

CCD camera. The position and the size of the bunch can be obtained by measuring the
shadow patterns and ﬁtting the experimental data with an analytical expression describing
the projection geometry[75].
Here we compare a group of experimental data with simulation results. In experiment we
set the extracting ﬁeld Fa to be 0.32 MV/m and the ﬂuence of the exciting laser 5mJ/cm2 ,
the electrons generated by photoemission is predicted to be in the order of 102 [31] and the
electrons that survived after 100 ps are measured to be in the order of 106 . There is a two
order’s drop of the electron population. In the simulation, we set the Fermi energy to be 5
eV, photon energy 4.65 eV, work function 4.45 eV. The FWHM of the laser pulse is 50 fs,
which means the standard deviation σ is 21.23 fs. To model the creation of the electrons,
we divide the time region of 6σ into 100 each pieces. In total 130 million electrons are
created, which are presented by 1.3 million macro-particles. Each macro-particle represents
100 electrons. Then we simulate the following 120 ps in 620 steps with various step sizes.

114

The dynamics equations are solved by COSY’s intrinsic eighth order Runge-Kutta integrator.
The space charge ﬁeld between the macro-particles are calculated by the DA based MLFMA.
A constant external ﬁeld of 0.28 MV/m is applied to help extract the electrons. We save the
number of particles, the bunch size, and the momentum dispersion in each time step, and
save the positions and the momenta of all macro-particles in some selected steps. Figure
5.11 shows how the number of particles changes with respect to time. All the particles are
created in the ﬁrst 120 fs. The peak value of the percentage is slightly less than one, which
means some particles go back to the surface almost immediately after they jump out. The
number of particles keeps decreasing. In the ﬁrst 10 ps, the percentage decreases to 10%.
Then the curve becomes ﬂatter. After 80 ps, the number of particles is almost a constant.
After 100 ps there are 7.26 million electrons, which agrees with the experiment in order.
Figure 5.11: Number of particles as a function of time
100
90
p (% of 130M electrons)

80
70
60
50
40
30
20
10
0
0

20

40

60
80
t (ps)

115

100

120

140

The shadow pattern on the CCD screen can give us the information about the bunch
size and the position of the bunch. The raw data of the the shadow pattern at 70 ps is
shown in Figure 5.12. The horizontal axis represents the longitudinal position on the CCD
screen and the vertical axis the depth of the shadow. The peak at about 0.008 m tells us
where the bunch is in longitudinal direction and the shape of the peak give us information
about the longitudinal bunch size. In the simulation, we divide the 600 µm distance in the
longitudinal direction into 600 slots, each of which has the width of 1 µm. Then we count
the particle number in each slot. To compare with the simulation result, the position data of
′

′

the experiment is scaled as x = (x − x(1))/33, where x is the data before scaling, x is the
data after scaling and x(1) is the value of the ﬁrst point before scaling. The shadow depth
′

′

data is scaled as y = 5000 · y, where y is the data before scaling and y is the data after
scaling. The simulation result and the experiment data at 70 ps, 80 ps, 90 ps and 100 ps are
compared in Figure 5.13, Figure 5.14, Figure 5.15 and Figure 5.16. Ignoring the data points
before 0.14 mm, the peak position, amplitude, and shape agree well in all cases. Figure 5.17
shows how the longitudinal bunch size evolves as time goes, in which the simulation and the
experiment agree too. The quadratic curve, instead of a linear line, is as expected due to
the space charge eﬀect.

116

Figure 5.12: Raw data of the shadow patterns at 70 ps

'HSWKeRIeWKHeVKDGRZ

e
151c
1
3151c
315P
315Pc
315i
315ic
315n
315nc
1

1511c 151P 151Pc 151i 151ic
3RVLWLRQeRQeWKHe&&'eVFUHHQedPw

151n

Figure 5.13: Comparison of the experiment with the simulation at 70 ps

1XPEHUmRImSDUWLFOHV

z00
m6LPXODWRQ
m([SHULPHQW

300

200

0

s200
s0-0002

0

0-0002 0-0003 0-000z 0-000( 0-000) 0-000. 0-0004 0-0005
]maPt

117

Figure 5.14: Comparison of the experiment with the simulation at 80 ps

z00
6LPXODWLRQ
([SHULPHQW

1XPEHUmRImSDUWLFOHV

300

200

0

s200
s0-0002

0

0-0002 0-0003 0-000z 0-000( 0-000) 0-000. 0-0004 0-0005
]aPt

Figure 5.15: Comparison of the experiment with the simulation at 90 ps

1XPEHUmRImSDUWLFOHV

z00
m6LPXODWLRQ
m([SHULPHQW

300

200

0

s200
s0-0002

0

0-0002 0-0003 0-000z 0-000( 0-000) 0-000. 0-0004 0-0005
]aPt

118

Figure 5.16: Comparison of the experiment with the simulation at 100 ps

1XPEHUmRImSDUWLFOHV

z00
m6LPXODWLRQ
m([SHULPHQW

300

200

0

s200
s0-0002

0

0-0002 0-0003 0-000z 0-000( 0-000) 0-000. 0-0004 0-0005
]maPt

UPVsRIs]sePLFURPHWHUt

Figure 5.17: Comparison of the bunch sizes in the experiment and the simulation

1S8
1p8
1E8
188
l8
S8
p8
E8
8
0E8
0E8

s([SHULPHQW
s6LPXODWLRQ

8

E8

p8 S8 l8
]sePLFURPHWHUt

119

188 1E8 1p8

Figure 5.18: The momentum dispersion as a function of time
0.45

rms of px/py/pz (MeV/c)

0.4
0.35
0.3
0.25
0.2
0.15
0.1
rms of px
rms of py
rms of pz

0.05
0
0

20

40

60
80
time (ps)

100

120

140

Simulations can give us some information that is diﬃcult to measure directly in the experiments, for example the momentum dispersion. Figure 5.18 shows the momentum dispersion
at diﬀerent time in all the three directions. We are glad not to see any discontinuities. The
curves for the two transverse directions are almost identical because the whole system has
rotational symmetry in the simulation. For the same reason we also obtain two identical
curves of the transverse bunch sizes in Figure. 5.19.
More interesting is the charge distribution in diﬀerent normal space and phase space.
Figure 5.20 shows the charge distribution in x − z phase space in the ﬁrst 130 fs, during
which the laser pulse is applied on the surface and the electrons leave the surface and form
the bunch. Diﬀerent color means diﬀerent densities, and the brighter the color is, the higher
the density is. We can see how the electrons go out little by little, how they are accumulated
and how the bunch is generated. The center of the bunch has a higher density than the
edge has, which is reasonable since the Gaussian laser pulse has higher photon density at the
120

Figure 5.19: The transverse bunch sizes evolve with time
190
180
170
rms of x/y (µm)

160
150
140
130
120
110
100

rms of x
rms of y

90
0

20

40

60
80
time (ps)

100

120

140

center too. Figure 5.21 shows the charge distribution in z − pz phase space in the ﬁrst 130
fs. After the bunch is created, the charged particles form a triangle which covers the upper
left area in the z − pz phase space. The discontinuity in the ﬁrst a few ﬁgures is due to the
discreteness in time of the integration.

121

Figure 5.20: Charge distribution in x − z phase space in the ﬁrst 130 fs
(a) t=13fs

(b) t=25fs

200

100

100
x(µm)

300

200

x(µm)

300

0

0

-100

-100

-200

-200

-300

-300
0

0.01

0.02

0.03

0

0.01

0.02

0.02

z(µm)

(d) t=51fs
300

200

200

100

100
x(µm)

300

x(µm)

0.03

z(µm)

(c) t=38fs

0.03

0

0

-100

-100

-200

-200

-300

-300
0

0.01

0.02

0.03

z(µm)

0

0.01
z(µm)

122

Figure 5.20: (cont’d)
(e) t=63fs

(f) t=76fs

200

100

100
x(µm)

300

200

x(µm)

300

0

0

-100

-100

-200

-200

-300

-300
0

0.01

0.02

0.03

0

0.01

0.02

0.02

z(µm)

(h) t=102fs
300

200

200

100

100
x(µm)

300

x(µm)

0.03

z(µm)

(g) t=89fs

0.03

0

0

-100

-100

-200

-200

-300

-300
0

0.01

0.02

0.03

z(µm)

0

0.01
z(µm)

123

Figure 5.20: (cont’d)
(i) t=115fs

(j) t=127fs

200

100

100
x(µm)

300

200

x(µm)

300

0

0

-100

-100

-200

-200

-300

-300
0

0.01

0.02

0.03

0

0.01

z(µm)

0.02

0.03

z(µm)

Figure 5.21: Charge distribution in x − pz phase space in the ﬁrst 130 fs
(a) t=13fs

(b) t=25fs

0.02
pz(MeV/c)

0.04

0.02
pz(MeV/c)

0.04

0

0

-0.02

-0.02

-0.04

-0.04
0

0.01

0.02

0.03

z(µm)

0

0.01

0.02
z(µm)

124

0.03

Figure 5.21: (cont’d)
(c) t=38fs

(d) t=51fs

0.02
pz(MeV/c)

0.04

0.02
pz(MeV/c)

0.04

0

0

-0.02

-0.02

-0.04

-0.04
0

0.01

0.02

0.03

0

0.01

0.02

0.02

z(µm)

(f) t=76fs
0.04

0.02

0.02
pz(MeV/c)

0.04

pz(MeV/c)

0.03

z(µm)

(e) t=63fs

0.03

0

0

-0.02

-0.02

-0.04

-0.04
0

0.01

0.02

0.03

z(µm)

0

0.01
z(µm)

125

Figure 5.21: (cont’d)
(g) t=89fs

(h) t=102fs

0.02
pz(MeV/c)

0.04

0.02
pz(MeV/c)

0.04

0

0

-0.02

-0.02

-0.04

-0.04
0

0.01

0.02

0.03

0

0.01

0.02

0.02

z(µm)

(j) t=127fs
0.04

0.02

0.02
pz(MeV/c)

0.04

pz(MeV/c)

0.03

z(µm)

(i) t=115fs

0.03

0

0

-0.02

-0.02

-0.04

-0.04
0

0.01

0.02

0.03

z(µm)

0

0.01
z(µm)

126

Figure 5.22 shows the charge distribution in x − z phase space at diﬀerent times in the
following 120 ps, and Figure 5.23 shows the charge distribution in x − y phase space. In the
ﬁrst 20 ps, the bunch is attached to the surface. During this period of time, the number of
particles decreases very fast as shown in Figure 5.11. From the x−y phase space distribution
we can see that the bunch has the highest density in the center at 1 ps and then the charge
density at the center is decreasing till 20 ps. This is because the charges at the center feel
a stronger space charge ﬁeld and a stronger ﬁeld from the positive residuals. After 60 ps,
the bunch starts to leave the surface and the number of particles is almost constant. In the
x − y phase space, most particles gather at the center again. Figure 5.24 shows the charge
distribution in z − pz phase space. At 10 ps, a linear chirp has been formed. As time goes,
the momentum diﬀerence between the two ends of the chirp increases due to the space charge
ﬁeld, which is as expected.
From this example, we see the DA based MLFMA works well although the charge density keeps changing. The simulation helps us understand the experimental data. Similar
simulations can be used to study how the laser shape aﬀects the photo electron distributions
and how the electron bunch evolves under diﬀerent initial conditions and diﬀerent external
ﬁelds.

127

Figure 5.22: Charge distribution in x − z phase space in the following 120 ps
(a) t=10ps

(b) t=20ps

400

200

200
x(µm)

600

400

x(µm)

600

0

0

-200

-200

-400

-400

-600

-600
0 100 200 300 400 500 600 700 800

0
0 100 200 300 400 500 600 700 800

z(µm)

z(µm)

(c) t=30ps

(d) t=40ps

400

200

200
x(µm)

600

400

x(µm)

600

0

0

-200

-200

-400

-400

-600

-600
0 100 200 300 400 500 600 700 800
z(µm)

128

0 100 200 300 400 500 600 700 800
z(µm)

Figure 5.22: (cont’d)
(e) t=50ps

(f) t=60ps

400

200

200
x(µm)

600

400

x(µm)

600

0

0

-200

-200

-400

-400

-600

-600
0 100 200 300 400 500 600 700 800

0 100 200 300 400 500 600 700 800

z(µm)

z(µm)

(g) t=70ps

(h) t=80ps

400

200

200
x(µm)

600

400

x(µm)

600

0

0

-200

-200

-400

-400

-600

-600
0 100 200 300 400 500 600 700 800
z(µm)

129

0 100 200 300 400 500 600 700 800
z(µm)

Figure 5.22: (cont’d)
(i) t=90ps

(j) t=100ps

400

200

200
x(µm)

600

400

x(µm)

600

0

0

-200

-200

-400

-400

-600

-600
0 100 200 300 400 500 600 700 800

0 100 200 300 400 500 600 700 800

z(µm)

z(µm)

(k) t=110ps

(l) t=120ps

400

200

200
x(µm)

600

400

x(µm)

600

0

0

-200

-200

-400

-400

-600

-600
0 100 200 300 400 500 600 700 800
z(µm)

130

0 100 200 300 400 500 600 700 800
z(µm)

Figure 5.23: Charge distribution in x − y phase space in the following 100 ps
(a) t=1ps

(b) t=5ps

400

200

200
x(µm)

600

400

x(µm)

600

0

0

-200

-200

-400

-400

-600
-600 -400 -200

0

200

400

-600
-600 -400 -200

600

0

200

400

600

0
200
y(µm)

400

600

y(µm)

y(µm)

(c) t=10ps

(d) t=20ps

400

200

200
x(µm)

600

400

x(µm)

600

0

0

-200

-200

-400

-400

-600
-600 -400 -200

0
200
y(µm)

400

600

131

-600
-600 -400 -200

0

Figure 5.23: (cont’d)
(e) t=30ps

(f) t=40ps

400

200

200
x(µm)

600

400

x(µm)

600

0

0

-200

-200

-400

-400

-600
-600 -400 -200

0

200

400

-600
-600 -400 -200

600

0

200

400

600

0
200
y(µm)

400

600

y(µm)

y(µm)

(g) t=50ps

(h) t=60ps

400

200

200
x(µm)

600

400

x(µm)

600

0

0

-200

-200

-400

-400

-600
-600 -400 -200

0
200
y(µm)

400

600

132

-600
-600 -400 -200

Figure 5.23: (cont’d)
(i) t=70ps

(j) t=80ps

400

200

200
x(µm)

600

400

x(µm)

600

0

0

-200

-200

-400

-400

-600
-600 -400 -200

0

200

400

-600
-600 -400 -200

600

0

200

400

600

0
200
y(µm)

400

600

y(µm)

y(µm)

(k) t=90ps

(l) t=100ps

400

200

200
x(µm)

600

400

x(µm)

600

0

0

-200

-200

-400

-400

-600
-600 -400 -200

0
200
y(µm)

400

600

133

-600
-600 -400 -200

Figure 5.24: Charge distribution in x − z phase space in the following 120 ps
(a) t=10ps

(b) t=20ps

1.5

1

1

pz(MeV/c)

2

1.5

pz(MeV/c)

2

0.5

0.5

0

0

-0.5

-0.5

-1

0

-1

0 100 200 300 400 500 600 700 800

0
0 100 200 300 400 500 600 700 800

z(µm)

z(µm)

(c) t=30ps

(d) t=40ps

1.5

1

1

pz(MeV/c)

2

1.5

pz(MeV/c)

2

0.5

0.5

0

0

-0.5

-0.5

-1

-1
0 100 200 300 400 500 600 700 800
z(µm)

134

0 100 200 300 400 500 600 700 800
z(µm)

Figure 5.24: (cont’d)
(e) t=50ps

(f) t=60ps

1.5

1

1

pz(MeV/c)

2

1.5

pz(MeV/c)

2

0.5

0.5

0

0

-0.5

-0.5

-1

-1
0 100 200 300 400 500 600 700 800

0 100 200 300 400 500 600 700 800

z(µm)

z(µm)

(g) t=70ps

(h) t=80ps

1.5

1

1

pz(MeV/c)

2

1.5

pz(MeV/c)

2

0.5

0.5

0

0

-0.5

-0.5

-1

-1
0 100 200 300 400 500 600 700 800
z(µm)

135

0 100 200 300 400 500 600 700 800
z(µm)

Figure 5.24: (cont’d)
(i) t=90ps

(j) t=100ps

1.5

1

1

pz(MeV/c)

2

1.5

pz(MeV/c)

2

0.5

0.5

0

0

-0.5

-0.5

-1

-1
0 100 200 300 400 500 600 700 800

0 100 200 300 400 500 600 700 800

z(µm)

z(µm)

(k) t=110ps

(l) t=120ps

1.5

1

1

pz(MeV/c)

2

1.5

pz(MeV/c)

2

0.5

0.5

0

0

-0.5

-0.5

-1

-1
0 100 200 300 400 500 600 700 800
z(µm)

136

0 100 200 300 400 500 600 700 800
z(µm)

5.5

Beneﬁt of Using the Grid-free Fast Multipole Method

The particle-in-cell (PIC) method is currently the most popular algorithm for the space
charge eﬀect calculation in the beam community. There are two important diﬀerences that
distinguish the grid-free fast multipole method from the PIC method. First, the PIC method
depends on a grid, which makes it diﬃcult to deal with systems with strongly nonuniform
charge distribution or/and complex geometry[68, 22]. Advanced techniques are often required to treat curved geometries by the PIC method. The fast multipole method is grid-free
and it treats any charge distribution and any geometry in a natural way. Some PIC programs can solve problems with regular boundaries, such as rectangular, circular, or elliptical
boundary. But they have diﬃculties to solve problems with irregular boundaries. The fast
multipole method can be combined with the boundary integral equation method (BIEM)
[76, 38, 64] to solve problems with complicated geometries. Second, the PIC method artiﬁcially smooths the ﬁeld. In a highly correlated system, it is diﬃcult to accurately calculate
the local interaction to simulate the ﬁne structures by the PIC method without using a
very ﬁne grid, which greatly increases the computation cost[68, 90, 22]. In the research on
advanced accelerators, sometimes a high density beam with a certain proﬁle is desired[70].
For example, a ﬂat beam may help to easily generate the accelerating wake ﬁeld[61] and
a three dimensional uniformly-ﬁlled ellipsoidal electron bunch is good for high-power free
electron lasers[70]. The grid-free fast multipole method is an ideal simulation tool for these
problems since it accurately calculates the nearby interactions, which aﬀect the ﬁne structure
of the charge distribution. Besides the above-mentioned beneﬁts, one can easily increase the
accuracy using the grid-free fast multipole method, whenever the high accuracy is needed.
As is described in the previous chapters, the accuracy can be increased by simply increasing

137

the order of the multipole expansions or increasing the distance between a box and its near
neighbors. It does not depend on how we decompose the whole region into boxes. While
using the PIC algorithm, the accuracy is limited by the grid size. To increase the accuracy,
which requires ﬁner grid size, is hard, especially for three dimensional problems.
Nowadays, accelerator physicists are trying to provide higher and higher intensity beams,
so that it becomes more and more important to understand the interaction between charged
particles. For example, muon cooling is an essential part in Fermilab’s Neutrino Factory and
Muon Collider program[36, 95], and some experiments have been launched in the international MICE project[28]. Electron cooling for the ion beam is also a challenge in Jeﬀerson
Lab’s MEIC project[1]. The electron cloud eﬀect is being studied in the main injector of
Fermilab’s Project X[53, 54] and Cornell University’s CesrTA program[27, 67]. The space
charge eﬀect, which is dominant in the photoemission process, has to be understood and
controlled in Cathode R&D for free electron laser (FEL) or electron microscope design[30].
Numerical simulation is a very useful tool for studies on collective eﬀects, and it is inevitable
when the collective eﬀects are strong and highly nonlinear. Because of its grid-free property
and good eﬃciency, the fast multipole method will be either a good cross-check to the classical PIC method or a challenger in high density highly correlated systems with complicated
geometries.

138

APPENDIX

139

Detailed Description of the DA-Based
MLFMA Algorithm
Choose DA order and s, the maximum number of particles inside a childless box.
Stage 1. Hierarchical structure of boxes.
Create the zero level box and put it into the array box. Update lvl, prnt, cntr, idx, nptcl,
and link if applicable.
if box[1] contains more than s particles
i1 ←the index of the last element in box. a ← b means giving the value of b to a.
subdivide box[1] into eight boxes, add the nonempty boxes into box, update nchld, lvl,
prnt, cntr, idx, nptcl, and link if applicable.
i2 ←the index of the last element in box.
while i2 > i1
loop i = i1 + 1, i2 , 1
if box[i] contains more than s particles
subdivide box[i] into eight boxes, add the nonempty boxes into box, update nchld, lvl, prnt, cntr, idx, nptcl, and link if applicable.
endif
endloop

140

i1 ← i2
i2 ←the index of the last element in box.
endwhile
endif
Comment We denote by nbox the total number of boxes formed in Stage 1. And we put
the in line comments inside the “{}”.
Stage 2. Multipole expansion.
loop i = nbox, 2, −1
if box[i] is childless
Calculate the multipole expansion around the center of box[i] according to the
particles inside by Eq. (3.1).
else {box[i] is a parent box}
Calculate the multipole expansion of box[i] by shifting the multipole expansions
of its child boxes to its center and take the summation using Eq. (3.2) and Eq.
(3.3). {Note again that in Eq. (3.3) we do not calculate d′ . }
r
endif
endloop
Comment Since we sort backwards from the last box, the multipole expansions of the child
boxes are always calculated before that of their parent box.
Stage 3. Fields.
141

loop i = 2, nbox, 1
b ← box[i].
clg[i] ← 1&i. {Initialize the colleague list for b.}
if lvl[i] > 1
Inherit the local expansion from b’s parent box using Eq. (3.9) and Eq. (3.10).
endif
if b is childless
Calculate the interaction between the particles inside b directly, and save the
resulting ﬁeld in E(b).
endif
loop j = 1, clg[prnt[i]]|1, 1
pclg ← clg[prnt[i]]|(j + 1).{Sort the colleague list of b’s parent box.}
loop k = box[pclg], box[pclg] + nchld[pclg] − 1, 1
w ← box[k]. {w is a child box of a colleague of b’s parent box.}
if k > i
if w is adjacent to b
Add w into b’s colleague list.
For the operations here, see description (1) in the following .
else
w is in the list Vb . Convert the multipole expansion of w into the local
expansion of b using Eq. (3.5) and Eq. (3.6), and add it to lcl[i]. In the

142

same way convert the multipole expansion of b into the local expansion
of w and add it to lcl[k].
endif
endif
endloop
endloop
if b is childless
Calculate the ﬁeld on the particles inside b according to the local expansion of b,
and add the result to E(b).
Add E(b) to the corresponding elements in Ex , Ey , and Ez .
endif
endloop
Comment We only calculate w that is behind b in the array box, because the interactions
between b and those boxes before b have already been calculated. The diagram of stage
3 is shown in Figure A.1.

Description (1)
if b is childless
if w is childless
w is in the list Ub . Calculate the contribution of the particles in w directly and
add the results to E(b). Use Newton’s third law to calculate the ﬁeld on the
143

Figure A.1: Diagram of stage 3

From the coarest level to the ﬁnest level
Do the following for each box b
Initialize clg of b: Put b into clg
and set the colleague number to 1
Inherit the local
expansion
from the parent

Y

l>1
N
is b childless?
Y

N

Ub : Cb → Cb
ﬁnd clg of b’s parent
check every child box w
of the boxes in b’s parent’s clg
Y
is w behind b?

Vb :Mb → Lw
Mw → Lb

is w
childless?
Y
check the
descent of b

∗

Y
N
w adjacent
to b?
Y
put w into b’s clg
Y
is b
N
childless?
Y
is w
Y Ub : Cw → Cb
childless?
Cb → Cw (N 3)
N
check the descent of w ∗∗
is b
childless?
144

Y

Lb → Cb

Save
the
ﬁeld

particles inside w, and add the results to the corresponding elements in Ex , Ey ,
and Ez .
else {w is a parent box.}
Check the descent of w. See description (2).
endif
elseif w is childless
Check the descent of b. See description (3).
endif
Description (2)
Create a stack.
Put all child boxes of w into the stack.
pnt ←length of the stack.
while pnt > 0
d ← stack[pnt].
pnt ← pnt − 1. {Remove the last element from the stack.}
if d is not adjacent to b
d is in the list Wb . Calculate the ﬁelds on the particles inside b according to the
multipole expansion of d using Eq. (3.11), and add the results to E(b). Calculate
the local expansion around the center of d according to the charges inside b using
Eq. (3.7) and Eq. (3.8), and add the result to the corresponding element of lcl.

145

Figure A.2: Diagram of description (2)

Check the descent of w

∗∗

Create a stack
Put all child boxes of w into the stack
Stack empty?

Y

Stop

N
Remove the last box, call it d, out of the stack
d adjacent
to b?

Y
Add all child boxes
of b into the stack

N

d childless?

Y

Ub : C d → C b
Cb → Cd (N 3)

146

N

Wb : Md → Cb
Xd : Cb → Ld

else {d is adjacent to b.}
if d is childless
d is in the list Ub . Calculate the contribution of the particles in d directly
and add the results to E(b). Use Newton’s third law to calculate the ﬁeld on
the particles inside d, and add the results to the corresponding elements in
Ex , Ey , and Ez .
else {d is a parent box.}
Add all the child boxes of d into the stack.
pnt ←the length of the stack.
endif
endif
endwhile
Comment The diagram of description (2) is shown in Figure A.2.
Description (3)
The algorithm is the same with description (2). One only needs to replace all w by b and
all b by w, and ﬁnally add one more step to add E(w) to the corresponding elements in Ex ,
Ey , and Ez .

147

Figure A.3: Diagram of description (3)

Check the descent of b

∗

Create a stack
Put all child boxes of b into the stack
Stack empty?

Y

Stop

N
Remove the last box, call it d, out of the stack
d adjacent
to w?

Y
Add all child boxes
of w into the stack

N

d childless?

Y

Uw : Cd → Cw
Cw → Cd (N 3)

148

N Ww

: Md → Cw
Xd : Mw → Lb

BIBLIOGRAPHY

149

BIBLIOGRAPHY
[1] S. Ahmed, A. Bogacz, Y. Derbenev, A. Hutton, G. Kraﬀt, R. Li, V. Morozov, and
F. Pilat. Conceptual design of a polarized medium energy electron-ion collider at jlab.
Polarization, 70:80.
[2] S. Aluru, J. Gustafson, G.M. Prabhu, and F.E. Sevilgen. Distribution-independent hierarchical algorithms for the n-body problem. The Journal of Supercomputing, 12:303–
323, 1998. 10.1023/A:1008047806690.
[3] S. Aluru, G.M. Prabhu, and J. Gustafson. Truly distribution-independent algorithms
for the n-body problem. In Proceedings of the 1994 ACM/IEEE conference on Supercomputing, pages 420–428. ACM, 1994.
[4] S. Aluru and F. Sevilgen. Dynamic compressed hyperoctrees with application to the
n-body problem. In Foundations of Software Technology and Theoretical Computer
Science, pages 21–33. Springer, 1999.
[5] C.R. Anderson. An implementation of the fast multipole method without multipoles.
SIAM Journal on Scientiﬁc and Statistical Computing, 13:923, 1992.
[6] L.C. Andrews. Special functions for engineers and applied mathematicians. Macmillan,
1985.
[7] J. Applequist. Traceless cartesian tensor forms for spherical harmonic functions: new
theorems and applications to electrostatics of dielectric media. Journal of Physics A:
Mathematical and General, 22:4303, 1989.
[8] J. Barnes and P. Hut. A hierarchical O(N log N ) force-calculation algorithm. nature,
324:4, 1986.
[9] J. E Barnes. A modiﬁed tree code: don’t laugh; it runs. Journal of Computational
Physics, 87(1):161–170, 1990.
[10] R. K. Beatson and L. Greengard. A short course on fast multipole methods. Wavelets,
Multilevel Methods and Elliptic PDEs, pages 1–37, 1997.

150

[11] M. Berz. Arbitrary order description of arbitrary particle optical systems. Nuclear
Instruments and Methods, A298:426, 1990.
[12] M. Berz. Modern map methods for charged particle optics. Nuclear Instruments and
Methods, 363:100, 1995.
[13] M. Berz. From Taylor series to Taylor models. AIP CP, 405:1–20, 1997.
[14] M. Berz. Modern Map Methods in Particle Beam Physics. Academic Press, San Diego,
1999. Also available at http://bt.pa.msu.edu/pub.
[15] M. Berz and K. Makino. COSY INFINITY Version 9.1 programmer’s manual. Technical report, Department of Physics and Astronomy, Michigan State University, East
Lansing, MI 48824, 2011. See also http://cosyinﬁnity.org.
[16] T. Bountis and C. Skokos. Space charges can signiﬁcantly aﬀect the dynamics of
accelerator maps. Physics Letters A, 358(2):126–133, 2006.
[17] D.L. Bruhwiler and M.F. Reusch. High-order optics with space charge: The topkark
code. In AIP Conference Proceedings, volume 297, page 524, 1993.
[18] D.L. Bruhwiler and M.F. Reusch. The eﬀect of various particle distribution functions
on the modeling of space charge eﬀects in a high-order optics code. In Proceeding of
Fourth European Particle Accelerator Conference 1994, page 1168, 1994.
[19] J. Carrier, L. Greengard, and V. Rokhlin. A fast adaptive multipole algorithm for
particle simulations. SIAM Journal on Scientiﬁc and Statistical Computing, 9:669,
1988.
[20] H. Cheng, L. Greengard, and V. Rokhlin. A fast adaptive multipole algorithm in three
dimensions. Journal of Computational Physics, 155(2):468 – 498, 1999.
[21] C.H. Choi, K. Ruedenberg, and M.S. Gordon. New parallel optimal-parameter fast
multipole method (opfmm). Journal of Computational Chemistry, 22(13):1484–1501,
2001.
[22] A.J. Christlieb, R. Krasny, J.P. Verboncoeur, J.W. Emhoﬀ, and I.D. Boyd. Grid-free
plasma simulation techniques. Plasma Science, IEEE Transactions on, 34(2):149–165,
2006.

151

[23] W.J. Cody. Chebyshev polynomial expansions of complete elliptic integrals. Math.
Comp, 19:249–259, 1965.
[24] W.J. Cody. Rational chebyshev approximations for the error function. Mathematics
of Computation, 23(107):631–637, 1969.
[25] R. Coifman, V. Rokhlin, and S. Wandzura. The fast multipole method for the wave
equation: A pedestrian prescription. Antennas and Propagation Magazine, IEEE,
35(3):7–12, 1993.
[26] C. Coleman-Smith, H.A. Padmore, and W. Wan. Limits to electron beam emittance
from stochastic coulomb interactions. Technical report, Ernest Orlando Lawrence
Berkeley National Laboratory, Berkeley, CA (US), 2008.
[27] J.A. Crittenden, J.R. Calvey, G.F. Dugan, D.L. Kreinick, Z. Leong, J.A. Livezey,
M.A. Palmer, D.L. Rubin, D.C. Sagan, R.L. Holtzapple, et al. Progress in studies of
electron-cloud-induced optics distortions at CesrTA. Technical report, Ernest Orlando
Lawrence Berkeley National Laboratory, Berkeley, CA (US), 2010.
[28] A. Dobbs, D. Adams, K. Long, J. Pasternak, and M. Apollonio. The MICE muon
beam: Status and progress. In Proceedings of 1st International Particle Accelerator
Conference, 2010.
[29] J. Dong, S.L. Chai, and J.J. Mao. An automatic load-balanced parallel multilevel fast
multipole method for large scale electromagnetic scattering problem. In Microwave
Conference Proceedings, 2005. APMC 2005. Asia-Paciﬁc Conference Proceedings, volume 3, pages 4, IEEE, 2005.
[30] D.H. Dowell, I. Bazarov, B. Dunham, K. Harkay, C. Hernandez-Garcia, R. Legg,
H. Padmore, T. Rao, J. Smedley, and W. Wan. Cathode R&D for future light sources.
Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 622(3):685–697, 2010.
[31] D.H. Dowell and J.F. Schmerge. Quantum eﬃciency and thermal emittance of metal
photocathodes. Physical Review Special Topics-Accelerators and Beams, 12(7):074201,
2009.
[32] N. Engheta, W.D. Murphy, V. Rokhlin, and M.S. Vassiliou. The fast multipole method
(fmm) for electromagnetic scattering problems. Antennas and Propagation, IEEE
Transactions on, 40(6):634–641, 1992.

152

[33] O. Ergul and L. Gurel. Eﬃcient parallelization of the multilevel fast multipole algorithm for the solution of large-scale scattering problems. Antennas and Propagation,
IEEE Transactions on, 56(8):2335–2345, 2008.
[34] W. Fong and E. Darve. The black-box fast multipole method. Journal of Computational Physics, 228(23):8712–8725, 2009.
[35] R.W. Garnett and T.P. Wangler. Space-charge calculation for bunched beams with 3-D
ellipsoidal symmetry. In Proceedings of the 1991 IEEE Particle Accelerator Conference
(APS Beam Physics), page 330, 1991.
[36] S. Geer. Muon cooling R&D. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment,
503(1):64–69, 2003.
[37] L. Greengard and W.D. Gropp. A parallel version of the fast multipole method.
Computers & Mathematics with Applications, 20(7):63–71, 1990.
[38] L. Greengard and J.Y. Lee. A direct adaptive poisson solver of arbitrary order accuracy.
Journal of Computational Physics, 125(2):415–424, 1996.
[39] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of
Computational Physics, 73(2):325–348, 1987.
[40] L.F. Greengard and J. Huang. A new version of the fast multipole method for
screened coulomb interactions in three dimensions. Journal of Computational Physics,
180(2):642–658, 2002.
[41] D.P. Grote, A. Friedman, and I. Haber. Methods used in warp3d, a three-dimensional
pic/accelerator code. In AIP Conference Proceedings, volume 391, page 51, 1997.
[42] D.P. Grote, A. Friedman, I. Haber, W. Fawley, and J.L. Vay. New developments in
warp: Progress toward end-to-end simulation. Nuclear Instruments and Methods in
Physics Research-Section A Only, 415(1):428–432, 1998.
[43] D.P. Grote, A. Friedman, J.L. Vay, and I. Haber. The warp code: Modeling high
intensity ion beams. In 16th international Workshop on ECR Ion Sources, edited by
M. Leitner, AIP, Berkeley, CA, 2004.
[44] B. Hariharan and S. Aluru. Eﬃcient parallel algorithms and software for compressed
octrees with applications to hierarchical methods. Parallel computing, 31(3-4):311–331,
2005.
153

[45] G.H. Jansen. Coulomb Interactions in Particle Beams (Advances in Electronics and
Electron Physics Supplement 21). Academic Press, 1990.
[46] F.W. Jones. A hybrid fast-muitipole technique for space-charge tracking with halos. In AIP Conference Proceedings, pages 359–370. IOP INSTITUTE OF PHYSICS
PUBLISHING LTD, 1998.
[47] F.W. Jones and H.O. Schonauer. New space-charge methods in accsim and their
application to injection in the cern ps booster. In Particle Accelerator Conference,
1999. Proceedings of the 1999, volume 4, pages 2933–2935. IEEE, 1999.
[48] P. Kruit and G.H. Jansen. Space charge and statistical Coulomb eﬀects, second edition.
Handbook of Charged Particle Optics, pages 341–391, 2009.
[49] K. Kuns. Calculation of magnetic ﬁeld inside plasma chamber. UCLA report, 2(3):1–11,
2007.
[50] P. Lapostolle, J.M. Lagniel, S. Nath, N. Pichoﬀ, E. Tanke, and S. Valero. The scherm
space charge routine, limitations and solutions. In Linac 1998 conference.
[51] P. Lapostolle, A.M. Lombardi, E. Tanke, S. Valero, R.W. Garnett, and T.P. Wangler. A
modiﬁed space charge routine for high intensity bunched beams. Nuclear Instruments
and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors
and Associated Equipment, 379(1):21–40, 1996.
[52] P.M. Lapostolle, A.M. Lombardi, S. Nath, E. Tanke, S. Valero, and T.P. Wangler. A
new approach to space charge for linac beam dynamics codes. In Proceedings of the
18th International Linear Accelerator Conference, page 375, 1996.
[53] P. Lebrun, P. Spentzouris, J.R. Cary, P. Stoltz, and S.A. Veitzer. Accurate simulation
of the electron cloud in the fermilab main injector with VORPAL. In Proceedings of
1st International Particle Accelerator Conference: IPAC10, 2010.
[54] P.L.G. Lebrun, P. Spentzouris, J.R. Cary, P. Stoltz, and S.A. Veitzer. Accurate simulation of the electron cloud in the fermilab main injector with VORPAL. Technical
report, Fermi National Accelerator Laboratory (FNAL), Batavia, IL, 2011.
[55] P. Li, H. Johnston, and R. Krasny. A cartesian treecode for screened coulomb interactions. Journal of Computational Physics, 228(10):3858–3868, 2009.
[56] J. Makino. Yet Another Fast Multipole Method without Multipoles–Pseudoparticle
Multipole Method. Journal of Computational Physics, 151(2):910–920, 1999.
154

[57] S.L. Manikonda and M. Berz. An accurate high-order method to solve the Helmholtz
boundary value problem for the 3D Laplace equation. International Journal of Pure
and Applied Mathematics, 23,3:365–378, 2005.
[58] S.L. Manikonda and M. Berz. Multipole expansion solution of the Laplace equation
using surface data. Nuclear Instruments and Methods, 558,1:175–183, 2006.
[59] S.L. Manikonda, M. Berz, and K. Makino. High-order veriﬁed solution of the 3D
Laplace equation. Transactions on Computers, 4-11:1604–1610, 2005.
[60] M. Martini and M. Prome. Computer studies of beam dynamics in a proton linear
accelerator with space charge. Particle Accelerators, 2:289–299, 1971.
[61] D. Mihalcea, P. Piot, P. Stoltz, and B. Cowan. Wakeﬁeld generation in compact rectangular dielectric-loaded structures using ﬂat beams. In Proceedings of 2011 Particle
Accelerator Conference, 2011.
[62] S. Nath, J. Qiang, R. Ryne, J. Stovall, H. Takeda, L. Young, K.R. Crandall, N. Pichoﬀ,
and D. Uriot. Comparison of linac simulation codes. In Particle Accelerator Conference,
2001. PAC 2001. Proceedings of the 2001, volume 1, pages 264–266. IEEE, 2001.
[63] C. Nieter and J.R. Cary. VORPAL: a versatile plasma simulation code. Journal of
Computational Physics, 196(2):448–473, 2004.
[64] N. Nishimura. Fast multipole accelerated boundary integral equation methods. Applied
Mechanics Reviews, 55:299, 2002.
[65] E. Nissen and B. Erdelyi. A new paradigm for modeling, simulation, and analysis of
intense beams. In Proccedings of the 46th ICFA Advanced Beam Dynamics Workshop
on High-Intensity and High Brightness Hadron Beams, 2010.
[66] A. Orzhekhovskaya and G. Franchetti. A space charge algorithm for the bunches of
elliptical crossection with arbitrary beam size and particle distribution. Proceedings of
International Computational Accelerator Physics, ICAP, pages 106–109, 2006.
[67] M.A. Palmer, J.P. Alexander, M.G. Billing, J.R. Calvey, C. Conolly, J.A. Crittenden,
J.A. Dobbins, G.F. Dugan, N. Eggert, E. Fontes, et al. Electron cloud at low emittance in CesrTA. In Proceedings of 1st International Particle Accelerator Conference:
IPAC10, 2010.
[68] S. Pfalzner and P. Gibbon. Many-body tree methods in physics. Cambridge University
Press, 2005.
155

[69] N. Pichoﬀ, J.M. Lagniel, and S. Nath. Simulation results with an alternate 3D space
charge routine, PICNIC. In Proceedings of the XIX International Linac Conference,
volume 141, pages 23–28, 1998.
[70] P. Piot, C. Behrens, C. Gerth, F. Lemery, D. Mihalcea, and M. Vogt. Generation and
characterization of electron bunches with ramped current proﬁle at the ﬂash facility.
[71] G. P¨plau, U. Van Rienen, J. Staats, and T. Weiland. Fast algorithms for the tracking
o
of electron beams. energy, 1(5):0–001, 2000.
[72] G. Poplau, U. van Rienen, B. van der Geer, and M. de Loos. Multigrid algorithms for
the fast calculation of space-charge eﬀects in accelerator design. IEEE Transactions
on Magnetics, 40(2 Part 2):714–717, 2004.
[73] D. Potts, G. P¨plau, and U. van Rienen. Calculation of 3d space-charge ﬁelds of
o
bunches of charged particles by fast summation. Scientiﬁc Computing in Electrical
Engineering, page 30, 2004.
[74] J. Qiang, R.D. Ryne, S. Habib, and V. Decyk. An object-oriented parallel particle-incell code for beam dynamics simulation in linear accelerators. Journal of Computational
Physics, 163(2):434–451, 2000.
[75] R.K. Raman, Z. Tao, T.R. Han, and C.Y. Ruan. Ultrafast imaging of photoelectron
packets generated from graphite surface. Applied physics letters, 95(18):181108–181108,
2009.
[76] V. Rokhlin. Rapid solution of integral equations of classical potential theory. Journal
of Computational Physics, 60(2):187–207, 1985.
[77] R.D. Ryne. Advanced computing tools and models for accelerator physics. Technical
report, Ernest Orlando Lawrence Berkeley National Laboratory, Berkeley, CA (US),
2008.
[78] F. Sevilgen and S. Aluru. A unifying data structure for hierarchical methods. In
Supercomputing, ACM/IEEE 1999 Conference, pages 24–24. IEEE, 1999.
[79] F.E. Sevilgen, S. Aluru, and N. Futamura. A provably optimal, distributionindependent parallel fast multipole method. In Proceedings of the 14th International
Parallel and Distributed Processing Symposium, pages 77–84. IEEE, 2000.

156

[80] B. Shanker and H. Huang. Accelerated cartesian expansions-A fast method for computing of potentials of the form r−ν for all real ν. Journal of Computational Physics,
226(1):732–753, 2007.
[81] J.P. Singh, C. Holt, J.L. Hennessy, and A. Gupta. A parallel adaptive fast multipole
method. In Proceedings of the 1993 ACM/IEEE conference on Supercomputing, pages
54–65. ACM, 1993.
[82] J.P. Singh, C. Holt, T. Totsuka, A. Gupta, and J.L. Hennessy. Load balancing and
data locality in adaptive hierarchical N-body methods: Barnes-Hut, fast multipole,
and radiosity. Journal of Parallel and Distributed Computing, 27(2):118–141, 1995.
[83] T. Srinivasan-Rao, J. Fischer, and T. Tsang. Photoemission studies on metals using
picosecond ultraviolet laser pulses. Journal of applied physics, 69(5):3291–3296, 1991.
[84] Z. Tao, H. Zhang, P. Duxbury, M. Berz, and C. Ruan. Quantitative imaging of ultrashort photoelectron pulse dynamics. Bulletin of the American Physical Society, 56,
2011.
[85] Z. Tao, H. Zhang, P.M. Duxbury, M. Berz, and C.Y. Ruan. Space charge eﬀects in
ultrafast electron diﬀraction and imaging. Journal of Applied Physics, 111:044316,
2012.
[86] S.H. Teng. Provably good partitioning and load balancing algorithms for parallel
adaptive n-body simulation. SIAM Journal on Scientiﬁc Computing, 19(2):635–656,
1998.
[87] S.B. van der Geer, M.J. de Loos, G. P¨plau, and U. van Rienen. Adaptive space-charge
o
meshing in the general particle tracer code. In Proceedings of PAC, 2011.
[88] S. Velamparambil and W.C. Chew. Analysis and performance of a distributed memory
multilevel fast multipole algorithm. Antennas and Propagation, IEEE Transactions on,
53(8):2719–2727, 2005.
[89] S. Velamparambil, W.C. Chew, and J. Song. 10 million unknowns: is it that big?
Antennas and Propagation Magazine, IEEE, 45(2):43–58, 2003.
[90] J.P. Verboncoeur. Particle simulation of plasmas: review and advances. Plasma physics
and controlled fusion, 47:A231, 2005.

157

[91] M.S. Warren and J.K. Salmon. A parallel hashed oct-tree n-body algorithm. In
Proceedings of the 1993 ACM/IEEE conference on Supercomputing, pages 12–21. ACM,
1993.
[92] L. Ying, G. Biros, D. Zorin, and H. Langston. A new parallel kernel-independent fast
multipole method. In Supercomputing, 2003 ACM/IEEE Conference, pages 14–14.
IEEE, 2003.
[93] H. Zhang and M. Berz. The fast multipole method in the diﬀerential algebra framework.
Nuclear Instruments and Methods A 645, pages 338–344,, 2011.
[94] F. Zhao. An O (N) Algorithm for Three-Dimensional N-Body Simulations. Technical
report, MASSACHUSETTS INST OF TECH CAMBRIDGE ARTIFICIAL INTELLIGENCE LAB, 1987.
[95] M.S. Zisman. The muon collider. Technical report, Ernest Orlando Lawrence Berkeley
National Laboratory, Berkeley, CA (US), 2011.

158