. u. . hfiwedmi £3. w. :1 . 3Q: It; .. . ”my; 'A. . RE..." 4 v .5. As“ 5.5,: s; 3... Vxfimuwnm .Wm...uuh as... , .. . It}! fl. .. :. .{Pvfl I3... .114... J .5 it! .I. .a I. .511... Lang: 4 .nQv.i.é§..~G=.y.uuwmx.: . a V :84“.ha tan. 4% WWVIHWlehmJaM,..wa . $173 an...” twat.“ 1mm Michigan-State 3 _ University 2000! This is to certify that the dissertation entitled FAST COMPUTATIONAL TECHNIQUES FOR MULTISCALE ELECTROMAGNETIC SIMULATIONS presented by Vikram Melapudi has been accepted towards fulfillment of the requirements for the Ph.D. degree in Electrical ErgineerinL {Ami/é "\ Major Professor’s Signature MM Vi, 200? Date MSU is an Affinnative Action/Equal Opportunity Employer 4 —.-n-I-I-I-O-O-O-O-O-l-l-l-I-O-I-l-O--o—I0-'-—--—-—-—-—-—-—-—-— — —- - PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 5/08 K:/Prolecc&Pres/ClRC/DateDue indd FAST COMPUTATIONAL TECHNIQUES FOR MULTISCALE ELECTROMAGNETIC SIMULATIONS By Vikram Melapudi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Electrical Engineering 2009 ABSTRACT FAST COMPUTATIONAL TECHNIQUES FOR MULTISCALE ELECTROMAGNETIC SIMULATIONS By Vikram Melapudi Multiscale electromagnetic simulations contain features with multiple length or frequency scales or both. Multiscale features are characteristic of realitic simulations as large degrees of freedom (N) are required to capture the minute physical details. Though integral equation (IE) approaches are well-suited for electromagnetic simu- lations, they require repeated evaluation of pair-wise potentials - also referred to as N -body problems. It is well known that the direct computation of these potentials scales as 0(N2) both in terms of computer memory and time. Even with the rapid advancements in computer technology, this places severe limitation on the size of the problem (N) that can be analyzed in a realistic time frame. Further, multiscale sim- ulations produce badly-conditioned systems of equations that require large number of iterations when using Krylov-subspace solvers. The main goal of this thesis is to develop a suite of computational techniques that enables multiscale electromagnetic simulations in a fast, efficient and stable fashion. In this work, the accelerated Carte- sian expansion (ACE) algorithm is used to overcome the quadratic cost-scaling of N -body problems. ACE was intially developed for the fast evaluation of polynomial potentials and here it is extended to the fast computation of retarded and Helmholtz potentials. These algorithms are shown to be stable and efficient for computation of electromagnetic potentials at sub-wavelength or low—frequency scales Hybrid com- bination of these algorithms with existing fast methods leads to the development of multiscale electromagnetic solvers that are stable and efficient across length and fre- quency scales. Since the fast algorithms only reduce the time spent in each iteration, a new integral equation formulation is developed that yields better conditioned systems of equations. This is achieved by reformulating the augmented field integral equations such that the resulting Operators are bounded and compact. Further, the widespread availability of parallel distributed or cluster computers combined with the memory and speed restriction of single processor computers necessitates the development of efficient parallel implementation of the sophisticated fast algorithms. The parallel algorithms developed in this work are provably scalabale and enables simulation of problems with several millions of unknowns on large scale clusters, with hundreds of processors and beyond. In this thesis, ACE algorithm is also extended to rapid computation of time domain diffusion potentials. ACKNOWLEDGMENT During the past six years of my Ph.D. life at Michigan State University, I have been influenced by many people. I wish to acknowledge the select few who have left an everlasting impression on me. I start by thanking my advisor Prof. Shanker Balasubramaniam. More than an advisor, he has been a good friend and his contagious enthusiasm and rigor in research continues to amaze me. It is difficult to encompass my gratitude and appreciation towards him in few words. I thank my committee members Prof. Lalita Udpa, Prof. Satish Udpa and Prof. Neil Wright for all the support, advice and time they have provided me. Special regards to Prof. Lalita Udpa, my first advisor, for showing the support and confidence in me to proceed with Ph.D. The frank discussions with both Prof. Udpa(s) have helped be at home in MSU. I also thank professors E. Rothwell, A. Tamburrino and C. Y. Wang (Math Dept.) for their influential research perspectives and exciting courses. I carry with me special memories of all the colleagues and friends in NDEL and BSer. I thank them for granting me a fun-filled research life. I am indebted to specially mention Kavitha, Naveen, Sridhar, Gokul and Yiming. I add my heart filled gratitude toward my friends Kavitha and Sanketh for their support and affection. Finally, I would like to thank my family Mr. Karunakara Reddy (father), Mrs. Bhargavi (mother) and Mr. Vinod (brother). Their unconditional love, moral support and appreciation of my passion continues to help me lead a satisfactory life, while maintaining my unconventional perspectives. iv TABLE OF CONTENTS List of Tables ................................. viii List of Figures . . . ., ............................ x Introduction 1 1.1 Background ................................ 1 1.2 Hierarchical Computation Scheme .................... 3 1.2.1 Preliminaries ........................... 4 1.3 Static fast multipole method ....................... 7 1.3.1 Single level scheme ........................ 8 Spherical harmonics ....................... 8 1.3.2 Multilevel FMM algorithm .................... 10 1.3.3 Diagonalized translation Operators ............... 13 1.4 FMM for Helmholtz equations ...................... 15 Single Level FMM ........................ 17 Multilevel FMM .......................... 20 1.4.1 Other FMMs ........................... 25 1.4.2 Wideband FMM ......................... 27 Scaled expansions ......................... 27 Spectral representation based plane wave expansions ..... 28 1.5 Applications ................................ 30 1.6 Thesis Objectives and Outline ...................... 32 Accelerated Cartesian Expansions (ACE) 39 2.1 Introduction ................................ 39 2.2 Cartesian Tensors ............................. 41 2.3 ACE: Definitions and Theorems ..................... 41 2.4 Multi-level Tree Computational Framework ............... 45 2.5 Algorithmic Improvements ........................ 48 2.5.1 Reduced Interaction List ..................... 48 2.5.2 Compressed Oct-tree ....................... 49 Fast Evaluation of Time Domain Retarded Potential in Sub-Wavelength Structures 51 3.1 Introduction ................................ 52 3.2 Problem Statement ............................ 54 3.3 Single Time Step Geometries ....................... 56 3.4 Multiple time step interaction ...................... 58 V 3.5 Results and Discussion .......................... 61 Wideband FMM and Multiscale Electromagnetic Solver in Frequency Domain 73 4.1 Introduction ................................ 74 4.2 Integral Equation and FMM for Helmholtz Equations ................................. 76 4.2.1 Sub—wavelength breakdown of Helmholtz FMM ........ 78 4.3 ACE translation operator for Helmholtz potential ........... 79 4.4 Hybrid algorithm for multiscale problems ................ 81 4.4.1 Combining ACE and FMM ................... 82 4.4.2 Implementation details ...................... 84 4.5 Results ................................... 85 4.5.1 Helmholtz potential evaluation ................. 85 4.5.2 Multiscale scattering problems .................. 87 A Well Conditioned Formulation of Augmented Electric Field Inte- gral Equation (AEFIE) 96 5.1 Introduction ................................ 97 5.2 Preliminaries ............................... 99 5.2.1 Interior Resonance and Augmented IE ............. 100 5.2.2 Operator and Eigenvalue Analysis ................ 101 5.3 Well-conditioned Formulation for AEFIE ................ 105 5.3.1 Eigenvalue analysis in 2D .................... 109 5.3.2 3D Problems ........................... 110 5.4 Results ................................... 112 Algorithms for Implementation of Hierarchical Computations on Parallel, Distributed Computers 121 6.1 Introduction ................................ 122 6.2 Preliminaries ............................... 124 6.2.1 The Fast Multipole Method ................... 126 6.2.2 Accelerated Cartesian Expansion (ACE) ............ 127 6.2.3 Hybrid algorithm for multiscale problems ............ 129 6.3 Parallel Algorithm for FMM ....................... 130 6.3.1 Parallel Construction of the Oct-tree .............. 131 6.3.2 Distribution of ACE and FMM harmonic data ......... 133 6.3.3 Construction of Interaction Lists ................ 134 6.3.4 Multipole and local expansion computation ........... 135 6.3.5 Translation Operation ...................... 136 6.3.6 Evaluation of Potential ...................... 137 6.3.7 Parallel Electromagnetic (EM) Solver .............. 138 6.4 Results ................................... 138 6.4.1 Kernel Evaluation ......................... 139 6.4.2 EM Simulations .......................... 141 vi 7 Summary and Future Work 152 7.1 Summary ................................. 152 7.2 Future Work ................................ 154 7.2.1 Well conditioned formulation for EM solver .......... 154 7.2.2 Fast algorithms .......................... 155 7.2.3 Numerical solution procedures .................. 156 A A combined accelerated Cartesian expansion (ACE) and fast Fourier transform (FFT) acceleration scheme for rapid evaluation of diffu- sion potentials 157 A.1 Introduction ................................ 158 A2 Mathematical Preliminaries ....................... 160 A3 Acceleration Schemes ........................... 162 A31 ACE for Spatial Acceleration .................. 162 A32 FFT based Temporal Acceleration ................ 164 A.4 Results ................................... 168 B Comprehensive Exam Problem: Integral Equation Based Eddy Cur- rent Model for Defects in Layered Media 174 B.l Introduction ................................ 175 B2 Integral Equation Formulation ...................... 176 B.2.1 Formulation ............................ 176 3.2.2 Planar Media Green’s Function ................. 177 B3 Numerical Implementation ........................ 180 B.3.1 Evaluation of Green’s function .................. 181 B.3.2 Solution to integral equations .................. 183 BA Results and Discussion .......................... 184 C Curriculam Vitae 190 Bibliography ..................................................... 194 vii 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.1 4.2 4.3 4.4 LIST OF TABLES Comparison between Old and New (reduced) scheme for interaction list for different distribution sizes (N3) and {P, K} pairs. ....... Exact multipole to multipole and local to local operators of ACE Error far in single time step interaction case (C3 = 0.5), for various N3 and {P, K} pairs ............................ Comparison of run-time in single time step interaction case (Cs = 0.5). Error far in multiple time step interaction case, for various combina- tion of N 3, N and {P, K} pairs. N is the number of distinct time steps involved ................................... Comparison of run-time in multiple time step interaction case Error far for non-uniform geometry configuration 1 ........... Comparison of run-time for non-uniform geometry configuration 1. . . Comparison of run-time for non-uniform geometry configuration 2. . . Error convergence of ACE algorithm with random points within a A / 2 size domain ................................ Error in FMM multipoles computed from ACE multipoles using Tmap in (4.23) .................................. Time for hybrid algorithm as applied to uniform and non-uniform ge- ometries .................................. Multiscale problem 1 : Cone—sphere geometry ............. viii 67 68 70 70 71 71 71 72 72 90 93 94 94 4.5 4.6 5.1 5.2 6.1 6.2 A.1 A2 A3 A.4 Multiscale problem 2: Almond ...................... Multiscale problem 3: Toy-aircraft .................... 3D AEFIE condition numbers with loop—star decomposition ..... 3D AEFIE condition numbers with orthogonal, quasi-Helmholtz de- composition ................................ Average time spent by an individual processor at different stages of hierarchical tree computation for N =40 million. ............ Average time spent by an individual processor at different stages of hierarchical tree computation for N =20 million points uniformly dis- tribution within a cube of side-length 20A. ............... Exact translation operator in ACE algorithm, P denotes the number of ACE harmonics. ............................ Error convergence for different number of ACE harmonics (P) and different source/ observer configuration (N3, (1) ............. Time for different problem size (N3) within a cube of sidelength d = 0.5m. In all cases Nt = 256, P = 3 (6 = 0(1E — 5)) .......... Tfast for different Nt size. In all cases N5 = 8,000, d = 0.5, P = 3 (e = 0(1E — 5) .............................. 95 95 115 142 143 172 172 173 173 1.1 1.2 1.3 1.4 1.5 1.6 2.1 3.1 3.2 3.3 3.4 3.5 LIST OF FIGURES Hierarchical decomposition of a 2D computational geometry ..... Representation of 2D computational geometry using quad-trees. Boxes at different levels and corresponding nodes in tree are represented using binary keys. ................................ Illustration of interaction list; dark boxes are contained in the interac- tion list of source box. .......................... Illustration of computational load in single- and multi-level FMMs. Dark nodes correspond to actual sources while light shaded nodes rep— resent centers of multipole and local expansions ............. Various operators involved in a multilevel FMM ............ Re—grouped boxes in original interaction list, in figure 1.3, for applica- tion of diagonal translation operator (1.15) .............. An example of compressed-quadtree with binary key representation used to label the tree nodes. ....................... Example of antenna feed geometry with low- and high-frequency regimes denoted by (2 L F and 0 H F respectively. Smallest wavelength of inci- dent pulse is also shown for reference ................... Definition for domains interacting over multiple time steps ...... Map of N in equation 3.11 for an example single level interaction. . . Non-uniform geometry configuration 1, resembling interconnect in elec- tronic chips (Ns=12000) .......................... Non-uniform geometry configuration 2 (N329600) ............ X 36 36 37 37 37 38 50 66 66 67 68 69 3.6 3.7 4.1 4.2 4.3 4.4 4.5 4.6 5.1 5.2 5.3 5.4 5.5 5.6 log(N3) vs. log(Tfa,.) for single interaction case and uniform geometry 69 log(N3) vs. log(Tfa,.) for single interaction case and non-uniform ge- ometry ................................... An example non-uniform (a) point distribution and (b) its tree repre- sentation. ................................. Time vs. no. of unknown in log-log plot when hybrid scheme is applied to uniform and non-uniform (fig 4.1) geometries ............. Bi—static RCS of cone-sphere geometry, corresponding to Run 3 in table 4.4. Inset figure shows the incident excitation and magnitude Of surface current. .................................. Bi-static RCS of cone-sphere geometry, corresponding to Run 2 in table 4.4. Inset figure shows the incident excitation and magnitude Of surface current. .................................. Bi-static RCS of NASA fat almond (multiscale geometry 2) correspond- ing to Run 2 in table 4.5. Inset figure shows the'incident excitation and magnitude of surface current .................... Bi-static RCS of Toy-aircraft geometry (multiscale geometry 3) corre- sponding to Run 1 in table 4.6. Inset figure shows the incident excita- tion and magnitude of surface current .................. r Eigenvalue spectrum of J,’,(na)Hn(2)(na) corresponding to 2D—tEFIE operator (5.20). .............................. I Eigenvalue spectrum of Jn(rsa)Hn(2)(rsa) corresponding to 2D—tMFIE operator (5.24). .............................. Eigenvalue spectrum of Jn(r<.a)Hn(na) corresponding to some of the AEFIE Operators .............................. Surface current on 2D circular cylinder computed using 2D AEFIE and analytical solution. ............................ Singular values Of 2D-AEFIE formulation before and after deflation. . 2D AEFIE condition number vs. frequency for circular cylinder. . . . xi 70 90 91 91 92 92 93 116 116 117 117 118 118 5.7 5.8 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 2D AEFIE condition number vs. frequency for elliptical cylinders with different aspect ratios. .......................... (a) Shows the 2D AEFIE condition number vs. frequency for a 3D multiscale geometry and (b) shows the geometry ............ An example compressed tree used in ACE+FMM hybrid approach. The Z—space filling curves or Morton ordering formed by the sorting the nodes Of the tree at a particular level. ............... Illustration of the tree partitioning scheme proposed in this work. The subsequent distribution of nodes and duplicate nodes in each processor is also shown. ............................... Efficiency Of the parallel-ACE algorithm for computation of Helmholtz potential between N uniformly distributed random point within a cu- bical volume. ............................... Computational complexity of the parallel-ACE algorithm for the case of uniformly distributed random points in a cubical volume. The slope of linear line fits, shown by dotted lines, are close to unity and indicates the linear complexity of parallel-ACE algorithm ............. Time spent by individual processors of a 128 processor set at different steps of tree computation for N =40 million using ACE expansions. Efficiency Of the parallel-ACE algorithm for computation of Helmholtz potential between N uniformly distributed random points placed on the surface of a sphere ........................... Time spent on the Helmholtz FMM translation operation by individual processors of a 128 processor set with and without adaptive direction partition strategy .............................. Time spent by individual processors Of a 128 processor set at different steps of the tree computation for N =40 million using the Helmholtz FMM expansions .............................. 6.10 Efficiency of the parallel-FMM algorithm for evaluation of Helmholtz potential between N uniformly distributed random points within a cubical volume. .............................. 119 120 143 143 144 144 145 145 146 146 147 6.11 Computational complexity of parallel-FMM algorithm as a function of number of unknown N for different processor sets P. The slope of linear line fits are shown by dotted lines. ................ 148 6.12 Efficiency of the parallel-FMM algorithm for evaluation of Helmholtz potential between uniformly distributed random points on the surface of a sphere. ................................ 148 6.13 Comparison of ROS due to plane wave scattering from a 64A PEC sphere computed using the parallel EM solver and Mie series solution. Only a portion of the RC8 is shown for clarity .............. 149 6.14 Comparison of RCS due to plane wave scattering from 3 128A PEC sphere, discretized with 14 million unknowns computed using the par- allel EM solver and Mie series solution .................. 150 6.15 Efficiency of the parallel EM solver for the scattering from PEC sphere with different number of unknowns N and as number Of processors was varied from 64 to 512. .......................... 150 6.16 Induced surface currents on multiscale geometries (a) toy-aircraft with sharp edges and (b) tetrahedron shaped arrow .............. 151 A.1 Illustration of the Block-Toeplitz computational scheme ........ 172 A2 log Tfast vs. log N, from Table A3, slope of linear fit = 1.1 ..... 173 B.1 Different domains in eddy current simulation .............. 185 B2 DCIM path from existing literature for zero conductivity ....... 185 B.3 Comparison between existing and PrOposed DCIM path with finite conductivity ................................. 186 B4 Tfansmitted field at various depth .................... 186 B5 Tfansmitted field with varying frequency ................ 187 B6 Z vs. coil lift-Off .............................. 187 B7 Z vs. excitation frequency ........................ 188 B8 Experiment to validate the integral equation model. Measurement of impedance as the coil is scanned across the defect ............ 188 xiii B.9 Absolute value of coil impedance from the IE model. ......... 189 B.10 Phase value Of coil impedance from the IE model. ........... 189 Chapter 1 Introduction This chapter provides a comprehensive introduction to the development of fast mul- tipole methods (F MM) within the context of electromagnetics. Section 1.1 gives a brief account of the various developments in FMM that are elucidated in greater de- tail in rest of the chapter. Section 6.2 provides a general introduction to hierarchical algorithms, which forms the backbone of this thesis work. Section 1.3 and 1.4 details the development of various versions Of FMM for Laplace and Helmholtz equation, respectively. Section 1.6 gives a preview of the developments made in this thesis work along with the outline of the thesis. 1. 1 Background The numerical solution of Maxwell’s equations has typically proceeded along two different paths. The first, and perhaps the more popular, is the direct discretization of Maxwell’s equations [1, 2]. Finite difference and finite element methods belong to this class of solvers. Their popularity stems from two salient features; (i) they are typically simpler to program and (ii) their memory and CPU cost scales as 0(N), where N denotes the number of degrees of freedom. The second methodology for 1 solving Maxwell’s equations are based on developing integral equations (IE) derived by evoking the Green’s identity/ equivalence theorems. While the latter was introduced in electromagnetics more than four decades ago [3], they were not a popular option for electromagnetic analysis. The bottlenecks to their adOption was due to both the memory and CPU complexity, both of which scale as 0(N2). This is despite some of the inherent advantages Of integral equations for analyzing open region problems, viz., better condition numbers, possibility of using surface integral equations and incorporation of the radiation boundary condition in the Green’s function. The introduction of the fast multipole methods (and tree codes) significantly al- tered the landscape. Both these methods were developed in response to accelerating pairwise potential evaluations in N -bOdy problems in fields ranging from biophysics to computational chemistry to astrophysics, etc. Here, it is necessary to compute long-range Coulombic potentials repeatedly between N randomly distributed parti- cles. The tree methods [4, 5] and the fast multipole method (FMM) [6, 7, 8, 9] reduced the computational complexity of computing these pairwise potentials from 0(N2) to 0(N). FMM and tree codes are based on a hierarchical decomposition of the com- putational domain, and using multipole/ local expansions to compute the influence between sub-domains that are sufficiently separated. The FMM, as introduced in [7], exploits the representation of the potential in terms of spherical harmonics. As we shall see, this is a consequence of using addition theorems to represent the potential as a series wherein each term is a product of two functions. These functions depend either on the coordinates of the source or the observer only. The separation between source and Observer is crucial to creating a fast scheme. At about the same time, an algorithm that achieves the same reduction on complexity, albeit using Cartesian tensors was introduced [10]. This derivation relies on using Taylor expansion Of the potential function to provide the necessary addition theorems [11]. Cartesian expan- sions have been used extensively in tree codes. More recently, FMM codes based on 2 Cartesian expansions have used recurrence relations to avoid derivatives [12]. Typi- cally, FMMs derived using the Cartesian expansion are more expensive as spherical harmonics are optimal in representing Coulombic potentials. However, it was recently shown that it is possible to develop a FMM using Maxwell-Cartesian harmonics that are as optimal as using spherical harmonics with one singular advantage; it avoids the need for special functions [13]. Both FMM and tree codes have revolutionized analysis in various application domains ranging from molecular dynamics [14], elas- tostatics [15, 16], elastic wave equations [17], flow problems [18], capacitance [19] and impedance [20] extraction in micro-electronic circuits, evaluation of splines [21] and spherical harmonics [22, 23]. The FMM framework has also been extended to the solution of potentials resulting from parabolic equations [24, 25, 26]. However, direct extension Of FMM to the solution of potentials arising from hy- perbolic equations is not as straightforward. The first solution to this problem was presented in two dimensions [27, 28], and soon extended to three dimensions [29, 30]. The crux to developing these algorithms was the derivation of a diagonalized form of the translation operator [30, 31, 32]. Since then, there has been a virtual ex— plosion in research in application Of these methods to various problems in electro— magnetics; see [33, 34, 35] and references therein. The state of art is such that problems Of the order of tens million spatial degrees Of freedom have been solved [36, 37, 38, 39, 40, 41]. However, the development of FMM based method continues on many fronts [42, 43, 44, 45, 46, 47, 48, 49, 50]. This paper reviews progress in FMM technology since its inception and details current trends in FMM research. 1.2 Hierarchical Computation Scheme The purpose Of this section is to outline the structure of fast multipole methods and introduce notation that will be used in the rest of the paper. 3 h .‘r K‘s“: .‘ ' ‘I " \\ \‘L .., \x. «A. An» v'- F. “AA. 59" -) F; 1.2. 1 Preliminaries Consider a source distribution q(r) such that supp {q(r)} = 52 C R3. Likewise, it is assumed that the Observers are also distributed in $2. With no loss of generality, it is assumed that q(r) = Sill qz-J (r — r,), where N is the number of degrees of freedom. The field due to this source constellation at any point r E R3 is given by k 45(1’) = 9(Irl) * (10‘) = :90]? - 130% (M) i=1 where g(|r|) is the appropriate Green’s ftmction, and * denotes a spatial convolution. It is apparent from this expression that the field evaluation scales as 0(N2) for N observation points. Ideas introduced by [4] to ameliorate this cost for static problems relies on exploiting the fact that the field at a point due to a cluster of sources is rank deficient, where the rank depends on the distance between the point and the center of the cluster. In other words, for a given accuracy, potential at an Obser- vation point sufficiently separated from a cluster Of sources can be computed with few multipole expansions. Similarly, for given accuracy, few local expansions can be used to compute potential at a cluster Of Observation point due to a well-separated source. These ideas were cast in a more formal framework as tree-codes [5] and FMM [6]. At this point, we note that there is rampant confusion in terminology; both FMM and tree codes are used interchangeably. While the two methods are closely related, there are subtle but significant differences between the two [51]. Tree codes compute interactions between source pairs using one of three methods: (i) directly, (ii) evaluating fields at each Observation point using multipole expansion due to a cluster of sources, or (iii) using local expansion at Observation clusters to find fields. The decision on the Operation used depends on which one is computationally eflicient. On the other hand, the algorithmic structure of FMM enables the computation of potentials in an Optimal manner [51]. Two additional operations that permit this are 4 ..,‘.... is» 9.7.- .II “I... ‘r L“ “ k. aggregation and disaggregation functions. These permit the computation Of informa- tion at coarser (or finer) levels using information at finer (or coarser) levels. Thus, FMM relies on a hierarchical decomposition of the computational domain. This is achieved using the following strategy [8]; the computational domain 52 is embedded in a fictitious cube that is then divided into eight sub-cubes, and so on. This prO- cess continues recursively until the desired level of refinement is reached; an Nl-level scheme implies N1 — 1 recursive divisions of the domain, see figure 1.1. At any level, the (sub)domain that is being partitioned is called the parent of all the eight children that it is being partitioned into. At the lowest level, all source/Observers are mapped onto the smallest boxes. This hierarchical partitioning of the domain is referred to as a regular oct-tree data structure. Regular oct-tree representations are optimal only for geometries with uniform distribution [52]; non-uniform distributions can be repre- sented using compressed oct-trees [51, 39]. In compressed oct-trees, sub-division- of a domain is stopped when number Of source/Observer in that domain becomes less than a pre-fixed limit. While many algorithm exist for constructing a tree, the one that we have found to be efficient is the use of key data-structure to represent the nodes Of a tree. In this approach the root box enclosing the entire geometry is represented with integer value 1; each of the eight (four) children of a parent is identified with a three (two) bit code which is appended to the parent box key to Obtain their global unique key. Figure 1.2 shows an example compressed oct—tree where each box is rep- resented using key-codes. This representation has several advantages: the nodes of tree at each level automatically follow Morton ordering and it plays an important role when partitioning the boxes among processors in parallel algorithm, all antecedents of a box and essential information like size Of box, center position, level etc. can be readily recovered from its key-code using bit manipulations [53, 38, 54]. Mapping the computational domain onto a tree facilitates partition/ classification of interactions as being either in the near or farfield. This is done using the following rule: at any level in the tree, all boxes/sub—domains are classified as being either in the near or far field of each other using the following dictum: two sub-domains are classified as being in the farfield of each other if the distance between the centers is at least twice the side length of the domain, and their parents are in the near field of each other; see figure 1.3 for an illustration of these classification. Once, the interaction list have been built for all levels, the computation proceeds as follows; at the lowest level, interaction between the elements of boxes that are in the near-field of each other is computed directly, i.e., using (1.1). All other interactions are computed using a three stage algorithm: (i) compute multipoles of sources that reside in each box; (ii) convert these to local expansion at all boxes that are in its far field; (iii) from the local ex- pansion, compute the field at each Observer. This simple three stage scheme is called a one-level scheme, and necessitates the development of theorems for (i) computation Of multipoles at leaf (or smallest boxes), (ii) translate multipole expansion to local expansion and (iii) finally, aggregate the local expansions in a box to compute the field at all the Observers. It is apparent that one can derive a more efficient compu- tational scheme by embedding this scheme within itself as shown in 1.4. That is, if two sets of sub-domains that interact with each Other are sufl'iciently far away, then these clusters may be combined to form large clusters that then interact with each other at a higher level and so on; this is referred to here as a multilevel scheme. This implies that it is necessary to develop additional theorems that enable (i) shifting the origins of multipole so that effects of small clusters can be grouped together to form larger clusters and (ii) move the origin of local expansion so that expansions at the origin of the parent may be disaggregated to those of its children. In concert, these theorems enable one to traverse up and down the tree, and are presented next. This said the various steps involved in the hierarchical computing are shown in Algorithm 1. Note that in single level algorithm the upward and downward tree traversal (steps 6 Algorithm 1 Hierarchical tree computation 1: Construct the tree representation for the given geometry (distribution Of discrete points). 2: Build interaction list using the above definition, for all boxes in the tree and the near-field list for leafless boxes. 3: NF: Use direct method for computation of nearfield potential at observation points in each leafless box from sources contained in its near-field boxes. 4: $2M: compute multipole expansions for each leafless boxes from sources con- tained within it. 5: M2M (upward traversal): for all parent boxes compute the multipole expansion by combining the multipole expansions at their children boxes. 6: M2L (translation): for all boxes in the tree convert the multipole expansions to local expansions about centers of boxes in their interaction list. 7: L2L (downward traversal): update the local expansion information at a child box using the local expansion of their parent box. 8: L20: use the local expansions about each leafless box to compute the farfield potential at its Observation points. 5 and 7) are absent. Next, we will detail these Operations for diflerent FMMs. Starting with well known static FMM to those for Helmholtz and finally to those for Wideband FMM. Details are presented for the first two despite the fact that they are well known. The rationale for doing so is two fold (i) it is important to understand when FMM for Helmholtz fails and (ii) techniques developed for static FMM and some of the new FMM approaches find their way into the development of Wideband FMM. 1.3 Static fast multipole method This section provides the appropriate theorems for fast evaluation of potential defined in terms of g(|r|) = 1/|r|. Such potentials are commonly used in study of plasma dynamics, magnetostatic problems, eddy currents etc. While on first glance, one might be inclined to exclude methods developed for rapid evaluation of the Coulomb potential but these play an important role in developing fast methods for Wideband problems. 1.3.1 Single level scheme Consider two domains (23 E R3 and {20 E R3 that comprises Of randomly located source and Observer points, respectively. With no loss of generality, it is assumed that the number of sources and observers are k, these domains can be embedded in spheres of radius a. The centers Of 93 and 90 are denoted by rs and r0, respectively. It is assumed that (23 C Q; and (20 C {-20 and (2305—20 2 Q), and the domains of 93 and (20 are sufficiently separated. In what follows, the domains Os and {—20 will be called parents Of {23 and 90, respectively. The parent domains can be embedded in a sphere Of radius 2a, and their center are denoted by rg and r5, respectively. Next, we will present a single level FMM constructed using two methods; (i) spherical harmonics and (ii) Cartesian tensors. Spherical harmonics The. theorems for a single and multilevel FMMs using spherical coordinates were introduced in a series of papers [7, 8], and have found extensive application in various disciplines; a sampling of these can be found in [6, 8, 55, 19, 20, 56, 57]. The genesis of the method is the well known generating function for Legendre polynomials [58], 1 1 i ——’"'" = = P (cos'y) (1.2) +1 n R r\/1 — 2% cosy + ("—I)2 n=0 Tn 7' with cos'y = cos (9 cos 0' + sindsin 0' cos(¢ — (1),) (1.3) where Pn(u) represents Legendre polynomial of degree n, r’ = (r’, 0’, 45') and r = (r, 0, d2). Legendre polynomials in (1.2) can be represented in terms of spherical 8 harmonics Ynm(9, ¢) using the addition theorem [59], nc(o)sry 2:: Yn(9m m(9’ ,gb) (1.4) m=-n where the superscript * represents complex conjugate. Using (1.4) in (1.2) results in complete separation of source and Observation quantities, _ _ -2 2 ”WM 02¢ ,)K.’:_T[__+,1¢) (1-5) n=0 m=—n These expressions enable the derivation of the following theorems necessary for steps 4, 6 and 8 in Algorithm 1. Theorem 1.3.1 (Multipole Expansion ($2M): spherical). Let 19 charges of strengths {q,-,i = 1,...,k} be located at r,- E 93 with Ir,- — r3] < a. Then for any 1‘ 6 Do, the potential ¢ is given by, 00 n =2 2 Mr,—’"—3’:f.f..":’. m» n=0 m=—n where k = Z QiIri — rsI”YI:"(0.-, 9152') (1.7) i=1 where the parameters {9;,¢,-} and {9, <15} are spherical coordinates of r,- and r w.r.t the origin at r3. In Theorem 1.3.1, MI," is the multipole expansion at r3 constructed from the source quantities q;(r,-). Proofs for the error bounds in the above expressions can be obtained from [8, 9]. Next, these multipoles are translated from r3 to to. Theorem 1.3.2 (Multipole to Local 'D'anslation Operator (M2L): spherical). Given a multipole expansion 07'? about r3, it can be mapped to local expansion L59 at 9 n; . I Thec its; r0 using " orHit-ml-'ml-Ik'ArAtliT;k(6.i) Left 2 n m-k i+n+1 (1'8) n=0m=-n (_1) Ai+n Irs _ rOI _ n where {9, 45} are the spherical coordinates of the r3 w. r.t r0, and AI," — ( 1) _ \/(n—m)!(n+m)!' Finally, the local expansions at any leaf node may be mapped onto the observers using the theorem presented next. Theorem 1.3.3 (Local expansions to Observer (L20): spherical). The potential at a point r E 90 due to local expansion Lg about origin is given by, ¢=2 Z LT'I’Ir-rolnYJ"(9,¢) (1.9) n=0 m=-n As before, the parameters {9, (1)} are the spherical coordinates of r with respect to the origin at r0. The above theorems, in a one level setting, permit the rapid computation of potentials at all points in 90 due to sources in {23. It is evident that this scheme can be embedded within itself to create a multilevel scheme. But prior to doing so, it is instructive to reexamine the fundamentals of FMM from a Cartesian perspective. 1.3.2 Multilevel FMM algorithm It is apparent that the 0(N4/3) cost of single level algorithm can be further reduced by embedding this scheme within itself, as is evident from figure 1.4. To implement such a scheme it is necessary to develop methods that enable one to construct multipole expansions at a parent level from those at their children. These are effected using the following theorems. Theorem 1.3.4 (Multipole to Multipole (M2M): spherical). A multipole ex- 10 pension 0;,” about r3 can be mapped onto one that exists around r3 using " 0i“: .1"( )lkl—ImI-I’c-m'ArAm(swerve) =2 2 m) k ( n=0 mz—n Ai where rs =|r3 2 rs — fl; and {9, (15} are the polar coordinates of rs w.r. t. r3. Theorem 1.3.5 (Local to Local (L2L): spherical). Given a local expansion 0;? about 1'5, it can be mapped to one around r0 using 11 n 0,7,"(—j)ImI-IkI‘Im-klAxgkAfyflgkw, (1)) (rgC)n—i = Z Z . (1.11) n=i m=—n (— 1)fl+‘lA1T{l 'whererpc =|rgc|- — Iro— r5], and {9, rt} are the polar coordinates of re w.r. t. r5. The equivalent theorems for Cartesian expansion likewise follow. Theorem 1.3.6 (Multipole to Multipole (M2M): Cartesian). A traceless mul- tipole tensor Oém) at r3 is related to Mlm) that is centered at r? via m m ("llnvn 11376 n m—n M] )=Z_:O n! (2n(—1]!!OI ) (1'12) wherergc=r§—r3. Theorem 1.3.7 (Local to Local (L2L): Cartesian). Given a local expansion 0]”) that exist in the domain (20 centered around 110’, it can be shifted to the domain (20 centered at r0 using m m m+n m Tl. n L] l: E o] + ).(m).(r2p)t (1.13) 1120 m where r3” = r0 — r? 11 ._- t)! (I‘ ‘1" 8. ,4 11!: J C r1) __7' ( real. 03‘. -I r». Ge): 9.17 These theorems, in concert, permit traversing up and down the oct-tree, see figure 1.5. While these theorems are the bare-bones presentation of the steps required, there have been several attempts to make these more efficient [7, 8, 14, 60]. As both methods are based on Taylor expansions the upperbounds in using these approaches can be readily derived. Such a derivation is presented in. [8, 13]. Alternatively, another interesting algorithm was introduced in [13] that permits exact evaluation of the multipole expansion at the parent given the multipole expansion at the children—this has been shown both analytically and numerically for different potential functions. However, in order to get this exact expression, one has to abandon the use of traceless tensors. It follows that the cost Of using exact multipole to multipole translations is higher. But in our experience, we have found that we need a smaller number of multipoles for the same precision, and this can significantly affect the total cost, especially for large data sets [13]. Abandoning the use of traceless Operators has three salient benefits; (1) the algorithms can be used for any potential function whose Taylor’s series converges rapidly, (ii) it does not depend on special functions and (iii) only the translation Operator depends on the potential function which implies that multiple potentials may be easily combined [61]. In all the above expressions, it was assumed that the number Of multipoles used was infinite. The analytical estimates regarding truncation of this sum for both the spherical and Cartesian form can be found in [8, 13]. The cost analysis for multilevel approach is as follows: the total number of boxes in the tree is 0(N/s) and the cost for 82M and L2O operations remains the same; the cost of applying M2L translation Operation across levels scales as 0(P4N/s). In addition the cost of applying M2M and L2L Operations for all boxes scales as 0(N/sP4). Thus, the overall computational cost associated with both schemes scales as 0(P4N). This cost is largely dominated by the time for multipole to local translation (M2L) and considerable research effort has been expended on reducing this cost. A closer examination of the M2L Operation reveals 12 r... . uu. V (1.. . U. It)? : that (i) the number of translations per box is 189 and (ii) the cost per translation scales as 0(P4). The latter is due to the fact that this operation is not diagonal. Greengard et. al. [9] remedied this deficiency by introducing a novel algorithm that diagonalizes the translation Operator. Additional modifications to the overall algorithm introduced there [42, 62] further reduces the number of translations, making the “revamped” FMM extremely efficient. Ideas behind this diagonalization can be exploited by either both varieties of FMM; spherical and Cartesian. It also plays a key role in FMMs for low-frequency, and consequently, will be presented in some detail next. An FFT based implementation of above un-diagonalized form results in a overall cost that scale as (9(NP2 log P) [60], but will not be dwelt here. 1.3.3 Diagonalized translation Operators A diagonal translation Operator may be derived using a spectral representation of the Green’s function [9], viz., 1 1 00 _A 21r _, , _ = _ d/\e z/ dae JA(xcos(a)+ysm(a) 1.14 R 21f 0 0 ( ) for z > 0. It is apparent that the inner integral is in fact a zeroth order Bessel function. The computation Of potentials using the above expressions hinge on the existence Of an integration rule that is efficient to a given precision and scale invariant if this formula is to be used at different levels in the FMM tree. Given the existence of such a rule [63], the potential at any point can be written as [9] 8(6) M (k) ¢(r) = Z Z W(k,i)e—)‘kze_3’\k($cosai+y31nail + 0(6) (1.15) k=l i=1 where the coefficients W(k, i) are a combination of the charges q,- and integration weights wk, 8(6) and M (1:) denotes the number of integration points for 6 accuracy. 13 Evidently, in above discrete representation, the number of integration points M (k) for evaluating a integral depends on k to account for the varying bandwidth, Ak, of its integrand. The advantages of above scheme are immediately apparent in that it readily permits translation of the origin; translation Of the origin is quite simply a shift in the exponentials. The similarity between (1.15) and those in Theorems (1.3.1), and (2.3.3) are readily apparent. The mapping from spherical harmonic multipole coefficients MI," onto exponential expansions W(k, j) is given as [9], wa, i) = fill)“ §(j)ImIe-jmai ”gill \/ n _ 71:14)?” + ml}; (1.16) and given W(k, i) coefficients the spherical harmonic local expansion L? can be computed with, (j)|m| 8(6) M(k) . I"? = —A " W k,i e-‘Jmai , x/(n-m)!(n+m)!kE=:1( k) Z31 ( l (117) The multipole to local translation Operation, with diagonalized translation forms, can be computed as a three stage process: multipole coefficients are mapped to W(k, i), translate W(lc, i), and then map the translated coefficients back to local expansions, and then proceed as usual. It is evident that cost of all operators involving exponen- tial expansions scale as 0(P2). Various symmetry considerations in implementation reduces the number of total translation count from 189 to 40. Additionally, one can exploit symmetry in the expressions involved to further reduce the overall cost, if not the asymptotic complexity [56]. Thus, properly modifying and augmenting ei- ther spherical or Cartesian multipole based algorithms with plane wave translation Operators can considerably ameliorate the cost. However, a couple of issues must be noted; (i) the plane wave expression is valid for z > 0, this implies that the interaction list must be modified [9]; (ii) additional Operators must be introduced to rotate the 14 multipole Operators along the required axis; (iii) the operator developed should be scale invariant for the scheme to be efficient. In implementation the spherical har- monic multipole coefficient is converted into six plane wave expansions corresponding to each face of the cube and the interaction list definition is changed accordingly. For example, exponential expansions corresponding to +2 cube face is valid only for boxes present above X — Y plane, as illustrated in figure 1.6. Boxes in original interaction list are divided into six new sets termed as up-list,dovm-list,north-list,south-list,east- list and west-list corresponding to +2, —z, +y, —y, +x and —x cube faces respectively [9]. Overall, the diagonalized version of the translation operator reduces both the to— tal number of translation operation and per translation cost leading to a much faster algorithm. This approach is very similar to spectral approaches developed for al- ternative derivation of Helmholtz FMM [42, 64] and is the crux of many methods developed for Wideband FMM. 1.4 FMM for Helmholtz equations Thus far, we have seen that cascaded Taylor expansions can be used to develop static FMM. While these ideas are readily extended to the solution of parabolic equations as well [24], they are not readily extendable to Helmholtz equation kernels, especially at high frequencies. Furthermore, as was evident from last section, the scheme developed should be diagonal. Consider a problem setting that is identical to what was described in Section 6.2. We shall seek development of methods to accelerate the evaluation of the potential integral in (1.1) with g(|r|) = exp[—jrs:|r|]/|r|. One expression that readily suggests itself is the Gegenbauer addition theorem [59, 65, 31], _- oo (27%? = —jn§%(—1)l(2l+1)jl(nd)h[2)(nX)Pl(d-i) (1.18) 15 '7' l.” ‘4‘ {re , 4‘“), W511? ‘1' If?) r, ‘- where X and d are position vectors such that r = X+d and |X| > |d|, j; and him are lth order spherical Bessel and Hankel function of second kind, X = IX] and d = [d]. Augmenting this theorem with another addition theorem for Legendre polynomials in (1.4) completes the separation between the source and Observer coordinates. e —jK]X+d]— (2 ) l—-——x + dl =-JI€Z ZI:(- (21+1)jz(fid)hl (EX)Yzm(9X.h§21atfi - a.) (1.27) l=0 Where r03 = r0 — r3 Finally, the potential at any point r E 00 can be constructed using «mag flMfl4wflWWmWWM (me While these equations are readily derived from (1.23). More insight into the derivation of these equations can be Obtained by realizing that the farfield (and local expansions) can be represented in terms of spherical harmonics. In turn, this interpretation leads to expressions that reveal convergence rates Of these and error bounds as a function radius a and the separation distance. More importantly, this insight leads to the type of quadrature rules that must be used to implement these schemes numerically. In 18 of. other words, the continuous integral is evaluated using . L P "".7 = T Z” 22pm“; ,—kpq, r0 — r) T(kpq, r03)M(r3, kpq) (1.29) where L is the order Of the Gauss Legendre rule, mm are the integration weights, p and q are the integration points in 9 and (b axis, 27rq 7‘1 " 2L + 1 9,, is the (p + 1)th zero Of PL+1 (cos 9) 4r (1 — cos2 9p) (1‘30) "1m: (2L + 1) [(L + 1)PL(cos 9,,)]2 km 2 xsin9pcos¢q +ysin9psin¢q + 2cos9p As is apparent form the above equations, uniform sampling is used to evaluate the integral along ()5. Other applicable rules may be found in [68]. We have yet to elaborate the underlying factors that decide the order of Gauss-Legendre rule that is used along 9. A number Of formulae exist for choosing the number Of Gauss-Legendre quadrature point [30, 65, 69]. However, examination Of (1.23) yields interesting insight. If only the exponential terms are considered in this integral, it is apparent that these expressions can be represented using L= 0(rid)= 0(2na) harmonics. This, in turn, implies that the summation is also truncated using L terms. Though the reasoning here is based on economical means to discretize the integral a deeper reason, arriving at same conclusion, exists for choice of L based on original multipole expansion [65]. Choice of L should be large enough for the series (1.19) to converge, but not too large to cause numerical instability due to the asymptotic behavior of spherical Bessel and Hankel functions. Given that only a finite number of terms are being used, one can explicitly derive error bounds that, in turn, depend on the translation distance also [30]. Deriving rigorous error bounds has been a focus Of considerable work 19 [43, 69, 70, 71, 72], and the behavior of error is well understood [73, 74] as are the means to overcome these. A simple choice for truncation limit L applicable to most practical problems is, L = rid + Clog(nd + 7r) (1.31) where C is a number that depends on the desired accuracy 6; typically the choice of C is {3,5,10} for an accuracy c = {10-3,10-5, 10-14}, respectively [65, 75]. This estimate is semi-empirical and assumes that the two boxes are well separated if they are one box apart. Other estimates [76, 73, 69] based on approximation of Bessel and Hankel function exists both in two- and three-dimensions and can account for multiple box separation between interacting boxes [72, 74]. Cost of this scheme can be computed in the same manner as in the static with P = L and the diagonalized form of translation operator implies 0(P2) cost per Operation. However choice of L depends on size of box kd, which in turn dictates the number of unknowns per box 3 (assuming uniform discretization). It can be show that the Optimal cost of the above scheme scales as 0(N3/2) for surface problems. Multilevel FMM While the above exposition details the necessary mathematics for implementing a single level scheme, nesting these in a hierarchical setting is the next logical extention. The first robust attempts to do so are [7 7, 78, 79]. Extension to multilevel is different from that encountered for the Laplace FMM; there, the number of multipoles at all level of the tree was constant. But as is evident from ( 1.31) and (1.29) as the size of the source/ receiver boxes increases, the bandwidth increase increases by a factor of two, which implies that the number of directions increase by a factor of four. This then creates a need for developing robust methods for going up and down the tree for the stages of aggregation and disaggregation. These Operators can be thought of 20 as filters. But before we proceed into intricate details of the methods to implement these, the theorems that help achieve these are as follows: Theorem 1.4.3 (Translation of farfield signatures). If the farfield signature M(r3,k) around the point rs E 93 is known, then the farfield signature M (r§,k) around the point r? E 5212 is given by M(r§, k) = M(rs, k)e_jkl(rg_rs) (1.32) An identical theorem for can be derived for translating local expansion at the parent level to that Of its child. Numerical implementation of these theorems is not as simple as it seems. To maintain uniform accuracy across levels, employing (1.31), the L for parent is approximately twice that of its child. This implies that the num- ber of direction for parent box is approximately four times that of its child; thus the multipole expansions for the child and parent box are defined on different grids. This process of computing a higher bandwidth representation from lower bandwidth farfield signature is referred to as interpolation and anterpolation is its inverse ana- logue applied during downward tree traversal. Implementing the above theorems calls for efficient methods to interpolate (or anterpolate). Several methods that exist have been elaborated upon in [33] and summarized as well in [69]. An eflicient and exact algorithm can be devised using the forward and inverse farfield transform for both interpolation and anterpolation [79, 23, 80, 35]. This algorithm relies on the fact that at any level the farfield signature can be represented in terms of spherical harmonics, VlZ., OO M(-.k)=Z Z aannm(6.¢) (1.33) n=0 m=-n As is well known, the farfield signature of a source constellation is bandlimited to 0(na) harmonics. This implies that the above expression can be truncated. Further- 21 more, since an Lth order rule is chosen to evaluate the spectral integral in (1.23), it follows that the upper limit in the summation over n can be chosen to be L. This said, direct computation of anm is expensive. Alternate methods both exact and approx- imate have been discussed in [23, 81]. Consider the computation of anm from child farfield signature M (r3, kkp) represented using (2L2+ 1) coefficients, i.e. p = 1, - - - , L andq= 1,--- ,(2L+ 1), “ma = [d2"M(r31 k)Y;m(61 45) L (2L+1) . = 2: prnm (COS 6p) 2: M(r81km)ejn¢q (1.34) 1):] q=l L = 2 prnm (COS 6p)am (6p) p=1 where wp are numerical quadrature weights. Since, the integration along (b is per- formed using uniform sampling, fast Fourier transform (FFT) can be used for summa- tion inside the brackets. These coefficients are then used to compute samples along new polar coordinates (91,421) with p = 1, - -- ,L and q = 1, - -- ,(2L + 1) as, ~ ~ L ~ L M (r3,kpq) = Z e‘jm‘bq Z anman(cos 919) (1.35) m=—L "=1 Again, FFT can be used to evaluate the outer summation. In interpolation, L > L to accommodate for the increase in bandwidth and km represents the discrete direc— tions of the farfield signature corresponding to the parent. The required multipole coefficients about parent origin r3 can be obtained using a simple shifting Operation, Mag, 1%) = M(r,,12pq)e*jkm-(”s’—rs) (1.36) An inverse procedure is performed when translating local expansions from parent to 22 child where anterpolation is used in place of interpolation. First, the parent local expansion about 1‘3 is shifted about child origin r0; then in anterpolation, the forward and inverse farfield transform are performed to reduce the bandwidth in an exact manner as described above but with L < L, where L represents the number of harmonics in parent domain. Above procedure for interpolation/anterpolation can be further accelerated with the use Of fast Legendre transform [23] where the coefficients anm are not computed explicitly. Though this approach scales favorably the break- even point is large and not suitable for most practical applications [35]. This can be overcome to some extent using the 1D FMM for fast Legendre transforms [81]. Cost of Interpolation / anterpolation using this approach scales as O (Q log Q), where Q denotes the number of directions in farfield signature. This said it can be shown that overall cost of the multilevel algorithm scales as 0(N log2 N) [35]. Other methods used for interpolation and anterpolation have been presented in detail in [78, 33, 69]. These include the use of polynomials and approximate prolate spheroidal wave functions. The singular advantage of these methods is their cost scales linearly with the number of samples, thus the overall cost scales as 0(N log N). However, while interpolation is sufficiently accurate, one has to be more careful when anterpolating functions as it is necessary to remove higher order harmonics. While we have not digressed into implementation Of these schemes for vector electromagnetic problems, we must caution that it is not a trivial extension. It is important to realize that the farfield component represented in terms of polar components in not bandlimited [82], whereas they are bandlimited when represented in terms of Cartesian components. This means that one either uses a fast scheme based on vector spherical harmonics [82] or converts these to Cartesian before interpolation / anterpolation. Another intriguing method for interpolation and anterpolation was introduced by Sarvas [48], wherein he introduced modifications that enabled the use of FFTs. In other words, bandlimited 23 farfield signatures can be represented in terms of Fourier basis as P Q—l Made 23 :3 access”) 637) ' p=-Pq=-Q where, P Q-l ' atp.q)=DFT{M(k>}= Z Z e" 0, using small argument approximation for spherical functions and with t = n, it is a straightforward exercise to show that the normalized expansions reduces to the expansions (1.2) used in static case. While the normalized forms ensures numerical stability, the low-frequency nature Of the problem implies that one can choose the number of multipoles to be same at every level. This in turn implies that the multilevel version of this approach scales as 0(N) [46]. A constant normalization factor is sufficient when the geometry is uniformly discretized. How- ever to accommodate wide variation in domain sizes and maintain the stability Of expansion different normalization factor should be chosen in different parts [33]. This approach has been successfully used in integral equation solution for scattering from sub-wavelength structures [46, 87]. Spectral representation based plane wave expansions An alternate approach, inspired by the diagonalized form for static FMM, was in- troduced in [42] and later implemented in [84, 49, 50]. It is based on the following well-known spectral representation of solution to Helmholtz equation [88], —jnR e R _2_1_7r0 8-1923122]: e—jA(xcosa+ysina)—___n dad). (1.43) this relation is valid for z > 0. Further it is straight-forward to identify the purely propagating part of spectrum as 0 S A S K. and the evanescent part as n: < A S 00; 28 with simple change of variables, above expression can be written as [42], e—st ( e-st) ( e—st) = + (1.44) R R evanescent R propagating where, e—jrtR 1 oo 2 2 2r . . ___ _/ e—V A —rc z/ e—JA(xcosa+ysma) R evanescent 27f K 0 A dadA x/A2 — 52 = i [00 e—oz f2“ e—jvo§+n§(xcosa+ysina)dado. 271' 0 0 _- R 2 3 JR = i "e_ [\2_,€2z)‘/ We—jA(xcosa+ysina) R propagating 2W 0 0 A \/A§——Is2 _ ij 1r/2 — 271’ 0 dadA 2 e—jncos9/ We-jrtsin9(xcosa+ysina)dad6 0 Notice that with It -—> 0 the propagating part vanishes and the evanescent part reduces to the diagonalized form (1.14) used in static FMM. Now it remains to discretize the above integrals for numerical evaluation and generalized Gaussian quadratures can be employed for this. However, unlike in static case, the integrand cannot be ren- dered scale independent and this means quadrature points and weights should be pre-computed for all possible translation distances at all levels. It is worthwhile to recount that the multipole and local expansions are computed and stored as they appear in original spherical harmonics expansion (1.23); they are converted to ex- ponential expansions back and forth during multipole to local translation only and these relations can be found in [50]. This approach avoids the floating point over- flow as all the computed quantities and operations are regular and numerically sta- ble. Other approaches based on above spectral representation have been presented 29 [84, 49, 89, 90] and they differ significantly in their numerical implementation and structure. In all these methods the multipole and local expansion are represented directly in terms of exponential expansion coefficients; hence they require new inter— polation/anterpolation Operators for multilevel implementation. In [84], an extension of FIPWA as introduced for multi-layered structures, the integrand is sampled along the steepest descent path (SDP) and extrapolation techniques to estimate the evanes- cent portion Of the spectrum from samples of the propagating portion. However, one has to treat “shallow” evanescent waves differently from “deep” evanescent waves. In [49], the evanescent integrand is sampled along the traditional Sommerfeld inte- gral path (SIP) and singular value decomposition (SVD) of the integrand is used to obtain expressions for multipole coefficient and multilevel translation operators. An interpolation matrix approach is presented in [90] to relate exponential expansions at different levels. Using sample points in child and parent domain an overdetermined system Of equation is formed and solved for the interpolation matrix entries in a least square sense. The advantage of latter approaches is that they avoid the spherical harmonic to exponential expansion and reverse mapping Operations. 1.5 Applications This section provides an overview on application of above discussed algorithms in different contexts. As mentioned in introduction, FMM and other fast methods, e.g. FFT and tree code based, were develOped primarily to accelerate the evaluation of potential or field in N body problems. Integral equation solutions, a common choice in simulation Of many electromagnetic applications, sought through iterative solvers requires repeated evaluation of potential or field at source points itself. Thus fast algorithms play a significant role in solving real world problems within realistic time duration. The literature referenced here is only selective and not exhaustive as the 30 use of these algorithms have become more common during recent years. Also, only topics related to electromagnetics are listed here; for applications in other research field refer to introduction. First, electromagnetic application of static or Laplace FMM was evaluation of electrostatic potential in 2D [6, 91]. The extension to 3D has seen lot Of applications, particularly, in plasma dynamics [8, 92]. FMM based FastCap and FastHenry are widely popular tools for extraction of equivalent capacitance and impedance among multi-connects in micro-electronic components [19, 20]. Static FMM is also used in integral equation solution of magnetostatic problems predominantly for analysis and design of electric machines [93]. Simulations with non-linear materials have benefited much as they demand multiple solution before attaining stability [94, 95, 56]. It has also been applied to quasi-static case especially in simulation Of eddy-current phenomena [96, 97] and micro-magnetics is another area Of practical interest [98, 99]. The recently published book on fast methods in electromagnetics is a virtual treasure house Of FMM methods and their applications to various problems in high fi'equency electromagnetics [33]. As is to be expected, Helmholtz FMM has been applied to accelerate iterative solution of surface and volume integral equations. The means to modify Helmholtz equation such that they are applicable to vector elec- tromagnetics problems was first presented in [7 8]. More detailed description can be found [100, 75, 33]. Since their introduction, they have been applied extensively to scattering and radiation problems of different flavors; for instance, scattering from perfect electrically conducting surfaces [28, 65, 78, 100, 69, 101, 102, 103, 104, 105], scattering from dielectric/ composite bodies [106, 107, 108, 109, 110, 111], volume integral equations [112, 70, 113, 114], anisotropic Objects [115, 116], scattering from rough surfaces [117, 118, 119], application to microstrips [120], EMC/EMI analy- sis [121, 122, 123], antennas [124, 125, 126]. Efficient implementation of FMM in solvers with higher order geometry and basis function representations have led to the 31 development of fast and accurate solvers [107, 127, 128]. [129, 130] Multipole accelerated algorithms have also been employed in various hybrid meth- ods where solution is Obtained with use of moment method combined with one or more of following techniques: to impose global radiation boundary conditions in finite e1- ement solvers [131, 132, 133], ray tracing and diffraction methods [134], multi-grid methods [135] and physical optics [136, 137]. These techniques are primarily used in applications with multi-scale scatterers like antenna interactions [138] and field pre- dictions for urban mobile communications [139]. Implementation of FMM was also modified to accommodate perfectly matched layer (PML) assisted integral equation methods used in simulation of monolithic microwave integrated circuit (MMIC) and photonic crystals [140, 141, 142]. Fast inhomogeneous plane wave (FIPWA) method and other forms of FMM have been used to accelerate solution of scattering simu- lations involving layered media structures with applications in design of microstrip antennas [129, 130, 143, 144, 145, 146, 147, 148] and geophysical investigations for sub-surface scatterers [64, 149, 150, 151, 152, 153, 154, 155, 156, 157]. A combined FMM-FFT algorithm [158, 159] and SDFMM have been used in electromagnetic analysis Of general quasi-planar structures with applications to rough surface scat- tering, grating structure design in quantum devices and radiation from microstrip patch antenna [118, 160, 161, 162]. The principle of FMM has also been extended to accelerate potential employed in time domain integral equations. Plane wave time domain (PWTD) is the time domain analogue of Helmholtz—FMM that has been used to accelerate time domain IE (TDIE) [163, 82, 164]. 1.6 Thesis Objectives and Outline As mentioned in the preceding exposition, the primary downside of conventional FMM is that they are specific to the form of Green’s function. In other words, one is 32 required to develop new set of FMM formulaes for each form of the Green’s function. The Helmholtz FMM, as detailed in Section 1.4.2, suffers from numerical instability for low excitation frequency. Analogously, plane wave time domain (PWTD), the time domain counterpart Of Helmholtz FMM, also suffers from a similar breakdown when large number of unknowns are concentrated in small size domains. Consequently, the existing state of the art fast methods face severe limitations when applied to multiscale problems. These are realistic problems where certain regions are very densely discretized to accurately capture the physical details. The main goal of this thesis is develop mathematical techniques to overcome the limitations of the existing fast methods for electromagnetic simulation of multiscale problems. This is achieved with the aid of accelerated Cartesian expansion (ACE) algorithm. ACE is a recently developed, hierarchical tree computation algorithm in the vein Of FMM. Unlike FMM, ACE algorithm relies on Cartesian harmonics and Taylor series expansion to derive FMM like algorithm for arbitrary, non-oscillatory potentials. In this thesis, diflerent aspects of ACE algorithm are exploited to develop fast algorithms to overcome the low-frequency breakdown in both time and frequency domain. Parallelization of FMM can be classified as a fairly recent work, with most of the literature concentrated in the last decade. The hierarchical framework of FMM qualifies it as one of the diflicult algorithms to parallelize. Most of the existing algorithms are based on heuristics. Such algorithms, though successful, provide only modest scalability with maximum at 64 processors. This is a severe limitation of the FMM algorithm when considering the ever growing size of cluster computers. In this work, a parallel version of FMM algorithm is introduced that is scalable up to hundreds of processors and beyond. The prOposed algorithm is provably scalable and hence allows for large scale parallelization. This work leads to the development of the state of art parallel multiscale electromagnetic solver for very large scale simulations. The rest of the thesis is organized as follows: 33 Chapter 2, provides a detailed review of ACE algorithm. Here the definitions and theorems of ACE algorithm are presented in a kernel independent fashion for easy reference in later chapters. Also presented here is a detailed overview of the multi-level tree computation scheme along with the proposed modifications. Chapter 3, addresses the sub-wavelength breakdown of PWTD method, the time domain counterpart of Helmholtz FMM. The smallest domain size used in PWTD is restricted for reasons Of numerical stability. Thus, when large number of unknowns are concentrated within a sub-wavelength structure, the computational advantage Offered by the PWTD algorithm is overshadowed by the direct computation cost. Here, the almost kernel independent framework of ACE algorithm is exploited to develop an algorithm that is stable and efficient for evaluation Of retarded potentials within sub-wavelegnth structures. Chapter 4, addresses the low-frequency breakdown of Helmholtz FMM. Here the FMM algorithm is presented in sufficient detail to identify the root cause of the breakdown. ACE expansions of the Helmholtz kernel is developed and the stability and convergence of these expansions at low-frequency limits is shown in a rigorous manner. This leads to the development Of a Wideband FMM algorithm, Obtained by seamlessly combining the ACE and FMM algorithms. This hybrid algorithm, that is stable and efficient across length and frequency scales, is then augmented with an existing electromagnetic solver for simulation of multiscale geometries. Chapter 5, take a slight detour from fast methods and explores the possibility of developing a new integral equation formulation that yields well conditioned systems of equations for multiscale simulation. Here the augmented electric field integral equa- tions (AEFIE), an existing IE formulation, is considered for modification. Included here is a succinct review of the operator theory analysis of EM integral equations. These tools are used to establish the behaviour of the new formulation when applied to low—frequency and multiscale problems. The new formulation is first developed for 34 2D problems and then extended, with appropriate modifications, to 3D problems. Chapter 6, presents the development of parallel implementation of above hierar- chical tree computation algorithms on distributed computers using message passing interface (MPI). Here the emphasis is laid on developing a parallel framework that is provably scalable on large number, in orders of thousands, Of processors. The novel schemes developed here results in a implicitly load balanced parallel algorithm. The resulting framework can be viewed as a seamless combination of different schemes already in existence. Detailed description on development of a parallel electromag- netics solver is also provided and its high efficiency is demonstrated on hundreds Of processors and beyond. Chapter 7, summarizes the various contributions Of this thesis work in a succinct manner. Several possible future works are also mentioned here. Appendix A details the development Of a ACE based algorithm for rapid compu- tation of time domain diffusion potentials and Appendix B provides a quick review of the comprehensive exam problem “Integral equation methods to model eddy current inspection of plates ”. 35 / W/w / 11o // m / 10° // 101 / / t00107 L_/'°°°' mom Level 4 $010101 / . . / ° / Legend: / :7? . // ' . . [Leaf Boxl I Parent Box] Figure 1.2: Representation of 2D computational geometry using quad-trees. Boxes at different levels and corresponding nodes in tree are represented using binary keys. 36 Source ' Near-field Interaction list I Box Box I Box Figure 1.3: Illustration of interaction list; dark boxes are contained in the interaction list of source box. ' Figure 1.4: Illustration of computational load in single- and multi-level FMMs. Dark nodes correspond to actual sources while light shaded nodes represent centers of multipole and local expansions. M2L A p 's" r° L2L r, l'o L20 , 02M 09 as DI O Figure 1.5: Various operators involved in a multilevel FMM 37 South-list North-Ii Down-list Figure 1.6: Re-grouped boxes in original interaction list, in figure 1.3, for application of diagonal translation operator (1.15) 38 Chapter 2 Accelerated Cartesian Expansions (ACE) This Chapter, provides a detailed treatment of the accelerated Cartesian expansion (ACE) algorithm and a general framework for hierarchical computations. Though ACE is not the primary development of this thesis work, it is lays the foundations for the advancements made in this thesis. Section 2.1, provides a brief overview of the ACE algorithm. Section 2.2, presents the requisite introduction to definitions and notations of Cartesian tensors used in rest of this thesis. In Section 2.3, the defini- tions and theorems of ACE algorithm are stated. Section 2.4, provides the details Of the different procedures involved in a hierarchical tree computation algorithm. Sec- tion 2.5 describes some of the algorithmic developements introduced to reduce the computational cost by half. 2.1 Introduction Accelerated Cartesian Expansion (ACE) is a fast computational technique, in the vein of FMM, in the sense that it employs tree structure for hierarchical computation 39 and derives rigorous error and cost estimates. A common feature of these hierarchical computational scheme is the use of divide and conquer strategy to offer the computa- tional advantage with a prescribed loss in accuracy. This loss Of precision is justified by the fact that the numerical simulation are constrained to finite precision by other factors; and evaluation Of potentials beyond this limit does not Offer any advantage. Further, these schemes accelerate the computation of far-field potentials only. The dominant contribution to the total potentials arise from the near-field interactions that are evaluated exactly using direct computation. ACE is the mathematical engine behind the fast method discussed in this thesis. It employs Taylor’s series expansion to derive addition theorem for arbitrary, non- oscillatory functions. It is worth noting that the use Of Taylor expansion for fast computation have been developed earlier also [10, 11, 12]. Typically, these FMMs derived using the Cartesian expansions were more expensive as spherical harmonics are optimal in representing Coulombic potentials. However, it was recently shown that it is possible to develop a FMM using Maxwell-Cartesian harmonics that are as optimal as using spherical harmonics with one singular advantage; it avoids the need for special functions [13]. Here, the entire algorithm is cast within the framework Of Cartesian tensors and exploits the fact that these tensors are totally symmetric to provide an optimal representation of Cartesian harmonics. Another salient feature of ACE algorithm is that it derives exact forrnulaes for traversing up and down the tree, which in turn implies lesser source of error. The use Of Taylor’s expansion implies that the potential or its modified form be non-oscillatory for rapid convergence of these expansions. This technique, as presented here, was introduced for kernels of the form R”. ACE has been extended to several other forms Of potentials, some Of them as part of this thesis work, for e.g. Helmholtz potential [165], Yukawa (or shielded Coulomb) potentials, retarded potential, diffusion potentials, solutions to Klein-Gordon and lossy wave equations. 40 2.2 Cartesian Tensors Tensor analysis is an integral tool used in development of ACE algorithm. A Cartesian tensor Of rank n is denoted by A(") or in component notation by Ag;l..an, and is an array of 3" components, for points in R3. A totally symmetric tensor is one that is independent of the permutation Of indices (11 - - - an and in compressed form it contains (n + 1)(n + 2) / 2 independent components. Alternatively, they can be represented in compressed form as A(nl(n1, n2, 72.3) where n1 + n2 + n3 = n, and n,- is the number of times the index i is repeated. An n-fold contraction between two tensors A("+ml and B01) is represented using C(m) = A("+m) . n . BI"). The contraction of two totally symmetric Cartesian tensors can be written using the compressed notation as 17.! CW) (m1, m2,m3) = Z WA(n+m)(n1 + m1,712 + m2,713 + m3) "1,712,713 1' ' 3' (2.1) 3(n)(n1,n2,n3) An extensive exposition of theorems and formulae pertinent to the properties of com- pressed tensors, their application to the ACE algorithm, can be found in [166]. 2.3 ACE: Definitions and Theorems In this section the theorems and definitions that permit the fast evaluation of functions are outlined briefly. To this end, assume that domains as and (20 are sufficiently separated, and comprise Of sources and observers, respectively. Also, 93 C 913, 90 C 03 and (2’3 0 913 = (l). The centers of the domains 03, 90, at: and 93 are denoted by r3, r0, r? and r3, respectively. Further, denote the potential function that maps the effects of these sources on the observation points as 1,0(R), where R = ||r — r’ I] and 7: sources exist in (23. Here, the function ¢(R) can stand for any interpolation function T(t) convolved with the retarded potential and Observed at time t = 0. An addition 41 theorem for this function may be Obtained using Taylor’s expansion. Theorem 2.3.1 (Taylor Expansion). The function 11) (r — r’) can be expressed about the origin using w(r — r') = Z -(_71!)—n-rm . n . Vntb(r) (2.2) n=0 mer>H This theorem gives rise to the following corollary. Corollary 2.3.2. The function 112(r — r’) takes the form 00— M(n) . I n 1/1 (r _ r') = 211.0 n V w(r) for r > r’ (2.3) Egozor'n.n.L("l forr’>r where M("l and L(") are the multipole and local expansions. These theorems may be used in concert to derive/ prove the following five theorems that form the crux of ACE [166]. Theorem 2.3.3 (Multipole Expansion). The total potential at any point r E 00 due to k sources q;, i = 1, - - - ,1: located at points r,- E 93 is given as oo ¢(r) = 2 MW -n - war) ":0 (2.4) Mini = 24236.- - e)" where M(") is the multipole tensor. Theorem 2.3.4 (Multipole to Multipole Expansion). Given a multipole expan- 42 sion of k sources about rs 0(")=Z(—1)"q—'(r,—r,)" (2.5a) i=1 then the multipole expansion about the point rg can be expressed in terms of ( 2. 5a ) as M‘"’= 21- 1)"1;) m=0 P(m, n) It is evident that one can repeatedly use this theorem to translate the multipole expansion from rs to rg. This expression is exact [166]. Theorem 2.3.5 (Multipole to Local Translation). Assume that the domains 93 and 910’ are sufliciently separated, and the distance between their centers r53 = |r53| = Irg — rgl is greater that diam {913’} and diam {123 } If a multipole expansion M01) is located at rg, then another expansion Ll") that produces the same field Vr 6 523 is given by w(=r) 2p" ..n L(") no=00 (2.6) L(n) = Z a34071-71) , (m — n) . Vmwrgs) where p = r — r3 and V is the derivative with respect to rig. Theorem 2.3.6 (Local to Local Expansion). A local expansion 0(") that exists in the domain of; centered around r’o7 can be shifted to the domain {to centered at r0 using 00 L("') = Z m=n m—n m 0(m) . (m — n) . (r?)”"” (2.7) It can be shown that this expression is exact as well. Finally, the fields at a set of 43 observation points can be computed using the following theorem. 46) = Z Li") -n- (pm-r (2.8) n=0 Proofs for these theorems for MR) = R” can be found in [166] and may be trivially extended to functions of the form ’l/J(R) = span{R"”} for u = —1,0, 1, ~ -- ,K, or any other non-oscillatory function. Note, that when ¢(R) = R"", evaluating the multipole to local expansion using Theorem 2.3.5 implies the computation Of VnR'V which can be efficiently eflected through 11311 13221 11311 n n. n 1 _2 _ "'2 6.12-26.3 (17) = <-1>"R " " Z Z 2 H)": m1 =0 m2 =0 m3=0 m1 m2 (2.9) ”3 X Rme (V, Tl _ m _ 1)xn1-2m1yn2-2m22n3—2m3 m3 where R2 = x2 + y2 + 22. As was pointed out in [166], a computation scheme based on these theorems have the following characteristics: 1. The multipoles are independent of the function being translated. Only the translation operator depends on V. This fact will be of use in developing fast methods for evaluating the retarded potential. 2. The multipole to multipole expansion (or the local tO local expansion) is exact. This implies that the errors obtained do not depend on the height of the tree. 3. The formulation in terms of totally symmetric tensors permits the realization of CPU cost savings of a factor of 1 / 720 over a simplistic implementation. 4. Finally, since only the translation function depends on the potential function being used, it follows that the proposed methodology can be readily altered, 44 with very little change in the overall algorithm, for other potential functions. 2.4 Multi-level Tree Computational Framework These theorems permit rapid evaluation of potential using either a standard or com- pressed oct-tree decomposition of the domain. A standard oct-tree is constructed by first embedding the entire domain in a fictitious cube that is then divided into eight sub-cubes, and so on. This process continues recursively until the desired level of refinement is reached; an Nl-level scheme implies N1 — 1 recursive divisions of the do- main. At any level, the domain that is being partitioned is called the parent of all the eight children that it is being partitioned into. At the lowest level, all source / observers are mapped onto the smallest boxes, leaf boxes. This hierarchical partitioning of the domain is referred to as a regular oct-tree data structure. At any level in the tree, all boxes/ domains are classified as being either in the near or far field of each other using the following dictum: two subdomains are classified as being in the far field of each other if the distance between the centers is at least twice the sidelength of the domain, and their parents are in the near field of each other. This definition will be used unless it is specially stated that an alternate definition is necessary. The interactions between all source and Observation points are now computed using traversal up and down the tree structure in the following manner. First the multipole expansions are computed at the lowest level for leaf boxes. Parent box multipoles, at all levels, are computed from its children multipoles using multipole- to-multipole translation Operator, theorem 2.3.4. This process is called upward tree traversal. Second, local expansions are computed at every box from multipole expan- sion of the boxes in its far-field using multipole-to—local translation Operator, theorem 2.3.6. Next, the local expansions of all child boxes are updated with the local ex- pansion of its parent using local-tO-local translation Operator, theorem 2.3.6. This 45 procedure is referred to as downward tree traversal. Finally the potential at observer points are computed from the local expansion of leaf boxes using theorem 2.8. This the far-field potentials that accounts for contribution from all sources except from the sources in near-field region of the corresponding leaf box. The total or complete potentials is Obtained by accounting for contributions from near-field sources for leaf boxes only through direct evaluation. Cost of this scheme can be computed in the following manner. The cost associated with each Operation will be denoted by Cop where op E {N F, 02M, M 2M, M 2L, L2L} that stand for (i) near field (ii) charge to multipole (iii) multipole to multipole (iv) multipole to local (v) local to local and (vi) local to Observer. In the following analysis the total number of interaction pairs is denoted by N, number of harmonics by P, total number Of levels in the tree by N]. Let NM denote the number of boxes at each level and assume that the number of unknowns in each leaf boxes, on average, is s. It follows that NM = N/s, N,,,,_1 = 8N“ and 2:1, Nb, o< N/s. With these preliminaries the cost of each Operation can be computed in the following manner. 1. Near field evaluation, C N F1 This computation is carried out only at the lowest level I = 1, at leaf boxes. The cost of direct evaluation between two leaf boxes scales as s2 and for each leaf boxes can atmost have 27 boxes in its near-field. The total cost of this operation can be written as CNF OC N/s X 2752 (2.10) or 27Ns 2. Multipole expansion, 002M: In this Operation multipoles expansion in form of totally symmetric Cartesian tensors are computed from s charges per leaf box. The distinct elements in a totally symmetric Cartesian tensor of rank p is 46 + + . e cost 0 eva uatm mu t1 0 e tensors u to ran IS, (19 1)(p 2)/2 Th f 1 'g 1°p1 p Pth k' P 062M 1%.. s x Ze+1>tp+2v2 p=o (2.11) _ NP3 - T . Multipole to multipole expansion translation, C M2 M3 The multipoles of parent boxes, at any level, is computed from its eight children multipoles. The number of Operations to translate all P + 1 multipole from a child to its parent is 6 H (P + i) / i. Since the total number of boxes in the tree scales as N / s the cost i=1 for this Operation can be expressed as (P+i) N 6 CM2M = —S- x H (2.12) i=1 . Multipole to local translation, C M2 L3 This operation is performed on all boxes of the tree. For any box the maximum number of far-field boxes is 189. The cost of translation between two tensors is same as in previous case. (P + i) N 6 CM2L=189:-H i (2.13) i=1 . Local to local expansion translation, C L2 L1 As mentioned before this operation is exactly the same as multipole-to—multipole translation operation i.e. C L2 L = CM2M- . Local to observer, 01,20: cost Of this Operation is exactly the same as that for mapping charges to multipole expansion i.e. 0120 = 002M- 47 The total cost of the scheme is the sum Of all individual cost, Ctotal = CNF + 002M + CM2M + CM2L + CL2L + CL20 NP6 NP3 = 7 __ __ 2Ns+1913720+ 3 (2.14) It is readily apparent that optimal number of unknowns per box is s or P3/10. 2.5 Algorithmic Improvements The above discussion pertains to the classical multi-level computational framework introduced in [167]. Since its introduction several modifications and additions have been suggested to reduce the computational time and extend its applicability to general geometric distributions. Following are some of the developed in this research work for Optimal implementation Of ACE algorithm. 2.5.1 Reduced Interaction List From the cost estimate of ACE algorithm it is evident that C M2 L, cost of multipole-to- local translation Operation, is the dominant part. Both the per-translation evaluation cost and number of interaction pairs (typically 189) are very high. This observation is common to all FMM like methods [168, 169]. In FMM based on expansions in terms of spherical harmonics both the factors can be reduced with the use of plane wave basis representation and exploiting the resulting symmetry. Alternatively, in this work a new definition is introduced to classify far-field pairs: if box a (at level 1 + 1) interacts with all the children of box b ( at level I) and box a, box b are in far-field of each other then box a interacts with box b. Interaction between boxes at two consecutive levels is easily effected using Cartesian tensors. In fully populated oct-tree this results in a reduction in the number of translation operations by half with minimal increase in 48 error. Numerical results that support this claim is presented in chapter 3. 2.5.2 Compressed Oct-tree Classical multi—level FMM loses its 0(N3) scaling when applied to geometries with non-uniformly distributed sources / observers. The root cause of this breakdown is the use Of uniform or regular oct-trees where all branches of tree is grown till the lowest level. This implies that the number of source / Observer per leaf box varies drastically between regions Of low and high source / Observer concentrations. For leaf boxes with very low number of unknowns, evaluation of potential at its far-field boxes through {CZM, M 2L, L20} Operaion can be costlier than direct evaluation. To overcome this shortcoming an adaptive version of the multi-level computational scheme was pre- sented in [169] with compressed oct-tree representation for non-uniform geometries. In compressed oct-tree representation only boxes, at any level, with source/Observer pairs greater than a predetermined number is sub-divided into child boxes. Further, in adaptive version always the optimal form of FMM is used based on the number Of points in a leaf box. The implementation Of ACE algorithm closely follows the work in [169], the main deviation is that the smallest box is used to enclose some pre-fixed number of points per box, 3. While this approach is not significantly diflerent in terms of cost when compared with [169], it does provide the possibility of improving error with certain geometries as the error in multipole evaluation is reduced. On downside this method may produce large number of single child parent which in turn increases the number of tree traversal Operations, however this can be remedied by eliminating these redundant parent boxes. With the elimination Of single child parent nodes, the resulting oct-tree would have the same structure as in [169] except the leaf box size would be smaller here as shown in Figure 2.1. 49 101101\1/ ch 1010110 L101“), / . - /__/’ ° / / " - / / 5—7/ / / 7- ' / W Figure 2.1: An example Of compressed-quadtree with binary key representation used to label the tree nodes. 50 Chapter 3 Fast Evaluation of Time Domain Retarded Potential in Sub-Wavelength Structures In this chapter, a computational scheme is presented for fast evaluation of time do- main retarded potentials in sub-wavelength structure, whose principle dimension is less than or only a few orders of the maximum wavelength. Section 3.1 provides a brief review of the existing fast algorithms for evaluation of retarded potentials and their limiations when applied to sub-wavelength structures. Section 3.2 describes the problem of computing retarded potentials. Here these computation are reduced to evaluation of polynomial potentials Of different orders. Section 3.3 details a fast algo— rithm when principal dimension of the domain is less than the maximum wavelength and Section 3.4 generalizes this to arbitrary size domains. Section 3.5 presents re- sults and discussion of the proposed method when applied to arbitrary uniform and non-uniform geometry distributions. 51 3. 1 Introduction Time domain solutions to scattering problems is preferred over frequency domain methods when the analysis spans a wide range of frequencies. Examples of such anal- ysis include characterization of Wideband antennas and analysis radar signatures. In— tegral equation based methods for scattering from electrically large objects in time do- main has been made possible via the development of acceleration techniques like plane wave time domain (PWTD) and time domain adaptive integral method (TDAIM). These methods ameliorate the computational cost when the size of overall Object is several wavelengths long and the smallest feature scale is a fraction of the wavelength. However, analysis Of structures that contain a mix of feature scales, poses problems for acceleration techniques in both frequency and time domain. Here, it is the geo- metric constraint that dictate the computational complexity. For instance, to model fine features, it is necessary to discretize that domain at a considerably higher rate than that is dictated by the smallest wavelength to capture the geometric details. These features occur in the analysis of practical problems in applied electromagnet- ics, ranging from EMI/EMC applications to antenna topologies to feed structures to signal integrity analysis in high speed interconnects, etc. The solution to this problem is typically sought by devising a methodology that works at sub-wavelength scales, and developing a transition to higher frequencies so that it can be integrated with existing acceleration methodologies. The problem encountered herein is not very different from those addressed in the frequency domain fast multipole method (FMM). The PWTD algorithm is a time domain analogue of FMM, with one significant difference; the field due to a quasi-time limited and bandlimited source can be reconstructed to arbitrary accuracy using a discrete set of propagating plane waves provided certain separation conditions between the source and observers are met [170]. The separation criterion ensures that 52 time gating can be employed to yield causal results. Unlike in the frequency domain, the cause of breakdown is not the expansions used in the algorithm; all functions used in the expansion are regular at zero. The breakdown occurs because domains that interact with each other via the PWTD algorithm are determined indirectly by the time step size. As the time step depends only on the maximum frequency of excitation and not on the smallest discretization, it implies that PWTD breaks down as an acceleration tool because most Of the interactions would fall under the “near” field classification. However, these arguments suggest an approach for overcoming this hurdle; develop an acceleration procedure using adaptive time stepping. The main advantage Of this procedure is the seamless manner in which it can be integrated with the classical PWTD scheme for high frequencies, resulting in an acceleration scheme that is valid at all length scales [171, 172]. Alternatively, one can modify existing frequency domain low frequency algorithm to construct time domain information [173]. This implies that one needs to develop the mechanism to transition from frequency to time domain and vice versa such that the resulting system can still be cast within the framework that permits transient analysis within a marching on in time framework. It has been shown that the latter approach is considerably faster than the former [173]. This work presents an alternate method to arrive at the same objective and is founded on using Taylor expansions in a Cartesian framework, detailed in previous chapter. More specifically, the methodology presented herein will rely on the recently develOped fast kernels for evaluating potential of the form R” for l/ E R, and is very competitive in terms of speed for a given accuracy with the other two methods that exist [171, 172, 173]. integrated with PWTD. Thus, the main contribution in this work are 0 Development of an acceleration technique to compute retarded potentials in the sub-wavelength regime. The method presented relies on representing the 53 retarded potential as a function of potentials of the form R” , and then acceler- ating this function. The presented method can be extended to other functional representations as well. 0 Development of the requisite algorithmic structure to seamlessly extend this (with very little cost overhead) to multiple time steps. Extension to multiple time steps is done with the sole aim Of integrating with the PWTD algorithm. 0 Application to sub-wave legnth problems with non-uniform geometric distribu- tions 3.2 Problem Statement Consider a set of N3 sources that are randomly distributed in a domain 9. The location of these sources will be denoted using rn and their time signatures by fn(rn, t) for n = 1, - -- ,Ns. It is assumed that these functions are bandlimited to an angular frequency 62m and all sources are approximately quiescent for t < 0. As in all time domain solvers, the source functions fn(rn, t) are known only at evenly spaced time steps tk = kAt for k = 1, - - - ,Nt where At = rr/(xwmax), MA), is the total simulation time and x is an oversampling factor. Typically, x > 1 and chosen between 5 to 20 to accurately reconstruct functions fn (rn, t) from its samples. The field at any point r due to these sources is given by ”8 6(t - 124/c) (r, 1) = Z T * fn(1‘n, 1) (3-1) n=1 where c is the speed Of light, * denotes convolution in time and Rn = ||r — rnll. It is apparent that the cost of computing (3.1) scales as 0(NtN3). Finally, in keeping with the definition of sub-wavelength regime, the size of domain is diam(Q) = 0(cAt). Given the size of the domain, it is apparent that the PWTD scheme cannot be readily 54 used; it has to be substantially modified in order to evaluate these potentials efficiently [172). In developing this scheme, it is necessary that the source signatures in (3.1) be known so as to facilitate the integration of the proposed algorithm with existing marching-on-in—time solvers for time domain integral equations. The starting point of the proposed method arises from the representation of the source signal. Assume that the source function can be represented in terms of fn(rn, t) = 2k Inka(t), where Tk(t) = T(t — tk) is a time basis function and I): are the samples of the function at the discrete time step tk. It follows from this representation that 0% 00 (rt) = Z 2: 111,71: Tk(t - RnR’n/C) (32) n=lk=0 This implies that to realize a fast algorithm, one needs to rapidly compute functions of the form T), (t - Rn/c). To illustrate the development of a fast algorithm the temporal basis functions are chosen to be backward Lagrange polynomials. Note, however, that the methodology presented herein is not restricted to polynomials. To this end, the K th order basis functions is defined as hk(tng—k(t) for (k — 1)At S t S kAt; k = 0, - - - ,K T(t) = (3.3a) 0 otherwise where, 1 k = 0 Mt) = k t _ m, (3.3b) , k 75 0 i=1 ‘1At and K k __ _ t + lAt arr—11(1) — t—H1 mi (336) It follows, from the above equation that T(t) = 0 for t 9? (—At, K At), T (0) = 1 and 55 T(t) = 0 for t 2 —At, At, 2At, - - - , (K— 1)At. Using the functions in (3.2), and point testing in time, results in the potential function that is a polynomial of R". It means that one can directly exploit acceleration methods developed for kernels Of the form R" [166]. 3.3 Single Time Step Geometries The field at any observation point r c Q, at time instance iAt, due to sources at rn E 9 for n = 1, - -- ,Ns can be obtained from (3.1) (iAt, r): giftéh 72an/6) fn(rn, iAt — r)dr (3.4) n=1 0 where Rn = “r — rn|| and fn(rn, t) is the transient source strength at the nth spatial point. The limits [0, At], on above time integral is possible because Rn/c E [0, At]. Employing time-domain basis function from (3.2) and evaluating the time integral in (3.4) results in, (iAt,r) = i Z 1",,- [nu-”2; Rn/c) (3.5) n=1j=i— —K Ab If .___ ZZIni-jT(jAtRnEI/C) (36) n=1j=0 where, K is the order of temporal basis function T(t). Since T(t) is chosen to be a backward Lagrange polynomial, (3.6) can be expressed in terms of powers of Rn/c as NsKK (iAt,r) 2:21,, ,_Joz (h ,j )R,,_1 (3.7) n=1 j=0 h: 0 56 In (3.7), a(h, j) is the coefficient corresponding to the polynomial of degree (h — 1) for the basis function at (i — j)-th time step, they also depend on At and c. Evaluating these polynomials of form R" can be performed at 0(N3) cost using the ACE algorithm. Thus, the overall cost of this scheme scales as 0 (K N3). Error bounds for using ACE to evaluate (3.7) can be obtained from the bounds derived in [166] and it can be proven that the upper bound of the error is determined by that for R‘l. Note, that the above derivation is not specific to using polynomials as temporal basis functions. Other basis functions may be dealt with in one of two ways; either by finding the appropriate translation functions, or by mapping these onto a space of polynomials. Using polynomials is fairly trivial as the framework for the R” kernel is readily available [166]. The 0(KN3) reduction in cost, specified above, is for brute force implementa- tion of the ACE algorithm. It is important to recognize that the above formula- tion demands evaluation of the kernel 12'” for different V’s. However, most of the steps in the proposed algorithm are kernel independent. In that, Theorems 2.3.3 and 2.3.4 (multipole expansion and multipole-to—multipole translation) do not depend on the kernel. Similar observation holds for local-to-local translation and evaluation of potential from local expansion, Theorem 2.3.6 and equation (2.8). Thus, only the multipole-to—local translation, Theorem 2.3.5, depends on the kernel and requires the evaluation of VnR" for different V values. Therefore, evaluation of polynomials of form 2,, cuR" involves (almost) one tree traversal (up and down) irrespective of the kernel, only the multipole-to—local translations need to be done separately for each kernel or polynomials of different degrees. Thus, a careful implementation of the ACE algorithm rasults in an adaptable and significantly lower cost algorithm. Applying 57 the multipole to local translation (Theorem 2.3.5) in (3.7), N3 K K (P(iAtJ) = ZZZIn,i—ja(hvj)R’ri-l n=1j=0h=0 N3 K K P = Z z Z In,i—ja(haj) Z VpRgh—l) 'P ' R’n(p) n=1j=0h=0 p=0 P K K N3 = Z: Za(h,j)VpR£h‘l) ~p- ZIni—JRW) p=0j=0 h=0 n=1 K P = 2273“”) 'P-M?) (3.8) J'=0p=0 where R0 = ||r — roll, Rf, = Hro — rnll and r0 is the center of sphere enclosing all sources. 73.0)) and M?) are the optimal tensor representation of multipoles and translation Operation of the ACE algorithm. Equation (3.8) implies that upward tree traversal, i.e., multipole-to-multipole translation and multipole-to—local transla- tion should be performed K times. This is to preserve the transient information, In,,-_j associated with each basis function for every source. However, downward tree- traversal which include local—to-local translation and potential evaluation needs to be performed only once. 3.4 Multiple time step interaction The above exposition was geared towards developing a scheme for computing inter- actions when diamm) < cAt. Next, the generic case of diamQ > CA; is approached through modifications to above methodology. Consider two domains 91 and {22 such that, ‘v’ r1 6 $21 and r2 6 {22 satisfies (N —1)At g ||r1 — r2||/c S NAt, where N is any positive integer. Then, the field at any point 1'] at i-th time step, (P(iAt, r1), 58 due to Nsfiz sources at rn E (22 can be written as N3 92 NA . ’ 6 'r — Rn c . (iAt,r1)= ”21 f ( Rn / )fn(zAt —T,rn)drdr (3.9) _ (N—1)At where Rn = ||r1 — rnll. Repeating the derivation presented for single time step interaction, N3 K . . T J + N — 1 At — Rn C ‘I’(2At,l‘1) = Z ZIn,i—j—(N—l) (( R: / ) (3°10) n=1 j=0 When N = 1, (3.10) reduces to the case for interaction within one time step (3.6). It is important to preserve R/c argument of the basis function in (3.10), as a polynomial representation is necessary for acceleration using the ACE algorithm. Thus, the key in multiple time step interaction is to identify groups (21 and {22, and it can be done using the following argument; find dmin Z NAt and dmag: S (N + ”At (3.11) where, dmax and dmz-n are the maximum and minimum distance between any two points in $21 and {22, see Figure 3.2. For example, consider spherical domains of radii 1'1 and r2 whose centers are separated by R0; then dmax = R0 + T1 + T2 and dmin = R0 — r1 — r2. From (3.10) and (3.8) it can be inferred that the number of upward tree traversals (multipole-to-multipole and multipole-to—local translations) equals NWK, where Nmacht is the diameter of the sphere encompassing the entire low-frequency region Q. These constraints mandate a new definition be used when developing the interaction list in the oct-tree as follows: Definition Interaction list rule: Consider two child boxes whose parent boxes are in near-field. They are in each other’s far-field if the distance between their centers is 59 at least twice the sidelength of the domain and they satisfy (3.11). Otherwise, they are in each other’s near-field. Some boxes may be well-separated in space and still not satisfy the temporal constraint in (3.11). For example, consider two spheres of radius r1 = r2 = cAt / 8 whose centers are separated by R0 = N cAt, now dmaz = c(N + 1/2)At and dmin = c(N — 1/4)At which do not satisfy (3.11). In such cases one can choose either of the following options: (i) sub-divide the domains and perform interaction at next level (with smaller domain size); (ii) consider the domains to be in near-field of each other and use direct evaluation. Sub-dividing the domain without limit has two disadvan- tages. First, the number of unknowns per smallest box, with increasing levels, can fall below the limit for optimal computational cost. Second, sub—division into smaller size boxes does not always ensure compliance with constraint in (3.11); it can be shown that boxes who’s centers are separated by multiples of CA1; (N cAt), without regard to their size, will not follow the temporal constraint (3.11) and interaction between such boxes should be evaluated using direct methods. Further, using the second Option on short trees can increase the total number of near-field interactions and dominate the overall computational cost. In this work, an optimal implementation is obtained by combining both, i.e., sub-dividing up-to a certain level and beyond this level domains violating (3.11) are placed in near-field interaction of each other. It is essential to note that the number of levels up-to which sub-division is used can be geometry de- pendent. In essence, this procedure overcomes the multiple time step interaction with a slight cost overhead that should be optimized. Further observations are presented in next section. 60 3.5 Results and Discussion In this section results presented will substantiate the above claims and demonstrate the efficacy of the algorithm presented herein. As in all illustration of FMM meth- ods, the goal is to demonstrate considerable speed—up with predetermined accuracy. Consequently, the results presented will demonstrate convergence as well as 0(N3) per time step CPU cost scaling. In all numerical experiments, the source/ observer locations are randomly distributed. The corresponding standard / compressed oct-tree data structures (including interaction lists) are generated using the algorithmic pro- cedure outlined in the Appendix. The accuracy of the prOposed algorithm is validated against analytical data for all casae where the unknown count is numerically small. The relative error at nth observer is evaluated as _ llq)fast,far(nat) - (panalytical,far(nit)ll2 ) Error (n — (3.12) f” H....:,,..-c..z,f..nz where, “”2 represents L2-norm, fast, far (t) and (panalytical,far(t) represent the time history of the fields produced by the sources evaluated using proposed algorithm and analytical procedure, respectively. The error reported in this work is the average error over all observers [172] when the number of observers N 3 < 32, 000. For larger number of unknowns, the analytical data (and hence the error) is computed for randomly distributed unknowns (approximately 150). Hence, the reported data is an estimate of the expected error. These value are denoted using a 1‘. Finally, as is usually done for all fast algorithms, analytical data is computed only for the source/ observation pairs that are in the far-field of each other, and is consequently representative of an upper bound or worst-case error. The CPU timings (in seconds) are those taken for evaluating the field at a single time step using a 2.3 GHz Intel Pentium processor with 2GB RAM running Linux OS. In all experiments that follow, the time signature 61 that is associated with the nth source is given by (A.16) —(t—t )2/202 fn(t, rn) = Fine 1) (3.13) where Kn is the magnitude of the source randomly chosen between [0,1], 0 = 6.366 x 10’8 s and tp = 60 s. The effective highest frequency and minimum wavelength associated with these signal parameters are fmax = 3/7ro = 15MHz and Am,” 2: 20m, respectively. As prescribed in MOT solvers, the time step is chosen as At = 1 / (20 fmax) = 3.334 ns and is independent of geometric feature size and only a func- tion of fmax. The above parameters are chosen such that cAt = 1m, thus, all geometric features smaller than 1 m would fall in the sub-wavelength category. In rest of the section, P denotes the number of ACE harmonics used and K denotes the order of the time basis function. The first set of numerical simulation is performed to demonstrate the validity of the improvements made in the kernel that reduce the number of translations by approximately a factor of two without significantly affecting the order of the error (see Appendix 2.5.1 for details). The numerical experiment performed is as follows; source points were randomly distributed within a cube of side-length 0.5 m, i.e., all points interact within one time step. The number of source / observation points is varied (as is the height of the tree), the number of unknowns at a leaf box is approximately 64, and error is obtained for the “Old” and “New” schemes. The results presented in Table 3.1 indicate what is expected, viz., the computational cost is reduced approximately by a factor of two while the increase in error is almost always marginal (the order of magnitude of the error is unchanged). Next, set of results demonstrate that the multipole-to-multipole and local-to—local operations are exact. An important ramification of this is that the error is independent of the height of the tree. This experiment is effected as follows: consider two cubical 62 domains {21 = (0, 1/4) x(0,1/4)x(0, 1/4) In3 and 522 = (1/2, 3/4) x (0, 1/4) x (0, 1/4) m3. Each domain contains 4000 randomly distributed source and observation points. In constructing interaction lists, it is ensured that only sources / observers in (21 and (22 interact, all others are ignored. Thus, as the number of levels in the tree are increased, the change in the error norm can be attributed solely to the multipole-to- multipole and local-to-local operations. Table A.1 shows error computed for different {P, K} pairs and different levels in tree, where dxo is the size (in meters) of smallest box. It is evident from Table A.1 that, for a given {P, K} pair, the variation in error obtained from using different levels in the tree is accurate to double precision. This is a consequence of the fact that Theorems 2.3.4 and 2.3.6 are exact, i.e., they produce the multipole (or local) expansion had the box size at that level been the leaf box. Consequently, the error bounds are much tighter. Details and proofs can be found in [166]. Next, results are presented for distribution wherein all source/ observation pairs are distributed within a domain (I < cAt and distribution sizes ranging from 8000 to 4,000,000 points. The number of unknowns per leaf box, on average, is chosen to lie between 60 and 70. From Table 3.1, it can be inferred that number of harmonics and order of time basis function are closely coupled, i.e., for a given K, arbitrarily increasing P does not improve the error and vice-versa. This is true because the two sources for error (A.17) reported here are (a) approximation of a time signal with polynomial basis function of order K, and (b) error in evaluating a polynomial through ACE (limited P) due to far-field approximation. Hence, the results for time comparison are presented only for the optimal pairs {P, K}. For example, {4, 2} indicates simulation run with 4th order harmonic in ACE and 2nd order temporal basis functions. In general first, second and third order temporal basis function can provide up to 0(10'4), 0(10-5) and 0(10’7) accuracies respectively, for the given source signal parameters (A.16). Table 3.3 shows the relative error for different 63 {P, K} pairs and distribution sizes, N3. It can be seen that for increasing {P, K} combination the error decreases consistently. Table 3.4 presents the per-time—step computation time involved in both direct and proposed algorithm, the order of error corresponding to different {P, K} pairs can be inferred from Table 3.3. Similar results are presented for multiple time step interaction in Tables 3.5 and 3.6, where N denotes the number of distinct time step interactions and Cs denotes the sidelength of cube enclosing all sources / observers in meters. In Tables 3 to 6 empty entries, pertaining to large N, and {P, K} values, are due to insuflicient computer memory on the chosen computer platform. Figure 3.6 shows N3 vs. Tfar graph in log scale for data in Table 3.4. The lines plotted in the graph corresponds to a least square error linear fit for different {P, K} pairs. Slope of these line for different {P, K} values was approximately 1.06, thus, validating the 0(N3) scaling of algorithm presented here. The evident mismatch between timings in Tables 3.6 and 3.4 is explained as fol- lows. In the case of single-tirne—step interaction, the size of smallest box was chosen to accommodate 60 to 70 unknowns per box on average. However the largest box, at top of the tree (level 1), is within cAt dimensions; therefore, the height of the tree increases as distribution size is increased. In case of multiple time-step interactions one can keep the leaf box size constant and increase the level-1 box size for higher distribution size, to achieve % 64 unknowns per leaf box. However, this does not imply a direct increase in tree height because the interactions at larger boxes also need to obey (3.11). For example, for two spheres of radius r, to interact, the limit- ing condition based on (3.11), is r, S cAt / 4. Boxes greater than this size interact only through their child. This is the only limitation of the algorithm presented here, however, in practice the algorithm can be strictly used to compute field interacting in few time steps only and PWTD will interface with this method when [R/ (cAt)J is beyond a certain number of time steps. Thus, an ideal algorithm should switch 64 between the proposed algorithm and PWTD seamlessly. Finally, results for an adaptive version of the algorithm introduced here is shown on two types of non-uniformly distributed geometries. The first closely resembles interconnects in electronic chips as shown in Figure 3.4. The distribution of points between top and bottom planes and two interconnects were approximately the same. In applying the adaptive version, the number of unknowns per leaf node was ap proximately 64, was tested for source / observer distributions ranging from 8,000 to 1,000,000. Table 3.7 presents the error obtained using the proposed algorithm, and was generated for different combinations of ACE harmonics (P) and order of time basis function (K). The rate of error convergence exhibited here is fast in compar- ison to those in Tables 3.3 and 3.5. This outcome is primarily attributed to the consideration of smallest box enclosure and stricter enforcement of error criteria in building the interaction list; see previous chapter. The timing result for this geometry configuration is presented in Table 3.8. As explained above in uniform distribution, the timing results are presented only for certain combinations of {P, K}, each pair corresponding to different orders of accuracy given in Table 3.7. Figure 3.7 shows N3 vs. Tfa, graph in log scale. The slope of the linear fit was approximately 1.06 for different pairs of {P, K}, exhibiting the 0(N3) scaling produced by the adaptive version of the algorithm. The second geometry configuration considered is made of three circles with points non-unifome distributed in each of them as shown in Figure 3.5. Each circle is 0.15 m in radius and the points were distributed so that density of points is inversely proportional to the radius. The adaptive version is applied on five different distribution sizes varying from 9,600 to 1,000,000 and the results are shown in Table 3.9. As before, it can be verified that the time scaling is 0(N3). 65 Figure 3.1: Example of antenna feed geometry with low- and high-frequency regimes denoted by Q L F and 9 H F respectively. Smallest wavelength of incident pulse is also shown for reference. dmin 92 Figure 3.2: Definition for domains interacting over multiple time steps 66 N |:I 3 2 - 1 - Direct AT Source box Single interaction case Figure 3.3: Map of N in equation 3.11 for an example single level interaction. Table 3.1: Comparison between Old and New (reduced) scheme for interaction list and P for different distribution sizes 67 3': I . . I. i. 0'..- - I...- II - n 'l ‘0 I :l I ‘f. . . I’ .‘ :‘1 .. A . I - '. .'. 1.: I I‘.; '.: :. e u. : ‘8 I I'. 0‘ . ...I :' . :‘. I * E I y. " I ~I I. 0 I. .~ . I a ' I .0. .M. . . ‘ a}' 3' I I - U. I ..' I I "t... -' ~ ..0 .fi'Iu'". 0 I '7 .. I. .. .: O. :t '0'.‘ .0 ~' ‘? mg}- \ E... ‘ . r .§:3-«’i-r’,__‘ag'._,g_.. :4. II . .o .e .' I ~I ." e I . J... . :.:."-5;'§'L‘:'-“~‘". . I... :. I -. o -I ’ I "' I If. .3” . :. I.. ' ‘ - ' - ' -' '. Z" I a s I I a. " ,..: ." .f‘ ‘ ' I' '5 z. s' o .2 -- - ‘ I 0‘. 'fio I . 42.; , ‘.-. . ._II n. . ~' -I. ’4'. e . . \- ‘ ' .’ I... :0 ’. I I . e . .. I. ' :::.v'-. ~ ' - 5 a. _ . ..‘ 1.1: .‘A . I .004..'. I. I 4 . .- ’ . ~54" \ . - I . I ..n . II I. a. ' I... a." g ._ I... ..' I . i O. ' 3‘. ‘ I : . .. II ~- I O . .I' ' I ' .3 s \ .5:- ..' . Figure 3.4: N on-uniform geometry configuration 1, resembling interconnect in elec- tronic chips (N3=12000). Table 3.2: Exact multipole to multipole and local to local operators of ACE ditto {P, K} A=0.0625 Levels {1,1} {4,2} A 4 1.8800972191556 69E—2 7.03463843261 4828E-4 A/ 2 5 1.8800972191556 66E—2 7.03463843261 3739E-4 A/ 8 7 1.8800972191556 70E—2 7.03463843261 3831E-4 A/32 9 1.8800972191556 70E—2 7.03463843261 3819E—4 68 '? n . .-n...o' .. . . . n . .. .‘QE. .. 0. ':;?":‘-.'....‘-. f T ' .1. ‘fi ‘0..- }" I a .I... ‘, ‘.' F . I ‘99.” 'e ‘vl ’5 - h. - . o' a ..:--v--r--.. -=" - - .: a. .I s It. a ..n '. ' O .0 .0. e a N ..I "‘ I ‘.:...‘l ' . '- ;.O‘I. . 0" .'II . . . ... - . :t'.:{."} 5 I: "i: u I o u i - I ~ 0 " g‘ :. . ‘ ~ . f 'I o o ‘ ~- .0 . ‘I I ". O . ’. ‘I {7. ' I ‘ .a. '. . :’- . ‘ I... .z. a T : o‘- I: .- I I Figure 3.5: Non-uniform geometry configuration 2 (N3=9600). 25 ‘ O {1 ,1} 2: um} 15- .{42} ——Llnoer Fit 9 Cl l 4.6 q -2 . . . . . . 3.7 4.2 4.7 6.2 5.7 6.2 6.7 L00 (NI) Figure 3.6: log(N3) vs. log(Tfa,.) for single interaction case and uniform geometry 69 1.1 0.6 - E 0.1 - §0A, onn 09 4 - {2.1} A {4.3} .14 . -—LineIr Fit '19 I l I I 3.7 4.2 4.7 5.2 6.7 L09 (N!) Figure 3.7: log(N3) vs. log(Tfa,.) for single interaction case and non-uniform geome- try Table 3.3: Error far in single time step interaction case (Cs 2 0.5), for various N 3 and {P, K} pairs. N3 8000 12000 32000 640001 soooooi levels 4 4 4 5 6 {P,K} Error jar {1,1} 0.00581 0.00995 0.00474 0.00808 0.0015 {0&1} 0.000938 0.00155 0.000637 0.000944 0.00176 {ELI} 0.000341 0.000596 0.000355 0.000638 0.00118 '982} 7.87E—05 0.000143 6.54E—05 0.000174 0.000406 '012} 2.09E—05 4.87E—05 2.59E—05 4.01E—05 2.48E—05 {95$} 4.97E—06 4.52E—06 2.31E—06 8E—06 8.34E—06 {13,3} 8.85E-07 1.62E—06 1.03E—06 2.43E—06 Table 3.4: Comparison of run-time in single time step interaction case (C's = 0.5). fl:fasta {P, K} N3 TDirect {131} {2’1} {4’2} {953? {13:3} 8000 4.47 1.40E—2 3.18E—2 0.14 2.17 10.96 12000 11.02 2.27E—2 4.6lE-2 0.23 3.60 18.50 32000 97.59 8.87E—2 0.18 0.91 14.42 85.2 64000 - 0.20 0.444 1.90 27.95 173.38 500000 - 1.94 3.82 15.67 245.37 - 1000000 - 3. 78 7.19 30.98 498.03 - 2000000 - 7.71 13.33 60.18 742.21 - 4000000 - 16.06 27.46 121.72 1940.15 - 70 Table 3.5: Error far in multiple time step interaction case, for various combination of N3, N and {P, K} pairs. N is the number of distinct time steps involved. 8000 12000 32000 32000 128000T 4 4 4 5 6 1 1 1 2 2 2 2 2 3 3 ETTOTfar 2.92E—03 2.97E—03 3.50E—03 2.11E—03 3.80E-03 5.79E-04 5.42E—04 7.06E—04 4.07E—04 3.33E—04 3.15E—04 3.07E-04 4.53E-04 3.22E—04 3.30E-04 6.27E—05 5.67E—05 8.14E—05 5.03E—05 7.19E—05 2.67E-05 2.45E—05 2.99E—05 1.54E—05 1.62E-05 1.97E-06 1.90E—06 3.32E—06 9.15E—07 Table 3.6: Comparison of run-time in multiple time step interaction case TFast: {P, K} N8 03 N TDirect {1,1} {2’1} {4’2} {933} 8000 1.0 2 2.02 0.03 0.06 0.31 4.85 12000 1.0 2 4.69 0.04 0.10 0.49 7.68 32000 1.0 2 61.93 0.19 0.44 2.52 43.97 32000 2.0 3 34.31 0.17 0.32 1.67 28.49 64000 2.0 3 - 0.67 1.52 8.76 165.45 500000 2.0 4 - 27.89 55.47 294.06 - 1000000 1.0 2 - 45.52 83.37 433.34 - Table 3.7: Error far for non-uniform geometry configuration 1. N3 8000 12000 32000 levels 4 5 6 {P, K} Error far {1,1} 3.81E—3 3.58E-3 3.23E—3 {2,1} 9.46E—4 8.62E—3 6.75E—4 {3,2} 1.83E-4 1.64E-4 1.14E—4 {4,2} 3.78E-5 3.43E—5 2.69E—5 {5,2} 1.62E—5 1.51E—5 1.32E—5 {6,3} 3.06E-6 2.59E—6 2.02E—6 {8,3} 1.18E-6 1.10E—7 9.46E—7 {9,3} 7.52E—7 7.27E-7 6.93E—7 71 Table 3.8: Comparison of run-time for non-uniform geometry configuration 1. TFast’ {P, K} Na {1,1} {2,1} {4,2} {63} 8000 0.03 0.05 0.14 0.49 16000 0.05 0.09 0.32 1.03 32000 0.11 0.20 0.63 2.28 64000 0.25 0.43 1.42 4.8 250000 1.19 1.75 5.33 18.86 500000 2.61 3.58 10.86 40.08 1000000 6.11 7.8 23.25 79.63 Table 3.9: Comparison of run-time for non-uniform geometry configuration 2. TFasta {Pa K} N3 {1,1} {2,1} {4,2} {63} 9600 0.03 0.05 0.19 0.51 38400 0.13 0.23 0.79 2.88 105000 0.4 0.72 2.35 8.42 450000 2.16 3.18 10.52 37.34 1000000 5.17 7.07 24.33 82.46 72 Chapter 4 Wideband FMM and Multiscale Electromagnetic Solver in Frequency Domain This chapter addresses the development of a fast algorithm for electromagnetic sim- ulation of multiscale structures in frequency domain. Section 4.1 provides a brief review of the multiscale problem in electromagnetics and the limitation of the ex- isting fast algorithms. Section 4.2 presents a general problem setting followed by a brief exposition on the sub-wavelength breakdown of FMM algorithm. In Section 4.3, ACE algorithm is employed for fast evaluation of Helmholtz potential in sub- wavelength scenarios; rigorous proofs are provided to establish the stability of these expansions. Section 4.4 describes the details of the hybrid algorithm, combining ACE and FMM, that is applicable to multiscale problems. Section 4.5 presents results on error and timing of the proposed schemes to demonstrate their numerical stability and efficiency. The hybrid algorithm was also integrated with an EM solver to analyze scattering from electrically large structures. 73 4. 1 Introduction Integral equation based methods are used extensively in scattering analysis. However, it is well known that they produce dense matrices that increase the cost and limit the size of the problem that can be solved. In past two decades, significant research effort has been dedicated to the development of efficient and accurate techniques to amor- tize this cost. These advances have had a widespread impact in variety of applications ranging from scattering and radiation analysis to micro-electronic packaging; an in- complete compendium of applications is presented in [88, 174]. The increased power and availability of computational resources and acceleration schemes have enabled so- lution of problems with very large number of unknowns, varying from few thousands to few millions [37, 40]. Another class of problems arise when analyzing structures which require a high local density of unknowns to capture geometric features. This class of problems, hereafter, referred to as multiscale problems exhibit multiple scales in frequency or length or both. For example, small length scale discretizations are required to capture sharp geometric features that are embedded within large and smooth geometries discretized at a coarser length scale. Similarly, multiple frequency scales is vital to analysis and design of ultra Wideband (UWB) antennas embedded in structures [17 5]. In general the characteristics of a multiscale problem is the con- centration of large number of unknowns in electrically small domains. Akin to the breakdown of time domain fast mathods for wave equation, existing techniques for Helmholtz equations also face limitations when applied to multiscale problems as their cost scaling is poor [46, 42] and mixed discretization also lead to badly-conditioned matrix systems [176, 177, 178, 179, 175}. The development of a fast algorithm that is stable and efficient for multiscale structures is addressed in this chapter. The latter problem of badly-conditioned system is remedied with the use of an alternate integral equation formulation, whose development is detailed in Appendix A. 74 919 I: app: PIC“! less see. he CU; As mentioned in Chapter 1, FMM has become an indispensable for large scale electromagnetic analysis [174, 34]. However, FMM becomes numerically unstable and inefficient when applied to multiscale problems [42]. This is a consequence of the fact that Helmholtz FMM does not smoothly transition to Laplace FMM as frequency tends to zero. This was first remedied by introducing a suitable scaling factor [46] that ensures the computed quantities are stable and the transition is smooth. However this approach is not suitable for problems with multiple length scales. An alternative approach based on spectral representation of free space Green’s function in terms of propagating and evanescent plane waves was proposed in [42] . This approach seam— lessly transitions from high to low frequency kernel for both spatial and frequency scaling [50], but it requires the evaluation of an infinite integral in k—space; general- ized Gaussian quadratures [50] and other approaches based on contour integration in complex plane [84, 49, 180] have been explored for this purpose. The main contributions of this chapter are, a development of a low frequency fast method based on Accelerated Cartesian Expansion (ACE) algorithm 0 development of a hybrid scheme by combining ACE with FMM for multiscale problems 0 derivation of convergence proofs and bounds o integrating the hybrid scheme with integral (equation solvers and demonstrate its application to practical problems. Though the overall structure of the hybrid algorithm developed here bears some similarity with [50], it should be emphasized that they are two different algorithms. 75 4.2 surfa l. I cum lllSlC (CF be! 1hr 4.2 Integral Equation and FMM for Helmholtz Equations Let S denote the surface of a closed PEC object that resides in free-space. This surface is excited by a plane wave characterized by {E2 (r), Hi (r)} with wavelength A. The scattered fields are denoted by {E3 (r), H3 (r)} and are radiated by equivalent currents J (r) on the surface S. Let 8‘ denote a surface that is conformal and just inside 5 and let Et(r) = E3(r) + E’(r) and Ht(r) = H3(r) + Hi(r) denote the total electric and magnetic fields, respectively. The combined field integral equation (CFIE) formulation for solution of J (r) is, as x a x Et(r) + (1 — a)fi x Ht(r) = 0 Vr e 5* (4.1) where fl is the outward pointing normal and a is an arbitrary scalar constant chosen between 0 and 1. The scattered electric and magnetic fields are related to J (r) through the dyadic Green’s function, " x n X E3(r) = £e{J(r)} (4.2) = fixfix/SdsGdnr) J(r) n x H8 (r) = ICm{J(r)} (4.3) . 1 = I I = an—n; Sdst [Gn(r,r)-J(r)] fink, r') = —-jmy (T + Z—ZV) g(r, 1") (4.4) e— 'nlr—r’] g(nr') = Vii—If] (4.5) In above relations It is the wavenumber, 77 is the characteristic impedance of free space and I: is. the identity dyad. The CFIE formulation is chosen to eliminate spurious 76 4‘ solutions current 1 l'sing Ga where. is is term: he re ah the (lit TEX 1h i0 tc tl solutions corresponding to interior resonance problem. As is normally done, the current J (r) is represented using a space of local vector basis functions fm(r) [181]. Using Galerkin testing results in a system of equations that may be expressed as ZI = V (4.6) where, an = (010'), —a£e{fm(r)} + (1 - a)’Cm{fm(r)}> (4-7) (as), seams)» = —jm + €< fm(r)> (4-9) 12,, = (1,,(r), an x a x Ei(r) + (1 — a)fi x Him) (4.10) As is evident from these equations, the evaluation of each element may be recast in terms of evaluation of scalar potentials. Thus, to better analyze the problem, it can be reposed as follows. Find the potential 111(r) due to a set of N sources N e—jKIr—ril W) = 2 i=1 [1' _ ril wnw, (4.11) where 111,, and w,- represent the appropriate testing and source strengths, respectively, that include numerical quadrature weights and other constants. It is evident that a direct evaluation of potential at N observation points yields an 0(N2) method. FMM reduces this cost to 0(N log N) by utilizing spherical harmonic expansion [30, 65] of the scalar Green’s function. Very briefly, the classical FMM algorithm proceeds as follows: the computational domain is embedded in a fictitious cube that is then used to construct an oct-tree. At the lowest level, interaction between the elements of boxes that are in the nearfield of each other is computed directly. All other interactions 77 are mmpd that read] , operator 1 the field 1 a single]. \llile 111 bar um} that are “he: Elma are computed using a three stage algorithm: (i) compute multipoles M from sources that reside in each box; (ii) convert these to local expansion L, using the translation operator T , at all boxes that are in its far field; (iii) from the local expansions compute the field at observer points within the box. This simple three stage scheme is called a single-level algorithm and suffices to discuss the limitation of these expansions. While multilevel variants of this scheme exist [35, 69], the limitations of FMM are best understood by examining (6.8). Consider K closely spaced sources located at r,- that are well-separated from the testing point r, W) = if?” d2RM(r3,k)T(ro—r3,k)wne—jk'(r°_r) (4.123) = '73? [dzkwne—jk'(r0‘r)£(ro,k) (4.12b) K . M(r3,k) = Zwie—Jk'(r3-ri) (4.120) i=1 T(r, k) = Z(—1)’(21 + 1)hl(2)(n|r|)Pl(k . f) (4.12d) l=0 £(r0,k) = M(rs,k)T(ro—r3,k) (4.12e) where [r0 — r3] > 2d, |r — r0] < d, k = 5k, r3 (r0) is the center of multipole (local) expansion for source (observation) cluster, ’1' is the translation operator, hp and P] denotes an order I spherical Hankel function of second kind and Legendre polynomial, respectively. 4.2.1 Sub-wavelength breakdown of Helmholtz FMM In FMM the interaction between source and observation clusters is evaluated using the translation operator T in (6.8d) that contains a spherical Hankel function. The singular behaviour of spherical Hankel function implies restrictions on the size of its argument nlrl. For numerical stability, neither the translation distance [r] nor the 78 wavenuzz. has beer; submar- or order high den] the owl problem: 4.3 111.10 form ( w} wavenumber K. can be arbitrarily small [73, 76]. Limitations on these parameters has been reported in detail in [74]. As a result, FMM is inefficient when applied to sub-wavelength problems where the principal dimension of the domain is less than or order of a wavelength only. In such problems, some of the leaf boxes have a very high density of unknowns and the overall computational complexity is dominated by the nearfield cost. Thus FMM algorithm is inefficient when applied to multiscale problems where the discretization rate is either non-uniform or uniformly dense. 4.3 ACE translation operator for Helmholtz po- tential In ACE, the translation operator is the only kernel dependent quantity. The analytical form of the translation function in case of Helmholtz potential can be written as, n n e_,.,,R 1111 1121 12331 Vn R (n1,n2,n3) = Z (_1)n+mR2m—2n—1 m1=0m2=0m3= n1 n2 n3 (4-13) X m1 m2 m3 $n1—2m1yn2-2m22n3—2m3g(n _ m, KR) where g(n,,.11) = Whom)(Humanism) n n! = 2mm!(n — 2m)! m In above expressions, n = n1 + n2 + n3, m = m1 + mg + m3, Kn(-) represents the modified Hankel function of order n, R = |r| and [-j is the floor operation. It is well 79 1.110111 size 0: traml for m known that above Taylor’s series expansion is convergent when either the domain size or frequency is small. Consider the limiting case of low frequencies or small translation distance i.e. KR —> 0. Using the small argument asymptotic expansion for modified Hankel function [59] and comparing with the translation operator for 1 / R potential given in [13], (6.14) can be written as, P 0.5 n+0.5 gwm z ¢2/«<1«R><“+°5>(—”;’—l (TIER) for 0 (4.30) p=0p' oo 1 s 2 —,[4.4—oi”)-p-vpsin(—k~(rs—r4))l p=P+lp' = 5%mo 83 Since both real and imaginary parts satisfy same error bound, the Tmap Operator preserves both the amplitude and phase of FMM multipoles to desired precision. The error bound for FMM-to—ACE local expansion translation operation is identical to ACE-to-FMM multipole translation as they use the same mapping operator Tmap. Furthermore, let d be the side length of the cube where ACE harmonics are defined. Then in (4.29), a = \/3d and for convergent error d S /\ / 2.65. This limit is within the range [0.2, 2.0]A where both ACE and FMM expansions are stable, hence the error in transition from ACE to FMM and vice versa can be controlled to arbitrary accuracy. 4.4.2 Implementation details As shown in figure 4.1 the multiscale geometry is mapped onto a non-uniform oct- tree. This ensures that the number of unknowns per leaf-level box is approximately same [50, 61]. In figure 4.1(b), the dark and light nodes indicate ACE and FMM computational domains respectively. This classification of tree-nodes is based on the size of the domain they represent and introduces a transition level such that nodes at and above this level are of FMM type and nodes below this level are of ACE type. The hierarchical tree code computation starts with the evaluation of appropriate multipole expansion at leaf boxes. During upward tree traversal the parent multipoles are computed from their children multipoles using the multipole-to-multipole translation operator. At transition level alone the parent box FMM multipoles are computed from their children ACE multipoles using the mapping operator in (4.23). Next the appropriate multipole-to—local expansion translation operation is performed. Then the children local expansions are updated with their parent local expansion using the local-to-local expansion translation operator. Again at the transition level, the child box ACE expansion is updated with its parent box FMM local expansion using the mapping operation in (6.16). Finally the local expansion coefficients at leaf- level boxes are used to compute the farfield potential at their respective observation 84 points. .- contrihzd 4.5 This sec] hybrid 5’ to erahr SCheme algor'rth 4.51 is r poi cor Op rel oi points. As in all tree algorithms the complete potential is obtained after the near-field contributions are accounted through direct evaluation. 4.5 Results This section presents plethora of results that exhibit the accuracy and efficiency of the hybrid scheme when applied to multiscale problems. First few set of results pertain to evaluation of Helmholtz potential given a set of random points. Later the hybrid scheme is integrated with an existing solver and it effectiveness, over FMM-only algorithm, is shown for several problems. 4.5.1 Helmholtz potential evaluation First, the accuracy and stability of of ACE-only algorithm when applied to sub- wavelength problems is demonstrated. Consider the evaluation of Helmholtz potential at N source/ observation pairs that are randomly distributed within a domain of size A/ 2. The error incurred in computing only the far-field potentials using ACE are listed in Table 4.1. As is evident, the error decreases uniformly with increase in the number of harmonics. Note, that the error presented here does not include nearfield contributions; in general the total error including the nearfield contribution, that are computed exactly, is less by two orders of magnitude. Next, the convergence of the mapping operators ACE to FMM (FMM to ACE is reciprocal) prescribed in (4.23) is demonstrated. Given some arbitrary number of points confined within a domain of size A, the FMM multipoles for a given box size are computed both directly and from their children’s ACE harmonics using the mapping operator Tmap. Let 2d be the side-length of the FMM box, Table 4.2 shows the relative error in computation of FMM multipoles for various values of d and number of Cartesian harmonics P used in mapping. As expected the mapping error uniformly 85 decrees To d hiring . random 8.000.001 ‘2) to 1‘3 stnrcted oi doma points is cordigur the cen' average ACE 211 This we domly g of fan“) Pentlur Li M .11 dlSirlbr figure dislrlbr UllilOm Scheme terill0n dislribu the loo to him 1 decreases to double precision as P increases or as d decreases. To demonstrate the efficiency of the proposed algorithm, the hybrid scheme com- bining ACE and FMM is applied to both uniformly and non-uniformly distributed random points. In both cases the number of unknowns N is varied from 64,000 to 8,000,000. In case of uniform distributions the size of the domain is increased from 2) to 12A as the number of unknowns is increased. The non-uniform geometry is con- structed of three overlapping thin disks, as shown in figure 4.1(a), and the overall size of domain was fixed at 12A. In each disk the points were distributed so that density of points is inversely proportional to radius and linear in z—axis. Note that this geometry configuration closely resembles a multiscale scenario as the discretization rate, near the centers of disk, can be as high as x\/ 1000. In all cases it was ensured that the average number of points per leaf-level box is approximately 64 and the number of ACE and FMM harmonics were chosen so as to maintain an accuracy of 0(10‘4). This was verified by performing direct computation on few, typically 50 to 100, ran- domly selected points. Table 4.3 shows the time taken, in seconds, for computation of far-field potential using the hybrid algorithm on a desktop computer with 2.3 GHz Pentium IV processor 4GB RAM running Linux. In uniform distribution LACE and L F M M denotes the number of ACE and FMM levels respectively. In non-uniform distribution LACE and L F M M were constant at 5 and 3 respectively for all cases. Figure 4.2 shows the Log-Log plot of time vs. N for both uniform and non-uniform distributions. The linear line fit with slope one, indicated inside the figure, for both uniform and non-uniform case shows that the cost scaling of the proposed hybrid scheme is irrespective of how the points are distributed. It is important to draw at- tention to the overlapping linear line fits corresponding to uniform and non-uniform distribution of points. This, in particular, highlights the fact that the time taken by the hybrid algorithm depends purely on the number of unknowns N without regard to how the points are distributed. 86 4.5.2 Multiscale scattering problems Next, the performance of integral equation solver augmented with ACE+FMM with that augmented with only FMM. The solvers employs RWG basis ftmctions for surface currents J(r). GMRES iterative solver with a restart value of 30 was used with tolerance and maximum number of iteration fixed at 1E—3 and 1000, respectively. In these numerical experiments geometries with different overall size and number of discretizations were considered along with different excitation frequencies. In rest of the results FMM harmonics were used when the box size is greater than or equal to A / 4 and ACE harmonics for rest of the domains. For each configuration, of chosen geometry and frequency, the CFIE solver was executed in two modes (i) FMM-only: where the leaf-level box size is fixed at A/ 4 to ensure that only FMM harmonics are utilized (ii) ACE+FMM: the non-uniform tree is constructed such that the average number of unknowns per leaf-level box was approximately 10 to 20. Note that in ACE+FMM runs the smallest domain size can be as small as /\/40. In all cases FMM harmonics were used when box size was equal to or greater than /\ / 4 and ACE harmonics for rest of the domains - this defines the transition level in the hybrid algo- rithm. The ACE and FMM harmonics were chosen such that they yield an accuracy of 0(1E — 3). The maximum run time for each simulation was limited to 6 days and any unfinished data is denoted by *. The following values are reported in table for each simulation: near-time and solve-time are the time, in seconds, for computation of sparse near-field matrix and iterative solution respectively, speed-up is the ratio of total time spent by the solver using FMM-only and ACE+FMM algorithms, Avg/boa: and M an: denotes the average and maximum number of source / observer pairs per leaf- level box respectively. The ratio of maximum to minimum edge length serves as good measure of the multiscale nature of the problem as it is close to one for uniformly discretized geometries and high for discretization with multiple length scales. 87 First multiscale problem considered here is the cone-sphere geometry. Here the cone’s tip is densely discretized in comparison to the smooth sphere part of the structure. Table 4.4 shows the results of solver using FMM-only and ACE+FMM algorithm when applied to three different cases. The geometry fits within a cuboid with aspect ratio 1 : 1 : 5 and the maximum dimension for each case is given in terms of the incident wavelength in table 4.4. Let 2 be the axis of rotation of cone- sphere. The propagation direction of incident plane wave E = ie‘j’df'r was k = g for first two runs and k = 2 for Run 3, as shown in figures 4.4 and 4.3. As is evident the solver with hybrid scheme offers speed-up as high as 7 times over that using FMM-only algorithm. Essentially this speed-up is achieved by reducing the number of near-field interactions as indicated by the near-time in table. This is due to the fact that ACE+FMM algorithm allows domain size to be as small as A/ 40 which in turn reduces the average number of unknowns per box considerably in comparision to FMM-only case. As expected, the speed-up offered by ACE+FMM algorithm reduces as the problem size increases as most interactions fall under FMM; only few number of interactions exist in sub-wavelength domains and ACE algorithm does not offer much advantage over their direct computation. Figure 4.3 and 4.4 shows the bi-static RCS corresponding to Run 3 and Run 2 in table 4.4. The RCS computed using both solvers exhibit excellent match to given order of accuracy. Table 4.5 shows results from second multiscale problem - NASA almond. The entire structure fits within a cuboid with aspect ratio 1 : 6 : 4 and the maximum dimension is given in terms of incident wavelength for each case in the table 4.5. In all cases the direction of incident plane polarized along 2?: was k = 2 as shown in figure 4.5. Three different meshes were considered with the number of unknowns varying from 62,000 to 250,000, the increasing multiscale nature of the problem is indicated by the max/ min edge length. Here again the hybrid scheme offers speed-up as high as 7 times over FMM-only approach. Notice that in Run 3 the solver with ACE+FMM 88 has completed its run while the large number of near-field interactions in FMM-only consumes almost the entire computational time. Figure 4.5 shows the bi-static RCS computed using both the solvers for Run 2 in table and they agree with each other. Table 4.6 shows results from third multiscale problem which is a toy-aircraft ge- ometry with many sharp features. The structure fits within a cuboid with aspect ratio 3 : 1.5 : 1 and the maximum dimension in terms of incident wavelength is reported in table. In all cases the direction of incident plane wave polarized along :5 was it = 2 as shown in figure 4.6. The number of miknowns was varied from few thousands to millions as the maximum dimension was increased from 1.5 to 20 A. The maximum to minimum edge length ratio for this geometry was apprOximately 20 in all cases, indicative of a uniformly dense discretization. The solver with ACE+FMM exhibits speed-up as high as 14 times over the solver with FMM-only algorithm. Notice that in Run 3 with 1.7 million unknowns, the large number of unknowns per box, indi- cated by average and maximum unknowns per box in table, in FMM-only case results in large number of near-field interactions which consumes the entire computational time. In comparision, with smaller domains in ACE+FMM algorithm the number of unknowns per box is considerably smaller and entire computation is completed within the limited time. Figure 4.6 shows the bi-static RCS corresponding to Run 2 in table 4.6. 89 (b) Figure 4.1: An example non-uniform (a) point distribution and (b) its tree represen- tation. Table 4.1: Error convergence of ACE algorithm with random points within a A / 2 size dome Lin P 1 3 5 7 9 l2 Error 1.172E—2 4.804E—4 3.936E—5 1.236E—5 4.380E—6 7.029E—7 90 st.T (Error0(1.053)) 3 ' 121. 0 Uniform; y- 1.0011 -4.22 2.5: I Non-Unifonn; y- 1.001: -4.18 1°“ w 2 ‘ 811 E 1 5 O _J 1 1 0.5 ‘ 2A 0 . . . . . . 4 4.5 5 5.5 6 6.5 7 7.5 Log (N) Figure 4.2: Time vs. no. of unknown in log-log plot when hybrid scheme is applied to uniform and non-uniform (fig 4.1) geometries. _FMM+ACE ‘ .___FMM-only o . g -5 > Ex .5 . ~- 8 \A M. 1 60 80 .100 120 140 160 180 Thetamdeg. Figure 4.3: Bi—static RCS of cone-sphere geometry, corresponding to Run 3 in table 4.4. Inset figure shows the incident excitation and magnitude of surface current. 91 RCS in (if) F i811MB to RM Surfaflr 20- m u .5 1o . (I) 0 m I — ACE+FMM — - FMM-only ~10 - q .200 -100 100 200 0 Theta in dog Figure 4.4: Bi-static RCS of cone-sphere geometry, corresponding to Run 2 in table 4.4. Inset figure shows the incident excitation and magnitude of surface current. 20 — ACE+FMM — — — FMM-only RCSindB -150 -100 -50 0_ 50 100 150 Theta 111 deg. Figure 4.5: Bi-static RCS of NASA fat almond (multiscale geometry 2) corresponding to Run 2 in table 4.5. Inset figure shows the incident excitation and magnitude of surface current 92 “ACE+FMM ---FMM-only m- “ ’ 5 I . 10»- ’ at? "..‘s I .E I (I) _, .. O 1 a: I I 1200 450 400 -so 0 so 100 150 200 Theta in dog. Figure 4.6: Bi-static RCS of Toy-aircraft geometry (multiscale geometry 3) corre- sponding to Run 1 in table 4.6. Inset figure shows the incident excitation and mag- nitude of surface current Table 4.2: Error in FMM multipoles computed from ACE multipoles using Tmap in (4.23) ACE harmonics d P=3 P=6 P=9 P=12 0.5 2.13 5.58E-3 9.62E—6 5.90E—09 0.25 2.58E—2 8.04E-6 1.51E—9 1.27E—13 0.125 3.49E—4 1.30E—8 1.55E—13 2.24E—15 0.0625 1.04E—5 5.34E—11 1.41E—15 1.41E—15 93 11th 4 tries Table 4.3: Time for hybrid algorithm as applied to uniform and non-uniform geome- tries Uniform Non-uniform N Size(A) Time LACE L F M M Time 64,000 2 3.77 3 5 7.48 125,000 2 6.29 3 5 8.13 250,000 2 14.39 3 5 13.93 500,000 4 34.57 3 6 35.71 1,000,000 4 68.7 3 6 79.53 2,000,000 4 125.26 3 6 135.43 4,000,000 8 310.22 3 7 263.7 8,000,000 10 588.94 3 7 484.02 Table 4.4: Multiscale problem 1 : Cone-sphere geometry Near-Time Solve—Time Speed-up Avg/ box Max Run 1 800 MHZ, Size ft: 2A with 19,000 basis ACE+F MM 832.47 449.87 6.21 14 2,288 FMM 7843.59 117.87 1,329 3,424 Run 2 76 MHz, Size n 7A with 19,000 basis ACE+FMM 593.56 344.96 1.39 17 1,666 FMM 1028 277.13 100 1,926 Run 3 10 GHz, Size z 21A with 72,000 basis ACE+FMM 1100.28 1107.1 0.79 3 2,084 FMM 1440.7 300.28 41 2,224 94 Table 4.5: Multiscale problem 2: Almond Near-Time Solve-Time Speed-up Avg/ box Max Run 1 1 GHz, Size = 5A with 62,550 basis Max/min edge len.= 160.34 ACE+FMM 1063.88 795.31 7.71 5 270 FMM-only 13532 800 4 242 6,118 Run 2 1.5 GHz , Size = 8A with 107,400 basis Max/min edge len.= 193.42 ACE+FMM 1569.43 6713.38 2.66 4 256 FMM-only 20548.77 1475.01 175 5,464 Run 3 2 GHz , Size = 10.6A with 269,100 basis Max/min edge len.= 474.61 ACE+FMM 22247.25 26428.42 2: 11 2180 FMM-only 96297.28 * 265 13,110 Table 4.6: Multiscale problem 3: Toy-aircraft N ear-Time Solve-Time Speed-up Avg/ box Max Run 1 76.2 MHz , Size = 1.53A with 9,727 unknowns ACE+FMM 196.15 286.07 14.29 7 36 FMM-only 6491.35 400.66 2,784 5,178 Run 2 300 MHz, Size = 6.06A with 26,145 unknowns ACE+FMM 62.54 1343.4 6.4 2 12 FMM-only 7800.79 1203.73 697 2,646 Run 3 1 GHz, Size = 20.27A with 1,754,814 unknowns ACE+FMM 237767.37 58500.89 * 98 1,174 FMM-only * * 691 4,508 95 Ch A); BC. This lead sub: tion dev tears a C1 cor. 0ft Chapter 5 A Well Conditioned Formulation of Augmented Electric Field Integral Equation (AEFIE) This chapter addresses the develOpment of integral equation (IE) formulations that lead to well conditioned systems of equations. Typically iterative solvers, like Krylov- subspace solvers, are used for solution of large systems of equation and well condi- tioned systems of equation require fewer number iterations for solution. Thus the developments presented here are complementary to the discussions in previous chap- ters where the focus was on reducing the cost of a single iteration. Section 5.1 provides a concise account of the recent research work on the theory and development of well conditioned formulation for electromagnetic simulations. Section 5.2 introduces some of the analysis tools and the insights they provide in understanding the IE operators of EM. Section 5.3 introduces a new formulation of the augmented electric field IE (AEFIE) that leads to both better conditioned systems of equation and unique so- lutions at all frequencies. The new formulation is first developed for 2D and then extended to 3D case with appropriate modifications. Section 5.4 presents plethora of 96 result: 5.1 Comp). oomerl he prir of com louse; have 8 Int ‘llatio; electr. detail linear that 1 depe) trie fl and ( and 1 [184. f017m r€480 the e: HlfiCa 11011-5, Only 5 results that exhibit the well-conditioned nature of the new formulation. 5. 1 Introduction Computational electromagnetic (CEM) is the field of research that concerns with numerical solution of Maxwell’s equation. The rapid development in this field can be primarily attributed to the simultaneous development in power and availability of computers along with the advancements in mathematical research. This chapter focuses on the latter aspect, the mathematical developments in the past decade that have considerably altered the landscape of CEM research. Integral equations (IE) is one of the widely adopted numerical techniques for sim- ulation of electromagnetic problems [183]. The distinct advantage of IE approach for electromagnetic simulations over their differential equation counterparts have been detailed in the previous chapters. Typically, the IE formulations result in a set of linear system Of equation, which are solved using an iterative solver. It is well known that the convergence rate of an iterative solver, the number of iterations for solution, depends directly on the condition number of the numerical system of equation. Elec— tric field IE (EFIE) is one the most widely used formulation as it is valid for both Open and closed problems. It is well known that EFIE is an integral equation of first kind and the condition number of these numerical system is not assured to be bounded [184, 185]. Further, EFIE also suffers from the low-frequency breakdown where the formulation is inherently ill-conditioned for low excitation frequencies. The physical reasons for this breakdown of EFIE at low-frequencies is well known [17 6]. Consider the electric field produced by an arbitrary electric current source, there exists a sig- nificant disparity in the magnitude of electric field produced by the solenoidal and non-solenoidal part of the current source. Thus, using the electric field equations, only some parts of the source can be computed in a stable and robust fashion. It 97 0V8! quir err. 118 ite co tic to £11 f0 31 can be analytically shown that the disparity increases as the frequency tends to zero and this is known as the low-frequency breakdown of EFIE. Several computational strategies, based on quasi-Helmholtz decomposition of surface electric currents, have been proposed to overcome this limitation of EFIE. Loop-star, loop-tree and tree/co- tree are different forms of the same type of solution approach [179, 178, 87, 186]. Apart from this, recent mathematical analysis of boundary integral equations in EM have lead to the development of analytical preconditioners that modifies the EFIE into a well conditioned, second kind integral equations [187, 188, 189]. However, the resulting formulation suffers from the interior resonance problem and hence produce non-unique solution at resonance frequencies. Conventional techniques, like combined field IE (CFIE), also fail when applied to these new formulations. Further, the nu- merical implementation of these modified formulation demands careful considerations that has been the focus of several recent research works [190, 191, 192, 193]. This work explores the development of a new IE formulation for electromagnetic simulation that is both well-conditioned and resonance free. This formulation is based on the augmented field integral equations (AFIE), which were initially proposed to overcome the interior resonance problem [194]. AFIE, as originally proposed, re- quires the solution of an over-determined system of equations using a least-square- error approach. In this work, both the electric charges and currents are considered as unknowns and the resulting new AFIE formulation is amenable to conventional iterative solvers. However, this requires that the imposition of continuity and charge conservation conditions separately. Based on operator theory analysis, these addi- tional constraints are imposed in a manner such that the operators in the resulting formulation are bounded and compact; leading to well—conditioned systems of equa- tions. The validity of these formulation is shown both analytically and numerically for 2D problems. Since some of the observations in 2D case does not hold good for 3D problems, the relevant modifications are also discussed here. 98 5.2 let S 1: surface it. The current {or sold let S lntegr Ina elec‘ In Sp. 5.2 Preliminaries Let S denote the surface of a closed PEC object that resides in free-space. This surface is excited by a plane wave characterized by {Ez (r), Hi(r)} with wavelength A. The scattered fields are denoted by {E3 (r), H3(r)} and are radiated by equivalent currents J (r) on the surface S. The electric field integral equation (EFIE) formulation for solution of J (r) is, f1 X f1 x (Ez(r) + E3(r)) = 0 Vr E S (5.1) Let S — denote a surface that is conformal and just inside S, then the magnetic field integral equation (MFIE) is written as f1 x (H3(r) + Him) = 0 ‘v’r e s- (5.2) In above equations, fl is the outward pointing normal on surface S. The scattered electric and magnetic fields are related to J (r) through the dyadic Green’s fimction, f1 x ii x E3 (r) = £t{J(r)} (5.3) = fixfixfgdan(r,r)-J(r) (5.4) —f1 x H3(r) = ICt{J(r)} (5.5) . 1 = I I = nxj-TT7 Sdst [Gn(r,r)-J(r)] E1403 r’ = —jm] (I + %) g(r,r’) (5.6) I e—jn|r-r’| g(r,r) = m (5.7) In above relations K. = w/c is the wavenumber, w is the angular frequency, c is the speed of light in free space, 77 is the characteristic impedance of free space and I is the 99 identity dyad. Since the above formulations are based on the boundary conditions on tangential fields, the IE operators [It and IQ are referred to as tEFIE and tMFIE, respectively. Typically, in numerical solutions, the current J (r) is represented using a space of local vector basis functions fm (r) [181]. Galerkin testing results in a system of equations that may be expressed as 21 = v (5.8) For example, in case of tEFIE, the elements of Z and V can be written as an = (5.9) (fn(r),£1{fm(r)}> = —jnn+% (5-10) 12,, = (fn(r),fixfixE’(r)) (5.11) 5.2.1 Interior Resonance and Augmented IE It is well known that, for closed geometries, both tEFIE and tMFIE Operators have non-empty null space at excitation frequencies corresponding to interior resonance. Thus, at these frequencies the solution of the equations (51,52) is not unique [195]. Combined field IE (CFIE) is a popular alternative, where both EFIE and MFIE are solved simultaneously. Other proven approaches to overcome interior resonance problems are combined source IE (CSIE) [196] and dual surface IE [197]. All these alternatives demand additional computation in one form or the other, for e.g. CFIE requires computation of magnetic field, and CSIE doubles the number of unknowns. Augmented field integral equations (AFIE), proposed by Yaghjian [194], is an alter- native approach to overcome the interior resonance problem with the computation of either electric or magnetic field only. In AFIE, a unique solution to currents J (r) is obtained by simultaneously satisfying both the tangential and normal boundary 100 condit] Simllzt lo the IESpor boom All? it is . Syste appr tint-r AFli iSne 5.2 In ti to st elect condition of electric or magnetic field. Consider the. augmented EFIE (AEFIE) mi} = —fi x a x 132(1) Vr e s (5.12) 5,,{J} = 50?- — a - Ei(r) ‘v’r e s (5.13) 5,,{1} = a . / 1135,01, r’) -J(r') (5.14) S Similarly, the augmented MFIE (AMFIE) can be written as, ICt{J} = J — a x Hi(r) Vr e s (5.15) 1c..{J} = —fi - Hi(r) Vr e s (5.16) . 1 = I I 1cm} = n - 1'77 SdsV x [0,,(r,r ) -J(r )] (5.17) In the above equations Ln and [Cu are the nEFIE and nMFIE operators that cor- respond to the boundary condition on electric and magnetic fields normal to the boundary surface S, respectively. It has been rigorously shown that both AEFIE and AMFIE produce unique solutions for any closed geometry except spheres. Moreover, it is evident that discretization of both the formulations lead to an overdetermined system of equations, which can be solved only in a least squares sense. Hence this approach has been relatively less popular when compared to CFIE or CSIE. A dis- tinct feature of AFIE, also noted in the seminal work of Yaghjian [194], is that the AFIE formulation is similar to a second kind IE. However, exploiting this advantage is not trivial and forms the main focus of the work presented here. 5.2.2 Operator and Eigenvalue Analysis In the past decade, rigorous mathematical analysis techniques have been employed to study the different boundary integral equation formulations used in computational electromagnetics. These theoretical analysis of boundary IE operators depend on the 101 form and size of the computational geometry and they are restricted to the study of canonical geometries such as circular cylinder and sphere in 2D and 3D, respectively. Though analytical solution exist for these geometries, the theoretical analysis have provided valuable insights that led to a better understanding of the behaviour of these IE Operators. It is well known that an IE Operator with finite and bounded spectrum, when discretized, yields a well-conditioned system of equations [185, 191]. In formal terms, a formulation is well-conditioned if all its operators take the form (I + M) , where I is the identity operator and .M is a compact operator. Intuitively, compact Operators have bounded spectrum and the presence of identity operator ensures that the spectrum of overall operator is offset from origin . Hence, the spectral radius of these operators are finite and bounded, see [184, 185] for rigorous treatment on these topics. In rest of this section, for the sake of completion and clarity, these recent develOpments are presented in requisite detail for the 2D case. Similar observations hold for 3D case also, however with more involved derivations beyond the scope of this thesis, and these are just stated with ample references. Consider the 2D problem of traverse electric (TEz) scattering from a PEC circular cylinder Of radius a, with axis of rotation aligned along the Z-axis. The tEFIE formulation (5.1) for this problem can be written as, 27r VV -jl€7)f‘ X /ad¢, (7 + it?) H32)(KIR) - J¢(¢I) = —f‘ X Ez(¢) (5.18) 0 where H (2) is the Hankel function of second kind, R = a [r—r’ | , r = a cos ¢i+a sin 33) 0 and r’ = a cos (15’ a”:+a sin (1’37. For this canonical case, it is well known that {43(3an n = 0,1, . . .} forms a complete set of eigenfunctions for the surface current J [183]. Then 102 the or. where eigenf‘. The; 8155 0. lli( Th1 wjd for the unknown surface current J ¢ can be represented as sum of these eigenfunctions, 00 J¢ = Z Inasejnd’ (5.19) n=0 where In are the unknown coefficient to be solved for. Using the orthogonality of the eigenfunctions, the eigenvalues of Operator Lt are given as 1 I AgE’tEFIE = 5(n7ma)J,’,(I~za)Hn(2)(Ica) (5.20) The plot of few of these eigenvalues for different orders 77. and size of the object a is shown in figure 5.1. Evidently, these eigenvalues are zero whenever J;,(I~za) = 0, indicating the non-trivial, finite dimension null space of the TE—tEFIE Operator. These are also the frequencies corresponding to the interior resonance. Further, from figure 5.1, it is seen that the spread between eigenvalues is large , especially, for small values of 15a. Consider the asymptotic limits when KG. —> 0, TE,tEFIE ~ .nna AZEJEFIE z 39%] —» 00 ;n 55 o (5.22) This suggests that the tEFIE Operator is unstable for electrically small scatterer. The widely spread eigenvalue spectrum also suggests that tEFIE Operator is an unbounded operator. This is particularly a consequence of the double derivatives in (5.18) that leads to hyper-singular terms. The same eigenvalue analysis can be extended to tMFIE operators. The tMFIE for 2D TEz scattering from a PEC cylinder of radius a can be written as, 1 . 2“ I" I I (2) zctrJ¢<411=J¢+Er x [0 cake 4(4 )J¢(¢ 1H. (.11) (5.23) 103 Sam; . Here. tions null s] narree analxd Thus 3CC'UI whet 3580c As in the case of tEFIE, {43ejnd’} form a complete set of eigenfunctions and the corresponding eigenvalues are given as, 1 I ZEWFIE = §(j7ma)Jn(Ica)Hn(2)(na) (5.24) Samples of these eigenvalues are plotted in figure 5.2 for different values of n and 14a. Here, the eigenvalues are zero whenever Jn(1ca) = 0 and the corresponding eigenfimc- tions form the finite, non-trivial null space of the tMFIE Operator. Notice that the null space of tEFIE and tMFIE are not the same, in other words, the interior reso- nance for both operators occur at different frequencies. Performing the asymptotic analysis for I430. —> 0, Ag’EJMFIE z 1 (5.25) AgEttMF’E z § ;n ,4 o (5.26) Thus the spectrum of tMFIE operator is bounded and for n —+ 00 the eigenvalues accumulate at (0.5+j0.0). Similar analysis can be carried out in 3D for the canonical problem of scattering from a PEC sphere of radius a. Here, the vector tesseral harmonics Xnm and Unm form the complete set of eigenfunctions for representation of the vector current fields on surface of the sphere. These are also known as the surface Helmholtz decomposition [187] on sphere and given by, x5484) = i- x VYJ"(6,¢) (5.27) Unm(9,¢) = 1" x xnm(o,¢) (5.28) where 14:" (0, (1)) = P,’,"(cos 0)ejm¢ is the spherical harmonics and Pi," denotes the associate Legendre function of order {n, m}. The eigenvalues of tEFIE and tMFIE 104 opera unho{ about 261063 hueri probl these QXlQl fine tion mre The our ati H10 In t 001] Operator in 3D can be derived as, xnm 2 .n (1) Unm 4. = (8a) 1 (841117184) (5.29) Um, (Ica)2j;,(na)hn( )(Ica)Xnm Xnm —' 2 " hi.” Unm 1C1 = 1(5a)1n(na)’1 (.5) (530) Um 1(na)27n(na)hn( ltna>xnm Similar to 2D case, the spectrum of tEFIE operator is widespread, indicating an unbounded Operator and the spectrum of tMFIE Operator is bounded and accumulate about (0.5 + 30.0). Here again, the interior resonance frequencies, corresponding to zero eigenvalues, are not same for tEFIE and tMFIE operators. The above eigenvalue analysis offers more insight than just understanding the interior resonance problem and analytic nature of the IE operators. Since practical problems cannot be approximated as above canonical problems, discretized version of these IE operators are employed. In such cases, the above eigenvalue analysis can be extended as follows: assuming a uniform discretization of the geometry, increasingly fine discretization size corresponds to better representation of higher order eigenfunc- tions. Thus, employing a dense discretization with tEFIE Operator leads to larger spread of eigenvalues and hence results in a badly-conditioned system of equations. The same geometry, when considered with tMFIE operator would lead to a well con- ditioned system of equation as their eigenvalues are bounded at all frequencies, except at interior resonance. These insights are used in develOpment and investigation of the modified AEFIE formulation presented in the next section. 5.3 Well-conditioned Formulation for AEFIE In this section, the AEF IE (5.12) is posed in a manner such that it results in a well conditioned system of equations; that is amenable for use with conventional iterative 105 solvers. The particular case of AEFIE was chosen so that the formulation, in future, can be extended to Open geometries as well. Similar to the discussion in previous section, an eigenvalue analysis of the proposed formulation is presented in detail for the 2D problem. Extension to 3D is not trivial and requires careful consideration. As mentioned before, if N basis function are used to represent the current, AEFIE requires the solution to satisfy 2N contraint equations of tEFIE and nEFIE. In this work, electric charges are also considered as unknowns so that there are 2N unknowns to be solved with 2N equations. Charge unknowns have been previously employed in MOM formulations to overcome the low-frequency breakdown of EFIE. Here we consider this choice specifically to reformulate AEFIE as A3 A], J 11in = — , (5.31) A} A]; p fi-E’ At, = fix / dr'g(r,r’)J(r’) (5.32) S A3 = a. / dr'g(r,r')J(r’) (5.33) S A; = -fix deHVg(r,r’)p3(r’) (5.34) A2 = —£—fi-/dr'Vg(r,r’)ps(r’) (5.35) 60 3 Another motivation for employing charge unknowns is that the scalar basis function used to represent charges forms a suitable set of testing function for nEFIE. In nu- merical implementation, the vector basis function for currents and scalar charge basis function are used to test the tangential and normal electric field boundary conditions, respectively. However, considering both currents and charges as unknowns demands development of methodologies to impose the continuity and total-charge conservation 106 60nd; Here fune' molt liere thee: This the r] Thus fine appr to ir lead is tl Thu knot tiOr) el‘Te: SfStt the i conditions, V-J ’ijs (5.36) 2,0,, = 0 (5.37) Here, the continuity condition is imposed as an external constraint using the penalty function method. In this approach, the discretized form of the differential equation is multiplied by a scalar factor a and added to the tangential field boundary condition. Here, a is a predetermined constant chosen to be as large as possible, as per the theory of penalty functions, but within the range of available numerical precision. This causes a obvious numerical imbalance between the two equations of (5.31), hence the normal field boundary condition is also scaled by a to ameliorate this disparity. Thus, the AEFIE formulation satisfying the continuity condition is given as, A3 + (IV A}, + ozjw J¢ = _ 1‘1 x Ez (5.38) 01.743 01.743 p3 an - Ez Finally, the charge conservation can be ensured either through a penalty function approach or through the deflation procedure. Since penalty function is already used to impose the continuity condition, employing it to impose charge conservation can lead to numerical overflow. Further, note that the constant current and charge vector is the only non-trivial element in the null space of the AEFIE operator in (5.31). This observation favors the use of deflation technique as it requires an approximate knowledge of the null space. Deflation is a well known procedure used in the solu- tion of badly conditioned numerical systems. Consider an arbitrary matrix M with eigenvectors {en} and corresponding eigenvalues {An}. M is a badly conditioned system of equation if one of the eigenvalues , say A0, is very small. However, with the knowledge of corresponding eigenvector, one can consider a modified system of 107 equations M’ that is a rank one update of the original system M’ = M + (q — 10150453" (5.39) where q is the average eigenvalue of the original system. Notice that q becomes the new eigenvalue of eigenvector e0 and all other eigenvalues are unchanged. M'eo = (M+ (q-A0)eoeo)eo (540) = qeo (5.41) M'en = MennyéO (5.42) = Anen (5.43) Thus the defective eigenvalue is deflated from the original system using the rank one update. Multiple deflations or rank one updates can be used to improve the condition number of an arbitrary system with more than one defective eigenvalues. However, the deflation procedure requires the knowledge of eigenvectors corresponding to these defective eigenvalues. Also, additional evaluations need to be performed on solution of modified system to remove the effects of deflation and obtain the correct solution. Since the null space of the AEFIE operator (5.31) is known, the deflation technique is employed to impose the charge conservation condition. At the limiting case of w —» 0, the above AEFIE equations reduces to solving the currents and charges using the normal electric field boundary condition while imposing the zero divergence constraint on currents. It is well known that nEFIE is equivalent to tMFIE and hence is well conditioned at low-frequencies also. Also, the zero divergence is the required and physically correct behaviour of currents at low-frequencies. Hence the proposed AEFIE formulation does not suffer from the low-frequency breakdown and is expected to be well conditioned across the frequency range. 108 5.3.1 Eigenvalue analysis in 2D An immediate observation, looking at (5.31), is that none of the AEF IE operators contain double derivatives, which is the reason for hypersingularity that leads to unbounded operators. The following theoretical analysis is performed to investigate the analytic nature of each of the AEFIE Operators. Consider the evaluation of electric field, using the above operators, produced by the current and charge sources residing on the surface of a circular cylinder of radius a. As mentioned in previous section, {die-j 11¢} form a complete set of eigenfunctions for vector surface currents J ,5. Similarly, {ejnib} forms a complete set of eigenfunctions for the scalar surface charges p3. Thus the mixed set {(iejnd’, ejnib} is a complete set of basis functions for the AEFIE operator (5.31). It is a fairly straight-forward exercise to obtain the following result, 5in _ z Z “cine _ . = wéma 11 12 45. (5.44) mi - Ez 221 222 e3"¢ 1 2 2 '71. Zn = 594418421415.htm)—Jn-1(8a>H.‘._’1(m>>—a’; 2d, Ir — r0| _<_ d, |r’ — r3| S d, k = nk, r3 (r0) is the center of multipole (local) expansion M (L) for source (observation) cluster, T is the transla- tion Operator, him and P1 denotes an order l spherical Hankel function of second kind and Legendre polynomial, respectively. The translation Operator contains a spherical Hankel function which is singular at the origin. Thus, for numerical stability, neither the translation distance Ix] nor the wavenumber K. can be arbitrarily small [73, 76]. Hence, the classical FMM is ineflicient or numerically unstable when applied to sub- wavelength problems, where the domain is discretizated at a rate higher than /\/ 10 to capture the geometric details. These limitations have been reported in detail in [74] and several alternatives have been proposed to overcome them [46, 49, 50, 174]. In this work, the stable accelerated Cartesian expansion (ACE) [198] described in Chapters 2 and 4 is used for the sub-wavelength problems. Following section briefly presents the ACE algorithm and its integration with FMM to create a Wideband FMM. 126 by . 0p,. W}. ha: 6.2.2 Accelerated Cartesian Expansion (ACE) ACE is a hierarchical tree computation algorithm that employs the generalized Tay- lor’s expansion to derive alternate representation Of the Green’s function. The con- struction of ACE algorithm is similar to FMM in that it uses the oct-tree for geometry processing and derives equivalent operators for tree computation. In contrast, ACE employs Cartesian harmonics as multipole and local expansions. Use of Taylor’s series expansion for fast computation has been explored previously albeit with severe limi- tation on accuracy and performance. ACE provides a generic framework by adopting a tensorial formulation to exploit the full power of Taylor’s series expansion for fast computation. In rest Of the paper, M(”) denotes a tensor Of rank n, the polyadz'c associated with r = {725, ry, rz} is given by r(") = {r21r32r2’3} where n = 23:1 n,- and n,- > 0, an m fold contraction between two tensors A(”) and B(m) is denoted by A(”) . m . B(m) = C("-m) when n > m; for more details on these definitions and operations see [182, 13]. The ACE expansions for computing the potential in (6.7) can be written as Mr) = anM(n)(r3).n.T(")(r3,r) (6.9) n=0 = an(r0—r)(").n.L(n)(ro) (6.10) n=0 K w- M(n)(r3) = Z(-1)"#(ri—rs)n (6.11) i=1 ' (n) ne_jnlr0—r3[ T (r3,ro) = V W (6.12) L(”)(ro) = Z $T(")(r3,ro).(m—n).M(m_") (6.13) where M(") and L(") denote the n—th order multipole and local expansion Cartesian harmonics and T(") is the ACE translation Operator. ACE is an almost kernel in- 127 Wllf W]; I fill. SlZl’ dependent method as (a) all quantities of the ACE algorithm, except the translation Operator, are independent of the form of the kernel and (b) the ACE expansions are rapidly converging for any non-oscillatory function [13, 61, 174, 214]. Readers inter- ested in the details and other salient features of the ACE algorithm are referred to [13]. In the case of the Helmholtz potential, the ACE translator Operator is given as [174], (,K,, 19211 11321 12331 T(")(r3,ro)=V" R (”1,712,710 2 Z Z(_)1)n+mR2m— —-2n 1 m1 =0 m2=0 m3=0 n1 n2 113 (6-14) x m1 m2 m3 1;"1 —2m1yn2-2mzzn3—2m3g(n _ m, KR) where 902. ..R) = x/Q/WUKR)("+0'5’Kn+0.5(jflR) n n! = 2mm!(n — 2m)! m where, n = n1 + n2 + 713, m 2 m1 + mg + m3, Kn(-) represents the modified Hankel function of order n, R = \/:I:2 + y2 + z2 and [J is the floor Operation. It is well known that above Taylor’s series expansion is convergent when either the domain size or frequency is small or both. Further, it has been rigorously shown that as frequency tends to zero the ACE expansion seamlessly transitions to Laplace FMM [198]. 128 6.2.3 Hybrid algorithm for multiscale problems Multiscale problems, by definition, contains a mixture of sub-wavelength and large- wavelength problems. From above discussions it can be seen that the ACE algorithm for evaluation of Helmholtz potential is stable and efficient for sub-wavelength prob- lems; while, FMM algorithm is eflicient and Optimal for large-wavelength problems. Thus, individually neither of the two algorithm is efficient for multiscale problems. A hybrid approach, where both ACE and FMM expansions are used in an Optimal and seamless fashion, is required to achieve full efliciency with multiscale geometries. This implies that one needs to develop transition Operators to switch from ACE to FMM and vice versa [198]. These maps are given by 00 M(r.,k) = ZM

(rA).p.T$£2.p(k,r.-r.4) (6.15) p=0 Loan”) 2 i‘jr—n/dszlggp(k,ro—rf1)£(k,ro) (6.16) where r A denotes the center Of ACE multipole expansion Ml?) and the mapping Operator Tmap. The derivation of Tmap and the proof Of convergence can be found in [198]. The overall Wideband FMM algorithm proceeds as shown in Algorithm 2. The computational geometry is represented using a compressed oct-tree [51, 61]. This is constructed by first embedding a cube enclosing the computational domain and recursively sub-dividing the large parent cubes into eight smaller, non-overlapping children cubes The boxes at the lowest level of the tree, beyond which no sub-division occurs, are referred tO as the leaf boxes. In rest Of the paper, box and nodes are used inter-changeably. Interaction list is constructed for all nodes and nearfield list is constructed for all leaf nodes only [9, 61]. Next, the tree nodes are classified as ACE or FMM, based on the side length of the domain they represent. All nodes 129 representing domains less than a certain predetermined size, typically A / 4 or /\ / 2, are classified as ACE nodes and rest Of the nodes are labeled as FMM. The ACE to FMM multipole transition Operator in (6.15) is used in step 4 and FMM to ACE local expansion transition Operator in (6.16) is used in step 6 of the Algorithm 2 [198]. Algorithm 2 Wideband Multilevel Fast Multipole Algorithm 1: Construct the tree representation for the given geometry (distribution of discrete points). 2: Build interaction list for all tree nodes and the near-field list for leaf nodes only. 3: 82M: compute multipole expansions at each leaf node from sources contained within it. 4: M2M (upward traversal): compute the parent node multipole by combining the multipole expansions at their children node. 5: M2L (translation): for all nodes in the tree convert the multipole expansions to local expansions of the nodes in their interaction list. 6: L2L (downward traversal): update the local expansion information at a child node using the local expansion of their parent node. 7: L20: use the local expansions about each leaf node to compute the farfield po- tential at its Observation points. 8: NF: use direct method for computation Of nearfield potential at observation points in each leaf node from sources contained in its near-field nodes. 6.3 Parallel Algorithm for FMM This section presents the details of the parallel implementation of the ACE (PACE), FMM (PFMM) and the Wideband FMM (PACEFMM) algorithm outlined in the previous section. First, a scheme for constructing and partitioning the oct-tree data structure in parallel environment is presented. This is followed by the details on parallel implementation of the individual tree computation steps in Algorithm 2. As mentioned in the introduction, the emphasis of this work is on reducing the latency among processors to ensure the scalability of the algorithm. 130 6.3.1 Parallel Construction of the Oct-tree Although the construction of oct-tree is a one time effort and takes up a negligible fraction of the overall parallel run-time, it is important because (a) tree partitioning among the processors directly affects the load balancing of the rest of the algorithm and (b) creation of various interaction lists at this stage are communication-intensive. In our implementation, the tree is stored in postorder traversal order. It will be shown that the resulting ordering of nodes enables load balanced computation Of various tree Operations, obviating the need for explicit load balancing. Let N denote the total number Of points (sources and observers) distributed within a cubical domain of side length D and P be the number of processors. The average number of points per processor is denoted by n = N / P. Given the smallest side length do associated with leaf boxes, the total number Of levels or height of the tree is H = log2 (D / d0). Integer coding scheme [204] is used to uniquely represent a node in the tree. This has several advantages as (a) the keys encode a wealth of information such as the center position Of the box represented by the node, level of the node, its entire ancestral lineage etc., and (b) the sorted keys conform to Morton ordering [215]. Morton ordering of the sorted leaf nodes distributed across processors results in a self- similar structure in each processor [199, 216] as shown in figure 6.2. Self-similarity is critical to parallel processing as it ensures that each processor has an identical number of tree-Operation. This leads to an implicitly load balanced scheme. The full post-order tree is constructed in a recursive fashion, in each processor, by generating the parent nodes from children nodes. Next, some comments on the distribution of tree-nodes among the processors are provided. Each processor contains only a part of the tree with nodes at every level as shown in figure 6.3. It is evident that some nodes can occur in multiple processors. When considering the global postorder traversal tree, across processors, each such node is 131 associated with a processor where its occurrence is appr0priate (the processor which has the rightmost leaf box in the subtree of the node). This processor is referred to as the native processor for that node and every tree-node has an unique native processor. All other occurrences of the node are termed duplicate nodes and the following Lemma 6.3.1 provides a bound on number of such nodes. Lemma 6.3.1. The number of duplicate nodes in each processor is bounded by the height of the tree, and will appear sequentially at the end of the local postorder traversal tree. Proof. Let H denote the height of the tree. A processor can have at most one duplicate nodes per level in the tree. The rationale for this statement is as follows: assume that a processor has at least two duplicate nodes at the same level in the tree. Let v1 and v2 be two such nodes, with v2 occurring to the right of v1 in the tree. A processor has a node in its local tree only if at least one of the leaf boxes in the subtree under the node falls in the same processor. Also, all the leaf boxes in a processor are consecutive in Morton ordering. Taken together, these two Observations imply that ' the rightmost leaf box under v1 must reside in the same processor. Thus, v1 is native to this processor and cannot be a duplicated node. This argument demonstrates that a processor can have at most one duplicate node per level, shared with the next processor. Similarly, one can show that the number of multiply occurring nodes that are native to a processor are limited to one per level. The proof that the duplicate nodes will appear sequentially at the end Of local postorder traversal tree follows from the fact that the postorder sequencing always places nodes before their parents. The parent of a duplicate node is also a duplicate node in the same processor. Hence all duplicate nodes in a processor appear in sequence at the end Of the local postorder traversal tree. El Next section provides details on the parallel implementation of each step of the 132 tree computations shown in Algorithm 2. In rest of the Chapter, given any two nodes A and B, A is said to be less than B if node A appears earlier than node B in the postorder traversal sequence. This notion Of comparison between tree nodes simplifies implementation of several Of the processes detailed below. 6.3.2 Distribution Of ACE and FMM harmonic data The above tree partitioning scheme ensures that the nodes are unifome partitioned among the processors. If the size of data associated with each node is same, then data is also uniformly distributed amongst the processors. This is true only for the ACE portion of the tree; in FMM, the number Of expansions depends on the level Of the node. Thus the total FMM expansions data contained in the native processors is considerably high, leading to severe load imbalance during tree computations. Though this load imbalance is bounded, due to Lemma 1, it undermines the scalability of the algorithm. A possible remedy is to redistribute the tree nodes such that the FMM expansions data is uniformly distributed across the processors. This results in the following unfavorable scenarios. First, the Optimal redistribution of nodes may be such that some processors, especially the native processors, contain only higher level nodes. This induces latency during upward and downward tree traversals, steps 5 and 6 in Algorithm 2. Second the redistributed tree is not self-similar across processors and affects the parallel efficiency during tree computations. As will be evident, this results in highly scalable parallel scheme. An alternative approach developed in this is paper is the adaptive direction parti- tion. To achieve uniform distribution Of FMM expansion data, direction partitioning [37] is employed for duplicate nodes only. The FMM expansions data of a duplicate node is partitioned such that each copy contains an equal and distinct portion. This scheme and the self-similar distribution of the tree automatically ensures that each processor has an equal quantity Of FMM harmonics data. The nodes to be parti- 133 tioned and the number Of partitions are decided automatically, resulting in implicit load balancing. This approach bears some similarity to the recently introduced hi- erarchical partitioning approach, where the multipole data of all nodes except the leafs are partitioned [211, 212]. This imposes a strict relation between N and P for scalability. In contrast, the adaptive direction partition scheme is flexible and differs in the following manner: (a) it combines the spatial and direction partitioning in a seamless manner; (b) direction partitioning is used only when its Optimal; and (c) it provides a means of preserving the self similarity of tree computations. 6.3.3 Construction of Interaction Lists Tree computation requires the construction of interaction and nearfield lists. Inter- action lists are built for all the nodes in the local tree except duplicate nodes. This operation is split into serial and parallel portions. In the serial portion, the interac- tion list of each node is built assuming that the full tree is constructed [199]. Given a node’s key code, straightforward bit manipulation yields its parent node, the parent’s neighbor nodes and their children. This information is used tO construct the interac- tion list of each local node. Due to locality there is no communication cost associated with this Operation. In the parallel portion, the non-existent nodes are eliminated using one time communication. At this stage, different communication maps are contructed for information exchange during tree traversal. The entire process is effi- ciently implemented with the use Of a binary tree search algorithm to identify nodes in postorder traversal. A similar procedure is used to construct the nearfield list Of local leaf nodes. 134 6.3.4 Multipole and local expansion computation In each processor, the multipole expansions are computed at every node in the local postorder traversal tree. The postorder traversal order ensures that a parent node appears only after its children nodes (in case Of duplicate nodes, all children that reside in the same processor). Thus, when a parent node occurs the necessary children multipoles are already computed. Multipole expansiond are computed for all the local nodes, including the duplicate nodes. Note that the multipole expansions at the duplicate nodes are only partially filled as they account for sources in that processor only. Thus, after the local computation, all processors with duplicate nodes send their multipole expansions to the appr0priate native processors of the duplicate nodes they host. The native processor of a node simply adds the received multipole expansion data to the appropriate local node. This algorithm is a one step update process with the following bound on communication overhead. Lemma 6.3.2. Total number of nodes received by a processor during multipole com- putation is bounded by (P — 1)H. Proof. This follows from the fact that the number Of duplicate nodes in a processor is bounded by H (see Lemma 6.3.1). Since only the duplicate nodes are exchanged during multipole computations, the maximum number of nodes received by any processor will be no more than (P — 1)H. D The computation of local expansion is a reverse analogue of multipole computa- tion. In the downward tree traversal, the child node local expansions are updated with the local expansion of their parent node. First, the processors with the duplicate nodes obtain their local expansion from its native processor. Then, the downward tree traversal is performed locally in each processor by traversing the local postorder tree from right to left. 135 Cost analysis: Each processor has at least one node from every level of the tree and their multipole expansions are computed in every processor by traversing the local postorder traversal tree. Thus, this part of the process is load balanced if every processor has the same number of leaf nodes. This is true even in the case of FMM where the number of multipole harmonics increases as the level increases. Since the number of duplicate nodes per processor is bounded, the communication overhead involved in exchange Of their multipole information is also bounded. Hence the overall process is load balanced. 6.3.5 Translation Operation At each node in the global postorder traversal tree, local expansions are computed using multipole expansions of the nodes in its interaction list. This process is divided into a parallel and serial portion. In the initial parallel portion, multipole information is exchanged between processors. While building the interaction lists for each node in the local postorder traversal tree, the set Of processors that require their multipole expansions are identified. This list of local nodes and processors is sorted accord- ing to the processor-ID. At every processor, the requisite information is send to the appropriate processors by traversing through this list. In implementation, this data is exchanged in blocks whose size is defined by the user. This serves two purposes (a) the number of communication calls can be greatly reduced when compared to a scheme where the multipole data is exchanged one node at a time, and (b) the block size can be adjusted according to the communication architecture Of the distributed environment to ensure an optimum performance. Once the required multipole expan- sion data is received, the actual translation operation is performed in a serial manner to compute the local expansion of nodes in the local tree. In case Of FMM transla- tions, the duplicate nodes exchange and compute only part Of their FMM harmonics data in accordance with the adaptive direction partitioning strategy. This is possi- 136 ble due to the diagonal translation operation of the Helmholtz FMM algorithm. In actual implementation the serial and parallel parts are performed in an intertwined fashion such that the translation Operation is performed as and when the data is received. Further, asynchronous communications can be used to minimize the wait time between the parallel and serial portion. Cost analysis: The translation Operation is reciprocal. Thus, if two interacting nodes are in different processors, then both processors need to exchange same amount Of information. This process would be load balanced if all processors receive and process the same amount Of multipole data. In case of ACE computations, where the size of harmonics data is constant for all nodes, an uniform partitioning of tree nodes automatically ensures that the data is also uniformly partitioned. The same argument is not true in the case of FMM computation where the number of harmonics is a function of the level of the node. However, the use of adaptive direction partition ensures that the data is distributed uniformly across the processors. Hence, this part of the algorithm is also load balanced. 6.3.6 Evaluation of Potential The farfield potential at the Observation points are evaluated from the leaf node local expansion they reside in. However, the evaluation Of the potential is completed only after accounting for the nearfield interactions. These are interactions only among leaf boxes, as specified by the nearfield list. Similar to translation operation, for each leaf node a list of processors that require its information is created and then sorted by processor-ID. At every processor this list is used to communicate the leaf box infor- mation, in blocks, to appropriate processors. The nearfield potential is computed in a serial manner from the received data. This completes the evaluation of potential at every point across all processors. As in the case of translation Operation, the commu- nication and computation parts are intertwined and asynchronous communication is 137 used to minimize wait time. 6.3.7 Parallel Electromagnetic (EM) Solver Next, a brief description is provided on the use Of the above described parallel ACE and FMM (PACEFMM) algorithm within the framework of EM solvers. When the discretization size is in the orders of millions, the geometry processing to create basis functions, etc. becomes computationally intensive. Though this is a one time prO- cess, an efficient parallel implementation is necessary to justify the algorithmic gains Obtained with the PACEFMM algorithm to reduce the overall solution time. We assume that the input to the parallel EM solver is a simple mesh file with a list Of :0, y, z position of each node and the element-to-nOde connectivity table. Each of the P processors reads an equal share of the Na nodes and Ne elements from the input mesh file. Using the local element-tenode connectivity, a list of edges is created in each Of the P processors. Each edge is represented by the two global node numbers that make the edge and this two element integer array is sorted in parallel. Thus every processor has approximately equal number of edges. The global node numbers allows one to gather the {:0, y, 2} data of the nodes as they are sequentially distributed across the processors. This allows us to compute the centers of each edge which is then used to construct the oct-tree. Based on the distribution Of leaf boxes the edge data are exchanged among the processors and the necessary {33, y, 2} data of related nodes are gathered from their global node number. 6.4 Results In this Section, we present plethora of results that exhibit scalability Of the parallel algorithm presented here. All the results were obtained on a IBM Blue Gene / L cluster with 1024 processors and 512 MB RAM per processor. The message passing interface 138 (MPI) was used for communication between the processors. In our implementation of the classical FMM, the spherical harmonic filters [23] are used for interpolation and anterpolation during upward and downward tree traversal, respectively. With 512 MB RAM per processor, precomputation of these filter coefficients places memory constraints and restricts the maximum number Of FMM levels to 10 or the overall domain size to 128 A. First set of results correspond to evaluation of kernel only which helps to study the various aspects of the parallel algorithm. This is followed by the use Of these algorithm for solving electromagnetic scattering problems. In all cases, the timings are reported in seconds and the parallel efliciency of the algorithm P6” is computed using (6.17) where Tm and Nm respectively denote the average time taken for evaluation of potential a processor and number Of processors in the m-th processor set, m E {32, 64, 128, 256,512, 1024} and ref is the smallest size processor set for a given N. 6.4.1 Kernel Evaluation The PACE and PFMM algorithms were separately employed to evaluate the scalar potential (6.7) at N random, uniformly distributed source / Observer pairs within a volume and surface. For volume distribution, we fill a cube of sidelength a and in case of surface distribution the points were placed on a sphere of radius a. In evaluating the PACE algorithm the overall size Of the domain was fixed at /\ and the leaf box size was the chosen such that the average number of sources / Observers per leaf box is approximately 60. The order of ACE harmonic p = 3 was chosen so as to evaluate the potential to an accuracy of 0(10-3) [198]. The number of points N was varied from 1 million to 80 million and in each case the number Of processors was varied from 32 to 512. Table 6.1 shows, as a representative sample, 139 the time spent at different stages Of the tree computation for the case of N = 40 million on diflerent processor sets. It is evident that the time taken for parallel part of multipoletO-multipole (M2M) and local-to—local (L2L) evaluation are negligible. This is in accordance with the theoretical reasoning presented in Section 6.3.4, where Lemma 6.3.2 shows that the number of communications at this stage is bounded and small. Notice that the timing data for rest of the process is proportional to the number Of processors. This is a direct consequence of the self-similar tree partition algorithm that ensures load balanced tree computation. The parallel efficiency of PACE algorithm for different cases is shown in the figure 6.4. In all cases ref was chosen to be 32 except for N = 80 million where ref =64 was used. The presented algorithm exhibits efficiency as high as 98% on 512 processors. Figure 6.6 shows the time spent by individual processors of a 128 processor set for N =40 million at different steps of tree computation. This exhibits the excellent load balance Of the prescribed algorithm as all the processors spend almost the same amount at every step Of the tree computation. Next, in figure 6.5, we plot Tm as function of N, for each processor set, to measure the cost complexity of our parallel implementation. The slope of the linear line fits are close to unity which indicates the 0(N) scaling of our parallel implementation. Figure 6.7 plots the efficiency of our PACE algorithm for the case of spherical distribution. Here again the efliciency is as high as 96% on 512 processors. This indicates the scalability of our parallel algorithm on large number Of processors. In evaluating the PACEFMM algorithm, the sidelength of the leaf box was fixed at A / 4 and overall size Of the domain was chosen such that the average number of points per leaf box was approximately 60. This choice implies that the number of ACE levels is zero. Table 6.2 shows, as a representative sample, the average time taken by one processor at different stages of hierarchical tree computation for N = 10 million on different processor sets. As in the case Of PACE algorithm, the parallel 140 part of upward and downward tree traversal time are negligible and this is attributed to the bounded number Of communications. Figure 6.8 shows the time taken by the individual processors in a 128 processor set during translation with and with- out adaptive direction partition strategy. Without the adaptive direction partition, the native processors that host the duplicate nodes spend more time in communica- tion and computation than others. The load balance among the processors improves with adaptive direction partition strategy and helps to improve the scalability of the PFMM algorithm. The efiiciency in case Of volume distribution of points is shown in figure 6.10. The PFMM algorithm Offers efficiency as high as 96% on 512 processors. Figure 6.9 shows the time taken by individual processors of a 128 processor set at different stages of tree computation. The negligible variations in time taken by differ- ent processors indicate the excellent load balance of the algorithm; which stems from the self-similar partitioning of tree nodes. In figure 6.11, the average time taken for potential evaluation Tm is plotted as a function of N for different processor sets m. The slope of the linear line fits are close to unity which indicates the linear complexity Of the proposed PFMM algorithm. The efficiency of the PFMM algorithm for surface distribution of points is in shown in figure 6.12. The algorithm Offers high efficiency on 512 processors even for relatively small number of points N = 10 million. 6.4.2 EM Simulations The CFIE formulation is used to solve for electromagnetic scattering from closed PEC objects. The numerical system is solved by a parallel GMRES solver with diagonal preconditioner. First, the parallel solver is validated by computing the plane wave scattering from a PEC sphere and comparing the RCS with analytical results from Mie series. Figure 6.13 shows the comparison of RCS from a PEC sphere Of radius 64A computed using the parallel solver and Mie series. As is evident there is a excellent agreement between 141 Table 6.1: Average time spent by an individual processor at different stages of hier- archical tree computation for N =40 million. Proc. Local- Parallel- Trans- Parallel- Local- Direct Multipole Multipole lation Local-exp Local-exp 32 1.97 0.00 241.35 0.01 43.50 1307.55 64 0.99 0.00 121.31 0.00 21.75 660.91 128 0.49 0.01 60.76 0.01 10.88 332.82 256 0.25 0.01 30.72 0.01 5.43 167.19 512 0.12 0.01 15.62 0.01 2.72 83.75 the two solutions and validates our parallel implementation. Figure 6.14 shows the comparison of RCS of a PEC sphere of radius 128A, discretized with 14 million un- knowns and both the solutions exhibit excellent agreement. To compute the efficiency of the parallel solver, scattering from PEC spheres of different radius (and different number of unknowns) was considered on different processors sets. Consider spheres Of radius {16, 32,64} with {500,1500, 3240} thousand unknowns. These simulations were performed on 64, 128, 256 and 512 processors. When computing the efficiency using (6.17 ), the time denotes the solution time averaged across the processors and 64 processor set as reference, ref =64. As shown in figure 6.15, the parallel solver exhibits efliciency as high as 90% on 512 processors with 3.24 million unknowns. The parallel solver was applied to two realistic geometries. The first geometry is a PEC toy-aircraft with fine edges that is densely discretized with 1.75 million unknowns. At 3GHz the principal dimension of the geometry was 64): and the figure 6.16a shows the induced surface currents and the computed RCS. The total simulation time was less than 4 hours when executed on a 256 processor cluster. The second geometry is 3 PEG sharp arrow discretized with 3.24 million unknowns. The principal dimension was 64): long at 3 GHz and the induced surface currents and RCS are shown in figure 6.16b. The simulation time was less than 2 hours on a 256 processor cluster. 142 Table 6.2: Average time spent by an individual processor at different stages of hi- erarchical tree computation for N =20 million points uniformly distribution within a cube of sidelength 20A. Proc. Local- Parallel- Trans- Parallel- Local- Direct Multipole Multipole lation Local-exp Local-exp 32 249.98 0.04 39.01 0.02 130.99 2488.38 64 125.31 0.07 16.94 0.03 66.17 1250.01 128 62.94 0.13 9.85 0.06 33.71 626.89 256 31.77 0.28 5.09 0.11 17.52 314.1 512 16.17 0.59 3.23 0.24 9.39 157.22 5E iii 310 51:1 ”f” *1) ill 9' M""'f§l3§iii(r, t) = f(r,t) for r x t e 0 (A.1) (r,0) = u0(r) for r x 0 E Q where (r, t) is the dissipative wave potential at r and time t, 9 is a domain of finite volume, u0(r) is the initial condition, a and c 6 IR are problem dependent constants, and 6t denotes a temporal derivative. In the limit c —> 00, we recover the diffusion equation. This corresponds to the limit of an infinite velocity of propagation for solutions, i.e. changes in the spatial profile Of a solution can influence the behavior of the solution at all points in space after an infinitesimal period of time. (r, t) = 1(r, t) + 2(r, t) (A.2a) 510-, t) = [a G(r — r’, t)u0(r’)dr’ (A.2b) t 2(r, t) = f(r’, t')G(r - r',t — t')dr’dt' (A.2c) ( l where G (r, t) is the Green’s function for the diffusion equations given by, _ a -02||r||2 G(r, t) — (47rt)3/2 exp (T) O(t) (A.3) where 9(t) is the Heaviside distribution. The literature on the application of these Green’s function to integral equation based solvers is extensive [233]. However, it is apparent that the computational cost is the principal bottleneck to the adOp- 160 tion of these methods, despite their several advantages [230, 232]. Indeed, it can be shown from a straightforward discretization of these integrals that the cost scales as 0(N3Nt + N 3N3). In what follows, we will prescribe methods to tackle the latter integral. As will be evident, evaluating the former will be a trivial use of the methods presented here. Assume that the temporal sample size is uniform, and is denoted by At. Without loss of generality, the discrete version Of (A.2c) can be written as lt/AtJ Ns 52(r, t) = Z Z G(r — r;,t — zAt)f,(r;, 1A.) (A.4) l=0 i=1 where [o] denotes the floor Operation, r; and f,- are the position and strength asso- ciated with the it” spatial source respectively. Testing the discretized form at N. spatial points at time t = tk results in the following matrix equation, k—l Pk = Z Zk-zSz (A-5) l=0 where the vectors are given as Pk = {¢2(r1:kAt):¢2(r2:kAt)a°'°:‘D2(rNsakAt)} Sl = {f(r11 LAt)1 f(r21LAt)7 ' ° ' a f(rNsalAt)} and Z1 is N. x N3 matrix whose elements are Zl(i, j) = G(r,- — rj,lAt). In what follows, it is assumed that S) is known only at t = IA); and not before. Other IE formulations [230, 232, 229] can be reduced to the matrix equation of the form (B.20). As done in [234], this equation can be cast as a space time matrix equation and it ' is readily apparent that cost for evaluating this system scales as 0(N3Nt2). In the next section fast methods are developed to reduce this cost to either 0(N3Nt log M) 161 0r 0(N3Nt). A.3 Acceleration Schemes In this section we introduce the spatial and temporal acceleration schemes that form the crux of this paper. First we employ the ACE algorithm to rapidly evaluate po- tentials at a particular Observation time due to spatially distributed sources excited at a single time step. This corresponds to the inner summation in (A.4) and the evaluation of one matrix vector product in (B.20). Given the ACE expansions for spatial acceleration, an FFT based scheme is proposed for rapid evaluation of sum- mation over temporal basis functions corresponding to the outer summation in (A.4). The temporal acceleration schemes presented here primarily exploit the fact that the harmonic expansion Of ACE preserves the temporal convolution in (A.4). Then the Toeplitz structure among ACE expansions is identified, for which the standard FFT scheme is employed to reduce the complexity of temporal convolution from 0 (N132 ) to 0(Nt log Nt). Further, the proposed schemes are cast in a block Toeplitz fashion to conform with the marching on in time (MOT) framework of existing integral equation solvers. A.3.1 ACE for Spatial Acceleration As mentioned in the previous section, each summation term in (B20) corresponds to a fixed source and observation time which is denoted in the subscript. The individual matrix vector product Zk_lS( corresponds to the evaluation of inner sum in (A.4). 162 This evaluation involves only spatial contributions and can be written as, V1: 1 = Zk—lsl (A-G) N. vgl, = me...) = Z G(rm — rn, (k — l)At)fn(rn, 1A,) (A.7) n=1 where Vkl is an intermediate vector of dimension N 3 introduced for the convenience of the following discussion. It is evident that cost of computing the entire vector V“ scales as 0(N3). In this work, the accelerated Cartesian expansion (ACE) algorithm is adopted to accelerate computation Of vector V“. Rapid evaluation of ij through the ACE algorithm requires the definition of V"G (R, t) in evaluating the multipole to local expansion using Theorem 2.3.5. It is evaluated using a recursive expression as mm) = 651652653211...immense») (As) afoul-1:) 24—L:Ia{°-10(R,,t) — 2(k—1)8[°‘2G(Rl,t) le{a:,y,z} 3 where p,- e {0, 1, ..p} and 2 p,- = p. i=1 The ACE procedure for Gaussian kernels bears similarity to Fast Gauss Trans- form (FGT) [226].The use Of generalized Taylor expansion and Cartesian tensors for Gaussian kernels is essentially a reformulation of FGT, where Hermite polynomials and expansions are employed. The cost and storage savings achieved with the use of totally symmetric tensors is identical to that of the graded lexicographic order representation in improved FGT[228]. An additional advantage with ACE algorithm, as detailed in Chapter 3, is the use of exact translation operators for multipoleto— multipole and local-tO-local expansions. 163 A.3.2 FFT based Temporal Acceleration The above description details the use Of ACE for the rapid evaluation of a potential at an arbitrary Observation time due to spatially distributed sources all excited at a particular time instant. This corresponds to the evaluation Of one Of the sum terms in (B20), hence only a part of total potential at time step (kAt) is computed. As men- tioned in the introduction, diffusive, lossy wave and Klein-Gordon potentials exhibit an infinite temporal tail (long history) and the computation of the potential at the kth time step would involve k such partial potential evaluations. Consequently, the cost of a scheme with only spatial acceleration scales as 0(N3Nt2). This complexity is undesirable when the number Of time steps in the simulation is large, as would be expected for any time domain simulation Of merit. A FFT based temporal accelera- tion scheme is deverlOped here to ameliorate this cost. This scheme are formulated in a manner such that causality is not violated i.e. evaluation of a potential at time step (kAt) assumes the knowledge of sources at time steps (mAt), m < It only. Thus the proposed acceleration schemes are in conformance with the existing solver framework and can be readily integrated. Consider the convolution in (B.20), this can be written as one matrix vector product as illustrated in Fig. A.1, which illustrates the evaluation Of the vector Pk at all time instants k = 1, . . . , Nt. Note that each term in the figure are themselves either a matrix or vector quantity and depend on the spatial discretization. It is evident, from Fig. A.1, that the matrices Z;_j form a Toeplitz matrix and beckons the use of fast Fourier transform to perform the matrix vector product in 0(N82Nt log Nt) cost. However causality allows one the knowledge of past sources only. In other words, one cannot assume the knowledge of sources 8),, Sk+la . . . , S Nt when evaluating the potential at time step It. To overcome this, the matrix is divided into sub-matrices Tk as indicated in Fig. A.1 . The evaluation of the potential Pk involves different 164 submatrices Tl that are multiplied by past sources only at different time instances [234, 235]. In this scheme, computation at the kth time step involves 10 past time source vectors { S k— N: . . . , Sk_1} multiplied with the appropriate Toeplitz sub-matrix T N to produce potential Pk at the kth time step and partially evaluated potentials {Pk+1,.. ~1Pk+N—1} at future time steps, where N depends on It. For example, evaluation of P2 involves the multiplication of vector {80,31} with sub-matrix T2, however this also results in partial computation Of P3. In the next time step, P3 is completely evaluated with the computation of the matrix vector product T1 S2. Thus the sub-division scheme shown in figure A.1 also provides a means to compute potential within the MOT framework. Further each Of the submatrices is Toeplitz and hence FFT can be employed to accelerate the computation. It is important to note that the this acceleration methodology does not involve any approximations and is exact. Consequently, this scheme is used to accelerate both near and far time interactions. Next, consider the above acceleration scheme within the context of ACE algorithm introduced in previous section. The following two observations that forms the basis Of integrating the Block-Toeplitz based temporal and ACE based spatial acceleration acceleration schemes, 1. Except for the multipoletO-local translation Operators all other operations in ACE depend only on either the source (1) or Observer (k) time. 2. The multipoletO-local translation Operation in Theorem 2.3.5 preserves the convolution in time. Consider the evaluation of potential at Observation points in domain (20 at time kAt due to sources located in domain {23 and excited at (At. Using theorem 2.3.5 and 165 (2.8) Of ACE algorithm in (A.4) we get 2(r,kAt) = 202...)“: n Ll") (A9) n=0 H H ) Lk" =. 2ng (A.10) (=0 °° m n 1 m = Z (21w )_(m_n).mH§c_]) (A.11) m=n (:0 where, Hg’g = VmG(rgs,(k—I)At) From above, we infer that the evaluation Of the local expansions at the kth time step is equivalent to the evaluation of the potential at the kth time step. Considering the above equivalence we rewrite the discrete form (B.20) as k— 1 00 P), = 2:09,, -.n)("> .[Z (M[m ") .—(m n). —H§,"‘)) (A.12) l=0 n=0 m=n . °° H ( —) H< ) l = 2(pm.)("l .11.. Z: Z (Mlm n .(m— n).— nngml) (A.13) n=0 "1:" 1:0 _ In above equation, the time dependence of translation Operator and multipole ex— pansion corresponding to a particular tensor component Of ACE harmonic can be written in the block Toeplitz matrix form as illustrated in Fig A.1. This immediately suggests the evaluation of the temporal convolution in (A.13) using FFT. In addition, as described above, the block-Toeplitz structure is utilized to conform with the MOT framework of solvers. TO complete the discussion we define the Fourier transform Of a nth rank Cartesian tensor M(") as follows M(")(n1,n2,n3;w) = f{Ml(n)(n1,n2,n3)} ,where n1 + 112 + n3 = n (A14) 166 where .7: denotes the forward Fourier transform Operator (t —+ w) and M("’) denotes the nth rank Cartesian tensor in Fourier space. In essence the Fourier transform of a tensor is evaluated in a term-by-term (n1, n2, n3) fashion. Given this definition the local expansion at kth time step Lk due to all past sources is evaluated as, Lin)(n1,n2,n3) = Z: $.74 {211070010 . (m — n) . M(m—n) (01)} (A15) where n1 + n2 + 113 = n, .7-"‘1 is the inverse Fourier transform Operator (or —-> t). Since the Fourier transform is applied only in the time domain (t H w) the tensor contraction definition is valid in the Fourier domain as well. Computation of the time domain diffusion potential using the multi-level tree representation Of the spatial domain requires attention to the following details. The multipole expansions for all boxes should be evaluated as we march on in time and stored at all time discretizations, this implies Nt upward tree traversals. The mul- tipole expansions M(") are represented as a set of Nt x 1 column vector for each tensor component as M[n)(n1, n2, n3). In a similar fashion the translation operator for Observation time kAt is represented as a set of k x k matrix for each tensor com- ponent. In actual implementation only the (2k - 1) unique entries Of this Toeplitz matrix is stored for computation with FFT. For interacting boxes, the multipole to local translation Operation involves the temporal convolution (A.11) and are evalu- ated in a rapid manner using FFT as in (A.15). This evaluation is carried out for each of the (n + 1)(n + 2) / 2 tensor components. As is evident, from figure A.1 the size of the Toeplitz system N depends on the block size which in turn depends on the observation time step It and can vary between 1 and N; / 2. The outcome of this process are the N local expansions from which the potentials at N future time steps are evaluated by downward tree traversal. The cost of this scheme is computed in the following manner. The cost of one up- 167 ward and downward tree traversal scales as 0(N3). The cost of computing multipole expansions at all time instants scales as 0(N3Nt). The cost multipoleto-local expan- sion and downward tree traversal for N size Toeplitz system scales as 0(N3N log N) and 0(N3N) respectively. In the entire scheme, Toeplitz systems of size Nt / 2 occurs once, Nt / 4 occurs twice and so on. Thus, the total cost Of this scheme scales as 0(N3Nt+Ns (% (10g_1§£+1) +2% (log%+1) +...)) z o (NsNt log2 N.) The error in the acceleration scheme is only due to approximations in the ACE algo- rithm. A.4 Results In this section, results are presented to substantiate the above claims and demon- strate the efficacy Of the algorithm presented here. The goal here is to demonstrate considerable speed-up with predetermined accuracy. Consequently, the results pre sented will demonstrate convergence in error as well as linear CPU cost scaling. In all numerical experiments, the source / Observer locations are randomly (uniform dis- tribution) chosen. The time signature associated with the nth source is given by (A. 16) M0 = WWW/2.2 (A.16) where an is the source strength, randomly chosen between [0,1], 0 = 6.366 x 10-8 s and tp = 60 s. In all simulations At was chosen as 1 second. These parameters were chosen arbitrarily.Error is conditioned primarily by the number of ACE har- monics used, P. While there is an error associated with the UV decomposition of the translation Operator, it is conditioned to be at or below that of the error incurred in 168 truncating the expansion of the translation operator to P harmonics. The accuracy Of the proposed algorithm is validated against direct computation for all cases where the unknown count is numerically small. The relative error at nth Observer is evaluated as (I) n,t — (I) - n,t Errorfar(n) = H fast,far( ) direct,far( )ll2 (A.17) llq>direct,far(n1 L) l [2 where, I] - [[2 represents Lg-norm, (I) fast, for (t) and (I’dz‘rect,far(t) represent the time history of the fields produced by the sources evaluated using the proposed algorithm and a direct procedure, respectively. The final error reported throughout this work is the average error over all observers. Error is computed only with the far-field potential, i.e. direct data is computed only for the source/Observation pairs that are in the far-field Of each other, and is consequently representative Of an upper bound or worst-case error. With the exception Of the temporal scaling experiments, the computational time reported here is the total run-time in seconds using a 2.3 GHz Intel Pentium processor with 2GB RAM running Linux OS. High Performance Computing Center at Michigan State University was utilized to extend into very long time scales. All estimated or projected time values are marked by I. The first set Of results demonstrates the exact multipoletO-multipole and local- tO-local translation Operators of the ACE algorithm, in the context Of evaluating Gaussian kernels. An important implication of this feature of the ACE algorithm is that the error does not increase as the height of the tree is increased. Consider two domains :21 and $22 of size (0, 0.5) x (0,05) x (0, 0.5)m3 and (1,15) x (1,15) x (1,1.5)m3 respectively. In each domain 4000 source/Observer points are randomly distributed and we consider interaction between these two domains only, all other interactions are neglected. As the number of levels in tree is increased, the change in the error norm can be attributed solely to the error in multipoletO-multipole and local-to-local translations. Table A.1 shows error computed for different P and 169 different levels in tree. For a given accuracy (fixed P) it is evident that variation in error Obtained from using different levels in the tree is accurate to double precision. The next set of results pertain to the evaluation of the time domain diffusion po- tential in (A.4). Table A.2 presents errors for different ACE harmonics P, numbers of unknowns N3, and domain sizes- represented by d - sidelength Of cube enclosing all sources/ observers. In all cases the number Of time discretizations was maintained at a constant, Nt = 256 and the FFT based MOT scheme was utilized. As expected, an increase in the number of ACE harmonics, P, leads to a uniform decrease in the error for all cases. The speed-ups provided by the proposed algorithm, utilizing the FFT scheme, are exhibited in Tables A.3 and A.4. Table A.3 presents run-times for evalu- ating time domain diffusion potential for different size Of spatial discretizations while the following parameters were kept as constants: total number Of time discretization Nt = 256, size Of domain d = 0.5 and number of ACE harmonics P = 3 corresponding to an error Of 0(1E — 5). It can be seen that the proposed algorithm is 230 times faster than the direct method even for a small problem size N. = 4000 and Nt = 256. Figure A.2 shows a log-scale graph of Tfast vs N, for values in Table A.3. The line in the graph corresponds to a least-square linear fit whose slope was found to be 1.1. This validates the 0(Ns) cost scaling of the proposed algorithm. In Table A.4 the run-time of the proposed algorithm is shown for different num- bers Of time discretizations. Aside from Nt, all other parameters were kept constant at: N. = 8, 000, d = 0.5m and P = 3. For the FFT experiments, the expected 0(NtlogNt) scaling cannot be verified because of the following implementation de tails. In this workthe Open source library FFTW3 package [236] was used which does not exhibit uniform NtlogNt scaling. This is the case with many other performance oriented FFT packages as well. Also, it was Observed that it is efficient (faster) to use direct evaluation than using FFT procedure when the size Of the Toeplitz system was smaller. This is important in terms Of overall performance as smaller size Toeplitz 170 system occur more frequently ex. the Toeplitz system of size 1 x 1 and 2 x 2 occurs Nt / 2 and N); / 4 times. In this work direct evaluation was used for any Toeplitz matrix of size 5 16 x 16, this may vary based on the computer platform and FFT library in use. 171 I P1 . S0 P3 S1 P3 32 P4 = 33 P5 S4 P3 35 P7 ‘50—." Se Figure A.1: Illustration Of the Block-Toeplitz computational scheme Table A.1: Exact translation operator in ACE algorithm, P denotes the number Of ACE harmonics. Levels P = 3 3 5.343762051614 770E—006 4 5343762051614 130E—006 5 5.343762051614 279E—006 P = 6 7.0123317207 40178E—008 7.0123317207 70545E—008 7.0123317207 63711E—008 Table A.2: Error convergence for different number Of ACE harmonics (P) and different source/Observer configuration (N3, d) N3, d P || 8000, 1.0 | 4000, 1.0 | 8000, 0.5 | 4000, 0.5 0 4.92E—02 5.01E-02 1.39E—02 1.74E—02 1.12E—02 1.04E—02 3.71E—03 4.70E—03 9.61E—05 9.39E—05 1.22E—05 2.21E—05 1.20E—06 1 .36E—06 3.54E—08 9.04E-08 1.87E—08 2.28E—08 1.20E-10 3.30E—10 19'4me 2.77E—10 3.44E—10 6.40E—13 1.21E—12 172 4.4 I 4.8 5.2 Log(Ns) Figure A.2: log Tfast vs. log N3 from Table A.3, slope of linear fit = 1.1 5.6 Table A.3: Time for different problem size (N3) within a cube Of sidelength d = 0.5m. In all cases Nt = 256, P = 3 (e = 0(1E — 5)) N3 TFFT TDirect 4000 71.98 17195.35 8000 98.32 68781.25 I 16000 226.8 275125.60 I 32000 472.66 110050240 1 64000 978.07 440200960 I 128000 2282.01 17608038.40l 256000 4652.78 7043215360l Table A.4: T ast for different Nt size. In all cases N3 = 8, 000, d = 0.5, P = 3 (e = 0(1E — 5) I Nt I TFFT I TDirect 256 94.17 44169.36 512 231.95 176677.44 T 1024 554.57 706709.76 l 2048 1395.52 282683904 1 4096 3438.00 1130735616 1 8192 9494.96 4522942464 T 16384 28410.84 180917698.56l 17 3 Appendix B Comprehensive Exam Problem: Integral Equation Based Eddy Current Model for Defects in Layered Media The goal of this chapter is to develop an efficient simulation scheme for eddy current testing Of defects in layered media. Finite element method is the popular technique used to model eddy current testing. However, they require discretization of entire domain and is not favorable for modeling layered media. Alternatively, the less used Integral equation approach, with layered medium Green’s function, demands discretization Of the small defect region only. Such a formulation demands the eval- uation of infinite integrals which are efficiently handled using the recently developed Discrete Complex Image Method (DCIM). A volume integral equation solver, based on tetrahedral elements, is developed here. Full details of the formulation along with results demonstrating the efficiency and accuracy of the method are presented. 174 B.1 Introduction Eddy current testing (ECT) relies on the principle of magnetic induction to interro— gate materials under inspection [237] at very low frequencies, of the order Of kilo—Hertz (kHz). Since its inception, ECT has become an indispensable tool and covers a wide spectrum Of industries from pipeline inspection to aircraft health monitoring to bio- medical inspections. ECT can be used for any applications where the material (to be inspected is a conducting medium. Qualitatively, the principle of Operation Of ECT is as follows: time varying magnetic field (produced usually by a time vary- ing impressed currents) induces eddy currents in the conducting region, this in turn produces a measurable field outside the material region. Change in conductivity of the material affects the characteristic Of the induced eddy currents—magnitude and direction of flow—causing a change in the measured field. Eddy currents are induced in regions only near the impressed currents, thus it is a local phenomena and aids in precise location of defects with appropriate signal processing. Numerical models for ECT have played a significant role in its development and application. These models have been used to design the geometry Of the excitation and measurement probes, determine Optimal testing parameters and development Of efficient signal processing techniques. Eddy current phenomena is governed by Maxwell’s equation. Under low-frequency and high conductivity assumptions, the coupled Maxwell’s equations is reduced to a single diffusion equation. Finite ele ment methods (FEM) has been the popular choice in NDE community to model this phenomena. This is primarily due to ease of implementation and fast solution time. However, FEM requires discretization of a large domain and employs artificial bound— ary condition to truncate fields at boundary. Alternatively, integral equation solution have also been developed tO model ECT, however these models need to be formulated appropriately to compete with the speed of FEM solutions. 175 In this work, an integral equation solver is developed to model ECT for analyzing defects that are embedded in a multilayered environment, e. g., cracks under rivets etc. Contrary to the existing literature in NDE [237], the low-frequency approximation is not used here. A detailed discussion Of the various techniques used in this develop- ment is provided in the following sections. Section B.2 describes in detail the theory required for the development of the integral equation model. Section B.3 presents the numerical techniques used in implementation. Finally, Section B.4 provides some results to substantiate the theoretical claims. B.2 Integral Equation Formulation B.2. 1 Formulation Consider the homogeneous domains {21 and (22 in figure B.1. Assuming e—i’L’L time convention the harmonic electromagnetic field in each region are given by Maxwell’s equation, V X E1 = inl (B.1) V X H1 = -’iw€1E1 + 0'1E1 + Je = —iw€~1E1 + Je V X E2 = in2 (B.2) V X H2 = —iw82E2 + 02E2 = —iw€~2E2 where E, and Hk represent the total electric and magnetic fields, respectively, in the domain 52),, J c is the impressed or excitation source, {Eb/1.190%} are the electromag- netic material constants permittivity, permeability and conductivity, respectively, Of the domain 9k, 87,. = 8k — ok/iw denotes the complex permittivity. Straight forward 176 manipulations reduces (B.2) to the form in (B.1) V X H2 = —’iw§1E2 + J3 (B.3) where J 5 = —iw(§2 — €1)E2 is the equivalent source that depends on the field in that region. The equivalent sources introduced here allows one to treat 92 same as 521 but with additional sources. The total electric field, EL in any region is divided into E8 and E3 corresponding to the contribution from impressed source J 3 and equivalent source J 3 respectively, Et = E8 + E3 (BA) The above integral equation is used to solve for the equivalent source strengths to uniquely determine the fields everywhere. Evaluation Of E3 due to the equivalent sources J 3 requires the prescription of the Green’s function, G, for a source radiating in presence of the homogeneous media (21. Thus, (BA) is written as E = E6 +341, (B5) In ECT simulations, the layered media without any defect is chosen as {21 and the region corresponding to defect only is taken as 92. Thus the Green’s function in (B5) corresponds to the layered media Green’s function with sources outside and inside the layered media. The next section, provides a detailed treatment on the derivation of the layered media Green’s function. B.2.2 Planar Media Green’s Function In deriving the Green function, no approximation is made to reduce the governing wave equations of ECT to a diffusion equation. This is done intentionally to lever- age on the existing literature in the wave prOpagation community and introduce the 177 appropriate modifications at a later stage. A vast source of literature address the derivation Of Green’s function in presence of layered planar media obeying the wave equation and radiation boundary condition. In the last decade, several significant de velopments have been made in terms of formulation and evaluation of these Green’s function in a manner suitable to numerical models. The electric and magnetic field produced by an infinitesimal electric dipole d = 6(r)d in homogeneous media can be written in terms of spherical waves as _ ikr E(r) -_- —iwu (I+ :_2v) er *d (B.6) eikr H(r) = V x T *d (B.7) where r = Mr“ and * denotes spatial convolution Operation. Sommerfeld Identity [238] is used to represent spherical waves as a product of plane waves in z direction and cylindrical waves in p direction, ezkr i k (1) 'k — - dk —pH I: 6z zlzl B.8 1‘ 2./SIP pkz ”(pp) ( ) where 16,, = [#02 — kg, 16 = w/c, c is the speed Of light in the homogeneous medium and w is the angular frequency. Now consider the eflect Of a layered media, stacked in z direction, whose properties vary only as a function Of 2. From the expansion in (B8), it is evident that only the plane waves in z direction will get affected. In other words, the effect Of introducing a layered medium in 2 can be accounted in terms of the reflection and transmission of the plane waves travelling in z direction. As an example, consider a vertically oriented dipole placed at origin in a homogeneous media, (1,, = 0. Since Ep = 0, this corresponds to transverse magnetic (TM) wave excitation that is fully characterized by the Ez fields. Combining (B6) and (B8), we 178 get 0 k . E2 =iwp(1+a§/k2)l / dkp—pHél)(kpp)e’kZIzl (13.9) 2 SIP k2 When a planar layered media in z direction is introduced, its effect can be accounted for using the TM 2 reflection and transmission coefficients depending on the region where field is computed. Fields on the same side of the source are called the re flected fields and is denoted by superscript R. Fields on the Opposite side are called as the transmitted fields denoted by superscript T. In this case, the reflected and transmitted fields are given as, __ k . . EzR = (1+ 83/k2)% [SIP dkp—:Hél)(kpp)e’kzlzl(1 + RTMe_’2"ZIzI) (13.10) _ k . E3 = (1 + 6.3/9%? [51p dkp—:H((,1)(kpp)ezk2zlzl(1 + TTM) (3.11) where RT M and TT M denotes the reflection and transmission coefficient for TM z plane waves, 11:22 = 10% — 10,2, and k2 is the propagation constant in the layered medium. Similarly, one can perform the same derivation for TE 2 fields produced by a dipole oriented in horizontal or :r — y plane. Since any field can be broken into TM 2 and TE; polarization, superposition of the two forms is required when considering arbitrary excitation. Depending on the intended use, several different variations of the layered media Green’s function can be found in the literature. In this work, the symmetric form of the Green’s fimction provided in [239] is used as it allows for efficient treatment Of singularities in the Green’s function. In addition to the source vector, this form requires a testing vector to attain symmetry. The derivation is 179 straight forward but laborious [239], hence only the final form is presented here where, R 9TE,TM,EM REM Cl” T gTE,TM,TM1 TTMl ' 1 ’2’? (68’ + pa avvo ’)g P (B.12a) b iw Ti? (0:... a’ngE + azdngM) + (B.12b) iwuo 471-132 __(a. VV al’gTM + 2613.V3V3.d;g§M) rid—#0 l 1 Tu b643( angE + k2 —as. VSVS. .aggTM + kg, -—aza;gTM1)(B. 12C) 0 k . = 3/ dkpRTE,TM,EM_iH([1)(kpp)e—ikbz(z+z’) (13.133) 2 SIP kbz 2 _b RTE + RTM) = 2k§( = —a’,+6; = i / dkaTE’TM’TM 11931131)(kpp)e""bzz+”‘“zz’)(13-13b) 2 SIP kbz _ szTM p where 67 is the testing vector, 6’ is the source dipole vector, subscript t and 2 denotes the traverse and 2 components of vectors. B.3 Numerical Implementation This section provides details on the numerical implementation of the above theoretical formulation for ECT simulation. This includes methods to efficiently evaluate the infinite integrals Of Green’s function (B.12) are presented. Followed by details on the solution to the integral equation in (B5). 180 B.3.1 Evaluation of Green’s function Several methods have been develOped for numerical evaluation of the infinite integrals arising with the use of Sommerfeld identity in (B8). These vary from asymptotic expansion that yield closed form, approximate solution to specialized quadrature rules. In this work, the recently developed Discrete Complex Image Method (DCIM) is used evaluate these infinite integrals in closed form. In DCIM, the integrand is approximated as a sum of exponential functions in In; and the Sommerfeld identity is used to represent these individual infinite integrals as spherical waves. Consider the evaluation of the following infinite integral, - k 2' z (W) = $1911: éHél)(ka)e kz 13092) (314) Note that all integrals in (B.12) can be reduced to the above form with simple ma- nipulations. Let the coefficient R(kz) in (BM) be represented as sum of exponentials N . R(kz) = Z Ane‘kzon (13.15) n=0 Using the Sommerfeld identity (B.8), the individual integrals in the above summation is represented as spherical waves, N ¢ = :3 Aneik‘rfi/r: , rs. = \/p2 + (z + on)? (13.16) n=0 These spherical waves can be interpreted as emanating from image sources of strength An placed at complex positions rfi. There exists several different approach to determine A,- and ci, for approximating 120%) in terms of exponential functions. Matrix Pencil Method (MPM) is a relatively new technique for representing arbitrary function in terms of exponentials. MPM is 181 based on singular valued decomposition (SVD) and provides a means to control the error in the resulting approximation. A major requirement of MPM is that the integrand be sampled at uniform intervals. In case of planar media Green’s function, this sampling should be done respecting the Sommerfeld integral path (SIP) shown in figure B.2. In practice, this is achieved by using an intermediate parametric variable t. The integrand is sampled at uniform intervals of t and the relation between t and k2 is chosen to conform with the SIP, k2 = 7k[it + 1] 0 S t S T0/7 (B.17) where To defines the finite domain of integration and is chosen large enough until oscillations in integrand have damped sufficiently, 7 controls the separation of SIP with real axis as shown in figure B.2. The relation (B.17) prescribed in literature is suitable only for scenarios where the conductivity of the medium is zeros or negligible. When conductivity is finite and very large, as in case of ECT, k2 determined through above relation (B.17) fails to follow the SIP as shown in figure B.3. A modified relation (B.18) is introduced in this work to account for materials with high conductivity and still respect the SIP [9; = 7Re{k}[it + 1] + iIm{k} (B.18) where Re{-} and I m{} represent the real and imaginary part respectively. Note that when k is a pure real quantity, i.e. when conductivity is zero, the modified relation reduces to the original relation (B.17). Figure B.3 shows the path evaluated using the modified relation for high conductivity case. Other advantages of high conductivity are (a) the poles and branch cuts of the integrand are well separated from real axis and they do not affect the numerical integration; and (b) the integrands 182 are less oscillatory, which in turn implies less number of approximation terms. The maximum number of DCIM terms encountered throughtout this work did not exceed 20. B.3.2 Solution to integral equations For numerical solution of the integral equation in (B.5), the volume region {22 is discretized into M tetrahedrons. The unknown electric field density D in this domain is represented in terms of divergence conforming vector basis functions defined on each face of tetrahedron [240], Me D(r) = Z f,(r)D,- (3.19) n=1 where Me is the total number of distinct tetrahedron faces, fz- is the vector basis function defined on the ith face and D,- is the unknown coefficient corresponding to this basis function. Several properties of these basis function are discussed in detail in [240]. Employing this representation in (B5) and using Galerkin testing leads to the following matrix equation, zv = H (3.20) where V and H are vectors of size Me and Z is a matrix of size Me x Me. The elements the matrix and vectors in (B.20) are given as follows, Z(m,n) = [V dr [V dr fm(r).G .1.,(r) (3.21) V(m) = Dm (B22) H(m) = [V drEi"C(r)fm(r) (3.23) where V is the volume of domain 92. The symmetric form of the Green’s function used in this work, allows the transfer of derivatives of the Green’s function onto the testing and sourcing vector basis Thus the maximum singularity encountered in this 183 work is of the type 1/ | Ir—r’ II which can be evaluated in closed form using the methods provided in [241]. BA Results and Discussion Several results that demonstrate the efficacy of above described scheme are presented in this section. First the theoretical modifications introduced to the DCIM method for high-conductivity materials are verified. Simple semi-analytical models exists for axi-symmetric geometries. Such models exist for circular coils place over defect free layered media. The numerical implementation of DCIM and MPM procedure with the proposed modifications are compared with results these semi-analytical models for various tact configuration. First, the results for evaluation of the transmitted field at various depths inside the layered medium using both the methods are shown in figure B.4. Figure B.5 shows the evaluation of transmitted fields at different frequencies using both the methods. Next, impedance of the coil evaluated using DCIM+MPM method and semi-analytical model is shown in figures B.6 and B7 corresponding to different coil lift-off and excitation frequency, respectively. In all cases, there is an excellent match between DCIM+MPM method and semi-analytical and this implies that the proposed modifications to DCIM are valid and essential. Next set of results corresponds to the case of a defect in the layered medium. A semi-infinite medium with finite conductivity of a = 30M S with a rectangular defect of size 13 x 0.3 x 5 mm3 is considered. The frequency of excitation is 900 MHz and the solution from the integral equation solver are compared with experimental measurements. Impedance of the coil is computed at various .1: position, see figure B.8, with the origin at center of defect region and a: position of coil is measured with respect to coil center from origin. The amplitude of coil impedance evaluated from Integral equation solution is shown in figure B.9 and the phase values of impedance is shown in 184 91 02:0 Figure B.1: Different domains in eddy current simulation knlkpl §§§§§§§9o o1obozoooaoooaooosoooeooomoeooo 3°!ka , Figure B.2: DCIM path from existing literature for zero conductivity figure B.10. Qualitatively comparing with the measurement results, provided in [242], there is a good match with results from the integral equation solver. Quantitatively, the error between the model results and measurement was evaluated to be about 10%. 185 ‘00-! '1 New path for complex k 300 P , — — - Old path for complex It 200 H - - - - Old path forreal k I '2 m J x: I v o ' E ‘ 0‘ .-.-o-o-ona-e-e-e- ‘ L 6 2000 4000 0000 8000 Re{Kp} .-.-.-.-.. l A Figure B.3: Comparison between existing and Proposed DCIM path with finite con- ductivity. .3 4.4 x10 4.2 4 . .. “3.8 - 9 m 3.6 > - g 3.4- DCIM - 3.2 _ 0 Analytical . 3 . . 2'80 oh 0) 0:6 018 1 Re {E } x10'7 Figure B.4: Transmitted field at various depth 186 —— DCIM 0 Semi-Analy. 2000 4000 0000 Freq. ( Hz ) 8000 10000 Angle{E