SOFTWARE TOOL METHODOLOGIES ON A GPU FOR FINITE ELEMENT
OPTIMIZATION IN MAGNETICS
By
Sivamayam Sivasuthan
A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Electrical Eng
ineering
Doctor of Philosophy
2015
ii
ABSTRACT
SOFTWARE TOOL METHODOLOGIES ON A GPU FOR FINITE ELEMENT
OPTIMIZATION IN MAGNETICS
By
Sivamayam Sivasuthan
The design of magnetic devices requires optimization coupled with finite element
analysis (FEA). Thi
s involves a massive computational load and requires a specialized mesh
generator. It is therefore not practicable. This thesis therefore presents i) a parameterized
iterative mesh generator for two
-
dimensional and three
-
dimensional finite element optimiza
tion;
ii) fast and low memory finite element solvers using a graphics processing unit (GPU). In
particular we introduce element by element finite element computations on a GPU with a
speedup of 102 while the best competing method gives only 10; and iii) a
n examination of
parallelizing such matrix computations on already parallelized genetic algorithm threads using
new GPU architectures. The resulting system is reliable and yields solutions in practicable times
with massive speedup. Example inverse optimiza
tion problems are presented. These software
tools are written in C/C++ and CUDA C/C++. The system is shown to be applicable to the
synthesizing of two
-
dimensional and three
-
dimensional electromagnetic devices and to non
-
destructive evaluation (NDE) probl
ems.
Several finite element mesh generators exist in the public domain, some even based on a
parametric device description. But for optimization we need a parametrically described mesh
dynamically evolving through the iterations without user input. The fe
w that exist are
commercial and their methodology is not known. In this thesis the mesh generator that we
describe is in open source code with parametric mesh generation that runs nonstop and
seamlessly through optimization iterations to convergence withou
t user intervention. Such
mesh
iii
generators as do exist are rare, commercial and not easily available to researchers except at great
cost and never with the code to modify it to suit needs individual. Besides, the typical mesh
generator requires some man
-
mac
hine interaction to define the mesh points and boundary
conditions and does not work for nonstop optimization iterations. We take two regular open
source mesh generators, one for two
-
dimensional systems and the other for three
-
dimensional
systems, and writ
e a script
-
based interface as open source code to run nonstop for optimization.
and use it equally adaptively for machine design. A simple scheme of averaging
neighbor heights
gives us a smooth geometry without having to use Bezier curves.
This thesis also points out using a literature survey issues in GPU computation which
result in erratic speedup and explain why in some instances GPU solutions are arithmetic
ally a
slight improvement on CPU solutions.
iv
This thesis is dedicated to my parents; Sinnathurai Sivamayam and Sivamayam Sivakumary, and
my mentors . . .
v
ACKNOWLEDGMENTS
I would like to address my heartily profound gratitude and ap
preciation to my advisor Dr.
S. Ratnajeevan H. Hoole for taking me on as a research student under his valuable guidance
with funding. His advice and insights were the real encouragement to complete this work. It has
been an honor to work with him. My sinc
ere thanks also go to the rest of my PhD committee
members, Dr. Lalita Udpa, Dr. William Punch and Dr. Nihar Mahapatra, for their suggestions
and encouraging comments throughout this work.
ch and Development
Center (TARDEC) for funding our research under Contract Number W911NF
-
11
-
D
-
0001 and
W56HZV
-
07
-
2
-
001. I am thankful to the MSU Graduate School and College of Engineering for
awarding a summer graduate excellence fellowship thrice, a gradu
ate office fellowship and a
graduate teaching assistantship for one semester. This support was instrumental in facilitating
the completion of my degree.
I would also like to thank the faculty
-
and staff
-
members of the Department of Electrical
and Compute
r Engineering at Michigan State University, especially its College of Engineering,
for their support in various forms and their friendship. I have been lucky to share a lab with so
many friendly colleagues, namely V. U. Karthik, M. R. Rawashdeh and T. Mat
hiyalakan. I
express my sincere gratitude for their unstinted friendship. My sincere thanks go to all the
academic and nonacademic staff members and friends at University of Jaffna, Sri Lanka, where
my career as a student in computer science started.
Fin
ally, I am forever indebted to my parents and other members of my family, who have
supported and encouraged me through their kindness and affection, so that I could concentrate on
my studies. They touched me more deeply than I could have ever expected.
vi
TAB
LE OF CONTENTS
LIST OF TABLES
................................
................................
................................
.......................
i
x
LIST OF FIGURES
................................
................................
................................
.......................
x
LIST OF ALGORITHMS
................................
................................
................................
...........
x
v
Chapter 1
Introduction
................................
................................
................................
..................
1
1.1
Motivation
................................
................................
................................
............................
1
1.2
The Finite Element Method
................................
................................
................................
.
5
1.2.1
Introduction
................................
................................
................................
................
5
1.2.2
Two
-
Dimensional Problems
................................
................................
.......................
5
1.2.3
Trial Function
................................
................................
................................
..............
6
1.2.4
Solving Magneto
-
Stat
ic Problems
................................
................................
............
11
1.2.5
Boundary Conditions
................................
................................
................................
14
1.2.6
Three
-
Dimensional
................................
................................
................................
...
15
1.3
Invers
e Optimization Problems
................................
................................
..........................
17
Chapter 2
Parameter Based Unstructured Mesh Generator for Two and
Three Dimensional Problems for Seamless Optimization
................................
......
21
2.1
Background
................................
................................
................................
.......................
21
2.2
Mesh Generation
................................
................................
................................
................
27
2.2.1
Introduction
................................
................................
................................
...............
27
2.2.2
Delaunay Based Methods
................................
................................
.........................
29
2.2.3
Delaunay Triangulation and Constrained
Delaunay Triangulation
..........................
30
2.2.3.1
Introduction
................................
................................
................................
......
30
2.2.4
Algorithms for Constructing a Delaunay Triangulation
................................
...........
31
2.2.4.1
Introduction o
f Constructing a Delaunay Triangulation
............................
31
2.2.4.2
Divide
-
and
-
Conquer Algorithm
................................
................................
.
31
2.2.4.3
Sweep
-
line Algorithm
................................
................................
................
32
2.2.4.4
Incremental Insertion Algorithm
................................
................................
33
2.2.5
Mesh Refinement
................................
................................
................................
......
33
2.2.6
Three Dimensional Mesh Generation
................................
................................
.......
34
2.3
Parameterized Mesh Generat
ion
................................
................................
........................
36
2.4
New Approach to Parameterized Mesh
Generation
................................
...........................
36
2.5
Data Structure and User Interface
................................
................................
......................
38
2.5.1
Data Structure
................................
................................
................................
..........
38
2.5.2
User Interface
................................
................................
................................
............
41
vii
2.5.2.1
Introduction
................................
................................
...............................
41
2.5.2.2
Defining the Geometrical Shape
................................
................................
41
2.5.3
Post
-
proc
essing of Meshing
................................
................................
......................
44
2.5.4
Approach to
Renumbering
................................
................................
.......................
46
2.5.5
Merge Sort
................................
................................
................................
................
48
2.5.6
Modified Form of Merge Sort for Renumberi
ng
................................
......................
50
Chapter 3
Low
Memory High
Speed FEM Solvers Using the GPU
................................
........
52
3.1
Introduction
................................
................................
................................
.......................
52
3.2
General Purpose Computing on a Graphics P
rocessing Unit (GPGPU)
............................
54
3.3
Related Works
................................
................................
................................
....................
57
3.4
Element by Element Solvers
................................
................................
..............................
59
3.4.1
Element by Element with Jacobi Algorithm
................................
.........................
.
59
3.4.2
Element by Element Conjugate Gradients Algorithm
................................
............
64
3.4.3
Element by Element Biconjugat
e Gradient Algorithm
................................
............
67
3.4.4
Element by Element with Bi
-
Conjugate Gradient Stabilized
method.
....................
70
3.5
Conjugate Gradients Algorithm with Sparse Storage Schemes
................................
.........
73
3.5.1
Conjugate Gradient Algorithm for Matrix Solution
................................
..............
.
73
3.5.2
Matrix Storage Schemes
................................
................................
..........................
75
3.5.2.1
Introduction
................................
................................
................................
....
75
3.5.2.2
Profile Storage
................................
................................
...............................
75
3.5.2.3
Sparse Storage Scheme
................................
................................
..................
77
Chapter 4
Test and Validation Problems
................................
................................
..................
79
4.1
Device design inverse
-
optimization problem: Design of the Pole Face of an
Electric
al
motor
................................
................................
................................
................................
..
79
4.1.1
Problem Definition
................................
................................
................................
...
79
4.1.2
Problem Model
................................
................................
................................
..........
81
4.2
Inverse
-
optimization for Device Design: Determining the Rotor Contour of a
Salient Pole Synchronous Generator
................................
................................
................
87
4.2.1
Problem Definition
................................
................................
................................
....
87
4.2.2
Problem M
odel
................................
................................
................................
..........
88
4.3
NDE benchmark problem:
Characterizing Interior Defects
................................
..............
93
4.3.1
Problem Definition
................................
................................
................................
....
93
4.3.2
Problem
Model
................................
................................
................................
..........
94
4.4
A Simple Three
-
dimensional Problem
................................
................................
...............
98
4.5
NDE benchmark problem in 3D: Characterizing Interior Defects
................................
..
101
Chapter 5
Results and
Analysis
................................
................................
................................
105
5.1
Memory Limitation
................................
................................
................................
..........
105
5.2
Element
-
by
-
Element Solve
rs
................................
................................
...........................
108
viii
Chapter 6
Conclusion and
Future Works
................................
................................
...............
123
APPENDICES
................................
................................
................................
............................
126
Appendix A
:
Publications Raised
from This Research
................................
.........................
127
Appendix
B
:
Sample
Input File
2D
................................
................................
..........................
130
Appendix
C
:
Sample
Input File
3D
................................
................................
..........................
132
BIBLIOGRAPHY
................................
................................
................................
......................
137
ix
LIST OF TABLES
Table 4.1
The flux
distribution from the un
-
const
rained optimization
................................
..........
84
Table 4.2
The flux distribution from the constrained optimization
................................
...............
86
Table 4.3
The flux distribution after the optimi
zation without smoothened shape
.......................
91
Table 4.4
The flux distribution after optimizat
ion with smoothened shape
................................
..
92
Table 4.5
The solut
ion of defect characterization
................................
................................
..........
96
Table 4.6
Real and binary
solutions time need to compute
................................
...........................
97
Table 4.7
Real and binary solutions time need to comp
ute
................................
...........................
97
Table 4.8
Potentials at measuring points
................................
................................
.....................
100
Tabl
e 4.8
Potentials at measuring points
................................
................................
.....................
101
Table 4.9
The solution of 3D defect characteriz
ation
................................
................................
..
104
Table 5.1
Number of elements (NE) and storage (in MB) with matrix size for different storage
schemes . . . . . . . . . . . . . . . . . . . . . . . .
................................
................................
......
106
T
able 5.2
Projected Memory (in MB) Needs
................................
................................
...............
107
Table 5.3
The CGEbE solution
................................
................................
................................
....
109
Table 5.4
The BiCGSTABEbE solution
................................
................................
......................
11
2
Table 5.5
The Jacobi solution
................................
................................
................................
......
11
5
Table 5.6
Speedup ratio betw
een single and double precision
................................
....................
120
x
LIST OF FIGURES
Figure 1.1
R
egula
r finite element analyses
.
................................
................................
....................
2
Figure 1.2
F
EM with optimization algorithms
................................
................................
................
4
Figure 1.3
Definiti
on of H1 and h1 for a triangle
................................
................................
............
6
Figure 1.4
Integr
ation of triangular coordinates
................................
................................
............
10
Figure 1.5
Known elements and unknown elements in
matrix
................................
......................
14
Figure 1.6
Traditi
onal engineering design process
................................
................................
........
16
Figure 1.7
Modern design process
................................
................................
................................
.
18
Figure 1.8
Equipotentials of vector potential at the optimum
................................
.......................
18
Figure 1.9
Equipotentials of vector potential
at the optimum with B
-
spline in
terpolation
...........
18
Figure 2.1
Design
cycle for an geometric optimization
................................
................................
.
22
Figure 2.2
Design cycle
for an inverse problem
................................
................................
............
22
Figure 2.3
Performance comparison of triangulation using CPU
-
DT and GPU
-
DT for different number of points
................................
................................
................
24
Figure 2.4
Problem specific parametri
c mesh generato
rs
................................
..............................
25
Figure 2.5
Elastically deformed problem specific
NDE mesh: As defect moves
..........................
26
Figure 2.6
3D mesh for NDE problem: As parameter
s change
................................
.....................
27
Figure 2.7
3D Mesh for motor problem
................................
................................
.........................
28
Figure 2.8
Applying
Delaunay
triangulation
................................
................................
.................
30
Figur
e 2.9
Example for constrained
Delaunay
triangula
tion
................................
.........................
31
Figure 2.10
Divide and conquer algorithm
................................
................................
....................
32
Figure 2.11
Sweep
-
line algorithm
................................
................................
................................
..
32
Figure 2.12
Incremental insertion algorithm
................................
................................
.................
33
xi
Figure 2.13
Delaunay mesh refinement
................................
................................
.........................
34
Figure 2.14
Delaunay
mesh re
finement between two regions
................................
......................
34
Figure 2.15
My approach to
parameterized mesh
gene
ration
................................
........................
37
Figure 2.16
Sample input file for mesh generator
................................
................................
........
41
Figure 2.17
Simple example problem
................................
................................
............................
46
Figure 2.18
Numbering and renumbered nodes
................................
................................
............
47
Figure 2.19
Renumber
ed nodes
................................
................................
................................
.....
47
Figure 2.20
Sorted version of (b) and corresponding i
ndex changes
................................
............
47
Figure 2.21
Merge Sort
................................
................................
................................
..................
48
Figure 3.1
Finite element optim
ization using genetic algorithm
................................
...................
53
Figure 3.2
Floating
-
point operations per second and memory bandwidth for the
CPU and GPU
................................
................................
................................
...............
54
Figure 3.3
The GPU devotes more
transistors to
dat
a processing
................................
.................
55
Figure
3.4
Steps
in the classic finite element method (FEM) and the proposed changes for
the FEM
-
SES method e
nclosed within the dashed line
................................
...............
58
Figure 3.5
Proposed method in flow chart
................................
................................
.....................
60
Figure 3.6
A. Sparse full matrix, B. Sparse lower triangular matrix (
because
of symmetry
)
.......
76
Figure 3.7
Data structures for the symmetric profile storage
corresponding to
Figure 3.6 B
................................
................................
................................
.................
76
Figure
3.8
A
. Sparse full matrix, B. Sparse upper triangular matrix (because
of symmetry
)
.......
7
7
Figure 3.9
Data structures for the symmetric profile storage
corresponding to
Figure 3.8 B
................................
................................
................................
.................
77
Figure 4.1
Pole face of electrical motor
................................
................................
.........................
79
Fi
gure 4.2
Geometry, boundary conditions and the material properties of the sample problem
...
80
Figure 4.3
Defining the problem using our tool
................................
................................
............
81
xii
Figure 4.4
Generated mesh using our tool
................................
................................
.....................
82
Figure 4.5
Finite element solution
................................
................................
................................
.
82
Figure 4.6
Results of the un
-
constrai
ned optimization of the problem
................................
..........
83
Figure 4.7
Results of the constrained optimization
without
smoothening
................................
.....
83
Figure 4.8
Results of the constrained optimization
with smoothening
................................
..........
84
Figure 4.9
The flux distribution from t
he un
-
constrained optimiz
ation
................................
........
85
Figure 4.10
Averaging technique for
manufacturable shape
................................
........................
86
Figure 4.11
The flux distribution from the constrained
optimization
................................
............
86
Figure 4.12
A synchronous Generator
(A) two pole and (B) four pole
................................
.........
87
Figure 4.13
Parame
trized geometry of sali
ent pole
................................
................................
.......
88
Figure 4.14
Defining the problem
................................
................................
................................
.
89
Figure 4.15
Initial mesh
................................
................................
................................
.................
89
Figure 4.16
Flux line of a salient pole synchron
ous Generator
................................
.....................
90
Figure 4.17
Optimized shape without smoothening constrained
by rising pole
heights from
le
ft to right
................................
................................
................................
...................
90
Figure 4.18
The flux distribution after the optimization wit
hout smoothened shape
....................
91
Figure 4.19
The flux distribution after the optimi
zation with smoothened shape
.........................
91
Figure 4.20
Final smoothened shape
................................
................................
.............................
92
Figure 4.21
Inspection of an ar
my vehicle after impro
vis
ed explosive device
............................
93
Figure 4.22
Parametrically defined crack in plate from Triangle
................................
..................
94
Figure 4.23
Defining the problem . . . . . . . . . . . . . . . . . . . . . . . . . .
................................
.............
95
Figure 4.24
Generated Mesh for NDE problem
................................
................................
.............
95
F
igure 4.25
Flux line for NDE problem
................................
................................
.........................
96
Figure 4.26
Optimum shape of the reconstructed defect
................................
..............................
96
xiii
Figure 4.27
Square conductor problem
................................
................................
..........................
98
Figure 4.28
Mesh for square conductor problem
................................
................................
...........
99
Figure 4.29
Potential at measuring points
................................
................................
.....................
99
Figure 4.30
Potenti
al at measuring points for a simple cube
................................
.........................
99
Figure 4.31
Three
-
dimensional NDE problem
................................
................................
............
102
Figure 4.32
Defining variable: top: side view, bottom: s
ide view
................................
...............
103
Figure 4.33
Three
-
dimensional mesh
for NDE prob
lem: As parameters change
........................
103
Figure 5.1
Memor
y vs matrix size
................................
................................
...............................
107
Figure 5.2
Speed
-
up versus matrix size: Jacobi
preconditioned conjugate gradi
ents algorithm.
108
Figure 5.3
Number of iterations vs number of unkn
owns for CGEbE
................................
........
110
Figure 5.4
Number of iterations vs number of eleme
nts for CGEbE
................................
.........
110
Figure 5.5
Speedup vs number of unknowns for CGEbE
................................
...........................
111
Figure 5.6
Speedup
vs number of elements for CGEbE
................................
..............................
111
Figure 5.7
Number of unknowns vs numbe
r of iterations for BiCGSTABEbE
..........................
112
Figure 5.8
Number of elements vs number of iter
ations for EbEBiCG
STAB
...........................
113
Figure 5.9
Speedup vs number of unknowns for B
iCGSTABEbE
................................
.............
113
Figure 5.10
Speedup vs number of elements for BiC
GSTABEbE
................................
.............
114
Figure 5.11
Number of unknowns vs number of iterat
ions for JacobiCG
................................
...
114
Figure 5.12
Number of elements vs number of i
teratio
ns for JacobiCG
................................
....
115
Figure 5.13
Speedup vs nu
mber of unknowns for Jacobi EbE
................................
....................
116
Figure 5.14
Speedup vs number of elements for Jacobi
EbE
................................
......................
116
Figure 5.15
Convergence rate of CG in CPU
................................
................................
..............
117
Figure 5.16
Convergence rate of CG in GPU
................................
................................
..............
117
xiv
Figure 5.
17
Convergence rate of BiCGSTABEbE in CPU
................................
.........................
117
Figure 5.18
Convergence rate of BiCGSTABEbE in GP
U
................................
.........................
118
Figure 5.19
Convergence rate of Jacobi in CPU
................................
................................
..........
118
Figure 5.20
Convergence rate of Jacobi in GPU
................................
................................
.........
118
Figure 5.21
Speedup comparison between CGEbE, BiCGS
TABEbE and Jacobi
EbE algorithms
................................
................................
................................
.........
119
Figure 5.22
Speed
-
up versus matrix size of Kiss
et al
................................
................................
.
120
Figure 5.23
Serial addition losing precision. Numbers surrounded by a box rep
-
resent the actual result floating point
value
with 7 digits
................................
.......
121
Figure 5.24
Erratic behavior of gain for various methods
................................
...........................
122
xv
LIST OF
ALGORITHMS
Algorithm 2.
1
Renumbering
................................
................................
................................
..........
45
Algorithm 2.2
Merge Sort
................................
................................
................................
..............
48
Algorithm 2.3
Merge
................................
................................
................................
.....................
49
Algorithm 2.4
Merge Sort
-
Modified
................................
................................
..............................
50
Alg
orithm 2.
5
Merge
-
Modified
................................
................................
................................
.....
50
Algorithm 3.1
Element by Element Jacobi Algorithm
................................
................................
..
62
Algorithm 3.2
Computing
the diagonal
vector {D}
and the right hand side
vector {Q}
..............
63
Algorithm 3.3
Element
by Element Conjugate Gradients Algorithm
................................
............
65
Algorithm 3.4
Elem
ent by Element Biconjugate Gradient Algorithm
................................
..........
68
Algorithm 3.
5
Element by Element with Bi
-
Conjugate Gradient Stabilized method
....................
71
Algorithm 3.
6
Preconditioned Conjugate Gradient
................................
................................
.......
74
1
Chapter 1
Introduction
1.1
Motivation
Existing finite element analy
sis software for electromagnetic fields provides advanced
features for design or analysis problems. However engineering design involves inverse problem
solving. That is, once the requirements are given, we have to find the geometrical shape or the
material
properties, which will satisfy the requirements as closely as possible (if not exactly).
This is done basically using optimization algorithms. The deviation of the performance result of
the analysis of the present shape from the requirements must be form
ulated as an error function,
which must be minimized to get the optimum solution [1]. Most finite element analysis softwar
e
packages do not support this.
Therefore this must be done using a trial and error process.
However a trial and error process alone
will be so inefficient that it is not practically possible to
obtain a solution. Therefore an expert is required to guide this trial and error process using his
experience. However the success is directly dependent on the ability of the human expert and
t
herefore the solution obtained may deviate from the most optimum solution possible (Note that
computer based optimization solutions also do not guarantee the global minimum, however they
can reach a far better solution than a human expert).
There are some
finite element analysis software packages that support optimization.
However such software can only be used for the optimization of special classes of problems. A
totally new program has to be developed in order to solve a new type of problem. This is an
e
xpensive and time consuming task. This forces the designers who do not have the resources to
2
develop a separate optimization program for their specific need, to use the old trial and error
approach with traditional finite element analysis programs. Actuall
y, it is usually not economical
to develop a separate program if we have to solve a given problem only once or twice.
Figure 1.1: Regular finite element analysis
Figure 1.1 shows the regular finite element analysis process. Figure 1.2 shows finite
e
lement optimization using the zeroth order genetic algorithm (GA) and gradients based
algorithms. If we use GA optimization we can speed
-
up our solution time through parallelization
on a GPU. In the genetic algorithm, the design parameter vector {
h
} is bin
ary
encoded in general
3
[2]. A chromosome is a vector {
h
}. Its fitness score f is defined in
terms
of the object function
Tho
ugh GA is practicable and gives a faster solution when parallelized [3], it is slow in
our experience as a single process when compared with the gradient optimization methods. In
sequential CPU computing, the fitness value is calculated for each chromosome
one by one.
When the population is high it takes a very long time to converge. We use GPU computation to
overcome this problem [3]. We launched GPU kernels for computing the fitness value. So the
fitness value will be calculated simultaneously for each
chromosome in the population (Figure
1.2) [4]. Genetic algorithm based finite element analysis in magnetics has been carried out by
many researchers [5, 6, 7, 8, 9, 10] over the past 20 year period.
In gradients based optimization, the changes in parameter
s of device description {
h
}
are
against the gradient of the object function F because in one
-
dimensional analogy the minimum
point is to the right of locations with negative gradient and to the left of those with positive
gradient:
where the amount of change
is determined by a line search [11]. The computation of the
gradient
F
h
}) was previously by finite difference, computing
F
through a finite
element solution corresponding to a given {
h
} and then in turn changing each component
h
i
by
an infinitesimal amount and re
-
F
h
i
F
h
i
. Thus the component of
F
at each iterative step with n components of {
h
}
took n + 1 finite element solutions and then
once the direction of change of {
h
F
, is established several more finite element solutions
progressively increased and the
4
problem iterativel
y solved until the minimum of
F
in that direction is identified [11]. Each
changed {
h
} means a new geometry and therefore a new mesh. For a seamless iterative process,
automatic mesh generators are required that can yield a mesh corresponding to a given {
h
}.
However with gradient based algorithms the matrix solution may be parallelized whereas with
the GA, both matrix solution and optimization
may be parallelized through forking within a fork
[12].
Figure
1.2:
FEM
with
optimization
algorithms
Therefore
, we have developed general
-
purpose two
-
dimensional and three
-
dimensional finite
5
element analysis inverse optimization procedure using GA software tools on a GPU for analysis
and geometrical shape optimization in design and NDE problems.
1.2
The Finite Elemen
t Method
1.2.1
Introduction
The finite element method is a method used for solving partial differential equations
numerically. It is widely used to solve electromagnetic problems [13], structural design
problems [14] etc. using computers. This method is gene
rally used to find the distribution of a
certain field (e.g. magnetic vector potential, electric scalar potential, tension, fluid velocity, etc.)
in space governed by a given differential equation called the governing equation.
In order to solve the proble
m numerically, the solution region is divided into a finite
n
um
ber of elements (which are not needed to be uniform). The potentials are assumed to have a
known mathematical variation called the trial function (e.g. linear variation, quadratic variation,
et
c.) over each individual element. The field is postulated to be interpolations from its values at
certain nodes. Note that these elements must be small enough for this assumption to be valid
[13]. The problem is then solved to get the potentials at the in
terp
o
lation nodes of these elements.
Then the potential at any given point in the solution space can be found by interpolating these
now known potentials using the trial function.
1.2.2
T
w
o
-
Dimensional
Problems
In most two dimensional finite element analyses
, the space is divided into a mesh of
triangles. Points in these triangles are called the interpolation nodes of the mesh. The variation of
6
the potential over the triangles is assumed to be defined by a given trial function (most often a
first order trial
function). The objective is to find the potentials at the nodes of the mesh so that
the potential at any given point inside a triangle can be found using the trial function. To do this,
we develop one equation per unknown node in the mesh. Then, this set o
f equations must be
solved to find the potentials at each node. Since these equations alone are not sufficient to get a
unique solution, some boundary conditions must also be considered.
1.2.3
Trial Function
Since we use only first order trial functions in ou
r software tools, let us consider them in
detail. In two
-
dimensional cartesian coordinates, a first order trial function can be expressed as
follows [13, 15]:
where x and y are cartesian coordinates and
a, b
and
c
are constants for the triangle. Triangular
coordinates are used in finite element analysis, as these normalize
d coordinates
pr
o
vide
se
v
eral
Figure 1.3: Definition of H
1
and h
1
for a triangle
7
advantages in
analyzing
properties inside triangles. Triangular coordinates
point P inside a triangle are defined as follows (see Figure 1.3).
where
H
1
is the shortest distance from
N
1
to
N
2
N
3
and
h
1
is the shortest distance from P
to
N
2
N
3
,
and so on (see Figure 1.3). If
S
is the area of the triangle,
Considering the three triangular areas in the triangle separated by dotted lines, (Figure 1.3)
F
Using 1.7
Using linear interpolation, the Cartesian coordinates of a point within the triangle can be written
as follows [15].
8
These are exact because x and y are linear. Solving 1.10, 1.11 and 1.12,
Where
i1
= i
mod 3 + 1, and
i2 = i1
mod 3 + 1 Equation 1.13 can be re
-
written as,
w
here
)
and
Using these triangular coord
inates, the first order trial function A is expressed as follows
where
is the potential at the point (
) and
A
1
, A
2
and
A
3
a
re the potentials at
the node
points 1, 2 and 3 of the triangle.
This can be verified by considering that the triangular coordinates of the three nodes of
the triangle are (1,0,0),
(0,1,0) and (0,0,1). By substituting these points, one will get
A
1
, A
2
and
9
A
3
as the potentials at the three nodes and a linear variation of potentials along
any given line
inside a triangle. This trial function also provides a continuous variation of potentials from
triangle to triangle and a continuous first derivative from tri
angle to triangle along the tangential
direction of the boundary. Equation 1.19 can be re
-
written as,
where
and
From 1.16 and 1.17,
From 1.19,
10
Let us examine another property of these triangular coordin
ates, referring to Figure 1.4
Figure 1.4: Integration of triangular coordinates
w
here S is the area of the trainable. That is,
Where
is a metric tensor
11
1.2.4
Solving Magneto
-
Static Problems
This solution us
es first order triangular elements and materials with linear magnetic
prop
erties at low frequency for simplicity. The following differential equation (From
w
here
is the mag
netic field intensity and
is the current density. Since
and
where
is the reluctivity
From 1.31, the energy functional can be derived. Since the energy is at its minimum at the stable
state, this function should achieve
its minimum at the point of the solution. This
f
unction is
called the Lagrange function:
The solution region has been divided into triangles. Therefore now the total energy can be
written as the sum of the energies of each individual triangle.
12
Let
us integrate this by parts,
From 1.28,
From 1.33, 1,34 and 1.35
where
and
13
From 1.33 and 1.36 we can get
To get the solution we have to minimize
L{A}
Therefore the final solution can be obtained by solving,
This can be re
-
written in
the form,
and the equation solved for {A}.
14
Figure 1.5: Known elements and unknown elements in matrix
1.2.5
Boundary Conditions
We basically use two types of boundary conditions, namely Neumann and Dirichlet [13].
Dirichlet boundary conditions mea
n that the potential along the boundary is fixed at a given
value and Neumann boundary conditions mean the derivative of the unknown potential at the
boundary along the normal direction is zero. Dirichlet boundary conditions can be implemented
by consider
ing the node points on the boundary to have known values. Neumann boundary
conditions are implemented automatically if Dirichlet conditions are not used at a boundary [13].
They are said to be natural. If Neumann boundary conditions are used, it means the
potentials at
some points along the boundary are known. Therefore the vector A can be broken into two as
potentials at nodes with known potentials
and potentials at nodes with unknown
potentials
. Therefore Equation 1.43 can be written as (see Figure 1.5),
This is the matrix equation used by this software to find the final solution of finite
element analysis. There is no minimizati
on of L with respect to
15
1.2.6
Three
-
Dimensional Problems
Corresponding to the triangle in two dimensions, the tetrahedron is a convenient element
to use in three dimensions. For the tetrahedral coordinates
[13]:
where now
is the height of a point from the opposite triangular face of a tetrahedron and
is
the height of the vertex opposite that face. First order interpolation,
Solving the preceding four equations for the four
s, we have
16
Wh
ere,
where
or a cycle permutation of them. A is the potential at the
point
(x, y, z) and
A
1
, A
2
, A
3
and
A
4
are the potentials at the node points 1, 2, 3 and 4 of
a tetrahedron
[13].
Figure 1.6: Traditional engi
neering design process
17
1.3
Inverse
Optimization Problems
The inverse problem, the more practically realistic problem, is synthesis. That is, wanting a
performance, computing t
he system description from it.
Thus the computational design
assignment may be t
his: compute the size and other descriptions of a motor that can produce so
much torque. Figure 1.6 shows the traditional engineering design process. An expert decides
which device to use and for that device assigns parameters to use and then checks the
performance by making and testing. Finally, an expert has to make changes in parameters if
needed. This process repeats until we get the desired performance. In the 1960s, the analysis
phase was introduced in place of make and test before checking the per
formance. Thereafter the
expert who makes the entire decision about the design process is replaced by powerful software.
Figure 1.7 shows the modern design process using AI techniques, knowledge base etc. to make
device selection. Optimization
algorithms a
re
used to select parameters in order to get the desired
performance. In the modern design process, the analysis phase is replaced with synthesis. The
earliest persons to automate this cycle in magnetics were Marrocco and Pironneau in 1978 [16].
In 1976 a
parallel work with [16] by Arora and Hang [17] also established finite element
optimization in magnetics.
An erratic undulating shape with sharp edges arose when Pironneau [18] optimized a
recording
-
head to achieve a constant magnetic flux density and thi
s was overcome through
constraints [19]. Haslinger and Neittaanmaki [20] suggest Bezier curves to keep the shapes
smooth with just a few variables to be optimized, while Preis, Magele, and Biro [21] have
suggested fourth
-
order polynomials which when we tri
ed gave us smooth but undulating shapes
because of the higher order. Most of the required shape changes can be achieved with linear
variations [22]. Figure 1.8 shows that without regularity constraints, sharp corners
and jagged
18
Figure 1.7: Modern design
process
contours arise in designing a pole
-
face for constant vertical flux density. Figure
1.9 shows the
shape is smoothened further and there is no sharp corner when B
-
spline curves are used but the
undulation is mathematically correct though not pract
icable.
Figure 1.8: Equipotentials of vector potential at
the optimum
Figure 1.9: Equipotentials of vector p
o
tential at
the optimum with B
-
spline
in
terpolation
In optimization problems like NDE of steel plates, besides the detection of cracks, what
is
also important is their characterization. Characterization is necessary for determining whether
any discovered crack demands withdrawal of the part from service [23, 24]. In eddy current
19
crack identification the response of a part to an eddy current te
st coil is compared to the response
without a crack [4]. When different, the presence of the crack is flagged. But to characterize the
defect, the computed response from eddy current analysis with a crack described by parameters
is optimized to match measu
rements with computations. When the two match, the parameters
describe the defect [4]. In inverse electromagnetic problem solutions by the finite element
method, we require three tools. They are a special mesh generator, efficient matrix equation
solvers
and optimization algorithms. These are for the 3 major steps of finite element
optimization. They are as follows,
1.
Preprocessing: The essential operation for optimization is involved with this step. The
design for optimization is parameterized before mesh
generation. As the geometry
defined by parameters is optimized, it changes shape, and a new finite element mesh must
be created without stopping the optimization iterations to create a new mesh. Several
finite element mesh generators exist in the public do
main [25, 26, 27, 28, 29, 30, 31, 32],
a few even based on a parametric device description. The required mesh generator must
therefore support parameter based mesh generation and be completely automatic once the
optimization process begins. That means we
must be able to change the physical shape
of the problem during run time and generate the mesh without stopping. Such mesh
generators as do exist are rare, commercial and not easily available to researchers except
at a great cost and never with the code t
o modify them to suit individual needs. We
propose taking a regular open source mesh generator and writing a script
-
based interface
as open source to run nonstop for two and three dimensional optimization problems. In
this thesis we are going to develop pa
rametric mesh generators that run nonstop and
seamlessly through optimization iterations to convergence.
20
2.
Solution: The biggest load in finite element field computation is in matrix solution.
Recently GPU computing has had great success in many very large n
umerical
computations (For example [33, 34, 35, and 36]). GPU
-
based finite element computation
offers massive parallelization. This thesis will investigate speeding up using the GPU in
sparse matrix computation. We will also examine the memory needs. For
this purpose we
will investigate parallel EbE processing by Gauss iterations [24] and preconditioned
conjugate gradient [23].
3.
Optimization: The parameters need to be optimized in NDE as well as synthesis to make
the computed fields match the desired perf
ormance. As we discussed in the introduction
section, we can use zeroth order optimization methods like the genetic algorithm, bee
colony algorithms [37] etc. Moreover, if gradient methods are to be used in the mesh
topology the nodal connections need t
o be held fixed to preserve C
1
continuity of the
object function lest the mesh
-
induced minima are seen by the optimization algorithm as
from the physics of the problem
This thesis mainly focuses on the first two steps because several open source
optimizat
ion algorithms are available on the web (for example, [38]) and a colleague in the group
is using GPU computations to parallelize genetic algorithm optimization [4]. However his code
will be used for the test problems.
21
Chapter 2
Parameter Based
Unstructu
red Mesh
Generator for
Two
and
Three Dimensional Problems for Seamless Optimization
2.1
Background
Figure 2.1 shows the design cycle for a geometric optimization problem. In the
beginning, the
initial geometric
positions are either selected by the subjec
t expert or in the
absence of the expert, randomly selected. In the next step we generate the mesh for the current
geometry, measure the object value by a finite
element solution
and check whether it is minimum
or not. If this is a minimum we terminate th
e loop;
otherwise we
change the geometric
parameters and do the same to procedure again.
Mesh generation is therefore a very important part of finite element analysis. But mesh
generators do not support parameter
based mesh
generation for optimization. F
or real world
inverse problems we need a mesh
generator as
a library where the design is described
by
parameters
and it takes
as input and returns the mesh from iteration to iteration.
Figure 2.2 shows the design cycle for an inverse problem. In the first step the design
parameter set
is randomly selected (or estimated by a subject expert) and thereupon we
generate the corresponding mesh, get the finite element solution and measure the object
value
(often conveniently defined as a least square difference between design objects desired and
22
Figure 2.1: Design cycle for
a
geometric optimization
Figure
2.2: Design cycle for an inverse problem
23
computed) and check whether it is minimum or not. If this is a minimum we terminate the loop;
otherwise we change the design parameters and do the same procedure again. This procedure
repeats until the object value
is acceptably small. For optimization to go on
non
-
stop, the mesh
needs to be generated for the new parameters without user intervention. In NDE the only
difference is that the object function compares measured values with those
computed from
presumed val
ues of
being sought.
In this section of this thesis we describe the necessity of a parametric mesh generator that
runs nonstop and seamlessly through optimization iterations to convergence. Such mesh
generators as
do exist are rare, commercial and not easily
available to researchers except at great
cost and never with the code to modify them to suit individual needs. Besides, the typical mesh
generator requires some man
-
machine interaction
to define the points and boundary conditions
and does not work for nons
top optimization iterations. We will take a regular open source mesh
generator and write a script
-
based interface as
open source to run nonstop for optimization.
There are many mesh generators available on the web [25, 26, 27, 28, 29, 30, 31, 32] and
in th
e literature; some packages are open source software and others commercial. But they
usually do not support
parametric mesh
generation. However they do support features we would
like in a
parametric mesh
generator. To summarize some notable mesh generato
rs, Triangle,
which we use, generates exact
Delaunay triangulations
, constrained Delaunay triangulations,
conforming Delaunay triangulations, and Voronoi diagrams to yield high quality triangular
meshes without large angles as suited to finite element anal
ysis [25]. AUTOMESH2D
gen
erates
CGAL [
28], ADMESH [30], and Delaundo [39] are all notable for special features. Indeed
there
are
parametric mesh
generators, for example
[40]. However it is not publicly available. Another
24
such mesh generator is, CEDRATs suite Flux
whereas
parameters are
changed, the mesh is
generated and the device analyzed to study the effect of parameters on performance [41]. The
same approach has been
taken in NDE studies [42]. However, the works of [40, 41] are not
intended for non
-
stop optimization. For that CEDRAT uses a script based scheme called GOT
-
It
[
43] which passes
parameters to
the program Flux and gets the results back for the optimization.
GOT
-
It, named FGot, is offered free to students although, but there again, the code is not
accessible.
Figure 2.3 shows the performance comparison of triangulation
using CPU
-
Delaunay tri
-
angulation (DT) and GPU
-
DT for different number of points [44]. We can see the gain is
nominal compared to finite element solvers which are given in Chapter 5. Therefore it is not
worthwhile parallelizing mesh generation
Figure 2
.3: Performance comparison of triangulation u
sing CPU
-
DT and GPU
-
DT for dif
ferent
number of points
Moreover, if gradient methods are to be used, the mesh topology given by the nodal
connections need to be held to preserve C
1
continuity of the object fu
nction lest the mesh
25
induced fictitious minima are seen by the optimization algorithm as from the physics of the
Figure 2.4: Problem specific parametric mesh generators
problem [45].
For these reasons very problem specific mesh generators are constru
cted by
researchers. As an example, when an armored vehicle is targeted by an improvised explosive
device, the armor is inspected by an eddy current test probe to characterize the interior damage to
determine if the vehicle should be withdrawn from deplo
yment.
Figure 2.4 shows a problem
specific parametrically described crack in steel excited by an eddy current probe, where P1
-
P6
are the lengths that represent the position and shape of the crack, J is current density and µr is
relative permeability. I
n this NDE exercise the parameters need to be optimized to make the
computed fields match the measurements. The mesh has been
constructed for
the specific
problem. As the
parameters change
, the mesh topology is fixed, pulling and crunching triangles
as sho
wn
in Figure 2.5. Such problem specific meshes are a headache because they restrict the
geometry, lack flexibility and take time for modifications. Hence the need for general
-
purpose
parametric mesh generators. We can use zeroth order optimization m
ethod
s for which C
1
continuity is irrelevant, such as the genetic algorithm,
bee colony
algorithms etc.
without pulling
and crunching meshes for inverse problems; e.g. in reconstructing cracks to characterize interior
26
defects, or designing power devices. For
non
-
stop optimization, the
commercial code
ANSYS
offers a gradients
-
based optimization suite [46], but gives
little information
on the techniques
employed. That is, although these methods are known within the companies, they are rarely
published. There ar
e other companies, particularly from structural engineering, that also offer
gradients
-
based optimization. A huge lacuna is how
they address
the problem of mesh
-
induced
minima. These artificial minima are seen as physics
-
based object
function minima and t
he code
tends to get stuck at these. Other approaches like a mathematical distance function to model the
geometry lie in the domain of specialized efforts [47].
Figure 2.5: Elastically deformed problem specific NDE mesh: As defect moves
Like in 2D ther
e are also many 3D
-
mesh generators available on the web and in the literature
[29, 43, 48, 49]. Some are open source software and others commercial. TetGen [48, 49] is for
tetrahedral mesh generation and is more effective than previous methods at removing
slivers
from and producing Delaunay optimal tessellations [50]. Each of these mesh generators has its
own merits but none of these mesh generators supports parametric mesh generation.
We will take the freely available, widely published, nonparametric, open
source 3
-
D
27
mesh generator TetGen [48, 49] which like all published mesh ge
nerators involves user input in
the pro
cess
of mesh generation. Here also we use a script file which uses a parametric
description of the system to start the mesh from initial para
meters and thereafter runs it
seamlessly without stopping as the parameters are updated by the optimization process. Sample
3D meshes are shown in Figures 2.6 and 2.7
Figure 2.6: 3D mesh for NDE problem: As parameters change
2.2
Mesh Generation
2.2.1
Introducti
on
The finite element method requires the problem space to be split into a finite number of
finer elements. The preferred element shape for two
-
dimensional problems is the triangle and for
three
-
dimensional problems it is the tetrahedron [13]. This set o
f triangles/tetrahedrons is called
the mesh. If the mesh is finer, it will produce a better result in FEA [13], although it will increase
the processing time. If we can have a finer mesh only at the places where we want a more
accurate result, then we can
reduce the processing time considerably. This is called
adaptive
28
mesh generation [13]. Apart from this, the shape of the triangles/tetrahedrons in the mesh has a
great effect on the final solution of finite element analysis. If we have very obtuse angles i
n the
triangles of the mesh, they will introduce considerable errors into the final solution. Therefore,
all these facts have to be considered when generating a mesh for FEA problems.
Figure 2.7: 3D Mesh for motor problem
There are many mesh generation
algorithms available. Basically we can divide them into
29
two categories. They are,
1
Algorithms which generate a crude mesh to define the basic geometry and then refine it
to get a good quality mesh
2
Algorithms which generate a fine mesh from the beginning,
The Advancing Front Algorithm [51] and Quad
-
tree Algorithm [52], Delaunay based
Algorithms [25] are the examples of the second category. These methods can produce very good
quality meshes. Delaunay based algorithms are well
-
known and commonly used algorit
hms for
quality mesh generations and therefore we use them.
2.2.2
Delaunay Based Methods
A Delaunay based meshing approach is a concept which consists of two tasks:
1
The mesh points are first created by a variety of techniques; e.g. advancing front, octree,
or
structured methods
2
The Delaunay triangulation is first computed for the boundary without internal points.
The mesh points are then inserted incrementally into the triangulation/tetrahedralization
and the topology is updated according to the Delaunay defin
ition.
There are many Delaunay triangulation algorithms; the incremental insertion algorithm,
the divide and conquer algorithm, the plane
-
sweep algorithm etc. In this work, we take the
freely available, widely published, non
-
parametric, open source two
-
d
imensional (2D) mesh
generator Triangle [25]. These three mentioned algorithms have been implemented in the
Triangle [25] mesh generator. We then adapted it for seamless optimization [24].
30
Figure 2.8: Applying Delaunay triangulation
2.2.3
Delaunay Triangul
ation and Constrained Delaunay Triangulation
2.2.3.1 Introduction
Delaunay triangulation [13] is a technique used to improve the quality of the mesh by
simply rearranging the nodal connections that make triangles. This algorithm ensures that there
wi
ll be no obtuse angles in the mesh other than in the triangles at boundaries. This is done by
rearranging the triangles, if the uncommon point of the
neighboring
triangle lies inside the
inscribing circle of one of the triangles, as
shown in
Figure 2.8. Th
is can be identified by
calculating the two angles
corresponding to
the uncommon points. By the properties of cyclic
quadrilaterals, when the sum of these angles is greater than 180
, the triangles must be
rearranged.
In Figure 2.8, the triangle QRS lies inside the inscribing circle of the triangle PQS. This
can be recognized by summing the two opposite angles, A and B (These are the angles
corresponding to the uncommon points for th
e two triangles, P and R). Since the sum of A and B
is greater than 180
the two triangles are rearranged as PQR and PRS as shown in the figure.
Now the point of the opposite triangles is not inside the inscribing circles.
Constrained Delaunay
triangulatio
n is a generalization of the Delaunay triangulation that
forces certain required
31
segments into the triangulation. An example is shown in Figure 2.9. Both triangles are in
different regions where each may have different properties. So they cannot be flipped
like the
previous case
Figure 2.9: Example for constrained Delaunay triangulation
2.2.4
Algorithms for Constructing a Delaunay Triangulation
2.2.4.1 Introduction of Constructing a Delaunay Triangulation
There are many Delaunay triangulation algorithms
; for example Divide
-
and
-
Conquer [53],
Sweepline [54], Incremental insertion [55] etc. As Su and Drysdale [56] found, the divide
-
and
-
conquer algorithm is fastest; the second is the sweepline algorithm. The incremental insertion
algorithm performs poorly,
spending most of its time in point location. Su and Drysdale
introduced a better incremental insertion implementation by using bucketing to perform point
location, but it still ranks third. A very important development in the divide and conquer
algorith
m is partitioning the vertices with vertical and
horizontal
cuttings [53].
2.2.4.2 Divide
-
and
-
Conquer Algorithm
The point set
v
is divided into halves until we are left with two or three points in each
subset. Then these smaller subsets can be linked
with edges or triangles which is called a
32
Voronoi diagram. Now we have a set of Voronoi
diagrams because
we have a set of smaller
subsets. In the conquer step, we merge the subsets to get the whole Voronoi diagram (see Figure
2.10). The dual of the Vorono
i diagram is the mesh [25].
Figure 2.10: Divide and conquer algorithm
Figure 2.11: Sweep
-
line algorithm
2.2.4.3 Sweep
-
line Algorithm
The sweep
-
line algorithm uses a sweep
-
line which divides a working area into two sub
-
areas. This process const
ructs the Voronoi diagram
-
the dual graph of Delaunay triangulation
(shown in Figure 2.11). This algorithm was introduced by Fortune [54]. Shewchuk [25] pre
-
sented a successful algorithm for constructing a higher
-
dimensional Delaunay triangulation.
Figur
e 2.11 explains how the sweep
-
line algorithm works. There is a vertical line which is called
a sweep line in the Figure 2.11. When this line passes a point this algorithm creates a Voronoi
diagram with other points which are already passed.
33
Figure 2.12:
Incremental insertion algorithm
2.2.4.4
Incremental Insertion
Algorithm
Here we generate a fictitious triangle containing all points of V in its interior. The points
are then added one by one. Figure 2.12 (b) shows the mesh after the first point is
added. Figure
2.12 (c) shows how to handle the insertion of the second or subsequent point. The idea is to draw
circumcircles of a particular triangle where the new point is located and neighboring triangles,
select the triangles whose
circumcircles cover
the new points, remove the interior edges of
selected triangles
; and
finally, a new point is connected with every point of a created polygon.
This algorithm always maintains the Delaunay triangulation of the points.
2.2.5
Mesh Refinement
There are many mesh r
efinement algorithms available. Most of them are based on
are finer
geometrical shapes
and fewer nodes at other regions. The basic idea of the algorithm is
to maint
ain a triangulation, making local improvements in order to remove the skinny triangles.
34
Figure 2.13 shows the basic idea of avoiding skinny angles.
Figure 2.13: Delaunay mesh refinement
Figure 2.14: Delaunay mesh refinement between two regions
Both
triangles are in different regions (shown in Figure 2.14); each may have different
properties. So they cannot be refined like in the previous case. Here the algorithm follows the
same idea without violating the boundary of separate regions
2.2.6
Three Dimensi
onal Mesh Generation
Since almost every real world problem is three
-
dimensional, we extend our two
-
dimensional
work to three
-
dimensional geometric parameterized mesh generation for optimization
prob
lems
in design and NDE. Mesh generators in electrical e
ngineering commonly use Delaunay
35
tetrahedralization and constrained Delaunay tetrahedralization for quality meshes. The
In
cremental Insertion algorithm is a well
-
known algorithm for tetrahedralization [48]. The worst
case runtime of this algorithm is of
O
(n
2
),
but the expected runtime for this
algo
rithm is of
O(n
log n)
[25]. Constrained Delaunay tetrahedralization was first considered by Shewchuk [25].
Gmsh [29] and TetGen [48] are the better
-
known free, open source 3D mesh generators. TetGen
[48] uses
a constrained Delaunay refinement algorithm which
guar
antees termination and good
mesh quality. A three
-
dimensional Delaunay
triangulation
is called a Delaunay
tetrahedralization. Ideas for Delaunay operation, constrained Delaunay triangulations, and mesh
refinements are
the same but only the dimension is different. 3D objects are usually represented
by Piecewise Linear Complexes (PLCs) [48]. The design goal of TetGen is to provide a fast,
light and user
-
friendly meshing tool with parametric input and advan
ced visualization
capabilities. Even though TetGen and Gmsh [29] are great open source mesh generators, from an
inspection of the code, it is very hard to use for non
-
stop optimization problems. For the non
-
stop
optimization that ANSYS offers [46], it give
s little information on the techniques employed.
CEDRAT uses a script based scheme called GOTIt [58]. FGot, is offered free to students
although the code is not accessible and therefore will not permit modification nor work for
industry
-
scale problems [58]
. Here TetGen [48] is used as a backend for parameterized meshes
for optimization. Therefore we will develop it on our own and make it
available as
open source.
It is always possible to tetrahedralization a polyhedron if points are not vertices of the
poly
hedron. Two types of points are used in TetGen:
1
The first type of points are used in creating an initial tetrahedralization of PLC.
2
The second type of points are used in creating quality tetrahedral meshes of PLCs.
The first type of points is mandatory in
order to create a valid tetrahedralization. While
36
the second
type of points is
optional, they may be necessary in order to improve the mesh
quality.
2.3
Parameterized Mesh Generation
Parametric mesh generation is a very important part of finite element opt
imization
problems. In optimization problems, parameters describe the device in terms of materials,
currents, and dimension. During optimization, as these parameters are changed to minimize an
object function, a new mesh has to be generated and a new fin
ite
element solution
obtained to re
-
evaluate the object function. At each iteration of an optimization algorithm, given the variables
as input, the mesh is generated without user intervention. Finite element mesh generators exist
in the public domain, a
few even based on a parametric device description. The typical mesh
generator requires some man
-
machine interaction to define the points and boundary conditions,
and does not work for non
-
stop optimization iterations for which we need a mesh dynamically
ev
olving through the iterations with optimization variables as changing parameters. Such mesh
generators as do exist are rare, commercial, and not easily available to researchers except at great
cost and never with the code to modify them to suit individual
needs.
2.4
New Approach to Parameterized Mesh Generation
We take the freely available, widely published, nonparametric, open source 2
-
D mesh
gen
erator Triangle [25] and 3
-
D mesh generator TetGen [48, 49] which like all published mesh
generators (with th
e exception of commercially restricted ones whose methodology is not
published) involves user input in the process of mesh generation. But for use in optimization we
37
cannot stop the iterations to make input [24]. To address these problems we use a script
file
which uses a parametric description of the system to start the mesh from initial parameters and
thereafter runs it seamlessly without stopping as the parameters are updated
by the optimization
process [24]. The script file provides the user input whi
le the code is iteratively running, input
that is normally made in the mesh generator being used, but for which the optimization iterations
cannot stop [24]. Figure 2.15 explains our approach to parametrized mesh generation. In the first
step, the initial
input which is described in the following sections is given to our mesh generator.
Next, the mesh generator calls our chosen open source mesh generator to generate the mesh.
After that, the FEM solver uses the mesh to solve the problem. Next, the optimiza
tion algorithm
updates the parameters which are accepted by our new mesh generator to generate the new mesh.
Figure 2.15: My approach to parameterized mesh generation
38
2.5
Data Structure and User Interface
2.5.1
Data Structure
The data
-
structure used in t
hese mesh generator software suites contains the following col
-
lections of objects:
1.
Points list
2.
Regions list
3.
Properties list
4.
Variable points list
5.
Measuring points list
6.
Segments list
7.
Mesh details list (triangles/ tetrahedrons)
8.
Holes list
9.
Boundary condition
s
Points List
:
The point collection contains all the points used in the prob
lem definition and
solving pro
cess. Each point contains the coordinates of the relevant fini
te element node.
For the three
-
dimensional mesh
generator the list has x, y and z coord
inates. For the two
-
dimensional mesh generator, it should have x and y coordinates only.
Regions List
:
The region list contains all the regions used in the problem definition.
Region here means a material
-
source
combination which
has different physical
characteristics.
39
Properties List
:
Properties list contains all the properties of each region. Each region has
a set of properties.
Variable points list
:
The variable points list contains the information about the points to
be moved according to the chang
es of the parameters, during optimization.
Measuring points list
:
The measuring points list contains the points in the solution space
where we want to find the potentials, flux density, etc.
Segments list
:
The segments list contains the edges of the prob
lem model. Each problem
may have many segments.
Mesh details list
:
The mesh details list
contains the
triangles/
tetrahedrons of
the mesh.
This collection is empty until the mesh is generated. For a two
-
dimensional mesh
generator each
triangle contains ref
erences to its three vertex points. For a three
-
dimensional mesh
generator each tetrahedron contains references to its four vertex points.
Holes list
:
Holes are a special kind of region where we do not need to generate the mesh.
A hole list contains all t
he holes used in the problem definition.
40
Boundary conditions
:
There are two types of boundary conditions that are usually used
in FEA problems. They are
1.
Neumann boundary conditions
2.
Dirichlet boundary conditions
A Dirichlet boundary condition means the
potential along the given boundary is fixed and a
Neumann boundary condition means the derivative of the potential along the given boundary is
fixed, and usually zero. Dirichlet boundary conditions can be
implemented by
keeping the
potential of all the poi
nts on the given boundary to be fixed at their given value. The user can
select any segment, and define the potential of that segment. If the potential of the segment is
set, then all the points which will be added onto that line will get this potential a
utomatically.
The boundaries that do not
implement Dirichlet
conditions will automatically act as
Neumann
boundaries during
the FEA process. This is because it is natural to the finite element formulation
[13]. Therefore no special
provisions are
needed
to define Neumann boundaries.
41
Figure 2.16: Sample input file for mesh generator
2.5.2
User Interface
2.5.2.1
Introduction
A proper user interface is very important for good software. If the user interface is not
friendly to use, even if it is very
powerful, most users will not be able to use it effectively.
Therefore the user interface is carefully designed and used in this software as described in the
next subsection.
2.5.2.2
Defining the Geometrical Shape
Since we are providing the code as open sourc
e, it is necessary to describe it for other
users to re
-
engineer the code. This software is made to import drawings from the text file format.
Figure 2.16 shows the sample input file of our mesh
generator
adapting Triangle using a script
file to run non
-
st
op from iteration to iteration without manual intervention.
42
1
The mesh generator code does not care about lines which start with #. We can write
comments using the # sign.
2
First interpreted row:
-
The first number represents the num
-
ber
of nodes in the domain; the second number represents the number of variable points.
These variable points are also nodes but their coordinates may vary with the optimization
iterations.
3
From the second row to row number 10 (that is, 9+1) in the domain, th
ere are, associated
with that row,
The integer represents the node number. These must be numbered consecutively, start
-
ing from one. The two floating point numbers represent the coordi
nates of this node. For
the three
-
dimensional mesh
generator, the three floating point numbers represent the
coordinates of this node.
4
The next segment of this input file represents variable points. These are also in the same
format as the previous.
5
In the
third part of the input file we have segment details. The first row of this file has the
number of segments. From the second row onwards it has 4 columns. The first column
involves the segment numbers which must be numbered consecutively, starting from on
e.
The next two columns are node numbers. Each row represents a segment. The fourth
column is a marker. A marker has different integer values; it can be used to define the
boundary condition. Here
-
1 means it is not on a boundary. If we have to define the
boundary condition we have to give any positive integer to the marker. Then we can
assign the boundary value using these markers. For the three
-
dimensional mesh
43
generator
, the
first row of this segment of this file has the number of faces and
a
boundary ma
rker. Each face has a list. The first part of the list has the number of
polygons in the face, number of holes on the surface and the boundary marker. From the
second row onwards, polygons and holes of the surface are defined.
6
The next segment of this file
is the definition of boundaries. The first row contains the
number of boundary conditions. From the second row onwards the first column
represents the numbering; the second column is a marker number which has been already
defined in the previous part (the
segment part). The third column represents the
boundary value for a particular marker.
7
The subsequent segment is a definition of the regions of the problem. Here a different
region means different
materials so
it has different properties. The first row r
epresents
the
number of regions in the domain. From the second row, each row has five columns. The
first column represents the
numbering as
usual. The second and the third columns
represent the coordinates. These coordinates are used to identify the regio
n. The point
may be any point in the relevant particular region. The fourth column is an integer which
starts from 1. It can be used to assign
properties to
these regions. The next column is not
used here because it is an area constraint coming from Triang
le but, as mentioned, it is
not used by us.
8
The next segment of this input file represents the number of holes. The hole is a region in
which we do not want to generate the mesh. In the first step we define the number
of
holes in the problem domain. Form t
he next row, there are three columns. The first
column represents the hole number. The second and third columns represent the x and y
coordinates of any point within that hole.
44
9
The segment thereafter is for the measuring point list for object function eval
uation.
The
first part is the number of measuring points where we are going to calculate the solution
to get the target solution. Our source code helps users to identify the errors in the input
file. It works very efficiently for any shape of problem domai
n. The sample inputs files
are attached in this thesis as
Appendices B
and C. Users customize their own problem
very efficiently as tried out in our lab [4, 3, 24]. This software is easy to use.
This
software
is well supported in any operating system, i.e.
, Linux/Unix, Windows etc.
2.5.3
Post
-
processing of Meshing
Once we triangulate/tetrahedralize the problem domain, we have to define the boundaries
and boundary values. Upon triangulation/tetrahedralization, we have an element list (node
numbers), the propert
ies of regions and
point list
(for FE solution).
We do
not need to calculate
the solution
of known
nodes so
we have to
separate the known and unknown nodes. This step is
known as
renumbering the
nodes which is also to reduce the profile of the matrix [13]
. In this
process, we
1.
Define the boundary (generally using segment numbers in 2D, faces in 3D)
2.
Get the boundary values (different boundaries may have different boundary values)
3.
Get all nodes which are on boundaries (We used the segment marker list to deter
mine the
boundary nodes)
4.
Separate boundary elements from non
-
boundary elements; and separate unknown nodes
from known nodes
5.
Give the first set of numbers for the unknown nodes and the last set of numbers for the
known nodes
45
6.
Renumber the whole point lis
t based on new numbering.
7.
Renumber the node entries in the triangle list based on the new numbering system.
8.
Get all properties for the particular regions
9.
Assign these properties to all corresponding triangles.
Since real world problem size is typically ve
ry large, this
renumbering process
takes
a very long
time. Algorithm 2
.1 describes the regular
renumbering process
. This algorithm is very inefficient
because each node will be searched for in an index array. The order of this algorithm
is
O
(n
3
).
This s
tep can be improved. In this work the traditional algorithms have been improved based on
the merge sort technique.
46
2.5.4
Approach to Renumbering
Figure 2.17: Simple example problem
Instead of searching for every element from the whole list, this thes
is tracks the node
number changes and updates the node numbers of the mesh. Figure 2.17 shows a simple ex
-
ample finite element problem. The boundary elements are circled. Figure 2.18 (a) shows the
node numbers which are assigned in the mesh generation pro
cess. Figure 2.18 (b) presents the
rearranged nodes which are separated based on whether the nodes are of known or un
-
known
values. Figure 2.18 (c) shows the new numbering system. The boundary elements are shown in
the gray boxes in Figure 2.18. The corre
sponding old number list (c) is renumbered in (b).
Figure 2.19 shows the new numbering of the nodes. Let us take the array of Figure 2.18 (a) and
use a new index which has 1, 2, up to the number of elements. We sort the array of Figure 2.18
(b) using the
merge sort algorithm which will be described in the following section. We apply
every operation of the sort algorithm to a new index array. The resultant arrays are shown in
Figure 2.20. Now if we want to get a renumbered value of a particular node number
we can
directly access it from the updated index array. For example 1 is replaced by 4, 2 by 1, 3 by 5,
and so on.
47
Figure 2.18: Numbering and renumbered nodes
Figure 2.19: Renumbered nodes
Figure 2.20: Sorted version of (b) and corresponding inde
x changes
48
Figure 2.21: Merge Sort
2.5.5
Merge Sort
Sorting is a technique that arranges the elements in a certain order. There are many
sorting algorithms such
as counting
sort [59], bucket sort [59], radix sort [59], merge sort [59],
heapsort [59], qu
icksort [59] etc. Each algorithm has it own advantages. Merge sort is one of the
best sort algorithms which has
n log (
n
) time complexity for each average, best and worst case
time complexities [59]. Another very important reason for selecting merge sort
is that this
algorithm is easily parallelizable
-
parallelization on the GPU is the main theme of this thesis.
Figure 2.21 [60] describes the merge sort in a graphical way. First the list is split into two pieces
and then each is split into two further pie
ces. This process is repeated until arriving at the
singleton list. Then we work our way back up the recursion by merging together the short arrays
in
to larger arrays. Algorithms 2.
2 and 2.
3 describe merge sort.
49
Algorithm 2.
50
Al
gorithm 2.
2.5.6
Modified Form of Merge Sort for Renumbering
As we described in section 2.5.4, we track the node number changes using a form of
merge sort instead of searching for every element from the whole list. This new algorithm has
n
log
(
n)
time complexity but the traditional method has time complexity of O(n3). We define an
array of size given by the number of nodes in the mesh. The array has 0, 1, 2, up to (number of
nodes
-
1). We applied every operation of the sort algorithm to thi
s newly
define
d array.
Algorithms 2.
4
and 2.
5
describe the idea underlying what we have used.
51
52
Chapter 3
Low Memory
High Speed
FEM Solvers Using the GPU
3.1
I
ntroduction
It has been more than 20 years since genetic algori
thm (GA) based optimization was first
used in finite element optimization [5, 6, 7, 8, 9, 10]. Since GA is practicable and gives a faster
solution when parallelized [3], GA has been used for optimization in FE. The object function
corresponding to every me
mber
of a population has to be computed many times to find the
minimum. The many members
form the genetic search space. Since
consists of dimensions
and materials of a particular design being examined for its goodness [3], for those dimensions a
mesh
is constructed, the finite element problem solved and the object function evaluated (see
figure 3.1). The object function itself is computed from a finite element solution involving a
matrix equation. Thus we may treat the object function computation as
a kernel and launch it on
multiple threads, each for a different member of the population. Then within that kernel, we can
parallelize the matrix equation solution. In genetic algorithm based finite element optimization
[61, 62], several copies of the mat
rix are held on the GPU and the corresponding solutions
attempted. Limited memory is also a very big issue in GPUs [23]. This part of the work mainly
focuses on low memory and high speed finite element solvers.
As we already mentioned, the GA based optimiz
ation method presents a huge
computational load. Powerful PCs are capable today of solving large matrix equations in a few
seconds, sometimes using a Graphics Processing Unit (GPU). GPU based finite element
53
computation offers massive parallelization [63].
The finite element solution of field problems requires the solution of large
-
sized matrix
equations leading to large waiting times [13]. To address these parallel computations were used
at one time [64, 65]. But the speedup was limited by the fact that on
the shared memory parallel
computers, there was a memory bottleneck which typically then allowed 4, 8, 16 or rarely 32
processors with more computing elements meaning exorbitant cost. For an n
-
processor
machine with one processor dedicated to book
-
keepin
g on what the other processors
were
doing, the best speedup was n
-
1 and always a little less because of communication and waiting
issues. Recently the graphics processing unit (GPU) has been shown to speed up the matrix
solution part in the finite element
solution [34, 66]. This was a major advance in finite element
computational efficiency.
Figure 3.1: Finite element optimization using genetic algorithm
54
Figure 3.2: Floating
-
point operations per second and memory bandwidth for the CPU and GPU
Parall
elization is the best approach to speeding up as bigger problems are tackled in field
computation [13, 64]. However as noted, the need for shared memory between processors was a
bottleneck because machines with more than 8 processors were very expensive.
3.2
General Purpose Computing on a Graphics Processing Unit (GPGPU)
The GPU is a single chip processor with integrated transform, lighting, triangle
setup/clipping, and rendering engines [67]. In
November 2006
, Nvidia group introduced CUDA
which is a gen
eral purpose parallel computing platform and programming model. The GPU has
its own
memory, up to 24 GB in current configurations [68]. This device (GPU) memory
supports a very high data path using a wide data path. The CPU has a few cores which have been
used for optimized sequential processes. In contrast, the GPU has thousands of small more
efficient cores which have been used for massive parallel processes. The GPU has tremendous
55
computational horsepower and a very high memory bandwidth (shown in Figure
3.2 [69]).
Figure 3.3: The GPU devotes more transistors to data processing
GPUs are used for highly parallel computation. Therefore they are designed using more
transistors for data processing (ALUs) rather than data caching and flow control (shown in
Figure 3.3 [69]). The GPU is well suited for data
-
parallel processing. In GPU FE
computa
tion,
data
-
parallel processing maps data elements to parallel processing threads.
GPU computing uses the GPU to accelerate the computational speed of very large
scient
ific and engineering problems. Generally a GPU has thousands of cores, For example the
Tesla K40 GPU has 2880 cores [68]. Here cores mean a number of computing components.
Nowadays there are so many multi
-
core processors available in the market but the num
ber of
cores is very limited in CPUs.
The GPU for general purpose calculations instead of graphics rendering is called
gen
eral
purpose computing on graphics processing unit (GPGPU). The Nvidia GPU CUDA
architecture
is composed of streaming multiprocessors
(SMs), each containing a number of streaming
processor cores (SPs) with on
-
chip memory, and a single instruction unit. All SMs have access
56
to global memory, the off
-
chip memory (DRAM), which has a latency of several hundred clock
cycles. Thus unlimited sp
eedup is possible unlike with shared memory machines. The until
-
recent restriction that a kernel launched on parallel threads cannot launch another kernel in
further forking from a fork [12] is no serious shortcoming since with multi
-
processor systems
,
even though we can fork from a fork, we usually do not have spare processors to assign.
However in a recent development CUDA dynamic parallelism has been made available on the
SM 3.5 architecture GPU [70] and this is available on PCs now.
In finite eleme
nt analysis the coefficient matrix is very large when dealing with real
world problems; for example in reconstructing cracks in inverse non
-
destructive evaluation
problems and device design problems. We use two different techniques to overcome this
memory
problem:
1.
Use sparse storage schemes to store the coefficient matrix and solve the matrix equation
using the GPU [23]
2.
Use element by element FEM
-
performs operations on the local finite element matrix
[
P
]
L
that corresponds to operations on the global mat
rix and therefore does not require
storage for the larger global matrix([
P
]) [65]
However, the need for shared memory
between processors
was a bottle neck because machines
with more than 8 processors were very expensive. With n processors we could at best
get a
speedup of n
-
1. Using the GPU of PCs is a new alternative [71, 72]. The NVIDIA GPU CUDA
architecture is composed of streaming multiprocessors (SMs), each containing a number of
57
streaming processor
cores (SPs) with on
-
chip memory, and a single instru
ction
unit.
All SMs
have access
to global memory, the off
-
chip memory (DRAM)
, which
has a latency of several
hundred clock cycles. Thus unlimited speedup is possible
unlike with
shared memory machines.
We will use well
known storage
schemes such as the p
rofile (sky line) [13] storage
schemes and the sparse
storage scheme
[13, 73] (also known as the compressed sparse row
scheme). We revive the old element by element finite element solvers from the early 1980s for
working on a highly memory limited IBM PC 2
82 to launch thousands of CUDA threads on the
GPU architecture. We will examine
different numerical
techniques such as
conjugate gradients
(CG), preconditioned conjugate gradients (PCG), Jacobi, bi
-
conjugate gradients etc. to get the
high speed solution we
seek. These ideas are explained below.
3.3
Related Works
To Wu and Heng [71] should go the credit for first exploiting as far back as in 2004 the
CUDA architecture in parallelizing FEM computations. They focused their attention on the
conjugate gradients
matrix solver where the most gains could be made. It took until 2010 for
CUDA FEM computations to enter seminal electrical engineering works and that too without
Kiss
et al.
[34] have recently applied EbE p
rocessing to solve their finite element
equations from a first order tetrahedral mesh using the conjugate gradients algorithms. Their EbE
method is different from EbE in references [65, 74]. They implement the bi
-
conjugate gradient
technique [75]. Since th
e matrix is not formed, they use the diagonal to implement Jacobi
preconditioning. Fernandez
et al.
[33] decoupled the solution of a single element from that of the
whole mesh (Figure 3.4 [33]), thus exposing parallelism at the element level. Individual
e
lement
58
solutions are then superimposed node
-
wise using a weighted sum over concurrent nodes. They
used Jacobi iterations to calculate the solution of each local matrix in parallel and then couple
local solutions using weighted average enforcing continuity.
For example node number 1 in
Figure 3.4 [33] is replaced by numbers 1 and 6 in new numbering system; number 1 is in element
e1 and 6 in e2. Once calculated the local solutions of
elements e1
and e2, need to form the
global solution using the following f
ormula [33],
where
and
are local matrix elements (shown in Figure 3.4 [33]),
is the solution
of element e2 and
is the solution of element e1 [33]. Fernandez
et al
. compared
different generations of GPUs getting speedups with conjugate gradients of up to
14 times and
111 times for the 8800GT and the GTX48
0 GPUs respectively, compared to optimized CPU
results.
Figure 3.4: Steps in the classic finite element method (FEM) and the proposed changes for the
FEM
-
SES method enclosed within the dashed line
59
3.4
Element by Element Solvers
3.4.1
Element by Element with Jaco
bi Algorithm
To overcome the memory limitation of 612 kB on the IBM PC 286 of the early 1980s,
Hughes
et al
. [74] introduced EbE processing. The then available memory of 612 kB was not
enough to hold even a trivial finite element matrix. What we used to
do following Hughes
et al
.
[74] was not form the coefficient matrix [
P
]. Instead, recognizing that [
P
] is assembled from the
small element matrices [
P
]
L
(of size 3 × 3 in magnetics with triangular first order elements),
according to
all the operations meant for [
P
] we performed multiple times on each [
P
]
L
. We took an element,
computed its
[
P
]
L
, did the operation meant to be done on [
P
], added the resul
t to that from other
elements already dealt with as indicated in equation 3.2, threw away
[
P
]
L
,
went to the next
element and so on. The non
-
forming of the coefficient matrix meant that only iterative schemes
of solution without decomposing [
P
] in any way c
ould be done. This restriction therefore
excluded the powerful iterative incomplete
C
holesky conjugate gradients (ICCG) scheme of
solution usually preferred in finite elements work because an incomplete
Cholesky factor of [
P
]
is required [13]. In solving
the Gauss
-
Seidel iterations
commonly used
by power
engineers, is an improvement on the older Gauss iterations [13]. In Gauss
-
Seidel, in each
iteration
m + 1
we use the latest available values of the
unknowns
, using equation
i
of
to compute
treating only
as the unknown and all the other
variables as
known and given by their latest values in the iteration cycle:
60
with obvious modifications for
i =1
and
i = n
. In this algorithm
must be
computed
before
. Here at iteration
m + 1
, computing
in the order i=1 to n,
is at values of iteration
m
+ 1
up to the
th
component of {
} and at the value of the previous iteration m for
values after i. It is therefore necessarily a sequential algorithm.
Figure 3.5: Proposed method in flow chart
When EbE processing was developed for the Gauss
algorithm [65], in solving
the older displaced Gauss iterations [65] use the values of the old iteration m for computing
every {
} in iteration
m + 1
. Therefore the computation of a particular {
} value is
61
independent of the co
mputation of all other {
} values for
that iteration
and therefore
parallelizable:
This is inefficie
nt in the context of sequential computations. But in the case of
paral
lelization it is highly efficient as was laid out by Mahinthakumar and Hoole [65] and Carey
et al
. [76] for shared memory systems. Speedups were just below (n
-
1) where n is the number of
processors. If [
D
] is the matrix [
P
] with all off diagonal elements eliminated, then the Gauss
iterations yield
Our proposed implementation first computes the diagonal vector
[
D
]
of the unformed
global
matrix [
P
], the right hand side of the finite element equation {
Q
} and initi
al solution
vector, and
saves them as two one dimensional arrays. This computation will be done only once. We can
parallelize this computation. In the second step, our algorithm calculates the residual components
ri, i = 1 to n, of {
r
} in parallel:
Algorithm 3.1
shows the element
-
by
-
element Gauss iterations. In that
n
is the total number of
unknowns, {
V
} is a vector which
contains the vertices of an element, {
r
} is the
residual vector,
[
P
]
L
is the local coefficient matrix, {
Q
} is global right hand side vector, {
D
}
is the diagonal
vector of the global matrix [
P
], and
} is the current solution vector.
NC
a
nd
NR
represent
the
62
column number and row number respectively of the unformed global matrix in which the local
matrix term [
P
]
L
(i, j)
stored as the column vector
[P ]
L
(i
3 + j).
The first step of this algorithm
computes the diagonal matrix [
D
] and solution vector {
Q
}.
Al
gorithm 3.2
explains how to
compute the diagonal matrix [
D
] and solution vector {
Q
}. This algorithm is common for all
following algorithms which are described in this chapter. Moreover, this step is also parallelized.
Steps 4 to 19 of the Algorithm 3.
1
whi
ch compute the residual vector of this process also can be
parallelized. A problem is in updating the vector {
r
}. Here more than 2 processes may access
the same memory location at the same time while updating a particular
r
i
. This is called the race
cond
predefined functions. The
arch = sm
_
20
[72] flag was used to enable FERMI advanced
architecture features which support them [72]. Kiss
et al
. used the coloring te
chnique [34] to
avoid the race condition and a two numbering system [33] was used by Fernandez
et al.
Both
methods take a lot of memory and extra computations. So we avoided these methods.
63
Al
g
orithm
3.
1
(C
)
64
Al
g
orithm
3.2
(C
)
3.4.2
El
ement by Element Conjugate Gradients Algorithm
The linear system equation from the finite element formulation can be solved by a
minimization method [77]. Consider a quadratic equation
where [
A
] is a symmetric matrix, {
b
} and {
x
} are vectors, and
is a scalar constant.
It is the square of the residual of the equation
[A]{x} = {b}.
[
A
] is symmetric and positive
-
definite for
our finite element equation,
f ({x})
is minimized by the solution to
[A]{x} = {b}
The
iterates {
x}
(i)
are updated i
n each iteration by a multiple
(i
)
of the search direction vector
{p}
(i)
[78]:
65
Corresponding residual
are updated as
where
is chos
en so as to minimize
f({x})
along
.
The search directions
{p}
(i)
are updated using the residuals
{r}
(i)
i.e.,
where
is defined by
The conjugate gradients iterative solver can also be decomposed into an element by
element process [79]. Here we have used the element
-
by
-
element Jacobi preconditioner [80,
63]
because traditional preconditioners
cannot
be used without the assembled global matrix. In this
algorithm we use the Jacobi preconditioner [78] which is weak but offers precon
dition
ing.
Algorithm 3.4.8 shows the element by element process using the c
onjugate gradients algorithm
[79]. Step
s 6 to 18 of the Algorithm 3.3
compute the residual vector
(v
)
this
process also can be
parallelized.
66
Al
g
orithm
3.
3
(C
)
67
3.4.3
Element by Element Biconjugate Gradient Algorithm
Kiss
et al.
[34] have recently
applied EbE processing to solve their finite element
equations from a first order tetrahedral mesh using the bi
-
conjugate gradients algorithm.
Algorithm
3.4
shows their element by element process using the biconjugate gradient algorithm.
The CG method is n
ot suitable for a non
-
symmetric system since the residual vectors cannot be
made with short recurrence [81, 82]. The biconjugate gra
dients method takes another ap
proach,
replacing the orthogonal sequence of residuals by two mutually orthogonal sequences, a
t the
price of no longer providing a minimization. In the biconjugate gradients method,
we update the
two sequences of residuals and search directions for
[A]
and
[A]
T
. Steps 4 to
19 of the
Algorithm 3.4
compute the residual vector, steps 31 to 34, updat
e the vectors
{xt}, {r}
and
{rd}
and steps 41 to 47 update the vectors
{d}, {dd}, {q}
and
{qd}.
These processes also can be
parallelized. In this process the residuals may be computed using,
The search directions are
given by,
and the distance to
be moved along a direction by,
68
69
Al
g
orithm
3.4
(C
)
70
3.4.4
Element by Element with Bi
-
Conjugate Gradient Stabilized method
The BiCGSTAB iterative method had been used to solve the steady Navier
-
Stokes
equations by Wang
et al.
[83]. Sheu
et al.
[84] used the BiCGST
AB method for solving the
monotonic finite element model. The BiCGSTAB method is suitable for non
-
symmetric systems
but can be applied to symmetric systems too [75]. This method avoids the irregular convergence
pattern of the conjugate gradient squared met
hod. In the biconjugate gradient method, the
residual vector
{r}
(i)
can be written as the product of
{r}
(0)
and an
i
th
degree polynomial in [A];
This same po
lynomial satisfies
So that
This re
sult suggests that if
P
i
(A)
reduces
{r
}
(0)
to a smaller vector
{r}
(i)
, then it might
be
advantageous to apply this
P
i
operator twice, and compute
[85].
This
algorithm
is known as the conjugate gradient squared method. The bicon
jugate gradient
stabilized method
(BiCGSTAB) avoids the irregular convergence of the conjugate gradient
squared method. The
BiCGSTAB computes
instead of
[86].
The BiCGSTAB method is
computationally ex
pensive per iteration compared to the CG
algorithm [87].
Algorithm 3.
5
shows the element by element process using bi
-
conjugate gradient
71
sta
bilized algorithm [84, 83]. BiCGSTAB has two stopping tests which are shown in Algorithm
3. 5
. Steps 13 to 27 and 36
to 50 of the Algorithm 3.
5
compute the residual vectors.
These
processes also can be parallelized.
72
Al
g
orithm
3.
5
(C
)
73
Al
g
orithm
3.5
(C
)
3.5
Conjugate Gradients Algorithm with Sparse Storage Schemes
3.5.1
Conjugate Gradient Algorithm for Mat
rix Solution
There are several works, which attempt to solve finite element problems using GPU(s)
[66, 88, 89, 90, 91, 92] but they are not based on using the element by element technique as in
this thesis. The authors of [92] tested the conjugate gradi
ent (CG), the biconjugate gradient
(BiCG), and the biconjugate gradient stabilized (BiCGSTAB) algorithms with popular
preconditioning techniques; for example the algebraic multigrid, diagonal, shifted incomplete
Cholesky, and shifted incomplete LU methods
. The best performance was found for conjugate
gradients with preconditioning [92]. The authors of [88] also implemented CG with
74
preconditioning algorithms. The reported speedup is between 2 and 325. The speedup varies with
different applications [88]. The
reported speedup for the BiCGSTAB algorithm is on average 8
to 10 times faster [91]. The authors of [89] reported that the speedup obtained with the
preconditioned conjugate gradients (PCG) on a GPU, with respect to the CPU implementation of
the CG algori
thm, is between 8 and 10 (depending on the sparse matrix
-
vector multiplication
used).
The PCG algorithm is considered more efficient for large matrix equations and therefore
usually used for solving a symmetric, sparse, positive definite system of linear e
quations as from
finite element analysis. It identifies the residual in successive orthogonal directions and for n
equations is guaranteed to converge in no more than n iterations [93]. We use it because it is
cheap. Algorithm 3
.6
describes the PCG method
for the system of linear equations
; where [
P
] is a real, positive definite, symmetric matrix from finite element discretization and
is the initial solution of the system which is improved in each iteration
k.
Preconditioning by
the matr
ix M is used to replace the original system
by
. The Jacobi preconditioner is one of the simplest forms of preconditioning, in which the
preconditioner is chosen to be the diagonal of the matrix P =
{diag}(A). In th
e case of the
implementation of this algorithm, we use CUDA C for parallel implementation and C++ for
sequential implementation.
75
Al
g
orithm
3.6
(C
)
3.5.2
Matrix Storage Schemes
3.5.2.1
Introduction
In finite element analysis and optimization the coefficient
matrix is very large when
dealing with real world problems but very sparse [13]. For a symmetric matrix, we need to store
only the diagonal and upper or lower triangular part of the matrix. The sparsity property brings
storage down to
O(n)
for the finite
element method [13]. The elimination of unnecessary
multiplications with 0.0 also speeds up computations significantly. The following sections give
an overview of matrix storage schemes.
3.5.2.2 Profile Storage
Profile storage is also known
as its
e
quivalent skyline
storage which
reduces the
storage
requirement for
a matrix. The matrix would be stored in three one dimensional floating point
76
number arrays. Space is allocated for every number to the right of the first non
-
zero
on a row, up
to the diag
onal term. Therefore renumbering is used first to reduce storage, to band the matrix.
Figure 3.6: A. Sparse full matrix, B. Sparse lower triangular matrix (because of symmetry)
Figure 3.7: Data structures for the symmetric profile storage correspondi
ng to Figure 3.6 B
The matrix of Figure 3.6 A is reduced first to its lower triangle part in Figure 3.6 B. It is then
stored as the vectors
Diag
, giving the diagonal element location,
FC
giving the first column on a
row occupied by a non
-
zero and
V
, giv
ing the coefficient of the matrix which now has several
zeros which are between the first non
-
zero column and the diagonal. An example matrix and
profile storage scheme vectors are shown in Figures 3.6 and 3.7 respectively.
The data structure consists of t
hree one
-
dimensional arrays: A real type array
V
;
the size
of this array is equal to the number of non
-
zero elements plus the number of zeros between two
non
-
zero elements in the array, an integer type array
FC
; the size of this array is equal to the
numbe
r of non
-
zero elements in the array, an integer type array
Diag
; the size of this array is n ;
where n is the number of rows/columns in the array. When we are dealing with real world
77
problems, the coefficient matrix is a very large sparse matrix [13] and w
e can
use the profile
storage scheme to reduce memory consumption
Figure 3.8: A. Sparse full matrix, B. Sparse upper triangular matrix (because of symmetry)
Figure 3.9: Data structures for the symmetric profile storage corresponding to Figure 3.8 B
3.5.2.3
Sparse
Storage
S
c
heme
The sparse storage scheme is also known as the compressed sparse row (CSR) scheme.
The sparse storage scheme is a row
-
wise (or alternatively column
-
wise) representation of the
nonzero entries in the coefficient matrix
of the linear system. For a symmetric matrix, computer
memory can be saved by storing only the nonzero entries in each row on and before the main
diagonal. The associated column numbers are stored in an integer
-
valued array JA such that
JA(K) is the colum
n number for the coefficient A(K). A mapping vector IA is used to denote the
starting location of each row. An example matrix and its sparse storage scheme
vectors are
shown in Figures 3.8 and 3.9 respectively. The data structure consists of three one
-
dim
ensional
arrays: A real
type array
A, contains all the non
-
zero elements of a matrix. The size of this array
78
is equal to the number of non
-
zero elements in the array, an integer type array JA, contains the
matrix column indices of the elements of
A. The s
ize of this array is equal to the number of non
-
zero elements in the array. Another integer type array IA contains the index of each row in the
arrays A and JA. The size of this array is n +1; where n is the number of rows/columns in the
array. When we ar
e dealing with real world problems, the coefficient matrix is a very large
sparse matrix [13] and we can use CSR storage scheme to reduce the memory consumption and
speedup the computations.
79
Chapter 4
Test and Validation Problems
4.1
Device design
invers
e
-
optimization problem:
De
sign of the Pole Face of
an Electrical Motor
4.1.1
Problem Definition
Our mesh generators and solvers will be
demonstrated on
two examples from design.
First let us consider the following sample problem. The
objective is
to achieve
a uniform flux
density distribution in the vertical direction in the air gap of a pole face (see Figure
4.1)
. Since
the air gap in turbo
-
alternators compared to the radius of the machine is small, the shape of the
pole face can be
approximated by
a straig
ht line
.
Figure 4.1: Pole face of electrical motor
Figure 4.2 gives the related dimensions, material properties and field excitation values
used. The symmetry of the magnetic fields with respect to the pole axis allows the modeling of
just half the p
ole pitch, where the pole axis, which is the line of symmetry, is located
at
the right
boundary to the finite element solution domain. The relative permeability of 20 for the magnetic
80
circuit is deliberately set this low, so that the leakage flux through a
ir at the left edge of the pole
face is larger than for higher and more realistic permeability. Our requirement is to have a
uniform vertical flux density of 1 Tesla along measuring points on the stator. That means all our
measuring points must have their
y direction flux density (x direction derivative of the vector
potential) of 1 Tesla. The influence of this leakage flux requires significant correction in the
shape of the pole face close to the left edge in order to achieve the desired constant flux den
sity
in the air gap. This example is frequently used in the demonstration of electromagnetic
optimization methods and it can be considered as a standard demonstration example [13]
Figure 4.2: Geometry, boundary conditions and the material properties of t
he sample
prob
lem
81
4.1.2
Problem Model
Figure 4.3 shows how to model the problem using our software. There are 11 fixed points
(shown in Figure 4.3), 10 variable heights (h1....h10) to be optimized, 8 measuring points
(purple
-
dots), and 4 materials (stator
, air, coil, back ion) in this problem. Figure 4.4 shows the
generated mesh. Figure 4.5 shows the
equi
potential lines of the starting design of the pole face of
a motor. Since this is a 2D magneto
-
static problem, these lines represent the flux lines as we
ll.
As we discussed in Chapters 1 and 3, we use the genetic algorithm for optimization. In
genetic algorithm optimization [61, 62], several copies of the matrix are held on the GPU and the
corresponding solutions attempted. We have tried different problems
using the GA on a GPU
[12, 3].
Figure 4.3: Defining the problem using our tool
When we optimize directly this problem by defining an independent parameter for the
displacement of each point in the pole face, the shape we get is given in Figure 4.6. Th
is shape is
not a manufacturable shape but the solution really gives a very good result (see Table 4.1 and
82
Figure 4.9). However it is clear that this type of pole face cannot be practically constructed.
Figure 4.4: Generated mesh using our tool
Figur
e 4.5: Finite element solution
There are two solutions for this problem. We can add some constraint to the solution on
force the variable points to be arranged in a curve function of known mathematical form.
Both
methods had been tested by Wijesinghe [22
]. And he claimed that even though the second
83
method is easy because of the known equation, the result is not as good as the result by the first
method.
Figure 4.6: Results of the un
-
constrained optimization of the problem
Figure 4.7: Results of th
e constrained optimization without smoothening
An erratic undulating shape with sharp edges arose when Pironneau [18] optimized a pole face to
achieve a constant magnetic flux density and this was overcome through constraints [19].
84
Figure 4.8: Result
s of the constrained optimization with smoothening
Table 4.1: The flux distribution from the un
-
constrained optimization
Haslinger and
Neittaanmaki [94] suggest Bezier curves to keep
the shapes smooth with just a
few variables to be optimized, while Preis
et al
. [95] have suggested fourth order polynomials
which when we tried gave us smooth but undulating shapes. As such we follow Sub
ramaniam
et al
. [19] and extend their principle, so as to maintain a non
-
undulating shape by imposing the
constraints:
Measuring points
Flux density (in Tesla)
1
1.001
2
1.001
3
1.002
4
0.996
5
1.001
6
1.001
7
0.999
8
0.999
85
to ensure a smooth shape Even this gives a non
-
smooth shape (Figure 4.7) but we use averaging
of neighboring heights which is shown in Figure 4.10, to obtain a very manufacturable shape a
s
demonstrated in Figure 4.8. The final results are given in Table 4.2 and Figure
4.11. Average
error percentage of un
-
constrained optimization is 0.15% while the constrained optimization
method has an error percentage of 1.0625%. Even
-
though un
-
constrain
ed
opti
mization gives a
more accurate solution than constrained optimization, the resulting shape from the un
-
constrained problem is not practicably manufacturable.
Figure 4.9: The flux distribution from the un
-
constrained optimization
The average erro
r percentage was calculated using the following formula:
where n is the number of measuring points,
Ep
is the error percentage,
is the calculated flux
density and
is the target flux density. We have reported solutions for a population size of
100. Figure 4.10 shows the averaging
techniques which have
been used to get a smooth
manufacturable shape
. This figure
shows a 3
-
neighbor averaging technique. We have a mask that
covers 3 elements. The mask moves the left most
elements
to the right most element and updates
86
the middle element with the average of the three values which are covered by the mask. We
introduced
2 temporary elements at the front and end to calculate the average of
the first and last
elements. If we wish to use a 5
-
neighbor technique, we need to introduce 2 elements at the front
and 2 elements at the end to calculate the average of the first and l
ast elements.
Figure 4.10: Averaging technique for manufacturable shape
Table 4.2: The flux distribution from the constrained optimization
Measuring points
Flux density (in Tesla)
1
0.991
2
1.016
3
1.008
4
1.008
5
0.977
6
0.978
7
0.993
8
0.988
Figure 4.11: The flux distribution from the constrained optimization
87
4.2
Inverse
-
optimization for Device Design: Determining the Rotor Contour
of a Salient Pole Synchronous Generator
4.2.1
Problem Definition
The second example is about determining the pole fac
e contour of a salient pole
synchronous generator to demonstrate the parametrized mesh generator and matrix solution
software as applied to constrained optimization. The current density in the excitation coil and
the geometric parameters that define the s
hape of the pole piece have to be predicted in order to
achieve a sinusoidal distribution of the airgap flux with a peak value of 1.0 T and reduce the flux
leakage while the airgap is constrained to a minimum to prevent the motor from hitting the stator
(F
igure 4.12).
Figure 4.12: A synchronous Generator (A) two pole and (B) four pole
Figure 4.13 gives the related dimensions, material properties and field excitation values used.
The symmetry of the magnetic fields with respect to the pole axis allows th
e modeling of just
half the pole pitch, where the pole axis, which is the line of symmetry, is located at the right
88
boundary to the finite element solution domain. The stator is
idealized as
a solid steel region
without slots, and both stator and rotor are
made of linear steel with a relative permeability of
A/mm
2
which is
the limit for copper windings, air gap between stator and rotor x < 2 cm and flux go through the
points A
1
and A
2
< 0.3 × flux go through the points A
3
and
A
1
which means allowable leakage
flux is 30%,
Figure 4.13: Parametrized geometry of salient pole
4.2.2
Problem Model
Figure 4.13 explains the parameters of the problem and how to model this problem. There
are 14 fixed points, 16 variable heights (h1....h16), 8 measuring points and 4 materials in this
problem (see Figure 4.14). Figure 4.15 shows the corresponding starting mesh for this problem.
Figure 4.16 shows the flux lines of this salient pole synchrono
us Generator at starting. When we
89
optimize directly this problem by defining an independent parameter for the displacement of
each point in the rotor like in the previous example, the shape we get is given in Figure 4.17.
This shape is also not a manufacto
rable shape but the solution really
gives very good result in
-
terms of a sinusoidal distribution of the airgap flux with a peak value of 1 T (Table 4.3 and
Figure 4.17).
Figure 4.14: Defining the problem
Figure 4.15: Initial mesh
90
Figure 4.16: Flux
line of a salient pole synchronous Generator
Figure 4.17: Optimized shape without smoothening constrained by rising pole heights from left
to right
Figure 4.17 shows the flux lines for the optimum solution with a constraint like in (4.1)
but the heig
ht of the shaped surface having to go up from left to right. It has sharp corners but is
reasonably smooth. We then use an averaging technique to remove sharp bends. We took five
neighboring values of a height and calculated the mean for every variable sol
ution
with suitable
91
modification for boundary variables to get Figure 4.20.
Table 4.3: The flux distribution after the optimization without smoothened shape
Measuring points
Flux density (in Tesla)
Target Flux density (in Tesla)
1
0.0202
0.0000
2
0.1869
0.1736
3
0.3721
0.3420
4
0.4987
0.5000
5
0.6321
0.6428
6
0.7453
0.7660
7
0.8769
0.8660
8
0.9215
0.9397
9
0.9709
0.9848
10
1.0091
1.0000
Figure 4.18: The flux distribution after the optimization without smoothened shape
Figure 4.19: The flux
distribution after the optimization with smoothened shape
92
The final results are given in Table 4.4 and Figure 4.19. The average error percentage of un
-
smoothened optimization is 2.96% while the smoothened optimization method has an error
percentage of 2.2
4%. The average error percentage is calculated using Equation 4.2 (the first
point is not
included because
of the zero
denominators in
Equation 4.2). Since GA is a stochastic
optimization algorithm [96, 97], the optimization value is not always perfect. Th
e optimum value
depends on the initial population and search space [96, 97].
Table 4.4: The flux distribution after optimization with smoothened shape
Measuring points
Flux density (in Tesla)
Target Flux density (in Tesla)
1
0.0083
0.0000
2
0.1596
0.1736
3
0.3560
0.3420
4
0.4964
0.5000
5
0.6434
0.6428
6
0.7517
0.7660
7
0.8549
0.8660
8
0.9183
0.9397
9
0.9986
0.9848
10
1.0035
1.0000
Figure 4.20: Final smoothened shape
93
4.3
NDE benchmark problem: Characterizing Interior
Defects
4.3.1
Problem Definition
As an example, when an armored vehicle is targeted by an improvised explosive device
(Figure 4.21), the armor is inspected by an eddy current probe to determine whether there is
damage or not. But we wish to characterize the interior damage to determine i
f the vehicle
should be withdrawn from deployment. The same system is also intended for regular rust
dollars [98, 99, 100]
Figure 4.21: Inspection of an army v
ehicle after improvised explosive device
Figure 4.22 presents crack shapes, both shown through the meshes, to make the
com
puted
field match the measured field. The normalized least
-
square mismatch of nodes from the
midpoint between the measured and compu
ted shapes (Figure 2.4) shows a
parametri
cally
described crack in steel excited by an eddy current probe. In this NDE exercise the parameters
94
need to be optimized to make the computed fields match the measurements. An objective
function F is defined as the
sum of the squares of the difference between computed
and desired
performance values: at measurement points
i,
where
is the calcula
ted mag
netic flux density and
is the measured flux density.
By
minimizing the objective function F by the optimization method, the characteristics of the defect
can be estimated since F is the function of the parameters.
Figure 4.22: Parametrical
ly defined crack in plate from Triangle
4.3.2
Problem Model
Figure 4.23 shows the parameters of the described NDE problem and how to model it to
using our tools. There are 14 fixed points, and 6 variable points (for example); the x coordinates
are fixed [10
1]. There are also 4 materials (air, crack, steel plate and coil) in this problem. Figure
4.24 shows the generated mesh for this problem. Different materials are shown in different colors
(although in black and white in this printout). Figure 4.25 shows t
he flux lines from this
example. The flux lines are shown in Figure 4.25. Figure 4.26 shows the true defect and
constructed defect.
95
Figure 4.23: Defining the problem
Figure 4.24: Generated Mesh for NDE problem
Table 4.5 shows the variable points (
defect coordinates), their x and y coordinates,
euclidean distance between the centroid of that the crack and a point (d1 and d2 respectively)
and normalized distance between d1 and d2. That means centroid difference between the true
profile and the constr
ucted profile, d1 and d2, is calculated using the Equation 4.4. Let us say,
(x1, y1) and (x2, y2) are the two points; the euclidean distance between the two points is
defined by the following formula
:
96
Figure 4.25: Flux line for NDE problem
Table 4.5: The solution of defect characterization
T
rue
profile
Reconstructed
profile
Norm error
V
ariable
p
oi
n
ts
x
y
d1
x
y
d2
((
d
1
d
2)
/d
1)
2
1
2
3
4
5
6
7
8
8
2.0
1.5892
9
1
.2
1.4162
10
2.4
0.5153
11
2.2
1.5348
11
2.7
1.5101
10
3.0
0.6896
9
3.2
0.8400
8
3.5
1.7890
8
2.24
1.5268
9
2.34
0.5331
10
1.96
0.7544
11
2.15
1.5461
11
2.54
1.5000
10
2.94
0.6497
9
3.42
1.0251
8
3.55
1.8167
0.001541
0.388815
0.215191
0.000054
0.000044
0.0
03342
0.048597
0.000240
Figure 4.26: Optimum shape of the reconstructed defect
97
The normalized least
-
square match of nodes from the midpoint between the measured
and computed shapes (Figure 4.26) was close to 90%. The error in location was 4.65%. A bet
ter
match would require the use of more parameters. We tried with several population
sizes and
multiple times of iterations. These results are reported in [102]. This thesis presents the
particular solution for a population of 200. Tests were carried ou
t for the different population
sizes and different number of iterations and we measured the time taken to compute the solution
and tabulated these in Table 4.6. The best fitness score for different population sizes and
different number of iterations is rep
orted in Table 4.7 for real and binary representations of GA
solutions [102]. We can see that solutions from parameters represented by real numbers are
obtained faster than the solutions that are represented by binary numbers [102].
Table 4.6: Real and bin
ary solutions time need to compute
Population Size
30 iterations
40 iterations
50 iterations
Time Taken(s)
Time Taken(s)
Time Taken(s)
Real
Binary
Real
Binary
Real
Binary
10
20
30
40
50
60
274.33
540.68
861.79
1319.36
1583.73
1757.78
311.97
661.79
10
24.65
1431.34
1769.6
2131.94
346.58
695.28
1095.82
1666.79
1994.77
2199.26
408.29
891.06
1258.17
1950.36
2302.99
2534.36
413.16
866.47
1588.32
2024.03
2323.93
2656.53
503.43
1105.42
1749.1
2390.25
2891.94
3303.85
Table 4.7: Real and binary solutions time
need to compute
Population Size
30 iterations
40 iterations
50 iterations
Best Fitness Score
Best Fitness Score
Best Fitness Score
Real
Binary
Real
Binary
Real
Binary
10
20
30
40
50
60
0.0123
0.0084
0.0040
0.0105
0.0111
0.0127
0.0019
0.0012
0.0009
0
.0017
0.0017
0.0017
0.0091
0.0084
0.0040
0.0105
0.0068
0.0127
0.0020
0.0008
0.0009
0.0017
0.0017
0.0016
0.0091
0.0084
0.0040
0.0105
0.0068
0.0127
0.0015
0.0008
0.0009
0.0017
0.0017
0.0000
98
4.4
A Simple Three
-
dimensional Problem
For testing purposes, we took
a small cube which is made of a material with relative
permittivity 1. Another inner cube with a relative permittivity of 1 and charge density of 1C/m3
condit
ion). The measuring points (along a line) are shown in Figure 4.27. Then we solved the
Poisson equation to calculate the potential (
) using FEM:
Figure 4.27: Square
conductor problem
There are 16 fixed points, 2 materials (air and material) and 100 measuring points (along
the line) in this problem. Since this is not an optimization problem, there are no variable points.
Figure 4.28 shows the generated mesh for this
problem. Different materials are given in different
colors; light green for material 1 and red for material 2.
Figure 4.29 shows the potential at the measuring points.
We may
recognize that the
potential within the inner cube is constant because the inne
r cube has a constant uniform
charge.
99
Figure 4.28: Mesh for square conductor problem
Figure 4.29: Potential at measuring points
Figure 4.30: Potential at measuring points for a simple cube
100
Table 4.8 shows the numerical solution for this problem.
The first column gives the measuring
point
number;
the second and third columns give the electric potential value and tetrahedron
number in which the particular measuring point exists, respectively. Figure 4.30 shows the
potential at the measuring points
for a simple cube problem (a simple cube with
t
he
same
dimension as
above, charge density is 1C/m
3
-
no inner outer cubes and no change
in
permittivity).
Table 4.8: Potentials at measuring poin
t
Measuring points
(in V)
Element#
Measuring points
(in V)
Ele
ment#
1
0.0000
19962
51
3.4708
10058
2
0.2159
13483
52
3.4711
12276
3
0.4318
13483
53
3.4714
12276
4
0.6477
13483
54
3.4717
12276
5
0.8636
13483
55
3.4721
12276
6
1.0795
13483
56
3.4727
7335
7
1.2954
13483
57
3.4736
7335
8
1.4938
982
58
3.4745
7335
9
1.6796
982
59
3.4755
7335
10
1.8616
12016
60
3.4763
1321
11
2.0228
18763
61
3.4767
1321
12
2.1673
18763
62
3.4772
18264
13
2.3118
18763
63
3.4777
18264
14
2.4446
5702
64
3.4793
15140
15
2.5760
5702
65
3.4844
15136
16
2.6948
21367
66
3.4897
15136
17
2.8076
21367
67
3.4950
15136
18
2.9204
21367
68
3.5003
15136
19
3.0161
21366
69
3.5056
15136
20
3.0922
18694
70
3.5109
15136
21
3.1629
15809
71
3.4988
15448
22
3.2293
5610
72
3.4792
15448
23
3.2924
5610
73
3.4596
15448
24
3.3532
14667
74
3.4400
15448
25
3.3950
25126
75
3.4205
15448
26
3.4169
25126
76
3.4009
15448
27
3.4388
25126
77
3.3562
15446
28
3.4607
25126
78
3.2909
15446
29
3.4826
25126
79
3.2257
15446
30
3.5045
25126
80
3.1591
8028
101
Table 4.8: Potentials at measuring poi
nts (C
)
Measuring points
(in V)
Element#
Measuring points
(in V)
Element#
31
3.5181
23548
81
3.0841
19985
32
3.5122
23548
82
3.0092
19985
33
3.5064
23548
83
2.9343
19985
34
3.5005
23548
84
2.8152
18420
35
3.4947
23548
85
2.6758
18420
36
3.4888
23548
86
2.5
364
18420
37
3.4830
23548
87
2.3971
18422
38
3.4772
23548
88
2.2578
18422
39
3.4723
22059
89
2.1185
18422
40
3.4719
22059
90
1.9794
8298
41
3.4712
19315
91
1.8409
8298
42
3.4708
19049
92
1.6655
17398
43
3.4707
19049
93
1.4562
17398
44
3.4708
4482
9
4
1.2470
17398
45
3.4707
4482
95
1.0383
14912
46
3.4705
4482
96
0.8303
14912
47
3.4703
4482
97
0.6222
14912
48
3.4701
4482
98
0.4142
14912
49
3.4704
17128
99
0.2072
13685
50
3.4706
5560
100
0.0001
13685
This is a simple problem. We will test the me
sh generator on the more complex system of
the next section. These results are verified with the mesh generator Gmsh [29] with Matlab
functions. This simple problem verifies that our parameterized mesh generator and FEM solver
work perfectly.
4.5
NDE benchmark
problem in 3D: Characterizing
Interior Defects
As a more complex example for testing our parameterized mesh generator, this is a three
dimensional version of example 3 (Section 4.3). We wish characterize the interior damage to
a
land vehicle hull to
determine if the vehicle should be withdrawn from deployment. Figure
4.31
frame, air (upper half of the Figure 4.31) and steel plate (lower half of the Figure
4.31).
102
Figure 4.31: Three
-
dimensional NDE problem
Figure 4.32 shows how to define a crack whose outline coordinates and location
coordinates are changing as the optimization algorithm runs. We define two surfaces; the outer
co
or
dinates of both surfaces
are the same. We made some constraints on the y coordinates to
ensure that this is truly a volume without the surfaces crossing each other. In Figure 4.32 top,
there are two surfaces. Each has a few variables shown in green, while the outer
com
mon
coordi
nates are given in blue. In Figure 4.32 the lower part of the figure shows the top view of
the top surface which has 13 variables in addition to its outer common coordinates. Similarly the
bottom surface also has variable points.
Figure 4.33 shows the gene
rated mesh for this problem. Different materials are shown in
different colors; green
-
air, red
-
steel, blue
-
coil and yellow
-
crack (made by 2 surfaces)
although this printout is black and white. For this test problem we took 5 variable points on the
upper surface, 8 variable points on the common interface and 4 variable points on the lower
surface. There are a total of 17 × 3 variables associated with the 17 variable points. Here we
made another assumption that the points are varying in the y directio
n only; the x and z
103
coordinates are fixed. Table 4.9 shows the true and characterized profile coordinates and the
normalized distance between results.
Figure 4.32: Defining variable: top: side view, bottom: side view
Figure 4.33: Three
-
dimensional me
sh for NDE problem: As parameters change
The last column of Table 4.9 shows the norm error (the formula is also given in the Table 4.9).
Average error is 2.5%. Average error is calculated using the following formula:
104
where n is the number of variables,
is the target value and
is the reconstructed value. This
thesis presents
the particular solution for a population of 200. This demonstrates the validity of
our 3D optimization tools.
Table 4.9: The solution of 3D defect characterization
True profile
Reconstructed profile
Norm error
#
x
y
z
x
y
z
|z
t
re
| /z
t
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
52.853
44.000
50.927
51.763
42.000
52.427
48.237
45.000
52.427
47.147
42.000
50.927
48.237
45.000
47.573
50.000
44.000
47.000
51.763
43.000
47.573
52.853
43.000
49.073
51.427
46.000
50.464
50.000
46.00
0
51.500
48.573
47.000
50.464
50.882
46.500
48.786
50.000
48.000
50.000
51.427
40.000
50.464
48.573
40.000
50.464
49.118
40.000
48.786
50.882
40.000
48.786
50.000
38.000
49.000
52.853
42.628
50.927
51.763
43.892
52.427
48.237
43.682
52
.427
47.147
41.873
50.927
48.237
44.670
47.573
50.000
41.893
47.000
51.763
44.735
47.573
52.853
42.444
49.073
51.427
44.239
50.464
50.000
47.193
51.500
48.573
46.317
50.464
50.882
46.483
48.786
50.000
47.219
50.000
51.427
37.999
50.46
4
48.573
39.835
50.464
49.118
40.022
48.786
50.882
38.213
48.786
50.000
36.903
49.000
0.0312
0.0451
0.0293
0.0030
0.0073
0.0479
0.0403
0.0129
0.0383
0.0259
0.0145
0.0004
0.0163
0.0500
0.0041
0.0005
0.0447
0.0289
105
Chapter 5
Results and Analy
sis
5.1
Memory Limitation
Recently, Graphics Processing Unit (GPU) computing has had great success in many
very large numerical computations. Software developers, researchers, and scientists have been
using the GPU for speeding up their computations. Appl
ications taking advantage of this new
technology have ranged from quantum chemistry [66] and molecular dynamics [103, 104] to
fluid dynamics [105, 106] and cloth simulation [107]. In this work we discuss the often
undiscussed GPU memory limitation in finit
e element optimization. In GPU computing the
memory of the Nvidia GPU is limited. This part of this thesis assesses the memory limits in
terms of matrix size in light of the various ways to store a large matrix in order to overcome
these limits.
We took a
4 cm
2
square conductor with current density = 1 A/mm2 and relative
p
er
meability 1. Then we solved the Poisson equation to calculate the magnetic vector potential
(A) using FEM
In this experiment we defined different number of progressively refined triangles; i.e.
288,
768, 1408 etc. Therefore the final solution can be obtained from the equati
on, [
P
] {
A
} =
{
Q
};
where [
P
] is the matrix and {
A
} and {
Q
} are vectors. In this experiment we used both
106
storage schemes to reduce the storage capacity. Matrix
[
P
]
is a symmetric positive definite sparse
matrix; each row has approximately 3 to 5 elements in
a symmetric half [13], because of the first
order mesh.
Table 5.1 shows the total number of matrix elements and storage with matrix size for
different storage schemes. Clearly, we can store very large matrices using the profile or sparse
storage scheme be
cause in FEM, the matrix [
P
] is a symmetric sparse matrix [13]. According to
our result, memory
-
wise, the sparse storage scheme is much better than the profile storage
scheme as to be expected because of fill
-
in during decomposition with the latter. Althou
gh well
known, we repeat this investigation to obtain memory limits with CUDA. In inverse problems
where many equations need to be solved, this is limiting. Figure 5.1 (A) shows the memory
required for the matrix versus matrix elements. Figure 5.1 (B) sh
ows the memory required for
profile and sparse storage schemes.
Table 5.1: Number of elements (NE) and storage (in MB) with matrix size for different storage
schemes
Matrix Size
NE
Regular
NE(Profile)
Profile
NE(Sparse)
Sparse
100
10500
0.040054
1161
0.0
04429
1709
0.006519
400
18000
0.068665
10819
0.041271
4421
0.016865
900
814500
3.107071
33329
0.127140
9521
0.036320
1600
2568000
9.796143
75239
0.287014
18421
0.070271
2500
6262500
23.889542
142549
0.543781
29801
0.113682
6000
36030000
137.443543
404
459
1.542889
71681
0.273441
8000
64040000
244.293213
697679
2.661434
94221
0.359425
10000
100050000
381.660461
1070099
4.082104
118021
0.450214
Using curve fitting projection we determined the maximum size of the problem that may
be attempted within th
e 4 GB memory limit of the GetForce GTX 970 GPU card that we worked
with. Table 5.2 shows the matrix size and corresponding memory size that we need to store the
variables. From these calculations we can say that the sparse storage scheme
can be used for v
ery
large problems of size 50,000K × 50,000K. It takes much lower memory
than the profile storage
107
scheme as remarked. Using the sparse storage scheme we can solve bigger than a 50,000K ×
50,000K matrix size before running into memory limits (4 GB limit).
With the 24 GB Kepler
K80 GPU cards [68] now available, we can solve problems much bigger than 50,000K ×
50,000K matrix
Figure 5.1: Memory vs matrix size
Table 5.2: Projected Memory (in MB) Needs
Size
Regular
Profile
Sparse
20K
1525.2
14.73
0.90
30K
3430.5
32.13
1.35
50K
9526.5
87.06
2.26
100K
38097.0
341.79
4.52
500K
952240.0
8417.70
22.58
1000K
3808900.0
33608.00
45.17
5000K
95220000.0
838940.00
225.85
10000K
380880000.0
3355100.00
451.70
50000K
9522000000.0
83865000.00
2258.50
100000K
38088
000000.0
335450000.00
4517.00
Our conjugate gradients method with Jacobi preconditioning gives the results shown in
Figure
5.2. We stopped our experiment with size of 10000 × 10000 because if we have a population of
5000 in GA, theoretically we can go up
to 10000 × 10000 with sparse storage schemes because
of the memory limits as discussed above. If we want a larger population,
the problem size should
108
be decreased. Even
-
though the sparse storage scheme reduces the memory consumption, we can
not solve prob
lems larger than 10000 × 10000 with larger populations than 50. But with the new
GPU cards like Kepler K80 [68] that came out as
this thesis was completed, memory will not be
a problem
Figure 5.2: Speed
-
up versus matrix size: Jacobi preconditioned conju
gate gradients
algo
rithm
5.2
Eleme
n
t
-
b
y
-
Eleme
n
t
Sol
v
ers
Again we solved the
Poisson
equation for the magnetic vector potential A with
progressively refined meshes using the finite element method for the Poisson equation for a test
problem from magnetics.
We defined different number of progressively refined tetrahedrons to
obtain results for different matrix sizes. Run time statistics obtained for different mesh sizes is
given in Table 5.3 which compares the CPU and GPU calculation times for different sizes
of
problems for the conjugate gradient element by element (CGEbE) algorithm. This table gives the
speedup which is defined as CPU calculation time/GPU calculation time and the speedup per
iteration which is defined as CPU calculation time for one iteratio
n/CPU
calculation time per
iteration. It shows a high speedup of 51.23 for 2,828,782 unknowns which is equivalent to a
109
matrix size of 2, 828, 782 × 2, 828, 782 in regular first order finite element analysis.
Table 5.3: The CGEbE solution
CPU
GPU
Elemen
ts
Unknown
Itns
Time
Itns
Time
Speedup (S)
S per itn
26595
3072
32682
3843
45328
5440
106587
13312
210024
27852
273008
36704
399928
54672
978905
137639
1940736
279340
2569403
372149
3830869
559546
9521059
141314
2
18879664
2828782
51
1.15
55
1.63
60
2.39
79
7.38
119
22.34
130
36.02
148
54.32
206
189.24
294
524.26
329
953.45
413
1542.75
645
11120.06
903
46348.39
66
0.33
69
0.33
69
0.33
90
0.45
126
0.73
134
0.91
151
1.33
201
7.11
271
25.19
310
41.27
346
75.06
514
30
8.25
733
904.54
3.4848
4.5098
5.0938
6.3903
7.2424
8.3288
16.4000
18.6835
30.6027
32.4029
39.5824
40.8003
40.8421
41.6700
26.6160
25.9700
20.8122
19.1841
23.1027
21.7685
20.5536
17.2192
36.0748
28.7480
51.2397
41.5933
This number of iterations depends on both number of unknown elements and total
number of elements (see the algorithms in Chapter 3). That is why both iterations and speedup
with number of unknown elements and total number of elements are plotted. Figur
e 5.3 shows
the number of iterations vs number of unknowns. Figure 5.4 shows the number of elements vs
number of unknowns. In both graphs, the number of iterations increases with problem size. We
can see that the number of iterations increases with problem
size for both CPU and GPU. The
reason for the difference between number of iterations in the GPU and in the CPU is as
explained later in this section.
The Figure 5.5 shows the speedup vs number of unknowns for the CGEbE. Figure 5.6
shows the speedups vs n
umber of elements for the CGEbE. The speedups vary with problem
size. We can see an erratic nature in the speedups in Figures 5.5 and 5.6. The causes of the
erratic
speedup need to be investigated [108].
110
Figure 5.3: Number of iterations vs number of unk
nowns for CGEbE
Figure 5.4: Number of iterations vs number of elements for CGEbE
Since we have been working with the regular finite element method for the Poisson
equations for magnetics, the bi
-
conjugate gradient method [34] also works as the conjug
ate gradient
method because [
A
] = [
A
]
T
[34, 77]. In the situation where the convection effect when dealing with
temperature problems is significant, finite element matrix equations take antisymmetric form [83].
Therefore for a generalized finite element p
ackage, antisymmetric solvers also should be included
with the package.
Table 5.4 shows the CPU and GPU calculation times for the different sizes of problems
for
111
the biconjugate gradient stabilized element by element (BiCGSTABEbE) method. Even
-
though it
g
ives higher speedup than the CGEbE method, this BiCGSTABEbE method takes a long time to
converge. We can see the differences in Tables 5.3 and 5.4. The BiCGSTABEbE method shows a
higher speedup of 80.1364 for 54,674 unknowns while the speedup is only 20.5
5 for 2,828,782
unknowns. So BiCGSTABEbE method can be used to solve small problems (smaller than 54,674
unknowns problems
).
Figure 5.5: Speedup vs number of unknowns for CGEbE
Figure 5.6: Speedup vs number of elements for CGEbE
112
The Figure 5.7 show
s the number of iterations vs number of unknowns for BiCGSTABEbE.
Figure 5.8 shows the number of elements vs number of unknowns for BiCGSTABEbE. In
both
graphs, the number of iterations increases with problem size. This is as to be expected for a w
ell
-
cond
itioned problem, such as
this.
Table 5.4: The BiCGSTABEbE solution
CPU
GPU
Elements
Unknown
Itns
Time
Itns
Time
Speedup (S)
S per itn
26595
3072
32682
3843
45328
5440
106587
13312
210024
27852
273008
36704
39
9928
54672
978905
137639
1940736
279340
2569403
372149
3830869
559546
9521059
1413142
18879664
2828782
71
3.44
72
4.27
83
6.84
107
21.24
143
56.68
145
74.81
162
123.41
219
421.69
263
1051.64
361
1785.89
371
5158.10
466
11147.04
553
28191.35
69
0.34
72
0.34
72
0.35
96
0.50
129
0.88
134
1.08
146
1.54
198
6.41
237
21.24
260
33.43
375
82.78
569
366.80
884
1371.98
10.1176
9.8326
12.5588
12.5588
19.5429
16.9528
42.4800
38.1129
64.4091
58.1033
69.2685
64.0137
80.1364
72.2217
65.7863
59.4780
49.5122
44.6175
53.4218
38.4755
62.3109
62.9828
30.3900
37.1071
20.5479
32.8470
Figure 5.7: Number of unknowns vs number of iterations for BiCGSTABEbE
113
The Figure 5.9 shows the speedups vs number of unknowns for BiCGSTABEbE. Figure
5.10 shows the speedup vs number of element
s for BiCGSTABEbE. The speedups vary with
problem size. The speedup is decreasing after 54,674 unknowns in contrast to the speedup
increasing for CGEbE after 54,674 unknowns. Again the causes of the erratic speedup need
to be
investigated [108].
Table 5.
4 compares the CPU and GPU calculation times for different sizes of problems for
CGEbE algorithm. This table gives speedup and speedup per iteration. It shows high speedup of
102.12 for 54,672 unknowns which is equivalent to a matrix size of 54,672 ×
54,6
72 in regular first
order finite element analysis.
Figure 5.8: Number of elements vs number of iterations for EbEBiCGSTAB
Figure 5.9: Speedup vs number of unknowns for BiCGSTABEbE
114
Figure 5.11 shows the number of iterations vs number of unknowns for J
acobi EbE. Figure
5.12 shows the number of iterations vs number of elements for Jacobi EbE. In both graphs, the
number of iterations increases with problem size as to be expected. We can see that the number of
iterations increases with problem size for bo
th CPU and GPU
The Figure 5.13 shows the speedups vs number of unknowns for Jocobi EbE. Figure 5.14
shows the speedups vs number of elements for Jacobi EbE. The speedup is decreasing after
372,149
elements (see Table 5.5).
Figure 5.10: Speedup vs number
of elements for BiCGSTABEbE
Figure 5.11: Number of unknowns vs number of iterations for JacobiCG
115
Figures 5.15 and 5.16 show the convergence of the CGEbE algorithm for CPU and GPU
respectively. We can see that both convergence patterns are almost the s
ame and that the
convergence rate is very high in both figures.
Figures 5.17 and 5.18 show the convergence of the BiCGSTABEbE algorithm for CPU
and
GPU respectively. We can see that both convergence patterns are slightly different and the
convergence rate
is high in both figures
Table 5.5: The Jacobi solution
CPU
GPU
Elements
Unknown
Itns
Time
Itns
Time
Speedup (S)
S per itn
26595
3072
32682
3843
45328
5440
106587
13312
210024
27852
273008
36704
399928
54672
9
78905
137639
1940736
279340
2569403
372149
3830869
559546
9521059
1413142
18879664
2828782
806
17.24
889
23.58
1019
37.01
1557
134.42
2371
409.10
2712
619.42
3320
1129.47
5376
6219.27
7826
19026.61
9070
31069.48
11185
56743.83
17853
217808.09
24904
488881.
06
807
0.70
889
0.78
1018
0.95
1557
1.96
2374
4.67
2713
6.52
3314
11.06
5374
66.84
7819
269.75
9063
452.96
11176
930.97
17856
4364.02
24908
12852.42
24.6286
24.6591
30.2308
30.2308
38.9579
38.9197
68.5816
68.5816
87.6017
87.7126
95.0031
95.0381
102.1221
10
1.9375
93.0471
93.0125
70.5342
70.4711
68.5921
68.5392
60.9513
60.9023
49.9100
49.9184
38.0381
38.0441
Figure 5.12: Number of elements vs number of iterations for JacobiCG
116
Figures 5.19 and 5.20 show the convergence of the Jacobi EbE algorithm for CPU
and
GPU respectively. We can see that both CPU and GPU convergence rates are very slow
compared to CGEbE and BiCGSTABEbE. But convergence patterns in CPU and GPU are the
same and the convergence pattern is smooth. Here we can see there is not that much
dif
ference
between number of iterations between GPU and CPU compared to CGEbE and BiCGSTABEbE
algorithms. In CGEbE and BiCGSTABEbE methods many statements including element by
element process are parallelized in GPU but in the Jacobi method only one stateme
nt with
element by element executes in the GPU. This is the possible reason for this difference in the
number of iterations in CPU and GPU
Figure 5.13: Speedup vs number of unknowns for Jacobi EbE
Figure 5.14: Speedup vs number of elements for Jacobi
Eb
E
117
Figure 5.15: Convergence rate of CG in CPU
Figure 5.16: Convergence rate of CG in GPU
Figure 5.17: Convergence rate of BiCGSTABEbE in CPU
118
Figure 5.18: Convergence rate of BiCGSTABEbE in GPU
Figure 5.19: Convergence rate of Jacobi in CP
U
Figure 5.20: Convergence rate of Jacobi in GPU
119
Figure 5.21 compares the speedup of Jacobi EbE, CGEbE and BiCGSTABEbE methods.
In terms of speedup, the Jacobi CG method gives a higher speedup for small problems (less than
372,149 unknowns), the
second higher speedup is for BiCGSTABEbE and then CGEbE. But
convergence time is much less for CGEbE, followed by BiCGSTABEbE and then Jacobi EbE.
For large problems (bigger than 372,149 unknowns, see Tables 5.3, 5.4, 5.5), CGEbE gives a
higher speedup, f
ollowed by Jacobi EbE and then BiCGSTABEbE. But convergence time is
much less for CGEbE, second less for BiCGSATBEbE and then Jacobi EbE (see Tables 5.3, 5.4,
5.5).
Figure 5.21: Speedup comparison between CGEbE, BiCGSTABEbE and Jacobi EbE algorithms
Fi
gure 5.22 graphically summarizes Kiss
et al
than those by the same authors, Kiss
et al
., in [34] even though both use the same hardware. Kiss
et al
. [109] got 10.01 as their maximum speed up for computations on a sin
gle GPU
(corresponding to our results which shows a maximum speedup of 102.12). In terms of speedup
we got better results while at the same time they got a higher convergence rate than ours with an
astoundingly fast rate in terms of iterations (e.g., 79 it
erations for a 91,000
X
91,000 matrix on a
CPU) but the results of Fernandez
et al.
[33] are more comparable to our Jacobi EbE. Kiss
et al
120
an erratic up and do
wn speedup that needs to be investigated [108].
Figure 5.22: Speed
-
up versus matrix size of Kiss
et al
.
Table 5.6: Speedup ratio between single and double
Matrix size
Jacobi
Gauss
-
Seidel
GMRES(35)
BiCGSTAB
2000
1.9148
1.9197
1.9773
1.9712
4000
1.5568
1.8194
1.9854
1.9976
8000
2.3471
1.9148
1.9255
1.8784
12000
1.9851
1.8095
2.0099
1.9157
16000
2.2012
1.9276
1.8528
1.9545
20000
2.0449
1.8841
1.8819
1.8850
Table 5.6 shows the speedup ratio of the single and double precision GPU implementation of th
e
direct method for linear systems [110]. Theoretically for memory bound algorithms, double
precision work on the GPU has been shown to take twice as much time than single precision
arithmetic [111]. However, there are papers where this is not so [108, 110
] (see Table
5.6).
Communications is a factor but the exact nature is still not known as pointed out in our paper
[108]. Devon Yablonski analyzes numerical accuracy issues that are found in many
scientific
121
GPU applications using floating point computation
[112]. As he puts it, two widely held myths
about floating
-
and that computations on the GPU are unavoidably different from the same computations on a
CPU [112]. He appears to have
dispelled both myths by studying specific
applications.
Figure 5.23: Serial addition losing precision. Numbers surrounded by a box represent the actual
result floating point value with 7 digits
Accumulating values serially (Figure 5.23 A) will someti
mes result in a large value that each
successive small value is added to, resulting in diminished accuracy of the results. The reduction
style of computation (Figure 5.23 B) avoids the issue in accumulating floating
-
point values in a
way that is similar to
binning [112]. Binning describes collecting and computing small groups of
values and then computing the final result by combining each result. There is sometimes an
erratic up and down speedup [113] like in Figures 5.5, 5.6, 5.9, 5.10, 5.13, 5.14. The ca
uses need
122
to be investigated as we have pointed out [108]. In a study we did of GPU computation for finite
element optimization by the genetic algorithm [3], the speedup showed an unexpected erratic up
and down trajectory [108]. This result is seen in othe
r works too such as of Krawezik and Pool
[113] as shown in Figure 5.24 and Kiss
et al
. [34] as shown in Figure 5.22. In the absence of an
explanation we carry on but a real understanding of the method to obtain the best speedup,
requires some investigation
.
Figure 5.24: Erratic behavior of gain for various methods
While expert programmers have programmed their problems in CUDA C to reap the
benefits of speedup, it is to be noted that the compiler is very difficult to work with as pointed
out by us [108]
. Error messages are still difficult to use in debugging. When memory is violated
in one function, the program crashes in another without a proper error message. Debugging is
therefore more difficult in CUDA C, particularly because we cannot print intermed
iate outputs
directly from the GPU
123
Chapter 6
Conclusion and Future Works
The first part of this thesis describes a successful script
-
based
, parameterized mesh
generator library for design and NDE developed in C for seamless finite element optimization.
Unlike other such systems written in inaccessible code and designed for use with specific
software, this system is provided as open
-
source code [114, 115] so that it may be modified by
anyone and used with his or her own finite element (FE) optimization
system. This system with
the finite element analysis and optimization on the GPU makes for quick NDE assessments and
other finite element optimization problems in the field [3]. The modified open source codes
Triangle and TetGen we have developed for optim
ization are CPU
-
based because mesh
generation takes little time compared to finite element matrix solution and optimization. As such
the massive effort to port them to a GPU is not justified. Moreover they are too complex for
porting to the GPU. A large te
am of engineers can translate that too to the GPU but we do not
think the effort is warranted by the gains to be had. The mesh generator has been demonstrated
through a successful NDE system for testing army ground vehicle hulls with damage from
corrosion
or IEDs and some associated optimization problems.
In the second contribution of this thesis, we have presented our GPU
-
based EbE matrix
solution routine that is very fast and takes little memory. It has been specifically developed for
finite element optim
ization. Several finite element solutions are done on parallel GPU threads
for use where memory is critical and solution times long. The GPU
-
parallelized pre
-
conditioned
conjugate gradient (PCG) algorithm with sparse
-
stored matrix formation is the
best w
ay for
124
PCG. With 24 GB GPU cards now available, EbE processing is not necessary for most practical,
single, forward problems. This element by element (EbE) method, how
-
ever, becomes a must
for genetic algorithm optimization where, whether in NDE or device
design, several genetic
algorithm threads are launched in parallel and memory capabilities are challenged. The EbE
Jacobi iterations give better speed
-
up than the conjugate gradients element by element (CGEbE)
for which incomplete cholesky conjugate gradi
ent (ICCG) is not possible because the matrix is
never formed in EbE processing.
In our work we use the genetic algorithm where the object function corresponding to
every member
of a population has to be computed many times to find the minimum. The many
members
form the genetic search space. Since
consists of dimensions and materials of a
particular design being examined for its goodness [116], for those dimensions a me
sh is
constructed, the finite element problem solved and the object function evaluated. The object
function itself is computed from a finite element solution involving a matrix equation. Thus we
may treat the object function computation as a kernel and lau
nch it on multiple threads, each for
a different member of the population. Then within that kernel, we can parallelize the matrix
equation solution at a speedup which we shall refer to a
SP
, which depends on problem size.
Alternatively, we may do the objec
t function evaluation for each member of the population in
sequence and in that process parallelize the matrix computations. Let the population number be
n
. Say the object function evaluation for each member of the population takes
in time
where
is the time for the matrix solution and
is
the time for other operations. Therefore if
we parallelize the operations for the different
members of the population, the time for evaluating
all the object functions corresponding
to the ent
ire population would still be, neglecting
coordination time,
125
since these are done simultaneously. Here we have assumed that the work for each member of
the population is
being done in parallel, and that the time for combining results and other
communications is negligible.
On the other hand, if we parallelized the matrix computation, the evaluation of the object
function has to be in sequence since we cannot have forking
from a parallelized kernel. The total
time would then be the number of members in the population multiplied by the time
for
computing the object function for each member of the population
where SP is the matrix equation solution speedup, tm is the time for the matrix solution, n is
population number and t
0
is the time for other operations. A deci
sion on which of the processes
is to be parallelized would depend on considerations like this. However we
have not seen such
considerations in the literature [12]. Further in a recent development CUDA Dynamic
Parallelism has been made available on the SM
3.5 architecture GPU [72], we can parallelize
both the genetic algorithm and the FE calculations. Suggested future work includes,
Develop the GPU based parametrized mesh generator for two and three dimensions.
Develop dynamic parallelization based optimi
zation tools.
Develop a user friendly PC
-
based optimization tools for portable NDE use in the field.
126
APPENDICES
127
Appendix A
: Publications Raised from
T
his Research
Journals
1.
S. Sivasuthan, V. U. Karthik, A. Rahunanthan, P. Jayakumar, Ravi
Thyagarajan, Lalita
Nonde
structive
Defect Characterization: Element
-
by
-
Journal of Non
-
destructive Evaluation
, vol. 34(9), 2015
2.
S. Ratnajeevan H. Hoole, Sivamay
am Sivasuthan, Victor U. Karthik and Paul R.P. Hoole
-
teaching Engineering Optimization, Electromagnetic Product Design and
Computer Applications in Engi
-
neering Education Journal(CAEE)
, vol. 23, pp.
374
-
382, 2015
3.
Victor U. Karthik, Sivamayam Sivasuthan Arunasalam Rahunanthan, Ravi S. Thya
-
accurate, parallelized inversion for shape optimization in electroheat prob
lems on a
graphics processing unit (GPU) with the real
-
COMPEL
International Journal of Computations and Mathematics in Electrical
, vol. 34, no. 1, pp.
344
-
356, Jan. 2015.
4.
S. Sivasuthan, V. U. Karthik, A. Rahunanthan, P. Jayakuma
r, Ravi Thyagarajan, Lalita
-
based, Parameterized Mesh Generator for Machine
IETE Technical
Review
, Vol. 32( 2), pp. 94
-
103, 2015.
5.
S. Ratnajeevan H. Hoole ,
Victor U. Karthik, S. Sivasuthan, A. Rahunanthan, R. Thya
-
-
destructive
Evaluation: A Review in Magnetics, and Future Directions in GPU
-
based, Element
-
by
-
Element Coupled Optimization a
International Journal of Applied Elec
-
tromagnetics and Mechanics
, vol. 47(3), pp. 607
-
627, 2015
6.
S. Ratnajeevan H. Hoole, Sivamayam Sivasuthan Victor U. Karthik, Arunasalam Rahu
-
netic Device
Optimization: The Forking of Already Parallelized Threads on Graphics Processing
Applied Computational Electromagnetics Society (ACES) Journal
, vol. 29(9), pp.
677
-
684, 2014
128
7.
Sivamayam Sivasuthan, Victor U. Karthik, Arunasalam Rahu
nanthan, Ravi S. Thya
-
Revue roumaine des
sciences techniques
-
Serie Electrotechnique et Energetique
(accepted)In
press Vol. 60,
No. 3, 2015.
Peer
-
r
eviewed Conference Papers
1.
3D Mesh Generator for Optimization in NDE and Shape Design on a GPU
International Review of P
rogress in Applied Computational Electromagnetics
, 22
-
26
March 2015, in Williamsburg, VA
2.
ISFEE Conference 2014
(Ac
cepted)
November 28
-
29 2014, Bucharest, Romania
3.
S. Sivasuthan, V. U. Karthik, A. Rahunanthan, P. Jayakumar, Ravi Thyagarajan, Lalita
Con
jugate
16th Biennial IEEE Conference on Electromag
netic Field
Compu
tation
(CEFC),
May 25
-
28 2014, Annecy, France.
4.
S. Sivasuthan, V. U. Karthik, A. Rahunanthan, P. Jayakumar, Ravi Thyagarajan, Lalita
-
based, Parameterized Mesh Generator Li
-
brary for
CoupledGradient Desi
16th Biennial IEEE Conference on Elec
-
tromagnetic Field Computation (CEFC)
, May 25
-
28 2014, Annecy, France
5.
In proceedings of 2014 American Society
For Engineering Education, NCS
Conference
, April 4 and 5, 2014, Oakland University, Rochester, MI.
6.
S. Sivasuthan Victor U. Karthik, Arunasalam Rahunanthan, Ravi S. Thyagarajan,
d in
Optimization: The Forking of Already Parallelized Threads on Graphics Pro
cess
ing
The 30th International Review of Progress in Applied
Computational Electromagnetics
, March 23th
-
27th, 2014 in Jacksonville, Florida
7.
S. Sivasu
129
Review of Progress in Quantitative Nondestructive Evaluation, edited by Dale E.
Chimenti, Leonard J. Bond, and
Donald O. Thompson,
AIP Conference Proceedings
1581
, 1967
-
1974 , American Institute of Physics, Melville, NY.
8.
the Genetic Algorithm on NVIDIA GPU Architecture for S
40th Annual Review of Progress in Quantitative Nondestructive Evaluation, edited by
Dale E. Chimenti, Leonard J. Bond, and Donald O. Thompson,
AIP Conference
Proceedings 1581
, 1991
-
1998 , American Institute of Physics, Melville
, NY.
9.
Method in Electrical Engineering Optimization: Parallelization on Graphics Pro
cess
ing
ICPAM
-
LAE satellite
c
onference
, in Port Moresby, Papua New Guinea, 2013.
10.
Victor U. Karthik, S. Sivasuthan, A. Rahunanthan, P. Jayakumar, R. Thyagarajan, S.
Graphics Processing Unit for Grou
nd Vehicle Hull Inspection
Ground Vehicle Systems Engineering and Technology Symposium
, Troy, MI, August
2013.
11.
Inverse problem symposium
, East Lansin
g, 2012 (Digest)
130
Appendix B
: Sample Input File 2D
#A set of pointsin 2D(* WITHOUT VARIABLE POINTS).
# Number of nodes is 9 number of variables is 5
9 5
#And here are the nine points.
1 0.0 0.0
2
10. 0
0. 0
3
20.0
0. 0
4
10. 0
10. 0
5
0. 0
10. 0
6
2.
0
2. 0
7
4. 0
2. 0
8
4. 0
4. 0
9
2.0
4.0
# variable points
# number of points in first draw. Then coordinates
10 20.0 1.85
11 18 1.90
12
15.5 2.10
13
13 2.40
14
10 3
#segments, 1st line
--
number of segments following lines are segments (node
numbers, eac
h
segment has two node numbers and a marker to identify the boundary elements.
#number of segments
15
#segments (two nodes) and a marker
1
3
10
1
2
4
5
2
3
6
7
-
1
4
7
8
-
1
5
1
2
-
1
6
2
3
-
1
7
5
1
-
1
8
14
4
-
1
9
10
11
-
1
10
11
12
-
1
11
12
13
-
1
12
13
14
-
1
13
2
14
-
1
14
8
9
-
1
15
9
6
-
1
#segment markers used to identify the boundaries and set boundary
#conditions. do not give 0 or 1 to segment marker because already
#fixed as a default, 3rd column is boundary condition value
2
#asdf
131
1 2 0
2 3 0
#regions, num
ber of regions x y coordinates of region, regional
#attribute (for whole mesh), Area constraint that will not be used
#we can leave one region without any assignments we have to assign for
#this case 0 0 0 0 but we can give properties to this region
2
1 1 0.5 1 0.1
2 12 3 2 0.9
#
#properties of regions, first number of properties then property
#values
2
1 1 1.32
2 1.90 9.312
# holes, number of holes x y coordinates of the hole
1
1 3 3
#type
0
#measuring points
10
#point coordinates
1 1.1 1.
10
2 2.1 1.19
3 3.1 1.18
4 3.3 1.17
5 3.6 1.16
6 3. 9 1.15
7 4 .1 1.14
8 4. 4 1.13
9 4. 9 1.12
10 5.1 1.1
132
Appendix C: Sample Input File 3D
#3DMesh Input File
#Number of points <
--
> Number of variable points
36
22
#13
3
0
1
1
0
0
0
2
100
0
0
3
100
0
100
4
0
0
100
5
0
50
0
6
100
50
0
7
100
50
100
8
0
50
100
9
0
100
0
10
100
100
0
11
100
100
100
12
0
100
100
#coil
13
40.0 52.0 48.0
14
44.0 52.0 48.0
15
48.0 52.0 48.0
16
52.0 52.0 48.0
17
56.0 52.0 48.0
18
60.0 52.0 48.0
19
40.0
52.0 52.0
20
44.0 52.0 52.0
21
48.0 52.0 52.0
22
52.0 52.0 52.0
23
56.0 52.0 52.0
24
60.0 52.0 52.0
25
44.0 56.0 48.0
26
48.0 56.0 48.0
27
52.0 56.0 48.0
28
56.0 56.0 48.0
29
44.0 56.0 52.0
30
48.0 56.0 52.0
31
52.0 56.0 52.0
32
56.0 56.0 52.0
33
40.0 60.0 48.0
34
60.0 60.0 48.0
35
60.0 60.0 52.0
36
40.0 60.0 52.0
#variables
37
52.85316955
44
50.92705098
38
51.76335576
42
52.42705098
39
50
43
53
133
40
48.23664424
45
52.42705098
41
47.14683045
42
50.92705098
42
47.14683045
43
49.07294902
43
48.
23664424
45
47.57294902
44
50
44
47
45
51.76335576
43
47.57294902
46
52.85316955
43
49.07294902
47
51.42658477
46
50.46352549
48
50
46
51.5
49
48.57341523
47
50.46352549
50
49.11832212
46
48.78647451
51
50.88167788
46.5 48.7864745
1
52
50
48
50
53
51.42658477
40
50.46352549
54
50
40
51.5
55
48.57341523
40
50.46352549
56
49.11832212
40
48.78647451
57
50.88167788
40
48.78647451
58
50
38
49
#number of faces, boundary markers
53
1
# no of polygons> no of h
oles, boundary marker
1
0
3
4
1
2
3
4
1
0
3
4
5
6
7
8
1
0
3
4
9
10
11
12
1
0
3
4
1
2
6
5
1
0
3
4
5
6
10
9
1
0
3
4
2
3
7
6
1
0
3
4
6
7
11
10
1
0
3
4
1
4
8
5
1
0
3
4
5
8
12
9
1
0
3
4
8
7
11
12
1
0
3
4
4
3
7
8
1
0
-
1
12
13
14
25
26
15
16
27
28 17
18
34
33
134
1
0
-
1
12
19
20
29
30
21
22
31
32
23
24
35
36
1
0
-
1
4
13
19
36
33
1
0
-
1
4
18
24
35
34
1
0
-
1
4
33
34
35
36
1
0
-
1
4
13
14
20
19
1
0
-
1
4
14
20
29
25
1
0
-
1
4
25
26
30
29
1
0
-
1
4
15
21
30
26
1
0
-
1
4
15
16
22
21
1
0
-
1
4
16
22
31
27
1
0
-
1
4
27
28
32
31
1
0
-
1
4
17
23
32
28
1
0
-
1
4
#new
1
17
0
18
-
1
24
23
3
37
38
47
1
0
-
1
4
38
39
48
47
1
0
-
1
4
39
40
49
48
1
0
-
1
3
40
41
49
1
0
-
1
4
41
42
50
49
1
0
-
1
3
42
43
50
1
0
-
1
4
43
44
51
50
1
0
-
1
3
44
45
51
135
1
0
-
1
4
45
46
47
51
1
0
-
1
3
46
37
47
1
0
-
1
4
47
48
52
51
1
0
-
1
3
48
49
52
1
0
-
1
3
49
50
52
1
0
-
1
3
50
51
52
1
0
-
1
3
37
38
53
1
0
-
1
4
38
39
54
53
1
0
-
1
4
39
40
5
5
54
1
0
-
1
3
40
41
55
1
0
-
1
4
41
42
56
55
1
0
-
1
3
42
43
56
1
0
-
1
4
43
44
57
56
1
0
-
1
3
44
45
57
1
0
-
1
4
45
46
53
57
1
0
-
1
3
46
37
53
1
0
-
1
4
53
54
58
57
1
0
-
1
3
54
55
58
1
0
-
1
3
55
56
58
1
0
-
1
3
56
57
58
#bounday
conditions
1
#conditions
1
3
0
# 2 regions
136
3
1
10
10
10
1
0.1
2
46
50
50
2
0.01
3
-
1
-
1
-
1
3
1.2
#number of properties
2
1
1.90 2.20
2
2.213.30
3
3.214.30
#number of holes
0
#measuring points
5
1
52.85316955
51
50.92705098
2
51.76335576
51
52.42705098
3
50
51
53
4
48.23664424
51
52.42705098
5
47.14683045
51
50.92705098
#mesh area constraint
10
137
BIBLIOGRAPHY
138
BIBLIOGRAPHY
[1]
problems in electromagnetic NDE
using finite element methods
Magnetics
, vol. 34, no. 5, pp. 2924
2927, 1998.
[2]
Computers Mathematics with
Applications
, vol. 32, p. 133, 1996.
[3]
V. U. Karthik
, S
. Sivasuthan, A. Rah
unanthan, R. S. Thyagarajan, P. Jayakumar, L. Udpa,
Faster,
More Accurate Parallelized Inversion for Shape Optimization
in Electroheat Problems on a Graphics Processing Unit (GPU) with the Real
-
Coded
Genetic Algorithm,
COMPEL
, vol. 34
, no. 1, pp. 344
356, 2015.
[4]
AIP Conf. Proc
.,
pp. 1991
1998, 2014.
[5]
G. F. Uler, O. A. Mohammed
IEEE Transactions on Magnetics
, vol. 30, no. 6, pp.
4296
4298, 1994.
[6]
eg
uide
dev
IEEE Transactions on Magnetics
, vol. 32, no. 3 (2), pp. 1250
1253, 1996.
[7]
S. Dong
-
Joon, C. Dong
-
Hyeok, C. Jang
-
Sung, J. Hyun
-
Kyo, and C. Tae
-
Kyoung,
ficiency optimization of interior permanent magnet synchronous motor using genetic
I
EEE Transactions on Magnetics
, vol. 33, pp. 1880
1883, Mar. 1997.
[8]
Optimization of
loss in
orthogo
nal bend
Alexandria Engineering Journal
, vol. 52, no.
3, pp. 525
530,
2013.
[9]
IEEE
Transactions on Magnetics
, vol. 34, no. 5, 1998.
[10]
H. Enomoto, K. Harada, Y.
Optimal design
of linear
IEEE Tr
ans.
on Mag
.
, vol. 34, no. 5, 1998.
139
[11]
G. N. Vanderplaats,
Numerical Optimization Techniques for Engineering Design: With
Applications (Mcgraw
Hill Series in Mech
.
Engineering)
. Mcgraw
-
Hill College,
1984.
[12]
S. R. H. Hoole, S. Sivasuthan, V. U. Karthik, A. Rahunanthan, R. S. Thyagarajan, and P.
Threads on Graph
ACES Journal
, vol. 29, no. 9, pp. 677
684, 2014.
[13]
S. R. H. Hoole
, Computer
-
aided Analysis and Design of
Electromagnetic Devices
. New
York: Elsevier, 1989.
[14]
Finite Element Analysis and Design of Metal Structures
, pp. 182
205, 2014.
[15]
M. V. K. Chari and P. P. Silvester,
Finite Elements in Electrical and Magnetic Field
Problems (Wiley Series in Numerical Methods in Engineering).
New York: John Wiley
&
Son
s Inc, 1980.
[16]
-
sign
Computer Methods in Applied Mechanics and Engineering
, vol. 15,
no. 3, pp. 277
308, 1978.
[17]
ptimal design of structures by generalized steepest
International Journal for Numerical Methods in
En
gineering
, vol.
10, no. 4, pp. 747
766, 1976.
[18]
O. Pironneau,
Optimal Shape Design for Elliptic Systems
. Berlin, Heidelberg: Springer
Berlin Heidelberg, 1984.
[19]
IEEE Transactions on
Magnetics
, vol. 30, no. 5, pp. 3455
3458, 1994.
[20]
J. Haslinger an
d P. Neittaanmaki,
Finite Element Approximation for Optimal Shape,
Material and Topology Design
, 2nd Edition. Wiley, 1996.
[21]
IEEE Tran
.
on Mag
netics
, vol. 26, no. 5, pp. 2181
2183, 1990.
[22]
Annual Transaction of IESL
, pp. 141
149, 2003.
140
[23]
S. Sivasuthan
, V
memory limitation in finite
in AIP Conf. Proc
. 1581, pp. 1967
1974,
2014.
[24]
S. Sivasuthan, V. U. Karthik
, A
. Rahunanthan, P. Jayakumar, R. Thyagarajan, L. Udpa,
A Script
-
Based,
Parameterized Fin
ite Element Mesh
for Design and NDE
on a GPU
, vol. 32, pp. 94
103, Mar. 2015.
[25]
J. R. Shewchuk
Geometry
towards
Geometric Engineer
Triangle: Engineering a 2D Quality Mesh Generator and Delaunay Triangulat
or,
vol.
1148, pp. 203
222, 1996.
[26]
J. R. Shewchuk
Delaunay
refinement algorithms
Computational Geometry
, vol. 22, pp. 21
74, May 2002.
[27]
J. Sch
-
mesh generator
based on abstract
rules,
Computing and Visualization in Science
, vol. 1, pp. 41
52, July 1997.
[28]
CGAL Editorial Board
4.2
,
2013.
[29]
-
D finite element mesh generator with built
-
in
pre and post
-
International Journal for Numerical
Methods
in
Engineering
, pp. 1
24, 2009.
[30]
unstructured mesh
generator for
shallow water
Ocean Dynamics
, vol. 6
2, pp.
1503
1517, Nov. 2012.
[31]
T. Chen, J. Johnson, and W. D. Robert,
A Vector Level Control Function for Generalized
Octree Mesh Generation
. Vienna: Springer Vienna, 1995.
[32]
-
erato
Materials Research Innovations
, vol
. 15, pp. s482
s486, Feb. 2011.
[33]
Processing Approach for FEM,
IEEE
Trans
.
on Mag
.
, v
ol. 48, pp. 39
9
402, Feb. 2012.
[34]
I. Kiss, S. Gyimothy, Z. Badics, and J. P
-
by
-
141
IEEE Trans
.
on Magnetics
, vol. 48, no. 2, pp. 507
510, 2012.
[35]
T. Okimura, T. Sasayama, N. Takahashi, and S. Ikuno,
rallelization of Finite
El
ement
Analysis
of Nonlinear Magnetic Fields Using GPU,
IEEE Transactions
on Magnetics
, vol.
49, pp. 1557
1560, May 2013.
[36]
graphical processing u
IEEE Trans
.
on Magnetics
, vol. 46, pp. 2303
2306, 2010.
[37]
W. Khamsen, A. Aurasopon, and W. Sa
-
Voltage Harmonics Reduction in Pulse Width Modulation AC Chopper Using Bee Colony
IETE Technical R
eview
, vol. 30, no. 3, p. 173, 2013.
[38]
[39]
J. Mu
1996.
[40]
technique for finite
element magnetic field computation and its application to optimal designs of
IEEE Trans
.
on Magnetics
, vol. 47, pp. 2943
2946,
2011.
[41]
10 2D
http://www.jaew
oo.com/material/
magazinefolder/jwnews/0710/10
\
_New
\
_Features.pdf. [online accessed
2013
-
08
-
01].
[42]
t
ing
Magnetic Field Eddy
-
IEEE Transactions on Magnetics
, vol. 4
7, pp. 1070
1073, May 2011.
[43]
http://forge
-
mage.g2elab
.grenoble
-
inp.fr/project/got.
[o
nline accessed
14
-
07
-
26].
[44]
-
tiple
14th International Conferennce on C
omputing in Civil and Building
Engineering
, pp. 1
8, Publishing House ASV, 2012.
[45]
titious minima of object func
tions,
IEEE
T
ransactions on Magnetics
, vol. 27, no. 6, pp. 5214
5216, 1991.
[46]
-
142
Based Design Optimization
, vol.
34, pp. 157
180, July
2006.
[47]
C. Talischi, G. H. Paulino, A. Pereira, and I. F. M. Menez
im
plementation of a general topology optimization fram
ework using unstructured
polyg
onal finite
element meshes
Structural and Multidisciplinary Optimization
, vol. 45
,
pp. 329
357, Jan. 2012.
[48]
-
ACM Trans
-
actions on Mathematical Software
, vol. 41, pp. 1
36, Feb. 2015.
[49]
-
dimensional Delaunay
http://tetgen.
berlios.
de, 2006.
[50]
-
Dimensional Improved
-
Quality Triangulations Using
Local
SIAM Journal on Scientific Computing
, vol. 16, pp. 1292
1307, Nov.
1995.
[51]
D. J. Mavriplis,
Journal of Computational Physics
, vol. 117, pp. 90
101, Mar. 1995.
[52]
Proceedings of the VKI Le
cture series on Computational Fluid Dynamic
, pp. 2000
2004,
2000.
[53]
structing a Delaunay triangula
International Journal of Computer & Information Sciences
, vol. 9, pp. 219
242,
1980.
[54]
S. Fortune
A sweepline algorithm
Algorithmica
, vol. 2, pp. 153
174, 1987.
[55]
C. L. Lawson,
Software for C1 Surface Interpolation. Ma
thematical Software III
.
Aca
demic p ed., 1977.
[56]
A comparison
of sequenti
al Delaunay
triangulation algo
Computational Geometry
, vol. 7, pp. 361
385, Apr. 1997.
[57]
-
Dimensional Mesh
Journal of Algorithms
, vol. 18, pp. 548
585, May 1995.
143
[58]
http://www.cedra
t.com/en/software/got
-
it.html.
[online accessed
2014
-
04
-
21].
[59]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein
, Introduction to Algorithms
.
The
MIT Press, 2011.
[60]
http://webdocs.cs.ualberta.ca/~holte/T26/merge
-
sort.html.
[online
accessed 2015
-
02
-
11].
[61]
D. Robilliard, V. Marion
-
processing units,
Genetic
Programming and Evolvable Machines
, vol
. 10, pp. 447
471,
Oct. 2009.
[62]
Genetic Algorithms on
Graphics
Intelligent and Evolutionary Systems
, vol. 187, pp. 197
216, 2009.
[63]
S. Sivasuthan, V. U. Karthik, A. Rahunanthan, P. Jayakumar, R. S. Thyagarajan, L. Udpa,
d Spee
d Problems in Nondestruc
tive Defect
Characterization: Element
-
by
-
Element Processing on a GPU
Evaluation
, vol. 34, no. 2, 2015.
[64]
IEEE
Transactions on Ma
gnetics
, vol. 27, no. 5, pp. 4146
4149, 1991.
[65]
Journal of Applied Physics
, vol. 67, no. 9, p.
5818, 1990.
[66]
C. Cecka, A. J
-
ics
International Journal for Numerical Methods in Engineering
, vol. 85, pp.
640
669, Feb. 2011.
[67]
http://www.nvidia.com/object/gpu.html.
[online accessed 2014
-
07
-
26].
[68]
http://www.nvidia.com/object/tesla
-
servers.html.
[ac
cessed 2015
-
07
-
26].
[69]
programming
http://docs.nvidia.com/cuda/
cuda
-
c
-
programming
-
guide/index.html
\
#axzz3gSPnvaGC.
[online
accessed
2013
-
08
-
01]
[70]
CUDA
Applications
f
or
http://www.nvidia.com/object/
cudaget.html. [online accessed 2014
-
07
-
26].
144
[71]
in Computer Animation and Virtual World
s,
vol.
15, pp. 219
227, 2004.
[72]
CUDA
Toolkit
Release
http://docs.nvidia
.com/cuda/
cuda
-
toolkit
-
release
-
notes/index.html. [online accessed 2013
-
08
-
01].
[73]
Algorithm 586
: IT
-
PACK
2C: A FORTRAN Package for Solving Large Sparse Linear Systems by Adaptive
Accelerated Iterative Methods
, vol. 8, pp.
302
322, Sept. 1982.
[74]
-
by
-
element solution a
lgorithm for
Computer Methods in Applied Mechanics and
Engineering
, vol. 36, pp. 241
254, Feb. 1983.
[75]
Y. Saad, Iterative Methods for Sparse Linear Systems, vol. 3. 2003.
[76]
G. F. Carey, E. Barragy, R. McLay, and M
-
by
-
element vector and
Communications in Applied Numerical Methods
, vol. 4, pp. 299
307, May 1988.
[77]
C
.
Heusser
Conjugate gradient
-
type algorithms for a finite
-
element discretization of the
Jour
nal of Computational and Applied Mathematics
, vol. 39(1), pp. 23
37, Feb. 1992
.
[78]
R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo,
C. Romine, and H. van der Vorst,
Templates for the Solution of Linear Systems: Build
ing
Blocks for Iterative Methods
. 1994.
[79]
-
by
-
element preconditioned conju
-
Parallel Computing
, vol. 17, pp.
1051
1065, Nov. 1991.
[80]
lysis of some element
-
by
-
Computer Methods in
Applied Mechanics and Engineering
, vol. 74, pp. 271
287, Sept. 1989.
[81]
SIAM
J. Numer. Anal.,
vol. 21, pp. 352
362, 1984.
145
[82]
-
self
-
adjoint generalization of the conjugate
gradient
method
USSR Comput. Maths. Math. Phys
., vol. 23, pp. 143
144,
1983.
[83]
-
by
-
element BICGSTAB iterative method for
three
-
dimensional steady Navier
-
Stokes equations
Mathematics
, vol. 79, pp. 147
165, Mar. 1997.
[84]
Application of
an element
-
by
-
element BiCGSTAB
i
terative
Computers & Mathematics with Applications
,
vol. 37, pp. 57
70, Feb. 1999.
[85]
-
SIAM Journal on Scientific and Statistical Com
puting
, vol. 10, no. 1, pp. 36
52, 1989.
[86]
-
CGSTAB: A Fast and Smoothly Converging Variant of Bi
-
CG for
SIAM Journal on Scientific and Statistical
Computing
, vol. 13, no. 2, pp. 631
644,
1992.
[87]
H. A. van der Vorst,
Iterative Krylov Methods for Large Linear Systems
, vol. 13. Cam
-
bridge: Cambridge University Press, 2003.
[88]
A. F. P. de Camargos, V. C. Silva, J.
-
Preconditioned Conjugate Gradient S
olver on GPU
for FE Modeling of Electromag
netic
IEEE Transactions on Magnetics
,
vol. 50, pp. 569
572, Feb. 2014.
[89]
J
.
of Compu
tational and Applied Mathematics
, vol. 236, pp. 3584
3590,
2012.
[90]
M. Ament, G. Knit
tel, D. Weiskopf, and W. Strab
gradient solver for the
Poisson
problem on
a multi
-
Pro
c
.
of the 18th
Euromicro Con
f
.
on Parallel, Distributed and Net
.
-
Based Processing
, pp. 583
592, 2010.
[91]
-
Conjugate Gradient Stabilized solver
J.
of Computers (Finland),
vol. 7, no. 12, pp. 3088
3095, 2012.
[92]
-
GPU
Implementations of Krylov
-
Subspace Methods Applied to FEA of Electromagnetic
IEEE Transactions on Magnetics
, vol. 51, no. 3, pp. 1
4, 2015.
146
[93]
C. A. J. Fletcher,
Computati
onal Techniques for Fluid Dynamics 1
. Berlin, Heidelberg:
Springer Berlin Heidelberg, 1998.
[94]
J. Haslinger and P. Neittaanm
aki,
Finite Element Approximation for Optimal Shape,
Material and Topology Design
, 2nd Edition. Wiley, 1997.
[95]
K. Preis, C. Magele, and
.
on Magnetics, vol. 26, no. 5, pp. 2181
2183, 1990.
[96]
Journal
of the American Statistical Association
, vol. 95, p. 347, Mar. 2000.
[97]
Genetic Algorithms
and the Optimal Allocation of Trials,
SIAM
Journal
on Computing
, vol. 2, no. 2, pp. 88
105, 1973.
[98]
[99]
[100]
in DoD Corrosion Conference
, pp.
1
11, 2009.
[101]
ructing and Classifying Damage in a 2D Steel Plate Using Non
-
in ASEE
-
NCS, (Oakland), ASEE
, 2014.
[102]
V. U. Karthik, Shape optimization using finite element analysis in eddy current testing and
electro
-
thermal coupled
pr
oblems. PhD Thesis, Michigan State University, 2015.
[103]
L. Vogt, R. Olivares
-
Amaya, S. Kermes, Y. Shao, C. Amador
-
Bedolla, and A. Aspuru
-
-
of
-
the
-
Identity Second
-
Order M
øller
-
Plesset Quan
tum
Chemistry Calculations with Grap
Journal of Physical Chemistry
A
, vol. 112, no. 10, pp. 2049
2057, 2008.
[104]
Journal of
Computational
Physics
, vol. 227, pp. 5342
5359, May 2008.
[105]
Parallel Computing
, vol. 35, no. 3, pp. 164
177, 2009.
147
[106]
E. Elsen, P. Le
Journal of Comput
.
Physics
, vol. 227, pp. 10148
10161, Dec. 2008.
[107]
-
ified paral
lel finite element Navier
-
2009 International Conference on
High Performance Computing & Simulation,
pp. 12
21, IEEE, June 2009.
[108]
S. Sivasuthan, V. U. Karthik, R. Arunasalam, R. S. Thyagarajan, P. Jayakumar, and S. R.
utations for Finite Element Optimization: Some Issues to be
aine des sciences techniques
-
S
erie electrotechnique et
energ
etique, vol. 60(3), no. (accepted
-
in press), 2015.
[109]
ity and increased intra
-
node
parallelism for solving finite
element models
on GPUs by novel element
-
by
-
element
2012 IEEE Conf
.
on High Perf
.
Extreme Computing, HPEC 2012
.
[110]
mance of the
Challenges for the Knowledge Society
, pp. 2036
2045, 2012.
[111]
\
_2010/
CUDA
\
_Tutorial/SC10
\
_Accelerating
\
_GPU
\
_Computation
\
_Through
\
_Mixed
-
Precision
\
_Methods.pdf. [online accessed 2014
-
07
-
26].
[112]
D. Yablonski, Numerical accuracy differences in CPU and GPGPU codes. MS Thesis
-
Northeastern University, 2011.
[113]
in Symp
.
on App
.
Accelerators in High Perf
.
Computing, (Urbana
-
Champaign, IL), 2009.
[114]
www.egr.msu.edu/~hoole/FE2DMesh.
[onl
ine accessed 2014
-
05
-
25].
[115]
www.egr.msu.edu/~hoole/FE3DMesh.
[online accessed 2015
-
07
-
26].
[116]
E. Laithwaite,
Electronics and Power
, vol. 11, no. 3, pp.
101
103, 1965.