SOFTWARE TOOL METHODOLOGIES ON A GPU FOR FINITE ELEMENT OPTIMIZATION IN MAGNETICS By Sivamayam Sivasuthan A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Electrical Eng ineering Doctor of Philosophy 2015 ii ABSTRACT SOFTWARE TOOL METHODOLOGIES ON A GPU FOR FINITE ELEMENT OPTIMIZATION IN MAGNETICS By Sivamayam Sivasuthan The design of magnetic devices requires optimization coupled with finite element analysis (FEA). Thi s involves a massive computational load and requires a specialized mesh generator. It is therefore not practicable. This thesis therefore presents i) a parameterized iterative mesh generator for two - dimensional and three - dimensional finite element optimiza tion; ii) fast and low memory finite element solvers using a graphics processing unit (GPU). In particular we introduce element by element finite element computations on a GPU with a speedup of 102 while the best competing method gives only 10; and iii) a n examination of parallelizing such matrix computations on already parallelized genetic algorithm threads using new GPU architectures. The resulting system is reliable and yields solutions in practicable times with massive speedup. Example inverse optimiza tion problems are presented. These software tools are written in C/C++ and CUDA C/C++. The system is shown to be applicable to the synthesizing of two - dimensional and three - dimensional electromagnetic devices and to non - destructive evaluation (NDE) probl ems. Several finite element mesh generators exist in the public domain, some even based on a parametric device description. But for optimization we need a parametrically described mesh dynamically evolving through the iterations without user input. The fe w that exist are commercial and their methodology is not known. In this thesis the mesh generator that we describe is in open source code with parametric mesh generation that runs nonstop and seamlessly through optimization iterations to convergence withou t user intervention. Such mesh iii generators as do exist are rare, commercial and not easily available to researchers except at great cost and never with the code to modify it to suit needs individual. Besides, the typical mesh generator requires some man - mac hine interaction to define the mesh points and boundary conditions and does not work for nonstop optimization iterations. We take two regular open source mesh generators, one for two - dimensional systems and the other for three - dimensional systems, and writ e a script - based interface as open source code to run nonstop for optimization. and use it equally adaptively for machine design. A simple scheme of averaging neighbor heights gives us a smooth geometry without having to use Bezier curves. This thesis also points out using a literature survey issues in GPU computation which result in erratic speedup and explain why in some instances GPU solutions are arithmetic ally a slight improvement on CPU solutions. iv This thesis is dedicated to my parents; Sinnathurai Sivamayam and Sivamayam Sivakumary, and my mentors . . . v ACKNOWLEDGMENTS I would like to address my heartily profound gratitude and ap preciation to my advisor Dr. S. Ratnajeevan H. Hoole for taking me on as a research student under his valuable guidance with funding. His advice and insights were the real encouragement to complete this work. It has been an honor to work with him. My sinc ere thanks also go to the rest of my PhD committee members, Dr. Lalita Udpa, Dr. William Punch and Dr. Nihar Mahapatra, for their suggestions and encouraging comments throughout this work. ch and Development Center (TARDEC) for funding our research under Contract Number W911NF - 11 - D - 0001 and W56HZV - 07 - 2 - 001. I am thankful to the MSU Graduate School and College of Engineering for awarding a summer graduate excellence fellowship thrice, a gradu ate office fellowship and a graduate teaching assistantship for one semester. This support was instrumental in facilitating the completion of my degree. I would also like to thank the faculty - and staff - members of the Department of Electrical and Compute r Engineering at Michigan State University, especially its College of Engineering, for their support in various forms and their friendship. I have been lucky to share a lab with so many friendly colleagues, namely V. U. Karthik, M. R. Rawashdeh and T. Mat hiyalakan. I express my sincere gratitude for their unstinted friendship. My sincere thanks go to all the academic and nonacademic staff members and friends at University of Jaffna, Sri Lanka, where my career as a student in computer science started. Fin ally, I am forever indebted to my parents and other members of my family, who have supported and encouraged me through their kindness and affection, so that I could concentrate on my studies. They touched me more deeply than I could have ever expected. vi TAB LE OF CONTENTS LIST OF TABLES ................................ ................................ ................................ ....................... i x LIST OF FIGURES ................................ ................................ ................................ ....................... x LIST OF ALGORITHMS ................................ ................................ ................................ ........... x v Chapter 1 Introduction ................................ ................................ ................................ .................. 1 1.1 Motivation ................................ ................................ ................................ ............................ 1 1.2 The Finite Element Method ................................ ................................ ................................ . 5 1.2.1 Introduction ................................ ................................ ................................ ................ 5 1.2.2 Two - Dimensional Problems ................................ ................................ ....................... 5 1.2.3 Trial Function ................................ ................................ ................................ .............. 6 1.2.4 Solving Magneto - Stat ic Problems ................................ ................................ ............ 11 1.2.5 Boundary Conditions ................................ ................................ ................................ 14 1.2.6 Three - Dimensional ................................ ................................ ................................ ... 15 1.3 Invers e Optimization Problems ................................ ................................ .......................... 17 Chapter 2 Parameter Based Unstructured Mesh Generator for Two and Three Dimensional Problems for Seamless Optimization ................................ ...... 21 2.1 Background ................................ ................................ ................................ ....................... 21 2.2 Mesh Generation ................................ ................................ ................................ ................ 27 2.2.1 Introduction ................................ ................................ ................................ ............... 27 2.2.2 Delaunay Based Methods ................................ ................................ ......................... 29 2.2.3 Delaunay Triangulation and Constrained Delaunay Triangulation .......................... 30 2.2.3.1 Introduction ................................ ................................ ................................ ...... 30 2.2.4 Algorithms for Constructing a Delaunay Triangulation ................................ ........... 31 2.2.4.1 Introduction o f Constructing a Delaunay Triangulation ............................ 31 2.2.4.2 Divide - and - Conquer Algorithm ................................ ................................ . 31 2.2.4.3 Sweep - line Algorithm ................................ ................................ ................ 32 2.2.4.4 Incremental Insertion Algorithm ................................ ................................ 33 2.2.5 Mesh Refinement ................................ ................................ ................................ ...... 33 2.2.6 Three Dimensional Mesh Generation ................................ ................................ ....... 34 2.3 Parameterized Mesh Generat ion ................................ ................................ ........................ 36 2.4 New Approach to Parameterized Mesh Generation ................................ ........................... 36 2.5 Data Structure and User Interface ................................ ................................ ...................... 38 2.5.1 Data Structure ................................ ................................ ................................ .......... 38 2.5.2 User Interface ................................ ................................ ................................ ............ 41 vii 2.5.2.1 Introduction ................................ ................................ ............................... 41 2.5.2.2 Defining the Geometrical Shape ................................ ................................ 41 2.5.3 Post - proc essing of Meshing ................................ ................................ ...................... 44 2.5.4 Approach to Renumbering ................................ ................................ ....................... 46 2.5.5 Merge Sort ................................ ................................ ................................ ................ 48 2.5.6 Modified Form of Merge Sort for Renumberi ng ................................ ...................... 50 Chapter 3 Low Memory High Speed FEM Solvers Using the GPU ................................ ........ 52 3.1 Introduction ................................ ................................ ................................ ....................... 52 3.2 General Purpose Computing on a Graphics P rocessing Unit (GPGPU) ............................ 54 3.3 Related Works ................................ ................................ ................................ .................... 57 3.4 Element by Element Solvers ................................ ................................ .............................. 59 3.4.1 Element by Element with Jacobi Algorithm ................................ ......................... . 59 3.4.2 Element by Element Conjugate Gradients Algorithm ................................ ............ 64 3.4.3 Element by Element Biconjugat e Gradient Algorithm ................................ ............ 67 3.4.4 Element by Element with Bi - Conjugate Gradient Stabilized method. .................... 70 3.5 Conjugate Gradients Algorithm with Sparse Storage Schemes ................................ ......... 73 3.5.1 Conjugate Gradient Algorithm for Matrix Solution ................................ .............. . 73 3.5.2 Matrix Storage Schemes ................................ ................................ .......................... 75 3.5.2.1 Introduction ................................ ................................ ................................ .... 75 3.5.2.2 Profile Storage ................................ ................................ ............................... 75 3.5.2.3 Sparse Storage Scheme ................................ ................................ .................. 77 Chapter 4 Test and Validation Problems ................................ ................................ .................. 79 4.1 Device design inverse - optimization problem: Design of the Pole Face of an Electric al motor ................................ ................................ ................................ ................................ .. 79 4.1.1 Problem Definition ................................ ................................ ................................ ... 79 4.1.2 Problem Model ................................ ................................ ................................ .......... 81 4.2 Inverse - optimization for Device Design: Determining the Rotor Contour of a Salient Pole Synchronous Generator ................................ ................................ ................ 87 4.2.1 Problem Definition ................................ ................................ ................................ .... 87 4.2.2 Problem M odel ................................ ................................ ................................ .......... 88 4.3 NDE benchmark problem: Characterizing Interior Defects ................................ .............. 93 4.3.1 Problem Definition ................................ ................................ ................................ .... 93 4.3.2 Problem Model ................................ ................................ ................................ .......... 94 4.4 A Simple Three - dimensional Problem ................................ ................................ ............... 98 4.5 NDE benchmark problem in 3D: Characterizing Interior Defects ................................ .. 101 Chapter 5 Results and Analysis ................................ ................................ ................................ 105 5.1 Memory Limitation ................................ ................................ ................................ .......... 105 5.2 Element - by - Element Solve rs ................................ ................................ ........................... 108 viii Chapter 6 Conclusion and Future Works ................................ ................................ ............... 123 APPENDICES ................................ ................................ ................................ ............................ 126 Appendix A : Publications Raised from This Research ................................ ......................... 127 Appendix B : Sample Input File 2D ................................ ................................ .......................... 130 Appendix C : Sample Input File 3D ................................ ................................ .......................... 132 BIBLIOGRAPHY ................................ ................................ ................................ ...................... 137 ix LIST OF TABLES Table 4.1 The flux distribution from the un - const rained optimization ................................ .......... 84 Table 4.2 The flux distribution from the constrained optimization ................................ ............... 86 Table 4.3 The flux distribution after the optimi zation without smoothened shape ....................... 91 Table 4.4 The flux distribution after optimizat ion with smoothened shape ................................ .. 92 Table 4.5 The solut ion of defect characterization ................................ ................................ .......... 96 Table 4.6 Real and binary solutions time need to compute ................................ ........................... 97 Table 4.7 Real and binary solutions time need to comp ute ................................ ........................... 97 Table 4.8 Potentials at measuring points ................................ ................................ ..................... 100 Tabl e 4.8 Potentials at measuring points ................................ ................................ ..................... 101 Table 4.9 The solution of 3D defect characteriz ation ................................ ................................ .. 104 Table 5.1 Number of elements (NE) and storage (in MB) with matrix size for different storage schemes . . . . . . . . . . . . . . . . . . . . . . . . ................................ ................................ ...... 106 T able 5.2 Projected Memory (in MB) Needs ................................ ................................ ............... 107 Table 5.3 The CGEbE solution ................................ ................................ ................................ .... 109 Table 5.4 The BiCGSTABEbE solution ................................ ................................ ...................... 11 2 Table 5.5 The Jacobi solution ................................ ................................ ................................ ...... 11 5 Table 5.6 Speedup ratio betw een single and double precision ................................ .................... 120 x LIST OF FIGURES Figure 1.1 R egula r finite element analyses . ................................ ................................ .................... 2 Figure 1.2 F EM with optimization algorithms ................................ ................................ ................ 4 Figure 1.3 Definiti on of H1 and h1 for a triangle ................................ ................................ ............ 6 Figure 1.4 Integr ation of triangular coordinates ................................ ................................ ............ 10 Figure 1.5 Known elements and unknown elements in matrix ................................ ...................... 14 Figure 1.6 Traditi onal engineering design process ................................ ................................ ........ 16 Figure 1.7 Modern design process ................................ ................................ ................................ . 18 Figure 1.8 Equipotentials of vector potential at the optimum ................................ ....................... 18 Figure 1.9 Equipotentials of vector potential at the optimum with B - spline in terpolation ........... 18 Figure 2.1 Design cycle for an geometric optimization ................................ ................................ . 22 Figure 2.2 Design cycle for an inverse problem ................................ ................................ ............ 22 Figure 2.3 Performance comparison of triangulation using CPU - DT and GPU - DT for different number of points ................................ ................................ ................ 24 Figure 2.4 Problem specific parametri c mesh generato rs ................................ .............................. 25 Figure 2.5 Elastically deformed problem specific NDE mesh: As defect moves .......................... 26 Figure 2.6 3D mesh for NDE problem: As parameter s change ................................ ..................... 27 Figure 2.7 3D Mesh for motor problem ................................ ................................ ......................... 28 Figure 2.8 Applying Delaunay triangulation ................................ ................................ ................. 30 Figur e 2.9 Example for constrained Delaunay triangula tion ................................ ......................... 31 Figure 2.10 Divide and conquer algorithm ................................ ................................ .................... 32 Figure 2.11 Sweep - line algorithm ................................ ................................ ................................ .. 32 Figure 2.12 Incremental insertion algorithm ................................ ................................ ................. 33 xi Figure 2.13 Delaunay mesh refinement ................................ ................................ ......................... 34 Figure 2.14 Delaunay mesh re finement between two regions ................................ ...................... 34 Figure 2.15 My approach to parameterized mesh gene ration ................................ ........................ 37 Figure 2.16 Sample input file for mesh generator ................................ ................................ ........ 41 Figure 2.17 Simple example problem ................................ ................................ ............................ 46 Figure 2.18 Numbering and renumbered nodes ................................ ................................ ............ 47 Figure 2.19 Renumber ed nodes ................................ ................................ ................................ ..... 47 Figure 2.20 Sorted version of (b) and corresponding i ndex changes ................................ ............ 47 Figure 2.21 Merge Sort ................................ ................................ ................................ .................. 48 Figure 3.1 Finite element optim ization using genetic algorithm ................................ ................... 53 Figure 3.2 Floating - point operations per second and memory bandwidth for the CPU and GPU ................................ ................................ ................................ ............... 54 Figure 3.3 The GPU devotes more transistors to dat a processing ................................ ................. 55 Figure 3.4 Steps in the classic finite element method (FEM) and the proposed changes for the FEM - SES method e nclosed within the dashed line ................................ ............... 58 Figure 3.5 Proposed method in flow chart ................................ ................................ ..................... 60 Figure 3.6 A. Sparse full matrix, B. Sparse lower triangular matrix ( because of symmetry ) ....... 76 Figure 3.7 Data structures for the symmetric profile storage corresponding to Figure 3.6 B ................................ ................................ ................................ ................. 76 Figure 3.8 A . Sparse full matrix, B. Sparse upper triangular matrix (because of symmetry ) ....... 7 7 Figure 3.9 Data structures for the symmetric profile storage corresponding to Figure 3.8 B ................................ ................................ ................................ ................. 77 Figure 4.1 Pole face of electrical motor ................................ ................................ ......................... 79 Fi gure 4.2 Geometry, boundary conditions and the material properties of the sample problem ... 80 Figure 4.3 Defining the problem using our tool ................................ ................................ ............ 81 xii Figure 4.4 Generated mesh using our tool ................................ ................................ ..................... 82 Figure 4.5 Finite element solution ................................ ................................ ................................ . 82 Figure 4.6 Results of the un - constrai ned optimization of the problem ................................ .......... 83 Figure 4.7 Results of the constrained optimization without smoothening ................................ ..... 83 Figure 4.8 Results of the constrained optimization with smoothening ................................ .......... 84 Figure 4.9 The flux distribution from t he un - constrained optimiz ation ................................ ........ 85 Figure 4.10 Averaging technique for manufacturable shape ................................ ........................ 86 Figure 4.11 The flux distribution from the constrained optimization ................................ ............ 86 Figure 4.12 A synchronous Generator (A) two pole and (B) four pole ................................ ......... 87 Figure 4.13 Parame trized geometry of sali ent pole ................................ ................................ ....... 88 Figure 4.14 Defining the problem ................................ ................................ ................................ . 89 Figure 4.15 Initial mesh ................................ ................................ ................................ ................. 89 Figure 4.16 Flux line of a salient pole synchron ous Generator ................................ ..................... 90 Figure 4.17 Optimized shape without smoothening constrained by rising pole heights from le ft to right ................................ ................................ ................................ ................... 90 Figure 4.18 The flux distribution after the optimization wit hout smoothened shape .................... 91 Figure 4.19 The flux distribution after the optimi zation with smoothened shape ......................... 91 Figure 4.20 Final smoothened shape ................................ ................................ ............................. 92 Figure 4.21 Inspection of an ar my vehicle after impro vis ed explosive device ............................ 93 Figure 4.22 Parametrically defined crack in plate from Triangle ................................ .................. 94 Figure 4.23 Defining the problem . . . . . . . . . . . . . . . . . . . . . . . . . . ................................ ............. 95 Figure 4.24 Generated Mesh for NDE problem ................................ ................................ ............. 95 F igure 4.25 Flux line for NDE problem ................................ ................................ ......................... 96 Figure 4.26 Optimum shape of the reconstructed defect ................................ .............................. 96 xiii Figure 4.27 Square conductor problem ................................ ................................ .......................... 98 Figure 4.28 Mesh for square conductor problem ................................ ................................ ........... 99 Figure 4.29 Potential at measuring points ................................ ................................ ..................... 99 Figure 4.30 Potenti al at measuring points for a simple cube ................................ ......................... 99 Figure 4.31 Three - dimensional NDE problem ................................ ................................ ............ 102 Figure 4.32 Defining variable: top: side view, bottom: s ide view ................................ ............... 103 Figure 4.33 Three - dimensional mesh for NDE prob lem: As parameters change ........................ 103 Figure 5.1 Memor y vs matrix size ................................ ................................ ............................... 107 Figure 5.2 Speed - up versus matrix size: Jacobi preconditioned conjugate gradi ents algorithm. 108 Figure 5.3 Number of iterations vs number of unkn owns for CGEbE ................................ ........ 110 Figure 5.4 Number of iterations vs number of eleme nts for CGEbE ................................ ......... 110 Figure 5.5 Speedup vs number of unknowns for CGEbE ................................ ........................... 111 Figure 5.6 Speedup vs number of elements for CGEbE ................................ .............................. 111 Figure 5.7 Number of unknowns vs numbe r of iterations for BiCGSTABEbE .......................... 112 Figure 5.8 Number of elements vs number of iter ations for EbEBiCG STAB ........................... 113 Figure 5.9 Speedup vs number of unknowns for B iCGSTABEbE ................................ ............. 113 Figure 5.10 Speedup vs number of elements for BiC GSTABEbE ................................ ............. 114 Figure 5.11 Number of unknowns vs number of iterat ions for JacobiCG ................................ ... 114 Figure 5.12 Number of elements vs number of i teratio ns for JacobiCG ................................ .... 115 Figure 5.13 Speedup vs nu mber of unknowns for Jacobi EbE ................................ .................... 116 Figure 5.14 Speedup vs number of elements for Jacobi EbE ................................ ...................... 116 Figure 5.15 Convergence rate of CG in CPU ................................ ................................ .............. 117 Figure 5.16 Convergence rate of CG in GPU ................................ ................................ .............. 117 xiv Figure 5. 17 Convergence rate of BiCGSTABEbE in CPU ................................ ......................... 117 Figure 5.18 Convergence rate of BiCGSTABEbE in GP U ................................ ......................... 118 Figure 5.19 Convergence rate of Jacobi in CPU ................................ ................................ .......... 118 Figure 5.20 Convergence rate of Jacobi in GPU ................................ ................................ ......... 118 Figure 5.21 Speedup comparison between CGEbE, BiCGS TABEbE and Jacobi EbE algorithms ................................ ................................ ................................ ......... 119 Figure 5.22 Speed - up versus matrix size of Kiss et al ................................ ................................ . 120 Figure 5.23 Serial addition losing precision. Numbers surrounded by a box rep - resent the actual result floating point value with 7 digits ................................ ....... 121 Figure 5.24 Erratic behavior of gain for various methods ................................ ........................... 122 xv LIST OF ALGORITHMS Algorithm 2. 1 Renumbering ................................ ................................ ................................ .......... 45 Algorithm 2.2 Merge Sort ................................ ................................ ................................ .............. 48 Algorithm 2.3 Merge ................................ ................................ ................................ ..................... 49 Algorithm 2.4 Merge Sort - Modified ................................ ................................ .............................. 50 Alg orithm 2. 5 Merge - Modified ................................ ................................ ................................ ..... 50 Algorithm 3.1 Element by Element Jacobi Algorithm ................................ ................................ .. 62 Algorithm 3.2 Computing the diagonal vector {D} and the right hand side vector {Q} .............. 63 Algorithm 3.3 Element by Element Conjugate Gradients Algorithm ................................ ............ 65 Algorithm 3.4 Elem ent by Element Biconjugate Gradient Algorithm ................................ .......... 68 Algorithm 3. 5 Element by Element with Bi - Conjugate Gradient Stabilized method .................... 71 Algorithm 3. 6 Preconditioned Conjugate Gradient ................................ ................................ ....... 74 1 Chapter 1 Introduction 1.1 Motivation Existing finite element analy sis software for electromagnetic fields provides advanced features for design or analysis problems. However engineering design involves inverse problem solving. That is, once the requirements are given, we have to find the geometrical shape or the material properties, which will satisfy the requirements as closely as possible (if not exactly). This is done basically using optimization algorithms. The deviation of the performance result of the analysis of the present shape from the requirements must be form ulated as an error function, which must be minimized to get the optimum solution [1]. Most finite element analysis softwar e packages do not support this. Therefore this must be done using a trial and error process. However a trial and error process alone will be so inefficient that it is not practically possible to obtain a solution. Therefore an expert is required to guide this trial and error process using his experience. However the success is directly dependent on the ability of the human expert and t herefore the solution obtained may deviate from the most optimum solution possible (Note that computer based optimization solutions also do not guarantee the global minimum, however they can reach a far better solution than a human expert). There are some finite element analysis software packages that support optimization. However such software can only be used for the optimization of special classes of problems. A totally new program has to be developed in order to solve a new type of problem. This is an e xpensive and time consuming task. This forces the designers who do not have the resources to 2 develop a separate optimization program for their specific need, to use the old trial and error approach with traditional finite element analysis programs. Actuall y, it is usually not economical to develop a separate program if we have to solve a given problem only once or twice. Figure 1.1: Regular finite element analysis Figure 1.1 shows the regular finite element analysis process. Figure 1.2 shows finite e lement optimization using the zeroth order genetic algorithm (GA) and gradients based algorithms. If we use GA optimization we can speed - up our solution time through parallelization on a GPU. In the genetic algorithm, the design parameter vector { h } is bin ary encoded in general 3 [2]. A chromosome is a vector { h }. Its fitness score f is defined in terms of the object function Tho ugh GA is practicable and gives a faster solution when parallelized [3], it is slow in our experience as a single process when compared with the gradient optimization methods. In sequential CPU computing, the fitness value is calculated for each chromosome one by one. When the population is high it takes a very long time to converge. We use GPU computation to overcome this problem [3]. We launched GPU kernels for computing the fitness value. So the fitness value will be calculated simultaneously for each chromosome in the population (Figure 1.2) [4]. Genetic algorithm based finite element analysis in magnetics has been carried out by many researchers [5, 6, 7, 8, 9, 10] over the past 20 year period. In gradients based optimization, the changes in parameter s of device description { h } are against the gradient of the object function F because in one - dimensional analogy the minimum point is to the right of locations with negative gradient and to the left of those with positive gradient: where the amount of change is determined by a line search [11]. The computation of the gradient F h }) was previously by finite difference, computing F through a finite element solution corresponding to a given { h } and then in turn changing each component h i by an infinitesimal amount and re - F h i F h i . Thus the component of F at each iterative step with n components of { h } took n + 1 finite element solutions and then once the direction of change of { h F , is established several more finite element solutions progressively increased and the 4 problem iterativel y solved until the minimum of F in that direction is identified [11]. Each changed { h } means a new geometry and therefore a new mesh. For a seamless iterative process, automatic mesh generators are required that can yield a mesh corresponding to a given { h }. However with gradient based algorithms the matrix solution may be parallelized whereas with the GA, both matrix solution and optimization may be parallelized through forking within a fork [12]. Figure 1.2: FEM with optimization algorithms Therefore , we have developed general - purpose two - dimensional and three - dimensional finite 5 element analysis inverse optimization procedure using GA software tools on a GPU for analysis and geometrical shape optimization in design and NDE problems. 1.2 The Finite Elemen t Method 1.2.1 Introduction The finite element method is a method used for solving partial differential equations numerically. It is widely used to solve electromagnetic problems [13], structural design problems [14] etc. using computers. This method is gene rally used to find the distribution of a certain field (e.g. magnetic vector potential, electric scalar potential, tension, fluid velocity, etc.) in space governed by a given differential equation called the governing equation. In order to solve the proble m numerically, the solution region is divided into a finite n um ber of elements (which are not needed to be uniform). The potentials are assumed to have a known mathematical variation called the trial function (e.g. linear variation, quadratic variation, et c.) over each individual element. The field is postulated to be interpolations from its values at certain nodes. Note that these elements must be small enough for this assumption to be valid [13]. The problem is then solved to get the potentials at the in terp o lation nodes of these elements. Then the potential at any given point in the solution space can be found by interpolating these now known potentials using the trial function. 1.2.2 T w o - Dimensional Problems In most two dimensional finite element analyses , the space is divided into a mesh of triangles. Points in these triangles are called the interpolation nodes of the mesh. The variation of 6 the potential over the triangles is assumed to be defined by a given trial function (most often a first order trial function). The objective is to find the potentials at the nodes of the mesh so that the potential at any given point inside a triangle can be found using the trial function. To do this, we develop one equation per unknown node in the mesh. Then, this set o f equations must be solved to find the potentials at each node. Since these equations alone are not sufficient to get a unique solution, some boundary conditions must also be considered. 1.2.3 Trial Function Since we use only first order trial functions in ou r software tools, let us consider them in detail. In two - dimensional cartesian coordinates, a first order trial function can be expressed as follows [13, 15]: where x and y are cartesian coordinates and a, b and c are constants for the triangle. Triangular coordinates are used in finite element analysis, as these normalize d coordinates pr o vide se v eral Figure 1.3: Definition of H 1 and h 1 for a triangle 7 advantages in analyzing properties inside triangles. Triangular coordinates point P inside a triangle are defined as follows (see Figure 1.3). where H 1 is the shortest distance from N 1 to N 2 N 3 and h 1 is the shortest distance from P to N 2 N 3 , and so on (see Figure 1.3). If S is the area of the triangle, Considering the three triangular areas in the triangle separated by dotted lines, (Figure 1.3) F Using 1.7 Using linear interpolation, the Cartesian coordinates of a point within the triangle can be written as follows [15]. 8 These are exact because x and y are linear. Solving 1.10, 1.11 and 1.12, Where i1 = i mod 3 + 1, and i2 = i1 mod 3 + 1 Equation 1.13 can be re - written as, w here ) and Using these triangular coord inates, the first order trial function A is expressed as follows where is the potential at the point ( ) and A 1 , A 2 and A 3 a re the potentials at the node points 1, 2 and 3 of the triangle. This can be verified by considering that the triangular coordinates of the three nodes of the triangle are (1,0,0), (0,1,0) and (0,0,1). By substituting these points, one will get A 1 , A 2 and 9 A 3 as the potentials at the three nodes and a linear variation of potentials along any given line inside a triangle. This trial function also provides a continuous variation of potentials from triangle to triangle and a continuous first derivative from tri angle to triangle along the tangential direction of the boundary. Equation 1.19 can be re - written as, where and From 1.16 and 1.17, From 1.19, 10 Let us examine another property of these triangular coordin ates, referring to Figure 1.4 Figure 1.4: Integration of triangular coordinates w here S is the area of the trainable. That is, Where is a metric tensor 11 1.2.4 Solving Magneto - Static Problems This solution us es first order triangular elements and materials with linear magnetic prop erties at low frequency for simplicity. The following differential equation (From w here is the mag netic field intensity and is the current density. Since and where is the reluctivity From 1.31, the energy functional can be derived. Since the energy is at its minimum at the stable state, this function should achieve its minimum at the point of the solution. This f unction is called the Lagrange function: The solution region has been divided into triangles. Therefore now the total energy can be written as the sum of the energies of each individual triangle. 12 Let us integrate this by parts, From 1.28, From 1.33, 1,34 and 1.35 where and 13 From 1.33 and 1.36 we can get To get the solution we have to minimize L{A} Therefore the final solution can be obtained by solving, This can be re - written in the form, and the equation solved for {A}. 14 Figure 1.5: Known elements and unknown elements in matrix 1.2.5 Boundary Conditions We basically use two types of boundary conditions, namely Neumann and Dirichlet [13]. Dirichlet boundary conditions mea n that the potential along the boundary is fixed at a given value and Neumann boundary conditions mean the derivative of the unknown potential at the boundary along the normal direction is zero. Dirichlet boundary conditions can be implemented by consider ing the node points on the boundary to have known values. Neumann boundary conditions are implemented automatically if Dirichlet conditions are not used at a boundary [13]. They are said to be natural. If Neumann boundary conditions are used, it means the potentials at some points along the boundary are known. Therefore the vector A can be broken into two as potentials at nodes with known potentials and potentials at nodes with unknown potentials . Therefore Equation 1.43 can be written as (see Figure 1.5), This is the matrix equation used by this software to find the final solution of finite element analysis. There is no minimizati on of L with respect to 15 1.2.6 Three - Dimensional Problems Corresponding to the triangle in two dimensions, the tetrahedron is a convenient element to use in three dimensions. For the tetrahedral coordinates [13]: where now is the height of a point from the opposite triangular face of a tetrahedron and is the height of the vertex opposite that face. First order interpolation, Solving the preceding four equations for the four s, we have 16 Wh ere, where or a cycle permutation of them. A is the potential at the point (x, y, z) and A 1 , A 2 , A 3 and A 4 are the potentials at the node points 1, 2, 3 and 4 of a tetrahedron [13]. Figure 1.6: Traditional engi neering design process 17 1.3 Inverse Optimization Problems The inverse problem, the more practically realistic problem, is synthesis. That is, wanting a performance, computing t he system description from it. Thus the computational design assignment may be t his: compute the size and other descriptions of a motor that can produce so much torque. Figure 1.6 shows the traditional engineering design process. An expert decides which device to use and for that device assigns parameters to use and then checks the performance by making and testing. Finally, an expert has to make changes in parameters if needed. This process repeats until we get the desired performance. In the 1960s, the analysis phase was introduced in place of make and test before checking the per formance. Thereafter the expert who makes the entire decision about the design process is replaced by powerful software. Figure 1.7 shows the modern design process using AI techniques, knowledge base etc. to make device selection. Optimization algorithms a re used to select parameters in order to get the desired performance. In the modern design process, the analysis phase is replaced with synthesis. The earliest persons to automate this cycle in magnetics were Marrocco and Pironneau in 1978 [16]. In 1976 a parallel work with [16] by Arora and Hang [17] also established finite element optimization in magnetics. An erratic undulating shape with sharp edges arose when Pironneau [18] optimized a recording - head to achieve a constant magnetic flux density and thi s was overcome through constraints [19]. Haslinger and Neittaanmaki [20] suggest Bezier curves to keep the shapes smooth with just a few variables to be optimized, while Preis, Magele, and Biro [21] have suggested fourth - order polynomials which when we tri ed gave us smooth but undulating shapes because of the higher order. Most of the required shape changes can be achieved with linear variations [22]. Figure 1.8 shows that without regularity constraints, sharp corners and jagged 18 Figure 1.7: Modern design process contours arise in designing a pole - face for constant vertical flux density. Figure 1.9 shows the shape is smoothened further and there is no sharp corner when B - spline curves are used but the undulation is mathematically correct though not pract icable. Figure 1.8: Equipotentials of vector potential at the optimum Figure 1.9: Equipotentials of vector p o tential at the optimum with B - spline in terpolation In optimization problems like NDE of steel plates, besides the detection of cracks, what is also important is their characterization. Characterization is necessary for determining whether any discovered crack demands withdrawal of the part from service [23, 24]. In eddy current 19 crack identification the response of a part to an eddy current te st coil is compared to the response without a crack [4]. When different, the presence of the crack is flagged. But to characterize the defect, the computed response from eddy current analysis with a crack described by parameters is optimized to match measu rements with computations. When the two match, the parameters describe the defect [4]. In inverse electromagnetic problem solutions by the finite element method, we require three tools. They are a special mesh generator, efficient matrix equation solvers and optimization algorithms. These are for the 3 major steps of finite element optimization. They are as follows, 1. Preprocessing: The essential operation for optimization is involved with this step. The design for optimization is parameterized before mesh generation. As the geometry defined by parameters is optimized, it changes shape, and a new finite element mesh must be created without stopping the optimization iterations to create a new mesh. Several finite element mesh generators exist in the public do main [25, 26, 27, 28, 29, 30, 31, 32], a few even based on a parametric device description. The required mesh generator must therefore support parameter based mesh generation and be completely automatic once the optimization process begins. That means we must be able to change the physical shape of the problem during run time and generate the mesh without stopping. Such mesh generators as do exist are rare, commercial and not easily available to researchers except at a great cost and never with the code t o modify them to suit individual needs. We propose taking a regular open source mesh generator and writing a script - based interface as open source to run nonstop for two and three dimensional optimization problems. In this thesis we are going to develop pa rametric mesh generators that run nonstop and seamlessly through optimization iterations to convergence. 20 2. Solution: The biggest load in finite element field computation is in matrix solution. Recently GPU computing has had great success in many very large n umerical computations (For example [33, 34, 35, and 36]). GPU - based finite element computation offers massive parallelization. This thesis will investigate speeding up using the GPU in sparse matrix computation. We will also examine the memory needs. For this purpose we will investigate parallel EbE processing by Gauss iterations [24] and preconditioned conjugate gradient [23]. 3. Optimization: The parameters need to be optimized in NDE as well as synthesis to make the computed fields match the desired perf ormance. As we discussed in the introduction section, we can use zeroth order optimization methods like the genetic algorithm, bee colony algorithms [37] etc. Moreover, if gradient methods are to be used in the mesh topology the nodal connections need t o be held fixed to preserve C 1 continuity of the object function lest the mesh - induced minima are seen by the optimization algorithm as from the physics of the problem This thesis mainly focuses on the first two steps because several open source optimizat ion algorithms are available on the web (for example, [38]) and a colleague in the group is using GPU computations to parallelize genetic algorithm optimization [4]. However his code will be used for the test problems. 21 Chapter 2 Parameter Based Unstructu red Mesh Generator for Two and Three Dimensional Problems for Seamless Optimization 2.1 Background Figure 2.1 shows the design cycle for a geometric optimization problem. In the beginning, the initial geometric positions are either selected by the subjec t expert or in the absence of the expert, randomly selected. In the next step we generate the mesh for the current geometry, measure the object value by a finite element solution and check whether it is minimum or not. If this is a minimum we terminate th e loop; otherwise we change the geometric parameters and do the same to procedure again. Mesh generation is therefore a very important part of finite element analysis. But mesh generators do not support parameter based mesh generation for optimization. F or real world inverse problems we need a mesh generator as a library where the design is described by parameters and it takes as input and returns the mesh from iteration to iteration. Figure 2.2 shows the design cycle for an inverse problem. In the first step the design parameter set is randomly selected (or estimated by a subject expert) and thereupon we generate the corresponding mesh, get the finite element solution and measure the object value (often conveniently defined as a least square difference between design objects desired and 22 Figure 2.1: Design cycle for a geometric optimization Figure 2.2: Design cycle for an inverse problem 23 computed) and check whether it is minimum or not. If this is a minimum we terminate the loop; otherwise we change the design parameters and do the same procedure again. This procedure repeats until the object value is acceptably small. For optimization to go on non - stop, the mesh needs to be generated for the new parameters without user intervention. In NDE the only difference is that the object function compares measured values with those computed from presumed val ues of being sought. In this section of this thesis we describe the necessity of a parametric mesh generator that runs nonstop and seamlessly through optimization iterations to convergence. Such mesh generators as do exist are rare, commercial and not easily available to researchers except at great cost and never with the code to modify them to suit individual needs. Besides, the typical mesh generator requires some man - machine interaction to define the points and boundary conditions and does not work for nons top optimization iterations. We will take a regular open source mesh generator and write a script - based interface as open source to run nonstop for optimization. There are many mesh generators available on the web [25, 26, 27, 28, 29, 30, 31, 32] and in th e literature; some packages are open source software and others commercial. But they usually do not support parametric mesh generation. However they do support features we would like in a parametric mesh generator. To summarize some notable mesh generato rs, Triangle, which we use, generates exact Delaunay triangulations , constrained Delaunay triangulations, conforming Delaunay triangulations, and Voronoi diagrams to yield high quality triangular meshes without large angles as suited to finite element anal ysis [25]. AUTOMESH2D gen erates CGAL [ 28], ADMESH [30], and Delaundo [39] are all notable for special features. Indeed there are parametric mesh generators, for example [40]. However it is not publicly available. Another 24 such mesh generator is, CEDRATs suite Flux whereas parameters are changed, the mesh is generated and the device analyzed to study the effect of parameters on performance [41]. The same approach has been taken in NDE studies [42]. However, the works of [40, 41] are not intended for non - stop optimization. For that CEDRAT uses a script based scheme called GOT - It [ 43] which passes parameters to the program Flux and gets the results back for the optimization. GOT - It, named FGot, is offered free to students although, but there again, the code is not accessible. Figure 2.3 shows the performance comparison of triangulation using CPU - Delaunay tri - angulation (DT) and GPU - DT for different number of points [44]. We can see the gain is nominal compared to finite element solvers which are given in Chapter 5. Therefore it is not worthwhile parallelizing mesh generation Figure 2 .3: Performance comparison of triangulation u sing CPU - DT and GPU - DT for dif ferent number of points Moreover, if gradient methods are to be used, the mesh topology given by the nodal connections need to be held to preserve C 1 continuity of the object fu nction lest the mesh 25 induced fictitious minima are seen by the optimization algorithm as from the physics of the Figure 2.4: Problem specific parametric mesh generators problem [45]. For these reasons very problem specific mesh generators are constru cted by researchers. As an example, when an armored vehicle is targeted by an improvised explosive device, the armor is inspected by an eddy current test probe to characterize the interior damage to determine if the vehicle should be withdrawn from deplo yment. Figure 2.4 shows a problem specific parametrically described crack in steel excited by an eddy current probe, where P1 - P6 are the lengths that represent the position and shape of the crack, J is current density and µr is relative permeability. I n this NDE exercise the parameters need to be optimized to make the computed fields match the measurements. The mesh has been constructed for the specific problem. As the parameters change , the mesh topology is fixed, pulling and crunching triangles as sho wn in Figure 2.5. Such problem specific meshes are a headache because they restrict the geometry, lack flexibility and take time for modifications. Hence the need for general - purpose parametric mesh generators. We can use zeroth order optimization m ethod s for which C 1 continuity is irrelevant, such as the genetic algorithm, bee colony algorithms etc. without pulling and crunching meshes for inverse problems; e.g. in reconstructing cracks to characterize interior 26 defects, or designing power devices. For non - stop optimization, the commercial code ANSYS offers a gradients - based optimization suite [46], but gives little information on the techniques employed. That is, although these methods are known within the companies, they are rarely published. There ar e other companies, particularly from structural engineering, that also offer gradients - based optimization. A huge lacuna is how they address the problem of mesh - induced minima. These artificial minima are seen as physics - based object function minima and t he code tends to get stuck at these. Other approaches like a mathematical distance function to model the geometry lie in the domain of specialized efforts [47]. Figure 2.5: Elastically deformed problem specific NDE mesh: As defect moves Like in 2D ther e are also many 3D - mesh generators available on the web and in the literature [29, 43, 48, 49]. Some are open source software and others commercial. TetGen [48, 49] is for tetrahedral mesh generation and is more effective than previous methods at removing slivers from and producing Delaunay optimal tessellations [50]. Each of these mesh generators has its own merits but none of these mesh generators supports parametric mesh generation. We will take the freely available, widely published, nonparametric, open source 3 - D 27 mesh generator TetGen [48, 49] which like all published mesh ge nerators involves user input in the pro cess of mesh generation. Here also we use a script file which uses a parametric description of the system to start the mesh from initial para meters and thereafter runs it seamlessly without stopping as the parameters are updated by the optimization process. Sample 3D meshes are shown in Figures 2.6 and 2.7 Figure 2.6: 3D mesh for NDE problem: As parameters change 2.2 Mesh Generation 2.2.1 Introducti on The finite element method requires the problem space to be split into a finite number of finer elements. The preferred element shape for two - dimensional problems is the triangle and for three - dimensional problems it is the tetrahedron [13]. This set o f triangles/tetrahedrons is called the mesh. If the mesh is finer, it will produce a better result in FEA [13], although it will increase the processing time. If we can have a finer mesh only at the places where we want a more accurate result, then we can reduce the processing time considerably. This is called adaptive 28 mesh generation [13]. Apart from this, the shape of the triangles/tetrahedrons in the mesh has a great effect on the final solution of finite element analysis. If we have very obtuse angles i n the triangles of the mesh, they will introduce considerable errors into the final solution. Therefore, all these facts have to be considered when generating a mesh for FEA problems. Figure 2.7: 3D Mesh for motor problem There are many mesh generation algorithms available. Basically we can divide them into 29 two categories. They are, 1 Algorithms which generate a crude mesh to define the basic geometry and then refine it to get a good quality mesh 2 Algorithms which generate a fine mesh from the beginning, The Advancing Front Algorithm [51] and Quad - tree Algorithm [52], Delaunay based Algorithms [25] are the examples of the second category. These methods can produce very good quality meshes. Delaunay based algorithms are well - known and commonly used algorit hms for quality mesh generations and therefore we use them. 2.2.2 Delaunay Based Methods A Delaunay based meshing approach is a concept which consists of two tasks: 1 The mesh points are first created by a variety of techniques; e.g. advancing front, octree, or structured methods 2 The Delaunay triangulation is first computed for the boundary without internal points. The mesh points are then inserted incrementally into the triangulation/tetrahedralization and the topology is updated according to the Delaunay defin ition. There are many Delaunay triangulation algorithms; the incremental insertion algorithm, the divide and conquer algorithm, the plane - sweep algorithm etc. In this work, we take the freely available, widely published, non - parametric, open source two - d imensional (2D) mesh generator Triangle [25]. These three mentioned algorithms have been implemented in the Triangle [25] mesh generator. We then adapted it for seamless optimization [24]. 30 Figure 2.8: Applying Delaunay triangulation 2.2.3 Delaunay Triangul ation and Constrained Delaunay Triangulation 2.2.3.1 Introduction Delaunay triangulation [13] is a technique used to improve the quality of the mesh by simply rearranging the nodal connections that make triangles. This algorithm ensures that there wi ll be no obtuse angles in the mesh other than in the triangles at boundaries. This is done by rearranging the triangles, if the uncommon point of the neighboring triangle lies inside the inscribing circle of one of the triangles, as shown in Figure 2.8. Th is can be identified by calculating the two angles corresponding to the uncommon points. By the properties of cyclic quadrilaterals, when the sum of these angles is greater than 180 , the triangles must be rearranged. In Figure 2.8, the triangle QRS lies inside the inscribing circle of the triangle PQS. This can be recognized by summing the two opposite angles, A and B (These are the angles corresponding to the uncommon points for th e two triangles, P and R). Since the sum of A and B is greater than 180 the two triangles are rearranged as PQR and PRS as shown in the figure. Now the point of the opposite triangles is not inside the inscribing circles. Constrained Delaunay triangulatio n is a generalization of the Delaunay triangulation that forces certain required 31 segments into the triangulation. An example is shown in Figure 2.9. Both triangles are in different regions where each may have different properties. So they cannot be flipped like the previous case Figure 2.9: Example for constrained Delaunay triangulation 2.2.4 Algorithms for Constructing a Delaunay Triangulation 2.2.4.1 Introduction of Constructing a Delaunay Triangulation There are many Delaunay triangulation algorithms ; for example Divide - and - Conquer [53], Sweepline [54], Incremental insertion [55] etc. As Su and Drysdale [56] found, the divide - and - conquer algorithm is fastest; the second is the sweepline algorithm. The incremental insertion algorithm performs poorly, spending most of its time in point location. Su and Drysdale introduced a better incremental insertion implementation by using bucketing to perform point location, but it still ranks third. A very important development in the divide and conquer algorith m is partitioning the vertices with vertical and horizontal cuttings [53]. 2.2.4.2 Divide - and - Conquer Algorithm The point set v is divided into halves until we are left with two or three points in each subset. Then these smaller subsets can be linked with edges or triangles which is called a 32 Voronoi diagram. Now we have a set of Voronoi diagrams because we have a set of smaller subsets. In the conquer step, we merge the subsets to get the whole Voronoi diagram (see Figure 2.10). The dual of the Vorono i diagram is the mesh [25]. Figure 2.10: Divide and conquer algorithm Figure 2.11: Sweep - line algorithm 2.2.4.3 Sweep - line Algorithm The sweep - line algorithm uses a sweep - line which divides a working area into two sub - areas. This process const ructs the Voronoi diagram - the dual graph of Delaunay triangulation (shown in Figure 2.11). This algorithm was introduced by Fortune [54]. Shewchuk [25] pre - sented a successful algorithm for constructing a higher - dimensional Delaunay triangulation. Figur e 2.11 explains how the sweep - line algorithm works. There is a vertical line which is called a sweep line in the Figure 2.11. When this line passes a point this algorithm creates a Voronoi diagram with other points which are already passed. 33 Figure 2.12: Incremental insertion algorithm 2.2.4.4 Incremental Insertion Algorithm Here we generate a fictitious triangle containing all points of V in its interior. The points are then added one by one. Figure 2.12 (b) shows the mesh after the first point is added. Figure 2.12 (c) shows how to handle the insertion of the second or subsequent point. The idea is to draw circumcircles of a particular triangle where the new point is located and neighboring triangles, select the triangles whose circumcircles cover the new points, remove the interior edges of selected triangles ; and finally, a new point is connected with every point of a created polygon. This algorithm always maintains the Delaunay triangulation of the points. 2.2.5 Mesh Refinement There are many mesh r efinement algorithms available. Most of them are based on are finer geometrical shapes and fewer nodes at other regions. The basic idea of the algorithm is to maint ain a triangulation, making local improvements in order to remove the skinny triangles. 34 Figure 2.13 shows the basic idea of avoiding skinny angles. Figure 2.13: Delaunay mesh refinement Figure 2.14: Delaunay mesh refinement between two regions Both triangles are in different regions (shown in Figure 2.14); each may have different properties. So they cannot be refined like in the previous case. Here the algorithm follows the same idea without violating the boundary of separate regions 2.2.6 Three Dimensi onal Mesh Generation Since almost every real world problem is three - dimensional, we extend our two - dimensional work to three - dimensional geometric parameterized mesh generation for optimization prob lems in design and NDE. Mesh generators in electrical e ngineering commonly use Delaunay 35 tetrahedralization and constrained Delaunay tetrahedralization for quality meshes. The In cremental Insertion algorithm is a well - known algorithm for tetrahedralization [48]. The worst case runtime of this algorithm is of O (n 2 ), but the expected runtime for this algo rithm is of O(n log n) [25]. Constrained Delaunay tetrahedralization was first considered by Shewchuk [25]. Gmsh [29] and TetGen [48] are the better - known free, open source 3D mesh generators. TetGen [48] uses a constrained Delaunay refinement algorithm which guar antees termination and good mesh quality. A three - dimensional Delaunay triangulation is called a Delaunay tetrahedralization. Ideas for Delaunay operation, constrained Delaunay triangulations, and mesh refinements are the same but only the dimension is different. 3D objects are usually represented by Piecewise Linear Complexes (PLCs) [48]. The design goal of TetGen is to provide a fast, light and user - friendly meshing tool with parametric input and advan ced visualization capabilities. Even though TetGen and Gmsh [29] are great open source mesh generators, from an inspection of the code, it is very hard to use for non - stop optimization problems. For the non - stop optimization that ANSYS offers [46], it give s little information on the techniques employed. CEDRAT uses a script based scheme called GOTIt [58]. FGot, is offered free to students although the code is not accessible and therefore will not permit modification nor work for industry - scale problems [58] . Here TetGen [48] is used as a backend for parameterized meshes for optimization. Therefore we will develop it on our own and make it available as open source. It is always possible to tetrahedralization a polyhedron if points are not vertices of the poly hedron. Two types of points are used in TetGen: 1 The first type of points are used in creating an initial tetrahedralization of PLC. 2 The second type of points are used in creating quality tetrahedral meshes of PLCs. The first type of points is mandatory in order to create a valid tetrahedralization. While 36 the second type of points is optional, they may be necessary in order to improve the mesh quality. 2.3 Parameterized Mesh Generation Parametric mesh generation is a very important part of finite element opt imization problems. In optimization problems, parameters describe the device in terms of materials, currents, and dimension. During optimization, as these parameters are changed to minimize an object function, a new mesh has to be generated and a new fin ite element solution obtained to re - evaluate the object function. At each iteration of an optimization algorithm, given the variables as input, the mesh is generated without user intervention. Finite element mesh generators exist in the public domain, a few even based on a parametric device description. The typical mesh generator requires some man - machine interaction to define the points and boundary conditions, and does not work for non - stop optimization iterations for which we need a mesh dynamically ev olving through the iterations with optimization variables as changing parameters. Such mesh generators as do exist are rare, commercial, and not easily available to researchers except at great cost and never with the code to modify them to suit individual needs. 2.4 New Approach to Parameterized Mesh Generation We take the freely available, widely published, nonparametric, open source 2 - D mesh gen erator Triangle [25] and 3 - D mesh generator TetGen [48, 49] which like all published mesh generators (with th e exception of commercially restricted ones whose methodology is not published) involves user input in the process of mesh generation. But for use in optimization we 37 cannot stop the iterations to make input [24]. To address these problems we use a script file which uses a parametric description of the system to start the mesh from initial parameters and thereafter runs it seamlessly without stopping as the parameters are updated by the optimization process [24]. The script file provides the user input whi le the code is iteratively running, input that is normally made in the mesh generator being used, but for which the optimization iterations cannot stop [24]. Figure 2.15 explains our approach to parametrized mesh generation. In the first step, the initial input which is described in the following sections is given to our mesh generator. Next, the mesh generator calls our chosen open source mesh generator to generate the mesh. After that, the FEM solver uses the mesh to solve the problem. Next, the optimiza tion algorithm updates the parameters which are accepted by our new mesh generator to generate the new mesh. Figure 2.15: My approach to parameterized mesh generation 38 2.5 Data Structure and User Interface 2.5.1 Data Structure The data - structure used in t hese mesh generator software suites contains the following col - lections of objects: 1. Points list 2. Regions list 3. Properties list 4. Variable points list 5. Measuring points list 6. Segments list 7. Mesh details list (triangles/ tetrahedrons) 8. Holes list 9. Boundary condition s Points List : The point collection contains all the points used in the prob lem definition and solving pro cess. Each point contains the coordinates of the relevant fini te element node. For the three - dimensional mesh generator the list has x, y and z coord inates. For the two - dimensional mesh generator, it should have x and y coordinates only. Regions List : The region list contains all the regions used in the problem definition. Region here means a material - source combination which has different physical characteristics. 39 Properties List : Properties list contains all the properties of each region. Each region has a set of properties. Variable points list : The variable points list contains the information about the points to be moved according to the chang es of the parameters, during optimization. Measuring points list : The measuring points list contains the points in the solution space where we want to find the potentials, flux density, etc. Segments list : The segments list contains the edges of the prob lem model. Each problem may have many segments. Mesh details list : The mesh details list contains the triangles/ tetrahedrons of the mesh. This collection is empty until the mesh is generated. For a two - dimensional mesh generator each triangle contains ref erences to its three vertex points. For a three - dimensional mesh generator each tetrahedron contains references to its four vertex points. Holes list : Holes are a special kind of region where we do not need to generate the mesh. A hole list contains all t he holes used in the problem definition. 40 Boundary conditions : There are two types of boundary conditions that are usually used in FEA problems. They are 1. Neumann boundary conditions 2. Dirichlet boundary conditions A Dirichlet boundary condition means the potential along the given boundary is fixed and a Neumann boundary condition means the derivative of the potential along the given boundary is fixed, and usually zero. Dirichlet boundary conditions can be implemented by keeping the potential of all the poi nts on the given boundary to be fixed at their given value. The user can select any segment, and define the potential of that segment. If the potential of the segment is set, then all the points which will be added onto that line will get this potential a utomatically. The boundaries that do not implement Dirichlet conditions will automatically act as Neumann boundaries during the FEA process. This is because it is natural to the finite element formulation [13]. Therefore no special provisions are needed to define Neumann boundaries. 41 Figure 2.16: Sample input file for mesh generator 2.5.2 User Interface 2.5.2.1 Introduction A proper user interface is very important for good software. If the user interface is not friendly to use, even if it is very powerful, most users will not be able to use it effectively. Therefore the user interface is carefully designed and used in this software as described in the next subsection. 2.5.2.2 Defining the Geometrical Shape Since we are providing the code as open sourc e, it is necessary to describe it for other users to re - engineer the code. This software is made to import drawings from the text file format. Figure 2.16 shows the sample input file of our mesh generator adapting Triangle using a script file to run non - st op from iteration to iteration without manual intervention. 42 1 The mesh generator code does not care about lines which start with #. We can write comments using the # sign. 2 First interpreted row: - The first number represents the num - ber of nodes in the domain; the second number represents the number of variable points. These variable points are also nodes but their coordinates may vary with the optimization iterations. 3 From the second row to row number 10 (that is, 9+1) in the domain, th ere are, associated with that row, The integer represents the node number. These must be numbered consecutively, start - ing from one. The two floating point numbers represent the coordi nates of this node. For the three - dimensional mesh generator, the three floating point numbers represent the coordinates of this node. 4 The next segment of this input file represents variable points. These are also in the same format as the previous. 5 In the third part of the input file we have segment details. The first row of this file has the number of segments. From the second row onwards it has 4 columns. The first column involves the segment numbers which must be numbered consecutively, starting from on e. The next two columns are node numbers. Each row represents a segment. The fourth column is a marker. A marker has different integer values; it can be used to define the boundary condition. Here - 1 means it is not on a boundary. If we have to define the boundary condition we have to give any positive integer to the marker. Then we can assign the boundary value using these markers. For the three - dimensional mesh 43 generator , the first row of this segment of this file has the number of faces and a boundary ma rker. Each face has a list. The first part of the list has the number of polygons in the face, number of holes on the surface and the boundary marker. From the second row onwards, polygons and holes of the surface are defined. 6 The next segment of this file is the definition of boundaries. The first row contains the number of boundary conditions. From the second row onwards the first column represents the numbering; the second column is a marker number which has been already defined in the previous part (the segment part). The third column represents the boundary value for a particular marker. 7 The subsequent segment is a definition of the regions of the problem. Here a different region means different materials so it has different properties. The first row r epresents the number of regions in the domain. From the second row, each row has five columns. The first column represents the numbering as usual. The second and the third columns represent the coordinates. These coordinates are used to identify the regio n. The point may be any point in the relevant particular region. The fourth column is an integer which starts from 1. It can be used to assign properties to these regions. The next column is not used here because it is an area constraint coming from Triang le but, as mentioned, it is not used by us. 8 The next segment of this input file represents the number of holes. The hole is a region in which we do not want to generate the mesh. In the first step we define the number of holes in the problem domain. Form t he next row, there are three columns. The first column represents the hole number. The second and third columns represent the x and y coordinates of any point within that hole. 44 9 The segment thereafter is for the measuring point list for object function eval uation. The first part is the number of measuring points where we are going to calculate the solution to get the target solution. Our source code helps users to identify the errors in the input file. It works very efficiently for any shape of problem domai n. The sample inputs files are attached in this thesis as Appendices B and C. Users customize their own problem very efficiently as tried out in our lab [4, 3, 24]. This software is easy to use. This software is well supported in any operating system, i.e. , Linux/Unix, Windows etc. 2.5.3 Post - processing of Meshing Once we triangulate/tetrahedralize the problem domain, we have to define the boundaries and boundary values. Upon triangulation/tetrahedralization, we have an element list (node numbers), the propert ies of regions and point list (for FE solution). We do not need to calculate the solution of known nodes so we have to separate the known and unknown nodes. This step is known as renumbering the nodes which is also to reduce the profile of the matrix [13] . In this process, we 1. Define the boundary (generally using segment numbers in 2D, faces in 3D) 2. Get the boundary values (different boundaries may have different boundary values) 3. Get all nodes which are on boundaries (We used the segment marker list to deter mine the boundary nodes) 4. Separate boundary elements from non - boundary elements; and separate unknown nodes from known nodes 5. Give the first set of numbers for the unknown nodes and the last set of numbers for the known nodes 45 6. Renumber the whole point lis t based on new numbering. 7. Renumber the node entries in the triangle list based on the new numbering system. 8. Get all properties for the particular regions 9. Assign these properties to all corresponding triangles. Since real world problem size is typically ve ry large, this renumbering process takes a very long time. Algorithm 2 .1 describes the regular renumbering process . This algorithm is very inefficient because each node will be searched for in an index array. The order of this algorithm is O (n 3 ). This s tep can be improved. In this work the traditional algorithms have been improved based on the merge sort technique. 46 2.5.4 Approach to Renumbering Figure 2.17: Simple example problem Instead of searching for every element from the whole list, this thes is tracks the node number changes and updates the node numbers of the mesh. Figure 2.17 shows a simple ex - ample finite element problem. The boundary elements are circled. Figure 2.18 (a) shows the node numbers which are assigned in the mesh generation pro cess. Figure 2.18 (b) presents the rearranged nodes which are separated based on whether the nodes are of known or un - known values. Figure 2.18 (c) shows the new numbering system. The boundary elements are shown in the gray boxes in Figure 2.18. The corre sponding old number list (c) is renumbered in (b). Figure 2.19 shows the new numbering of the nodes. Let us take the array of Figure 2.18 (a) and use a new index which has 1, 2, up to the number of elements. We sort the array of Figure 2.18 (b) using the merge sort algorithm which will be described in the following section. We apply every operation of the sort algorithm to a new index array. The resultant arrays are shown in Figure 2.20. Now if we want to get a renumbered value of a particular node number we can directly access it from the updated index array. For example 1 is replaced by 4, 2 by 1, 3 by 5, and so on. 47 Figure 2.18: Numbering and renumbered nodes Figure 2.19: Renumbered nodes Figure 2.20: Sorted version of (b) and corresponding inde x changes 48 Figure 2.21: Merge Sort 2.5.5 Merge Sort Sorting is a technique that arranges the elements in a certain order. There are many sorting algorithms such as counting sort [59], bucket sort [59], radix sort [59], merge sort [59], heapsort [59], qu icksort [59] etc. Each algorithm has it own advantages. Merge sort is one of the best sort algorithms which has n log ( n ) time complexity for each average, best and worst case time complexities [59]. Another very important reason for selecting merge sort is that this algorithm is easily parallelizable - parallelization on the GPU is the main theme of this thesis. Figure 2.21 [60] describes the merge sort in a graphical way. First the list is split into two pieces and then each is split into two further pie ces. This process is repeated until arriving at the singleton list. Then we work our way back up the recursion by merging together the short arrays in to larger arrays. Algorithms 2. 2 and 2. 3 describe merge sort. 49 Algorithm 2. 50 Al gorithm 2. 2.5.6 Modified Form of Merge Sort for Renumbering As we described in section 2.5.4, we track the node number changes using a form of merge sort instead of searching for every element from the whole list. This new algorithm has n log ( n) time complexity but the traditional method has time complexity of O(n3). We define an array of size given by the number of nodes in the mesh. The array has 0, 1, 2, up to (number of nodes - 1). We applied every operation of the sort algorithm to thi s newly define d array. Algorithms 2. 4 and 2. 5 describe the idea underlying what we have used. 51 52 Chapter 3 Low Memory High Speed FEM Solvers Using the GPU 3.1 I ntroduction It has been more than 20 years since genetic algori thm (GA) based optimization was first used in finite element optimization [5, 6, 7, 8, 9, 10]. Since GA is practicable and gives a faster solution when parallelized [3], GA has been used for optimization in FE. The object function corresponding to every me mber of a population has to be computed many times to find the minimum. The many members form the genetic search space. Since consists of dimensions and materials of a particular design being examined for its goodness [3], for those dimensions a mesh is constructed, the finite element problem solved and the object function evaluated (see figure 3.1). The object function itself is computed from a finite element solution involving a matrix equation. Thus we may treat the object function computation as a kernel and launch it on multiple threads, each for a different member of the population. Then within that kernel, we can parallelize the matrix equation solution. In genetic algorithm based finite element optimization [61, 62], several copies of the mat rix are held on the GPU and the corresponding solutions attempted. Limited memory is also a very big issue in GPUs [23]. This part of the work mainly focuses on low memory and high speed finite element solvers. As we already mentioned, the GA based optimiz ation method presents a huge computational load. Powerful PCs are capable today of solving large matrix equations in a few seconds, sometimes using a Graphics Processing Unit (GPU). GPU based finite element 53 computation offers massive parallelization [63]. The finite element solution of field problems requires the solution of large - sized matrix equations leading to large waiting times [13]. To address these parallel computations were used at one time [64, 65]. But the speedup was limited by the fact that on the shared memory parallel computers, there was a memory bottleneck which typically then allowed 4, 8, 16 or rarely 32 processors with more computing elements meaning exorbitant cost. For an n - processor machine with one processor dedicated to book - keepin g on what the other processors were doing, the best speedup was n - 1 and always a little less because of communication and waiting issues. Recently the graphics processing unit (GPU) has been shown to speed up the matrix solution part in the finite element solution [34, 66]. This was a major advance in finite element computational efficiency. Figure 3.1: Finite element optimization using genetic algorithm 54 Figure 3.2: Floating - point operations per second and memory bandwidth for the CPU and GPU Parall elization is the best approach to speeding up as bigger problems are tackled in field computation [13, 64]. However as noted, the need for shared memory between processors was a bottleneck because machines with more than 8 processors were very expensive. 3.2 General Purpose Computing on a Graphics Processing Unit (GPGPU) The GPU is a single chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines [67]. In November 2006 , Nvidia group introduced CUDA which is a gen eral purpose parallel computing platform and programming model. The GPU has its own memory, up to 24 GB in current configurations [68]. This device (GPU) memory supports a very high data path using a wide data path. The CPU has a few cores which have been used for optimized sequential processes. In contrast, the GPU has thousands of small more efficient cores which have been used for massive parallel processes. The GPU has tremendous 55 computational horsepower and a very high memory bandwidth (shown in Figure 3.2 [69]). Figure 3.3: The GPU devotes more transistors to data processing GPUs are used for highly parallel computation. Therefore they are designed using more transistors for data processing (ALUs) rather than data caching and flow control (shown in Figure 3.3 [69]). The GPU is well suited for data - parallel processing. In GPU FE computa tion, data - parallel processing maps data elements to parallel processing threads. GPU computing uses the GPU to accelerate the computational speed of very large scient ific and engineering problems. Generally a GPU has thousands of cores, For example the Tesla K40 GPU has 2880 cores [68]. Here cores mean a number of computing components. Nowadays there are so many multi - core processors available in the market but the num ber of cores is very limited in CPUs. The GPU for general purpose calculations instead of graphics rendering is called gen eral purpose computing on graphics processing unit (GPGPU). The Nvidia GPU CUDA architecture is composed of streaming multiprocessors (SMs), each containing a number of streaming processor cores (SPs) with on - chip memory, and a single instruction unit. All SMs have access 56 to global memory, the off - chip memory (DRAM), which has a latency of several hundred clock cycles. Thus unlimited sp eedup is possible unlike with shared memory machines. The until - recent restriction that a kernel launched on parallel threads cannot launch another kernel in further forking from a fork [12] is no serious shortcoming since with multi - processor systems , even though we can fork from a fork, we usually do not have spare processors to assign. However in a recent development CUDA dynamic parallelism has been made available on the SM 3.5 architecture GPU [70] and this is available on PCs now. In finite eleme nt analysis the coefficient matrix is very large when dealing with real world problems; for example in reconstructing cracks in inverse non - destructive evaluation problems and device design problems. We use two different techniques to overcome this memory problem: 1. Use sparse storage schemes to store the coefficient matrix and solve the matrix equation using the GPU [23] 2. Use element by element FEM - performs operations on the local finite element matrix [ P ] L that corresponds to operations on the global mat rix and therefore does not require storage for the larger global matrix([ P ]) [65] However, the need for shared memory between processors was a bottle neck because machines with more than 8 processors were very expensive. With n processors we could at best get a speedup of n - 1. Using the GPU of PCs is a new alternative [71, 72]. The NVIDIA GPU CUDA architecture is composed of streaming multiprocessors (SMs), each containing a number of 57 streaming processor cores (SPs) with on - chip memory, and a single instru ction unit. All SMs have access to global memory, the off - chip memory (DRAM) , which has a latency of several hundred clock cycles. Thus unlimited speedup is possible unlike with shared memory machines. We will use well known storage schemes such as the p rofile (sky line) [13] storage schemes and the sparse storage scheme [13, 73] (also known as the compressed sparse row scheme). We revive the old element by element finite element solvers from the early 1980s for working on a highly memory limited IBM PC 2 82 to launch thousands of CUDA threads on the GPU architecture. We will examine different numerical techniques such as conjugate gradients (CG), preconditioned conjugate gradients (PCG), Jacobi, bi - conjugate gradients etc. to get the high speed solution we seek. These ideas are explained below. 3.3 Related Works To Wu and Heng [71] should go the credit for first exploiting as far back as in 2004 the CUDA architecture in parallelizing FEM computations. They focused their attention on the conjugate gradients matrix solver where the most gains could be made. It took until 2010 for CUDA FEM computations to enter seminal electrical engineering works and that too without Kiss et al. [34] have recently applied EbE p rocessing to solve their finite element equations from a first order tetrahedral mesh using the conjugate gradients algorithms. Their EbE method is different from EbE in references [65, 74]. They implement the bi - conjugate gradient technique [75]. Since th e matrix is not formed, they use the diagonal to implement Jacobi preconditioning. Fernandez et al. [33] decoupled the solution of a single element from that of the whole mesh (Figure 3.4 [33]), thus exposing parallelism at the element level. Individual e lement 58 solutions are then superimposed node - wise using a weighted sum over concurrent nodes. They used Jacobi iterations to calculate the solution of each local matrix in parallel and then couple local solutions using weighted average enforcing continuity. For example node number 1 in Figure 3.4 [33] is replaced by numbers 1 and 6 in new numbering system; number 1 is in element e1 and 6 in e2. Once calculated the local solutions of elements e1 and e2, need to form the global solution using the following f ormula [33], where and are local matrix elements (shown in Figure 3.4 [33]), is the solution of element e2 and is the solution of element e1 [33]. Fernandez et al . compared different generations of GPUs getting speedups with conjugate gradients of up to 14 times and 111 times for the 8800GT and the GTX48 0 GPUs respectively, compared to optimized CPU results. Figure 3.4: Steps in the classic finite element method (FEM) and the proposed changes for the FEM - SES method enclosed within the dashed line 59 3.4 Element by Element Solvers 3.4.1 Element by Element with Jaco bi Algorithm To overcome the memory limitation of 612 kB on the IBM PC 286 of the early 1980s, Hughes et al . [74] introduced EbE processing. The then available memory of 612 kB was not enough to hold even a trivial finite element matrix. What we used to do following Hughes et al . [74] was not form the coefficient matrix [ P ]. Instead, recognizing that [ P ] is assembled from the small element matrices [ P ] L (of size 3 × 3 in magnetics with triangular first order elements), according to all the operations meant for [ P ] we performed multiple times on each [ P ] L . We took an element, computed its [ P ] L , did the operation meant to be done on [ P ], added the resul t to that from other elements already dealt with as indicated in equation 3.2, threw away [ P ] L , went to the next element and so on. The non - forming of the coefficient matrix meant that only iterative schemes of solution without decomposing [ P ] in any way c ould be done. This restriction therefore excluded the powerful iterative incomplete C holesky conjugate gradients (ICCG) scheme of solution usually preferred in finite elements work because an incomplete Cholesky factor of [ P ] is required [13]. In solving the Gauss - Seidel iterations commonly used by power engineers, is an improvement on the older Gauss iterations [13]. In Gauss - Seidel, in each iteration m + 1 we use the latest available values of the unknowns , using equation i of to compute treating only as the unknown and all the other variables as known and given by their latest values in the iteration cycle: 60 with obvious modifications for i =1 and i = n . In this algorithm must be computed before . Here at iteration m + 1 , computing in the order i=1 to n, is at values of iteration m + 1 up to the th component of { } and at the value of the previous iteration m for values after i. It is therefore necessarily a sequential algorithm. Figure 3.5: Proposed method in flow chart When EbE processing was developed for the Gauss algorithm [65], in solving the older displaced Gauss iterations [65] use the values of the old iteration m for computing every { } in iteration m + 1 . Therefore the computation of a particular { } value is 61 independent of the co mputation of all other { } values for that iteration and therefore parallelizable: This is inefficie nt in the context of sequential computations. But in the case of paral lelization it is highly efficient as was laid out by Mahinthakumar and Hoole [65] and Carey et al . [76] for shared memory systems. Speedups were just below (n - 1) where n is the number of processors. If [ D ] is the matrix [ P ] with all off diagonal elements eliminated, then the Gauss iterations yield Our proposed implementation first computes the diagonal vector [ D ] of the unformed global matrix [ P ], the right hand side of the finite element equation { Q } and initi al solution vector, and saves them as two one dimensional arrays. This computation will be done only once. We can parallelize this computation. In the second step, our algorithm calculates the residual components ri, i = 1 to n, of { r } in parallel: Algorithm 3.1 shows the element - by - element Gauss iterations. In that n is the total number of unknowns, { V } is a vector which contains the vertices of an element, { r } is the residual vector, [ P ] L is the local coefficient matrix, { Q } is global right hand side vector, { D } is the diagonal vector of the global matrix [ P ], and } is the current solution vector. NC a nd NR represent the 62 column number and row number respectively of the unformed global matrix in which the local matrix term [ P ] L (i, j) stored as the column vector [P ] L (i 3 + j). The first step of this algorithm computes the diagonal matrix [ D ] and solution vector { Q }. Al gorithm 3.2 explains how to compute the diagonal matrix [ D ] and solution vector { Q }. This algorithm is common for all following algorithms which are described in this chapter. Moreover, this step is also parallelized. Steps 4 to 19 of the Algorithm 3. 1 whi ch compute the residual vector of this process also can be parallelized. A problem is in updating the vector { r }. Here more than 2 processes may access the same memory location at the same time while updating a particular r i . This is called the race cond predefined functions. The arch = sm _ 20 [72] flag was used to enable FERMI advanced architecture features which support them [72]. Kiss et al . used the coloring te chnique [34] to avoid the race condition and a two numbering system [33] was used by Fernandez et al. Both methods take a lot of memory and extra computations. So we avoided these methods. 63 Al g orithm 3. 1 (C ) 64 Al g orithm 3.2 (C ) 3.4.2 El ement by Element Conjugate Gradients Algorithm The linear system equation from the finite element formulation can be solved by a minimization method [77]. Consider a quadratic equation where [ A ] is a symmetric matrix, { b } and { x } are vectors, and is a scalar constant. It is the square of the residual of the equation [A]{x} = {b}. [ A ] is symmetric and positive - definite for our finite element equation, f ({x}) is minimized by the solution to [A]{x} = {b} The iterates { x} (i) are updated i n each iteration by a multiple (i ) of the search direction vector {p} (i) [78]: 65 Corresponding residual are updated as where is chos en so as to minimize f({x}) along . The search directions {p} (i) are updated using the residuals {r} (i) i.e., where is defined by The conjugate gradients iterative solver can also be decomposed into an element by element process [79]. Here we have used the element - by - element Jacobi preconditioner [80, 63] because traditional preconditioners cannot be used without the assembled global matrix. In this algorithm we use the Jacobi preconditioner [78] which is weak but offers precon dition ing. Algorithm 3.4.8 shows the element by element process using the c onjugate gradients algorithm [79]. Step s 6 to 18 of the Algorithm 3.3 compute the residual vector (v ) this process also can be parallelized. 66 Al g orithm 3. 3 (C ) 67 3.4.3 Element by Element Biconjugate Gradient Algorithm Kiss et al. [34] have recently applied EbE processing to solve their finite element equations from a first order tetrahedral mesh using the bi - conjugate gradients algorithm. Algorithm 3.4 shows their element by element process using the biconjugate gradient algorithm. The CG method is n ot suitable for a non - symmetric system since the residual vectors cannot be made with short recurrence [81, 82]. The biconjugate gra dients method takes another ap proach, replacing the orthogonal sequence of residuals by two mutually orthogonal sequences, a t the price of no longer providing a minimization. In the biconjugate gradients method, we update the two sequences of residuals and search directions for [A] and [A] T . Steps 4 to 19 of the Algorithm 3.4 compute the residual vector, steps 31 to 34, updat e the vectors {xt}, {r} and {rd} and steps 41 to 47 update the vectors {d}, {dd}, {q} and {qd}. These processes also can be parallelized. In this process the residuals may be computed using, The search directions are given by, and the distance to be moved along a direction by, 68 69 Al g orithm 3.4 (C ) 70 3.4.4 Element by Element with Bi - Conjugate Gradient Stabilized method The BiCGSTAB iterative method had been used to solve the steady Navier - Stokes equations by Wang et al. [83]. Sheu et al. [84] used the BiCGST AB method for solving the monotonic finite element model. The BiCGSTAB method is suitable for non - symmetric systems but can be applied to symmetric systems too [75]. This method avoids the irregular convergence pattern of the conjugate gradient squared met hod. In the biconjugate gradient method, the residual vector {r} (i) can be written as the product of {r} (0) and an i th degree polynomial in [A]; This same po lynomial satisfies So that This re sult suggests that if P i (A) reduces {r } (0) to a smaller vector {r} (i) , then it might be advantageous to apply this P i operator twice, and compute [85]. This algorithm is known as the conjugate gradient squared method. The bicon jugate gradient stabilized method (BiCGSTAB) avoids the irregular convergence of the conjugate gradient squared method. The BiCGSTAB computes instead of [86]. The BiCGSTAB method is computationally ex pensive per iteration compared to the CG algorithm [87]. Algorithm 3. 5 shows the element by element process using bi - conjugate gradient 71 sta bilized algorithm [84, 83]. BiCGSTAB has two stopping tests which are shown in Algorithm 3. 5 . Steps 13 to 27 and 36 to 50 of the Algorithm 3. 5 compute the residual vectors. These processes also can be parallelized. 72 Al g orithm 3. 5 (C ) 73 Al g orithm 3.5 (C ) 3.5 Conjugate Gradients Algorithm with Sparse Storage Schemes 3.5.1 Conjugate Gradient Algorithm for Mat rix Solution There are several works, which attempt to solve finite element problems using GPU(s) [66, 88, 89, 90, 91, 92] but they are not based on using the element by element technique as in this thesis. The authors of [92] tested the conjugate gradi ent (CG), the biconjugate gradient (BiCG), and the biconjugate gradient stabilized (BiCGSTAB) algorithms with popular preconditioning techniques; for example the algebraic multigrid, diagonal, shifted incomplete Cholesky, and shifted incomplete LU methods . The best performance was found for conjugate gradients with preconditioning [92]. The authors of [88] also implemented CG with 74 preconditioning algorithms. The reported speedup is between 2 and 325. The speedup varies with different applications [88]. The reported speedup for the BiCGSTAB algorithm is on average 8 to 10 times faster [91]. The authors of [89] reported that the speedup obtained with the preconditioned conjugate gradients (PCG) on a GPU, with respect to the CPU implementation of the CG algori thm, is between 8 and 10 (depending on the sparse matrix - vector multiplication used). The PCG algorithm is considered more efficient for large matrix equations and therefore usually used for solving a symmetric, sparse, positive definite system of linear e quations as from finite element analysis. It identifies the residual in successive orthogonal directions and for n equations is guaranteed to converge in no more than n iterations [93]. We use it because it is cheap. Algorithm 3 .6 describes the PCG method for the system of linear equations ; where [ P ] is a real, positive definite, symmetric matrix from finite element discretization and is the initial solution of the system which is improved in each iteration k. Preconditioning by the matr ix M is used to replace the original system by . The Jacobi preconditioner is one of the simplest forms of preconditioning, in which the preconditioner is chosen to be the diagonal of the matrix P = {diag}(A). In th e case of the implementation of this algorithm, we use CUDA C for parallel implementation and C++ for sequential implementation. 75 Al g orithm 3.6 (C ) 3.5.2 Matrix Storage Schemes 3.5.2.1 Introduction In finite element analysis and optimization the coefficient matrix is very large when dealing with real world problems but very sparse [13]. For a symmetric matrix, we need to store only the diagonal and upper or lower triangular part of the matrix. The sparsity property brings storage down to O(n) for the finite element method [13]. The elimination of unnecessary multiplications with 0.0 also speeds up computations significantly. The following sections give an overview of matrix storage schemes. 3.5.2.2 Profile Storage Profile storage is also known as its e quivalent skyline storage which reduces the storage requirement for a matrix. The matrix would be stored in three one dimensional floating point 76 number arrays. Space is allocated for every number to the right of the first non - zero on a row, up to the diag onal term. Therefore renumbering is used first to reduce storage, to band the matrix. Figure 3.6: A. Sparse full matrix, B. Sparse lower triangular matrix (because of symmetry) Figure 3.7: Data structures for the symmetric profile storage correspondi ng to Figure 3.6 B The matrix of Figure 3.6 A is reduced first to its lower triangle part in Figure 3.6 B. It is then stored as the vectors Diag , giving the diagonal element location, FC giving the first column on a row occupied by a non - zero and V , giv ing the coefficient of the matrix which now has several zeros which are between the first non - zero column and the diagonal. An example matrix and profile storage scheme vectors are shown in Figures 3.6 and 3.7 respectively. The data structure consists of t hree one - dimensional arrays: A real type array V ; the size of this array is equal to the number of non - zero elements plus the number of zeros between two non - zero elements in the array, an integer type array FC ; the size of this array is equal to the numbe r of non - zero elements in the array, an integer type array Diag ; the size of this array is n ; where n is the number of rows/columns in the array. When we are dealing with real world 77 problems, the coefficient matrix is a very large sparse matrix [13] and w e can use the profile storage scheme to reduce memory consumption Figure 3.8: A. Sparse full matrix, B. Sparse upper triangular matrix (because of symmetry) Figure 3.9: Data structures for the symmetric profile storage corresponding to Figure 3.8 B 3.5.2.3 Sparse Storage S c heme The sparse storage scheme is also known as the compressed sparse row (CSR) scheme. The sparse storage scheme is a row - wise (or alternatively column - wise) representation of the nonzero entries in the coefficient matrix of the linear system. For a symmetric matrix, computer memory can be saved by storing only the nonzero entries in each row on and before the main diagonal. The associated column numbers are stored in an integer - valued array JA such that JA(K) is the colum n number for the coefficient A(K). A mapping vector IA is used to denote the starting location of each row. An example matrix and its sparse storage scheme vectors are shown in Figures 3.8 and 3.9 respectively. The data structure consists of three one - dim ensional arrays: A real type array A, contains all the non - zero elements of a matrix. The size of this array 78 is equal to the number of non - zero elements in the array, an integer type array JA, contains the matrix column indices of the elements of A. The s ize of this array is equal to the number of non - zero elements in the array. Another integer type array IA contains the index of each row in the arrays A and JA. The size of this array is n +1; where n is the number of rows/columns in the array. When we ar e dealing with real world problems, the coefficient matrix is a very large sparse matrix [13] and we can use CSR storage scheme to reduce the memory consumption and speedup the computations. 79 Chapter 4 Test and Validation Problems 4.1 Device design invers e - optimization problem: De sign of the Pole Face of an Electrical Motor 4.1.1 Problem Definition Our mesh generators and solvers will be demonstrated on two examples from design. First let us consider the following sample problem. The objective is to achieve a uniform flux density distribution in the vertical direction in the air gap of a pole face (see Figure 4.1) . Since the air gap in turbo - alternators compared to the radius of the machine is small, the shape of the pole face can be approximated by a straig ht line . Figure 4.1: Pole face of electrical motor Figure 4.2 gives the related dimensions, material properties and field excitation values used. The symmetry of the magnetic fields with respect to the pole axis allows the modeling of just half the p ole pitch, where the pole axis, which is the line of symmetry, is located at the right boundary to the finite element solution domain. The relative permeability of 20 for the magnetic 80 circuit is deliberately set this low, so that the leakage flux through a ir at the left edge of the pole face is larger than for higher and more realistic permeability. Our requirement is to have a uniform vertical flux density of 1 Tesla along measuring points on the stator. That means all our measuring points must have their y direction flux density (x direction derivative of the vector potential) of 1 Tesla. The influence of this leakage flux requires significant correction in the shape of the pole face close to the left edge in order to achieve the desired constant flux den sity in the air gap. This example is frequently used in the demonstration of electromagnetic optimization methods and it can be considered as a standard demonstration example [13] Figure 4.2: Geometry, boundary conditions and the material properties of t he sample prob lem 81 4.1.2 Problem Model Figure 4.3 shows how to model the problem using our software. There are 11 fixed points (shown in Figure 4.3), 10 variable heights (h1....h10) to be optimized, 8 measuring points (purple - dots), and 4 materials (stator , air, coil, back ion) in this problem. Figure 4.4 shows the generated mesh. Figure 4.5 shows the equi potential lines of the starting design of the pole face of a motor. Since this is a 2D magneto - static problem, these lines represent the flux lines as we ll. As we discussed in Chapters 1 and 3, we use the genetic algorithm for optimization. In genetic algorithm optimization [61, 62], several copies of the matrix are held on the GPU and the corresponding solutions attempted. We have tried different problems using the GA on a GPU [12, 3]. Figure 4.3: Defining the problem using our tool When we optimize directly this problem by defining an independent parameter for the displacement of each point in the pole face, the shape we get is given in Figure 4.6. Th is shape is not a manufacturable shape but the solution really gives a very good result (see Table 4.1 and 82 Figure 4.9). However it is clear that this type of pole face cannot be practically constructed. Figure 4.4: Generated mesh using our tool Figur e 4.5: Finite element solution There are two solutions for this problem. We can add some constraint to the solution on force the variable points to be arranged in a curve function of known mathematical form. Both methods had been tested by Wijesinghe [22 ]. And he claimed that even though the second 83 method is easy because of the known equation, the result is not as good as the result by the first method. Figure 4.6: Results of the un - constrained optimization of the problem Figure 4.7: Results of th e constrained optimization without smoothening An erratic undulating shape with sharp edges arose when Pironneau [18] optimized a pole face to achieve a constant magnetic flux density and this was overcome through constraints [19]. 84 Figure 4.8: Result s of the constrained optimization with smoothening Table 4.1: The flux distribution from the un - constrained optimization Haslinger and Neittaanmaki [94] suggest Bezier curves to keep the shapes smooth with just a few variables to be optimized, while Preis et al . [95] have suggested fourth order polynomials which when we tried gave us smooth but undulating shapes. As such we follow Sub ramaniam et al . [19] and extend their principle, so as to maintain a non - undulating shape by imposing the constraints: Measuring points Flux density (in Tesla) 1 1.001 2 1.001 3 1.002 4 0.996 5 1.001 6 1.001 7 0.999 8 0.999 85 to ensure a smooth shape Even this gives a non - smooth shape (Figure 4.7) but we use averaging of neighboring heights which is shown in Figure 4.10, to obtain a very manufacturable shape a s demonstrated in Figure 4.8. The final results are given in Table 4.2 and Figure 4.11. Average error percentage of un - constrained optimization is 0.15% while the constrained optimization method has an error percentage of 1.0625%. Even - though un - constrain ed opti mization gives a more accurate solution than constrained optimization, the resulting shape from the un - constrained problem is not practicably manufacturable. Figure 4.9: The flux distribution from the un - constrained optimization The average erro r percentage was calculated using the following formula: where n is the number of measuring points, Ep is the error percentage, is the calculated flux density and is the target flux density. We have reported solutions for a population size of 100. Figure 4.10 shows the averaging techniques which have been used to get a smooth manufacturable shape . This figure shows a 3 - neighbor averaging technique. We have a mask that covers 3 elements. The mask moves the left most elements to the right most element and updates 86 the middle element with the average of the three values which are covered by the mask. We introduced 2 temporary elements at the front and end to calculate the average of the first and last elements. If we wish to use a 5 - neighbor technique, we need to introduce 2 elements at the front and 2 elements at the end to calculate the average of the first and l ast elements. Figure 4.10: Averaging technique for manufacturable shape Table 4.2: The flux distribution from the constrained optimization Measuring points Flux density (in Tesla) 1 0.991 2 1.016 3 1.008 4 1.008 5 0.977 6 0.978 7 0.993 8 0.988 Figure 4.11: The flux distribution from the constrained optimization 87 4.2 Inverse - optimization for Device Design: Determining the Rotor Contour of a Salient Pole Synchronous Generator 4.2.1 Problem Definition The second example is about determining the pole fac e contour of a salient pole synchronous generator to demonstrate the parametrized mesh generator and matrix solution software as applied to constrained optimization. The current density in the excitation coil and the geometric parameters that define the s hape of the pole piece have to be predicted in order to achieve a sinusoidal distribution of the airgap flux with a peak value of 1.0 T and reduce the flux leakage while the airgap is constrained to a minimum to prevent the motor from hitting the stator (F igure 4.12). Figure 4.12: A synchronous Generator (A) two pole and (B) four pole Figure 4.13 gives the related dimensions, material properties and field excitation values used. The symmetry of the magnetic fields with respect to the pole axis allows th e modeling of just half the pole pitch, where the pole axis, which is the line of symmetry, is located at the right 88 boundary to the finite element solution domain. The stator is idealized as a solid steel region without slots, and both stator and rotor are made of linear steel with a relative permeability of A/mm 2 which is the limit for copper windings, air gap between stator and rotor x < 2 cm and flux go through the points A 1 and A 2 < 0.3 × flux go through the points A 3 and A 1 which means allowable leakage flux is 30%, Figure 4.13: Parametrized geometry of salient pole 4.2.2 Problem Model Figure 4.13 explains the parameters of the problem and how to model this problem. There are 14 fixed points, 16 variable heights (h1....h16), 8 measuring points and 4 materials in this problem (see Figure 4.14). Figure 4.15 shows the corresponding starting mesh for this problem. Figure 4.16 shows the flux lines of this salient pole synchrono us Generator at starting. When we 89 optimize directly this problem by defining an independent parameter for the displacement of each point in the rotor like in the previous example, the shape we get is given in Figure 4.17. This shape is also not a manufacto rable shape but the solution really gives very good result in - terms of a sinusoidal distribution of the airgap flux with a peak value of 1 T (Table 4.3 and Figure 4.17). Figure 4.14: Defining the problem Figure 4.15: Initial mesh 90 Figure 4.16: Flux line of a salient pole synchronous Generator Figure 4.17: Optimized shape without smoothening constrained by rising pole heights from left to right Figure 4.17 shows the flux lines for the optimum solution with a constraint like in (4.1) but the heig ht of the shaped surface having to go up from left to right. It has sharp corners but is reasonably smooth. We then use an averaging technique to remove sharp bends. We took five neighboring values of a height and calculated the mean for every variable sol ution with suitable 91 modification for boundary variables to get Figure 4.20. Table 4.3: The flux distribution after the optimization without smoothened shape Measuring points Flux density (in Tesla) Target Flux density (in Tesla) 1 0.0202 0.0000 2 0.1869 0.1736 3 0.3721 0.3420 4 0.4987 0.5000 5 0.6321 0.6428 6 0.7453 0.7660 7 0.8769 0.8660 8 0.9215 0.9397 9 0.9709 0.9848 10 1.0091 1.0000 Figure 4.18: The flux distribution after the optimization without smoothened shape Figure 4.19: The flux distribution after the optimization with smoothened shape 92 The final results are given in Table 4.4 and Figure 4.19. The average error percentage of un - smoothened optimization is 2.96% while the smoothened optimization method has an error percentage of 2.2 4%. The average error percentage is calculated using Equation 4.2 (the first point is not included because of the zero denominators in Equation 4.2). Since GA is a stochastic optimization algorithm [96, 97], the optimization value is not always perfect. Th e optimum value depends on the initial population and search space [96, 97]. Table 4.4: The flux distribution after optimization with smoothened shape Measuring points Flux density (in Tesla) Target Flux density (in Tesla) 1 0.0083 0.0000 2 0.1596 0.1736 3 0.3560 0.3420 4 0.4964 0.5000 5 0.6434 0.6428 6 0.7517 0.7660 7 0.8549 0.8660 8 0.9183 0.9397 9 0.9986 0.9848 10 1.0035 1.0000 Figure 4.20: Final smoothened shape 93 4.3 NDE benchmark problem: Characterizing Interior Defects 4.3.1 Problem Definition As an example, when an armored vehicle is targeted by an improvised explosive device (Figure 4.21), the armor is inspected by an eddy current probe to determine whether there is damage or not. But we wish to characterize the interior damage to determine i f the vehicle should be withdrawn from deployment. The same system is also intended for regular rust dollars [98, 99, 100] Figure 4.21: Inspection of an army v ehicle after improvised explosive device Figure 4.22 presents crack shapes, both shown through the meshes, to make the com puted field match the measured field. The normalized least - square mismatch of nodes from the midpoint between the measured and compu ted shapes (Figure 2.4) shows a parametri cally described crack in steel excited by an eddy current probe. In this NDE exercise the parameters 94 need to be optimized to make the computed fields match the measurements. An objective function F is defined as the sum of the squares of the difference between computed and desired performance values: at measurement points i, where is the calcula ted mag netic flux density and is the measured flux density. By minimizing the objective function F by the optimization method, the characteristics of the defect can be estimated since F is the function of the parameters. Figure 4.22: Parametrical ly defined crack in plate from Triangle 4.3.2 Problem Model Figure 4.23 shows the parameters of the described NDE problem and how to model it to using our tools. There are 14 fixed points, and 6 variable points (for example); the x coordinates are fixed [10 1]. There are also 4 materials (air, crack, steel plate and coil) in this problem. Figure 4.24 shows the generated mesh for this problem. Different materials are shown in different colors (although in black and white in this printout). Figure 4.25 shows t he flux lines from this example. The flux lines are shown in Figure 4.25. Figure 4.26 shows the true defect and constructed defect. 95 Figure 4.23: Defining the problem Figure 4.24: Generated Mesh for NDE problem Table 4.5 shows the variable points ( defect coordinates), their x and y coordinates, euclidean distance between the centroid of that the crack and a point (d1 and d2 respectively) and normalized distance between d1 and d2. That means centroid difference between the true profile and the constr ucted profile, d1 and d2, is calculated using the Equation 4.4. Let us say, (x1, y1) and (x2, y2) are the two points; the euclidean distance between the two points is defined by the following formula : 96 Figure 4.25: Flux line for NDE problem Table 4.5: The solution of defect characterization T rue profile Reconstructed profile Norm error V ariable p oi n ts x y d1 x y d2 (( d 1 d 2) /d 1) 2 1 2 3 4 5 6 7 8 8 2.0 1.5892 9 1 .2 1.4162 10 2.4 0.5153 11 2.2 1.5348 11 2.7 1.5101 10 3.0 0.6896 9 3.2 0.8400 8 3.5 1.7890 8 2.24 1.5268 9 2.34 0.5331 10 1.96 0.7544 11 2.15 1.5461 11 2.54 1.5000 10 2.94 0.6497 9 3.42 1.0251 8 3.55 1.8167 0.001541 0.388815 0.215191 0.000054 0.000044 0.0 03342 0.048597 0.000240 Figure 4.26: Optimum shape of the reconstructed defect 97 The normalized least - square match of nodes from the midpoint between the measured and computed shapes (Figure 4.26) was close to 90%. The error in location was 4.65%. A bet ter match would require the use of more parameters. We tried with several population sizes and multiple times of iterations. These results are reported in [102]. This thesis presents the particular solution for a population of 200. Tests were carried ou t for the different population sizes and different number of iterations and we measured the time taken to compute the solution and tabulated these in Table 4.6. The best fitness score for different population sizes and different number of iterations is rep orted in Table 4.7 for real and binary representations of GA solutions [102]. We can see that solutions from parameters represented by real numbers are obtained faster than the solutions that are represented by binary numbers [102]. Table 4.6: Real and bin ary solutions time need to compute Population Size 30 iterations 40 iterations 50 iterations Time Taken(s) Time Taken(s) Time Taken(s) Real Binary Real Binary Real Binary 10 20 30 40 50 60 274.33 540.68 861.79 1319.36 1583.73 1757.78 311.97 661.79 10 24.65 1431.34 1769.6 2131.94 346.58 695.28 1095.82 1666.79 1994.77 2199.26 408.29 891.06 1258.17 1950.36 2302.99 2534.36 413.16 866.47 1588.32 2024.03 2323.93 2656.53 503.43 1105.42 1749.1 2390.25 2891.94 3303.85 Table 4.7: Real and binary solutions time need to compute Population Size 30 iterations 40 iterations 50 iterations Best Fitness Score Best Fitness Score Best Fitness Score Real Binary Real Binary Real Binary 10 20 30 40 50 60 0.0123 0.0084 0.0040 0.0105 0.0111 0.0127 0.0019 0.0012 0.0009 0 .0017 0.0017 0.0017 0.0091 0.0084 0.0040 0.0105 0.0068 0.0127 0.0020 0.0008 0.0009 0.0017 0.0017 0.0016 0.0091 0.0084 0.0040 0.0105 0.0068 0.0127 0.0015 0.0008 0.0009 0.0017 0.0017 0.0000 98 4.4 A Simple Three - dimensional Problem For testing purposes, we took a small cube which is made of a material with relative permittivity 1. Another inner cube with a relative permittivity of 1 and charge density of 1C/m3 condit ion). The measuring points (along a line) are shown in Figure 4.27. Then we solved the Poisson equation to calculate the potential ( ) using FEM: Figure 4.27: Square conductor problem There are 16 fixed points, 2 materials (air and material) and 100 measuring points (along the line) in this problem. Since this is not an optimization problem, there are no variable points. Figure 4.28 shows the generated mesh for this problem. Different materials are given in different colors; light green for material 1 and red for material 2. Figure 4.29 shows the potential at the measuring points. We may recognize that the potential within the inner cube is constant because the inne r cube has a constant uniform charge. 99 Figure 4.28: Mesh for square conductor problem Figure 4.29: Potential at measuring points Figure 4.30: Potential at measuring points for a simple cube 100 Table 4.8 shows the numerical solution for this problem. The first column gives the measuring point number; the second and third columns give the electric potential value and tetrahedron number in which the particular measuring point exists, respectively. Figure 4.30 shows the potential at the measuring points for a simple cube problem (a simple cube with t he same dimension as above, charge density is 1C/m 3 - no inner outer cubes and no change in permittivity). Table 4.8: Potentials at measuring poin t Measuring points (in V) Element# Measuring points (in V) Ele ment# 1 0.0000 19962 51 3.4708 10058 2 0.2159 13483 52 3.4711 12276 3 0.4318 13483 53 3.4714 12276 4 0.6477 13483 54 3.4717 12276 5 0.8636 13483 55 3.4721 12276 6 1.0795 13483 56 3.4727 7335 7 1.2954 13483 57 3.4736 7335 8 1.4938 982 58 3.4745 7335 9 1.6796 982 59 3.4755 7335 10 1.8616 12016 60 3.4763 1321 11 2.0228 18763 61 3.4767 1321 12 2.1673 18763 62 3.4772 18264 13 2.3118 18763 63 3.4777 18264 14 2.4446 5702 64 3.4793 15140 15 2.5760 5702 65 3.4844 15136 16 2.6948 21367 66 3.4897 15136 17 2.8076 21367 67 3.4950 15136 18 2.9204 21367 68 3.5003 15136 19 3.0161 21366 69 3.5056 15136 20 3.0922 18694 70 3.5109 15136 21 3.1629 15809 71 3.4988 15448 22 3.2293 5610 72 3.4792 15448 23 3.2924 5610 73 3.4596 15448 24 3.3532 14667 74 3.4400 15448 25 3.3950 25126 75 3.4205 15448 26 3.4169 25126 76 3.4009 15448 27 3.4388 25126 77 3.3562 15446 28 3.4607 25126 78 3.2909 15446 29 3.4826 25126 79 3.2257 15446 30 3.5045 25126 80 3.1591 8028 101 Table 4.8: Potentials at measuring poi nts (C ) Measuring points (in V) Element# Measuring points (in V) Element# 31 3.5181 23548 81 3.0841 19985 32 3.5122 23548 82 3.0092 19985 33 3.5064 23548 83 2.9343 19985 34 3.5005 23548 84 2.8152 18420 35 3.4947 23548 85 2.6758 18420 36 3.4888 23548 86 2.5 364 18420 37 3.4830 23548 87 2.3971 18422 38 3.4772 23548 88 2.2578 18422 39 3.4723 22059 89 2.1185 18422 40 3.4719 22059 90 1.9794 8298 41 3.4712 19315 91 1.8409 8298 42 3.4708 19049 92 1.6655 17398 43 3.4707 19049 93 1.4562 17398 44 3.4708 4482 9 4 1.2470 17398 45 3.4707 4482 95 1.0383 14912 46 3.4705 4482 96 0.8303 14912 47 3.4703 4482 97 0.6222 14912 48 3.4701 4482 98 0.4142 14912 49 3.4704 17128 99 0.2072 13685 50 3.4706 5560 100 0.0001 13685 This is a simple problem. We will test the me sh generator on the more complex system of the next section. These results are verified with the mesh generator Gmsh [29] with Matlab functions. This simple problem verifies that our parameterized mesh generator and FEM solver work perfectly. 4.5 NDE benchmark problem in 3D: Characterizing Interior Defects As a more complex example for testing our parameterized mesh generator, this is a three dimensional version of example 3 (Section 4.3). We wish characterize the interior damage to a land vehicle hull to determine if the vehicle should be withdrawn from deployment. Figure 4.31 frame, air (upper half of the Figure 4.31) and steel plate (lower half of the Figure 4.31). 102 Figure 4.31: Three - dimensional NDE problem Figure 4.32 shows how to define a crack whose outline coordinates and location coordinates are changing as the optimization algorithm runs. We define two surfaces; the outer co or dinates of both surfaces are the same. We made some constraints on the y coordinates to ensure that this is truly a volume without the surfaces crossing each other. In Figure 4.32 top, there are two surfaces. Each has a few variables shown in green, while the outer com mon coordi nates are given in blue. In Figure 4.32 the lower part of the figure shows the top view of the top surface which has 13 variables in addition to its outer common coordinates. Similarly the bottom surface also has variable points. Figure 4.33 shows the gene rated mesh for this problem. Different materials are shown in different colors; green - air, red - steel, blue - coil and yellow - crack (made by 2 surfaces) although this printout is black and white. For this test problem we took 5 variable points on the upper surface, 8 variable points on the common interface and 4 variable points on the lower surface. There are a total of 17 × 3 variables associated with the 17 variable points. Here we made another assumption that the points are varying in the y directio n only; the x and z 103 coordinates are fixed. Table 4.9 shows the true and characterized profile coordinates and the normalized distance between results. Figure 4.32: Defining variable: top: side view, bottom: side view Figure 4.33: Three - dimensional me sh for NDE problem: As parameters change The last column of Table 4.9 shows the norm error (the formula is also given in the Table 4.9). Average error is 2.5%. Average error is calculated using the following formula: 104 where n is the number of variables, is the target value and is the reconstructed value. This thesis presents the particular solution for a population of 200. This demonstrates the validity of our 3D optimization tools. Table 4.9: The solution of 3D defect characterization True profile Reconstructed profile Norm error # x y z x y z |z t re | /z t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 52.853 44.000 50.927 51.763 42.000 52.427 48.237 45.000 52.427 47.147 42.000 50.927 48.237 45.000 47.573 50.000 44.000 47.000 51.763 43.000 47.573 52.853 43.000 49.073 51.427 46.000 50.464 50.000 46.00 0 51.500 48.573 47.000 50.464 50.882 46.500 48.786 50.000 48.000 50.000 51.427 40.000 50.464 48.573 40.000 50.464 49.118 40.000 48.786 50.882 40.000 48.786 50.000 38.000 49.000 52.853 42.628 50.927 51.763 43.892 52.427 48.237 43.682 52 .427 47.147 41.873 50.927 48.237 44.670 47.573 50.000 41.893 47.000 51.763 44.735 47.573 52.853 42.444 49.073 51.427 44.239 50.464 50.000 47.193 51.500 48.573 46.317 50.464 50.882 46.483 48.786 50.000 47.219 50.000 51.427 37.999 50.46 4 48.573 39.835 50.464 49.118 40.022 48.786 50.882 38.213 48.786 50.000 36.903 49.000 0.0312 0.0451 0.0293 0.0030 0.0073 0.0479 0.0403 0.0129 0.0383 0.0259 0.0145 0.0004 0.0163 0.0500 0.0041 0.0005 0.0447 0.0289 105 Chapter 5 Results and Analy sis 5.1 Memory Limitation Recently, Graphics Processing Unit (GPU) computing has had great success in many very large numerical computations. Software developers, researchers, and scientists have been using the GPU for speeding up their computations. Appl ications taking advantage of this new technology have ranged from quantum chemistry [66] and molecular dynamics [103, 104] to fluid dynamics [105, 106] and cloth simulation [107]. In this work we discuss the often undiscussed GPU memory limitation in finit e element optimization. In GPU computing the memory of the Nvidia GPU is limited. This part of this thesis assesses the memory limits in terms of matrix size in light of the various ways to store a large matrix in order to overcome these limits. We took a 4 cm 2 square conductor with current density = 1 A/mm2 and relative p er meability 1. Then we solved the Poisson equation to calculate the magnetic vector potential (A) using FEM In this experiment we defined different number of progressively refined triangles; i.e. 288, 768, 1408 etc. Therefore the final solution can be obtained from the equati on, [ P ] { A } = { Q }; where [ P ] is the matrix and { A } and { Q } are vectors. In this experiment we used both 106 storage schemes to reduce the storage capacity. Matrix [ P ] is a symmetric positive definite sparse matrix; each row has approximately 3 to 5 elements in a symmetric half [13], because of the first order mesh. Table 5.1 shows the total number of matrix elements and storage with matrix size for different storage schemes. Clearly, we can store very large matrices using the profile or sparse storage scheme be cause in FEM, the matrix [ P ] is a symmetric sparse matrix [13]. According to our result, memory - wise, the sparse storage scheme is much better than the profile storage scheme as to be expected because of fill - in during decomposition with the latter. Althou gh well known, we repeat this investigation to obtain memory limits with CUDA. In inverse problems where many equations need to be solved, this is limiting. Figure 5.1 (A) shows the memory required for the matrix versus matrix elements. Figure 5.1 (B) sh ows the memory required for profile and sparse storage schemes. Table 5.1: Number of elements (NE) and storage (in MB) with matrix size for different storage schemes Matrix Size NE Regular NE(Profile) Profile NE(Sparse) Sparse 100 10500 0.040054 1161 0.0 04429 1709 0.006519 400 18000 0.068665 10819 0.041271 4421 0.016865 900 814500 3.107071 33329 0.127140 9521 0.036320 1600 2568000 9.796143 75239 0.287014 18421 0.070271 2500 6262500 23.889542 142549 0.543781 29801 0.113682 6000 36030000 137.443543 404 459 1.542889 71681 0.273441 8000 64040000 244.293213 697679 2.661434 94221 0.359425 10000 100050000 381.660461 1070099 4.082104 118021 0.450214 Using curve fitting projection we determined the maximum size of the problem that may be attempted within th e 4 GB memory limit of the GetForce GTX 970 GPU card that we worked with. Table 5.2 shows the matrix size and corresponding memory size that we need to store the variables. From these calculations we can say that the sparse storage scheme can be used for v ery large problems of size 50,000K × 50,000K. It takes much lower memory than the profile storage 107 scheme as remarked. Using the sparse storage scheme we can solve bigger than a 50,000K × 50,000K matrix size before running into memory limits (4 GB limit). With the 24 GB Kepler K80 GPU cards [68] now available, we can solve problems much bigger than 50,000K × 50,000K matrix Figure 5.1: Memory vs matrix size Table 5.2: Projected Memory (in MB) Needs Size Regular Profile Sparse 20K 1525.2 14.73 0.90 30K 3430.5 32.13 1.35 50K 9526.5 87.06 2.26 100K 38097.0 341.79 4.52 500K 952240.0 8417.70 22.58 1000K 3808900.0 33608.00 45.17 5000K 95220000.0 838940.00 225.85 10000K 380880000.0 3355100.00 451.70 50000K 9522000000.0 83865000.00 2258.50 100000K 38088 000000.0 335450000.00 4517.00 Our conjugate gradients method with Jacobi preconditioning gives the results shown in Figure 5.2. We stopped our experiment with size of 10000 × 10000 because if we have a population of 5000 in GA, theoretically we can go up to 10000 × 10000 with sparse storage schemes because of the memory limits as discussed above. If we want a larger population, the problem size should 108 be decreased. Even - though the sparse storage scheme reduces the memory consumption, we can not solve prob lems larger than 10000 × 10000 with larger populations than 50. But with the new GPU cards like Kepler K80 [68] that came out as this thesis was completed, memory will not be a problem Figure 5.2: Speed - up versus matrix size: Jacobi preconditioned conju gate gradients algo rithm 5.2 Eleme n t - b y - Eleme n t Sol v ers Again we solved the Poisson equation for the magnetic vector potential A with progressively refined meshes using the finite element method for the Poisson equation for a test problem from magnetics. We defined different number of progressively refined tetrahedrons to obtain results for different matrix sizes. Run time statistics obtained for different mesh sizes is given in Table 5.3 which compares the CPU and GPU calculation times for different sizes of problems for the conjugate gradient element by element (CGEbE) algorithm. This table gives the speedup which is defined as CPU calculation time/GPU calculation time and the speedup per iteration which is defined as CPU calculation time for one iteratio n/CPU calculation time per iteration. It shows a high speedup of 51.23 for 2,828,782 unknowns which is equivalent to a 109 matrix size of 2, 828, 782 × 2, 828, 782 in regular first order finite element analysis. Table 5.3: The CGEbE solution CPU GPU Elemen ts Unknown Itns Time Itns Time Speedup (S) S per itn 26595 3072 32682 3843 45328 5440 106587 13312 210024 27852 273008 36704 399928 54672 978905 137639 1940736 279340 2569403 372149 3830869 559546 9521059 141314 2 18879664 2828782 51 1.15 55 1.63 60 2.39 79 7.38 119 22.34 130 36.02 148 54.32 206 189.24 294 524.26 329 953.45 413 1542.75 645 11120.06 903 46348.39 66 0.33 69 0.33 69 0.33 90 0.45 126 0.73 134 0.91 151 1.33 201 7.11 271 25.19 310 41.27 346 75.06 514 30 8.25 733 904.54 3.4848 4.5098 5.0938 6.3903 7.2424 8.3288 16.4000 18.6835 30.6027 32.4029 39.5824 40.8003 40.8421 41.6700 26.6160 25.9700 20.8122 19.1841 23.1027 21.7685 20.5536 17.2192 36.0748 28.7480 51.2397 41.5933 This number of iterations depends on both number of unknown elements and total number of elements (see the algorithms in Chapter 3). That is why both iterations and speedup with number of unknown elements and total number of elements are plotted. Figur e 5.3 shows the number of iterations vs number of unknowns. Figure 5.4 shows the number of elements vs number of unknowns. In both graphs, the number of iterations increases with problem size. We can see that the number of iterations increases with problem size for both CPU and GPU. The reason for the difference between number of iterations in the GPU and in the CPU is as explained later in this section. The Figure 5.5 shows the speedup vs number of unknowns for the CGEbE. Figure 5.6 shows the speedups vs n umber of elements for the CGEbE. The speedups vary with problem size. We can see an erratic nature in the speedups in Figures 5.5 and 5.6. The causes of the erratic speedup need to be investigated [108]. 110 Figure 5.3: Number of iterations vs number of unk nowns for CGEbE Figure 5.4: Number of iterations vs number of elements for CGEbE Since we have been working with the regular finite element method for the Poisson equations for magnetics, the bi - conjugate gradient method [34] also works as the conjug ate gradient method because [ A ] = [ A ] T [34, 77]. In the situation where the convection effect when dealing with temperature problems is significant, finite element matrix equations take antisymmetric form [83]. Therefore for a generalized finite element p ackage, antisymmetric solvers also should be included with the package. Table 5.4 shows the CPU and GPU calculation times for the different sizes of problems for 111 the biconjugate gradient stabilized element by element (BiCGSTABEbE) method. Even - though it g ives higher speedup than the CGEbE method, this BiCGSTABEbE method takes a long time to converge. We can see the differences in Tables 5.3 and 5.4. The BiCGSTABEbE method shows a higher speedup of 80.1364 for 54,674 unknowns while the speedup is only 20.5 5 for 2,828,782 unknowns. So BiCGSTABEbE method can be used to solve small problems (smaller than 54,674 unknowns problems ). Figure 5.5: Speedup vs number of unknowns for CGEbE Figure 5.6: Speedup vs number of elements for CGEbE 112 The Figure 5.7 show s the number of iterations vs number of unknowns for BiCGSTABEbE. Figure 5.8 shows the number of elements vs number of unknowns for BiCGSTABEbE. In both graphs, the number of iterations increases with problem size. This is as to be expected for a w ell - cond itioned problem, such as this. Table 5.4: The BiCGSTABEbE solution CPU GPU Elements Unknown Itns Time Itns Time Speedup (S) S per itn 26595 3072 32682 3843 45328 5440 106587 13312 210024 27852 273008 36704 39 9928 54672 978905 137639 1940736 279340 2569403 372149 3830869 559546 9521059 1413142 18879664 2828782 71 3.44 72 4.27 83 6.84 107 21.24 143 56.68 145 74.81 162 123.41 219 421.69 263 1051.64 361 1785.89 371 5158.10 466 11147.04 553 28191.35 69 0.34 72 0.34 72 0.35 96 0.50 129 0.88 134 1.08 146 1.54 198 6.41 237 21.24 260 33.43 375 82.78 569 366.80 884 1371.98 10.1176 9.8326 12.5588 12.5588 19.5429 16.9528 42.4800 38.1129 64.4091 58.1033 69.2685 64.0137 80.1364 72.2217 65.7863 59.4780 49.5122 44.6175 53.4218 38.4755 62.3109 62.9828 30.3900 37.1071 20.5479 32.8470 Figure 5.7: Number of unknowns vs number of iterations for BiCGSTABEbE 113 The Figure 5.9 shows the speedups vs number of unknowns for BiCGSTABEbE. Figure 5.10 shows the speedup vs number of element s for BiCGSTABEbE. The speedups vary with problem size. The speedup is decreasing after 54,674 unknowns in contrast to the speedup increasing for CGEbE after 54,674 unknowns. Again the causes of the erratic speedup need to be investigated [108]. Table 5. 4 compares the CPU and GPU calculation times for different sizes of problems for CGEbE algorithm. This table gives speedup and speedup per iteration. It shows high speedup of 102.12 for 54,672 unknowns which is equivalent to a matrix size of 54,672 × 54,6 72 in regular first order finite element analysis. Figure 5.8: Number of elements vs number of iterations for EbEBiCGSTAB Figure 5.9: Speedup vs number of unknowns for BiCGSTABEbE 114 Figure 5.11 shows the number of iterations vs number of unknowns for J acobi EbE. Figure 5.12 shows the number of iterations vs number of elements for Jacobi EbE. In both graphs, the number of iterations increases with problem size as to be expected. We can see that the number of iterations increases with problem size for bo th CPU and GPU The Figure 5.13 shows the speedups vs number of unknowns for Jocobi EbE. Figure 5.14 shows the speedups vs number of elements for Jacobi EbE. The speedup is decreasing after 372,149 elements (see Table 5.5). Figure 5.10: Speedup vs number of elements for BiCGSTABEbE Figure 5.11: Number of unknowns vs number of iterations for JacobiCG 115 Figures 5.15 and 5.16 show the convergence of the CGEbE algorithm for CPU and GPU respectively. We can see that both convergence patterns are almost the s ame and that the convergence rate is very high in both figures. Figures 5.17 and 5.18 show the convergence of the BiCGSTABEbE algorithm for CPU and GPU respectively. We can see that both convergence patterns are slightly different and the convergence rate is high in both figures Table 5.5: The Jacobi solution CPU GPU Elements Unknown Itns Time Itns Time Speedup (S) S per itn 26595 3072 32682 3843 45328 5440 106587 13312 210024 27852 273008 36704 399928 54672 9 78905 137639 1940736 279340 2569403 372149 3830869 559546 9521059 1413142 18879664 2828782 806 17.24 889 23.58 1019 37.01 1557 134.42 2371 409.10 2712 619.42 3320 1129.47 5376 6219.27 7826 19026.61 9070 31069.48 11185 56743.83 17853 217808.09 24904 488881. 06 807 0.70 889 0.78 1018 0.95 1557 1.96 2374 4.67 2713 6.52 3314 11.06 5374 66.84 7819 269.75 9063 452.96 11176 930.97 17856 4364.02 24908 12852.42 24.6286 24.6591 30.2308 30.2308 38.9579 38.9197 68.5816 68.5816 87.6017 87.7126 95.0031 95.0381 102.1221 10 1.9375 93.0471 93.0125 70.5342 70.4711 68.5921 68.5392 60.9513 60.9023 49.9100 49.9184 38.0381 38.0441 Figure 5.12: Number of elements vs number of iterations for JacobiCG 116 Figures 5.19 and 5.20 show the convergence of the Jacobi EbE algorithm for CPU and GPU respectively. We can see that both CPU and GPU convergence rates are very slow compared to CGEbE and BiCGSTABEbE. But convergence patterns in CPU and GPU are the same and the convergence pattern is smooth. Here we can see there is not that much dif ference between number of iterations between GPU and CPU compared to CGEbE and BiCGSTABEbE algorithms. In CGEbE and BiCGSTABEbE methods many statements including element by element process are parallelized in GPU but in the Jacobi method only one stateme nt with element by element executes in the GPU. This is the possible reason for this difference in the number of iterations in CPU and GPU Figure 5.13: Speedup vs number of unknowns for Jacobi EbE Figure 5.14: Speedup vs number of elements for Jacobi Eb E 117 Figure 5.15: Convergence rate of CG in CPU Figure 5.16: Convergence rate of CG in GPU Figure 5.17: Convergence rate of BiCGSTABEbE in CPU 118 Figure 5.18: Convergence rate of BiCGSTABEbE in GPU Figure 5.19: Convergence rate of Jacobi in CP U Figure 5.20: Convergence rate of Jacobi in GPU 119 Figure 5.21 compares the speedup of Jacobi EbE, CGEbE and BiCGSTABEbE methods. In terms of speedup, the Jacobi CG method gives a higher speedup for small problems (less than 372,149 unknowns), the second higher speedup is for BiCGSTABEbE and then CGEbE. But convergence time is much less for CGEbE, followed by BiCGSTABEbE and then Jacobi EbE. For large problems (bigger than 372,149 unknowns, see Tables 5.3, 5.4, 5.5), CGEbE gives a higher speedup, f ollowed by Jacobi EbE and then BiCGSTABEbE. But convergence time is much less for CGEbE, second less for BiCGSATBEbE and then Jacobi EbE (see Tables 5.3, 5.4, 5.5). Figure 5.21: Speedup comparison between CGEbE, BiCGSTABEbE and Jacobi EbE algorithms Fi gure 5.22 graphically summarizes Kiss et al than those by the same authors, Kiss et al ., in [34] even though both use the same hardware. Kiss et al . [109] got 10.01 as their maximum speed up for computations on a sin gle GPU (corresponding to our results which shows a maximum speedup of 102.12). In terms of speedup we got better results while at the same time they got a higher convergence rate than ours with an astoundingly fast rate in terms of iterations (e.g., 79 it erations for a 91,000 X 91,000 matrix on a CPU) but the results of Fernandez et al. [33] are more comparable to our Jacobi EbE. Kiss et al 120 an erratic up and do wn speedup that needs to be investigated [108]. Figure 5.22: Speed - up versus matrix size of Kiss et al . Table 5.6: Speedup ratio between single and double Matrix size Jacobi Gauss - Seidel GMRES(35) BiCGSTAB 2000 1.9148 1.9197 1.9773 1.9712 4000 1.5568 1.8194 1.9854 1.9976 8000 2.3471 1.9148 1.9255 1.8784 12000 1.9851 1.8095 2.0099 1.9157 16000 2.2012 1.9276 1.8528 1.9545 20000 2.0449 1.8841 1.8819 1.8850 Table 5.6 shows the speedup ratio of the single and double precision GPU implementation of th e direct method for linear systems [110]. Theoretically for memory bound algorithms, double precision work on the GPU has been shown to take twice as much time than single precision arithmetic [111]. However, there are papers where this is not so [108, 110 ] (see Table 5.6). Communications is a factor but the exact nature is still not known as pointed out in our paper [108]. Devon Yablonski analyzes numerical accuracy issues that are found in many scientific 121 GPU applications using floating point computation [112]. As he puts it, two widely held myths about floating - and that computations on the GPU are unavoidably different from the same computations on a CPU [112]. He appears to have dispelled both myths by studying specific applications. Figure 5.23: Serial addition losing precision. Numbers surrounded by a box represent the actual result floating point value with 7 digits Accumulating values serially (Figure 5.23 A) will someti mes result in a large value that each successive small value is added to, resulting in diminished accuracy of the results. The reduction style of computation (Figure 5.23 B) avoids the issue in accumulating floating - point values in a way that is similar to binning [112]. Binning describes collecting and computing small groups of values and then computing the final result by combining each result. There is sometimes an erratic up and down speedup [113] like in Figures 5.5, 5.6, 5.9, 5.10, 5.13, 5.14. The ca uses need 122 to be investigated as we have pointed out [108]. In a study we did of GPU computation for finite element optimization by the genetic algorithm [3], the speedup showed an unexpected erratic up and down trajectory [108]. This result is seen in othe r works too such as of Krawezik and Pool [113] as shown in Figure 5.24 and Kiss et al . [34] as shown in Figure 5.22. In the absence of an explanation we carry on but a real understanding of the method to obtain the best speedup, requires some investigation . Figure 5.24: Erratic behavior of gain for various methods While expert programmers have programmed their problems in CUDA C to reap the benefits of speedup, it is to be noted that the compiler is very difficult to work with as pointed out by us [108] . Error messages are still difficult to use in debugging. When memory is violated in one function, the program crashes in another without a proper error message. Debugging is therefore more difficult in CUDA C, particularly because we cannot print intermed iate outputs directly from the GPU 123 Chapter 6 Conclusion and Future Works The first part of this thesis describes a successful script - based , parameterized mesh generator library for design and NDE developed in C for seamless finite element optimization. Unlike other such systems written in inaccessible code and designed for use with specific software, this system is provided as open - source code [114, 115] so that it may be modified by anyone and used with his or her own finite element (FE) optimization system. This system with the finite element analysis and optimization on the GPU makes for quick NDE assessments and other finite element optimization problems in the field [3]. The modified open source codes Triangle and TetGen we have developed for optim ization are CPU - based because mesh generation takes little time compared to finite element matrix solution and optimization. As such the massive effort to port them to a GPU is not justified. Moreover they are too complex for porting to the GPU. A large te am of engineers can translate that too to the GPU but we do not think the effort is warranted by the gains to be had. The mesh generator has been demonstrated through a successful NDE system for testing army ground vehicle hulls with damage from corrosion or IEDs and some associated optimization problems. In the second contribution of this thesis, we have presented our GPU - based EbE matrix solution routine that is very fast and takes little memory. It has been specifically developed for finite element optim ization. Several finite element solutions are done on parallel GPU threads for use where memory is critical and solution times long. The GPU - parallelized pre - conditioned conjugate gradient (PCG) algorithm with sparse - stored matrix formation is the best w ay for 124 PCG. With 24 GB GPU cards now available, EbE processing is not necessary for most practical, single, forward problems. This element by element (EbE) method, how - ever, becomes a must for genetic algorithm optimization where, whether in NDE or device design, several genetic algorithm threads are launched in parallel and memory capabilities are challenged. The EbE Jacobi iterations give better speed - up than the conjugate gradients element by element (CGEbE) for which incomplete cholesky conjugate gradi ent (ICCG) is not possible because the matrix is never formed in EbE processing. In our work we use the genetic algorithm where the object function corresponding to every member of a population has to be computed many times to find the minimum. The many members form the genetic search space. Since consists of dimensions and materials of a particular design being examined for its goodness [116], for those dimensions a me sh is constructed, the finite element problem solved and the object function evaluated. The object function itself is computed from a finite element solution involving a matrix equation. Thus we may treat the object function computation as a kernel and lau nch it on multiple threads, each for a different member of the population. Then within that kernel, we can parallelize the matrix equation solution at a speedup which we shall refer to a SP , which depends on problem size. Alternatively, we may do the objec t function evaluation for each member of the population in sequence and in that process parallelize the matrix computations. Let the population number be n . Say the object function evaluation for each member of the population takes in time where is the time for the matrix solution and is the time for other operations. Therefore if we parallelize the operations for the different members of the population, the time for evaluating all the object functions corresponding to the ent ire population would still be, neglecting coordination time, 125 since these are done simultaneously. Here we have assumed that the work for each member of the population is being done in parallel, and that the time for combining results and other communications is negligible. On the other hand, if we parallelized the matrix computation, the evaluation of the object function has to be in sequence since we cannot have forking from a parallelized kernel. The total time would then be the number of members in the population multiplied by the time for computing the object function for each member of the population where SP is the matrix equation solution speedup, tm is the time for the matrix solution, n is population number and t 0 is the time for other operations. A deci sion on which of the processes is to be parallelized would depend on considerations like this. However we have not seen such considerations in the literature [12]. Further in a recent development CUDA Dynamic Parallelism has been made available on the SM 3.5 architecture GPU [72], we can parallelize both the genetic algorithm and the FE calculations. Suggested future work includes, Develop the GPU based parametrized mesh generator for two and three dimensions. Develop dynamic parallelization based optimi zation tools. Develop a user friendly PC - based optimization tools for portable NDE use in the field. 126 APPENDICES 127 Appendix A : Publications Raised from T his Research Journals 1. S. Sivasuthan, V. U. Karthik, A. Rahunanthan, P. Jayakumar, Ravi Thyagarajan, Lalita Nonde structive Defect Characterization: Element - by - Journal of Non - destructive Evaluation , vol. 34(9), 2015 2. S. Ratnajeevan H. Hoole, Sivamay am Sivasuthan, Victor U. Karthik and Paul R.P. Hoole - teaching Engineering Optimization, Electromagnetic Product Design and Computer Applications in Engi - neering Education Journal(CAEE) , vol. 23, pp. 374 - 382, 2015 3. Victor U. Karthik, Sivamayam Sivasuthan Arunasalam Rahunanthan, Ravi S. Thya - accurate, parallelized inversion for shape optimization in electroheat prob lems on a graphics processing unit (GPU) with the real - COMPEL International Journal of Computations and Mathematics in Electrical , vol. 34, no. 1, pp. 344 - 356, Jan. 2015. 4. S. Sivasuthan, V. U. Karthik, A. Rahunanthan, P. Jayakuma r, Ravi Thyagarajan, Lalita - based, Parameterized Mesh Generator for Machine IETE Technical Review , Vol. 32( 2), pp. 94 - 103, 2015. 5. S. Ratnajeevan H. Hoole , Victor U. Karthik, S. Sivasuthan, A. Rahunanthan, R. Thya - - destructive Evaluation: A Review in Magnetics, and Future Directions in GPU - based, Element - by - Element Coupled Optimization a International Journal of Applied Elec - tromagnetics and Mechanics , vol. 47(3), pp. 607 - 627, 2015 6. S. Ratnajeevan H. Hoole, Sivamayam Sivasuthan Victor U. Karthik, Arunasalam Rahu - netic Device Optimization: The Forking of Already Parallelized Threads on Graphics Processing Applied Computational Electromagnetics Society (ACES) Journal , vol. 29(9), pp. 677 - 684, 2014 128 7. Sivamayam Sivasuthan, Victor U. Karthik, Arunasalam Rahu nanthan, Ravi S. Thya - Revue roumaine des sciences techniques - Serie Electrotechnique et Energetique (accepted)In press Vol. 60, No. 3, 2015. Peer - r eviewed Conference Papers 1. 3D Mesh Generator for Optimization in NDE and Shape Design on a GPU International Review of P rogress in Applied Computational Electromagnetics , 22 - 26 March 2015, in Williamsburg, VA 2. ISFEE Conference 2014 (Ac cepted) November 28 - 29 2014, Bucharest, Romania 3. S. Sivasuthan, V. U. Karthik, A. Rahunanthan, P. Jayakumar, Ravi Thyagarajan, Lalita Con jugate 16th Biennial IEEE Conference on Electromag netic Field Compu tation (CEFC), May 25 - 28 2014, Annecy, France. 4. S. Sivasuthan, V. U. Karthik, A. Rahunanthan, P. Jayakumar, Ravi Thyagarajan, Lalita - based, Parameterized Mesh Generator Li - brary for CoupledGradient Desi 16th Biennial IEEE Conference on Elec - tromagnetic Field Computation (CEFC) , May 25 - 28 2014, Annecy, France 5. In proceedings of 2014 American Society For Engineering Education, NCS Conference , April 4 and 5, 2014, Oakland University, Rochester, MI. 6. S. Sivasuthan Victor U. Karthik, Arunasalam Rahunanthan, Ravi S. Thyagarajan, d in Optimization: The Forking of Already Parallelized Threads on Graphics Pro cess ing The 30th International Review of Progress in Applied Computational Electromagnetics , March 23th - 27th, 2014 in Jacksonville, Florida 7. S. Sivasu 129 Review of Progress in Quantitative Nondestructive Evaluation, edited by Dale E. Chimenti, Leonard J. Bond, and Donald O. Thompson, AIP Conference Proceedings 1581 , 1967 - 1974 , American Institute of Physics, Melville, NY. 8. the Genetic Algorithm on NVIDIA GPU Architecture for S 40th Annual Review of Progress in Quantitative Nondestructive Evaluation, edited by Dale E. Chimenti, Leonard J. Bond, and Donald O. Thompson, AIP Conference Proceedings 1581 , 1991 - 1998 , American Institute of Physics, Melville , NY. 9. Method in Electrical Engineering Optimization: Parallelization on Graphics Pro cess ing ICPAM - LAE satellite c onference , in Port Moresby, Papua New Guinea, 2013. 10. Victor U. Karthik, S. Sivasuthan, A. Rahunanthan, P. Jayakumar, R. Thyagarajan, S. Graphics Processing Unit for Grou nd Vehicle Hull Inspection Ground Vehicle Systems Engineering and Technology Symposium , Troy, MI, August 2013. 11. Inverse problem symposium , East Lansin g, 2012 (Digest) 130 Appendix B : Sample Input File 2D #A set of pointsin 2D(* WITHOUT VARIABLE POINTS). # Number of nodes is 9 number of variables is 5 9 5 #And here are the nine points. 1 0.0 0.0 2 10. 0 0. 0 3 20.0 0. 0 4 10. 0 10. 0 5 0. 0 10. 0 6 2. 0 2. 0 7 4. 0 2. 0 8 4. 0 4. 0 9 2.0 4.0 # variable points # number of points in first draw. Then coordinates 10 20.0 1.85 11 18 1.90 12 15.5 2.10 13 13 2.40 14 10 3 #segments, 1st line -- number of segments following lines are segments (node numbers, eac h segment has two node numbers and a marker to identify the boundary elements. #number of segments 15 #segments (two nodes) and a marker 1 3 10 1 2 4 5 2 3 6 7 - 1 4 7 8 - 1 5 1 2 - 1 6 2 3 - 1 7 5 1 - 1 8 14 4 - 1 9 10 11 - 1 10 11 12 - 1 11 12 13 - 1 12 13 14 - 1 13 2 14 - 1 14 8 9 - 1 15 9 6 - 1 #segment markers used to identify the boundaries and set boundary #conditions. do not give 0 or 1 to segment marker because already #fixed as a default, 3rd column is boundary condition value 2 #asdf 131 1 2 0 2 3 0 #regions, num ber of regions x y coordinates of region, regional #attribute (for whole mesh), Area constraint that will not be used #we can leave one region without any assignments we have to assign for #this case 0 0 0 0 but we can give properties to this region 2 1 1 0.5 1 0.1 2 12 3 2 0.9 # #properties of regions, first number of properties then property #values 2 1 1 1.32 2 1.90 9.312 # holes, number of holes x y coordinates of the hole 1 1 3 3 #type 0 #measuring points 10 #point coordinates 1 1.1 1. 10 2 2.1 1.19 3 3.1 1.18 4 3.3 1.17 5 3.6 1.16 6 3. 9 1.15 7 4 .1 1.14 8 4. 4 1.13 9 4. 9 1.12 10 5.1 1.1 132 Appendix C: Sample Input File 3D #3DMesh Input File #Number of points < -- > Number of variable points 36 22 #13 3 0 1 1 0 0 0 2 100 0 0 3 100 0 100 4 0 0 100 5 0 50 0 6 100 50 0 7 100 50 100 8 0 50 100 9 0 100 0 10 100 100 0 11 100 100 100 12 0 100 100 #coil 13 40.0 52.0 48.0 14 44.0 52.0 48.0 15 48.0 52.0 48.0 16 52.0 52.0 48.0 17 56.0 52.0 48.0 18 60.0 52.0 48.0 19 40.0 52.0 52.0 20 44.0 52.0 52.0 21 48.0 52.0 52.0 22 52.0 52.0 52.0 23 56.0 52.0 52.0 24 60.0 52.0 52.0 25 44.0 56.0 48.0 26 48.0 56.0 48.0 27 52.0 56.0 48.0 28 56.0 56.0 48.0 29 44.0 56.0 52.0 30 48.0 56.0 52.0 31 52.0 56.0 52.0 32 56.0 56.0 52.0 33 40.0 60.0 48.0 34 60.0 60.0 48.0 35 60.0 60.0 52.0 36 40.0 60.0 52.0 #variables 37 52.85316955 44 50.92705098 38 51.76335576 42 52.42705098 39 50 43 53 133 40 48.23664424 45 52.42705098 41 47.14683045 42 50.92705098 42 47.14683045 43 49.07294902 43 48. 23664424 45 47.57294902 44 50 44 47 45 51.76335576 43 47.57294902 46 52.85316955 43 49.07294902 47 51.42658477 46 50.46352549 48 50 46 51.5 49 48.57341523 47 50.46352549 50 49.11832212 46 48.78647451 51 50.88167788 46.5 48.7864745 1 52 50 48 50 53 51.42658477 40 50.46352549 54 50 40 51.5 55 48.57341523 40 50.46352549 56 49.11832212 40 48.78647451 57 50.88167788 40 48.78647451 58 50 38 49 #number of faces, boundary markers 53 1 # no of polygons> no of h oles, boundary marker 1 0 3 4 1 2 3 4 1 0 3 4 5 6 7 8 1 0 3 4 9 10 11 12 1 0 3 4 1 2 6 5 1 0 3 4 5 6 10 9 1 0 3 4 2 3 7 6 1 0 3 4 6 7 11 10 1 0 3 4 1 4 8 5 1 0 3 4 5 8 12 9 1 0 3 4 8 7 11 12 1 0 3 4 4 3 7 8 1 0 - 1 12 13 14 25 26 15 16 27 28 17 18 34 33 134 1 0 - 1 12 19 20 29 30 21 22 31 32 23 24 35 36 1 0 - 1 4 13 19 36 33 1 0 - 1 4 18 24 35 34 1 0 - 1 4 33 34 35 36 1 0 - 1 4 13 14 20 19 1 0 - 1 4 14 20 29 25 1 0 - 1 4 25 26 30 29 1 0 - 1 4 15 21 30 26 1 0 - 1 4 15 16 22 21 1 0 - 1 4 16 22 31 27 1 0 - 1 4 27 28 32 31 1 0 - 1 4 17 23 32 28 1 0 - 1 4 #new 1 17 0 18 - 1 24 23 3 37 38 47 1 0 - 1 4 38 39 48 47 1 0 - 1 4 39 40 49 48 1 0 - 1 3 40 41 49 1 0 - 1 4 41 42 50 49 1 0 - 1 3 42 43 50 1 0 - 1 4 43 44 51 50 1 0 - 1 3 44 45 51 135 1 0 - 1 4 45 46 47 51 1 0 - 1 3 46 37 47 1 0 - 1 4 47 48 52 51 1 0 - 1 3 48 49 52 1 0 - 1 3 49 50 52 1 0 - 1 3 50 51 52 1 0 - 1 3 37 38 53 1 0 - 1 4 38 39 54 53 1 0 - 1 4 39 40 5 5 54 1 0 - 1 3 40 41 55 1 0 - 1 4 41 42 56 55 1 0 - 1 3 42 43 56 1 0 - 1 4 43 44 57 56 1 0 - 1 3 44 45 57 1 0 - 1 4 45 46 53 57 1 0 - 1 3 46 37 53 1 0 - 1 4 53 54 58 57 1 0 - 1 3 54 55 58 1 0 - 1 3 55 56 58 1 0 - 1 3 56 57 58 #bounday conditions 1 #conditions 1 3 0 # 2 regions 136 3 1 10 10 10 1 0.1 2 46 50 50 2 0.01 3 - 1 - 1 - 1 3 1.2 #number of properties 2 1 1.90 2.20 2 2.213.30 3 3.214.30 #number of holes 0 #measuring points 5 1 52.85316955 51 50.92705098 2 51.76335576 51 52.42705098 3 50 51 53 4 48.23664424 51 52.42705098 5 47.14683045 51 50.92705098 #mesh area constraint 10 137 BIBLIOGRAPHY 138 BIBLIOGRAPHY [1] problems in electromagnetic NDE using finite element methods Magnetics , vol. 34, no. 5, pp. 2924 2927, 1998. [2] Computers Mathematics with Applications , vol. 32, p. 133, 1996. [3] V. U. Karthik , S . Sivasuthan, A. Rah unanthan, R. S. Thyagarajan, P. Jayakumar, L. Udpa, Faster, More Accurate Parallelized Inversion for Shape Optimization in Electroheat Problems on a Graphics Processing Unit (GPU) with the Real - Coded Genetic Algorithm, COMPEL , vol. 34 , no. 1, pp. 344 356, 2015. [4] AIP Conf. Proc ., pp. 1991 1998, 2014. [5] G. F. Uler, O. A. Mohammed IEEE Transactions on Magnetics , vol. 30, no. 6, pp. 4296 4298, 1994. [6] eg uide dev IEEE Transactions on Magnetics , vol. 32, no. 3 (2), pp. 1250 1253, 1996. [7] S. Dong - Joon, C. Dong - Hyeok, C. Jang - Sung, J. Hyun - Kyo, and C. Tae - Kyoung, ficiency optimization of interior permanent magnet synchronous motor using genetic I EEE Transactions on Magnetics , vol. 33, pp. 1880 1883, Mar. 1997. [8] Optimization of loss in orthogo nal bend Alexandria Engineering Journal , vol. 52, no. 3, pp. 525 530, 2013. [9] IEEE Transactions on Magnetics , vol. 34, no. 5, 1998. [10] H. Enomoto, K. Harada, Y. Optimal design of linear IEEE Tr ans. on Mag . , vol. 34, no. 5, 1998. 139 [11] G. N. Vanderplaats, Numerical Optimization Techniques for Engineering Design: With Applications (Mcgraw Hill Series in Mech . Engineering) . Mcgraw - Hill College, 1984. [12] S. R. H. Hoole, S. Sivasuthan, V. U. Karthik, A. Rahunanthan, R. S. Thyagarajan, and P. Threads on Graph ACES Journal , vol. 29, no. 9, pp. 677 684, 2014. [13] S. R. H. Hoole , Computer - aided Analysis and Design of Electromagnetic Devices . New York: Elsevier, 1989. [14] Finite Element Analysis and Design of Metal Structures , pp. 182 205, 2014. [15] M. V. K. Chari and P. P. Silvester, Finite Elements in Electrical and Magnetic Field Problems (Wiley Series in Numerical Methods in Engineering). New York: John Wiley & Son s Inc, 1980. [16] - sign Computer Methods in Applied Mechanics and Engineering , vol. 15, no. 3, pp. 277 308, 1978. [17] ptimal design of structures by generalized steepest International Journal for Numerical Methods in En gineering , vol. 10, no. 4, pp. 747 766, 1976. [18] O. Pironneau, Optimal Shape Design for Elliptic Systems . Berlin, Heidelberg: Springer Berlin Heidelberg, 1984. [19] IEEE Transactions on Magnetics , vol. 30, no. 5, pp. 3455 3458, 1994. [20] J. Haslinger an d P. Neittaanmaki, Finite Element Approximation for Optimal Shape, Material and Topology Design , 2nd Edition. Wiley, 1996. [21] IEEE Tran . on Mag netics , vol. 26, no. 5, pp. 2181 2183, 1990. [22] Annual Transaction of IESL , pp. 141 149, 2003. 140 [23] S. Sivasuthan , V memory limitation in finite in AIP Conf. Proc . 1581, pp. 1967 1974, 2014. [24] S. Sivasuthan, V. U. Karthik , A . Rahunanthan, P. Jayakumar, R. Thyagarajan, L. Udpa, A Script - Based, Parameterized Fin ite Element Mesh for Design and NDE on a GPU , vol. 32, pp. 94 103, Mar. 2015. [25] J. R. Shewchuk Geometry towards Geometric Engineer Triangle: Engineering a 2D Quality Mesh Generator and Delaunay Triangulat or, vol. 1148, pp. 203 222, 1996. [26] J. R. Shewchuk Delaunay refinement algorithms Computational Geometry , vol. 22, pp. 21 74, May 2002. [27] J. Sch - mesh generator based on abstract rules, Computing and Visualization in Science , vol. 1, pp. 41 52, July 1997. [28] CGAL Editorial Board 4.2 , 2013. [29] - D finite element mesh generator with built - in pre and post - International Journal for Numerical Methods in Engineering , pp. 1 24, 2009. [30] unstructured mesh generator for shallow water Ocean Dynamics , vol. 6 2, pp. 1503 1517, Nov. 2012. [31] T. Chen, J. Johnson, and W. D. Robert, A Vector Level Control Function for Generalized Octree Mesh Generation . Vienna: Springer Vienna, 1995. [32] - erato Materials Research Innovations , vol . 15, pp. s482 s486, Feb. 2011. [33] Processing Approach for FEM, IEEE Trans . on Mag . , v ol. 48, pp. 39 9 402, Feb. 2012. [34] I. Kiss, S. Gyimothy, Z. Badics, and J. P - by - 141 IEEE Trans . on Magnetics , vol. 48, no. 2, pp. 507 510, 2012. [35] T. Okimura, T. Sasayama, N. Takahashi, and S. Ikuno, rallelization of Finite El ement Analysis of Nonlinear Magnetic Fields Using GPU, IEEE Transactions on Magnetics , vol. 49, pp. 1557 1560, May 2013. [36] graphical processing u IEEE Trans . on Magnetics , vol. 46, pp. 2303 2306, 2010. [37] W. Khamsen, A. Aurasopon, and W. Sa - Voltage Harmonics Reduction in Pulse Width Modulation AC Chopper Using Bee Colony IETE Technical R eview , vol. 30, no. 3, p. 173, 2013. [38] [39] J. Mu 1996. [40] technique for finite element magnetic field computation and its application to optimal designs of IEEE Trans . on Magnetics , vol. 47, pp. 2943 2946, 2011. [41] 10 2D http://www.jaew oo.com/material/ magazinefolder/jwnews/0710/10 \ _New \ _Features.pdf. [online accessed 2013 - 08 - 01]. [42] t ing Magnetic Field Eddy - IEEE Transactions on Magnetics , vol. 4 7, pp. 1070 1073, May 2011. [43] http://forge - mage.g2elab .grenoble - inp.fr/project/got. [o nline accessed 14 - 07 - 26]. [44] - tiple 14th International Conferennce on C omputing in Civil and Building Engineering , pp. 1 8, Publishing House ASV, 2012. [45] titious minima of object func tions, IEEE T ransactions on Magnetics , vol. 27, no. 6, pp. 5214 5216, 1991. [46] - 142 Based Design Optimization , vol. 34, pp. 157 180, July 2006. [47] C. Talischi, G. H. Paulino, A. Pereira, and I. F. M. Menez im plementation of a general topology optimization fram ework using unstructured polyg onal finite element meshes Structural and Multidisciplinary Optimization , vol. 45 , pp. 329 357, Jan. 2012. [48] - ACM Trans - actions on Mathematical Software , vol. 41, pp. 1 36, Feb. 2015. [49] - dimensional Delaunay http://tetgen. berlios. de, 2006. [50] - Dimensional Improved - Quality Triangulations Using Local SIAM Journal on Scientific Computing , vol. 16, pp. 1292 1307, Nov. 1995. [51] D. J. Mavriplis, Journal of Computational Physics , vol. 117, pp. 90 101, Mar. 1995. [52] Proceedings of the VKI Le cture series on Computational Fluid Dynamic , pp. 2000 2004, 2000. [53] structing a Delaunay triangula International Journal of Computer & Information Sciences , vol. 9, pp. 219 242, 1980. [54] S. Fortune A sweepline algorithm Algorithmica , vol. 2, pp. 153 174, 1987. [55] C. L. Lawson, Software for C1 Surface Interpolation. Ma thematical Software III . Aca demic p ed., 1977. [56] A comparison of sequenti al Delaunay triangulation algo Computational Geometry , vol. 7, pp. 361 385, Apr. 1997. [57] - Dimensional Mesh Journal of Algorithms , vol. 18, pp. 548 585, May 1995. 143 [58] http://www.cedra t.com/en/software/got - it.html. [online accessed 2014 - 04 - 21]. [59] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein , Introduction to Algorithms . The MIT Press, 2011. [60] http://webdocs.cs.ualberta.ca/~holte/T26/merge - sort.html. [online accessed 2015 - 02 - 11]. [61] D. Robilliard, V. Marion - processing units, Genetic Programming and Evolvable Machines , vol . 10, pp. 447 471, Oct. 2009. [62] Genetic Algorithms on Graphics Intelligent and Evolutionary Systems , vol. 187, pp. 197 216, 2009. [63] S. Sivasuthan, V. U. Karthik, A. Rahunanthan, P. Jayakumar, R. S. Thyagarajan, L. Udpa, d Spee d Problems in Nondestruc tive Defect Characterization: Element - by - Element Processing on a GPU Evaluation , vol. 34, no. 2, 2015. [64] IEEE Transactions on Ma gnetics , vol. 27, no. 5, pp. 4146 4149, 1991. [65] Journal of Applied Physics , vol. 67, no. 9, p. 5818, 1990. [66] C. Cecka, A. J - ics International Journal for Numerical Methods in Engineering , vol. 85, pp. 640 669, Feb. 2011. [67] http://www.nvidia.com/object/gpu.html. [online accessed 2014 - 07 - 26]. [68] http://www.nvidia.com/object/tesla - servers.html. [ac cessed 2015 - 07 - 26]. [69] programming http://docs.nvidia.com/cuda/ cuda - c - programming - guide/index.html \ #axzz3gSPnvaGC. [online accessed 2013 - 08 - 01] [70] CUDA Applications f or http://www.nvidia.com/object/ cudaget.html. [online accessed 2014 - 07 - 26]. 144 [71] in Computer Animation and Virtual World s, vol. 15, pp. 219 227, 2004. [72] CUDA Toolkit Release http://docs.nvidia .com/cuda/ cuda - toolkit - release - notes/index.html. [online accessed 2013 - 08 - 01]. [73] Algorithm 586 : IT - PACK 2C: A FORTRAN Package for Solving Large Sparse Linear Systems by Adaptive Accelerated Iterative Methods , vol. 8, pp. 302 322, Sept. 1982. [74] - by - element solution a lgorithm for Computer Methods in Applied Mechanics and Engineering , vol. 36, pp. 241 254, Feb. 1983. [75] Y. Saad, Iterative Methods for Sparse Linear Systems, vol. 3. 2003. [76] G. F. Carey, E. Barragy, R. McLay, and M - by - element vector and Communications in Applied Numerical Methods , vol. 4, pp. 299 307, May 1988. [77] C . Heusser Conjugate gradient - type algorithms for a finite - element discretization of the Jour nal of Computational and Applied Mathematics , vol. 39(1), pp. 23 37, Feb. 1992 . [78] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. van der Vorst, Templates for the Solution of Linear Systems: Build ing Blocks for Iterative Methods . 1994. [79] - by - element preconditioned conju - Parallel Computing , vol. 17, pp. 1051 1065, Nov. 1991. [80] lysis of some element - by - Computer Methods in Applied Mechanics and Engineering , vol. 74, pp. 271 287, Sept. 1989. [81] SIAM J. Numer. Anal., vol. 21, pp. 352 362, 1984. 145 [82] - self - adjoint generalization of the conjugate gradient method USSR Comput. Maths. Math. Phys ., vol. 23, pp. 143 144, 1983. [83] - by - element BICGSTAB iterative method for three - dimensional steady Navier - Stokes equations Mathematics , vol. 79, pp. 147 165, Mar. 1997. [84] Application of an element - by - element BiCGSTAB i terative Computers & Mathematics with Applications , vol. 37, pp. 57 70, Feb. 1999. [85] - SIAM Journal on Scientific and Statistical Com puting , vol. 10, no. 1, pp. 36 52, 1989. [86] - CGSTAB: A Fast and Smoothly Converging Variant of Bi - CG for SIAM Journal on Scientific and Statistical Computing , vol. 13, no. 2, pp. 631 644, 1992. [87] H. A. van der Vorst, Iterative Krylov Methods for Large Linear Systems , vol. 13. Cam - bridge: Cambridge University Press, 2003. [88] A. F. P. de Camargos, V. C. Silva, J. - Preconditioned Conjugate Gradient S olver on GPU for FE Modeling of Electromag netic IEEE Transactions on Magnetics , vol. 50, pp. 569 572, Feb. 2014. [89] J . of Compu tational and Applied Mathematics , vol. 236, pp. 3584 3590, 2012. [90] M. Ament, G. Knit tel, D. Weiskopf, and W. Strab gradient solver for the Poisson problem on a multi - Pro c . of the 18th Euromicro Con f . on Parallel, Distributed and Net . - Based Processing , pp. 583 592, 2010. [91] - Conjugate Gradient Stabilized solver J. of Computers (Finland), vol. 7, no. 12, pp. 3088 3095, 2012. [92] - GPU Implementations of Krylov - Subspace Methods Applied to FEA of Electromagnetic IEEE Transactions on Magnetics , vol. 51, no. 3, pp. 1 4, 2015. 146 [93] C. A. J. Fletcher, Computati onal Techniques for Fluid Dynamics 1 . Berlin, Heidelberg: Springer Berlin Heidelberg, 1998. [94] J. Haslinger and P. Neittaanm aki, Finite Element Approximation for Optimal Shape, Material and Topology Design , 2nd Edition. Wiley, 1997. [95] K. Preis, C. Magele, and . on Magnetics, vol. 26, no. 5, pp. 2181 2183, 1990. [96] Journal of the American Statistical Association , vol. 95, p. 347, Mar. 2000. [97] Genetic Algorithms and the Optimal Allocation of Trials, SIAM Journal on Computing , vol. 2, no. 2, pp. 88 105, 1973. [98] [99] [100] in DoD Corrosion Conference , pp. 1 11, 2009. [101] ructing and Classifying Damage in a 2D Steel Plate Using Non - in ASEE - NCS, (Oakland), ASEE , 2014. [102] V. U. Karthik, Shape optimization using finite element analysis in eddy current testing and electro - thermal coupled pr oblems. PhD Thesis, Michigan State University, 2015. [103] L. Vogt, R. Olivares - Amaya, S. Kermes, Y. Shao, C. Amador - Bedolla, and A. Aspuru - - of - the - Identity Second - Order M øller - Plesset Quan tum Chemistry Calculations with Grap Journal of Physical Chemistry A , vol. 112, no. 10, pp. 2049 2057, 2008. [104] Journal of Computational Physics , vol. 227, pp. 5342 5359, May 2008. [105] Parallel Computing , vol. 35, no. 3, pp. 164 177, 2009. 147 [106] E. Elsen, P. Le Journal of Comput . Physics , vol. 227, pp. 10148 10161, Dec. 2008. [107] - ified paral lel finite element Navier - 2009 International Conference on High Performance Computing & Simulation, pp. 12 21, IEEE, June 2009. [108] S. Sivasuthan, V. U. Karthik, R. Arunasalam, R. S. Thyagarajan, P. Jayakumar, and S. R. utations for Finite Element Optimization: Some Issues to be aine des sciences techniques - S erie electrotechnique et energ etique, vol. 60(3), no. (accepted - in press), 2015. [109] ity and increased intra - node parallelism for solving finite element models on GPUs by novel element - by - element 2012 IEEE Conf . on High Perf . Extreme Computing, HPEC 2012 . [110] mance of the Challenges for the Knowledge Society , pp. 2036 2045, 2012. [111] \ _2010/ CUDA \ _Tutorial/SC10 \ _Accelerating \ _GPU \ _Computation \ _Through \ _Mixed - Precision \ _Methods.pdf. [online accessed 2014 - 07 - 26]. [112] D. Yablonski, Numerical accuracy differences in CPU and GPGPU codes. MS Thesis - Northeastern University, 2011. [113] in Symp . on App . Accelerators in High Perf . Computing, (Urbana - Champaign, IL), 2009. [114] www.egr.msu.edu/~hoole/FE2DMesh. [onl ine accessed 2014 - 05 - 25]. [115] www.egr.msu.edu/~hoole/FE3DMesh. [online accessed 2015 - 07 - 26]. [116] E. Laithwaite, Electronics and Power , vol. 11, no. 3, pp. 101 103, 1965.