‘ PARALLEL Roomnotua Ammfis ‘ Dissertation fer the Degree of Ph. D. ivfiliifilfim STATE UNWERSITY amass. FREDERICK CGRLLSS 1974 MlChigan State University This is to certify that the thesis entitled PARALLEL ROO‘I‘FINDING ALGORITHMS presented by George Frederick Corliss has been accepted towards fulfillment of the requirements for Ph.D. Mathematics degree in 41mm D. ass/6e Major professor V 0-7639 BlNDHlB-fi‘t ‘5 / HURF'TSDW‘ 800K BlliZ‘ERY lNC.‘ LIBRARY BlNDERS srmuspom. mcmsnii _J ABSTRACT PARALLEL ROOTFINDING ALGORITHMS By George Frederick Corliss Algorithms to find a simple root a of a continuous function f are considered which are suitable for implementation on an array type of parallel processing computer using N processing elements. The algorithms are compared on the basis of their efficiency log2 order of convergence EFF = number of parallel arithmetic operations per iteration The first class of algorithms to be considered uses inverse Lagrange interpolation on k points. If f E Ck' on an interval containing a, these methods have order of convergence k. In contrast, the order of convergence for sequential methods using Lagrange interpolation on several previous estimates for a is always less than 2. The parallel methods achieve a higher order of convergence because they use only the most recent estimates for a. k can be chosen, depending on the number of arithmetic Operations required to evaluate f, to maximize the efficiency of methods in this class. If the cost of evaluating f is very high, the Speed- up possible by using parallel rootfinding algorithms based on Lagrange interpolation over efficient sequential rootfinding algorithms is proportional to log2 N. A 1 example c implement cessing e evaluatic it is met more than The verse Her order of If r is Effective processing 1% N fc less Cost] amethod L effluent POints. The estimated Hermite in f,f'm_,f Hermite im rare “amp: Efficient t baSEd. George Frederick Corliss A parallel secant rootfinding algorithm is considered as an example of a method using Lagrange interpolation. If it is implemented on a parallel processing system with three or more pro- cessing elements, it achieves order of convergence 2 with one parallel evaluation of f and five additional arithmetic operations. Hence it is more efficient than the usual Newton-Raphson method whenever more than three arithmetic operations are required to evaluate f'. The second class of parallel algorithms considered uses in- verse Hermite interpolation of f,f',...,f(r) at k points. The order of convergence is equal to the amount of data used, k(r+l). If r is fixed, the fastest algorithm in this class which can be effectively implemented on a parallel processing system with N processing elements achieves a Speed-up ratio proportional to log2 N for functions which are very costly to evaluate. If it is less costly to evaluate f than to evaluate any derivative of f, a method using Lagrange interpolation (r = 0) on k points is more efficient than any other method using Hermite interpolation on k points. The third class of parallel algorithms contains derivative- estimated methods. If the values of f(r) used in a parallel Hermite interpolation method are estimated using values of f,f',...,f(r-1) at enough points, the order of convergence of the Hermite interpolation method is not reduced. However, except in rare examples, a parallel derivative-estimated method is less efficient than the Hermite interpolation method on which it is based. George Frederick.Corliss If the answer to a rootfinding problem is desired as quickly as possible, as in real-time computer applications, these parallel algorithms should be used. In general, a parallel method using inverse Lagrange interpolation on k points is fastest, where k depends on the function. If the root is to be computed at low cost, the task should be executed on a sequential machine because the speed-up ratio proportional to log2 N achieved by these parallel rootfinding algorithms shows that they do not make efficient use of the parallel capabilities of the system. PARALLEL ROOTFINDING ALGORITHMS By George Frederick.Cor1iss A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Mathematics 1974 Gerald E tions an ACKNOWLEDGMENTS I wish to express my sincere appreciation to Professor Gerald D. Taylor for his leadership and guidance. His sugges— tions and assistance have been invaluable. ii Chapter I II Chapter I II TABLE OF CONTENTS INTRODUCTION TO PARALLEL PROCESSING AND ROOTFINDING 1.1 Motivation for Parallel Processor Design 1.1.1 High Speed Computation 1.1.2 Very Large Problems 1.1.3 Inherent Parallelism in Existing Problems 1.1.4 Enormous Quantities of Data Requiring Real-Time Processing 1.1.5 Multiuser Systems 1.1.6 Reliability and Graceful Degradation Parallel Machine Organization Problems Encountered when Implementing Parallel Algorithms Examples of Parallel Processing Algorithms Speed-Up Ratio Introduction to Iterative Rootfinding Definition of Iterative Methods Order of Convergence Examples of Rootfinding Iterations 1.9.1 Newton-Raphson Method Generalized Newton-Raphson Method Multipoint Method One Multipoint Method Two Lagrange Interpolation Methods . Parallel Secant Method 1.10 Results on Optimal Order 1.11 Definitions of Computational Efficiency 1.12 Results on Optimal Efficiency 1.13 Parallel Rootfinding raid Can: P‘P‘F‘F‘P‘P‘ ¢>GD\JO\U1$~ F‘P‘F‘h‘h‘ \O\O\O\O\O oxuwc~oana ROOTFINDING BY LAGRANGE INTERPOLATION 1 Rice's Results 2 Parallel Secant Method .3 Order of Convergence 4 Efficiency iii Page LANH 43 43 46 47 55 Chapte‘ 111 Chapter III IV ROOTFINDING BY HERMITE INTERPOLATION Order of Convergence Example of an Hermite Interpolation Method Efficiency Comparison of Hermite and Lagrange Methods tow NH taco PM ROOTFINDING BY DERIVATIVE-ESTIMATED METHODS 4.1 An Example of a Derivative-Estimated Method 4.2 Order of Convergence 4.3 Efficiency WHICH METHOD TO USE 5.1 Real-Time Problems 5.2 Problems to be Solved at Low Cost BIBLIOGRAPHY iv Page 63 63 67 69 77 80 81 85 97 100 100 102 103 Table LIST OF TABLES Table Page 2-1 9*(13) 59 Figure 1-1 1-2 1-3 1-5 1-6 1-7 1-8 1-9 2-1 2-2 2.3 3-1 1-5 1-6 1-7 1-8 1-9 2-1 2-3 3-1 LIST OF FIGURES Functional Block Diagram for ILLIAC IV Addition with Six Sub-operations Pipeline Addition of N-dimensional Vectors A + B Addition of Two 64-dimensional Vectors on ILLIAC IV 1 +'2 +;..+ 64 on ILLIAC IV Carrol and Wetherford's Variation of Gauss-Seidel Stone's Variation of Gauss-Seidel Flow Chart for One Iteration of the Parallel Secant Method Speed-Up Ratio Flow Chart for PE k Parallel Secant Method Pk(f) Inverse Hermite Interpolation by One PE vi Page 10 11 14 15 17 17 19 21 45 47 59 68 IXTF Tl ware conf reliabili users. P Considere processir minutes, 93}: to d: Construct Effectiv. Paper wi which us 1'1 Mot CHAPTER I INTRODUCTION TO PARALLEL PROCESSING AND ROOTFINDING The computer industry is currently searching for new hard- ware configurations which will attain greater Speeds with a higher reliability and more flexibility to meet the varying needs of many users. Parallel processing is one very promising approach being considered. The intuitive motivation for research into parallel processing is that if one machine can do a certain job in ten minutes, then perhaps ten machines could be coupled together in some way to do the same job in one minute. In addition to the design and construction of such machines, algorithms must be developed which effectively utilize the parallel capabilities of the machine. This paper will be devoted to the study of algorithms for rootfinding which use the parallel machine structure. 1.1 Motivation for Parallel Processor Design There are several reasons why parallel processor computer systems may have a place in future computer designs (see [14], pp. 4—7). Some areas of possible benefit which we will discuss in more detail are: 1) High Speed computation 2) Very large problems 3) Inherent parallelism in existing problems 4) the use of Ll.l Hig Ci Hat very expensive which ele micro-min HUSt tra\ Small ti: being ex C3mputar N'fold r COElputat head tin N dirt. among tl machine StruCti is COns Sthti 4) Enormous quantities of data requiring real-time pro- cessing 5) Multiuser systems 6) Reliability and graceful degradation. We now discuss each of these areas of possible benefit from the use of parallel processing computers. 1.1.1 High Speed Computation Conventional computer hardware design has reached the point that very small improvements in the speed of the machine are very expensive. The Speed of the machine is limited by the Speed at which electrical pulses travel through the machine. Advances in micro-miniature circuitry have cut down the distances the pulses must travel, and hence the computation times, but the cost of even Small time savings is prohibitive, so other avenues of design are being explored. Parallel processing is one such design alternative. In a parallel processing system, N different parts of a computation may proceed simultaneously, so it is possible that an N-fold reduction in the time needed to perform certain classes of computations could be realized. One would expect the program over- head time to be greater on a parallel processing machine because N different processors must be coordinated and the work shared among them. At the same time, it may be possible to construct a machine with N processing elements at a cost below that of con- structing N separate machines. One design in which this is done is considered in Section 1.2. If the savings possible in the con- struction of the parallel processing machine is larger than the increased cost of the overhead in program execution, then the N- fold SP€€d' on existing premise tha :ost than 1' time turrer 1.1.2 Vet"; $01 so much Cot anlj; at gr: 3 gléhal w. dinanics p and other of data p0 aChiEVe a £935 1b 19 t 1'1-3 Int A configure! be PGrfon an a Para example, althoUgh ferent f0 9119!} thOu fold Speed-up can be achieved at a lower total cost than execution on existing machines. That is, parallel processing designs hold the promise that it may be possible to execute some problems at a lower cost than is currently possible and to execute them in 1/N of the time currently required. 1.1.2 Very Large Problems Some problems of current interest require so much memory or so much computation that they can be solved on conventional machines only at great expense. Some examples are the problem of constructing a global weather model, some nuclear physics problems, large hydro- dynamics problems, phased array radar air defense systems (see [1]), and other similar problems requiring computations on a large number of data points. If a parallel processing system can be built to achieve a high speed-up at a relatively low cost, it will become feasible to solve problems too large for existing computer systems. 1.1.3 Inherent Parallelism in Existing Problems A large class of problems are well suited to a parallel configuration by their structure. If the same computations are to be performed on many independent sets of data, the job may be executed on a parallel processing system in a highly efficient manner. For example, a payroll job must do the same computations for each employee, eilthough the rate of pay, hours worked, withholding, etc., are dif- feremt for each employee. Tasks of this sort which are essentially l l 'vector operations may be economical to run on a parallel machine even though they do not require a very high speed operation. 1.1 whi an ar fe pr US de Vi ti ar it Ca f) (D 1.1.4 Enormous Quantities of Data Requiring Real-Time Processing Historically, each time a new computer has been introduced which was Significantly faster or had more memory than its pre- decessors, it became feasible to solve problems which had been too complex or too large for smaller, slower machines to handle. Sim— ilarly, the design and construction of a parallel processing system would enable problems which require vast amounts of data to be read and analyzed fast enough to reSpond in real time. Two Such problems are the control of a large phased array radar system for urban de- fense (see [1], [12], and [8]), or Speech and visual perception problems in artificial intelligence (see [31]). A system to control a large air defense radar network must use the information from several different radar installations to determine the nature, position, and motion of each airborn object within the range of the network. Then appropriate radar installa- tions are directed to conduct a further study of each "target", while continuing a general search by the entire network for other targets. Targets whose identity is known must be distinguished from incoming enemy aircraft and from decoys deployed by the attack- ing enemy to fool the air defense system. Then the appropriate air defense mechanisms must be alerted. All of this data must be analyzed and action taken before the attacking aircraft arrive at their targets. In [1], Knapp, Ackins, and Thomas describe a system (Eipable of tracking 109 targets simultaneously. This requires thee controlling computer system to issue about 107 commands per seczond to the radar units. Currently existing computers cannot pro- cesss that much data fast enough to reSpond in the time required. €132 t O CES IEC O:56: In the second examp}e, human Speech and vision require the processing of a large amount of data at a very high rate. For example, when a person is Speaking, his listener must detect the sound of his voice, interpret his Speech, and formulate a response, all during the time the Speaker is talking. Then the listener is able to reSpond so that a conversation is possible. If a robot is to recognize human Speech patterns, it must be able to process one second of Speech in one second. Reddy [31] estimates that a robot must be able to process about 2 x 105 bits per second of informa- tion to detect, analyze, and recognize Speech patterns. If a robot is mobile, it must be able to "see" obstacles in its path in time to avoid them. Reddy estimated that the robot must be able to pro- cess approximately 108 bits of data per second for this visual recognition task. Conventional computers are not capable of processing data at the rate required by either of the two examples given above, while a large parallel processing system could be designed to meet these demands for Speed and capacity. 1.1.5 Multiuser Systems It is becoming increasingly common for one computer system to serve many users. Various time-sharing schemes allow work to prcmeed on different programs at once. A parallel processing computer rmay be able to perform a similar function by allocating each user to éin idle processor without the expensive overhead of the current opelrating systems. 1.1.6 Rel of the mac Reliabilit ponent or go gracefi hughly ll In a conv. forces th. of the Sy failures cessing e aSSume th cessing e my 311g he use c icant dIC That is, gradual r reliabili ”Went“ Systems l Process c 1'2 Para 1111966, ] 1.1.6 Reliability and Graceful Degradation A parallel processor system may be able to withstand failures of the machine's components without a major interruption of service Reliability is a measure of the probability of the failure of a com- ponent or of the entire machine. A computer system is said to under- go graceful degradation if the Operation of the machine deteriorates roughly in proportion to the percentage of malfunctioning components. ; In a conventional computer system, the failure of a major component forces the entire system out of operation. If the major functions of the system are decentralized in a parallel configuration, some failures need not be fatal. For example, if one of the many pro- cessing elements failed, the remaining processors could automatically assume the tasks of the disabled one. Hence the failure of a pro- cessing element which would be fatal to a conventional machine would only slightly lower the efficiency of a parallel processing system. The use of several processing elements could be lost before a signif- icant drop in the overall performance of the machine would occur. That is, the performance of a parallel system would deteriorate in a gradual way. This graceful degradation would make the overall reliability of such a System much better than the reliability of conventional machines. The high reliability makes parallel processor systems particularly important in such real-time applications as process control or military surveillance. 1.2 Parallel Machine Organization There are many different alternative designs for parallel processing computers to fulfill the needs discussed in Section 1.1. In 1966, Flynn [11] and [10] described four classes of machine arganizat different fmrclas sider the stream-sf ized by ( Each ins: carried ( mta stn proceed . inpracti essary t instruct in this is carri designs diSCussE are banl Berk 3n gram, 31 n bar . 8y 'stem, < organization. Parallel processing systems were assigned to two different classes. We will give a brief description of each of the four classes of machine organization considered by Flynn, and con- sider the basic design of two important parallel processing systems. Flynn classified conventional computers as single instruction stream-single data stream (SISD) machines since they are character- ized by one sequence of instructions acting on one set of data. Each instruction is executed in sequence and the instruction is carried out on only one set of data. Some early computers were multiple instruction stream-single data stream (MISD) machines. Several sets of instructions could proceed simultaneously on one set of data. MISD machines were impractical because of the hardware and programming overhead nec- essary to avoid memory conflicts. A large class of parallel processing machines are single instruction stream-multiple data stream (SIMD) machines. Machines in this class execute instructions in sequence, but each instruction is carried out on several different sets of data. Two important designs for SIMD machines, the ILLIAC IV and the CDC STAR, will be discussed later in this section. Flynn's fourth class of machine is a multiple instruction stream-multiple data stream (MIMD) machine. These multi-processors sire banks of autonomous machines. Each machine can be directed to twork.on a different program, or on different parts of the same pro- gram, all under the overall control of a central control unit. CarTuegie-Mellon University is designing a multi-mini-processor SYStefln, C.mmp [42]. A MIMD machine is well suited to multi-user I 2‘ ‘ h“ '. svstems, blocks uh for use c such mach designed for the h heffet Fi instructi cessing e unit wit} MOI}! or proceeds executes Data nay bEWEen I to act or is an ex; Suited t: DaDEnber! of rum systems, or to large problems where the program can be Split up into blocks which do not depend on data from the other blocks. The algorithms for rootfinding in this paper are intended for use on a SIMD machine, so we will consider the design of two such machines. The first is ILLIAC IV, an array processing system designed by the University of Illinois and Burroughs Corporation for the National Aeronautics and Space Administration Laboratory at Moffet Field, California. ILLIAC IV is composed of one control unit (CU) which generates instruction signals to control the operation of 64 identical pro- cessing elements (PE). Each PE consists of an arithmetic and logic unit with its own memory. Each PE processes the data in its own memory or data which is passed to it through the CU. Computation proceeds according to the commands generated by the CU, so each PE executes the same instruction stream unless it has been turned off. Data may be passed between the CU and the PE's, or it may be passed between PE'S. This configuration allows one sequence of instructions to act on up to 64 different sets of data simultaneously, so ILLIAC IV is an example of a SIMD machine. This machine organization is well suited to a large class of vector or array type problems. In [9], .Danenberg gives a detailed description of the design and organization ()f ILLIAC IV (see also [22] or [31]). J lProcess ll l tilenent A Ciently 0' or final paints, f 54 proces used. p0 Store and “'3 Stor cess“,g ( cessed if the cmp,‘ Such as 1 the Whip So that . is a Pipf can “he: Control Unit Memory l Processing 2 E 64 Element 1 PE ' ' ° P Memory 1 M 2 M 64 Figure 1-1. Functional Block Diagram for ILLIAC IV H-r-eu-‘Ir 1 An array processing system like ILLIAC IV operates effi— ciently only when the size of the data vector is slightly less than or equal to the size of the machine. 63, 64, 190, or 192 data points, for instance, are well suited to processing on ILLIAC IV'S 64 processing elements since all, or nearly all, of the PE'S are used. For example, if a l90—dimensional vector is used, 62 PE's store and operate on 3 components each, while the two remaining PE's Store and operate on 2 components each. By contrast, pro- cessing 65-dimensional vectors requires that 64 components be pro- cessed first. Then all but one of the PE'S must be disabled while the computations proceed on the "extra" component. In many problems such as the numerical solution to partial differential equations, the computations can be structured to use a multiple of 64 points so tfimt ILLIAC IV'S 64 PE's can be used efficiently. The second example of a SIMD machine which we will consider is aa pipeline processor, the CDC STAR. While array processing Syst:ems like ILLIAC IV are based on the notion that N machines can ssometimes do the work of one machine in l/N of the time, 10 pipeline computers like STAR gain Speed by performing different parts of the same operation simultaneously (see [9] or [13]). For example, suppose the operation of addition is broken into these six sub-operations: 1) fetch operands from memory, 2) adjust exponents to align the decimal points, 3) add mantissas, 4) normalize, 5) round off the result, and 6) store the sum in memory. These six sub-operations are illustrated by a simple example in Figure 1-2. If this addition operation is implemented on a pipeline computer, the sum of two numbers is being stored in memory at the same time that the sum of the next two numbers is being rounded off, etc. Figure 1-3 shows how addition of two N- dimensional vectors would proceed in such a machine. Sub-operations l) Fetch operands A = 0.234 X 103 B = 0.567 x 101 2) Adjust exponents A = 23.4 X 101 B = 0.567 X 101 3) Add mantissas A + B = 23.967 X 101 4) Normalize A + B = 0.23967 X 103 5) Round off A + B = 0.240 X 103 6) Store the sum A + B = 0.240 X 103 Figure 1—2. Addition with Six Sub-operations Step N+1 N+5 Fizure 1 Pipeline PiPeline “hich are illuStrat data. 1“lath. are Separ Pipeline . flawed to Became oi 11 Sub-operation 1 2 . . . 6 1 (A1, B1) 6 (A6. B6) (A5, B5) . . . A1 + B1 Step N (AN’ 3N) (AN-1’ BN_1) . . . AN_5 + BN_S n+1 (15,, BN) . . . AN-4 + BN_4 N+5 AN + BN Figure 1-3. Pipeline Addition of N-dimensional Vectors A + B We have seen that a single sequence of instructions in a pipeline machine acts on several different sets of data. Hence the pipeline processor is an example of a SIMD machine. Computations which are much more involved than the operation of addition illustrated above can be implemented on a pipeline computer if the same sequence of computations must be repeated on different sets of data. A pipeline computer can operate on a vector of arbitrary length. Figure 1-3 shows that if the computations being programmed rare separated into six parts, five steps are required to fill the Fitmeline when starting the computations. Similarly, five steps are nee<fled to complete processing of the last component of the vector. Becatuse of the time Spent filling the pipeline at the beginning and time [28‘ \‘EC at V: (J. 12 emptying it at the conclusion of an operation, the efficiency of the machine decreases if it is necessary to empty and refill the pipeline often. In actual use, the start-up time is typically the time needed to generate 30 - 60 results once the pipeline is full [28], so longer vectors use the machine more efficiently. The vector length is limited to 64,000 by the temporary memory available on the CDC STAR, and lOOO—dimensional vectors are typical of efficient machine utilization [28]. There is some overlap among the classes of machines dis— cussed in this section. For example, the array processing system ILLIAC IV uses pipelined instructions and memory access queues to speed execution [9]. As originally designed, the ILLIAC IV system was to be a MIMD machine composed of four autonomous SIMD machines. Each of four control units was to drive 64 processing elements, but only one quadrant of the proposed machine was built. 1.3 Problems Encountered when Implementing Parallel Algorithms After a parallel processing system has been designed, there are at least four major problem areas which must be studied before the system can be used efficiently. The problems discussed in [37] by Stone are: l. The arrangement of data in memory is essential for efficient parallel processing, although it is only a minor concern for the con- ventional sequential machines now in use. 2. An efficient sequential algorithm does not necessarily lead to an efficient parallel algorithm. Hence research is necessary to develop efficient parallel algorithms. 1' L‘Iu 1‘ wine 4. T etc., the c a run influe roatf: In [if for ii of the Stone doubli PYOble transf into a Value, velopel effiCie Probabl 13 3. Many algorithms are inherently sequential, but in some cases the sequential dependence can be relaxed. 4. The behavior such as the order of convergence, round-off error, etc., of a parallel algorithm may be quite different from that of the corresponding sequential algorithm. In this paper we will consider algorithms designed to find a root of a function. We will see how each of these problem areas influences the construction, the behavior, and the choice of efficient rootfinding algorithms. 1.4 Examples of Parallel Processing Algorithms There are many algorithms known which are parallel in nature. In [15], Karp and Miranker describe a parallel searching algorithm for finding roots of a function. Pease [29] gives a modification of the fast Fourier transform which is suited to parallel processing. Stone [35] and Kogge and Stone [16] derive a method called recursive doubling which allows them to solve a large class of recurrence problems using parallel processing. The method of recursive doubling transforms a problem which seems to be inherently sequential in nature into a problem which can be solved in parallel. Miranker [24] gives a survey of many other parallel algorithms. The pull-back interpolation technique for solving initial value, boundary value, and time delay differential equations de— veloped by Rogers [34] incorporates several features which could be efficiently implemented on a parallel processing computer. We now consider some parallel algorithms in more detail. Prtflmably the simplest use of parallel processing is the addition I‘Y‘ Fi Iii 0f 58( adt 14 of two N-dimensional vectors, A + B. In a pipeline processor like the CDC STAR, addition is broken down into several Sub-operations. Figure 1-3 illustrates pipeline addition with six sub-operations. The summands move through the pipeline of sub-operations at the Speed of the slowest sub-operation. In an array processing system like ILLIAC IV, the i th pro- cessing element, PE i, performs A1 + Bi° Figure 1-4 illustrates the addition of two 64-dimensional vectors. If N < 64 some of the PE's are turned off while the rest execute the addition. If N > 64, some PE performs the addition of more than one component. Control Unit I l" l 1 PE 1 PE 2 . . . PE 64 A1+B1=C1 A2+B2=C2 A64+B64=C64 Figure 1-4. Addition of Two 64-dimensiona1 Vectors on ILLIAC IV An interesting example of the savings in time possible with ILLIAC IV is the computation of l + 2 +...+ 64 together with all of the partial sums. This problem requires 63 additions on a sequential machine, but Figure 1-5 shows the sequence of six parallel additions which yields the desired sum and each intermediate partial sum. The instructions indicate the routing of data at each step. If the instructions are changed to allow the data from PE 64 to be passed directly around to PE 1, each PE contains the sum 1 +-2 +...+ 64 after six parallel additions. 15 PE 1 PE 2 . . . PE 64 Route each PE'S contents one to the right and add, 1 1+2 . . . 63+64 Route contents two right and add, 1 1+2 1+2+3 1+2+3+4 . . . 61+...+64 Route contents 4 right and add, 1 1+2 . . . l+...+8 . . . 57+...+64 Route contents 8 right and add, 1 1+2 . . . l+...+l6 . . . 49+...+64 Route contents 16 right and add, 1 1+2 . . . l+...+32 . . . 33+...+64 Route contents 32 right and add, 1 1+2 l+2+3 . . . . l+2+...+64 Figure 1-5. 1+2+...+64 on ILLIAC IV Another type of problem which is well suited to a parallel processing system is the numerical solution of partial differential equations. For example, the point Jacobi iteration to solve the Laplace equation by finite differences with a uniform mesh on a square is given by u (i j) = Un(i,j—1)+ un(i-l,j) + un(i.j+1) + un(i+1.j) n+1 ’ 4 Here the average of the previously computed approximation to the true solution at the four nearest points is used as the new approxima- tion. Since no u uses previously computed values of u at n+1 n+1 ite is var 0'.) 16 other points, it is possible to compute un+1 at each point simultaneously on a parallel machine. In contrast to the point Jacobi method, the Gauss-Seidel iteration uses new approximations for un+1 which have already been computed whenever they are available, so the Gauss-Seidel iteration is inherently sequential. Carrol and Wetherford [6] suggest a variation of the usual Gauss-Seidel method which can be implemented on a parallel machine with N processing elements. Assume /N is an integer. If a 2 /N x 2 /N grid is used, the solution must be computed on 4N points. Hence each processor must compute the approximation at 4 points. Figure 1-6 shows that each processor computes the value at the first point using old data; the values for the second and third points are computed using some new and some old data; the value at the fourth point is computed using only new data. Another variation of the Gauss-Seidel method is due to Stone [37]. He Suggests that an N X N grid be used. Then diagonals of length N are processed simultaneously in the manner shown in Figure 1-7. As a final example of parallel algorithms we consider a parallel rootfinding algorithm. This algorithm is a special case of the simultaneous n-2 degree method due to Rice [32] with n = 3. A summary of Rice's results and a more careful discussion of this method will be given in Chapter II. This algorithm will be called the parallel secant method because it is derived from the usual secant method given by many authors (see [30, pp. 323-328], or [27, Chapter 3]). The secant method is given by 17 Step 1 Step 2 o o o o o n o n PE i ——T' PE 1 o 1 o o o o o o o o —7>un+1 e——-o o o n-—9 un+1 r- n o o o o o o o 0 Step 3 Step 4 n n n n n n n n ---—4— PE i -*—] *‘—‘ PE i ]f_—] O I" un+1 9'" ° ° “ —’ un+1 ‘7" n n n n n n n n n o o o o o n o n o = old information n = new information un+1 = point being computed Figure 1-6. Carrol and Wetherford's Variation of Gauss-Seidel N \ x \ x . \e \ \o' Vé? \fQo 4),. 6, \o ,6 \e c .06 G \ sS . \ \ 2 ‘\ 1 \ ‘ \ l 2 N Figure 1-7. Stone's Variation of Gauss-Seidel 18 f(X.) f(X. ) x. = l X. + 1-1 X. 1+1 f(xi)-f(xi_1) 1-1 f(xi_1)-f(xi) 1 X. 'X. 1 1-1 = x. - f(x,) 1 f(xi)-f(xi_1) 1 (1-1) 2 . Assume that f E C on some interval containing a simple root 0 of f. Let x0 and x1 be chosen near 0. Then the error in equation (1-1) satisfies -f"f'(si>f'(si_1) 0‘ ' x1+1 = 3 (0’ ' X1)(°’ ' xi-l ZIP(9] ) a (1‘2) where g, gi, and §i_1 lie on the interval Spanned by 0, xi, and xi 1. Then there exists a constant K such that la - xi+1\ s Kla - xii la - XML . (1-3) We now generalize the secant method given by (1-1) to a parallel method. Let and x be chosen near a. X0,1’ x0,2’ 0,3 Let yi j = f(xi j), for j = 1,2,3; and i = 0,1,... . The parallel secant method defines a sequence of 3-dimensional vectors {X1 = (x1,1’ xi,2’ 1:13)} by x = x _ 1,1 i,2 1+1,1 1,1 yl,1 - yi,2 1,1 x - x i 2 i,3 x. = X. - 1 _ y. (1‘4) 1+l,2 1,2 yi’2 yi, 1,2 = x _ Xi,3 ' x14 Xi+1,3 1,3 - y1,3 y1,3 Parall Vinus approx the er- 19 PE 1 PE 2 PE 3 Store x0,1 x0,2 x0,3 0,1 X0,2’ y0,2 “ X0,3’y0,3 1,2 1,3 Compute by (1-4) x1 1 ] x Figure 1-8. Flow Chart for One Iteration of the Parallel Secant Method As the flow chart in Figure 1-8 Shows, each of the three parallel processors uses the approximations generated at the pre- vious Step by itself and by another processor to compute a new approximation to 0 according to the secant method. From (1-3), the errors in (1-4) satisfy ‘0’ ' xi+l,l‘ S Ki“ ' X1,1\ ‘0’ ' xL,2\ \O’ ' xi+1,2l 5 Kid ‘ xi,2| l“ ‘ xi,3i (1'5) la’xi+1,3l s n. - w tor-x111 If we let 0 = max{\a - x, i 1,j|’ j = 1,2,3), (1-5) becomes 2 61+1 s K(61) . (1-6) It will improves from 1.5 Spe elements ll mantL realized discusse: of the pi inposssit define a Ball. ll‘ir -——§l;;:; 20 It will be shown in Chapter II that the use of parallel processing improves the order of convergence of the conventional secant method 1 +,(5 s: . 2. 2 1 6 to from 1.5 Speed-Up Ratio Ideally, a parallel processing system with N processing elements should be able to execute a job in one minute which requires N minutes to execute on a sequential machine. This ideal is realized for the addition of two 64-dimensional vectors on ILLIAC IV discussed in Section 1.4, but the program overhead and the nature of the problem or algorithm under consideration usually make it imposssible to achieve the ideal savings of time. This leads us to define a measure of the time actually saved by a parallel machine. Definition 1-1 (see [37, p. 3]). Define the Speed-up ratio to be = computation time on a sequential computer computation time on a parallel computer Notice that it is necessary to state which sequential algorithm is being used for comparison. When studying the Speed-up ratios achieved by different parallel rootfinding algorithms, their execution time Should be compared with the execution time of sequential root- finding algorithms which are optimal in some sense. AS the proportion of commands requiring sequential computa- tion in a program increases, the Speed-up ratio falls very quickly. Figure 1-9, due to Amdahl [2], shows how the Speed-up ratio depends On the sequential processing inherent in a program. For example, EHJppose only 1/64 of a job requires sequential computation on ‘ILLIAC IV, while the remaining 63/64 of the job can be executed in parallel. execution ‘ sequential Spee Rat Tl Processing Cmputatic tasks whit alternatix 3Ut0mtica Parallel m ities in a: We which are a As I the ideal SF {my Problem, 8 speed-up r8 21 parallel. The machine's efficiency drops to 1/2 since half of the execution time will be taken by one processor to perform the sequential part of the program while the other 63 PE‘s are idle. N Speed-up N/2 «u- Ratio 1 160 I 0 N 100 Z of Inherent Serial Computation Figure 1-9. Speed-Up Ratio This example illustrates the importance of using the parallel processing machine only for parallel computations. To avoid sequential computations, the control unit can perform indexing and other overhead tasks while the PE's are executing the body of the program. Another alternative approach is to have a seQuential computer available which automatically assumes the execution of sequential tasks while the parallel machine performs tasks which utilize its parallel capabil- ities in an efficient manner. We now consider some examples to show the Speed-up ratios ‘which are achieved by several different classes of problems. As the remarks at the beginning of this section point out, the ideal Speed-up ratio of N is very rarely attained. However, many problems which have a natural vector or array Structure achieve a Speed-up ratio which is linear in the number of processors used. Aspeed-up r; are very vel problems inc numerical so'. calculations Probl relations (KL and Paterson speed-up rat as a linear Parallel prc bl’ doubling The ratio. Alg: [15] aChiev. °f a1gorith little incr CESSors, 1‘ class, they to COIHpute t large Class The that the use These PTObleL so they Shoo; '3 . ‘mcesslng sy up rati0. 22 A speed-up ratio of aN, with a s l, characterizes problems which are very well Suited to parallel processing. Examples of such problems include matrix computations and problems Such as the numerical solution of partial differential equations which use calculations at grid points (see [17]). Problems such as sorting (Stone in [36]), linear recurrence relations (Kogge and Stone in [16]), and polynomial evaluation (Munro and Paterson in [25]) have been handled on SIMD machines with a speed-up ratio of aN/logzN. Although this Speed-up is not as good as a linear Speed-up, these problems are still well suited to parallel processing since the Speed of computation is nearly doubled by doubling the number of processing elements. The next class of algorithms have an even poorer Speed-up ratio. Algorithms Such as Karp and Miranker's searching algorithm [15] achieve a speed-up ratio proportional to logzN. This class of algorithms is poorly suited to parallel processing since very little increase in Speed is achieved by doubling the number of pro- cessors. In Spite of the poor Speed-up ratio of algorithms in this class, they are of use in real-time applications where it is important to compute quickly, regardless of cost. This paper will Show that a large class of parallel rootfinding algorithms fall into this class. The final class of problems are inherently sequential, so that the use of parallel processing yields no increase in Speed. These problems have a Speed-up ratio which is independent of N, so they should be executed on sequential machines to leave the parallel ‘processing system free to execute algorithms with a favorable Speed- up ratio. 1.6 introduc This root t! 0” statement of Let 1(a) =0 A root of f. The general iter f need only Other condi‘ necessary f. The and has bee to the fixe Cludes Part boundary Va 1'7 DGfinj Let the SIMD ma 0' We Cons N‘tup1e3 [ an EStimate be distinct {HUSt have form 23 1.6 Introduction to Iterative Rootfinding This paper will be concerned with the problem of finding a root a of a real-valued function f. We begin with a careful Statement of the rootfinding problem. Let f E Cm(a,b). Find a 6 (a,b), if it exists, such that f(a) = 0. Assume further that f'(o) # 0, so that o is a simple root of f. The requirement that f E Cm(a,b) will allow us to consider general iterative methods. For each Specific example we will consider, f need only have some finite number of continuous derivatives. Other conditions will be placed on the function f when they are necessary for the discussion. The rootfinding problem arises frequently in applications and has been the subject of extensive research. It is equivalent to the fixed point problem x = g(x) and its general setting in- cludes partial differential equations, integral equations, and boundary value problems (see Collatz [7]). 1.7 Definition of Iterative Methods Let N denote the number of processing elements used by the SIMD machine. Assume that the given function f has a root a- We consider iterative methods which compute a sequence of N-tuples {Xi = (xi,1,xi’2,...,xi,N)]. Each component of X1 is an estimate for a. In general, the components of X1 need not be distinct, but for all the methods considered in this paper we ‘must have xi j # xi k if j # k, We consider iterations of the form In general, considered 1* component to such as TILL The c sufficient f: following ri of the def in Kung and Tra m an Open zero qf 0 functions satisfies: 1) 2) 24 = tori, xi_1,...,x ) x1+1,j i-m In general, I may depend on the component j, but all Of the methods considered in this paper use the same iteration function for each component to that the methods can be implemented on a SIMD machine such as ILLIAC IV. The above definition Of a parallel iterative method is sufficient for our discussion, but for completeness, we give the following rigorous definition. This definition is a generalization Of the definition Of a sequential iterative method (N = 1) given by Knng and Traub in [21]. Definition 1-2. Let D = {flf is a real analytic function defined on an Open interval depending on f, I CIR which contains a Simple f zero cf of f, and f' # 0 on If]. Let n denote the set of functions {m} which map every f E D to a function m(f) which satisfies: 0n+l) a, N 1) (9(a) :RN R ; 2) CP(f) (af)""af) = (af’°",af) ; N(mfil) times N times 3) For each (p E 0, f E D, there exists Icp f, an open subinterval contained in If such that i) of 6 Im,f , ii) o(f)[Icp,fN(m1)] :: Icp.fN , and iii) if X0,X1,...,Xm 6 Iqbe and X1+1 is given by X1+1 = ¢(f)(X1,Xi-1,...,Xi_m) , then The: it de ti 0n ca do De put of is tl “hitch etc/7 Mr 1 25 = = 2 ..., . lim xi,j of, for j l, , N 1'“ Then qaé O is called a parallel rootfinding iteration with memory m. If m = O, m is called a parallel rootfinding iteration with- out memory. If N = l, m is a sequential rootfinding iteration as defined by Kung and Traub. In general m(f) depends on the values Of f and some of its derivatives. If m(f) requires the evaluation Of f and its derivatives only at the points xi 1,xi 2,... (values at ’ , ’Xi,N xi-l,l’°°"xi-1,N’°"’xi-m,N may be stored and used in the computa- tion of Xi+1), m is called a one point iteration. If m(f) re- quires the evaluation Of f or its derivatives at points which are not components of Xi, m is called a multipoint iteration. All Of the iterative methods satisfying this definition are called stationary iterative methods since the iteration function m does not change as the iterations proceed. Definition 1:3. The information usage Of the iterative method m is the number of parallel functional evaluations necessary to com- pute Xt+1. If vk(m) denotes the number of parallel evaluations Of f(k) necessary to compute then xi+1’ v(cp) = E Vk(cp) (1‘7) k20 is the total information usage of m. Recall that we are considering only iteration functions m which are suitable for implementation of a SIMD machine. Hence each component of X is computed by the same formula, so if f i+l Inust be evaluated at one component of X it must be evaluated at 1, 26 all N components. These N evaluations of f may be executed in parallel with PE j evaluating f(x ). Thus one parallel func— 1.1 tional evaluation yields N functional values. 1.8 Order of Convergence An iterative method as defined in the preceding section generates a sequence Of N-tuples {Xi}. The order Of convergence Of an iterative method is a quantitative way of measuring the rate at which this sequence converges to the N-tuple (af,...,af). In what follows, we shall discuss this idea for sequential rootfinding algorithms, and then generalize it to include parallel rootfinding algorithms. As Brent points out [4, p. 21], there are many possible definitions of "order of convergence". We will use two Of them. The sequence {xi} with lim x 1 1-400 have order Of convergence 1 if (1-8) or (1-9) hold. = a and 61 = a - x1 is said to 6. 11m\-—1—+—1—‘-=c>o,;.>1. (1-8) - A 1”” leil 1hn (‘ln lei|)1/i = A - (1-9) lam Definition 1:4, We Shall say that the sequence {xi} has strong order of convergence A if (1-8) holds, and weak order Of con- vergence A if (1-9) holds. Note that condition (1-8) implies condition (1-9). Indeed, if we assume that (1-8) holds for the sequence {xi} with x 4 a i and let G. = l61+1l 1 leilx 27 then there exists an integer I such that i 2 I implies ci 2 c/2. Without loss of generality, we may assume that I = 0 and leil < l for all i. It is easily seen that 1-1 1'. _ A A X lei] - ci-lCi-Z ... c0 ‘30] . (l-lO) Using (1-10), one has that 1’1 k i ln leil = REG I 1n ci-l-k.+ x 1U\eol s so that f’. 1 '1 lli 1- 1 ° k ' (~1n leil) /1 = x 1n C 1 +'11 1n k=0 i-l-k leol P ‘1 1-1 k_i 1 1 1/1 = x z 1 1n + 1n (1-11) _k=0 °1-1-k leol The assumption that lei. < 1 implies that -ln‘ei\ > 0. Hence the term in brackets in equation (1-11) is positive. Since lim 81/1 = 1 if a > 0, we need only show that the term in the i—oo brackets is bounded as i _. on. Now i-k Z xk-i ln-C-i—+ln =0 i-l-k leol 1-1 sl— 2: ikin-z- +1n.1 1. C 3‘ x k=0 1 0 i S1n (2/c) [1 -§]+ 1n 1 C01 in the fit and 28 However, condition (1-9) does not imply condition (1-8). This can easily be seen by letting e i even e 1 Odd Ortega and Rheinboldt [26] discuss these and several other possible definitions of order of convergence. They point out that condition (1-8) usually requires involved proofs when A is not an integer, while the proofs required when condition (1-9) is used as the definition are the same whether [X is an integer or not. We shall now generalize Definition 1-4 for parallel root- finding algorithms. While a sequential rootfinding method generates a sequence {x1} of estimates for the root a, a parallel algorithm generates a sequence of N-tuples {X1 - ( xi’1,xi,2,...,x1’N)]. Setting a: 'or and 61-mxfie j-1,2,...,N],we 1.1 ' “1.1 1.1" define the order of convergence in terms of the worst error at each step. Definition 1:2, We shall say that a sequence of N-tuples {X1} which converges component-wise to the N-tuple (a,a,...,a) has strong order of convergence A if 6 _ktl_ 11m >. = c > 0 , (1-12_ 1.... (5,) and weak.order of convergence A if lim(-1n 51)”1 = x . (1-13) i-can l.‘ m. 81‘ bc 29 For each of the parallel rootfinding algorithms considered in this paper, we Show that ‘)1/1 lim (-1n\si = x, for j = l,...,N , law 1 so (1-13) holds for the error in each component, not only for the largest error. However, the strong order of convergence of these methods remains an open question. It is our conjecture that for each of the methods considered in this paper, the strong order of convergence is the same as the weak order of convergence. In [32], Rice cites technical problems with using the strong order of con- vergence in the parallel case as his reason for using the weak order of convergence. 1.9 Examples of Rootfinding Iterations In this section we shall consider five sequential rootfinding algorithms and the parallel secant method introduced in Section 1.4 to illustrate the definitions of iterative methods and order of convergence. 1.9.1 Newton-Raphson Method The familiar Newton-Raphson method is given by _ ESL CP(f)(x) " X " ft(x) ' This is a one-point sequential iteration without memory. It has both strong and weak orders of convergence 2. It is characterized by m-0,N=l,v(cp)=2,and x-Z. of act EV. 1.9 30 1.9.2 Generalized Newton-Raphson Method (see [40, pp. 78-87]) Let f exist in a neighborhood of a and be denoted by g. Define v1 = x Y2 = Y1 ' f'(x) 2 - _ f"(xl__ f(x) “g(x) ’ Y3 ’ Y2 ' 2f'(x) [Foo] ° This is a sequential one-point iteration without memory with m = 0, N = l, vo(q9 = v1(q9 = v2(q9 = l, v(q9 = 3, and A a 3. This method is a Special case of a general class of sequential one-point itera- tions without memory which achieve a strong order of convergence A .,f(1-1). with one evaluation each of f, f',.. 1.9.3 Multipoint Method One (see [21]) Let 3 fi 0 be a constant. Define N X lo v, to + efoo) af(v0)f(v1) f(f)(x) = *2 = *1 ‘ f(vl) — f(ts5 This is a sequential multipoint iteration without memory with m - 0, N = l, v0(cp) 8 v(ep) = 2, and L = 2. It is a Special case of a general class of sequential multipoint iteration methods which I achieve a strong order of convergence A = 2“. with just n evaluations of f. 1.9.4 Multipoint Method Two (see [21]) Let 31 20 = x _ {5:01. 21' z0 ' f'(zo) f(z )f(z ) (f(z ) cp(f)(X)=7-2=zl' 1 O z'ifi'g—f [f(zl) - f(zo)] o This is a sequential two-point iteration Since functional values at 20 and 21 N = 1, v0(cp) = 2, v1(cp) = l, v(cp) = 3, and A = 4. This method are needed to compute m(f)(x). Here m = O, is a Special case of a general class of multipoint sequential root- finding algorithms without memory which achieve a strong order of convergence A = 2n-1 with n-l evaluations of f and one evaluation of f'. 1.9.5 Lagrange Interpolation Methods (see [30, pp. 336-339], or [39, pp. 60-75] Let x0,x1,...,xm be given estimates close to o- Let h(x) be the Lagrange interpolation polynomial of degree m which x ‘Let x' be a real zero agrees with f at x i-m° i+1 1,xi_1,..., of h chosen according to a Specified policy. Define ¢(f)(xi,...,x ) = x This is called direct Lagrange interpola- i-m 1+1. tion. Alternately, let H(y) be the Lagrange interpolation poly- ‘nomial of degree m which satisfies H(yj) 8 x for j’ j - i, i-l,...,i-m, where yj = f(xj). Then H interpolates f-l, so this is called inverse Lagrange interpolation. Define x1+1 I H(O). Both Of these methods are sequential one-point iteration methods *with memory m, N . l, and vo(q9 = v(q9 - l. The Strong order of convergence A is the unique positive root of the equation t - t -...- t - 1 = 0 . 32 i also satisfies lsx<2 (see [40, pp. 62-74]). Both direct and inverse Lagrange interpolation use f(xj), for j = i,i-1,...,i-m, and both have the same strong order of convergence. If H is the inverse Lagrange interpolation polynomial, H(O) is easily computed, while it is more difficult to find the root by direct interpolation using h. Although it can be shown that b has a real root x which makes the method converge, 1+1 that root need not be unique. If more than one root Of h have the desired properties, additional criteria are used to assure that the choice of x1+1 is well-defined. h(x) is a polynomdal of degree m, so the problem of actually finding its root or roots may require a separate rootfinding algorithm. Thus it may be quite difficult to compute x as a root of h, while the computation i+1 of xi+l as H(O), where H is the inverse interpolation poly- nomial, is straightforward. For this reason, rootfinding .algorithms using inverse Lagrange interpolation are preferred over those using direct Lagrange interpolation. Chapter II will consider parallel iterations which use Lagrange interpolat ion . 1.9.6 Parallel Secant Method The parallel secant method introduced in Section 1.4 is a parallel one-point iterative method without memory with m - 0, Ii - 3, vo(q9 - v(m) - 1. That is, only one parallel evaluation of f is necessary to evaluate f at three different points. 33 In Chapter II, it will be shown that the parallel secant method has weak order of convergence 2. We note that the parallel secant method is a generalization of a sequential one-point iteration with memory m = 1. Most of the parallel one-point methods considered in this paper are generaliza- tions of sequential methods with memory. 1.10 Results on Optimal Order Since the publication of Traub's book [40] in 1964, con- siderable progress has been made in the search for sequential iterative rootfinding methods of Optimal order. This section will Survey some of those results. The orders of convergence discussed in this section are all Strong orders of convergence. An extensive list of references dealing with results on optimal orders can be found in [39]. In the case of sequential one-point methods without memory, it is known [40, p. 98] that the highest order which can be achieved using only f(xi),f'(xi),...,f(r-1)(x1) is r for r 2 2. Further, any such method of order r must depend explicitly on f(r-l) f(xi),f'(xi),..., (x1). In fact, this bound is achieved by the generalized Newton-Raphson method introduced in Section 1.9.2. In the case of sequential one-point methods with memory, consider the method m which computes f(xi),f'(xi),...,f(r-1)(x .,£(r'1) 1) and uses the values Of f,f',.. evaluated at xi_1,xi_2,...,xi_m which were computed and stored during previous iterations. It is shown in [40, p. 106] that the order of con- vergence A of m satisfies 34 r < K < r+1, for m.= 1,2,... . (1-14) Hence no amount of memory will improve the order by as much as one evaluation of f(r). In [5], Brent and Winograd consider another class of sequential one-point methods with memory. They show that the highest order of convergence which can be achieved by direct or inverse Hermite interpolation of f(k)(xj), for j = 0,1,...,i; k = 0,1,...,r-l, is r+l. This bound is achieved by using all_of the old functional evaluations in the Hermite interpolation to compute x . They suggest that the upper bound in (1-14) is i+1 approached as m grows large. Rissanen [33] proves that the usual secant method has optimal order among sequential one-point iterations using only f(xi) and f(xi_1), if a smoothness condition on admissible algorithms is assumed. In the case of sequential multipoint methods without memory, Kung and Traub [21] construct two classes of algorithms of order Zn-l using n functional evaluations. The first method evaluates f at n points (see Section 1.9.3), while the second method evaluates f' at one point and f at n-l points (see Section 1.9.4). Kung and Traub go on to show that the order of any sequential multipoint method without memory using n functional evaluations is 5 2“. It is their conjecture that the optimal order is actually 5 2n-1. In another paper [20], Kung and Traub prove the truth of their conjecture in the case n - 2. Very little work has been done in the area of sequential multipoint iterations with memory. The question of optimal order for such methods is still open. 35 1.11 Definitions of Computational Efficiency If we wish to compare two different iterative rootfinding algorithms, we must have some measure of efficiency. Perhaps the ultimate efficiency measure is the time required to compute the desired root to within a specified accuracy. However, this is an impractical measure since it depends heavily on the function, the initial guess, and the machine used for the computations. For com- parison of general methods, we want a! measure which does not depend on the computer used to implement the algorithm. The simplest measure is the order of convergence of the method, since the error of a high order method asymptotically approaches 0 more rapidly than the error of a low order method. If two methods have the same order of convergence, the one with the smaller asymptotic error constant converges more quickly. However, the results cited in Section 1.10 on Optimal orders showed that high order iterative methods require more functional evaluations than lower order methods. In some cases, the desired root may be found in less time by a low order method requiring fewer functional evaluations than by a high order method. This suggests that an efficiency measure be defined which reflects both the order of convergence of the method and its in- formation usage (see Definition 1-3). The following efficiency measure generalizes a measure used by Traub [40] for sequential algorithms to include parallel algorithms. Definition ;;§, Let m be an iterative rootfinding method with order of convergence 1- Let the information usage of m, the total number of parallel functional evaluations required by one step of 36 the method, be denoted by v(q9. Define the information efficiency of the method to be (1-15) This measure of efficiency is invariant under the composi- tion of a method with itself. That is, if V = q>o m, one step of w is the same as two steps of m. If q; has order I and informa- tion usage v, v has order K2 and information usage 2v. Since that same work is being done, one would expect the information efficiency to be the same. Indeed, 2 log k IE(cp) = T;— = IEW) - In [19], Kung and Traub show that (1-15) is essentially the unique way to define an efficiency measure, since any efficiency measure which is invariant under self composition is of this form or is a strictly increasing function of this form. Traub [40] shows that this measure of efficiency is inversely pr0portional to the total time required to compute the root 0 to a specified accuracy. Other estimates of the time required to find the root 0 to a specified accuracy may replace the information usage used in (1-15). In [18], Kung shows that if the time is assumed to be pro- portional to the number of non-constant multiplications or divisions, M, then 1082 k M S l . 37 It is for this reason that log2 is used in (1-15). Unless other- wise noted, all logs in this paper are taken in base 2. To see that the information efficiency defined by (1-15) does not reflect the entire cost of computation, consider the sequential methods f(xi) x1+1 = 2&0?) (xi) = xi - my , (1-16) and 2 [f(xi)] x1+1 = ‘91:“)“9 = "1 ' f(xi + 130(1)) - f(xi) (1‘17) qh is the Newton-Raphson method, while mr is a two-point method due to Kung and Traub [20]. Both methods have strong order of con- vergence k = 2, and both have information usage v = 2. Hence IE(qN) = IE(q&) = 1/2 . Although both methods have the same information efficiency, the time required by each of them to compute a to a specified accuracy is different. 4% has the advantage of not requiring f'. This is important if f' is not available or if it is costly to evaluate. However, once the functional evaluations have been made, qh requires only one addition and one division, while qk requires three additions, one multiplication, and one division. Hence if f and f' are equally costly to evaluate, qh will converge in less time. In fact, “h is faster whenever f' can be evaluated in less time than is required for the evaluation of f and the execu- tion of two additions and one multiplication. 38 This suggests the following definition of efficiency which is due to Kung and Traub [19]. Definition 1-7. Let m be an iterative rootfinding method with mo) ,0) order of convergence X- Let ) denote the number of arithmetic operations needed to evaluate . Let a(q9 denote the number of parallel arithmetic Operations needed to combine the functional values used to compute X Then we shall call the efficiency 1+1' of m 1082 A z Vj(cp)c(f(j)) + 8(cp) jZO EFF(gp,f) = . (1-18) (1) where vj(q9 denotes the number of parallel evaluations of f which are necessary at each step. It is easily seen that the denominator of (1-18) is the total number of parallel arithmetic operations required at each step of the method. We will be concerned with finding upper and lower bounds for the efficiencies of various classes of algorithms by finding lower and upper bounds, respectively, for the denominator of (1—18). If we let cf = min{c(f(j)), j 2 O}, we immediately have the upper bound log I. EFF(qp,f) svcf + 8W) . (1-19) where v(q9 = E Vj(q9 is the number of parallel functional evalua- ij tions required at each step. 1.12 Results on Optimal Efficiency we now survey some of the results which are known about ailgorithms whose efficiency (see Definition 1-7) is optimal. This 39 is an active area of interest and many questions are still open. In [20], Kung and Traub consider the methods “N and “T given by equations (1-16) and (1-17), reSpectively. They show that these methods have optimal efficiency in the sense of Definition 1-7 among all sequential methods without memory using only two functional evaluations. Their reSpective efficiencies are given by EFF(qN,f) = 1/(c(f) + c(f') + 2) , (1-20) and EEF(mT,f) = 1/(2c(f) +15) , (1-21) where c(f) and c(f') denote the number of arithmetic Operations needed to evaluate f and f', reapectively. From equations (1-20) and (1-21), it follows that the Newton-Raphson method q“ is more efficient than the two-point method q& if and only if c(f') < c(f) + 3. Next we compute the efficiency of the parallel secant method introduced in Section 1.4. Recall that the first component of X i+1 is given by x1,1 ‘ x1,2 X x 3 - £( r+1,1 1,1 f(x1,1) f(x1,2) x1,1) (1-22) The parallel secant method requires one parallel evaluation of f, and has weak order of convergence 2. Equation (1-22) requires five arithmetic operations. Hence the efficiency of the parallel secant method, ¢§’ is given by 40 EFF(¢pS,f) = l/(c(f) + 5) . (1-23) Comparing equation (1-23) with equations (1-20) and (1-21), we see that the parallel secant method is always more effiCient than Traub's two-point method, and that it is more efficient than the Newton- Raphson method whenever c(f') > 3. For iterative rootfinding methods m belonging to a given class Q of rootfinding methods, Kung and Traub [19] and Traub [40] define En(<§,f) = suprFF(¢p,f)\gp 6 Q, v((p) s n] and E(§:f) = SUP{ED(§af)9 n = 1:2:°°°} a where v(q9 denotes the information usage of m. They then give some lower and upper bounds for ED and E for various classes Q of algorithms. Their results are summarized by the following theorems . Theorem 1-1 [19]. Let Q be the class of sequential one-point iterations without memory. Then there exists some constant p >>O such that 108 “ s E (ii) s L03 “ , (1-24) n (1) 2 n n cf +'n - l 2 C(f ) +‘p n log n i=0 for all n, where cf = min{c(f(i)), i 2 0}, and if cf >14, log 3 log 3 c(f) + c(f') + c(f") + 7 ‘ Ew’f) ‘ 3c + 2 ° “’25) f 41 Theorem 1-2 [19]. Let F be the class of sequential multipoint iterations without memory using values of f only. Then there exist constants r0, r1, and r2 with r2 > 0 such that n - 1 n 5 E (T,f) S , (1-26) nc(f) + rzn2 +’r1n +'r0 n nc(f) +-n - l for all n, and 1 g 1 1 E ,f —- . 1-27 C(f) m S (I‘ ) s c(f) ( ) for some constant Q < 0. It is further conjectured in [19] that for the class of all sequential one-point or multipoint iterations without memory, n - 1 nc + n - l ’ E (§.f) 5 n f and that E(§’f) ‘ cf +'l The results in Theorems 1-1 and 1-2 will be used in the comparison of the efficiences of the parallel rootfinding algorithms developed in this paper to the efficiencies of optimal sequential algorithms. 1.13 Parallel Rootfinding There has been very little research done in the area of parallel rootfinding algorithms. In 1968, Karp and Miranker [15] gave a parallel searching technique for a maximum.which is optimal in the minimax sense. These techniques can be used to solve the rootfinding problem since finding ‘ 42 a maximum of -\f(x)\ is the same as finding a root of f. Their algorithm makes one parallel functional evaluation to yield f evaluated at N points. It then searches for the largest value. Since searching cannot be done efficiently in parallel, this algorithm does not effectively utilize a parallel machine, but it does achieve a Speed-up ratio proportional to log N. In [23], Miranker gives a "parallel" rootfinding algorithm, but the i th processing element PE 1 must have available the results computed by PE 1, PE 2,..., PE i-l before it can proceed. Thus this is not a truly parallel algorithm. Miranker claims a speed-up ratio of %'1og1.618 N. In [32], Rice gives several parallel rootfinding algorithms based on direct or inverse polynomial interpolation. Some of the details of Rice's results will be presented in Section 2.1. Roughly, he computes the order of convergence of each of his methods from a matrix representation of the data used by each method. Some of Rice's methods require different computations to be performed by each processing element, so they are only suitable for implementation on a MIMD machine. However, his simultaneous (N-2) degree method is suitable for use on an array processing system like ILLIAC IV. This class of algorithms will be studied in detail in Chapter II. CHAPTER II ROOTFINDING BY LAGRANGE INTERPOLATION One class Of rootfinding algorithms whose structure is well suited to an array processing system like ILLIAC IV is based on direct or inverse Lagrange interpolation. In this chapter, we will study such methods. In particular, we will give the order of con- vergence of such methods, consider their efficiency, and show that their Speed-up ratio is proportional to log N, where N is the number of processing elements being used. We conclude that because of their poor Speed-up ratio, these methods are not well suited for parallel processing, even though their structure makes efficient use of the parallel capabilities of a SIMD computer. Definition 2;}, Let Gr denote the class of all stationary one- point parallel iterative rootfinding methods without memory which require that f,f',...,f(r) be evaluated at the components of X . i In this chapter we shall consider G , the class of methods 0 which use only values of f. The case of the general class Gr will be considered in Chapter III. In what follows, we shall require that the real valued function f with a simple root 0 is fixed. 2.1 Rice's Results In [32], Rice gives the algorithms we will discuss in this chapter. A general description of the algorithms of [32] follows. 43 44 A sequence of N-dimensional vectors {X1 = (x1,1,xi’2, ...,x1 N)] is generated, where each x, is an approximation to 3 1.1 a. An N x.N matrix T is formed by 1 if x, is used to compute 1,1 x1+1,1< 0 otherwise Consider the set = l} for a fixed k. Let H Sm = {..in3”. be the Lagrange interpolation polynomial which satisfies = f s , H(y1,3) x1,3’ or xi,J 6 i,k where yi j ' f(x. j) Then let xi+1,k = H(O) That is, Rice performs classical inverse Lagrange interpolation on f. Rice's main result is the following. Theorem 2-1 [32]. Assume that f 6 CI“,1 in a neighborhood of the simple root a- Let p be the spectral radius of the matrix repre- sentation T. If p > 1, then the iterative method converges and has weak order of convergence p. We wish now to focus our attention on the type of method which Rice calls the simultaneous (N-2) degree method. This method is of interest because it is the only method proposed by Rice which is suitable for implementation on a SIMD machine. His other methods require a MIMD machine because all of the processing elements do not perform the same computations. For example, one method requires three processors to perform interpolation at three points, while a fourth processor is performing interpolation on four points. 45 In the simultaneous (N-2) degree method, there are N dis- tinct approximations to a stored at each step. Each of these N points is stored in a different processing element. To compute the new approximations, each processing element uses N-l of these N points to compute a Lagrange interpolation polynomial of degree at most N-2. Of the many ways the N-l points could be chosen by each processor, Rice has each processing element use the points stored in the other N-l processors. Thus the matrix representation T for the simultaneous (N-2) degree method is given by Tkj=1"6kj, where 6kj is Kronecker's delta. Thus for example, the matrix re- presentation T for N = 4 is o 1 1 1 1 o 1 1 T= . (2-1) 1 1 o 1 1 1 1 o The weak order of convergence of the simultaneous (N-Z) degree method is N-l. A simplified flow chart for the operations of one of the N processing elements is shown in Figure 2-1. Use xi,j and yi,j’ for jfik to compute I x1+1,k Evaluate f(x i+1,k) = y1+1,k 1 to the other PE's and for j # k send x1+1,k and y1+1,1< receive x1‘+1,j and yi+1,j’ Figure 2-1. FIOW'Chart for PE k 46 2.2 Parallel Secant Method The parallel secant method introduced in Section 1.4 is a slight modification of Rice's simultaneous first degree method (N = 3). Recall that if y1 = f(xi ), then the parallel secant J J method is given by x = x - xi’l - xi’z y 1+1,1 1,1 yi’1 - yi,2 1,1 x - x i 2 i 3 x. = x. - ”L _ L’ y. (2‘2) 1+1,2 1,2 yi’2 yi’3 1,2 x = x - xi’3 - x111 Y . ”1’3 1’3 yi,3 ' yi,l 1’3 If we compare the flow chart given in Figure 1-8 for the parallel secant method with the flow chart given in Figure 2-1 for the simultaneous (N-2) degree method in the case N = 3, we see that the parallel secant method requires less data to be transferred among the processing elements. For example, in Rice's method, PE 1 must receive xi+1,2, yi+1,2, xi+1,3, and yi+1’3, while in the parallel secant method, PE 1 need only receive xi‘+1’2 and yi+1,2. Hence the parallel secant method requires less communication among the processing elements than Rice's method. Note that this modification does not change the Spectral radius of the matrix representation of the method, so both methods have weak order of convergence 2. Theorem 2-5 will give a different proof that the weak order of the parallel secant method is 2. This method requires one parallel functional evaluation and five arithmetic operations per step, so the information efficiency of this method is 47 2 IE-l9€-=l , and the efficiency is . log 2 = 1 EH" c(f)-l-S c(f)-+5 ’ where c(f) is the number of arithmetic operations needed to evaluate f, and log is understood to be logz. 1,1 % Figure 2-2. Parallel Secant Method In Section 1.12 we showed that the parallel secant method is more efficient than the Newton-Raphson method whenever c(f') > 3. 2.3 Order of Convergence In this section we will show that the weak order of con- vergence of parallel rootfinding methods based on Lagrange inter- polation is equal to the number of points used for the interpola- tion. Before stating the exact result, we state the following well 48 known result on the error of Hermite interpolation (see [40, p. 244]). In this chapter we will only use this result for Lagrange interpola- tion (r = 0), but the more general result will be used in Chapter III. 3 k Lemma 2:2, Let B = k'+ 2 rj, rJ 2 0, j = 1,2,...,kx ‘Let f E C6 J=1 on an interval containing x1,x2,...,xk. Let h(x) be the unique Hermite interpolation polynomial of degree 5 B - l satisfying ), for i = 0,...,r ; j = l,...,k . (i) = (i) h (xj) f (xj j Then the error is given by (B) k r.+1 E(x) = f(x) - h(x) = £_§rL§l H (x - xj) J H where g is on the interval Spanned by x,x1,x2,...,xk. Let us give the notation needed to construct parallel root- finding algorithms based On inverse Lagrange interpolation. The same notation will be used in Chapter III for methods based on in- verse Hermite interpolation. Fix k satisfying 2 s k s N-l. Choose N distinct subsets A for j = 1,2,...,N, of the set 1’ {l,2,...,N] such that j 6 Aj and, A1 contains k elements. We will use the following lemma and its corollary to find the orders of convergence for most of the methods considered in this paper. Lemma 2-3. Let g) be a parallel rootfinding algorithm which gen- erates a sequence {Xi} whose errors satisfy 11 (61,98 , j =l,2,...,N , (2-3) EA 31+1,j= K14 t J for some sequence of constants K , where = a - x . Let 1,3 31,3 49 E x \€,,j\ W H. II B H. :3 \e i,j' xl a x \Ki,j\ II S 1.1. D \K,,j\ Assume further that e # O for all i and j, and that E i o , ....I H. a WI ll i-m 11m 51 law g # o . Then there exist positive sequences c1 and di’ both with strong order of convergence ks, satisfying C1S§1$\ei,j\561‘di , for j = 1,2,...,N, and 1 = 0,1,... Proof. From equation (2-3), s. 1 = \x 1 u \e \S , <2-4) ' 1+1,J 1,.1 tt'Aj in: so that '- s ‘- ks \erp1’1\ 5 K1 ch (51) = K1(61) J Then - ks 61+1 S K1(51) Let d = 60, and di+ 50 d lim “~Etl- = K , ks 14m (d1) so the sequence {d1} has strong order of convergence ks. By induction, 61 s (11 since - ks '— ks 5 s Ki(5i) s Ki(di) - d i+l 1+1 Similarly from equation (2-4), s ks 'ei+1,j' 2 E1 H (51) ' §i(§i) tEA J The if c = and c = K (c )k8 n 0 go, 1+1 -i 1 ’ c lim -lilif'= K , s _ ifim (c1) so {c1} has strong order of convergence ks, and ks ks From this lemma it follows that Corolfiltaq 2-4. Under the conditions of Lenma 2-3, the rootfinding method m has weak order of convergence ks. Proof. According to the lemma, there exist sequences {c1} and [d1], both with strong orders of convergence ks, which satisfy for j = l,...,N, and i = 0,1,... ci 5 'ei,j' 5 d1 , Then since strong order implies weak order, 1/1 1/1 ks = lim (-ln d1) 5 if: (-1n\ei,j\) 14a /1 = 5 lim (-1n Ci)1 ks , 1 am which.implies that 51 lim (-1n \ei jp'l' = ks , for j = 1,2,...,N idm ’ [I In order to find the order of convergence for parallel root- finding methods based on Lagrange interpolation, we need some addi- tional restrictions on f. Let f E Ck(I), where I is an Open interval containing the simple root 0 of f, and on which f' i 0. Let f.1 be de- noted by g. Then g E Ck on some Open interval containing 0. Assume that g(k) # O in a neighborhood of 0. Let yi j = f(xi ). .1 We are now ready to state and prove the main result of this section. Theorem 2-5. With the notation and assumptions given above, for each 1 = l,2,...,N, let H (y) denote the Lagrange interpolation J polynomial of degree at most k-l which satisfies Hj(yi,s) = xi,s , for s 6 Aj Let ¢k 6 G0 be the method which defines xi+1 j = Hj(0) If distinct points are chosen sufficiently x0,l’x0,2"°°’xO,N close to a, then either xi j = a for some i and 1, or lim x = a , for j = 1,2,...,N . i,j iaa Thus the method ¢k converges. MOreover, mk has weak order of convergence k. The proof is a generalization of the proof given in [30] for the sequential case. When no ambiguity will result in the proof, we suppress the subscript i to simplify the notation. 52 Proof. If yt = 0 for any t = 1,2,...,N, then x1 t = a, and we are done. Hence we may assume that yt i 0. We may also assume that ys # yt, for s i t, for if x = x we may re- i,s i,t’ place xi s by any x from the interval spanned by x1,1,xi’2,...,xi’N. Since the resulting ei,s still lies between the smallest and the largest errors, the proof remains valid. If for some i, all of the xi 3 are identical, they span no interval and the method fails. H is the Lagrange interpolation polynomial for g, so that it must have the form 8(Y)‘“ H(Y) = SEAJL8(Y)8(YS) = SEAJLS(Y)X1’S , (2'5) where () ny-y" Ly= —-—-. s t6Aj ys ' yt tis Let Z = H (-ys). Inverse Lagrange interpolation estimates 86Aj a = 3(0) with Hj(O), so in our method we have x i s x =H(0)- 2: 1(0):: =-zz 5* . (2-6) t+1’j j 36Aj 8 1’8 86Aj ysch (ys - yt) J tfis According to Lemma 2-2, the error in (2-6) is given by 5(k)(fll z a ' xi+1,j ___ k! ’ ei+1,j = (2'7) where n lies in the interval spanned by 0 and ys, s E Aj. Now 53 ’~< I - f(xi,s) = f(xi,s) - f(a) f'(gs) (xi,s - a) = -f'(§s) 61,3 where gs lies between a and x , and T13 lies between 0 is and ys. Then equation (2-7) becomes e = (k) 11 31$.— . (2-8) i+l,j k! SEA g'(‘ns) J If are chosen sufficiently close to 0:, equation (2-8) implies that ’ = = 2 111“ 61,1 0 1 for j 1: 9 J": 11-009 so the method converges. We have assumed yt 1‘ 0, so that 1‘ 0, for t = 1,2,...,N. Since g(k) 3‘ 0 near 0 by assumption, 31,: equation (2-8) implies that 3i+1 j :4 0. Let k K = 11211). n 1 , 1:3 1" SEA 8'(T'S) 1 so that (R) 11111 K1 j = —8—'(9-l); 1‘ O i-m ’ k![g'(0)] Then all of the hypotheses of Leanna 2-3 are satisfied with s = 1, so it follows from Corollary 2-4 that cpk has weak order Of con- vergence k. I For each k, there are (:1) different subsets of the set N {1,2,...,N]. Hence there are ((k)) different methods of order k N 54 based on inverse Lagrange interpolation, but these methods differ only in the indexing of x1 1,xi 2,...,x, . These methods arise 1,N from different choices for the subsets Aj' The condition 2 s k s'N-l arises from the need to choose the N distinct subsets Aj of {1,2,...,N], each containing k distinct elements. The subsets must be distinct to assure that {xt+1,1,x1+1’2,...,x1+1,N] are all distinct, so k cannot equal N. This condition gives the following corollary to Theorem 2-5. Corollary 2-6. Under the conditions of Theorem 2-5, the highest weak order of convergence which can be achieved on a parallel pro- cessing system with N processing elements by rootfinding algorithms based on inverse Lagrange interpolation is N-l. This corollary shows that the Optimal order for this class of parallel rootfinding algorithms grows linearly with the machine size. We now compare the orders of convergence of parallel root- finding algorithms based on inverse Lagrange interpolation with the correSponding orders of convergence of the sequential one-point iterations with memory which also use inverse Lagrange interpolation. We have already seen, for example, that the parallel secant method (N B 3, and k = 2) increases the order of convergence of the usual secant method from 1L-'§-"C2N‘1.618 to 2. As noted in Section 1.10, [40, p. 106] shows that the order of convergence of sequential one- point iterations with memory m - k-l is < 2. Hence the use of parallel processing has increased the order of convergence from something less than two to k. 55 It can be shown that Theorem 2-5 remains true if direct Lagrange interpolation is used instead of inverse Lagrange inter- polation. However, as we saw in Section 1.9.5, the methods using inverse Lagrange interpolation are always preferred over those using direct Lagrange interpolation. For this reason the corresponding theorem for direct interpolation is omitted. 2.4 Efficiency It was shown in Section 2.3 that a weak order of convergence of N-l can be achieved on a parallel processing computer using N processors. This requires each processor to perform'Lagrange inter- polation on N-l points. Although this interpolation is very costly for large N, we will show that if evaluations of f are expensive, the Optimal efficiency for methods based on Lagrange interpolation is achieved by taking k = N-l. Definition 2:2, Let 9k be a parallel rootfinding algorithm based on inverse Lagrange interpolation as described in Theorem 2-5. Define log2 k P (f) 3 EFF< ,f) = a k q" z V3( 0. We will use the results of [19] summarized in Theorems 1-1 and 1-2 to estimate the optimal efficiency for sequential computation. Theorem 2-7. Assume that f is analytic in a neighborhood of a simple root a- Let ¢k be an inverse Lagrange interpolation method described in Theorem 2-5. Note that k is not restricted to be 5 N-l. Then 1) IE(¢%) = log2 k . - log k ii) P (f) - EFF( ,f) = . (2-10) k ¢k c(f) +2k2 +'k - l 57 iii) Let y denote the unique root which is 2 l of 2 2 u(k) = (4k 1+ k)1n k - (c + 2k + k - l) = 0 . (2-11) Then P*(f)R5 2 1 . (2-12) (4y + y)ln 2 iv) Assume that 2 s k s N-l and that c = c(f(i) ) for all i. For large values of c, the Speed-up ratio Of mk compared to the optimal sequential one-point methods without memory is asymptotic to w 1:3. 3 1n (N-l) - 5 as N a a. v) Assume that 2 S k s N—l. For large values of c = c(f), the Speed-up ratio of mk compared to the optimal sequential multipoint iterations without memory using values of f only is asymptotic to 1 log (N-l) - 1n 4 as N a m. Proof. 1) Each iteration of ¢k requires one parallel evaluation of f and yields weak order of convergence k. ii) Count the arithmetic operations in equation (2-6). iii) To show that y is unique, let 2 2 u(t)-(4t +t)lnt-(c+2t +t-l) Then 58 u'(t) (8t + l)ln t + (4t + 1) - (4t + 1) (8t + 1)ln t >0, for t>1 Now u(l) = -(c + 2)1< 0, while lim u(t) =‘+m. Hence u(t) = 0 t-m has a unique solution on the interval (1, +w). To find the value of k for which P*(f) = sup Pk(f) 25k is attained, we observe that P1(f) = 0 and that lim Pk(f) = O . ken Thus P*(f) is attained for some finite k > 1. To find the optimal k, differentiate both sides of equation (2-10) and set equal to zero to get dP(f) (c+2k2+k-l) 1 -(4k+l)logk k = klnz =0 d k 2 2 (c +12k 1+ k - 1) This implies that 2 2 (4k + k)1n k - (c +>2k +-k - l) = O y is assumed to satisfy equation (2-11), so substituting (2-11) into (2-10) gives (2-12). Although k must be an integer, v need not be. Hence P*(f) is actually attained by Pk(f), for some k. within one unit Of y. Before we proceed with the proof of the rest of the theorem, we digress to consider the behavior of v and Pk(f). Figure 2-3 59 shows how ‘Pk(f) depends on k. and the cost of functional evalua- tion, c. Table 2-1 shows typical values of y and the resulting P*(f) for given values of c. Table 2-1 P*(f) c(f) v ROpt P*(f) 1 1.755 2 .100 2 1.864 2 .091 5 2.120 2 .071 20 2.893 3 .040 100 4.764 5 .015 104 29.360 29 .00041 106 225.36 225 7. x 10'6 Pk(f) Figure 2-3 Pk(f) 60 Note that lim Y = m, for if we let 1400 c = l + en(n - l + en(4n - 2)) , then v = a“. Note also that P*(f) is independent of the machine size N, so it cannot be attained if v 2‘N. In practice, k should be chosen to satisfy N - 1 if c > 3N - 2N2 + (4N2 - 7N + 3)ln (N-l) integer within one of y which maximizes Pk(f) For ILLIAC W, N = 64, so that if c > 58,038. v cannot be attained. That is, unless 58,038 or more arithmetic operations are required for each evaluation of f (or its rational approximation), the optimal efficiency for the class of parallel rootfinding algorithms based on inverse Lagrange interpolation can be realized on ILLIAC IV. If c = 58,038, P*(f)9e 9.05 x 10’5 , 31(2)‘e 0.91 x 10's, and E2(f)i~ 1.72 x 10-5, so all of these methods would require a long time to compute a to a specified accuracy, but the parallel method is slightly faster. Let us now return to the proof of the theorem. iv) Using equation (2-12) and the bounds for E1(f) given in Theorem 1-1, 3c +-2 ‘ P*(f) 2 log 3 (4y + y)ln 2 1 3c +*7 ‘ 2 log 3 (4y +~y)ln 2 (2-13) 61 From equation (2-11), 2 2 c = (4v + y)ln y - (2y + v - l) . (2'14) Substituting equation (2-14) into (2-13), we get, 2*(0 3(4y2 +Jlln1- 6‘12 ~3y+5s E1(f) 2 1n 3 (4y + y) s 39492 + mt. v - 612 - 31+ 10 2 1n 3 (4y + y) which implies that 3 9x2 + 31 — 5 P*(f) malnY' 2 ‘E(f) (4y + y)ln 3 1 3 6y2 +r3y 10 - ' . 2-1 5 ln 3 1n y ( 5) (4y2 + 'y) In 3 We have seen that lim y = a, so that if c is large enough, caa the most efficient Lagrange interpolation method which can be implemented on a machine with. N PE's uses k = N-l points. As larger machines are used, the optimal efficiency of P*(f) is P (f) * EETEY' is asymptotic to approached. Hence (2-15) shows that 1n (N-l) - , as N «on sun: In 3 v) Similarly, using equation (2-12) and the bounds for E2(f) given in Theorem 1-2, c mo 5 5 (4y2 + y) In 2 '32“) (1 +§:)(4y2 + v)ln 2 C where g is a negative constant. This implies that 62 108v 2Y2+1'L-SP*(f)s-£-L1° 2124'3'" - 2 ' 9 (4y + v)ln 2 E2“) 1 +%; (4Y2 4' Y)1n 2 r,(£> so that 32(f) is asymptotic to l log(N-l)-m,asN-°ao. I The important conclusions from Theorem 2-7 are parts iv) and v). They Show that if the cost of functional evaluation is so high that y 2 N, then the speed-up of parallel rootfinding algorithms based on Lagrange interpolation is prOportional to log2 of the machine size. If it is important to find the root a as fast as possible, as in real-time applications, these methods are useful because they are faster. If, however, one wishes to compute a at a low cost, the speed-up ratio of log N shows that these methods are not well suited for implementation on a parallel processing system. CHAPTER III ROOTFINDING BY HERMITE INTERPOLATION In this chapter we generalize the results of Chapter II on Lagrange interpolation to include inverse Hermite interpolation. As in the Lagrange case, the order of convergence is equal to the amount Of information used. That is, while Lagrange interpolation methods use values of the function f evaluated at k points to achieve order k, Hermite interpolation methods use f,f',...,f(r), each evaluated at k points to achieve weak order of convergence k(r+l). If a parallel processing system has N processing elements, we will also show that the speed-up ratio possible by using Hermite interpolation is proportional to log N. That implies that parallel rootfinding algorithms based on Hermite interpolation are not well suited to parallel processing. The reader is referred to [30] or [40] for a discussion of sequential rootfinding algorithms using Hermite interpolation. 3.1 Order of Convergence The main result of this section is a generalization of Theorem 2-5 to show that the weak order of convergence of parallel rootfinding algorithms based on inverse Hermite interpolation of f,f',...,f(r) at k points is k(r+l). Recall (see Definition 2-1) that Gr is the class of all one-point parallel iterative rootfinding methods without memory 63 ' 64 which require that f,f',...,f(r) be evaluated at the components of X1. Let us also recall the notation needed to construct parallel rootfinding algorithms based on inverse Hermite interpolation. Fix k satisfying 2 s k s N-l. Choose N distinct subsets Aj, j = 1,2,...,N, of the set {1,2,...,N] such that j E Aj’ and Aj contains k elements. Theorem 3-1. Let y = k(r+l). Assume that f E CY on some Open interval containing the simple root a, and on which f' # 0. Let f.1 be denoted by g. Then g E CY on some open interval con- taining 0. Assume that g(Y) * O in a neighborhood of O. For each j = 1,2,...,N, let H (y) denote the Hermite interpolation J polynomial of degree at most v-l which satisfies ) = g(t)(y1 s), for t = 0,...,r, and s 6 A j ”1,. 1 Let i E G be the method which defines k r x141,j = Hj(0) . If distinct points x0’1,x0’2,...,x0’N are chosen sufficiently close to a, then either x = a for some i and j, or 1,1 lim.x = a , for j = 1,2,...,N 1,1 i—oa Thus the method 'k converges. Moreover, 'k has weak order of convergence y. The proof of this theorem follows closely the proof of Theorem 2-5. As before, we will suppress the subscript i to Simplify the notation. and we hood 0: all di: otherw the in where Now Where ficien 65 Proof. We may assume yt i O, for if yt = o = a, X ’ i,t and we are done. we have assumed that f.1 exists in a neighbor- hood Of a, so y1,...,yN are all distinct if x, are 1,l""’xi,N all distinct. If x, are identical, the method fails, 1,1,...,xi,N otherwise, if = x t’ we may replace xi 3 by any point on 9 x1,8 1 9 the interval Spanned by the x's. Hj interpolates g, so according to Lemma 2-2, M r+1 g(y) - Hjm = a_!_m)_ n (y - y ) . (34) y s 86Aj where n lies in the interval Spanned by y, ys, for s E Aj. Now Y8 = f(xi S) - f(a) _ _ O - (xi,s a)f (Es) .- I - -(a - xi,8)/g (T‘s) _ _ I — €1,8/8 (m) 9 where g8 lies between xi 8 and a, and “5 lies between 0 and ya. Then, substituting zero into equation (3-1), ei+1,j = a ' Xi+1,j (v) 1.4.1 =S_Y_!_(fll I'I (7Y8) 36Aj (Y) = A ”D m n (313)”1 . (M) v! H [8'(fls)] SEAj ’ 86Aj If N distinct points x0’1,x0’2,...,x0’N are chosen suf- ficiently close to a, equation (3-2) implies that $0 for The! lute} 1116mm- 66 11m 31’s = O, for 8 = 1,2,...,N , i-m so 'k converges. Equation (3-2) also implies that 31 s # 0 for all i and 3 since we assumed that g(Y) # 0 near 0. Let K = £0011ll 1’j v! n [g'3 SEAj r+1 Then (Y) limKij=—3——L—)—O ,40 1.. ’ Mg'm):Y ’ tear- ...m "an—- so the hypotheses of Lemma 2-3 are satisfied with s = r+l. It follows from Corollary 2-4 that 'k has weak order of convergence v. ' Since k, the number of points in Aj on which the inter- polation is performed, is constrained to lie between 2 and N-l, the following corollary follows easily from Theorem 3-1. Corollary 3-2. Under the conditions of Theorem 3-1, the highest weak order of convergence which can be achieved by an inverse Hermite interpolation method in G1. on a parallel processing computer with N PE's is (N-l)(r+l). This order is achieved by 'N-l' We remark that the sequential one-point iterations with memory m = k-l which use inverse Hermite interpolation and values of f,f',...,f(r) have strong order of convergence between r and r+l, so the use of parallel processing has greatly improved the orders of convergence of these methods. As in the case for Lagrange interpolation, a theorem corre- sponding to Theorem 3-1 can be proven for direct Hermite interpola- tion. As before, that theorem is omitted because direct interpola- tion methods are of little practical importance. 3.2 53 we cone element are ide the Sub xi,2 = y2 = f(: inverse data is H2(Y) Then 1. method of 67 3.2 Example of an Hermite Interpolation Method Consider the case for N = 3, r = k = 2. For simplicity, we consider only the computations done by the second processing element PE 2. The computations performed by the other two PE's are identical, except that they use different points. We suppress the subscript i to simplify the notation. Let A2 = {1,2] so that PE 2 uses x = x and i.1 l x to COmPUte X. The values y1 = f(xl), yi = ! r+l,2' f (x1), “1,2 = 2 y2 = f(xz), and yé = f'(x2) have been previously computed. The inverse interpolation polynomial of degree 3 which agrees with this data is given by y y '1 y y 2 y y -'2 ' 1 ' 2 ' 2 1 H(y)=2——_——+1 -—_———x+[y-y]——:-—— —7 2 y2 y1 y1 y2 1 1 y1 y2 y1 .J A a 2 -72 y - yZ y - y]. y - y]. 1 +' 2 ———:——--+ l _ x2 + [y - yz] "“j“’ 1 -(3'3) y]. y2 3'2 y yZ 3'1 yZ Then xi+1,2 = H2(0) = [<3 - >x<>2+< -3>x<)2 3 y1 y2 1 y2 y1 y2 2 y1 ] y y y y - 1 2 2 -y—Z,-+ ;1 <3-4) Figure 3-1 illustrates the graphical interpretation of the method of equation (3-4) . COD til 68 x - - - i,2 Figure 3-1. Inverse Hermite Interpolation by One PE According to Theorem 3-1, this method has weak order of convergence 4. Note that it requires one evaluation of f and one evaluation of f' by each processing element so the informa- tion efficiency is 2 Equation (3-4) requires 22 arithmetic operations (if (y1 - yz) 3 is stored during the computation of (y1 - yz) ), so the efficiency is 2 ‘ c(f) + c(f') + 22 ' EFF The sequential method which this method generalizes has order of convergence 1 +-/3 R12.732 [40, p. 66], information efficiency IE *1 0.726, and efficiency 0.726 EFF 2’ c(f) + c(f') + 22 69 Thus the use Of parallel processing has improved each of these measures. In Section 3.3, it will be shown that the speed-up ratio which can be achieved by methods like this one which are based on Hermite interpolation is proportional to log N, so that they are not well suited to implementation on a parallel processing computer. 3.3 Efficiency It was shown in Corollary 3-2 that the highest weak order of convergence which can be achieved on a parallel processing computer using ‘N processing elements by inverse Hermite interpolation methods is (N-l)(r+l). This order is achieved by methods which use values of f,f',...,f(r) at N-l components of the N-dimensional vector X It is very costly to perform Hermite interpolation of f and 1' its first r derivatives at N-l points, so one might expect these methods to be inefficient. For example, the simple example Of such methods constructed in Section 3.2 required 22 arithmetic operations. Although the interpolation becomes more and more expensive as the number of points used increases, we will show that if a very large number of arithmetic operations are required for each evaluation of f,f',...,f(r>, then the most efficient Hermite interpolation method which can be implemented on a given machine is a method which per- forms interpolation at N-l points. That result will be used to show that the best Speed-up ratio which can be achieved by parallel rootfinding methods using Hermite interpolation is proportional to log'N. Hence these methods are poorly suited for implementation on parallel processing computers. Consider the inverse Hermite interpolation method 'k for parallel rootfinding described in Theorem 3-1. Assume that f is 70 analytic so that N, and hence k, may be chosen as large as we wish. Definition 3-1. Define log2 k(r+l) Pr,k(f) = EFF(¢k:f) = I. (i) 9 2 c(f ) + a i=0 where c(f(1)) denotes the number of arithmetic operations needed to compute f(l), and a denotes the number of parallel arithmetic Operations necessary to compute Xi+1. Also define P *(f) = sup Pr k(f) r 3 2‘k 9 If r = 0, so Lagrange interpolation is being used, P (f) = Pk(f), and P0,*(f) = P*(f) which were defined for 0,k Lagrange interpolation in Definition 2-2. N (k N {1,2,...,N], so there are the same number of possible choices for There are ( )) possible choices of N subsets of 'k' In spite of this, Pr,k(f) is well-defined since the possible choices for 'k differ only in their labeling of the points. Hence their order and computational complexities are the same. P ,*(f) is the Optimal efficiency of all inverse Hermite interpolation methods for parallel rootfinding. We will later ask whether N is large enough to allow Pr,*(f) to be achieved. As in Theorem 2-7, we will compare the efficiencies of parallel inverse Hermite interpolation methods with the sequential one-point and multipoint methods given by Traub in [19]. As in Section 2.4, let E1(f) denote the optimal efficiency of the class of sequential one-point iterations without memory. Let E2(f) 71 denote the Optimal efficiency of the class of sequential multipoint iterations without memory which use values of f only. We restate the following bounds from Theorem 1-1 and 1-2: log 3 - c(f) + c(f') + c(f") + 7 SE1“) 5 igg___§ (3 5) l g f 3- c(f)1+/‘;(5‘5E2()‘—_c(f) (6) Here c(f(i)) is the number of arithmetic operations needed to evaluate f(i) 1 = min c(f(i)), and g is some negative constant. 120 Theorem 3-3. Assume that f is analytic in a neighborhood of a , C simple root a. Let 'k be an inverse Hermite interpolation method given in Theorem 3-1. Let r 2 0 be fixed. Let c1 = min c(f(i)), and c2 = max c(f(i)) , Osisr Osisr Then logrk(r+1) i) IE('k)= r+1 ii) There exist positive constants pl and p2 such that k.r+ 2 (r+l)c2 + p2k (r+1)c1 + plk ln k for k sufficiently large. iii) Let k1 and k2 denote the unique solutions to u1(k) = k ln k (2 + lnzk)ln k(r+l) - (r+l)c1 p1 -pk1n2k=0 1 9 (3-8) and 112(k) = 292(k2)21n k(r+l) - (r+l)c2 - p2(k2)2 = 0 , (3-9) ‘flflfi"‘m”“m’"fllll' which 1 the 81 point Spee< poin k(H f,f‘ tiOt ‘hnp‘ to‘ Goth] 72 which satisfy k1 2 l, and k2 2 l/(r+l), reSpectively. Then I 1 51> *(f) s . (340) 292(k2)21n 2 r» 01(2 +'ln k1)k1 ln k1 1n 2 iv) Assume that 2 s k s N-l. If c1 is large and c1 = c2, the speed-up ratio of 'k compared to the optimal sequential one- then point methods without memory approaches 3 (;:l)l;—§iln (N-l), as N «'m . (3-11) v) .Assume that 2 s k s N-l. For large values of c(f) = c, the Speed-up ratio of *k compared to the optimal sequential multi- point iteration without memory using values Of f only approaches In gN-lz _ (r+l)ln 2 , as N a m . (3 12) Proof. i) By Theorem 3-1, 'k has order of convergence k(r+l). One iteration of yk requires the evaluation of f,f',...,f(r) at each of the components of X Hence r+l func- i. tional evaluations are necessary. ii) By definition, c s c(£(')) s 1 for 0 s i s r , C2 , implies that r (1) (r+1)c S 2 c(f ) s (r+1)c . (3-13) 1 2 i=0 Let a denote the number of arithmetic operations necessary to compute x1+1 J’ once f,f',-o-,f(r) have been evaluated at the components of X1. In [38], Strassen shows that interpolation on 73 2 k points requires at least 0(k ln k) operations. Several algorithms have been devised which perform interpolation at k 2 points in 0(k ) operations. Hence there exist positive con- stants and p2 such that for k sufficiently large, p1 p1 Applying the lower and upper bounds given by equations (3-13) and (3-14) to the definition of Pr k(f) gives (3-7). iii) We first Show that u1(k) has a unique solution 2 1. Let u1(t) = plt In t (2 + 1n2 t)ln t(r+l) 2 - (r+1)c1 - plt In t 2p t In t 1n t(r+l) - (r+1)c1 l 2 + plt In t (In t(r+l) - 1) Then -(r+l)c1 < 0, and lim u1(t) = +m , tfim u1(1) so there exists a solution k1 2 l to equation (3-8). To see that k1 is unique, notice that , 2 u1(t) = p1(2 + 4 ln t +'ln t)1n t(r+l) > 0 , for t > 1 Similarly, to show that u2(t) = 0 has a unique solution, we observe that 2 2 k 1n k s a s pzk . (3-14) "“m‘? and Hem $0 the 0cm Subs (3-1 toe SUbs and 74 p2 (rm2 u2(1/(r+l)) = —(r+1)c2 - < 0 , and that lhn u2(t) =‘+m . t-m Hence equation (3—9) has a solution k2 > l/(r+l). But 1 ... u2(t) — 4,2: In t(r+l) > 0 , for t > 1/(r+l) , so k2 is unique. To show (3-10), we take suprema of (3-7). The maxima of the expressions log k(r+l) and lgg k(r+l) , 2 (r+1)c2 + 62k (r+1)c1 + plk ln k occur at k2 and k1, reSpectively. Hence log k2(r+l) log k1(r+1) 2$131.41?) s 2 (r+1)c2 +'p2(k2) (r+1)c1 +~p1k1 ln k . (3-15) 1 Substituting equations (3-8) and (3-9) into the denominators of (3-15) gives (3-10). iv) If we assume that all derivatives are equally costly to evaluate, then c1 = c2 = c. Dividing (3-10) by (3-5) gives 3c+2 spree“); 3c+7 2 2p2(k2) 1n 3 E1(f) 91(2 + 1n k1)k1 ln k In 3 1 Substitution for c in terms of k1 and R from equations (3-8) 2 and (3-9) gives 75 3[2p2(k2)21n 1.2mm - 620.2)2] + 2(r+1) Pr *(f) 2 f (r+l)2p2(k2) ln 3 31‘ ) 2 S 3[p1(2 + 1n k1)k1 1n klln k1(r+l) - plkl ln k1] +-7(r+1) (r+l)p1(2 + 1n k1)k1 ln k1 1n 3 so that 3 1“ k2 + 6 1n(r+1) - 3 + 1 ! ..‘ "" ' 2 (r+1)1n 3 2(r+l)ln 3 p2(k2) ln 3 E P (f) 3 ln 1: * -1 ‘ ‘2'??? ‘ (1+1) 1:173— ‘ 32 Q! l (r+l)(i;-Ef'+ l)ln 3 . 1 3 lngr+lz 7 _ + (r+1)1n 3 + 91(2 + 1n k1)k1 1n k1 ln 3 ° (3 16) From equations (3-8) and (3—9), lim k1 = lim k2 = a , c *w *m 1 c2 so for large values of c, (3-16) shows that the Speed-up ratio is approximately bounded below by 3 ln k2 (1+1) 1n 3 ’ and above by 3 1n k1 (r+l)1n 3 For a parallel processing computer with N PE's, if c is so large that k2 2 N-l, then the most efficient Hermite interpolation method which can be implemented uses k -‘N-l points. As larger machines are used, the optimal efficiency Pr f) is approached. ’*( 76 P (f) *- Hence as N .. an, 1%???- is approximately bounded above and below by l 3 ln,(N-l) (r+l) ln 3 v) The proof of v) is similar to the proof of iv) given above. Dividing (3-10) by (3-6) gives Pr * s—-’-——-s E2(f) C 2 2p2(k2) ln 2 C .. §_. 2 (1 +'/t> p1(2 +~ln k1)k1 1n k1 ln Substituting for c in terms of k1 and k2 from equations (3-8) and (3-9) and simplifying, we get 1n k2 + 21n(r+1) _ 1 g Pr,*(f) (r+l)ln 2 2(r+l)ln 2 E2(f) ln k s l + ln(r+l) (1 +§:)(r+l)ln 2 (1 + §C-)(r+1)1n 2 1 EL. /2)‘r*1)( 2 (3-17) + l)ln 2 1n k (l + 1 If c is large, k1 and k2 are also large, so the Speed-up ratio is approximately bounded below and above by 1n k2 1n k1» d (r+l) ln 2 ’ 8“ (r+l) 1n 2 ’ respectively. Hence for large values of c, the Speed-up ratio behaves like In (N-I) d (r+l) 1n 2 ’ as N ° ' l 77 The important conclusions from Theorem 3-3 are contained in parts iv) and v). Taken together they show that if parallel root- finding methods based on Hermite interpolation are compared to either optimal sequential one-point iterations without memory or to optimal sequential multipoint iterations without memory which use values of f only, the Speed-up ratio is prOportional to the log2 of the machine size. Hence these methods are not well suited for implementation on a parallel processing system. 3.4 Comparison of Hermite and Lagrange Methods In Section 3.3, we studied the efficiency of inverse Hermite interpolation methods for parallel rootfinding which use f,f',...,f(r) at k points to compute In that section, we considered Xi+1,j ' the effect of increasing k to find the Speed-up ratios of the methods. In this section, we suppose that k is fixed and let r vary to determine the optimal number of derivatives to be used for the Hermite interpolation. Let ¢h denote the parallel rootfinding method based on in- verse Lagrange interpolation described in Theorem 2-5. Let 'k,r denote the method based on inverse Hermite interpolation described in Theorem 3-1. Note that ¢k is lk,0' We will show that ¢k is more efficient than 'k,r' Theorem 3-4. .Assume that f is analytic in a neighborhood of a simple root a. Then IE(wk’r) s IE(qk), for all r 2 0 If c(f) s c(f(i)), for o s i s r, then for r 2 0, *“finl-r .i F H. 78 EFF(fik,r) S EFF(qk) . Proof. Recall that log2 (order of convergence of q» IE(qD = number of parallel functional evaluations According to Theorem 3-3, E , ___ tress-ta L1,; I (wk’r r+l . , To find the highest possible IE, we differentiate with reSpect to 7 r to get F r+l dr (r+1)2 ln 2 g 1 - ln k(rtl) 2 (r+1) 1n 2 Hence diz < 0 if and only if k(r+l) >-e. Recall that d:E < 0 unless k = 2 and r = 0. (Notice that k a 2 and r B 0 characterizes the parallel secant ZsksN-l and r20,so method.) In the case k - 2, IE - 1 for both r = 0 and r = 1. For all other values of k and r, IE is a decreasing function of r, so for any fixed k 2 2, the largest IE is always attained by ¢k' Hence IE('k,r) = 1° rfir+1 s 16g k a IE(¢i) . (3-18) That is, Lagrange interpolation on k points has a higher IE than any other Hermite interpolation method on the same k points. We now turn to the efficiences of the two methods. By assumption c(f) s c(f(1)), for 0 s i s r, so from (3-18), 79 log k 2 10g k(r+l) C(f) (r+1)C(f) 2 log k(r+l) 1' z c(f‘”) t=0 Let a(qD denote the number of parallel arithmetic opera- tions used to combine the functional values to compute Xi+1. Clearly a(cpk) s a“:k r) , so that logik(r+l) EFF(il:k r) = r z c(fm) + aw ) t=0 k’r log k(r+l) _ c(f)+8( 1+1 1 §,(xi) where (xi- xi-l)f(xi-2) (xi ' “1-2)f(xi-1) + ' (xi-2 ‘ xi-1)("i-2 ' xi) (xi-1 ' xi-2)("i-1 ' xi) f(xi) f(xi) x -.. +—.—;— - (44> i i-2 xi i-l Equation (4-2) is obtained by differentiating the three point Lagrange interpolation polynomial, C(x), to f at x d 1-2, Xi_1, an x According to Lemma 2-2, 1. E(x) = f(x) - C(x) = %(x - xi_2)(x - x1_1)(x - xi)fm(§(x)), (4-3) where g(x) lies on the interval spanned by x, x1_2, x , and i-l 82 x1. If f 6 Civ in a neighborhood of a, differentiating both sides of equation (4-3) and substituting x1 into the resulting equation gives the well-known estimate E'(x ) = 11x - x )(x - x )fm(§ ) (4-4) i 6 i i-2 i i-l i ’ where §i = g(xi). Let F(xi) = x1 - f(xi)/f'(xi) , and H(xi) = x1+1 = x1 — f(xi)/f'(xi) , where f'(xi) is given by equation (4-2). Now F is the usual Newton-Raphson method, so it has error given by 01 - mi) = some - x1>2/f'f(xi,3) + f("1,1) (Xi,3 ' Xi,2)(xi,3 ' Xi,l) x1,1 ' Xi,3 Let x141,2 and Xi+l,3 be defined Similarly. To compute the order of convergence of this parallel method, let ei,j = a - Xi,j° Then equation (4-7) becomes H — f ('1) < )2 ei+1,1 2f'(x ) 81,1 1,1 f'(n1)f”’<§i 1> ’ - 1‘ 6. (e. - e. )(e. - e. )5 (4-9) 6fl(xi 1)fv(xi 1) 1,1 1,2 1,1 1,3 1,1 with similar eXpressions holding for and This ei+1,2 e’i+1,3‘ implies that if x 2, and x are chosen sufficiently 0,1’ x0 0,3 close to a, the method converges. We will Show in Theorem 4-1 that this method has order of convergence 2. One parallel functional evaluation and 16 arithmetic Operations are necessary at each iteration, so the information efficiency is and the efficiency is 85 l EFF “ c(f) + 16 ° Note that the use of parallel processing has improved the order, and hence the efficiency and the information efficiency, of the sequential method given by equation (4-1). Although using an approximation for the value of f'(xi) needed for the usual Newton-Raphson method, F, lowered the order of convergence from 2 to 1.84 in the sequential case, using a similar approximation did not affect the order of convergence of the Newton-Raphson method in the parallel case. We will Show that in many cases, an estimation may be used for the values of a derivative without lowering the order of the method. One of the many variations possible on this example is of interest. Instead of using the data at three points to compute the approximation f'(xi 1) to f'(xi 1), we could use just f(xi 1) and f(xi,2). Then . f("i,z) ' f(xi,l) f0‘11)“ x -x ° ’ 1,2 1,1 This is just the parallel secant method introduced in Section 1.4. Hence the parallel secant method may be viewed either as an example of a parallel method based on Lagrange interpolation, or as a derivative-estimated method based on the Newton-Raphson method. 4.2 Order of Convergence In this section we will indicate that if enough points are used to estimate values of the highest derivative of f needed by certain rootfinding methods, the order of convergence is unchanged. 86 As in Chapters II and III, we assume that the analytic function f has an inverse, denoted by g, in a neighborhood of the simple root 0. Consider the following parallel generalization of the Newton- Raphson method. Let x be initial approximations O’1,x0,2,...,x0’N chosen sufficiently close to a. ASSume that g E Cr+1 near 0. For each j = 1,2,...,N, let Fj(y) be the Hermite interpolation polynomial of degree at most r which satisfies Fj(S) ) = 8(8) (yi,j (yi,j)’ for s=0,...,r . Define xi+1,j g FJ(O) Although this is a parallel method, each processor is acting in- dependently of the others Since x depends only on x i+l,j i,j° Thus each processing element is performing a generalized Newton- Raphson iteration which is well known to have order of convergence r+l (see [40, Chapter 3]). Let us set the notation which is needed to construct a derivative-estimated method based on this parallel generalization of the Newton-Raphson method. Let u 2 2 be fixed. Choose N subsets, not necessarily distinct, of {1,2,...,N], each cOntaining u elements. Denote the jth subset by B , and assume j E B J J. Let Gj(y) be the Hermite interpolation polynomial of degree at most ur-l which interpolates (8) (8) G1 (yi t) - s (yi’t). for 8 - 0,...,r-l, t 6 B1 . 87 (r) We will use 0 (r) (Yi j J which is used in Fj' Theorem 4-1. Let f 6 Cur in a neighborhood of the Simple root (yi j) to approximate the value of g ) a. Let Fj and G1 be as defined above. Let Hj(y) be the Hermite interpolation polynomial of degree at most r which satisfies (8) = (S) = - Hj (yi j) 8 (Y1 j) . for s 0. .r 1 . (r) _ (r) Hj (y1 j) - Gj (yi j) Define If the N components of X0 are chosen sufficiently close to a, this method converges and has Strong order of convergence of at least r+l. The proof of this theorem is a generalization of the corresponding sequential case given in [40] by Traub. The first Of two lemmas needed for this proof is a well-known generalization of Rolle's Theorem [40, p. 252]. k Lemma 4-2. Let q 8 Z Yj' Let W E C q (I), where I is an j=0 interval which contains the points x which are zeros of ‘W of J multiplicity Yj’ reSpectively, for j = 0,1,...,k. Then W(1) has at least q-i zeros counting multiplicity, for i = 0,1,...,q-1, on I. In particular, there exists g 6 I such that W(q'1)(§) = 0. The second lemma deals with the error in estimating derivatives by Hermite interpolation. we include the proof for completeness. “1 ii: r" '9 4...... ”1|— " -.-...s 88 Lemma 4-3 [40, p. 252]. Let S = ru, and let g E C s (I), where I is an interval containing the distinct points y1,y2,...,yu. Let C(y) be the Hermite interpolation polynomial of degree at most s-l satisfying C(i)(yj) = g(i)(yj) , for j = l,...,u, and i = 0,...,r-l . Then (r) (r) _ EL (s) “ _ r 8 (5'1) -G (y1)-S, s ($111201 yj) . where g 6 I. Proof. Define u r P(t) = H (t - y ) i=1 3 The r th derivative of P, U P(r)(y1) = I“ n (Y1 - yj)r * 0 9 j: so we may define g(t) (Y1: - c(f) (9'1) x = (r) P (5'1) Let W(t) = g(t) - C(t) - AP(t). Then W(t) has a zero of multiplicity. r+1 at yl, and a zero of multiplicity r at y29y3,...,yu. Hence W has at least s+l zeros, so by Lemma 4-2, (8) there exists g E I such that W (g) = 0. Since C(t) is a polynomial of degree at most 8-1, (8) o = w (t) = g(s’<§> - 18! W9 ,4 4 .Jlmfnruimz 2 89 Then (r) , (r) g (r) 8 (VI) G (5'1) 11’ (3'1) (5 ) (r )y .r_!_ (s) r =8'8 (§)H(Y1-Y Y) .' J=2 3 We are now prepared to prove Theorem 4-1. The subscript i will be suppressed to Simplify the notation. (1‘) Proof. Fj(y) interpolates g,g',...,g at yj = y1 j’ so it must have the form (8) r 8 (y) Fj(y) = 2 —-S—,-J—(y -yj) . s=0 with error given by (”1) 1+1 8(0) - Fj(0) = (rd-l)! (-yj) . where 0 lies between 0 and y . Similarly, Hj(y) is given by J (s) (r) r-l s (y) G (V) Hj(}') = 2:0 --;,—-'— (y - yj)s + J—g—L (y - yj)r 8: From Lemm 4-3 we have (r) (r) g(ur) r - G - , 4-10 g (yj) j (91 (it), (e )tggjyj yt> < > ta‘j where 91 lies on the interval Spanned by [yt\t € Bj]. Now 8(0) - H (0) ‘i+1,j' °’ ' x1+1.j '3 J = 8(0) - 171(0) + Fj(0) - 111(0) w.- a Lenllna n‘.‘ Q I 90 r 9 (4-11) (4-12) ...;L__ (r) - (r) (r+1)! ( yJ) +' r! [8 (yj) Gj (Yj)l (”1) ( )r~+l (1+1)! VJ (-y >r i r! (ur) r + r! (ur)! g (9.1)th (yj - yt) J ta‘j r+1 = g(r+1)(e) (61.1) (”1" Leap)“ (e )r g(ur)(9 ) 1L1 1 I U [f (k )(x ' X ) (lit)! [g'(§j)]r CEBJ 39': j t ta‘j s where 51 is between 0 and y], and Yj ' Yt = f(xj) ' f(xt) = f'()‘j’t)(xj - xt), where xj,t lies between xj and xt. Let M =L1r+' and . g("r)(ei)2r(“'1) [ )1‘ = o n f'(x “2.1 (ur),[g.(§j)]r+1 cesj 3,: ti‘j Then from equation (4-11), — M ( )r+1 + “211(61’j)r r ems " 1.1 “1.1 2z-(u-n “ (61,: ' 31.1) ° tEBj ti‘j Let 91 M.1 I mjx 'Ml,j' M2 = mix JM2,3‘ Then if bi = max \8 l. j i.J Jei+1,j\ ‘ M1(51)r+1 + M2051)r n (61)r tEBj t¢j r+l ur s M1(61) + M2(61) . Now (r+1) lim M1 = 3 (01 H1 , and 1% (1+1)! [8 ' (0)] 11 g g(ur)(0) 2r(u-l) m r i-a (ur)! [8 ' (0)] Hence M1 and M2 are bounded by M1 and D71, reSpectively, for all i. If the N components of X are chosen sufficiently 0 close to a, then '- r+1 -' ur 81+1 s 141(61) + M051) . so 11111 61 = 0 l-cco and the method converges. Since 2 s u, the inequality 5i+1 — s M + H(a )ur-r-l i (61) r+l 1 implies that 92 5 ._ lim sup "";i},'+—1' 5M1 9 iaa (6.) 1 showing that the method has Strong order of convergence at least r+l. I We are unable to prove the exact order of convergence because of the form of equation (4-12). However, it is our conjecture that: Conjecture. The parallel derivative-estimated method given in ' .. 1 Theorem 4-1 which defines x1+1 j = Hj(0) has order of convergence exactly 'r+1. We have Shown that the order of convergence of this method 3"- ”1313—2. ... .2 T": “F l I is at least r+l. Recall that the method from which it is derived (r g ) uses and has order of convergence n+1. It seems highly (r) (r) unlikely that using an approximation for g instead of g itself could make the method converge more rapidly for a general choice of the function f. How many points u should be used for the Hermite interpola- tion from which the derivative is estimated? Theorem 4-1 showed that the order of convergence of the sequential generalized Newton- (r) (r) j 9 where Gj is the Hermite interpolation polynomial which agrees (r-l) Raphson method is not lowered if g is approximated by G with g,g',...,g at u 2 2 points. Notice that u does not effect the order of convergence, so we Should use just two points in order to minimize the effort needed to compute G . J It does not appear that the stepping down from r to r-l derivatives in Theorem 4-1 can be repeated since that would require an analogue for Lemma 4-3 giving an estimate for g - Fj = K. . n (... f“ 9.] 19m m€Aj By Lemma 4-3, (r) _ (r) = r! (ur) _ r 8 (71) G1 (yj) (ur), 8 (83) cg; (yj yt) , J t¥j where 91 lies on the interval Spanned by {yt\t 6 BJ]. As in the proof of Theorem 4-1, ei-+1,j = a ' xi+1,j = 3(0) ’ Hj(0) 1‘ r ’ Ki,j mgAj(ei’ ) + Mi,j(€i.j) tghj(€i,t 61.1) ’ ti‘j where M g(ur)(ej) 2r(u-l) n ( 1 )r "3 (ur)![8'(§j)]r+1 teaJ 3“”) tfij lies between 0 and .; the proof of Theorem 4-1, it was shown that ‘Mi j' s M, so that 9 lies between yj and yt. In r+l - r r Jei+1,j' ‘ngAj'ei’m\ +M 'ei,j tEBJ'ei’t ' 61.j' t¥j s K(61)k(r+1) +1701)“r 96 Hence k(r+1) 8 s 1((61) + R61)“ . (4-13) r+l If ,...,x are chosen sufficiently close to a. then x0 ,1 “0,2 (4-13) implies that 0,N lim 6i = 0 i-«n Thus the method converges. MOreover, since y = min{k(r+1), ur], the inequality 6 r, _. _ (6 )Y 1 1 1 implies that 5 lim sup < m . 1-«9 (61) Thus the method has strong order of convergence at least v. . we conjecture that if g(ur)(0) f 0 and g(kr)(0) # 0, then these methods have strong order of convergence exactly y. In that case, the order of convergence of a derived method may equal, but may not exceed, the order of convergence of its parent method. Since ' k(r+l), if u 2 k(r+l)/r y = min{k(r+l), ur] 3 ur, if u s k(r+l)/r , the orders of the derived and the parent methods are equal when- ever [k(r+l)/r] or more points are used to estimate the values of g(t). For the method considered in Theorem 4-1, k = l, r 2 l, 97 so (r+l)/r s 2. Hence in that case, the orders of the derived and parent methods are equal if two or more points are used in the derivative estimation. In Theorem 4-4, 2 s k s'N-l, and 2 s u S‘N, so that we have this result. Corollary 4-5. If the strong order of convergence of the derivative- estimated method for parallel rootfinding described in Theorem 4-4 is exactly y, then the highest order of convergence which can be achieved on a parallel machine using N processing elements is either (N-l)(r+l) , if N s n+1, or Nr, if N 2 r+l . This shows that unless the number of derivatives used in the parent method is as large as the machine Size N, use of a derivative-estimated method actually reduces the order of con- vergence. 4.3 Efficiency In this section, the efficiency of the parallel derivative- estimated methods is studied by comparing the computational effort required by the derived method with that required by the parent method. We will assume that the derived methods have strong order of convergence exactly y - min{k(r+l), ur}. If an order of convergence of k(r+l) is desired, it can be achieved by the parent method using Hermite interpolation of 8’8.,...,8(r) at the k points {y1,t‘t€Aj]. Aderived method 98 with u 2 [k(r+1)/r] retains the same order Of convergence, but (1') it does not require the computation and evaluation of g . This (1') is an important consideration if g is much more costly to evaluate than g,g',...,g(rfll). However, the cost of avoiding the (r) _ evaluation of g is high. At least [k/r additional pro cessing elements are needed to perform the computations required at u points. In addition, each processing element must compute an Hermite interpolation polynomial G of degree at most ur-l J r and evaluate Gj( ) at k points. Hence the parent method is more efficient than the derived method unless at least rk/r] additional PE's are available and the cost of one parallel evalua- (1') tion of g is greater than the cost of constructing and evalua- ting G This would be expected to occur only very rarely. 1’ If instead of fixing the desired order, we wish to use the r 8() data from g,g',..., each evaluated at k points, we may pro- ceed in two ways. First, the Hermite interpolation method using this data has order of convergence k(r+l). .Alternatively, we can use the data in a derivative—estimated method to approximate g