‘ PARALLEL Roomnotua Ammﬁs ‘

Dissertation fer the Degree of Ph. D.
ivﬁliiﬁlﬁm STATE UNWERSITY
amass. FREDERICK CGRLLSS

1974

MlChigan State
University

    
   
    

  

This is to certify that the

thesis entitled

PARALLEL ROO‘I‘FINDING ALGORITHMS

presented by

George Frederick Corliss

has been accepted towards fulﬁllment
of the requirements for

Ph.D. Mathematics

degree in

 

 

41mm D. ass/6e

 

 

Major professor V
0-7639

 
 
      

 
 

BlNDHlB-ﬁ‘t ‘5
/ HURF'TSDW‘

800K BlliZ‘ERY lNC.‘
LIBRARY BlNDERS
srmuspom. mcmsnii

    
   

 

 

 

_J

 

 

ABSTRACT
PARALLEL ROOTFINDING ALGORITHMS

By

George Frederick Corliss

Algorithms to find a simple root a of a continuous function
f are considered which are suitable for implementation on an array
type of parallel processing computer using N processing elements.

The algorithms are compared on the basis of their efficiency

log2 order of convergence

 

EFF = number of parallel arithmetic

operations per iteration

The first class of algorithms to be considered uses inverse
Lagrange interpolation on k points. If f E Ck' on an interval
containing a, these methods have order of convergence k. In
contrast, the order of convergence for sequential methods using
Lagrange interpolation on several previous estimates for a is
always less than 2. The parallel methods achieve a higher order of
convergence because they use only the most recent estimates for a.
k can be chosen, depending on the number of arithmetic Operations
required to evaluate f, to maximize the efficiency of methods in
this class. If the cost of evaluating f is very high, the Speed-
up possible by using parallel rootfinding algorithms based on
Lagrange interpolation over efficient sequential rootfinding

algorithms is proportional to log2 N.

 

 

A 1
example c
implement
cessing e
evaluatic
it is met
more than

The
verse Her
order of
If r is
Effective
processing
1% N fc
less Cost]
amethod L
efﬂuent
POints.

The
estimated
Hermite in
f,f'm_,f
Hermite im
rare “amp:
Efficient t

baSEd.

George Frederick Corliss

A parallel secant rootfinding algorithm is considered as an
example of a method using Lagrange interpolation. If it is
implemented on a parallel processing system with three or more pro-
cessing elements, it achieves order of convergence 2 with one parallel
evaluation of f and five additional arithmetic operations. Hence
it is more efficient than the usual Newton-Raphson method whenever
more than three arithmetic operations are required to evaluate f'.

The second class of parallel algorithms considered uses in-

verse Hermite interpolation of f,f',...,f(r)

at k points. The
order of convergence is equal to the amount of data used, k(r+l).
If r is fixed, the fastest algorithm in this class which can be
effectively implemented on a parallel processing system with N
processing elements achieves a Speed-up ratio proportional to

log2 N for functions which are very costly to evaluate. If it is
less costly to evaluate f than to evaluate any derivative of f,
a method using Lagrange interpolation (r = 0) on k points is more
efficient than any other method using Hermite interpolation on k
points.

The third class of parallel algorithms contains derivative-
estimated methods. If the values of f(r) used in a parallel
Hermite interpolation method are estimated using values of
f,f',...,f(r-1) at enough points, the order of convergence of the
Hermite interpolation method is not reduced. However, except in
rare examples, a parallel derivative-estimated method is less

efficient than the Hermite interpolation method on which it is

based.

George Frederick.Corliss

If the answer to a rootfinding problem is desired as quickly
as possible, as in real-time computer applications, these parallel
algorithms should be used. In general, a parallel method using
inverse Lagrange interpolation on k points is fastest, where k
depends on the function. If the root is to be computed at low cost,
the task should be executed on a sequential machine because the
speed-up ratio proportional to log2 N achieved by these parallel
rootfinding algorithms shows that they do not make efficient use of

the parallel capabilities of the system.

 

PARALLEL ROOTFINDING ALGORITHMS

By

George Frederick.Cor1iss

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Department of Mathematics

1974

Gerald E

tions an

 

ACKNOWLEDGMENTS

I wish to express my sincere appreciation to Professor

Gerald D. Taylor for his leadership and guidance. His sugges—

tions and assistance have been invaluable.

ii

 

Chapter

I

II

Chapter

I

II

TABLE OF CONTENTS

INTRODUCTION TO PARALLEL PROCESSING AND
ROOTFINDING

1.1 Motivation for Parallel Processor Design

1.1.1 High Speed Computation

1.1.2 Very Large Problems

1.1.3 Inherent Parallelism in Existing

Problems
1.1.4 Enormous Quantities of Data
Requiring Real-Time Processing
1.1.5 Multiuser Systems
1.1.6 Reliability and Graceful Degradation
Parallel Machine Organization
Problems Encountered when Implementing
Parallel Algorithms
Examples of Parallel Processing Algorithms
Speed-Up Ratio
Introduction to Iterative Rootfinding
Definition of Iterative Methods
Order of Convergence
Examples of Rootfinding Iterations
1.9.1 Newton-Raphson Method
Generalized Newton-Raphson Method
Multipoint Method One
Multipoint Method Two
Lagrange Interpolation Methods
. Parallel Secant Method

1.10 Results on Optimal Order
1.11 Definitions of Computational Efficiency
1.12 Results on Optimal Efficiency
1.13 Parallel Rootfinding

raid
Can:

P‘P‘F‘F‘P‘P‘
¢>GD\JO\U1$~

F‘P‘F‘h‘h‘
\O\O\O\O\O
oxuwc~oana

ROOTFINDING BY LAGRANGE INTERPOLATION

1 Rice's Results

2 Parallel Secant Method
.3 Order of Convergence

4 Efficiency

iii

Page

LANH

43

43
46
47
55

Chapte‘

111

 

Chapter

III

IV

ROOTFINDING BY HERMITE INTERPOLATION

Order of Convergence

Example of an Hermite Interpolation
Method

Efficiency

Comparison of Hermite and Lagrange
Methods

tow
NH

taco
PM

ROOTFINDING BY DERIVATIVE-ESTIMATED METHODS
4.1 An Example of a Derivative-Estimated
Method

4.2 Order of Convergence
4.3 Efficiency
WHICH METHOD TO USE

5.1 Real-Time Problems
5.2 Problems to be Solved at Low Cost

BIBLIOGRAPHY

iv

Page
63
63

67
69

77
80
81
85
97

100

100
102

103

 

Table

LIST OF TABLES

Table Page

2-1 9*(13) 59

Figure
1-1
1-2

1-3

 

1-5

1-6

1-7

1-8

1-9
2-1
2-2
2.3

3-1

1-5

1-6

1-7

1-8

1-9

2-1

2-3

3-1

LIST OF FIGURES

Functional Block Diagram for ILLIAC IV
Addition with Six Sub-operations

Pipeline Addition of N-dimensional Vectors
A + B

Addition of Two 64-dimensional Vectors on
ILLIAC IV

1 +'2 +;..+ 64 on ILLIAC IV

Carrol and Wetherford's Variation of
Gauss-Seidel

Stone's Variation of Gauss-Seidel

Flow Chart for One Iteration of the Parallel
Secant Method

Speed-Up Ratio

Flow Chart for PE k
Parallel Secant Method
Pk(f)

Inverse Hermite Interpolation by One PE

vi

Page

10

11

14

15

17

17

19
21
45
47
59

68

 

IXTF

Tl
ware conf
reliabili
users. P
Considere
processir
minutes,
93}: to d:
Construct
Effectiv.
Paper wi

which us

1'1 Mot

CHAPTER I

INTRODUCTION TO PARALLEL PROCESSING AND ROOTFINDING

The computer industry is currently searching for new hard-
ware configurations which will attain greater Speeds with a higher
reliability and more flexibility to meet the varying needs of many
users. Parallel processing is one very promising approach being
considered. The intuitive motivation for research into parallel
processing is that if one machine can do a certain job in ten
minutes, then perhaps ten machines could be coupled together in some
way to do the same job in one minute. In addition to the design and
construction of such machines, algorithms must be developed which
effectively utilize the parallel capabilities of the machine. This
paper will be devoted to the study of algorithms for rootfinding

which use the parallel machine structure.

1.1 Motivation for Parallel Processor Design

There are several reasons why parallel processor computer
systems may have a place in future computer designs (see [14], pp.
4—7). Some areas of possible benefit which we will discuss in more
detail are:

1) High Speed computation

2) Very large problems

3) Inherent parallelism in existing problems

 

4)

the use of
Ll.l Hig

Ci
Hat very
expensive
which ele
micro-min
HUSt tra\
Small ti:

being ex

C3mputar
N'fold r
COElputat
head tin
N dirt.
among tl
machine
StruCti
is COns
Sthti

4) Enormous quantities of data requiring real-time pro-
cessing

5) Multiuser systems

6) Reliability and graceful degradation.

We now discuss each of these areas of possible benefit from
the use of parallel processing computers.
1.1.1 High Speed Computation

Conventional computer hardware design has reached the point
that very small improvements in the speed of the machine are very
expensive. The Speed of the machine is limited by the Speed at
which electrical pulses travel through the machine. Advances in
micro-miniature circuitry have cut down the distances the pulses
must travel, and hence the computation times, but the cost of even
Small time savings is prohibitive, so other avenues of design are
being explored. Parallel processing is one such design alternative.

In a parallel processing system, N different parts of a
computation may proceed simultaneously, so it is possible that an
N-fold reduction in the time needed to perform certain classes of
computations could be realized. One would expect the program over-
head time to be greater on a parallel processing machine because
N different processors must be coordinated and the work shared
among them. At the same time, it may be possible to construct a
machine with N processing elements at a cost below that of con-
structing N separate machines. One design in which this is done
is considered in Section 1.2. If the savings possible in the con-
struction of the parallel processing machine is larger than the

increased cost of the overhead in program execution, then the N-

 

fold SP€€d'
on existing
premise tha
:ost than 1'

time turrer

1.1.2 Vet";
$01
so much Cot
anlj; at gr:
3 gléhal w.
dinanics p
and other
of data p0
aChiEVe a

£935 1b 19 t

1'1-3 Int
A
configure!
be PGrfon
an a Para
example,
althoUgh
ferent f0

9119!} thOu

fold Speed-up can be achieved at a lower total cost than execution

on existing machines. That is, parallel processing designs hold the
promise that it may be possible to execute some problems at a lower
cost than is currently possible and to execute them in 1/N of the

time currently required.

1.1.2 Very Large Problems

Some problems of current interest require so much memory or
so much computation that they can be solved on conventional machines
only at great expense. Some examples are the problem of constructing
a global weather model, some nuclear physics problems, large hydro-
dynamics problems, phased array radar air defense systems (see [1]),
and other similar problems requiring computations on a large number
of data points. If a parallel processing system can be built to
achieve a high speed-up at a relatively low cost, it will become

feasible to solve problems too large for existing computer systems.

1.1.3 Inherent Parallelism in Existing Problems

A large class of problems are well suited to a parallel
configuration by their structure. If the same computations are to
be performed on many independent sets of data, the job may be executed
on a parallel processing system in a highly efficient manner. For
example, a payroll job must do the same computations for each employee,
eilthough the rate of pay, hours worked, withholding, etc., are dif-

feremt for each employee. Tasks of this sort which are essentially

l
l

'vector operations may be economical to run on a parallel machine

even though they do not require a very high speed operation.

1.1

whi

an

ar

fe

pr

US

de

Vi

ti

ar

it

Ca

f)
(D

1.1.4 Enormous Quantities of Data Requiring Real-Time Processing

Historically, each time a new computer has been introduced

which was Significantly faster or had more memory than its pre-
decessors, it became feasible to solve problems which had been too
complex or too large for smaller, slower machines to handle. Sim—
ilarly, the design and construction of a parallel processing system
would enable problems which require vast amounts of data to be read
and analyzed fast enough to reSpond in real time. Two Such problems
are the control of a large phased array radar system for urban de-
fense (see [1], [12], and [8]), or Speech and visual perception
problems in artificial intelligence (see [31]).

A system to control a large air defense radar network must
use the information from several different radar installations to
determine the nature, position, and motion of each airborn object
within the range of the network. Then appropriate radar installa-
tions are directed to conduct a further study of each "target",
while continuing a general search by the entire network for other
targets. Targets whose identity is known must be distinguished
from incoming enemy aircraft and from decoys deployed by the attack-
ing enemy to fool the air defense system. Then the appropriate
air defense mechanisms must be alerted. All of this data must be
analyzed and action taken before the attacking aircraft arrive at
their targets. In [1], Knapp, Ackins, and Thomas describe a system
(Eipable of tracking 109 targets simultaneously. This requires
thee controlling computer system to issue about 107 commands per
seczond to the radar units. Currently existing computers cannot pro-

cesss that much data fast enough to reSpond in the time required.

€132
t O

 

CES

IEC

O:56:

In the second examp}e, human Speech and vision require the
processing of a large amount of data at a very high rate. For
example, when a person is Speaking, his listener must detect the
sound of his voice, interpret his Speech, and formulate a response,
all during the time the Speaker is talking. Then the listener is
able to reSpond so that a conversation is possible. If a robot is
to recognize human Speech patterns, it must be able to process one
second of Speech in one second. Reddy [31] estimates that a robot
must be able to process about 2 x 105 bits per second of informa-
tion to detect, analyze, and recognize Speech patterns. If a robot
is mobile, it must be able to "see" obstacles in its path in time
to avoid them. Reddy estimated that the robot must be able to pro-
cess approximately 108 bits of data per second for this visual
recognition task.

Conventional computers are not capable of processing data
at the rate required by either of the two examples given above,
while a large parallel processing system could be designed to meet

these demands for Speed and capacity.

1.1.5 Multiuser Systems
It is becoming increasingly common for one computer system
to serve many users. Various time-sharing schemes allow work to
prcmeed on different programs at once. A parallel processing computer
rmay be able to perform a similar function by allocating each user
to éin idle processor without the expensive overhead of the current

opelrating systems.

 

1.1.6 Rel

of the mac
Reliabilit
ponent or
go gracefi
hughly ll
In a conv.
forces th.
of the Sy
failures

cessing e
aSSume th
cessing e
my 311g
he use c
icant dIC
That is,

gradual r
reliabili
”Went“
Systems l

Process c

1'2 Para

1111966, ]

1.1.6 Reliability and Graceful Degradation

A parallel processor system may be able to withstand failures
of the machine's components without a major interruption of service
Reliability is a measure of the probability of the failure of a com-
ponent or of the entire machine. A computer system is said to under-
go graceful degradation if the Operation of the machine deteriorates

roughly in proportion to the percentage of malfunctioning components. ;

 

In a conventional computer system, the failure of a major component
forces the entire system out of operation. If the major functions

of the system are decentralized in a parallel configuration, some
failures need not be fatal. For example, if one of the many pro-
cessing elements failed, the remaining processors could automatically
assume the tasks of the disabled one. Hence the failure of a pro-
cessing element which would be fatal to a conventional machine would
only slightly lower the efficiency of a parallel processing system.
The use of several processing elements could be lost before a signif-
icant drop in the overall performance of the machine would occur.
That is, the performance of a parallel system would deteriorate in a
gradual way. This graceful degradation would make the overall
reliability of such a System much better than the reliability of
conventional machines. The high reliability makes parallel processor
systems particularly important in such real-time applications as

process control or military surveillance.

1.2 Parallel Machine Organization
There are many different alternative designs for parallel
processing computers to fulfill the needs discussed in Section 1.1.

In 1966, Flynn [11] and [10] described four classes of machine

 

arganizat
different
fmrclas

sider the

stream-sf
ized by (
Each ins:

carried (

mta stn
proceed .
inpracti

essary t

instruct
in this
is carri
designs

diSCussE

are banl

Berk 3n

gram, 31

n
bar .

8y
'stem, <

organization. Parallel processing systems were assigned to two
different classes. We will give a brief description of each of the
four classes of machine organization considered by Flynn, and con-
sider the basic design of two important parallel processing systems.

Flynn classified conventional computers as single instruction
stream-single data stream (SISD) machines since they are character-
ized by one sequence of instructions acting on one set of data.

Each instruction is executed in sequence and the instruction is
carried out on only one set of data.

Some early computers were multiple instruction stream-single
data stream (MISD) machines. Several sets of instructions could
proceed simultaneously on one set of data. MISD machines were
impractical because of the hardware and programming overhead nec-
essary to avoid memory conflicts.

A large class of parallel processing machines are single
instruction stream-multiple data stream (SIMD) machines. Machines
in this class execute instructions in sequence, but each instruction
is carried out on several different sets of data. Two important
designs for SIMD machines, the ILLIAC IV and the CDC STAR, will be
discussed later in this section.

Flynn's fourth class of machine is a multiple instruction
stream-multiple data stream (MIMD) machine. These multi-processors
sire banks of autonomous machines. Each machine can be directed to
twork.on a different program, or on different parts of the same pro-
gram, all under the overall control of a central control unit.
CarTuegie-Mellon University is designing a multi-mini-processor

SYSteﬂn, C.mmp [42]. A MIMD machine is well suited to multi-user

I
2‘ ‘ h“ '.

 

svstems,

blocks uh

for use c
such mach
designed
for the h

heffet Fi

instructi
cessing e
unit wit}
MOI}! or
proceeds
executes
Data nay
bEWEen I
to act or
is an ex;
Suited t:
DaDEnber!

of rum

systems, or to large problems where the program can be Split up into
blocks which do not depend on data from the other blocks.

The algorithms for rootfinding in this paper are intended
for use on a SIMD machine, so we will consider the design of two
such machines. The first is ILLIAC IV, an array processing system
designed by the University of Illinois and Burroughs Corporation
for the National Aeronautics and Space Administration Laboratory at
Moffet Field, California.

ILLIAC IV is composed of one control unit (CU) which generates
instruction signals to control the operation of 64 identical pro-
cessing elements (PE). Each PE consists of an arithmetic and logic
unit with its own memory. Each PE processes the data in its own
memory or data which is passed to it through the CU. Computation
proceeds according to the commands generated by the CU, so each PE
executes the same instruction stream unless it has been turned off.
Data may be passed between the CU and the PE's, or it may be passed
between PE'S. This configuration allows one sequence of instructions
to act on up to 64 different sets of data simultaneously, so ILLIAC IV
is an example of a SIMD machine. This machine organization is well
suited to a large class of vector or array type problems. In [9],
.Danenberg gives a detailed description of the design and organization

()f ILLIAC IV (see also [22] or [31]).

 

J

lProcess ll
l

tilenent

 

 

 

A
Ciently 0'
or ﬁnal
paints, f
54 proces
used. p0
Store and
“'3 Stor
cess“,g (
cessed if
the cmp,‘
Such as 1
the Whip

So that .

is a Pipf

can “he:

 

 

 

 

 

 

Control Unit Memory

l

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Processing 2 E 64
Element 1 PE ' ' ° P
Memory 1 M 2 M 64

 

 

 

 

 

 

 

 

 

Figure 1-1. Functional Block Diagram for ILLIAC IV

H-r-eu-‘Ir
1

An array processing system like ILLIAC IV operates effi—
ciently only when the size of the data vector is slightly less than
or equal to the size of the machine. 63, 64, 190, or 192 data
points, for instance, are well suited to processing on ILLIAC IV'S
64 processing elements since all, or nearly all, of the PE'S are
used. For example, if a l90—dimensional vector is used, 62 PE's
store and operate on 3 components each, while the two remaining
PE's Store and operate on 2 components each. By contrast, pro-
cessing 65-dimensional vectors requires that 64 components be pro-
cessed first. Then all but one of the PE'S must be disabled while
the computations proceed on the "extra" component. In many problems
such as the numerical solution to partial differential equations,
the computations can be structured to use a multiple of 64 points
so tﬁmt ILLIAC IV'S 64 PE's can be used efficiently.

The second example of a SIMD machine which we will consider
is aa pipeline processor, the CDC STAR. While array processing
Syst:ems like ILLIAC IV are based on the notion that N machines

can ssometimes do the work of one machine in l/N of the time,

10

pipeline computers like STAR gain Speed by performing different

parts of the same operation simultaneously (see [9] or [13]). For

example, suppose the operation of addition is broken into these
six sub-operations: 1) fetch operands from memory, 2) adjust
exponents to align the decimal points, 3) add mantissas,

4) normalize, 5) round off the result, and 6) store the sum in
memory. These six sub-operations are illustrated by a simple
example in Figure 1-2. If this addition operation is implemented
on a pipeline computer, the sum of two numbers is being stored in
memory at the same time that the sum of the next two numbers is
being rounded off, etc. Figure 1-3 shows how addition of two N-

dimensional vectors would proceed in such a machine.

Sub-operations

l) Fetch operands A = 0.234 X 103

B = 0.567 x 101
2) Adjust exponents A = 23.4 X 101

B = 0.567 X 101
3) Add mantissas A + B = 23.967 X 101
4) Normalize A + B = 0.23967 X 103
5) Round off A + B = 0.240 X 103
6) Store the sum A + B = 0.240 X 103

Figure 1—2. Addition with Six Sub-operations

Step

N+1

N+5

Fizure 1

Pipeline
PiPeline
“hich are
illuStrat

data.

1“lath.

are Separ
Pipeline .
ﬂawed to

Became oi

11

Sub-operation

1 2 . . . 6
1 (A1, B1)
6 (A6. B6) (A5, B5) . . . A1 + B1
Step
N (AN’ 3N) (AN-1’ BN_1) . . . AN_5 + BN_S
n+1 (15,, BN) . . . AN-4 + BN_4
N+5 AN + BN

Figure 1-3. Pipeline Addition of N-dimensional Vectors A + B

We have seen that a single sequence of instructions in a
pipeline machine acts on several different sets of data. Hence the
pipeline processor is an example of a SIMD machine. Computations
which are much more involved than the operation of addition
illustrated above can be implemented on a pipeline computer if the
same sequence of computations must be repeated on different sets of
data.

A pipeline computer can operate on a vector of arbitrary
length. Figure 1-3 shows that if the computations being programmed

rare separated into six parts, five steps are required to fill the
Fitmeline when starting the computations. Similarly, five steps are
nee<ﬂed to complete processing of the last component of the vector.

Becatuse of the time Spent filling the pipeline at the beginning and

time
[28‘

\‘EC

at

V:

(J.

12

emptying it at the conclusion of an operation, the efficiency of

the machine decreases if it is necessary to empty and refill the
pipeline often. In actual use, the start-up time is typically the
time needed to generate 30 - 60 results once the pipeline is full
[28], so longer vectors use the machine more efficiently. The

vector length is limited to 64,000 by the temporary memory available
on the CDC STAR, and lOOO—dimensional vectors are typical of efficient
machine utilization [28].

There is some overlap among the classes of machines dis—
cussed in this section. For example, the array processing system
ILLIAC IV uses pipelined instructions and memory access queues to
speed execution [9]. As originally designed, the ILLIAC IV system
was to be a MIMD machine composed of four autonomous SIMD machines.
Each of four control units was to drive 64 processing elements, but

only one quadrant of the proposed machine was built.

1.3 Problems Encountered when Implementing Parallel Algorithms

After a parallel processing system has been designed, there
are at least four major problem areas which must be studied before
the system can be used efficiently. The problems discussed in [37]
by Stone are:
l. The arrangement of data in memory is essential for efficient
parallel processing, although it is only a minor concern for the con-
ventional sequential machines now in use.
2. An efficient sequential algorithm does not necessarily lead to
an efficient parallel algorithm. Hence research is necessary to

develop efficient parallel algorithms.

1'

 

 

L‘Iu 1‘

wine
4. T
etc.,

the c

a run
influe

roatf:

In [if
for ii
of the
Stone

doubli
PYOble
transf
into a

Value,
velopel

effiCie

Probabl

13

3. Many algorithms are inherently sequential, but in some cases the
sequential dependence can be relaxed.
4. The behavior such as the order of convergence, round-off error,
etc., of a parallel algorithm may be quite different from that of
the corresponding sequential algorithm.

In this paper we will consider algorithms designed to find
a root of a function. We will see how each of these problem areas
influences the construction, the behavior, and the choice of efficient

rootfinding algorithms.

1.4 Examples of Parallel Processing Algorithms

There are many algorithms known which are parallel in nature.
In [15], Karp and Miranker describe a parallel searching algorithm
for finding roots of a function. Pease [29] gives a modification
of the fast Fourier transform which is suited to parallel processing.
Stone [35] and Kogge and Stone [16] derive a method called recursive
doubling which allows them to solve a large class of recurrence
problems using parallel processing. The method of recursive doubling
transforms a problem which seems to be inherently sequential in nature
into a problem which can be solved in parallel. Miranker [24] gives
a survey of many other parallel algorithms.

The pull-back interpolation technique for solving initial
value, boundary value, and time delay differential equations de—
veloped by Rogers [34] incorporates several features which could be
efficiently implemented on a parallel processing computer.

We now consider some parallel algorithms in more detail.

Prtﬂmably the simplest use of parallel processing is the addition

I‘Y‘

Fi

Iii
0f
58(

adt

14

of two N-dimensional vectors, A + B. In a pipeline processor like
the CDC STAR, addition is broken down into several Sub-operations.
Figure 1-3 illustrates pipeline addition with six sub-operations.
The summands move through the pipeline of sub-operations at the
Speed of the slowest sub-operation.

In an array processing system like ILLIAC IV, the i th pro-
cessing element, PE i, performs A1 + Bi° Figure 1-4 illustrates
the addition of two 64-dimensional vectors. If N < 64 some of the
PE's are turned off while the rest execute the addition. If N > 64,

some PE performs the addition of more than one component.

 

Control Unit

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I
l" l 1
PE 1 PE 2 . . . PE 64
A1+B1=C1 A2+B2=C2 A64+B64=C64

Figure 1-4. Addition of Two 64-dimensiona1 Vectors on ILLIAC IV

An interesting example of the savings in time possible with
ILLIAC IV is the computation of l + 2 +...+ 64 together with all
of the partial sums. This problem requires 63 additions on a
sequential machine, but Figure 1-5 shows the sequence of six parallel
additions which yields the desired sum and each intermediate partial
sum. The instructions indicate the routing of data at each step.
If the instructions are changed to allow the data from PE 64 to be
passed directly around to PE 1, each PE contains the sum

1 +-2 +...+ 64 after six parallel additions.

15

PE 1 PE 2 . . . PE 64

Route each PE'S contents one to the right and add,

1 1+2 . . . 63+64
Route contents two right and add,

1 1+2 1+2+3 1+2+3+4 . . . 61+...+64
Route contents 4 right and add,

1 1+2 . . . l+...+8 . . . 57+...+64
Route contents 8 right and add,

1 1+2 . . . l+...+l6 . . . 49+...+64
Route contents 16 right and add,

1 1+2 . . . l+...+32 . . . 33+...+64
Route contents 32 right and add,

1 1+2 l+2+3 . . . . l+2+...+64
Figure 1-5. 1+2+...+64 on ILLIAC IV

Another type of problem which is well suited to a parallel
processing system is the numerical solution of partial differential
equations. For example, the point Jacobi iteration to solve the
Laplace equation by finite differences with a uniform mesh on a

square is given by

u (i j) = Un(i,j—1)+ un(i-l,j) + un(i.j+1) + un(i+1.j)
n+1 ’ 4

 

Here the average of the previously computed approximation to the
true solution at the four nearest points is used as the new approxima-

tion. Since no u uses previously computed values of u at

n+1 n+1

ite

is

var

0'.)

16

other points, it is possible to compute un+1 at each point
simultaneously on a parallel machine.

In contrast to the point Jacobi method, the Gauss-Seidel
iteration uses new approximations for un+1 which have already been
computed whenever they are available, so the Gauss-Seidel iteration
is inherently sequential. Carrol and Wetherford [6] suggest a
variation of the usual Gauss-Seidel method which can be implemented
on a parallel machine with N processing elements. Assume /N
is an integer. If a 2 /N x 2 /N grid is used, the solution must
be computed on 4N points. Hence each processor must compute the
approximation at 4 points. Figure 1-6 shows that each processor
computes the value at the first point using old data; the values for
the second and third points are computed using some new and some
old data; the value at the fourth point is computed using only new
data. Another variation of the Gauss-Seidel method is due to Stone
[37]. He Suggests that an N X N grid be used. Then diagonals of
length N are processed simultaneously in the manner shown in
Figure 1-7.

As a final example of parallel algorithms we consider a
parallel rootfinding algorithm. This algorithm is a special case
of the simultaneous n-2 degree method due to Rice [32] with n = 3.
A summary of Rice's results and a more careful discussion of this
method will be given in Chapter II.

This algorithm will be called the parallel secant method
because it is derived from the usual secant method given by many

authors (see [30, pp. 323-328], or [27, Chapter 3]). The secant

method is given by

17

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Step 1 Step 2
o o o o o n o n
PE i ——T' PE 1
o 1 o o o o o o o
o —7>un+1 e——-o o o n-—9 un+1 r- n
o o o o o o o 0
Step 3 Step 4
n n n n n n n n
---—4— PE i -*—] *‘—‘ PE i ]f_—]
O I" un+1 9'" ° ° “ —’ un+1 ‘7" n
n n n n n n n n
o o o o o n o n

o = old information
n = new information

un+1 = point being computed

Figure 1-6. Carrol and Wetherford's Variation of Gauss-Seidel

 

 

 

 

N \ x
\ x
. \e
\ \o'
Vé? \fQo
4),. 6, \o
,6 \e c
.06 G \
sS
. \ \
2 ‘\
1 \ ‘ \
l 2 N

Figure 1-7. Stone's Variation of Gauss-Seidel

18

f(X.) f(X. )
x. = l X. + 1-1 X.
1+1 f(xi)-f(xi_1) 1-1 f(xi_1)-f(xi) 1

 

X. 'X.
1 1-1

= x. - f(x,)
1 f(xi)-f(xi_1) 1

(1-1)

2 .
Assume that f E C on some interval containing a simple root 0

of f. Let x0 and x1 be chosen near 0. Then the error in

equation (1-1) satisfies

-f"<s>f'(si>f'(si_1)
0‘ ' x1+1 = 3 (0’ ' X1)(°’ ' xi-l
ZIP(9]

 

) a (1‘2)

where g, gi, and §i_1 lie on the interval Spanned by 0, xi,

and xi 1. Then there exists a constant K such that

la - xi+1\ s Kla - xii la - XML . (1-3)

We now generalize the secant method given by (1-1) to a

parallel method. Let and x be chosen near a.

X0,1’ x0,2’ 0,3

Let yi j = f(xi j), for j = 1,2,3; and i = 0,1,... . The parallel

secant method defines a sequence of 3-dimensional vectors

{X1 = (x1,1’ xi,2’ 1:13)} by

 

 

 

x = x _ 1,1 i,2
1+1,1 1,1 yl,1 - yi,2 1,1
x - x
i 2 i,3
x. = X. - 1 _ y. (1‘4)
1+l,2 1,2 yi’2 yi, 1,2
= x _ Xi,3 ' x14
Xi+1,3 1,3 - y1,3

y1,3

Parall
Vinus
approx

the er-

19

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PE 1 PE 2 PE 3
Store x0,1 x0,2 x0,3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0,1 X0,2’ y0,2 “ X0,3’y0,3

 

 

 

 

 

 

 

 

 

 

 

 

 

1,2 1,3

 

Compute by (1-4) x1 1 ] x

 

 

Figure 1-8. Flow Chart for One Iteration of the Parallel Secant Method

As the flow chart in Figure 1-8 Shows, each of the three
parallel processors uses the approximations generated at the pre-
vious Step by itself and by another processor to compute a new
approximation to 0 according to the secant method. From (1-3),

the errors in (1-4) satisfy
‘0’ ' xi+l,l‘ S Ki“ ' X1,1\ ‘0’ ' xL,2\
\O’ ' xi+1,2l 5 Kid ‘ xi,2| l“ ‘ xi,3i (1'5)
la’xi+1,3l s n. - w tor-x111

If we let 0 = max{\a - x,

i 1,j|’ j = 1,2,3), (1-5) becomes

2
61+1 s K(61) . (1-6)

It will
improves

from

1.5 Spe

elements
ll mantL
realized
discusse:
of the pi
inposssit
define a

Ball. ll‘ir
-——§l;;:;

20

It will be shown in Chapter II that the use of parallel processing

improves the order of convergence of the conventional secant method

1 +,(5

s: . 2.
2 1 6 to

from

1.5 Speed-Up Ratio

Ideally, a parallel processing system with N processing
elements should be able to execute a job in one minute which requires
N minutes to execute on a sequential machine. This ideal is
realized for the addition of two 64-dimensional vectors on ILLIAC IV
discussed in Section 1.4, but the program overhead and the nature
of the problem or algorithm under consideration usually make it
imposssible to achieve the ideal savings of time. This leads us to
define a measure of the time actually saved by a parallel machine.

Definition 1-1 (see [37, p. 3]). Define the Speed-up ratio to be

= computation time on a sequential computer
computation time on a parallel computer

Notice that it is necessary to state which sequential algorithm
is being used for comparison. When studying the Speed-up ratios
achieved by different parallel rootfinding algorithms, their execution
time Should be compared with the execution time of sequential root-
finding algorithms which are optimal in some sense.

AS the proportion of commands requiring sequential computa-
tion in a program increases, the Speed-up ratio falls very quickly.
Figure 1-9, due to Amdahl [2], shows how the Speed-up ratio depends
On the sequential processing inherent in a program. For example,
EHJppose only 1/64 of a job requires sequential computation on

‘ILLIAC IV, while the remaining 63/64 of the job can be executed in

parallel.
execution ‘

sequential

Spee

Rat

Tl
Processing
Cmputatic
tasks whit
alternatix
3Ut0mtica
Parallel m
ities in a:

We
which are a

As I
the ideal SF
{my Problem,

8 speed-up r8

21

parallel. The machine's efficiency drops to 1/2 since half of the
execution time will be taken by one processor to perform the

sequential part of the program while the other 63 PE‘s are idle.

 

 

N
Speed-up
N/2 «u-
Ratio
1
160 I
0 N 100

Z of Inherent Serial Computation
Figure 1-9. Speed-Up Ratio

This example illustrates the importance of using the parallel
processing machine only for parallel computations. To avoid sequential
computations, the control unit can perform indexing and other overhead
tasks while the PE's are executing the body of the program. Another
alternative approach is to have a seQuential computer available which
automatically assumes the execution of sequential tasks while the
parallel machine performs tasks which utilize its parallel capabil-
ities in an efficient manner.

We now consider some examples to show the Speed-up ratios
‘which are achieved by several different classes of problems.

As the remarks at the beginning of this section point out,
the ideal Speed-up ratio of N is very rarely attained. However,
many problems which have a natural vector or array Structure achieve

a Speed-up ratio which is linear in the number of processors used.

 

 

Aspeed-up r;
are very vel
problems inc
numerical so'.

calculations

 

Probl
relations (KL
and Paterson
speed-up rat
as a linear
Parallel prc
bl’ doubling

The
ratio. Alg:
[15] aChiev.
°f a1gorith
little incr
CESSors, 1‘
class, they
to COIHpute t
large Class

The
that the use

These PTObleL
so they Shoo;

'3 .
‘mcesslng sy

up rati0.

22

A speed-up ratio of aN, with a s l, characterizes problems which
are very well Suited to parallel processing. Examples of such
problems include matrix computations and problems Such as the
numerical solution of partial differential equations which use
calculations at grid points (see [17]).

Problems such as sorting (Stone in [36]), linear recurrence
relations (Kogge and Stone in [16]), and polynomial evaluation (Munro
and Paterson in [25]) have been handled on SIMD machines with a
speed-up ratio of aN/logzN. Although this Speed-up is not as good
as a linear Speed-up, these problems are still well suited to
parallel processing since the Speed of computation is nearly doubled
by doubling the number of processing elements.

The next class of algorithms have an even poorer Speed-up
ratio. Algorithms Such as Karp and Miranker's searching algorithm
[15] achieve a speed-up ratio proportional to logzN. This class
of algorithms is poorly suited to parallel processing since very
little increase in Speed is achieved by doubling the number of pro-
cessors. In Spite of the poor Speed-up ratio of algorithms in this
class, they are of use in real-time applications where it is important
to compute quickly, regardless of cost. This paper will Show that a
large class of parallel rootfinding algorithms fall into this class.

The final class of problems are inherently sequential, so
that the use of parallel processing yields no increase in Speed.
These problems have a Speed-up ratio which is independent of N,
so they should be executed on sequential machines to leave the parallel
‘processing system free to execute algorithms with a favorable Speed-

up ratio.

1.6 introduc

This

 

root t! 0”

statement of

Let
1(a) =0 A
root of f.
The

general iter
f need only
Other condi‘
necessary f.

The
and has bee
to the fixe
Cludes Part

boundary Va

1'7 DGfinj

Let
the SIMD ma
0' We Cons
N‘tup1e3 [
an EStimate
be distinct

{HUSt have

form

23

1.6 Introduction to Iterative Rootfinding

This paper will be concerned with the problem of finding a
root a of a real-valued function f. We begin with a careful
Statement of the rootfinding problem.

Let f E Cm(a,b). Find a 6 (a,b), if it exists, such that
f(a) = 0. Assume further that f'(o) # 0, so that o is a simple
root of f.

The requirement that f E Cm(a,b) will allow us to consider
general iterative methods. For each Specific example we will consider,
f need only have some finite number of continuous derivatives.
Other conditions will be placed on the function f when they are
necessary for the discussion.

The rootfinding problem arises frequently in applications
and has been the subject of extensive research. It is equivalent
to the fixed point problem x = g(x) and its general setting in-
cludes partial differential equations, integral equations, and

boundary value problems (see Collatz [7]).

1.7 Definition of Iterative Methods

Let N denote the number of processing elements used by
the SIMD machine. Assume that the given function f has a root
a- We consider iterative methods which compute a sequence of

N-tuples {Xi = (xi,1,xi’2,...,xi,N)]. Each component of X1 is

an estimate for a. In general, the components of X1 need not

be distinct, but for all the methods considered in this paper we

‘must have xi j # xi k if j # k, We consider iterations of the

form

In general,
considered 1*
component to

such as TILL

 

The c
sufficient f:
following ri
of the def in
Kung and Tra
m an Open
zero qf 0
functions
satisfies:

1)

2)

24

= tori, xi_1,...,x )

x1+1,j i-m

In general, I may depend on the component j, but all Of the methods
considered in this paper use the same iteration function for each
component to that the methods can be implemented on a SIMD machine
such as ILLIAC IV.

The above definition Of a parallel iterative method is
sufficient for our discussion, but for completeness, we give the
following rigorous definition. This definition is a generalization
Of the definition Of a sequential iterative method (N = 1) given by

Knng and Traub in [21].

Definition 1-2. Let D = {flf is a real analytic function defined

on an Open interval depending on f, I CIR which contains a Simple

f

zero cf of f, and f' # 0 on If]. Let n denote the set of

functions {m} which map every f E D to a function m(f) which

satisfies:

0n+l) a, N

1) (9(a) :RN R ;

2) CP(f) (af)""af) = (af’°",af) ;
N(mﬁl) times N times

3) For each (p E 0, f E D, there exists Icp f, an open

subinterval contained in If such that
i) of 6 Im,f ,
ii) o(f)[Icp,fN(m1)] :: Icp.fN , and
iii) if X0,X1,...,Xm 6 Iqbe and X1+1 is given by
X1+1 = ¢(f)(X1,Xi-1,...,Xi_m) ,

then

The:

it

de

ti

0n

ca
do

De

put

of

is tl

“hitch

etc/7

Mr 1

25

= = 2 ..., .
lim xi,j of, for j l, , N

1'“

Then qaé O is called a parallel rootfinding iteration with memory
m. If m = O, m is called a parallel rootfinding iteration with-
out memory. If N = l, m is a sequential rootfinding iteration as
defined by Kung and Traub.

In general m(f) depends on the values Of f and some of
its derivatives. If m(f) requires the evaluation Of f and its

derivatives only at the points xi 1,xi 2,... (values at
’ ,

’Xi,N
xi-l,l’°°"xi-1,N’°"’xi-m,N may be stored and used in the computa-
tion of Xi+1), m is called a one point iteration. If m(f) re-
quires the evaluation Of f or its derivatives at points which are
not components of Xi, m is called a multipoint iteration.

All Of the iterative methods satisfying this definition are
called stationary iterative methods since the iteration function m
does not change as the iterations proceed.
Definition 1:3. The information usage Of the iterative method m
is the number of parallel functional evaluations necessary to com-
pute Xt+1. If vk(m) denotes the number of parallel evaluations

Of f(k) necessary to compute then

xi+1’
v(cp) = E Vk(cp) (1‘7)
k20

is the total information usage of m.
Recall that we are considering only iteration functions m
which are suitable for implementation of a SIMD machine. Hence

each component of X is computed by the same formula, so if f

i+l

Inust be evaluated at one component of X it must be evaluated at

1,

26

all N components. These N evaluations of f may be executed

in parallel with PE j evaluating f(x ). Thus one parallel func—

1.1

tional evaluation yields N functional values.

1.8 Order of Convergence

An iterative method as defined in the preceding section
generates a sequence Of N-tuples {Xi}. The order Of convergence
Of an iterative method is a quantitative way of measuring the rate
at which this sequence converges to the N-tuple (af,...,af). In
what follows, we shall discuss this idea for sequential rootfinding
algorithms, and then generalize it to include parallel rootfinding
algorithms.

As Brent points out [4, p. 21], there are many possible
definitions of "order of convergence". We will use two Of them.
The sequence {xi} with lim x

1
1-400
have order Of convergence 1 if (1-8) or (1-9) hold.

= a and 61 = a - x1 is said to

6.
11m\-—1—+—1—‘-=c>o,;.>1. (1-8)
- A
1”” leil
1hn (‘ln lei|)1/i = A - (1-9)

lam
Definition 1:4, We Shall say that the sequence {xi} has strong
order of convergence A if (1-8) holds, and weak order Of con-
vergence A if (1-9) holds.
Note that condition (1-8) implies condition (1-9). Indeed,
if we assume that (1-8) holds for the sequence {xi} with x 4 a

i
and let

G. = l61+1l
1 leilx

27

then there exists an integer I such that i 2 I implies ci 2 c/2.
Without loss of generality, we may assume that I = 0 and leil < l
for all i. It is easily seen that

1-1

1'.
_ A A X
lei] - ci-lCi-Z ... c0 ‘30] . (l-lO)

Using (1-10), one has that

 

 

 

 

1’1 k i
ln leil = REG I 1n ci-l-k.+ x 1U\eol s
so that
f’. 1 '1 lli
1-
1 ° k '
(~1n leil) /1 = x 1n C 1 +'11 1n
k=0 i-l-k leol
P ‘1
1-1 k_i 1 1 1/1
= x z 1 1n + 1n (1-11)
_k=0 °1-1-k leol

 

 

The assumption that lei. < 1 implies that -ln‘ei\ > 0. Hence the

term in brackets in equation (1-11) is positive. Since

lim 81/1 = 1 if a > 0, we need only show that the term in the
i—oo

brackets is bounded as i _. on. Now

 

 

i-k
Z xk-i ln-C-i—+ln
=0 i-l-k leol
1-1
sl— 2: ikin-z- +1n.1
1. C 3‘
x k=0 1 0

 

i
S1n (2/c) [1 -§]+ 1n 1

 

 

C01
in

the

fit

and

28

However, condition (1-9) does not imply condition (1-8).

This can easily be seen by letting

e i even

e 1 Odd

 

Ortega and Rheinboldt [26] discuss these and several other
possible definitions of order of convergence. They point out that
condition (1-8) usually requires involved proofs when A is not an
integer, while the proofs required when condition (1-9) is used as
the definition are the same whether [X is an integer or not.

We shall now generalize Definition 1-4 for parallel root-
finding algorithms. While a sequential rootfinding method generates
a sequence {x1} of estimates for the root a, a parallel algorithm
generates a sequence of N-tuples {X1 - (

xi’1,xi,2,...,x1’N)].

Setting a: 'or and 61-mxﬁe j-1,2,...,N],we

1.1 ' “1.1 1.1"

define the order of convergence in terms of the worst error at each
step.

Definition 1:2, We shall say that a sequence of N-tuples {X1}
which converges component-wise to the N-tuple (a,a,...,a) has
strong order of convergence A if

6

_ktl_

11m >. = c > 0 , (1-12_

1.... (5,)

and weak.order of convergence A if

lim(-1n 51)”1 = x . (1-13)

i-can

l.‘
m.

81‘

bc

29

For each of the parallel rootfinding algorithms considered

in this paper, we Show that

‘)1/1

lim (-1n\si = x, for j = l,...,N ,

law 1
so (1-13) holds for the error in each component, not only for the
largest error. However, the strong order of convergence of these
methods remains an open question. It is our conjecture that for
each of the methods considered in this paper, the strong order of
convergence is the same as the weak order of convergence. In [32],
Rice cites technical problems with using the strong order of con-

vergence in the parallel case as his reason for using the weak order

of convergence.

1.9 Examples of Rootfinding Iterations

In this section we shall consider five sequential rootfinding
algorithms and the parallel secant method introduced in Section 1.4
to illustrate the definitions of iterative methods and order of

convergence.

1.9.1 Newton-Raphson Method

The familiar Newton-Raphson method is given by

_ ESL
CP(f)(x) " X " ft(x) '

This is a one-point sequential iteration without memory. It has

both strong and weak orders of convergence 2. It is characterized

by m-0,N=l,v(cp)=2,and x-Z.

of
act

EV.

1.9

30

1.9.2 Generalized Newton-Raphson Method (see [40, pp. 78-87])

Let f exist in a neighborhood of a and be denoted by

g. Define

v1 = x
Y2 = Y1 ' f'(x)
2
- _ f"(xl__ f(x)
“g(x) ’ Y3 ’ Y2 ' 2f'(x) [Foo] °

This is a sequential one-point iteration without memory with m = 0,

 

N = l, vo(q9 = v1(q9 = v2(q9 = l, v(q9 = 3, and A a 3. This method
is a Special case of a general class of sequential one-point itera-
tions without memory which achieve a strong order of convergence A

.,f(1-1).

with one evaluation each of f, f',..

1.9.3 Multipoint Method One (see [21])

Let 3 ﬁ 0 be a constant. Define

N
X

lo

v, to + efoo)

af(v0)f(v1)
f(f)(x) = *2 = *1 ‘ f(vl) — f(ts5

 

This is a sequential multipoint iteration without memory with

m - 0, N = l, v0(cp) 8 v(ep) = 2, and L = 2. It is a Special case

of a general class of sequential multipoint iteration methods which
I

achieve a strong order of convergence A = 2“. with just n

evaluations of f.

1.9.4 Multipoint Method Two (see [21])

Let

31

 

20 = x
_ {5:01.
21' z0 ' f'(zo)
f(z )f(z ) (f(z )
cp(f)(X)=7-2=zl' 1 O z'iﬁ'g—f
[f(zl) - f(zo)] o

This is a sequential two-point iteration Since functional values

at 20 and 21

N = 1, v0(cp) = 2, v1(cp) = l, v(cp) = 3, and A = 4. This method

are needed to compute m(f)(x). Here m = O,

is a Special case of a general class of multipoint sequential root-

finding algorithms without memory which achieve a strong order of

convergence A = 2n-1 with n-l evaluations of f and one

evaluation of f'.

1.9.5 Lagrange Interpolation Methods (see [30, pp. 336-339], or
[39, pp. 60-75]

Let x0,x1,...,xm be given estimates close to o- Let
h(x) be the Lagrange interpolation polynomial of degree m which

x ‘Let x' be a real zero

agrees with f at x i-m° i+1

1,xi_1,...,
of h chosen according to a Specified policy. Define

¢(f)(xi,...,x ) = x This is called direct Lagrange interpola-

i-m 1+1.

tion. Alternately, let H(y) be the Lagrange interpolation poly-

‘nomial of degree m which satisfies H(yj) 8 x for

j’
j - i, i-l,...,i-m, where yj = f(xj). Then H interpolates f-l,
so this is called inverse Lagrange interpolation. Define x1+1 I H(O).
Both Of these methods are sequential one-point iteration methods

*with memory m, N . l, and vo(q9 = v(q9 - l. The Strong order of

convergence A is the unique positive root of the equation

t - t -...- t - 1 = 0 .

32
i also satisfies

lsx<2

(see [40, pp. 62-74]).

Both direct and inverse Lagrange interpolation use f(xj),
for j = i,i-1,...,i-m, and both have the same strong order of
convergence. If H is the inverse Lagrange interpolation polynomial,
H(O) is easily computed, while it is more difficult to find the
root by direct interpolation using h. Although it can be shown

that b has a real root x which makes the method converge,

1+1
that root need not be unique. If more than one root Of h have
the desired properties, additional criteria are used to assure that
the choice of x1+1 is well-defined. h(x) is a polynomdal of
degree m, so the problem of actually finding its root or roots may
require a separate rootfinding algorithm. Thus it may be quite

difficult to compute x as a root of h, while the computation

i+1
of xi+l as H(O), where H is the inverse interpolation poly-
nomial, is straightforward. For this reason, rootfinding
.algorithms using inverse Lagrange interpolation are preferred over
those using direct Lagrange interpolation.

Chapter II will consider parallel iterations which use

Lagrange interpolat ion .

1.9.6 Parallel Secant Method

The parallel secant method introduced in Section 1.4 is a
parallel one-point iterative method without memory with m - 0,
Ii - 3, vo(q9 - v(m) - 1. That is, only one parallel evaluation

of f is necessary to evaluate f at three different points.

33

In Chapter II, it will be shown that the parallel secant method has
weak order of convergence 2.

We note that the parallel secant method is a generalization
of a sequential one-point iteration with memory m = 1. Most of the
parallel one-point methods considered in this paper are generaliza-

tions of sequential methods with memory.

1.10 Results on Optimal Order

Since the publication of Traub's book [40] in 1964, con-
siderable progress has been made in the search for sequential
iterative rootfinding methods of Optimal order. This section will
Survey some of those results. The orders of convergence discussed
in this section are all Strong orders of convergence. An extensive
list of references dealing with results on optimal orders can be
found in [39].

In the case of sequential one-point methods without memory,
it is known [40, p. 98] that the highest order which can be achieved
using only f(xi),f'(xi),...,f(r-1)(x1) is r for r 2 2. Further,
any such method of order r must depend explicitly on

f(r-l)

f(xi),f'(xi),..., (x1). In fact, this bound is achieved by the

generalized Newton-Raphson method introduced in Section 1.9.2.

In the case of sequential one-point methods with memory,

consider the method m which computes f(xi),f'(xi),...,f(r-1)(x

.,£(r'1)

1)
and uses the values Of f,f',.. evaluated at
xi_1,xi_2,...,xi_m which were computed and stored during previous

iterations. It is shown in [40, p. 106] that the order of con-

vergence A of m satisfies

34
r < K < r+1, for m.= 1,2,... . (1-14)

Hence no amount of memory will improve the order by as much as one
evaluation of f(r).

In [5], Brent and Winograd consider another class of
sequential one-point methods with memory. They show that the highest
order of convergence which can be achieved by direct or inverse
Hermite interpolation of f(k)(xj), for j = 0,1,...,i;

k = 0,1,...,r-l, is r+l. This bound is achieved by using all_of
the old functional evaluations in the Hermite interpolation to

compute x . They suggest that the upper bound in (1-14) is

i+1
approached as m grows large.

Rissanen [33] proves that the usual secant method has
optimal order among sequential one-point iterations using only
f(xi) and f(xi_1), if a smoothness condition on admissible
algorithms is assumed.

In the case of sequential multipoint methods without memory,
Kung and Traub [21] construct two classes of algorithms of order
Zn-l using n functional evaluations. The first method evaluates
f at n points (see Section 1.9.3), while the second method
evaluates f' at one point and f at n-l points (see Section
1.9.4). Kung and Traub go on to show that the order of any
sequential multipoint method without memory using n functional
evaluations is 5 2“. It is their conjecture that the optimal
order is actually 5 2n-1. In another paper [20], Kung and Traub
prove the truth of their conjecture in the case n - 2.

Very little work has been done in the area of sequential

multipoint iterations with memory. The question of optimal order

for such methods is still open.

35

1.11 Definitions of Computational Efficiency

If we wish to compare two different iterative rootfinding
algorithms, we must have some measure of efficiency. Perhaps the
ultimate efficiency measure is the time required to compute the
desired root to within a specified accuracy. However, this is an
impractical measure since it depends heavily on the function, the
initial guess, and the machine used for the computations. For com-
parison of general methods, we want a! measure which does not
depend on the computer used to implement the algorithm.

The simplest measure is the order of convergence of the
method, since the error of a high order method asymptotically
approaches 0 more rapidly than the error of a low order method.

If two methods have the same order of convergence, the one with the
smaller asymptotic error constant converges more quickly. However,
the results cited in Section 1.10 on Optimal orders showed that high
order iterative methods require more functional evaluations than
lower order methods. In some cases, the desired root may be found
in less time by a low order method requiring fewer functional
evaluations than by a high order method.

This suggests that an efficiency measure be defined which
reflects both the order of convergence of the method and its in-
formation usage (see Definition 1-3). The following efficiency
measure generalizes a measure used by Traub [40] for sequential
algorithms to include parallel algorithms.

Definition ;;§, Let m be an iterative rootfinding method with
order of convergence 1- Let the information usage of m, the total

number of parallel functional evaluations required by one step of

36

the method, be denoted by v(q9. Define the information efficiency

of the method to be

(1-15)

This measure of efficiency is invariant under the composi-
tion of a method with itself. That is, if V = q>o m, one step of
w is the same as two steps of m. If q; has order I and informa-
tion usage v, v has order K2 and information usage 2v. Since
that same work is being done, one would expect the information

efficiency to be the same. Indeed,

2 log k
IE(cp) = T;— = IEW) -

In [19], Kung and Traub show that (1-15) is essentially the
unique way to define an efficiency measure, since any efficiency
measure which is invariant under self composition is of this form
or is a strictly increasing function of this form. Traub [40]
shows that this measure of efficiency is inversely pr0portional to
the total time required to compute the root 0 to a specified
accuracy.

Other estimates of the time required to find the root 0
to a specified accuracy may replace the information usage used in
(1-15). In [18], Kung shows that if the time is assumed to be pro-
portional to the number of non-constant multiplications or divisions,
M, then

1082 k

M S l .

37

It is for this reason that log2 is used in (1-15). Unless other-
wise noted, all logs in this paper are taken in base 2.

To see that the information efficiency defined by (1-15)
does not reflect the entire cost of computation, consider the

sequential methods

f(xi)
x1+1 = 2&0?) (xi) = xi - my , (1-16)
and
2
[f(xi)]

 

x1+1 = ‘91:“)“9 = "1 ' f(xi + 130(1)) - f(xi) (1‘17)

qh is the Newton-Raphson method, while mr is a two-point method
due to Kung and Traub [20]. Both methods have strong order of con-

vergence k = 2, and both have information usage v = 2. Hence
IE(qN) = IE(q&) = 1/2 .

Although both methods have the same information efficiency,
the time required by each of them to compute a to a specified
accuracy is different. 4% has the advantage of not requiring f'.
This is important if f' is not available or if it is costly to
evaluate. However, once the functional evaluations have been made,
qh requires only one addition and one division, while qk requires
three additions, one multiplication, and one division. Hence if f
and f' are equally costly to evaluate, qh will converge in less
time. In fact, “h is faster whenever f' can be evaluated in
less time than is required for the evaluation of f and the execu-

tion of two additions and one multiplication.

38

This suggests the following definition of efficiency which

is due to Kung and Traub [19].

Definition 1-7. Let m be an iterative rootfinding method with
mo)

,0)

order of convergence X- Let ) denote the number of arithmetic

operations needed to evaluate . Let a(q9 denote the number of

parallel arithmetic Operations needed to combine the functional

values used to compute X Then we shall call the efficiency

1+1'
of m
1082 A

z Vj(cp)c(f(j)) + 8(cp)
jZO

 

EFF(gp,f) = . (1-18)

(1)

where vj(q9 denotes the number of parallel evaluations of f
which are necessary at each step.

It is easily seen that the denominator of (1-18) is the total
number of parallel arithmetic operations required at each step of
the method. We will be concerned with finding upper and lower bounds
for the efficiencies of various classes of algorithms by finding
lower and upper bounds, respectively, for the denominator of (1—18).

If we let cf = min{c(f(j)), j 2 O}, we immediately have the upper

bound

 

log I.
EFF(qp,f) sv<cp>cf + 8W) . (1-19)

where v(q9 = E Vj(q9 is the number of parallel functional evalua-
ij

tions required at each step.

1.12 Results on Optimal Efficiency

we now survey some of the results which are known about

ailgorithms whose efficiency (see Definition 1-7) is optimal. This

39

is an active area of interest and many questions are still open.
In [20], Kung and Traub consider the methods “N and “T
given by equations (1-16) and (1-17), reSpectively. They show
that these methods have optimal efficiency in the sense of
Definition 1-7 among all sequential methods without memory using
only two functional evaluations. Their reSpective efficiencies

are given by

EFF(qN,f) = 1/(c(f) + c(f') + 2) , (1-20)

and
EEF(mT,f) = 1/(2c(f) +15) , (1-21)

where c(f) and c(f') denote the number of arithmetic Operations
needed to evaluate f and f', reapectively. From equations (1-20)
and (1-21), it follows that the Newton-Raphson method q“ is more
efficient than the two-point method q& if and only if
c(f') < c(f) + 3.

Next we compute the efficiency of the parallel secant method
introduced in Section 1.4. Recall that the first component of X

i+1
is given by

x1,1 ‘ x1,2
X

x 3 - £(
r+1,1 1,1 f(x1,1) f(x1,2)

x1,1) (1-22)

The parallel secant method requires one parallel evaluation of f,
and has weak order of convergence 2. Equation (1-22) requires five
arithmetic operations. Hence the efficiency of the parallel secant

method, ¢§’ is given by

40
EFF(¢pS,f) = l/(c(f) + 5) . (1-23)

Comparing equation (1-23) with equations (1-20) and (1-21), we see
that the parallel secant method is always more effiCient than Traub's
two-point method, and that it is more efficient than the Newton-
Raphson method whenever c(f') > 3.

For iterative rootfinding methods m belonging to a given
class Q of rootfinding methods, Kung and Traub [19] and Traub

[40] define

En(<§,f) = suprFF(¢p,f)\gp 6 Q, v((p) s n]

and

E(§:f) = SUP{ED(§af)9 n = 1:2:°°°} a

where v(q9 denotes the information usage of m. They then give
some lower and upper bounds for ED and E for various classes

Q of algorithms. Their results are summarized by the following

theorems .

Theorem 1-1 [19]. Let Q be the class of sequential one-point

iterations without memory. Then there exists some constant p >>O

 

 

 

such that
108 “ s E (ii) s L03 “ , (1-24)
n (1) 2 n n cf +'n - l
2 C(f ) +‘p n log n
i=0
for all n, where cf = min{c(f(i)), i 2 0}, and if cf >14,
log 3 log 3
c(f) + c(f') + c(f") + 7 ‘ Ew’f) ‘ 3c + 2 ° “’25)

f

41

Theorem 1-2 [19]. Let F be the class of sequential multipoint
iterations without memory using values of f only. Then there

exist constants r0, r1, and r2 with r2 > 0 such that

n - 1 n

 

 

 

5 E (T,f) S , (1-26)
nc(f) + rzn2 +’r1n +'r0 n nc(f) +-n - l
for all n, and
1 g 1
1 E ,f —- . 1-27
C(f) m S (I‘ ) s c(f) ( )

for some constant Q < 0.
It is further conjectured in [19] that for the class of all

sequential one-point or multipoint iterations without memory,

n - 1

nc + n - l ’

E (§.f) 5
n f

and that

 

E(§’f) ‘ cf +'l
The results in Theorems 1-1 and 1-2 will be used in the

comparison of the efficiences of the parallel rootfinding algorithms

developed in this paper to the efficiencies of optimal sequential

algorithms.

1.13 Parallel Rootfinding

There has been very little research done in the area of
parallel rootfinding algorithms.

In 1968, Karp and Miranker [15] gave a parallel searching
technique for a maximum.which is optimal in the minimax sense. These

techniques can be used to solve the rootfinding problem since finding

‘

42

a maximum of -\f(x)\ is the same as finding a root of f. Their
algorithm makes one parallel functional evaluation to yield f
evaluated at N points. It then searches for the largest value.
Since searching cannot be done efficiently in parallel, this
algorithm does not effectively utilize a parallel machine, but it
does achieve a Speed-up ratio proportional to log N.

In [23], Miranker gives a "parallel" rootfinding algorithm,
but the i th processing element PE 1 must have available the results
computed by PE 1, PE 2,..., PE i-l before it can proceed. Thus
this is not a truly parallel algorithm. Miranker claims a speed-up
ratio of %'1og1.618 N.

In [32], Rice gives several parallel rootfinding algorithms
based on direct or inverse polynomial interpolation. Some of the
details of Rice's results will be presented in Section 2.1. Roughly,
he computes the order of convergence of each of his methods from a
matrix representation of the data used by each method. Some of
Rice's methods require different computations to be performed by
each processing element, so they are only suitable for implementation
on a MIMD machine. However, his simultaneous (N-2) degree method

is suitable for use on an array processing system like ILLIAC IV.

This class of algorithms will be studied in detail in Chapter II.

CHAPTER II

ROOTFINDING BY LAGRANGE INTERPOLATION

One class Of rootfinding algorithms whose structure is well
suited to an array processing system like ILLIAC IV is based on
direct or inverse Lagrange interpolation. In this chapter, we will
study such methods. In particular, we will give the order of con-
vergence of such methods, consider their efficiency, and show that
their Speed-up ratio is proportional to log N, where N is the
number of processing elements being used. We conclude that because
of their poor Speed-up ratio, these methods are not well suited for
parallel processing, even though their structure makes efficient
use of the parallel capabilities of a SIMD computer.

Definition 2;}, Let Gr denote the class of all stationary one-
point parallel iterative rootfinding methods without memory which
require that f,f',...,f(r) be evaluated at the components of X .

i

In this chapter we shall consider G , the class of methods

0
which use only values of f. The case of the general class Gr
will be considered in Chapter III.

In what follows, we shall require that the real valued

function f with a simple root 0 is fixed.

2.1 Rice's Results
In [32], Rice gives the algorithms we will discuss in this

chapter. A general description of the algorithms of [32] follows.
43

44

A sequence of N-dimensional vectors {X1 = (x1,1,xi’2,

...,x1 N)] is generated, where each x, is an approximation to
3

1.1
a. An N x.N matrix T is formed by

1 if x, is used to compute

1,1 x1+1,1<

0 otherwise

Consider the set = l} for a fixed k. Let H

Sm = {..in3”.
be the Lagrange interpolation polynomial which satisfies

= f s ,
H(y1,3) x1,3’ or xi,J 6 i,k
where yi j ' f(x. j) Then let
xi+1,k = H(O)

That is, Rice performs classical inverse Lagrange interpolation on
f. Rice's main result is the following.

Theorem 2-1 [32]. Assume that f 6 CI“,1 in a neighborhood of the
simple root a- Let p be the spectral radius of the matrix repre-
sentation T. If p > 1, then the iterative method converges and
has weak order of convergence p.

We wish now to focus our attention on the type of method
which Rice calls the simultaneous (N-2) degree method. This method
is of interest because it is the only method proposed by Rice which
is suitable for implementation on a SIMD machine. His other methods
require a MIMD machine because all of the processing elements do
not perform the same computations. For example, one method requires

three processors to perform interpolation at three points, while a

fourth processor is performing interpolation on four points.

45

In the simultaneous (N-2) degree method, there are N dis-
tinct approximations to a stored at each step. Each of these N
points is stored in a different processing element. To compute the
new approximations, each processing element uses N-l of these N
points to compute a Lagrange interpolation polynomial of degree at
most N-2. Of the many ways the N-l points could be chosen by each
processor, Rice has each processing element use the points stored in
the other N-l processors. Thus the matrix representation T for

the simultaneous (N-2) degree method is given by

Tkj=1"6kj,

where 6kj is Kronecker's delta. Thus for example, the matrix re-

presentation T for N = 4 is

o 1 1 1
1 o 1 1
T= . (2-1)
1 1 o 1
1 1 1 o

The weak order of convergence of the simultaneous (N-Z)
degree method is N-l. A simplified flow chart for the operations

of one of the N processing elements is shown in Figure 2-1.

 

Use xi,j and yi,j’ for jﬁk to compute

I

x1+1,k

 

 

 

 

Evaluate f(x

 

 

i+1,k) = y1+1,k

1

to the other PE's and
for j # k

 

 

send x1+1,k and y1+1,1<
receive x1‘+1,j and yi+1,j’

 

 

 

Figure 2-1. FIOW'Chart for PE k

46

2.2 Parallel Secant Method
The parallel secant method introduced in Section 1.4 is a
slight modification of Rice's simultaneous first degree method

(N = 3). Recall that if y1 = f(xi ), then the parallel secant

J J

method is given by

 

 

 

x = x - xi’l - xi’z y

1+1,1 1,1 yi’1 - yi,2 1,1

x - x
i 2 i 3

x. = x. - ”L _ L’ y. (2‘2)

1+1,2 1,2 yi’2 yi’3 1,2
x = x - xi’3 - x111 Y .

”1’3 1’3 yi,3 ' yi,l 1’3

If we compare the flow chart given in Figure 1-8 for the parallel
secant method with the flow chart given in Figure 2-1 for the
simultaneous (N-2) degree method in the case N = 3, we see that
the parallel secant method requires less data to be transferred among
the processing elements. For example, in Rice's method, PE 1 must
receive xi+1,2, yi+1,2, xi+1,3, and yi+1’3, while in the parallel
secant method, PE 1 need only receive xi‘+1’2 and yi+1,2. Hence
the parallel secant method requires less communication among the
processing elements than Rice's method. Note that this modification
does not change the Spectral radius of the matrix representation of
the method, so both methods have weak order of convergence 2.

Theorem 2-5 will give a different proof that the weak order
of the parallel secant method is 2. This method requires one

parallel functional evaluation and five arithmetic operations per

step, so the information efficiency of this method is

47

2
IE-l9€-=l ,

and the efficiency is

. log 2 = 1
EH" c(f)-l-S c(f)-+5 ’

where c(f) is the number of arithmetic operations needed to

evaluate f, and log is understood to be logz.

 

 

1,1

 

 

 

%

Figure 2-2. Parallel Secant Method

In Section 1.12 we showed that the parallel secant method
is more efficient than the Newton-Raphson method whenever

c(f') > 3.

2.3 Order of Convergence

In this section we will show that the weak order of con-
vergence of parallel rootfinding methods based on Lagrange inter-
polation is equal to the number of points used for the interpola-

tion. Before stating the exact result, we state the following well

48

known result on the error of Hermite interpolation (see [40, p. 244]).
In this chapter we will only use this result for Lagrange interpola-

tion (r = 0), but the more general result will be used in Chapter III.

3 k
Lemma 2:2, Let B = k'+ 2 rj, rJ 2 0, j = 1,2,...,kx ‘Let f E C6
J=1

on an interval containing x1,x2,...,xk. Let h(x) be the unique

Hermite interpolation polynomial of degree 5 B - l satisfying

), for i = 0,...,r ; j = l,...,k .

(i) = (i)
h (xj) f (xj j

Then the error is given by

(B) k r.+1
E(x) = f(x) - h(x) = £_§rL§l H (x - xj) J
H

where g is on the interval Spanned by x,x1,x2,...,xk.

Let us give the notation needed to construct parallel root-
finding algorithms based On inverse Lagrange interpolation. The
same notation will be used in Chapter III for methods based on in-

verse Hermite interpolation. Fix k satisfying 2 s k s N-l.

Choose N distinct subsets A for j = 1,2,...,N, of the set

1’
{l,2,...,N] such that j 6 Aj and, A1 contains k elements.

We will use the following lemma and its corollary to find
the orders of convergence for most of the methods considered in

this paper.
Lemma 2-3. Let g) be a parallel rootfinding algorithm which gen-

erates a sequence {Xi} whose errors satisfy

11 (61,98 , j =l,2,...,N , (2-3)

EA

31+1,j= K14
t J

for some sequence of constants K , where = a - x . Let

1,3 31,3

49

E
x

\€,,j\

W
H.
II
B
H.
:3

\e

i,j'

xl
a
x

\Ki,j\

II
S
1.1.
D

\K,,j\
Assume further that e # O for all i and j, and that

E i o ,

....I

H.

a

WI
ll

i-m

11m 51

law

g # o .

Then there exist positive sequences c1 and di’ both with strong

order of convergence ks, satisfying
C1S§1$\ei,j\561‘di ,

for j = 1,2,...,N, and 1 = 0,1,...

Proof. From equation (2-3),

s. 1 = \x 1 u \e \S , <2-4)

' 1+1,J 1,.1 tt'Aj in:
so that

'- s ‘- ks
\erp1’1\ 5 K1 ch (51) = K1(61)
J
Then
- ks
61+1 S K1(51)

Let d = 60, and di+

50

d
lim “~Etl- = K ,

ks
14m (d1)

so the sequence {d1} has strong order of convergence ks. By

induction, 61 s (11 since

- ks '— ks
5 s Ki(5i) s Ki(di) - d

i+l 1+1

Similarly from equation (2-4),

s ks
'ei+1,j' 2 E1 H (51) ' §i(§i)

tEA
J

The if c = and c = K (c )k8

n 0 go, 1+1 -i 1 ’

c
lim -lilif'= K ,
s _

iﬁm (c1)

so {c1} has strong order of convergence ks, and

ks ks

From this lemma it follows that
Corolﬁltaq 2-4. Under the conditions of Lenma 2-3, the rootfinding
method m has weak order of convergence ks.

Proof. According to the lemma, there exist sequences {c1}

and [d1], both with strong orders of convergence ks, which satisfy

for j = l,...,N, and i = 0,1,...

ci 5 'ei,j' 5 d1 ,

Then since strong order implies weak order,

1/1 1/1
ks = lim (-ln d1) 5 if: (-1n\ei,j\)

14a

/1 =

5 lim (-1n Ci)1 ks ,
1

am

which.implies that

51

lim (-1n \ei jp'l' = ks , for j = 1,2,...,N
idm ’ [I

In order to find the order of convergence for parallel root-
finding methods based on Lagrange interpolation, we need some addi-
tional restrictions on f.

Let f E Ck(I), where I is an Open interval containing
the simple root 0 of f, and on which f' i 0. Let f.1 be de-
noted by g. Then g E Ck on some Open interval containing 0.

Assume that g(k) # O in a neighborhood of 0. Let yi j = f(xi ).

.1
We are now ready to state and prove the main result of this

section.

Theorem 2-5. With the notation and assumptions given above, for

each 1 = l,2,...,N, let H (y) denote the Lagrange interpolation

J
polynomial of degree at most k-l which satisfies

Hj(yi,s) = xi,s , for s 6 Aj

Let ¢k 6 G0 be the method which defines

xi+1 j = Hj(0)

If distinct points are chosen sufficiently

x0,l’x0,2"°°’xO,N

close to a, then either xi j = a for some i and 1, or
lim x = a , for j = 1,2,...,N .
i,j
iaa

Thus the method ¢k converges. MOreover, mk has weak order of
convergence k.

The proof is a generalization of the proof given in [30]
for the sequential case. When no ambiguity will result in the proof,

we suppress the subscript i to simplify the notation.

52

Proof. If yt = 0 for any t = 1,2,...,N, then x1 t = a,

and we are done. Hence we may assume that yt i 0. We may also

assume that ys # yt, for s i t, for if x = x we may re-

i,s i,t’
place xi s by any x from the interval spanned by
x1,1,xi’2,...,xi’N. Since the resulting ei,s still lies between

the smallest and the largest errors, the proof remains valid. If
for some i, all of the xi 3 are identical, they span no interval
and the method fails.

H is the Lagrange interpolation polynomial for g, so that

it must have the form

8(Y)‘“ H(Y) = SEAJL8(Y)8(YS) = SEAJLS(Y)X1’S , (2'5)
where
() ny-y"
Ly= —-—-.
s t6Aj ys ' yt
tis

Let Z = H (-ys). Inverse Lagrange interpolation estimates

86Aj
a = 3(0) with Hj(O), so in our method we have

 

x
i s
x =H(0)- 2: 1(0):: =-zz 5* . (2-6)
t+1’j j 36Aj 8 1’8 86Aj ysch (ys - yt)
J
tﬁs

According to Lemma 2-2, the error in (2-6) is given by

5(k)(ﬂl z

a ' xi+1,j ___ k! ’

ei+1,j = (2'7)

where n lies in the interval spanned by 0 and ys, s E Aj. Now

53

’~<
I

- f(xi,s) = f(xi,s) - f(a)

f'(gs) (xi,s - a) = -f'(§s) 61,3

where gs lies between a and x , and T13 lies between 0

is
and ys. Then equation (2-7) becomes

e = (k) 11 31$.— . (2-8)
i+l,j k! SEA g'(‘ns)
J

If are chosen sufficiently close to

0:, equation (2-8) implies that

’ = = 2
111“ 61,1 0 1 for j 1: 9 J":

11-009
so the method converges. We have assumed yt 1‘ 0, so that

1‘ 0, for t = 1,2,...,N. Since g(k) 3‘ 0 near 0 by assumption,

 

31,:
equation (2-8) implies that 3i+1 j :4 0. Let
k
K = 11211). n 1 ,
1:3 1" SEA 8'(T'S)
1
so that
(R)

11111 K1 j = —8—'(9-l); 1‘ O
i-m ’ k![g'(0)]

Then all of the hypotheses of Leanna 2-3 are satisfied with s = 1,
so it follows from Corollary 2-4 that cpk has weak order Of con-
vergence k. I

For each k, there are (:1) different subsets of the set

N
{1,2,...,N]. Hence there are ((k)) different methods of order k
N

54

based on inverse Lagrange interpolation, but these methods differ

only in the indexing of x1 1,xi 2,...,x, . These methods arise

1,N

from different choices for the subsets Aj'

The condition 2 s k s'N-l arises from the need to choose
the N distinct subsets Aj of {1,2,...,N], each containing k
distinct elements. The subsets must be distinct to assure that
{xt+1,1,x1+1’2,...,x1+1,N] are all distinct, so k cannot equal
N. This condition gives the following corollary to Theorem 2-5.
Corollary 2-6. Under the conditions of Theorem 2-5, the highest
weak order of convergence which can be achieved on a parallel pro-
cessing system with N processing elements by rootfinding algorithms
based on inverse Lagrange interpolation is N-l.

This corollary shows that the Optimal order for this class
of parallel rootfinding algorithms grows linearly with the machine
size.

We now compare the orders of convergence of parallel root-
finding algorithms based on inverse Lagrange interpolation with the
correSponding orders of convergence of the sequential one-point
iterations with memory which also use inverse Lagrange interpolation.
We have already seen, for example, that the parallel secant method
(N B 3, and k = 2) increases the order of convergence of the usual
secant method from 1L-'§-"C2N‘1.618 to 2. As noted in Section 1.10,
[40, p. 106] shows that the order of convergence of sequential one-
point iterations with memory m - k-l is < 2. Hence the use of

parallel processing has increased the order of convergence from

something less than two to k.

55

It can be shown that Theorem 2-5 remains true if direct
Lagrange interpolation is used instead of inverse Lagrange inter-
polation. However, as we saw in Section 1.9.5, the methods using
inverse Lagrange interpolation are always preferred over those using
direct Lagrange interpolation. For this reason the corresponding

theorem for direct interpolation is omitted.

2.4 Efficiency

It was shown in Section 2.3 that a weak order of convergence
of N-l can be achieved on a parallel processing computer using N
processors. This requires each processor to perform'Lagrange inter-
polation on N-l points. Although this interpolation is very costly
for large N, we will show that if evaluations of f are expensive,
the Optimal efficiency for methods based on Lagrange interpolation
is achieved by taking k = N-l.
Definition 2:2, Let 9k be a parallel rootfinding algorithm based
on inverse Lagrange interpolation as described in Theorem 2-5.

Define

log2 k

 

P (f) 3 EFF< ,f) = a
k q" z V3(<pk)c(f(8)) + amp
820

where v8(mk) is the number of parallel evaluations of f(s)

needed, c(f(8)) is the number of arithmetic operations required by

one evaluation of f(s), and a(¢k) is the number of arithmetic
operations needed to combine the functional values used to compute

Xi+1. Define also

P (f) - sup P (f) .
* Zsk "

56
N
(

))
k
N possible choices for ¢k3 Pk(f)

is well-defined since the possible choices differ only in the in-

Although there are (

dexing of the points used. P*(f) is the optimal efficiency of all
parallel rootfinding algorithms based on Lagrange interpolation on
any number of points. That is, we delete the requirement that

k S N-l. Later we will ask.whether there are enough processing
elements available to perform interpolation on enough points to
achieve the optimal efficiency.

Recall that the speed-up ratio was defined to be

= Computation time on a sequential computer
Computation time on a parallel computer

Since the efficiency EFF(qD is inversely prOportional to the total
time required to compute a to a Specified accuracy using the method

m, the speed-up ratio for a class of parallel methods satisfies

Optimal EFF for the class oprarallel methods

2-
optimal EFF for sequential computation ’ ( 9)

S = B

for some constant B > 0. We will use the results of [19]
summarized in Theorems 1-1 and 1-2 to estimate the optimal efficiency
for sequential computation.
Theorem 2-7. Assume that f is analytic in a neighborhood of a
simple root a- Let ¢k be an inverse Lagrange interpolation method
described in Theorem 2-5. Note that k is not restricted to be
5 N-l. Then

1) IE(¢%) = log2 k .

- log k
ii) P (f) - EFF( ,f) = . (2-10)
k ¢k c(f) +2k2 +'k - l

 

57

iii) Let y denote the unique root which is 2 l of
2 2
u(k) = (4k 1+ k)1n k - (c + 2k + k - l) = 0 . (2-11)

Then

P*(f)R5 2 1 . (2-12)
(4y + y)ln 2

iv) Assume that 2 s k s N-l and that c = c(f(i)

) for
all i. For large values of c, the Speed-up ratio Of

mk compared to the optimal sequential one-point methods

without memory is asymptotic to

w

 

1:3. 3 1n (N-l) - 5
as N a a.

v) Assume that 2 S k s N—l. For large values of c = c(f),
the Speed-up ratio of mk compared to the optimal
sequential multipoint iterations without memory using

values of f only is asymptotic to

1
log (N-l) - 1n 4

 

as N a m.
Proof. 1) Each iteration of ¢k requires one parallel
evaluation of f and yields weak order of convergence k.
ii) Count the arithmetic operations in equation (2-6).

iii) To show that y is unique, let
2 2
u(t)-(4t +t)lnt-(c+2t +t-l)

Then

58

u'(t) (8t + l)ln t + (4t + 1) - (4t + 1)

(8t + 1)ln t

>0, for t>1

Now u(l) = -(c + 2)1< 0, while lim u(t) =‘+m. Hence u(t) = 0
t-m
has a unique solution on the interval (1, +w).
To find the value of k for which
P*(f) = sup Pk(f)
25k
is attained, we observe that P1(f) = 0 and that
lim Pk(f) = O .

ken

Thus P*(f) is attained for some finite k > 1. To find the optimal

k, differentiate both sides of equation (2-10) and set equal to zero

 

 

to get
dP(f) (c+2k2+k-l) 1 -(4k+l)logk
k = klnz =0
d k

2 2
(c +12k 1+ k - 1)

This implies that
2 2
(4k + k)1n k - (c +>2k +-k - l) = O

y is assumed to satisfy equation (2-11), so substituting (2-11)
into (2-10) gives (2-12).

Although k must be an integer, v need not be. Hence
P*(f) is actually attained by Pk(f), for some k. within one unit
Of y.

Before we proceed with the proof of the rest of the theorem,

we digress to consider the behavior of v and Pk(f). Figure 2-3

59

shows how ‘Pk(f) depends on k. and the cost of functional evalua-
tion, c. Table 2-1 shows typical values of y and the resulting

P*(f) for given values of c.

Table 2-1 P*(f)

c(f) v ROpt P*(f)
1 1.755 2 .100
2 1.864 2 .091
5 2.120 2 .071
20 2.893 3 .040
100 4.764 5 .015
104 29.360 29 .00041
106 225.36 225 7. x 10'6

 

Pk(f)

 

 

 

 

Figure 2-3 Pk(f)

60

Note that lim Y = m, for if we let

1400

c = l + en(n - l + en(4n - 2)) ,

then v = a“.
Note also that P*(f) is independent of the machine size

N, so it cannot be attained if v 2‘N. In practice, k should be

chosen to satisfy

N - 1 if c > 3N - 2N2 + (4N2 - 7N + 3)ln (N-l)

integer within one of y which maximizes Pk(f)

For ILLIAC W, N = 64, so that if c > 58,038. v cannot be attained.
That is, unless 58,038 or more arithmetic operations are required

for each evaluation of f (or its rational approximation), the
optimal efficiency for the class of parallel rootfinding algorithms
based on inverse Lagrange interpolation can be realized on ILLIAC IV.

If c = 58,038, P*(f)9e 9.05 x 10’5

, 31(2)‘e 0.91 x 10's, and
E2(f)i~ 1.72 x 10-5, so all of these methods would require a long
time to compute a to a specified accuracy, but the parallel method
is slightly faster.

Let us now return to the proof of the theorem.

iv) Using equation (2-12) and the bounds for E1(f) given

in Theorem 1-1,

3c +-2 ‘ P*(f)
2
log 3 (4y + y)ln 2 1

3c +*7

‘ 2
log 3 (4y +~y)ln 2

(2-13)

 

61
From equation (2-11),
2 2
c = (4v + y)ln y - (2y + v - l) . (2'14)

Substituting equation (2-14) into (2-13), we get,

2*(0

3(4y2 +Jlln1- 6‘12 ~3y+5s
E1(f)

2
1n 3 (4y + y)

s 39492 + mt. v - 612 - 31+ 10
2
1n 3 (4y + y)

which implies that

 

 

 

 

3 9x2 + 31 — 5 P*(f)
malnY' 2 ‘E(f)
(4y + y)ln 3 1
3 6y2 +r3y 10
- ' . 2-1
5 ln 3 1n y ( 5)

(4y2 + 'y) In 3

We have seen that lim y = a, so that if c is large enough,
caa
the most efficient Lagrange interpolation method which can be

implemented on a machine with. N PE's uses k = N-l points. As

larger machines are used, the optimal efficiency of P*(f) is
P (f)
*

EETEY' is asymptotic to

approached. Hence (2-15) shows that

 

1n (N-l) - , as N «on

sun:

In 3

v) Similarly, using equation (2-12) and the bounds for

E2(f) given in Theorem 1-2,

c mo

5 5
(4y2 + y) In 2 '32“) (1 +§:)(4y2 + v)ln 2

C

 

where g is a negative constant. This implies that

62

 

 

 

 

108v 2Y2+1'L-SP*(f)s-£-L1° 2124'3'"
- 2 ' 9
(4y + v)ln 2 E2“) 1 +%; (4Y2 4' Y)1n 2
r,(£>
so that 32(f) is asymptotic to

l
log(N-l)-m,asN-°ao. I

The important conclusions from Theorem 2-7 are parts iv)
and v). They Show that if the cost of functional evaluation is so
high that y 2 N, then the speed-up of parallel rootfinding algorithms
based on Lagrange interpolation is prOportional to log2 of the
machine size. If it is important to find the root a as fast as
possible, as in real-time applications, these methods are useful
because they are faster. If, however, one wishes to compute a
at a low cost, the speed-up ratio of log N shows that these methods
are not well suited for implementation on a parallel processing

system.

CHAPTER III

ROOTFINDING BY HERMITE INTERPOLATION

In this chapter we generalize the results of Chapter II on
Lagrange interpolation to include inverse Hermite interpolation.
As in the Lagrange case, the order of convergence is equal to the
amount Of information used. That is, while Lagrange interpolation
methods use values of the function f evaluated at k points to
achieve order k, Hermite interpolation methods use f,f',...,f(r),
each evaluated at k points to achieve weak order of convergence
k(r+l). If a parallel processing system has N processing elements,
we will also show that the speed-up ratio possible by using Hermite
interpolation is proportional to log N. That implies that parallel
rootfinding algorithms based on Hermite interpolation are not well
suited to parallel processing.

The reader is referred to [30] or [40] for a discussion of

sequential rootfinding algorithms using Hermite interpolation.

3.1 Order of Convergence

The main result of this section is a generalization of
Theorem 2-5 to show that the weak order of convergence of parallel
rootfinding algorithms based on inverse Hermite interpolation of
f,f',...,f(r) at k points is k(r+l).

Recall (see Definition 2-1) that Gr is the class of all

one-point parallel iterative rootfinding methods without memory
63 '

64

which require that f,f',...,f(r) be evaluated at the components

of X1.

Let us also recall the notation needed to construct parallel
rootfinding algorithms based on inverse Hermite interpolation. Fix

k satisfying 2 s k s N-l. Choose N distinct subsets Aj,

j = 1,2,...,N, of the set {1,2,...,N] such that j E Aj’ and Aj

contains k elements.

Theorem 3-1. Let y = k(r+l). Assume that f E CY on some Open

 

interval containing the simple root a, and on which f' # 0. Let
f.1 be denoted by g. Then g E CY on some open interval con-
taining 0. Assume that g(Y) * O in a neighborhood of O. For

each j = 1,2,...,N, let H (y) denote the Hermite interpolation

J
polynomial of degree at most v-l which satisfies

) = g(t)(y1 s), for t = 0,...,r, and s 6 A

j ”1,. 1

Let i E G be the method which defines
k r

x141,j = Hj(0) .

If distinct points x0’1,x0’2,...,x0’N are chosen sufficiently

close to a, then either x = a for some i and j, or

1,1

lim.x = a , for j = 1,2,...,N
1,1
i—oa

Thus the method 'k converges. Moreover, 'k has weak order of
convergence y.

The proof of this theorem follows closely the proof of
Theorem 2-5. As before, we will suppress the subscript i to

Simplify the notation.

and we
hood 0:
all di:
otherw

the in

where

Now

Where

ficien

65

Proof. We may assume yt i O, for if yt = o = a,

X
’ i,t

and we are done. we have assumed that f.1 exists in a neighbor-

hood Of a, so y1,...,yN are all distinct if x, are

1,l""’xi,N

all distinct. If x, are identical, the method fails,

1,1,...,xi,N

otherwise, if = x t’ we may replace xi 3 by any point on
9

x1,8 1 9

the interval Spanned by the x's.

Hj interpolates g, so according to Lemma 2-2,

M
r+1
g(y) - Hjm = a_!_m)_ n (y - y ) . (34)
y s
86Aj
where n lies in the interval Spanned by y, ys, for s E Aj.
Now
Y8 = f(xi S) - f(a)
_ _ O
- (xi,s a)f (Es)
.- I
- -(a - xi,8)/g (T‘s)
_ _ I
— €1,8/8 (m) 9
where g8 lies between xi 8 and a, and “5 lies between 0 and

ya. Then, substituting zero into equation (3-1),

ei+1,j = a ' Xi+1,j

(v) 1.4.1
=S_Y_!_(ﬂl I'I (7Y8)

 

36Aj
(Y)
= A ”D m n (313)”1 . (M)
v! H [8'(ﬂs)] SEAj ’
86Aj

If N distinct points x0’1,x0’2,...,x0’N are chosen suf-

ficiently close to a, equation (3-2) implies that

$0

for

The!

lute}

1116mm-

66

11m 31’s = O, for 8 = 1,2,...,N ,

i-m

so 'k converges. Equation (3-2) also implies that 31 s # 0

for all i and 3 since we assumed that g(Y) # 0 near 0. Let

K = £0011ll
1’j v! n [g'<ns>3

SEAj

r+1

Then

(Y)
limKij=—3——L—)—O ,40
1.. ’ Mg'm):Y

’

tear- ...m "an—-

so the hypotheses of Lemma 2-3 are satisfied with s = r+l. It
follows from Corollary 2-4 that 'k has weak order of convergence v. '

Since k, the number of points in Aj on which the inter-
polation is performed, is constrained to lie between 2 and N-l,
the following corollary follows easily from Theorem 3-1.

Corollary 3-2. Under the conditions of Theorem 3-1, the highest

weak order of convergence which can be achieved by an inverse Hermite
interpolation method in G1. on a parallel processing computer with
N PE's is (N-l)(r+l). This order is achieved by 'N-l'

We remark that the sequential one-point iterations with
memory m = k-l which use inverse Hermite interpolation and values
of f,f',...,f(r) have strong order of convergence between r and
r+l, so the use of parallel processing has greatly improved the
orders of convergence of these methods.

As in the case for Lagrange interpolation, a theorem corre-
sponding to Theorem 3-1 can be proven for direct Hermite interpola-
tion. As before, that theorem is omitted because direct interpola-

tion methods are of little practical importance.

3.2 53

we cone
element
are ide

the Sub

xi,2 =

y2 = f(:
inverse

data is

H2(Y)

Then

1.

method of

67

3.2 Example of an Hermite Interpolation Method

Consider the case for N = 3, r = k = 2. For simplicity,

we consider only the computations done by the second processing

element PE 2. The computations performed by the other two PE's

are identical, except that they use different points. We suppress

the subscript i to simplify the notation.

Let A2 = {1,2] so that PE 2 uses x = x and

i.1 l

x to COmPUte X. The values y1 = f(xl), yi =

!
r+l,2' f (x1),

“1,2 = 2
y2 = f(xz), and yé = f'(x2) have been previously computed. The
inverse interpolation polynomial of degree 3 which agrees with this

data is given by

 

 

 

 

y y '1 y y 2 y y -'2
' 1 ' 2 ' 2 1
H(y)=2——_——+1 -—_———x+[y-y]——:-—— —7
2 y2 y1 y1 y2 1 1 y1 y2 y1
.J A
a 2 -72
y - yZ y - y]. y - y]. 1
+' 2 ———:——--+ l _ x2 + [y - yz] "“j“’ 1 -(3'3)
y]. y2 3'2 y yZ 3'1 yZ
Then
xi+1,2 = H2(0)
= [<3 - >x<>2+< -3>x<)2
3 y1 y2 1 y2 y1 y2 2 y1 ]
y y y y
- 1 2 2 -y—Z,-+ ;1 <3-4)

Figure 3-1 illustrates the graphical interpretation of the

method of equation (3-4) .

COD

til

68

x - - -

i,2

 

 

Figure 3-1. Inverse Hermite Interpolation by One PE

According to Theorem 3-1, this method has weak order of
convergence 4. Note that it requires one evaluation of f and
one evaluation of f' by each processing element so the informa-

tion efficiency is

2
Equation (3-4) requires 22 arithmetic operations (if (y1 - yz)
3
is stored during the computation of (y1 - yz) ), so the efficiency

is

2
‘ c(f) + c(f') + 22 '

 

EFF

The sequential method which this method generalizes has order of
convergence 1 +-/3 R12.732 [40, p. 66], information efficiency

IE *1 0.726, and efficiency

0.726

EFF 2’ c(f) + c(f') + 22

69

Thus the use Of parallel processing has improved each of these
measures. In Section 3.3, it will be shown that the speed-up ratio
which can be achieved by methods like this one which are based on
Hermite interpolation is proportional to log N, so that they are

not well suited to implementation on a parallel processing computer.

3.3 Efficiency

It was shown in Corollary 3-2 that the highest weak order of
convergence which can be achieved on a parallel processing computer
using ‘N processing elements by inverse Hermite interpolation methods
is (N-l)(r+l). This order is achieved by methods which use values
of f,f',...,f(r) at N-l components of the N-dimensional vector

X It is very costly to perform Hermite interpolation of f and

1'
its first r derivatives at N-l points, so one might expect these
methods to be inefficient. For example, the simple example Of such
methods constructed in Section 3.2 required 22 arithmetic operations.
Although the interpolation becomes more and more expensive as the
number of points used increases, we will show that if a very large
number of arithmetic operations are required for each evaluation of
f,f',...,f(r>, then the most efficient Hermite interpolation method
which can be implemented on a given machine is a method which per-
forms interpolation at N-l points. That result will be used to
show that the best Speed-up ratio which can be achieved by parallel
rootfinding methods using Hermite interpolation is proportional to
log'N. Hence these methods are poorly suited for implementation

on parallel processing computers.

Consider the inverse Hermite interpolation method 'k for

parallel rootfinding described in Theorem 3-1. Assume that f is

70

analytic so that N, and hence k, may be chosen as large as we
wish.

Definition 3-1. Define

 

log2 k(r+l)
Pr,k(f) = EFF(¢k:f) = I. (i) 9
2 c(f ) + a
i=0
where c(f(1)) denotes the number of arithmetic operations needed
to compute f(l), and a denotes the number of parallel arithmetic

Operations necessary to compute Xi+1. Also define

P *(f) = sup Pr k(f)

r 3 2‘k 9
If r = 0, so Lagrange interpolation is being used,

P (f) = Pk(f), and P0,*(f) = P*(f) which were defined for

0,k
Lagrange interpolation in Definition 2-2.
N
(k
N

{1,2,...,N], so there are the same number of possible choices for

There are ( )) possible choices of N subsets of
'k' In spite of this, Pr,k(f) is well-defined since the possible
choices for 'k differ only in their labeling of the points. Hence
their order and computational complexities are the same.

P ,*(f) is the Optimal efficiency of all inverse Hermite
interpolation methods for parallel rootfinding. We will later ask
whether N is large enough to allow Pr,*(f) to be achieved.

As in Theorem 2-7, we will compare the efficiencies of
parallel inverse Hermite interpolation methods with the sequential
one-point and multipoint methods given by Traub in [19]. As in

Section 2.4, let E1(f) denote the optimal efficiency of the class

of sequential one-point iterations without memory. Let E2(f)

71

denote the Optimal efficiency of the class of sequential multipoint
iterations without memory which use values of f only. We restate

the following bounds from Theorem 1-1 and 1-2:

 

 

log 3 -

c(f) + c(f') + c(f") + 7 SE1“) 5 igg___§ (3 5)
l g f 3-

c(f)1+/‘;(5‘5E2()‘—_c(f) (6)

Here c(f(i)) is the number of arithmetic operations needed to

evaluate f(i) 1 = min c(f(i)), and g is some negative constant.
120

Theorem 3-3. Assume that f is analytic in a neighborhood of a

, C

simple root a. Let 'k be an inverse Hermite interpolation method

given in Theorem 3-1. Let r 2 0 be fixed. Let

 

 

 

c1 = min c(f(i)), and c2 = max c(f(i)) ,
Osisr Osisr
Then
logrk(r+1)
i) IE('k)= r+1
ii) There exist positive constants pl and p2 such that
k.r+

2
(r+l)c2 + p2k (r+1)c1 + plk ln k

for k sufficiently large.

iii) Let k1 and k2 denote the unique solutions to

u1(k) = k ln k (2 + lnzk)ln k(r+l) - (r+l)c1

p1

-pk1n2k=0

1 9 (3-8)

and

112(k) = 292(k2)21n k(r+l) - (r+l)c2 - p2(k2)2 = 0 , (3-9)

‘ﬂﬂﬁ"‘m”“m’"ﬂlll'

which 1

the 81

point

Spee<

poin

k(H
f,f‘

tiOt

‘hnp‘

to‘

Goth]

72
which satisfy k1 2 l, and k2 2 l/(r+l), reSpectively. Then

I 1

51> *(f) s . (340)
292(k2)21n 2 r» 01(2 +'ln k1)k1 ln k1 1n 2

iv) Assume that 2 s k s N-l. If c1 is large and c1 = c2,

the speed-up ratio of 'k compared to the optimal sequential one-

then

point methods without memory approaches

3

(;:l)l;—§iln (N-l), as N «'m . (3-11)

v) .Assume that 2 s k s N-l. For large values of c(f) = c, the
Speed-up ratio of *k compared to the optimal sequential multi-

point iteration without memory using values Of f only approaches

In gN-lz _
(r+l)ln 2 , as N a m . (3 12)

Proof. i) By Theorem 3-1, 'k has order of convergence
k(r+l). One iteration of yk requires the evaluation of

f,f',...,f(r) at each of the components of X Hence r+l func-

i.
tional evaluations are necessary.

ii) By definition,

c s c(£(')) s

1 for 0 s i s r ,

C2 ,
implies that
r (1)
(r+1)c S 2 c(f ) s (r+1)c . (3-13)
1 2
i=0
Let a denote the number of arithmetic operations necessary

to compute x1+1 J’ once f,f',-o-,f(r) have been evaluated at the

components of X1. In [38], Strassen shows that interpolation on

73

2
k points requires at least 0(k ln k) operations. Several
algorithms have been devised which perform interpolation at k
2
points in 0(k ) operations. Hence there exist positive con-

stants and p2 such that for k sufficiently large,

p1
p1

Applying the lower and upper bounds given by equations

(3-13) and (3-14) to the definition of Pr k(f) gives (3-7).

iii) We first Show that u1(k) has a unique solution

2 1. Let

u1(t) = plt In t (2 + 1n2 t)ln t(r+l)

2
- (r+1)c1 - plt In t

2p t In t 1n t(r+l) - (r+1)c1

l

2
+ plt In t (In t(r+l) - 1)

Then

-(r+l)c1 < 0, and lim u1(t) = +m ,

tﬁm

u1(1)

so there exists a solution k1 2 l to equation (3-8). To see that
k1 is unique, notice that
, 2
u1(t) = p1(2 + 4 ln t +'ln t)1n t(r+l)
> 0 , for t > 1
Similarly, to show that u2(t) = 0 has a unique solution,

we observe that

2 2
k 1n k s a s pzk . (3-14)

"“m‘?

and

Hem

$0

the

0cm

Subs

(3-1

toe

SUbs

and

74

p2
(rm2

 

u2(1/(r+l)) = —(r+1)c2 - < 0 ,

and that
lhn u2(t) =‘+m .

t-m

Hence equation (3—9) has a solution k2 > l/(r+l). But

1 ...
u2(t) — 4,2: In t(r+l)

> 0 , for t > 1/(r+l) ,

so k2 is unique.
To show (3-10), we take suprema of (3-7). The maxima of

the expressions

log k(r+l) and lgg k(r+l)

,

2
(r+1)c2 + 62k (r+1)c1 + plk ln k

occur at k2 and k1, reSpectively. Hence

log k2(r+l) log k1(r+1)
2$131.41?) s 2

(r+1)c2 +'p2(k2) (r+1)c1 +~p1k1 ln k

 

 

. (3-15)
1
Substituting equations (3-8) and (3-9) into the denominators of
(3-15) gives (3-10).

iv) If we assume that all derivatives are equally costly
to evaluate, then c1 = c2 = c. Dividing (3-10) by (3-5) gives

3c+2 spree“); 3c+7
2
2p2(k2) 1n 3 E1(f) 91(2 + 1n k1)k1 ln k In 3

 

1

Substitution for c in terms of k1 and R from equations (3-8)

2
and (3-9) gives

 

75

3[2p2(k2)21n 1.2mm - 620.2)2] + 2(r+1) Pr *(f)

 

 

 

 

 

 

 

 

2 f
(r+l)2p2(k2) ln 3 31‘ )
2
S 3[p1(2 + 1n k1)k1 1n klln k1(r+l) - plkl ln k1] +-7(r+1)
(r+l)p1(2 + 1n k1)k1 ln k1 1n 3
so that
3 1“ k2 + 6 1n(r+1) - 3 + 1 ! ..‘
"" ' 2
(r+1)1n 3 2(r+l)ln 3 p2(k2) ln 3 E
P (f) 3 ln 1:
* -1
‘ ‘2'??? ‘ (1+1) 1:173— ‘ 32 Q!
l (r+l)(i;-Ef'+ l)ln 3 .
1
3 lngr+lz 7 _
+ (r+1)1n 3 + 91(2 + 1n k1)k1 1n k1 ln 3 ° (3 16)

From equations (3-8) and (3—9),

lim k1 = lim k2 = a ,
c *w *m

1 c2
so for large values of c, (3-16) shows that the Speed-up ratio is

approximately bounded below by

3 ln k2

(1+1) 1n 3 ’
and above by

3
1n k1

(r+l)1n 3

For a parallel processing computer with N PE's, if c is so
large that k2 2 N-l, then the most efficient Hermite interpolation
method which can be implemented uses k -‘N-l points. As larger

machines are used, the optimal efficiency Pr f) is approached.

’*(

76

P (f)
*-
Hence as N .. an, 1%???- is approximately bounded above and below by
l

3 ln,(N-l)
(r+l) ln 3

v) The proof of v) is similar to the proof of iv) given

above. Dividing (3-10) by (3-6) gives

Pr *<f>
s—-’-——-s

E2(f)

C

2
2p2(k2) ln 2

C

.. §_. 2
(1 +'/t> p1(2 +~ln k1)k1 1n k1 ln

Substituting for c in terms of k1 and k2 from equations (3-8)

and (3-9) and simplifying, we get

 

1n k2 + 21n(r+1) _ 1 g Pr,*(f)
(r+l)ln 2 2(r+l)ln 2 E2(f)
ln k
s l + ln(r+l)

 

 

(1 +§:)(r+l)ln 2 (1 + §C-)(r+1)1n 2

1

EL.
/2)‘r*1)(

2 (3-17)

+ l)ln 2
1n k

 

(l +
1

If c is large, k1 and k2 are also large, so the Speed-up

ratio is approximately bounded below and above by

 

 

1n k2 1n k1»
d
(r+l) ln 2 ’ 8“ (r+l) 1n 2 ’

respectively. Hence for large values of c, the Speed-up ratio

behaves like

In (N-I) d
(r+l) 1n 2 ’ as N ° ' l

 

 

77

The important conclusions from Theorem 3-3 are contained
in parts iv) and v). Taken together they show that if parallel root-
finding methods based on Hermite interpolation are compared to
either optimal sequential one-point iterations without memory or to
optimal sequential multipoint iterations without memory which use
values of f only, the Speed-up ratio is prOportional to the log2
of the machine size. Hence these methods are not well suited for

implementation on a parallel processing system.

3.4 Comparison of Hermite and Lagrange Methods
In Section 3.3, we studied the efficiency of inverse Hermite
interpolation methods for parallel rootfinding which use f,f',...,f(r)

at k points to compute In that section, we considered

Xi+1,j '
the effect of increasing k to find the Speed-up ratios of the methods.
In this section, we suppose that k is fixed and let r vary to
determine the optimal number of derivatives to be used for the
Hermite interpolation.

Let ¢h denote the parallel rootfinding method based on in-
verse Lagrange interpolation described in Theorem 2-5. Let 'k,r
denote the method based on inverse Hermite interpolation described
in Theorem 3-1. Note that ¢k is lk,0' We will show that ¢k
is more efficient than 'k,r'

Theorem 3-4. .Assume that f is analytic in a neighborhood of a

simple root a. Then
IE(wk’r) s IE(qk), for all r 2 0

If c(f) s c(f(i)), for o s i s r, then for r 2 0,

 

*“ﬁnl-r
.i
F
H.

 

78
EFF(ﬁk,r) S EFF(qk) .

Proof. Recall that

log2 (order of convergence of q»

 

IE(qD = number of parallel functional evaluations

According to Theorem 3-3,

 

 

 

 

E , ___ tress-ta L1,;
I (wk’r r+l . ,
To find the highest possible IE, we differentiate with reSpect to 7
r to get F
r+l
dr (r+1)2 ln 2
g 1 - ln k(rtl)
2
(r+1) 1n 2

Hence diz < 0 if and only if k(r+l) >-e. Recall that

d:E < 0 unless k = 2 and r = 0.

(Notice that k a 2 and r B 0 characterizes the parallel secant

 

ZsksN-l and r20,so

method.) In the case k - 2, IE - 1 for both r = 0 and r = 1.
For all other values of k and r, IE is a decreasing function of
r, so for any fixed k 2 2, the largest IE is always attained by

¢k' Hence

IE('k,r) = 1° rfir+1 s 16g k a IE(¢i) . (3-18)

That is, Lagrange interpolation on k points has a higher IE
than any other Hermite interpolation method on the same k points.
We now turn to the efficiences of the two methods. By

assumption c(f) s c(f(1)), for 0 s i s r, so from (3-18),

79

log k 2 10g k(r+l)
C(f) (r+1)C(f)

2 log k(r+l)
1'
z c(f‘”)
t=0

Let a(qD denote the number of parallel arithmetic opera-

tions used to combine the functional values to compute Xi+1.

Clearly a(cpk) s a“:k r) , so that

logik(r+l)

EFF(il:k r) = r

z c(fm) + aw

)
t=0 k’r

log k(r+l) _
c(f)+8(<pk) “EFF(“? ° I

We have shown that if we wish to use a parallel rootfinding
algorithm based on interpolation at k points, the fastest method

to use is based on Lagrange interpolation.

 

CHAPTER-IV

ROOTFINDING BY DERIVATIVE-ESTIMATED METHODS

We now consider methods for finding the simple root a
Of the function f which use estimates for the values of one of
the derivatives of f.

The Hermite interpolation methods studied in Chapter III

use values of f,f',...,f(r).

In practice, it is often the case
that some derivative of f, say f(r), is not available or is much
more costly to evaluate than any of the other derivatives. If one

f(r)

can estimate the values of using the values of lower, more

easily computed, derivatives, we may be able to retain the same
order of convergence, while avoiding the need to evaluate f(r).

The parallel rootfinding methods considered in Chapter III
r)

were based on inverse Hermite interpolation of f,f',...,f( at
k points. Theorem 3-1 showed that these methods have weak order
of convergence k(r+l). If f(r) is either unavailable or too
costly to evaluate, we can use an estimate for the values of f(r)
at k points. The methods used to find the orders of convergence
in Chapters II and III fail in the case of the derivative-estimated
methods because Of the form of the difference equations satisfied
by the errors. Hence we will only be able to prove lower bounds
for the orders of convergence of these methods. It is our con-

jecture that the lower bounds are exactly the orders. We will show

80

 

Iast~

81

that if Hermite interpolation of f,f',...,f(r-1) at {-k Eglf] or
1E(r)

more points is used to estimate , the order of convergence Of
the method is not less than k(r+l). ([t] is the smallest integer
which is 2 t.) We will then consider the conditions under which

the derivative-estimated method is more efficient than the Hermite

interpolation method on which it is based.

4.1 An Example of a Derivative-Estimated Method

Before studying the general derivative-estimated methods,
we shall consider an example to illustrate the derivation of the
general methods. We first derive the sequential method and then
modify it to form a parallel derivative-estimated method.

This example is based on an exercise in Ralston [30, p. 388]

which suggests that we consider the following sequential method:

f(xi)
x =x -— , (44>
1+1 1 §,(xi)
where
(xi- xi-l)f(xi-2) (xi ' “1-2)f(xi-1)

 

 

+
' (xi-2 ‘ xi-1)("i-2 ' xi) (xi-1 ' xi-2)("i-1 ' xi)

f(xi) f(xi)

x -.. +—.—;— - (44>
i i-2 xi i-l

Equation (4-2) is obtained by differentiating the three point

Lagrange interpolation polynomial, C(x), to f at x d

1-2, Xi_1, an

x According to Lemma 2-2,

1.
E(x) = f(x) - C(x) = %(x - xi_2)(x - x1_1)(x - xi)fm(§(x)), (4-3)

where g(x) lies on the interval spanned by x, x1_2, x , and

i-l

82

x1. If f 6 Civ in a neighborhood of a, differentiating both
sides of equation (4-3) and substituting x1 into the resulting

equation gives the well-known estimate
E'(x ) = 11x - x )(x - x )fm(§ ) (4-4)
i 6 i i-2 i i-l i ’

where §i = g(xi).

Let

F(xi) = x1 - f(xi)/f'(xi) ,

and

H(xi) = x1+1 = x1 — f(xi)/f'(xi) ,

where f'(xi) is given by equation (4-2). Now F is the usual

Newton-Raphson method, so it has error given by

01 - mi) = some - x1>2/f'<xi) . <4-5)

where g lies between x and a (see [30, p. 332]). Notice

1
that F and H are the same except that in H, f'(x1) from equa-
tion (4-2) is used as an estimate for f'(x1) which is used by F.
Notice also that H requires only values of f.

To compute the order of convergence of H, let

ei-l-l = °’ ’ xi+1 g °’ ' H"?

a - 20:1) + 130:1) - H(xp

 

= _ f_".(.Q_. 2 , _l_ _ 1
2f9(x1) (31) f(xl) [f!(x1) £10, )]
l
f" 2 min: ' (x1)

 

-—J.'Q—(e) 4'
2' (xi) ' f'(x1)f'(x1)

83
"(£1 f(x )fm(§ )
(xi 6f'(xi)f'(x1)

n f(x )fm(§ )
* §E§£L—‘ (e )2 +' 1 1
(xi) i

 

(xi'xi-2)(xi'xi-1)

 

_ (e. '3 )(e '6.)9
6f'(xi)f'(xi) 1‘2 i 1’1 1

<4-6)
Now

since x and x

i ‘ xi-1 = ci-1 ' 6i i ' x1-2 g ei-2 ' ‘1‘
f(x£)- f(xi) - f(a) = (x1 — a)f'(n), where n lies between a

and xi, so

u f'(n)£m(§ )
°r+1 ' 2f'(xi) (£1) - 6f'(x )f'(x ) 61(‘i-2"i)(‘i-1"i) ° (4'7)
i 1

Observe that equation (4-7) implies that whenever x0, x1, and x2

are chosen sufficiently close to a,
11m x1 =0! ,
l-oo
so that the method converges.
From equation (4-7), it can be shown that this method has
order of convergence 1.84 (see [40, p. 53]). The method requires
one new functional evaluation and 16 arithmetic Operations at each

iteration. Hence the information efficiency is

IE = 1° 1"3499 0.879

and the efficiency is

0.879

EFF‘“ c(f) + 16

We now show how this sequential derivative-estimated method
is modified to construct a parallel derivative-estimated method for

a parallel processing computer with N - 3 processing elements.

Wu‘aﬂnw
. .
.
V
.

84

 

 

 

 

Let
= - -' -
Xl+l,1 X1,1 f(xi’lﬂf (xi,1) , (4 8)
where
-. _ (“1,1 ' Xi,3)f(xi,2) f(xi,l)
f (xi 1) — (x - x )(x - x ) + - x
’ 1,2 1,3 1,2 1,1 x1,1 1,2
+ (Xi,l ' XiL2>f(xi,3) + f("1,1)
(Xi,3 ' Xi,2)(xi,3 ' Xi,l) x1,1 ' Xi,3
Let x141,2 and Xi+l,3 be defined Similarly.
To compute the order of convergence of this parallel method,
let ei,j = a - Xi,j° Then equation (4-7) becomes
H
— f ('1) < )2
ei+1,1 2f'(x ) 81,1

1,1

f'(n1)f”’<§i 1>
’ - 1‘ 6. (e. - e. )(e. - e. )5 (4-9)
6fl(xi 1)fv(xi 1) 1,1 1,2 1,1 1,3 1,1

 

with similar eXpressions holding for and This

ei+1,2 e’i+1,3‘

implies that if x 2, and x are chosen sufficiently

0,1’ x0 0,3

close to a, the method converges.

We will Show in Theorem 4-1 that this method has order of
convergence 2. One parallel functional evaluation and 16 arithmetic
Operations are necessary at each iteration, so the information

efficiency is

and the efficiency is

85

l

EFF “ c(f) + 16 °

Note that the use of parallel processing has improved the order,
and hence the efficiency and the information efficiency, of the
sequential method given by equation (4-1).

Although using an approximation for the value of f'(xi)
needed for the usual Newton-Raphson method, F, lowered the order
of convergence from 2 to 1.84 in the sequential case, using a
similar approximation did not affect the order of convergence of
the Newton-Raphson method in the parallel case. We will Show that
in many cases, an estimation may be used for the values of a
derivative without lowering the order of the method.

One of the many variations possible on this example is of

interest. Instead of using the data at three points to compute the

 

approximation f'(xi 1) to f'(xi 1), we could use just f(xi 1)
and f(xi,2). Then
. f("i,z) ' f(xi,l)
f0‘11)“ x -x °
’ 1,2 1,1

This is just the parallel secant method introduced in Section 1.4.
Hence the parallel secant method may be viewed either as an example
of a parallel method based on Lagrange interpolation, or as a

derivative-estimated method based on the Newton-Raphson method.

4.2 Order of Convergence
In this section we will indicate that if enough points are
used to estimate values of the highest derivative of f needed by

certain rootfinding methods, the order of convergence is unchanged.

86

As in Chapters II and III, we assume that the analytic function f

has an inverse, denoted by g, in a neighborhood of the simple root

0.
Consider the following parallel generalization of the Newton-

Raphson method. Let x be initial approximations

O’1,x0,2,...,x0’N

chosen sufficiently close to a. ASSume that g E Cr+1 near 0.
For each j = 1,2,...,N, let Fj(y) be the Hermite interpolation

polynomial of degree at most r which satisfies

Fj(S) ) = 8(8)

(yi,j (yi,j)’ for s=0,...,r .

Define

xi+1,j g FJ(O)

Although this is a parallel method, each processor is acting in-

dependently of the others Since x depends only on x

i+l,j i,j°

Thus each processing element is performing a generalized Newton-
Raphson iteration which is well known to have order of convergence
r+l (see [40, Chapter 3]).

Let us set the notation which is needed to construct a
derivative-estimated method based on this parallel generalization
of the Newton-Raphson method. Let u 2 2 be fixed. Choose N
subsets, not necessarily distinct, of {1,2,...,N], each cOntaining

u elements. Denote the jth subset by B , and assume j E B

J J.

Let Gj(y) be the Hermite interpolation polynomial of degree at

most ur-l which interpolates

(8)

(8)
G1 (yi t) - s

(yi’t). for 8 - 0,...,r-l, t 6 B1 .

87

(r)

We will use 0 (r) (Yi j

J

which is used in Fj'

Theorem 4-1. Let f 6 Cur in a neighborhood of the Simple root

(yi j) to approximate the value of g )

a. Let Fj and G1 be as defined above. Let Hj(y) be the

Hermite interpolation polynomial of degree at most r which

satisfies
(8) = (S) = -
Hj (yi j) 8 (Y1 j) . for s 0. .r 1 .
(r) _ (r)
Hj (y1 j) - Gj (yi j)
Define

If the N components of X0 are chosen sufficiently close to a,
this method converges and has Strong order of convergence of at
least r+l.

The proof of this theorem is a generalization of the
corresponding sequential case given in [40] by Traub. The first Of
two lemmas needed for this proof is a well-known generalization of

Rolle's Theorem [40, p. 252].
k

Lemma 4-2. Let q 8 Z Yj' Let W E C q (I), where I is an
j=0
interval which contains the points x which are zeros of ‘W of

J
multiplicity Yj’ reSpectively, for j = 0,1,...,k. Then W(1)

has at least q-i zeros counting multiplicity, for i = 0,1,...,q-1,

on I. In particular, there exists g 6 I such that W(q'1)(§) = 0.

The second lemma deals with the error in estimating
derivatives by Hermite interpolation. we include the proof for

completeness.

“1 ii: r"

'9 4...... ”1|—

" -.-...s

 

88

Lemma 4-3 [40, p. 252]. Let S = ru, and let g E C s (I), where
I is an interval containing the distinct points y1,y2,...,yu.

Let C(y) be the Hermite interpolation polynomial of degree at

most s-l satisfying

C(i)(yj) = g(i)(yj) , for j = l,...,u, and i = 0,...,r-l .
Then
(r) (r) _ EL (s) “ _ r
8 (5'1) -G (y1)-S, s ($111201 yj) .
where g 6 I.
Proof. Define
u r
P(t) = H (t - y )
i=1 3

The r th derivative of P,

U
P(r)(y1) = I“ n (Y1 - yj)r * 0 9
j:

so we may define

 

g(t) (Y1: - c(f) (9'1)
x = (r)
P (5'1)

Let W(t) = g(t) - C(t) - AP(t). Then W(t) has a zero
of multiplicity. r+1 at yl, and a zero of multiplicity r at

y29y3,...,yu. Hence W has at least s+l zeros, so by Lemma 4-2,

(8)

there exists g E I such that W (g) = 0. Since C(t) is a

polynomial of degree at most 8-1,

(8)

o = w (t) = g(s’<§> - 18!

W9 ,4 4 .Jlmfnruimz
2

89

Then
(r) , (r) g (r)
8 (VI) G (5'1) 11’ (3'1)
(5 ) (r )y
.r_!_ (s) r
=8'8 (§)H(Y1-Y Y) .'

J=2 3

We are now prepared to prove Theorem 4-1. The subscript

i will be suppressed to Simplify the notation.
(1‘)

Proof. Fj(y) interpolates g,g',...,g at yj = y1 j’
so it must have the form
(8)
r 8 (y)
Fj(y) = 2 —-S—,-J—(y -yj) .
s=0
with error given by
(”1) 1+1
8(0) - Fj(0) = (rd-l)! (-yj) .

where 0 lies between 0 and y . Similarly, Hj(y) is given by

J
(s) (r)
r-l s (y) G (V)
Hj(}') = 2:0 --;,—-'— (y - yj)s + J—g—L (y - yj)r
8:

From Lemm 4-3 we have

     

(r) (r) g(ur) r
- G - , 4-10
g (yj) j (91 (it), (e )tggjyj yt> < >
ta‘j
where 91 lies on the interval Spanned by [yt\t € Bj]. Now

8(0) - H (0)

‘i+1,j' °’ ' x1+1.j '3 J

= 8(0) - 171(0) + Fj(0) - 111(0)

w.- a Lenllna n‘.‘
Q
I

90

 

 

 

 

 

 

r
9

(4-11)

(4-12)

...;L__ (r) - (r)
(r+1)! ( yJ) +' r! [8 (yj) Gj (Yj)l
(”1) ( )r~+l
(1+1)! VJ
(-y >r
i r! (ur) r
+ r! (ur)! g (9.1)th (yj - yt)
J
ta‘j
r+1
= g(r+1)(e) (61.1)
(”1" Leap)“
(e )r g(ur)(9 )
1L1 1 I
U [f (k )(x ' X )
(lit)! [g'(§j)]r CEBJ 39': j t
ta‘j
s
where 51 is between 0 and y], and Yj ' Yt = f(xj) ' f(xt)
= f'()‘j’t)(xj - xt), where xj,t lies between xj and xt.
Let
M =L<r+1)(En ’
"1 (r+1)![g'(§j>1r+'
and .
g("r)(ei)2r(“'1) [ )1‘
= o n f'(x
“2.1 (ur),[g.(§j)]r+1 cesj 3,:
ti‘j
Then from equation (4-11),
— M ( )r+1 + “211(61’j)r r
ems " 1.1 “1.1 2z-(u-n “ (61,: ' 31.1) °
tEBj
ti‘j

Let

91

M.1 I mjx 'Ml,j'
M2 = mix JM2,3‘

Then if bi = max \8

l.
j i.J

Jei+1,j\ ‘ M1(51)r+1 + M2051)r n (61)r
tEBj

t¢j

r+l ur
s M1(61) + M2(61) .

Now

 

 

(r+1)
lim M1 = 3 (01 H1 , and
1% (1+1)! [8 ' (0)]
11 g g(ur)(0) 2r(u-l)
m r
i-a (ur)! [8 ' (0)]

Hence M1 and M2 are bounded by M1 and D71, reSpectively, for

all i. If the N components of X are chosen sufficiently

0
close to a, then
'- r+1 -' ur
81+1 s 141(61) + M051) .
so
11111 61 = 0
l-cco

and the method converges. Since 2 s u, the inequality

5i+1 —

s M + H(a )ur-r-l
i
(61)

r+l 1

implies that

92

5 ._
lim sup "";i},'+—1' 5M1 9
iaa (6.)

1

showing that the method has Strong order of convergence at least r+l. I
We are unable to prove the exact order of convergence because

of the form of equation (4-12). However, it is our conjecture that:

Conjecture. The parallel derivative-estimated method given in

' ..

1

Theorem 4-1 which defines x1+1 j = Hj(0) has order of convergence

exactly 'r+1.

We have Shown that the order of convergence of this method

3"-

”1313—2. ... .2 T": “F
l
I

is at least r+l. Recall that the method from which it is derived
(r
g )

uses and has order of convergence n+1. It seems highly

(r) (r)

unlikely that using an approximation for g instead of g

itself could make the method converge more rapidly for a general
choice of the function f.

How many points u should be used for the Hermite interpola-
tion from which the derivative is estimated? Theorem 4-1 showed

that the order of convergence of the sequential generalized Newton-
(r) (r)

j 9
where Gj is the Hermite interpolation polynomial which agrees
(r-l)

Raphson method is not lowered if g is approximated by G

with g,g',...,g at u 2 2 points. Notice that u does not

effect the order of convergence, so we Should use just two points

in order to minimize the effort needed to compute G .

J
It does not appear that the stepping down from r to

r-l derivatives in Theorem 4-1 can be repeated since that would

require an analogue for Lemma 4-3 giving an estimate for

g<r+1)(y1) - G<r+1)(y1). We can, however, extend Theorem 4-1 by

replacing the Newton-Raphson method, F , with the inverse Hermite

J

93

interpolation methods discussed in Chapter III which interpolate

898'9°°°98(r) at 1‘ points.

In setting up the notation for the general parallel
derivative-estimated methods, we must be very careful to distinguish

between three different Hermite interpolation polynomials, F , G

J J’

and Hj’ being used.

Choose N distinct subsets of k elements from the set
{1,2,...,N], for a fixed k, 2 s k s N-l. Denote the j th subset
by Aj’ and assume that j E Aj' The Hermite interpolation poly-

nomials F1 and Hj will interpolate different data on the points

yi t’ for t E A]. Let r 2 1 be fixed. Let Fj be the polynomial

of degree at most k(r+l) - l which interpolates

Fj(3)(yi,t) = g(8)(y1,t), for S = 0,...,r, t E Aj .

This is the same polynomial which was used in the construction of
the method 1k in Theorem 3-1.
As in the method considered in Theorem 4-1, we use the r th

derivative of the Hermite interpolation polynomial Gj
(r)

the values of g . To construct G , choose N subsets, not

J
necessarily distinct, Of u elements from the set {1,2,...,N},

to approximate

for a fixed u, 2 s u STN. Denote the j th subset by Bj’ and assume

that j E B We remark that k may not equal N, while u may,

j.
because the sets ‘Aj must be distinct, while some or all of the

sets Bj may be the same. Let Gj(y) be the Hermite interpola-

tion polynomial of degree at most ur-l which interpolates

(8)

g (8)
Gj (yi,t) 8

(yi,t)’ for s '3 0,...,r-l, t €Bj .

 

94

j(r) as an approximation for g(r) in Fj’

so that Hj(y) is the Hermite interpolation polynomial of degree

Now we use G

at most k(r+1) - l which interpolates

HES) (yi9t) = g(S)(y1,t) ’ for 8 = 0,...,r-'1 \
g t t A ,
(r) (r) j
H.) (yi,t) = Gj (yi,t) J

 

Thegrem 4-4. Let f be analytic in a neighborhood of a simple

root a- Let x be chosen sufficiently close to

0’1,XO,2,...,XO,N

a. For each j = 1,2,...,N, let F G , and H be the Hermite

J’J J
interpolation polynomials defined above. Define

x144,j = Hj(0) .

Then this method has strong order of convergence at least
y 8 min{k(r+l), ur].
Definition 4-1. Call the Hermite interpolation method considered

in Chapter III which defines = F (0) the parent method.

xi+1,j j
Call the parallel derivative-estimated method which defines

= d d d.
x141,j HJ(0) the erive metho

Proof. As before, the subscript i is omitted to simplify

the notation.

Now F and H are given by the linear combinations

J J
r (8)

FJ(0) =s§0 “DEA 88,1118 (ym) .
J
and
r-1
3 (8)
113(0) 32061511189“: (ym)
+6131 ar’mGj(r)(ym) .

‘Eﬁﬁﬂ3-“MV‘mﬂu-Ir
.‘ .r

 

95

From the proof of Theorem 3-1, there exist a sequence of constants,

Ki,j’ such that

‘Ki j' S K , for all i and j ,

and

 

g<o> - Fj<o> = K. . n (... f“
9.] 19m
m€Aj
By Lemma 4-3,
(r) _ (r) = r! (ur) _ r
8 (71) G1 (yj) (ur), 8 (83) cg; (yj yt) ,
J

t¥j

where 91 lies on the interval Spanned by {yt\t 6 BJ]. As in

the proof of Theorem 4-1,

ei-+1,j = a ' xi+1,j = 3(0) ’ Hj(0)

1‘ r

 

’ Ki,j mgAj(ei’ ) + Mi,j(€i.j) tghj(€i,t 61.1) ’
ti‘j
where
M g(ur)(ej) 2r(u-l) n ( 1 )r
"3 (ur)![8'(§j)]r+1 teaJ 3“”)
tﬁj

lies between 0 and .;

the proof of Theorem 4-1, it was shown that ‘Mi j' s M, so that
9

lies between yj and yt. In

r+l - r r
Jei+1,j' ‘ngAj'ei’m\ +M 'ei,j tEBJ'ei’t ' 61.j'
t¥j
s K(61)k(r+1) +1701)“r

96

Hence

k(r+1)

8 s 1((61) + R61)“ . (4-13)

r+l

If ,...,x are chosen sufficiently close to a. then

x0 ,1 “0,2
(4-13) implies that

0,N

lim 6i = 0

i-«n
Thus the method converges. MOreover, since y = min{k(r+1), ur],

the inequality

6 r, _. _
(6 )Y 1 1

1
implies that

5

 

lim sup < m .

1-«9 (61)
Thus the method has strong order of convergence at least v. .
we conjecture that if g(ur)(0) f 0 and g(kr)(0) # 0,
then these methods have strong order of convergence exactly y.
In that case, the order of convergence of a derived method may
equal, but may not exceed, the order of convergence of its parent
method. Since '
k(r+l), if u 2 k(r+l)/r

y = min{k(r+l), ur] 3
ur, if u s k(r+l)/r ,

the orders of the derived and the parent methods are equal when-
ever [k(r+l)/r] or more points are used to estimate the values

of g(t). For the method considered in Theorem 4-1, k = l, r 2 l,

97

so (r+l)/r s 2. Hence in that case, the orders of the derived and
parent methods are equal if two or more points are used in the
derivative estimation.

In Theorem 4-4, 2 s k s'N-l, and 2 s u S‘N, so that we
have this result.
Corollary 4-5. If the strong order of convergence of the derivative-
estimated method for parallel rootfinding described in Theorem 4-4
is exactly y, then the highest order of convergence which can be
achieved on a parallel machine using N processing elements is

either

(N-l)(r+l) , if N s n+1, or

Nr, if N 2 r+l .

This shows that unless the number of derivatives used in
the parent method is as large as the machine Size N, use of a
derivative-estimated method actually reduces the order of con-

vergence.

4.3 Efficiency

In this section, the efficiency of the parallel derivative-
estimated methods is studied by comparing the computational effort
required by the derived method with that required by the parent
method. We will assume that the derived methods have strong order
of convergence exactly y - min{k(r+l), ur}.

If an order of convergence of k(r+l) is desired, it can
be achieved by the parent method using Hermite interpolation of

8’8.,...,8(r) at the k points {y1,t‘t€Aj]. Aderived method

98

with u 2 [k(r+1)/r] retains the same order Of convergence, but
(1')

it does not require the computation and evaluation of g . This

(1')

is an important consideration if g is much more costly to

evaluate than g,g',...,g(rﬂl).

However, the cost of avoiding the
(r) _

evaluation of g is high. At least [k/r additional pro

cessing elements are needed to perform the computations required

at u points. In addition, each processing element must compute

an Hermite interpolation polynomial G of degree at most ur-l

J
r
and evaluate Gj( ) at k points. Hence the parent method is

more efficient than the derived method unless at least rk/r]
additional PE's are available and the cost of one parallel evalua-

(1')

tion of g is greater than the cost of constructing and evalua-

ting G This would be expected to occur only very rarely.

1’
If instead of fixing the desired order, we wish to use the

r
8()

data from g,g',..., each evaluated at k points, we may pro-
ceed in two ways. First, the Hermite interpolation method using
this data has order of convergence k(r+l). .Alternatively, we can
use the data in a derivative—estimated method to approximate g<r+1)
at m points, where m s k. In that case, the order of convergence
is min{m(r+2), k(r+l)]. If k(r+l)/(r+2) s m S k, both methods
have order of convergence k(r+l). Now both of these two methods
have the same order, both require the construction of one Hermite
interpolation polynomial of degree at most k(r+l) - 1, and both
require the same functional evaluations. The Hermite method re-
quires one evaluation Of its polynomial, while the derived method

requires m evaluations of the (r+l)st derivative of its polynomial.

In addition, the derived method requires the construction and one

99

evaluation of an Hermite interpolation polynomial of degree at
most k(r+2) — 1. Hence the derived method requires more effort
to achieve the same order of convergence as the Hermite method, so
it is less efficient.

If m1< k(r+1)/(r+2), the Hermite method shows an even
stronger advantage over the derived method. The computations needed
by the derived method are the same as those outlined in the pre-
ceding paragraph, but the order of convergence is m(r+2) which
is less than the k(r+l) order of the Hermite method. That is,
the derived method yields a lower order of convergence, despite a
higher computational effort.

Thus we have shown that in general, the parent Hermite inter-
polation methods are more efficient than the derivative-estimated
methods because more computational effort is required by the derived

methods to achieve the same order of convergence.

CHAPTER V

WHICH METHOD TO USE

In this paper we have developed several different classes
of parallel rootfinding algorithms and considered their relative
efficiencies. Which of these methods should be used to find a
simple root a of a given function? We will Show that the choice

depends on the function f and on how fast the answer is needed.

5.1 Real-Time Problems

If the rootfinding problem is part of a real-time problem,
the root a must be found as quickly as possible, even if the cost
is high. For each of the parallel methods considered in this paper,
it has been shown that if the cost of functional evaluation is very
high, then the Speed-up achieved is proportional to log2 N. That
means that the parallel methods can solve some problems much more
quickly than currently used sequential methods. TO see which of
the parallel methods to use, let us consider each class of parallel
rootfinding algorithms studied in this paper.

In Chapter IV, it was shown that parallel derivative-estimated
methods are less efficient than parallel inverse Hermite interpola-
tion methods for nearly all choices of f. Hence, except in the
~rare case where one derivative of f lS‘NUCh more expensive to
evaluate than all of the lower derivatives of f, the methods using
Hermite interpolation are preferred over parallel derivative-estimated

methods.
100

 

101

Thus, one should use inverse Hermite interpolation rather
than derivative-estimated methods. It only remains to choose the
number of points, k, and the number of derivatives, r, on which to
perform interpolation. In Section 3.4, we Showed that if k is
fixed and the cost of evaluating each derivative is at least as
high as the cost of evaluating f, then the efficiency decreases as
the number of derivatives used increases. In this case, parallel
rootfinding methods based on Lagrange interpolation on k points
are more efficient than any other Hermite interpolation method on
k points. If, however, some derivatives are much easier to
evaluate than f, then Hermite interpolation methods may be faster
than methods using only Lagrange interpolation.

Assuming that f is such that inverse Lagrange interpolation
is faster than inverse Hermite interpolation, we may apply the re-
sults of Chapter II to Show how many points should be used for

interpolation. Let B 2 1 denote the unique root of
2 2
u(k)=(4k+k)lnk-(c+2k+k-l)=0

3 need not be an integer, so we consider either k2 8 [a] or
k1 = [a] -'1, where [a] is the smallest integer 2 3. Recall
that the efficiency of these methods based on inverse Lagrange

interpolation is given by

log2 k
Pk(f) = 2 .
c(f)+2k +k-1

 

Then the number of points which should be used for the interpola-

tion on a parallel machine using N processing elements is either

102

N-l, if a 2 N-l, or

whichever of k or R2 maximizes Pk(f)

l

5.2 Problems to be Solved at Low Cost

For other than real-time applications, the root 0 should
be found with as little cost as possible. All of the parallel root-
finding algorithms studied in this paper have been Shown to achieve
a Speed-up ratio proportional to log2 of the number of processors
used. This means that if a sequential machine can compute a to
a Specified accuracy in 10 seconds, a parallel machine can do the

same in

10
a log2 N

seconds, so that doubling the number Of processing elements used
has little effect on the time required. Thus, because of the high
cost of a parallel processing system, rootfinding problems should
be solved on a sequential machine, leaving the parallel system free

to execute other tasks which use its parallel structure in a more

efficient manner.

B IBL IOGRAPHY

8.

10.

B IBLIOGRAPHY

Ackins, Gary M., Morris A. Knapp, and John Thomas, "Applications
of ILLIAC IV to Urban Defense Radar Problem," in Parallel
Processor Systems, Technologies, and Applications, edited by
Hobbs, L.C., D.J. Theis, Joel Trimble, Harold Titus, and Ivar
Highberg, Spartan Books: New York (1970) pp. 23-70.

Amdahl, G.M., "Validity of the Single Processor Approach to
Achieving Large Scale Computing Capabilities," in 1967 Spring
Joint Comput. Conf., AFIPS Conf. Proc., 30, Thompson:
Washington, D.C. (1967) p. 483.

Barnes, George H., Richard M. Brown, Maso Kato, David J. Luck,
Daniel L. Slotnick, and Richard A. Stokes, "The ILLIAC IV

Computer," LEEE Transactions gt_1_ Com ters, C-l7, 8 (August,
1968) pp. 746-757.

Brent, Richard P. , Algorithms for Minimization Without
Derivatives, Prentice ~Hall: Englewood Cliffs, New Jersey (1973).

Brent, Richard P., Shuuel Winograd, and Philip Wolfe, "Optimal
Iterative Processes for Root-Finding," Numerische mthematic,
20 (April, 1973) pp. 327-341.

Carrol, 11.8., and R.T. Wetherford, "Applications of Parallel
Processing to Numerical Weather Forecasting," _Atssoc. Comput.
Mach. J_. (1967) pp. 591-614.

Collatz, L., Functional Analysis and Numerical Mathematics,
Academic Press: New York (1964) .

Courtney, J.E. , and H.M. Halpern, "Parallel Processing for
Phased-Array Radars," in Parallel Processor Systems,
Technolggies, gig Applications, edited by Hobbs, et al. ,
pp. 87-106.

Danenberg, S.A., "An Introductory Description of the ILLIAC IV
System," BLUE Document no. 225, University of Illinois (1970) .

Flynn, M.J., "Some Computer Organizations and Their Effective-
ness," IEEE Transactions gn_ Commters, C-21, 9 (September, 1972)
pp. 948-960.

103

11.

12.

l3.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

104

Flynn, M.J., "Very High Speed Computing Systems," Proceedings
IEEE, 54, 12 (December, 1966) pp. 1901-1909.

 

Githens, J.A., "An.Associative, Highly-Parallel Computer for
Radar Data Processing," in Papillel Processor Systems,
Technologies, and Applécations, edited by Hobbs, et al.,

pp. 71-86.

Grahm, William R., "The Parallel and the Pipeline Computers,"
Datatpation (April, 1970) pp. 68-71.

Hobbs, L.C., and D.J. Theis, "Survey of Parallel Processor
Approaches and Techniques," in Parallel Processor Systems,
Technologies, and Applications, edited by Hobbs, et al.,
pp. 3’20.

Karp, Richard M., and Willard L. Miranker, "Parallel Minimax
Search for a Maximum," J, Comb. $3,, 4 (1968) pp. 19-35.

Kogge, Peter M., and Harold Stone, "A Parallel Algorithm for
the Efficient Solution of a General Class of Recurrence Equa-

tions," IEEE Transactions £Computers, C-22, 8 (August,
1973) pp. 786-793.

Ruck, D.J., "ILLIAC IV Software and Applications Programming,"

IEEE Transactions ppComputers, C-17, 8 (August, 1968) pp.
758-770.

Kung, H.T., "A Bound on the Multiplication Efficiency of
Iteration," Proceedirms pf. the Fourth Annual ACM Smposium 9_n_
Theory 2; Computing (1972) .

Kung, H.T., and J.F. Traub, "Computational Complexity of One-
Point and Maltipoint Iteration," Department of Computer Science,
Carnegie-Mellon University, Pittsburg (April, 1973) .

Kung, H.T., and J.F. Traub, "Optimal Order and Efficiency for
Iterations with No Evaluations," Department Of Computer
Science, Carnegie-Mellon University, Pittsburg (November, 1973) .

Kung, H.T. , and J.F. Traub, "Optimal Order of One-Point and
Multipoint Iteration," Department of Computer Science, Carnegie-
Mellon University, Pittsburg (February, 1973) .

McIntyre, David E., "An Introduction to the ILLIAC IV Computer,"
Datamation (April, 1970) pp. 60-67.

Miranker, W.L., "Parallel Methods for Approximating the Root
of a Function," IBM Journal pf Research m Deveiogent, 13
(1969) pp. 297-301.

Miranker, Willard L., "A Survey of Parallelism in Numerical
Analysis," SIAM Review, 13 (1971) pp. 525-547.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

105

mnro, Ian, and Michael Paterson, "Optimal Algorithms for
Parallel Polynomial Evaluation," Conference Record, ;2_ pl;
Annual Smposium _o_n_ Switching a_pd Automate Theory, Michigan
State University, IEEE Publication 71 C 45-0 (October, 1971) .

Ortega, J.M., and W.C. Rheinboldt, Iterative Solution pf Non-
linear Eguations pp Several Variables, Academic Press:
New York, 1970.

Ostrowski, A.M., Solution pf Equations and Systems _o_f_ Eguations,
Second Edition, Academic Press: New York (1966) .

Owens, Jerry L., "The Influence of Machine Organization on
Algorithms," in Complexity _o_f SeCLu;e_nrt_ial and Paraupel Numerical
Algorithms, edited by Traub, J.F., Academic Press: New York
(1973) pp. 111-130.

Pease, mrshall, "An Adaptation of Fast Fourier Transform for
Parallel Processing," JACM, 15 (April, 1968) pp. 252-264.

Ralston, Anthony, A First Course 1p Numerical Analysis,
McGraw-Hill: New York (1965) .

Reddy, D. Raj, "Some Numerical Problems in Artificial
Intelligence: Implications for Complexity and Machine
Architecture," in Complexity p_f_ Seguential 1951 Parallel
Numerical Analysis, edited by Traub, pp. 131-148.

Rice, John R., "Matrix Representations of Nonlinear Equation
Iterations - Applications to Parallel Computation," Mathematics
g Computation, 25, 116 (October, 1971) pp. 639-647.

Rissanen, J. , "On Optiml Root -Finding Algorithms," IBM Report
RJ726 (No. 13830).

Rogers, Richard, Uniformly Accurate Numerical Solutions §_o_

Differential Eguat ions Using Extrapolation and Interpolation,
Ph.D. Thesis, Michigan State University (1974) .

Stone, Harold 8., "An Efficient Parallel Algorithm for the
Solution of a Tridiagonal Linear System of Equations," JACM,
20, 1 (January, 1973) pp. 27-38.

Stone, Harold 8., "Parallel Processing with the Perfect

Shuffle," IEEE Transactions pn_Computers, C-ZO, 2 (February,
1971) pp. 153-161.

Stone, Harold S. "Problems of Parallel Computation," in
Complexity 9_f Seguential a___nd Parallel Numerical Analysis ,
edited by Traub, pp. 1-16.

Strassen, V., "Gaussian Elimination is Not Optiunl," Numerische
Mathematic, 13 (1969) PP. 354-356.

39.

40.

41.

42.

106

Traub, J.F., 'Computational Complexity of Iterative Processes,‘
Department of Computer Science, Carnegie-Mellon University,
Pittsburg (October, 1971).

Traub, J.F., Iterative MEthods for the Solution p§_Eguations,
Prentice-Hall: Englewood Cliffs, New Jersey (1964).

Traub, J.F., "Theory of Optimal Algorithms," Department of
Computer Science, Carnegie-Mellon University, Pittsburg (June.
1973).

Wulf, William A., 'C.mmp: A Multi-Mini-Processor," in
Computer Science Research Review, Carnegie-Mellon University,
Pittsburg (1972) pp. 37-73.

 

 

Unclassified
SECURITY CLASSIFICATION OF THIS PAGE (When Dete Entered)

 

 

 

 

I READ INSTRUCTIONS
I. REPORT NUMBER 2. GOVT ACCESSION NO. 3. RECIPIENT'S CATALOG NUMBER
4. TITLE (and Subtitle) 5- TYPE OT" REPORT & PERIOD COVERED

Parallel Rootfinding Algorithms

 

6. PERFORMING ORG. REPORT NUMBER

 

 

7. AUTHOR(s) 6. CONTRACT on GRANT NUMBER(s)
George Frederick Corliss AFOSR-72-2271
9. PERFORMING ORGANIZATION NAME AND ADDRESS 10. PROGRAM ELEMENT. PROJECT, TASK

, , _ . AREA 6 WORK UNIT NUMBERS
Michigan State Univer51ty

Department of Mathematics
East LansingLiMI 48824

 

 

II. CONTROLLING OFFICE NAME AND ADDRESS 12. REPORT DATE
Air Force Office of Scientific Research August, 1974
1400 Wilson Blvd. 13- NUMBER OF PAGES
Arlington, VA 22209 115

 

If MONITORING AGENCY NAME 0 ADDRESS“! different lrom Controlling Office) 15. SECURITY CLASS. (0! this report)

 

ISe. DECLASSIFICATION/DOWNGRADING
SCH HEDUL.

 

 

I6. DISTRIBUTION STATEMENT (of this Report)

Approved for public release; distribution unlimited.

 

I7. DISTRIBUTION STATEMENT (o! the abstract entered in Block 20, it different from Report)

 

IB. SUPPLEMENTARY NOTES

 

IQ. KEY WORDS (Continue on reverse side it‘ necessary and identity by block number)

parallel processing, rootfinding, iterative methods, order of
convergence, computational efficiency, Hermite interpolation

 

20. ABSTRACT (Continue on reverse side if necessary and identity by block number)

Algorithms to find a simple root a of a continuous
function f are considered which are suitable for implementation
on an array type of parallel processing computer using N
processing elements. The algorithms are compared on the basis of
their efficiency log order of convergence

EFF = number SFparana aritﬁmeﬁc
Won

DD ‘52:!” 1473 EDITION or 1 NOV 65 us OBSOLETE

 

 

 

 

 

SECURITY CLASSIFICATION OF THIS PAGE (When Dsts Entered)

 

SECURITY CLASSIFICATION OF THIS PAGE(When Dete Entered)

 

 

The first class of algorithms uses inverse Lagramﬁa
interpolation on k points. If fGEC(k) on an interval
containing a , these methods have order of convergence k .

k can be chosen, depending on f , to maximize the efficiency

1

of methods in this class. If the cost of evaluating f is
very high, the speed-up possible by using parallel root-
finding algorithms based on Lagrange interpolation over
efficient sequential rootfinding algorithms is proportional
to logzN .

The second class of parallel algorithms considered

uses inverse Hermite interpolation of f, f',...,f(r) at

k points. The order of convergence is k(r+1) . If r is
fixed, the fastest algorithm in this class achieves a speed-
up ratio proportional to log N for functions which are very
costly to evaluate f than to evaluate any derivative of

f , a method using Lagrange interpolation (r = O) on k
points is more efficient than any other method using Hermite
interpolation on k points.

The third class of parallel algorithms are derivative-
estimated methods. If the values of f(r) used in a
parallel Hermite interpolation method are estimated using
values of f, f', ...,f(r‘1) at enough points. the order
of convergence of the Hermite interpolation method is not
reduced. However, except in rare examples. a parallel

derivative-estimated method is less efficient than the Hermitf

interpolation method on which it is based.

For real-time computer applications, these parallel
algorithms should be used. In general, a parallel method
using inverse Lagrange interpolation on k points is fastesti
where k depends on the function.

For other applications the rootfinding task should be
executed on a sequential machine.

 

 

 

SECURITY CLASSIFICATION OF THIS PAGE(When Date Entered)