RIGOROUS HIGH-PRECISION ENCLOSURES OF FIXED POINTS AND THEIR
INVARIANT MANIFOLDS
By
Alexander N. Wittig

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Physics
Mathematics
2011

Abstract
RIGOROUS HIGH-PRECISION ENCLOSURES OF FIXED POINTS AND
THEIR INVARIANT MANIFOLDS
By
Alexander N. Wittig
The well established concept of Taylor Models is introduced, which oﬀer highly accurate C 0
enclosures of functional dependencies, combining high-order polynomial approximation of
functions and rigorous estimates of the truncation error, performed using veriﬁed arithmetic.
The focus of this work is on the application of Taylor Models in algorithms for strongly
non-linear dynamical systems.
A method is proposed to extend the existing implementation of Taylor Models in COSY
INFINITY from double precision coeﬃcients to arbitrary precision coeﬃcients. Great care
is taken to maintain the highest eﬃciency possible by adaptively adjusting the precision of
higher order coeﬃcients in the polynomial expansion. High precision operations are based on
clever combinations of elementary ﬂoating point operations yielding exact values for round-oﬀ
errors. An experimental high precision interval data type is developed and implemented.
Algorithms for the veriﬁed computation of intrinsic functions based on the High Precision
Interval datatype are developed and described in detail. The application of these operations
in the implementation of High Precision Taylor Models is discussed.
An application of Taylor Model methods to the veriﬁcation of ﬁxed points is presented by
verifying the existence of a period 15 ﬁxed point in a near standard H´non map. Veriﬁcation
e
is performed using diﬀerent veriﬁed methods such as double precision Taylor Models, High
Precision intervals and High Precision Taylor Models. Results and performance of each
method are compared.

An automated rigorous ﬁxed point ﬁnder is implemented, allowing the fully automated
search for all ﬁxed points of a function within a given domain. It returns a list of veriﬁed enclosures of each ﬁxed point, optionally verifying uniqueness within these enclosures.
An application of the ﬁxed point ﬁnder to the rigorous analysis of beam transfer maps in
accelerator physics is presented.
Previous work done by Johannes Grote is extended to compute very accurate polynomial
approximations to invariant manifolds of discrete maps of arbitrary dimension around hyperbolic ﬁxed points. The algorithm presented allows for automatic removal of resonances
occurring during construction.
A method for the rigorous enclosure of invariant manifolds of continuous systems is introduced. Using methods developed for discrete maps, polynomial approximations of invariant
manifolds of hyperbolic ﬁxed points of ODEs are obtained. These approximations are outﬁt
with a sharp error bound which is veriﬁed to rigorously contain the manifolds. While we
focus on the three dimensional case, veriﬁcation in higher dimensions is possible using similar
techniques.
Integrating the resulting enclosures using the veriﬁed COSY VI integrator, the initial
manifold enclosures are expanded to yield sharp enclosures of large parts of the stable and
unstable manifolds. To demonstrate the eﬀectiveness of this method, we construct enclosures
of the invariant manifolds of the Lorenz system and show pictures of the resulting manifold
enclosures. To the best of our knowledge, these enclosures are the largest veriﬁed enclosures
of manifolds in the Lorenz system in existence.

Copyright by
ALEXANDER N. WITTIG
2011

To my parents

v

ACKNOWLEDGMENT

During the past ﬁve years I had the pleasure of collaborating with many exceptional people
here at Michigan State University, both in the Department of Physics and Astronomy and
the Department of Mathematics.
First and foremost, I would like to thank my academic advisor Professor Martin Berz for
his continued support, patience, and determined guidance when needed during my graduate
studies. At the same time, I shall not forget Professors Kyoko Makino and Sheldon Newhouse
without whom my research would not have been possible.
I also would like to acknowledge Professors Pratt, Mahanti and Schenker for kindly
serving on my guidance committee.
A large part of my graduate research has been supported by a generous scholarship by
the Studienstiftung des Deutschen Volkes as well as further ﬁnancial consideration by the
U.S. Department of Energy and the Thesis Completion Fellowship by the Michigan State
University Graduate School.
This acknowledgement would not be complete without thanking my former and current
fellow students Shashikant Manikonda, Youn-Kyung Kim, Johannes Grote, Pavel Snopok,
Alexey Poklonskiy, Ravi Jagasia, He Zhang, Roberto Armellin, Pierluigi DiLizia, and Benjamin Loseth. The numerous discussions with them, as well as the many shared laughs, were
of great value to me.

vi

Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
List of Figures

xi

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xii

1 Introduction
1.1 Computers and Numerical Algorithms . . . . . . . . . . . . . . . . . . . . .
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
3
4
12

2 Fundamental Concepts
2.1 Floating Point Numbers . . . . . . . . . . .
2.2 Interval Arithmetic . . . . . . . . . . . . . .
2.2.1 Dependency Problem . . . . . . . . .
2.2.2 Wrapping Eﬀect . . . . . . . . . . . .
2.2.3 Scaling . . . . . . . . . . . . . . . . .
2.3 Taylor Model Arithmetic . . . . . . . . . . .
2.3.1 Taylor Model Operations . . . . . . .
2.3.2 Intrinsic Functions on Taylor Models
2.3.3 Scaling . . . . . . . . . . . . . . . . .
2.3.4 Order Loss . . . . . . . . . . . . . . .
2.3.5 Implementation . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

15
15
18
19
20
20
21
22
23
25
26
30

.
.
.
.
.
.
.
.
.
.
.
.
.
.

31
32
32
33
34
34
35
36
36
38
39
40
41
42
43

3 High Precision Operations
3.1 General Concepts . . . . . . . . . . . . . .
3.1.1 Newton Method . . . . . . . . . . .
3.1.2 Interval Newton Method . . . . . .
3.1.3 Taylor Expansion with Remainder .
3.1.4 Leibniz Criterion . . . . . . . . . .
3.1.5 Kahan Summation . . . . . . . . .
3.2 Precise Elementary Operations on Floating
3.2.1 Two-Sum . . . . . . . . . . . . . .
3.2.2 Splitting . . . . . . . . . . . . . . .
3.2.3 Multiplication . . . . . . . . . . . .
3.3 High Precision Data Types . . . . . . . . .
3.3.1 High Precision Intervals . . . . . .
3.3.1.1 Limitations . . . . . . . .
3.3.1.2 The Accumulator . . . . .
vii

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
Point Numbers
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

3.4

3.5

3.3.1.3 Addition and Subtraction . . . . . . .
3.3.1.4 Multiplication . . . . . . . . . . . . . .
3.3.1.5 Squaring . . . . . . . . . . . . . . . . .
3.3.2 High Precision Taylor Models . . . . . . . . . .
High Precision Intrinsic Functions . . . . . . . . . . . .
3.4.1 General Concepts . . . . . . . . . . . . . . . . .
3.4.1.1 Taylor Expansion with Remainder . .
3.4.1.2 Argument Reduction and Splitting . .
3.4.2 Division . . . . . . . . . . . . . . . . . . . . . .
3.4.2.1 Argument Reduction and Splitting . .
3.4.3 Square Root . . . . . . . . . . . . . . . . . . . .
3.4.3.1 Argument Reduction and Splitting . .
3.4.4 Inverse Square Root . . . . . . . . . . . . . . .
3.4.5 Exponential Function . . . . . . . . . . . . . . .
3.4.5.1 Argument Reduction . . . . . . . . . .
3.4.5.2 Argument Splitting . . . . . . . . . . .
3.4.6 Natural Logarithm . . . . . . . . . . . . . . . .
3.4.6.1 Argument reduction . . . . . . . . . .
3.4.6.2 Argument Splitting . . . . . . . . . . .
3.4.7 Sine and Cosine . . . . . . . . . . . . . . . . . .
3.4.7.1 Argument reduction . . . . . . . . . .
3.4.7.2 Argument splitting . . . . . . . . . . .
3.4.8 Tangent . . . . . . . . . . . . . . . . . . . . . .
3.4.9 Arc Sine and Arc Cosine . . . . . . . . . . . . .
3.4.10 Arc Tangent . . . . . . . . . . . . . . . . . . . .
3.4.10.1 Argument Reduction . . . . . . . . . .
3.4.10.2 Argument Splitting . . . . . . . . . . .
3.4.11 Hyperbolic Sine and Cosine . . . . . . . . . . .
3.4.12 Hyperbolic Tangent . . . . . . . . . . . . . . . .
3.4.12.1 Argument Reduction . . . . . . . . . .
3.4.12.2 Argument Splitting . . . . . . . . . . .
Comparisons and Bounding . . . . . . . . . . . . . . .
3.5.1 Approximation . . . . . . . . . . . . . . . . . .
3.5.2 Bounding . . . . . . . . . . . . . . . . . . . . .
3.5.2.1 Absolute Value . . . . . . . . . . . . .
3.5.3 Splitting and Merging . . . . . . . . . . . . . .
3.5.3.1 Splitting . . . . . . . . . . . . . . . . .
3.5.3.2 Merging . . . . . . . . . . . . . . . . .
3.5.4 Sign of a Floating Point Series . . . . . . . . . .
3.5.4.1 Double Precision Sign Determination .
3.5.4.2 High Precision Sign Determination . .
3.5.5 Comparisons . . . . . . . . . . . . . . . . . . .
3.5.5.1 Comparing a High Precision Number
Number . . . . . . . . . . . . . . . . .
viii

. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
to a
. . .

. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Floating
. . . . .

. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Point
. . . .

45
46
49
50
51
52
52
53
55
57
57
59
59
60
61
64
64
65
66
67
67
69
72
73
73
74
75
76
76
77
78
78
78
80
81
82
82
83
84
85
86
88
90

3.6

3.7

3.5.5.2 Comparing two High Precision Intervals
Input / Output . . . . . . . . . . . . . . . . . . . . . . .
3.6.1 Rigorous Binary Output . . . . . . . . . . . . . .
3.6.1.1 Implementation . . . . . . . . . . . . . .
3.6.2 Rigorous Decimal Output . . . . . . . . . . . . .
3.6.2.1 Implementation . . . . . . . . . . . . . .
3.6.2.1.1 Addition . . . . . . . . . . . . .
3.6.2.1.2 Multiplication by 2 . . . . . . .
3.6.2.1.3 Division by 2 . . . . . . . . . .
3.6.3 Rigorous Decimal and Binary Input . . . . . . . .
3.6.3.1 Examples . . . . . . . . . . . . . . . . .
3.6.3.2 Implementation . . . . . . . . . . . . . .
3.6.3.2.1 B-Format . . . . . . . . . . . .
Future Work . . . . . . . . . . . . . . . . . . . . . . . . .
3.7.1 Automated Plausibility Testing . . . . . . . . . .

4 Fixed Points
4.1 General Concepts . . . . . . . . . . . . . . . . . . . .
4.1.1 Brouwer’s Fixed Point Theorem . . . . . . . .
4.1.2 Banach’s Fixed Point Theorem . . . . . . . .
4.1.3 Contraction Maps . . . . . . . . . . . . . . . .
4.1.3.1 Operator Norm . . . . . . . . . . . .
4.1.3.2 Inﬁnity Norm . . . . . . . . . . . . .
4.1.3.3 Coordinate Transformation . . . . .
4.2 Periodic Point in Near Standard H´non Map . . . . .
e
4.2.1 Existence of a Periodic Point . . . . . . . . .
4.2.1.1 Preconditioning . . . . . . . . . . . .
4.2.1.2 Rigorous Coordinate Transformation
4.2.1.3 Proof using Intervals . . . . . . . . .
4.2.1.4 Proof using Taylor Models . . . . . .
4.2.2 Implementation and Results . . . . . . . . . .
4.2.2.1 Double Precision Taylor Models . . .
4.2.2.2 High Precision Intervals . . . . . . .
4.2.2.3 High Precision Taylor Models . . . .
4.2.3 Uniqueness and Attractiveness . . . . . . . . .
4.2.4 Conclusion . . . . . . . . . . . . . . . . . . . .
4.3 Global Fixed Point Finder . . . . . . . . . . . . . . .
4.3.1 Preconditioning . . . . . . . . . . . . . . . . .
4.3.2 Implementation . . . . . . . . . . . . . . . . .
4.3.2.1 Algorithm . . . . . . . . . . . . . . .
4.3.3 Analysis . . . . . . . . . . . . . . . . . . . . .
4.3.4 Constants . . . . . . . . . . . . . . . . . . . .
4.3.5 Example . . . . . . . . . . . . . . . . . . . . .
4.4 Application to Beam Physics . . . . . . . . . . . . . .
ix

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

92
93
94
95
98
101
103
105
105
106
108
110
112
114
115

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

116
118
118
118
119
120
122
123
124
126
128
129
130
133
134
135
137
140
142
144
145
146
150
152
153
155
155
157

4.5

4.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.4.2 Tune Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

5 Rigorous Manifold Enclosures
5.1 General Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Invariant Manifolds of a Diﬀeomorphism . . . . . . . . . . .
5.1.2 Invariant Manifolds of autonomous ODEs . . . . . . . . . .
5.1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 High Order Invariant Manifold Approximation in Discrete Systems .
5.2.1 Parameterization . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.3 Induction step . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.3.1 Gamma-Convention . . . . . . . . . . . . . . . . .
5.2.3.2 P-Convention . . . . . . . . . . . . . . . . . . . . .
5.2.4 Stable manifold . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Invariant Manifolds of Autonomous ODEs . . . . . . . . . . . . . .
5.3.1 Manifold Generation . . . . . . . . . . . . . . . . . . . . . .
5.3.2 Manifold Veriﬁcation . . . . . . . . . . . . . . . . . . . . . .
5.3.2.1 Outline of the proof . . . . . . . . . . . . . . . . .
5.3.2.2 One Dimensional Manifold in Three Dimensions . .
5.3.2.3 Two Dimensional Manifold in Three Dimensions .
5.3.3 Numerical Veriﬁcation . . . . . . . . . . . . . . . . . . . . .
5.3.3.1 Veriﬁcation using Taylor Models . . . . . . . . . .
5.3.4 Manifold Propagation . . . . . . . . . . . . . . . . . . . . .
5.3.4.1 Initial Condition . . . . . . . . . . . . . . . . . . .
5.3.4.2 Manifold Growth . . . . . . . . . . . . . . . . . . .
5.3.5 Application to the Lorenz System . . . . . . . . . . . . . . .
5.3.5.1 Initial Manifold Enclosure . . . . . . . . . . . . . .
5.3.5.2 Manifold Propagation . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

169
170
170
173
177
178
180
181
183
186
188
189
189
190
192
192
193
202
213
217
219
220
221
226
228
232
250

List of Tables
4.1

Comparison of the performance of double precision Taylor Models, High Precision Taylor Models and High Precision intervals in this example. . . . . . 145

4.2

Periodic points found for periods from 1 through 20 in the region [−0.09, 0.09] ×
[−0.06, 0.06] in the Tevatron transfer map. . . . . . . . . . . . . . . . . . . . . . 168

5.1

Thickening of the manifold enclosure (in diagonalized coordinates) for which
veriﬁcation is successful for various values of η2 and η3 . . . . . . . . . . . . . 229

5.2

Approximate length, maximum error, and number of Taylor Models covering
the unstable manifold after propagating for the given time t. . . . . . . . . . 230

5.3

Approximate area, maximum error, volume, and number of Taylor Models covering the integrated half of the stable manifold within the box B =
[−50, 50] × [−50, 50] × [−20, 90] after propagating for the given time t backwards through the Lorenz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

xi

List of Figures

1.1

The logistic map for r = 4 after 5 iterations with ﬂoating point numbers
(red) and High Precision intervals (blue). For interpretation of the references
to color in this and all other ﬁgures, the reader is referred to the electronic
version of this dissertation. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

The logistic map for r = 4 after 15 iterations with ﬂoating point numbers
(red) and High Precision intervals (blue). . . . . . . . . . . . . . . . . . . . .

7

The logistic map for r = 4 after 25 iterations with ﬂoating point numbers
(red) and High Precision intervals (blue). . . . . . . . . . . . . . . . . . . . .

8

The logistic map for r = 4 after 35 iterations with ﬂoating point numbers
(red) and High Precision intervals (blue). . . . . . . . . . . . . . . . . . . . .

8

The logistic map for r = 4 after 45 iterations with ﬂoating point numbers
(red) and High Precision intervals (blue). . . . . . . . . . . . . . . . . . . . .

9

The logistic map for r = 4 after 55 iterations with ﬂoating point numbers
(red) and High Precision intervals (blue). . . . . . . . . . . . . . . . . . . . .

9

The logistic map for r = 4 after 60 iterations with ﬂoating point numbers
(red) and High Precision intervals (blue). . . . . . . . . . . . . . . . . . . . .

10

The logistic map for r = 4 after 130 iterations with ﬂoating point numbers
(red) and High Precision intervals (blue). . . . . . . . . . . . . . . . . . . . .

10

The logistic map for r = 4 after 131 iterations with ﬂoating point numbers
(red) and High Precision intervals (blue). . . . . . . . . . . . . . . . . . . . .

11

1.10 The logistic map for r = 4 after 132 iterations with ﬂoating point numbers
(red) and High Precision intervals (blue). . . . . . . . . . . . . . . . . . . . .

11

1.11 The logistic map for r = 4 after 133 iterations with ﬂoating point numbers
(red) and High Precision intervals (blue). . . . . . . . . . . . . . . . . . . . .

14

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

xii

2.1

3.1

3.2

3.3

3.4

Enclosure of a relatively simple, non-convex set (red) by an interval box (blue)
and the resulting overestimation. . . . . . . . . . . . . . . . . . . . . . . . .

19

Graphical representation of the accumulator algorithm for n = 6. There are
two cases shown: In the ﬁrst the result holds six limbs. In the second case,
shown in parentheses, the result holds only three limbs. In that case, the
accumulation terminates after calculation three limbs. The absolute values of
the remaining terms di are then added to the error bound. . . . . . . . . . .

44

Graphical representation of the multiplication algorithm for two numbers of
length 4 and 5 respectively. Every cell in the grid represents one product. The
coloring of the cells shows products of the same order of magnitude, assuming
that the input numbers where well conditioned. The numbering shows the
order in which these products are evaluated, so that we take advantage of
that ordering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

Graphical representation of the squaring algorithm for a number of length 4.
Every cell in the grid represents one product. However, the oﬀ-diagonal products appear twice, so we can ignore half of them as shown in the illustration.
The numbering shows the order in which these products are evaluated, so that
we take advantage of well conditioned input. . . . . . . . . . . . . . . . . . .

50

Graphical representation of the division algorithm for two numbers A and B.
Capital letters represent high precision intervals, while the lower case letters
are ﬂoating point numbers that represent the ﬂoating point sum of the limbs
of the corresponding high precision interval. All arithmetic operations are
Dekker operations, except for the ﬂoating point division, which is denoted by
a colon (:). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

4.1

The attractor of the H´non map with A = 1.422 and B = 0.3 (blue) with the
e
orbit of the suspected period 15 periodic point (green). The point used for
veriﬁcation in Section 4.2.2 is highlighted in red. . . . . . . . . . . . . . . . . 127

4.2

The image of a Taylor Model box (solid black), the interval bounds Ix and
Iy , and the rigorous inner enclosure (hatched). . . . . . . . . . . . . . . . . . 149

4.3

The eﬀect of preconditioning. A box of size 10−4 × 10−4 in a H´non map
e
and its 15-th iterate without (left) and with (right) preconditioning. . . . . . 149

4.4

All period 11 periodic points (green) and the two ﬁxed points (black) within
initial boxes (blue) in the standard H´non map. . . . . . . . . . . . . . . . . 156
e
xiii

4.5

Tracking of evenly spaced particles through the transfer map in the x − a
plane (1400 iterations, 5 mm spacing between rings). . . . . . . . . . . . . . 158

4.6

Location of the periodic points identiﬁed in the transfer map. Red are two
sets of period 5 points, in blue are particles trapped in the islands around the
elliptic period 5 points. Two period 8 points are in black at the top left and
bottom right corner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

4.7

Polynomial expansion of tune shifts computed to orders 3. . . . . . . . . . . 162

4.8

Polynomial expansion of tune shifts computed to order 5 . . . . . . . . . . . 162

4.9

Polynomial expansion of tune shifts computed to order 7 . . . . . . . . . . . 163

4.10 Polynomial expansion of tune shifts computed to order 9 . . . . . . . . . . . 163
4.11 Polynomial expansion of tune shifts computed to order 11. . . . . . . . . . . 164
5.1

The enclosure Γ for a curve γ(x) in R3 with its sides (blue) and ends (red)
highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

5.2

The set C with its ends EC (red) and sides SC (blue) highlighted. . . . . . 198

5.3

The enclosure Γ for a curve γ(x, y) in R3 with its sides (blue) and ends (red)
highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

5.4

The set C with its ends EC (red) and sides SC (blue) highlighted. Each Cx
and Cy consist of two pyramids around the x and y axis respectively. . . . . 208

5.5

Alignment along the strongest expanding eigenvector for a simple linear ODE.
The initial condition is shown in red, and then integrated forward for times
T = 0.1 (yellow), T = 0.3 (green), T = 0.3 (light blue), T = 0.4 (dark blue). . 223

5.6

The initial manifold enclosures generated for the Lorenz system at the origin
shown in standard Lorenz coordinates. . . . . . . . . . . . . . . . . . . . . . 231

5.7

The local unstable manifold of the origin in the Lorenz system propagated for
time t = 1. The error of the manifold shown is less than 2 · 10−11 . . . . . . . 234

5.8

The local unstable manifold of the origin in the Lorenz system propagated for
time t = 5. The error of the manifold shown is less than 2 · 10−10 . . . . . . . 235
xiv

5.9

The local unstable manifold of the origin in the Lorenz system propagated for
time t = 10. The error of the manifold shown is less than 2 · 10−9 . . . . . . . 236

5.10 The local unstable manifold of the origin in the Lorenz system propagated for
time t = 15. The error of the manifold shown is less than 7 · 10−8 . . . . . . . 237
5.11 The local unstable manifold of the origin in the Lorenz system propagated for
time t = 20. The error of the manifold shown is less than 5 · 10−5 . . . . . . . 238
5.12 The local unstable manifold of the origin in the Lorenz system propagated for
time t = 28. The error of the manifold shown is less than 2 · 10−2 . . . . . . . 239
5.13 The local stable manifold of the origin in the Lorenz system within the box
[−50, 50] × [−50, 50] × [−20, 90] propagated for time t = 0.2. The error of the
manifold shown is less than 4 · 10−7 . . . . . . . . . . . . . . . . . . . . . . . 240
5.14 The local stable manifold of the origin in the Lorenz system within the box
[−50, 50] × [−50, 50] × [−20, 90] propagated for time t = 0.3. The error of the
manifold shown is less than 4 · 10−7 . . . . . . . . . . . . . . . . . . . . . . . 241
5.15 The local stable manifold of the origin in the Lorenz system within the box
[−50, 50] × [−50, 50] × [−20, 90] propagated for time t = 0.4. The error of the
manifold shown is less than 4 · 10−7 . . . . . . . . . . . . . . . . . . . . . . . 241
5.16 The local stable manifold of the origin in the Lorenz system within the box
[−50, 50] × [−50, 50] × [−20, 90] propagated for time t = 0.5. The error of the
manifold shown is less than 2 · 10−6 . . . . . . . . . . . . . . . . . . . . . . . 242
5.17 The local stable manifold of the origin in the Lorenz system within the box
[−50, 50] × [−50, 50] × [−20, 90] propagated for time t = 0.6. The error of the
manifold shown is less than 8 · 10−6 . . . . . . . . . . . . . . . . . . . . . . . 242
5.18 Both the local stable and unstable manifold of the origin in the Lorenz system
within the box [−50, 50] × [−50, 50] × [−20, 90] in the same picture. The error
on both manifolds shown is less than 5 · 10−5 . . . . . . . . . . . . . . . . . . 246

xv

Chapter 1
Introduction
This thesis is the product of my work as a dual degree student in both the Department
of Physics and Astronomy as well as the Department of Mathematics at Michigan State
University. As such, parts of this thesis may be of more interest to the mathematician, while
other parts may be more interesting to the physicist.
But before jumping into a detailed discussion of the various research topics making up
this work, I want to make clear why I felt attracted in bridging the gap of two disciplines with
this interdisciplinary research. Instead of attempting to put this into my own words, I will
quote the words of Norbert Wiener in what will be the only block quote in this dissertation.
The following is taken from the preface of Wiener’s famous book Cybernetics: or Control
and Communication in the Animal and the Machine[43]:

For many years Dr. Rosenblueth and I had shared the conviction that the
most fruitful areas for the growth of the sciences were those which had been neglected as a no-man’s land between the various established ﬁelds. Since Leibniz
there has perhaps been no man who has had a full command of all the intellectual activity of his day. Since that time, science has been increasingly the task
of specialists, in ﬁelds which show a tendency to grow progressively narrower. A
1

century ago there may have been no Leibniz, but there was a Gauss, a Faraday,
and a Darwin. Today there are few scholars who can call themselves mathematicians or physicists or biologists without restriction. A man may be a topologist
or an acoustician or a cleopterist. He will be ﬁlled with the jargon of his ﬁeld,
and will know all its literature and all its ramiﬁcations, but, more frequently than
not, he will regard the next subject as something belonging to his colleague three
doors down the corridor, and will consider any interest in it on his own part as
an unwarrantable breach of privacy.
These specialized ﬁelds are continually growing and invading new territory.
The result is like what occurred when the Oregon country was being invaded
simultaneously by the United States settlers, the British, the Mexicans, and the
Russians – an inextricable tangle of exploration, nomenclature, and laws. There
are ﬁelds of scientiﬁc work, as we shall see in the body of this book, which have
been explored from the diﬀerent sides of pure mathematics, statistics, electrical
engineering, and neurophysiology; in which every single notion receives a separate
name from each group, and in which important work has been triplicated or
quadruplicated, while still other important work is delayed by the unavailability
in one ﬁeld of results that may have already become classical in the next ﬁeld.
It is these boundary regions of science which oﬀer the richest opportunities to
the qualiﬁed investigator. They are at the same time the most refractory to the
accepted techniques of mass attack and the division of labor. If the diﬃculty of
a physiological problem is mathematical in essence, the physiologists ignorant of
mathematics will get precisely as far as one physiologist ignorant of mathematics,
and no further. If a physiologist who knows no mathematics works together with
a mathematician who knows no physiology, the one will be unable to state his
problem in terms that the other can manipulate, and the second will be unable
to put the answers in any form that the ﬁrst can understand. Dr. Rosenblueth
has always insisted that a proper exploration of these blank spaces on the map
of science could only be made by a team of scientists, each a specialist in his
own ﬁeld but each possessing a thoroughly sound and trained acquaintance with
the ﬁelds of his neighbors; all in the habit of working together, of knowing one
another’s intellectual customs, and of recognizing the signiﬁcance of a colleague’s
new suggestion before it has taken on a full formal expression. The mathematician need not have the skill to conduct a physiological experiment, but he must
have the skill to understand one, to criticize one, and to suggest one. The physiologist need not be able to prove a certain mathematical theorem, but he must
be able to grasp its physiological signiﬁcance and to tell the mathematician for
what he should look. We had dreamed for years of an institution of independent
scientists, working together in one of these backwoods of science, not as subordinates of some great executive oﬃcer, but joined by the desire, indeed by the
spiritual necessity, to understand the region as a whole, and to lend one another
the strength of that understanding.
2

1.1

Computers and Numerical Algorithms

Ever since early computers became available, both physicists and mathematicians have used
them to examine and simulate problems in their ﬁelds. As the speed and computational
power of computers grew, so did the the size and scope of the problems they were applied
to.
However, surprisingly, the underlying algorithms used to solve various problems have not
developed nearly as fast as the technology. The most widely used numerical integrators, for
example, are still Runge-Kutta type integrators named after the German mathematicians
Carl Runge and Martin Kutta who developed these methods around 1900.
Similarly, the fundamental techniques behind most numerical algorithms in use today are
still linear algebraic methods based on mathematical techniques that have been known for
over a century. One of the most commonly used operations in modern numerical mathematics
is still the inversion of matrices, which is used in approximating solutions to PDEs, ﬁtting
of parameters and many other problems in numerical analysis.
While the implementations of these algorithms have become better and faster, and new
types of algorithms to solve these problems have been developed, the underlying approach
to the actual numerical problem to be solved has not changed very much since the early days
of the computer age. In part this is due to the fact that many of the currently used methods
provide quick, suﬃciently good results to be used in physics and engineering applications.
Many new discoveries would have been impossible had it not been for the application
of these numerical techniques to large real world systems. In physics, nowadays the ﬁeld
of computational physics has become a fully accepted third branch, equally well established
besides theoretical and experimental physics. Engineers use numerical tools every day in
3

simulations during the design process for new and exciting products.
Also in mathematics, computer simulations have been widely used, especially in the ﬁeld
of dynamical systems. Everybody has seen pictures of the Lorenz “butterﬂy” attractor,
or the strange attractor in the H´non map, or the fractal pictures of the Mandelbrot sets.
e
However, while having spurred lots of interest and research into chaos theory, except for
the very simplest one dimensional system very little is actually rigorously known about
them. The numerical simulations have spurred lots of conjectures, many of which are widely
believed to be true. But almost none of these have been rigorously proven to this day.

1.2

Problem

Why is it that, while having a huge impact in Physics and Engineering, computers have not
yet had much impact in the mathematical ﬁeld in terms of rigorous understanding?
The following simple example demonstrates the problem of using computers in dynamical
systems research. Consider the logistic map given by

f (x) = r · x · (1 − x)

(1.1)

where r is a parameter. For r = 4, it is easy to see that the map maps the interval [0, 1]
onto itself.
A typical question in dynamical systems is the behavior of a system under iteration.
Starting with the value 15/16, this very simple map is iterated repeatedly. To do so on a
computer typically means that the actual numbers are approximated by so called double
precision ﬂoating point numbers (see Section 2.1). Fortunately, the number 15/16 can be
4

represented exactly in double precision, as can be 4 and 1, so in the initial step and the
representation of f (x) are not a problem.
Reaching ahead a little, the logistic map is simultaneously evaluated in rigorous interval
arithmetic (see Section 2.2) using the rigorous, high precision intervals introduced in Chapter
3. While delving into the details in the following chapters, suﬃce it to say at this point that
these intervals provide a mathematically correct enclosure of the actual result.
Figures 1.1 through 1.11 show the evolution of both the ﬂoating point value and the
mathematically correct high precision interval after a growing number of iterations. The
graph shown is that of f (x) over the interval [−1, 1]. The red mark on the x-axis of each graph
shows the value of the ﬂoating point evaluation, while the two blue bounds are enclosures of
the mathematically correct value.
In the beginning, the red mark and the blue bounds are identical, and the blue bounds
cover up the red mark. However, as the iteration continues, the two values grow apart. After
55 iterations (Figure 1.6), it is clearly visible that the ﬂoating point value on the right and
the blue sharp interval enclosure of the mathematically correct value do not coincide any
more.
The situation does not improve with further iteration, and in the end it happens that
not a single decimal digit of the result computed using ﬂoating point arithmetic is correct
any more, as the correct result is found near 0 after 60 iterations, while the ﬂoating point
value is close to 1 at the other end of the interval.
So why not just use these High Precision Intervals instead of double precision ﬂoating
point numbers? Continuing the process to 130 iterations (Figure 1.8) shows that just as the
double precision value grew increasingly inaccurate, the High Precision interval also loses
5

sharpness. This trend has been going on all along, it was just not visible so far due to its
scale: the High Precision interval started with a width of about 10−60 , so that the gradual
change in the width was not visible in the picture up until 130 iterations.

After about 130 iterations, however, it is clear that the width of the interval grows quickly,
and after 133 iterations, it already covers 1/3 of the entire interval, rendering any further
computation useless.

This very simple example demonstrates the issues involved in rigorous computation. Even
after a very modest number of iterations of certain rather trivial maps the errors accumulated
can be staggering. Considering how many evaluations of a typically much more complicated
function are needed to just perform one single step in a Runge-Kutta integration scheme,
and how many of such steps are typically performed, it becomes quite apparent that while a
powerful tool in the study of mathematical and physical systems, computers can very quickly
reach their limitations when applied to certain problems.

While the example in Equation 1.1 is particularly chosen to exhibit chaotic behavior,
this is not an untypical case. Most interesting real world systems studied by physicists, such
as for example in celestial mechanics or accelerator physics, exhibit similar chaotic motion
and are just as susceptible to very fast accumulation of errors. This is not even a question
of making computer implementations better, but this is “hard coded” into the DNA of the
problem itself. While posing a great problem, at the same time this eﬀect is what makes the
study of such systems so interesting and rewarding.
6

Figure 1.1: The logistic map for r = 4 after 5 iterations with ﬂoating point numbers (red)
and High Precision intervals (blue). For interpretation of the references to color in this and
all other ﬁgures, the reader is referred to the electronic version of this dissertation.

Figure 1.2: The logistic map for r = 4 after 15 iterations with ﬂoating point numbers (red)
and High Precision intervals (blue).

7

Figure 1.3: The logistic map for r = 4 after 25 iterations with ﬂoating point numbers (red)
and High Precision intervals (blue).

Figure 1.4: The logistic map for r = 4 after 35 iterations with ﬂoating point numbers (red)
and High Precision intervals (blue).

8

Figure 1.5: The logistic map for r = 4 after 45 iterations with ﬂoating point numbers (red)
and High Precision intervals (blue).

Figure 1.6: The logistic map for r = 4 after 55 iterations with ﬂoating point numbers (red)
and High Precision intervals (blue).

9

Figure 1.7: The logistic map for r = 4 after 60 iterations with ﬂoating point numbers (red)
and High Precision intervals (blue).

Figure 1.8: The logistic map for r = 4 after 130 iterations with ﬂoating point numbers (red)
and High Precision intervals (blue).

10

Figure 1.9: The logistic map for r = 4 after 131 iterations with ﬂoating point numbers (red)
and High Precision intervals (blue).

Figure 1.10: The logistic map for r = 4 after 132 iterations with ﬂoating point numbers
(red) and High Precision intervals (blue).

11

1.3

Outline

This dissertation certainly will not solve all open problems in rigorous numerics. But it
will present various relatively new high order, veriﬁed numerical techniques which will then
be applied to various problems in dynamical systems. These techniques, developed mostly
over the last decade, will be improved to meet the demands of working with even more
complicated dynamical systems. It is the belief of the author of this dissertation that these
tools and techniques will prove to be useful in the future study of dynamical systems via
computer assisted proofs.
In Chapter two, some general concepts of numerical computation will be introduced,
including ﬂoating point numbers, and methods for veriﬁed rigorous computations such as
interval arithmetic and high order Taylor Model methods. These allow the implementation
of numerical computations which result in a mathematically rigorous result that can be used
to prove various statements.
Chapter three describes in detail the design and implementation of a major improvement
of the Taylor Model methods introduced in Chapter two. The introduction of rigorous
high precision Taylor Models with high precision coeﬃcients presented in this chapter will
form the basis for many exciting new computations that are not possible with the currently
available rigorous numerical techniques.
Chapter four will deal with the rigorous identiﬁcation and veriﬁcation of ﬁxed points
of maps. Again using Taylor Model methods we will prove the existence of a period 15
ﬁxed point in the H´non map with near-standard parameters. We then extend this work to
e
describe an automated rigorous global ﬁxed point ﬁnder, and apply this tool to a rigorous
analysis of a beam transfer map from accelerator physics.
12

In Chapter ﬁve, various methods are presented to compute rigorous enclosures of invariant
manifolds using high order Taylor Model methods. Unlike many of the existing non-veriﬁed
methods, the techniques detailed here allow the rigorous enclosure of large pieces of invariant
manifold with a very small error.

13

Figure 1.11: The logistic map for r = 4 after 133 iterations with ﬂoating point numbers
(red) and High Precision intervals (blue).

14

Chapter 2
Fundamental Concepts
In this chapter we introduce certain well-known and fundamental concepts used throughout
this thesis.

2.1

Floating Point Numbers

To represent calculations on the real numbers on a computer, most modern processors use
ﬂoating point numbers. The concept behind the representation of a ﬂoating point number
is essentially the same as the ”scientiﬁc notation” in terms of relevant digits and a power of
ten to represent the order of magnitude. The same is done with ﬂoating point numbers. The
only diﬀerence is that, due to the binary number system, a power of two is used to signify
the magnitude.

Deﬁnition 2.1. We deﬁne the set of all ﬂoating point numbers R to be given by

R = {mz · 2ez | 2t−1

|mz | < 2t ; M < ez < M }
15

The constants t, M , M are ﬁxed and deﬁne the ﬂoating point number system. M and
M deﬁne the exponent range and thus the largest and smallest representable numbers. To
make the following proofs easier to understand, we will assume that the exponent range is
unlimited, i.e. M = −∞ and M = ∞. This is of course not true for computer systems, where
overﬂows and underﬂows of the exponent may happen. In our practical implementation we
have to deal with those cases separately. The parameter t is the mantissa length in binary
digits and thus deﬁnes the relative precision of the ﬂoating point system (see below).
In the following we will use ﬂoating point systems with diﬀerent mantissa lengths which
we will denote by Rt . Over- and underﬂows notwithstanding, we clearly have that Rt ⊂ Rt
if t

t . The lower bound requirement on the mantissa is called the normalization. With

this additional requirement, the values represented by ﬂoating point numbers become unique.
Mantissae with absolute value less than 2t−1 can be multiplied by a power of two so that
they lie within the allowed range for the mantissa, while the exponent is adjusted accordingly.
Given any real number r ∈ R within the range of the ﬂoating point representation, we
will denote by r ∈ R the closest ﬂoating point number in the given system of ﬂoating point
˜
numbers. Then it follows readily from Deﬁnition 2.1 that

|r − r|
˜
< m = 2−t
|r|

The value m is called the machine precision and is given by the length of the mantissa t.
Every ﬂoating point implementation has to provide at least the basic operations addition,
subtraction, and multiplication. Clearly the mathematical result of any of those operations
on two arbitrary ﬂoating point numbers a, b ∈ R does not necessarily have to be in R. Thus,
the ﬂoating point operations corresponding to +, −, × are not the same as their mathematical
16

counterparts on the real numbers. Let ⊕, , ⊗ denote the ﬂoating point operations for
+, −, ×.
Deﬁnition 2.2. Let

denote one of the ﬂoating point operations ⊕, , ⊗ and • the same

operation on the real numbers.
The operation

is said to be round-to-nearest if ∀a, b ∈ R

|(a

b) − (a • b)| = min (x − (a • b))
x∈R

Note that, if a ﬂoating point operation is round-to-nearest, the result is the ﬂoating
point number closest to the mathematically correct result. In case of a toss-up, i.e. if
the mathematically correct result is exactly between two ﬂoating point numbers, we will
accept either one. Another immediate consequence is that if the result of an operation is
representable exactly by a ﬂoating point number then we obtain the correct result without
roundoﬀ errors.
From this deﬁnition a bound for rounding errors and a useful condition for the mantissa
of the result of a round-to-nearest operation

easily follow. Let z = mz · 2ez = a

|z − (a • b)| < m · z

b. Then:

(2.1)

This is clear since if the error was more than m · z then either the ﬂoating point number
(mz + 1) · 2ez or (mz − 1) · 2ez would be closer to the correct result. Furthermore for the
mantissa mz , the following equation holds:

mz =

ma · 2ea • mb · 2eb
2ez
17

(2.2)

where [x] denotes rounding to the nearest integer.
In most modern computers the constants t, M , M are deﬁned to follow the IEEE 754
standard[2]. The double precision numbers deﬁned in that standard specify that t = 53, M =
1023, M = −1024. Thus, for double precision numbers m = 2−53 ≈ 10−16 . Therefore in
double precision we can represent about 16 valid decimal digits. The standard also deﬁnes
that the elementary ﬂoating point operations ⊕, , ⊗ can be set to be round-to-nearest.
Consistent with the notation introduced above, we will denote the set of double precision
ﬂoating point numbers by R53 .

2.2

Interval Arithmetic

Traditionally, veriﬁed computations have been performed using interval arithmetic. The
development of this technique began with the groundbreaking in the 1960s mostly by R.
E. Moore[38] and since has formed a well known and broadly studied ﬁeld. It is based
on deﬁning the result of an arithmetic operation on intervals as an interval containing all
possible solutions, i.e. for compact intervals A ⊂ R, B ⊂ R and an operation ◦, the result
of A ◦ B is a compact interval satisfying the property

A ◦ B ⊃ {a ◦ b | ∀a ∈ A, b ∈ B} .

While easy to implement rigorously on a computer using the correct rounding of ﬂoating
point numbers and well studied, interval arithmetic has various intrinsic drawbacks which
inﬂate its error bounds tremendously and make it practically useless for many real world
applications. While the ﬁnal result of an interval computation is guaranteed to contain the
18

Figure 2.1: Enclosure of a relatively simple, non-convex set (red) by an interval box (blue)
and the resulting overestimation.

correct result, the width of the resulting interval is typically much larger than mathematically
necessary. This eﬀect if called overestimation, and it is caused by various factors.

2.2.1

Dependency Problem

One source of over-estimation in interval arithmetic is the so called dependency problem.
Evaluating a simple expression such as x − x in interval arithmetic for x = [−1, 1], one
obtains
x − x = [−1, 1] − [−1, 1] = [−2, 2],
since the dependence of x in both intervals is lost. While correctly enclosing the mathematically rigorous result 0, vast overestimation is introduced.
There have been various techniques developed to try to minimize this eﬀect through
manual rewriting of equations before evaluating them in interval arithmetic, but they require
manual rewriting for each equation and only ease the dependency problem, but do not solve
it.
19

2.2.2

Wrapping Eﬀect

Another problem of intervals that appears in higher dimensional systems is their limited
shape. Even in the enclosure of very simple, convex sets large overestimation is introduced
even if the best possible interval enclosure of the set were somehow computed. The enclosure
of arbitrary, complicatedly shaped sets makes the problem even worse. This is known as the
wrapping eﬀect and is illustrated in Figure 2.1.
Since in complicated computations consisting of repeated iteration of similar steps this
wrapping happens repeatedly for every intermediate result, these errors due to the wrapping
eﬀect can add up signiﬁcantly during the computation.

2.2.3

Scaling

In interval arithmetic, the overestimation in general scales linearly with the width of an
interval.
Consider again the example used before:

f (x) = x − x.

As described before, when evaluated over the domain [−1, 1], large overestimation is introduced, and the width of the result is 4. Evaluating the same function instead over the
interval [m − k, m + k], where |m|, |k| < 1, yields the following interval result

f ([m − k, m + k]) = [m − k, m + k] − [m − k, m + k] = [−2 · k, 2 · k].

Clearly, the width of the result, 4 · k, scales linearly with the width of the initial interval,
20

2 · k.
Thus, to obtain a result of a prescribed width of less than , in general the width w of
the initial interval to evaluate an expression with goes as

w = /C

(2.3)

for some problem dependent constant C.
Particularly in higher dimensions, this quickly leads to an eﬀect commonly referred to as
the dimensionality curse. The number of boxes N required to cover an n dimensional box
[−1, 1]n by boxes of side length less than w is given by the expression

N=

2 n
.
w

That is, the number of boxes scales exponentially with the dimensionality of the problem.
It is therefore essential to make w as large as possible so that the number of boxes required
is minimized.

2.3

Taylor Model Arithmetic

To overcome the problems inherent in Interval arithmetic, Martin Berz, Kyoko Makino et al.
developed the concept of Taylor Models. Taylor Models use high order polynomial expansions
based on automatic diﬀerentiation techniques combined with automatically veriﬁed error
bounds over a given domain to represent very sharp enclosures of sets[28, 29, 9, 30].
In the following, we will give a brief overview over Taylor Models, drawing largely from
[29], which is also a good reference for a more in depth description of Taylor Models.
21

A Taylor Model consists of a polynomial P : Rp → R of ﬁxed order n in p variables
as well as an remainder term R

0. All Taylor Models are deﬁned on a speciﬁc domain

D ⊂ Rp , which by convention and without loss of generality is typically assumed to be
[−1, 1]p . We write a Taylor Model T as T = P (x) ± R or T = (P, R).

Let now f : D → R be any bounded function on D. We say the Taylor Model (P, R) is
a representation of f , if
|f (x) − P (x)|

R ∀ x ∈ D.

Note in particular that this deﬁnition is valid for any function f independent of its smoothness.
Consider now the size of R in such a representation. If f is in C n+1 , Taylor’s theorem
guarantees that R can be chosen in such a way that

R = O |D|n+1

2.3.1

as |D| → 0.

Taylor Model Operations

Elementary operations on two Taylor Models (P1 , R1 ) and (P2 , R2 ) are deﬁned as

(P1 , R1 ) + (P2 , R2 ) = (P1 + P2 , R1 + R2 )
c · (P, R) = (c · P, |c| · R) for c ∈ R
(P1 , R1 ) · (P2 , R2 ) = (P1·2 , R1·2 )
22

Here P1·2 represents P1 · P2 truncated to order n. Letting Pe = P1 · P2 − P1·2 be the
remaining polynomial terms of order (n + 1) through 2n, R1·2 is then given by

R1·2 = R1 · R2 + B(Pe ) + B(P1 ) · R2 + B(P2 ) · R1 .

B(P ) in this context represents a bound on the maximum absolute value of the range of P
over the domain D, i.e.
|P (x)| < B(P ) ∀x ∈ D.

Let f1 , f2 : D → R be two bounded functions, and (P1 , R1 ), (P2 , R2 ) two Taylor Models
representing f1 and f2 respectively. It can be shown[31, 29, 28] that the Taylor Model
operations as deﬁned above satisfy the two properties

(P1 , R1 ) + (P2 , R2 ) represents f1 + f2 ,

(P1 , R1 ) · (P2 , R2 ) represents f1 · f2 .
Furthermore, for any real c we have that the Taylor Model c·(P1 , R1 ) represents the function
c · f1 .

2.3.2

Intrinsic Functions on Taylor Models

In the references given above, further discussions of intrinsic functions such as trigonometric
functions for Taylor Models can be found. For all of these operations the representation
property is retained, i.e. if (P, R) represents f then op((P, R)) represents op(f ) on the
chosen domain.
23

For example, consider the exponential function exp for Taylor models. Let (P, R) be
a Taylor Model, and denote by P0 the constant part of P , and by P> = P − P0 the
non-constant part of P . Then

exp((P, R)) = exp(P0 + (P> , R))
= exp(P0 ) · exp((P>, R))


n
i (P , R)n+1
(P> , R)
= exp(P0 ) · 
+ >
exp(θ · (P> , R))
i!
(n + 1)!
i=0
where the expression in parenthesis is the truncated Taylor series of the exponential function
and the Taylor remainder term which is valid for some θ ∈ [0, 1].

Note that since P0 is only a number, a sharp enclosure of exp(P0 ) with a rigorous error
bound can be computed, e.g. using the algorithm described in Section 3.4.5. Furthermore,
the sum in parenthesis can be evaluated directly in Taylor Model arithmetic, yielding both
a polynomial part as well as an remainder term.

For the last Taylor remainder part, note that since P> is a polynomial of order at
least 1, in Taylor Model arithmetic (P> , R)n+1 does not have any polynomial part since
the resulting polynomial is of order greater than n. Thus (P> , R)n+1 is a Taylor Model
consisting only of a Taylor Model remainder term. The expression exp(θ · P> ) can be
bounded by ﬁrst computing a constant part only Taylor Model (B, RB ) representing a
bound of P> over the domain D. Then a rigorous upper bound of the maximum absolute
value of exp([0, 1] · (B, RB )) can be computed using again constant part only Taylor Model
arithmetic.
24

2.3.3

Scaling

Let f, g ∈ C n+1 and (Pf , Rf ), (Pg , Rg ) are n-th order Taylor Model enclosures of f and
g over some domain of width w respectively. Furthermore let (Pf +g , Rf +g ) = (Pf , Rf ) +
(Pg , Rg ), (Pf ·g , Rf ·g ) = (Pf , Rf ) · (Pg , Rg ), and (Pop(f ) , Rop(f ) ) = op((Pf , Rf )), where
op represents any intrinsic operation.
Assume the error terms Rf and Rg scale with order n + 1 in the width of the initial
domain, i.e.

Rf = O(wn+1 )
Rg = O(wn+1 ),

for suﬃciently small w.
It can then be shown that in the absence of round-oﬀ errors the remainder terms Rf +g , Rf ·g , Rop(f )
for Taylor Model addition, multiplication and intrinsic operation op all also satisfy the same
scaling property as the initial intervals Rf , Rg , i.e.

Rf +g = O(wn+1 )
Rf ·g = O(wn+1 )
Rop(f ) = O(wn+1 ).

This fact is known as the Taylor Model scaling theorem.
25

So instead of a linear scaling as is the case with pure interval arithmetic, the error bound
of Taylor Models scales with the order of the polynomial used. This is crucial especially in
higher dimensions to obtain results with a speciﬁed accuracy. In Equation2.3, the width w
of the initial interval required to obtain a prescribed accuracy

for interval evaluation of a

function f scaled linearly in . For Taylor Models, the same equation reads
√
w = n+1 /C

(2.4)

for some problem dependent constant C > 0, since the error term scales as C · wn+1 . That
is, the minimum width required for Taylor Models to obtain a prescribed accuracy is much
larger than in the interval case, so fewer Taylor Models are required.

2.3.4

Order Loss

While always correct in theory, in practice this scaling law breaks down at some point when
the width of the initial box becomes too small. This leads to somewhat counterintuitive
situations, where Taylor Model methods perform very well over relatively large boxes, but
suddenly fail with smaller boxes. The application of Taylor Models presented in Chapter
4 to compute very sharp enclosures of periodic and ﬁxed points in particular will prove to
suﬀer from this eﬀect.
The cause of this eﬀect is due to the limited precision with which the coeﬃcients of the
Taylor Model are stored on a computer. Eﬀectively, this limited precision causes a loss of
computation order when the coeﬃcients of Taylor Models become too small.
In regular IEEE 754[2] compliant double precision ﬂoating point Taylor Models, where
coeﬃcients are stored as double precision numbers, each coeﬃcient only carries about 16
26

signiﬁcant decimal digits. This ﬁnite precision requires careful bookkeeping of the truncation
errors during Taylor Model operations. Great care has been taken in implementing Taylor
Models to bound all those errors in the remainder bound of the Taylor Model.
However, these round-oﬀ errors of each ﬂoating point operation do not scale according to
the above Taylor Model scaling law. The round-oﬀ error for a given coeﬃcient of the Taylor
Model resulting from a Taylor Model operation only scales as the coeﬃcient itself. So for
linear coeﬃcients, the round-oﬀ error only scales linearly, while the round-oﬀ error for the
constant part does not scale at all.
Under most circumstances, this poses no problem as the absolute value of the round-oﬀ
error is very much smaller than the error due to the truncation of the Taylor series. However,
as the coeﬃcients of a Taylor Model become smaller, the size of the Taylor series truncation
error scales as O(wn+1 ), and the non-scaling round-oﬀ errors become a relatively larger
contribution to the overall error.
Eventually, the higher order terms of the Taylor Model get smaller than the round-oﬀ
errors. Eﬀectively then those orders are lost, in the sense that keeping these terms in the
Taylor Model does not increase the accuracy of the computation. The error term would not
change if these orders were truncated during the computation. This eﬀect is known as the
broken scaling law, or order loss.
Consider as an example the simple function

f (x) = (1 + x + x2 + x3 )/3.

Evaluate this function in Taylor Model arithmetic of order two with a Taylor Model box of
width 2 ·

around 0 of the form T = · x1 . Since the operations are performed in double
27

precision ﬂoating point arithmetic, the round-oﬀ error for the resulting constant part (1/3)
is on the order of 10−16 as 1/3 is not a ﬁnite binary fraction, and has to be truncated after
16 digits. The round-oﬀ error of /3 and 2 /3, respectively, is on the order of 10−16 ·
and 10−16 · 2 . Since computations are performed only to order 2, the entire x3 /3 term is
truncated, yielding a truncation error contribution to the overall error bound on the order
of 3 /3.
Let now

= 2−10 ≈ 10−3 . That way, all powers of

are exact ﬂoating point numbers,

and the only round-oﬀ error occurs when dividing by 3. Floating point round-oﬀ errors for
each of the resulting coeﬃcient are on the order of 10−16 , 10−20 , and 10−24 respectively,
while the series truncation error is of order 3 · 10−10 . Taken all together, the error bound
of the resulting Taylor Model is on the order of 10−10 , clearly dominated by the truncation
error, which scales as O( 3 ) as expected. Note that the magnitude of the second order
coeﬃcient is 1/3 · 2−20 ≈ 3 · 10−7 .
Consider now the case where

= 2−30 ≈ 10−9 . Again powers of

are exact ﬂoating

point numbers, and the only round-oﬀ error occurs when dividing by 3. The round-oﬀ errors
for each coeﬃcient are now on the order of 10−16 , 10−25 , and 10−34 respectively, while the
truncation error has shrunk to the order of 3·10−28 . It can clearly be seen how the round-oﬀ
error of the constant part does not scale with

at all. The resulting error bound thus is on

the order of 10−16 , easily surpassing the second order coeﬃcient, 1/3 · 2−60 ≈ 3 · 10−19 , in
magnitude.
Thus, carrying out the above computation using only ﬁrst order Taylor Models would
yield the same quality of result as the second order computation. Eﬀectively the second
order is lost, and the error scales no better than that of a ﬁrst order Taylor Model evaluation
28

would. As

gets even smaller, the Taylor Model asymptotically behaves no better than a

classical interval, as only the constant part dominates.
The only way to prevent this order loss is to perform the computation using a higher
precision for the lower order coeﬃcients. That way, the round-oﬀ error of the constant part
is reduced in magnitude, and higher order computations remain viable for smaller values of
. Of course for any ﬁnite precision, there exists an such that the Taylor Model still behaves
no better than an interval. However as long as this

is suﬃciently small, full advantage can

be taken of the scaling law. In Chapter 3 we introduce High Precision Taylor Models which
will provide such features.
Note that this is a markedly high-order eﬀect. The closest analogy to order loss in interval
arithmetic is practically never observed because it only occurs with very narrow intervals.
Consider an interval in which both the lower and upper bound are stored as double precision
numbers. Since double precision numbers are discreet, there is a minimum width for an
interval, given by the spacing of ﬂoating point numbers. Once that minimum width has
been reached, attempts to reduce the width of the interval will not lead to better results.
Evaluate, for example, the above expression for f (x) in interval arithmetic. Since double
precision ﬂoating point numbers around 1/3 are spaced approximately 10−16 apart, the
resulting interval for f ([0, 10−16 ]) could be roughly of the form [1/3, 1/3+10−16 ]. However,
reducing the width of the argument further to e.g. [0, 10−20 ], does not bring any further
improvement, as the thinnest possible double precision interval at 1/3 is roughly [1/3, 1/3 +
10−16 ].
Clearly, for all practical purposes, this eﬀect is negligible in interval arithmetic. In Taylor
Models, however, it is much more pronounced due to the strong scaling of the higher order
29

terms.

2.3.5

Implementation

A computer implementation of such Taylor Models is provided in the COSY INFINITY[32]
package. With this implementation, many of the problems present in interval arithmetic
are avoided and even complicated computations can be performed quickly and with minimal
overestimation. Among the many applications are the R. E. Moore prize winning veriﬁed
integrator COSY VI[34, 10, 35] and the global optimizer COSY GO[33], both developed by
Kyoko Makino and Martin Berz.
This implementation stores the coeﬃcients of the polynomial constituting each Taylor
Model as double precision ﬂoating point numbers. In Chapter three, fundamental work
will be presented to extend this implementation such that coeﬃcients can be stored in high
precision, allowing for much more accurate computations.

30

Chapter 3
High Precision Operations
In this chapter, we develop the theory and describe the implementation of High Precision
intervals within the COSY INFINITY framework, which will then lead naturally to the
implementation of High Precision Taylor Models.
To that end, we begin by reiterating some general concepts, as well as introducing wellknown, fundamental operations that allow the precise computation of round-oﬀ errors for
elementary operations on ﬂoating point numbers (2.1). Based on those Dekker operations[16],
named after T.J. Dekker who ﬁrst published them in 1971, I will then describe the general
features of an implementation of a high precision interval data type.
Once the basic routines of this interval data type are available, we will brieﬂy describe a
generalization of those elementary arithmetic routines to full-ﬂedged Taylor Models. Based
on those elementary operations, we proceed to develop rigorous, veriﬁed algorithms to compute various intrinsic functions. These will then form the basis for a full implementation of
High Precision Taylor Models.
Our main objectives for the design, as well as the implementation of all of the above
31

algorithms are
Rigor The algorithms must be absolutely rigorous. That means we will take into account
all possible sources of errors, including roundoﬀ errors due to ﬂoating point operations
and errors due to numerical methods used.
Speed Since the algorithms will be included in version 10 of the COSY INFINITY[32] software package, they have to perform well speed-wise. We will optimize our algorithms,
using the fact that our calculations do not have to be precise to the last bit, as long
as we do not violate the ﬁrst objective, mathematical rigor.
Precision While focusing on the range of up to about 60 signiﬁcant digits, the code is designed such that it is not restricted to any ﬁxed precision. This separates it from other
existing quadruple precision (about 30 digits) packages based on Dekker operations.

3.1

General Concepts

In this section, we will introduce various general concepts which are used throughout the
remainder of this chapter. We will not provide proofs for these methods, but instead refer
the reader to the given references or any introductory analysis textbook.

3.1.1

Newton Method

The Newton Method[12] is a well known method for iteratively ﬁnding the zeros of a function
f (x). For all suﬃciently smooth functions f , there is a neighborhood of the solution, in
which the convergence of this method is super-linear. This is of particular interest for our
problems, since we can use the available non-rigorous ﬂoating point intrinsics to quickly
32

obtain an approximation of the result, which we then can use as a starting value. From
there, it typically takes very few iterations to converge to the ﬁnal result in the desired
precision.
The Newton iteration step is calculated according to the equation

xn+1 = xn −

3.1.2

f (xn )
f (xn )

(3.1)

Interval Newton Method

Based on the traditional Newton method, Moore introduced the following, slightly modiﬁed,
interval Newton method[37].

Xn+1 = m(Xn ) −

f (m(Xn ))
f (Xn )

(3.2)

Here m(Xn ) stands for the midpoint of the interval Xn .
If all operations in the above equation are carried out rigorously, and the initial interval
contained a zero of the function f , the resulting interval Xn+1 is also guaranteed to contain
a zero. Thus, by iterating 3.2, one can obtain an interval enclosure of a zero of f . For this
method to work, it is required that the interval Newton step actually contracts the argument.
Note that it is possible to deﬁne Xn+1 as the intersection of Xn and the interval Newton
step on acting Xn . The above statements still hold in that case. This intersection, however,
often is not necessary if the interval Newton step is contracting the interval strongly enough.
33

3.1.3

Taylor Expansion with Remainder

Using the Taylor expansion to evaluate certain intrinsic functions is a very powerful method
to obtain good interval approximations[11, 12]. In particular, in many cases we can derive
an analytic expression for the remainder term.
A Taylor expansion of a function f (x) up to order n (around 0) is given by 3.3
n
f (x) =
i=0

f i (0) i
x + Rn
i!

(3.3)

The remainder term of that expansion in the Lagrange form is given by:

Rn =

f n+1 (ξ) n+1
x
(n + 1)!

(3.4)

for some ξ ∈ (0, x). While there are other forms for the remainder term, this one is the most
useful for our purposes.
Given that ||f n+1 ||∞ exists, i.e. f n+1 is bounded over the whole domain of x, we can
simplify the remainder term as follows:

|Rn |

3.1.4

||f n+1 ||∞ n+1
|x|
(n + 1)!

(3.5)

Leibniz Criterion

Another method for bounding the remainder of a sum is the Leibniz criterion[12]. Given a
monotonically decreasing sequence an such that limn→∞ an = 0. Then an
34

0 for all n

and the series
∞

s=

(−1)n an
n=0

does converge. Furthermore, the remainder after the m-th partial sum is given by
∞
∞
na −
na =
|sm − s| =
(−1) n
(−1) n
(−1)n an
n=0
n=0
n=m+1
m

3.1.5

|am+1 |

Kahan Summation

The Kahan summation algorithm[23] is an algorithm that allows the summation of a sequence of ﬂoating point numbers, producing a much smaller error than a naive ﬂoating point
summation would yield. It follows the idea of the Dekker addition, by calculating the roundoﬀ error in each addition and feeding it back into the calculation in the next step. While the
result is, of course, still not exact, the roundoﬀ error is greatly reduced compared to naive
addition.
Let sum be the ﬂoating point sum of a sequence of ﬂoating point numbers ai , and
c = 0 initially. Then, for each ai the following operations are preformed in ﬂoating point
arithmetic.

y = ai − c
t = sum + y
c = (t − sum) − y
sum = t

We will not go into details of the algorithm here, for more information see [18, 23].
35

3.2

Precise Elementary Operations on Floating Point
Numbers

In the following subsections we will state some well-known facts about obtaining exact results
for the basic ﬂoating point operations. While this may sound surprising at ﬁrst, it is indeed
possible to obtain the roundoﬀ errors of the basic ﬂoating point operations exactly from
within the ﬂoating point arithmetic. The theorems and proofs given here are originally due
to Dekker[16], who showed that the theorems also hold with slightly lesser requirements on
the underlying ﬂoating point operations than prescribed by the IEEE 754 standard. But
since our implementation will build on IEEE 754 double precision ﬂoating point numbers,
we will restrict ourselves to those.
To give the reader an idea of how the proofs of those theorems work, we will prove some
of the theorems while referring the reader to [16, 25] for others.
All throughout this section, I will use the notation introduced in Section 2.1.

3.2.1

Two-Sum

The ﬁrst theorem will provide us with a way to calculate the exact round-oﬀ error occurring
when adding two ﬂoating point numbers.
Theorem 3.1. Let two double precision ﬂoating point numbers a and b such that |a| > |b|
be given. Let z = a ⊕ b, w = z

a and zz = b

w. Then, neglecting possible over- or

underﬂows during the calculation, we have that z + zz = a + b exactly.
Proof. Let a = ma · 2ea and b = mb · 2eb . Since |a| > |b| and ﬂoating point numbers are
normalized, we have that ea

eb . It is suﬃcient to show that w ∈ R53 and b − w ∈ R53 ,
36

then the result follows readily from optimality of the ﬂoating point operations.
Let z = a ⊕ b = mz · 2ez . From 2.2 we get that

mz = ma · 2ea −ez + mb · 2eb −ez

Since |a + b| < 2|a| we have that ez
and ez

ea + 1. Now we consider the two cases ez = ea + 1

ea .

Assume ez = ea +1. Then mz = ma · 2−1 − mb · 2eb −ea −1 and letting w = mw ·2ea
we ﬁnd that

|mw | = |mz · 2ez −ea − ma |
= |mz · 2ez −ea − ma − mb · 2eb −ea + mb · 2eb −ea |
|2mz − ma − mb · 2eb −ea | + |mb · 2eb −ea |
< 2|mz − ma · 2−1 − mb · 2eb −ea −1 | + 253
1
< 2 + 253
2

Since mw is an integer, we therefore have that mw

253 and thus w ∈ R53 , i.e. w is a

double precision ﬂoating point number.
If ez

ea the exact same proof carries through, the only diﬀerence being that we deﬁne

w = mw · 2ez .
To prove that zz ∈ R53 , we ﬁrst note that we can write w = i · 2eb for some integer i
since ea

eb . Secondly, we have that |b − w| = |b − z + a|
37

|b| by optimality. To see this

simply let z = x, and then apply Deﬁnition 2.2. We thus have

|zz| = |b − w| = |mb − i| · 2eb

|b| = |mb | · 2eb < 253 · 2eb

and thus (mb − i) · 2eb = zz ∈ R53 .
Note that by Deﬁnition 2.1, ﬂoating point numbers are symmetric, i.e. if a ∈ R then
−a ∈ R. Thus the above theorem automatically provides exact subtraction as well.
It is worth mentioning that there are also other algorithms to calculate the same two values without the condition that a > b, but requiring some additional ﬂoating point operations.
The following algorithm is due to Knuth [25]. The advantage of this method is that due to
pipelining on modern processors it is often faster to perform the three additional ﬂoating
point operations instead of having to evaluate a conditional statement on the absolute values
of a and b.
Theorem 3.2. Let two double precision ﬂoating point numbers a and b be given. Let z =
a ⊕ b, bv = z

a, av = z

bv and zz = (a

av ) ⊕ (b

bv ). Then, neglecting possible over-

or underﬂows during the calculation, we have that z + zz = a + b exactly.
Proof. For a proof see, for example, [25].

3.2.2

Splitting

Before we can move on to the exact multiplication, we require the concept of the splitting
of a double precision number.
Deﬁnition 3.3. Let a ∈ R53 be given. We call ah , at ∈ R26 the head and the tail of the
38

splitting of a if

ah =

ma · 2−26 · 2ea +26

at = a − ah

This deﬁnition may sound surprising at ﬁrst. After all a has 53 mantissa bits, but both
ah and at only have 26 bits each yielding a total of 52 bits. The solution to this riddle is
the fact that the diﬀerence |[a] − a|

1/2, but depending on x it can have either positive or

negative sign. So the missing bit is the sign bit of the tail of the splitting.
The following theorem, also presented by Dekker, allows us to calculate such a splitting
of a double precision number.
Theorem 3.4. Let a ∈ R53 be given and let p = a ⊗ (227 +1). Then the head of the splitting
of a is given by ah = p ⊕ (a

p).

Proof. Since the proof of this theorem is somewhat technical and does not contribute much
to the understanding of these operations, we refer the reader to the papers of Dekker [16] or
Shewchuk [41].

3.2.3

Multiplication

With the notion of a splitting, we can formulate the following theorem for exact multiplication
of two double precision numbers:

Theorem 3.5. Given two double precision ﬂoating point numbers a and b let a = ah + at ,
b = bh +bt be a splitting as deﬁned above. Also let p = (ah ⊗bh ), q = (at ⊗bh )⊕(ah ⊗bt ) and
39

r = (at ⊗ bt ). Then, neglecting possible over- or underﬂows during the calculation, z = p ⊕ q
and zz = (p

z) ⊕ q ⊕ r satisfy z + zz = a · b exactly.

Proof. First note that for any two numbers x, y ∈ R26 their product x · y ∈ R52 ⊂ R53 .
This is clear since for x = mx · 2ex and y = my · 2ey we have that x · y = mx · my · 2ex +ey
and |mx · my | < 252 since |mx | < 226 and |my | < 226 .
We also have that

a · b = (ah + at ) · (bh + bt ) = ah · bh + ah · bt + at · bh + at · bt .

Since ah , at , bh , bt ∈ R26 , each single term in this sum is in R52 . Furthermore, the two
cross terms ah · bt and at · bh have the same exponent and therefore their sum is in R53 .
Thus p, q, and r, as deﬁned in the statement of the theorem, are exact, and we obtain that
a · b = p + q + r.
Now we perform an exact addition of p and q as described above, yielding the leading
term z = p ⊕ q and a remainder term z1 = (p

z) ⊕ q. We thus have a · b = z + z1 + r. Close

examination of the proof of the exact addition shows that r and z1 have the same exponent
and both are in R52 , so their sum can be calculated exactly in R53 . This leaves us with the
ﬁnal equation a · b = z + (z1 ⊕ r) = z + ((p

z) ⊕ q ⊕ r) = z + zz, which completes the

proof.

3.3

High Precision Data Types

In this section, we present the detailed algorithms used in our High Precision Interval
algebra[45]. We then brieﬂy describe a generalization of these algorithms to High Preci40

sion Taylor Models, and proceed to give rigorous methods to compute all intrinsic functions
supported by COSY INFINITY with rigorous remainder bounds.

3.3.1

High Precision Intervals

Deﬁnition 3.6. A high precision interval A is given by a ﬁnite sequence of double precision
ﬂoating point numbers ai , and a double precision error term aerr . The value of the interval
is then given by:


n

n



A=
ai − aerr ,
ai + aerr 
i=1
i=1
For shorthand notation, we will also refer to the interval as A =

n a ±a .
err
i=1 i

We call ai the i-th limb of the interval, the number of limbs n is its length. We call
the exact sum of all limbs the value of the high precision interval. The rigorous remainder
aerr is another ﬂoating point number that speciﬁes the uncertainty (or width) of the high
precision interval. The correct number represented by A is guaranteed to be in the interval
ai − aerr ,

ai + aerr . We require aerr to always be a positive number. The sequence

ai is also called a ﬂoating point expansion of the midpoint of A.
A high precision interval A is called well conditioned if for all i we have |ai | > |ai+1 |,
i.e. the absolute values of the sequence are strictly decreasing.
A high precision interval A is called normalized or canonical if for all i we have ε · |ai | >
|ai+1 |, i.e. the ﬂoating point numbers are non-overlapping where ε is the machine precision.
Obviously normalized numbers are also well conditioned.
41

3.3.1.1

Limitations

This representation implies certain limitations to the range of our high precision intervals.
Those limitations are a direct consequence of the limitations of double precision ﬂoating
point numbers.
The range of representable numbers is obviously limited by the range of double precision
numbers. The smallest possible non-zero number representable consists of one single limb
representing the smallest possible double precision ﬂoating point number. That number is
of order 2−1024 ≈ 10−309 .
The largest representable number, in theory, is only limited by available memory, since
there is always an unbounded sequence of double precision ﬂoating point numbers. In practice, however, we would like our numbers to be normalized. Therefore, the largest representable number is of order 21024 ≈ 10309 . Furthermore, our algorithms will encounter overand underﬂow errors during the calculations when limbs get close to the double precision
limits.
The relative precision of our high precision intervals is bound by the “quantization” in
quanta of the size of the smallest possible representable double precision number. Thus, the
last digit in a decimal representation of a number is of order 10−309 . A number of order one
can, therefore, have a maximum of about 309 valid decimal digits. Note that the maximum
number of valid digits depends on the size of the represented number itself.
Thus, we have the following limitations for a high precision number X:

10−309

|X|
42

10309

(3.6)

where the relative precision depends in the order of X.

3.3.1.2

The Accumulator

The key operation in our algorithms is the accumulator. Every other operation will, directly
or indirectly, call this function. The accumulator simply takes an arbitrary list of ﬂoating
point numbers, and adds them up exactly using Dekker’s addition. The result will be a
high precision interval of a speciﬁed maximum length that represents the sum of the input
rigorously.
If the length of the output interval is not suﬃcient to hold the exact result, the accumulator adds the leftover numbers to the error bound of the result. Consequently, the result is
always a rigorous enclosure of the mathematically correct result. For performance reason, we
also add terms to the error bound that are smaller than the error bound itself. This prevents
computationally expensive calculations on numbers that are already below the remainder
term. For this purpose, the initial error bound of the result is passed to the accumulator.
This allows us to ﬁrst calculate the error bound of the resulting interval of an operation, and
then pass it to the accumulator.
Let ai denote the input sequence with n elements, and S the resulting high precision
interval of maximal length m. The accumulator then sums up the ai successively using n − 1
exact additions. This results in one double precision number representing an approximate
result of the ﬁnal sum. This number becomes the ﬁrst limb s1 of the result. In the exact
additions we also obtained n − 1 roundoﬀ terms. To these, the same scheme is applied again
to calculate the second limb of the result. This calculation is repeated until there are either
no more roundoﬀ errors left, or the maximum number of limbs in the result has been reached.
If there are terms left, the absolute values of those terms are added to the result’s error term
43

!"

!#

!$

!%

!&

("

(#

($

(%

(&

)"

)#

)$

)%

*"

*#

*$

+"

!'

+#

0 -./"

-./"

0 -./#

-./#

0 -./$

-./$

0 -./%

+11

0 -./&

,"

0 -./'

Figure 3.1: Graphical representation of the accumulator algorithm for n = 6. There are two
cases shown: In the ﬁrst the result holds six limbs. In the second case, shown in parentheses,
the result holds only three limbs. In that case, the accumulation terminates after calculation
three limbs. The absolute values of the remaining terms di are then added to the error
bound.

serr . For n = 6 and m = 3 this process is graphically shown in 3.1.
To increase performance, we add a slight adjustment to this algorithm. After performing
each exact addition, we check if the roundoﬀ error of the result actually is above serr . Only
in that is the case is it kept for the next round, otherwise the roundoﬀ error is immediately
added to serr . The same test is applied to the resulting limb in each step. If its absolute
value is less than the error bound, it is also added to serr and the limb is discarded. The
only exception is the very ﬁrst limb, which is always kept, in order to prevent large intervals
from always being centered at 0.
It is important to note that the order of the input numbers is important. It does inﬂuence
the result and, more importantly, the performance of this algorithm. Best performance is
obtained if the sequence ai is ordered by increasing absolute value of the numbers. If two
small numbers just above the error bound are added, the absolute value of the roundoﬀ error
44

for that operation will surely be below the error bound. As described above, it will therefore
immediately be discarded from the further calculation. If, however, one such small number
is added to an already signiﬁcantly larger number, the roundoﬀ will be of the order of the
small number added. Thus the small number will just propagate as a roundoﬀ error into the
next round. So for best performance, it is important that, if possible, the numbers are at
least roughly ordered by magnitude.
It is also interesting to note that the output of this algorithm is not necessarily normalized. In fact, if cancellation occurs in the addition, this can lead to poorly conditioned
results. However, in most real world cases it is very likely that the numbers returned are
well conditioned. This possibility for cancellation is the reason why none of our algorithms
have any normalization requirements for the input numbers. However, we do optimize our
algorithms to perform better with well conditioned or even normalized input.

3.3.1.3

Addition and Subtraction

Given the accumulator, addition of two high precision intervals A and B into C is quite
trivial. All that has to be done is to concatenate the sequences of limbs making up the
two numbers into a temporary array, and apply the accumulator to it. Before calling the
accumulator, the two error terms of A and B are added up to form the new error term
cerr = aerr + berr . This new error term is then passed to the accumulator, along with the
temporary array.
As mentioned above, passing roughly sorted input to the accumulator improves performance. So, instead of blindly concatenating the two sequences ai and bi , it is possible to
merge them in order, assuming the high precision intervals A and B themselves are well
conditioned. This is achieved by starting at the last limb of both sequences and successively
45

copying the smaller of the two numbers into the temporary array. This yields, in linear time,
a sequence of numbers ordered by increasing absolute value. If the input is not well conditioned, best performance of the accumulator would be obtained by a full sort of the input.
In practice, however, the performance gain in the accumulator is too small to outweigh the
cost of sorting the input.
For the subtraction A − B, the same algorithm as for addition is used, except all signs of
the bi are ﬂipped. Since the set of ﬂoating point numbers is symmetric, i.e. if a is a ﬂoating
point number so is −a, this operation is exact.

3.3.1.4

Multiplication

The multiplication is the core of our algorithms. Together with addition, this operation will
be the most important and the most frequently performed operation. It is, therefore, vital
to make this operation as eﬃcient as possible.
Mathematically, it is clear what has to be done. To multiply two high precision numbers
A and B one has to perform the following operation:

C=

an
n

·

bm
m

=

an b m
m,n

It is obvious that, in general, there are n · m multiplications necessary to evaluate this expression. Each of these multiplications has to be done exactly, using a Dekker multiplication,
and yields two resulting terms. Thus, in the end there are 2 · n · m terms to be summed up
using the accumulator (3.3.1.2).
However, in practice it is possible to reduce the number of multiplications needed significantly. Once again this is due to the size of the error bound. The new error bound cerr is
46

given by

cerr =

an

· berr +

n

bm

· aerr + aerr · berr

m
|an | + aerr

· berr +

n

|bm | + berr

· aerr

m

As before, this new error bound is calculated ﬁrst. Then, before performing a full Dekker
multiplication, a simple ﬂoating point multiplication x = ai · bj of the two limbs ai and bj
is performed. This product is then compared to cerr . There are three possible cases:

|x| < cerr In this case, the result of the multiplication is smaller than the error term. Its
absolute value can thus be rounded outwards and added to the error bound. No further
processing is necessary.

ε · |x| < cerr In this case, the result of the multiplication does contribute to the ﬁnal sum
signiﬁcantly. The roundoﬀ error of the ﬂoating point multiplication, however, is below
the error bound and thus can be estimated from above as ε · x. The absolute value of
this is then added to the error bound. Note that this is possible as long as we have
round-to-nearest operations.

Else In all other cases, a full Dekker multiplication has to be performed. The result of this
multiplication will consist of two ﬂoating point numbers, which have to be stored in a
temporary array. This will eventually be handed oﬀ to the accumulator later. We also
check wether the smaller of the two resulting numbers if it is below the error term.
If so, we add the number to the error bound directly, thus reducing the number of
elements that are passed to the accumulator.
47



























































Figure 3.2: Graphical representation of the multiplication algorithm for two numbers of
length 4 and 5 respectively. Every cell in the grid represents one product. The coloring of the
cells shows products of the same order of magnitude, assuming that the input numbers where
well conditioned. The numbering shows the order in which these products are evaluated, so
that we take advantage of that ordering.

This way we can reduce the number of Dekker multiplications that are actually performed,
signiﬁcantly. It seems this results in an improvement in speed of about 5% – 10%. However,
the full impact of these measures has to be determined by detailed performance analysis
with real world examples at a later point.
Another way to reduce execution time is to “pre-split” the limbs of each number. In
the worst case, every limb is multiplied by all limbs of the other number using a Dekker
multiplication. If we split the numbers anew every time, we recalculate the same splitting
over and over. Instead, the limbs are split once in the beginning and then stored in a local
array. Whenever a multiplication takes place, it is performed using the pre-split limbs.
The last important issue with this algorithm is the order in which the limbs are multiplied.
Clearly, the order of the multiplications is irrelevant, as long as it is ensured that every pairs
is multiplied exactly once. The naive way would be to simply go straight through the ﬁrst
sequence ai and multiply each limb with each bj . However, if we want the input to the
48

accumulator to be roughly ordered, it is better to multiply numbers that yield the same
order of magnitude, starting with the lowest order (see 3.2). This is a little more involved,
but not much more computationally expensive than the naive approach.
Note last that, if the input numbers were normalized, it would be suﬃcient to just
multiply the ﬁrst limbs and immediately add the products of higher limbs to the error term.
Since we cannot assume that input is normalized, however, this method cannot be applied
to our algorithm. But since we do compare each ﬂoating point product to the error term,
our algorithm does not perform signiﬁcantly slower.

3.3.1.5

Squaring

Analyzing the algorithm for multiplication given above, one ﬁnds that some steps in the
process are redundant if both arguments are the same, i.e. if a number is squared. For one,
the splitting only has to be performed for one number. Also, we can utilize the fact that a
multiplication by 2 is always correct in ﬂoating point arithmetic1 .
So, to calculate
2
ai  =



i

a2 +
i

ai aj =
i,j

i

a2 + 2
i

ai aj =
i=j

i

ai aj
j>i

one simply goes through all indices i and, in an inner loop, lets j loop from i + 1 to n. After
the Dekker multiplication of the cross terms, one simply multiplies each part of the result
by 2.
The algorithm is shown in 3.3. Note that, as in the regular multiplication, to keep the
input to the accumulator at least roughly ordered, we evaluate products of the last limb ﬁrst,
1 Except for overﬂows, which we deal with separately.
49











!"

*

$

&



*

#

%

(



$

%

'

)



&

(

)

!

Figure 3.3: Graphical representation of the squaring algorithm for a number of length 4.
Every cell in the grid represents one product. However, the oﬀ-diagonal products appear
twice, so we can ignore half of them as shown in the illustration. The numbering shows the
order in which these products are evaluated, so that we take advantage of well conditioned
input.
and work our way up to the larger terms. While this only results in approximate ordering
up to the 6th product, the speed gain by cutting the number of Dekker multiplications in
half clearly outweighs the losses in the accumulator.

3.3.2

High Precision Taylor Models

Using similar techniques as those described in detail for High Precision intervals above, it is
possible to implement the basic operations of addition and multiplication for Taylor Models.
Here, too, each coeﬃcient is stored as an unevaluated sum of double precision ﬂoating point
numbers, which are operated on by the same type of Dekker based algorithms as in the case
of High Precision intervals.
One such implementation was written by the author. It only operates on non-veriﬁed,
high precision diﬀerential algebra vectors, storing each coeﬃcient with a variable number of
limbs as needed to achieve a pre-deﬁned precision cut-oﬀ. To show the power of such a high
precision implementation, we then used it to compute the transfer maps of various beam
50

optical elements to about 60 accurate digits (see [44]).
Another fully veriﬁed implementation of these two operations for the High Precision
Taylor Model datatype in the upcoming COSY INFINITY version 10 was done by Kyoko
Makino[36]. In this implementation, each coeﬃcient in the polynomial is itself stored as
a polynomial with the limbs of the high precision representation of each coeﬃcient as its
coeﬃcients. A detailed description of the algorithms used in this implementation is beyond
the scope of this dissertation. Instead, we refer to the following publications describing
the existing double precision Taylor Model data type[28, 4, 7, 31], as well as a forthcoming
publication devoted entirely to the implementation of the new High Precision Taylor Models.

3.4

High Precision Intrinsic Functions

Intrinsic functions form an essential part of every scientiﬁc computing package. Arithmetic
operations only form the very basis upon which a complex and diverse set of intrinsic functions is built. Almost all practical applications in physics and engineering require some form
of trigonometric or exponential functions.
In this section, we will build upon the basic arithmetic operations addition, subtraction
and multiplication described above to derive methods valid for both High Precision Intervals
as well as the constant parts of High Precision Taylor Models.
With this information, it will be possible to implement a complete High Precision Interval
datatype based on Dekker operations. To fully implement a working High Precision Taylor
Model datatype, the algorithms for the constant part provided here need to be integrated
into the algorithms already fully developed by Kyoko Makino for regular double precision
Taylor Models in her thesis[28].
51

For simplicity, in the following we will collectively refer to both High Precision Taylor
Models with only a constant part and Intervals as High Precision Numbers. If a distinction
must be made between the two, this will be made clear in the text.

3.4.1

General Concepts

Intrinsic functions of a High Precision Number X are implemented using either an iterative
Newton method or by the Taylor expansion of the intrinsic function. Newton methods are
used mainly for the square root, as here the Newton iterations converge very quickly, and
underlying ﬂoating point operations can be utilized to easily obtain a good starting values.
To evaluate the other intrinsics, such as exponential, logarithm, or trigonometric functions
and their inverse, we rely on the Taylor expansion of the respective functions, or we reduce
the intrinsic to an evaluation of another intrinsic based on arithmetic identities.

3.4.1.1

Taylor Expansion with Remainder

With the Taylor expansion, we successively calculate each order, and add it to the result.
By keeping the previous order term after adding it to the result, the cost of calculating the
next order remains constant, so the computational cost of evaluating the series is linear in
the order of the expansion.
To decide when to terminate the expansion, we generally use the current order term. Since
in any useful Taylor formula the terms are monotonously decreasing, once the next order
term becomes smaller than the error bound of the result, the expansion can be terminated.
All further terms at that point would only contribute to the error term of the result.
This method can and is adapted depending on the requirements of the numerical envi52

ronment. It is, for example, possible that due to certain external factors an intrinsic function
does not need to be evaluated to full precision. This happens quite frequently when an argument is split (see below) and the boundary values are evaluated separately. In such cases,
instead of using the remainder bound as a measure, a pre-set value can be used instead to
terminate the series expansion in order to save computation time.
To rigorously estimate the error bound of the result, we use the Taylor remainder formula
3.5 where applicable. If that is not possible, we have to revert to other methods geared
speciﬁcally towards the given expansion.

3.4.1.2

Argument Reduction and Splitting

To speed up convergence in the evaluation of the Taylor expansion, it is important to reduce
the arguments so that they are close enough to the expansion point of the series to yield fast
convergence. This is usually achieved by exploiting symmetries in the functions involved, as
well as other analytical identities speciﬁc to each intrinsic function.
Another very important concept is argument splitting. In this context, argument splitting
refers to the separate computation of the upper and lower bounds of wide arguments, which
then are combined back into one single result using certain properties of the intrinsic in
question, e.g. monotonicity. This approach helps with the fact that evaluating a Taylor
expansion with wide intervals is subject to the usual interval overestimation due to repeated
addition and subtraction in the Taylor expansion.
When determining wether to split an interval, various factors can be taken into account.
For our purposes, we examine the size of the relative error and compare it to the relative
computation precision multiplied by some value ξ. This ξ is a heuristically determined value
for each intrinsic, which indicates how much wider than the computation precision an interval
53

is allowed to be before it is split. Typical values for ξ lie in the range of 104 to 106 .
For example, consider an intrinsic for which ξ = 104 being called with an argument
12 ± 10−30 with two limbs precision (i.e. relative computation precision ε of about 10−32 ).
−30
The relative error of this argument is 1012 ≈ 10−31 , which is less than εξ = 10−28 . Thus
the argument would not be split. The argument 0.012 ± 10−29 , however, has a relative error
of about 10−27 and thus would be split.
To speed up the computation in the case of splitting, it is furthermore possible to set
a cutoﬀ value for the intrinsic evaluation. Since error terms are only stored as a single
double precision number, the maximum accuracy of the width of an interval is the precision
of a double precision number, which is about 10−16 . Thus it is not useful to compute the
intrinsic on the left and right boundary to higher precision than 10−16 times the expected
width of the result.
For example, consider the exponential of the interval [0, 1]. The width of the resulting
interval is roughly 1.7, so that the precision of the error variable is of order 10−16 . Computing the values of exp(0) and exp(1) to more than 16 signiﬁcant digits is a waste of eﬀort,
as all this extra precision is lost in the computation of the width of the resulting interval.
The width of the ﬁnal result can usually be estimated using a linear approximation of
the function at the center point. To that end, the derivative is computed in non-veriﬁed
double precision arithmetic, and then multiplied by the width of the interval. Even though
this operation is not veriﬁed, the actual High Precision computation remains veriﬁed, but
may not be optimally fast or sharp.
Depending on the requirements on the accuracy of intrinsics over large domains, it is
even possible to set this cutoﬀ higher than 10−16 , trading precision of the result versus
54

computation time. In no case, however, is it useful to compute the result to more than
10−16 times the expected width signiﬁcant digits.

3.4.2

Division

Division is somewhat diﬀerent from the other algorithms presented so far. Instead of directly
calculating the result using low level ﬂoating point operations of the processor, or using
Newton Methods or Taylor expansions, we use a constructive approach that produces more
and more valid limbs of the result in each step.
The algorithm is basically the same as the simple “black board” division (in the US also
known as long division) that is taught in elementary school. It works by utilizing the division
for ﬂoating point numbers to estimate the next limb. Then the rigorous multiplication and
subtraction are used to calculate how far oﬀ this is from the correct result.
The idea is to ﬁrst approximate the high precision intervals by ﬂoating point numbers,
and then divide those using ﬂoating point arithmetic. That number will be the ﬁrst limb
in the result. Then we use high precision operations to calculate the diﬀerence between our
ﬂoating point result times the denominator, and the numerator, i.e. the error left over. Now
we continue this process with this diﬀerence to get the next limb and so on. Once we have
reach a high enough accuracy, or the maximal number of limbs, we can terminate the process
and immediately have a rigorous error bound at hand.
In mathematical terms, we want to evaluate C = A/B. Then we can reformulate the
problem like this:
C = A/B = c + D
where c =

ai /

(3.7)

bi evaluated in ﬂoating point arithmetic, and D is a high precision
55

%

&

- !",'

'

(

./

!"

!#

!$

!122

)*+

- !#,'

.//

- !$,'

0/*+
.///

0//*+
0///*+

Figure 3.4: Graphical representation of the division algorithm for two numbers A and B.
Capital letters represent high precision intervals, while the lower case letters are ﬂoating
point numbers that represent the ﬂoating point sum of the limbs of the corresponding high
precision interval. All arithmetic operations are Dekker operations, except for the ﬂoating
point division, which is denoted by a colon (:).

interval or Taylor Model that rigorously represents the error. This error is, barring the
possibility of cancellation during the summation of the high precision intervals, of order of
ε · c. Solving 3.7 for D we obtain:

A/B = c + D ⇒ A = Bc + D B ⇒

A − Bc
=D
B

(3.8)

Let D = A − Bc. As noted before, D will usually be an order of ε smaller than A. To ﬁnd
the next limbs of the result, we have to determine D . This, again, involves a division of
the two high precision intervals D and B. The ﬁrst limb of this can again be determined as
described above, so one simply start over with the same algorithm, replacing A with D. This
way one recursively obtains more and more limbs of the ﬁnal result, while always keeping a
rigorous error bound, namely D. This algorithm is graphically shown in 3.4.
The rigorous error D allows, of course, immediately the calculation of an error bound
for the ﬁnal result. When the algorithm has determined enough limbs of the result, one can
56

obtain a ﬂoating point interval enclosure of the error D. The same can be done for the divisor
B. Then, using straightforward ﬂoating point interval division, it is possible to obtain an
upper bound on the error term for the result, since ﬂoating point division is guaranteed to
be accurate up to /2.
Of course, this algorithm needs to test the denominator for a division by zero. In our
implementation, this is done by calculating a ﬂoating point interval approximation of the
divisor B, which is then tested to not contain zero. If the divisor does contain zero, an
error ﬂag is set for the result, marking it invalid. For further remarks about checking for the
inclusion of 0 in High Precision datatypes, see Section 3.5.

3.4.2.1

Argument Reduction and Splitting

The algorithm presented above does not require any splitting or argument reduction. This
is due to the fact that neither addition nor multiplication require splitting, and that in each
step, the value subtracted from the remaining error D is optimally sharp, resulting from a
multiplication of High Precision Number B and the error-free ﬂoating point number c.

3.4.3

Square Root

To compute the square root X of a high precision number A, we use the interval Newton
method[38] to obtain an interval enclosure of the solution of

X 2 − A = 0.
57

This yields the iteration step

Xn+1 = M (Xn ) −

M (Xn )2 − A
2Xn

(3.9)

where M (Xn ) stands for the midpoint of the interval Xn .
Provided that the initial interval enclosure X0 contains the square root, each Xn contains
the correct result. Furthermore, the sequence of intervals converges quickly, even without
intersecting each Xn and Xn+1 . The criterion for terminating this iteration is the width,
i.e. the error bound, of Xn . Once the intervals do not contract any more, we have reached
the best possible enclosure of the result.
The choice of the initial starting value X0 is done in double precision interval arithmetic.
First, an interval enclosure I of the high precision interval A is calculated (see 3.5.2). If this
enclosure contains values less than or equal to 0, the resulting high precision interval is set
to 0 and the error ﬂag for that number is set. Otherwise, the interval square root of I is
converted into a high precision number, which serves as the initial interval X0 .
Provided that the double precision interval operations are rigorous, the initial interval X0
does contain the exact square root, and thus fulﬁlls the requirements for the interval Newton
algorithm. Note that the double precision ﬂoating point standard prescribes the same round
to nearest accuracy requirements for the square root operation on double precision numbers
as for the basic addition and multiplication,
On an implementational note, we remark that in our representation of High Precision
Numbers it is particularly easy to obtain an the midpoint of an argument exactly. All that
has to be done is to set the error bound to 0 temporarily. This way all further calculations
are still performed in rigorous interval arithmetic, thus assuring that the resulting Xn+1
58

really contains the correct interval.

3.4.3.1

Argument Reduction and Splitting

Since the the square root is not diﬀerentiable at 0, the convergence of the Newton method
very close to zero decreases. This will only really show once arguments get really close to
0, pushing the limits of our implementation. We therefore currently do not implement any
argument reduction. It is, however, possible to do based on the identity
√
X=

2−2n 22n X =

2−2n

22n X = 2−n

22n X

If n is chosen appropriately, the argument in the square root can be made large enough to
be of order 1 and thus converging quickly using Newton’s method. The multiplications and
divisions by two are all easy to perform since they are exact in ﬂoating point arithmetic,
thus producing no errors.
Argument splitting is also not necessary with interval Newton methods, as the method
is contracting the entire interval. Overestimation is minimized by the Newton method’s
reliance on the mid point of intervals in most places.

3.4.4

Inverse Square Root

While it would be possible to calculate the inverse square root using another interval Newton
method, we simply calculate the square root and then invert the result. This simpliﬁes the
code development and maintenance at a minimal cost compared to a full implementation of
this operation based on another Newton method.
Note that all error handling, splitting and argument reduction is automatically inherited
59

from the division and square root routines. Since neither one performs any of those operations, there is no performance penalty due to repeated argument reductions of splitting.

3.4.5

Exponential Function

To calculate the exponential of a High Precision number X, we simply use the probably best
known Taylor expansion of all, the expansion of the exponential function around 0[12]:
∞
exp X =
n=0

Xn
X2 X3
=1+X +
+
+ ...
n!
2
6

(3.10)

To estimate the remainder for that series, we use that
d n
exp(y) = exp(y)
dy

and
sup(exp((−∞, 0.5]) = exp(0.5) =

√

e

2

(3.11)

Thus, the Taylor remainder term of the exponential series computed up to order n for
an argument y

0.5 can be bounded above by twice the next term in the expansion:

|Rn+1 |

X (n+1)
2·
(n + 1)!

(3.12)

This yields a rigorous enclosure of the value exp(y). The range of the bound in equation
3.11 is chosen such that it includes the argument after the following argument reduction is
performed.
60

3.4.5.1

Argument Reduction

Since the exponential function for large arguments obviously grows very fast, it is necessary
to reduce the argument to a small value to ensure convergence in reasonable number of terms
and thus reasonable computation time. To achieve that, we make use of the identity

k
exp(x · 2k ) = exp(x)(2 )

(3.13)

First notice that exp(700) > 10304 , which is near the largest representable double precision ﬂoating point number. Similarly, of course, exp(−700) < 10−304 which again is near
the smallest representable double precision ﬂoating point number. We thus somewhat arbitrarily limit the domain on which the exponential function is computed to D = [−650, +650].
If a value is larger than 650, it will produce an error in the form of a defective result, while a
value smaller than 650 will always return a predetermined interval [0, d] where d > 10−280 .
This value is determined by considering that

exp(−650) = exp(−65)10 < (10−28 )10 = 10−280 ,

where exp(−65) is bounded using our very own algorithm described here, and checked using
a diﬀerent veriﬁed interval package (such as MPFI[40]).
Now, given some value x ∈ D, we deﬁne y = x·2−11 . This ensures that |y|
1 . By 3.13 we then have
2
11
exp(x) = exp(y · 211 ) = exp(y)(2 ) .
61

650/2048 <

Because of the choice of the factor 2−11 , y is of suﬃciently small size such that the
Taylor expansion for exp(y) converges relatively quickly. As a criterion for terminating
the expansion, we use the size of the n-th computed coeﬃcient of the expansion. If that
coeﬃcient is less than the error bound of the result, all further coeﬃcients will merely
increase the error bound, and thus the expansion can be truncated at that point. Other
conditions for terminating the expansion earlier can be considered, based on the external
circumstances.
In order to obtain the ﬁnal result for exp(x), the result of the series evaluation has to be
raised to the power of 211 = 2048. Naively, this can be done very fast since for any x

n
x(2 ) =

x

2n−1

2


= x

2n−2

2 2
 = ··· =

2 2
2 2...
... x

11
To evaluate exp(y)(2 ) one simply squares exp(y) repeatedly 11 times. Since squaring
is the second cheapest operation in our arithmetic, these operations incur no performance
penalty.
However, such a naive evaluation produces large relative overestimation. This is because
for a typical exponential argument of order 1, the exponential of the reduced argument is of
order
1
exp(2−11 ) = exp(
) ≈ 1.00048 . . . .
2048
When such a value is squared, the result is still of order 1, but the error has doubled in size
because

(1.00048 ± err)2 = 1.000482 ± 2 · 1.00048 · err + err2 ≈ 1.000482 ± 2 · err
62

for small err. Thus the relative error has doubled in size.

Squaring 11 times, the relative error in each step grows by a factor of 2, leading to a
relative error of the ﬁnal result of 211 = 2048 times the size of the initial relative error. In
practice the Taylor expansion evaluation of the reduced argument yields an error roughly the
size of the selected precision. For a two limb computation of exp(1), for example, that error
would be on the order of 10−30 . After the squaring, however, the resulting error bound will
be on the order of 2048 · 10−30 = 2 · 10−27 .

In order to avoid this eﬀect, we compute z1 = exp(y) − 1 instead of exp(y) by simply
omitting the constant term in the Taylor expansion of the exponential function. We then
observe that
2
exp(y)2 = (1 + z1 )2 = 1 + 2 · z1 + z1 .

(3.14)

2
Letting z2 = 2 · z1 + z1 , note that the relative error of z2 remains about the same as that of
z1 . This is because the multiplication of z1 by a factor of 2 of course doubles both the error
term but also the value of the center point. Thus the ratio remains the same. For small
2
values of z1 the term z1 is so small that it does not contribute signiﬁcantly to the value or
the error of z2 .

Repeating the squaring operation as described in equation 3.14 eleven times, keeping the
1 separate from the result in each step and only actually adding it in the very end, ﬁnally
yields the desired result for exp(x) with a relative error of the same order of magnitude as
the initial evaluation of the Taylor expansion.
63

3.4.5.2

Argument Splitting

In order to suppress the eﬀects of the interval dependency problem in the evaluation of
the Taylor expansion for the exponential function, we split large intervals into two separate
High Precision numbers Xu and Xl for the upper and lower bound respectively. Then the
exponential function is evaluated as described above separately for both Xu and Xl , yielding
Yu = exp(Xu ) as well as Yl = exp(Xl ).
Due to the monotonic nature of the exponential function, the ﬁnal interval is then computed by merging Yu and Yl into one interval covering the range [Yl , Yu ].

3.4.6

Natural Logarithm

To evaluate the natural logarithm of a high precision number X, we utilize the following
expansion which can be derived by integrating the Taylor series of the derivatives of log(x+1)
and log(x − 1) and subtracting them from each other[12]:

∞
ln X =
n=0

2
2n + 1

X − 1 2n+1
X −1 2
=2
+
X +1
X +1 3

X −1 3 2
+
X +1
5

X −1 5
+ . . . (3.15)
X +1

Since X has to be positive for the logarithm to be deﬁned, the argument for the series is
always less than or equal to 1, and therefore the sequence is monotonously decreasing. That
allows us to use the current order as a termination criterion for the evaluation of the series.
To estimate the remainder of the natural logarithm, we use the following expression for
64

the remainder bound to the ﬁrst n terms in the series given in 3.15:

|Rn+1 |

(X − 1)2
2X

X − 1 2n
X +1

(3.16)

For purely implementational purposes, we slightly rearrange this remainder term so that
we can use as many of our already computed quantities as possible:

|Rn+1 |

(X − 1)2
2|X|

X − 1 2n
1
=
·2
X +1
4|X|

X − 1 2n+1 2
(X − 1)
X +1

(3.17)

2n+1
At the end of the series evaluation, we already calculated 2 X−1
. All that remains
X+1
to do is to divide by 4 and X (which is positive) and multiply the result by (X 2 − 1).

3.4.6.1

Argument reduction

Obviously, this series converges the faster the closer X is to 1. In particular, it converges
very slowly for X close to 0, or large values of X. We hence reduce the argument to be close
to 1 making use of the identity:

ln(x) = m ln(2) + ln(2−m x)

(3.18)

Choosing m to be the nearest integer to the 2 based logarithm, i.e.

m = ln(x)/ ln(2) + 0.5 ,

the new argument satisﬁes the inequality |2−m x − 1|

1 . For those arguments, the Taylor
3

expansion converges relatively quickly. Note that even if m is not chosen exactly as speciﬁed
65

above, all operations remain veriﬁed, they are just not optimally fast any more. Therefore
it is suﬃcient to compute m using rather crude ﬂoating point approximations without any
veriﬁcation.

For this, it is necessary to once compute an accurate enclosure of the value of ln(2). This
is done by simply evaluating the series given above while skipping the argument reduction
step. This value is then stored internally for later use so it does not need to be recomputed
every time the logarithm is called.

3.4.6.2

Argument Splitting

As with the exponential function before, we also split wide arguments to the logarithm
in order to suppress the eﬀects of the interval dependency problem in the evaluation of the
Taylor expansion. We split wide arguments into two separate High Precision numbers Xu and
Xl for the upper and lower bound respectively. Then the logarithm is evaluated as described
above separately for both Xu and Xl , yielding Yu = log(Xu ) as well as Yl = log(Xl ).

Again, due to the monotonic nature of the logarithm, the ﬁnal interval is then computed
by merging Yu and Yl into one interval covering the range [Yl , Yu ].
66

3.4.7

Sine and Cosine

We use the well known expansions of the sine and cosine functions[12]:
∞
sin X =
n=0
∞
cos X =

(−1)n
X3 X5
X 2n+1 = X −
+
− ...
(2n + 1)!
3!
5!

(3.19)

(−1)n 2n
X2 X4
X =1−
+
− ...
(2n)!
2!
4!

(3.20)

n=0
X n+1
Rn =
(n + 1)!

In the remainder term we used that

(3.21)

d n+1 sin(x)
dx

1 as well as

d n+1 cos(x)
dx

1.

Computationally, it is advantageous to compute both the sine and the cosine simultaneously. The overhead needed is rather small, and in the implementation of the Taylor
Model sine and cosine, it is necessary to compute both values of the same argument anyway.
Another advantage is that the argument reduction, which is rather complicated for these
functions, only has to be performed once.

3.4.7.1

Argument reduction

The argument to the sine and cosine is reduced in several steps. First, we exploit the
periodicity of the sine and cosine functions, sin(x) = sin(x + 2π) and cos(x) = cos(x + 2π).
In a ﬁrst step, this allows us to reduce the argument into the interval [−π, π] using the
equation
k=

x
1
+
2π 2

This reduction is performed using a ﬂoating point approximation x of the value of the
argument X as well as a ﬂoating point approximation of π to estimate the number k of integer
67

multiples of 2π to subtract from the argument. While this approach does not guarantee that
the argument rigorously is reduced to the interval [−π, π], the result is still veriﬁed since a
wrong estimate here will only increase the computation time, but not aﬀect the correctness
of the result.
The value of 2π itself, which is subtracted from the argument, is of course computed as a
rigorous enclosure. This computation only happens once using the equality 2π = 8·arctan(1).
Similarly, we precompute values for π and π/2. All of these values are stored after their ﬁrst
use for later use.
In the second argument reduction step, we make use of the symmetry that sin(x) =
− sin(x + π) and cos(x) = − cos(x + π). If after the ﬁrst reduction step the result is negative,
i.e. in the interval [−π, 0], a rigorous enclosure of π is added to the argument to bring it
into the interval [0, π]. If this step is performed, the ﬁnal results are negated once they are
computed.
The third step reduces the argument to the interval [0, π/2] by using the identities
sin(x) = sin(π − x) and cos(x) = − cos(π − x) respectively. If the argument is larger
than π/2, i.e. in the interval [π/2, π], it is subtracted from π, and the sign of the cosine
result is ﬂipped.
In the very last step, the reduced argument is divided by 3 and result is called t. Then
the actual Taylor expansions of the sine and cosine are evaluated. To undo the last division
by 3, the following identities are used:

sin(3t) = 3 sin(t) − 4 sin3 (t)
cos(3t) = 4 cos3 (t) − 3 cos(t)
68

(3.22)

Note that using similar equations for higher multiples it would be possible to reduce the
argument even further. However, these equations become increasingly complicated and the
additional operations introduce further computational errors. In our experience, using the
triple angle formula given above provides good accuracy and reasonably fast evaluations.
To obtain the sharpest possible results for the cosine, we replace the multiple angle
equation given above by a slightly modiﬁed equation. The problem here is the same as with
the squaring operations in the exponential function (see Section 3.4.5.1): While the sine of
t is a number relatively close to 0, the cosine of t is a number of the form 1 + ε, where ε is a
small number compared to 1. Computing the cube of such a number increases the relative
error of the result signiﬁcantly.
Instead of computing the expression cos(t), we proceed as before with the exponential
function and compute y = cos(t) − 1 instead by simply omitting the ﬁrst term of the Taylor
expansion. We then rewrite equation 3.22 in terms of y:

cos(3t) = 4 (y + 1)3 − 3(y + 1) = 4y 3 + 12y 2 + 9y + 1.

Evaluating this expression instead, the relative error of the result does not increase signiﬁcantly.

3.4.7.2

Argument splitting

Proper argument splitting for the sine and cosine is more complicated than for other intrinsic
routines. The periodicity of these functions requires more complicated tests when assembling
the ﬁnal interval from the results of the left and right boundary.
First oﬀ, a plausibility test is performed to test if the width of the interval is greater
69

than a ﬂoating point approximation of π. If that is the case, the interval [−1, 1] is returned
immediately without further computation. Since [−1, 1] is always a rigorous enclosure of the
sine or cosine, it does not matter that the comparison is done in non-veriﬁed ﬂoating point
arithmetic.
If the argument passes this ﬁrst test, the sine and cosine splitting for wide arguments
begins with computing two separate High Precision numbers Xu and Xl for the upper and
lower bound respectively. These can be easily computed by rigorously adding and subtracting
the remainder bound from the High Precision number, resulting in rigorous enclosures of Xu
and Xl .
Then the sine and cosine are evaluated as described above separately for both Xu and
Xl , yielding Yu = sin(Xu ) and Zu = cos(Xu ) as well as Yl = sin(Xl ) and Zl = cos(Xl ).
The sine function attains its maximum at π + k · 2π and its minimum at 3π + k · 2π
2
2
where in both cases k ∈ Z. Inbetween these values, the sine function is monotonic, since
the derivative is non-zero there. To determine the ﬁnal result from the results at the two
endpoints Yu and Yl it is therefore necessary to test if the argument contains a value of the
form π + k · 2π (or 3π + k · 2π). If so, the upper bound Yu (or lower bound Yl ) need to be
2
2
replaced by the value 1 (or −1). Then the resulting High Precision number can be computed
by merging Yu and Yl into one interval covering the range [Yl , Yu ].
Similarly for the cosine the maximum is attained at k · 2π, while the minimum is attained
at π + k · 2π. By the same method as for the sine applied to these values and the bounds
Zu and Zl , the ﬁnal result of the cosine is readily computed.
It remains to actually implement the test wether a High Precision number contains a
value of one of the above forms. We will describe here how to perform the test for the
70

maxima of the sine function at π + k · 2π. The test for the other three cases is absolutely
2
analogous. First, the un-split argument X is linearly transformed as follows

J=

X
1
−
2π 4

(3.23)

where a rigorous High Precision enclosure of 2π is used in the denominator. By inserting
the expression for the location of the maxima into equation 3.23, it is clear that whenever
J contains an integer, then X contains a location of a maximum of the sine function. Note
that the choice of the value − 1 is of course not unique, any value of the form − 1 + k for
4
4
k ∈ Z will work.
Next, a ﬂoating point approximation of J is computed, and truncated to the nearest
integer, the representation of which is precise in double precision ﬂoating point arithmetic.
If non-zero, this value is then subtracted from J. Clearly, if J contains an integer, so does
J − k for any k ∈ Z. If the approximation of J was suﬃciently accurate, this step reduces
J to lie in the interval − 1 , 1 , because in the beginning we already eliminated intervals
2 2
wider than 2π. Note that even if the approximation was not accurate, the result is still
mathematically rigorous, but J may lie outside this interval.
In the last step, J is bounded using double precision intervals. The resulting upper and
lower bounds are then tested for inclusion of zero using simple ﬂoating point comparisons.
In case the approximation in the previous step was not accurate, we also test for inclusion
of any other integer. If any of these tests are true, then J may contain an integer. If they
are all false, J certainly does not contain an integer.
As described in Section 3.5.2, the test for zero is very accurate despite the use of double
precision intervals in the last step. This is the reason why we perform the previous reduction
71

of J to the interval − 1 , 1 .
2 2
The remaining maxima and minima tests are performed analogously, adjusting the value
− 1 in equation 3.23 accordingly to e.g. 1 for the minima of the sine function, 0 for the
4
4
maxima of the cosine function and − 1 for the minima of the cosine function. Since all these
2
X
tests require the computation of 2π , this value is only computed once and then reused in
each test.

3.4.8

Tangent

The Taylor expansion of the tangent is not as trivial as that of the sine and cosine. Furthermore, the argument checking and splitting required for the periodic trigonometric functions
is the most complicated of our available intrinsic functions. Thus instead of reproducing the
eﬀort and implementation for the tangent Taylor expansion, we rely on the deﬁnition of the
tangent:
tan(X) =

sin(X)
cos(X)

(3.24)

This requires one evaluation of each the sine and the cosine, but the performance impact
is greatly reduced by the combined evaluation of the sine and the cosine as described above.
Through this, all argument reductions are only performed once, by those routines. Similarly,
the argument splitting is also left to the sine/cosine routine.
There is no argument splitting needed for the ﬁnal division, as the division is fully functional for large arguments as well.
72

3.4.9

Arc Sine and Arc Cosine

Both the arcsine and arccosine functions rely on the arctangent, as given by the identity[12]:

X

arcsin X = 2 arctan
1+

arccos X =

π
− arcsin X
2

1 − X2

(3.25)

In the computation of the argument of the arctangent, neither argument splitting nor
argument reduction is necessary, as all involved operations, the square root, division, addition
and multiplication, are fully functional on their own without argument splitting.
Furthermore, no testing of the argument to lie within the domain D = [−1, 1] of the
arcsine or arccosine is necessary. This test is automatically performed by the square root
routine when computing the value of

1 − X 2 in the argument for the arctangent. If this

square root produces an arithmetic error, the argument is outside of the allowed domain,
and an error is returned immediately without further computations.
Note that for the arccosine, a pre-calculated, rigorous interval enclosure is used for the
value of π. This constant is computed once using the identity π = 4 · arctan(1) and then
stored for future use. All veriﬁcation and argument splitting and reduction is left to the
arctangent function.

3.4.10

Arc Tangent

For the arctangent we use the following expansion[12]:
∞

arctan X =

X 2n+1
1
1
(−1)n
= X − X3 + X5 + . . .
2n + 1
3
5
n=0
73

(3.26)

This series converges for |X|

1.

To obtain the error bound, it is unfortunately not very useful to use the Taylor remainder
d n arctan x shows that it grows quickly,
dx
1
so that the Lagrange remainder term only converges as n . This overestimates the error by
term. Careful examination of the expression sup

several orders of magnitude.
Instead, we use the fact that within the radius of convergence, this series has monotonously
decreasing elements and alternating signs. Thus the Leibniz criterion (Section 3.1.4) applies,
and an estimate of the remainder of the partial series after n terms is given by:

Rn =

3.4.10.1

X 2n+3
2n + 3

(3.27)

Argument Reduction

To reduce the argument to a range where the above expansion is valid, we make use of the
following identity[12]:

X

arctan(X) = 2 arctan

1 + X2

1+

(3.28)

Applying this equation once, reduces all arguments from (−∞, +∞) to (−1, 1). This is
clear since we always have that
X
1+

1 + X2

<1

and
lim
x→±∞ 1 +

X
1 + X2

= ±1.

Convergence of the series on (−1, 1) is still rather slow, thus applying the same reduction
74

repeatedly n times ensures that the result lies in a range that converges quickly. After
evaluating the series for an argument that has been reduced n times, only a multiplication
by 2n is necessary. This is, even in ﬂoating pint arithmetic, an exact operation and can be
easily done.
It is, however, not desirable to perform too many reduction steps, as each reduction causes
overestimation and thus increases the remainder bound of the result. We found that n = 4
yields good results for the convergence speed, while still maintaining a sharp error bound.
After the second application, the argument lies within

−

1 , 1
√
√
1+ 2 1+ 2

≈ (−0.41, 0.41),

the third application takes it to approximately (−0.2, 0.2) and the forth application reduces
the argument to the ﬁnal range of about (−0.1, 0.1) .
To undo the argument reduction after the evaluation of the Taylor expansion, a simple
multiplication of the result by 16 is required.

3.4.10.2

Argument Splitting

Wide arguments to the arctangent are split in order to suppress the eﬀects of the interval
dependency problem in the evaluation of the Taylor expansion. We split wide arguments into
two separate High Precision numbers Xu and Xl for the upper and lower bound respectively.
Then the arctangent is evaluated as described above separately for both Xu and Xl , yielding
Yu = arctan(Xu ) as well as Yl = arctan(Xl ).
Due to the monotonic nature of the arctangent, the ﬁnal interval is then computed by
merging Yu and Yl into one interval covering the range [Yl , Yu ].
75

3.4.11

Hyperbolic Sine and Cosine

Transforming the well known form of the hyperbolic sine and cosine into an equal, yet
computationally more favorable form, we obtain the relations
exp(X) − exp(−X)
exp(X)2 − 1
=
2
2 exp(X)
exp(X) + exp(−X)
exp(X)2 + 1
cosh X =
=
2
2 exp(X)
sinh X =

(3.29)

These equation can be evaluated in the veriﬁed High Precision arithmetic to directly
obtain the result. As can be seen, we completely rely on the expression of the hyperbolic
functions in terms of the real exponential function[12]. The overhead due to this is minimal,
since each hyperbolic function only requires one evaluation of the exponential function.
On the other hand, we automatically gain all veriﬁcation, argument checking, argument
reduction and argument splitting that is performed by the exponential routine. The computation of the ﬁnal result of the hyperbolic functions does not require argument splitting,
as it only involves operations that perform equally well for wide as for narrow arguments
(addition, multiplication, division).

3.4.12

Hyperbolic Tangent

The hyperbolic tangent is slightly more involved than the other two hyperbolic functions.
This is because of its asymptotic behavior. First, we again express the hyperbolic tangent
by an exponential function:

tanh X =

exp(X) − exp(−X)
exp(X)2 − 1
=
exp(X) + exp(−X)
exp(X)2 + 1
76

(3.30)

Note that there is no beneﬁt in rewriting exp(X)2 as exp(2X), as the argument reduction
in the exponential function would just remove this power of two again. Instead, it is faster
to square the result directly.
As in the preceding cases, this equation can be evaluated directly in high precision interval
arithmetic. However, such a naive evaluation produces rather large remainder bounds for
positive arguments. This is because for even moderately large values of X, the expression
exp(X)2 becomes quite large quickly. Thus, even if the relative error remains optimally
sharp, in absolute value the error of the two terms exp(X)2 − 1 and exp(X)2 + 1 is rather
large. The result of their division, of course, is almost 1, but due to the large absolute value
of the error bounds, the resulting error bound will be large as well.

3.4.12.1

Argument Reduction

To prevent the eﬀect described above from happening, we introduce a very simple argument
reduction based on the following symmetry of the hyperbolic tangent:

tanh(−x) = − tanh(x)

Whenever a positive argument is evaluated, we replace it with its negative and after
the evaluation negate the result. Especially for large positive arguments, this reduces the
remainder bound of the result drastically.
We also test the incoming argument to lie within the somewhat arbitrarily chosen interval
[−325, 325] (half of the interval chosen for similar purposes in the exponential function in
Section 3.4.5). For values outside of this interval, we do not compute a result, but instead
just return the value ±1 with a suﬃciently small error bound (e.g. 10−50 ). This prevents
77

calling the exponential function with an argument that is too small to be evaluated.

3.4.12.2

Argument Splitting

While not strictly necessary, we also perform the usual argument splitting for monotonic
functions. Wide arguments are split in order to suppress the eﬀects of the interval dependency
problem in the evaluation of the Taylor expansion. We split wide arguments into two separate
High Precision numbers Xu and Xl for the upper and lower bound respectively. Then
the logarithm is evaluated as described above separately for both Xu and Xl , yielding
Yu = tanh(Xu ) as well as Yl = tanh(Xl ).
Since the hyperbolic tangent is monotonic, the ﬁnal interval is then computed by merging
Yu and Yl into one interval covering the range [Yl , Yu ].

3.5

Comparisons and Bounding

The following section describes algorithms for bounding and comparisons of high precision
intervals. For the bounding operations we obtain an upper and lower bound as a double
precision number. The comparison operations allow rigorous comparisons between two high
precision intervals, or a high precision interval and a double precision number. Due to the
interval nature of the high precision intervals, the meaning of the comparison operation also
has to be speciﬁed.

3.5.1

Approximation

In our implementation of intrinsic functions, it is often required to compute a ﬂoating point
approximation of the value of the center point a High Precision number. There are of
78

course various ways to obtain such an approximation, we will brieﬂy discuss three diﬀerent
approaches here.

1. The fastest method to estimate the value of a High Precision number is to only consider
the value of the ﬁrst limb. For well conditioned High Precision numbers, this will be
a good approximation of the actual value, as all higher limbs are considered to be
corrections to the preceding ones. However, for ill conditioned numbers this method
can produce completely wrong results.

2. The second method involves the simple summation of all limbs in double precision
arithmetic. For well conditioned numbers, this summation will not change the result
signiﬁcantly from the preceding method, but in case of ill conditioned numbers this
method provides at least some protection against completely misleading approximations. For added accuracy, the summation should be performed starting with the last,
and usually smallest, limb, working its way up to the ﬁrst, and largest, limb.

3. The third method is similar to the second, but instead of a naive summation of double
precision ﬂoating point numbers, a Kahan summation (see Section 3.1.5) is performed.
This provides additional safeguards against ill conditioned High Precision numbers
being approximated poorly. As in method number two, the summation should be
performed backwards, starting with the last limb of the number.

In our implementation, we decided to use the Kahan summation algorithm to compute
approximations of High Precision numbers. Its accuracy is far better than the other two
approaches, especially for not well-conditioned numbers. Such numbers regularly occur when
there is heavy cancellation during a computation, e.g. when representing an identity in a
79

very complicated way such as sin2 (x) + cos2 (x) or log(exp(x)). The additional cost of
such a summation is negligible compared to the cost of performing the other high precision
operations.

3.5.2

Bounding

The bounding operation essentially converts a high precision interval into a double precision
interval rigorously containing the high precision interval. The upper and lower bound of the
result are, of course, upper and lower bounds for the high precision number.
To calculate such an interval enclosure the unevaluated sum A =

i ai ± aerr is eval-

uated in double precision interval arithmetic. The initial interval is just the error bound
[−aerr , aerr ]. Then each limb is added to this interval. To avoid excessive and unnecessary
overestimation, the summation is carried out backwards, beginning with the last, and usually
smallest, limb.
This method of bounding is particularly accurate for High Precision numbers close to
zero. The reason for this is that in the ﬂoating point model, zero is an “accumulation point”
of sorts. The density of ﬂoating point numbers around zero is much higher than around any
non-zero number. This fact allows sharp bounds for values very close to zero without the
overestimation typically experienced near non-zero values.
Consider for example a bound for the High Precision number A = 1 − 10−20 + 10−40 .
This is a number very close to, but not exactly 1. In double precision arithmetic, even the
most accurate double precision interval enclosure of A would be given by an interval of the
form [1 − 10−16 , 1], because there are no double precision ﬂoating point numbers closer to
the real value of A. Consider, on the other hand, a similar number near 0 instead of near 1,
80

e.g. B = 0 − 10−20 + 10−40 . Then the double precision interval enclosure would be much
better and correctly reﬂect the fact that B is strictly negative and bounded away from zero.
This is simply because there are plenty ﬂoating point numbers between 10−20 and 0, despite
their small absolute distance from each other.

3.5.2.1

Absolute Value

The interval nature of high precision intervals also requires a new deﬁnition of the absolute
value. For this operation, we follow the currently existing implementation of double precision
intervals in COSY INFINITY. This implementation returns a single double precision number
that is guaranteed to be an upper bound on the absolute values of every element in the
interval.
Deﬁnition 3.7 (Absolute Value). The absolute value of a High Precision number A returns
one double precision number v such that

v

|x| ∀ x ∈ A

Note that there is no requirement for v to be in any way optimal. Any v that satisﬁes
the condition in Deﬁnition 3.7 is acceptable. To make this intrinsic useful, however, it is
clearly in the interest of the implementation to avoid large overestimation.
In our implementation of High Precision numbers, the upper bound v is calculated by
ﬁrst obtaining a double precision interval enclosure of the high precision number A (see
3.5.2). Then v is set to be the larger of the absolute values of the upper and lower bounds
of the double precision interval. Since the absolute value is an exact operation in ﬂoating
point arithmetic, this result satisﬁes the above deﬁnition.
81

For well conditioned High Precision numbers, or in cases where the sharpness of the
absolute value is not absolutely essential, a much faster method to compute a suitable value
for v can be employed. Since certainly

ai ± aerr <
i

ai + aerr ,
i

we can just add up the absolute values of each limb and the error term and round the result
outward in double precision arithmetic. If the input is well conditioned, the resulting bound
will still be quite sharp, since all the smaller corrections just cause round-oﬀ in the last
few bits of the result. In case of ill conditioned input, however, this method can produce
signiﬁcant overestimation.

3.5.3

Splitting and Merging

In various places throughout the intrinsics it will become necessary to split High Precision
numbers into a sharp upper and lower bound, each again given as a High Precision number.
Conversely, an operation merging two High Precision numbers into one which covers the
entire interval of values between them is needed. These operations are of particular interest
for the argument splitting which is performed for wide interval argument.
In the following, we will describe two simple and fast algorithms to perform these operations.

3.5.3.1
Let A =

Splitting
ai ± aerr be a given High Precision number with limbs ai and error term aerr .

In order to compute an upper and lower bound of A, the ﬂoating point error term aerr is
82

either added or subtracted from the center point A =

ai , respectively. This operation is

performed in the usual veriﬁed High Precision arithmetic, so the result is rigorous.
Furthermore, the error bound on the resulting number is on the order of the computation
precision. This is because neither A nor aerr have an error term of their own, so any
round-oﬀ errors that appear are only due to the limited precision.
This means that, obviously, for error bounds smaller than the relative computation precision of A this operation has no eﬀect at all and returns the same result for both upper and
lower bound.

3.5.3.2

Merging

The merging operation, or mathematically speaking the interval union, of two intervals A
and B can be performed as follows.
First, note that the width W of the resulting interval is mathematically given by

W =

|A − B|
,
2

where the absolute value of the interval is meant to be the largest absolute value of all values
in the interval A − B. Furthermore, the new center point C can be computed as

C=

A+B
2

In High Precision number arithmetic, these operations can be performed rigorously, and
W can be bounded from above by a ﬂoating point number w using any of our bounding
schemes. Then the resulting interval C = A ∪ B is simply given by adding w to the error
83

term of C with outward rounding.
This method is very fast, and works well for merging sharp intervals, i.e. A, B with
small remainder errors. Another big advantage of this method is that it does not require
any previous knowledge about A and B. In particular, it is not necessary to compare both
values to determine which one is the larger and which is the smaller.
The drawback of this method is that the over-estimation of the result is quite large if the
intervals to be merged themselves have large error bounds. While the expression for W still
provides the correct width of the resulting interval, the expression for C will have an error
bound roughly the size of the average of the individual error bounds of A and B. As this is
then added to W , the resulting width is not optimal any more if the error of C was large.
This problem is due to the fact that the optimal center point of the resulting interval
depends on the widths of A and B. Ideally, the center point would be computed up to the
computation precision. However, such a computation is much more involved, requiring the
splitting of both A and B into upper and lower bounds and then merging the maximum and
minimum of each using e.g. the above algorithm.
Fortunately, in our use of the intrinsic operations, we usually merge intervals with very
sharp remainder bounds. For these cases the algorithm described here is perfectly suﬃcient
to produce very good results.

3.5.4

Sign of a Floating Point Series

In Section 3.5.5, comparisons between ﬂoating point numbers and High Precision numbers
will be reduced to the determination of the sign of a series of ﬂoating point numbers. Here
we will start by presenting various methods to determine the sign of a ﬁnite series of ﬂoating
84

point numbers.
In the following, let the series be given by n ﬂoating point numbers ai . The goal then is
to rigorously determine sign of the sum

3.5.4.1

n a.
i=1 i

Double Precision Sign Determination

An easy and fast way to determine the sign of such a sum is to use rigorous double precision
interval arithmetic. By summing up the ai as described in Section 3.5.2, the resulting double
precision interval can easily be tested for its sign.
This is particularly eﬀective if the ai are well conditioned in the sense of a High Precision
number. In that case, performing the additions starting with the smallest number ai and
continuing with the next larger one yields a very sharp interval enclosure of the value which
has a width of about 10−16 times the correct value of the sum. Thus, as long as the precise
sum is non-zero (and not below the underﬂow threshold) this interval method will yield the
correct sign.
This method is not suitable to test for exactly zero, since due to the unavoidable interval
overestimation the resulting interval will always have a non-zero width and thus contain
positive and negative values.
Similarly, this method fails if the ai are not well conditioned. If there is cancellation in
the double precision interval computation, the typical eﬀects of interval arithmetic cause the
resulting interval to be overly wide, possibly preventing the determination of the sign.
Consider for example the case whena1 = −1, a2 = 1, and a3 = 10−30 . Clearly, then
the result of the sum is 10−30 and hence the sign is positive. However, in double precision
interval arithmetic a3 + a2 will yield an interval of width about 10−16 around 1 since a3
is too below the ﬂoating point precision for the value 1. Consequently, after adding a1
85

the resulting interval indeed is an interval around 0 containing both positive and negative
numbers.
If, however, either a1 or a2 in the above series are dropped, the series is well conditioned
and the double precision interval computed will easily and safely allow determination of the
sign.

3.5.4.2

High Precision Sign Determination

To work around the problems of the double precision interval method of determining the
sign, the following algorithm was developed. It is based on a simple bounding argument,
and allows to determine the sign of a series of ﬂoating point numbers very precisely, only
using ﬂoating point arithmetic. Without loss of generality, let |a1 | and |a2 | be the largest
and second largest absolute values of the n numbers ai .
Then we have
n

n
ai = a1 +

n
ai

a1 + |

n
ai |

|ai |

i=1

i=2

i=2

n

n

(3.31)

a1 − (n − 1) · |a2 |.

(3.32)

i=2

n

a1 + (n − 1) · |a2 |,

|ai |

a1 +

n

and

ai = a1 +
i=1

ai
i=2

a1 − |

ai |

a1 −

i=2

i=2

Let n∗ be an integer power of two that is greater than or equal to n − 1. Then the
inequalities obviously still hold, if n − 1 is replaced by n∗ . Furthermore, all operations
needed to calculate n∗ · |a2 | are exact in ﬂoating point arithmetic. Combined with the fact
that comparisons of ﬂoating point numbers are always exact, we can therefore conclude that
86

the sign of the whole sum will be the same as the sign of a1 if

|a1 | > n∗ · |a2 |

(3.33)

If, however, a1 and a2 are too close together, we can perform a Dekker addition of
the two ﬂoating point numbers. That is guaranteed to result in two diﬀerent ﬂoating point
numbers, which rigorously replace a1 and a2 in the above sums without changing their value.
Also they fulﬁll |a1 | > 1/ε · |a2 | and, if n∗ in the above equation is smaller than 1/ε, also
|a1 | > n∗ · |a2 |. Repeating the process with this new series of numbers will provide diﬀerent
values for a1 and a2 , and thus may allow to determine the sign.
Unfortunately, it cannot be guaranteed that this algorithm terminates. It is possible
that for very speciﬁc input, the largest number either becomes smaller than the underﬂow
threshold of the underlying ﬂoating point arithmetic, or that it becomes inﬁnite in the
ﬂoating point sense. In such cases, no deﬁnitive answer can be given, but in practice such
cases do not appear and are thus largely irrelevant.
There are a few other special cases to be considered. In the trivial case of n = 1 then
the sign of the sum is clear. If the largest value turns out to be a1 = 0 then the sign is also
clear, since then the whole sum has to be equal to 0.
As mentioned before, over- and underﬂow of the ﬂoating point numbers need to be
considered separately. This can be done by e.g. testing the size of the number with the
largest absolute value, and returning an undetermined result if this number is too large or
too small to perform further operations. In those cases, no statement can be made about
the sign of the sum.
87

3.5.5

Comparisons

Although high precision intervals are intervals, it is possible to deﬁne certain useful comparison operators to act on them. When comparing two real numbers, there are exactly three
choices: one of them is either larger, smaller, or equal to the other. However, due to the
interval nature of high precision intervals, these concepts have to be redeﬁned.
Note that in the following deﬁnitions, one of the intervals can be assumed to have zero
width, yielding the condition for comparisons to real numbers.

Deﬁnition 3.8. We say an interval A is larger than an interval B if the same is true for
each pair of elements in the intervals:

A > B ⇔ x > y | ∀x ∈ A ∀y ∈ B

Deﬁnition 3.9. Similarly, we say an interval A is smaller than an interval B if the same is
true for each pair of elements in the intervals:

A < B ⇔ x < y | ∀x ∈ A ∀y ∈ B

Equality, and subsequently inequality, is somewhat harder to deﬁne. Of course, one could
use the mathematical deﬁnition of equality of two intervals

A=B⇔A⊆B ∧ B⊆A

However, for practical purposes this is not useful. After lengthy calculations, the chances
that two intervals represented by a high precision intervals are exactly equal, are practically
88

zero. This is particularly true if a program intends to check for a certain condition, such as
A = 0. In practical situations, this is never be true since even if A had only one limb of
value 0, the error bound very likely would not be zero.
Instead, we use the following deﬁnition for equality.

Deﬁnition 3.10 (Equality). We say A is equal to B iﬀ there is a non-zero intersection
between the two intervals:
A=B ⇔A∩B =∅

Inequality, on the other hand, is deﬁned as the negation of equality.

Deﬁnition 3.11 (Inequality). We say two high precision intervals A and B are not equal
iﬀ their intersection is the empty set:

A=B ⇔A∩B =∅

This deﬁnition of equality has mainly practical value. In typical interval code, it is often
necessary to test if two intervals overlap, or if a certain value is included in an interval. With
the above deﬁnition of equality, these kinds of tests can be carried out easily and somewhat
elegantly in a program.
Note that, while being useful, these deﬁnitions do not have all mathematical properties
of the same relations on real numbers. Clearly, equality in the above sense is not transitive,
as is illustrated by the intervals [0, 2] = [1, 4] = [3, 5] but [0, 2] = [3, 5]. Furthermore, as
discussed in the previous section, there is always a chance that a comparison cannot be
decided. In that case, we a third value, “undecided”, is returned instead of “true” or “false”.
89

3.5.5.1

Comparing a High Precision Number to a Floating Point Number

To compare a High Precision number A to some ﬂoating point number b, we have to understand what this means in terms of the representation of A. Let X =

i xi +xerr be the up-

per bound of the interval represented by High Precision number X, and X =

i xi − xerr

the lower bound. Then we can write the diﬀerent conditions deﬁned above as:
• A > b:
A >b ⇒

ai − aerr > b ⇒

ai − aerr − b > 0

i

i

• A < b:
A <b ⇒

ai + aerr < b ⇒

ai + aerr − b < 0

i

i

• A = b:

A
⇒

ai + aerr

b ∧
b ∧

i
⇒

ai + aerr − b

A

b
ai − aerr

b

i
0 ∧

i

ai − aerr − b

0

i

There are two ways to implement these comparisons, as described in section 3.5.4. First, we
will consider the double precision interval test. It is fast, and in many cases where the sign
is rather clear it can provide a quick and deﬁnitive answer.
Unfortunately, evaluation of the sums in double precision intervals causes the tests to
be inaccurate for close cases, however. The accuracy for distinguishing the sign of a series
would only be about 10−16 , i.e. only if the result of the sum is larger in absolute value than
10−16 can a decision be made. This is due to the cancellation in the above series, which
90

causes the interval bounding method to become inaccurate, as described above.
Consider for example the test of A = 1 + 1020 versus the number 1. The sum to evaluate
would end up being 1 + 1020 − 1, causing the double precision intervals to include values
both positive and negative. Note that it also does not help to compute a double precision
enclosure of only A and then test its bounds against 1, since already here the result of the
addition will most likely be an interval including values on both sides of 1.

If the simple double precision interval arithmetic test does not succeed, there are several
options. One method is to perform a veriﬁed subtraction of the value b from A, and then
comparing the result to 0. The High Precision subtraction operation ideally removes the
cancellation happening in the above sums, yielding a well conditioned number that can easily
be compared to 0. However, in doing so one has to be very careful in the interpretation of the
result. Since the subtraction may result in overestimation, the inclusion of 0 in the diﬀerence
does not guarantee the inclusion of b in A.

While in general quite good, this method has some drawbacks. It very slightly increases
the error bound of the result, making it slightly more likely that a comparison fails. This
eﬀect is very small and in practice can be ignored. More importantly, the subtraction is not
always guaranteed to return a well conditioned number, so that in some cases it may still be
possible that the double precision interval bounding fails.

The high precision comparison described in Section 3.5.4.2 is essentially a boiled down
version of the full addition. Unlike full addition, this algorithm stops as soon as a rigorous
determination of the sign is possible. So in general, it is probably slightly faster and more
accurate in determining the sign.
91

3.5.5.2

Comparing two High Precision Intervals

Comparing two high precision intervals A and B is a little bit more involved. Nonetheless,
this case, too, can be reduced to the determination of the sign of a series of ﬂoating point
numbers. Unfortunately, here one cannot simply take the diﬀerence C = A − B and compare
that to 0, because the subtraction will only yield an interval enclosure of the result. The
result C, however, may be wider than it has to be due to overestimation. Thus it is possible
that, if two disjoint intervals A and B are very close to each other, the subtraction yields
in interval that includes 0. This leads to an incorrectly detected intersection of the two
intervals.
In terms of the representation of A and B, and using the same notation as before, the
comparison of the two high precision intervals can be written as:

• A > B:

A > B

⇒

ai − aerr >
i

bj + berr ⇒
j

ai − aerr −
i

bj − berr > 0
j

• A < B:

A < B

⇒

bj − berr ⇒

ai + aerr <
i

j

ai + aerr −
i

92

bj + berr < 0
j

• A = B:

A
⇒

j

ai + aerr −
i

A

bj − berr ∧

ai + aerr
i

⇒

∧

B

bj + berr

B
ai − aerr

i
0 ∧

j

bj + berr
j

ai − aerr −
i

bj − berr

0

j

As in the case for the comparison to a single double precision number, the bounding can
be performed either in double precision or using the high precision bounding algorithm.
However, in these cases the naive double precision interval bounding is usually quite useless
for accurate comparisons of nearby intervals due to overestimation.
Instead, one has to carefully subtract the right expressions in veriﬁed high precision
arithmetic, and bound the typically well conditioned result. Alternatively, the sign of the
sum can be determined using the high precision sign determination algorithm.

3.6

Input / Output

Rigorously printing ﬂoating point numbers, and reading input from a string is essential to
veriﬁed calculations. The functions provided by the underlying programing language are
usually insuﬃcient for this task. This is because the output will be converted to decimal
fractions, which leads to truncation and thus incorrect results. The input functions provided are implemented to read input into double precision numbers. However, they are
non-rigorous, and typically not even correct rounding is guaranteed.
Therefore, it is necessary to implement speciﬁc functions to perform these tasks. Although it might be possible to use some conversion functions provided by the system li93

braries, we will not rely on any of those. Instead, we implement all functions on our own.
This not only guarantees portability, it also allows to ensure that every step of the conversion
procedure is performed rigorously.
There are generally two types on input and output formats:
1. Human readable input/output. This format is used for human users to input values
into code or a program and to read output generated by the program. As such it is
usually in decimal notation. In most computer programming languages it is common
to write decimal fractions in what is known as scientiﬁc notation (see below).
2. Machine readable input/output. This format is not intended to be read or manipulated
by humans. It is stored in a format suitably close to the internal representation of
numbers (usually binary), and if implemented properly this input can be read into
program memory without loss of precision. Output in this format is typically generated
by a computer program, and serves the purpose to restore certain data structures from
persistent storage into the programs memory.

3.6.1

Rigorous Binary Output

When printing rigorous numbers, it is necessary to avoid conversion from the ﬂoating point
number system used in the computer representation of a number. We print double precision
numbers rigorously, by outputting their ﬂoating point representation exactly.
In order to extract the relevant parts of the bit pattern of a double precision number, we
rely on IEEE 754[2] conform storage. This means that the number is represented internally
as
s · m · 2e
94

(3.34)

where the numbers are integers of diﬀerent size: s is the sign (1 bit), m is the mantissa (52
bits2 ) and e is the exponent (11 bits). This results in a total of 64 bits.
To achieve rigorous output we extract all of these ﬁelds form the ﬂoating point number
a, and write it as
±ma bea
where ma is the mantissa and ea is the exponent. The character ”b” as a separator is
reminiscent of the ”e” in scientiﬁc notation common in most programming languages. Thus
the decimal number 4.25, for example, is output as 17b − 2 meaning 17 · 2−2 .
It is obvious that this representation is not always unique. Powers of two can be shifted
between the mantissa and the exponent somewhat freely. To make it both unique and easy
to read, we attempt to keep the absolute value of the exponent as small as possible. This
convention means that whenever possible, a number is output as an integer without the need
for multiplications by powers of 2.
When outputting a High Precision number, each limb is printed once in this rigorous
way, and once as a non-rigorous, decimal fraction approximation. That means the output is
both human and machine readable at the same time.

3.6.1.1

Implementation

To extract mantissa, exponent and sign from a DOUBLE PRECISION variable in FORTRAN,
we use the bit ﬁeld functions introduced into FORTRAN by DoD military standard 1753[1]
(also part of the Fortran 90 standard), and the concept of EQUIVALENCE, which allows different FORTRAN variables to share the same physical memory. We assume that the FOR2 Actually it is 53 bits because the ﬁrst bit is implicitly assumed to be one for nondenormalized numbers.
95

TRAN type DOUBLE PRECISION represents an IEEE 754 compliant double precision number
in memory. This is not guaranteed by the FORTRAN standard, but seems to be universally
true on all platforms and compilers we tested.
We declare a 64 bit integer variable using INTEGER*8 as data type3 . Then we declare
this variable to be equivalent to a local DOUBLE PRECISION variable. Assigning the value
in question to the local double precision variable, we can now operate on the bit pattern of
the double precision number. Using bit ﬁeld functions, we extract the relevant bits from the
double precision number.
According to IEEE 754[2], the bit layout of a double precision number is as follows. Bits
are numbered from 0 through 63, starting with the right most bit. This is in accordance
with the numbering scheme used by the FORTRAN bit ﬁeld functions.
The layout of the memory of a double precision number is given by the following table:

63

62

sign

...

52

51

...

40

...

exponent

30

...

20

...

10

...

0

mantissa

To store the mantissa, a 64 bit integer is required, since there are more than 32 bits. The
exponent and the sign are both stored in a regular FORTRAN integer (32 bits). To extract
the bits, the FORTRAN bit ﬁeld extensions provide the very useful function IBITS(m, i,
len). It extracts len bits from m starting at i, and returns a new integer that contains those
bits ﬂushed to the right, and the rest of it padded with zeros. This, of course, is exactly
what we need to extract the parts:
EXPONENT
3 This construct is not FORTRAN 77 standard, but it seems to be widely supported by
all current compilers.
96

= IBITS( TEMP, 52, 11 ) MANTISSA = IBITS( TEMP, 0, 52 ) SIGN = IBITS(
TEMP, 63, 1 )
After extracting this information from the double precision number, we have another
function interpret it and actually converting it into a character string. All strings returned
are 27 characters long and right justiﬁed.
According to the IEEE754 standard, there are several cases:
MANTISSA = EXPONENT = 0 The number has the value 0.0. IEEE allows zeros to be
signed, so the sign bit is still valid. The value of this number is ±0.0.
EXPONENT = 2047 signiﬁes special cases, depending on the value of the mantissa:

MANTISSA = 0 The number is inﬁnity. The sign bit is valid. This number is
converted into the string Inf inity.
MANTISSA = 0 The number is not valid, i.e. not a number. This is converted into
the string N aN (“not a number”).

EXPONENT = 0, MANTISSA = 0 signiﬁes a denormalized number, where the implicit
bit of the mantissa is not set. Instead, m is directly used as extracted from the bit
pattern. The exponent in this case is −1074, so that the value of this number is then
±m · 2−1074 .
Other values For all other combinations, the correct value must be calculated according to
these rules: The exponent is biased with a biasing constant of 1075 which is subtracted
from it, while in the mantissa the implicit leading bit, 52, is set to one. The value of
this number is then ±m · 2exponent .
97

The reason for the biasing constant being diﬀerent from the one given in [2] is that we
shift the decimal point of the mantissa to the end, so it becomes an integer instead of
a binary fraction. That way, we have to subtract an additional 52 from the exponent.

3.6.2

Rigorous Decimal Output

While the storage format as an unevaluated series of double precision numbers is a good
choice for computations and developing algorithms, it is not particularly intuitive for humans
to read the resulting High Precision numbers. For easily human readable output, it is
necessary to provide a mechanism to rigorously convert a series of ﬂoating point numbers
into a decimal fraction.
The format in which we want to print a number is the same as the “scientiﬁc notation” for decimal fractions described in Section 3.6.3. In the case of rigorous output, this
representation is followed by an additional rigorous error bound. For example:

1.284756354658596836355657869392735443638E − 001

1.284756354658596E − 001 ± 1.21E − 14,
where the notation in the second line is meant to be read as the interval [1.284756354658596·
10−1 − 1.21 · 10−14 , 1.284756354658596 · 10−1 + 1.21 · 10−14 ].
To that end, we implemented an very simple decimal arithmetic based on character
strings. While not eﬃcient in the least, this allows for a relatively straightforward implementation of the needed operations based on “pencil-and-paper” type computations.
More precisely, any number in this arithmetic is represented by a mantissa M , which
98

is represented as a decimal integer in a string of characters, along with a binary integer e
representing the decimal exponent associated with this number and a sign s. The value of
such a number is then given by s · M · 10e . On these numbers, three operations will be
implemented:
1. Addition
2. Multiplication by 2
3. Division by 2
These operations are suﬃcient to convert numbers from the binary fractions stored in a
double precision ﬂoating point number into a decimal fraction. Furthermore, all of these
operations result in ﬁnite decimal fractions, thus as long as the length of the mantissa
character string is chosen large enough these operations are absolutely exact.
Then, in order to compute the correct decimal representation of a series of double precision numbers S =

i=1...n ai , the following steps are performed:

1. The sign si , mantissa mi , and exponent ei of each number ai are extracted. The
mantissa is stored as an integer in a 16 digit character string, while the exponent and
sign are stored as binary integers. Then

si · mi · 2ei

S=
i=1...n

2. The smallest exponent emin is determined. Then

si · mi · 2ei −emin

S = 2emin ·
i=1...n
99

3. For each i = 1 . . . n the exact decimal expansion of mi · 2ei −emin is computed. This
is done by repeated multiplication by 2 of the initial mantissa mi . At this point no
division by 2 is required yet, since by choice of emin all exponents ei − emin are
positive. The resulting expansions are then summed into a ﬁnal number M . Then

S = 2emin · M

Note that all these operations so far are all purely integer operations, the decimal
exponent of M can be chosen to be zero at this point.
4. Finally, S is computed by either multiplying or dividing M repeatedly by 2 for emin
times.
5. Depending on the desired rounding mode the result may need to be rounded up or
down. This is performed by examining the ﬁrst non-printed digit in the expansion and
the sign of the result. If the result then needs to be rounded towards zero, it is just
truncated. If rounding away from zero is required, the last printed digit is “bumped
up” by adding a 1 in that place of the mantissa.
This results in an exact decimal representation of the value of S, with proper rounding if
desired, which then can be formatted properly for output.
In the case of rigorous output, some additional steps are required. After computing S
as described above, the following additional steps are performed to obtain an exact decimal
representation of the truncation error and the error term err of the High Precision number
being printed.
1. To compute the remainder error between the value for the center point that is printed
100

and the correct value, the actually printed digits after rounding are subtracted from
the exact decimal expansion S. This yields a new number E which contains an exact
decimal representation of the truncation error due to rounding and printing. The sign
is modiﬁed to always be positive, eﬀectively taking the absolute value of the truncation
error.
2. As described for each of the ai above, the double precision error term err of the
incoming High Precision interval is converted to a decimal representation and then
added to E. This now is an exact decimal expansion of the error term to be displayed.
3. Again as described above, the error term is rounded upward (towards positive inﬁnity)
at the last printed digit of the error term. This then results in a rigorously rounded
representation of the error term that can be formatted and printed on the screen.

3.6.2.1

Implementation

First, a safe length for the mantissa character string has to be chosen. We suggest a maximum
length of 1440 decimal digits. This is suﬃcient to add at least 100 completely arbitrary double
precision numbers.
First, note that the maximum number of decimal digits that can occur in the integer
resulting from the multiplication within the sum in step 2 is 632. This is because, after
subtracting the smallest possible exponent −1074, the absolute value of the largest possible
element under the summation sign is less than 253 · 22045 = 22098 < 4 · 10632 . Summing
100 of these values then certainly results in an integer less than 100 · 4 · 10632 < 4 · 10634 .
In the last step, this integer of at most 634 digits is then either divided or multiplied
|emin |

1074 times by 2. If the number divided is odd, each division by two creates one
101

more digit at the end of the decimal expansion. However, since 24 = 16 > 10, for every
4 divisions, at least one leading decimal digit disappears. Thus division by 2 adds at most
1074 − 1074/4 = 806 digits to the result. Multiplication by 2 adds at most one new digit
at the beginning of the decimal expansion if the ﬁrst digit is greater or equal to 5, but since
23 = 8 < 10 does so no more often than in every third multiplication. Thus multiplication
by 2 at most adds 1074/3 = 359 digits to the result.
Thus, the ﬁnal result S will have at most 806 + 634 = 1440 digits. This is only a rough
upper bound, but it will provide suﬃcient digits for all of these operations to be carried out
precisely.
In the following, a brief description of the implementation of each of the three required
operations is given. All algorithms are essentially implementations of the digit-by-digit
operations taught in middle school.
For ease of implementation, our implementation stores decimal expansions right-justiﬁed
within their respective string, with the right-most digit being the least signiﬁcant. The end
of the decimal number is signiﬁed by a blank character to the left of the most signiﬁcant
digit.
For convenience, we also store the sign of the number in the left-most character of the
string. This zeroth character does not count towards the character limit and is also not
counted in the following descriptions. If the sign character is ’X’, the corresponding number
is defective due to some error during the computation and should not be trusted. Otherwise
the sign is ’+’ or ’-’. For printing purposes the sign can also be ’˜’, resulting in a ± sign. The
last sign is not valid for arithmetic operations and causes an error if used in that context.
All of the following operations carry out checks at runtime to insure that no attempts are
102

made to write beyond the maximum length of a decimal expansion string. While the length
of such strings is chosen large enough, as described above, this additional test ensures that
in no case incorrect numbers are ever returned. Instead, an error is returned. These test
will not be mentioned any more explicitly in the following descriptions, to keep them easily
understandable.

3.6.2.1.1

Addition Addition surprisingly is the most complex of the operations. First,

the decimal exponents e1 and e2 of the two input numbers are made equal. This is done by
picking the smaller one, without loss of generality e1 , and adding e2 − e1 zeros on the right
hand side of the decimal expansion corresponding to e2 . Now the addition has been reduced
to an integer operation on the mantissae of the two numbers.
For two numbers with the same sign, addition is rather straightforward, and will be
described ﬁrst. The case of diﬀerent signs is then discussed afterwards.
Addition of numbers with the same sign is performed digit by digit, starting with the
right-most (least signiﬁcant) digit. The corresponding digits of both expansions are added
to form a number k, and the carry, if any is also added to k. If k is greater than or equal
to 10, it is reduced by 10, and carry is set to 1. Otherwise, carry is set to zero. Lastly, the
current digit in the result is set to the k. Then the process is repeated with the next digit
to the left. If one of the two decimal expansions reaches the end, it is treated as if it were
ﬁlled with 0 digits on the left.
Note that in this process the maximum value k can take is 9 + 9 + 1 = 19. Thus, after
the subtraction of 10, k is always a single digit.
If after reaching the last digit of both input numbers the carry is 1, it is added as an
extra digit in the beginning of the expansion. The decimal exponent of the result is the same
103

as e1 , the sign of the result is that of both input numbers.
In the case of diﬀering signs of the two input numbers, the algorithm is modiﬁed as
follows. Without loss of generality, let the ﬁrst number be positive, and the second one
negative.
Again, addition is performed digit by digit, starting with the right-most (least signiﬁcant)
digit. However, in this case the corresponding digit in the second number is subtracted from
the digit in the ﬁrst number to form k. Then the carry, if any, is added to k. If k is greater
than or equal to 10, it is reduced by 10 and carry is set to 1. If k is less than 0, 10 is added
to k and carry is set to −1. Otherwise, carry is set to zero. Lastly, the current digit in the
result is set to the k. Then the process is repeated with the next digit to the left. If one of
the two decimal expansions reaches the end, it is treated as if it were ﬁlled with 0 digits on
the left.
Note that in this process the maximum value k can take is 9 − 0 + 1 = 10, while the
minimum is 0 − 9 − 1 = −10. Thus, after the addition or subtraction of 10, k is always a
single, positive digit.
If after reaching the last digit of both input numbers the carry is 1, it is added as an
extra digit in the beginning of the expansion. The decimal exponent of the result is the same
as e1 , the sign of the result is positive. The same is true if the carry is 0. If the carry in the
end is −1, the 10-complement of the result has to be taken.
This is done by applying the following transformation to each digit in the result, starting
with the rightmost digit. To the digit the carry, if any is added forming k. If k is greater
than zero, it is subtracted from 10 and the carry is set to 1, otherwise the carry is set to
0. Then the current digit is replaced by the value in k. Since the maximum value for k is
104

10 − 1 = 9, it is always a single, positive digit. Then the process is repeated with the next
digit to the left.
After this process has reached the left-most digit of the result, the sign of the result is
then set to negative and its decimal exponent is e1 .

3.6.2.1.2

Multiplication by 2 Multiplication by 2 is the easiest of all operations. Start-

ing from the end of the string, each digit is multiplied by 2 and the carry over from the
previous digit, if any, is added. If the result is greater than or equal to 10, 10 is subtracted
from the result and the carry is set to 1. Then the process is repeated with the next digit
to the left.
Note that the maximum carry over to the next digit that can occur is 1. This is because
the result of the multiplication of any digit by two is at most 2 · 9 = 18. Thus the ﬁrst carry
that can occur is at most 1, but with a carry of 1 the maximum value of the next digit after
adding the carry is 19, which still causes a carry of only 1.
If after reaching the last digit of the input number the carry is 1, it is added as an extra
digit in the beginning of the expansion.
Multiplication by 2 obviously does not aﬀect the decimal exponent or the sign of the
number.

3.6.2.1.3

Division by 2 Division is the only operation that “grows” the decimal expan-

sion to the right instead of the left. Thus temporarily the decimal expansion is shifted so
that it begins on the left of the character string with the most signiﬁcant digit (i.e. it is left
justiﬁed within the string). The end of the expansion is denoted by a blank character after
the right-most (least signiﬁcant) digit.
105

From there on, division follows a similar pattern as multiplication, except that operations
start on the left. Each digit is divided by 2 and the carry is added to form k. The current
digit is then replaced by the integer part of k. If k has a fractional part, the carry is set to
5, otherwise it is set to 0. Then the process is repeated with the next digit to the right.
Note that in each step k is always less than 10, since the carry is at most 5 and the result
of the division is at most 9/2 = 4.5.
If after reaching the last digit of the input number the carry is 5, it is added as an extra
digit at the end of the expansion.
Finally the entire decimal expansion is shifted back to the end of the string (i.e. right
justiﬁed within the string). If the left-most digit in the expansion is a 0, it is removed and
the decimal exponent is reduced by 1. Division by 2 obviously does not aﬀect the sign of the
number.

Normalization Before printing a number, it should be normalized. This process removes
excess zeros at the beginning and the end of the number, in the latter case adjusting the
decimal exponent accordingly. Implementation of this routine is straightforward.

3.6.3

Rigorous Decimal and Binary Input

In this section we will present a string conversion algorithm that will read decimal fractions
given in a special format called scientiﬁc notation, as well as the binary ﬂoating point notation
introduced above. This conversion algorithm has to perform several tasks:
1. Parse the input string, and make sure it conforms to the speciﬁed format
2. Verify that the number is within the limitations of the appropriate type
106

3. Rigorously convert the number into the representation used internally by the computer
system
First, it is necessary to deﬁne the input format. Our input routines accept a format that is
based on the scientiﬁc notation. More speciﬁcally, numbers consist of several ﬁelds, some of
which are optional. They are separated by speciﬁc character constants.

[+−]? [0123456789]∗ [., ]? ([0123456789]∗ )[eEhHbB]? ([+−]? [0123456789]∗ )

(3.35)

In this representation, which is loosely based on POSIX regular expressions, characters in
square brackets represent the allowed character set for the respective ﬁeld. Any character
listed in the square brackets is an allowed character. The superscript behind the square
brackets designates the number of characters from the preceding set that are allowed in
the ﬁeld. ? means exactly none or one character appear. ∗ means zero or more characters
appear. Expressions in parentheses are only allowed to appear, if the ﬁeld directly before is
not empty.
Expression 3.35 dictates that each number has an optional sign of either + or −. A
missing sign is the same as a positive sign. This is then followed by zero or more digits
representing the integer part of the number represented in base 10. This then is followed
by an optional decimal point, which, to accommodate users from Europe as well as from
the U.S., can be either a decimal dot ”.” or a comma ”, ”. Only if this decimal separator is
present, it is followed by an optional fractional part in base 10. The mantissa is then followed
by an optional exponent separator which can be one of e, E, h or H. If this separator is
present, it can be followed by another optional sign and a base 10 integer of zero or more
107

digits.
The exponent for decimal expansions is limited in its range to lie within the interval
[−300, 300]. If the exponent is outside of this range, an error is returned.
The exponent separator b or B is a special case, indicating a binary fraction in what we
call the B-Format. If the input string is in this format, there are some additional restrictions
on the input. B-Format numbers are not allowed to have any fractional mantissa part, i.e.
the mantissa must be an integer. Furthermore, the value of the mantissa must lie in the
interval [−253 + 1, 253 − 1], i.e. the absolute value of the mantissa must be less than 253 ,
and the exponent must be lie in the interval [−1074, 971].
These values are derived directly from the deﬁnition of double precision ﬂoating point
numbers in [2], chosen such that any number that passes these tests represents a valid
double precision number of that precise value. The result may possibly be a denormalized
underﬂow, but even in this case the number accurately represents the correct value. The
smallest (denormalized) value representable in double precision is 2−1022−52 = 2−1074 ,
while the largest number representable is 253 − 1 · 21023−52 = 253 − 1 · 2971 .
It is easily veriﬁed that this convention guarantees that any number output in the format
described in Section 3.6.1 is also a valid number according to these criteria and can be read
back exactly.

3.6.3.1

Examples

To illustrate allowed formats, here are some examples of valid and invalid numbers:
• 0505, 1983E24 is a valid number
• .666 is a valid number
108

• 1, 0000000000000000000 is a valid number
• 17 is a valid number
• 1b4 is a valid number
• −0.000911000E is a valid number (the missing exponent will be interpreted as 0)
• −1E − 99 is a valid number
• .E17 is a valid number (it will be read as zero since the empty mantissa will be interpreted as 0.0)
• −. is a valid number (it will again be read as zero)4
• 5; 1983E17 is an invalid number (; is not a decimal delimiter)
• . 8888 is an invalid number (illegal spaces after the decimal delimiter)
• 1, 000.0 is an invalid number (two decimal delimiters)
•

17 is an invalid number (illegal spaces in front)

• 1.5b − 3 is an invalid number (no fractional part allowed for B-Format)
• 12345678901234567890b1 is an invalid number (mantissa too large for B-Format)
• 1 ∗ 10 ∗ ∗22 is an invalid number (illegal character ∗)
• 1A7EF F is an invalid number (illegal characters A and F , all input has to be base 10)
4 Although this and the preceding examples are valid, it is not recommended to use this
representation since it is confusing. Use 0 instead.
109

In addition to this format, the conversion functions also accept certain special strings, which
are handled as completely separate cases in the beginning of the conversion process. Currently, those character constants are (case-insensitive):

1. “INFINITY”, “INF”, “INFTY” - Positive inﬁnity double precision constant,

2. “-INFINITY”, “-INF”, “-INFTY” - Negative inﬁnity double precision constant,

3. “NAN”, “–NAN–” - Not a number double precision constant,

4. ”PI” - The value of the pre-calculated High Precision number representing π in the
current precision,

5. ”2PI”, ”2*PI” - The value of the pre-calculated High Precision number representing
2π in the current precision,

6. ”PI/2” - The value of the pre-calculated High Precision number representing π/2 in
the current precision,

7. “LN2” - The value of the pre-computed High Precision number representing log(2) in
the current precision.

This allows for certain constants to be obtained easily, without requiring the user to calculate
them in their code.

3.6.3.2

Implementation

Another important issue is the accuracy of the resulting numbers in the case of decimal
fractions. Some numbers that have a ﬁnite fractional representation in decimal notation, do
110

not have such a representation in binary. Numbers as simple as 0.1 cannot be represented by
a ﬁnite binary fraction. By design this is not an issue for binary fractions in the B-Format.

The following conversion algorithm attempts to encode at least the current number of
digits exactly, possibly more. In any case, it always attaches an error bound to result, so
that the number represented by the string is guaranteed to be contained in the resulting
high precision interval.

Parsing of the string is done by stepping through the input character by character, and
checking if it matches the allowed character set of the current ﬁeld. If it does, as many
following characters are consumed as possible until a non-matching character is hit. At
that point, the current ﬁeld must have ended, and the parsing continues with the next ﬁeld
description. Whenever a new ﬁeld is entered, the beginning index is stored for later use.
Similarly, the ending index is stored whenever a ﬁeld has ended. After completing all ﬁelds,
we thus have beginning and end indices for all ﬁelds. If there are characters left at the end of
the string, this means the input did not match our format. In that case an error is returned.

In the following, we will make use of the fact that certain integer operations, if performed
in double precision, are exact. The mantissa of a regular double precision number has 53
bits. Thus, it can store numbers between −253 ≈ −9 · 1015 and 253 ≈ 9 · 1015 exactly. Since
our ﬂoating point model is round to nearest, if the result of an operation is representable as a
ﬂoating point number, the operation is exact. Therefore, integer addition and multiplication
performed on double precision numbers are exact, as long as the integers lie between −9·1015
and 9·1015 . Consequently, any 15 digit integer in decimal representation can be read exactly
into a ﬂoating point number.
111

3.6.3.2.1

B-Format For the B-Format, the actual conversion is rather simple. Due to

the bound checks performed previously during parsing, the value of the mantissa is guaranteed to ﬁt into a double precision number precisely. Thus it can just be assembled digit by
digit in ﬂoating point arithmetic without any roundoﬀ errors, since all integers less than 253
can be represented precisely in double precision arithmetic.
The last remaining step is to multiply the result by 2|exp| or 1 |exp| respectively. Here
2
it is important to note that one cannot just compute 2|exp| and multiply or divide by this
value. This is because for the minimum allowed exponent, −1074, the value 21074 is ﬂoating
1074
point inﬁnity, while 1
is still a valid (denormalized) value. Apart from this caveat,
2
the computation of 2|exp| , as well as the subsequent multiplication of the mantissa, can be
carried out exactly in double precision ﬂoating point arithmetic.
In general, to evaluate B |exp| eﬃciently, we have devised the following computation
scheme. It requires very few operations, and thus is rather fast. While not relevant in the
case of B = 2, for other values of B represented as High Precision numbers this also means
less overestimation of the result.
Expressing the exponent in binary form, exp = b0 b1 b2 b3 . . . with bi ∈ {0, 1}, we obtain:
i
B bi ·2

0
1
2
3
B exp = B (b0 ·2 +b1 ·2 +b2 ·2 +b3 ·2 +... ) =
i

In the product, only the i with bi = 1 contribute to the result since for bi = 0 we have
i
B bi ·2 = B 0 = 1. Calculating those remaining factors is easy, considering that:

B

2i

=

B

2i−1

2


= B

2i−2

112

2 2
 = ··· =

2 2
2 2...
... B
.

Let E be a High Precision number or double precision number containing 1 initially.
20
Starting with a High Precision number or double precision number A containing B
=
B 1 = B, A is squared successively. If the corresponding bit is set in the exponent, then
E is multiplied by A. Since the exponent is limited to be smaller than 512 (2048 in the
B-Format), there are at most 9 (11) bits to check.

Rigorous Decimal Format To convert a string in decimal format into a veriﬁed interval
enclosure N , ﬁrst the value of the exponent exp including its sign is extracted from the string
and converted into an integer in a straightforward integer conversion.
Next, this exponent is adjusted for use in the following steps. Let d be the number of
digits before the decimal point in the input string, or 0 if there are none. Then we set

e = exp + d − 1 − 15

and set E to be a rigorous High Precision number enclosure of 10e . For an eﬃcient way to
compute 10|e| , see the previous paragraph. In this case, it is suﬃcient to compute 10|e| , and
then compute its multiplicative inverse afterwards if the sign of e is negative.
This adjustment ensures that each digit in the mantissa is assigned its correct value. The
ﬁrst digit in the mantissa represents a decimal power given by its position in the mantissa
string, d, and the decimal exponent exp as 10exp+d−1 . The additional correction to the
exponent of subtracting 15 stems from reading 15 digits of the mantissa at once.
The last step is actually reading the mantissa from the string. To convert the mantissa
digits, the mantissa is stepped through from left to right, and the digits are assembled
into a DOUBLE PRECISION number m. The decimal point, if present, is simply skipped over
113

completely. Every 15 digits, m is multiplied by the High Precision number E and the result
is added to the ﬁnal number N . Both of these operations are of course performed in rigorous
High Precision arithmetic. Afterwards, m is set to zero and the exponent E is multiplied
by a precomputed, rigorous High Precision enclosure of 10−15 in preparation for the next
round. Reading now continues with the next 15 digits, until there are no more digits to be
read.
If the number of digits in the mantissa is not a multiple of 15, the last read value of
m is padded with zeros at the end by multiplying it by the appropriate power of 10 before
multiplying it by E and adding to the ﬁnal result N , to make up for the “missing” digits.
After this process, the sign of the variable N is adjusted according to the input string
to complete the conversion into a rigorous enclosure of the decimal expansion given by the
input string.
The method described here works for all decimal expansions representing valid numbers,
regardless of their length. Digits of the mantissa which fall below the current precision are
read correctly and are added to the error bound of the result.

3.7

Future Work

The algorithms presented in this chapter, combined with the implementation of addition and
multiplication of High Precision Taylor Models, form the basis of a new, uniﬁed datatype
in the COSY INFINITY programming environment. The work of implementing all required
helper functions and related operations for this datatype is still ongoing. Once ﬁnished, this
will result in a new major version of COSY INFINITY version 10.
114

3.7.1

Automated Plausibility Testing

To automatically check that the results of these algorithms are indeed plausible (a test
for correctness is unfortunately not possible), Ravi Jagasia and I are working on a special
version of COSY INFINITY that after each intrinsic operation will call the external high
precision interval library MPFI[40]. The argument to the intrinsic will be converted to an
MPFI number, evaluated using the MPFI provided intrinsic function, and then compared to
the result computed by the algorithms as described above. If the resulting intervals do not
overlap, this is an indicator that clearly one of them is wrong. If this happens an error and
debugging information is printed and the program is aborted.
This method is very slow and ineﬃcient. It is not intended for everyday use, but as
a development tool to test the plausibility of the results of our intrinsic operations while
performing real world computations.

115

Chapter 4
Fixed Points
In this chapter, various algorithms for ﬁnding and verifying the existence of ﬁxed and periodic
points in discrete maps are described. In the rest of this chapter, we will use the terms
periodic point and ﬁxed point somewhat interchangeably, considering how a periodic point
of period m of a given map M is a ﬁxed point of the map M m .
More precisely, given a continuously diﬀerentiable map

M : Rn → Rn ,

we want to identify veriﬁed enclosures of periodic points x of period m of that map, i.e. a
point x ∈ Rn such that
M m (x) = x.

The direct analytical proof, of course, would be to compute the exact periodic point
x. In exact arithmetic, applying the map M m times then yields the initial argument x.
Unfortunately, this direct method is only possible for simple maps where the periodic point
116

can be easily calculated in closed form. As soon as this becomes impossible, that way of
proving the existence of a periodic point cannot be employed anymore. Furthermore, if performed on a computer, round-oﬀ errors in ﬂoating point numbers, as well as overestimation
in any veriﬁed numerical method will cause the result M (x) to always diﬀer from the initial
argument x.

Non-veriﬁed ﬂoating point arithmetic in particular is subject to misleading results. Consider, for example, the following map

M (x) = x + 1 + 10−20 − 1.

This map clearly simpliﬁes to the expression M (x) = x + 10−20 , which does not have any
ﬁxed point. However, when evaluated in ﬂoating point arithmetic, every x will appear to be a
ﬁxed point. This is because in ﬂoating point arithmetic either already the addition of 10−20
is below the round-oﬀ threshold for x + 1, or if it is not then the subsequent subtraction of
1 will put it there.

In the following, several methods will be introduced to rigorously prove the existence of
a periodic point of a given map M using rigorous numerical methods. In particular, we are
interested in methods that do not require exact knowledge of the periodic point, but rather
allow us to prove that a certain subset K ∈ Rn contains a periodic point. That way, if K is
small enough, one can obtain very good approximations to the position of a periodic point.

Furthermore, the uniqueness of the periodic point within K is subject of our study.
117

4.1

General Concepts

We begin by introducing some general mathematical theorems that prove the existence of
ﬁxed points under various conditions.

4.1.1

Brouwer’s Fixed Point Theorem

To only prove the existence of one or more ﬁxed points in certain sets K, one can make use
of the following version of the well known Brouwer ﬁxed point theorem[13] (sometimes also
called the Schauder ﬁxed point theorem):
Theorem 4.1 (Brouwer Fixed Point Theorem). Let a continuous map M : Rn → Rn and
a convex compact set K ⊂ Rn be given. If M (K) ⊂ K then K contains at least one ﬁxed
point x of M .
Note that this statement of the theorem uses the well known fact that any convex compact
set in Rn is homeomorphic to a closed unit ball of some dimension k

n[20].

The conditions of Brouwer’s theorem are easily veriﬁed by rigorous computations, which
makes this theorem very useful for our purposes. All that is required are certain bounding
routines, which are readily available in rigorous computations.

4.1.2

Banach’s Fixed Point Theorem

Probably the best known ﬁxed point theorem in mathematics is the Banach ﬁxed point
theorem[3].
Theorem 4.2 (Banach Fixed Point Theorem). Let a K be a compact metric space and
M : K → K a continuous map of the space into itself. If M is a contraction, then K
118

contains exactly one unique, attractive ﬁxed point x of M .

Unlike the Brouwer ﬁxed point theorem, this theorem can be applied to prove the existence and uniqueness of a ﬁxed point. Note, however, that the Banach Fixed Point Theorem
assumes a map of some compact metric space into itself. Taking this space to be K ⊂ Rn
used in Brouwer’s Fixed Point Theorem, the Banach Fixed Point Theorem can be viewed as
an extension of Brouwer’s Fixed Point Theorem.
For a mere proof of existence, it is numerically more favorable to use Brouwer’s theorem for several reasons. The Banach ﬁxed point theorem requires knowledge of a Lipschitz
constant k for the map. Often this is obtained through the derivatives of the map. Unfortunately, it is not always convenient to obtain an analytic expression of the derivative of a
map. But even if the derivative is known, evaluating the derivative of a high period iteration
of a map by the chain rule is computationally expensive.

4.1.3

Contraction Maps

Deﬁnition 4.3 (Contraction Map). Recall that a map M is said to be a contraction on K
if ∃k, 0

k < 1 such that

|M (x) − M (y)| < k |x − y| ∀x, y ∈ K.

(4.1)

The numerical veriﬁcation of the contraction property 4.1, unfortunately, is not quite as
straight forward as the veriﬁcation of the inclusion mapping property in Theorem 4.1. The
following provides an overview of various ways to assert that a map M satisﬁes equation 4.1,
and discuss their ﬁtness for use in veriﬁed numerical computations.
119

4.1.3.1

Operator Norm

We begin with the following deﬁnition of the operator norm of a matrix:
Deﬁnition 4.4 (Matrix Operator Norm). Let A be a real valued n × n matrix. The matrix
operator norm of A is given by

|A · x|
||A|| = sup |A · x| = sup
x=0 |x|
|x|=1
where the vector norm | · | is the Euclidean norm on Rn (or Cn ). It is easy to see that due
to the linearity of A both expressions are equivalent.
Lemma 4.5. Let C ⊂ Rn be a non-empty, convex, and compact set and f by a continuously
diﬀerentiable function on C such that ||Df (x)|| < k < 1 ∀x ∈ C. Then f is a contraction
on C.
Proof. Given any two points x, y ∈ C, consider the function G : [0, 1] → R given by the dot
product
G(t) = (f (y) − f (x)) · f ((1 − t)x + ty)
Applying the Mean Value Theorem to this function, one ﬁnds that for some ξ ∈ (0, 1)

(f (y) − f (x)) · (f (y) − f (x)) = G(1) − G(0)
= G (ξ) = (f (y) − f (x)) · (Df ((1 − ξ)x + ξy) · (y − x))

Applying the Cauchy-Schwarz inequality to the expression on the right hand side one thus
obtains
|f (y) − f (x)|2

|f (y) − f (x)| ||Df ((1 − ξ)x + ξy)|| |y − x|
120

and hence
|f (y) − f (x)| < k |y − x|

Corollary 4.6. Let K ⊂ Rn be non-empty, compact, convex and M : K → K continuously
diﬀerentiable with ||DM (x)|| < 1 ∀x ∈ K. Then

∃!x ∈ K such that M (x) = x.

Proof. Apply Lemma 4.5 and the Banach ﬁxed point theorem.

The problem with Corollary 4.6 is to determine a rigorous upper bound for the operator
norm of the derivative. Deﬁnition 4.4 deﬁnes what the value is, but provides no direct way of
computing it for a given matrix. To that end, recall the following standard result of operator
theory[15]:
||A|| =

λmax (A∗ A)

(4.2)

where A ∈ Cn×n , and A∗ denotes the adjoint operator, i.e. the transpose conjugate of A,
and λmax denotes the largest eigenvalue of a matrix.
Unfortunately, while A∗ A can be computed easily, determining the largest eigenvalue of
a matrix rigorously is non-trivial in general. This is particularly true in higher dimensions.
For the special case of dimension 2, however, the eigenvalues can be computed by an analytic
expression. So in this special case this approach is rather straight forward and provides an
analytic way to compute the matrix norm and verify the contraction property.
121

4.1.3.2

Inﬁnity Norm

Since direct calculation of of the operator norm of a matrix is not straight forward, another
option is to use a more readily computed quantity to provide an upper bound of ||A|| for
A ∈ Cn×n . Consider, for example, the following upper bound:
√

||A||

n||A||∞

(4.3)

where ||A||∞ indicates the maximum absolute sum of the rows of A, i.e.
n
||A||∞ =

max
i=1,...,n

|ai,j |

(4.4)

j=1

Proof. For any x ∈ Rn we have

|Ax|2 =

n



aij xj 


i=1

2

n
j=1


n

2

max 
|aij ||x|
i=1,...,n
j=1

2
n
= n|x|2  max
|aij |
i=1,...,n
j=1
n

Clearly, ||A||∞ is easily computed in a veriﬁed setting for any matrix A. Unfortunately,
in general this inequality is a rather crude upper bound of the operator norm ||A||.
122

4.1.3.3

Coordinate Transformation

Using inequality 4.3, it is possible to obtain a much more accurate condition to test for
the contraction property of M under certain circumstances. Since we are only interested in
showing the uniqueness of a ﬁxed point, it is irrelevant in what coordinate system the map
is a contraction. The following approach to compute the operator norm ||A|| does so by
changing into a more suitable coordinate system.
Let T : C → Rn be an invertible, smooth coordinate transformation. Then the following
are equivalent:
• f has a unique ﬁxed point in C
• F = T ◦ M ◦ T −1 has a unique ﬁxed point in T (C).
This is clear since for x ∈ C such that M (x) = x one ﬁnds that T (x) satisﬁes F ◦T (x) = T (x)
and vice versa for y ∈ T (C) such that F (y) = y, T −1 (y) satisﬁes M ◦ T −1 (y) = T −1 (y).
Let the derivative of F be decomposed into a purely diagonal part L and a purely oﬀdiagonal part N such that
DF = L + N

(4.5)

Then applying the triangle inequality yields

||DF ||

||L|| + ||N ||.

The operator norm ||L|| is simply the largest absolute value of the diagonal entries of L,
while ||N || can be estimated by inequality 4.3.
Let H denote the convex hull of T (C). By the assumptions of Corollary 4.6, Theorem
123

4.1 applies and yields at least one ﬁxed point of M in C and hence a ﬁxed point of F in
T (C) ⊂ H. If ||DF || < 1 on H, Lemma 4.5 applied to the function F on H establishes the
uniqueness of that ﬁxed point and thus the uniqueness of the corresponding ﬁxed point of f
in C ⊂ T −1 (H).
Note that if T can be chosen such that the derivative of F on T (C) is approximately
diagonalized, the entries in N are small compared to those in L, and the overestimation
introduced by the estimation of ||N || is negligible compared to ||L|| and the bound in ||DF ||
will be a rather tight one. Furthermore, if T (C) itself is already convex, there is no additional
overestimation introduced by evaluating the derivative over the convex hull of T (C).
Such a coordinate transformation T can be found relatively easily if DM (x) is diagonalizable at some point x ∈ C. Then T can be chosen to be an approximation of the matrix
of eigenvectors. In that case, also T (C) will be convex if C was convex, as T is a linear
transformation.
To obtain a sharp bound, the derivative is only required to be approximately diagonal.
It is thus not necessary to compute eigenvalues or eigenvectors rigorously. Only the matrix
inversion T −1 has to be performed rigorously.

4.2

Periodic Point in Near Standard H´non Map
e

We now apply the mathematical tools presented above to an example. In the following, we
consider one of the simplest maps exposing chaotic behavior, the H´non map[21], given by
e

H:



 2
R


→ R2



(x, y) → (1 + y − Ax2 , Bx).

124

In the original map studied by H´non himself, and many others after him, the parameters
e
were chosen to be A = 1.4 and B = 0.3. We consider a slightly diﬀerent value A = 1.422
while keeping B = 0.3.

In this map, the existence of an attracting period 15 point was suggested by non-veriﬁed
numerical experiments by S. Newhouse[39]. The coordinates of this suggested periodic point
are approximately

x ≈ −0.0869282203452939,
y ≈ 0.2391536750716747.

As this point seems to be attractive, simple iteration of the H´non map, starting with
e
those coordinates, or even just the origin, improves the approximate coordinates until the
change between iterations is of the size of ﬂoating point errors. In standard double precision
calculations this will yield about 16 valid decimal digits.

Non-veriﬁed ﬂoating point calculations help gaining some insight into the general behavior
of this periodic point. Simply iterating the coordinates given above by the map H 15 a few
times in double precision yields the following interesting results.
125

H 15 : x = −0.0869282203454442 y = 0.2391536750716964
H 30 : x = −0.0869282203452939 y = 0.2391536750716747
H 45 : x = −0.0869282203454442 y = 0.2391536750716964
H 60 : x = −0.0869282203452939 y = 0.2391536750716747
.
.
.

.
.
.

Note how the last few underlined digits alternate between two diﬀerent points. Looking
at this result naively, one may suspect that the periodic point is actually of period 30, instead
of period 15. It will turn out, however, that these oscillations are caused by ﬂoating point
round-oﬀ errors, and have no mathematical signiﬁcance. This just happens to be an orbit
for which this particular eﬀect appears.
The location of this orbit within the attractor of this H´non map is shown in ﬁgure 4.1.
e

4.2.1

Existence of a Periodic Point

Given an non-veriﬁed approximation of a suspected attractive ﬁxed point, the following
method provides a veriﬁed proof thereof[48].
Let M be a continuously diﬀerentiable map. Without loss of generality, we assume the
presumed periodic point of order m is near the origin. If that was not the case, we simply
consider the new map
Mnew (x) = M (x + z) − z,

(4.6)

where z is an approximation of the presumed periodic point. This new map is still continu126

Figure 4.1: The attractor of the H´non map with A = 1.422 and B = 0.3 (blue) with the
e
orbit of the suspected period 15 periodic point (green). The point used for veriﬁcation in
Section 4.2.2 is highlighted in red.

127

ously diﬀerentiable, and will have a presumed periodic point of the same order m near the
origin.
Furthermore, if Q is a regular n × n matrix, then instead of M m , we can consider the
conjugated map
M (x) = Q−1 ◦ M m ◦ Q(x).

(4.7)

By continuity, if M has a ﬁxed point x0 near the origin, then M has the periodic point
Q(x0 ) of order m which is also near the origin.
Let K be a n-dimensional interval vector K = ([− , ] , . . . , [− , ]), where > 0 is small.
Being a Cartesian product of closed intervals, K is certainly compact and convex for any
choice of .
We now want to show that M (K) ⊂ K. Once this has been proven, applying Brouwer’s
theorem 4.1 leads to the conclusion that there is a ﬁxed point of M in K and hence a periodic
point of period m of M in Q(K).
Note that, depending on the map M and the choice of , the proof may not yield a
conclusive result. If M (K)

K = ∅ but M (K) ⊂ K, then no statement can be made about

the existence or non-existence of a periodic point.

4.2.1.1

Preconditioning

Assume now the Jacobian J of M m at the origin is known approximately through some
non-veriﬁed, numerical computation, e.g. using non-veriﬁed diﬀerential algebra techniques.
Assume furthermore all eigenvalues of J have norm less than 1. There then is a linear
transformation Q, such that Q−1 JQ contracts each face of the box K towards the origin
for suﬃciently small .
128

This preconditioning by Q transforms, to ﬁrst order, the attractive region around the
origin to a rectangle, instead of a more general parallelogram. With this preconditioning,
the proof will succeed in more cases that would otherwise not yield a conclusive result.
Consider, for example, in dimension 2 the linear map Φα given by




−1 2
Φα (x) = α · 
·x
0 1
with 0 < α < 1. The eigenvalues are obviously ±α, and thus less than one in absolute value,
so the origin is an attractive ﬁxed point. Yet the point (−1, 1) is mapped to α · (3, 1), which
depending on the value of α can even grows in norm, and thus for such values of α Φα
is not a contraction. After a change of coordinates such that the coordinate axes become
eigenvectors, however, the map becomes a contraction.
For attractive ﬁxed points, the change into eigen-coordinates is suﬃcient to ensure contraction. For hyperbolic or repelling ﬁxed points, a more general preconditioning method is
described in Section 4.3.1.

4.2.1.2

Rigorous Coordinate Transformation

When performing the preconditioning described above, it is necessary to perform a translation and an aﬃne transformation. While this is easy to do mathematically, some care has
to be taken when this is done in rigorous computer arithmetic. The translation is simple, as
the calculation of the inverse is just the negation of the periodic point coordinates, which is
exact in ﬂoating point arithmetic.
The more complicated part is the linear transformation into eigen-coordinates. To obtain
129

this transformation, the approximate eigenvectors of the map at the periodic point have to
be determined ﬁrst. Here it is not necessary to be rigorous, as the proof above clearly is
correct with any regular linear transformation.
To obtain the expansion of the map to ﬁrst order, we use non-veriﬁed, automatic diﬀerentiation through the DA data type in COSY INFINITY, but any other numerical or analytical
method can be used. Once the matrix representation of the linear part of the map is obtained, any of the well known numerical methods to calculate eigenvalues and eigenvectors
can be applied. This allows the construction of the matrix transforming eigen-coordinates
into Cartesian coordinates.
To complete the coordinate transformation, this matrix has to be inverted. This inversion
has to be performed rigorously to ensure the resulting matrix really represents a rigorous
enclosure of the inverse. In this example, the H´non map is only two dimensional and
e
thus the inversion of the matrix is easily performed analytically. Instead of performing the
calculations in ﬂoating point arithmetic, either interval or Taylor Model arithmetic is used.
That way, one obtains a matrix of intervals or Taylor Models, which can be used to rigorously
transform from Cartesian coordinates into eigen-coordinates.
Once this is done, a rigorous representation of M can be constructed.

4.2.1.3

Proof using Intervals

Following the discussion above, using intervals in principle is a straightforward procedure.
First, a small interval box K is chosen. Then K is mapped through M and the result M (K)
is veriﬁed to lie within K. With intervals, this test can easily be performed. All that has to
be done is to compare the boundaries of the resulting intervals for each coordinate with the
corresponding coordinates boundaries of the initial interval box K.
130

However, due to the overestimation in interval arithmetic, the initial box as a whole will
not be mapped into itself right away. In particular, if the map M is of high order, there
will be signiﬁcant dependency introduced in the interval arithmetic, and the result of M (K)
will be several orders of magnitude larger than the smallest enclosure of the mathematically
correct result. Nonetheless, as long as there is some overlap between M (K) and K, there is
a chance that K could actually be mapped into itself.
To overcome this problem, it is necessary to split the starting box K into smaller boxes
and map each of these boxes separately. As the error in interval arithmetic decreases linearly
with the box size, we can expect that, after a suﬃcient number of splits, the small boxes will
eventually map into the original box K. Clearly, all of K is mapped into itself if and only if
every small sub-box is mapped entirely into K.
For the actual implementation, certainly one does not want to split the box into small
pieces of a ﬁxed, predeﬁned size. The best box size is a priori unknown, and, in fact, the
optimal size of the split boxes varies depending on the position within the starting box K.
Pieces close to the center can be larger than pieces along the boundary of K.
Instead of pre-splitting, an algorithm that automatically determines the optimal box size
for the region under consideration is preferable. To implement this eﬃciently, it is best to
use a stack based approach. Whenever a given box fails to map fully into K, it is split
and the process is continued with the newly created boxes until all boxes were successfully
mapped into K.
The complete algorithm for the veriﬁcation procedure with automatic box size control is
as follows:

1. Start with initial box K on the stack
131

2. Take the top box S oﬀ of the stack and evaluate M (S)

3. Compare M (S) and K

• If M (S) ⊂ K, discard S as it has been successfully mapped into K.
• Otherwise, if M (S)

K = ∅, split S along each coordinate axis once, and push

the resulting four pieces onto the stack.
• If neither of the above is true, we have M (S)

K = ∅. In this case, declare failure

and stop, as we just mapped a piece of K completely outside of K.

4. Continue with 2 until there are no more boxes on the stack.

While this looks trivial, there is a pitfall hidden in the implementation of this algorithm.
For the interval based algorithm to work, we require relatively small boxes. In all but the
most simple cases, the split boxes have to be so small, that double precision intervals are
not precise enough to carry out the operations anymore.
In 4.6 x + z is calculated, where z is the presumptive periodic point, and x is one of the
small boxes. In typical maps such as the H´non map, the coordinates of the periodic point
e
z will be of order one, whereas the box size can easily be of order 10−15 , depending on the
required accuracy. It is evident, that this addition cannot be carried out precisely in double
precision.
As a result, as the search space is split into “smaller” pieces, those pieces are then eﬀectively lost due to the limited precision in the coordinate transformation. That is the reason
why an interval package with high precision (e.g. the High Precision intervals introduced in
Chapter 3) is required to carry out these calculations.
132

4.2.1.4

Proof using Taylor Models

The proof using Taylor Models diﬀers signiﬁcantly from the interval version of the proof.
Taylor Models signiﬁcantly reduce the overestimation that caused the intervals to grow
rapidly. This is due to the drastic reduction of the dependency problem intrinsic to Taylor
Model arithmetic. Thus, instead of mapping a very small box by successively splitting it,
just one single box will be mapped. This is already suﬃcient to map the box into itself and
thus to conclude the proof.
To set up the calculation, ﬁrst the map M is generated with an inverse Taylor Model
matrix as described in Section 4.2.1.2. Next, two Taylor Models xi and yi with only a linear
coeﬃcient in the ﬁrst and second independent variable, respectively, are created. The image
of the domain of these two Taylor Models then represents an initial box centered around the
origin. Lastly, these Taylor Models are mapped by M in Taylor Model arithmetic, and the
image of the resulting Taylor Models xf and yf is veriﬁed to lie within that of xi and yi .
COSY INFINITY provides intrinsic functions to calculate an outer interval enclosure of
the range of a Taylor Model, so one can easily obtain intervals containing the images of xf
and yf . To verify that these two intervals actually lie within the range of the initial Taylor
Models is a little more complicated. To test for enclosure, it ﬁrst is necessary to obtain an
inner interval enclosure of the image of the initial Taylor Models xi and yi . In this case,
those enclosures can be obtained easily, since the initial Taylor Models are constructed to
have a simple structure.
The initial Taylor Models only consist of one linear coeﬃcient and a remainder bound.
To obtain an inner enclosure one can simply take the remainder bound and add, in interval
arithmetic, the linear coeﬃcient extracted from the Taylor Model. Thus one obtains an
133

enclosure of the upper bound of the image of the Taylor Model. Taking the lower bound of
that interval yields an upper bound of the inner enclosure. Applying the same technique,
but this time subtracting the linear coeﬃcient from the remainder bound, and taking the
upper bound of the resulting interval one obtains a lower bound for the inner enclosure (see
ﬁgure 4.2).
With the inner enclosure of the initial Taylor Models known, comparing those two inner
interval bounds of the image of xi and yi to the outer interval enclosure of the image of xf
and yf is trivial. If the outer range enclosures of the ﬁnal coordinates are fully contained
within the inner range enclosures of the initial coordinates, the initial box was mapped into
itself.

4.2.2

Implementation and Results

Using COSY INFINITY[32], we implemented the algorithms described above for the H´non
e
map using the traditional double precision Taylor Models, as well as High Precision Intervals and High Precision Taylor Models (see Chapter 3). This allows a direct side-by-side
comparison of the strengths and weaknesses of each of these veriﬁed numerical tools.
Implementation in each case is straightforward, since in two dimensions the calculation of
eigenvectors and eigenvalues, as well as the matrix inversion, can be performed analytically.
To ensure the matrix inversion is rigorous, it is carried out using either Taylor Models with
only a constant part, or intervals. The result is a matrix with Taylor Model or interval
entries, respectively.
For ease of notation, in the following results we will denote an interval of the form
[1.2345678, 1.2346789] by 1.2346789 to both conserve space and make the width of the in5678
134

terval more clearly visible.

4.2.2.1

Double Precision Taylor Models

Theorem 4.7 (Existence of Period 15 Point in H´non Map). Given the H´non map with
e
e
parameters A = 1.422 and B = 0.3, there exists at least one periodic point of period 15
within the interval

X = 1.195780721557596 ,
58008577504
Y

= 0.05052194963414509 ,
49328335698421

Proof. For the following calculations, Taylor Models of order 10 in two independent variables
were used.
By repeated iteration of the origin by the map H in double precision arithmetic, a suitable
candidate for the presumed periodic point is obtained. Mathematically there is no preference
for any particular point in the orbit of the periodic point candidate. Numerically, however,
it is favorable to choose a point that has a well conditioned eigenvector matrix. If that is
the case, the inversion can be carried out nicely, and will not produce much overestimation.
After a non-veriﬁed heuristic analysis of the eigenvectors at each point in the orbit, we
chose the following periodic point candidate:

x ≈ 1.195769365067588,
y ≈ 0.05050761649554453.

Starting with the origin, this approximation required about 400 iterations through H 15 , or
135

equivalently 6000 iterations through H.
Using automatic diﬀerentiation with the DA datatype of COSY INFINITY, the linearization of H 15 is computed, and the following approximation of the eigenvector matrix
is derived




 0.8852161763463854 0.2504328280425341 

,
−0.4651798804061557 0.9681339776284161
which is then rigorously inverted using Cramer’s rule in rigorous Taylor Model arithmetic
on Taylor Models with only constant parts.
The x and y coordinates, in eigen-coordinates, of the initial box are chosen to be Taylor
Models with only a linear coeﬃcient of 10−5 in the ﬁrst and second independent variable
respectively. This initial box is then transformed from eigen-coordinates into Cartesian
coordinates to yield the periodic point enclosure in Cartesian coordinates. This Taylor
Model enclosure then is bounded, yielding the interval representation given in the statement
of the theorem.
To conclude the proof, the initial box in Cartesian coordinates is mapped 15 times by
H, converted back to eigen-coordinates, and then tested for inclusion in the initial box as
described in Section 4.2.1.4.

Note that, while seemingly counter-intuitive, an initial box as big as 2 · 10−5 is required
to successfully carry out the proof in double precision Taylor Model arithmetic. In fact, due
to the speciﬁc structure of Taylor Models, the veriﬁcation process fails if the initial box is
chosen signiﬁcantly smaller, due to the order loss eﬀect (see Section 2.3.4). This problem
can be overcome by using High Precision Taylor Models as described in Section 4.2.2.3.
It is also worth noting that the computational time required for this proof was negligible,
136

taking only fractions of a second on a 2 GHz Core 2 Duo processor. The whole initial box
is mapped at once, requiring only 15 consecutive evaluations of the map H in Taylor Model
arithmetic and a few simple operations afterwards to verify the enclosure property.

4.2.2.2

High Precision Intervals

To obtain a more precise rigorous enclosure of the periodic point, we also implemented an
interval version of the proof. As described in Section 4.2.1.3, it is necessary to use a High
Precision interval package to successfully carry out this proof using intervals. We use the
High Precision Intervals introduced in Chapter three.
Theorem 4.8 (High Precision Enclosure of Period 15 Point in H´non Map). Given the
e
H´non map with parameters A = 1.422 and B = 0.3, there exists at least one periodic point
e
of period 15 within the interval

x = 1.1957693650675503360411009839655489
3523372355948068010530037073508396832853 ,
10139

y = 0.0505076164955646488882884801756161
016841426808283706281410555165782294397960 .
1531331

Before proving this theorem, an estimate of the computational complexity of the proof
can be made. Mapping a single interval box around the presumed periodic point from
eigen-coordinates into Cartesian coordinates, then through H 15 in interval arithmetic, and
137

back to eigen-coordinates, results a blow-up in the size of the box by a factor of about 1300
in x, and 1100 in y. This is not uncommon for interval arithmetic, and is to be expected
given the type of map and the high period of the periodic point.
Mathematically H 15 is a polynomial of order 215 = 32768. Of course the full expansion
of H 15 is never computed or used, instead H acts 15 times on the argument. While this
reduces the total number of operations carried out signiﬁcantly, in interval arithmetic this
still results in a signiﬁcant amount of overestimation because of the dependency problem.
These heuristics allow an estimate of the minimum number of boxes necessary to map
the complete initial box into itself. Since the splitting always happens in both x and y,
the maximal length the split boxes can have is about 1/1300 of the initial box size. Thus
13002 , or about 1.7 million, is a lower bound for the number of boxes required to successfully
complete the proof.

Proof. The calculations for this proof are carried out with intervals set to a precision of
about 75 signiﬁcant decimal digits. In the ﬁrst step, a high precision approximation of the
presumed periodic point is obtained by repeatedly iterating the origin through the map
H. By iterating H until the same point as in 4.7 is reached, the following periodic point
candidate is obtained:

x ≈ 1.195769365067550336041100983965548935
23372355948068010530037073508396821495,
y ≈ 0.0505076164955646488882884801756161016
841426808283706281410555165782292964645.
138

As before, using a double precision approximation of this point, automatic diﬀerentiation
is used to calculate the linearization of H 15 around this point. This process yields the
following matrix of approximate eigenvectors:




 0.8852161763463209 0.2504328280405625 

.
−0.4651798804062782 0.9681339776289262

While this step is performed in double precision since at the time this code was written
the DA datatype of COSY INFINITY did not support high precision yet, this does not pose
any problem. The transformation only requires approximate eigenvectors for the veriﬁcation
to be successful. This eigenvector matrix is then rigorously inverted in interval arithmetic
using Cramer’s rule on High Precision intervals.
The initial box in eigen-coordinates is chosen to be the interval [−10−70 , 10−70 ], both in
x and y. This initial box is then transformed from eigen-coordinates to Cartesian coordinates
to yield the actual periodic point enclosure given in the statement of the theorem.
To conclude the proof, the splitting-mapping-cycle described in Section 4.2.1.3 is initiated with the initial box on the stack and iterated until all boxes on the stack have been
successfully mapped into the initial box.

Note that, unlike the Taylor Model version of the proof, this process requires about 12
to 13 splits of the initial box in each direction in order to compensate the eﬀects of overestimation in the interval arithmetic. This is consistent with the earlier heuristic estimate, as
it results in a box size of about 1/212 ≈ 1/4000. The proof takes about 130 minutes on a 2
GHz Apple MacBook with 2 GB of RAM to complete. In that time, about 70 million boxes
are successfully mapped into the initial box.
139

4.2.2.3

High Precision Taylor Models

Theorem 4.9 (High Precision Taylor Model Enclosure of Period 15 Point in H´non Map).
e
Given the H´non map with parameters A = 1.422 and B = 0.3, there exists at least one
e
periodic point of period 15 within the interval

X = 1.19576936506755033604110098396
554893523372355948068010530037188 ,
6812
Y

= 0.0505076164955646488882884801
7561610168414268082837062814105797 ,
403

Proof. For the following calculations, High Precision Taylor Models of order 11 with a precision of about 75 in two independent variables were used.
By repeated iteration of the origin by the map H in High Precision Taylor Model arithmetic, a candidate for the same presumed periodic point as in 4.7 is computed:

x ≈ 1.19576936506755033604110098396
554893523372355948068010530037,
y ≈ 0.05050761649556464888828848017
56161016841426808283706281410555.

Starting with the origin, this approximation required about 2400 iterations through H 15 , or
equivalently 36000 iterations through H.
Using automatic diﬀerentiation with the DA datatype of COSY INFINITY, the lin140

earization of H 15 is computed, and the following approximation of the eigenvector matrix
is derived




 0.8852161763462412 0.2504328280382762 

.
−.4651798804064298 0.9681339776295176
As in the case of High Precision intervals, this computation is performed using only double
precision. This matrix is then rigorously inverted using Cramer’s rule in rigorous High
Precision Taylor Model arithmetic on High Precision Taylor Models with only constant
parts.
The x and y coordinates, in eigen-coordinates, of the initial box are chosen to be High
Precision Taylor Models with only a linear coeﬃcient of 10−60 in the ﬁrst and second independent variable respectively. This initial box is then transformed from eigen-coordinates
into Cartesian coordinates to yield the periodic point enclosure in Cartesian coordinates.
This High Precision Taylor Model enclosure then is bounded, yielding the interval representation given in the statement of the theorem.
To conclude the proof, the initial box in Cartesian coordinates is mapped 15 times by H,
converted back to eigen-coordinates, and then tested for inclusion in the initial box.

The computational time required for this proof of a High Precision enclosure still is
negligible, amounting to about one second on a 2 GHz Apple MacBook. As was the case
with the double precision Taylor Models, the whole initial box is mapped at once, requiring
only 15 consecutive evaluations of the map H in High Precision Taylor Model arithmetic
and a few simple operations afterwards to verify the enclosure property.
Almost all of this computation time is actually spent on the iteration of H to compute a
suitable initial ﬁxed point candidate. The actual veriﬁcation step is virtually instantaneous.
141

4.2.3

Uniqueness and Attractiveness

Now existence of at least one periodic point within the intervals given in the above theorems
is established. To establish uniqueness, Corollary 4.6 is applied to the H´non map.
e
To evaluate the Jacobian of its ﬁfteenth iterate, J(H 15 )(x, y), the chain rule is applied
repeatedly, yielding:

J(H 15 )(x, y) = J(H 14 )(H(x, y)) · JH(x, y)
= JH(H 14 (x, y)) · JH(H 13 (x, y)) · · · JH(H(x, y)) · JH(x, y).

To compute the necessary bound of the operator norm of the Jacobian matrix J(H 15 )
of the H´non map, equation 4.2 yields
e

J(H 15 ) =

λmax J(H 15 )T J(H 15 ) ,

where λmax represents the largest eigenvalue. The eigenvalues of the 2×2 matrix J(H 15 )T J(H 15 )
are computed using the analytic expression for the roots of the characteristic polynomial

λ2 − tr(J(H 15 )T J(H 15 ))λ + det(J(H 15 )T J(H 15 )) = 0

given by

λ1,2 =

tr(J(H 15 )T J(H 15 )) ±

tr(J(H 15 )T J(H 15 ))2 − 4 · det(J(H 15 )T J(H 15 ))
2

.

Theorem 4.10 (Uniqueness and Attractiveness of Periodic Point). The enclosures of the
142

period 15 point given in Theorems 4.7 and 4.8 contain exactly one periodic point of order
15. Furthermore, this periodic point is attractive within the given enclosures.
Proof. The Jacobian of the H´non map H is given explicitly by
e




 −2Ax 1 
JH(x, y) = 
.
B
0
The following computations are carried out in double precision Taylor Model arithmetic
with computation order 10. The initial Taylor Model is chosen such that it covers the entire
enclosure given in the statement of Theorem 4.7.
Computing JH 15 as described above, results in a matrix with Taylor Model entries for
JH 15 representing a rigorous enclosure of the correct Jacobian of H 15 valid over the entire
ﬁxed point enclosure. It is then trivial to compute the matrix product

A = J H 15

T

J H 15 .

The eigenvalues of the resulting two dimensional matrix A are then calculated directly
as Taylor Models and ﬁnally bounded by intervals. The resulting enclosures of the two
eigenvalues e1 and e2 over the box given in Theorem 4.7 are:

e1 = [−0.8210105861696513 · 10−9 , 0.8210017711776557 · 10−9 ]
e2 = [0.9002450764758135, 0.9809161063129307]

Since both are bounded in absolute value from above by 1, so is ||J(H 15 )||, and the map
indeed represents a contraction. Together with the previously proven self mapping property
143

from Theorem 4.7, this establishes all requirements for the Banach ﬁxed point theorem, thus
concluding the proof.

First, note that by proving uniqueness within the rather large ﬁxed point enclosure
given in the statement of Theorem 4.7, uniqueness in the sharper high precision enclosures
contained within it follows automatically.
Note further that for this computation double precision Taylor Models are the natural
choice. They allow for computations to be performed over large volumes without signiﬁcant
overestimation. While it could have been performed in high precision, there is no need for
High Precision Taylor Models in this part of the proof.

4.2.4

Conclusion

The results from this example, shown in Table 4.1, nicely demonstrate the power of the High
Precision Taylor Models introduced in Chapter 3. While the mathematics of the proof are
the same for both the Taylor Model based implementations as well as the interval based
implementation, the algorithms required to successfully complete the proof in the interval
case are signiﬁcantly more complicated due to the splitting of the initial box required.
Furthermore, the time diﬀerence is striking. Both High Precision Taylor Models as well
as regular double precision Taylor Models ﬁnish the proof in less than one second on an
ordinary household computer. The High Precision intervals, however, take over two hours
to complete the proof.
High Precision Taylor Models very eﬃciently counteract the eﬀect of order loss that
prevents the computation of sharp enclosures in double precision Taylor Model arithmetic.
However, High Precision Taylor Models still exhibit order loss, so that, at the same precision,
144

Double Precision

High Precision

High Precision

Taylor Models

Taylor Models

Intervals

Approx. Enclosure Width

10−5

10−60

10−69

Computation Time

< 1 sec.

1 sec.

130 min.

Number of Boxes

1

1

∼ 16.7 million

Table 4.1: Comparison of the performance of double precision Taylor Models, High Precision
Taylor Models and High Precision intervals in this example.
High Precision intervals provide a slightly sharper enclosure at a much higher cost. Increasing
the precision of the Taylor Model computation easily overcomes this problem.

4.3

Global Fixed Point Finder

In the previous example, a good approximation of the presumed ﬁxed point was known, and
only its veriﬁcation had to be performed rigorously. Simple numerical techniques, such as
Newton methods, have been around for a long time and are a well known tool for identifying
isolated, approximate ﬁxed points.
However, these techniques suﬀer from a major drawback. Given a starting point, they
converge to a single ﬁxed or periodic point approximation, but there is no information
obtained about possible other ﬁxed or periodic points in a given area. In more general
terms, one can ask for sharp, veriﬁed enclosures of all periodic points of given order p > 0
of a map M in some compact set. Based on the veriﬁcation tools developed in the previous
section, such an automated global ﬁxed point ﬁnder can be implemented[46].
As before, let K ∈ Rn be a compact set and M : K → Rn , n ∈ N, a continuously
diﬀerentiable map. The goal is to identify enclosures of all periodic points of given order
p > 0 of M in K. That is, we want to ﬁnd a collection X of sets Xi ⊂ K such that for each
145

such set Xi ∈ X there exists some x ∈ Xi such that

M p (x) = x,

(4.8)

and conversely, that for any x ∈ K satisfying equation 4.8 there is some i such that

x ∈ Xi .

(4.9)

It is then possible to further classify each Xi ∈ X by whether the ﬁxed point contained
in it is unique or not.
Unfortunately, as is typical for veriﬁed computation, there is also the possibility that
for some set Y ⊂ K neither of the above statements can be made. That is, there is an
additional collection Y of “undecided” sets Yi for which it is not known wether they contain
any periodic points or not.
Since the case p > 1 coincides with x being a ﬁxed point of the map M p , without loss of
generality only ﬁxed points need to be considered.

4.3.1

Preconditioning

Theorem 4.1 is well suited for proving the existence of a ﬁxed point through numerical
computations. However, without any preconditioning it only succeeds in very few cases.
Clearly, in the case of a hyperbolic, elliptic, or repelling ﬁxed point, Theorem 4.1 does not
apply directly at all, as there is generally no set C around the ﬁxed point that is directly
mapped into itself.
But even if applied directly to an attractive ﬁxed point, the self mapping property may
146

not succeed, as described in Section 4.2.1.1. In this case it is suﬃcient to switch to a
coordinate system aligning the coordinate axes to the eigenvectors of the derivative.
There is, however, a much more powerful method to transform the map into a form that
is applicable to Theorem 4.1 loosely based on the well known Newton Method (see Section
3.1.1) used to approximate roots of diﬀerentiable functions.
Consider the following operator acting on continuously diﬀerentiable functions f : K ⊂
Rn → Rn
CA : f → A · (f − id) + id,

(4.10)

where A is a regular n × n matrix and id represents the identity operator on Rn .
Since
f (x) − x = 0 ⇔ A (f (x) − x) = 0
for any regular n × n matrix A, we ﬁnd that any ﬁxed point of the map CA f is also a ﬁxed
point of the map f and vice versa. We say that f and CA f are equivalent with respect to
their ﬁxed points.
Assume x is a ﬁxed point of f and hence CA f . We want to choose the matrix A in such
a way, that x becomes a strongly attracting ﬁxed point of CA f . To that extend, consider
the derivative of CA f :
D(CA f ) = A · (Df − I) + I.
If Df (x) − I is a regular matrix, i.e. 1 is not a singular value of Df , then setting A =
−(Df (x) − I)−1 yields
D(CA f )(x) = −I + I = 0.
Hence with this choice of A, CA f has a strongly contracting ﬁxed point at x. Note that
147

this is true independently of the type of ﬁxed point the map f has at x. Furthermore, if
Df − I is not regular, it is possible to perturb it a little such that it becomes invertible. In
that case, the derivative of CA f will not vanish completely, but if the perturbation was is
small enough, CA f will still be a strong contraction.

Thus, by applying Theorem 4.1 on CA f and a non-empty, convex, compact set C ⊂ K
around x, it is possible to verify the required self mapping property much more easily than
by direct application to f .

The operator CA is also very advantageous in the proof of uniqueness of a ﬁxed point
in C using Theorem 4.2. Since A is chosen such that the derivative of CA f vanishes at the
ﬁxed point x, by continuity the ∞-norm of D(CA f ) is small on all of C, provided C itself is
suﬃciently small. It is then possible to apply Lemma 4.5 to easily show uniqueness of the
ﬁxed point of CA f , and thus of f , in C by bounding D(CA f ) over C and utilizing inequality
4.3.

The eﬀect of this preconditioning is shown in ﬁgure 4.3. The dashed box has width 10−4
in each direction in cartesian coordinates and is centered around one of the period 15 points
of the H´non map used in Section 4.2. The solid black object indicates the return of that
e
box after 15 iterations. Without preconditioning, the resulting object is mostly linear with
relatively small non-linear curvature. Clearly, the image does not map into the original box.
In the preconditioned case, the same initial box is shrunk into a very highly non-linear object
at least one order of magnitude less in size than the original box.
148

!"

!#

!#

!"

Figure 4.2: The image of a Taylor Model box (solid black), the interval bounds Ix and Iy ,
and the rigorous inner enclosure (hatched).

Figure 4.3: The eﬀect of preconditioning. A box of size 10−4 × 10−4 in a H´non map and
e
its 15-th iterate without (left) and with (right) preconditioning.

149

4.3.2

Implementation

The operator described above provides a versatile method to classify single boxes around a
ﬁxed point. In the following, we extend this to an algorithm for a global ﬁxed point ﬁnder as
described in the introduction to this Section. To easily satisfy the requirement of Theorem
4.1 that sets must be convex, compact and non-empty, our implementation will only operate
on interval boxes. Interval boxes are subsets of Rn given by a non-empty, ﬁnite interval in
each component, i.e.
B = [b1 , b1 ] × [b2 , b2 ] × · × [bn , bn ]
Clearly, interval boxes are convex, and if bi , bi ∈ R ∀i = 1, . . . , n they are compact, and if
bi < bi ∀i = 1, . . . , n they are non-empty.
Let K be the interval box within which one wants to identify all ﬁxed points. The
simplest method is to divide the search space into an even grid of ﬁxed size boxes, and
then checking each single small box for a ﬁxed point using the classiﬁcation just presented.
However, even in two dimensional maps, that approach often is prohibitively expensive in
terms of computations required. In order to compute all ﬁxed points in a box of size 1 × 1
to an accuracy of 10−4 , already 104 · 104 = 108 boxes must be checked.
Instead of ﬁxing a grid, it is much more advantageous to adaptively select the size of
boxes to test in each region of the map based on the complexity of the map in that region.
We use a stack-based “divide-and-conquer” style algorithm to successively remove parts of
the search space that are guaranteed not to contain ﬁxed points. This algorithm is similar in
structure to the algorithm used for the High Precision interval veriﬁcation in Section 4.2.1.3.
Each box is classiﬁed based on the following criteria:
1. Boxes that are mapped completely outside of themselves, i.e. boxes B for which it can
150

be veriﬁed that M (B) ∩ B = ∅, certainly contain no ﬁxed points. These boxes are
discarded without further testing.
2. Boxes for which there is no conclusive result, i.e. boxes for which M (B) ∩ B = ∅ but
not M (B) ⊂ B. These boxes may or may not contain a ﬁxed point, and they are split
into smaller boxes which are pushed onto the stack. Analysis then continues with one
of those smaller boxes.
3. Boxes which contain a ﬁxed point, i.e. boxes for which M (B) ⊂ B. These boxes may
be split further to obtain sharper enclosures if their size is too large. Otherwise, these
boxes are then tested for uniqueness of their ﬁxed point using Corollary 4.6.
At this point, it is intentionally left open how exactly the “size” of an interval box is measured. Such a measure depends on the application. Commonly employed measures include
the volume of the interval box, or the length of the longest side. Furthermore, it is not
necessary at this point to specify an algorithm for the “splitting” of a box, as long as the
employed algorithm results in a reduced measure of each of the split boxes. A simple splitting
algorithm is to divide one of the longest sides of the box in half.
To guarantee that the above search for ﬁxed points ﬁnishes, there are certain preprescribed thresholds relating to the size of the boxes under consideration. The ﬁrst is
the desired maximum size of the veriﬁed ﬁxed point enclosures, denoted by Ω1 . All boxes
that contain or may contain ﬁxed points will be at most of this size. Boxes larger than this
size in step 3 are split further.
The second constant, Ω2 , provides a lower limit on the size of boxes that are considered.
Once the size of a box falls below this threshold, it will not be split any further, independently
of its current status. We will discuss these limits in more detail in Section 4.3.4.
151

4.3.2.1

Algorithm

The basic algorithm for the veriﬁcation procedure with automatic box size control is as
follows:
1. Start with initial box K on the stack
2. Take the top box S oﬀ of the stack and classify the box:

• If S is veriﬁed to contain no ﬁxed points, discard it.
• Else, if the size of the S is larger than Ω1 , split S and push the two resulting
pieces onto the top of the stack.
• Else, if the size of S is larger than Ω2 , split along the longest side of S and push
the two pieces onto the top of the stack.
• Else, if there is a unique ﬁxed point in S, store S in the list of found unique ﬁxed
points.
• Else, if there is at least one ﬁxed point in S, store S in the list of possibly
non-unique ﬁxed points.
• Otherwise, place S in the list of undecided boxes.

3. Continue with 2 until there are no more boxes on the stack.
In practice, the classiﬁcation of a box S is performed in several steps, checking after each if
further tests are necessary. The following order of tests is the most eﬃcient:
1. Compute M (S) in Taylor Model arithmetic. If M (S)
no ﬁxed point of M in S.
152

S = ∅, stop as there is certainly

2. Compute an approximate A = (DM (x) − I)−1 in ﬂoating point arithmetic, for some
x ∈ S. If the inverse does not exist or is very badly conditioned, set A to a more
suitable “approximate” inverse, or just the identity.
3. Compute CA M (S) in Taylor Model arithmetic. If CA M (S) ⊂ S, stop as existence is
not veriﬁed.
4. Compute D CA M (S) in Taylor Model arithmetic using a provided analytic expression for DM and subsequently test the bound on the operator norm of the derivative
to be less than one to verify uniqueness.

4.3.3

Analysis

The use of Taylor Models to carry out the veriﬁed computations makes the computation of
the matrix A in the construction of the operator CA particularly easy. The initial Taylor
Model used to evaluate M only consists of a constant part and a linear part in one variable
for each dimension. Thus, very good approximations of the derivatives at the center point
of the box can be obtained very easily from the ﬁrst order coeﬃcients of the resulting Taylor
Model. Due to the diﬀerential algebra approach of Taylor Models, no separate computation
is necessary.
The tests for inclusion of a Taylor Model in the original set are done by computing
interval bounds on each Taylor Model. In general, it is true that interval enclosures of
arbitrary Taylor Models suﬀer from the wrapping eﬀect, and are substantially larger than
the Taylor Model itself. In this particular case, however, the use of interval bounds and
the associated wrapping eﬀect introduces no additional overestimation beyond that of the
quality of the generated interval enclosure. This is because the initial box S is itself an
153

interval box. Thus for CA M (S) to lie within S, each separate coordinate bound of CA M (S)
must lie in the corresponding interval of S.

Note that the use of Taylor Models for these computations is essential for another reason.
For the contraction operator CA to work properly, it is required to have strong control
over the dependency problem. This is because CA is constructed above such that the ﬁrst
derivative is nearly cancelled out. This cancellation, however, can only occur if it is computed
symbolically, e.g. using Taylor Models. With mere interval evaluation, the dependency
problem caused by this cancellation would lead to vast blowup of the resulting interval, and
thus render the operator ineﬀective.

The reader familiar with veriﬁed global optimization will ﬁnd the above type of algorithm
very familiar. Most veriﬁed global optimizers, such as for example COSY GO[33], use very
similar methods to successively examine the entire search space and prune regions where
there is no extremum. However, it is worth pointing out one big diﬀerence between such
global optimizers and our ﬁxed point ﬁnder. Global optimizers produce a list of small
enclosures where there they cannot rule out the existence of an extremal point. However,
there is usually no statement made about which box contains the global extremum, or if a
box really contains one at all.

This global ﬁxed point ﬁnder, on the other hand, not only produces a list of enclosures of
ﬁxed points, but also veriﬁes and classiﬁes them in the same step, thus typically guaranteeing
the existence and in most cases also the uniqueness of a ﬁxed point in each resulting box.
154

4.3.4

Constants

As described above, there are two constants controlling the behavior of the algorithm with
respect to box sizes. The ﬁrst constant, Ω1 , is the maximum size of boxes stored in the
results list. Any boxes larger than this size are either discarded because they do not contain
ﬁxed points, or are split into smaller boxes to be analyzed. To the user, this parameter
speciﬁes the desired accuracy of the ﬁxed point enclosures.
The second constant, Ω2 , is the minimum size below which a box is never split any
further. Boxes that have been split blow this size, and could not be classiﬁed yet, will be
added to the “unclassiﬁed” list. From the user’s perspective, this value is a cutoﬀ value when
the algorithm should give up attempts to classify a box.
Therefore, resulting boxes in any of the three lists (unique ﬁxed points, ﬁxed points, and
undecided) will always range in size between Ω1 and Ω2 .
The ideal choice of these constants is problem dependent and can greatly aﬀect the
performance of the overall algorithm. In general, of course, Ω1

Ω2 should hold. In

practice, however, there is a lower limit for the size of Ω1 below which the algorithm becomes
signiﬁcantly less eﬀective due to limited ﬂoating point precision caused by the order loss eﬀect
(see Section 2.3.4). In our experiments, a good choice for the constant Ω1 in double precision
computations is about 10−7 .

4.3.5

Example

To demonstrate the eﬀectiveness of this global periodic point ﬁnder, we apply this algorithm
to the standard H´non map with A = 1.4 and B = 0.3. The approximate attractor is
e
covered by 8 interval boxes as shown in blue in ﬁgure 4.4. Within these initial boxes the
155

···············
·················
········ ··
··
········ ········
······················
·····················
· ······· ·· ··· ··
· · ··· ······· ·
······ · ·· ·········
········· ··············
· · ··· · ·· ·····
··········
·········· ·············
· ····· ·················
········
·· ·····
· ······ ················
·········· · ···············
··· ··
········· ··· ············
········
· ···· ·
·· · ····
········
········ ····················
···········
· ·····
········
···········
· ····
······
· ······
·······
· ··
········
· ······
·······
·······
· ······
· ··
·········
·· ······
· ··
·······
·······
· ··
··········
··· ······
·······
·· · ·
·· ····
·········
·······
· ····
·······
· ··
· ··
······
······
· ··
···················
·
······· ··············
··· · ··· · ·····
······
·
··· ··
· ···· ································
·
· · ··· · ···
····· ······· ··· ············· ·······
· ····
······ · ······ ························
·
·
····· · · · ···
······
·
····· ············· ··
· ··
······ ····································
·· ··
· · · ··
· ·· ····································
······ · · ······ ·····················
···
···· · ······ · ············ ·····
· ··
·········· ····
·· · ·
·
····· · ······· ·······················
· ········ ···
···· · ·······
·········· ····
· ····
·············
····
·· ·
· ···· ··· ··
····· · ······
··············
·· ··· ··
·
····· ·······
····
······
····
······
············
········ ···
· ·· ·
· ·· ······
··
···· ······
····· ······
···········
···· ·····
·· ········
····· ·
··
··
· ··········
···· ······
···
···········
···· ····
············
· · ··
···· ·······
···· ·····
····
············
··
········
···········
··· ······
···
·····
···········
·······
·· ·
··········
···
··
·· ······
·
·········
·····
····· ·
··· ·····
···········
··
····
··· ······
··
······
····
·
···
··
·········
··
·········
····
··· ·····
··
··
·········
··
·········
····
· ·······
·· ·····
·
···
··
········
········
········
·
··
··
··
·
·······
·
·· ·
···
········
··
·
·······
·
··
·
·······
·
·
·
·
·
·······
·
···· ··
·
·
·
·· · ·
·
·
·
·
· ·
·
·
· ·
·
· ·
·
·
· ·
·
·
· ·
· ·
·
·
· ·
·
·
· ·
·
· ·
·
· ·
·
·
·
·
·
· ·
·
·
·
· ·
·
· ·
·
·
· ··
·· ···
· ·
·
·
·
·· ···
· ··
·
·
· ··
·
·
· ·
·· ··
··· ···
··· ····
·· ··
···
·· ··
··· ··
······
··· ·····
······
··· ·····
·
··
·····
···
···· ·
······
··· ···
··· ···
·····
·· ··
···· ·· ·
·····
·
···· ···
······
··
· ·· ·······
······
··
·············
·······
·· ·
·
···
··· ··
···· ·
· ·····
· ·····
···· ·
····· ·····
·······
····· ·····
······
·······
·
·······
··· · ··
···· ····
·······
··
·······
············
·······
·············
····
·
·· ·
·····
····· ·
····· ····
·· ·
·······
··
·· ·
·······
······ ··
······
· ··· ·
·······
········
······
·····
· · ····
·· ·····
······
···· ··
······
······
·······
··
·
··
···
····
·· ··
······
· ······
······
······ ·
··· ···
·······
· ··
· ··· ·
··· ····
······
· ······
·······
······
······
······· ·
···
·
····
· ···
·······
··· ···
·······
·········
·········
········
········
········
········
· · ··
· ······
········
·······
····· ·
·········
·
·········
·········
· ···
·········
·······
· ····
·········
· ······
········
·········
·
·······
·········
·· ······
········
·· ········
· · ···
·········
··········
··
·· ········
···········
·
··········
··········
· · · ··
··········
··········

Figure 4.4: All period 11 periodic points (green) and the two ﬁxed points (black) within
initial boxes (blue) in the standard H´non map.
e

156

global periodic point ﬁnder was started to identify all ﬁxed points of period 11.
The computation took 54 seconds running on a single core of our Intel Xeon X5677
system at 3.5 GHz. There were 155 resulting periodic points found in these boxes. This
corresponds to 14 orbits of period 11 and one of the two ﬁxed points of the H´non map.
e
ﬁgure 4.4 shows each of these periodic points in green, which were computed to an accuracy
of about 10−5 , which is below the printer resolution. Computations were carried out using
5-th order double precision Taylor Models.

4.4

Application to Beam Physics

The study of ﬁxed points and periodic points of maps, such as transfer maps for beam lines,
can reveal lots of information about the behavior of such a dynamical system. In accelerator
physics, undesirable chaotic behavior in an accelerator is typically observed near periodic
points. One major design goal in accelerator construction is therefore to avoid low-period
periodic points in the transfer map.
In order to do so, it is of course ﬁrst necessary to ﬁnd the ﬁxed and periodic points of a
given map. Automated detection of such periodic points can be of great help with this task.
To illustrate this, we will now apply the rigorous ﬁxed point ﬁnder to a real world transfer
map of the Tevatron accelerator at Fermilab. This map is represented as a polynomial of
order 11, presenting particular numerical challenges to overcome.

4.4.1

Implementation

The transfer map is an 11-th order map, computed by Pavel Snopok using the COSY INFINITY beam physics package[42]. The model of the Tevatron used is a slightly simpliﬁed
157

0.600E-01

0.900E-01

Figure 4.5: Tracking of evenly spaced particles through the transfer map in the x − a plane
(1400 iterations, 5 mm spacing between rings).

158

0.600E-01

0.900E-01

Figure 4.6: Location of the periodic points identiﬁed in the transfer map. Red are two sets
of period 5 points, in blue are particles trapped in the islands around the elliptic period 5
points. Two period 8 points are in black at the top left and bottom right corner.

159

model consisting of all dipole, quadrupole and sextupole bending magnets, but omitting
certain corrective beam line elements, such as electrostatic separators and solenoids. This
omission still produces a very close approximation of the real dynamical system. Moreover,
it results in a symmetric map in which the x-a plane is invariant.
For this application, we restrict ourselves to the two dimensional motion in the invariant
x-a plane. For this purpose, the from full four dimensional map the two dimensional map of
the x-a plane into itself is extracted.
Using the particle tracking facilities of COSY’s beam physics package, a tracking picture
of the system using evenly spaced initial particle coordinates is generated (ﬁgure 4.5). Estimating the region of interest to be within the region shown, [−0.09, 0.09] × [−0.06, 0.06],
the periodic point ﬁnder is started to work on that region.
Table 4.2 shows the number of enclosures containing a unique periodic point found for
periods up to 20. As all COSY beam maps, this map is origin preserving since the origin
represents the reference trajectory. Hence for every period, there is at least one periodic
point identiﬁed, namely the origin. Similarly, periodic points of period 5 are also listed for
periods 10, 15, and 20. But since all of the periodic point enclosures identiﬁed contain exactly
one periodic point, and we know that periodic points of lower period must be present, we
can subtract out those points of lower period from the results. The count of periodic points
without those duplicate points is given in the second to last column.
This calculation rigorously proves that there are only periodic points of periods 5 and 8
within the region of interest. Note that for the period 5 points, there are two separate sets
of 5 points each, while for the period 8 points only two were found. The missing 6 points of
their orbit lie outside the search area and thus are not identiﬁed. Figure 4.6 illustrates the
160

location of all periodic points identiﬁed up to period 20.
Due to the splitting algorithm we chose in our implementation, always cutting along the
longest side of a box, the resulting periodic point enclosures are roughly square. Their area
is less than the parameter Ω1 speciﬁed, given in the area column. As is expected due to the
broken scaling law it was necessary to increase the targeted area as the period increases to
guarantee successful veriﬁcation. Note, however, that even the enclosures at period 20 still
are accurate up to an error of about 10−6 .
The time given is the CPU time consumed for the search rounded to the nearest full second and measured running on a single core of our Intel Xeon X5677 system at 3.5 GHz. It is
interesting to note the vast diﬀerences in execution times between diﬀerent periods. Particularly periods 13 and 18 take signiﬁcantly longer than all other periods. This phenomenon
will be analyzed in more detail in the next section.

4.4.2

Tune Shifts

The Tevatron map analyzed exhibits a linear tune of about 0.5915. The accelerator was
intentionally designed such that the linear tune is close to the fundamental resonance 3/5 =
0.6. Normal form theory predicts a shift of the tune as the distance form the reference orbit
increases. Those tune shifts can help explain why we found period 5 periodic points.
Normal form transformations based on diﬀerential algebra methods[6] applied to our
beam transfer map M : R2 → R2 provide a coordinate transformation T : R2 → R2 such
that the map T ◦ M ◦ T −1 takes elliptic orbits into circular orbits around the origin up to
a given order, and the tune shift t(r) only depends on the radius of the circle[5].
When the tune shift t(r) attains a rational value of the form n/k with n, k ∈ N, pe161

0.65

Tuneshift (3rd order)
5/8
8/13
11/18
3/5

0.64

Tune

0.63
0.62
0.61
0.6
0.59
0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Amplitude (in Normal Form Coordinates)

Figure 4.7: Polynomial expansion of tune shifts computed to orders 3.
0.65

Tuneshift (5rd order)
5/8
8/13
11/18
3/5

0.64

Tune

0.63
0.62
0.61
0.6
0.59
0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Amplitude (in Normal Form Coordinates)

Figure 4.8: Polynomial expansion of tune shifts computed to order 5
162

0.09

0.65

Tuneshift (7rd order)
5/8
8/13
11/18
3/5

0.64

Tune

0.63
0.62
0.61
0.6
0.59
0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Amplitude (in Normal Form Coordinates)

Figure 4.9: Polynomial expansion of tune shifts computed to order 7
0.65

Tuneshift (9rd order)
5/8
8/13
11/18
3/5

0.64

Tune

0.63
0.62
0.61
0.6
0.59
0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Amplitude (in Normal Form Coordinates)

Figure 4.10: Polynomial expansion of tune shifts computed to order 9
163

0.09

0.65

Tuneshift (11rd order)
5/8
8/13
11/18
3/5

Tune

0.64
0.63
0.62
0.61
0.6
0.59
0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Amplitude (in Normal Form Coordinates)

Figure 4.11: Polynomial expansion of tune shifts computed to order 11.

riodic points of period k are expected to appear in the map in normal form coordinates.
Consequently, the original map M also should exhibit periodic points of the same period.
We used the COSY INFINITY beam physics code to compute an expansion of the tune
shift in normal form coordinates[6] to various orders. In ﬁgures 4.7 through 4.11, those
tune shifts are plotted as a function of the radius from the reference orbit in normal form
coordinates. Also marked are several interesting resonances.
The wild oscillations of the function at diﬀerent orders for large radii indicate that the
polynomial expansion of the tune shift does not converge very well for values above approximately 0.06. However, for values below about 0.05 the plots are in good agreement,
suggesting good convergence in that region.
The last column of Table 4.2 shows the nearest fraction n/p, n ∈ N for each period p that
164

is larger than the tune of the Tevatron, 0.5915. We chose to list larger values as the tune
appears to be shifting upwards in the tune shift plots. As mentioned before, the resonance for
the period 5 points is by design very close to the tune of the Tevatron. This is because then
automatically other low order resonances occur signiﬁcantly further away from the Tevatron
tune.
The next two closest resonances, occurring for periods 13 and 18, are 8/13 ≈ 0.615 and
11/18 ≈ 0.611. Interestingly enough, those two periods were also the two periods taking the
longest for the global periodic point ﬁnder to treat, taking up to 12 times longer than all
other periods. This indicates that those periods are “closer” to produce periodic points, i.e.
points do return to within very close proximity of their origin. However, our periodic point
ﬁnder proved that none actually return exactly to themselves (except for the origin). This
complicates the search for our algorithm and results in signiﬁcantly longer execution times.
From the normal form tune shift plots in ﬁgure 4.7 we can determine the radius for
which the tune shift reaches 3/5 = 0.6. In all orders it is consistently located at about
0.0387. At this tune in the normal form we would expect period 5 points to appear. To
compare this prediction with the results of the periodic point ﬁnder, we transformed the
coordinates (x, a) of the center of one of the identiﬁed period 5 enclosures into normal form
coordinates (X, A) = T (x, a) and computed its radius R =

X 2 + A2 .

The point we chose for this is x ≈ 0.00137 and a ≈ −0.0295, which, in COSY’s normal
form coordinates, becomes X ≈ 0.00631 and A ≈ −0.0375, and thus has a radius R ≈ 0.0381.
As can be seen, both the radius predicted from the normal form tune shift plot as well as
the periodic point identiﬁed by the periodic point ﬁnder are in agreement within about 2%
of each other.
165

Unfortunately, this is the only valid prediction that can be made from the tune shift plot.
In all of the plots, the next tune, 11/18, seems to be crossed. However, the periodic point
ﬁnder proved that there are no periodic points of period 18 related to this tune within the
search region.
This is probably caused by the slow convergence of the normal form transformation due
to the closeness of the linear tune to the 3/5 resonance. This causes the tune shift expansions
shown above to not converge for the values at which 11/18 would be obtained.
Converting one of the period 8 points we identiﬁed by the same normal form coordinate
transformation as used for the period 5 point above, we ﬁnd that its amplitude would be
about 0.238. This is clearly far beyond the convergence of the normal form expansion, hence
its existence could not be predicted by the tune shift plots.
This comparison highlights a general problem associated with the normal form methods
applied to this type of problem. Since accelerators are intentionally designed to operate near
one of the fundamental resonances, the convergence of the tune shift expansion is slow and
the region in which reliable predictions can be made is rather small. Unfortunately, with
non-veriﬁed computations it is not clear a priori how large that region is. Veriﬁed tools, such
as this periodic point ﬁnder, can help in identifying rigorously such regions.

4.5

Conclusion

We succeeded in implementing a fast, rigorous, global periodic point ﬁnder. The use of Taylor
Models produces fast results even for high period points in numerically diﬃcult maps. The
main restriction to obtaining sharper enclosures and higher periods is the limited precision
of the underlying double precision ﬂoating point numbers. We have shown that by using
166

High Precision Taylor Models the eﬀect of order loss associated with limited precision can
be addressed and very sharp enclosures are possible.
Applying these techniques to dynamical systems such as beam transfer maps allows the
automated analysis of those maps for regions of stability and aides in identifying regions of
potential chaos. The rigorous character of the method presented augments the qualitative
results obtained from existing methods of analysis.
In future work, the rigorous global ﬁxed point ﬁnder will be reworked to use High Precision Taylor Models. It can then be applied to a wide range of problems. In its reincarnation
as a global rigorous veriﬁed zero ﬁnder, it can for example be used to compute rigorous
enclosures of intersections between Taylor Models, which is useful in automatically enclosing
homoclinic intersections e.g. between manifold enclosures.

167

Period

Unique boxes

Area (Ω1 )
4 · 10−19

Time (sec)

Periodic points

Tune

1

1

0

1

0.592

2

1

4 · 10−19
4 · 10−19

0

0

1

3

1

0

0

2/3 ≈ 0.666

4

1

4 · 10−19
4 · 10−19

0

0

3/4 = 0.75

5

11

6

10

3/5 = 0.6

6

1

4 · 10−19
4 · 10−19

0

0

4/6 ≈ 0.666

7

1

0

0

5/7 ≈ 0.714

8

3

4 · 10−19
4 · 10−19

6

2

5/8 = 0.625

9

1

1

0

6/9 ≈ 0.666

10

11

4 · 10−16
4 · 10−16

13

0

6/10 = 0.6

11

1

1

0

7/11 ≈ 0.636

12

1

4 · 10−16
4 · 10−16

2

0

8/12 ≈ 0.666

13

1

107

0

8/13 ≈ 0.615

14

1

2

0

9/14 ≈ 0.643

15

11

21

0

9/15 = 0.6

16

3

4 · 10−12
4 · 10−12

18

0

10/16 = 0.625

17

1

4

0

11/17 ≈ 0.647

18

1

359

0

11/18 ≈ 0.611

1

4 · 10−12
4 · 10−12

19

10

0

12/19 ≈ 0.632

20

11

4 · 10−12

33

0

12/20 = 0.6

4 · 10−16
4 · 10−16

Table 4.2: Periodic points found for periods from 1 through 20 in the region [−0.09, 0.09] ×
[−0.06, 0.06] in the Tevatron transfer map.

168

Chapter 5

Rigorous Manifold Enclosures

In this chapter we will introduce methods to obtain tight, veriﬁed enclosures of local invariant
manifold of various dynamical systems. This work is based on previous work by Johannes
Grote in his dissertation[19] and in a joint paper[47], which describes a method to compute
veriﬁed enclosures of manifolds of planar diﬀeomorphisms.

We extend the methods introduced there, to provide an algorithm for the construction of
a polynomial manifold approximation in arbitrary dimension to both discrete maps as well
as autonomous ordinary diﬀerential equations with proper treatment of resonances. We then
proceed to describe the rigorous veriﬁcation of a very sharp enclosure of the invariant manifolds of dynamical systems in three dimensions, which can be extended to higher dimensional
cases as well. Lastly, we examine the inclusion of parameter dependence in the manifold generation and veriﬁcation, to construct polynomial approximations of the manifolds of maps
depending on parameters.
169

5.1

General Concepts

First, we will brieﬂy introduce the general concept of invariant manifolds and describe some
well known properties thereof. Those concepts introduced here will be used in the construction and veriﬁcation of the manifold later on.
While well known, proofs will be provided for some of the theorems presented in this
section, as they provide some valuable insight into invariant manifolds and their properties.

5.1.1

Invariant Manifolds of a Diﬀeomorphism

In the following, let f : Rn → Rn be a diﬀeomorphism, i.e. f is a bijection such that both
f and f −1 are continuously diﬀerentiable.
Furthermore, assume f (0) = 0 and the eigenvalues λi , i = 1, . . . , Nu and µj , j =
1, . . . , Ns , Nu + Ns = n, of Df |0 are real and satisfy

|λi | > 1 ∀i = 1, . . . , Nu ,
|µj | < 1 ∀j = 1, . . . , Ns .

Without loss of generality, let the eigenvectors corresponding to λi be aligned with the
xi -axis and the eigenvectors corresponding to the µj be aligned with the xNu +j -axis. Given
the above assumptions, this can always be achieved by a suitable linear coordinate transformation.
Remark 5.1. The restriction to real eigenvalues is not of mathematical nature. The following
theorems hold as well for complex eigenvalues (see e.g. [14]). However, in this research we
restrict ourselves to real eigenvalues to simplify the analysis in the following sections. An
170

expansion of the results presented here to the general case of complex eigenvalues is very
likely possible and may be explored in a future paper.

Deﬁnition 5.2 (Invariant Stable and Unstable Sets of a Diﬀeomorphism). For the map f
as deﬁned above, let Ms denote all points x ∈ Rn such that

∀x ∈ Ms

lim f k (x) = 0.
k→∞

Equally, let Mu denote all points x ∈ Rn such that

∀x ∈ Mu

lim f −k (x) = 0
k→∞

Certainly Mu and Ms are non-empty sets, since 0 ∈ Mu and 0 ∈ Ms . Where necessary,
we denote by Ms (f ) the stable set of the map f , and similarly by Mu (f ) the unstable set
of the map f .

Theorem 5.3 (Invariant Manifold Theorem for Diﬀeomorphisms). The invariant sets Ms
and Mu as deﬁned above are Ns and Nu dimensional manifolds respectively. Furthermore
Mu at 0 is tangent to the linear expanding eigenspace of Df |0 spanned by e1 , . . . , eNu and
Ms at 0 is tangent to the linear contracting eigenspace of Df |0 spanned by eNu +1 , . . . , en .
If f ∈ C r then also Mu , Ms ∈ C r .
Proof. See, for example, [24, 14].

From Deﬁnition 5.2 one can infer the following lemma:
Lemma 5.4. The manifolds of f and f k for any k ∈ N are identical.
171

Proof. Let x ∈ Ms (f ). Then {xn }n∈N = {f n (x)}n∈N by deﬁnition converges to 0.
n
Hence the subsequence {xkn }n∈N = { f k (x)}n∈N also converges to 0. Thus Ms (f ) ⊂
Ms (f k ).

Conversely, let x ∈ Ms (f k ) and 1 > ε > 0. Then since f is continuously diﬀerentiable in
U = B 1 (0), it is Lipschitz in U with some Lipschitz constant L. Chose 0 < δ < ε . By the
Lk
deﬁnition of Ms (f k ) there exists some N0 ∈ N such that for all N > N0 f kN (x) ∈ Bδ (0).
By the Lipschitz condition, we have the following estimate

f (f kN (x)) = f (f kN (x)) − f (0) < L f kN (x) − 0 < Lδ.

By induction we hence ﬁnd that for i = 1, . . . , k − 1

f i (f kN (x)) < Li δ,

since all Li δ < ε < 1 and hence f i (f kN (x)) ∈ U .

But then all f i (x) ∈ Bε (0) for all i > N0 k, and since ε > 0 is arbitrary this shows that
limn→∞ f n (x) = 0 and thus Ms (f k ) ⊂ Ms (f ).

Replacing f by f −1 in the above argument proves the same for the unstable manifold
Mu .
172

5.1.2

Invariant Manifolds of autonomous ODEs

Consider now the manifolds of an autonomous dynamical system given by the continuously
diﬀerentiable vector ﬁeld F : Rn → Rn through the ordinary diﬀerential equation
d
x = F (x).
dt

Furthermore, assume F (0) = 0 and the eigenvalues λi , i = 1, . . . , Nu and µj , j =
1, . . . , Ns , Nu + Ns = n, of DF |0 are real and satisfy

λi > 0 ∀i = 1, . . . , Nu ,
µj < 0 ∀j = 1, . . . , Ns .

Without loss of generality, let the eigenvectors corresponding to λi be aligned with the
xi -axis and the eigenvectors corresponding to the µj be aligned with the xNu +j -axis. Given
the above assumptions, this can always be achieved by a suitable linear coordinate transformation.
Denote by Φt (x) the time t map of the vector ﬁeld F , i.e. Φ0 (x) = x and d Φt (x) =
dt
F (Φt (x)) ∀t ∈ R, assuming solutions exist for all time.
Note that since F is continuously diﬀerentiable by assumption, it is Lipschitz and thus
from the uniqueness of solutions of an ODE it follows that

Φt (Φt (x)) = Φt +t (x).
1
2
1 2

We will also refer to the map Φt (x) for some ﬁxed x as the orbit x(t) of x. For t
173

0 we

refer to x(t) as the forward orbit of x, and similarly for t

0 as the backward orbit of x.

Lemma 5.5. Any map Φt , with t > 0 has an Ns dimensional stable and Nu dimensional
unstable manifold at the origin.

Proof. The existence of stable and unstable manifolds for any positive real t follows directly
from theorem 5.3 and the facts that Φt (0) = 0 and that DΦt (0) is diagonal with eigenvalues
exp(λi t) > 1 , i = 1, . . . , Nu and exp(µj t) < 1 , j = 1, . . . , Ns .
To see this, consider the integral equation formulation for Φt (x) at the origin where
Φt (0) = 0. Then
t
Φt (x) =
DΦt (x) =
DΦt (0) =

0
t
0
t
0

F (Φs (x)) ds,
DF (Φs (x)) · DΦs (x) ds,
DF (0) · DΦs (x) ds.

But DF (0) is a diagonal matrix with the λi , i = 1, . . . , Nu in the ﬁrst Nu diagonal entries,
and the µj , j = 1, . . . , Ns in the last Ns diagonal entries. Thus the solution of this matrix
integral equation for the Jacobian matrix DΦt (0) is simply given by


0
exp(λ1 t)





..
DΦt (0) = exp (DF (0) · t) = 
.
.




0
exp(µNs t)

Deﬁnition 5.6 (Invariant Stable and Unstable Set of an ODE). For the vector ﬁeld F as
174

deﬁned above, let Ms denote all points x ∈ Rn such that

∀x ∈ Ms

lim Φt (x) = 0.
t→∞

Equally, let Mu denote all points x ∈ Rn such that

∀x ∈ Mu

lim Φ−t (x) = 0.
t→∞

Theorem 5.7 (Invariant Manifold Theorem for ODEs). The invariant sets Ms and Mu
given in Deﬁnition 5.6 for the vector ﬁeld F are Ns and Nu dimensional manifolds respectively. The manifold Mu at 0 is tangent to the linear expanding eigenspace of DF |0 and Ms
at 0 is tangent to the linear contracting eigenspace of DF |0 .
Furthermore, the manifolds Ms and Mu are identical to the manifolds of the time one
map Φ1 of F .

Proof. Clearly, all points of the stable set of F are also in the stable manifold of the time
one map Φ1 , i.e. Ms (F ) ⊂ Ms (Φ1 ). This is because for any point x ∈ Ms (F ) such that
limt→∞ Φt (x) = 0, the subsequence {ΦN (x)}N ∈N also converges to 0.
Conversely, consider any point x ∈ Ms (Φ1 ). It suﬃces to show for every ε > 0 there is
some time T such that for t > T |Φt (x)| < ε. So ﬁx some 1 > ε > 0.
Since Φt (x) is continuously diﬀerentiable in both x and t for x ∈ B 1 (0) and t ∈ [0, 1],
there exists a Lipschitz constant L such that

Φt (x1 ) − Φt (x2 )
1
2

L
175

(t1 − t2 )2 + |x1 − x2 |2

for all x1 , x2 ∈ B 1 (0) and t1 , t2 ∈ [0, 1].
Choose some δ such that 0 < δ < min(1, ε/L). Then by the deﬁnition of the stable
manifold Ms (Φ1 ), there is some N0 ∈ N such that ΦN (x) < δ for all integers N

N0 .

This then implies that |Φt (x)| < ε for all times t > N0 . This holds true since such t can
be written as N + ∆t, where N

N0 is an integer and ∆t ∈ [0, 1). Letting x0 = ΦN (x),

due to the Lipschitz condition

|Φt (x)| =
=

ΦN +∆t (x) = Φ∆t (ΦN (x)) = Φ∆t (x0 )
Φ∆t (x0 ) − 0 = Φ∆t (x0 ) − Φ∆t (0)
L |x0 − 0| = L |x0 | < Lδ < ε

Since ε is arbitrary, this shows that limt→∞ Φt (x) = 0, and thus Ms (Φ1 ) ⊂ Ms (F ).
Therefore Ms (F ) = Ms (Φ1 ).
The same argument holds for the unstable manifold, replacing F by −F and Φ1 by
Φ−1 .
Corollary 5.8. The invariant manifolds of F at the origin are identical to the invariant
manifolds of the time t0 map Φt of F at the origin for any time t0 > 0.
0
Proof. In the proof for theorem 5.7, replace the Lipschitz constant L by L valid for any times
t1 , t2 ∈ [0, t]. Similarly, chose N0 ∈ N such that ΦN ·t (x) < δ for all N
0

N0 . Then, any

time t can be written as N · t0 + ∆t where N ∈ N and ∆t ∈ [0, t0 ]. The rest of the proof is
identical.
Corollary 5.9. The invariant manifolds of any two time t maps Φt and Φt of F at the
1
2
origin are identical for any t1 , t2 > 0.
176

Proof. By transitivity and corollary 5.8, Ms (Φt ) = Ms (F ) = Ms (Φt ) and the same for
1
2
Mu .

Deﬁnition 5.10. Let K ⊂ Rn such that 0 ∈ K, and F as before. We refer to the connected
loc
part of the invariant manifolds containing 0 within K as local invariant manifolds Wu for
loc
the unstable and Ws for the stable manifold.

5.1.3

Notation

For ease of notation we will introduce the following equivalence relation for two suﬃciently
diﬀerentiable functions f and g. We say f =n g if each of the ﬁrst n derivatives of f and g
at the origin are equal.
Furthermore, we say α is a n-multi-index if it is a set of non-negative integers α1 . . . αn .
If n is clear from the context, we will simply speak of a multi-index α.
We deﬁne
n
|α| =

αi ,
i=1

and given a vector x ∈ Rn and a n-multi-index α, we use the short hand notation
n

xα =

α
xi i .

i=1
With that notation, it is possible to uniquely identify each coeﬃcient aα of a n-dimensional
polynomial P by a n-multi-index α, and write it as

aα xα ,

P (x) =
α
177

where x ∈ Rn . Furthermore, we use the notation

aα xα
|α|=k
to denote a sum over all possible multi-indices α such that |α| = k, i.e. all terms of exact
order k in P .

5.2

High Order Invariant Manifold Approximation in
Discrete Systems

In this section, we will develop a constructive algorithm to obtain very accurate high order polynomial expansions of the invariant manifolds of a hyperbolic point in arbitrary
dimension. In case resonances between the eigenvalues are encountered, several methods
are proposed to overcome those resonances and continue the construction. This method is
loosely based on the parametrization method introduced in [14].
Let φ : Rm → Rm be a C r -diﬀeomorphism with the following properties:
1. There is a ﬁxed point at the origin

φ(0) = 0

2. The ﬁxed point is hyperbolic and the map is diagonal with real eigenvalues:




 Λ 0 
Dφ(0) = 

0 M
178

where







Λ=






λ1
0
0
0

0

0

0 


λ2 0 0 



..
. 0 
0


0 0 λk

with λ = λ1 , λ2 , . . . , λk T , |λi | > 1, i = 1 . . . k and




0
0
0
 µk+1


 0
µk+2 0
0

M =

...
 0
0
0


0
0
0 µm

with µ = µk+1 , µk+2 , . . . , µm

T












, 0 < |µi | < 1, i = k + 1 . . . m.

Remark 5.11. Any diﬀeomorphism Ψ with a ﬁxed point x and real, non-zero and non-unity
eigenvalues of the derivative at x can be brought into this form by applying an aﬃne coordinate transformation.
By theorem 5.3 there exists a k dimensional unstable manifold and a m−k dimensional stable
manifold of φ at the origin. The following describes an algorithm to obtain a polynomial
approximation of this local manifold around the origin of any order up to r. Without loss of
generality, only the unstable manifold construction will be considered here, the construction
for the stable manifold is performed analogously.
Let f : Rm → Rm be a polynomial such that f =r φ, i.e. f is the Taylor expansion
up to order r of the diﬀeomorphism φ. In the construction of the polynomial manifold
179

approximation f will be used instead of φ. Note that f can then be written as




 Λ 0 
f (x) = 
 · x + f (x)
0 M

(5.1)

where f is a polynomial without constant or linear part.

5.2.1

Parameterization

Let







γ p (t) = 







p
γ1 (t)
p
γ2 (t)
.
.
.
p
γm (t)













 where t = 










t1 


t2 

 ∈ Rk
. 
. 
. 

tk

be the polynomial approximation to the manifold of order p. Since it is an approximation
to the invariant manifold, we then demand that it satisﬁes the condition

f (γ p (t)) =p γ p (P p (t))

where







P p (t) = 





p
P1 (t)
p
P2 (t)
.
.
.
p
P (t)
k

(5.2)







 , t ∈ Rk





is a polynomial of order p.
It will become apparent during the following construction of the unstable manifold that
180

the introduction of P p is required to avoid resonances that can otherwise occur in dimensions
higher than 2.

Remark 5.12. The condition on the manifold parametrization given in equation 5.2 is not
suﬃcient to produce a unique parametrization. Clearly, if γ p (t) and P p (t) satisfy equation
5.2, so do γ p (θ(t)) and P p (θ(t)), where θ : Rk → Rk is any C p diﬀeomorphism.

5.2.2

Construction

We now proceed to give an inductive construction algorithm for such γ p and P p for any
p

r.

First, consider the following polynomials for p = 1:


1
 γ1 (t)


 γ 1 (t)

2


.
.

.



γ 1 (t) =  γ 1 (t)

k

 1
 γ
(t)
 k+1

.

.
.



1
γm (t)
181





 
 
 
 
 
 
 
 
 
 
 
=
 
 
 
 
 
 
 
 
 
 


η1 t1 


η2 t2 


. 
. 
. 


ηk tk  ,



0 

. 
. 
. 


0

that is a simple scaling in the ﬁrst k variables and zeros for the rest, and




 λ1 t1


 λ t
 22
P 1 (t) = Λ · t = 
 .
 .
 .

λk tk






.





The values ηi > 0, i = 1, . . . , k are free parameters in the construction, according to Remark
5.12. Their value determines the size of the derivative of the manifold approximation at
the origin. They can be used to control the convergence of the higher order terms of the
resulting polynomial, the smaller these values are chosen, the smaller the higher order terms
of γ p and P p will be.

Then it is apparent that condition (5.2) holds since


f γ1 t









 

Λ 0  


=
·


0 M













η1 t1 







η2 t2 





. 
.

. 





+f
ηk tk 






0 




. 

. 
.





0


η1 t1 


η2 t2 


. 
. 
. 


ηk tk  =1



0 

. 
. 
. 


0


























λ1 η1 t1 


λ2 η2 t2 



.
.

.



1
1
λk ηk tk  = γ P t




0


.

.
.



0

because f only contains terms of order 2 or higher and hence f =1 0.
182

5.2.3

Induction step

Assume γ p , P p satisfy condition (5.2) up to order p.
Let an,α , bn,α be such that

p+1
p
γn
t = γn t +

an,α tα ,
|α|=p+1

p+1
p
Pn
t = Pn t +

bn,α tα ,
|α|=p+1

p+1
p+1
i.e. they are the p + 1 order coeﬃcients of γn
and Pn
respectively.
It remains to identify values for an,α , bn,α such that condition (5.2) holds up to order
p + 1.
First note that for 1

n

k

fn γ p+1 t

=p+1 fn γ p t

an,α tα ,

(5.3)

an,α tα .

+ λn

(5.4)

|α|=p+1
and equivalently for k < n

m

fn γ p+1 t

=p+1 fn γ p t

+ µn
|α|=p+1

This follows from equation (5.1) and the fact that f γ p+1 t

=p+1 f γ p t

since f

has no constant or linear part, and γ p and γ p+1 have no constant parts respectively.
183

Secondly, by a similar argument, for 1

p+1
γn
P p+1 t

p
=p+1 γn P p t

n

k

an,α λα tα + ηn

+
|α|=p+1

and for k < n

bn,α tα ,

(5.5)

|α|=p+1

m

p+1
γn
P p+1 t

p
=p+1 γn P p t

an,α λα tα .

+

(5.6)

|α|=p+1

This holds since

p+1
γn
P p+1 t

=

p
γn P p+1 t

an,α P p+1 t

+

α

|α|=p+1
p
=p+1 γn P p+1 t

an,α P 1 t

+

α

|α|=p+1
=

p
γn P p+1 t

an,α λα tα

+
|α|=p+1

because P p+1 has no constant part, and

p
γn P p+1 t

=

1
γn P p+1 t

p
1
+ γn − γn

P p+1 t

p
1
1
=p+1 γn P p+1 t + γn − γn P p t

 p

γ P p t + η
α
 n
n |α|=p+1 bn,α t
=p+1
 p

γ P p t
 n
p
1
1
as γn − γn has no constant or linear part and γn is linear.
184

1

n

k<n

k
m

Inserting equations (5.3) through (5.6) into condition (5.2) then reads component-wise:

fn γ p t

an,α tα =p+1

+ λn

(5.7)

|α|=p+1
p
γn P p t

an,α λα tα + λn ηn

+
|α|=p+1

for 1

n

bn,α tα
|α|=p+1

k, and

fn γ p t

p
an,α tα =p+1 γn P p t

+ µn
|α|=p+1

for k < n

an,α λα tα

+

(5.8)

|α|=p+1

m.

By the induction assumption all terms up to order p in (5.7) and (5.8) cancel. Thus, in
order to satisfy condition (5.2), an,α and bn,α must be chosen such that all terms of order
p + 1 in in (5.7) and (5.8) cancel.

Let cn,α be the coeﬃcient corresponding to the power tα , |α| = p + 1, in the expression

fn γ p t

p
− γn P p t

.

Those are well deﬁned as this expression only involves already known quantities.

Then (5.7) and (5.8) can be rewritten into

cn,α tα =
|α|=p+1

λα − λn an,α + ηn bn,α tα
|α|=p+1
185

(5.9)

for 1

n

k, and

cn,α tα =
|α|=p+1
for k < n

λα − µn an,α tα

(5.10)

|α|=p+1

m respectively.

Performing term-by-term matching on both sides, values for an,α and bn,α can now be
determined. Consider ﬁrst the case k < n

m. From (5.10) it follows directly that

an,α =

cn,α
.
α − µn
λ

Note that since all |λi | > 1 we have |λα | > 1 and thus λα − µn = 0 as |µn | < 1. Thus the
value of an,α is well deﬁned.
Next, consider the case 1

n

k. Again performing term-by-term matching in (5.9)

the following condition is derived

λα − λn an,α + ηn bn,α = cn,α .

(5.11)

Clearly, the choice of an,α and bn,α is not unique in this case. In the following, two of
the possible choices are discussed in more detail.

5.2.3.1

Gamma-Convention

Obviously, unlike in the previous case, now it is not possible to chose an,α such that condition
(5.11) is aways satisﬁed. This is contingent upon the resonance condition

λα = λn .
186

(5.12)

Consider the case where λα − λn = 0, i.e. there is no resonance for α and n. Then let

bn,α = 0

and
an,α =

cn,α
.
λα − λn

On the other hand, if λα − λn = 0, i.e. there is a resonance, let

bn,α =

cn,α
ηn

and
an,α = 0.
In both cases the values are well deﬁned.
Remark 5.13. It follows directly from this construction that there exists a p such that P p =
P p for all p > p. That is, resonances can only occur up to ﬁnite order.
This is because for p = |α| > p we have

|λα |

min(|λi |)p > min(|λi |)p > max(|λi |)
i
i
i

for large enough p and thus λα − λn = 0 for all 1

n

k, |α| > p.

Remark 5.14. Furthermore, in the two dimensional case, this resonance condition can never
occur as there is only one expanding and one contracting eigenvalue. Yet even in higher
dimensions, since there are only ﬁnitely many resonance conditions, in typical cases these
187

resonances do not occur. Consequently, with this construction P p is kept “very close” to
P 1.

5.2.3.2

P-Convention

An alternative choice is to follow the opposite path than the Gamma-Convention: Instead
of keeping bn,α zero whenever possible, one can instead always make the choice

bn,α =

cn,α
ηn

and
an,α = 0.

By this construction, the ﬁrst k components of γ t always stay the identity, while all
the information about manifold propagation is contained in P t .

Remark 5.15. The resulting manifold parametrization is identical to the one obtained by the
graph transform method. In particular, for n = 1, . . . , k

p
Pn (t) =p fn (t).

However, this construction not only provides a polynomial approximation of the graph,
as the graph transform does, but also provides the polynomial expression for propagation on
p
that manifold in the Pn , n = k + 1, . . . , m.
188

5.2.4

Stable manifold

To construct the stable manifold approximation, one possibility is to apply the above construction to the inverse map φ−1 . In practice, however, it is usually easier to perform the
same construction just described while exchanging the roles of µ and λ. The required changes
to the algorithm are straight forward, and will not be discussed here.

5.3

Invariant Manifolds of Autonomous ODEs

We will now describe a method to construct and verify the invariant manifolds of an autonomous, i.e. not explicitly time dependent, ordinary diﬀerential equation.
In the following, Let F : Rn → Rn be a C r vector ﬁeld satisfying the conditions described
in Section 5.1.2. That is, let F (0) = 0, and the eigenvalues λi , i = 1, . . . , Nu and µj , j =
1, . . . , Ns of DF |0 be real and satisfy

λi > 0 ∀i = 1, . . . , Nu ,
µj < 0 ∀j = 1, . . . , Ns .

with the eigenvectors corresponding to λi be aligned with the xi -axis and the eigenvectors
corresponding to the µj be aligned with the xNu +j -axis.
We then are interested in ﬁrst generating a polynomial approximation of the stable and
unstable manifolds of the critical point at the origin. Given that approximation, we then
add a small thickening to it and verify that the result represents a uniformly veriﬁed C 0
enclosure of the manifold over a certain domain.
189

5.3.1

Manifold Generation

Consider a Taylor expansion to order r of the vector ﬁeld F around the origin of the form


Λ 0 
F (x) = 
 · x + F (x)
0 M
where Λ is a diagonal matrix with the eigenvalues λi in the diagonal and M is a diagonal
matrix with the eigenvalues µj in the diagonal entries. Furthermore, F only contains terms
of order 2 or higher up to some order r.
We now proceed to compute a polynomial approximation of the manifold for this approximate vector ﬁeld F . Due to corollary 5.8 and theorem 5.7, the manifold of the vector ﬁeld
F is identical to the manifold for any of the discrete time t maps Φt induced by F for any
value t > 0. It is thus suﬃcient to compute an approximation of the manifolds of some time
t0 map Φt with t0 > 0 for F and then apply the construction algorithm for discrete maps
0
described as described in Section 5.2 to Φt .
0
In order to compute the Φt map, a Taylor expansion of the ﬂow of F around the origin
0
is computed. This can be done particularly nicely using a super-convergent ﬁxed point
iteration in diﬀerential algebra arithmetic[8]. Without going into the detail of this method
here, suﬃce it to say here that it results in a Taylor expansion of Φt (x) in both t and x. To
obtain the map Φt , all that is required is to evaluate this Taylor expansion for the value
0
t = t0 , resulting in a polynomial only in x representing Φt (x).
0
Note that the choice of the value of t0 > 0 is in principle arbitrary. However, one has to
keep in mind that the Taylor expansion of the ﬂow is only performed to a certain ﬁnite order.
By inserting a speciﬁc value for the time t, care must be taken to ensure that the expansion
190

for the given time t0 still converges well enough. The larger the error in the approximation
of the time t0 map Φt of the system, the worse the ﬁnal approximation of the manifold will
0
be.
On the other hand, the eigenvalues of the derivative of Φt at the origin are given by
0
expressions of the form exp(λi t0 ) and exp(µj t0 ). For the construction to succeed numerically
without resulting in divisions by values very close to zero, these eigenvalues of Φt must be
0
separated suﬃciently far away from 1. Thus t0 must not be chosen too small, in order to
still satisfy this requirement.
While somewhat problem dependent, we found that time steps of size 10−1 or 10−2 in
our cases seem to strike a good balance between the two requirements, yielding very accurate
enclosures while maintaining reasonably large eigenvalues.
This construction yields a polynomial approximation γu : RNu → Rn of the unstable
manifold, and similarly γs : RNs → Rn of the stable manifold. Both by construction satisfy
the properties
γu (0) = γs (0) = 0
and









0 
du,1










...




0












 0


du,Nu 




Dγu (0) = 
 and Dγs (0) = 







ds,1
0 












...




0








0
ds,Ns
where ds,i , du,j > 0 for i = 1, . . . , Ns and j = 1, . . . , Nu .
191

5.3.2

Manifold Veriﬁcation

In this Section, we describe how the unstable manifold approximation generated by the
algorithm described in the previous section is outﬁtted with a small remainder bound and
veriﬁed to form a rigorous enclosure of the local unstable manifold. The veriﬁcation for the
stable manifold is exactly analogous, using the vector ﬁeld −F instead of F .
The veriﬁcation presented here will cover the two cases of a one dimensional and a two
dimensional manifold in three dimensions. The method described here does generalize to
arbitrary dimensions, however the notation becomes quite cumbersome.
Throughout the following, we denote by ex , ey , ez the unit vector along the corresponding
axis.

5.3.2.1

Outline of the proof

The general concept of constructing a manifold enclosure is to perform the following steps:
1. Each point on the polynomial manifold approximation, over some bounded domain, is
outﬁt with a uniform thickening in each stable direction to yield a candidate for the
manifold enclosure. We want to show that for each point on the polynomial approximation there is at least one point of the manifold within its thickening.
2. By examining the vector ﬁeld on the boundary of this thickening, it is shown that the
local manifold cannot leave the enclosure through the sides created by this thickening,
but only through the ends of the polynomial manifold approximation.
3. By bounding the radial component of the vector ﬁeld over a suitably chosen set, it is
shown that any point on the local unstable manifold away from the origin must leave
the manifold enclosure.
192

4. From this we will conclude that for any point on the initial polynomial approximation,
there is a point of the manifold that lies within the thickening around this point.

5.3.2.2

One Dimensional Manifold in Three Dimensions

Let γu (x) be the polynomial unstable manifold approximation as computed in the previous
section. For ease of notation, we will drop the index u and refer to γu simply as γ from now
on. Additionally, we will assume that the polynomial approximation is given as the graph
of a function over the x-axis, i.e.



R →











R3




 x 




γ:


γ(x) = 




y(x)















z(x)

(5.13)

where y, z : R → R are polynomials without constant or linear part.

Remark 5.16. This requirement can always be satisﬁed since the manifold is tangent to
the x-axis at the origin. Furthermore, such polynomials are easily obtained using the “Pconvention” in the construction as described in Section 5.2.3.2.
If the polynomial is constructed with some scaling factor η1 in the x-coordinate (see
section 5.2), without loss of generality one can consider the coordinate transform T (x, y, z) =
(x/η1 , y, z) and the new vector ﬁeld F (x, y, z) = T ◦ F ◦ T −1 as well as γ = T ◦ γ.
193

Deﬁnition 5.17. For some ε > 0 ﬁxed, let Γ : [−1, 1] × [−1, 1]2 → R3 be deﬁned as

Γ(s, t1 , t2 ) = γ(s) + ε · t1 · ey + t2 · ez


s






= y(s) + εt 
1



z(s) + εt2

(5.14)

(5.15)

That is, Γ is a uniform thickening of the unstable manifold approximation along the two
stable directions.

In the following, we will refer to Γ as both the function Γ(s, t1 , t2 ) as well as the set Γ
representing the range of Γ(s, t1 , t2 ) over its domain. The exact meaning will be made clear
by the context.
loc
Clearly Γ contains the local unstable manifold Wu as it contains a neighborhood of the
origin since




1 0 0 




DΓ(0, 0, 0) = 0 ε 0 and Γ(0, 0, 0) = 0.




0 0 ε
We call the following objects the sides of Γ:

±
Sy Γ(s, t) = Γ(s, ±1, t)
±
Sz Γ(s, t) = Γ(s, t, ±1)

where again s, t ∈ [−1, 1].
194


}

	




Figure 5.1: The enclosure Γ for a curve γ(x) in R3 with its sides (blue) and ends (red)
highlighted.

All sides of Γ are collectively referred to as SΓ deﬁned as

−
−
+
+
SΓ = Sy Γ ∪ Sy Γ ∪ Sz Γ ∪ Sz Γ.

Similarly, the ends of Γ are given by

±
Ex Γ(t1 , t2 ) = Γ(±1, t1 , t2 ),

where again t1 , t2 ∈ [−1, 1].
Both ends of Γ are collectively referred to as EΓ deﬁned as

−
+
EΓ = Ex Γ ∪ Ex Γ.

Figure 5.1 illustrates the concepts of Γ and its sides and ends.

loc
Lemma 5.18. If the negative vector ﬁeld F on SΓ always points outwards of Γ, Wu cannot
not intersect SΓ.
195

±
Furthermore, the negative vector ﬁeld F on Sy Γ points outwards Γ if for all s, t ∈ [−1, 1]

±
±
± −y(s) · Fx Sy Γ(s, t) + Fy Sy Γ(s, t)
˙

< 0

±
and similarly for Sz Γ

±
±
± −z(s) · Fx Sz Γ(s, t) + Fz Sz Γ(s, t)
˙

< 0.

Proof. If the negative vector ﬁeld F on SΓ always points outwards of Γ, the backward orbit
loc
x(t) of any point x ∈ SΓ leaves Γ. Hence Wu , for which the entire backward orbit is
contained in Γ, cannot intersect SΓ.
+
Consider now the case of Sy Γ(s, t). The tangent vectors to this surface are given by


  
 1  0

  

  
y(s) , 0 .
˙   


  
z(s)
˙
ε



˙
−y(s)




Thus the outward normal vector to the surface is given by N (s, t) =  1 . This vector




0
indeed points outward of Γ because it points in the positive y direction, and by deﬁnition of
+
Γ Sy Γ(s, t) is the boundary of Γ with the largest y value for ﬁxed s, t.
+
+
Thus the vector −F Sy Γ(s, t) points outward if N (s, t) · F Sy Γ(s, t) < 0 which is
equivalent to
+
+
−y(s) · Fx Sy Γ(s, t) + Fy Sy Γ(s, t) < 0.
˙
196

The same analysis performed for the other three sides yields the remaining conditions.

We now introduce another set C as follows:

Deﬁnition 5.19. For some ϑ > 0 ﬁxed, let C : [−1, 1] × [−1, 1]2 → R3 be deﬁned as




 1 
 
 
C(s, t1 , t2 ) = s · ϑt  .
 1
 
ϑt2
That is, C is a thin rectangular cone around the x-axis.

In the following, we will refer to C as both the function C(s, t1 , t2 ) as well as the set C
representing the range of C(s, t1 , t2 ) over its domain. The exact meaning will be made clear
by the context.
We call the following objects the sides of C:

±
Sy C(s, t) = C(s, ±1, t)
±
Sz C(s, t) = C(s, t, ±1)

where again s, t ∈ [−1, 1].
All sides of C are collectively referred to as SC deﬁned as

−
−
+
+
SC = Sy C ∪ Sy C ∪ Sz C ∪ Sz C.
197





Figure 5.2: The set C with its ends EC (red) and sides SC (blue) highlighted.
Similarly, the ends of C are given by

±
Ex C(t1 , t2 ) = C(±1, t1 , t2 ),

where again t1 , t2 ∈ [−1, 1].
Both ends of C are collectively referred to as EC deﬁned as

−
+
EC = Ex C ∪ Ex C.

To illustrate these concepts, ﬁgure 5.2 shows an example of such a set C.
Remark 5.20. Clearly C contains some open neighborhood of the local unstable manifold
around the origin because C contains the origin and the manifold at the origin is tangent to
the x-axis, while all sides of C have a non-zero slope in the x direction.
loc
Lemma 5.21. If the negative vector ﬁeld F on SC\{0} always points outwards of C, Wu
cannot not intersect SC\{0}.
±
Furthermore, the negative vector ﬁeld F on Sy C\{0} points outwards C if for all s ∈
198

[−1, 1]\{0} and t ∈ [−1, 1]

±

1
±
±
−ϑ · Fx Sy C(s, t) + Fy Sy C(s, t)
s

< 0

±
and similarly for Sz C\{0}

±

1
±
±
−ϑ · Fx Sz C(s, t) + Fy Sz C(s, t)
s

< 0.

Proof. If the negative vector ﬁeld F on SC\{0} always points outwards of C, the backward
loc
orbit x(t) of any point x ∈ SC\{0} leaves C. Hence Wu , for which the entire backward
orbit is contained in C, cannot intersect SC.

+
Consider now the case of Sy C(s, t). The tangent vectors to this surface for s ∈ [−1, 1]\{0}
and t ∈ [−1, 1] are given by
   
1 0
   
   
ϑ, 0 .
   
   
ϑt
ϑs



−ϑ
 
 
Thus the outward normal vector to the surface is given by N =  1  for s > 0. This
 
 
0
vector indeed points outward of C because it points in the positive y direction, and by
+
deﬁnition of C Sy C(s, t) is the boundary of C with the largest y value for ﬁxed s > 0, t.
+
Similarly, for s < 0 Sy C(s, t) switches sides, and the outward normal vector here is given
199





−ϑ
 
 
by N = −  1 .
 
 
0
+
+
Thus the vector −F Sy C(s, t) points outward if N (s, t) · F Sy C(s, t) < 0 which is
equivalent to



−ϑ · F S + C(s, t) + F S + C(s, t) < 0

x y
y y


−ϑ · F S + C(s, t) + F S + C(s, t) > 0

x y
y y

for s > 0,
for s < 0.

Since s = 0, we can divide by s and unify this condition to read

1
+
+
−ϑ · Fx Sy C(s, t) + Fy Sy C(s, t)
s

< 0.

This expression will later turn out to be easier to treat numerically, as well.
The same analysis performed for the other three sides yields the remaining conditions.

Lemma 5.22. If the following condition holds, the forward orbit of every point x ∈ C\{0}
must leave C within ﬁnite time.

Fx (x) · x > 0 ∀x = (xx , xy , xz ) ∈ C\{0}

(5.16)

Proof. Let x = (xx , xy , xz ) ∈ C\{0}. Denote by x(t) the forward orbit in the vector ﬁeld F
200

such that x(0) = x. Then consider the expression

d
d
|xx (t)|2 =
xx (t)2
dt
dt
= xx (t) · xx (t)
˙
= Fx (x(t)) · xx (t).

Thus while x(t) ∈ C, |xx (t)|2 is monotonously increasing. But since F is continuous and
by virtue of condition 5.16, the expression Fx (x) · xx assumes a strictly positive minimum
on every set C\Bδ (0) for any δ > 0. Since x = 0, we have that x ∈ C\Bδ (0) for some δ > 0.
Thus d |xx (t)|2 is bounded from below by some positive constant κ > 0. Hence
dt
|xx (t)|2 > κt for any t such that ∀τ ∈ [0, t] x(τ ) ∈ C. Thus the orbit must leave C at
some ﬁnite time, as in C by construction |xx |

1.

Remark 5.23. For this argument to work, it is necessary to consider some set like C which
includes the origin, but does not include a full neighborhood of it. This is because in any
neighborhood B of the origin, this condition can never be satisﬁed, as B would include the
local stable manifold, which by deﬁnition contains points that stay within B for all forward
time. Thus a set has to be chosen that does not include any points of the stable manifold.
This automatically precludes testing for these conditions in Γ and requires the introduction of C instead.
Theorem 5.24 (Manifold Enclosure Theorem (1D)). If there exist ε, ϑ > 0 such that the conditions of the above lemmata 5.18 through 5.22 are satisﬁed, then also ∀s ∈ [−1, 1] ∃t1 , t2 ∈
loc
[−1, 1] such that Γ(s, t1 , t2 ) ∈ Wu is satisﬁed. We call Γ a manifold enclosure.
Proof. Claim: There exists some δ > 0 such that
201

1. the local unstable manifold can be written as a graph (x, y (x), z (x)) for x ∈ [−δ, δ],
ˆ
ˆ
2. the entire graph is contained in C,
3. the entire graph is contained in Γ.
This is true because
1. of the Hadamard-Perron theorem,
2. the manifold is tangent to the x-axis at zero while the boundaries of C form a ﬁnite
angle with the x-axis,
3. Γ contains an open neighborhood of the origin.
Then consider the end point x = (δ, y (δ), z (δ)) of this graph. By lemma 5.22, the forward
ˆ
ˆ
orbit x(t) of this point must leave C at some time T0 . By lemma 5.21 it cannot leave C
through its sides, and thus must leave through the end where its x-coordinate is 1.
Since the x-coordinate of the orbit is continuous, by the intermediate value theorem for
every δ

s

1 there is a 0

contained in Γ for all times t

t∗

T0 such that xx (t∗ ) = s. By lemma 5.18 the orbit is

T0 , when it leaves through the end of Γ.

The same argument can be made for the point x = (−δ, y (−δ), z (−δ)) to complete the
ˆ
ˆ
proof.

5.3.2.3

Two Dimensional Manifold in Three Dimensions

Let γu (x, y) be the polynomial unstable manifold approximation as computed in the previous
section. For ease of notation, we will drop the index u and refer to γu simply as γ from now
on. Additionally, we will assume that the polynomial approximation is given as the graph
202

of a function over the x − y plane, i.e.

γ:


 2

R →











R3




 x 






γ(x, y) = 




y 
















z(x, y)

(5.17)

where z : R2 → R is a polynomial without constant or linear part.
Remark 5.25. This requirement can always be satisﬁed since the manifold is tangent to
the x-axis at the origin. Furthermore, such polynomials are easily obtained using the “Pconvention” in the construction as described in Section 5.2.3.2.
If the polynomial is constructed with some scaling factors η1 , η2 in the x and y coordinate (see section 5.2), without loss of generality one can consider the coordinate transform
T (x, y, z) = (x/η1 , y/η2 , z) and the new vector ﬁeld F (x, y, z) = T ◦ F ◦ T −1 as well as
γ = T ◦ γ.
Deﬁnition 5.26. For some ε > 0 ﬁxed, let Γ : [−1, 1]2 × [−1, 1] → R3 be deﬁned as

Γ(s1 , s2 , t) = γ(s1 , s2 ) + ε · t · ez


s1






= 

s2




z(s1 , s2 ) + εt

(5.18)

That is, Γ is a uniform thickening of the unstable manifold approximation along the stable
direction.
203

In the following, we will refer to Γ as both the function Γ(s1 , s2 , t) as well as the set Γ
representing the range of Γ(s1 , s2 , t) over its domain. The exact meaning will be made clear
by the context.
loc
Clearly Γ contains the local unstable manifold Wu as it contains a neighborhood of the
origin since




1 0 0 




DΓ(0, 0, 0) = 0 1 0 and Γ(0, 0, 0) = 0.




0 0 ε
We call the following objects the sides of Γ:

±
Sz Γ(s1 , s2 ) = Γ(s1 , s2 , ±1)

−
+
where again s1 , s2 ∈ [−1, 1]. The union of both sides Sz Γ and Sz Γ is referred to as SΓ.
Similarly, the ends of Γ are given by

±
Ex Γ(s, t) = Γ(±1, s, t),
±
Ey Γ(s, t) = Γ(s, ±1, t)

where again s, t ∈ [−1, 1].
All ends of Γ are collectively referred to as EΓ deﬁned as

−
−
+
+
EΓ = Ex Γ ∪ Ex Γ ∪ Ey Γ ∪ Ey Γ.

Figure 5.3 illustrates the concepts of Γ and its sides and ends.

204

	




Figure 5.3: The enclosure Γ for a curve γ(x, y) in R3 with its sides (blue) and ends (red)
highlighted.

Lemma 5.27. If the negative vector ﬁeld F on SΓ always points outwards of Γ, the forward
orbit of any point in Γ does not leave Γ through SΓ.
±
Furthermore, the negative vector ﬁeld F on Sz Γ points outwards Γ if for all s1 , s2 ∈
[−1, 1]

± −

∂z
∂z
±
±
±
z(s1 , s2 ) · Fx Sz Γ(s1 , s2 ) −
z(s1 , s2 ) · Fy Sz Γ(s1 , s2 ) + Fz Sz Γ(s1 , s2 )
∂s1
∂s2

Proof. If the negative vector ﬁeld F on SΓ always points outwards of Γ, the backward orbit
loc
x(t) of any point x ∈ SΓ leaves Γ. Hence Wu , for which the entire backward orbit is
contained in Γ, cannot intersect SΓ.
+
Consider now Sz Γ(s1 , s2 ). The tangent vectors to this surface are given by


 
1








0
∂z z(s , s )
∂s1 1 2


0

 
 
 
,
 
 

205

1
∂z z(s , s )
∂s2 1 2




.



< 0.




∂z z(s , s )
− ∂s
1 2 


1


− ∂z z(s , s ).
Thus the outward normal vector to the surface is given by N (s, t) = 

 ∂s2 1 2 


1
This vector indeed points outward of Γ because it points in the positive z direction, and by
+
deﬁnition of Γ Sz Γ(s1 , s2 ) is the top boundary of Γ with the largest z value for ﬁxed s1 , s2 .
+
Thus the vector −F Sz Γ(s1 , s2 )

+
points outward if N (s, t) · F Sz Γ(s1 , s2 )

< 0

which is equivalent to

−

∂z
∂z
+
+
+
z(s1 , s2 )·Fx Sz Γ(s1 , s2 ) −
z(s1 , s2 )·Fy Sz Γ(s1 , s2 ) +Fz Sz Γ(s1 , s2 ) < 0.
∂s1
∂s2
The same analysis performed for the other side yields the remaining condition with

the normal vector pointing in the opposite direction and thus the sign of the condition
ﬂipped.
We now introduce another set C = Cx ∪ Cy as follows:
Deﬁnition 5.28. For some ϑ > 0 ﬁxed, let Cx , Cy : [−1, 1] × [−1, 1]2 → R3 be deﬁned as




 1 
 
 
Cx (s, t1 , t2 ) = s ·  t  ,
 1
 
ϑt2
 
 t1 
 
 
Cy (s, t1 , t2 ) = s ·  1  .
 
 
ϑt2
That is, Cx , Cy are pyramids around the x and y axes, respectively, with a small opening
206

angle in the z direction, and a 90 degree opening angle in the x − y plane.

We call the following objects the sides of C:

±
Sx C(s, t) = Cx (s, t, ±1)
±
Sy C(s, t) = Cy (s, t, ±1)

where again s, t ∈ [−1, 1].
All sides of C are collectively referred to as SC deﬁned as

−
−
+
+
SC = Sy C ∪ Sy C ∪ Sz C ∪ Sz C.

Similarly, the ends of C are given by

±
Ex C(t1 , t2 ) = Cx (±1, t1 , t2 ),
±
Ey C(t1 , t2 ) = Cy (±1, t1 , t2 ),

where again t1 , t2 ∈ [−1, 1].
All ends of C are collectively referred to as EC deﬁned as

−
−
+
+
EC = Ex C ∪ Ex C ∪ Ey C ∪ Ey C.

To illustrate these concepts, ﬁgure 5.4 shows an example of such a set C.
Remark 5.29. The boundary ∂C of C consists of ∂C = EC ∪ SC. This is because the
remaining boundaries of Cx and Cy are identical and thus do not form boundaries of the
207











Figure 5.4: The set C with its ends EC (red) and sides SC (blue) highlighted. Each Cx
and Cy consist of two pyramids around the x and y axis respectively.
combined C.


Remark 5.30. Clearly C contains some open neighborhood of the local unstable manifold

around the origin because C contains the origin and the manifold at the origin is tangent to
the x − y plane, while all sides of C have a non-zero slope in the x and y 
directions.

Lemma 5.31. If the negative vector ﬁeld F on SC\{0} always points outwards of C, the
forward orbit of any point in C does not leave C through SC.
±
Furthermore, the negative vector ﬁeld F on Sx C\{0} points outwards C if for all s ∈
[−1, 1]\{0} and t ∈ [−1, 1]

±

1
±
±
−ϑ · Fx Sx C(s, t) + Fz Sx C(s, t)
s

< 0

±
and similarly for Sy C\{0}

±

1
±
±
−ϑ · Fy Sy C(s, t) + Fz Sy C(s, t)
s
208

< 0.

Proof. If the negative vector ﬁeld F on SC\{0} always points outwards of C, the backward
orbit x(t) of any point x ∈ SC\{0} leaves C. Hence no orbit from within C can intersect
SC. Furthermore, 0 is a ﬁxed point, and thus no orbit leaves C through 0 either.

+
Consider now the case of Sx C(s, t). The tangent vectors to this surface for s ∈ [−1, 1]\{0}
and t ∈ [−1, 1] are given by
   
 1  0
   
   
ϑt , s .
   
   
ϑ
0
 
−ϑ
 
 
Thus the outward normal vector to the surface is given by N =  0  for s > 0. This vector
 
 
1
indeed points outward of C because it points in the positive z direction, and by deﬁnition of
+
C Sx C(s, t) is the top boundary of C with the largest z value for ﬁxed s > 0, t. Similarly,
+
for s < 0 Sx C(s, t) switches from top to bottom boundary, and the outward normal vector
 
−ϑ
 
 
here is given by N = −  0 .
 
 
1

+
+
Thus the vector −F Sx C(s, t) points outward if N (s, t) · F Sx C(s, t) < 0 which is
equivalent to



−ϑ · F S + C(s, t) + F S + C(s, t) < 0

x x
z x


−ϑ · F S + C(s, t) + F S + C(s, t) > 0

x x
z x
209

for s > 0,
for s < 0.

Since s = 0, we can divide by s and unify this condition to read

1
+
+
−ϑ · Fx Sx C(s, t) + Fz Sx C(s, t)
s

< 0.

This expression will later turn out to be easier to treat numerically, as well.

Lemma 5.32. If the vector ﬁeld F on EC always points outwards of C, the backward orbit
of any point in C does not leave C through EC.
Furthermore, the vector ﬁeld F on EC points outwards of C if for all t1 , t2 ∈ [−1, 1]

±
±Fx Ex C(t1 , t2 )

> 0

±
±Fy Ey C(t1 , t2 )

> 0.

±
and similarly for Ey C

Proof. If the vector ﬁeld F on EC always points outwards of C, the forward orbit x(t) of
any point x ∈ EC leaves C. Hence no orbit from within C can intersect EC.
+
Consider now the case of Ex C(t1 , t2 ). The tangent vectors to this surface for t1 , t2 ∈
[−1, 1] are given by
   
0  0 
   
   
1 ,  0  .
   
   
0
ϑ
210

 
1
 
 
Thus the outward normal vector to the surface is given by N = 0. This vector indeed
 
 
0
points outward of C because it points in the positive x direction, and by deﬁnition of C
+
Ex C is the right boundary of C with the largest x value.
+
+
Thus the vector F Ex C(t1 , t2 ) points outward if N (s, t) · F Ex C(t1 , t2 ) > 0 which
is equivalent to

+
Fx Ex C(t1 , t2 )

> 0.

The same analysis performed for the other three sides yields the remaining conditions.

Lemma 5.33. If the following condition holds, then for every δ > 0 there is a T > 0 such
that for any orbit x(t) = (xx (t), xy (t), xz (t)) that is contained in C\{0} for all t ∈ [−T, 0]
the expression |xx (−T )|2 + xy (−T ) 2 < δ holds:

Fx (x) · xx + Fy (x) · xy > 0 ∀x = (xx , xy , xz ) ∈ C\{0}

(5.19)

Proof. Consider the expression

d
|xx (t)|2 + xy (t) 2
dt

=

d
xx (t)2 + xy (t)2
dt

= xx (t) · xx (t) + xy (t) · xy (t)
˙
˙
= Fx (x(t)) · xx (t) + Fy (x(t)) · xy (t).

Since F is continuous and by virtue of condition 5.19, this expression assumes a strictly
211

δ
positive minimum Mδ on every compact set C\Bδ (0). Thus for T = M and any orbit x(t)
δ
contained in C\{0} for all t ∈ [−T, 0] we have that

|xx (−T )|2 + xy (−T ) 2 < Mδ · T < δ.

Remark 5.34. As was the case with the one dimensional manifold, for this argument to work,
it is necessary to consider some set like C which includes the origin, but does not include a
full neighborhood of it. This is because in any neighborhood B of the origin, this condition
can never be satisﬁed, as B would include the local stable manifold. Thus a set has to be
chosen that does not include any points of the stable manifold.
Again this automatically precludes testing for these conditions in Γ and requires the
introduction of C instead.
Theorem 5.35 (Manifold Enclosure Theorem (2D)). If there exist ε, ϑ > 0 such that the conditions of the above lemmata 5.27 through 5.33 are satisﬁed, then also ∀s1 , s2 ∈ [−1, 1] ∃t1 ∈
loc
[−1, 1] such that Γ(s, t1 , t2 ) ∈ Wu is satisﬁed.
Proof. Let ζ(r), r ∈ [0, 1] be a continuous curve within C\{0} connecting some point ζ(0)
on the top side to some point ζ(1) on the bottom side of C. Let τ (r)

0 be the largest time

such that Φτ (r) (ζ(r)) ∈ SC, or let τ (r) = −∞ if no such time exists. By lemma 5.31 the
function τ (r) is continuous wherever it is ﬁnite, since the vector ﬁeld on the top boundary
is never tangent to the boundary and thus each point on SC leaves C together with a small
neighborhood around it. Furthermore, by lemma 5.32 for any time 0 > t > τ (r) the orbit
Φτ (ζ(r)) is within C.
212

Assume now every point on ζ(r) intersects SC at some time −∞ < T < τ (r). Then
consider the continuous function Φτ (r) (ζ(r)). We have that Φτ (r) (ζ(0)) is in the top side
of C, and Φτ (r) (ζ(1)) is in the bottom side of C. However, the only point connecting
the top and bottom sides of C is the origin, and thus for some s∗ ∈ [0, 1] we have that
Φτ (r) (ζ(s∗ )) = 0. That is impossible since no orbit starting away from the origin goes
through the origin in ﬁnite time.
Thus for some r∗ ∈ [0, 1] we must have that τ (r∗ ) = −∞, and then by lemma 5.33
loc
limt→−∞ Φt (ζ(r∗ )) = 0. Thus ζ(r∗ ) ∈ Wu .
Assume now that ∃ s, t1 ∈ [−1, 1], s = 0 such that ∀ t2 ∈ [−1, 1] we have Cx (s, t1 , t2 ) ∈
/
loc
Wu . Then let ζ be the curve connecting Cx (s, t1 , 0) and Cx (s, t1 , 1). Certainly ζ meets
loc
the above criteria, and thus must contain a point p ∈ ζ such that p ∈ Wu , contradicting
the assumption. The same holds for Cy .
Since Γ contains a neighborhood of the origin, ∃ T > 0 such that Φ−T (p) ∈ Γ. Then
also p ∈ Γ, as by lemma 5.27 the forward orbit of Φ−T (p) cannot leave Γ except through
loc
the ends. Thus ∀ s1 , s2 ∈ [−1, 1] ∃ t ∈ [−1, 1] such that Γ(s1 , s2 , t) ∈ Wu .
Remark 5.36. This proof is much more involved than its one dimensional counterpart in
theorem 5.24. This is mostly due to the topological aspects of this proof. We hope that
further research on the topology of this problem will allow a signiﬁcant simpliﬁcation of this
proof.

5.3.3

Numerical Veriﬁcation

In the following, we will describe the numerical veriﬁcation of the above conditions. Computations will be performed in Taylor Model arithmetic, and we ﬁrst recall a few facts about
213

Taylor Model arithmetic.

Taylor Model Derivative
Recall that in Taylor Model arithmetic in general it is not possible to obtain an enclosure of
the derivatives of a function from a Taylor Model enclosure of the function itself. While it
is possible to diﬀerentiate the polynomial part of the Taylor Model, we have no information
about the derivative of the remainder term. This is because it represents a purely C 0 bound
of the truncation and numerical round-oﬀ errors.
However, if the function itself is a polynomial of order less than the current computation
order to begin with, it is simple to compute its derivative. We will use this fact to compute
the derivatives of the manifold approximation γu (but not the manifold itself) rigorously.

Integral-Division Operator
In general, in Taylor Model arithmetic it is also not possible to divide by an independent
variable xi . This is because in the diﬀerential algebra xi does not have an inverse element.
Yet even if every coeﬃcient in the polynomial part P (x) of a Taylor Model depends on
xi , and thus Q(x) = P (x)/xi is a polynomial, usually this division cannot be performed
rigorously. This is because the error bound R of the Taylor Model in general does not scale
with xi , so that the Taylor Model P (x) ± R, when divided by xi would be of the form
Q(x) + B(R/xi ) where B is the bounding operation over the Taylor Model domain [−1, 1].
Obviously, the expression for the error bound does not have a bound on that domain unless
R = 0.
However, under certain very speciﬁc circumstances it is possible to rigorously divide a
Taylor Model by an independent variable xi . For this to be possible, the remainder error R
214

of the Taylor Model must a priori be known to scale at least linearly with xi and, of course,
every coeﬃcient of the polynomial part must depend on xi . Then such a division can be
performed by simply leaving the remainder term untouched and the resulting Taylor Model
is indeed a valid enclosure of the desired expression valid for all xi = 0 on the Taylor Model
domain.

1 x
In particular, consider the operator x 0 i · dxi . This operator should be read as one
i
single operator, not as an integration followed by a division. When applied to a Taylor Model
T = P (x) ± R the result is
1 xi
1 xi
1 xi
T dxi =
P (x) dxi ±
R dxi .
xi 0
xi 0
xi 0
The polynomial part P (x) can easily be integrated, as can be R, which is simply R · xi .
The following division by xi can thus be performed on both the integrated polynomial part
as well as the integrated error bound, which right after integration scales linearly with xi .
1 x
The resulting Taylor Model is a rigorous enclosure of the operator x 0 i · dxi , which is
i
well deﬁned for all xi = 0, applied to T which is valid over the entire Taylor Model domain
[−1, 1] except at xi = 0.

We will apply this operator in the following form. Consider the vector ﬁeld F evaluated
over a set of the form
K(s, t) = s · v(s, t)
where s ∈ [−1, 1]\{0} and v ∈ R3 . The vector v may additionally depend on any number of
other parameters t ∈ Rk .
215

Note that since F (0) = 0, we have that

F s · v(s, t)

s d
F s · v(s, t) ds + F (0 · v(0, t))
0 ds
=0
s
DF s · v(s, t) · v(s, t) ds.
0

=

=

If we are only interested in what happens for s = 0, we can divide by s and obtain

1
F s · v(s, t)
s

=

1 s
DF s · v(s, t) · v(s, t) ds
s 0

(5.20)

which is well deﬁned for all s ∈ [−1, 1]\{0}.
While the expression on the right looks complicated, it can be evaluated rather easily in
Taylor Model arithmetic using the combined integral-division operator introduced above, as
long as the derivative DF of F is known analytically.
The advantage of this form is that near zero, this expression is bounded away from zero.
Note that for ﬁxed 0 = x ∈ Rn and F diagonal




0
λ1




..
F (s · x) = 
 · s · x + O s2 .
.




0
λn
Thus the expression




0
λ1


1


...
F s · v(s, t) = 
 · v(s, t) + O (s)


s


0
λn
216

has a non-zero limit as s → 0 as long as lims→0 v(s, t) = 0.

5.3.3.1

Veriﬁcation using Taylor Models

The following steps are performed to verify the various conditions for the manifold enclosure.
For all of these computations, note that the enclosure candidate Γ can easily be written as a
a Taylor Model. The choice of ε can be automated by testing increasingly larger values until
all conditions are met, or a value can simply be chosen. For double precision computations,
good starting points are values on the order of 10−13 .
The same is true for the set C and the value ϑ used in its construction. Since here the
enclosure does not need to be very sharp, the value for ϑ can be chosen rather large. We
found that using ϑ = 0.1 works quite well.

Veriﬁcation of Lemmata 5.18, 5.27 and 5.32

Straight forward evaluation of each separate condition, with subsequent bounding of the
result allows the veriﬁcation of these conditions. Note that since γ is a polynomial, its
derivatives are readily computed and Taylor Model enclosures of the derivatives used in
these expressions can be obtained.
Moreover, since all of these conditions are evaluated over sets not containing zero, the
vector ﬁeld F does not vanish. Thus the direct evaluation and bounding of F does not pose
a problem if ε is chosen big enough.
217

Veriﬁcation of Lemmata 5.21, 5.22

Note that the expressions in lemmata 5.21 and 5.31 are of the form

1
−ϑ · Fi (s · v) + Fj (s · v)
s
1
1
= −ϑ · Fi (s · v) + Fj (s · v)
s
s
where i, j = x, y, z, s ∈ [−1, 1]\{0} and v ∈ R3 is some vector depending on additional
parameters t.
This is precisely the form of expression used in equation 5.20, and thus the 1 Fi , i = 1, 2, 3
s
used in this expression can be computed using Taylor Models. Bounding the entire expression
in Taylor Model arithmetic allows for easy veriﬁcation of this condition.

Veriﬁcation of Lemmata 5.31 and 5.33

Consider the expressions
Fx (x) · xx ,
Fx (x) · xx + Fy (x) · xy
in lemmata 5.22 and 5.33 respectively. Both of them are evaluated for x = (xx , xy , xz ) of
the form
x=s·v
where s ∈ [−1, 1]\{0} and v = (vx , vy , vz ) ∈ R3 is some vector depending on additional
parameters t.
218

Inserting this into the above expression yields

Fx (sv) · svx ,

Fx (sv) · svx + Fy (sv) · svy .
Since s = 0, this can be rewritten as

s2 ·

s2 ·

1
Fx (sv) · vx ,
s

1
1
Fx (sv) · vx + Fy (sv) · vy .
s
s

Now by again applying equation 5.20, the expressions 1 Fi (sv) can easily be computed
s
in Taylor Model arithmetic. Thus the entire expression in parenthesis can be computed as
a Taylor Model and subsequently bounded. Since s2 is always positive for s = 0, if this
expression is always positive, then

Fx (x) · xx > 0,

Fx (x) · xx + Fy (x) · xy > 0.

5.3.4

Manifold Propagation

After verifying an initial enclosure of the manifolds, these enclosures can be propagated to
a much larger size using veriﬁed integration. In particular, since the initial enclosure is
available directly as a Taylor Model, this step can be performed using the COSY VI veriﬁed
219

integrator, which is described in detail in [34, 10, 35]. Using this tool, it is possible to
rigorously transport an initial Taylor Model through a ﬂow resulting in a veriﬁed Taylor
Model enclosure of the ﬁnal Taylor Model.
At the core of this integrator lies the computation of a rigorous enclosure of the ﬂow of
the ODE for some suitably chosen time step for the given initial condition. That is, in each
step the veriﬁed integrator computes a rigorous enclosure of Φt (xi ) valid for some initial
condition xi , which itself is a Taylor Model, and any time t ∈ [0, ∆t] for some time step
∆t[35]. There are various additional operations performed during the integration in order to
keep the growth of the error bound of the Taylor Model under control. These are described
in detail in [10, 34] and we will not discuss those methods here. Furthermore, the veriﬁed
integrator automatically splits the initial condition into smaller pieces if it considers such
splits necessary to maintain optimal precision. The integration is then performed on each
piece separately, resulting in a list of ﬁnal Taylor Models covering the ﬁnal result.
Using such a veriﬁed integrator, it is possible to “grow” the invariant manifolds by integrating the unstable manifold forward and the stable manifold backward in time. Such an
integration would produce a new enclosure of part of the manifold, covered by a collection
of Taylor Models. While in principle this is a possible way to integrate the entire veriﬁed
manifold enclosure, it is not ideal to “grow” the manifold.

5.3.4.1

Initial Condition

Recall that with any point x on the manifold, also automatically its entire orbit x(t) is part
of the manifold. Thus in order to propagate the invariant manifold enclosure, it is suﬃcient
to integrate a small ball around the origin within the manifold. In fact, the ﬂow of any initial
condition I0 within the manifold that is transversal to the vector ﬁeld F will trace out part
220

of the manifold.
The advantage in choosing such an initial condition is that it has one less dimension
than the entire manifold, making the veriﬁed integration faster and introducing fewer errors
during the integration thus increasing the accuracy.
Given a veriﬁed Taylor Model enclosure Γ(x, t) of the manifold, where t are the error
variables representing the slight thickening, we typically simply chose the initial enclosure
to be the ball of radius 1 for x as the initial condition, i.e. I0 = Γ(x, t) for |x| = 1. In
practice, of course, this initial condition will not be represented by one single Taylor Model,
but instead be split up into several smaller Taylor Models which together cover the entire
ball |x| = 1. This also allows to only integrate “interesting” initial conditions, that lead to
manifold enclosures of the the relevant parts of the manifold.
Since the Taylor Model integrator COSY VI computes veriﬁed rigorous enclosures of the
ﬂow Φt (x) in each integration step, a Taylor Model representation of the ﬂow is readily
available with this method. By storing each of these ﬂow expansions, a covering of the
manifold by these ﬂows is obtained.

5.3.4.2

Manifold Growth

Either way, one problem with the propagation of the manifold in this way is the rapid growth
of the manifold. Locally, the growth of the manifold is given by the eigenvalues λi of the
Jacobian DF (0) at the origin. For the each manifold, the linear growth in the direction of
the eigenvectors is exponential in time, given by

xi (t) = xi (0) · exp( λi t)
221

where xi is the distance along the eigenvector. The larger the absolute value of the eigenvalues, the faster the manifold grows.
A problem appears if the diﬀerent eigenvalues of a higher dimensional invariant manifold
are signiﬁcantly diﬀerent. If that is the case, the manifold will grow very quickly along the
larger eigenvalue, while it will grow relatively slowly along the smaller eigenvalue. This poses
various problems for the veriﬁed integration.
First, the manifold in the direction of the major expanding eigenvalue may not be the
relevant part of the manifold for a given problem. Thus, integrating the entire manifold in
this direction is a waste of computation time.
There are several strategies to counter this undesirable growth of the manifold under integration. Two easily implemented techniques are described here, which one is more eﬃcient
to use depends on the problem at hand.
Since the COSY VI integrator supports the splitting of initial conditions, it is relatively
easy to limit the size of each initial condition in each integration step. If an initial condition
exceeds a certain predeﬁned size, it is automatically split along the longest side. Thus the
propagation moves a growing number of Taylor Models through the ODE. To restrict the
growth, one can automatically discard such pieces that have left a predeﬁned area of interest
completely. This eﬀectively takes care of pieces that quickly grow along the direction of
the largest eigenvalue. The enforced splitting above a certain size ensures that more pieces
actually leave the area of interest faster, instead of having pieces that are mostly outside
the area of interest but still have some small overlap. Clearly, the maximum allowed size for
each piece should be chosen smaller than the area of interest.
Another technique is not to wait until a given piece of the initial condition has left the
222

0,8

1,6

1,2

0,8

0,4

-0,4

0

0,4

0,8

1,2

1,6

2

2,4

Figure 5.5: Alignment along the strongest expanding eigenvector for a simple linear ODE.
-0,4
The initial condition is shown in red, and then integrated forward for times T = 0.1 (yellow),
T = 0.3 (green), T = 0.3 (light blue), T = 0.4 (dark blue).

-0,8
box of interest and then discard it, but discarding parts of the initial condition in every

integration step. This is particularly useful if the manifold is only needed in a neighborhood
of some known point x0 on the initial manifold enclosure.
Consider an initial condition I(s), s ∈ [−1, 1] such that I(0) = x0 . In each integration step a rigorous enclosure of Φt (I(s)) is computed for some t > 0. This ﬂow is
then only evaluated over some smaller domain such as s ∈ [−1, 1]/ exp(λt) and rescaled
to the domain [−1, 1]. The initial condition for the next integration step is thus given by
I1 = Φt (I(s/ exp(λt))), s ∈ [−1, 1]. Clearly in each step a neighborhood of Φt (x0 ) is retained, while the “ends” of the initial condition are discarded. If λ is chosen correctly in
each step, this approach allows the computation of a band of relatively constant size on the
manifold. Typically, the largest eigenvalue λmax is a good choice for λ.
223

Another more subtle problem arising from very diﬀerent eigenvalues is that the strong
expansion in one direction causes the initial condition to align with that expanding direction
after longer integration times. This alignment with the strongest expanding eigenvector
also means that the initial condition becomes more and more parallel to the vector ﬁeld,
thus causing the area each ﬂow expansion covers to become smaller and smaller. That is
particularly noticeable if parts of the initial condition leaving a given bounding box are
discarded.
Consider, for example, the two dimensional case of a linear ODE, with eigenvalue λ1 = 8
along the x-axis and λ2 = 0.5 along the y-axis. Then the initial condition I(s) = (1, 0) +
s · (−1, 1), s ∈ [0, 1] forming a line from (1, 0) to (0, 1) is stretched out signiﬁcantly more
along the x-axis than along the y-axis as it is integrated over time, resulting in an object
almost parallel to the x-axis after long time. This eﬀect is shown in ﬁgure 5.5, where the
initial condition I (red) is integrated for times T = 0.1 (yellow), T = 0.3 (green), T = 0.3
(light blue), and T = 0.4 (dark blue). Note how the area the rays sweep out between two
subsequent time steps becomes smaller and smaller in the shown bounding box.
This problem is, of course, intrinsic to the ODE, and thus cannot be circumvented.
However, there are possible techniques to prevent the stretching of most pieces of the initial
condition. In the above, integration was always performed using ﬁxed time steps of a given
size (e.g. 0.1 in the example) for every point in the initial condition. However, since the
Taylor Model integrator computes a validated ﬂow expansion before inserting a time, there is
nothing preventing us from inserting a polynomial expression for the time step depending on
the parametrization of the initial condition. Mathematically speaking, instead of computing
Φt (I(s)) for some ﬁxed t1 ∈ R and some initial condition I(s) parametrized by s ∈ D ⊂ Rn ,
1
224

the expression Φt(s) (I(s)) for some suitably chosen t(s) is thus computed.
In particular, one can attempt to chose t(x) in such a way, that points growing faster
are delayed by using smaller time steps, while points growing slower are sped up by using
larger time steps. Consider the above example, and some subset I1 (s), s ∈ [0, 1] of the initial
condition I not including the point (0, 1). Then every point in I1 moves to the right. If the
xf
time step is chosen as t(s) = ln
/λ1 , where x(s) indicates the x coordinate of a point
x(s)
in I1 and xf ∈ R+ , then the map Φt(s) (I1 (s)) takes I1 to a vertical line with x(s) = xf .
If this is performed in each integration step for some suitably chosen xf , the result of each
step is always almost perpendicular to the ﬂow, thus providing the best possible coverage
while controlling the expansion at the same time.
Lastly, there is an inherent numerical problem due to the stretching of the initial condition
happening during the integration. Not only the length of the initial condition, but also
the size of the unavoidable integration errors grows exponentially. While points on the
eigenvector of the smaller eigenvalue move relatively slowly, locally all points slightly oﬀ
are moving away exponentially with the stronger eigenvector. This leads to rapid growth
of the error bound, which presents the integrator with signiﬁcant problems and limits the
maximum possible integration time.
This problem can be addressed by either using High Precision Taylor Models to perform
the integration, or to some extend by again using position dependent, non-uniform time
steps for each integration, thus controlling the growth of the manifold and thus the errors.
Unfortunately, while the ﬁrst technique of discarding pieces of the initial condition, or
rescaling the initial condition in each step is relatively straight forward in their implementation, the latter two techniques are more complicated to implement. Both require, for
225

example, various non-trivial changes to the COSY VI integrator. It is beyond the scope
of this dissertation to implement and explore the possibilities of these approaches. Further
research on how to modify COSY VI to construct an optimal veriﬁed manifold propagator
is ongoing.

5.3.5

Application to the Lorenz System

To demonstrate the veriﬁed manifold enclosures we consider the well known Lorenz equations
in three dimensions. This system was introduced by Lorenz in 1963[27], and it has been the
focus of extended studies in the dynamical systems community ever since.
The standard Lorenz system of diﬀerential equations is the three dimensional system of
ordinary diﬀerential equations given by

x1 = σ(x2 − x1 )
˙
x2 = (ρ − x3 )x1 − x2
˙

(5.21)

x3 = x1 x2 − βx3
˙

where the real numbers σ, ρ, β are external parameters. The canonical values used are
σ = 10, ρ = 28, and β = 8/3. We denote the associated vector ﬁeld by F . In the following,
we refer to the vector ﬁeld with the canonical values for the parameters as the Lorenz system.
226

This system has a hyperbolic ﬁxed point at the origin, with the Jacobian given by




0
 −σ σ


JF |0 =  ρ −1 0


0
0 −β




.



Thus the eigenvalues of the Jacobian at the origin are given by the analytical expressions
−σ−1+ 4ρσ+(1−σ)2
≈ 11.8
2
−σ−1− 4ρσ+(1−σ)2
µ1 =
≈ −22.8
2
λ1 =

−β

µ2 =

≈ −2.7

and the corresponding eigenvectors read








 
 1 
 1 
0




 




 
e1 =  λ1 + 1 , e2 =  µ1 + 1 , e3 = 0 .
σ

σ

 




 
0
0
1

Forming the 3 × 3 transformation matrix M = (e1 , e2 , e3 ) to transform from cartesian
into eigen-coordinates, one readily computes the inverse matrix M −1 given by



µ1 +σ
− σ
0
 µ −λ
µ1 −λ1 
 1 1


−1 =  λ1 +σ
−
σ
M
0 .
 µ −λ

 1 1 µ1 −λ1



0
0
1

With these expressions the diagonalized Lorenz vector ﬁeld F = M −1 ◦ F ◦ M can be
227

written analytically. Similarly, the derivative of F is given by DF = M −1 ◦ DF ◦ M .
Note that all of the analytical expressions in M and M −1 are readily computed using
veriﬁed techniques, yielding a rigorous enclosure of the correctly diagonalized Lorenz map.

5.3.5.1

Initial Manifold Enclosure

We apply the construction algorithm described in the preceding section to this system. All
of the following computations are performed using double precision Taylor Models of order
21.
First, the matrices M and M −1 are computed using Taylor Models containing only a
constant part. This yields a rigorous representation of F and DF . Note that the eigenvectors
are ordered in such a way as to yield the same order of eigenvalues as described in Section
5.2. In the non-veriﬁed construction steps of the polynomial approximation, these rigorous
enclosures are simply replaced by double precision approximations by dropping the error
bound of the Taylor Models.
Next, the order by order construction of the polynomial approximation is started. Due to
the overall size of the dynamics in the Lorenz system, the free parameter for the derivatives
ηi of the polynomial manifold approximation at the origin can be chosen rather large without
major loss in the accuracy of the resulting approximation.
We choose the values η1 = 1 for the unstable manifold, and various values for η2 and η3
for the stable manifold. Since the veriﬁcation is always performed over the domain [−1, 1]n ,
the size of the ηi eﬀectively determines the size of the manifold enclosure. Computing the
enclosure for various values of η3 will demonstrate the eﬀect of changing this value on the
size and sharpness of the manifold enclosure. In this particular ODE, it is advantageous to
make the size of the initial manifold enclosure along the z-axis as large as possible. This is
228

η2 \η3

1

5

10

15

20

1
0.5

5 · 10−14
1 · 10−14

1 · 10−13
5 · 10−14

1 · 10−12
1 · 10−13

1 · 10−9
5 · 10−11

1 · 10−7
5 · 10−8

0.2

5 · 10−15

1 · 10−14

5 · 10−14

1 · 10−11

6 · 10−9

Table 5.1: Thickening of the manifold enclosure (in diagonalized coordinates) for which
veriﬁcation is successful for various values of η2 and η3 .
because in the later propagation of the manifold enclosure, the z-axis represents the slowest
growing direction with its eigenvalue of absolute value 8/3 ≈ 2.67 compared to 22.8 along
the y-axis. The eﬀect of this will become painfully obvious during the manifold propagation.
In order to be able to conveniently perform the veriﬁcation as described in Section 5.3.2,
the “P-convention” is chosen for the manifold construction, yielding the manifold approximation as the graph over the x-axis (unstable) or the y-z plane (stable) respectively.
To test the quality of the computed approximations, the construction condition (5.2) is
evaluated with the resulting polynomials. For all values of ηi , both sides of this condition
indeed agree up to the computation order 21 with individual diﬀerences of less than 10−17 in
the coeﬃcients. This small disagreement is caused by ﬂoating point round-oﬀ errors during
the computation and is to be expected in double precision ﬂoating point arithmetic.
For the veriﬁcation of the one dimensional unstable manifold, the polynomial approximation curve is outﬁt with a small thickening of size 2 · 10−14 in both y and z direction
in the diagonalized system, which after transforming back into standard Lorenz coordinates
translates into a thickness of less than 5.7 · 10−14 . For the two dimensional stable manifold,
the initial surface is thickened by some value depending on the choices of η2 and η3 . Table
5.1 shows the thickening chosen for various combinations of η2 and η3 for which veriﬁcation
of the enclosure was successful.
229

t

1

2

3

4

5

Approximate Length
Maximum Error

126
2 · 10−11

144
3 · 10−11

166
5 · 10−11

189
1 · 10−10

214
2 · 10−10

# of Taylor Models

44

72

106

140

175

t

10

15

20

25

28

Approximate Length
Maximum Error

390
2 · 10−9

710
7 · 10−8

1156
5 · 10−5

1592
7 · 10−4

1879
2 · 10−2

# of Taylor Models

353

537

718

1091

9196

Table 5.2: Approximate length, maximum error, and number of Taylor Models covering the
unstable manifold after propagating for the given time t.

After completing the computation of the veriﬁcation conditions given in Section 5.3.2, this
results in a rigorously veriﬁed enclosure of the one dimensional unstable and two dimensional
stable manifolds of the origin in the Lorenz system. The computation and veriﬁcation of all
these manifold enclosures of order 21 takes only a couple seconds on a 2GHz Apple MacBook
with 2GB RAM.

Note that the sizes of the manifold enclosures are roughly given by the size speciﬁed by
the ηi during the construction. For the unstable manifold, the length is about 2 · η1 , while
for the stable manifold the area is about 2 · η2 · 2 · η3 .
Figure 5.6 shows the initial enclosures of the stable and unstable manifolds in the standard
Lorenz coordinates for the case η1 = η2 = 1 and η3 = 10. The thickness of the stable
manifold enclosure shown is less than 2.3 · 10−12 , while the error in the unstable manifold
is less than 5.7 · 10−14 . The twisting of the stable manifold around the z-axis is becoming
quite visible, indicating that this enclosure is much better than just a linear approximation.
230

Initial Manifold Enclosures
Unstable
Stable
20
15
10
5
0
-5
-10
-15
-20

-1.5

-1

-0.5

0

0.5

1

1.5 -2.5

-1.5 -0.5

0.5 1.5

2.5

Figure 5.6: The initial manifold enclosures generated for the Lorenz system at the origin
shown in standard Lorenz coordinates.

231

5.3.5.2

Manifold Propagation

The initial manifold enclosure is then integrated forward. Due to the symmetry of the Lorenz
equations, for any orbit (x(t), y(t), z(t)) also (−x(t), −y(t), z(t)) is an orbit. Thus, if the point
(x, y, z) is on one of the manifolds, so is (−x, −y, z). The manifold are therefore symmetric,
and it is suﬃcient to integrate an initial condition on the manifold for which x > 0, and the
other half can be obtained by a simple reﬂection instead of a lengthy integration.
For the one dimensional unstable manifold this integration is relatively simple. The end
point of the initial manifold enclosure, Γu (1, t), is integrated forward in time using COSY
VI. Since this is a 0 dimensional initial condition (except for the tiny error variables), there
are no large problems with the propagation even for large times. COSY VI does not need to
split this initial condition, as there are no signiﬁcant directions to split along. This allows to
integrate the unstable manifold forward through the Lorenz equations even for large times.
The only limiting factor here is the growth of the remainder bound due to ﬂoating point
round-oﬀ errors.
Figures 5.7 through 5.12 show the growth of the unstable manifold for various times
between t = 1 and t = 28, while Table 5.2 shows the length of the unstable manifold, the
number of Taylor Models covering it, and the maximum error for various times up to t = 28.
Both in the pictures and the table, only one half of the manifold is considered, the other half
is symmetric due to the underlying symmetry of the Lorenz equations and is not included
in these ﬁgures.
To visualize the three dimensional structure of the manifolds, each polynomial ﬂow was
covered with a uniform polygon mesh which was then exported into the Meshlab[17] program
for display. In the plots of the manifold, it was necessary to thicken the manifold to make it
232

visible at all, the enclosures are much sharper than shown in the pictures.
As can be seen in Table 5.2, for times up to t = 25 the integration performs as it should.
Once t = 28 is reached, however, this is near the end of the possible integration range. This
can be seen by the huge number of resulting Taylor Models required to cover the manifold,
about 9 times more than at time t = 25. This indicates that the time step size COSY VI
can take after t = 25 drops signiﬁcantly.
With the two dimensional stable manifold the diﬀerence of the eigenvalues of the stable
manifold by almost a factor of 10, causes the problems described in Section 5.3.4.2 to appear
with full force. Since our goal is to compute as much of the local stable manifold of the origin
as possible, we use the cropping integration technique described earlier. Any manifold piece
that after an integration step ends up outside the box B = [−50, 50] × [−50, 50] × [−20, 90] is
automatically discarded from further integration. This allows the computation of the entire
local stable manifold up to some ﬁnal integration time.
The next question is which initial condition to use for the integration. Due to the extreme
diﬀerence of the growth of the stable manifold in backward time along the direction of the
stable eigenvector belonging to the eigenvalue µ1 compared to that along the z-axis, which
is the eigenvector associated with eigenvalue µ2 , it proves to be very beneﬁcial to select an
initial condition as large as possible along the z-axis.
In the following, we use the boundary of the initial stable manifold enclosure generated
with η2 = 0.2 and η3 = 20. The small size of this enclosure in the x-y plane is no problem,
as this is aligned in the direction of the strongly expanding eigenvector. Even though this
larger initial condition is less accurate than one of the smaller ones, by the time the smaller
initial conditions reach its size, their error would be at least as large.
233

Z

Y

Figure 5.7: The local unstable manifold of the origin in the Lorenz system propagated for
time t = 1. The error of the manifold shown is less than 2 · 10−11 .

234

Z

Y

Figure 5.8: The local unstable manifold of the origin in the Lorenz system propagated for
time t = 5. The error of the manifold shown is less than 2 · 10−10 .

235

Z

Y

Figure 5.9: The local unstable manifold of the origin in the Lorenz system propagated for
time t = 10. The error of the manifold shown is less than 2 · 10−9 .

236

Z

Y

Figure 5.10: The local unstable manifold of the origin in the Lorenz system propagated for
time t = 15. The error of the manifold shown is less than 7 · 10−8 .

237

Z

Y

Figure 5.11: The local unstable manifold of the origin in the Lorenz system propagated for
time t = 20. The error of the manifold shown is less than 5 · 10−5 .

238

Z

Y

Figure 5.12: The local unstable manifold of the origin in the Lorenz system propagated for
time t = 28. The error of the manifold shown is less than 2 · 10−2 .

239

t

0.2

0.3

0.4

0.5

0.6

Approximate Area A
Maximum Error emax

2360
3.7 · 10−7

6307
3.7 · 10−7

8112
3.7 · 10−7

10192
1.5 · 10−6

14862
7.9 · 10−6

Volume V

8.7 · 10−4

2.3 · 10−3

3.0 · 10−3

1.5 · 10−2

1.2 · 10−1

# of Taylor Models

70

145

233

473

2469

# of Intervals

1.7 · 1016

4.6 · 1016

5.9 · 1016

4.5 · 1015

2.4 · 1014

Table 5.3: Approximate area, maximum error, volume, and number of Taylor Models covering the integrated half of the stable manifold within the box B = [−50, 50] × [−50, 50] ×
[−20, 90] after propagating for the given time t backwards through the Lorenz.
Z

Y

X

Figure 5.13: The local stable manifold of the origin in the Lorenz system within the box
[−50, 50]×[−50, 50]×[−20, 90] propagated for time t = 0.2. The error of the manifold shown
is less than 4 · 10−7 .
Furthermore, due to the symmetry of the Lorenz equations, it is again suﬃcient to
compute the manifold for only one half of the initial condition in the x

0 quadrant, the

other half is obtained by the simple reﬂection R(x, y, z) = (−x − y, z).
Figures 5.13 through 5.17 show the entire manifold, including the mirror image of the
computed pieces for the same time steps. The rugged, sawtooth like edges in these pictures
are due to the way manifold pieces are discarded. Since the box B is not aligned with the
direction of the eigenvectors, the direction of motion of each ﬂow expansion happens at an
240

Z

Y

X

Figure 5.14: The local stable manifold of the origin in the Lorenz system within the box
[−50, 50]×[−50, 50]×[−20, 90] propagated for time t = 0.3. The error of the manifold shown
is less than 4 · 10−7 .

Z

Y

X

Figure 5.15: The local stable manifold of the origin in the Lorenz system within the box
[−50, 50]×[−50, 50]×[−20, 90] propagated for time t = 0.4. The error of the manifold shown
is less than 4 · 10−7 .

241

Z

Y

X

Figure 5.16: The local stable manifold of the origin in the Lorenz system within the box
[−50, 50]×[−50, 50]×[−20, 90] propagated for time t = 0.5. The error of the manifold shown
is less than 2 · 10−6 .

Z

Y

X

Figure 5.17: The local stable manifold of the origin in the Lorenz system within the box
[−50, 50]×[−50, 50]×[−20, 90] propagated for time t = 0.6. The error of the manifold shown
is less than 8 · 10−6 .
242

angle. Only ﬂow expansions completely outside the box B are discarded, hence some pieces
still have a corner inside B, while others are already completely outside.
Table 5.3 shows the numerical results of the integration of the boundary of the initial
stable manifold enclosure for η2 = 0.2 and η3 = 20 backward in time for various times t. All
statistics given in this table refer to the half of the manifold actually integrated, and do not
include the mirror image making up the other half.
The maximum error emax comprises both the error due to the integration, as well as
the error due to the thickening of the initial manifold. The volume V given in this table
is computed as V = emax · A. The number of Taylor Models indicates how many ﬂow
expansions were needed to cover the entire surface. Finally, the number of intervals is an
estimate of how many interval boxes would be required to cover the same manifold with the
accuracy of the Taylor Models. This number is estimated as the number of interval boxes of
2
size emax required to cover the volume V (i.e. V /e3
max = A/emax ).
The reason why the maximum error does not change for times 0.2 through 0.4 is because
the maximum error comprises two components. One is the error accumulating during the
numerical integration, the other is the error due to the thickening of the initial manifold. To
the veriﬁed integrator, however, the thickening is not an error but just an extra dimension
of the initial condition.
In the beginning, the error due to the thickening of the initial condition dominates the
maximum error. This thickening grows diﬀerently in diﬀerent places along the manifold,
the fastest growth occurring where the manifold itself grows the fastest. Thus right in the
beginning, this error grows quickly, but then stops as the manifold pieces with the largest
error are discarded as they leave the box B. With those “oﬀenders” discarded, the remaining
243

pieces have much smaller thickness and thus do not contribute to the maximum error for
quite a while. This shows that the maximum error given in this table is actually a rather
pessimistic number, for the majority of the manifold the error is signiﬁcantly less than this
number. Eventually, the integration errors grow as the integration time increases and they
become the dominating eﬀect at time 0.5 and beyond, causing the maximum overall error
to grow.
As can be seen, integration can be performed up to time t = 0.5 easily. The veriﬁed integrator is able to reach time t = 0.6. At this time, however, the current implementation of the
automated domain decomposition has split the manifold into so many pieces that the number
of Taylor Models covering the manifold has grown too large and further integration becomes
impractical. Most likely this eﬀect can be greatly reduced by an improved implementation
of the automated domain decomposition, including the use of High Precision Taylor Models
to make splitting decisions, allowing integration of the manifold beyond t = 0.6.
The comparison between intervals and Taylor Models is striking. As described in [26],
one of the techniques to compute invariant manifolds that can be done in a veriﬁed setting
is an interval based space pruning technique. However, in order to attain the same accuracy
of representation as our Taylor Model covering, enormous numbers of interval boxes would
be required just to store the representation. Considering that each interval box in three
dimensions is made up of six double precision numbers of eight bytes each, just the total
memory required to keep all these interval boxes is on the order of 1018 bytes or about 1
exabyte. Clearly this is practically impossible at this point in time, and this does not even
consider yet the process to compute such an interval enclosure in the ﬁrst place.
The Taylor Model representation, on the other hand, requires less than 2500 Taylor Mod244

els at most. Each Taylor Model as used by COSY VI has a maximum of 15714 coeﬃcients,
resulting in a maximum memory requirement of 15714 · 2500 · 8 ≈ 314 · 106 bytes which is
about 314 megabytes for the manifold representation. In practice, this number is even much
lower, since the Taylor Models used are only sparsely populated.
To conclude our examination of the invariant manifolds of the Lorenz at the origin, ﬁgure
5.18 shows both the stable and unstable manifolds of the Lorenz system in the same picture.
Unfortunately, we were unable to draw the stable and unstable manifold in diﬀerent colors
in this picture due to limitations in the Meshlab program used to display these pictures.

245

Figure 5.18: Both the local stable and unstable manifold of the origin in the Lorenz system
within the box [−50, 50] × [−50, 50] × [−20, 90] in the same picture. The error on both
manifolds shown is less than 5 · 10−5 .

246

APPENDIX

247

APPENDIX

Compiler Options
The algorithms described in this dissertation, especially those in Chapter 3, rely heavily on
correct implementation of the IEEE 754 ﬂoating point standard[2]. While modern computer
hardware in general implements double precision operations according to this standard, care
must be taken that the code written is compiled into the correct sequence of processor
instructions.
Unfortunately, many compilers perform optimization on the code written and sometimes
reorder operations. While often this results in faster execution time, reordering ﬂoating
point operations in the Dekker operations breaks these operations, as they rely on subtle
round-oﬀ eﬀects of ﬂoating point numbers that are not taken into consideration by optimizing
compilers.
Our implementation is written in FORTRAN, and we use the Intel Fortran Compiler
version 12. While this compiler performs aggressive optimization, it provides a command
line switch to turn oﬀ any ﬂoating point optimizations. The command line we use to compile
the COSY INFINITY environment is “-fp-model strict -m32”.
This command line appears to be safe. A detailed analysis and comparison of various
compilers performed by us can be found in [22].
248

BIBLIOGRAPHY

249

BIBLIOGRAPHY

[1] DoD supplement to ANSI X3.9-1978, Tech. report, US Department of Defense, 1978.
[2] IEEE standard for binary ﬂoating-point arithmetic., Tech. Report IEEE Std 754-1985,
The Institute of Electrical and Electronics Engineers, 1985.
[3] S. Banach, Sur les op´rations dans les ensembles abstraits et leurs application aux
e
´quations int´grales, Fund. Math. 3 (1922), 7–33.
e
e
[4] M. Berz, Forward algorithms for high orders and many variables, Automatic Diﬀerentiation of Algorithms: Theory, Implementation and Application SIAM (1991).
[5]

, High-order computation and normal form analysis of repetitive systems, in:
M. month (ed), physics of particle accelerators, vol. 249, p. 456, American Institute of
Physics, New York, 1991.

[6]

, Diﬀerential algebraic formulation of normal form theory, M. Berz, S. Martin
and K. Ziegler (Eds.), Proc. Nonlinear Eﬀects in Accelerators (London), IOP Publishing,
1992, p. 77.

[7]
[8]

, From Taylor series to Taylor models, AIP CP 405 (1997), 1–20.
, Modern map methods in particle beam physics, Academic Press, San Diego,
1999, Also available at http://bt.pa.msu.edu/pub.

[9] M. Berz and G. Hoﬀst¨tter, Computation and application of Taylor polynomials with
a
interval remainder bounds, Reliable Computing 4 (1998), no. 1, 83–97.
250

[10] M. Berz and K. Makino, Suppression of the wrapping eﬀect by Taylor model- based
veriﬁed integrators: Long-term stabilization by shrink wrapping, International Journal
of Diﬀerential Equations and Applications 10,4 (2005), 385–403.
[11] K. Braune and W. Kramer, High-accuracy standard functions for intervals, Computer
Systems: Performance and Simulation (Manfred Ruschitzka, ed.), vol. 2, Elsevier Science Publishers B.V., 1986, pp. 341–347.
[12] I. N. Bronstein, K. A. Semendjajew, G. Musiol, and H. M¨hlig, Taschenbuch der Mathu
ematik, 5th edition ed., Verlag Harri Deutsch, 2001.
¨
[13] L.E.J. Brouwer, Uber eineindeutige, stetige Transformationen von Fl¨chen in sich,
a
Math. Ann. 69 (1910), 176–180.
[14] X. Cabr´, E. Fontich, and R. de la Llave, The parameterization method for invarie
ant manifolds I: Manifolds associated to non-resonant subspaces, Software for Railways
Control and Protection Systems. EN 50128, 1995.
[15] J. B. Conway, Functional analysis, Springer, 1990.
[16] T.J. Dekker, A ﬂoating-point technique for extending the available precision, Numerische
Mathematik 18 (1971/72), 224–242.
[17] Paolo Cignoni et al., Meshlab, Tech. report, Visual Computing Lab - ISTI - CNR, 2011.
[18] David Goldberg, What every computer scientist should know about ﬂoating-point arithmetic, ACM Comput. Surv. 23 (1991), no. 1, 5–48.
[19] Johannes Grote, High-order computer-assisted estimates of topological entropy, Ph.D.
thesis, Michigan State University, East Lansing, Michigan, USA, 2009.
[20] M. Hazewinkel (ed.), Encyclopaedia of mathematics, Springer Verlag, 2002.
[21] M. H´non, A two-dimensional mapping with strange attractor, Communications in
e
Mathematical Physics 50 (1976), no. 1, 69–77.
[22] R. Jagasia and A. Wittig, Survey of fortran compiler options and their impact on cosy
inﬁnity, Tech. Report MSUHEP-090422, Michigan State University, 2009.
251

[23] W. Kahan, Pracniques: further remarks on reducing truncation errors, Commun. ACM
8 (1965), no. 1, 40.
[24] A. Katok and B. Hasselblatt, Introduction to the modern theory of dynamical systems,
Cambridge University Press, Cambridge, 1996.
[25] D. E. Knuth, The art of computer programming, vol. I-III, Addison Wesley, Reading,
MA, 1973.
[26] B Krauskopf, H M Osinga, E J Doedel, M E Henderson, J Guckenheimer, A Vladimirsky,
M Dellnitz, and O Junge, A survey of methods for computing (un)stable manifolds of
vector ﬁelds, International Journal of Bifurcation and Chaos 15 (2005), no. 3, 763–791.
[27] E. N. Lorenz, Deterministic nonperiodic ﬂow, J. Atmos. Sci. 20 (1963), 130–141.
[28] K. Makino, Rigorous analysis of nonlinear motion in particle accelerators, Ph.D. thesis,
Michigan State University, East Lansing, Michigan, USA, 1998, Also MSUCL-1093.
[29] K. Makino and M. Berz, Remainder diﬀerential algebras and their applications, Computational Diﬀerentiation: Techniques, Applications, and Tools (Philadelphia) (M. Berz,
C. Bischof, G. Corliss, and A. Griewank, eds.), SIAM, 1996, pp. 63–74.
[30]

, Eﬃcient control of the dependency problem based on Taylor model methods,
Reliable Computing 5 (1999), no. 1, 3–12.

[31]

, Taylor models and other validated functional inclusion methods, International
Journal of Pure and Applied Mathematics 6,3 (2003), 239–316.

[32]

, COSY INFINITY version 9, Nuclear Instruments and Methods 558 (2005),
346–350.

[33]

, Range bounding for global optimization with Taylor models, Transactions on
Computers 4,11 (2005), 1611–1618.

[34]

, Suppression of the wrapping eﬀect by Taylor model- based veriﬁed integrators:
Long-term stabilization by preconditioning, International Journal of Diﬀerential Equations and Applications 10,4 (2005), 353–384.

[35]

, Suppression of the wrapping eﬀect by Taylor model- based veriﬁed integrators:
The single step, International Journal of Pure and Applied Mathematics 36,2 (2006),
175–197.
252

[36] Kyoko Makino, Private communication.
[37] R. E. Moore, Automatic local coordinate transformation to reduce the growth of error
bounds in interval computation of solutions of ordinary diﬀerentical equations, Error in
Digital Computation, Vol II (L. B. Rall, ed.), 1965.
[38]

, Interval analysis, Prentice-Hall, Englewood Cliﬀs, NJ, 1966.

[39] S. Newhouse, Private communication.
[40] N. Revol, Multiple precision ﬂoating-point interval library, Tech. report, SPACES, INRIA Lorraine and Arenaire, INRIA Rhone-Alpes, 2002.
[41] J. R. Shewchuk, Adaptive precision ﬂoating-point arithmetic and fast robust geometric
predicates, Discrete & Computational Geometry 18 (1997), 305–363.
[42] P. Snopok, Optimization of accelerator parameters using normal form methods on highorder transfer maps, Ph.D. thesis, Michigan State University, East Lansing, Michigan,
USA, 2007.
[43] N. Wiener, Cybernetics: or control and communication in the animal and the machine,
MIT Press, 1948.
[44] A. Wittig and M. Berz, Computation of high-order maps to multiple machine precision,
International Journal Modern Physics A 24 (2008), no. 5, 1019–1039.
[45]

, Rigorous high precision interval arithmetic in COSY INFINITY, Communications of the Fields Institute in print (2008).

[46]

, High period ﬁxed points, stable and unstable manifolds, and chaos in accelerator
transfer maps, Vestinik Mathematica in preparation (2010).

[47] A. Wittig, M. Berz, J. Grote, K. Makino, and S. Newhouse, Rigorous and accurate
enclosure of invariant manifolds on surfaces, Regular and Chaotic Dynamics 15 (2010),
no. 2-3, 107–126.
[48] A. Wittig, M. Berz, and S. Newhouse, Computer assisted proof of the existence of high
period periodic points, Communications of the Fields Institute in print (2008).

253