F3?"

‘ x’Fi'

lCHlGAN STATE

IIIIIIIIIIIIILII I

 

 

 

 

 

 

 

 

 

IIIIIIIIIIIIIIIIIIIIIIIIIIIIII

24108

This is to certify that the

dissertation entitled

DESIGNING MAXIMALLY ADAPTIVE
ALGORITHMS FOR WORMHOLE ROUTING:
THE TURN MODEL

presented by

CHRI STOPHER JAMES GLASS

has been accepted towards fulﬁllment
of the requirements for

PHD degree in COMPUTER SCIENCE

 

gm

Major professor

Date M

MS U i: an Affirmative Action/Equal Opportunity Institution 0-12771

 

 

 

LJE’SR‘ARY
M'Chlgan State
ntverslty

*

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or betoro due due.

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU to An Afﬁrmative Action/Equal Opportunity Institution
oMMmS-ot

 

DESIGNING MAXIMALLY ADAPTIVE
ALGORITHMS FOR WORMHOLE ROUTING:
THE TURN MODEL

By

Christopher James Glass

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

1992

{7% a 5/59

ABSTRACT

DESIGNING MAXIMALLY ADAPTIVE
ALGORITHMS FOR WORMHOLE ROUTING:
THE TURN MODEL

By

Christopher James Glass

One obstacle to using wormhole routing as the switching technique in a direct net-
work is designing a good algorithm for routing message packets through the network.
A good routing algorithm should provide low communication latency, high network
throughput, and ease of implementation in VLSI. Contributing to these objectives
are such factors as deadlock freedom, adaptive routing, nonminimal routing, live-
lock freedom, and fault tolerance. Previous models for designing routing algorithms
have been either ad hoc or based on adding virtual channels to the networks. The
algorithms produced by these models prevent deadlock, but the cost is either non-
adaptiveness in routing or extra hardware to support virtual channels. To solve the
design problem more completely, this dissertation presents a model for systematically
designing routing algorithms that are deadlock free and maximally adaptive for a net-
work, as well as minimal or nonminimal, livelock free, and fault tolerant. The model

is not based on adding virtual channels to a network, but can be applied to networks

that have virtual channels. Instead, the model is based on analyzing the directions
in which packets can turn in a network and the cycles that the turns can form. This
turn model is applied to n-dimensional meshes and k-ary n-cubes in the dissertation,
but it can be applied to any direct network that has directions associated with its
channels. Simulations of the routing algorithms produced for meshes and hypercubes

show that they can perform better than previous routing algorithms.

To my Wife, Fran

iv

ACKNOWLED GEMENTS

Many persons have made this degree and dissertation possible. Among them, I
especially wish to thank Dr. Carl V. Page for recruiting me and for lengthy discussions
of artiﬁcial intelligence; Dr. Abdol Esfahanian, Dr. Dorian Feldman, and Dr. Wen-ling
Hsu for making my ﬁrst quarter back in school challenging and exciting; Dr. Lionel M.
Ni for the stimulating class “Multiprocessors and Parallel Processing,” which inspired
this dissertation, and for being an understanding advisor and shrewd researcher;
Dr. Richard Enbody for pointing out practical considerations in the research; and
Dr. Philip McKinley for helping develop the simulator. Finally, I wish to thank
my wife, Fran, whose encouragement, sacrifices, and added. support have made my
education possible. '

TABLE OF CONTENTS

LIST OF TABLES
LIST OF FIGURES

1 Introduction

1.1 Direct Networks and Multicomputers ..................
1.2 Network Topologies ............................
1.3 Switching Techniques ...........................
1.4 Routing Algorithms ............................
1.5 Problem Statement and Objective ....................
1.6 Dissertation Organization ........................

2 Previous Models

2.1 Dimension Ordering ...........................
2.2 The Virtual Channel Model .......................
2.3 The Virtual Network Model .......................
2.4 The Dimension Reversal Model .....................
2.5 The Wraparound Model .........................
3 The Turn Model
3.1 Turns in my Networks ...........................
3.2 Turns in Double-my Networks ......................
3.3 Turns in Double-y Networks .......................
3.4 Systematic Design of Routing Algorithms ................

4 The Simulator

4.1 Direct Networks ..............................
4.2 Channel Selection Policies . . . . . . . . . ..............
4.2.1 Input Selection Policies ......................
4.2.2 Output Selection Policies .....................
4.3 Trafﬁc Patterns ..............................

vi

ix

mqmtnwwo-t H

G

9
10
12
13
15

18
19
21
22
22

26
27
27
27
29
32

4.4 Simulator Software ............................

4.4.1 Structure .............................
4.4.2 Inputs ...............................
4.4.3 Outputs ..............................
4.4.4 Running the Simulator ......................

Partially Adaptive Routing

5.1 - Two-Dimensional Meshes .........................
5.1.1 The West-First Routing Algorithm ...............
5.1.2 The North-Last Routing Algorithm ...............
5.1.3 The Negative-First Routing Algorithm .............
5.1.4 Degrees of Adaptiveness .....................

5.2 Three-Dimensional Meshes ........................
5.2.1 Degrees of Adaptiveness .....................

5.3 n-Dimensional Meshes ..........................
5.3.1 Degrees of Adaptiveness .....................

5.4 Ic-ary n-cubes ...............................

5.5 Hypercubes ................................
5.5.1 Degrees of Adaptiveness .....................

5.6 Simulation Experiments .........................
5.6.1 Input Selection Policies ......................
5.6.2 Output Selection Policies .....................
5.6.3 Trafﬁc Patterns ..........................
5.6.4 Spreading Traffic More Evenly ..................

Fully Adaptive Routing

6.1 Two-Dimensional Meshes .........................

6.2 Simulation Experiments .........................

6.3 n-Dimensional Meshes ..........................

6.4 Three-Dimensional Meshes ........................

Fault-Tolerant Routing

7.1 A Model of Fault-Tolerant Routing ...................

7.2 Two-Dimensional Meshes .........................

7.3 Simulation Experiments .........................

Summary and Conclusions

8.1 Implementation Considerations .....................

8.2 Future Work ................................

vii

33
34
34
35
36

37
37
39
41
43
43
47
48
51
52
53
55
56
57
58
61
65
72

78
78
87
93
98

100
100
101
104

109

111

1 l4
BIBLIOGRAPHY

viii

3.1

5.1

6.1

8.1

LIST OF TABLES

Comparison of models for designing routing algorithms ......... 25
Example of a path allowed by the p—cube routing algorithm. ..... 58
The degree of adaptiveness for maximally fully adaptive routing algo-

rithms .................................... 98
The directional phases of some routing algorithms ............ 113

ix

1.1

3.1

3.2

3.3

3.4

3.5

4.1

4.2

5.1
5.2
5.3
5.4
5.5

5.6
5.7

5.8
5.9
5.10
5.11

LIST OF FIGURES

A4x4mesh ................................

A wormhole deadlock involving four routers and four packets in a two-
dimensional mesh ..............................
A router in an 33/ network and the turns allowed by some routing
algorithms. ................................
The two simple cycles and one of the complex cycles formed by the
turns in an :ry network ........ ' ...................
A router in a double-2y network and the turns allowed by the double-
zy routing algorithm ...... - ......................
A router in a double-y network and the turns allowed by the double-y
routing algorithm ..............................

Channel utilization in a 10 x 10 mesh for uniform trafﬁc and the my
routing algorithm ..............................
Turns can increase contention .......................

The possible turns and simple cycles in a two-dimensional mesh. . . .
The four turns allowed by the $3; routing algorithm ...........
Six turns that complete the cycles and allow deadlock. ........
The west-ﬁrst routing algorithm for two-dimensional meshes ......
Numbering of the channels leaving each node (3,31) for the west-ﬁrst
routing algorithm ..............................
Numbering of a 4 x 4 mesh for the west-ﬁrst routing algorithm.

For each input channel, the output channels allowed by the west-ﬁrst
routing algorithm have lower numbers. .................
The north-last routing algorithm for two-dimensional meshes.

The negative-ﬁrst routing algorithm for two-dimensional meshes. . . .
The possible turns and simple cycles in a three-dimensional mesh. . .
The turns allowed by three routing algorithms for three-dimensional
meshes. ................ i ..................

19

20

20

21

22

31
32

38
38
39
40

41
42

42

45
47

5.12 The channels used during the two phases of the negative-ﬁrst routing

algorithm for 4-ary 2-cubes. ....................... 54
5.13 The minimal p-cube routing algorithm for hypercubes. ........ 56
5.14 The nonminimal p-cube routing algorithm for hypercubes. ...... 57

5.15 Comparison of input selection policies for the xy routing algorithm. . 59
5.16 Comparison of input selection policies for the west-ﬁrst routing algorithm. 59
5.17 Comparison of input selection policies for the north-last routing algo-

rithm. ................................... 60
5.18 Comparison of input selection policies for the negative-ﬁrst routing
algorithm. ................................. 60

5.19 Comparison of input selection policies for the west-ﬁrst routing algorithm. 62
5.20 Comparison of input selection policies for the north-last routing algo-

rithm. ................................... 62
5.21 Comparison of input selection policies for the negative-ﬁrst routing
algorithm. ................................. 63
5.22 Comparison of output selection policies for the west-first routing algo-
rithm. ................................... 64
5.23 Comparison of output selection policies for the north-last routing al-
gorithm. .................................. 64

5.24 Comparison of output selection policies for the negative-first routing

algorithm. ................................. 65
5.25 Comparison of routing algorithms for uniform trafﬁc in a 10 x 10 mesh. 67
5.26 Comparison of routing algorithms for uniform trafﬁc in a 16 x 16 mesh. 67
5.27 Comparison of routing algorithms for uniform trafﬁc in a 10 X 10 x 10

mesh ....................... . .............. 68
5.28 Comparison of routing algorithms for matrix-transpose trafﬁc in a 10 x

10 mesh. . , ................................ 68
5.29 Comparison of routing algorithms for matrix«transpose trafﬁc in a 16 x

16 mesh ................................... 69
5.30 Comparison of routing algorithms for matrix-transpose trafﬁc in a bi-

nary 8-cube ................................. 69
5.31 Comparison of routing algorithms for matrix-rotate trafﬁc in a 10 x 10

mesh ..................................... 70
5.32 Comparison of routing algorithms for merge-south trafﬁc in a 10 x 10

mesh ..................................... 70
5.33 Comparison of routing algorithms for address-rotate trafﬁc in a binary

8-cube .................................... 71

xi

5.34 Comparison of routing algorithms for reverse-ﬂip trafﬁc in a binary
8-cube ....................................
5.35 The directions in which packets in the two halves of a two-dimensional

71

mesh are routed by the east-west routing algorithm during its two phases. 75

5.36 Comparison of routing algorithms for uniform trafﬁc in a 10 x 10 mesh. 76

5.37 Comparison of routing algorithms for matrix-transpose trafﬁc in a 10 x
10 mesh ...................................
5.38 Comparison of routing algorithms for matrix-rotate trafﬁc in a 10 x 10
mesh .....................................
5.39 Comparison of routing algorithms for merge-south trafﬁc in a 10 x 10
mesh .....................................

6.1 Eight simple cycles formed by the sixteen turns in a double-y network.

6.2 A router in a double-y network and the turns allowed by some routing
algorithms. ................................

6.3 The potential cycles created by allowing O-degree turns when travelling
. north. ...................................

6.4 Numbering of the channels leaving each node (2:, y) fer the mad-y rout-
ing algorithm ................................

6.5 For each input channel, the output channels allowed by the mad-y
routing algorithm have higher numbers ..................

6.6 A router for a two-dimensional mesh with double eastward and north-
ward channels and the possible turns ...................

6.7 Comparison of nonadaptive routing algorithms for uniform trafﬁc in a
16 x 16 mesh with double y channels ...................

6.8 Comparison of routing algorithms for uniform trafﬁc in a 16 x 16 mesh
with double y channels ...........................

6.9 Comparison of routing algorithms for center-reﬂection trafﬁc in a 16 x
16 mesh with double y channels ......................

6.10 Comparison of routing algorithms for matrix-transpose trafﬁc in a 16 x
16 mesh with double y channels ......................

6.11 Comparison of routing algorithms for bit-reversal trafﬁc in a 16 x 16
mesh with double 3; channels. ......................

6.12 Comparison of routing algorithms for matrix-multiply trafﬁc in a 16 x
16 mesh with double y channels ......................

6.13 The directions of the virtual channels in the two pairs of classes for
fully adaptive routing in a two-dimensional mesh. ...........

xii

76

77

77

79

80

81

84

85

86

88

89

90

90

91

91

94

6.14 The directions of the virtual channels in the four pairs of classes for

7.1

7.2
7.3

7.4

7.5

7.6

7.7

fully adaptive routing in a three-dimensional mesh. .......... 99

Examples of the paths allowed by the one-fault-tolerant routing algo-
rithm for two-dimensional meshes ..................... 102
Numbering of a 4 x 4 mesh for the one-fault-tolerant routing algorithm. 105
Comparison of the minimal and nonminimal west-ﬁrst routing algo-

rithms for uniform trafﬁc. ........................ 106
Comparison of the minimal and nonminimal north-last routing algo-
rithms for uniform trafﬁc. ........................ 106
Comparison of the minimal and nonminimal negative-ﬁrst routing al-
gorithms for uniform trafﬁc. ....................... 107

Comparison of different amounts of misrouting in the negative-first
routing algorithm for uniform trafﬁc. .................. 107
Comparison of the minimal and nonminimal negative-first routing al-
gorithms for matrix-transpose trafﬁc. .................. 108

xiii

CHAPTER 1

Introduction

Sequential electronic computers have been steadily decreasing in size, power consump-
tion, and cost, and steadily increasing in speed and reliability, since their creation a
half century ago. These improvements are largely a result of the development of
transistors and ever better techniques for miniaturizing and integrating transistors.
Miniaturization will, however, soon reach a physical limit. Current techniques can
fabricate semiconductors with features smaller than a micron, but features cannot
become smaller than atoms. Therefore, further miniaturization is limited to a few
more orders of magnitude. Because signal propagation rates are limited to the speed
of light, the limit on miniaturization is also a limit on the rate at which sequential
electronic computers can process information. For electronic computers to continue
improving in performance will require a change from sequential architectures to mas-
sively parallel architectures. Massively parallel architectures enable thousands of pro-
cessors to divide a task into subtasks and perform the subtasks concurrently, speeding

computation and permitting greater fault tolerance and reliability.

1.1 Direct Networks and Multicomputers

The first consideration in building a computer with numerous processors, memory
banks, input/output devices, and other components is the interconnection of the
components. The method that currently scales the best to thousands of components
is the direct network [1, 2, 3]. A direct network has channels connecting each node
(that is, group of components) directly to only a few other nodes, its neighbors.
Neighboring nodes communicate with each other by sending messages across the
channels. A node communicates with a node that is not its neighbor by sending a
message to one of its neighbors for forwarding. To handle the complexities of routing
messages through the direct network, each node often has a router. The router
controls both local channels, which connect it to local devices, and network channels,
which connect it to neighboring routers.

One promising, massively parallel architecture based on the direct network is the
multicomputer [4]. Each node of a multicomputer has at least one processor and
memory bank. The processors execute separate instruction streams, and all memory
is distributed. The lack of centralized resources makes multicomputers highly scalable

and a promising architecture for continued improvements in computer performance.

1.2 Network Topologies

Which nodes in a direct network are neighbors is deﬁned by the topology of the net—
work. The most popular topologies for direct networks are n-dimensional meshes
and k-ary n-cubes, particularly low-dimensional meshes and hypercubes. A two-
dimensional mesh is used in the Intel Touchstone DELTA [5], the Intel Paragon,
and the Symult 2010 [6]; a two-dimensional torus is used in the Fujitsu AP1000 [7];
a three-dimensional mesh is used in the MIT J-machine [8]; and a hypercube is used

in the nCUBE—2 [9].

Formally, an n-dimensional mesh has k0 x k1 x - - - x lend nodes, k.- nodes along
each dimension i, where k,- 2 2. A 4 x 4 mesh is shown in Figure 1.1. Each node
X in an n-dimensional mesh is identiﬁed by n coordinates, (2:0, 3:1, . . . , and), where
0 S 1:.- S k; - 1 for each dimension i. Two nodes X and Y are neighbors if and only
if 2:.- = y,- for all i except one, j, where y, = 2:; i 1. Thus, nodes have from n to
2n neighbors, depending on their location in the mesh. In a k-ary n-cube [10], all
nodes have the same number of neighbors. The deﬁnition of a k-ary n-cube differs
from that of an n-dimensional mesh in that the 165,8 are all equal to k and in that
two nodes X and Y are neighbors if and only if 2:.- = y,- for all i except one, j,
where y,- = (2:,- :l: 1) mod Is. The change to modular arithmetic in the deﬁnition
adds wraparound channels to the k-ary n-cube, giving it symmetry. As a result, each
node in a k-ary n-cube has n neighbors if k = 2 and 2n neighbors if k > 2. Another
topology with symmetry is a hypercube, which is a special case of both n-dimensional
meshes and k-ary n-cubes. A hypercube is an n-dimensional mesh in which k,- = 2

for all i, or a 2—ary n-cube.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

[3 1-3 ___. 2e 33.
nib—M. 1.2 ‘_‘——'., 2.2 3.2
L. .11.. .l , .l_.

rm 1,1 , 2,1 3,1

[11 1,0 2,0 ‘— 3,0

 

 

 

 

 

 

 

 

 

Figure 1.1. A 4 x 4 mesh.

Meshes and lc-ary n-cubes are popular, in part, because their regular topolo-
gies simplify the routing of messages. In general, low-dimensional meshes are pre-
ferred because they have low, ﬁxed node degrees, which makes them more scalable
than high-dimensional meshes and k-ary n-cubes. Low-dimensional meshes also have
fewer channels and higher channel bandwidth per bisection density and have lower
contention and blocking latencies, which result in lower communication latencies and
higher hot-spot throughputs [10]. In addition, their function better matches their
form, two and three topological dimensions being easier to implement in the three
physical dimensions. High-dimensional meshes and k-ary n-cubes, on the other hand,
have lower diameters, which shortens path lengths. High-dimensional topologies also
have more paths between pairs of nodes, which permits more fault tolerance. Finally,
the symmetry of lc-ary n-cubes can make it easier to spread message trafﬁc more

evenly.

1.3 Switching Techniques

In the latest generation of multicomputers, the most popular technique for switching
messages along network channels is wormhole routing [11]. It is used in the Intel
Touchstone DELTA [5], the Intel Paragon, the Symult 2010 [6], the Fujitsu AP1000
[7], the MIT J-machine [8], and the nCUBE-2 [9]. Wormhole routing switches a
message along the channels of a direct network by ﬁrst dividing the message into
packets and the packets into ﬂow control digits or ﬂits. It then routes the ﬁits in
each packet through the network in a pipeline fashion. Header ﬂits contain all of the
routing information for a packet and lead each packet through the network. When
the header ﬂits reach a router that has no suitable output channel available, all of
the ﬂits in the packet wait where they are for a suitable channel to become available.

One of the attractions of wormhole routing is that each router requires only enough

buffer space to store a few ﬂits for each channel. The store-and-forward and vir-
tual cut-through switching techniques [12, 13] require enough buffer space to store
an entire packet for each channel. Another attraction of wormhole routing is that
its communication latencies are low. In the absence of contention, the latencies for
store-and-forward are proportional to the product of the packet length and the dis-
tance to travel [14, 10]. The latencies for wormhole routing, virtual cut-through,
and circuit switching, on the other hand, are proportional to the sum of the packet
length and the distance to travel. Wormhole routing also has some advantages over
circuit switching. Channel reservation and release are an integral part of wormhole
routing, but are separate phases in circuit switching. Also, wormhole routing permits
routers to replicate ﬂits and output them over multiple channels, making multicast

and broadcast communications possible [15, 9].

1 .4 Routing Algorithms

The sequence of channels a message packet traverses on its way from its source node
to its destination node is determined by the routing algorithm that the routers of
the direct network execute. A good routing algorithm should have three features:
low communication latency, high network throughput, and ease of implementation in
VLSI. In other words, a routing algorithm should provide packet transmission that is
high in quality, quantity, and practicality. Factors contributing to ease of implemen-
tation are little hardware for channels, buffers, switches, and control logic. Factors
contributing to low latency and high throughput are freedom from deadlock, freedom
from indeﬁnite postponement, freedom from livelock, spreading packet trafﬁc evenly,
routing packets adaptively, routing packets along short paths, and fault tolerance.
Many of these terms require some deﬁnition (partly because they have been used

in different ways by different authors). Deadlock occurs when a packet waits for an

event that cannot happen. For example, a packet may wait for a network resource to
be released by a packet that is, in turn, waiting for the ﬁrst packet to release some re-
source. In wormhole routing, such a circular wait condition causes deadlock, because
a packet holds resources while waiting and excludes other packets from acquiring the
held resources. Indeﬁnite postponement is similar to deadlock, but occurs when a
packet waits for an event that can happen but never does. For example, a packet
may wait forever to acquire a network resource for which other packets are always
competing successfully. Indeﬁnite postponement is only possible when the allocation
of network resources is fair. Both deadlock and indeﬁnite postponement can stop a
packet from being injected into a network or from moving once it is in the network.
In contrast, livelock does not stop the movement of a packet, but rather stops its
progress toward the destination. Livelock occurs when the routing of a packet never
leads it to its destination. Livelock is possible only when routing is adaptive (not
along a predetermined path) and nonminimal (away from, or equidistant to, the des-
tination at times). Minimal routing, on the other hand, restricts packets to shortest
paths. Although minimal routing may initially sound more promising, nonminimal
routing provides better fault tolerance.

Of the factors contributing to good routing algorithms, the most important is
freedom from deadlock. Deadlock can keep many or all packets from reaching their
destinations and occurs readily unless a routing algorithm contains preventive mea-
sures. Indeﬁnite postponement and livelock are less likely to occur and are generally
easier to prevent. Adaptiveness is also an important factor, because it contributes
to several of the other factors [16]. It provides alternative paths for packets that
encounter unfair channel allocation, faulty hardware, or hot spots in trafﬁc patterns.
The adaptiveness of a routing algorithm falls into one of three categories, depending
on the number of nodes the algorithm can choose from when forwarding packets, or

the number of shortest paths the algorithm can route packets along. A nonadaptive

algorithm has a choice of only one node for each packet to be forwarded. Thus, a
nonadaptive algorithm routes a packet from its source node to its destination node
along a path that is predetermined. A partially adaptive algorithm [17] can choose
from multiple nodes for some packets, but cannot choose from every node leading
toward the destination node for every packet. Thus, a partially adaptive algorithm
cannot route packets along every shortest path in a network topology. A fully adap-
tiue algorithm can choose from every node leading toward the destination node for
every packet. Thus, a fully adaptive algorithm can route packets along every shortest
path in a network topology. Closely related to the categories of nonadaptive, partially
adaptive, and fully adaptive routing algorithms is the category of maximally adaptive
routing algorithms. A maximally adaptive algorithm can choose from as many chan-
nels leading toward the destination node as possible while preventing deadlock. The
focus is on channels, instead of nodes. Depending on the number and arrangement of
the channels in a direct network, the maximally adaptive algorithms for the network

may be nonadaptive, partially adaptive, or fully adaptive.

1.5 Problem Statement and Objective

One obstacle to using wormhole routing as the switching technique in a direct network
is designing a good algorithm for routing message packets through the network. As
described in the preceding section, a good algorithm provides low communication
latency, high network throughput, and ease of implementation in VLSI. Further,
low latency and high throughput are largely the result of deadlock prevention and
adaptiveness in the algorithm. Accordingly, the objective of this dissertation is to
develop a model for systematically designing routing algorithms that are deadlock

free and maximally adaptive for typical direct networks.

1.6 Dissertation Organization

This dissertation examines the design of wormhole routing algorithms for direct net-
works. Chapter 2 reviews previous models for designing routing algorithms and some
of the resulting algorithms. Chapter 3 presents a new model, one based on analyzing
the directions in which packets can turn in a network and the cycles the turns can
form. Chapter 4 describes the simulator used to study the routing algorithms pro-
duced by the different models. Chapters 5 and 6 apply the turn model to networks
without and with virtual channels. Chapter 7 applies the turn model to the problem
of fault-tolerant routing. Finally, Chapter 8 concludes the dissertation and proposes

future work.

CHAPTER 2

Previous Models

Previous models for designing wormhole routing algorithms for direct networks are
either ad hoc or based on adding extra channels to the networks. The algorithms
produced by these models prevent deadlock, but the cost is either nonadaptiveness

in routing or extra hardware for channels. '

2.1 Dimension Ordering

The primary ad hoc model is dimension ordering, in which the dimensions of a direct
network are ordered and message packets are routed along the dimensions in that
order. Applied to a two-dimensional mesh with the minimum number of channels
(an my network), this model produces the my routing algorithm [6, 5]: route a packet
ﬁrst along the m dimension (dimension 0) and then along the y dimension (dimen-
sion 1). This routing algorithm is the most popular for my networks. Applied to
hypercubes, dimension ordering produces the e-cube routing algorithm [18]: route a
packet ﬁrst along the lowest dimension and then along higher and higher dimensions,
until the destination node is reached. This routing algorithm is the most popular for
hypercubes.

Dimension ordering ensures that the my and e-cube algorithms prevent deadlock,

10

by denying the circular wait condition. A packet that has just reserved a channel along
one dimension will try to reserve another channel along only the same dimension or
a higher dimension. Thus, there is no way for a cycle of waiting packets to be formed
in which a packet holding a channel is waiting for a higher-dimension channel and a
packet holding a channel along that higher dimension is waiting for a lower-dimension
channel. Also, packets waiting to reserve channels along the same dimension cannot
form a cycle: a packet holding a channel in one direction along the dimension will try
to reserve another channel in only the same direction (because routing is minimal),
and the channels in a single direction do not form a cycle (because meshes do not
have wraparound channels, as do k-ary n-cubes).

Although dimension ordering prevents deadlock, it has some major drawbacks.
First, it is applicable only to network topologies without wraparound channels. More
critically, it ensures nonadaptiveness: for each pair of source and destination nodes,
it routes all packets along the same path. Finally, dimension ordering does not allow
nonminimal routing, which is important for fault tolerance and increased adaptive-

11888.

2.2 The Virtual Channel Model

Subsequent models for designing routing algorithms have attempted to overcome the
drawbacks associated with dimension ordering. All of these models are based on
the addition of virtual channels [19, 20] to direct networks. Adding special buffers
and control logic allows a physical channel that would normally be used by only
one packet at a time to be shared by multiple packets. Thus, one physical channel
becomes multiple virtual channels. The virtual channels can be used both to prevent
deadlock and to provide an arbitrary degree of adaptiveness.

Dally and Seitz [19] propose a model to address the problem of topologies with

11

wraparound channels. Their virtual channel model is based on breaking all of the
cycles in the channel dependency graph for a direct network and routing algorithm,
which they prove prevents deadlock. The channel dependency graph is a directed
graph with a vertex for each channel in the network and an arc from one vertex to
another if the algorithm can route packets directly from that channel to the other.

The steps of the model are as follows.

1. Find a nonadaptive routing algorithm for the direct network, without regard

for whether it is deadlock free or not.
2. Construct the channel dependency graph for the network and algorithm.

3. Where there are cycles in the graph, divide the physical channels into virtual
channels, and modify the algorithm to route a packet along any of the virtual

channels.
4. Construct the channel dependency graph for the new network and algorithm.

5. Eliminate enough arcs in the graph to break all cycles, and modify the algorithm

to prohibit the corresponding transitions from one channel to another.

To demonstrate the versatility of their model, Dally and Seitz apply it to three net-
work topologies: k-ary n-cubes, cube-connected cycles, and shuffle-exchange net-
works.

Although the virtual channel model can be applied to any topology, including
those with wraparound channels, it still has several major drawbacks. First, some of
the steps are too indeﬁnite. How does one ﬁnd the initial routing algorithm in Step 1?
Into how many virtual channels does one divide the physical channels in Step 3? How
does one choose arcs to eliminate in Step 5, especially when the channel dependency
graph constructed in Step 4 is very complex? The result is that the model must be

applied iteratively, using trial and error. Another drawback is that the model tends

12

to add more virtual channels to a network than necessary, increasing its cost. Finally,
and most critically, the model produces routing algorithms that are nonadaptive and

minimal, just as dimension ordering does.

2.3 The Virtual Network Model

The ﬁrst to propose a model for designing adaptive routing algorithms are Yantchev
and Jesshope [21, 22]. Their virtual network model produces a fully adaptive algo-

rithm for any network topology. The steps of the model are the following.

1. Classify the pairs of source and destination nodes in the topology according
to the directions of the channels in the shortest paths connecting them. In an
extreme case, there is a separate class for each pair of source and destination

nodes.

2. For each class, create an independent, cycle-free virtual network to route its
messages minimally and with full adaptiveness. The virtual network contains
a virtual channel for each channel of the topology that is in any of the shortest

paths of the class.

3. If partially adaptive routing is acceptable, eliminate some virtual networks and
allow messages to transfer from some virtual networks to some others, as long

as doing so permits full connectivity and does not introduce cycles.

4. Overlay the virtual networks onto the physical network, dividing physical chan—
nels into virtual channels where more than one virtual channel is mapped onto

the same physical channel.

Applied to two-dimensional meshes, the virtual network model classiﬁes source-

destination pairs according to whether they route messages northeast, southeast,

13

southwest, or northwest and creates a separate virtual network to route each class
with full adaptiveness. For example, for the northeast class, the model creates a north
or east virtual channel for each north or east physical channel. Overlaying the four
virtual networks results in two virtual channels for each physical channel, and a two—
dimensional mesh called a double-my network. The corresponding double-my routing
algorithm routes a northwest packet along the ﬁrst set of north and west channels,
a northeast packet along the second set of north and east channels, a southeast
packet along the ﬁrst set of south and east channels, and a southwest packet along
the second set of south and west channels. If it is desirable to reduce the number
of virtual channels per physical channel to one (that is, leave the physical channels
unmodiﬁed), the southeast and northwest virtual networks can be eliminated. The
resulting 2-plane routing algorithm routes packets southwest as far as necessary and
then northeast as far as necessary. The draWback is that it is only partially adaptive.
It routes packets southeast and northwest nonadaptively.

The virtual network model has the advantages both of being applicable to any
network topology and of producing adaptive routing algorithms. In addition, the steps
are more deﬁnite than those in the virtual channel model. Like the previous models,
however, the virtual network model produces algorithms only for minimal routing,
which limits its usefulness for fault tolerance. The virtual network model also tends
to add more virtual channels than necessary to achieve full or partial adaptiveness.
Finally, after adding the virtual channels, the model produces a. routing algorithm

that is not maximally adaptive.

2.4 The Dimension Reversal Model

To design routing algorithms that are both adaptive and nonminimal, Dally and

Aoki [16] propose a model based on dimension reversals. Each message packet keeps

 

14

track of the number of times, p, it reverses dimensions, that is, is routed from a
higher dimension to a lower dimension. The model divides each physical channel into
r + 1 virtual channels, numbered 0 to r, where r is large enough to give the desired
degree of adaptiveness. Based on this division, the model allows either of two routing
algorithms. The static routing algorithm routes packets that have made p dimension
reversals only along virtual channels numbered p. If p < r, routing is along any such
virtual channel; and if p = r, routing is according to dimension ordering. Thus, a
packet is routed adaptively until it makes r dimension reversals, and r is the maximum
number of dimension reversals a packet can make. The dynamic routing algorithm
does not limit the number of dimension reversals. It routes a packet along any virtual
channels with numbers less than r until it reaches a router where all suitable output
channels are being used by packets with equal or lower values of p. It then routes the
packet along virtual channels numbered r, using dimension ordering. Because these
two algorithms allow adaptive, nonminimal routing, livelock is a potential problem.
To prevent livelock, Dally and Aoki propose limiting either the number of times each
packet is misrouted (routed away from the destination or equidistant to it) or the
ratio of misroutings to routings toward the destination.

The dynamic routing algorithm has the unique ability to route a packet along
almost any of the virtual channels leaving a router. This ability signiﬁcantly increases
adaptiveness; and in simulations conducted by Dally and Aoki, the dynamic algorithm
outperforms the static one. The simulations also show, however, that the dynamic
routing algorithm requires throttling to limit the number of new message packets
entering the direct network when trafﬁc is heavy.

Although the dimension reversal model provides a solution to the problem of de-
signing nonminimal routing algorithms, it has several drawbacks. The dimension
reversal model depends on dimension ordering and, like dimension ordering, is appli-

cable only to meshes and other topologies that do not contain wraparound channels.

15

Also, the static and dynamic algorithms route packets along the virtual channels
numbered r nonadaptively. This nonadaptiveness severely reduces the fault tolerance
of the direct network. Next, the algorithms depend on the packets transmitting extra
information, p, in the header ﬂits. This extra information increases the buffer space
required in each router and increases communication latency. Finally, like previous
models, the dimension reversal model depends on adding virtual channels and does

not produce routing algorithms that are as adaptive as possible for the number of

virtual channels added.

2.5 The Wraparound Model

Linder and Harden [23] propose a model similar to the dimension reversal model of
Dally and Aoki, except that it is based on keeping track of the number times a packet
travels over wraparound channels and that it needs no subnetwork of virtual channels
to nonadaptively route packets as a back-up when contention is high. The steps of

the model are as follows.

1. For each combination of directions in dimensions 1 to n — 1 of an n-dimensional

network, create an independent virtual network.

2. If the topology has wraparound channels, give each virtual network at least

n + 1 levels. Otherwise, give each virtual network one level.

3. For each level in each virtual network, create virtual channels in the directions
associated with that virtual network and create virtual channels in both direc-
tions of dimension 0. Where there are wraparound channels for a level, connect

them from that level to the next lowest level.

4. Overlay the virtual networks onto the physical network.

 

16

The routing algorithm that goes with the resulting physical network injects a
message packet into the top level of any virtual network that has a path to the
destination. It then routes the packet along any of the many paths to the destination.
These paths include all shortest paths and many longer ones. Each time a packet
traverses a wraparound channel, it drops down a level. At high levels, almost any
output channel leads to the destination, but at low levels, the routing algorithm must
be more careful that the packet has at least as many levels below it as wraparound
channels it must still traverse. The level the packet is on can be determined from the
virtual channel it just traversed, and the minimum number of wraparound channels
required to reach the destination can be determined from the addresses of the current
and destination nodes. Therefore, the number of wraparound channels traversed by
the packet does not have to be transmitted in the header ﬂits, which is an improvement
over the dimension reversal model. As in the virtual network model, the independent
virtual networks prevent deadlock among packets travelling in different directions. In
addition, the multiple levels in each virtual network prevent deadlock among packets
traversing wraparound channels.

Linder and Harden apply their wraparound model to n-dimensional meshes and
k-ary n-cubes. Applied to two-dimensional meshes, it produces a minimal and fully
adaptive routing algorithm that requires two virtual channels per physical channel
along only the y dimension. Such a two—dimensional mesh is called a double-y network.
The corresponding algorithm is called the double-y routing algorithm. It routes an
eastbound packet along the east physical channels and one set of y virtual channels
and routes a westbound packet along the west physical channels and the other set of
y virtual channels. Dally [24] proposes the same network and algorithm.

The wraparound model has advantages beyond those of the dimension reversal
model and only some of the drawbacks. Like the dimension reversal model, it produces

fully adaptive and nonminimal routing algorithms. In addition, it is applicable to

 

17

topologies with wraparound channels and is more fault tolerant, especially when the
virtual networks contain more levels than the minimum required by Step 2. The
drawbacks the wraparound model shares with the dimension reversal model are that
it often depends on adding many virtual channels and does not produce routing

algorithms that are as adaptive as possible for the number of virtual channels added.

CHAPTER 3

The Turn Model

Deadlock in wormhole routing is caused by message packets waiting on each other
in a cycle. One way that deadlock can occur in a two-dimensional mesh is shown in
Figure 3.1. Four packets travelling in different directions try to turn left and wind up
in a circular wait. If only one of the packets had not turned, deadlock would have been
avoided. This possibility suggests that a routing algorithm might prevent deadlock
in a direct network by prohibiting certain turns. The routing algorithm would have
to prohibit at least one turn in each of the many possible cycles in the network,
however. At the same time, the algorithm would have to leave a path between every
pair of nodes. Finally, the algorithm should not prohibit more turns than necessary;
otherwise, its adaptiveness would be reduced.

The deadlock freedom and adaptiveness of: most of the routing algorithms de—
scribed in the preceding chapter can be explained in terms of the turns prohibited
and allowed. The next few sections present explanations for several of the routing al-
gorithms applicable to the different two-dimensional meshes deﬁned in the preceding

chapter.

18

19

 

 

 

 

\. D :11: buffer

 

 

 

 

input selection
circuit

pocket 4 pocket 2 ~

packet progre;;ion

.......... ....... -
[:1 packet awaiting
resource

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.1. A wormhole deadlock involving four routers and four packets in a two—
dimensional mesh.

3.1 Turns in my Networks

Figure 3.2(a) shows a router in an my network, and Figure 3.2(b) shows the turns the
my routing algorithm allows (solid lines) and prohibits (dashed lines) at each router.
The possible turns from one direction to another form two simple cycles and several
complex cycles, as illustrated in Figure 3.3. The my algorithm is deadlock free because
it prohibits at least one turn in each of the simple cycles and each of the complex
cycles. The algorithm is nonadaptive because the turns it allows cannot be combined
into sequences; the heads of the allowed turns do not match the tails of any of the
other allowed turns. The 2-plane routing algorithm is also applicable to my networks,
allowing the turns shown in Figure 3.2(c). It, too, is deadlock free because it prohibits
at least one turn in each of the cycles. It, however, allows more turns than the my
algorithm; and some of the turns can be combined into sequences, permitting partial

adaptiveness in routing.

20

(north)

”I {'7 F‘"? 4,,
_t

can: ....

(west) -X +X (east) (1)) xy algorithm (nonadaptive)

_Y I'M? F‘ as
(south) I—‘J L._.I

(c) 2— lane al orithm
(a) a router in an xy network (pg-ﬂan, Edaptive)

Figure 3.2. A router in an my network and the turns allowed by some routing algo-
rithms.

""7
._l

 

 

I“? V—II
L._II_._IL.__I

(a) (b) (c)

Figure 3.3. The two simple cycles and one of the complex cycles formed by the turns
in an my network.

21

3.2 Turns in Double-my Networks

Figure 3.4(a) shows a router in a double-my network, and Figure 3.4(b) shows the turns
the double-my routing algorithm allows and prohibits at each router. Which of the two
virtual channels in each direction appears in a turn is important in determining cycles
and adaptiveness. The possible turns from one set of virtual channels to another form
eight simple cycles. The double-my algorithm is deadlock free because it prohibits at
least one turn in each of the simple cycles, as well as each of the complex cycles. The
algorithm is fully adaptive because the turns it allows can be combined into a variety
of sequences in each of the four directions packets need to be routed: southwest,

southeast, northwest, and northeast.

_ ‘53-» 3.....‘41

..q.
'1'
I...

 

 

 

 

<4
+Y1 +Y2
" '1 ‘ﬂ'e
It it f v . =i=
-X1 +‘ ""’+x1 e T
<-I-I ‘—|- a 1".» 8,32
41* -II-> mm»; gun-<4.
.X2 C-II- ”+X2 # I; $ *
H It I s s . I
-Y1 -Y2 "‘°“'*' 't'r't"
arms-g. ;-Il-<lI-.
(a) a router in a .1 i i .1.
double-xy network ‘ . . j
- 'l' T -
T-u- <1» '11-» 41-3"

(b) double-xy algorithm
(fully adaptive)

Figure 3.4. A router in a double-my network and the turns allowed by the double-my
routing algorithm.

22

3.3 Turns in Double-y Networks

Figure 3.5(a) shows a router in a double-y network, and Figure 3.5(b) shows the turns
the double-y routing algorithm allows and prohibits at each router. The possible turns
from one set of virtual channels to another form four simple cycles. The double-y
algorithm is deadlock free because it prohibits at least one turn in each of the simple
cycles, as well as each of the complex cycles. The algorithm is fully adaptive because
the turns it allows can be combined into a variety of sequences in each of the four

directions packets need to be routed.

+Yl +Y2

H H 41'": F1
—x-. g It); LJ i”!-

 

 

 

 

____ ‘_ .‘ 8/16
H H‘— {—3 L;
-Y1 —Y2 "" ‘1":
(a) a router in a (b) double-y algorithm

double-y network (fully adaptive)

Figure 3.5. A router in a double-y network and the turns allowed by the double-y
routing algorithm.

3.4 Systematic Design of Routing Algorithms

The preceding observations suggest a new solution to the problem of designing worm-
hole routing algorithms that are deadlock free and maximally adaptive for a direct

network, the turn model [25, 26, 27]. The model involves analyzing the directions in

23

which packets can turn in the network and the cycles that the turns can form. It
prohibits just enough of the turns to break all of the cycles. Routing algorithms that
employ the remaining turns are deadlock free, maximally adaptive, livelock free, and
minimal or nonminimal for the network. The steps of the turn model are given in
more detail below. Unless speciﬁed otherwise, the term “channel” indicates either a

physical or virtual channel.

1. Partition the channels in the direct network into sets according to the direc-
tions in which they route packets. If each node has v channels in a topological
direction, treat these channels as being in v distinct virtual directions and di-
vide them into v distinct sets accordingly. Put any wraparound channels in a

separate set to be incorporated during Step 5.

2. Identify the possible turns from one virtual direction to another, omitting 0-
degree and 180-degree turns. A 0-degree turn is only possible when there are
multiple channels in a topological direction. It represents a transition from
one set of channels to another, where the two sets are in the same topological

direction, but different virtual directions.

3. Identify the cycles that the turns can form. Generally, identifying the simplest

cycles in each plane of the topology is adequate.

4. Prohibit the minimum number of turns so that at least one turn is prohibited in
each cycle. The turns must be chosen carefully in order to break every possible
cycle, including very complex cycles. A useful approach is ﬁrst to break the
cycles in each plane and then check whether doing so allows more complex

cycles.

5. Incorporate as many turns as possible involving the set of wraparound channels,

without reintroducing cycles. At least one turn involving each wraparound

24
channel can always be incorporated.

6. Incorporate as many 0-degree and 180-degree turns as possible, without rein-

troducing cycles.

Routing algorithms that route packets along the sets of channels identiﬁed in
Step 1 and use only the turns from one set to another allowed by Steps '4, 5, and 6 are
deadlock free, livelock free, minimal or nonminimal, and maximally adaptive for the
network. They are deadlock free because breaking all of the cycles prevents circular
Waits. Proofs that particular routing algorithms are deadlock free are presented in
later chapters, but only as a check that the turn model has been applied correctly.
Preventing circular waits by prohibiting turns means that it is possible to number the

challnels in the network so that each algorithm routes every packet along channels

in strictly decreasing (or increasing) order [19]. The possibility of such a numbering,
together with the fact that a network contains a ﬁnite number of channels, means that
a packet reaches its destination after a limited number of hops. Thus, the routing
algorithms are livelock free. They are maximally adaptive for the network because
the model prohibits the minimum number of turns from one direction to another.
Finally, the routing algorithms are nonminimal, but can easily be made minimal
by modifying them to use channels only when they lead toward the destination.
Nonminimal routing, however, is more adaptive and fault tolerant.

The turn model is applicable to direct networks that have directions associated
with their channels, which is most existing and feasible networks. The model produces
partially adaptive routing algorithms for such common topologies as mesh-connected,
k-ary n-cube, hexagonal, octagonal, and cube-connected cycle networks. Adding
extra virtual or physical channels to the topologies allows the model to produce fully
adaptive routing algorithms. The following chapters apply the model to a variety of

meshes and k-ary n-cubes and should make it much clearer.

25

At this point, it is possible to compare all of the models for designing wormhole
routing algorithms. The differences between the models are summarized in Table 3.1.
Compared to the other models, the turn model has many advantages and few disad-
vantages. The simulations reported in the following chapters show that the advantages

of the turn model actually improve performance.

Table 3.1. Comparison of models for designing routing algorithms.

 

 

 

 

 

 

 

virtual nonminimal livelock
channels routing easy to fault
model topologies used adaptive possible prevent tolerant
dimension meshes none no no NA no
ordering
virtual all few no no NA no
channel
virtual all many fully no NA some
network
dimension meshes some fully yes no some
reversal
wraparound most many fully yes yes yes
turn most any maximally yes yes yes
number

 

 

 

 

 

 

 

 

 

CHAPTER 4

The Simulator

The most versatile way to study and compare the routing algorithms produced by the
different models is simulation. A routing algorithm cannot be simulated in isolation,
however. It must be simulated for a particular direct network, input selection policy,
output selection policy, and pattern of message trafﬁc. Each of these factors affects
the performance of the routing algorithm, however. Therefore, in preparation for
the simulations reported in later chapters, this chapter describes the kinds of direct
networks, channel selection policies, and trafﬁc patterns that are studied. It also
describes the software that carries out the simulations.

The performance of each combination of routing algorithm, direct network, input
selection policy, output selection policy, and trafﬁc pattern is measured in terms of
two main quantities: average communication latency and average sustainable network
throughput. The latency of a message is the time that elapses between the source
node making the message available for transmission and the destination node having
the message available for use. The throughput of a network is the number of messages
delivered per unit of time. It is measured as a percent of the maximum theoretical
throughput, which would be achieved if every channel continuously transmitted ﬂits.
The throughput is considered sustainable when the number of messages queued at

their source nodes does not exceed some small limit.

26

27

4.1 Direct Networks

Each direct network studied in the simulations is treated as part of a multicomputer
that has one processor and one router per node. In all but the double-y networks,
each pair of neighboring routers is connected by one pair of unidirectional physical
channels. Each router is also connected to its local processor by a pair of unidirec-
tional physical channels. All channels have the same bandwidth, 20 ﬂits/psec. Each
input channel in a router has a buffer the size of a single ﬂit. The routers operate
asynchronously, but synchronize to simultaneously transmit all of the ﬂits in a worm,

the portion of a packet in the network.

4.2 Channel Selection Policies

During both nonadaptive and adaptive routing, multiple input channels in a router
may contain header flits waiting for the same available output channel. Which input
channel gets to use the output channel next is decided by the input selection policy.
During adaptive routing, a header ﬂit may have a choice of multiple output channels.

Which output channel to use is decided by the output selection policy.

4.2.1 Input Selection Policies

Input selection policies can be based on a variety of packet and router characteristics.
Some of these are the length of the packet, the location of the source node, the location
of the destination node, the location of the router, the distance travelled already, the
distance to travel still, the time the message was sent, the time the packet arrived
in the router, the number of output channels available to the packet, and the input
channels chosen previously in the router. In this dissertation, seven policies are

studied: random, no-turn, local ﬁrst-come—ﬁrst-served, global ﬁrst-comeﬁrst-served,

28

distance-travelled, least-adaptive, and distance-least. The random policy chooses one
of the input channels randomly. The no-turn policy chooses the input channel directly
across the router from the output channel, provided it is one of the input channels
waiting. Otherwise, the no-turn policy chooses one of the waiting input channels
randomly. The next three policies also decide ties randomly. The local FCFS policy
chooses the input channel containing the packet that arrived in the router ﬁrst. The
global F CFS policy chooses the input channel containing the packet that was sent
ﬁrst. The distance-travelled policy chooses the input channel containing the packet
that has travelled the farthest. The least-adaptive policy chooses the input channel
containing the packet that has the fewest choices of output channels. It decides ties
by using the distance-travelled policy. Using a policy that is more complex than the
random policy is necessary because, in low-dimensional topologies, the number of
output channels upon which packets can wait is small and many packets will often be
waiting for the same number of output channels, making ties common. The distance-
least policy is like the distance-travelled policy, except that it decides ties by using
the least-adaptive policy.

Each of these input selection policies has its advantages. An advantage of the
random, no-turn, and local FCFS policies is that they require no additional routing
information. The global FCF S requires that a global time stamp be passed along in
the header ﬂits of packets. Generating and passing the global time stamps necessitates
establishing a global clock and adding many bits to the header flits. The distance-
travelled, least-adaptive, and distance-least policies require that the distance travelled
so far be passed along in the header ﬂits of packets. The distance takes up a few
extra bits. All but the random policy attempt to reduce average communication
latencies. The reasoning behind the local and global FCFS policies is that giving
priority to older packets prevents their latencies from increasing much more, thereby

decreasing the number of high latencies included in the average. Because the global

29

FCFS policy is based on more global information, it should perform better than local
FCFS. The FCFS policies have an added beneﬁt: they are fair and, therefore, free
from indeﬁnite postponement. The distance-travelled, least-adaptive, distance-least,
and no—turn policies, on the other hand, are not fair and are subject to indeﬁnite
postponement. The reasoning behind the distance-travelled policy is that giving
priority to the packets that have reserved the most channels releases as many channels
as possible, as soon as possible, and limits the number of packets entering the network,
thereby reducing contention and latencies. The least-adaptive policy tries to reduce
contention by giving priority to the least adaptive packet, making it less likely that the
remaining packets will have to wait. The no-turn policy tries to reduce contention by
giving priority to the packet that does not have to turn. Why such a priority reduces

contention will be discussed in connection with output selection policies.

4.2.2 Output Selection Policies

Output selection policies can be based on many of the same packet and router char-
acteristics as input selection policies. Some of the more interesting characteristics
are which channels are available, which lead toward the destination and which away,
the distance the packet must still travel in each direction, and the direction of the
input channel. In this dissertation, three policies are studied: zigzag, my, and no—turn.
When none of the output channels allowed by the routing algorithm is available, the
policies all choose the ﬁrst one to become available. When only one of the output
channels is available, the policies choose the available channel. When more than one
of the output channels are available, the zigzag policy chooses the output channel in
the direction the packet must still travel the farthest. More precisely, if the current
node is C and the destination node is D, the zigzag policy chooses a channel in the
dimension, i, with the largest value |d,- —c.-|. The my policy chooses the output channel

for the lowest dimension. The no-turn policy chooses the output channel in the same

30

direction as the input channel.

Each of these output selection policies tries to improve network performance in
a different way. The reasoning behind the zigzag policy is that preferring to follow
a diagonal, staircase path to the destination node is most likely to preserve adap-
tiveness along the way, thereby reducing latency. The disadvantage of this policy is
that it tends to concentrate trafﬁc near the center of a mesh, although adaptiveness
mitigates the concentration some. The my policy attempts to overcome this disad-
vantage by preferring the channels chosen by the my routing algorithm, which spreads
uniform trafﬁc very evenly across the channels of a mesh. Simulations of the my
routing algorithm for uniform trafﬁc produce channel utilizations like those shown
in Figure 4.1. In the ﬁgure, the plus signs represent the nodes of a 10 x 10 mesh,
and the numbers surrounding each plus sign indicate the utilizations of the output
channels of the node. Each number is the fraction of the time that the channel is
reserved, in tenths. The main thing to note in Figure 4.1 is that the utilizations for
all of the eastward channels in a column are very nearly equal. The same is true of
all of the westward channels in a column, all of the northward channels in a row,
and all of the southward channels in a row. The packets that cross a particular col-
umn or row of channels could not be spread much more evenly across the channels
than the my routing algorithm spreads them. The reasoning behind the no—turn pol-
icy is illustrated in Figure 4.2. Each time a packet turns onto a row or column, it
blocks other packets travelling in the same direction along that row or column. In
Figure 4.2, the zigzag path (solid arrows on the upper left) blocks many northbound
and eastbound packets (dashed arrows), while the my path (solid arrows on the lower
right) blocks fewer. The impact is greatest when turning onto the middle of a row
or column, where trafﬁc is heaviest. Turns are not all bad, however. Their beneﬁt is
that they increase adaptiveness. Therefore, what should be avoided are unnecessary

turns, which is what the no—turn policy tries to do.

1+3
1

1
1+3

2+4

1
2+3

3+4

3+4

4+4

31

4+3

4+3

4+2
1

1
4+2

3+1

1
3+1

Figure 4.1. Channel utilization in a 10 x 10 mesh for uniform trafﬁc and the my
routing algorithm.

 

llhhkhll

Figure 4.2. Turns can increase contention.

4.3 Trafﬁc Patterns

In the simulated multicomputers, the processors generate messages at time intervals
chosen from a negative exponential distribution. Each message has an equal probabil-
ity of being one packet of 10 or 200 flits. Messages that are blocked from immediately
entering the network are queued at the source processor. Messages arriving at a
destination processor are immediately consumed.

What probably affects the performance of the routing algorithms more than
any of the preceding factors is the pattern of message trafﬁc, that is, the desti-
nations to which messages are sent. In this dissertation, nine trafﬁc patterns are
studied: uniform, center-reﬂection, matrix-transpose, matrix-rotate, merge-south,
reverse-ﬂip, address-rotate, bit-reversal, and matrix-multiply. Some apply only to
two-dimensional meshes, and some apply only to hypercubes. The uniform pattern
sends each message to any of the other processors with equal probability. In a k x k

mesh, the center-reﬂection pattern sends each message from the processor at (m,y)

33

to the one at (k — 1 — m, k -— 1 — y). In a k x k mesh, the matrim-transpose pattern
sends each message from the processor at (m, y) to the one at (k - 1 — y, k - 1 — m).
In a binary 8-cube, a matrim-transpose pattern is derived by mapping a 16 x 16 mesh
to the hypercube so that neighbors in the mesh are neighbors in the hypercube. Mes—
sages are then sent to the processors dictated by the matrix transpose in the mesh.
The resulting pattern in the binary 8-cube sends each message from the processor at
(am, m1, $2, m3, m4, m5, m6, m7) to the one at (m1, m5, m6, m7, m3, m1, m2, m3). In a k x k mesh,
the matrim-rotate pattern sends each message from the processor at (m, y) to the one
at (k — 1 - y,m). In a k x k mesh, the merge-south or tree reduction pattern sends
each message from the processor at (m, y) to the one at (m/2+ (y mod 2) at (k/2), y/2).
The merge-south pattern might occur if all the children in a binary tree had to
send messages southward to their parents. In a binary 8-cube, the reverse-ﬂip pat-
tern sends each message from the processor at (mo, m1, m2, m3, m4, m5, m6, m7) to the one
at (177,56, 55,f4,53,fg,f1,fo). In a binary 8-cube, the address-rotate pattern [28]
sends each message from the processor at (mo,m1,mg,m3,m4,m5,m6,m7) to the one at
(m4,m5,m6,m7,mo,m1,m2,m3). The bit-reversal pattern sends each message from the
processor at (m, y) to the one at (yR,mR), where R indicates bit reversal. In a k x k
mesh, the matrim-multiply pattern sends each message from the processor at (m, y) to

the one at (c, k - 1 — m), where c is a random integer between 0 and k — l.

4.4 Simulator Software

The simulations described in this dissertation were carried out by a computer program
written in C and CSIM [29]. It simulates a routing algorithm, direct network, input
selection policy, output selection policy, and pattern of message trafﬁc at the level of
channels and buffers, rather than at the gate level, which would be much more time

consuming.

34

4.4.1 Structure

The simulator is organized as a series of functions, the most important of which are
described below. The main function is called sin, as is required by CSIM. The
function sin reads parameters from the command line, calls ReadConfig to read a
ﬁle containing more parameters, initializes the direct network to be simulated, calls
node for each node in the network, and, when the simulation is done, calls report
to output the results. The function node creates an independent CSIM process for a
node. The process repeatedly waits a time interval chosen from a negative exponential
distribution and calls sendmessage to send a message. The function sendmessage
creates an independent CSIM process for a one-packet message. The process deter-
mines the destination node for the message, calls getmsgsize to determine the size of
the message, calls message to send the appropriate message, and calls batchrecord
to record simulation statistics. The function message repeatedly calls route to de-
termine the choice of output channels for a header ﬂit, calls ReserveLink to wait to
reserve one of those channels, and waits while the header ﬂit is transmitted. It then
repeatedly waits while the tail ﬂit is transmitted and calls ReleaseLink to release
the channel just traversed. The functions ReserveLink and ReleaseLink had to be
written because the CSIM function reserve does not allow two processes to wait for
sets of channels that have some, but not all, channels in common.

The code for related functions appears in the same ﬁle. For example, the functions
deﬁning the network topology appear in one ﬁle, and the functions for initializing,

reserving, and releasing channels appear in another ﬁle.

4.4.2 Inputs

The particulars of a simulation are controlled by a wide variety of parameters. The

parameters come from the command line, a ﬁle of parameters, and a C header ﬁle.

35

The command line and ﬁle of parameters are read when the simulator is run. The
header ﬁle is read when the simulator is compiled. The various parameters control
the size of the direct network, the number of channels between a router and its
local processor, the transmission time for a ﬂit, the mean time between sends for
the negative exponential distribution, the minimum and maximum message sizes, the
trafﬁc pattern, the routing algorithm, the number of messages averaged in each batch
of statistics, the number of batches to ignore at the beginning of a run, and the
minimum number of batches over which to collect statistics. Additional batches of

messages are run until the average latency across batches is known to within 5% with

95% conﬁdence.

4.4.3 Outputs

The outputs of the simulator are the input parameters, the average latency across
batches and the 95% conﬁdence interval, the average throughput across batches, the
simulation time, the total number of messages sent, the utilization of the channels, and
statistics on the latencies, the length of messages, the fraction of channels messages
waited for, the time messages spent waiting, and the distances messages travelled.
One other possible output is error messages. Each function in the simulator contains
code to check each new value for reasonableness. Should an unreasonable value be
found, the code prints it, along with an error message containing the word “ERROR”.

The most important of the outputs are the average latency and throughput, which
determine the performance. The remainder are useful for explaining the performance.
To aid in this explanation, a separate program, map, displays the channel utilizations
graphically. There is also a separate program, test, for debugging the code for the
routing algorithms. This program graphically displays the output channels chosen by
a routing algorithm under each of a wide variety of conditions. The choices appear

numbered in order of preference.

36

4.4.4 Running the Simulator

The simulator can be run by entering a command containing parameters and the
name of a parameter ﬁle. More typically, however, simulations are run in groups. In
this case, a series of commands can be placed in a ﬁle and executed either by Unix

or Condor.

CHAPTER 5

Partially Adaptive Routing

The turn model produces partially adaptive routing algorithms when applied to direct
networks without virtual channels. This chapter illustrates this application of the
turn model, ﬁrst for two-dimensional, three-dimensional, and n-dimensional meshes,
and then for k-ary n-cubes and hypercubes. It ends with simulations comparing
channel selection policies and comparing the partially adaptive routing algorithms

with nonadaptive routing algorithms for a variety of trafﬁc patterns.

5.1 Two-Dimensional Meshes

For two-dimensional meshes, the terminology used in the deﬁnition of n-dimensional
meshes can be simpliﬁed. Dimensions 0 and 1 become m and y. The lengths of the
dimensions, kg and k1, become m and n. Finally, the directions —m, +m, —y, and +y
become west, east, south, and north.

The four directions permit eight 90-degree turns: left and right turns when trav-
elling west, east, south, and north. The eight turns form two simple cycles, as shown
in Figure 5.1. The my routing algorithm prohibits four of the turns, as shown in
Figure 5.2. Prohibiting these four turns prevents deadlock because the four remain-

ing turns cannot form a cycle. Unfortunately, the four remaining turns also do not

37

38

permit any adaptiveness in routing. Deadlock can be prevented by prohibiting fewer
than four turns, however. In fact, only two turns need to be prohibited, one from
each cycle, allowing for some adaptiveness in routing. Prohibiting any two turns does
not prevent deadlock, however, as Figure 5.3 illustrates. The three left turns allowed
in Figure 5.3(a) are equivalent to the prohibited right turn in Figure 5.3(b), and the
three right turns allowed in Figure 5.3(b) are equivalent to the prohibited left turn in
Figure 5.3(a). Thus, both cycles still exist and deadlock is possible, as shown in Fig-
ure 5.3(c). Prohibiting two turns as in Figure 5.3 also has another serious drawback.
It prohibits all turns that are a combination of the directions north and west, which
makes it impossible to route most packets to the node in the northwest corner of the
mesh. Of the 16 different ways to prohibit one turn in each of the two simple cycles,

only 12 prevent deadlock and only 3 are unique if symmetries are taken into account.

I—‘W V—l
L._I L._I

Figure 5.1. The possible turns and simple cycles in a two-dimensional mesh.

Figure 5.2. The four turns allowed by the my routing algorithm.

39

*1 l'"‘l *‘1

t |
'7’ .7] I“

t. "I L._(:

 

 

Figure 5.3. Six turns that complete the cycles and allow deadlock.

5.1.1 The West-First Routing Algorithm

Figure 5.4(a) shows one way to prohibit two turns in a two-dimensional mesh. The
prohibited turns are the two to the west. Therefore, to travel west, a packet must
start out in that direction. This requirement suggests the west-ﬁrst routing algorithm:
route a packet ﬁrst west, if necessary, and then adaptively south, east, and north.

Examples of paths allowed by the west-ﬁrst algorithm are shown in Figure 5.4(b). In

this ﬁgure, black squares represent nodes, and gray bars indicate blocked channels
requiring that packets wait (dashed lines) or take alternative paths (solid lines). Note
that both minimal and nonminimal paths are shown.

Proof that the west-ﬁrst partially adaptive algorithm is deadlock free is based on

the work of Dally and Seitz [19], who show that a routing algorithm is deadlock free if
the channels in the direct network can be numbered so that the algorithm routes every
packet along channels with strictly decreasing (or increasing) numbers. Because the
west-ﬁrst algorithm routes packets ﬁrst west and then in the other directions, the proof
assigns lower numbers to westward channels the farther west they are, and assigns
still lower numbers to eastward, northward, and southward channels the farther east

they are.

Theorem 5.1 The west-ﬁrst routing algorithm is deadlock free.

40

 

(b) Examples of the paths allowed in an 8 x 8 mesh.

Figure 5.4. The west-ﬁrst routing algorithm for two-dimensional meshes.

41

Proof: Assign each channel in a m x n mesh a two-digit number a,b in base r,
where r is the greater of 3m — 2 and n — 1. Figure 5.5 shows the particular numbers
to assign to the channels leaving node (2:, y). Figure 5.6 illustrates the numbering of a
4 x 4 mesh. To show that the west-ﬁrst algorithm routes every packet along channels
with strictly decreasing numbers, it is sufﬁcient to show that, for each channel into
an arbitrary node, the algorithm can only route the packet out along a channel with
a lower number. Figure 5.7 shows the four possible cases. Examination shows that,
in each case, the channels used to route a packet out of a node have lower numbers
than the input channel. Note that in part (c) a packet can make a 180-degree turn.

This turn is only useful for nonminimal routing. El

2m-2-2x,n-2-y

2m-2+X,o ‘— x,y 2m.3.2x,o

 

2m-2-2x,y-1

Figure 5.5. Numbering of the channels leaving each node (m,y) for the west-ﬁrst
routing algorithm. .

5.1.2 The North-Last Routing Algorithm

Figure 5.8(a) shows another way to prohibit two turns in a two-dimensional mesh.
The prohibited turns are the two when travelling north. Therefore, a packet should

only travel north when that is the last direction it needs to travel. This requirement

42

 

8.0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

[0.3 1.3—2.1 3.3
5.0 , 3.0 > 1.0 ‘
6.0 4.2 4.0 2.2 2.0 also
I
70 8,0 9 ’
u**—’-—__. 1.2. 2.2 3.2
I. , so , 3.0 w 1.0
6.1 4.1 4.1 2.1 2.1 0.110.:
7L0 0 9,0
[M‘ 14—4—32: 3.!
_ 5,0 ‘ 3,0 } 1,0 .1
so 6.2 4.0 4.2 2.0 2.2 “Jo:
._z.o__‘,._u.2_~ ,._2.o__
”Tl. so ,2. 1.0 3.0

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.6. Numbering of a 4 x 4 mesh for the west-ﬁrst routing algorithm.

 

 

 

 

2m-2-2x,n-2oy 2m-2-2x,y
2m-1-2x,0 :;:’_:ZLn-3-2x,0 +1m-3-2xﬁ
2m-2-2x,y-12m-2-2x,y-1
(a) input from west (h) input from north
2m-2-2x,n-2-y 2m-2-2x,n-2-y
2m-2+x,0 33'1”» a 2m-3-2x,0
e—1 w ‘_.
L]— 2m-3-2x,0
2m-2-2x,y-l 2m-2-2x,n-l-y
(c) input from east (d) input from south

Figure 5.7. For each input channel, the output channels allowed by the west-ﬁrst
routing algorithm have lower numbers.

43

suggests the north-last routing algorithm: route a packet ﬁrst adaptively west, south,

and east, and then north. Examples of the paths allowed by the north-last algorithm
are shown in Figure 5.8(b).
Theorem 5.2 The north-last routing algorithm is deadlock free.

Proof: The proof is similar to that for Theorem 5.1. Rotate Figures 5.5 and
5.6 counterclockwise 90 degrees, and reverse the directions of the channels. The ﬁg-

ures now show that the north—last algorithm routes every packet along channels with

strictly increasing numbers. [3

5.1.3 The Negative-First Routing Algorithm

Figure 5.9(a) shows a third way to prohibit two turns in a two-dimensional mesh.
The prohibited turns are the two from a positive direction to a negative direction.
Therefore, to travel in a negative direction, a packet must start out in a negative
direction. This requirement suggests the negative-ﬁrst routing algorithm: route a
packet ﬁrst adaptively west and south, and then adaptively east and north. As de-
scribed in Chapter 2, Yantchev and Jesshope propose a similar algorithm [22], but
only for minimal routing. The negative-ﬁrst algorithm can be either minimal or non-
minimal, the nonminimal version being more adaptive and fault tolerant. Examples

of the paths allowed by the negative-ﬁrst algorithm are shown in Figure 5.9(b).

Theorem 5.3 The negative-ﬁrst routing algorithm is deadlock free.

The proof is a special case of the one given for n-dimensional meshes in Theorem 5.5.

5.1.4 Degrees of Adaptiveness

How adaptive are these partially adaptive routing algorithms? Let Saharan," be the

number of shortest paths an algorithm allows from source node (3,, s”) to destination

44

 

(b) Examples of the paths allowed in an 8 x 8 mesh.

Figure 5.8. The north-last routing algorithm for two-dimensional meshes.

‘IC.

L._I
r"...
t_

I_-.

(a) The six turns allowed.

 

IUIIIIII

(b) Examples of the paths allowed in an 8 x 8 mesh.

Figure 5.9. The negative-ﬁrst routing algorithm for two-dimensional meshes.

46

node (d,, d”). Also, let f be a fully adaptive algorithm, p be one of the three partially

adaptive algorithms, Am = Id, — 3,], and Ay = Id, — 3,]. Then,

S = (Am + Ay)!
’ .mmw

 

[Am-+482)! -
AmlAy! ‘f d3 2 31'
Sweet—first =

1 otherwise

(Am+Ay]! -
AmlAy! 1f du S 3::
Snorth—last =

1 otherwise

%ﬁ%iﬂ¢gnmm4$%)
Snegatiue-first = I 01‘ (d3 2 83 and dy 2 8y)
[ 1 otherwise

 

For minimal routing algorithms, the larger Satgwgthm is, the more adaptive the
algorithm is. For the three partially adaptive routing algorithms, however, S,p = 1 for
at least half of the source-destination pairs. Another measure of the degree to which a
minimal routing algorithm is adaptive is the ratio Sugars“... / S f. Sp/S; ranges from 0
to l, but averaged across all source-destination pairs, Sp/Sf > 1/2, which indicates a
signiﬁcant degree of adaptiveness. Finally, note that S, = 1 for a source-destination
pair does not imply that algorithm p is nonadaptive for that pair. The partially
adaptive algorithms all permit nonminimal routing. In the case of the negative—
ﬁrst algorithm, nonminimal routing means that routing can be adaptive even when

d“r _<_ 3, and (1,, Z 3,, or when d, 2 s, and d, S 3,. For an illustration of this added

47

adaptiveness, see the bottom path in Figure 5.9(b).

5.2 Three-Dimensional Meshes

For three-dimensional meshes, the terminology used in the deﬁnition of n-dimensional
meshes can be simpliﬁed. Dimensions 0, 1, and 2 become m, y, and z. The direc-
tions —m, +m, —y, +y, —z, and +2 become west, east, south, north, down, and up
respectively.

There are 24 possible 90-degree turns in a three—dimensional mesh, four for each
of the six directions. These turns form six simple cycles, two in each of the three
planes, as shown in Figure 5.10. At least one turn must be prohibited in each cycle
in order to prevent deadlock. Of the 4096 different ways to prohibit these six turns,

176 prevent deadlock and nine are unique if symmetries are taken into account.

(2) U

I"
L
K
V

Figure 5.10. The possible turns and simple cycles in a three-dimensional mesh.

Each of the nine combinations suggests a partially adaptive routing algorithm.

The ﬁrst three combinations suggest algorithms that route a packet ﬁrst west and

48

then adaptively south, down, east, north, and up with two restrictions on the turns.
The second three combinations suggest algorithms that route a packet ﬁrst adaptively
west, south, down, east, and north with two restrictions and then up. The last three
combinations, Figure 5.11, suggest simpler algorithms. The seventh combination
suggests the west-south-ﬁrst routing algorithm: route a packet ﬁrst adaptively west
and south and then adaptively down, east, north, and up. The eighth combination
suggests the north-up-last routing algorithm: route a packet ﬁrst adaptively west,
south, down, and east and then adaptively north and up. The ninth combination
suggests the negative-ﬁrst routing algorithm: route a packet ﬁrst adaptively west,
south, and down and then adaptively east, north, and up. All nine of the algorithms

are deadlock free. The individual proofs are omitted here.

5.2.1 Degrees of Adaptiveness

The algorithms for three-dimensional meshes are more likely to be able to route any
particular packet adaptively than are the algorithms for two—dimensional meshes. Let
Smash... be the number of shortest paths from source node (3,, 3,, 3,) to destination
node (d,, d”, d,). Also, let f be a fully adaptive algorithm, p be one of the last three
partially adaptive algorithms, Am = Id, — s,I, Ay = Id, — syI, and A2 = Id, — 3,].
Then,

S __ (Am + Ag + A2)!
I - AmlAylAz!

 

M if (d, 2 s, and d, 2 3:)

AmlAylAz!

W if(d.2s.and 4.5.9.)

Sweat-south- first = I

‘W if(d,gs,and age.)

 

I LL-Liifz’ if (d. s s: and <12 2 3,)

49

[.7 [3K 045 257

(a) the west-south-f'n'st routing algorithm

.--— "1"" I“ ﬁt 1:.-
L] D V [gt/.5

(b) the north-up-last routing algorithm

I:._l LJKJKJZJAJ‘

Figure 5.11. The turns allowed by three routing algorithms for three-dimensional
meshes.

50

1.
W 1f(d,$s, and d,$s,)

W if(d.3s.and d. 2s.)

M if (d, 2 .9, and d, 2 s.)

AmlAz!

Snorth-up-last =

W “((13- Z 3:.- and d: S 3:)

V
L—J—Leggymf! if (d, g s, and d, g s, and d. g s,)

01' (dz-23:: and (#28, and d,2s,)
W if(d=Ss.andd._<_s,andd.>s,)

or (d, 2 s, and dy Z s, and d, < s.)
Snegative— first = I

Egg-)- if(d,Ss, and d,>s, and (1.33,)
or (d, 2 s, and d, <3" and d, 2 3,)
%# if(d,>s,andd,,Ssv andd,Ss,)

 

or (d, <3, and d, _>_sv and d, _>_ 3,)

S, = l for few source-destination pairs, especially when the lengths of the dimen-
sions are large. Therefore, minimal routing is adaptive for most source-destination
pairs. Averaged across all source-destination pairs, Sp/S; > 1/4. This average is
low compared to the average of over 1/2 for two-dimensional meshes. As in the
case of two-dimensional meshes, nonminimal routing can increase the degree of adap-
tiveness. Overall, a three-dimensional mesh with the same number of nodes as a
two-dimensional mesh permits a lower average degree of adaptiveness relative to fully
adaptive routing, but distributes the adaptiveness more uniformly across the source-

destination pairs.

51

5.3 n-Dimensional Meshes

In an n-dimensional mesh, each node has up to 2n input channels and 2n output
channels. For each of the 2n directions, there are 2n — 2 possible 90-degree turns to a
different direction, making 4n(n — 1) total turns. These turns form two simple cycles

in each of the n(n — 1)/2 planes, making n(n — 1) total cycles of four turns.

Theorem 5.4 The minimum number of turns that must be prohibited to prevent

deadlock in an n-dimensional mesh is n(n - l), or a quarter of the possible turns.

Proof: The 4n(n - 1) turns in an n-dimensional mesh form n(n — 1) cycles of four
turns each. At least one of the four turns in each cycle must be prohibited in order
to prevent these cycles. Therefore, prohibiting a quarter of the turns is necessary to

prevent deadlock. El

Of the many routing algorithms formed by prohibiting one turn per cycle, three

are noteworthy for their simplicity. They are analogs of the west-ﬁrst, north—last, and
negative-ﬁrst routing algorithms for two-dimensional meshes and the west-south-ﬁrst,
north-up-last, and negative-ﬁrst routing algorithms for three-dimensional meshes.

The analog of the west-ﬁrst and west-south-ﬁrst routing algorithms is the all-but-
one-negative-ﬁrst routing algorithm: route a packet ﬁrst adaptively in the negative
directions of all but one dimension (n — 1) and then adaptively in the other di-
rections. The analog of the north-last and north-up-last routing algorithms is the
all-but-one-positive-last routing algorithm: route a packet ﬁrst adaptively in the neg-
ative directions and the positive direction of one dimension (0) and then adaptively in
the other directions. The analog of the negative-ﬁrst routing algorithms is also called
the negative-ﬁrst routing algorithm: route a packet ﬁrst adaptively in the negative
directions and then adaptively in the positive directions. All of these algorithms are

deadlock free, but the only proof presented here is for the negative-ﬁrst algorithm.

52

Theorem 5.5 The negative-ﬁrst routing algorithm for n-dimensional meshes is dead-

lock free.

Proof: Let K be the sum of the k; for an n-dimensional mesh, and let X be the
sum of the m,- for any node (mo,m1, . . . ,m,_2,m,,_1). Number each channel leaving a
node in a positive direction K — n + X and each channel leaving a node in a negative
direction K - n — X. Then, if a packet enters a node when travelling in a negative
direction, it enters along a channel numbered K — n - X — 1, which is less than
both K - n — X and K — n + X, the numbers of the channels leaving the node in
the negative and positive directions. If a packet enters a node when travelling in
a positive direction, it enters along a channel numbered K — n + X — l, which is
less than K — n + X, the number of the channels leaving the node in the positive
directions. Therefore, the negative-ﬁrst algorithm routes every packet along channels

with strictly increasing numbers and is deadlock free. 0

The negative-ﬁrst algorithm is deadlock free as a result of prohibiting just one turn
per cycle, the turn from a positive direction to a negative direction. Therefore, pro-
hibiting some quarter of the turns is sufﬁcient to prevent deadlock in an n-dimensional
mesh. Combining this observation with Theorem 5.4, gives the following theorem,
which supports the claim that the turn model produces maximally adaptive routing

algorithms.

Theorem 5.6 Prohibiting some quarter of the 4n(n — 1) turns in an n-dimensional

mesh is necessary and suﬁicient to prevent deadlock.

5.3.1 Degrees of Adaptiveness

As the number of dimensions increases, the minimal, partially adaptive algorithms are

more likely to be able to route message packets adaptively. S, = 1 less often, especially

53

when the k,- are large. But averaged across all source-destination pairs, Sp/Sf >
1/2""1 , indicating that the degree of adaptiveness relative to fully adaptive algorithms

decreases as n increases. Again, adaptiveness can be increased by nonminimal routing.

5.4 k-ary n-cubes

The partially adaptive routing algorithms for meshes can be extended to use the
wraparound channels of k-ary n—cubes. One way to extend them is to allow a packet
to be routed along a wraparound channel only on its ﬁrst hop. To prove that the
routing algorithm is still deadlock free, number the mesh channels as before and assign
the wraparound channels a number that is either greater than or less than those of the
mesh channels, depending on whether packets are routed along channels in strictly
decreasing or increasing order.

The negative-ﬁrst routing algorithm can also be extended to k-ary n-cubes in
another way: classify each wraparound channel as negative or positive according to
how it routes packets relative to the mesh channels and then apply the negative-ﬁrst
algorithm. For example, a node at the east edge of the mesh channels has two channels
to the west: a mesh channel to the node immediately to its west and a wraparound
channel to a node at. the west edge of the mesh channels. Either channel to the west
can be used during the negative phase of routing a packet. The negative and positive
phases in a 4-ary 2-cube are illustrated in Figure 5.12. The proof of deadlock freedom
is, again, a simple modiﬁcation of the proof for meshes.

All of the preceding routing algorithms are strictly nonminimal. For k-ary n-
cubes with k > 4, it is impossible to design deadlock-free routing algorithms that are
minimal, without adding extra channels. To see that it is impossible, imagine a k-ary
n-cube and an arbitrary k-ary l-cube embedded in it. The nodes in the k-ary 1-cube

all lie along one dimension. Next, imagine that each of the nodes sends a message to

(a) negative phase (b) positive phase

Figure 5.12. The channels used during the two phases of the negative-ﬁrst routing
algorithm for 4-ary 2-cubes.

the node that is two hops away in the positive direction of the dimension. If routing
is minimal and k > 4, each of these messages is routed two hops in the positive
direction. These messages will deadlock if each reserves its ﬁrst channel before any
of them reserve the second channel. The deadlock is a result of a cycle that does not
involve turns.

To design a routing algorithm that is deadlock free, minimal, and fully adaptive,
Linder and Harden [23] add virtual channels to a k-ary n-cube. They add, however,
enough virtual channels to partition the k-ary n-cube into 2““1 virtual networks with
n + 1 levels per virtual network and (n + 1)k“ channels per level. That is an average

of

(n + 1)22n-2
11

virtual channels per physical channel.

Dividing the physical channels in a k-ary n-cube into fewer virtual channels, how
ever, is sufﬁcient for designing routing algorithms that are deadlock free, minimal,
and partially adaptive. A suitable design method is inspired by the one Dally and
Seitz use for unidirectional k-ary n-cubes [19]. Dally and Seitz divide each physical

channel into two virtual channels and connect the virtual channels in each k-ary l-

55

cube embedded in the k-ary n-cube into a spiral that circles the k-ary n-cube nearly
twice. Unfolding the spirals converts the unidirectional k-ary n-cube into a unidirec-
tional, n-dimensional mesh with 2k — 1 nodes along each dimension. Each node of
the k-ary n-cube appears in the mesh up to 2" times. Dimension-order routing in the
mesh prevents deadlock. Routing algorithms that are deadlock free, minimal, and
partially adaptive for a bidirectional k-ary n-cube can be designed by unfolding the
k-ary n-cube into a bidirectional, n—dimensional mesh with 2k — 1 nodes along each

dimension. The unfolding requires an average of

2k—1 ""
2(——. I

virtual channels per physical channel. Applying the turn model to the mesh produces
routing algorithms that are deadlock free, minimal, and partially adaptive. The
routing algorithms to use in the k-ary n-cube are the ones that correspond to routing

a packet to the nearest copy of its destination node in the mesh.

5.5 Hypercubes

Hypercubes are a special case of both n-dimensional meshes and k-ary n-cubes. Con-
sequently, the partially adaptive routing algorithms for hypercubes are special cases of
the algorithms for n-dimensional meshes and k-ary n-cubes. Proofs that the routing
algorithms are deadlock free are corollaries of the proofs for the more general cases.
The special case of the negative-ﬁrst algorithm has a particularly compact expres-
sion, the p-cube routing algorithm. Let S be the binary address of the source node for
a packet, C be the binary address of the node the header ﬂits currently occupy, and
D be the binary address of the destination node. The p-cube routing algorithm has

two phases. In the case of minimal routing, the ﬁrst phase routes the packet along

56

any dimension, i, for which c,- = 1 and d,- = 0. When there is no such dimension,
the second phase routes the packet along any dimension, i, for which c.- = 0 and
d,- = 1. These steps are easily computed using bitwise logic operations, as shown in
Figure 5.13. If nonminimal routing is desired, because of its increased adaptiveness
and fault tolerance, the ﬁrst phase can also route the packet along any dimension,
i, for which a,- = l and d.- = 1. Then, the steps can be computed as shown in Fig-
ure 5.14. In both of these algorithms, the only input transmitted in the header ﬂits
is D. C is a unique constant for each router, and p depends on which input buffer
the header ﬂits occupy in the router. Konstantinidou proposes an algorithm similar

to p-cube [28], but only for minimal routing.

 

Algorithm: Minimal p-cube routing algorithm.
Input: Current address, C, and destination address, D.
Procedure: ‘ I

1. If C = D, route the packet to the local processor and exit.
2R=CAD.
3. IfR= 0,thenR= CAD.

4. Route the packet along any available channel, i, for which r,- = 1.

 

 

 

Figure 5.13. The minimal p-cube routing algorithm for hypercubes.

5.5.1 Degrees of Adaptiveness

Let IXI represent the number of 1’s in the binary number X. The number of shortest
paths from S to D, S,-,.,b,, equals hllhol, where h1=I(S A D)| and ho=|(S A D)I. For
a fully adaptive routing algorithm, f, S; = h!, where h = h + ho = |(S ® D)|, the

Hamming distance between S and D. The other measure of the degree of adaptiveness

57

 

Algorithm: Nonminimal p-cube routing algorithm.

Input: Current address, C; destination address, D; and whether the last hop was
in a positive direction, p.

Procedure:

1. If C = D, route the packet to the local processor and exit.
2. pr= 1, thenR= CAD.

3. ElseifC AD=0,thenR=C V (CAD).

4. Else R = C.

5. Route the packet along any available channel i, for which r.- = l.

 

 

 

Figure 5.14. The nonminimal p-cube routing algorithm for hypercubes.

is Sp_cub¢/Sf, which equals 1/(,:).

Overall, the p-cube routing algorithm offers a choice of many shortest paths,
especially when compared to the nonadaptive e-cube routing algorithm. The following
table illustrates this adaptiveness, for a binary 10-cube in which the node 1011010100
sends a packet to the node 0010111001. In this example, h = 6, ho = 3, h = 3,
and 5,4,5, = 36. The table shows only one of the 36 possible shortest paths. For
each node transmitting the packet, however, the table shows the number of choices
of output channel allowed by the minimal p-cube routing algorithm. The number
of choices in parentheses indicates the additional choices available with nonminimal

routing.

5.6 Simulation Experiments

Numerous simulations have been conducted to study the partially adaptive routing
algorithms produced by the turn model and the nonadaptive routing algorithms pro-

duced by dimension ordering for different direct networks, different input and output

58

Table 5.1. Example of a path allowed by the p—cube routing algorithm.

 

I address I choices I dimension taken I comment I

 

 

 

 

 

 

1011010100 3(+2) 2 source
1011010000 2(+2) 9 phase 1
0011010000 1(+2) 6 phase 1
0010010000 3 5 phase 2
0010110000 2 0 phase 2
0010110001 1 3 phase 2
0010111001 destination

 

 

 

 

 

 

selection policies, and different trafﬁc patterns.

5.6.1 Input Selection Policies

The ﬁrst simulations compare the random, no—turn, local FCF S, global FCFS, and
distance-travelled input selection policies. Each policy is simulated in a 10 x 10 mesh
for the my, west-ﬁrst, north-last, and negative-ﬁrst routing algorithms. In all cases,
the output selection policy is zigzag, the routing is minimal, and the message trafﬁc
is uniform.

The results for each routing algorithm are fairly consistent (Figures 5.15, 5.16,
5.17, and 5.18). At low throughputs, the input selection policies perform the same.
At high throughputs, the distance-travelled policy produces the lowest latencies, the
global FCFS policy produces the second lowest latencies, and the random, no-turn,
and local FCFS policies produce the highest latencies. Compared to the random, no-
turn, and local FCFS policies, the distance-travelled policy increases the maximum
sustainable throughput by 15%.

The most surprising of these results is that the local FCFS and no—turn policies
do not perform much better than the random policy. The random, no—turn, and

local FCFS policies are the three that do not require additional routing information.

59

   
   
    
   

 

 

 

 

40 l I I l
35 '- random 9- -<
no-turn +-
30 ' local FCFS 5- -
global FCFS -><—
25 I- distance-travelled A— -
Average
latency 20 ..
(pace)
15 q
10 .
5 ' 4
0 ' l 1
0 5 10 15 20 25 30

Percent of maximum theoretical throughput

Figure 5.15. Comparison of input selection policies for the my routing algorithm.

 

    

 

 

 

40 I l l T
35 '- random 9- .
no-turn -I—
30 *- local FCFS 5- q
global FCFS -)(—-
25 distance-travelled A— ..
Average
latency 20 -I
(usec)
15 .
10 -I
5 I 4
0 1
0 5 10 15 20 25 30

Percent of maximum theoretical throughput

Figure 5.16. Comparison of input selection policies for the west-ﬁrst routing algo-
rithm.

60

 

  
  
 
    
  
  
 
    
     

 

 

 

 

40 I I T I I
35 '- random 9- -
no-turn +—
30 _ local FCFS 5— d
global FCFS ->(—
25 distance-travelled A— J
Average 3
latency 20 ..
( 86¢)
It 15 -
10 ...
5 ' .1
0 . r
0 5 10 I5 20 25 30

Percent of maximum theoretical throughput

Figure 5.17. Comparison of input selection policies for the north-last routing algo-

rithm.

 

 

 

 

 

40 I 1 I I
35 F random 0- -
no-turn -l—
30 *- local FCFS '0'- -
global FCFS -x—
25 distance-travelled A— -I
Average
latency 20 ..
(wee)
15 .t
10 q
5 ..
0 l L l l

0 5 10 15 20 25 30

Percent of maximum theoretical throughput

Figure 5.18. Comparison of input selection policies for the negative-first routing
algorithm.

61

That they perform the worst may not be a coincidence and suggests that improving
performance by changing input selection policies requires changing to a policy that [is
based on additional information carried in the header ﬂits. The other main result is
that the distance-travelled policy outperforms the FCFS policies at high throughputs.
This result suggests that reducing contention is more important than maintaining
fairness. In the simulations of the distance-travelled policy, indeﬁnite postponement,
though theoretically possible, is not a problem.

The next simulations compare the distance-travelled, least-adaptive, and distance-
least input selection policies. These three policies differ only in whether they empha-
size distance or adaptiveness and in how they decide ties. Each policy is simulated
in a 10 x 10 mesh for the west-ﬁrst, north-last, and negative—ﬁrst routing algorithms.
Again, the output selection policy is zigzag, the routing is minimal, and the message
trafﬁc is uniform.

The results for each routing algorithm are very consistent (Figures 5.19, 5.20, and
5.21). At all throughputs, the input selection policies have nearly the same laten-
cies. These results suggest that augmenting a successful policy, such as the distance-
travelled policy, to give priority to the least adaptive packet is of marginal value, at
least for two-dimensional meshes. For higher-dimensional meshes, the adaptiveness
of packets has a wider range of values, and the added discriminating power of the

least-adaptive and distance-least policies may make them more useful.

5.6.2 Output Selection Policies

The next simulations compare the zigzag, my, and no-turn output selection policies.
Each policy is simulated in a 10 x 10 mesh for the west-ﬁrst, north-last, and negative-
ﬁrst routing algorithms. In all cases, the input selection policy is distancetravelled,
the routing is minimal, and the message trafﬁc is uniform.

Figures 5.22, 5.23, and 5.24 show the results. For each of the routing algorithms,

62

 

 

 

 

 

40 I l l I T

35 I- distance-travelled A— -

least-adaptive +-

30 " distance-least -0— -
Average '-
latency 20 ..
(wee)

l5 -

10 -

5 'I
0 . l l l l l
0 5 10 15 20 25 30

Percent of maximum theoretical throughput

Figure 5.19. Comparison of input selection policies for the west-ﬁrst routing algo-
rithm.

 

 

 

 

 

 

40 T r I l l

35 ' distancetravelled A— -

least-adaptive +—

30 I' distance-least -0— -
Average2.5 _
latency 20 q
([1880)

15 -

10 -

5 4
0 l l J l l
0 5 10 15 20 25 30

Percent of maximum theoretical throughput

Figure 5.20. Comparison of input selection policies for the north-last routing algo-
rithm.

63

 

 

 

 

 

40 I M I I I
35 P distance-travelled A— d
least-adaptive +—
30 " distance-least '0— d
25 -
Average
latency 20 .-
(pace)
15 -
10 d
5 .-
0 7 1 1 1 1 1
0 5 10 15 20 25 30

Percent of maximum theoretical throughput

Figure 5.21. Comparison of input selection policies for the negative—ﬁrst routing
algorithm.

the no-turn policy has the lowest latencies and the zigzag policy has the highest
latencies. Depending on the routing algorithm, the my policy is comparable to either
the no-turn or zigzag policy. Relative to the zigzag policy, the no—turn policy increases
the maximum sustainable throughput by an average of 5%.

The relative performance of the three output selection policies appears to be due
to two factors: spreading message trafﬁc evenly and avoiding turns. The zigzag policy
tends to concentrate uniform trafﬁc in the center of the mesh, increasing contention
and latencies. Compensating for this tendency somewhat is the adaptiveness of the
routing algorithms. The my policy attempts to spread the uniform trafﬁc more evenly.
When contention is high, however, the my policy, like the other policies, chooses
the ﬁrst available channel, ignoring which channel is preferred. This behavior tends
to route more packets toward the center of the mesh. The better performance of

the my policy relative to the zigzag policy in general suggests that spreading trafﬁc

64

40 t l l r' I

35 - zigzag O— ..
my D—
30 ' no-turn A— ..

25
Average

latency 20
(#880)
15

 

10

 

 

 

J l l l

7 m
0 5 10 15 20 25 30

Percent of maximum theoretical throughput

Figure 5.22. Comparison of output selection policies for the west-ﬁrst routing algo-
rithm. '

 

40 I l l 1 I

35 - zigzag O— , -
my Q— ‘

30 ' no-turn -A- ..

25

Average
latency 20

(am) 15

10

 

 

l

o 5 10 15 20 25 30

Percent of maximum theoretical throughput

 

Figure 5.23. Comparison of output selection policies for the north-last routing algo-
rithm.

65

40 T I I l I
35 - zigzag 0— cut
my D—
30 '- no-turn A— q

25

Average
latency 20

(#896) 15

10

 

 

 

 

I l l l

J
0 5 10 15 20 25 30

Percent of maximum theoretical throughput

Figure 5.24. Comparison of output selection policies for the negative-ﬁrst routing
algorithm.

evenly is more important than maintaining maximum adaptiveness. The superior
performance of the no—turn policy suggests that turns are indeed costly and that

avoiding unnecessary turns is even more important than spreading trafﬁc evenly.

5.6.3 Trafﬁc Patterns

The next simulations compare the partially adaptive routing algorithms produced
by the turn model with the nonadaptive routing algorithms produced by dimension
ordering for different direct networks and different patterns of message trafﬁc. The
four networks studied are a 10 x 10 mesh, a 16 x 16 mesh, a 10 x 10 x 10 mesh,
and a binary 8-cube. The six trafﬁc patterns studied are uniform, matrix—transpose,
matrix-rotate, merge-south, reverse-ﬂip, and address-rotate. In Figures 5.25, 5.27,
5.28, 5.31, and 5.32, the input selection policy is distance-travelled and the output

selection policy is no-turn. In Figures 5.26, 5.29, 5.30, 5.33, and 5.34, the input

66

selection policy is local FCFS and the output selection policy is my. In all cases,
routing is minimal.

For each trafﬁc pattern, a different routing algorithm has the lowest latencies at
the high throughputs. For uniform trafﬁc, it is the my algorithm (Figures 5.25, 5.26,
and 5.27); for matrix-transpose trafﬁc, the negative-ﬁrst algorithm (Figures 5.28,
5.29, and 5.30); for matrix-rotate, merge-south, and address-rotate trafﬁc, the north-
last or all-but-one—negative-first (ABONF) algorithm (Figures 5.31, 5.32, and 5.33);
and for reverse-ﬂip trafﬁc, the west-first or all-but-one-positive-last (ABOPL) algo-
rithm (Figure 5.34). In general, the nonadaptive routing algorithms perform better
for uniform trafﬁc, and the partially adaptive routing algorithms perform better for
nonuniform trafﬁc. For matrix-transpose trafﬁc in the meshes and hypercube, the
maximum sustainable throughput of the partially adaptive algorithms is twice that
of the nonadaptive algorithms. In the meshes, this throughput is a third higher than
the second highest sustainable throughput, which occurs for the my algorithm and
uniform trafﬁc. The latter improvement in throughput is not due to shorter path
lengths for matrix-transpose trafﬁc than for uniform trafﬁc: the average path length
for matrix-transpose trafﬁc is 11.34 hops and for uniform trafﬁc is 10.61 hops. For
reverse—flip trafﬁc in the hypercube, the maximum sustainable throughput of the par-
tially adaptive algorithms is four times that of the nonadaptive e-cube algorithm
and one half higher than the next highest sustainable throughput, which occurs for
the e-cube algorithm and uniform trafﬁc. Again, average path length is longer for
reverse-ﬂip trafﬁc (4.27 hops) than for uniform trafﬁc (4.01 hops).

The differences in the performances of the algorithms, for individual trafﬁc pat-
terns and across trafﬁc patterns, are largely due to differences in the types of in-
formation about trafﬁc used in the algorithms. The my and e-cube algorithms are
nonadaptive, but happen to embody global, long-term information about the uni-

form trafﬁc pattern. By routing packets ﬁrst along one dimension and then the other,

 

67

 

   
 
  
   
 

 

 

 

40 I I I I I I
35 r- 533! -)<-— q
west-ﬁrst O—
30 '- north-last 5— ..
negativefirst A—
Average -
latency 20 -
(pace)
15 -
10 4
5 -‘ -
0 l l l
0 5 10 15 20 25 30 35 40

Percent of maximum theoretical throughput

Figure 5.25. Comparison of routing algorithms for uniform traffic in a 10 x 10 mesh.

 

    

 

 

 

 

 

 

40 I I I I I I I
35 . any -x- ..
west-ﬁrst O—
30 - north—last 5— -
negative-ﬁrst
Average25 ( -
latency 20 -
(#sec)
15 -
10 -
5 u
0 , 1 l l 1 l l 1
0 5 10 15 20 25 30 35 40

Percent of maximum theoretical throughput

Figure 5.26. Comparison of routing algorithms for uniform trafﬁc in a 16 x 16 mesh.

 

68

 

 

 

 

40 I w I I I I I
35 . I- 33/2 -x-— 4
west-south-ﬁrst 0—
30 - north-up-last «D- -
negative~ﬁrst A—
Average ‘
latency 20 -
(#886)
15 -
10 e
5 " d
0 l l I l

 

 

 

0 5 10 15 20 25 30 35 40

Percent of maximum theoretical throughput

Figure 5.27. Comparison of routing algorithms for uniform trafﬁc in a 10 x 10 x 10
mesh. ‘

 

40 I I I I I I I

35 - my -x— l d
west-ﬁrst 0-

30 " north-last 9- ..

negative-ﬁrst A-

25

Average
latency 20

(wee) 15

10

 

 

 

 

 

, 4
0 5 10 15 20 25 30 35 40

Percent of maximum theoretical throughput

Figure 5.28. Comparison of routing algorithms for matrix-transpose trafﬁc in a 10 x 10
mesh.

69

 

40 I I I I I I I

35 - $.11 *- -
west-ﬁrst 0—

30 '- northolast 5— ..

negativeﬁrst A-

25
Average

latency 20
(mec)
15

10

 

 

 

 

 

0 . I I I I I L I

0 5 10 15 20 25 30 35 40

Percent of maximum theoretical throughput

Figure 5.29. Comparison of routing algorithms for matrix-transpose trafﬁc in a 16 x 16
mesh. ‘

 

 

40 I I I I I I I I
35 *' e-cube -)(- _
ABONF @-
30 - ABOPL 9— -
p-cube A-
25 -
Average
latency 20 .
(pace)
15 ‘

 

10

 

 

 

I I I I I I

0 2 4 6 8 10 12 14 16 18 20

Percent of maximum theoretical throughput

Figure 5.30. Comparison of routing algorithms for matrix-transpose trafﬁc in a binary
8-cube.

 

70

 

40 I I - I I I I I

35 - 3y -x— J
west-ﬁrst @-

30 '- north-last 5— -

negative-ﬁrst A—

 

 

 

 

Average
latency 20 -
(#866)
15 d
10 a
5 ..
0 I I I I I I I
0 5 10 15 20 25 30 35 40

 

Percent of maximum theoretical throughput

Figure 5.31. Comparison of routing algorithms for matrix-rotate trafﬁc in a 10 x 10
mesh. '

 

40 I I f I I I I

35 - fry -><— -
west-ﬁrst 0—

30 ' north-last 9- m

negative-ﬁrst A—

25

Average
latency 20

(usec) 15

10

 

 

 

 

0 I I I I I I L

o 5 10 15 20 25 30 35 40

Percent of maximum theoretical throughput

Figure 5.32. Comparison of routing algorithms for merge-south trafﬁc in a 10 x 10
mesh.

40
35
30

Average
latency 20

(wee) 15

10

71

 

   

I

~ e-cube -)(—
ABONF O—
- ABOPL 5—
p-cube «A—

 

 

 

I

12

Percent of maximum theoretical throughput

I

16

18

 

 

20

Figure 5.33. Comparison of routing algorithms for address-rotate trafﬁc in a binary

8-cube.

40
35
30
25

Average
latency 20

(We) 15

10

 

I

- e-cube 9(—
ABONF 9-
'- ABOPL '9—
p-cube

-

I

 

12

Percent of maximum theoretical throughput

16

18

 

20

Figure 5.34. Comparison of routing algorithms for reverse-ﬂip trafﬁc in a binary

8-cube.

72

they spread uniform trafﬁc as evenly as possible across the channels of a mesh or
hypercube in the long term. The adaptive algorithms, on the other hand, select
channels based on local, short-term information. These selections tend to beneﬁt just
the routed packet, and only for the immediate future, and tend to interfere with other
packets. The selections often produce zigzag paths, which disturb the global, long-
term evenness of uniform trafﬁc. The result is increased contention and decreased
performance.

Despite the superior performance of the nonadaptive routing algorithms for uni-
form trafﬁc, the partially adaptive algorithms probably provide better performance
in real systems. Uniform trafﬁc has been used in many previous simulation studies,
but is not generated by many applications. A trafﬁc pattern is determined by the
application and how its processes are mapped to the nodes of the direct network.
For most applications, each process communicates with some processes much more
than others. Nonuniform trafﬁc presents a problem for the my and e-cube algorithms
because they are nonadaptive. Just as they maintain the evenness of uniform trafﬁc,
they blindly maintain the unevenness of nonuniform trafﬁc. The result, as the ﬁgures

illustrate, is often poor performance.

5.6.4 Spreading Trafﬁc More Evenly

The overall conclusion from the simulations in the preceding subsection is that the
performance of a routing algorithm is largely determined by how well it spreads trafﬁc
across the channels of a direct network. The present subsection discusses how to
modify partially adaptive routing algorithms so that they spread trafﬁc more evenly.

Konstantinidou [28] describes two separate modiﬁcations of the p-cube routing
algorithm that make it spread trafﬁc more evenly across a hypercube. The ﬁrst
modiﬁcation is to route some packets ﬁrst in positive directions and then in negative

directions and to route the other packets ﬁrst in negative directions and then in

 

73

positive directions. To prevent deadlock, the number of channels in the positive
directions is doubled. The positive-ﬁrst packets and negative-ﬁrst packets use separate
channels in the positive directions but share the channels in the negative directions.
The trafﬁc of the positive-ﬁrst packets is heavier for channels nearer node n — 1, and
the trafﬁc of the negative-ﬁrst packets is heavier for channels nearer node 0. In total,
the trafﬁc of the two types of packets is fairly even. The second modiﬁcation is to
route all packets ﬁrst in positive directions, then in negative directions, and then
again in positive directions. Deadlock is again prevented by doubling the number of
channels in the positive directions. The total trafﬁc of the three phases is fairly even
across the channels of the hypercube.

The modiﬁcations by Konstantinidou can be generalized to any partially adaptive
routing algorithm and any direct network. The ﬁrst general technique for modifying
a routing algorithm to spread trafﬁc more evenly in a direct network is to create
multiple virtual networks and use a different version of the routing algorithm (or an
entirely different routing algorithm) in each virtual network. For example, a north-
last routing algorithm might be used in one virtual network, and a south-last routing
algorithm might be used in a second virtual network. Which routing algorithm and
virtual network are used to route a particular message can be determined randomly
(as Konstantinidou suggests) or according to which routing algorithm is best suited for
the particular source and destination nodes. When the virtual networks are overlaid
onto the physical network, it may be possible for virtual networks to share some
virtual channels and still prevent deadlock (as Konstantinidou demonstrates for the
p-cube routing algorithm). The overall objective is that the channel utilizations in
the virtual networks complement each other such that the channel utilizations in the
physical network are as even as possible.

The second general technique for modifying a partially adaptive routing algorithm

to spread trafﬁc more evenly is again to create multiple virtual networks with multiple

74

routing algorithms, but to connect the virtual networks in sequence. All messages
start in the same virtual network and repeatedly advance to the next virtual network
in the sequence. When the last phase of the routing algorithm in one virtual network
is similar to the ﬁrst phase of the routing algorithm in the next virtual network, the
two virtual networks may share virtual channels. Again, the overall objective is that
the channel utilizations in the virtual networks complement each other such that the
channel utilizations in the physical network are as even as possible.

The drawback to both of these general techniques is that they require that virtual
channels be added to a direct network. Adding virtual channels increases the cost of a
direct network. It also creates the possibility that a routing algorithm designed from
the start for a direct network with that many virtual channels might make better use
of the channels and perform better.

A third general technique for modifying a partially adaptive routing algorithm to
spread trafﬁc more evenly does not have this drawback of adding virtual channels to
a direct network. The technique is to use different versions of the routing algorithm
(or entirely different routing algorithms) in different parts of the direct network.
The objective is to use the routing algorithm that spreads trafﬁc the most evenly in
each part of the direct network. For example, this technique can be applied to two-
dimensional meshes and the west-ﬁrst routing algorithm. The result is the east-west
routing algorithm: the east half of the mesh uses a west-ﬁrst routing algorithm to
route packets, and the west half uses an east-ﬁrst routing algorithm. Thus, as shown
in Figure 5.35(a), packets destined for the other half of the mesh are routed there
nonadaptively (that is, east or west without going north or south). Packets in the
same half as their destination are routed adaptively primarily if they are heading
away from the center of the mesh, as shown in Figure 5.35(b). This combination of
nonadaptive routing toward the center of the mesh and adaptive routing away from

the center helps keep adaptiveness from inadvertently congesting the center of the

 

75

mesh.

 

 

 

 

 

 

 

 

 

 

(a) (b)

Figure 5.35. The directions in which packets in the two halves of a two-dimensional
mesh are routed by the east-west routing algorithm during its two phases.

To ﬁnd out how the east-west routing algorithm performs, it is simulated in a
10 x 10 mesh for a variety of trafﬁc patterns. In all cases, the input selection pol-
icy is distance-travelled, the output selection policy is no-turn, and the routing is
minimal. The results are shown in Figures 5.36, 5.37, 5.38, and 5.39. Overall, the
east-west routing algorithm performs better than the other partially adaptive routing
algorithms. For uniform trafﬁc, the nonadaptive my routing algorithm still performs

the best, but the east-west algorithm is a close second.

76

 

  
   
    
  
    
   
  

 

 

40 I I I I I I
35 - my 9(— .,
west-ﬁrst 9-
30 - north-last 5- a
negative-ﬁrst A-
east-west -O— -
Average
latency 20 -
(#866)
15 q
10 .
5 ‘ q
0 , I I I

 

 

0 5 10 15 20 25 30 35 40

Percent of maximum theoretical throughput

Figure 5.36. Comparison of routing algorithms for uniform trafﬁc in a 10 x 10 mesh.

 

 

 

 

 

 

40 I I I I I
35 - my -)(— -
west-ﬁrst 0-
30 - north-last D— ..
negative-ﬁrst A—
east-west -0— -
Average
latency 20
(psec)
15
10
5 .
0 , l l
0 5 10 15 20 25 30 35 40

Percent of maximum theoretical throughput

Figure 5.37. Comparison of routing algorithms for matrix-transpose trafﬁc in a 10 x 10
mesh. ‘

77

 

  
   
    
  

 

   

 

 

40 I I I I I I I
35 - ity -x— a
west-ﬁrst 0—
30 - north-last 5- 4
negative-ﬁrst A—
25 east-west -0— .l
Average
latency 20 -+
(ﬂux)
15 "
10 "
5 d
0 7 I I I I I I
0 5 10 15 20 25 30 35 40

 

Percent of maximum theoretical throughput

Figure 5.38. Comparison of routing algorithms for matrix-rotate trafﬁc in a 10 x 10
mesh. '

 

 

 

40 I I ’ I I I I I
35 - 3y *- . -
west-ﬁrst @-
30 ' north-last 9— , -
negative-ﬁrst A—
Average i -
latency 20 -
(ﬂux)
15 -
10 a
5 ‘ u
0 I I I I

 

 

 

o 5 10 15 20 25 30 35 40

Percent of maximum theoretical throughput

Figure 5.39. Comparison of routing algorithms for merge-south trafﬁc in a 10 x 10
mesh.

CHAPTER 6

Fully Adaptive Routing

The turn model produces fully adaptive routing algorithms when applied to direct
networks with enough virtual channels. Unlike other models, the turn model helps
determine the minimum conﬁguration of virtual channels required for fully adap-
tive routing and produces routing algorithms that are maximally adaptive. This
chapter ﬁrst applies the turn model to the problem of fully adaptive routing in two-
dimensional meshes, ﬁnding the minimum conﬁguration of virtual channels required
and producing a maximally adaptive routing algorithm. Simulations show that this
algorithm performs better than the fully adaptive and nonadaptive algorithms pro-
duced by other models for the same network. Next, the chapter describes how to easily
design maximally fully adaptive routing algorithms for n-dimensional meshes in gen-
eral, adding the minimum number of virtual channels. Finally, the chapter applies
the simpliﬁed model to the design of a maximally fully adaptive routing algorithm

for three-dimensional meshes.

6.1 Two-Dimensional Meshes

Fully adaptive routing is possible in two-dimensional meshes if there are two pairs of

channels in one of the two dimensions. Chapter 2 assumed that this dimension was y

78

79

and named the routing algorithm the double-y algorithm. In light of the turn model,
this algorithm raises two questions. First, is the double-y algorithm the most adaptive
routing algorithm for two-dimensional meshes with double y channels? Second, is
fully adaptive routing possible in two-dimensional meshes with fewer channels? This
section uses the turn model to answer these questions.

The ﬁrst question can be answered by applying the turn model to two-dimensional
meshes with double y channels. In such double-y networks, there are six virtual
directions: one west, one east, two north, and two south, as shown in Figure 6.2(a).
The six directions permit sixteen 90-degree turns: two turns to the north and two
to the south when travelling in each of the directions west and east and one turn to
the west and one to the east when travelling in each of the two north and two south
directions. These turns conveniently form eight cycles of four turns each, as shown in
Figure 6.1. Cycles in which the head of one turn and the tail of the turn to which it is
matched are in the same topological direction but different virtual directions are not
considered now, because the turn model delays considering 0«degree (and ISO-degree)

turns until the last step.

Figure 6.1. Eight simple cycles formed by the sixteen turns in a double-y network.

The next step in applying the turn model is to prohibit enough of the sixteen turns

to break all of the cycles in Figure 6.1, as well as all of the cycles involving more than

80

four turns. A simple way to start is with the four cycles in the left half of Figure 6.1,

which contain all sixteen turns. There are 4“ ways to prohibit one turn in each of these

four cycles. Of these 256 ways, eight break all cycles and allow fully adaptive routing.

The eight ways are symmetric to each other, however, and yield equivalent routing

algorithms. One of the eight ways to prohibit four turns is shown in Figure 6.2(c).

It is easy to verify that the four prohibited turns also break the four cycles in the
right half of Figure 6.1. Note‘that the double-y algorithm (Figure 6.2(b)) prohibits l

the same four turns, plus four additional turns, thereby reducing its adaptiveness. .2

+Yl +Y2

1H

 

 

 

 

H H

-Y1 -Y2

(a) a router in a
double-y network

5"“ {—1

____‘__. 8/16
F . .
1..-... L _4
(b) double-y algorithm
(fully adaptive)
l" "'1’ i— “f
15::1 #:fi 12/16
=Z=
L -33: L _4
(c) mad-y algorithm
(maximally fully adaptive)

Figure 6.2. A router in a double-y network and the turns allowed by some routing

algorithms.

Because there are no wraparound channels in a mesh, the next step is to incor-

81

porate as many 0-degree turns as possible without reintroducing cycles. The four
possible O-degree turns are from the ﬁrst set of northward channels to the second and
vice versa and from the ﬁrst set of southward channels to the second and vice versa.
Each of these four O-degree turns must be tested to see whether its inclusion creates
any cycles. For example, if turns from the ﬁrst set of northward channels to the
second were allowed, they would create four potential cycles of four turns, as shown
in Figure 6.3(a). At least one turn in each of these cycles is already prohibited, so
these O-degree turns can be allowed safely. If turns from the second set of northward
channels to the ﬁrst were allowed, they would create four potential cycles of four
turns, as shown in Figure 6.3(b). No turns are prohibited in two of these cycles,
so these O-degree turns must remain prohibited. Continuing to test O-degree turns
reveals that it is safe to allow O-degree turns only from the ﬁrst sets of northward and
southward channels to their respective second sets. Note that, though these turns
do not cause deadlock, routing a packet along them when the packet still has to be
routed farther west would prevent the packet from reaching its destination, because

turns to the west from the second sets of channels are prohibited.

1"” I"? {“1
L"; I'ZJ L714
i—«é 7 i—

L...-.€ LA LA

(a) If the Mega turn (b) If the 0-degree turn
I. to I '5 allowed. I to I, is allowed.

Figure 6.3. The potential cycles created by allowing 0-degree turns when travelling
north.

-"--‘—-r—

82

If nonminimal routing is desired, the ﬁnal step is to incorporate as many 180.
degree turns as possible without reintroducing cycles. The ﬁve pairs of possible
180-degree turns are from the eastward channels to the westward channels and vice
verso, from the ﬁrst set of northward channels to the ﬁrst set of southward channels
and vice versa, from the ﬁrst set of northward channels to the second set of southward
channels and vice verso, from the second set of northward channels to the ﬁrst set of
southward channels and vice versa, and from the second set of northward channels
to the second set of southward channels and vice versa. Testing whether allowing
each 180—degree turn creates cycles, much as was done for O—degree turns, reveals that
three particular turns must remain prohibited. These are the turns from the eastward
channels to the westward channels, from the second set of southward channels to the
ﬁrst set of northward channels, and from the second set of northward channels to the
ﬁrst set of southward channels. These three turns come from three different pairs. In
the other two pairs, neither of the turns creates a cycle when allowed. Both of the
turns in a pair cannot be allowed, however. If both turns in a pair were allowed, they
would form a cycle by themselves. Thus, in three pairs, a speciﬁc 180-degree turn can
be allowed; and in the other two pairs, either, but not both, of the 180-degree turns
can be allowed. Therefore, four different sets of ﬁve 180-degree turns can be allowed.
That half of the 180-degree turns can be allowed is no accident.

Theorem 6.1 The mamimum number of 180-degree turns that the turn model can

allow in a direct network is always one half of the total number of 180-degree turns.

Proof: The ISO-degree turns in a direct network always come in pairs: for each
180—degree turn, its reverse is always a different ISO-degree turn. As argued before,
both of the turns in a pair cannot be allowed, without creating a cycle by themselves.
Therefore, the turn model can allow at most one half of the 180-degree turns. If .a

ISO-degree turn creates a cycle when allowed, some combination of the other allowed

83

turns must form the equivalent of the reverse 180-degree turn. If both of the turns in
a pair create cycles when allowed, the other allowed turns must form the equivalents
of both of the turns, which implies that the other allowed turns can also form a cycle.
This contradiction implies that at least one of the turns in each pair can be allowed.

Therefore, the turn model can allow at least one half of the l80degree turns. D

The routing algorithm produced by the preceding application of the turn model is
called the momimolly adoptive double-y routing algorithm or mad-y for short. It routes
northeastbound packets ﬁrst along the ﬁrst set of northward channels and then along
the second set of northward channels and the eastward channels. It routes southeast-
bound packets ﬁrst along the ﬁrst set of southward channels and then along the second
set of southward channels and the eastward channels. It routes southwestbound pack-
ets ﬁrst along the westward channels and the ﬁrst set of southward channels and then
along the second set of southward channels. It routes northwestbound packets ﬁrst
along the westward channels and the ﬁrst set of northward channels and then along
the second set of northward channels.

Proof that the mad-y routing algorithm is deadlock free is based on the work
of Dally and Seitz [19], who show that a routing algorithm is deadlock free if the
channels in the direct network can be numbered so that the algorithm routes every

packet along channels with strictly increasing (or decreasing) numbers.
Theorem 6.2 The mad-y routing algorithm for double-y networks is deadlock free.

Proof: Assign each channel in the m x n- mesh a two-digit number a,b in base
r, where r is the greater of 2m and n. Figure 6.4 shows the particular numbers to
assign to the channels leaving node (m, y). To show that the mad-y algorithm routes
every packet along channels with strictly increasing numbers, it is sufﬁcient to show

that, for each channel into an arbitrary node, the algorithm can only route the packet

84

out along a channel with a higher number. Figure 6.5 shows the six possible cases.

Examination shows that, in each case, the channels used to route a packet out of a

node have higher numbers than the input channel. [I]
m-l-x,l+y m+x,l+y
m-x,0 m+l+x,0
m-l-x,n-y m+x,n-y

 

Figure 6.4. Numbering of the channels leaving each node (2:, y) for the mad-y routing
algorithm.

The ﬁrst question from the beginning of this section has now been answered. The
double-y routing algorithm is not the most adaptive algorithm for two-dimensional
meshes with double y channels. It prohibits eight of the sixteen 90-degree turns, all
four O-degree turns, and all ten of the 180—degree turns, while the mad-y algorithm
prohibits only four of the sixteen 90-degree turns, two of the four O-degree turns,
and ﬁve of the ten 180-degree turns. In addition, prohibiting any fewer than four
90—degree turns, two 0~degree turns, or ﬁve ISO-degree turns allows packets to wait
on each other in cycles, creating deadlock. Thus, the mad—y algorithm cannot be
made more adaptive.

The second question from the beginning of this section, whether fully adaptive
routing is possible in two-dimensional meshes with fewer channels, can be answered
by applying the turn model to two-dimensional meshes with double eastward and
northward channels (Figure 6.6(a)). Such a network has six virtual directions and

eighteen 90-degree turns (Figure 6.6(b)). The eighteen turns form eight cycles of four

 

85

m+XII+Y m+x,l+y l

m-l-x,0 m+x,0 m+l+x,0

   

ﬁr”

m+x,n-y m+x,n-y

 

m-l-x,n-1-y m-l-x,l+y m+x,l+y

 
  

m-x,0 m+l+x,0 m-x,0 m+1+x,0

   
  

m-l-x,n-y m+x,n-y m-l-x,y
m+x,n-l-y m+x,l+y
m+1+x,0 m+1+x,0
m+x,n-y m+x,y

 

Figure 6.5. For each input channel, the output channels allowed by the mad-y routing
algorithm have higher numbers.

86

turns. It is not possible to break all of the cycles by prohibiting turns and retain
full adaptiveness, however. Thus, two-dimensional meshes with double eastward and
northward channels do not support fully adaptive routing, even though they have the
same number of channels as double-y networks. Also, two-dimensional meshes with
just double eastward channels or just double northward channels do not support fully
adaptive routing, since they have even fewer of the same turns as two-dimensional
meshes with double eastward and northward channels. Therefore, of the obvious
regular direct networks, double-y networks have the minimum number of channels
necessary to support fully adaptive routing and have the only arrangement of these

channels, other than double :1: channels, that support fully adaptive routing.

# 4"“: r1
$7.791 LA
in ‘1
.x: :“I *— LN}

+— -II->+X2 I".
1;; .1
' l"

(a) a router with double ;
channels in the +X (east)
and +Y (north) directions

 

 

 

 

(b) 18 poSsible turns

Figure 6.6. A router for a two-dimensional mesh with double eastward and northward
channels and the possible turns.

 

87
6.2 Simulation Experiments

This section reports on simulations designed to compare the double-y, mad-y, and
my routing algorithms for different trafﬁc patterns. The direct network studied is a
16 x 16 mesh with double y channels. Neighbors along the y dimension are connected
by two pairs of unidirectional physical channels in order to simplify the computations.
The ﬁve trafﬁc patterns studied are uniform, center-reﬂection, matrix-transpose, bit-
reversal, and matrix-multiply. The input selection policy is distance-travelled, the
output selection policy is my, and the routing is minimal.

The ﬁrst simulations explore whether the my routing algorithm or its reverse, the
ym routing algorithm, performs better under uniform trafﬁc. These two nonadaptive
algorithms are allowed to route packets along any of the y channels, so that they
are not at a disadvantage relative to the adaptive algorithms in later simulations.
Because the y dimension has twice the number of channels and twice the bandwidth
of the m dimension, however, one of the nonadaptive algorithms may perform better
than the other.

The results of the simulations are shown in Figure 6.7. At all throughputs, espe-
cially high ones, the my routing algorithm has the lower latency. The my algorithm
also has a maximum sustainable throughput that is about 13% higher.

These improvements are caused by the greater throttling effect of initially routing
packets along the dimension with the fewer number of channels. Throttling is inherent
in direct networks. Increased trafﬁc tends to increase contention for channels, and the
increased contention tends to restrict the number of new packets that can enter the
network, thereby reducing trafﬁc. When some dimensions have fewer channels than
others, contention tends to increase most rapidly in the dimensions with the fewest
channels, making them the bottleneck. Routing packets along these dimensions ﬁrst

places the bottleneck at the best point for keeping new packets from entering the

88

network and tying up resources. Similar reasoning is behind the decision to use
the my output selection policy for the adaptive routing algorithms instead of a 3)::
policy. The my output selection policy increases contention for the smaller number of
m channels, which are the bottleneck for packets travelling through the network. The
distance-travelled input selection policy causes competitions for the m channels to be
won by packets that have travelled farther through the network and that, therefore,
tend to be older. Thus, with the my output selection policy, younger packets tend to

hold fewer precious resources, and older packets tend to ﬁnish with less interference.

40
35 '-
30 -

25 -
Average
latency 20 -

(twee) 15 _
10 -
5T7 -J
0 1 I I I 1

0 5 10 15 20 25 30

Percent of maximum theoretical throughput

 

 

 

 

 

 

Figure 6.7. Comparison of nonadaptive routing algorithms for uniform trafﬁc in a
16 x 16 mesh with double y channels.

The remaining simulations explore how the performances of the my, double-y,
and mad-y routing algorithms for double-y networks vary with the trafﬁc pattern.
Figures 6.8 to 6.12 show the results. For uniform trafﬁc and high throughputs, the
nonadaptive my algorithm has lower latencies than the fully adaptive algorithms. At

89

low throughputs, however, the mad-y algorithm has slightly lower latencies than the
others. The maximum sustainable throughput for the my algorithm is also higher than
those of the fully adaptive algorithms by around 50%. The results for center-reﬂection
trafﬁc are similar. For the matrix-transpose, bit-reversal, and matrix-multiply trafﬁc
patterns, the mad-y algorithm has the lowest latencies at all throughputs, the double-
y algorithm has the second lowest latencies, and the my algorithm has the highest
latencies. These differences increase as throughput increases. At high throughputs,
the mad-y algorithm has latencies around 30% lower than those of the double-y
algorithm. The maximum sustainable throughputs of the fully adaptive algorithms
are around 50% higher than that of the my algorithm. Finally, regardless of the trafﬁc

pattern, the mad-y routing algorithm has lower latencies than the double-y algorithm.

 

40 I I I

35 " $3] *-

double-y @—
30 mad-y D—
25

Average
latency 20

  
 
    
   

 

 

  

 

   

J I L I

o 5 1o 15 20 25 30

Percent of maximum theoretical throughput

Figure 6.8. Comparison of routing algorithms for uniform trafﬁc in a 16 x 16 mesh
with double y channels.

As in Section 5.6.3, the differences in the performances of all the algorithms for

 

90

 

  

 

 

 

 

 

 

40 I I I I

35 F $31 ->(— ..

double-y @—

30 - . mad-y -D— _
Average25 - .
latency 20 - ..
(Itsec)

l5 " -I

10 - 1

5T -.
0 L l I I I
0 5 10 15 20 25 30

Percent of maximum theoretical throughput

Figure 6.9. Comparison of routing algorithms for center-reﬂection trafﬁc in a 16 x 16

mesh with double y channels.

 

 

 

 

40 m I F I I I

35 - 3y -)<— q

double-y @—

30 - mad-y Q— q
Averagez'5 ‘
latency 20 -
(pace)

15 n

10

 

 

0 7 I I I I I

0 5 10 15 20 25

Percent of maximum theoretical throughput

 

30

Figure 6.10. Comparison of routing algorithms for matrix-transpose trafﬁc in a 16 x 16

mesh with double y channels.

 

91

 

  
 
 
   
 

 

   

 

 

40 I I I I
35 "' 3y 9(— .(
double-y 0-

30 mad-y 5- q
Average d
latency 20 q
(pace)

15 -

10 -

5 .-
0 7 I 1 I l
0 5 10 15 20 25 30

Percent of maximum theoretical throughput

Figure 6.11. Comparison of routing algorithms for bit-reversal trafﬁc in a 16 x 16
mesh with double y channels.

 

 

 

 

 

 

40 F I I I I
35 - my -)(— a
double-y 0—

30 - mad-y D— _
Average25 ‘
latency 20 -
(#986)

15 -

10 -

5 ..
0 7 I I I I I
0 5 10 15 20 25 30

Percent of maximum theoretical throughput

Figure 6.12. Comparison of routing algorithms for matrix-multiply trafﬁc in a 16 x 16
mesh with double y channels.

is

92

individual trafﬁc patterns and across trafﬁc patterns are largely due to differences
in the types of information about trafﬁc used in the algorithms. The nonadaptive
my algorithm happens to embody global, long-term information about the uniform
and center-reﬂection trafﬁc patterns. It spreads uniform and center-reﬂection trafﬁc
as evenly as possible across the channels of a mesh in the long run. The adaptive
algorithms, on the other hand, select channels based on local, short-term information.
These selections tend to beneﬁt just the routed packet, and only for the immediate
future, and tend to interfere with other packets. The selections often produce zigzag
paths, which disturb the global, long-term evenness of uniform and center-reﬂection
trafﬁc. The result is increased contention and decreased performance.

For the matrix-transpose, bit-reversal, and matrix-multiply trafﬁc patterns, the
situation is different. The my algorithm embodies the wrong global, long-term in-
formation for these patterns and spreads the trafﬁc very unevenly. The adaptive
algorithms, especially the more adaptive mad-y algorithm, spread the trafﬁc fairly
evenly by avoiding local, short-term hot spots. They do not spread the trafﬁc as
evenly as an algorithm would that embodied the correct global and long-term infor-
mation, but they come reasonably close. The maximum sustainable throughputs for
the my algorithm under uniform and center-reﬂection trafﬁc are 29% and 30%; for the
adaptive algorithms under matrix-transpose, bit-reversal, and matrix-multiply traffic,
they are 25%, 20%, and 15%.

What makes the adaptive routing algorithms, particularly the mad-y algorithm,
preferable overall is that uniform and center-reﬂection trafﬁc patterns are rare. In
an actual system, trafﬁc is much more likely to be a changing mix of nonuniform
trafﬁc patterns over time. For most of the nonuniform trafﬁc patterns simulated, the

adaptive routing algorithms perform better than the nonadaptive algorithms.

 

93
6.3 n-Dimensional Meshes

Designing maximally fully adaptive routing algorithms is easy for two-dimensional
meshes. The analysis is simple, and a little trial and error reveals the virtual channels
that must be added to the topology. For higher dimensions, it is less obvious exactly
how many virtual channels need to be added and where, and the analysis of particular
cases is much more difﬁcult. This section uses the turn model and insights gained from
applying it to two-dimensional meshes to discover how to easily design maximally fully
adaptive routing algorithms for n—dimensional meshes. The ﬁrst step is to discover
the minimum number of virtual channels required and their positions. Then, it is
straightforward to design a fully adaptive routing algorithm that is not maximally
adaptive, much as other models might do. Analysis of the fraction of turns used by
such an algorithm reveals a very low degree of adaptiveness, however. Consequently,
the next step is to discover how to easily modify a fully adaptive routing algorithm to
make it maximally adaptive, with the same result as rigorous application of the turn
model. Analysis of the fraction of turns used by a maximally fully adaptive routing
algorithm reveals a high degree of adaptiveness.

To start, an n-dimensional mesh has 2n topological directions: a positive and a
negative direction along each dimension. For each dimension, a packet must travel
in at most one of the two directions. Thus, each packet must travel in at most n
directions and belOngs to at least one of 2" classes, based on its particular combination
of up to n directions. Each class can be designated by a sequence of n 0’s and 1’s,
such as 11010000, where 0 or 1 in position i represents routing in the negative or
positive direction of dimension i. For each class, a fully adaptive algorithm must
route packets along a set of virtual channels in each of the n directions. The virtual
network model would use a separate set of virtual channels for each direction in each

class. In the case of two-dimensional meshes, however, Section 6.1 shows that pairs of

94

classes can share sets. For the double-y and mad-y algorithms, the 00 and 01 classes
share the westward channels, and the 10 and 11 classes share the eastward channels.
The sets of virtual channels in the two pairs of classes can be represented graphically

as in Figure 6.13.

 

 

Figure 6.13. The directions of the virtual channels in the two pairs of classes for fully
adaptive routing in a two-dimensional mesh.

Theorem 6.3 In an n-dimensional mesh, a pair of classes can share all but one set

of virtual channels and still prevent deadlock.

Proof: Assume that the two classes share sets in every dimension except i. Be-
cause two classes can share a set in a dimension only if they have the same direction
in that dimension, the two classes must have the same directions in all but dimension
i. In dimension i, the classes must have different directions; otherwise, the two classes
would be the same class. If at least one of the two 180-degree turns along dimension i
are prohibited, there can be no cycles involving the n + 1 sets. A cycle would require
travelling in both directions of two dimensions, and only one dimension, i, has virtual

channels in two directions. Therefore, deadlock is prevented. D

From the standpoint of reducing the number of virtual channels required for the
fully adaptive routing of two classes, being able to share all but one set is ideal.

Unfortunately, a class involved in such a pair cannot be a member of any other pair.

95

Theorem 6.4 In an n-dimensional mesh, if a class is paired with each of two other

classes and shares all but one set of virtual channels with each, deadlock is possible.

Proof: Assume that class one does not share sets in dimension i with class two
and does not share sets in dimension j with class three. Dimensions i and j must be
different; otherwise, classes two and three would be the same class. The directions
of. the three classes in the two dimensions i and j must be either 00, 01, 10; 00, 01,
11; 00, 10, 11; or 01, 10, 11. Each of these four cases is symmetric to the example of
deadlock shown in Figure 5.3. C]

If a class is paired with each of two other classes and shares less than n — 1
sets with each, even more sets of virtual channels are involved and deadlock is still
possible. Therefore, the biggest reduction in the number of virtual channels comes
from pairing classes so that they share all but one set. Such a pairing leads to 2""1
pairs of classes. Since each pair requires n + 1 sets of virtual channels, each interior
node of the mesh has (n + l)2"‘1 output network channels. The pairs can be formed

so that the maximum number, V, of virtual channels per physical channel is

 

V = [n + 12n-2]
n
Some physical channels have V virtual channels, and the others have V — 1. In
contrast, the wraparound model requires the same total number of virtual channels
as the turn model, but places more of the virtual channels along dimension 0. For the
wraparound model, V = 2"“. Also unlike the turn model, the wraparound model
does not show that the number of virtual channels used is the minimum necessary for
fully adaptive routing. The virtual network model, on the other hand, requires more
virtual channels than either the turn model or the wraparound model.

The simplest fully adaptive routing algorithm is to route each packet along the

96

sets of virtual channels for its class. This algorithm is the one that the wraparound
model produces. The problem with the algorithm is that it is not nearly as adaptive
as it could be, considering the number of virtual channels that have been added.
One meaningful measurement of the degree of adaptiveness of a routing algorithm
in an n-dimensional mesh with virtual channels is the fraction of 90-degree turns
allowed from one virtual direction to another, A. A is easiest to estimate if it is
assumed that every physical channel has V virtual channels, which happens when

n = 2‘, where i 2 2. In this case,

V = lily-2
n
In such a network, the number of virtual directions is 2nV, and the number of possible
90-degree turns from one virtual direction to another is

2
2nV(2n — 2)V = n(n _ 1) (1&1) 22n-2

The simple fully adaptive routing algorithm allows n(n — 1)+ (n — l)2 or (n +2)(n -— 1)
90-degree turns from one virtual direction to another per pair of classes. The total
number of turns is, thus, (n + 2)(n - l)2”‘1. Therefore,

_ (n +1)2 —1 ~ 1
— (n +1)22n-1 ~ 273-1

 

Thus, for all but low-dimensional meshes, the simple fully adaptive routing al-
gorithm makes very poor use of the virtual channels. Fortunately, the turn model
can produce more adaptive routing algorithms, as Section 6.1 demonstrated for two-
dimensional meshes. The difﬁculty is that applying the turn model to meshes with
many dimensions and virtual channels is very tedious. In the case of maximally

fully adaptive routing, however, the turn model can be simpliﬁed. The insight for

 

97

the simpliﬁcation comes from comparing the double-y and mad-y routing algorithms
for two-dimensional meshes. The mad-y algorithms is basically the double-y routing
algorithm with turns added from the virtual channels in one pair of classes to the
virtual channels in the other pair. This insight suggests the following design method
for maximally fully adaptive routing algorithms. First, design a simple fully adaptive
routing algorithm as described above. Second, order the pairs of classes. Finally, add
all turns from earlier pairs to later pairs.

How adaptive are the algorithms produced by this special case of the turn model?

First, the maximum number, P, of pairs with channels in both directions of some

n—2
12:2? 1
72

Some dimensions have P pairs with channels in both directions, and the other dimen-

particular dimension is

 

sion have P - 2 pairs. A is easiest to estimate if it is assumed that every dimension
has P pairs with channels in both directions, which happens when n = 2‘, where

i2 2. In this case,
2n—1

II

P

 

The number 90-degree turns from one virtual direction to another added to the simple

fully adaptive routing algorithm is

 

(2(n -1)+(n _1,,,!LI’2:;2, + (n -1+ a?) (”4‘2“ ’ I) — 53—22,.)

2 2

or

' 1
2""z(2"'1(n2 + n — 1 — a) — n2 — n + 2)

Therefore,
n— 1
— 2n ~

 

bDl

 

98

Thus, the degree of adaptiveness is approximately independent of the number of
dimensions. Table 6.1 shows that the degree of adaptiveness quickly approaches one

half as the number of dimensions increases.

Table 6.1. The degree of adaptiveness for maximally fully adaptive routing algorithms.

 

ur-—-sr .—.. _.- —‘ . I.
_l I I .

‘9

g I

12/16 = 0.75

104/168 z 0.62
672/1200 = 0.56
585088/1161216 z 0.50

 

 

 

mﬁww

 

 

 

 

6.4 Three-Dimensional‘Meshes

This section illustrates the ease with which the simpliﬁed model of the previous section
can be applied to the design of a maximally fully adaptive routing algorithm for three-
dimensional meshes. The ﬁrst step is to identify the 23 classes of messages. The eight
classes can be paired as shown in Figure 6.14. This pairing nearly balances the
number of virtual channels in each dimension, with two virtual channels per physical
channel in the m dimension and three virtual channels per physical channel in the y
and 2 dimensions. The next step is to order the four pairs of classes. Any order is
probably as good as another. The ﬁnal step is to allow all 90-degree turns within
pairs of classes, one 180-turn within each pair, and all 0-, 90-, and ISO-degree turns
from earlier pairs to later pairs. That amounts to 14 0-degree turns, 104 90-degree
turns, and 22 180-degree turns for the maximally fully adaptive routing algorithm.
A simple fully adaptive routing algorithm, in contrast, would allow only 40 90—degree

turns.

99

 

Figure 6.14. The directions of the virtual channels in the four pairs of classes for fully
adaptive routing in a three-dimensional mesh.

CHAPTER 7

Fault-Tolerant Routing

The turn model produces routing algorithms that are deadlock free, maximally adap-
tive, minimal or nonminimal, and livelock free. This combination is ideal for fault-
tolerant routing. This chapter presents a model of fault-tolerant routing, a one-fault-
tolerant routing algorithm for two-dimensional meshes, and simulation results for

nonminimal and fault-tolerant routing algorithms.

7 .1 A Model of Fault-Tolerant Routing

The objective of fault-tolerant routing is to maximize the ability of the good nodes in
a direct network to communicate with each other in the presence of faulty nodes and
channels. Ideally, all of the good nOdes would be able to communicate with each other,
regardless of the number and position of the faulty nodes and channels. Because nodes
in direct networks rely on neighboring nodes to forward message packets, however,
such ideal fault tolerance is not possible.

This section describes a set of requirements for the design of direct networks that
support fault-tolerant routing. It is not, by any means, the only possible set of
requirements, but it is a convenient one for an initial examination of fault tolerance.

To simplify the requirements some, nodes and channels are not considered separately.

100

101

Instead, the input network channels for a node are considered to be part of the node.

The ﬁrst requirement is that faults be detectable. To the extent that some are not,
packet routing and transmission may not function properly. Second, once any part of
a node is detected as faulty, all other parts of the node are detectable only as faulty.
Packets passing through a node when it fails immediately have an erroneous tail ﬂit
inserted in their stream of ﬁits. Then, the faulty node stops operating, performing no
further routing and generating no further messages. A node that receives a packet for
a faulty neighbor or for a node to which the path is blocked by faulty nodes discards
the packet and possibly returns an acknowledgement of the error to the sender. Error
detection and acknowledgement of packet receipt are performed from end to end. If
a node sending a packet receives an acknowledgement that the destination node has
received the packet successfully, the transmission is complete. If the sender receives
an acknowledgement of an error or receives no acknowledgement after a long period,
it may resend the packet. Doing so will probably correct transmission errors, but still

may not allow the packet to be delivered around all of the faulty nodes.

7 .2 Two-Dimensional Meshes

Based on the model of fault-tolerant routing given in the previous section, it is pos-
sible to design a routing algorithm that is one-fault tolerant for a two—dimensional
mesh that has no virtual channels. That is, the algorithm continues to route messages
between all of the good nodes no matter which one node is faulty. One-fault tolerance
is the best that can be achieved in a two-dimensional mesh, since it is possible for a
corner node to be isolated by faults in its two neighbors. It is also a surprising re-
sult, considering that two-dimensional meshes without virtual channels are commonly
thought to require nonadaptive routing.

The one—fault-tolerant routing algorithm is a modiﬁcation of the negative-ﬁrst

 

102

routing algorithm produced by the turn' model for two-dimensional meshes. The
nonminimal version of the negative-ﬁrst routing algorithm can usually route packets
adaptively during both its negative and positive phases. The only time during the
negative phase when routing is nonadaptive is when a packet is at the west or south
edge of the mesh. Then, the packet has only one direction it can be routed: along the
edge in a negative direction. The only time during the positive phase when routing is
nonadaptive is when a packet has only one dimension to travel along in order to reach
the destination. A few simple modiﬁcations of the negative-ﬁrst routing algorithm
can remove these few cases of nonadaptiveness. The modiﬁcations are basically to
route a packet around faulty nodes on the west or south edge of the mesh, to route a
packet farther west or south than required for minimal routing in order to make the
remainder of the path to the east or north adaptive, and to leave the remainder of
the path adaptive until the last hop. Figure 7.1 illustrates these modiﬁcations with

a few examples of the new paths.

faulty node

é minimal path

 

 

Figure 7.1. Examples of the paths allowed by the one-fault-tolerant routing algorithm
for two-dimensional meshes. '

 

103

The one-fault-tolerant routing algorithm resulting from the modiﬁcation of the

negative-ﬁrst routing algorithm has the following four phases.

1. Route a packet as far west and south as necessary. If a faulty node not on the
west or south edge blocks the path, route the packet an extra hop west or south.
If a faulty node on the west edge blocks the path to the west, route the packet
one hop north. If a faulty node on the south edge blocks the path to the south,
route the packet one hop east. If a faulty node on the west edge blocks the path
to the south, route the packet one hop east. If a faulty node on the south edge

blocks the path to the west, route the packet one hop north.

2. If the packet must travel more than one hop north and no hops east, route the
packet one hop west if possible. If not possible because the node to the west is

faulty, route the packet one hop north and repeat this phase.

3. If the packet must travel more than one hop east and no hops north, route the
packet one hop south if possible. If not possible because the node to the south

is faulty, route the packet one hop east and repeat this phase.

4. Route the packet as far east and north as necessary, maintaining the ability to
route the packet both east and north as long as possible. If a faulty node on
the west or south edge of the mesh blocks the path to a destination on that
edge, route the packet one hop perpendicular to the edge, two hops toward the
destination, and one hop back to the edge. While a faulty node blocks the path
back to the edge, route the packet one more hop toward the destination and

one hop back to the edge.

As described in the previous section, if a node ever ﬁnds it impossible to route a
packet farther, it discards the packet and possibly returns an acknowledgement of the

error to the sender.

104

The one-fault-tolerant algorithm can always route packets around a single faulty
node and can often route packets around multiple faulty nodes. Proof for a 4 x14
mesh is given in Figure 7.2. The ﬁgure shows how to number the channels of a 4 x 4
mesh so that the one-fault-tolerant routing algorithm routes packets along them in
strictly increasing order. Channels with two numbers, one being in parentheses, vary
their numbering according to which nodes are faulty. If the node immediately to the
southwest of the channel is good, the number outside of the parentheses is used. If
the node immediately to the southwest of the channel is faulty, the number inside of
the parentheses is used. Varying the numbering is a legitimate proof technique since
each failure of a node creates a new direct network in which a new proof must be
constructed. Fortunately, similar numberings work for all of the possible networks;
only seven faulty nodes require that some channels be renumbered. Examining the
numberings in detail shows that no matter which one node is faulty and no matter
which good nodes are the source and destination of a packet, the packet can be routed
from the source to the destination along channels with strictly increasing numbers.
Therefore, the routing algorithm is both deadlock free and one-fault tolerant for a
4 x 4 mesh. The proof can easily be extended to two-dimensional meshes with other

sizes.

7 .3 Simulation Experiments

The simulations reported in the previous chapters have all been of minimal routing
algorithms. This section describes the results of simulations of some nonminimal
routing algorithms, including the one-fault-tolerant algorithm from the previous sec-
tion. The direct network studied is a 10 x 10 mesh. The two traffic patterns studied
are uniform and matrix-transpose. The input selection policy is distance-travelled,

and the output selection policy is no-turn.

 

105

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

C(22) 3 ‘_ 0
[E ,1.3"—‘—1 2.3“'—“, 3.3
21 (5) 24 2‘7
7[ 20 6 21 3 I24 0 27
_ 9(19) - s 3
[0.2 nth—52.2 .,3.2
‘ 18 (8) 7 ‘ 21 A 24 7
10 17 9 18 6 21 3 24
12(16) . 9 ‘ 6 _
ran (1,1"—"— 2,1 3,1
7 15 (11) 18 7 21 4
13 14 12 15 9 18 s 21
I (16) (u) (19) 1(8) (22) (5)
13 10 7
[ﬂ 1,0“'— 2,0"'—— 3,0
14 17 20

Figure 7.2. Numbering of a 4 X 4 mesh for the one-fault-tolerant routing algorithm.

The ﬁrst simulations compare minimal and nonminimal versions of the west-ﬁrst,
north-last, and negative-ﬁrst routing algorithms for uniform trafﬁc. Figures 7.3, 7.4,
and 7.5 show the results. In each case, nonminimal routing has slightly lower latencies
at low throughputs and has much higher latencies at high throughputs.

These results suggests that, when contention is high, shorter path lengths (min-
imal routing) are more important than greater adaptiveness (nonminimal routing).
The advantage of nonminimal routing, however, is that it can greatly increase fault
tolerance. One way to make nonminimal routing more practical is to restrict the
number of hops that do not lead closer to the destination. In the previous simula-
tions of nonminimal routing, the number of misroutings was unrestricted. In the next
set of simulations, the number of misroutings is varied. Figure 7.6 shows the results.
As the number of misroutings allowed increases, so does the average latency at high
throughputs.

Again, the results suggest that nonminimal routing should be avoided. To be most

practical, nonminimal routing should be used only to avoid faulty nodes. The one-

 

 

106

 

 

 

 

 

 

 

40 j I I I I I I
35 - minimal --><— .
nonminimal -0—

30 '- .4
Average -
latency 20 -4
(wee)

15 r

10 n

5 -1
0 7 I I I I I I I
0 5 10 15 20 25 30 35 40

Percent of maximum theoretical throughput

Figure 7.3. Comparison of the minimal and nonminimal west-ﬁrst routing algorithms
for uniform trafﬁc. '

 

 

 

 

 

 

40 I I I I I I
35 " minimal 9(— .
nonminimal '0-

30 r -‘
Average - .-
latency 20 - < -:
(11m)

15 r -

10 - '1

5 -1
0 I I I I I I I
0 5 10 15' 20 25 30 35 40

Percent of maximum theoretical throughput

Figure 7.4. Comparison of the minimal and nonminimal north-last routing algorithms
for uniform trafﬁc.

 

107

 

 

 

 

 

 

 

40 I I TI I I i I
35 -' minimal 9(— .
nonminimal -O—
30 L- one-fault-tolerant -k— -
25 ..
Average
latency 20 q
(#sec)
15 -
10 q
5 -
0 I I .' J L I I I
0 5 10 15 20 25 30 35 40

Percent of maximum theoretical throughput

Figure 7.5. Comparison of the minimal and nonminimal negative-ﬁrst routing algo-
rithms for uniform trafﬁc.

 

   
  
 
    
     

40 I I WI. I

35 - minimal *-

1 misrouting @—

30 - 2 misroutings D—
3 misroutings A-

00 misroutings -o—

 

 

 

 

Average
latency 20

(psec) 15

10

 

 

    

I I I I I I

0 5 10 15 20 25 30 35 40

Percent of maximum theoretical throughput

 

Figure 7.6. Comparison of different amounts of misrouting in the negative-ﬁrst routing
algorithm for uniform trafﬁc.

 

108

fault-tolerant version of the negative-ﬁrst routing algorithm uses nonminimal routing
only to avoidnodes that have failed or that may fail and prevent packet delivery. In
the remaining simulations, this algorithm is compared to the minimal and nonminimal
versions of the negative-ﬁrst algorithm.

Figures 7.5 and 7.7 show the results for uniform and matrix-transpose trafﬁc. At
low throughputs, nonminimal routing has the lowest latencies, minimal routing has
slightly higher latencies, and fault-tolerant routing has the highest latencies. At high
throughputs, minimal routing has the lowest latencies, fault-tolerant routing has the
next highest latencies, and nonminimal routing has the highest latencies. Thus, there

is a cost in performance for fault-tolerance. Figure 7.6 suggests that the cost is largely

due to the fault-tolerant algorithm allowing each packet one misrouting.

 

40 I I I I I I I

35 - minimal *— ..
nonminimal -9--
30 P onefault-tolerant d— .

Average25

latency 20

 

 

 

 

 

 

0 5 10 15 20 25 30 35 40

Percent of maximum theoretical throughput

Figure 7.7. Comparison of the minimal and nonminimalnegative-ﬁrst routing algo-
rithms for matrix-transpose traﬂic.

 

CHAPTER 8

Summary and Conclusions

For electronic computers to continue improving in performance far into the future,
computer architectures must become massively parallel. A promising architecture
involves interconnecting thousands of processors and other components using direct
networks and wormhole routing. One. obstacle to this approach is designing the best
algorithms for routing message packets through the direct networks.

This dissertation presents a new model for designing routing algorithms. Unlike
previous models, which are largely based on adding virtual channels, the new model
is based on analyzing the directions in which packets can turn in a network. This turn
model explains why most routing algorithms are deadlock free. More importantly, it
systematically designs routing algorithms that are maximally adaptive for a network.
For networks without virtual channels, the turn model produces routing algorithms
that are partially adaptive. For networks with virtual channels, the turn model can
produce routing algorithms that are fully adaptive. Adaptiveness is important in a
routing algorithm because it reduces the effects of hot spots, unfair channel allocation,
and faults. The turn model also designs routing algorithms that are ideal for fault
tolerance, being minimal or nonminimal and livelock free, in addition to maximally
adaptive. Based on a simple model for fault tolerance, the turn model designs a

one-fault-tolerant routing algorithm for two-dimensional meshes, without adding any

109

 

110

virtual channels.

This dissertation also presents the results of numerous simulations. Simulations
of different input selection policies show the distance-travelled policy to perform the
best, implying that it is more important for a policy to reduce contention than to
provide fairness. Simulations of different output selection policies show the no-turn
policy to perform the best, implying that it is more important for a policy to reduce
contention than to increase adaptiveness. Simulations of different trafﬁc patterns and
routing algorithms show the nonadaptive dimension-order routing algorithm to per-
form the best for uniform trafﬁc and more adaptive routing algorithms to perform
better for nonuniform trafﬁc. The implication is that spreading trafﬁc evenly across
the channels of a network is very important for good performance by a routing al-
gorithm. Three ways are suggested to make partially adaptive routing algorithms

spread trafﬁc more evenly.

8.1 Implementation Considerations

Whether the improvements in performance expected from the turn model are realiz-
able in real systems depends on two things: whether real trafﬁc patterns are highly
nonuniform and whether the routing algorithms can be implemented in efﬁciently.
The latter factor is the topic of this section.

Adaptive routing requires that more or larger header ﬂits be present in a router at
one time, increasing communication latencies. While nonadaptive routing decisions
can be made based on the distance remaining in a single dimension, adaptive routing
decisions must be based on the distance remaining in more than one dimension, if
not all dimensions. For dimension ordering, the address of the destination node can
be stored in a series of header flits, one or more for each dimension. Only the ﬁrst

header ﬂit is required for each routing decision. When it is no longer needed, it can

 

111

be discarded and the next ﬂit used. For adaptive routing, the best approach may be
to place the least significant bits of each dimension in the first header ﬂit, the next
least significant bits in the second header ﬂit, and so forth. For example, the ﬁrst ﬂit
might contain bits 0 to 3, the second store bits 3 to 6, the third store bits 6 to 9,
and so forth. Then, when the distances in the first ﬂit are reduced to zero, the ﬂit
is discarded, the distances in the second ﬂit are decremented by one, and a new ﬁrst
ﬂit is generated by the router.

Adaptive routing also requires more complex control circuits. More information is
involved in the routing decision, and the decision rules are more complex. The extra
control hardware means more expense and possibly longer decision times. How much
longer decision times are depends on the specifics of the direct network and routing
algorithm. Basing the decision on more information does not necessarily require more
time. It is likely, however, that the complexity of the decision rules increases decision
times by a few clock cycles.

Finally, adaptive routing may increase the time for reassembling the packets of a
message. Nonadaptive routing algorithms usually prevent the packets in a message
from passing each other on their way through a network, because all of the packets
follow the same path. Adaptive routing, on the other hand, makes no guarantee that
packets will arrive at the destination in the order they are sent. Thus, reassembling
a message may require sorting. There may have to be more packets for a message,

too, since each packet must carry information about its proper order.

8.2 Future Work

There are many possibilities for future work. One of the most important is identi-
fying realistic tramc patterns, so that the results of future simulations can be more

meaningful. Another is exploring the efficiency with which the routing algorithms

 

112

produced by the turn model can be implemented in hardware.

There are many more input and output selection policies to be explored, too, es—
pecially ones based on additional information. For example, an input selection policy
that assigns packets priorities based on the distance travelled, minus the distance
remaining and the packet length, might release more channels more quickly. Also, an
output selection policy that waits a short period before selecting a nonpreferred out-
put channel, especially a channel that leads away from the destination, may improve
performance.

The direct networks simulated in this dissertation had one buffer for each input
channel. Another topic for research is the impact on the performance of adaptive
routing algorithms by queues of buffers of different sizes. Preliminary research shows
that infinitely large queues can keep at most an additional quarter of the channels
busy in a k x k mesh.

The turn model has been applied to n-dimensional meshes and k-ary n-cubes in
this dissertation, but it can also be applied to other topologies, such as hexagonal
networks, octagonal networks, cube-connected cycles, and express cubes [30]. In such
topologies, the turns are not necessarily 90-degrees and the simple cycles are not
necessarily formed by four turns. The turn model should again produce interesting
and useful routing algorithms.

So far, the research has involved only unicast communication, in which each mes-
sage has only one destination. The question is whether the turn model can help
design routing algorithms for multicast and broadcast communication, in which each
message can have multiple destinations. The turn model may be applicable to the
multicast star model of multicast communication [15].

Many of the routing algorithms designed by the turn model can be described
strictly in terms of routing packets first in one group of directions and then in another.

Some examples are given in Table 8.1. They suggest a second model for designing

113

routing algorithms, one based on grouping channel directions into routing phases.
This direction-based model can never completely replace the turn model, however.
Many of the algorithms designed by the turn model, such as six of the nine for three-
dimensional meshes, cannot be described in terms of grouping channel directions into
phases, and would have been nearly impossible to design without the turn model. The
proofs of deadlock freedom for individual routing algorithms suggest a third model for
designing routing algorithms. The proofs rely on numbering individual channels, and
the model suggested is based on grouping individual channels into routing phases.
Whether the two suggested models can be developed to the point that they are
guaranteed to produce routing algorithms that are deadlock free, maximally adaptive,
minimal or nonminimal, livelock free, and fault tolerant is an open question. If they

can, they may prove to be simpler to apply than the turn model.

Table 8.1. The directional phases of some routing algorithms.

routing ' ° ' in
—0, +0 —1 +1
west- —0 —1,+0,+1
've- —0, -1 +0, +1

-0,—1,+0 +1
-l,-2,...,—n-l —n,+1,+2,
—l,-2,...,—n-1,—n +1,+2,...
-1—2,... —n-1,-—n +1 +2,...

 

 

BIBLIOGRAPHY

BIBLIOGRAPHY

[l] S. B. Borkar, R. Cohn, C. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam,
B. Moore, C. Peterson, J. Pieper, L. Rankin, P. S. Tseng, J. Sutton, J. Urbanski,
and J. Webb, “iWarp: An integrated solution to high-speed parallel computing,”
in Proceedings of Supercomputing ’88, pp. 330—339, Nov. 1988.

[2] D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy, “The
directory-based cache coherence protocol for the DASH multiprocessor,” in Proc.

of the 17th International Symposium on Computer Architecture, pp. 148—159,
May 1990.

[3] A. Agarwal, B.-H. Lim, D. Kranz, and J. Kubiatowicz, “APRIL: A processor ar-
chitecture for multiprocessing multiprocessor,” in Proc. of the 17th International
Symposium on Computer Architecture, pp. 104-114, May 1990.

[4] W. C. Athas and C. L. Seitz, “Multicomputers: Message-passing concurrent
computers,” IEEE Computer, pp. 9—25, Aug. 1988.

[5] Intel Corporation, A Touchstone DELTA System Description, 1991.

[6] C. L. Seitz, W. C. Athas, C. M. Flaig, A. J. Martin, J. Seizovic, C. S. Steele, and
W. K. Su, “The architecture and programming of the Ametek Series 2010 mul-
ticomputer,” in Proceedings of the Third Conference on Hypercube Concurrent
Computers and Applications, vol. 1, (Pasadena, CA), pp. 33—36, Association for
Computing Machinery, Jan. 1988.

[7] H. Ishihata, T. Horie, S. Inano, T. Shimizu, S. Kato, and M. Ikesaka, “Third-
generation message-passing computer AP1000,” in International Symposium on
Supercomputing, Kyusyu, pp. 46-55, nov 1991.

[8] W. J. Dally, ‘_‘The J-machine: System support for Actors,” in Actors: K nowledge-
Bascd Concurrent Computing (Hewitt and Agha, eds.), MIT Press, 1989.

[9] NCUBE Company, NCUBE 64 00 Processor Manual, 1990.

114

 

115

[10] W. J. Dally, “Performance analysis of k-ary n-cube interconnection networks,”
IEEE Transactions on Computers, vol. C-39, pp. 775-785, June 1990.

[11] W. J. Dally and C. L. Seitz, “The torus routing chip,” Journal of Distributed
Computing, vol. 1, no. 3, pp. 187-196, 1986.

[12] P. Kermani and L. Kleinrock, “Virtual cut-through: A new computer commu-
nication switching technique,” Computer Networks, vol. 3, no. 4, pp. 267—286,
1979.

[13] J. Y. Ngai and C. L. Seitz, “A framework for adaptive routing in multicomputer
networks,” in Proceedings of the 1989 ACM Symposium on Parallel Algorithms
and Architectures, 1989.

[14] L. M. Ni, “Communication issues in multicomputers,” in Proceedings of the First
Workshop on Parallel Processing, (Hsinchu, Taiwan), pp. 52-64, Dec. 1990.

[15] X. Lin and L. M. Ni, “Deadlock-free multicast wormhole routing in in multicom-
puter networks,” in Proceedings of the 18th Annual International Symposium on
Computer Architecture, pp. 116-125, May 1991.

[16] W. J. Dally and H. Aoki, “Adaptive routing using virtual channels,” tech. rep.,
Massachusetts Institute of Technology, Laboratory for Computer Science, Sept.
1990.

[17] L. M. Ni and P. K. McKinley, “A survery of routing techniques in wormhole
networks,” Tech. Rep. MSU-CPS—ACS-46, Dept. of Computer Science, Michigan

State University, East Lansing, Michigan, Oct. 1991.

[18] H. Sullivan and T. R. Bashkow, “A large scale, homogeneous, fully distributed
parallel machine,” in Proceedings of the 4th Annual International Symposium on
Computer Architecture, pp. 105-124, Mar. 1977.

[19] W. J. Dally and C. L. Seitz, “Deadlock-free message routing in multiprocessor
interconnection networks,” IEEE Transactions on Computers, vol. C-36, pp. 547-
553, May 1987.

[20] W. J. Dally, “Virtual channel ﬂow control,” in Proceedings of the 17th Annual
International Symposium on Computer Architecture, pp. 60—68, May 1990.

[21] C. R. Jesshope, P. R. Miller, and J. T. Yantchev, “High performance communi-
cations in processor networks,” in Proceedings of the 16th Annual International
Symposium on Computer Architecture, pp. 150—157, 1989.

116

[22] J. T. Yantchev and C. R. Jesshope, “Adaptive, low latency, deadlock-free packet
routing for networks of processors,” in IEE Proceeding, Pt. E, vol. 136, pp. 178-
186, May 1989.

[23] D. H. Linder and J. C. Harden, “An adaptive and fault tolerant wormhole routing
strategy for lc-ary n-cubes,” IEEE Transactions on Computers, vol. 40, pp. 2—12,
Jan. 1991.

[24] W. J. Dally, “Fine-grain message passing concurrent computers,” in Proceedings
of the Third Conference on Hypercube Concurrent Computers and Applications,
vol. 1, (Pasadena, CA.), pp. 2—12, Association for Computing Machinery, Jan.
1988.

[25] C. J. Glass and L. M. Ni, “The turn model for adaptive routing,” in Proceedings

of the 19th Annual International Symposium on Computer Architecture, pp. 278-
287, May 1992.

[26] C. J. Glass and L. M. Ni, “Adaptive routing in mesh-connected networks,” in
Proceedings of the 12th International Conference on Distributed Computing Sys-
tems, pp. 12—19, June 1992. l

[27] C. J. Glass and L. M. Ni, “Maximally fully adaptive routing in 2D meshes,”
in Proceedings of the 213t International Conference on Parallel Processing, Aug.

1992.

[28] S. Konstantinidou, “Adaptive, minimal routing in hypercubes,” in Proceedings

of the 6th MIT Conference: Advanced Research in VLSI, pp. 139-153, 1990.

[29] H. D. Schwetman, “CSIM: A C-based, process-oriented simulation language,”
tech. rep., 1985.

[30] W. J. Dally, “Express cubes: Improving the performance of k-ary n-cube inter-
connection networks,” IEEE Transactions on Computers, vol. 40, pp. 1016—1023,
Sept. 1991.