1““ A

“4,...

. ”mg.”

(-0 “3,.-

 

..'a,~*.u ' 1

If!

‘. '3‘

~13: ‘3'?" ”.1-

4‘. u

2:133:

 

 

 

r
“N
...2 'v Eli.

PL“:

 

 

 

‘ ﬁat“:
-- .,_,_

J ,
- :15. g,»

 

5.3- 31"-

2:???" “2” "~32: . m, ...

a? .333- «:2.
5-1”:

w-
i‘l;.r‘
2'?“
”3:3
1:73.221
'3"m w; -
. .

. ‘3‘ .3 .ziv. ‘ u-
.
LIL9“"' fur-.5“... X LR“; :7"
4., ‘4 .5”? .- u. u‘r ‘ .2"... V .'
Vv-1‘7“*:MSJ' . :._ _...:
,

.. .3
Thur. "‘
mm.

3,. n ‘

:1. ”F25. fig?

J. t
P‘m‘ynvfkk .,
. . .n
:g‘ v-v ‘ 1
3m 3223‘“ “3 ?3_ 3}}; .. .. Z. “.411? L.

x “3,. {r} .7,“ :Ju‘v‘z-IH -.1 _'
"5' i.“ :::‘31;“mv{é‘£‘ "5:“ >1 4.”. ""‘" "1";"1. «'4‘
y El.- ‘. 0 - .1 5......“ d:=",--n 'I'h‘
: :3 ”is“: :5. ,4» \

)" gui-

n #395 .. 3"
w..- ,.
«II-«Md .h...
J31:

u "Eigfﬁ Iii “‘4an

.9“ Mai)”. tint»?

Lu. - ,‘
W%7“1§W ‘
if?

“7
:wvvv- ‘

.3:

 

 

1H

12

l—

’ \l\ll\\\\l\l\l\§§9\lllll\ll\\

This is to certify that the

dissertation entitled

COMMUNICATION PERFORMANCE OF
MULTICOMPUTERS

presented by

Suresh Chittor

has been accepted towards fulﬁllment
of the requirements for

 

Ph.D mgmem Computer Science

 

V -
Ma jo >rofessor ?
DateW

MSU is an Afﬁrmative Action "Equal Opportunity Institution 0-12771

 

* LIBRARY
Michigan State
University

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.

ll DATE DUE DATE DUE DATE DUE

T; 2 1‘ 1399

 

l

 

 

 

   
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

_J

 

ll

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

T—W

MSU is An Afﬁrmative Action/Equal Opportunity institution
cmmt

 

 

 

 

 

 

COMMUNICATION PERFORMANCE OF
MULTICOMPUTERS

By

Suresh Suryanarayana Chittor

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science
1991

ABSTRACT
COMMUNICATION PERFORMANCE OF MULTICOMPUTERS

By
Suresh Suryanarayana Chittor

Massively-parallel distributed-memory concurrent computers or multicomputers
are beginning to dominate the supercomputing arena. The continued success of mul-
ticomputers has resulted in an ever increasing demand on their communication per-
formance. In this thesis, we perform theoretical, experimental and simulation studies
to understand and improve the communication performance of large multicomputers
that use direct networks. We show that 2D / 3D mesh networks provide much higher
performance than popular hypercube networks, when the effect of contention is neg—
ligible. The performance improvement is due to higher bandwidth channels, and the
ability of the switching technique to efﬁciently handle communication over long paths.
However, in the case of large multicomputers that use mesh networks, contention for
network channels can be signiﬁcant which will lower the performance. We study the
effect of contention for a given mapping of parallel tasks on a set of multicomputer
nodes. A metric called path contention level is introduced as a measure of contention
and quality of mapping. The results of our studies have been used to implement a
tool that helps users predict the effect of contention, identify communication bottle-
necks and to evaluate the adequacy of a mapping for a given application. We also
show that random mapping is not advisable for large multicomputers that use mesh
networks. We illustrate in several cases that careful mapping and routing can mini-
mize contention which in turn can signiﬁcantly improve communication performance.
Theoretical results are supported and complemented by experimental results on a Sy-
mult 2010, and simulations that give the performance of large multicomputers not yet
available. We conclude that mesh networks will replace hypercube networks as they
can provide better performance under contention-free conditions, and that research

interest will shift from minimizing path length to minimizing contention.

©Copyright by
SURESH SURYNARAYANA CHITTOR

1991

To my parents

Ramamani and Suryanarayana Rae

iv

mm

 

Flu II .I llhl

DEM -

m3».

ACKNOWLEDGMENTS

I owe a lot to my thesis advisor Dr. Richard Enbody for his constant encourage-
ment and support right from the beginning. I am grateful to the many invaluable
and long discussions I had with him. Those discussions helped very much in shaping
up my ideas into this thesis in its present form.

Many thanks to Dr. Lionel Ni for his wise and timely suggestions that kept me on
the right track during my research. I thank my other guidance committee members,
Dr. Esfahanian and Dr. Drachman, for their help, encouragement and guidance. I
am grateful to the chairperson, Prof. Wojcik, for all his help, and for providing the
much needed ﬁnancial assistance during my stay at MSU.

I thank the members of the parallel processing research group at MSU. The weekly
seminars and discussions with them had been very useful, and stimulating. Special
thanks to Arun, Jayashree, and Tyan-Shu for their help and encouragement. I thank
the system managers for keeping the computer facilities up and running all the time.

I thank my roommates, friends and students at MSU who made my stay at Michi-
gan State University very pleasant and unforgettable. Special thanks to Sateesh,
Shashidhar, Keshavchandra, Vibhu, Mohankrishnan, Bharath, and Sanjay.

I am indebted to my wife Aruna for her support, understanding, and patience

during many long days and late nights I spent on my research and thesis.

Table of Contents

List of Figures

List of Tables

1 INTRODUCTION

1.1 Concurrent Computers ...........
1.2 Communication Networks .........
1.3 Direct Network Architecture ........
1.3.1 Switching Technology ........
1.3.2 Routing Scheme ...........
1.3.3 Network Topology .........
1.3.4 Channel Bandwidth .........
1.4 Network Performance Studies .......

1.5 Thesis Overview ...............

2 COMMUNICATION PERFORMANCE :

CASE

2.1 Terms and Notations ............
2.2 Wire Complexity ..............
2.2.1 Wire Density ............
2.2.2 Wire Length .............

vi

OOOOOOOOOOOOOOO

OOOOOOOOOOOOOOO

xiii

............... 1

(0030!“

............... 13
............... 14

............... 18

CONTENTION-FREE
21

............... 21
............... 23
............... 24
............... 32

2.3

2.4

Performance Comparison ......................... 34

2.3.1 Contention-free Adjacent-node Communication ........ 35
2.3.2 Contention-free Non-adjacent-node Communication ...... 37
Conclusion ................................. 40

COMMUNICATION PERFORMANCE UNDER CONTENTION 41

3.1 Modeling Communication and Contention ............... 42
3.1.1 Communication Model ...................... 42
3.1.2 Contention Parameters ...................... 46

3.2 Saturation Path Trafﬁc .......................... 49

3.3 Saturation Node Trafﬁc .......................... 53

3.4 Random Mapping ............................. 59
3.4.1 Path Lengths ........................... 59
3.4.2 Channel Loads ............... ' ........... 60
3.4.3 Path Contention Levels ...................... 62

3.5 Optimal Mapping ............................. 71
3.5.1 Binary Tree ............................ 71
3.5.2 Mesh-3D .............................. 72
3.5.3 Hypercube ............................. 72

3.6 Speedup .................................. 73

3.7 Contention Analyzer: A Software Tool ................. 75

3.8 Conclusion ................................. 76

EXPERINIENTS ON SYMULT-2010 77

4.1 Symult 2010 Architecture ........................ 78

4.2 Experiments ................................ 80
4.2.1 Contention-free Adjacent-node Communication ........ 84

vii

4.2.2 Contention-free Non-adjacent-node Communication ...... 87

4.2.3 Communication Under Contention ............... 89

4.3 Analysis of Experimental Results .................... 95
4.4 Discussion ................................. 101
4.5 Conclusion ................................. 103
5 NETWORK SINIULATOR 105
5.1 Introduction ................................ 105
5.2 Simulator Design : Kernel ........................ 107
5.2.1 Kernel Data Structures ...................... 108

5.2.2 Simulation Cycle ......................... 111

5.2.3 Kernel Algorithms ........................ 113

5.3 Simulator Shell ............... » ............... 116
5.4 Simulator Validation ................ ' ........... 120
5.5 Simulator Performance .......................... 122.
5.6 Simulation Experiments ......................... 125
5.6.1 Small vs. Large Messages .................... 125

5.6.2 Constant vs. Probabilistic Computation Times ......... 125

5.6.3 Packetized vs. Unpacketized Messages ............. 127

5.6.4 Mapping .............................. 128

5.6.5 Arbitration Policy ........................ 131

5. 7 Conclusion ................................. 134
6 CONCLUSION AND FUTURE RESEARCH 135
6.1 Summary and Major Contributions ................... 136
6.2 Future Research . . . . l .......................... 139

viii

Bibliography 142

 

 

Cs)

_¢ 0

List of Figures

1.1 Overview of a concurrent computer ................... 2
1.2 Trends in direct networks ......................... 6
1.3 Common switching techniques ...................... 7
1.4 Two common network topologies .................... 12
2.1 Linear layout and wire density : An example .............. 25
2.2 Identity layout of 2D mesh ........................ 31
2.3 Speedup R, for contention-free non-adjacent node communication . . 39
3.1 An example of a random mapping .................... 48
3.2 Contention parameters for the mapping example ............ 49
3.3 A simple contention model ........................ 50
3.4 Effect of contention on communication time for a single path ..... 54
3.5 Actual path trafﬁc vs. Applied path trafﬁc for a single path (V = 2) . 55
3.6 Estimating contention parameters .................... 61
3.7 Estimating path contention level. .................... 63
3.8 Path lengths, estimated and actual values ............... 66
3.9 Channel loads, estimated and actual values ............... 67
3.10 Path contention level, estimated and actual values ........... 68
3.11 Three basic tiles in modiﬁed Youn’s scheme .............. 72
3.12 Mapping 3D Mesh on to 2D mesh network ............... 73

X

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13

4.14

5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10

The architecture of a node of Symult 2010 ................ 78

Transfer of a single ﬂit over one hop ................... 79
Contention-free adj acent-node communication ............. 84
Communication time for contention-free adjacent-node communication 86

Contention-free, non adjacent node communication .......... 87
Communication time per message as a function of path length . . . . 88
Communication under contention for network channels ........ 90
Increase in communication time per loop ................ 92
Near-neighbor communication patterns ................. 94
Base loop for four near-neighbor (4NN) communication patterns . . . 94
Components of loop time for 4 Kb messages .............. 96
Normalized communication time per message for different system sizes. 98
Normalized communication time per message for different message

lengths. .................................. 99
Actual node trafﬁc for different message lengths ............ 100
Overview of the network model ..................... 109
Main datastructures of the simulator .................. 109
A simulation cycle ............................ 112
Interface between shell and kernel .................... 117
Outline of the simulation main function ................. 120
Matrix-transpose pattern : simulation results .............. 121
The simulator performance in cycles per CPU second ......... 123

The simulator performance in micro seconds per message per cycle . . 124
Effect of message length ......................... 126

Effect of distribution on computation times ............... 126

5.11 Effect of packetization .......................... 127

5.12 Effect of random mapping on the network performance ......... 129
5.13 Effect of careful mapping on the network performance. ........ 130
5.14 FIFO queue policy, 256 node system .................. 132
5.15 Biased priority queue policy, 256 node system ............. 133
5.16 Message-source priority queue policy, 256 node system ........ 133

2.3

3.1

3.2

4.1

1.1

2.1
2.2
2.3

3.1
3.2

4.1

List of Tables

Important characteristics of mesh and hypercube topologies .....

Ratio of Channel Bandwidths for the three delay models .......
Speedup (R,) for different system sizes .................

Channel width, rate and bandwidth for some commercial systems . .

Contention parameters : summary of estimates .............

Saturation node trafﬁc for various process graphs for a specific random

mapping ..................................

Communication time per loop for near-neighbor patterns in ms.

xiii

35
36
37

65

69

95

Rapid l

increase

.
I‘J.’

Chapter 1

INTRODUCTION

Rapid progress has been made in concurrent computing technology, and the result is

increased interest in this technology. Some important reasons for this interest are

e Feasibility : Large systems are becoming viable and cost-effective due to rapid
advances in VLSI technology. Available systems have demonstrated remarkable

performance for various applications.

0 Supercomputing : Concurrent systems that use more than one computing ele-

 

ment are already capable of providing 1—10 Gﬂops peak performance, and are

expected to reach a teraﬂop speed in the near future.

0 Scalability: The performance of concurrent systems is scalable in the number of
computing elements. Real systems in the supermini range have demonstrated
that the capacity of the systems can be easily increased by adding processor
boards. Large systems having hundreds of processors have shown that the
performance can be directly proportional to the the number of processors used

for various scientiﬁc applications.

The architecture of concurrent systems varies widely and is changing rapidly, as
the technology is far from maturity. There are many research problems and issues

that need to be investigated.

1.1 Concurrent Computers

A concurrent computer consists of a set of nodes connected by a communication
network (Figure 1.1). Individual nodes may have a processor or memory or both.

In addition, nodes will have communication hardware that will provide the interface

1

i0 E

1
C€3€

 

 

Communication Network

no 1’:
@GD (9

NODES

 

 

Figure 1.1: Overview of a concurrent computer

 

 

 

to the network. Concurrent computers can be divided into three broad categories

depending on the organization of the memory and access to it by various processors.

1. Shared Memory systems : In these systems any processor can access any part

 

of the memory with the same latency. The entire memory is global as it is
physically and logically shared. Individual nodes will be either a processor or
a memory, each with a communication interface to the network. The size of
such systems tend to be small, at most 20-30 nodes, due to limitations of the
communication network technology. Sequent’s Balance and Encore’s Multimax

are examples of such commercially available systems.

2. Non-Uniform Memory Access (NUMA) systems : In these systems, as in shared

 

memory systems, any processor can access any part of the memory. There is,
however, a difference in access times to various parts of the memory. Each node
has a processor and some memory called local memory. Access to the memory
of another node, called remote memory, is slower than access to the local mem-
ory. The memory is physically distributed, but logically shared. Medium sized
NUMA systems are available which have hundreds of nodes. BBN’s Butter-
fly GP1000 [1] or its successor the TC2000 are examples of such commercially

available systems.

3. Distributed memory systems or Multicomputers : In such a system, memory is
physically and logically distributed. Each processor can only access its local
memory. Interaction between processors is by passing messages. The systems

are also known as message-passing concurrent computers or multicomputers [5].

Some of the largest parallel systems available belong to this category such as

NCUBE/l [48], iPSC/l [67], Symult-2010 [74], iPSC/2 [4] etc.

Large multicomputers are gaining interest as they already have achieved some of
the fastest computing rates, and seem to be the way to design a teraﬂop 1 computer
in near future. A critical architectural component of all concurrent systems, including

multicomputers, is the communication network.

1.2 Communication Networks

The network provides a means for the various nodes in a concurrent system to com-
municate, exchange information and coordinate their activities. The performance of
the communication network is essential to the success of any concurrent computer.
Extensive research has explored alternate network architectures over the last ten to
ﬁfteen years. Ideally one wants a completely-connected network where each node
has a dedicated communication link to every other node. Implementation difﬁculties,
however, prevent one from using such a network. The goal is therefore to design a
network that is feasible and cost-effective, and can provide a performance comparable
to that of a completely-connected network. Four types of networks have dominated
current concurrent systems.

1. High-speed time-shared bus : In such a network a single bus, typically, connects

 

all the nodes in the system [51]. As a result, communication time will be same for any
pair of nodes. The main disadvantage is that the single bus becomes the bottleneck
when the number of nodes or the size of the system reaches a certain threshold (20-
30 for current technology). For large systems, we have to modify the network. One
possibility is to use more than one bus. Researchers have proposed various ways of
using more than one bus such as the hierarchically organized buses [78, 63], multi-
dimensional structures as in Hyperbus [7] and so on. In spite of those modiﬁcations,

contention for the bus continues to be a major problem with bus-based networks.

 

lA computing rate of 1012 ﬂoating point instruction per second.

2. £21;st bar network : Here, a crossbar switch of N 2 cross points is used to connect
N nodes to one another [51]. Such a network allows any node to communicate to
any other node, and the time taken will be same for any pair of nodes as in bus
based systems. One major advantage of the crossbar over the bus is that all the
nodes can communicate simultaneously without any contention, provided that no
two nodes communicate to the same node. However, the cost of a crossbar increases
quadratically with the size of the system. In addition, power and pinout problems
make a crossbar network unattractive for large systems.

3. Indirect or Multistage Interconnection Network (MIN) : These networks strike

 

a compromise between the price / performance alternatives offered by the bus and the
crossbar networks. An N x N MIN network connects N nodes to one another by
employing multiple stages of banks of switches in the interconnection pathway. One
approach possible with N as a power of 2 is to use lgN stages of N /2 switches,
each switch being a 2x2 crossbar. Different ways of connecting successive stages of a
network have been proposed, such as Omega, indirect binary n-cube and so on, with
some of them shown to be equivalent. One advantage of a MIN is its expandability,
since the number of stages increases only as 1g N. There are also disadvantages.
Contention for paths such as the hot-spot problem can degrade the performance
signiﬁcantly. The network cost increases with the system size as the number of
switches increases as N lg N. Communication time remains the same for different
pairs of nodes, provided there is no contention for resources within the network.

4. Direct or Point-to-point networks : In direct networks, nodes are connected

 

by high-speed communication links. Each node is connected to only a small number
of other nodes. Two nodes that are connected directly by a communication link are
called adjacent nodes. The communication hardware in each node, called the router,
is used to support communication between two non-adjacent nodes. If each node
is connected to only a fixed number of other nodes irrespective of the system size,
then we have N routers, one in each node, and O(N) total communication links
connecting the N routers. The cost of the network increases only as O(N). Direct

networks are therefore much simpler to build than either MINs or crossbars. In

Dire.

L!‘
o a .
(I)

earlier direct networks, the communication time depended on the relative position of
the communicating nodes. Communication between adjacent nodes was much faster
than the communication between non-adj acent nodes. However, with recent advances
in technology, the dependence of the communication time on the position of the nodes
has been signiﬁcantly reduced.

Shared memory systems use a bus-based network or a crossbar network. NUMA
systems use MIN networks, where as direct networks seem to be the choice for multi-
computers. The performance of direct networks has dramatically improved recently
due to advances in technology, so they have become competitive to other networks.
Direct networks can be used for both small and large concurrent systems. In this
thesis, we study the performance of direct networks when used for message-passing
in multicomputers. Such a performance study will be useful even if systems using
direct-networks are organized as a shared-memory or a NUMA system at the user

level.

1.3 Direct Network Architecture

Direct networks have four main characteristics: the switching technique, the routing
scheme, the network topology and the channel bandwidth. The switching technique
determines how communication takes place between two non-adjacent nodes. The
routing scheme decides the sequence of channels, the path, used for communication
between any two nodes. The network topology determines the set of adjacent nodes
for any given node in the network. The bandwidth of a network channel depends on
the width of the channel and the maximum rate at which the channels can transmit
bits. The bandwidth will put an upper limit on the communication performance.
The last few years have witnessed changes in all the four main characteristics of
a direct network used by a multicomputer. Current series of multicomputers such as
the iPSC/ 2 [4, 28, 66] and the Symult-2010 [74] are called second-generation multi-
computers, while earlier systems such as the Cosmic—cube [73], the NCUBE/l [48],
and the iPSC/ 1 [67] are called ﬁrst-generation multicomputers [5]. The technology

of the networks used by second-generation systems differs signiﬁcantly from that of

the ﬁrst-generation multicomputers.

 

 

Switching Routing Network Channel
Technology Scheme Topology Width
First. Generation Store-and-forward Fixed Hypercube 1-bit wide
Multicomputers . : : .
i v V i
. Second Generation CircuitOSWitching Fiéed 2D 1311,9511 8—32 bit
Multicomputers Wormhole routing Adaptive 2D Torus w1de
Figure 1.2: Trends in direct networks

 

 

 

The trends in network characteristics from ﬁrst to second-generation multicom-
puters are shown in Figure 1.2. It is necessary to understand the changes in network
features, as they have a signiﬁcant inﬂuence on the network behavior and perfor-
mance. As we see later, changes in one network characteristic inﬂuence changes in
other network characteristics. We now discuss the four main network characteristics

and trends in more detail.

1.3.1 Switching Technology

The three common switching techniques used in current multicomputer networks
are shown in Figure 1.3. First generation multicomputers used store-and-forward
switching which is common in wide-area and local-area networks. A message is split
into packets if necessary. Each packet is then sent completely to an adjacent node
that is nearer to the destination than the source. This adjacent node, called an
intermediate node, relays the message to one of its adjacent nodes and the process
repeats. Each transmission takes the message nearer to the destination until the

message ﬁnally reaches it.

Deﬁnition 1 Path length is the number of hops a message travels, and is denoted by
D.

 

U}

Path length depends on the routing scheme which determines the channels used
for a communication. First generation multicomputers used ﬁxed routing schemes,
and path length was equal to the shortest distance between the communicating nodes.
The total communication time, t“, is proportional to the number of hops. The time
taken for one hop, th, is a linear function of message length L, i.e. th = a + bL. Here,
a and b are system-dependent constants. The ﬁxed overhead is denoted by a, and b

denotes the effective message transmission rate. Hence, total communication time is
t“ = D * t}, = D(a + bL) = O(DL) (1.1)

Processors at each intermediate node are involved in processing messages passing
through the node. Software control at each intermediate node slows communication
signiﬁcantly as the number of hops, D, increases. Typical values for a and b for ﬁrst
generation computers are 1000 ps and 1 ps/byte.

 

source intiryiediate nodes destination

—path length (D) = 3—*

t =
+afi-bL-r "__tca __'
3 MRI] header ﬂits E lIlIIIll] m

11 mm“: mm mum]
12 WIDE] E E [1111111] mm
d writing a HEIDI] mum

 

"—_- tat "_ twh "
.ﬁ
time

store-and-forward switching circuit switching wormhole-routing

Figure 1.3: Common switching techniques

 

 

 

The switching technology has changed in second-generation multicomputers. In-
stead of store-and-forward switching, circuit switching or wormhole routing is used.
Also, each node has dedicated communication hardware to process the messages pass-
ing through the node. In circuit switching, a header (one or two words) is sent from

the source, and the header reserves channels on its way to the destination node thereby

the dc

.I,.. f
ILAe‘e A

establishing a physical communication path between the source and the destination
nodes. Once the path is established, it takes the same time to send the message to
the destination as it does to an adjacent node, th. Hence the total communication

time for a circuit switched network is
to, = 01D + th = C1D + a + OL = 0(1) + L) (1.2)

Here, 01 is a system dependent constant that represents the time taken per hop to
establish the path. One example of a second generation multicomputer that uses
circuit switching is iPSC/2 for which 01 is 10—30 as, and a and b are 660 as and
0.36 as respectively [13].

Wormhole routing has also been used in commercial, second-generation multicom-
puters [32]. Wormhole-routing is a modiﬁed form of virtual cut-through switching
proposed earlier for computer networks [54]. It is similar to circuit switching as the
header reserves channels establishing a dedicated path for the message. However,
words, called ﬂits, of a message follow the header in a pipelined fashion instead of
waiting until the entire path is established as was done in circuit-switched networks.
The last message word, called the tail ﬂit, automatically releases the channels as
it moves towards the destination. This switching appears to be similar to packet
switching where the packet size is reduced to one ﬂit. There is, however, an im-
portant difference between this level of packet switching and wormhole routing. In
packet switching, it is possible for several packets to accumulate in an intermediate
node which requires each node to have a sufﬁcient number of buffers for packets and
to manage these buffers. In contrast, there is no need for a large number of buffers
in wormhole routing. A ﬂit is not sent to a subsequent node unless the previous ﬂit
has left that node. Hence, if a message header is blocked due to a reserved channel,
then the entire ﬂit stream following it stops progressing, with typically one ﬂit in
each intermediate node. Hence, one buffer that can hold a ﬂit is enough per channel

per node. Details of wormhole routing and actual implementations can be found in

’n- .-

3)“ ii

[37, 77, 32, 40, 35]. For wormhole routing, the communication time needed is
twh =a+CgD+b(L-— 1) = 0(D+L) (1.3)

Once again, Cg is a system dependent constant that represents the time taken per
hop to establish the path. Current technology is quite impressive as the lowest values
claimed for 01 and b is about 20 us for NDF [77] and 13 ns/byte for iWarp [14]
respectively; However, a is still high for currently available systems as the lowest
value claimed is 177 psecs for Symult 2010 [75]. Systems are being designed to have
low overheads such as the iWarp [14] and the J-machine [29, 30].

Note that all the three equations above for t“, tc, and twh gives the communica-
tion time under contention-free conditions. We see that both circuit switching and
wormhole routing used in second-generation multicomputers take O(L + D) time in-
stead of the O(LD) time required by the store-and-forward switching used in the
ﬁrst-generation multicomputers. Since, L > D, O(L + D) z O(D), and hence com-
munication time is not sensitive to D, the distance between the two communicating
nodes. However, in both circuit switching and wormhole routing channels need to be
reserved. The resulting contention for channels can become a serious problem and
affect network performance. Hybrid switching schemes have also been suggested such

as Staged Circuit Switching [3] which is similar to virtual cut—through.

1.3.2 Routing Scheme

The routing scheme decides the path taken by a message, and is another main feature
that can signiﬁcantly affect the network’s performance. Multicomputers put a tight
constraint on the time spent in deciding routing. We need to implement all routing
functions in hardware to keep the processing times at an acceptable level. Hence,
routing schemes should be simple. At the same time, it should ensure freedom from
deadlocks and utilize the channels as best as possible.

Routing schemes can be divided into two main classes.

0 Fixed routing schemes : In such schemes, the choice of a path depends only

 

10

on the communicating nodes. A message between two given nodes always uses
the same path. The choice of the path does not change with time or current

utilization levels of various network channels.

0 Adaptive routinﬁchemes : These schemes tend to be smarter as the choice of

 

path depends on the network state or the state of various network channels.
The choice of a path taken by a message between a given pair of nodes may
change with time. Adaptive routing schemes attempt to avoid already congested
channels, strike a balance in the utilization of the channels and improve the

communication times.

The routing scheme depends on the switching technology used. In ﬁrst-generation
multicomputers which used store-and-forward technology, it was essential to minimize
the path lengths. Hence, the routing scheme was designed to select the shortest
path possible between the communicating nodes. The routing decision was made by
processors in each node by software, and hence the overheads were high. Network
channels were never a bottleneck as messages were not injected fast enough to increase
their utilization to high levels. Hence, ﬁxed routing scheme were used and were
sufﬁcient.

Second-generation multicomputers use circuit switching or wormhole routing
which, as explained in the previous section, are not sensitive to path lengths. Hence,
routing schemes are not constrained to choose the shortest path possible. The routing
overheads are signiﬁcantly reduced as all routing decisions are made by the hardware,
and the utilization of the channels can become quite high. Hence, network channels
can become a bottleneck as individual communications reserve channels. Routing
schemes may have to choose a path that reduces contention and balances the utiliza-
tion of channels [10, 8]. Simple adaptive routing schemes may increase the network
performance appreciably .

We see that the challenge is to design a routing scheme that is simple enough to
be implemented in hardware, but at the same time good enough to keep the network

performance at a high level. Both ﬁxed and adaptive routing schemes have been

proposed lb
we do not

schemes to

1.3.3 I‘

he netwm
the set of a
need
is represen
represents
important

Degree

\
0i 2. iopolc
lie diam,
Pars of m

The me
3068 not i
that dial
liliemate
:15. 42. .3[
taps Slow.

11

proposed that are simple and avoid deadlock for modern networks [36, 60]. However,
we do not have much information on the relative performance of various routing

schemes to make a correct choice for the routing scheme.

1.3.3 Network Topology

The network topology deﬁnes the way nodes are connected; the topology determines
the-set of adjacent nodes for any given node. It is represented by the graph obtained
when each node is deﬁned as a vertex and each physical channel between two nodes
is represented by a directed edge between corresponding vertices. The topology is
represented by G(V,E), where V is the set of vertices and E is the set of edges. Two
important features of a topology are

m : The degree of any node is the number of its adjacent nodes. The degree
of a topology is the maximum degree of any node, i.e. mmev d(V).

Meter: The shortest distance between any two nodes u, v is denoted by I (u, v).
The diameter of a topology is the maximum value of the shortest distance over all
pairs of nodes, i.e. diameter = maxumev [(11,21).

The network topology should have a small number of edges, a ﬁxed degree (degree
does not increase with the system size), regular (all vertices have the same degree),
a low diameter, symmetric (network looks alike from each node), and fault-tolerant
(alternate paths between vertices). Various network topologies have been suggested
[16, 42, 50, 64, 71, 43, 38, 53], and compared [2, 56, 70]. Since an increased number of
hops slowed communication in ﬁrst-generation multicomputers, low diameter topolo-
gies were suggested and used [27, 39]. The hypercube topology was the most popular
(Figure 1.4), which is a member of k-ary n-dimensional topology. In such a topology,
the vertices can be numbered by a n—digit k-radix number d = dndn_1 . . . d1, where
0 S d.- < k. Two vertices a = anan_1 ...a1 and b = bnbn-1...b1 are adjacent if
a.- = b.- for all 3' except i = i, and 6.- = b; :l: 1. Hypercube topology is obtained when
k = 2, and n = lg N. The diameter and the degree of a hypercube are both lg N, and
the number of edges increases only as O(N 1g N). The hypercube topology was also

shown to have other topologies such as meshes and trees as subgraphs. Even though

12

the degree and the number of edges are reasonable when compared to a completely
connected network, they are still quite high for large systems having hundreds or
thousands of nodes. The large number of edges results in increased wire densities and

pin-outs, making network implementation difﬁcult.

 

 

Hypercube 2D Mesh

Figure 1.4: Two common network topologies

 

 

 

In the second-generation multicomputers, the distance between communicating
nodes is no longer a serious problem, and hence a low diameter topology is no longer
necessary. Simpler topologies such as the 2D mesh or torus (mesh with wrap-around
edges) are worth considering. Mesh is also a member of Ic-ary n-dimensional topology,
for which n = 2, and k = x/N. In case of a 2D mesh, the number of edges increases
only as O(N) and the degree of a node is constant, i.e. independent of N. A mesh
is therefore easier to implement. However, if we use circuit switching or wormhole
routing and a 2D mesh or a 2D torus, then there will be increased contention for
network channels as only O(N) of them are available. Some important characteristics
are summarized in Table 1.1. 2 .

It is also possible to use switching element to increase the ﬂexibility of a chosen
topology, or even provide conﬁgurable networks i.e. networks whose topology can be

changed by setting the active switching elements properly [76, 18, 41].

 

2The concept of bisection width, wire density and wire lengths are precisely deﬁned in Section 2.2

b

 

I ’2' ’il ml'ul HI ' 1‘ v—«l

L
‘5"
.
(I)

13

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2D Mesh Hypercube

Degree + 4,ﬁxed - 1g N, variable
No. of edges + 2(N — x/N), O(N) - (ng N)/2, O(N lg N)
Diameter - 2N7? — 1), O(x/N) + lg N, O(lg N)
Alternate paths + yes + yes
Bisection width + \/N,O(\/N) - N/2, O(N)
Wire Density + Uniform - Non uniform
Wire lengths + Uniform and short , O(l) - Non uniform

(+(-) indicates an advantage (disadvantage))

Table 1.1: Important characteristics of mesh and hypercube topologies

1.3.4 Channel Bandwidth

The physical bandwidth of the network channels is another important feature that can
affect the network performance. The bandwidth depends on the width of the channel,
i.e. number of data bits that can be simultaneously transmitted across the channel,
and the rate at which bits can be transmitted. The bandwidth puts an upper limit
on the communication performance of a direct network. A higher channel bandwidth
will not only increase the peak communication rates, but also reduce the possibility of
individual channels getting saturated. High bandwidths may ensure that the channel
will never become a bottleneck and avoid the need for adaptive routing schemes.
The maximum bandwidth possible depends on the network topology. A topology
that has more edges is difﬁcult to implement as it will increase the wire density.
Hence, width of channels has to be reduced if we use a richer topology. Hypercube
networks typically use bit-serial lines to keep the wire complexity to a manageable
level. However, if a simpler topology such as the 2D mesh is used, then it is possible
to increase the width of the channels to 8 or even 32 bits wide, and operate them at
faster rates as the length of the channels will be small. Mesh networks can therefore
can be designed to have channels with much higher bandwidths than hypercubes.
The tradeoff between simpler, high-diameter networks that have wide, fast channels
and complex, low-diameter networks that have slower channels is not clear, and needs

to be studied.

acre:

proce

14

In addition to the changes in network switching technique, network topology and
network channel bandwidth, the router or communication coprocessor is becoming
increasingly complex and that in turn is minimizing the involvement of the node
processor in handling communication. It is essential that all these developments are
complemented by sufﬁcient performance studies to clearly understand the issues, to
achieve highest possible performance, and to ensure that the communication network
does not become the bottleneck. Efﬁcient communication performance in turn will
help us to achieve efﬁcient computation, i.e. actual computation rates close to the

maximum possible rates of the system.

1.4 Network Performance Studies

The discussion in the previous section indicates that the choices of the four main
features of a direct network are not independent of one another. A change in one
network feature will have inﬂuence on the choice for another network feature. Such
a dependence makes network design difﬁcult. Rapid advances in network technology
and changes in individual network features further complicates the design decisions.
A detailed understanding of the dependencies and the effect of the choices on the
network performance is essential to design a good network.

Appropriate performance studies help both the system designer and the user. The
studies will help the system designer in understanding the drawbacks of the current
architecture, and to come up with better network designs for future systems. The
studies also help a user to get a better and deeper understanding of the factors that
will degrade the communication performance when his application is run on modern
multicomputers. Such an understanding will enable him to get the best possible
performance on a given machine for his application. In this section, we brieﬂy review
the performance studies done so far and explain why there are still many questions
about the performance for which answers are not clear. The discussion also explains
the motivation for our thesis.

Researchers have attempted to compare the performance of different switching

c 31]}:

second-
cuit 5w:
efect. 1

mp 41

a. 'L'll *

15

techniques suggested for direct networks of multicomputers. Simulation studies have
been made considering small hypercube systems [46, 26, 49, 15]. Results show that
adaptive circuit switching and wormhole routing perform well for small systems hav-
ing less than a hundred nodes. The studies do not consider larger systems. Re-
cently, experimental results are available about the performance of iPSC/ 2 [13], a
second-generation hypercube multicomputer that uses Direct-connect (a type of cir-
cuit switching) technology [66]. The results show that path lengths have negligible
effect, and that the communication rate is about 5 times better than that of the
iPSC / 1. The results also show that contention for network channels can be a serious
problem. The experiments were run on a small system having less than a hundred
nodes.

Fixed routing schemes have been suggested that make the wormhole-routed net-
works deadlock-free. A general scheme has been suggested based on virtual channels
which avoids deadlocks [36]. All channels are appropriately numbered and the ﬁxed
routing scheme is designed such that it always routes a message from a higher num-
bered channel to a lower numbered channel. Circular waits are thus avoided, making
the network deadlock-free. The scheme can also be used for circuit-switched networks.
The ﬁxed routing scheme, however, fails to utilize all the channels and can result in
increased contention for network channels. Virtual channels have been suggested to
improve the performance under contention [34]. Virtual channels have also been used
to provide adaptive routing schemes [52, 60]. A new adaptive routing scheme has
been suggested for which network trafﬁc is shown to saturate at 80% network capac-
ity instead of 40% capacity if we use ﬁxed routing [65]. Multicast communication
that reduce network trafﬁc has been suggested and evaluated [59, 58].

Recently, different network topologies have been compared considering their suit-
ability to implementation [69, 62, 31, 33]. Different studies show that the wire density
is an important implementation constraint. Hence, the network topologies have been
compared assuming constant wire density [31, 33]. The studies clearly indicate the
advantage of simpler networks such as the 2D mesh or torus over richer topologies

such as the hypercube. Simpler networks can be designed to have higher bandwidth

1,
.16 .
SiCEf
.
snow
‘0
911,11

:LQMA

0i 31C

16

channels for a given wire density than rich networks. Higher bandwidth channels
result in higher communication rates between adjacent nodes. Dally has recently
compared various members of k-ary n-dimensional torus networks, which include 2D
torus and hypercube, to determine the best dimension for a given network size [33].
The studies assume wormhole routing and uniformly distributed messages, and con-
sider the effect of path length and the difference in channel bandwidths. The results
show that the best dimension is 2 or 3 for systems up to ten thousand nodes. All these
studies have resulted in a 2D or 3D mesh or torus topology being used for networks
of modern multicomputers in place of a hypercube network used earlier.

High bandwidth channels of simpler, high-diameter topology networks enable
them to provide higher performance if contention is negligible. Smaller number of
channels and ﬁxed routing, however, may result in increased contention and reduced
performance. A recent study by Dally shows that 2D mesh performs better than
hypercube even if we consider the effect of contention for the network channels for
system sizes up to 4K nodes [33]. The theoretical study has been supported by simula—
tions. The results hold only for uniformly distributed communication; an assumption
made in the studies. Even though uniformly distributed messages give a common
ground to compare different topologies, in reality it is difﬁcult to achieve. Trafﬁc will
be non-uniform. Some nodes will suffer from contention more than others, and they
can degrade the overall performance of an application. Hence, it is necessary to study
the nonuniformities in network trafﬁc and the effect of such nonuniform trafﬁc on the
network performance.

Current results also indicate that contention may become a serious problem for
large systems [13, 33, 10]. Parallel algorithms are designed to be scalable. Hence,
an individual node running a parallel task will have the same message injection rate
irrespective of the system size. The network trafﬁc [33], which is inversely propor-
tional to the path length, will then increase with the system size. Results show that
communication time increases exponentially once the network trafﬁc reaches a certain
threshold. Hence, we can expect a saturation effect at lower message injection rates

for larger systems. Large systems are not yet available, and we do not have a clear

10 svste

the last

a

1961:]

II a At

L
35 com:
(7:13.11:

at a ree

17

idea of the message injection rates at which the saturation effects will be visible for
a given communication pattern and system size.

The performance also depends on mapping—the way parallel tasks are assigned
to system nodes. The problem of mapping has been explored by many researchers in
the last decade in order to ﬁnd the effect of mapping and to ﬁnd the best mapping
[9, 6, 12, 44, 72] Most of these studies have concentrated on minimizing path lengths
as communication time increased linearly with path length. For a given mapping,
communication may be appropriately routed to ensure utilization of channels remain
at a reasonable level, and none of the channels are becoming a bottleneck [8, 11]. It
has also been claimed that the mapping problem no longer exists in second generation
multicomputers [75]. The claim is based on the observation that the path length effect
is negligible, and that the peak node injection rates are limited. Even though the
claim seems to hold for current, small systems, it is not clear if random mapping is
sufﬁcient for large, future systems having hundreds (or thousands of nodes. Future
systems can also have higher message injection rates as a result of improved node
architecture and reduced ﬁxed overheads per message. Mapping will decide the exact
communication pattern for a given application and the non—uniformity in network
trafﬁc, and random mapping may result in saturation of network channels. A careful
mapping and routing may balance the utilization of channels and network trafﬁc,
and hence avoid saturation of channels and the consequent exponential increase in
communication times. It is necessary to study how non-uniformity depends on the
mapping and how it increases with the system size.

We find that most of the performance studies done so far are from the system
designer perspective, and help the system designer to make correct design choices.
The performance studies do not give sufﬁcient information for the user to run his ap-
plications well on the second-generation multicomputers, and ensure the best possible

communication and overall performance.

18

1 .5 Thesis Overview

We explained in Section 1.3 that direct communication networks used in second gen-
eration multicomputers are quite different from that used in ﬁrst generation multi-
computers. The discussion clearly showed that the choices of network features are
dependent on one another, which makes network design difﬁcult. In addition, changes
in one network feature will inﬂuence changes in other network features. Section 1.4
explained that performance studies of modern networks are still inadequate and that
many questions related to network performance remain unanswered. This thesis is an
effort to improve our understanding of the performance of communication networks
used in second generation multicomputers. Results from our studies will complement
the results already available. In particular, we investigate the following important
research issues about the network performance.

1. Performance of large, mesh-connected networks: Second generation multicom-

 

puters are using 2D / 3D mesh or Torus networks instead of hypercube network. We
do not have much information about the performance of mesh-connected wormhole-
routed networks. The impact of the transition from hypercube networks to 2D or 3D
mesh networks on the user is also not clear. Currently only small second generation
multicomputers having less than a hundred node are available. Larger systems are
expected in the near future.

2. Contentien end Non-nniform fI'tafﬁc: It is not clear how the contention problem
scales up with the system size, and whether the effects of contention will be more
easily visible in larger systems. We need to be able to model and predict the effect
of contention on network performance. The results will be useful to a user of second-
generation multicomputers in ensuring the best possible communication performance.
Non-uniformity in trafﬁc also needs to be modeled and analyzed to understand the
variation in performance, and to predict the performance of the worst-case node.

3. The Mapping Problem: For second-generation multicomputers, minimizing
path lengths no longer seems to be important, and random mapping seem to be

good enough. Contention for network channels depends on the mapping, and hence

19

a. mapping can affect the communication performance if the effect of contention is
visible. We may need to redeﬁne the mapping objective based on contention rather
than on path length. It would be useful for a user to have a quick way of deciding if
random mapping is sufﬁcient for a problem at hand.

The three issues are related as we see later. The thesis is organized as follows. In
Chapters 2 and 3, we study theoretically several issues that affect the communica-
tion performance. Theoretical studies provide a rough estimate of the performance.
However, it is difﬁcult to develop exact models to accurately predict the performance
for many cases. Experimental and simulation studies are needed, which also help
in validating the theoretical results. The theoretical studies are complemented by
experimental studies in Chapter 4, and by simulations in Chapter 5.

In Chapter 2 we study the improvement in performance provided by simpler,
high-diameter topology networks over richer, low-diameter topologies [19, 23, 17].
We do this by comparing the performance of two typical networks: hypercubes which
represent rich topologies that have been quite popular and 2D mesh networks that
represent a simple topology highly suitable for implementation. We take into consid-
eration the difference in bandwidth, the effect of the switching technique, path lengths
and contention for network channels. Results show that mesh networks are capable
of O(N log N) improvement in performance. Path lengths are negligible as long as
message lengths are much longer than path lengths. However, increased contention
for network channels reduces the performance, and may take away the advantage of
mesh networks.

We study the problem of contention for network channels for a given mapping
in Chapter 3 [24, 20, 22]. We model the contention for network channels, and the
nonuniformity in trafﬁc in terms of path contention level and relate mapping to con-
tention. We also show that the effect of contention is to put an upper bound on the
rate at which messages can be injected into the network by a node—measured by
node traﬂic. For a given communication pattern and mapping we show how to com-
pute the saturation node trafﬁc. We also analyze the case of random mapping, and

show that random mapping may not be advisable for large systems having hundreds

20

or thousands of nodes that use mesh networks. Careful mapping and routing will
reduce contention and nonuniformity in trafﬁc and increase the saturation levels for
node trafﬁc. We illustrate the advantage of a careful mapping in the case of several
common communication patterns. We also develop a software module based on the
theory which computes the upper bound on node trafﬁc, and nonuniformities for a
given communication pattern, and mapping.

In Chapter 4, we describe several experiments we conducted on a second-
generation mesh-connected multicomputer, and analyze the results [21]. The exper-
imental results demonstrate the performance offered by the modern mesh-connected
wormhole-routed multicomputers at the user level. The results conﬁrm that the effect
of path lengths are negligible, and that the effect of contention may no longer be neg-
ligible as suggested by theoretical studies. The experimental results agree well with
the theoretical results, and hence validate theoretical analysis in the earlier chapters.
Analysis of the the results indicates that high, ﬁxed overheads and smaller system
sizes are currently limiting the effect of contention, and that the contention prob-
lem will be more serious in future, larger multicomputers. The studies identify other
architectural drawbacks in current networks, and suggest improvements needed in
future networks.

In Chapter 5, we describe a high-performance simulator that can be used to study
the performance of large future multicomputers not yet available [25]. We discuss
the design of the simulator to show that the design is ﬂexible and efﬁcient. The
simulator implements all the essential features of a wormhole-routed network, but
can still simulate large networks in reasonable time. Several simulation experiments
were done to investigate questions about the network performance. The results of
the experiments are presented. The experiments demonstrates the usefulness of the
simulator. The results agree with the theoretical as well as the experimental results
given in earlier chapters, thus assuring the credibility of our results. Finally we

conclude our thesis and suggest some interesting research problems.

Chapter 2

COMMUNICATION
PERFORMANCE :
CONTENTION-FREE CASE

In this chapter, we analyze the simpler case of contention-free communication. We as-
sume that there is no contention for network channels or any other network resources.
The communication time then depends on the bandwidth of network channels, path
length and the switching technique used. The channel bandwidth depends on the
network topology, so communication performance depends on the network topology.
We illustrate this dependency by analyzing the improvement in performance provided
by 2D mesh networks relative to hypercubes. The analysis can be easily extended to
evaluate the performance of other networks such as the 3D mesh.

We ﬁrst introduce our notation in Section 2.1. In Section 2.2, we describe how
implementation constraints can be taken into account in the analysis. The perfor-
mance of mesh networks relative to hypercubes is analyzed in Section 2.3. Finally, in

Section 2.4 we summarize the results of this chapter.

2.1 Terms and Notations

We already introduced the notations L, D, N, and rm in previous chapters which
represent message length, path length, number of nodes in the system, and com-
munication time per message respectively. In addition, we introduce the following
terms and notations that will be used later in the analysis. We make the following

assumptions about the network which hold for most implementations.

o All network channels are identical and have the same bandwidth.

21

22

o The transmissions are error-free.

Deﬁnition 2 Channel width, denoted by W, is the number of bits that a physical

channel can transmit simultaneously between two adjacent nodes.

Deﬁnition 3 Channel rate, denoted by R, is the peak rate in bits/sec at which bits
can be transferred over the individual lines of a physical channel, and is equal to the

reciprocal of the time delay to transfer one bit.

Deﬁnition 4 Channel bandwidth, denoted by B, is the peak rate in bits/sec at which
data can be transmitted over a physical channel between adjacent nodes. Channel

bandwidth is equal to product of channel width and channel rate, i.e. B = W x R.

Deﬁnition 5 Message aspect ratio, denoted by A, is the ratio of the message length
to the channel width, i.e. A = L/ W.

Deﬁnition 6 Header length, denoted by H, is the length of a message header (gener-
ated and used by the communication network) in number of bits. For current systems,

H is typically 1—4 bytes.

Deﬁnition 7 Network latency, denoted by n, is the time from the instant the ﬁrst bit

of a message leaves the source PE to the instant the ﬁrst bit reaches the destination

PE.

Deﬁnition 8 Network message time, denoted by TM, is the time from the instant the
first bit of a message arrives at the destination PE. to the instant the last bit of a

message arrives at the destination PE.

Deﬁnition 9 Network delay, denoted by Tn, is the total network delay for a message.

We have 1“,, = 1'; + TM.

Communication time per message, rm, will be sum of network delay and the
processing delays per message at the source and the destination nodes. Since, we

are not interested in studying the node architecture and its effect on communication

23

time, we assume that communication time per message is equal to network delay, i.e.
Tm = 7'". However, ﬁxed overheads at the source and the destination nodes can be

accounted for by adding them to the computation time between messages.

Deﬁnition 10 Effective Bandwidth, denoted by E, is the rate at which a message
can be transmitted between the source and the destination; it is equal to the ratio of

the message length to the network delay, i.e. E = L/rn.

Any parameter with a superscript h(m) indicates its value for a hypercube (2D

mesh) network.

Deﬁnition 11 Channel bandwidth ratio, denoted by R5, is the ratio of the channel
bandwidth of a mesh network to that of a hypercube network of the same size, i.e.

R. = Bm/B".

Deﬁnition 12 Effective bandwidth ratio or Speedup, denoted by Re, is the ratio of
the eﬂ'ective bandwidth for mesh network to that of a hypercube, i.e., Re = Em/E".

Note that effective rates and speedup depend on the speciﬁc case under consider-
ation. We use the speedup as the metric to compare the performance, and show how

to derive the speedup for various cases in later sections.

2.2 Wire Complexity

A major implementation problem is the wires that connect various components of a
system. In a large concurrent system, most of the wires will be the communication
wires that connect different nodes of the system. The complexity of the connection
is limited by the maximum wire density possible. The communication rates are
limited by the wire lengths, and the majority of the power consumed by the system is
used to drive the wires. Concurrent systems are therefore wire-limited [33, 31]. Any
reasonable analysis of network performance should account for the wire complexity.
Recently, several researchers have suggested ways to account for the wire complexity
in theoretical analysis of the performance of communication networks of concurrent

Systems [62, 69, 31, 33].

24

2.2.1 Wire Density

First, we should be able to relate wire density to topology and system size, and
measure wire density. The following deﬁnitions help us to quantify and deﬁne a

metric for the wire density.

Deﬁnition 13 A linear layout of a concurrent system is a I-to-I mapping of the N

nodes onto the integers 0—(N -1)

A linear layout in which node i is mapped onto integer i is called the identity

layout.

Deﬁnition 14 For a given linear layout and k, cut width is the number of edges

connecting a processor numbered less than or equal to k, to a processor numbered

higher than k, and is denoted by n(k).

Deﬁnition 15 The bisection width of a topology, denoted by m, is the minimum
value of 17(N/2) taken over all possible linear layouts of the given topology.

Deﬁnition 16 Bisection density is the product of bisection width and the width of a

channel i.e m, x W.

Bisection density denotes the minimum number of wires needed to connect any
two halves of a system. Researchers have proposed bisection density as a measure of
wire density [31, 33]. Hence, different network designs can be compared by assuming
equal bisection density. Here, we use a more general metric, peak density deﬁned

similar to bisection density.

Deﬁnition 17 The peak width for a given linear layout is equal to the maximum
value of cut width, i.e. m k; n(k). The peak width of a topology, denoted by 17",, is

the minimum peak width over all possible linear layouts.

Deﬁnition 18 Peak density is the product of the peak width and the width of a chan-
nel, i.e 17m x W.

25

We denote the value of k which gives the maximum cut width by km. We can
compare different topologies assuming same peak densities. We then ﬁnd that the
width of a channel will be inversely proportional to the peak width of the network
topology.

 

    

a I
Peak Width Bisection Width

8-Nodc Hypercube A linear Layout of 8-node Hypercube

 

Figure 2.1: Linear layout and wire density : An example

 

 

 

Figure 2.1 shows the identity layout of an 8-node hypercube. The bisection width
and the peak width are 4 and 5 respectively for the given layout. For an 8-node mesh
(4x2), the values will be 2 and 3 respectively for an identity layout. For a given
topology, the peak width is a function of the system size. We now derive peak width
for hypercube and 2D mesh.

Let us consider the hypercube topology ﬁrst. The dimension of the hypercube is
denoted by n so N = 2". Let ni(k) denote the cut width if we consider only the 2""

dimension edges of the hypercube so that

We) = ink) on

i=1

It has been shown that the identity layout gives the minimum cut width for the
hypercube topology [47], but the value of nm is not derived for the identity layout.
Before we can derive the value of nm for the identity layout, we have to ﬁnd the value
of km, value of k for which 17,. is maximum (2 nm).

For the identity layout of an N -node hypercube topology, observe the value of

ni(k) as It increases. We ﬁnd that 111(k) increases from 1 to 2""1 and then decreases

on "i
.u .r. a

.-
F1:
n.

the ra

lemr

(“4 by.
Jub but

26

to 0, and the cycle repeats. We can state this observation precisely as
n‘(k) = min(k + 1,2i — 1 — k) where k = (k tnod 2‘) and 0 S k S N — 1 (2.2)

Hence, n‘(k) = 7: +1 over the range 0 S k < 2i/2, and n‘(k) = 2‘ — 1 - k over
the range 2‘/2 S k < 2‘.

Lemma 1 The cut width for an identity layout of hypercube is symmetric if we leave
out the last value 17(N — 1), i.e. n(k) = n(N — 2 - k) for 0 S k S N — 2. The
symmetric property also holds for edges of each dimension, i.e. 17‘(k) = n‘(N —
2 — k)for1SiSn andOSkSN—2.

Proof: We prove the lemma by induction on the dimension of the hypercube.

Eagle : For n = 1 (N = 2), we have the cutwidth values 17(0) = 171(0) = 1 and
77(1) = 171(1) = 0 which satisfy the lemma.

Hypothesis : The lemma holds for an (n — 1)-dirnension hypercube, i.e n‘(k) =
17"(2"'1 — 2 — k) for 0 S k S 2'”1 — 2, and hence 17(k) = 17(2"'1 — 2 — k).

Indnetign : Consider n-dimension cube, N = 2". First, consider the highest, i.e.

 

nth dimension edges. We can show that n"(k) satisﬁes the symmetric property. From

Equation 2.2, we have for 0 S k S N - 2

n"(k) min(k +1,N -1-— k)
as k=(kmed2")=(kmedN)=k and 2"=N
= min(N -— 1 — k, k + 1) ; after swapping the two terms
= min((N — 2 — k) + 1), N — 1 — (N — 2 — k)) ; after rewriting the terms

= 1;“(N — 2 — k) ; by Equation 2.2 (2.3)
Next, consider the lower dimension edges, i.e. i < n, we have from Equation 2.2

nl(k) = 71"(2"’1 — 2 — k) from the induction hypothesis
= 17:72“ + 2'"1 — 2 — k) as a: mod 2" = (a: + 2'“) mod 2‘
= n‘(2" - 2 — k)

27

= niuv — 2 — k) Vi < n (2.4)

From Equations 2.3 and 2.4 we have n‘(k) = n‘(N — 2 — k) for 1 S i S n. Com-
bining this result with Equation 2.1, we have n(k) = 17(N — 2 — k), and induction

completes the proof for the lemma. E]

Due to the symmetry of the cut width values, we ﬁnd that peak width should

occur in both halves of the layout. Hence, we have the following corollary.

Corollary 1 The peak width for an identity layout of the hypercube occurs in the ﬁrst
half of the layout, i.e. 0 S km S N/2 — 1.

The next lemma restricts the possible value of km further.

Lemma 2 The cut width for an identity layout of an n-dimensional hypercube, 17(k),

is related to the cutwidth for an identity layout of an (n — 2)-dimensional hypercube,

"’06): as
11(16) = N/2+77’(lc—N/4)Zn(lc—N/4) Vk N/4Slc<N/2

Proof: From Equation 2.1, we have

n-2

We) = n"(k) + n"‘1(k) + Z 17"“) (2-5)

i=1

We can use Equation 2.2 to ﬁnd the values of the terms on the RHS of the above
equation, for the range of k values considered (N / 4 S k < N / 2).

n"(k) = min(k+1,2"—1—k) ;as k = kmod2n=kmodN=k

 

 

= k+1; as 2”—1—k2k+1 (2.6)
Similarly, we can show in case of the second term of Equation 2.5 that

Wu) = min(k+1,2"“1—1—k) = 2"“-1-k (2-7)

28

as 2""1 — 1 — k 2 k + 1 for the range of k values considered
In case of the third term on the RHS of the Equation 2.5, we have
n‘(k) = n‘(k — N/4) as k mod 2i = (k — N/4) mod 2‘ (2.8)

We can use Equations 2.6, 2.7 and 2.8 in Equation 2.5 to get the following result for
N/4 S k < N/2.

n—2

n(k) = (k +1) +(2"“- 10+ 2( n( —)N/4
i=1
n-2
= 2""1 + Z n’(k - N/4)
i=1
= 2'”1 + n'(k — N/4) ; by Equation 2.1 (2.9)

By a similar argument, we can show that if 0 S k < N / 4

We) = M) + n“‘1(k) + Era)
= (k +1) +(k + 1) +71%)
3 2"") + 2"-2 + n’(lc)
_<_ 2""1 + n’(k) 0 s k < N/4 (2.10)
or 17(k — N/4) 2 2"”1 + 17(k — N/4) ; N/4 s k < N/2 (2.11)

Using Equation 2.11 in Equation 2.9, we get
n(k)= 2" 1 + n ’(k— N/4) Z 17(k — N/4) (N/4 S k < N/2) (2.12)

which completes the proof. D

From the lemma it follows that nut) g ”(k + N/4) for o g k < N/4. Hence, 1:...
cannot occur in the range 0-(N / 4 — 1). From Corollary 1, we know that km is in the

range 0 S k < N / 2. Hence, we have the following corollary.

29

Corollary 2 The cut width is maximum in the second quarter of the layout, i.e.

N/4 S km < N/2.

We ﬁnd that 17(k) reaches the maximum value at a value of (k+ N / 4), if 17’ reaches
maximum at a value of k. From Lemma 2, the maximum values of 17(k) and 17’ (k)
are related as given by the ﬁrst part of its equation. Hence, we have the following

corollaries from Lemma 2.

Corollary 3 The values of km for n and n — 2 dimensional hypercubes {N and N/4
nodes) are related as km(N) = N/4 + km(N/4).

Corollary 4 The peak width of an n-dimensional hypercube is related to the (n — 2)-
dimensional hypercube as nm(N) = N/2 + nm(N/4).

We are now in a position to state and prove the main theorems. The next theorem

states the value of km.

Theorem 1 The cut width for an identity layout of an n-dimensional binary hyper-

cube is maximum when k = [N /3]

Proof: The proof is by induction on the dimension of the hypercube.

m : For a hypercube of dimension 1 (N = 2), the cut width values for the
identity layout are (1,0). We ﬁnd that the theorem is satisﬁed as km = 0 = [2/ 3].

For a hypercube of dimension 2 (N = 4), the cut width values for the identity
layout are (2,2,2,0). The theorem is once again satisﬁed as km = 1 = [4/ 3].

Hypothesis : The cut width of an identity layout of an i-dimensional binary
hypercube is maximum when k = [2‘ / 3] for all i S n.

Indnetien : Consider the n-dimensional hypercube (N = 2“). From Corollary 3,

we have

km(N) = N/4+km(N/4)

= N /4 + |_-N—3/é ; by the induction hypothesis
2 [N/4 + (N/4)/3] ; since N/4 is an integer

lN/3l

30

and induction completes the proof. ' D

The next theorem gives the value of the peak width.
Theorem 2 The peak width for an n-dimensional hypercube topology is [2N / 3 ] .

Proof: As mentioned earlier, it has already been proven that the identity layout
gives the minimum peak width. Hence, we only have to prove the theorem for the
case of the identity layout. We once again prove the theorem by induction on the
dimension of the hypercube.

m : For a hypercube of dimension 1 (N = 2), the cut width values for the
identity layout are (1,0). We ﬁnd that the theorem is satisﬁed as nm = 1 = [(2 x 2) / 3] .

For a hypercube of dimension 2 (N = 4), the cut width values for the identity
layout are (2,2,2,0). The theorem is once again satisﬁed as 17m = 2 = [(2 x 4) / 3].

Hypothesis : The peak width for an identity layout of an i-dimensional binary
hypercube is 17m = [_(2 x 2‘) / 3] for all i S n.

Induction : Consider the n-dimensional hypercube now. By Corollary 4, we have

nm(N) = N/2 + nm(N/4)
= N /2 + [2N / (3 x 4)] ; by the induction hypothesis
.—_ [N/2 + N/6] ; as N/2 is an integer

i4N/61 = i2N/3J

and the induction completes the proof. C]

Now that we have proven the peak width of the hypercube topology, we consider
the peak width for the other topology—2D mesh.

Theorem 3 The peak width for an identity layout of a 20 square mesh of size N

(x/Nxfo) isx/IV+1.

Proof: Consider the identity layout for a 2D mesh network, in which a node

(i,j), 0 S i,j < W is mapped onto the integer ix/N+j. Let 170(k) and 171(k) denote

 

31

the contribution to the cut width by dimension 0 (row) and dimension 1 (column)

edges of the mesh so
7706) = 770(k) + 771(k) (2-13)

 

(0,0 091) ﬂ$lO/N'1)

/ | \/N X \/l_Vmesh
O
/ Rio) (WN-ln/N-l)
1

-1

 

Figure 2.2: Identity layout of 2D mesh

 

 

 

The identity layout of 2D mesh is shown in Figure 2.2 which is basically a row-

major layout. A little observation shows that

17°(k) = o if (lc+l)110_d\/N=0,

= 1 otherwise (2.14)
171(k) -_- k+1 OSkSW-Z

= W \/N-15k3(N—\/N—l)

= N—k—l (N—x/N)SkS(N—1) (2.15)

From equations 2.13, 2.14 and 2.15 we ﬁnd that maximum value of cut width or peak

width is x/N + 1 which proves the theorem. C]

We assume that the identity layout gives the lowest peak width, and hence the
peak width for mesh topology is W + 1 as proven in Theorem 3. Theorem 2 and 3
shows that the peak wire density of the mesh and hypercube are of O(x/N) and O( N)

32

respectively. We have the following theorem if networks are limited by wire density.

Theorem 4 Mesh networks can have 24? wider channels than hypercubes for the

same wire density.

Proof: We can get the ratio of channel widths by equating the peak wire
densities for the hypercube and the 2D mesh. By using the deﬁnition of peak density
from the previous section (Deﬁnition 18) and the values of the peak width for the
hypercube topology (Theorem 2) and 2D mesh (Theorem 3), we have

(x/N+1)W"‘ = [2111/3] W"

Wm _ 2W

_~———_

2N/3
Wh ~ WV _ 3

which completes the proof. Ci

2.2.2 Wire Length

Apart from wire densities, we must also account for the effect of the length of the
wires on the communication rates. Three models have been suggested by researchers
[33] to relate the wire lengths to the communication rates. In the constant delay
model, we assume that the rate is independent of the the wire length. This model is
appropriate if wire lengths are small. In the linear delay model, which is appropriate
if wires are long and propagation delay is the limiting factor, rates are assumed to be
inversely proportional to the wire lengths. For medium wires, the logarithmic delay
model is used where rates are inversely proportional to the logarithm of the wire
lengths. Since wire lengths are shorter in 2D meshes than in hypercubes, linear or
logarithmic delay models make the communication rates in 2D meshes greater than
in hypercubes. We assumed that all physical channels of a network are identical so
they are designed to operate at the same rate. Hence, the longest wires (worst case)

determine the communication rates. The ratio of the length of the longest wires in

33

mesh and hypercube can be calculated from the topological properties, which will

give the ratio of the communication rates for the three delay models.

Theorem 5 The ratio of the length of the longest wire of a hypercube network, 1", to
that ofa mesh network, 1’", of the same size will be at least (V2N — \/2)/lg N, if we

use a two dimensional layout.

Proof: Without loss of generality assume that a node can be laid out in a
square area of size a x a units. The best way to lay out a mesh network is x/N
rows by \/N columns, so each wire will be of length a (if assumed to be equal to the
distance between the centers of the layouts). On the other hand, the best lay out for
hypercube is not immediately obvious. Here, we derive a lower bound on the length
of the longest wire. In the case of a hypercube network, the most compact layout
will again be in a square area of \/N a x x/N a assuming a 2D layout. This is true
as we have a total layout area of N a, and a square layout minimizes the distance
between diagonally opposite corner nodes. The distance between the centers of the
farthest nodes (diagonally opposite corner nodes) will then be (M — \/2)a. The
maximum path length between any two nodes in a hypercube, and hence the diago-
nally Opposite corner nodes, can be at most lg N. Hence, we get that the length of
each of the wires along the path between diagonally opposite corner nodes must be
at least ((x/2N — J2) x a)/ lg N. Hence, the ratio of the lengths of the longest wires
must be at least ((x/2N — \/2) x a)/(lg N x a), which proves the theorem. CI

The observation of the lengths of the wires in the two networks leads to the

following theorem on the ratio of channel rates.

Theorem 6 The ratio of the channel rate of a mesh network to that of the hy-

percube network for linear, logarithmic, and constant delay models is given by

(V2N - ﬁﬂlg N, (3.89 + lg(N) — 2lg(lg N))/2.89, and 1 respectively.

Proof: For the constant delay model, the ratio is 1, as channel rates do not depend

on the wire length. In the case of the linear delay model, channel rates will be inversely

34

proportional to the wire lengths. The ratio of channel rates then is the same as the
ratio of the longest wires in the two systems which is derived in Theorem 5. In the
case of the logarithmic delay model, the ratio Rm/Rh will be (1 + 1n 1" / 1'"), where
lh(lm) is the length of the longest wire in a hypercube (2D mesh) [33]. Hence

Rm/R" = (1+1n(z’:/1m))
z 1 + 1n (Vii—V7 lg N ) ; from Theorem 5
= 1+ 0.5ln(2) + 0.5ln(N) - ln(lg N)
= 1+0.5/lgex(1+lgN+2lg(lgN))
= (3.89 + lgN + 2lg(lg N))/2.89

which completes the proof. . C]

2.3 Performance Comparison

We now derive expressions for the speedup provided by the mesh network over the hy-
percube network when both are of size N. The derivations are done at three levels as in
the experiments to measure the network performance (Chapter 4). First, contention-
free adjacent node communication is considered. Such a speedup is achieved for
applications that naturally map onto the mesh network resulting in communication
between adjacent nodes only. We then consider contention-free, non-adjacent node
communication. Here, communicating nodes are no longer adjacent so communication
performance is affected by path length D. Such speedup is achieved for applications
that are not communication intensive so the eﬁect of contention for network channels
on communication performance can be neglected. Finally, we consider the speedup
while allowing contention for network channels.
By the deﬁnitions of effective bandwidth and speedup from Section 2.1 (Deﬁni-
tion 10 and 12), we have
R¢=—=—”—=—" (2.16)

35

We can therefore derive the speedup by computing the time taken to communicate a

single message of length L for the case under consideration.

2.3.1 Contention-free Adjacent-node Communication

When all communication is limited to adjacent nodes only, we have the path length

D = 1. In addition, there is no contention for network channels. In such a case we

have from Equation 2.16

T: (L+Hh)/Bh

 

(L+H") ~

since L>>H, wehave f=(L+Hm)~

1 (2.17)

We ﬁnd that speedup is almost equal to the channel bandwidth ratio, which in turn,

is the product of the ratio of channel bandwidths and the ratio of channel rates from

the deﬁnition i.e. m ‘
e = a = (t?) x (a) m
We can multiply the results of Theorems 4 and 6 to get the ratio of channel
bandwidth for all three delay models: constant, logarithmic, and linear. The results

obtained are stated in the following theorem. (In the case of the linear delay model, we

have ignored the term \/2 in the numerator for the ratio of channel rates (Theorem 6)).

Theorem 7 The ratio of channel bandwidths of a 2D mesh network to that of a
hypercube network for the three delay models are given in the Table 2.1.

 

 

 

 

 

 

 

 

Delay Model Constant Logarithmic Linear
R 31‘}: \f1\7(:3.139+152 ~41QO N1) zlgw
b 3 4.33 3152 N

 

Table 2.1: Ratio of Channel Bandwidths for the three delay models

From Equation 2.17, the speedup R6 for contention-free adjacent—node commu-

nication is approximately equal to the channel bandwidth ratio given in Table 2.1.

36

As a result, we ﬁnd that the speedup is a function of the system size. From the

Table 2.1 we ﬁnd that the speedup increase as O(x/N), O(x/ng N) and O(N/ 1g N)

for constant, logarithmic and linear delay models respectively. In Table 2.2, we give
the speedup value for different system sizes. The speedup is appreciable for systems

having hundreds of nodes.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Speedup (Re)

N $17,- Linear Logarithmic Constant

delay delay delay
64 11 10.4 8.9 5.3
128 18 17.2 13.7 7.5
256 32 30.2 21.8 10.7
512 57 53.7 34.1 15.0
1024 103 97.1 53.6 21.3
2048 187 176.3 83.3 30.1
1.1096 341 321.8 129.6 42.7

 

 

_;

Table 2.2: Speedup (Re) for different system sizes

Table 2.3 shows the channel widths and channel rates for several commercial
systems. The ratio of channel widths and rates agrees with that predicted by analysis
(Table 2.1). For example, if we compare iPSC/l and MRC, both of which can have
up to 128 nodes, the channel width ratio 8/1 = 8 and channel rate ratio 20/ 10 = 2.
The channel width ratio is equal to R, for the constant delay model. From Table 2.2
for N = 128, we ﬁnd channel width ratio is 7.5 (value of R.3 for the constant delay
model) which matches the actual value 8. From Table 2.2 we can also derive the ratio
of channel rates, assuming the linear delay model, by dividing the value of Re for the
linear delay model by channel width ratio. Hence, the channel rate ratio for N = 128
is 17.2/7.5 z 2, which also matches the actual ratio. We ﬁnd that the results of the
analysis gives a reasonable estimate of the increased bandwidth possible by using a
2D mesh instead of a. hypercube network for a given system size, when we consider
the suitability of the network for implementation.

In the subsequent sections, we show that speedup will be lower than the channel

bandwidth ratio due to longer path lengths and contention for network channels. In

37

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

System ] Network Topology ] W ] R ] B = WXR []
Ametek(sys 14) hypercube 1 3 3
iPSC/ 1 hypercube 1 10 10

Torus 2D Torus 8 8 64
MRC 2D mesh 8 20 160
iWarp 2D mesh 32 10 320
NDF 2D mesh 9 50 450

 

Table 2.3: Channel width, rate and bandwidth for some commercial systems

fact, the channel bandwidth ratio is an upper bound for the speedup, i.e. R.2 S Rb.
From now on, we will assume the constant delay model and use the corresponding
value for the channel bandwidth ratio, as they have been found to be appropriate for

current technology and system sizes [33].

2.3.2 Contention-free Non-adjacent-node Communication

For many applications it may not be possible to map the communicating tasks onto
adjacent nodes. The path length will be more than 1 in such a case. If communication
is not frequent, contention can still be ignored. We analyze the speedup for such a
case in this section.

The network latency will be the time taken by the header ﬂit to reach the desti-
nation node. For each hop, communication time for the header is H / B. For a path of
length D, the network latency will be D x H / B. The data ﬂits follow the header in
a pipelined fashion. Hence, once the message header arrives at the destination node,
remaining data ﬂits arrive at the rate of B. The total time for all the data ﬂits will
then be L/ B. Hence, the network delay for communication over a path of length D
will be

rn=rz+rM=DxH/B+L/B=(DxH+L)/B (2.19)

The speedup can be derived as it is the ratio of the communication times. From

1‘

staticns 13.16]

lentil}: D“ Z
..l .' . f
to. tie talue 01
I... u i v 1‘

.2136: .engtn) e
2::- .e:gtn of th
;at'; lengths in .
:4: 2D mesh am

D”:

at;

D":

an
The speedil]
lezgths by sub:
estate from E
D" = 03,, D

"_. v t ,
[”2 lie iOllOWm]

The Speedu
..tt as long as
i," L
i does not

tier i
" heady coz

t
V. !

13 n
. t .
“v deal

if? is lo

decreases

38

equations (2.16) and (2.19), we have

r" B’">< DhH + L Dh + L’
= _n = —- — = 2
R‘ 7,71 B—h' xDmH + L :3, x Dm + L’ M" ( 20)
Dh + L’

where L'=L/H and f=m

Normally, D’" 2 D". Hence, we ﬁnd that speedup is a fraction of R5 (Re S Rb),
and the value of the fraction f depends on the message length (expressed in units of
header length) and the path length. If the header length is one ﬂit, then L’ will be
the length of the message in ﬂits. Let Davg and Dmar be the average and maximum
path lengths in a hypercube and a 2D mesh. The values of Davg and Dm“ are known
for 2D mesh and hypercube networks [65, 33], and are stated below.

Dgg~2\/_/3 Dmax=2(‘/1-V—1)

h h (2.21)
D...=1s N/2 Dm..= ls N

The speedup can be expressed as a function of systemsize for various message
lengths by substituting the values of Rt (from Theorem 7) and the values of the
distance from Equation (2.21) into Equation 2.20. The speedup for an average node

(Dm = D39, Dh= Davy) and a worst case node (Dm = D3”, Dh= Din”) are given

by the following equations.

 

Re X23((2lf/11:V:231L’)) (average case) (2.22)
Re -_- Rb x (lgN + L) (worst case) (2.23)

 

2(\/1_V_-l)+L’

The speedups are plotted for various message lengths in Figure 2.3. We observe
that as long as the message length is much greater than the path length, the path
length does not affect the speedup and R, = R1,. Experimental results in Section 4.2.2
later clearly conﬁrm this conclusion. We assumed the header length to be 1 ﬂit. If
the header is longer, the speedup will be further reduced for a given message length

(as L’ decreases for a given L).

(i ..

(J)
’ C,
(‘7)

lie

 

25

 

 

 

 

 

 

 

I l l l I
-... L = m (Re — Rb) I
20 - ..<.. L = 1000 ﬂits . 4
.<>.. L = 250 ﬂits
-.0.. I: = 50 gits
.9“ = 10 its a
15 _. -.*.. L = 5 ﬂits , , / _
Speedup R.e
10 '- ; -t
5 " ,' —
/.
0 L l l L l
0 200 400 600 800 1000
Network size N in number of nodes
A. Average case.
25 l I l T T
..... L=00(Rc=Rb)
20 e- . <1. L =1000 ﬂits .41 -t
...<>.. L=250ﬂits
...o.. L=50ﬂits ..o
...>.. L=10ﬂits
15- ...*.. L=5flits ._
Speedup R.g
' ......... O
10 - :33: ...................... _
ff. ..0 .............
1': ................... p
5 ' 13'” ------- 3;“:2:::::::::::::::::::::: .................... ,, _
«fig—3:: ----------
O l l l l l
0 200 400 600 800 1000

Network size N in number of nodes

B. Worst case.

Figure 2.3: Speedup R, for contention-free non-adjacent node communication

2.4

"P re

O...»

5‘.)
..—I

r“.

40

2.4 Conclusion

The results of our studies in this chapter leads to the following conclusions.

1. Mesh networks can be designed with one or two order higher bandwidth channels

than hypercube networks, and hence have the potential for greater performance.

2. Under contention-free conditions meshes perform better than hypercubes due
to higher bandwidth channels except when message sizes are comparable to
path lengths. Applications that either map naturally onto a mesh or are not

communication intensive can beneﬁt from mesh networks.

However, increased contention in mesh networks may make meshes perform worse
than hypercubes. The contention problem scales up with the system size so the
performance of large networks is more sensitive to contention. In the next chapter,
we show how to model the the effect of contention on the communication performance

for a given mapping.

1T,“ ‘ ’3!

"MC

‘ .0...
364;“:

"h. .
“>-
‘Ma -.

9.4..

h

N,’
.1,‘qf:e‘
5. ,
.“

. l "D
~Q\ u‘
.i‘.‘

t-l‘, 'TPT‘
. “A,
l
1').

J1

Chapter 3

COMMUNICATION
PERFORMANCE UNDER
CONTENTION

We study the communication performance of large multicomputers under contention
in this chapter. Contention can be a signiﬁcant factor that determines the perfor-
mance of large mesh networks. We develop a framework to predict the effect of
contention for a given application and mapping. The results help to estimate the
saturation node trafﬁc, and to ﬁnd if the effect of contention is negligible or signiﬁ-
cant. We also show that we will loose the speedup provided by mesh networks if we
use random mapping, and hence random mapping is not advisable in case of large
networks. We concentrate on 2D mesh networks in this chapter. However, our results
can be extended to networks that have 3D mesh or other topologies.

The chapter is organized as follows. In Section 3.1 we describe several key concepts
including the mapping problem, notations and assumptions that lay the foundation
for later sections. We model the performance for a given path of communication
(deﬁned later) in Section 3.2. Several paths originate from a given node. Hence,
we use the performance model for a single path to compute bounds on the message
injection rates for various nodes in Section 3.3. We investigate the special case of
random mapping in Section 3.4. Results show that contention scales up with random
mapping, and is signiﬁcant for large systems (having hundreds or thousands of nodes).
Hence, random mappingiis not advisable for large multicomputers. Optimal mappings
that minimize contention are given in Section 3.5 for several common communication
patterns. Contention is reduced signiﬁcantly by a careful placing of parallel tasks, and

by routing various communications. We also describe steps in analyzing the effects of

41

42

contention and the effect of mapping in Section 3.6. Finally, we summarize the main

results of this chapter.

3.1 Modeling Communication and Contention

The effect of contention on communication time depends on the communication char-
acteristics of an application, and the level of contention for network resources, In
this section, we model both which will help us later to estimate the degradation in

communication performance due to contention.

3.1.1 Communication Model

Parallel programs are usually designed assuming a speciﬁc topology such as a tree, a
ring, a 3D mesh, a hypercube and so forth. Hence, the communication pattern of a

parallel program can be represented by a process graph.

Deﬁnition 19 Process Graph, GP, represents a parallel program. The vertices de-

note the parallel tasks and the edges denote the communication between tasks.

Deﬁnition 20 System Graph, Gm, represents the concurrent system. The vertices

represent the nodes and the edges represent the communication channels.

We denote the set of vertices and edges of the process graph (system graph) by V(Gp)
(V(Gm)) and E(G'p) (E(G'm)) respectively. In general, G1, and Gm are different. The

system graph Gm is chosen based on implementation constraints, whereas Gp may be

any topology that suits the problem at hand. Hence, the problem of mapping arises.

Deﬁnition 21 Mapping Scheme, [1, decides the placement of the nodes of GP. It is
a function that maps the vertices of G1, to that of Gm, i.e. [1 : V(Gp) —+ V(Gm).

If we assume a ﬁxed routing scheme R, then each edge of the process graph (v1, v2)

corresponds to a path between p(v1) and n(v2) deﬁned below.

9
"1’1

11

(la

6.\

Lev

y~

L3
’ I
E;
In

‘I" ‘
1 1 -

Uzi. \,
‘k, \Q
'\
0.]E
4
1,1

43

Deﬁnition 22 A Path, p(e), is the image of an edge e = (v1,v2) of the process
graph for any given mapping p. The path is a sequence of edges of the system graph
(e1, e2, . . . , eD) deﬁned by the routing function for communication between nodes p(vl)

and p(vg), where D is the path length.

We restrict the domain of our study by making the following assumptions about

the process graphs and mapping.

1. Process graphs, GP, have an adjacency matrix that is symmetric. Hence, the

indegree is equal to the outdegree for any vertex v, and is denoted by 6(v).
2. A task communicates uniformly with its neighbors in GP.
3. Message lengths are exponentially distributed.

4. Message generation on each path of 0,, is a Poisson process with an arrival rate

of A’ messages / sec.

5. The mapping function is one-to-one, i.e. every vertex (node) of the system

graph, Gm, is assigned at most one vertex (parallel task) of the process graph,
G’-

6. We assume that blocking communication is used, and that communication and

computations do not overlap.

These assumptions imply that a node of Gm communicates uniformly with only a
small set of other nodes of Gm onto which the neighbors (in GP) of the parallel task
running on that node are mapped. Two nodes that communicate during a run are
called peers. Assumptions 3 and 4, are made so we can analyze the performance
using queueing theory. Our results, however, apply quite well to other distributions
of message lengths and message interarrival times. The last assumption is made so
we consider the simpler case ﬁrst. We can extend our results to the cases when the
last assumption may not be true.

One important characteristic of a process graph is its richness measured by the

total number of edges. Since we assumed the process graph to have a symmetric

i 1
a‘ 79-" a
‘-. ¢A" “‘

v

:3; the ‘-

3'; (0351'

mod

0 pp

fort

‘4L

_‘j

“Ln {.11
. “Jug“

p

44

edge-matrix, we have an even number of edges in a process graph. Let E be equal to
half the total number of edges. The average and the maximum degree can be deﬁned
by considering all nodes of the process graph. If Np denotes the number of vertices

in the process graph, then we have

6m: = i Z 6(v) 6m”: max 6(v) (3.1)

N, vet/(op) ”El/(6’)

Here, 6(v) denotes the indegree (equal to outdegree) of a vertex of GP. Observe that
2E is equal to sum of the degrees (outdegree or indegree) of all the nodes of the

process graph. Hence,
_ _1_ 2B

5... — N, ”Egg“ 6(v) = F; (3.2)
We see that E and 601,9 are related, and either one of them can be used as a measure
of the richness of a process graph.

In addition to the process graph, we need to model the communication intensity

of the parallel tasks to completely deﬁne the communication characteristics of an

application, and the effect of contention. We use the following notation.

o t represents the average time for local computations between two successive

message injections by a node.

L is the average length of a message.

7' is the communication time for a message of length L.

r’ is the value of r if the communication is contention-free. 0 is the factor by

which communication time has increased due to contention, i.e. 0 = r/r’.

T represents the average time between two successive message injections by a

node, i.e. T = t+ r.

T’ represents the average time between two successive message injections under

contention-free communication conditions, i.e. T’ = t + r’.

The following deﬁnitions help

Deﬁnit?

a . .'
P.
i}. Ill 4'1
' '
__-orl
:ﬁp‘tlileli

Deﬁnit
5 56716.?

.‘i *q .
. 55. f0

3331 lit

are: ac

45

Deﬁnition 23 Applied path trafﬁc, A,“ is the ratio of data generated on any edge of

G, in bits/sec under contention-free conditions to the capacity of a network network

channel, B, in bits/sec.

Deﬁnition 24 Applied node trafﬁc, A,, is the ratio of the rate of data generated by
a vertex of G, ( or Gm) in bits/sec under contention-free conditions to the capacity of

a network channel, B, in bits/sec.

Since a message of length L is generated by a node in T secs, we ﬁnd that A, = LLg—Il.
Applied node trafﬁc is the sum of the applied path trafﬁc over all paths that originate
from the given node. We assumed that a node of G, communicates uniformly with 6
other nodes; so node trafﬁc A, = 6A,“, where 6 is the degree (in G,) of the parallel
task running on the given node. Hence, A,“ = Aa/6 = (#511, Applied path trafﬁc and
applied node trafﬁc depend only on the application and the system constant B.
Contention will increase the communication time, and hence the time between
messages, T. This contention will reduce the path trafﬁc and also the node trafﬁc.

Hence, we have the following deﬁnitions for the trafﬁc under contention.

Deﬁnition 25 Actual path trafﬁc, A,, is the ratio of the rate of data generated on
any edge of G, in bits/sec to the capacity of a network channel, B, in bits/sec, i.e.

A __ [LIT]
P — B '

Deﬁnition 26 Node trafﬁc, A,, is the ratio of the rate of data generated by a vertex
of G, (or Gm) in bits/sec to the capacity of a network channel, B, in bits/sec.

Similar to the path trafﬁc, we ﬁnd that A, = L99, and A, = 6A,. Observe that
A, S A, and A, S A,“ as T Z T’. The relation between 0 and A, can be found by

their deﬁnitions.

 

, = (M) = (L/B) = (L/B) = Tu.
" B T t+r t+r’+(0—1)T’
Au Aa

 

 

= 1+(0—1)T’/T’ = 1+(6—1)A, (3'3)

L‘J'

46

We can also show a similar relation between A, and 6.

A = @531 = (L/B) = 5T'Ap.

B 6T 6(t + r)
Ape

= 1 + (9 _ 1))‘pe by proceeding as inA, derivation above (3.4)

 

Since, 0 and A, are related, we can represent the effect of contention by the ratio
of A, to A,“ or by the ratio of A, to A,, instead of 0. As we see later, this way of
representing the effect of contention has some advantages.

We make the following assumptions about the system graph or the communication

network. Most of these are reasonable for current networks, and simplify our analysis.
1. All network channels (edges of G,,) have the same bandwidth.
2. All transmissions are error free.

3. Wormhole-routing or Circuit-switching is used. Header length is assumed to
be negligible compared to message length, and the time taken by intermediate
routers to process headers is assumed to be negligible compared to the overall

message transmission time.

4. A ﬁxed deadlock-free routing scheme is used in which messages travel along the

row channels ﬁrst, and then along the column channels.
5. Physical channels do not support virtual channels.

The last assumption is not true for some new mulitcomputers, such as iWarp. Our

analysis needs to be extended to cover those systems.

3.1.2 Contention Parameters

We deﬁne the following three parameters for a given G,, G,,, p and R which help us

to measure the contention for network channels.

47

Deﬁnition 27 Channel load, x, for any edge e of G,,, which represents a network
channel, is the number of paths that share the given edge, i.e. x(e) = [{e’ : e’ E
E(G,,) A 6 6 P(€')}|-

Channel load helps in characterizing the non-uniformity in traffic through various
network channels, and gives an upper bound on path trafﬁc A,. If the path trafﬁc
along each path sharing a channel is A, then A, S l/x. The average and maximum
channel loads, denoted by x,,, and Xma, respectively, are deﬁned considering the

various edges of the system graph. We have

1
Xav = —— x e Xma, = max x(e) (3.5
g ]E(Gm)| 66%,”) ( ) CEE(Gm) )

Channel load alone cannot characterize the contention. We deﬁne two additional
contention parameters—logical path length and path contention level—for an edge of
the process graph. First, we need to deﬁne the concept of path contention sets which

is used in the deﬁnitions.

Deﬁnition 28 For any edge e of G, which corresponds to a path p of G,,, the path
contention set 3,- is the set of paths that share with p at least one of the ﬁrst i segments

(which correspond to channels.) of p.

Let path p = e1, e2, . . . , eD. Then the corresponding path contention sets 3;, 0 S i S
D, are deﬁned as

so = 45, s, = 3,-1 U {ﬁle 6 p'} (3.6)

Here, 45 denotes the null set.

Deﬁnition 29 Logical path length (d) for any edge of G, is the number of path
segments in which at least one new path is “encountered”, i.e. d = [{i : 1 S i S

D /\ 3; ;£ 35-1“.

Deﬁnition 30 Path contention level, V, for any edge of G, is the total number of
paths that share at least one network channel with the given path, and is equal to

cardinality of the set 31), i.e. 1x = [81)].

-

Pv-

Una.

t‘ﬁ’

i.

48

The logical path length is bounded by the physical path length D, and denotes
the number of queues (for network channels) at which a message on a given edge
of G,, may have to wait. High values of logical path length indicate a possibility of
longer wait times.

The signiﬁcance of path contention level is that a message on a given edge of the
process graph will be blocking messages on that many paths. Hence, path contention
level also puts another upper bound on the path trafﬁc as AP S 1/(1/ + 1). As in
case of channel loads, we deﬁne average and maximum logical path lengths and path
contention levels considering all the edges of the process graph.

An example is given in Figure 3.1 to illustrate the contention parameters. The
ﬁgure shows a random mapping of a 15-node binary tree (G,,) onto a 16-node, 2D-
mesh system graph (Gm). The paths are shown in Figure 3.1.B for the tree edges from
parents to their children. We have only shown edges from parents to children in G,,
to keep the example simple and the ﬁgures clear. In Figure 3.2, we present the values
of the contention parameters for this example. In Figure 3.2.A, the tree edges are
labeled by the triple (D, d, u). The mesh edges has are labeled by the corresponding
channel loads. The average and maximum values of the contention parameters for
the example are also given. Note that for this example d and u is same for all edges of

G,, as G,, is sparsely connected and maximum channel load is 2. In general, 11 Z d.

(5% ”)9

gift» cm

A. Process Graph (15 node binary tree) B. System Graph (16 node 2D mesh)

 

Figure 3.1: An example of a random mapping

49

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 
  

1 1 2 —
6 “t ‘34 it 5
21* ‘11”0‘2 7‘1‘Q '. .1
I I I
FLA-1 .-1-r 2 L

 

 

 

 

 

 

 

 

 

 

 

 

 

 

£932 9 9 212.0 35.0 OILLOO "0
,,22 ,11 “I, 7"..12 'I
1,,00/ 53% CD? 5,1,1 2,1,1 11:11-70 '11-"0 “Twig

 

 

 

 

 

 

1n :0 1| I1 no :0
@® ® @ -!1_(;-r- l?-}-%-er 9
1-0-‘__<b' '10 L

 

 

 

 

A. Physical path length, logical path length

and path contention level B- Channel 103-d8

Avg Max
Physical path length (D) 2.71
Logical path length (d) 1.00
Channel load (x) 0.79
Path contention levels (u) 1.00

NMMO‘

Average and Maximum values of contention parameters

Figure 3.2: Contention parameters for the mapping example

We later see that computing channel loads and path contention levels for a given

mapping can be useful in several ways.

3.2 Saturation Path Trafﬁc

An exact analysis of the contention for network channels requires treatment of a
multiple server queuing system with a non-Poisson input process, and non-exponential
service times, and hence is very difﬁcult. Here, we examine a simpler, analytically
tractable model used earlier to study circuit-switched networks [55]. This model can
also be used for wormhole routing as its behavior is similar to circuit switching (in
blocking mode) when the message lengths are much longer than path lengths. The
results show the effect of contention for a single path. The results based on this simple
model not only provide insight into the contention problem, but also can be used to
reasonably predict the performance as we show later.

Consider a single path of communication from a source node 5' to a destination

V.
t‘x-b
.
‘~Gte
7A.
-l 147‘
.l
t“- i.

(la:

0 I
F‘
,e ,.

Ml.‘

50

 

router

  
  

 

   

d

mation
node

source
node

 

 

(13)

Figure 3.3: A simple contention model

 

 

 

node D (Figure 3.3.A). The path goes over n physical channels and n - 1 interme-
diate nodes. In wormhole routing or circuit switching the header of a message will
progress from the source node to the destination node reserving the channels. All the
n channels must be reserved before a message can be successfully communicated. If
the header encounters a busy channel, it waits in a FIFO queue associated with the
channel. Assume that at any node there are k other paths over different incoming
channels needing the same output channel as the path under consideration. Con-
tention for a channel at an intermediate node can be modeled as an M / M/ 1 queue
(Figure 3.3.B).

Let :5.- and s",- be the random variables that represent the service time and the
overall delay (sum of wait time and service time) at the 2"" intermediate node. At
each intermediate node, the service time is equal to the overall delay at the next
node since processing overheads are negligible (Assumption 3), i.e. 23', = 3,-4.1. We use
the generalized version of the independence assumption of Kleinrock to remove the
stochastic dependence of the service time of one node’s queueing system on that of

the succeeding node [57]. The assumption is

In

A: eac

‘Q'IQLH

:~ ?
*0:

07“
pg
4)

~-
1\ "
4.. 5v
‘

lead:

51

Assumption: The distribution of service time at each of the queuing
systems shown in Figure 3.3.A is a negative exponential with the average
values stated in Equation 3.7, and is stochastically independent of the

service time of its succeeding nodes.

At each node i, the length of the message is assigned a new value from the exponential
distribution with the same average value of L at each node i even though the length
does not really change. We can then ignore the dependence of service time of any
stage on the service times of successive stages. The results obtained based on this
assumption have been found to be reasonable for networks with moderate connectivity
[57]. By virtue of our earlier assumptions, and this independence assumption, the
contention for a channel at each intermediate node can be modeled as an M/ M/ 1
queue. Also, the mean values of the overall delay, Eg, and service time, 53,-, are related
as s"; = 1/(11; — kX), where p,- = 1 / 23;, and kX represents the net arrival rate of the
k headers requesting services from the same output channel. Once the entire path
is established, i.e. all channels are successfully reserved, communication time will be

§n+1 = L/ B . The results can be summarized in the following recurrence equation.

1
-i = < O < i .1: =
s 1/§,-+1-k/\’ (1 _ i _ n) s ...,

 

L
D (3.7)
The recurrence equation can now be solved for the 5,, the overall delays once the

header reaches node i.

_ (L/B) _ §n+1

5‘ = 1 — (n—i+1)kA’L/B ‘ 1 — (n—i+1)kA’L/B (3'8)

We are not interested in the absolute value of the overall delay, but the factor
by which the overall delay increases due to contention. The delay can be normalized
with respect to the communication time under contention-free conditions given by
§n+1 = L/B; this normalized time is denoted by 0,.

8,‘ 1

9; = = , ; . '
5,,“ 1 — (n—z+1)kNL/B (39)

 

 

M‘-

a,.

Lw.‘ In

52

The normalized delay from the source (01) is denoted by 0.

1

9: 0‘ =1 — nkA’L/B

 

(3.10)

The factor (A’L/B) in the denominator of Equation 3.10 is the actual path trafﬁc
A, (Deﬁnition 25) since A’ = 1 / T. Hence, we have A, = A’ L/ B. The remaining factor
in the second term of the denominator, which is nk, can be shown to be V + D from
the deﬁnition of V and D. The system is therefore equivalent to a single M/M/l
queue with nlc = V + D paths contending for a single resource. We derived this result
assuming a FIFO queue policy at each intermediate node. However, if messages are
stamped with the time they entered the network and priority is given for messages

that have earlier time stamps, then we ﬁnd that nk = (V + 1). Therefore, we have

1 1
= :: —- h so = .
0 1-(V+1)A, 1-Ap/Aaat w ere A , 1/(V+1) (311)

 

Equation 3.11 shows that 0 increases exponentially as A, -+ Am. In fact, the
saturation path trafﬁc, A,,“, is an upper bound on the value of A,. There is an easy
explanation for that fact. Observe that the path under consideration is competing
with V other paths, and that it blocks all of them when it is progressing. If all
paths have the same path trafﬁc A,, i.e. are busy for a A, fraction of the time, then
(V+1)A, S 1 or A, S 1/(V + 1) = Am.

The effect of contention given by Equation 3.11 is shown in Figure 3.4. F ig-
ure 3.4.A displays the normalized communication time 0 as a function of A,, for
different values of V. The ﬁgure illustrates the factor by which communication time
increases due to contention. For example, A, = 0.15 and V = 5 yields a 0 of 4;
that is, the communication time under contention is 4 times higher than the time
for contention-free communication. Figure 3.4.B shows the same result in a different
way: the ﬁgure shows constant 0 curves in the A, x V space. The curves are shown
for 0 = 2, 5, 10 and 00. Everything in the “forbidden region” requires “more” than
inﬁnite time, and hence the network cannot operate in this region.

Observe that the region of operation where the contention effect is negligible is

53

restricted to a small region in which the value of A, or V is small. In current systems
such as the Symult 2010, small sizes are limiting the value of V, and high overheads
are limiting the peak value of A,. We, therefore, always operate in the region where
the effect of contention is negligible. However, in future systems both A, and V will
increase due to larger system sizes and lower overheads. Hence, the contention effect
will be increasingly visible in future systems.

We can combine Equations 3.11 and 3.4 by eliminating 0 from these two equations
to express A, in terms of A,“ and Am. After some simpliﬁcation, we get the following

quadratic equation for A,
A:(1 — APa) —. A13(Apa + A33¢“) + APaA-ﬂlt : 0 (3'12)

The solution for A, can be simpliﬁed by approximating Equation 3.11 using a thresh-

1 if A, g A,,,
0 = (3.13)
> 1 if A, = A,,,

old function.

Actual path trafﬁc, A,, will then be the solution of Equations 3.4 and 3.13. We
ﬁnd that A, = A,“ if A,“ < Am, and that A, = A,,, if A,“ 2 Am. The exact
and approximate solutions for A, are shown for a speciﬁc instance of A,“ = 0.5 in
Figure 3.5.A.

We can repeatedly solve for A, for different valuesof A,“ for a given Am. We can

then plot A, as a function of A,,. The resulting curves are shown in Figure 3.5.B

when A,,, = 33.33% (V = 2).

3.3 Saturation Node Trafﬁc

In last section, we analyzed the effect of contention considering a single path. The
results show that path trafﬁc saturates at a level determined bythe path contention
level. The effect of contention on node trafﬁc is similar—to put an upper bound on
the node trafﬁc. We can determine the bound on node trafﬁc by considering multiple

paths originating from a given node.

54

 

 

 

 

 

 

 

   

 

 

I l T l
5_V=l5u=10 u=5 11:2 _
4 - ..
Normalized
communication 3 _ _
time 0
2 - n
1 -«
0 l l I l l l
0 5 10 15 20 25 30 35 40
Path trafﬁc A, in percentage
(A)
14 r
12 *-
10 -
Contention
level _-
V
6 — Effttgct
4 _ contention
negligible
2 .-
0 l l l l l l I
0 5 10 15 20 25 30 35 40

Path trafﬁc A, in percentage

(B)

Figure 3.4: Effect of contention on communication time for a single path

CO?

 

55

 

 

 

  

 

 

 

 

 

 

 

6 I I I l I
__1__
(1-A,/0.33))
5 q
4 exact *
olutio’
Normalized A, = 23.3%
communication 3 approx _
time 0 3 solution
g A, = 33.3%
........................................ A _ A _
P “ l+(9-1)A,a
0 1 l l l l
0 10 20 30 40 50 60
Path trafﬁc A,, in percentage
(A)
50 I I ﬂ I I I I I I
45 - -
40 _ — Exact Solution _
35 _ - -- Threshold approximation _
Actual path30 h _
trafﬁc in 25 _ _
percentage
20 r- ..
15 - _
10 r —
5 r ..
0 1 1 L 1 1 1 1 1 1
0 5 10 15 20 25 30 35 40 45 50

Applied path trafﬁc A, in percentage

(13)

Figure 3.5: Actual path trafﬁc vs. Applied path trafﬁc for a single path (V = 2)

The f4

.tssert'u

{ill nod

We .

“11C

56

The following assertion holds for the limited problem space we are considering.

Assertion 1 For any node, the path traﬁic is same for all the paths originating from

that node.

We assumed that each node is running at most one parallel task, and that a
parallel task communicates uniformly with its neighbors in the process graph (refer
to Section 3.1). Hence, over a given time a node would have sent an equal amount
of data to all paths originating from the node. The result is that path trafﬁc will be
equal for any node.

We consider an average node ﬁrst.

Lemma 3 The saturation node traﬁic for an average-case node is given by A“ =

A,,/(v.39 +1)-

Proof: The average number of paths that originate in a node is 60,9. Each of
these paths has an average path contention level of Vavg. From the results for a single
path in Section 3.2, the saturation path trafﬁc for each path is A,,, = 1/(Vaug + 1)
(Equation 3.11). Hence, we have

AZ“, = 6augAsat = éavg/(Vavg‘l'l)

which completes the proof. [:1

We can also use our framework to estimate the saturation levels for individual
nodes and the worst node which will help in characterizing the nonuniformities in

trafﬁc We have the following result for the worst case.

Lemma 4 The saturation node traﬁ‘ic for the worst-case node A3“, is bounded by

6aug/(Vmaz: + 1)-

Proof: Consider the path p that has the maximum path contention level (Vmax).

Let S" be the set of paths that are in the path contention set of p. Let A,(:c) denote

the actu

3.16 on.

'1
1"; —
..‘1! _
1'
'1“
! 'v-.
in

a

“ﬁt

57

the actual path trafﬁc on a path 3:. Then, since the trafﬁc on the different paths of S

and on p are mutually exclusive, we have

2 A,(a:) g 1 (3.14)
zE(S’Up)

Since the worst-case node has the least node trafﬁc, the node trafﬁc of every other
node should be greater than or equal to the node trafﬁc of the worst node, A21. ..
Also, 6 paths originate from a node, and from Assertion 1 all paths should have the
same trafﬁc. Since the sum of each node’s path trafﬁc is its node trafﬁc, we have

6A, = A,, 2 A3“, or Aim/6 S A,.
The value of 6 is not constant for a given process graph as the process graph need
not be regular. Let go,- be the probability that a randomly selected path is from a
node of degree i. Since [SI = (Vmaz + 1), we ﬁnd that pi(Vmaz + 1) paths in set S

originate from a node whose degree is i, and each of these paths have a path trafﬁc

of at least Ax’m/i. Since, the range of i is 0-6max, we have

677303

2(pi(umaz+l))( nun/i) S 1 (3-15)

i=0

We can ﬁnd so,- easily. If n,- is the number of nodes whose degree is i, then ngi
paths originate from nodes of degree i. The total number of paths is 2E (from the
deﬁnition of E). Hence, p,- = ngi / 2E. Using this result in Equation 3.15, we have

to 1 1
n“; S 6mm: ' 6
(”max + 1) 23:0 (pi/2)= (”max +1)Z:'="6’ 215':
2E 6aug
(Vmaa: +1)Np =(Vma1: +1)

 

 

 

by Equation 3.2

which completes the the proof. D

We show in the next chapter by simulation that the above bound is tight, i.e. the
node trafﬁc for the worst-case node saturates at this bound if messages are given pri—

ority based on the time they entered the network. Hence, we conclude that saturation

l. Sin

58

node trafﬁc for the worst case node is given by 6a,,g/(Vmaz + 1).

We can also compute an upper bound on the node trafﬁc for any individual node

which is stated as the following lemma.

Lemma 5 For any node, the node traﬂ‘ic is bounded by 6 (1 — where Vm is

_.._. >
”mas'l’l ’
the maximum path contention level for any path originating from the node

Proof: From Assertion l, we know that all of the paths that originate in a
given node should have the same path trafﬁc in the steady state. We know that the
path trafﬁc for any path is bounded by A,,, = 1 / (V +1), V being the path contention
level for that path (Equation 3.11). Hence, the bound is lower for paths that have a
higher path contention level. Therefore, of all the paths originating from a node, the
path pm that has the maximum path contention level equal to Vm will determine the
path trafﬁc on the remaining paths from the node. Let S be the set of paths that are
in the path contention set for pm. The sum of path trafﬁc on pm, and path trafﬁc on
all paths in S should be less than 1. Therefore, we have

Mp...) s l—ZMP) (3.16)
P65

Using the same argument as in the proof of Lemma 4 we ﬁnd that

67750: 6111a:
A,(p,,,) S 1- Z PiVm(/\:’.../i) = 1 - I’m/W... Z (Pi/i)
i=0 i=0
VmA‘” Um
= 1 _ _AL: < 1 _ _—
5m; _ (”mm + 1) by Lemma 4
Since, A,, = 6A,, we prove the lemma. E]

Lemmas 3, 4 and 5 show that it is possible to estimate the bound on node trafﬁc

for an average node, the worst-case node or any node.

3.4 Ba

We consider
term suggest
We ﬁrst sho‘
average and
Show that 01
outed talues
Combining 1
section. we
suﬁcient for
to the avera
traffic 311
Here, w.
;ossible me-

lnode in r

3.4.1 I

Path lengtl
gapil is m.
ml path
Comer nod
‘05" aSsumir
Stoch- tica

”(Elected [e

Dr.

59

3.4 Random Mapping

We consider the special case of random mapping now. In random mapping, as the
term suggests, tasks are mapped onto a randomly—selected system node that is free.
We ﬁrst show that we can use graph theory and probability theory to estimate the
average and maximum values of contention parameters for a random mapping. We
show that our estimates are sufﬁciently accurate by comparing them with the com-
puted values from 25 different random mappings for several known process graphs.
Combining the estimates for path contention levels with the results of the previous
section, we can ﬁnd the saturation node trafﬁc, and decide if random mapping is
sufﬁcient for the given application. We can estimate the worst-case values in addition
to the average-case values. Hence, the results also show the extent of nonuniformity
in trafﬁc and contention.

Here, we assume that process graphs are randomly mapped onto the smallest
possible mesh network. Such a system graph will have N nodes where N = [@12.
A node in row i and columnj is denoted by (i,j), 0 < i,j' < \/.7V—.

3.4.1 Path Lengths

_Path length depends only on the size of the system graph onto which the process
graph is mapped, and is independent of the richness of the process graph. The max-
imum path length (pmaz) occurs for communication between the diagonally-opposite
corner nodes of the system graph. The average path length (paw) can be computed
by assuming that the lengths of the segments of a path in the two dimensions are
stochastically independent [33, 65]. The average path length will then be twice the

expected length in each dimension. We, therefore, have

om, = 2(x/1V—1) (3.17)

1 m-lm-l . .
llnm :: 2(YV'ZE: :E: h'—Jd) :=

i=0 j=0

[\D
...,

'1le 0

tame:

60

We observe that Dma, z 3Davg, i.e. the maximum path length is approximately three
times higher than the average path length.

The logical path length is approximately equal to and is bounded by the physical
path length. It can be seen that if the process graph is richer, there is higher prob-
ability that a new path will be “encountered” at each segment of the path. Hence,
the logical path length become closer to the physical path length as a graph becomes
richer. The estimates and actual values from 25 random mappings are shown in

Figure 3.8.A and B. We observe that estimates match well with actual values.

3.4.2 Channel Loads

The average channel load is easy to calculate as it is related to the average path
length. The sum of the channel loads over all edges of the system graph should be
equal to the sum of path lengths over all edges of process graph, as both count the

same quantity in two different ways. Hence, we have-

 

 

1 1
X609 _ |E(Gm)|e€E§;M)X(e) '_ 4ﬂv—(m— _ 1) aegip)D(e)
|E(G,)|D,.,,
WWW — 1)
2E(2\/1V/3)

 

x/NMW _ l) usingEquation 3.18
E

22

W (3.19)

To ﬁnd the maximum channel load, consider the load on a row channel between
nodes (i, j - 1) and (i, j) (Figure 3.6.A). The source nodes of paths that need that
channel are the nodes (i, 0) .. (i, j — 1), and the destination of these paths should be in
a column greater than or equal to j (This is a consequence of the ﬁxed routing scheme
used by the network). A total of jbavg paths originate from nodes (i,0) . . . (i, j — 1).
If destinations are uniformly distributed, the probability, so, that the destination of a,
given path is in a column greater than or equal to j is (W — j / x/IV ) Therefore,

the number of paths through, or channel load, on the channel under consideration,

61

1:1 d’ﬁﬁ'n" 1:1 6 fit“ I: if)”

a. .
Cl DEE] DID! D V 3'3
sséblesource'

Cl “Ell—JD"

1,0 i,j-1 i,j:-i,j+1
. _ 1 D m]
E’°--Dniﬁ”nﬁi’“' 1:1 ~EIEIEI- 1:1 (:1

   
 

chan -
.......... ndhrel consideration '

Eiégljﬁ CID:

n-lg-I possible destination n-1,0 n-1’J n-1,n-1
no es
A. Estimating Channel load B. Estimating path contention level

Figure 3.6: Estimating contention parameters

i j, is a fraction p of the j 6“,, paths.

Xi'j = p(jam) = j(m‘/‘Nj)5avn

 

The above expression will be maximum when j = W / 2 (which makes p = 1/2).

Hence,

mam _ «37 2 E

Xmas = 4 — 4 X _ = 2W (320)

N?

The maximum channel load derived so far is the expected value for a single row. There
are N rows. When we consider all the rows, it is possible that for the worst-case
row, all destinations are on the right half. Hence, we conclude that the maximum

channel load is given by
E

Xmas: = W (3.21)
Here, only row channels were considered. The distribution of loads on column chan-
nels can be estimated similarly, i.e. considering the possible source and destination
nodes for the paths that need a given channel. We ﬁnd that the distribution of loads
on column channels is similar to that of row channels. Hence, we conclude that max-

imum channel load is given by the Equation 3.21. Note that loads increase as one

It
More

1

“dds

62

moves towards the center of the mesh network.

The estimated and the actual values as computed from 25 random mappings are
shown in Figure 3.9 for four common process graphs. The results show that our
estimates for channel loads are good. The actual values are slightly less than the
estimate, except for the binary tree, for maximum channel loads. We ﬁnd that actual

values are better approximated by
xm = 0.75E/x/7V— (3.22)

instead of E/x/IV. The maximum channel load is therefore about 0.75 x 3 z 2.25

times higher than the average channel load.

3.4.3 Path Contention Levels

The path contention level is more difﬁcult to estimate. A path on average has a
length of W / 3 in each dimension as explained in Section 3.4.1. The path can be
partitioned into four segments V.-, 1 S i S 4 (Figure 3.7). We can estimate the values
of V1 .. V4 by considering the possible source and destinations of paths “encountered”
on various segments of the path, and making use of the results for channel loads and

path lengths. We can then add them to get V, i.e.

V 2 V1 + V2 + V3 + V4 (3.23)

The ﬁrst term, V1, is the number of paths that use the ﬁrst, higher dimension
(row) channel. The average value of V1 will be one less than the average channel load.

We showed in Section 3.4.2 that the average channel load is E/3x/7V— so

E
V=a,,—1z————lz— 3.24
1 X9 3W 3J1? ()

The next term, V2, in Equation 3.23 is the number of new paths encountered on the
remaining channels of the higher dimension. To estimate, consider the possible sources

and destinations for a new path encountered at the channel connecting node (i, j) to

IlEDCf

V
A A
Jﬁ

1

(1+1
a;

0»
than

91!

I
l

0

r
A

SOUICE
431

63

 

-DII+1Dl-1
V V2 D

iIIlizzz-iixigiri-ij
1:: DD ; ,
,0... D .JD [Dian-1

Figure 3.7: Estimating path contention level.

   
 

 

 

 

(i + 1, j). According to the routing protocol, the source must be the intermediate
node (i, j), and the destination be a node in a column higher than j. By deﬁnition,
6“,, paths originate at each of the W / 3 intermediate nodes, and approximately half
of them can be expected to go in the same direction as the path under consideration.

Hence,

60,, [1)? 2E W~ E

x_—

2 3‘27va—3‘~3,/N'

The third term, V3, in Equation 3.23 is the number of new paths encountered on

 

V28

(3.25)

the ﬁrst channel of the lower dimension, which is approximately equal to the average

channel load.

E
V3 = V1 z -— (3.26)
3\/N

The last term, V4, in Equation 3.23 is the number of paths encountered on the
remaining channels of the lower dimension. We can compute this value by considering
the number of new paths encountered at each intermediate node (Figure 3.7). The
source node of these new paths should be one of the nodes in the same row as the
intermediate node under consideration. The total number of paths emanating from
that row to all destinations is 6“,, x x/IV. The destination of the paths we encounter
should be one of the nodes in the same column as the node under consideration, and

in a row numbered higher than the intermediate node. The probability so will be

64

‘/N/2/N = 1/2x/N on an average. Hence,

 

WV ,—) W 1 2E E
V4——3—'X(pX6avg N —TX2‘/NXE\/N—3\/N (3.27)

Combining Equations 3.24-3.27 with Equation 3.23, we have

”avg = V1+V2+V3+V4
E 4E

_E._+_E_+ E + =_ (32.)
3W 3W 3W 3W 3W '

 

From Equations 3.28 and 3.19 we ﬁnd that Vow/X0”, = 4, i.e. the average path
contention level is about four times the average channel load.

The maximum path contention level is more difﬁcult to estimate. The problem is
that the values of the four components are dependent on one another. For no path
can all the components reach their maximum values. There are two cases to consider.
In the ﬁrst case, the source node of the path is at the [center of the mesh. In that case,
the ﬁrst channel of the path in each dimension can be shoWn to have the maximum
load equal to Xmas = E /2\/1V. The maximum length of the path segments can only
be [Al/2. Hence, V2 and V4 will not have maximum possible values. Using Viv/2
instead of W / 3 in Equations 3.25,3.27, we can show that V2 and V4 to be E/2x/1V.

The maximum path contention level will be

Vmaz=i+i+i+i=£ (3.29)
2W 2W 2W NW \/l-V-

Another possible case resulting in high path contention level is when the length of

the path segments is maximum i.e. W — 1 instead of W / 3. In that case, we ﬁnd

that V; and V4 will be maximum (z E/x/IV), by using the values (x/JV - 1) instead

of J17 /3 in Equations 3.25,3.27. However, V1 and V3 will be minimum equal to 0.

Once again we ﬁnd that Vm, z 2E/x/1V, and we conclude that Equation 3.29 is the
estimate for maximum path contention level.

Actual values of average path contention level are slightly lower than our estimates.

We ﬁnd that Va”, = E/x/IV ﬁts the actual values better than our estimate of

4E 3 v

are IlO‘

ﬂ

q
E;
Q

0111' CS

131:1

0 1

We
With 1.

The re

LEmn

mEdor

65

4E / 3\/ N . The difference arises due to the fact that the values of the four components
are not independent of each other, and some error is introduced when we estimate the
four components independently and then add them to get Vavg. We, therefore, modify

our estimates, and summarize the ﬁnal results in Table 3.1 where C = 6a,,\/ =

2Ex/N/N, z 2E/x/N.

 

 

 

 

Contention Parameter Average Maximum
Channel load (x) C/ 6 3C/ 8
Path contention level (V) C / 2 C

 

 

 

 

Table 3.1: Contention parameters : summary of estimates

From the results so far, we conclude the following about the special case of random

mapping.
0 contention increases with both system size and the degree of the process graph.
0 the worst case values are 2 to 3 times higher than the average case.

a the ratio of path contention level to the channel load remains approximately

equal to 3 as the system size or richness of process graph changes.

We can combine the estimates of the contention parameters for random mapping
with the result of the previous chapter for saturation node trafﬁc (Lemmas 3 and 7).

The results are the following lemmas.

Lemma 6 The saturation node traﬂic for an average node when an application is

randomly mapped onto a mesh of N nodes is A3,“‘ = 2/ \/1—V- .

Proof: We have shown that the average path contention level for random
mapping is E/x/N (Table 3.1), and 60,, = 2E/N, (Equation 3.2). Using these

results in Lemma 3, we have

A, 6... _ 2E/N, ... 2W

2
"W _ u..,+1_(E/./N)+1~ N. ~ W

 

 

 

as N,zN (3.30)

100 I I I
[:ng Mesh
inary tree
80 _ o yplgrecsue
— Avera e ath ,.
Physical 60 ' mWorstgpaiDth 2(\/N-1)...°'°' °'
path length .. ..o-" *
D .<>"'*
40 - . ' D
..o
20
0 . l l l l l
0 200 400 600 800 1000
Number of nodes, N
A. Physical path lengths.
100 I I I . I I
* D Mesh
mary tree
80 - <> ofig'percu e
. _ — Avera e path
Logical 60 mWorstgpath 2(3/N-1) ....... <>
path length
d ° 2
40 '- . ° 0 *
.'o D

66

 

 

 

 

 

 

 

 

 

 

I l l l
0 200 400 600 800 1000

Number of nodes, N

B. Logical path lengths

(lines show the estimates, legends show the actual values)

Figure 3.8: Path lengths, estimated and actual values

67

200

 

   

 

  

 

 

 

 

 

 

 

2.231) MeEh ' ‘ '
t> 1nary tree
gfgplgreizsu e
150 7 — Estimate : E/3x/N
Average
channel loadoo .
Xavg
50 - o
- :3
0 l l I A
0 200 400 600 800 1000
Number of nodes, N
A. Average Channel load
200 r I I I I
3 333.583...
ngphéIr‘izsu e
150 7 - - - Estimate = E/x/N
Maximum 0
channel 103300 _
Xavy
. . . o .......
.' ....... o .............
50 r o ...................... *
...: ..é’.;...,o ....... D * .................. D
9-:I-- ' .........................
:6}: ......... 1>
0 " l L I I l
0 200 400 600 800 1000

Number of nodes, N

B. Maximum channel load

(lines show the estimates, legends show the actual values)

Figure 3.9: Channel loads, estimated and actual values

 

 

68

 

 

 

 

 

 

 

 

 

200 1 1 1 1 1
M h
it B113nar§stree
M
gfgpefizsu e
150 7 — Estimate : 4E/3\/N J
Average
path contentionoo ..
level Va”,
50 ‘
0 I I l I I
0 200 400 600 800 1000
Number of nodes, N
A. Average path
400 1 1 r 1 1
i: Elﬁn¥§silree I
350 '- O M6811) -.
o ypercu e ._
300 7 ---Estimate: 2E/\/N 7
250 - <> _
Maximum ' '
path contentiogoo - .6". _
level V...“ "
150 - ....o a
100 77 ..OO. ....... * ................ * -1
3°" ............... '3’
50 - t D ......................... _
,3. ......................
0 ' I I I I I
0 200 400 600 800 1000

Number of nodes, N

B. Worst path

(lines show the estimates, legends show the actual values)

Figure 3.10: Path contention level, estimated and actual values

Lem

rand-1

tions

I
.3131
lem.

the e

m
‘1

in ac'

69

which completes the proof. 0

Lemma 7 The saturation node traﬁ‘ic for the worst node when an application is

randomly mapped onto a mesh of N nodes is bounded by A3“. = 1/\/N7.

Proof: We have shown that the maximum path contention level for random
mapping is 2E/\/N (Equation 3.29), and 6“,, = 2E/N, (Equation 3.2). Using these

results in Lemma 4, we have

6,”, 2E/N, VN 1
w=___=—E_—z—z—asNzN .1
which completes the proof. D

Lemmas 6 and 7 also give a rough estimate of the saturation levels when applica-
tions are randomly mapped. However, we can actually compute the path contention
levels even for the case of random mapping, and use these values with the results of
Lemmas 3, 4 and 5 to estimate the saturation levels more precisely. Table 3.2 gives
the estimated saturation levels for four different topologies when they are randomly
mapped (a speciﬁc instance). In fact, the software we have developed automatically
computes the saturation node trafﬁc values for any given process graph and mapping,

in addition to contention parameters.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2D Mesh Binary tree 3D mesh Hypercube
Avg Wc Avg Wc Avg Wc Avg Wc
N = 64 47 23 31 14 36 19 41 22
N = 256 19 10 12 6 13 7 17 9
N = 1024 8 4 5 3 5 3 7 4

 

Table 3.2: Saturation node trafﬁc for various process graphs for a speciﬁc random

mapping

In the next chapter, we show by simulations that average estimates are quite

accurate. The worst case (Wc) estimates also match, but simulation results seem to

 

gee about

Obsert
node (Let
enressior
to any pr
is comple
which res

corollary.

Corolla]

node tra}

In the
node is t
be 2/ 16
recently

HClWork

Deﬁnitf
lo the cc
total ha)

and is d
lie I
95d] [1&1

OI node

70

give about 20% higher values consistently for all process graphs.

Observe that the saturation node trafﬁc for an average node as well as the worst
node (Lemma 6 and 7) is independent of the the richness of the process graph as the
expression for A,,”, does not contain the term 6“,, or E. In general, the result applies
to any process graph including a completely connected graph. If the process graph
is completely connected, then a node communicates with all other nodes uniformly
which results in uniformly distributed messages. Therefore, we have the following

corollary.

Corollary 5 For uniformly distributed messages in an N -node mesh, the saturation

node traﬁ‘ic is 2/x/N.

In the case of uniformly distributed messages, all nodes are identical and the worst
node is the same as the average node. For a 256-node mesh, the saturation level will
be 2/16 = 12.5%. The saturation trafﬁc level for uniformly distributed messages has
recently been derived in a different way by Dally [33], in which a different unit, called

network trafﬁc, was used to measure trafﬁc.

Deﬁnition 31 Network trafﬁc per node is the ratio of the data injected by the node
to the capacity of the network, and is denoted by A,. Network capacity per node is the
total bandwidth out of a node divided by the average distance traveled by a message,

and is denoted by C.

We have C’ = 4B/Da,, for a mesh network, as there are four outgoing channels,
each having a bandwidth of B. The data injected by a node is A,,B by the deﬁnition

of node trafﬁc, our metric for trafﬁc (Deﬁnition 26). Hence, we have

 

C "" 413/13,, ‘ 4 (3'32)

For uniformly distributed messages in a mesh network, we know the value of D“, to

be 2x/N / 3 (Equation 3.18). Hence, we have

E x A" - EA, (3.33)

A: —_.
‘ 3 4 6

The e
:eiati

[Cort

71

The expression above shows that node trafﬁc is related to network trafﬁc. The above
relationship holds even for saturation trafﬁc values. Hence, using the value of A2,“,
(Corollary 5) we have

,. _ ﬂ... _ W

has 6 ﬂoat - T

= 33.33% (3.34)

OOH—t

which match the reported result for the uniformly distributed messages by Dally in
[33].
The advantage of our framework is that it is more ﬂexible, and can be used to

predict the saturation node trafﬁc levels for various process graphs and mappings.

3.5 Optimal Mapping

In this section, we present mappings and routings that minimize contention (measured
by path contention level and channel load) in the case of several common process
graphs. Simulations later show that these mappings provide a signiﬁcant performance

improvement over random mapping.

3.5.1 Binary Tree

Several schemes have been suggested recently to optimally layout a binary tree. Youn
and Singh [79] have suggested a tile-based scheme which uses just one extra node i.e.
N = N, + 1. We adapted their scheme to reduce Xma, to 2. Three types of basic
tiles that give the mapping and routing for a 15-node complete binary tree onto a
16-node mesh are shown in Figure 3.11. The edges shown in the ﬁgure indicate the
routing for communication in either direction i.e. child to parent or parent to child.
Observe that for these tiles Xmas: is 2. The scheme can be extended for larger trees by
combining the basic tiles. The path corresponding to higher level edges of the larger
tree uses only the channels along the perimeter of the basic tiles. Also, no two paths
corresponding to higher level edges share a channel. Hence, the maximum channel

level still remains 2. The scheme also reduces path contention levels signiﬁcantly.

72

Contention is reduced even if we only use the placement suggested, and use the
standard ﬁxed routing instead of the suggested routing. In the next chapter, we
show that even if we only use the placement suggested by Youn’s scheme and the
standard ﬁxed routing (i.e. ignore the routing suggested by this scheme) we still
get signiﬁcant improvement in performance compared to random mapping. Another
scheme suggested by Gordon [45] uses slightly more nodes for a given tree, but reduces

x”, to 1 and 11m“,- to 0, completely eliminating contention.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

:J 713-5734? .72 - I. e .3 2:11

;1] 3:37 1 _1_ 1—_. |73L-—-\ —-2]—7Lf|

"L--3—- “E2. {—27 \2-—:T :€—'\21=:f ':_1'

I _1-1: L. I 1 1 : 1 1 _1
Tile M1 Tile M2 ’ Tile M3

Figure 3.11: Three basic tiles in modiﬁed Youn’s scheme

(The labels indicate the levels of the tree node
1 indicates the leaf level, and 4 indicates the root level)

 

 

 

3.5.2 Mesh-3D

The 3D mesh is another popular communication pattern used by many applications.
Figure 3.12 suggests two schemes. Scheme A does not use any extra nodes, but

results in high channel loads of xmu = (3/ N, + 1. Scheme B may need more nodes,
but reduces the value of xm“ to 3” N, + 1.
3.5.3 Hypercube

We explained and proved in Section 2.2 that an identity layout of hypercube topology
gives the minimum peak width equal to [2N / 3]. We can therefore use an identity

 

r72

73

 

_kﬂ/L
k ’1

I.
l

 

 

  

 

 

 

 

 

 

 

T I 1 __ | I
l I l k_, 12:1" : 2'
.... : : " f,.‘--,lz:-,- ........ 1
: : : ---... /: : ,
k2 _ﬁ: +]"” //'___1____,: ________ klc
J-J-i------- . '
[...] : : Yﬁ--- ‘
[‘33 E L/ / k33DMesh
I _241 1
Scheme 1 Scheme 2

Figure 3.12: Mapping 3D Mesh on to 2D mesh network
(N, = k3, k’ = N751)

 

 

 

mapping to minimize contention where node (i, j) of mesh runs the ix/N + j task
of the hypercube process graph. If the dimension of the hypercube is even, then we
will be mapping N, = 2" nodes of the process graph onto a 2’”2 x 2'”2 mesh (then
N, = N and «N = 2"”). Each row or column then will have x/N nodes. The
maximum channel load for this optimal mapping will be [2\/N / 3] , from the result
of Theorem 2.

Alternate schemes of mapping and routing that are possible in the case of several
process graphs indicate that there is a tradeoff between the number of extra nodes

used (i.e. N — N,) and the extent to which the contention can be minimized (x

values or V values).

3.6 Speedup

Contention can reduce the performance of 2D or 3D mesh networks more than a richer

topology such as the hypercube. The following lemma relates A,,; to a reduction in

effective rates.

74

Lemma 8 The eﬁ'ective rate under contention can be reduced by a factor of Aug, i.e.

E = AgatBo

Proof: By Deﬁnition 26 A, = (£921. or T = %. Also, the communication time
is the loop time. minus the computation time, i.e. r = T -— t. For communication
intensive applications, A,, -> Am. Also in such applications, the term t can be ignored

in comparison to 1'. Hence, T m 1'. By the deﬁnition of effective rate (Deﬁnition 10),

we have
L L L
E:: = 71." = m = Asat-B (3.35)
Asa:

D

In the case of a richer topology, the effect of contention will be negligible and
A,,, z 1. Hence, the effective rate will be the same as channel bandwidth. We
ﬁnd that the speedup also decreases by a factor of A.“ for communication intensive

applications if we use random mapping.

Rc = E- (Deﬁnition 12)
A,, B'"
= ﬁ— (Lemma 8)

= AmRb (Deﬁnition 11)
2 2W
1.33

 

(Table 2.1)

22

This result shows that all the speedup we obtained by using a mesh would be canceled
by increased contention if we use random mapping. Hence, mesh would perform just
slightly better than hypercube for communication intensive application. In reality,
the bandwidth of mesh networks may not be exactly 2\/N / 3 times that of hypercube
networks. If the bandwidth is less than this factor, the mesh will perform worse than
hypercubes.

For applications that have applied node trafﬁc comparable to 2/x/N or greater,

we can use careful mappings. Such careful mappings will reduce the path contention

level. and
then the

that to re
elective «

careful in

3.7

The thec
form of a
the satur
At prese
‘31) Tom.
rewriting
other use
The 5
i0! an a;
1. Fl:

be done
0 Pm
me

the

me

ces

cor

0f<

ter:

con

it \\

‘ Rur

75

level, and increase Am. If mapping and routing are sufﬁcient to make A”; > A“,
then the effect of contention will be negligible, and R. R: R1,. Hence, we conclude
that to retain the advantage of the higher bandwidth of mesh networks and to get an
effective communication speed equal to the channel bandwidth, we may have to be

careful in mapping parallel tasks.

3.7 Contention Analyzer: A Software Tool

The theoretical framework developed in this section has been implemented in the
form of a contention analyzer. It does all the computations involved, and hence gives
the saturation levels for node trafﬁcs directly, for a given process graph and mapping.
At present, the tool works for Linear array, 2D mesh, 3D mesh, Hypercube, and
2D Torus. The tool can be easily modified for other system graphs, as it involves
rewriting one or two small functions. Random mapping is built into the tool. Any
other user-speciﬁc mapping has to be deﬁned by the user in the form of a function.

The software tool is useful in analyzing contention, and evaluating the mapping
for an application. The following steps are suggested.

1. First estimate the applied node trafﬁc for the application. This estimation can

be done in several ways.

0 Program analysis : If the program involves a single kernel loop, we can esti-
mate the data generated per loop on average, say L bytes. We can also estimate
the average number of instructions, say I, executed per loop. Then, from the
multicomputer speciﬁcation sheet we can use the MIPS rating of the node pro-
cessor P to ﬁnd the computation time t. Recall that communication time under
contention-free conditions can be speciﬁed by a linear function a+bL. The value
of a and b are either known from the machine speciﬁcation or can be easily de-
termined. Hence, communication time under contention-free condition can be
computed if we know L. The Applied node trafﬁc, A“, can also be computed as

- - bL
It Wlll be ”7,15%.

0 Run the application on a small system say 9 or 16 nodes, carefully mapping the

para
the
The

2. Co
using the
is not go
becomes.
good en0‘

If A2,.
aﬁected.
nodes are
determinr
nodes isr

enough.

3.8 .

The com]
in this cl
trafﬁc, ar
hound ca
that the]
ping l5 nc
neutralize
imize pat

 

76

parallel tasks if possible. In such a case contention will be negligible. Measure

the total amount of data generated by a node L’, and the total run time T’.
Then, A“ = bL’/T.

2. Compute the worst-case saturation node trafﬁc under random mapping A3,“.

using the tool. If A, is comparable to or greater than Ana“, then a random mapping

is not good enough. Try an improved mapping, and recompute A33“. Once A3“,
becomes at least a few percent greater than A“, then the mapping can be considered
good enough.

If A3,.“ is less than or equal to Am, the performance of most of the nodes is
affected. However, if A3,.“ is greater than Ac but A3“, is lower than A,,, only a few
nodes are affected by contention. The bound on individual node trafﬁc can be used to
determine the nodes that will be affected by contention. If the performance of those
nodes is not critical to the overall performance, the mapping may be considered good

enough.

3.8 Conclusion

The communication performance under contention for network channels was modeled
in this chapter. The results show that message injection rates, measured by node
trafﬁc, are bounded for a given communication pattern and mapping, and that this
bound can be predicted. The special case of random mapping was analyzed to show
that the performance can be poor for large multicomputers, and hence random map-
ping is not advisable for them. Increased contention in mesh networks will effectively
neutralize their advantage—higher bandwidth channels. Careful mappings that min-
imize path contention levels can increase the saturation levels well above the applied

node trafﬁc of an application, and hence make the effect of contention negligible.

Chapter 4

EXPERIMENTS ON
SYMULT-ZOIO

In this chapter, we describe several experiments we conducted on a Symult 2010, a
second-generation multicomputer that uses a mesh-connected wormhole-routed net-
work for interprocessor communication. Although the Symult computer is no longer
in production, there are many lessons to be learned from this machine. For a long
time, the largest machine available had only 64 nodes, but it was recently increased
to 192 nodes.

The experimental studies in this chapter are signiﬁcant for several reasons.

0 The experiments were run on a real system that uses a state-of-the-art,
wormhole-routed network, so the results indicate the performance that can be

expected from current technology at the user level.

0 The experiments reveal the problems and bottlenecks in the current architec-
ture, and indicate architectural areas to be examined in future designs for im-

proved performance.

0 The experimental results for communication under contention are analyzed, and
compared with our theoretical results. This comparison is done by expressing

the performance in terms of node trafﬁc deﬁned in the previous chapter.

0 We discuss the current architectural trends and how they will affect our con-

clusions.

In particular, our results show that path lengths are no longer a problem whereas
the effect of contention for network channels will become an important concern, even

though the effect is negligible in small systems currently available.
77

78

The chapter is organized as follows. In Section 4.1, we brieﬂy describe the Symult
2010’s architecture and show how wormhole routing is implemented. In Section 4.2,
we describe the experiments and present the results. The results give the network
performance both under contention-free condition and under contention at user level.
The results under contention are analyzed in detail in Section 4.3. We generalize
the performance curves by expressing them in terms of the applied node trafﬁc (de-
ﬁned in Chapter 2). In Section 4.4, we discuss the weakness and idiosyncrasies of
Symult 2010, and current architectural trends in order to extend the results to the
performance of future multicomputer. The discussion clearly shows that contention
for network channels will become a serious concern in future multicomputers. Finally,

we summarize our results.

4.1 Symult 2010 Architecture

 

 
   

20 Mbyte/ sec

 

‘
etworl
c annel.

 

 

Figure 4.1: The architecture of a node of Symult 2010.

 

 

 

We brieﬂy describe the architecture and salient features of a Symult 2010, as that
will help in understanding our experiments and experimental results. The overall
architecture of a node is shown in Figure 4.1. The two main components of a node
are the router and the processing element (or a PE). The PE includes a processor,
local memory and an interface to the router. The router supports interprocessor

communication through ﬁve bidirectional physical channels, one connected to each of

79

its four neighbors in the 2D mesh topology and one connected to the local PE. We
refer to a router-to-router channel as a network channel or rr-channel and a PE—to-
router channel as a pr—channel. Each physical channel is 8 bits wide and operates
at a peak rate of 20 MHz, giving a peak bandwidth of 20 Mbytes/ sec. Trafﬁc in
opposite directions is supported by time-multiplexing the physical channel at a ﬂit
level, a ﬂit being the data unit for communication between adjacent nodes. Hence, a
physical channel can be thought of as supporting two logical, unidirectional channels
in opposite directions. For each such unidirectional channel, the router has one buffer
whose size is big enough to hold one ﬂit.

The Symult 2010 uses wormhole routing which was explained in Section 1.3.1.
A message consists of a set of data ﬂits preceded by two head ﬂits—one for each
dimension. The head ﬂit indicates the number of hops needed along each dimension
to reach the destination node. As soon as a router receives a message, it examines
the head ﬂit to determine whether the message is for the local PE. If it is, the head
ﬂit and the message ﬂits following it will be forwarded to the local PE through the
interface. Otherwise, the router will determine the next channel, and forward the ﬂit

if that channel is available.

 

high
Initial state ' Data ready

Low

   

    

A h

high high Low
Transfer 1‘ Send acknowled ement
from outport of A ffbﬁininegort of B to complete hagdshake
to Inport of B to outport of A and si al read ness

   

to receive anot er ﬂit

Figure 4.2: Transfer of a single ﬂit over one hop

 

 

 

A single control line is sufﬁcient for each logical channel. The line will be used

80

both for handshaking and for ﬂow-control. Figure 4.2 shows the basic steps involved
in a single hop of a data ﬂit from one node to its adjacent node. The control line is
normally low, indicating the readiness of the receiving router. Whenever data is to
be transmitted, the sending router will raise the line indicating the availability of a
new data ﬂit. The receiving router will then read the data ﬂit on the channel into
its input buffer. The router will then examine if the next output port to which data
ﬂit is to be routed is free. If it is free, the router will transfer the ﬂit to the output
port, and lower the control line to acknowledge to the sender the successful transfer
of a data ﬂit, and also to indicate its readiness to receive another data ﬂit. Once the
sending router detects the lowering of the control line, it will mark the output buffer
free to be used for another transfer. A router cannot transfer the received ﬂit to the
next output port, if that port has been reserved by another message. In that case,
the router will not lower the line, and hence will not complete the message transfer.
Similarly, the sending node will not be available to use the output port for another
transfer. Successive ﬂits then get blocked along the partially established path, one
ﬂit per node.

Each router requires only a limited number of storage buffers, one per channel, to
hold the ﬂits of the messages passing through. All routing functions are implemented
in hardware so the PE is not involved in processing messages that are merely passing
through. The time taken to process a newly arrived ﬂit and to send it to the appro-
priate adjacent node is less than 100 us. In order to avoid communication deadlock,
the Symult 2010 uses a ﬁxed (static) routing scheme. In such a ﬁxed routing scheme,
ﬂits ﬁrst travel along the rows until they reach the column of the destination node

and then travel along the columns to reach the destination node.

4.2 Experiments

Experiments were run on Symult 2010 systems at Caltech under the Cosmic Envi-
ronment / Reactive Kernel (CE/RK) — a message-passing programming environment

available on several multicomputers [75]. All test programs were written in the ’C’

81

language and used the run time library supported by the CE/RK environment for
interprocessor communication. We ran the experiments using 16 nodes (connected as
4x4 2D mesh), 64 nodes (8x8) or 144 nodes (12x12). We start this section with a
list of general comments, notation and precautions that were necessary to ensure the
validity of the measurements.

In our experiments, the following steps were taken to isolate the desired network
performance from other system characteristics.

1. Once the programs were spawned onto the nodes, the host waited for 5 secs
for spawning to complete before sending (broadcasting) the input parameters of the
experiment. This delay ensured that intensive trafﬁc due to program spawning will
end before the start of the experiments. If we had not waited, this extraneous trafﬁc
would have interfered with the messages generated by the experiments and affected
the measured times.

2. In each experiment our aim was to ﬁnd the average time taken by a node to
complete a given program segment. Such a segment, called the base loop, consists
of both local computation and communication with other nodes. The loops of the
experiments require 0.5-15 ms, but the clock resolution is only lms so the base loop
was repeated N times. Trafﬁc due to the broadcast of the experimental parameters
interfered with the communication in the initial loops. Hence, the ﬁrst n loops were
ignored. The clock reading at the beginning of the n + 1“ loop was subtracted from
the clock reading at the end of the n’th loop, yielding the time for n’ - n loops, say
TN-“ ms. The average loop time, T, was then computed as T = an_,,/(n’ -n). In our
experiments we found that n’ = 11,000 and n = 1000 were sufﬁcient. We measured
the loop time on each node in milliseconds by averaging the time for 10,000 loops.

3. The objective of the experiments was to measure the performance of the com-
munication network and not the Cosmic Environment functions that support mes-
sages. Hence, messages were created before the beginning of the ﬁrst loop, and the
Same set of messages was used throughout the experiment. Messages were neither
Created nor destroyed once the ﬁrst loop started. This strategy prevented any pro-

cessing time by functions, such as xalloc or xfree (which manage message buffers),

82

from affecting the measured times.

4. Blocking receives were used in all experiments to capture the delays in the
communication. The design of the base loop ensured that receiving nodes were al-
ways ready to receive by the time a message arrived at the node. This technique
prevented any program-generated delays in accepting a message from appearing as
communication delays.

.5. The kernels were different on different machines, and were modiﬁed during the
course of the experiments. Changes due to kernel performance were avoided as follows.
Many experiments were performed in pairs one after the other; the ﬁrst tested the
inclusion of a speciﬁc factor under study, while the other tested the exclusion of the
factor. We then calculated the difference in communication times. Timings measured
on different days were never compared.

6. Each node computed the average loop time as described above, by using its own
clock for each reading. This local timing avoided any problems due to the variance in
different clock readings. In addition, each node executed the same program. In some
experiments, the mean of the loop times determined by individual nodes was used as
the ﬁnal loop time to avoid idiosyncrasies of individual node hardware (or kernel).

We use the following notation. The system consists of N nodes connected as a
[A7 x m 2D mesh. Nodes are numbered 0..N-l, and rows and columns are numbered
0.4/N-1. Node (i, j) refers to a node located in the i"t row and the jth column.

Deﬁnition 32 Two nodes which communicate with each other during an experimen-

tal run are called peer nodes.

In most experiments a node (i, j) will communicate with only one other node, and
hence has a unique peer node. A communication pattern is speciﬁed by deﬁning the
peer nodes. The positions of the peers were changed to form different communication
patterns, causing changes in the demand on the network.

In all the experiments, the base loop consisted of communication with peer nodes
fOllowed by local computation. Communication is represented by two functions tsend

alnd trecv.

83

o tsend (dst,msg,L) : Send a message of length L bytes, which is stored starting
at the address may, to the node dst. The function is non-blocking.

0 msg = trecv (src) : Receive a message from the node src. The message will
be put at the address msg when the call returns. This function is a blocking

receive so that the call will not return until the message arrives.

In all experiments the length of the message is same in each base loop. The com-
putation times of different loops are uniformly distributed over the interval 0..2ta.,g.
We used such a distribution to reﬂect the fact in actual applications the computation
times may vary depending on the contents of the received messages. Hence, tag, is the
average computation time per loop, and is the same for any node. On the other hand,
the communication time per loop is different for different nodes as messages from dif-
ferent nodes are affected differently by contention and path length. Communication
time per loop is represented by r.- for the ith node.

Since communication time differs for different nodes, the loop times differ for
different nodes, and are denoted by T; = tm,g + r.- for the it" node. The loop times for

an average and a worst node, denoted by TM and Tm”, are deﬁned as

1 N N-l

Tan— = TV- E051",- and Tm“, = mng (4.1)

The communication time per loop for an average and a worst case node can be
computed by subtracting to”, from the corresponding loop times.

In each experiment, nodes are partitioned into three classes: I, A and B. Nodes
belonging to the the class I are idle, and do not participate in the experiment. The
remaining nodes belong to either class A or B, and execute the corresponding base
loop repeatedly. A node and its peer usually belong to different classes.

The experiments are explained in three sections.‘ In Section 4.2.1, we measure
the performance for contention-free communication between adjacent nodes, which
is the best an application programmer can achieve. The times in this section are
neither affected by longer path lengths, nor by contention for channels and data

paths. In Section 4.2.2 we concentrate on the effect of path lengths. Non-adjacent

84

nodes communicate, but contention is still carefully avoided. In Section 4.2.3, several

experiments are described which involve communication under contention.

4.2.1 Contention-free Adjacent-node Communication

In the ﬁrst set of experiments, communication is restricted to physically adjacent

nodes. The communication pattern is deﬁned by (Figure 4.3).

peer of node(i, j) = node(i, j ED 1) (6 denotes the bitwise ex-or function

)

class of node(i,j) = even(j) :5 A , odd(j) 2? B.

None of the nodes are idle, i.e. none belong to the class I. The N active nodes
of the system are partitioned into N /2 pairs, and these pairs do not interfere with
each others communication. In the base loop, a message of L bytes is sent from an

A node to its peer, and then a message of the same size is sent back from the peer

(Figure 4.3).

.
3
IEB| A

 

 

 

 

 

 

 

 

 

Class A node Class B node

 

 

 

 

 

tsend(pcer.msg.l) : - .. msg = u'ecv(pecr) :
msg = mv(peer) ; ‘ - tsend(peer,msg.l) ;

 

 

>|>l>
\. 4

 

 

 

 

V, ,I II I
w In! w W

 

7 , computc(t) ; compute(t) ;
communication pattern Base loop

Figure 4.3: Contention-free adj acent-node communication

 

 

 

Contention is carefully avoided. First notice that at any time a node will be either
sending or receiving. Thus there is trafﬁc along only one direction on the pr-hannels.
The communication between the various pairs does not interfere with one another.

Also, on each network channel, the trafﬁc is only along one direction at any instant.

85

In addition, both the peers compute for the same amount of time after completing
their communication. This strategy ensures that the receiving node will always be
ready for the incoming message.

The communication time should be approximately the same for all the nodes as
there is no contention and all path lengths are unity. Each loop involves communica-
tion of two messages sequentially. Hence, the communication time per message, rm,

will be half the communication time per loop. We have
rm = r/2 = (T — tavg)/2 (4.2)

The experiment was repeated for different message lengths; communication time
per message was plotted as a function of the message length in Figure 4.4. The plot
shows the mean value of rm computed by 16 nodes. Even though different pairs could
progress at their own rates, the communication time per message for different nodes
was not signiﬁcantly different. The maximum deviation of rm from the mean value
is less than 4 us for messages up to 8 Kb long. This small variation demonstrates
that the communication performance of different nodes (hardware and the kernel) are
almost identical.

These experiments show that the communication time per message can be speciﬁed
as a linear function of message length L, i.e. a+ BL. Here, a is the ﬁxed overhead and
ﬂ is the effective bandwidth. We observe that the ﬁxed overhead is still signiﬁcant in
this second generation machine—about 210 us. Our measured value is slightly larger
than the value of 177 [18 reported recently [75]. The effective bandwidth is l / 0.0577
n 17 Mbytes/ sec for small (_<_ 256 bytes) messages which is about 87% of the channel
bandwidth (20 Mbytes/sec). Larger messages are sent as packets of 256 bytes. The
packetization is enforced1 to prevent long messages from blocking the transfer of small
messages. We observe in Figure 4.4.B that packetization has a noticeable impact on
the effective bandwidth. The packetization overhead reduces the effective bandwidth

to 1/0.1536 z 6.5 Mbytes/sec. The effective bandwidth is again slightly lower than

 

1In fact, there is no way to avoid this packetization in the current system.

86

 

 

 

 

 

 

 

 

 

 

240 I l l l l
235 r "
230 r '*
225 r o...--""..1
rm in as _ ..... ~°o ....... _
“é". —
215 ..... -°' 210.9 + 0.0577L as
210 »_- 0 Experimental results q
205 - r
200 l J l l J
0 50 100 150 200 250
Message length, L, in bytes
A. Short Messages (L S 256 bytes)
1600 - ,9" ~
1400 - . _
1200 ~ 'b —
rm in as - 309 4
_ ,o“ _
800 ..... 210.2 + 0.1536L us
600 _ o" 0 Experimental results _
.0"
400 l" 9’9 —1
o'
200 °' 1 I L r
0 2000 4000 6000 8000 10000

Message length, L, in bytes

B. Long Messages (L > 256 bytes)

Figure 4.4: Communication time for contention-free adj acent-node communication

 

 

87

the claimed value of 1 / 0.11 as 9 Mbytes / sec [75]. Communication of longer messages

is about 17/ 6.5 = 2.6 times slower than for smaller messages.

4.2.2 Contention-free Non-adjacent-node Communication

In this section, we examine the performance when communication is between nodes
that are not physically adjacent. Contention for both rr-channels and pr-channels is
to be avoided so we can concentrate on the effect of longer path lengths. We achieve
that goal by only allowing nodes along the the leading diagonal to be active. The

communication pattern is deﬁned by (Figure 4.5)

peer of node(i,j) = (x/N-i— 1,\/1V— i — 1)
class ofnode(i,j) = (i #j) => I,(i < W/2) => A,(i 2 x/IV/2) => B

Thus, we have VN / 2 pairs communicating—avoiding any interference between their
communications. The result is no contention for network channels. The base loop

is the same as shown in Figure 4.3. Hence, contention for pr-channels is avoided as

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

before.
2.3233
0 0
C’ —j *-—* M “2.3227
A ”2.3230
"‘— ‘ """ 1—"‘ “2.3231
A -1 L “ 2.322
_ ___ 3' 2.326
__h __ .32291!
——‘ —"" """' mF 2.3238!
B .-
‘__J‘ __ ‘_‘ 2.32.35»
(3,0) (3,3) 2.3235v
. . 2.3240
A. Communication pattern B. Loop times for different nodes in and
Figure 4.5: Contention-free, non adjacent node communication

 

 

 

The path lengths for the various pairs are 2, 4, 6, ..., 2m-2. If longer path

lengths increase delays, we would expect to see communication time (and hence loop

88

time) increase as we move from node (x/N / 2,\/—1\7/ 2) to node (0,0). The average loop
times for the diagonal elements of a 12x12 2D mesh are shown in Figure 4.5.B. The
times are for 4 Kb messages and 450 as average computation time per loop. The
times do not show any dependency on path length. Figure 4.6 shows how time for
a single message varies with path length for different message sizes. The change is
less than :l:4 us. Since such variations existed for adjacent node communication, we

conclude that the effect of path lengths can be ignored in current wormhole-routed

 

 

 

 

 

 

 

 

 

 

networks 2.

*_ l A i _A_ l _L r .L.

" " " "8 Kb “
1600 - ‘
1400 - ‘
1200 r -
rm in as - k L + L :1 Kb A j
800 — ' -
600 - 1 Kb -‘
400 - W3 -

:__ : : 1' 3 byte *

200 " l " T " 1 " I ﬁ

0 5 10 15 20
Path length (D)
Figure 4.6: Communication time per message as a function of path length

 

 

 

 

2The times here are slightly different than the values given in Section 4.2.1, as the systems were
different and so were the CE kernels.

89

4.2.3 Communication Under Contention

In this section, we describe several experiments designed to study the effect of con-
tention on the communication time. We establish communication patterns so mes-
sages will have to compete for network channels (rr.channels) and/or PE. to router
channels (pr-channels). Unlike the experiments in the previous two sections, we ﬁnd

that communication time increases exponentially as the computation time decreases.

Contention for Network Channels

In order to study the effect of contention for network channels, we need a communi-
cation pattern in which the nodes are affected differently by contention, and in which
node pairs can progress independently of each other. Under these conditions, it is
possible to measure the effect of contention by computing the difference in loop times.
We call the pattern used here the matrix-transpose pattern (Figure 4.7.A).

In this pattern, all nodes along the leading diagonal are idle, and others are active.

The communication pattern is deﬁned by

peer of node(i,j) = node(j,i)
classofnode(i,j) = (i =j) => I, (i>j) => A , (i<j) => B

We have Wh/N — 1)/2 pairs communicating. The base loop is identical to that
of the earlier experiments, and there is no contention for pr-channels. The experiment
is more demanding on the network because the peers are widely separated, and the
messages between them will result in severe contention for network channels.

An examination of Figure 4.7.A provides an idea of the distribution of contention
due to communication in the matrix-transpose pattern. To construct that contention
pattern, we have utilized the fact that Symult 2010 systems use a ﬁxed routing scheme.
Messages travel along rows ﬁrst, and then along the columns. The channels connected
to the two corner nodes (0,0) and (s/N —1,\/1-V -1) have the maximum loads (W -1),
whereas the channels connected to the other two corner nodes have the lowest loads
(1). The loads on other channels vary between those two extremes. Logical path
lengths (equal to 1/2 the physical path length) increase as we move away from the

 

90

 

(0.0)

--, 2.085, 2.097. 2.095, 2.102, 2.109, 2.116, 2.110
2.084, -, 2.086, 2.099, 2.104, 2.116, 2.116, 2.115
2.092, 2.086. -, 2.095. 2.101, 2.106, 2.112, 2.110
2.095, 2.099, 2.094, -, 2.092. 2.099, 2.104, 2.103
2.102, 2.104, 2.101, 2.092, -, 2.091, 2.092, 2.092
2.109, 2.116, 2.105, 2.098, 2.091, -, 2.088, 2.086
A A A 2.115, 2.116, 2.112, 2.103, 2.092. 2.088, -, 2.086
2.109. 2.114, 2.110, 2.102, 2.092, 2.085, 2.086. --

 

 

(3.0) (3.3)
A. Communication pattern 8. Loop time for different nodes in ms
. Figure 4.7: Communication under contention for network channels

 

leading diagonal. Hence, logical path length is highest for the corner nodes (On/IV— -l)
and (W -1,1), and is equal to (W -2). Path contention levels are higher for upper
(lower) row nodes if we consider nodes above (below) the leading diagonal, and vary
from 0 to (s/N -2). i
The loop times for a speciﬁc case when the message length is 4 Kb and the mean
computation time is 150 as is shown in Figure 4.7.B. Observe that nodes that are
away from the leading diagonal have higher loop times. The observed distribution
can be explained. Symult 2010 does not use priority queues. Hence, nodes whose
paths have longer logical path length are affected more. For example, a message from
node (On/NJ) to its peer node (x/lV-LO) must compete for each of the x/IV — 1
horizontal segments (channels) of its path. On the other hand, a node near the
principal diagonal, such as node (0,1), will only have to compete for the one channel
between nodes (0,0) and (0,1). Hence, communication time as well as loop time is
higher for node (VF-1,0). Later we show that path contention level still determines
the increase in communication time for an average and a worst case node.
Let T“, and Tm“ denote the average and maximum loop time taken by any node
in this experiment (refer to Equation 4.1). Let Tévg and T4,” be the corresponding

Values when the peers are mapped onto adjacent nodes as in the experiment in Section

91

3.1 (note that Tgug as Tim”). The increase in loop time for an average node is (ATM, =
T avg - Tgvg), and for the worst node is (ATM, = Tm” — Tim”). The increase in
communication time, Ar, will be equal to the increase in loop time, AT, because
the computation time for any loop is ﬁxed. The increase in communication time per
loop for an average and a worst-affected node are plotted for various systems sizes in
Figure 4.8. Note that decreasing the average computation time per loop has the effect
of increasing the communication time. Our results show that the communication
time increases exponentially with the communication-to-computation time ratio. The
increase is greater for larger system sizes and longer messages.

Since we showed in Section 4.2.2 that path lengths cannot account for that ex-
ponential increase in communication time, we conclude that the increase in commu-
nication time here is due to contention for network channels. The pattern in loop
times (Figure 4.7.B) further conﬁrms that the increase can only be due to contention
for network channels. Although the results show that the increase in communication

time is signiﬁcant only for large messages and very high communication to compu-
tation ratios, analysis in Section 4.3 shows that these delays can occur for smaller

messages and more reasonable communication-to-computation ratios in future, larger

systems.

Contention for PR—channels

We now examine the effect of contention for the PE-to-router channel, the pr-channel.
Contention for pr-channels will occur if a node receives and / or sends several messages
simultaneouslya. Such contention was carefully avoided in all our previous experi-
ments. Contention for pr-channels is more serious in current systems than contention
for network channels. For many common communication patterns, contention for
pr-channels can occur and increase the communication times signiﬁcantly.

The architecture of a node (shown in Figure 4.1) indicates the problem. All ﬁve

channels (four rr-channels and one pr-channel) have the same bandwidth. If messages

3Multiple, simultaneous sends and receives will also result in contention for any node hardware
involved in sending or receiving a message, including the CPU.

92

 

 

 

 

 

1000
900
800
700

Increase in 600 "

communication
time in as

400
300
200
100

1400

1200 ‘

1000

Increase in 300
communication
time in [18 600

400

200

 

l

 

T l l

* N = 16 -
r> N = 64 -
o N = 144 d
. ~ - worst-case node -

— Average-case node

 

 

é). _,

— .0.
_ 'o, _
\2; ............... —
k7 llllllllllllllllll w
0 1 2 3 4 5

Average computation time per loop (tavg) in ms

A. 4 Kb Messages

 

l

 

0

r l T _ I
"h s
* N = 16
_ 8; r> N = 64 ..
' oN=1M

 

1

~ - - worst-case node
— Averageocase node

 

2 3 4 5

Average computation time per loop (tavg) in ms

B. 8 Kb Messages

Figure 4.8: Increase in communication time per loop

 

 

 

93

arrive on more than one rr-channel to a given node, the messages must compete for
the single pr-channel and create a bottleneck. This pr-channel bottleneck can slow
down an incoming message by a maximum factor of eight. We use variants of the
popular near-neighbor communication patterns here to demonstrate the effects of
contention for the pr-channel. The communication patterns are shown in Figure 4.9.
In near-neighbor patterns, a node typically exchanges data with one or more neigh-
bors in each loop. Consider the experiment in Section 4.2.1. In that experiment, each
node exchanged (sent and received) a message with its peer in each loop. Contention
Was avoided for both rr-channels and pr-channels. Let us call that pattern the AB-
1 NN pattern‘. The base loop is the same as given in Figure 4.3. If we now make all
nodes in Figure 4.3 belong to the class A, then each node will ﬁrst initiate a send and
then receive the incoming message from its peer (Figure 4.9). We call this pattern
the AA-INN pattern. In this pattern, a node will be simultaneously sending and re-
ceiving. Trafﬁc exists in both directions on pr-channels as well as rr-channels. Since
messages are progressing concurrently, we expect the communication time per loop
to be half of that for the AB — INN pattern. However, the-measured times indicate
that it is almost the same as for a AB - INN pattern (Table 4.1). We ﬁnd that if
trafﬁc exists in both directions, the communication time for a message is increased
by a factor of 2.

Instead of sending and receiving from one neighbor, we next consider the case
where a node communicates to all its four neighbors in each loop—a more common
Situation. Typically, a node sends the same message to all the four neighbors in
each loop. The four-neighbor pattern which avoids contention for network channels is
given in Figure 4.9 (AB-4NN pattern). The base loops of A and B nodes are given in
Figure 4.10. A node (i, j) belongs to class A (B) if i + j is even (odd). Even though
contention for network channels is avoided, four messages will be contending for the
pr-channel. The communication time for the AB-4NN pattern is four times that of
the AB-lNN pattern. Hence, the communication time per loop is eight times that

for a single message. As in case of the 1NN pattern, we can place all nodes in class

‘

4One Near Neighbor and peer nodes belong to classes A and B.

94

 

 

 

 

 

 

 

 

 

Figure 4.9: Near-neighbor communication patterns

 

 

 

Anode M
for each neighbor do for each neighbor do
tsend(neighbor,msg,1.) ; msg I trecv(neighbor) ;
for each neighbor do for each neighbor do
mag - trecv(neighbor) ; taend(neighbor,msg,L) ;
compute(t) ; computeft) ;

Figure 4.10: Base loop for four near-neighbor (4N N) communication patterns

 

 

 

95

 

 

 

 

 

 

 

 

 

 

 

Msg ln AB-INN AA-INN AB-ilNN AA—JNN
(in bytes) 71 1'1 / rm 7'2 1'2 / rm 7;, r3 / rm 1:, 74/13,,
1 .412 2.00 .385 1.87 2.422 11.75 2.269 11.01
515 .687 2.00 .681 1.98 3.425 9.97 3.059 8.90
1027 .791 2.00 .785 1.98 3.782 9.56 3.683 9.31
4099 1.767 2.00 1.766 1.99 7.648 8.65 7.508 8.49
8196 3.306 2.00 3.295 1.99 12.938 7.82 12.793 7.73

 

 

 

- Table 4.1: Communication time per loop for near-neighbor patterns in ms.

(rm is the time for a single message under contention-free condition)

A to create the AA-4NN pattern. The loop times for AA-4NN pattern are given in
Table 4.1. Again, we ﬁnd little difference in the loop times of AA-4NN and AB-4NN,
even though nodes in AA-4NN pattern initiated all sends ﬁrst before receiving any
messages in a loop. As before, the problem is due to the contention for the pr-channel
by the four incoming and the four outgoing messages.

The experiments in this section lead us to the conclusion that the pr-channel or
node-architecture can become a serious performance bottleneck in current systems.
One solution is to increase the bandwidth of the pr-channel, and design the memory
with sufﬁcient bandwidth so that multiple incoming and outgoing messages can be
handled without any degradation in performance. Another solution is to improve the
router to provide multicast communication support. This support will be useful as
a node will often be trying to send the same message to more than one node. In
that case, a single message on a pr-channel could generate multiple copies of itself

on the network channels, and signiﬁcantly reduce the bandwidth requirements of the

pr- channel.

4.3 Analysis of Experimental Results

In Section 4.2.3, experimental results showed that contentions for network channels

can increase communication times exponentially. The increase in communication time

96

was signiﬁcant (>10 usecs) only for messages larger than 1 Kb. In this section, we
analyze the results further to examine if the increase in delays can be signiﬁcant in
future systems even for smaller messages and higher computation times per loop.
Theoretical results in Chapter 2 had indicated that normalized communication time
should be a function of applied node trafﬁc. We also verify if experimental results
agree with the theory. To convert the experimental data, we need to take a look at i
the various components of loop time.

Consider the components of loop time for contention—free communication for the

base loop shown in Figure 4.3 (refer to the experiments in Section 4.2.1). Figure 4.11
shows the components for a speciﬁc instance when the message length is 4 Kb and the
average computation time per loop is 1 ms. The communication time per loop includes
the time to send a 4 Kb message from an A node to its peer (A-B communication)
and the time for another 4 Kb message from the peer back to node A (B-A commu-
nication). Each message was shown in Section 4.2.1 to take 210.2+0.1536x4096 =
839 as. Of this, 210 us is ﬁxed overhead and 629 ,us is needed to send 4096 bytes as
16 packets of 256 bytes each. Each packet therefore takes 629/ 16 = 39.3 psecs. The
data of individual packets are communicated over the network channels at the rate
of 0.0577 psec/ byte, or 0.0577x256 = 14.8 ysecs per packet as measured for smaller
messages (Figure 4.4). These times are shown as shaded regions in Figure 4.11. The
remaining time, (39.3 - 14.8) = 24.5 psecs is the overhead to process each packet.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A—B Commn B-A Commn local computationsl
14.8 as 24.5 s 16 {gaskets
2‘ =‘ 3.21 ms _ A _ - _.
(... = 0.839 ms 7... = 0.839 ms A - Avg comp time “me
communication time T = 1.678 ms

 

t“,g = lms

A

 

’V

 

 

loop time T = 2.678 ms

 

Figure 4.11: Components of loop time for 4 Kb messages

 

 

An understanding of the components of loop time indicates how to express the

97

experimental data in terms of applied node trafﬁc and normalized communication
times. For contention-free communication, we have by Equation 4.2, T = taug +27", =
to”, + 2(a + bL) where a and b are system dependent constants whose values can be
easily determined (Section 4.2.1). The values for Symult 2010 are shown in Figure 4.4.
Hence, we ﬁnd that for our base loop, applied node trafﬁc (Deﬁnition 24) is

A, = (L/B)/(ta,,g + 2(a + bL)) . (4.3)

Observe that applied node trafﬁc depends only on the characteristics of a parallel
task (L,t.,,,,) and the machine dependent constants (B,a,b). The signiﬁcance of
A0 is that under contention-free conditions, the pr—channel of a node will be busy
injecting data into the network for A“ fraction of time. For the example in Figure 4.11
A. = W x 100 z 7.64%.

We can also normalize the communication time per message under contention to
the communication time under contention-free.conditions less the ﬁxed overheads,
bL. The results of the experiments (Section 4.2.3) shown in Figure 4.8.A gives the
increase in communication time per loop Ar. Since, each loop (Figure 4.3) involves
the communication of two messages, the increase in communication time per message

will be Arm = Ar / 2. The normalized communication time therefore will be given

by
_ bL+Arm _ bL+Ar/2
0 _ bL _ ————bL (4.4)

The results given in Figure 4.8.A for a 144-node system are represented in terms of
applied node trafﬁc and normalized communication time in Figure 4.12. Figure 4.12
shows that for the same applied node trafﬁc, the increase in communication time
is greater for larger systems. We also observe that some nodes (such as the worst-
case node) perform much worse than an average node. The behavior can be easily
explained. The maximum path contention level for the matrix transpose pattern is
um” = (W — 2), and maximum channel load is Xmu = (\/N - 1). For a 144 node
system \/N - 2 = 10. Hence, from Equation 3.11 we ﬁnd that the saturation node
trafﬁc for the worst-case node is A3“, = 1 / (10 + 1) = 9.1%. A 9.1% node trafﬁc will

98

saturate physical channels as Xmas = 11. We can therefore expect a dramatic increase

in communication time when applied node trafﬁc is higher than A3”. = 9%.

 

 

 

 

 

3 T I l l r l l
2.8 - r
2'6 _ - - . Worst-case node _
2'4 f — Avg case node 9 T
Normalized 2-2 ' * N = 16 .. "
communication 2 _ _
time 0 D N = 64 ..0'
1.8 - o N =144 0'... —1
1.6 - -
1.4 - ' -
1.2 r r
1 l l 4

0 2 4 6 8 10 12 14
Applied node trafﬁc A,, in percentage

Figure 4.12: Normalized communication time per message for different system sizes.

(Matrix-transpose pattern, 4 Kb messages)

Figure 4.13 gives the normalized communication time per message for three dif-
ferent message lengths for a 144-node system using the matrix transpose pattern.
The graph clearly shows that the normalized communication time is a function of
the applied node trafﬁc and is independent of the message length. The plots in Fig-
ure 4.13 stops at A,, = 13% because overheads and architectural bottlenecks in the
Symult 2010 limited applied node trafﬁc. If ﬁxed overheads and architectural bottle-
necks are removed in future systems, applied node trafﬁc will increase for the same
message length and computation time ( as the values of a and b in Equation 4.3
are reduced ). Such an increase in applied node trafﬁc will seriously degrade the
communication performance. Figure 4.13 also shows why we do not observe the ef-
fects of contention for less than 1 Kb messages in a Symult 2010. In that case, the

overheads limit the applied node trafﬁc to less than 6% for small messages, reducing

99

the percentage increase in communication time. In addition, for small messages the
communication time T... is also small. Hence, the absolute increase in communication
times, Arm, will be too small to be observable. However, with reduced overheads,
increased applied node trafﬁc is possible, and we will observe the effect of contention

even for small messages.

 

3 l I I l l l l
2'8 D - . - worst-case node T
2'6 _ — Avg-case node 5° ‘
2'4 ' * L = 1 Kb ...9' ‘
Normalized 2'2 7 D L = 4 Kb "
communication 2 _ .{o‘ -
time 9 0 L z: 8 kb 9'
1.8 - {.0 a
1.6 - r
1.4 - -
1.2 - ~
1 *‘" 1 1

 

 

 

 

0 2 4 6 8 10 12 14
Applied node trafﬁc A,, in percentage

Figure 4.13: Normalized communication time per message for different message

lengths.

(Matrix-transpose pattern, 144 node system)

One may argue that even such an increase in communication time may not af-
fect the overall loop time signiﬁcantly, since communication is only a small part of
the loop time. We could plot the percentage increase in loop time instead of per-
centage increase in communication time. However, an interesting and useful way to
present the effect of contention on the overall performance is to plot actual node traf-
ﬁc (Deﬁnition 26) as a function of applied node trafﬁc (Deﬁnition 24). An increase
in communication time will increase the loop time which will reduce the actual node
trafﬁc. The results given in Figure 4.13 are presented in terms of actual and applied

node trafﬁc in Figure 4.14.

100

 

 
    

 

I I I I I I I '.
14 - -
* L = 1 Kb
12 .. D L = 4 Kb -l
o L = 8 kb "
10 _. - - - Worst-case node _
— Avera e-case node
Actual node g -'
trafﬁc A,, in 8 " y“ 'l
percentage
4 _- d
2 ' —l
0 -' 1 1 1 1 1 1 1

 

 

0 2 4 6 8 10 12 14
Applied node trafﬁc A“ in percentage

Figure 4.14: Actual node trafﬁc for different message lengths

(Matrix-transpose pattern, 144 node system)

If the contention effect is negligible on the overall performance, the actual node
trafﬁc, A,,, should be approximately the same as the applied node trafﬁc, A,,. The
curves in Figure 4.14 clearly shows that for a 144-node system the actual node trafﬁc
begins to saturate at about 9% applied node trafﬁc for the worst-case node as ex-
pected. The effect of contention on the overall performance (or loop time) then will
be visible if a node attempts to generate a trafﬁc which is more than the saturation
level. Applications can never achieve a node trafﬁc higher than the saturation level.
The level at which actual node trafﬁc saturates in general depends on the communi-
cation pattern, the mapping and the size of the system as elaborated in the previous

chapter.

101

4.4 Discussion

In the previous section we saw that although path lengths are not a problem in
wormhole-routed networks, contention can be a serious problem. Our experiments
were conducted on Symult 2010 systems, which may not be a typical representative
of the class of 2D mesh connected, wormhole-routed networks. In this section we
discuss some of the architectural idiosyncrasies and drawbacks of the Symult 2010
system which may not be present in future systems, and extend our results to such
systems.

-to- uter chan an ode arc itec : In a Symult 2010, the
Processor-to—Router channel (pr-channel) bandwidth is the same as that of the net-
work channels. Hence, the pr.channel can become a bottleneck for several communi-
cation patterns, and it limits the peak rates at which a node can inject trafﬁc into the
network. It is possible to increase the bandwidth between the router and the proces-
sor by either increasing the channel width or the channel rates. Ideally four times the
bandwidth of the network channels in each direction should be suﬂicient to eliminate
this bottleneck. In addition, the node and its kernel should be designed to handle
multiple incoming and/or outgoing messages. Future designs (such as iWarp [14]) are
increasing the bandwidth between the router and the PE. With these improvements,
however, higher applied node trafﬁc is possible and the problem of contention for
network channels can be increasingly visible.

2. System Size : At present, the largest Symult system has 192 nodes. However,

 

system sizes are expected to be as high as 1K nodes or more in the near future.
Applied node trafﬁc is a characteristic of the parallel program, and parallel programs
should be designed to be scalable. Hence, applied node trafﬁc remains the same as
system size increases. As we saw earlier, the utilization of some channels can increase
to very high levels due to nonuniform trafﬁc even though applied node trafﬁc remains
reasonable. Longer path lengths in larger systems will force a message to compete for
more channels before successfully establishing a path for communication. With larger ‘

systems sizes, interference between trafﬁc due to different users (or applications) and

102

between system and user trafﬁc will also increase. Hence, the contention effect will be
increasingly visible in larger systems, and contention may have to be studied seriously
if inefﬁcient performance is to be avoided.

3. WM : In current systems, the ﬁxed overhead per message
is about 210 psecs which is quite high compared to network delays. It is possible
to reduce this overhead further by a careful design of the node architecture, good
interface between the communication and the computation processors and restrict-
ing the ﬂexibility of the message primitives. Some of these improvements are being
implemented in latest multicomputers [14, 30]. As explained in the previous section,
reduced overheads can increase peak applied node trafﬁc, and hence the effect of con-
tention. Even with current overheads, which limit peak node trafﬁc to about 12%,
the effect of contention will be visible in large systems as saturation may occur at less
than 10% node trafﬁc.

4. We : In a Symult 2010, (messages larger than 256 bytes
are sent as 256 byte packets. Packets were introduced to prevent large messages
from affecting the communication of smaller messages due to contention for channels.
However, packetization introduces additional software overhead which causes larger
messages to progress 2.7 times slower than smaller messages. In addition, there is no
way a user can prevent this, even if he is sure that he is avoiding contention, and is not
interfering with the communications of other applications. Increasing communication
times by a factor of 2.7 may reduce overall performance signiﬁcantly and may not
be acceptable. Such packetization cannot be very effective if channels are getting
saturated. Turning off packetization will increase the applied node trafﬁc, and hence
the effect of contention. However, we may achieve a signiﬁcant speedup if we turn off
packetization, and carefully avoid contention.

5. Multicast and Broadcast support : At present, support for multicast and
broadcast communication is very inefﬁcient. In fact, if the host needs to broad-
cast a message to all nodes, it will simply send a copy to each node. Only one node is
connected to the host, and all the messages from host to nodes must pass through a

single channel. This implementation will slow down the broadcast by a factor of N.

103

Similarly, if a node has to send the same message to more than one node, it has to
make copies of the message and send one copy to each node. Architectural support
for broadcast and multicast will be useful. However, such support will also increase
network trafﬁc for a given applied node trafﬁc, and hence the problem of contention
for network channels.

6. Wider Data paths : Future systems are being designed to have wider data

 

paths. As the width of the channels increases, the network delays become smaller
which will reduce the node trafﬁc and utilization of channels for a given application.
Hence, increased width is one main factor which may reduce the applied node trafﬁc
and hence the the problem of contention. Detailed studies are needed to determine
whether increases in channel widths can compensate for all the earlier factors which
increased the problem of contention.

We see that a number of trends in network architecture indicate that the problem
of contention for network channels will become serious in near future. It is not clear
if the higher bandwidth possible in mesh networks can compensate for the changes,

and enable the mesh networks perform better than hypercube networks.

4.5 Conclusion

The results of our experiments and the analysis of these results indicate of the per-
formance of a state-of-the-art mesh-connected wormhole-routed interprocessor com-
munication network. The communication time per message under contention-free
conditions is one of the best any implementation has achieved, and is given by the
linear function 210.9 + 0.0577L usecs (210.2 + 0.1536L psecs if L > 256) where L is
the message length. This expression is accurate to within :124 psecs, even for paths of
length 22. Hence, with current technology we ﬁnd that the effect of path length on
communication performance can be ignored.

However, contention for network channels may turn out to be the new problem
as predicted by theory. Contention is a more serious problem than path length as it

can result in an exponential increase in the communication time once communication

104

intensity of an application reaches a certain threshold. In the Symult 2010, the effect
of contention is signiﬁcant (> 10 psecs) only for large systems ( > 100 nodes) and
for long ( > le) messages. Large overheads are limiting the applied node trafﬁc to
less than 13%, and hence the effect of contention in small systems currently available.
However, our analysis indicates that the effect of contention may be visible even at
applied node trafﬁc of less than 10% for large systems not yet available so the problem
of contention can become serious in the near future.

Improvements are needed in the current architecture to reduce overheads,
to remove bottlenecks in the node architecture, and to provide efﬁcient broad-
cast / multicast support. Future multicomputers under development are already plan-
ning to provide some of the improvements. Although such improvements will improve
the network performance for contention-free communication, those improvements will
also increase the problem of contention for network channels. The network perfor-

mance will become more sensitive to the contention.

Chapter 5

NETWORK SIMULATOR

5.1 Introduction

We decided to develop a high-performance simulator that can be used to study the
performance of large networks. Such a simulator is needed for several reasons. Ex-
perimental studies on systems larger than 200 nodes are not possible as such large
multiprocessor systems that use wormhole-routing or circuit-switching are not yet
available. Such large networks and systems are, however, expected in the near future.
In addition, experiments reveal only the characteristics of a speciﬁc system or network,
and may fail to highlight the characteristics common to all wormhole-routed networks.
Theoretical analysis in earlier chapters clearly showed that contention can be serious
problem in large networks, and appropriate mapping can result in signiﬁcant improve-
ment in performance. Simulation studies can be used to conﬁrm the simulations. If
the results of the theory and simulations agree, then that would validate our sim-
ulator, too. Once the simulator is validated, it can be a useful tool to investigate
various cases that are difﬁcult to model and analyze by theory. To our knowledge,
there is no published literature that describes the design of a general-purpose, efﬁcient
wormhole-routed network simulator that can simulate large networks in reasonable
time. We were, therefore, motivated to design a simulator which can be used to study
the performance of large, wormhole-routed networks.

We had the following goals for our simulator

0 General purpose : The simulator was not intended to study any particular

 

multiprocessor system or network. It was intended to model the fundamental
characteristics of a wormhole-routed network, and to be sufﬁciently general so

we can study the performance characteristics common to all wormhole-routed

105

106

networks.

a High performance : We wanted the simulator to be fast enough so that we can

 

use it to study large networks having hundreds or thousands of nodes.

0 ﬂexible : It should be easy to modify the simulator to include speciﬁc character-
istics of a machine so that a speciﬁc machine can be simulated more accurately

if needed.

0 Portable : The simulator should run on any computing resource available, and
is not to be designed to run on a speciﬁc machine or require special support

from the operating system.

We describe how we managed to meet our goals, and show that our simulator
can be useful in understanding the behavior of modern wormhole-routed networks.
In this chapter, we concentrate on the performance of 2D mesh networks as they
have potential for high performance, and are also signiﬁcantly affected by contention.
However, the simulator can be easily modiﬁed (as explained later in the chapter) to
study performance of networks having other topologies. In Section 5.2, we describe
the design of the simulator kernel and explain how we could make it simulate all the
essential features of a wormhole-routed network, and still keep it simple and efﬁcient.
Additional code, called the simulator shell, needs to be written to complement the
kernel and complete the simulator. We describe the design of a simulator shell to study
the effect of mapping of the performance on the network in Section 5.3. In Section 5.4,
we describe how we validated our simulator by comparing its outputs with results from
experiments on a real system and from theoretical analysis. We present the simulator
results for several experiments in Section 5.6 which demonstrate the usefulness of the
simulator in understanding the behavior of wormhole-routed networks. Details of the
performance of the simulator on a Sun Sparc workstation are given in Section 5.5.
The results show that our simulator can give useful results in reasonable time even
for large systems. Finally, we conclude and mention limitations of our simulator, and

possible improvements to make the simulator more powerful and ﬂexible.

107

5.2 Simulator Design : Kernel

The major challenge in designing the simulator was to make it fast enough so we
can simulate large networks in reasonable time and get useful results. After careful
study, we chose to implement the simulator in C, a high level language, rather than a
simulation language. The network model we simulate is simple, and we do not require
the power of a simulation language, and we can avoid the processing overheads of a
simulation language. Simulators using a language such as CSIM have been found to
be too slow to simulate large wormhole-routed networks [68]. Our kernel design is
similar to that of another work reported recently [61].

For the ﬁrst version of our simulator, we made the following assumptions that
let us concentrate on the main characteristics of a wormhole-routed network. These
assumptions simplify the simulator design, and hence improve the simulator perfor-

mance.

o All network channels are assumed to have the same bandwidth and all trans-

missions are assumed to be error-free.
0 Channels between adjacent nodes are assumed to be full-duplex.

o A node-processor or memory will never be a bottleneck. We made this assump-
tion as we were interested in studying effects of contention and non-uniform
trafﬁc, and not the limitations of a node’s capability on communication perfor-

mance.

o All routers are synchronized by a global clock. We assumed that the duration
of a clock period is sufﬁcient to process all the newly arrived data ﬂits by a

router.
o No support for virtual channel.
a One ﬂit buffer per channel in each direction.

At the same time, the following essential features of a wormhole-routed network

are retained, so the simulator predicts the network performance reasonably well.

108

reservation of network channels and datapaths by the message header

small headers of one or two words

pipelined ﬂow of message ﬂits that follow the header

release of reserved channels by the tail ﬂit of the message

0 limited number of buffers per router, typically one or two per channel

blocking the the message ﬂits if the header is blocked.

We also decided to implement the main functions in the form of a kernel, a library
that supports key functions. A problem-speciﬁc shell can be then written to complete
the simulator. Such a design will help us optimize the key functions, and at the same
time keep the simulator quite ﬂexible. We ﬁrst describe the main datastructures, and
then the algorithms for key kernel functions. Design of a simulator shell is described

in the next section.

5.2.1 Kernel Data Structures

The overview of the network model simulated by the kernel is shown in Figure 5.1,
and is similar to the model used in another work recently [61]. There are N nodes,
each node consisting of a PE and a router. Each router, and hence the corresponding
node, is connected to a limited number of other nodes by physical links. The PE’s
generate and absorb messages, whereas the routers communicate the messages from
one node to the other. We need data structures to remember the state of the links,
the routers and the PE’s.

Each output port is connected to the input port of an adjacent node by a physical
link. The state of a link or an input port is not remembered explicitly as they can be
deduced from the state of the corresponding output port, We, therefore, can represent
the status of the network and the routers by a N x 6 array of PORT structures, each
structure remembering the status of an output port (Figure 5.2). Here, 6 is the degree

of a node, i.e. number of output or input ports in a router.

109

 

 

:' """" x 35 “or arhs' "- ‘Ph' icTCiiiE"""’: Network
I I----., ---- 2'. """""""" IIII 'I:
aim. 534:: em
: | l: I l outputpons: I ihputpons
=e ‘= o = = ‘
L..- --.l L..- --.l l.-- --.i

Node

1 2 N

Figure 5.1: Overview of the network model

 

 

 

 

 

 

 

 

 

 

 

 

 

 

nclcms I] intc er
clcrnsﬁ ll integcrﬂ
QUEiUE
POSN pstatus ]] enumerated
P051“ waitq ll OUEUE
PORT

MESSAGE

Figure 5.2: Main datastructures of the simulator

 

 

 

110

Each PORT has two ﬁelds: pstatus and waitq. The ﬁeld pstatus stores the state
of the output port. Its value will be IDLE, if it is not currently reserved for any
message. Otherwise, its value indicates the message which has reserved it. The other
ﬁeld vaitq is a small FIFO or priority queue. The queue is not for message ﬂits, but
for message headers that have arrived on input ports, and are waiting to reserve this
port and the corresponding physical link. When the tail ﬂit of the message currently
holding the port passes through the port, it will release the port and the channel.
Once the port becomes free, the ﬁrst message in the waitq will get the channel. In
an actual network, the elements of waitq may be input ports. In our simulator, each
queue element is the index of a message, as it is more convenient to process.

A PE is treated as a source/sink for messages. Observe that every incoming
message corresponds to an outgoing message from some other node. The kernel
handles only the messages currently in the network and leaves other details of the
PE behavior, including generation of messages in future, to the shell. Hence, the
status of PBS is represented by a N x 6 array of MESSAGE structures, each MESSAGE
structure remembering the status of a message currently in the network, which has

the following ﬁelds:

a mstatus: status of the message: IDLE if there is no message at present, BUSY
if the message is not blocked and is progressing, or WAIT if the message is

blocked and waiting for a network channel.

0 src, dst , msgln, bt ime: give the details about the message: source node,
destination node, length of the message and the time at which it entered network

respectively.

0 headposn, tailposn: position of the head and tail ﬂits respectively. If the
value of the headposn (tailposn) is i, then the head (tail) ﬂit of the it" message
has arrived at an input port of node i. Tailposn will be -1 if the tail ﬂit has

not left the source PE yet.

0 nbytssent: indicates the number of message ﬂits that have already left the

source PE

111

Arrays of PORT and MESSAGE structures can remember the status of the entire sim-
ulator kernel. The total space needed is O(N), and hence we ﬁnd that our simulator

is space efﬁcient.

5.2.2 Simulation Cycle

The kernel functions essentially take the simulator through one simulation cycle (Fig-
ure-5.2.2). It is necessary to understand the functions carried out by a router in
one simulation cycle before we can describe kernel functions. The initial state of a
router before a new simulation cycle is as follows. Some of the input ports would
have received message ﬂits from adjacent nodes in the previous cycle and need to be
processed. Other input ports are currently either blocked or idle. Some of the output
ports would have sent ﬂits to adjacent nodes and are waiting for acknowledgment.
Other output ports are either idle or waiting for the next data ﬂit to arrive. We
decided on a four-phase simulation cycle based on related work [77, 61].

Phase 1, Routing: In the ﬁrst phase, the router has to decide the output port for

 

each newly received message ﬂit. If the ﬂit received is a header ﬂit, then the router
has to decode the header, and decide the output port to which the header should be
sent. Otherwise, i.e. the received ﬂit is not a header, its output port can be inferred
by the path taken earlier by its header.

Phase 2, Blocking and Arbitration: In the second phase, the router has to decide

 

if the required output port is free for each ﬂit. For non-header ﬂits, it just checks
if the previous ﬂit of that message has been successfully transmitted to an adjacent
node so the output port is free to receive the next ﬂit. For a header ﬂit, it has to check
if the required output port is idle. It also has to compete and win the arbitration if
headers on several input ports are requesting the same output port.

Phase 3, Internal data transfer: (input port to output port) In this phase, the

 

router will transfer all unblocked ﬂits from their input ports to their speciﬁed output
ports and send an acknowledgment to the adjacent nodes which sent these ﬂits. All
output ports for which acknowledgments are received are marked ready, and if an

acknowledgment is received for a tail ﬂit then the output port is marked IDLE.

112

 

 

In orts Outports

 

  

 

 

 

 

 

 

1 > 1 :ﬂh [L] 1 5
2 5 2:30 :5 2 l
3 5 3: z} 3 >
4 > 4: h D: 4 )
Initial state Phase 1 Phase 2
head ﬂit arrived on inports route head flit on 1 and 4 Arbitrate between
1 and 4, data ﬂit on inport 3 data ﬂit on inport 3 1s inports 1 and 4

to be routed to-inport 2 for outport 4,
assume 1 wins

    

1;:me

 

 

 

 

acks—m
4zﬂh
Phase 3 Phase 4 . Final state
Transfer 1 —) 4, Send out ﬁlts
3 -+ 2, block 4 from outports 2 and 4.
send acks

Figure 5.3: A simulation cycle

 

 

113

Phase 4, External data transfer (output port to input port): In this last phase, all

 

output ports that received new ﬂits in Phase 3 will transmit them to the respective
adjacent node.
At the end of the simulation cycle, we are back to the initial state except that all

unblocked message ﬂits have advanced by one node.1

5.2.3 Kernel Algorithms

The computational requirement will be high if we process each ﬂit of each message.
We reduced the requirement signiﬁcantly, and made the simulator computationally
efﬁcient based on the following observations.

0 Handling message ﬂits: Instead of processing each message ﬂit, we process only
the head and tail ﬂits. The movement of other data ﬂits is implicit as they are between
the head and tail ﬂits.

o Idle routers: The fraction of routers that have an event to process in a cycle
is small. We skip idle routers in a simulation cycle by scanning the list of active
messages instead of scanning routers.

0 Blocked messages: Any message whose header is waiting for the release of a
channel is marked to be in the WAIT state, put in the appropriate waitq and is not
processed until the required channel is free. The message will be signaled when the
channel becomes free (when the tail ﬂit passes through). Hence, in each cycle we
process only the unblocked messages.

0 Arbitration policy: We also need to handle efﬁciently the case of two or more
headers which arrive in the same cycle and compete for the same output port. Each
port is associated with a priority queue, and each message is associated with some
priority. In Phase 1, all headers which need the channels are inserted into the queue.
In the second phase, each header checks if it is at the head of the queue to determine

if a channel is granted to it.

 

iObserve that it takes two cycles to complete the transfer of a message ﬂit, and hence message
ﬂits follow one another separated by 2 nodes. However, if we have two buffers for each outport, then
it is possible for the message ﬂits to follow one another separated by just one node.

114

By default, the priority value of any message is the time it was injected into the
network, and the the message having the lowest priority value will be the winner.
Such a policy will ensure that the network tries to minimize communication time for
each message when there is contention. Also, in such a policy no message can wait
forever. We later demonstrate that these improvements result in a simulation rate
per cycle which is independent of the system size, and is only proportional to the
number of progressing messages.

The following are some important functions implemented in the kernel.

1. route (curnode,dstnode,outport,nextnode) The dependence of the sim-
ulator on network topology and on the routing scheme is constrained to only this
function so we can easily adapt our simulator to any network topology. Given a cur-
rent node number, curnode, (for a head or a tail ﬂit) and destination node number,
dstnode, the function route will determine the next node along the path to the des-
tination and the output port the ﬂit should be routed to (which is connected to an
input port of nextnode).

The function is given for 2D mesh network below. Here, an is a global variable
whose value will be \/N. The function is simple even for hypercube or 3D mesh

networks, and can be implemented in a few statements of C language.

 

procedure route (curnode, dstn, nextnode, nextport)
curnode, dstn : integer ; /* Inputs, curnode <> dstn */

nextnode : integer; PORT *nextport ; /* outputs */

begin
fn I curnode - (curnode % sn);/* First node in curnode’s row */
nextnode 8 curnode ;
idestn < fn) { nextnode -- sn ; nextport 8 UP ; }
else if(dstn >8 (fn + sn)) {nextnode += sn ; nextport 8 DOWN;}
else if(dstn < curnode) { nextnode-- ; nextport 8 LEFT ; }
else if(dstn > curnode) { nextnode++ ; nextport - RIGHT ; }
else /* dsnt 8' curnode */ { nextport 8 PE ) ;

end

 

 

 

115

2. adenessage (srcnode, dstnode, msgln) Given a source node, srcnode,
a destination node, dstnode, and message length, msgln, this function will add a
message to the network in the current simulation cycle. The function will search for
an available MESSAGE structure, initialize it appropriately, and add it to the active
message list.

3. advancenessage: This is a key function which will attempt to move a single,
unblocked message through the network by one step. It executes phase 3 and 4 of the
simulation cycle. It will advance the position of the header and/ or tail ﬂit, if that ﬂit
is in some intermediate node, and it increments the number of ﬂits that have left the
source. The function will also signal the ﬁrst waiting message in the queue associated

with the channel being released by the tail ﬂit. The function is given below.

 

procedure advance-msg(msgp)
MESSAGE Imsgp ;
begin
if(msgp->headposn <> msgp->tailposn) /*advance header flit*/
route(nsgp->headposn,msgp->dstn,nextnode,nextport) ;
nsgp->headposn I nextnode ;
endfi
if(msgp->nbytssent < msgp->msgln)
II Release one more flit from source */
msgp->nbytssent +I 1 ;
else if(msgp->tailposn !I msgp->dstn) IImove the tail flitI/
route(msgp->tailposn, msgp->dstn,nextnode,nextport) ;
if(!empty_queue(nextport->waitq))
/* Signal the message at the head of the wait queue */
msgp I get-queue(nextport->vaitq) ;
msgp->mstatus I BUSY ;
endfi ;
msgp->tailposn I nextnode ;
else /* tail has reached destination node */
msg-finished(msgp) ; /* inform shell */
msgp->mstatus I IDLE ;
endfi
end

 

 

 

116

Here, the functions put.queue, get_queue and first.elem operate on uaitq,
and put an element, get an element or examining the ﬁrst entry respectively.

4. advance (:1) : This function will make the simulator execute n simulation
cycles, and is the main function supported by the kernel. The algorithm for advance
is given below, where the global variable simclock is the virtual time of the simulator.
Here, userin is a function implemented in the simulator shell which will simulate the
behavior of the PE’s at the start of a new cycle. Typically, as we see later, userin

may inject several fresh messages by calling the addnessage function.

5.3 Simulator Shell

The interface between the simulator shell and the kernel is shown in the Figure 5.4.
The shell has to essentially implement three functions main, msgiinished and
userin. The shell functions call the kernel functions init_kernel, addhessage
and advance. The kernel function advance calls the shell functions msgiinished
and useri‘n on speciﬁc events. The function msgiinished is called on the event of
the tail ﬂit reaching the destination node indicating completion of a message transfer.
The function user_fn will be called by the advance function at the beginning of each
simulation cycle. A proper design of msgiinished and userin can implement the
behavior of the parallel tasks running on the PE’s—the goal of the shell module.

In this section, we brieﬂy describe a simulator shell for studying the effect of
mapping on network performance. We use the example to illustrate how to write a
simulator shell for a speciﬁc problem, and to show that the shell can be simple and
efﬁcient. The mapping problem is to ﬁnd an optimal assignment of the tasks of a
parallel application to the nodes of a parallel system to get the best communication
performance.

The shell has to simulate the behavior of the individual tasks of a parallel appli-
cation running on different nodes. A parallel task, typically, performs local computa-

tions for some time, and then sends a message to one of its neighbors (in the process

117

 

 

procedure advance(n)/I Advance simulation clock by n cycles I/
n : integer ;
begin
n +I simclock ;
while (simclock <> n) do /* Execute a simulation cycle */
user-fn() ; simclock++ ;
/* Phase 1 I/
for each active, unblocked message, msgp, do
if(msgp->headposn <> msgp->dstn)
/* If header has not reached the destination */

/* Attempt to get the next segment of the path */
route(msgp->headposn,msgp->dstn,nextnode,nextport);
put_queue(nextport->uaitq,msgp) ;

endfi
endfor
/* Phase 2 */
for each active, unblocked message, msgp, do
if(msgp->headposn <> msgp->dstn)
route(msgp->headposn,msgp->dstn,nextnode,nextport);
if(nextport->pstatus <> IDLE or
first-elem(nextport->waitq) <> msgp) /* Wait */
msgp->mstatus I WAIT ;
else /* Port granted, remove entry from queue */
get_queue(nextport->waitq) ;
/* Phase 3 and 4 */
for each active, unblocked message, msgp, do
advance-msg(msgp)
enduhile ;
end

 

 

 

 

 
    

 

 

 

 

 

 

 

 

 

SHELL, ,
msg.ﬁnished user.fun
ll
Init-network addmsg vance‘

 

 

 

KERNEL

 

 

 

Figure 5.4: Interface between shell and kernel

 

 

 

118

graph). The sequence is then repeatedz.
In our shell, the status of a PE is maintained in a structure called

MESSAGEJiISTORY, whose ﬁelds are

start.t ime : time at which ﬁrst message was sent.
next.message : time at which next message is to be generated.
no_messages.comp1et ed : number of messages completed so far.
completion.t ime : time at which the nth message was completed.

Various functions of the simulator shell set and use MESSAGEJ'IISTORY to simu-
late the communication behavior of a RE. The functions implemented in shell are
described now.

1. init.msgs : This function is called in the beginning of a simulation run.
The function schedules a message from each PE by setting nextmessage ﬁeld to a
value chosen randomly from the interval 0.. 2t¢vg. The function also initializes other
members of the MESSAGEJ-IISTORY structure for each PE. _

2. msgiinished : This function is called by the simulation kernel whenever a
message is completed. This function will increment the count of messages completed
(no_messages.complet ed ﬁeld) for the source RE, and will schedule another message
from the source after a time randomly chosen from the interval 0..2ta,,g. This randomly
chosen time simulates the time needed for local computation by the RE before sending
another message. The function will also record the completion time, if the message
just transmitted was the a“ message from the source node.

3. user.fn : This function is called at the beginning of each simulation cycle, and
must be efﬁciently implemented. In this shell, all active MESSAGEHISTORY structures
are linked by a linked-list sorted in ascending order of nextnessage ﬁeld value. This
function will examine the head of this list and add the required number of messages
to the network by calling the kernel function add.message.

The functions init.msgs and msgiinished together with user.fn will simulate

the message injections by a node. The destination of these messages has to be decided

 

2Note that a node will also be receiving messages from neighbors asynchronously.

119

based on the knowledge of the process graph and the current mapping. The knowledge
is set and used by calling the following two functions.

4. map : maps the parallel tasks in a requested way, and remembers the mapping
for later use. The function is called before a simulation run begins.

5. find.dstn : chooses the destination node given the source node, and is called
whenever a new message is to be added to the network. The choice is made so that
each parallel task communicates with all its neighbors uniformly. The function uses
the knowledge of current mapping and the characteristics of the process graph.

The main function is shown below. It will call initializing functions including
init.msgs and the map. After initialization, the main function enters a loop calling
advance in each iteration to advance the simulation clock by 1 until all nodes complete
17. messages. The command line arguments specify parameters for a simulation run:
the process graph, message length, average computation time per loop, system size,
and number of messages to be completed by each node.

In our simulations, nodes continue to generate messages even if they have ﬁnished
n messages. This continued generation maintains the trafﬁc levels in the network
and avoids any tail-effect. This strategy also makes a small value of n (100 in our
experiments) sufﬁcient. A small value of n in turn reduces the time taken by a
simulation run.

After a simulation run is completed, the function statistics will scan the
MESSAGEJ-IISTORY structures and compute the time taken by each node to successfully
send n messages. The difference in times computed for two different mappings will
then show the effect of mapping on the network performance. Note that even though
parallel tasks are communicating in exactly the same way, a mapping changes the
way the system nodes communicate. The change in communication pattern affects

the trafﬁc and contention, and hence the time to complete a message.

120

 

procedure main()
begin
prcs_args() ;/* Process command line arguments */
init__kerne1();/"l Initialize kernel */
mapC) ; ;/* Map the process graph in a required way */
init_shell() ;/* Initialize shell and a message on each node*/
simclock I O ;/* Initialize simulation clock */
cnt 3 O ;/* No of nodes that completed ‘n’ messages */
do
advance() ;
vhileCcnt < N) ;/* Advance simulation time till all
nodes complete ‘n’ messages */
statistics() ;/* Process results and compute statistics */
end

Figure 5.5: Outline of the simulation main function

 

 

 

5.4 Simulator Validation

One of the major challenges in developing a simulator is to validate it after it is im-
plemented. We validate our simulator by comparing its output with the results from
experiments on a real system, and by comparison with the results from theoretical
analysis. We use the Matrix-transpose pattern (Section 4.2.3) for validation. Fig-
ure 5.6 shows the network performance for N =144, i.e. a 12x12 mesh. The legends
in Figure 5.6 are from experiments on a Symult 2010 (Figure 4.12). We ﬁnd that the
simulation results match well with the experiments. The curves match even though
simulator used 50 byte messages, and the experiments on the Symult used 1K, 4K or
8K bytes. The ﬁgure indicates that the node-trafﬁc tends to saturate as predicted by
theoretical results.

From theoretical analysis in the previous chapter, we also know that the saturation
node trafﬁc for a worst-case node is A3“, = 1/(1/ma, +1) as 6“,, = 6",“ = l for the
matrix-transpose pattern (Any node communicates with only one other node). For a

144 node (VI-V- = 12) system, um“ = W — 2 = 10, and hence A3.” = 1/11 z 9.1%.

The saturation node trafﬁc for an average node will be 1/(1+Va,,g). To compute
mug, notice that u value for any path originating from a node (i, j) below the principal
diagonal (i.e. i > j) is i — 1, and is also equal the u for the path originating from
node (VI—V- — l — 23W — 1 — 2'). Hence, considering the average over the paths

121

originating from the nodes below the principal diagonal, we have

 

 

Vow, __ ,_ ”(Pol
2241:,— l j=fll i=1 j=l
where pg, denotes a path originating from node(i, j)
1 W-ii—l ‘ .
«vs/m1) 2. 25'" 1) a. ”(p‘j) = "I
a: J:

2(W — 2)

 

 

3
_ 2_0
_ 3
25 _— I I I I I 1
* expt results (1 Kbyte)
l> expt results (4 Kbyte)
20 ‘ o expt results (8 Kbyte) ‘

Actual node 15 -
trafﬁc (A,,) in
percentage

10

I

 

— simulation results

saturation levels . ° average-case nolde

  

l l l l l

3%

worst:.c.ass.ngd9.1% ‘

-4

 

 

 

5 10 15 20 25
Applied node trafﬁc (A,,) in percentage

Figure 5.6: Matrix-transpose pattern : simulation results

30

 

The saturation node trafﬁc for an average node is l/(Vaug + 1) = 1 / (20/ 3 + 1) =

 

122

3/ 23 z 0.13 or 13%. Figure 5.6 conﬁrms that simulator results agree with the satu-
ration node trafﬁcs predicted by theoretical results—further validating the simulator.
We verify in Section 5.6 that the saturation levels indicated by the simulation results

match with that predicted by theory even for other communication patterns (process

graphs).

5.5 Simulator Performance

One criterion for judging the quality of a simulator is its execution performance—how
long will it take to generate the desired results. The time for a simulation run depends
on the message generating rate and the system size. We measured the simulation
rate by running the simulator on Sun SPARC 1 workstations. Figure 5.7 shows the
simulator performance as a function of applied node trafﬁc for different system sizes.
The Y-axis represents how many cycles can be simulated per CPU second, and the
X -axis is the applied node trafﬁc as deﬁned in Section 5.4. Performance curves are
presented for two different communication patterns in Figure 5.7. In one pattern,
each node only communicates with its four neighbors so there is no contention for
network channels (solid line). The second pattern is the uniformly distributed pattern
in which each node communicates to every other node uniformly (dotted line). In
this case messages need to compete for channels. In fact, simulation run times are
approximately the same irrespective of the exact communication pattern used. The
results show that the simulation rates are quite high. The simulation rate when
there is signiﬁcant contention is slightly lower due to additional work as messages are
blocked and released.

We can observe from our simulator design that the time per simulation cycle
depends essentially on the number of unblocked messages in the network in that cycle.
Given the system size N and the node trafﬁc A,,, the number of unblocked messages
in the system at any instant will be N x A,,. Hence, another useful way to display
the performance of the simulator is shown in Figure 5.8. The Y-axis represents how

many micro seconds of CPU time will be needed to process one, unblocked message

123

 

 

120000 T r l I

100000

 
 
  
 
   
 

>l=144
01:256 .4

- ~ - under contention

I

80000

Simulation rate

in cycle/«99000 l— ", .

40000 .. — contention-free _

20000 —'z "-~.h ”- ..............

...............

......
......
...
...
..............

 

 

 

 

 

0 5 10 15 20 25
Applied node trafﬁc A,, in percentage

Figure 5.7: The simulator performance in cycles per CPU second

 

 

 

per simulation cycle. The X -axis is still the applied node trafﬁc. We observe that the
CPU time needed is approximately a constant; about 10-15 micro seconds to process
one message per simulation cycle. The CPU time is independent of the system size
and the communication pattern.

We can also estimate the time needed for a simulation run by using the curves in
Figure 5.8. Let m be the number of messages that need to be sent by each node in a
simulation run. As a ﬁrst step approximation we assume A,, = A, = #33 (Section 3.1).
So, we have L = iii»? If the simulator processes events at the rate of a: psecs per

message per cycle (as shown in Figure 5.8), the execution time for m simulation loops

will be:

CPUtime = z x number_of.messages_in.the.system_per.cycle

xnumber.of_cycles_per.loop x number.of_loops
t x A
1

= xx(Nx/\a)x(t+ -A

 

°)><m
a

124

 

 

25 I I j I
*I=N=64
>l=144

20’ ol=256 —

- - - under contention
— contention-free _

 

 

 

 

 

 

15 -
Simulation rate
in ps/msg/cycle
10
5
O l l J _L
0 5 10 15 20 25

Applied node trafﬁc A,, in percentage

Figure 5.8: The simulator performance in micro seconds per message per cycle

 

 

 

For example, consider simulating a network with 256 nodes (N = 256), and each
node completing 100 messages per simulation run (n = 100). Also, assume that
t = 270 cycles, and A,, = 10% (I = 30). From Figure 5.8 we ﬁnd that the simulator is
processing at a rate of a: z 15psecs/ message/ cycle. Hence, we can estimate the time

for the simulation run to be

300 x 0.1

15 x (256 x 0.1) x (300 + 1 _ 0.1

) x 100 z 13 seconds

which is very efﬁcient. The simulation time is directly proportional to N if all other
parameters remain the same. Hence, in the above example if instead of a 256 node
network, we wanted to simulate a 1024 node system, we would need 13x4 = 52 secs
to complete the simulation run. We ﬁnd that our simulator performs well and can

simulate even large networks in reasonable time.

125

5.6 Simulation Experiments

The simulator formed by combining the simulator shell with the kernel can be a
very useful tool in understanding the network performance under the effects of con-
tention for network channels. In this section, we describe some of the experiments
we conducted using the simulator to better understand the network performance.
The experiments show the ﬂexibility of our simulator and the ease with which we
can‘use the simulator to answer various questions about the network performance.
In all cases, we present the performance results as a plot of actual node trafﬁc vs.
applied node trafﬁc (refer to Section 4.3). We also indicate the saturation node trafﬁc
predicted by our theoretical results to demonstrate that our theoretical results agree

with the simulator results.

5.6.1 Small vs. Large Messages

Figure 5.9 shows the results for for three different message lengths of 5, 50 and 250
flits, when the entire message is sent as a single packet. The saturation levels for 50
and 250 flits are almost identical, and match the theoretical results. We conclude that
if message lengths are greater than path lengths, the saturation effect is independent
of absolute message length. For small message lengths, the saturation occurs at a

slightly lower value.

5.6.2 Constant vs. Probabilistic Computation Times

Another important characteristic of our experiments is the distribution for compu-
tation times of various loops. We tried two distributions. In one we had the same
computation time in each loop equal to tow, and in the other, computation time was
distributed uniformly over the interval 0..2ta,,g. The average computation time was
equal to tavg in both the experiments, and hence applied node trafﬁc was same. The
results are given in Figure 5.10. We observe that the performance is almost identical.
The performance is slightly better for constant computation times, and the worst

case node shows a sharper transition in performance when applied node trafﬁc is near

saturation level.

25

20

Actual node 15
trafﬁc (A,,) in
percentage
10

126

 

I I

oL=5ﬂits
_ >L=50ﬂits
*L=250ﬂits

 

 

, ' Average-case node A

913.0%

 

 

0 5 10

15 20 25 30

Applied node trafﬁc (A,,) in percentage

Figure 5.9: Effect of message length

25

20

Actual node 15
trafﬁc (A,,) in
percentage

10

 

I I

b

-

I

 

* Constant computation times
0 probabilistic computation times -

 

I I I

, ' Average-case node d

 

 

0 5 10

15 20 25 30

Applied node trafﬁc (A,,) in percentage

Figure 5.10: Effect of distribution on computation times

127

5.6.3 Packetized vs. Unpacketized Messages

The effect of packetization on performance was measured next. We sent 250-ﬂit
messages as ten packets of 25 ﬂits each, with a inter-packet interval of 10 cycles to
account for the processing overhead per packet. The results were then compared
with that obtained in the ﬁrst experiment when 250—ﬂit messages were sent as a
single message (Figure 5.11). We observe that the performance once again is almost
identical. Experiments also revealed that neither processing overheads per packet
(which determines time between packets) nor packet size has a signiﬁcant effect on
the saturation level. However, if the average length of the messages generated by
various nodes are different, then packetization may be useful. Packetization will
prevent a long message from blocking a small message for a long time, and hence the

network response will be fair to different messages.

 

25 _ I I r I I -1

* 250 flits as one packet! .
o 250 ﬂits as ten packets .
20 *- ~ -

I

' Average-case node _

-13.0%

Actual node 15
trafﬁc (A,,) in

 

 

 

 

percentage
10 ' Wors -case 035.1% ‘
5 r _
0 ' l l I l 1
0 5 10 15 20 25 30

Applied node trafﬁc (A,,) in percentage

Figure 5.11: Effect of packetization

These results indicate that none of the factors—absolute message length, packeti-
zation, or distribution of computation times—affect the saturation levels. We there-

fore use 50 byte, single packet messages and uniformly distributed computation times

in the remaining experiments.

128

5.6.4 Mapping

We now examine the effect of random mapping on the network performance for four
common process graphs: binary tree, 2D mesh, 3D mesh and hypercube. We also
present the results for a uniform distribution which acts as a reference for other re-
sults, and shows the effect of non-uniformity in trafﬁc due to random mapping. A
uniform distribution was realized by selecting the destination node randomly each
time a message was generated. Figure 5.12 give the actual node trafﬁc for different
communication patterns for two system sizes of 64 nodes and 256 nodes. We observe
that the actual trafﬁc tends to saturate at different levels, depending on the com-
munication patterns, and that the results match the saturation levels predicted by
theoretical analysis in the previous two chapters.

Compare Figure 5.12.A and Figure 5.12.B. We ﬁnd that saturation occurs at a
lower level for larger system sizes—less than 10% for systems having 256 node or
more. Hence, random mapping may not be advisable for multicomputers having
hundreds of nodes if the applied node trafﬁc of an applications is 5% or higher. As
mentioned in the previous section, overheads in current multicomputers are limiting
the applied node trafﬁc to 13% or less (6% or less for small messages). With reduced
overheads in future machines, higher applied node trafﬁc are possible depending on
the application. Hence, the effect of contention on performance will be increasingly
visible in future multicomputers if we use random mapping. We also observe that
the saturation level for the worst-case node is about 2-3 times lower than the level
for an average node. The saturation levels predicted by our theoretical analysis are
indicated in parenthesis in the graph’s legends (average case, worst case). We ﬁnd that
theoretical predictions match the simulation results further validating our simulator.
The actual saturation levels for the worst case are about 20% higher than the value
predicted by the theoretical analysis.

One solution to the saturation problem is to use careful mappings that reduce con-
tention. Simulations showed that for careful mappings actual node trafﬁc was almost
equal to applied node trafﬁc for all our patterns even for a 25% applied node trafﬁc.

Hence, the mapping problem still exists, and mappings that minimize contention may

25

20

Actual node 15
trafﬁc (A,,) in
percentage
10

25

20

Actual node 15
trafﬁc (A,,) in
percentage
10

(The pair of values in parenthesis gives the theoretical results)

129

 

‘4: Unifofm distribution (25%)
x 2D Mesh (47%,23%)
_I> Binary tree (3170,1470)
0 3D Mesh (36%,19%) .
o Hypercube (41%,22%) . '
'— Average-case node , '

 
 
  
  

 

l l l J l

 

 

 

   

 

 

 

Applied node trafﬁc (A,,) in percentage

B. N = 256

0 5 10 15 20 25 30
Applied node trafﬁc (A,,) in percentage
A. N = 64
I I I i I I
‘* Uniform distribution (13%) I
x 2D Mesh (19%,10%)
t> Binary tree (12%,6%) ‘
I
0 3D Mesh (13%,7%)
<> Hypercube (17%,9%)
”— Average-case node , ' ‘
- - - Worst-case node i
"' no.9 ............ Q -
...9 ............ D
— -l
l l l l J
0 5 10 15 20 25 30

Figure 5.12: Effect of random mapping on the network performance.

130

 

 

 

 

 

 

 

 

25 _ I I I I :g _
x 2D Mesh (100%) 3;?“
> Binary tree (65%) 3:3'
20 " ..u -4
0 3D Mesh (90%) .ﬁ'
0 Hypercube (75%) 3‘.
Actual node 15 - 5‘: _
trafﬁc (A,,) in '9'.
percent age . . '
10 - “x ..
5 - I -
I:
K
0 l J l I l
0 5 10 15 20 25 30
Applied node trafﬁc (A,,) in percentage
A. N -.= 64
25 __ I I j i r j< _
x 2D Mesh (100%) .32
> Binary tree (39%) .
20 t' _
0 3D Mesh (71%) ﬁ
0 Hypercube (36%) ﬁg"
Actual node 15 — .532' _
trafﬁc (A,,) in .515;
percentage ‘ . i'
10 '- '1‘. -
5 - ,ur'. ~
I"
l
0 l l l l l
0 5 10 15 20 25 30

Applied node trafﬁc (A,,) in percentage

B. N =256

(The plots are shown only for the worst-case node)

Figure 5.13: Effect of careful mapping on the network performance.

131

still make a signiﬁcant difference in the performance of large multicomputers.
The results conﬁrm our fear that the problem of contention will be increasingly
visible in future, larger multicomputers. Random mapping may not be advisable and

careful mapping may improve the performance signiﬁcantly.

5.6.5 Arbitration Policy

The previous experiments considered a software solution—mapping. One possible
hardware improvement is to ﬁnd a good queueing policy when message headers on
different input ports request an output port. In our simulator, we time-stamped mes-
sages when they entered the network and gave priority to messages with earlier time
stamps while granting output ports. Intuitively it seems to be the best policy as the
network attempts to complete older messages and minimizes the time spent waiting
for messages by parallel tasks. We investigate performance of other queueing policies
to determine if this is the best policy. For each policy we evaluate the performance
for the four process graphs used in Experiment 2.

A simple queueing policy is FIFO with ties resolved by giving a ﬁxed priority to
input channels. Figure 5.14 shows that performance is signiﬁcantly worse, and A,,
actually declines when applied node trafﬁc is increased beyond a level. Hence, we
need a better queue policy than the FIFO. FIFO can be improved by using a tie-
resolution scheme that gives priority to an input port that least-recently requested
the output port. However, there is little performance improvement.

An effective policy is to discourage message injections by the local node when the
router senses heavy network trafﬁc passing through the node. Different input ports
are given different base priorities—the port with a lower priority is granted the output
port. To prevent inﬁnite waits, each time an input port fails to get an output port its
priority is reduced by 1, and when it succeeds its priority is set to its base value. We
call this queue policy biased priority policy. Figure 5.15 shows the performance when
the bias for the input port for the channel connected to the local node was 9, and
the other ports were 4. The performance is comparable to that of the time-stamped

priority policy. The advantage of the biased priority policyis that it is simple and

132

 

 

r I I I I
25 ' * Uniform distribution _,
X 2D Mesh
20 _ o Binary tree _
0 3D Mesh

o Hypercube
Actual node 15 " — Average-case node '

trafﬁc (A,,) in ‘-

 

 

 

percentage - '- Worst-case node e 3
10 p O . ""'/X3:. 0 ‘ .4
I. ----- '
. .......... .9, .......
5 '- ........ 9 ”QM-0.. .h '3 _‘
0
0 l l 1 D ------ t .......... f
0 5 10 15 20 25 30

Applied node trafﬁc (A,,) in percentage
C

Figure 5.14: FIFO queue policy, 256 node system

 

 

 

does not require time stamp information in messages.

Another queue policy whose performance is comparable to the time-stamped pri-
ority is the message-source priority. If the message header contains the source node
number, this policy can be easily implemented. Each output port remembers the
source of the last message that used the port and acts as a clock hand. The message
whose source node number is nearest to the clock hand gets the port next. The perfor-
mance for such a policy seems to be quite similar to the time-stamped priority policy
(Figure 5.16). We conclude that the time-stamped priority policy seems to be the
best of all the policies considered, and that the performance of other simpler policies
such as biased priority policy and message source priority policy are comparable.

The experiments and the results clearly demonstrate that our simulator can be a
useful tool in understanding the performance of modern wormhole-routed networks.
Our results indicate that the contention problem is signiﬁcant for large networks, and
that the effect depends on the applied node trafﬁc (a measure of the communication

intensity of an application) for a given communication pattern and mapping. Careful

25"

20-

Actual node 15
trafﬁc (A,,) in
percentage
10

Figure 5.15:

25*-

20-

Actual node 15
trafﬁc (A,,) in
percentage
10

133

 

I I I I I
* Uniform distribution

x 2D Mesh
D Binary tree

0 3D Mesh
<> Hypercube

  
   

— Average-case node °

- - - Worst-case node

. .
....

C I
......
O O
.0
C

 

 

 

5 10 15 20 25

Applied node trafﬁc (A,,) in percentage

Biased priority queue policy, 256 node system

30

 

 

I

 

I I I I T
* Uniform distribution

x 2D Mesh

t> Binary tree

0 3D Mesh

o Hypercube

— Average-case node '

e 0.
0....
I o
..O:'o
.-

 

 

5 10 15 20 25

Applied node trafﬁc (A,,) in percentage

30

Figure 5.16: Message-source priority queue policy, 256 node system

134

mappings seem to be the main solution to the contention problem even though system

level solutions such as appropriate queueing policies and adaptive routing can help.

5.7 Conclusion

We described a high-performance simulator for wormhole-routed interprocessor com-
munication networks. A careful design and implementation in a high-level language
resulted in a simulator that can simulate large networks in reasonable amount of time.
We demonstrated that the simulator can be a useful tool in understanding the effect
of contention on network performance. The simulator was validated by comparing its
outputs with the results from experiments on a real machine, and the results from
theory. The design clearly shows that our simulator is quite ﬂexible, and can be easily
modiﬁed to simulate any network topology, routing scheme or even circuit-switched
networks. '

The simulator performance can be improved further. The performance for large
messages, and longer computation time per loop is not good. It is possible to improve
the performance by recognizing the messages that have successfully established their
paths, and avoid processing them until the tail ﬂit leaves the source. The simulator
was not designed to study the architecture of the router or node in detail. It is
not useful in evaluating the bottlenecks that may be present in the router or the
limitations of the node or the PE. Some recent network implementations support
virtual channels that do not consume any channel bandwidth if idle. Our simulator

cannot simulate such virtual channels at present.

Chapter 6

CONCLUSION AND FUTURE
RESEARCH

In this thesis, we studied the performance of direct interprocessor communication

networks—a critical component of message-passing distributed-memory concurrent

computers. The main objective was to model and improve the performance of large

direct networks that use wormhole-routing, and to determine if network performance

can limit the overall performance of a multicomputer. Results showed that rapid

advances in the technology of direct networks have made their performance charac-

teristics quite different from the characteristics of the earlier networks. The results

also showed that the performance of direct networks is now comparable to that of

other networks. Our studies involved

0 theoretical analysis to model and predict the communication performance, and
to study some design tradeoffs. The analysis was in two parts. First, we studied
the simpler case of contention-free communication, i.e. we assumed that there
was no contention for network channels. Next, we modeled the performance

under contention for a given communication pattern and mapping.

experimental studies to evaluate the performance of a real multicomputer that
uses a stateof-the-art communication network. Our experimental results pro-
vided an understanding of the performance characteristics of current networks,

and helped to verify theoretical results.

simulation studies to evaluate the performance of large future multicomputers.
The simulator was validated by comparing its results with those from theoretical
analysis and experiments. Simulations were also used to investigate the effect of

several design factors on performance that are difﬁcult to model theoretically.

135

136

We summarize the main contribution of our thesis now.

6.1 Summary and Major Contributions

The communication performance under contention-free conditions is determined by
the bandwidth of the network channels and the switching technique. The bandwidth
depends on the width of the channels and the rate of communication over these chan-
nels. Since large networks are wire-limited, the choice for network topology affects
the width of network channels. Topology also affects the length of the communica-
tion wires, and the length of wires determine the maximum rate of communication.
We estimated the improvement in channel bandwidths if we use 2D mesh networks
instead of the popular hypercube networks. Our results showed that channel widths
improve by a factor Ru, = 2\/N / 3 and that channel rates improve by a factor, 12,,
in the range 1..\/2—N_/ lg N, depending on the dependency of the channel rate on the
length of a channel. Hence, overall bandwidths improve by a factor R5 = R", x IL.
Calculations showed that R, z 100 for a 1024 node system, i.e. bandwidths of
mesh networks could be two orders of magnitude higher than that of hypercube net-
works for a 1024 node system. Current networks use wormhole routing which is
efﬁcient for non-adj acent node communication. Considering the characteristics of
wormhole-routing, we showed that higher bandwidth channels improved performance
signiﬁcantly even for non-adjacent node communication, in spite of longer paths in
mesh networks. In short, we concluded that a mesh topology was a better choice
than a hypercube because we get a signiﬁcant improvement in performance under
contention-free conditions.

However, in large multicomputers that use mesh networks, we showed that the
effect of contention can be signiﬁcant unless carefully reduced. Contention for network
channels results in additional wait times which reduces effective communication rates
to only a fraction of the channel bandwidths. Contention depends on mapping—
the way parallel tasks are assigned to various multicomputer nodes. We developed

a framework to model the effect of contention on the communication performance

137

for a given application and mapping. For a given mapping, we introduced a new
metric, path contention level, to measure the contention for a speciﬁc communication
path. Path contention levels for various paths together with channel loads for various
network channels can characterize the nonuniformities in traffic. We used queuing
theory to estimate and model the performance under contention as a function of path
contention levels. Our results showed that the level of message trafﬁc that can be
injected by a node is bounded, and that this bound can be reasonably estimated
if we know the path contention levels. A simple tool was developed based on our
theoretical models to compute the bounds on node trafﬁc for a given mapping.

We used the general results for performance under contention to study the special
case of random mapping. The objective was to determine if random mapping was
good enough, that is, to determine if the mapping problem no longer exists. We used
simple probability theory to estimate the channel load and path contention levels as
a function of system size. These estimates were veriﬁed by actual values computed
for different popular communication patterns. We then used those estimates for path
contention levels in our general results to predict the performance under random
mapping. Our results showed that message injection rates, measured by node trafﬁc,
cannot exceed 2 / \/N for an average node. Some nodes are affected more than others,
and the results also showed that bounds on node trafﬁc for the worst case nodes can
be lower than that for an average node by a factor of two to three. We concluded
that careful mapping is necessary to have effective communication rates close to the
channel bandwidths, and hence retain the advantage of higher bandwidth channels
of mesh networks. We illustrated in case of several common communication patterns
how careful mappings can reduce contention and increase bounds on node trafﬁc. Our
results clearly showed that the path contention level can be a useful metric as it can
reasonably predict the communication performance under contention. Minimizing
path contention levels can replace minimizing path lengths as the mapping objective.

Experiments were conducted on a Symult 2010—a real multicomputer that uses
a state-of-the-art wormhole-routed mesh-connected network. The objective of the

experiments was to understand the performance characteristics of current networks,

138

and to verify theoretical results. Experimental results showed that contention-free
communication can be modeled as a linear function of message length, and that it is
independent of path length. The node trafﬁc was limited to 6% if message lengths
were less than 1 Kb, and less than 13% even for a 8K message. The reason was high,
ﬁxed overhead and packetization overhead. Limited injection rates and small system
sizes made the effect of contention negligible, and made the performance independent
of mapping for current systems. However, on the largest system which had 192
nodes was the effect of contention visible, and the performance curves matched well
with theoretical results. In the future, overheads are being reduced and system sizes
are increasing. We, therefore, concluded as predicted by theoretical results that the
effect of contention would no longer be negligible. Mapping can make a difference in
communication performance of multicomputers. We also showed that architectural
bottlenecks exist inside a node in current systems.

A high-performance simulator was developed in C to study the communication
performance of large multicomputers not yet available. A careful design resulted
in a simulator that can reasonably model the performance of wormhole-routed or
circuit-switched networks. At the same time, the simulator is fast enough to pro-
vide results for large multicomputers having hundreds of nodes in tens of seconds on
SPARCstations. The simulator is also ﬂexible as it can be easily modiﬁed to simulate
any network topology, routing scheme or application characteristics under speciﬁed
mappings.

The simulator was used to verify the theoretical results for large multicomputers
and/or multicomputers which can have high message injection rates. Simulations
show that performance is as predicted by the theory. The results also match with
experimental results over the operation range of current multicomputers. Results
showed that performance is a function of node trafﬁc, and is relatively independent
of message lengths, packet lengths, interpacket time and so on provided that the
average length of messages generated by various nodes is approximately the same.
We also used the simulator to study system-level solutions to the contention problem.

One such solution is an appropriate arbitration policy to resolve contention. Current

139

networks use FIFO policy to resolve contention. Simulations show that performance
deteriorates fast in such networks once trafﬁc reaches a certain threshold. Time—
stamping the messages when they entered the network, and giving priority to ’old’

messages provides the best performance.

6.2 Future Research

The results of the thesis indicate that contention for network channels will be a
growing concern in future, larger multicomputers as its effect on network performance
will be increasingly visible. In this thesis, we did some initial work to model and
improve network performance under contention. Still, there are many interesting
problems that need to be studied. Some of them are

1. Automated Mgping : In this thesis, we gave careful mappings that minimized

 

contention ( measured by path contention level ) only for a few common communi-
cation patterns. It is useful to have a tool that can ﬁnd out such a mapping for any
given communication pattern speciﬁed as a process graph. It is useful if the tool can
accept process graphs with weighted edges to account for nonuniformities in message
intensity along different edges. Since ﬁnding such a mapping is a. hard problem, such
a tool may use heuristics. A heuristic algorithm is sufﬁcient as it is only necessary to
find a mapping that results in bounds that are higher than the node trafﬁc applied
by the application, and there is no need to ﬁnd the best mapping. The tool may also
have to consider routing possibilities in addition to placement, as future systems may
provide adaptive or user-deﬁned routing schemes.

2. System trafﬁc : We considered only the user-generated trafﬁc due to one appli-

 

cation. In a general system, the trafﬁc from several users interfere with one another.
There is also system-generated trafﬁc which interferes with user-trafﬁc. Some exam-
ples of system-trafﬁc are I/O trafﬁc, operating system messages, etc. The character-
istics of such trafﬁc may be different from that of user-generated trafﬁc. For example,
message lengths may be greater than user-generated messages. One can study the

communication performance at this higher level considering the interference between

140

different trafﬁc using the framework we developed. Such a study may help in making
policy decisions such as the allocation of nodes to a parallel task and the position of
I/ O nodes.

3. Fault tolerance : In a fault-tolerant system, the routing scheme may change un-
der faults. A different routing may in turn change contention levels, non-uniformities
and communication performance. It could be interesting to study how the com-
munication performance may change under faults, and whether we can extend our
theoretical results to those cases.

4. System-level solutions : We studied one system-level solution to the contention

 

problem—an appropriate arbitration policy. There can be other solutions such as
adaptive routing, whose effectiveness is still an open question. Another solution is
virtual channels. The improvements in communication performance provided by these
system level solutions need to be studied. It is not clear if these solutions can allow
random mapping to be sufﬁcient. It would be interesting to see if we can model the
communication performance for a mapping considering adaptive routing and virtual
channels.

5. W: In our thesis, we made assumptions about
the communication characteristics of an application to simplify the model. It would
be interesting to see if we can extend our theoretical results to more general cases.
For example, a task may not communicate uniformly with its peers. We can handle
such a case by assigning weights to the edges of a process graph and take that into
account in our analysis. In addition, the messages generated may be bursty in many
applications. The message generation at a node may depend on the messages received
by a node, i.e. path trafﬁc on one path may depend on another. Multicast and
broadcast communications may occur which need to be accounted for. The process
graphs itself may not be static and may change during the execution. It should be
possible to extend our framework to most of these cases.

To summarize, old problems such as path lengths may no longer exist due to
advances in network technology. However, the communication network is a critical

architectural component, and the demand on the network performance continues to

141

grow. As a result, we face new problems, new issues, and we require new solutions to

be able to continue to improve the communication performance of multicomputers.

Bibliography

Bibliography

[1] Butterﬂy GPIOOO, section 5, 1988.

[2] D. P. Agarwal, V. K. Janakiram, and G. Pathak. Evaluating the performance of
multicomputer conﬁgurations. In IEEE Computer, pages 23—37, May 1986.

[3] M. Arango, H. Badr, and D. Gelerenter. Staged circuit switching. IEEE Trans-
actions on Computers, pages 174—179, Feb. 1985.

[4] S. Arlauskas. iPSC / 2 system: a second generation hypercube. In G. C. Fox, edi-
tor, Proc. 3rd Conference on Hypercube Concurrent Computers and Applications,
pages 38—42. ACM, 1988.

[5] W. C. Athas and C. L. Seitz. Multicomputers: Message passing concurrent
computers. Computer, pages 9-25, Aug. 1988. .

[6] F. Berman and L. Snyder. On mapping parallel algorithms into parallel archi-
tectures. Journal of Parallel and Distributed Computing, pages 439—458, 1987.

[7] L. N. Bhuyan and D. P. Agarwal. Generalized hypercube and hyperbus struc-
tures for a computer networks. In IEEE Transactions on Computers, Apr. 1984.

[8] R. P. Bianchini and J. P. Shen. Interprocessor trafﬁc scheduling algorithm for
multiple-processor networks. IEEE Transactions on Computers, pages 396—409,
Apr. 1987.

[9] S. H. Bokhari. On the mapping problem. In IEEE Transactions 0n Computers,
pages 207-214, Mar. 1981.

[10] S. H. Bokhari. Communication overhead on the intel ipsc-860 hypercube. Tech-
nical report, ICASE, NASA Langley Research Center, May 1990.

[11] S. H. Bokhari. A network ﬂow model for load balancing in circuit-switched
multicomputers. Technical report, ICASE, NASA Langley Research Center, May
1990.

[12] S. W. Bollinger and S. F. Midkiff. Processor and link assignment in multi-
computers using simulated annealing. In International Conference on Parallel
Processing, pages 1—7, 1985.

[13] L. Bomans and D. Roose. Benchmarking the iPSC / 2 hypercube multiprocessor.
Concurrency: Practice and Experience, pages 3—18, Sept. 1989.

142

143

[14] S. Borkar et al. iWarp: An integrated solution to high-speed parallel computing.

In Proc. Supercomputing Conferecne, pages 330-339, 1988.

[15] G. D. Buzzard. High performance communication for hypercube multiprocessor.

[16]
[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

Ph.D Thesis, University of Michigan, Ann Arbor, 1988.

D. A. Carlson. Modiﬁed mesh connected parallel computer. In IEEE Transac-
tions 0n Computers, pages 1315-1321, Oct. 1988.

S. Chittor and R. Enbody. Hypercubes vs. 2d meshes. Technical Report CPS-
89-17, Computer Science, Michigan State University, 1989.

S. Chittor and R. Enbody. Link switching : A communication architecture for
conﬁgurable parallel systems. In proc IEEE Phoenix Conference on Computers
and Communication, Mar. 1989.

S. Chittor and R. Enbody. Hypercubes vs. 2D meshes. In Proc. SIAM Fourth
Annual Conference on Parallel Processing for Scientiﬁc Computing, pages 313—
318, 1990.

S. Chittor and R. Enbody. Performance degradation in large wormhole-routed

interprocessor communication networks. In Proc. International Conference on
Parallel Processing, volume 1, pages 424—428, 1990.

S. Chittor and R. Enbody. Performance evaluation of mesh-connected wormhole-
routed networks for interprocessor communication in multicomputers. In Proc.
Supercomputing Conference, pages 647—656, Nov. 1990.

S. Chittor and R. Enbody. Predicting the effect of mapping on the communica-
tion performance of large multicomputes. Technical Report CPS-91, Computer
Science Dept, Michigan State University, 1990.

S. Chittor and R. Enbody. Will 2d meshes replace hypercubes ? In PARBASE—
.90, pages 317—319, 1990.

S. Chittor and R. Enbody. Predicting the effect of mapping on the communica-
tion performance of large multicomputers. In Proc. International Conference on
Parallel Processing, (to appear).

S. Chittor, R. Enbody, and T.-S. Jon. A high performance simulator for large
wormhole-routed networks. In International Journal of Computer Simulation,
1991 (to appear).

E. Chow, H. Madan, J. Peterson, D. Grunwald, and D. Reed. Hyperswitch net-

, work for the hypercube computer. In Proc. International Conference on Parallel

[27]

Processing, pages 90—99, 1988.
D. V. Chudnovsky, G. V. Chudnovsky, and M. M. Denneau. Regular graphs

with small diameter as models for interconnection networks. In International
Conference on Parallel Processing, pages 232—239, 1987.

144

[28] P. Close. The iPSC/2 node architecture. In Third Conference on Hypercube
Concurrent Computers and Applications, pages 43-51, 1988.

[29] W. Dally. Network and processor architecture for message-driven computers.
In R. Suaya and G. Birtwistle, editors, VLSI and Parallel Computation, pages
140-222. Morgan Kaufmann Publishers, Inc., 1990.

[30] W. Dally et al. Architecture of a message-driven processor. In Proc. 14th
ACM/IEEE Symposium on Computer Architecture, pages 189—196, June 1987.

[31] W. J. Dally. Wire efﬁcient VLSI multiprocessor communication networks. In
Proc. Stanford Conference on Advanced Research in VLSI, pages 391—415. MIT
Press, Cambridge, Mar. 1987.

[32] W. J. Dally. Fine-grain message-passing concurrent computers. In Proc. The
third Conference on Hypercube Concurrent Computers and Applications, 1988.

[33] W. J. Dally. Performance analysis of k-ary n-cube interconnection networks.
IEEE Transactions on Computers, 39(6):775—785, June 1990.

[34] W. J. Dally. Virtual-channel ﬂow control. In Proc. of the 19.91 International
Symposium on Computer Architecture, pages 60—68, 1990.

[35] W. J. Dally and C. L. Seitz. The torus routing chip. Journal of Distributed
Systems, 1: 187-196, 1986.

[36] W. J. Dally and C. L. Seitz. Deadlock-free message routing in multiprocessor
interconnection networks. IEEE Transactions on Computers, pages 547—553,
May 1987.

[37] W. J. Dally and P. Y. Song. Design of a self-timed VLSI multicomputer com-
munication controller. In Proc. International Conference on Computer Design,
pages 230-234, 1987.

[38] S. P. Dandamudi and D. L. Eager. Hierarchical interconnection networks for

multicomputer systems. IEEE Transactions on Computers, pages 786-797, June
1990.

[39] K. W. Doty. New designs for dense processor interconnection networks. IEEE
Transactions on Computers, pages 447—450, May 1984.

[40] C. Flaig. VLSI mesh routing systems. Technical Report 5241, California Institue
of Technology, May 1987.

[41] J. Gecsei. Interconnection networks from three-state cells. In IEEE Transactions
on Computers, pages 705—711, Aug. 1977.

[42] J. Ghosh and K. Hwang. Hypernet : A communication efﬁcient architecture for
constructing massively parallel computers. In IEEE Transactions On Computers,
pages 1450—1465, Dec. 1987.

145

[43] J. R. Goodman and C. H. Sequin. Hypertree: A multiprocessor interconnection
topology. In IEEE Transactions on Computers, pages 923-933, Dec. 1981.

[44] M. Goodrich, C. Koelbel, W. Robinson, and K. Showell. Prep-p: A mapping
preprocessor for chip computers. In International Conference On Parallel Pro-
cessing, 1985.

[45] D. Gordon. Efﬁcient embeddings of binary trees in VLSI arrays. IEEE T rans-
actions on Computers, c-36(9):1009—1017, Sept. 1987.

[46] D. C. Grunwald and D. A. Reed. Networks for parallel processors: Measurements
and prognostications. In Proc. The third Conference on Hypercube Concurrent
Computers and Applications, pages 610-619, 1988.

[47] S. Hart. A note on the edges of the n-cube. Discrete Mathematics, pages 157—
163, 1976.

[48] J. P. Hayes et al. Architecture of a hypercube supercomputer. IEEE Micro,
pages 653-660, Oct. 1986.

[49] R. E. Horan and D. A. Poplawski. Communication/ computation paradigms for
hypercubes. In Hypercube Computers Conference, pages 620—623, 1988.

[50] E. Horowitz and A. Zorot. The binary tree as an interconnection network: Appli-

cations to multiprocessor systems and vlsi. In IEEE Transactions 0n Computers,
1981.

[51] K. Hwang and W. Briggs. Parallel Processing. McGraw Hills, 1985.

[52] T.-S. Jon and R. Enbody. Adaptive routing on wormhole-routed two-dimensional
meshes. Technical Report CPS-90-03, Computer Science, Michigan State Uni-
versity, 1990.

[53] A. M. Jrad and R. W. Hall. Orthogonal fast channels: An enhanced mesh
architecture. In International Conference on Parallel Processing, pages 828—831,

1987.

[54] P. Kermani and L. Kleinrock. Virtual cut-through: A new computer communi-
cation networks. Computer Networks, 3:267—286, 1979.

[55] P. Kermani and L. Kleinrock. A tradeoff study of switching systems in computer
communication networks. IEEE Transactions on Computers, c—29, Dec. 1980.

[56] T. Kerola and A. Hartmann. Operational analysis on hyper-rectangulars. In
International Conference on Parallel Processing, pages 78-82, 1988.

[57] L. Kleinrock. Queueing Systems : Computer Applications, volume 2. John Wiley
& Sons, 1976.

146

[58] X. Lin, P. M. Mckinley, and L. M. Ni. Performance evaluation of multicast
wormhole routing in 2d-mesh multicomputers. Technical report, Computer Sci-
ence Dept, Michigan State University, 1991.

[59] X. Lin and L. M. Ni. Deadlock-free multicast wormhole routing in multicom-

puter networks. In Proc. of the 19.91 International Symposium on Computer
Architecture, 1991.

[60] D. H. Linder and J. C. Harden. An adaptive and fault tolerant wormhole routing
strategy for k-ary n-cubes. IEEE Transactions on Computers, pages 2—12, 1991.

[61] J. N. Mailhot. A Comparative Study of Routing and Flow Control Strategies in
k-ary n-cube Networks. B.S. thesis, Computer Science, Massachusetts Institute
of Technology, 1989.

[62] P. Muzumdar. Evaluation of on-chip static interconnection networks. IEEE
Transactions on Computers, pages 365—369, Mar. 1987.

[63] T. Nash et al. High performance parallel computers for science: New devel-

opments at the fermilab advanced computer program. Technical report, Fermi
National Accelerator Laboratory, Illinois, 60510, Aug. 1988.

[64] D. Nath, S. Mahasweri, and P. Bhat. Efﬁcient vlsi networks for parallel pro-
cessing based on orthogonal trees. In IEEE Transactions 0n Computers, pages
569—581, June 1983. '

[65] J. Ngai. Framework for Adaptive Routing in Multicomputer Networks. Ph.D.
dissertation, Computer Science, California Institute of Technology, 1989.

[66] S. F. Nugent. The iPSC/ 2 direct-connect communication technology. In G. C.

Fox, editor, Proc. 3rd Conference on Hypercube Concurrent Computers and Ap-
plications, pages 51—60. ACM, 1988.

[67] D. M. Passe and A. R. Larrabee. Intel iPSC concurrent computer. In R. G. Babb

II, editor, Programming Parallel Processors, pages 105-124. Addison Wesley,
1988.

[68] N. M. Ram, M. Ramu, and A. Paulraj. On the performance of a backtrack-
ing variation of wormhole routing. Technical Report AR—TR-003, Centre for
Development of Advanced Computing, Bangalore , India, Aug. 1990.

[69] A. G. Ranade and S. L. Johnson. The communication efﬁciency of meshes,
boolean cubes and cube connected cycles for wafer scale integration. In Proc.
International Conference on Parallel Processing, pages 479—482, 1987.

[70] D. A. Reed and H. D. Schwetman. Cost-performance bounds for multicomputer
networks. IEEE Transactions on Computers, pages 83-95, Jan. 1983.

147

[71] D. D. Riley and R. J. Baron. Design and evaluation of a synchronous triangular
interconnection scheme for interprocessor communication. In IEEE Transactions
on Computers, pages 110-118, Feb. 1982.

[72] M. Rosing and R. P. Weaver. Mapping data to processors in distributed memory
computations. In The Fifth Distributed Memory Computing Conference, pages
884-892, 1990.

[73] C. L. Seitz. The cosmic cube. In Communications of ACM, pages 22—23, Jan.
1985.

[74] C. L. Seitz et al. The architecture and programming of the Ametek series 2010
multicomputer. In Proc. Hypercube Computers Conference, pages 33—36, 1988.

[75] C. L. Seitz, J. Seizovic, and W.-K. Su. The C programmer’s abbreviated guide to
multicomputer programming. Technical Report Caltech-CS-TR-88-1, Computer
Science, California Institute of Technology, Pasadena, Apr. 1989.

[76] L. Snyder. Introduction to the conﬁgurable highly parallel computer. In IEEE
Computer, pages 47-56, Jan. 1982.

[77] P. Y. Song. Design of a network for concurrent message passing systems. Mas-
ter’s thesis, Electrical Engineering and Computer Science, MIT, May 1988.

[78] R. J. Swan, S. H. Fuller, and D. Siewiorek. Cm*—a modular multiprocessor. In
Proc. Nat. Comput. Conf., pages 39—46, 1977.

[79] H. Y. Youn and A. D. Singh. Near optimal embedding of binary tree architectures
in VLSI. In Proc. The 8th International Conference on Distributed Computing
Systems. IEEE, 1988.

 

"‘lithium“