mm.

. w».

3.4,».‘3.
.

w

.1.
i7”

3; m,-
~q.

7' an):

A
. 4.; ”w
-uI., Aw... -- -w
.
A

~ - ,.4
_, , JC‘ 5/.
u ““ 44m“ ~35.“ ~ ' I!” :‘E‘.
..'.. w”... A. -- .4- . ,A‘- u» M
x ..-.,,r.:n.-.:. .. J. ..v 1- I:
A.
Q, 4 mx
.

.....:..L.(..
m.“
. :us.

mwrutsuzN' ‘Vﬁ‘:
rrjpv‘uv-r
u. :1“

 

 

 

 

I L
.r
1 h‘ﬂx'. .r‘
‘:..,. 4. ..4 . .2..-

.11?”
. n) ft? namvww
Ii- cg. 3:16.13; «:9 K

.1»: KL: -n .quIX
5113-:

A
.
"E‘R .
. ...:r;..... u
.“Itl' l‘s‘
oii‘
,

. ..
...,_4... .

.A.'.
.‘m

....

..._.4.,~

:...4~.~:
ﬁln‘

\‘.
\mﬁ

hc-v

$1435“
233332373“ \- T

-. a
L‘.» n" ‘
,. 23":1hh . ﬁnm-T‘
Law-us... -’.:‘ ‘1‘}?
gym”. 4- J. :5, - n. :75:
I. «r...»
r.

\u’LJ‘4
’2" 4

{Eng

m
:L ...~.L~I w
\.-"_

. «on, \X.
“a"

.‘z;§......r.."c,
. .
a —\

.‘

Lava
mun: R'Lom'vmv:
4'“. K‘ﬁhm
‘L

cm»:

v M»..I.I.4»~"~-~

.. .9.
(1‘

$..
v 7¢L.;x37v§? 31'
r33“ 2.3-3:“... ,w

9—.

tw 54“
£§.1L-%P“'.“'VEI.

“II/I. I“

In 3‘ ‘

W5: w

1:“ n...

‘l-U‘ I“

"""Lm .

W‘ ‘51..

. .

‘ “ll.“ u“.

_ ’u- <.w.-.x.~.s.> .u.

3...... a.» um.

um... ;\,... .- n.
I . w»: ~-

~
. Nu. m
“w”... .. .n‘:

a.
.. 4 . -tvg
.,. .~\mll 1
._...,. ~ luv~s
nwmhhhnq
u wt ~

. .

w:-
\. v.|.~4 .‘v
“-4 ",“Z

Ana..-
a .. ..-.‘..‘-~ u."

KPH":

‘
:n. #3::
A: ~-

 

um.
.., , .4
.g

 

w. ..
~~

;-r._.'-.~ . . A...
I

 

 

NIVEFIS ITYL

lllllllllﬂllllllllllllllllllll lillllllllllilllllilllll

3 1293 00876 4908

 

 

 

 

This is to certify that the

dissertation entitled

Multicast Communication
in Multicomputer Networks

presented by
Xiaola Lin

has been accepted towards fulﬁllment
of the requirements for

Ph.D . Computer Science
degree in

 

 

\I \
\

Major professor

Date May 19, 1992

 

MSU is an Afﬁrmative. Action/Equal Opportunity Institution 0-12771 ’

 

 

 

 

 

LIBRARY
Mlchlgan State
UnIverclty

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.

TO AVOID FINES return on or baton date due.

  

DATE DUE DATE DUE DATE DUE

 

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

l—Tl l

MSU Is An Afﬁrmative ActIoNEquaI Opportmlty Institution

 

[L

CWMI

MULTICAST COMMUNICATION IN

MULTICOMPUTER NETWORKS

By

Xiaola Lin

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements

for the degree of

DOCTOR. OF PHILOSOPHY

Department of Computer Science

1991

ABSTRACT

MULTICAST COMMUNICATION IN
MULTICOMPUTER NETWORKS

By

Xiaola Lin

Efﬁcient routing of messages is a key to the performance of multicomputers. Mul-
ticast communication service in multicomputers is to deliver the same message from a
source node to an arbitrary number of destination nodes. Multicast communication is
ﬁnding increased demand in many applications, including various simulations, image
processing, and numerous parallel algorithms.

Providing support for multicast communication involves several, often conﬂicting,
requirements. First, it is desirable that the message delay from the source to each of
the destinations be as small as possible. Second, the amount of network trafﬁc must
be minimized. Third, the routing algorithm must not be computationally complex.
Finally, the multicast communication protocol must be deadlock-free.

In this dissertation, the above issues are addressed. According to the different
switching technologies and routing criteria, various multicast models, multicast path,
multicast cycle, Steiner tree, and multicast star, have been proposed. We show that

ﬁnding optimal route for each proposed model is NP-hard for the most common multi-

computer topologies, such as hypercube and mesh. Basic heuristic routing algorithms
for these models are presented. Based on node labeling and network partitioning
strategies, various deadlock-free routing schemes, tree-like, dual-path, multi-path, and
ﬁxed path methods, have been presented. These are the ﬁrst deadlock-free multicast
wormhole routing algorithms ever proposed. We also conduct a series of simulation
studies to evaluate the performance of these multicast algorithms under both static
and dynamic network trafﬁc conditions. The programs are written in C and use an
event-drive simulation package, CSIM.

In summary, this dissertation research focuses on the basic issues of multicast
communication: multicast routing models, investigation of the optimal routing for the
models, development of heuristic algorithms, deadlock avoidance, and performance

study based on both static and dynamic network trafﬁc.

Copyright © by
Xiaola Lin

1991

To my parents, Xiao J u-Xiang and Lin Yi.

ACKNOWLEDGEMENTS

I would like to thank Professor Lionel M. Ni for providing me with the opportunity
to perform this research and continue my graduate studies under his direction. His
suggestions and guidance were invaluable to the completion of this work. Professor
Wen-Jin Hsu has been my co-advisor. His help in my studies and his advice on my
research enabled me to overcome a host of obstacles.

Professor Philip K. McKinley’s contributions are gratefully acknowledged. The
discussions we had were always enlightening, and his recommendations helped to
improve the quality of this research.

The careful review of this documentation by other Ph.D committee members,
Professor Abdol H. Esfahanian, Professor Byron Drachman, and Professor Sakti Pra-
manik is greatly appreciated.

The graduate assistantship I received from the Case Center for CAEM of Engi—
neering College was critical to the completion of this work. I wish to thank Professor
Erik Goodman, Case Center director, for his ﬁnancial support. Special thanks also
go to Jackie Carlson, John Lees, Robert Raicsh, and Susan Smith for their kindness
and friendship during my stay in Case Center.

I am grateful to my parents and my sisters for their love and encouragement. I am
also thankful to Shiqi He, Liangsheng Cao, and Zhonggang Zeng for their assistance

and friendship in these years.

vi

TABLE OF CONTENTS

LIST OF TABLES ix
LIST OF FIGURES x
1 INTRODUCTION 1
1.1 Interprocessor Communication ...................... 4
1.2 Motivation and Problem Statements .................. 6
1.3 Dissertation Organization ........................ 7

2 MULTICOMPUTER NETWORK AND MULTICAST COMMU-
NICATION 9
2.1 Topology of Multicomputer Network .................. 10
2.1.1 Hypercube Topology ....................... 11
2.1.2 2D Mesh Topology ........................ 13
2.1.3 Other Topologies ......................... 14
2.2 Switching Technology ........................... 15
2.2.1 Store-And-Forward Switching .................. 16
2.2.2 Virtual Cut-Through Switching ................. 16
2.2.3 Circuit Switching ......................... 17
2.2.4 Wormhole Routing ........................ 18
2.3 Message Routing and Flow Control ................... 20
2.3.1 Source Routing and Distributed Routing ............ 20
2.3.2 Deterministic Routing and Adaptive Routing ......... 20
2.3.3 Flow Control ........................... 21
2.3.4 Deadlock Issues .......................... 22
2.4 Issues of Multicast Communication ................... 26
3 MODELS FOR MULTICAST COMMUNICATION 28
3.1 Multicast Path and Cycle Problems ................... 30
3.2 Steiner Tree Problem ........................... 32
3.3 Multicast Tree Problem ......................... 33

vii

3.4 Multicast Star Problem .......................... 34

4 OPTIMAL MULTICAST IN MULTICOMPUTERS 36
4.1 Optimal Multicast in 2D Mesh Topology ................ 36
4.2 Optimal Multicast in Hypercube Topology ............... 41
4.3 Optimal Multicast in 3D Mesh Topology ................ 49

5 BASIC HEURISTIC MULTICAST ROUTING ALGORITHMS 50

5.1 Heuristic Routing Algorithms for MP and MC ............. 51
5.2 Heuristic Routing Algorithms for ST .................. 55
5.3 Heuristic Routing Algorithms for MT .................. 60
5.4 Illustrative Examples ........................... 64

6 DEADLOCK-FREE MULTICAST WORMHOLE ROUTING 73
6.1 Deadlock Issue in Multicast Wormhole Routing ............ 73
6.2 Deadlock-Free Multicast in 2D Mesh .................. 78
6.2.1 Tree-Like Deadlock-Free Multicast Routing Schemes ...... 78

6.2.2 Path-Like Deadlock-Free Multicast Routing Schemes ..... 83

6.3 Deadlock-Free Multicast in Hypercube ................. 94

7 PERFORMANCE STUDY OF THE ROUTING SCHEMES 101
7.1 Performance Study under Static Network Traffic ............ 101
7.2 Performance Study under Dynamic Network Trafﬁc .......... 106

8 CONCLUSIONS 113
8.1 Summary of Major Contributions .................... 113
8.2 Direction for Future Research ...................... 115

BIBLIOGRAPHY 118

viii

5.1
5.2

5.3
5.4

LIST OF TABLES

A Hamilton cycle and the corresponding mapping h of a 4 X 4 mesh.

The sorting key f(:c) and mapping h(a:) for each node a: in a 4 x 4
mesh, where no = 9. ...........................
A Hamilton cycle and the corresponding mapping h of a 4-cube. . . .
The sorting key f(:1:) and mapping h(:c) for each node a: in a 4-cube,

where no 2 9. ...............................

ix

67

LIST OF FIGURES

1.1 (a) A generic multicomputer architecture; (b) A generic node architec-
ture. ....................................

1.2 A multicast communication pattern with three destinations. .....

2.1 Hypercube topology with 3 dimensions ..................
2.2 A 4 x 3 mesh ................................

2.3 A comparison of different switching technologies .............

2.4 An example of deadlock involving four messages .............
2.5 (a) A direct network I ; (b) Channel dependency graph for] and X-ﬁrst
routing. ..................................

3.1 An example of multicast path in 2D mesh. ...............
3.2 An example of Steiner tree in 2D mesh ..................
3.3 An example of multicast tree in 2D mesh. ...............

3.4 An example of multicast star in 2D mesh. ...............

4.1 The construction of G’ from G. .....................
4.2 A simple grid graph with 8 nodes .....................

5.1 The sorted MP algorithm for message preparation ...........
5.2 The sorted MP algorithm for message routing .............
5.3 The greedy ST algorithm for message preparation ...........

5.4 The greedy ST algorithm for message routing .............
5.5 X-ﬁrst multicast algorithm ........................
5.6 Divided greedy algorithm .........................
5.7 A 4 x 4 mesh with node 9 as the source. ................
5.8 A 4-cube with node 9 as the source ....................
5.9 A complete ST routing pattern in an 8 X 8 mesh. ...........
5.10 A complete ST routing pattern in a 6-cube. ..............
5.11 The routing pattern of X-ﬁrst algorithm for the example ........
5.12 The routing pattern of divided greedy algorithm for the example.

12
14
19
23

31
32

52
53
57
58
61
63
64
66
68
69
70
71

6.1 Deadlock in a 3-cube multicomputer. .................. 75

6.2 The detailed diagram of deadlock configuration. ............ 75
6.3 An X-ﬁrst multicast routing pattern. .................. 76
6.4 A deadlock situation in a 3 x 4 mesh ................... 77
6.5 Network partitioning for 3 X 4 mesh. .................. 79
6.6 Double-channel X-ﬁrst routing algorithm ................. 80
6.7 The routing pattern of the tree algorithm. ............... 81
6.8 The label assignment for N+x,+y for channel ordering. ........ 82
6.9 The labeling of a 4 x 3 mesh. ...................... 84
6.10 The other label assignment and channel partitioning for a 4 x 3 mesh. 85
6.11 Message preparation for the dual-path routing algorithm ........ 87
6.12 The dual-path routing algorithm. .................... 88
6.13 An example of duaLpath routing in a 6 x 6 mesh. ........... 89
6.14 Message preparation for the multi-path routing algorithm in 2D mesh. 91
6.15 Multi-path destination address partitioning. .............. 92
6.16 An example of multi—path routing in a 6 x 6 mesh. .......... 93
6.17 An example of ﬁxed-path routing in a 6 x 6 mesh ............ 94
6.18 The label assignment and the corresponding high-channel and low-

channel networks for a 3-cube ....................... 95
6.19 A 4-cube with node 1100 as the source for daul path routing ...... 98
6.20 Message preparation for the multi-path routing algorithm in hypercube. 99
6.21 A 4-cube with node 1100 as the source for multi-path routing. . . . . 100
7.1 Performance of the sorted MP algorithm on a 32 x 32 mesh. ..... 102
7.2 Performance of the sorted MP algorithm on a. 10-cube. ........ 103
7.3 Performance of the greedy ST algorithm on a 32 x 32 mesh. ..... 104
7.4 Performance of the greedy ST algorithm on a 10-cube. ........ 104
7.5 Performance of the X-ﬁrst and divided greedy algorithms on a 16 x 16

mesh ..................................... 105
7.6 Performance of different multicast methods on a 6-cube. ....... 106
7.7 Performance of different number of destinations on 8 x 8 mesh. . . . . 107
7.8 Performance under different loads on a double-channel mesh ...... 108

7.9 Performance of different number of destinations on a double-channel

mesh ..................................... 109
7.10 Performance under different loads ..................... 110
7.11 Performance of- different number of destinations ............. 111

xi

CHAPTER 1

INTRODUCTION

Parallel processing is rapidly becoming a dominant theme in all areas of computer
science and its applications. Some experts [I] believe that it is likely that virtually all
developments in computer architecture, system programming, computer applications
and the design of algorithms will be centered around the parallel computation within
a decade.

Parallel computers are those systems that emphasize parallel processing, including
pipeline computers, array processors and multiprocessor systems [2].

Multicomputers are a class of MIMD multiprocessor systems composed of a large
number of processors interconnected together by message-passing networks with some
ﬁxed topology [3]. Each processor or node is a programmable computer with its own
CPU, local memory, and other supporting devices. Multicomputers are also referred
to as distributed-memory multiprocessors. Figure 1.1(a) shows a generic multicom-
puter in which a set of nodes are connected through a multicomputer network. Fig-
ure 1.1(b) is the structure of a generic node. The router in the node is responsible for
handling message passing for communication among the nodes. A dedicated router
is required to allow computation and communication overlapped to support a high
performance computing system.

The key characteristics of a multicomputer that distinguish it from other parallel

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

node node 0 0 .
node . r . I . . node
0 - 0
. Direct Network .
(a)
other
. local
funcuonal processor
mm mm l
' 7 I l
internal input internal output
channels channels
———.l ——->
. Router .
0 0
____.l _———>
external external
input channels (b) output channels

Figure 1.1. (a) A generic multicomputer architecture; (b) A generic node archi-

 

 

processor systems are briefly summarized as follows [4]:

(1) Large number of processors. Because microprocessors are becoming inexpen-
sive, easily packaged and consume little power, it is economically practical to construct
systems with hundreds or thousands of nodes. Multicomputers are able to provide
massive parallelism by having a large number of nodes. For example, FPS T-series
[5] can have up to 2” processors. nCUBE-l has 1024 nodes, and the newinCUBE-2
can have up-to 8192 nodes [6]. A multicomputer of a few hundred nodes can have a
peak performance exceeding that of today’s supercomputers.

(2) Message based communications. There is no globally shared memory in mul-
ticomputer. Message passing is the only means for information exchange between
nodes. It thus can reduce the software bottleneck.

(3) Asynchronous execution. Each node executes independently of all other nodes.
Synchronization between nodes relies on message passing primitive. It is a MIMD
machine.

(4) Moderate communication overhead. Delays in a well-designed multicomputer
network should not exceed a few hundred microseconds, in contrast to milliseconds
in loosely coupled distributed systems such as a network of workstations.

(5) Medium grained computation. A parallel program consists of a number of
“tasks”. The “grain” of a computation denotes the size of these tasks and it can
be measured by the amount of computation between task interactions. In order to
exploit the parallelism of the parallel system, the program should be decomposed into
many small tasks. However, an extremely ﬁne computation granularity will increase
communication and task swapping overheads. Multicomputer are best suited for
medium grain to balance the desire for massive parallelism against these overheads.

The appearance of multicomputers raises many challenging design and perfor-
mance issues. Basically, two major factors affect the performance of a multicom-

puter: the computational power of each node and interprocessor communication of

the multicomputer network. With the development of VLSI technology, the nodes are
becoming more powful and faster. An efﬁcient communication mechanism should be
able to pass message at the right time to keep all nodes busy to achieve a high paral-
lelism. Interprocessor communication will dominate computing cost in both hardware

and software [4].

1.1 Interprocessor Communication

At the system level, depending on the number of destinations, interprocessor com-
munication can be classiﬁed into three types: unicast, broadcast, and multicast com-
munication.

Unicast (one-to-one) communication is to send a message from a source node
to one destination node. It is the basic type of communication and it has been
directly supported by all current multicomputers. Broadcast (one-to—all) is a type of
message delivery that a source node sends a message to all other nodes. This type of
communication has also extensively studied in the past [7], it is directly supported in
nCUBE—2 using wormhole routing [6].

Multicast (ono-to-many) communication refers to the delivery of the same mes-
sage from a source node to an arbitrary number of destination nodes. Figure 1.2
shows a multicast communication pattern with three destinations, where process Po
has to send the same message to three processes: P1 , P2 and P3 (here we assume each
process resides in a separate node). Such a communication pattern is typical in many
parallel and distributed simulations, such as circuit simulation, Petri net simulation,
simulation of computer networks and queuing networks, and some other parallel al-
gorithms. In these cases, the application of multicast is straightforward as the output
of a gate (or server, place) may become the input of some connected gates. Parallel

algorithms for such applications require proper partitioning and mapping strategies

to achieve a balance between granularity and communication overhead.

®

Q Q
E

Figure 1.2. A multicast communication pattern with three destinations.

 

 

 

 

If multicast communication primitives are not supported, the program structure

of the communication pattern in Fig. 1.2 is shown below.

send(msg,P1) ..................
send(msg,P2) recv(msg,P0) recv(msg,PO) recv(msg,PO)
sendesg,P3) ..................

Multicast communication, although highly demanded in the development of many
parallel algorithms, has not received much attention. As shown above, multicast com-
munication can be supported by multiple one-to-one communications. Assume that
both send and recv are synchronous operations. If P0 is executing send(msg,P1)
and P1 has not yet executed the recv statement, P0 is blocked. In the mean time
P; is executing the recv statement and is blocked because Po has not yet executed
send (msg ,P2). Obviously, system resources are wasted due to the unnecessary block-
ing. Because of the non-determinism and asynchrony properties of multicomputers
there is no way for P0 to determine a proper message passing sequence in priori. Fur-
thermore, some of these replicated messages may traverse the same communication

channel, which creates much trafﬁc than needed.

1.2 Motivation and Problem Statements

Efficient routing of messages is critical to the performance of multicomputers. Histor-
ically, commercial multicomputers have supported only single—destination, or unicast,
message passing. More recently, multicomputers have begun to provide multicast
communication services, in which the same message is delivered from a source node
to an arbitrary number of destination nodes [6] [8]. The multicast primitive has been
shown to be highly useful in many parallel algorithms [9] [10] [11]. A growing number

of parallel applications can beneﬁt from multicast services.

c In parallel simulation, an event modeled at one processor may cause a chain
reaction of events that are to be modeled at other processors. The causal

information must be communicated to those other processes [12].

o In parallel search algorithms, a set of processes collectively solve a decision or
optimization problem. Examples include parallel alpha-beta search and parallel
algorithms for artiﬁcial intelligence problems [13]. Processes in such applications
typically search some global state space and must inform one another concerning

their ﬁndings as well as pruning information.

o In image processing and pattern recognition, parallel processes operate on dif-
ferent areas of an image and must exchange information in order to identify
complex objects and identify changes in the image since a previous image was

generated [14] [10].

o In parallel graph algorithms, such as ﬁnding multiple shortest paths [15], in—
formation concerning the characteristics of the graph discovered by one process

affects the behavior of other processes.

0 In numerical algorithms, such as ﬁnding steady-state solutions of power flow

equations [16], it is common for iterations of a loop to be done in parallel. If

some steps in an iteration depend on results from previous iterations, then all
of the processes must synchronize after executing earlier steps. That is, none
of the other processes will proceed with the dependent step until all executed
earlier steps. This “barrier synchronization” can be efﬁciently implemented

using multicast communication [17].

A software approach for multicast communication in hypercube multicomputers
was proposed in [18]. The Cosmic Environment, a popular message-based parallel
programming language, developed at Caltech also supports multicast communication
[8]. A better heuristic multicast algorithm for the hypercube was proposed in [19]
[20]. In [19], it was conjectured that ﬁnding an optimal multicast communication
tree in terms of time and trafﬁc is NP-hard in a hypercube graph. Furthermore, a
VLSI router to handle multicast communication based on the virtual cut-through
communication mechanism has been successfully prototyped [21].

Realizing the importance of multicast communication to multicomputers, the ob-
jective of this dissertation research is to study the major issues of multicast com-
munication, including the modeling of the multicast patterns, investigation of the
complexity properties of the optimal routing problems, development of efﬁcient dis-
tributed routing algorithms, deadlock issues and performance study. We will state

these issues in detail in the following chapter.

1.3 Dissertation Organization

This chapter introduces the demand of high performance parallel computers, espe-
cially the use of multicast communication in multicomputers. The motivation of this
dissertation research is also presented.

The next chapter gives a brief review for the basic properties of multicomputer

networks. The main issues of multicast communication will be addressed.

Based on various routing evaluation criteria and switching technologies, various
multicast models are proposed in Chapter 3.

Chapter 4 examines the computational complexities of the proposed optimal mul-
ticast problems for 2D mesh, hypercube, and 3D mesh topologies. It shows that all
of the optimal multicast problems are N P-complete.

In Chapter 5, basic heuristic algorithms are proposed for the multicast models in
2D mesh and hypercube multicomputers.

Chapter 6 focus on deadlock-free multicast wormhole routing. It is shown that the
broadcast wormhole routing adopted in the nCUBE-2 is not deadlock-free, a property
that is essential to wormhole routed networks. Several multicast wormhole routing
strategies for 2D mesh and hypercube multicomputers are proposed and studied. All
of the algorithms are shown to be deadlock-free.

A simulation study has been conducted that compares the performance of these
multicast algorithms under both static and dynamic network trafﬁc conditions. The
performance study results are presented in Chapter 7.

The last chapter summarizes the dissertation research work and proposes the

direction of the related future research.

CHAPTER 2

MULTICOMPUTER NETWORK
AND MULTICAST
COMMUNICATION

The critical component of a multicomputer is its communication network. Multicom-
puter networks are used to pass messages between the nodes in multicomputers. An
interconnection network is characterized by its topology, switching technology, rout-
ing scheme and ﬂow control mechanism. The topology deﬁnes the way the nodes are
interconnected by channels, which is usually modeled as graph. A switching mecha-
nism transfers data from an input channel to an output channel during the message
passing process. Routing determines the path(s) selected for messages to reach the
destination(s). Flow control deals with the allocation of channels and buffers to a
message as it travels along the path.

Multicomputer networks have become a popular architecture for building large-
scale multiprocessors to offer massive parallelism. The issues of interprocessor com-
munication are basically related to the structure of a multicomputer network. In
this chapter, we discuss the general properties of multicomputer networks. Multicast

communication issues will also be addressed.

10
2.1 Topology of Multicomputer Network

A key decision in design of a multicomputer is the choice of the topology of its in-
terconnection network. A good network topology should be able to accommodate a
large number of nodes while minimizing the message transmission time. It is desir-
able that it could connect each node to every other node like the connection in a
complete graph. Message would be delivered directly without passing through any
intermediate node. The number of physical connections, however, would be extremely
large, especially when the number of the nodes increases. The hardware constrains
such as the number of the available pins and pads on the router as well as the com-
munication area preclude such a completed connected network. Therefore, all of the
multicomputers with relatively large size adopt a ﬁxed topology.

The following factors should be considered in selecting and evaluating multicom-

puter network topology:

number of the connection: the number of channels used to connect the nodes, to get

a reasonable wire complexity, it should be not massive;

regularity: to make the design easy, the network looks alike from every node, all of

the nodes have the same or nearly the same degree;

diameter. the maximum distance between all of the pairs of nodes, a smaller diameter
has smaller worst case delay caused by passing message through the intermediate

nodes;

scalability: the system size can be increased by easily adding new nodes to the

network;

routing: the easy of designing of deadlock-free and efﬁcient routing algorithm;

11

robustness: the fault-tolerant ability, there should exist alternate paths between

pairs of nodes;

throughput: the total number of messages the network can handle per unit time, the

network should be able to efﬁciently handle various trafﬁc distributions.

Some of the factors are related, others may be conﬂicting. Therefore, trade-off
must be taken for selecting the network topology. In our study for multicast com-
munication, we will focus on two popular topologies: n-cube and 2D mesh. Various
network topologies have been proposed and studied. We discuss the properties of

these topologies in more detail as follows.

2.1.1 Hypercube Topology

All ﬁrst generation multicomputers adopt n-cube or hypercube topology due to its
rich topological properties [22]. Most second generation multicomputers also adopt
the hypercube topology, such as iPSC-2 [23] and nCUBE~2 [6].

An n—dimensional hypercube, also known as an n-cube, is a multicomputer with N,
N =2" nodes interconnected as an n-dimensional binary cube. There are 2" distinct
n-bit binary addresses or labels that may be assigned to each node. A node’s address
differs from that of each of its 77. neighbors in exactly one bit position. Figure 2.1
illustrates a hypercube with 3 dimensions.

It has been known that hypercube topology has many nice features that make it
suitable for the multicomputer network [24]. One of the most important advantages
of hypercube topology is that meshes of all dimensions and trees can be embedded
in a hypercube so that neighboring nodes are mapped to neighbors in the hypercube.
The communication structures used in the fast Fourier transform and bitonic sort
algorithm can be embedded easily in the hypercube. As we know, a great many

scientiﬁc applications use mesh, tree, FFT, or sorting interconnection structures, the

12

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2.1. Hypercube topology with 3 dimensions.

 

 

 

hypercube is a good choice for general-purpose multicomputer networks. The other
major advantage is that a hypercube’s maximum internode distance (diameter) is only
a, n=log2 N, N is the size of the hypercube, so that any two nodes can communicate
fairly rapidly. Although the diameter is greater than the unit diameter of a complete
connected topology, it is achieved with nodes having a degree of n=log2 N compared
with the N — 1 degree of nodes in complete connection. Another important feature
is there are d! distinct paths of length d between two nodes of distance d in the
hypercube.

There are additional features of hypercube topology that are useful in designing
multicomputers. For example, it is homogeneous in that all nodes look the same, an
I / O channel can be added to each node to get an extremely high I/ O rates. Also, there
are many ways to partition a hypercube into subcubes to support multiprocessing.
Different users can work on different subcube in the same hypercube. This feature
makes a system more tolerant faults since the operating system can allocate subcubes
that contain no faulty processors or faulty channels. In summary, hypercube topology
balances node connectivity, communication diameter, algorithm embeddability, and

programming easy. This balance makes it suitable for broad class of computational

13

problems.

However, it was shown in [25] that, in wormhole networks, low-dimension net-
works (e.g., 2D mesh) have lower latency and higher hot-spot throughput than high-
dimension networks such as hypercubes with the same bisection width. Further, low
dimensional networks are more scalable than high dimensional networks. For in-
stance, to increase the size of hypercube with 2" nodes, at least 2" nodes must be
added. The size of a hypercube is limited by the degree of each node that is usually
ﬁxed. For 2D mesh the degree of each node is a constant, 4 or 2. The system can
be easily upgraded with additional rows or columns, and there is no size limitation
imposed by the degree of the node. We will examine the topology properties of 2D

mesh topology that is low dimensional network.

2.1.2 2D Mesh Topology

As shown in Fig. 2.2, a node in a 2D mesh has connections to at most four neighbors.
Some second generation multicomputers such as Ametek 2010 uses 2D mesh topology
[26]. Intel’s “touchstone” next generation multicomputer will also adopt 2D mesh
topology due to its desirable properties of regularity, low cross-section bandwidth,
low ﬁxed degree of channels, and scalability.

VLSI systems are wire limited [25]. The complexity of the connection is limited by
the wire density and the performance is also limited by the delay of interconnection.
Bisection density is used to account for wire density, which is the product of the width
of a channel and bisection width that is in turn deﬁned as the minimum number of
channels required to divided network into two equal halves. It is not hard to see that
low—dimensional networks have a wider channel bandwidth, and thus a higher chan-
nel bandwidth, assuming that all the networks have the same bisection density. 2D
mesh is low-dimensional network. For wormhole networks, it has been shown by the-

oretical analysis and simulation [25] that low-dimensional networks reduce contention

14

 

 

 

 

 

 

 

 

Figure 2.2. A 4 x 3 mesh.

 

 

 

because having a few high—bandwidth channels has more resource sharing than hav-
ing many low-bandwidth channels in high-dimensional networks. It also makes low—
dimensional networks higher hot-spot throughput than higher-dimensional networks.
Low-dimensional networks have a higher maximum throughput and lower average
blocking latency.

2D mesh is asymmetric network. The channels in the center of the mesh are likely
having higher trafﬁc than those in boundary. Another disadvantage is its greater
diameter. For a pair of source and destination nodes, the average number of channels
that a message traverses in a \/N x \/N mesh is x/N/Z while it is log N/2 in an

n-cube, n=log N.

2.1.3 Other Topologies

Many other interconnection topologies have been proposed for multicomputers. For
example, 3D mesh is becoming popular recently. MIT J-machine [30] and Caltech
MOSAIC [3] adopt 3D mesh topology. Other topologies are ring, tree, etc. A general

topology is the k-ary n-cube. It has n dimensions and 1: nodes at each dimension

15

that connected as a ring. Both hypercube and 2D mesh topologies discussed before

are special classes of k-ary n-cube topology.

2.2 Switching Technology

The network performance is usually measured in terms of communication latency.
It is deﬁned as the elapsed time from the instance a “message send” is initiated by
a sending process to the instance the message is received by the receiving process.
The communication latency is the summation of three values: start-up latency, net-
work latency, and blocking time [27]. The start-up latency is the time for message
framing/unframing, memory/ buffer copying, validation and etc., at both source and
destination node. It is dependent on the software techniques used in each node. The
blocking time is the time delay during the message passing mainly due to the con-
tention of communication resources such as waiting for a busy channel to become idle
or waiting for the available buffers. It reﬂects the dynamic behavior of the network
when there is a contention for network resources. Network latency is the time when
the head of message is sent out at a source node until the message tail enters the des-
tination node. Obviously, the switching technologies have a great impact on network
latency.

Four switching technologies have been proposed to handle the switching tasks at
intermediate nodes: store-and forward switching, virtual cut-through, circuit switch-
ing, and wormhole routing. We will describe these switching technologies in more
detail.

A message is usually divided into a number of packets for transmission in order
to use network resources efﬁciently. Packet is the information unit handled by the

router.

16

2.2.1 Store-And-Forward Switching

All ﬁrst generation hypercube multicomputers adopt the store-and-forward mecha-
nism, in which each packet is stored completely in each intermediate node before
forwarding it to next node. If we deﬁne the network latency as the time from when
the head of a message enters the network at the source until the tail emerges at
the destination. The network latency of store-and-forward switching can roughly be

expressed as [3]

no + L/B = (L/B)D + L/B = (L/B)(D +1),

where T, is the delay of the individual routing nodes on the path, and D is the
number of nodes on the path (distance). L / B is the time required for the message of
length L to pass through the channels of bandwidth B. In this approach, the message
transmission time is linearly proportional to the number of hops (links) between two

nodes. Thus, the parameter time is usually represented by the number of hops.

2.2.2 Virtual Cut-Through Switching

Virtual cut-through switching [28] was originally proposed for computer network. In
this method, the routing decision is made as soon as the header with destination
information is available. The node then immediately forwards the package to the
next node if the corresponding outgoing channels is available. The network latency

without considering the block time is

no + L/B = (Li/Bu) + L/B,

where L], is the length of the header ﬁeld. When the length of message L is much

greater than Lh, (Lh/B)D is negligible. Therefore, the distance D is no longer an

17

important factor affecting the network latency.

As indicated in [28], however, when the outgoing channel is unavailable, the pack-
age has to be buffered at the intermediate node. If the trafﬁc is heavy in the network,
the package is buffered at each intermediate node, virtual cut-through switching acts

just like store-and-forward switching.

2.2.3 Circuit Switching

Circuit switching is similar to the way of signal transmission in telephone networks.
In circuit switching, a physical circuit between the source and destination nodes must
be established before the source node can start sending message. The circuit is torn
down after the tail of message is delivered.

In the circuit establishment phase, the source node sends a short control package
with routing information such as the destination address to set up a connection to
the destination node. The control package can be stored at intermediate nodes. If
a circuit cannot be set up due to the contention for channels, various protocols can
be used to reestablish the circuit [27]. After the circuit being established, the source
node sends the message through the circuit that is exclusively reserved for the message
transmission. No message buffer is needed for the message transmission.

The network latency for circuit switching is

no + L/B = (LC/B)D + L/B,

where L, is the length of the control packet to be sent for establishing the circuit.
Again, if L; << L, the distance D has little impact on the network latency.

The major advantage of circuit switching is that routing cost is paid only once at
the circuit establishment phase. After the circuit is set, the message can be transmit-

ted with very little delay and no buffered is required.

18

2.2.4 Wormhole Routing

Wormhole routing was proposed by Dally and Seitz in their torus routing chip design
[29]. In wormhole routing, the message is divided into ﬂits (ﬂow control digits). Sim-
ilar to the virtual cut-through, when the header ﬂit that contains routing information
arrives at an intermediate node, it is sent out immediately through an available outgo-
ing channel. The remaining Hits of the message follow in a pipeline fashion. However,
the blocked messages remain in the network in the wormhole routing, while virtual
cut-through buffers blocked messages and thus removes them from the network.

The network latency for the new switching mechanism is

no + L/B = (L,/B)D + L/B,

where L; is the length of its smallest information unit, ﬂit. If L >> L], the message
transmission time is almost independent of the number of hops between two nodes.

Because wormhole routing has low network latency and requires small amount of
dedicated buffers at each node, it becomes the most promising switching technology
and has been adopted in Symult 2010, nCUBE-Z, iWARP, and Intel’s Touchstone
project. It is also being used in some ﬁne-grained multicomputers, such as MIT’s
J-machine [30] and Caltech’s MOSAIC [3]. Detailed studies of communication issues
and wormhole routing method can be found in [27] [31].

Figure 2.3 illustrates the transmission of a packet from source node S to des-
tination node D for store-and-forward switching, circuit switching, and wormhole
routing in a contention-free network. A good survey and comparison study of various

communication paradigms can be found in [32] [27].

19

 

 

12

 

 

 

 

 

 

[:33] I
time
>

 

 

(a) store-and-forward switching.

 

 

 

 

 

packet
legend: I I
II [:22] header am
12 1:21
.
13 % time
D
(b) circuit switching
node
3
n
12 El——"I
- _ .
B m,
>
(c) wormhole routing.

Figure 2.3. A comparison of different switching technologies.

 

 

20
2.3 Message Routing and Flow Control

Message routing is an important issue in parallel processing. Since there are poten-
tially many paths joining pairs of nodes in a multicomputer, different routes can be
found depending on the criteria employed. The other important and practical issue

is the avoidance of the deadlock in message passing.

2.3.1 Source Routing and Distributed Routing

Depending on which node(s) determines the routing path, either source routing or
distributed routing may be adopted. In source routing, the source node determines
all the nodes that will be involved in delivering its message to all the destinations.
Thus, the routing information will be carried in the message itself. With distributed
routing, the source node determines only to which of its neighboring nodes the source
message will be sent. These nodes will repeat the procedure in turn. In this approach,
the destination ﬁeld in the message only carries the destination addresses.

The main drawback of source routing is that the addresses of all the intermediate
nodes must be carried in the source message, which will create extra message over-
head. In distributed routing, on the other hand, since the routing decision is made
in each intermediate node, the routing algorithm should be simple enough to achieve
a low overall message transmission time. All commercially available multicomputers
support distributed routing for one-to-one communications. However, obtaining an

efﬁcient multicast routing algorithm is not an easy task in a distributed manner [33].

2.3.2 Deterministic Routing and Adaptive Routing

The routing scheme can also be classiﬁed as deterministic routing and adaptive rout-
ing. With deterministic routing, the path from source node to destination node is

determined solely by the addresses of the source and the destination node. In adap-

21

tive routing, however, it can use more than one path to deliver the message for a pair
of source and destination nodes. The choice of the path to be taken by a message
may depend on the network trafﬁc. Minimal adaptive routing refers to the adaptive
routing that always takes a shortest path from source node to the destination node.
Nonminimal adaptive routing can take any path between source and destination.

The deterministic routing strategy is simple and easy to implement in hardware.
Since the path from a source to a destination is determined a priori, some restriction
can be imposed to the selection of the path so that deadlock can be avoided. For
example, the E—Cube routing for hypercube and XY routing for 2D mesh are the
well-known deterministic deadlock-free routing schemes that have been used in many
multicomputers [34] [35]. However, in a multicomputer, the network usually has many
paths between a pair of nodes, only one is regularly used. The load may not evenly
be distributed over the channels and it is not fault-tolerant.

On the other hand, adaptive routing has the advantage of being able to provide
more ﬂexibility to select different channels based on the network condition and the
network throughput is increased. Also, it can avoid the fault channels to achieve
fault-tolerant. The main issue in providing adaptive routing is to avoid deadlock
while having relatively simple routing algorithm. By introducing virtual channels,
some adaptive routing schemes have been proposed [36]. Partially adaptive routing

scheme is presented in [37] without introducing virtual channels.

2.3.3 Flow Control

Flow control deals with the allocation of communication resources such as channels
and buffers to messages as they are passing through the paths, as well as the resolu-
tion of resource collisions. A good flow control policy should be able to reduce the
communication latency.

A resource collision occurs when more than one message want to use a same

22

resource such as communication channel. Usually, the one that requests ﬁrst gets the
resource, the rest will be blocked until the resource is available. Depending on the
switching technology and routing scheme, several methods can be used to resolve the

collisions.

For the channel collision, the message is buffered at an intermediate node where
the collision occurred. In virtual cut-through switching, it buffers the blocked

message until the relevant outgoing channel is available.

The message is blocked in place until the requested resource becomes available.
In wormhole routing, the blocked message stops progress and remains in the

network until the relevant outgoing channel becomes idle.

The block message is dropped. The source node then waits for a random time before

trying to send the same message again.

The message is derouted through another channels. In adaptive routing, message

can be routed in different paths according to the usage of the channels.

When several messages are requesting the same resource, the resource selection
policy decides which message may use the resource. Possible selection policies include

the ﬁrst come ﬁrst serve, round-robin, and ﬁxed resource priority, etc.

2.3.4 Deadlock Issues

In multicomputer networks, communication channels and message buffers constitute
the set of permanent reusable resources [27]. The processors that send or receive the
messages compete for these resources. In short, messages are the entities that compete
for these resources. Deadlock refers to the situation in which a set of messages each

is blocked forever because each message in the set holds some resources also needed

23

by the other. Figure 2.4 shows a deadlock conﬁguration in which the four messages

each acquire some channels required by the other messages.

 

 

 

 

 

 

 

 

 

 

 

message 3

 

 

 

 

 

 

Figure 2.4. An example of deadlock involving four messages.

 

 

 

In store-and-forward and virtual cut-through switching, buffers are the resources
used in the message passing, while in circuit switching and wormhole routing, channels

are the critical resources if each physical (virtual) channel has its own set of buffers.

Buffer Deadlock

Buffers are used in multicomputer to reduce the effect of trafﬁc ﬂuctuations and
increase the total bandwidth. In message transmission, buffers hold both incoming
and outgoing packets and they are released when the packets are sent out or consumed.
The size of the buffer at each node is an important issue. If the size of the buffer were

unlimited, deadlock would never occur.

24

A large number of deadlock-free routing algorithms have been developed for store-
and-forward networks [38] [39] [40] [41] [42]. These algorithms are mostly based on
the concept of structure buﬂer pool. In this method, the message buﬂers at each node
are divided into C + 1 classes numbered from 0 to C, C is the length of the longest
route in the network. The assignment of buffers to message packets is restricted to
deﬁne a partial order on buffer classes. A packet is ready to send out only if the
buffer pool of class 0 is not empty and it is put to the buffer of class 0. The queued
packet can be sent to next node only if buffers of class 1 at that node are available,
and so on until the packet arrives the destination node. The buffers of class i only
hold the packets that have traversed at least i hops. The packets are dropped if the
buffers are not available and will be retransmitted later.

The other method of the same idea is to divide the buffers into C classes. A packet
with i hops remaining to the destination can only be put in the buffers of class i.

It is not difﬁcult to prove that the structure buffer pool algorithm is deadlock free
since it assigns a partial order to resources. In the second version of the algorithm, for
example, after a destination node consumes the previous message packet, the buffers
of class 1 will be available, a packet with 1 hops remaining to the destination can be
forwarded and the buffer be released. Then the buffers of class 1 at that node are not
empty, and the packet with 2 hops remaining to the destination can be sent out, and
so on until all the packets are reached to their destination nodes.

There are two major drawbacks of this method. The buffer utilization is low due
to their restricted usage. Many buffers are barely used in the message passing. The
other problem is that if the diameter of the network is large, the buffers have to be
divided into many classes and the amount of the buffers must also be huge.

As indicated earlier, more advanced switching technologies such as circuit switch-
ing and wormhole routing have been used in new multicomputers. We need to study

the channel deadlock problem in the networks using these switching technologies.

25

Channel Deadlock

If we suppose that each physical (virtual) channel has its own small set of buffers as
in the wormhole routing and circuit switching, channels are critical resources. We
consider here the channel deadlock problem in wormhole routing. The solution can
also be applied to circuit switching. Figure 2.5 shows an example of channel deadlock
conﬁguration.

In [44], a graph model, channel dependency graph (CDG), for a direct network is
proposed to establish the necessary and sufﬁcient condition for deadlock-free routing.
A channel dependency graph for a given network I and routing function f, is a
directed graph, D = C(C, E). The nodes of D are channels of I. The edges of D are
the pairs of channels connected by R:

E = { (c,,c,-)|R(c,-,n) = c,- for some n E N},

where R(c,-,n) = c, means that if a message with destination n is entered from
channel c,-, the routing function R forwards it to channel cj toward destination n.

It has been proved in [44] that a routing algorithm is deadlock-free only and if
only there is no cycle in its CDG. Therefore, we need to restrict routing to avoid
cycle in the channel dependency graph. Figure 2.5 gives an application of channel
dependency graph. The routing function in this example always ﬁrst forwards message
horizontally then vertically, which is called X—ﬁrst routing.

The other similar solution is to divide the network into several acyclic subnetworks,
messages will be routed in each subnetwork independently. For unicast communica-
tion, the routing scheme cannot let deadlock since each subnetwork is acyclic and no

cycle can be formed in its channel dependency graph.

26

 

c1

c3 c4

c6 c7

 

(a) 68

ME

 

 

(b)

Figure 2.5. (a) A direct network I ; (b) Channel dependency graph for I and X-first
routing.

 

 

 

2.4 Issues of Multicast Communication

Providing support for multicast communication involves several, often conﬂicting,
requirements. First, it is desirable that the message delay from the source to each
of the destinations be as small as possible. Sending a separate copy of the message
to each destination along shortest paths appears to be a logical solution, but the
increased traffic load resulting from these copies may actually hinder the progress of
messages. Hence arises the second requirement, that the amount of network trafﬁc be
minimized. Unfortunately, as will be shown later, ﬁnding optimal multicast routes,
in terms of delay and trafﬁc, has been shown to be N P-hard for the most common
multicomputer topologies. The third requirement, then, is that the routing algorithm
not be computationally complex. Heuristic algorithms must be employed, but are
constrained by the ﬁnal requirement, that the multicast communication protocol be
deadlock-free.

Designing multicast protocols and routing algorithms to meet these requirements

naturally depends on the network topology and the underlying switching mechanism

27

used in the multicomputer.

Apparently, the underlying switching techniques affect the criteria in evaluating
multicast communication schemes. Based on the different criteria of the multicast
communication, various multicast routing models should be proposed.

Theoretically, we need to examine the computational complexity of the optimal
routing problem under each proposed model. As will be shown later, all the routing
problems for the proposed various models are NP-complete for 2D mesh, hypercube,
and 3D mesh topologies. These results make the existence of any polynomial time
routing algorithms for optimal routing unlikely, especially in a distributed manner.

Having established theoretical foundation of the routing problem, we need to
develop heuristic multicast routing algorithms for each proposed model. In designing
the routing algorithms, there are three issues which are of practical importance. First,
the heuristic routing should be in distributed manner. Second, the routing algorithm
should be deadlock-free or capable of dealing with deadlock situation. Finally, such
routing algorithms should be simple enough for efﬁcient hardware implementation.

Wormhole routing is becoming the most promising switch technique used in mul-
ticomputer networks. We will study new routing models and network partitioning
strategies for multicast wormhole routing. Based on the routing models and network
partitioning strategies, new multicast wormhole routing algorithms must be devel-
oped. These routing algorithms should be deadlock-free and have low probability a
message is blocked.

Performance study is necessary to evaluate the multicast routing schemes. The
performance of a multicast routing algorithm can be measured by average trafﬁc it
takes. More importantly, we need to study the network latency under the dynamic
network condition because it mainly depends on the interaction of the different mes-
sages in the network.

We will address these issues in more detail in the following chapters.

CHAPTER 3

MODELS FOR MULTICAST
COMMUNICATION

Two major routing design parameters are traffic and network latency. For a given
multicast communication, the parameter trafﬁc is quantiﬁed in the number of com-
munication links used to deliver the source message to all its destinations. The
parameter network latency is the message transmission time. In an asynchronous
multiprocessing environment, the message transmission time should be considered
from the destination node point of view because the receiving process can continue
its execution as soon as the message is received. Obviously, the time required to
transmit a message from a source to any destination should be minimized. It is desir-
able to develop a routing mechanism that completes communication while minimizing
both trafﬁc and network latency. However, these two parameters are not, in general,
totally independent and achieving lower bound for one may prevent us from achieving
the other.

The network latency is dependent on the underlying switching technology. As
shown in Chapter 2, in store-and-forward mechanism, the network latency is linearly
proportional to the number of hops (links) between two nodes. Thus, the param-

eter network latency is usually represented by the number of hops. By minimizing

28

29

network latency, in this case, it implies that for each destination node, the message
should be delivered through a shortest path to that destination. In second generation
multicomputers, more advanced switching mechanisms have been adopted. For exam-
ple, iPSC—2 [43] adopts circuit switching and Ametek 2010 (Symult) adopts wormhole
routing [26] [44]. In these new switching techniques, the network latency is almost
independent of the number of hops between two nodes. Apparently, the underlying
switching technique will affect the criterion in evaluating multicast communication
schemes.

Graph will be used to model the underlying topology of multicomputer. We will
closely follow the graph theoretical terminology and notation of [45]; terms not deﬁned
here can be found in that book. Let graph C(V, E) denote a graph with node set V
and edge set E. When G is known from context, the sets V(G') and E(G) will be
referred to as V and E, respectively. A path with length n is a sequence of edges

e1, e2, . . . , en such that
l. e,- and e,+1 have a common end-node, and

2. if e,- is not the ﬁrst or last edge, then it shares one of its end-nodes with e;_1

and the other with e,+1.

Suppose e,- = (u,,v,-+1) for 1 3 i g n. In the following discussion, this path with
length n will be represented by its node visiting sequence (v1, v2, . . . , on, on“). A cycle
is a path whose starting and ending nodes are the same, i.e.,v1 = on“. Furthermore,
we assume that for every pair of nodes in the path except v1 and on.” are different.
A graph is said to be connected if every pair of its nodes are joined by a path. A
tree is a connected graph which contains no cycles. A graph F (V, E) is a subgraph of
another graph G(V,E), if V(F) Q V(G) and E(F) Q E(G). A subgraph which is a
tree is referred to as a subtree. For a pair of nodes u, v in V(G), dG(u, v) denotes the

length (the number of edges) of a shortest path from u to v in G.

30

The interconnection topology of a multicomputer is denoted by a host graph
C(V, E), where each node in V corresponds to a processor and each edge in E cor-
responds to a communication link. For a multicast communication, let uo denote
the source node and u], 112, . . . ,uk denote k destination nodes, where k 2 1. The set
K = {uo, u], . . . , uk}, which is subset of V(G), is called a multicast set. Depending on
the underlying switching technique and the routing method, the multicast communi-
cation problem in a multicomputer can be formulated as different graph theoretical

problems.

3.1 Multicast Path and Cycle Problems

In some communication mechanisms, replication of an incoming message in order to
be forwarded to multiple neighboring nodes involves too much overhead and is usually
undesirable. Thus, the routing method does not allow each processor to replicate the
message passing by. Also, as indicated in [46], a multicast path model provides better
performance than the tree model when there is a contention in the network. From
communication technology point of view, the multicast path model is more suitable
for the new communication mechanism such as wormhole routing.

The multicast communication problem becomes the problem of ﬁnding a shortest
path starting from uo and visiting all k destination nodes. This optimization problem

is the ﬁnding of an optimal multicast path (OMP) and is formally deﬁned below.

Deﬁnition 3.1 A multicast path (MP) (121,222, . . . ,vn) for a multicast set K in G is
a subgraph F(V, E) of G, where V(P) = {v1,u2, . . .,vn} and E(P) = {(v;,v,-+1) : 1 _<_
i S n — 1}, such that 221 = uo and K (_i V(P). An 0MP is an MP with the shortest

total length.

Figure 3.1 shows a example of multicast path in 2D mesh graph.

 

C] source node
0 destination node

—> edges in
‘- multicast path

 

 

 

 

 

Figure 3.1. An example of multicast path in 2D mesh.

 

 

 

Reliable communication is essential to a message passing system. Usually, a sepa-
rate acknowledgement message is sent from every destination node to the source node
upon receipt of a message. One way to avoid the sending of [KI separate acknowl-
edgement messages is to have the source node itself receive a copy of the message it
initiated after all destination nodes have been visited. Acknowledgements are pro—
vided in the form of error bits ﬂagged by intermediate nodes when a transmission error
is detected. Thus, the multicast communication problem is the problem of ﬁnding a

shortest cycle, called optimal multicast cycle (OMC) for K.

Deﬁnition 3.2 A multicast cycle (v1,v2,...,vn,v1) for K is a subgraph C(V, E) of
G, where V(C) = {v1,v2,...,vn} and E(C) = {(vn,v1),(v,-,v,-+1) : 1 S i S n — 1},
such that K Q V(C). An OMC is an MC with the shortest total length.

32
3.2 Steiner Tree Problem

Both OMC and OMP assume that the message will not be replicated by any node
during transmission. However, message replication can be implemented by using some
hardware approach [21]. If our major concern is to minimize trafﬁc, the multicast
problem becomes the well-known Steiner tree (ST) problem [47]. An example of
Steiner tree in 2D mesh is shown in Fig. 3.2 Formally, we restate the ST problem as

follows.

Deﬁnition 3.3 A Steiner tree, S(V, E), for a multicast set K is a subtree of G, such

that K Q V(S). A minimal Steiner tree (MST) is a ST with a minimal total length.

 

D source node
0 destination node

—> edgesin
‘— Steinertree

 

  
  

 

 

 

  

oeeo
oeeoe

Figure 3.2. An example of Steiner tree in 2D mesh.

 

 

 

 

33

3.3 Multicast 'Iree Problem

In the ST problem, we don’t require the using of a shortest path from the source
to a destination. If the distance of two nodes is not a major factor to the network
latency, such as virtual cut-through, wormhole routing, and circuit switching, the
above optimization problem is appropriate. However, if the distance is a major factor
to the communication time, such as the store-and-forward mechanism, then we may
like to minimize time ﬁrst, then trafﬁc. The multicast communication problem is
then modeled as an optimal multicast tree (OMT), as shown in Fig. 3.3. The OMT

problem was originally deﬁned by [19].

Deﬁnition 3.4 An OMT, T(V, E), for K is a subtree ofG, such that (a) K Q V(T),
(b) dT(uo,u,~) = dG(uo,u,-), for 1 S i g k, and (c) |E(T)[ is as small as possible.

 

 

.. l: ”Wm
® @W@.® .. ® 0 destinationnode

—> edges in
‘— multicast tree

 

  

  

 

 

 

eooeo»
oeooe

Figure 3.3. An example of multicast tree in 2D mesh.

 

 

34

3.4 Multicast Star Problem

In new switching technology such as wormhole routing, the tree-like multicast mod-
els are not suitable because the progress of the message passing of the entire tree is
blocked if any one of its branches is blocked due to the contention of the communi-
cation resources [48] [49]. Further, the deadlock properties of wormhole routing are
different from those of store-and-forward and virtual cut-through. As will be shown
later, More than one path are needed for deadlock-free multicast wormhole routing.
A practical multicast routing scheme must be deadlock-free, Also it should transmit
the source message to each destination node with as small transmission time and
communication channels as possible. The optimal routing problem in our routing
scheme described above becomes the problem of ﬁnding an optimal multicast star in

a given host graph.

Deﬁnition 3.5 A multicast star (MS) S(V, E) ofa host graph C(V, E) for a multicast
set K is a collection of several multicast paths. The i-th path starts from source node
and reaches each destination in D,, where (UV,- 0,) U {uo} = K and D,- 0 03- = 0 if
i 7$ j. An optimal MS (OMS) is a MS with a minimum length.

Figure 3.4 is an example of multicast star in 2D mesh graph.

The graph optimization problems can be stated as follows:

Given a host graph G, a multicast set K, and an integer 3, does there
exist an 0MP (OMC, MST, 0M T, OMS) for K with total length less than

or equal to 1"?

We shall investigate the computational complexities of the above optimal problems

for 2D mesh, hypercube, and 3D mesh topologies in next chapter.

35

 

 

 

 

 

 

2D mesh.

ample of multicast star in

Figure 3.4. An ex

 

 

CHAPTER 4

OPTIMAL MULTICAST IN
MULTICOMPUTERS

Apparently, the complexity of each of the above optimization problems is directly
dependent on the underlying host graph. We will focus on two popular host graphs:
n-cube and 2D mesh. Results obtained for 2D mesh can be easily extended to 3D
mesh topology. In this chapter, we will examine the computational complexities of
the MP, MC, ST and MS optimization problems for these topologies. The OMT
problem was originally studied in [19] and has been proven to be NP-complete for
n-cube graphs [50]. It is still open if OMT problem is NP-complete or not for 2D

mesh graph. The OMT problem will not be discussed here.

4.1 Optimal Multicast in 2D Mesh Topology

We ﬁrst consider the case in which the host graph is a 2D mesh (non-wraparound)
as adopted in Ametek 2010 [26]. Before discussing the 2D mesh graph, we have to
brieﬂy review a more general class of graphs, called grid graphs [51].

Let 0‘” be the inﬁnite graph whose vertex set consists of all points of the plane

with integer coordinates and in which two vertices are connected if and only if the

36

37

Euclidean distance between them is equal to 1 [51]. A grid graph, G( V, E), is a ﬁnite,
node-induced subgraph of G°°. A grid graph is completely speciﬁed by its vertex set
V(G). Let v, and v3, be the coordinates of the vertex v E V(G). A 2D mesh, also

named rectangular graph, is a special case of grid graphs as deﬁned below.

Deﬁnition 4.1 Let M(V, E) represent a 20 N1 x N2 mesh graph, which is a grid

graph whose vertex set is

V(M)={v:xogvx_<_N1+xo—1andyoSvySNg-l-yo—l},

where (xo,yo) is the lower left coordinate of the mesh.

In [51], the authors have proved a number of related and interesting results as

stated below.

G1 The Hamilton cycle (circuit) problem for grid graphs is NP-complete.
G2 The Hamilton path problem, (C,s,t), for grid graphs is NP-complete.
G3 The Hamilton cycle problem for rectangular subgrid graphs is NP-complete.

G4 The Hamilton path problem, (G,s,t), for rectangular subgrid graphs is NP-

complete.

Our problem is different from these in grid graphs in which we are looking for a

multicast cycle or path for a given multicast set, K, which is a subset of the 2D mesh,

M.
Theorem 4.1 The OMC problem for 2D mesh graphs is NP-complete.

Proof: In the proof of NP-completeness of this theorem, we shall use the above

known NP—completeness results. First, we have to show that we can construct a 2D

38

mesh graph M (V, E) from C(V, E) in polynomial time. Then we will show that we
can select a multicast set K in M (V, E) in polynomial time. Finally, we have to
prove that there is an OMC for K in M (V, E) with total length less than or equal
to some number if and only if C(V, E) has a Hamilton cycle. From G1, we know the
Hamilton cycle problem is NP-complete for a grid graph C(V, E) Thus, the OMC
problem for 2D mesh graphs is NP-complete. This technique will also be used in the
proof of other NP-complete theorems.

In this case, given a grid graph G'( V, E), we can easily construct a 2D mesh graph
M(V, E) such that V(C) Q V(M) in polynomial time. Then, we select K=V(G). It
is straightforward to see that G has a Hamilton cycle if and only if there exists an

OMC for K in M with total length [V(C)]. D

The Hamilton path problem (G, s, t) in G2 has a solution if there exists a Hamilton
path from s to t in C. To prove the OMP problem is N P-complete, we need to prove
the following lemma which is more general than G2, in which there is no restriction

on which node should be the termination node in the Hamilton path.

Lemma 4.1 The Hamilton path problem with a given starting node for grid graphs

is NP-complete.

Proof: For a given grid graph C(V, E), we construct a grid graph G’(V,E)
such that G has a Hamilton cycle if and only if G" has a Hamilton path starting from
a given node 3.

First, we select a corner node u = (umuy) such that u,c = minvev(g){v3} and

“3’ = mlnu€V(G) and v1=u3{vy}'

We deﬁne the following four points

p = (u; —1,uy),q = (ux— 1,uy +1),t =(u, —2,uy+1),s = (u,— 1,21,, — 1).

39

Then, we can construct a grid graph G’(V, E) such that V(G’) = V(G) U {p,q,t,s}
(see Fig. 4.1).

1
I
I
I
I
I
I
I
I
I
I
I
I
l
I
I
I
I
I
I
I

J

 

r
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I.

Figure 4.1. The construction of G" from G.

Note that the degree of each node in a grid graph is at most 4 and for a given node
v of a grid graph, the coordinates of its four possible adjacent nodes are (v, - 1, vy),
(v,t + 1, vy), (or, vy — 1) and (v3, 2),, + 1), respectively. Suppose there exists a Hamilton
cycle in G, then the degree of each node in V(C) is at least 2. Now, considering
the node u, since its two possible adjacent nodes with coordinates (u; - 1,uy) and
(ux, uy — 1) are not in V(C) due to the selection of u, the other two possible adjacent
nodes with coordinates (u,,.,uy + l) and (ux + 1,uy) must be in V(G). Let w =
(uz, uy +1) and r = (at +1, u,,) (see Fig. 4.1). Obviously, both edges (u,r) and (w,u)
should be in this Hamilton cycle. Suppose this Hamilton cycle is (u,r, - ~ - ,w,u). It

is easy to see that G’ has a Hamilton path (s,p,u,r, . . . ,w, q, t).

40

Conversely, if G" has a Hamilton path starting from s, by the selection of u, we

know p,q,t and s are not in V(C), and

E(G’) = E(G) U {(q,w),(tar),(q,P),(P,S),(p,U)}-

The Hamilton path should end with t since the degree of t is 1. It is straightforward
to check that such a Hamilton path must have the form (s,p, u, r, . . . , w, q, t). There-

fore, (u, r, - - - , w, u) is a Hamilton cycle in G. D

The transformation from the Hamilton path problem, stated in Lemma 4.1, into
the OMP problem for 2D mesh graphs is similar to the one in the proof of Theorem 4.1.

Thus, we have the following theorem.
Theorem 4.2 The 0MP problem is NP—complete for 20 mesh graphs.
For the multicast star problem, again we have the following result.

Theorem 4.3 The OMS problem is NP—complete for 2D mesh graphs [52].

Proof: To show that the MS problem for 2D mesh graphs is NP-complete, we
will reduce the known NP-complete problem of ﬁnding a Hamilton cycle [51] for grid
graphs to this problem.

Given a grid graph G, as in the proof of Theorem 4.1, a 2D mesh graph M (V, E)
can be constructed in polynomial time such that V(G) Q V(M) (Fig. 4.1). Now

select K=V(G’) and no = s. We need to prove that

G1 G has a Hamilton cycle if and only if G’ has a Hamilton path starting from 3;

G2 G" has a Hamilton path starting from 3 if and only if M (V,E) has a MS with
s as the root total length [V(G’)I — 1, where [V(G’)| = IV(G)[+4.

41

The proof of G1 is the same as the one for Theorem 4.1 and here we only prove
G2, which is straightforward. Suppose G’ has a Hamilton path starting from 3. Since
G’ is a subgraph of M(V, E), the Hamilton path is also a MS for K (K = V(G’)) in
M(V, E) with s as the root, which has the minimum length IV(G’)| — 1. On the other
hand, if M(V, E) has a MS with s as the root and with length IV(G’)| — 1, the MS
can only contain the nodes in K since K==V(G’). Because the only neighbor node of
s in K is p, the MS consists of only one path starting from s to p, otherwise at least
one node in MS would not be in K. Obviously, the MS is a Hamilton path in G’.

The proof of this theorem is completed by combining G1 and G2. 13

Garey and Johnson [52] have proved that the rectilinear Steiner tree (RST) prob-
lem is NP-complete. The RST problem for a set A is a tree structure, composed
solely of horizontal and vertical line segments, which interconnect all the points in A.
Replacing A by the multicast set K, this problem is the same as the MST problem

for 2D mesh graphs. The following theorem can be derived directly from [52].

Theorem 4.4 The MST problem is NP—complete for 2D mesh graphs [52].

4.2 Optimal Multicast in Hypercube Topology

We ﬁrst formally deﬁne a hypercube, or n-cube, graph.

Deﬁnition 4.2 Let H(V, E) represent an n-dimensional hypercube graph, [V(H)|=2"
and each node in V(H) has a unique n-bit binary address. For each v E V(H), let
b(v) denote its n-bit binary address and [I b(v) I] represent the number of I ’s in b(v).
An edge e = (u,v) E E(H) if and only if [I b(u) EB b(v) ll: 1, where GB is the bitwise

exclusive OR operation on binary numbers.

42

The shortest distance between any two nodes u,v E V(H) is dH(u,v) =|| b(u) EB
b(v) II.

To show that the OMC problem for n-cube graphs is NP-complete, we will reduce
the known NP-complete problem of ﬁnding a Hamilton cycle for grid graphs to this
problem.

First, given a grid graph C(V, E) with [V(C) = k, let’s consider an n-cube graph
H(V, E) with n = 4k. For a node q E V(H), its address b(q)=bo(q)b1(q) - «~b,,_1(q)
can be thought of as consisting of k blocks, each with length 4. Thus, we can express
b(q)=ato(q)0n(<1) - - - ale-101), where ar(q) = bai(q)b«+1(q)b4a+2(q)b4i+3(q) for 0 S i < k-

The next step is to select K = {uo,u1,...,uk_1} where K Q V(H) such that
C(V, E) has a Hamilton cycle if and only if H (V, E) has an OMC with length less
than or equal to 6k.

To begin with we apply the breadth-ﬁrst search algorithm to G(V, E) with an
arbitrary node v0 as a starting node. As a result, the node set V(C) will be partitioned
into w disjoint subsets A0, A1, . . . , Aw-“ where A0 = {220} and A,- = {v : dg(vo,v) =
i} for 1 S i < w. Next, we order the nodes of V(G) as 120,121, . . . ,vk_1 in such a way
that for every pair of nodes v.- 6 Ap,vj 6 Ah, if p < h, then i < j.

The nodes in K can be selected by the following procedure.
1. Select uo E V(H), such that a0(uo) = 1111, ah(uo) = 0000 for 0 < h < k.

2. For m=1 to k - 1, select um E V(H), such that

(a) Let V... = {up : p < m and (vmvm) e E(C)}. For each 22,, e Vm,
let U”, = {vq :p < q < m and (vP,vq) E E(C)}.
i. if |U,,,.,,| = 0, a,(u.,,) = 1000
ii. if |U,,,m| = 1, ap(u,,,) = 0100

iii. if [Umm] = 2, ap(um) = 0010

43

iv. if [Upm] = 3, ap(um) = 0001
(b) If [le = 1, then am(um) = 1110. If [le = 2, then am(um)=1100.

(c) For all other address blocks, i.e., 2);, is not in Vm U {vm} for 0 S h < k, we

have ah(um) = 0000.

By the ordering of the nodes in V(G) and the deﬁnition of Vm, we have [Vol = 0
and 1 S |Vm| S 2 for m > 0. Since the degree of each node in a grid graph is at most
4, lanl S 3. Thus, K can be constructed by the above procedure in polynomial

time. The following example should make the above procedure clear.

Example 4.1 Consider the following grid graph C (see Fig. 4.2). After applying
breadth-ﬁrst search algorithm, the node set will be partitioned as A0 2 {v0}, A1 =
{01,02}, A2 = {03,04}, A3 = {WW6}; and A4 = {v7}.

Figure 4.2. A simple grid graph with 8 nodes.

Consider H(V,E), where n = 4k 2 4 x 8 = 32 and select K = {uo,u1,...,u7}

such that

a(uo) = 1111, 0000, 0000,0000, 0000, 0000, 0000, 0000

a(u1) = 1000, 1110, 0000, 0000, 0000, 0000, 0000, 0000

44

a(u2) = 0100, 0000, 1110,0000,0000,0000, 0000,0000
a(u3) = 0000, 1000,1000,1100,0000,0000,0000,0000
a(u4) = 0000, 0000, 0000,1000,1110, 0000, 0000, 0000
a(u5) = 0000, 0000, 1000, 0000, 0000,1110, 0000, 0000
a(u6) = 0000, 0000, 0000, 0100, 0000,1000,1100, 0000

a(u7) = 0000, 0000, 0000, 0000,1000, 0000,1000,1100.

Note that in the above example every pair of nodes u,- and u,- in K, 0 S i,j S 7,
dH(u,-,u,~) = 6 if (v,-,v,~) is in E(C); otherwise, dH(u,-,uJ-) = 8. In what follows we will
show that it is true for the general cases.

For each vm in V(G), let Um = Vm U {um}. From the selection procedure, we have

the following properties:

Property 1 z u aw...) ||= new, n a,(u..) ||= 4;
Property 2 : For any 1),, E Vm, || a,,(um) ”2 1; and

Property 3 : For any vh E V(C) — Um, || ah(um) I]: 0.

Since dHWuUj) =|| GM) 69 00%) ll: Eli—=10 ll 071101063 amlui) H, and V(G) =
{vo,v1,...,vn_1}, then dH(u,-,u,-) = Ema/(G) I] am(u,-) EB am(u,-) [I. By Property

3, theV(G)—U, ll “b(ui) ll: 0 and theV(G)—U, ll ah(u,-) ll: 0, we have

2 ll ah(u.-) e ah(u,-) n: 0.

vheV(G)-U,UU,

Thus,

dH(u,-,u,-) = 2 ll am(u:') *3} am("j) ll: 2 ll am(u.') @“ijl ll +

vmeU,uU, vaUi-UJ'IUJ‘

2 ll aIm(uz') <13 audio) II + 2: ll am(ue) EBaij) ll 3

45

Again, by Property 3, we have

2 ll amWi) ll: 2: II am(u,-) ll: 0.

u... eU,—U.nU, vaU, -—U,nU,

Finally, we have the following equation.

dH(U.-,uj) = 2 || am(uz') H +
UmEUg—UJ'IUJ
2 ll am(’ui) 6961771011) II + 2 ll am(uj) ||- (4-1)
UmEUgﬁUJ umeU,—U,-nU,

Lemma 4.2 For every pair of nodes u, and u,- in K, dH(u;,u,) 2‘ 8 if (v,,v,) is not

in E(G).

Proof: By the deﬁnition of V,- and V,, if (v,,v,~) ¢ E(G), then v,- ¢ V,- and
v,- ¢ V,-. Since

U,- 0 U1 = (V,- U {vi}) 0 (Vi U {’Uj}) = (V,- 0 {v,-}) U (V3 0 {vi}) U {v,-,v,-} U (V3 0 Vi),

we have U,- 0 U,- = V.- O V,. Suppose. V,- 0 V,- = 0, Equation 4.1 becomes

dH(u,-,u,) = 2 || am(u,-) ll + 2 ll am(u,) ll -

vaUl vaU,

That is, dH(u,-,u,) = 4 + 4 = 8 by Property 1. On the other hand, if V.- O V,- # 0,
then for each v,, E V,- O V,, we have IUNI < lUw'l since i < j and v.- 6 UN. From Step

2(b) of the selection procedure, it is not difﬁcult to check that

H ambit) EBGij) |l=ll am(u,-) II + ll am(‘“j) ll

46

Therefore, for (v,-, v,-) 9? E(C), we have

dH(uivuJ) : 2 ll 0171““) ll +
vaU.—U.nU,
2 ll am(u,-) EB am(u,) II + 2: ll am(u,) ll
vaU,nUJ vaUj-Ugnt
: 2 ll avn(ui) ll + 2: ll a,n(u,) ll: 4 + 4 = 8°
UmEUi UmEUj

Lemma 4.3 For every pair of nodes u,- and u,- in K, dH(u,-,u,-) = 6 if (v,,v,') is in

13(0).

Proof: By deﬁnition of V.- and V,, if (v,,v,-) E E(G), then u,- ¢ U,- and v,- E
U,. We have V,- D V,- = 0; otherwise, if there were a node v,, 6 V;- O V,, again by
deﬁnition of V,- and V,, there would have a triangular {(v,,vp), (v,-,vp),(v,-,v,)} in G.
However, this is impossible in a grid graph, G. Thus, U,- 0 U,- = {vi}. Let’s consider

|| am(u,) EB a,-(u,) [I in terms of the number of elements of V,-.

1. If ]V,| = 0, then i = 0 by the deﬁnition of V}. From step 1 of the selection

procedure, we have (10(U0) = 1111 and [I ao(u,-) I]: 1 (Property 2). Thus,

ll (“(1%)69 (M(uj) ll: 3 =|l a,(u,-) ll + ll a.-(u,-) ll ‘2.

2. If |V,-| = 1, then a,(u,) = 1110 (step 2b). In this case, we must have [UN-I S 2
for a 2),, E V.- and p < i, because we already have both (umvg) 6 E(G) and
(v,,v,-) E E(G), and the degree of each node in V(G) is at most 4. Hence,
a,-(u,-) is either 1000, 0100, or 0010 by step 2 of the selection procedure. Thus,

we have

ll (“(1%) 69 MW) ll: 2 =H aim) II + ll MW) ll —2-

47

3. If |V,-| = 2, we have [Um] S l for each v,, E V; (p < i) for the same reason as

that of (2). Since a,(u,~) = 1100 by step 2a of the selection procedure, we have

ll a,(u,-) 69 MW) ll: 1 =l| “stud ll + ll MW) ll —2-

Since (v,,v,~) E E(C), we have

dH(u,-,u,-) = 2 ll a,,,(u,-) II + II (H(Ui) Elam-(w) II + 2 ll am(u,-) ll
vaUe-{vr} vaUj-{va}
= )2 ll am(us) || —2 + 2 ll am(u,-) ll: 4 — 2 + 4 = 6-
vaUl UmEUJ
C]

Now we can show that G has a Hamilton cycle if and only if H has an OMC
for K with length less than or equal to 6k. Suppose G has a Hamilton cycle
(v,,,v,-,,...,v,-k,v,-,), then (v,,,v,,+,) 6 E(C) for 1 S i < k and (v;,,v;,) E E(G').
By Lemma 4.3, for all the nodes u,, (1 Sj S k) of K, dH(u,-,,u,-,+,) = 6(1Sj< k)
and dy(u,-,,u,-,) = 6. Hence, there exists a cycle (u,,,...,u,-,,...,u,-,,...,u,~,) of H,
which contains all the nodes in K with length 6k. That is, there must exist an OMC
with length less than or equal to 6k. Conversely, if H has an OMC for K with length
less than or equal to 6k, say (u,,, . . . ,u,,, . . . ,uik, . . . ,w,), then dH(u,-,,u;,+,) Z 6 for
1 S j < k and dH(u,-,‘,u,-,) Z 6 by Lemma 4.2 and Lemma 4.3. Since the length of
an OMC is less than or equal to 6k, we must have dH(u,",u,',+,) = 6 (1 S j < k)
and dH(u,-k,u,-,) = 6. Again, by Lemma 4.3, (v,,,v,-J+,) E E(G') (l S i < k) and
(1).-Hun) E E(C). Therefore, the cycle (v,,,v.-,, . . . ,v,k,v,-,) is a Hamilton cycle of G.

Thus, we have established the following theorem.

Theorem 4.5 The OMC problem is NP-complete for n-cube graphs.

48

Note that due to the regular structure of the n-cube graph, we don’t need to
construct H in the reduction. The input size of the reduced problem is max {12, k},
where k is the number of destinations. The reduction in the proof of Theorem 4.5 and
solving the reduced problem are independent of the number of nodes in the n-cube
graph.

By transforming the problem of ﬁnding a Hamilton path for grid graph G (see
Lemma 1) into the problem of ﬁnding an OMP for an n-cube graph H in the same
way as the one in the proof of Theorem 4.5, it can be shown that G has a Hamilton

path if and only if H has an OMP with length equal to 6(k — 1), where k = IV(G)I.

So we have the following theorem.
Theorem 4.6 The 0MP problem for n-cube graphs is [VP-complete.

By using the above proof process, we can easily get the following theorem.
Theorem 4.7 The OMS problem for n-cube graphs is NP-complete.

Proof: We also reduce Hamilton path problem for grid graphs to this problem.
Given a grid graph C(V, E), we consider an n-cube graph H (V, E) with n = 4k, where
k = IV(G)|, which is the same as the one in the proof of Theorem 4.5.

By using the method in the proof of Theorem 4.5, it can be proved that for every
pair of nodes u, and u,- in K, dH(u,-,u,-) = 6 if (v,,v,-) is in 13(0), and dH(u.-,u,-) = 8
if (vi, 12,-) is not in E(C). By the same arguments as the proof of Theorem 4.3, we can
show that G’ has a Hamilton path if and only if H has an OMS for K with length

less than or equal to 6(k — 1). D

For the problem of ﬁnding a MST in n-cube graphs, it was proved in [53] that the

MST problem is N P-complete for n—cube graphs.

Theorem 4.8 The MST problem for n-cube graphs is NP-complete [53].

49

4.3 Optimal Multicast in 3D Mesh Topology

We have studied the multicast issues for 2D mesh and hypercube topologies. These
are the most popular topologies used in the current multicomputers. However, some
other interconnection topologies are also used or will be adopted in the future mul-
ticomputers, for example, the cube-connected-cycle and general k-ary n-cube topolo-
gies. 3D mesh topology is also becoming popular. Because 2D graph is the subgraph

of 3D graph, it is not difﬁcult to obtain the following corollaries.

Corollary 4.1 The OMC problem is NP-complete for 30 mesh graphs.
Corollary 4.2 The OMP problem is NP-complete for 3D mesh graphs.
Corollary 4.3 The MST problem is NP-complete for SD mesh graphs.
Corollary 4.4 The OMS problem is NP-complete for SD mesh graphs.

It is still open if OMT problem is NP-complete or not for 2D mesh graphs. It is

also open for 3D mesh graphs.

CHAPTER 5

BASIC HEURISTIC
MULTICAST ROUTING
ALGORITHMS

As shown in previous section, all the proposed multicast problems are N P-complete.
Thus, several basic heuristic multicast algorithms for MP, MC, ST and MT problems
are given in this chapter. The heuristic multicast algorithms for MS problem will be
proposed in next chapter. The performance study for the routing algorithms will be
presented in Chapter 7.

We present new heuristic distributed routing schemes for the multicast optimiza-
tion problems. In our distributed routing algorithms, the source pode must perform
a preprocessing function (message preparation) to obtain some routing control infor-
mation. This routing control information is carried in the message in conjunction
with destination addresses in order to help forward nodes make efﬁcient routing deci-
sions. In this sense, our distributed routing algorithm may be considered as a hybrid
algorithm.

For simplicity, for each node v 6 V(C), its address is also denoted by v. Edge and

link (channel) will be used interchangeably. As mentioned before, the multicast set

50

f3.

51

K is {110,211, . . . ,uk}, where uo is the source node and ul through uk are destination
nodes. Let C(V, E) represent a 2D mesh topology or an n-cube topology. Without
loss of generality, we assume that the width of at least one dimension of the 2D mesh

is an even number.

5.1 Heuristic Routing Algorithms for MP and

MC

The problem of ﬁnding an optimal MC (or MP) seems to be similar to traveling
salesman problem [54]. In ﬁnding an MC (or MP) for multicast routing, however, the
destination nodes cannot be visited in an arbitrary order since the host graph is mesh
graph or n-cube graph. As a result, some well known heuristic algorithms for solving
traveling salesman problem such as nearest-neighbor algorithm cannot be applied to
this kind of problems. The proposed algorithms are heuristic in nature and make use

of the following facts.

F 1: There exists a Hamilton cycle (path) in an N1 X N2 mesh graph when N1 or N2

is even.
F2: There exists a Hamilton cycle (path) in an n-cube graph for any integer n.

F3: Given a 2D mesh graph or an n-cube graph G, and a multicast set K of G, there
exists an MC (or MP) in G.

Note that F3 is due to the fact that a Hamilton cycle (path) is an MC (MP)
for any multicast set in G. Let C be a Ilamilton cycle in the host graph G, where
C=(v1,v2,...,vm,v1), m = IV(G)|, and v1 = no. We deﬁne a mapping h such that
h(v,-) = i - the position of v,- in the cycle. For a given Hamilton cycle, C, which is

known to all nodes, each node u can determine in advance h(u) and the h values of its

52

neighboring nodes. In the description of the following routing algorithms, there are
two parts. The ﬁrst part (Fig. 5.1), message preparation, is performed by the source
node, and the second part (Fig. 5.2), message routing, is performed by each forward
node including the source node. In the message preparation part, a routing control
ﬁeld, D, which carries the destination addresses and some routing information, is
prepended to the actual message. In the message routing part, the routing control
ﬁeld D may be modiﬁed. The basic idea in this part is to choose next node that is

closest in the cycle to the next destination.

 

Algorithm: The sorted MP algorithm for Message Prepa-
ration

Input: Multicast set K and local address no

Output: A sorted destination node list D

Procedure:

1. For each u,- (1 S i S k), if h(u,-) < h(uo), then set f(Ug) =
h(u;) + m, m:IV(G)I; otherwise, set f(u,) = h(u.-).

2. Sort the destination nodes u], . . . , it], using the f value as the key.
Put these sorted nodes in a list D in ascending order.

Figure 5.1. The sorted MP algorithm for message preparation

 

 

 

Theorem 5.1 Let G represent a 2D mesh or an n-cube graph. Given a multicast set

K, the edges selected by the sorted MP algorithm induce an MP for K in G.

Proof: To prove all the edges selected by the algorithm induce an MP, we ﬁrst

need the following facts.

Fact 1: For any pair of nodes u and v in V(G), if f(u) = f(v), then u = v.

53

 

Algorithm: The sorted MP algorithm for Message Routing

Input: Local address w, sorted destination list D = {db . . . ,dg}

Output: A new destination list D’ and a neighboring node w’, or
nothing.

Procedure:

1. If w = d1, then D’ = D — {d1} and a copy of the message is sent
to the local node; otherwise, D’ = D.

so

If D’ = lb, then return nothing.

3. Let (1 be the ﬁrst node in D’. For each neighboring node 12, set
f(v) : h(v) + m if h(v) < h(uo). Otherwise, set f(v) = h(v).

Select a neighboring node w’ such that

f(w’) = max{f(p) : f(p) S f(d) and p is a neighboring node of w }

Figure 5.2. The sorted MP algorithm for message routing

 

 

 

Fact 2: If f(w) < f(d), then f(w) < f(w’) S f(d), where w is the local node, dis
the first node in the sorted list of the remaining destination nodes and w’ is the

next forward node selected by Step 3 (Fig. 5.2).

Fact 1 is obvious since each node has an unique f value. By the deﬁnition of
f, there exists a neighboring node t of to such that f(t)=f(w)+1. That is, f(w) <
f(t) S f(d) since f(w) < f(d). Note that f(t) S f(w’) S f(d) by the selection of w’
in Step 3. Therefore, we conclude that Fact 2 is correct.

According to the above facts, given a local node w and a destination node d, where
f (w) < f (d), the next forward node w’ can be determined uniquely by Step 3. Now
we can prove the theorem by induction on k, the number of the destination nodes.

Basis (k = 1): we have D = {ul} and u] is the ﬁrst (and the only) node in
the remaining destination node list D’. The routing starts from no, f(uo) = l and

f(uo) < f(ul). Suppose that wo := no and, for i > 0, w,- is the forward node at

54

which source message arrives after traversing i edges from wo, i.e.,, w,- is the forward
node selected by w,_1 based on Step 3. It is easy to see that f(uo) S f(wg_1) <
f (w,) S f (ul) in terms of Fact 2. Since f is an integer mapping, there must exist an
integer [1 such that f (w¢,) = f (ul), and thus wt, = ul by Fact 1. Note that w,- is a
neighboring node of w,-_1. Clearly, the edges selected by the algorithm induce an MP
for K = {11,}.

Induction: Assume the hypothesis is true for j (k: :: j). For k = j + 1,
without loss of generality, suppose that after sorting step in message preparation,
D = {211,112, . . . ,u,,u,-+1}, where f(u,) < f(u,+1) for 0 S i S j. By the induction hy-
pothesis, if D = {u1,112, . . . , u,}, the edges selected by the algorithm induce an MP P,
where P = (w0,w1,...,w¢]), wo = no, and w(, = u,. For D = {u1,u2,...,u,,u,-+1},
after the source message has arrived at node u,, we have D’ = {14,-4.1}. By using
exactly the same arguments as those in the basis (by replacing it, and 11,-.” by no and
ul, respectively), it can be shown that the algorithm will route the source message

from node u,, through a path P’ to u,+1, where

I .—

ng = u, and wt”, 2 u,+1. Since f(w,) > f(u,-) for t,- < i S [1+1 and f(wg) < f(u,-)
for 0 S i < l, by Fact 2, the edges selected by the algorithm induce a MP P U P’ for

{u1,u2, . . . ,u,,u,-+1}, where

I
P U P = (100,101,. . . ,IU[J,U)(J+1,- - -,w(,+1-1awl,+1)a

wo = no, and wt, +1 = u,+1. The proof of this theorem is completed by the principle

of induction. Cl

h

[It

55

Corollary 5.1 The time complexity of the sorted MP algorithm for message prepara-
tion is O(k log k) where G represents QD mesh or n-cube graphs, and k is the number
of the destinations. The time complexity for message routing is 0(1) when G repre-

sents a 2D mesh topology and O(n) when G is an n-cube topology.

Proof: Note that the mapping It can be determined in advance for a given host
graph and the starting address no. Thus, the computation of h value for each node
takes O(l) time. Step 1 of the message preparation part takes O(lc) time to set the f
values for all of the destination nodes. Step 2 only sorts the k destination nodes. A
well known sorting algorithm, such as quicksort, may be used with time complexity
O(klogk). In the message routing part, both Step 1 and Step 2 take O(l) time when
G is a 2D mesh or n-cube graph. Step 3 must examine all the neighboring nodes of
w. Since each node in mesh graph G has at most 4 neighboring nodes, this step takes
O(l) time. On the other hand, for n-cube graphs, Step 3 can be done in O(n) time

since there are n neighboring nodes for each node in G. D

The routing algorithm for MC is similar to the MP algorithm. Suppose C is a
Hamilton cycle in G, where C = (v1,v2,...,vm,v1) and v1 = no. Again we deﬁne
h(v,) = i for i > 1 and h(vl) = m +1. By adding uo (with h(uo) = m +1) to the
destination list D, the MP algorithm can be used to ﬁnd MC for K.

5.2 Heuristic Routing Algorithms for ST

For general host graph, Kou, Markowsky, and Berman presented an algorithm (KMB
algorithm) for constructing Steiner tree [55]. The basic idea is to ﬁrst construct
an auxiliary complete graph G; that consists of only the Steiner nodes and shows
the distance in original graph G between any two of the Steiner nodes, then ﬁnd a

minimum spanning tree T1 of G 1, and replace each edge in T; by its shortest path in

56

G and prune it to a Steiner tree. Wall modiﬁed the algorithm for use in a distributed
environment [56].

Our Steiner tree algorithm is applied to 2D mesh or n-cube graph. As will be
shown later, for any three nodes u, v and w in either graph, a node t can be located
in a constant time such that t is the nearest node to w among the nodes in all the
shortest paths between u and v. Therefore, in constructing a Steiner tree, we consider
not only the Steiner nodes but also the nodes in shortest paths between Steiner nodes.
It is not difﬁcult to see that our algorithm is at least as good as KMB algorithm in
the worst case.

The message preparation part (see Fig. 5.3) of the routing algorithm for ST will
sort the destination nodes in some order. A forward node in the tree is called bypass
node if it only forwards the message. If the message is replicated, the forward node
is called a replicate node. The computational complexity of the routing algorithm
executed at each bypass node is a constant in 2D mesh and n-cube topologies. Each
replicate node will perform additional computation to determine which neighboring
nodes are to receive a copy of the message. As to be explained later, the maximum
number of replicate nodes is k — 1, where k is the number of destinations. The basic
idea of the message routing part (see Fig. 5.4) is to construct a Steiner tree that
always adds a closest remaining destination node to it until all of the destination
nodes are covered by the Steiner tree.

In Step 4(a) of Fig. 5.4, de(u,-) can be computed as follows. Assume that de(u,-) =
d(u,—,v), e = (s,t), and v 6 PC. 1) can be computed in the following way.

For a 2D mesh graph, Suppose that x1 = min{sx,tz}, x2 = max{s,,t3}, y; =
min{sy,t,,}, and y; = max{sy,ty}. Also, let (u,)x = x,- and (21,-), = 31,-. It is easy to

see that

57

 

Algorithm: The greedy ST algorithm for Message Prepa-
ration

Input: Multicast set K

Output: A sorted multicast set node list D = (no, ul, . . . , uh).

Procedure:

1. Sort K in ascending order with d(uo,u,-), 1 S i S k, as the key.
Without loss of generality, suppose d(uo,u1) S d(uo,u2) S - - - S
d(uo,uk) after sorting.

Figure 5.3. The greedy ST algorithm for message preparation

 

 

 

x1 if x.- < x1 311 if y: < yr
var: (1:; ”$132,332 ”1!: Eli ify1_<_yi_<.3/2
x; if x,- > x2 312 if y.- > 312

For an n-cube graph, If u, = (1102 . . . an, s = b1b2 . . . bn and t = c1c2 . . . en. Suppose

v = d1d2...dn, then we have, for 1 Sj S n

 

Theorem 5.2 Let G represent a 20 mesh graph or an n-cube graph. Given a mul-
ticast set K, the edges selected by the greedy ST algorithm induce a ST for K in
G.

Proof: The message routing part (Fig. 5.4) of the greedy ST algorithm in

source node uo constructs a tree T which contains all the nodes in K. In Step 3 of

Fig. 5.4, V(T) =2 {u,u1} and E(T) = (u,u1). Thus, T is a tree after Step 3. The i-th

58

 

 

Algorithm: The greedy ST Algorithm for Message Routing

Input: Local address w, destination list D = (u,u1, . . . ,ul).

Output: Destination sublist(s): D1, D2, - - -, where D,- 0D, = 0 for
i at j.

Procedure:

1. If w 75 u, set D1=D and send the message to one of w’s neighbor
that is on a shortest path between w and u. Stop.

2. If w = u and D — {u} = (b, then the message is sent to the local
node w. Stop.

3. Constuct a tree T with u as the root as follows. E(T) +— {(u, 171)}.
4. For i = 2,3, . . . ,3, do the following:
(a) For each 6 = (s,t) E E(T)

0 Deﬁne PC as the set
{v : v E V(G), v is on a shortest path from 3 tot }.

e Let dc(u,-) = minvepe{d(u,-,v)}.

(b) Find a v such that d(u,~, v) = mineeE(T){dc(u;)}, where v is in
one of the shortest path(s) between 3 and t and (s, t) E E(T).

(c) If v 76 s and v 74 t, then E(T) 4— E(T) U {(s,v),(v,t)} —
{(s,t)}-
(d) If u, set U, then E(T) +— E(T) U {(v,u,~)}.

5. Suppose it has m sons in T, r1,r2, . . . ,rm, set

D,- = {v : v 6 D and v is in the subtree of T with r,- as the root}.

6. Send a copy of the message with D,- to one of w’s neighbor which
is in a shortest path between w and r,- for 1 S i S m.

Figure 5.4. The greedy ST algorithm for message routing

 

 

59

iteration of Step 4 in Fig. 5.4 adds at least u,- to T while preserving cycle-free in T.
It is not hard to see that each edge in E(T) is replaced by a path in G whose edges
are selected by the message routing part of the greedy ST algorithm. In Step 4 of
Fig. 5.4, at most two nodes may be added to V(T). Since there are k — 2 iterations,

T has at most 2(k - 1) nodes. D

Corollary 5.2 Consider the greedy ST algorithm with k destinations. The time
complexity for the message preparation part for both 2D mesh and n-cube graphs
is O(klogk). The time complexity for the message routing part is O(l) for the bypass
nodes and O(kz) for the replicate nodes for both 2D mesh and n-cube graphs. The

maximum number of replicate nodes is k — 1.

Proof: Since the distance between any two nodes can be computed in a constant
time (by a simple hardware design) for both 2D mesh and n-cube graphs, the message
preparation part in Fig. 5.3 takes O(k log k) time to sort the destination nodes.

In the message routing part, a bypass node simply executes Step 1 and a leaf
(destination) node simply executes Step 2 in Fig. 5.4, which takes a constant time.
For a replicate node, Step 3 in Fig. 5.4 takes a constant time. As mentioned above,
dc(u,~) can be computed in O( 1) for both 2D mesh and n-cube graphs. Step 4(b) can
be done in O(k) time. Both Step 4(c) and Step 4(d) take O(l) time. There are k — 1
iterations of Step 4. Thus, the time complexity of the message routing part is O(k2).

In the worst case, if each replicate node makes only one addition copy of the mes-

sage, it is trivial to see that the maximum number of replicate nodes is k — 1. D

Note that the basic concept behind the greedy ST algorithm for message routing is
to identify those closest replicate nodes to forward the message, where each replicate

node, which may be a destination, is the root of a subtree. In order to obtain the

60

best replicate nodes, all destination nodes have to be considered. Thus, at the end of
Step 4 (Fig. 5.4), a complete Steiner subtree rooted at the calling node is obtained.
In other words, the source node is able to obtain the complete Steiner tree. Another
approach to implement this algorithm is to pass the complete Steiner tree information
in the message to all replicate nodes. In this case, the complexity required at each
replicate node is reduced to 0(1) and O(n) for 2D mesh and n-cube, respectively.
However, each message will carry additional Steiner tree information. Thus, there is
a tradeoff between message size and processing time. In both implementations, the

amount of trafﬁc generated is the same.

5.3 Heuristic Routing Algorithms for MT

Since the multicast routing algorithm for hypercube has been proposed in [19], we
only present the routing algorithms of MT for 2D mesh.

In unicast communication in 2D mesh, message is generally delivered ﬁrst in X-
direction then Y-direction. The ﬁrst multicast routing algorithm in Fig. 5.5, X—ﬁrst
multicast routing, is a natural extension of the one-to—one routing method.

It is easy to show the following theorem.

Theorem 5.3 Given a multicast set K, the links selected by executing X—ﬁrst multi-
cast routing algorithm in each forward node in 2D mesh induce a multicast subgraph

for K. The time complexity of the algorithm is of O(k) for k input destination nodes.

The X-ﬁrst algorithm is simple and can be easily implemented in hardware. How-
ever, the selected path from the source to a destination node is only determined by
the addresses of the source and the destination node. It is independent of the posi-
tions of the rest destination nodes. As a result, it often creates much unnecessary

trafﬁc.

61

 

 

X-ﬁrst Multicast Algorithm
Input: Local address (x0,yo), destination list D, D 79 0.
Output: Destination node sublist(s): D+x, D..X, D+y, D_y, D,- Q
D for i E {+X, —X,+Y, —Y} and D,- 0 D,- = (ll for i # j.
Procedure:
1. set D,- 2 (ll for i=+X, -X, +Y, -Y.

2. For each (x,y) E D , do:
Case (x,y) of
:1: > x0: D+x *— D+X U {(xiyllli
:r. < x0: D_x *— D—X U {(3341)};
x = x0 and y > yo: 04.)! +— D+Y U {WAD};
x = x0 and y < yo: D-Y ‘— D-Y U {(341)};
x = x0 and y = yo: send the message to local processor.
3. If D+x aé (ll, put D+x to message head and send it to (x0 +1,yo);
If D_x # (I), put D- X to message head and send it to (x0 — 1, yo);

If D+y 75 0, put D+y to message head and send it to (0:0, yo + 1);
If D_y 75 0, put 0.}! to message head and send it to (xo,yo - 1);

Figure 5.5. X-ﬁrst multicast algorithm

 

 

62

In the second multicast routing algorithm, we consider all the positions of the
destination nodes in a multicast to select possible message passing path(s). The
divided greedy multicast algorithm in Fig. 5.6 ﬁrst divides the destination set into
several subsets. The destination nodes in the same subset have the same one or two
candidate neighboring nodes to which the message can be sent. If necessary, each
subset can be further divided into two sublists, and the two sublists and the source

message will be sent to the two candidate neighboring nodes respectively.

Theorem 5.4 Given a multicast set K, the links selected by executing divided greedy
multicast routing algorithm in each forward node in 2D mesh induce a multicast sub—
graph for K. The time complexity of the algorithm is of O(k) for k input destination

nodes.

Proof: Starting from the source node, the algorithm is executed at each forward
node. For an arbitrary destination node (x, y) which is in the input destination node
list of a forward node, if it is in D, of Step 2 for some i, i E {+X, —X,+Y, —Y}, the
message and (x, y) will be sent to a neighboring node which is in a shortest path from
the current forward node to (x,y) in Step 6 . Otherwise, (x,y) is in P,- of Step 3 for
some i, 0 S i S 3. It will be in either ng or 5,, in Step 4.2. Finally, it will be put
in either D.- or Dg+1mod4 in Step 5. By checking the deﬁnition of P,- and the routing
decision made in Step 6, it is easy to see the message and (x,y) will be sent to a
neighboring node which is in a shortest path from the current forward node to (x, y).
Therefore, the message is forwarded from source node to each destination node along
a shortest path. The links selected by executing the algorithm in each forward node
induce a multicast subgraph for the given multicast set.

Step 1 and Step 2 need O(k) and 0(1) time respectively. Step 3 to Step 6 each
takes O(k) time. Hence, the time complexity of the divided greedy multicast algo-

rithm is O(k), where k is the number of the input destination nodes. 0

63

 

 

Divided Greedy Multicast Agorithm
Input: Local address (xo,yo), destination node list D, D 75 0.
Output: Destination sublist(s): D+X,D+y,D_X,D_y, D,- Q D
foriEDand D,ﬂD,-=(llfori7€j.
Procedure:
1. If (xo,yo)=(x,y) for same (x,y), (x,y) E D, send the message to
local processor and D <— D — {(x,y)};

2. If D = 0, stop; else S <— D;

3. Set
D+X (‘— {(113,31) l (x,y)ED,x>xo andyzyO},
D+Y ‘— {(13 l!) l (17,31) 5 Did? = 27o and y > 310},
D-X ‘— {(13 y) l (183/) 6 DJ: < 2:0 and y =yo},
D-y «— {(x,y) I (x,y) E D,x = x0 and y < yo};
4. Let
Po— {(3 y) l (a: y)6 0,2: >10 ands/>310},
P1— {(x y) | (x 3060.1 <=vo andy>yo},
P,2={(x y)|(x, y)€D,x<xoandy<yo},and
P3={(x ”y)|(xy)€D,m>woandy<3/o};
For i=0, 1, 2, 3do:

(a) Six *— @,Siy ‘— 0;
(xixsyix) ‘— ($0,310), (xiyayiy) ‘— (130,310);
(b) For each destination (x,y) in P,- do:
Let H 2 mini] 31 — on,d(;(($,y),($gx,y;1-))]and
V = minll (l: " $0 I’dG(($,3/),($iyayiy))}i
If H S V, then Six +— Six U [(17, 31)},(xixa yix) (— (x,y);
else Siy 4— 5a, U {(x,y)}, (x,,,,y.~,,) *— (x,y);

5. Let Soo'ZSox, Sal-=53,” and 510:50y, 511:5”, and 520:5”,
1321:5273, and 530:52y, 331:53y; and Do = D+x, DI = 0...)!
D2 = D_X and D3 = D_y;

For i=0, 1, 2, 3 do:
If D,- ;£ (0 or both SiO 94 0 and Sn 79 fl then D.- *- D; U Sgo U Sn;
€186 D(i-—l)mod4 *— D(i-—l)mod4 U 5:0, D(i+l)mod4 *- D(i+1)mod4 U 3n;

6. If D+x # (ll, put D+x to message head and sent it to (x0 + 1,yo);
If D+y # (ll, put D+y to message head and sent it to (x0, yo + 1);

If D_X 3i fl, put D- x to message head and sent it to (3:0 — 1, yo);
If D..y 75 0, put D_y to message head and sent it to (xo,yo — 1).

Figure 5.6. Divided greedy algorithm

 

 

64

5.4 Illustrative Examples

Five examples are presented in this section in order to illustrate the operations of the
proposed heuristic multicast algorithms.

Finding an MP in a 4 X 4 Mesh

Consider a 4 X 4 mesh, G. For simplicity, we use an integer to rep-
resent the address of each node as shown in Fig. 5.7. Obviously, C =
(0,1, 2, 3, 7, 6, 5, 9,10,11,15,14,13,12, 8,4, 0) is a Hamilton cycle in G. For each node
x (0 S x S 15), Table 5.1 shows the corresponding h(x). Consider the multicast set
K = {9,0,1,6,12}, where no = 9 is the source. Given uo = 9 as the source, the

sorting key f (x) for each node x is shown in Table 5.2.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

15
I I
4 5 6 7
'.;.;.;.;.;.;._\
I
3 I 2 3
Figure 5.7. A 4 X 4 mesh with node 9 as the source.

 

 

 

Based on the sorted MP algorithm for message preparation shown in Fig. 5.1,
we have a sorted destination list D 2 {12,0, 1,6}. Then, we apply the sorted MP

algorithm for message routing (Fig. 5.2) for each node receiving the message. For

65

Table 5.1. A Hamilton cycle and the corresponding mapping h of a 4 X 4 mesh.

 

 

 

 

 

 

llhm l It‘ll h(a?) l 93 ll h(4)] 3 ll h(='=)l 55 ll
1 0 5 7 9 10 13 13
2 1 6 6 10 11 14 12
3 2 7 5 11 15 15 8
4 3 8 9 12 14 16 4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 5.2. The sorting key f(x) and mapping h(x) for each node x in a 4 X 4 mesh,
where no = 9.

 

 

 

 

 

 

 

 

 

 

ll 1' lh(~1‘) l f(w) ll 1‘ M(x) l f(w) ll
0 1 17 8 15 15
1 2 18 9 8 8
2 3 19 10 9 9
3 4 20 11 10 10
4 16 16 12 14 14
5 7 23 13 13 13
6 6 22 14 12 12
7 5 21 15 11 11

 

 

 

 

 

 

 

 

 

 

 

node 9, its four neighboring nodes are 5, 8, 13, and 10. The corresponding f values
are 23, 15, 13, and 9, respectively. Since f(d = 12) = 14, only node 13 and node 10
have f values less then 14 and f(d = 13) > f( = 10). Thus, node 13 is selected
by node 9 to forward the message. By repeating the same procedure for each node
receiving the message, we obtain the multicast path (9, 13, 12, 8, 4, 0, 1, 2, 6) as shown
in Fig. 5.7.

Finding an MP in a 4-cube

Consider a 4—cube shown in Fig. 5.8. One can easily obtain a Hamilton cycle shown
in Table 5.3. For each node x, the corresponding h(x) is also shown in Table 5.3.

Suppose we have the multicast set K 2: {0011,0100,0111,1100, 1010,1111}. Given

66

no = 0011 as the source, Table 5.4 lists the sorting key f (x) for each node x.

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.8. A 4-cube with node 9 as the source.

 

 

 

Table 5.3. A Hamilton cycle and the corresponding mapping h of a 4-cube.

 

“(3)1 16 llh($)l x llh($)l xJIM-ﬂl a=ll
1 0000 5 0110 9 1100 13 1010 ll
2 0001 6 0111 10 1101 14 1011
3 0011 7 0101 11 1111 15 1001
4 0010 8 0100 12 1110 16 1000

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

By applying the sorted MP algorithm for message preparation, we have a sorted
destination list D = {0111,0100,1100,1111,1010}. By applying the sorted MP algo-
rithm for message routing (Fig. 5.2) to each node receiving the message, we obtain
the following multicast path (0011,0111,0101,0100,1100,1101,1111,1110,1010) as
illustrated in Fig. 5.8.

Finding an ST in an 8 X 8 Mesh

Let G be an 8 X 8 mesh. Each node in G is indexed as [i,j] for 0 S i,j S 7.

67

Table 5.4. The sorting key f (x) and mapping h(x) for each node x in a 4-cube, where
“(to = 9.

 

ll 1 lh($)lf(m)ll a: lh($)lf($)ll

 

 

 

 

 

 

 

 

0000 1 17 1000 16 16
0001 2 18 1001 15 15
0010 4 4 1010 13 13
0011 3 3 1011 14 14
0100 8 8 1100 9 9
0101 7 7 1101 10 10
0110 5 5 1110 12 12
0111 6 6 1111 11 11

 

 

 

 

 

 

 

 

 

 

 

Suppose node (2,7), the source node, wishes to send a message to 5 destinations:
[0,5],[2,3], [4,1], [6,3] and [7,4]. By applying the greedy ST algorithm for message
preparation (Fig. 5.3), we have D = ([(2, 7],[0,5],[2,3],[4,1],[6,3],[7,4]). Note that
if d(uo,u,) = d(uo,u,-), u,- and u,- can be placed in an arbitrary order. Let’s consider
node [2, 7], which is the source with the sorted input D, by applying the greedy ST
algorithm for message routing (Fig. 5.4). Obviously, Step 1 and Step 2 are skipped.
In Step 3, we have E(T) = {([2, 7], [0,5])}. In the ﬁrst iteration of Step 4, node [2, 5]
is the nearest node to [2,3] among the nodes in any shortest paths between [2, 7] and
[0,5]. Thus, [2, 5] is selected in Step 4(b). After the ﬁrst iteration of Step 4, E(T)
becomes {([2, 7], [2,5]), ([2,5], [0,5]), ([2,5], [2,3])}. Nodes [4,1], [6,3], and [7,4] are
added to the Steiner tree in the second, third, and fourth iteration of Step 4, respec-
tively. At the end of Step 4, we have E(T) = { ([2,7],[2,5]), ([2,5],[0,5]), ([2,5],[2,3]),
([2,3],[4,3]), ([4,3],[4,1]), ([4,3],[6,3]), ([6,3],[7,4])}, which speciﬁes the complete Steiner
tree. Thus, in Step 5, we have output D, = ([2, 5], [0, 5], [2, 3],[4,1],[6, 3], [7, 4]), which
will be sent along with the message to the neighboring node [2,6]. Note that node
[2,6] is a bypass node which is the neighboring node on one of the shortest paths

between [2, 7] and the next replicate node [2, 5].

68

Upon receiving the message, node [2,5] will execute the greedy ST algorithm for
message routing. It will generate D1 = ([0,5]) and D2 = ([2, 3], [4, 1], [6, 3], [7,4]). One
copy of the message with D1 will be sent to node [1,5], a bypass node, and another
copy of the message with D; will be sent to the bypass node [2, 4]. By repeating this
procedure at all forward nodes, the complete Steiner tree routing pattern is shown in
Fig. 5.9. Note that the selection of bypass nodes is based on the underlying shortest

path routing algorithm.

 

 

 

 

 

[2'7] I source node
1' - ‘ '
: [2.6] : O rephcate node
1" - - ‘ O replicate node or destination
w <—J. [1.51 :
s - - I ,. - .- \
’t _ 1 i bypass node

®

“--

i'--‘ i ' ‘
[153], @ﬁ I731:

\--’ ~-—’

or [6.4]

 

Figure 5.9. A complete ST routing pattern in an 8 X 8 mesh.

 

 

 

Finding an ST in a 6-cube
Let’s consider a 6-cube with a multicast set K = { 000110, 010101, 001101, 000001,

69

101001, 110001 }. The sorted destination list is D = (000110, 010101, 000001, 001101,
101001, 110001). The source node 000110 is a replicate node. In this node, neither
Step 1 nor Step 2 is executed, and we have E(T) = {(000110, 010101)} in Step 3. Since
node 000101 is the nearest node to node 000001 among the nodes in any shortest paths
between 000110 and 010101, it is selected in Step 4(b). At the end of the ﬁrst iteration
of Step 4, E(T) is {(000110,000101), (000101,010101), (000101,000001)}. In the
second iteration, node 001101 is added to the Steiner tree, and node 101001 and node
110001 will be added to the Steiner tree in the third and fourth iteration, respectively.
Finally, we have output D1 = (000101,010101,000001,001101,101001,110001). By
repeating the greedy ST algorithm for message routing (Fig. 5.4) to each message

receiving node, the resulting Steiner tree routing pattern is shown in Fig. 5.10.

 

 

 

 

 

 

’ 1
I111001.
' l

 

\- a

 

Figure 5.10. A complete ST routing pattern in a 6-cube.

 

 

 

Finding an MT in an 6 x 6 Mesh

The following example will be used to explain the X-ﬁrst algorithm for MT. Con-

70

sider a 6 X 6 mesh in Fig. 5.11, Suppose source node (3, 2) wants to send message to
(2,0), (3,0), (4,0), (1,1), (5,1), (0,2), (1,3), (2,5), (3,5) and (5,5). At the beginning,
source node (3,2) runs X- ﬁrst algorithm. After running the algorithm, we get
D+x— — {(4 0)1 (5 1)1 (515)}1
D x — {(21 5)1(21 0)1(13)1(111)1(012)}1
D+Y — {(315)|} and D— Y — {(3 0)}1

which are to be added to message head, respectively, and sent to its neighboring
nodes (4,2), (2,2), (3,3) and (3,1) respectively. Then the forward nodes (4,2), (2,2),
(3,3) and (3,1) will run the algorithm and so on until all the destination nodes receive
the message. The routing pattern is shown in Fig. 5.11, where each path from source

to each destination node is always goes X-direction ﬁrst then Y-direction. The total

trafﬁc for this example is 24.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

source node

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

destination node

 

 

 

 

 

 

 

 

 

 

’
«

 

 

 

link selected by
routing algorithm

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.11. The routing pattern of X-ﬁrst algorithm for the example.

 

 

 

71

The same example is used to clarify divided greedy algorithm which is much more

complicated than X-ﬁrst algorithm.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

source node

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

destination node

 

 

 

 

 

 

 

 

 

 

 

 

»
h

 

 

 

 

 

 

 

 

 

link selected by
routing algorithm

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.12. The routing pattern of divided greedy algorithm for the example.

 

 

 

Again, suppose source node (3,2) is going to send message to each destination
node in destination list D, where D={(2,0), (3,0), (4,0), (1,1), (5,1), (0,2), (1,3),
(2,5), (3,5), (5,5)}, as shown in Fig. 5.12. Initially, the only forward node, i.e. the
source node (3,2) runs the divided greedy algorithm as follows. Step 1 and Step 2
will do nothing since no destination node here is equal to the local node (3,2) and the
input destination list D is not empty. In Step 3, node (3,5) is put to D+y in order to
sent to (3,3), node (0,2) is put to D_x, and (3,0) to D_y, which will be sent to (2,2)
and (1,3) respectively. By the deﬁnition of Pg, 0 S i S 3, in Step 4, we have

Po ={(515)}1P1={(113)1(215)}1

P2 = {(1,1),(2,0)} and P3 = {(4,0),(5,1)}.

72

After executing Step 4, P,, 0 S i S 3, is divided into two subsets 5,, and S;,,,
0 S i S 3, where

so. = 0, so. = {(5.5)}, 51. = {(1.3)} .51.. = {(2.5)},

52: = {(111)}1 5211 = {(210)} and

531.- ={(511)}1535 = {(410)}-

Note that Sox and 53,, are the candidate sets for D_x, Soy and SI, for D+y, S1,,
and 321- for D_x , and $2,, and 33,, for D_y. However, since So, is empty, in Step 5, its
partner 53,, is not put to D+x, and instead it will be merged with 53,, which in turn
will be put to D_y. Therefore, by the end of Step 5, we get the output destination
node sublists:

D+Y ={(315)1(215)1(515)}1D—x ={(012)1(113)1(111)}«'1nd

D—Y = {(310)1(210)1(410)1(511)}-

D+y, D_x and D_y will be sent, along with the source message, to (3,3), (2,2)
and (3,1) respectively.

Next, the forward nodes are (3,3), (2,2) and (3,1). The algorithm is executed in
each forward node with its input destination list in the same way as the one described
above for node (3,2). Finally, the message will be sent to each destination node and
the routing paths form a multicast subgraph for the source and destination nodes,
which is shown in Fig. 5.12. Note that the destination node list for a non-source
forward node can only be divided to at most two sublists of Pg, 0 S i S 3, in Step 4.
This is because it is shortest path routing. For example, for the forward node (3,3)
with input destination node list {(3, 5), (2, 5), (5, 5)}, the list can only be divided into
two sets: Po and P1 in Step 4 for further consideration. Thus, the execution of the
algorithm in the non-source forward node will be much faster than that in source

node.

CHAPTER 6

DEADLOCK-FREE
MULTICAST WORMHOLE
ROUTING

As described in Chapter 2, wormhole routing offers a best-case latency that is inde-
pendent of the path length while requiring small amount of dedicated buffers. Worm-
hole routing has been a popular choice in new generation multicomputers. Since
the blocked messages are not being buffered at each intermediate node in wormhole
routing, its routing criteria and deadlock properties are quite different from those in
the other switching technologies such as store-and-forward and virtual cut—through
methods. This chapter focuses on the routing and deadlock issue in multicast com-

munication in the most promising wormhole networks.

6.1 Deadlock Issue in Multicast Wormhole Rout-
ing

A practical multicast routing algorithm must be deadlock-free and should transmit

the source message to each destination node using as little time and as few com-

73

74

munication channels as possible. For a given set of destination nodes, the multicast
routing algorithm might deliver the message along a common path as far as possible,
then branch in order to deliver the message to each destination node. Essentially, the
message follows a tree-like route to the destinations. One possible approach to imple-
menting such a multicast tree is to extend deadlock-free unicast routing algorithms
to handle multicast trafﬁc. For example, the E-cube algorithm has been proved to
be deadlock-free for one-to-one communication in n-cubes [44].

The nCUBE-2 [6], which is the ﬁrst hypercube multicomputer to use wormhole
routing, uses this approach to support broadcast and a special form of multicast in
which the destination nodes form a subcube. A tree is created in which each path
from the source to a destination uses E-cube routing. In order to avoid buffering,
when the tree branches and a node must send the message on multiple channels, all of
the required channels must be available before transmission on any of them may take
place. Hence, the branches of the tree proceed forward in a lock-step fashion. Blockage
of any branch of the tree can prevent completion of the message transfer even to those
destination nodes to which paths have been successfully created. Moreover, deadlock
can occur using such a routing scheme. Figure 6.1 shows a deadlock conﬁguration
on a 3-cube. Suppose that nodes 000 and 001 simultaneously attempt to transmit
broadcast messages M 0 and M 1, respectively. Figures 6.1a and 6.1b, respectively,
depict the would-be trees for these broadcasts. Figure 6.2 shows a detailed diagram of
six of the nodes involved in the broadcasts. Outgoing channel selectors are represented
by solid rectangles, while input buffers are represented by open rectangles. The
broadcast originating at node 000, M 0, has acquired channels [000,001], [000,010],
and [000,100]. The header ﬂit has been duplicated at node 001 and is waiting on
channels [001,011] and [001,101]. The M1 broadcast has already acquired channels
[001, 011], [001, 101] and [001, 000], but is waiting on channels [000, 001] and [000, 010].

The two broadcasts will block forever.

75

 

 

 

Figure 6.1. Deadlock in a 3-cube multicomputer.

 

 

 

 

 

 

 

 

 

 

 

 

100

 

 

 

 

 

 

101

 

 

 

 

 

 

 

 

 

M1
—c>
message has acquired buffer host '
_______ a. I 001 I
message requires buffer

Figure 6.2. The detailed diagram of deadlock conﬁguration.

 

 

 

76

In a similar manner, one may attempt to extend deadlock-free unicast routing
on a 2D mesh to encompass multicast. One such unicast routing algorithm requires
that messages be sent ﬁrst in the X-direction and then in the Y-direction. It is
straightforward to prove that this algorithm is deadlock-free. An extension of the X-
ﬁrst routing method to include multicast is shown in Figure 6.3, in which the message
is delivered in to each destination in the manner described. As in the n-cube example,
the progress of the tree requires that all branches be unblocked. For example, suppose
the header ﬂit in Figure 6.3 is blocked due to a busy channel [(4,2), (4,3)]. Unlike
store-and-forward or virtual cut-through routing, node (4,2) cannot buffer the entire
message. As a result of this phenomenon, the progress of messages in the entire
routing tree must be stopped. All of the its ﬂits stop forwarding and remain in
contiguous channels of the network. In turn, other messages requiring segments of
this tree are also blocked. Network congestion may be increased and degrade the

performance of the multicomputer.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

source node

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

*
«

 

 

 

 

 

 

link selected by
routing algorithm

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 6.3. An X-ﬁrst multicast routing pattern.

 

 

 

 

77

Moreover, this routing algorithm can lead to deadlock. Figure 6.4 shows a deadlock

conﬁguration in which two multicasts, M 0 and M 1, have the following communication

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

patterns:
M0: M1:
source: (1,1) source: (2,1);
destinations: (0,2), (3,1) destinations: (0,1), (3,0);
acquired: [(1,1),(0,1)], acquired: [(2,1),(1,1)],
[(211)1(311)]; [(1.1).(211)];
requiring: [(2,1),(3,1)]; requiring: [(1,1), (0,1)];
TIE->7,TH—~ 3,2
010 z = 1,0 < T 210
(a)
B if D
M0-0
ii 1:
(011) (111)
hd0
___J
00
Figure 6.4. A deadlock situation in a 3 X 4 mesh.

 

 

78

In Figure 6.4a, M 0 has acquired channel [(1,1),(0,1)] and requires channel
[(2,1),(3,1)], while M1 has acquired channel [(2,1),(3,1)] and requires channel
[(1,1),(0,1)]. Figure 6.4b shows the details of the situation for the nodes (0,1),
(1,1), (2,1), and (3,1). Because neither node (1,1) nor node (2, 1) buffers the blocked
message, channels [(1,1),(0,1)] and [(1,1),(0,1)] cannot be released. Deadlock has
occurred.

The goal of our research is to develop and evaluate deadlock-free multicast routing
algorithms in order to support reliable and efﬁcient parallel applications in multicom-
puters. In the next two sections, we present deadlock-free multicast wormhole routing
schemes based on the concept of network partitioning for 2D mesh and hypercube
respectively. The basic idea is to divide a multicast into several sub-multicasts, each
routed in a different acyclic subnetwork. Because the subnetworks are disjoint and
acyclic, no cyclic resource dependency can exist. Thus, the routing schemes are

deadlock-free.

6.2 Deadlock-Free Multicast in 2D Mesh

We ﬁrst present several deadlock-free multicast wormhole routing algorithms for 2D

mesh multicomputers.

6.2.1 Tree-Like Deadlock-Free Multicast Routing Schemes

The ﬁrst deadlock-free multicast wormhole method that we discuss uses a modiﬁcation
of X-ﬁrst routing algorithm discussed earlier and shown to be susceptible to deadlock.
In order to avoid cycles of channel dependencies, we ﬁrst double each channel on the
2D mesh, then partition the network into four subnetworks, N+X,+y, N_x'+y, N..x,+y
and N+x,_y. Subnetwork N1, x,+y contains the unidirectional channels with addresses

[(i,j),(i + 1,j)] and [(i,j),(i,j + 1)], subnetwork N+x,_.y contains channels with

79

addresses [(i,j),(i + 1,j)] and [(i,j),(i,j — 1)], and so on. Figure 6.5 shows the

partitioning of a 3 X 4 mesh into the four subnetworks.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

12 «32— : 3,2 0.2—»
(r l 1
31k- 31 0" =

v

1
m 1,0: 2,0

-X,—Y subnetwork N+X,-Y subnetwork

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

9
N1
INA
ll

 

 

 

 

 

 

 

 

 

 

 

ll

 

 

 

 

 

 

3,0 0,0 pa

 

 

 

 

 

 

 

Figure 6.5. Network partitioning for 3 X 4 mesh.

 

 

 

For a given multicast, the destination node set D will be divided into at most four
subsets, D1, x,+y , D_x,+y, D_x,+y and D+X,_y, according to the relative positions of
the destination nodes and the source node no. Set D+x,+y contains the destination
nodes to the upper right of no, D- X337 contains the destinations to the upper left of
no, and so on. Formally, we have

D+x.+y = { ($1y)l($1y)€ D and x > $011; 2 yo }1

D-X,+Y = { (113/”($137) 6 D and at 2 2:019 > 90 }1

D+x,_y = { (x,y)I(x,y) G D and x < xo,y S yo }, and

80

D-x,_y = { (x,y)I(x,y) E D and x Z xo,y < yo }.

Speciﬁcally, the multicast will comprise into at most four submulticasts from no
to each of D+x,+y, D_X,+y, D_X.+y, and D+x,_y. The submulticast to D+x.+y
will be implemented in subnetwork N+X,+y using X-ﬁrst Y-next routing, D- X,+y in

subnetwork N- X,+y, and so on. The message routing algorithm is given in Fig. 6.6.

 

Algorithm: The double-channel X-ﬁrst routing algorithm
for subnetwork N+x,+y

Input: Destination set D’, D’=D+x,+y, local address v, v = (x, y);

Procedure:

1. If x < Min{x,-I(x,-,y,-) E D’}, send message and D’ to (x +1,y).
Stop.

2. If (x,y) E D’, then D’ +— D’ —- {(x,y)} and a copy of the message
is sent to the local node.

3. Let D3, = {(x,,y,-)|x,~ = x, (x,,y,~) E D’}. If DI, 74 0, send message
and D3, to (x,y +1); If D’ — DI, 51$ (0, send message and D’ — D3,
to (x + 1,y).

Figure 6.6. Double-channel X-ﬁrst routing algorithm.

 

 

 

The following example is used to explain the algorithm. Consider the 6 X 6 mesh
in Figure 6.3. At the source node, (3,2), the destination set

D = {(010)1(012)1(015)1(113)1(415)1(510)1(511)1(513)1(514)}1

is divided into

D+X.+Y = {(415)1 (513)1 (514)}1 D-X.+Y = {(015)1 (113)}1

D—x,—Y = {(010)1 (012)}1 and D+X.-Y = {(510)1 (511)}-

The message will be sent to the destination nodes in D+x,+y through subnetwork

N+X,+y, to the nodes in D_x,+y using subnetwork N_x,+y, to the nodes in D_x,_y

81

using subnetwork N-x,_y, and to the nodes in 104...)! using subnetwork N+x,_.y,

respectively. The routing pattern is shown in Figure 6.7.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.
C...’

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ll
3"

 

 

 

   
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

V

=..-..I 3.3

 

 

 

 

 

 

 

< 2,0 3,01

 

 

 

 

 

 

 

N-X,-Y

Figure 6.7. The routing pattern of the tree algorithm.

 

 

 

Assertion 1 The double-channel X—first routing algorithm is deadlock-free.

Proof: Because the subnetworks are channel-disjoint, we can show the asser-
tion in each subnetwork separately. Without loss of generality, we only show it for
subnetwork N+X'+y. First, we label the nodes such that node (0,0) has label 0,
and and all nodes at distance i from node (0,0) have higher labels than all nodes at

distance i-l from (0,0). Figure 6.8(a) shows such a label assignment in the above

82

example. Next, we label all channels entering node with the same labels as the node
as shown in Figure 6.8(b). Using X-ﬁrst Y-next routing, a message entering a node
on a channel labed with i always leaves on a channel labeled with a number greater

than i. Therefore, no cycle dependency of the channels can exist and it is deadlock

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

free. For the rest of the subnetworks, the proofs are similar to the above one. D
_3. __6. _9, ’1 6
0.2 < 1.2 < 2.2 < 3,2 0.2 7
I 114 ll 7 ll 10 3 4
0,1 1,] < 2,1 3,1 0,]
lo 2 1I 5 ll 8 1 2
0,0 1,0 <—Iﬂ = 3 ,0 0,0 —I-
N+x,+y subnetwork N+X,+Y subnetwork
(a) (b)
Figure 6.8. The label assignment for N+x’+y for channel ordering.

 

 

 

While the X-ﬁrst multicast tree approach avoids deadlock, one of the major dis-
advantages is the need for double channels. It may be possible implement double
channels with virtual channels [44], however, early analysis shows that the signaling
for multicast communication may be quite complex. Another major disadvantage of
tree-like routing is high block probability. As mentioned before, if one of branches of
the routing tree is blocked, the progress of the entire tree has to be stopped.

Based on multicast star model proposed in Chapter 3, we next introduce multicast
routing algorithms that are deadlock-free and which do not require additional physical

or virtual channels. Deadlock situations arise in multicast trees when copies are

83

created and diverge at intermediate nodes. The algorithms in the next section avoid
deadlock by enforcing the rule that a message, once in the network, may never be

copied.

6.2.2 Path-Like Deadlock—Flee Multicast Routing Schemes

We ﬁrst present a network partitioning strategy based on Hamiltonian paths, which
is fundamental to the deadlock-free routing schemes. A Hamiltonian path visits every
node in a graph once and only once. It is not difﬁcult to see that a 2D mesh has
many Hamiltonian paths.

In our algorithms, each node i in a multicomputer is assigned a label, [(i). In a
network with N nodes, the assignment of the label to a node is based on the position
of that node in a Hamiltonian path, where the ﬁrst node in the path is labeled 0 and
the last node in the path is labeled N — 1. Each node is represented by its integer
coordinate (x, y).

Figure 6.9( a) shows such a labeling in a 4 X 3 mesh. The labeling effectively divides
the network into two subnetworks. The high-channel subnetwork contains all of the
channels whose direction is from lower labeled nodes to higher labeled nodes, and
the low-channel network contains all of the channels whose direction is from higher
labeled nodes to lower labeled nodes.

First, let’s consider the case of one-to-one communications. If the label of the
destination node is greater than the label of the source node, the routing always takes
place in the high-channel network; otherwise, it will take the low-channel network.
Given a source and a destination node, it can be easily observed from Fig. 6.9 that
there always exists a shortest path to deliver the message. Obviously, the performance
of a routing scheme is dependent on the selection of a Hamilton path. Figure 6.10
shows the label assignment for a 4 X 3 mesh based on a different Hamilton path

and the corresponding high-channel and low-channel networks. Under such a label

84

 

11 10 9

03 1.3 %-27I 0.? «1.3 23 03 PEI—’—
.. - (1

16 “in “Is

[3.2 3 1222.4 02

”15 I14 ”l3 I II

 

 

 

Tl

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

22 02 12

 

v
{‘1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2.3
I
221
_L
2 1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0,1 ‘11 1‘2 2,1I 0,1 ._ 1,1 2,1 x 0.1 1,1 .
11 11 ’ i 1 ‘ ‘
0 _ 11 1 I 2 l 1 , .
0,0 1.0 2 2,0 0,0 —> 1,0 2,0 0.0 1.0 2.0
(a) physical network (b) high-channel network (c) low-channel network

Figure 6.9. The labeling of a 4 X 3 mesh.

 

 

 

assignment, the routing paths between the nodes (1,0) (label 4) and (1,2) (label 8)

take 4 channels in either direction (see Fig. 6.10(b-c)) rather than a shortest path

which should take only 2 channels.

The label assignment function 6 for a 7n X 71 mesh can be expressed as

any): y*n+x ifyiseven
y*n+n—x—1ifyisodd
The ﬁrst step in ﬁnding a deadlock-free multicast algorithm for the 2D mesh is
to deﬁne a routing function that uses these two subnetworks in such a way as to
avoid channel cycles. Let V be the node set of the 2D mesh. One routing function
R : V X V ——2 V, is deﬁned as R(u,v) = w, such that w is the neighboring node of u,
and
f(w) = max{€(p) : f(p) S f(v) and p is a neighboring node of u}, if f(u) < f(v),

01'

f(w) = min{€(p) : f(p) 2 f(v) and p is a neighboring node of u}, if f(u) > f(v).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ll

 

 

 

 

[frat—o

- .431
'I‘°
v;
8I
11
H
\O
I
I:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

If I ' 4 * , I 1
-8 i=l_f_<;_f -8: -2. 6 8 ‘7 " 6
1‘ 0’ ”v 11 0 , 1 g I I
I 1 ==I 2 ==I 5 1 > 2 5 1 I: 2 4 5
I, “v A] 1 V 11 I
, o 3 ==lL 0 -*Ii 4 0 F‘s—IA 4 ‘

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(a) (b) (c)

Figure 6.10. The other label assignment and channel partitioning for a 4 X 3 mesh.

 

 

 

For two arbitrary nodes r and t, a path from r to t can be selected by the routing
function R as (v1,v2,...,vk), where v] = r, v,- = R(v,_1,t)for1 < i S k, and v1, = t.
The path is a partial order preserved for the assignment I if [(v,) < t(v,+1) for
1 S i < k (that is, in the high-channel network), or [(v,) > t(v,-+1) for 1 S i < k

(that is, in the low-channel network). We have the following important lemmas.

Lemma 6.1 For two arbitrary nodes u and v in a 20 mesh, the path selected by the
routing function R is a shortest path from u to v, and it is a partial order preserved

for the label assignment function l’.

Proof: Since 11 and v are two arbitrary nodes in a 2D mesh, without loss of
generality we can assume that f(u) < f(v). We prove the lemma by induction on the
distance d(u, v).

Basis (d(u,v)=1): It is easy to check that R(u,v) = v and f(u) < f(v).

Induction: Assume the hypothesis is true for d(u,v) = k. We will prove that it is
true for d(u,v) = k +1. Suppose that u := (xmyu) and v = (x,,y,,), and w = R(u,v).

Case y,, = y,: By the deﬁnition of t’, x,, > x,, if y, is even, and xv < xu if y, is

86

odd. By the deﬁnition of routing function R, we have w = (xu + 1,y,,) for x, > x,,,
and w = (x, — l,y,,) for x, < xu. Again, by the deﬁnition of I, [(u) < f(w) S f(v).
Obviously, d(w,v) = d(u,v) — 1 = k and f(w) < f(v) since w 7b v.

Case yu < yv: By the deﬁnition of l and R, when y, = yu + 1, if yu is even and
x, > x, we have w = (xu + 1,y,,); if y,, is odd and x, < xu, then w = (xu —1,y,,).
Otherwise, w = (xu,y,, + 1). In all cases, we have d(w,v) = d(u,v) — 1 = k and
f(u) < f(w) < f(v).

By the deﬁnition of t for 2D mesh, it is impossible to have ya > y”, since f(u) <
f(v).

By the assumption of the induction, the path selected by routing function R for
w and v, say (w, 111,. . . , uk_1, v), is a shortest path and a partial oreder preserved for
f with length 1:. Thus (u,w,u1,...,uk._1,v) is also such a path with length k + 1.

The proof of this lemma is completed by the principle of induction. Cl

We are now ready to deﬁne three multicast routing heuristics that use the routing

function R.

Dual-Path Multicast Routing

The ﬁrst heuristic routing algorithm partitions the destination node set D into two
subsets, DH and DL, where every node in DH has a higher label than that of the
source node no, and every node in DL has lower label than that of no. Multicast
messages from no will be sent to the destination nodes in DH using the high-channel
network and to the destination nodes in DL using the low-channel network.

The message preparation algorithm executed at the source node of the dual-path
routing algorithm is given in Figure 6.11. The source node divides the destination
node set into two subsets: DH and DL, which are then sorted in ascending order and

descending order, respectively, with the label of each node used as its key for sorting.

87

The dual—path routing algorithm, shown in Figure 6.12, uses a distributed routing
method in which the routing decision is made at each intermediate node. Thus the
dual-path routing algorithm is executed at each intermediate node. Upon receiving
the message, each node ﬁrst determines whether its address is the ﬁrst destination
node in the message header. If so, the address is removed from the message header
and the message is delivered to the host node. At this point, if the address ﬁeld of the
message header is not empty, the message is forwarded toward the ﬁrst destination

node in the message header using the routing function R.

 

Algorithm: Message preparation for the dual-path routing
algorithm

Input: Destination set D, local address no, and node label assign-
ment function 3;

Output: Two sorted lists of destination nodes: DH and D1, placed
in message header.

Procedure:

1. Divide D into two sets DH and DL such that DH contains all the
destination nodes with higher 1’ value than f(uo) and DL the node
with lower 3 value than f(uo).

2. Sort the destination nodes in DH and DL using'the I value as the
key, respectively. Put these sorted nodes in list D3 in ascending
order and 0;, in descending order, respectively.

3. Construct two messages, one containing Dy as part of the header
and the other containing DL as part of the header.

Figure 6.11. Message preparation for the dual-path routing algorithm.

 

 

 

Note that the label assignment function i can be determined in advance for a given
multicomputer topology. Thus, the label in Fig 6.11 and Fig. 6.12 for each node can

be obtained in O(l) time. Step 1 and Step 2 can be merged by using some well known

88

 

Algorithm: The dual-path routing algorithm

Input: A message with sorted destination list DM = {d1,. . . , d1},
a local address w and node label assignment t;

Procedure:

1. If w = all, then D3“ = DM — {d1} and a copy of the message is
copied to the local node; otherwise, Div = DM.

2. If DIM = (t), then terminate the message transmission.

3. Let (1 be the ﬁrst node in DI", and let w’ = R(w,d).

4. The message is sent to node 10’ with address destination list Div
in its header.

Figure 6.12. The dual-path routing algorithm.

 

 

 

sorting algorithm such as quick sort algorithm with time complexity O(dlog d), where
d = IDI. In the message routing part, both Step 1 and Step 2 take a constant time
if we don’t consider the time for message coping. Step 3 can be done in 0(1) time
since the maximum outdegree for a node is 4 in 2D mesh. Thus we have established

the following lemma and theorem.

Lemma 6.2 The message preparation complexity for the dual-path routing algorithm

is O(IDI log IDI), where D is the destination set.

Theorem 6.1 The dual-path routing algorithm in 2D mesh for a destination set D
sends the source message to each destination node in D once and only once. The

time complexity for each node is O(l).

Figure 6.13 shows an example of applying dual-path routing algorithm to a 6 X 6
mesh with (3,2) as the source node.
In order to show that the dual-path routing algorithm is deadlock-free, we need

only prove that there can exist no cycle dependency among the channels, because

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

30
55
29
source node
18
destination node
17
5.2 _
.1 ‘—
link selected by
routing algorithm
5
Figure 6.13. An example of dual-path routing in a 6 X 6 mesh.

 

 

 

the cycle dependency of the resource is a necessary condition for deadlock. By the
deﬁnition of channel partitioning scheme, each of the high-channel and low-channel
subnetworks comprises a separate sets of channels in the multicomputer. There are
no cycles within each subnetwork, and hence no cyclic dependency can be created
among the channels. Therefore, we have the following assertion.

Assertion 2 The dual-path multicast routing algorithm for 2D mesh is deadlock-

free.

Multi-Path Multicast Routing

The performance of the dual-path routing algorithm is dependent on the location
distribution of destination nodes. Consider the example shown in Figure 6.13 for a
6 x 6 mesh topology. The total number of channels used to deliver the message is 33

(18 in the high-channel network and 15 in the low-channel network). The maximum

90

distance from the source to a destination is 18 hops. In order to reduce the average
length of multicast paths and the number of the channels used for a multicast, an
alternative is to use a multi—path multicast routing algorithm, in which the restriction
of having at most two paths is relaxed.

In a 2D mesh, most nodes have outgoing degree 4, so up to 4 paths can be used to
deliver a message, depending on the location of the source node. The only difference
between multi-path routing and dual-path routing concerns the message preparation
algorithm at the source node. Figure 6.14 shows the message preparation of the multi-
path routing algorithm, in which we further partition the destination sets DH and
D1, of the dual-path algorithm. The set, D” is divided into two sets, one containing
the nodes whose x-coordinates are greater than or equal to that of no and the other
containing the rest of the nodes in D”. DL is divided in a similar manner.

The rules by which ties are broken in partitioning the destination nodes depend
on the location of the source node in the network and the particular labeling method
used. For example, Figure 6.15a shows the partitioning of the destinations in the high-
channel network when the source is the node labeled with 15. When the node labeled
8 is the source, the high-channel network is partitioned as shown in Figure 6.15b.

In the multi-path routing algorithm, Step 1 and Step 2 are exactly the same as
those in dual-path algorithm. The two steps need O(IDI log IDI) time. In Step 3, we
need to check at most two queues for each destination node in DH, [DH] < D, it
takes O(IDI) time. Step 4 is symmetric to Step 3, it also requires O(IDI) time. The

follow lemma is established.

Lemma 6.3 The time complexity of message preparation of the multi—path routing

algorithm is O(IDI log IDI), where D is the destination set.

For the multicast example shown in Figure 6.13, the destination set is

ﬁrst divided into two sets DH and DL at source node (3,2), with DH =

91

 

 

Algorithm: Message preparation for the multi-path rout-
ing in 2D mesh

Input: Destination set D, local address no = (xo,yo), and node
assignment t;

Output: Sorted destination node lists DHI,DH2,DL1,DL2 for 4
multicast paths.

Procedure:

1. Divide D into two sets DH and DL such that DH contains all
the destination nodes with higher 2 value than ((uo) and DI, the
nodes with lower I value than ((uo).

2. Sort the destination nodes in DH and DL using the I value as the
key. Place these sorted nodes in list D3 in ascending order and
DL in descending order, respectively.

3. (Assume that 121 = (x1,y1) and v2 = (x2,y2) are the two neigh-
boring nodes to no with higher labels than that of no.)
Divide DH into two sets, D31 and DH; as follows:
DH] = {(x,y)Ix S x1 if x1 < x2, x 2 x1 if x1 > x2} and
Dm = {(x,y)Ix S 1321f$2 < x1, x 2 x1 if x2 > x1}.
Construct two messages, one containing Dm as part of the header
and sent the message to v1, and the other containing Dm as part
of the header and sent the message to v2.

4. Similarly, divide DL into DLI and DLQ.

Figure 6.14. Message preparation for the multi-path routing algorithm in 2D mesh.

 

 

92

 

 

 

 

N
HIE

N N
EIHIE

N
H E

 

 

“-I-

 

 

 

.N

:i
N
I!

H H
§ N

 

 

 

 

 

31:1-
#111"

 

Figure 6.15. Multi—path destination address partitioning.

 

 

 

{(5,3), (1,3), (5,4), (4, 5), (0,5)} and DL = {(0,2), (5, 1), (5,0), (0, 0)} DH is further
divided into two subsets Dy, and D112 at Step 3, with D31 = {(5,3),(5,4),(4,5)}
and D32 = {(1, 3), (0,5)}. DL is also divided into D1,; = {(5,1),(5,0)} and
D1,; = {(0, 2), (0, 0)}. The multicast will be performed using four multicast paths, as
shown in Figure 6.16. Note that multi-path routing requires only 20 channels in the
example, and the maximum distance from the source to destination is 6 hops. Hence,
this example shows that multi-path routing can offer signiﬁcant advantage over dual-
path routing in terms of generated trafﬁc and the maximum distance between the
source and destination nodes.

By the same argument as for Assertion 2, we have the following assertion.

Assertion 3 The multi-path multicast routing algorithm for 2D mesh is deadlock-

free.

93

 

 

 

 

 

 

 

 

 

 

 

source node

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

#
«

 

 

 

 

 

 

 

 

link selected by
routing algorithm

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 6.16. An example of multi-path routing in a 6 X 6 mesh.

 

 

 

Fixed-Path Multicast Routing

For purposes of comparison, we describe a third multicast algorithm called ﬁxed-path
routing. This routing scheme was suggested in [49]. Fixed-path routing is similar
to dual path routing except that each path traverses all possible horizontal links
before traversing a single vertical link. The upper path visits all nodes in increasing
order until the last destination is reached. Similarly, the lower path visits all nodes
in decreasing order until the last destination is reached. Figure 6.17 shows ﬁxed
path routing in a 6 X 6 mesh network for the source and destinations of the previous
examples.

The total number of channels used to deliver the message is 35 (20 in the high-
channel network and 15 in the low-channel network). The maximum distance from
the source to a destination is 20 hops. Clearly, ﬁxed-path routing is not as efﬁcient

as the other two approaches. However it is very simple to implement. We will discuss

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

»
«

 

 

 

 

 

 

 

 

link selected by
routing algorithm

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 6.17. An example of fixed-path routing in a 6 X 6 mesh.

 

 

 

ﬁxed-path routing further in the next chapter.

6.3 Deadlock-Free Multicast in Hypercube

The tree-like deadlock-free multicast routing algorithm using double channels is no
longer suitable for hypercube topology since each node in n-cube has n neighbors. It
is our conjecture that O(n) channels between any two neighboring nodes are necessary
to support tree-like deadlock-free multicast routing. As mentioned before, tree-like
multicast routing pattern is not well suitable for wormhole networks. The remainder
of this chapter will only consider the path-like routing schemes.

We also need to ﬁrst partition network based on Hamiltonian paths. A hypercube
has many Hamiltonian paths. Again, each node, say node i, in a multicomputer is
assigned with a label, f(i). The assignment of the label to a node is based on the

order of that node in a Hamilton path, where the ﬁrst node in the path is assigned

95

with label 0 and the last node in the path is assigned with label 771. — 1 if there are
m nodes in the network. Figure 6.18(a) shows a possible label assignment for 3-cube
multicomputer. The corresponding high-channel and low-channel networks are also

shown Fig. 6.18.

 

 

   

 

(a) physical network (b) high-channel network (c) low-channel network

Figure 6.18. The label assignment and the corresponding high-channel and low-
channel networks for a 3-cube.

 

 

 

We also propose label assignment scheme hypercube topology and prove that the
assignment scheme provides a shortest routing path for any given pair of source and
destination nodes.

For an n-cube, the label assignment function I for a node with address
dn_1dn_2 . . . do is

7(d,-,d,,_, . . . do) = E (ad—,2i + act-2"),
1:0
where en-1 = 0, en-,- = dn_1 EB dn_2 (If) . . .6) 61“-,“ for 1 < j S n. It is straightforward
to check that for two arbitrary nodes u and v in n-cube, f(u) 76 [(v) if u 75 v.

By deﬁning the routing function as the same way as the one for 2D mesh, we have

the following important lemma.

96

Lemma 6.4 For two arbitrary nodes 11 and v in an n-cube, the path selected by the
routing function R is a shortest path from u to v, and it is a partial order preserved

for the label assignment function l’.

Proof: Again, we assume that f(u) < f(v), and prove by induction on the
distance d(u, v).

For d(u,v)zl, we have R(u,v) = v and f(u) < f(v). Assume that it is true for
k, (d(u,v) :: k). Now, consider the case of d(u,v) = k+ 1. Suppose the j-th bit (from
left to right) is the ﬁrst bit that u and v differ.

1. Suppose that u is (In—1 . ..an_,-+10b,,_,-_1 . . . b0 and v is
an_1...a,,_,+11c,,_,-1...co.

(a) Consider node an_1 .. .an_,+11b,,_,-_1 ...bo, if l(a,,_1 ...an_,-+11b,,_,_1...bo) <
l(a,,_1...a,,_,-+11c,,_,_1...co),
then R(u,v) = w, where w = a,,..1...a,,_,+11b,,_,-_1 ...bo by the deﬁnition of R.
We have d(w,v) 2 d(u,v) -— 1 = k, and ((11) < f(w) < f(v) by the deﬁnition of f
for n-cube. By the hypothesis of the induction, there is a shortest path selected by
routing function R, say (w,u1, . . . ,uk_1,v), which is a partial order preserved for t’
with length k. Clearly, (u, w, u1,. . . ,uk_1,v) is also such a path with length 11: + 1.

(b) If t(a,,_1...a,,_,+11b,,..,-_1...b0) > t(a,,_1...a,,_,-+11c,,_,-_1...co), assume
that p is the ﬁrst bit that an_1... an_,+11b,,_,_1 . . . b0 and an_1 . . .an_,-+11c,,_,-_1 . . . co
differ (from left to right), and p > j. Let 61”-] . . . an_,-+11b,,_,_1 . . . b0 be

an_1 . . . an_,-+11c,,_,_1 ...cn_,,+1c,,_pb,,_,,_1 ...bo. Consider node 3,

S : (Inn—1 . o . an_j+10Cn_J.—‘ o o - Cﬂ_p+lcn_pbn_p_l o o o b0,

Since

t(a,,_1 ...an_,-+11c,,_,-_1 ...cn-p+1c,,_,,b,,_p_1...bo)

= l(a,,_1 . . . an_,~+11b,,..,-..1 . . . b0)

97

> €(a,,_1...a,,_,+11c,,_,-_1...cn_p+1c,,_,,b,,_,,_1...bo), and

an—I ED - - - EB an—j-H 13> 0 iIDCn—j-l ® - .. EB Cn-p+l :

 

(In-16173.16) an_,~+, 1111111c,.,_,—_1 GB . . . EB cn_,,+1,

by the deﬁnition of t for n-cube, we have

f(u) = f(an_1...a,,_,+10b,,_,~_1...bo)
= €(an_1 ...an_,-+10cn_,_1 ...cn_p+1E,,T,§b,,_p_1 . . . b0)
< t(a,,_1...an_,+10cn-,_1 ...cn_p+lﬁﬁbn_p-1...bo)
= [(3).

Note that R(s, v) = v, and d(u, s) = d(u, v)—1 = k and f(u) < ((3) < f(v). Again,
by the assumption of the induction, the path selected by R, say (u,u1, . . . ,uk_1,s),
is a shortest path and a partial order preserved for t with length k. Hence, the path
(u,vl, . . . ,u1,_1,s, v) selected by R is also such a path with length k +1.

2. Suppose u is an_1...a,,_,-+11b,,_,-_1 ...b0 and v is a,,_1...a,,_,-+10c,,-,-_1...co.

It is symmetric to the proof of part 1. D

The labeling effectively divides the network into two subnetworks, the high-channel
subnetwork and low-channel network similar to those for 2D mesh topology.

The basic idea of dual-path multicast routing algorithm is the same as the one
for 2D mesh. Thus, algorithms in Fig. 6.11 and Fig. 6.12 can be used directly for
hypercube topology.

By using the above label assignment, we have the following corollary.

Corollary 6.1 The dual-path multicast routing algorithm for hypercube is deadlock-
free.

For example, consider a 4-cube with node 1100 as the source as shown in Fig. 6.19.

The number outside a node in the ﬁgure is the assigned label of the node. Suppose

98

the destination nodes are 0100, 0011, 0111, 1000 and 1111. Clearly, DL={0100, 0111,
0011} and DH={1111, 1000}. For 03, the source node will send the message ﬁrst to
node 1111, the ﬁrst destination node in D3 through the high-channel network. Based
on the label assignment scheme, node 1100 has three outgoing channels reaching nodes
1000, 1101, and 111, respectively. According to the routing function R, node 1101
will be selected to forward the message. By repeating the same procedure for each
node receiving the message, we get the routing pattern shown in Fig. 6.19.

For n-cube, The time complexity of the message preparation is the same as the
one for 2D mesh, i.e., |D| log [D]. However, the routing algorithm in Fig. 6.11 needs

to check at most O(n) outgoing channels. We have the following lemma.

Theorem 6.2 The dual-path routing algorithm of n-cube for a destination set D
sends the source message to each destination node in D once and only once. The

time complexity for each node is O(n).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 6.19. A 4-cube with node 1100 as the source for daul path routing.

 

 

 

In n-cube, each node has n neighboring nodes. Up to n multicast paths can be

supported to deliver a message. Figure 6.20 depicts the message preparation of the

99

multi-path routing algorithm. The message routing part of the multi-path algorithm

is the same as the one for dual-path routing.

 

Algorithm: Message preparation for the multi-path rout-
ing in hypercube

Input: Destination set D, local address no, and node assignment
3;

Output: Sorted destination node lists

DH], 0112, . . . , DHd, DL], 0L2, . . . , DLd'

for at most (I + d’ multicast paths.

Procedure:

1. Divide D into two sets DH and DL such that DH contains all
the destination nodes with higher 6 value than [(uo) and DL the
nodes with lower I value than f(uo).

2. Sort the destination nodes in DH and DL using the 6’ value as the
key. Place these sorted nodes in list DH in ascending order and
DL in descending order, respectively.

3. (Assume that v1, v2, . . .vd are the d neighboring nodes to 110 with
higher labels than that of no, and [(v,) < t(v,+1, 1 S i < d.)
Divide DH into at most d sets, 0”,, 1 S i S d, as follows:

DH, = {wIt(v,-) S f(w) < [(034.1]
Construct at most d messages, each containing DH,- as part of the
header and sent the message to v,- for l S i S d.

4. Similarly, divide DL into DLI, DLg, . . . Dth

Figure 6.20. Message preparation for the multi-path routing algorithm in hyper-
cube.

 

 

 

Lemma 6.5 The time complexity of message preparation of the multi-path routing

algorithm for n-cube is max{O(|DI log IDI), O(nIDI)}, where D is the destination set.

Proof: In the message preparation algorithm in Fig. 6.20, Step 1 and Step 2

are exactly the same as those in dual-path algorithm in Fig. 6.11. The two steps need

100

O(IDI log IDI) time. In Step 3, at most 11 queues for each destination node in D”,
lDHl < D, need to be checked. It takes at most O(nIDI) time. Step 4 is symmetric

to Step 3, it also requires O(nIDI) time. D

Figure 6.21 depicts an example of multi-path routing in a 4-cube multicomputers.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 6.21. A 4-cube with node 1100 as the source for multi—path routing.

 

 

 

Corollary 6.2 The multi-path multicast routing algorithm is deadlock-free for hyper-

cube.

Also, a ﬁxed-path algorithm can be developed accordingly.

We present the performance study for these proposed algorithms in next chapter.

CHAPTER 7

PERFORMANCE STUDY OF
THE ROUTING SCHEMES

In this chapter, we ﬁrst present the simulation results of multicast routing algorithms
for MC, MP and ST under static network condition. The average trafﬁc of each
algorithm is measured for different multicast sets. Then, we give the simulation
results of the proposed deadlock—free multicast routing schemes. These results are
measured not only by trafﬁc but also by the network latency under dynamic network

condition in which messages in network interact dynamically.

7 .1 Performance Study under Static Network

Trafﬁc

We study the performance of the proposed heuristic multicast algorithms of MC, MP
and ST for a 32 X 32 mesh and a lO-cube by simulation. A large number of randomly
generated multicast sets with different number of destinations are tested to measure
the trafﬁc generated by the algorithms. The number of destination nodes, lc, is chosen
from 1 to 900. For a given k, a random number generator generates k integers within

the range [0,1023], which represent k destination addresses. A generated integer i is

101

102

mapped to a 2D mesh node address (x,y), where x = i mod 32 and y = [i/32]. In
n-cube, the integer i is represented by the corresponding binary number. Each unit of
trafﬁc represents the transmission of one message over a link. The amount of trafﬁc
generated by the routing algorithm is averaged over 1000 runs for each k.

Figure 7.1 and Figure 7.2 show the amount of trafﬁc generated by the MP algo-
rithm in the 32 X 32 mesh and 10-cube topologies, respectively. Multicast can be
performed by multiple one-to-one or broadcast communications. When using broad-
cast to implement multicast, the router sends the message to the local processor only
when it is a destination node. The proposed algorithm always creates less trafﬁc
compared with multiple one-to-one and broadcast methods. For a 1—to—k multicast, it
requires at least k units of the trafﬁc [20]. The additional trafﬁc is deﬁned as the total
amount of trafﬁc minus k. For a broadcast with 1024 nodes, the trafﬁc generated is

always 1023. The additional trafﬁc generated is 1023 — k for k destinations.

 

3000 1

  
    
 
 

 

               

1 1 T 1 F 1 1
Average 2500 — multiple one-to-one ~
Additionab000 __ _,

Trafﬁc
1500 T
1000 broadcast d
500 a

sorted MP
0 1 1 1 1 1 1 1 1

 

           

 

0 100 200 300 400 500 600 700 800 900

Number of Destination Nodes

Figure 7.1. Performance of the sorted MP algorithm on a 32 X 32 mesh.

103

 

1400 1 1 1 1 1 1 1 l
1200 - multiple one-to-one -
1000 ..

Average

1
1

Additional 800

fﬁ broadcast

'D'a. C 600 _ .1
400 1- -
200 - sorted MP -

 

 

l l l l l l l l

0
0 100 200 300 400 500 600 700 800 900

Number of Destination Nodes

 

Figure 7.2. Performance of the sorted MP algorithm on a lO-cube.

Figure 7.3 and Figure 7.4 show the average additional amount of trafﬁc obtained
by simulating the greedy ST algorithm for 32 x 32 mesh and 10-cube, respectively. In
Fig. 7.3, the performance of the greedy ST algorithm is compared with multiple one-
to-one and broadcast communications. The performance of the greedy ST algorithm
is much better compared with the other two approaches. For the nacube graph,
we compare our results with the LEN heuristic routing algorithm [20] which has
demonstrated its superiority over multiple one-to—one and broadcast. The results of
our routing algorithm show a signiﬁcant improvement over the LEN algorithm in
terms of the amount of trafﬁc.

We also study the performance of MT algorithm by simulation on 16 x 16 mesh.
Figure 7.5 plots the amount of trafﬁc generated by each proposed MT algorithm for
16 x16 mesh. It shows that the amount of trafﬁc generated by X-ﬁrst algorithm always

creates much less amount of trafﬁc compared with multiple one-to-one or broadcast

104

 

2000 1 j l 1 1 I I

multiple one-to-one

I
1

Average 1500

Additional
1000

broadcast

Trafﬁc

500 - .4
greedy ST

 

 

 

l L l l L l l

0 1
0 100 200 300 400 500 600 700 800 900

Number of Destination Nodes

 

Figure 7.3. Performance of the greedy ST algorithm on a 32 x 32 mesh.

 

100 T 1 T 1 1 F T l

LEN Algorithm

Average 80 1- -1
Additional
Trafﬁc 40 " "‘

20 Greedy ST algorithm -

 

 

l l J l L l l __

0 100 200 300 400 500 600 700 800 900

Number of Destination Nodes

Figure 7.4. Performance of the greedy ST algorithm on a 10-cube.

105

methods. The traffic generated by divided greedy algorithm is always much less than

that created by X-first algorithm.

 

    
    

 

 

 

 

 

 

 

 

 

 

1000 T , , 1
800 -
Average
Additional 600 r- many one-to-one
Trafﬁc
400 - _
__ broadcast
200 - d
X-first
divided greedy multicast
0 l J l l
0 20 40 60 80 100

Number of Destination Nodes

Figure 7.5. Performance of the X-ﬁrst and divided greedy algorithms on a 16 x 16
mesh.

For deadlock-free multicast routing algorithms proposed in Chapter 6, Figure 7.6
and Figure 7.7 compare, for various numbers of destinations, the amount of “addi-
tional” trafﬁc resulting from multi- and dual-path routing. This is a static measure-
ment and does not depend on network trafﬁc conditions, but gives an indication of the
efﬁciency of the algorithm. The dynamic measurements for these routing algorithms

will be given in next section.

106

 

 

 

 

 

 

 

 

120 l I l l I l l I |
100 - multiple one-to-one -
80 - ..
A
A 60 '- -1
T
40 _ broadcast ‘
20 _ dual-path _
multi-pgth_ \
0 1 1 1 1 1 1 1 1 1 ‘—

0 5 10 15 20 25 30 35 40 45 50

Number of Destination Nodes

Figure 7.6. Performance of different multicast methods on a 6—cube.

7 .2 Performance Study under Dynamic Network

Trafﬁc

The performance of a multicast routing algorithm not only is measured in terms of
the delay and trafﬁc resulting from single multicast message, but also depends on the
interaction of the multicast message with other network trafﬁc. In order to study
these effects on the performance of the proposed multicast routing algorithms, we
have written a simulation program to model the multicast communication in 8 x 8
mesh networks. In this section, we describe the program and results obtained by
using it.

The simulation program used to model multicast communication in 2D mesh net-
works is written in C and uses an event-driven simulation package, CSIM [57]. CSIM
allows multiple pseudo-processes to execute in a quasi—parallel fashion and provides

a very convenient interface for writing modular simulation programs. Our simulation

107

 

T
1

60

l
l

50

Average

I
L

Additional 40

 

 

 

 

Trafﬁc 30 _ _
20 _, dual-path -
10 - multi-path N
0 1 1 1 1 1 1 1 1
0 5 10 15 20 25 30 35 40 45

Number of Destination Nodes

Figure 7.7. Performance of different number of destinations on 8 x 8 mesh.

program consists of several components, all of which run within the CSIM package.
The main program activates 64 CSIM parallel processes, called multicast genera-
tors, one for each network node. Each multicast generator loops, creating multicast
messages whose destinations are determined by a uniform random number generator.
Each multicast messages is simulated with a pseudo-process that sends multicast mes-
sages to the destinations by creating ﬂit pseudo-processes. Each ﬂit pseudo-process
models the transmission of a ﬂit of the message. If there is a branch of the message
at an intermediate node, the ﬂit process will fork several ﬂit processes, one for each
new branch. A routing module for each routing algorithm is used by ﬂit processes to
determine the channels on which each message should be transmitted. Each channel
has a single queue of message waiting for transmission. A statistics module gathers
information concerning network trafﬁc, time, for example, average network latency,
using the method of batch means [58]. Although they are not shown in the figures,

all simulations were executed until the confidence interval was smaller than 5 percent

108

of the mean, using 95 percent confidence intervals.

 

 

 

 

 

120 _ I I I T I I _1
100 ~ '-
Avera
38 80 - . -
Network tree-like
Latenc
y 60 - ~
(usec) l
mu i-path
40 P dual-pat ‘
20 - -
0 L l l l l l

 

 

 

700 600 500 400 300 200

Mean Interarrival Time (psec).

Figure 7.8. Performance under different loads on a double-channel mesh.

In order to compare the tree-like and path-like algorithms fairly, we ﬁrst simulated
each on a network that contained double channels. Figure 7.8 plots the average
network latency for various network loads. The average number of destinations for a
multicast is 10, and the message size is 128 bytes. The speed of each channel is 20
Mbytes / second. All three algorithms display good performance at low loads. As the
load is increased, the path algorithms are not negatively affected as soon as the tree-
like algorithm. This result is explained by the fact that in tree—like routing, when one
branch is blocked, the entire tree is blocked. This type of dependency does not exist
in path-like routing. Multi-path routing outperforms dual-path routing because, as
we have shown earlier, paths tend to be shorter and less trafﬁc is generated. Hence,

the network will not saturate as quickly.

1.09

 

 

 

 

 

120 __ 1 1 1 1 1 1 1 1 .
100 _ tree-like _
Average
Network 80
Latency
(#886) 60 h
40 - '-
multi-path
20 t" f
0 l l l 1 Ldual-pal'th I

 

0 5 10 15 20 25 30 35 40 45

Average Number of Destinations

Figure 7.9. Performance of different number of destinations on a double-channel
mesh.

The disadvantage of tree-like routing increases with the number of destinations.
Figure 7.9 compares the three algorithms, again using double channels. The average
number of destinations varies from 1 to 45. In this set of tests, every node generates
a multicast messages with an average time between messages of 300 psec. The other
parameters are the same as for the previous ﬁgure. With larger sets of destinations,
the dependencies among branches of the tree become more critical to performance
and causes the delay to increase rapidly. The path algorithms still perform well,
however. Notice that the dual-path algorithm outperforms the multipath algorithm
for large destination sets. This result will be explained shortly. Our conclusion
from Figures 7.8 and 7.9 is that tree-like routing is not particularly well suited for
2D mesh networks. First, it requires double channels in order to be deadlock-free.
Second, its performance is worse than that for path-based schemes. The remainder

of the simulation results concern only single link, path-based approaches.

110

As shown by example in Figures 6.13 and 6.16, multi-path routing usually requires
fewer channels than dual-path routing. Because the destinations are divided into
four sets rather than two, they are reached more efficiently from the source, which is

approximately centrally located among the sets.

 

   
 

 

 

 

120 _ I I I I I I I _l
100 l- -
Average
Network 80
Latency
60 - -
(#sec)
40 " dual-path ‘
20 1— -1
multi-path
1 1 1 1 1 1 1 1

 

0
750 700 650 600 550 500 450 400 350

Mean Interarrival Time (psec)

Figure 7.10. Performance under different loads.

Figure 7.10 plots the average network latency time for various network loads for
multi-path and dual—path routing. Only single channels are used, the average number
of destinations for a multicast is 10, and the message size is 128 bytes. The speed
of each channel is 20M bytes/ second. Both algorithms display good performance at
low loads. As the load is increased, multi-path routing offers slight improvement over
dual-path routing. This is likely due to the fact that multipath routing introduces
less trafﬁc to the network.

One may conclude from the results given thus far that multipath routing is su-

111

 

  

 

 

 

1 I 1 l 1 I I
140 - 4
120 - ‘
100 __ multi-path ‘
Average
Network 80 _ F
Latency
(#sec) 60 f '
40 - -
20 dual-path “
1 n 1 1 1 1 1

 

0 5 10 15 20 25 30 35 40 45

Average Number of Destinations

Figure 7.11. Performance of different number of destinations.

perior to dual-path routing. However, a major disadvantage of multipath routing is
not revealed until both the load and number of destinations are relatively high. Fig-
ure 7.11 shows that under these conditions, dual-path routing performs much better
than multi-path routing. The reason is somewhat subtle. When multi-path routing
is used to reach a relatively large set of destinations, the source node will likely send
on all of its outgoing channels. Until this multicast transmission is complete, any ﬂit
from another multicast or unicast message that routes through that source node will
be blocked at that point. In essence, the source node becomes “hot spot.” In fact,
every node currently sending a multicast message is likely to be a hot spot. As the
load increases, these hot spots will throttle system throughput and greatly increase
message latency.

Hot spots are much less likely to occur in dual-path routing, and this fact accounts
for its stable behavior under high loads with large destination sets. Although all of

the outgoing channels at a node can be simultaneously busy, this can only result

112

from two or more messages routing through that node. Figure 7.11 also compares
ﬁxed-path routing to multi-path routing or dual-path routing. For a small number
of destinations, ﬁxed path routing traverses many unnecessary channels, creating
more traffic and needlessly blocking more messages than multi- or dual-path routing.
For a large enough number of destinations, however, dual- and ﬁxed-path routing
effectively become identical performance. Because ﬁxed-path routing is much simpler
than dual-path routing, it may be the best choice for messages with large destination

sets.

CHAPTER 8

CONCLUSIONS

This chapter summarizes the major contributions of this dissertation research and

outlines the direction of future research in this ﬁeld.

8.1 Summary of Major Contributions

This dissertation research was motivated by the high demand for multicast commu—
nication in multicomputers.

We have studied various multicast routing evaluation criteria for multicomputers
with different switching techniques. Based on a graph theoretical model, five opti-
mization problems, namely the problems of ﬁnding OMP, OMC, MST, OMT, and
OMS were identiﬁed.

This dissertation provided a theoretical foundation to the problems of OMP, OMC,
MST, and OMS by showing that all these optimization problems are N P-complete for
2D mesh, n-cube, and 3D-mesh interconnection topologies. In the short time since its
introduction in the early 1970’s, NP-completeness has come to symbolize the abyss
of inherent intractability that algorithm designers increasingly face as they seek to
solve large and more complex problems. Indeed, discovering that a problem is N P-

complete is usually just the beginning of work on that problem. The knowledge of

113

114

NP-completeness of a problem does provide invaluable information about how hard it
would be. The work of proving the NP-complete problems was one of the main con-
tributions of this dissertation research. Therefore, we could claim that it is extremely
difﬁcult to ﬁnd optimal solutions for these problems for mesh or hypercube topolo-
gies in a reasonable amount of time. These results also justiﬁed our development of
heuristic algorithms for these problems.

Designing efﬁcient multicast protocols and routing algorithms naturally depends
on the topology of the network, and also depends on the underlying switching mech-
anism used in the multicomputer. We have proposed new hybrid heuristic multicast
routing algorithms of all the optimization problems for 2D mesh and hypercube mul-
ticomputers using various switching techniques. In general, they have much better
performance than the current routing schemes for multicast, including multi-unicast,
broadcast and some existing multicast algorithms.

In order to support reliable and efﬁcient parallel computations in multicomputers,
multicast routing schemes must be deadlock-free. We have focused on the dead-
lock issues of multicast communication using the most promising wormhole routing
switching technique. We have shown that deadlock-free unicast routing algorithms,
when extended to include multicast. trafﬁc, are no longer deadlock-free. A tree-based
algorithm using X-ﬁrst routing can be made to be deadlock-free if double channels
are used in the network. It was observed that tree-like routing model is not suit-
able for multicast routing in wormhole networks. An alternative approach is to use a
multicast star model. We presented three such algorithms based on star model that
are deadlock-free without requiring additional channels. These algorithms were the
ﬁrst deadlock-free multicast algorithms using wormhole routing to be studied. These
routing algorithms can be applied to any multicomputer networks that have Hamil-
ton paths. The performance of the routing scheme is dependent on the selection of a

good Hamilton path. For 2D mesh and hypercube topologies, we proposed methods

115

to select such a Hamilton path so that for any given two nodes, there exists a shortest
routing path between them.

A simulation study has been conducted that compares the performance of these
multicast algorithms under both static and dynamic network trafﬁc conditions.
Among the deadlock-free multicast routing algorithms, the dual-path routing algo-
rithm offered the best overall performance over different traffic loads and destination
set sizes. The major disadvantage of multi-path routing is that hot spots may occur
under certain conditions, signiﬁcantly degrading communication performance. A sim-
pler ﬁxed-path routing algorithm offers performance equal to the dual-path algorithm
for large destination sets.

In summary, this dissertation research involved the study of routing criteria,
modeling, investigation of optimal routing problems, routing algorithm development,

deadlock issues, and performance evaluation. of routing schemes.

8.2 Direction for Future Research

The following issues need to be addressed in future in order to better support multicast

communication in multicomputers.

System Supported Multicast Service

System supported multicast service is able to offer a large number of applications
improved performance and simpliﬁed programming. The dissertation research focused
on multicast routing for hardware support. In order to provide multicast service for
users, we must deﬁne a set of multicast primitive operations and develop the interface
between application programs and system software, so that the underlying multicast
facility can be easily used by the users. Also, a set of protocols must be developed to

implement the multicast primitives using the functionality of the multicast facility.

116

In most existing multicomputers, multicast communication is not directly supported,

the protocols have to map multicast primitives to many unicast messages.

Hardware Implementation — Multicast Router

In ﬁrst generation multicomputers, communication functions were implemented by
software. By hardware implementation, a dedicated hardware router is attached to
each node. In order to fully utilize the resources of multicomputer, it is necessary
to use such dedicated hardware routers so that the computation and communication
can be overlapped, and the network latency be minimized. Some second generation
multicomputers, such as Intel iPSC/2 uses dedicated hardware routers. In [29], a
hardware router for unicast has been reported to support the virtual cut-through
packet switching method. A hardware router for multicast communication is pre—
sented in [21].

In this dissertation, several deadlock-free multicast routing algorithms have been
proposed to support multicast communication in wormhole networks. Hardware im-
plementation for these routing algorithms is required to match the speed of the pro-
cessors. If multicast is directly supported in hardware, the multicast protocols may

be quite simple.

Adaptive Routing and Use of Virtual Channels

The routing algorithms proposed are deterministic. In order to increase the network
throughput and to decrease the network latency, adaptive routing may be used. It
can also support the fault tolerant routing. The main issue of adaptive routing is to
avoid deadlock. Although some adaptive unicast routing schemes are proposed [36]
[37], they are not directly applicable to the case of multicast communication.

Some adaptive unicast routing schemes use virtual networks [36]. A virtual channel

117

is a logic channel with its own message buffers, control, and data path. Several virtual
channels may share a physical channel. The use of virtual channel increases the
connectivity of the network and may easily support some adaptive routing schemes.
Instead of partitioning the network into high-channel and low-channel networks as in
the dissertation, the network may be partitioned into many sub-networks. The set
of destination nodes then may be distributed to different sub-networks to support
multiple multicast paths. The issue will be how many virtual channels are required
and how the destination nodes should be partitioned. This is an interesting issue and

deserves further study.

More Performance Study for Routing Schemes

We have conducted a simulation under both static and dynamic network trafﬁc to
study the performance of the proposed multicast routing schemes. The simulation
was under the assumption that the distribution of the source node and destination
nodes is uniform, and interval time to send a multicast message is uniformly random.
Some benchmarks are necessary to run the simulation in order to get more convencing
results. Another issue is to study the interaction between unicast and multicast trafﬁc
and how different multicast algorithms affect the performance of unicast wormhole

routing.

BIBLIOGRAPHY

BIBLIOGRAPHY

[1] R. M. Karp and V. Ramachandran, “A survey of parallel algorithms for shared-
memory machines,” Tech. Rep. UCB/CSD 88/408, UCB/CSD, Mar. 1988.

[2] K. Hwang and F. A. Briggs, Computer Architecture and Parallel Processing.
McGraw-Hill Book Company, 1984.

[3] W. C. Athas and C. L. Seitz, “Multicomputers: Message-passing concurrent
computers,” IEEE Computer, pp. 9—25, Aug. 1988.

[4] D. A. Reed and R. M. Fujimoto, Multicomputer Networks: Message-Based
Parallel Processing. The MIT Press, 1987.

[5] J. L. Gustafson, S. Hawkinson, and K. Scott, “The architecture of a homogeneous

vector supercomputer,” in Proceedings of the 1986 International Conference on
Parallel Processing, pp. 649—652, 1986.

[6] nCUBE Company, NCUBE 6400 Processor Manual, 1990.

[7] C.-T. Ho and S. L. Johnsson, “Distributed routing algorithms for broadcast-
ing and personalized communication in hypercubes,” in Proceedings of the 1986
International Conference on Parallel Processing, pp. 640—648, Aug. 1986.

[8] C. L. Seitz, J. Seizovic, and W.-K. Su, “The C programmer’s abbreviated guide
to multicomputer programming,” Tech. Rep. Caltech-CS-TR—88-1, Department
of Computer Science, California Institute of Technology, Jan. 1988.

[9] H.-W. Hsu and X. Lin, “Parallel algorithms for labeling image components,”
Algorithms, International Symposium SIGAL’QO, Tokyo, Japan. Lecture Notes
in Computer Science 450, Springer-Verlag, Berlin, New York, 1990.

[10] W.-J. Hsu, L. Wu, and X. Lin, “Optimal algorithms for labeling image compo-
nents,” in Proceedings of 19.90 International Conference on Parallel Processing,
August 1990.

118

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

119

Z. Zeng, X. Lin, and T.-Y. Li, “An efﬁcient parallel homotopy algorithm for un-
symmetric eigenproblems with O(n) running time,” in Proceedings of Fifth SIAM
Conference on Parallel Processing for Scientiﬁc Computing (to be published in
1992), March 1991.

M. Loper, “Distributed interactive simulation (DIS) requirement for multicast,”
Aug. 1991. Draft presented at Multipeer/Multicast Workshop in Orlando,
Florida.

R. F. DeMara and D. I. Moldovan, “Performance indices for parallel marker—

propagation,” in Proceedings of the 1991 International Conference on Parallel

Processing, vol. I, (St. Charles, IL), pp. 658—659, Aug. 1991.

S. G. Akl, The Design and Analysis of Parallel Algorithms. Englewood Cliffs,
New Jersey: Prentice—Hall, 1989.

V. Kumar and V. Singh, “Scalability of parallel algorithms for the all-pairs
shortest-path problem,” Journal of Parallel and Distributed Computing, pp. 124—
138, Oct. 1991.

X.-H. Sun, L. M. Ni, F. A. Salam, and S. Guo, “Compute-exchange compu-
tation for solving power ﬂow problems: The model and application,” in Proc.
of the Fourth SIAM Conference on Parallel Processing for Scientiﬁc Computing
(J. Dongarra, P. Messina, D. C. Sorensen, and R. G. Voigt, eds.), pp. 198—203,
Dec. 1989.

H. Xu, P. K. McKinley, and L. M. Ni, “Efficient implementation of barrier syn-
chronization in wormhole-routed hypercube multicomputers,” Tech. Rep. MSU-
CPS-ACS-47, Department of Computer Science, Michigan State University, East
Lansing, MI, October 1991.

C. Moler and D. Scott, “Communication utilities for the iPSC,” iPSC Technical
Report No. 2, Intel Scientific Computers, 1986.

Y. Lan, A. H. Esfahanian, and L. M. Ni, “Distributed multi-destination rout-
ing in hypercube multiprocessors,” in Proceedings of the Third Conference on

Hypercube Computers and Concurrent Applications, pp. 631—639, Jan. 1988.

Y. Lan, A. H. Esfahanian, and L. M. Ni, “Multicast in hypercube multiproces-
sors,” Journal of Parallel and Distributed Computing, pp. 30—41, Jan. 1990.

 

120

[21] Y. Lan, L. M. Ni, and A. H. Esfahanian, “A VLSI router design for hypercube
multiprocessors,” Integration: The VLSI Journal, vol. 7, pp. 103—125, 1989.

[22] Y. Saad and M. H. Schultz, “Topological properties of hypercubes,” IEEE Trans-
actions on Computers, vol. C-37, pp. 867—872, July 1988.

[23] R. Arlauskas, “iPSC / 2 System: A second generation hypercube,” in Proceedings
of the Third Conference on Hypercube Computers and Concurrent Applications,
(Pasadena, CA), pp. 38—42, Association for Computing Machinery, Jan. 1988.

[24] J. P. Hayes, T. Mudge, Q. F. Stout, S. Colley, and J. Palmer, “A microprocessor-
based hypercube supercomputer,” IEEE' Micro, vol. 6, pp. 6—17, Oct. 1986.

[25] W. J. Dally, “Performance analysis of k-ary n-cube interconnection networks,”

IEEE Transactions on Computers, vol. C-39, pp. 775-785, 1990.

[26] C. L. Seitz, W. C. Athas, C. M. Flaig, A. J. Martin, J. Seizovic, C. S. Steele, and
W.—K. Su, “The architecture and programming of the Ametek Series 2010 mul-
ticomputer,” in Proceedings of the” Third Conference on Hypercube Concurrent
Computers and Applications, Volume 1, (Pasadena, CA), pp. 33-36, Association
for Computing Machinery, Jan. 1988.

[27] L. M. Ni, “Communication issues in multicomputers,” in Proceedings of the Fifth
Workshop on Parallel processing, (Taiwan), Dec. 1990.

[28] P. Kermani and L. Kleinrock, “Virtual cut-through: A new computer commu-
nication switching technique,” Computer Networks, vol. 3, no. 4, pp. 267—286,
1979.

[29] W. J. Dally and C. L. Seitz, “The torus routing chip,” Journal of Distributed
Computing, vol. 1, no. 3, pp. 187-496, 1986.

[30] W. J. Dally, “The J-machine: System support for Actors,” in Actors: K nowledge-
Based Concurrent Computing (Hewitt and Agha, eds.), MIT Press, 1989.

[31] L. M. Ni abd P. K. Mckinley, “A Survey of Routing Techniques in Wormhole
Networks,” Tech. Rep. CPSACS 46, Michigan State University, Oct. 1991.

[32] D. C. Grunwald and D. A. Reed, “Networks for parallel processors: Measure-
ments and prognostications,” in Proceedings of the Third Conference on Hy-
percube Concurrent Computers and Applications, Volume I, (Pasadena, CA),
pp. 610—619, Association for Computing Machinery, Jan. 1988.

 

121

[33] X. Lin and L. M. Ni, “Multicast communication in multicomputers networks,” in
Proceedings of the 1990 International Conference on Parallel Processing, pp. 111—
114-III—118, Aug. 1990.

[34] H. Sullivan and T. R. Brashkow, “A large scale homogeneous machine,” in Pro-
ceedings of the 4th Annu. Symp. Comput. Architecture, pp. 105—124, 1977.

[35] C. Lang, “The Extension of object-oriented languages to a homogeneous concur-
rent architecture,” Tech. Rep. 5014:TR:82, California Institute of Technology,
Department of Computer Science, 1982.

[36] D. H. Linder and J. C. Harden, “An adaptive and fault tolerant wormhole rputing
strategy for k-ary n-cubes” IEEE Transactions on Computers, pp. 2—12, Jan.
1991.

[37] C. Class and L. M. Ni, “Adaptive wormhole routing in multicomputer networks,”
Tech. Rep. CPSACS 35, Michigan State University, Oct. 1991.

[38] K. D. Gunther, “Prevention of deadlocks in packet-switched data transport sys-
tems,” IEEE Transactions on Communications, pp. 512-524, Apr. 1981.

[39] P. M. Merlin and P. J. Schweitzer, “Deadlock avoidance in store-and-forward net-
works ~ I: Store-and-forward deadlock,” IEEE' Transaction on Communication,
pp. 345—354, 1980.

[40] S. Toueg and J. D. Ulman, “Deadlock-free packet switching networks,” in Proc.
of the 11th ACM symposium on Theory Computing, pp. 89—98, 1979.

[41] S. Toueg, “Deadlock- and Iivelock-free packet switching networks,” in Proc. of
the 12th ACM symposium on Theory Computing, pp. 94-99, 1980.

[42] D. Gelernter, “A DAG-based algorithm for preventing of store-and-forward dead-
lock in packet networks,” IEEE Transactions on Computers, pp. 709—715, Oct.
1981.

[43] S. F. Nugent, “The iPSC-2 direct-connect communications technology,“ in Pro-
ceedings of the Third Conference on Hypercube Computers and Concurrent Ap-
plications, pp. 51—60, Jan. 1988.

[44] W. J. Dally and C. L. Seitz, “Deadlock-free message routing in multiprocessor

interconnection networks,” IEEE' Transactions on Computers, vol. C-36, pp. 547—
553, May 1987.

[451

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

122

F. Harary, Graph Theory. Readings, Massachusetts: Addson-Wesley, 1972.

G. T. Byrd, N. P. Saraiya, and B. A. Delagi, “Multicast communication in mul-
tiprocessor systems,” in Proceedings of the 1989 International Conference on
Parallel Processing, pp. I—196 —- I—200, 1989.

M. R. Carey and D. S. Johnson, Computer and Intractability, A Guide to the
Theory of NP-completeness. Freeman, 1979.

X. Lin and L. M. Ni, “Deadlock-free multicast wormhole routing in multi-
computer networks,” in Proceedings of 18th Annual International Symposium on
Computer Architecture, pp. 115-125, May 1991.

X. Lin, P. K. Mckinley, and L. M. Ni, “Performance study of multicast worm-
hole routing in 2D-mesh multicomputers,” in Proceedings of the 1991 Interna-
tional Conference on Parallel Processing, pp. l—345 -— 1-343, 1991.

H. A. Choi and A. H. Esfahanian, “On complexity of a message-routing strategy
for multicomputer systems,” in Proc. of the 16th International Workshop on
Graph- Theoretic Concepts in Computer Science, (West Berlin, West Germany),
June 1990.

A. Itai, C. H. Papadimitriou, and J. L. Szwarcﬁter, “Hamilton path in grid
graphs,” SIAM J. Computing, vol. 11, pp. 676 - 686, Nov. 1982.

M. R. Carey and D. S. Johnson, “The rectilinear Steiner tree problem is NP-
complete,” SIAM J. Appl. Math 32, pp. 826—834, 1977.

R. L. Graham and L. R. Foulds, “Unlikelihood that minimal phylogenies for
realistic biological study can be constructed in reasonable computational time,”
Mathematical Biosciences, vol. 60, pp. 133—142, 1982.

D. S. Johnson and C. H. Papadimitrion, “Computational complexity,” in The
Traveling Salesman Problem (E. L. Lawler, J. K. Lenstra, A. H. G. R. Kan, and
D. B. Shmoys, eds.), John Willery & Sons, 1984.

G. M. L. K011 and L. Berman, “A fast algorithm for steiner trees,” Tech. Rep.
Research Report RC 7390, IBM, Nov. 1978.

D. W. Wall, Mechanisms for Broadcast and Selective Broadcast. PhD thesis,
Stanford University, June 1980.

 

123

[57] H. D. Schwetman, “CSIM: A C-based, process—oriented simulation language,”
tech. rep., 1985.

[58] A. M. Law and W. D. Kelton, Simulation Modeling and Analysis. New York:
McGraw-Hill, 1982.

HICHIGRN STQTE UNIV. LIBRARIES
Illllllllllllllllllllllllllllllllllllllllllll
31293008764908