. ‘—

 

”WWW

2722225922“ 2

.2 .

.1-
.. i2
2222'22
o “I I 2'

 

222-22:
WM 2
‘VYI

,.,._..--._._
A . v.4 .-

. '.c.‘“‘“
a- 2

 

c.22‘
222’"

Eng

2 -.22
'.-.‘.122'Iv2l
: 2'22. 22222225

22:22. 2.. 22222.

r‘..22‘|2

2 2
222
2 2‘2" 2

"I 22 .222
. 222 ."’2
‘2
2:2.7' 2.'.2."
2232722If

222-22 .2
"42.21222

22E;
._1 J:
I 2 V 2
3

NJ " '2‘22
__.

1‘: 222 l':
. 2222

2 .

2). .
“2‘22

.1

.f”
'3';

g. 6
:vu~
.

L__...1

. . <
....f-‘v-o-~

“-
22
J._

A

.-_.4.. t n-
‘

-.-. - r».

-. NwC-

rt 0..“
.a-OCI IL-

 

n...-o-v «Hf 74f:
’
.4";

‘5‘...
o

a..."

..
_cn.’\‘-"- V‘
7"- ...-¢.....o --

. .n ..
n-

"$233....“—
"'v 3‘ ..
l .‘

-v

J.
l. -.‘o

4221'”.
:21. 2! ‘27:;

v. .“545—

‘ﬁcﬁw:

.-.4

‘I‘

'G‘...o-
..

....'.
w..-

‘ .‘- .
. '.-..~.4~uv

.-
j

1‘22

2,.

-
‘—

1

,-

.... ~
v - o -—'
" “,ﬁu...
‘2‘- ..

era-3

4

5:3.

,.,.7--..
.WA...
.

n
v __

c
at
‘

.4...-

.‘-

.7

A .‘_.
W“ . --—.‘.

“21”.;

'f-J

'22“ .222

.2-

t'nhm“

2 E22322

. A
. . _ 4. ' H V A“: .'. CV .u
_ . . .. ... . . .J‘.) ~.
~ ,A . , . 7 ,,-,
_ _, ... *3 ’A “
f... . _. . .' ‘ 2
f: ~ — ' _ . f ._ V“ I
.g M... .’ ﬂ ; -2 - .‘.
n '
_.... _

227:2“; ’2' 2:»

22:2

27:2: 2 2:27:

1

' 132,31}th
2222.2
322 2’) 12:222-
f." I 2 M22 221321;"

' {32' 2‘

. 22.12.527.722225232.

 

22,”

222322;;‘

.-
. ' .
.— .
.
- . .4
2. ~.
1.,
m .- ’
‘ .. Ira-v9
' ..
' .

'L

-.
‘1- L12.

3,223.1.

12:"-
'2

.i. .-
...._
I Cﬁn',

1.2....

.24.?ch

unpu>
‘_. .< n
.-

‘2‘
2 [22:22:

01202
2222“.

2 L2,

232-29 “

" -. ,HI 2

222222 "2 2.22.2121

2262222222! 922.2

222222 - ,
"2221’ l

22».

.2.- 3-) "221-:
3202212122

2.22
22.

.‘J

I

. .5373

2)! “1'1
22'!

2.9.3

222222.‘

1
2.5.

“.2223
01"“

“I!
_ ca

0 .
_ npni

' .21“...

¢

1.:

... 4 .
1:17.:

"Paar:

. :1"

 

..
,........v

XVI.

""L'J,’

a- n.-

"Y‘ t‘::!‘.'." ~"
_. ..... .um: _ 7:, r"-
. . -m ‘ f
.

u.

2:51. ‘9
7" ,.:
..

'UF—

\lmq. ‘— ’
$25.19?“
2' ,‘21n‘r' 3‘6}

   
   

.. :«n’?

“.4

v

THESIS

lllllllllllllllllllllllllllllllllllllll

3 1293 01686

This is to certify that the

dissertation entitled

Network Routing and Planning
in Switch-Based Networks

presented by

Wenjian Qiao

has been accepted towards fulﬁllment
of the requirements for

Doctoral degree in PhilggophL

\AMO \Vl M

Major v\professor

Date sept. 9, 1997

MS U is an Afﬁrmative Action/Equal Opportunity Institution 0-12771

 

 

 

 

LIBRARY

Michigan State
University

 

 

 

PLACE IN RETURN BOX
to remove this checkout from your record.
TO AVOID FINE-3 return on or before date due.

 

DATE DUE MTE DUE MTE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

me macs-m4

 

NETWORK ROUTING AND PLANNING IN SWITCH-BASED
NETWORKS

By

Wenjian Qiao

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁlhnent of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

1997

ABSTRACT
NETWORK ROUTING AND PLANNING IN SWITCH-BASED NETWORKS
By

Wenjian Qiao

In recent years, switch-based networks have received much attention in both local area
networks (LAN 8) and wide area networks (WAN 5) due to their higher network bandwidth
and throughput, greater interconnect scalability and ﬂexibility, and better fault handling
capability than shared-medium networks. For cost and performance reasons, switch-based
networks should support arbitrary topologies. Although arbitrary topologies do provide
the needed ﬂexibility and incremental scalability, network routing and planning in such
networks are not trivial and have a great impact on the network performance.

This thesis studies issues of switching techniques, routing algorithms, topology design,
network tuning, and performance evaluation in switch-based networks. Three switching
schemes are reviewed and compared. Analytical models are presented for average delay
analysis and comprehensive simulation results are conducted to verify the analytical models
and evaluate network performance under different switching schemes and buffer architec-
tures. 'vao novel deadlock-free routing algorithms are proposed to allow arbitrary inter-

connection of cut-through (wormhole) switches. In the proposed routing algorithms, two

opposite unidirectional Eulerian trails are used to avoid deadlock and shortcuts are used to
provide shorter routes. For initial network topology design, a simple method based on block
design is proposed to efﬁciently minimize path length. To improve network performance,
a systematic method is proposed for network tuning and incremental reconﬁguration. Our
design principles for network planning and tuning include minimizing routing path length,
exploiting trafﬁc locality, and balancing channel utilization. To obtain a practical insight to
the network behavior and performance under various networking environments, a ﬂexible
and extensible simulator has been implemented. Comprehensive simulation experiments
have been conducted to demonstrate network behavior under different situations and eval-

uate network performance of our proposed algorithms.

© Copyright 1997 by Wenjian Qiao
All Rights Reserved

To my parents and my husband

ACKNOWLEDGMENTS

I would like to take this Opportunity to express my appreciation to those who have con-
tributed to the completion of this dissertation. I am very grateful to my advisor Professor
Lionel M. Ni for exposing me to a broad range of exciting issues in computer networks and
distributed systems. Without his patient instruction and imaginative enlightening, this dis-
sertation would not be possible. He has provided me with a great deal of encouragement, a
direction for my research, and a great sense of organization.

I would like to thank the other members of my dissertation committee: Professors
Abdol-Hossein Esfahanian, Tien-Yien Li, and Matt W. Mutka, for their valuable sugges-
tions on this dissertation, for their careful review of the manuscript, and for their time and
support. Special thanks to Professor Esfahanian for his very good suggestions to improve
this dissertation.

I wish to thank all the people who helped me during my stay at Michigan State Univer-
sity. Thanks especially to Natawut Nupairoj, Mingyao Yang, Dr. Hong Xu, Sherry Moore,
Wei-kuo Liao, and Xipeng Xiao for their valuable technical discussions and sharing their
views on research with me.

I also wish to acknowledge the love and support of my parents and my parents-in-law.

vi

My parents have provided their constant love, encouragement, and support. My parents-
in-law offered their continuous support and patient understanding throughout my doctorate
work.

Finally and foremost, I wish to acknowledge and thank the most important person in my
life, my husband Dai Wan. He deserves most credit, for loving me, for providing support

in time of need, and for putting up with me during all of my ups and downs.

vii

TABLE OF CONTENTS

LIST OF TABLES xi
LIST OF FIGURES xii
1 Introduction 1
1. 1 Motivations .................................... 5
1.2 Problem Statement and Research Contributions ................. 6
1.3 Organization of the Thesis ............................ 10
2 Switch Design and Switch Interconnects 12
2.] Switching Technology .............................. 12
2.2 Considerations on Switch Design ......................... 13
2.2.1 Nonblocking vs. Blocking ........................... 15
2.2.2 Scalability and Incremental Expandability ................... 15
2.2.3 Switching Techniques ............................. 16
2.2.4 Buffer Architecture and Channel Arbitration ................. 17
2.2.5 Other Issues ................................... 18
2.3 Example Switches ................................. 21
2.4 Design Considerations on Switch Interconnects ................. 25
2.4.1 Routing Algorithms ............................... 25
2.4.2 Topology Design and Incremental Reconﬁguration .............. 27
2.4.3 Other Issues ................................... 30
2.5 Trafﬁc Characteristics and Measurement ..................... 31
2.6 Performance Metrics and Evaluation ....................... 33
3 Performance Evaluation of Switching Techniques 37
3.1 Analytical Model ................................. 38
3.1.1 Delay Analysis ................................. 39
3.1.2 Analytical Model for Store-and-Forward Switching .............. 41
3.1.3 Analytical Model for Virtual Cut-Through Switching ............. 44
3.1.4 Cut-Through Probability ............................ 45
3.2 Simulation Results ................................ 46
3.2.1 Simulation Environment ............................ 46
3.2.2 Veriﬁcation of Analytical Models ....................... 47
3.2.3 Performance Evaluation ............................ 52

4 Deadlock-Free Routing in Arbitrary Switch-Based Networks 63

4.1 Related Work ................................... 64
4.2 Deﬁnitions and Eulerian Trail ........................... 67
4.2. 1 Deﬁnitions ................................... 69
4.2.2 Eulerian Trail .................................. 70
4.3 Eulerian-Trail Routing Algorithm (ETR) ..................... 71
4.3.1 Routing Tables ................................. 72
4.3.2 Time Complexity ................................ 72
4.4 Adaptive-Trail Routing Algorithm (ATR) .................... 73
4.4.1 Adaptive Trail .................................. 74
4.4.2 Routing Tables ................................. 79
4.4.3 Time Complexity ................................ 81
4.5 Heuristics ..................................... 8 1
4.5.1 Heuristics about Degree of Adaptivity ..................... 82
4.5.2 Selection of Eulerian Trails ........................... 83
4.6 The Proof of Deadlock Freedom ......................... 84
4.7 Performance Evaluation .............................. 89
4.7.1 Network Workload and Parameters ....................... 90
4.7.2 Network Topologies .............................. 92
4.7.3 Uniform Trafﬁc ................................. 95
4.7.4 Client/Server Trafﬁc .............................. 102
5 Network Planning and Timing in Switch-Based LANs 110
5.1 Design Considerations of Switch-Based Networks ................ 111
5.1.1 Routing Path Length .............................. 111
5.1.2 Local Trafﬁc, Remote Trafﬁc and Inter-Switch Trafﬁc ............. 112
5.1.3 Balance Channel Utilization .......................... 113
5.2 Model Description ................................ 114
5.2.1 Basic Parameters ................................ 114
5.2.2 Design Variables ................................ 115
5.3 Initial Topology Design .............................. 1 18
5.3.1 From Block Design to Network Design .................... 120
5.3.2 Construction of Blocks ............................. 122
5.3.3 Construction of Bo ............................... 127
5.3.4 Considerations of Trafﬁc Distribution ..................... 132
5.4 Incremental Reconﬁguration ........................... 133
5.4.1 Overall Considerations ............................. 134
5.4.2 Add A New Client ............................... 135
5.4.3 Add A New Server ............................... 136
5.4.4 Network Tuning ................................ 136
5.5 Simulations .................................... 141
5.5.1 Simulation Environment ............................ 141
5.5.2 Performance Improvement by Network Tuning ................ 145
5.5.3 Server Bottleneck ................................ 148
5.5.4 Fairness ..................................... 148

ix

5.6 Evaluation of the Network Planning Algorithm ................. 151

6 A Flexible Simulation Model for Performance Evaluation 160
6.1 Overall Structure ................................. 161
6.2 Design and Implementation ............................ 163
6.2.1 Routing Algorithm ............................... 164
6.2.2 Switch Architecture ............................... 165
6.2.3 Incoming and Outgoing Channels ....................... 170
6.2.4 Message Generation and Reception ...................... 171
6.2.5 Flow Control .................................. 172
6.2.6 Basic Parameters ................................ 172
6.2.7 Performance Metrics .............................. 173
6.3 Workload and Topology Support ......................... 173
6.4 Flexibility and Expandability ........................... 176
6.5 Simulation Experiments .............................. 177
7 Conclusions 178
7.1 Contributions of the Thesis ............................ 178
7.1.1 Performance Evaluation of Switching Techniques ............... 179
7.1.2 Routing Algorithm ............................... 179
7.1.3 Network Planning and Tuning ......................... 180
7. 1 .4 Simulations ................................... 1 80
7.2 Future Work .................................... 18 1
BIBLIOGRAPHY 183

4.1
4.2
4.3
4.4
4.5

5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12

6.1

LIST OF TABLES

Example routing table entries in Switch 8 for deterministic ETR routing . . . . 73
Example routing table entries for a ﬁrst visited node ............... 80
Example routing table entries for a multiply visited node ............ 81
The network conﬁgurations for simulation .................... 93
Three models to study the effect of server locations for A and D ........ 106
List Of important symbols in Chapter 5 ...................... 1 19
Blocks of symmetric BIBD (7,3,1) ........................ 120
Blocks of BBD (14,3,1) .............................. 123
Blocks of CBD (14,4+2,Z 1) ........................... 124
The two columns determined by z' and j ..................... 125
The two columns determined by i and 3' (after rotation) ............. 126
Relationship between block size and number of distinct differences ....... 127
Minimum r to yield all differences in a BBD (s, r, 2 1) ............. 129
Blocks of BBD (10, 4, Z 1): before reorder and after reorder .......... 131
Number of clients in each workgroup ...................... 142
Trafﬁc distribution of pattern 1 .......................... 143
Trafﬁc distribution of pattern 2 .......................... 144
List of major parameters in SimNet ........................ 172

xi

1.1

2.1
2.2
2.3

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.1 1
4.12
4.13

5.1
5.2

LIST OF FIGURES

A switch-based network with nine 8-port switches ................ 2
A high-level view of switch architecture ..................... 14
A network with three switches and two channels ................. 34
An example of latency/throughput performance curve .............. 35
Four networks for veriﬁcation and evaluation .................. 48
Average latency of store-and-forward switching under uniform trafﬁc ..... 49
Average latency of virtual cut-through switching under uniform trafﬁc ..... 50
Average latency of store-and-forward switching under client/server trafﬁc . . . 51
Average latency of virtual cut-through switching under client/server trafﬁc . . . 52
Comparison of switching schemes under uniform trafﬁc ............. 54
Comparison of switching schemes under client/server trafﬁc .......... 55
Comparison of FIFO and shared buffering under uniform trafﬁc ........ 56
Comparison of FIFO and shared buffering under client/server trafﬁc ...... 57
Effect of packet size ................................ 58
Comparison of switching schemes under uniform trafﬁc ............. 59
Comparison of switching schemes under client/server trafﬁc .......... 60
Comparison of buffer management under uniform trafﬁc ............ 61
Comparison of buffer management under client/server trafﬁc .......... 62
An example of an arbitrary network topology .................. 68
An example of deadlock ............................. 74
Illustration for the generation of adaptive trails .................. 76
All cases where f—shortcuts from u to 1) should be removed to avoid deadlock . 78
The deadlock case due to source shortcut ..................... 79
Six topologies considered in simulation ..................... 92
Comparisons of ETR and ATR under virtual cut-through switching ....... 94
Performance under uniform trafﬁc with ﬁxed message distribution ....... 96
Performance under uniform trafﬁc with bursty message distribution ...... 101
Effect of buffer size under uniform trafﬁc .................... 103
Performance under client/server trafﬁc ...................... 104
Effect of server locations and number of servers ................. 108
Effect of buffer size under client/server trafﬁc .................. 109
From block design to the network ........................ 121
The algorithm to construct Bo .......................... 130

xii

5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10

5.11
5.12
5.13

5.14
5.15

5.16

6.1
6.2
6.3
6.4
6.5
6.6

Algorithm to add a new client to the network .................. 135

Algorithm to add a new server to the network .................. 136
Algorithm for network tuning ........................... 138
Initial topology used for simulation ........................ 142
Performance improvement by apply network tuning ............... 147
Performance comparison of trafﬁc pattern 1 and trafﬁc pattern 2 ........ 148
The importance of higher bandwidth to release server bottleneck ........ 149
Comparison of fairness ratio for workgroup 2 with and without a higher band-

width channel ................................. 150

Five non-isomorphic switch-based networks with 6 switches and 12 channels . 152
Nine non-isomorphic switch-based networks with 6 switches and 11 channels . 153
Performance comparisons of networks (6 switches and 12 channels) under uni-

form trafﬁc .................................. 155
Performance comparisons of networks (6 switches and 1 1 channels) under uni-

form trafﬁc .................................. 156
Performance comparison of networks (6 switches and 12 channels) under

client/server trafﬁc .............................. 157
Performance comparisons of networks using high bandwidth server ports under

client/server trafﬁc .............................. 158
The structure and modular relationship in SimNet ................ 162
Data structure RoutEntry ............................. 165
A generic model of a switch ........................... 165
Deﬁnition of class Switch ............................ 166
Algorithm to simulate switch forwarding in each cycle ............. 168
Class Deﬁnitions of InChannel and OutChannel ................. 170

xiii

Chapter 1

Introduction

In recent years, switch-based networks have received much attention in both local area
networks (LAN 5) and wide area networks (WAN 5) due to their higher network bandwidth
and throughput, greater interconnect scalability and ﬂexibility, and better fault handling ca-
pability than shared-medium networks. Traditional shared-medium LANs do not provide
satisfactory throughput and latency for some communication intensive applications, espe-
cially those emerging multimedia applications. For example, a fast workstation can easily
choke a 10Mbps Ethernet. On a shared-medium network only one host can be a sender and
other hosts can only be receivers at one moment. Switch—based networks make such appli-
cations possible by providing virtual point-to-point communication and achieving higher
bandwidth and lower latency. Security is improved than that of share-medium networks
because data on the physical link is no longer always available to every host on the subnet.
Switched LANs are also an incremental approach to upgrade LAN performance since the
investment in existing adapters, cables, and drivers are preserved and switched LAN 3 still
conform to widely accepted network standards.

1

2

An example of switch-based networks is shown in Figure 1.1, where nine 8-port
switches are interconnected. An open port is either connected to a node or open for fu-
ture usage. A node can be a workstation, a multi-processor, or a gateway to another net-
work. In a switch-based network, each host computer has a network adapter connecting to
a network switch. When the scale of the network increases due to the increasing number of
host computers and the increasing demand of aggregate network bandwidth, more switches
and channels can be added to the network to accommodate the increasing communication
demands. The interconnection of switches deﬁnes various network topologies. For cost
and performance reasons, switch-based networks should support arbitrary topologies. Al-
though arbitrary topologies do provide the needed ﬂexibility and incremental scalability
[1], network routing and planning in such networks are not trivial and have a great impact

on the network performance.

 

Figure 1.1: A switch-based network with nine 8-port switches

A switch-based network provides an excellent platform for distributed multimedia ap-

plications and high performance computing based on networks of workstations. Obviously,

3

good switch design and appropriate interconnection of switches play signiﬁcant roles in im-
proving network performance and meet performance requirements. This thesis addresses
the challenging issues in switch design and switch interconnect, presents practical and ef-
ﬁcient methods for network routing, planning, and tuning, and evaluates network perfor-
mance based on comprehensive simulation results. Analytical modeling is also presented
to analyze the average latency for store—and-forward and virtual cut-through switching.

There are two common forwarding schemes used in building LAN switches: store-and-
forward and cut-through. Store-and-forward switching must keep a whole frame of data
buffered before forwarding the data. The disadvantages of this approach include increased
latency and buffer space, especially for frame-switching networks. In contrast, cut-through
switching can send packets in a pipeline fashion and forward partially received data as
long as the packet header is received and decoded, thus promises low-latency delivery. On
the other hand, store-and-forward switching can make better decisions for corrupted or
truncated frames by receiving a whole packet, but cut-through switching just propagates a
frame all the way to its destination. If a frame is corrupted or truncated, it may not be able
to be fully removed from the network until it reaches the destination, thus wasting network
bandwidth. However, in a LAN environment the error rate for a data frame is generally very
low. Therefore, latency-sensitive real-time applications and high performance computing
will greatly beneﬁt from cut-through switching because of its shorter latency.

Cut-through switching can be further distinguished as virtual cut-through switching
[2] and worrnhole switching [3]. Virtual cut-through switching buffers the entire packet
in a large buffer when the outgoing channel is in use, effectively degrading to the store-

and-forward mode. In contrast, a blocked worrnhole packet stalls in the network, holding

4

previous channels in its route until the outgoing channel becomes available. Since a worm-
hole switch does not need to store entire packets, it is enough to have a small buffer holding
a few ﬂits of the packet, which relieves the burden on buffer space. Furthermore, wormhole
switching can provide primitive link-to-link ﬂow control capability due to the back pres-
sure mechanism in each link. Wormhole switching can achieve low latency, particularly in
a lightly-loaded network; however, a blocked wormhole packet may stall other trafﬁc on its
route because it holds the channel(s). This limits network throughput and complicates the
effort to avoid packet deadlock. Wormhole switched networks can improve throughput by
including additional ﬂit buffers or by incorporating virtual channels [4] on each physical
channel to allow other trafﬁc to bypass a blocked packet.

Many new generation switches support cut-through switching, such as the DEC GI-
GAswitch [5] for FDDI networks, the Ancor FCS 266/ 1062 [6] switches for FCS networks,
the Myricom Myrinet [7], Ethernet Switches[8], and Sun Microsystem’s S-Connect [9].
Unlike cell-based switches such as ATM switches, the above frame-based switches allow
large packets and can increase the effective channel utilization. This research concentrates
on these frame-based switches which can support varying size of frames.

There are design and performance tradeoffs among different switching techniques. Al-
though cut-through switching schemes have the potential to reduce communication latency,
store-and-forward switching can simplify operation over different channel bandwidth and
reduce ﬂow-control requirements. Ideally, the selection of a speciﬁc switching scheme
should depend on network conditions and dynamically change its switching mode to pro-
vide best network performance. Cut-through switching could be applied to achieve low

latency under light workload; while store-and-forward switching may be applied to achieve

high throughput under heavy workload.

1.1 Motivations

High-speed networks are demanded for various applications, such as high-performance
computing, database applications, and distributed multimedia communications. A high-
speed network should provide high throughput, low latency, and high scalability for various
applications. However, current network systems may not satisfy these requirements. Sys-
tem area networks (e.g., some MPP machines), may provide high throughput due to high
bandwidth channel and low latency due to wormhole routing. But they usually have the
constraints of regular topologies and small physical distance between switches (routers),
which result in poor incremental scalability in general. Shared medium LAN s (e.g., Ehter-
net) do not provide scalable bandwidth to accommodate the increasing communication
demands. In traditional store-and-forward switched networks, the communication latency
is high because packets have to be entirely buffered in each intermediate switch before
forwarding.

Therefore, the question is how to design a high—speed network dealing with the fol-
lowing requirements. First, it provides high throughput and low latency, which means
cut-through switching is the choice in such a network. Second, it is able to cover longer
distance band in local area. Thus, system area networks may not be a good candidate.
Third, it should be incrementally scalable. Because regular topologies impose stringent
interconnect constraints and may not be incrementally scalable, arbitrary topologies are

needed due to the economic reason, the technical requirement, and the engineering wiring

6

constraints. Fourth, multiple paths between switches may be needed in order to handle
fault-tolerance. It is the above requirements that trigger our research to study issues in
switch-based networks and allow arbitrary switch interconnects.

As we mentioned earlier, switch-based networks are becoming popular to meet the
increasing demand for higher network performance. Meanwhile, there are still lots of chal-
lenging issues open to researchers. A good method for performance evaluation is necessary
to understand the network behavior and guide network design under various network envi-
ronments. It is not trivial to allow deadlock-free routing in any arbitrary network especially
for a wormhole switched network. Without topology constraint, network planning and
incremental reconﬁguration become important in order to provide reasonable network per-
formance. In order to increase sustained throughput, channel utilization should be balanced
under expected trafﬁc workload. Some other challenges include fault-tolerant routing and

efﬁcient multicast communication.

1.2 Problem Statement and Research Contributions

This thesis considers high-speed switch-based LANs, where wiring length is short and the
cost of channels is relatively insigniﬁcant comparing with the cost of switches. Such net-
works provide a lower network latency, high incremental scalability, and very infrequent
transmission error. Due to the high reliability of transmission, link level acknowledgment
is eliminated and only end-to—end acknowledgment is supported. Such a network has fully
incremental scalability, where channels and switches may be added, removed, or reconﬁg-

ured to adapt to a particular network trafﬁc pattern of a given environment.

7

In order to reduce communication latency, the trend of new generation switches is to
support cut-through switching. Recently, cut-through switching is being applied to switch-
based interconnects like Myrinet [7] and ServerNet [10] to build networks of workstations
in LANs. The concept of wormhole switching, as a special case of cut-through switching,
has been used in new generation scalable parallel computers, such as the IBM SP [11],
Cray T3D [12], MIT J-machine [13], Ncube-3 [14], and Intel Paragon [15]. In order to
avoid deadlock, those scalable parallel computers have to use regular network topologies,
such as meshes and tori. Many deadlock-free routing algorithms have been proposed for
such regular network topologies. Interested readers may refer to [16] for a detailed survey.
However, using regular network topologies is impractical and uneconomical for hi gh-speed
switched networks as regular networks impose stringent interconnect constraints and are
not incrementally scalable. Thus, a natural approach is to allow arbitrary interconnection
of switches.

A major difﬁculty in constructing a large-scale network with wormhole switches is
to avoid deadlock among simultaneously transmitting frames. An efﬁcient deadlock-free
routing algorithm is highly demanded for switch-based networks using wormhole switches.
Deadlock-free routing algorithms based on regular topologies cannot be applied for an
arbitrary topology. Although some related algorithms have been proposed for arbitrary
topologies, they either are too complex or easily cause trafﬁc bottleneck.

While arbitrary topologies do provide ﬂexibility and incremental scalability [1], the
design of an appropriate network topology can be a complex undertaking. After the net-
work becomes operational, network tuning and upgrading may be needed to maintain

high performance. Network topology design has been an important issue for many years.

8

In the past, considerable research efforts have been spent on Optimal topological design
[17, 18, 19, 20], which was based on some assumptions including known trafﬁc distribu-
tions and homogeneous environment. Although these algorithms do have important con-
tributions for topological design, they may not be applicable to real network environment
due to unrealistic assumptions and very time-consuming algorithms.

In this thesis, we study network routing and planning issues in switch-based networks
and evaluate network performance under different switching techniques, different routing
algorithms, different topologies, different trafﬁc distributions, and different network pa-

rameters. The major contributions are summarized as follows.

1. Both analytical modeling and simulation results are used to evaluate performance in
switch-based networks. Analytical modeling provides a cost-effective way to predict
the average latency for store-and-forward switching and virtual cut-through switch-
ing. However, analytical models usually require simplifying assumptions and may
not be able to give a practical insight of network behavior due to the dynamic nature
of networks. Therefore, simulation results are measured to evaluate performance un-
der various network environments which are difﬁcult to be estimated by analytical

modeling. Simulation results are also used to verify the analytical models.

2. Two novel deadlock-free routing algorithms are proposed for arbitrary networks us-
ing wormhole switches [21]. In order to avoid deadlock, an Eulerian trail is used
to help maintain an order of channel dependency and allow reasonable routing. The
critical issue is how to reduce network latency due to channel blocking and at the

same time avoid deadlock. In order to reduce message blocking time and allow more

9

and shorter routing paths, shortcuts are added to the Eulerian trail. To avoid deadlock,

some shortcuts have to be removed or used in a restricted way.

3. We study design principles and propose efﬁcient and practical methods for network
planning and tuning in switch-based LANs. To provide good network performance,
our design criteria include minimizing routing path length, exploiting trafﬁc local-
ity, and balancing channel utilization. The idea of block design is applied for ini-
tial topology design to efﬁciently minimize routing path length. For network tun-
ing and incremental reconﬁguration, we consider trafﬁc distributions in the network
and present a systematic method to reduce message latency and increase network
throughput. Simulation experiments are implemented to demonstrate the perfor-

mance improvement by applying our method.

4. We have implemented a ﬂexible and extensible simulator for the purpose of perfor-
mance evaluation in switch-based networks. Our simulator enables comprehensive
experiments with a variety of routing algorithms, network topologies, switching tech-
niques, buffer architectures under diverse trafﬁc patterns. Due to the dynamic nature
of networking environments, simulation results should give us more practical insight
to the network behavior and performance than analytical modeling. Intensive simu-
lation experiments are conducted under various network environments and network
performance are compared for different routing algorithms, different topologies, dif-

ferent trafﬁc distributions, and many other network parameters.

10
1.3 Organization of the Thesis

This thesis presents effective techniques for the design and evaluation of switching tech-
nique, routing algorithm, network topology in switch-based networks. Buffer architectures,

trafﬁc distributions, and many other network parameters are considered as well.

Chapter 2 investigates major issues in switch design and switch interconnects. It ad-
dresses the fundamental issues such as scalability, switching techniques, deadlock-free
routing, network planning and tuning, trafﬁc characteristics, performance metrics, and so

on.

In Chapter 3, we compare different switching schemes and buffer architectures based on
both analytical models and simulation results. Analytical modeling provides an easy way
to estimate the average latency in the network, but they may depend on some unrealistic
assumptions. Comprehensive simulation results can demonstrate the effect of different

network environments on performance.

Chapter 4 presents two novel deadlock-free routing algorithms for arbitrary wormhole
switched networks. The routing algorithms are based on an Eulerian trail and adding short-
cuts. Several related algorithms are reviewed and compared with our algorithm based on

comprehensive simulation results.

In Chapter 5, the idea of block design is proposed for initial topology design to efﬁ-
ciently minimize path length and a systematic method is proposed for network tuning to
improve performance. Design principles considered in these methods include minimizing

routing path length, exploiting trafﬁc locality, and balancing channel utilization.

Chapter 6 describes a ﬂexible and extensible simulation model, which enables experi-

ll

ments with different routing algorithms, topology design, switching techniques, and buffer
management under various network workloads.
Chapter 7 is the conclusion of this thesis, where major contributions of this research are

summarized and some interesting topics are indicated for future work.

Chapter 2

Switch Design and Switch Interconnects

Switch-based networks are becoming popular to meet the increasing demand for higher net-
work performance. To meet performance requirements of many communication-intensive
applications, good switch design and appropriate interconnection of switches are critical to
deliver demanded network performance. Many challenging issues are open to researchers
and major design considerations have been discussed in [22]. In this chapter, we review

major issues and design principles in the design of switches and switched networks.

2.1 Switching Technology

Switching technology operates at layer 2 of the 081 Reference Model. A switch makes
a relatively simple forwarding decision based on the destination MAC address contained
in each packet [23]. Switches allow bandwidth to be scaled and can alleviate trafﬁc bot-
tlenecks in traditional shared-medium LANs. Today, commercial switching products are
available for Ethernet, Fast Ethernet, ATM and FDDI.

l2

13

Switches provide many intemetworking beneﬁts. Switches economically segment the
network into smaller collision domain, providing a higher percentage of bandwidth to each-
station. Switches have protocol transparency, which allows them to be installed in multiple
protocols with little or no software conﬁguration [23]. Ethernet switches are fully com-
patible with the existing cables, hubs, and network adapters without expensive hardware
upgrades. The use of application-speciﬁc integrated circuit (ASIC) technology allows a
switch to provide greater performance by supplying high packet throughput with extremely
low latency. Therefore, it is possible to simultaneously forward packets across all ports at
wire speed. For example, a single Ethernet interface can support maximum throughput
of 10Mbps. A twelve-port ‘wire speed’ Ethernet switch, supporting twelve concurrent
streams, must provide an aggregate throughput of 12* 10:120Mbps.

Switches are designed and tuned to address LAN performance problems resulting from
bandwidth shortages and network bottlenecks. Switches solve these problems by providing

high aggregate bandwidth and low latency at a low cost per port.

2.2 Considerations on Switch Design

A high-level view of switch architecture is shown in Figure 2.1. Its function is to forward
packets from input ports to output ports. Each part is associated with a pair of input and
output channels (or a bidirectional channel). A port may connect to a node which generates
and consumes messages, connect to a port of another switch which deﬁnes the network
topology, or leave it open for a future connection. A node (or end node) can be a worksta-

tion, a multi-processor system, or a gateway to another network. To clarify our description,

14

a port connecting to another switch is called a trunk port and a port connecting to a node is

 

 

 

 

 

 

called terminal port.
Switch Board Switching Element
' Switch Module /
I
Input butters / / / output butters
‘ r I "D
——~E rte , ‘iLiiJ“*T—’
—-iED~ ‘ 4: 1:»; _.
g —~ﬂ’*rv at» a,
‘ 'Yﬁ‘q If T' l
8 *Fﬂdir' ’ is ‘1 f ’ "*9:
s i v j T I ‘ =
g r—“iilL” l L} rag
—-~WIE~ eifl‘j’ ..
—ﬁ‘IT} . LEI~~
WHEEL ,_ lntraswitch Interconnect 4] i7 D , ]_ _,
‘F Control Unit i

 

 

 

Figure 2.1: A high-level view of switch architecture

The switch fabric is based on many small switching elements. Switch modules are
built from switching elements, where a switch module is usually a physical chip. A switch
board is built from a number of interconnected switch modules. A switch consists of many
stacked switch boards, where the number of switch boards is determined by the switch
capacity, i.e., the number of ports.

When a MAC-layer frame arrives, a routing tag is added to the front of the frame after
destination MAC address is available for its routing decision. Then the frame is routed
through the switch to the output port using the routing tag. The routing tag is stripped
at the output port and the frame is delivered to a destination host or another switch. An
internal packet includes a routing tag, MAC-layer header and trailer, and payload data.

There are some input FIFO buffers for packets which cannot be delivered immediately

15

because of port/channel contention in the switch. Some output FIFO buffers are available
to accommodate network links of different speed. Shared buffer pool could be applied in
switch design. Cut-through switching may pipeline a packet through a sequence of switches

without fully buffering the packet at each switch.

2.2.1 Nonblocking vs. Blocking

Nonblocking switch guarantees that any routing request to any free output port can be
supported without interference with other packets being routed. Nonblocking switches
have dedicated bandwidth for each port and is helpful to guarantee the quality of service

for some applications.

Nonblocking switch is also important to inter-switch deadlock freedom when transmit-
ting data through several interconnected wonnhole switches without buffering messages
between switches. For a network environment which only requires intra-switch deadlock
freedom, blocking switches may well meet the performance requirement. Nonblocking

switches help in an environment which requires inter-switch deadlock freedom.

2.2.2 Scalability and Incremental Expandability

A scalable switch means that the same switch architecture can be applied when higher
performance is demanded from the switch. Therefore, bottlenecks existing in the switch
architecture will severely degrade its scalability. For example, design based on a hi gh-speed
bus is not scalable. The bandwidth of a bus is constrained by some physical limits. The bus

becomes a bottleneck when the bandwidth for each port needs to be increased beyond some

16

point. Design based on shared memory is not scalable either because the memory latency
and bandwidth of dual-ported memory module limits the performance of the switch.

The scale of a switch tends to grow with the increasing number of hosts connected to
it. To achieve good expandability in switches, modular design is usually followed [25]. A
switch is constructed from switch boards which are also built from some building modules.
Ideally the minimum requirement to expand a switch to the next allowed conﬁguration is
a single switch board. For example, if a switch board has four input and output ports, it
is more economic to expand the switch by adding four ports, that is one switch board at a

time rather than having to add at least eight or even more extra ports.

2.2.3 Switching Techniques

Switches forward trafﬁc based on one of two forwarding models:
0 Store-and-forward switching
o Cut-through switching

As we mentioned earlier, a store-and—forward switch reads and validates the entire
packet before initiating the forwarding process. This allows the switch to discard cor-
rupted packets and permits the network manager to deﬁne custom packet ﬁlter to control
trafﬁc through the switch. The disadvantage of store-and-forward switching is that latency
increases in proportion to the size of the packet.

A cut-through switch (virtual cut-through or wormhole) starts the forwarding process
immediately upon receipt of enough header information to make a routing decision. Since

the switch only has to read the destination MAC address before it begins to forward a

17

frame, packets are processed faster and latency is at the same low level for both short and
long packets. The results is a dramatic reduction in the network latency over the traditional
store-and—forward switching techniques under light to moderate trafﬁc. The major disad-
vantage of pure cut-through switching is that corrupted frames are also forwarded by the
switch. However, this should not be a big problem in LAN 3 where transmission error rate

is extremely low.

Although cut-through switching techniques have the potential to reduce network la-
tency, most existing local and wide area networks employ packet switching to reduce ﬂow-
control requirements, simplify routing algorithms, and simplify transmission over a range
of different channel bandwidth. Still, wormhole and virtual cut-through switching are vi-
able options in more homogeneous, tightly-coupled domain, such as system network and
high-speed switch designs [26, 27]. Research projects like Shrimp [28] and Atomic [29] as
well as commercial products like Sun Microsystem’s S-connect [9] and Myricom’s Myrinet

[7] have been the results of this paradigm.

2.2.4 Buffer Architecture and Channel Arbitration

Buffers in a switch can be organized in various ways: input, output, cross-point, and shared.
FIFO queues or non-FIFO buffers can be used to manage buffers. Input FIFO queuing is
the simplest but worst performing architecture. In such a case, only the front packet of each
input queue is scheduled for possible routing, which is called head-of-line (HOL) blocking
and causes a limited throughput due to output contention. In [30], the performance of

nonblocking switches with FIFO input buffers is surveyed. The analysis shows that the

18

maximum throughput of the nonblocking switch with FIFO input buffer is 58.6% [31, 32].

One approach to improve the switch performance of is to reduce or eliminate the HOL
blocking. Non-FIFO input buffering architecture provides better performance than input
queuing [33]. Use of output buffers or a shared buffer improves the throughput by elimi-
nating the HOL blocking. However, these solutions are complicate to implement in practice
and usually can only handle ﬁxed size frames (packets) [34]. In output queuing, each buffer
is associated with one outgoing link, and must be able to accept, in the worst case, packets
arriving simultaneously from all inputs [35].

If more than one incoming packets compete the same outgoing channel, the arbitra-
tion scheme coordinates access amongst competing packets. Round-robin arbitration is
a common demand-driven approach to divide bandwidth fairly among incoming trafﬁc.
Priority-based schemes may improve performance by favoring longer queues or packets

with stricter delay requirements [36].

2.2.5 Other Issues

Besides the above discussed issued, many other important issues need also be studied and

some of them are addressed in the following:

Deadlock-Free Routing

Deadlock is a well known issue in wormhole switching with back pressure ﬂow control.
Because a message can hold some channels and wait for another channel, deadlock may oc-
cur if there is a cycle in the channel dependency graph [16]. Deadlock will cause messages

to be blocked in the network forever. Therefore, it is important that the routing algorithm is

19

deadlock free. In practice, even if the routing algorithm is deadlock free, deadlock may still
occur due to some hardware or software errors, such as a lost trailer signal. A long-period
timeout to force a packet out of the network can be used to avoid deadlocks completely for

such rare cases [37].

There are two levels of deadlock freedom in a switch-based network. One guarantees
that within a switch the routing is deadlock free, i.e., if the whole network is connected to
a single switch, deadlock never happens. This may be referred to as intra-switch deadlock
freedom. The other guarantees that the switching in a network which is interconnected
by more than one switch is deadlock free. This may also be referred to as inter-switch
deadlock freedom. To achieve inter-switch deadlock freedom, careful considerations need

to be made on switch design, network topology and routing algorithm.

Link-level Back Pressure Flow Control

Link-level back pressure ﬂow control is used in wormhole switches. When the input or
output buffers in a switch are nearly full, back pressure is generated. This back pressure
can be propagated to the sending source to throttle it from sending more data when a packet
is blocked. Thus, there is less packet loss caused by buffer overﬂow if this mechanism is
utilized. In a local environment, most packet loss is due to buffer overﬂow instead of
transmission error. Packet loss is generally handled by hi gh-level network protocols which
involve timeout and other overhead. Therefore, both network throughput and latency will

beneﬁt from less packet loss due to wormhole switching with link-level ﬂow control.

20

Packaging and Layout Issues

For ease of maintenance and expandability, the switch design must follow a modular ap-
proach. There may be many hierarchies of modules, from chips to switch boards. How
to efﬁciently design 2D layout on a board and stack switch boards in 3D is a critical engi-
neering issue. When a switch is expanded or reconﬁgured due to fault, it is desirable that
minimum number of rewiring is required. Packaging of a switch decides the physical space
consumed, wiring complexity, wire length, whether a switch is stackable, easy to conﬁgure

and maintain, and ready for expansion [24].

Hardware Multicast

Multicast communications are critical in teleconferencing, video-on-demand, distributed
computing, and so forth. Conventional shared-medium networks have a very natural way
to support multicast/broadcast communications. In a switch-based network, however, each
switch must possess multicast capability. Hardware multicast in networks based on cut-
through switching with back pressure ﬂow control is not a trivial task because it is very
difﬁcult to prevent deadlock. A multicast capable cut-through switch design was proposed

in [24].

Fault Tolerance

It is likely that one or more of the building modules of a switch are faulty sometime. A fault
tolerant switch either can continue to function well regardless of the faulty module(s), or
is able to function well by making minimal reconﬁguration to the existing boards, without

adding new good boards. We call the former on-line fault tolerance and the latter board-

21

level fault tolerance. On-line fault tolerance can afford faulty boards without loss of service.
Board-level fault tolerance may lose some service upon failure but allows quick and easy

recovery.

2.3 Example Switches

As we mentioned before, many switches have been commercially available. In this section,

we review the design characteristics of some commercial switches.

Myrinet Switch

Myrinet [7] supports cut-through switching in arbitrary local network topologies. It was
developed from two research projects: Caltech Mosaic [38] which was a multicomputer
and USC/ISI ATOMIC [29] which was a high-speed local area network built using Mosaic
components.

A Myrinet network consists of a number of Myrinet switches and host interfaces.
Myrinet can be both a LAN of relatively small physical diameter and a System Area Net-
work (SAN) that clusters computer components. A Myrinet link has a full-duplex pair of
Myrinet channels. Each channel carries a sequence of 9-bit characters. A SAN channel
operate at 1.28 Gb/s up to 3 meters. A LAN channel operate at 1.28 Gb/s up to 10 meters
or at half this speed up to 25 meters. The core of a Myrinet switch is a pipelined crossbar
for cut-through switching. A custom VLSI CMOS chip that implements link level pro-
tocol and ﬂow control have SAN ports. Another custom VLSI COMS chip can translate

SAN ports to LAN ports. Flow control is done on each physical channel with STOP and

22

GO control characters and ”slack buffer” at the receiver. A slack buffer has 80 bytes. A
STOP control character is sent to the sender when the slack buffer begins to be ﬁlled with
more than 48 bytes and a G0 control character is sent to the sender when data in the slack
buffer drops to less than 32 bytes. An 8-bit CRC character is computed on each packet.
Low-latency and low-error-rate communication is achieved in Myrinet with its cut-through

switching, ﬂow control and error control.

A Myrinet Control Program (MCP) is running on each host interface. MCP interacts
with host to send packets from the host to the network or deliver incoming packets to the
host. MCP also maps network addresses to routes. The routing algorithm is designed to be
deadlock free. Up*/down* routing [39] is currently being used. Deadlock due to hardware

failure is guarded against with a long timeout of 50ms.

ServerNet Switch

ServerNet [40] is a SAN which integrates I/O transfers and inter-processor communication
within the same network. Scalable 1/0 is achieved by decoupling processor buses from
I/O buses with an interconnection network between processors and I/O components. The
basic components of ServerNet include six-way crossbar routers, dual-ported processor
interfaces on a PCI card and low-cost peripheral-device interfaces. The addressing and
routing provided by ServerNet can support communications between CPUs, from CPUs
to I/O devices, or between I/O devices. ServerNet allows reliable, scalable and powerful

computing systems to be built using clustered commodity components.

A ServerNet router has FIFO buffers on the inputs, logic for arbitration and ﬂow con-

23

trol, a routing table and a crossbar switch. Cut-through switching is implemented by the
routers to provide low latency communication. Link-level ﬂow control prevents packet
loss during network congestion. Six-port switches allow various network topologies such
as hypercube, mesh, and tree. Routing is based on table lookup. The 1024 entry routing
table maps the destination address in the packet header to the output port. Deadlock free-
dom is achieved by routing table entry design. Conﬁgurations with circular dependencies
are disallowed. New topologies called “fractahedrons” [10] were proposed for deadlock

avoidance in ServerNet.

Ethernet Switches

Ethernet has been the most popular local area network for more than a decade. As more
and more communication intensive applications appear, tradition shared-medium Ethernet
is approaching bottle status. To achieve higher bandwidth, Ethernet switching hubs are

being deployed to replace traditional shared-medium Ethernet.

There are generally three switching modes in an Ethernet switch: fast-forward (cut-
through), fragment-free, and store-and-forward. Store-and-forward buffers a whole frame,
makes an error check, and forwards the frame. Invalid frames are ﬁltered out. To cut
switching latency, a packet can be partially received before it begins to be forwarded. In
fast-forward, a frame is switched as soon as the destination address or a few more following
bytes are received. Corrupted packets may be propagated through the network since they
are forwarded before frame errors can be detected. Fragment-free switching buffers at least

64 bytes of data before switching is made. Buffering 64 bytes of data allows Ethernet

24

collision fragments to be ﬁltered out. Collision may happen often in a heavily loaded
network, such as when a port is shared by many host interfaces. A good Ethernet switch
design should be able to select the switching mode adaptively based on the network trafﬁc
conditions.

In a client/server computing environment, a server is shared by many clients and re-
quires higher network bandwidth than a client. This can be achieved by using switches
which support both Ethernet and Fast Ethernet ports. A server can be connected to a
Fast Ethernet port for higher bandwidth. When switching a frame from an Ethernet port
(10Mb/s) to a Fast Ethernet port (100Mb/s), store-and-forward is necessary to adjust the
speed difference of input and output ports. When switching a frame from a Fast Ethernet
port to an Ethernet port, a frame can be partially received before being switched.

Back pressure mechanism can also be applied in Ethernet switches to keep the sending
source from delivering more frames to an overloaded port. A congested port can fake
collision-detection frames which let the sending source think that collision is occurring and
back off its attempt to send a frame according to CSMA/CD protocol [8].

Each Ethernet port can support multiple MAC addresses which may correspond to in-
dividual workstations, repeaters, or switches. Address learning can be done by automati-
cally learning the source address of frames received on each port. Address table can also
be manually maintained. Old addresses not in use are aged out from the address table.
Spanning-Tree protocol (part of IEEE 802.1d standard) is generally used to interoperate
between compliant bridges and switches. Spanning tree allows networks to dynamically
maintain a loop-free topology and establish redundant paths in event of lost connections.

Some Ethernet switches allow virtual LAN s to be built. Ports on the switch are grouped

25

into separate logical networks. Virtual LAN limits broadcast trafﬁc and offers higher secu-

rity.

2.4 Design Considerations on Switch Interconnects

The interconnection of those switches deﬁnes various network topologies. Regular net-
work topologies do not have good incremental scalability due to the stringent intercon-
nection constraints and may not be able to use the original routing schemes with faulty
nodes. Therefore, a natural approach is to allow arbitrary switch interconnects like Autonet
[41], Myrinet, and ServerNet. Arbitrary switched networks are truly incrementally scalable
and have potential to be reconﬁgured to adapt to the dynamics of network trafﬁc. Many
challenging issues need to be studied in a switched network. Some major tasks include
routing and ﬂow control, topology design and incremental reconﬁguration, trafﬁc pattern,

performance evaluation, reliability, scalability, and fault tolerance.

2.4.1 Routing Algorithms

Routing plays an important role in a switched network. Routing decision is usually im-
plemented based on table lookup or on a ﬁnite state machine. An important engineering
issue is to make the routing decision as fast as possible. In this dissertation, we focus on
routing in wormhole switched networks. A major difﬁculty in constructing a large-scale
network with wormhole switches is to avoid deadlock among simultaneously transmitting
frames. Arbitrary switch interconnects make the problem even harder. Three philosophies

have been used to develop deadlock-free routing algorithms. One is based on deadlock

26

prevention which never allows the formation of a cycle and is quite conservative. Another
approach allows the existence of cycles [42, 43], but does provide a path to escape from
the cycle. The third and more aggressive approach is based on deadlock detection [44, 45].
Practical, convincing, and meaningful quantitative evidence is needed to demonstrate the
superiority of one approach over other approaches.

For regular topologies like k—ary n-cube, many deadlock-free routing algorithms have
been proposed in literature. For arbitrary switched networks, the ﬁrst known deadlock-
free routing algorithm is probably the up*/down* routing used in the DEC AN 1 system
[41]. A similar approach is used in the Myrinet. In the up*/down* algorithm, a breadth-
ﬁrst spanning tree is constructed from a selected root. Based on the direction to the root
node, the network is divided into two acyclic subnetworks (up and down). When a packet
enters the down subnetwork, it will never enter the up subnetwork to ensure cycle free. The
algorithm is simple and easy to understand, but it may introduce severe trafﬁc congestion
near the root and allow a routing path as twice long as a shortest path.

Some other routing algorithms have been proposed as well. Adaptive-trail routing
(ATR) was proposed in [21] to allow deadlock-free adaptive routing in arbitrary switched
networks. The basic idea of ATR is using an Eulerian trail as a ‘base line’ to maintain chan-
nel dependency order and allow reasonable routing. To minimize the path length, shortcuts
are added to the Eulerian trail. The smart routing (SR) [46] created a channel dependency
graph from the network topology and broke dependency cycles one by one, which mini-
mized a heuristic cost function based on average path length. The routing is represented
by the ﬁnal non-cyclic channel dependency graph. The trafﬁc balancing, which assumed

uniform trafﬁc, depends on a linear programming solver and is the most time consuming

27
part.

Graph partitioning is applied in [47] to avoid deadlock. To evaluate it in arbitrary
networks, a simple and efﬁcient implementation has to be enforced, which may not be
intuitive. A recent research [48] proposed a general methodology for the design of adaptive
routing algorithms in arbitrary topologies. In this method, all physical channels are split
into two virtual channels: ‘original channels’ and ‘new channels’. For ‘original channels’,
a deadlock-free routing algorithm like up*/down* is deﬁned. A message can only leave
source from ‘new channels’, but it can use ‘original channels’ at intermediate switches
if none of ‘new channels’ is available. Once a message is transmitted on any ‘original

channel’, it cannot go back to ‘new channels’ any more.

2.4.2 Topology Design and Incremental Reconﬁguration

The design of an appropriate network topology can be a complex undertaking and the
conﬁguration of a trafﬁc-balanced network is not an easy task. Most of existing research
efforts have been spent on optimal topological design [17, 18, 19, 20], which was based
on some assumptions including known trafﬁc distributions and homogeneous environment.
Although these algorithms do have important contributions for topological design, they
may not be applicable to real network environment due to unrealistic assumptions and
very time-consuming algorithms. A systematic method [49] should be applied in topology
design, which concerns some basic parameters. The key parameters are the number of
switches, the number of ports per switch and the bandwidth of each port, the required

number of host computers, the estimated trafﬁc distribution, and the average path length.

28

When designing a network topology, a primary consideration should be minimizing
diameter and the average path length. Longer paths mean more network resources are
consumed per packet sent. In addition to minimizing average path length, there should be

no obvious bottleneck or small bisectionl in the network.

Network planners need to be responsive to the demands of client/server trafﬁc. A server
contention may become a main bottleneck because many clients communicate with the
server. A key part of scaling network performance is to eliminate bottlenecks to servers.
Based on the server’s throughput capacity, dedicated bandwidth is needed to release server
bottleneck. Trafﬁc locality is an important trafﬁc characteristic in LAN and we should
maximize trafﬁc locality to beneﬁt intra-switch trafﬁc. If a signiﬁcant fraction of trafﬁc
ﬂows among a set of end computers (i.e., a workgroup), we should consider to place them

on the same switch if possible, or on two switches connected by at most one hop.

To provide good network performance, a key point is to balance trafﬁc ﬂow among
channels. It is important to know the real or estimated trafﬁc distribution. For a partic-
ular topology, given routing algorithm and estimated trafﬁc distribution, trafﬁc ﬂow on
each channel and the maximum attainable aggregate ﬂow can be calculated with a multi-

commodity ﬂow solver. This solution will also identify bottleneck channels.

After a network is operational, network tuning may be needed to accommodate network
changes and improve network performance. Many situations may be changed in a network;
however, network reconﬁguration should be incremental and not involve major changes if

possible. Therefore, one practical question is how to make a minor change to a network to

 

lBisection is deﬁned as the minimum number of channels to be removed in order to cut a network into
two disjoint subnetworks with equal number of switches in each subnetwork.

29

obtain a signiﬁcant improvement in performance.

The ﬁrst step is to diagnose what causes the problem. If the real trafﬁc distribution is
much different from the estimated trafﬁc distribution, the topology and routing algorithm
may need to be redesigned. If the average path length of routing is much longer than
the average path length of the base topology, chances are that experimenting with some
additional routing strategies might improve the performance signiﬁcantly. If neither the
routing nor the base topology is the problem, additional resources may need to be added.
If there are spare ports on any switches, additional channels might be added to yield a
performance increase. In general, additional channels should be added between switches
that are a maximum distance apart (taking some long paths and making them shorter), or

between switches that have few channels attached already.

If adding one or more channels is not possible or does not provide adequate perfor-
mance improvement, one or more switches can be added. To add a switch to an existing
network, it may be necessary to connect a number of channels to switches that are as widely
spaced (with respect to path length) as possible. We may consider to redistribute some end
nodes to the new switch. When adding channels or switches, the routing should be recalcu-
lated from scratch. Such a change has a profound impact on the path length and potential
deadlock in the system. Since most of these switches have conﬁgurable routing tables, this

yields a practical solution.

30
2.4.3 Other Issues

Topology design will be further complicated by topological constraints (such as reliabil-
ity or physical layout constraints) upon the network. To consider reliability, a traditional
method is to make more than one switch-disjointed path available between each pair of

sources and destinations.

It is clear that efﬁcient multicast communication support is critical to the performance of
many applications. A recent paper [50] proposed heuristic multicast algorithms in switched
networks, which assumed the underlying routing algorithm is up*ldown* routing. While
efﬁcient multicast communication is becoming very important and obtaining much atten-
tion, multiple multicasts and the effect of background trafﬁc should be considered as well.
Since the performance of multicast algorithms are various on different platforms, a prac-
tical approach is to consider some critical machine-speciﬁc parameters in the design of

platform—independent algorithms as proposed in [51].

In large scale networks, the probability of having a faulty channel or switch is relatively
high. An intuitive solution is to simply shutdown the system and replace the faulty ele-
ments. In addition to some simple solutions like redundancy, a practical and useful scheme
is demanded to automatically detect faulty elements, handle dynamic faults, properly deal
with transmitting messages, isolate faulty elements for live replacement, and avoid routing

deadlock [1].

Flow control determines the amount of packets that can be injected into the network in
order to avoid congestion. As indicated in Figure 2.3, without proper ﬂow control, the net-

work throughput will decrease after reaching its maximal. In addition to local ﬂow control

31

like link-level back-pressure in cut-through switching, there also needs global ﬂow con-
trol (e.g., end-to-end ﬂow control) to limit the total number of packets that can be handled
by network simultaneously. If ﬂow control mechanism works properly, further input to

network should be stopped well before the network is over-saturated.

2.5 'Ii'afﬁc Characteristics and Measurement

To design an appropriate network, we should understand the trafﬁc characteristics in LAN 8.
LAN trafﬁc analysis can be done by collecting LAN trafﬁc and some common characteris-
tics have been observed in previous work [52, 53]. In the following, we summarize some

important characteristics.

o Client/server trafﬁc: every server provides services to a number of clients and there
may be more than one server in a LAN. Therefore, messages generated from a node
may not be uniformly destined to the other nodes and messages received by a node

may not be uniformly originated from the other nodes.

0 Workgroup: the existence of workgroups means that a large portion of trafﬁc is 10-
calized among relative small workgroups of nodes. Such a feature is called ‘traﬁ‘ic

locality’.

o Bursty trafﬁc: the burstiness may cause congestion within a network resulting in poor

overall network performance.

Network planners need to be responsive to the demands of client/server trafﬁc, the 10-

cality of trafﬁc, and the challenge of trafﬁc burstiness. A server usually has much more

32

amount of trafﬁc than a client so that dedicated bandwidth is needed to release server bot-
tlenecks. Trafﬁc within a workgroup is usually very high, nodes within each workgroup
should be placed in a same switch (if possible) to maximize trafﬁc locality. To deal with
burstiness, we have to compromise the trafﬁc rate between the average rate and the peak
rate. If a network is designed based on peak rate, it may cause an excessive cost. On
the other hand, if network design considers only average rate, it may be under-engineered.
How to balance between peak rate and average rate is still an open issue.

The knowledge of trafﬁc distribution is very helpful for topology design and tuning.
There are two methods to obtain trafﬁc distribution in a network. Before a network is op-
erational, the only method is based on a good understanding of network applications and
communication demands in the future network. For example, we may be able to ﬁgure out
workgroups, servers and their clients. However, in general, trafﬁc distribution is hard to be
estimated accurately a priori. Thus, network tuning and reconﬁguration may be needed to
improve network performance after the network is operational. After the network is oper-
ational, trafﬁc distribution can be measured by monitoring the total number of transmitted
packets over the network, the size of each packet, and the source and destination infor-
mation associated with each packet (e.g., SNMP management [54]). Note that measured
trafﬁc rate should consider both average rate and bursty rate.

Many LAN switches support standard-based SNMP management. For example, both
HP’s AdvanceStack Assistant [55] and 3Com’s Transcend Enterprise Manager [56] pro-
vide SNMP-based network management. An SNMP agent is a program running on the
managed device and collecting information about device operations. For example, if the

object is a switch, the agent can collect information about network trafﬁc passing through

33

each port within the switch and information about the behavior of the switch itself under
different load conditions. The SNMP agent maintains a database called the Management

Information Base (MIB). The agent uses the MIB to track and systematically update data.

2.6 Performance Metrics and Evaluation

From an application program’s point of view, the most important performance metric of
a network is communication latency, which is the time interval between the instant that a
send command is issued until the instant that the message is completely received by the
recipient. The metric communication latency consists of network latency (including the
blocking time due to unavailable output channels) and software latency, which is software
overhead introduced by the sender and receiver.

From the network point of view, an important metric is network throughput which is de-
ﬁned as the total amount of trafﬁc transmitted in the network per unit of time. The sustained
network throughput is dependent on the communication pattern, message size distribution,
and the rate at which each node is able to inject messages into the network. Due to the
limitation of network contention and output contention, the maximum sustained network
throughput is usually much smaller than the maximum theoretical network throughput.

Fairness is also an important factor in network design although it is often ignored [57].
Without some global fairness mechanism, switched networks can be naturally unfair, ser-
vicing some injection ports better than others, especially when the network workload is
heavy. For example, consider a simple network shown in Figure 2.2, where the center

switch 0 is connected to 10 nodes and the two end switches 1 and 2 are connected to 11

34

nodes, respectively. Let us assume all nodes are injecting as fast as they can, the desti-
nations are uniformally distributed and the center switch 0 is fair to all of its input ports.
From our simulation, we observe that each node on switch 0 injects about 1 1 times as much
trafﬁc as the node on switches l and 2. The aggregate throughput is only 9% of the maxi-
mum throughput, but it reﬂects almost entirely trafﬁc originating at the central switch. The
reason is because that the center switch treats the trafﬁc from 11 nodes on each end switch

in the same way as it treats the trafﬁc from each node on the center switch.

fei‘i%%>% 9:"

Figure 2.2: A network with three switches and two channels

 

After we restrict the injection rate of each node in the center switch, the network be-
haves more fairly and the aggregate throughput is higher. The fairness can be signiﬁcantly
impacted by the topology, routing, and trafﬁc distribution in a network. For instance, a
fully-connected topology of three switches (i.e., a triangle) under uniform trafﬁc will show

no fairness problems of the sort exhibited by the network in Figure 2.2.

Besides message latency, network throughput, and fairness, there are also other metrics

like reliability and fault-tolerance. However, we will not address them in this dissertation.

For a given network workload, the performance of a switched network is best rep-
resented by its latency-throughput curve as shown in Figure 2.3. This ﬁgure shows that
the network throughput can reach up to 19% of the network’s maximum throughput. The

average network latency increases when the throughput increases. However, when more

35

messages are injected into the network after the network has reached its maximum sustain-
able throughput, the network throughput will decrease while the average network latency
will continuously increase, creating the so—called over-saturated or unstable region. In such
a situation, fairness may become a major concern; some ports will essentially starve while
others will inject more trafﬁc. In such a case, the aggregate throughput may not be enough
to indicate the actual network performance. Thus, fairness ratio is introduced as the ratio
of the maximum trafﬁc generated in a node over the minimum trafﬁc generated in a node.
When fairness ratio is over 2, the network is considered in a unfair situation and we mark

such a performance point with a shaded rectangle in Figure 2.3.

E: uniform traffic and fixed size messages
8000 . T . .

7000 L .\ .
6000 - 1 ~
5000 ~ in .
4000 - i -
3000 - .35 ~
2000 L ”313”" -

---El"
”am-8'”?
r m r

 

Average Latency

 

 

1000 J—""B"P'”E’"

0.1 0.12 0.14 0.16 0.18 0.2
Network Throughput (bytes per time unit)

 

Figure 2.3: An example of latency/throughput performance curve

Both analytic models and simulation results can be used to evaluate network perfor-
mance. By assuming the knowledge of network workload and applying multi-commodity
ﬂow model and queuing theory, a good analytic model may indicate message latency and
bottleneck channel and obtain optimal or sub-optimal solutions for network design. How-
ever, due to the dynamic nature of networking environments, analytic modeling is unlikely

to give practical insight to the network behavior and performance. Simulation experiments

36

should be conducted to evaluate the performance. But we can hardly trust the simulation
results as well. For example, most of the network simulation studies do not consider soft-
ware latency and assume an inﬁnite input source. This may lead researchers to investigate
something that may never (or very rarely) occur in an actual network. Even worse, most of
the models used to characterize the network workload are far from realistic. Trace-driven
or execution-driven simulation based on real applications should be considered [58]. A set

of benchmark applications covering different communication patterns has to be identiﬁed.

Chapter 3

Performance Evaluation of Switching

Techniques

Many researchers have studied performance evaluation of interconnection networks. Store-
and-forward packet switching networks were described as a M -channel, N -node model and
average packet latency were analytically studied [59, 60]. Performance evaluation of cut-
through switching began with the work in [2], which presented a mean-value analysis of
end-to-end latency under deterministic routing for virtual cut-through switching, derived
from M/M/l model for packet switched networks. The extensions of this M/M/l analysis
can be found in [61, 62, 63, 26]. In [64], an asymptotic performance analysis was proposed
for packet-switching multistage interconnection networks. The delay analysis of k-ary n-
cube direct networks was studied in [65, 66] for wormhole switching. However, these
analytical models do not consider arbitrary topologies.

Both analytical models and simulation experiments have been presented in literature
for the purpose of performance evaluation. Effective analytical modeling provides a

37

38

cost-effective way to predict the behavior of large networks and help balance the cost-
perforrnance tradeoff in switch-based networks. However, analytical models usually re-
quire simplifying assumptions about the network conﬁguration and trafﬁc patterns, for the
sake of tractability. Simulation study can characterize the effects of these assumptions to
determine and show insight network behavior, especially when analytical models cannot
accurately estimate real performance.

This chapter compares performance of switch-based networks for different switching
modes and buffer architectures, using both analytical models and simulation experiments.
Analytical models based on MIMI l queuing theory are presented for average packet delay
analysis under a certain assumption like deterministic routing and inﬁnite buffer size. For
cut-through switching, a packet can cut through a switch if its outgoing channel is idle,
which is reﬂected in the analytical model by the deduction of the time saved by cut through.
The analytical models have been veriﬁed by simulation results. As we mentioned earlier,
analytical modeling needs simplifying assumptions and may not be able to give accurate
estimation under various network environments. Therefore, simulation experiments are
conducted to evaluate performance under various network environments which are hardly

to be accurately estimated in the analytical models.

3.1 Analytical Model

Our goal is to address an analytical model for average packet latency in a switch-based
network. A basic queuing model of computer networks can decompose the interconnec-

tion network into a collection of independent channels and queues. The packet delay in

39

store-and-forward packet-switching networks has been expressed using the MIMI 1 model
in terms of source-destination paths through the network [60]. An arriving packet will be
forwarded in a switch based on switching modes. Various topologies, algorithms, and traf-
ﬁc distributions differ in how they affect the amount of trafﬁc on each channel and therefore

the delay waiting on the channel.

3.1.1 Delay Analysis

The average packet latency T for a computer-communication network has been deﬁned as

follows in terms of the average source-to-destination packet delay [59]:
T: 231212,,- (3.1)
7

where Z,”- is the average packet delay (queuing and transmission) from source 2' to destina-
tion j, 7,-0- is the average packet rate from i to j, and 7 = Z 710-
By applying Little’s result, which relates the average number in the system to the aver-

age arrival rate and the average service time, a useful expression was given in [60, 67]

A
T = Z 7’51). (3.2)

where A), is the average packet rate on channel 10 and Tk is the average queuing plus trans-
mission delay on channel k. Given this useful but too general expression, we may not not
able in general evaluate A], and Tk. However, as indicated in [17], if we make the following

assumptions, the network channels can be considered as independent M/M/l queues by

40

Jackson’s theorem [68].

1. Poisson packet arrival distribution

2. exponential packet length distribution
3. inﬁnite buffer space

4. ﬁxed (deterministic) routing

5. error-free channels

6. no switch delay

7. independence between interanival times and transmission times on each channel

Therefore, the average delay T), on channel It can be evaluated as follows based on
M/M/l queuing theory:

1

T=——-—-——-—
k ,u'Clc-Ak

(3.3)

where Up is average packet length (bytes/packet), Ck is capacity of channel I: (bytes/s),
and Ak is average packet rate on channel k (packet/s). The average rates Ak’s can be com-
puted from routing tables and the trafﬁc distribution matrix. By plugging Equation ( 3.3 )
into ( 3.2 ) and letting f], be ﬁAk, i.e., the average bit rate on channel k (bytes/s), we have

the following equation for average packet delay T:

 

1 f.
____ 3.4
T VZCk-fk ( ’

41

Considering propagation delay and switch processing delay, a more accurate result can

be expressed in the following:
1
T = 3 E AMT]; + Pk + Sic] (3.5)

where P], is the propagation delay (second) on channel It and S1, is the processing time
(second/packet) in the switch at which channel I: terminates.

Although more accurate expressions may be obtained by considering more general dis-
tributions, the validation of the above assumptions and MM] delay analysis has been
tested and veriﬁed through simulations and measurements by many authors in a variety
of applications [69, 59]. For packet-switching networks, M/M/l theory has been applied
for delay analysis, capacity assignment, ﬂow assignment, and topology design [60]. An
application example is ARPANET design and performance evaluation. Furthermore, the
solution of some of the design problems seems to be rather insensitive to the introduction
of additional details in the delay formula [17]. In this paper, without loss of generality,
we limit our considerations to the delay formula in ( 3.4 ), which means that we do not

consider propagation delay and switch processing time in this chapter.

3.1.2 Analytical Model for Store-and-Forward Switching

Given the network topology, the ﬁxed routing algorithm, and trafﬁc distribution, we are
able to calculate trafﬁc rate f), on each channel and the total trafﬁc rate 7, and therefore,
the average delay in ( 3.4 ). However, ( 3.4 ) was proposed for the networks where only

one processor connects to a switch/router. In switch-based networks, a packet has to com-

42

pete with other packet for same destination node even in the ﬁnal destination switch. So
the packet delay should consider this waiting time, resulting in a modiﬁed packet delay

formula:

 

—1 ff _£_
T3_7(ZCk-fk+ZCL—f5) (3.6)

where C,’, and f,’, are the port bandwidth and the average trafﬁc rate to node v. T, means
delay in store-and-forward switching.

Under uniform trafﬁc, Equation ( 3.6 ) can be further simpliﬁed by using average chan-
nel utilization and average routing path length. To make it even easier, we consider a
homogeneous environment, where the bandwidth for all switch ports are same (represented

by C bytes/s). For the purpose of analysis, we introduce the following notations:

N: total number of nodes in the network.

M: total number of bidirectional channels in the network.

6: mean of packet length (bytes/packet), exponential distribution.

19”,”: routing path length from node u to node v

6: average path length of packet routing.

A: packet generation rate (packet/sec), Poisson distribution.

pk: utilization of the kth channel and pk = fk/C'k.

p: average channel utilization

43

o {0: utilization of node v

o g : average utilization of node bandwidth

0 T3: average packet delay in store-and-forward switching.

0 Tc: average packet delay in cut-through switching.

Since trafﬁc distribution is uniform, we can calculate the average path length 6 as fol-

lows.

 

2 23191.,” (3.7)

p = number.o f -nodemairs

The average channel utilization is expressed under uniform trafﬁc as follows:

MpN
” — me (33’
The trafﬁc arrival rate in each node should also be uniform, which results in
M
= ,, = — 3.9
6 r C < )

Therefore, given average path length i, average channel utilization p, average packet

length Z, and average node bandwidth utilization f, the average packet latency is

T. = __ + __ (3.10)

Comparing ( 3.6 ) and ( 3.10 ), the former one is more general to handle different

44

network environments, while the later one is easier to calculate. Note that ( 3.10 ) can only
be applied for a homogeneous network, where trafﬁc are uniform and channels are evenly
utilized. It may not be able to accurately estimate latency if trafﬁc is non-uniform, or the

selected topology and routing algorithm cannot evenly utilize channels.

3.1.3 Analytical Model for Virtual Cut-Through Switching

For store-and-forward switching, we have analyzed the average latency using MM 1 queu-
ing theory. For cut-through switching, a M/M/l model is still applicable; however, we
should reduce the packet buffer time whenever the packet is cut through an intermediate
switch. Remember that we have assumed inﬁnite buffer space in order to use MM 1 queu-
ing theory, which means that it is virtual cut-through here. Since the network channels are
independent, a packet traveling h hops has h — 1 independent opportunities to cut through
intermediate switches, resulting in a binomial distribution for the number of cut-throughs
[17]. Based on this binomial distribution, analytical models can calculate average latency

for cut-through networks. From ( 3.6 ), we have

 

 

1 _f___k(1,, f'(12)
T,=;[2Ckf_“ fl. —ZPfak ——:+Zc,f_” f, —ZRﬁv——C: ]

(3.11)

where P: is the cut-through probability for a packet forwarded to channel [0; R3 is the
cut-through probability for a packet forwarded to node v; 01,, is the percentage of f), which
has opportunity to cut through the corresponding switch; 0,, is the percentage of f,’, which

has opportunity to cut through the corresponding switch; and e is the header length for a

45

packet (included in the packet). For the ﬁrst hop, none of the packets has the cut-through

opportunity to transmit to next channel or node, which is why we use a), and ﬂu.

From ( 3.10 ), we have

"6 Z— —
Tc=_p___PC(ﬁ—1)——£+—€—_RCIB€C_€

0(1 - p) C 00 —a> ‘3'”)

where 6 is the average percentage of opportunity for a packet to be transmitted to a node
in a cut-through way. This eliminates packets whose source and destination switches are

same. Here C is channel bandwidth (bytes/s). Since we assume ﬁxed routing, we have

Pf=1—p,,anng=1—§,.

3.1.4 Cut-Through Probability

The cut-through probability depends on channel utilization and the number of routing op-
tions. For ﬁxed routing, PC" can be simply expressed as 1 — pk. For adaptive routing, it
depends on the number of routing options available for a packet. Suppose average chan-
nel utilization is p and P,- is the probability for a packet to have i routing options, the

cut-through probability can be calculated as follows:
PC, = (1 — m1?1 + (1 — mp2 + (1 — p3)P3 + . -- (3.13)

We just calculate cut-through probability under ﬁxed routing algorithms. Under an
adaptive routing algorithm, we may be able to derive an expression for average cut-through

probability in some symmetric topologies with the assumption of uniform trafﬁc distri-

46

bution. As an example, cut-through probabilities for adaptive routing algorithms in 2D
meshes were analyzed and compared under uniform trafﬁc [26]. However, for switch-based
networks, we may not be able to calculate average cut-through probability even under uni-
form trafﬁc. Because the topology is very likely to be asymmetric, adaptive routing may
result in different routing options in different switches and channels.

Under adaptive routing, we are still able to calculate the routing options, trafﬁc rate
and cut-through probability for each individual source and destination pair. However, these
calculations may not be able to capture the accurate channel utilization and cut-through
probability because the adaptive routing options will be dynamically selected based on the
status of each channel. Therefore, we will not present analytical models under adaptive

routing in this dissertation.

3.2 Simulation Results

To verify analytical modeling and evaluate network performance, simulation experiments

have been conducted based on the simulation model presented in Chapter 6.

3.2.1 Simulation Environment

The simulation environment is same as the one presented in Chapter 6. Although the sim-
ulation model can handle many network parameters and their combinations, we only con-
sider a subset of these network parameters in this chapter. Unless otherwise speciﬁed,
packet (message) size is exponential distribution and the mean is 100 bytes, including 3-

byte packet header. Buffer size is 2000 bytes per port in store-and-forward and virtual

47

cut-through switches and 60 bytes per port in wormhole switches. For simplicity, all ports

have same bandwidth, which is 1 byte per time unit.

Due to the arbitrary topologies in switch-based networks, we consider the four networks
shown in Figure 3.1. In each network, there are 10 switches and each switch is supposed to
have 12 bidirectional ports. There are totally 75 nodes in each network and they are evenly
distributed among switches (if possible). Under client/server trafﬁc, 4 out of 75 nodes are
considered as servers. For a server, it uniformly communicates to all the other nodes. For a
client, 50% of its trafﬁc uniformly goes to the 4 servers and the rest of its trafﬁc uniformly

goes to all the other clients.

As we will present in Chapter 6, a ﬁnite source model is applied in the simulations. For
each source node, there is only one distinguished message (packet) and next message will
not be generated until the current message completely leaves the source node. Note that
in this dissertation, we do not consider packetization and therefore, packets and messages
are used interchangeably. Theoretically, each node is supposed to have same trafﬁc gen-
eration rate. However, when trafﬁc load is heavy, trafﬁc congestion and buffer overﬂow
may reduce trafﬁc generation rate in certain nodes via back-pressure ﬂow control. This
phenomenon may cause a different offered load from theoretical calculation, resulting in

inaccurate mathematical analysis.

3.2.2 Veriﬁcation of Analytical Models

The analytical expressions for average packet latency estimate the network performance,

but it is difﬁcult to capture inter-switch dependencies mathematically. To verify the analyt-

48

 

 

Figure 3.1: Four networks for veriﬁcation and evaluation

ical expressions, simulation experiments are conducted to compare the simulation results
and analytical calculations under both uniform trafﬁc and non-uniform trafﬁc. Since the
analytical models assume ﬁxed routing, we use the deterministic ETR routing (which will

be presented in Chapter 4) in simulations to verify the analytical models.
Uniform ’Ii'afﬁc

Under uniform trafﬁc, trafﬁc is supposed to evenly distributed among channels. There-
fore, Equations ( 3.10 ) and ( 3.12 ) may be simple and reasonable estimations by con-
sidering average channel utilization and average cut-through probability. Equations ( 3.6 )
and ( 3.11 ) estimate average latency by considering each individual channel utilization and
cut-through probability, which should be more accurate but more complex than Equations

(3.10)and(3.12).

As shown in Figure 3.2 and in Figure 3.3, analytical expressions closely match the

49

simulation results under light to moderate workload, although the simulation results are
slightly higher than analytical calculations. It is not surprising to observe that Equations
( 3.10 ) and ( 3.12 ) underestimate latency than Equations ( 3.6 ) and ( 3.11 ), because the
former two equations use average data. However, when offered network load is equal or
greater than the load marked ‘high’, the analytical model cannot accurately estimate the

average latency.

 

 

Topology A Topology B
I I f I f I I r T I II 3500 I T I I T I I 1 I T I
1400 - ‘ 4
I 3000 - , ~
1200 - - 3 Azalysis (E .3.10 -o—
f - al is .3.6 -+---
E 1000 - Axaiygigﬁ egg ~— .w 5 250° ysSiinglation M... 3,..7’
"‘ na is q. . -+--- "' _ .'
g 800 Simulation We a 2000 5","
g 600 . “ﬁnd 3 1500 _ .' x
s .a-"i’x' s
< 400 " ..4E>"_'.'£’,',-»"" < 1000 "
200 ' 500 r-
0 1 1 1 1 1 1 1 1 1 1 l 0 1 1 1 1 1 1 1 1 1 1 1
low high low high
Offered Load (low to high) Offered Load (low to high)
Topology C Topology D
2m I I I T rﬁ I I I I I 3000 I I I I I I I I j I I
f
I ,D
,"p 2500 i Analysis (Eq. 3.10) +— p”
g 1500 - Analysis (E .3.10 .._ g~ Analyse (Eq- 3.6) -+---
o Analysis( q. 3.6 -+--- f g 2000 . Simulation ----+:i a" .1”
3 Simulation am- I 3 1'
1000 L Q 1500 "
S co
‘-‘-’ a
g g 1000 r
< 500 - <
500 ~
0 1 1 1 1 1 1 1 1 1 1 1 o 1 1 1 1 1 4L J 1 1 1 1

 

    
 

  
 

 

 

 

 

 

 

 

low high
Offered Load (low to high)

 

 

 

 

 

 

 

low high
Offered Load (low to high)

 

 

Figure 3.2: Average latency of store-and-forward switching under uniform trafﬁc

For all plots in this chapter, the offered load marked ‘high’ means that under the same
or higher offered load, one or more than one channel have channel utilization over 95% (or

even over 100%). Under such high workloads, some channels are over utilized so that it is

hard to capture the real trafﬁc rate on those channels. Moreover, since we use back—pressure
ﬂow control, trafﬁc congestion on overloaded channels may further inﬂuence trafﬁc on
other channels and eventually reduce the real trafﬁc generation rate in the source nodes. It

means that the offered load used for analysis is different from the real network load, which

50

is difﬁcult to know mathematically.

1000

800

400

Average Latency

200

Average Latency

 

 

1000

TopologyA
r r r r r I I I y :1; , 35m
f ill
I‘ 3000
Analysis (Eq. 3.12 .._. :o' .
Analysis Eq.3.11 -+~-- : g: 2500
Simulation We ..
i 3 2000
§ 1500
e
<

 

500

1 1 l 1 1 L 1 1 1 1 o

 

 

 

 

 

 

 

low high
Offered Load (low to high)
Topology
I I I I I I r I I I l’ I 25m
-° , 2000
Analysis (Eq. 3.12 ~— I ,-' - 2;
Analysis Eq. 3.11 -+--- _:' 5
Simulation We ‘ 3 1500
i a
. g 1000
a)
- 2
500
1 1 1 1 1 1 1 1 1 1 1 0
low high
Offered Load (low to high)

Topology B

 

 

I

Analysis (Eq. 3.12 -o— ,.--;'
Eq.3.11 -+--- ,9," -
Simulation am- I

Analysis

 

I I T I I T I I I

 

 

 

 

low high
Offered Load (low to high)
Topology D

Analysis (Eq. 3.12 .._. «.a ,«" .
Analysis (Eq. 3.11 -+---

 

r I f ,1
B

Simulation ev- ,"'

 

 

low high

Offered Load (low to high)

Figure 3.3: Average latency of virtual cut-through switching under uniform trafﬁc

N on-Uniform 'D'afﬁc

Under non-uniform trafﬁc, i.e., the client/server trafﬁc mentioned in Chapter 3.2.1,
Equations ( 3.10 ) and ( 3.12 ) will not be valid because trafﬁc is not uniformly distributed

any more. As shown in Figures 3.4 and 3.5, Equations ( 3.6 ) and ( 3.11 ) are used to

51

estimate the average latency. Again, it is observed that the analytical models closely match
the simulation results under light to moderate workload. When the offered load is equal to
or over the point marked ‘high’, analytical models cannot accurately estimate the latency

although they still keep the performance trend.

 

 

 

 

 

 

 

 

   

 

 

        

 

 

 

 

 

 

Topology A Topology B
4000 I I I I I I I I I 1 4000 I I I I I I I I I T I
3500 . Analysis (Eq. 3.6) —o— q 3500 _ Analysis (Eq. 3.6) -o— "" ,
g 3000 ~ Simulation -+--- ,1” ~ g 3000 - Simulation 4;“! .
a 2500 - " - e 2500 ' i
3 3
3. 2000 - . 8» 2000 .
g 1500 . - g 1500 .
E 1000 - - 3: 1000 .
500 r . 500 4
0 + o r
Topology C Topolog D
4000 I I I I I I I I I '1' 4000 I I I ﬂ I I yI I I I 1””,
350° ' Analysis (Eq. 3.6) +— .x” i 3500 ' Analysis (Eq. 3.6) +—,r"' ‘
g 3000 - Simulation -+--- g 3000 - Simulation #7,: .
3 2500 r " - 3?. 2500 - f .
o 2000 ~ - a, 2000 - -
D D
g 1500 h - g 1500 r ~
2 1000 - - 3: 1000 ~ «
500 - a 500 P J
o 4 1 r 1 r r 1 r r 1 r o P L 1_ r r 1 1 r 1 1 L
low high low high
Offered Load (low to high) Offered Load (low to high)

Figure 3.4: Average latency of store-and-forward switching under client/server trafﬁc

Remember that when the offered load is equal to or over ‘high’, some channel have
utilization over 95% or even over 100%, which means that these channels are under an
over-saturated situation. The trafﬁc congestion may inﬂuence trafﬁc on other channels
and trafﬁc generation rate. Unfortunately, such a trafﬁc inter-dependency in a network is

very difﬁcult to capture mathematically. In a word, if the offered load does not cause high

52

utilization for any channel, our analytical expressions closely match simulation results.
Otherwise, the analytical results may not be close to simulation results, but they still have

the same performance trend.

 

 

   
     

 

 

 

 

 

 

 

 

 

 

      

 

 

 

 

 

 

 

Topology A Topology 8
4” I I I I I T ﬁ T I I 4m I I I I I I I I I I l}
3500 - I; ~ 3500 b .. .
3‘ 3000 . Analysis (Eq. 3.11) -+— ,t” - >~ 3000 - Analysis (Eq. 3.11) +—,/" -
g 2500 Simulation -+--- d g 2500 _ Simulation «+5.4: ,
a 2000 . g, 2000 - f .
g 1500 4 g 1500 i -
3: 1000 1 2 1000 - -
500 i 500 - -
O 1 o r r 1 1 r r 1 1 r r 1
low high
Offered Loa (low to high)
TopologyC Topology
4000 I I r I I I I TI ﬁ I I 4” I I f I I I I I I I ,"T
,.
350° ' 1 350° ' Analysis (Eq. 3.11) .._ ’ '
g 3000 - Analysis (Eq. 3.11) -+— ,I’ 1 5 3000 L Simulation -+---,r"" ~
Simulation C
g 2500 - ' . g 2500 - .
g 2000 " " Q 2000 "' "
CD 0)
g 1500 ~ . g 1500 l «
> >
< 1000 ~ . < 1000 r -
500 - 4 500 - ~
0 1- - r 1 1 1 J 1 1 r 1 r O 1 1 1 1 1 L 1 J 1 1 1
low high low high
Offered Load (low to high) Offered Loa (low to high)

Figure 3.5: Average latency of virtual cut-through switching under client/server trafﬁc

3.2.3 Performance Evaluation

To understand the effect of switching modes, buffer management, and other network pa-
rameters on network performance, simulation results are measured to make comprehensive
comparisons.

Comparison of switching schemes

53

There are three different switching modes in commercial switches: store-and-forward,
virtual cut-through, and wormhole. Figures 3.6 and 3.7 demonstrates their performance
tradeoff under both uniform and client/server trafﬁc. Under light to moderate workload,
virtual cut-through and wormhole switches provide low packet latency because they do
not need to buffer the whole packet before forwarding. However, wormhole switches do
not provide high throughput under heavy workload, because we use small buffer space for
wormhole switches, which is not big enough to store most of packets. If wormhole switches
have bigger buffer space and still allow pipeline forwarding, it should have higher through-
put. Store-and-forward switches cause high latency even under light workload due to the
overhead of buffering packets. On the other side, store-and-forward switching provides as
high throughput as virtual cut—through switching. This is because that they both have large
buffer to keep blocked packets.

Given the beneﬁts of the pipeline fashion in cut-through switching and the big buffer
space in store-and-forward switching, a promising switch design should integrate them to-
gether and dynamically change switching mode for each port. Some recent router/switch
designs have supported hybrid switching modes. As an example, the Cray T3D router [70]
implements wormhole switching with two pairs of virtual channels. Each virtual channel
has enough buffer space to store a small packet, while longer packets spread across mul-
tiple routers (channels). Therefore, a small packet will not stall in a channel. IBM SP2
switch [71] reduces network congestion through a shared output buffer. The central queue

dynamically buffers blocked wormhole packets in 8-ﬂit chunks.

Average Latency

Average Latency

54

 

 

   

 

 

 

 

 

 

 

 

     

 

 

 

 

 

 

Toplogy A Toplogy B
I I r 3000 I r r 1 7
100° ' Store-and-forward —o— , ‘ 2500 _ " .
Virtual cut-through -+---
800 - Wormhole We w 3‘ i
5 2000 - Store-and-forward -o— 5 ~
600 3 Virtual cut-through ~+--
' 1 o 1500 . Wormhole We .
U)
as
40° L J a; 1000 - -
'I/r'” <
200 .- ”Ottawa/as” . 500 _ 4
odW’G‘kw
O 1 1 1 o r 1 1 1
0.04 0.08 0.12 0.16 0.04 0.06 0.08 0.10 0.12
Normalized Network Throughput Normalized Network Throughput
ToplogyC Toplogy D
2m I I I 3000 I I I I I
2500 r ,r a
1500 r Store-and-forward 4— a 3 .
Virtual CUI'WOUQh '+'" ,r 5 2000 - Store-and-forward _._ + l
W0m‘h0'8 "“9 3 Virtual cut-through -+--- -
1ooo , . m 1500 i- Wormhole -----o , .
<3 .
g 1000 l- -
500 - - . <
. “MM" 500 .
o-vo-W'Q’MB'Q i
o r 1 1 o 1 1 1 1 1_
0.04 0.08 0.12 0.16 0.04 0.06 0.08 0.10 0.12
Normalized Network Throughput Normalized Network Throughput

Figure 3.6: Comparison of switching schemes under uniform trafﬁc

Average Latency

Average Latency

4000
3500
3000

Topology A

55

 

 

I

Store-and-forward +—
Virtual cut-through -+---

 

 

 

 

 

   
   
 

 

 

250° ’ Wormhole We ‘
2000 .
1500 5
1000 ~
500 .
0
0.03 0.05 0.07 0.09 0.1 1
Normalized Network Throughput
TopologyC
4000 r r
3500 ,5
3000 \Sitore-land-ftomarg —o— ,i .
rtua cut-t roug -+--- :'
250° ' Wormhole -~-9~ f i
2000 If r
1500 a .
a.
1000 “g .
500 '1 ,,,,, -
o r...1--.-a+.a-r 1
0.03 0.07 0 09 0.11

0.05 .
Normalized Network Throughput

Average Latency

Average Latency

 

 

 

 

 

 

 

 

 

 

Topology B
4000 . . r,
3500 «
3000 \Store-Iand-frorwarg -o— «
irtua cut-t roug -+--- i
2500 Wormhole o"- " ‘
2000 .
1500 .
1000 ~
500 '43") u
o 8"?” 1 1
0.03 0.05 0.07 0.09 0.11
Normalized Network Throughput
Topology D
4000 I I I
3500 : 1
3000 \Sitore-land-fgrwarg +— 1+ ~
rtua cut-t roug -+--- ,:
2500 Wormhole en- ; ‘
2000 J
1500 -i
1000 ~
500 -4-é'itlflau "
0 °“*T'"°‘"“’“f;"' .
0.03 0 05 0.07 0.09 0.11

Normalized Network Throughput

Figure 3.7: Comparison of switching schemes under client/server trafﬁc

56
Comparison of FIFO and Shared Buffering

It is well-known that FIFO input buffering can cause poor performance due to the head-
of-line blocking. In Figures 3.8 and 3.9, we use simulation results to compare the perfor-
mance between FIFO input buffers and shared buffers under store-and-forward switching.
For shared buffering, each input port has a buffer to temporarily hold the incoming packet
and each switch has a 2000 bytes shared buffer pool. It is not surprising to observe that
shared buffering provides lower latency and higher network throughput than FIFO input

buffering, although the shared buffer pool has smaller space than the sum of all input FIFO

 

 

         

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

buffers.
Topology A Topology B
I j I I 3000 I I I r I
1000 . J
x." 2500 r- .
5‘ 800 l FIFO +— ,x’ - a FIFO 4—
3: Shared -+--- " 5 2000 r Shared -+--- 1 -
g 600 r ,1" - g 1500 , 5+4" -
S S s
5 40° ’ ..... v" ‘ as 1000 L ,1 .
> *-.4—-"* > '
< ** <
200 P - 500 . .
o 1 l 1 1 0 1 1 1 L 1
0.04 0.08 0.12 0.16 0.2 0.04 0.06 0.08 0.10 0.12 0.14
Normalized Network Throughput Normalized Network Throughput
Topology C Topology D
am I I I I 3000 T I I I I I
2500 ~ * -
1500 - . 3 x,
5 2000 . FIFO _._ f: -
3 x’ 3 Shared -+--- 5
1000 - , - a, 1500 - i 4
8. j o: 3
S . S ,t
g ,.v" 9 1000 - .
< 500 - ”my“ . <
+--+-+"“"'+---- 500 l- -
o 1 r 1 1 0 1 1 L 1 1 1L
0.04 0.08 0.12 0.16 0.2 0.04 0.06 0.08 0.10 0.12 0.14
Normalized Network Throughput Normalized Network Throughput

Figure 3.8: Comparison of FIFO and shared buffering under uniform trafﬁc

Average Latency

Average Latency

Topology A

57

 

4000
3500 ..
3000 .
2500 -
2000 ~
1500 -
1000 -
500 -

 

I

FIFO +—
Shared -+---

  

    

Average Latency

 

 

0
0.03

0.05
Normalized Network Throughput

0.07

Topology C

0.09

0.11

 

4000
3500 -
3000 F
2500 -
2000 i-
1500 ~
1000 -
500 *

 

FIFO -o—
Shared -+---

L

  

  
 
 

Average Latency

 

 

0
0.03

0.05

0.07

0.09

0.11

Normalized Network Throughput

 

 

 

 

 

 

 

 

 

 

Topology B
4000 1 . .
3500 ~
3000 4“ -
2000 FIFO +— ,2" -
1500 Shared -+--- -
1000 ~
--4
500 -
0 1 r r
0.03 0.05 0.07 0.09 0.11
Normalized Network Throughput
Topology D
4000 . r .
3500 , ‘
3000 *1.‘ .
To?
2500 i 1
2000 FIFO _._ ,1 +
1500 i Shared -+--- _
1000 ,I’ -
500 "4' -
0 1 J L
0.03 0.05 0.07 0.09 0.11

Normalized Network Throughput

Figure 3.9: Comparison of FIFO and shared buffering under client/server trafﬁc

Effect of Packet Size

To understand the effect of packet size on network performance, we also measure the

performance data when the mean packet size is 500 bytes. Under same packet generation

rate (bytes/s), the performance under different packet sizes are compared in Figure 3.10. It

shows that the performance trends are similar. The higher latency for larger packet size is

expected.

Adaptive Routing

For all the above performance data, they are measured under a deterministic routing

algorithm. Adaptive routing algorithms are also proposed to provide routing ﬂexibility

58

 

 

 

 

 

 

 

 

 

 

ToplogyA
2000 ~ _
5‘
g 15001 Msgsize=1003 .._ .
‘5 Msg 5128:5008 "1“"
..J
8 1000 I:
S
0
>
< 500 ~
0 l ‘ l r
0.04 0.08 0.12 0.16
Normalized Network Throughput
ToplogyB
4000 . , , r
3500 F i .
g 2500 - ii -i
m 2000 L Msg size: 1003 .._ .
CD ' = —+--- ,4‘
g 1500 _ Msg Size 5008 q
E 1000 - -
500 L*-+-+--+-"' _,
0 1

0.04 0.06 0.08 0.10 0.12
Normalized Network Throughput

Figure 3.10: Effect of packet size

and take the advantage of multiple routing paths. We have implemented an adaptive rout-
ing algorithm (ATR), which will be presented in Chapter 4, in our simulations. Under
the adaptive routing, we also measure performance and make comparison among different
switching modes and buffer management. As Shown in Figures 3.11, 3.12, 3.13 and 3.14,
the same performance trends under deterministic routing can also be observed under adap-

tive routing.

1000

800 -

600

400

Average Latency

200

1 400
1 200

Average Latency

Figure 3.11: Comparison of switching schemes under uniform trafﬁc

59

Toplogy A: adaptive routing

 

I I

Store-and-forward _.— .
Virtual cut-through -+---
Wormhole ---—-e

 

 

 

 

0.04 0.08 0.12 0.16 0.2
Normalized Network Throughput

Toplogy C: adaptive routing

 

I

- Store-and-forward —o— -
q.“ Virtual cut-through -+---
- ‘31. Wormhole We -

 

 

 

 

0.04 0.08 0.12 0.16 0.2
Normalized Network Throughput

60

Topology A: adaptive routing

 

    
  

 

 

 

 

4000 . | .
3500 i .
g 3000 ~ Store-land-fgrwarg ~9— i -
irtua cut-t roug -+--- ,
g 2500 ’ Wormhole We " '
.1
Q 2000 ' "
°’ I
E 1500 -
a
< 1000 " RED I "
500 ~ '9; ,,,,,,,, -
0 9'1" mer r r
0.05 0.07 0.09 0.1 1
Normalized Network Throughput
Topology C: adaptive routing
5000 I I I “‘ I
4500 - I, ~
4000 ~ ‘1 .

3500 - Store-and-forward _.__ .
Virtual cut-through -+--- I -
Wormhole -----£»

2500 -
2000 -
1500 -
1000 -
500 -

Average Latency

 

 

 

 

0.05 0.07 0.09 0.1 1
Normalized Network Throughput

Figure 3.12: Comparison of switching schemes under client/server trafﬁc

Average Latency

Average Latency

Figure 3.13

61

Topology A: adaptive routing

 

1000 r

FIFO 4— -
Shared -+---

800

600

 

 

 

 

0.04 0.08 0.12 0.16 0.2 0.24 0.28
Normalized Network Throughput

 

I I

Topology C: adaptive routing
1 000 I I I I

800 L FIFO -°— ~
Shared -+--- 1,,“

600L

 

 

 

 

0.04 0.08 0.12 0.16 0.2 0.24
Normalized Network Throughput

: Comparison of buffer management under uniform trafﬁc

62

Topology A: adaptive routing

 

I

4000 .
3500 _ .
3000 ~ FIFO —._ i, .
2500 _ Shared -+--- it
2000 ~
1500 F
1000 —
500 ~ A. - .

Average Latency

 

 

 

 

0.05 0.07 0.09 0.1 1
Normalized Network Throughput

 

Topology C: adaptive routing
4000 I I I ‘

3500 ~ .
3000 — FIFO 4— i -
2500 _ Shared -+--- .
2000 i
1500 ~
1000 ~
500 ~

I

Average Latency

 

 

 

 

0.05 0.07 0.09 0.1 1
Normalized Network Throughput

Figure 3.14: Comparison of buffer management under client/server trafﬁc

Chapter 4

Deadlock-Free Routing in Arbitrary

Switch-Based Networks

The main contributions of this chapter include two proposed routing algorithms and com-
prehensive performance evaluations among three routing algorithms in arbitrary networks.
Due to the arbitrary network topologies, it is not trivial to develop an efﬁcient routing algo-
rithm with lower transmission latency and higher network throughput. In this chapter, we
consider wormhole switching, a special case of cut-through switching. A main difﬁculty
for wormhole switched networks is to design a routing algorithm avoiding deadlock. To
deal with this challenge, we propose two simple and efﬁcient deadlock-free routing algo-
rithms for arbitrary networks using wormhole switches. In order to avoid deadlock, we
use two unidirectional Eulerian trails to help maintain an order of channel dependency and
allow reasonable routing. Shortcuts are added to the Eulerian trails to provide more and
shorter routing paths. Heuristics are suggested in order to provide better performance.

63

4.1 Related Work

In a wormhole switch, based on the header information of an incoming packets, a routing
algorithm selects an outgoing channel to deliver or forward the packet. A critical issue
in designing an appropriate routing algorithm in wormhole switched networks is to avoid
channel deadlock, because a message can hold some channels and wait for another chan-
nel. A deadlock occurs when there is a cyclic dependency among occupied channels and
requesting channels. A well-known solution for deadlock avoidance was proposed in [72]
based on restricting the routing algorithm. For example, the turn model proposed in [73]

provides deadlock freedom via prohibiting certain turns.

Some general methodologies [42, 43, 45, 74] have been proposed to allow deadlock-
free routing in arbitrary networks. However, it is difﬁcult to directly apply those general
ideas to an arbitrary network. The examples given in those papers are all for regular topolo-
gies. There are other methods of preventing deadlocks, such as dropping packets for which
there is insufﬁcient buffer space, or using virtual channels [4] to allow more routing free-
dom. We assume that these methods are unacceptable or impractical for the switch-based

networks we desire.

For the purpose of evaluation and comparison, we consider another two existing routing
algorithms, which can be directly applied to arbitrary networks using wormhole switches.
These two algorithms are the up*ldown* routing (UDR) used in the DEC AN 1 system [41]
and the smart-routing (SR) [46] proposed by HP researchers. In order to avoid deadlock, the
UDR and SR adopt the concept of deadlock-prevention, which never allows the formation

of a dependency cycle. Our proposed routing algorithm can be used in two ways: a simple

65

algorithm called Eulerian trail routing (ETR) and a complicated algorithm called adaptive
trail routing (ATR). Details of the two routing algorithm will be described in Chapters 4.3
and 4.4. While the ETR never allows a dependency cycle, the ATR uses the philosophy
in [42, 43], which allows the existence of cycles, but does provide a channel to escape
from the cycles. Neither UDR nor SR considers virtual channels. For simple and practical
reasons, we do not consider virtual channels in our routing algorithm either, but it is an
interesting issue worthy further study.

The UDR algorithm. A breadth-ﬁrst Spanning tree is constructed from a speciﬁed root.
Each channel is assigned a direction based on the spanning tree, with ‘up’ meaning ‘toward
the root’. A tie is solved by comparing the ids of two end switches of a channel. With this
assignment, the directed channels do not form loops. A legal route is deﬁned to be one that
never uses a channel in the ‘up’ direction after it has used one in the ‘down’ direction. The
UDR routing is easy to understand and implement, but it may concentrate trafﬁc around
the root switch and allow unnecessary long routing paths. Thus, its performance is greatly
dependent on the network topology and the selected spanning tree. Since all routing paths
satisfying the up*ldown* requirement are allowed in the routing table, many very long
routing paths may exist and cause poor performance in many cases. We have modiﬁed the
original algorithm in the following way. If there are multiple routing choices in a routing
table entry and the shortest distance among these paths is h, longer routing paths, which
have more than two hops and are longer or equal to 2h, will be discarded. The modiﬁed
UDR is called MUDR.

A Similar idea is used in Myrinet [37], where a spanning tree is built up for routing.

However, Myrinet uses source routing instead of adaptive routing based on table lookup.

66
The SR algorithm. Based on the network topologies, the SR algorithm builds an

explicit channel dependency graph and searches the graph for cycles. For each cycle, a
dependency is broken to minimize a heuristic cost function. The procedure terminates when
the channel dependency graph has no cycles. The routing is represented by the channel
dependency graph. The SR can be used as adaptive routing (SRA) or deterministic routing
(SRD). For the SRA, an adaptive routing table, which allows multiple routing paths among
switch pairs, is created in each switch. For the SRD, a deterministic routing table, which
provides only a single path between any two switches, is created in each switch. In this
paper, the SR represents both SRA and SRD. The smart-routing has done a good job to
balance channel utilization under uniform trafﬁc. However, the trafﬁc balancing, which
depends on a linear programming solver, is the most time consuming part when calculating
the routing table. Such complexity may not gain signiﬁcant performance beneﬁt in real
networks Since non-uniform (e.g., client/server) trafﬁc is very commonly observed.
Extensive simulation experiments have been conducted to evaluate the performance of
up*ldown*, ATR, and SR under different topologies and workloads [57]. It is observed in
the simulations that a premature bottleneck channel is one main reason to cause poor net-
work performance. The SR is the only algorithm putting signiﬁcant effort and computation
time to balance channel trafﬁc. Note that trafﬁc balancing should be guided by real trafﬁc
pattern in the network. The SR assumes a unrealistic trafﬁc pattern, uniform trafﬁc, so
it may not provide better performance in non-uniform (e.g., client/server) trafﬁc. A good
heuristic routing algorithm may have good network performance under many network con-
ditions. For example, a much more simple algorithm ATR can offer as good performance

as SR in most cases [57]. However, to achieve near optimal performance, trafﬁc balancing

67

should be considered if routing table computation time is not very critical.

4.2 Deﬁnitions and Eulerian 11:31]

As we mentioned earlier, regular network topologies do not have good incremental scal-
ability due to the stringent interconnection constraints [1] and may not be able to use the
original routing algorithms with faulty nodes. Thus, some arbitrary network topologies
with good network performance are highly demanded. In this chapter, we will use the
same network shown in Figure 1.1 to illustrate our proposed algorithm. The nine 8-port
wormhole switched network is Shown again in Figure 4.1(a). The graph in Figure 4.1(b) is
used to model this network, where each vertex corresponds to a switch and each edge cor-
responds a channel connecting between a switch pair. This graph representation is used to
represent network topologies and derive channel dependency graph. All channels (edges)
are assumed to be bidirectional and multiple channels (edges) between a switch pair are
allowed.

Our proposed algorithm [21, 75] is applicable to any network topology with Eulerian
trails. Its basic idea is to ﬁnd two opposite unidirectional Eulerian trails to provide rea-
sonable routing paths and control the order of channel dependency. An Eulerian trail is a
sequence of channels, which visits each channel once and exactly once so that it can main-
tain the order of channel dependency. In order to maximize channel utilization and allow
more and Shorter routing paths, shortcuts may be added to the two unidirectional Eulerian

trails. The two unidirectional trails with shortcuts are called adaptive trails.

To avoid deadlock, shortcuts have to be used in a restricted way (e.g., some shortcuts

68

can only be used for messages coming from terminal ports) based on the channel depen-
dencies along the adaptive trails. However, a dependency cycle is allowed in adaptive trails
as long as there is an escape channel [42, 43] for that cycle. Based on the unidirectional Eu-
lerian trails and adaptive trails, we have two routing algorithms: one is called Eulerian trail
routing (ETR) based on the two unidirectional Eulerian trails, and the other one is called
adaptive trail routing (ATR) based on the two unidirectional adaptive trails. The ETR is
easier to implement, but it may cause longer routing paths and imbalanced channel utiliza-
tion due to channels’ different locations in the trail. The ATR is more complex, but it is
more fair to route messages evenly among all channels. For both ETR and ATR algorithms,
routing information are carried in a static routing table in each switch, which provides local

routing information for each switch.

   

(b)

Figure 4.1: An example of an arbitrary network topology

69
4.2.1 Deﬁnitions

In order to simplify our presentation, we introduce the following deﬁnitions.

Deﬁnition 1 Trail, path, and cycle: a trail is a ﬁnite sequence of edges (channels) of the
form '00 —> 01 —> —) vm_1 —> pm, in which all the edges are distinct. If vertices

00, 01, - - ~vm_1, vm are also distinct, it is called a path. When 00 2 pm, it is called a cycle.

A graph G is said to be connected if every pair of its vertices are joined by a path.
Obviously, all graphs considered in this thesis are connected. The degree of a vertex is
deﬁned as the number of edges connected to the vertex. For a pair of vertices u, v in G,
dG(u, 0) denotes the length of a shortest path from u to v in G. The diameter of a graph is

the maximum dG(u, 0).

Deﬁnition 2 Eulerian trail and Eulerian graph: an Eulerian trail of a connected graph is
a trail that contains all the edges of the graph. A graph is called an Eulerian graph if and
only if it has an Eulerian trail. The sufﬁcient and necessary condition is that all vertices

have even degrees or exactly two vertices have odd degrees.

Deﬁnition 3 Channel, shortcut, and index in a unidirectional Eulerian trail: given a uni-
directional Eulerian trail ET: 110 —> -~ . —> v,- —+ - .. —> 12,- —> . - - —> vm in graph G, each
vertex (switch) along the trail is given an index beginning from 0. In trail ET} a channel
between any two subsequent vertices v,- and 11,41 is denoted as c,-,,-+1 (i. e., v,- —> 0,41). Given
an Eulerian trail E7: a shortcut 31"]: between two non-subsequent vertices v, and 12,- exists
if j > z' + 1, 223- 75 0,41, and the channel 0,- —+ '0,- is in the edge set of G. A shortcut 3,3,“

exists between two subsequent vertices v, and 0,41 if there are multiple channels between

70

v,- and 12,-.” in the network. Let C( ET) denote the set of all unidirectional physical links in

E7: which means 11,- —-> 0,41 6 C(ET), but 0.41 —> 12,- ¢ C(ET).

In Deﬁnition 3, the concepts of channels and shortcuts are used to logically distinguish
physical links’ positions in a trail. Any channel or shortcut corresponds to a unidirectional
physical link in the network. If v, and 03- represent the same switch in a network, we say
u,- = 03-. A shortcut SM 6 C(ET) if and only if there is a k such that v), = 0,, and 0H1 2 pg.
We use cs,”- to represent either cm- or 3,-0- if applicable. osm- and cs“ represent the same

physical link if v,- 2 up, 03- = rig, and there is no multiple channel between the two switches.

4.2.2 Eulerian Trail

Our routing algorithm can be applied to any network which has an Eulerian trail. The
algorithm to ﬁnd an Eulerian trail is well-known. We adopt the same algorithm in the book
[76]. A graph may have more than one Eulerian trail. Our routing algorithm is based on
any Eulerian trail. A heuristic is suggested in Chapter 4.5.2 to select an Eulerian trail which
may provide a good performance. In this section, we suppose a random Eulerian trail.
Since bidirectional channels are used to interconnect switches, an Eulerian trail is con-
sidered as two opposite unidirectional Eulerian trails, called ETl and ET2, respectively.
Unless otherwise speciﬁed, any channel or trail mentioned in the proposed routing algo—
rithms means a unidirectional channel or trail. As illustrated in Figure 4.3, one Eulerian
trail,ETl,forFigure4.1(b)isO—->5—i4——>3—>8—>7—> 1 £3 5—> 2—+3—+1—‘>5—+
6 —+ 0 —-> 1 —-> 8. We use 6 and e’ to distinguish the two physical links between switches 1

and 5. Another Eulerian trail ET2 can be obtained if we look at the above trail in the reverse

71

order:8—)1——>O—>6—>5-3)1—>3—>2——>5—e;1—>7—>8—>3—+4—>5—>0.Ifa

channel is in ETl , it cannot be a channel in ET2.

4.3 Eulerian-Trail Routing Algorithm (ETR)

Given ET] and ET2, an easy way is to route messages along the two trails. To allow more
and shorter routing paths, the two Eulerian trails are concatenated into a single unidirec-
tional trail. This routing method is called Eulerian trail routing (ETR), where messages are
routed on the concatenated unidirectional trail.

Given ET] and ET2, we can easily concatenate them into a single unidirectional trail
as follows:

0—>5—>4—>3—>8—>7—>1$5—>2—>3——>1-‘35—>6—>O—>1—>8—>1-—>
0—+6—>5—e>1—~>3—->2—>5i'>1—>7—+8—>3—>4—>5—>0.

Message routing will follow the unidirectional trail. For example, a message from
switch 4 to switch 8 will take path 4 —) 3 —> 8. The ETR routing algorithm takes the
reasonable paths following the channel dependency order in the concatenated trail. Note
that a message does not have to go through every channel and it never passes the same
switch more than one time. For example, consider a message from 0 to 6. It will take
0->5->6insteadof0->5—>4—>3—>8—>7—i135—”4341:3546.
Because the later routing path passes switch 5 multiple times, increasing path length and
wasting channel bandwidth.

The ETR can be used as deterministic routing or adaptive routing. If used as determinis-

tic routing, only the shortest path in the trail is allowed. If used as adaptive routing, multiple

72

routing choices can be kept. Heuristics suggested in Chapter 4.5.] could be applied to avoid
longer routing paths. Different Eulerian trails generate different routing tables, which may
provide different performance. A heuristic method is suggested in Chapter 4.5 .2 to select
an Eulerian trail providing better performance.

Like the UDR routing algorithm, the ETR routing is very easy to understand and very
simple to implement. Neither shortcut iS added to the Eulerian trail nor channel dependency
cycle exists. Unlike the Up/down, it uses an Eulerian trail to provide reasonable routes and

avoid obvious trafﬁc congestion.

4.3.1 Routing Tables

Based on the routing paths allowed in the concatenated trail, a routing table is created
in each switch to carry routing information for the local switch. A routing table has many
entries and each entry is indexed by the destination switch and the incoming port (channel).
If adaptive routing is allowed, more than one routing option may be listed in an entry. As
an example, the routing table in switch 8 for deterministic ETR is Shown in Table 4.1. Note
that if a message cannot match any entry, it means somehow it has taken a wrong route.
For example, if a message in switch 8 has destination switch 0, it Should not come from

switch 7 because from 7 to 0 should take 7 —> 1 —) 0.

4.3.2 Time Complexity

The ETR algorithm is simple to implement. Let m be the number of bidirectional channels

and n be the number of switches in a network. Finding an Eulerian trail takes 0(m) time

73

 

 

 

 

 

 

 

 

 

 

l Dst Incoming Channels " Outgoing Channel I
0 3—i8,terminalports 8—>1
1 3 —+ 8, terminal ports 8 -> 1
2 terminal ports 8 —> 1
3 7—>8,terminal ports 8—>3
4 7—>8,terminal ports 8—93
5 terminal ports 8 —+ 7
6 terminal ports 8 —+ l
7 3—+8,terminalports 8—->7

 

 

 

 

 

 

Table 4.1: Example routing table entries in Switch 8 for deterministic ETR routing

complexity and creating routing tables in all switches takes 0(mn) time complexity.

4.4 Adaptive-ﬂail Routing Algorithm (ATR)

The ETR algorithm is very easy to implement and also provide reasonable performance.
Deadlock can never happen in the ETR because there is no channel dependency cycle.
However, the ETR may have two drawbacks. First, some messages are not able to use
the shortest paths to reach their destinations. For example, from switch 6 to switch 2, a
message has to go through 6 —+ 5 —> 1 ——> 3 ——> 2 (four hops) although the shortest path is
6 -—> 5 —> 2. Second, the earlier a channel appears in the ﬁnal concatenated Eulerian trail,
the less nodes can utilize them. For example, channel 0 —> 5 can only be used by nodes
in switch 0 because it is the ﬁrst channel in the trail. This may cause unbalanced channel
utilization.

Therefore, we proposed a more complex algorithm, ATR, to avoid these problems. In
the ATR, shortcuts are added to each Eulerian trail. The trails with shortcuts are called

adaptive trails. The adaptive trails derived from ETl and ET2 are called AT] and AT2,

74

respectively.

4.4.1 Adaptive Trail

Any channel in the network may be a shortcut in an adaptive trail. However, deadlock is
possible if shortcuts are added without any restriction. A deadlock example is shown in

Figure 4.2, where a shortcut 30,5 (i.e., 8 —> 3) is added to AT2 and used in the routing.

m1: 8"’3—> 2, waitingfor 2 ' >5
e,

m2: 2—>5->l, waitingfor 1 >7

m3: l""*7""’ 8, waitingfor 8 ‘ >3

 

 

Figure 4.2: An example of deadlock

In order to avoid deadlock, shortcuts have to be used with some restrictions. The con-
cept of channel dependency was ﬁrst presented in [72]. A routing algorithm is deadlock-
free if there is no cyclic channel dependency in the network [72]. However, such a condition
is too restrictive for adaptive routing, where more than one outgoing channel are offered
at many switches. It has been shown in both [43] and [42] that an adaptive routing algo-
rithm with cyclic channel dependency can still be deadlock-free as long as some necessary
conditions are satisﬁed. For any detected dependency cycle, we break it if it may cause
deadlock, but we allow it if it cannot cause deadlock.

The shortcuts are categorized into three different types: free-style shortcut, destination

shortcut, and source shortcut. Figure 4.3 shows how to add shortcuts step by step. For a

75
partialtraillike~~u1 ——> —> u,---- —) ’U1°H—> vk---,whereu,- = u(1$i_<_ j),
v,- = 0(1 3 i g k), and no 0 exists between ul and ii], any allowed Shortcut u MOI-tic”
is drawn from 11,- to 01. The order to add different types of shortcuts facilitates deadlock
avoidance.

Free-style shortcut: Given an Eulerian trail, a shortcut SM is a free-style shortcut (or
f-shortcut) if there is an i < p such that v,- = 0,, and 0,41 = 0,1. A free-style shortcut must
be in C(ET). To create adaptive trails, free-style shortcuts are ﬁrst added to each Eulerian
trail as illustrated in Figure 4 (ﬁrst step). Note that in ET], the free-style shortcut 310,11
from 010 (i.e., switch 1) to 1111 (i.e., switch 5) is channel e’, while channel cm,” is e. A
free-style Shortcut can be used by any message if it is on the routing path.

Destination shortcut: Given an Eulerian trail ET, any shortcut of ET is a destination
shortcut (or d-shortcut). For example, in AT] of second step in Figure 4, there are two
destination shortcuts to 09 (i.e., switch 3) from 04 (i.e., switch 8) and Us (i.e., switch 1),
respectively. A destination shortcut 3M can only be used by messages whose destination
is vq. Thus, whenever a message acquires a destination shortcut, it will eventually go to its
destination and release the channel.

Source shortcut: Given an Eulerian trail ET, any shortcut of ET, but not in C(ET), is
a source shortcut (or s-Shortcut). For example, in AT] of third step in Figure 4, there are
two source shortcuts from Us (i.e., switch 1) to rig (i.e., switch 3) and 013 (i.e., switch 0),
respectively. Note that channel 36,15 (i.e., 1 ——> 8) on AT] is not a source shortcut since
1 —> 8 is in C(ETl). A source shortcut SM can only be used by messages whose source is
a local processor connected to switch up. No message can hold a channel and request for a

source Shortcut.

76

- The two unidirectional Eulerian trails
lndcx:0 l 2 3 4 5 6 7 8 9 10 ll 12 l3 14 15

arr: 0->s->4->3—>8->7—>135->2—>3—>15>5—>6->0—>1->8

ET2: 8->r—>o->6->5-‘>1->3—>2—>55>i->7->s—>3->4->5->0

- First step: add free-style shortcut channels ' - e.‘ ..

AT]: 0+5+4+3+8+7+1$5+2+5+i3>g+6+o+i+§
.-:-.. ''''''''''''''''''' 1 --------- .

AT2: s->1—>o->6—>53>1—>3—>2->'55>li'->7->s—>i->4->5—>ri

«- Second step: add destination shortcut channels and remove free-style shortcut if it win cause deadlock

,-...r-- .............

AT1:0->5-> —>-—>->->3'>->—>->'1-5>—>6—>->

 
 
 
 

cl
AT2;3—>1—>0—>—>—‘>—>->—>-5->7—>—>—>->->

e

- Third step: add source shortcut channels and remove some of them causing deadlock:

    

 

  
   

      

- -- --— - —--——-.

'

    

AT1:0->5->->—>->->-c.>—> 915-9699»

    
 
  

vi
AT2;3—>1—>0->—>—$—>—>—>5>—>7->—>->—>—>

l. Free-style shortcuts are drawn in dashed line above the Eulerian trails.
2. Destination shortcuts are drawn in solid line below the Eulerian trails.
3. Source shortcuts are drawn in solid line above the Eulerian trails.

Figure 4.3: Illustration for the generation of adaptive trails

77

After adding different Shortcuts, we have to detect cyclic channel dependency. Any
routing path allowed in the corresponding Eulerian trail will be kept. Whenever a depen-
dency cycle has to be broken to avoid deadlock, we always remove a shortcut to break the
cycle.

As we mentioned above, different types of shortcuts are used for different purposes. A
destination shortcut cannot cause deadlock. Because whenever it is occupied as a destina-
tion shortcut, it always delivers a message to the destination and will be released ﬁnally. In

Figure 4.3, there is a dependency 3(03) —-+ 8(04) d—s’gta‘t 1(

06) on AT] and a dependency
8(00) —> 1(2),) —) 0(1),) —> 6(03) —> 5(1),) —+ 1(1),) -+ 3(2),) ““3““ sun) on AT2. The
two dependencies cause a dependency cycle, but the cycle cannot cause deadlock due to
the following. If 8 —) 1 is used as a destination shortcut on ATl, it will be released ﬁnally.
If 8 -——) 1 is used on AT2, there is another path from 8 to l (8 —> 7 —+ 1) available in ATl.

Dependency cycles due to free-style shortcuts may cause deadlock. Consider the fol-

lowing trail, where u and y_ represent a same node and v and y represent a same node.

 

f —shorfcut

. —shortcut
Assume there IS a free-style shortcut g ——> y. Because it —> v and g f —> y

represent a same physical link, there is no way to distinguish them when a message arrives
at switch 0 from u. Due to u —> v —> a ——) -- - —> d allowed by the Eulerian trail, a
message from source 1: to destination d may take path a: —> g f—sgtmt y —> a —> - - - —+ d.
Such a path, which breaks the channel dependency order given by the Eulerian trail, causes
a dependency cycle together with the dependency of a —> - - - -—> d —> - - - —> a: —)

— h
g. It may cause deadlock, so g f 83““ g has to be removed. As an example, let us

78

f—shogcuﬂe) 1

see why 5(08) (09) on AT2 may cause deadlock and has to be removed. If

we allow this free-style Shortcut, we have 5(04) 5) 1(225) —> 3(06 2111) and

f—shogcuﬂe)

5(03) 1(v9) —> 7(010) ——> 8(011). At switch 1, if a message coming from

f—shogcuﬂe)

2(07) —) 5(08) 1(09) wants to go to switch 8, it may take 1(05) —+ 3(v6) and

) d—sh_o>rtcut 8( ) d-slgrtcut 8(

wait on 3(06 2211). However, if 3(06 on) is not available, it may

take 3(06) —> 2(07) instead and have to wait on 2(07) —-> 5(08) which has been occupied by
the same message. Thus, deadlock happens. In Figure 4.4, we show all the conﬁgurations

— h t t
where _q I 8—0; cu 31 cannot be used as a free-style Shortcut.

case]:---u—>v—>a—+-~d-~-—ia:—+g—+~--—>y—>---d---

 

ca562:...u__)IU—)a_)...p°"—)$"ig",‘d...y—.)...d...

 

case3:...u_.)v__)a_)...p°"—)$—)y_—;...—-)Q—)"'d...

 

case4: ~--u—>v—)a—)---P"'—>$—>H—>"'—>y—> ---...d
(Note: over arrow lines mean f-shortcuts or d-shortcuts)

Figure 4.4: All cases where f-shortcuts from u to 0 should be removed to avoid deadlock

Deadlock may be caused by a dependency cycle due to source shortcuts too. In Fig-

s—shortcut

ure 4.5, there is a dependency a —+ b —> -- - —-> c —+ d on AT] and a dependency

s—shortcut

—> d —> - - - -—) a ——) b on AT2. The two dependencies create a cycle and may
cause deadlock. Therefore, a source shortcut (e.g., c —> d on AT2) has to be removed to
break the cycle. Without loss of generality, we assume AT2 is used to break such a cy-

cle. For example, if 2 -—) 3 on ET] is added to AT2 as a source shortcut, there will be

s—shortcut

dependency 2(v7) ——> 3(012) —> 4(013) —> 5(014) on AT2. Since there is dependency

4(02) ”skim“ 5(v7) —> 2(v8) —+ 3(09) on ATl, a dependency cycle exists and may cause

deadlock. Therefore, shortcut 2(v7) s-s’gtw‘ 3(012) has to be removed from AT2.

79

 

ATl:---b—>a—>---—>b—>~--—>d—>---—>c—>d—+~-

 

AT2:---d—+c—>---—+i’l—>---—+b—>---—+a—ib—->---

Figure 4.5: The deadlock case due to source shortcut

Figure 4.4 and Figure 4.5 have shown all the possible deadlock conﬁgurations caused by
adding shortcuts. The deadlock detection algorithm will detect them and break a deadlock
cycle by removing a shortcut. The detailed proof given in Chapter 4.6 shows that the
adaptive-trail routing is deadlock free. By routing messages along the two adaptive trails,
we guarantee deadlock—free adaptive routing. Note that a shortcut 3,,1- may have more than
one shortcut type. For example, 333 (3 —i 2) on AT] is both a source Shortcut and a

destination shortcut. In this case, 3,3]. is used for both purposes.

4.4.2 Routing Tables

As the ETR, the ATR also creates a static routing table in each switch based on the two
adaptive trails. A routing table has many entries and each entry is indexed by both the
destination switch and the incoming channel (port). Some entryI may have multiple op-
tions of outgoing channels, which are selected by Shortest path ﬁrst policy. We ﬁrst create
routing table for each trail separately; then combine them together and remove redundant
entries if any. Adjustment of the routing table may be done according to the heuristics in
Section 4.5.1.

Our scheme scans each trail beginning from the largest index to the smallest index.
When a switch is visited, different types of shortcuts are used in different ways. In order

to use a source shortcut as an outgoing channel, the incoming channel has to be a terminal

80

port. A destination shortcut SM cannot be used in an entry unless the destination is vq.

Channels c,-,,-+1 and free-style shortcuts are used without limitation. In Table 4.2, a routing

table is created in switch 5 when it (i.e., 011) is ﬁrst visited along AT] of Figure 4.3. Note
d—shortcut

that 5(1)”) —+ 0(013) and 5(2)”) d-s’fltw‘ 1(014) can only be used for destination

switches 0 and 1, respectively.

 

 

 

 

 

 

I Dst ] Incoming Channels H OutCh 1 I OutCh 2 I
0 1‘—/§'5,2—>5,tenninaipons 5—>0 5—+6
1 2 —> 5, terminal ports 5 ‘1‘? 1 5 —> 6
6 all, 5, 2 —> 5, terminal ports 5 -—) 6 none
8 1 613’ 5, 2 —> 5, terminal ports 5 —) 6 none

 

 

 

 

 

 

 

Table 4.2: Example routing table entries for a ﬁrst visited node

When a switch is visited again, a key point is to use the output channels available for
its current occurrence and previous occurrence(s) with larger index. However, we Should
avoid routing a message to any switch more than once. Consider that switch 5 (07) is
visited for the second time along AT] in Figure 4.3. The new entries in switch 5 are shown
in Table 4.3. Note that for destination switch 6, we do not allow the path 5(07) —> 2(08) —->
3(09) —+ 1(010) (”15’ 5(1)“) —> 6(012) because it passes switch 5 twice. A Shortcut offers
a shorter routing path, but it may not be available for message routing. Consider that a
message M comes from 5(07) —> 2(08) ——> 3(09) ——> 1(v10) (on AT]) and goes to switch
0. Although 1(010) d-sh—oitw‘ 0(013) is the shortest path to M’s destination, M may take
1(v10) 5) 5(1211) ——> 6(012) -> 0(013) if 1(010) (#8th 0(1213) is not available when M is
routed from switch 1. This results in passing switch 5 twice and wastes channel utilization.

Due to this reason, a heuristic in Section 4.5 .1 is suggested to handle such a case.

81

 

 

 

 

 

 

 

 

 

[ Dst Incoming Channels ] OutCh 1 OutCh 2 ] OutCh 3 I
0 4—>5,terminalports 5—>0 5—)6 5—>2
0 1 5’) 5 5 —-> 0 5 —) 6 none
1 4—>5,terminalports 5‘51 5—+2 5—>6
2 1 £3 5, 4 —+ 5, terminal ports 5 —> 2 none
3 1 £3 5, 4 —+ 5, terminal ports 5 —) 2 none
6 1 3’) 5, 4 —-> 5, terminal ports 5 ——> 6 none
8 1315,445Jerminalports 5—)2 5—r6 none

 

 

 

 

 

 

 

 

Table 4.3: Example routing table entries for a multiply visited node

4.4.3 Time Complexity

The whole algorithm is simple to implement. Let m be the number of trunk channels and n
be the number of switches in a network. As shown in [75], the time complexity of ﬁnding
an Eulerian trail, adding all shortcuts, removing free-style shortcuts for deadlock avoid-
ance, removing source shortcuts for deadlock avoidance, and creating routing tables are
0(m), 0(m2), 0(m2), 0(m2), and 0(nm), respectively. Thus, the total time complexity

is 0(m2).

4.5 Heuristics

To provide potentially good network performance, some heuristics are suggested in the

following to avoid longer routing paths and decide the selection of an Eulerian trail.

82
4.5.1 Heuristics about Degree of Adaptivity

If a switch appears more than once along an Eulerian trail, there will be more than one
outgoing channels from the switch to some destinations. Multiple routing paths from this
switch to a destination switch offers higher degree of adaptivity and redundant routing
choices for fault-tolerance. However, if a path is very long, many messages may be blocked
due to the long path. Simulation results Show that allowing long paths will cause low
network throughput and high message latency. Our heuristics consider multiple routing
options in a switch and decide whether to keep a longer routing path or not. Two heuristics

are suggested: dual-path heuristic and source heuristic.

Dual-path heuristic: Suppose switch 0 appears more than once along the trail. Let 0,-
and 12,- (j > 2') denote any two of them. If v,- and 12,- have different outgoing channels for a

destination d, the dual-path heuristic will decide if the longer path is allowed.

 

Onecaseisthatdappearsonce: —) 12,—) a —> -~—->vj —>b—>~>- —> d —-)
In this case, 0,- can go to d from either 12,- —+ b or v,- —+ a plus some shortcut(s) if any.
Suppose the later choice is used. The shortcut may not be actually used in a route if it is not
available when a message is scheduled, which results in a longer path and visiting switch
0 twice. An example is that a packet may take path 5(07) ——> 2(08) —> 3(09) ——+ 1(v10) —e>
5(1)“) —> 6(012) —> 0(013) instead of 5(07) —> 2(08) —> 3(v9) —-> 1(010) —> 0(1213) (on
ATl). Our dual—path heuristic will not keep such a path unless its length is no longer than

the path from 11,-.

Anothercaseisthatdappearstwice: ---—>v,-—>a—>---d---—>vj —+b—>-~-—+

d —> . - -. When a message comes from vﬁs incoming channel, it may use either 0, —-> a.

83

or 12,- —> b. If one path is much longer than the other, the longer path may not be good
for performance because it consumes many channel resources. Suppose the length of the
shorter path is h, our dual-path heuristic will discard the longer path if its length is greater

than 2h.

Source heuristic: There may be more than one routing paths between a source and a
destination. Suppose the shortest distance among these options is h, our heuristic will not
keep an option in the corresponding entry if its length is greater than or equal to 2h. Such
a heuristic is called source heuristic, which is only applied to terminal incoming ports. In
Figure 4.3, Since the shortest path from switch 0 to 1 is 0 -—> 1 (on AT]), the longer path
0 —-> 6 -—> 5 —) 1 (on AT2) will not be used by messages coming from a terminal port in

switch 0.

4.5.2 Selection of Eulerian 'Ii'ails

So far, we use the well-known algorithm to randomly choose an Eulerian trail. Different
Eulerian trails may result in performance difference, which has been observed in our sim-
ulations [21, 75]. In order to get a better performance, a heuristic to ﬁnd an appropriate

Eulerian trail is needed.

Given two bidirectional Eulerian trails, the following method is used to choose one of
them which may provide a better performance. Let r(u, v) be the shortest distance from u
to v allowed by the routing algorithm. Note that the adaptive trails are constructed if the
ATR is used. Let 6(u, v) = W, where dG(u, v) is the shortest distance from u to

v allowed by the network. Let A be the sum of all 6(11, 0). In general, an Eulerian trail with

84

smaller A will have a better performance. However, it is not guaranteed that the trail with
minimum A will provide the best performance. Also, it is still an open issue about how to
construct an Eulerian trail with the minimum A.

In our algorithms, we select several bidirectional Eulerian trails by varying the starting
switch and the way to select the next channel. For each bidirectional Eulerian trail, we
create the corresponding adaptive trails and compute A. An Eulerian trail with the smallest
A is selected.

We use the following methods to select the next channel for an Eulerian trail.

0 Select a channel randomly.

e A table is used to maintain the possible Shortest distance along the trail for each
pair of visited nodes and un-visited nodes. Whenever a new channel (and a node)
is visited, the table is updated. Suppose the pair of nodes 3 (visited node) and d
(uh-visited node) has the longest distance in the table. We select next channel whose

newly visited node is on the Shortest path from s to d.

4.6 The Proof of Deadlock Freedom

The deadlock freedom of ETR routing is straightforward because there is no channel depen-
dency cycle. In this section, we are going to show that the ATR algorithm is deadlock free
too. In order to simplify our proof, we need the following deﬁnition of deadlock-immune

channel which was ﬁrst deﬁned in [43].

Deﬁnition 4 A channel c is deadlock-immune for a packet M if and only if, once M oc-

85

cupies the channel c, M will eventually leave channel c and release it. For convenience,
if a packet M can never use channel c, c is also deﬁned to be deadlock-immune for packet
M. A channel is said to be deadlock-immune if and only if it is deadlock-immune for all

packets.

As shown in [43], a routing algorithm is deadlock-free if and only if every channel is
deadlock-immune. This is obvious because the channels involved in a deadlock could not
be deadlock-immune.

In the ATR routing algorithm, a packet is transmitted to its destination along ATl
and/or AT2. Although the routing tables are created separately based on each adaptive
trail, they are ﬁnally combined together to become integrated routing tables. Due to the use
of source shortcut, it is possible for a packet to Start from one trail and change its route to
another trail. Note that this change will not happen unless there is a channel on its routing
path, which is also used as a source shortcut on another trail.

Let ET] be 00 —+ 121 ——> ——> Um and ET2 be no —+ Ur —> ——> um, where
u,- = vm_,-. In the following, Cit-+1 and 3:,” represent a channel in C(ETl) and a shortcut of
AT], respectively, and cf, +1 and 31%,q represent a channel in C(ET2) and a shortcut of AT2,

respectively.

Lemma 1 If none of channels cit-+1 (z' _>_ k) in C(ETI) is a source shortcut in A72, all

channels c},,~+1’s (2' 2 k) are deadlock-immune.

Proof: Let us consider any packet, no matter it is originally routed along AT] or AT2.

There are three cases:

86

1. The packet holds a channel c3”. +1 (1' _>_ k) when it is routed along AT]. Then its route

cannot be changed to AT2 any more, because none of Ci.i+1 (i 2 k) is a source short-

1
m—1,m’

cut in AT2. If any packet is holding c the packet will arrive in destination um
and release c,l,,_1’m. After Gin—1,771 is released, any packet holding Crln—2,m—1 will go to
its destination (either cm or vm_1) and release cln_2,m_1. Eventually, cln_3,m_2, . - - ,
and ch +1 will be released by the packet because any packet holding cf“- +1 (2' _>_ k)

can always go to the destination via path 1),-+1 —+ 1),-+2 —-> - - . —> destination. There-

fore, all channels cl, +1 (1' 2 k) are deadlock-immune for such a packet.

2. The packet holds a channel oil-+1 (i 2 10) when it is routed along AT2. It means that
of“. H is used as a destination shortcut in AT2 (because it is not a source Shortcut) for
the packet. The packet will go to the destination and release the channel eventually.

Therefore, each ck,- “ (i 2 k) is deadlock-immune for such a packet.

3. The packet never takes any channel Ci,- +1 (2' 2 k). By deﬁnition, c2“- +1 (2' Z k) is

deadlock-immune for such a packet.
Therefore, c,-,,-+1’s are deadlock-immune. CI

Lemma 2 If none of channels Ci,- +1 (71 2 k) in C(ED) is a source shortcut in AT], all

channels 012,241," (i 2 k) are deadlock—immune.
The proof is Similar to that of Lemma 1 .

Lemma 3 Channel c1 is deadlock-immune.

m—1,m

Proof: If c}n_1,m is not a source shortcut in AT2, it is deadlock-immune by Lemma 1.

87

Otherwise, let 3:", is the source shortcut, which corresponds to c}n_1,m, in AT2. We
claim that none of channel egg-+1 (j _>_ q) can be a source shortcut in AT]. Because other-

wise, there will be a dependency cycle as follows, which is the same deadlock conﬁguration

shown in Figure 4.5 and should not be allowed in our adaptive trails.

 

 

AT1:50—+.--—>v,—>---—>vg---vr—>vr+r—>---vm(wher68.l,y=0§.i+r)
AT2“0"*°”"')up—l"°_)u6'°’uj—)uj+l”—l"'um(Wheresg,q=cln—l,m)

Therefore, all c§J+1’s are deadlock-immune by Lemma 2. In this case, if any packet

holds 312,", as a source shortcut, it will go to its destination via 11,, —-> uq+1 —> -—>
destination and eventually release 3,2,”. So, c}n_1,m is deadlock-immune for any packet. D

Lemma 4 Channel an—1,m is deadlock-immune.

The proof is Similar to that of Lemma 1 .

Lemma 5 For any channel ch +1 in C(ETI), if all channels cit-+1 (i > k) are deadlock-

immune, 011ch +1 is deadlock-immune.

Proof: For any packet, there are the following four cases:

1. The packet never takes channel Ci, k +1. Then by deﬁnition, cjc, k +1 is deadlock-immune

for the packet.

2. The packet holds channel ch +1 and its destination is switch ’Uk+1- In this case, the
packet will ﬁnally go to destination and free ch +1. So 0,1”: +1 is deadlock-immune

for the packet.

88

3. The packet holds Cb, +1 and requests c}c+1,k+2 in ATl. Since c},,-+1’s (i > k) are
deadlock-immune, they will be available for the packet. Therefore, the packet will
eventually go through 0H1 -—> vk+2 —> - - - —-> destination and reach the destination.

In this case, ck, k +1 is deadlock-immune for the packet.

4. The packet holds ch: H as a source shortcut in AT2. Suppose the source shortcut is
33",. Then the packet requests cg,q +1. If none of egg-+1 (j 2 q) is a source shortcut in
ATl, c? J- +1’S are deadlock-immune by Lemma 2. It means that the packet can reach
its destination via c? J- +1’S. If some 01.241 is a source shortcut in AT], Let 33w be
a source shortcut corresponding to egg-+1 in AT], there must be y > k. Otherwise,
there will be a dependence cycle as follows, which is the same deadlock conﬁguration

shown in Figure 4.5 and should not be allowed in our adaptive trails.

 

Alevo—>---——>'Ux—>--°—>’U;'°°’Ur—>’Ur+1—>°“'Um(Wheresalr,y=inn)

 

AT2uo —> —->up —) ——> 115-nu]- —-+uj+1—>-~um(wheres§,q =c%,,-+1)
Therefore, we must have y > k. Any packet holding Sim (y > k) will go to its
destination by path 12,, —) vy+1 —> —> destination because all channels c},,+1’s
(i > k) are deadlock-immune. It means that Sis (i.e. c1141) is deadlock-immune.
Therefore, the packet holding sf,” (i.e., Cir,k+1) can go to its destination via uq —>

uq+1 ——) - ~ - —> destination and release 3;"). So ch +1 is deadlock-immune for the

packet is this case.

In a word, Ci,k+1 is deadlock-immune. C1

89
Lemma 6 For any channel ci’k +1 in C(ET2), if all channels egg-+1 ( j > k) are deadlock-

immune, c2 is deadlock-immune.
k,k+1

The proof is similar to that of Lemma 5.

Theory 1 The ATR routing algorithm is deadlock-free.

Proof: Due to Lemmas 3, 4, 5, and 6, channels c1 c1

I 1
m-l,m’ m—2,m—11 ' ' ' 161,2, and C0,l and

channels c2

m_1,m,cfn_2,m_1, - - - ,ciz, and 00,1 are proved to be deadlock-immune one by

one. Therefore, the routing algorithm is deadlock-free. Otherwise, some channels must not
be deadlock-immune because channels involved in a deadlock cannot be deadlock-immune.

E]

4.7 Performance Evaluation

This section shows some typical simulation results based on the simulation environment
described in Chapter 6. Let a: be the time duration between the time when the current
message has completely left the source buffer and the time when the next message is gen-
erated in each node. Thus, a: is a random variable based on an exponential distribution. To
demonstrate the performance behavior of different routing algorithms, we run the simula-
tions under different workloads by varying the value of a: from large to small. The smaller
the a: is, the heavier the workload is. To make a fair comparison, we use the same :1: for all
algorithms under the same combination of network parameters. Note that the actual effec-
tive network workload, which depends on both a: and how fast a message leaves its source
in each switch, may not be exactly same for different algorithms. However, this should

be enough to evaluate the performance behavior of each algorithm. For each run, at least

90

30,000 messages have been received and the error rate on average message latency is less
than 5%. The ﬁrst 1,000 messages are passed to eliminate start—up transient effect. The la-
tency/throughput curves are used as our performance curves. In order to Show the fairness
ratio, we use a star to mark each point on the performance curves where the fairness ratio

is grater than 2 for the corresponding workload.

4.7.1 Network Workload and Parameters

Two destination distributions (i.e., trafﬁc pattern) are considered:

1. Uniform trafﬁc: a node uniformly communicates with any other node. Two nodes

connected to the same switch may communicate with each other.

2. Client/server trafﬁc: k nodes are servers and all the other nodes are clients. For a
client, a certain rate (75% here) of trafﬁc is sent uniformly among those servers and
the rest of trafﬁc is sent uniformly among those clients. A server sends messages to

other clients and servers with uniform distribution.

In a real network, a client may not uniformly communicate with all servers; instead,
it may communicate with a speciﬁc server more frequently. However, there are too many
cases in such a non-uniform client/server pattern. For simplicity, we choose the above
model. In terms of k, we use 4 or 5 servers out of 75 nodes in our simulations.

We use the following two message Size distributions:

1. Fixed size messages: the message size is always 1000 bytes.

91

2. Bursty messages: bursty trafﬁc happens at a Speciﬁc rate and bursty size is 10,000

bytes. Other messages are 100 bytes long. The bursty rate is 5% in this chapter.

The ﬁxed message size and the uniform trafﬁc distribution are combined to demonstrate
the maximum sustained throughput or worst case latency. A client/server distribution is
more realistic than uniform distribution and useful to model network hot Spots. In actual
network applications, the bursty message distribution and the hot-spot destination pattern

are commonly observed.

A time unit is deﬁned as the time needed to transmit a byte via a channel, which is
decided by channel bandwidth. We assume the header for each message is 10 bytes. Thus,
the switch delay for a routing decision is 10 time unit (i.e., the input buffer should have at
least 10 bytes to hold the complete header ﬁeld). The input FIFO buffer capacity is 100
bytes by default. However, we also measure the performance of 1000 bytes buffer capacity

to Show the effect of buffer Size.

In a wormhole switch, it is not necessary to limit the size of a packet smaller than
the input buffer capacity. We assume variable size packets and consider each message as
a packet. Because the ATR uses escape channels to avoid deadlock, a buffer cannot be
assigned to a new packet unless it is empty. However, it is not necessary for the ETR, SRA,
SRD, UDR, and MUDR. Thus, the SRA, SRD, UDR, and MUDR can take more beneﬁts of
large buffer capacity since a single buffer can hold more than one packet. But this beneﬁt
may not be signiﬁcant if the buffer capacity is much smaller than the message size (e.g.,

1000-byte messages and 100—byte buffer capacity).

92
4.7 .2 Network Topologies

The topologies used to measure network performance are Shown in Figure 4.6. In order to
make a rather fair comparison, all topologies have the same number of switches and nodes.
There are 10 switches and 75 nodes in each network. The number of channels, the node
distributions, diameters, and average path length are shown in Table 4.4. We assume that
all switches have the same number of bidirectional ports and the number of used ports in
each switch is similar. Open ports are left for future usage. Based on the node distribution,

each switch should have at least 12 ports.

 

 

Figure 4.6: Six topologies considered in Simulation

The path length from node a to v is deﬁned as the number of trunk channels traversed by
the path. For nodes 11 and v in the same switch, the path length between them is deﬁned as

0. The average path length is the average length over all pairs of nodes of the shortest path

Edam,“ pathJength(i,j)
number..of.nodes2

 

between nodes, which is deﬁned as . Note that the average length of

93

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Switch Number of Nodes

A B l C D I E ] F

0 7 7 7 3 3 8

1 8 8 8 8 9 7

2 8 7 8 8 9 8

3 8 8 7 8 8 8

4 7 8 8 8 9 8

5 8 8 7 8 9 7

6 7 7 7 8 9 7

7 7 7 8 8 8 7

8 8 8 8 8 8 8

9 7 7 7 8 3 7

total nodes 75 75 75 75 75 75

total trunk channels 20 20 20 18 17 19

diameter 2 3 3 2 2 3
avg. path length 1.399 1.442 1.50 1.510 1.631 1.454

 

Table 4.4: The network conﬁgurations for simulation

paths allowed by routing may be longer than the average path length in the network due to
the deadlock-avoidance and adaptive routing. A good routing algorithm should keep the
average length of routing paths as small as possible. Although A, B, and C have the same
number of trunk channels for each switch, they have different diameter and the average
path length. The number of trunk channels are unevenly distributed among switches in D
and E. Channels in F are basically evenly distributed among switches except that switch 9
has only two trunk channels, which may cause bottleneck for trafﬁc coming from or going

to switch 9.

To avoid showing too many curves in a ﬁgure, we do not show the performance for
the ETR routing. As shown in Figure 4.7, the performance trend is that the ATR provides
better performance than the ETR if the ETR provides a much longer average routing path
length than the ATR. However, it the ETR can allow most of the shortest paths, it could

have as good performance as the ATR.

94

 

 

   

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1400 Toplogy A: uniform traffic Toplogy C: uniform traffic
12m L I I I I I J 14m p I I I I I q
1200 - m j: .
E 1000 r- -4 5‘
ﬂ 5 1000 - ~
3 80° ’ ‘ 3 300 . .
O 0
3 60° ' ‘ E 600 ~ ,. .
0 ,e (D
3 400 ~ - 2 400 - ~
200 '- " 200 r-
0 1 1 1 r 1 0 1 1 1 1 1
0.04 0.08 0.12 0.16 0.2 0.04 0.08 0.12 0.16 0.2
Normalized Network Throughput Normalized Network Throughput
Toplogy A: client/server traffic Toplogy C: client/server traffic
4000 1 a r . 1.. 4000 .
3500 ~ ETR ' 3500 - ~
g 3000 ~ Am 3: g 3000 » -
C
9 2500 - 2 2500 r -
3 2000 - 3 2000 L
o o -
a a
g 1500 - § 1500 — -
< 1000 - < 1000 ~ «
500 - - - - 500 ~ .
O 1 T . 1 1 r O r
0.04 0.06 0.08 0.10 0.04 0.06 0.08 0.10 0.12
Normalized Network Throughput Normalized Network Throughput
2.9in

Figure 4.7: Comparisons of ETR and ATR under virtual cut—through switching

In the following, we are going to compare performance between ATR, UDR, MUDR,
SRA, and SRD routing algorithms. The performance of UDR and ATR depends on the
selection of the root switch and the selection of Eulerian trails. For the ATR, we adopt all
the heuristics in [21] and select one bidirectional Eulerian trail which minimizes a heuristic
function based on average path length [21] among three randomly selected bidirectional
Eulerian trails. For the UDR, we select the switch with maximum number of trunk chan-
nels as the root (e.g., switch 0 is the root in D). If more than one switch have a highest
degree, we construct three different trees with different roots and Show the one with the best

performance in the paper. To assign the direction for a channel whose end switches are at

95

the same level, we assume that the ‘up’ direction is always from the switch with a smaller
id to the switch with a larger id. If the above channel direction is assigned in a random way,
the performance may be better but may not have a big difference. The MUDR selects the

same root as the UDR in the same topology.

4.7 .3 Uniform 'h'afﬁc

In order to fully understand the effect of topologies and routing algorithms to network
performance under uniform trafﬁc, we have measured the network performance for all the
six network topologies with ﬁxed message size distribution and with bursty message size
distribution. Figure 4.8 shows the performance curves with ﬁxed message Size distribution.

It is observed that most performance curves reach their maximum sustained throughput
under a certain workload. Then, the throughput will be decreased but the latency is still
increased when the workload is further increased. For those curves which have not shown
such behavior, they will do so if some smaller :1: is given. Because channel contentions are
very serious under such over-saturated workloads, many messages have to wait for a long
time before they are able to use free channels, which causes higher transmission latency
and lower network throughput. In a real network, a good ﬂow control mechanism Should
be applied to avoid the over-saturated workloads. The same range of a: is used for all
topologies except for E, where three additional smaller values of a: are applied in order to

Show the trend of performance.

Average Latency

Average Latency

Average Latency

96

A: uniform traffic and fixed size messages

9000
8000
7000
6000
5000
4000
3000
2000
1 000

 

 

3

r

4

I

5
Network Throughput (bytes per time unit)

I

  
    

6 7

I I r II I I

ATR ~— ‘
SRA -+--- .
SFID .......
UDR —~-—- -
MUDR 4" .

f Unfair -

+
1
1
t
l
1
.‘C'

.g”

."a

   

1 1 1

8 9101112

 

C: uniform traffic and fixed size messages

10000

9000
8000
7000
6000

3000
2000
1000

Figure 4.8: Performance under uniform trafﬁc with ﬁxed message distribution

 

 

3

 

 

 

4
Network Throughput (bytes per time unit)

ATR -o—
SFIA -+ ‘
SRD ..9.... .1
UDR +—
MUDR 4" ‘
Unfair I .
5 6 7 8 9 1011 12

E: uniform traffic and fixed size messages

 

 

3

4

I

I

I

I

Unfair I

 
  
 
   

ATFI +—
SRA -+---
see a
one __,.....

   

I

I

 

 

5

6 7 8 9
Network Throughput (bytes per time unit)

10 11

12

Average Latency

Average Latency

Average Latency

B: uniform traffic and fixed size messages

 

 

3

4

 

5

I I I I I I

ATR -o—
SRA -+--
SRD .9...
UDR -+—-
Unfair I

67891011

 

12

Network Throughput (bytes per time unit)

D: uniform traffic and fixed size messages

 

 

3

5

I I I

MUDR
Unfair I

4 6 7 8 9
Network Throughput (bytes per time unit)

 

1

1

10 11

 

12

F: uniform traffic and fixed size messages

 

 

3

I

4
Network Throughput (bytes per time unit)

I

T

I .f I I
1

 

5

6789

ATR «—
SRA -+---
see am-
one «-

Unfair -

10 11

i

 

12

97

The Effect of Topologies

As shown in Figure 4.8, the network performance is varied under different topologies. If
we consider both the average latency and the network throughput, A is probably the best
choice among the six networks although E has higher throughput for some algorithms
under very heavy workload and delivers the best performance for the UDR algorithm.

The reason that A delivers a better performance is multi-folded. First, A has the shortest
average path length. Second, A’ 3 channels are evenly distributed and more than one shortest
path exist between most pairs. Thus, channels are shared among trafﬁc and most trafﬁc
may use a shortest path to deliver messages, which reduces channel contention in heavy
workload. Although B and C have the same number of trunk channels in each switch as A,
they both have worse performance than A. There is no simple explanation for why B and
C perform worse. Besides that B and C have longer average path length, one fact is that
B and C both have a bisection‘ of only six channels. F does not show surprise behavior
given its topology and number of trunk channels. Note that switch 9 in F has only two
trunk channels which may become a bottleneck.

An interesting fact is that E has a relative better performance, especially under heavy
workload, although it has only 17 trunk channels. In E, most of the shortest paths could be
allowed in a routing algorithm without causing deadlock. Thus, the average path length of
routing in E could be very close to the average path length of the network, which may not
be true for the other ﬁve topologies. Moreover, each channel in E is used to route messages

between a certain pairs. For example, channel 2) —+ 0 may be only used for messages from

 

lBisection is deﬁned as the minimum number of channels to be removed in order to cut a network into
two disjoint subnetworks with equal number of switches in each subnetwork.

98

switch 2) to another switch. Such a feature is likely to provide better ﬂow control and reduce
channel contention under heavy workload.

Although D also has a central switch connected to all other switches, it shows different
performance curves than E. In D, channels between the central switch and non-central
switches are heavily used. Although there are extra channels between two non-central
switches in D, those extra channels do little to increase throughput because they are not
well shared among trafﬁc. The unbalanced channel utilization causes poor performance of

D.

Adaptive Routing vs. Deterministic Routing

It is interesting to compare the performance of the SRA and SRD. Both of them are de-
veloped by the same philosophy and have been made to balance channel utilization under
uniform trafﬁc. For all the six topologies, the SRA achieves lower latency under light work-
load but suffers from worse performance under heavy workload. This is not a surprising
observation because adaptive routing can provide an alternative routing path and take ad-
vantage of multiple paths between pairs under light workload. However, adaptive routing
may take a longer path when shorter paths are not available. The misrouting can cause
more channel contentions under heavy workload and therefore, result in poor performance.
On the other hand, deterministic routing, especially trafﬁc balanced deterministic routing
like the SRD, always selects a unique and shorter path for a source-destination pair. Thus, a
message has to wait on a busy channel while a free alternative path exists, which results in
higher latency under light workload. Under heavy workload, such a disadvantage is gone

because all channels are busy on transmission and a shorter path is the best choice.

99
The Effect of Routing Algorithms

Although these routing algorithms have different behavior under different topologies, there
are still common behavior for each algorithm. As we mentioned above, the SRD provides
the best performance under heavy workload for most of topologies. With the time complex-
ity of balancing channel utilization, the SR achieves good performance in most of uniform

cases.

The ATR is much more simple than the SR, but it provides very close (even better)
performance to the SR in all topologies except E. The good performance of ATR is because
that it uses Eulerian trails and shortcuts to provide reasonable and shorter paths. The reason
of worse performance in E is due to the way to avoid deadlock. The ATR uses two opposite
Eulerian trails to avoid deadlock. For E, some longer routing paths along the Eulerian trails
have to be kept in order to avoid deadlock. Those longer paths consume more channel

resources and may cause poor performance.

The UDR provides a worse performance in most of topologies except E. First, it may
concentrate trafﬁc near the root switch. If the root switch does not have enough trunk
channels to transfer those trafﬁc, serious contention may cause poor performance, which
is the case for A, B, C, and F. Second, the UDR allows a routing path as long as it never
uses any ‘up’ channel after using ‘down’ channels. This may allow very long routing paths.
For example, in E, 3 —+ 4 —> 5 —> 7 —> 0 may be a legal path, if switch 0 is the root and
4 —> 5 and 5 —) 7 are considered as ‘up’ channels. It is the second reason that makes the
performance difference between MUDR and UDR. The reason that the UDR performs very

good in E is the avoidance of the above two shortcomings.

100

Fairness

It is observed that the fairness ratio in some routing algorithms may be over 2 after the net-
work trafﬁc is heavier than a certain point. The fairness is impacted by both the topologies
and the routing algorithms. For A, B, C, and F, the UDR shows higher fairness ratio than
the SR and ATR under same workload. This is because that the UDR favors the nodes at
the root switch. In terms of the topologies, A’ s and E’s fairness ratio is seldom over 2
comparing with other topologies under the same workload.

When the network is overloaded, it may lose the fairness for terminal nodes and suffer
from decreasing performance. Thus we should control the network trafﬁc under the over-
saturated workload. The wormhole switching can partially control the network workload.
When the workload reaches some saturation region, where increased latency and decreased
throughput are observed, the serious channel contentions will block messages and eventu-

ally avoid the injection of more messages in our ﬁnite source model.

Bursty Messages

As shown in Figure 4.9, the performance comparison under the bursty messages is similar
to that under the uniform trafﬁc with ﬁxed size messages. This is because that they both
have the same destination distribution. When long messages are transmitted in a network,
some channels are occupied by these long messages and block many other messages, which
causes higher network latency and lower network throughput for all topologies. The SRD
causes higher latency because it has to wait on a busy channel for a long time while that

channel is occupied by a long message.

Average Latency

Average Latency

Average Latency

1% —I I— I I I I I
9000 - .
8000 i ATR .._. . ~
7000 ~ SRA “T“ -
SRD -----a

6000 i MUDR +-— ,r .
5000 __ Unfair ' 1: ”‘9 .1
4000 L w} «
aooo - -
2000 - 1
1000 .

1 0000
9000
8000
7000
6000

5000-

4000

2000
1000

1 0000
9000
8000
7000
6000
5000
4000
3000
2000
1 000

101

A: uniform traffic and bursty messages

 

 
 
     

 

 

 

 

2 2.5

C: uniform traffic and bursty messages

 

I I I I I I I
I

I

|

  
   
   

- ATR _._.
SRA -+---
SRD -a-
- MUDR +-
Unfair I

  

 

 

 

3 3. 5
Network Throughput (bytes per time unit)

2 2.5?

E: uniform traffic and bursty messages

 

 
 
     

 

i ATR *—

- SRA -+--- , ~
. sap ...- g”. .
UDR +- ,3:

‘ Unfair . 2,2?" 4
.- . " ,f/ .4

 

 

2 2. 5 3 3. 5 4. 5 5
Network Throughput (bytes per time unit)

3 3.5 4 4.5 5 5.5
Network Throughput (bytes per time unit)

Average Latency

Average Latency

4.5555

Average Latency

5.5

1

0000

8000
7000
6000
5000
4000
3000
2000
1 000

10000

9000
8000
7000
6000
5000
4000

2000-

1000

B: uniform traffic and bursty messages

 

 

 

ATR 4— -
SRA -+~—- .
SRD ----a
UDR ~*-—~ «
Unfair I

 

i l J

 

2

2.5
Network Throughput (bytes per time unit)

3 3.5 4 4.5 5 5.5

D: uniform traffic and bursty messages

 

ﬂ

--«»“

 

f

I ﬁﬁ

  
 

ATR _._ -
SRA -+--— ‘
SRD --~--a
UDR ~~x~~~ «

Unfair I

1

 

 

2.5

33.5 4555.5

2 4
Network Throughput (bytes per time unit)

F: uniform traffic and bursty messages

 

I

 

 

  

fﬁ

ATR -o— .
SRA -+--- .
SRD ”+2-
UDR «w- «
Unfair I

 

1 1 1

 

2.5

33...5445555

2
Network Throughput (bytes per time unit)

Figure 4.9: Performance under uniform trafﬁc with bursty message distribution

102
Buffer Size

In order to demonstrate the effect of buffer size to performance, we measure the perfor-
mance by using larger buffer capacity with 1000 bytes. As shown in Figure 4.10, large
buffer capacity does improve the performance for all the algorithms under uniform trafﬁc
with both ﬁxed size messages and bursty messages. This is because that a larger buffer
will hold more bytes and reduce the number of occupied channels by a blocked message.
Note that the UDR and SR gain more performance beneﬁt from larger buffer than the ATR,
especially for bursty trafﬁc. This is due to the limitation of empty buffer required by the
ATR. For bursty message distribution, a 1000-byte buffer may hold up to 10 IOO-byte mes-
sages for UDR and SR, while these 10 messages have to occupy 10 different channels for
the ATR. Such an improvement can be observed more or less in other topologies.

Further simulation results show that the beneﬁt of larger buffer space will not be signif-

icant after the buffer size is big enough to hold the maximize message.

4.7 .4 Client/Server Trafﬁc

The performance under client/server trafﬁc is shown in Figure 4.11. Let us look at Fig-
ure 4.6 again. Four servers are assumed in each network shown in Figure 4.6, where two
server nodes are located at each of the shadowed switches with wider border and one server
node is located at each of the shadowed switches with thinner border.

Under client/server trafﬁc, the routing algorithms may have different performance be-
havior than that of uniform trafﬁc. For example, the ATR has the worst performance under

uniform trafﬁc in E, but it does provide a better performance than the UDR and the SRA in

Average Latency

Average Latency

the selected client/server model. The reason is that many messages in such a client/server
model will go to switches O and 9 and the ATR does not allow very long routing paths for
these two destinations. Thus, messages to O and 9 will not waste channel utilization caused

by long paths. Such an observation shows that we have to consider different trafﬁc patterns

8: uniform traffic and fixed size messages

 

 

 
   

 

 

 

 

  

12“ I I T I I I I I I
“ii
10000 - ,3, . ATR(100 +— <
‘ AJ31(000 -+---
_ 6.1 100 ......) .
8°00 “- uonuooo -——~
Unfair -
6000 .
4000 \ ~
\\
‘r
2000 _ 4' .
3 4 5 6 7 8 9 10 11 12
Network "Throughput (bytes per time unit)
8: uniform traffic and bursty messages
18“ I I I I I I I I
_ ATR(100 4— .
16°00 ATR(1000 -+-- \
14°°° * Mattias i i
+.—.
12000 i' ungair I 'i
10000 - ,3, -\ >‘ J
8000 - é.“ , «
eooo - 2 4
4000 " '39 '1
2m ”8 l 1 l l 1 q

 

 

      

 

2 2.5 3 3.5 4 .5 5 5.5

6
Network Throughput (bytes per time unit)

103

Average Latency

Average Latency

B: uniform traffic and fixed size messages

10000

8000

6000

4000

2000

18000
16000
14000
12000

 

 

 

3

r

4

5

I

  
   
 

6 7

I

‘\
‘8
\

I

8910111

I

SRA(100 *—
SRA 1000

SRD(1000 +3
Unfair

 

I

.9....

I

2

 

Network Throughput (bytes per time unit)

8: uniform traffic and bursty messages

 

F

b

b

 

I

I

  

I

I

I

1

I

I

 

 

SRA(100 +— ‘ ~
SRA1000 -+--
SR (100 am- ‘
SRD(1000 ~*-~ \ .
Unfair I i
,i
.4: j
2 2.5 3 3.5 4 4.5 5 5.5 6

Network Throughput (bytes per time unit)

Figure 4.10: Effect of buffer size under uniform trafﬁc

and workload when we evaluate the performance for a network.

Average Latency Average Latency

Average Latency

A: client/server traffic and fixed size messages

104

B: client/server traffic and fixed size messages

 

 

 

16” I I I ﬁ I I I I I I
14000 ~ 4 14000 - ~
12000 - ATR +— I « 6‘ 12000 - ~
SRA -+--- '-. 5
10000 - SRD .....a ,- - 3 10000 L ATR +—- -
MUDR «~— 9 SRA '+-“
8000 - Unfair . up « § 8000 - 833 We ~
6000 - {p - g 6000 " Unfair - '
4000 ~ ~ 4000 - «
2000 - . . 2000 L . . J
2 2. 5 3 3. 5 4. 5 5 5. 5 2 2.5 3 3.5 4 4.5 5 5.5
Network Throughput (bytes per time unit) Network Throughput (bytes per time unit)
C: client/server traffic and ﬁxed size messages 160000: client/server traffic and fixed size messages
14000 ~ 4 14000 ~ ATR 11' 11 ~
—0— 0
12000 - ~ E 12000 . SRA ,+_-_ ’8 : .
_ . m SRD -0-- 6 °
10000 3 10000 - 1611?? ...... :1 1 -
3000 r ‘ g 8000 - "a” 1 -
6000 L" ‘ a); 6000 _ ..
4000 . ‘ < 4000 - -
2000 1 « 2000 _ .
2 2.5 3 3.5 4 4.5 5 5.5 2 2.5 3 3.5 4 4.5 5 5.5
Network Throughput (bytes per time unit) Network Throughput (bytes per time unit)
E: client/server traffic and fixed size messages F: client/server traffic and fixed size messages
I I I I I I I i I T I ﬁ
14000 - f' f - 14000 ~ 1,
12000 ~ 1' g « a 12000 - ATR 4—\ -
,1 5 SRA -+---
10000 - ATR _._ {1 , 1 3 10000 - sap ...... .
SRA -+--- 1‘ i J UDR +—- '1, .
8000 - 333 9....1‘ 1 § 8000 r Unfair - 1‘ 1
6°00 “ Unfair - i 1 1 E’ 6000 " i." ‘
4000 - J .1 - < 4000 - .i -
2000 - ’ . 1 ‘ 2°00 ' ; ------ . . . . . ‘
2 2.5 3 3.5 4 4.5 5 5.5 2 2. 5 3 3. 5 4. 5 5 5. 5

 

    
     

  

  
 
     

 

 

 

 

 

 

 

 

 

 

 

 

 

 
  
   

   

 

 

Network Throughput (bytes per time unit)

 

   

 

 

 
      
    

 

 

   

 

 

 

Network Throughput (bytes per time unit)

Figure 4.1 1: Performance under client/server trafﬁc

105

The Effect of Topologies

In a client/server trafﬁc pattern, the output contention of servers is the main bottleneck for
performance. Also, the trunk channels near the servers will be contended among messages.
E provides the best performance in our selected client/server model, because its topology
favors messages to switches O and 9 and provides a good performance when most messages
are destined to switches O and 9. However, the output contention limits its performance.
How to reduce the output contention is a very important issue under client/server trafﬁc.
One way is to use multiple ports for servers, but multiple network adapters and network

addresses are needed, which is a complicate design in practice.

The Effect of Routing Algorithms

As shown in Figure 4.11, the performance difference of the four algorithms under
client/server trafﬁc shown in Figure 4.11 is not as signiﬁcant as that of uniform trafﬁc
shown in Figure 4.8. This is due to the limitation of output contention. An interesting thing
is that the SRD delivers the best performance for E but the worst performance for D. This
is due to the locations of servers. The SRD is designed for uniform trafﬁc and it cannot ad-
just routing based on the trafﬁc pattern due to its deterministic feature. Thus, it may cause
poor performance because it cannot deal with some speciﬁc trafﬁc pattern. The SRA does
not show such extreme performance although it is also designed based on uniform trafﬁc.
This is because that the SRA has more ﬂexibility to deal with various trafﬁc patterns due to
its adaptive feature. A routing algorithm, which is based on uniform trafﬁc and performs

well under uniform trafﬁc, may not provide good performance under client/server trafﬁc.

106
Locations of Servers

The number of servers and the locations of servers both inﬂuence the network performance.
In Figure 4.12, we show their effects to performance in A and D, where three different
models shown in Table 4.5 are used to study the effect for A and D, respectively. We
select the SRD and the ATR to show the performance of A and D, respectively. Similar

performance is observed in other routing algorithms and other topologies.

 

Models Server location Server location
switch ids in A switch ids in D

Model 1 2 3 8 9 - 0 4 5 7
Model 2 2 3 8 9 l 0 0 4 7
Model 3 2 2 2 8 8 O 0 0 7

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 4.5: Three models to study the effect of server locations for A and D

A is a network where trunk channels are evenly distributed among switches. Since
no switch has extremely high degree, it is better to distribute the servers among different
switches if possible. In Figure 4.12, we compare the performance under three client/server
models: model 1 has four servers located at four switches, model 2 has ﬁve severs located
at ﬁve switches, and model 3 has ﬁve servers located at two switches. It is not surprising
that model 2 achieves the best performance. Model 3 has the worst performance because it

concentrates the trafﬁc near two switches and channel bottleneck becomes serious.

D is a network where switch 0 has full connection to all other switches. Servers should
be allocated at switch 0 in order to use those channels for it. Again, we compare three mod-
els: model 1 has four servers located at four switches with one server at switch 0, model 2

has four servers located at three switches with two servers at switch 0, and model 3 has four

107

servers located at two switches with three servers at switch 0. As shown in Figure 4.12,
while model 1 shows the worst performance, model 2 shows the best performance under
light workload and model 3 has the best performance under heavy workload. When three
servers are located at switch 3, there is no client node in switch 3, which cannot take advan-
tage of short paths to servers. That is why model 3 delivers higher latency in light workload.
When the workload becomes heavier, model 3 gains a little bit better performance because
it has the highest bandwidth to release trafﬁc near servers. However, because output con-
tention is the major bottleneck in a client/server environment, the performance difference
among these three models are little.

Note that the performance we have shown is based on our uniform client/server model.
If a group of nodes tend to communicate with a speciﬁc server, that server should be located

close to the group of nodes.

Fairness

A star is used to mark each point where the fairness ratio is over 2. Under client/server
trafﬁc, we have distinguished the server class and the client class and measured the ratio
for the two classes individually. The marks in the ﬁgure show the unfair cases for the client
class. Because servers send messages in a uniform pattern, the fairness ratio of the server
class is much less than that of the client class under same workload. However, the ratio of
the server class can still be over 2 when workload is very heavy.

Given a range of workloads, an observation is that there are more unfair cases under
client/server trafﬁc than that under uniform trafﬁc. This is due to the different distance from

clients to servers. For those clients located at the same switch as one server, messages from

108

A: client/server traffic and fixed size messages

 

I I 1 T I I

 

 

 

 

 

14000 ~ 4
> 12000 . ~
g SRD-modeii —.— x
9 10000 - SRD-modelz -+--- 11 ~
3 SRD-model3 We ' 11:

a 3000 ' Unfair I ‘1', J
“’ i

§ 6000 ' 1 d
< 4000 - .1 -

2000 - ~

2 2.5 3 3.5 4 4.5 5 5.5
Network Throughput (bytes per time unit)
D: client/server traffic and fixed size messages
14000 - I I I j *I *I d
I
> 12000 - i." .
g ATR-model1 -o— 1i
3 10000 - ATR-model 2 -+--- 1,115 -
3 ATR-modela --a--- :1
81 8000 . Unfair x 1
g 6000 1 ' 4
< 4000 - 4
2000 - -

 

 

 

 

2 2.5 3 3.5 4 4.5 5 5.5
Network Throughput (bytes per time unit)

Figure 4.12: Effect of server locations and number of servers

them to the server do not suffer from trunk channel contention, and so more messages are

sent out from these clients.

Buffer Size

The effect of buffer size under client/server trafﬁc is shown in Figure 4.13. Note that there
is no big difference between small buffer capacity and big buffer capacity. This is because

that the output contention and the channel contention near the servers are main bottlenecks

109

in such an environment. To improve the performance under client/server trafﬁc, output
contention should be considered and multiple ports may be used for a server to reduce

output contention.

D: client/server traffic and fixed size messages

 

 

 

    

 

 

16000 a . . . .
14000 ~ and: 1
5 12000 - ATR(100 .._ S; '1 «
5 ATR(1000 -+--- $41
3 ...... M11111 ‘
1000 ~-x-~-— fix .
81 3°00 ’ Unfair .. a; . 1
«.2 ff 1
E 6000 - , .
4000 - ~
2000 - I L I l l -
2 2.5 3 3.5 4 4.5 5
Network Throughput (bytes per time unit)
D: client/server traffic and fixed size messages
16000 I f I 7 i I
1.1851
14000 f SRA(100 +— g '
12000 L SRAUOOO "1"" 0111 . .,

$110000 ...... e: |
10000- 811011000 +- at" 1 -
Unfair 1- 1;» 5

8000 -
6000 -
4000 .
2000 -

2 2.5 3 3.5 4 4.5 5
Network Throughput (bytes per time unit)

Average Latency

 

 

 

 

Figure 4.13: Effect of buffer size under client/server trafﬁc

Chapter 5

Network Planning and '111ningin

Switch-Based LAN 8

This chapter studies network planning and tuning issues in switch-based LANs [77]. Our
purpose is not to address an optimal solution for network design, which can be an NP-
complete problem [20]. Instead, we propose a simple method for initial topology design
and an incremental method for network tuning and reconﬁguration after the network be-
comes operational. To provide low message latency and high network throughput, our
criteria are to minimize routing path length, maximize trafﬁc locality, reduce inter-switch
trafﬁc, and balance channel utilization. For the initial topology design, we use block design
to efﬁciently minimize routing path length by covering as many switch pairs as possible
in blocks. For incremental reconﬁguration and network tuning, we propose a systematic
method to deal with various situations in a network.

In this chapter, we ﬁrst describe general considerations on topology design and then

present design variables and analytic models used in our method for network planning and

110

111

tuning. A simple method based on block design is proposed for initial topological design,
where the number of switches, the number of computers, and the estimated trafﬁc pattern
are given. We also propose solutions to handle various changes in network environment and
use simulation results to demonstrate the performance improvement after network tuning.
Finally, we present some preliminary results for the evaluation of our topology design

algorithm.

5.1 Design Considerations of Switch-Based Networks

A switch-based network consists of a number of switches. Each switch has a certain num-
ber of ports and it forward packets from input ports to output ports. Switches are inter-
connected via their ports. A node, which can be a workstation, a multi-processor system,
or a gateway to another network also connects to a switch port. Although there are differ-
ent switching techniques (e.g., store-and-forward and cut-through), our method does not
assume any speciﬁc switching technique.

The topology design and node placement in a switch-based network can signiﬁcantly
inﬂuence network performance. To achieve good performance, we need to consider the
LAN trafﬁc characteristics mentioned in Chapter 2.5. Trafﬁc locality, workgroup, and

client/server trafﬁc should be exploited in the design of switch-based networks.

5.1.1 Routing Path Length

In general, a high-performance network will have as low an average path length as pos-

sible. Longer paths mean higher transmission latency and more channel bandwidth are

112

consumed per packet sent, resulting in lower network throughput and higher message la-
tency in most cases. Thus, when designing a network topology, a primary consideration
should be minimizing the average path length and the maximum path length.

Given the number of switches, ports per switch, and the required number of nodes, it
is, in general, a difﬁcult problem to ﬁnd a network with a minimal average path length.
However, there are still some basic ideas that can make it easier. In general, each channel
should run between a unique pair of switches; extra channels between a particular pair of
switches does not decrease the path length, and almost always more performance can be
gained by connecting two otherwise unconnected switches. Connecting the long-distance
switches with a new channel is usually a better idea than connecting two short-distance
switches.

In addition to minimizing average path length, we have to ensure that there is no obvious
bottleneck. Such a bottleneck might be a single switch connected with only one or two
channels; generally, the channel connections should be fairly uniformally spread among
the switches. We also have to ensure that the bisectionl is not too small, because such a

‘channel cut’ is usually a performance bottleneck.

5.1.2 Local ﬁafﬁc, Remote ﬁafﬂc and Inter-Switch Trafﬁc

We deﬁne trafﬁc within a switch as ‘local trafﬁc’ and trafﬁc between switches as ‘remote
traﬁic’. ‘Inter-switch traﬁic’ is used to represent trafﬁc between any two switches. To efﬁ—

ciently utilize channel bandwidth, an important design principle is to maximize local trafﬁc

 

lBisection is deﬁned as the minimum number of channels to be removed in order to cut a network into
two disjoint subnetworks with equal number of switches in each subnetwork.

113

and minimize remote trafﬁc as well. Exploiting trafﬁc locality will beneﬁt intra-workgroup
communication and efﬁciently reduce average path length. Small average path length usu-
ally means less trafﬁc passing through intermediate switches and more efﬁcient utilization
of the network bandwidth. Therefore if a signiﬁcant fraction of trafﬁc is transmitted among
a set of nodes, we should consider to place them on the same switch if possible, or on two

switches connected by at most one hep.

5.1.3 Balance Channel Utilization

The message delay Th on channel 11: can be calculated based on a standard M / M / 1 queuing

model:

Tk = 1 (51)

 

where tk is the time required to transmit a byte on channel k, m is the average message size
in bytes, and 11,, is the utilization of channel k.

Equation 5.1 means that Tk will be very large if 11,, is close to 1. Our goal is to keep
mammk) as small as possible because the channel with maximum utilization is usually the
trafﬁc bottleneck. To reduce maa:(11,,), it is important to reduce total inter-switch trafﬁc
and balance channel utilization. There are three ways to balance trafﬁc: change topology,
change routing, and balance inter-switch trafﬁc. Changing topology, which should be in-
tegrated with routing algorithm, usually needs many iterations to achieve a near-optimal
solution. On the other hand, it is usually efﬁcient to redirect trafﬁc from overloaded chan-

nels to under—loaded channels by taking alternate routing paths and balancing inter-switch

1 14
trafﬁc.

5.2 Model Description

This section considers the basic parameters and derives design variables for topology de-
sign and incremental reconﬁguration. To design network topologies, ﬁxed routing is gen-
erally assumed because it is easy to evaluate channel ﬂows as a function of routing tables
and trafﬁc distribution [17]. Although we assume ﬁxed routing algorithms to simplify our
analysis, our idea can also be applied to adaptive routing algorithms, which can provide

routing ﬂexibility and adjust network utilization.

5.2.1 Basic Parameters

The design of an appropriate network topology can be a complex undertaking, which con-

cerns some basic parameters. In this paper, we deﬁne the following given parameters:

N: the total number of nodes in the network.

S (or s): the number of switches.

c,-: the number of ports in switch z'.

m: the bandwidth (bytes/sec) of channel k (each unidirectional channel has a unique

id 1:).

To design or tune a network topology, we deﬁne two trafﬁc distribution matrices to

represent trafﬁc patterns and workloads in the corresponding network. Speciﬁcally, we

115

deﬁne a N x N trafﬁc matrix X, where mu,” represents the trafﬁc rate (bytes/sec) from node
11 to node v, and a S x S trafﬁc matrix Y, where yid- represents the trafﬁc rate (bytes/sec)
from switch 1' to switch j. X and Y reﬂect both the message generation rate and message
size and they could use peak rate, average rate or a value in between depending on the
design requirement. Note that Y is actually derived from X and 3114‘ is inter-switch trafﬁc

rate from switch 1' to switch j. The offered workload W of the network should be

N N S S
W = :2 as... = 22y...- (5.2)

As mentioned in Section 2, before a network is operational, the trafﬁc distribution can
only be estimated based on the knowledge of expected network applications, which may
not be accurate. After the network is operational, trafﬁc distribution may be measured via

monitoring network trafﬁc, which should be close to the real trafﬁc.

5.2.2 Design Variables

For the purpose of analysis, the following design variables are deﬁned:

71,-: the number of nodes in switch 2' and N = 2179:1111.

0 mi: the number of channels in switch 1' and n,- + m,- g c,-.

M: the total number of bidirectional channels in the network and M = 235211.

pm: the length of routing path from source node 11 to destination node v in hops,
which is deﬁned as the number of channels traversed by the path. For nodes 11 and

v in the same switch, the path length between them is deﬁned as 0. Note that path

116
length should be proportional to channel bandwidth and higher bandwidth channel

contributes less path length.

0 13: the maximum path length, i.e. ma$(pu,v).

N N
211:1 211:1 puiv
number_o f -(u,v)_pairs '

0 p: the mathematical average of all pum’s, i.e.

o F: the average path length with the consideration of the amount of trafﬁc between
. . . _ EN: EN: (Pum‘zum) — —

each pair, Wthh 18 deﬁned as P = —n—l——E—v1,———. Note that p and P may not

be equal unless the trafﬁc is uniform. B should be more accurate to reﬂect both path

length and trafﬁc distribution.

fk: the average rate of trafﬁc ﬂow on channel 11:.

pk: utilization of channel 11:, which is calculated by f1, /ﬂk.

Given a network topology and a ﬁxed routing algorithm, it is easy to compute the route
and its path length for each source and destination pair. Due to the bandwidth requirements,

we have F >1: W g 2 Bk. Therefore, the upper bound of W is

 

W s ZFB" (5.3)

The above equation shows that smaller F results in larger upper bound of W. Note
that the upper bound of W is obtained without the consideration of channel contention,
switch delay, and software latency. Such an upper bound gives us a metric to evaluate the
throughput of a network in terms of offered load, but it can hardly be obtained in a real

network.

117

One of the design criteria is to maximize trafﬁc locality. For the purpose of design
and analysis, we deﬁne ‘local traﬁ‘ic’ L as the total trafﬁc transmitted within a switch and
‘remote trajﬁc’ R as the total trafﬁc transmitted between switches. Formally, L and R are
deﬁned in Equations 5.4 and 5.5. Note that R is calculated by the consideration of routing
path length, which should be more reasonable than a simple sum because a longer routing

path consumes more channel bandwidth per packet send. Increasing L and reducing routing

path length will decrease R.
N N
L = Z :3 $11,111 where u and v are in same switch (5.4)
u=1 v=l
N N
R = Z Z rump”, where u and v are not in same switch (5.5)
u=l v=l

To express the local trafﬁc out of W, we deﬁne ‘traﬁ‘ic locality ratio’ a = {4%. However,
simply maximizing a may cause unbalanced inter-switch trafﬁc and unbalanced channel
utilization, although bigger a will beneﬁt local trafﬁc. We should also balance inter-switch
trafﬁc and avoid channel bottleneck(s) due to large inter-switch trafﬁc. For example, a
group of nodes may be placed in switch 2' to maximize 01. Meanwhile, these nodes also
communicate to a server in switch j, but they cannot be placed in switch j due to the limit
of available ports. If inter-switch trafﬁc ym- contributes a signiﬁcant fraction of trafﬁc, the
path from switch 2' to switch j may be overloaded and become trafﬁc bottleneck(s). In
such a case, we may redistribute some nodes from switch 2' to other switches, resulting in

smaller 0: but more balanced channel utilization and better overall network performance.

118

The criterion is which is bigger, trafﬁc within these nodes or trafﬁc from these nodes to the

SCI'VCI'.

Given X and Y, we are able to compute trafﬁc ﬂow on each channel for a given topol-
ogy and its routing algorithm. Our selection of ﬁxed routing algorithm makes this calcula-

tion unambiguous. A decision variable 1r and trafﬁc ﬂow f1, can be calculated as follows:

1 if the route from switch 1' to switch j passes channel 11:
711'“ = (5.6)

0 otherwise.

S S
f1 = Z :01...- * wt.) (5.7)

i=1 j=l

The calculations of fk and pk help us identify trafﬁc bottleneck under the current net-
work conﬁguration. The channel with mad/11,) is usually the bottleneck channel. To
improve network performance, we should minimize mad/11,). Besides balancing channel
utilization by adjusting routing paths, we also need to reduce the total remote trafﬁc and

balance inter-switch trafﬁc.

For convenience, important symbols used in this chapter is listed in the following table.

5.3 Initial Topology Design

In this section, we propose a simple and efﬁcient method for initial topology design. As we

mentioned earlier, a primary consideration for topology design is minimizing both maxi-

119

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

a trafﬁc locality ratio
Bk bandwidth of channel I:
A minimum number of occurrences for a pair in block design
B,- a block in block design
q, o number of ports in a switch
fk trafﬁc ﬂow rate (bytes/s) of channel k
L local trafﬁc within switches
M total number of bidirectional channels
m,- number of channels in switch 1‘
N number of nodes
11,- total number of nodes in switch 1'
p1,,” routing path length from node 11 to node v
13 maximum routing path length
1‘9 average routing path length (not consider trafﬁc)
_75 average routing path length (consider trafﬁc)
R remote trafﬁc between switches
s, S number of switches (objects)
pk channel utilization
X trafﬁc distribution matrix between nodes
mu,” trafﬁc rate from node u to node 1)
Y trafﬁc distribution matrix between switches
yih,‘ inter-switch trafﬁc rate from switch 1' to switch j

 

Table 5 . 1: List of important symbols in Chapter 5

 

120

mum and average path lengths. With shorter routing paths, less channel resources are con-
sumed per packet sends, resulting in lower message latency and higher network throughput.
To design a topology with shorter routing paths, we adopt the idea of block design, which

was previously suggested in multiprocessor networks [78] and switch-based LAN 5 [79].

5.3.1 From Block Design to Network Design

Combinatorial design provides methodologies to distribute a set of objects into blocks to
satisfy certain properties. One method is called balanced incomplete block design (BIBD),
which distributes 3 objects to blocks of size d such that (1) each pair of the objects occurs
in exactly A blocks, (2) each object occurs in 7‘ blocks, and (3) there are b blocks. A BIBD
with s = b and r = d is called symmetric, which can be denoted by (13,13).). For example,
a symmetric BIBD (7,3,1) has seven objects 0, 1, - - - ,6 and seven blocks B0, B1, - - - ,B5

as shown in Table 5.2, where each block has three objects.

 

Bo=(013) B4=(450)
Bl=(124) B5=(561)
Bz=(235) Bs=(602)
Bg=(346)

 

 

 

 

Table 5.2: Blocks of symmetric BIBD (7,3,1)

To design a switch-based network from the block design, we have the following maps

between blocks and the network:

1. Each object matches a switch and there are total 3 switches.

2. Each block deﬁnes a path for its objects (switches). For example, Bo determines a

bidirectional path 0 H 1 <—-) 3.

121

3. All paths within blocks are connected to construct a single bidirectional trail, which
is considered as an Eulerian trail2 of the network. The network topology is then

decided by the Eulerian trail. For the block design in Table 5.2, the Eulerian trail and

topology are shown in Figure 5.1.

BOIOH1H3;BI:1H2H4;3222<—)3(—)5;
B323H4H6;B4:4H5H0;B5:5(-)6(—>1;B¢;:6<—)0(—)2;
EulerianTrail:0(—>1H3H4H6HOH2H3H5H6H1H2H4H5H0

 

 

Figure 5.1: From block design to the network

4. The average path length 5 and the maximum path length i) depend on d.

5. The degree of each switch is 211 — 2.

6. The number of routing paths between a switch pair is related to A.

Given the paths within blocks and the Eulerian trail, it is easy to design a routing al-
gorithm. One way is to route messages along one of the two unidirectional Eulerian trails.
Message routing from switch 2' to switch j can take the path in the corresponding block,
because each pair occurs in exactly one block. Thus, the maximum routing path length is

d — 1. Shortcuts may be used to provide shorter routing paths. We can also adopt some

 

2An Eulerian trail of a network is deﬁned as a trail which passes each channel once and exactly once.

122

existing routing algorithms like Eulerian trail routing in [21] or virtual ring routing in [79].
Note that we do not consider switching techniques here. For wormhole switching, special

considerations may be needed to avoid deadlock [72, 42].

5.3.2 Construction of Blocks

In a topology design, we are given the number of switches (3), number of ports per switch
(c,-), and the required number of nodes (N). To simplify our description, we assume all cﬁs
are equal to c. The average degree k of switches can be calculated by cs 2 211: + N. Given
k, we have 11 = [Ir/2] +1.

Combinatorial design has shown some block design theories and methods for special
parameters. For symmetric BIBD, an important property is:
Property 1 An (s,r;1 )-symmetric block design exists when (r — 1) is a power of a prime
numberands = r2 - 1" +1.

The above property means that we may not be able to ﬁnd a symmetric BIBD for any
3 and 11. Since r decides 2r — 2 degree for each switch, it should not be very large in
our design. In [79], it points out that the set of integers for which there is a constructive
symmetric design with A = 1 and 3 S r g 20 is limited to {7, 13, 21, 31, 57, 73, 91, 133,
273, 307, 381}. To design a network for any reasonable size of s and r, we introduce the

following deﬁnitions as variations of symmetric block design.

Deﬁnition 5 Basic block design (BBD): in a basic block design (3, r, 2 A), there are 3 ob-
jects and 3 blocks, each block has r objects, each objects occurs in 7‘ blocks, and each pair

of objects occurs in at least A blocks. Note that A may be 0 and the number of occurrences

123

of different pairs may be dtﬁerent.

A BBD (14, 3, Z 0) is shown in Table 5.3, where some pairs occur in one of blocks but

other pairs (e.g., 5 and 10) occur in none of blocks.

 

Bo=(013) B7=(7810)
Bl=(l24) B3=(89ll)
Bz=(235) Bg=(91012)
Bg=(346) B10 =(lO ll 13)
B4=(457) Bll=(11 120)
B5 =(S68) B12 =(12131)
Bs=(679) B13=(1302)

 

 

 

 

Table 5.3: Blocks of BBD (14,3,1)

Deﬁnition 6 Combined block design (CBD): a combined block design (3, r0 + r1 + - ~ - +
7.9—1)? A) is a combination ofg BEDS: (317.01 2. A0): (817.11 2 AI), ° ' '1 (31 Tg—la 2 Ag);

where all the BBDs have the same set of objects. In the CBD, there are 3 objects and gs

blocks. Each pair occurs in at least A blocks and it is obvious that A Z A,- (1 S i S g).

A CBD (14, 4 + 2, 2 1) is shown in Table 5.4, which consists of two BBDs (14, 4, 2 O)
and (14,2, _>_ O). For a CBD, the construction of an Eulerian trail is the integration of the
Eulerian trail from each individual BBD and the switch degree is 2r — 29. If A 2 1, the
maximum routing path length will be max(r,-) — 1. In the network derived from Table 5 .4,
the switch degree is 8 and the maximum path length is 4 — 1 = 3.

Given 3 and r, we use the following method to construct a BBD. Switches are rep-
resented by numbers 0, 1, 2, s — 1. Suppose Bo is (b0,b1, - --b,_1), where each b,-
represents a switch number. All the other blocks are constructed in such a way that block

B,- is ( (b0 + 2') mod 3, (b1 + i) mod 3, - - - , (b,._1 + 2') mod 3) for O < i < s. This kind of

124

 

30,0 = (O l 3 7) 31,0 = (0 5)
30,1 =(12 4 8) 31,1 =(16)
30,2 = (2 3 5 9) 31,2 = (2 7)

30,3 = (3 4 6 10) 31,3 = (3 8)

Bo,4=(45711) 31,4 =(4 9)

130,5 = (5 6 812) 31,5 =(510)
B01; =(67 913) 31,6 =(611)
30,7 = (7 8100) 31,7 =(712)
80,8 = (8 911 1) 31,8 =(813)
30,9 =(91012 2) 31,9 = (9 0)

30,10 =(1011 13 3) 31,10 =(101)
Bo,1l=(1112 04) 31,11 =(112)
30,12 =(121315) 31,12 = (12 3)
30,13 =(13026) 31,13=(13 4)

Table 5.4: Blocks of CBD (14,4+2,2 1)

 

 

 

 

block designs have special cyclic automorphism, which pennutes the objects of all blocks
in a cycle of 3. Therefore, if B0 is decided, all the other blocks are determined. No matter
if each pair occurs in one of blocks, messages can be routed along the Eulerian trail. How-
ever, a message may have to travel more than one block if its source and destination switch
pair do not occur in any block. In this case, the maximum path length may be greater than
the block size 71.

To cover as many pairs as possible in a BBD, we use differences to construct B0 based
on s and r. A diﬁerence of two switches is deﬁned as the difference of their numbers. For
3 switches, there are totally s — 1 distinct nonzero differences: 1, 2, - . - ,s — 1. Note that
any two switches i and j yield two s-complement differences: (2' — j) mod 5 and (j — 1')
mod 3. For example, 0 and 1 yields differences 1 and s — 1 (i.e., —1 mod 3). Due to the
way a BBD is constructed, all Bﬁs should yield a same set of differences. We are interested
in BBD (s, r, 2 1), which means that each pair occurs in at least one block of the BBD so

that maximum path length can be bounded by r — 1. We have the following lemma.

125

Lemma 7 Each pair occurs in at least one block of the corresponding BBD if and only if

Bo yields all dzﬁerences 1, 2, - - - ,s — 1.

Proof:

Step 1 ‘if’: consider an BBD (3, 11,2 A), where Bo yields all differences 1, 2, - - - , and
s — 1. We are trying to show that each object (switch) pair occurs in at least one block in
the BBD (i.e., A = 1).

Let us consider any two objects 2' and j in B0. Without loss of generality, let j is greater
than i. Leta =j-— i andﬂ = (i —j) MODs. Theremustbea+,6 = s. IntheBBD, we
have 3 rows (blocks) and 7' columns. Therefore, we can further consider the two columns

decided by i and j, which are constructed in Table 5.5.

 

 

 

 

i H. j
i+1 ~- j+1
s—a—l ~~ s—l
s—a ~- 0
3—a+1 ~- 1
s—l a—l
... a
1 ~- a+1
i—l ~- j-l

 

 

Table 5.5: The two columns determined by i and j

If we rotate the rows and let 0 and a be the ﬁrst row, we will have Table 5.6.
The above two columns clearly show that they cover all object (switch) pairs whose
differences are a (e. g, j — i) and all object (switch) pairs whose differences are 6 = s — a

(e.g., i — j). Therefore, given any difference :1: generated by two objects in B0, all pairs

126

 

 

 

 

 

a
1 a+l
2 .7'
s—a—l ~- s—l
s-a ...
s—a+1 ~- 1
s—l ~- a—l

 

Table 5.6: The two columns determined by i and j (after rotation)

whose differences are either a: or s - a: must occur in the BBD. Since it is given that Bo

can generate all differences, all pairs will occur in the BBD.

Step 2 ‘only if’: Suppose each object (switch) pairs occurs in at least one block of the

BBD(s, r, 2 1). We are trying to show that Bo yields all differences 1, 2, - - - ,s — 1.

Consider a difference (1. Since each object pair occurs in at least one block of the BBD,
there must be two objects 2' and j in one block whose differences are a and s — a. Note

that the differences of i — j and j — i are s-complement.

Due to the cyclic addition method to construct all blocks, each pair in the same columns
as 2' and j in a block yields the same differences a and s — a. It means that the pair in B0
also yields differences a and s — (1. Therefore, for any a, we can ﬁnd such two objects 1'

and j and the corresponding pair in B0 whose differences are a and s — a. C]

A CBD can be created by the combination of corresponding BBD’s. Without loss
of generality, we suppose a CBD consisting of two BBDs: BBDO and BBD]. In BBDo
(3,710, 2 A0), we construct B010 which yields some differences. In BBD1 (3,711, 2 A1), we

construct 31,0 which tries to yield as many unyielded differences as possible. Ideally, B110

127

yields all the differences not yielded by 80,0. If each difference is yielded by either Bop or
31,01 each pair will occur in either BBDO or BBDl.

The examples of BBD and CBD constructions have been shown in Tables 5.2, 5.3,
and 5.4. In Table 5.4, B0], of BBDO yields differences 1, 13, 2, 12, 3, 11, 4, 10, 6, 8, and
7. Since we have differences 5 and 9 left, 31,0 is constructed to yield differences 5 and 9.

This results in a CBD ( 14, 4+2, 2 1).

5.3.3 Construction of B0

Because all the other blocks are determined by Bo, we only need to show the construction
of B0. Our objective is to yield as many distinct differences as possible in B0. The more
distinct differences Bo yields, the more pairs a BBD covers. As we mentioned earlier, each
pair occurs in at least one block if and only if Bo yields all differences.

Block B0 of BBD (s, r, 2 A) yields r(r — 1) differences which may or may not be
mutually distinct depending on its objects. If some of the differences yielded by Bo are
duplicated, the total number of distinct differences is less than r(r — 1). Since there are

totally s - 1 non-zero differences (modulo 3), we have the following relationship as shown

 

 

 

 

 

 

 

in Table 5.7.
r r(r — 1) Yield all differences‘d
3 \/.§ < s — 1 no
\/§<r<\/§+l -Z<-8—1 maybeormaybenot
2 J3 +1 > s — 1 maybe or maybe not

 

 

Table 5.7: Relationship between block size and number of distinct differences

In general, a larger 11 yields more distinct differences in a block and more distinct pairs

128

in a BBD. Given a big enough 71, we can always ﬁnd a block of size 11 yielding all distinct
differences of 3 objects. However, r also determines the maximum length of routing paths
so that 1' should be as small as possible. Moreover, r is equal to [Ir/2] + 1, where k is
the average switch degree in the network. Due to the limit on the number of channels
connected to a switch, r may not be very big. Given 8 and r, a best solution of B0, which
covers as many distinct differences as possible, can always be found by exhausting search.
However, such an algorithm may take too much time for large 3 and r. Therefore, we
propose a heuristic algorithm in Figure 5.2 to construct B0.

As shown in Table 5.7, to cover all pairs in a BBD, 7' must be greater than ﬂ. We
have run our algorithm for 3 S s g 400 and found the minimum r yielding all differences
for each of 3. Although our algorithm considers only local optima, it turns out that the
results are pretty good. As shown in Table 5.8, when 3 is equal to or less than 100, we
can construct a network with a r which is no more than [Vs—l + 2. Smaller r guarantees
smaller routing path length between pairs. If r is allowed to be bigger than the one listed
in Table 5.8 for a given 3, a CBD can be constructed to reduce path length.

In most cases, 11 = [J8] + 1 can yield at least 85% of differences, so it is easy to use
another BBD with smaller block size to yield all the other differences. The two BBDs can
then be combined as a CBD covering all pairs. There is tradeoff between a BBD and a
CBD. For same 3 and r, a CBD has smaller block size than a BBD, which decides shorter
routing paths for pairs within blocks. If a CBD can cover all pairs, we may choose the
CBD instead of a BBD because smaller path lengths are promised in a CBD. However, the
number of pairs in a CBD is smaller than that in a BBD for same 3 and 1‘. Therefore, if r is

not big enough to yield all differences in one block, we do not consider partitioning r for a

129

CBD unless there is a good reason.

 

 

 

 

 

 

 

 

 

3 1‘
3—100 six/3+2
101—200 six/5+4
201—300 six/37%
301—400 six/3+7

 

 

Table 5.8: Minimum 2' to yield all differences in a BBD (3,11, 2 1)

After all objects have been selected in B0, the order of these objects may need to be
reorganized in the following cases. Our purposes are to avoid double channels between any
two switches and reduce maximum path length if possible. Since double channels between
a switch pair will not reduce routing path length, we try to avoid double channels and save
the channel for another two otherwise unconnected switches, which results in shorter path

length.

0 If s is an even number, the two objects whose difference is g should not be adjacent in
B0. Otherwise, there will be double channels between switches 2' and 2' + 3, because
the two s-complement differences yielded by the two switches are same. For example
in Table 5.4, if block 80,0 is (O 7 1 3), where O and 7 have difference of l;— and are
adjacent in the block, there will be double channels between switches O and 7, 1 and

8,-~,and6and13.

o If more than one pair in B0 yield same differences, try to place one of these pairs in

the two boundaries of B0 so that the maximum path length is less than T — 1.

o If more than one pair in B0 yield same differences, try to place at most one of the

these pairs adjacent in B0. Otherwise, there will be double channels between some

130

Input: 3, 1‘
Output: block B0 2 (b0, b1, ...b,_1)
Algorithm:

/* Step 1: initialization */

()0 = 0;

D2ffSet={1,2,---,s—1};

/* Step 2: iteration to ﬁnd next object for Bo */
For(2'= 1;2 <r;2++){
d2 f f = the smallest unyielded difference in D2 f f Set;
F0rU=0;jSi-1;j++){
newobject = bj + diff;
If newobject yields distinct differences in the partial block
add newobject to B0 as b; and continue to next 2';
Else

keep newob j ect which yields maximum number of distinct differences in the partial block
}

b,- is the newobject yielding maximum number of distinct differences in the partial block;
remove yielded differences from Diff Set;

}

/* Step 3: rearrange the order of objects in B0 */
If s is an even number {
/* to avoid double channels between any two switches */
place two objects, whose difference is g, unadjacent to each other;
}
If a difference is yielded by more than one pair {
/* to avoid double channels between a switch pair */
at most one of those pairs can be adjacent in B0;
/* to reduce maximum path length *I
one of those pairs should be placed as boundary objects in B0;

}

Figure 5.2: The algorithm to construct B0

131

switch pairs.

Note that the new order of objects in B0 will inﬂuence all the other blocks and the
network topology. Consider a BBD (10, 4, 2 1), initially we have the block design shown
in the left side of Table 5.9. After rearranging the order of objects in B0 and all other
blocks, we have a new block design shown in the right side of Table 5.9, which has smaller

maximum path length and smaller average path length.

 

 

before reorder after reorder

Bo=(0136) Bo=(3106)
B1=(1247) Bl=(4217)
Bz=(2358) B2=(5328)
B3=(3469) B3=(6439)
B4=(457O) = B4=(7540)
B5=(5681) B5=(8651)
Be=(6792) B6=(9762)
B7=(7803) B7=(0873)
Bg=(8914) B3=(1984)
Bg=(9025) Bg=(2095)

 

 

 

 

Table 5.9: Blocks of BBD (10, 4, 2 1): before reorder and after reorder

Until now, we suppose all switches have same degree and suggest channels are evenly
distributed among switches if possible. In a real network, a switch may have more or less
degree than the average degree. If a switch has less degree, we may remove it from a certain
block(s) depending on its degree requirement. If a switch has more degree, we can add it
to a block which originally does not include the switch. For example in Table 5.3, if switch
10 has more degree(s), it could be added to B0 as follows. We do not place the new switch

in a boundary of a block because we will construct an Eulerian trail.

132
5.3.4 Considerations of Trafﬁc Distribution

In the above block design, objects 0, 1, - ~ - ,s — 1 have not been matched to switches. To

design initial topology, the switches are matched to objects one-by-one as follows:

1. Place servers on switches and each switch has at most one server if possible. This is

helpful to balance inter-switch trafﬁc.

2. Place nodes on switches. The number of nodes in each switch is determined by
design variable n,-. In our design, 72,- could be up to all the spare ports not used by
channels. Nodes are placed in a switch to maximize local trafﬁc L, which can be
guided by the knowledge of client/server trafﬁc and workgroups. Meanwhile, we
have to make sure that inter-switch trafﬁc between any two switches is not too large

in order to avoid trafﬁc bottleneck on the corresponding routing path.

3. Match switches to objects one by one. Our objective is to minimize remote trafﬁc R
and recall that R = E Z mu,,,p,,,,,. For two switches which have lots of communica-
tion between each other, they should be matched to two adjacent objects in a block so
that there is a channel connecting the two switches. If two switches have little trafﬁc

between each other, they may be matched to two boundary objects in a block.

A switch matching object 2 is called switch 2'. After matching switches to objects, we
apply the network tuning algorithm in Section 5.4.4 and create the ﬁnal initial t0pology.
The difference is that we do not have any measured trafﬁc distribution or channel utiliza-
tion before the network is operational. But we can use estimated trafﬁc distribution and

calculated channel utilization. If the estimated trafﬁc matrix is not close to the real trafﬁc

133

distribution, we need incremental network tuning to obtain better performance.

5.4 Incremental Reconﬁguration

Due to the lack of accurate knowledge of network trafﬁc, the initial topology design may
not be good enough to deliver demanded performance. Even the initial network is good
enough, network tuning and incremental reconﬁguration are still needed to accommodate
changes in the network. Some general ideas have been proposed in [49] for network plan-
ning and tuning in shared-medium LANs. However, as indicated in [20], the method pro-
posed in [49] for high trafﬁc locality and high load balance may contradict with each other.
Also those ideas are based on shared-medium LAN 8 and may not be applicable to switch-
based networks.

In a real network environment, it is unlikely to rewire the whole network to achieve
optimal design for the new environment. In other words, network topology changes usually
are incremental and do not include major change. Although so many situations may be
changed in a real network, we are not able to address all of them in one paper. The following

three typical and frequent changes are considered in this section.

1. A new client computer is added. Where to put this computer and does the placement

of this computer make difference to network performance?

2. A new server is added. Where to connect this server to provide reasonable perfor-

mance?

3. Due to the change of trafﬁc patterns or the poor design of initial topology, net-

134

work tuning is needed to provide better performance (e.g., lower latency and higher
throughput). What a minimum reconﬁguration can be made to meet the performance

requirement? This is a more general issue.

5.4.1 Overall Considerations

For any change in an operational network, it should be minor and incremental so that the
whole network does not have to be shutdown for a long time to make major changes and
many rewirings. Due to such a consideration, our objective is how to make minor change

to improve network performance.

Based on queuing theory, message latency on channel k is proportional to 1:15;, where
p), is the channel utilization. Therefore, to achieve lower latency, it is is very important to
reduce 11k. Recall that 11,, = 2%, where B), is channel bandwidth and f), is trafﬁc ﬂow rate of
channel 11:. Due to the limit of network resources, it is usually difﬁcult to increase channel
bandwidth. Thus, the solution should be to reduce fk if possible. There are two necessary
steps to reduce fk’s. The ﬁrst step is to minimize remote trafﬁc R and the second step is to

balance channel utilization.

In the following, we present our algorithms for some typical and frequent changes
in the network: adding a new client, adding a new server, and network tuning. All the
algorithms consider the following criteria: (1) maximize local trafﬁc L and balance inter-
switch trafﬁc; (2) minimize remote trafﬁc R; and (3) balance channel utilization. Note that
we just show the sketch of each algorithm. For each algorithm, it can be implemented in

different heuristic methods, but these methods should consider the topology design, the

135

routing algorithm, and trafﬁc distribution. In the simulation, we have implemented an

algorithm based on our block design and Eulerian trail routing [21].

5.4.2 Add A New Client

Client/server trafﬁc is a common characteristic in LAN 3. It is often demanded to add a new
client to a network. This is the easiest situation for network change and usually does not in-
troduce much complexity. Where to connect a new client depends on its trafﬁc distribution
and the availability of spare ports. We also need consider the physical constraint of its loca-
tion. If its server/workgroup is known, it should be placed as close to the server/workgroup
as possible. The algorithm sketch is shown in Figure 5.3. If network tuning is needed, the

algorithm in Section 5.4 can be applied.

Input: current topology, routing tables, and trafﬁc distribution matrix
Output: the network after adding the new client
Algorithm:
understand trafﬁc distribution of the new client, its server, and its workgroup;
ﬁnd all feasible and available ports for the new client;
positionS = {feasible and available ports for the new client};
recalculate L and R for each port in positionS;
recalculate y“, fk, and 12,, for inﬂuenced switches and channels;
select the one based on minimizing max(pk), R, and max(y,-,J-), and maximizing L;
refer Section 5.4.4 if network tuning is needed;

Figure 5.3: Algorithm to add a new client to the network

136
5.4.3 Add A New Server

In general, adding a server is similar to adding a user. However, a server may easily cause
high utilization on a routing path because it usually has much more trafﬁc than a client
and concentrates its trafﬁc to a subset of clients. Since the large amount of trafﬁc from/to
a server node may easily cause a trafﬁc bottleneck and signiﬁcantly reduce network per-
formance, a higher bandwidth port may be needed to connect a server. Multiple network
interfaces on a server could also be a solution. If the connection of new server causes
very high utilization on some channels, network tuning in Section 5.4.4 will be needed to

balance channel utilization.

Input: current topology, routing tables, and trafﬁc distribution matrix
Output: the network after adding the new server
Algorithm:
ﬁnd all clients of the new server;
ﬁnd available and reasonable ports for the new server;
positionS = {feasible and available ports for the new server};
recalculate L and R for each port in positionS;
recalculate yi‘j, fk, and a), for inﬂuenced switches and channels;
select the one based on minimizing maxim), R, and max(y,-J-), and maximizing L;
refer Section 5.4.4 if trafﬁc balance is needed;

Figure 5.4: Algorithm to add a new server to the network

5.4.4 Network Timing

Suppose we have sufﬁcient network resources for the network workload. If an operational
network does not provide demanded performance, network tuning is needed to improve

performance. The reason to cause poor performance of the original network could be two-

137

folded. First, network trafﬁc may change so that the original network may not be able
to deliver demanded performance under the new trafﬁc. For example, a server’s load is
increased by a certain percentage after it is upgraded or merged with another server. Sec-
ond, the estimated workload may be different from the real workload so that the original
network design is not suitable for the real workload.

A practical and systematic method is proposed in the following for network tuning.
Our method follows the design considerations in Section 5.4.1. The major objective is
to balance channel utilization and minimize mad/1),). To identify overloaded channel,
we introduce 6 as a threshold for channel utilization. If a), > 6, channel k is considered
overloaded. For most design, 6 should be about 50-60%. We try to avoid overloaded
channels by reducing remote trafﬁc R and balancing channel utilization. The algorithm
sketch is shown in Figure 5.5, which includes the following four major steps. In any step,

if none of 12,, is greater than 6, the algorithm can be stopped.

1. Increase local trafﬁc L and balance inter-switch trafﬁc for any switch pairs. The
method is to move/exchange node placement among switches. This is guided by
trafﬁc distribution matrix. Note that maximizing L can efﬁciently reduce total remote

trafﬁc R.

2. Decrease remote trafﬁc R. Our method is to reduce routing path length for those
switch pairs with bigger inter-switch trafﬁc by moving and/or adding channels. If our
block design method is used for initial topology design, it should be easy to make
these channel changes as follows. We can change the topology and routing path

length by rearranging the order of objects in blocks. For example, if block (0 l 3 7)

138

Input: current topology, routing tables, and trafﬁc distribution matrix
Output: network after network tuning
Algorithm:
/* Step 1: increase L and balance yid- */
increase local trafﬁc L by moving/exchanging node placement;
balance inter-switch trafﬁc between switch pairs by moving/exchanging node placement;

/* Step 2: decrease R */

minimize R by moving and/or adding channels;
recalculate f,c and a), on inﬂuenced channels;
[fall a), < 6, done;

/* Step 3: balance channel utilization by changing routing paths */
calculate or measure fk’s, pk’s;
BusySet = all channels whose utilization is over 6;
For each channel BusyCh in BusySet {
BusyUtil = BusyCh’s utilization;
PathSet = all paths which include BusyCh;
try to ﬁnd an alternate routing path for each path in PathSet, which
reduces BusyC’h’s utilization;
does not increase max(pk);
does not make another channel’s utilization over BusyUtil;

}

/* Step 4: balance channel utilization by changing node placement */
recalculate f), and pk;
BusySet = all channels whose utilization is over 6;
For each channel BusyCh in BusySet {
BusyUtil = BusyCh’s utilization;
ﬁnd all source and destination switches whose path include BusyCh;
try to move nodes in these switches so that new node placement
reduces BusyC’h’s utilization;
does not increase max(p,,);
does not make another channel’s utilization over BusyUtz’l;

}

Figure 5.5: Algorithm for network tuning

139

is changed to (O 3 l 7), the changed topology will have two more channels between
switches 0 and 3 and switches 1 and 7 but lose two channels between switches 0 and
1 and switches 3 and 7. We will make these channel changes if inter-switch trafﬁc
310,3 + y1,7 is more than 310,1 + 313,7, because the remote trafﬁc will be reduced in this
case. Since we use Eulerian trail to design network topology and routing algorithm,
we do not change boundary objects in this step; otherwise, maximum path length
may be increased. Moreover, we try to avoid double channels between any two
switches because we can save one of the double channels to connect two otherwise
unconnected switches. Note that this change only inﬂuences the corresponding block

not all the blocks.

3. Balance trafﬁc by changing routing paths if possible. We need the knowledge of
trafﬁc ﬂow f), and channel utilization a), on each channel, which can be obtained by
either theoretical calculations or network measurement. Because theoretical calcu-
lations may be different from network measurement due to simpliﬁed assumptions
in the calculations and the dynamic nature of networking environment, measured
channel utilization is usually more accurate and is suggested whenever applicable.
If any pk is over 6, the corresponding channel is marked overloaded. Our algorithm
searches all the paths for the overloaded channels. If a routing path includes at least
one overloaded channel, we will try to ﬁnd an alternate path so that trafﬁc on the cor-
responding overloaded channel(s) will be reduced. This routing change should also
ensure that it does not cause new overloaded channel or bigger max(uk). If there

are more than one alternate path, we will pick up the one which derives smallest

140

max(pk) and more balanced channel utilization.

4. Balance trafﬁc by moving/exchanging nodes among switches so that the trafﬁc on
heavily utilized channels can be reduced. Suppose channel It is overloaded and inter-
switch trafﬁc from switch 2' to switch j passes channel 12. If we can move/exchange
node placement to reduce 31,-J, it will reduce the utilization of channel 12 especially
when 12,/,3,- contributes a signiﬁcant portion to the trafﬁc ﬂow of channel 12. Of course,
changing node placement has to be taken in a global point of view and should not
cause new bottleneck channel. Usually, changing routing tables can balance trafﬁc
pretty well. This step is helpful only when the initial node placement causes very

unbalanced inter-switch trafﬁc.

We would like to mention that it may not always be so straightforward to redistribute
nodes from one switch to another. Even in a LAN, the distances between the switches
may be too large to make it difﬁcult (or impossible) to reconnect a given host to another
switch. Therefore, the feasibility of end node replacement may be limited by a real network
environment.

There is no straight-forward implementation for any of the above steps. Moreover,
we have to compromise between the criteria if we are not able to achieve all of them at
the same time. The most important criterion should be minimizing mad/2),). Note that

we ﬁrst need to check if the average channel utilization :5“; is less than 6. If the average

 

channel utilization is over 6, the channel bandwidth may not be enough to handle the offered
workload. In this case, more network resources such as new channels and/or switches are

needed. Even though the average channel utilization is under 6, it is still possible that

141

some uk’s are greater than 6 after applying the algorithm due to the trafﬁc distribution and
network conﬁguration. If this happens, we may need to add more channels or upgrade to

high bandwidth ports (switches).

5.5 Simulations

A simulator has been implemented to evaluate the network performance before and after
network tuning under various topologies and trafﬁc workload. There are so many network
parameters (e.g., topologies, placement of nodes, trafﬁc distributions, message size, buffer
size, message injection rate, location of servers, etc.) and combinations of these parameters.
However, due to the space limit, we can only show the performance under a limited subset

of parameter combinations.

5.5.1 Simulation Environment

Without loss of generality, the performance data shown in this section are measured from
a switch-based network which has 10 switches and 200 nodes. The initial topology design
is created by a BBD (10, 4, 2 1), which has been shown in the right side of Table 5.9. The
corresponding initial topology is shown in Figure 5.6, where the network has 30 bidirec-
tional channels and the degree of each switch is 6. The deterministic ETR is used as the
routing algorithm. All calculations for trafﬁc rate and channel utilization are based on the
topology and the routing algorithm.

In order to evaluate interconnection networks, a simulator should consider the following

workloads: trafﬁc patterns, message size distributions, and temporal distribution [80]. Two

142

Bo:3<—>1H0<—>6;BI:4+—>2Hl<—>7;Bg:SH3H2H8;33:6H4++3H9;
B4:7H5H4H0;B5:8H6H5H1;Bs:9H7H6(->2;B710H8H7H3;
Bszl++9<-+8H4;Bg:2<->O<—>9<—>5;

 

 

 

Figure 5.6: Initial topology used for simulation

 

Group 1 Group 2 Group 3 Group 4 Group 4
50 35 25 45 40

 

 

 

 

 

 

 

 

Table 5.10: Number of clients in each workgroup

143

trafﬁc patterns are used in this section, where there are 5 servers and 5 client wokgroups
out of 200 nodes for both patterns. The number of clients in each workgroup is listed in
Table 5.10 and the trafﬁc distribution of each pattern is shown in Tables 5.11 and 5.12.
Each server or client has 20-30% global uniform trafﬁc as the background trafﬁc. Trafﬁc
pattern 1 and trafﬁc pattern 2 are similar, except that pattern2 has more workgroup trafﬁc
and less client/server trafﬁc. Our intention is to evaluate the effect of workgroup trafﬁc and
client/server trafﬁc on network performance. For simplicity, we choose ﬁxed size messages
and the message size is always 128 bytes.

Although we just show network performance for a speciﬁc topology and two trafﬁc
patterns, similar performance trend can also be observed in other t0pologies and network
workload. The switching technique is assumed to be cut-through switching and each port
has an input FIFO buffer of 1000 bytes. Back-pressure ﬂow control is used to avoid buffer

overﬂow. Again, similar performance trend can be observed in store-and-forward switched

 

 

networks.
srvl srv2 srv3 srv4 srv5 grpl grp2 grp3 grp4 grpS uniform
srv 1 10% 40% 30% 20%
srv 2 20% 40% 20% 20%
srv 3 60% 20% 20%
srv 4 10% 70% 20%
srv 5 10% 70% 20%
group 1 30% 30% 20% 20%
group 2 50% 30% 20%
group 3 30% 50% 20%
group 4 20% 40% 20% 20%
group 5 10% 40% 30% 20%

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 5.11: Trafﬁc distribution of pattern 1

We use ﬁnite input source to model the message generation in each node because it is

 

144

 

 

 

srvl srv2 srv3 srv4 srv5 grpl grp2 grp3 grp4 grp5 uniform
srv 1 10% 40% 30% 20%
srv 2 20% 40% 20% 20%
srv 3 50% 30% 20%
srv 4 10% 60% 30%
srv 5 10% 60% 30%
group 1 20% 20% 40% 20%
group 2 30% 50% 20%
group 3 20% 60% 20%
group 4 10% 30% 40% 20%
group 5 10% 30% 40% 20%

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 5.12: Trafﬁc distribution of pattern 2

more realistic than the inﬁnite source model. A ﬁnite source model may have up to K out-
standing messages. In high performance computing, most parallel programs use blocking
send to deliver messages. Thus, the next send will not be initiated until the previous one is
completed. In a client/server environment, a server may receive several requests and have
several messages waiting on the queue to send. For simplicity, we consider the case of
K = 1, which means next message cannot be generated unless the previous message has

completely left the source.

A time unit is deﬁned as the time needed to transmit a byte via a channel, which is de-
cided by channel bandwidth. Unless otherwise speciﬁed, all channels connecting switches
are assumed to have bandwidth of 1 byte per time unit, all ports connecting to clients are
assumed to have bandwidth of 1 byte per time unit, and all ports connecting to servers are
assumed to have bandwidth 10 bytes per time unit. The ratio between higher bandwidth
and lower bandwidth emulates lOO-baseT and lO—baseT ports in Ethernet switches. Higher
bandwidth for server ports will efﬁciently reduce trafﬁc congestion due to server bottle-

neck. We assume the header for each message is 3 bytes. Thus, the switch delay for a

145

routing decision is 3 time unit.

Let x be the time duration between the time when the current message has completely
left the source buffer and the time when the next message is generated in each node. Thus, :1:
is a random variable based on an exponential distribution. The smaller :1: is, the heavier the
workload is. In our simulation, all results are measured under a same set of workloads. The
simulator will be run until at least 30000 messages are received. In each run, the ﬁrst 2000
messages are passed to eliminate start-up transient effect. Average message latency and
network throughput are used to evaluate network performance. The network throughput is
normalized by the theoretically maximum throughput, which is 200 bytes per time unit in

the selected network.

5.5.2 Performance Improvement by Network 'Itming

In this section, we use simulation results to show the performance improvement by applying
our algorithm for network tuning. We use the method of block design in Section 4 to
plan initial topology and adopt Eulerian trail routing [21] as our routing algorithm. Our
algorithm is implemented based on the way we design topology and routing algorithm.
Due to the space limit, we cannot address our algorithm in detail. (should address the

algorithm a little bit)

In our simulations, we measure the network performance under the following four
cases, which are categorized into four levels based on how much network tuning has been
done. In tuning levels 0-2, each switch has exactly 20 nodes and at most one of the 20

nodes is a server. In tuning level 3, a switch can have at most 22 nodes. This is intended to

146

simulate that each switch has a limited number of ports. Allowing at most one server in a

switch is helpful to balance inter-switch trafﬁc.

1. Tuning level 0: initial topology; random node placement; and no network tuning.

2. Tuning level 1: initial topology; nodes are placed in a way to maximize local trafﬁc

and balance inter-switch trafﬁc; and no other network tuning.

3. Tuning level 2: based on level 1, try to minimize remote trafﬁc R by changing the
order of objects in each block, resulting in new network topology and new routing

path length. No further trafﬁc balance.

4. Tuning level 3: based on level 2, try to balance channel utilization by changing rout-

ing paths and node placement.

As shown in Figure 5.7, each level of network tuning can achieve signiﬁcant perfor-
mance beneﬁts. When new clients or servers are added, similar performance beneﬁts can
be achieved as long as we follow the algorithms. The better performance after network
tuning demonstrates the importance of network tuning and the efﬁciency of our algorithm.
In tuning level 3, we use measured channel utilization because it is more accurate than cal-
culations and can be measured after a network is operational. All the other network tuning
levels are based on calculation.

To evaluate the effect of workgroup trafﬁc and client/server trafﬁc on network perfor-
mance, we compare the performance between trafﬁc pattern 1 and trafﬁc pattern 2. As
shown in Figure 5.8, trafﬁc pattern 2 provides lower message latency and higher network

throughput than pattern 1. For example, maximum sustained network throughput is about

147

Traffic pattern 1

 

I I I I

1600 - .
Level 0 ..—
> 1400 - teve11 -+--- 1
eve 2+1-
5 1200 P Level 3 -..--. ‘
3 1000 L 1,. q .
“8 80° ’ 1tWM E 1
+’ 1‘ i‘
3 600 - .1”, ”23m .2" .
90’ ,B ,,...x'
400 - p-‘G x .1
.a ------- x‘
200 '- :B&HR:;?C:BKX~&/T 4 ..

 

 

 

 

0.05 0.1 0.15 0.2 0.25
Normalized Network Throughput

 

 

Traffic pattern 2

1600 h Level 0 —.—_ ‘

1400 - Levei1 -+--- .
5‘ Leve|2 ----a
g 1200 " Leve| 3 ...x ....... .
E 1000 - +1 3 .
L. ‘4: U x
3 600 - ,1, a, 1x .
<1: x.)

400 - 31"3 ....... x", -- .

200 " a ﬁggzglge ”Nix”. " l .(

 

 

 

0.05 0.10 0.15 0.20 0.25
Normalized Network Throughput

Figure 5.7: Performance improvement by apply network tuning

200 x 22% = 44 bytes per time unit for pattern 1 and 200 x 24% = 48 bytes per time unit
for pattern 2. This is because pattern 2 has more local trafﬁc within workgroups but less

trafﬁc to servers. Therefore, server bottleneck in pattern 2 is not as much as that in pattern

1.

1200
1000
>
(c)
,9 800
_‘3
8, 600
f!
O
5 400
200

Figure 5 .8: Performance comparison of trafﬁc pattern 1 and trafﬁc pattern 2

148

Comparison of pattern 1 and pattern 2

 

I

 

I I

PAT1-Levei 2 4—
PAT2-Level 2 -+---
PAT1-Level3 we
PAT2-Levei 3 ...,,......

:
i

 

I

 

5.5.3 Server Bottleneck

To avoid server bottleneck, we assume that server ports have higher bandwidth (10 bytes
per time unit). We have also measured network performance when all server ports have
same bandwidth (1 byte per time unit) as other ports. As shown in Figure 5.9, the network
performance is dramatically decreased without higher bandwidth ports for servers. Further
network tuning cannot improve performance very much due to server bottleneck. This re-
sult emphasizes the importance to release server bottleneck in a client/server environment.

We will talk more about server bottleneck in Section 5.5.4.

5.5.4 Fairness

In Chapter 2.6, we have discussed the fairness issue. To evaluate fairness in a switch-based

0.05
Normalized Network Throughput

0.10 0.15

network, we deﬁne fairness ratio as follows:

0.20

 

149

Traffic pattern 1 (tuning level 1)

 

 

 

 

3500 . i
3000 - -
>
2 2500 L Low Bandwidth +— ~
93 High Bandwidth -+---
_‘3 2000 - .
Q)
53 1500 - -
2
< 1000 "' 1..+A------H'"-+’*’+ -
500 F ,,_.—-1r"""'l'r "
+"4'"*’ 1 g
0.05 0.10 0.15

Normalized Network Throughput

Figure 5.9: The importance of higher bandwidth to release server bottleneck

max traﬁ'ic generated by a node
min trafﬁc generated by a node

 

f a2rness-rat2'o =

The bigger fairness ratio is, the less the network is fair. We have measured trafﬁc generated
by each node and calculated fairness ratio for each client workgroup. It turns out that
fairness ratio is usually lower than 2 under light to medium workload, but the ratio could
be as high as 7 under heavy workload. Network tuning can reduce fairness ratio for each

workgroup and improve fairness for the whole network.

In our selected trafﬁc patterns, the main reason causing high fairness ratio is the differ-
ent locations of clients in a workgroup. Due to the limited number of ports in each switch,
none of workgroups can be all placed in the same switch as its server. For a client which
is located in the same switch as its server, its messages to the server do not compete chan-
nel bandwidth with other trafﬁc. Therefore, its messages are transmitted to the server very

quickly and next message are generated very soon. However, for a client which is located in

150

a different switch from its server, its messages have to compete at least one channel before
they are transmitted to the server. Since a client communicates to a server frequently, such
a location difference can make big difference in terms of transmission latency and number

of messages generated by a node.

To avoid such an unfair situation, we may consider use higher bandwidth channel to
connect two switches if there are a signiﬁcant portion of trafﬁc transmitted between them.
Thus, the higher bandwidth channel may be able to avoid trafﬁc congestion on it. As an
example, we use a higher bandwidth channel (10 bytes per time unit) to connect the two
switches which host the clients of workgroup 2. As shown in Figure 5.10, under the new
network conﬁguration, fairness ratio of workgroup 2 is signiﬁcantly reduced under heavy

workload.

Workgroup 2 of traffic pattern 1 (tuning level 1)
7 . 1

 

 

 

 

 

6 - No high bandwidth channel +—— -
0 One high bandwidth channel -+—--
'8 5 ~
c:
iii 4 -
01
E
ii 3 ' .

2 .

1

high workload low workload

Workload

Figure 5 . 10: Comparison of fairness ratio for workgroup 2 with and without a higher band-
width channel

151
5.6 Evaluation of the Network Planning Algorithm

In this chapter, we have proposed a topology design algorithm based on block design and
a systematic method for network tuning. Since our proposed solutions are heuristic, it is
important to show how good the heuristic methods are. However, it is difﬁcult to compare
our methods with the optimal design and/or other existing topology design algorithms due
to the following reasons. First, the optimal topology design is NP-complete, which makes it
unrealistic to ﬁnd the optimal solution when a network has more than a few switches and a
dozen of channels. Second, although there are many other existing algorithms for topology
design, it is not straightforward to implement any of them and measure its performance via
simulations. Third, even if we could implement some algorithm, the network environment
considered in the algorithm may not be same as the one we considered — switch-based
LAN 5, which may cause unfair comparisons.

Therefore, how to comprehensively evaluate our network planning and tuning algo-
rithms remains an open issue in this dissertation. We will try to ﬁgure out a good approach
to solve this problem in the future. On the other hand, we have done some preliminary
work in this section to compare our topology design with all the other possible topologies
for two small network conﬁgurations. Although our comparisons are incomplete in terms
of trafﬁc patterns, routing algorithms, and network size, we hope that these preliminary
results can demonstrate the network behavior and performance in some sense.

In the following, we consider all non-isomorphic switch-based networks in the two

cases:

0 Case 1: a network has 6 switches and 12 channels,

152

0 Case 2: a network has 6 switches and 11 channels.

We assume that each switch has 16 ports and there are totally 70 nodes (i.e., host
computer) in each network. As shown in Figure 5.11 and Figure 5.12, there are 5 non-
isomorphic topologies for case 1 and 9 non-isomorphic topologies for case 2. As marked
in the ﬁgures, Topology 1 in each ﬁgure is the topology generated by our network planning

algorithm. Note that we have not done any network tuning yet because it highly depends

 

 

 

 

 

 

 

 

on the trafﬁc pattern.
0 1 O 1
5 2 5 ‘ 2
4 3 4 3
Topology 1* (Eulerian) Topology 2 (Eulerian) Topology 3 (Eulerian)
0 1 0 1

 

 

/ 5
\

 

 

 

4 3 4 3
Topology 4 Topology 5

* the tapology generated by the proposed network planning method

Figure 5.11: Five non-isomorphic switch-based networks with 6 switches and 12 channels

For simplicity, we consider two trafﬁc patterns in our comparisons and the trafﬁc gen-

eration rate is Poisson distribution.

 

4 3

 

 

 

4 3
Topology 4 (Eulerian)
0 l

 

A
A
NV

 

 

4 3
Topology 7

153

 

 

 

 

 

 

 

5 V g 7 2
4 3
Topology 5 (Eulerian)

 

 

 

 

 

 

 

 

 

Topology 9

"‘ the topology generated by the proposed network planning algorithm

Figure 5.12: Nine non-isomorphic switch-based networks with 6 switches and 11 channels

154

0 Uniform trafﬁc: message destinations are uniformly distributed among all network

nodes;

0 Client/server trafﬁc: there are 2 servers and 68 clients out of 70 nodes. For a client,
a certain ratio (e.g, 66.67% is used in this section) of trafﬁc evenly goes to one of the
two servers and the rest of trafﬁc evenly goes to the other clients. For a server, it is

assumed to have uniform trafﬁc to all servers and clients.

It is known that the routing algorithm is critical to network performance for a given
topology. In Chapter 4, we have studied three different routing algorithms. In this sec-
tion, we choose the ETR and SR routing algorithms for the purpose of comparison. Since
the ETR algorithm cannot simply deal with non-Eulerian graph, we only compare those
networks with Eulerian trails under this routing algorithm.

We have run our simulation for each of the network topologies and measured the net-
work performance under uniform trafﬁc and client/server trafﬁc. The simulation results are
summarized in the following.

Uniform Me

As shown in Figure 5.13, Topology 1 of Figure 5.11, the topology generated by our
network planning algorithm, has the best performance among all the other topologies under
uniform trafﬁc and the ETR and SR routing algorithms. Similarly, Figure 5.14 shows that
our topology design, Topology 1, has the best performance among the nine networks of
Figure 5.12. The SR routing cannot generate routing table for Topology 6, and it is why
there is no performance curve for Topology 6.

Client/Server ﬁafﬁc

 

Average Latency

Average Latency

155

Networks (6 switches and 12 channels): ETR

I I I I

 

 

1500 - Topology1 4— 1 1
Topology2 -+--- 12‘
Topology3 ---+3- .
1000 - f p 4
500 -
o l l l l

 

 

 

0.04 0.08 0.12 0.16 0.2
Normalized Network Throughput

 

Networks (6 switches and 12 channels): SR

 

 

 

 

 

2500 . . . ,1: .
2000 - .
‘1 '5’

1500 - Topology1 +— i: .
Topology2 -+--- .
Topology3 We it;

1000 ' Topology 4 w» j ‘
Topology5 ----- / ...-.5

500 - -
O L l I l

 

0.04 0.08 0.12 0.16 0.2
Normalized Network Throughput

Figure 5.13: Performance comparisons of networks (6 switches and 12 channels) under

uniform trafﬁc

156

 

 

 

 

 

 

 
 
  
   

 

 

 

 

Networks (6 switches and 11 channels): ETR
500 . s . .
2000 T I 1 1
' 090 ogy "— -
g Topology 2 -+---‘.,-‘ T
g Topology3 mu»; 3‘
3 1500 " Topology 4 -..”. ..... i“ ‘l‘ -
a, Topology 5 ‘--- f 3 :9
3 1000 . ’1 ' -
3%
500 - .
O 1
0 0.04 0.08 0.12 0.16 0.2
Normalized Network Throughput
Networks (6 switches and 11 channels): SR
4000 I I 1 I x! I
‘2 K.
> - ODO ogy2 "1"" i' .. _
3 3000 Topology 3 .9....4 h
2! 2500 - Topology4 «w; ii“- -._. .
3 Topology5 —~A----g i513
m 2000 ' Topology7 ---~-g~ ii“? ; -
5’ 1500 - T0p0|09y8 -----.°-i :
2 TOPO'OQY 9 "+“",‘ 2
< 1000 ' ,‘i 7".- "
500 L .
0 . - -
0.04 0.08 0.12 0.16

Normalized Network Throughput

Figure 5.14: Performance comparisons of networks (6 switches and 11 channels) under

uniform trafﬁc

157

Under the client/server trafﬁc, if a server port has the same bandwidth as a client port,
the server will become trafﬁc bottleneck because it is unable to consume the incoming traf-
ﬁc as fast as demanded. Therefore, the whole network performance will be signiﬁcantly
degraded. This phenomenon has been observed in Figure 5.15 when the server ports have
regular bandwidth, where none of the topology is superior to others. To relieve the server
bottleneck, we should provide higher bandwidth port for each server. It is shown in Fig-
ure 5.16 that the network performance is signiﬁcantly improved and our generated topology

performs very well among all the topologies.

Client/server. regular bandwidth server port
5000 I .

4500 - d
4000 ' Topology1 -o— ,+ d
3500 ' Topology2 -+--- ’ ‘
3000 _ Topology3 We ;, _
2500 r '
2000 .
1500 -
1000 ~
500 -

O I 1
0.04 0.08
Normalized Network Throughput

 

Average Latency

 

 

 

 

Figure 5.15: Performance comparison of networks (6 switches and 12 channels) under
client/server trafﬁc

Remarks

As we mentioned before, our comparisons are incomplete. In terms of trafﬁc pattern,
we consider only uniform trafﬁc and a speciﬁc client/server trafﬁc. To make more prac-
tical comparison, more realistic trafﬁc, like bursty trafﬁc or trafﬁc traces from real LAN
networks should be considered. Obviously, a routing algorithm plays an important role

to inﬂuence the network performance. Since not many routing algorithms are speciﬁcally

158

Networks with 6 switches and 12 channels

 

 

 

 

 

 

 

1000 1 . I .
800 - Topology1 —o—- ~
§ Topology2 -+——-
% 600 _ Topology3 «Er-~- q
_l
8,
g 400 ~ -
a)
2
200 - -
0 P l l l
0.04 0.08 0.12 0.16
Normalized Network Throughput
Networks with 6 switches and 11 channels
1800 r . . .
160° ' Topology1 -o— ‘
>, 1400 L Topology2 -+"' :2! :
g Topology3 "9""gi
9 120° ’ Topology 4 «WE; ‘
3 1000 . Topology5 ""3 ' .
§’ 800 r a y x .
g 600 i 1
< 400 - -
200 - ~
0 l l 1 l

 

 

 

0.04 0.08 0.12 0.16
Normalized Network Throughput

Figure 5.16: Performance comparisons of networks using high bandwidth server ports un-
der client/server trafﬁc

159

proposed for switch-based networks, we choose ETR and SR in this section. Due to the ex-
istence of numerous switch-based networks, there is no way to use a few example networks
to represent all of them. However, our preliminary work can at least show the performance

trend of the considered networks and help us to investigate a better evaluation method.

Chapter 6

A Flexible Simulation Model for

Performance Evaluation

Due to the dynamic nature of networking environments, analytic modeling is unlikely to
give a practical insight to the network behavior and performance. Moreover, analytical
models often require simplifying assumptions that degrade the accuracy of the evalua-
tion. To overcome these limitations, simulation experiments have been used to evaluate
performance of multicomputer networks by many researchers. These simulators either
evaluate routing algorithms and switching schemes on multicomputer network architecture
[81, 82, 83] or simulate the instruction-level operations of parallel processing applications
[84].

To address network behavior under different trafﬁc patterns, routing algorithms, topolo-
gies, and switching techniques, this chapter presents a ﬂexible simulation model for per-
formance evaluation in switch-based networks. The simulator, SimNet, is implemented in
C-H-. Major network components are deﬁned as different classes, representing the network

160

161

topology, routing algorithm, switches, channels, message sending, message receiving, net-
work workload, and so on. It is flexible to run simulation experiments under different
network environments by selecting topologies, routing algorithms, workloads, switching
techniques, buffer architecture, and many other network parameters in user input ﬁles.

New modules can be easily added to the simulator with minor change of existing modules.

6.1 Overall Structure

SimNet’s overall structure is shown in Figure 6.1. The main components in SimNet are a
set of CH- classes, supporting network topology (NetTop), routing algorithm (RouteAlg),
switches (Switch), message generation (MsgSend), message reception (Mngecv), channels
(InChannel and OutChannel), workload (Workload), and messages (F lit). The task library,
a multi-threaded library in SUNWspro software package, is used in the simulator to control
concurrent operations of switches and nodes. A task is an object with a control thread run-
ning its associated program. More speciﬁcally, classes like Switch, Msgsend, and Mngecv
are derived from class task and their scheduling are controlled by the implementation of
the task library. The constructors of these classes, which are designed and implemented in
SimNet, are the main program of the tasks. None of the tasks survives after the completion
of the constructor.

After reading user input ﬁles, NetTop set up its member data, which consist of the
switch adjacency matrix, port bandwidth, number of nodes connected to each switch, and
other necessary information to get the complete knowledge of the network topology. Given

NetTop, routing tables are created based on the selected routing algorithm and each switch

162

 

[ User Inputs

topology information

& network parameters

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  

 

 

 

 
   

 

 

 

 

 

 

 

 

 

 

 

 

 

 

NetTop Workload
generate rate
- ' message Size
adjacent swrtches destination distribution
&channels ,.___ ----
l V- \
ll ' '
t 1 / MsgSend :
opo ogy . inport messa e
SWItch generat§on : :
Oll 11
change node A A (P0 message : Node :
placement & annels , . eptiont '
1n/out port routing u |
status options ' :
. Mngecv I
| I
I I
Route“: ‘ ‘ " ' " negate size
latency
channel 1‘ change it hops
utilizatiorﬁ routing table
PerfMeasure
NetTune

 

 

 

 

 

 

Figure 6.1: The structure and modular relationship in SimNet

163

carries one of the tables. Each switch has a number of input ports and output ports, which
are connected to other switches or nodes via OutChannels and InChannels. Each unidirec-
tional channel is represented by one InChannel and one OutChannel corresponding its two
end switches, respectively. Since a node generates and receives messages, two classes are
deﬁned to simulate message generation and message reception, respectively. MsgSend en-
capsulates the details of the trafﬁc generation and Mngecv handles all functions related to
packet reception. NetTune will balance trafﬁc based on the current network conﬁguration,
resulting in changed node placement, changed channels and/or changed routing tables.

To evaluate network performance under a wide range of workload, SimNet deﬁnes
workload as the combination of message generation rate, message size distribution, and
trafﬁc pattern. To consider various network situations, the simulator provides users a pa-
rameter ﬁle and users can select a speciﬁc situation by setting proper parameters. These
parameters are listed in Chapter 6.2.6. Whenever a message is received by Mngecv, im-
portant performance data like transmission latency, message size, and number of traversed
channels will be collected in order to calculate the performance data. The simulator calcu-
lates end-to-end transmission latency statistics, such as the mean, variance, and conﬁdence
intervals. We do not distinguish messages and packets in SimNet, which means that each

message is considered as a packet.

6.2 Design and Implementation

As mentioned before, SimNet is written in C++ and uses the task library. Major modules

are implemented by C++ classes, which deﬁnes related data structures and member func-

164

tions. Switch, MsgSend, and Mngecv are derived from task class in task.h, so they are
independent threads and scheduled by the library implementation. The switching schemes,
buffer management, arbitration policy, message generation, and message reception are im-
plemented in SimNet, independent of task library. At the beginning, SimNet reads two input
ﬁles: one for topology information and one for network parameters. Then, all classes are
setup, routing tables are created, and switches and channels are properly interconnected.
After initial setup, switches (Switch) and nodes (MsgSend and Mngecv) will keep running
as individual threads. Each object of MsgSend generates messages based on workload in-
formation, each object of Switch forwards messages ﬂit by ﬂit from input ports to output
ports, and each object of Mngecv receives arrived messages and measures the performance

data.

6.2.1 Routing Algorithm

Three routing algorithms have been implemented in SimNet: ETR, ATR, and UDR. The
simulator also accepts the results from SR simulator and creates corresponding routing ta-
bles from the results. A new routing algorithm can be either accommodated in the simulator
or implemented outside of the simulator. In the later case, we need to make the simulator
to accept the output format of the new routing algorithm. Each routing algorithm is rep-
resented by routing tables in switches. Each routing table keeps routing information for
packets in the related switch. In SimNet, each entry in a routing table includes the input

port, the destination switch, and the output port(s).

The major part of data structure RoutEntry class is shown in the follows, which is the

165

basic element of routing tables.

typedef struct RoutEntry_t {

int dest; // destination switch id

int inCh; // input channel id (input port)

int options; // number of routing options for this entry

OutCh *outCh[MAXOPTION]; // all possible output ports, sorted by routing distance
} RoutEntry;

Figure 6.2: Data structure RoutEntry

6.2.2 Switch Architecture

Figure 6.3 shows a generic switch with 1: ports. The switch is assumed to be non-blocking
and have a logical crossbar inside. As we mentioned earlier, each port is associated with a
pair of input and output channels. Each port may connect to a node, which generates and

consumes messages, connect to the port of another switch, or leave it open for future usage.

 

 

 

 

Input Input FIFO Buffer Output Buffer Output
Channels Channels 0
>EEEIZD—> —EEEI]:l———>
Nonblocking
1 >CEI:I:D—> —-Cl:l:l:l:l——-> 1
0 Network 0
o o
Interconnect
O O
k-l >EE1133—> —CEEED--*——i> k-l

 

 

 

 

 

 

 

Figure 6.3: A generic model of a switch

The sketch of Switch deﬁnition is shown in Figure 6.4. In each input port, a buffer
called ‘inbuf’ is used to hold incoming packets. Each output port also has a small ‘outbuf’

to temporarily store the packet coming from an inport. The use of small output buffer

166

simpliﬁes switch processing because there is no need to check if the downstream switch
has enough buffer space. For virtual cut-through and wormhole switches, there are only
FIFO input buffers in inports. For store-and-forward, shared buffer pool is supported as
well as FIFO input buffers. Note that store-and-forward and virtual cut-through schemes
need a big enough buffer to hold the max sized packet, while wormhole scheme only needs
a buffer to hold the packet header (one or a few ﬂits). Arbitration schemes to assign an

outport among competing incoming packets include round-robin and FCFS.

class Switch :: public task {

private:
int id; // identiﬁer of the switch
int numinport; // number of input ports
int numoutport; // number of output ports
qhead *inbuﬂMAXPORT]; // buffer for each input port, from InChannel
qtail *outbuﬂMAXPORT]; // buffer for each output port, to OutChannel
Flit *sharebuf[NUMBUF]; // used for shared buffer only
int busyﬂag[MAXPORT]; // status of each output port
int bandwidth [MAXPORT]; // bandwidth of each output port
int matchPort[MAXPORT]; // match inport to outport
int readyﬂVIAXPORT]; // status of each input port

// private and public member functions;

Figure 6.4: Deﬁnition of class Switch

The switch process in each cycle is highlighted in Figure 6.5. Each inport can have the

following six statuses.

o Idle: no packet is waiting for transmission;

167

o Not-ready: the packet in the inport is not totally buffered in store-and-forward switch—
ing or virtual cut-through switching if the packet has been blocked before; or the

header has not completely arrived in wormhole switching;

0 Ready: the packet in the inport is ready to be forwarded depending on the forwarding

scheme;

0 In-transit: part of the packet has been forwarded and the rest of the packet will be

continually forwarded to the corresponding outport;

0 Go: the packet is assigned an available outport and its header ﬂit will be forwarded

in the current cycle;

0 Blocked: the packet cannot be forwarded because there is no available outport for it;

Statuses ‘idle’, ‘not-ready’, and ‘ready’ will be set in step 1 and used in step 2. Statues
‘go’ and ’blocked’ will be set in step 2 and used in step 3. Status ‘in-transit’ will be set in
step 3 and used in all steps. An outport has two statuses: ‘idle’ and ‘occupied’, which are

well self-explained.

168

step 1: check the status of each inport
store-and-forward: if the whole packet has been buffered;
virtual-cut-through: the port is cut-through or store-and-forward mode;
wormhole: if the header has been received;
set ‘idle’, ‘not-ready’, ‘ready’ and keep ‘in-transit’;

step 2: search routing table for an output port

if (an inport is ‘idle’, ‘not-ready’, ‘in-transit’)
do nothing;

if (an inport is ‘ready’)
ﬁnd an available outport for it;

if (there is more than one available outport)
select the one with shortest path length;
mark the inport ‘go’;
keep the mapping from inport to outport in the switch;

if (there is one available outport)
select that outport;
mark the inport ‘go’;
keep the mapping from inport to outport in the switch;

if (more than one inport compete the same outport)
use arbitration principle to assign the outport to one of the inports;
mark the selected inport ‘ready’ and the other inports ‘blocked’;
keep the mapping from inport to outport in the switch;

if (no output port available for an inport)
mark the inport ‘blocked’

step 3: forwarding from inports with ‘in-transit’ and ‘go’
if (an inport is marked ‘idle’ or ‘not-ready’)
do nothing;
if (an inport is marked ‘in-transit’)
forward its ﬂit to the corresponding outport;
mark the inport ‘idle’ if it is last ﬂit;
mark the outport ‘idle’ if it is last ﬂit;
if (an inport is marked ‘go’)
forward its ﬂit to the corresponding outport;
mark the inport ‘in-transit’;
mark the outport ‘occupied’;
if (an inport is marked ‘blocked’)
block the packet;
change it to store-and-forward mode if virtual cut-through;

Figure 6.5: Algorithm to simulate switch forwarding in each cycle

169

At each simulation cycle, a switch checks the status of its input ports, with no depen-
dence on the network topology, trafﬁc patterns, or internal switching techniques in other
switches. If an input port has an in-transit packet, whose header has been forwarded to an
output port, the switch will continuously forward its ﬂits one by one to the outport. If an
input port has a new ‘ready’ packet, whose header ﬂit has not been forwarded, the switch
will try to ﬁnd an available output port based on its header information and table lookup. If
more than one input packet compete the same output port, the channel arbitration principle
is applied to coordinate the access of output ports. If no output port is available, the packet
will be blocked. By default, a switch forwards one ﬂit from an input port to an output
port (if available) within one cycle, and a k-port switch can forward at most k ﬂits from k
inports to k outports in one cycle if no channel contention. However, if a port has higher
bandwidth, more than one ﬂits can be forwarded in one cycle depending on the bandwidth.
At next cycle, the input port of the corresponding downstream switch will have the packet

in its buffer.

The simulator is implemented to be able to handle different forwarding schemes: store-
and-forward, virtual cut-through, and wormhole. For store-and-forward, the whole packet
will be buffered before it is forwarded, resulting in longer transmission latency. For virtual
cut-through, a packet will be forwarded as soon as the header information has been re-
ceived. If the packet is blocked in an inport because of no available output port, the whole
packet has to be buffered before it is ready to be transmitted from this inport. For wormhole
switching, its input buffer does not have to be big enough to hold a whole packet. Flits can

be forwarded at any time as long as the header information has been completely received.

170
6.2.3 Incoming and Outgoing Channels

The class deﬁnitions InChannel and OutChannel are shown in Figure 6.6. They use qhead
and qtail deﬁned in the task library, respectively. For qhead, a FIFO queue, we can ‘get’ a

ﬂit from its head. For qtail, which is also an FIFO queue, we can ‘put’ a new ﬂit to its tail.

class InChannel {
friend Switch;
private:
int multiple; l/ number of channels connecting a switch pair
int bandwidth[MULTIPLE]; // channel bandwidth
qhead *InBuﬂMULTIPLE]; // buffers for multiple channels

// member functions

class OutChannel {
friend Switch;

private:
int multiple; // number of channels connecting a switch pair
int bandwidth[MULTIPLE]; // channel bandwidth
qtail * OutBuﬂMULTIPLE]; // buffers for multiple channels

// member functions

Figure 6.6: Class Deﬁnitions of InChannel and OutChannel

All channels in the network are organized as two-dimension arrays. Inchannel[i][j]

means the incoming channel in switch 2' coming from switch j and 0utchannel[i][j] means

171

the outgoing channel in switch 2' going to switch j. If there is no channel connecting to two
switches, the buffer space in InChannel and OutChannel is empty. As we mentioned earlier,
each unidirectional channel is represented by both InChannel and OutChannel. Outchan-
nel[i]|'j]’s OutBuf points to the tail of Inchannelﬁ][i]’s InBuf and they logically represent
the same unidirectional channel from switches z’ to j. The separation of InChannels and
OutChannels simulates input buffer and output buffer in a switch so that a switch does not

need to check the channel status of any downstream switch.

6.2.4 Message Generation and Reception

Each node will send and receive messages and these two processes are simulated by two
classes MsgSend and Mngecv. Remember that we do not distinguish messages and pack-
ets in this Chapter. For MsgSend, it generates next message after a period of time based
on message generation rate. The message size follows message size distribution and the
destination follows destination distribution, which will be described in Chapter 6.3. Since
we use ﬁnite source model holding one distinct message, next message cannot be gener-
ated unless the previous message has completely left the source buffer. In the simulation,
we suppose a ﬂit has 1 byte. Each message have a number of ﬂits (based on its message
size) including one or more than one header ﬂit (depending on the header delay in the user
parameter ﬁle) and one tail ﬂit. The header ﬂit includes the following information: source,
destination, message size, generated time, and header ﬂit tag. The tail ﬂit has a special tag

to indicate the end of a message.

When a message (packet) arrives in the destination node, Mngecv will remove it from

172

its buffer and record important performance data like latency and message size. Some

calculations are needed to compute average latency in a batch.

6.2.5 Flow Control

In SimNet, we suppose all switches have link-level back pressure, which means that if a
buffer is full in the downstream switch, the upstream switch will receive a ‘stop’ signal and
stop sending more packets to the downstream switch until it receives a ‘go’ signal later on.
This ﬂow scheme is similar to the one in Myrinet [7]. Therefore, we do not worry about

packet loss due to buffer overﬂow in the simulation.

6.2.6 Basic Parameters

An input ﬁle including all needed parameters are required by the simulator. These param-
eters include important information for network topology and routing. Major parameters

are listed in Table 6.1.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

BufSize buffer size for each input port

BurstRate bursty rate for bursty message

DistType destination distribution

MastgSize maximum packet size

MinMsgSize minimum packet size

MeanMsgSize mean packet size

MsgSizeType message size: ﬁxed, bimodal, or bursty

Mngate message generation rate

RouteType what routing algorithm: ETR, ATR, UPDOWN, SR
SwitchType what switching techniques: store-and-forward, cut-through
Bquype FIFO or shared buffer (for store-and-forward)
Tuning how much network tuning will be done

 

Table 6.1: List of major parameters in SimNet

 

173

6.2.7 Performance Metrics

When the number of received messages is over a certain number and the statistical average
latency is within a conﬁdential interval, the simulator will calculate and report the important
performance data and terminate itself. As we have discussed in Chapter 2.6, we consider
the following three important performance metrics: average message latency, sustained
network throughput, and fairness ratio. In most plots in this thesis, the performance is
represented by its latency-throughput curve. We also mark unfair situation in some ﬁgures.

Fairness ratio is calculated by considering node types (i.e., server or client).

6.3 Workload and Topology Support

In a switch-based network, network performance depends on the subtle interaction between
the network topology, the routing algorithm, and the trafﬁc distribution. In order to eval-
uate network interconnects, a simulator should consider the following workloads: trafﬁc
patterns, message size distributions, and temporal distribution [80]. To facilitate a wide
range of experiments, SimNet supports various workloads.

In terms of destination distribution (i.e., trafﬁc pattern), We have the following three

major types of destination distribution.

1. Uniform trafﬁc: any node will uniformly communicate with any other node.

2. Global client/server trafﬁc: k nodes out of n nodes are servers and all the other
nodes are clients. For a client, a certain rate of trafﬁc is sent uniformly among those

servers and the rest of trafﬁc is sent uniformly among those clients. A server sends

174

messages to other clients and servers with uniform distribution. This trafﬁc pattern
is a simpliﬁed version of client/server trafﬁc. Because in a real network, a client
may not uniformly communicate with all servers; instead, it may communicate with
a speciﬁc server more frequently. Therefore, we come up with the following more

realistic and general trafﬁc pattern.

3. Workgroup and client/server trafﬁc: k out of n nodes are servers and the rest 72 - k
nodes are partitioned into g workgroups. A trafﬁc distribution matrix is used to
describe how many percentage of trafﬁc will be transmitted among 9 workgroups
and k servers. Trafﬁc patterns for all nodes in the same workgroup are same. The

example expression of trafﬁc distribution has been shown in Chapter 5.5. 1.

The following four message size distributions are used in our simulation:

0 Fixed length: all message size is same.

0 Exponential: message size is an exponential distribution with a given mean.

0 Bimodal: 75% of messages are short message with 100 bytes, and 25% of messages

are long message with 1,500 bytes.

0 Bursty: burst trafﬁc happens at a speciﬁc rate and burst size is 10,000 bytes. Other

messages are 100 bytes long.

In actual network applications, the bimodal and bursty distributions are typical and the
hot—spot destination pattern is commonly observed. The ﬁxed message size and the uni-
form trafﬁc distribution are combined to demonstrate the maximum sustained throughput

or worst case latency.

175

We use ﬁnite input source to model the message generation in each node because it is
more realistic than the inﬁnite source model typically used by other researchers. Moreover,
due to the limitation of memory space, the over-saturated behavior is extremely difﬁcult
to measure using the inﬁnite source model. A ﬁnite source model may have up to K out-
standing messages. In high performance computing, most parallel programs use blocking
send to deliver messages. Thus, the next send will not be initiated until the previous one is
completed. In a client/server environment, a server may receive several requests and have
several messages waiting on the queue to send. For simplicity, we consider the case of
K = 1, which means next message cannot be generated unless the previous message has
completely left the source.

Let a: be the time duration between the time when the current message has completely
left the source buffer and the time when the next message is generated in each node. Thus, a:
is a random variable based on an exponential distribution. The smaller the :r is, the heavier
the workload is. Note that the actual effective network workload, which depends on both
:1: and how fast a message leaves its source in each switch, may not be exactly same for
same a: under different network situations. However, this should be enough to fairly and
practically evaluate the performance.

The routing algorithms implemented in the simulator can handle all topologies except
that ETR and ATR cannot deal with a network which has no Eulerian trail. The sufﬁcient
and necessary condition for the existence of Eulerian trail in a network is that each switch
has even number of channels. Such a limitation can be partially released by removing a
few channels and adding them back to the Eulerian trail. In most simulation experiments,

we just consider networks with Eulerian trails. The simulator needs an input ﬁle, which

176

includes information for switch interconnects, number of ports and their bandwidth in each
switch, number of nodes in each switch, and server location if applicable. After reading
these input data, NetTop is initiated and keeps all information related to network topology.

InChannels and OutChannels will be linked together based on switch interconnects.

6.4 Flexibility and Expandability

The simulation environment for switch-based networks (SimNet) provides an extensible,
object-oriented framework for performance evaluation. The simulator’s structure facili-
tates experiments with ﬂexible routing algorithms, topologies, switching schemes, buffer
management, and arbitration policies. All experiments can be run independently based on
various design parameters. At each run time, the simulator can instantiate unique network
workloads, routing algorithms, switching-schemes, and buffer management. Major com-
ponents in SimNet are deﬁned as different classes, which makes clean and well-deﬁned

boundaries between the components and allows easy extension of the simulator.

For example, NetTune was not included in the simulator originally. Later on, we need to
add this module for network tuning in order to reconﬁgure network, balance channel utiliza-
tion, and improve performance. Since this module will only communicate with RouteAlg

and NetTop, there is no need to change any other modules (classes).

The simulator can easily incorporate new topologies, routing algorithms, switching

techniques, trafﬁc patterns, and data collection routines.

177
6.5 Simulation Experiments

SimNet has been run many times under different network environments. Lots of perfor-
mance data have been collected and simulation results are plotted in Chapters 3, 4, and 5
for the purpose of evaluation and comparison. In each run, a speciﬁc parameter ﬁle and a
topology ﬁle are given as input arguments. The simulator will run until enough number of
packets have been received and the average latency is within a conﬁdential interval. The

major experiments are summarized here.

0 In Chapter 3, simulation experiments are used to verify the accuracy of our analytic

models and evaluate performance under various network environments.

0 In Chapter 4, we evaluate network performance for three different routing algorithms
in switch-based networks. Wormhole switching and FIFO input queuing are as-

sumed. Uniform and client/server trafﬁc are considered as well as bursty trafﬁc.

0 In Chapter 5, we demonstrate the performance improvement by applying the network

tuning algorithm, which reduces inter-switch trafﬁc and balance channel utilization.

Chapter 7

Conclusions

In previous chapters, we have reviewed fundamental issues in switch design and switch
interconnects, described analytical models for delay analysis, proposed two simple and ef-
ﬁcient routing algorithms, presented practical solutions for network planning and tuning,
evaluated network performance under various network environments. This chapter con-

cludes the research contributions of this thesis and indicates future work.

7 .1 Contributions of the Thesis

In this thesis, we have presented our research results on switching techniques, routing al-
gorithms, network planning and tuning, and performance evaluation in switch-based net-
works. Major research contributions are summarized in the following.

178

179
7.1.1 Performance Evaluation of Switching Techniques

Performance evaluation is very important for the design of interconnection networks. In
switch-based networks, three switching schemes (store-and-forward, virtual cut-through,
and wormhole) have been proposed and applied in commercial products. To understand
their performance tradeoff, analytical models are presented for average packet delay analy-
sis in both store-and-forward switching and cut-through switching. The analytical models,
which consider some simpliﬁed assumptions in switch-based networks, are inspired by
previous work based on M/M/l queuing theory. Meanwhile, simulation results are used to
verify the accuracy of the models and evaluate performance under various network envi-

ronments.

7.1.2 Routing Algorithm

We have proposed two simple and efﬁcient deadlock-free routing algorithms for arbitrary
switch-based networks. Both algorithms use an Eulerian trail as the ‘base line’ to control
channel dependency and provide reasonable routes. The ETR algorithm is very simple,
which concatenates the two unidirectional Eulerian trail together to provide reasonable
routes. The ATR algorithm is more complex, which adds different types of shortcuts to
minimize channel utilization and balance channel utilization. The ATR has the potential to
allow shorter routing paths and evenly utilize channels.

Two related algorithms are reviewed and their performance are compared with ATR via
simulation experiments under various topologies and trafﬁc workloads. For most of cases,

our proposed algorithm ATR and the SR provide better performance than the up*ldown*.

180
While the ATR has close performance to the SR, the ATR is much more simple than the

SR. The up*ldown* may concentrate trafﬁc near root switch and therefore its performance

heavily depends on topologies and the selection of root switch.

7 .1.3 Network Planning and 'lhning

We have studied issues of network planning and routing in switch-based LANs. Impor-
tant design considerations have been discussed in terms of network characteristics, routing
path length, trafﬁc locality, trafﬁc balance, and performance metrics. We have proposed
a method for initial topology design based on block design. This method can efﬁciently
minimize the maximum path length and the average path length by appropriately selecting
differences and rearranging the order of objects to construct block Bo. A systematic method
has been presented for network tuning and incremental reconﬁguration. Our method tries to
maximize local trafﬁc, balance inter-switch trafﬁc, reduce remote trafﬁc, and balance chan-
nel utilization. Our algorithm for network tuning has been implemented based on our block
design and Eulerian trail routing algorithm. Simulation results show that network tuning
can signiﬁcantly improve network performance, especially when we have better knowledge

of trafﬁc distribution after a network is operational.

7.1.4 Simulations

A ﬂexible simulator has been implemented in C++ using a multi-threaded task library. The
simulator can handle different routing algorithms, topologies, trafﬁc workloads, switching

schemes, and buffer management. Users can initiate a speciﬁc network conﬁguration by

181

setting proper parameters in the input parameter ﬁle. New features can be easily incorpo-
rated in the simulator. Comprehensive simulation experiments provides an insight view of
network behavior under various network conditions. The simulation results have been used
to evaluate performance for different switching techniques, routing algorithms, topologies,

workloads, and network tuning methods.

7 .2 Future Work

There are many directions worth further exploration. The following issues indicate some

possible directions to extend the work which have been done in this thesis.

0 More realistic trafﬁc pattern: We have considered uniform trafﬁc and client/server
trafﬁc in this dissertation. Bursty trafﬁc is also used to measure performance for
different routing algorithms. However, to make practical design and evaluation, we
should consider more realistic trafﬁc, like bursty trafﬁc and trafﬁc traces from real

LAN networks, in network routing, planning, and tuning.

0 Evaluation of network planning and tuning methods: Although we have pre-
sented the preliminary evaluation results in Chapter 5.6, a good approach is needed to
evaluate our proposed algorithms and compare them with other existing algorithms.
Due to the difﬁculty, this issue is not addressed very well in the dissertation and

should be studied in the future.

0 Stochastic analysis: Stochastic analysis of the network is another direction for future

study, since the stochastic behavior of host to host interaction is very relevant in

182

LAN 3. The interaction can signiﬁcantly inﬂuence network performance.

Multicast: Although our routing algorithm focuses on unicast routing, we are also
interested in multicast communication. It is clear that efﬁcient multicast communi-
cation support is critical to the performance of many applications. Some hardware
multicast methods have been proposed in [85] for the gigabit ATM switch and [24]
for cut-through switches. In [86], a reliable multicast scheme has been proposed
for Myrinet using Fast Message [87]. A trip-based multicast routing algorithms is
also proposed in [88] for any wormhole routed network using two virtual channels.
When multicast communication is becoming very important in practice, more re-
search is needed to allow efﬁcient multicast communication in switched networks.
Since the performance of a multicast algorithm varies on different platform, a more
reality and practical method is to consider some critical machine-speciﬁc parameters

as proposed in [51] when we design the multicast algorithm.

BIBLIOGRAPHY

 

Bibliography

[1] L. M. Ni, “Issues in designing truly scalable interconnection networks,” Proc. of the
1996 ICPP Workshop on Challenges for Parallel Processing, pp. 74 — 83, Aug. 1996.

[2] P. Kermani and L. Kleinrock, “Virtual cut—through: A new computer communication
switching technique,” Computer Networks, vol. 3, no. 4, 1979.

[3] W. J. Dally and C. L. Seitz, “The torus routing chip,” Journal of Distributed Comput-
ing, vol. 1, no. 3, pp. 187 — 196, 1986.

[4] W. J. Dally, “Virtual channel ﬂow control,” Computer Networks, vol. 3, pp. 194 — 205,
Mar. 1992.

[5] R. J. Souza and et al, “The GIGAswitch system: A high-performance packet switch-
ing platform,” Digital Technical Journal, vol. 6, Jan. 1994.

[6] Ancor Communications Inc. http://www. ancot: com/cxt.html.

[7] N. J. Boden and et al, “Myrinet — a Gigabit-per-second local area network,” IEEE
Micro, vol. 15, pp. 29 - 36, Feb. 1995.

[8] T. W. Giorgis, “29 switching hubs save the bandwidth,” BYTE, pp. 162 — 169, July
1995.

[9] A. G. Nowatzyk, M. C. Browne, E. J. Kelly, and M. Parkin, “S-connect: from net-
works of workstations to supercomputer performance,” Proceedings of the 22nd In-
ternational Symposium on Computer Architecture, June 1995.

[10] R. Horst, “ServerNet deadlock avoidance and fractahefral topologies,” Proceedings
of the International Parallel Processing Symposium, pp. 274 — 280, Apr. 1996.

[11] C. B. Stunkel and et a], “The SP1 high-performance switch,” Proceeding of 1994
Scalable High Performance Computing Conference ( SHPC C94 ), pp. 150 — 157, May
1994.

[12] “Cray T3D system architecture overview,” tech. rep., Cray Research Inc., Sept. 1993.

[13] M. Noakes, D. A. Wallach, and W. J. Dally, “The j-machine multicomputer: An ar-
chitecture evaluation,” International Symposium on Computer Architecture, pp. 224 —
236, 1993.

183

184

[14] B. Duxett and R. Buck, “An overview of the Ncube-3 supercomputer,” Proceedings
of the Frontiers of Massively Parallel Computation, pp. 458 — 464, 1992.

[15] Intel Corporation, “Paragon XP/S product overview,” 1991.

[16] L. M. Ni and P. K. McKinley, “A survey of wormhole routing techniques in direct
networks,” IEEE Computer, vol. 26, pp. 62 — 76, Feb. 1993.

[17] M. Gerla and L. Kleinrock, “On the topological design of distributed computer net-
works,” IEEE Transaction on Communications, vol. COM-25, pp. 48 — 60, Jan. 1977.

[18] B. Gavish, “A general model for the topological design of computer networks,”
GLOBECOM’86, PP. 1584 - 1588, 1986.

[19] C. Ersoy and S. S. Panwar, “Topological design of interconnected LAN/MAN net-
works,” IEEE Journal of Selected Areas in Communications, vol. 11, pp. 1 172 - 1 182,
Oct. 1993.

[20] R. Elbaum and M. Sidi, “Topological design of local area networks using genetic
algorithms,” INF OCOM ’95, pp. 64 — 71, Apr. 1995.

[21] W. Qiao and L. M. Ni, “Adaptive routing in irregular networks using cut-throughput
switches,” Proceedings of the 1996 International Conference on Parallel Processing,
vol. 1, pp. 52 — 60, Aug. 1996.

[22] L. M. Ni, W. Qiao, and M. Yang, “Switches and switch interconnects,” MPPOI’97,
pp. 52 — 60, June 1997.

[23] 3Com Corporation, “Switches and routers,” http://www.3com.com/solutions, 1997.

[24] M. Yang and L. M. Ni, “Design of scalable and multicast capable cut-through
switches for high-speed LAN 3,” accepted to appear in the Proceedings of the I 997
International Conference on Parallel Processing, Aug. 1997.

[25] M. Yang and L. M. Ni, “Incremental design of scalable interconnection networks
using basic building blocks,” Proceedings of the Seventh IEEE symposium on parallel
and Distributed Processing, pp. 252 — 259, Oct. 1995.

[26] J. Rexford, “Tailoring router architectures to performance requirements in cut-through
networks,” Ph.D Dissertation, University of Michigan, 1996.

[27] S. Daniel, J. Rexford, J. Dolter, and K. Shin, “A programmable routing controller
for ﬂexible communications in point-to-point networks,” Proceedings-of the IEEE
International Conference on Computer Design, pp. 320 — 325, Oct. 1995.

[28] M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W. Felten, and J. Sandberg, “Virtual
memory mapped network interface for the Shrimp multicomputer,” Proceedings of the
International Symposium on Computer Architecture, pp. 142 — 153, Apr. 1994.

 

185

[29] D. Cohen and G. Finn, “ATOMIC: A low-cost, very-high-speed, local communica—
tion architecture,” in Proceedings of the 1993 International Conference on Parallel
Processing, vol. I, pp. 39 — 46, Aug. 1993.

[30] Y. Oie, T. Suda, M. Murata, and H. Miyahara, “Survey of the performance of non-
blocking switches with ﬁfo input buffers,” Proceedings of ICC ’90, Apr. 1990.

[31] M. J. Karol, M. G. Hluchyj, and S. P. Morgan, “Input versus output queuing on
a space-division packet switch,” IEEE Transactions on Communications, vol. 35,
pp. 1347 - 1356, Dec. 1987.

[32] M. G. Hluchyj and M. J. Karol, “Queuing in high-performance packet switching,”
IEEE Journal on Selected Areas in Communications, vol. 6, pp. 1587 — 1597, Dec.
1988.

[33] T. Anderson, S. Owicki, J. Saxe, and C. Thacker, “High-speed switch schduling for
local area networks,” ACM Transactions on Computer Systems, vol. 11, pp. 319 —
352, Nov. 1993.

[34] R. O. Onvural, Asynchronous Transfer Mode Networks: Performance Issues. Artech
House, 1994.

[35] M. Katevenis, P. Vatsolaki, and A. Efthmiou, “Pipelined memory shared buffer for
VLSI switches,” Proceedings of ACM SIGCOMM ’95, pp. 39 — 48, Aug. 1995.

[36] J. P. Li and M. W. Mutka, “Priority based real-time communication for large scale
wormhole networks,” Proceedings of the International Parallel Processing Sympo-
sium, pp. 433 - 438, Apr. 1994.

[37] Myricom, “Myrinet link speciﬁcation,” http://www.myri.com/products/docurnentation/linkl.

[38] C. Seitz and et al, “The design of the Caltech Mosaic C multicomputer,” Proceedings
of the University of Washington Symposium on Integrated Systems, 1993.

[39] M. Schroeder and et al., “Autonet: a high-speed, self-conﬁguring, local area net-
work using point-to-point links,” IEEE Journal on Selected Areas in Communications,
vol. 9, Oct. 1991.

[40] R. Horst, “TNet: A reliable system area network,” IEEE Micro, pp. 36 — 44, Feb.
1995.

[41] S. S. Owicki and A. R. Karlin, “Factors in the performance of the ANl computer
network,” Performance Evaluation Review, vol. 20, pp. 167 — 180, June 1992.

[42] J. Duato, “Deadlock-free message routing in multiprocessor interconnection net-
works,” IEEE Transactions on Parallel and Distributed System, vol. 4, Nov. 1993.

[43] X. Lin, P. McKinley, and L. Ni, “The message ﬂow model for routing in wormhole-
routed networks,” IEEE Transactions on Parallel and Distributed Systems. PP- 755 —
760, July 1995.

 

186

[44] A. K. VenKatramani and T. M. Pinkston, “An efﬁcient fully adaptive deadlock recov—
ery scheme: DISHA,” Proceedings of the 22nd International Symposium on Com-
puter Architecture, pp. 201 — 210, June 1995.

[45] A. K. VenKatramani, T. M. Pinkston, and J. Duato, “Generalized theory for deadlock-
free adaptive wormhole routing and its application to disha concurrent,” Proceedings
of the International Parallel Processing Symposium, pp. 815 — 821, Apr. 1996.

[46] L. Cherkasova, V. Kotov, and T. Rokicki, “Fibre channel fabrics: Evaluation and
design,” 29th Hawaii International Conference on System Sciences, Feb. 1995.

[47] B. Abali, “A deadlock avoidance method for computer networks,” Proceedings of

the First International Workshop on Communication and Architectural Support for
Network-based Parallel Computing, pp. 61 — 72, Feb. 1997.

[48] F. Silla, M. Malumbres, A. Robles, P. Lopez, and J. Duato, “Efﬁcient adaptive routing
in networks of workstations with irregular topology,” Proceedings of the First Inter-

national Workshop on Communication and Architectural Support for Network-based
Parallel Computing, pp. 46 — 60, Feb. 1997.

[49] K. M. Khalil and P. A. Spencer, “A systematic approach for planning, tunning and
upgrading local area networks,” GLOBECOM’9I, pp. 658 — 663, 1991.

[50] R. Kevavan, K. Bondalapati, and D. Panda, “Multicast on irregular switch-based net-
works with wormhole routing,” Third International Symposium on High Performance
Computer Architecture (HPCA-3), pp. 48 - 57, Feb. 1997.

[51] J. L. Park, H. Choi, N. Nupairoj, and L. M. Ni, “Construction of optimal multicast
trees based on the parameterized communication model,” Proceedings of Intema-
tional Conference on Parallel Processing, vol. 1, pp. 180 -— 187, Aug. 1996.

[52] K. M. Khalil, K. Q. Luc, and D. V. Mlson, “LAN trafﬁc analysis and workload char-
acterization,” Proceedings of 15th IEEE Conference on Local Computer Networks,
Oct. 1990.

[53] J. Shoch and J. Hupp, “Measured performance of an Ethernet local network,” Com—
munications of ACM, pp. 711 — 721, Dec. 1980.

[54] M. Rose and K. McCloghrie, How to Manage Your Network using SNMP: The Net-
work Management Practicum. Prentice Hall, 1995.

[55] Hewlett Packard Company http://www. hp. com/rnd/technoI/switch/impperf/lans.htrn.
[56] 3Com Corporation http://www.3com.com/nsc.html, 1996.

[57] W. Qiao and L. M. Ni, “Evaluate routing algorithms in cut-through switched net-
works,” Tech. Rep. MSU-CPS-ACS-96—07, Michigan State Univ., Dept. of Computer

Science, June 1996.

187

[58] A. Kumar and L. Bhuyan, “Evaluating virtual channels for cahce—coherent shared-
memory multiprocessors,” International Conference on Supercomputing, May 1996.

[59] L. Kleinrock, Communication Nets: Stochastic message ﬂow and delay. McGraw-
Hill, New York, 1964.

[60] L. Kleinrock, Queuing Systems, vol. 11: Computer Applications. Wiley-interscience,
New York: John Wiley & Sons, 1976.

[61] J. W. Dolter, P. Ramanathan, and K. G. Shin, “Performance analysis of virtual cut-

through switching in harts: a hexagonal mesh multicomputer,” IEEE Transactions on
Computers, vol. 40, pp. 669 - 680, June 1991.

[62] Y. S. Youn and C. K. Un, “Performance analysis of an integrated voice/data cut-
through switching network,” Journal of Distributed Computing, vol. 21, no. 1, pp. 41
— 51, 1991.

[63] T. D. Morris and H. G. Perros, “Approximate analysis of a discrete-time tandem net-
work of cut-through queues with blocking and bursty trafﬁc,” Performance Evalua-
tion, vol. 17, no. 3, pp. 207 — 223, 1993.

[64] C. P. Kruskal and M. Snir, “The performance of multistage interconnection networks
for multiprocessors,” IEEE Transactions on Computers, vol. C-32, pp. 1091 — 1098,
Dec. 1983.

[65] W. J. Dally, “Performance analysis of k-ary n-cube interconnection networks,” IEEE
Transactions on Computers, vol. 39, pp. 775 - 785, June 1990.

[66] A. Agarwal, “Limits on interconnection network performance,” IEEE Transactions
on Parallel and Distributed Systems, vol. 2, pp. 398 - 412, Oct. 1991.

[67] L. Kleinrock, Queuing Systems, vol. 1: Theory. Wiley-interscience, New York: John
Wiley & Sons, 1975.

[68] J. R. Jackson, “Networks of waiting lines,” Operations Research, vol. 5, pp. 518 —
521, Aug. 1957.

[69] G. L. Fultz and L. Kleinrock, “Adaptive routing techniques for store-and-forward
computer communication networks,” Proceedings of International Conference on
Communication, pp. 14 — 16, June 1971.

[70] W. Oed The Cray Research Massively Parallel Processor System: Cray T3D, Nov.
1993.

[71] C. Stunkel and et al., “Sp2 high-performance switch,” IBM Systems Journal, vol. 34,
no. 2, pp. 185 — 204, 1995.

[72] W. Dally and C. Seitz, “Deadlock-free message routing in multiprocessor intercon-
nection networks,” IEEE Transactions on Computers, vol. 036, pp. 547 — 553, May
1987.

188

[73] C. J. Glass and L. M. Ni, “The turn model for adaptive routing,” Journal of the Asso-
ciation for Computing Machinery, vol. 41, pp. 874 — 902, Sept. 1994.

[74] H. Park and D. P. Agrawal, “Generic methodologies for deadlock-free routing,” Pro-
ceedings of the International Parallel Processing symposium, pp. 638 — 643, Apr.
1996.

[75] W. Qiao and L. M. Ni, “Adaptive routing in irregular networks using cut-throughput
switches,” Tech. Rep. MSU-CPS-ACS-108, Michigan State Univ., Dept. of Computer
Science, Dec. 1995.

[76] F. Harary, Graph Theory. Readings, Massachusetts: Addson-Wesley, 1972.

[77] W. Qiao and L. M. Ni, “Network planning and tuning in switch—based lans,” submitted
for publication, July 1997.

[78] J. Opatmy, N. Srinivasan, and V. S. Alagar, “Highly fault-tolerant communication
network models,” IEEE Transactions on Circuits and Systems, vol. 36, pp. 23 — 29,
Jan. 1989.

[79] B. Yener, Y. Ofek, and M. Yung, “Topological design of loss-free switch-based
LANs,” INF OCOM ’95, pp. 88 — 96, Apr. 1995.

[80] A. A. Chien and M. Konstantinidou, “Workload and performance metrics for eval-
uating parallel interconnects,” IEEE Computer Architecture Technical Committee

Newsletter, pp. 23 - 27, Summer-Fall 1994.

[81] K. Bolding, S. E. Choi, and M. Fulgharn, “The chaos router simulator,” Parallel Com-
puter Routing and Communication Workshop, pp. 115 - 124, May 1994.

[82] P. K. McKinley and C. Trefftz, “MultiSim: a simulation tool for the study of large-
scale multiprocessors,” Proceedings of the International Workshop on Modeling,

Analysis, and Simulation of Computer and Telecommunication Systems, pp. 57 - 62,
Jan. 1993.

[83] J. R. Jump and S. Lakshmanamurthy, “NETSIM: a general-purpose interconnection
network simulator,” Proceedings of the International Workshop on Modeling, Analy-
sis, and Simulation of Computer and Telecommunication Systems, pp. 121 — 125, Jan.

1993.

[84] E. Olk, “PARSE: Simulation of message passing communication networks,” Proceed-
ings of the Annual Simulation Symposium, pp. 115 - 124, Apr. 1994.

[85] J. S. Tumer, “An optimal nonblocking multicast virtual circuit switch,” Proceedings
of 1994 INF OCOM. Pp. 298 — 305, June 1994.

[86] K. Verstoep, K. Langendoen, and H. Bal, “Efﬁcient reliable multicast on myrinet,”
Proceedings of the 1996 International Conference on Parallel Processing, vol. 3,
pp. 156 — 165, Aug. 1996.

189

[87] S. Pakin, M. Lauria, and A. Chien, “High performance messaging on workstations:
Illinois fast messages (fm) for myrinet,” Supercomputing ’95, 1995.

[88] Y.-C. Tseng, D. K. Panda, and T. Lai, “A trip-based multicasting model in wormhole-
routed networks with virtual channels,” IEEE Transaction on Parallel and Distributed
Systems, vol. 7, pp. 138 - 150, Feb. 1996.