I!
.. :hf .
I

.5... r?

r .“
tﬁadvuhx Kauai... HP .it.‘
Ema», .3

Yr. t
llaectnytthn

30‘33 .g

nahﬁﬁga
... .

an... .95.
:I, '0‘."'I:lvn$0‘llu {‘0' '
39.399.17.155: :
81.55013 :3
x: ban: .
.21.. 95...”
1.2!... .. x 1 {If
s . .t'sisiiﬁwto
. ﬁt:a‘!..s..nhrelpfﬂuo.lh
«Furs Ardgtl.fulu~
2,333.36. 37".. .
£51.53}...

[nu-3:332:31
{at}!!!
4v, ‘0“ u:

 

This is to certify that the
dissertation entitled

CoStore: A STORAGE CLUSTER ARCHITECTURE
USING NETWORK ATTACHED STORAGE DEVICES

presented by

Yong Chen

has been accepted towards fulﬁllment
of the requirements for

Doctoral degree in Computer Science
& Engineering

 

 

\J‘\\w& Mimi:

Major professor

Date May 7, 2002

 

MS U is an Afﬁrmative Action/Equal Opportunity Institution 0-12771

 

 

 

LIBRARY
Michigan State
University

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6/01 c:/CIRC/DateDue.p65-p. 15

\a'

" . "
u .11)
f 1 1w

CoStore: A STORAGE CLUSTER ARCHITECTURE
USING NETWORK ATTACHED STORAGE DEVICES

BY

Yong Chen

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science and Engineering

2002

ABSTRACT

CoStore: A STORAGE CLUSTER ARCHITECTURE
USING NETWORK ATTACHED STORAGE DEVICES

BY

Yong Chen

The exponential growth of the lntemet has presented numerous challenges to
computer scientists and engineers. One of the major challenges is to design
advanced storage systems to meet the demanding requirements for high
performance, high capacity, and strong reliability.

This dissertation proposes the CoStore cluster architecture to construct reliable
storage systems using network attached storage devices. The performance of a
CoStore prototype has been measured and the preliminary results show that
CoStore’s performance is comparable to that of the efficient NFS and that the
CoStore cluster architecture is highly scalable in terms of the cluster size.

This dissertation study investigates remote replication using CoStore clusters
to construct highly reliable and highly available storage systems. It has been
demonstrated that if a cluster is mirrored over the network with sufficient bandwidth
and low latency, the remote replica cluster can considerably reinforce
preparedness for disaster recovery without sacrificing performance.

This dissertation study also confirms the feasibility of deploying CoStore
architecture to construct reliable storage service utilizing idle disk space on

workstation clusters on existing desktop computing infrastructure.

Copyright © 2002 Yong Chen

All rights reserved.

ACKNOWLEDGEMENTS

I would like to express my deep gratitude to my advisor Prof. Lionel M. Ni for
his years of advising, guiding and continuous strong support. I also would like to
thank other guidance committee members, in particular, Prof. Abdol H.
Esfahanian, Prof. Matt W. Mutka, and Prof. Zhengfang Zhou for having spent time
reviewing this dissertation and providing feedbacks.

I am grateful for the unconditional love and support from my wife, my parents
and parents-in-law, and all other members in my family in my dear home country.
Without their spiritual support, this work would not have been possible. I especially
thank my wife Danhua for her continuous support, encouragement and patience, in
addition to her love. I also would like to thank my younger sister Jun Chen for her
encouragement.

I am very glad that l have had the opportunity working as a graduate system
administrator in the Department of Computer Science and Engineering at Michigan
State University. This experience has been proved invaluable in the dissertation

study and in my career.

iv

TABLE OF CONTENTS

LIST OF TABLES ................................................................................................... viii
LIST OF FIGURES .................................................................................................. ix
Chapter 1 Introduction and Motivations ............................................................... 1
1.1 Introduction ..................................................................................................... 1
1.2 Motivations ...................................................................................................... 2
1.2.1 Evolving storage architectures ................................................................. 2
1.2.2 Data is the most valuable asset ............................................................... 4
1.2.3 Storage space utilization irony ................................................................. 4

1.3 Problem Abstraction ........................................................................................ 5
1.4 Dissertation Overview ..................................................................................... 6
Chapter 2 Background ......................................................................................... 7
2.1 Storage System Components ......................................................................... 7
2.1.1 Storage devices ........................................................................................ 7
2.1.1.1 Disk drives ......................................................................................... 7
2.1.1.2 Disk arrays ......................................................................................... 9
2.1.1.2.1 RAID level 4 and level 5 ........................................................... 11
2.1.1.2.2 RAID status ............................................................................... 13

2.1.2 I/O interfaces .......................................................................................... 14
2.1.2.1 Parallel interfaces ............................................................................ 15
2.1.2.2 Serial interfaces ............................................................................... 17
2.1.2.3 Merger of V0 Channels and Ethernet ............................................. 20

2.2 Local File Systems ........................................................................................ 22
2.2.1 File system layers in operating systems ................................................ 22
2.2.2 Virtual File System (VFS) ....................................................................... 24
2.2.3 File system structures ............................................................................ 24
2.2.3.1 Ext2 file system ................................................................................ 26

2.2.4 File system management ....................................................................... 30
2.2.4.1 Joumaling ........................................................................................ 31
2.2.4.2 Soft Updates .................................................................................... 33
2.2.4.3 Ext3: joumaling Ext2 ........................................................................ 33
2.2.4.3.1 Backwards compatibility ........................................................... 33
2.2.4.3.2 The Ex13 approach ................................................................... 34
2.2.4.3.2 Joumaling for both meta-data and data .................................... 36

2.3 Distributed File Systems ............................................................................... 36
2.3.1 Network File System (NFS) .................................................................... 38
2.3.2 Andrew and Coda File Systems ............................................................. 39

2.3.3 Zebra and xFS ........................................................................................ 40

2.3.4 Petal and Frangipani .............................................................................. 41
2.3.5 Network Attached Secure Disks (NASD) ............................................... 42
Chapter 3 The CoStore Storage Cluster Architecture and Performance
Evaluation .......................................................................................................... 46
3.1 Introduction ................................................................................................... 46
3.2 Technology Trends in Networked Storage ................................................... 47
3.2.1 Storage interfaces: block vs. file ............................................................ 47
3.2.2 Architecture taxonomy ............................................................................ 48
3.2.3 Storage Area Network (SAN) ................................................................. 50
3.2.4 Network Attached Storage (NAS) .......................................................... 51
3.2.5 Upcoming storage architectures ............................................................ 52
3.3 A Storage Cluster Architecture Using Network Attached Storage Devices.. 54
3.3.1 Architecture overview ............................................................................. 54
3.3.2 Distributed file subsystem ...................................................................... 56
3.3.3 Local file subsystem ............................................................................... 57
3.3.4 RAID subsystem ..................................................................................... 60
3.3.4.1 Block buffer cache ........................................................................... 62
3.3.5 System operations .................................................................................. 64
3.3.5.1 Cluster initialization .......................................................................... 65
3.3.5.2 Client initialization ............................................................................ 65
3.3.5.3 File and directory creation ............................................................... 66
3.3.5.4 lntra-cluster communications and deadlock .................................... 67
3.3.5.5 Recovery from node failures ........................................................... 68
3.4 Prototype Implementation ............................................................................. 69
3.5 Performance Evaluation ................................................................................ 70
3.5.1 Experimental setup ................................................................................. 70
3.5.2 Write performance: CoStore vs. NFS vs. CIFS ..................................... 74
3.5.3 Read performance: CoStore vs. NFS vs. CIFS ..................................... 74
3.5.4 Impact of distributed RAID and commit policy ....................................... 77
3.5.5 Scalability of CoStore ............................................................................. 79
3.5.6 Parity daemon’s bottleneck effect in RAID-4 ......................................... 81
3.6 Summary ....................................................................................................... 82
Chapter 4 CoStore Clusters Utilizing Idle Disk Space on Workstations ........... 83
4.1 Introduction ................................................................................................... 83
4.2 Assumptions and Environment Description .................................................. 85
4.3 Alternative Solutions ..................................................................................... 87
4.3.1 Ad hoc solutions ..................................................................................... 87
4.3.2 NBD based solutions .............................................................................. 87
4.3.3 Peer-to-peer solutions ............................................................................ 89
4.3 Feasibility Assessment of Deployment on Existing Desktop Computing
Infrastructure ....................................................................................................... 90
4.3.1 Reliability theory in RAID ........................................................................ 90

4.3.2 Feasibility assessment ........................................................................... 94

4.4 Summary ....................................................................................................... 96
Chapter 5 Reliable and Highly Available Storage Systems Using CoStore
Clusters .......................................................................................................... 97

5.1 Introduction ................................................................................................... 97

5.2 Backup and Archiving Techniques for Recovery .......................................... 97

5.2.1 Tape backups ......................................................................................... 99
5.2.2 Snapshots ............................................................................................ 100
5.2.2 Redundancy ......................................................................................... 102
5.3 Reliable and Highly Available Storage Systems Using CoStore Clusters. 103
5.3.1 Benefits and implications ..................................................................... 104
5.4 Experimental Setup ..................................................................................... 107
5.4.1 Network emulator EMIP ....................................................................... 109
5.5 Performance Evaluation .............................................................................. 112
5.5.1 Local mirroring ...................................................................................... 112
5.5.2 Remote mirroring .................................................................................. 116
5.5.3 Performance enhancing techniques .................................................... 118

5.5 Summary ..................................................................................................... 119
Chapter 6 Conclusions and Future Work ........................................................ 121

6.1 Related Work .............................................................................................. 121

6.1.1 Storage cluster architecture ................................................................. 121
6.1.2 Remote storage replication .................................................................. 123
6.1.3 Storage service utilizing idle space ...................................................... 124
6.2 Conclusions ................................................................................................. 124
6.2.1 Storage cluster architecture ................................................................. 125
6.2.2 Remote storage replication .................................................................. 125
6.2.3 Storage cluster utilizing idle space ....................................................... 126
6.3 Contributions ............................................................................................... 126
6.4 Future Work ................................................................................................ 127
6.4.1 Implementations ................................................................................... 127
6.4.2 Future research areas .......................................................................... 127
Bibliography .......................................................................................................... 129

LIST OF TABLES

SCSI parameters. DT means double transition clocking. ......................................... 16
MTI‘ F w, and MTTRW, based on system uptime data ................................................. 95
The reliability (MTT F cm”) of clusters ..................................................................... 96
Test sets and the characteristics of network environments ..................................... 108

2-1

2-2

2-4

2-5

2-6

2-7

3-1

3-2

3-3

3-4

3-5

3-6

3-7

3-8

3-9

3-10

LIST OF FIGURES

(L) Structures inside disk; (R) mechanisms in a magnetic disk drive ........................ 7
The layout of data and parity blocks in a RAID level 5 ............................................. 9
RAID level 5 host writes and reads in normal condition ......................................... 11
RAID level 5 host writes and reads in the presence of a fault ................................. 12
File system layers in UNIX ....................................................................................... 22

(a) The schematic structure of a ﬁle system; (b) the structure of Ext2 ﬁle system .26

The superblock in Ext2 ﬁle system ........................................................................... 28
(a) The structure of a Unix inode; (b) the inode in Ext2 ﬁle system ........................ 29
A directory entry in the Ext2 ﬁle system .................................................................. 3O
Network-attached secure disks (N ASD) ................................................................... 43
NASD/Cheops architecture ....................................................................................... 45
Server-attached disks (SAD) ..................................................................................... 49
Network Attached Storage (NAS) vs. Storage Area Network (SAN) ..................... 52
The structure of a CoStore cluster ............................................................................. 54
Functional modules in a CoStore daemon ................................................................ 56
Block addressing and inode numbering scheme ....................................................... 58
File system layout in a CoStore cluster with RAID level 4 ..................................... 59
Architectural similarities between CoStore and DASH ........................................... 60
Linklists in block buffer cache .................................................................................. 63
The write performance of CoStore ............................................................................ 73
The read performance of CoStore ............................................................................. 76

3-11 The impact of distributed RAID overhead and commit policy ................................ 78

3-12 The scalability of CoStore clusters ............................................................................ 80
3—13 Parity daemon’s bottleneck effect in CoStore with RAID level 4 ........................... 81
4-1 Queuing theory in the reliability of disk arrays. ....................................................... 90
4-2 The relationship between MTT F dam, and MTT F w, .................................................. 93
5-1 A CoStore cluster with RAID level 4+1 ................................................................. 104
5-2 RAID buffers and block buffer cache in CoStore ................................................... 107
5-3 The network diagram of testing environments ....................................................... 109
5—4 The internals of EMIP ............................................................................................. 111
5-5 A CoStore cluster with RAID level OIO+1 in a switched LAN .............................. 113
5-6 A CoStore cluster with RAID level 4/4+1 in a switched LAN .............................. 1 15
5-7 A CoStore cluster mirrored in various network environments ............................... 117

Chapter 1

Introduction and Motivations

1.1 Introduction

The Internet has been a phenomenon that has cultivated a worldwide information
revolution. At unprecedented paces and in unparalleled scales, the information
digitalization processes have presented numerous new challenges to IT scientists and
engineers. One of the major challenges is to design advanced storage systems to meet the
demanding requirements for high performance, high capacity, and strong reliability.

Traditional storage systems rely on ﬁle servers to copy data between clients and storage
devices. File servers, actually generic computers themselves, are not efﬁcient in moving
data between clients and storage nodes. The reason is that the data path involved includes
client network interface, system memory, disk controllers to storage devices, and then
along the same path back to the client. Inevitably memory copying and protocol processing
have introduced many delays and overhead.

As a result, ﬁle servers in storage systems have emerged as a major barrier to achieve a
better scalability. The performance of storage systems has been actively studied, including
research areas on disk arrays [Anderson 1995b; Gibson 1996; Hitz 1994; Lee 1996] and
distributed ﬁle systems [Anderson 1995b; Gibson 1996; Hartman 1993; Hitz 1994;
Sandberg 1985; Satyanarayanan 1989; Thekkath 1997].

Over the years the market has witnessed many technological innovations ranging from
faster peripheral channel, to dedicated storage area networks (SAN), and ﬁnally to

aggressively specialized storage systems with custom hardware, operating systems and ﬁle

systems. Unavoidably these highly specialized storage systems come with high prices,
which, more often than not, makes many organizations fumble for more budget on storage
systems.

According to [Alvarez 2000], storage takes up 30-50% of total system cost before even
paying for the recurring storage management. It is estimated that recurring personnel costs
for storage management now dominate one-time capital costs over the equipment’s useful
lifetime [Gibson 2000]. The storage cost of per MB (megabytes) decreases constantly;
however, the management cost keeps growing because of the increasing shortage of skilled

storage system administrators in current tight ['1‘ labor market [Gibson 2000].

1.2 Motivations

1 .2.1 Evolving storage architectures

The architecture of storage systems has evolved into three generations. The ﬁrst generation
is called Direct-Attached Storage (DAS) [Duplessie 2001]. In DAS, storage servers are just
generic workstations running a distributed ﬁle system like NFS, connecting to a SCSI
RAID controller. SCSI’s limited bandwidth and scalability quickly become a bottleneck
when there are a large amount of clients.

The second generation Storage Area Networks (SAN) [Gibson 2000] has emerged to
help solve the limitations imposed mainly by SCSI channels. With a dedicated storage
network, Fibre Channel [Mearian 2001] can provide higher bandwidth (leps) and higher
capacity (256 devices per Fibre Channel - Arbitrated Loop, or more if using switch
fabrics). There is a new technology called IP Storage (iSCSI) [Chudnow 2002; Satran

2000], which adapts SCSI protocol onto TCP/IP suite so that storage devices can run on

commodity Ethernet networks. At the same time, similar technologies have been proposed
to expand Fibre Channel onto TCP/IP, namely iFCP and FCIP [Mearian 2001].

The third generation architecture is called Network Attached Storage (NAS) [Gibson
2000]. With SAN, a ﬁle server is still needed to manage the block space on an SAN rack
like in the DAS. NAS devices are simply plugged into the network and are extremely easy
to manage, therefore often referred to as storage appliances. NAS should not be considered
as a competing technology to SAN, because internally very likely SAN technology is still
used. Usually a dedicated ﬁle server is embedded inside an N AS storage rack.

The DAS structure is primarily used in small sever environment. The SAN structure is
mostly used in medium to large enterprise products and the N AS structure is mainly used
for department-level servers. With the evolution from DAS, to SAN and NAS, the
fundamental architecture in storage systems has not changed. Essentially it is still a single
server for both distributed ﬁle system and local ﬁle system, probably for redundancy
management as well.

The author believes that this single-server architecture is preventing us from realizing
the full potential from storage devices connected by the dedicated hi gh-speed storage
networks. The solution is to cluster multiple servers, each connected with a large amount of
storage space, so that much higher aggregated bandwidth, performance, and capacity can
potentially be achieved. The research in this dissertation study proposes the CoStore cluster
architecture to construct storage systems using network attached storage devices and to

achieve high performance, high reliability, high scalability and high capacity.

1.2.2 Data is the most valuable asset

The value of data has risen dramatically over the last few years to a point where, for many
enterprises, it is the most valuable corporate asset [Webster 2001]. To maintain the
availability, integrity and disaster recoverability of data, storage is now the most important
resource in information infrastructure. This dissertation study also investigates remote
replication using CoStore clusters to achieve high reliability and high availability storage

systems and to reinforce preparedness for disaster recovery.

1 .23 Storage space utilization irony

Besides the need for higher performance and better reliability, the other problem facing
most infrastructure management is the demand for more storage capacity. With the Internet
growing in size and reaching into every aspect of the society, information explosion has
generated billions of bits every millisecond. Examples include databases of astronomical,
geographical and medical images, data centers for e-Commerce, and multimedia content
for news or entertainment. Even personal email has become attachment rich with pictures,
audio or video clips. Given the rapid growth rate, many organizations are under continuous
pressure to expand their storage systems as demands for capacities grow when the data sets
swell relentlessly.

This dissertation study is in part motivated by the ironical fact that in some
organizations there are growing demands for more storage space while enormous amount
of disk space is idling and unused in their infrastructure. The disparity of storage space
utilization ratios is expected to deteriorate further over time. This dissertation study

assesses the feasibility of deploying CoStore architecture to construct reliable storage

service utilizing idle disk space on workstation clusters on existing computing

infrastructure.

1.3 Problem Abstraction

The problem in this dissertation study can be generalized as follows. Given a group of
network attached storage (N AS) devices, we are to construct a storage cluster without
external ﬁle system managers. Such a storage cluster will provide a uniﬁed ﬁle namespace
with a unique root in the ﬁle system tree. In CoStore both the distributed ﬁle system
responsibilities and local ﬁle system management are evenly distributed among all cluster
members.

Such NAS devices have many different forms. However, they all share the following
attributes: i) they all contain block devices; ii) they all have Ethernet connections (Fast
Ethernet or Gigabit Ethernet); iii) they are intelligent enough to manage local block devices
themselves and provide a ﬁle interface on the network, in addition to the block interface
available from traditionally dumb disk drives.

With the exponential growth in ASIC chip technology, disk drive controllers are
potentially very powerful and intelligent. Ideally CoStore can use the imminent network
attached smart disks [Gibson 1996] or Active Disks [Riedel 2001]. Though technically
possible, Active Disks have not appeared on the market as of this writing because there is
little economic incentive for disk drive manufacturers.

For practical reasons, in this study we will use generic PCs, running modern operating
systems and connected with one or more disk drives. If the CoStore architecture turns out

to be very successful, we can use advanced servers each connected with SAN rack devices.

1 .4 Dissertation Overview
This dissertation consists of six chapters. Chapter 1 presents an introduction and
motivations to this study and provides an abstraction to the problem being solved.

Chapter 2 presents a background review on storage systems. Chapter 3 ﬁrst introduces
technology trends; then presents the CoStore cluster architecture using contemporary
approaches and describes the construction of a prototype CoStore cluster using
commercial-off-the-shelf (COTS) components. The performance of the prototype has been
measured and compared with other commonly available distributed ﬁle systems.

Chapter 4 assesses the feasibility of deploying the CoStore architecture using
workstation clusters on existing desktop environment. Chapter 5 investigates the
construction of highly reliable and highly available storage systems using CoStore clusters.

Chapter 6 ﬁrst discusses related work and then concludes this dissertation study with

conclusions and contributions made by this study. Chapter 6 provides areas for future work.

Chapter 2

Background

2.1 Storage System Components

2.1.1 Storage devices

2.1.1.1 Disk drives

Disk drives are nonvolatile storage devices that record data on magnetic media [Ruemmler
1994]. Each disk device consists of one or more stacked platters attached to a rotating
spindle (Figure 2—1). Disks magnetically store data on recording surfaces located on both
sides of each platter. Concentric circles called tracks divide each platter surface. Sectors

divide tracks into the smallest unit that can be read or written, typically a size of 512 bytes.

Platters Q E Platters

\ me... T7“:
: _ ‘p ' write on both @ f \
'. -‘ 1;, platter sides
' ~ ‘ _ Electronics

   

Actuator

Areal density = Ilnear density ' track density

Figure 2-1 (L) Structures inside disk; Image courtesy
of Seagate Technology, Inc. (R) mechanisms in a
magnetic disk drive

Each recording surface has one head that reads and writes sector data. Heads attach to
the ends of actuator arms. These arms pivot the heads, in unison, between tracks. Seek time

is the duration required to position a head to the desired track. Rotational latency is the time

spent after a head seek but before the target sector rotates under the head. Together, seek
time and rotational latency comprise total positioning time of a request.

Modern devices possess speed-matching buffers. To coordinate between channel
availability and media latencies, devices temporarily store data in speed-matching buffers.
Some advanced devices use buffer memory for caching. Caches may act as read-ahead
buffers by pre-fetching likely to be read data. Caches may also write buffer requests by
transferring data into buffers and then releasing the channel. Devices write cached data to
media at a later time. Advanced conﬁgurations function as both read and write caches.
Read and write caches typically employ variants of least recently used (LRU) replacement
policies.

The latency of a disk access can therefore be broken down into three main elements:
seek, rotational and transfer latencies. Seek latency refers to the time it takes to position the
read/write head over the proper track. This involves a mechanical transitional movement
that may require acceleration in the beginning and deceleration and repositioning in the
end. As a result, although seek times have been improving, they have not kept up with the
rates of improvement of silicon processors. While processing rates have improved by more
than an order of magnitude, average seek times have shrunk to only half of their values of a
decade ago [Gibson 1992].

The second element, rotational latency, refers to the time it takes to wait for the sector
to rotate under the read/write head. This is determined by the rotational speed of the disk.
Rotational speeds have improved slowly over the past decade, improving at an average
annualized rate of 13%. Higher rotational speeds reduce rotational latencies and improve

transfer rates. Unfortunately, they are hard to improve because of electrical and

manufacturing constraints. The third element is transfer time, which is the time for the
target sectors to pass under the read/write head. Disk transfer times are determined by the
rotational speed and storage density (in bytes/square inch). Disk areal densities continue to
increase at 50 to 55% per year, leading to dramatic increases in sustained transfer rates,

averaged at 40% per year [Grochowski 1996].

 

Disk1 Disk2 Disk3 Disk4

 

 

 

D1 DZ 03

 

' P1=01e02e03

 

 

 

 

 

 

06

 

D4 05

 

 

 

 

 

 

I. parity stripe

 

 

D7 D8 09

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2-2 The layout of data and parity blocks in a
RAID level 5

2. 1. 1.2 Disk arrays

In the late 19805, in order to bridge the growing access gap between the storage subsystem
and the processors, redundant arrays of inexpensive disks (RAID) were proposed [Chen
1994; Patterson 1988] to replace expensive large disk systems. (Inexpensive was later
replaced with Independent.) RAID arrays provide the illusion of a single logical device
with high small-request parallelism and large-request bandwidth. By storing a partially
redundant copy of the data as parity on one of the disks, RAIDs improved reliability in

arrays with a large number of disks.

A piece of data is spread over N-I data blocks and one parity block, which is the XOR
of the corresponding bits of the data blocks. In other words the data and the parity are
striped over different disks in the array. High performance is achieved by the parallelism of
multiple disks and fault tolerance is provided by the extra check data parity — the contents
of a failed data disk can be reconstructed by taking the exclusive-OR of the remaining data
blocks and the parity block.

Several different schemes to organize disk arrays were deﬁned by Paterson et aI.,
known as RAID level 0 through RAID level 5. RAID 0 is data striping only without any
check information; therefore it has the highest parallelism but no redundancy. The RAID 1
duplicates all disks or mirroring; hence fault tolerance is excellent and read performance is
doubled. However, mirroring is the most expensive of all levels and the write performance
is not improved. RAID 2 applies Hamming code error-correction like memory-style ECC
and RAID 3 uses bit-interleaved parity. Both RAH) 2 and 3 are rarely used in common
disk arrays.

RAID 4 and 5 are both block-interleaved parity. In RAID 5, parity is uniformly
distributed across all disks (Figure 2-2) whilst in RAID 4 parity is stored on one dedicated
disk. Because the parity disk is a potential bottleneck, normally RAID 4 is seldom used.

There are also a few other levels proposed. RAID 6 is essentially an extension to RAID
5, which utilizes double correcting redundancy (Reed-Solomon) codes to provide
additional fault tolerance. RAID 6 has a much more complex controller design and the

write performance is very poor.

10

2. 1.1.2.1 RAID level 4 and level 5

The parity block is rotated around the devices in an array. Each write to any of the disks
needs to update the parity block. Rotating the parity block balances the parity write trafﬁc
across the devices.

RAID level 5 employs a combination of striping and parity checking. The use of parity
checking provides redundancy without the 100% capacity overhead of mirroring. In RAID
level 5, a redundancy code is computed across a set of data blocks and stored on another
device in the group. This allows the system to tolerate any single self-identifying device
failure by recovering data from the failed device using the other data blocks in the group
and the redundant code. The block of parity that protects a set of data units is called a

parity unit. A set of data units and their corresponding parity unit is called a parity stripe.

 

Datat ‘
2’21““
DDDPDDDPDDDP

(a) Large—Write (b) Read-Modify-Write (c) Reconstmct-Write

 

 

Data Parity an'ty
pre-read uzpdate pre-read
(1) ( ) ”In

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2-3 RAID level 5 host writes and reads in
normal condition

Figure 2-3 shows RAID level 5 host writes and reads in normal condition. Write
operations in fault-free mode are handled in one of three ways, depending on the number of
units being updated. In all cases, the update mechanisms are designed to guarantee the
property that after the write completes, the parity unit holds the cumulative XOR over the

corresponding data units. In the case of a large write (Figure 2-3(a)), since all the data units

11

in the stripe are being updated, parity can be computed by the host as the XOR of the data
units and the data and parity blocks can be written in parallel.

If less than half of the data units in a stripe are being updated, the read-modify-write
protocol is used (Figure 2-3(b)). In this case, the prior contents of the data units being
updated are read and XORed with the new data about to be written. This produces a map of
the bit positions that need to be toggled in the parity unit. These changes are applied to the
parity unit by reading its old contents, XORing it with the previously generated map, and
writing the result back to the parity unit. Reconstruct-writes (Figure 2-3(c)) are invoked
when the number of data units is more than half of the number of data units in a parity
stripe. In this case, the data units not being updated are read, and XORed with the new data

to compute the new parity. Then, the new data units and the parity unit are written.

[Ti l l IFF‘T

(a) Reconstruct-Write (b) Read-Modin-Write (c) Degraded-Read

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2—4 RAID level 5 host writes and reads in the
presence of a fault

Figure 2-4 shows RAID level 5 host writes and reads in the presence of a fault. If a
device has failed, the degraded-mode write protocols shown in Figure 2-4(a) and Figure
2-4(b) are used. Data on the failed device is recomputed by reading the entire stripe and
XORing the blocks together as shown in Figure 2-4(c). In degraded mode, all operational

devices are accessed whenever any device is read or written.

12

The more balanced RAID 5 is by far the most popular RAID organization of all levels.
However, Network Appliance has adopted RAID 4 because their novel WAFL (Write
Anywhere File Layout) design can overcome most of the bottleneck effect [Hitz 1994].
Compared with RAID 5, RAID 4 is much easier to implement and disks can be added
incrementally into the array with little reorganizing work. In this proposal only RAID level
4 and 0+1 are considered.

One major drawback of RAID 4 and 5 is that small writes have very high overhead,
called small write penalty. Techniques have been proposed to improve the performance of
small updates, including parity-logging [Stodolsky 1993], ﬂoating-parity scheme [Menon
1989] and log-structured ﬁle system design (LFS) [Rosenblum 1992.], among others.
AFRAID (A Frequently Redundant Array of Independent Disks) was proposed [Savage
1996] to eliminate the small—update penalty to RAID 5. The concept is that considerably
better performance can be achieved by consciously sacriﬁcing a small amount of data

redundancy.

2. 1.1.2.2 RAID status

RAID 1 is increasingly popular because the controllers are simple to implement and many
PC motherboards of recent models are providing them embedded, along with the
combination of RAID 0 and 1. RAID 0+1 stands for striping across mirrored disks and
RAID 1+0 stands for the mirroring of two virtual disks, each is a set of stripping disks. The
performance characteristics of these two are equivalent. However, whenever possible
RAID 0+1 is preferred because a single disk loss will not affect the whole array, while in

RAID 1+0 a single disk failure will ruin the whole striping set in one virtual disk

13

[VERITAS 2000]. The striped-mirrors RAID 0+1 or 1+0 (sometimes also called RAID 10)
has the reliability of mirroring RAID 1 and the performance of striping-only RAID 0.
Because the cost per MB continues to drop rapidly, mirroring RAID 1 or its variation 0+1
has witnessed accelerating market acceptance as inexpensive RAID solution. Another
possibility is to combine level 4 and 1 into RAID 4+1, similar to RAID 0+1.

Full-blown hardware RAID is an expensive solution, compared to software RAID,
which is implemented by operating systems kernel without hardware support, mostly in a
form of device driver. Software RAID is available from many popular PC operating
systems: Linux, Solaris and Windows 2000 Server.

A volume manager is a subsystem for online disk storage management [Teigland
2001]. It adds an additional layer between physical peripherals and the I/O interface in the
kernel to present a logical view of disks. A volume manager will often implement one or

more levels of software RAID to improve performance or reliability.

2.1.2 l/O interfaces

I/O interfaces transport data between host processors and peripheral devices. Interfaces
often become open industry standard. I/O interfaces can be categorized into two groups:
parallel and serial interface. It is an industrial trend that these parallel interfaces will soon
be serialized. Parallel interfaces include ATA/IDE and SCSI. Serial interface include USB,

IEEE1394, SSA, Fibre Channel, and forthcoming serial ATA and serial SCSI.

14

2. 1 .21 Parallel interfaces

The most popular I/O interface is the Integrated Drive Electronics (IDE) [Tanenbaum
1999], which appeared with the BIOS (Basic Input Output System) on IBM PCs since mid
19805. The more ofﬁcial name for IDE is ATA (AT Attachment). For lack of foresight, the
initial BIOS only supports a maximum drive of capacity 512MB because in the BIOS there
are 4 bits for the head, 6 bits for the sectors and 10 bits for the cylinders. Eventually IDE
drives evolved into EIDE (Extended IDE), which also supports a second addressing
scheme called LBA (Logical Block Address), which just numbers the sectors starting at 0
up until a maximum of 224-1.

IDE and EIDE disks were originally only for Intel-based systems since the interface is
an exact copy of the IBM PC bus. However, now a few other computer architectures
(including SUN SPARC workstations) also use them recently. The latest EIDE supports
Ultra ATA 100/ 133 with a burst data rate up to 100MB/ 133MB per second. The
disadvantage of (E)IDE is that only limited aggregated bandwidth can be supported, with
limited expandability, up to 4 channels and 8 devices. Therefore, EIDE is still mostly
limited to PCs and entry-level workstations.

In storage system, the other most important I/O interface is SCSI (Small Computer
System Interface) interface, which was standardized in 1986 by the T10 committee of the
National Committee on Information Technology Standards (NCITS). NCITS, formerly
known as X3, operates under the auspices of the American National Standards Institute
(ANSI). Since then, increasingly faster vesions have been standardized with the names Fast
SCSI (lOMHz), Ultra SCSI (2OMHz), and Ultra2 SCSI (40MHz). Each of these has a wide

(16bit) version as well. The current (sixth) generation, Ultra160 SCSI data rate reaches to

15

l60Mbps. Future versions include Ultra320 and Ultra640 each doubling the preceding
generation’s data rates [Mason 2000]. Unlike other I/O interfaces, SCSI product
generations are always backward and forward compatible, preserving the owner’s
investment and making it possible to connect legacy devices to newer system [Mason
2000]. Table 2-1 shows the all of the different SCSI transfer modes and feature sets, along
with their key characteristics [Kozierok 2001].

Because SCSI disk have high transfer rates, traditionally they are the standard disks in
most UNIX workstations from Sun, HP, SGI and other vendors. They are also the standard

disks on Macintoshes and high-end Intel PCs, especially network servers.

Table 2-1 SCSI parameters. DT means double
transition clocking.

 

 

 

 

 

 

 

 

 

 

 

 

 

Transfer Deﬁning Bus Width Bus Speed Throughput

Mode Standard (bits) (MHz) (MB/s) Cabling
SCSI-l SCSI-l 8 5 5 50-pin
Wide SCSI SCSI-2 16 5 10 68-pin
Fast SCSI SCSI-2 8 10 10 50-pin
Fast Wide SCSI SCSI-2 16 10 20 68-pin
Ultra SCSI SCSI-3 I SP1 8 20 20 50-pin
Wide Ultra SCSI SCSI-3 / SPI 16 20 40 68-pin
Ultra2 SCSI SCSI-3 l SPI-2 8 40 40 50-pin
Wide Ultra2 SCSI SCSI-3 / SPI-2 16 40 80 68-pin
Ultra3 SCSI SCSI-3 / SPI-3 16 40 (DT) 160 68-pin
Ultral60 SCSI SCSI-3 / SPI-3 16 40 (DT) 160 68-pin
Ultra320 SCSI SCSI-3 / SPl-4 I6 80 (DT) 320 68-pin

 

 

 

 

 

 

16

 

On the high-end storage servers, SCSI disks were once exclusively used. However,
SCSI has its own limitations. The main disadvantages are the limited number of devices on
each channel and limited distance of SCSI cabling. SCSI is a parallel interface, with a wide
cable connector, which is very unpleasant in an environment of high-density devices. To
accommodate the emerging requirements of massive capacity, high-bandwidth and
continuous availability, recently on the high-end storage systems, SCSI interface disk
drives have been gradually replaced by serial storage interfaces such as Fibre Channel —
Arbitrated Loop (FC-AL).

The parallel ATA will give way, over the next two or three years, to a new connectivity
standard pr0posed by the Serial ATA Working Group [Norman 2002]. The use of gigabit
technology will make Serial ATA viable for at least 10 more years, enabling the interface
to better keep up with the data-transfer demands of advanced processor. Serial ATA will

also eliminated parallel ATA’s large, unwieldy ribbon cables.

2. 1.2.2 Serial interfaces

USB (Universal Serial Bus) interface appeared in mid 19905 and is being widely
implemented on personal computers [Anderson 1997]. USB is mainly designed for low-
speed devices such as keyboards, mice, digital cameras, scanner, printers and so on. There
are disk drives with USB interface on the market and most of them are external to take
advantage of the convenience of mobility. The USB 1.1 bandwidth is 1.5MB/s (or
12Mbps). The latest USB 2.0 can support bandwidth as high as 480Mbps and is backward

compatible with USB 1.1.

17

The IEEE 1394 multimedia connection enables simple, low-cost, high-bandwidth real-
time data interfacing between computers, peripherals, and consumer electronics products
such as camcorders, VCRs, printers, PCs, TVs, and digital cameras. The 1394 digital link
standard was conceived in 1986 by technologists at Apple Computer, who chose the
trademark 'FireWire', in reference to its speeds of Operation. The ﬁrst speciﬁcation for this
link was completed in 1987. It was ad0pted in 1995 as the IEEE 1394 standard. Some high-
end disk drives are equipped with 1394 port for the purpose of faster transfer rate of data,
which is a necessity in professional digital video processing, and other multimedia
applications. Currently IEEE 1394 supports bandwidth of 400Mbps and future versions
will go as high as 800Mbps and 1.6Gbps.

Serial Storage Architecture (SSA) SSA [SSA 1995], a serial-connection technology
proposed by IBM, abandons the wide parallel cables used by SCSI in favor of a simple
four-wire serial connection. Unlike SCSI, which uses a shared bus to connect the device,
SSA uses a pair of point-to-point links to connect two devices together via a port on each
device [Du 1996]. We investigated the impact of SSA’s spatial reuse property on the
performance of RAID storage system in [Jayaram 1998].

By year 2000, SSA has almost lost the war to the competing FC-AL technology, in part
for the lack of wide industrial support to an IBM proprietary standard and for its technical
limitations. However, some of its technology contributions will prevail as is attested by the
fact that Seagate, Adaptec and IBM have started development of a new technology, called
FC-EL (Fibre Channel — Enhanced Loop), which will combine both the advantages of SSA

and FC-AL.

18

Fibre Channel - Arbitrated Loop (FC-AL) Fibre Channel (FC) is an emerging ANSI
serial interface that supports channel and network operations [ANSI 1993]. Fibre Channel
consists of ﬁve functional levels. Modular design allows independent implementation of
each level.

Fibre Channel deﬁnes all computers, switching elements, and storage devices as nodes.
Each node has one or more FC ports called network ports (N Ports). Each port possesses a
transmitter and a receiver to interface media Fibre Channel supports three topologies:
point-to-point, arbitrated loop (FC-AL), and fabric. Point-to-point conﬁgurations are the
simplest. This topology connects node pairs or acts as backbones for other network types.
Arbitrated loops connect several ports via shared media. Fabrics connect ports to switched
environments.

Like shared buses, only one arbitrated loop node transmits at any given time. As many
as 126 nodes connect together to form a ring; the transmitter of one node connects to the
receiver of another. A transmitting node must ﬁrst arbitrate for the loop. After acquiring the
loop, the transmitting node either sends messages to other nodes or broadcasts to several
nodes. After transmission, the node releases the loop. Loop ports (L Ports) are nodes that
support the arbitrated loop topology. The bandwidth per Fibre Channel loop is leps or
100MBps.

Performance and cost requirements dictate network topology. Many nodes may attach
to inexpensive, arbitrated loops in order to share loop bandwidth and connectivity. Nodes
can also attach to fabric ports and exploit the network bandwidth. Loops connect to fabric

ports for high connectivity but relatively low costs.

19

2. 1.2.3 Merger of I/O Channels and Ethernet

Recent interface trends combine channel and network technologies into single interfaces
capable of supporting multiple protocols [Sachs 1994]. Interface merging tends to produce
slightly more complicated designs, but these interfaces generally inherit advantages of both
channels and networks. Combining these traditionally independent subsystems enables
vendors to produce single products with multiple uses. Vendors providing combined
products beneﬁt from larger markets.

A network attached peripheral (NAP), according to Van Meter, is “a network computer
peripheral that communicates via a network rather than a traditional I/O bus, such as SCSI”
[Van Meter 1996]. Van Meter presents characteristics that distinguish NAPS from bus-
based devices. Such characteristics include interconnect distance, ownership, and an ability
to handle general network trafﬁc. Printers and computer terminals often ﬁt this deﬁnition.
Other peripherals like scanners and storage devices may also be designed as NAPS.
Merging channels and networks provides new functionality to devices. Network attached
storage (N AS) devices connect directly to networks. Multiple computers can share NAS
devices.

One particular example is the new technology called IP Storage or iSCSI (Internet
SCSI), which maps SCSI protocol onto TCP/IP protocol [Satran 2000]. iSCSI will enable
SCSI storage controllers, disk subsystems and tape libraries to attach directly to IP
networks. According to Nick Allen, iSCSI promises to let users operate SAN, NAS, LAN
and WAN as a single, integrated network [Chudnow 2002]. While iSCSI can run on
standard Gigabit Ethernet switches, the overhead involved in processing TCP and the

iSCSI protocol can quickly overwhelm the CPU in servers with signiﬁcant storage network

20

trafﬁc. Therefore standard Ethernet adaptors with special host bus adapters (HBA) called
TCP off-load engines (TOE). There is signiﬁcant investment in the imminent 10GB
Ethernet technology, which adds more fuel into iSCSI’s development.

The future of Fibre Channel based SAN is overcast with questions due to the limited
bandwidth (2GB) in its next generation while iSCSI is soon to take advantage of the 1063
Ethernet. However, the Fibre Channel camp is not resting. Several projects similar to iSCSI
are in the works, including iFCP (Internet Fibre Channel Protocol) and FCIP (Fibre
Channel over IP) [Mearian 2001]. FCIP enables the transmission of Fibre Channel
information by tunneling data between SANS over IP networks. A hybrid technology, iFCP
is a version of FCIP that moves Fibre Channel data over networks using the iSCSI
protocols. All the three new protocols iSCSI, iFCP and FCIP are interfaces to SAN based
storage systems [Paulson 2002].

The demand to move data with greater speed has been restricted by a number of
bottlenecks and arguably the greatest restriction to data ﬂow is the nature of the I/O
architecture [Williams 2001]. The InﬁniBand technology addresses the shortfall of current
I/O buses and will provide speed range from 500Mbps to 6Gbps per link. The implications
for storage networking are enormous [Williams 2001]. A fabric of InﬁniBand switches and
links is used to provide connections between servers, remote storage and networking
devices. The key to InﬁniBand is switching technology that links high-channel adaptors
(HCA) to target channel adaptors (T CA). The TCAs support the storage and peripheral
device U0. The switch operates between the HCA and TCA to manage and direct data

packets.

21

2.2 Local File Systems

This section presents operating system background with UNIX as example. Reasons for
choosing UNIX include: (1) a large install base and variations of UNIX run on most
platforms architectures; (2) a large amount of UNIX design literature and research exist; (3)
numerous non-UND( operating system inherit many design principles from UNIX; (4) the
prototype of this proposal is targeted at direct integration into existing UNIX dominant

infrastructure.

 

Kernel mode

process process process
1 2 n
User mode
I I 1

Virtual in. System '

    

 

 

 

Figure 2-5 File system layers in UNIX

2.2.1 File system layers in operating systems

UNIX functionally organizes storage and network subsystems into layers [Bach 1986].
Figure 2-5 illustrates these layers. Each layer views storage and communication through
different degrees of abstractions. The top layer of Figure 2-5 is user space. Programs

operating at this level include command shells and system utilities; applications run on top

of these programs. All user level programs interact with the operating system, or kernel,
through system calls.

The ﬁle system layer lies beneath the system call layer. Modern UND( implementations
include installable ﬁle system interfaces. Many UNIX implementations incorporate the
Virtual File System (VFS) interface. VFS, developed by Sun Microsystems, provides a
common interface to ﬁle systems. VFS divides ﬁle system functionality into ﬁle system
and individual ﬁle operations [Kleiman 1986; Sandberg 1985].

File systems often use buffer cache services. The buffer cache layer consists of system
memory buffers and routines that operate on these buffers. Buffer caches provide caching,
pre-fetching, and temporary memory for non-aligned transfers. These caches reduce device
read and write requests by caching recently accessed data. Buffer caches either write-
through or write-behind, and often use a least recently used (LRU) replacement policy.
File systems pre-fetch data into buffer caches for possible future reference. Pre-fetches are
often continuations of current requests. Such pre-fetching only increases data transfer times
of request durations, since devices already perform initial seeks and rotations for the non-
pre—fetched data. File systems also use buffer caches to temporarily store data for non-
aligned transfers.

The device driver layer is the lowest software level. Device drivers, or drivers, are
interfaces between the kernel and hardware. Drivers hide hardware device speciﬁcs.
Several levels of device drivers comprise the device driver layer. High-level device drivers
are more abstract than low-level drivers. One good example is that software RAID is
implemented as high-level device driver, which interacts with low-level block device

drivers. Low-level drivers are most hardware speciﬁc [Rubini 2001]. 1/0 requests arrive at

23

device drivers from higher in the kernel. Drivers use input requests to construct requests
suitable for lower drivers or hardware. Drivers place newly formed requests on the driver
queue. These queues are usually ﬁrst-in, ﬁrst-out (FIFO), but can be priority-based to
enable scheduling. Drivers pass queued requests down to lower device drivers or the

hardware level.

2.2.2 Virtual File System (VFS)

The virtual ﬁle system (VFS) is an interface that supports various ﬁle system types within a
kernel. Several UNIX implementations incorporate VFS; however, interfaces differ from
one platform to another. There have been many ﬁle systems and they all share common
structures. VFS is an object-oriented interface. This interface deﬁnes virtual VFS and
vnode operations. Each installed ﬁle system provides the kernel with functions associated
with VFS and vnode Operations.

VFS operations include functions that operate on ﬁle systems by mounting,
unmounting, and reading status, respectively. Vnode operations manipulate individual ﬁles.
A vnode is the VFS virtual equivalent of an inode. VFS creates and passes vnodes to ﬁle
system vnode operations. Vnode operations include opening, closing, creating, removing,
reading, writing, and renaming ﬁles. VFS deﬁnes many other vnode operations, yet ﬁle

system implementations need only support a subset of these routines.

2.2.3 File system structures

File systems manage user and system data on secondary storage. Applications often address

data at the byte level, though storage devices are typically block addressable. File systems

24

perform translations between byte and block level addressing schemes. The terms metadata
and real data classify ﬁle system structure data and user data, respectively. In other words,
real data is data that users store in ﬁles. File systems create metadata to store layout
information; metadata is not directly visible to users.

Files abstractly hide details concerning storage management from users. UNDI
recognizes several ﬁle types. File systems and application programs handle each ﬁle type
differently. Users store and retrieve data from regular ﬁles as contiguous, randomly
accessible segments of bytes. Users are responsible for organizing data stored in regular
ﬁles.

Directories are ﬁle abstractions that organize collections of ﬁles. In most modern ﬁle
systems, directories nest within one another, thereby forming a tree structure with empty
directories and regular ﬁles at the leaves. File names are unique to directories but not to ﬁle
systems. Applications identify ﬁles by complete pathnarnes. A ﬁle name and the names of
all encompassing directories comprise a pathnarne. The entire collection of ﬁle names
composes the ﬁle system name-space.

Computers may simultaneously access ﬁles from multiple ﬁle systems, including local
and network ﬁle systems. To provide a transparent name-space, ﬁle systems mount other
ﬁle systems at mount points. Users transparently traverse from one ﬁle system, across a
mount point, to another ﬁle system. File system root directories attach to mount points.

Internal fragmentation occurs when ﬁle systems do not ﬁll entire blocks. Traditionally,
ﬁle systems limited internal fragmentation by using small ﬁle system block sizes. Today,
conservation of storage is less important, so optimizations focus on improving ﬁle transfer

rates. External fragmentation is due to non-contiguous storage of data blocks. Since I/O

25

requests incur substantial device overheads, external fragmentation strongly inﬂuences
transfer rates. Increases to ﬁle system block sizes tend to reduce external fragmentation,
however large block sizes increase internal fragmentation.

File systems transfer blocks of data between system memory and storage devices. File

system block sizes are multiples of device block sizes.

2.2. 3. 1 Ext2 file system
In this section the Second Extended File System (Ext2) [Card 1986], the de facto standard
ﬁle system on Linux, is used to explain meta-data structures for generic ﬁles systems. In

Ext2 ﬁle system block sizes are lKB, 2KB and 4KB.

 

 

Boot

block Superblock inode blocks Data blocks

 

 

 

 

 

(a) The schematic structure of a UNIX file system

 

 

 

 

 

 

 

 

 

 

Boot Block Group Block Group Block Group
block 0 1 n
F".— ''''' ~~ V7-“ _, ~- ”mu-:u; _ . _.,_ .........
,1 Super-blk 1.3:! Block ' inode . _i,§inode":
:BG-dscptr ti‘iibitmap ,., bitmap 4‘: table i Data blocks
>'- ..- . t .3. W ' .. ...A . . m)

 

 

 

 

 

 

 

 

 

 

(b) The structure of Ext21ile system

 

Figure 2-6 (a) The schematic structure of a file
system; (b) the structure of Eth file system

The basic structure is the same for all the different UNIX ﬁle systems (Figure 2-6(a)).
Each ﬁle system starts with a boot block. This block is reserved for the code required to
boot the operating system. All the information essential for managing the ﬁle system is

held in the superblock. superblocks maintain information concerning the amount of free

26

space left on the ﬁle system, the device on which the ﬁle system is mounted, and ﬁle
system access privileges. superblock.s~ also maintain pointers to locate ﬁle system root
directories. superblock is followed by a number of inode blocks containing the inode
structures for the ﬁle system. The remaining blocks provide the space for data. These data
blocks thus contain ordinary ﬁles along with the directory entries and the indirect blocks.

In Unix ﬁle systems each ﬁle is represented by one inode, which stores all the
information about that ﬁle: permissions, type, size, and data block locations etc. The block
devices are divided into metadata and data blocks. The metadata blocks store superblock,
block-bitmap and inode-bitmap, and inode-table. For efﬁciency reason, blocks are ﬁrst split
into block groups with each block group has its own metadata [McKusick 1984]. Figure
2-6(b) shows a typical Ext2 ﬁle system structure.

The design of the Ext2 ﬁle system is very much inﬂuenced by BSD’s Fast File System
[McKusick 1984]. A partition is divided into a number of block groups, with each block
group holding a copy of the superblock, and inode and data blocks, as shown in Figure
2-6(b). The block groups are employed with aim of keeping (i). data blocks close to their
inodes; and (ii) ﬁle inodes close to their directory inode. Block groups reduce disk seek
time to a minimum. Every block contains the superblock along with information on all the
block groups, allowing the ﬁle system to be restored in an emergency. Figure 2-7 show the

superblock structure in Ext2 ﬁle system.

27

 

 

 

 

o 1 2 4 5 a 7
oWumber of Inodes Number of blocks
afNumber of reserved blocks Number of free blocks
rerNumber of free Inodes First data block
24IBlock size Fragment size

 

32IBlocks per group

Fragments mgroup

 

4o|lnodes per group

Time of mounting

 

 

 

 

 

«ITime of last write Status LMax. mount counter
seIExtz signature [Status Error behaviour

e4|Time of last test Max test Interval

72I0perating system File system revision

eoIRESUID Tnesero

 

 

Figure 2-7 The superblock in Eth file system

Additionally, in each block group Ext2 ﬁle systems maintain free lists of unallocated
data blocks inode by means of bitmap tables: block-bitmap and inode-bitmap as shown in
Figure 2-6(b). File systems set bits to signify blocks or inode that are allocated to ﬁles.

Figure 2—8(a) illustrates the structure of an inode and metadata pointer tree for a
traditional UNIX ﬁle system. The inode contains header information and several pointers.
The ﬁrst ten pointers are direct pointers that address data blocks. The ﬁrst indirect pointer
addresses a data block ﬁlled with direct pointers. The second indirect pointer addresses a

data block ﬁlled with indirect pointers. The third indirect pointer addresses a block of

double indirect pointers, taking a total of three indirections to reach real data.

28

 

Direct
references
to
data blocks

Indlrect block

 

a) The structure of a UNIX inode

o 1 2 a 4 s e 7
File sIze
time of creation
of modiﬁcation of deletlon
Link counter of blocks
File attributes Reserved
12 direct blocks

indirect block Indirect block
indirect block File version
File ACL ACL
address Reserved (OS-dependent)

 

b) The inode in Ex12 file system.

Figure 2-8 (a) The structure of a Unix inode; (b) the
inode in Ext21ile system

29

Figure 2—8(b) shows the structure of an inode in Ext2. inodes also store ﬁle type (ﬁle or
directory), ownership information, access permissions, access times, and ﬁle sizes.
Directory ﬁles contain ﬁle names and unique inode index numbers (Figure 2-9), called
inode numbers. File systems use inode numbers to locate inodes stored on disk. The
separation of inode numbers from ﬁle names allows ﬁle systems to support link ﬁles. Link

ﬁles transparently reference other ﬁles. Links provide multiple ﬁle names for single ﬁles.

 

 

 

Fincdelt Iantrylglenl name_len[ name J

 

 

 

 

Figure 2-9 A directory entry in the Eth file system

Ext2 is adopted with minor modiﬁcations as local ﬁle system in the prototype
implementation of this proposal. These modiﬁcations enable multiple Ext2 ﬁle systems to

construct a virtual ﬁle system spanning across all storage nodes.

2.2.4 File system management
File system operations include data operations and meta—data operations. Data operations
are performed on actual user data, reading or writing data from/to ﬁles. Meta—data
operations modify the ﬁle system structure, creating, deleting or renaming ﬁles or
directories.

When unfortunate situations occur, such as an unexpected power failure or system
lock-up, the system does not have the opportunity to cleanly unmount ﬁlesystems. When
the system is rebooted and fsck starts its scan and detects that these ﬁlesystems were not

cleanly unmounted. It is very likely that the meta-data is in inconsistency status. To ﬁx this

30

situation, fsck will begin an exhaustive scan and sanity check on the meta-data, correcting
any errors that it ﬁnds along the way. Once fsck is complete, the ﬁlesystem is ready for use.
Although some recently modiﬁed data may have been lost due to the unexpected system
crash, since the meta-data is now consistent, the ﬁlesystem is ready to be mounted and
used.

The problem with fsck is that complete consistency check on all meta-data is a time-
consuming task in itself and normally takes at least several minutes to ﬁnish. The bigger
the ﬁlesystem, the longer this exhaustive scan will take.

Therefore during a meta-data operation, the system must ensure that data are written to
disk in such as way that the ﬁle system can be recovered to a consistent state if a system
crash occurs. Traditionally this requirement has been met by synchronously writing each
block of meta-data [Seltzer 2000]. Unfortunately synchronous writes can signiﬁcantly
impair ﬁle system’s performance on meta-data operations.

There are two commonly used approaches for improving the performance of meta-data
operations and recovery: joumaling and Soft Updates. Joumaling (or logging) ﬁle systems
use an auxiliary log to record meta-data operations and Soft Updates uses ordered writes to
ensure metadata consistency. Most operating systems have adopted joumaling ﬁle
systems. There have been quite a few joumaling ﬁle systems: IFS, ReiserFS, XFS, Ext3

and GFS.

2.2.4.1 Joumaling

Joumaling ﬁle systems attack the meta-data update problem by maintaining an auxiliary

log that records all meta-data operations and ensuring that the log and data buffers are

31

synchronized in such a way to guarantee recoverability [Seltzer 2000]. The system enforces
write-ahead logging [Gray 1993], which ensures that the log is written to disk before any
pages containing data modiﬁed by the corresponding operations. If the system crashes, the
log system replays the log to bring the ﬁle system to a consistent state. Joumaling systems
always perform additional I/O to maintain ordering information (i.e., they write the log).
However, these additional I/Os can be efﬁcient, because they are sequential compared with
random writes with long track-seeking. When the same piece of meta-data is updated
frequently, joumaling systems consolidate these log writes in exchange for avoiding
multiple meta-data writes.

When the ﬁlesystem is mounted, the ﬁle system checks to see whether the ﬁlesystem is
consistent. If for some reason it is not, then the meta—data needs to be ﬁxed, but instead of
performing an exhaustive meta-data scan in fsck, it takes a look at the journal. Since the
journal contains a chronological log of all recent meta-data changes, it simply inspects
those portions of the meta-data that have been recently modiﬁed. Thus, it is able to bring
the ﬁlesystem back to a consistent state in a matter of seconds. And unlike the more
traditional approach that fsck takes, this journal replaying process does not take longer on
larger ﬁlesystems.

The key design issues in a joumaling ﬁle system are where to store the log; how to
manage the log and when to reclaim space and when to do checkpointing; interface

between the log and the main ﬁle system; and how to recover from the log.

32

2.2.4.2 Soft Updates

Soft Updates tries to solve the meta-data update problem by guaranteeing that blocks are
written to disk in their required order without using synchronous disk I/Os [Seltzer 2000].
In general, a Soft Updates system must maintain dependency information, or detailed
information about the relationship between cached pieces of data. For example, when a ﬁle
is created, the system must ensure that the new inode reaches disk before the directory that
references it does. In order to delay writes, Soft Updates must maintain information that
indicates that the directory data block is dependent upon the new inode and therefore, the
directory data block cannot be written to disk until after the inode has been written to disk.
In practice, this dependency information is maintained on a per-pointer basis instead of a

per—block basis in order to reduce the number of cyclic dependencies.

2.2.4.3 Ext3: joumaling Ext2

Ext3 has been designed by Stephen Tweedie to be extremely easy to deploy [Robbins
2001; Tweedie 2000]. Ext3 is built on the solid Ext2 ﬁlesystem code and it inherits a great
fsck tool. And Ext3's joumaling capabilities have been specially designed to ensure the

integrity of both metadata and data [Robbins 2001].

2.2.4.3.1 Backwards compatibility

Ext2 and Ext3's on-disk format is identical, which means that a cleanly unmounted Ext3
ﬁlesystem can be remounted as an Ext2 ﬁlesystem- It is possible to perform in-place Ext2
to Ext3 ﬁlesystem upgrades. By upgrading a few key system utilities, installing a modern

2.4 kernel and typing in a single tune2fs command per ﬁlesystem, users can convert

33

existing Ext2 ﬁle systems into joumaling Ext3 systems. The transition is safe, reversible,
and incredibly easy.

In addition to being Ext2-compatible, Ext3 inherits other beneﬁts by sharing Ext2's
metadata format. Ext3 users gain access to a rock-solid fsck tool. If you do end up getting
corrupted metadata, either from a ﬂaky kernel, bad hard drive, or something else, you can
still use fsck to ﬁx inconsistency in your Ext3 ﬁle systems.

Ext3's journal is stored in an inode, or basically a ﬁle. By storing the journal in an
inode, Ext3 is able to add the needed journal to the ﬁlesystem without requiring
incompatible extensions to the Ext2 metadata This is one of the key ways that an Ext3

ﬁlesystem maintains backwards compatibility with Ext2 metadata.

2.2.4.3.2 The Exts approach

Ext3 handles joumaling very differently than other joumaling ﬁlesystems. With ReiserFS,
XFS, and JFS, the ﬁlesystem take extra special care to journal metadata, but makes no
provisions for joumaling data However, unexpected reboots and system lock-ups can
result in signiﬁcant corruption of recently modiﬁed data. Ext3 uses a couple of innovative
solutions to avoid these problems, which we'll look at in a bit.

In Ext3, the joumaling code uses a special API called the Joumaling Block Device
(JBD) layer. The JBD has been designed for the express purpose of implementing a journal
on any kind of block device. Ext3 implements its joumaling by hooking in to the JBD API.
For example, the Ext3 ﬁlesystem code will inform the JBD of modiﬁcations it is
performing, and will also request permission from the JBD before modifying certain data

on disk. By doing so, the JBD is given the appropriate opportunities to manage the journal

34

on behalf of the Ext3 ﬁlesystem. The JBD is being developed as a separate, generic entity
and it could be used to add joumaling capabilities to other ﬁlesystems in the future.

There are a number of ways to implement a journal. For example, a ﬁlesystem
developer could design a journal that stores spans of bytes that need to be modiﬁed on the
host ﬁlesystem. The advantage of this approach is that your journal would be able to store
lots of tiny little modiﬁcations to the ﬁlesystem in a very efﬁcient way, since it would only
record the individual bytes that need to be modiﬁed and nothing more.

JBD takes a different approach. Rather than recording spans of bytes that must be
changed, JBD stores the complete modiﬁed ﬁlesystem blocks themselves. The Ext3
ﬁlesystem also uses this approach and stores complete replicas of the modiﬁed blocks
(”(8, 2KB or 4KB) in memory to track pending 10 operations.

At ﬁrst, this may seem a bit wasteful. After all, complete blocks contain modiﬁed data
but may also contain unmodiﬁed (already on disk) data as well. The approach that the JBD
uses is called physical joumaling, which means that the JBD uses complete physical blocks
as the underlying currency for implementing the journal. In contrast, the approach of only
storing modiﬁed spans of bytes rather than complete blocks is called logical joumaling.

Because Ext3 uses physical joumaling, an Ext3 journal will have a larger relative on-
disk footprint than those using logical joumaling. But because Ext3 uses complete blocks
internally and in the journal, Ext3 does not deal with as much complexity as it would if it
were to implement logical joumaling. In addition, the use of full blocks allows Ext3 to
perform some additional optimizations, such as "squishing" multiple pending 10 operations
within a single block into the same in-memory data structure. This, in turn, allows Ext3 to

write these multiple changes to disk in a single write operation.

35

2.2.4.3.2 Joumaling for both meta-data and data

Ext3 ﬁlesystem provides both metadata and data joumaling to ensure both data and
metadata integrity. Originally, Ext3 was designed to perform full data and metadata
joumaling. In this mode (data=joumal), the JBD journals all changes to the ﬁlesystcm,
whether they are made to data or metadata. Because both data and metadata are joumaled,
JBD can use the journal to bring both metadata and data back to a consistent state. The
drawback of full data joumaling is that it can be slow, although you can reduce the
performance penalty by setting up a relatively large journal.

A new joumaling mode has been added to Ext3 that provides the beneﬁts of full
joumaling but without introducing a severe performance penalty. This new mode works by
joumaling metadata only. However, the Ext3 ﬁlesystem keeps track of the particular data
blocks that correspond with each metadata update, grouping them into a single entity called
a transaction. When a transaction is applied to the ﬁlesystem proper, the data blocks are
written to disk ﬁrst. Once they are written, the metadata changes are then written to the
journal. By using this technique (data=ordered mode), Ext3 can provide data and metadata
consistency, even though only metadata changes are recorded in the journal. Ext3 uses this

mode by default.

2.3 Distributed File Systems

Distributed ﬁle system consists of ﬁle system servers, ﬁle system clients, and the
speciﬁcation of the ﬁle system service interface to the client. The performance of
distributed ﬁle systems is affected by several key design decisions. Among them are the

statelessness or statefulness of servers, caching location, and semantics of ﬁle sharing.

36

Servers are either stateless or stateful. Stateless servers maintain no information
regarding the history of client requests. Since servers maintain no state information,
requests must contain all necessary information to describe server tasks. Stateless behavior
simpliﬁes recovery; clients only need to resend requests that do not successfully complete.
Stateful servers maintain information about previous client requests. With this knowledge,
clients and servers effectively pre-fetch and cache data. State information is difﬁcult to
rebuild if lost by server failures.

Distributed ﬁle systems cache data at a variety of locations. These locations include
main memories and storage devices of clients and servers. Data location largely determines
client access times. Access times are shortest for data cached in client memories. Access
times are longest when servers cache data on local storage devices.

File systems support various degrees of ﬁle sharing. There are two important sharing
models. The ﬁrst model, UND( semantics, states that when a READ operation follows a
WRITE operation, the READ returns the value just written. This model is easy to
understand and straightforward to implement. A more relaxed model, session semantics,
states that changes to an open ﬁle are initially visible only to the process that modiﬁed that
ﬁle and only when the ﬁle is closed are the changes made visible to other processes.

Cache inconsistencies occur when clients modify one or more copies of the same data
To prevent or manage inconsistencies, distributed ﬁle systems have mechanisms to ensure
coherence. These consistency mechanisms either invalidate or update stale copies of data.
Stateless servers typically rely on clients to maintain consistency. Before accessing locally
cached data, clients must verify consistency with servers. With stateless servers, clients

perform write-through caching so that other clients receive modiﬁed data. Stateful servers

37

take an active approach to consistency management. Servers perform callbacks to notify
clients and other servers that cached data is inconsistent and must be written back or

invalidated.

2. 3. 1 Network File System (NFS)

The Network File System (NFS) was designed by Sun Microsystems in 1985 [Sandberg
1985]. The goals of NFS include system independence, name transparency, and
preservation of UNIX ﬁle system semantics. The server design is stateless, so clients make
requests with all information necessary to complete operations. Clients and servers
communicate over a network using the remote procedure call (RPC) protocol. RPC is a
high-level protocol built upon User Datagram Protocol (UDP) and lntemet Protocol (IP).
Transmission Control Protocol (TCP) may replace UDP to provide connection-oriented
communication with guaranteed, in-order delivery.

NFS servers are stateless and write modiﬁed data to stable storage before completing
requests. NFS maintains only a weak form of consistency, since single read and write
requests may span several RPC operations. Multiple clients may issue overlapping
requests.

Both NFSv2 and NFSv3 [Callaghan 2000] clients and servers cache data in system
memory in order to improve read performance. Clients also cache ﬁle attributes, but
periodically invalidate these attributes to limit the use of stale data. Clients maintain ﬁle
data consistency by verifying ﬁle modiﬁcation times with servers.

The stateless server design is the crux of NFS simplicity. Servers use local ﬁle systems

to store data; NFS does not manage storage. Primary server functions manage client

38

requests and transport data. Server statelessness simpliﬁes crash recovery. Failed clients do
not affect the operations of servers or other clients. Servers that fail need only reboot;
clients resend requests not completed within a given duration. Clients perceive failed
servers as slow servers.

NFS offers portability and high connectivity but lacks fundamentals necessary for good
performance. Single servers become bottlenecks as the number and size of client requests
increase. NFS is distributed in the sense that multiple computers share ﬁles; however, the
design is not distributed in a manner capable of providing scalable performance. Single

servers also make NFS vulnerable to failures.

2. 3.2 Andrew and Coda File Systems

The Andrew File System (AFS) was developed from a joint research project between IBM
and Carnegie Mellon University [Satyanarayanan 1990]. Coda descended from AFS
research [Satyanarayanan 1989]. Both AFS and Coda are designed to operate on distributed
networks of workstations scaling up to 5000 machines. AFS and Coda use locally attached
storage devices on both servers and clients.

AFS distributes the ﬁle system across multiple server computers. AFS servers maintain
state. Servers perform call-backs when client cached data is modiﬁed by other clients. AFS
only guarantees consistency at the granularity of the entire ﬁle. When multiple copies of a
ﬁle exist, servers save the last ﬁle written.

Transarc Corporation took AFS technology and developed the Distributed File System

(DFS) [Bever 1993]. DFS is the basis of the Open Software Foundation (OSF) Distributed

39

Computing Environment (DCE) [Kazar 1990]. DFS provides stronger UNIX consistency
semantics than AFS.

Coda, which stands for “Constant Data Availability”, improves the availability of AFS.
Clients cache entire ﬁles locally in memory and on disk. Furthermore, multiple ﬁle copies
may exist on different servers. Single server failures have little impact on availability.
Clients may also run in disconnected operation mode, thereby using only locally cached
ﬁles. Disconnected clients later reconnect to the network and synchronize modiﬁed ﬁles
with the distributed system.

Like AFS, Coda distributes ﬁle manager responsibilities to server computers, although
Coda clients also perform server-like functions during disconnected operation. For this
reason, the Coda client ﬁle manager organization is a merged client/server system with

private ﬁle managers. However, Coda server design is that of a distributed organization.

2. 3.3 Zebra and xFS
Zebra network ﬁle system [Hartman 1993] is a log-structured ﬁle system striped over
network storage nodes directly by clients. Zebra stripes client logs of recent ﬁle system
modiﬁcations across network storage servers and uses RAID level 4 to ensure fault-
tolerance of the each log. By logging many recent modiﬁcations before initiating a parallel
write to all storage servers, Zebra avoids the small write problem of RAID level 4.

As in the Log-structured File System (LFS) [Rosenblum 1992.], Zebra uses stripe
cleaners to reclaim free space. Zebra assumes clients are trusted; each time a client ﬂushes
a log to the storage servers, it notiﬁes the ﬁle manager of the new location of the ﬁle blocks

just written through a message, called a “delta” which is post-processed by the manager to

40

resolve conﬂicts with the cleaner. Zebra lets each clients write to the storage servers
without going through the server and coordinates the clients and the cleaners optimistically
with ﬁle server post-processing. By making clients responsible for allocating storage for
new ﬁles across the storage servers, Zebra effectively delegates to the clients the
responsibility of low-level storage management.

The Serverless Network File System (xFS) is part of the Network of Workstations
(NOW) project at the University of California at Berkeley [Anderson 1995a]. xFS uses a
log structured organization like the Log-structured File System (LFS) [Rosenblum 1992.]
and striping techniques from Zebra [Hartman 1993] to simplify failure recovery and
provide high throughput transfers. Fast, switched networks connect xFS clients.

The xFS project recognizes that central servers are performance and reliability
bottlenecks. Therefore, xFS distributes traditional server responsibilities to the clients.
Hence, any system can manage control directives, metadata, and real data. The serverless
design attempts to improve load balancing, scalability, and availability.

xFS differs from Coda, and AFS, since xFS distributes metadata management across
multiple nodes. In contrast, the other systems divide directory trees into subtrees and assign

each subtree to different servers. xFS is a merged client/server architecture.

2. 3.4 Petal and Frangipani

Petal was a research project at Compaq’s Systems Research Center [Lee 1996] based on
arrays of storage-appliance-like disk “bricks” but offering a block-oriented interface rather
than a ﬁle interface. Petal scaled by splitting the controller function over a cluster of

controllers, any one of which had access to consistent global state. As a Petal system’s

41

capacity grew, so did the number of Petal servers in the cluster along with the performance
they sustained. Logically, Petal could be viewed as a RAID system implemented on a
symmetric multiprocessor, though it used distributed consensus algorithms instead of
shared memory for global state.

Frangipani is a distributed ﬁle system based on the lower layer Patel, which provides a
distributed storage service [Thekkath 1997]. Multiple machines run the same frangipani
ﬁle system code on top of a shared Petal virtual disk, using distributed loc service to ensure
coherence. Frangipani is designed to run in a cluster of machines that are under a common

administration and can communicate securely.

2. 3.5 Network Attached Secure Disks (NA SD)

Network Attached Secure Disks (NASD) was a research project at Carnegie Mellon
University. NASD exploits the computational power at storage devices to perform parallel
and network ﬁle system functions, as well as more traditional storage optimizations
[Gibson 1998]. The basic goal of the NASD project is to eliminate the server bottleneck
from the storage hierarchy, and make disks directly accessible to clients. This eliminates
the need to move all data from the disks, over a storage network, through the memory

system of a server machine, over a client network, and to the clients.

42

 

 

 

 

 

 

 

 

 

 

 

 

ontro at

F NASD File Manager I ‘7
Network Protocol] | Access Control J LCM“t 310'3991 d

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

“Hr Network Driver j r Trecurity 1 etwo W
33M .3 (a
[REMOrklnterface] <5> ® @
(D
$3; @ Local Area Network
w %

     

 

Figure 2-10 Network-attached secure disks (NASD)

The NASD architecture can be summarized in its three key attributes:

1)

2)

Direct transfer to clients; Data accessed by a ﬁlesystem client is transferred between
NASD drive and client without indirection (store-and-forward) through a ﬁle server
machine. NASD are designed to ofﬂoad more of the ﬁle system’s simple and
performance-critical operations. For example in Figure 2-10, a client, prior to
reading a ﬁle, requests access to that ﬁle from the ﬁle manager (1), which delivers a
capability to the authorized client (2). So equipped, the client may make repeated
accesses to different regions of the ﬁle (3, 4) without contacting the ﬁle manager
again unless the ﬁle manager chooses to force reauthorization by revoking the
capability (5).

Asynchronous oversight by ﬁle managers; Access control decisions made by a ﬁle
manager must be enforced by a NASD drive. This enforcement implies
authentication of the ﬁle manager’s decisions. The authenticated decision

authorizes particular operations on particular groupings of storage. Because this

43

authorization is asynchronous, a NASD device may be required to record an audit
trail of operations performed, or to revoke authorization at the ﬁle manager’s
discretion.

3) The abstraction of variable-length objects. The NASD interface abandons the
notion that ﬁle managers understand and directly control storage layout. Instead,
NASD drives store variable-length, logical byte streams called objects. Filesystems
wanting to allocate storage for a new ﬁle request one or more objects to hold the
ﬁle’s data. Read and write operations apply to a byte region (or multiple regions)
within an object. The layout of an object on the physical media is determined by the
NASD drive.

To exploit the high bandwidth possible in a NASD storage architecture, the client-
resident portion of a distributed ﬁle system needs to make large, parallel data requests
across multiple NASD drives and to minimize copying, preferably bypassing operating
system ﬁle caches. Cheops is a storage service that can be layered over NASD devices to
accomplish this function. In particular, Cheops was designed to provide this function
transparently to the hi gher-level ﬁle systems.

Figure 2—11 depicts how NASD, Cheops, and ﬁlesystem code ﬁt together to enable
direct parallel transfers to the client while maintaining the NASD abstraction to the
ﬁlesystem. At each client, a ﬁle clerk performs ﬁle caching and namespace mapping
between high-level ﬁlenames and virtual storage objects. The storage clerk on the client
receives logical accesses to a virtual storage object space, and maps the accesses onto
physical accesses to the NASD objects. Parallel transfers are then carried out from multiple

NASDs into the storage clerk.

 

I ------ 1

l
I
l
I
I
l
.l

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

   

: Client-app : : Client-app :
r ______ ﬁlejntertace. '1 ............ i ..... I ............ :. ........
: £53m?” 'lr'c———) File clerk : | File clerk :
;_._-._.-._.-._.-._-a,_.-,-,:,_:........:. ------------ . ----- :r ------------ .L --------
rim a”: °' +—r_z~:°r L°"°°“f°_".‘.l
Object interface ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
(network)

 

 

 

Figure 2-11 NASD/Cheops architecture

Cheops implements storage striping and RAID functions but not ﬁle naming and other
directory services. This maintains the traditional “division of concerns” between
ﬁlesystems and storage subsystems, such as RAID arrays. Cheops performs the function of
a disk array controller in a traditional system. One of the design goals of Cheops was to
scale to a very large numbers of nodes. Another goal was for Cheops to export a NASD
interface, so that it can be transparently layered below ﬁlesystems ported to NASD. A
NASD-optimized parallel ﬁlesystem, NASD/PFS, has been implemented using a cluster of

workstations simulating NASD devices.

45

Chapter 3

The CoStore Storage Cluster Architecture
and Performance Evaluation

3.1 Introduction

A CoStore system clusters a variety of network attached storage devices, each capable of
providing ﬁle interface in addition to block interface. Using the same NAS approach, a
CoStore cluster offers a ﬁle interface with built-in fault-tolerance to achieve strong
reliability and availability, similar to modern storage appliances. With the system
responsibilities evenly distributed across all collaborating storage devices, the proposed
architecture provides scalable high-performance high-capacity storage services,
traditionally only achievable by high-end storage systems.

Jim Gary predicts that, with IPv6 network interfaces and operating systems, storage
bricks have arrived and they will evolve from block servers to application servers [Gray
2002]. Before these NAS disks materialize in the market, we will use commodity PCs with
locally attached disks to simulate NAS disks. It is worth to point out that advanced storage
systems can also be constructed in the same manner by clustering high-end computers
attached with huge SAN devices.

Distributed computing based on COTS (commercial-off-the-shelf) components has
been a burgeoning strategy to deliver high performance with superior cost effectiveness
and ﬂexibility [TFCC 2001]. The availability of inexpensive high-performance PCs and
other high-grade commodity components, such as disk drives, network interface cards,
memory chips, has made great sense for COTS-based storage clusters. The author believes

that the CoStore cluster architecture using network attached storage devices can achieve the

46

high performance of advanced storage systems through ensembles of intelligent storage
devices.

The potentials of commodity components are often underestimated in high-end storage
system arena. With the disk capacity in excess of 100 GB, a PC can easily store up to half
TB even with IDE interfaces, whose bandwidths also have reached to as high as 133MB/s
[Fido 2001] per channel. Assisted by the latest SCS1320 [Mason 2000] a PC can store
multiple TB with transfer rates up to 320MB/s per channel with up to 15 devices. Due to
intense market competitions, the speed of processors has been too fast to easily name a
CPU killer-application for personal computing. Cheap PC memory prices make extensive
caching less expensive. With switched Fast Ethernet has become commonplace, Gigabit
Ethernet is getting more affordable recently [Sander 2001a]. With disks scattered into PCs,
we do not need to deal with the cooling, packaging and powering challenges commonly

encountered in high-density storage servers.

3.2 Technology Trends In Networked Storage

3. 2. 1 Storage interfaces: block vs. file

Gibson et al. classiﬁed storage device interfaces into two abstractions in [Gibson 2000].
block, is a simple, untyped, ﬁxed-size (block), memory-like interface for manipulating
nonvolatile magnetic media. Traditional disk drives (IDE or SCSI), disk arrays or even
SAN rack systems essentially are all block devices. The other interface is ﬁle, which is a
richer, typed, variable-size (ﬁle), hierarchical interface. Network attached storage (NAS)
systems provide a ﬁle interface, which is similar to that of a traditional local ﬁle system.

Storage appliances are in fact intelligent devices that provide ﬁle interface storage services

47

by hiding the details of managing internal nonvolatile media through a block interface.
NAS systems provide the same functions as computers running distributed ﬁle systems
with attached disks. Differently, NAS systems are normally based on aggressively
specialized hardware and software and are attached internally with SAN systems. Storage
appliances are NAS systems engineered to be especially simple to manage and extremely

reliable, like a home appliance [Hitz 1997].

3. 2.2 Architecture taxonomy
Gibson et al. present a taxonomy of network attached storage architectures in [Gibson
1996]. Case 0 in the taxonomy is server-attached disks (SAD) as shown in Figure 3-1.
Disks locally attach to general server computers. Servers transfer data between server
storage devices and memory via traditional I/O buses using protocols like SCSI. The data is
transferred to client’s memories via network links typically using protocols like TCP/IP.
Case 1 devices, known as Server Integrated Disks (SID), are more specialized
computers that only perform distributed ﬁle system functions. The disks are still connected
locally through I/O channels like SCSI. Case 2 is deﬁned as Network SCSI (NetSCSI),
which directly transfers data between clients and storage devices via SCSI over network
protocols. NetSCSI is a network-attached disk architecture designed for minimal change in
the disk’s command interface. NetSCSI devices are block addressable. File manager
computers facilitate ﬁle system operations and name-space manipulations between clients

and storage devices.

48

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

File Server
f NASD File Manager J
T [Network Protocol ] | [Sc—El File Sgtem | Tl ' l
' F‘S‘C’Sm » 0 0 0
‘1] Network Dnver J nver it <5) SEE d
BackplaneCD Q I3) 1 ®
Bus I in F mm ] [SCSI I t rt I—Q‘J
N twork lnte ace tern emo n e acel‘
r e I ys W ‘ (Packetized) SCSI

 

 

 

 

 

 

(i) 3
Local Area Network

are

Figure 3-1 Server-attached disks (SAD)

 

 

 

   

 

 

 

Case 3 approach proposed by Gibson et al. is called Network Attached Secure Disks
(NASD) NASD is an enhanced device interface that supports object addressable
operations, which is higher level than NetSCSI, which is only block addressable. Objects
may be data extents or ﬁles. Standalone NASD ﬁle manager maintains the name-space by
performing standard ﬁle system functions. NASD devices authenticate client request via
ﬁle managers to ensure security. The key enabling technology in NASD is a powerful on-
drive microprocessor capable of executing the drive's embedded ﬁle system, networking
and security code.

The evolution of storage architecture can also be categorized into three generations.
Case 0 and case 1 in Gibson’s taxonomy are the ﬁrst-generation, which essentially includes
computers connected to local disks via traditional I/O interfaces like SCSI. The more than
20 years old SCSI interface has several limitations: a clumsy parallel cable; limited

distance to devices; scalability problem because only up to 16 devices can be connected to

49

each channel; and limited headroom for bandwidth growth, even though the newer version

promising up to 160MB per second.

3.2.3 Storage Area Network (SAN)

The Storage Area Network (SAN) emerges as data communication platform to interconnect
servers and storage devices at Gigabit speed. With a dedicated high-speed storage network,
SAN eliminates the bandwidth bottleneck and both scalability and distance limitations
haunted the ﬁrst generation architecture for a long time. The ANSI standard for high-end
storage interface is Fibre Channel Protocol for SCSI (FCP) on Fibre Channel - Arbitrated
Loop (FC-AL) network. The other interface, Serial Storage Architecture (SSA) from IBM
competing a few years ago, failed because of little market acceptance to the proprietary
proposal and much lower bandwidth. Both FC-AL and SSA are serial storage interface. If
NetSCSI in Case 2 has a separate network for storage devices, or SAN connects disks using
the same local area network, then SAN and NetSCSI share the same architecture.

The second-generation storage architecture is based on SAN. All the disks connected
are combined together to provide a single virtual huge disk device through interconnect of
Fibre Channel (FC) network. Also through FC interface, the huge virtual disk provides a
block addressable abstraction to a general purpose computer, which runs as a distributed
ﬁle system server for Sun Microsystems’s Network File System (NFS), or Common
lntemet File System (CIFS) service [Leach 1997] from Microsoft or the open source
Samba project.

The core concept of SAN is to use a fully connected FC-AL network infrastructure,

rather than direct-attached SCSI devices, to manage a collection of storage devices. The

50

single FC-AL network connects storage devices and storage servers, but is still separate
from the general-purpose network that connects clients and storage servers. This allows
multiple hosts to share the same storage, but still requires clients to access data through

intermediate storage servers rather than directly.

3.2.4 Network Attached Storage (NAS)

The latest third-generation architecture is Network Attached Storage (NAS) with the
concept of storage appliance. Storage appliances are servers that have been specialized to
perform only storage service [Hitz 1994]. Storage appliances are designed to be especially
simple to manage [Gibson 2000]. With higher-level interface, ﬁle, to clients, storage
appliances provide storage service at the distributed ﬁlesystem level through standardized
protocols including NFS, CIFS, and HTTP. Storage appliances are seamlessly integrated
into the whole infrastructure. Storage appliances authenticate client requests through login
servers via interfaces like Light-weight Directory Access Protocol (LDAP), or Sun
Microsystems’s Network Information Service (N18) or Microsoft’s Windows NT domain
controllers.

NAS is more a trend than a new architecture, because most of the NAS devices of
recent models are built using SAN technology (Figure 3-2). What really differentiates NAS
devices from SAN devices is the former includes a ﬁle system and provides ﬁle interface
while the latter has only block addressable interface [Gibson 2000]. NAS devices normally
are standalone server with very specialized software and hardware connected with disk
devices via hi gh-speed SAN network, Fibre Channel for example. Compared with general-

purpose computers, both the software and hardware are striped down with only the

51

necessary modules to handle: network communication; data redundancy management; ﬁle
system management, and distributed ﬁle system functions among others. On the contrary,
the onboard controller of SAN devices mainly implements data redundancy management.
Another important differentiation between NAS and SAN devices is the former has a close
association with Ethernet network hardware, while the latter with Fibre Channel network
hardware [Gibson 2000]. Both architectures may use Fibre Channel as internal connection

interface.

 

file system

(metadata+data) storage volume
(blocks)

”-ooooocg.-...

"‘ \“

 

    

 

 

 

 

 

1':

 

r.
I. ' I
7 u

0"
O
.0

 

 

 

 

 

 

 

Figure 3-2 Network Attached Storage (NAS) vs.
Storage Area Network (SAN)

3.2.5 Upcoming storage architectures

Providing an interface of ﬁles, the NASD architecture belongs to the third generation
storage architecture. Different from most storage appliances, the standalone ﬁle manager in
N ASD connects to disks through local area network such as Ethernet. Not only do these
disks have Ethernet interface, they also are more intelligent than standard disks on the

market. These intelligent disks will perform more involving functions like object

52

manipulation than normal disks do with block interface in SAN. Data redundancy is
provided as an extra layer called Cheops, running on clients.

Like Fibre Channel’s command protocol FCP, lntemet SCSI (iSCSI) is a SAN
interface. According to [Van Meter 1998], it is a natural way to exploit the inﬂuence of the
lntemet by layering a block-level SAN protocol over lntemet protocols, as was
demonstrated by the Netstation project at the University of Southern Califomia’s
Information Sciences Institute.

IPS (IP Storage) working group has chartered to work on security, naming, discovery,
conﬁguration, and quality of service for IP storage. iSCSI embodies the effort to generalize
in storage-device networking. The IPS working group use TCP as transport mechanism to
reliably deliver data over IP. However, the connection oriented-ness and the efﬁciency of
TCP over hi gher-speed network have prompted some to propose an entirely different
congestion-control algorithm that is appropriate for storage trafﬁc.

Instead of the highly specialized super storage appliance, one major alternative
approach is the use of PC clusters with a low-latency cluster interconnect based on network
interface cards that can ofﬂoad protocol processing from each machine’s main processor
[Intel 1997]. Such cluster approaches require specialized software. Network Appliance and
Intel have proposed such ﬁle system architecture and protocol called the Direct Access File
System (DAFS) [NetworkAppliance 2000]. The implementation of the proposed
architecture in this study can take advantage of the direct ﬁle access protocol.

With the introduction of Virtual Interface Architecture (VIA) [Intel 1997], extremely
high performance storage clusters can be implemented. VIA’s Remote DMA (RDMA)

support is capable of sending data in and out from I/O devices to the network, with none or

53

little involvement of server nodes’ host CPUs. One example is the Direct Access File
System (DAFS) project [NetworkAppliance 2000] proposed by Network Appliance and

Intel.

3.3 A Storage Cluster Architecture Using Network Attached Storage Devices
3.3.1 Architecture overview

A CoStore cluster implements a virtual storage server with a ﬁle interface as in other
storage appliance products using NAS approach. To provide an interface of a single virtual
cluster server, each cluster is assigned with a multicast IP address. All participating cluster
members join in this multicast group, whose IP address is known to its clients. The main
advantage of NAS approach is that internally the design can seamlessly integrate major
storage components to work closely together. The close integration of local ﬁle system
management and fault-tolerance management is essential to the efﬁciency of CoStore

architecture, as in other NAS approach based storage server designs [Hitz 1994].

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CoStore cluster
data data data —
. _
' m
—
ta I I I meta meta —
me
data data data m
storage storage ‘ storage
device 0 device n-1 device n
l l
v v
Ethernet: multicast and unicast (JDP, TOP) |
client client clignt

 

 

 

Figure 3-3 The structure of a CoStore cluster

54

As shown in Figure 3-3, a CoStore cluster consists of a group of network attached
storage devices and clients that access the storage resources provided by the cluster. All
members in a CoStore cluster work collaboratively to construct a storage system with a
uniﬁed ﬁle namespace, i.e. one root directory. One multicast IP address for a CoStore
cluster gives clients an impression of a single server initially. This virtual CoStore server is
a serverless design because no central ﬁle manager is required to maintain the consistency
of the overall namespace in the cluster. Both ﬁle system metadata (and hence local ﬁle
system management) and distributed ﬁle system responsibilities are evenly distributed
across all participating members.

Each of the storage device members in the cluster runs a program, called daemon, to
serve clients’ requests. There are three key functional modules in a daemon program
(Figure 3-4). The top layer is the distributed ﬁle subsystem, which presents a ﬁle interface
to CoStore clients. The middle layer is the local ﬁle subsystem, which manages block
resources as part of the global namespace of the cluster. The block resources can be raw
disks, disk partitions, or even large ﬁles in regular local ﬁle systems. The local ﬁle
subsystem accesses the linearly addressable block resources via the block interface
provided by the bottom layer RAID subsystem. By providing a block interface, the RAID
subsystem hides fault-tolerance details from upper layers and maintains various levels of
data redundancy. The lower two subsystems interact with peer daemons and the top layer

interacts with clients, both via communication networks.

55

 

are... ‘

 

daemon n i

 

 

 

 

 

 

 

 

 

 

Distributed File Subsystem ,
peer ‘ ' Local File Subsystem ;: ’ d peer r;
daemons 4 ’ RAID Subsystem :‘ TL aemons
o

 

 

block
device

Figure 34 Functional modules in a CoStore daemon

 

 

 

 

3. 3.2 Distributed file subsystem
Network File System (NFS) has been proved to be an extremely successful distributed ﬁle
system protocol, whose success should largely attribute to NFS protocol's simple and
stateless design [Callaghan 2000]. Not to reinvent the wheel, this virtual CoStore server
adopts NFS Version 3 protocol [Callaghan 1995] as its interface to clients. Most NFS
servers have been implemented using RPC, which is a mechanism for point-to-point
remote procedure calls. However, because of the multi-entity nature of a cluster, traditional
RPC library is not capable to support the one-to-many call from a client to a CoStore
cluster. In our prototype we resort to low-level socket programming using UDP to
implement the NFS protocol between a CoStore cluster and clients. Even though our
prototype CoStore cluster implementation strictly follows NFS protocol, at binary level
CoStore cluster is not compatible with the many traditional NFS client implementations.
Therefore at client end, a ﬁle system module is needed in order to access storage
services from a CoStore cluster. Currently a standalone ftp-like program is used to simulate
a regular client. Alternatively a CoStore proxy can be implemented to support standard

NFS and MOUNTD calls from traditional NFS clients. With such a proxy, it is possible for

56

CoStore to reach more platforms without having to implement client modules for speciﬁc
platforms because NFS is supported on most platforms.

Multicast communication is only used when necessary, such as at initialization stage
when a client makes the ﬁrst lookup call. All subsequent requests are sent to individual
members’ UDP ports using unicast. The reason is that each multicast involves all group
members and may generate unnecessary network trafﬁc to those machines that the request
was not intended. More efﬁcient unicast using UDP protocol is used for communications
between a cluster and clients. Point-to-point TCP protocol is used for internal
communications among cluster members.

As in NFS protocol, CoStore enforces Unix semantics for ﬁle sharing, that is, the latest
write prevails. At current stage in CoStore, there is no caching at either server side or client
side except the block buffer cache in CoStore’s RAID subsystem. In future study we will
look into whether caching at various levels can be exploited to improve performance

without violating NFS’s stateless design.

3. 3.3 Local file subsystem
The purpose of a local ﬁle subsystem is to manage the linear addressable storage space on
block devices. In CoStore, the local ﬁle subsystem manipulates an abstract block device
through an interface, such as read_blocks() and write_blocks(), provided by the lower
RAID subsystem.

In CoStore disk space is efﬁciently managed by individual local daemons in the same
way as any generic local ﬁle system. Without loss of generality, we adopt the well-

established Unix ﬁle system structure in the local ﬁle system management. Speciﬁcally

57

CoStore’s local ﬁle system has the same layout as the de facto ﬁle system on Linux:
Second Extended File System (Ext2) [Card 1986]. More sophisticated local ﬁle systems
can be utilized in the future. For instance, the latest Ext3 and ReiserFS ﬁle systems support
joumaling [Robbins 2001], which makes ﬁle system consistency much more stable.
Focusing on the cluster architecture the prototype implementation in this study only adopts
local ﬁle system as generic as Ext2.

In CoStore Ext2 ﬁle system structure is furnished with a special block addressing and
inode numbering scheme to unify all the local ﬁle systems on individual daemons. Both
data blocks and inodes on each daemon are numbered in 32—bit integers. However the
highest 8 bits are reserved for daemon’s identiﬁcation number (Figure 3-5). 32-bit integers
(cluster capacity up to 16 TB) are sufﬁcient in our initial prototype implementation. For
large capacity clusters the current design can be easily extended to accommodate 64-bit

integers.

 

 

Ldaerr?on ID ] local in"... number ]

 

 

 

| daenion ID 1 local blocz: address ]

 

 

Figure 3-5 Block addressing and inode numbering
scheme

With this scheme, multiple local ﬁle systems on individual storage devices are
combined into a global large ﬁle system with one hierarchy of namespace. This ﬁle system
design is best summed up as Data Anywhere, Metadata at Fixed Locations. This design is
efﬁcient because of the locality of metadata management and yet ﬂexible because of the

data anywhere layout. The feature of Data Anywhere also includes indirect block pointers.

58

Figure 3-6 shows an example of three scenarios to allocate blocks for one ﬁle. The blocks
allocated can be limited to local disk space only, or can be spread across multiple or all

storage devices in the cluster.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A CoStore cluster with RAID level 4 H
device 0 device 1 device 2 parity device
super super super super
block block block block
block block block block
bltrnap bitmap bltrna bit
inode lnode I ﬁe ln§e
bitmap bitmap bit __blt_maL
lnode lnode lnode . _ _ lnode
table table table table
+< scenario A
data data data data
blocks blocks blocks blocks
scenario )
1 l
x Irealism 3; 3
L__l I___l L___J

 

 

 

 

 

 

Figure 3-6 File system layout in a CoStore cluster with
RAID level 4

Considering the metadata as directories for the data blocks, CoStore architecture
resembles a CC-NUMA (Cache Coherency - Non-Unifonn Memory Access) distributed
shared memory multiprocessor system with directory-based cache coherence protocol as in
DASH [Lenoski 1990]. In DASH, the local memory is managed using a directory by a
local node. In CoStore, the local space is is managed by each daemon using a metadata,
essentially a directory (Figure 3-7). Likewise, the consistency of the global ﬁle system
management in CoStore cluster is achieved with distributed locking on a per ﬁle (lnode)
basis. One daemon can request remote daemons to assign inodes or allocate blocks. After
having obtained the exclusive lock to an lnode on a remote storage daemon, the daemon

can read to write to the data blocks for that lnode via remote daemon’s block interface.

59

Our CoStore prototype implements the traditional Unix-style security with permission
bits in inodes. This is safe in a secure environment with daemons trusting the operating
systems on client machines. Without major changes, more sophisticated security
mechanism can be employed to enhance the CoStore design, which is beyond the scope of

this study.

 

Daemon Daemon Daemon

    

          
   
   
 

Meta-data Meta-data

CA
CA: communlcatlon assistant

 

 

 

Figure 3-7 Architectural similarities between CoStore
and DASH

3. 3.4 RAID subsystem
The RAID subsystem provides an abstract block device to local ﬁle subsystem by hiding
the fault-tolerance details. Internally RAID subsystem will read from or write to physical
magnetic storage devices while maintaining data redundancy at RAID levels of user’s
choice. When a modifying write_blocks() is called, distributed RAID semantics with
various RAID levels has to be enforced by synchronizing mirror or parity updates to
remote storage devices.

Fault-tolerance is essential when each individual storage device in a CoStore cluster is
not reliable enough. In CoStore different RAID levels with variable reliability features are

supported, each with different overheads in space or latency. The current CoStore

60

prototype supports RAID level 0, 1, 0+1 (or 10), 4 and 4+1. Support for RAID 5 and 5+1 is
still ongoing. RAID level 5 is normally preferred because of its balanced performance.
However, the parity-rotating scheme makes the distributed RAID implementation more
complicated than other levels.

At the meantime RAID level 4 does have its own advantages: simplicity and
incremental expandability. In a cluster of RAID 4, the capacity can grow by simply adding
more devices without expensive RAID reconstruction. As a side note, RAID level 5+1, 4+1
and 0+1 are extremely useful in disaster avoidance applications, such as mirroring of
clusters in two distant sites as we will explore separately in Chapter 5. One nice feature of
RAID 0+] is that storage devices in one cluster are not required to have identical capacity
(except mirror pairs) because logical data blocks are not striped across all devices in a
horizontal manner. Instead, all data blocks are managed locally.

With special techniques the parity bottleneck problem in level 4 can be signiﬁcantly
alleviated as has been demonstrated in products from Network Appliance [Hitz 1994].
Because all local ﬁle systems have identical layout, stripes in RAID 4 are aligned with
respect to block types: metadata or user data as shown in Figure 3-4. The relatively more
frequently updated metadata stripes are naturally consolidated into fewer parity updates,
assuming that the trafﬁc is balanced in terms of load and space utilization on all storage
devices. Due to the stripe alignment, metadata stripes can be differentiated from ordinary
data stripes with different priorities to update parity. The metadata consistency is critical to
each ﬁle system's overall stability. Therefore, it is desirable to enhance local ﬁle system in

CoStore with joumaling support.

61

Because of longer than traditional delays, the concept of delayed parity updates in
AFRAII) [Savage 1996] will be helpful to improve performance when distributed RAID is
enabled. In future study we will explore the effects of delays on the performance and

reliability of CoStore systems.

3. 3.4. 1 Block buffer cache
All modern operating systems employ a block buffer cache system based on the concept
described by Maurice J. Bach [Bach 1986]. However, the RAID functionality in CoStore
requires having direct control of committing modiﬁed data blocks to storage devices.
Therefore, the RAID subsystem is implemented in an independent block-buffer cache
module. In public domain, there is no implementation more famous than the block buffer
cache in Linux kernel [Beck 1998]. Our prototype borrows the buﬁer_head data structure
and hence the list-manipulating algorithms from Linux kernel.

buﬁer_head is a header structure pointing to a block cache of buffered data. One buffer
cache system includes a hash table used to accelerate buﬁer_head searching, and three
circular doubly-link lists: lru_list, dirtyJist and lock_list (Figure 3-8). At one moment,
each buﬂerJread can only belong to one list. Initially all the buﬂer_head are stored in a
free list. Once used, the buffer_head is ﬁrst added into the hash table according to the hash
value based on block number. Using separate chaining method to solve hashing collisions,
each hash table entry is a non-circular doubly-link list, which will be short if the table is

relative large and the hash function is well chosen.

62

 

 

dirty_list:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

bh_a
bh_a
hash_table , ,J h-next
,x’” , ah_pprev
Ditch, IIII ’ I_next
I _prev
: bh_1 ~\‘
\blL 1.- MRU
\\‘* h_next
lru_ll'st: ,‘ h—Pp'e"
bh_1 r I_next

 

 

 

 

hash chains-"k
list links ——*

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

free_list:
bh_2: MRU
/ h_ﬂGXt- ”:1:
I,” J h _pprev ' lock_list:
,"l I_next
I_prev
LRU
h_next h_next
h _pprev h _pprev
I_next >
I _prev .

 

 

 

 

 

 

I_next
I_prev ﬂ

 

Figure 3-8 Linklists in block buffer cache

The status of the buffered data pointed by buﬁer_head indicates which list the

If one clean buffer is modiﬁed, the buﬂer_head will be inserted into dirty__list. At

buﬁer_head belongs to. If the buffered data is clean, then buﬁ’erjread is in lru_list, which
is named after the replacement algorithm. Each read hit in lru__list will make the
buﬁer_head removed and inserted into to the head of lru_list as the MR U (Most Recently
Used). If new buﬁer_head is needed and there is none in free_list, the tail of lru_list will be
removed as the LRU (Least Recently Used). Because of the hash table, locating a
buﬁer_head never has to examine a long list, but a short chaining queue for those

buﬁerjreads that collided on one hash table entry.

meantime mirror updates will be copied and/or parity (XOR) updates will be generated to
separate accumulative buffers, which will be sent to targeted daemons once the buffer is
full or commit periods time out. dirty_list is kept short due to dirty blocks ﬂushing, which

is either periodical or mandatory per transaction depending on the system policy.

 

To speed up performance, asynchronous write is used to commit modiﬁed data blocks
to media. Once a write request is issued, the bujfer_head will be moved from dirty_list to
lock_list, which means that it can only be read and a new write has to block until the
pending write is completed by block devices. Once the modiﬁed data is committed to the
media, the buffer_head can be removed from lock_list and be inserted into lru_list as the

MRU or into dirty_list if it is being modiﬁed again.

3. 3.5 System operations

NFS protocol is based on a cornerstone called ﬁlehandle. Each ﬁle or directory has one
ﬁlehandle in the ﬁle system. Each NFS procedure call has to provide a ﬁlehandle, upon
which the request action can be performed. In production systems, ﬁlehandle embodies
much more information. Without loss of generality we simply use the 32-bit lnode number
as ﬁlehandle in our CoStore prototype. As in Figure 3-5, there is one ﬁeld in each
ﬁlehandle that indicates which daemon this ﬁle or directory is hosted on.

The ﬁlehandle for the root directory in one namespace is a special one (all 0’s), called
public ﬁlehandle. Therefore clients can start from this well-know ﬁlehandle without having
to resort to MOUNTD protocols. The most importance one in all NFS procedure calls is
LOOKUP, which evaluates the pathnames relative to current path or root to ﬁnd the target
ﬁlehandle. NFS version 4 will support multi-component LOOKUP operation, which is not

support in CoStore or NFS 3.

64

3. 3. 5. 1 Cluster initialization

At startup, each CoStore daemon reads conﬁguration information from a local
conﬁguration ﬁle, which includes the daemon’s ID number, RAID level, RAID role
(storage-, mirror- or parity-daemon). It also includes the multicast IP address for the
cluster, TCP and UDP ports, and various buffer sizes etc.

Multicast communication is used in the cluster’s initialization process and can
potentially serve as system status bus for the cluster. Each CoStore cluster is assigned one
IP multicast address. At initialization stage, all daemons join this multicast group, exchange
information with other daemons and construct a daemon’s identiﬁcation number to IP
address (ID-to-IP) mapping table, shared by all participating daemons. When initialization
is ﬁnished, each daemon listens on the UDP port of the multicast address and its own UDP
port for requests. All daemons also establish a full-connection among them using TCP for

intra—cluster communications.

3. 3. 5.2 Client initialization
When a CoStore client ﬁrst starts up, it needs to get hold of the root directory in a cluster.
The client makes a LOOKUP call using the public ﬁlehandle by sending a request to the
known multicast IP address. All daemons will receive the request. Only daemon 0 which
hosts the root directory will answer the LOOKUP request and will piggyback one ID-to-IP
table in the reply.

With the ID-to-IP mapping table piggybacked in the reply, the client can send requests
directly to individual daemons based on the daemon ID ﬁeld in the ﬁlehandle. The same

request can also continue to be sent to the multicast IP address. Every daemon checks

65

requests received at the multicast address and only processes these requests with the

daemon ID in the ﬁlehandle matching the daemon’s own identiﬁcation number.

3. 3. 5.3 File and directory creation

File creation. Files are created in the same way as traditional NFS systems by calling
CREATE on target directory with a ﬁle name. Only the daemon owning the target
directory will process the request. The daemon ﬁrst validates the ﬁlehandle for the target
directory in its local ﬁle system and then allocates one inode (and therefore the ﬁlehandle
for the new ﬁle) from the inode table on that daemon. If the CREATE indicates the ﬁle
size, the daemon will also allocate the desired blocks. Then the daemon replies the
ﬁlehandle for the new ﬁle to the requesting client. Upon receiving the new ﬁlehandle, the
client can write to or read from that ﬁle by calling WRITE or READ on that ﬁlehandle. All
ﬁles are created on the same daemon as its parent directory.

Directory creation. Directory is a special ﬁle and can be created similarly by calling
MKDIR. It seems that initially all ﬁles and directories will be created on daemon 0 and
other daemons will have no business at all. To balance the load to all daemons, the MKDIR
procedure is extended by an extra parameter: target daemon number. By default directories
are created on the same daemon as its parent directory. The user can manually indicate the
target daemon on which the new directory should be stored. For example, mkdir foo 4 will
make the new foo subdirectory created on daemon 4. Ideally the target daemon of new
directories could be decided by all daemons dynamically to load-balance all new generated

directories and ﬁles. The CoStore prototype has yet supported dynamic load-balancing.

66

3. 3. 5.4 lntra-cluster communications and deadlock

A full-connection using TCP among all daemons is established at initialization time for
intra-cluster communications. So far there are three different kinds of intra-cluster requests:
local ﬁle system (LFS), logic block interface, and physical block interface for distributed
RAID. The physical block interface is used for parity updates. Because we are using TCP,
these is no Ack for each physical block updates, therefore it is deadlock free.

However, we are less fortunate at LFS level and logical block interface level. The main
purpose of LFS level is primarily for remote directory creation (MKDIR). By imposing a
one-way direction of remote MKDIR we can effectively remove potential loops. That
means remote directories can only be created from lower daemons on higher daemons.

A remote MKDIR involves two LFS requests between daemons. First the source
daemon sends an LFS_NEWJNODE_NO request for a new inode on target daemon.
Second, the source daemon sends an LFS_ALLOC_BLOCK request so that blocks are
allocated for a directory with a default size, which will have at least two entries: self (.) and
parent (..). Then the source daemons writes the initial two entries (. and ..) using the logical
block interface. The imposed one-way direction for remote MKDIR also removes the need
of read and write to the logical block interface in two-direction, and thus eliminates
potential deadlocks at the logical block level. Therefore CoStore is deadlock free.

Because of the three-pass communication involved in each remote MKDIR call, the
latency is relatively high. In a regular local MKDIR, the latency is about 0.005 second,
while the latency for a remote MKDIR is about 0.18 second. Fortunately remote MKDIRs

are rare and most ﬁles and directories are created locally.

67

3. 3. 5.5 Recovery from node failures

Without data redundancy from the RAID subsystem, CoStore has a single point of failure
because the root directory in the global ﬁle system for one cluster is hosted on daemon 0
only. To overcome node failures, CoStore clusters should be conﬁgured to have parity in
one cluster using RAID 4 or 5, or to mirror individual daemons using RAID 0+1, or both,
using RAID 4+1 or 5+1.

When storage daemons are mirrored in a master-slave pair in RAID 0+1, 4+1 or 5+1,
failure recovery is relatively easy and straightforward. Slave nodes can upgrade themselves
and take the responsibility of master nodes. There may be unrecoverable data lost when
master node failed. How to maintain a consistent ﬁle system (with joumaling) on slave
nodes is an important and interesting problem and deserves closer scrutiny in further
studies.

For clusters with only parity data, recovery takes longer as we have to reconstruct a
new storage node from the surviving node similar to disk failures in disk RAID arrays. The
reconstruction is an expensive operation in terms of data trafﬁc and XOR computation.
Theoretically the cluster can continue operation while construction is performed in
background as a degraded mode disk array. However, it is extremely to implement the
online reconstruction in a distributed manner and the ﬁle performance from failed nodes
would be very slow. So far, none of the recovery features has been implemented in our

prototype, however.

68

3.4 Prototype Implementation

A CoStore prototype has been implemented on Windows 2000 and will be ported to other
platforms once implementation of all functionalities has been done. CoStore is still an
ongoing project. Currently the prototype only supports RAID level 0, 1, 0+1, 4 and 4+1.
RAID level 5 and 5+1 is under development as of this writing. We have conducted some
measurements on basic ﬁle read and write operations on the CoStore prototype, in
comparison with other non-cluster distributed ﬁle systems: NFS on Linux and the ﬁle-
sharing service (CIFS) [Leach 1997] on Windows 2000. Our experience told us that the
CIFS service from Samba on Linux is slightly outperformed by that of Windows 2000.
Therefore we did not include Samba in our experiments.

The limitations of our CoStore prototype are: i). it is only appropriate in a secure
environment and it requires strong encryption schemes if use is expanded beyond LAN
environment; ii). modiﬁcations are necessary to existing NFS client implementations.

CoStore systems can consist of a variety of storage devices: PCs running different
operating systems, high-end servers attached to SAN systems, or even onboard controllers
on NAS devices. Considering the potential heterogeneity in various platforms, CoStore
systems should be implemented as independent of operating systems as possible. For
example, CoStore can be implemented in Java so as to take advantage of its Write Once,
Run Anywhere feature. To implement CoStore in Java is an increasingly attractive choice
as more JINI storage devices are available [Heyn 1999].

At this time the CoStore prototype is implemented in C at user-level. Ideally CoStore’s
block buffer cache and hence the RAID subsystem should be implemented at kernel level

of underlying operating system for efﬁciency and stability, but at the expense of portability.

69

Planning to port the CoStore prototype to other platforms, the authors have begun
investigating the availability of aforementioned requirement in various Unix environments,
particularly the open source Linux and BSDs.

A CoStore system requires at least the following support from underlying operating
systems: network communications, memory management and block device I/O interface.
TCP and UDP sockets are used for communications between clients and the cluster and
among cluster members. Large-memory management is needed for block buffer cache

subsystem. Asynchronous I/O support is necessary for efﬁcient accesses to storage devices.

3.5 Performance Evaluation

3. 5.1 Experimental setup

The experiments were conducted on a cluster of PCs. In most tests, PCs for servers and
clients are Pentium III lGHz with 128MB PC133 SDRAM. There are two tests where PCs
with different speed are used. In one case PCs with Pentium III 550MHz are used as clients
and in another case PCs with Pentium 111 1.2GHz are utilized for both clients and servers.
A lOOMbps Ethernet switch connects all the PCs. The storage devices on the servers are
Quantum Fireball Plus A8205 hard drives (Ultra ATA/ 100 interface and 7200RPM).

For CoStore both client programs and server daemons are running on Windows 2000
Professional. For NFS on Linux and CIFS on Windows 2000 Professional tests, the client
side is Linux. The client mounts the remote ﬁle share as either an smbfs share from
Windows 2000 or an nfs share from NFS on Linux. A kemel-based NFS server (0.2-1) is
running on GNU/Linux (kernel 2.4.2) for NFS tests. The kemel-based NFS service has a

slightly better performance than user-level one.

70

All the tests are a set of consecutive requests for read or write with variable ﬁle sizes.
We arrange the tests into two groups according to ﬁle sizes: 32KB to 512KB for small-size
and 1MB to 128MB for large-size. Each ﬁle size has 4 different ﬁles and an average is
taken as the result for that size. All tests are conducted on a cold system, i.e. no data is in
the cache before read or write requests start.

Unix command dd is invoked to read from or write to remote ﬁles, with a constant
block size 32KB. For example, dd bs=32k count=8 if=/rmt/256k.dat.a of=/loc/256k.dat.a
is used to read 256KB from remote server to local disk. Unix command time is used to
measure the total elapsed time for each request. In CoStore, the client measures the elapse
time for each request by checking the system clock.

The CoStore prototype can choose to use the standard block buffer cache (OS-buff)
from the operating system (Windows 2000), or use an independent block buffer cache
subsystem (RAID-buff). When independent block buffer cache subsystem is enabled in
CoStore, the reserved memory size for block buffer is 32MB and 1MB for mirror or parity
update buffer. When OS-buff is used, read or write accesses to storage devices are buffered
by the block buffer from operating systems. In that case, when data is modiﬁed, rrrirror or
parity updates have to be sent to the RAID partners no matter how small the modiﬁcation
is. To send out many tiny packets is very inefﬁcient use of the network resources. With an
standalone block buffer cache, we could maintain a separate buffer to accumulate small
mirror or parity updates into larger ones.

When RAID-buff is used, the block updates can be conﬁgured as Transaction Comrrrit
or Lazy Commit. Transaction Commit means updates will be committed to devices when

one status-modifying NFS request, such as nfs_write(), is processed. Lazy Commit means

71

dirty block buffers will be ﬂushed to devices periodically (3 seconds in our tests) or when
dirty buffers are swapped out. Similarly mirror or parity update buffers can be conﬁgured
as Transaction Commit or Lazy Commit. Here by committing nrirror or parity updates, we
only mean that the update has been sent out to the network. It does not mean that the
updates have been received by network partners or committed to remote storage devices.
Normally block buffer and mirror or parity buffers are conﬁgured using the same policy:

either lazy or transactional.

72

Latency (seconds)

Latency (seconds)

(a) Small-size write

I l l

 

0-2 J CIFS/Win2K ---o--- _
NFS/Linux wow ,1"
CoStore RAID-O OS-buff + /
CoStore RAID-0 SyncDIsk ---ttt--- I!
CoStore RAID-0 AsyncDisk + ,-"

0.15 - f,

 

 

 

 

32K 64K 128K 256K 512K
File size (bytes)

(b) Large-size write

I l l l l l

 

50 J CIFS/Win2K ---o--- .r 1..
NFS/Linux --—-v--~ ,- .
CoStore RAID-0 OS-bulf + 1',"
CoStore RAID-O SyncDisk ---l--- ,-’,r
CoStore RAID-0 AsyncDisk + it";

\4
fﬁv

 

 

 

 

1M 2M 4M 8M 16M 32M 64M 128M
File size (bytes)

Figure 3-9 The write performance of CoStore

73

3.5.2 Write performance: CoStore vs. NFS vs. CIFS

Figure 3-9 compares a single-daemon RAID-0 CoStore’s write performance with those of
NFS and CIFS. In both small-size and large-size accesses, the performance of CIFS from
Windows 2000 lags behind others. In Figure 3-9(a), all CoStore cases outperform NFS. But
the result is susceptible to background noise due to very small latencies in tiny accesses.

In Figure 3-9(b), CoStore with asynchronous disk I/O (AsyncDisk) using only 32MB
buffer cache slightly outperforms NFS. By enabling AsyncDisk, we also disable the
standard block buffer in the operating system for our storage devices. CoStore with
synchronous disk I/O (SyncDisk) has a much worse performance, only a little better than
CIFS of Windows 2000. The reason is that with SyncDisk, the buffer cache at operating
system level for storage devices is forced to ﬂush every 3 seconds.

Interestingly, CoStore using only standard OS block buffer has an outstanding
performance in Figure 3-9(b). There are several reasons: Windows 2000 takes advantage of
the bulk of free memory as buffer cache (more than 64MB in our case); it has a much
longer ﬂushing threshold for dirty blocks; and it also does read-ahead caching. Because of
the RAID support and efﬁciency reason for network mirror or parity updates, we need to

maintain a separate block buffer cache and have a more conservative commit policy.

3. 5.3 Read perfonnance: CoStore vs. NFS vs. CIFS

Figure 3-10 shows the read performance of CoStore. It is hard to compare read
performance when access sizes are small in Figure 3-10(a). However, in Figure 3-10(b)
CoStore underperforms both NFS and CIFS by a large margin. CoStore with SyncDisk or

OS-buff has an almost identical performance. But CoStore with AsyncDisk is considerable

74

slower than SyncDisk or OS-buff. The reasons are two-folded. First the buffers in OS-buff
and SyncDisk can occupy almost all free memory. Second the operating system can do

optimizations with read-ahead.

75

Latency (seconds)

Latency (seconds)

0.14

0.12

20

10

1
32K 64K 128K 256K
File size (bytes)

(a) Small-size read

1 l l

 

- CIFS/WinZK --—e.--
NFSlIJnux wom—

CoStore RAID-0 OS-bufl +
CoStore RAID-O SyncDisk ---I---
- CoStore RAID-0 AsyncDisk +

 

 

 

 

(b) Large-size read

1 l l l 1

512K

 

 

 
 
 
 
 
 

ClFSMﬁnZK ---e---

NFS/Linux ---v~--

CoStore RAID-0 OS-buff ---+
CoStore RAID-O SyncDisk —--II---
CoStore RAID-0 AsyncDisk —-—

 

2M 4M 8M 16M 32M
File size (bytes)

Figure 3-10 The read performance of CoStore

76

 

 

1 28M

3. 5.4 Impact of distributed RAID and commit policy

In Figure 3-11 we evaluate the impact of distributed RAID and the effect of commit policy
on the performance of CoStore with only one client. Figure 3-11(a) illustrates the overhead
of synchronizing rrrirror or parity updates to network targets. We ﬁrst verify that the l-
daemon and 2-daemon RAID-0 CoStore clusters have the same performance when there is
one client. In the early tests we found that the 2-daemon CoStore cluster had longer write
latency because in the implementation the daemon called select() with 1 usec timeout to
check requests from peer daemons. With the timeout set to 0, there is no difference
between l-daemon and 2—daemon RAID-0 CoStore clusters for single client. We will
further demonstrate the scalability of RAID-0 in Figure 3-12(a).

The overhead of copying updates to network partners in RAID-1 is relatively small. For
example, to write a 32MB ﬁle in RAID-1 costs 11.125, or 1.255 more than 9.875 in RAID-
0. The extra mirror copying incurs about 12.66% overhead. To write the same ﬁle in
RAID-4 costs 11.80 second, or 1.935 more than in RAID-O. The extra parity update incurs
about 19.55% compared with non-redundant RAID-0. The reserved buffer for either mirror
or parity is 1MB. At the mirror and parity target daemon, dirty blocks are committed to
devices once a set of updates (up to 1MB) is received.

Figure 3-11(b) demonstrates the effect of commit policy on the latency in RAID-0
CoStore with l-dearnon and l-client. The Transaction Commit does have signiﬁcant
impact on the latency of each request. To write a 32MB ﬁle, the latency is 12.555 with
Transact Commit policy, compared to 9.875 with Lazy Commit policy, which times out

every 3 seconds. That is a slowdown of about 27.02%.

77

Latency (seconds)

Latency (seconds)

(a) the overhead of mirror and parity in CoStore with 1-client, large write

 

 

 

 

 

 

 

 

 

 

45 l l I l l l
I:
40 - CoStore RAID-O 1-daemon + fill
CoStore RAID-0 2-daemon ~-v~~ I;
CoStore RAID-1 1-daemon --+-- .' I’ r
35 .. CoStore RAID-4 2-daemon ---«X--- ’llr.
30 - .5," .-
.'. III 1’
25 /
ox. I’ll’l/
.o' /
20 - ,r / -
*1/21”
.. r”/
15 " o‘i’ll / _
,.,—53;/
10 ‘ ”5,99% '
'.4::-.’;"/
5 " .¢0;::’;‘/ "
-‘ ¢’O'c
0 1M" I , T r r
1M 2M 4M 8M 16M 32M 64M 128M
File size (bytes)
(b) the effect of commit policy on RAID-0
l l l l l l
50 - ,« _
.’
I"
CoStore RAID-0 Lazy Commit (3 sec) + .'
CoStore RAID-0 Transaction Commit ".3.-. j
.l
40 '1 l" t-
0’ I
30 - I" l-
I")I
20 - r .
I’d
fill" "4"; t-
I I
1M 2M 4M 8M 16M 32M 64M 128M
File size (bytes)

Figure 3-11 The impact of distributed RAID overhead
and commit policy

78

3. 5.5 Scalability of CoStore

Figure 3-12(a) illustrates a 2—daemon RAID-0 CoStore cluster is almost linearly scalable in
terms of the daemon number, when each of the two clients is only requesting service from
one of the two daemons in an ideal case. The two curves are approximately identical except
at the size of 128MB, where the average latency of two clients is a little less than that of
single client, possibly a result of noise due to system background activities. However,
RAID-4 CoStore clusters are less scalable than RAID-0. Figure 3-12(b) illustrates a RAID-
4 CoStore cluster (two daemons plus parity daemon). For two clients to write a 32MB ﬁle
concurrently, the average latency is about 12.775, while it takes 11.805 if there is only one
client. The slowdown is about 8.2% for an extra client-daemon pair in ideal cases. Please
note in Figure 3-12(b) both of the two clients are Pentium III 550MHz PCs; but the two

members in the cluster are Pentium III lGHz PCs.

79

Latency (seconds)

Latency (seconds)

(a) the scalability of CoStore RAID-O

l 1 l l l

 

4o

351

 

CoStore RAID-0 2-daemon 1-client ..4...
CoStore RAID-0 2-daemon 2-client ——x—

 

l

 

 

 

 

 

 

 

 

1 M 2M 4M 8M 16M 32M 64M 128M
File size (bytes)
(b) the scalability of CoStore RAID-4
so 1 I l 1 P l
I
45 - CoStore RAID-4 2-daemon 1-client ---t--- -
CoStore RAID-4 2-daemon 2-client +
40 ~
35 -
30 -r
25 -
20 -
15 a
10 -
5 d
0 T r l I l l
1M 2M 4M 8M 16M 32M 64M 128M
File size (bytes)

Figure 3-12 The scalability of CoStore clusters

80

3. 5.6 Parity daemon ’s bottleneck effect in RAID-4

To evaluate the parity daemon’s bottleneck effect in RAID-4, we also experiment with 4
clients on a CoStore cluster of 5 daemons (one is parity). Even though each client is
requesting on one daemon, the average access latency increases gradually as the number of
concurrent clients grows from 1 to 3 (Figure 3-13). It seems that the parity node reaches the
saturation point where the average latency jumps sharply when there are 4 clients. As the
group size and the number of concurrent clients increase, the parity node in RAlD-4 may
quickly becomes a potential bottleneck due to limited I/O bandwidth and extensive XOR
computation on that node. The parity bottleneck should become less a problem in CoStore
when the more balanced RAID-5 support is implemented. Please note in Figure 3-13 all the

clients and servers are Pentium IH 1.2GHz PCs.

Bottleneck effect in RAID 4 (Lazy-Wt)

l l l

 

40 a /” L
1-client —r— '
2-client --.,..-- /

35 1 3-client ...G-.. ,2" __
4-client «Mr-- ,4“

xii/f}
30 - "" h
."/
25 ‘1 l-

Latency (seconds)

 

 

 

 

1 M 2M 4M 8M 16M
File size (bytes)

Figure 3-13 Parity daemon’s bottleneck effect in
CoStore with RAID level 4

81

3.6 Summary

The N AS approach has been proven to be effective in constructing scalable storage server
clusters using network attached storage devices. In CoStore the consistency of a uniﬁed ﬁle
namespace is collaboratively maintained by all participating cluster members without any
central ﬁle manager. The enabling factor for CoStore’s efﬁciency and scalability is the
ﬂexible Data Anywhere, Metadata at Fixed Locations ﬁle system layout. The serverless
design eliminates any central server bottleneck and provides strong scalability.

The CoStore prototype using COTS-based components demonstrated feasibility of
building such scalable storage clusters. The performance results measured on the prototype
illustrate the potential of CoStore to achieve scalable high-perfomrance high-capacity

storage services with strong reliability and availability.

82

Chapter 4

CoStore Clusters Utilizing Idle Disk Space on Workstations

4.1 Introduction

In recent years, the computer industry has made signiﬁcant advances in magnetic recording
technology. At the meantime the economy of volume of storage has improved enormously
from the mass production of hard disk drives. Consequently, disk drives are getting higher
in capacity, smaller in size and cheaper in cost—a trend that is expected to continue [Ng
1998]. The standard size for disk drives on mainstream computers in the market is about
20-30GB as of March 2001 and is growing continuously over the time.

To an end user, there are two different kinds of storage services provided in a well-
organized computing system: local disk space on the client desktop machines and remote
space on network storage servers. Most users prefer the network storage for various reasons
including:

1) Mobility. The users are more on the go and they prefer to be able to access their
data through consistent interface from any place they are in. Most client machines
are behind ﬁrewall and/or may not be able to easily provide local resources to other
systems.

2) Quality of Service. Normally the network space is stored on high performance
highly available storage systems, with built-in redundancy to counter most
hardware failures. A well-maintained storage system also has regularly scheduled

backups so that information can survive catastrophic accidents or even natural

83

disasters. On the other hand, most client machines use commodity disk drives and
normal do not have local backups.

3) Security assurance. System security is much easier to maintain on a centralized
storage system managed by professional administrators than a group of systems
managed by individual owners. Well-organized systems provide peace of mind for
information security.

As a result, in many organizations most of the local disk space on client workstations is
only used for operating systems, application programs and temporary ﬁles, which in total
take up only 2 to 5GB disk space. Douceur and Bolosky measured and analyzed a large set
of client machines in a large commercial environment with more than 60,000 desktop
person computers [Bolosky 2000; Douceur 1999]. The measurement includes disk usage
and content; ﬁle activity; and machine uptimes, lifetimes and loads. The result shows that
about 53% and 50% of the overall disk space of studied environment is in use in September
1998 and August 1999, respectively. The disparity of space utilization ratios on storage
servers and local machines is expected to deteriorate further over the time as the average
disk size grows rapidly.

Motivated by the increasingly pervasive resource wasting, the CoStore cluster
architecture was originally proposed to construct a storage system utilizing idle disk space
on workstation clusters. In this chapter we evaluate the client-computing environment and
assess the feasibility of deploying CoStore cluster on existing computing infrastructure.

It is worth to point out that in typical ofﬁce environment, dedicated system

administrators loosely manage most desktop machines. The term of being supported by

84

technical staff may better describe the relation. It is reasonable to assume that each of seat
owners has almost full control of the local resource on his or her desktop machine(s).

There is another common model: centralized system management. In this model, end
users are not tied to desktop machines. Instead, they go to the public lab and use whatever
system available. The system administration team supervises the whole infrastructure. This
model is very popular in universities, public libraries, and technical training centers in large
corporate or government organizations. Even in the non-centralized environment, many of
the non-technical personnel, for obvious reasons, choose to totally rely on technical support
to take care of their systems. This study mainly focuses on the centralized management
environment, even though it can also apply to the non-centralized model.

More research work is warranted by the fact that the cost for advanced storage servers
continues to be expensive and that there is growing prevalence of idling local disk. Limited
work has been done so far, partially because of the complexity of the diverse environment
and the administrative overhead involved. To our best knowledge, previously only

Microsoft research project Farsite [Microsoft 2000] tries to solve the same problem.

4.2 Assumptions and Environment Description

The subjects in this study are front-end desktop workstations, compared with back-end
time-sharing servers. A typical example is client workstations in a public lab or an
engineering lab environment. Computing seats in such environments are on a ﬁrst-come,
ﬁrst-serve basis and the local disks are primarily used to store ﬁles for operating system

and application programs.

85

We assume that in such environments there exists a central administration with login
authentication servers and storage servers with a uniﬁed namespace. All these servers and
client workstations are connected in a secure local area network behind a ﬁrewall.
Therefore, the servers can trust the operating systems on these workstations. This study
mainly focuses on the centralized model, even though it can also be applied to the non-
centralized one.

The foremost characteristic of such environments is heterogeneity in terms of hardware
platforms and operating systems. Generally these workstations are well equipped with fast
processors, large amount of memory, and high bandwidth local I/O and network interfaces.
Another noticeable fact is that these workstations are susceptible to occasional unexpected
reboots, due to software failure, user choice, or system-sharing policy.

The primary objective of this study is to transform the idle disk space into usable
storage service with satisfactory reliability and efﬁciency. A good solution should require
little administrative effort; otherwise, the prohibitive human cost may overshadow the gain
from recovered resources. An ideal solution should provide additional storage service as a
seamless part of the current storage infrastructure. CoStore is not intended to replace main
storage servers. Instead CoStore attempts to provide extra storage space supplementary to
main storage servers with little or no further investment.

Potentially the recovered storage space can be used for archiving purposes for both
end-users and system administrators, such as large multimedia data or system snapshots
ﬁles for online backup. Other possible usages include web page caching, website
replication, or data buffering for search engines. Most of them are not frequently updated

and some of them require little or no redundancy at all.

86

4.3 Alternative Solutions

There are different approaches to solve this problem. We can use existing software installed
on these workstations to manually setup and combine individual ﬁle system resource into
one large ﬁle system. This is called ad hoc solutions. We can also adopt the concept of
virtual disks in Petal [Lee 1996] by using Network Block Device [Breuer 2000] driver
available in Linux kernel (2.1.101 or later). The peer-to-peer movement has emerged as an

interesting approach to solve the idle space problem in very large scale.

4.3. 1 Ad hoc solutions

It is possible to solve the idling disk space problem using existing software installed on
current systems. Speciﬁcally, on UNIX workstations disk space can be shared via NFS or
Samba service; on Windows machines disk space can be shared through CIFS protocol
[Leach 1997]. On UNIX platform, automount can be used to construct one unique ﬁle
system namespace. The Distributed File System (DFS) service on Windows 2000 Server
has the same function to coalesce multiple ﬁle resources on Windows 2000 Server or
Samba (2.2.0 or later) into one single namespace. However, this is a manual process and
may be cost-prohibitive because of the extensive human work involved in the setup. There
is no data fault-tolerance in this ad hoc solution except the fact that Microsoft DFS service

can provide ﬁle-based replication.

4.3.2 NBD based solutions
The distributed RAID approach used in xFS [Anderson 1995b] and Petal [Lee 1996] can be

applied to build a virtual disk with block interface. On top the virtual storage devices,

87

higher level ﬁle system and distributed ﬁle system can be deployed, like the metadata
manager in xFS and Frangipani [Thekkath 1997]. The result is a centralized reliable
storage service that can easily be integrated into the existing storage infrastructure.

Using the same concept, we can build one or many virtual disks using the Network
Block Device (NBD) driver on Linux. The Network Block Device driver simulates a block
device, such as a hard disk or hard-disk partition, on the local host, but connects across the
network to a remote host that provides the real physical backing [Breuer 2000]. Locally the
device looks like a disk partition, but it is a facade for the remote. The remote host is a
lightweight piece of daemon code providing the real access to the remote device and does
not need to be running under Linux. The local operating system will be Linux and must
support the Linux kernel NBD driver and a local client daemon. NBD setups can be used to
transport physical devices virtually anywhere in the world. To introduce redundancy, we
can build a software RAID out of multiple NBD devices. On top of block devices from
simple NBD or software RAID on t0p of multiple NBD devices, any local ﬁle system can
be chosen and so can any distributed ﬁle system available on Linux.

However, the relatively independent relation between RAID module and NBD module
prevents efﬁcient error handling when there is any network ﬂuctuation to the TCP
connections between the Linux server and remote hosts serving the physical block
resources. This single server might be a potential bottleneck and it can combine limited
number of NBD devices. To manage a local ﬁle system on top of network connections
without special caching can be an issue in term of both efﬁciency and reliability. For

obvious reasons, these remote hosts can do much more than only serving block resources.

88

4.3.3 Peer-to-peer solutions

One of the Intemet’s recent phenomena is the introduction of peer-to-peer (P2P)
computing. The current P2P movement was started by a simple motivation for many
pe0ple: the need to exchange music ﬁles. There have been several prominent peer-to-peer
systems offering ﬁle-sharing services, such as Napster [Napster] and Gnutella [Gnutella].
The peer-to-peer movement has grown to more than ﬁle-sharing. This peer-to-peer
phenomenon has reached to many areas, including distributed computing and distributed
storage, to take advantage of the abundant processing power and disk resources widely
available on millions of PCs.

There have been several peer-to-peer approach based storage systems: Farsite [Douceur
2001], PAST [Druschel 2001], and CPS/Chord [Dabek 2001; Stoica 2001] compared to the
so-called server-to-server [Yianilos 2001] based systems such as OceanStore [Kubiatowicz;
Rhea]. Many contemporary distributed ﬁle systems [Callaghan 1995; Leach 1997] are
based on client-server model.

A P2P network distributes information among all member nodes instead of
concentrating it at single server [Parameswaran 2001]. Using a P2P approach, Douceur et
al. has proposed Farsite, a serverless distributed ﬁle system to solve this speciﬁc problem
[Douceur 2001]. From the measurement of machine availability including uptime and
lifetime, [Bolosky 2000] concluded that the measured desktop infrastructure would
passably support their proposed system Farsite [Douceur 2001], providing availability on
the order of one unﬁlled ﬁle request per user per thousand days.

The lack of security and central authority in P2P solutions makes such solutions less

ideal in public labs environments that we are trying to solve. The efﬁciency of P2P

89

solutions remains to be evaluated. P2P is best for information sharing in large scale and its

candidacy in storage is still an open question.

4.3 Feasibility Assessment of Deployment on Existing Desktop Computlng

Infrastructure

4.3.1 Reliability theory in RAID

1 2 3 G
- - - -
u . . ' u

((3.1); disk failure rate: .1

GA
: m : m : disk repair ratezp
K/
a

Figure 4-1 Queuing theory in the reliability of disk
arrays.

 

 

 

 

Patterson et al. analyzed the reliability of disk arrays using queuing theory [Patterson
1988]. Assuming a constant disk failure rate — that is, an exponentially distributed time to
failure — and that failures are independent — both assumptions are made by disk
manufactures when calculating the Mean Time To Failures (MTTFdisk) or Mean Time
Between Failures (MTBF). In Figure 4-1, disk failure rate 11. is 1/MTTFdisk; and disk repair
rate 11 is 1/M77‘Rd,sk. Without any parity, a disk array’s reliability is:

M "F M
number of Disks in the Array '

 

MTTFdisk array =

To overcome the reliability challenge, extra disks are introduced to provide redundant

information to recover the original information when a disk fails. Disk arrays are broken

90

into reliability groups, with each group having extra check disks containing the redundant
information. When a disk fails, it is assumed that within a short time the failed disk will be
replaced and information will be reconstructed on to the new disk using the redundant
information. This time is called the Mean Time to Repair (MTTRdisk). MYTRdjsk is far
smaller than MTI‘Fdisk.

A few notations for the following discussion: G: number of data disks in a group; C:
number of check disks in a group, C is l for RAID level 4/5. There will be data loss when a
second disk fails before the ﬁrst failed disk is repaired within MTI‘Rdisk. The probability
Pam loss for this to happen is:

Pdata loss = [1_ (e_MTTRdiSk [andisk )(G+C—l) ]

Since MTTRdisk << MTI‘Fdisk / (G+C) and (l-e'x ) is approximately X when 0 <X<<1,
therefore:
PW 1053 = MTT R*(G+C-1)/MTI’Fdisk;
and MTI‘Fgmup = Expected[time between failures] * Expected[number of disk failures until
data loss]

_ Expected[time between failures]

 

P data loss

= MTTFdiSk * l
G+C MNRdisk *(G+C-1)/Manisk)

 

= (MTTFdisk )2
(G+C)*(G+ C—1)*M71Rd,_,k'

 

Similarly, the proposed architecture reliability depends on the reliability of each
individual system whose idling disk space is utilized. Likewise we deﬁne the following

notations. For disks the Mean Time To Failure is denoted as M ITEM and the Mean Time

91

to Repair as MYTRdisk. MTTRdisk is far smaller than MTTFdisk. For disk arrays the Mean

(M mm )2
(G+C)*(G+C-l)*MTTRd,-sk

 

Time To Failure is M TTF gmp. Therefore, MTI' F group is

With the latest disk manufacturing technologies, modern disk drives on client machines
are very reliable. The main failing factor in this study would be the system availability of
desktop workstations. Due to failures in operating system, or user's choice to reboot the
system for whatever reason, the mean uptime of client machine is very small compared to
MTT F disk. Thus, we consider the MIT F415,, of individual disk drive inﬁnitely large.

However, when the system reboots, it normally can restore to working condition in 2 or
3 minutes, including grace time for all services to get started. This period of time for
system to reboot is very small compared to the mean system uptime. We deﬁne M77 F m as
the mean system uptime; and MTI‘Rw, as the mean time that each system reboot needs to
restore to fully working status. MTTRm is far smaller than MTT F m.

When the number of concurrent rebooting workstations exceeds C, the service of this
cluster will be interrupted, even though this does not render permanent data-loss as it would
when too many physical disks failed in a RAID group. To evaluate the service interruption

rate, we apply the same formula and thus the MTTde, for a cluster consisting of desktop

(MTI‘FWS )2
(G+C)*(G+C—1)*MT1‘RWS

 

workstations is . For RAID 0+1, data are striped across on

G mirrored workstation pairs without check data. For RAID 4+1, data are striped across on

C mirrored workstation pairs with check data on another mirrored pair. As MTI‘FRAIDI is

 

2 2

, th M77 °
2*MT1‘Rm (G+1)*G*MT1’RW, creme FR’W’” ’8

92

 

 

 

 

2 . MTI‘F 2
—-——MnFMDl = (MTTFM ) , and M TTF RA [04+ 1 IS ( jRAIDI )
G 2*G*MITRM (G+1)*G*MTI‘RW,
(MTI‘FW, )4
4*(G+1)*G*MITR3,,°
Reliability of various RAID levels
164-14 1 1 I l l l l 1 l

te+12 -

 
   

 

   

 

 

 

le+10 - RAID 4/5+1: G=8,C=10 —a1— -
RAID 1: G , 1 --+--
RAID 0+128 ? -o--—-
E 16+08 ‘ G 0 "
8
3: 1e+06 ~ ~
3
‘55
3 — — — — _ : 7 7 : v
75 10000 a vvvvvvvvvv r
100 .. r-
1 1 _' i-
1. + ++~+ +~+-+-+-+ +-+-+-+-+-+.++-+ +~+-+-+--+- +~++-+-+-+-++-++-¢
'1 ‘+'+- - I-
001 "+'**
‘i’
00001 I W I I I I I I I
20 40 60 80 100 120 140 160 180
MTI'Fws (days)

Figure 4-2 The relationship between MiTqusm, and
M'ITFW3

To illustrate the relationship between MTTFm and MTI'qusm with variable RAID
conﬁgurations and different group sizes, we plot the above formulas in Figure 4-2. The
mean reboot time MTTRWS is 3 minutes. On the x axle, the mean system uptime MTT F w,
ranges from 1 day to 180 days. Typical system uptime MTTFw, measured in real

environment goes from 5 to 20 and 40 days.

93

4.3.2 Feasibility assessment

To assess the feasibility of constructing a CoStore cluster on an existing desktop computing
infrastructure, we collected system uptime data from public labs in Computer Science and
Engineering Department at Michigan State University. There are seven instructional labs
(18 to 20 seats each): four Sun UltraSPARC labs and three PC labs. The system activity
monitor has recorded all rebooting events into a database. All the machines have been
monitored during the school year of 2000-2001, except that the monitoring of three PC labs
was started in January 2001. Table 4-2 shows the mean system uptime M'I'I'Fws and the
measured reboot time M'I'I‘Rws for each of the labs.

Assume that we construct 14 independent clusters from workstations in 7 labs.
Normally the lifetime for desktop workstations is 3 years. Therefore, within three years we
accept one occurrence of service interruption. In other words the number of concurrent
rebooting workstations exceeds the number of workstations for check data in one cluster.
Statistically the MTTde, is equivalent of 3*365*14 or 15330 days. With this number in

mind as a hypothetical criterion, we evaluate different RAID conﬁgurations in Table 4-2.

94

Table 4-1 MTTFWs and MTTRWS based on system

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

uptime data
Start End Average Average Reboot

Lab Architecture OS Date Date Reboot uptime time

(midnight) (midnight) numbers (days) (sec.)
Aleutians Spachltral Solaris 2.8 28AUGOO DSMAYOI 1 5.23529 19.02635 120
Bones SpachltraIO Solaris 2.8 28AUGOO OSMAYOI 10.57895 2575577 120
IIndonesians Spachltral Solaris 2.8 28AUGOO OSMAYOI 11.33235 2324329 120
Sounds SpachltraIO Solaris 2.8 28AUGOO OSMAYOI 1 5.00000 18.21376 120
Mountains lP-II/333 GNU/Linux 2.2 OlJANOI OSMAYOI 294737 4222310 150
iRivers P-lI/400 Win2000 Pro. 01JAN01 OSMAYOl 25.53333 5.461 001 180
Snakes 2xP-III/600 Win2000 Pro. DIJANOI OSMAYOI 31 .35714 4.925493 150

 

 

 

Table 4-2 shows the MIT F clam, value for each cluster with different RAID

conﬁgurations based on the MTI‘ F ,9, data from Table 1. From Table 2 we can determine
that, using RAID level 4/5 or 0+1, the last two labs cannot construct a cluster with
satisfactory reliability according to our previous criterion. The reason is that they both are
too unstable or reboot too often (every 5 days). With RAID level 4/5, only three of the ﬁrst
ﬁve more stable labs can construct a cluster with RAID 4/5 when group size is small
(G=4). When using RAID 0+1, almost all the ﬁrst ﬁve labs can construct reliable clusters
with group size up to 8. When RAID 4/5+1 is used, even with group size 8 all the seven
labs can make up clusters with exceptional reliability. The reason is that RAID 4/5+1 is an
extremely reliable conﬁguration and the check disks are more than data disks. How much
impact the reliability of RAID level 4/5+1 will have on the performance remains to be

evaluated in our further study.

95

Table 4-2 The reliability (MTTqusmr) of clusters

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MT“: MTTF MTTF M'ITF MTTF MTTF MTTF
Lab MTTF MTI'R RAID4/S W15 W15 1mm RAIDOH RAIDO-rl RAM/5+1
m m G=4 G=8 G=16 G=l G=4 G=8 G=8

(days) (sec.) (daYS) (daYS) (daYS) (dayS) (daYS) (daYS) (daYS)
Aleutians 19.0 120 13032.1 3620.0 958.2 1303210 32580.2 16290.1 17064-1 1
Bones 25.7 120 23882.8 6634.1 1756.1 2388280 59707.0 29853.5 5.70844 1
Indonesians 23.2 120 19457.4 5404.8 1430.7 1945740 48643.5 24321.7 3.79e+1 1
Sounds 18.2 120 1 1942.7 3317.4 878.1 1 19427.0 29856.7 14928.3 1.438-1-1 1
Mountains 42.2 150 51356.5 14265.7 3776.2 5 1 3565 .0 1283910 64195.6 2.1 le+12
'vers 5 .4 180 715.7 198.8 52.6 7157.4 1789.4 894.7 3.426408]
Snakes 4.9 150 698.7 194.1 51.4 6987.0 1746.8 873.41 3.916408l

 

Table 4-2 conﬁrms that theoretically it is feasible to deploy the CoStore architecture on

existing desktop workstation clusters. With extra handling of the rebooting effect in

daemon’s implementation, the reliability of resultant clusters can be further improved.

4.4 Summary

Using an NAS approach, CoStore is a novel solution to build reliable storage service with

efﬁciency, utilizing idle disk space on desktop workstations. The enabling factor for system

efﬁciency is the ﬂexible Data Anywhere, Metadata at Fixed Locations ﬁle system layout.

The serverless design eliminates any central server bottleneck and can provide improved

scalability. System uptime data collected from production systems conﬁrmed the viability

of constructing storage cluster systems utilizing the idle disk space on desktop computing

infrastructure.

96

Chapter 5

Reliable and Highly Available Storage Systems
Using CoStore Clusters

5.1 Introduction

According to Todd Gordon of IBM, 70,000 workstations and 10 million square feet of
rentable space were lost in the devastation that occurred when the World Trade Center
collapsed in the attacks of September 11 [Hoard 2001]. However, no signiﬁcant amounts of
data were destroyed because these (mostly ﬁnancial services) companies had the foresight
to maintain backups remotely [Hoard 2001].

In the aftermath of this tragedy, disaster recovery services used storage networks to
restore thousands of terabytes of business data and get hundreds of companies running
[Noblett 2001]. In the wake of this previously unimaginable disaster, storage providers like
StorageTek and EMC experienced increased customer interest in storage networks that
include complex business continuance capabilities. One service provider claimed that the
request for business continuance services increased 1,000% since September 11 [Noblett
2001].

In a world with anything but certainty, storage system administrators must prepare for
any catastrophe, imaginable or not. For disaster recovery, the most prominent approaches

have been tape backup, redundancy and snapshot technologies.

5.2 Backup and Archiving Techniques for Recovery
The purpose of backing up data is to protect from data loss or corruption due to user errors

or hardware failures. Archiving data is a process to create a complete, self-consistent

97

replica of a collection of data, typically a directory hierarchy, appropriate for bringing back
online in the future. Therefore backups protect from errors and archives allow business to
stop and resume work (possibly at different locations) [Brown 2001].

Restoration from backup most commonly entails the recovery of a single ﬁle or
directory contained within the backup (complete ﬁle system restores are rarely necessary)
[Brown 2001]. In contrast, archives are usually restored in their entirety. Disaster recovery
is a closely related subject that shares characteristics with both backup and archiving
[Brown 2001]. Like backup, the intent is to protect from errors, but disaster recovery
concentrates only on catastrophic failures. Where backup is concerned with users
accidentally destroying ﬁles or individual hardware components failing, disaster recovery
addresses natural disasters wiping out an entire building or geography. This scale of
concern makes disaster recovery similar to archiving, in that the data saved must be self-
consistent and restorable in its entirety to allow work to continue at the recovery site
[Brown 2001].

Backups allow individual ﬁles or entire ﬁle systems to be restored after a hardware
failure or user error. Archives allow entire collections of data to be resumed after a disaster.
Disaster recovery allows businesses to resume operations at a remote site after a major
disaster. There is a hierarchical relationship: copies of backup tapes may be sufﬁcient to
archive a project, and sufﬁcient (offsite) archive tapes may be enough to recover from a

disaster [Brown 2001].

98

5.2.1 Tape backups

Traditionally system and user ﬁles have been backed up onto tape cartridges and preferably
the tapes should be stored offsite. However, in today's multi-terabyte environments, tape
backup has become more and more impractical for organizations that generate enormous
amount of data daily. The primary issue for backup and archiving is speed. Even with the
latest technology, it will take more than tens of hours to backup or restore one terabyte of
data. The ultimate limitation for backup speed is the backup media itself: modern tape
drives are capable of only a few megabytes per second [Brown 2001]. Software and system
I/O bandwidth typical limit performance even further.

More problematic than the speed issue is the issue of data integrity. Because it takes a
ﬁnite amount of time to write data to tape, there is always the risk that a ﬁle being backed
up while the tape is being written. If this occurs, there is no guarantee that the data on tape
can even be restored meaningfully, regardless of tape speed [Brown 2001].

Modern commercial backup software packages provide a variety of mechanisms to
perform safe live ﬁle system backups. It remains a fact, however, that a signiﬁcant portion
of the worlds IT assets are written to tape on the assumption that no application is writing
to the dataset while it is being backed up [Brown 2001]. NetApp's Snapshot technology
provides system administrators with a powerful tool set that virtually eliminates these
problems.

With the per megabyte price of hard drives continuing to drop and its superior
performance advantage over tape backups, it has been suggested to use hard drives for
backup. Unfortunately hard drives are not designed to be easily plugged or unplugged. We

have not seen many backup systems taking advantage of the inexpensive high-capacity disk

99

drives. The tape backup remains to be the most widely used method for system archiving
purposes. Many sites are investigating dumping to disk devices (or disk arrays) as

temporary staging areas prior to dumping to tapes [Brown 2001].

5.2.2 Snapshots

Snapshots are a series of images of the data on a ﬁle system that is stored onto a protected
area of the same system [Hitz 1994]. Although a few approaches to snapshots have been
proposed, it appears that the one developed by Columbia Data Systems may be the most
successful [Sander 2001b]. This technology has been included in systems offered by IBM,
and has been integrated under license by Microsoft in its Server Appliance Kit (SAK). For
example, Maxtor’s MaxAttach 4300 is based on Windows 2000 with SAK extensions.

The other popular snapshot is the on Network Appliance’s ﬁler products based on the
Write Anywhere Filesystem Layout (W AFL). The WAFL ﬁle system can freeze frame
itself at any point in time, and make the frozen versions of the ﬁle system available via
special subdirectories that appear in the current (active) ﬁle system [Hitz 1994]. Each
freeze from version of the ﬁle system is called a snapshot. Snapshots usually require only a
tiny disk space premium to maintain and only start to consume disk space as the ﬁle system
changes after a snapshot is created.

This facility allows users to recover accidentally damaged or deleted data. Rather than
requesting an administrator to locate and mount a backup tape containing the user's ﬁle,
and waiting for the tape to seek and load the ﬁle in question, the user can simply copy from

a hidden snapshot directory a recent version of the ﬁle (prior to the accident). If conﬁgured

100

and scheduled to take snapshots regularly, it also enables users to retrieve old versions at
particular date.

Because snapshots provide a consistent, read-only view of the ﬁle system, they provide
an elegant solution to the problem of live ﬁle system backups. Prior to beginning the
backup, a snapshot is taken (a process of a second or two). The backup may then be taken
from the highest level Snapshot directory. As long as backups read from a snapshot,
changes that occur to the ﬁle system while the backup is running will not affect what gets
written to tape. Users always see an active ﬁle system, fully read-write at all times, the
backup device sees a stable, read-only Snapshot of the ﬁle system taken an instant before
the backup began. This ability to create guaranteed consistent, restorable archives at any
time (especially during working hours) is a very valuable feature.

Snapshots provide a particularly convenient method for backing up relational data ﬁles.
Traditionally, the only way to guarantee a consistent state for such databases prior to
backup is to shutdown the application that controlled the database. Backups consist of
shutting down the application, performing the backup, then restarting the application. The
downtime is completely dependent on the speed of the backup, typically several minutes to
several hours.

With snapshots, the downtime can be reduced to a few seconds, the time it takes to
create a snapshot. The sequence becomes: shutdown the application, create a snapshot,
restart the application, then dump to the backup media from the top level Snapshot
directory. Such backup/archive images are guaranteed to be consistent and immediately

usable to the end application. Equally importantly, restorable snapshot images of such

101

databases can be left online for signiﬁcant periods of time. If a database ever experiences
some form of corruption, the time to recover can be dramatically reduced.

The snapshot technology is particularly useful for user data backup. However, snapshot
alone does not improve the reliability of physical storage systems. If the snapshot image
resides on a server, the server or drive fails, recovery from a snapshot is impossible,
because the device that holds the snapshot is not available. Therefore, the concept of
mirroring an entire data volume over a network is a more reliable approach to overcome

disasters.

5.2.2 Redundancy

RAID (Redundant Array of Independent Disks) [Patterson 1988] is one of the most
important and most successful inventions in storage technology. The success of redundancy
in system area networks (SAN s) has been extended to wide area networks (W ANS). Data
centers are thriving to provide storage-mirroring services between remote sites hundreds or
thousands of miles away. Storage systems can be mirrored using point-to-point ﬁbre
connections or through WAN s or Virtual Private Networks (VPNs) connections. With the
imminence of multiple gigabits Ethernet on the horizon, the concept of data redundancy
between geographically separated locations is gaining more popularity in IT market.

In this study we will construct a reliable and highly available storage system using
CoStore clusters. To the best of our knowledge there is little literature available discussing
the performance of mirroring storage between remote sites. There are several products from
major storage vendors, for example, the Symmetrix Remote Data Facility (SRDF) from

EMC [EMC 2001], and the SnapMirror from Network Appliance [Brown 2001]. Both of

102

them are mirroring between point-to-point storage volumes. A CoStore cluster with

mirroring can consist of one pair or multiple pairs.

5.3 Reliable and Highly Available Storage Systems Using CoStore Clusters

In information technology, high availability refers to a system or component that is
continuously operational for a desirably long length of time. Reliability is an attribute of
any computer-related component that consistently performs according to its speciﬁcations.
The RAID approach has been ubiquitously utilized to improve reliability and high
availability in storage systems. The success of RAID is mainly driven by its data striping
technique, which achieves high performance through the parallelism of multiple disks and
fault tolerance by the extra check data parity.

The redundancy in RAID arrays has enabled to overcome component (disk) failures. In
the event of a disk failure, a RAID array can still serve in degraded mode while automatic
reconstruction takes place on a hot-spare disk in background. Level 1 (mirror) and level 5,
along with variations (such as 0+1) are the most commonly used in RAID conﬁgurations.
To add more availability, a pair of clustered servers can be connected to the same disk
array as failover partners. They can be conﬁgured as master/slave, or master/master each
sharing half of the service load. In the event of one server failure, the other one will take

over the whole responsibility. This is the traditional cluster of two standby failover hosts.

103

 

read/Wigs clients

 

4 A

s s s s > P stomgel
l 1 1 1 T i cl
(1::112twork connections?

I 1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

r ‘6' Y
M M . M M M/P mirro]
cluster

 

 

 

 

 

 

 

 

I
readonly clients

 

 

 

Figure 5-1 A CoStore cluster with RAID level 4+1

In this study we use cluster in a broader sense. A CoStore cluster can consist of a single
computer or a group of computers, each equipped with its own block devices. The same
RAID semantics of various levels is maintained in a distributed manner. Borrowing the
concept from RAID variations levels (0+1, 4+1 and 5+1), we introduce a second level
redundancy by mirroring the original storage cluster. Figure 5-1 shows one example cluster
with redundancy equivalent to RAID level 4+1. The main cluster is called the storage
cluster, which handles read-write client requests. The mirror cluster stores a mirror copy of
the storage cluster and can handle requests from read-only clients. Similar to the roles of
disks in RAID arrays, daemons can assume different roles in CoStore clusters: storage
daemons (S), parity daemons (P), mirror daemons (M), and possibly hot-spare daemons

(H).

5. 3. 1 Benefits and implications
Just as read performance is doubled in RAID 1 disk arrays, the mirroring of clusters can
also improve performance in addition to system availability. In a LAN environment,

members in a mirror cluster can be conﬁgured as standby failover partners and at the same

104

time they can also serve read requests. When a member in storage cluster has failed, the
corresponding member in mirror cluster can upgrade itself into storage cluster and take the
place of failed member to avoid time- and bandwidth-consuming reconstruction. Without
expensive reconstruction, storage cluster can also avoid switching into the degraded mode,
which has a seriously reduced performance. Therefore this second level redundancy
improves availability as well as read performance.

Following the trend of extending RAID's success from SANS to WANs, we move the
mirror cluster away from the storage cluster with a range of distances: to different rooms, to
different buildings in the same campus, or to campuses in different geographical regions.
Mirroring between remote sites is particularly beneﬁcial and desirable in certain situation.
For example, in a business environment, some information needs to be deployed at remote
sites for read-only applications. In this way, a full-time online almost up-to—date ﬁle system
is being replicated. The distribution of data is very efﬁcient because requests from remote
clients are served by the local mirror cluster at remote sites.

Another beneﬁt of mirroring clusters between remote sites is improved disaster
recoverability. Disasters have various scopes in terms of geographical sizes, ranging from
rooms, ﬂoors, buildings, campuses, cities, and to regions. The longer the distance between
remote sites, the more likely at least one of the mirrored clusters is able to survive a major
catastrophe. The success of data centers should mainly attribute to their replication services
between geographical regions so as to provide maximum survivability to customers'
storage systems.

Compared with a traditional point-to-point clustered pair, a multi-entity CoStore cluster

with mirroring has several advantages: larger volume size, higher aggregated network

105

 

bandwidth and performance. Due to the number of active members serving in a cluster,
there are multiple members in a cluster each with local disk resources and there exist
multiple network connections between storage and mirror clusters. In this study we will
evaluate the impact of network bandwidth and latency on the performance of CoStore
cluster with mirroring.

Using queuing theory the theoretical reliability of RAID arrays could be quantiﬁed
[Patterson 1988]. As we have analyzed in Chapter 4, the reliability of some of the RAID
variations with two levels of redundancy, level 4+1 or 5+1 in particular, can provide
extremely high reliability, compared to traditional RAID levels. However, this high
reliability comes at a price in increasing overhead, i.e. more disk space used for mirror
copies; and reduced performance because mirror data have to be copied. In fact, the space
overhead in RAID 4+1/5+1 is so high that in extreme cases only 1/3 of the total disk space
is usable and that the redundancy overhead is always more than 50%.

Transparent to upper layers, data redundancy management is actually implemented in a
standalone block buffer cache subsystem, which provides a block interface to local ﬁle
subsystem. With a standalone buffer cache system, multiple tiny updates can be bundled
together and sent in relatively fewer network operations and hence improved overall

performance.

106

,.. .1!

. . bio-ck'buffercache module 1

 

 

lru_list (clean)

-r:-1:}-:1 , 'e

f wnte_ blocks :'

/ T

8” read_ sectors \\\ I mirror updates : ‘0 i

{W \\ 5. [21—0—0— . networks

:83/ b.—ock / parity updates ‘1 .
.evice \\ m L

write_ TSOC“) TS
in

 

 

 

 

write_ b/locksl

 

 

 

 

 

 

Figure 5-2 RAID buffers and block buffer cache in
CoStore

5.4 Experimental Setup

The tests are conducted on a cluster of PCs, each equipped with 1.26Hz Pentium III
processor, 256MB SDRAM memory, 100Mbps network interface, and 20GB hard disks
(ATA100/5400RPM/2MB cache). All of them are running Windows 2000 Professional.
Locally all PCs are connected by 100Mbps Fast Ethernet switches.

To measure the impact of mirroring distance on the performance of CoStore, we
conduct the experiments such that the mirror cluster is located in a series of different
network environments. These environments include A) a single switched Ethernet LAN; B)
two LANs connected by 100Mbps router/switches within one department; C) two LANs
connected by a Gigabit building router; D) two LAN s connected by multiple Gigabit
routers on a Gigabit Ethernet campus backbone ring; E) two LANs in two universities
connected by the lntemet with a physical distance of about 140km; F) two LANs
connected by an IP emulator with a latency of 30ms. (The latency through Internet between

East and West coast of the United States is about 60ms).

107

 

Table 5-1 Test sets and the characteristics of network

 

 

 

 

 

 

 

environments

Test Network TCP throughput ping latency
sets environments (Mbps) (ms)

A switched LAN 94.72 0.18

8 cross switched LANs 94.72 0.20

C cross building router 94.38 0.30

D cross campus backbone 94.35 0.40

E cross lntemet 10.68 10.0

F cross WAN emulator 32.79 30.0

 

 

 

 

 

 

The network bandwidth (or TCP throughput) and latency for each of the testing
environments is shown in Table 5-1. The TCP throughput is measured by Test T CP
('I'I'CP) [TI‘CP]. The latency is measured by the utility program ping. Figure 5—3 The
network diagram of testing environments shows the network diagram of the testing
environments and interested readers can refer to campus network diagrams for MSU and
WSU from [MSU-network 2001; WSU-network 2001] and MichNet backbone diagram
from [MichNet-Backbone 2001].

CoStore clients and the storage cluster are always on the same Ethernet switch of LAN
A. Test Set A is the performance baseline as the mirror cluster is on the same switch as
well. In Test Set B, we move the mirror cluster onto LAN B. The two LAN s are connected
by a department network backbone: Cisco 7200 router and Catalyst 5500 switch in the
Department of Computer Science and Engineering (CSE) at Michigan State University
(MSU). The closely coupled router and switch are conﬁgured to ensure that only the ﬁrst
packet goes through router and that all subsequent packets will be switched following the
path of the ﬁrst packet. In Test Set C, the mirror cluster is located on a farther LAN C and
the network trafﬁc has to go through the Engineering Building router. This Gigabit

Ethernet Foundry router is one node on MSU's Gigabit Ethernet campus backbone ring. In

108

Test Set D, the mirror cluster is moved onto LAN D in National Superconducting
Cyclotron Lab (NSCL) building (a quarter mile away) at MSU. Both buildings are
connected to the Gigabit Ethernet campus backbone via Foundry routers, with ﬁve routers

in between.

 

 

 
       
   
  
   

 

 

 

 

 

 

   

_ Gigabit
lntemet _ 100Mbps

(MichNet) F3082?"

a Cisco

router

6. arts”
9 tea us . user. Bld
ﬂ backbone "9 g
Egr Bldg /-\—-—/ switched
LANs
/ \ \

LAN A LAN 8 LAN 0 LAN D

 

 

 

 

 

 

 

 

 

Figure 5-3 The network diagram of testing
environments

In Test Set E, the mirror cluster is placed on LAN E in the Department of Electrical and
Computer Engineering at Wayne State University (W SU), which is located in another
metropolitan area. Both universities are connected to the Internet2 via MichNet [MichNet-

Backbone 2001].

5.4. 1 Network emulator EMIP

Unable to further expand the distance, in Test Set F we resort to network emulator to

evaluate the performance of CoStore with the mirror cluster located far away from the

109

storage cluster. The network emulator in Test Set F is EMIP (Emulation of IP Networks)
[Huang 1998].

The EMIP emulator provides the functionality of applying a set of network conditions
and trafﬁc dynamics to the traversing network trafﬁc. EMIP, in contrast to a pure software
simulation solution like ns-2 [ns-2], is a real laboratory network on which real protocol and
application implementations can be developed and tested both for correctness and for
efﬁciency against a selected set of real network conditions and trafﬁc dynamics.

In EMIP, each physical network interface is attached with a virtual device module in
Linux kernel (Figure 5-4). The virtual device module consists of ﬁve policy units: MTU
(minimum transfer unit) unit, delay unit, bandwidth unit, loss unit and bit error unit. By
intercepting outgoing network trafﬁc to be sent out from one of the emulator’s network
interface, the virtual device module can impose bandwidth control, delay control, packet
loss rate control, bit error rate control or MTU changes for each outgoing packet sending
out from that network interface.

In addition, background trafﬁc can be generated within the emulator host such that the
impact of the background trafﬁc to the emulation experiment can be studied. EMIP also
supports trace-driven emulation, in which a trace ﬁle of real network trafﬁc is used as the
input to the emulator instead of speciﬁc protocols and applications. The virtual device
module is implemented as a kernel loadable module on Linux. A GUI interface is also
provides to easily conﬁgure network conditions, trafﬁc dynamics, and background trafﬁc to
be applied. Useful trafﬁc statistics can be viewed on the GUI interface. A project called
EMPOWER is currently underway to design and implement a more scalable network

emulation framework [EMPOWER].

110

 

| IP |
Virtual Device Module

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MT U Delay Loss Bit-error
Unit Unit Unit 7' Unit
LBandwidth Unit’I
Background
Traffic

 

 

 

 

 

 

 

 

 

 

 

Device Driver ]
I

 

 

 

 

 

 

Figure 5-4 The internals of EMIP

In Test Set F, the EMIP emulator acts as a router between the storage cluster and the
mirror cluster and the virtual device module of EMIP is used as a trafﬁc shaper to delay
traversing network trafﬁc through the emulator host. The delay is set to 30ms in either
direction and the measured TCP throughput is 32.8Mbps using 'ITCP.

For all the tests we only evaluate write performance, which is the central topic in RAID
performance. All the tests are a set of consecutive write requests with variable ﬁle sizes.
We arrange the tests into two groups according to ﬁle sizes: 32KB to 512KB for small-size
and 1MB to 128MB for large-size (For clearer curves not all large samples are shown in
some of the ﬁgures.) Each ﬁle size has 4 different ﬁles and an average is taken as the
latency for that size. Currently the CoStore prototype only supports RAID level 0/O+l and
level 4/4+l with various cluster sizes. In RAID 0/0+1, the group size is set to one because
larger size would not change the performance of CoStore. In RAID 4/4-1-1, we only
evaluate single-client cases. The bottleneck effect in RAID 4/4+l with multiple clients
deserves separate scrutiny in further studies.

The independent block buffer cache subsystem in CoStore is enabled in all the tests. All

the disk and network I/O operations are called synchronously. There is another block

111

 

buffering at the operating system level. This extra buffering from OS can be disabled and
we will investigate its effect along with that of asynchronous HQ in upcoming studies. For
all tests in this study, the block buffering at the operating system level is left on and the
only negative effect it might cause is the use of extra memory, which is abundant on the

test PCs.

5.5 Performance Evaluation

There are two options in deciding when to commit modiﬁcations to devices or to mirror or
parity daemons. One choice is to commit the modiﬁcation for every client request before
sending out the reply. This policy is called transactional write (transact-Wr). The other
option is to periodically commit dirty buffers, which is called lazy write (lazy-Wt).

For parity and mirror updates, when the buffer is full, the system will force the flushing
before more updates can be inserted into buffers. Therefore, the timing of committing
parity and mirror is controlled by two parameters: the ﬂushing interval and the capacity of
buffer size. We will demonstrate the effect of transact-Wr in Test Set A. For all other tests
we use lazy-Wr only; we reserve 32MB for block buffer and 1MB for either mirror or

parity update buffer and the buffer committing period is 5 seconds.

5. 5. 1 Local mirroring

Figure 5-5 shows the results for RAID 0/0+1 in a switched Ethernet environment (Test Set
A). For both small and large size accesses, the transactional commit policy considerably
lowered the performance of either RAID level 0 or level 0+1. The overhead of copying

updates to mirrors is relatively small compared to the total latency for each access.

112

Latency (seconds)

Latency (seconds)

(a) Small-size writes

 

1 l l s
1.4 - L
Transact-Wr in RAID 0 —-+—— i
Lazx—Wr in RAID 0 --+-— /
Transact- rin RAID 0+1 ~-«e------ . i.
1.2 ‘ Lazy-Wt in RAID 0+1 ---b~-- /

 

 

 

 

 

 

 

 

 

 

P
512K
File size (bytes)
(b) Large-size writes
1 20 I l l l I I
i/
Transact-Wr in RAID 0 —+—-
-Wr in RAID 0 mn"-
100 . Transact- r in RAID 0+1 ----- e ------ .1" _
Lazy-Wr in RAID 0+1 ---b---
80 d «it; h
60 ' I /r
’5‘,
40 " “if". .riﬁ” 1—
.-"l "‘-:’l’
20 - .................. .,.:;’,':"”’ 1-
.... ”"an “.’-';:.:'~:‘"
. ..--»-""""""MM -.—.-:-.'.::;:;‘-§;:; 55555
A ‘. w.L--
0 T l I I I
1M 2M 4M 8M 16M 32M 64M

File size (bytes)

Figure 5-5 A CoStore cluster with RAID level 0/0+1 in
a switched LAN

113

Figure 5-6 shows the results for RAID 4/4+l with 4 storage daemons (Test Set A).
Again, just as in RAID 0/0+l, the transactional commit signiﬁcantly lowered the
performance of RAID 4. Due to the large buffers, the performance for small writes in
RAID 4 or 4+1 are indistinguishable in Figure 5-6(a). However, for large size accesses
RAID 4+1 is slower than RAID 4 because of the extra mirroring effort. For example, for a
64MB ﬁle, the writing in RAID 4+1 is about 15% slower than RAID 4.

Special techniques can be applied to reduce the overhead in RAID 4+1. In the
implementation, we have an option about how parity and mirror buffers are updated. A
default method called S-Both (solid lines in Figure 5-1) is to let storage daemons update
both buffers: to the parity daemon and to its corresponding mirror daemon. Alternatively,
storage daemons can choose to only update mirror daemons. Upon receiving the updates
from storage daemons, the mirror daemons then notify the changes to the parity daemon in
the mirror cluster, which in turn will update the parity daemon in the storage cluster. This
option is called M-Parity (dotted lines in Figure 5-1). In other words, the parity updates are
postponed until the parity daemon in the mirror cluster is updated. From Figure 5-6(b), the
M-Parity brings the RAID 4+1 performance very close to that of RAID 4, however still

about 2.4% slower.

114

 

Latency (seconds)

Latency (seconds)

(a) Small-size writes

 

 

 

 

 

 

l J 1
1.4 - Transact-Wr in RAID 4 —+—— -
Lazyi-Wr in RAID 4 ---1n---
Lazy-Wr (S-Bot in RAID 4+1 ---e---~
1 2 LBZY'Wf (M-Parity in RAID 4+1 --—&---
. " >-
1 ~ 1-
0-8 4 _
/,.r>
0.6 -* x/ ..
./ ,0
./ .......
If: ,,,,,,, J!
0.4 ~ .,-./"' ,,,,,,,, _
.‘r’ti..:::' """"
"_,-;"-w£"
..-—~~¢!O"""'“
0.2 " "M-MJo-ﬁ'f’y """" .-
‘ m— $5333" “a: a. ............ "k ....
......................... .3. -------------
0 y I I I
32K 64K 128K 256K 512K
File size (bytes)
(b) Large-size writes
80 l I l l l I
Transact-Wr in RAID 4 —+— /I
70 ‘ Lazy-Wr in RAID 4 ---x—-— ,5?
Leszr (S-Bot in RAID 4+1 mew / 42'
Lazy- r(M-Parity in RAID 4+1 -'-b--- .I.

3

8

 

 

 

 

4M

File size (bytes)

Figure 5-6 A CoStore cluster with RAID level 4/4+1 in
a switched LAN

115

 

Comparing RAID 4+1 (Figure 5-6(b)) with RAID 0+1 (Figure 5-5(b)), we can ﬁnd that
the extra parity has an overhead of about 31.6% using standard S-Both option, or about
12% if using the special M-Parity option. For extra redundancy, this may be a small
performance price to pay. The M-Parity technique effectively removes the parity bottleneck
in storage clusters and makes parity updates out of storage daemon's critical path in serving
the front-end clients request trafﬁc. The beneﬁts of M-Parity remain to be further

investigated when there are multiple concurrent clients.

5. 5.2 Remote mirroring

We evaluate the effect of CoStore clusters mirrored in a switched Ethernet LAN
environment in Test Set A. Next, we conduct similar tests on RAID 0+1 and 4+1 on
different network environments with varying network bandwidth and latency (distance).
All the following tests are conducted with option lazy-Wt and S—Both.

From Figure 5-7(a) and Figure 5-7 (b) we can ﬁnd that both the performance of RAID
0+1 and RAID 4+1 clusters are virtually unaffected by the fact that the mirror clusters were
located on different subnets, ranging from in the same room, to different wings in same
building, and ﬁnally to different buildings in the same campus. The high bandwidth gigabit
campus backbone and the high performance routers/switches make possible mirroring
storage clusters located as far as a campus networks can reach. With such distances, the
storage systems can survive most natural disasters or even unexpected catastrophes like the
September 11 attacks. The network infrastructure in MSU is state~of-the-art, but by no
means rare. Many large universities, business organizations have such infrastructures in

place. Many of the data centers have better network resources at their disposal.

116

Latency (seconds)

Latency (seconds)

45

40

35

25

20

15

10

(a) 0+1 clusters mirrored between remote sites (Lazy-Wr)

l l l

1.-

 

 

 

 

 

 

 

 

 

 

.0
switched LAN (0.18ms latency —-+——
. cross switched LANs (0.20ms latency) ---n--- _
cross building router ’0.30ms latency) mew--
cross campus backbone (0.40ms latency) ---b--
_ cross lntemet2110.0rns latency) v ..
cross WAN emulator (30.0ms latency) e
d F‘
................ o _
r) ..................... .- r'.
I I I '
1M 2M 4M 8M 16M 32M
File size (bytes)
(b) 4+1 clusters mirrored between remote sites (Lazy-Wr)
l l #17 I
switched LAN 0.18ms latency —r——
cross switched LANs 0.20ms latency "41--- _
cross building router 0.30ms latency mem-
cross campus backbone 0.40ms latency ---b--~
- cross WAN emulator 30.0ms latency ~0~~ i

 

 

 

 

1M 2M 4M 8M 16M
ﬁle size (bytes)

Figure 5-7 A CoStore cluster mirrored in various

network environments

117

32M

In Test Set E, we measure the performance of RAID 0+1 between two university
campus networks. The baseline throughput of CoStore cluster with RAID 0+1 is roughly
1MB/s in Test Set A (Figure 5-5(b)). Even though the available bandwidth (lOMbps) from
the connection is adequate, the throughput of the mirrored CoStore cluster is slowed to
0.57MB/s, a 43% drop, by the 10 ms latency along the 10—hop path through the lntemet
connection. In the emulated IP network with 30ms latency, the throughput has dropped to
0.32MB, or 68% off from the baseline performance. Again the network bandwidth
(32.79Mbps) is more than enough and it is the much longer delay along the mirroring path
that has slowed the performance of mirrored CoStore clusters. Due to the physical long
distance and limited time, we choose not to conduct tests with RAID 4+] in Test Set E.

In Figure 5-7(b) the Test Set F shows that the performance of RAID 4+1 is doubly
penalized (0.16MB or 84% off) by the long delays because both data mirrors and parity
mirrors have to be updated crossing the long-delayed emulated WAN connections. This
30ms latency is equivalent of going through 10+ lntemet backbone routers with a physical
distance of up to 3000km. This latency is drastically too big for most commercial
organizations (such as data centers). However most business can afford direct links or only-
a-few-hops links with very low latency and high bandwidth. At the meantime the longer the
distance separating the mirrored clusters, the more likely one cluster can survive large-scale

disasters.

5. 5.3 Performance enhancing techniques

By having larger block buffers or longer commit interval, the system performance can be

improved. However, we risk more data loss if storage daemons crash before updates are

118

committed to disk or parity/mirror daemons. The availability of NVRAM (Non-volatile
RAM) certainly can help; unfortunately NVRAM is very expensive and only high-end
systems can afford.

So far the daemon in CoStore is only implemented as a single threaded program. It
makes perfect sense to introduce a second thread to take care the synchronization of RAID
updates (parity or mirror) in background so that the main thread can be more responsive to
clients’ requests. Another potential technique to improve performance is to use
asynchronous I/O operations for disk and network accesses. By issuing an I/O request
asynchronously, daemons can return right away to handle more client requests without
waiting for the request to be completed.

As of this writing we have just ﬁnished the implementation of asynchronous disk and
network I/O. Unfortunately there is not enough time for further experiments or to
implement multi-threaded daemons. We plan to evaluate the effect of multithreading and

asynchronous I/O on the performance of mirrored CoStore clusters in future study.

5.5 Summary
The CoStore cluster architecture could construct storage systems with strong reliability and
high availability. The performance evaluation of the CoStore prototype demonstrates that
there is little impact on performance even if the cluster is mirrored in different buildings on
a large campus as long as the network bandwidth and latency are satisfactory.

The test results also show that the latency between remote sites plays an important role
in the performance of the cluster even if the bandwidth is adequate. With direct

communication channels or efﬁcient VPN links, the latency could be kept low even if the

119

distance is long. Therefore, with dedicated networks the CoStore cluster architecture could
effectively improve storage systems’ preparedness for disaster recovery without sacriﬁcing

performance. However, direct deployment on public lntemet may be less desirable.

120

Chapter 6

Conclusions and Future Work

6.1 Related Work

6.1.1 Storage cluster architecture

Gibson et al. have built a high performance storage system based on directly network-
attached secure disks (NASD) [Gibson 1996]. With a similar architecture, the CoStore is
also proposed to take advantage of the emerging network attached storage (NAS) devices.
CoStore currently assumes that clients authenticate through a central authentication server
while the NASD system uses a dedicated ﬁle manager to enforce access control. More
importantly in CoStore the namespace consistency is maintained collaboratively by
participating cluster members, different from NASD whose central ﬁle manager maintains
the namespace consistency for the whole system.

Previously proposed serverless distributed ﬁle system xFS [Anderson 1995b] is a ﬁle
cluster server supporting high-performance distributed computing applications on network
of workstations (NOW) [Anderson 1995b]. The xFS ﬁle system inherits the log-structured
ﬁle system from LFS [Rosenblum 1992.] and the distributed RAID striping from Zebra
[Hartman 1993] to provide data redundancy.

Compaq/DEC’s research project Patel [Lee 1996] provides a distributed storage
system, offering a block-level virtual disk interface. Block-level interface rather than a
richer ﬁle-level interface is chosen to handle heterogeneous client ﬁle systems. The authors
of Patel were concentrated on replication-based redundancy schemes like chained-

declustering [Hsiao 1989]. Patel logically could be viewed as a RAID system implemented

121

on a symmetric multiprocessor [Gibson 2000]. Similarly, the architecture in our design is
virtually a directory-based approach in distributed shared-memory multiprocessor system.
Frangipani [T hekkath 1997] is a distributed ﬁle system built separately on top of Patel’s
virtual disk.

Distributed RAID approach was introduced in [Stonebraker 1990] and was used in xFS
[Anderson 1995b] and Petal [Lee 1996] to build virtual disks with block interface. On top
virtual block devices, higher level ﬁle systems and distributed ﬁle systems were built, like
the metadata manager in xFS and Frangipani [Thekkath 1997]. Differently, CoStore offers
a ﬁle interface using NAS approach, which seamlessly integrates local ﬁle system and data
redundancy management. Both xFS and CoStore are a serverless design. However,
CoStore evenly distributes metadata and the management across all storage devices in the
cluster. Metadata in CoStore systems are stored at ﬁxed locations while data can be stored
anywhere, compared with the Anything, Anywhere ﬁle layout in xFS.

We borrow the concept of A Frequently RAID from [Savage 1996]. The ﬂexibility of
parity-update-delay parameter ranges from strictly full-time redundancy to variable-
delayed frequent redundancy. This makes possible the idea of conﬁgurable tradeoff
between reliability and performance. LFS technique is mainly used to improve small-write
efﬁciency, which is critical for overall performance in typical RAID systems. For
simplicity reason our design will not employ the LFS technique, but we will use the
joumaling concept for fast recovery after system crash.

The CoStore design is also inﬂuenced by Network Appliance’s design in their storage
appliance products [Hitz 1994]. We use the concept of storage appliance to integrate RAID

management and ﬁle system management functions. Similarly, RAID 4 and RAID 0+1 are

122

ﬁrst chosen for the ease to expand the volume capacity and the simplicity in

implementation.

6.1.2 Remote storage replication

The most relevant work is Microsoft Research's Farsite project [Microsoft 2000], which
builds a serverless distributed ﬁle system with a large group of collaborative desktop
computers across campus-wide corporate network. The fault-tolerance of Farsite is
achieved through replication of ﬁles assisted by Byzantine-fault-tolerant algorithm
[Douceur 2001]. Compared with our proposal, this higher-level approach works well in an
environment of mostly with seat owners. Their previous study demonstrated the feasibility
of a serverless distributed ﬁle system based on existing desktop PCs at Microsoft
Corporation [Bolosky 2000].

Farsite [Douceur 2001], PAST [Druschel 2001] and CFS [Dabek 2001] are all peer-to-
peer approach based storage systems. OceanStore [Kubiatowicz; Rhea] is server-to-server
based. Farsite and PAST are replication based data archiving systems. CoStore uses
distributed RAID, a speciﬁc form of the erasure-resilient coded [Bloemer 1995] fragments
(or block) based approach taken by OceanStore and CFS [Dabek 2001]. Both Farsite and
OceanStore exploit Byzantine agreement protocol [Castro 1999] to realize fault tolerance
and high availability. PAST and CFS only serve as read-only archival systems and are
mainly used for ﬁle-sharing purposes like the ﬁrst generation peer-to-peer systems.
Differently, Farsite, OceanStore and our proposal all provide traditional ﬁlesystem
semantics. PAST, OceanStore and CFS are designed for global scale deployment, while

Farsite and our proposal are targeted for organizational scale.

123

Both using distributed RAID, xFS [Anderson 1995b] and CoStore shares their
decentralized architecture with peer-to-peer systems. However xFS is only intended for
general-purpose ﬁlesystem in a LAN environment and CoStore was targeted at
departmental level in organizations.

The efﬁciency of locating data items in large-scale peer-to-peer systems is critical. In
this regard several routing and location schemes have been designed, including Pastry
[Rowstron 2001] in PAST, Tapestry [Zhao 2001] in OceanStore and Chord [Stoica] in
CFS. Farsite uses a distributed directory service to keep track of replica and ﬁle version

locations.

6.1.3 Storage service utilizing idle space

EMC's SRDF [EMC 2001] and Network Appliance's SnapMirror [Brown 2001] are
examples of point-to-point mirroring between remote storage servers. Typical CoStore
systems are clusters with a group storage hosts with various degrees of redundancy,

possibly with two levels of redundancy.

6.2 Conclusions

In this dissertation study, we have proposed a cluster architecture to construct storage
systems and have demonstrated the CoStore has the potential to achieve: higher scalability
due to the effective ﬁle system layout; high performance because of its root in the efﬁcient
NFS protocol; high reliability by using distributed RAID; and high capacity by pooling

multiple devices together.

124

6.2.1 Storage cluster architecture
Speciﬁcally, the NAS approach has been proven to be effective in constructing scalable
storage server clusters using network attached storage devices. The key to CoStore’s
efﬁciency and scalability is the ﬂexible Data Anywhere, Metadata at Fixed Locations ﬁle
system layout. In CoStore the consistency of a uniﬁed ﬁle namespace is maintained by all
cluster members collectively without any external assistance. The serverless design
eliminates any central server bottleneck and provides strong scalability.

The CoStore prototype using COTS-based components demonstrated feasibility of
building such scalable storage clusters. The performance results measured on the prototype
conﬁrm the potential of CoStore to achieve scalable high-performance high-capacity

storage services with strong reliability and availability.

6. 2.2 Remote storage replication
The CoStore cluster architecture could construct storage systems with strong reliability and
high availability. The performance evaluation of the CoStore prototype demonstrates that
there is little impact on performance even if the cluster is mirrored in different buildings on
a large campus as long as the network bandwidth and latency are satisfactory.

With dedicated networks the CoStore cluster architecture could effectively improve
storage systems’ preparedness for disaster recovery without sacriﬁcing performance.

However, direct deployment on public lntemet may be less desirable.

125

6. 2.3 Storage cluster utilizing idle space

Using an N AS approach, CoStore is a novel solution to build reliable storage service with
efﬁciency, utilizing idle disk space on desktop workstations. System uptime data collected
from production systems conﬁrmed the viability of constructing storage cluster systems

utilizing idle disk space on desktop computing infrastructure.

6.3 Contributions
We make several contributions in this dissertation study. Four main contributions are:

First, we adopted the contemporary NAS approach in constructing a storage cluster.
Distributed ﬁle system, local ﬁle system and RAID management have been seamlessly
integrated in each cluster member to achieve efﬁciency in a CoStore cluster.

Second, we designed the Data Anywhere and Metadata at Fixed Locations ﬁle system
layout using a simple but effective block and inode addressing scheme. This ﬂexible layout
is the key to achieve the efﬁciency and scalability in a CoStore cluster and its serverless
feature.

Third, we investigated cluster mirroring to enable remote replication and to provide an
online almost up-to-date backup copy. This extra redundancy is invaluable in disaster
recovery planning.

Finally, we actually implemented a prototype CoStore cluster and demonstrated its
advantages. This prototype implementation can also serve as a concept proof tool or test-
bed for further research and experiments.

Other contributions of this research include: we applied the directory-based CC-NUMA

approach from distributed shared-memory systems to the architecture of storage cluster

126

systems; we assessed the feasibility deploying the proposed cluster architecture on existing

desktop-computin g environments.

6.4 Future Work

6. 4.1 Implementations

Future work includes the implementation of the RAID management module with full
support for different RAID levels, particularly the more balanced level SIS-+1. With only a
client simulator, the evaluation of the current prototype is only limited to synthetic test sets.
We will implement an installable module at client end so that experimenting with
comprehensive benchmark programs is possible.

To address the heterogeneity of potential platform, the CoStore prototype is
implemented at user level and can be relatively easy to port to other platforms. To reach
more platforms, like the open source Linux and BSD, the effort to port the prototype is
currently on going.

We will implement a second thread to manage the synchronization of RAID updates
(parity or mirror) in the distributed RAID. We will investigate the effect of asynchronous
I/O operations for network and disk accesses on the performance of mirroring CoStore
clusters. Further study topics also include the investigation of importing more sophisticated

security mechanism and joumaling ﬁle system management into CoStore systems.

6.4.2 Future research areas

In the future we will look into DAFS and NFS version 4 and investigate the potential of

DAFS-enabled CoStore in taking advantage of modem memory-to-memory networking

127

technologies. Another interesting area is the Quality Of Service (QoS) on storage. Storage
out-sourcing has gradually becoming standard business conduct.

However, there has been a lack of mutual-agreeable and enforceable terms in deﬁning
the kind of storage service a customer is paying for in service contracts between data
centers and customers. Storage services nowadays are mostly best effort. We believe that
the CoStore cluster architecture has the potential to facilitate the fulﬁllment and

enforcement of QoS Storage contractual agreements and will explore in future studies.

128

 

Bibliography

Alvarez, G., K. Keeton, A. Merchant, E. Riedel, and J. Wilkes, "Tutorial on Storage
Systems Management," in SIGMETRICS, 2000.

Anderson, D., Universal Serial Bus System Architecture, Addison-Wesley, 1997.

Anderson, T., D. Culler, et al., "A case for NOW (Networks of workstations)," in IEEE
Micro, 19953, pp. 54—64.

Anderson, T., M. Dahlin, et al., "Serverless Network File Systems," ACM Transactions
on Computer Systems (TOSC), 1995b.

ANSI, "Small Computer System Interface-2 (X3T9.2)," Draft Revision 10k ed: American
National Standard Institute, Inc., 1993.

Bach, M., The Design of the UNIX Operating System, Prentice Hall, 1986.
Beck, M., et a1, Linux Kernel Internals, 2nd Ed. ed. Addison Wesley Longman, 1998.

Bever, M., et al., "Distributed Systems, OSF DCE, and Beyond," in DCE-The OSF
Distributed Computing Environment, Schill, A., Ed. Berlin: Springer-Verlag,
1993.

Bloemer, J., et al., "An XOR-Based Erasure-Resilient Coding Scheme," The Int’l
Computer Science Inst., Berkeley, Calif. tech. report TR-95-048, 1995.

Bolosky, W. J ., J. R. Douceur, et al., "Feasibility of a Serverless Distributed File System
Deployed on an Existing Set of Desktop PCs," in Proceedings of the 2000

International Conference on Measurement and Modeling of Computer Systems
(SIGMETRICS), 2000.

Breuer, P. T., et al., "The Network Block Device," in Linux Journal Magazine, 2000.

Brown, K., J. Katcher, R. Walters and A. Watson, "SnapMirror and SnapRestore:
Advances in Snapshot Technology," Network Appliance Inc. TR-3043, 2001.

Callaghan, B., NF S Illustrated, Addison Wesley, 2000.

Callaghan, B., B. Pawlowski, and P. Staubach, "NFS Version 3 Protocol Speciﬁcation,"
June 1995.

Card, R., T. Ts'o, and S. Tweedie, "Design and Implementation of the Second Extended

File system," in Proceedings of the First Dutch International Symposium on
Linux, 1986.

129

Castro, M., and B. Liskov, "Practical Byzantine Fault Tolerance," in Proc. Usenix Symp.
Operating Systems Design and Implementation (OSDI), 1999.

Chen, P., E. Lee, G. Gibson, R. Katz, and D. Patterson, "RAID: High-Performance,
Reliable Secondary Storage," in ACM Computing Surveys, 1994, pp. 145-185.

Chudnow, C., "Christmas Comes Early for IP Storage," in Computer Technology Review,
2002.

Dabek, F., M. F. Kaashoek, D. Karger, R. Morris, "Wide-Area Cooperative Storage with
CFS," in 18th ACM Symposium on Operating Systems Principles, 2001.

Douceur, J. R., and R. P. Wattenhofer, "Optimizing File Availability in a Secure
Serverless Distributed File System," in Proceedings of 20th IEEE Symposium on
Reliable Distributed Systems (SRDS), 2001.

Douceur, J. R., and W. J. Bolosky, "A Large-Scale Study of File System Contents," in
Proceedings of the 1999 Conference on Measurement and Modeling of Computer
Systems (SIGMETRICS), 1999.

Druschel, P., and A. Rowstron, "PAST: A large-scale, persistent peer-to-peer storage
utility," in Proc. HotOS VIII, Elmau Germany, 2001.

Du, D., et al., "Emerging Serial Storage Interfaces: Serial Serial Storage Architecture
(SSA) and Fiber Channel - Arbitrated Loop (FC-AL)," Computer Science
Department, University of Minnesota TR 96-073, 1996.

Duplessie, S., "From DAS/NAS/SAN to 'utility-class storage'," in InfoStor, 2001.

EMC, "Symmetrix Remote Data Facility (SRDF) - Connectivity Overview (Engineering
White Paper)," EMC Corp. Dec. 2001.

EMPOWER project website: http://elans.cse.msu.edu/empower

Fido, 1., "Review: Promise Ultra133 TX2 ATA133 Controller & Maxtor D740X ATA133
Drive," LittleWhiteDog.com Nov. 2001.

Gibson, 6., et al., "A cost-effective, high-bandwidth storage architecture," in Proceedings
of the ACM 8th International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS), 1998.

Gibson, G. A., Redundant Disk Arrays: Reliable, Parallel Secondary Storage, MIT Press,
1992.

Gibson, G. A., and R. Van Meter, "Network Attached Storage Architecture," in
Communications Of the ACM, vol. 13, 2000.

130

Gibson, G. A., et al., "A Case for Network-Attached Secure Disks," TR-CMU-CS-96—
142, Sept. 1996.

Gnutella project website: http://gnutella.wego.com

Gray, J., "Storage Bricks Have Arrived (talk in the Session on The Future of Storage
Technology)," USENIX Conference on File and Storage Technologies (FAST
'02). Jan. 2002.

Gray, J ., A. Renter, Transaction Processing: Concepts and Techniques, Morgan
Kaufrnann, 1993.

Grochowski, E. G. a. R. F. H., "Future trends in hard disk drives," IEEE Transactions on
Magnetics, vol. Vol. 32, 1996.

Hartman, J ., and J. Ousterhout, "The Zebra striped network ﬁle system," in Proceedings
ofACM Symposium on Operating Systems Principles (SOSP), 1993.

Heyn, T., "Jini: A Pathway for Intelligent Network Storage," Seagate Technology,
Virtual Press Room online literature Jan. 1999.

Hitz, D., "An NFS ﬁle server appliance," Network Appliances TR3001, Jan. 1997.

Hitz, D., J. Lau, and M. Malcolm, "File systems design for an NFS ﬁle server appliance,"
in USENIX Winter 1994 Technical Conference Proceedings, 1994.

Hoard, B., "Disaster recovery, interoperability are recurrent themes at Storage
Networking World," in Storage Networking World Online, 2001.

Hsiao, H.-I., and D. J. DeWitt, "Chained declustering: A new availability strategy for

multiprocessor database machines," University of Wisconsin, Madison CSTR-
854, June 1989.

Huang, M., and L. M. Ni, "Emulation of IP-based Networks," Department of Computer
Science and Engineering, Michigan State University, Technical Report 1998.

Intel, "Virtual Interface (VI) Architecture," Dec. 1997.

Jayaram, H., E. Tomg, Y. Chen, S. Wagner, L. M. Ni, P. Hodges, "The Impact of Smart
Disks and Spatial Reuse Property on RAID-5 Storage Systems," in Workshop on
Architectural and OS Support for Multimedia Applications, 1998.

Kazar, M. L., et al., "Decorum File System Architectural Overview," in Proceedings of
Summer 1990 USENIX Conference, 1990.

Kleiman, T., "Vnodes: An Architecture for Multiple File System Types in Sun UND(," in
Proceedings of the Summer 1986 USENIX Conference, 1986.

Kozierok, C. M., "Storage Review's Reference Guide: Hard Disk Drives," 2001.

131

Kubiatowicz, J ., et al., "OceanStore: An Architecture for Global-Scale Persistent
Storage," in Proc. Conf Architectural Support for Programming Languages and
Operating Systems (ASPLOS-IX), 2000.

Leach, P., and D. Naik, "Common lntemet File System (CIFS/ 1.0) Protocol Preliminary
Draft," Internet-Draft Dec. 1997.

Lee, E., and C. Thekkath, "Petal: Distributed virtual disks," in Proceedings of the ACM
7th International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS), 1996.

Lenoski, D., J. Laudon, K. Gharachorloo, A. Gupta and J. Hennessy, "The Directory-
Based Cache Coherence Protocol for the DASH Multiprocessor," in Proceedings
of the 17th International Symposium on Computer Architecture, 1990.

Mason, H., "SCSI, the Industry Workhorse, Is Still Working Hard," in IEEE Computer,
2000.

McKusick, M., et al., "A Fast File System for UNIX," ACM Transaction of Computer
Systems (TOSC), 1984.

Mearian, L., M. Hall, "What's After Fibre Channel?," in Computer World, 2001.

Menon, J ., and J. Kasson, "Methods for improved update performance of disk arrays,"
IBM Almaden Research Center, San Jose, CA, Technical Report r] 6928 (66034),
July 1989.

MichNet Backbone diagram: httpzl/www.merit.edu/michnet/maps/backbone.html
Farsite project website: http://www.research.microsoft.com/research/sn/Farsite/
MSU campus network diagram: http://mrtg.cl.msu.edu/topology/gigabitl.htrnl
Napster website: http://www.napster.com

NetworkAppliance, "DAFS: Direct Access File System Protocol, Version 0.54," Network
Appliance and Intel Oct. 2000.

Ng, S., "Advances in Disk Technology: Performance Issues," in IEEE Computer, 1998,
pp. 75-81.

Noblett, T., "Storage networking technology plays prominent role in September 11
disaster recovery," in Storage Networking World Online, 2001.

Norman, B., F. Lee, "Implementing Serial ATA in Next-Generation Computer Systems,"
in Computer Technology Review, 2002.

Network Simulator n82: http://www.isi.edu/nsnam/ns/

132

Parameswaran, M., et al., "P2P Networking: An Information-Sharing Alternative," in
IEEE Computer, 2001.

Patterson, D. A., G. Gibson, and R. H. Katz, "A Case for Redundant Array of
Inexpensive Disks (RAID)," in Proceedings of the 1988 ACM Conference on
Management of Data (SIGMOD), Chicago, IL, 1988.

Paulson, L. D., "Faster Storage Connectivity Technologies Emerge," in IEEE Computer,
vol. 35, 2002.

Rhea, S., et al., "Maintenance-Free Global Data Storage," in IEEE lntemet Computing,
2001, pp. 40—49.

Riedel, E., C. Faloutsos, et al., "Active Disks for Large-Scale Data Processing," in IEEE
Computer, 2001.

Robbins, D., "Introducing Ext3," IBM developerWorks Nov. 2001.

Rosenblum, M., and J. Ousterhout, "The Design and Implementation of a Log-Structured
File System," ACM Transactions on Computer Systems, 1992.

Rowstron, A., and P. Druschel, "Pastry: Scalable, Distributed Object Location and
Routing for Large-Scale Peer-to-Peer Systems," in Proc. of the 18th IFIP/ACM
International Conference on Distributed Systems Platforms (Middleware 2001),
Heidelberg, Germany, 2001.

Rubini, A., et al., Linux Device Drivers, 2nd Ed. ed. O'Reily & Associates, 2001.

Ruemmler, C., and J. Wilkes, "An Introduction to Disk Drive Modeling," in IEEE
Computer, 1994, pp. 17-28.

Sachs, M., A., Leff, and D. Sevigny, "LAN and I/O convergence: A survey of the issues,"
pp. 24-32, 1994.

Sandberg, R., D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon, "Design and
Implementation of the Sun Network File System," 1985.

Sander, S. L., "Gigabit Ethernet: settling in as a storage networking standard," in Storage
Networking World Online, 2001a.

Sander, S. L., "Disaster recovery in the data center," in Storage Networking World
Online, 2001b.

Satran, J., et al., "iSCSI (Internet SCSI), lntemet draft," IETF Dec. 2000.
Satyanarayanan, M., "Coda: A Highly Available File System for a Distributed

Workstation Environment," in Proceedings of the Second IEEE Workshop on
Workstation Operating Systems, 1989.

133

Satyanarayanan, M., "Scalable, Secure, and Highly Available Distributed File Access," in
IEEE Computer, 1990, pp. 9-20.

Savage, S., and J. Wilkes, "AFRAID -- A Frequently Redundant Array of Independent
Disks," in Proceedings of1996 USENIX Technical Conference, 1996.

Seltzer, M. I., G. R. Ganger, M. K. McKusick, et. al., "Joumaling versus Soft Updates:
Asynchronous Meta-data Protection in File Systems," in Proceeedings of USENIX
Annual Technical Conference, 2000.

SSA, "Serial Storage Architecture: A Technology Overview, Version 3.0," SSA Industry
Association 1995.

Stodolsky, D., G. Gibson, and M. Holland, "Parity logging: overcoming the small write
problem in redundant disk arrays," in Proceedings of 20th International
Symposium on Computer Architecture, San Diego, CA, 1993.

Stoica, I., R. Morris, D. Karger, M. F. Kaashoek, and H. BALAKRISHNAN, "Chord: A
Scalable Peer-to-peer Lookup Service for lntemet Applications," in Proc. ACM
SIGCOMM, 2001. ~

Stonebraker, M., G. A. Schloss, "Distributed RAID - a new multiple copy algorithm," in
Proceedings of the Sixth IEEE International Conference on Data Engineering,
1990.

Tanenbaum, A. S., Structured Computer Organization, 4th Ed. ed. Prentice Hall, 1999.

Teigland, D., H. Maugelshagen, "Volume Managers in Linux," in FREENIX Track
USENIX Annual Technical Conference, 2001.

TFCC, "Organization statement," The IEEE Computer Society Task Force on Cluster
Computing (TFCC) 2001.

Thekkath, C., T. Mann, et al., "Frangipani: A Scalable Distributed File System," in
Proceedings of the 16th Symposium on Operating Systems Principles (SOSP),
1997.

TTCP website: http:l/www.pcausa.com/Utilities/pcattcp.htm
Tweedie, S., "EXT3, J oumaling Filesystem," Ottawa Linux Symposium 2000.

Van Meter, R., "A Brief Survey of Current Work on Network Attached Peripherals," in
ACM Operating Systems Review, 1996, pp. 63-70.

Van Meter, R., G. Finn, and S. Hotz, "VISA: Netstation's virtual lntemet SCSI adapter,"
in Proceedings of the ACM 8th International Conference on Architectural
Support for Programming languages and Operating Systems (ASPLOS), San
Jose, CA, 1998.

134

VERITAS, "RAID for Enterprise Computing," VERITAS Software Corporation 2000.

Webster, J ., "The overwhelming importance of storage," in Storage Networking World
Online, 2001.

Williams, R., "InﬁniBand I/O standard will enhance storage networking," in Storage
Networking World Online, 2001.

WSU campus network diagram: http:llnetwork.ucomm.wayne.edu/

Yianilos, P. N., and S. Sobti, "The Evolving Field of Distributed Storage," in IEEE
lntemet Computing, 2001, pp. 35-39.

Zhao, B. Y., ID. Kubiatowicz, and AD. Joseph, "Tapestry: An Infrastructure for Fault-
Tolerant Wide-Area Location and Routing," Univ. of California at Berkeley tech.
report UCB/CSD-O 1 -1 141, April 2001.

135