J; w..'1r’?.-’"‘J

L,. V ‘ "rm—55$.“ 15:11:

VFW" -I ... ' > . ‘ '

5. H. ; ~91 11': 15114 ‘5‘" 19511.,

.5111 15"5‘15‘5"L '. ' ||11~|v5‘vi'.|_5|’1 1 Il"'1{3 :1; .31?-
‘115 £11.}. .IV'101snu.L, ,‘55 .'. 5‘0""??—
‘1‘. ' '.. x" ." m'::.1:'l‘ '11,.0:1_ - . . . “'3‘“ 2
01' “11‘1“ ”-10. ‘ ;, .1" _;‘ ; 5' , . ‘ 3‘1““;‘5 ‘5
M11! 1113' I; ' -“ ' "“ ‘ 1111i“ r142, 555-11 .111. m: '

451 1‘" ‘ 1.54314 1;;

 
  

2‘ . 15‘)? ,‘i
" $12311»; 1.1.5::1‘1 '_' 553314,;
5‘““I‘" 1"" "‘5“ ‘ ‘ "':'51"‘.‘ '11! It .I- 1‘ 51‘ 13" ' 11;.-
' A5511} ‘1

-.__-.~u._.
a ..

~.

 
   

:rﬂ ‘55 II

it”: I? "'

55"1"‘J"“1i' 541}

  
  
       
   

  
 

..C

   
  
  
 
     
   
  
  

   

   
       

   
      
  
    

 

    

  
    

 
   

   
   

 

  

    

    
     
   
    

- :’;:€~;a"‘ ' I_
‘51:“? 5c ‘5‘ :ﬁﬁ‘5w5‘5555555551550 I5 (I:
'15:" m 15:5'.5111,"515é3‘1335'115.~‘§;"£ ‘f; 4
LL: 2 :11. "1M" '15: f4 '13,; if?! 9111“; :é“, ‘51:] 1.11.11“ ii“:
' '.l‘: "" )5.,.‘5":‘ ' 15555555‘1 55'“ 1x.‘u 1‘ .: ”5‘" . “3559}
' ‘2' “51%;" "“135": v1. '5“ ‘1": r; mm x;
. raw "1 ~ ; ; tW‘ﬁ'w 9‘1’?‘:.."‘éz%.-;v
. V ;.. ... n ; .. 3; a. ””5"
- . 1". ' * WW1" ; "V‘WM‘F' ' " ‘ " '51}. .. 11;?! ‘”"‘ “
‘!I.=€~.;‘;‘".:";”.: ‘lot‘..'.' .21 1‘; ' "r ‘ '~"‘ " 4.‘ ‘ ' - i=2" '
m s25I11‘1‘111151‘r1;3 1'EI1. 5"?“ "1‘” 511:1.” “£1; '1‘ 5‘. :u q. , o
7 H ' "
(I!

, kl?!“ ' '5“ 5
5 '3‘! .. I. 1.1 éi M £1125"‘5¢“"'3"".
' ‘1‘"5“ ‘5' W1 '2‘ .j ,1"'§!';111 “31191,"
‘5 5" ; ""15 "'5‘" 55‘1‘1‘1‘";‘1“W“"".r “
‘9 9' ix

9'1 V‘. ‘5‘”?
"5': ‘2.“ '

 

THESIS

illllilllllllllllllllllllllllllllll‘ill!lllillllllllllllllilii

193 01576 075

 

LIBRARY

Michigan State
University

 

 

 

This is to certify that the

dissertation entitled

ALGORITHMS FOR THE SCHEDULING PROBLEM
IN VLSI HIGH-LEVEL SYNTHESIS

presented by

GIREENDRANATH TIRUVURI

has been accepted towards fulﬁllment
of the requirements for

DQQIQBAL degree in WIENCE

@wéqﬂ/

Major professorQ

Date /7:[/‘3 /?{

MS U is an Afﬁrmative Action/Equal Opportunity Institution 0-12771

PLACE ll RETURN BOX to remove thh checkout from your record.
TO AVOID FINES return on or betore one due.

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU le An Mmetlve Action/Bond Opportunity lnetltwon
mm-

ALGORITHMS FOR THE SCHEDULING PROBLEM IN VLSI
HIGH-LEVEL SYNTHESIS

By

Gireendranath ﬁruvuri

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Department of Computer Science

Michigan State University
East Lansing, MI 48824

1996

ABSTRACT

ALGORITHMS FOR THE SCHEDULING PROBLEM IN VLSI
HIGH-LEVEL SYNTHESIS

By

Gireendranath Timvuri

High-level synthesis takes an abstract behavioral speciﬁcation of a digital system and
ﬁnds a register-transfer level structure that realizes the given behavior. Operation schedul-
ing and datapath construction are the core of high-level synthesis in obtaining efﬁcient
designs in terms of area and speed. A novel dynamic programming approach for optimal
scheduling in high-level synthesis under resource constraints is presented. Our algorithm
can handle multi-cycle operations, pipelined functional units and scheduling under bus and
register constraints. In our dynamic programming approach, an equivalence relation be-
tween conﬁgurations(partial schedules) is used to efﬁciently detect equivalent states in the
solution space. Exploiting the principle of optimality, the scheduling algorithm keeps only
the best partial schedules in the search space for further exploration. A set of effective
pruning methods is proposed to further reduce the design space.

A new scheme for representing Control/Data Flow Graphs (CDFGs) is proposed. Based
on the representation, a new method for conditional resource sharing analysis between an
arbitrary number of operations is presented. Using this method, our dynamic program-
ming scheduling algorithm exploits global conditional resource sharing to obtain optimal
schedules in very short computational times.

For a synthesis system to produce efﬁcient designs, it should have the capability to con-
sider different cost-performance trade-offs. Instead of producing all designs and selecting

the best, the system can use estimation to identify and prune inferior designs. A simple
and efﬁcient technique for the estimation of lower-bound performance is presented. The
estimation method is extended to estimate lower—bound completion times from partially
scheduled graphs. This is a very important problem in our scheduling algorithm and some
other classes of scheduling algorithms such as branch and bound methods.

The scheduling method and the estimation method are implemented and extensive ex-
perimentation is performed using a number of benchmarks. We are able to schedule large
examples optimally in very short computational times. Our method is an order of magni-
tude faster than some of the best exact scheduling methods in the literature. Our results for

the lower-bound estimation compare well with the best in the literature.

To Amma, Naannagaru, Anna

iv

ACKNOWLEDGEMENTS

I wish to express my gratitude to my advisor, Dr. Moon Jung Chung for his able guid-
ance and constant encouragement throughout this work. I would also like to thank my
dissertation committee members, Dr. Lionel Ni, Dr. A. Esfahanian and Dr. Bruce Sagan
for their help and guidance and the invaluable education I got from them.

I would like to thank my wife, Kameswari for her constant encouragement, patience
and love. I wish to thank thank my friend, Vasu for his friendship and help throughout my
years at MSU. I wish to express my heartful thanks to my parents, my brother and many
friends and family for giving me everything I have in life.

 

TABLE OF CONTENTS

LIST OF TABLES

LIST OF FIGURES

1

3

Introduction

1.1 Motivation ...................................
1.2 Scheduling in High-Level Synthesis .....................
1.3 Scheduling Data Flow Graphs ........................
1.4 Scheduling Control/Data Flow Graphs ....................
1.5 Lower-bound Estimation ...........................
1.6 Thesis Organization ..............................

Dynamic Programming Scheduling
2.1 Related Work .................................
2.2 Problem Description and Terminology ............ ' ........
2.3 Basic Dynamic Programming Scheduling Algorithm ............
2.3.1 Equivalence of Conﬁgurations ....................
2.4 More Pruning of the Search Space ......................
2.5 Extension of Models .............................
2.5.1 Multi-cycle Operations ........................
2.5.2 Pipelining Functional Units .....................
2.6 Exploring the Search Space ..........................
2.6.1 Breadth First Search (BFS) ......................
2.6.2 Depth First Search (DFS) .......................
2.6.3 BFS vs DFS .............................
2.7 Experimental Results .............................
2.8 Conclusions ..................................

Scheduling under Bus and Register Constraints
3.1 Introduction ..................................
3.2 Scheduling under Bus Constraint .......................

vi

10
10
12
15
18
20
22
22
23
23
23
24
25
26
29

31
31
32

3.3 Scheduling under Register Constraint .................... 33

3.4 Experimental Results ............................. 34

3.4.1 Results for scheduling under bus constraints ............. 34

3.4.2 Results for scheduling under register constraint ........... 37

3.5 Conclusions .................................. 38

4 Lower-bound Estimation 41

4. 1 Introduction .................................. 41

4.2 Previous Work ................................. 42

4.3 Model and Deﬁnitions ............................ 43

4.4 Lower-bound Estimation for Performance .................. 44

4.5 Estimating a Lower Bound for a Partially Scheduled DFG .......... 49

4.6 Experimental Results ............................. 54
4.6.1 Lower bound estimation of partial schedules in branch and bound

methods ................................ 54

4.6.2 Lower-bound estimation for the entire DFGs ........... . 55

4.7 Conclusions .................................. 57

5 Scheduling CDFGs 60

5. 1 Introduction .................................. 60

5.2 Related Work ................................. 61

5.3 CDFG Deﬁnitions and Issues ......................... 62

5.4 Representation of CDFGs ........................... 65

5.4.1 The Dot and Dash Notation ..................... 65

5.5 Conditional Resource Sharing for a Set of Operations ............ 66

5.6 Lower-bound Estimation for CDFGs ..................... 69

5.7 DPS Algorithm for CDFGs .......................... 71

5.8 Experimental Results ............................. 72

5.9 Conclusions .................................. 75

6 Conclusions 76

6.1 Summary of Contributions .......................... 76

6.2 Future Research ................................ 78

BIBLIOGRAPHY 80

vii

2.1
2.2

3.1
3.2
3.3
3.4
3.5
3.6
3.7

4.1
4.2
4.3

5.1
5.2

LIST OF TABLES

Number of control steps and conﬁgurations with non-pipelined multiplier .
Number of control steps and conﬁgurations with pipelined multiplier . . . .

Results for the ﬁfth-order elliptic wave ﬁlter .................
Results for the once unrolled ﬁfth-order elliptic wave ﬁlter .........
Results for the sixth-order elliptic Bandpass ﬁlter ..............
Results for the Fast Discrete Cosine Transform ...............
Comparison with the InSyn heuristic method [70] ..............
Comparison with the multi-schedule approach [9] ..............
Comparison with the symbolic technique [60] ................

CPU time and number of conﬁgurations in branch and bound algorithm . .
Lower bounds with non-pipelined multiplier .................
Lower bounds with pipelined multiplier ...................

Number of total control steps for Conditional Branch benchmarks .....
Speculative execution in Conditional Branch benchmarks .........

viii

27
28

34
35
36
36
37
38
39

54
55
58

73
74

 

2.1
2.2
2.3

4.1
4.2
4.3
4.4
4.5

5.1
5.2
5.3
5.4
5.5
5.6
5.7

LIST OF FIGURES

An example DFG ............................... 13
The Design Search Space for the example DFG ............... 16
Outline of the basic dynamic programming scheduling procedure ...... 17
Outline of the procedure for lower-bound estimation of entire DFGs . . . . 45
An example for the lower-bound estimation for the entire DFG ....... 46
Outline of the procedure for lower-bound estimation from partial schedules 49
Computing the values for I and J ...................... 50
An example for the estimation of lower-bound completion time of partial

schedules ................................... 51
An example CDFG .............................. 63
Our representation for the example CDFG .................. 65
Procedure for Conditional Resource Sharing Analysis ............ 66
Pseudo-code to update the total resource requirement ............ 67
Outline of the procedure to compute the weights of operations ....... 69
Design Search Space for the example CDFG ................. 70
Outline of the DPS algorithm for CDFGs .................. 72

ix

CHAPTER 1

Introduction

High-level synthesis takes an abstract behavioral speciﬁcation of a digital system and ﬁnds
a register-transfer level structure that realizes the given behavior. Usually, there are many
different structures that can be used to realize a given behavior. One of the main goals 'of a
synthesis system is to ﬁnd the structure that best meets the constraints, such as limitations
on the number of functional units, registers, power while minimizing some other param-
eters like the number of cycles. The task of high-level synthesis can be decomposed into
four major subtasks : parsing, operation scheduling, datapath construction and controller
synthesis. Parsing translates a high-level behavioral description such as VHDL descrip-
tion of a digital system into an internal representation such as a Control/Data Flow Graph
(CDFG). The operations in the CDFG are then assigned to control steps by a scheduler. A
control step is the fundamental sequencing unit in synchronous systems. Typically, it cor-
responds to a clock cycle. Objects (operations, variables and data-transfers) are bound to
resources (functional units, registers, multiplexers etc.) in the datapath construction phase.

Finally, a controller is synthesized to monitor the execution of the operations.

1.1 Motivation

In recent years, the trend toward automating synthesis of digital systems at higher and
higher levels of the design hierarchy is gaining more momentum. An example of this is the
acceptance and increasing use of logic synthesis tools by the VLSI industry. There have

been several motivating factors behind this trend.

e As the VLSI design complexities have increased, it became necessary for the design-
ers to move to higher and higher levels of design abstraction. At higher levels of
design abstraction, it is relatively easier to understand the speciﬁcations and faster to

explore different trade-offs between cost, speed, power and so on.

e Since design development contributes the most to the cost of a chip, automating the
design process as much as possible can help to reduce the cost. Automated design
tools at higher levels of design can also help to reduce the design time which is

important in an increasingly competitive market.

e \Vrth an automated design tool, there is less likelihood for errors. Furthermore, a

veriﬁcation mechanism can be incorporated into the tool.

e At higher levels of automation, it is easier to keep track of the history of design
decisions such as : what decisions were made and why; what was the effect of those

decisions.

Operation scheduling and datapath construction are the core of high-level synthesis
in obtaining efﬁcient designs in terms of area and speed. Scheduling datapath operations
into the best control steps is a task whose importance has been recognized in many sys-
tems e.g. [2, 5, 18, 48, 52, 53, 38, 56, 57]. Typically, a scheduler attempts to obtain an
efﬁcient design using one of the two approaches namely, resource-constrained scheduling

and performance-constrained scheduling. In resource-constrained scheduling, the number

of control steps is minimized given the counts and delays of the resources that perform the
operations in the CDFG. The number of resources required to meet a given performance
goal is minimized in performance-constrained scheduling. In many high-level synthesis
systems, a large behavioral description is decomposed into several blocks and each block
is scheduled separately. In such cases, the resource requirement is determined by the blocks
that need the maximum number of resources of each type. Ideally, all blocks should be able
to use all the available resources and this can not be ensured by performance-constrained
scheduling. Secondly, given a single performance constraint for the whole design, it is a
non-trivial task to determine the performance constraint for each block separately. On the
other hand, a constraint on the resources is uniform across all blocks. Hence, resource-
constrained scheduling is preferred to performance-constrained scheduling. We address

the scheduling problem under resource constraints in this thesis.

1.2 Scheduling in High-Level Synthesis

The general problem of scheduling is known to be NP-complete [16]. In [71], it is proved
that the restricted cases of single execution time scheduling and two processor, one or two
time unit scheduling problems are also NP-complete. The most elementary instance of
the scheduling problem in high-level synthesis is that where the behavioral description is
a Data Flow Graph (DFG) and the resource constraints are constraints on the available
functional units. In a DFG, the nodes represent operations and the edges represent data
dependencies between Operations. The operations need to be assigned to control steps such
that the number of operations scheduled in' any control step is less than or equal to the
number of available functional units that can perform those operations.

The problem gets more complex when there are additional constraints such as bus and
register constraints. For example, the number of busses required in a control step depends

on the actual operations scheduled in that step and their inputs, not just the number of

operations. Register constraint is even more complex because the register requirement in a
control step depends on the operations scheduled in the previous steps also.

If some of the operations in the behavior are conditional constructs such as if-then-
else and case structures, there are control dependencies in addition to data dependencies
between operations. The graphs representing such behaviors are called Control/Data Flow
Graphs (CDFGs). In a CDFG, one hardware resource can be assigned to more than one
operation in the same control step if they are mutually exclusive i.e. if they belong to
different branches of a conditional construct. In order to obtain efﬁcient schedules, it is very
important that the scheduling algorithms maximize such conditional hardware resource
sharing. This adds another dimension to the complexity of the scheduling problem.

A scheduler has to often explore the design space with a variety of resource constraints
and evaluate different cost-performance tradeoffs before selecting a design. Instead of pro-
ducing schedules with each and every resource constraint, a scheduler can use estimation to
identify and prune inferior designs. So, lower-bound estimation of performance has been
an important problem in scheduling. For some classes of scheduling algorithms such as
branch and bound, it is often necessary to estimate a lower-bound completion time from
a large number of partially scheduled graphs. Though it appears to be a special case of
lower-bound estimation for the entire graph, the problem is much more complex because
the algorithm has to be much faster.

In this thesis, we address the following problem instances of scheduling and estimation.

e Scheduling Data Flow Graphs(DFGs) with single cycle operations under constraints

on the number of functional units only (Chapter 2)

e Scheduling DFGs with multi-cycle operations and pipelining functional units under

constraints on the number of functional units only (Chapter 2)

e Scheduling DFGs under additional constraints on the number of busses and registers

(Chapter 3)

e Scheduling Control/Data Flow Graphs (CDFGs) under resource constraints (Chap-

ter 5)

e Lower-bound performance estimation for entire DFGs and partially scheduled DFGs

under constraints on functional units (Chapter 4)

e Lower-bound performance estimation for entire CDFGs and partially scheduled CD-

FGs under constraints on functional units (Chapter 5)

In the next few sections, these problems are introduced in further detail in terms of the

existing techniques, necessity for new techniques etc.

1.3 Scheduling Data Flow Graphs

Since scheduling is an intractable problem, a wide variety of heuristic algorithms is used by
synthesis systems to solve the scheduling problem. The heuristics range from list schedul-
ing [10, 53, 48] to force-directed scheduling [55, 6] to transformational [2, 11, 36, 51]. The
weakness of the heuristic approaches is that the effectiveness of a heuristic depends on the
nature of the input circuit and on the nature of the constraints. Some heuristics work well
in some scenarios, but do not work well in some other scenarios resulting in sub-optimal
solutions. Furthermore, as pointed out in [15], “the weakness of the heuristic approaches is
that they begin their scheduling by imposing a heuristic rather than begin with the optimal
schedule. Hence, the heuristic is intrinsic to the scheduling algorithm and makes signiﬁcant
modiﬁcations of the algorithm very hard.” For example, many heuristic algbrithms try to
schedule one operation at a time. This makes it very hard to schedule under constraints like
bus and register constraints. An extensive empirical evaluation of some high-level synthe-
sis scheduling heuristics reported in [28] shows that many scheduling heuristics result in

sub—optimal solutions even for small problem sizes.

Exact scheduling algorithms are mainly Integer Linear Programming(ILP) based [31,
7, 26, 24], branch and bound [45, 9] or symbolic techniques [59, 60]. The existing methods
are computationally very expensive because they do not sufﬁciently exploit the structure of
the input circuit and constraints at each intermediate step of the scheduling process. They
do not prune the solution space enough to be applicable to large scheduling problems. Most
importantly, none of the algorithms attempted to detect equivalent states in the solution
space i.e. states that represent partial schedules to the same subset of the operations. If the
operations are scheduled in a constructive fashion into increasing control steps, we need to
store only the best partial schedule for any given subset of operations for further scheduling
(principle of optimality). Based on this principle, we proposed a new optimal dynamic pro-
gramming scheduling (DPS) algorithm [8]. An ‘equivalence’ relation is deﬁned between
partial schedules using which equivalent states in the solution space can be detected very
efﬁciently. In addition to this, a simple yet powerful set of pruning techniques is proposed
which reduces the size of the solution space (in terms of the number of states) drastically.
In order to keep the memory requirement practical, it is also very important that the amount
of data stored at each state is minimal. Another major advantage of our algorithm is that it
requires very little data to be stored at each state. New states are computed very efﬁciently
from the existing states. Our scheduling algorithm can handle multi-cycle operations and
pipelining functional units.

We extended the DPS algorithm to handle scheduling under bus and register constraints.
We implemented the proposed algorithm and conducted experiments on a number of stan-
dard benchmarks. We created large examples by unrolling the embedded loops in some
of the benchmarks and tested our method on those examples. We are able to solve all the
examples optimally in very short computational times. The results show the efﬁciency of

our basic dynamic programming approach and the effectiveness of our pruning methods.

 

1.4 Scheduling Control/Data Flow Graphs

Several scheduling algorithms have been proposed in the literature [22, 30, 62, 73] to han-
dle conditional branches. Except [62], no other work comprehensively addresses the issues
pertaining to conditional branches. However, the symbolic technique used in their work
is computationally very expensive and has failed to solve large problems optimally [60].
Some of the important issues of conditional branches include : conditional resource sharing
beyond pair-wise mutual exclusiveness, speculative execution of operations and simultane-
ous scheduling of parallel conditional trees. A concise representation of the conditional
behavior is fundamental in addressing these issues efﬁciently. The existing representations
[61, 63, 72] have various limitations. We proposed a new representation for CDFGs in
which the CDFG operations are labeled in a dot and dash notation that precisely represents
their conditional behavior. Based on this representation, we present an efﬁcient method for
analyzing conditional resource sharing among an arbitrary number of operations. Specula-
tive execution and simultaneous scheduling of parallel trees are supported, thus overcoming
the limitations of previous works. The conditional resource sharing analysis method is in-

corporated into our DPS algorithm to obtain optimal schedules that exploit global sharing.

1.5 Lower-bound Estimation

For a synthesis system to produce efﬁcient designs, it should have the capability to con-
sider different cost-performance trade-offs. So, a scheduler has to explore the design space
with a variety of resource constraints. Instead of producing schedules with each and every
resource constraint, a scheduler can use estimation to identify and prune inferior designs.
For example, for a given CDFG, let the scheduler have found a schedule of a: control steps
with a cost constraint of y. If there is an estimation tool for lower-bound performance and

if it determines that the lower bound with a cost constraint of y + z is a: control steps, the

scheduler need not be used to produce an actual schedule with any cost constraint between
y and y + 2. Furthermore, a lower bound can be used to evaluate a heuristic solution. For an

estimation tool to be useful, it has to be much faster than the actual scheduler and the lower

bounds it produces should be as tight as possible. We have proposed an efﬁcient estimation 4

technique [19] for the lower-bound performance. We tested our estimation method on a
number of benchmarks and compared our results with those of some other known meth-
ods [29, 65, 69]. in the literature. Our method is faster and produces better lower bounds.
In most cases, the lower bound is tight and it is always within ﬁve control steps from the
optimum.

Our dynamic programming scheduling method performs an exhaustive search of the de-
sign space to ﬁnd an optimal schedule with a given resource constraint. Any such schedul—
ing method as well as the branch and bound methods need a dynamic estimation method to
detect partial schedules that are not going to lead to an optimum schedule. Since the number
of partial schedules is generally very high in a design space search process, the dynamic es-
timation should be much faster than the lower—bound estimation for the entire DFG. At the
same time, it should be effective and be able to detect infeasible partial schedules as early
as possible in the exploration. Furthermore, it should not introduce signiﬁcant memory
overhead into the scheduling algorithm. We extended our lower-bound estimation method
so that it can ﬁnd such dynamic lower bounds. This problem is not addressed in earlier
works. We incorporated this extended method into our dynamic programming scheduling
algorithm. The results show that it reduces the size of the search space drastically. We

extended our lower-bound estimation methods to handle CDFGs.

1.6 Thesis Organization

In the next chapter, we describe our dynamic programming scheduling algorithm under

constraints on functional units. Scheduling under additional constraints (bus and register

...3

constraints) is explained in Chapter 3. The lower—bound estimation methods for the entire
DFGs and partially scheduled DFGs are presented in Chapter 4. Scheduling CDFGs is
described in Chapter 5. In that chapter, we describe our representation scheme for CDFGs
and the method for conditional resource sharing analysis. The scheduling algorithm and
the lower-bound estimation are extended to handle CDFGs in the same chapter. Finally,

Chapter 6 concludes outlining the contributions of this work and topics for further research.

M.
. ll

CHAPTER 2

Dynamic Programming Scheduling

In this chapter, we explain our basic dynamic programming scheduling algorithm and the
pruning methods. We ﬁrst describe our algorithm for single-step operations and then de-
scribe the extensions for multi-cycle operations, pipelining functional units. Then, two
approaches to explore the search space - a Breadth First Search and a Depth First Search

using the dynamic programming algorithm, are described.

2.1 Related Work

The simplest scheduling technique is as soon as possible (ASAP) scheduling [21, 38] where
the operations are scheduled into increasing control steps as allowed by precedence con-
straints, resource constraints and their initial order. The list scheduling technique which
was originally used in [10] for microcode compaction, has been adopted by many high-
level synthesis systems [28, 53, 48]. A list scheduling algorithm selects operations from
a list ordered according to some priority function. A selected operation is assigned to the
earliest time step as allowed by the precedence and resource constraints. Different list
scheduling heuristics vary in the priority function they use. Many algorithms e.g.[28] use
the as last as possible (ALAP) values of operations as the priority function. Given a perfor-

mance constraint, the ALAP(ASAP) value of an operation is the latest(earliest) time step

10

11

in which the operation can start executing without violating the performance constraint and
precedence constraints. The non-pipelined scheduling algorithm of SEHWA [53] uses ur-
gency as its priority function. The urgency of an operation is deﬁned to be the length of the
longest path from that operation to the output operations. The priority of operations is in
the decreasing order of their urgencies. Slicer [48] uses mobility (the difference between
the ALAP and ASAP values) of an operation as the priority function. In [66], it was shown
that all these list scheduling methods are in fact identical. RECALS II [66] is a collection
of list scheduling algorithms, also with ALAP priority functions. However, in RECALS II
more than one ALAP value called Pseudo ALAP (PALAP) values are computed, each with
a subset of resource constraints. Then, a set of schedules is generated using one PALAP
priority function at a time and the best schedule is chosen from the set. In freedom-based
scheduling [52], the range of possible control step assignments for each operation is cal-
culated. Then, operations on the critical path are scheduled ﬁrst. To schedule the other
operations, at each step, the operation with the least freedom, that is, the one with the
smallest range of control steps into which it can go, is chosen.

A different approach, force-directed scheduling (FDS) was originally presented in [55]
and has later been used in other high-level synthesis systems like SAM [6]. In FDS, “force”
values are calculated for all operations at all feasible control steps. The assignment of an
operation to a control step is made such that it has the most attractive force. FDS attempts
to minimize the number of resources subject to a performance constraint. In [54], another
algorithm using the same concept of “force”, force-directed list scheduling was presented
with the goal of optimizing the performance under resource constraints. Transformational
algorithms start with a default schedule and apply transformations to probabilistically move
towards an optimal schedule. Transformational methods in the literature include : simu-
lated annealing [11, 43], simulated evolution [36] and FAMOS by Park et. a1. [51].

Exact scheduling methods are mainly Integer Liner programming (ILP) based, Branch
and Bound or Symbolic techniques. Several ILP-based algorithms for high-level synthesis

12

are presented in the recent literature [17, 35, 31, 7, 26, 24, 74], In [LP—based schedul—
ing algorithms the scheduling objectives and constraints are translated into integer linear
programming formulations. In [7, 24], several techniques are used to reduce the number of
variables in the formulation. Others used heuristics like zone scheduling [26] and step-wise
expansion [31] to keep the solution space from exploding. The Zone Scheduling [25, 26]
approach works by successively partitioning the control steps into zones and solving each
of the zones by a 0-1 integer linear programming technique. By specifying the maximum
number of 0—1 variables involved within a zone, the model can be turned into an optimal
scheduling, a list scheduling, or one in between with different speed/quality characteristics.

The optimal scheduling method proposed in [45] is a combination of branch and bound
and transformational approaches. They optimize hardware cost under a performance con—
straint. The multi-schedule approach in [9] generates multiple optimal schedules in a
branch and bound fashion synthesizes a complete data path for each schedule. The sym-
bolic technique used in [60] describes scheduling constraints as Boolean functions and
build an OBDD (Ordered Boolean Decision Diagrams) encapsulating all optimal solutions.
Their technique is computationally very expensive and they used heuristics to solve large

scheduling problems.

2.2 Problem Description and Terminology

A DFG (Data Flow Graph) G = (V, E) is a (directed)graph representation of a behavioral
description where V is the set of operations and E denotes the set of dependencies (prece-
dence constraints) between the operations. For any two operations 1:, y E V, if (:r, y) E E,
operation :1: should be ﬁnished before operation y can start. Operation :2: is called an im-
mediate predecessor of y and y is an immediate successor of 2:. Associated with each
operation :17 in V, there is a unique type T,c indicating that a resource of type T: should be

used to perform operation :5. Figure 2.1 shows a DFG for the differential equation example

 

 

Figure 2.1. An example DFG

used in [55].

Let T be the set of operation types. Resource constraints are given by a set of ordered
tuples < t, D(t), N (t) > where D(t) is the delay (number of control steps) an operation of
type t takes to complete and N (t) is the number of available resources of type t. For any
operation 2:, we denote D(Tz) by d,.

A schedule is a many-to-one mapping f : V -—-) {1, 2, 3, ..., h}. For any node v in V,
f (v) = r means that the node v is scheduled to start at control step 1' and it will be ﬁnished
at control step f (v) + d, — 1. A valid schedule f has to satisfy the following conditions.

For all (1:, y) in E, f (:r) + d; - 1 < f (y) (precedence constraints)

For all r in {l,2,...,k}, |V(t, 7‘)I S N(t) where V(t, r) is the set of operations 2) oftype
t such that

f (v) s r g f (v) + 6, -— 1 if type t resources are pipelined with a latency 6,
(resource constraints) '
f(v) 5 r S f(v) + d” — 1 iftype t resources are not pipelined.

The number of control steps 0, taken by a schedule f is marvev{ f (v) + (1., — 1}. A

schedule f is an optimal schedule if and only if for any other schedule 9, C, 5 09.

An operation a: is called a predecessor of y (and y, a successor of x) if there is a

l4

directed path from a: to y in G. S, denotes the set Of successors of a: and P2 is the set Of
predecessors Of 2:. An Operation without any predecessors is called an input operation. An
operation without any successors is called an output operation. A multi-cycle operation a:
is said tO be unﬁnished at control step 1' if it is scheduled but not ﬁnished at control step r
i.e. f (x) < r 5 f (z) + d, — 1. An operation 2: is said tO be ready at control step 1', if it
is not scheduled before control step 1' and all its predecessors are scheduled and ﬁnished at
control step r i.e. f (y) + dy — 1 < r for all y in P3.

In the process Of ﬁnding a schedule for a given CDFG G, a conﬁguration S represents
a state in the design space that is reached by scheduling operations until some control step
1'. We call 1‘ the depth Of S (depth(S)). A root conﬁguration is the initial conﬁguration
at which none Of the operations is scheduled. The depth Of the root conﬁguration is 0. At
a conﬁguration S, ready(S) denotes the set Of ready operations at S and unﬁnished(S)
denotes the set Of unﬁnished operations (scheduled at a previous control step, but not yet
ﬁnished). The scheduling process is started at the root conﬁguration. The set Of input

Operations constitutes the ready set at the rOOt conﬁguration. TO continue the process Of

scheduling from a conﬁguration S, a subset Of ready(S) is chosen to be scheduled sub- -

ject tO resource constraints. That represents an assignment Of Operations to control step
depth(S) + 1 creating a child conﬁguration 3’ of S. This set Of newly scheduled Operations
together with any unﬁnished Operations at S is called the active set of S’ ( active(S’) ). The
conﬁguration S is called the parent Of S’ (parent(S’)). If the depth Of S is d, the depth Of S’
is d + 1. The path from the rOOt conﬁguration to a conﬁguration represents a partial sched-
ule i.e. a schedule to a subset Of Operations in the DFG. Hence, we say that a conﬁguration
represents a partial schedule. All the conﬁgurations on the path from the root conﬁguration
to a conﬁguration S (including the root) are called predecessor conﬁgurations Of S. A leaf
conﬁguration represents a complete schedule Of the given DFG where all the operations in
G are scheduled and ﬁnished. By following the path from the root conﬁguration to a leaf

conﬁguration, we Obtain the control step assignments for all the Operations. At a conﬁgura-

It

15

tion S, unsch(S, t) denotes the number Of unscheduled Operations Of type t at S. Figure 2.2
shows a partial design space for the example DFG in Figure 2.1 with two multipliers and
one adder. All Operations take one control step. Addition, subtraction and comparison
Operations are all performed by adders. In ﬁgure 2.2, ovals represent conﬁgurations. The
numbers inside the ovals are the Operations in the ready set. The numbers along an are are
the Operations in the active set Of the conﬁguration at the head Of the arc. For example,
ready(J) = {3,6,8,l 1} and active(J) = {12,10}. I is the root conﬁguration and P is a leaf
conﬁguration. J is the parent of K. K is a child Of J. The values for unsch(S, t) are not
shown in the ﬁgure. For example, unsch(K, *) = 2 and unsch(K, +) = 3.

2.3 Basic Dynamic Programming Scheduling Algorithm

The search space is explored starting at the root conﬁguration. All the input operations are
included in the ready set Of the root conﬁguration. At each conﬁguration, the scheduling
process is continued by generating its child conﬁgurations. All the conﬁgurations thus
generated constitute the search space. This exploration Of the search space can be done
either in a breadth-ﬁrst order or in a depth-ﬁrst order, which is explained in Section 2.6. In
this section and the next section, we describe how a child conﬁguration is generated from
its parent conﬁguration and how it is determined whether a generated conﬁguration should
be added to the search space or not. We ﬁrst describe the method for uni-cycle operations.
Extensions to multi-cycle Operations are described in Section 2.5.

TO generate a conﬁguration S from its parent R, a subset Of ready(R) is scheduled in the
current control step i.e. at control step depth(R) + 1. This set Of newly scheduled Operations
constitutes active(S). Then, ready(S) is computed. The computation Of ready(S) is based
on the following lemma
Lemma 2.1 : For any unscheduled Operation a: at conﬁguration S, a: is either a member Of
ready(S) or there exists a y in ready(S) such that y is a predecessor of :c.

 

Sid

Una 2an3 05 c8 8an canon $63 25. .NN 2&5

 

. . . m e
. E =
9"“ ° 9
. . . m n
E. 3 O
69 6
I. 3.”
e: g g
:23
.23 :3
6 e 2.3...
22 _.

o_..m_ 2:0—

 

l7

 

dps(S) /* This dps algorithm is applied to each
configuration S when it is generated */
{
if (lower-bound completion time from S exceeds upper bound)
/* lower—bound estimation */
prune S

if (an equivalent configuration P is found) {
/* Dynamic programming technique */
if (depth(S) >= depth(P)) prune S
if (depth(S) < depth(P))
add S to the search space and prune P
}
if (no equivalent configuration found)
add S to the search space

Figure 2.3. Outline Of the basic dynamic programming scheduling procedure

 

 

 

Proof : Let a: be an unscheduled Operation. If x is not in ready(S), then there is a prede-
cessor p Of a: that is not scheduled. Among all such p, let q be the farthest from r i.e. the
length Of the longest path from q to a: is maximum among all p. If q is not in ready(S),
then there is a predecessor ql Of q that is not scheduled. Note thath is a predecessor Of :1:
also. And, ql is farther than q from x, which is a contradiction. Hence, q is in ready(S). Cl

TO compute ready(S), all the Operations in ready(R) that are not in active(S) are ﬁrst
included in ready(S). Then, the only candidates to be added to ready(S) are the immedi-
ate successors Of Operations in active(S). Among these candidate Operations, only those
Operations with no unscheduled predecessors can be included in ready(S). For a candidate
Operation 2:, to check whether or not all Of its predecessors are scheduled, due to Lemma
2.1, it sufﬁces to check only sets (ready(R) - active(S) ) and the set Of candidate Operations

itself. The operation :5 can be included in ready(S) if and only if these sets do not contain

18

any predecessors Of 2:.

Figure 2.3 shows the outline Of the basic dynamic programming scheduling procedure
as applied to a conﬁguration S (dps(S)) when it is generated. When a conﬁguration S is
generated, a lower bound on the completion time Of any schedule from S is computed. If
it exceeds the upper bound, S is pruned. Lower-bound estimation is brieﬂy explained in
Section 2.4 and further elaborated in Chapter 4.

2.3.1 Equivalence of Conﬁgurations

TWO conﬁgurations SI and 52 are called equivalent if ready(Sl) = ready(Sg). In Fig-
ure 2.2, conﬁgurations A, A1, A2, A3, A4,A5 are all equivalent, B, 81,82 are equiva-
lent and so on.
Lemma 2.2 : If 51 and 52 are equivalent conﬁgurations, the set Of unscheduled Operations
at $1 is the same as the set of unscheduled Operations at 52.
Proof : If x is not scheduled at SI, then from Lemma 2.1, either a: is in ready(Sl) or some
predecessor p Of a: is in ready(Sl). If a: is in ready(Sl), then a: is in ready(Sg). Hence, a: is
not scheduled at 52. If p is in ready(Sl), then p is in ready(Sg); hence, p is not scheduled at
52. Since p is a predecessor Of as, a: is not scheduled at Sz. Similarly, if a: is not scheduled
at 32, it is not scheduled at SI. Hence, the lemma. C]
The concept of ready set is similar to cut in [20]. Lemma 2.2 above is similar to Lemma
2.2 in [20]. As was mentioned earlier, a conﬁguration represents a partial schedule Of the
DFG i.e. the schedule for a subset Of Operations in the DFG. Each conﬁguration is a result
of a sequence Of decisions where a decision corresponds to the scheduling of Operations
at one control step. Lemma 2.2 implies that the set of scheduled Operations at SI is the
same as that at 82. Thus, conﬁgurations SI and 32 represent schedules for the same subset
Of Operations, say A Of the DFG. And, it takes the same number Of control steps from SI
and S; to schedule the remaining unscheduled Operations. Thus, if the schedule for the set

19

Of operations in A is not Optimal, then the schedule for the whole DFG cannot be Optimal
(principle of optimality [1]). Hence, a conﬁguration S need not be explored further if there
is already another equivalent conﬁguration at an equal or lower depth. Also, S can be
pruned if there is another conﬁguration P at an equal or lower depth such that ready(P) is
a subset Of ready(S).

The equivalence Of conﬁgurations is checked with the help Of a hash table. For a con-
ﬁguration S, we save ready(S) and depth(S) in the hash table. When a Conﬁguration S is
generated, the hash table is searched to check if there already is a conﬁguration equivalent
tO S. If none is found, S is added to the search space and stored in the hash table. If an
equivalent conﬁguration P is found, then we process S as follows. Let depth(S) be h and
depth(P) be k. If h is equal tO k, then S is pruned because S cannot lead to a better solution
than P. In Figure 2.2, conﬁguration A1 can be pruned since there is an equivalent conﬁgu-
ration A at the same depth. If h is less than k, then S is added tO the search space since S
leads tO a better solution than P does. In Figure 2.2, if E2 is generated earlier than E, the
hash table gives a hit when E is generated. E is still added to the search space because it is
at a lower depth. The hash table is also updated in this case so that among all the equivalent
conﬁgurations, only the one with the best(least) depth is stored in the hash table.

Thus, our scheduling algorithm exploits two important principles Of dynamic program-

ming [1].

e First, it efﬁciently constructs solutions to larger subinstances Of the original problem
(child conﬁgurations) from the solutions to a smaller subinstance (parent conﬁgura-
tion). At each conﬁguration S, we only need to save ready(S), active(S), depth(S)
and parent( S). This reduces the memory requirement drastically because when the
search space is huge, the memory required is primarily determined by the amount
Of data stored at each conﬁguration. The predecessor and successor information of
Operations can be looked up the transitive closure Of the adjacency matrix, which is

computed once prior to exploring the search space.

 

20

e It exploits the principle Of Optimality to prune the search space.

2.4 More Pruning of the Search Space

We explain three pruning methods in this subsection. The ﬁrst pruning method is based on
an important Observation about equivalent conﬁgurations.
Lemma 2.3 : Let S and P be two equivalent conﬁgurations. Let depth(S) be h and
depth(P) be It. Let h be greater than k and let P’ be the predecessor conﬁguration Of S at
depth 1:. Then, P’ and all its successor conﬁgurations can be pruned.
Proof : The conﬁgurations P and S have the same set of unscheduled Operations due tO
Lemma 2.2. Since P’ is a predecessor conﬁguration Of S, the set Of unscheduled Operations
at P’ is a superset of the set of unscheduled Operations at S. Thus, the set Of unscheduled
Operations at P’ is a superset of the set Of unscheduled Operations at P and they are at
the same depth. Hence, P’ cannot lead to a better solution than the best solution from P.
Hence, the lemma. ‘ . D
In Figure 2.2, E3 and E are equivalent. Hence, E3 as well as its parent F can be
pruned. I
The following additional condition is added to the basic dynamic programming

scheduling procedure to accommodate this pruning method.

 

if (depth(S) > depth(P) ) {
find P', the predecessor configuration of S
at depth depth(P); prune P’ and all of its successor
configurations

}

 

 

 

21

The other two pruning methods are based on an upper bound for the performance In
order to use these pruning methods, we ﬁnd a schedule f using a list scheduling algorithm
with ALAP value as the priority function. Then, we search the design space targeting for
a schedule with a performance better than that Of f i.e. a leaf conﬁguration at a depth less
than Of. A conﬁguration is called an infeasible conﬁguration if it cannot lead to a schedule
with the target performance. Before the search space is explored, the ALAP values Of all
Operations are computed with respect to the target performance.

The second pruning method is based on the precedence constraints Of the DFG. For
ﬁnding a child of R, all the Operations in ready(R) with ALAP values equal to depth(R) + 1
must be scheduled in the current control step. If this violates the resource constraints, R
is an infeasible conﬁguration and hence can be pruned from the search space. By forcing
the Operations with ALAP values equal to depth(R) + 1 into the active set Of any child
conﬁguration Of R, we can reduce the number Of child conﬁgurations Of R. In Figure 2.2,
from conﬁguration L, D1 will not be generated at all because Operations 1 and 2 should be
included in the active set of any child Of L with a target performance Of 5 steps.

The third method is based on both precedence constraints and resource constraints. Af-
ter S is generated, it is checked tO see if S is an infeasible conﬁguration. This is done by
estimating a lower bound on the number Of control steps from S to schedule the remaining
unscheduled Operations at S. If it exceeds the target performance, S is an infeasible conﬁg-
uration and hence is pruned. We compute the lower bound as follows. At a conﬁguration S,
for each resource type t e T we ﬁnd a separate lower bound on the number of control steps
to schedule the remaining Operations Of type t. The maximum Of all these lower bounds
gives a lower bound on the number Of control steps from S. We compute the lower bound
7foranytypetfromaconﬁgurationSas:

7=a+b+cwhere

a is the minimum number Of control steps after the current control step before any Of

the type t unscheduled Operations can be scheduled;

L.

22

b is the minimum number Of control steps to complete any schedule Of G after the

last Of the type t Operation is scheduled and

c is the minimum number of control steps to schedule the unscheduled Operations Of

type t assuming that all the N (t) resources will be used in all control steps.

We deﬁned new data structures that are computed only once prior to design space ex-
ploration using which a and b can be computed in 0(k) time where k is the number Of
ready operations at S [19]. These data structures and computation Of a and b are described
in Chapter 4. The value Of c is computed by dividing unsch(S, t) by N (t). At each con-
ﬁguration S, we store unsch(S, t) for each type t and they can be easily computed from
the corresponding values at its parent conﬁguration. At root conﬁguration, this quantity for
each type is equal to the number of Operations Of each type in the DFG. Further details on

the lower-bound estimation method are given in Chapter 4.

2.5 Extension of Models

2.5.1 Multi-cycle Operations

The deﬁnitions in section 2.2 already account for multi-cycle Operations. The only problem
that needs tO be addressed for multi-cycle Operations is the handling of unﬁnished Opera-
tions. First, the deﬁnition Of the equivalence Of conﬁgurations needs to be modiﬁed as
follows.
Two conﬁgurations SI and 52 are called equivalent if (i) ready(Sl) = ready(Sg);
unﬁnished(Sl) = unﬁnished(Sz) and (ii) for all a: in unﬁnished(Sl) and unﬁnished(Sg),
a: has ﬁnished the same number Of control steps at both SI and 82.

Lemma 2.2 still holds with this modiﬁed deﬁnition of equivalence and the set Of un-
scheduled Operations at $1 and $2 is the same. Since the set Of unﬁnished Operations is

also the same and each unﬁnished Operation has the same number Of control steps yet to

23

complete, it is going to take the same number of control steps to schedule the unscheduled
Operations from 51 and 52.

At each conﬁguration S we also save unﬁnished(S), and the number Of control steps that
each a: in unﬁnished(S) has ﬁnished. We save these quantities in the hash table also. TO
generate a child conﬁguration S of R, all Operations in unﬁnished(R) should necessarily be
included in active(S). The number Of resources available for the Operations to be scheduled
at the current step should be updated accordingly. The subset of ready(R) to be scheduled
is chosen subject to the updated resource availability. The set unﬁnished(S) should be
determined. It is a subset of active(S). In computing ready(S), we should also check
unﬁnished(S) in addition to the sets explained in Section 2.3. An Operation is not included
in ready(S) if it has a predecessor in unﬁnished(S).

2.5.2 Pipelining Functional Units

If pipelining functional units are used, a functional unit can be assigned tO more than one
Operation in an overlapping fashion. Let (it be the latency with which functional units of
type t can be pipelined. A functional unit Of type t occupied with an unﬁnished Operation
a: at a control step is still available to another Operation if a: has ﬁnished at least 6, control
steps. Hence, when an unﬁnished Operation a: at a conﬁguration R is included in the active
set Of all its child conﬁgurations, the number of available resources need not be updated if

a: has ﬁnished at least 6; control steps at R.

2.6 Exploring the Search Space

2.6.1 Breadth First Search (BFS)

In BFS, the design space is explored in breadth-ﬁrst fashion i.e. all the conﬁgurations
at depth d are generated before any conﬁguration at depth d + 1 is generated. If a leaf

24

conﬁguration is reached during the exploration, it represents an Optimal schedule and the
search can be terminated. ’

When a new conﬁguration is generated, it is checked to see if it is an infeasible con-
ﬁguration. If it is not, then the dynamic programming technique is used tO check if an
equivalent conﬁguration already exists . If none exists, the current conﬁguration is added
to the search space. Conﬁgurations are added to the rear Of a queue Of conﬁgurations as
they are generated. Those conﬁgurations are retrieved from the front of the queue one at
a time, all of its child conﬁgurations are generated and added to the queue. This search

process is continued until a leaf conﬁguration is reached.

2.6.2 Depth First Search (DFS)

In DFS, the design space is explored in depth-ﬁrst fashion i.e. a complete path from the
root conﬁguration to a leaf conﬁguration is explored before the next path is examined. The
exploration Of a particular path is continued until either an infeasible conﬁguration or a leaf
conﬁguration is reached. In DFS, a leaf conﬁguration does not necessarily represent an
Optimal schedule.

From a conﬁguration R, a child conﬁguration R’ Of R is generated. Then, it is checked
to see if it is an infeasible conﬁguration. If it is not, then the dynamic programming tech-
nique is used tO check if an equivalent conﬁguration already exists . If none exists, R’ is
added tO the search space. Then, a child Of R’ is generated and this process is recursively
continued until an infeasible conﬁguration or a leaf conﬁguration is reached. When all
paths from a conﬁguration R’ are explored, it returns to its parent R. Then the next child
Of R is found and the recursive process is repeated. When an infeasible conﬁguration is
reached, it is ignored and the next child Of its parent is generated. If a leaf conﬁguration
is reached, it means that a better schedule than the current best is found. If this solution is

as good as the lower bound from the root, the search is terminated. Otherwise, we save it

25

as the current best schedule, update the target, update the ALAP values with respect to the
new target and continue the search.

Our depth-ﬁrst approach is different from branch and bound in that it uses dynamic
programming techniques tO explore the design space. It exploits the concept Of equivalent
conﬁgurations to prune the search space. Our algorithm also uses effective methods to
recognize if a path does not lead to a target solution as early as possible in the exploration

Of that path.

2.6.3 BFS vs DFS

The major advantage Of BFS over DFS is that among equivalent conﬁgurations, more
promising ones are always generated ﬁrst, in BPS. Hence, the inferior ones are never ex-
plored further. Thus, BFS exploits the dynamic programming technique better than DFS.
Another advantage is that no conﬁgurations deeper than any Optimal solution will ever be
generated in the BFS approach.

The DFS approach also has its own advantages over the BFS approach. The DFS
approach takes much less memory than the BFS because all the conﬁgurations at all levels
need not be stored. Whenever we reach a solution that is better than the current best, we
can update the target for the rest Of the search and hence there can be more pruning. If a
solution is as good as the lower bound performance, the search can be terminated. If such
a solution is found early on during the search, the DFS approach will perform faster than
the BFS approach.

If the Optimal solution is very close to the solution Obtained from the list scheduling
algorithm, the design space Of the BFS approach will be nearly the same as the design space
of the DPS approach. In this case, the DFS approach has higher probability to perform
better than the BFS approach. If there are a lot Of solutions with the target performance,

then the DPS approach might be able to ﬁnd one very quickly and hence perform well.

26

However, if the optimal solution is far from the list scheduling solution or if the number of
solutions with the target performance is very small, the BFS approach might perform better

since the number of unnecessary conﬁgurations generated will be less in the BFS approach.

2.7 Experimental Results

The proposed dynamic programming scheduling algorithm was implemented in C lan-
guage. We implemented it using both the breadth-ﬁrst approach and the depth-ﬁrst ap-
proach (columns DPSB and DPSD in our tables) employing all the pruning methods ex-
plained in Section 2.4. To measure the effect of the pruning methods alone, we also im-
plemented the approaches (columns BFSN and DFSN) without employing the pruning
methods of Section 2.4. The dynamic programming technique of pruning the equivalent
conﬁgurations is used in all four implementations. The design space is exponential in size
if there is no pruning. Thus, the experiments are performed so as to measure the effect of
the dynamic programming technique and all the other pruning methods.

The benchmarks we used are the AR Filter [27], the Differential Equation [55], the ﬁfth-
order elliptic Wave Filter [26], the FIR Filter [53], the complex Biquad recursive digital
Filter [49] and the sixth-order elliptic Bandpass Filter [50]. We created larger examples
for the purposes of testing by unrolling the embedded loops in the differential equation
and ﬁfth-order elliptic ﬁlter benchmarks. For the Biquad Filter example, we used three
control steps for the multiplier and one for ALU (we used the same resource type ALU
to do addition, subtraction and comparison). For all the other examples, we used two
control steps for multiplication and one for ALU. We tested each of these benchmarks
with a number of different resource constraints. In 70% of the cases, the lower bound
performance is the same as the list scheduling solution. Hence, the dynamic programming
scheduling procedure need not be invoked. We present the results for some of the other

cases.

27

Table 2.1. Number of control steps and conﬁgurations with non-pipelined multiplier

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Resources Control steps No. of conﬁgurations
DFG + * DPS List DPSB BFSN Ratio DPSD DFSN Ratio
Wave 2 2 18M 19 58 355 6.12 30 228 7.6
Filter 2 3 18 19 81 262 3.23 34 374 11
2 4 18 19 81 247 3.23 34 356 10.47
1 2 28 28 8 5612 701.5 11 20214 1837.6
Once unrolled 2 2 35M 36 333 749 2.25 334 13529 40.51
Wave Filter 2 3 35 36 337 559 1.66 247 1396 5.65
L 2 4 35 36 328 529 1.61 243 1351 5.56
Twice unrolled 2 1 58 58 124 310 2.5 176 4449 25.28
Wave Filter 2 2 52!“! 53 410 2268 5.53 1969 20210 10.2
Differential 1 2 8M 8 12 55 4.5 12 68 5.67
Equation 1 3 7l°°1 7 3 22 7.33 3 26 8.67
Once unrolled 2 2 14 14 15 281 18.73 15 359 23.93
Differential 1 2 14 14 22 410 18.63 24 637 26.54
Equation 3 2 14 14 15 275 18.33 15 359 23.93
Twice unrolled 2 2 20 21 146 813 5.56 58 1278 22.03
Differential 1 2 20 21 435 2661 6.11 131 6110 46.64
Equation 3 2 20 21 127 673 5.3 ‘ 58 1026 17.69
Tm Filter 3 2 11 11 47 2906 61.83 47 5401 114.91
Biquad 3 3 131661 13 12 125 10.42 12 274 22.83
Filter 3 2 16 16 73 347 4.75 73 409 5.6
2 3 14!“! 14 51 759 14.88 59 1565 26.52
3 4 13 13 8 170 21.25 8 204 25.5
Bandpass 3 2 13 14 195 552 2.83 72 723 10.04
Filter 2 2 13!“! 14 253 623 x 2.46 126 939 7.45
2 3 11!“! 11 16 180 11.25 16 192 12.0

 

 

Our solution (optimal) List : List scheduling solution

DPSB : Dynamic Programming Scheduling with Breadth-first search

DPSD : Dynamic Programming Scheduling with Depth-first search

BFSN : Breadth First Search with No pruning other than pruning equivalent configurations
DFSN : Depth First Search with NO pruning other than pruning equivalent configurations

DPS :

28

Table 2.2. Number of control steps and conﬁgurations with pipelined multiplier

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Resources Control steps No. of conﬁgurations

DFG + * ops List DPSB BFSN Ratio DPSD DFSN Ratio
ARFilter 2 2 13B?! 13 132 5812 44.03 155 8785 56.68
3 2 13 13 134 5812 43.37 155 8417 54.30
Wave 2 2 18 19 80 247 3.24 34 356 10.47
Filter 2 3 18 19 81 247 3.23 34 356 10.47
2 4 18 19 81 247 3.23 34 356 10.47
2 1 19M 19 23 412 17.91 23 703 30.56

1 1 28M 28 8 7128 891 11 20216 1837.81

1 2 28 28 8 7114 889.25 11 20214 1837.6
Once unrolled 2 2 35 36 327 529 2.25 334 1351 5.56
Wave Filter 2 3 35 36 328 529 1.66 247 1351 5.56
2 4 35 36 328 529 1.61 243 1351 5.56
2 1 36 36 279 869 3.11 424 6206 14.64
_'I\vice Unrolled 2 2 52 53 382 1743 4.56 645 20312 31.49
Wave Filter 2 1 53 53 346 2492 7.20 543 18423 33.92
Biquad 2 2 13M 13 180 935 5.19 247 4479 18.13
Filter 1 1 t 18 19 1455 2536 1.74 162 28658 176.9
2 3 12M 12 25 950 38 36 1757 48.8
3 1 14 14 32 635 19.84 32 984 30.75

 

1 Our solution for this case is one control step better than best known result [64].

 

29

We present the results for the experiments for the non-pipelined and pipelined multiplier
in Table 2.1 and Table 2.2 respectively. An extensive experimentation by R.Jain et al [28]
indicates that a list scheduling algorithm with ALAP values as the priority function gives
better schedules in more cases than a few other popular heuristics such as FDS [55], FDLS
[54] and MAHA [52]. Hence, we implemented an ALAP List Scheduling algorithm and the
solutions are presented here (column List in our tables). The optimal solutions as obtained
by our dynamic programming scheduling are in column DPS. Our experiments indicated
that our approach is very efﬁcient in terms of run-time performance. The average CPU
time for our method is 0.5 seconds. The average CPU times for FDS, FDLS and MAHA
on a SUN-Spare Station 1 are 33.6 seconds, 2.834 minutes and 2.58 seconds respectively
[28]. We ran all our experiments on a SUN-Spare station 2.

As a machine-independent measure for the efﬁciency of our dynamic programming
technique and the effect of pruning methods, we reported the number of conﬁgurations ex-
plored in each of the four implementations (DPSB, DPSD, BFSN and DFSN). We compare
BFSN with DPSB and DFSN with DPSB and the ratios of the number Of conﬁgurations
are shown in the tables. The big ratios between BFSN and DPSB, and between DFSN and
DPSD show the effectiveness of our pruning methods. The small number of conﬁgurations
even without the pruning methods is an indication of the effectiveness of our dynamic pro-
gramming technique. We mentioned that the breadth-ﬁrst approach takes better advantage
of the dynamic programming technique than the depth-ﬁrst approach. This fact is illus-

trated by the fact that the number of conﬁgurations is much smaller for BFSN than DFSN.

2.8 Conclusions

We presented a dynamic prograrnrrring approach for the scheduling problem under resource
constraints. It can handle multi-cycle operations and pipelined functional units. Using a

dynamic programming technique combined with a set of pruning methods, our algorithm

30

searches the design space efﬁciently and always ﬁnds an optimal solution. We have done
experimentation with a wide range of resource constraints using a number of benchmark
circuits in the literature. The experimental results show that the proposed dynamic pro-
gramming technique and the other pruning techniques are very effective in reducing the
search space drastically.

Our algorithm can be used to ﬁnd multiple optimal schedules. Then, we can chose
one among them so as to minimize some other design parameter such as the number of
registers. Also, a limit on any such parameter can be imposed as an additional constraint.
The next chapter describes the how bus and register constraints can be incorporated into

our scheduling algorithm.

CHAPTER 3

Scheduling under Bus and Register

Constraints

3.1 Introduction

In obtaining efﬁcient designs, synthesis systems have to consider constraints on design
elements like busses and registers in addition to functional units. The number of busses and
registers may have considerable impact on the overall cost of the ﬁnal design. Busses are
needed for data transfers between functional units, registers and multiplexers etc.. Registers
are needed to store values of the operands until they are consumed by all their respective
operations. The operands include inputs to and outputs from the circuit as well as the
outputs produced by operations for other operations to use. The number of busses/registers
required in a control step depends on the actual operations scheduled in that control step,
not just the number of operations. Hence, scheduling under bus and register constraints is
more complex than scheduling under constraints on functional units alone.

Many heuristic algorithms for the scheduling problem consider one operation at a time
for scheduling. In the presence of additional constraints like a constraint on the number of
buses, scheduling an operation in a particular control step may damage the possibility of

scheduling more than one, equally critical operations later in the same step. As an example,

31

32

consider the following scenario for scheduling under bus constraint. The number of buses
required in a control step is calculated as the number of distinct inputs to all the Operations
scheduled in that control step. Suppose there are three operations A, B and C that can be
scheduled at control step k. The inputs of A are P and Q. B and C have common inputs
R and S. There is bus constraint of two. There are functional units available to schedule
any two operations. Let A, B and C have the same priority and let A appears before B
and G in the list. So, A is considered ﬁrst for scheduling and A is scheduled in control
step Is. Now, neither B nor G can be scheduled in the current control step because of the
bus constraint. If either B or C is considered ﬁrst for scheduling, both could have been
scheduled at control step k because they share inputs. ,

Scheduling under a register constraint is even more difﬁcult. First, scheduling an Oper-
ation in a control step‘can possibly reduce the register requirement in that control step. Sec-
ond, the register requirement in a control step depends on the scheduling of several previous
control steps. Hence, many synthesis algorithms allocate registers following scheduling
[33, 75]. Some others [6, 44, 70] try to minimize the number of registers While scheduling
as opposed to imposing a constraint. MSS [9] produces multiple schedules and selects the
one with the least number of registers. In our approach, since we schedule one control step
- not one operation — at a time, we can handle register constraint easily.

In this chapter, we describe how bus and register constraints can be handled in our DPS
algorithm. We present a comparison of our algorithm with a few other exact and heuristic

scheduling algorithms that address bus constraint and register constraint/minimization.

3.2 Scheduling under Bus Constraint

In our method, a constraint on the number of buses can be imposed just like the constraint
on the number of functional units. The number of buses in a control step is calculated as

the number Of distinct inputs to all the operations scheduled in that step. This is based on

33

the model described in [25] for a bused architecture with broadcasting. The assumption is
that the inputs and outputs are interleaved and the number of inputs is always more than
the number of outputs. In our algorithm, when a conﬁguration is generated we compute
the number of buses by ﬁnding the number of distinct inputs to all the operations scheduled
in the control step corresponding to that conﬁguration. If it exceeds the bus constraint, we
consider that conﬁguration infeasible. In a bused architecture with no broadcasting capa-
bility, the number of buses in a control step is calculated as twice the number of operations
scheduled in that step [24].

The number of control steps without any bus constraints is never more than the number
of control steps with a bus constraint. Hence, the lower bounds used in our algorithm are

valid even with bus constraint.

3.3 Scheduling under Register Constraint

At each conﬁguration we save the set of values active at that conﬁguration (register con-
tents). For each value v, we save the number of unscheduled operations that use 22. When
a new conﬁguration S is generated, the register usage details are copied from its parent
cOnﬁguration and updated according to the newly scheduled operations. Then, if there are
not enough free registers to accommodate the values produced at S, S is pruned. Regis-
ters are ﬁlled using the left-edge algorithm as described in [33]. The left-edge algorithm
guarantees optimal register usage if no conditional branches exist. The number of control
steps without any register constraint is never more than the number of control steps with
a register constraint. Hence, the lower bounds used in our algorithm are valid even with

register constraint.

34

Table 3.1. Results for the ﬁfth-order elliptic wave ﬁlter

 

 

 

 

non-pipelined multiplier pipelined multiplier
Resources Control steps CPU (sec.) Conu'ol steps CjPU (see)
+ '- Bus ops List ALPS [26] DPS DPS List ALPS [26] DPS
1 1 2 30 31 35 3.89 29 30 7.52
2 2 4 18 19 20W] 0.12 18 19 0.16
2 1 4 21 21 21W 0.09 19 19 20!”! 0.15
3 3 6 17 17 17W 0.08 17 17 0.08
3 2 6 18 18 18 0.08 17 17 17l251 0.07
2 2 6 18 19 18“” 0.12 18 19 0.12
3 3 8 17 17 17 0.08 17 17 0.08
3 2 8 18 18 18 0.1 17 17 0.07
2 3 6 18 19 0.15 18 19 0.19
2 3 8 18 19 0.15 18 19 0.19

 

 

 

 

 

 

 

 

 

 

 

 

 

3.4 Experimental Results

3.4.1 Results for scheduling under bus constraints

Previously, scheduling under bus constraint was given major consideration in the ALPS
system [25, 26, 24]. To our knowledge, the results in [26] are the best results for scheduling
under bus constraint. We incorporated scheduling under bus constraint into our dynamic
programming scheduling (DPS) algorithm and compared our results with those by ALPS.
The results indicate that our methods are much fasterland better.

For the ﬁfth-order elliptic ﬁlter and once unrolled ﬁfth-order elliptic wave ﬁlter (Tables
3.1 and 3.2 respectively), we used two control steps for multiplication and one for ALU.
For the sixth-order elliptic bandpass ﬁlter (Table 3.3) and Fast Discrete Cosine Transform
(FDCI‘) [37] (Table 3.4), we used all uni-cycle operations as in [26]. In the ﬁrst part
of Table 3.3, additions and subtractions are executed on ALU’s; in the second part they

are executed on adders and subtracters, respectively. In Table 3.4 also, they are executed

 

The CPU times of our results are on a SUN-Spare Station 2. The CPU times of ALPS are on VAX
“/8550. The dhrystone MIPS ratings indicate that our system is nearly 5 times as fast as their system. The
speed up we obtained is up to a few hundred times.

35

Table 3.2. Results for the once unrolled ﬁfth-order elliptic wave ﬁlter

 

 

 

 

non-pipelined multiplier pipelined multiplier
Resources Control steps CPU(sec.) Control steps CPU(sec.)
+ * Bus DPS List ALPS [26] DPS DPS List DPS
l 1 2 63 63 70 0.26 61 62 12.79
2 2 4 36 37 39 0.49 35 37 0.48
2 l 4 40 40 40 0.29 36 37 0.31
3 3 6 33 33 33 0.23 33 33 0.24
3 2 6 34 34 34 0.27 33 33 0.23
2 2 6 35 36 35 0.52 35 36 0.54
3 3 8 33 33 33 0.21 33 33 0.23
3 2 8 34 34 34 0.27 33 33 0.24
2 2 5 35 36 0.49 35 36 0.50
2 3 5 35 36 0.43 35 36 0.51
2 3 4 36 37 0.48 36 37 0.48

 

 

 

 

 

 

 

 

 

 

 

 

on adders and subtracters. For FDCT, the results of FDLS as reported in [26] are also
presented.

Our solutions are better than the solutions from List scheduling, ALPS and FDLS.
When there are bus constraints, the list scheduling algorithm fails to give an optimum
solution in many cases and our method always gives an optimum solution. For ﬁfth-order
elliptic wave ﬁlter (Table 3.1), the number of conﬁgurations explored is less than a hundred
except in one case. The example of the once unrolled wave ﬁlter contains 52 additions and
16 multiplications. For this example (Table 3.2), the number of conﬁgurations explored is
always under 400 except for one case. For both benchmarks, the number of conﬁgurations
never exceeded a few thousands. Thus, the number of conﬁgurations and the CPU times of
our method are very small and practical even for considerably large examples. We reported
our CPU times and the CPU times of ALPS in Tables 3.3 and 3.4. For the example of
bandpass ﬁlter, our method is faster than ALPS by up to a few hundred times. For FDCI‘,

we obtained a speedup of up to more than a thousand times.

36

Table 3.3. Results for the sixth-order elliptic Bandpass ﬁlter

 

 

 

 

 

 

 

 

 

 

Resources Control steps CPU time (seconds)
+,- * Bus DPS List ALPS [26] DPS ALPS [26]

3 2 8 8 8 8 0.06 1.0

2 2 8 9 9 9 0.07 4.2

2 1 6 12 13 13 0.53 40.0

1 1 4 17 17 17 0.06 8.8

2 2 4 9 9 0.06

2 1 4 12 13 0.51
2,2 1 2 8 8 8 8 0.06 1.3
2,1 1 2 8 9 0.07 4.1
1,11 2 8 ll 11 11 0.07 18.2
1,1 I 1 6 13 0.47 19.3

 

 

 

1 Additions and subtractions are performed on adders and subtracters

Table 3.4. Results for the Fast Discrete Cosine Transform

 

 

 

 

 

 

 

 

 

 

 

 

 

Resources Control steps ' CPU time (seconds)

+ - * Bus DPS List ALPS [26] FDLS DPS ALPS [26]
3 3 8 16 6 6 6 7 0.08 3.7

2 2 4 12 7 8 7 8 0. 14 5.8

2 2 3 12 8 9 8 9 0. 12 142.2

2 2 2 10 10 11 10 11 0.14 63.1

1 1 2 8 13 14 14 14 0.12 85.1

1 1 1 6 18 21 20 20 0.23 132.3

 

 

In all our results, we considered bus with broadcasting as explained in [25]. It was not
clearly explained in [26] as to how the number of buses in a control step are computed. For
the cases where there is a result from [25], we cite the reference next to the ALPS result
in our tables. For those cases, the number of control steps is the same in both [25] and
[26]. We also experimented assuming that there is no broadcasting. For the examples of

Table 3.1 and Table 3.2, our results in terms of the number of control steps are identical

37

with those of [26]. For the examples of Table 3.3 and Table 3.4, however, we obtained
better results even without broadcasting. For these benchmarks, our results with or without
broadcasting are identical. The CPU times are also nearly the same. Thus, our results are

better and much faster than those of [26] with or Without broadcasting of buses.

3.4.2 Results for scheduling under register constraint

We then experimented including register constraint and compared our results with
InSyn[70] in Table 3.5. InSyn allows delayed transfer of output data to minimize regis—
ter usage and uses a concept of reusable data values to minimize bus activity. We obtained
better results in some cases even Without using those concepts. InSyn reported an average
CPU time of ~15 sec. to produce each design point on a Sun SPARC-Station SLC. Our
method took less than 1 sec. on the average on a SUN Spare-2. Thus, our method produced

better results and is much faster than InSyn.

Table 3.5. Comparison with the InSyn heuristic method [70]

 

 

 

 

 

 

 

 

 

 

. non-pipelined multiplier pipelined multiplier
DFG Resources Control steps Control steps
+ * Bus Reg. DPS InSyn [70] DPS InSyn [70]
1 1 4 8 28 28 28 28
Fifth-order 2 l 4 8 21 22 l9 19
EW Filter 2 2 4 8 19 19 18 -

_ 3 l 6 8 l9 - 18 19
Twice unrolled 1 1 6 10 84 84 84 84
ﬁfth-order 2 l 4 10 60 60 57 57
EW Filter 3 2 6 10 50 51 51 -

2 2 6 10 52 52 52 52
Bandpass 1 l 4 8 25 25 17 17
Filter 2 l 2 7 25 28 24 24
2 2 4 8 13 14 13 13
2 3 6 7 11 11 10 10
Biquad t 1 1 4 6 28 27 19 19
Filter 1 2 4 5 20 19 18 18
2 3 4 5 15 15 13 13
3 4 2 3 q 19 19 19 -

 

 

 

 

 

 

 

 

 

 

 

 

1 Complex multiplier uses three control steps

38

Table 3.6. Comparison with the multi-schedule approach [9]

 

 

 

 

Resources Registers CPU time (seconds)
+ * Control Steps DPS MSS [9] DPS MSS [9]

3 3 17 8 10 0.09 3.8

2 2 19 8 10 0.79 39.1

2 1 21 8 10 0.03 25.3

3 21 17 8 10 0.2 3.9

2 l] 19 8 10 0.18 43.9

 

 

 

 

 

 

 

 

 

 

 

 

1 Pipelined multiplier is used for this case

We also compared our register usage results and CPU times with MSS [9] for the ﬁfth-
order elliptic wave ﬁlter in Table 3.6. The results show that our method not only produced
designs that use lesser number of registers (two less in all cases) but also took much less
CPU time to produce the designs(up to a few hundred times faster). Their method was
implemented on a SparcCenter 2000.

Table 3.7 shows a comparison with the symbolic technique [60] for two large examples,
EWF2 and EWF3 (ﬁfth-order elliptic ﬁlter unfolded two times and three times respec-
tively). For many cases shown in table 3.7, their exact method could not ﬁnish in one CPU
hour or conﬁrm the optimality of the schedule obtained by their heuristic method. In table
3.7, we reported the results of their exact method for the cases it could solve. Otherwise,
we reported the results of their heuristic method. As can be seen from the results in table
3.7, our method is much faster than even their heuristic method. In some case, we obtained
a speedup of upto a few thousand times. Our results are superior in terms of register us-
age. We used busses with broadcasting and obtained schedules with less number of control

steps.

39

Table 3.7. Comparison with the symbolic technique [60]

 

 

 

 

 

 

Resources Control Steps Registers CPU time (seconds)
Benchmark + * Bus Ours Symb. Ours Symb. Ours 1 Symb.

1 l 2 64 70 12 14 0.12 94.4

2 2 4 36 39 10 11 0.5 2700
EWF2 3 2 6 35 35 10 11 0.13 21.52

1 1 1 2 63 68 12 14 0.15 88.3

2 1 1 4 36 39 10 11 0.44 1850

3 2 1 6 33 33 10 11 0.1 4.49

1 1 2 96 105 12 15 0.33 372.9

2 1 4 59 59 12 12 0.32 31.5

2 2 4 55 58 12 13 1.23 39.8
EWF3 3 2 6 50 50 12 12 0.22 62.66

1 1 1 2 94 102 12 15 0.27 356.7

2 1 1 4 54 58 12 12 1.03 42.3

2 2 1 6 52 52 12 12 1.37 550

3 2 1 6 49 49 12 12 0.2 10.52

 

 

 

 

 

 

 

 

 

 

 

1 Pipelined multiplier is used for this case
1 Their experiments are run on a Sun SPARCStation 10

3.5 Conclusions

We have extended our DPS algorithm to handle scheduling under bus and register con-
straints in addition to the constraints on the number of functional units. The extensions
preserve the simple structure and efﬁciency of the original algorithm. Extensive exper-
iment on large problems show that our method is an order of magnitude faster than the

existing exact scheduling methods.

CHAPTER 4

Lower-bound Estimation

4.1 Introduction

For a synthesis system to produce efﬁcient designs, it should have the capability to con-
sider different cost-performance trade-offs. So, a scheduler has to explore the design space
with a variety of resource constraints. Instead of producing schedules with each and every
resource constraint, a scheduler can use estimation to identify and prune inferior designs.
For an estimation tool to be useful, it has to be much faster than the actual scheduler and
the lower bounds it produces should be as tight as possible. In this chapter, we describe
an efﬁcient estimation technique for the lower-bound performance and compare the results
with some of the best in the literature.

As explained in Chapter 2, the DPS algorithm estimates lower-bound completion from
each conﬁguration to check if it is an infeasible conﬁguration. Any such scheduling al-
gorithm as well as the branch and bound methods need a dynamic estimation method for
partial schedules to detect partial schedules that are not going to lead to an optimal sched—
ule. Since the number of partial schedules is generally very high in a design space search
process, the dynamic estimation should be much faster than the lower-bound estimation
for the entire DFG. At the same time, it should be effective and be able to detect inferior

partial schedules as early as possible in the exploration. In this chapter, we also describe an

40

41

extension to our lower-bound estimation method to ﬁnd such dynamic lower bounds. None
of the existing methods for lower-bound estimation addressed the problem of dynamic es-

timation.

4.2 Previous Work

There are several methods proposed in the literature for the lower-bound estimation of cost
as well as performance. Jain et al [29] proposed a mathematical model for predicting the
area-delay curve. Their lower-bound method is very fast, but is too trivial and does not con-
sider precedence constraints at all. The technique proposed by Fernandez and Bussell [13]
computes the minimum number of operations which must be scheduled in each sub-interval
of time steps. It then derives maximum increase in total execution time over all intervals
for not having enough processors to accommodate all the operations in that interval. Their
method considers only homogeneous resources and can be applied only to multiprocessor
schedules. Their method has been extended to high-level synthesis by Sharma et al [69].
They compute the increase in the length of each interval due to concentration of each type
of operations in that interval. They also address lower bounds on area cost including in-
terconnect cost. Their method has a computational complexity of 0(nc2) where n is the
number of nodes in the DFG and c is the critical path length. The method proposed by
Ohm et a1 [46] estimates lower bounds on functional units as well as registers. Their tech-
nique for functional unit estimation is a reﬁnement of the basic technique of [69]. It is
not applicable to lower-bound performance estimation. The complexity of their method
is O( n(c2 + n + e) ) where e is the number of edges in the DFG. The method proposed
in [65] uses a relaxation technique of the ILP formulation of the scheduling problem for
the lower-bound estimation of performance under resource constraints. It has a computa-
tional complexity of O(n + cC') where C is the number of time steps and produced lower
bounds as good as [69] on many benchmarks. The method in [58] is similar to [65] in that

42

it relaxes the precedence constraints and solves the relaxed problem using a slack driven
list scheduling algorithm. Hu et al [23] proposed a method to estimate lower bounds on
iteration time and functional unit cost for functional pipelined DFGs. The complexity of
their method is 0(nck2) where k is the initiation latency. A recursive technique is proposed
in [34] for lower-bound performance estimation and it has a complexity of 0(n2 + 02). The
complexity of our method for the lower-bound performance estimation of the entire DFGs

has a complexity of 0(n + c’).

4.3 Model and Deﬁnitions

For all 0 E V, ASAP(0) (as soon as possible) is the earliest time-step that 0 can be
scheduled to start execution, assuming unlimited resources. For all 11 E V, we deﬁne
M SAT (1)) (Minimum Steps After This operation) as the minimum number of time steps
that any schedule of G is going to take after the completion of operation 0, assuming
unlimited resources. The ASAP values can be computed in a top-down fashion starting
from the input operations as follows. If v is an input Operation, ASAP(v) = 1. Otherwise,
ASAP(v) = mar(,,,,,)eg {ASAP(u) + du}. The MSAT values are similarly computed
in a bottom-up fashion starting from the output operations as follows. If v is an output
operation, M SAT(v) = 0. Otherwise, M SAT(v) = mar(,,,w)eg {M SAT(w) + dw}. The
critical path length, c is the minimum number of time-steps that any schedule of G is going
to take, assuming unlimited resources. It can be computed as max” {ASAP(v) + d” - 1 }
with the maximum taken over all output operations 1).

If a resource type is pipelined, one instance of that resource type can be assigned to
more than one operation in an overlapping fashion. The latency with which they can overlap
is denoted by 6,. Clearly, (it < D(t). 'Chaining of operations is handled by dividing time
steps into time-units and extending the deﬁnitions of MSAT and ASAP values in terms of

time-units. Let the time period of a time-step be T time-units. Let p, denote the execution

43

time of an operation 0 in terms of time-units. Let A(0) and M (v) be the ASAP and MSAT
values of an operation in terms of time-units. The A and M values can be computed
similar to the computation of ASAP and M SAT values using the mu values and under
the following requirement. For any (11,0) 6 E, if 0 cannot be chained to u and if it
ﬁnishes execution during time step d, then u can start execution only at the beginning of
time step d + 1. The ASAP and MSAT values can be easily computed from the respective
A and M values. If two operations it and v are chained, the functional unit executing it
cannot be freed until 0 is ﬁnished [68]. If v spans across time-steps, this may result in
under-utilization of resources. 80, we assume that 0 can be chained to the end of another
operation only if 11 starts and ﬁnishes execution in the same control step. We impose the
following necessary condition like in [68]. An operation 0 can be chained at the end Of
operation 11 only if (u, v) e E, 11,, modT aé 0 andT — (11u mod T) _>_ 11,. Note that this
is not a sufﬁcient condition and the scheduling algorithm should‘make sure that 11 starts and
ﬁnishes execution in the same step. Clearly, it has an impact only on the sharpness but not

the validity of the lower bound.

4.4 Lower-bound Estimation for Performance

The intuitive idea behind our lower-bound estimation is as follOws. For each resource type
t, we group the Operations of that type into three non-overlapping intervals and compute
a lower-bound as the sum of the lengths of those intervals. The ﬁnal lower bound is the
maximum among all resource types and all possible groupings of operations of each type.
Let Pt be the number of operations of type t in the DFG. If there are L(i, t) operations
of type t with an ASAP value less than or equal to i, then there are at least Pt — L(i, t) type
t operations that cannot be scheduled in the ﬁrst 2' time steps. Similarly, if there are M (j, t)
type t operations with a MSAT value less than j, then there are at least R — M (j, t) type

t operations that cannot be scheduled to execute in the last j time steps. Thus, there are at

44

least Pt — L(i, t) — M (j, t) type t operations that cannot either be scheduled in the ﬁrst 2'
time steps or be scheduled to execute in the last j steps of any schedule. The three intervals
considered are : ﬁrst 2' steps (interval..1), last j steps (interval.3) and an interval between
the two (interval.2) that does not overlap with the other two. The lengths of interval-1
and interval_3 are i and j respectively. The length of interval.2 depends on the minimum
number of type t operations in that interval as well as the number of available resources of
type t. The number of Operations depends on the ASAP and MSAT values of operations
which are determined by data dependencies. Thus, the lower-bound estimation takes into
account both precedence constraints and resource constraints.

Let n be the number of type t operations in interval.2 and let 1' = N (t). These 12
Operations can be scheduled into [rt/r] stages. If type t resources are not pipelined, each
stage takes D(t) time steps where D(t) is the delay of a type t resource. If they can be
pipelined with a latency 6,, each stage except the last takes (it steps. The last stage in either
case takes D(t) steps. Hence, the minimum length of interval.2, a can be computed as
follows.

_ [n/r] * 6, + D(t) — 6, if resource type t is pipelined

0 - [n/r] at: D(t) otherwise.

Ifn S 0,thena=0.

Note that it takes at least 2' time steps before any of the 72 operations in interval.2 can
start execution and at least j steps after the last operation has ﬁnished execution. The
lengths of interval-1 and intervaL3 are independent of the set of operations scheduled in
those intervals. Thus, for the purposes of lower-bound estimation, the three intervals are
non-overlapping. We denote the minimum number of type t operations in interval.2 by
q(i, j, t) and the minimum length of interval.2 due to type t operations by h(i, j, t). As ex-
plained above, q(i, j, t) = P; — L(i, t) — M(j, t) and the value of h(i, j, t) can be computed
from q(i,j, t). Let /\ be equal to mar(¢,,-+j56){h(i,j, t) + i + j}.

45

 

estimate_lower_bound()

{

Read the DFG, resource constraints and resource delays (1)

Find the ASAP and MSAT values for all operations (2)
Compute L(i,t) and M(i,t) for all i and t (3)
Compute q(i,j,t) and h(i,j,t) for all i,j and t

and compute lower bound (4)

Figure 4.1. Outline of the procedure for lower-bound estimation of entire DFGs

 

 

 

Lemma 4.1 : The expression A is a lower bound on the the completion time of any
schedule of the given DFG.
Proof : The above discussion implies that i + h(i, j, t) + j is a lower bound for a given
2', j and type t. The expression for A is the best lower bound among all i, j and t. The
condition 3' + j S c makes sure that the intervals are non-overlapping. D
An outline of the procedure for the lower-bound estimation is shown in Figure 4.4.
Figure 4.2 shows an example for lower-bound computation. The ordered pair next to each
node indicates its ASAP and MSAT values respectively. It is assumed that addition takes
one time step and multiplication takes two. There are one adder and one non-pipelined
multiplier. All values of L and M for multiplication and addition are shown in the ﬁgure.
For example, the multiplication operations 1 and 2 each has ASAP value less than or equal
to 2. Hence, L(2, at) is 2. Similarly, the addition Operations 6, 7, 8 and 9 each has MSAT
value less than 2. Hence, M (2, +) is 4. For the example DFG, the maximum value for A is
obtained when i=0 and j =2.

Complexity analysis :
Let n be the number of operations in the DFG. The number of edges in a DFG will grow

46

 

 

(5.0)

L(1,+)=1 M(1,+)=2 L(1,*)=2 M(1,*)=0
L(2,+)=1 M(2,+)=4 L(2,*)=2 M(2,v)=0
L(3,+)=2 M(3,+)=5 L(3,4)=3 M(3,*)=1
L(4,+)=3 M(4,+)=6 L(4,*)=3 M(4,*)=2
L(5,+)=5 M(5,+)=6 L(5,*)=3 M(5,*)=3
L(6,+)=6 M(6,+)=6 L(6,*)=3 M(6,4)=3

q(0, 2, 4:) = P. - L(0, a) — M(2, at) = 3
A = h(0,2,*)+0+2 = 8

Figure 4.2. An example for the lower-bound estimation for the entire DFG

47

linearly in 11 since the number of inputs of each Operation is generally bounded by small
number such as two. Hence, the ﬁrst two steps require O(n) time because the ASAP
and MSAT values can be found in 0(n) by traversing the DFG in a top-down and in a
bottom-up fashion respectively. The number of resource types is generally bounded by a
small number. For any i and t, L(i, t) and M (i, t) can be found recursively as follows.
L(i, t) = L(i — 1, t) + A(z', t) where A(i, t) is the number of type t operations with ASAP
value 2'. Similarly, M(i, t) = M(i — 1, t) + B(i — 1,t) where B(i — 1,t) is the number of
type t operations with MSAT value 2' — 1. The values for A and B can be found during step
2 without affecting its complexity. Hence, step 3 takes 0(c) time. Finally, step 4 requires
0(c2) time because there are 0(c2) intervals. Thus, the complexity of our algorithm to
estimate lower bound of the entire DFG is 0(n + c2).

The method in [69] is similar to our method in that it estimates the length of each in-
terval of time steps. It computes the required computation cycles of each type as the sum
of minimum overlaps of all operations of that type in each interval. Then the difference
in the required and available computation cycles of each type is divided by the number
of available functional units of that type to get any increase in the length of that interval.
In their method, for each interval the minimum overlap of each operation has to be deter-
mined. Hence, it has a complexity of O(nc2). Our method computes only the number of
operations in each interval in constant time using some precomputed data structures, thus
having a complexity of only O(n+cz). This is made possible by slightly modifying the way
multi-cycle operations are considered and using MSAT values of operations. Their method
allows a portion of a multi-cycle operation to overlap with an interval. In our method, a
multi—cycle operation has to be completely inside an interval to be counted in that interval.
This restriction allows a more effective computation of the length of each interval for non-
pipelined operation types. We divide the number of operations of each type (as opposed to
the difference in required computation cycles) by the available number of functional units

of that type and then multiply it with the delay of that type.

 

48
4.5 Estimating a Lower Bound for a Partially Scheduled

DFG

As in Chapter 2, we refer to a partial schedule as a conﬁguration. The depth, ready set,
unﬁnished set and unsch are similarly deﬁned. The method explained in this section is
applicable to any forward scheduling algorithm where operations are scheduled into non-
decreasing control steps. Scheduling algorithms such as branch and bound methods need
to compute lower bounds on the completion times from partially scheduled DFGs. As can
be seen from the experimental results in Chapter 2, the number of conﬁgurations can often
be very large. If the methods in [69] or [65] or our method in the previous section are used
for estimation of lower-bound completion time from each conﬁguration, the time spent in
estimation itself may nullify the advantage of estimation. In this section, we present a faster
extension for lower-bound estimation from conﬁgurations. Our method takes O(k) time to
compute a lower-bound for a conﬁguration where k is the number of operations in the ready
and unﬁnished sets at the conﬁguration. This is much faster than the methods in [69], [65].
They take O(nc2), 0(c2) time respectively to compute lower bound for a partial schedule.
Our method in the previous section also takes 0(c2) time.

Our method for partial schedules is a special case of the method for the entire DFGs.
The basic idea is as follows. At a partial schedule, a subset of operations is already sched-
uled so as to satisfy precedence constraints as well as resource constraints. So, instead of
considering all possible values for i and j, we can consider the following special case for
each resource type t. We ﬁnd I = ma2:{i|L(i, t) = 0} and J = max{j|M(j, t) = 0},
thus maximizing h(i, j, t). Note that ASAP, L and M are deﬁned with respect to the un-
scheduled portion of the DFG. Intuitively, I is the minimum number of time steps after
the current step before any type t unscheduled operation can start execution. And, J is the
minimum number of time steps that any complete schedule from the current conﬁguration

takes after the last type t operation has ﬁnished executing. The most important merit of our

49

 

lower_bound_partial(R,t)
/* This procedure returns a lower bound on the number
of control steps to schedule the remaining Operations
of type t from configuration R */

Compute I <-- max{i} such that L(i,t) 0 (Step 1)
Compute J <-- max{j} such that M(j,t) 0 (Step 2)
/* ASAP, L and M are defined with respect

to the unscheduled portion of the DFG */

q(I,J,t) <-- unsch(R,t) (Step 3)
compute h(I,J,t) from q(I,J,t) (Step 4)
return h(I,J,t) + I + J

Figure 4.3. Outline of the procedure for lower-bound estimation from partial schedules

 

 

 

algorithm is that it computes I and J in a very efﬁcient way. After computing I and J,
we compute h(I, J, t) + I + J as a lower bound on the number of time steps to schedule
the remaining operations of type t. The maximum of these lower bounds over all resource
types t added to depth(R) gives a lower bound on the completion time of any schedule that
conﬁguration R can lead to.

An outline of the procedure to compute the lower-bound from a conﬁguration R for type
t is shown in Figure 4.3. The steps 3 and 4 are straightforward. Since L(I, t) = M (J, t) =
0, q(I, J, t) = unsch(R, t). The value for h(I, J, t) can be computed from q(I, J, t) using
the formula in the previous section. It can be noted from the proof of Lemma 4.1 that
h(i, j, t) + i + j for any i, j and t gives a lower bound on the number of time steps to
schedule the remaining operations. 80, depth(R) + h(I, J, t) + I + J is a lower bound
on the completion time of any schedule from R. Figure 4.4 describes an efﬁcient way to

compute values for I and J (steps 1 and 2 of Figure 4.3).

50

 

The value for I is given by min(a, b) where

a = minu {q(u, t)} with the minimum taken over all u 6 ready(R). We ignore it if it
has no successors of type t. (a = 0 if there is a type t operation in ready(R)).

b = minu {q(u, t) - Fu} where Fu is the number of time steps it ﬁnished at R and the
minimum is taken over all u E unﬁnished(R). We ignore it if it has no successors of

type t-
The value for J is given by :

min“ {ﬂ(u, t)} with the minimum taken over all u E {ready(R) U unﬁnished(R)}. If 21
has no successors of type t,then : we use M SAT(u) for [3(11, t) if u is a type t operation
and u E ready(R); we ignore it otherwise.

Figure 4.4. Computing the values for I and J

 

 

 

For each node 21 and each operation type t, q(u, t) is deﬁned as the minimum number
of time steps that any type t successor of u can start after the starting time of it.

For each node u and each Operation type t, 6(11, t) is deﬁned as the minimum number
of time steps that any schedule of the given DFG is going to take after the completion of all
type t successors of u. The value for 6(11, t) can be computed as min,,{MSAT(v)} such
that v is a type t successor of u.

The formulas for the computation of I and J are based on the following lemma which
is an extension of Lemma 2.1 to take multi-cycle operations into account.

Lemma 4.2 : For any unscheduled operation 2: at conﬁguration R, 2: is either a member of
ready(R) or there exists a y in (ready(R) U unﬁnished(R)) such that y is a predecessor of
2:.

Proof : Let :c be an unscheduled operation. If r is not in ready(R), then there is a prede-
cessor p of a: that is not scheduled or scheduled but not ﬁnished executing. Among all such
p, let q be the farthest from 2: i.e. the length of the longest path from q to a: is maximum
among all p.

(i) If q is scheduled but not ﬁnished executing, q is in on f inished(R).

 

"-

51

readyal) = {1.5}

 

unsch(R,+) = 5

unsch(R,*) = 2

depth(R) = 1
0(11 '1') = 1 B(11+) = 0 0101*) = 0 3(11") = 3
0(5,+) =1 ﬂ(5,+) = 0 a(5W) = 0 [3(5.*) = 2
a(2,+) = 2 ﬁ(2,+) = 0 oz(2,*) = 1 ,6(2,*) = 2
043,-?) = 2 ﬂ(‘3.+) = 0 0(3.*) =1 3(3, *) = 2

I = min{a(1. +). 9(5, +), a(2, +). a(3. +)} = 1
J = min{ﬁ(1,+),[3(5t+),B(2,+),ﬂ(3i+)} = 0
For type +, q(I, J, +) = unsch(R, +) = 5
MI. J. +) = l(q(I, J. +)/1)1 *1 = 5
completionJime = depth(R) + h(I, J, +) + I + J = 7
I = 0 / at type at operation in ready(R) It: /
J = min{ﬂ(1t*),5(5t *),ﬁ(2. *),ﬂ(3t '0} = 2
For type 2:, q(I, J, at) = unsch(R, 4:) = 2
h(I,-7H“) = l(Q(I,Ji*)/1)l * 1 = 2
completioniime = depth(R) + h(I, J, at) + I + J = 5

Figure 4.5. An example for the estimation of lower-bound completion time of partial sched-
ules

52

(ii) Otherwise : if q is not in ready(R), then there is a predecessor ql of q that is not
scheduled or is in unfinished(R). Note that ql is a predecessor of 2: also. And, ql is
farther than q from 2:, which is a contradiction. Hence, q is in ready(R). C]

Figure 4.5 shows an example for the estimation of lower-bound completion time of
a partially scheduled DFG. It is assumed that the scheduling of the ﬁrst step is ﬁnished.
There is one adder and one multiplier both with a delay of one time-step. The lower-bound
completion time is 7 time-steps. If the target performance is 6 time-steps, the lower-bound
estimation suggests that the selection of operations 2 and 3 in the ﬁrst time-step is wrong.

The method in this section is especially useful for a class of scheduling algorithms that
compute the lower bound for a large number of conﬁgurations during the design space ex-
ploration. The matrices a and [3 for all the Operations in the DFG and all resource types are
computed only once before the design space exploration. This can be done by computing
the transitive closure. For a directed graph, the transitive closure can be computed using
depth-ﬁrst search in O(n(e + n)) where e is the number of edges in the graph [67]. As
already explained, e grows linearly in n in a DFG since the number of inputs of each open
ation is generally bounded by a small number such as two. Thus, a and 6 can be computed
in 0(n2) time. Note that only these matrices are used in computing the values of I and J
(steps 1 and 2 of ﬁgure 4.3). Steps 3 and 4 take constant time. Hence, a lower-bound for
any conﬁguration R can be computed in O(lc) time where k is the number of operations in
(ready(R) U unﬁnished(R)).

Another major advantage of our method is that it introduces very little memory over-
head. The only overhead is to store the matrices a and [3. When there are a large number
Of partial schedules the memory requirement is dominated by the amount of information
stored at each conﬁguration. Any scheduling algorithm taking full advantage of our lower-

bound estimation needs to store very little information at each conﬁguration.

 

53

4.6 Experimental Results

We implemented our methods in C language on a SUN Spare-2 workstation. We tested
them using a number of benchmarks in the literature. The benchmarks we used are the
AR Filter [27], the Differential Equation [55], the ﬁfth-order elliptic Wave Filter [49],
twice unfolded Wave Filter, the FIR Filter [53], the complex Biquad recursive digital Filter
[49], the sixth-order elliptic Bandpass Filter [49], Discrete Cosine Transform [47], and
Fast Discrete Cosine Transform [37]. For the Biquad Filter example, we used three time
steps for the multiplier and one for adder (we used the same resource type adder to do
addition, subtraction and comparison). For all the other examples, we used two time steps

for multiplication and one for adder.

4.6.1 Lower hound estimation of partial schedules in branch and

bound methods

Scheduling algorithms such as branch and bound methods rely on estimating lower bounds
for partial schedules to keep the design space from exploding. Generally, lower bounds
need to be estimated for a large number of partial schedules (conﬁgurations). If the time
spent in lower-bound estimation itself is too high, it will have a big negative impact on
the over-all time taken by the scheduling algorithm. We implemented a branch and bound
scheduling algorithm [10] and tested on the benchmarks. We ﬁrst ﬁnd a schedule using a
list scheduling algorithm. We use that performance as an upper bound in the branch and
bound algorithm and search the design space exhaustively for an optimum schedule. From
each partial schedule, we estimate a lower bound for the schedule completion time. If it
exceeds the upper-bound, the partial schedule cannot lead to a complete schedule with the
target performance and it is not explored further.

We separately measured the time spent in lower-bound estimation using our method

Of section 4.5 and Rim’s method [65]. The results are reported in Table 4.1. Our method

 

54

Table 4.1. CPU time and number of conﬁgurations in branch and bound algorithm

 

 

 

 

 

 

 

 

Resources CPU time (see) Conﬁgurations
DFG + * Our method Rim [65] Our method Rim
3 3 0.3 7.1 1609 2953
AR Filter 2 3 0.35 8.2 1769 3497
1 3 1.1 22.6 6161 8001
Once unrolled 2 3 0.13 2.9 770 723
EW Filter 2 2 0.1 2.2 539 508
2 4 0.13 2.8 766 723
2 2 2.1 42.4 18973 16366
TWice unrolled 2 1 0.1 2.2 413 503
EW Filter 2 3 4.0 81.2 33182 27767
Biquad 2 2 0.3 6.1 2355 2355
Filter 2 1 0.8 17.6 6216 6384
1 2 1.1 24.0 7042 8173
Fast Discrete 1 1 O. 14 3.1 688 892
Cosine Transform 2 2 0.1 1 2.6 474 774

 

 

 

 

 

 

 

 

 

 

is at least 20 times faster in all cases. As a measure of the effectiveness of the lower-
bound estimation in reducing the size of the search space, we also measured the number
of conﬁgurations visited using each method separately. These results are also reported in
Table 4.1. Both the methods are equally effective. Since their estimation is very slow,
the CPU time taken using their method is more than the time taken. without using any
lower-bound estimation in a few cases. However, in a majority of the cases, the search
space exploded without lower-bound estimation, thus showing the necessity of estimation
in branch and bound scheduling algorithms. Our method is more suitable than the existing
methods to be used in such scheduling algorithms. The results in Chapter 2 show the
efﬁciency as well as the effectiveness of our estimation method in pruning the search space

in our DPS algorithm.

4.6.2 Lower-bound estimation for the entire DFGs

We tested all the benchmarks with eleven different resource constraints for pipelined and

non-pipelined multiplier. For all the cases, our lower bound is compared with an optimum

55

Table 4.2. Lower bounds with non-pipelined multiplier

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

our lower bound and an
Resources optimum solution other lower bounds
DFG + * DPS: lower bound difference Rim [65] Jain [29] Shanna [69]
1 1 34 34 0 34 32 34
1 2 18 18 0 18 16 18
AR Filter 1 3 16 14 2 14 12 14
2 3 15 14 11 1 13 11 13
3 3 15 14 11 1 13 - 13
2 4 11 l 1 0 1 1 1 1 11
1 1 13 13 O 13 12 13
1 2 7 7 O 7 6 7
Diff. Eqn. 1 3 7 6 1 6 6 6
2 2 7 7 0 7 6 7
1 4 6 6 0 6 6 6
1 1 28 27 1 27 26 27
EW Filter 2 1 21 21 0 21 17 21
2 2 18 18 0 18 17 18
3 3 17 17 0 17 17 17
1 l 18 18 0 18 16 18
FIR Filter 1 2 15 15 0 15 15 15
2 2 1 1 1 1 0 11 10 11
2 3 10 10 0 10 10 10
1 1 84 79 5 79 78 79
Twice unrolled 2 2 52 50 2 50 49 50
Wave Filter 3 2 50 50 0 50 49 50
3 3 49 49 0 49 49 49
1 1 25 25 0 25 24 25
Bandpass 1 2 17 17 0 17 17 17
Filter 2 2 13 13 0 13 12 13
2 3 1 1 1 l 0 1 1 10 11
2 4 10 10 0 10 10 10
1 1 27 24 3 24 21 24
Biquad 2 2 17 16 1 1 15 11 15
Filter("') 3 2 16 16 1 0 15 - 16
2 4 14 13 1 1 12 - 13
3 4 13 13 1 0 12 - 13
3 3 13 13 0 13 11 13
Discrete 1 1 34 34 0 34 - 34
Cosine 2 2 18 - 18 0 18 - 18
Transform 3 3 14 14 11 0 13 - 13
Fast 1 1 34 34 0 34 - 34
Discrete 2 2 18 18 0 18 - 18
Cosine 2 3 14 14 11 0 13 — 13
Transform 3 3 14 14 11 0 13 - 13

 

 

 

 

 

 

(*) Complex multiplication takes 3 time steps

(1) Our lower bound for this case is better than Rim’s

(1) Our lower bound for this case is better than Sharma’s

 

 

56

solution we obtained using our DPS algorithm [8]. In Tables 4.2 and 4.3, we present the
lower bounds for some of the cases as obtained by our methodof section 4.4 . The column
DPS in the tables shows the number of steps in an Optimum solution. The lower bound is
tight in 156 out of 198 cases. In 22 more cases, the difference is only one step. We also
implemented the algorithms by Rim [65] and Sharma [69], and compared the results with
ours. Our method gives better lower bound than [65] in nine cases. In ﬁve cases, our lower
bounds are better than [69]. In [65], the lower bounds by one more method, Jain [29] are
also reported. Those are copied into the second last column (Jain) of our tables. For the
cases not reported in our tables, our lower bounds are identical to Rims’ [65]. The average
CPU times are 21 ms, 25 ms and 270 ms for our method, Rim’s method and Sharrna’s
method respectively. Thus, our method is faster than the fastest non-trivial method in the
literature [65] and produces better lower bounds in more cases. Our method is one order
faster than the method in [69] and still produces better results in some cases. The lower

bounds of [29] are far inferior to ours.

4.7 Conclusions

We have presented simple, efﬁcient and powerful techniques for estimating lower-bound
completion time for the scheduling problem in high-level synthesis. The proposed tech-
niques can handle multi-cycle operations, pipelined functional units, and chaining of oper-
ations. Our method for the entire DFGs produces better lower bounds than [65] and [69]
in many cases. It is more than 10 times faster than [69] and 20% faster than [65]. We
presented an extension of our method to estimate the lower-bound completion times of par-
tially scheduled DFGs, even more efﬁciently. Upon extensive experiment, it is found to
be at least 20 times faster than the fastest non-trivial method known in the literature [65]
while being equally effective in reducing the size of the search space. It is very useful for
scheduling algorithms such as branch and bound methods in keeping the search space from

 

57

exploding. We used this method in our DPS algorithm and obtained excellent results.

58

Table 4.3. Lower bounds with pipelined multiplier

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

our lower bound and an
Resources optimum solution other lower bounds
DFG + "‘ DPS lower bound difference Rim [65] Jain [29] Sharma [69]
l l 19 19 0 l9 l6 19
AR Filter 1 2 16 14 2 14 12 14
2 2 13 12 1 12 l l 12
2 4 1 1 1 1 0 1 l 1 l 1 1
Diff. Eqn. 1 l 8 8 0 8 6 8
1 2 6 6 0 6 6 6
1 l 28 27 1 27 26 27
EW Filter 2 1 19 18 1 18 17 18
3 2 17 17 0 17 17 17
1 l 15 15 0 15 15 15
FIR Filter 2 l l 1 l 1 0 1 1 10 l 1
2 2 10 10 0 10 10 10
1 1 84 79 5 79 78 79
Twice unrolled 2 1 52 50 2 50 49 50
EW Filter 3 l 50 50 0 50 49 50
3 2 49 49 0 49 49 49
Bandpass 1 1 17 17 0 17 17 17
Filter 2 1 14 14 0 14 12 14
2 2 10 10 0 10 10 10
l l 19 18 1 18 15 18
Biquad 2 l 15 14 2 14 1 1 14
Filter(“') 2 2 13 12 1 l2 1 1 12
~ 2 3 12 12 0 12 1 1 12
3 3 l 1 l l 0 1 1 1 1 1 1
Discrete l 1 32 32 0 32 - 32
Cosine 2 2 l6 l6 0 16 - 16
Transform 3 3 1 1 1 l O 1 1 - 1 1
Fast 1 1 26 26 0 26 - 26
Discrete 2 2 13 13 0 l3 - 13
Cosine 2 3 13 13 0 l3 - l3
Transform 3 3 10 10 0 10 - 10

 

(*) Complex multiplication takes 3 time steps

 

CHAPTER 5

Scheduling CDFGs

5.1 Introduction

In the presence of conditional constructs such as if-then-else and case, conditional re—
source sharing may have a major impact on the efﬁciency of designs. Several schedul-
ing algorithms have been proposed in the literature [22, 30, 62, 73] to handle conditional
branches. Except [62], no other work comprehensively addresses the issues pertaining to
conditional branches. However, the symbolic technique used in their work is computa-
tionally very expensive and has failed to solve large problems using exact methods [60].
Some of the important issues of conditional branches include : conditional resource sharing
beyond pair-wise mutual exclusiveness, speculative execution of operations and simultane-
ous scheduling of parallel conditional trees. A concise representation of the conditional
behavior is fundamental in addressing these issues efﬁciently, The existing representations
[61, 63, 72] have various limitations. In this chapter, we describe a new representation for
CDFGs in which the CDFG operations are labeled in a dot and dash notation that precisely
represents their conditional behavior. Based on this representation, we present an efﬁcient
method for analyzing conditional resource sharing among an arbitrary number of opera-
tions. Speculative execution and simultaneous scheduling of parallel trees are supported,

thus overcoming the limitations of previous works. The conditional resource sharing anal-

59

 

60

ysis method is incorporated into our DPS algorithm to Obtain optimal schedules that exploit
global sharing.

5.2 Related Work

Several representation schemes have been proposed in the recent literature to capture condi-
tional behavior [61, 63, 72]. The node tagging scheme in [63] cannot handle case structures
and can only support pair-wise resource sharing between operations. Condition Vectors in
[72] cannot be used to correctly analyze resource sharing between operations from paral-
lel conditional trees. The guard-based representation in [61] can perform resource sharing
analysis for an arbitrary number of operations. However, the complexity of this analysis is
very high. For each Operation, they assign a Boolean guard function. The representation of
case structures is not straightforward in their scheme and is not explained in their paper.

Several scheduling algorithms have been proposed in the literature [22, 30, 62, 73] to
handle conditional resource sharing. The algorithm in [30] Cannot exploit sufﬁcient poten-
tial parallelism because they consider only pair-Wise mutual exclusiveness of CDFG oper-
ations and do not allow speculative execution Condition Vectors [72] are used in Condition
Vector List Scheduling(CVLS) [73] to analyze resource sharing beyond pair-wise mutual
exclusiveness. Speculative execution is allowed in their method. Tree-based Scheduling
(TS) [22] also allows movement of operations. However, both CVLS and T8 are restricted
to nested conditional trees. Parallel conditional trees are addressed in [73], but the trees
are either scheduled sequentially or duplicated. A recent paper by Radivojevic et al [62]
presents a comprehensive treatment of global conditional resource sharing. Their method is
based on the symbolic technique of [60] and is computationally very expensive as already
pointed out.

61
5.3 CDFG Deﬁnitions and Issues

A CDFG (Control/Data Flow Graph) G = (V, E, C) is a (directed)graph representation
of a behavioral description where V is the set of operations and E denotes the set of data
dependencies (precedence constraints) and C denotes the set of control precedences be-
tween the operations. A conditional operation generates a control signal that ﬁres one of
its mutually exclusive branches. If (r, y, k) E G, then the operation y belongs to the kth
branch of the conditional operation 2:, and the operation 2: is called an control predecessor
of y. A conditional operation 2: is called a control ancestor of an operation y if there is a
directed path from 2: to y in G using only the arcs in C. Figure 5.1(b) shows the CDFG for
the behavioral description in Figure 5. l (a). Conditional operations are shown as boxes and
other operations are shown as circles. The data dependencies are shown as solid arcs and
control dependencies are shown as broken arcs.

All the operations that belong to the same branch of the same control predecessor are
said to belong to the same block. Note that this deﬁnition of a block is different from
the concept of ‘basic block’ used in other works. For example, operations D and J in
the CDFG in Figure 5.1(b) belong to different basic blocks. But, they belong to the same
block because they have the same control predecessor C and belong to the same branch
of G. Thus, a block precisely represents the conditional behavior of its operations. A
conditional operation in a block may in turn have one or more branch blocks. We refer
to the control predecessor and control ancestors of the operations in a block also as the
control predecessor and control ancestors of the block. A block a is a branch block of

another block b if the control predecessor of a is in b; b is called the parent block of a. A

block a is a nested branch block of another block b if b contains a control ancestor of a; b is

called an ancestor block of a. The root block is an ancestor of all the blocks in the CDFG
and contains all the operations that are always executed i.e. the operations with no control

predecessors. Examples for blocks in the example CDFG are : {A, B, G, M}, {D, E, J},

 

62

and {H, I}. The block {A, B, C, M} is the root block. The blocks {F, G} and {H, I} are
the branch blocks of operation E. Thus, E is the control predecessor of F, G, H and I.
There are several issues that need special attention in maximizing conditional resource

sharing. Some of the issues are brieﬂy discussed below.

e Resource sharing beyond pair-wise mutual exclusiveness: Its is often necessary for
scheduling algorithms to analyze conditional resource sharing for a set of operations
i.e. determine the minimum number of resources required by a set Of Operations. Us-
ing pair-wise mutual exclusiveness, it may not be possible in some cases, to compute
minimum requirement without employing complex algorithms such as clique parti-
tioning. Resource sharing analysis for an arbitrary number of operations is addressed
in [72, 61]. A case was reported in [63] where [72] gives erroneous conclusions
on the mutual exclusiveness of operations. The complexity of this analysis in [61]
is very high. We present here a method to obtain maximum conditional resource

sharing for a set of operations that runs in a polynomial time.

e Speculative execution of operations .° In speculative execution, an Operation in a
branch block can be executed before the corresponding conditional (control prede-
cessor). In some cases, allowing speculative execution results in faster schedules.
To allow speculative execution, resource sharing analysis should take into account
the fact that operations executed in a speculative fashion cannot share resources even
if they belong to different branches of the corresponding conditional. Furthermore,
some implicit precedences should be imposed explicitly. For example, in the exam-
ple CDFG, the conditional Operation E needs to be executed before J since the data
used by J depends on the result Of E. This kind of implicit precedence is handled
by [62] by imposing a strict precedence between the conditional and the join node.
This may enforce unnecessary precedences as pointed out in [63]. We track these

precedences more dynamically and avoid unnecessary precedences.

_Ii‘".

63

p = a+b;
q = p+c;
if (q<d){
r = q+b;
if (r < a){
r+b;
s—a;

 

 

 

 

Q'TH'UUOCD>

H
II II

 

 

 

 

H
r-e
II
(I)
I
9"

 

[‘79
II

..D

+
P‘

 

(a) Behavioral Description

(b) CDFG

Figure 5.1. An example CDFG

64

5.4 Representation of CDFGs

In this section, we present a new representation of the CDFGs and a method for conditional
resource sharing analysis using the representation. Each operation in our representation is
given a label, that reﬂects its conditional behavior. The data precedences are stored as a
list of immediate predecessors for each operation. Associated with each operation a: in V,
there is a type T; indicating that a resource of type T; should be used to perform operation

17.

5.4.1 The Dot and Dash Notation

In our representation scheme, we label each operation by a sequence of numbers separated
by alternating dots(.) and dashes(-). The number following the dot denotes the sequence
number within a block and the number following the dash denotes the branch number
among the branches of a conditional operation. For example, a label 1.2-1.3-2.4 represents
the fourth operation in the block 1.2-1.3-2. The block 1.2-1.3-2 is the second branch block
of the conditional operation labeled 1.2-1.3. The operation 1.2-1.3 in turn is the third oper-
ation in the ﬁrst branch of another conditional operation 1.2 which is the second operation
in the root block. Thus, the control path of an Operation is implicitly described by its label.

Figure 5.2 shows our representation of the example CDFG. Each operation has three
parts - a dot and dash label, the type of the operation (in braces) and the list of its immediate
data predecesSors (in square brackets). Note that the control precedence information is
implicit in the label and is not imposed as a strict precedence. Some important properties

of the dot and dash notation are :

e The part of the label of an Operation that is to the left of the rightmost dot indicates

the block the operation belongs to.

e The part of the label of an Operation that is to the left of the rightmost dash is the label

 

65

 

A : 1 . 1 {+} [ 1
B : 1.2 {=} l 1.1 l
C : 1.3 {<} [ 1.2 ]
D : 1.3-1.1 {+} [ 1.2 ]
E : 1.3-1.2 (<} [ 1.3-1.1 ]
F : 1.3-l.2-1.l {+} [ 1.3-1.1 ]
G : 1.3-1.2—1.2 {-} [ 1.3-l.2-l.1 ]
H : 1.3-l.2-2.1 {+} [ 1.3-1.1 ]
I : 1.3-1.2-2.2 {—} [ 1.3-1.2-2.l ]
J : 1.3-1.3 {-} [ 1.3-1.2-l.2 l.3—1.2-2.2 ]
K : 1.3-2.1 {+1 [ 1.2 ]
L : 1.3-2.2 {+1 [ 1.3-2.1 ]
M : 1.4 {+} [ 1.3-1.3 1.3—2.2 ]
Figure 5.2. Our representation for the example CDFG

 

 

 

for the control predecessor of the Operation. A label that contains no dash represents

an operation in the root block.

e Two operations are mutually exclusive if the longest common preﬁx of their labels
ends with a dash because they are in different branches of a conditional operation

whose label is given by the longest common preﬁx (without the dash at the end).

5.5 Conditional Resource Sharing for a Set of Operations

Let M be the set of operations considered for scheduling at a particular control step.
Let a(b, t) denote the number of type t resources a block b requires i.e. the number of
type t resources required by the operations that belong to either block b or one of its nested
branch blocks. A type t operation 11 in M contributes one to a(b, t) if it belongs to b.
Additionally, each conditional operation q in b (even if q is not in M) may contribute to

a(b, t), due to the resource requirements of the corresponding branch blocks. Let S (q, t)

I...

66

 

1 resource_sharing(M)
/* This algorithm computes the resource requirement for the
set M of Opeartions currently selected for scheduling */

{

for each resource type t {
initialize a(b, t) to zero for all blocks b
initialize [3(2), t) to zero for conditional operations 1)

}

for each operation p in M {

t <— type of p
call update_req(p, t)

}

Figure 5.3. Procedure for Conditional Resource Sharing Analysis

 

 

 

denote the contribution of q to a(b, t). If q is already resolved i.e. the value of the control
signal generated by q is available at the current control step, only one of the branch blocks
of q needs to be executed. Hence, 6(1), t) is given by the maximum of a(c, t) with the
maximum taken over all branch blocks c of v. If q is not resolved (speculative execution),
then My, t) is the sum of such a(c, t).

Note that the total number of type t resources required by M is given by a(B, t) where
B is the root block. Figure 5.3 shows the procedure to compute the number of each type
of resources required by M. We schedule one operation from M at a time, each time
updating the total resource requirement accordingly. Figure 5.4 shows a pseudo-code for
the recursive procedure (update.req(p, t)) to update the total resource requirement when a
type t operation p is scheduled. If p belongs to block b, a(b, t) is incremented by one. Then,
the change in a(b, t) is passed to the parent block of b through q, the control predecessor
of p. First, ﬂ(q, t) is reevaluated. If there is a change in the value of ﬂ(q, t), the resource
requirement of the block of q is updated by calling the procedure updateJeq recursively.

This recursive process is continued until either the root block is reached or there is no

67

 

update_req (p. t)

{

b (— block of p
q 4— control predecessor of p
a(b, t) = oz(b, t) + 1
if b is the root block, return
if q is resolved {
- if a(b,t) 5 Matt)
return /* No change in the resource requirement */
else

} 3(9. t) = 0(9, t)

else { /* q is not resolved */
ﬂ(Qtt) = 3(Qit) + 1

}
call update_req(q, t) /* Update the requirement of the block of q *I

Figure 5.4. Pseudo-code to update the total resource requirement

 

 

 

change in the resource requirement. When all the operations from the set are added, a(B, t)
is the number of resources required by that set of operations. Trivially, it takes 0(k) time
to update the resource requirement with respect to an operation p, where k is the number
of nested levels from the root block to p. The dot and dash notation provides a mechanism
to easily track the nested conditionals and the corresponding blocks from the operation to
the root block. This method can be used to compute the requirement of not only functional
units but also resources like busses and registers.

The above method takes into account the impact of speculative execution on resource
sharing. Out-of-order execution of conditionals can be allowed with the labeling scheme
and resource sharing analysis. Parallel trees can be simultaneously scheduled. The implicit
precedence discussed in the previous section can be handled as follows. Let operation q

be an immediate predecessor of p. SO, p cannot be scheduled until q is executed. We also

68

impose the following implicit precedence. Operation p cannot be scheduled until all the
nested conditionals q to r are executed where r is the corresponding conditional in the

same block as p. Using our representation, all these conditionals can be easily determined.

5.6 Lower-bound Estimation for CDFGs

For the lower-bound estimation of CDFGs, we use the same approach as for DFGs i.e. di-
viding each type of operations into three non-overlapping intervals. As explained in Section
4.4, the lengths Of the ﬁrst and the third intervals are independent of resource constraints.
The length of the second interval is a function of the total resource requirement of the Op-
erations that should be scheduled in that interval. If there are no conditional branches, the
resource requirement is equal to the number of operations. In the presence of conditional
branches, however, more than one operation can share the one resource in the same time-
step. Effectively, an operation requires only a fraction of the resource. If an operation can
share resources with at most 71 other operations in the same time-step, its minimum resource
requirement is 1 / (n + 1). We refer to this quantity as the weight of that operation. For any
given resource type t, the minimum total resource requirement in an interval can be com-
puted as the sum of weights of all type t operations in that interval. Given the weights of
individual operations, the computation of the sum of weights of operations in each interval
- is similar to the computation of the number of operations with no impact on the complexity.
For lower-bound estimation from a conﬁguration S, we use SumWeights(S, t), the sum of
weights of the unscheduled operations of each type t in place of unsch(S, t). This can
be easily computed from the corresponding value at its parent conﬁguration. At the root
conﬁguration, it is the sum of weights of all Operations of each type in the CDFG. Thus,
the only increase in complexity with our extension to conditional branches is due to the
computation of the weights of operations.

Figure 5.5 shows an outline of the procedure to compute the weights of individual

69

 

compute_weights()
/* Computes weights of all operations in the CDFG */

Partition all the operations into conditional blocks
For each Operation 2: in the CDFG {
t (—— type of 2:
b 4—— block of 2:
n +— number of blocks that have a type t operation and
mutually exclusive with the block b
weight(2:) (— 1/(n + l)
}

Figure 5 .5. Outline of the procedure to compute the weights of operations

 

 

 

 

operations. Since all the operations in a block have the same control behavior, the concept
of mutual exclusiveness between operations can be easily extended to blocks. If 2: is a type
t operation in the block b, then at any given control step 2: can share a resource with at most
one other type t operation from any other block that is mutually exclusive with b. Hence, if
there are n blocks that are mutually exclusive with b and that have a type t operation, then
the weight of2: is 1/(n + 1). Ifz is in the root block, the weight of2: is 1.

The method in [68] to handle conditional branches is extension of [69]. In their ap-
proach, for each interval the operations from only one conditional path are considered so as
to maximize the minimum resource requirement in that interval. Since the conditional path
analysis is performed for each interval, their method is very slow. When used in schedul-
ing algorithms for partial schedules, the actual time spent in estimation itself can outweigh
the advantage of the resultant pruning. In comparison, our method computes the weights
of operations only once and the complexity of the remaining steps remains unchanged.

Their method is based on distribute-join representation of CDFGs which is a C-select im-

70

plementation. In C-select implementation, the operations in conditional branches cannot
be executed until the corresponding condition is resolved. Our estimation method can sup—
port both C-select implementation and speculative execution. In C-select implementation,
the control precedences are treated the same way as data dependencies and are considered
in computing ASAP and MSAT values of operations. In speculative execution, control

dependencies are ignored while computing ASAP and MSAT values.

5.7 DPS Algorithm for CDFGs

 

Figure 5.6. Design Search Space for the example CDFG

In this section, we describe how the conditional resource sharing analysis method can be
used in our DPS algorithm to extend it to schedule CDFGs. Figure 5.6 shows the design
search space for the example CDFG using one adder, one subtracter and one comparator.

All operations take one control step. At conﬁguration 3, both the addition operations D

71

and K can be scheduled using one adder, because they belong to different branches of G
and G is already resolved. At conﬁguration 4 however, F and H cannot share resources
because E is not resolved and one of them will be executed in a speculative fashion.
Figure 5.7 shows the outline of the DPS algorithm for scheduling CDFGs. This algo-
rithm is applied to a conﬁguration R when a subset M of ready(R) is selected for schedul-
ing (dps_cdfg(R, M )). First, the resource sharing analysis procedure of ﬁgure 5.3 is called
to compute the number of resources required of each type. A corresponding child conﬁg-
uration S is generated only if resource requirement does not exceed resource availability.
Processing of newly generated conﬁgurations is done the same way as for DFGs. The
lower-bound completion times can be estimated from conﬁgurations as explained in Sec-
tion 5.6. Equivalence of conﬁgurations can be checked the same way as in the case of

DFGs.

5.8 Experimental Results

The algorithm for resource sharing analysis described in this chapter was implemented into
our original DPS scheduling algorithm. We tested our methods on many benchmarks in
the literature. Table 5.1 shows the results for examples with conditional behavior - Maha
from [52], Kim from [30], Waka from [72] and MulT from [73]. Parker is Maha with ad-
dition A6 replaced with a subtraction. The results in terms of the total number of control
steps are compared to the optimal or best results known. The resources column lists the
number of adders, subtracters and comparators used in each case. All additions and sub«
tractions are single-cycle. We use the same set of assumptions as in [62] about comparison
operations. For the examples of Maha, Parker and MulT, the result of a comparison is
available to other operations in the same control step and the branches of that conditional
become mutually exclusive in the same step. For the other example, comparison takes

one control step. Kim, et al do not support speculative execution and they consider only

72

 

 

 

dps_cdfg(R,M)

/* This dps algorithm is applied to each configuration R when
a subset M is selected for scheduling from its ready set */

call resource_sharing(M)

if (resources required exceeds resources available )
/* The set of operations M cannot
share available resources */
return
else
generate a child configuration S of R
by scheduling M in the current control step

if (lower-bound completion time from S exceeds upper bound)
/* lower-bound estimation */
prune S

if (an equivalent configuration P is found) {
/* Dynamic programming technique */
if (depth(S) >= depth(P)) prune S
if (depth(S) < depth(P))
add S to the search space and prune P
1
if (no equivalent configuration found)
add S to the search space

Figure 5.7. Outline of the DPS algorithm for CDFGs

 

 

‘WW

‘

’1 {r
I

73

Table 5.1. Number of total control steps for Conditional Branch benchmarks

 

 

 

 

Benchmark Resources #Control Steps
+,-,< Ours Symb.[62] Kim et. al.[30] Wakabayashi[73]
Maha 1,1,- 5 5 5 5
Maha 2,3,— 4 4 - 4
Parker 2,3,- 4 4 - 4
Kim 2,1,1 6 6 7 6
MulT 2,1',1 3 3 - 4
Waka 1,1,2 7 7 7 -

 

 

 

 

 

 

 

 

pair-wise mutual exclusiveness. For the example of Kim, their method takes 7 steps. Our
method takes only 6 steps with speculative execution. The results for MulT are an example
to show the beneﬁts of scheduling parallel trees simultaneously. CVLS [73] takes 4 steps
while our method takes only 3 steps.

The advantage of speculative execution are shown in Table 5.2. It shows optimal sched-
ule lengths without (column N o spec.) and with(column Spec. Exec.) allowing speculative
execution. Resources column shows the number of adders, subtracters and comparators
used in each case. For all examples, we assume that all operations take one cycle. If there
is no speculative execution, operations from mutually exclusive branches can always share
resources. However, since control precedences are strict precedences, critical path length
may increase. In Table 5.2, Maha and Parker are two examples with a high degree of
branching. When speculative execution is not allowed, the advantage of conditional re-
source sharing is nulliﬁed by the increase in the critical path length. The length of the
schedules could not be reduced even by adding more resources. In comparison, speculative
execution gives much superior results and adding more resources reduces schedule lengths.
The only other work that addressed speculative execution with no limitations is the method
in [62]. However, the symbolic technique they used is computationally very expensive and
has failed to solve some large scheduling problems exactly [60].

 

74

Table 5 .2. Speculative execution in Conditional Branch benchmarks

 

 

 

 

 

 

 

 

 

Benchmark Resources # Time Steps
+,-,< No spec. Spec. Exec.
Maha 1,1,1 1 1 7
Maha 2,1,1 1 1 6
Maha 2,2,2 1 l 5
Parker 1,1,1 1 1 6
Parker 2,2,1 1 1 5
Parker 2,2,2 1 1 5
Kim 1,1,1 9 9
Kim 1,2,1 8 6
Kim 2,2,2 7 6
Waka 1,1,1 7 7
Waka 2,2,2 7 7
MulT 1,1,1 5 4
MulT 2,1,1 5 3
MulT 2,2,1 5 3

 

 

 

 

 

 

5.9 Conclusions

We proposed a new representation for CDFGs in which operations are labeled in a dot and
dash notation. The dot and dash labels capture the conditional behavior of the operations
including constructs such as case statements. Exploiting this representation, a method for
conditional resource sharing analysis is presented. This method is shown to be efﬁcient in
computing the resource requirement for a set of operations, and allows speculative execu-
tion of operations and out-of-order execution of conditionals. This method is incorporated
into our DPS algorithm to extend it to schedule CDFGs. The DPS algorithm for CDFGs
overcomes the limitations of several previous works and exploits global sharing to Obtain

schedules with the minimum number of control steps in very short computational times.

 

CHAPTER 6

Conclusions

Scheduling datapath operations into the best control steps is a task whose importance has
been recognized in many high-level synthesis systems. Many heuristics proposed in the
literature to solve this intractable problem do not work well in all scenarios. This is es-
pecially true when there are constraints such as bus and register constraints. The current
exact scheduling algorithms in the literature suffer from the combinatorial explosion of the
search space because they do not apply sufﬁcient pruning. Most importantly, none of the
methods attempts to detect equivalent states in the search space. Based on this observation
and some other characteristics of the scheduling problem, we developed algorithms to ad-
dress various aspects of the scheduling problem in high-level synthesis. We described these
algorithms in the previous chapters of this thesis. This chapter explains a summary of the
contributions of our work and further research that needs to be done in order to maximally

utilize the techniques developed in this thesis.

6.1 Summary of Contributions

The contributions of this work can be summarized as follows.

e A new optimal dynamic programming algorithm is proposed for the scheduling prob-

lem in high-level synthesis. The algorithm minimizes the number of control steps

75

76

(clock cycles) under resource constraints. A novel feature of our algorithm is that it
efﬁciently detects equivalent conﬁgurations(partial schedules) and explores only one
of the best among them. A set of simple yet powerful techniques to prune the search

space is invented.

Our DPS algorithm uses a succinct representation of conﬁgurations in the search
space and stores very little data at each conﬁguration. The child conﬁgurations are
generated from their parent conﬁguration very efﬁciently, exploiting the representa-

tion.

Our scheduling algorithm is extended to handle multi-cycle operations, pipelined
functional units, scheduling under bus and register constraints. Experimental results
on large benchmarks show that optimal results are Obtained in very short compu-
tational times. Our method is an order of magnitude faster than the existing exact

methods in the literature.

A new approach for lower-bound estimation of performance is proposed. Using our
approach, we Obtained faster and better results than some of the best known methods
in the literature. In most cases, our method produces tight lower bounds. The ap-
proach is extended to estimate lower-bound completion times of partially scheduled
data ﬂow graphs. We used this approach in our dynamic programming scheduling
algorithm. It is shown to be very effective in pruning the design space While being
very efﬁcient. The estimation algorithms can be used in any branch-and-bound or
multi-schedule scheduling algorithms.

A new representation scheme to represent behaviors with conditional constructs
(Control/Data Flow Graphs) is proposed. The operations are labeled in a novel dot-

and-dash notation which precisely captures their conditional behavior.

77

e Based on this representation, a new method for conditional resource sharing analysis
is proposed. This method can address issues like speculative execution, resource
sharing beyond pair-wise mutual exclusiveness which are very important in obtaining
efﬁcient designs for CDFGs. The resource sharing analysis method can be used in

any scheduling algorithm.

e The resource sharing analysis method is incorporated into our dynamic programming
scheduling algorithm, thus extending it to handle conditional branches. Our lower-

bound estimation algorithms are also extended to handle conditional branches.

6.2 Future Research
There are several problems that need to be addressed further.

e The scheduling algorithm and the lower-bound estimation algorithms need to be ex-

tended to handle multi-function functional units.

e The algorithms need to be extended to handle loops and functional pipelining i.e.
pipelined datapaths.

e The current lower bound estimation method considers constraints on the number of
functional units only . It may be possible to get better lower bounds if we include
additional constraints like bus constraints and register constraints. The lower bound

estimation method needs to be extended in this direction.

e Further investigation on the ways to improve the performance (in terms of speed and
memory usage) of our methods is necessary. Some of the issues are : how can the
merits of both breadth-ﬁrst and depth-ﬁrst approaches can be combined into a single
approach; in addition to the ‘equivalence’ relation between conﬁgurations, determine

a ‘better’ relation between conﬁgurations that can be easily implemented.

 

 

 

78

0 Since our algorithm can have exponential complexity in the worst case, it is important
to have the capability to search only a part of the search space for a near-optimal
solution. This can be done by imposing a limit on the number of conﬁgurations
explored at each depth. However, this should be done intelligently such that the best

conﬁgurations are selected for further exploration.

0 Finally, a parser needs to be implemented that translates VHDL descriptions of dig-

ital systems into our representation of CDFGs.

 

BIBLIOGRAPHY

 

BIBLIOGRAPHY

[l] G; Brassard and P.Brat1ey, “ALGORTHMICS : Theory and Practice”, Prentice Hall,

1988.

[2] R. K Brayton, R. Camposano et al., “The Yorktown Silicon Compiler”, in Silicon

Compilation, Addison-Wesley Publishing Company, pp204 — 310, 1988.

[3] RE. Bryant, “Graph-Based Algorithms for Boolean Function Manipulation”, IEEE

Transactions on Computers, Vol. c-35, No.8, August 1986.

[4] RE. Bryant, “Symbolic Boolean Manipulation with Ordered Binary Decision Dia-

grams”, ACM Computing Surveys, Vol.24, No.3, September 1992.

[5] R.Camposano and W.Wolf, Editors, “High-Level VLSI Synthesis”, Kluwer Academic

Publishers, 1991.

[6] R.J.Cloutier and D.E.Thomas, “The Combination of Scheduling,Allocation, and
Mapping in a Single Algorithm”, Proceedings of the Design Automation Conference,

pp.71-76, June 1990.

[7] S.Chaudhuri, R.A.Wa1ker and J .E.Mitchell, “Analyzing and Exploiting the Structure
of the Constraints in the ILP Approach to the Scheduling Problem”, IEEE Trans. on
VLSI Systems Vol.2, No.4, pp.456-471, December 1994.

79

 

 

[8]

[9]

[10]-

[11]

[12]

[13]

[14]

[15]

80

M.J.Chung and G.Tiruvuri, “Optimum Dynamic Programming Scheduling under Re-
source Constraints” Technical Report MSUCPS-TR95-44, Computer Science Depart-

ment, Michigan State University, 1995.

M.E.Dalikilic and V.Pitchumani, “A Multi-Schedule Approach to High-Level Synthe-

sis”, International Conference on Computer Design, pp.572-575, 1994.

S.Davidson, D.Landskov, B.D.Shriver, and P.W.Ma11ett, “Some Experiments in Local
Microcode Compaction for Horizontal Machines”, IEEE Transactions on Computers,

pp.460-477, July 1981.

S. DeVadas and AR. Newton, “Algorithms for Hardware Allocation in Data Path
Synthesis”, IEEE Transactions on Computer-Aided Design, pp768 — 781, Vol.8, No.7,

July, 1989.

Y.Fann, et al, “Global Scheduling For High-Level Synthesis Applications”, 31st De-

sign Automation Conference, 1994.

E.B.Femandez and B.Bussell, “Bounds on the Number of processors and Time for
Multiprocessor Optimal Schedules”, IEEE Transactions on Computers, pp.745-751,

vol.C-22, No.8, August 1973.

D.Gajski, N. Dutt et. al., “HIGH-LEVEL SYNTHESIS : Introduction to Chip and

System Design”, Kluwer Academic Publishers, 1992.

D.Gajski, R.Potasman, J .Lis, and A.Nicolau, “Percolation Based Synthesis”, Pro-

ceedings of the Design Automation Conference, pp.444-449, June 1990.

f“-.-__

 

81

[16] M.R.Garey and D.S.Johnson, “Computers and Intractability”, W.H.Freeman and CO.,

1979.

[17] CH. Gebotys and M.I. Elmasry, “Simultaneous Scheduling and Allocation for Cost
Constrained Optimal Architectural Synthesis”, Proceedings of the Design Automation

Conference, pp.2-7, 1991.

[18] E. F. Girczyc and J. P. Knight, “An Ada to Standard Cell Hardware Compiler based on
Graph Grammars and Scheduling”, International Conference on Computer Design,

pp.726-731, October 1984.

[19] G. Tiruvuri and M.J. Chung, “Estimation of Lower Bounds in Scheduling Algorithms
for High-Level Synthesis”, Under Revision for ACM Transactions On Design Au-

tomation of Electronic Systems.

[20] E. M. Gurari and I. H. Sudborough, “Improved Dynamic Programming Algorithms
for Bandwidth Minimization and MinCut Linear Arrangement Problem”, Journal of

Algorithms, Vol.5, No.4, pp.531-546, December 1984.

[21] C. Y. Hitchcock and D. E. Thomas, “A Method of Automatic Data Path Synthesis”,

Proceedings of the Design Automation Conference, pp.484-489, July 1983.

[22] S.H.Huang, et. al., “A Tree-Based Scheduling Algorithm For Control-Deminated Cir-

cuits”, 30th Design Automation Conference, 1993.

[23] Y.Hu, A.Ghouse and B.S.Carlson, “Lower Bounds on the Iteration Time and the Num-
ber of Resources for Functional Pipelined Data Flow Graphs”, International Confer-

ence on Computer Design, pp.21-24, 1993.

 

82

[24] C.T.Hwang, J .H.Lee and Y.C.Hsu, “A Formal Approach to the Scheduling Problem
in High Level Synthesis”, IEEE Trans. on Computer-Aided Design, Vol.10, No.4,

pp.464—475, April 1991 .

[25] CT. Hwang, Y.C. Hsu and Y.L. Lin, “Optimum and Heuristic Data Path Scheduling
Under Resource Constraints”, 27th Design Automation Conference, pp.65-70, June

1990.

[26] C.T.Hwang and Y.C.Hsu, “Zone Scheduling”, IEEE Trans. on Computer-Aided De-

sign, Vol.12, No.7, pp.926-934, July 1993.

[27] R.Jain, KKucukcakar, M.J.Mlinar, and A.C.Parker, “Experience with the ADAM
Synthesis System”, Proceedings of the Design Automation Conference, pp.56-61,

1989.

[28] R.Jain, A.Mujumdar, A.Sharma, and H.Wang, “Empirical Evaluation of Some High—
Level Synthesis Scheduling Heuristics”, Proceedings of the Design Automation Con-

ference, pp.686-689, 1991.

[29] R.Jain, A.C.Parker, and N.Park “Predicting system-level area and delay for pipelined
and non-pipelined designs”, IEEE Trans. on Computer-Aided Design, Vol.12, No.8,

August 1992.

[30] T.Kim, et al, “A Scheduling Algorithm For Conditional Resource Sharing”, Proceed-

ings of ICCAD, 1991.

[31] H.Komi, S.Yamada and K.Fukunaga, “A Scheduling Method by Stepwise Expansion

in High-Level Synthesis”, ICCAD, 1992.

83

[32] S.Y. Kung, H. J. Whitehouse, and T. Kailath, VLSI and Modern Signal Processing,

Englewood Cliffs, NJ: Prentice Hall, pp258 — 264, 1985.

[33] F.J.Kurdahi and A.C.Parker, “REAL: A Program for REgister ALlocation”, 24th De-

sign Automation Conference, 1987.

[34] M.Langerin and E.Cemy, “A Recursive Technique for Computing Lower-Bound Per-
formance of Schedules”, International Conference on Computer Design, pp. 16-20,

1993.

[35] J .H.Lee, Y.C.Hsu and Y.L.Lin, “A New Integer Linear Programming Formulation for
the Scheduling Problem in Data Path Synthesis”, Proceedings of ICCAD, November

1989.

[36] T.A.Ly and J.T.Mowchenko, “Applying Simulated Evolution to High Level Synthe-

sis”, IEEE Trans. on Computer-Aided Design, Vol.12, No.3, pp.389-409, March 1993.

[37] DJ. Mallon and P.B.Denyer, “A new approach to pipeline optimization”, Proceedings

of European Design Automation Conference, March, 1990.

[38] P.Marwedel, “A New Synthesis Algorithm for the MIMOLA Software System”, Pro-

ceedings of the Design Automation Conference, pp.271-277, July 1986.

[39] M. C. McFarland, “Using Bottom-Up Techniques in the Synthesis of Digital Hard-
ware from Abstract Behavioral Descriptions”, Proceedings of the Design Automation

Conference, pp.474-480, July 1986.

[45

[46

[47

[48.

 

84

[40] M. C. McFarland, A. C. Parker and R. Camposano, “Tutorial on High-level Synthe-

sis”, Proceedings of the Design Automation Conference, pp.330-336, July 1988.

[41] M. C. McFarland, A. C. Parker and R. Camposano, “The high-level synthesis of dig-

ital systems”, Proceedings of IEEE, pp.301-318, February 1990.

[42] SJ. Minato, “Zero-Suppressed BDDs for Set Manipulation in Combinatorial Prob-

lems”, Proceedings of the Design Automation Conference, 1993.

[43] J .A. Nestor and G. Krishnamoorthy, “SALSA: A new approach to scheduling with

timing constraints”, Proceedings of I C CAD, 1990.

[44] M.Nourani and C.Papachristou, “Move Frame Scheduling and Mixed Scheduling-
Allocation for the Automated Synthesis of Digital Systems”, 29th Design Automation

Conference, 1992.

[45] S.Y.Ohm and C.S.Jhon, “A Branch and Bound Method for the Optimal Scheduling”,

Proceedings of CICC, pp. 8.6.1-8.6.4, May 1992.

[46] S.Y.Ohm, F.J.Kurdahi, and Niki] Dutt, “Comprehensive Lower Bound Estimation

from Behavioral Descriptions”, Proceedings of ICCAD, pp. 182- 1 87, November 1994.
[47] S.Y.Ohm, “Personal Communication”, March 1995.

[48] B. M. Pangrle and D. D. Gajski, “Slicer: A State Synthesizer for Intelligent Silicon

Compiler”, International Conference on Computer Design, pp.42-45, July 1987.

 

85

[49] C.A.Papachristou and H.Konuk, “A High Level Synthesis Technique Based on Linear
Programming”, Technical Report # CES-89-21, Computer Engineering and Science

Department, Case Western Reserve University, November 1989.

[50] C.A.Papachristou and H.Konuk, “A Linear Program Driven Scheduling and Alloca-
tion Method Followed by Interconnect Optimization Algorithm”, Proceedings of the

Design Automation Conference, pp.77-83, June 1990.

[51] 1. Park and C.Kyung, “FAMOS: An Efﬁcient Scheduling Algorithm for High-Level
Synthesis”, IEEE Trans. on Computer-Aided Design, Vol.12, No.10, pp.1437-1448,

October 1993.

[52] N. Park and A. C. Parker, “MAHA: A Program for Datapath Synthesis”, Proceedings

of the Design Automation Conference, pp.461-466, July 1986.

[53] N. Park and A. C. Parker, “SEHWA: A Software Package for Synthesis of
Pipelines for Synthesis of Pipelines from Behavioral Speciﬁcations”, IEEE Trans.

on Computer-Aided Design, Vol.7, No.3, pp.356-368, March 1988.

[54] P.G.Paulin and J .P.Knight, “Scheduling and Binding Algorithms for High-Level Syn-

thesis”, Proceedings of the Design Automation Conference, pp. 1-6, 1989.

[55] P.G.Paulin and J. P. Knight, “Force-Directed Scheduling for Behavioral Synthesis of
A81 ’s”, IEEE Trans. on Computer-Aided Design, Vol.8, No.6, pp.661-679, June

1989.

86

[56] P.G.Paulin, J .P.Knight, and E.F.Girczyc, “HAL : A Multi-Paradigm Approach to Au-
tomatic Data Path Synthesis”, Proceedings of the Design Automation Conference,

pp.263-270, July 1986.

[57] J.M. Rabaey, H. De Man et. al., “CATHEDRAL II: A Synthesis System for Multipro-
cessor DSP systems”, in Silicon Compilation, Addison-Wesley Publishing Company,

pp204 — 310, 1988.

[58] J .M.Rabaey and M.Potkonjak, “Estimating Implementation Bounds for Real Time
DSP Application Speciﬁc Circuits”, IEEE Trans. on Computer-Aided Design, Vol. 13,

No.6, pp.669-683, June 1994.

[59] LP. Radivojevic and Forrest Brewer, “Symbolic Techniques for Optimal Scheduling”,

em Proceedings of the 4th SASIMI, Nara, Japan, 1993.

[60] LP. Radivojevic and Forrest Brewer, “On Applicability of Symbolic Techniques to
Larger Scheduling Problems”, Proc. of European Design and Test Conference, pp.48-

53, 1994.

[61] LP. Radivojevic and Forrest Brewer, “Analysis of Conditional Resource Sharing us-
ing a Guard-based Control Representation”, International Conference on Computer

Design, 1995.

[62] LP. Radivojevic and Forrest Brewer, “A New Symbolic Technique for Control-

Dependent Scheduling”, IEEE Trans. on CAD, January 1996.

[63] M.Rim and R.Jain, “Representing Conditional Branches for High-Level Synthesis

Applications”, 29th Design Automation Conference, 1992.

87

[64] M.Rim, “High-Level Synthesis of VLSI Designs for Scientiﬁc Programs”, Ph.D. the-
sis, Department of Electrical and Computer Engineering, University of Wisconsin,

Madison, August 1993.

[65] M.Rim and R.Jain, “Lower-Bound Performance Estimation for the High-Level Syn-
thesis Scheduling Problem” IEEE Trans. on Computer-Aided Design, Vol.13, No.4,

pp.451-458, April 1994.
[66] M.Rim and R.Jain, “RECALS II: A New List Scheduling Algorithm”, ICASSP, 1994.
[67] R. Sedgewick, “Algorithms in C”, Addison-Wesley, 1990.

[68] A.Sharma, “Estimation and Design Algorithms for the Behavioral Synthesis of

ASICS”, Ph.D. thesis, University of Wisconsin, 1992.

[69] A.Sharma and R.Jain, “Estimating Architectural Resources and Performance for
High-Level Synthesis Applications”, IEEE Trans. on VLSI Systems, Vol.1, No.2, June

1993.

[70] A.Sharma and R.Jain, “InSyn: Integrated Scheduling for DSP Applications ”, 30th

Design Automation Conference, 1993.

[71] J .D.Ullman, “NP-Complete Scheduling Problems”, Journal of Computer and System

Sciences, pp.384-393, 1975.

[7 2] KWakabayashi and T.Yoshimura, “A resource sharing and control synthesis method

for conditional branches”, ICCAD, 1989.

88

[73] K.Wakabayashi and H.Tanaka, “Global Scheduling Independent of Control Depen-

dencies Based on Condition Vectors”, 29th Design Automation Conference, 1992.

[74] TC. Wilson, N.Mukherjee, M.K. Garg, and D.K. Banerjee, “An integrated and ac-
celeretaed ILP solution for scheduling, module allocation, and binding in datapath
synthesis”, 6th International Conference on VLSI Design, pp.192-197, Bombay, In-

dia, January 1993

[75] Nam-Sung Woo, “A Global, Dynamic Register Allocation and Binding for a Data

Path Synthesis System", 27th Design Automation Conference, 1990.