J; w..'1r’?.-’"‘J L,. V ‘ "rm—55$.“ 15:11: VFW" -I ... ' > . ‘ ' 5. H. ; ~91 11': 15114 ‘5‘" 19511., .5111 15"5‘15‘5"L '. ' ||11~|v5‘vi'.|_5|’1 1 Il"'1{3 :1; .31?- ‘115 £11.}. .IV'101snu.L, ,‘55 .'. 5‘0""??— ‘1‘. ' '.. x" ." m'::.1:'l‘ '11,.0:1_ - . . . “'3‘“ 2 01' “11‘1“ ”-10. ‘ ;, .1" _;‘ ; 5' , . ‘ 3‘1““;‘5 ‘5 M11! 1113' I; ' -“ ' "“ ‘ 1111i“ r142, 555-11 .111. m: ' 451 1‘" ‘ 1.54314 1;; 2‘ . 15‘)? ,‘i " $12311»; 1.1.5::1‘1 '_' 553314,; 5‘““I‘" 1"" "‘5“ ‘ ‘ "':'51"‘.‘ '11! It .I- 1‘ 51‘ 13" ' 11;.- ' A5511} ‘1 -.__-.~u._. a .. ~. :rfl ‘55 II it”: I? "' 55"1"‘J"“1i' 541} ..C - :’;:€~;a"‘ ' I_ ‘51:“? 5c ‘5‘ :fifi‘5w5‘5555555551550 I5 (I: '15:" m 15:5'.5111,"515é3‘1335'115.~‘§;"£ ‘f; 4 LL: 2 :11. "1M" '15: f4 '13,; if?! 9111“; :é“, ‘51:] 1.11.11“ ii“: ' '.l‘: "" )5.,.‘5":‘ ' 15555555‘1 55'“ 1x.‘u 1‘ .: ”5‘" . “3559} ' ‘2' “51%;" "“135": v1. '5“ ‘1": r; mm x; . raw "1 ~ ; ; tW‘fi'w 9‘1’?‘:.."‘éz%.-;v . V ;.. ... n ; .. 3; a. ””5" - . 1". ' * WW1" ; "V‘WM‘F' ' " ‘ " '51}. .. 11;?! ‘”"‘ “ ‘!I.=€~.;‘;‘".:";”.: ‘lot‘..'.' .21 1‘; ' "r ‘ '~"‘ " 4.‘ ‘ ' - i=2" ' m s25I11‘1‘111151‘r1;3 1'EI1. 5"?“ "1‘” 511:1.” “£1; '1‘ 5‘. :u q. , o 7 H ' " (I! , kl?!“ ' '5“ 5 5 '3‘! .. I. 1.1 éi M £1125"‘5¢“"'3"". ' ‘1‘"5“ ‘5' W1 '2‘ .j ,1"'§!';111 “31191," ‘5 5" ; ""15 "'5‘" 55‘1‘1‘1‘";‘1“W“"".r “ ‘9 9' ix 9'1 V‘. ‘5‘”? "5': ‘2.“ ' THESIS illllilllllllllllllllllllllllllllll‘ill!lllillllllllllllllilii 193 01576 075 LIBRARY Michigan State University This is to certify that the dissertation entitled ALGORITHMS FOR THE SCHEDULING PROBLEM IN VLSI HIGH-LEVEL SYNTHESIS presented by GIREENDRANATH TIRUVURI has been accepted towards fulfillment of the requirements for DQQIQBAL degree in WIENCE @wéqfl/ Major professorQ Date /7:[/‘3 /?{ MS U is an Affirmative Action/Equal Opportunity Institution 0-12771 PLACE ll RETURN BOX to remove thh checkout from your record. TO AVOID FINES return on or betore one due. DATE DUE DATE DUE DATE DUE MSU le An Mmetlve Action/Bond Opportunity lnetltwon mm- ALGORITHMS FOR THE SCHEDULING PROBLEM IN VLSI HIGH-LEVEL SYNTHESIS By Gireendranath firuvuri A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science Michigan State University East Lansing, MI 48824 1996 ABSTRACT ALGORITHMS FOR THE SCHEDULING PROBLEM IN VLSI HIGH-LEVEL SYNTHESIS By Gireendranath Timvuri High-level synthesis takes an abstract behavioral specification of a digital system and finds a register-transfer level structure that realizes the given behavior. Operation schedul- ing and datapath construction are the core of high-level synthesis in obtaining efficient designs in terms of area and speed. A novel dynamic programming approach for optimal scheduling in high-level synthesis under resource constraints is presented. Our algorithm can handle multi-cycle operations, pipelined functional units and scheduling under bus and register constraints. In our dynamic programming approach, an equivalence relation be- tween configurations(partial schedules) is used to efficiently detect equivalent states in the solution space. Exploiting the principle of optimality, the scheduling algorithm keeps only the best partial schedules in the search space for further exploration. A set of effective pruning methods is proposed to further reduce the design space. A new scheme for representing Control/Data Flow Graphs (CDFGs) is proposed. Based on the representation, a new method for conditional resource sharing analysis between an arbitrary number of operations is presented. Using this method, our dynamic program- ming scheduling algorithm exploits global conditional resource sharing to obtain optimal schedules in very short computational times. For a synthesis system to produce efficient designs, it should have the capability to con- sider different cost-performance trade-offs. Instead of producing all designs and selecting the best, the system can use estimation to identify and prune inferior designs. A simple and efficient technique for the estimation of lower-bound performance is presented. The estimation method is extended to estimate lower—bound completion times from partially scheduled graphs. This is a very important problem in our scheduling algorithm and some other classes of scheduling algorithms such as branch and bound methods. The scheduling method and the estimation method are implemented and extensive ex- perimentation is performed using a number of benchmarks. We are able to schedule large examples optimally in very short computational times. Our method is an order of magni- tude faster than some of the best exact scheduling methods in the literature. Our results for the lower-bound estimation compare well with the best in the literature. To Amma, Naannagaru, Anna iv ACKNOWLEDGEMENTS I wish to express my gratitude to my advisor, Dr. Moon Jung Chung for his able guid- ance and constant encouragement throughout this work. I would also like to thank my dissertation committee members, Dr. Lionel Ni, Dr. A. Esfahanian and Dr. Bruce Sagan for their help and guidance and the invaluable education I got from them. I would like to thank my wife, Kameswari for her constant encouragement, patience and love. I wish to thank thank my friend, Vasu for his friendship and help throughout my years at MSU. I wish to express my heartful thanks to my parents, my brother and many friends and family for giving me everything I have in life. TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES 1 3 Introduction 1.1 Motivation ................................... 1.2 Scheduling in High-Level Synthesis ..................... 1.3 Scheduling Data Flow Graphs ........................ 1.4 Scheduling Control/Data Flow Graphs .................... 1.5 Lower-bound Estimation ........................... 1.6 Thesis Organization .............................. Dynamic Programming Scheduling 2.1 Related Work ................................. 2.2 Problem Description and Terminology ............ ' ........ 2.3 Basic Dynamic Programming Scheduling Algorithm ............ 2.3.1 Equivalence of Configurations .................... 2.4 More Pruning of the Search Space ...................... 2.5 Extension of Models ............................. 2.5.1 Multi-cycle Operations ........................ 2.5.2 Pipelining Functional Units ..................... 2.6 Exploring the Search Space .......................... 2.6.1 Breadth First Search (BFS) ...................... 2.6.2 Depth First Search (DFS) ....................... 2.6.3 BFS vs DFS ............................. 2.7 Experimental Results ............................. 2.8 Conclusions .................................. Scheduling under Bus and Register Constraints 3.1 Introduction .................................. 3.2 Scheduling under Bus Constraint ....................... vi 10 10 12 15 18 20 22 22 23 23 23 24 25 26 29 31 31 32 3.3 Scheduling under Register Constraint .................... 33 3.4 Experimental Results ............................. 34 3.4.1 Results for scheduling under bus constraints ............. 34 3.4.2 Results for scheduling under register constraint ........... 37 3.5 Conclusions .................................. 38 4 Lower-bound Estimation 41 4. 1 Introduction .................................. 41 4.2 Previous Work ................................. 42 4.3 Model and Definitions ............................ 43 4.4 Lower-bound Estimation for Performance .................. 44 4.5 Estimating a Lower Bound for a Partially Scheduled DFG .......... 49 4.6 Experimental Results ............................. 54 4.6.1 Lower bound estimation of partial schedules in branch and bound methods ................................ 54 4.6.2 Lower-bound estimation for the entire DFGs ........... . 55 4.7 Conclusions .................................. 57 5 Scheduling CDFGs 60 5. 1 Introduction .................................. 60 5.2 Related Work ................................. 61 5.3 CDFG Definitions and Issues ......................... 62 5.4 Representation of CDFGs ........................... 65 5.4.1 The Dot and Dash Notation ..................... 65 5.5 Conditional Resource Sharing for a Set of Operations ............ 66 5.6 Lower-bound Estimation for CDFGs ..................... 69 5.7 DPS Algorithm for CDFGs .......................... 71 5.8 Experimental Results ............................. 72 5.9 Conclusions .................................. 75 6 Conclusions 76 6.1 Summary of Contributions .......................... 76 6.2 Future Research ................................ 78 BIBLIOGRAPHY 80 vii 2.1 2.2 3.1 3.2 3.3 3.4 3.5 3.6 3.7 4.1 4.2 4.3 5.1 5.2 LIST OF TABLES Number of control steps and configurations with non-pipelined multiplier . Number of control steps and configurations with pipelined multiplier . . . . Results for the fifth-order elliptic wave filter ................. Results for the once unrolled fifth-order elliptic wave filter ......... Results for the sixth-order elliptic Bandpass filter .............. Results for the Fast Discrete Cosine Transform ............... Comparison with the InSyn heuristic method [70] .............. Comparison with the multi-schedule approach [9] .............. Comparison with the symbolic technique [60] ................ CPU time and number of configurations in branch and bound algorithm . . Lower bounds with non-pipelined multiplier ................. Lower bounds with pipelined multiplier ................... Number of total control steps for Conditional Branch benchmarks ..... Speculative execution in Conditional Branch benchmarks ......... viii 27 28 34 35 36 36 37 38 39 54 55 58 73 74 2.1 2.2 2.3 4.1 4.2 4.3 4.4 4.5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 LIST OF FIGURES An example DFG ............................... 13 The Design Search Space for the example DFG ............... 16 Outline of the basic dynamic programming scheduling procedure ...... 17 Outline of the procedure for lower-bound estimation of entire DFGs . . . . 45 An example for the lower-bound estimation for the entire DFG ....... 46 Outline of the procedure for lower-bound estimation from partial schedules 49 Computing the values for I and J ...................... 50 An example for the estimation of lower-bound completion time of partial schedules ................................... 51 An example CDFG .............................. 63 Our representation for the example CDFG .................. 65 Procedure for Conditional Resource Sharing Analysis ............ 66 Pseudo-code to update the total resource requirement ............ 67 Outline of the procedure to compute the weights of operations ....... 69 Design Search Space for the example CDFG ................. 70 Outline of the DPS algorithm for CDFGs .................. 72 ix CHAPTER 1 Introduction High-level synthesis takes an abstract behavioral specification of a digital system and finds a register-transfer level structure that realizes the given behavior. Usually, there are many different structures that can be used to realize a given behavior. One of the main goals 'of a synthesis system is to find the structure that best meets the constraints, such as limitations on the number of functional units, registers, power while minimizing some other param- eters like the number of cycles. The task of high-level synthesis can be decomposed into four major subtasks : parsing, operation scheduling, datapath construction and controller synthesis. Parsing translates a high-level behavioral description such as VHDL descrip- tion of a digital system into an internal representation such as a Control/Data Flow Graph (CDFG). The operations in the CDFG are then assigned to control steps by a scheduler. A control step is the fundamental sequencing unit in synchronous systems. Typically, it cor- responds to a clock cycle. Objects (operations, variables and data-transfers) are bound to resources (functional units, registers, multiplexers etc.) in the datapath construction phase. Finally, a controller is synthesized to monitor the execution of the operations. 1.1 Motivation In recent years, the trend toward automating synthesis of digital systems at higher and higher levels of the design hierarchy is gaining more momentum. An example of this is the acceptance and increasing use of logic synthesis tools by the VLSI industry. There have been several motivating factors behind this trend. e As the VLSI design complexities have increased, it became necessary for the design- ers to move to higher and higher levels of design abstraction. At higher levels of design abstraction, it is relatively easier to understand the specifications and faster to explore different trade-offs between cost, speed, power and so on. e Since design development contributes the most to the cost of a chip, automating the design process as much as possible can help to reduce the cost. Automated design tools at higher levels of design can also help to reduce the design time which is important in an increasingly competitive market. e \Vrth an automated design tool, there is less likelihood for errors. Furthermore, a verification mechanism can be incorporated into the tool. e At higher levels of automation, it is easier to keep track of the history of design decisions such as : what decisions were made and why; what was the effect of those decisions. Operation scheduling and datapath construction are the core of high-level synthesis in obtaining efficient designs in terms of area and speed. Scheduling datapath operations into the best control steps is a task whose importance has been recognized in many sys- tems e.g. [2, 5, 18, 48, 52, 53, 38, 56, 57]. Typically, a scheduler attempts to obtain an efficient design using one of the two approaches namely, resource-constrained scheduling and performance-constrained scheduling. In resource-constrained scheduling, the number of control steps is minimized given the counts and delays of the resources that perform the operations in the CDFG. The number of resources required to meet a given performance goal is minimized in performance-constrained scheduling. In many high-level synthesis systems, a large behavioral description is decomposed into several blocks and each block is scheduled separately. In such cases, the resource requirement is determined by the blocks that need the maximum number of resources of each type. Ideally, all blocks should be able to use all the available resources and this can not be ensured by performance-constrained scheduling. Secondly, given a single performance constraint for the whole design, it is a non-trivial task to determine the performance constraint for each block separately. On the other hand, a constraint on the resources is uniform across all blocks. Hence, resource- constrained scheduling is preferred to performance-constrained scheduling. We address the scheduling problem under resource constraints in this thesis. 1.2 Scheduling in High-Level Synthesis The general problem of scheduling is known to be NP-complete [16]. In [71], it is proved that the restricted cases of single execution time scheduling and two processor, one or two time unit scheduling problems are also NP-complete. The most elementary instance of the scheduling problem in high-level synthesis is that where the behavioral description is a Data Flow Graph (DFG) and the resource constraints are constraints on the available functional units. In a DFG, the nodes represent operations and the edges represent data dependencies between Operations. The operations need to be assigned to control steps such that the number of operations scheduled in' any control step is less than or equal to the number of available functional units that can perform those operations. The problem gets more complex when there are additional constraints such as bus and register constraints. For example, the number of busses required in a control step depends on the actual operations scheduled in that step and their inputs, not just the number of operations. Register constraint is even more complex because the register requirement in a control step depends on the operations scheduled in the previous steps also. If some of the operations in the behavior are conditional constructs such as if-then- else and case structures, there are control dependencies in addition to data dependencies between operations. The graphs representing such behaviors are called Control/Data Flow Graphs (CDFGs). In a CDFG, one hardware resource can be assigned to more than one operation in the same control step if they are mutually exclusive i.e. if they belong to different branches of a conditional construct. In order to obtain efficient schedules, it is very important that the scheduling algorithms maximize such conditional hardware resource sharing. This adds another dimension to the complexity of the scheduling problem. A scheduler has to often explore the design space with a variety of resource constraints and evaluate different cost-performance tradeoffs before selecting a design. Instead of pro- ducing schedules with each and every resource constraint, a scheduler can use estimation to identify and prune inferior designs. So, lower-bound estimation of performance has been an important problem in scheduling. For some classes of scheduling algorithms such as branch and bound, it is often necessary to estimate a lower-bound completion time from a large number of partially scheduled graphs. Though it appears to be a special case of lower-bound estimation for the entire graph, the problem is much more complex because the algorithm has to be much faster. In this thesis, we address the following problem instances of scheduling and estimation. e Scheduling Data Flow Graphs(DFGs) with single cycle operations under constraints on the number of functional units only (Chapter 2) e Scheduling DFGs with multi-cycle operations and pipelining functional units under constraints on the number of functional units only (Chapter 2) e Scheduling DFGs under additional constraints on the number of busses and registers (Chapter 3) e Scheduling Control/Data Flow Graphs (CDFGs) under resource constraints (Chap- ter 5) e Lower-bound performance estimation for entire DFGs and partially scheduled DFGs under constraints on functional units (Chapter 4) e Lower-bound performance estimation for entire CDFGs and partially scheduled CD- FGs under constraints on functional units (Chapter 5) In the next few sections, these problems are introduced in further detail in terms of the existing techniques, necessity for new techniques etc. 1.3 Scheduling Data Flow Graphs Since scheduling is an intractable problem, a wide variety of heuristic algorithms is used by synthesis systems to solve the scheduling problem. The heuristics range from list schedul- ing [10, 53, 48] to force-directed scheduling [55, 6] to transformational [2, 11, 36, 51]. The weakness of the heuristic approaches is that the effectiveness of a heuristic depends on the nature of the input circuit and on the nature of the constraints. Some heuristics work well in some scenarios, but do not work well in some other scenarios resulting in sub-optimal solutions. Furthermore, as pointed out in [15], “the weakness of the heuristic approaches is that they begin their scheduling by imposing a heuristic rather than begin with the optimal schedule. Hence, the heuristic is intrinsic to the scheduling algorithm and makes significant modifications of the algorithm very hard.” For example, many heuristic algbrithms try to schedule one operation at a time. This makes it very hard to schedule under constraints like bus and register constraints. An extensive empirical evaluation of some high-level synthe- sis scheduling heuristics reported in [28] shows that many scheduling heuristics result in sub—optimal solutions even for small problem sizes. Exact scheduling algorithms are mainly Integer Linear Programming(ILP) based [31, 7, 26, 24], branch and bound [45, 9] or symbolic techniques [59, 60]. The existing methods are computationally very expensive because they do not sufficiently exploit the structure of the input circuit and constraints at each intermediate step of the scheduling process. They do not prune the solution space enough to be applicable to large scheduling problems. Most importantly, none of the algorithms attempted to detect equivalent states in the solution space i.e. states that represent partial schedules to the same subset of the operations. If the operations are scheduled in a constructive fashion into increasing control steps, we need to store only the best partial schedule for any given subset of operations for further scheduling (principle of optimality). Based on this principle, we proposed a new optimal dynamic pro- gramming scheduling (DPS) algorithm [8]. An ‘equivalence’ relation is defined between partial schedules using which equivalent states in the solution space can be detected very efficiently. In addition to this, a simple yet powerful set of pruning techniques is proposed which reduces the size of the solution space (in terms of the number of states) drastically. In order to keep the memory requirement practical, it is also very important that the amount of data stored at each state is minimal. Another major advantage of our algorithm is that it requires very little data to be stored at each state. New states are computed very efficiently from the existing states. Our scheduling algorithm can handle multi-cycle operations and pipelining functional units. We extended the DPS algorithm to handle scheduling under bus and register constraints. We implemented the proposed algorithm and conducted experiments on a number of stan- dard benchmarks. We created large examples by unrolling the embedded loops in some of the benchmarks and tested our method on those examples. We are able to solve all the examples optimally in very short computational times. The results show the efficiency of our basic dynamic programming approach and the effectiveness of our pruning methods. 1.4 Scheduling Control/Data Flow Graphs Several scheduling algorithms have been proposed in the literature [22, 30, 62, 73] to han- dle conditional branches. Except [62], no other work comprehensively addresses the issues pertaining to conditional branches. However, the symbolic technique used in their work is computationally very expensive and has failed to solve large problems optimally [60]. Some of the important issues of conditional branches include : conditional resource sharing beyond pair-wise mutual exclusiveness, speculative execution of operations and simultane- ous scheduling of parallel conditional trees. A concise representation of the conditional behavior is fundamental in addressing these issues efficiently. The existing representations [61, 63, 72] have various limitations. We proposed a new representation for CDFGs in which the CDFG operations are labeled in a dot and dash notation that precisely represents their conditional behavior. Based on this representation, we present an efficient method for analyzing conditional resource sharing among an arbitrary number of operations. Specula- tive execution and simultaneous scheduling of parallel trees are supported, thus overcoming the limitations of previous works. The conditional resource sharing analysis method is in- corporated into our DPS algorithm to obtain optimal schedules that exploit global sharing. 1.5 Lower-bound Estimation For a synthesis system to produce efficient designs, it should have the capability to con- sider different cost-performance trade-offs. So, a scheduler has to explore the design space with a variety of resource constraints. Instead of producing schedules with each and every resource constraint, a scheduler can use estimation to identify and prune inferior designs. For example, for a given CDFG, let the scheduler have found a schedule of a: control steps with a cost constraint of y. If there is an estimation tool for lower-bound performance and if it determines that the lower bound with a cost constraint of y + z is a: control steps, the scheduler need not be used to produce an actual schedule with any cost constraint between y and y + 2. Furthermore, a lower bound can be used to evaluate a heuristic solution. For an estimation tool to be useful, it has to be much faster than the actual scheduler and the lower bounds it produces should be as tight as possible. We have proposed an efficient estimation 4 technique [19] for the lower-bound performance. We tested our estimation method on a number of benchmarks and compared our results with those of some other known meth- ods [29, 65, 69]. in the literature. Our method is faster and produces better lower bounds. In most cases, the lower bound is tight and it is always within five control steps from the optimum. Our dynamic programming scheduling method performs an exhaustive search of the de- sign space to find an optimal schedule with a given resource constraint. Any such schedul— ing method as well as the branch and bound methods need a dynamic estimation method to detect partial schedules that are not going to lead to an optimum schedule. Since the number of partial schedules is generally very high in a design space search process, the dynamic es- timation should be much faster than the lower—bound estimation for the entire DFG. At the same time, it should be effective and be able to detect infeasible partial schedules as early as possible in the exploration. Furthermore, it should not introduce significant memory overhead into the scheduling algorithm. We extended our lower-bound estimation method so that it can find such dynamic lower bounds. This problem is not addressed in earlier works. We incorporated this extended method into our dynamic programming scheduling algorithm. The results show that it reduces the size of the search space drastically. We extended our lower-bound estimation methods to handle CDFGs. 1.6 Thesis Organization In the next chapter, we describe our dynamic programming scheduling algorithm under constraints on functional units. Scheduling under additional constraints (bus and register ...3 constraints) is explained in Chapter 3. The lower—bound estimation methods for the entire DFGs and partially scheduled DFGs are presented in Chapter 4. Scheduling CDFGs is described in Chapter 5. In that chapter, we describe our representation scheme for CDFGs and the method for conditional resource sharing analysis. The scheduling algorithm and the lower-bound estimation are extended to handle CDFGs in the same chapter. Finally, Chapter 6 concludes outlining the contributions of this work and topics for further research. M. . ll CHAPTER 2 Dynamic Programming Scheduling In this chapter, we explain our basic dynamic programming scheduling algorithm and the pruning methods. We first describe our algorithm for single-step operations and then de- scribe the extensions for multi-cycle operations, pipelining functional units. Then, two approaches to explore the search space - a Breadth First Search and a Depth First Search using the dynamic programming algorithm, are described. 2.1 Related Work The simplest scheduling technique is as soon as possible (ASAP) scheduling [21, 38] where the operations are scheduled into increasing control steps as allowed by precedence con- straints, resource constraints and their initial order. The list scheduling technique which was originally used in [10] for microcode compaction, has been adopted by many high- level synthesis systems [28, 53, 48]. A list scheduling algorithm selects operations from a list ordered according to some priority function. A selected operation is assigned to the earliest time step as allowed by the precedence and resource constraints. Different list scheduling heuristics vary in the priority function they use. Many algorithms e.g.[28] use the as last as possible (ALAP) values of operations as the priority function. Given a perfor- mance constraint, the ALAP(ASAP) value of an operation is the latest(earliest) time step 10 11 in which the operation can start executing without violating the performance constraint and precedence constraints. The non-pipelined scheduling algorithm of SEHWA [53] uses ur- gency as its priority function. The urgency of an operation is defined to be the length of the longest path from that operation to the output operations. The priority of operations is in the decreasing order of their urgencies. Slicer [48] uses mobility (the difference between the ALAP and ASAP values) of an operation as the priority function. In [66], it was shown that all these list scheduling methods are in fact identical. RECALS II [66] is a collection of list scheduling algorithms, also with ALAP priority functions. However, in RECALS II more than one ALAP value called Pseudo ALAP (PALAP) values are computed, each with a subset of resource constraints. Then, a set of schedules is generated using one PALAP priority function at a time and the best schedule is chosen from the set. In freedom-based scheduling [52], the range of possible control step assignments for each operation is cal- culated. Then, operations on the critical path are scheduled first. To schedule the other operations, at each step, the operation with the least freedom, that is, the one with the smallest range of control steps into which it can go, is chosen. A different approach, force-directed scheduling (FDS) was originally presented in [55] and has later been used in other high-level synthesis systems like SAM [6]. In FDS, “force” values are calculated for all operations at all feasible control steps. The assignment of an operation to a control step is made such that it has the most attractive force. FDS attempts to minimize the number of resources subject to a performance constraint. In [54], another algorithm using the same concept of “force”, force-directed list scheduling was presented with the goal of optimizing the performance under resource constraints. Transformational algorithms start with a default schedule and apply transformations to probabilistically move towards an optimal schedule. Transformational methods in the literature include : simu- lated annealing [11, 43], simulated evolution [36] and FAMOS by Park et. a1. [51]. Exact scheduling methods are mainly Integer Liner programming (ILP) based, Branch and Bound or Symbolic techniques. Several ILP-based algorithms for high-level synthesis 12 are presented in the recent literature [17, 35, 31, 7, 26, 24, 74], In [LP—based schedul— ing algorithms the scheduling objectives and constraints are translated into integer linear programming formulations. In [7, 24], several techniques are used to reduce the number of variables in the formulation. Others used heuristics like zone scheduling [26] and step-wise expansion [31] to keep the solution space from exploding. The Zone Scheduling [25, 26] approach works by successively partitioning the control steps into zones and solving each of the zones by a 0-1 integer linear programming technique. By specifying the maximum number of 0—1 variables involved within a zone, the model can be turned into an optimal scheduling, a list scheduling, or one in between with different speed/quality characteristics. The optimal scheduling method proposed in [45] is a combination of branch and bound and transformational approaches. They optimize hardware cost under a performance con— straint. The multi-schedule approach in [9] generates multiple optimal schedules in a branch and bound fashion synthesizes a complete data path for each schedule. The sym- bolic technique used in [60] describes scheduling constraints as Boolean functions and build an OBDD (Ordered Boolean Decision Diagrams) encapsulating all optimal solutions. Their technique is computationally very expensive and they used heuristics to solve large scheduling problems. 2.2 Problem Description and Terminology A DFG (Data Flow Graph) G = (V, E) is a (directed)graph representation of a behavioral description where V is the set of operations and E denotes the set of dependencies (prece- dence constraints) between the operations. For any two operations 1:, y E V, if (:r, y) E E, operation :1: should be finished before operation y can start. Operation :2: is called an im- mediate predecessor of y and y is an immediate successor of 2:. Associated with each operation :17 in V, there is a unique type T,c indicating that a resource of type T: should be used to perform operation :5. Figure 2.1 shows a DFG for the differential equation example Figure 2.1. An example DFG used in [55]. Let T be the set of operation types. Resource constraints are given by a set of ordered tuples < t, D(t), N (t) > where D(t) is the delay (number of control steps) an operation of type t takes to complete and N (t) is the number of available resources of type t. For any operation 2:, we denote D(Tz) by d,. A schedule is a many-to-one mapping f : V -—-) {1, 2, 3, ..., h}. For any node v in V, f (v) = r means that the node v is scheduled to start at control step 1' and it will be finished at control step f (v) + d, — 1. A valid schedule f has to satisfy the following conditions. For all (1:, y) in E, f (:r) + d; - 1 < f (y) (precedence constraints) For all r in {l,2,...,k}, |V(t, 7‘)I S N(t) where V(t, r) is the set of operations 2) oftype t such that f (v) s r g f (v) + 6, -— 1 if type t resources are pipelined with a latency 6, (resource constraints) ' f(v) 5 r S f(v) + d” — 1 iftype t resources are not pipelined. The number of control steps 0, taken by a schedule f is marvev{ f (v) + (1., — 1}. A schedule f is an optimal schedule if and only if for any other schedule 9, C, 5 09. An operation a: is called a predecessor of y (and y, a successor of x) if there is a l4 directed path from a: to y in G. S, denotes the set Of successors of a: and P2 is the set Of predecessors Of 2:. An Operation without any predecessors is called an input operation. An operation without any successors is called an output operation. A multi-cycle operation a: is said tO be unfinished at control step 1' if it is scheduled but not finished at control step r i.e. f (x) < r 5 f (z) + d, — 1. An operation 2: is said tO be ready at control step 1', if it is not scheduled before control step 1' and all its predecessors are scheduled and finished at control step r i.e. f (y) + dy — 1 < r for all y in P3. In the process Of finding a schedule for a given CDFG G, a configuration S represents a state in the design space that is reached by scheduling operations until some control step 1'. We call 1‘ the depth Of S (depth(S)). A root configuration is the initial configuration at which none Of the operations is scheduled. The depth Of the root configuration is 0. At a configuration S, ready(S) denotes the set Of ready operations at S and unfinished(S) denotes the set Of unfinished operations (scheduled at a previous control step, but not yet finished). The scheduling process is started at the root configuration. The set Of input Operations constitutes the ready set at the rOOt configuration. TO continue the process Of scheduling from a configuration S, a subset Of ready(S) is chosen to be scheduled sub- - ject tO resource constraints. That represents an assignment Of Operations to control step depth(S) + 1 creating a child configuration 3’ of S. This set Of newly scheduled Operations together with any unfinished Operations at S is called the active set of S’ ( active(S’) ). The configuration S is called the parent Of S’ (parent(S’)). If the depth Of S is d, the depth Of S’ is d + 1. The path from the rOOt configuration to a configuration represents a partial sched- ule i.e. a schedule to a subset Of Operations in the DFG. Hence, we say that a configuration represents a partial schedule. All the configurations on the path from the root configuration to a configuration S (including the root) are called predecessor configurations Of S. A leaf configuration represents a complete schedule Of the given DFG where all the operations in G are scheduled and finished. By following the path from the root configuration to a leaf configuration, we Obtain the control step assignments for all the Operations. At a configura- It 15 tion S, unsch(S, t) denotes the number Of unscheduled Operations Of type t at S. Figure 2.2 shows a partial design space for the example DFG in Figure 2.1 with two multipliers and one adder. All Operations take one control step. Addition, subtraction and comparison Operations are all performed by adders. In figure 2.2, ovals represent configurations. The numbers inside the ovals are the Operations in the ready set. The numbers along an are are the Operations in the active set Of the configuration at the head Of the arc. For example, ready(J) = {3,6,8,l 1} and active(J) = {12,10}. I is the root configuration and P is a leaf configuration. J is the parent of K. K is a child Of J. The values for unsch(S, t) are not shown in the figure. For example, unsch(K, *) = 2 and unsch(K, +) = 3. 2.3 Basic Dynamic Programming Scheduling Algorithm The search space is explored starting at the root configuration. All the input operations are included in the ready set Of the root configuration. At each configuration, the scheduling process is continued by generating its child configurations. All the configurations thus generated constitute the search space. This exploration Of the search space can be done either in a breadth-first order or in a depth-first order, which is explained in Section 2.6. In this section and the next section, we describe how a child configuration is generated from its parent configuration and how it is determined whether a generated configuration should be added to the search space or not. We first describe the method for uni-cycle operations. Extensions to multi-cycle Operations are described in Section 2.5. TO generate a configuration S from its parent R, a subset Of ready(R) is scheduled in the current control step i.e. at control step depth(R) + 1. This set Of newly scheduled Operations constitutes active(S). Then, ready(S) is computed. The computation Of ready(S) is based on the following lemma Lemma 2.1 : For any unscheduled Operation a: at configuration S, a: is either a member Of ready(S) or there exists a y in ready(S) such that y is a predecessor of :c. Sid Una 2an3 05 c8 8an canon $63 25. .NN 2&5 . . . m e . E = 9"“ ° 9 . . . m n E. 3 O 69 6 I. 3.” e: g g :23 .23 :3 6 e 2.3... 22 _. o_..m_ 2:0— l7 dps(S) /* This dps algorithm is applied to each configuration S when it is generated */ { if (lower-bound completion time from S exceeds upper bound) /* lower—bound estimation */ prune S if (an equivalent configuration P is found) { /* Dynamic programming technique */ if (depth(S) >= depth(P)) prune S if (depth(S) < depth(P)) add S to the search space and prune P } if (no equivalent configuration found) add S to the search space Figure 2.3. Outline Of the basic dynamic programming scheduling procedure Proof : Let a: be an unscheduled Operation. If x is not in ready(S), then there is a prede- cessor p Of a: that is not scheduled. Among all such p, let q be the farthest from r i.e. the length Of the longest path from q to a: is maximum among all p. If q is not in ready(S), then there is a predecessor ql Of q that is not scheduled. Note thath is a predecessor Of :1: also. And, ql is farther than q from x, which is a contradiction. Hence, q is in ready(S). Cl TO compute ready(S), all the Operations in ready(R) that are not in active(S) are first included in ready(S). Then, the only candidates to be added to ready(S) are the immedi- ate successors Of Operations in active(S). Among these candidate Operations, only those Operations with no unscheduled predecessors can be included in ready(S). For a candidate Operation 2:, to check whether or not all Of its predecessors are scheduled, due to Lemma 2.1, it suffices to check only sets (ready(R) - active(S) ) and the set Of candidate Operations itself. The operation :5 can be included in ready(S) if and only if these sets do not contain 18 any predecessors Of 2:. Figure 2.3 shows the outline Of the basic dynamic programming scheduling procedure as applied to a configuration S (dps(S)) when it is generated. When a configuration S is generated, a lower bound on the completion time Of any schedule from S is computed. If it exceeds the upper bound, S is pruned. Lower-bound estimation is briefly explained in Section 2.4 and further elaborated in Chapter 4. 2.3.1 Equivalence of Configurations TWO configurations SI and 52 are called equivalent if ready(Sl) = ready(Sg). In Fig- ure 2.2, configurations A, A1, A2, A3, A4,A5 are all equivalent, B, 81,82 are equiva- lent and so on. Lemma 2.2 : If 51 and 52 are equivalent configurations, the set Of unscheduled Operations at $1 is the same as the set of unscheduled Operations at 52. Proof : If x is not scheduled at SI, then from Lemma 2.1, either a: is in ready(Sl) or some predecessor p Of a: is in ready(Sl). If a: is in ready(Sl), then a: is in ready(Sg). Hence, a: is not scheduled at 52. If p is in ready(Sl), then p is in ready(Sg); hence, p is not scheduled at 52. Since p is a predecessor Of as, a: is not scheduled at Sz. Similarly, if a: is not scheduled at 32, it is not scheduled at SI. Hence, the lemma. C] The concept of ready set is similar to cut in [20]. Lemma 2.2 above is similar to Lemma 2.2 in [20]. As was mentioned earlier, a configuration represents a partial schedule Of the DFG i.e. the schedule for a subset Of Operations in the DFG. Each configuration is a result of a sequence Of decisions where a decision corresponds to the scheduling of Operations at one control step. Lemma 2.2 implies that the set of scheduled Operations at SI is the same as that at 82. Thus, configurations SI and 32 represent schedules for the same subset Of Operations, say A Of the DFG. And, it takes the same number Of control steps from SI and S; to schedule the remaining unscheduled Operations. Thus, if the schedule for the set 19 Of operations in A is not Optimal, then the schedule for the whole DFG cannot be Optimal (principle of optimality [1]). Hence, a configuration S need not be explored further if there is already another equivalent configuration at an equal or lower depth. Also, S can be pruned if there is another configuration P at an equal or lower depth such that ready(P) is a subset Of ready(S). The equivalence Of configurations is checked with the help Of a hash table. For a con- figuration S, we save ready(S) and depth(S) in the hash table. When a Configuration S is generated, the hash table is searched to check if there already is a configuration equivalent tO S. If none is found, S is added to the search space and stored in the hash table. If an equivalent configuration P is found, then we process S as follows. Let depth(S) be h and depth(P) be k. If h is equal tO k, then S is pruned because S cannot lead to a better solution than P. In Figure 2.2, configuration A1 can be pruned since there is an equivalent configu- ration A at the same depth. If h is less than k, then S is added tO the search space since S leads tO a better solution than P does. In Figure 2.2, if E2 is generated earlier than E, the hash table gives a hit when E is generated. E is still added to the search space because it is at a lower depth. The hash table is also updated in this case so that among all the equivalent configurations, only the one with the best(least) depth is stored in the hash table. Thus, our scheduling algorithm exploits two important principles Of dynamic program- ming [1]. e First, it efficiently constructs solutions to larger subinstances Of the original problem (child configurations) from the solutions to a smaller subinstance (parent configura- tion). At each configuration S, we only need to save ready(S), active(S), depth(S) and parent( S). This reduces the memory requirement drastically because when the search space is huge, the memory required is primarily determined by the amount Of data stored at each configuration. The predecessor and successor information of Operations can be looked up the transitive closure Of the adjacency matrix, which is computed once prior to exploring the search space. 20 e It exploits the principle Of Optimality to prune the search space. 2.4 More Pruning of the Search Space We explain three pruning methods in this subsection. The first pruning method is based on an important Observation about equivalent configurations. Lemma 2.3 : Let S and P be two equivalent configurations. Let depth(S) be h and depth(P) be It. Let h be greater than k and let P’ be the predecessor configuration Of S at depth 1:. Then, P’ and all its successor configurations can be pruned. Proof : The configurations P and S have the same set of unscheduled Operations due tO Lemma 2.2. Since P’ is a predecessor configuration Of S, the set Of unscheduled Operations at P’ is a superset of the set of unscheduled Operations at S. Thus, the set Of unscheduled Operations at P’ is a superset of the set Of unscheduled Operations at P and they are at the same depth. Hence, P’ cannot lead to a better solution than the best solution from P. Hence, the lemma. ‘ . D In Figure 2.2, E3 and E are equivalent. Hence, E3 as well as its parent F can be pruned. I The following additional condition is added to the basic dynamic programming scheduling procedure to accommodate this pruning method. if (depth(S) > depth(P) ) { find P', the predecessor configuration of S at depth depth(P); prune P’ and all of its successor configurations } 21 The other two pruning methods are based on an upper bound for the performance In order to use these pruning methods, we find a schedule f using a list scheduling algorithm with ALAP value as the priority function. Then, we search the design space targeting for a schedule with a performance better than that Of f i.e. a leaf configuration at a depth less than Of. A configuration is called an infeasible configuration if it cannot lead to a schedule with the target performance. Before the search space is explored, the ALAP values Of all Operations are computed with respect to the target performance. The second pruning method is based on the precedence constraints Of the DFG. For finding a child of R, all the Operations in ready(R) with ALAP values equal to depth(R) + 1 must be scheduled in the current control step. If this violates the resource constraints, R is an infeasible configuration and hence can be pruned from the search space. By forcing the Operations with ALAP values equal to depth(R) + 1 into the active set Of any child configuration Of R, we can reduce the number Of child configurations Of R. In Figure 2.2, from configuration L, D1 will not be generated at all because Operations 1 and 2 should be included in the active set of any child Of L with a target performance Of 5 steps. The third method is based on both precedence constraints and resource constraints. Af- ter S is generated, it is checked tO see if S is an infeasible configuration. This is done by estimating a lower bound on the number Of control steps from S to schedule the remaining unscheduled Operations at S. If it exceeds the target performance, S is an infeasible config- uration and hence is pruned. We compute the lower bound as follows. At a configuration S, for each resource type t e T we find a separate lower bound on the number of control steps to schedule the remaining Operations Of type t. The maximum Of all these lower bounds gives a lower bound on the number Of control steps from S. We compute the lower bound 7foranytypetfromaconfigurationSas: 7=a+b+cwhere a is the minimum number Of control steps after the current control step before any Of the type t unscheduled Operations can be scheduled; L. 22 b is the minimum number Of control steps to complete any schedule Of G after the last Of the type t Operation is scheduled and c is the minimum number of control steps to schedule the unscheduled Operations Of type t assuming that all the N (t) resources will be used in all control steps. We defined new data structures that are computed only once prior to design space ex- ploration using which a and b can be computed in 0(k) time where k is the number Of ready operations at S [19]. These data structures and computation Of a and b are described in Chapter 4. The value Of c is computed by dividing unsch(S, t) by N (t). At each con- figuration S, we store unsch(S, t) for each type t and they can be easily computed from the corresponding values at its parent configuration. At root configuration, this quantity for each type is equal to the number of Operations Of each type in the DFG. Further details on the lower-bound estimation method are given in Chapter 4. 2.5 Extension of Models 2.5.1 Multi-cycle Operations The definitions in section 2.2 already account for multi-cycle Operations. The only problem that needs tO be addressed for multi-cycle Operations is the handling of unfinished Opera- tions. First, the definition Of the equivalence Of configurations needs to be modified as follows. Two configurations SI and 52 are called equivalent if (i) ready(Sl) = ready(Sg); unfinished(Sl) = unfinished(Sz) and (ii) for all a: in unfinished(Sl) and unfinished(Sg), a: has finished the same number Of control steps at both SI and 82. Lemma 2.2 still holds with this modified definition of equivalence and the set Of un- scheduled Operations at $1 and $2 is the same. Since the set Of unfinished Operations is also the same and each unfinished Operation has the same number Of control steps yet to 23 complete, it is going to take the same number of control steps to schedule the unscheduled Operations from 51 and 52. At each configuration S we also save unfinished(S), and the number Of control steps that each a: in unfinished(S) has finished. We save these quantities in the hash table also. TO generate a child configuration S of R, all Operations in unfinished(R) should necessarily be included in active(S). The number Of resources available for the Operations to be scheduled at the current step should be updated accordingly. The subset of ready(R) to be scheduled is chosen subject to the updated resource availability. The set unfinished(S) should be determined. It is a subset of active(S). In computing ready(S), we should also check unfinished(S) in addition to the sets explained in Section 2.3. An Operation is not included in ready(S) if it has a predecessor in unfinished(S). 2.5.2 Pipelining Functional Units If pipelining functional units are used, a functional unit can be assigned tO more than one Operation in an overlapping fashion. Let (it be the latency with which functional units of type t can be pipelined. A functional unit Of type t occupied with an unfinished Operation a: at a control step is still available to another Operation if a: has finished at least 6, control steps. Hence, when an unfinished Operation a: at a configuration R is included in the active set Of all its child configurations, the number of available resources need not be updated if a: has finished at least 6; control steps at R. 2.6 Exploring the Search Space 2.6.1 Breadth First Search (BFS) In BFS, the design space is explored in breadth-first fashion i.e. all the configurations at depth d are generated before any configuration at depth d + 1 is generated. If a leaf 24 configuration is reached during the exploration, it represents an Optimal schedule and the search can be terminated. ’ When a new configuration is generated, it is checked to see if it is an infeasible con- figuration. If it is not, then the dynamic programming technique is used tO check if an equivalent configuration already exists . If none exists, the current configuration is added to the search space. Configurations are added to the rear Of a queue Of configurations as they are generated. Those configurations are retrieved from the front of the queue one at a time, all of its child configurations are generated and added to the queue. This search process is continued until a leaf configuration is reached. 2.6.2 Depth First Search (DFS) In DFS, the design space is explored in depth-first fashion i.e. a complete path from the root configuration to a leaf configuration is explored before the next path is examined. The exploration Of a particular path is continued until either an infeasible configuration or a leaf configuration is reached. In DFS, a leaf configuration does not necessarily represent an Optimal schedule. From a configuration R, a child configuration R’ Of R is generated. Then, it is checked to see if it is an infeasible configuration. If it is not, then the dynamic programming tech- nique is used tO check if an equivalent configuration already exists . If none exists, R’ is added tO the search space. Then, a child Of R’ is generated and this process is recursively continued until an infeasible configuration or a leaf configuration is reached. When all paths from a configuration R’ are explored, it returns to its parent R. Then the next child Of R is found and the recursive process is repeated. When an infeasible configuration is reached, it is ignored and the next child Of its parent is generated. If a leaf configuration is reached, it means that a better schedule than the current best is found. If this solution is as good as the lower bound from the root, the search is terminated. Otherwise, we save it 25 as the current best schedule, update the target, update the ALAP values with respect to the new target and continue the search. Our depth-first approach is different from branch and bound in that it uses dynamic programming techniques tO explore the design space. It exploits the concept Of equivalent configurations to prune the search space. Our algorithm also uses effective methods to recognize if a path does not lead to a target solution as early as possible in the exploration Of that path. 2.6.3 BFS vs DFS The major advantage Of BFS over DFS is that among equivalent configurations, more promising ones are always generated first, in BPS. Hence, the inferior ones are never ex- plored further. Thus, BFS exploits the dynamic programming technique better than DFS. Another advantage is that no configurations deeper than any Optimal solution will ever be generated in the BFS approach. The DFS approach also has its own advantages over the BFS approach. The DFS approach takes much less memory than the BFS because all the configurations at all levels need not be stored. Whenever we reach a solution that is better than the current best, we can update the target for the rest Of the search and hence there can be more pruning. If a solution is as good as the lower bound performance, the search can be terminated. If such a solution is found early on during the search, the DFS approach will perform faster than the BFS approach. If the Optimal solution is very close to the solution Obtained from the list scheduling algorithm, the design space Of the BFS approach will be nearly the same as the design space of the DPS approach. In this case, the DFS approach has higher probability to perform better than the BFS approach. If there are a lot Of solutions with the target performance, then the DPS approach might be able to find one very quickly and hence perform well. 26 However, if the optimal solution is far from the list scheduling solution or if the number of solutions with the target performance is very small, the BFS approach might perform better since the number of unnecessary configurations generated will be less in the BFS approach. 2.7 Experimental Results The proposed dynamic programming scheduling algorithm was implemented in C lan- guage. We implemented it using both the breadth-first approach and the depth-first ap- proach (columns DPSB and DPSD in our tables) employing all the pruning methods ex- plained in Section 2.4. To measure the effect of the pruning methods alone, we also im- plemented the approaches (columns BFSN and DFSN) without employing the pruning methods of Section 2.4. The dynamic programming technique of pruning the equivalent configurations is used in all four implementations. The design space is exponential in size if there is no pruning. Thus, the experiments are performed so as to measure the effect of the dynamic programming technique and all the other pruning methods. The benchmarks we used are the AR Filter [27], the Differential Equation [55], the fifth- order elliptic Wave Filter [26], the FIR Filter [53], the complex Biquad recursive digital Filter [49] and the sixth-order elliptic Bandpass Filter [50]. We created larger examples for the purposes of testing by unrolling the embedded loops in the differential equation and fifth-order elliptic filter benchmarks. For the Biquad Filter example, we used three control steps for the multiplier and one for ALU (we used the same resource type ALU to do addition, subtraction and comparison). For all the other examples, we used two control steps for multiplication and one for ALU. We tested each of these benchmarks with a number of different resource constraints. In 70% of the cases, the lower bound performance is the same as the list scheduling solution. Hence, the dynamic programming scheduling procedure need not be invoked. We present the results for some of the other cases. 27 Table 2.1. Number of control steps and configurations with non-pipelined multiplier Resources Control steps No. of configurations DFG + * DPS List DPSB BFSN Ratio DPSD DFSN Ratio Wave 2 2 18M 19 58 355 6.12 30 228 7.6 Filter 2 3 18 19 81 262 3.23 34 374 11 2 4 18 19 81 247 3.23 34 356 10.47 1 2 28 28 8 5612 701.5 11 20214 1837.6 Once unrolled 2 2 35M 36 333 749 2.25 334 13529 40.51 Wave Filter 2 3 35 36 337 559 1.66 247 1396 5.65 L 2 4 35 36 328 529 1.61 243 1351 5.56 Twice unrolled 2 1 58 58 124 310 2.5 176 4449 25.28 Wave Filter 2 2 52!“! 53 410 2268 5.53 1969 20210 10.2 Differential 1 2 8M 8 12 55 4.5 12 68 5.67 Equation 1 3 7l°°1 7 3 22 7.33 3 26 8.67 Once unrolled 2 2 14 14 15 281 18.73 15 359 23.93 Differential 1 2 14 14 22 410 18.63 24 637 26.54 Equation 3 2 14 14 15 275 18.33 15 359 23.93 Twice unrolled 2 2 20 21 146 813 5.56 58 1278 22.03 Differential 1 2 20 21 435 2661 6.11 131 6110 46.64 Equation 3 2 20 21 127 673 5.3 ‘ 58 1026 17.69 Tm Filter 3 2 11 11 47 2906 61.83 47 5401 114.91 Biquad 3 3 131661 13 12 125 10.42 12 274 22.83 Filter 3 2 16 16 73 347 4.75 73 409 5.6 2 3 14!“! 14 51 759 14.88 59 1565 26.52 3 4 13 13 8 170 21.25 8 204 25.5 Bandpass 3 2 13 14 195 552 2.83 72 723 10.04 Filter 2 2 13!“! 14 253 623 x 2.46 126 939 7.45 2 3 11!“! 11 16 180 11.25 16 192 12.0 Our solution (optimal) List : List scheduling solution DPSB : Dynamic Programming Scheduling with Breadth-first search DPSD : Dynamic Programming Scheduling with Depth-first search BFSN : Breadth First Search with No pruning other than pruning equivalent configurations DFSN : Depth First Search with NO pruning other than pruning equivalent configurations DPS : 28 Table 2.2. Number of control steps and configurations with pipelined multiplier Resources Control steps No. of configurations DFG + * ops List DPSB BFSN Ratio DPSD DFSN Ratio ARFilter 2 2 13B?! 13 132 5812 44.03 155 8785 56.68 3 2 13 13 134 5812 43.37 155 8417 54.30 Wave 2 2 18 19 80 247 3.24 34 356 10.47 Filter 2 3 18 19 81 247 3.23 34 356 10.47 2 4 18 19 81 247 3.23 34 356 10.47 2 1 19M 19 23 412 17.91 23 703 30.56 1 1 28M 28 8 7128 891 11 20216 1837.81 1 2 28 28 8 7114 889.25 11 20214 1837.6 Once unrolled 2 2 35 36 327 529 2.25 334 1351 5.56 Wave Filter 2 3 35 36 328 529 1.66 247 1351 5.56 2 4 35 36 328 529 1.61 243 1351 5.56 2 1 36 36 279 869 3.11 424 6206 14.64 _'I\vice Unrolled 2 2 52 53 382 1743 4.56 645 20312 31.49 Wave Filter 2 1 53 53 346 2492 7.20 543 18423 33.92 Biquad 2 2 13M 13 180 935 5.19 247 4479 18.13 Filter 1 1 t 18 19 1455 2536 1.74 162 28658 176.9 2 3 12M 12 25 950 38 36 1757 48.8 3 1 14 14 32 635 19.84 32 984 30.75 1 Our solution for this case is one control step better than best known result [64]. 29 We present the results for the experiments for the non-pipelined and pipelined multiplier in Table 2.1 and Table 2.2 respectively. An extensive experimentation by R.Jain et al [28] indicates that a list scheduling algorithm with ALAP values as the priority function gives better schedules in more cases than a few other popular heuristics such as FDS [55], FDLS [54] and MAHA [52]. Hence, we implemented an ALAP List Scheduling algorithm and the solutions are presented here (column List in our tables). The optimal solutions as obtained by our dynamic programming scheduling are in column DPS. Our experiments indicated that our approach is very efficient in terms of run-time performance. The average CPU time for our method is 0.5 seconds. The average CPU times for FDS, FDLS and MAHA on a SUN-Spare Station 1 are 33.6 seconds, 2.834 minutes and 2.58 seconds respectively [28]. We ran all our experiments on a SUN-Spare station 2. As a machine-independent measure for the efficiency of our dynamic programming technique and the effect of pruning methods, we reported the number of configurations ex- plored in each of the four implementations (DPSB, DPSD, BFSN and DFSN). We compare BFSN with DPSB and DFSN with DPSB and the ratios of the number Of configurations are shown in the tables. The big ratios between BFSN and DPSB, and between DFSN and DPSD show the effectiveness of our pruning methods. The small number of configurations even without the pruning methods is an indication of the effectiveness of our dynamic pro- gramming technique. We mentioned that the breadth-first approach takes better advantage of the dynamic programming technique than the depth-first approach. This fact is illus- trated by the fact that the number of configurations is much smaller for BFSN than DFSN. 2.8 Conclusions We presented a dynamic prograrnrrring approach for the scheduling problem under resource constraints. It can handle multi-cycle operations and pipelined functional units. Using a dynamic programming technique combined with a set of pruning methods, our algorithm 30 searches the design space efficiently and always finds an optimal solution. We have done experimentation with a wide range of resource constraints using a number of benchmark circuits in the literature. The experimental results show that the proposed dynamic pro- gramming technique and the other pruning techniques are very effective in reducing the search space drastically. Our algorithm can be used to find multiple optimal schedules. Then, we can chose one among them so as to minimize some other design parameter such as the number of registers. Also, a limit on any such parameter can be imposed as an additional constraint. The next chapter describes the how bus and register constraints can be incorporated into our scheduling algorithm. CHAPTER 3 Scheduling under Bus and Register Constraints 3.1 Introduction In obtaining efficient designs, synthesis systems have to consider constraints on design elements like busses and registers in addition to functional units. The number of busses and registers may have considerable impact on the overall cost of the final design. Busses are needed for data transfers between functional units, registers and multiplexers etc.. Registers are needed to store values of the operands until they are consumed by all their respective operations. The operands include inputs to and outputs from the circuit as well as the outputs produced by operations for other operations to use. The number of busses/registers required in a control step depends on the actual operations scheduled in that control step, not just the number of operations. Hence, scheduling under bus and register constraints is more complex than scheduling under constraints on functional units alone. Many heuristic algorithms for the scheduling problem consider one operation at a time for scheduling. In the presence of additional constraints like a constraint on the number of buses, scheduling an operation in a particular control step may damage the possibility of scheduling more than one, equally critical operations later in the same step. As an example, 31 32 consider the following scenario for scheduling under bus constraint. The number of buses required in a control step is calculated as the number of distinct inputs to all the Operations scheduled in that control step. Suppose there are three operations A, B and C that can be scheduled at control step k. The inputs of A are P and Q. B and C have common inputs R and S. There is bus constraint of two. There are functional units available to schedule any two operations. Let A, B and C have the same priority and let A appears before B and G in the list. So, A is considered first for scheduling and A is scheduled in control step Is. Now, neither B nor G can be scheduled in the current control step because of the bus constraint. If either B or C is considered first for scheduling, both could have been scheduled at control step k because they share inputs. , Scheduling under a register constraint is even more difficult. First, scheduling an Oper- ation in a control step‘can possibly reduce the register requirement in that control step. Sec- ond, the register requirement in a control step depends on the scheduling of several previous control steps. Hence, many synthesis algorithms allocate registers following scheduling [33, 75]. Some others [6, 44, 70] try to minimize the number of registers While scheduling as opposed to imposing a constraint. MSS [9] produces multiple schedules and selects the one with the least number of registers. In our approach, since we schedule one control step - not one operation — at a time, we can handle register constraint easily. In this chapter, we describe how bus and register constraints can be handled in our DPS algorithm. We present a comparison of our algorithm with a few other exact and heuristic scheduling algorithms that address bus constraint and register constraint/minimization. 3.2 Scheduling under Bus Constraint In our method, a constraint on the number of buses can be imposed just like the constraint on the number of functional units. The number of buses in a control step is calculated as the number Of distinct inputs to all the operations scheduled in that step. This is based on 33 the model described in [25] for a bused architecture with broadcasting. The assumption is that the inputs and outputs are interleaved and the number of inputs is always more than the number of outputs. In our algorithm, when a configuration is generated we compute the number of buses by finding the number of distinct inputs to all the operations scheduled in the control step corresponding to that configuration. If it exceeds the bus constraint, we consider that configuration infeasible. In a bused architecture with no broadcasting capa- bility, the number of buses in a control step is calculated as twice the number of operations scheduled in that step [24]. The number of control steps without any bus constraints is never more than the number of control steps with a bus constraint. Hence, the lower bounds used in our algorithm are valid even with bus constraint. 3.3 Scheduling under Register Constraint At each configuration we save the set of values active at that configuration (register con- tents). For each value v, we save the number of unscheduled operations that use 22. When a new configuration S is generated, the register usage details are copied from its parent cOnfiguration and updated according to the newly scheduled operations. Then, if there are not enough free registers to accommodate the values produced at S, S is pruned. Regis- ters are filled using the left-edge algorithm as described in [33]. The left-edge algorithm guarantees optimal register usage if no conditional branches exist. The number of control steps without any register constraint is never more than the number of control steps with a register constraint. Hence, the lower bounds used in our algorithm are valid even with register constraint. 34 Table 3.1. Results for the fifth-order elliptic wave filter non-pipelined multiplier pipelined multiplier Resources Control steps CPU (sec.) Conu'ol steps CjPU (see) + '- Bus ops List ALPS [26] DPS DPS List ALPS [26] DPS 1 1 2 30 31 35 3.89 29 30 7.52 2 2 4 18 19 20W] 0.12 18 19 0.16 2 1 4 21 21 21W 0.09 19 19 20!”! 0.15 3 3 6 17 17 17W 0.08 17 17 0.08 3 2 6 18 18 18 0.08 17 17 17l251 0.07 2 2 6 18 19 18“” 0.12 18 19 0.12 3 3 8 17 17 17 0.08 17 17 0.08 3 2 8 18 18 18 0.1 17 17 0.07 2 3 6 18 19 0.15 18 19 0.19 2 3 8 18 19 0.15 18 19 0.19 3.4 Experimental Results 3.4.1 Results for scheduling under bus constraints Previously, scheduling under bus constraint was given major consideration in the ALPS system [25, 26, 24]. To our knowledge, the results in [26] are the best results for scheduling under bus constraint. We incorporated scheduling under bus constraint into our dynamic programming scheduling (DPS) algorithm and compared our results with those by ALPS. The results indicate that our methods are much fasterland better. For the fifth-order elliptic filter and once unrolled fifth-order elliptic wave filter (Tables 3.1 and 3.2 respectively), we used two control steps for multiplication and one for ALU. For the sixth-order elliptic bandpass filter (Table 3.3) and Fast Discrete Cosine Transform (FDCI‘) [37] (Table 3.4), we used all uni-cycle operations as in [26]. In the first part of Table 3.3, additions and subtractions are executed on ALU’s; in the second part they are executed on adders and subtracters, respectively. In Table 3.4 also, they are executed The CPU times of our results are on a SUN-Spare Station 2. The CPU times of ALPS are on VAX “/8550. The dhrystone MIPS ratings indicate that our system is nearly 5 times as fast as their system. The speed up we obtained is up to a few hundred times. 35 Table 3.2. Results for the once unrolled fifth-order elliptic wave filter non-pipelined multiplier pipelined multiplier Resources Control steps CPU(sec.) Control steps CPU(sec.) + * Bus DPS List ALPS [26] DPS DPS List DPS l 1 2 63 63 70 0.26 61 62 12.79 2 2 4 36 37 39 0.49 35 37 0.48 2 l 4 40 40 40 0.29 36 37 0.31 3 3 6 33 33 33 0.23 33 33 0.24 3 2 6 34 34 34 0.27 33 33 0.23 2 2 6 35 36 35 0.52 35 36 0.54 3 3 8 33 33 33 0.21 33 33 0.23 3 2 8 34 34 34 0.27 33 33 0.24 2 2 5 35 36 0.49 35 36 0.50 2 3 5 35 36 0.43 35 36 0.51 2 3 4 36 37 0.48 36 37 0.48 on adders and subtracters. For FDCT, the results of FDLS as reported in [26] are also presented. Our solutions are better than the solutions from List scheduling, ALPS and FDLS. When there are bus constraints, the list scheduling algorithm fails to give an optimum solution in many cases and our method always gives an optimum solution. For fifth-order elliptic wave filter (Table 3.1), the number of configurations explored is less than a hundred except in one case. The example of the once unrolled wave filter contains 52 additions and 16 multiplications. For this example (Table 3.2), the number of configurations explored is always under 400 except for one case. For both benchmarks, the number of configurations never exceeded a few thousands. Thus, the number of configurations and the CPU times of our method are very small and practical even for considerably large examples. We reported our CPU times and the CPU times of ALPS in Tables 3.3 and 3.4. For the example of bandpass filter, our method is faster than ALPS by up to a few hundred times. For FDCI‘, we obtained a speedup of up to more than a thousand times. 36 Table 3.3. Results for the sixth-order elliptic Bandpass filter Resources Control steps CPU time (seconds) +,- * Bus DPS List ALPS [26] DPS ALPS [26] 3 2 8 8 8 8 0.06 1.0 2 2 8 9 9 9 0.07 4.2 2 1 6 12 13 13 0.53 40.0 1 1 4 17 17 17 0.06 8.8 2 2 4 9 9 0.06 2 1 4 12 13 0.51 2,2 1 2 8 8 8 8 0.06 1.3 2,1 1 2 8 9 0.07 4.1 1,11 2 8 ll 11 11 0.07 18.2 1,1 I 1 6 13 0.47 19.3 1 Additions and subtractions are performed on adders and subtracters Table 3.4. Results for the Fast Discrete Cosine Transform Resources Control steps ' CPU time (seconds) + - * Bus DPS List ALPS [26] FDLS DPS ALPS [26] 3 3 8 16 6 6 6 7 0.08 3.7 2 2 4 12 7 8 7 8 0. 14 5.8 2 2 3 12 8 9 8 9 0. 12 142.2 2 2 2 10 10 11 10 11 0.14 63.1 1 1 2 8 13 14 14 14 0.12 85.1 1 1 1 6 18 21 20 20 0.23 132.3 In all our results, we considered bus with broadcasting as explained in [25]. It was not clearly explained in [26] as to how the number of buses in a control step are computed. For the cases where there is a result from [25], we cite the reference next to the ALPS result in our tables. For those cases, the number of control steps is the same in both [25] and [26]. We also experimented assuming that there is no broadcasting. For the examples of Table 3.1 and Table 3.2, our results in terms of the number of control steps are identical 37 with those of [26]. For the examples of Table 3.3 and Table 3.4, however, we obtained better results even without broadcasting. For these benchmarks, our results with or without broadcasting are identical. The CPU times are also nearly the same. Thus, our results are better and much faster than those of [26] with or Without broadcasting of buses. 3.4.2 Results for scheduling under register constraint We then experimented including register constraint and compared our results with InSyn[70] in Table 3.5. InSyn allows delayed transfer of output data to minimize regis— ter usage and uses a concept of reusable data values to minimize bus activity. We obtained better results in some cases even Without using those concepts. InSyn reported an average CPU time of ~15 sec. to produce each design point on a Sun SPARC-Station SLC. Our method took less than 1 sec. on the average on a SUN Spare-2. Thus, our method produced better results and is much faster than InSyn. Table 3.5. Comparison with the InSyn heuristic method [70] . non-pipelined multiplier pipelined multiplier DFG Resources Control steps Control steps + * Bus Reg. DPS InSyn [70] DPS InSyn [70] 1 1 4 8 28 28 28 28 Fifth-order 2 l 4 8 21 22 l9 19 EW Filter 2 2 4 8 19 19 18 - _ 3 l 6 8 l9 - 18 19 Twice unrolled 1 1 6 10 84 84 84 84 fifth-order 2 l 4 10 60 60 57 57 EW Filter 3 2 6 10 50 51 51 - 2 2 6 10 52 52 52 52 Bandpass 1 l 4 8 25 25 17 17 Filter 2 l 2 7 25 28 24 24 2 2 4 8 13 14 13 13 2 3 6 7 11 11 10 10 Biquad t 1 1 4 6 28 27 19 19 Filter 1 2 4 5 20 19 18 18 2 3 4 5 15 15 13 13 3 4 2 3 q 19 19 19 - 1 Complex multiplier uses three control steps 38 Table 3.6. Comparison with the multi-schedule approach [9] Resources Registers CPU time (seconds) + * Control Steps DPS MSS [9] DPS MSS [9] 3 3 17 8 10 0.09 3.8 2 2 19 8 10 0.79 39.1 2 1 21 8 10 0.03 25.3 3 21 17 8 10 0.2 3.9 2 l] 19 8 10 0.18 43.9 1 Pipelined multiplier is used for this case We also compared our register usage results and CPU times with MSS [9] for the fifth- order elliptic wave filter in Table 3.6. The results show that our method not only produced designs that use lesser number of registers (two less in all cases) but also took much less CPU time to produce the designs(up to a few hundred times faster). Their method was implemented on a SparcCenter 2000. Table 3.7 shows a comparison with the symbolic technique [60] for two large examples, EWF2 and EWF3 (fifth-order elliptic filter unfolded two times and three times respec- tively). For many cases shown in table 3.7, their exact method could not finish in one CPU hour or confirm the optimality of the schedule obtained by their heuristic method. In table 3.7, we reported the results of their exact method for the cases it could solve. Otherwise, we reported the results of their heuristic method. As can be seen from the results in table 3.7, our method is much faster than even their heuristic method. In some case, we obtained a speedup of upto a few thousand times. Our results are superior in terms of register us- age. We used busses with broadcasting and obtained schedules with less number of control steps. 39 Table 3.7. Comparison with the symbolic technique [60] Resources Control Steps Registers CPU time (seconds) Benchmark + * Bus Ours Symb. Ours Symb. Ours 1 Symb. 1 l 2 64 70 12 14 0.12 94.4 2 2 4 36 39 10 11 0.5 2700 EWF2 3 2 6 35 35 10 11 0.13 21.52 1 1 1 2 63 68 12 14 0.15 88.3 2 1 1 4 36 39 10 11 0.44 1850 3 2 1 6 33 33 10 11 0.1 4.49 1 1 2 96 105 12 15 0.33 372.9 2 1 4 59 59 12 12 0.32 31.5 2 2 4 55 58 12 13 1.23 39.8 EWF3 3 2 6 50 50 12 12 0.22 62.66 1 1 1 2 94 102 12 15 0.27 356.7 2 1 1 4 54 58 12 12 1.03 42.3 2 2 1 6 52 52 12 12 1.37 550 3 2 1 6 49 49 12 12 0.2 10.52 1 Pipelined multiplier is used for this case 1 Their experiments are run on a Sun SPARCStation 10 3.5 Conclusions We have extended our DPS algorithm to handle scheduling under bus and register con- straints in addition to the constraints on the number of functional units. The extensions preserve the simple structure and efficiency of the original algorithm. Extensive exper- iment on large problems show that our method is an order of magnitude faster than the existing exact scheduling methods. CHAPTER 4 Lower-bound Estimation 4.1 Introduction For a synthesis system to produce efficient designs, it should have the capability to con- sider different cost-performance trade-offs. So, a scheduler has to explore the design space with a variety of resource constraints. Instead of producing schedules with each and every resource constraint, a scheduler can use estimation to identify and prune inferior designs. For an estimation tool to be useful, it has to be much faster than the actual scheduler and the lower bounds it produces should be as tight as possible. In this chapter, we describe an efficient estimation technique for the lower-bound performance and compare the results with some of the best in the literature. As explained in Chapter 2, the DPS algorithm estimates lower-bound completion from each configuration to check if it is an infeasible configuration. Any such scheduling al- gorithm as well as the branch and bound methods need a dynamic estimation method for partial schedules to detect partial schedules that are not going to lead to an optimal sched— ule. Since the number of partial schedules is generally very high in a design space search process, the dynamic estimation should be much faster than the lower-bound estimation for the entire DFG. At the same time, it should be effective and be able to detect inferior partial schedules as early as possible in the exploration. In this chapter, we also describe an 40 41 extension to our lower-bound estimation method to find such dynamic lower bounds. None of the existing methods for lower-bound estimation addressed the problem of dynamic es- timation. 4.2 Previous Work There are several methods proposed in the literature for the lower-bound estimation of cost as well as performance. Jain et al [29] proposed a mathematical model for predicting the area-delay curve. Their lower-bound method is very fast, but is too trivial and does not con- sider precedence constraints at all. The technique proposed by Fernandez and Bussell [13] computes the minimum number of operations which must be scheduled in each sub-interval of time steps. It then derives maximum increase in total execution time over all intervals for not having enough processors to accommodate all the operations in that interval. Their method considers only homogeneous resources and can be applied only to multiprocessor schedules. Their method has been extended to high-level synthesis by Sharma et al [69]. They compute the increase in the length of each interval due to concentration of each type of operations in that interval. They also address lower bounds on area cost including in- terconnect cost. Their method has a computational complexity of 0(nc2) where n is the number of nodes in the DFG and c is the critical path length. The method proposed by Ohm et a1 [46] estimates lower bounds on functional units as well as registers. Their tech- nique for functional unit estimation is a refinement of the basic technique of [69]. It is not applicable to lower-bound performance estimation. The complexity of their method is O( n(c2 + n + e) ) where e is the number of edges in the DFG. The method proposed in [65] uses a relaxation technique of the ILP formulation of the scheduling problem for the lower-bound estimation of performance under resource constraints. It has a computa- tional complexity of O(n + cC') where C is the number of time steps and produced lower bounds as good as [69] on many benchmarks. The method in [58] is similar to [65] in that 42 it relaxes the precedence constraints and solves the relaxed problem using a slack driven list scheduling algorithm. Hu et al [23] proposed a method to estimate lower bounds on iteration time and functional unit cost for functional pipelined DFGs. The complexity of their method is 0(nck2) where k is the initiation latency. A recursive technique is proposed in [34] for lower-bound performance estimation and it has a complexity of 0(n2 + 02). The complexity of our method for the lower-bound performance estimation of the entire DFGs has a complexity of 0(n + c’). 4.3 Model and Definitions For all 0 E V, ASAP(0) (as soon as possible) is the earliest time-step that 0 can be scheduled to start execution, assuming unlimited resources. For all 11 E V, we define M SAT (1)) (Minimum Steps After This operation) as the minimum number of time steps that any schedule of G is going to take after the completion of operation 0, assuming unlimited resources. The ASAP values can be computed in a top-down fashion starting from the input operations as follows. If v is an input Operation, ASAP(v) = 1. Otherwise, ASAP(v) = mar(,,,,,)eg {ASAP(u) + du}. The MSAT values are similarly computed in a bottom-up fashion starting from the output operations as follows. If v is an output operation, M SAT(v) = 0. Otherwise, M SAT(v) = mar(,,,w)eg {M SAT(w) + dw}. The critical path length, c is the minimum number of time-steps that any schedule of G is going to take, assuming unlimited resources. It can be computed as max” {ASAP(v) + d” - 1 } with the maximum taken over all output operations 1). If a resource type is pipelined, one instance of that resource type can be assigned to more than one operation in an overlapping fashion. The latency with which they can overlap is denoted by 6,. Clearly, (it < D(t). 'Chaining of operations is handled by dividing time steps into time-units and extending the definitions of MSAT and ASAP values in terms of time-units. Let the time period of a time-step be T time-units. Let p, denote the execution 43 time of an operation 0 in terms of time-units. Let A(0) and M (v) be the ASAP and MSAT values of an operation in terms of time-units. The A and M values can be computed similar to the computation of ASAP and M SAT values using the mu values and under the following requirement. For any (11,0) 6 E, if 0 cannot be chained to u and if it finishes execution during time step d, then u can start execution only at the beginning of time step d + 1. The ASAP and MSAT values can be easily computed from the respective A and M values. If two operations it and v are chained, the functional unit executing it cannot be freed until 0 is finished [68]. If v spans across time-steps, this may result in under-utilization of resources. 80, we assume that 0 can be chained to the end of another operation only if 11 starts and finishes execution in the same control step. We impose the following necessary condition like in [68]. An operation 0 can be chained at the end Of operation 11 only if (u, v) e E, 11,, modT aé 0 andT — (11u mod T) _>_ 11,. Note that this is not a sufficient condition and the scheduling algorithm should‘make sure that 11 starts and finishes execution in the same step. Clearly, it has an impact only on the sharpness but not the validity of the lower bound. 4.4 Lower-bound Estimation for Performance The intuitive idea behind our lower-bound estimation is as follOws. For each resource type t, we group the Operations of that type into three non-overlapping intervals and compute a lower-bound as the sum of the lengths of those intervals. The final lower bound is the maximum among all resource types and all possible groupings of operations of each type. Let Pt be the number of operations of type t in the DFG. If there are L(i, t) operations of type t with an ASAP value less than or equal to i, then there are at least Pt — L(i, t) type t operations that cannot be scheduled in the first 2' time steps. Similarly, if there are M (j, t) type t operations with a MSAT value less than j, then there are at least R — M (j, t) type t operations that cannot be scheduled to execute in the last j time steps. Thus, there are at 44 least Pt — L(i, t) — M (j, t) type t operations that cannot either be scheduled in the first 2' time steps or be scheduled to execute in the last j steps of any schedule. The three intervals considered are : first 2' steps (interval..1), last j steps (interval.3) and an interval between the two (interval.2) that does not overlap with the other two. The lengths of interval-1 and interval_3 are i and j respectively. The length of interval.2 depends on the minimum number of type t operations in that interval as well as the number of available resources of type t. The number of Operations depends on the ASAP and MSAT values of operations which are determined by data dependencies. Thus, the lower-bound estimation takes into account both precedence constraints and resource constraints. Let n be the number of type t operations in interval.2 and let 1' = N (t). These 12 Operations can be scheduled into [rt/r] stages. If type t resources are not pipelined, each stage takes D(t) time steps where D(t) is the delay of a type t resource. If they can be pipelined with a latency 6,, each stage except the last takes (it steps. The last stage in either case takes D(t) steps. Hence, the minimum length of interval.2, a can be computed as follows. _ [n/r] * 6, + D(t) — 6, if resource type t is pipelined 0 - [n/r] at: D(t) otherwise. Ifn S 0,thena=0. Note that it takes at least 2' time steps before any of the 72 operations in interval.2 can start execution and at least j steps after the last operation has finished execution. The lengths of interval-1 and intervaL3 are independent of the set of operations scheduled in those intervals. Thus, for the purposes of lower-bound estimation, the three intervals are non-overlapping. We denote the minimum number of type t operations in interval.2 by q(i, j, t) and the minimum length of interval.2 due to type t operations by h(i, j, t). As ex- plained above, q(i, j, t) = P; — L(i, t) — M(j, t) and the value of h(i, j, t) can be computed from q(i,j, t). Let /\ be equal to mar(¢,,-+j56){h(i,j, t) + i + j}. 45 estimate_lower_bound() { Read the DFG, resource constraints and resource delays (1) Find the ASAP and MSAT values for all operations (2) Compute L(i,t) and M(i,t) for all i and t (3) Compute q(i,j,t) and h(i,j,t) for all i,j and t and compute lower bound (4) Figure 4.1. Outline of the procedure for lower-bound estimation of entire DFGs Lemma 4.1 : The expression A is a lower bound on the the completion time of any schedule of the given DFG. Proof : The above discussion implies that i + h(i, j, t) + j is a lower bound for a given 2', j and type t. The expression for A is the best lower bound among all i, j and t. The condition 3' + j S c makes sure that the intervals are non-overlapping. D An outline of the procedure for the lower-bound estimation is shown in Figure 4.4. Figure 4.2 shows an example for lower-bound computation. The ordered pair next to each node indicates its ASAP and MSAT values respectively. It is assumed that addition takes one time step and multiplication takes two. There are one adder and one non-pipelined multiplier. All values of L and M for multiplication and addition are shown in the figure. For example, the multiplication operations 1 and 2 each has ASAP value less than or equal to 2. Hence, L(2, at) is 2. Similarly, the addition Operations 6, 7, 8 and 9 each has MSAT value less than 2. Hence, M (2, +) is 4. For the example DFG, the maximum value for A is obtained when i=0 and j =2. Complexity analysis : Let n be the number of operations in the DFG. The number of edges in a DFG will grow 46 (5.0) L(1,+)=1 M(1,+)=2 L(1,*)=2 M(1,*)=0 L(2,+)=1 M(2,+)=4 L(2,*)=2 M(2,v)=0 L(3,+)=2 M(3,+)=5 L(3,4)=3 M(3,*)=1 L(4,+)=3 M(4,+)=6 L(4,*)=3 M(4,*)=2 L(5,+)=5 M(5,+)=6 L(5,*)=3 M(5,*)=3 L(6,+)=6 M(6,+)=6 L(6,*)=3 M(6,4)=3 q(0, 2, 4:) = P. - L(0, a) — M(2, at) = 3 A = h(0,2,*)+0+2 = 8 Figure 4.2. An example for the lower-bound estimation for the entire DFG 47 linearly in 11 since the number of inputs of each Operation is generally bounded by small number such as two. Hence, the first two steps require O(n) time because the ASAP and MSAT values can be found in 0(n) by traversing the DFG in a top-down and in a bottom-up fashion respectively. The number of resource types is generally bounded by a small number. For any i and t, L(i, t) and M (i, t) can be found recursively as follows. L(i, t) = L(i — 1, t) + A(z', t) where A(i, t) is the number of type t operations with ASAP value 2'. Similarly, M(i, t) = M(i — 1, t) + B(i — 1,t) where B(i — 1,t) is the number of type t operations with MSAT value 2' — 1. The values for A and B can be found during step 2 without affecting its complexity. Hence, step 3 takes 0(c) time. Finally, step 4 requires 0(c2) time because there are 0(c2) intervals. Thus, the complexity of our algorithm to estimate lower bound of the entire DFG is 0(n + c2). The method in [69] is similar to our method in that it estimates the length of each in- terval of time steps. It computes the required computation cycles of each type as the sum of minimum overlaps of all operations of that type in each interval. Then the difference in the required and available computation cycles of each type is divided by the number of available functional units of that type to get any increase in the length of that interval. In their method, for each interval the minimum overlap of each operation has to be deter- mined. Hence, it has a complexity of O(nc2). Our method computes only the number of operations in each interval in constant time using some precomputed data structures, thus having a complexity of only O(n+cz). This is made possible by slightly modifying the way multi-cycle operations are considered and using MSAT values of operations. Their method allows a portion of a multi-cycle operation to overlap with an interval. In our method, a multi—cycle operation has to be completely inside an interval to be counted in that interval. This restriction allows a more effective computation of the length of each interval for non- pipelined operation types. We divide the number of operations of each type (as opposed to the difference in required computation cycles) by the available number of functional units of that type and then multiply it with the delay of that type. 48 4.5 Estimating a Lower Bound for a Partially Scheduled DFG As in Chapter 2, we refer to a partial schedule as a configuration. The depth, ready set, unfinished set and unsch are similarly defined. The method explained in this section is applicable to any forward scheduling algorithm where operations are scheduled into non- decreasing control steps. Scheduling algorithms such as branch and bound methods need to compute lower bounds on the completion times from partially scheduled DFGs. As can be seen from the experimental results in Chapter 2, the number of configurations can often be very large. If the methods in [69] or [65] or our method in the previous section are used for estimation of lower-bound completion time from each configuration, the time spent in estimation itself may nullify the advantage of estimation. In this section, we present a faster extension for lower-bound estimation from configurations. Our method takes O(k) time to compute a lower-bound for a configuration where k is the number of operations in the ready and unfinished sets at the configuration. This is much faster than the methods in [69], [65]. They take O(nc2), 0(c2) time respectively to compute lower bound for a partial schedule. Our method in the previous section also takes 0(c2) time. Our method for partial schedules is a special case of the method for the entire DFGs. The basic idea is as follows. At a partial schedule, a subset of operations is already sched- uled so as to satisfy precedence constraints as well as resource constraints. So, instead of considering all possible values for i and j, we can consider the following special case for each resource type t. We find I = ma2:{i|L(i, t) = 0} and J = max{j|M(j, t) = 0}, thus maximizing h(i, j, t). Note that ASAP, L and M are defined with respect to the un- scheduled portion of the DFG. Intuitively, I is the minimum number of time steps after the current step before any type t unscheduled operation can start execution. And, J is the minimum number of time steps that any complete schedule from the current configuration takes after the last type t operation has finished executing. The most important merit of our 49 lower_bound_partial(R,t) /* This procedure returns a lower bound on the number of control steps to schedule the remaining Operations of type t from configuration R */ Compute I <-- max{i} such that L(i,t) 0 (Step 1) Compute J <-- max{j} such that M(j,t) 0 (Step 2) /* ASAP, L and M are defined with respect to the unscheduled portion of the DFG */ q(I,J,t) <-- unsch(R,t) (Step 3) compute h(I,J,t) from q(I,J,t) (Step 4) return h(I,J,t) + I + J Figure 4.3. Outline of the procedure for lower-bound estimation from partial schedules algorithm is that it computes I and J in a very efficient way. After computing I and J, we compute h(I, J, t) + I + J as a lower bound on the number of time steps to schedule the remaining operations of type t. The maximum of these lower bounds over all resource types t added to depth(R) gives a lower bound on the completion time of any schedule that configuration R can lead to. An outline of the procedure to compute the lower-bound from a configuration R for type t is shown in Figure 4.3. The steps 3 and 4 are straightforward. Since L(I, t) = M (J, t) = 0, q(I, J, t) = unsch(R, t). The value for h(I, J, t) can be computed from q(I, J, t) using the formula in the previous section. It can be noted from the proof of Lemma 4.1 that h(i, j, t) + i + j for any i, j and t gives a lower bound on the number of time steps to schedule the remaining operations. 80, depth(R) + h(I, J, t) + I + J is a lower bound on the completion time of any schedule from R. Figure 4.4 describes an efficient way to compute values for I and J (steps 1 and 2 of Figure 4.3). 50 The value for I is given by min(a, b) where a = minu {q(u, t)} with the minimum taken over all u 6 ready(R). We ignore it if it has no successors of type t. (a = 0 if there is a type t operation in ready(R)). b = minu {q(u, t) - Fu} where Fu is the number of time steps it finished at R and the minimum is taken over all u E unfinished(R). We ignore it if it has no successors of type t- The value for J is given by : min“ {fl(u, t)} with the minimum taken over all u E {ready(R) U unfinished(R)}. If 21 has no successors of type t,then : we use M SAT(u) for [3(11, t) if u is a type t operation and u E ready(R); we ignore it otherwise. Figure 4.4. Computing the values for I and J For each node 21 and each operation type t, q(u, t) is defined as the minimum number of time steps that any type t successor of u can start after the starting time of it. For each node u and each Operation type t, 6(11, t) is defined as the minimum number of time steps that any schedule of the given DFG is going to take after the completion of all type t successors of u. The value for 6(11, t) can be computed as min,,{MSAT(v)} such that v is a type t successor of u. The formulas for the computation of I and J are based on the following lemma which is an extension of Lemma 2.1 to take multi-cycle operations into account. Lemma 4.2 : For any unscheduled operation 2: at configuration R, 2: is either a member of ready(R) or there exists a y in (ready(R) U unfinished(R)) such that y is a predecessor of 2:. Proof : Let :c be an unscheduled operation. If r is not in ready(R), then there is a prede- cessor p of a: that is not scheduled or scheduled but not finished executing. Among all such p, let q be the farthest from 2: i.e. the length of the longest path from q to a: is maximum among all p. (i) If q is scheduled but not finished executing, q is in on f inished(R). "- 51 readyal) = {1.5} unsch(R,+) = 5 unsch(R,*) = 2 depth(R) = 1 0(11 '1') = 1 B(11+) = 0 0101*) = 0 3(11") = 3 0(5,+) =1 fl(5,+) = 0 a(5W) = 0 [3(5.*) = 2 a(2,+) = 2 fi(2,+) = 0 oz(2,*) = 1 ,6(2,*) = 2 043,-?) = 2 fl(‘3.+) = 0 0(3.*) =1 3(3, *) = 2 I = min{a(1. +). 9(5, +), a(2, +). a(3. +)} = 1 J = min{fi(1,+),[3(5t+),B(2,+),fl(3i+)} = 0 For type +, q(I, J, +) = unsch(R, +) = 5 MI. J. +) = l(q(I, J. +)/1)1 *1 = 5 completionJime = depth(R) + h(I, J, +) + I + J = 7 I = 0 / at type at operation in ready(R) It: / J = min{fl(1t*),5(5t *),fi(2. *),fl(3t '0} = 2 For type 2:, q(I, J, at) = unsch(R, 4:) = 2 h(I,-7H“) = l(Q(I,Ji*)/1)l * 1 = 2 completioniime = depth(R) + h(I, J, at) + I + J = 5 Figure 4.5. An example for the estimation of lower-bound completion time of partial sched- ules 52 (ii) Otherwise : if q is not in ready(R), then there is a predecessor ql of q that is not scheduled or is in unfinished(R). Note that ql is a predecessor of 2: also. And, ql is farther than q from 2:, which is a contradiction. Hence, q is in ready(R). C] Figure 4.5 shows an example for the estimation of lower-bound completion time of a partially scheduled DFG. It is assumed that the scheduling of the first step is finished. There is one adder and one multiplier both with a delay of one time-step. The lower-bound completion time is 7 time-steps. If the target performance is 6 time-steps, the lower-bound estimation suggests that the selection of operations 2 and 3 in the first time-step is wrong. The method in this section is especially useful for a class of scheduling algorithms that compute the lower bound for a large number of configurations during the design space ex- ploration. The matrices a and [3 for all the Operations in the DFG and all resource types are computed only once before the design space exploration. This can be done by computing the transitive closure. For a directed graph, the transitive closure can be computed using depth-first search in O(n(e + n)) where e is the number of edges in the graph [67]. As already explained, e grows linearly in n in a DFG since the number of inputs of each open ation is generally bounded by a small number such as two. Thus, a and 6 can be computed in 0(n2) time. Note that only these matrices are used in computing the values of I and J (steps 1 and 2 of figure 4.3). Steps 3 and 4 take constant time. Hence, a lower-bound for any configuration R can be computed in O(lc) time where k is the number of operations in (ready(R) U unfinished(R)). Another major advantage of our method is that it introduces very little memory over- head. The only overhead is to store the matrices a and [3. When there are a large number Of partial schedules the memory requirement is dominated by the amount of information stored at each configuration. Any scheduling algorithm taking full advantage of our lower- bound estimation needs to store very little information at each configuration. 53 4.6 Experimental Results We implemented our methods in C language on a SUN Spare-2 workstation. We tested them using a number of benchmarks in the literature. The benchmarks we used are the AR Filter [27], the Differential Equation [55], the fifth-order elliptic Wave Filter [49], twice unfolded Wave Filter, the FIR Filter [53], the complex Biquad recursive digital Filter [49], the sixth-order elliptic Bandpass Filter [49], Discrete Cosine Transform [47], and Fast Discrete Cosine Transform [37]. For the Biquad Filter example, we used three time steps for the multiplier and one for adder (we used the same resource type adder to do addition, subtraction and comparison). For all the other examples, we used two time steps for multiplication and one for adder. 4.6.1 Lower hound estimation of partial schedules in branch and bound methods Scheduling algorithms such as branch and bound methods rely on estimating lower bounds for partial schedules to keep the design space from exploding. Generally, lower bounds need to be estimated for a large number of partial schedules (configurations). If the time spent in lower-bound estimation itself is too high, it will have a big negative impact on the over-all time taken by the scheduling algorithm. We implemented a branch and bound scheduling algorithm [10] and tested on the benchmarks. We first find a schedule using a list scheduling algorithm. We use that performance as an upper bound in the branch and bound algorithm and search the design space exhaustively for an optimum schedule. From each partial schedule, we estimate a lower bound for the schedule completion time. If it exceeds the upper-bound, the partial schedule cannot lead to a complete schedule with the target performance and it is not explored further. We separately measured the time spent in lower-bound estimation using our method Of section 4.5 and Rim’s method [65]. The results are reported in Table 4.1. Our method 54 Table 4.1. CPU time and number of configurations in branch and bound algorithm Resources CPU time (see) Configurations DFG + * Our method Rim [65] Our method Rim 3 3 0.3 7.1 1609 2953 AR Filter 2 3 0.35 8.2 1769 3497 1 3 1.1 22.6 6161 8001 Once unrolled 2 3 0.13 2.9 770 723 EW Filter 2 2 0.1 2.2 539 508 2 4 0.13 2.8 766 723 2 2 2.1 42.4 18973 16366 TWice unrolled 2 1 0.1 2.2 413 503 EW Filter 2 3 4.0 81.2 33182 27767 Biquad 2 2 0.3 6.1 2355 2355 Filter 2 1 0.8 17.6 6216 6384 1 2 1.1 24.0 7042 8173 Fast Discrete 1 1 O. 14 3.1 688 892 Cosine Transform 2 2 0.1 1 2.6 474 774 is at least 20 times faster in all cases. As a measure of the effectiveness of the lower- bound estimation in reducing the size of the search space, we also measured the number of configurations visited using each method separately. These results are also reported in Table 4.1. Both the methods are equally effective. Since their estimation is very slow, the CPU time taken using their method is more than the time taken. without using any lower-bound estimation in a few cases. However, in a majority of the cases, the search space exploded without lower-bound estimation, thus showing the necessity of estimation in branch and bound scheduling algorithms. Our method is more suitable than the existing methods to be used in such scheduling algorithms. The results in Chapter 2 show the efficiency as well as the effectiveness of our estimation method in pruning the search space in our DPS algorithm. 4.6.2 Lower-bound estimation for the entire DFGs We tested all the benchmarks with eleven different resource constraints for pipelined and non-pipelined multiplier. For all the cases, our lower bound is compared with an optimum 55 Table 4.2. Lower bounds with non-pipelined multiplier our lower bound and an Resources optimum solution other lower bounds DFG + * DPS: lower bound difference Rim [65] Jain [29] Shanna [69] 1 1 34 34 0 34 32 34 1 2 18 18 0 18 16 18 AR Filter 1 3 16 14 2 14 12 14 2 3 15 14 11 1 13 11 13 3 3 15 14 11 1 13 - 13 2 4 11 l 1 0 1 1 1 1 11 1 1 13 13 O 13 12 13 1 2 7 7 O 7 6 7 Diff. Eqn. 1 3 7 6 1 6 6 6 2 2 7 7 0 7 6 7 1 4 6 6 0 6 6 6 1 1 28 27 1 27 26 27 EW Filter 2 1 21 21 0 21 17 21 2 2 18 18 0 18 17 18 3 3 17 17 0 17 17 17 1 l 18 18 0 18 16 18 FIR Filter 1 2 15 15 0 15 15 15 2 2 1 1 1 1 0 11 10 11 2 3 10 10 0 10 10 10 1 1 84 79 5 79 78 79 Twice unrolled 2 2 52 50 2 50 49 50 Wave Filter 3 2 50 50 0 50 49 50 3 3 49 49 0 49 49 49 1 1 25 25 0 25 24 25 Bandpass 1 2 17 17 0 17 17 17 Filter 2 2 13 13 0 13 12 13 2 3 1 1 1 l 0 1 1 10 11 2 4 10 10 0 10 10 10 1 1 27 24 3 24 21 24 Biquad 2 2 17 16 1 1 15 11 15 Filter("') 3 2 16 16 1 0 15 - 16 2 4 14 13 1 1 12 - 13 3 4 13 13 1 0 12 - 13 3 3 13 13 0 13 11 13 Discrete 1 1 34 34 0 34 - 34 Cosine 2 2 18 - 18 0 18 - 18 Transform 3 3 14 14 11 0 13 - 13 Fast 1 1 34 34 0 34 - 34 Discrete 2 2 18 18 0 18 - 18 Cosine 2 3 14 14 11 0 13 — 13 Transform 3 3 14 14 11 0 13 - 13 (*) Complex multiplication takes 3 time steps (1) Our lower bound for this case is better than Rim’s (1) Our lower bound for this case is better than Sharma’s 56 solution we obtained using our DPS algorithm [8]. In Tables 4.2 and 4.3, we present the lower bounds for some of the cases as obtained by our methodof section 4.4 . The column DPS in the tables shows the number of steps in an Optimum solution. The lower bound is tight in 156 out of 198 cases. In 22 more cases, the difference is only one step. We also implemented the algorithms by Rim [65] and Sharma [69], and compared the results with ours. Our method gives better lower bound than [65] in nine cases. In five cases, our lower bounds are better than [69]. In [65], the lower bounds by one more method, Jain [29] are also reported. Those are copied into the second last column (Jain) of our tables. For the cases not reported in our tables, our lower bounds are identical to Rims’ [65]. The average CPU times are 21 ms, 25 ms and 270 ms for our method, Rim’s method and Sharrna’s method respectively. Thus, our method is faster than the fastest non-trivial method in the literature [65] and produces better lower bounds in more cases. Our method is one order faster than the method in [69] and still produces better results in some cases. The lower bounds of [29] are far inferior to ours. 4.7 Conclusions We have presented simple, efficient and powerful techniques for estimating lower-bound completion time for the scheduling problem in high-level synthesis. The proposed tech- niques can handle multi-cycle operations, pipelined functional units, and chaining of oper- ations. Our method for the entire DFGs produces better lower bounds than [65] and [69] in many cases. It is more than 10 times faster than [69] and 20% faster than [65]. We presented an extension of our method to estimate the lower-bound completion times of par- tially scheduled DFGs, even more efficiently. Upon extensive experiment, it is found to be at least 20 times faster than the fastest non-trivial method known in the literature [65] while being equally effective in reducing the size of the search space. It is very useful for scheduling algorithms such as branch and bound methods in keeping the search space from 57 exploding. We used this method in our DPS algorithm and obtained excellent results. 58 Table 4.3. Lower bounds with pipelined multiplier our lower bound and an Resources optimum solution other lower bounds DFG + "‘ DPS lower bound difference Rim [65] Jain [29] Sharma [69] l l 19 19 0 l9 l6 19 AR Filter 1 2 16 14 2 14 12 14 2 2 13 12 1 12 l l 12 2 4 1 1 1 1 0 1 l 1 l 1 1 Diff. Eqn. 1 l 8 8 0 8 6 8 1 2 6 6 0 6 6 6 1 l 28 27 1 27 26 27 EW Filter 2 1 19 18 1 18 17 18 3 2 17 17 0 17 17 17 1 l 15 15 0 15 15 15 FIR Filter 2 l l 1 l 1 0 1 1 10 l 1 2 2 10 10 0 10 10 10 1 1 84 79 5 79 78 79 Twice unrolled 2 1 52 50 2 50 49 50 EW Filter 3 l 50 50 0 50 49 50 3 2 49 49 0 49 49 49 Bandpass 1 1 17 17 0 17 17 17 Filter 2 1 14 14 0 14 12 14 2 2 10 10 0 10 10 10 l l 19 18 1 18 15 18 Biquad 2 l 15 14 2 14 1 1 14 Filter(“') 2 2 13 12 1 l2 1 1 12 ~ 2 3 12 12 0 12 1 1 12 3 3 l 1 l l 0 1 1 1 1 1 1 Discrete l 1 32 32 0 32 - 32 Cosine 2 2 l6 l6 0 16 - 16 Transform 3 3 1 1 1 l O 1 1 - 1 1 Fast 1 1 26 26 0 26 - 26 Discrete 2 2 13 13 0 l3 - 13 Cosine 2 3 13 13 0 l3 - l3 Transform 3 3 10 10 0 10 - 10 (*) Complex multiplication takes 3 time steps CHAPTER 5 Scheduling CDFGs 5.1 Introduction In the presence of conditional constructs such as if-then-else and case, conditional re— source sharing may have a major impact on the efficiency of designs. Several schedul- ing algorithms have been proposed in the literature [22, 30, 62, 73] to handle conditional branches. Except [62], no other work comprehensively addresses the issues pertaining to conditional branches. However, the symbolic technique used in their work is computa- tionally very expensive and has failed to solve large problems using exact methods [60]. Some of the important issues of conditional branches include : conditional resource sharing beyond pair-wise mutual exclusiveness, speculative execution of operations and simultane- ous scheduling of parallel conditional trees. A concise representation of the conditional behavior is fundamental in addressing these issues efficiently, The existing representations [61, 63, 72] have various limitations. In this chapter, we describe a new representation for CDFGs in which the CDFG operations are labeled in a dot and dash notation that precisely represents their conditional behavior. Based on this representation, we present an efficient method for analyzing conditional resource sharing among an arbitrary number of opera- tions. Speculative execution and simultaneous scheduling of parallel trees are supported, thus overcoming the limitations of previous works. The conditional resource sharing anal- 59 60 ysis method is incorporated into our DPS algorithm to Obtain optimal schedules that exploit global sharing. 5.2 Related Work Several representation schemes have been proposed in the recent literature to capture condi- tional behavior [61, 63, 72]. The node tagging scheme in [63] cannot handle case structures and can only support pair-wise resource sharing between operations. Condition Vectors in [72] cannot be used to correctly analyze resource sharing between operations from paral- lel conditional trees. The guard-based representation in [61] can perform resource sharing analysis for an arbitrary number of operations. However, the complexity of this analysis is very high. For each Operation, they assign a Boolean guard function. The representation of case structures is not straightforward in their scheme and is not explained in their paper. Several scheduling algorithms have been proposed in the literature [22, 30, 62, 73] to handle conditional resource sharing. The algorithm in [30] Cannot exploit sufficient poten- tial parallelism because they consider only pair-Wise mutual exclusiveness of CDFG oper- ations and do not allow speculative execution Condition Vectors [72] are used in Condition Vector List Scheduling(CVLS) [73] to analyze resource sharing beyond pair-wise mutual exclusiveness. Speculative execution is allowed in their method. Tree-based Scheduling (TS) [22] also allows movement of operations. However, both CVLS and T8 are restricted to nested conditional trees. Parallel conditional trees are addressed in [73], but the trees are either scheduled sequentially or duplicated. A recent paper by Radivojevic et al [62] presents a comprehensive treatment of global conditional resource sharing. Their method is based on the symbolic technique of [60] and is computationally very expensive as already pointed out. 61 5.3 CDFG Definitions and Issues A CDFG (Control/Data Flow Graph) G = (V, E, C) is a (directed)graph representation of a behavioral description where V is the set of operations and E denotes the set of data dependencies (precedence constraints) and C denotes the set of control precedences be- tween the operations. A conditional operation generates a control signal that fires one of its mutually exclusive branches. If (r, y, k) E G, then the operation y belongs to the kth branch of the conditional operation 2:, and the operation 2: is called an control predecessor of y. A conditional operation 2: is called a control ancestor of an operation y if there is a directed path from 2: to y in G using only the arcs in C. Figure 5.1(b) shows the CDFG for the behavioral description in Figure 5. l (a). Conditional operations are shown as boxes and other operations are shown as circles. The data dependencies are shown as solid arcs and control dependencies are shown as broken arcs. All the operations that belong to the same branch of the same control predecessor are said to belong to the same block. Note that this definition of a block is different from the concept of ‘basic block’ used in other works. For example, operations D and J in the CDFG in Figure 5.1(b) belong to different basic blocks. But, they belong to the same block because they have the same control predecessor C and belong to the same branch of G. Thus, a block precisely represents the conditional behavior of its operations. A conditional operation in a block may in turn have one or more branch blocks. We refer to the control predecessor and control ancestors of the operations in a block also as the control predecessor and control ancestors of the block. A block a is a branch block of another block b if the control predecessor of a is in b; b is called the parent block of a. A block a is a nested branch block of another block b if b contains a control ancestor of a; b is called an ancestor block of a. The root block is an ancestor of all the blocks in the CDFG and contains all the operations that are always executed i.e. the operations with no control predecessors. Examples for blocks in the example CDFG are : {A, B, G, M}, {D, E, J}, 62 and {H, I}. The block {A, B, C, M} is the root block. The blocks {F, G} and {H, I} are the branch blocks of operation E. Thus, E is the control predecessor of F, G, H and I. There are several issues that need special attention in maximizing conditional resource sharing. Some of the issues are briefly discussed below. e Resource sharing beyond pair-wise mutual exclusiveness: Its is often necessary for scheduling algorithms to analyze conditional resource sharing for a set of operations i.e. determine the minimum number of resources required by a set Of Operations. Us- ing pair-wise mutual exclusiveness, it may not be possible in some cases, to compute minimum requirement without employing complex algorithms such as clique parti- tioning. Resource sharing analysis for an arbitrary number of operations is addressed in [72, 61]. A case was reported in [63] where [72] gives erroneous conclusions on the mutual exclusiveness of operations. The complexity of this analysis in [61] is very high. We present here a method to obtain maximum conditional resource sharing for a set of operations that runs in a polynomial time. e Speculative execution of operations .° In speculative execution, an Operation in a branch block can be executed before the corresponding conditional (control prede- cessor). In some cases, allowing speculative execution results in faster schedules. To allow speculative execution, resource sharing analysis should take into account the fact that operations executed in a speculative fashion cannot share resources even if they belong to different branches of the corresponding conditional. Furthermore, some implicit precedences should be imposed explicitly. For example, in the exam- ple CDFG, the conditional Operation E needs to be executed before J since the data used by J depends on the result Of E. This kind of implicit precedence is handled by [62] by imposing a strict precedence between the conditional and the join node. This may enforce unnecessary precedences as pointed out in [63]. We track these precedences more dynamically and avoid unnecessary precedences. _Ii‘". 63 p = a+b; q = p+c; if (q H II II H r-e II (I) I 9" [‘79 II ..D + P‘ (a) Behavioral Description (b) CDFG Figure 5.1. An example CDFG 64 5.4 Representation of CDFGs In this section, we present a new representation of the CDFGs and a method for conditional resource sharing analysis using the representation. Each operation in our representation is given a label, that reflects its conditional behavior. The data precedences are stored as a list of immediate predecessors for each operation. Associated with each operation a: in V, there is a type T; indicating that a resource of type T; should be used to perform operation 17. 5.4.1 The Dot and Dash Notation In our representation scheme, we label each operation by a sequence of numbers separated by alternating dots(.) and dashes(-). The number following the dot denotes the sequence number within a block and the number following the dash denotes the branch number among the branches of a conditional operation. For example, a label 1.2-1.3-2.4 represents the fourth operation in the block 1.2-1.3-2. The block 1.2-1.3-2 is the second branch block of the conditional operation labeled 1.2-1.3. The operation 1.2-1.3 in turn is the third oper- ation in the first branch of another conditional operation 1.2 which is the second operation in the root block. Thus, the control path of an Operation is implicitly described by its label. Figure 5.2 shows our representation of the example CDFG. Each operation has three parts - a dot and dash label, the type of the operation (in braces) and the list of its immediate data predecesSors (in square brackets). Note that the control precedence information is implicit in the label and is not imposed as a strict precedence. Some important properties of the dot and dash notation are : e The part of the label of an Operation that is to the left of the rightmost dot indicates the block the operation belongs to. e The part of the label of an Operation that is to the left of the rightmost dash is the label 65 A : 1 . 1 {+} [ 1 B : 1.2 {=} l 1.1 l C : 1.3 {<} [ 1.2 ] D : 1.3-1.1 {+} [ 1.2 ] E : 1.3-1.2 (<} [ 1.3-1.1 ] F : 1.3-l.2-1.l {+} [ 1.3-1.1 ] G : 1.3-1.2—1.2 {-} [ 1.3-l.2-l.1 ] H : 1.3-l.2-2.1 {+} [ 1.3-1.1 ] I : 1.3-1.2-2.2 {—} [ 1.3-1.2-2.l ] J : 1.3-1.3 {-} [ 1.3-1.2-l.2 l.3—1.2-2.2 ] K : 1.3-2.1 {+1 [ 1.2 ] L : 1.3-2.2 {+1 [ 1.3-2.1 ] M : 1.4 {+} [ 1.3-1.3 1.3—2.2 ] Figure 5.2. Our representation for the example CDFG for the control predecessor of the Operation. A label that contains no dash represents an operation in the root block. e Two operations are mutually exclusive if the longest common prefix of their labels ends with a dash because they are in different branches of a conditional operation whose label is given by the longest common prefix (without the dash at the end). 5.5 Conditional Resource Sharing for a Set of Operations Let M be the set of operations considered for scheduling at a particular control step. Let a(b, t) denote the number of type t resources a block b requires i.e. the number of type t resources required by the operations that belong to either block b or one of its nested branch blocks. A type t operation 11 in M contributes one to a(b, t) if it belongs to b. Additionally, each conditional operation q in b (even if q is not in M) may contribute to a(b, t), due to the resource requirements of the corresponding branch blocks. Let S (q, t) I... 66 1 resource_sharing(M) /* This algorithm computes the resource requirement for the set M of Opeartions currently selected for scheduling */ { for each resource type t { initialize a(b, t) to zero for all blocks b initialize [3(2), t) to zero for conditional operations 1) } for each operation p in M { t <— type of p call update_req(p, t) } Figure 5.3. Procedure for Conditional Resource Sharing Analysis denote the contribution of q to a(b, t). If q is already resolved i.e. the value of the control signal generated by q is available at the current control step, only one of the branch blocks of q needs to be executed. Hence, 6(1), t) is given by the maximum of a(c, t) with the maximum taken over all branch blocks c of v. If q is not resolved (speculative execution), then My, t) is the sum of such a(c, t). Note that the total number of type t resources required by M is given by a(B, t) where B is the root block. Figure 5.3 shows the procedure to compute the number of each type of resources required by M. We schedule one operation from M at a time, each time updating the total resource requirement accordingly. Figure 5.4 shows a pseudo-code for the recursive procedure (update.req(p, t)) to update the total resource requirement when a type t operation p is scheduled. If p belongs to block b, a(b, t) is incremented by one. Then, the change in a(b, t) is passed to the parent block of b through q, the control predecessor of p. First, fl(q, t) is reevaluated. If there is a change in the value of fl(q, t), the resource requirement of the block of q is updated by calling the procedure updateJeq recursively. This recursive process is continued until either the root block is reached or there is no 67 update_req (p. t) { b (— block of p q 4— control predecessor of p a(b, t) = oz(b, t) + 1 if b is the root block, return if q is resolved { - if a(b,t) 5 Matt) return /* No change in the resource requirement */ else } 3(9. t) = 0(9, t) else { /* q is not resolved */ fl(Qtt) = 3(Qit) + 1 } call update_req(q, t) /* Update the requirement of the block of q *I Figure 5.4. Pseudo-code to update the total resource requirement change in the resource requirement. When all the operations from the set are added, a(B, t) is the number of resources required by that set of operations. Trivially, it takes 0(k) time to update the resource requirement with respect to an operation p, where k is the number of nested levels from the root block to p. The dot and dash notation provides a mechanism to easily track the nested conditionals and the corresponding blocks from the operation to the root block. This method can be used to compute the requirement of not only functional units but also resources like busses and registers. The above method takes into account the impact of speculative execution on resource sharing. Out-of-order execution of conditionals can be allowed with the labeling scheme and resource sharing analysis. Parallel trees can be simultaneously scheduled. The implicit precedence discussed in the previous section can be handled as follows. Let operation q be an immediate predecessor of p. SO, p cannot be scheduled until q is executed. We also 68 impose the following implicit precedence. Operation p cannot be scheduled until all the nested conditionals q to r are executed where r is the corresponding conditional in the same block as p. Using our representation, all these conditionals can be easily determined. 5.6 Lower-bound Estimation for CDFGs For the lower-bound estimation of CDFGs, we use the same approach as for DFGs i.e. di- viding each type of operations into three non-overlapping intervals. As explained in Section 4.4, the lengths Of the first and the third intervals are independent of resource constraints. The length of the second interval is a function of the total resource requirement of the Op- erations that should be scheduled in that interval. If there are no conditional branches, the resource requirement is equal to the number of operations. In the presence of conditional branches, however, more than one operation can share the one resource in the same time- step. Effectively, an operation requires only a fraction of the resource. If an operation can share resources with at most 71 other operations in the same time-step, its minimum resource requirement is 1 / (n + 1). We refer to this quantity as the weight of that operation. For any given resource type t, the minimum total resource requirement in an interval can be com- puted as the sum of weights of all type t operations in that interval. Given the weights of individual operations, the computation of the sum of weights of operations in each interval - is similar to the computation of the number of operations with no impact on the complexity. For lower-bound estimation from a configuration S, we use SumWeights(S, t), the sum of weights of the unscheduled operations of each type t in place of unsch(S, t). This can be easily computed from the corresponding value at its parent configuration. At the root configuration, it is the sum of weights of all Operations of each type in the CDFG. Thus, the only increase in complexity with our extension to conditional branches is due to the computation of the weights of operations. Figure 5.5 shows an outline of the procedure to compute the weights of individual 69 compute_weights() /* Computes weights of all operations in the CDFG */ Partition all the operations into conditional blocks For each Operation 2: in the CDFG { t (—— type of 2: b 4—— block of 2: n +— number of blocks that have a type t operation and mutually exclusive with the block b weight(2:) (— 1/(n + l) } Figure 5 .5. Outline of the procedure to compute the weights of operations operations. Since all the operations in a block have the same control behavior, the concept of mutual exclusiveness between operations can be easily extended to blocks. If 2: is a type t operation in the block b, then at any given control step 2: can share a resource with at most one other type t operation from any other block that is mutually exclusive with b. Hence, if there are n blocks that are mutually exclusive with b and that have a type t operation, then the weight of2: is 1/(n + 1). Ifz is in the root block, the weight of2: is 1. The method in [68] to handle conditional branches is extension of [69]. In their ap- proach, for each interval the operations from only one conditional path are considered so as to maximize the minimum resource requirement in that interval. Since the conditional path analysis is performed for each interval, their method is very slow. When used in schedul- ing algorithms for partial schedules, the actual time spent in estimation itself can outweigh the advantage of the resultant pruning. In comparison, our method computes the weights of operations only once and the complexity of the remaining steps remains unchanged. Their method is based on distribute-join representation of CDFGs which is a C-select im- 70 plementation. In C-select implementation, the operations in conditional branches cannot be executed until the corresponding condition is resolved. Our estimation method can sup— port both C-select implementation and speculative execution. In C-select implementation, the control precedences are treated the same way as data dependencies and are considered in computing ASAP and MSAT values of operations. In speculative execution, control dependencies are ignored while computing ASAP and MSAT values. 5.7 DPS Algorithm for CDFGs Figure 5.6. Design Search Space for the example CDFG In this section, we describe how the conditional resource sharing analysis method can be used in our DPS algorithm to extend it to schedule CDFGs. Figure 5.6 shows the design search space for the example CDFG using one adder, one subtracter and one comparator. All operations take one control step. At configuration 3, both the addition operations D 71 and K can be scheduled using one adder, because they belong to different branches of G and G is already resolved. At configuration 4 however, F and H cannot share resources because E is not resolved and one of them will be executed in a speculative fashion. Figure 5.7 shows the outline of the DPS algorithm for scheduling CDFGs. This algo- rithm is applied to a configuration R when a subset M of ready(R) is selected for schedul- ing (dps_cdfg(R, M )). First, the resource sharing analysis procedure of figure 5.3 is called to compute the number of resources required of each type. A corresponding child config- uration S is generated only if resource requirement does not exceed resource availability. Processing of newly generated configurations is done the same way as for DFGs. The lower-bound completion times can be estimated from configurations as explained in Sec- tion 5.6. Equivalence of configurations can be checked the same way as in the case of DFGs. 5.8 Experimental Results The algorithm for resource sharing analysis described in this chapter was implemented into our original DPS scheduling algorithm. We tested our methods on many benchmarks in the literature. Table 5.1 shows the results for examples with conditional behavior - Maha from [52], Kim from [30], Waka from [72] and MulT from [73]. Parker is Maha with ad- dition A6 replaced with a subtraction. The results in terms of the total number of control steps are compared to the optimal or best results known. The resources column lists the number of adders, subtracters and comparators used in each case. All additions and sub« tractions are single-cycle. We use the same set of assumptions as in [62] about comparison operations. For the examples of Maha, Parker and MulT, the result of a comparison is available to other operations in the same control step and the branches of that conditional become mutually exclusive in the same step. For the other example, comparison takes one control step. Kim, et al do not support speculative execution and they consider only 72 dps_cdfg(R,M) /* This dps algorithm is applied to each configuration R when a subset M is selected for scheduling from its ready set */ call resource_sharing(M) if (resources required exceeds resources available ) /* The set of operations M cannot share available resources */ return else generate a child configuration S of R by scheduling M in the current control step if (lower-bound completion time from S exceeds upper bound) /* lower-bound estimation */ prune S if (an equivalent configuration P is found) { /* Dynamic programming technique */ if (depth(S) >= depth(P)) prune S if (depth(S) < depth(P)) add S to the search space and prune P 1 if (no equivalent configuration found) add S to the search space Figure 5.7. Outline of the DPS algorithm for CDFGs ‘WW ‘ ’1 {r I 73 Table 5.1. Number of total control steps for Conditional Branch benchmarks Benchmark Resources #Control Steps +,-,< Ours Symb.[62] Kim et. al.[30] Wakabayashi[73] Maha 1,1,- 5 5 5 5 Maha 2,3,— 4 4 - 4 Parker 2,3,- 4 4 - 4 Kim 2,1,1 6 6 7 6 MulT 2,1',1 3 3 - 4 Waka 1,1,2 7 7 7 - pair-wise mutual exclusiveness. For the example of Kim, their method takes 7 steps. Our method takes only 6 steps with speculative execution. The results for MulT are an example to show the benefits of scheduling parallel trees simultaneously. CVLS [73] takes 4 steps while our method takes only 3 steps. The advantage of speculative execution are shown in Table 5.2. It shows optimal sched- ule lengths without (column N o spec.) and with(column Spec. Exec.) allowing speculative execution. Resources column shows the number of adders, subtracters and comparators used in each case. For all examples, we assume that all operations take one cycle. If there is no speculative execution, operations from mutually exclusive branches can always share resources. However, since control precedences are strict precedences, critical path length may increase. In Table 5.2, Maha and Parker are two examples with a high degree of branching. When speculative execution is not allowed, the advantage of conditional re- source sharing is nullified by the increase in the critical path length. The length of the schedules could not be reduced even by adding more resources. In comparison, speculative execution gives much superior results and adding more resources reduces schedule lengths. The only other work that addressed speculative execution with no limitations is the method in [62]. However, the symbolic technique they used is computationally very expensive and has failed to solve some large scheduling problems exactly [60]. 74 Table 5 .2. Speculative execution in Conditional Branch benchmarks Benchmark Resources # Time Steps +,-,< No spec. Spec. Exec. Maha 1,1,1 1 1 7 Maha 2,1,1 1 1 6 Maha 2,2,2 1 l 5 Parker 1,1,1 1 1 6 Parker 2,2,1 1 1 5 Parker 2,2,2 1 1 5 Kim 1,1,1 9 9 Kim 1,2,1 8 6 Kim 2,2,2 7 6 Waka 1,1,1 7 7 Waka 2,2,2 7 7 MulT 1,1,1 5 4 MulT 2,1,1 5 3 MulT 2,2,1 5 3 5.9 Conclusions We proposed a new representation for CDFGs in which operations are labeled in a dot and dash notation. The dot and dash labels capture the conditional behavior of the operations including constructs such as case statements. Exploiting this representation, a method for conditional resource sharing analysis is presented. This method is shown to be efficient in computing the resource requirement for a set of operations, and allows speculative execu- tion of operations and out-of-order execution of conditionals. This method is incorporated into our DPS algorithm to extend it to schedule CDFGs. The DPS algorithm for CDFGs overcomes the limitations of several previous works and exploits global sharing to Obtain schedules with the minimum number of control steps in very short computational times. CHAPTER 6 Conclusions Scheduling datapath operations into the best control steps is a task whose importance has been recognized in many high-level synthesis systems. Many heuristics proposed in the literature to solve this intractable problem do not work well in all scenarios. This is es- pecially true when there are constraints such as bus and register constraints. The current exact scheduling algorithms in the literature suffer from the combinatorial explosion of the search space because they do not apply sufficient pruning. Most importantly, none of the methods attempts to detect equivalent states in the search space. Based on this observation and some other characteristics of the scheduling problem, we developed algorithms to ad- dress various aspects of the scheduling problem in high-level synthesis. We described these algorithms in the previous chapters of this thesis. This chapter explains a summary of the contributions of our work and further research that needs to be done in order to maximally utilize the techniques developed in this thesis. 6.1 Summary of Contributions The contributions of this work can be summarized as follows. e A new optimal dynamic programming algorithm is proposed for the scheduling prob- lem in high-level synthesis. The algorithm minimizes the number of control steps 75 76 (clock cycles) under resource constraints. A novel feature of our algorithm is that it efficiently detects equivalent configurations(partial schedules) and explores only one of the best among them. A set of simple yet powerful techniques to prune the search space is invented. Our DPS algorithm uses a succinct representation of configurations in the search space and stores very little data at each configuration. The child configurations are generated from their parent configuration very efficiently, exploiting the representa- tion. Our scheduling algorithm is extended to handle multi-cycle operations, pipelined functional units, scheduling under bus and register constraints. Experimental results on large benchmarks show that optimal results are Obtained in very short compu- tational times. Our method is an order of magnitude faster than the existing exact methods in the literature. A new approach for lower-bound estimation of performance is proposed. Using our approach, we Obtained faster and better results than some of the best known methods in the literature. In most cases, our method produces tight lower bounds. The ap- proach is extended to estimate lower-bound completion times of partially scheduled data flow graphs. We used this approach in our dynamic programming scheduling algorithm. It is shown to be very effective in pruning the design space While being very efficient. The estimation algorithms can be used in any branch-and-bound or multi-schedule scheduling algorithms. A new representation scheme to represent behaviors with conditional constructs (Control/Data Flow Graphs) is proposed. The operations are labeled in a novel dot- and-dash notation which precisely captures their conditional behavior. 77 e Based on this representation, a new method for conditional resource sharing analysis is proposed. This method can address issues like speculative execution, resource sharing beyond pair-wise mutual exclusiveness which are very important in obtaining efficient designs for CDFGs. The resource sharing analysis method can be used in any scheduling algorithm. e The resource sharing analysis method is incorporated into our dynamic programming scheduling algorithm, thus extending it to handle conditional branches. Our lower- bound estimation algorithms are also extended to handle conditional branches. 6.2 Future Research There are several problems that need to be addressed further. e The scheduling algorithm and the lower-bound estimation algorithms need to be ex- tended to handle multi-function functional units. e The algorithms need to be extended to handle loops and functional pipelining i.e. pipelined datapaths. e The current lower bound estimation method considers constraints on the number of functional units only . It may be possible to get better lower bounds if we include additional constraints like bus constraints and register constraints. The lower bound estimation method needs to be extended in this direction. e Further investigation on the ways to improve the performance (in terms of speed and memory usage) of our methods is necessary. Some of the issues are : how can the merits of both breadth-first and depth-first approaches can be combined into a single approach; in addition to the ‘equivalence’ relation between configurations, determine a ‘better’ relation between configurations that can be easily implemented. 78 0 Since our algorithm can have exponential complexity in the worst case, it is important to have the capability to search only a part of the search space for a near-optimal solution. This can be done by imposing a limit on the number of configurations explored at each depth. However, this should be done intelligently such that the best configurations are selected for further exploration. 0 Finally, a parser needs to be implemented that translates VHDL descriptions of dig- ital systems into our representation of CDFGs. BIBLIOGRAPHY BIBLIOGRAPHY [l] G; Brassard and P.Brat1ey, “ALGORTHMICS : Theory and Practice”, Prentice Hall, 1988. [2] R. K Brayton, R. Camposano et al., “The Yorktown Silicon Compiler”, in Silicon Compilation, Addison-Wesley Publishing Company, pp204 — 310, 1988. [3] RE. Bryant, “Graph-Based Algorithms for Boolean Function Manipulation”, IEEE Transactions on Computers, Vol. c-35, No.8, August 1986. [4] RE. Bryant, “Symbolic Boolean Manipulation with Ordered Binary Decision Dia- grams”, ACM Computing Surveys, Vol.24, No.3, September 1992. [5] R.Camposano and W.Wolf, Editors, “High-Level VLSI Synthesis”, Kluwer Academic Publishers, 1991. [6] R.J.Cloutier and D.E.Thomas, “The Combination of Scheduling,Allocation, and Mapping in a Single Algorithm”, Proceedings of the Design Automation Conference, pp.71-76, June 1990. [7] S.Chaudhuri, R.A.Wa1ker and J .E.Mitchell, “Analyzing and Exploiting the Structure of the Constraints in the ILP Approach to the Scheduling Problem”, IEEE Trans. on VLSI Systems Vol.2, No.4, pp.456-471, December 1994. 79 [8] [9] [10]- [11] [12] [13] [14] [15] 80 M.J.Chung and G.Tiruvuri, “Optimum Dynamic Programming Scheduling under Re- source Constraints” Technical Report MSUCPS-TR95-44, Computer Science Depart- ment, Michigan State University, 1995. M.E.Dalikilic and V.Pitchumani, “A Multi-Schedule Approach to High-Level Synthe- sis”, International Conference on Computer Design, pp.572-575, 1994. S.Davidson, D.Landskov, B.D.Shriver, and P.W.Ma11ett, “Some Experiments in Local Microcode Compaction for Horizontal Machines”, IEEE Transactions on Computers, pp.460-477, July 1981. S. DeVadas and AR. Newton, “Algorithms for Hardware Allocation in Data Path Synthesis”, IEEE Transactions on Computer-Aided Design, pp768 — 781, Vol.8, No.7, July, 1989. Y.Fann, et al, “Global Scheduling For High-Level Synthesis Applications”, 31st De- sign Automation Conference, 1994. E.B.Femandez and B.Bussell, “Bounds on the Number of processors and Time for Multiprocessor Optimal Schedules”, IEEE Transactions on Computers, pp.745-751, vol.C-22, No.8, August 1973. D.Gajski, N. Dutt et. al., “HIGH-LEVEL SYNTHESIS : Introduction to Chip and System Design”, Kluwer Academic Publishers, 1992. D.Gajski, R.Potasman, J .Lis, and A.Nicolau, “Percolation Based Synthesis”, Pro- ceedings of the Design Automation Conference, pp.444-449, June 1990. f“-.-__ 81 [16] M.R.Garey and D.S.Johnson, “Computers and Intractability”, W.H.Freeman and CO., 1979. [17] CH. Gebotys and M.I. Elmasry, “Simultaneous Scheduling and Allocation for Cost Constrained Optimal Architectural Synthesis”, Proceedings of the Design Automation Conference, pp.2-7, 1991. [18] E. F. Girczyc and J. P. Knight, “An Ada to Standard Cell Hardware Compiler based on Graph Grammars and Scheduling”, International Conference on Computer Design, pp.726-731, October 1984. [19] G. Tiruvuri and M.J. Chung, “Estimation of Lower Bounds in Scheduling Algorithms for High-Level Synthesis”, Under Revision for ACM Transactions On Design Au- tomation of Electronic Systems. [20] E. M. Gurari and I. H. Sudborough, “Improved Dynamic Programming Algorithms for Bandwidth Minimization and MinCut Linear Arrangement Problem”, Journal of Algorithms, Vol.5, No.4, pp.531-546, December 1984. [21] C. Y. Hitchcock and D. E. Thomas, “A Method of Automatic Data Path Synthesis”, Proceedings of the Design Automation Conference, pp.484-489, July 1983. [22] S.H.Huang, et. al., “A Tree-Based Scheduling Algorithm For Control-Deminated Cir- cuits”, 30th Design Automation Conference, 1993. [23] Y.Hu, A.Ghouse and B.S.Carlson, “Lower Bounds on the Iteration Time and the Num- ber of Resources for Functional Pipelined Data Flow Graphs”, International Confer- ence on Computer Design, pp.21-24, 1993. 82 [24] C.T.Hwang, J .H.Lee and Y.C.Hsu, “A Formal Approach to the Scheduling Problem in High Level Synthesis”, IEEE Trans. on Computer-Aided Design, Vol.10, No.4, pp.464—475, April 1991 . [25] CT. Hwang, Y.C. Hsu and Y.L. Lin, “Optimum and Heuristic Data Path Scheduling Under Resource Constraints”, 27th Design Automation Conference, pp.65-70, June 1990. [26] C.T.Hwang and Y.C.Hsu, “Zone Scheduling”, IEEE Trans. on Computer-Aided De- sign, Vol.12, No.7, pp.926-934, July 1993. [27] R.Jain, KKucukcakar, M.J.Mlinar, and A.C.Parker, “Experience with the ADAM Synthesis System”, Proceedings of the Design Automation Conference, pp.56-61, 1989. [28] R.Jain, A.Mujumdar, A.Sharma, and H.Wang, “Empirical Evaluation of Some High— Level Synthesis Scheduling Heuristics”, Proceedings of the Design Automation Con- ference, pp.686-689, 1991. [29] R.Jain, A.C.Parker, and N.Park “Predicting system-level area and delay for pipelined and non-pipelined designs”, IEEE Trans. on Computer-Aided Design, Vol.12, No.8, August 1992. [30] T.Kim, et al, “A Scheduling Algorithm For Conditional Resource Sharing”, Proceed- ings of ICCAD, 1991. [31] H.Komi, S.Yamada and K.Fukunaga, “A Scheduling Method by Stepwise Expansion in High-Level Synthesis”, ICCAD, 1992. 83 [32] S.Y. Kung, H. J. Whitehouse, and T. Kailath, VLSI and Modern Signal Processing, Englewood Cliffs, NJ: Prentice Hall, pp258 — 264, 1985. [33] F.J.Kurdahi and A.C.Parker, “REAL: A Program for REgister ALlocation”, 24th De- sign Automation Conference, 1987. [34] M.Langerin and E.Cemy, “A Recursive Technique for Computing Lower-Bound Per- formance of Schedules”, International Conference on Computer Design, pp. 16-20, 1993. [35] J .H.Lee, Y.C.Hsu and Y.L.Lin, “A New Integer Linear Programming Formulation for the Scheduling Problem in Data Path Synthesis”, Proceedings of ICCAD, November 1989. [36] T.A.Ly and J.T.Mowchenko, “Applying Simulated Evolution to High Level Synthe- sis”, IEEE Trans. on Computer-Aided Design, Vol.12, No.3, pp.389-409, March 1993. [37] DJ. Mallon and P.B.Denyer, “A new approach to pipeline optimization”, Proceedings of European Design Automation Conference, March, 1990. [38] P.Marwedel, “A New Synthesis Algorithm for the MIMOLA Software System”, Pro- ceedings of the Design Automation Conference, pp.271-277, July 1986. [39] M. C. McFarland, “Using Bottom-Up Techniques in the Synthesis of Digital Hard- ware from Abstract Behavioral Descriptions”, Proceedings of the Design Automation Conference, pp.474-480, July 1986. [45 [46 [47 [48. 84 [40] M. C. McFarland, A. C. Parker and R. Camposano, “Tutorial on High-level Synthe- sis”, Proceedings of the Design Automation Conference, pp.330-336, July 1988. [41] M. C. McFarland, A. C. Parker and R. Camposano, “The high-level synthesis of dig- ital systems”, Proceedings of IEEE, pp.301-318, February 1990. [42] SJ. Minato, “Zero-Suppressed BDDs for Set Manipulation in Combinatorial Prob- lems”, Proceedings of the Design Automation Conference, 1993. [43] J .A. Nestor and G. Krishnamoorthy, “SALSA: A new approach to scheduling with timing constraints”, Proceedings of I C CAD, 1990. [44] M.Nourani and C.Papachristou, “Move Frame Scheduling and Mixed Scheduling- Allocation for the Automated Synthesis of Digital Systems”, 29th Design Automation Conference, 1992. [45] S.Y.Ohm and C.S.Jhon, “A Branch and Bound Method for the Optimal Scheduling”, Proceedings of CICC, pp. 8.6.1-8.6.4, May 1992. [46] S.Y.Ohm, F.J.Kurdahi, and Niki] Dutt, “Comprehensive Lower Bound Estimation from Behavioral Descriptions”, Proceedings of ICCAD, pp. 182- 1 87, November 1994. [47] S.Y.Ohm, “Personal Communication”, March 1995. [48] B. M. Pangrle and D. D. Gajski, “Slicer: A State Synthesizer for Intelligent Silicon Compiler”, International Conference on Computer Design, pp.42-45, July 1987. 85 [49] C.A.Papachristou and H.Konuk, “A High Level Synthesis Technique Based on Linear Programming”, Technical Report # CES-89-21, Computer Engineering and Science Department, Case Western Reserve University, November 1989. [50] C.A.Papachristou and H.Konuk, “A Linear Program Driven Scheduling and Alloca- tion Method Followed by Interconnect Optimization Algorithm”, Proceedings of the Design Automation Conference, pp.77-83, June 1990. [51] 1. Park and C.Kyung, “FAMOS: An Efficient Scheduling Algorithm for High-Level Synthesis”, IEEE Trans. on Computer-Aided Design, Vol.12, No.10, pp.1437-1448, October 1993. [52] N. Park and A. C. Parker, “MAHA: A Program for Datapath Synthesis”, Proceedings of the Design Automation Conference, pp.461-466, July 1986. [53] N. Park and A. C. Parker, “SEHWA: A Software Package for Synthesis of Pipelines for Synthesis of Pipelines from Behavioral Specifications”, IEEE Trans. on Computer-Aided Design, Vol.7, No.3, pp.356-368, March 1988. [54] P.G.Paulin and J .P.Knight, “Scheduling and Binding Algorithms for High-Level Syn- thesis”, Proceedings of the Design Automation Conference, pp. 1-6, 1989. [55] P.G.Paulin and J. P. Knight, “Force-Directed Scheduling for Behavioral Synthesis of A81 ’s”, IEEE Trans. on Computer-Aided Design, Vol.8, No.6, pp.661-679, June 1989. 86 [56] P.G.Paulin, J .P.Knight, and E.F.Girczyc, “HAL : A Multi-Paradigm Approach to Au- tomatic Data Path Synthesis”, Proceedings of the Design Automation Conference, pp.263-270, July 1986. [57] J.M. Rabaey, H. De Man et. al., “CATHEDRAL II: A Synthesis System for Multipro- cessor DSP systems”, in Silicon Compilation, Addison-Wesley Publishing Company, pp204 — 310, 1988. [58] J .M.Rabaey and M.Potkonjak, “Estimating Implementation Bounds for Real Time DSP Application Specific Circuits”, IEEE Trans. on Computer-Aided Design, Vol. 13, No.6, pp.669-683, June 1994. [59] LP. Radivojevic and Forrest Brewer, “Symbolic Techniques for Optimal Scheduling”, em Proceedings of the 4th SASIMI, Nara, Japan, 1993. [60] LP. Radivojevic and Forrest Brewer, “On Applicability of Symbolic Techniques to Larger Scheduling Problems”, Proc. of European Design and Test Conference, pp.48- 53, 1994. [61] LP. Radivojevic and Forrest Brewer, “Analysis of Conditional Resource Sharing us- ing a Guard-based Control Representation”, International Conference on Computer Design, 1995. [62] LP. Radivojevic and Forrest Brewer, “A New Symbolic Technique for Control- Dependent Scheduling”, IEEE Trans. on CAD, January 1996. [63] M.Rim and R.Jain, “Representing Conditional Branches for High-Level Synthesis Applications”, 29th Design Automation Conference, 1992. 87 [64] M.Rim, “High-Level Synthesis of VLSI Designs for Scientific Programs”, Ph.D. the- sis, Department of Electrical and Computer Engineering, University of Wisconsin, Madison, August 1993. [65] M.Rim and R.Jain, “Lower-Bound Performance Estimation for the High-Level Syn- thesis Scheduling Problem” IEEE Trans. on Computer-Aided Design, Vol.13, No.4, pp.451-458, April 1994. [66] M.Rim and R.Jain, “RECALS II: A New List Scheduling Algorithm”, ICASSP, 1994. [67] R. Sedgewick, “Algorithms in C”, Addison-Wesley, 1990. [68] A.Sharma, “Estimation and Design Algorithms for the Behavioral Synthesis of ASICS”, Ph.D. thesis, University of Wisconsin, 1992. [69] A.Sharma and R.Jain, “Estimating Architectural Resources and Performance for High-Level Synthesis Applications”, IEEE Trans. on VLSI Systems, Vol.1, No.2, June 1993. [70] A.Sharma and R.Jain, “InSyn: Integrated Scheduling for DSP Applications ”, 30th Design Automation Conference, 1993. [71] J .D.Ullman, “NP-Complete Scheduling Problems”, Journal of Computer and System Sciences, pp.384-393, 1975. [7 2] KWakabayashi and T.Yoshimura, “A resource sharing and control synthesis method for conditional branches”, ICCAD, 1989. 88 [73] K.Wakabayashi and H.Tanaka, “Global Scheduling Independent of Control Depen- dencies Based on Condition Vectors”, 29th Design Automation Conference, 1992. [74] TC. Wilson, N.Mukherjee, M.K. Garg, and D.K. Banerjee, “An integrated and ac- celeretaed ILP solution for scheduling, module allocation, and binding in datapath synthesis”, 6th International Conference on VLSI Design, pp.192-197, Bombay, In- dia, January 1993 [75] Nam-Sung Woo, “A Global, Dynamic Register Allocation and Binding for a Data Path Synthesis System", 27th Design Automation Conference, 1990.