1141:5518 (Ll llllIllllllllllllllllllllllllllll‘lllllilllllllllllWill 31293 01712 This is to certify that the thesis entitled Hardware-software Partitioning in Co- -design of Embedded System presented by Habeel Ahmad has been accepted towards fulfillment of the requirements for Master's degree in Electrical Eng 5mm 1 [ {5'14 Major professor Date 5/10/71“ 0-7639 MS U is an Affirmative Action/Equal Opportunity Institution LIBRARY | Michigan State Unlverslty PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. DATE DUE MTE DUE DATE DUE 1m W14 HARDWARE-SOFTWARE PARTITIONING IN CO-DESIGN OF EMBEDDED SYSTEMS By Habeel Ahmad A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Electrical Engineering 1998 ABSTRACT HARDWARE-SOFTWARE PARTITIONING IN CO-DESIGN OF EMBEDDED SYSTEMS By Habeel Ahmad The rapid advancements in science and technology and especially in the field of computers have made an impact on every aspect of our daily life. Among the latest trends is to make devices known as embedded systems that consist of one or more programmable components. Hardware-software co-design is a methodology that provides rules and techniques for embedded system design. This work has focused on one of the aspects of co—design called hardware-software partitioning. A variety of co-design frameworks were studied with a view to explore the partitioning algorithms. POLIS, which currently supports manual partitioning, was chosen as a candidate for implementation of automatic partitioning. Two partitioning algorithms, namely, Group Migration and Simulated Annealing were implemented in C++ for this purpose. The existing design flow in POLIS was altered to generate the performance estimates for both software and hardware implementations of system modules. The algorithms used these estimates to find the best partition that satisfied the constraints. The results produced by the partitioning algorithms are in agreement with the results published in the POLIS documentation. The work can be extended to integrate the partitioning algorithm with POLIS design flow and to include a rapid prototyping environment for verification. I dedicate this thesis to my parents, Aziz Ahmad and Amna Aziz iii ACKNOWLEDGEMENTS I start in the name of Allah Almighty, the most beneficent and the most merciful, whose everlasting blessings made the completion of this thesis possible. Next, I wish to express my sincere gratitude to my advisor Dr Diane T. Rover for her commitment, encouragement and guidance that helped me conclude this study. Appreciation is also extended to Dr. Micheal Shanblatt and Dr. Bruce E. Kim for their valuable suggestions and also for serving as members of my guidance committee. My thanks are also due to the secretarial staff of EB Department for their support at all times. I am grateful to Mr. Bassam Tabbara at UC Berkeley for his prompt response and guidance during my struggle with POLIS and PTOLEMY. I would like to express my gratitude to all of my valuable friends, especially Mr. Abdul Naeem Khan, Dr. Pervaiz Akhtar, Mr. Fida M. Khan and Dr. Dale Joachim for their understanding, encouragement and social support during my stay at East Lansing. My special thanks to Mr Aman-ullah Ateequi and his family for their caring. Finally, I wish to acknowledge the support and sacrifice of my mother, wife and sister back home during my stay at MSU that enabled me to fully concentrate on my studies. I am especially grateful to my loving wife, Najm us Saher, who always remained a source of inspiration and encouragement during the course of studies. I also acknowledge her patience and courage for single handedly running all the affairs in looking after our children, Rafiah, Sa’ad, Saleha and Sa’adiah during my long absence from home. iv TABLE OF CONTENTS LIST OF TABLES ................................................................................ viii LIST OF FIGURES ............................................................................... ix CHAPTER 1 INTRODUCTION .................................................................................. 1 1.1 Embedded Systems ........................................................................... 2 1.2 Hardware-Software Co-design ............................................................... 4 1.3 Co—design Methodology ...................................................................... 5 1.4 Conventional Approach ....................................................................... 6 1.4.1 Specification .............................................................................. 6 1.4.2 Partitioning .............................................................................. 9 1.4.3 Synthesis ................................................................................. 9 1.4.3.1 Hardware Synthesis ............................................................. 9 1.4.3.2 Software Synthesis ............................................................ 10 1.4.3.3 Interface Synthesis ............................................................ 10 1.4.4 Hardware-Software Integration and Co-simulation .............................. 10 1.5 Model-Based Approach ..................................................................... 11 1.5.1 Validation .............................................................................. 11 1.5.2 Partitioning and Implementation ..................................................... 12 1.6 Partitioning Problem ........................................................................ 12 1.7 Scope of This Work ......................................................................... 13 CHAPTER 2 RELATED WORK ............................................................................... 15 2.1 Vulcan ......................................................................................... 16 2.2 COSYMA .................................................................................... 16 2.3 LYCOS: The Lyngby Co—Synthesis System ............................................. 17 2.4 COMET ....................................................................................... 18 2.5 Ptolemy ....................................................................................... 19 2.6 SpecSyn ....................................................................................... 20 2.7 TOSCA ....................................................................................... 21 2.8 Other Frameworks ........................................................................... 21 CHAPTER 3 POLIS ............................................................................................... 24 3.1 Introduction ................................................................................... 24 3.1.1 Model of Computation (CFSM) .................................................... 25 3.2 Design Flow .................................................................................. 28 3.2.1 Overview ............................................................................... 28 3.2.2 High Level Language Translation ................................................. 32 3.2.3 Formal Verification .................................................................. 32 3.2.3 System Co-simulation ............................................................... 33 3.2.4 Partitioning and Architecture Selection ........................................... 33 3.2.5 Hardware Synthesis .................................................................... 34 3.2.6 Software Synthesis ..................................................................... 34 3.2.7 Interfacing Implementation Domains ............................................... 35 3.2.8 Rapid Prototyping ..................................................................... 35 3.3 Design Example .............................................................................. 35 3.3.1 Specification ........................................................................... 36 3.3.2 Estimation ............................................................................... 38 3.3.3 Co-simulation .......................................................................... 39 3.3.4 Hardware Synthesis ................................................................... 40 3.3.5 Software Synthesis .................................................................... 41 3.3.6 Implementation ........................................................................ 41 CHAPTER 4 SYSTEM PARTITION ING ...................................................................... 42 4.1 Partitioning Approaches .................................................................... 43 4.1.1 Structural Partitioning ................................................................ 43 4.1.2 Functional Partitioning ............................................................... 44 4.2 Partitioning Issues ........................................................................... 45 4.2.1 Specification and Levels of Abstraction ............................................ 45 4.2.2 Granularity ............................................................................. 47 4.2.3 System-Component Allocation ...................................................... 47 4.2.4 Metrics and Estimation ............................................................... 48 4.2.5 Cost Function .......................................................................... 48 4.2.6 Partitioning Algorithm ................................................................ 50 4.2.6.1 Constructive/Iterative Algorithms ......................................... 50 4.2.6.2 Greedy/Hill-climbing Algorithms ......................................... 51 4.2.7 Output ................................................................................... 52 4.3 Basic Partitioning Algorithms ............................................................... 52 4.3.1 Vulcan Algorithms ................................................................... 53 4.3.2 Ratio Cut ............................................................................... 53 4.3.3 Group Migration (Kemighan-Lin) ................................................. 54 4.3.4 Simulated Annealing ................................................................. 56 4.3.5 Genetic Evolution ..................................................................... 58 4.3.6 Binary Constraint-Search ............................................................ 59 4.3.7 Integer Linear Programming ........................................................ 59 CHAPTER 5 APPLICATION OF PARTITIONING ALGORTIHMS 1N POLIS ........................ 60 5.1 Background ................................................................................... 60 5.2 Partitioning in POLIS ....................................................................... 61 5.3 Generation of Estimates ..................................................................... 62 5.4 Selection of Algorithms for Automatic Partitioning in POLIS ........................ 65 5.4.1 Assumptions ........................................................................... 66 5.5 Features of C++ Code ......................................................................... 67 5.5.1 Group Migration (GM) Algorithm ................................................. 67 5.5.2 Simulated Annealing (SA) Algorithm ............................................. 69 vi 5.6 Case Studies .................................................................................. 71 5.6.1 Hypothetical Cases ................................................................... 71 5.6.1.1 Small Size Example .......................................................... 72 5.6.1.2 Large Size Example .......................................................... 72 5.6.2 Real World Example (Dashboard Controller) ..................................... 73 5.6.2.1 Design Constraints ............................................................ 73 5.6.3 Results Obtained ....................................................................... 75 5.7 Analysis of Results ......................................................................... 81 5.8 Limitations of POLIS ........................................................................ 82 CHAPTER 6 CONCLUSION AND FUTURE DIRECTIONS ............................................. 84 6.1 Summary ...................................................................................... 84 6.2 Validity of Results ........................................................................... 86 6.3 Recommendations and Future Directions ................................................ 86 6.3.1 Integrating the Partitioning Algorithm in Polis Design Flow ................... 87 6.3.2 Rapid Prototyping Platform ......................................................... 87 APPENDICES .................................................................................... 88 APPENDD( A Group Migration Algorithm implementation in CH ................ 89 APPENDIX B Simulated Annealing Algorithm implementation in C++ .......... 102 APPENDIX C Group Migration Algorithm Results .................................. 114 APPENDD( D Simulated Annealing Algorithm Results ............................. 122 APPENDIX E Dashboard Example Results .......................................... 130 BIBLIOGRAPHY ............................................................................... 141 vii Table 2.1 Table 5.1 Table 5.2 Table 5.3 Table C.1 Table C.2 Table C.3 Table C.4 Table C.5 Table D.1 Table D2 Table D.3 Table D4 Table E.1 Table E.2 Table E.3 Table E.4 Table E.5 Table E.6 LIST OF TABLES List of co-design frameworks examined in this thesis .......................... 15 Estimates for the modules of dashboard example ............................... 76 Partition found using GM and SA algorithms (1 MHz Clock) ................. 77 Partition found using GM and SA algorithms (4 MHz Clock) ................. 79 Input data for the small example ................................................. 1 14 Partitioned system of the small example ........................................ 116 Input data for the large example .................................................. l 17 Partitioned system of the large example with equal weights ................ 118 Partitioned system when Hw_Area is more important ........................ 120 Input data for the small example .................................................. 122 Partitioned system of the small example ........................................ 123 Input data for the large example .................................................. 126 Partitioned system of the large example ......................................... 127 Input and partition data for belt_control module ............................... 135 Input and partition data for en gine_speed module ............................. 136 Input and partition data for wheel__speed module ............................. 137 Input and partition data for net dac_demo (1 MHz Clock) ................... 138 Partition data for net dac_demo (4 MHz Clock) ............................... 139 Partition data for net dac_demo (Hw_Area constraint = 4000000) ......... 140 viii LIST OF FIGURES Figure 1.1 Architecture of a typical embedded system ........................................ 3 Figure 1.2 Conventional co-design methodology .............................................. 7 Figure 1.3 Model-based co-design methodology ............................................... 8 Figure 3.1 Overview of the design flow in POLIS ........................................... 29 Figure 3.2 The POLIS design flow ............................................................. 30 Figure 3.3 Transition diagram of seat belt controller ......................................... 36 Figure 4.1 Essential partitioning issues ........................................................ 46 Figure 4.2 Escaping local minimum in iterative partitioning ............................... 51 Figure 4.3 Classification of automatic partitioning algorithms .............................. 53 Figure 4.4 Group migration algorithm with two-way partitioning ......................... 55 Figure 4.5 Simulated annealing algorithm .................................................... 57 Figure 5.1 Modified design flow in POLIS .................................................... 62 Figure 5.2 Plot of cost vs. iterations for GM algorithm (1 MHz Clock) ................... 78 Figure 5.2 Plot of cost vs. iterations for SA algorithm (1 MHz Clock) ................... 78 Figure 5.3 Plot of cost vs. iterations for GM algorithm (4 MHz Clock) ................... 80 Figure 5.4 Plot of cost vs. iterations for SA algorithm (4 MHz Clock) ................... 80 Figure A.1 Flow chart of GM algorithm program ............................................ 98 Figure A.2 Flow chart of SA algorithm program ........................................... 1 11 Figure C.1 Plot of cost vs. iterations corresponding to partition in Table C.2 .......... 1 16 Figure C.2 Plot of cost vs. iterations corresponding to partition in Table C.4 . .. 1 19 ix Figure C.4 Plot of cost vs. Figure D.1 Plot of cost vs. Figure D.2 Plot of cost vs. Figure D.3 Plot of cost vs. Figure D.4 Plot of cost vs. Figure D.5 Plot of cost vs. Figure D.6 Plot of cost vs. Figure D.7 Plot of cost vs. Figure D.8 Plot of cost vs. Figure E.l Plot of cost vs. Figure E.2 Plot of cost vs. Figure E.3 Plot of cost vs. Figure E.4 Plot of cost vs. Figure E.5 Plot of cost vs. iterations corresponding to partition in Table C.5 .......... 12] iterations for the small example (Case 1) ..................... 123 iterations for the small example (Case 2) ..................... 124 iterations for the small example (Case 3) .................... 124 iterations for each temperature value (Case 4) .............. 125 iterations for the small example (Case 5) .................... 125 iterations for the small example (Case 6) .................... 126 iterations for the large example (Case 1) ..................... 129 iterations for the large example (Case 2) ..................... 129 iterations for belt_control module ............................ 134 iterations for engine_speed module ........................... 135 iterations for wheel_speed module ............................ 136 iterations for net dac_demo (1 MHz Clock) ................. 138 iterations for net dac_demo (4 MHz Clock) ................. 139 CHAPTER 1 INTRODUCTION Electronics has made inroads in all fields of science and technology. Digital electronics is one of the most prolific fields in the present day world. It provides the building blocks for the wonder of the 20th century, the computer. Digital electronics is the enabling technology for the design of high-performance systems in which reliability and dependability are key issues, in such applications as space exploration and medical instrumentation. The continued miniaturization of digital circuits has made it possible to design and manufacture programmable components such as microprocessors and microcontrollers. With the increasing role of computers in our daily life, one may envision a world where everything would assume an electronic dimension; i.e., either it would be controlled by a computer or may contain one or more programmable components. This demands reconsideration of the role of computers in our daily life. For now, a user buys a computer as a platform to run a variety of programs that make it perform widely different tasks. With the advances in technology and increased market competition, the prices of computer hardware have gone so low that the highest cost a user has to pay today is that of software. In this scenario, it is logical to think of building special-purpose devices with optimized functionality for a dedicated application. The manufacturers of electronics systems have already adopted this approach. Out of the millions of microprocessors manufactured each year, only 20% are being used in the general—purpose computers, whereas the rest of them are used in digital systems for dedicated applications known as embedded systems. 1.1 Embedded Systems Embedded systems are digital systems, which perform specific functions. They are normally categorized as hardware-software systems. Embedded systems are defined as a collection of programmable parts and dedicated hardware components that are continuously interacting with the environment. By virtue of the requirement of continuous interaction with the environment they are also termed as reactive systems [1]. The software runs on microcontrollers or Digital Signal Processors (DSPs) and dedicated hardware is implemented in Application Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs). Generally, software is used for features and flexibility, while hardware is used for performance. Design of embedded systems can be subject to many different types of constraints, including timing, size, weight, power- consumption, reliability and cost. Rapid advancement in the field of computer aided design (CAD) during the recent past has opened numerous vistas of research and development in the area of embedded system design. The research efforts are now being directed towards the development of such techniques, which not only result in meeting the requisite performance criteria but also reduce time-to-market and cost [2]. Following are some examples of embedded systems: 0 Consumer Electronics: medical instruments, cameras, compact disc players, VCRs, microwave ovens and washing machines. 0 Telecommunications: networking and communication systems such as satellites and cellular phones. 0 Automotive: engine controllers, anti-lock brakes and dashboard controllers. 0 Defense and Aviation Electronics: airborne radio and radar, fire control, navigation and guidance, and cryptographic systems. 0 Plant and Process Control: Remote controlled toys, robots and plant monitors. Figure 1.1 shows an example of the architecture of a typical embedded system. It consists of programmable components such as microcontroller and DSP and dedicated hardware such as ASIC and standard logic components. The microcontroller runs the application program under the control of real—time operating system (RTOS). The interface among the various components is implemented with the system bus. Microcontroller System Bus 3 ....... Logic Figure 1.]: Architecture of a typical embedded system. Current practices of embedded system design tend to follow different paths for the design of hardware and software. A specification, often incomplete and written in non- formal language, is developed and sent to the hardware and software engineers. It involves multiple, subsequent hardware/software development steps in which a prototype is designed through refinement of specifications. Hardware-software partition is decided in advance and is adhered to as much as possible, because any changes in this partition may necessitate extensive redesign. Designers often strive to make everything fit in software, and off-load only some parts of the design to hardware to meet timing constraints. The problems with this design strategy are: 0 Lack of a unified hardware-software representation, which leads to difficulties in verifying the whole system and, therefore, to incompatibilities across the hardware-software boundary. 0 Definition of hardware-software partitions in the early design stages, which leads to sub-optimal designs. 0 Lack of a well-defined design flow, which makes specification revision difficult, and directly impacts time-to-market. 1.2 Hardware-Software Co-design Hardware-Software co-design is considered to be a design methodology that can avoid the above-mentioned disadvantages. It is trying to meet system-level objectives by exploiting the synergism of hardware and software through their concurrent design. It can be viewed as a management discipline, which offers the possibility to develop large complex system products. Hardware-Software codesign is a complex process that involves transforming a high-level system specification to an implemented hardware- software system that meets the specification constraints. Hardware-software co-design is predominant in the development of such systems where hardware and software modules closely interact to solve a certain task [3]. As discussed earlier, hardware—software systems are not new and they have continued to be designed using conventional approaches, however, methodologies that concurrently apply to both domains are now emerging. The growing interest in hardware-software co-design can be attributed to following developments or compulsions: 0 Advances in enabling technologies such as system level specification and simulation environments, prototyping techniques, formal methods for design and verification, high-level synthesis and the emergence of CAD frameworks have opened new venues for hardware-software co-design. 0 The increasing diversity and complexity of embedded systems demands advanced design methods for the development of both hardware and software. 0 The market competition has made it imperative to decrease the cost of design and test of hardware-software systems. More than ever, optimization of cost and performance and a significant reduction in time-to-market are vital issues in the development of embedded systems. 1.3 Co-design Methodology Co-design as practiced today relies heavily on techniques and methods that have been successfully applied in the past. New contributions are being made in the areas of design automation tools, tool interface, hardware-software partitioning techniques, and enhanced framework technologies. Currently, two approaches are followed by hardware—software system designers [3]. The conventional approach is the traditional approach where generic framework techniques are employed to facilitate tool encapsulation and integration, and management support is provided for coordinated and cooperative design. Figure 1.2 depicts a typical series of steps in this co—design methodology. This approach has been the focus of research in the past. The model-based approach is the focus of more recent research and favors late partitioning during the design process. Figure 1.3 illustrates the model-based approach. A brief description of both approaches follows. 1.4 Conventional Approach 1.4.1 Specification The first step in the conventional approach is description of specification. The objectives, requirements and constraints supplied by the user are often incomplete and lack clarity. This step helps in removal of inconsistencies and location of missing information, which results in formulation of system specifications. The traditional approach of system specification as an informal natural-language description has proved to be inadequate over the past [2][3]. It is not possible to automate the co-design process by using natural language descriptions. The designers have therefore evolved an executable-specification approach to overcome this limitation. In this approach the system’s functionality is first captured with an executable language and then the functional objects are derived and partitioned. Since the specifications are machine readable, it is possible to develop tools to automate co-design. The specifications could be verified using simulation, thereby eliminating errors early in the design and preventing costly changes in the subsequent stages of the design. System Specifications Hardware-Software Partitioning Hardware Descriptio Hardware Synthesis Software Description Software Synthesis ‘------------------_--. Interface Synthesis Hardware Hardware-Software Configuration Interface 1 HW-SW Integration and Co-simulation Y Integrated System] A A System Evaluation Design Verification Figure 1.2: Conventional co-design methodology. System Specification 1 a Modeling System Model Validation Refinement Validated System Model Simulation and Verification l [ System Partitioning ] / l \ Hardware Synthesis Interface Synthesis \l Technology Assignment 1 System Evaluation Software Synthesis Figure 1.3: Model-based co-design methodology. 1.4.2 Partitioning The next step is partitioning, where the designer decides to realize various components of the design in hardware and software. Partitioning is usually done manually. Partitioning may be done at various levels of abstraction or at different stages of the design. Early partitioning is preferred by the industry because of requirements of preplanning the development cycle. This however restricts late changes in the specifications. Late partitioning may result in better performance optimization and allows user change requests at later stages of the design. 1.4.3 Synthesis Synthesis is final realization of a system in hardware and software. It also provides some feedback to the partitioning process based on the technology requirements or the specific application of the system. Tools for software synthesis have already been available in the past because of efforts put in by the software designers in development of compilers. Hardware synthesis tools are also appearing which have made rapid prototyping of hardware possible. 1.4.3.1 Hardware Synthesis The hardware platform on which the software is to execute and the dedicated hardware that shares the functionality are synthesized during this phase, using the results of partitioning. Hardware synthesis involves technology binding by translation (mapping) of hardware descriptions such as VHDL, HardwareC, BLIF, etc. into gate level netlists. 1.4.3.2 Software Synthesis Software synthesis involves translation of functionality into a program for a particular processor. The software description is generated in terms of a high level language such as C/CH. A compiler is then used to translate this description into an executable code for the selected processor. The software synthesis may also include synthesis of the operating system, some times called real-time operating system (RTOS) due to the real-time reactive nature of embedded systems [1]. 1.4.3.3 Interface Synthesis In order to provide the signal/data exchange capability an interface between the hardware and software is often required. Interface synthesis provides a means of hardware and software synchronization. Typically signal exchange (hardware), semaphore (software) or interrupt driven schemes are employed in this phase [1]. Implementations range from custom logic to dynamically configurable logic devices. A central scheduler in software may also be incorporated to control the hardware processes. 1.4.4 Hardware-Software Integration and Co-simulation The integration step involves co-simulation of hardware and software on a heterogeneous simulator. The results of simulation provide an assessment of the performance of design in terms of meeting the specifications and satisfaction of constraints. The verification step ensures that the designed system performs according to specifications. 10 1.5 Model-based Approach Modeling is the process of conceptualizing and refining the given specifications. There exist a wide variety of potential formalizations of a design, but often a relation between a set of inputs and outputs characterizes the behavior of a system. This relation could be informal, may even be expressed in natural language, but the result of such an informal specification can easily be an expensive and unnecessary redesign. Formal modeling of a design consists of the following components: A functional specification, given as a set of explicit or implicit relations, which consists of inputs, outputs and internal state information. o A set of performance indices that evaluate the quality of the design (cost, reliability, size, speed, power consumption, etc.) given as a set of equations, which include at least inputs and outputs. 1» A set of constraints on these performance indices, specified as a set of inequalities. - A set of properties that must be satisfied, given as a set of relations over inputs, outputs and states that can be checked against the functional specification. 1.5.1 Validation After transforming a specification into a formal model, the whole model has to be checked on its functional correctness. This step is known as validation. It includes functional validation of subtasks and of the whole system, including formal verification and functional simulation. Functional validation and simulation concerning functionality and constraints such as hardware size, costs, timing and reusability are also performed. 11 1.5.2 Partitioning and Implementation After validation, the designer can start to be more concrete by partitioning the model. The decision of what to do in hardware and what in software is important. If the best partitioning is found and validated successfully, a prototype of the hardware and the necessary machine code is generated for testing the system in actual environment. 1.6 Partitioning Problem One phase of the co-design process, the partitioning of specification into components and binding them to hardware/software resources, is the focus of current research. It is the central problem where a system designer decides which components of the system will be implemented in hardware and which will be realized in software. Partitioning requires an effective means of exploring the design space through evaluation of candidate solutions, considering the interaction of multiple constraints [2]. The decision to put a particular component in hardware or software has to be based on an evaluation of the metrics of interest for the entire system. This evaluation can either be done in the physical domain by actual implementation, e.g., by synthesizing the hardware to a gate netlist on which accurate metrics for area and performance can be obtained; or it can be done in the model domain, which is less accurate but much faster. Co-design is an iterative process that requires repeated partitioning and evaluation for design space exploration, and thus the speed of this process is a critical issue. In practice, it means that the model domain is the right choice for efficient estimation and evaluation. 12 The hardware-software partitioning problem involves two key metrics: performance and hardware size. Performance is normally improved by moving objects to hardware, while the hardware size is obviously improved by moving objects out of hardware. This tradeoff has led to the development of specialized algorithms for hardware-software partitioning. Depending on the underlying theoretical model, the level of abstraction, and the integration strategy, several performance and size estimation methods are available [5]. Among the possible partitioning schemes derived mostly from related areas, such as VLSI design, the deterministic, statistical, benchmarking and profiling techniques are most popular. Deterministic estimation requires a fully specified model, with all data dependencies removed and all costs of components known. This method leads to very good partitions but fails whenever data items are unavailable. Statistical estimation based on the analysis of similar systems and certain design parameters is then required. 1.7 Scope of This Work This thesis examines the hardware-software co-design framework called POLIS to explore the possibility of performing automatic partitioning. POLIS is a software system which takes as input the system specifications in terms of an executable code written in a high-level language called Esterél [6] and provides the output in terms of C” code for software and VHDL code or technology-mapped gate-level netlists for hardware. The partitioning process in POLIS is manual. In this thesis we investigated automatic partitioning algorithms that may be integrated within the POLIS framework and report the results of several case studies. 13 In the next chapter the related work in the area of co—design, with an emphasis on the partitioning issues are discussed. Various techniques and algorithms have been explored and their strengths and weaknesses are highlighted. Chapter 3 presents an in-depth analysis of the design flow in POLIS and the partitioning methodology based on simulations using the Ptolemy simulation platform. In Chapter 4, partitioning issues and algorithms for automatic partitioning are discussed. Application of automatic partitioning algorithms in POLIS and their implementation in C” are presented in Chapter 5. The results obtained after the application of selected algorithms to automate the partitioning process in POLIS and their validity are also discussed in this chapter. The last chapter draws some conclusions and sets forth the goals for future work in this direction. 14 CHAPTER 2 RELATED WORK Hardware-software co-design with its promise of bringing order to the chaotic world of embedded system design, has attracted a lot of attention in recent years. Several research groups have addressed the problem of co-design and in particular that of hardware-software partitioning. Two research projects can be called the pioneers in the field of hardware-software partitioning: Vulcan [7][8][9] and COSYMA [10][11]. Others have also joined in the effort, and numerous co-design frameworks employing automatic or manual partitioning techniques are now being developed at various universities and research institutions. A brief description of some of these frameworks and environments is presented in this chapter. Where applicable, the partitioning strategies employed in each environment are also identified. Chapter 4 describes the partitioning in detail. Table 2.1 lists the co-design environments examined in this thesis along with their important attributes. System Input Partitioning Algorithm 1 . Vulcan HardwareC Automatic Iterative 2. COSYMA C, C" Automatic Simulated Annealing 3. LYCOS C, VHDL Automatic PACE( Fine grain partitioning) 4. COMET VHDL Automatic Scoreboard (Ratio-cut) 5. PTOLEMY C, ptl-code Manual Iterative 6. SpecSyn SpecChart Automatic Clustering 7. TOSCA C, VHDL Automatic Clustering 8. POLIS Esterél Manual None 9. Chinook Verilog Manual None Table 2.1: List of co-design frameworks examined in this thesis. 15 2.1 Vulcan The Vulcan system was developed at Stanford University [7][8]. The target architecture consists of single CPU and one or more ASICs, connected to the CPU via a communication bus. It starts with an all-hardware solution specified in HardwareC. The input to Vulcan consists of two components: system’s functionality and design constraints. The specification is translated into an internal graph-based representation on which the partitioning is performed. The partitioning algorithm uses an iterative approach to move operations from hardware to software under a timing constraint. It tries to keep the communication cost minimum by keeping the neighboring vertices together in either software or hardware. The Vulcan system attempts to reduce the size of hardware by moving the functionality to software using a CPU. The system can handle multiple processes as hardware and software can run in parallel. 2.2 COSYMA COSYMA is an acronym for COSYnthesis of eMbedded Architectures. It is a hardware-software codesign system designed at Technical University of Braunsheweig, Sweden [10][11]. COSYMA has a graphical user interface which runs under X-windows environment. The target architecture for COSYMA consists of one CPU and one ASIC, where ASIC is used as a co-processor to speed up the execution. The input to COSYMA comprises a high level description of the algorithm in C" and a constraint file. C" is a C dialect with some restrictions and modifications to ANSI C. The objective of COSYMA is to partition the algorithmic description into hardware- software parts such that the user supplied constraints are met while keeping the hardware 16 cost minimum. The system provides the user with information to fine-tune or remove the emerging bottlenecks from the algorithmic description. Input to COSYMA can be of three types. The input file can be in ANSI C, pure C“ or a C" program with parallel extensions to encode concurrent interacting processes. The user can specify rate constraints, inter/intra-process constraints and communication between processes. The input description is translated to a syntax graph. The run-time analysis of the program is then performed. For the parallel extensions, a scheduler orders the processes for a single processor environment with respect to their timing and communication constraints. COSYMA uses simulated annealing algorithm for automatic partitioning. The algorithm starts with an all-software solution and moves the chunks of software code to hardware until the timing constraints are met. The software part is compiled by a C compiler and the hardware part is handed over to the high level synthesis. Finally a co—simulation is performed using the compiled software part and the timing information of the hardware part supplied by the high level synthesis. 2.3 LYCOS: The Lyngby Co-Synthesis System LYCOS is an experimental co-synthesis environment being developed at Technical University of Denmark [12]. LYCOS targets an architecture consisting of a single CPU and a single dedicated hardware component (ASIC, FPGA etc.) communicating through memory mapped 1/0. The areas of application include DSP, embedded systems, software execution acceleration and hardware emulation and prototyping. LYCOS is built as a suite of tools centered around an implementation independent model of computation based on communicating Control Data Flow Graphs (CDFGs). The 17 design process starts with an input specification described in either C or VHDL translated to the computational model, Quenya. The CDFGs are divided into chunks of computation called Basic Scheduling Blocks (BSBs) [13], that may be moved between hardware and software. An automatic partitioning algorithm called PACE [14] is used to partition the CDFG by moving BSBs from software to hardware to achieve the best speed-up while keeping the total hardware area less than or equal to the area available on a particular ASIC or FPGA. PACE is based on a fine-grain partitioning [15] approach where adjacent blocks sharing the same variables are moved to hardware so as to reduce communication between hardware and software as well as increase hardware utilization. 2.4 COMET The main goal of the COMET project [16] at University of Cincinnati is to transform a high-level system specification into application-specific electronic signal processing modules using a hardware-software co-synthesis process and to produce working hardware within a two—week time period. The input to and output from the COMET system are in VHDL. The design framework maintains a database of modules with a variety of functionality. The system specification is divided into modules, matched to component specifications, and then allocated to either hardware or software synthesis processes. The Co-synthesis process is iterative during which alternate bindings are used to satisfy constraints such as performance and area requirements. The Co-synthesis tool issues requests to the design database using qualifications on design properties, and the query processor determines the set of design objects that match the request. In other words, a 18 query is a module description, and any modules in the database that have at least the desired functionality (possibly additional functionality) are returned. The co-synthesis tool analyzes candidate solutions and determines the best assignment of resources to hardware and software using an iterative binding approach, called Scoreboard algorithm. The hardware and software specifications are processed by hardware and software synthesis tools, then integrated to form a system that satisfies the initial specifications. The end result of these transformations is an application-specific hardware design that can be fabricated along with the embedded software that will be executed on the manufactured hardware. 2.5 PTOLEMY The PT OLEMY [17] [19] software is a system-level design framework developed at University of California, Berkeley. The objectives of the PT OLEMY project include most aspects of designing signal processing and communications systems, ranging from designing and simulating algorithms to synthesizing hardware and software, parallelizing algorithms, and prototyping real-time systems. PT OLEMY allows the interaction of diverse models of computation by using the object-oriented principles of polymorphism and information hiding. For example, using PTOLEMY, a high-level data-flow model of a signal processing system can be connected to a hardware simulator that in turn may be connected to a discrete—event model of a communication network. PTOLEMY is still in a state of evolution. Recent enhancements of the software have been in the field of data-flow modeling of algorithms, synthesis of embedded software from such data-flow models, animation and visualization, multidimensional signal 19 processing, managing complexity by means of higher-order functions, hardware-software partitioning, and VHDL code generation. PTOLEMY has been used for a broad range of applications including signal processing, telecommunications, parallel processing, wireless communications, network design, radio astronomy, real time systems, and hardware-software co-design. The main emphasis in PTOLEMY is on co-simulation of system modules targeted for different implementations. A hardware-software partitioning algorithm [18] has also been implemented in the framework of PTOLEMY. The input to this algorithm is a system level description. The partitioning goal is to minimize hardware area given a global execution time constraint. 2.6 SpecSyn SpecSyn [3] is another Co-design system, which incorporates automatic hardware- software partitioning. It extends the scope of system design from the previously discussed systems in three key ways: 0 It includes two additional component types such as memories and buses, o It allows inclusion of functionality in terms of variables and communication channels that together with behaviors comprise an executable specification, and 0 The numbers and types of physical system components can be changed as an integral part of system design. In SpecSyn the input specification is produced in the visual language SpecChart which is based on Statecharts [21]. This is translated into an intermediate system representation called SLIF [22], on which the system analysis and partitioning is 20 performed. SpecSyn supports several partitioning algorithms [3] & [23]. [23] Presents a combined approach where clustering is used to reduce the number of code blocks to be considered and a greedy algorithm is used to obtain the partition. The interesting aspect of this approach is that it is able to reach regions in the design space, which lie between the regions obtained by fast greedy algorithms and those obtained by the more costly simulated annealing algorithms. 2.7 TOSCA In TOSCA [20] the internal representation is based on concurrent hierarchical finite state machines (FSM) which are generated from either standard languages such as C or VHDL or from higher-level languages such as SpecChart [2]. Hardware-software partitioning is done automatically by a clustering algorithm, which tries to cluster FSMs based on some closeness criteria. The target architecture for TOSCA is a single standard processor and one or more coprocessors embedded on a single chip. 2.8 Other Frameworks A number of researchers have focused on algorithmic aspects rather than complete systems. Janstch et a1. [24], [25] present a dynamic programming algorithm to solve the partitioning problem of optimizing an existing C program for speed, given a hardware area constraint. The algorithm is derived from the Knapsack Stuffing algorithm [26] and solves (with exponential memory requirements) the partitioning problem for a partitioning model in which blocks can include other blocks and blocks in general therefore cannot be moved to/from hardware independently of each other. A full loop 21 block for example includes the loop body and loop test blocks but all three are considered simultaneously in their model. Another approach using a high level language as input is presented by Barros et a1. [4]. The partitioning algorithm is a two-stage clustering algorithm which selects groups of code based on similarity measures obtained from classification of assignments in the input specification, which is described in UNITY [27]. Co-design systems in which hardware-software partitioning is obtained with user interaction were also investigated. Among these are POLIS [1], PARTIF [28] and CASTLE [29]. POLIS is the focus of our research and its features are discussed in detail in Chapter 3. PARTIF is an interactive partitioning tool, which allows the designer to explore different partitions by applying a small set of transformation and decomposition rules. These rules are applied to a system representation consisting of hierarchical concurrent FSMs. The CASTLE system is another co-design framework where the input specs are given in a standard language, which can be Verilog, VHDL or C/C-H-. This input specification is translated into a common representation based on control data flow graphs called SIR, which provides the backbone for all tools. As for POLIS and PARTIF, the partitioning is done manually, but in CASTLE it is based on mappings from a hardware library which is used to specify complex components including microprocessors. In the Chinook [30] system the emphasis is on module interface and synchronization. The system is used for real-time reactive controllers initially specified in Verilog. Chinook does not provide automatic hardware-software partitioning, but leaves it to the designer, nor does it provide code generation tools for the target processors, but uses 22 standard C compilers. However, Chinook does synthesize the hardware and software needed for inter- process communication which is a difficult task as different components may not initially fit very well together. This chapter has shown that there are two main directions for hardware-software partitioning methods: 0 Automatic partitioning, which almost always means a restricted target architecture i.e. one processor and one or more ASICs/FPGAs. The emphasis in automatic partitioning is on transferring the functionality from one domain to the other so as to achieve the best performance at a minimum cost. 0 Manual partitioning, which typically allows for more advanced architectures involves detailed analysis of design tradeoffs. The complexity of the system and design constraints demand exploration of a variety of architectural choices to meet the system specifications. Also, with advanced target architectures, synchronization and communication between different components becomes much more difficult and very important. With this review, we may now proceed to explore the design methodology adopted by POLIS. 23 CHAPTER 3 POLIS 3.1 Introduction POLIS is a software system developed at the University of California Berkeley for hardware-software co-design of control—dominated embedded systems [1]. The main aspect of POLIS that distinguishes it from other co-design methods is the use of a formal model of computation. This model, called Co-design Finite State Machine (CFSM) [1], is based on the Extended Finite State Machine (EFSM) [1] model operating on a set of finite-valued (enumerated or integer sub-range) variables by using arithmetic, relational and Boolean operators as well as user-defined functions. EFSM model is similar to the FSM model, but the transition relation may also depend on a set of internal variables. POLIS offers a flexible environment for design analysis and verification. The design can be analyzed at the behavioral level either with formal tools such as model checking or by co-simulation in a heterogeneous simulation environment offered by PTOLEMY, another tool for co-design. The user can select an architecture for evaluation of design tradeoffs with respect to constraints during the simulation phase. Hardware-software partitioning is based on the results of simulation. Software synthesis is fully automated including generation of a custom scheduler and hardware synthesis also involves limited user interaction. POLIS has a path towards an emulation board including Xilinx FPGAs [1], the microprocessor of choice, and A/D and D/A interfaces. 24 3.1.1 Model of Computation (CFSM) A Co-design Finite State Machine (CFSM), as does a classical finite state machine, transforms a set of inputs into a set of outputs with only a finite number of internal states. The difference between the two models is that the synchronous communication model of classical concurrent FSMs is replaced in the CFSM model by a finite, non-zero, unbounded reaction time. This model of computation can also be described as Globally Asynchronous, Locally Synchronous. Each element of a network of CFSMs describes a component of the system to be modeled. The CFSM specification is a priori unbiased towards a hardware or software implementation. While both perform the same computation for each CFSM transition, hardware and software exhibit different delay characteristics. A synchronous hardware implementation of CFSM can execute a transition in one clock cycle, while a software implementation will require more than one clock cycle. CFSM is also a synthesizable and verifiable model, because many existing theories and tools for the FSM model can be easily adapted for CFSM. Each transition of a CFSM is an atomic operation. All the analysis and synthesis steps ensure that: 0 A consistent snapshot of the system-state is taken just before the transition is executed. 0 The transition is executed, thus updating the internal state and output of the CFSM. o The result of the transition is propagated to the other CFSMs and to the environment. 25 The interaction between CFSMs is asynchronous, in order to support neutral specification of hardware and software components by means of a single CFSM network. This means that: The execution time for a CFSM transition is unknown but assumed to be non-zero in order to avoid the problem of zero-delay feedback loops. The synthesis procedure refines this initial specification, by adding more precise timing information, as more design choices are made (e.g. partitioning, processor selection, compilation, etc). The designer, during the analysis steps, may on the other hand add constraints on this timing information that synthesis process should satisfy. The overall design philosophy of POLIS is to provide the designer with such tools that help in meeting these constraints, rather than provide a quick- fix solution. Communication between CFSMs is not by means of shared variables (as in the classical composition of finite state machines), but by means of events. An event is a semi-synchronizing communication primitive that is both powerful enough to represent practical design specifications and efficiently implementable in hardware, software, and between the two domains. CFSM’s behavior and the CFSM network topology are represented using an intermediate language called SHIFT, for Software Hardware Interchange FormaT. SHIFT is however, not meant to be used as a specification language. The FSM semantics of each CFSM ensures that any of the following graphical or textual languages can be used to specify individual behaviors: O Reactive synchronous languages, such as StateCharts, Esterél, Lustre and Signal; 26 0 The so-called synthesizable subset of hardware description languages such as VHDL and Verilog; and 0 System specification languages with FSM semantics such as SDL [2]. The interconnection between the CFSMS, on the other hand, can be specified (due to the distinctive asynchronous interconnection semantics) within the POLIS environment, using either a textual netlist auxiliary language or a graphical editor VEM [17] that is part of the PTOLEMY co—simulation environment. Events are emitted by CFSMs and/or by the environment over a set of carriers called signals. One or more CFSMs can detect the emission of each event (the actual delay depends on several implementation-related factors, such as partitioning, scheduling policy and so on). Each detecting CFSM has its own copy of the event, and each emission can be detected at most once by each receiving CFSM. Signals can carry control information, data information, or both. Events occurring on pure control signals, such as reset input, can be used only to trigger a transition of a CFSM. Once the receiver CFSM has detected an event, it can no longer be detected again until its sender CFSM re-emits it. Values carried by data signals, such as keyboard input or a data sample, can be used as inputs to and output from the CFSM data path. Signals carrying only control information are often called pure signals, while signals carrying only data information are often called pure values. Each CFSM transition has a pre- condition of a set of: 0 Input event presence or absence conditions (only for signals with a control part) 0 Boolean functions of some relational operations over the values of its input signals. 27 The post-condition is the conjunction of a set of: 0 Output event presence or absence conditions (presence implies emission, absence implies no action), and 0 Values assigned to output data signals. Note that no buffering is provided by the POLIS communication mechanism, apart from the event and value information. This means that events can be overwritten, if the sending end is faster than the receiving end. This overwriting, also called “losing” may or may not be a problem depending both on the application and the type of the event. The designer can make sure that “critical” events are never lost either by: 0 Providing an explicit handshaking mechanism, built by using a set of signals, between the CFSMs, or 0 Using synthesis directives, such as partitioning choices or scheduling techniques, that ensure that no such loss can ever occur. For example, this can be achieved by: a) implementing the receiving CFSM in hardware b) implementing both CFSMs in software and using a round-robin scheduler that executes both at the same rate. 3.2 Design Flow 3.2.1 Overview An overview of the design flow in the POLIS system is depicted in Figure 3.1. The detailed composition of POLIS is shown in Figure 3.2. The input specifications are described in Esterél language. The Esterél code is first translated to SHIFT code. This format is used by POLIS to generate S-Graphs similar to Control Data Flow Graph 28 (CDFG) for software synthesis and BLIF (Berkeley Logic Interchange Format) code for hardware synthesis. In the architecture selection step, a processor is chosen. POLIS supports Motorola 68HC11 and 68832, and MIPS R3000 processors. POLIS has powerful software and hardware cost estimation capabilities, which can guide the user in selection of the best architecture. Design specification (Esterél) ‘ 1 Translation to SHIFT Architecture selection Design Modification DESig." v valrdatron Partitioning, and architecture and scheduler selection by high-level co- simulation (PTOLEMY) Hw, Sw and Interface synthesis j Rapid prototyping Figure 3.1: Overview of the design flow in POLIS [courtesy POLIS group]. 29 [Formal Languages I Translators if I System Behavior i , SChedUIGI floning ormal EAL $flfiéme I Verification l nterf ace Synthesis \ Constraints Erttioned Specification / i ii 1 1i i 1r it [ S—Graph | r Unofiimized HW I | HW Interfaces I [Verif Interm. Format] Hw Estimation Sw Estimation OS Sy® Task Synthesis Logic Synthesis IOptmized HW Partitioning O mized HW 1? .1 BOARD LEVEL w: Standard Components ‘1 [ Physical Prototype J Figure 3.2: The POLIS design flow [courtesy POLIS group]. 30 The S-Graph is further translated to C code and PTOLEMY code for use by PTOLEMY for simulation. The cost estimates are also passed to PTOLEMY for use in simulation. The user can interactively define the system architecture within PT OLEMY as well by changing the implementation of each CFSM to software or hardware. The software and hardware models are executed in the same simulation environment, and the simulation code is the same that will run on the target processor. The simulation can be done at two levels of abstraction: functional and timing. In functional simulation, timing is ignored and only the functional correctness is checked. In timing-based simulation, the timing is approximated using cycle count estimations for software and using a cycle- based simulation for hardware. After starting the simulation, input events are generated and then the overflow file is checked for any missed deadlines. The processor clock speed can be adjusted to simulate different versions of a processor. If the highest speed processor is also unable to meet the deadlines, then the critical CFSM is transferred to hardware by changing its implementation. Depending upon the results of co-simulation, the final partitioning is decided. A complete SHIFT netlist describing the CFSM-network topology and implementation choice for each CFSM is then created. This SHIFT file is passed back to POLIS for synthesis of hardware (BLIF) and software (C code). The final software also contains a RTOS for the target processor, and the hardware contains the interface circuitry required for communication between hardware and software. There is a path available in POLIS for rapid prototyping of final design for functional validation or in- system testing. Results of this step may then be fed back for the improvement in design. 31 3.2.2 High Level Language Translation In POLIS, designers write their specifications in a high level language (e.g., Esterél, graphical FSMs, subsets of Verilog or VHDL) that can be directly translated into CFSMs. Any high level language with precise semantics based on extended FSMs can be used to model individual CFSMs. Currently, however, Esterél is supported directly. The Esterél programs are translated to SHIFT format using the command strlZshift. The Esterél compiler first compiles Esterél code and then another tool othShift translates the output to SHIFT. 3.2.3 Formal Verification The formal specification and synthesis methodology embedded within POLIS makes it possible to interface directly with existing formal verification algorithms that are based on FSMs. POLIS includes a translator from the CFSM to the FSM formalism which can be fed directly to verification systems (e. g. V18 [31]). In addition to uncovering bugs in a design, it also uses formal verification to guide the synthesis process. Since the abstract CFSM model covers the behavior of all possible hardware-software implementations at once, it is possible to refine the specification based on the results of formal verification. Formal verification tools of today still have problems with complexity. A methodology has been developed that incorporates a set of rules specific to POLIS and CFSMs so that it is now possible to verify larger designs. 32 3.2.3 System Co-simulation System level hardware-software co-simulation provides feedback on the design choices. These design choices include hardware-software partitioning, CPU selection, and scheduler selection. Fast co-simulation, in the order of millions of clock cycles per second (on a workstation) is possible due to the software synthesis and performance estimation techniques available. The purpose of high-level co-simulation in POLIS is to provide the designer with a flexible environment where architectural tradeoffs can be explored. POLIS currently utilizes PTOLEMY as a simulation engine, but it is not limited to PTOLEMY. VHDL code including all the co-simulation information such as code size and delays etc. can also be generated by POLIS. 3.2.4 Partitioning and Architecture Selection Making system—level design decisions such as hardware-software partitioning, target architecture selection and scheduler selection is not a trivial task. These decisions are based heavily on design experience and are very difficult to automate. In POLIS the designer is provided with an environment to quickly evaluate any such decisions through various feedback mechanisms from either formal verification or system co-simulation. This feedback is however limited to the results of simulation in terms of signal arrival times and missed deadlines. The key advantage of CFSM specification is that it is implementation-independent. The designer can interactively explore the implementation options using the same user interface as co-simulation. 33 3.2.5 Hardware Synthesis A CFSM sub-network chosen for hardware implementation is implemented and optimized using logic synthesis techniques from SIS [32]. Each CFSM, interpreted as a Register-Transfer Level (RTL) specification, can be mapped into BLIF, XNF (XILINX Netlist Format), VHDL or Verilog. 3.2.6 Software Synthesis A CFSM sub-network chosen for software implementation is mapped into a software structure that includes a procedure for each CFSM, together with a simple real-time operating system (RTOS). The reactive behavior of CFSM is synthesized in a two-step process: 0 Implement and optimize the desired behavior in a high-level, processor- independent representation similar to a CDFG, 0 Translate the CDFG into portable C code and use any available compiler to implement and optimize it in a specific microcontroller dependent instruction set. A timing estimator quickly analyzes the program and reports code size and speed characteristics. The algorithm uses a formula, with parameters obtained from benchmark programs, to compute the delay of each node in the CDFG for various microcontroller architectures (characterization data for MIPS R3000 and Motorola 68HC11 and 68332 are already available). The precision of the estimator, with respect to true cycle counting, is currently on the order of i 20 %. An application-specific operating system, consisting 34 of a scheduler (e.g. Rate-Monotonic and Deadline-Monotonic) and I/O drivers, is generated for each partitioned design. 3.2.7 Interfacing Implementation Domains Interfaces between different implementation domains (hardware-software) are automatically synthesized within POLIS. These interfaces come in the form of cooperating circuits and software procedures (I/O drivers) embedded in the synthesized implementation. Communication can be through I/O ports available on the micro- controller, or general memory-mapped I/O. 3.2.8 Rapid Prototyping A rapid prototyping environment is also available in POLIS based on APT IX architecture [1] [33] system. 3.3 Design Example In order to elaborate the various design steps in POLIS, it is convenient to work through an example. A seat-belt alarm example given in POLIS user’s manual [33] is highlighted in this work to explain the design methodology. Figure 3.3 shows the transition diagram for the system. The specification of the system is stated as follows: When the ignition key is turned on, wait for five seconds for the belt to be fastened. If the belt is not fastened within five seconds, turn the alarm on for five seconds. If the belt is fastened or the ignition is turned off, then don’t turn the alarm on. 35 key_on / start__timer key_off OR belt_on / alarm (0) end_10 OR end_5 / alarm (1) belt_on OR key_off OR key_on / alarm (0) Figure 3.3: Transition diagram of seat belt controller [courtesy POLIS group]. 3.3.1 Specification The initial specification of the system is written in Esterél language. The system is divided into two modules; belt_control and timer, and functionality of each module is then described in terms of an Esterél program. The belt_control program is shown below: module belt_control: input reset, key_on, key_off, belt_on, end_5, end_10; output alarm: boolean, start_timer; loop abort emit alarm(false); every key_on do abort emit start_timer; await end_5; emit alarm(true); await end_lO; when [key_off or belt_on]; emit alarm(false); end when reset and 36 The first line of each Esterél program gives a name to the module being described. The next two lines declare the input and output signals of the module. All input signals here are control signals meaning they do not have any values associated. One of the output signals alarm has a value of type Boolean. The loop statement starts an infinite loop. The abort statement instantaneously kills its body whenever reset signal is received. The emit alarm (false ) signal has an associated Boolean value to turn the alarm off. The every key_on do statement executes its body every time key_on is present. The abort statement instantaneously kills its body whenever key_ofi‘ or belt_on signals are present. The emit start_timer signal is an output to be sent to timer module. The await end_5 halts the execution until end_5 signal is received. The emit alarm (true) signal is an output signal with a value (true) to turn the alarm on. The await end_10 again halts the execution until end_10 signal is received, which means that five seconds have elapsed since the alarm was turned on. The emit alarm (false) signal then turns the alarm off and the system goes back to its initial or OFF state. Following is a listing of timer module: modulo timer: constant count_5 , count__lO : integer; input msec, start_timer; output end_5, end_10; every start_timer do await count_5 msec; emit end_5; await count_lO msec; emit end_10 ; and The second line declares a constant, whose actual value will be defined later in the design. The timer starts counting whenever it receives the start_timer signal. The await 37 const_5 msec statement counts const_5 transitions in which signal msec has an event, and then yields control to the emit statement that emits the corresponding output signals. 3.3.2 Estimation The next step is to read the design into POLIS. The POLIS environment is invoked by giving the command polis at the command prompt. The POLIS interpreter is then used to enter POLIS commands. The SHIFT file is read by giving the command read_shift filename. To read the belt_control.shift file, the following command is used: read_shift -a belt_control.shift; where -a means that no auxiliary file is used In the next step an implementation choice for the module is specified, using set_impl -s/h, where -s means software and -h means hardware. Initially software implementation is selected unless a module is specified for implementation in hardware only. Following command sets the implementation to software: set_impl -s The module is assigned to the selected partition by the command partition. The selection of target microprocessor is done by using the command set arch processor. We will select Motorola 68HC11 as our target processor. partition set arch 68hc11 The estimation tools are run to get an idea of the size and speed of the resulting software. The command is: read_cost_param 38 The SHIFT format is now translated to S—Graph representation by build_sg command. The graph is then optimized internally by POLIS and translated to C code using: sg-to-c -d ptolemy The PTOLEMY code is also generated for use in simulation by the command: write__pl -d ptolemy 3.3.3 Co-simulation Now we exit from the POLIS program and use the Makefiles, available in POLIS distribution, to create enough information to simulate the entire design in PTOLEMY. Subsequently the modules are instantiated in PTOLEMY environment as stars and their interconnections are described as galaxies. After setting various parameters we run the simulation to verify that the design behaves as expected. The design can be improved by selecting different implementation of the modules or selecting different architectures. This is a manual process therefore the quality of design depends upon the user’s experience rather than the design choices available. Once the design has been verified using this mixed simulation, a real implementation is created. We again go back to POLIS and read the final design into it. An auxiliary file is also generated to indicate the interconnections between the modules. Each module is designated for implementation in either hardware or software. The synthesis process performs different operations for hardware and software. 39 3.3.4 Hardware Synthesis The SHIFT files are translated to BLIF files. The command for this is: net_to_blif This representation is then optimized using SIS, which is embedded in POLIS. Finally the netlist is written out either as an interconnection of gates and flip-flops or as an XNF file for the Xilinx FPGAs. The complete set of commands for hardware implementation is as follows: read_shift filenameshift propagate_const set_impl -h set arch 68hc11 partition net_to_blif print_stats source script.rugged write_blif filenameblif print_stats For ASIC: read_library librarygenlib map -W write_blif -n filenamegate For FPGA (XHJNX 3000 family): read_be filenameblif xl__merge -l -o temp.merge write_xnf -M -m -d xilinx temp.merge filename _3000.xnf For VHDL (Outside POLIS): blivast librarygenlib filenamegate -ofilename.vhd Alternatively the designer can also use following make utilities for hardware synthesis but some commands do not perform the required operations due «to certain limitations: make hw_opt make an000 40 3.3.5 Software Synthesis The steps for creating the S-Graph are repeated with the modules for software synthesis. Finally the C code for the software modules and the operating system is generated. The command for generation of operating system is: gen_os -d belt_part_s g The set of commands for software synthesis is as follows: read_shiftfilename.shift propagate_const set_impl -s set arch 68hc11 partition read_cost_param print_cost -c build_sg print_cost -s sg_to__c -d sg_directory gen_os -d sg_directory Alternatively the whole sequence can be executed by the single command: make sw The following command generates a VHDL simulation model of the software: sg_to_vhdl Alternatively the following command can be used to generate a VHDL file: make beh_vhdl 3.3.6 Implementation The output of POLIS can now be used to implement the system. The C code is compiled for the target processor and loaded onto a prototyping board. The hardware netlist is downloaded onto a FPGA using the Xilinx software tools. 41 CHAPTER 4 SYSTEM PARTITIONING Partitioning is one of the central problems in the design of embedded systems. Hardware-software partitioning is a way of deciding that which part or task of the system is to be implemented in hardware and which in software. Finding the best partition to implement a system’s functionality is a critical and challenging task. To achieve this goal a set of system components is to be selected and the system’s functionality is to be allocated amongst them. The partition must lead to an implementation that satisfies a set of design constraints, such as cost, performance, size and power consumption. Partitioning of a system specification onto a target-architecture consisting of a single CPU and a single ASIC has been investigated by a number of research groups [1, 8, 11, 12]. This architecture is of interest in situations where the performance requirements can not be met by general-purpose microprocessor or where a complete hardware solution is too costly. Different approaches have been adopted by different researchers for partitioning of a hardware-software system based on the specific application areas, such as embedded systems, DSP, software execution acceleration and hardware emulation and prototyping. One of the important issues during partitioning is the way the communication between hardware and software is taken into account. Gupta and De Micheli [8] have presented a partitioning approach which starts from an all-hardware solution. Their partitioning algorithm takes communications into account and is able to reduce communication when neighboring vertices are placed together in either software or hardware. The algorithm presented by Henkle, Ernst et a1. [11] is based on simulated annealing algorithm which moves chunks of software code to hardware until timing 42 constraints are met. The algorithm accounts for communication and only variables that need to be transferred are actually taken into account. Kalavade and Lee [18] have presented an algorithm that also takes communications into account by attributing a fixed communication time to each pair of blocks. The COMET [16] project employs the scoreboard algorithm, which involves a three-step evaluation process for selecting the best node to move based on user supplied constraints. 4.1 Partitioning Approaches The partitioning problem is of two types: homogenous or heterogeneous. The homogenous partitioning attempts to partition a system into a minimal number of parts that are completely implemented in either hardware or software. In case of hardware the goal is to reduce the size, whereas software implementations try to achieve a speed up in overall execution time. The focus of heterogeneous partitioning problem is to partition the system into hardware and software implementations. The problem of partitioning into hardware and software is many times more complex as compared to partitioning for implementation into purely hardware or software. There are two different approaches to system partitioning: structural and functional. In the structural approach, the system is implemented with structure first and then partitioned. In the functional approach, the partitioning is done prior to implementation. 4.1.1 Structural Partitioning Structural approach is very popular in hardware design because of its straightforward methods for obtaining size and pin estimates. Another factor for its popularity is the case 43 with which it allows mapping of structural partitioning problem to a graph-partitioning problem, for which an extensive body of formal theory, algorithms and tools exist. It has produced good results in the past due to relatively small sizes of system components. Structural partitioning suffers from three main drawbacks. o It is difficult to make decisions to trade off size and performance while implementing the structure because subsequent partitioning steps may nullify them. It may lead to increased size of hardware or inter-chip communication. 0 With the increasing size of systems the number of objects tend to grow. This leads to poor results from partitioning algorithms. It also makes it more difficult to interact during the partitioning process. 0 The structural approach is limited to hardware design. It does not support allocation of the functionality to software. 4.1.2 Functional Partitioning In this approach, a system’s functionality is first divided into non-divisible parts called functional objects. These objects are then partitioned among system components, which are implemented either as hardware or software. The functional approach is more suited to hardware-software co-design due to several advantages over structural approach. One of the key advantages is that it is possible to make performance/design tradeoffs during the subsequent structural implementation stage with full knowledge of the partition. The performance estimates obtained during the structural implementation stage are accurate because of the complete knowledge of all data transfers between system 44 components. The second advantage is reduction in number of objects to be partitioned, since there are fewer functional objects than Register-Transfer level (RTL) structural objects. It is easier for the designer to interact with the partitioning algorithm while dealing with fewer objects. Functional approach naturally provides a means of hardware- software partitioning, since the partitioned objects are functional. Each object can either be implemented in hardware or software depending upon the system’s requirements. 4.2 Partitioning Issues It is easier to understand and compare various partitioning techniques if the partitioning problem is divided into seven essential issues as depicted in Figure 4.1 and described below. 4.2.1 Specification and Levels of Abstraction Partitioning techniques greatly depend upon the specification language used and the abstraction level of the conceptual model. The language alone may not efficiently describe the functionality in terms of specifications, since the same language can be used to represent many different conceptual models. It is, therefore, important to define the input at an appropriate level of abstraction. At low level of abstraction, the system is defined as a large number of low-complexity objects. At higher levels the system consists of small number of hi gh-complexity objects. The highest abstraction level is the task level. Here the conceptual model does not define the actual computations but the data transfers between the tasks and other characteristics such as size or delay are defined. A data flow graph of tasks and 45 behavioral level description are examples of task level abstraction. At the next lower level, the system is described by a control data flow graph (CDFG), which includes arithmetic and control operations. It is the most-used level in DSP applications and a variety of partitioning algorithms is based on this representation. The next level involves FSMs for system description. They may be simple FSMs or complex ones like CFSMs in POLIS. At RTL the input represents a set of register transfers for each operation. The lowest level of abstraction is structural interconnection of physical components commonly known as netlist. Specification & Levels of Abstraction I O O Granularity O O System-component Allocation [ Metrics and Estimation ] [ Cost Function J l [ Partitioning Algorithm ] i C Output > Figure 4.1: Essential partitioning issues [2]. 46 It is possible to define a single input specification with multiple parts at various levels of abstraction depending upon the intermediate implementations during the design. 4.2.2 Granularity The extent of decomposition of the input specifications decides the granularity of the functional objects. A large number of objects indicates fine granularity with small amount of specification allocated to each object; whereas coarse granularity implies a small number of objects, each with a lot of functionality assigned. The granularity greatly depends upon the level of abstraction. The task level has the most flexibility for decomposition, where each task could further be decomposed into procedures and statement blocks. Fine granularity is better suited for partitioning Optimization but it has certain drawbacks, such as more computation time, difficulty in recognizing fine-grained objects and difficulty in estimation. 4.2.3 System-Component Allocation A partitioning algorithm needs to know the types of system components to which the functional objects may be mapped. These include programmable devices such as processors with associated memories, l/O devices and buses as well as dedicated hardware such as ASICs, FPGAs and standard logic components. Selection of these components is greatly influenced by the specifications and constraints of the design. 47 4.2.4 Metrics and Estimation Metrics are the attributes of a partition that determine its goodness. These include monetary cost, execution time, program size, hardware area, power consumption, communication bandwidth, testability and reliability. Some techniques adopt an incremental approach towards partitioning, where objects are grouped one at a time. A new type of metric is required for this technique that could predict the benefit of grouping any two objects. Such metrics are called closeness metrics. These metrics help in reducing communication between hardware and software by grouping functions sharing same variables in software or data paths in hardware. Computation of metrics is another challenge that needs to be addressed during the co- design process. There are two options available for computing the metrics. Either the system is actually created, resulting in accurate metric values or a rough implementation is created resulting in estimated values. The second option is also known as estimation. An estimation process must posses speed and accuracy. These are two conflicting requirements because speed requires less amount of detail in the implementation, whereas the accuracy requires detailed implementation. The lower the level of abstraction the better are the estimates. 4.2.5 Cost Function A cost function defines the goodness of a partition in terms of various metrics. Different metrics of a system have conflicting requirements and it is very important to assess their combined effect as a single value. The cost function is a linear weighted-sum expression of the products of metrics and their associated weights, where weight 48 indicates each metric’s relative importance to the goodness of partition. For example the following cost function is a weighted-sum expression of four metric values; hardware- area (A), program—size (S), program-execution-time (T) and hardware-delay (D) are weighted by constants w1, w2, W3 and m respectively, and then summed: Cost_Func =w,-A+w2-S+w3-T+w4-D selecting a larger value for w than wz, W3 and W4 makes the hardware-area more important metric. It is more common to incorporate the constraints into the cost function to help make constraints driven design decisions. The partitions that meet the constraints are considered better than those that do not. A generalized cost function, incorporating the constraints can be written as follows: Cost_Func 2 EW; -(m,. —c‘.) i=1 where i, is the index of metric, n is the total number of metrics, w,- is the weight for the ith metric, m,- is the ith metric and c,- is the constraint on ith metric. It is also important to take care of the units of various metrics included in the cost function. It would not be a good idea to compare units of area with the units of delay. To overcome this problem normalization of metric’s units is carried out. It is achieved by dividing the difference of the metric and constraint by the constraint. A cost function employing this technique is given below: Cost_Func = 2": w, ~(ml. - c,)/c,. i=1 49 While a cost function combines metrics to evaluate a partition, a closeness function combines closeness metrics to indicate the desirability of grouping the objects, before a complete partition is formed. 4.2.6 Partitioning Algorithm For a given set of functional objects and a set of system components, it is the goal of a partitioning algorithm to find a partition with lowest cost as computed by the cost function. With a relatively large number of functional objects it would take a large amount of computations to evaluate all possible partitions. The algorithm must therefore be capable of choosing a subset of all possible partitions, evaluate each of them and find the best partition that meets the constraints. The partitioning problem can be defined as follows [2]: Given a set of Objects O = {01, 02, ....on}, determine a partition P = {p1,p2, ....pm} such that p1 Upz U pm = 0, pi fl pj z ¢ for all i, j, i at j, and the cost determined by an objective function Cost_Fuct (P) is minimal. 4.2.6.1 Constructive/Iterative Algorithms Partitioning algorithms can be generally classified as constructive or iterative. Constructive algorithms group objects into a complete partition. These algorithms use closeness metrics to group objects to achieve a good partition. The Iterative algorithms repeatedly modify a partition for improvements. They use an objective cost function to 50 evaluate the partition, which yields better results as compared to closeness functions used by the constructive algorithms. 4.2.6.2 Greedy/Hill-climbing Algorithms There are two types of partitioning algorithms being used for hardware-software co- design: greedy and hill-climbing. Greedy algorithms start with an initial partition and move objects to the opposite group as long as the cost is improved. They cannot escape a local minimum. The concept of local minimums is depicted in Figure 4.2. Point A is a local minimum, whereas point B is the global minimum. Hill-climbing algorithms have the capability of escaping local minimum. This is achieved by accepting cost increasing moves during the iterations as shown in Figure 4.2. Cost V Number of Moves Figure 4.2: Escaping local minimum in iterative partitioning [2]. 51 4.2.7 Output The output of a partitioning algorithm must be comprehendible by the user who has to perform the next step of design. The output may be a list of objects indicating which object is to be implemented in hardware and which in software. It may be a new version of specifications that contains information about the structural implementation of system components. The output may be machine-readable so that a synthesis too] could use it to map the functions onto hardware and software components. 4.3 Basic Partitioning Algorithms Most of the partitioning algorithms adopted by hardware-software co-design researchers are commonly used in hardware partitioning. Iterative techniques such as simulated annealing (SA), Kernighan—Lin (KL), Fiduccia—Mattheyses (FM) and genetic algorithms (GA) are some examples. Hardware partitioning provides a means of breaking a system into smaller, more manageable parts based primarily on the number of communication channels between them. Many partitioning approaches have been developed which incorporate an ancillary goal of balancing the relative sizes of the parts. However, during hardware-software partitioning, this goal is not relevant because the system has to be partitioned into two parts only, i.e., hardware and software. Given an initial partition of the system into two parts, iterative techniques move one or more objects between the partitions in an effort to minimize objective function cost. Figure 4.3 shows a classification of partitioning algorithms. 52 Partitioning Algorithms l t i Constructive Algorithms Iterative Algorithms ._—T l domMappmg] + Group Migration + Hierarchical Clustering Simulated Annealing + Multi-Stage Clustering + Genetic Evolution L, Binary Constraint Search I:I Hill-climbing Algorithms C] Greedy Algorithms Figure 4.3: Classification of automatic partitioning algorithms. 4.3.1 Vulcan The Vulcan [7] system uses two types of greedy algorithms. Vulcan 1, deals with hardware partitioning only, while Vulcan 11 attempts to reduce hardware cost by partitioning the functionality between hardware and software. Both algorithms adopt an iterative approach to find the best partition. 4.3.2 Ratio-Cut The ratio-cut algorithm is a constructive algorithm that was originally developed for structural partitioning. This algorithm is very effective when large number of objects have to be moved between the partitions. The algorithm groups objects till the time no 53 object is considered suitable to be merged. Scoreboard algorithm [16] is a version of ratio-cut method that is used for hardware-software partitioning in the COMET system. 4.3.3 Group Migration (Kernighan-Lin) This algorithm was originally proposed by Kemighan and Lin [34] in 1970 for graph partitioning to improve two-way partitions. It has been modified over time by Fiduccia and Mattheyses [35] and later by Krishnamurthy [36] to require less computation and obtain better results. This algorithm yields excellent results for both structural and functional partitioning. The control strategy of the algorithm for two-way partitioning may be stated as follows: For each object, determine the decrease in cost if the object is moved to the other group and then move the object if it generates the greatest decrease or smallest increase in cost. In order to prevent an infinite loop caused by moving the same object back and forth between partitions, each object can be moved once only. After all the objects have been moved once, the lowest cost partition is selected as the initial partition and another iteration is done. A Group migration algorithm with two way partitioning is listed in figure 4.4. The algorithm starts with an initial two-way partition PM = {P1, P2}, where P; may be termed as hardware partition and P2 as software partition. 54 P=Pinjr // Loop // P pm, = P // Cost prev = Cost_Func(P) // COSI best_P = 00 // Loop for each Obj, // Obji.moved = false // End Loop Loop for i=1, n // COSt best_move = °° // Start with an initial partition Iterative Loop Previous Partition is equal to new Partition Previous Cost is equal to the cost of new Partition Best Partition Cost is equal to infinity Initialize status of each Object Moved flag of each object is set to false Create a sequence of n moves Best_move Cost is equal to infinity Loop for each Obji.moved = false P = Move(P, Obit) // Create new partition by moving Obj,- to opposite side Cost = Cost_Func (P) // Calculate Cost of new Partition If Cost < Cost best_move then Cost best_move = Cost // Best_move Cost is equal to new Cost Obj best_move = Obji End if End Loop P = Move(P, Obj best_move ) Obj best_movcmoved = true // Best_move Object is now Obji // Create new partition // Set moved flag to true If Cost best_move < Cost best_p then P best_p = P // Best Partition is P Cost be“; = Cost best_move // Best Partition Cost is Best_move Cost End if End Loop If Cost best_p < Cost pm then P = P best_p // P is the Best Partition of this iteration Else return P pm // Otherwise return the previous Partition End Loop Figure 4.4: Group migration algorithm with two-way partitioning [2]. After initialization of certain variables and setting the flag, moved, of each object as false, a sequence of 11 iterations is generated. During each iteration, for each object with a false flag, a function Move(P,Obji) is called that moves object (Obji) to the opposite side and returns a new partition. The cost of each new partition is calculated and the minimum cost (Cost best_move) found during these moves is saved along with the object (Obj best_move) that generated the minimum cost partition. Once all moveable objects have been moved, 55 all possible partitions would have been created and their costs would have been compared. The object (Obj best_move) giving the minimum cost partition is now permanently moved to the opposite side and its flag is set to true so that it will not be moved during the next iteration. A new partition P is thus, created by this operation and its cost (Cost best_move) is now compared with the best partition cost (Cost hw_p) saved earlier. If the cost is lower than the previous cost then the current partition P is saved as the best partition (P best_p) along with its cost (Cost best_p). During the next iteration one less object would be moved and all possible partitions would again be compared. After n iterations the best partition found so far, is returned by the algorithm. The algorithm is iterated by the outer loop till no improvement in cost is generated. The whole process may be repeated with a certain number of new initial partitions and the results may be saved to find the best partition. During the experiments with this algorithm it has been observed that the number of outer loops in the algorithm is indeed less than five, which is in conformity with the earlier observation [34]. 4.3.4 Simulated Annealing This algorithm is similar to group migration in that it also accepts cost increasing moves in order to escape local minimums. But it also allows multiple moves of an object, while limiting the complexity by decreasing the tolerance for accepting cost increasing moves. The algorithm is intended to model the annealing process in physics where a material is melted and its minimal energy stage is achieved by lowering the temperature slowly enough that equilibrium is reached at each temperature. 56 The algorithm starts with an initial partition and an initial simulated temperature. While the temperature is being slowly decreased, for each temperature, random moves are generated. The move that improves the cost is accepted, otherwise the move may still be accepted but such acceptance at lower temperatures becomes less likely. Figure 4.5 lists the simulated annealing algorithm. Temp = Initial temperature // Assign a high value to Temp Cost = Cost_Func(P) // Calculate cost of initial partition Loop while not Frozen // Start of outer loop, continues until Temp > freezing Loop while not Equilibrium // Start of inner loop, continues till equilibrium PM“ = Move_Random(P) // Create tentative partition Costmm = Cost_Func(Ptem) // Calculate cost of partition Costdena = Costtem — Cost // Calculate ACost If (Accept(Costdena, Temp) > Random(0, 1) then P = Pm. // Tentative partition is the new partition Cost = Costtem // Cost of new partition End if End Loop Temp = or x Temp // Lower the teperature End Loop Figure 4.5 Simulated annealing algorithm [2]. A variable Temp, is initialized with a high value and the cost of current partition is calculated. Then a sequence of random moves is generated inside a two-level loop. The moves are accepted until an equilibrium state is reached, that is when for a certain number of moves the cost is not improved. At this point the temperature is lowered by multiplying the old temperature with a variable or, where 0.nn. Counter k is used as the index for accessing a particular object from the array for nn objects and to point to its 67 corresponding position in the partition array. If the flag moved of the indexed object is false and it is not locked then a trial partition is generated by moving this object to the other side using Move() function. If the object was originally mapped to software then it is moved to hardware and vice versa, and the cost of the trial partition is calculated. If the cost is less then the bestmove_cost then the new cost is saved as the bestmove_cost and the index of current object is also saved as bestmove_obj. At the end of this loop all moveable objects have been moved once and the cost of best move i.e., bestmove_cost and the index of corresponding object i.e., bestmove_obj have been saved. Now this object is actually moved to form a new partition P, and the moved-flag of this object is set ‘true’ to indicate that the object has been moved during the current iteration. If the bestmove_cost is less then the bestpart_cost, then the current partition P is saved as bestcost_P and bestmove_cost is saved as the bestpart_cost. The bestpart_ cost is written to output file for plotting and analysis. The algorithm repeats the above process until all objects have been moved to the other side i.e., jZnn. If the resulting bestpart_cost is less then the prev_cost, then the best partition bestcost_P is saved as the new partition P otherwise the previous partition prev_P is saved as P. The algorithm iterates with P as the new partition and continues the above process while bestpart_cost is less than prev_cost. When the above condition fails the algorithm stops and the best partition best_P found so far and the best_cost are saved. The algorithm is repeated with randomly formed initial partitions, until yZnn condition is reached during which a global minimum is usually found. The results are saved in an output file for use in final implementation. The user can change the clock frequency of the processor at the start of the execution of program to 68 select a particular speed of software execution. Higher clock speed reduces the size of hardware and helps in assessing the effect of higher speed processor on overall system performance. The code is portable and runs equally well on PCs and Sun workstations. The algorithm was applied to two hypothetical cases with five and 50 objects, respectively, and was found to converge to the lowest cost partition in both cases. The complexity of the algorithm is O(n3). 5.5.2 Simulated Annealing (SA) Algorithm. The C“ code implemented for SA algorithm reads an input file with metric, constraint and the weight values. The number of metrics to be used in the cost function and the number of objects are constant for a particular project. The variables and functions used in the program are explained in appendix B. The flowchart of the program is shown in Figure B]. After reading the input file the program prompts the user to input the processor clock frequency (clk) in MHz. Based on the input data the program generates an initial partition P_in. This is done by first defining an all-software partition except for those objects which are already assigned to hardware. The program then generates a random partition P using Random _gen() function. The temperature is initialized by assigning temp = init. The initial value init and the lowest value freez are two important factors affecting the speed of convergence or number of iterations performed by the algorithm. The cost of initial partition is calculated using Cost_fct() and saved as best_cost for comparison with the cost of new partitions later in the program. A loop is started with a counter y=0, that continues while the temp is greater than the freez value. Two more counters are 69 initialized, i.e., n=1 and z=0. Another loop is started that continues while n is less than n_eq. During this loop the current partition P is assigned to tentative partition P_ten. An object is randomly moved in P_ten from hardware to software or vice versa using Random_moveO function. After each move the cost cost_ten of the tentative partition P_ten is calculated and the difference delta_cost of the tentative cost and original cost is determined. A random number in the range [0,1] is also generated for use in Accept() function. The values of current temperature temp and d_cost are passed to the AcceptO function, and the returned value from the function is compared with the random number earlier generated. If the value of Accept() function is greater than the random number, then the tentative partition is accepted, i.e., P=P_ten. However, if the partition is not accepted or the tentative cost cost_ten is greater then cost then the counter 11 is incremented by one meaning that there is no improvement in cost. The equilibrium state is reached if for a certain number of moves n_eq the tentative cost is found to be greater than the previous cost or the value of Accept() function is less then the random number. In either case the value of cost_ten is assigned to cost. The counter z is incremented to keep track of the iterations carried out in reaching the equilibrium. Once the counter 11 has incremented to the value n_eq , it means that equilibrium is reached and the loop is collapsed. The terminal value n_eq was determined after extensive experimentation and consultation of results of earlier applications of simulated annealing algorithm [38][39]. Upon reaching equilibrium, the temperature is lowered to a new value by multiplying the old value with the constant alpha, selection of which also affects the convergence of algorithm. The value of alpha depends upon the number of objects comprising the system being partitioned. If the cost of the new partition is less than the best_cost, saved 70 earlier then the new value is saved as the best_cost and the partition is saved as best_P. The counter y is then incremented and the algorithm repeats with the new value of temperature and the process continues until freezing point is reached, i. e., temp becomes less then or equal to freez. At that point the algorithm should have converged to the lowest cost partition. The program generates two output files, one containing a list of cost saved at each equilibrium state, for plotting and analysis. The second file contains the results of partitioning with allocation of objects to hardware and software mentioned against each. The program is capable of analyzing the effect of different clock speeds on the execution speed of objects mapped to software. 5.6 Case Studies 5.6.1 Hypothetical Cases Both algorithms were tested using two sets of hypothetical data and were found to converge to a minimum cost partition. The results for various selections of cost estimates and constraints in case of GM algorithm are shown in Appendix C. The results of experimentation with various values of initial temperature init, final temperature freez, terminal value n_eq of counter 11, and multiplier alpha for SA algorithm, are shown in appendix D. It was observed that GM algorithm always finds the lowest cost partition, but requires a large number of iterations. The SA algorithm is faster and takes fewer iterations but may not always find the lowest cost partition in the first attempt. 71 5.6.1.1 Small Size Example A system consisting of a set of five objects was taken as an example to analyze the behavior of the algorithms. The hardware and software estimates and constraints for these objects were selected in such a way to force various selections by the algorithms. Appendices C and D lists this information. Both programs found the same minimum cost partition for the system with the same values of estimates assigned to modules. The resultant partitioning choices are also shown in appendices C and D, along with the plots of cost vs. iterations are given in Figures Cl and C2 for group migration and Figures D1 to D6 for SA algorithms respectively. It is quite evident that both algorithms did escape local minimums and SA algorithm did converge to best partition within a limited number of iterations. 5.6.1.2 Large Size Example A set of 50 objects was taken as a large example for further analysis of the software. Different values of the hardware and software estimates and constraints were selected to force various selections by the algorithms. Appendix C and D list this information. Both programs were found to successfully find the minimum cost partition of the system. The partitioning choices corresponding to different values of weights are shown in Tables C3 to C5, and Tables D7 and D8 for GM and SA algorithms respectively. The plots of cost vs. iterations are also shown in Figures C2 and C3 for GM algorithm and Figures D7 and D8 for SA algorithm respectively. 72 5.6.2 Real World Example: Dashboard Controller A real world example has been included in the POLIS distribution for experimentation by the new users. The dashboard controller consists of five subsystems, namely, tachometer, speedometer and odometer, fuel gauge, temperature gauge and seat- belt alarm system. The dashboard controller interacts with various parts of the automobile and controls display of information on various dials and indicators. For the purpose of design implementation, the functionality is divided amongst three sub-systems each consisting of certain number of modules. This example also makes use of some pre—built modules representing processor peripherals and on-board hardware support such as counter/timer. A list of modules in each subsystem, identified as net and the complete dac_demo.aux file are given at appendix E. The modules’ interaction is defined in terms of various galaxies in PT OLEMY for simulation. The complete system is simulated in PTOLEMY by generating various inputs at different rates to simulate the varying engine rpm. and vehicle speed. The output of the controller is also required to be at a certain rate to update various indicators. Based on the results of co-simulation the system is partitioned between hardware and software. 5.6.2.1 Design Constraints The input to odometer and speedometer is wheel-pulses. The wheel-pulses are received from the wheel at a rate of 4 pulses per revolution. Assuming a standard tire circumference of 0.66 m, the maximum vehicle speed of 260 [ #include #include #include #include #include #define infi 999. #define np 4 #define nn 26 float kk[np], area, size, hdelay, sdelay, proc_clk=1.0; struct Object { public: char name[20]; float hw_area, sw_size, hw_delay, sw_delay; bool moved, locked;; char pp; }; Object objt[nn]; float Cost_fct(char [1): float Func(float, float); void Move(int, char [1); void Gen_part(char [1); void Pr_part(char 11); void Part_file(char [], fstream); void Final_part(char Pt[], fstream): void main() { int ii, 11, bestmove_obj, y, 2; char P[nn], bestcost_P[nn], PP[nn], prev_P[nn], best_P[nn]. P_in[nn]; char inputfile[15]; float clk, cost, prev_cost, bestpart_cost, bestmove_cost, best_costzinfi; cout << "Enter the name of input file to use: "; cin >> inputfile; fstream infilefinputfile, ioszzin); fstream outfile( "outputdat", ios::out): fstream outfile 1 ("outl .dat", ioszzout); if(outfile.fail()|l outfilel.fail()) { cout << "Could not open output file" <> objt[counter].name >> objt[counter].hw_area >> objt[counter].sw_size >> objt[counter].hw_delay >> objt[counter].sw_delay >> objt[counter].pp; if ((objt[counter].pp==’H’)ll(objt[counter].pp==’S’)) { objt[counter].locked=true; } else if (objt[counter].pp==’X’) { objt[counter].locked=true; } else {objt[counter].locked=false; } } infile >> area >> size >> hdelay >> sdelay; for (counter=0; counter> kk[counter]; } }else cout << "Could not open input file" <> clk; if(clk>0) { proc_clk=clk; } Seed the random number srand(time(0)); outfilel << "Constraints " << area << " " << size << " " << hdelay << " " << sdelay << endl; outfile] << "Weights "; for (ii=0; ii> x; break; 94 } } sd=sdlproc_clk; est_cost = kk[0]*Func(sz, size) + kk[l]*Func(sd, sdelay) + kk[2]*Func(ha, area) + kk[3]*Func(hd, hdelay); return est_cost; } float Func(float val, float constr) // Estimate the normalized cost of a parameter for an Object { float temp; temp=(val-constr)/constr; if (temp <= 0) {temp=0;} return temp; } void Move(int ob, char Pt[]) { switch (Pt[ob]) { case ’S’: { Pt[ob]=’H’; objt[ob].pp=H’; cout << "Object moved to HW" << endl; break; } case ’H’: { Pt[ob]=’S’; objt[ob].pp=’S’; cout << "Object moved to SW" << endl; break; } case ’X’: { cout << "Object cannot be moved" << endl; break; } cout << "Un-identified partition code " << Pt[ob] << endl; break; } void Pr_part(char Pt[]) { for (int pj=0; pj #include #include #include #include #include #define init_temp 1.0 #define tip 4 #define nn 26 #define alpha 0.99 #define freez 0.001 #define n_eq 3 float kk[np], temp, area, size, hdelay, sdelay, proc_clk=1.0; struct Object { pubhc: char name[20]; float hw_area, sw_size, hw_delay, sw_delay; bool locked; char pp; }; Object objt[nn]; float Cost_fct(char []); float Func(float, float); void Random_move(char []); void Gen_part(char Pt[]); void Pr_part(char []); void Part_file(char [], fstream); float Accept(float, float); void Final_part(char Pt[], fstream); void main() { int ii, 2, y, 11; char P[nn], P_ten[nn], best_P[nn], P_in[nn]; 103 float clk, cost, cost_ten, best_cost, delta_cost, random; fstream infile("input00.dat", ios::in); // input file fstream outfile("output.dat", ios::out); // output file fstream outfile1("outl .dat", ios::out); // output file for graphs if(outfile.fail()|| outfile1.fail()) { cout << "Could not open output file" <> objt[counter].name >> objt[counter].hw_area >> objt[counter].sw_size >> objt[counter].hw_delay >> objt[counter].sw_delay >> objt[counter].pp; if ((objt[counter].pp==’I-I’)|l(objt[counter].pp==’S’)) { objt[counter].locked=true; 1 else if (objt[counter].pp==’X’) { objt[counter].locked=true; } else {objt[counter] .locked=false; } } infile >> area >> size >> hdelay >> sdelay; for (counter=0; counter> kk[counter]; } }else cout << "Could not open input file" <> clk; if(clk>0) { proc_clk=clk; } // Seed the random number srand(time(0)); // write input data to output file for graphs outfile] << "Modules = " << nn << ", Constraints " << area << ", " << size << ", " << hdelay << ", " << sdelay << ", Weights "; for (ii=0; ii freez) // Loop while temperature is not frozen { n=1; z=0; while(n < n_eq) // loop while equilibrium is not reached { for (ii=0; ii random) { for (ii=0; ii cost) { n++; } cost=cost_ten; 1 else { n++; } z++; } temp=alpha*temp; outfile] << cost << endl; // Output cost after reaching equilibrium if(cost < best_cost) { } y++; } best_cost=cost; for (ii=0; ii> x; break; } } sd=sdlproc_clk; est_cost = kk[0]*Func(sz, size) + kk[l]*Func(sd, sdelay) + kk[2]*Func(ha, area) + kk[3]*Func(hd, hdelay); return est_cost; } float Func(float val, float constr) // Estimate the normalized cost of a parameter for an Object { float t_val; t_val=(val-constr)/constr; if (t_val <= 0) {t__val=0;} 107 return t_val; } void Pr_part(char Pt[]) { for (int pj=0; pj