. . ..: : 153:3. 3. f: .. 1 “'r v :1. .335. .3 i. . t. 7“”. V- , ;. . T!.,....s? . .. .. U Eu... .1. . but ‘1' 1 .31...‘ . u u‘. v . ,. 3..“ $11.33... , 3». . .13.!»17. 3.5.15; . THESIS 962.7 LIBRARY ichigan State University TM This is to certify that the dissertation entitled A SYSTEM-LEVEL PLATFORM-BASED MULTI-CORE SYSTEM-ON-CHIP SIMULATION FRAMEWORK WITH RETARGETABLE PROCESSING ELEMENT MODELS presented by JINWEN XI has been accepted towards fulfillment of the requirements for the Ph.D. degree in Electrical and Computer Engineering Major Professor’s Signature 97 7 /u 7 Date MSU is an affirmative-action, equal-opportunity employer —.-.--.-.-n--c-u--.—-n-.--.-.-o-.—.-o--.--o--.. .------—o-------u—o-n-t-u-u-u-n-o-----I-n-c--u-n-v-a---.. - PLACE IN RETURN Box to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE A SYSTEM-LEVEL PLATFORM-BASED MULTI-CORE SYSTEM-ON-CHIP SIMULATION FRAMEWORK WITH RETARGETABLE PROCESSING ELEMENT MODELS By Jinwen Xi A DISSERTATION Submitted to Michigan State University In partial fulfillment of the requirements For the degree of DOCTOR OF PHILOSOPHY Department of Electrical and Computer Engineering 2007 ABSTRACT A SYSTEM-LEVEL PLATFORM-BASED MULTI-CORE SYSTEM-ON—CHIP SIMULATION FRAMEWORK WITH RETARGETABLE PROCESSING ELEMENT MODELS By Jinwen Xi Today’s semiconductor technology makes it reality to integrate up to billions of transistor to create the complete system on the single silicon die. The escalating design complexity makes the traditional register transfer level (RTL) design and modeling methodology inadequate facing the pressure of time to market and limited resource. Electronic system level (ESL) design methodology is needed to handle such a complex system-on-chip (SOC) design. System level modeling is core to this new methodology. This work explores system level modeling of function, performance and power and proposes a complete hierarchical SOC simulation framework supporting heterogeneous processing elements (PEs) integrated with a network-on-chip (NOC) communication infrastructure. To efficiently map an application onto multiple heterogeneous PE architectures, an object-oriented application/task mode] is proposed. PEs are characterized by their macro-models for both execution time and energy consumption. Performing macro- model back-annotation to application tasks enables fast and cycle-approximate timing and energy simulation. Moreover, the resulting retargetable PE model methodology mitigates the effort of task remapping among different core architectures by using the uniform macro-operation set. Task scheduling engine (TSE) and Task execution engine (TEE) supports dynamic task scheduling. By this method, the task-level simulation can be taken without architecture-dependent cross—compilers/synthesizers. The parameterizable NOC model features configurable routers and links. All the internal operations in each component are modeled as transaction-level function calls. Analytical transaction-level dynamic and leakage energy models are developed for each NOC component, and energy estimation is triggered whenever there is a transaction during the simulation. The NOC model also supports temporal and spatial power profiling that will help designers identify the potential hot-spots during application simulations. By configuring the four architectural parameters, NOC with different on-chip routers is evaluated. A complete SOC simulation framework is developed with the case study of M-JPEG application. The power/performance evaluations are performed to prove the effectiveness of the proposed modeling methodology. Based on the proposed simulation model, a system-level SOC design space exploration framework is developed with the novel in-loop simulation feature. The principle it employs is to minimize the on—chip communication size while meeting the throughput constraints. And local architectural fine-tuning of PE/NoC is also applied to achieve the further power savings. Overall, this methodology will bring significant changes in the way how future system-on-chips are designed. The inherent scalability enables the tradeoff of modeling time and accuracy. It has broad applications, from early design space exploration, through design refinements and iterations. to design characterization and reuse. ACKNOWLEDGEMENTS First of all, I'd like to thank my advisor, Professor Peixin Zhong, for his invaluable guidance throughout the road of my four years’ research and study in East Lansing, MI. His extraordinary vision, generosity, and patience have made this thesis possible. He provided a lot of significant suggestions and comments during my research, to help me refine the idea and modify the model. And the completion this thesis is also a mission impossible without his directions and suggestions. I am also very impressed by his keen insight into the technology trend because there was almost nobody using SystemC when we started developing the SystemC model in 2004, and it has become IEEE standard now. I look forward to applying in my future career the knowledge that I have gained from working with him during the past four years. I am also grateful to Professor Fathi Salem, Professor Gregory Wierzba, and Professor Anthony Wojcik for taking the time to read my thesis and providing me with helpful feedbacks on how to improve its quality. Serving as my Ph.D. committee members, they provided significant suggestions during my research work. I really appreciate their time and effort to review my work, discuss my research and provide suggestions. I also thank their excellent teaching efforts in the circuit and computer architecture courses that I took during my Ph.D. career, and I gained lot of theoretical and practical skills that will definitely be applied to my future career work. I also thank Professor Donnie Reinhard for his generous help in my Ph.D. plan modification. Additionally, I thank the faculty and staff of the Electrical and Computer Engineering Department for the help and advice 1 have received during my stay at Michigan State. I also want to thank my colleagues and friends to whom I am indebted, and their support has enabled .me to enjoy my research over the years: members from AMSAIC lab at Michigan State (Chao Yang, Jichun Zhang, Junwei Zhou, Yue Huang and Professor Andrew Mason). Finally, and most important, I’d like to thank my loved wife, Xiaohan Chen. Her encouragement makes me more confident in my research, especially during the most difficult idea-forming phase. And her love makes my life in East Lansing more beautiful. And I also want to thank my parents, Baojing Xi and Fengqi Xu, who always loved and supported me when I pursued my Ph.D. Thanks to my brother Bowen Xi for his love and friendship. I dedicate this thesis to my wife, parents and brother. ‘1 TABLE OF CONTENTS LIST OF FIGURES ........................................................................... viii LIST OF TABLES ................................................................................. xii 1. Introduction ........................................................................................ 1 1.1. Platform-based System-on-Chip Design Methodology ................................ 2 1.2. SOC Architecture Overview ................................................................ 3 1.2.1. Shared Bus OCA .......................................................................... 4 1.2.2. Switched Bus OCA ..................................................................... 4 1.2.3. Network-on-Chip OCA ................................................................. 5 1.3. Requirements of SOC Modeling for Efficient Design Space Exploration............6 1.4. Related Works on System-on-Chip Modeling ............................................ 9 1.4.1. System-on-Chip Simulation Frameworks .......................................... 11 1.4.2. SOC Modeling Languages and Environments ...................................... 12 1.4.2.1. SystemC: System-level Modeling Language .................................. 12 1.4.2.2. Ptolemy 11 Modeling Environment ............................................. 13 1.4.2.3. Metropolis System-level Design Framework... ............ ................... 13 1.5. Key Objectives and Conuibutions ....................................................... 15 1.6. Thesis Organization ........................................................................ l6 2. Cycle-approximate Processing Element Model ............................................. 18 2.1. Multi-granular Time Model ............................................................... 19 2.1.1. Cycle-approximate Time Annotation ................................................ 20 2.1.2. Implementation of Multi-granular Time Model in SystemC .................... 24 2.2. Object-oriented Uniform PE Simulation Framework ................................. 26 2.2.1. Macro-operations and Their Characterization ...................................... 27 2.2.2. Simulate Tasks by Running Macro-operations .................................... 30 2.2.3. Object-oriented Application and Task Graph Mode132 2.2.4. Object-oriented PE Simulation Model .............................................. 35 2.2.4.1. Sequential Microprocessor Model .............................................. 39 2.2.4.2. Parallel Hardware Accelerator (ASIC) Model ................................. 41 2.2.4.3. Case Study of an Application Mapped onto Microprocessor Model ....... 43 2.2.5. Retargetable Simulation Framework for Heterogeneous PE Cores ............. 44 2.2.6. Selective Code Simulation ............................................................ 47 2.2.7. System-level Modeling Of Dynamic Voltage and Frequency Scaling ........... 48 2.3. Experimental Results on Media Applications .......................................... 51 3. Architecturally Parameterized NOC Simulation Model .................................... 53 3.1. On-chip Interconnection: Packet-switching Network—on-Chip(NoC) ............... 53 3.1.1. Network-on-Chip Fundamentals ..................................................... 54 3.1.2. Object-oriented NOC Simulation Framework ...................................... 55 3.2. System—level Model of On-chip Router ................................................. 58 3.2.1. Input/Output Buffer Model and its Architectural Parameter ..................... 62 3.2.2. Output Arbiter Model and its Architectural Parameter ........................... 65 3.2.3. Router Crossbar Model and its Architectural Parameter ......................... 68 3.3. System-level Model of Interconnecting Links .......................................... 75 3.4. Latency Model of NOC Routers and Links ............................................. 77 3.4.1. On-chip Router’s Latency Model .................................................... 78 vi 3.4.2. Link Latency Model ................................................................... 81 3.5. Simulate NOC Model with Traffic Generator/Traffic Receiver ...................... 83 3.6. NOC Performance Model Experiment Results .......................................... 85 3.6.1. APL Performance under Different NOC Topologies ............................ 86 3.6.2. APL Performance with Different Packet Sizes on Mesh NOC .................. 88 3.6.3. APL Performance with Different Router Architectures on Mesh NOC ......... 89 3.7. Integrate PE and N 0C into the SOC Simulation Framework .......................... 94 3.7.1. Network Interface for Bridging NOC and PEs ..................................... 94 3.7.2. Buffered Network Interface Model .................................................. 95 3.7.3. Put Everything Together: SOC Simulation Framework xSoCSim ............... 98 3.8. M-JPEG NOC-based SOC Simulation Model Case Study ........................... 100 4. System-on-Chip Power Modeling Framework ............................................. 106 4.1. Related Work on Power Modeling ...................................................... 107 4.2. PE Power Macro-modeling Methodology ............................................. 109 4.2.1. Programmable Processor Power Macro-model ................................... 110 4.2.2. Experimental Results of Processor Macro Power Mode] ....................... 114 4.2.3. Application Specific Hardware Accelerator Power Macro Model ............. 116 4.2.4. Experimental Results of DFHA Macro Power Model ........................ 120 4.3. NOC Architectural Power Model ........................................................ 123 4.3.1. Power Model of Router’s Input/Output Buffer ................................... 124 4.3.2. Power Model of Router’s Crossbar ................................................ 128 4.3.3. Power Model Of Output Arbiter .................................................... 131 4.3.4. Power Model of Interconnecting Wires ........................................... 134 4.4. Integrate Power Models into SOC Simulation Framework .......................... 137 4.5. NOC Power Simulation Experimental Results ........................................ 140 4.5.1. NOC Router’s Dynamic and Leakage Power Simulation ........................ 140 4.5.2. NOC Router‘s Dynamic Power vs. Data Payload Switching Properties......14l 4.5.3. Router Architecture and its Dynamic Power ...................................... 143 4.5.4. NOC Dynamic Power’s Temporal Profiling ....................................... 147 4.5.5. NOC Link Power Evaluation ........................................................ 148 4.6. Full SOC System-level Power Simulation on Real Applications ................... 152 5. Power-aware SOC Design Space Exploration Framework ................................ 159 5.1. Mathematical Formulation of SOC Minimum Power Problem ..................... 160 5.2. Related Work on DSE Framework for Multi-core SOCs ............................. 163 5.3. Global Communication Minimization Energy—aware Exploration Framework...166 5.4. Local PE/NoC Refinement ............................................................... 174 6. Conclusions ..................................................................................... 176 Bibliography ....................................................................................... 181 vii LIST OF FIGURES Figure 1.1. Bus on—chip interconnections in SOC. (a)Shared—bus SOC architecture. (b) Switched-bus architecture ....................................................................... 5 Figure 1.2. SOC - - ‘ ' ' 16 PBS with 4x4 mesh NOC interconnection ................ 7 Figure 2.1. Cycle approximate time annotation (a)DCT functional specification. (b)Executable model with time annotation .................................................. 23 Figure 2.2. Time event distributions during simulation with different time models. . .24 Figure 2.3. The same macro-operation on different architectures (a)OP_MUL_2 with MUL instruction. (b)0P_MUL_2 without MUL instruction .............................. 28 Figure 2.4. Trme annotation styles to model PE timing. (a)Discrete time annotation. (b) Grouped time annotation ....................................................................... 31 Figure 2.5. Task and application model represented by C++ class ....................... 33 Figure 2.6. Application abstracu'on model. (a)Task graph with 6 tasks. (b) Application task’s dependency model ....................................................................... 35 Figure 2.7. Class diagram of PE model framework ........................................ 36 Figure 2.8. Architecture of TSP/TEE PE mode1(with application task models). . . .. ...38 Figure 2.9. Task Microprocessor’s TSE—TEE ............................................... 40 Figure 2.10. Concurrent ASIC model. (a)Task graph with concurrency. (b)ASIC’s TSE-TEE model ................................................................................. 42 Figure 2.11. (a)Ad-hoc task remapping. (b)Retargetable task remapping ............... 45 Figure 3.1. Topological illustration of 4-by-4 NOC-based SOC ........................... 54 Figure 3.2. Modeling of on-chip communication architecture. (a)Class diagram of OCA model. (b)0CA instantiation ............................................................ 57 Figure 3.3. The data transferred on NOC. (a)NoC packet structure. (b) Class diagram of packet .......................................................................................... 59 Figure 3.4. Generic NOC router architecture ................................................ 60 Figure 3.5. NOC on-chip router FIFO architecture .......................................... 62 Figure 3.6. Different ways of modeling FIFO. (a)SystemC sc_fifo-based FIFO model. (b) C++ deque-based FIFO model ............................................................ 64 Figure 3.7. Arbitration algorithms. (a)FPUS algorithm in NoC router. (b) FSUS algorithm in NOC router ........................................................................ 67 viii Figure 3.8. Generic router crossbar architecture ............................................ 69 Figure 3.9. XY-routing on 4x4 mesh ......................................................... 70 Figure 3.10. Crossbar control of XY—deterministic router. (a)RCU for XY—routing algorithm. (b)Crossbar control logic with 5 RCUs ......................................... 71 Figure 3.11. WF-routing on 4x4 mesh ....................................................... 72 Figure 3.12. Crossbar control of WF-adaptive router. (a)RCU for WF-routing algorithm. (b)Crossbar control logic with 5 RCUs ......................................... 74 Figure 3.13. NL-routing on 4x4 mesh ........................................................ 75 Figure 3.14. Reuse of sc_fifo-based channel for both input buffer and link models. . .76 Figure 3.15. Input buffer’s latency model ................................................... 79 Figure 3.16. NOC interconnecting wire model. (a)Interconnecting Wire’s delay model. (b)1St order wire RC delay model ............................................................. 82 Figure 3.17. TG/TR connected with NOC ................................................... 84 Figure 3.18. NOC average package latency under different topologies. (a)Injection rate = 0.025 packet/us. (b)Injection rate = 0.25 packet/n3 ...................................... 87 Figure 3.19. (a)SRAM area v.s. size. (b) SRAM power v.s. size ......................... 88 Figure 3.20. APL performance with different packet sizes on mesh NOC. (a)Uniform traffic. (b)Transpose traffic ..................................................................... 89 Figure 3.21. NOC APL with different router architectures ................................. 91 Figure 3.22. (a)SoC with 2x2 mesh NOC and Nis. (b) NI architecture .................. 96 Figure 3.23. State machine in NI. (a)State machine for data packetization. (b) State machine for data de-packetization ............................................................ 97 Figure 3.24. SOC simulation framework by integrating PEs, N15 and NOC ............ 100 Figure 3.25. Task graph of M-JPEG application .......................................... 101 Figure 3.26. SOC simulation model mapped with M-JPEG application ............... 104 Figure 3.27. (a)PE cycle count on M-JPEG application. (b)Communication cycle count on M-JPEG application ............................................................... 105 Figure 4.1. Abstraction levels of microprocessor ‘ " a 110 Figure 4.2. (a)Characterized 2-operand ALU operation. (b)Characterized 2-Operand Multiplier operation ........................................................................... 113 Figure 4.3. Parallel FIR architecture with 3 pipeline stages providing 1 sample/cycle throughput ....................................................................................... 117 Figure 4.4. Characterized ADD, REG and MPY macro-Operations’ power versus data widths ........................................................................................... 119 Figure 4.5. Power macro-model for hardware. (a)Dynamic power estimation results. (b)Simu1ation speed comparison ............................................................ 122 Figure 4.6. NOC hierarchical power model ................................................ 124 Figure 4.7. Architecture of dual-port SRAM .............................................. 125 Figure 4.8. Power model of matrix crossbar. (a)Mat.rix crossbar capacitance model. (b)Active state of matrix crossbar ........................................................... 130 Figure 4.9. Power model of output arbiter. (a)Structure of the matrix arbiter. (b) Default transistor sizes ........................................................................ 132 Figure 4.10. Interconnecting wires considering (i, i+1), (i, H ) coupling effects.....135 Figure 4.11. Integrating power models with SOC simulation framework .............. 137 Figure 4.12. Power profile of ADPCM application on PowerPC. (a)@ T."g = Sus (b) @ Tens = 20us .................................................................................. 139 Figure 4.13. (a)Router’s leakage power. (b)Router’s dynamic power and leakage...14l Figure 4.14. Router’s dynamic power on different input patterns ...................... 143 Figure 4.15. Power Of different router architectures on uniform and transpose traffics. (a)random traffic. (b)transpose traffic ...................................................... 145 Figure 4.16. Dynamic router power with different input/output buffer configurations. (a)random traffic. (b)transpose traffic ...................................................... 146 Figure 4.17. 4x4 mesh NOC under burst traffic from node #0, #2 and node #5 ....... 147 Figure 4.18. NOC temporal power profile under the burst traffic ....................... 148 Figure 4.19. NOC power breakdown at different technology nodes ..................... 149 Figure 4.20. Temporal power profile of different links. (a)Link #16(router #1 -> #5)’s profile. (b) Link #l7(router #5 -> #l)’s profile ............................................ 150 Figure 4.21. 4x4 NOC power spatial profile under burst at different time stamps. (a)Sus. (b)8us. (c)11us. (d)80us .............................................................. 151 Figure 4.22. (a)Leakage power profile Of router #0. (b)Dynamic power profile of router #0 ........................................................................................ 152 Figure 4.23. M-JPEG application mapped on 2x2 NOC with (a)non-overlapped traffic flow. (b)0ver1apped traffic flow ............................................................. 153 Figure 5.1. Tile-based SOC architecture and task mapping .............................. 167 Figure 5.2. SOC DSE framework based on GCM ......................................... 168 Figure 5.3. (a)Algorithm to find the cut with minimum ICL from CATG (b)CATG with out set ...................................................................................... 172 xi LIST OF TABLES Table 2.1. Task mapping scheme .............................................................. 44 Table 2.2. Simulated task finish time ......................................................... 44 Table 2.3. Simulation time for Mediabench applications .................................. 52 Table 3.1. Characterized read latency(ns) for FIFOs in IBM 0.18um process .......... 79 Table 3.2. Evaluated router architectures .................................................... 90 Table 4.1. Energy simulation results for J PEG on PISA architecture .................. 115 Table 4.2. Simulation times ...... - ............................................................ 115 Table 4.3. Regression coefficients of ADD operations ................................... 121 Table 4.4. Regression coefficients of REG operations .................................... 121 Table 4.5. Regression coefficients of MPY operations ................................... 121 Table 4.6. Wire parameters for different technologies .................................... 149 Table 4.7. M-JPEG power simulation results on architecture 4.23(a) with input video_l .......................................................................................... 154 Table 4.8. M-JPEG power simulation results on architecture 4.23(a) with input video_2 .......................................................................................... 154 Table 4.9. M-JPEG power simulation results on architecture 4.23(a) with input video_3 .......................................................................................... 155 Table 4.10. M-JPEG power simulation results on architecture 4.23(a) with input video_4 .......................................................................................... 155 Table 4.11. M-JPEG power simulation results on architecture 4.23(a) with input video_5 .......................................................................................... 155 Table 4.12. M-JPEG power simulation results on architecture 4.23(a) with input video_6 .......................................................................................... 156 Table 4.13. M-JPEG power simulation results on architecture 4.23(b) with input video_2 .......................................................................................... 157 Table 4.14. M-JPEG power simulation results on architecture 4. 23(b) with input video_ 5 .......................................................................................... 157 xii 1. INTRODUCTION Today’s semiconductor technology makes creating the whole system on a silicon die feasible by integrating millions or even billions of transistors when the transistor feature size continuously scales down. The system—on-chip (SOC) refers to integration of system level functions on a single silicon chip. SOC can be the integration of a complete electronic system including all its periphery and its interfaces to the outside world on a single die[1]. It can also be high-level integration of digital cores while leaving the analog and mixed signal functions to other devices. Nowadays people also started integrating the RF functionality into SoCs to make the single—chip cell phone platform solution, such as Texas Instruments’ LOCostO[116] and eCosto[117] platforms, which reduces the BOM (Bill Of Materials) dramatically. SOC is a best candidate to meet the increasing system complexity and the ever tightening time-to-market. SOC belongs to the broader class of embedded systems where hardware and software are closely coupled and the system targets one application or a class of applications, in contrast to general purpose computing systems. This thesis concentrates mainly on the digital SOC or the digital portion of a mixed-signal SOC, but it is also extendable to support the analog/mixed- signal functionalities once the appropriate analog models are plugged—in. Generally speaking, modern high performance SOC designs consist of multiple integrated parts including the Processing Elements (PBS), the On-chip Communication Architecture (OCA), i.e., shared-bus or packet-switching Network-on-Chip (NOC), and embedded memory blocks providing on-chip data/program storage. The PEs process the computation tasks while the OCA provides the necessary communication services between these PEs[2]. It is critical to optimize both the computation architecture and the OCA so that it can meet the ever-increasing demands of the computation and communication intensive applications. These SoCs often require comprehensive functionality, high performance, low power consumption, low production cost and short time-to-market. The continuously increasing VLSI design complexity has stimulated the advancement of Electronic Design Automation (EDA) tools and methodologies, such as the well- established logic synthesis-based ASIC design flow. However, the continuously shrinking feature size in the deep sub-micron (DSM) era aggravates the negative effects of gate and sub-threshold 1eakage[3], crosstalk[4], soft error[5] etc., which have resided in the custom ASIC design area for decades. Moreover, as ITRS[6] predicted, the productivity of current EDA tools is becoming increasingly inadequate for tackling the complexity of today’s ICs with one of the reasons that the productivity of EDA tools is not growing as fast as the exponential growth of semiconductor gate count thanks to the Moore’s Law. Alamiingly, the design productivity gap is widening in recent years as the design goes to the multi-core SOC epoch. Copying the conventional ASIC design methodology to SOC is not adequate to solve the problem. To help bridge this design productivity gap, the semiconductor industry needs a new generation of EDA methodologies and tools that can significantly increase design productivity and reduce the Non-Recurring Engineering (NRE) cost. I .1. Platform-based System-on-Chip Design Methodology In response to these challenges, a new design methodology named “platform—based design”[7] was proposed, promoting reuse at higher levels of granularity and relying heavily on flexibility and programmability to amortize the NRE costs of complex multi- core SOC design. A platform-based design methodology is essentially a meet-iu-the middle approach, which leverages the power of top-down methods and the efficiency of bottom-up styles. A design process is viewed as a stepwise refinement of a specification into a lower level of abstraction chosen from a restricted library of available components. Components are computational blocks or interconnecting blocks and are often referred to as Intellectual Property (IP) cores in many literatures. This library is a platform, which is a family of designs and not a single design in essence. A platform defines the design space that can be explored, and design space is the set of combinations of the components with different configurations from the platform library. Once a particular collection of components of the platform is selected, we obtain a platform instance or a point of design space. The choice of the platform instance and the mapping of the components of the specification into the components of the platform instance represent the top—down process where the constraints accompanying the specification are mapped into constraints on the components of the platform instance. When selecting a platform instance and mapping constraints, it is important to guide the selection with constraint parameters, e.g. delay, power, size and cost, that summarize the characteristics of the components of the platform. Evaluating these parameters quickly with the appropriate accuracy is one of the critical parts of platform-based design [8]. 1.2. SOC Architecture Overview From the architectural point of view, an SOC consists of five main functional blocks: processor core, hardware accelerator IPs, memory blocks, peripherals and on-chip interconnection architecture (OCA). The processor core functions as the central controller for other functional blocks in the whole system to orchestrate their operations. Modern SoCs often employ RISC microprocessor as the processor core due to its simple architecture and high clock frequency. Sometimes there is more than one processor, which may be homogeneous or heterogeneous, in the SOC to handle different control/computation tasks. Another important feature in SOC is that dedicated hardware blocks are often included to accelerate the computation-intensive parts of the application to mitigate the load on the main processor core. It may be the co-processor executing specifically designed instructions, or custom logic and reconfigurable FPGAs. In the SOC context, both processors and hardware accelerators are referred as Processing Elements (PE) since their tasks are performing data processing. PEs access memory blocks providing code and data storage, and peripherals offer the system interface with the external world. The non-volatile flash memory is often employed to store the program code or persistent data, while SRAM is used as data buffer to store intermediate computation results due to its fast accessing speed. Dynamic RAM (DRAM) has not been extensively integrated in current SOC designs due to its specific fabrication process. All these functional blocks are interconnected together via OCA to compose the complete system, and there are three main interconnecting schemes differentiating SOC OCA architectures: shared bus, switched bus and network-on-chip (NOC). 1.2.1. Shared Bus OCA The shared bus is the most common way to move on-chip data in current SOC designs. In this scheme, many components can drive a single interconnect net that sends the signals to all devices on the net. The target component replies to the master's request depending on the machine address. Figure 1.1 (a) shows a classical shared-bus SOC architecture connecting 8 PEs. A bus arbiter is present to decide which PE is granted for accessing the bus according to the arbitration policy once multiple masters raise their requests simultaneously. During execution only one master PE can initialize a bus transaction orchestrated by the bus arbiter. Shared bus architecture is very popular in the previous and current SOC designs and several semiconductor vendors provide industrial standardized buses such as ARM’s AMBA[51] and IBM’s CoreConnect[52]. But shared bus is not very scalable electrically because high fan-out load limits the speed. The physical wires also tend to be long, adding more delay and signal integrity issues. 1.2.2. Switched Bus OCA The switched bus architecture differentiates from the shared bus by allowing multiple bus transactions simultaneously at the same clock cycle. There are multiple point-to—point interconnections controlled by the crossbar switch, which selects interconnects between any combination of the CPU, the memory and other PS. The switched bus architecture is essentially a multi-bus whose connectivity is controlled by the crossbar switch. This interconnection architecture relieves some burden resulting from the shared bus in large SOC consisting of large number of PBS. Currently MIPS Technology has developed its switched bus product SOC-it[97] to leverage the multi-layer capabilities of ARM's AMBA to allow two simultaneous transfers on the same clock cycle. ASIC-1 ASIC-2 CPU-1 ASIC-3 SRAM CPU-2 FPGA ASIC—4 (a) Shared—bus architecture ASIC-2 FPGA ASIC-3 ASIC-4 r 7 51‘2“ x“_ sees: V CPU-1 Z ASIC-1 4 CPU-2 2 SRAM 1 USED §%%%_ gage- V I I (b) Switched—bus architecture Figure 1.1 Bus on-chip interconnections in SOC. 1.2.3. Nerwork-on-Chip OCA Network-on-Chip (NOC) has evolved as a promising alternative for large SOC on-chip interconnections during the recent years. The idea is that scalable-switched networks are used for on-chip communication between PBS in order to cope with the design of continuously growing systems. Different from bus—based SOC, cores in NOC communicate by sending information packets across the on-chip network instead of driving signals across dedicated global circuit wires [67]. The on-chip interconnection is implemented by routers that connect with each other via interconnecting links. These routers are placed in a regular floorplan and multi-bit width links connect these routers to form different network topologies, such as mesh and torus. Each PE is connected with one router via network interface and they form a “tile” in the SOC. Compared with the shared bus architecture, the shorter links in NOC are more controllable than the long wires when the system grows large. The main force behind NOC is the progress in semiconductor manufacturing technology and the fact that it is not followed by increased effectiveness in design automation tools. Design complexity promotes reuse of investment in earlier designs as well as purchase Of outside intellectual property (IP), but due to larger designs, communication among components will become a bottleneck using traditional techniques like common shared- or switched-buses. The most adverse pitfall for the shared-bus interconnection is that the global wires are non-scalable when the number of connected cores increases since adding one core to the bus will increase its capacitive load, which tightly contributes to the delay and dynamic power. Moreover, the inverse scaling properties residing in global wires ask for a more scalable and flexible on— chip interconnection architecture for the future SoCs fabricated in sub-90nm processes. NOC is one solution to this where regular packet switched communication architecture can provide better scalability, flexibility, throughput and reusability. Currently NOC-based SOC is still in the infant stage and many researchers and companies put much effort in it. Figure 1.2 shows an SOC with 4x4 mesh NOC interconnection architecture. 1.3. Requirements of 50C Modeling for Efiicient Design Space Exploration With the advent of multi-core architecture, each application can be partitioned into different tasks, each of which runs on different PE to achieve the pre-defined functionality. Given an application and a set of available PE cores, SOC designer has to decide the number and configuration of hardware resources to be used for the application to fulfill the functionality and meet the non-functional constraints such as power and area. This has increased the design space manifolds. It is not possible to explore all possible candidate architectures with detailed simulation and synthesis. Fast simulation frameworks and efficient modeling methodologies are indispensable in tackling the Design Space Exploration (DSE) problem in multi-core SOC era. Efficient implementation of heterogeneous system requires performance and energy estimation at the system level in early design stages so that design space can be reduced in later refinement stages. In general, the system-level SOC DSE requires a modeling framework to performing the following tasks: System specification @Architecture selection and resource allocation @Task partitioning and resource mapping @System-level integration ©Fast hardware/software co-simulation ©Efficient simulation data post-processing Figure 1.2 SOC containing 16 PEs with 4x4 mesh NOC interconnection The system specification should be manifested as functional model and nonfunctional requirements. The functional model should be abstract and does not constraint implementation details. Analysis will provide insight on architectural decisions. The specification is then transformed into implementation in multiple steps. The target application will be partitioned into small tasks, mapped onto different PBS, and the inter- PE communication model will be inserted for data sy-u L ' " ' ' The choice of target PE and task mapping increases exponentially along with the complex applications and large PE libraries. The high complexities of embedded systems require a system level design automation framework to perform these transformations fast and efficiently, and it should allow exploration of possible architectures at a higher level without going for synthesis and detailed timing simulation. In addition to implementing the functionality defined by specification, another important role an system—level SOC model should possess is the ability to estimate the non-functional constraints, from which energy/power and area are the most frequently used ones. Energy consumption directly relates to the battery life in the battery-powered handheld devices. In high performance systems, the power consumption determines the need for heat dissipation, which can be costly in places such as central offices for telecommunication systems. The area has major impact on the chip cost. Energy consumption is related to many factors in the SOC design from task mapping of the application to the configuration of each PE, and estimating the energy precisely and quickly is not a trivial task. It is well believed that the higher level the energy estimation and optimization is performed, the better energy saving will be achieved [49]. Fast energy/power simulator is becoming an important component in modern SOC DSE frameworks targeting low power applications. SOC energy consumption is composed by the computation energy from PBS and communication energy from OCA, and each of them should be estimated when simulating an application. Profiling the energy consumption of SOC components temporarily or spatially also helps designers estimate the potential energy-hungry parts during run-time. In general, A system-level model should provide the system properties without detailed low-level information. Depending on the properties of interest, different types of system models can be built. In the research, the system-level model is functionality—centric in some sense since it cares more about the system functionality than the detailed circuit implementation. For example, a PowerPC[ll] microprocessor executing an application code could be modeled simply by the same application code executed on the host machine with the back-annotated timing information characterized from the low-level PowerPC models or real chips. Given a functional specification, other system blocks could be modeled in the same way and the similar method can be applied on communication fabrics. In this way, the system-level model accelerates the process of functionality verification, design space exploration and nonfunctional metrics estimation in early design stages before the design goes down to the lower levels for logic- and circuit-level implementation. 1.4. Related Works on System-on-Chip Modeling In this section, we discuss the existing approaches for modeling multi-core SoCs or its IP cores on three main metrics: performance(throughput, clock rate), power/energy, and thermal properties (chip temperature) as well as functionality. These metrics are critical to SOC and much effort[l2][l3][14] has been taken to evaluate them quickly and accurately. First Of all, several prior work addressing system-level modeling methodologies[l3][14] is essentially a modification of the available micro-architectural model such as SimpleScalar[ 10] by adding the customized power, energy or thermal models, integrating the modified model with other blocks and performing simulation. While these models can generate an accurate estimation, which is a natural consequence of employing the architectural information such as L1 and L2 cache, BTB, instruction fetch in microprocessor core, their simulation speed is often slow. They are suitable for detailed study of micro-architecture choices. Secondly, system-level model is functionality-centric without too much low-level information as was mentioned earlier. Utilizing this property, macro-models have been proposed and researched in many literatures for programmable microprocessors [15][16], hardwired ASICs [17][l8], and SoCs containing both of these two types of components[ l9][20] with memory blocks. Macro-models rely on the characterization of a comprehensive set of atomic operations by which all other functionalities could be implemented. The characterization process involves simulating a set of test benches composed by these atomic operations on the available low-level models to acquire the metrics (power, performance) data and performing the mathematical linear/non-linear regression analysis and curve fitting, from which a closed-form expression for each atomic operation in terms of one of these metrics could be derived. Applying these derived models, system-level simulation could be performed by only taking the block’s input/output properties into consideration and these metrics could be estimated by evaluating the incoming atomic operations without the verbose architectural information. The most frequently used atomic operations are instructions (Instruction Set Architecture, or ISA) and input/output switching statistics for microprocessors and ASIC cores, respectively. However, the conventional macro-model methodology still has some challenges when applied to multi-core SoCs. Using a virtual SOC containing only two microprocessor cores (for example, PowerPC and ARM) with their corresponding ISA macro-models as an example, the partitioned tasks (a segment of application code, which is often written in C/C++) mapped on each core should be compiled to instructions before the macro-model could be evaluated and simulated. And this compilation process should be performed whenever there is a change on the mapped tasks for each core and the compilation time will take a non-negligible overhead when the system scale becomes large. We address this as retargetability issue in our research and retargetability reflects the flexibility of performing task mapping among different core architectures including both programmable micro-processors and hardwired ASICs. This issue resides in many proposed system-level SOC modeling frameworks[21][12][22] by the fact that changing task mapping always results in the code recompilation for microprocessors and re- synthesis for hardware cores, and solving this issue is still an open question. Since SOC integrates multiple IP cores, each Of which can be a complete functionality block to perform computational tasks or provide communication services, simulating the whole SOC on the circuit-level becomes intolerably time-consuming and error-prone. In consequence, available SOC simulation models are built upon RTL or higher levels of abstraction. Many research groups have proposed different methodologies and simulation frameworks to model and simulate multi-core SOCs at different levels of abstraction to facilitate the functional verification and design space exploration. We briefly review their work and enumerate three major modeling languages and environments residing in industry and academia: SystemC, Ptolemy H and Metropolis. And we also review some recent research work on the architecture and modeling Of network-on-chip, which is chosen as the communication architecture in this research. 1.4.1. System-on-Chip Simulation Frameworks [23] proposes a simulation framework utilizing VHDL to evaluate several features of virtual channels in mesh-based and hierarchical packet-switching NOC topologies in multi-core SoCs. The accuracy of VHDL model is high due to its cycle-accurate simulation engine, but it suffers from low simulation speed if the number of cores increases. To enhance the simulation speed, several approaches have been proposed. [24] presents a hybrid VHDL/SystemC model using a template router to support multiple interconnection networks. [25][26] describes a modeling framework for custom NOC topologies based on SystemC. In [27], XML is employed to specify the NOC components and evaluate different mesh-based NOC architectures. These models all concentrate on the architectural detail of each SOC component and are cycle-accurate. As a result, the simulation speed is still limited especially in the case that one or multiple microprocessor cores are present in the target SOC. And moreover, [28][29] use data traffic from tracing the application on the off-the-shelf simulators to simplify the PE models since the main goal of [28][29] is to model and evaluate different interconnection schemes (OCA). C/C++ is adopted broadly as a high-level system modeling language as well as a programming language. For most microprocessor-based SOC architectures, Instruction Set Simulator (188) has been extensively employed in modeling and simulating the target applications in that 188 is essentially a set of C/C++ codes running on the host machine to mimic the code execution on target architectures. Most ISSes[10][30][31] are constructed in the micro-architecture level to provide cycle-accurate simulation engines, and they are much faster than HDL simulators when evaluating software execution on microprocessors. [32][33][34] propose the Multi-Processor SOC (MPSOC) simulation framework using Inter-Process Communication (IPC) and a bus wrapper. The 188 and the C/C++ hardware model run as distinct processes on the host system and communicate between the system simulation and 158 and translate the information from the 188 into cycle-accurate bus transactions. To resolve the issue that different ISSes need different wrappers tO hook up the MPSOC simulation model, [35] develops an implementation Of the IPC interface between wrappers and the ISSes based on the Open-source debugger- GDB[36] so that any 188 that can communicate with GDB can also become part of the co-simulation framework seamlessly. These simulation frameworks are homogeneous from the language point of view since both hardware (HW) and software (SW) are described using C/C++, even they show heterogeneity from the simulator point of view because HW and SW can be executed using different simulators where SystemC simulation kernel is used for hardware and 188 for software program, respectively. 1.4.2. SoC Modeling Languages and Environments This section introduces three representative modeling environments in the chronological order of development. 1.4.2.1. SystemC: System-level Modeling Language As described before, many SOC simulation frameworks are based on the SystemC. SystemC is an Open source architecture- and system-level design and modeling language based on C++[37],[38] and it allows designers to apply well-understood software techniques such as Object-oriented programming (GOP), to the problems of system modeling and verification. Although originally designed for system and hardware design, SystemC is also applied in verification, testbench design, software modeling, etc. Currently SystemC 2.0.1 and later versions have been extensively used by many semiconductor vendors as well as EDA companies. SystemC language itself provides three basic computation units and one communication unit: (1) Function that is similar to the regular C++ member functions, (2) Process that is the basic functional entity Of SytemC, and there are three different process abstractions. (3) Module that is the basic structural entity of SystemC and modules can communicate with each other through port, which can be either uni- directional or bi-directional. (4) Channel that transports data between ports connecting different modules. SystemC also provides a rich set of data types targeting hardware modeling and simulation. Finally, SystemC library includes a lightweight Discrete Event (DE) simulation kernel. 1.4.2.2. Ptolemy 11 Modeling Environment Ptolemy II is an object-oriented, heterogeneous design and modeling framework based on Java. It targets the modeling, simulation and design of concurrent heterogeneous systems, which are usually hard to model in One unified design framework. Its description language can easily describe complex hierarchical interconnections Of pararneterized executable components. Ptolemy II supports a number of Models of Computation (MOC)[39], including Communication Sequential Process (CSP)[40], DE and Synchronous/Reactive (SR)[41]. Each MOC is implemented in Ptolemy II as a domain that is a software package that performs the required scheduling algorithms for the MOC. The software architecture of Ptolemy H is layered. Besides the domain packages, the core package contains the abstract syntax of semantics for the data model. The library packages provide actors that can execute in multiple domains. Also, a graphical user interface (GUI) package is provided to support its internal XML format, MOML. 1.4.2.3. Metropolis System-level Design Framework , The Metropolis [13] develops a formal methodology for SOC design and the corresponding system—level framework facilitating the exploration of multi-core SoCs. In the Metropolis methodology, the 30C designer first describes (or selects from IP libraries) the blocks that perform computations and, then, designs the communication among them using a rigorous successive refinement process. To maximize reusability, the layers of the protocols should only encapsulate the original computation cores without any change in their internal structure. This framework is based on a formal representation of the system throughout all levels of abstraction, but is not biased towards any specific Model of Computation (MOC) or communication semantics. At the highest level of abstraction, the system specification is purely denotational, i.e. the system is described as a set of concurrent components, call processes, and each defined as an input/output relation. In contrast to the traditional design practice, where the specification of the processes also includes details of the communication interface and hence is not easily reusable, the communication among processes in Metropolis is separated and dealt with explicitly. In Metropolis communication is modeled by channels with a set of properties, such as FIFO ordering, error rate and bandwidth, and a set of interfaces (e.g., read and write) that the processes connected to the channel can access. Once a physical channel is chosen, the designer must select a MOC that defines the firing rules that the processes must follow when they access the channel. The firing rules are part of a MoC wrapper that encapsulates each process. If the selected physical channel does not immediately meet the communication constraints on parameters like delay, throughput, reliability, it is necessary to introduce adapters (called channel adapters) between sender and receiver, and the channel. Consider the problem of implementing a reliable error-free connection. If an unreliable channel with a non-negligible error rate is selected, it is necessary to introduce adapters, e.g. encoding or retransmission functions, between the unreliable channel and the sender and receiver. 1.5. Key Objectives and Contributions We propose a hierarchical SoC modeling methodology and the associated framework that tackle the issues related to the ever-increasing system complexity and retargetibility. The goal is to satisfy the need for efficient design space exploration. The key objectives of this work are: @A unified modeling environment that can be used at different stages of design and can support different levels of abstraction. @A unified modeling framework that can model functional specification, system architecture and functions mapped to the architecture. @A framework that models the concurrency and communication. @A modeling framework that provides fast simulation for realistic application loads. ©A performance modeling framework that supports untimed, approximate timed and cycle—based timing models. ©A modeling framework that supports nonfunctional properties such as power consumption and dynamic power management. @Integration of processing element models and on-chip communication architecture models. .A methodology for rapid simulation model generation according to specs. @Build the design space exploration tools with simulation in the loop. In order to achieve these objectives, a set of models and associated tool flows are developed during this research. Firstly, the multi-granularity time model is proposed and will be utilized throughout the whole framework, and it also supports performance and power models with variable granularity to trade off the simulation speed and estimation accuracy. An approximate-timed cycle-free transaction-level simulation model for processing elements (PE) including programmable microprocessor and hardware accelerator (ASIC) based on the uniform task scheduling engine - task execution engine (TSE-TEE) is developed. This methodology provides an Open framework for modeling different architectures at system-level with the aid of proposed macro-modeling methodology covering both performance and power estimation. To reduce the simulation time based on the macro-model in some computation-intensive applications as well as back-annotated simulation code size, an innovative macro-model grouping technique will be presented. On the other side, to address the on-chip communication modeling, we propose a parameterized simulation model for Network-on-Chip (NOC) supporting multiple configurations. Combining the computation (PE) and communication (NOC) models through the interfaces (wrappers), a fully integrated model of PEs and the NOC OCA that enables efficient modeling and evaluation of the multi-core SOC is proposed. Due to the framework’s open architecture feature, our model supports plugging-in power and performance models at different levels of abstraction, by which the system-level performance (latency and throughput), power (including both dynamic and leakage power), and potentially thermal properties can be evaluated, and this information in turn helps the fast simulation-based design space exploration and application temporal and spatial profiling to find the optimal system configuration given a set of design constraints and possible hot-spots during run-time. Finally a power-aware design space exploration framework employing global communication minimization and local PE/NOC refinement is developed. This DSE framework heavily benefits from the fast task-level SOC simulation model, which provides capabilities of in-loop simulation to estimate the performance and power under the real application load. 1.6. Thesis Organization The remainder of the thesis is organized as follows. Chapter 2 presents the multi- granularity time model and the application task model. It also covers the proposed empirical macro-modeling methodology for PE’s high-level performance estimation. Then it starts presenting the approximate-timed transaction-level PE simulation model based on the task schedule engine (TSE) and task execution engine (TEE). Based on the proposed TSE/TEE model, a retargetable PE simulation framework is developed and it eliminates architecture—dependent cross-compilation and synthesis when performing the task remapping. Chapter 3 details the parameterized packet-switching NOC on-chip communication infrastructure, and it focuses on the modeling of two main NOC components: on-chip router and link. Based on the functionality model supporting exploring different NOC architectures, the NOC performance model is presented based on analyzing the router and link’s architectural features. This chapter also presents the integration of PE model presented in chapter 2 and the NOC model to form a complete task-level multi—core SOC simulation model named xSoCSim, which has been successfully ported to the M-JPEG media application. Chapter 4 presents the energy model for both PE and NOC supporting the run-time dynamic and leakage power estimation. PE energy model is based on the statistical macro—model similar to what’s presented in chapter 3, while the NOC energy model is based on the architectural information of the target NOC router and link, as well as the run—time data payload pattern. Based on the functionality, performance and power models presented in chapter 2, 3 and 4, chapter 5 presents the idea of system-level simulation—in—loop design space exploration methodology and the associated global communication minimization searching algorithm built upon the xSoCSim model. Chapter 6 concludes this thesis and presents the future work that can be done based on this research work. 2. CYCLE-APPROXIMATE PROCESSING ELEMENT MODEL As what’s discussed in chapter 1, simulation is one of the most frequently used methods to explore the SOC design space due to its precision with high fidelity. By running applications on the simulator, the system functionality can be verified and some non—functional design metrics such as throughput and energy will be estimated to help designers refine the design to meet the specification. Different types of simulators[25][30] have been developed to simulate the hardware and software at different levels of abstraction. For hardware driven by the synchronous clocks, cycle— accurate simulators have been the well-established environment In the cycle-accurate (or cycle-true in some literatures) hardware simulator, signal transitions on memory components are evaluated at the rising or falling edge of synchronizing clocks for deterministic timing. For the software running on microprocessors, cycle-accurate instruction-set simulator (188) is adopted extensively in both industry and academia to evaluate the processors. In 188, the operations of each instruction on the processor’s functional units are traced and recorded to mimic its execution, and a virtual global clock is employed to synchronize the processor’s functional units. Heterogeneity in modern SoCs due to different types of IP cores ranging from hardwired ASIC to fully programmable processor makes developing SOC simulator a non-trivial work. Ad-hoc integration of each component’s cycle-accurate simulation model is not an efficient way since each simulator has different interfaces and semantics. Also the simulation time will be another major concern when system scale is large. More importantly, simply integrating different simulators may not tackle the simulation of on- chip communications, which need to be dealt more efficiently since communication will become a major overhead in the future SoCs containing hundreds of cores. Many companies and universities are currently working on developing efficient simulators for multi-core SoCs, and there have been some products available in the market [111], and published research tools [34][35]. Our work in this research targets developing an efficient and fast simulation platform for multi-core SoCs with NoC on—chip communication architecture. We propose an innovative cycle-approximate retargetable SOC simulation framework in this and the next chapters. The novelty lies in the following aspects: First of all, the multi-granular time model leverages the different timing properties residing in the heterogeneous multi-core SoCs at system-level; second, an uniform PE simulation framework supporting both concurrent ASIC model and sequential microprocessor model is presented and it executes at the task—level with back-annotated timing and energy macro-models. At last, the parameterized packet—switching NOC model is developed to be integrated with PE models via gluing interfaces to construct a complete SOC simulation framework. The NOC model is developed by integrating multiple functionally orthogonal modules, each of which can be reconfigured to support different communication features such as routing algorithm, arbitration and flow control etc. to facilitate system-level NOC reconfiguration for the design space exploration. And the open architecture interface provides the space for the future extension to support other NoC architectures such as asymmetrical network topology and virtual channel routing algorithm. Adhering to the broadly adopted IEEE SystemC standard, we choose SystemC 2.1 as the main modeling language during this work. 2.1. Mum-granular Time Model Time is a crucial design parameter among almost all electronic systems and it is of the utmost concern for many system models to evaluate their sequential or concurrent executions. Depending on different levels of abstraction, there are three basic time models applied in the digital VLSI system as what’s described in [43]: (l) The physical time model uses physical time units and is based on physical principles. It is used extensively in the circuit-level timing analysis and RTL-level delay back— annotation. .(2) The clocked time model relates all circuit activities to a clocking scheme where one or multiple periodic clocks are applied to synchronize the operations of each functional component inside the system. It is essentially the concepts of digital time. (3) The causality time is an abstract model and employs a partial ordering defined by generation and consumption of data and by explicit control dependencies to emulate the total ordering of system events. At this level, time is abstracted to a higher-level without specified values and units. 2.1.1 . Cycle-Approximate Time Annotation Trme granularity is the minimum timing unit to model the system’s behavior. At the circuit-level, the time granularity is the physical time resolution related to solve the Circuit’s differential equation. At RTL, the time granularity is the synchronizing clock with specific parameters such as period, and the circuit will be triggered and evaluated on each transition event (falling- or rising— edge) of the clock. Similarly, cycle-accurate micro-architectural model triggers its evaluation on each clock cycle with the difference that it only evaluates the operations on the micro-architecture level through abstracting out some unnecessary detailed transitions inside each micro-architectural block. Although cycle-accurate models generate accurate timing information, it limits the temporal resolution at each cycle period, which will be in the ns range, resulting in slow simulation for large systems. It is also difficult to use if the systems contain multiple cores running at different clocks, where the meta-stability needs to be handled carefully in the blocks crossing two clock-domains. A well-used design method is to insert the re-synchronizer between the two clock domains, but modeling it brings additional overheads since there may be tens or hundreds of re-synchronizers needed to be evaluated at each delta cycle on the clock events resulting from different clock domains. This will pose significant simulation speed-down, which can be resolved by moving to higher levels of abstraction. The functional centric system-level model mainly considers the relationship between the system’s input and output by eliminating its implementation details to raise the level of abstraction to tackle the future large platform-based SOC design. The time model 20 utilized in our system-level framework is essentially a meet-in-the-middle solution combining both physical and causality time models introduced beforehand. Figure 2. 1(a) shows part of a functionality description of DCT algorithm that can be implemented in ASIC, FPGA or microprocessor. This functional model contains 9 macro-executors, each of which performs some intermediate computations. At the system-level, the functionality-centric model concerns more about “what’s the output when applying given inputs” and “when does each task finish” than “how does each task is implemented” since designers don’t care about the internal signal transitions at early design stages. Given the pre-characterized timing models of the macro-executors, an efficient way is to annotate the timing model, which is executable, to each macro-executor to emulate its timing property on the target architecture. Assume a macro-executor ME,- has the execution time Of ETMSIC on ASIC core, the functional spec can be converted into the system-level model shown in Figure 2.1(b). Each macro-executor is annotated with a delay function DelayFunc(ET.-, ASIC), which issues a time event after its execution. This time event will be sent to the simulation kernel, which synchronizes the whole system if multiple processes execute simultaneously. In this scheme, the causality is maintained by the control and data dependencies and the timing information is also provided by invoking delay functions. Moreover, time events are aligned by the completion of macro-executors for system synchronization and they are not distributed evenly on the time axis considering the fact that different macro-executors have different execution time. And even the same macro—executor may still spend disparate amount of time to execute on different target architectures. Compared with cycle-accurate or clocked time models with fixed granularity, the proposed time model is multiple-granular in the sense that it doesn’t have a constant minimum unit for each two consecutive synchronization time events along the time axis during simulation. The time granularity is associated with macro- executors by the annotated timing information evaluated by delay functions. Figure 2.2 describes three sequential macro-executors ME], ME; and ME3 with lOOns, 80ns and 21 200ns of execution time respectively. It also shows the time events’ distribution on the time axis for a cycle-accurate and multi-granular time model. The execution of functional code with multi-granular time model only issues 3 time events represented by the vertical lines on the time axis, while the cycle-accurate counterpart generates 38 time events given the clock period Of lOns. Since most currently used Discrete Event (DE) simulators are triggered by events to perform evaluation, the number of time events generated during simulation has a direct influence on the simulation speed. Shown in Figure 2.2, simulating using C++/SystemC-source-level simulator should be much faster than using RTL-based cycle-accurate simulator resulting from the reduced number of time events. Applying this time model makes system-level evaluation and design space exploration more efficient when performing macro-architectural simulation thanks to the reduced number of time events. Vvoidfdct_1d(int r0w[8], int r0w_res[8]) { """" . //’variable declaration t0 = (float) (r0w[0] + r0w[7] — bias); //ME, t7 = (float) (r0w[0] - r0w[7]); //ME,3 t1 = (float) (r0w[l] + r0w[6] - bias); //ME3 t6 = (float) (r0w[1] - r0w[6]); //ME4 t2 = (float) (r0w[2] + r0w[5] - bias); //ME5 t5 = (float) (r0w[2] - r0w[5]); //ME6 t3 = 070611) (r0w[3] + r0w[4] - bias); //ME7 t4 = (float) (r0w[3] - row[ 4 j); //ME3 x0 = [0 + [3; //ME9 """"""" . //other macro-executors 1' m Figure 2.1(a) DCT functional specification V void fdct_1 d( int r0w[8], int row_res[8]) I r t0 = (float) (r0w[0] + r0w[7] - bias); //ME1 elayF unc(E T ram); (7 = (float) (r0w[0] - r0w[7]); //ME2 / , DelayFunc(ET2.A.s1c); , / t1 = (float) (r0w[l] + r0w[6] - bias); fill/IE3 Delay fUHCIIOD ‘——-—DelayFunc(ET3,4s,(); ammatio” x :6 = (float) (r0w[1] - r0w[6]).' //ME4 \Dela yF unc(E T 4,45%); F t2 = (float) (r0w[2] + r0w[5] - bias); //ME5 \DelayFundETMm); t5 = (float) (r0w[2] - r0w[5]); //’ME6 \ DelayF unc(E T 6 ASIC); //\ """" . //other macro-executors ,/'//// f ,«"‘l\/‘/ ',/// I T \1" Time event sequence I J l l l I : time ETIASIC ETIASK' ET3.A.\'[(' ETJ..4.\‘I(' ETJ'ASK‘ ETISASK' Figure 2.1(b) Executable model with time annotation t0 = (float) (r0w[0] + r0w[7] - bias); rill-IE, I7 = (float) (r0w[0] * r0w[7]); AIME; t5 : (float) (r0w[2] - r0w[5]); Adi/IE3 spec in C? N C-to-RTL conversion Time annotation Ven'lOQ or VHDL :0 = (float) (row/01+ row/71 - bias): mg, DelayFuncfl 00113); t7 = (float) (row/0] * r0w[7]); .I'I/MEg DelayFunc(80ns); t5 = (float) (r0w[2] - r0w[5]) ,xf/ME3 DelayFunc(200ns); [RTL simulator) (C-level simulator) I time ME1 ME: ME; ME1 MEZ ME3 Figure 2.2 Time event distributions during simulation with different time models Cycle-approximate simulation can be performed with the multi-granular time model by 23 considering the timing only on the border of each macro—executor without delving its architectural detail. Although the timing information inside each macro-executor, which is decided by its implementation, is not explicit by annotation, the external timing information such as execution time can be applied directly for the overall system-level timing analysis. In this sense, the cycle information for each macro-executor is still effective given a pseudo-clock, which is not used for system synchronization, but just for cycle number calculation. And this scheme is called cycle-approximation. On the other hand, the multi-granular time model doesn’t preclude the cycle-accurate property. In the case where cycle-accuracy is required to model the internal micro-architecture of a core, invoking a delay function with the argument of constant time value equal to the clock period repetitively can provide a pseudo-clocked time model. In this way, each pair of neighboring time events has the same interval, which is the same as the cycle-accurate simulation scheme introduced before. Moreover, the three aforementioned time models (physical, clocked and causal) can coexist in the same simulation framework using the multi-granular time model to leverage different simulation timing requirements in the heterogeneous SOC architecture. This scheme is also applied in the NOC OCA model, where each packet’s receiving are handled using the clocked time model for the system-level synchronization, and route computation of each packet is processed with causal time model, and the transferring packet from one to another router is evaluated with the physical time model since the wire has the intrinsic RC delay that cannot be modeled with the synchronizing clock. 2.1.2. Implementation of Mum-granular Time Model in SystemC The proposed multi-granular time model is implemented by SystemC 2.1 in our framework. SystemC 2.1 library provides the pre-defined sc_time data type to represent physical time in different resolutions (s, ms, us, ps and fs). A sc_time data has two fields: value and unit, to describe the attribute of time. The value field contains the absolute time value, while the unit field specifies the time scale that is one of the pre-defined macros: SC_S, SC_MS, SC_US, SC_NS, SC_PS and SC_FS. Operations such as addition, subtraction and division are overloaded for sc_time data type to deal with the automatic time scale conversion. SystemC library provides wait( ) method as a prototype of the delay function. When wait( ) is invoked with an argument Of sc_time(t,u), a time—out event will be generated after the simulated time reaches t in the unit of u. Since multiple processes are executed concurrently during simulation, SystemC DE simulation kernel will evaluate these executing processes at each time event to perform updating before evaluating the next incoming time event. This type of evaluation is global, saying that if there are multiple concurrent processes executing in one simulation model, any time event will trigger updating of all processes no matter which one generates it. To illustrate the time model for different PE types, we use “[ ]” to represent the core that may be sequential microprocessors or concurrent ASICs, and “{ }” for the process with a sequence of macro-executors. Assume that ME], ME;, ME; and ME4 are four macro-executors running on the same or different PBS, and each one has sc(t,-, u;),(i = l,2,3,4) of execution time. Assume these four macro-executors are independent and can only be implemented on one core. Two different timing properties are modeled according to different core architectures. (1) These four macro-operations execute sequentially on one microprocessor with the order of M0,—>M02->M03.>M04, modeling their execution is expressed by: [{M01;Wait(5t(tls 141)); M02; wait(st(t2, u2)); M03; wait(st(t3, 143)); M04; wait(St(t4, u4))}l where st(t,-,u.-) = sc_time(t,-, u.) and it describes the multi-granular time model (2) M0,, M02 and M03, M04 form two independent processes with each consisting of two sequential macro-executors running on the ASIC core supporting concurrency. The description is: ”MOI; wait(st(tz. no); M02; wait(st(t2. WW II {M03; wait(st(t3, us»; M04; wait(st(t4. M)” where II represents the concurrency relationship between two processes. In these two schemes, the total execution time is evaluated as: 4 TI =2sc_time(t,.,u,-) and i=1 2 4 T2 = max[z sc _ time(t,. , u, ), 2 SC __ time(t,. , u, )] (2.1) i=1 i=3 When performing simulation by invoking SystemC kernel, the global time stamp will be assigned to T1 and T2 for (1) and (2) after these four macro-executors finish executing. From this way, both sequential and concurrent system’s timing properties can be emulated by invoking the wait( ) delay functions. The sc_time argument for wait( ) delay function is a crucial parameter when implementing the multi—granular time model, and it can be statically predetermined beforehand or dynamically computed at run-time. The first scheme is often applied when generating a periodical time event sequence to mimic a cycle—accurate clock, while the latter one is used in the simulation of PE or OCA where the tinting granularity varies along with current system status such as Dynamic Voltage and Frequency Scaling (DVFS), which will be covered in future sections. 2.2. Object-Oriented Uniform PE Simulation Framework The increasing diversity and complexity of PBS ranging from hardwired ASIC cores to fully programmable microprocessors call for a more efficient and modular way to simulate them in a multi-core SOC context during early exploration stages. Although cycle-accurate models offer good simulation accuracy, their ad-hoc way that a distinct model corresponds to each architecture platform makes the application simulation that involves both computations Of multiple models and their communications very slow and prone to errors. For example, the well-known micro-architecture ISS SimpleScalar[lO] takes ~5 minutes to run a JPEG encoder algorithm on a 64x64 input image. Even worse, an VHDL simulator[44] spends >15 minutes to simulate the JPEG encoder core for the same input. In the system-level SOC exploration, it is no longer efficient to use cycle- 26 accurate models for simulating real applications running on multi-core architectures. This factor motives us to raise the level of abstraction from the architecture level to system- level to model PEs more efficiently in our framework. 2.2. l. Macro-operations and Their Characterization In early design stages, the functionality of SOC is documented as system specification and modeled in high-level languages such as SystemC/C++ regardless of its target architecture. This type of executable specification can be employed to verify the system functionality coarsely since it only contains architecture-independent functional information. Annotating the multi-granular time model to the executable spec makes it possible to account for timing information when executing the functional code. Due to the flexibility of high-level programming language, each macro-executor may consist of multiple basic operations. To modularize the timing annotation, it is beneficial to find a minimum set of operations by which any macro-executor could be constructed. In our work a comprehensive set of C operations is identified and characterized for timing information. Since PEs target at signal and media processing applications, source- level analysis on benchmarks[45] is taken by hand or profiling too]s[46] to track the occurrence of each basic Operation in the code. Then the most frequent operations are identified and categorized into five groups containing 2] atomic operations: (1) Arithmetic operations such as: a = b + c, a = c >> 2; (2) Loop operations, such as for and while loops (3) Branch operations, such as if...then...else structures (4) Function calls with different number of parameters and type of return values (5) Memory access Operations such as a = *p, where p is a data pointer to memory address Any macro-executor can be described by the combination of these atomic operations, and each operation is called a macro-operation. These 21 macro-operations form the macro-operation set in the system-level mode]. For example, consider macro-executor 27 ME: c = a * b + (*d), there are three macro-operations 0P_ALU_2, 0P_MUL-2 and 0P_MEM_I. In this proposal, it is described as ME = (0P_ALU_2, 0P_MUL_2, 0P_MEM_I). Notice that the execution order of these macro-operations is not defined yet in this description. LOAD @0x200, Rl CLR RASH LOAD @0x202, R2 CLR RASL DMUL R1, R2, R3 MOV &0x200, RM MOV R3, @0x204 MOV &0x202. RM] NOP MOV RASL, @Ox204 Figure 2.3(a) OP_MUL_2 with MUL instruction (b) 0P_MUL_2 without MUL instruction Assume that each macro-operation can be implemented on both microprocessor and ALU-type ASIC core. In the macro-executor shown before, each macro-operation may have different implementations resulting in different timing properties after architecture mapping. Figure 2.3 shows the implementation Of macro-Operation 0P_MUL_2(two operand multiplication) on two microprocessor architectures. (a) shows that it needs only four instructions to finish 0P_MUL_2 including memory access in a microprocessor with multiplication instructions, while (b) shows that it needs six instructions to execute the same macro-operation on the processor without hardware MAC data path (such as MSP430 that treats multiplier as a peripheral). Different PE architecture has different timing properties in its macro-operation set, and timing characterization is performed to extract timing information for them. Benchmark programs with small segments Of C code are created by repeating the same type of macro-operation for a pre-defined number Of iterations. Using RTL- or microarchitecture- level simulator such as SimpleScalar to run the compiled benchmark and generate two parameters: sim_num_inst and sim_CPle, which represent the number of simulated instructions and average CPI (cycles per instruction). The total execution cycle number of the benchmark can be computed from: 28 Cycle benchmark 2 sim _ num _ inst X sim _ CPI (22) Each benchmark is encapsulated in a main( ) function, which initializes the run-time environment and allocates stacks and consumes clock cycles as well. To calculate the net cycle count for each iteration Of macro-operation, the main( ) initialization overhead, which is computed by simulating another code containing only a main( ) body, is removed. Then the macro—Operation cycle number is calculated as (2.3) shows: _ CyClebenchmark — CyCIemain operation _ ' N (2.3) Cycle where N is the number of iterations a macro-operation in the benchmark takes in simulation and Cyclemm-n is main( )’s initialization clock cycles expressed by: Cycl emain = sim_num_inst mm X sim_CPImam (2.4) Ideally, each macro-operation’s cycle number is linear and passes through the origin point in terms of number of iterations in the benchmark. In this step, each macro- operation is simulated by different number of iterations from 10000 to 20000 with the step of 100 iterations. Cycle data from (3.2) is approximately linear to the iteration number with ~0.0l% average non-linearity and a non-zero offset, which originates from the inherent Operation mechanism in embedded microprocessor with cache. At the start of program execution, the cache is empty and the miss rate is nearly 100% and brings additional clock penalties to fill the cache from external memory. Linear regression is employed to minimize the mean square error on the pre-computed data to calculate each macro—operation’s approximate cycle number statistically. Each macro-operation corresponds to a linear function, whose parameters are used to create the model library. For example, cycle function for macro-operation 0P_MUL_2 is: 1.8n + 13.9, where n is the number of iterations macro-operation 0P_MUL_2 takes on a MIPS embedded microprocessor. The first constant is average cycle number per macro- Operation and the latter one is offset value. ‘ The characterized cycle functions for all operations from the macro-operation set form 29 the macro-timing library. Combine the cycle information of each macro-operation and the operating frequency, any macro-executor’s execution time can be estimated by multiplying the cycle number with the clock period. For each target PE architecture, a dedicated characterization process should be performed to get its macro-timing library. 2.2.2. Simulate Tasks by Running Macro-operations Executing a macro—executor involves executing the functional code and its macro- operations’ time model. For macro-executor consisting of multiple macro-operations, timing annotation is implemented by providing the sum of these macro-operations’ time model to the wait( ) method for timing emulation. If a macro-executor ME contains N different macro-operations with parameters of st(t,-, u,-)(1<=i<=N), its execution could be expressed as: N [ ME; wait Zst(t,,u,) ] i=1 By executing wait( ) function with the argument Of summed time model, the timing effect of these macro-operations will be reflected by the global simulated time on the time axis. A task contains a set of macro-executors and each task can only be executed on the same core at one time. We also assume that each task is non-preemptive in the framework. Intuitively, a task executes directly on the host machine given the input and generates the output target-independently. Annotating time model to each macro-executor in a task provides the timing information related to the target core architecture whose details are hidden at early design stages. Depending on different annotation methods, the same task may have different simulation models. Figure 2.4(a) shows a computational task containing M macro-executors, each of which consists of N,- macro-Operations. The discrete annotation method annotates the code with the time model for each macro- executor separately as follows: 30 Nl N2 NM [ SI;waiI[Zst(t,,u,)];Sz;wait 2st(t,,u,)];...;SM;wait[Zst(t,,u,)] ] i=1 i=| [=1 (2.5) while the grouped annotation method annotates all the macro-executors when the task finishes: M N1 [ 51:52:...;SM;wait ZZst(t,,u,) ] (2.6) j=i i=1 N, N: NM wait[ st(t,. , u, )] wait(}: st(t, , u, )] Wdit[z “(1,- , u, )] Task ...... ..z l I ...... I ‘ . E T 51 ETsz' ETSM ' time (3) Discrete time annotation for 3 macro-operations wait[§: i st(t,. , u .- )] j=i i=1 ------ l u . ETS'w' time (b) Grouped time annotation Figure 2.4 Different time annotations on 3 macro-operations The difference between the discrete and grouped time annotation methods resides in the number of time events generated during the simulation. Although both models finish 31 M N j execution at the same simulated time of 22st(t,,u,) and have the same functionality, j=i i=1 the model from discrete time annotation in (2.5) generates a time event after each macro— executor commits execution to notify other concurrent processes shown in Figure 2.4(a), while the model from grouped time annotation in (2.6) issues the time event only when the whole task finishes as described in Figure 2.4(b). Applying these properties, discrete timing annotation is used in the context where synchronization needs to be performed with other tasks inside the task while the grouped one views the task as a black box assuming that no synchronization with other tasks is required during its execution. And the simulation time for the grouped annotation is faster than the discrete one because of less synchronization time events. 2.2.3. Object—Oriented Application and Task Graph Model Before detailing the retargetable PE simulation model, we present the C++ Object- oriented application model on which the PE model is built. The motivation Of creating the application model is to provide an encapsulation of the application, which essentially is just a set of C++ code and contains only functionality information, to create a programming interface by which the PE model is easy to handle regardless of the application code. The application model is graphically shown in Figure 2.5, which consists of two abstract C++ classes: TaskType and AppModel. TaskType abstracts all the required task information including the following fields: ®mSk_name, which specifies the task name, and any two tasks in an application should have different task__name fields. ®task_platfonn, which is modeled by the pointer to the object of PE class, which will be described later. This field specifies the target PE on which this task executes, and it also implicitly links to the target PE’s macro—library for timing/energy simulation. ®task_start_time, which models the start time of this task. This is modeled by SystemC 32 sc_time data type, and can be derived to cycle count given the operating frequency. @task_end_time, which models the end time Of this task. This is also modeled by SystemC sc_time data type. The task execution time can be calculated by task_end_time — task_start_time. @task_status, which indicates this task’s status, which may be IDLE, RUN, or DONE. This field provides PE model this task’s run-time information. @task_handler, which is a pointer pointing to the starting address of a function, which models this task’s functionality that can be modified at run-time by just changing its value. The C++ callback function takes task_handler as its entry to start execution on this task. @pred_task_num, which specifies the number of tasks that are required to be completed before this task. This member works with the next one together to handle the task dependency in a multi-task application. .pred_tasks, which is an array of pointers pointing recursively to the TaskType object. It specifies the tasks that should be executed before this task, and each element in the pred_task array has the same structure of Q) to @. TaskType string task_name; ProcEIem *task_platform; sc_time task_start_time; sc_time task_end_time; STATUS task__status; void (* task_handler) (ProcEIem *); int pred_task_num; TaskType *pred_tasks[MAX_TASK_NUM]; AppModeI TaskType *m_task[MAX_ TA SK_NUM]; ApplicationMode/(T A SK_ TYPE *m_taskfl, unsigned int); ~ApplicationMode/0; Figure 2.5 Task and application model represented by C++ class AppModel class abstracts the application to the task-level by associating multiple TaskType objects in its instantiation. As what’s shown in Figure 2.5, m_task array in AppModel contains all the task instances for this application, and the task dependency is 33 implicitly presented by each task’s prcd__tasks data member recursively. A task T.- can be scheduled for execution if and only if the following condition meets: Each element in Ti’s pred_tasks array has the status of DONE. This condition indicates that T,-’s all precedent tasks in the application task graph have finished so that T ,~ is ready to start. Given an application S partitioned into N tasks {T1, T2,..., TN}, a Task Graph (TG) Gs is constructed to represent the task dependency as what Figure 2.6(a) shows. In our research each task object is created by initializing its task_handler field, and the pred_tasks array will be initialized for each task object according to the task dependency in TO The AppModel Object will be instantiated to include all these sub-tasks at the top-level to create the encapsulated abstract application model. Figure 2.6(b) presents the task dependency model for all the 6 tasks in application represented by the task graph shown in Figure 2.6(a). Utilizing this proposed object-oriented application model, any application containing multiple tasks with arbitrary task dependencies could be easily modeled by a formal method without any ambiguity since each task’s information is explicitly encapsulated into the class object. This method also provides an application’s executable formal representation for the PE model. In the case that two tasks mapped onto different PEs have data communication, a buffer may be allocated for each task Object to temporarily store the communicating data. Application / / ' l \\ 100us K task_a ., task_c ,: 220us \ / \ /. \3/ \‘\~_..p’ ///w \\ ”T‘“\ \\ 200us task_b ) task_e /=19us \ \/' R, if- ,.—- \ / ‘\ 150 ’/taskmd\ / t k f \ usr _ i’ as « 40ns k j \\ — ,./// ..— ’a/ Figure 2.6(a) Task graph with 6 tasks 34 Task object instantiation m _task[a]= new TaskTypeC‘Task a" NULL T _ZERO T _,ZERO function _:a) m _-task[b] — new TaskTypeC‘Taskb”, NULL, T __,ZERO T _ZERO, function _;b) m _task[c]= new TaskTypeC’Taskc", NULL T _ZERO, T Z_ERO, function __;c) m_task[d]= newTaskTypeC‘Task d". NULL, T_ZERO, T__ZERO, function_d); m_task[e] = new TaskTypeC'T ask e". NULL, T_ZERO, T__ZERO, function_e); \m_task[f] = new TaskTypeC'Task 1", NULL, T_ZERO, T_ZERO, function_f);/ Task dependency initialization l/No task before task a \ m_task[a]->pred_task_num = 0: HT ask a and c before task b m_task[b]->pred_task_num = 2; m_task[b]->pred_tasks[0] = m_task[a]; //task a m_task[b]->pred_tasks[1] = m_task[c]; l/task c //No task before task c m_task[c]->pred_task_num = 0; [fr ask e before task d m_task[d]~>pred_task_num = 1; m_task[d]->pred_tasks[0] = m_task[e]: ”task e In ask a before task e m_task[e]->pred_task_num = 1; m_task[e]—>pred_tasks[0] = m_task[a]; ”task a [IT ask b and e before taskf m _-task[t] >pred_ task_ num= 2; m _task[f]- >pred _=tasks[0] m _task[b]; l/task b @_wsk[fl- >pred_tasks[1]- — m_task[e]; I/task e j Figure 2.6(b) Application’s task dependency model 2.2.4. Object-Oriented PE Simulation Model The PE simulation framework is developed to create a uniform environment for modeling both microprocessors and hardwired ASIC cores efficiently and flexibly. It also provides a simple scheduling engine as a timed OS model for processors running multiple tasks. Object-oriented modeling methodology is applied by abstracting the main attributes of the target PBS to create a homogeneous simulation model for different architectures. The class diagram Of the proposed PE modeling framework in shown in Figure 2.7 where all PEs are abstracted into the base class PEL containing the architecture independent parameters such as clock frequency (in IP based SOC, one operating clock frequency may correspond to several core choices), memory size (assume each PE has its own local memory), and timing/energy library. Two classes CPUL and ASICL inherit 35 from PEL and each of them contains some high-level information regarding microprocessors and ASICs respectively. The main parameters for CPUL are architecture, the application model object and its scheduling policy, and the ASICL is configured by its task list and parallel degree, which defines the maximum number of parallel processes. Each PE is instantiated as an Object Of CPUL or ASICL class by their constructors to specify the architecture. For example, an IBM PowerPC and a JPEG encoder core could be instantiated by the following C++ code in the SystemC environment: m _pe[0] = new CPUL( "PowerPC405" PPC405, 50, TRUE); m_pe[1] = new ASICL( "JPEGENC", JPEGASIC, 200, 4, TRUE); where the first one initiates a 50MHz IBM PowerPC405 rrricroprocessor core, and the second one creates a 200MHz JPEG encoder core with 4 parallel executing paths. CPUL ASICL type, type. (void *) schedu/e( ) Int parallel_ degree File 'tasklist; ------ file “task/Ist; CPU”): . ,ASICL(); ~CPULU’ ~ASICL()‘ void Addtask(unsigned char, ...... ’ void (*func)(CPUL 7); ------ PEL Frequency, latency_LUT Mem_size, energy_LUT PEN ); ~PEL( ); SetFrequencyfint freq); Int getFrequency( ); void init_latency_LUT(oonst char '); void init enemy LUT(const char *); Figure 2.7 Class diagram of PE model framework After task mapping, different cores may have different task dependency features, which need to be resolved for execution. We take different PEs into two categories: microprocessor and ASIC abstracted by CPUL and ASICL classes. We also assume that microprocessor executes only one task at one time, while ASIC targets computation— intensive applications and performs concurrent task executions through multiple 36 executing paths. The uniform cycle-approximate PE simulation framework based on Task Execution Engine (TEE) applies to both microprocessor and ASIC cores. First, this simulation framework is a high-level behavioral model and it performs multi-granular timed simulation by executing system-level specifications based on SystemC with annotated time model without compiling them to architecture-dependent codes. Second, the task executions are synchronized by their annotated timing models of macro- operations and feature cycle-approximate without being driven by fixed-cycle clocks. As what’s discussed before, the execution time of macro-Operation M0,- is modeled by SystemC wait( ) function call with its argument calculated by: t(M0, ) = sc _ time( f ‘_' - cycle _ numlMO, D (2.7) where f and cycle_num[M0,-] represent the operating frequency and macro-operation M058 cycle number respectively. The first term f in (2.7) is initialized when instantiating the PE object or configured by the top-level model during run-time when dynamic voltage and frequency scaling (DVFS) is applied, and the second one is acquired from table lookup given the macro-timing library, which is also setup when instantiating a PE module. PE model consists of two main modules: Task Scheduling Engine (TSE) and Task Execution Engine (TEE). Different PE categories correspond to different TSE/TEE model architectures by their executing concurrency properties. Figure 2.8 shows the generic architecture of the TSE/TEE-based PE model represented by the ProcElem SystemC class. This PE abstraction model contains one TSE and one TEE, which are modeled by TaskScheduler and process__l process shown in Figure 2.8, respectively. Each PE can only have one TSE and one or multiple TEES depending on whether this PE is single-thread microprocessor or multi-threaded hardware ASIC. TaskScheduler is registered to SystemC kernel as a SC_METHOD process and it performs the task status checking and scheduling, which doesn’t need to be suspended during simulation. On the other hand, process__1 is registered to SystemC kernel as SC_THREAD process because 37 it needs to invoke the C++ callback function pointing to the application task code that contains explicit wait( ) statement to model task timing directly. When this thread process is suspended during evaluating timing information, TaskScheduler can still be triggered to monitor the current task execution status. ProcElem \ TEE _task __assigned_ event cW—gTaskSchedulerO L ———————— prooess_1 current_ fask_ done_ event <— ————————— ‘ TEE _busy ‘\ I ApplicationModel 6 ’ Tabsk Task d/ E\i Task lllll Task TASK TYPE q TSE accesses task q TEE updates task % PE Is mapped with task Figure 2.8 Architecture of TSE/TEE PE model (with application/task models) TSE(TaskScheduler) and TEE(pr0cess__1) communicate via dedicated events and signals. TEE_task_assign_event is issued by TSE when it decides that a task’s dependence is resolved and ready to execute according to the application task graph and TEE status. TEE will not start executing the task until it receives TEE_task_assign_event event and it executes with the causal time model. TEE provides another event current_task_done_event to TSE along with a signal TEE_busy showing whether TEE is currently executing a task. The first event is used to trigger TSE to start scheduling the next task asynchronously regardless of the clock signal, which can be used for time-out controlling purposes, while TEE_busy signal is to control the synchronous TSE scheduling by running scheduler if and only if TEE_busy is ‘0’ representing no executing task in TEE on the TSE clock rising edge. TSE will pick up and pass the scheduled task for execution to TEE from the task _pool array, which stores all tasks that are mapped onto this PE. Each task element in the task array is a copy of task Object in the application task graph model shown in Figure 2.8. When an application is partitioned, each PE object’s task _pool will be initialized by those mapped tasks by a separate system controller. 2.2.4.1. Sequential M icr0pr0cessor Model Most embedded microprocessors feature sequential program execution without considering the advanced factors such as VLIW [47], hyper-threading[48], etc. Figure 2.9 depicts the TSE/TEE model for a microprocessor with one execution unit and this figure is derived from Figure 2.7 by adding some details. Assume a sub-graph Gs, from TG in Figure 2.9 containing 3 tasks {T1, T2, T3) is mapped onto a microprocessor, TSE first assigns these tasks by invoking AddTask( ) method, which registers each task entry with the microprocessor. Then TSE calculates a schedule from which the processor executes only one task at a time and meets task’s deadline requirement by the Scheduler Function (SF). Each task is characterized by its timing attributes such as starting time, deadline and period, and SF evaluates them dynamically during task execution and calculates the next task that will be scheduled according to predefined scheduling policies. Based on the scheduling result, TSE allocates the processor resource to the scheduled task once the current executing task finishes, and starts scheduling the unscheduled tasks in the sub- graph 05,. TSE loads the scheduled task to TEE for execution by L0adTask( ) function call and gets the current executing task from TEE by GetCurrentExeTask( ). Depending on different scheduling policies such as maximizing throughput or minimizing energy, a TSE may have multiple pre-defined SF prototypes to be loaded to its scheduler. More significantly TSE can change its scheduler’s SF during run-time by invoking 39 SetTSEScheduler( ) method and this feature provides the potential to model dynamic run- time scheduling policies. Scheduled task in 631 T6 for application 8 Partitioned tasks in 631 mappgd_o_p_ppo_cessor , - ., \ ,2aai l I" (3' T1 \ H. ~.-I..-"> Y ./ fl ‘ l"\‘ I T3 ,1 _-/’ \._, l T2) // :1 \ / \ '-l N! \/ ,—————-—————-. /\ 11: \f/ I El \ :/\ f \ l—li 10"] (IV? 8: . I . I ' . I f I \’ I - \ \ \ TSE TEE r ‘ r 1 [AddTask(TASKL 'taskfl M t. . Task_staIt—U acro- ImIng /___~__fi1 LoadTask(TASKL library \ “W“ [ 'task) ] /__la!5":£‘l°_,\l> {GetCurrentExeTasld )]\m i — m SC_THREAD( ) o—Taskjlnish _._ _._.-__.___,., SetTSEScheduler((voi d ')sch ) [Schedule_oompute( )] L J L J Figure 2.9 Microprocessor’s TSE-TEE SystemC sc_thread process provides a run-time environment for its encapsulated tasks and a sc_thread body triggered by the effective scheduled task implements the main part of TEE for microprocessor. Since the macro-executors in one task execute sequentially and so does each task, the sequential execution is guaranteed for the microprocessor model. Each task’s execution is triggered by the TEE_task_assign_event event issued from the TSE after the task is loaded into TEE by TSE. During task execution, the annotated time model will be invoked by wait( ) function calls, thus generate time events for the whole system’s synchronization if there exist multiple cores. Once a tasks finishes execution, a current_task_done_event event will be generated and notify the TSE to perform scheduling for its unhandled tasks. When all the mapped tasks are completed, another all_task_done event will be issued by TEE to trigger TSE to rerun the scheduler after it resets these tasks’ status information recorded during scheduling. 40 2. 2.4.2. Parallel Hardware Accelerator (ASIC) Model We model ASIC cores that are employed to exploit the parallelism among tasks for computation-intensive applications. Described as Figure 2.10(a), one of the most significant features of this type of applications is that part of its task graph can be partitioned into multiple concurrent executing threads. For example, JPEG encoder application can be partitioned into 4 homogeneous computation flows with each one performing the same tasks {DCT, Quant, Zigzag, Huffman} sequentially on each color component with the downsampling ratio of 21121. This motivates us to use multiple processing resources to handle these tasks concurrently to accelerate the computation. Reflected on the TEE model, these executing paths are implemented by multiple sc_thread processes, each of which executes a sequence of tasks as shown in Figure start 2. lO(b). / func1 i func1” --- func1ig ./ x - i : / T1" \) T1 \Tl \‘ _,-/’ funczr_ func2 E11: funcZ i‘\ ‘\ . ‘\; T2" ) (1.2/j \ T2 ’ <\._ _,/" func3 i ““103 func§,,[\ T3 \._ AddTask(1, &ldct); Task remapping F l lnit PPC » m_FPC—>AddTask(1,&fdct) Task1: fdct ‘ Wcuarg list) ' (SystemC TEE> « Pigggc { ll Function code J Task2: idct Fidct(arg list) { « Mlfigfigze // Function code '"it » §m_MB—>Addrask(1, aid'éiigw'i MicroBlaze ‘ .......... im_MB->AddTask(1[Sf-dot); Task remapping (b) Figure 2.11 Task re-mapping between two processor cores. (a)Ad-hoc task re-mapping with dedicated ISSes (b) Retargetable task re-mapping based on macro-model and TSEJTEE 45 To mitigate the constraint of relying on different cross-compiling environments for different PE architectures, our proposed PE model employs a uniform architecture- independent set of macro-operations as discussed in 2.2.1 to characterize the tinting information. Each PE architecture has a lookup table (LUT) with the same entries of these macro-operations’ names, and the same macro-operation on different cores features different execution time characterized from low-level simulations. These global LUTs constitute the system-level timing library, which is invoked by the task code. Each PE object has its local LUT, which is initialized by copying the corresponding global LUT from the library according to its architecture properly when executing its class constructor. For example, when instantiating a PowerPC core using core_1 = new PEL ("PowerPC_1 PPC405, 100, .....), it creates a PE object c0re_1 named “PowerPC_1” with the architecture of PPC405 (IBM PowerPC405) and the base operating frequency of 100MHz, then its constructor invokes initiare_LatencyLUT( ) method to copy the PowerPC405’s LUT information in the library to core_1’s local LUT data structure as shown in Figure 2.11(b). This step essentially links a PE architecture with its timing information. Given a task T = task junc(argu 1, argun) where argu; is the task’s ith parameter, its mapping onto PE is simply modeled by concatenating the PE object core to its argument list as well as annotating the macro-executors inside the task with the macro-operation time model as Figure 2.11(a) shows. After instantiating a PE core_1 as PowerPC405, T will be associated with core_1 ’3 local LUT after coreJ calls AddTask( ) member function to allocate resource for this task, and each annotated macro-operation M0 in T will access core_1’s LatencyLUT to evaluate its execution time by table lookup such as him = LatencyLUTIMO]. After passing the appropriate execution time data to the wait( ) method after each macro-executor finishes, the tinting property of executing this macro- executor on PowerPC 405 core will be emulated. In this way, either architecture- dependent cross-compiler or ISS is not required to perform simulation. And if there is 46 another core core_2 with different architecture, remapping task _func onto it is straightforwardly implemented by invoking core_2’s AddTask( ), by which all macro- operations in T are associated with core_2’s local timing LUT. Figure 2.11(b) describes the retargetability of TSE-TEE model using the same example as Figure 2.11(a). The only operation needed for switching the target processors for these two tasks is to exchange the function pointers of these two cores’ task allocation methods. Through this task function callback-based method the system-level task remapping can be simplified dramatically by just invoking each core’s task allocation method AddTask( ). During simulation, the annotated macro-operations access the mapped LUT and in turn evaluate the execution time on the target PE without relying on compilers/ISSes. Provided a task graph with large amount of tasks and several different core architectures, applying this methodology significantly reduces the overhead of building a simulation model targeting heterogeneous PE cores. In consequence, this uniform simulation framework enables the fast simulation-based SoC design space exploration, which will be presented in chapter 5. 2. 2.6. Selective Code Simulation The overhead brought from the retargetable PE model is the additional access to the target core object during task execution. If a task contains multiple child functions, each child function and its possible children need to be modified by adding a virtual PE object to its argument list to access the internal macro-operations. In addition to fast architecture-independent task remapping, another promising feature that this simulation framework possesses is the Selective Code Simulation (SCS) ability, which has not been proposed in previous work compared with the cross-compilation-based simulation model. Any real applications contain both functional code implementing the core functionality, and some nonfunctional code for debugging and file I/Os. When compiled with cross- compiler, the whole application including both functional and nonfunctional parts will be translated into architecture-dependent machine code. As a result, not only will the 47 nonfunctional code induce errors to the timing analysis where most concerns are concentrated on the functional code, but also reduce the simulation speed because of the increased machine code. As we know, simulating file I/O functions on cycle-accurate ISS is very time-consuming and inefficient. To tackle this issue, designers often tailor the application using compiler directives to only compile the core functions manually to feed ISS for a more accurate and faster simulation, and we believe it will become inefficient with large applications. In the TSE/TEE-based simulation framework, any function without macro-model annotation linked with PE object will not trigger any time event, which means that this function finishes execution without executing the time evaluating functions. Applying this methodology, designers only need to selectively annotate the function code that is of interest without changing any part of the nonfunctional code to get accurate timing information from system-level simulation. SCS significantly increases the flexibility of application simulation and also generates output dumping file with reasonable simulation time and accuracy, which is hard to achieve by the cross- compilation-based ISSes. 2. 2. 7. System-level Modeling of Dynamic Voltage and Frequency Scaling Currently many modern SoCs with embedded microprocessor cores are becoming more energy-centric than speed-centric given that the performance constraints are met since most of them are powered by battery. Because battery technology has not kept pace with the increasing power demands of such systems, it is important to target power during the design process. It is also known that the higher level of design hierarchy where the power is tackled, the higher is the power reduction possible [49]. Dynamic Voltage and Frequency Scaling (DVFS, or DVS in short) is one of the extensively applied power-reduction technologies in embedded SoCs. In DVS, different computation tasks are run at different voltages and clock frequencies in order to fill up the idle periods in the schedule. In this chapter only the system-level DVS model of microprocessor cores is presented and chapter 5 mainly addresses the task mapping and 48 scheduling for multi-core SoCs when taking DVS into consideration. Given a circuit fabricated through a certain technology, the delay depends on supply voltage V44 as follows: L = kx Vdd/(Vdd — V,)2, where k is a constant and V, is the threshold voltage. Thus, delay increases as Vdd decreases. This formula also applies to microprocessor when taking L as the minimum operational clock period corresponding to the maximum working frequency. In our model, each microprocessor is assigned a normal supply voltage Vddmm, and we define the DVFS factor Rows as the ratio between the normal supply voltage and actual supply voltage Vddm. Also based on the fact that the supply voltage must exceed the threshold voltage to make the digital circuit work normally, we have: // Vdd ,norm // Vdd .acr / R V _ k DVFS : k detorm Vdd / 2 V 2 .norm/ -— V dd.n0rm ’ [ ,/ Rows I) [F_ RDVFS 'V: DVFS (2.1l) Vdd norm R DVFS E [1’ ‘ , and T represents the microprocessor’s clock period. where From (2.11), it is observed that the increase of latency is not linear given a linear reduction of the supply voltage. Use the microprocessor’s clock period to represent the latency of instructions and macro-operations. Assume the baseline latency T0 with normal supply voltage Vddmm, the latency increase ratio could be derived from the DVFS factor by: k . Vddmorm //’ 7 Vderorm - V 2 / [ _ RDVFS v: ] [ dd .norm _ I] R _ 1 _ V Rovrs _ V, L To k ' Vddmorm / Vdd norm 2 ,// (Vddmorm — Vt )2 ' ///‘/' ‘/_R—— ” _ DVFS VRDVFS 49 (R. —1)'2 = (2.12) R 2 [ I " _ V Rovrs ] RDVFS _ dd,n0rm ,where R: - ‘V— is constant given a technology node, Rovrs E [1, R, ) t (2.12) forms the base on which our system-level DVFS model is built. From (2.1), each macro-operation evaluates the annotated latency by the multiplication of clock period and the pre-characterized cycle number. This scheme is based on the assumption of the constant normal supply voltage without DVFS. Applying (2.12) to (2.1), we model the DVFS effect by considering RL when evaluating each macro-operation’s execution time by: t(M0,. )DVFS 2 SC _time(RL - f ‘1 -cycle_num[M0,. ]) - cycle _ num[M0,.] = sc_time (2.13) for macro-operation M0,. The DVFS model is applied at the task-level, and every macro-operation inside each task is assigned with the same supply voltage by calling TSE’s LoadTask( ) method after it completes scheduling. If two tasks T, and T 2 are scheduled to execute sequentially with the supply voltages of Vdd; and Vddz on core core_l, its TSE will perform the following procedure c0re_I ~> LoadTask(&T1, Vddflm/Vd.” ); core_l -> LoadTask(&T2, Vdde/Vddg); to apply the appropriate voltages for task execution. In this model, when applying a smaller supply voltage of Vd.i,,mm/RDVF5, the execution time of each macro-operation is 50 evaluated based on (2.13). Only Row-‘5 is updated during this procedure without the modification of any other parts. Application execution on the new voltage and frequency will be reflected by the change of the simulated time evaluated from (2.13). The policy that decides how to change each task’s supply voltage for each core in a multi-core SOC is a hot research topic and will not be addressed in this thesis. Compared with the models presented in [50], our system-level DVFS model eliminates the architecture-level evaluation, and thus accelerates the real application’s simulation. 2. 3. Experimental Results on Media Applications To evaluate the proposed PE simulation framework, we run the Mediabench[45] application benchmark on the standalone microprocessor models by annotating the macro-Operations. Mediabench targets the computation-intensive signal processing applications covering audio, image and video processing algorithms. We choose GSM(audio), (1721(audio), ADPCM(audio), EPIC(image), JPEG(image), and MPEG- 2(video) applications and map them onto the PISA microprocessor core based on MIPS architecture. Each application has both encoder and decoder sub-applications. After characterizing PISA’s macro-timing library, we annotate the macro-operations to the applications’ source code and map them on TSE-TEE PE model. For comparison, SimpleScalar/Wattch version 1.02 is employed as a cycle—accurate ISS to run these applications using gcc PISA cross-compiler. These applications are also executed on the host machine directly without including any architecture information. The simulation results are shown in table 2.3. Table 2.3 lists the original code length (number of effective lines) and the code length after timing macro-model annotation, as well as each application’s run-time on these three platforms. T ho”, T153 and T73; represent the run-time for each application on the host machine, the cycle—accurate SimpleScalar/Wattch and the TSE/TEE PE model respectively. For all these five applications, executing their functional spec without any architectural information on the host machine is fastest, which is pretty natural and 51 straightforward. SimpleScalar performs cycle-accurate simulation on the cross-compiled machine code for each application and its simulation speed is 6659X to 10508X slower than executing the functional spec resulting from the fact that each micro-architectural transaction on the processor model is evaluated explicitly for high accuracy. We don’t perform the RTL-level simulation for these applications due to the unavailability of the target HDL models and we believe it will be much slower than SimpleScalar. For the TEE/TSE model, although the simulation time features ~102.11X slower compared with the functional spec due to the overhead of additional access to PE object’s timing LUT during execution, it demonstrates a minimum speedup of 68.39X compared with SimpleScalar on these applications since the proposed model raises the level of abstraction by evaluating the external features of each macro-operation, which may consist of multiple instructions. Also for the applications with heavy file I/Os MPEG-2, the ISS simulation speed is pretty slow compared with the macro-model due to the additional machine code to implement the file I/O library on the target microprocessor. Table 2.3. Simulation time for Mediabench applications Application mode Original code Annotated code Tho,,(s) T,5,(s) TTSE(s) length length ADPCM encoding 168 264 0.08 645 7.391 decoding ] 19 183 0.02 217 2.14 0.721 encoding 522 796 0.32 2298 33.6 decoding 507 765 0.22 1465 20.9 JPEG encoding 4017 5315 0.12 1261 13.8 decoding 3622 4608 0.02 269 2.43 EPIC encoding 2621 4067 0.07 7 17 8.96 decoding 2085 3241 0.02 2 l6 2. 12 MPEG-2 encoding 5469 8523 1.94 >lhr 190.14 decoding 5 159 7977 0.27 2597 32.43 3. ARCHITECTURALLY PARAMETERIZED NOC SIMULATION MODEL 3.1. Orr-Chip Interconnection: Packet-Switching Network-on-Chip (NOC) By far, the most common way tO integrate a number of IP cores on a SOC is tO use a bus, which provides a shared backplane tO connect with different cores. A bus is usually composed of master interface, slave interface, arbiters and shared backplane. The cores are connected to the shared backplane through master and slave interfaces. The access to the backplane is arbitrated through the arbiter. Cores connected to master interfaces initiate the bus transfer, and the device addressed by the transfer is considered as a slave. Thanks to its flexibility and easy programming model, the bus is well understood and applied effectively in numerous SOC designs [51][52]. However, one Of the major drawbacks for the bus-based SOC integration approach is the scalability issue. The next generation of SOC will be able to integrate more than one billion transistors on a single chip by the end Of this decade[6] thanks tO the ever- increasing silicon technology. Thus, futures SoCs need to interconnect tens or hundreds of cores on the chip. But a bus itself is not scalable since it is implemented by global metal wires. In addition, as process geometries continue scaling down, the increased wire capacitance and low signal swing (<1.0V) make signal propagation slower on global wires in comparison with the gates, which brings a challenge further complicated by a much larger chip area. Given the above scenario, it is widely believed that the bus-based approach alone cannot address the challenges posed by the next generation of on-chip communication traffic. TO create a more scalable and high-performance interconnection alternative, using packet-switching on-chip network as the backbone for on-chip inter-module communication is proposed in [53][54]. Similar to the Off-chip general-purpose network, a packet-switching interconnection network consists Of routers and links that are connected by a specific network topology such as ring, tree, mesh, torus, etc. Cores communicate with each Other through routers by sending and receiving data in the form 53 Of packet via interconnecting links. The most promising feature the packet-switching NOC has is that it provides multiple communication channels for different traffics as Figure 3.1 shows. Moreover, NOCs with regular topologies are more scalable than bus by the fact that adding other cores will also add more communication paths, which in turn increase the system’s total bandwidth. In this work, the OCA model targets packet- switching NOCs, which is abbreviated to “NOC” in this thesis. 3.1.1 . Network-on-Chip Fundamentals Before elaborating the NOC model, the basics of NOC are uncovered by presenting a component-based view, introducing the basic building blocks of a typical NOC. Then we will extract some NOC architectural features that will be modeled in the simulation framework. Figure 3.1 shows a sample SOC with NOC interconnecting architecture structured as a 4-by-4 grid which provides global chip-level communication. Instead of busses and dedicated point-tO-point links, a more general scheme is adapted, employing a grid Of routing nodes spread out across the chip, connected by communication links. Currently we adopt a simplified perspective in which the NOC contains the following fundamental components. _fi «__— PE Network Interface «~— Router ! <—— Link Figure 3.1 Topological illustration Of 4-by-4 NOC-based SOC, indicating fundamental components 54 --Network Interfaces (NI) implement the interface by which IP cores (also called PE, shown as the large squares in Figure 3.1) connect to the NOC. They function decoupling the computation that IP cores perform from communication that NOC handles. With NI, PEs don’t need to have any information of the communication protocol because N I will add the NOC-related information to the data from PE according to the network protocol before injecting them into NOC. --Routing nodes (Routers) route the data according to the chosen protocols from the source to the destination and they implement the routing strategy. A router is characterized by the following 4 parameters: router type, routing algorithm, arbitration policy, buffering scheme, each of which results in different NOC performance and power features. One of the distinct features of our work is that the router is modeled in the modular method by which different routers are constructed by simply configuring each parameter separately. --Links connect the routers, providing the raw bandwidth. They may consist Of one or more logical or physical channels. Tire data transferring from one node to another physically happens on links by toggling the voltage. In NOC links can be implemented as single-cycle non-pipelined or multi-cycle pipelined ones. In our work only the non- pipelined link is modeled and evaluated. 3.1.2. Object-Oriented NoC Simulation Framework The NOC simulation model targets tile—based regular topologies presented in Figure 3.1 and it is easy tO extend to other irregular interconnecting architectures. The most frequently addressed topology is a mesh, where M xN routers (switches) are organized as a 2-D MxN array. Excluding the border ones, each router has four ports, NORTH, SOUTH, EAST and WEST, to connect with its neighbors, and another LOCAL port will be connected to the PE, which may be microprocessor, DSP or ASIC. Each router and its interconnected PE is called a tile and each tile is connected with is neighbors through two unidirectional Sflu-blt width links. By changing the border routers’ connecting properties, 55 other three derived topologies, toms, X-torus and Y—torus are modeled as well in this simulation framework. Abstracted from the low-level circuit implementation, NOC is composed by two functional components: routers and interconnecting links. NOC essentially doesn’t contain any information of PE, which only performs computation tasks and injects their data into the network for communication services. To make the framework conform to the scheme of separating computation and communication, NOC model is PE- independent and it only handles communication requests. TO verify the NOC functionality, traffic generator/receiver models are used to emulate PEs by generating specific data traffics. A traffic generator creates packets compatible to the communication protocol and injects them into NOC to emulate the communication traffic from PE, and traffic receiver model does the reversed work for evaluating the NoC’s performance and power consumption. Employing this method PE and NOC models can be developed and verified independently and it also easies the integration of them. Similar to PE, NOC is modeled through the object-oriented methodology with reconfigurable parameters by SystemC TL-l(transaction—level l). The motivation Of TL-l modeling is that the pin-accurate information, which needs delta-cycle evaluate/update, doesn’t need tO be considered in the system-level model. The class diagram of NOC OCA is shown in Figure 3.2(a) where each class is described by the solid rectangle with hollow arrow and black diamond lines representing the inheritance and association relations between classes, respectively. The dotted rectangles are the SystemC primitive classes used to construct the OCA model. The NOC Object is instantiated from the top N0C_TL1_0CA class associating two children classes: NoC_TLI_Router_(param) and NoC_TL1_Fifo_Channel. param is a set Of router architectural parameters characterizing the router, and different NOC OCA can be constructed by instantiating NOC router with different architectural parameters. Each router parameter corresponds to a specific routing functionality, which will be detailed in section 3.2. 56 NoC__ TL 1_Router_ (param) NoC_TL1_OCA ‘ — .335“: int mDegree x output '"p‘" int mDegree:y input __sc__ inte—rfa_ce_‘ output FNOC_TYPE mArch mAddr BufferSize Routing Tonalogy create_noc_oca() oonfigure_wire_model( ) 1 -———$ sc_module | _ .1 _._“ NoC__ TL 1_Fifo_ChanneI Index transient_ throughput average_throughput SetConnectSts(unsigned int) m_fifo ' GetChannel() connected A num_flits . delaymodel_en GetConnectedSts( ) “ha""e' read( T8. ) write(wnst T8. ) Cal_link_delay( ) m_frfo(chanlnfo *) _ $11104: Figure 3.2(a) SystemC Class diagram of NOC OCA model -\\ 14 191,1 5 // \ l,/"”“\ /‘ \._ ”\ 1 12 )4—H‘ 13 u—s \EJ/ ‘ \ / \\ (874—59 \ \ \r l l /1 /' “\j / \ a—fi 10 \fi—h’ 11 ) m_router[i] = new 0CA_Rou!er(i, XY, 64); new m_fifo (64,1); m_ router[0[->input/EAST](*m_chan[0]); m router] I ]->output[ WES T]( *m_chan[ 0]); f l T \l/ , " ' ‘\ /’ \. /‘ ‘\ 1 ’1 )‘ ’1’ 1 l “1,4 /‘ 5 ,/ “\6 ,/" "\7 forti=0.- i<4*4.-i++) I f 7 7i ‘1 / ‘\ /'§\ f0r(i=0: K48: i++i I 1’ 0 [H—H 1 [‘1 _ 2 \K—fi 3 j m_chan[r'].channel = \\ \\ _// \\\~._/’ I ‘\\_/ \\\«'/ n. \ \ ‘ 1’ //perform port connection . \E ) 7 1nput[EAST] output[WEST] .\ / /\, ,/_\\“—o ------------- fi/-~\L \fi ( ,1 m __chan[0] ; 1 l \\_ —>\ V // m_router[0] m _router[1] Figure 3.2(b) NOC OCA instantiation by channel binding 57 NoC_TLI_Router__(param} models router by implementing the packet forwarding functionality through its routing functions implemented as member functions, while NoC_TLI_Fifo__Channel models the interconnecting links by associating m _fifo class. which inherits from scjifo SystemC primitive channel. In addition to m _fifo class, NoC_TLI_Fifo_Channel also contains connectivity information to interact with the upper NoC_TLI_OCA class for connecting with routers’ ports. NoC_TLI_Router_(param) inherits from sc_module SystemC class tO provide the functionality encapsulation and associates with sc_port class to exchange with its external modules. NOC model is parameterized ”by its topology, routing algorithm, arbitration policy, and buffering scheme, in which the last three rs are router’s architectural parameters. A NOC is created by instantiating the NoC_TLI_OCA class Object as shown Figure 3.2(b) where a 4x4 mesh is initialized by the wormhole router with only input buffer, fixed-priority arbitration policy and XY—deterministic routing algorithm. The OCA initialization process calls constructors of its associated NoC_TLI_Router and NOC_TL1_Fifo_Channel classes to create the router and link Objects, and then performs interconnecting between them according to the topological information. Each port P on the router R is connected with link L is implemented by binding P with L shown in Figure 3.2(b). If another router R, has the port P, bound with L, these two routers are connected logically and data can be transmitted between these two entities via L. For a regular M xN mesh NOC, there are MN router Objects and 2(MN-M-N) link objects instantiated before simulation. 3.2. System-level Model of Orr-chip Router Similar tO the general-purpose network, NOC communicates information in the form of a packet, which integrates a chunk Of data with some additional information (source, destination, and CRC etc.) and performs routing on this data chunk. In NOC each packet is transferred and received in the unit of flit, which is the minimum data block transmitted on wires in the network. A packet is composed of two parts: packet header and packet 58 payload. As shown in Figure 3.3(a), packet header is also the header flit and it contains the information Of source address, destination address, and effective flit number. The payload is the actual data from PE that requests communication service. Our system—level NOC model also conforms to this packetization scheme, and defines a dedicated PACKET class associating with FLIT class. Both PACKET and FLIT classes have a sc_time—type field latency to record the packet or flit’s latency, which is defined by the time between the injection of each packet/flit from the source PE and its arrival at the destination PE. A PACKET object can consists of multiple FLIT Objects by instantiating them according tO the packet length. packet head r OxFF j src_addr [dest_addrT eff_flit_numl data_0 (32-bit) j data__1 (132-bit) T data_2 (132-bit) l i—T'_T_1F:T data_3 (32-bit) payload J L data_4 (32-1111) 1 [ data_5 (32-1011) 1 [ data_6 (324311) j _.l data_7 (32-bit) j Figure 3.3(a) NOC packet structure PACKET l—fi F LIT data Length Latency hopcount inline bool operator == (const. inline bool operator == (const, FLIT8. rhs) const PACKET& rhs) 00081 inline ostream& operator <<( inline ostream& operator << ( ostream& 05, const PACKET& a ) ostream& 05. const FLIT& a ) sc_time Figure 3.3(b) Class diagram Of NOC packet/flit model Router is a NOC component that performs routing algorithm for packet delivery. As described in Figure 3.4, a router has one local port connected with PE and four routing 59 ports connected with its neighboring routers to form the interconnection network. Inside a router three functional modules cooperate with each other to route a packet from one input port to the appropriate output port: input buffers, crossbar, and output arbiters. And each output port has an optional output buffer for better flow control. Among these the input buffer is implemented in conjunction with the link model, which will be described later. Each router is configured with a specific routing algorithm defining the path taken by a packet between the source and the destination. They must prevent deadlock, livelock, and starvation situations. In this work, we only consider the wormhole routers where multiple flits in a packet spread through a set of routers from the source to the destination during operation. NoC_TL1__Router_Wh___ OBuffer IbufferlLOCAL] RS[LOCAL] , _________ , --------- Arb'ter »- lbufier[NORTH] RSINORTHI r --------- , ' i a 1 Ibuffer[SOUTH] RSlSOUTH] ::_-_-_-_-_-___-:.I Arbiter l L _________ 1 lbuffe EAS RS[EAST] .' _________ 1' Arbiter {rial {inn—w . _’ L _________ l ' lbuffer[WEST] RSIWEST] , _________ , M' —’© Full crossbar Routing algortihm Figure 3.4 Generic NOC router architecture a l _ 1-- ill ilJl uuuuu 65 l i _-_L A NoC router is modeled by sc_module class associated with input and output ports instantiated from sc_port class. Router processes the incoming packets from its five buffered input ports concurrently since these ports may receive data simultaneously, and each packet will be routed to the appropriate output port through a 5x5 full crossbar, which is controlled by the specified routing algorithm. Five Optional output buffers 60 bridge the output Of the crossbar and the output ports to provide better throughput with the price of additional area. Each output port is arbitrated by an output arbiter to handle the situation that multiple packets from different input ports are routed to the same output port. To model the router’s inherent concurrency, we employ SystemC concurrent process sc_thread to model the packet routing. In the 5-port router in this work, five parallel sc_thread processes with each one invoking the corresponding routing handlers are hooked up for each router during instantiation. The routing function route( ) emulates the crossbar operations on one input port. It performs non-blocking read for packet/flit from its input port, computes the route according to the assigned routing algorithm, and returns the calculated output port for this packet. As what’s mentioned before. a NOC router is characterized by four architectural parameters: router type, buffering scheme, arbitration policy and routing algorithm. Actually only the last three parameters are taken into consideration since our work mainly models the wormhole router, which is used extensively in NOC research. The (param) field in each router class is the permutation of these fields: __, where , and specify the input/output buffering scheme, arbitration policy and routing algorithm, respectively. Each parameter is modeled by a dedicated set Of classes/function calls in the router class, and these three parameters are completely orthogonal. Our methodology creates a router template model that only implements those architecture independent features such as input port scanning and routed packet/flit forwarding. Different router architectures are modeled explicitly by plugging in the architecture-dependent class members/methods supporting those features designated by the three aforementioned architectural parameters. Another feature of these three architectural parameters is that they are architecturally orthogonal in that each one can be changed independently without the requirement of changing another one accordingly. Categorizing the orthogonal 61 architecture parameters enables system designers to explore the design space more efficiently. 3.2.1. Input/Output Buffer Model and itsArchiteCtural Parameter Bufler Similar to the general routers used in the conventional network, buffering provides the ability of temporary data storage under the situation that the requesting hardware resource is busy on the previous or higher priority requests to avoid the data loss or network congestion. The on-chip router in NOC modeled in this work cannot afford large amount Of SRAM/register—based buffers because Of area and power consideration. In this work, the router’s maximum buffer size is 32x32 bit. l—l Din/Sf", :0] .. .9 .. [Unsung 3:3 [1:] iii“ [5] 3x s :3) i iiiiii Dout[S -01 ‘2 flit' g l .1 .......... i ...... I .......................... l"~-| WR \- - . 1:1 1:1 ------ 1:1 Address / “we“ Address[Nm,-l:0] _ M V - 6 RD Read 1 Flag logic control em ty - J Figure 3.5 NOC on-chip router FIFO architecture The buffer in the on-chip router is architected as First-In First-Out (FIFO) containing one read port and one write port having two sets of read/write control logics. Figure 3.5 shows the block diagram of a typical dual—port SRAM-based N x Sfli, FIFO that can store at most N flits (words), each containing Sfli, bits. On a write( read), the flit is written into 62 (read from) the appropriate row Of the SRAM, with the row indexing done based on the addresses generated from a read (write) counter. In the System-level model, the architectural and circuit-level details are not modeled elaborately since the FIFO is treated as a black box with the specified input/output properties in the system designers’ point of view. For the functionality, the on-chip FIFO buffer is modeled via two methods in our work: event-driven based on hierarchical channel built upon sc_flfo primitive channel, and non-event-driven based on standard C++ template class deque. Here only the functional modeling methodology is presented, and the performance and power models will be detailed in chapter 3.4 and chapter 4, respectively. The key of modeling FIFO behaviors is the support Of reading/writing data with the first-in first -out fashion as what’s indicated from the name. Figure 3.7(a) graphically presents the FIFO modeled with event-driven feature via a hierarchical channel Chfifo, which is essentially a SystemC module. A primitive sc_fifo channel Chpfior its derived class) is instantiated inside Chfifo along with two ports P1,, and Po“, , which have opposite access polarities. P,-,, is templated with sc_fifo_in_if interface virtually declaring non- blocking/blocking write( ) function, and similarly PM is templated with scjifo_out_rf interface virtually declaring non-blocking/blocking read( ) function. Both read( ) and write( ) functions’ implementations are done by the encapsulated sc_flfo channel Chm. During simulation, the data is written and read to/from the FIFO model by the method Of remote procedure call(RPC), where the actual read/write operations don’t belong to the module that calls them. Moreover, Chpf’s length and data width are both reconfigurable during the module elaboration time, and data__written_event( ) and data_read_event( ) member functions return the write and read event to the input and output port connecting the channel, respectively. SO the read/write operations are always triggered by these two events implicitly (non-blocking read/write) or explicitly (blocking read/write). The storage element is provided by the sc_fifo module Chpf, which encapsulates a FIFO data 63 structure only allowing the data written first to be read out first. F [F 0 templated storage Chfifo elements P P Module A Module 3 Figure 3.6(a) SystemC sc_fifo-based FIFO model FIFO templated storage Qeque ,70 . elements push_back( / pop frond ) Figure 3.6(b) C++ deque—based FIFO model Another way Of modeling the on-chip router buffer is the C++ template class deque, which supports different queue data structures with built-in encapsulation and is not event-driven. The details Of deque class template can be found in several references [118][ll9], but there was no literature that employed it to model the on-chip router in the previous work. Figure 3.6(b) shows the on-chip output buffer OBufifer modeled by the deque template class. The key member functions used in modeling the FIFO behavior are push_back( ) and pop _front( ), in which the previous one writes data to the tail of the queue structure and the last one reads the data from the head. Similar to the first method, the FIFO size and data width are both templated so that it can support a broad ranges of FIFOs with different sizes and data widths. The user explicitly controls the write/read operations on the FIFO without the indication of events. The difference between the FIFO model architectures in 3.6(a) and (b) is that the sc_fifo-based FIFO model is implemented as a SystemC module having input and output portsand the deque-based FIFO model doesn’t have any hierarchy. The scjifo-based FIFO model is applied in the situation that interconnections need to be modeled explicitly since its hierarchy with input/output ports provides the connecting interface and the deque-based one models the FIFO without explicit port-level interconnection. In the on- chip router model shown in Figure 3.4, the input buffer is modeled via the event-driven method using the sc_flfo-based hierarchical channel since the data reception is triggered by monitoring the data writing event on another router’s output port connecting with this router’s input port via on—chip link. The optional output buffer in the on—chip router is modeled by the non-event-driven method since its write/read access is controlled by the crossbar/output arbiter without monitoring any events. Another distinct difference between these two FIFO modeling schemes is their simulation speed. Because Of the complexity of internal hierarchy of the scjfo—based buffer as well as its event-driven evaluate/update simulation scheme, its simulation speed is much slower than the non-event driven one. The experiment shows that the event- driven FIFO is ~27.5X slower than the non-event driven one. In summary, the NOC router’s on-chip buffer has two architectural configurations: Ibuf, IObuf. @Ibuf indicates that the router is only buffered on input port. The input buffer is modeled by via the event-driven method via primitive channel. The buffer size is configured when the channel class constructor during Object instantiation. @IObuf indicates that the router has both input and output buffers. The input buffer is modeled by the event-driven method, and the output buffer is modeled by the non-event- driven scheme via deque template class, where the template T specifies the FIFO width. 3.2.2. Output Arbiter Model and its Architectural Parameter Arbitration The output arbiter performs arbitration on multiple active requests routed to the same output port for the flow-control purpose. In our work, we propose a unified arbitration 65 model based on the concept Of priority list (PL) and priority updating scheme (PUS). And this model is based on the mathematical representation of the arbitration process based on PL and PUS. [Priority List (PL) Model] Given an N-tuple input port set [P = {IPOJPl ..... lPN—I }and an M- tuple output port set 0P={0P0.0P1 ,...,0PM__,}, for each 0P,-, (1' = 0, 1, M-I), there exists one IP permutation set A=P(IP)|03 ={IRJPVHJB‘}, i,j,ke [0,N—l] and 1' ¢ j 1: k representing output port 0P.- priority list. And 0P,- has the priority list of: p(Arr(A)[0])> p(Arr(A)[l])>... > p(Arr(A)[N —l]), where Arr(A) is the transformation of permutation set to an array indexed by the position from 0 to NJ, and p(d) represents the priority Of the dth element in the array Arr(A). If an output port has the IP permutation set of A = {IR,I[§,IR,...,IP, }, the requests coming from input port d cannot be serviced until all the requests from the input port I with Arr(A)[pos(t)] = {XY, WF, NL} .0001 _L 00 0,3 cool- I. neocoe’e-c ooeeetuooeteee-oq ......5.....l.. ..... Figure 3.9 XY—routing on 4x4 mesh Q) XY indicates that the route is calculated by the XY—deterministic routing algorithm. Assume that each router in the NOC has a two-dimension address (x,y) representing its location, the header flit of each packet contains its destination addresses (xd, yd). XY- routing algorithm first routes the packet in the X direction until reaching the yd address, and afterwards in the Y direction to reach the destination as shown in Figure 3.9. The router only computes the new route when receiving a header flit or just routes the body flit with the pro-computed route if the output port is granted. The XY—routing algorithm is very area efficient for hardware implementation with the micro-architecture called Route Calculation Unit (RCU) containing only two 4-bit comparators and l 5-entry ROM table with additional latches for data storage shown in Figure 3.10(a). And Figure 3.10(b) presents the routing control logic architecture containing five XY—routing RCUs associated with router’s five input ports. Each one drives a direction signal selecting the appropriate passing gates to let the routed flits traversing through the router shown in Figure 3.4. The crossbar control logic also orchestrates the five RCUs so that each two RCUs will not select the same output port. The FSM inside the crossbar control logic also monitors the arbitration status to decide whether to start the new packet’s route 70 computation and data forwarding. In additional to its low gate count architecture, the XY- routing algorithm gives the micro—network the “in-order delivery” property since each flit in one packet is routed on the same path. 4-bit comparator > “00,, 5—entry ROM table Addr x 4 - —,4> 0101; LOCAL ___ s01. ; 00xx: WEST Dest_x 4; I 10xx: EAST u " xxOO: SOUTH < 10 xx10: North >“00" VitiAddrs Oé's'i) " ” Addr 4 : ‘y +> ' output = 00 De t 4 = '01- ._ If(Addr == Dest) -, 5 _y A . j output = '01" i < .10. 3, If(Addr == Dest) 1 output = '10" : 4-bit comparator (a) crossbar control unit To gate of passing transistors To crossbar’s data port /‘\ LOCA b r L ”fler— RCU(LOCAL) [ 1 NORTH bufferv RCU(NORTH) _T» SOUTH bufferl >i RCU(SOUTH) i > 1 1 ‘ b ff . ' EAST ” er - > RCU(EAST) —1l—>!-' t 1 1’ l 1 / WEST buffer R CU (WEST) To crossbar’s control port(gate of passing transistor) LCM—F Control FSM \ , \\‘_ >m" _/ / From output arbiter (b) Figure 3.10 Crossbar control of XY—deterministic router (a) RCU for XY—routing algorithm (b) Crossbar control logic with 5 RCUs (2) WF indicates that the route is calculated by the West First adaptive routing algorithm, which could selectively choose the route depending on both the input packet’s information and current traffic status. As what’s shown by the dotted line in Figure 3.ll(a), the turns where the packet comes from the Y direction are forbidden to prevent the deadlock situation. In the West-first algorithm, if xd <= 11', packets are routed deterministically as what the XY algorithm does, and packets can be routed adaptively in East, North or South directions if xd > x depending on trafiic conditions. 1&0) ' (3.11 l (3.21 I Figure 3.11 WF-routing paths on 4x4 ntesh In our work, we use the output port’s average throughput ThA as the traffic metrics when determining which output port grants this packet/flit in WF-adaptive routing algorithm. The average throughput is defined as the amount of data (number of bytes) traversing through the output port. Given the NOC starting time t and current time stamp tc, assume that there are Np traversing packets recorded on output port 0P,~, with the packet size of 5,,(flits), and the flit size szi,(bits), the average throughput of this port is calculate as: Thorax = —Np XSP x Sfll (tc —t)x8 In the five output ports LOCAL, NORTH, SOUTH, WEST, and EAST, Throcxtx. and Byte/ s Thwgsm are not evaluated since the LOCAL and WEST ports cannot be routed adaptively in the WP routing algorithm. The non-WEST output of the adaptive routing algorithm is the port with the minimum average throughput. WEST, if (xd <= x) OPWF = , where [Thml is the 3- k,T/z,_, = 1111110712,, [)1 = {NORTH,SOUTH, EAST} element array of the remaining routable output ports NORTH, SOUTH and EAST. If there are two or more Th,“ have the equal values, the output port with highest priority will be selected and it will be rotated to the lowest priority as what FS arbitration algorithm performs in chapter 3.2.2. Adaptive routing algorithm has the benefit of automatically monitoring the current traffic and adjusts the packets’ route dynamically to avoid the intensive access to the fixed output port. But the adaptability is achieved by paying the price of additional hardware logics. Figure 3.12(a) shows the micro-architecture of the WF routing calculation logic. Three 32-bit counters C1, C2 and C 3 are used to record the number of passing packets on the output ports of NORTH, SOUTH and EAST. And a min-max logic L evaluates the minimum value among the counters’ outputs to calculate the port with minimum average throughput. A 4-bit comparator CP compares the input packet header’s x-direction address and this router’s. If xd <= x, CP will outputs ‘0’ that selects the WEST route; Otherwise, CP outputs ‘1’ selecting the 2-bit output of L that will be used to choose one out of another three possible routes {NORTH, SOUTH, EAST}. Compared with the deterministic routing logic, the WP adaptive one adds three 32-bit counters, which contributes the biggest portion of the routing logic. 73 Ct L MUX 32-bit 32/ _ counter ’ ' c2 1: “00" => NORTH lg:- 1t ‘32, Min-max 2, _ 1: "01" => SOUTH counter / ' logic / ' 1: “10" => EAST To gate of passing 32-bit 32 A transistors counter fa '-——> 4-bit 0: WEST comparator Addr_x 4 5 0 “ <=| D Dest__x 4; >‘1' (SP (21) To crossbar's crossbar control unit p011 DLOCAL output NORTH output LOCAL buffer [without counter NORTH buffer [without counter SOUTH output SOUTH buffer [without counter output EAST buffer [without counter WEST output D To crossbar's control port(gate of passing transistor) WEST butler [without counter Control FSM From output arbiter (b) Figure 3.12 Crossbar control of WF—adaptive router (a) RCU for XY—routing algorithm (b) Crossbar control logic with 5 RCUs To save the on-chip router’s area, the three 32-bit packet number counters are shared among the five RCUs in the router. The overall routing control logic is shown in Figure 3.12(b). Similar to Figure 3.10(b), the output arbiter’s information is provided to the FSM to schedule the route computing. During the routing calculation process, all the three 74 counters are reset when anyone of them overflows to prevent the wrong comparison results. Also if two or more counters have the same counting value, the output with highest priority will be selected. (3.01 0 (3.11 i (3.2) l (3.31 I rm I rx . /\ I. . k] I \J I \_/____L_ l l l (2.01 I (2.1 l (2.2) 12. 1 r‘x ' I r\ i . L 2 r r \f f l | 1 70,1) — — +07) _ _ 17271; — :J—(aTI' _ _ \V~ J r — +- 7? | A J l .\ l K/ 1 l \r/ I l l l (0.0) F - T173) '7 _ Ting _ (33 _ r\ | KL‘ ‘1' r ’ LA. k/ I U I V’ I k/ I l l Figure 3.13 NL-routing paths on 4x4 mesh ® NL (North—Last) is another adaptive routing algorithm supported by this model. Figure 3.13 shows its possible routing paths. In the NL algorithm if yd<=y, packets are routed deterministically. And if yd >y, packets can be routed adaptively in WEST, EAST, or SOUTH directions. It has the similar architecture compared with the WF-routing computing unit shown in Figure 3.12. And the difference lies in that the average throughputs of WEST, EAST, and SOUTH are calculated instead of NORTH, SOUTH and EAST. 3.3. System-level Model of Interconnecting Links Link model abstracts the physical signal transntission on the global metal interconnecting wires in NOC shown in Figure 3.1. From the system-level point of view, a link can be considered as a process that only receives/sends data front/to its connected routers. Currently link is modeled by inheritng the SystemC scjifo primitive channel with customized read( ) and write( ) methods. These two methods override the virtual read/write functions declared in sc_fifo_if class from which sc_fifo inherits. Since sc_flfo_if is associated with the router’s input/output port, the link and the router’s port are logically connected with the channel that provides a FIFO data structure manipulated by the channel’s read( ) and write( ) methods. Consider the case in 3.15(a) where a flit F is sent from router A’s output port 1 to router B’s input port 1, the process in A will call its port 1’s virtual function write(F), which in turn invokes the implementation function write(F) in the m jifo channel inheriting from sc_fifo. The FIFO will be updated by putting F to the end of its queue. Consider the FIFO as the input buffer of port 1 on router B, writing data into FIFO from A models the operation of data arrival from router A to router B. At this time, router B may not process this data directly since it still exists in the input buffer. B needs to invoke read( ) associated with its input port 1 to receive F, and the FIFO will pop the first element in the FIFO to B’s local data member flit. Similarly applying this method to write a sequence of data from A to B by invoking port1-> write( ) multiple times, and B will receive them in the same sequence by continuously calling port] ->read( ). Through this Remote Procedure Call (RPC) method, inter-module conununication is modeled by the explicit read/write function calls. _ Eel 9:19.614 .. .... ineuLBzrfierMade.’ .. 1 P, | P2 I v 0d” e , " I l v odu e A \tlmk '1‘er / B :— 7" (\I: Read()I " 7' " “s"? —————————— J <<\\\/ Pin-> Write( flit) Figure 3.14 Reuse of scjifo—based channel for botlt input buffer and link models Since the data transfer behavior of links is modeled via the channels’ RPC read/write, which is also used for router’s input buffer modeling, we share the stifo-based event- driven channel for modeling both router’s input buffer and interconnecting link. The rationality of reusing the same model for the two different NOC functional blocks resides in the fact that the scfifo-based channel can be viewed as the link between port P, and P2, and the storage elements encapsulated in scjifo is treated as the input buffer of 76 router’s P2 port shown in Figure 3.14. The RPC function call from P, write( ) models the behavior of port P, injecting data onto the link, and this write’s associated data_written_event( ) indicates that data is received by P2’s input buffer. SO when annotating the latencies of link and the router’s input buffer, the overloaded sc_fifo’s write( ) method would have the delay function with the argument of (t,,., + t,,-,,,,), where t,,., and t,,-,,,, are input buffer’s write latency and link’s latency, respectively. On the other hand, the input buffer read latency is modeled by overloading sc_fifo’s read( ) method. In the current NOC model, we assume that each packet can’t be delivered until its previous one is sent successfully. Applying blocking read/write methods provided in sc_fifo, a simple lossless flow control is modeled. In short, an input port’s blocking-read doesn’t return until the data is successfully read from the connected FIFO once invoked, and similar for its blocking-write. When a port’s blocking read/write is waiting for the data, it allows other concurrent processes continue executing by using wait( ) method for synchronization. Instead of an explicit timed sc_time data, wait( ) is parameterized by an sc_event variable data_ready showing that the FIFO has effective data for read or enough space for write in blocking read/write methods. The lossless delivery is significant in NOC because the on-chip micro-network doesn’t have much area for implementing complex error-correcting algorithms compared with general-purpose network infrastructures. 3. 4. Latency Model of NoC Routers and Links In NOC, router and link’s delay or latency is another important design metric that needs to be taken into account because they contribute to the whole system performance significantly. Router latency is defined by the time interval between a flit’s injection from its input port and its departure from its output port, while link delay is the time that a flit spends to traverse through the link. With the decreasing transistor feature size in advancing technologies, routers are becoming smaller with faster Operating frequency, and in turn bring less latency. On the contrary, global interconnecting wires scale at a 77 much slower rate compared with transistors’, and in consequence link latency will increase on the global wires resulting from the increased unit length resistance and capacitance. Also, the increasing aspect ratio (AR) on deep-submicron wires makes the inter-wire coupling capacitance a significant contributor to wire delay. This work proposes a system-level modeling methodology for router and link’s delay and its integration with the NOC simulation framework by high-level function calls. 3. 4. l . Orr-chip Router ’3 Latency Model First consider router’s latency model for the generic router architecture shown in Figure 3.4. For each flit passing tltrough this router, its latency depends on both the router’s architecture and the traffic status. Here only the router’s intrinsic latency resulting from its architectural components is explicitly modeled, and the latency resulting from the network congestion is implicitly modeled by scjifo’s blocking read/write function calls. For a router consisting of input buffer, crossbar and arbiter, the latency of a flit traversing through it is estimated by: L = Lr’buff" + L + L flit .router r,crossbar r ,arbiter (3- 1 ) where Lnbuflgr, L,,,,,,,,,,a, and Lmrbm, are the latencies of router’s input buffer, crossbar and arbiter respectively. For the input buffer, most available on-chip routers use dual-port SRAM array for fast access. In this case, the latency contributed from input buffer could be modeled by: Lr,buffer = t5RAM,read (3.2) where tSRAMflm, is the read latency of the dual-port SRAM array. In the system-level, we don’t go into the detailed SRAM’s internal architecture to evaluate its latency, but characterize a set of memory instances with different sizes and bit widths to generate a lookup table. Once there is a memory access happening in the input buffer, the simulator will perform a table lookup according to each router’s input buffer size and data width and invoke the wait( ) method with the argument of extracted latency data to emulate the SRAM read delay. Since the input buffer read is modeled by calling the corresponding 78 input port’s read( ) virtual function, which is implemented by read( ) in the link object that this port is connected with, inserting wait( ) function parameterized by the extracted SRAM latency l_sram will postpone the data receiving operation by “l_sram” amount of time described in Figure 3.15. Read command issued @ t ns F is ready for routing @ t+l_sram ns Router T o crossbar (read(<7> val _j 1’ while/num_available() == 0) wait(m_data_written_event),' m_num_read + +,' ”Delay wait(sc_time(l_sram,NS)).‘4/ function buf_read(val_); request_update().' Q J Figure 3.15 lnput buffer’s latency model We use the Artisan memory generator[55] for IBM 7RF 0.18am CMOS process to create and characterize memory instances. The input buffer size ranges from 16-word to 2K-word, and the data width spans from 32 to 256 bits. The characterized FIFO read latency is shown in table 3.1: Table 3.1. Characterized read latency(ns) for FIFOs in IBM 0.18um process RAM size 32 64 128 256 512 1024 2048 Data wi 32-bit 1.118 1.123 1.134 1.155 1.198 1.283 1.453 64-bit 1.311 1.316 1.327 1.348 1.390 1.475 1.645 128—bit 1.696 1.702 1.712 1.734 1.776 1.861 2.031 256-bit 1.914 2.008 2.207 2.325 2.342 2.428 2.576 79 All these characterization data are stored as tables for the router model to access during simulation. Each RAM size-Data width pair corresponds to characterized memory access latency. In NOC router, the latency resulting from crossbar is defined as time between the flit’s arrival in crossbar’s input and its departure on its output. For the 5x5 crossbar as Figure 3.8 shows, each cross-point between input and output lines are connected by a NMOS passing transistor (or a transmission gate consisting of an NMOS and a PMOS). Each transistor’s gate is controlled by the routing logic unit implementing the routing function. The crossbar latency is composed by two components: the route computation time(t,o,.,2_c,,,,,,,) and the crossbar traverse time (t,,,,,,). In wormhole routing where each packet is decomposed into fixed-size flits for transmission, the router takes the header flit containing address information to compute the route, which will be applied to all the successive flits in this packet. Once the route for a packet is evaluated, this route will be reserved for all the flits belonging to this packet. The header flit can not traverse through the crossbar until the route computation finishes, while other flits just follow the same computed route. Based on this, the latency on the crossbar can be estimated as: Lr,crossbar :. troute_comp + tcbar 9 for the header flit (3'3) or L,,crossb,,r = tcba, , for the body flits Instead of developing a micro-architecture crossbar latency model, we employ the published parameterizable crossbar model proposed in [56] to estimate crossbar’s inherent delay. Assume p is the number of ports in the crossbar, and w represents the channel width of each port, the crossbar’s traversal time could be evaluated approximately by: 1.....(p.W)= 910g8 wi-‘EJ +610g2lpl+6 '47 (3.4) where 1.’ is the delay of an inverter with the fanout of 4. The route computation time tm,.,e_,.0,,,,, is characterized by running the timing simulation 80 of the synthesized routing algorithm written in Verilog with IBM 0.18am standard cell library. And we perform scaling to estimate tmu,e_,,,m,, on other technology nodes with different feature sizes. Arbiter performs arbitration when there are multiple routing requests to the same output port. The parameterized arbiter model in [56] is applied to estimate the arbitration latency when there is an arbitration request. Compared to the crossbar latency model in (3.4), the arbiter latency is independent with the channel width, and only associated with the crossbar’s port number p: l l [arbiter (p) : [ZIEIOgZ p +145) . 42- (3.5) 3. 4.2. Link Latency Model In the system-level, link’s delay model is similar to the router’s by calling the wait( ) method to emulate the delay when a flit passing through an interconnecting link. Since the physical signal transmission on links is abstracted to the channel’s write( ) function, modeling delay on links needs to insert the link’s delay function to the write( ) method to emulate the real delay. Shown in Figure 3.16(a), if the output driver on the source router A writes a flit F to router B tlnough link L, the transmission of F on the link is modeled by moving F from A to L’s FIFO data structure. Each flit has a data field f_latency to store the latency it experiences when traversing the network, updating f_latency during invoking write( ) intuitively models the link’s delay. As the pseudo-SystemC code in Figure 3.16(a) describes, the write( ) execution is “halted” for link_delay by inserting the wait( ) function before writing the flit F into the FIFO, which is modeled as router B’s input buffer. The link_delay parameter used for wait( ) function call is calculated by the simplified lst-Order wire RC delay model shown in Figure 3.16(b). In the current wire model, we don’t take the input and output drivers into consideration for simplicity. The whole global wire with length L is divided into small segments, each with length of Lug. by repeaters 81 for less latency and better driving capability. Different repeater sizing and location will result in different wire latency, and we assume repeaters with the same size are distributed uniformly along each interconnecting wire. According to Heo[57]’s wire model, wire delay is represented as a function Of the repeater sizing r, which is defined as the ratio of the repeater gate capacitance and wire capacitance within a wire segment. F is sent @ t ns F is received @ (t+link_delay) ns Router input ‘3‘ Link FIFO ‘ ...... Crier) W A ->write(F) { ...... while(num_fi‘ee() == 0) wait(m_data_read_event),' m_num_written + +; Delay wait(sc_tim€(li"Ldelay’NS»v' 4" V’fimction buf_write(val_); request_update( ),' L} J Figure 3.16(a) Interconnecting wire’s delay model Vdd Vdd 'TT'TTT / !Cd(a+1)w $Cwl/2 !Cg(a+1)W Figure 3.16(b) 1” order wire RC delay model _ w(,8 +1)Cg r_ C L (3'6) 11‘ seg The wire delay is modeled by two components: RC—delay from metal wires and repeaters: L DR =0-7' L [Rd '(flHi'iCd +Cg i+ Lsch..,W(fl+1)Cgl (3.7) seg LR C R C. 1),, =0.7——d——l+0.7L-Lm ’2 " (3.3) w . And the total link delay is estimated by adding up these two components: link _ delay = f(L, L3,, )= DR + DM (3.9) where R... and C... are the unit-length wire resistance and capacitance; C, and C, are the unit-length capacitance of drain and gate of minimum size NMOS transistor; and R4 is the resistance of minimum size repeater. ,6 is the size ratio between repeater’s PMOS and NMOS transistors. In this model, we treat the link delay data-independent as well, which means that the delay is only decided by the global wire length and repeater inserting scheme. But it is easy to extend this model to account for the inter-wire crosstalk resulting from different switching characteristics in neighboring wires by inserting the corresponding data- dependent model, which is left as one of the future work. A switch, which is configured in the link class constructor, is used to decide whether the link delay model is applied or not during simulation. 3. 5. Simulate NoC Model with Traflic Generator/Trafi‘ic Receiver NOC model is developed independently with PE model, and its functionality is verified before interconnecting with real PE models to reduce the tum-around time of system integration. Figure 3.17 depicts the verification environment using traffic generator/traffic receiver models. Traffic generator (TG) and traffic receiver(TR) essentially mimic the packet injection and consumption mechanism in PBS and their bridging network 83 interfaces (NI). The TG and TR coexist in one SystemC module where sc_thread process tgen models the packet generation and injection and another sc_thread process trev emulates the packet reception. The TG/T R has input and output ports connected with router’s LOCAL ports, and each router in the NOC will be connected with a TG/TR. TG/T R calls the write( ) virtual function of its output port’s interface to inject its packets into the NOC, while read( ) is used for retrieving the incoming packet. Actually each write( ) function call transmits a flit of the packet, and so does the read( ). Hence a packet segmentation and reassembly (SAR) mechanism is required for the TG/TR model. We employ the packet format shown in Figure 3.3(a) where each packet contains a header flit and the following body flits. The header flit contains the information of source and destination addresses, number of body flits in this packet. And the body flits compose packet’s data payload. Each flit in the packet has the same bit-width, and each packet in the network may have different number of flits. $99 Figure 3.17 TG/T R connected with NOC When TG injects a packet into the NOC, it first writes the header flit, then it continuously writes the body flits until all flits in this packet are injected. On the other hand, TR first identifies the received packet header flit before receiving its body flits. By retrieving the flit number field from the header flit, TR will identify the last flit of the 84 received packet by a simple counting mechanism. TG controls the interval of two successive injected flits or packets by calling wait( ) method that simulates delaying a given amount of time td. By adjusting the value and distribution of t2, different traffic patterns can be emulated. Also changing each packet’s destination address creates different network traffic. Each TG/TR can inject/receive packets simultaneously during simulation since its sc_thread processes execute independently, and this mechanism intuitively models the concurrent traffic flows in the NOC. We apply the following two synthetic traffics on different NOC configurations in terms of topology, input buffer size, routing algorithm and packet/flit size. (1) Transpose traffic: Any TG/TR located in (x, y) sends data to a fixed destination (x’,y’) throughout simulation. In our experiments, the relationship between the source and destination is set to be: .r'= (x + 2)%X _ SIZE and y'= (y + 2)%Y _ SIZE where X_SIZE and Y_SIZE are the numbers of routers in x and y directions respectively. (2) Uniform traffic: All TG/TRs in NOC inject packets to random destinations. The possibility of destination TG/TR features Gaussian distribution. 3. 6. NoC Performance Model Experiment Results Four 2-D regular architectures: mesh, toms, X-torus and Y-torus are used in verifying the NoC’s functionality and latency models. The maximum number of routers in both directions is set to be 50 with the fixed neighboring router distance of 1mm. The target technology is 180nm IBM CMOS process. The simulation platform is the same as what is used for simulating the PE models described in 2.3. The simulated time is fixed to 20ms (2M cycles with the clock frequency of 100MHz). In the first set of experiment, different NOC topologies are evaluated; while other experiments all focus on mesh architecture since it is the most frequently used NOC topology in many literatures[105][107]. Both physical and cycle-approximate time models are applied during the NOC performance simulations. Average packet latency (APL) is one of the most important performance metrics for evaluating NOC, APL is defined as the average time between a packet’s generation at the source node (TG) and its reception at the destination node (TR) in the NOC performance model. By this definition, the packet queuing time, which is spent when a packet is generated at TG, but waits for the available input buffer during the traffic, is also accounted. APL measures how fast the on-chip communication infrastructure delivers information to the destination. Another NOC performance metric, average router throughput (ART), is defined as the average number of bytes that traverse each router at unit amount of time. Assume a tile-based NOC has the router number of MxN, and the measurement shows that each router has KI,- injected bytes and KB, ejected bytes (i = 0, I, M xN-I ) at the given interval of T, the ART will be calculated by: MxN-l )3 (K1,. + KEI) ART = "=0 (M xN)-T (3'10) 3. 6. 1. APL Performance under Different NOC Topologies We simulate the four regular tile—based NOC topologies (mesh, torus, X-torus and Y- torus) with different input buffer sizes of 4, 8, 16, 32 and 64 under uniform traffic. The packet injection rate (number of packet injected into the NOC in unit amount of time) is adjusted to 0.025 packet/us and 0.25 packet/us respectively. The flit width is fixed to 64- bit and each packet contains 10 flits. The average packet latencies are plotted as Figure 3.18(a) and (b). From the simulation results we can see that the NOC model clearly reflects the input buffer’s influence on the average packet latency in different NOC topologies. Since torus NOC has more communicating channels than the other three topologies, its average packet latency is the minimum one among these four NOCs. Mesh doesn’t have any “loop-back” channels, and in turn it results in 23.78% and 34.61% more packet average latency at the packet injection rate of 0.025packet/ns and 0.25packet/ns compared with toms, with the compensation of 50% wiring resource saving. As trade-offs, X-torus and 86 Y—torus both consume 25% more wiring areas with the latency reductions of 15.81% and 8.75% compared with mesh respectively. Here we assume that the interconnecting wires are only laid out on the top metal layer. When the injection rate is low at 0.025 packet/us, doubling the input buffer size from 4 to 32 results in 7.27%, 5.7% and 4.96% improvement in packet latency in mesh. While the latency improvement is getting better in high injection rate at 0.25 packet/ns, and the latency reductions increase to 18.55%, 11.95% and 9.8% respectively. And the similar trend also applies to the other three NOC topologies. The high injection rate makes the network easily congested, and big input buffer can relieve the latency penalty by buffering the unrouted packets. Average packet latency v.8. input buffer size Average packet latency v.5. input butler size 0339‘“?1 i"1.99%" rate = 0-025 packet/n3) (packet injection rate = 0.25 packet/n3) ’5? 70 120 T— -———r——'_—_, E 5x5 mesh II A 1 - 5x5 mesh 0 60 5x5 torus 1 8 100i 5 V . - 5x5 toms 33 5x5 X-torus > 1 _19 so 8 5x5 X-torus ._ - 5x5 Y—torus <1) 80 3 40 °_.a . - 5x5 Y—torus o .— ‘ a 5; col g) 30 I § I g 20 § 40‘ < 10 E 20I < 0 . . o :31 3: a: 4 a 16 32 4 8 16 32 Input buffer size (flit) Input butler size(flit) Figure 3.18(a). Injection rate = 0.025 packet/ms Figure 3.18(b) Injection rate = 0.25 packet/ns We also observe that increasing buffer size to 64 only gives marginal network latency improvement in the applied uniform traffic. And it suggests that keeping a reasonable buffer size not only maintains the NOC performance, but also reduces its area and power consumption for a given traffic pattern. Note that SRAM implementing router’s buffer is pretty power—hungry and area—consuming due to its architecture. Figure 3.19(a) and (b) show the area and power consumption of different dual-port SRAM sizes under [BM 0.18um CMOS process. Due to the architectural constraint, the MUX number can only be 4 when memory size is less than 32. Doubling SRAM size will ask for 42.71% additional area and 29.92% more power in average when the flit size is 64-bit. Combining Figures 3.18 and 3.19, we observed that l6-entry is the optimal buffer size for the 5x5 mesh with the uniform traffic when taking buffer area and power into consideration. Dual-port SRAM area v.s. size Dual-port SRAM dynamic power v.s. size 0.35 r . r , - W t . _ _ . j 0 3 + Data width: 64b? ' A 1‘“ + Data Wt” 64 bf‘ ' . . . 3 + Data width: 132-bit A , + Data Width. 32—bit E 10 E 0.25 2___ ‘5 E. 3 g 0.2 3 (U 0 9 E m 2 o ‘5 MUX srze g E 0.1 2 a) < o. $ 01 . . . . . . . 4 8 16 32 64 128 4 8 16 32 64 128 Dual-port SRAM size(flits) Dual-port SRAM size(flits) Figure 3.19(a) SRAM area v.s. size Figure 3.19(b) SRAM power v.s. size 3.6.2. APL Performance with Difi’erent Packet Sizes on Mesh NoC Under 5x5 mesh NOC, we adjust different packet sizes ranging from 8 flits to 128 flits. Each flit has the fixed bit-width of 64 and the input buffer size is set to 16. Two injection rates are applied to the NOC model: 0.025 packetJns and 0.25 packet/ns. The simulation results on the uniform and transpose traffics are plotted in Figure 3.20(a) and (b). Packet size also affects the network performance due to the fact that a reserved route in the NOC wormhole router will not be released for other requesting packets unless all flits in the currently transmitting packet reach their destination. It takes longer time for large packets to reserve the router than short packets given that each flit takes equal time to traverse the router. Once the router’s output port is reserved, the arbitration will take place for the incoming packets requesting the same output port. And according to (3.4) and (3.5), arbitration takes ~50% more time than crossbar traverse. Large route reservation time and arbitration overhead make the packets with smaller injection rate have bigger average packet latency. In the uniform traffic, the randomness of packet’s destination makes it possible for a packet to reserve a long route compared with the transpose traffic (x’= (x+2) mod 5, y’=(y+2) mod 5). where the average hop count is 4. In this case, the uniform traffic has 56.96% more average latency with slow packet injection rate of 0.025 packet/ms compared with transpose traffic. When the injection rate 88 is increased to 0.25 packet/ns, the reduced flit injection interval reduces the port reservation time for each packet by ~10X, thus decreases the average packet latency in both uniform and transpose traffics. Packet size v.s. awrage latency in 5x5 mesh Packet size v.s. awrage latency in 5x5 mesh (uniform traffic) (transpose traffic) 800 800 7 __A _-_‘1——v_17 if ,l_L _ f- injection rate: 0.025 packet/as j - injection rate: 0.25 packet/n5 1 [E‘— injection ratezfo.025 packet/n3 600 M - injection rate: 0.25 packet/n5 600 400i ., Packet size(flits) Packet size(llits) Average packet latency(ns) Average packet latency(ns) ti a Figure 3.20(a) Uniform traffic Figure 3.20(b) Transpose traffic In the uniform traffic, the average packet latency doesn’t increase monotonously with the increased packet size, which seems to deviate what we explained in the previous paragraph. This is due to the applied random traffic pattern compared with the deterministic transpose traffic, where no randomness exists and monotonous increase of average packet latency is observed. 3. 6.3. APL Performance with Drfierent Router Architectures on Mesh NoC This experiment evaluates the performance resulting from different NoC router architectures. As what’s presented in 3.2, the on-chip router model in NOC can be reconfigured to 12 different architectures characterized by the combination of three functionally orthogonal architectural parameters. 6 different router architectures with only input buffering shown in Table 3.2 are configured during the simulation. The targeting mesh NoC size is adjusted to 4x4 and each simulation starts with 2000 cycles’ warm up period before calculating each packet’s average latency to ensure that NOC is in the sustained traffic. The injected packet is single—flit format, where each packet contains only one 64-bits flit in this experiment. Figure 3.21 (a) ~ (f) presents the APL (average packet latency) under different on-chip router architectures with the input buffer size 89 ranging from 2 to 8. Table 3.2. Evaluated Router Architectures NOC router architecture Buffering Arbitration Policy Routing algorithm Ibuf_FP_XY Input buffer only Fixed—priority XY-deterministic Ibuf_FS_XY Input buffer only Fair-sharing XY—deterministic lbuf_FP_WF Input buffer only Fixed-priority WF—adaptive Ibuf_FS_WF Input buffer only Fair-sharing WF—adaptive Ibuf_FP_NL Input buffer only Fixed-priority N L-adaptive Ibuf_FS_NL Input buffer only Fair-sharing N L-adaptive With the increased traffic injection rate, the average packet latency associated with all 6 different router architecture shows the similar trend under different input buffer depths. At the very low packet injection rate of 1.5~2.5 packets/cycle, the APL is almost constant with only ~2.68% difference under each architecture and buffer size. This small variance results from the randomness of each simulated traffic. In the low packet rate scheme, the NOC handling capability is higher than PEs’ (modeled by TG model) injection rate, e.g. the interval between two consecutive packets from each PE is larger than the time that each packet spends to traverse from the source to the destination including the operations of buffer read/write access, route calculation and crossbar forwarding, and arbitration. Moreover, the output arbiter is only triggered when more than two packets from different input ports are routed to the same output port. So the arbiter’s latency can be omitted here since the triggering rate is very low of < 2% under the low packet injection rate. Overall, the APL is mainly determined by the NoC intrinsic latency, which is contributed mainly by input/output buffer, crossbar and interconnecting link, under the low packet injection rate of 1.5~2.5 packets/cycle. When TG gradually increases its packet injection rate, the APL starts increasing at a given injection rate, and after that APL increases dramatically regarding to the injection rate. This is also expected since the NOC packet delivery ability is overwhelmed by the fast packet injection, and it results in the network congestion, where there are packets staying in the queuing buffer waiting for the available routing resources. This congestion— 90 related latency takes a big portion of the APL at high packet injection rates. From3 2.1(a) to (f) also presents this trend explicitly. Average Packet Latency(cycles) Average Packet Latency(cycles) Average Packet Latency(cycles) Input Butler Only, FP arbitration, XY routing Input Buffer Only, FS arbitration, XY routing I i-&- Buffer = 2 . '--l u _U .- 4000 —-+---Bufler=3 Ii ""5"" Buffer = 4 :5 30m -"'-' Buffer = 5 II: :-' --Bufier = w 0" Bu=lier 7 ': "' I -->--- Buller— = 1000 - 0 1 5 g ..a E S § Average Packet Injection Rate(Packet/cycle) (a) Ibuf_FP_XY router Input Butler Only, FP arbitration, WF routing Average Packet Latency(cycles)? § 3000 , _33 0 WI --9-- Bufler7 = , - +-- Butter- ~ MB -Buffer = --A--- Buffer = 5 Butler = 6 .......0 Butter = 7 ->-- Butler = 8 I I l .u a. .. I ... V . 4 5 79 Average Packet Injection Rate(Packet/cyc|e) (b) Ibuf_ FS_ XY router Input Buffer Only, F8 arbitration, WF routing 3 1 ~ ‘~ Average Packet Injection Rate (Packet/cycle) (c) Ibuf_ FP_ WF router Input Buffer Only, FP arbitration, NL routing :99? --9-- Bufler- = -- +-- -Buffer = WEI-- Buffer = 4 -qA--- Buffer = 5 Butter = 6 "“9" Buffer = 7 --D--‘ Buffer = 8 i’ O -:::::::::-_g 33%;; -- «(a-(«rumman- -- - - - - Inc. 3 ‘ Ira-— S r . 5 3.5 1.5 Average Packet Injection Rate(Packet/cycle) (e) Ibuf_FP_NL router Figure 3.21 NoC Average Packet Latency with different router architectures 2.5 Average Packet Latency(cycles) 91 S _' l ‘ 'l --O--- Buffer = 2 , i 4000I I --+-- Buffer = 3 j ”a Butler = 4 ,' 3000 --A--~ Buller = 5 '9 .-.,-,3-.- Butler = 6 '1 j "”0- Bufler = 7 .' - : 2000 ‘ -->--‘ Butler: 8 ’5 ' 5 S 1.5 Chm—2...- , _ , ‘ 1 500° __ ‘ 'L: . II --e-- Buffer=2 : I I g; I --e--- Butter: 7 jjli 7:7 4000II _.+._ Buffer=3 ? +P+ ~29 ? :4); 4000: —+-- Buller=3 3I II}? III ., "um... Buffer = 4 : .5 g . :z : I ; "Ha-u. Buffer: 4 W ’ : II I .5: ,5 ,- § 300011--A--Buuer=s‘ III-..- .' I .. I: ' I: = 'l 36 - A;;:'---~ Butler: 6 1.3:: ,‘l .., l 5 g f ' .— ....°.... Butler ,__ 7 :Ii: p [I 5 I .3 1':- ' 2000IL__’__ Buffer=8 13+ :9 f 8 2000 __,.-.Buflehar 5% } I ‘ : , »' c; ’.’I:' I 3 m > < 2 2.5 3 7 Average Packet Injection Rate(Packet/cycle) (d) Ibuf_FS_WF router Input Buffer Only, FS arbitration, NL routing I 1.5.5 4. 5 Average Packet Injection Rate(Packet/cycle) (i) Ibuf_FS_NL router The APL curve in 3.21(a) to (f) for each input buffer size has a maximum achieved packet injection rate (MAIR), on which a small increase of injection rate results in a very large increase of APL. For router architecture of lbuf_FP_X Y, the MAIR is 4.2 packets/cycle when the input buffer depth is 2, while the MAIR promotes to 4.9 packet/cycle when the buffer depth is 8. This indicates that increasing buffer size to 4X results in 14.28% improvement of MAIR. Another phenomenon is that the injection rate resulting in high APL under small buffer size corresponds to a much lower APL under higher buffer size. In NOC with Ibuf_FP_X Y router architecture under 4.4 packets/cycle’s injection rate, the APL is ~5000 cycles with the minimum buffer size of 2, and it reduces by 36% to ~3200 cycles when the buffer size is increased by 50%. When the input buffer size reaches 5, the APL reduces further to 1400 cycles, which is 3.57X smaller than the 2- entry input buffer’s case. This simulation result shows that it is potential to get 3.57X performance speedup by the price of 1.5X more memory area. In some applications where the APL is critical, increasing the input buffer size selectively is a good design practice. The APL is not only influenced by the input buffer size, but also the other two architectural parameters including the arbitration policy and routing algorithm. Comparing Figure 3.21(b) with Figure 3.21(a), we notice that changing the arbitration policy from fixed-priority to fair-sharing increases the APL performance greatly and MAIR under 2—entry and 8-entry buffer increases by 1.76X and 1.79X, respectively. This is because the fair-sharing policy arbitrates the output buffer in a round-robin way where each input port has the same accessing possibility to each output port. In consequence the starvation of some low-priority input ports in the fixed-priority arbitration is reduced or eliminated. The starvation directly results to the very high latency for the packets injected from these low priority input ports, and it also increases the APL shown in Figure 3.21(a). The overhead with fair-sharing is the additional hardware resource such as FSM, and priority rotating logics. The synthesis result shows that the fair-sharing arbiter has 2.16X silicon area compared with fixed-priority arbiter, which uses only one comparator and a 5-entry ROM. So in the area-critical situations, the FP arbiter is still practical if the performance requirement is met. Compared with XY—deterministic routing, WP and NL adaptive routing algorithms have the potential of choosing the appropriate route adaptively according to the current NOC traffic status. Applying this routing algorithm the congested packet doesn’t need to wait for the availability of the designated output port calculated by deterministic XY- routing algorithm. The price associated with the adaptive routing is the potential of increased hop count, which is the number of routers a packet traverses through before reaching the destination. In some cases, the same packet will pass through the same router multiple times, and this scheme introduces heavy latency overhead. Unfortunately, the experimented random traffic falls into this case. As shown from Figure 3.21(c), (d), (e) and (f), the MAIR of NOC with Ibuf_FP_WF router is 2.27X and 2.24X less than Ibuf_FP_XY’s at the input buffer depth of 2 and 8, respectively, and NoC with Ibuf_FP_NL router architecture has reduced MAIR of 1.9x and 1.53X at buffer depth of 2 and 8 compared with their counterparts of XY—deterministic routing. Another effect of the increased hop count in the adaptive routing algorithm is the steeper curve when the injection rate reaches the MAIR shown from 3.21(c)~(f). This trend does only show that the adaptive routing has worse APL performance at this dedicated random traffic resulting from the increased average packet hop counts. In some applications where the network congestion can be estimated, the adaptive routing algorithm has the potential to increase the N 0C throughput. In the adaptive routers, the architecture selection of output arbiter still results in the same effect as shown before due to the reduced possibility of starving the low priority input ports. When the input buffer size is 2, the MAIR of Ibuf_FS_WF architecture is 3.3 packets/cycle, which is 1.78X larger than is FP arbitration counterpart. From these simulations, the NoC model is verified to deliver packets successfully inside the network with multiple concurrent traffic flows. With router and link’s latency 93 models, the average packet latency on different NOC configurations and traffic patterns can be evaluated and direct designers tO choose the appropriate parameters during SOC interconnection design. 3. 7. Integrate PE and NOC into the 50C Simulation Framework After PE and interconnecting NOC models are developed independently, they are integrated together to form the complete SOC simulation framework not only for applications’ functionality verification and timing estimation, but also for the system- level power estimation, which is becoming significantly important for ever-demanding low-power system design, and system-level simulation-based design space exploration. This section will propose the methodology Of integrating models Of computation cores and communication fabrics into an application-driven SOC simulation framework by the case study Of M-JPEG encoding system. And the same methodology also applies tO other complex applications. 3. 7.1. Network Interface for Bridging NoC and P155 Each PE performs computation tasks and sends data via NOC for inter-PE communication. When the system scale becomes large, it is important tO separate computation and communication models for better modularity and portability. Orthogonizing computation and communication indicates that PBS in a SOC are transparent tO each Other, and communication fabrics are invisible from each PE’s point Of view. When a PE initiates a communication Operation tO send data tO another PE via NOC, it only injects the destination PE’s address and the data tO NOC without handling any network protocol information. Also each PE has different data processing granularity represented by the minimum amount of data processed together for its tasks as well as different data formats. The heterogeneity in PE’s data granularity asks for different packetization methods before delivering them into NOC. On the contrary, NOC routers don’t perform the SAR functionality in that they have nO knowledge Of the data format in packets traversing through them but delivery them to the destinations. TO bridge PEs that 94 are data-sensitive with the NOC featuring data-insensitive, network interfaces (NI) are required when connecting PBS with NOC tO maintain the separation between computation and communication. 3. 7.2. Buflered Network Interface Model NI performs data packetization to encapsulate data blocks from PBS into packets according tO the communication protocol in NOC, and it also de-packetizes the incoming packet back into the PIE-recognizable data once the packet is accepted by the NI’s associated router. Another functionality NI has is tO perform data buffering before PE injects them intO NOC. Similar tO the NIC (Network Interface Card) used in personal computers connecting tO the Internet, buffering between PE and network improves the system throughput and reduces the penalty of network congestion. This work implements a PE-specific NI model with bi-directional data buffering. Figure 3.22(a) describes the connection of PBS with 2x2 mesh NOC via four N18 for M—JPEG application. And Figure 3.22(b) presents the internal architecture Of the buffered NI model. N I has two input ports and two output ports, where pe_out_data and ni_out_packet form the egress channel from PE to NOC, and ni_in_packet and pe_in_data compose the ingress channel from NOC to PE. A data buffer and a packetization logic are located on the egress channel between its two data ports, while a packet buffer and a de- packetization logic are set on the ingress channel. The packetization and de-packetization logics are two independent finite state machines shown in Figure 3.22(a) and (b). During data packetization, PE first sends the address indicator, followed by the current packet’s destination address, the packet size (number Of flits), flit size (number Of data a flit contains) and the real data payload tO the data buffer. And the packetization logic accesses data from the buffer. From 3.23(a), the packetization logic stays in the NI_IDLE state until it receives the effect destination address information, which moves it to NI_ADDR state. Receiving the next data (packet size) puts it into NI_PLENGTH state followed by NI_FLENGTH upon receiving the flit size. It stays in NI_FDATA state until 95 a flit is constructed, and moves to NI_SENDF state to inject this flit into NOC. The state machine will return to NI_FDATA state tO receive the next flit’s data. Upon receiving all flits a packet contains, the state machine will go back tO NI_IDLE state waiting for the next packet. The similar procedure applies to the de-packetization state machine shown in Figure 3.23(b). (1) VLC encodinm NOC @ (1)FiIe read r — _ _ _ ‘ (2) RGB-YUV I l (3)Downsampling I (4)bit-stream form L.____) I® @- (1) DCT (1) Quantization (2) Zigzag-scan Figure 3.22(a) SOC with 2x2 mesh NOC and N13 Network Interface (N1) (1 data buffer ni ouLpacke! e out at — P _ _ a Packetization pe_in_ data b ni__in_packet Figure 3.22(b) NI architecture N1 is PE-specific according tO its PE’s supported data format. This specificity is reflected by the buffer’s data width in accordance with the PE’s output port. For an NI bridging a PE featuring 8-bit output port, its data buffer width is set to one byte, while NI will have 32-bit width data buffer when connecting a processor core with 32-bit GPIO. When a PE is initialized, its associated NI will be instantiated by specifying the input data 96 buffer width as well as pe_in_data port width. The Nl’s packetization and de- packetization logics are also configured according tO the data buffer width and the NOC’s flit width since a flit may contain multiple data elements for efficiently utilizing the network bandwidth. No address No address indicator address received No packet length received Received_data_nu N o flit size received m < flit size 'Received_data_num: Number Of data received in a flit | lFIit_num: Number of flits constructed in a packet I FIit_num < packet size Figure 3.23(a) FSM for data packetization in N I Effective packet not received F lit_num = packet size packet size not extracted Effective packet received packet size extracted eceived_data_num = flit size && Flit num < packet size ® flit size not extracted IReoeived_data_num: Number of data received in a flit I FIit_num: Number Of flits constructed in a packet h --------------- A Received_data_num < flit size Figure 3.23(b) FSM for data de-packetization in N1 97 In the system-level, N1 is also modeled as SystemC sc_module object for modularity and functional encapsulation. SystemC interface sc_fifo_in_if and scjifo_out_zf are applied to set its input and output ports’ attribute for gluelessly bridging PE and NoC. Although NI’s input/output ports to PE side may have non-uniform data width for different cores, its 108 to NoC side have exactly the same data width equal to the NoC’s flit size. The C++ standard template class queue is employed to model the NI’s data/packet buffer and it encapsulates all the required Operations that are provided as member functions for users to invoke directly. The packetization and de-packetization logics are modeled explicitly by two concurrent sc_thread processes, which are scheduled for execution triggered by the incoming data/packet. 3. 7.3. Put Everything Together: SoC Simulation Framework xSoCSim PE, N1 and NoC (interconnection of routers via links) form the 80C component library and the top-level SoC simulation framework is constructed by integrating PEs, N15 and NoC together. PE handles the computation tasks, NoC provides the inter-PE communication services required by data transmission between tasks mapped on different PBS, and NI provides seamless bridging between PE and NoC. As what’s discussed above, an abstract class encapsulating its functionality models each SoC component. Integrating these components into a multi-core SoC simulation model involves the following steps described in Figure 3.24. (1) Application analysis and task partition extract tasks from the application specification represented by C++/SystemC. A task is a function call and these tasks may be dependent or independent. A task graph integrates the partitioned tasks is derived to represent the application at the task-level. Cmrently the application task partitioning is still manual without any automation tools. (2) PE instantiation creates the required PE objects with architectural information represented by the macro-models. The initial operating frequency is also assigned. For the ASIC core, the parallelism is explicitly represented by the number of concurrent 98 processes initialized by its class constructor. (3) Task macro-operation annotation inserts the wait( ) function call parameterized by macro—operations for each task. Each macro-executor in the task is annotated with a delay function representing its execution time or multiple macro-executors are annotated with one wait( ) emulating their cumulative execution time. (4) Task mapping associates each PE with tasks. For microprocessor mapped with multiple tasks, the sub-graph representing these tasks’ dependency needs to be provided to the PBS TSE for task scheduling. Each task is hooked up with the PE architecture by calling the PE’s AddTask( ) method, which associates the PE’s macro timing-model with the macro-operations annotated to this task. For ASIC core, only tasks with explicit concurrency are mapped onto it, and the parallelism is resolved explicitly by employing multiple sc_thread processes in the current work. If different tasks mapped onto different PEs have data communication, the communicating pattern needs to be specified explicitly for each PE by inserting code accessing the NI to write these data to NoC. (5) NoC instantiation creates the on-chip communication architecture that interconnects these PBS. The topological, architectural and technological information is provided for initializing NoC components (routers and links). The NoC class constructor further calls its associative router and link’s constructors to create their objects and performs port binding according to the NoC topology to form the NoC. The NoC’s buffer size, flit size and routing algorithm are also setup during its initialization. (6) N1 initialization creates the bridging NI objects for different PBS to connect with NoC. A PE can not send data into NoC until its NI is initialized and connected with this PE. For each PE, its N1 is mapped to its memory space so that N1 is treated as PE’s peripheral during the simulation. (7) Connecting PEs with NoC performs the final connection between PBS and NoC through NI’s bridging. Two sets of channels, one of which connects PE with N1 and the other links NI to NoC, are instantiated. 99 Until now the complete SoC simulation model and its associated methodology named xSoCSim supporting multiple heterogeneous cores connected with on-chip network is created and ready for simulation. The top—level framework instantiates the 80C model and performs cycle-approximate simulation, where each PE executes its mapped tasks and sends data via NI to NoC for communication. After the execution, the total execution time as well as each PE’s timing overhead will be provided for analysis and optimization. . / 1 Application 5 ~\ 5’3” W specification ; (C/C++) i % funci- func2 /_fl __\ --, funcZ 3 / i I” . 1 (a)Macro-operation ‘~.\T\‘(a"”‘f’/k_..\* T\2(anno’)/, ’ i annotation Application i _f- .{ncii 3 Task Moel : / i : \Tj’jaT‘fl/ "‘ ‘ ‘ " “""‘(1)Appiieation analysis fun“ 8. task partitioning task graph 5»— T4(anno) / \ \ a nnotated (2)PE instantiation ' nd ‘ task ra h (4)Task mapping Q L) g p :Dg . (7)Connect PEs with Not: ' | -------------------- I /~—4‘\ , ~\ IE] [El Figure 3.24 xSoCSim simulation framework by integrating PEs, N13 and NoC PE library (ARM. PowerPC. FFT...) i l i i i z...“ _ .. ___, . , A I (6)Nl instantiation Ni library :> xSoCSim model aI-yotao‘ 3.8. M—JPEG NOC-based SoC Simulation Model Case Study We use the M-JPEG application as a case study to demonstrate the proposed SoC simulation framework. M-JPEG is a motion picture encoding algorithm that encodes each frame using JPEG engine without considering the inter-frame temporal and spatial redundancies. Its functionality specification is written in C++ containing eight tasks: (1) Input frame read, (2) RGB to YUV color space conversion. (3) 2: 1:1 down-sampling, (4) 2-D DCT transformation, (5) Quantization, (6) Zigzag scan and data reordering, (7) VLC and Huffman encoding, and (8) bit-stream dumping. Tasks (1)-(8) apply to each input frame to convert the input motion video into a serial of JPEG images as the task graph in Figure 3.25 shows. The computation is performed on 8x8 data block in each task except for (8). Task (3)’s output is 8-bit for each YUV component while tasks (4) ~ (7) operate on 32-bit fixed-point data for better precision. 1 2 3 read frame block 6 8 7 . Figure 3.25 Task graph of M-JPEG application Although there exists intra-task-Ievel parallelism in tasks (4), (5) and (6) since each down-sampled YUV component can be processed independently, we don’t exploit it in this case study and all YUV components share the same processing element. We employ the MPSOC (Multi-Processor SoC) architecture, where each PE is implemented by a microprocessor and they are interconnected through the NOC, to exemplify the construction of M-JPEG SoC simulation framework. Assume 3 microprocessors, IBM PowerPC405, Xilinx MicroBlaze and MIPS PISA architecture, are supported in the PE library. The macro-operations of PowerPC405 and MicroBlaze are characterized by Xilinx RTL-level simulators, and PISA is characterized with micro- architectural simulator, SimpleScalar. We will not go into the detailed characterization process here. A 2x2 mesh NoC architecture provides the interconnection. To construct the 80C simulation model under the xSoCSim framework, the following steps are performed sequentially: @PE instantiation: Four PEs, including two MicroBlaze, one PPC405 and one PISA, are instantiated with their operating frequency of SOMHz, 100MHz and 3OMHz. And each 101 processor is assigned an address from 0x00 to 0x03 in correspondence with NoC router’s location. The SystemC description to initialize these PEs is shown as follows m _pe[ 0] = new CPUL("PowerPC", PPC405, 100, 0x00. T RUE); m_pe[1] = new CPUL("DCT", MB. 50, 0x01, TRUE): m _pe[2] = new CPUL("Quant", MB. 50, 0x02, T RUE); m_pe[3] = new CPUL("VLC", PISA, 30, 0x03, TRUE); @I'ask-level macro-operation annotation: Each task is annotated with the macro- operations, and each function and its child functions inside the task are annotated in an iterative way. The following code describes the annotated code for the l-d DCT that will be called by 2-d DCT task. void fdct_ld(int row[ 8], int row_res[8], int bias, PEL *core) { float x0, xl, x2, x3, 10, tl. t2, t3, t4, t5, t6, t7; (0 = (float) (r0w[0] + r0w[7] - bias); latency = sc_time(2* core->D VS_ratio * core->period * core->latency_ L UT]ALU_ 0P_2], SC_NS); wait(latency); t7 = (float) (r0w[0] - r0w[7]); latency = sc_time(core->D VS_ratio * core->period * care->latency_L U TIAL U_ 0P_2], SC_ NS); wait(latenqy); .................. / /Olher operations (@I‘ask mapping: tasks (1), (2) (3) and (8) are mapped to the PowerPC 405, task (4) is mapped on to MicroBlaze located at address 0x01, tasks (5) and (6) are mapped onto another MicroBlaze core addressed by 0x02, and task (7) is mapped on PISA. Each core object invokes its AddTask( ) to hook up its assigned tasks described as follows. m _pe[0] -> AddTask(1, &read_ppm_pic); m_pe[0] -> AddTas/t(2, &rgb2yuv); m _pe[0] -> AddTask(3, &d0wnsample): m_pe[0] -> AddTask/4, &dump_jpeg_data); m_pe[1] -> AddTask(1, &fdct); m_pe[2] -> AddTask(I, &quant); m _pe[2] -> AddTask(2, &zigzag); m_‘pe[3] -> AddTask(1, &vlc_hufiinan); @NOC instantiation: a 2x2 mesh is created by reading the NoC configuration file, which 102 specifies the topology, input buffer size and other interconnection parameters. The NOC flit size is set to be 32-bit and packet size is 10 (including header flit, 8 data flits, and one CRC tail flit). @Nl configuration: NIs connecting to different PEs have different packet formats. For the packet that is sent from PowerPC containing the down-sampled YiYgUV data, each 32-bit flit is divided into 4 parts, each of which stores one component. For the two MicroBlaze cores running tasks (4), (5) and (6), each flit only contains one 32-bit integer. For the PISA running the VLC encoding algorithm, each flit consists of two encoded 16- bit word for the output JPEG bit-stream dumping. In this sense, three different NIs are initialized to handle these different packetization/de-packetization requirements. @PE-NOC integration: the final step is to connect PBS to the NoC through N13 to construct the top-level SoC module for simulation. The test bench instantiates the top- level simulation model to perform the application simulation in the source code level without compiling it to the target dependent binary executables. Each processor core executes their mapped tasks along with the annotated timing macro models scheduled by the PE model’s TSE. Once a processor needs to send its result data to another one, it accesses the NI, which is mapped to processor core’s memory space, to inform it the data payload and destination information. NI keeps receiving the data to assemble the packet before injecting it into the NoC, which routes the information to the router connecting to the processor core mapped with the next dependent task. PE and NoC operation collaboratively until all the data are processed and the result is dumped into an output file by one of these 4 processor cores. Figure 3.26 presents the final SoC architecture mapped with M-JPEG application. The PowerPC at tile #(0,0) operates on the frequency of 100MHz, the two MicroBlaze cores at tiles #(0,1) and #(l,0) works on the sample frequency of 50MHz, while the PISA core located at tile #(l,1) operates at 30MHz. The different PE frequencies setup is not optimal but just shows the capabilities of the simulation of heterogeneous SoC under this 103 framework. The NoC is clocked by the same frequency of 100MHz. The NoC on-chip router architecture set to lbuf_FP_XY with the input buffer depth of 4. Due to the deterministic XY—routing algorithm, 4 possible routes between NoC nodes can be identified as comm ,(#(0,0) —> #(l,0)), comm2(#(l,0) —> #(0,1)), comm3(#(0,1) —> #(1,l)). and comm4(#( 1,1) -> #(0,0)). The input video stream is a set of 64x64 frames in the ppm format, and the output is dumped into the corresponding set of files in the jpeg format. ???AHZ- cor_nm4 30MHZ l m MicroBlaze ....-....§9'.“.VD§.-...> PISA Q (1) Q-uantization . (1) VLC encoding (2)2igzag-scan' . : l: 3 E; i 100MHz : z = JUL : 5 . :+— I I i i E r t v - 3 100MHz ® ‘@ 50MHz .ouerPC corny-n2 \ (.1)File read comm1 > 1 FDCT (2)RGB—YUV ( ) (3)Downsampling (4)b‘rt—stream form Figure 3.26 SoC simulation model mapped with the partitioned M-JPEG application Figure 3.27(a) shows the cycle count of each mapped PE to process one 64x64 video frame, and figure 3.27(b) depicts the corresponding cycle count of each communication route. For each 64x64 block, the computational latency is 7.84ms, 43.1]ms, 33.52ms, and 35.96ms on each PE located at tile #(0,0). #(l,0), #(0,l) and #(l ,1), respectively. And the communication latencies on the four inferred routers are evaluated as 0.09ms, 0.26ms, 0.09ms, and 0.22ms under the 100MHz NoC frequency. Since the data is processed in the pipelined fashion in each PE, so the average throughput is estimated by considering the worst case PE with maximum processing latency, and the communication latency is not considered since it is >100X smaller than the computation latency in this simple case without heavy communication traffic. Among the 4 PEs, MicroBlaze at #(l,0) running at SOMHz has the maximum latency of 43.11ms, the average throughput of M-JPEG application is 23.19fps (frames per second). PE cycle overhead frequency(M Hz) PowerPC MBlaze #0 MBlaze #1 PISA Figure 3.27(a) PE cycle count on M-J PEG Comrrunication cycle overhead 30000 25000 20000 1 5000 1 0000 5000 commt ' comrnZ comma comm4 Figure 3.27(b) Communication cycle count on M-JPEG 105 4. SYSTEM-ON-Cl-IIP POWER MODELHVG FRAMEWORK Nowadays VLSI designers are facing ever-stringent challenges dealing with power on two folds. First of all, the continuously shrinking transistor size squeezes millions and billions of transistors clocked at GHz-level frequency. This trend aggrandizes the dynamic power consumption based on Pd," = 0.50CVj, f , where or, C, Vdd and f represent the switching factor, circuit capacitance, supplying voltage and operating frequency respectively, along with the ever increasing chip size. Secondly, the battery technology advances at a much slower rate compared with transistors and it needs VLSI designers to adopt low-power techniques to extend the battery life for the SoCs mostly used in handheld devices. Power/energy modeling serves as the core part in low-power techniques to estimate the system’s power/energy consumption for fine-tuning the design to meet its power budget. It is also believed that the higher level the power optimization is performed, the more power savings that can be achieved although power estimation at high-level may not achieve the same accuracy as what low-level models do. Take these two factors into consideration, a system-level power modeling framework based on the 80C simulation model is developed to provide fast power/energy estimation for multi- core SoCs targeting multimedia applications. Based on the levels of abstraction where the power model is built, there are different methodologies to estimate system’s power consumption from circuit-level considering each transistor’s switching property to system-level evaluating the input/output’s transition statistics. In our work different power modeling methodologies targeting different levels of abstraction are employed to estimate the SoC’s power consumption in the same framework. For PEs handling computational tasks, power macro-modeling is applied to abstract off their low-level implementation details using statistical regression linear fitting to derive a closed form formula for each macro-operation’s power/energy consumption. For routers and links in NoC, an analytical power model is developed by evaluating their micro-architectures since they are relatively stable compared with 106 various PE architectures. These models provide power functions that can be called by the simulation framework to evaluate the power during run-time without delving into the circuit- or RT-level transition details. Using a function call to model power also offers the flexibility that different power models can be applied to the same functionality block by just changing the corresponding power function, and it makes the 80C simulation framework more modular and maintainable. This chapter is organized as follows. Section 4.1 reviews the related research work in power modeling and proposes the motivation of raising the abstraction level in multi-core SoC power estimation. Section 4.2 will propose the microprocessor’s power macro modeling methodology where the level of abstraction is raised from instruction-level to macro-operation—level to escalate the power estimation speed while keeping reasonable accuracy. The analytical power estimation of NoC components will be proposed in section 4.3, where both dynamic and leakage power are addressed. Using the analytical power model doesn’t preclude the capability of employing statistical high-level NoC models, once available, in the framework. It also shows the openness of the proposed framework that can adopt heterogeneous power models on different functional blocks, which has not been addressed before. Section 4.4 presents the integration of these power models with SoC simulation framework to form a run-time application specific SoC power simulator. 4.]. Related Work on Power Modeling SoC power/energy estimation essentially consists of modeling the power consumed on its cores including both PBS and NoC. On modeling PEs especially embedded microprocessors, various techniques based on instruction-level characterization and simulation of the underlying hardware have been proposed[l6] [58]. In [15][16] microprocessors are characterized by gathering its ISAs’ energy data through experimental methods by running applications on real prototype boards. Isci et al. [15] propose the coordinated measurement approach that combines real power measurement l07 with performance-counter—based, per-unit power estimation, and applies it to the Pentium 4 processors. In [16][59], ARM and SH3 microprocessor models are created by recording the power consumption of each instruction to a spreadsheet-like library, which is invoked by program models during simulation to get total power consumption. Characterizing hardware achieves good precision, but it needs prototypes of target microprocessors, which are not available in early design stages of an application-specific SoC. As an alternative, architectural modeling mitigates requirements of real hardware for instruction set characterization. It models micro-operations of the functional blocks such as execution unit, cache, and branch predictor inside the microprocessor by creating these blocks’ capacitance models as what Wattch does. Wattch[60] is built upon SimpleScalar[IO] framework and achieves a speedup of 1000X or more compared with circuit-level power tools, and yet maintains accuracy within 10% of their estimates as verified using industry tools. Another cycle-accurate power estimator, SimplePower[6l] is proposed to simulate 5-stage pipeline microprocessors and it employs a transition sensitive energy model [62]. Compared with microprocessor whose internal signal transition can be abstracted to its ISA, the hardwired ASIC does not have such a minimum operation set to characterize. Low-level SPICE-based power estimation is performed on circuits containing small number of transistors for a high accuracy[63]. And RT-level power model estimates the circuit’s transition activity using some input vectors and derives the power consumption[63]. While as circuit size and complexity continue increasing with exponential pace, macromodeling[64][65][66] for high-level power estimation has attracted considerable attention over the past few years. In general, the macro-model is based on a look-up table (LUT) search and the LUT is created based on characterizing the circuits using input vectors with specific statistical distribution. The most frequently used parameters are the average input signal probability P;,,, the average input transition density Di", and the average output transition density Dam. The metric Dam is evaluated l08 using zero-delay simulation. The LUT reports estimates for equally spaced discrete values of these parameters. For input characteristics that do not correspond to any LUT entries, estimates are obtained using an interpolation scheme[65]. Several modifications to the macro LUT method are introduced in [66] to improve its accuracy. NoC power/energy modeling has attracted extensive research efforts since the NoC concept was proposed in [67]. Ye et al. [68] analyzes the power consumption in the NoC routers and proposes the “bit energy” models to estimate the routes’ power consumption. Wang et al. [25] presents ORION, a power-performance simulation model with analytical dynamic power models of NoC routers based on LSE framework. [69] extends the ORION by adding the leakage power model, which is based on empirically characterizing some frequently used circuit components. Interconnecting links, especially in deep submicron processes under 180nm, are also extensively researched from the energy point of view. Gupta et al. [70] presents a high-level interconnect power model for wires of a single core chip. Bokoglu et al. [71] proposes the wire delay model considering the optimal repeater sizing and spacing. Ho et al. [72] points out the power consumption of the delay optimal repeated wire is prohibitively large, and introduces a methodology to save wire power by increasing repeater spacing and decreasing repeater size. Heo et al. analyzes the power of the global wires in the on-chip network [57] using a first-order RC model. 4. 2. PE Power Macro-modeling Methodology In the communication centric NOC-based SoCs, the power consumption from both PE and NoC needs to be modeled to evaluate the energy perspective of the system. For the PE, macro-model method is employed to estimate processing element’s power based on characterizing a set of common operations the target PE supports; and NoC is modeled in an analytical methodology where the circuit-level properties are analyzed and elaborated to calculate the communication power. These two distinctive power-modeling threads (statistical-model and analytical-model) show the flexibility and openness of the l09 proposed framework, and users can easily switch from one power model to another one for SoC blocks to trade off the simulation speed versus modeling accuracy. 4.2.1. Programmable Processor Power Macro-Model + Simulation speed _ function void fft(int ‘3, int *b) Instruction DMUL R0, R0, R3 LD R4, 0(R2) DADD R0, R0, R4 Loop: LD R0, 0(R1) RTL J Layout+ ckcufl estimation error I I I I I I I I l I I I I I I I I I I I I I I I V Figure 4.1 Abstraction levels of microprocessor modeling Figure 4.1 shows the abstraction levels of microprocessor power models starting from the lowest circuit-level. In the lowest hierarchy, the microprocessor is modeled by the actual silicon layout or transistor netlist. The power estimates are highly dependent on transistors’ sizes and switching features, which are very slow to simulate. At the middle hierarchy, models are created at the register-transfer level (RTL). It takes the signal transferring among different logic blocks into consideration. The RTL model is mainly used by ASIC designers. The Instruction-level model lies on the top of the RTL model and it takes each instruction as an atomic operation from energy perspective, although an instruction is composed by several micro-operations. As stated before, instruction-level power modeling cannot meet the simulation speed requirement in high-complexity SoCs with embedded microprocessor for efficient design space exploration on real applications. At a higher level of hierarchy, function-level energy model characterizes the energy consumption with different parameter sets for the frequently invoked functions in the software code, and builds a linear regression model for each function’s energy consumption. It can achieve a very high-speed in microprocessor’s energy estimation but “0 suffers less efficiency and flexibility since a real application may contain hundreds or even thousands of frequently invoked functions and characterizing them needs huge amount of effort that may pay off in the gained speed-up. Recall that chapter 3 defines the macro-operation set that provides a minimum number of basic operations from which all the applications can be constructed. As a compromise, modeling macro-operations’ energy property raises the level of abstraction compared with instruction—level, and reduces the characterization effort compared with function- level model. More importantly, different microprocessor architectures have the same macro—operation set and it also conforms to the retargetable energy-modeling scheme, where architecture-dependent cross-compilation is eliminated similar to the execution time model described in chapter 3. As stated before, a C source code line is regarded as a macro-executor, such as YIi]=a*X[i]+Y[i]. Each macro-executor may contain one or more macro—operations, each of which is compiled into multiple instructions. Since energy is additive, the energy consumption of a macro-operation is the sum of its compiled instructions’ energy and the macro-executor’s energy can be estimated by adding its macro-operations’ energies. Consider a simple loop code that is compiled to the assembly code shown in Figure 4.1: for(i=0; i<1000: i++) Y[i]=a*X[i]+Y[i]; Assuming the energy values of these instructions are: Eu), EDSUB, EDADD, E50, EDMUL, EBR, respectively, then the energy consumption of the C code segment is computed by: E = n X [(ELD + EDSUB + EBR)+ (ELD + EDSUB + E51) + EDMUL )] (4-1) for_loop where n is the number of iterations this for loop takes. From Equation (4.1) two macro—executors can be extracted: for loop body and Y[i]=X[i]*a+Y[i], with the energy values of Ew+EDSUB+EBR and Ew+EDSU3+ESD+EDMUL, respectively. However, such simplistic approach has limitations. The first is the accuracy of energy model for a single instruction. Real measurement is Ill difficult and pipelining of execution makes the estimation even harder. The error will accumulate as the overall energy is the summation of all instructions. The second issue is the energy consumption for inter-instruction, which results from switching between two consecutive different instructions. In order to generate a relatively accurate energy model, the empirical method is employed to extract energy data for the macro-operations. The same macro-operations as what are used for execution time modeling are employed directly in the energy estimation of embedded microprocessors. The first step is characterizing the energy consumption of each macro-operation. In this work, it is extracted from the energy data generated from execution of benchmarks on Wattch. The methodology can be applied to extract from physical measurement of a processor or other energy aware simulation methods. Statistical methods are employed to design the experiment and to process the raw simulation data. Benchmark programs with small segments of C code are created by repeating the same type of macro-operation for a pre-defined number of iterations. Wattch runs the benchmark to generate three parameters after the C code is compiled: sim_num_inst, sim_CPI, and energy_cycle, which represent the number of simulated instructions, average CPI (cycles per instruction) and energy consumption per cycle. The total energy consumption in the execution of benchmark can be computed from: Energybm,,mm.k 2 Sim _ num _ inst >< Sim _ CPI x energy _ cycle (4.2) Similar to the method to eliminate the initialization main( ) function’s execution time in chapter 3, a code segment containing only main( ) body is simulated to get the corresponding sim_num_inst,,,a,-,,, sim_CPImam, and energy_cycle,,m,-,,. To calculate the energy consumption of each macro-operation, the main( ) initialization energy is removed by: N XEnergyuperution : Energybenchmark — Energy main (43) where N is the number of iterations a macro-operation in the benchmark takes in simulation and Energymm is main( )’s initialization energy expressed by: 112 Energymm = sim _ num - instmum x sim_ CPIM." X energy_ cyclemain (4.4) Ideally, each macro-operation’s energy consumption is linear and passes through the origin in terms of iteration number in the benchmark. In this step, each macro—operation is simulated by different number of iterations from 100 to 1000 with the step of 10 iterations. Figure 4.2 shows the energy simulation results for two macro—operations OP_ALU_2(2-operand ALU operations) and OP_MUL_2(2-operand multiplier operations) for PISA processor with hardware multiplier. They are both approximately linear to the iteration number with ~0.l% average non-linearity and a non—zero offset, which originates from the inherent operation mechanism in embedded microprocessors with cache. At the start of program execution, the cache is empty and the miss rate is nearly 100% and it brings additional energy consumption to fill the cache from the external memory. After that, the looped benchmark program accesses the cache frequently and hence results in a linear increase on the energy consumption. We employ the regressive linear fitting method to minimize the mean square error on these pre- computed energy data to estimate each macro-operation’s energy consumption. 3.5 . . . . 1% a a . s . r {109” t xtoe+8 3.0 3 5. K i 9’ a 2.5 o “2 O 0 2.0» t; g i 3“ “E 1.5- O 8 2- 0 i 1.0 0.5 1' l iterations iterations " 15 25 35 4b 55 Sb 70 86 95 “0 o 10 20 30 40 50 60 76 80 90 100 Figure 4.2(a) Power of 2-operand ALU operation (b) Power of 2-operand Multiplier operation With these characterized macro-operation energy data, an energy lookup table (LUT) is composed with the similar way that constructs the execution time model. Each PE architecture corresponds to a dedicated LUT, where the same macro-operations represented by a set of integers designate the LUT indexes. The characterized energy 113 model for each macro-Operation corresponds to each LUT index and can be accessed by providing the index as the macro-operation type during simulation. To estimate the energy consumption for an application executed on a specific microprocessor, each involved macro—operation is annotated with its energy model. Each macro—operation’s energy model will be accessed by table lookup after running this macro—operation, and the corresponding energy number will be retrieved. A global energy variable accumulates all the experienced energy data and evaluates the total energy consumption during application execution. 4. 2.2. Experimental Results of Processor Macro Power Model The performance and accuracy of the energy model is validated with a media processing application, JPEG encoder. JPEG[21] is a lossy compression algorithm, which is used extensively in the static picture compression. The four computation cores in JPEG encoder algorithm, DCT, Quantization, Zigzag and VLC, execute sequentially and DCT is more computation-intensive than the other three. We use JPEG source code[22] from UCLA and map it to PISA combining the energy model under SystemC. This JPEG encoder is simulated to encode pictures with different sizes from 64x64 to 512x512. For comparison, the JPEG source code is compiled to PISA binary and simulated with Wattch. Meanwhile, the code is also compiled with gcc and runs on the same Solaris platform to verify the functionality. Table 4.1 presents the experimental results of energy estimation from the macro-model methodology and Wattch. The first column lists the picture names and sizes, followed by energy estimation results from Wattch (eng_W) and macro-model energy estimator (eng_M). The last column shows the relative errors compared to the Wattch simulation results. Table 4.2 describes the execution time for each simulation model. In each column t_C, t_SC and t_W represent the run-time for pure functionality run without energy model in C, functional and energy simulation with PISA’s macro-energy model in SystemC and functional/energy simulation with PISA’s architecture-level energy model 114 in Wattch. The last column shows the speedup of energy macro-model compared with Wattch. Table 4.1. Energy Simulation Results for JPEG on PISA architecture picture eng_W(pJ) eng_M(p.I) error(%) bunny (64x64) 5.243x 16‘ 4.784x108 ~8.75 name(128x128) 1.972x109 1.854x109 -5.96 sun (256x256) 7.374x109 6.977x10" -5.38 face (512x512) 2.916x10m 2.787x10m -442 Table 4.2. Simulation times picture t_C(s) t_SC(s) t_W(s) Speedup bunny (64x64) 0.0015 0.149 31 208.05X name(128x128) 0.0191 0.462 115 248.92x sun (256x256) 0.089 1.631 438 268.55X face (512x512) 0.372 6.288 1730 275.2X We observe that microprocessor energy increases approximately linearly to run time from the simulatiOn results in both Wattch and the macro-model energy estimator when picture size increases as table 4.1 shows. This result is expectable since the amount of instructions and macro-operations executed to encode a picture with increasing size increases proportionally. The macro—model estimator achieves pretty good precision by the relative error ranging from —4.42% to -8.75%, with the average of —6.70% among 4 test pictures with different sizes. The under-estimation of the energy results from the absence of the modeling of inter-macro-operations and I/O functions such as fprintfl ) and printfl ), which consume energy as well. In many embedded applications, such I/O functions may not be used except for debugging. As compensation to the errors, which are acceptable in high-level design stages without much architectural information, this framework’s simulation time reduces dramatically with speedup from 208.05X to 275.2X when comparing with Wattch. The speedup is due to eliminating the evaluation of each executed instruction during simulation by combining multiple instructions into one macro-operation and invoking the model library directly. 115 4. 2.3. Application Specific Hardware Accelerator Power Macro Model Differentiating from the programmable microprocessor, the ASIC core, which is frequently used as the hardware accelerator for some computational intensive tasks on 80C to increase the performance and save power. By dedicatedly designed data path fine- tuned for the requirement of the specific application, hardware accelerator reduces the processing load of the main processor, in turn lowering its MIPS requirement to meet the target throughput. The reduced processor MIPS requirement brings the benefit of further power saving from the lower operating frequency, and possible voltage if DVFS is enabled. The customized data path of 80C hardware accelerator brings out the overhead of efficient power macro-modeling methodology because the “instruction” concept of microprocessor doesn’t apply to this type of PBS and their operation cannot be easily categorized by its “instruction” set as what we did in 4.2.1. What’s more, the broad range of 80C hardware accelerator makes it more difficult to use the common set of basic macro-operations to characterize the power properties of them. As an alternative solution, we narrow down the range of the hardware accelerators of interest with some common features such as data path type, architecture, or memory allocation, it is still feasible to extract a set of atomic operations by which all the hardware’s functionalities are composed. In this work, we target the digital filter hardware accelerators (DFHA) presented in the 2.2.4 when building the hardware accelerator’s power model. DFHA performs FIR/IIR algorithms in many media processing applications, where multiplication/MAC is the most intensively used operation expressed by the format of Zaxb. In the multi-tap FIR/11R hardware accelerators, multiple MACs are working in parallel shown in Figure 4.3(a) to reduce the cycle count of computing the input sample, and this scheme poses the requirement of another two basic Operations in the hardware accelerator: ® 2N-bit adders for intermediate result accumulation among multiple NxN multipliers 116 C2) 2N-bit width registers to enable the pipeline processing to maintain the data throughput N N N N, N N N N N N N N N N/ N N 2N 2N 2N 2N 2N 2N 2N 2N LI _ I REG 3 REG 2N / 1 / A4 ADD 2N Figure 4.3 Parallel FIR architecture with 3 pipeline stages providing 1 sample/cycle throughput Pipelining in the hardware accelerator is very useful in the real design since it reduces the length of the critical path resulting from the insertion of multiple stages of partial result accumulators. Although pipelined data path still has multiple cycles’ processing latency, it achieves one sample/cycle’s throughput, which is essential for most filter applications with stringent sampling rate requirements. The three basic DFHA operations, ADD, REG and MPY, are characterized by their power consumption using the available IBM 180nm CMOS process by Synopsys RTL power tools. During characterization, the operating frequency is set to 200MHz with 50% activity factor, and the supplying voltage is 1.8V. The power consumption of each macro- operation is estimated on 7 different data widths from 8 to 32. Based on the characterized power consumption on different data path widths, linear fitting is applied on the ADD and REG operations and a 3rd order polynomial fitting is performed on the MPY 117 operation to derive a regression power model. Figure 4.4 (a), (b) and (c) shows the characterized power versus input data width of operation MPY (multiplication), ADD(adder) and REG(register) as well as their regression curve fitting results, respectively. It clearly shows that the power consumption of each macro-operation is highly correlated with its data width. The RMS errors for ADD, REG and MPY operations after curve fitting are 2.04%, 2.46% and 1.98%, respectively. There are other three factors taken into consideration when building the power macro- model: clock frequency 0‘), operating voltage (Vdd) and activity factor (a). Given a hardware architecture, dynamic power is proportional to f, a, and square of Vdd, so the generic dynamic power model for ADD, REG, and MPY operations with IBM 180nm CMOS process can be derived as: 2 f (Vdd) a P ,V , , =-—- —— -———-P 200,1.8,0.5, ADD(f dd 0 W) 200 1.8 05 ADD( W) 2 =_[_.(Vi] ._a_.(23.519w—8.190l) 200 1.8 0.5 = (0.072w—0.025)- f v; 0a (4.5) 2 f [Vdd a P ,V , ,w =—- — -—-P 200,1.8,0.5,w REG(f dda ) 200 1.8 0.5 REG( ) 2 =_f_.(Y_a ._a_.(8.794w—0.0068) 200 1.8 0.5 = (0.027w— 00002)- f -v,,’:’, -a (4.6) f Vdd 2 a P ,V, , =—‘ — -—°P 200,1.8,0.5, MPY (f dd 0 W) 200 [1.8 0.5 MPY( W) 2 =—f—-[Xéi -i.(0.0037w3+6.208w2+6.539w+4.175) 200 1.8 0.5 = (0.0001w3 + 0.019w2 +0.017w+ 0013)- f -v,,‘;’-, .a (4.7) 118 Characterized power consumption 01 ADD operatior .., A Characterized power consumption of REG operatic 5. 1000 f . a i 400 r . . g y = 23.519,x _ 8.1901 1]]. g 300 . y = 8.7938 x - 0.006893 A ,4" © 200 - ,A E 400 1", E ,’A Z 1 2’ ,A .g r .g 100 - AK « g 200 ~ .1 g ,A g o I DC estimate E 0 1 A DC estimate 8 T6 3 -200 1 . . g -100 . , 1 o. 0 10 20 30 40 r; 10 20 30 4C Adder data width(bits) Data width(bits) (a) ADD operation (b) REG operation A Characterized power consumption of MPY operati a 12000 . . - 1%- * DC estimate 5 10000 " --——— cubic ,. O / 8 8000 -"/ © y = 0.0037206'x3 + _/ E 6000_ 6.207sz + 6.539'x + 4.175 f" c ,/ .9 ’ E 4000 . ',,1"/ a” g 2000» o 0,} g ,,_....r** 8 0 10 20 30 4c Data width(bits) (c) MPY operation Figure 4.4 Characterized ADD, REG, and MPY macro—operations’ power versus data widths (4.5), (4.6) and (4.7) form the basic power model of DHFA used in signal processing data paths. They are 4-variable functions considering the operating frequency, supplying voltage, activity factor and data width. These models are derived based on the available IBM 180nm process, and they are scaled to other predictive advanced CMOS technologies by the following capacitance scaling methodology. From (4.5), (4.6) and (4.7), each operation’s power macro model has the regression formula of: P(f,vd,,,a, w) = F(w)- f vi, -a (4.8) According to the CMOS Circuit’s dynamic power consumption: 119 l . P(faviaiia):§'c'f'vdfi’a (4-9) The switching capacitance under 100% activity factor is estimated as: C(w)=2F(w)=2(a0w3 +alw2 +a2w+a3) (4.10) where w is the data path width, and [(10, a], a2, a3] are regression coefficients from the curve fitting. (4.10) presents the statistical switching capacitance model for the data path element used in DHFA, and it is scaled to the other predictive technology node according to the lTRS roadmap[6]. For each technology node with m nm feature size, its statistical power model can be built as: P(f,V ,a, w,m) = ,B"P(f,V ,a,w,180), where B is the scaling dd dd factor derived from the ITRS report [6], and P( f ,Vdd , a, w,l 80) is the regression power model on the characterized IBM 180nm CMOS process. Table 4.3~4.5 presents the coefficients (an, a], a2, a 3} for different technology nodes after the capacitance scaling. With the feature size decreasing on advancing sub-100nm technologies, the capacitance of transistor and local wire scale with the feature size, while the global wires don’t. Since the wires in those hardware accelerator are frequently implemented as local ones for better timing closure and power consumption, the proposed DFHA statistical power model is still effective when being applied to those technologies. 4. 2.4. Experimental Results of DFHA Macro Power Model Due to the lack of technology libraries on those advanced technologies, the proposed statistical power model is only verified with the DC estimated power on the available 180nm CMOS technology provided by IBM. In this experiment, a 3-stage pipelined 8-tap FIR shown in Figure 4.3 is used for evaluation. This filter architecture contains eight NxN multipliers in the first stage, six 2N-bit cascaded adders in the second stage and one 2N-bit adder in the third stage. two pipeline registers with the data width of MN and 4N separate these three pipe stages. The flip— flops for the delayed element storage at the input are not explicitly shown here. This 120 architecture could finish an N—bit sample’s 8-tap FIR calculation in three cycles, and it has one sample/cycle throughput after the pipeline is 100% utilized. Table 4.3. Regression coefficients of DFHA macro power model for ADD operation coefficient IBM 180nm BPT M 130nm BPT M 90nm BPT M 65nm BP’I‘M 45nm an 0 0 0 0 0 a, 0 0 0 0 0 a; 0.072 0.0648 0.0547 0.0504 0.0432 03 -0.025 -0.0225 -0.019 -0.0175 -0.015 Table 4.4. Regression coefficients of DFHA macro power model for REG operation coefficient IBM 180nm BPTM 130nm BPTM 90nm BPTM 65nm BPTM 45nm an 0 0 0 0 0 a; 0 0 0 . 0 0 02 0.027 0.0243 0.0205 0.189 0.0162 03 -0.002 -0.0018 -0.0015 -0.0014 -0.0012 Table 4.5. Regression coefficients of DFHA macro power model for MPY operation coefficient IBM 180nm BPT M 130nm BPT M 90nm BPT M 65nm BP’I’M 45nm an 0.0001 0.00009 0.000076 0 (approx) 0 (approx) 0, 0.019 0.0171. 0.0144 0.0133 0.0114 (1; 0.017 0.0153 0.0129 0.0119 0.0102 a 3 0.013 0.01 17 0.0099 0.0091 0.0078 The power consumption of this FIR DFHA is estimated with the proposed regression power model, and its result is compared with the Synopsys tool estimated power consumption. The DFHA with four different data widths: 8, 14, 18, and 24, are evaluated, and the power results estimated by the regression model and the Synopsys tool are presented by Figure 4.5(a), while Figure 4.5(b) presents the run-time in the unit of seconds for these two methods. The proposed statistical hardware accelerator power modeling methodology achieves the relative error of 4.08%, 4.22%, 4.27% and 4.14% on the DFHA data path with 8, 14, 18 and 24 compared with the Synopsys simulation results, showing that the curve fitting 121 method can estimate SoC hardware accelerator power consumption with > 95% accuracy with arbitrary data width of those FIR—type architectures. We argue that the ~4 percent of overestimation results from two factors: (1) The intrinsic error related to the polynomial curve fitting, whose optimization method is to minimize the RMS error. The global RMS error minimization results in some additional error on the discrete sampling points, which are the data width in this case: (2) Some logic optimizations are performed by the Synopsys tool when it compiles the design to the gate netlist before running power estimation, and this optimization will reduce the gate count, in turn lowering the power consumption. If more architecture-level blocks, such as muxes, memories, etc, are characterized in the proposed way, the power consumption of hardware blocks with more complex architecture could be estimated without the timing-consuming RTL— or gate level synthesis. Another significant merit this modeling methodology brings is the estimation speed, which increases along with the increasing data width in Synopsys tool because of the computational overhead of elaboration all the inferred cells during power estimation. Since the proposed statistical macro-model is based on the regression curve fitting model without the elaboration of the design in full hierarchy, its simulation speed is constant for DFHA with different sizes. The simulation time based on Sun Solaris server unveils that the macro—model achieves the speed-up of ~205.1X in average compared with the synthesis—based power estimation. Power macro—model simulation results 4 5 x 10 on FIR data path 300 Macromodel simulation speedup E; 4i - Macro-model A 1 f A x g5 i - Synopsys iooi ‘ 7%. 250 e I; 3 j 8 e 8 I i g a 5 i 200 (518 2 a . '9 g E 150 E 3, 1 ‘17: E ‘ . 0 100'L -v.—_z*__l 8 14 18 24 8 14 18 24 FIR data width(bits) FiR data width(bits) Figure 4.5(a) Dynamic power estimation results Figure 4.5(b) Simulation speed comparison 122 Combining the processor and DFHA macro power models presented in this section, we can explore the energy perspective of PBS in the 80C with very high simulation speed and good precision. Next step would be to create the power model for the 80C interconnection infrastructure, NoC. 4. 3. NoC Architectural Power Model For modeling flexibility and scalability, each router or link class has a dedicated associative power model class gluelessly connected with the NoC performance model as the dotted rectangle in Figure 4.6 shows. Each power model class has two main member functions, one of which estimates the leakage power, and the other one calculates the dynamic power. The router’s power model is instantiated and initialized in each router’s constructor and so does the link’s. The architectural and technological parameters of the NoC will in turn be passed to these models for power estimation. Take the fact that the dynamic power depends on data switching characteristics into consideration, the flit payload that involves in each transaction is passed to the power model as well. During simulation, the communication fabric (router or link) involved in a transaction will invoke the power estimation functions in its power model to calculate the power. The utilization of power model class makes the model modular and easy for refinement, and it also conforms to the object-oriented modeling methodology, which is applied throughout the whole SoC simulation framework. Another feature of this power modeling framework is that it supports the temporal and spatial power profiling during run-time. Power profiling describes the power implications of an application dynamically in terms of time and location. It also helps designers locate the potential hot spots when simulating the applications. In this work a power monitor is implemented by sampling the power data for each router and link at the specified time interval Tsampie shown in Figure 4.6. The sampling interval can be configured by the designers according to different applications. From the sampled data, the average power can be calculated by: S—l :0: Bran . Tbample PM = ——S-—T—_ (4.11) sample where Bjflis the transient power(dynamic or leakage) on the ith sample, and S is the sample number. 6E9???¢9I!l§l931§--.- Topology = mesh X_size = 4 . Y_size = 4 1 Power Channel 1 “CA [ Channe. M OCA Link -- ~ m mm” W Svdd = 1.3 ' ‘ Power decoder iLamda = 90e-9 l:> OCA Router EVt_n = 03423 New: 4,1196 > MEL] i_::.'.:.'..' .............................. ~ > Technology data Figure 4.6 NoC hierarchical power model 4. 3.1. Power Model of Router ’3 Input/Output Bufi‘er Router has the following main components: input/output buffer, crossbar and arbiter, each of which has a dynamic and leakage power model. For the input/output buffer, we model its power properties by analyzing the SRAM array that implements the main storage functionality. Studies have shown that the input buffers, typically structured as FIFOs, are potentially the single largest energy consumer in a typical router [5]. Figure 3.5 shows the block diagram of a typical dual-port SRAM-based N x 5,71, FIFO that can store at most N flits (words), each containing Sfli’ bits. On a write( read), the flit is written into (read from) the appropriate row of the SRAM, with the row indexing done based on the addresses generated from a read (write) counter. 124 Figure 4.7 Architecture of dual-port SRAM We refer to [25] to create the router’s dynamic energy model with some modifications to interface our NOC transaction-level model. Figure 4.7 describes a dual—port SRAM cell inside the input buffer. N 1, N2, P1 and P2 compose the storage element, and N3, N4 form the write control, while read control is implemented by N5 and N6. P3 and P4 are PMOS transistors for precharging the bitlines before read operation. There are two types of SRAM operations that consume dynamic power. In read, the dynamic power is comprised by the switchings of precharge, wordline, and bitline as (4.12), (4.13) and (4.14) describe: Cwordline(read) = Sflit I [C2 (N5)+ Cg (N6)J+ Cw (Lword ’ M 1) (4-12) Cprecharge = Cg (P3) (413) Chit/inee(read) = 2 ' [Nrow ' Cd (N5)+ Cd (P3)+ Cw (Lbit ’ M2)] (4-14) And wordline, bitline, and memory cell capacitances contribute to write access shown in 125 (4.15), (4.16) and (4.17): wordlinet write) : 2 . Sflit . Cg (N3)+ Cw (L1H)!!! ’ M 1) (4-15) Chitlinelivrite) : 2 . [wa . Cd (N3)+ Cw (Lbit ’ M 2)] (4-16) C... = 2 - 1C. (N5)+ C. (N3)+ C. (P1)+ C. (N01 (4171 ,where C 80:) and Cd(x) are the gate and drain capacitances of transistor x respectively, and Ca(x) = Cg(x) + Cd(x) as [25] defines. NM. is the number of rows in the FIFO, and Cull, Mn) is the capacitance of wire laid out on Metal 11 with length L. Conventionally, M1 is used for word lines and M2 is applied to bit lines. The router’s dynamic power of read and write is computed as: P rend : Rvordlindrearl) + 2 . S flit . lzpprecharge + Bfit/ineUead) + IDsenseJ (418) Pwri'te : Pwordline(write) + m . [Pbitline(ivrite) + [)cell J (4-19) whereP, = a,C,Vdf, f is component i’s dynamic power, and m is the number of switching bits on the word line during a write access, which is available during simulation. As [69] states, the main part of CMOS Circuit’s leakage power results from the sub- threshold current, which is the static current flowing through the transistor’s drain and source when it’s in the “off” state. Sub-threshold current could be calculated by the following equations [73]: W q 8 51 N ch 1.0 =u0 .7. T ,sz (4.20) V 5‘ -V —V() [sub = [50 .[l-exp(— qu /Vt )j'exp s- th fir - nV, Vg, =(),V(l.S'=Vd .1 V2 8.. ‘ _Vt _Vo . :flo.._‘_. q 51 ch 'CX h fl W L 2¢s nV, : Ix” . W (4.21) The methodology to model leakage differentiates from what is proposed in [69] in that our analytical model is derived from transistor-level SRAM circuits. We take the assumption that a MOSFET’s subthreshold leakage is zero when it’s not in the off-state, and identify the off-state transistors of each SRAM operation to calculate the total leakage current. [74] proposes similar work that analyzed the one-port SRAM, while we extend it to dual-port FIFO cell in this work. Compared with 2-state assumption in FIFO’s dynamic power model, 3 states, including an additional “idle” state where no operation is taken, are taken into account since leakage in “idle” state exceeds the other two states due to maximum number of off-state transistors. In the leakage model, W,- is the width of transistor i, 1M) and [p0 are the unit width leakages of NMOS and PMOS as shown in (4.21). @When SRAM cell is in “idle” state, bitlines are precharged to ‘1’, write = read = ‘0’, din(din_b) has 50% probability to be either ‘0’(‘l’) or ‘1’(‘0’). If din = ‘1’, N1, N3, N5 and P2 are in off-state; and only N1, N5 and P2 are in off-state when din = ‘ l ’. Ileak,din=0 2 (Wm + WN3 + WNS) ' 1N0 + WP2 ' [PO (422) Ileak,din=l : (Wm +W~5 ) ' 1N0 +WP2 'IPO (4.23) So the SRAM array has the leakage of: Ilekuge,idle Z N row 'Ncni ’[(W~i +0-5WN3 +WN5 )1 N0 +WP21P0] (424) @When SRAM cell is in “read” state, All write signals are kept at ‘0’, while the read signal of the selected row is activated to ‘l’ and others remains at ‘0’. For the cell that is deselected in read operation: Ileuk,read:0 : (Wm +0‘5WN3 + WNS )1 N0 +WP21P0 (425) And (4.20) estimates the leakage for the cell that is selected in read operation: Ileak,read=l = (Wm + O‘SWN3 )INO +WP2 1P0 (426) The SRAM array has the leakage estimation of 127 I -N leukagt'Jeud = rmv col . [(W’Vl + O’SWN3 )INO + WP'ZIPU]+ (wa _ l) ‘ Neal . WNS . 1N0 (4'27) ©In “write” state, one of write signals is asserted to ‘1’ while others keep at ‘0’, all read signals are staying at logic ‘0’. Bitlines are precharged to logic ‘ l’. The transistors in off- state depend on status of current din(a’in_b) and previous data stored in the SRAM cell. When write = ‘0’, the leakage current is: Ileak.write=0 = (Wm + O'SWN3 + WNS )1 N0 +“11921190 (4-28) Otherwise, Ileak.write=l : (Wm + WNS )INO +WP21PO (429) Overall, the SRAM cell’s leakage during write access: I = wa ' Ncol ' [(WNI +WNS)]N0 +WP2’P0]+ (N — 1). N 05W,3 - 1,, (4.30) leakage. write col From (4.24), (4.27) and (4.30), the leakage power of input buffer is derived by P leakagej = Vdd ' Ileakagej (431) where Iraqi-age..- is the leakage current at operation state i (i belongs to the set (idle, write, read}). 4. 3.2. Power Model of Router ’s Crossbar Crossbar functions steering the flit from router’s input port to the routed output port. In the NoC framework, the NixNo matrix architecture crossbar is employed and each crossbar has N,- input ports and N0 output ports shown in Figure 4.8(a). Each port has the same data width of SIM-bit. The crossbar’s input ports are connected with the router’s input buffer, and its output ports are connected with router’s output. The passing transistors located on the intersections between the input and output ports control the data flow with the aid of the routing logic unit and the arbiter. So a matrix crossbar has SymeixNo passing transistors, each of which controls the connectivity status of an output with a specific input. At any time, only one output port can be granted for the requests from inputs. Similar to the input buffer, the power model of the crossbar is constructed by deriving its basic operation’s switching capacitance model. Consider one slice of the SflerIdth crossbar with the input and output munbers of N.- and No respectively. Assume C.._.,.,, C.,.,_..,., and Cam" are the capacitance in the input, output and control ports. And N represents for the NMOS used as the passing transistor. C. =Cd (N), C =Cd(N)+Cg(N) (4.32) in _ cnt out _tnt _ _Cd (N)9 Cctr_(‘nl Assume in the layout of matrix crossbar, the horizontal and vertical spaces between each two adjacent intersection nodes are w, and h,. The corresponding wire length for the input and output ports can be estimated by: L," = N0 -Sfl,., ~w, (4.33) L... = N . '5... 'h. (4.34) So the input and output capacitances of one bit slice on the matrix crossbar are: C.._.-.- - N.- C.._ +IC.(T T... )+ C.(T)1+T.-. C...(L.-.) (4.35) C...._.. =N. C..- +.(ICT ..).+C (..T )J+C.(L (L...) (4.36) If there is a data transition on the input or output port, the corresponding input or output effective capacitor C.._.-,, or Cw... will be charged or discharged, resulting in dynamic power consumption. Assume the current data is latched on each port until the next data is available. During a switching operation, suppose there are (5,..- bits and 5..., bits on input and output ports change their values. The effective switching capacitance is: Ceff = 61d . be _in + 6x0 be_aut (4-37) And the dynamic power consumption: 1 Pza-WZIAFV ddxi(§. .be _in +6xo.be_out).f (4.38) where AF is the crossbar’s activity factor, which can be characterized from low-level simulations. In this model, we assume that AF = 0.5. ...... . O. .- ............ Figure 4.8(a) Matrix crossbar capacitance model Figure 4.8(b) Active state of matrix crossbar Similar to SRAM’s leakage power model, different operating states are taken into consideration when dealing with matrix crossbar’s leakage power. The main contributors to the leakage in the crossbar are the passing transistors. Two states, idle and open, are analyzed to estimate the crossbar’s leakage power. @When the crossbar is in the “idle” state, all the control signals applied on the gate of the passing transistors are ‘0’. All these passing transistors are in off—state and contribute to the total leakage. [lea/(idle : Ni . No . WN . 1N0 . Sflit (439) where WN and [No are the passing transistor’s width and the unit width leakage current for NMOS transistor respectively. @When the crossbar is working, there is only one passing transistor open in each input row and each output column as shown in Figure 4.8(b). So each row has (No - I) passing transistors in off-state, and each column has (N,- — 1) off-state passing transistors. Assume that OutDatj is the flit latched on the output port j, and InDat) is the flit from the input port i. And the Hamming distance between OutDatj and InDat; H(0utDatj, InDat.) shows the number of switched bits between these two data. For each input port i, there exists: lea/t ,l 1,0 -H(1nDai,,ouiDat 1) (4.40) Ms \. II 1 30 Take all input rows into consideration and add leakage from the input/output drivers, the leakage of crossbar can be estimated by: N,- N, I,,,,W = WN 22 1,,0 -H(1nt)at,.,oi.ubaij ) i=1 j=l +%’(dev,tr “1100 +dev,N ' INC). San '(N1 + No) (4.41) where dep, and deN are the PMOS and NMOS’ width in the input/output drivers. 4. 3.3. Power Model of Output Arbiter Figure 4.9(a) shows a canonical micro-architecture of fair-sharing output arbiter, which is characterized by the number of requests per arbiter R. The arbiter’s request number equals to the number of crossbar’s input ports. The arbitration operation of the matrix arbiter can be divided into three atomic operations: request evaluation, priority update, and grant generation, which are employed to model its dynamic power consumption. First the incoming header flits on each input port generates a request, which will be evaluated along with the request signals from other inputs to generate a grant signal based on the current priority list. After that the priority list will be updated cyclically to achieve the arbitration fairness. (R(R-1)/2 flit-flops) priorities(mij)l-~ 01:14} grantt req_n-1 FL.“ ELM , , : fin' '1)" T02 5 E ’ rantn reqn ;-. 3.11,, _.' ~ten-nu. -_ -' I P . ' req_n+1 mhln+1) —R grant generation logic fell D— mnR Tn1 Figure 4.9(a) Structure of the matrix arbiter reqR E _ Gate 19(TN) M’(Tp) Tnl 13.51 76)» Tn2 13.5)» 761 Ti 12.5}. 251 Figure 4.9(b) Default transistor sizes For the request evaluation component, the switching capacitance contributing to the dynamic power can be further divided into three sub-components: @The gate end capacitance of the first level NOR gates Tn]. A request signal is involved in computing all the other grants. Besides the inverter connected with this request signal, there are (R-I) NOR gate ends. The total capacitance is (R-1)Cg(T,,1). @Gate end capacitance of the second-level multi-input NOR gate Tnz, which is not driven by the request signal directly, but by its compliment from T,-. Its capacitance is Cg(T,,2). @Inverter T;, which generates the complimented request signal for grant computation. Its capacitance is Ca(T,-) = Cg(T,-) + C401). 80 the capacitance during the request evaluation can be estimated by: C = (R-1)-C,(T,.r)+C,(T,.2)+C.(T.-) (4.42) request For the priority update, two capacitance components are taken into consideration during power modeling: @The gate end capacitance of the first level NOR gates Tn ,. Each priority signal represents the priority relationship between two requesters. So it is used to compute two grants, and the gate end capacitance is 2Cg(T,,,). @Flip-flop switching capacitance Cpp since the priorities are stored by Flip-flops. The switching capacitance during priority update is: CPriority = 2 Cg (an )+ CFF (4.43) For the grant generation, two capacitance components contribute to its dynamic power: I32 @Drain end capacitance of the second level NOR gate Tnz, the capacitance is Cd(T,,2). @Crossbar control line’s capacitance Cum-m Since the grant signals are connected directly with them. Overall, Cgrtmt = Cd (T112 )+ Cctr_cm (4.44) Arbitration has one argument request, the bit vector of all requesting signals, with each bit representing whether a request signal is on or off. The request vector can be acquired by evaluating each column of the traffic matrix representing the current traffic information in this router. For the request evaluation, the number of switching requesters is simply the Hamming distance between rer, the previous request bit vector, and req, the current request bit vector. Hence the energy dissipation during request evaluation can be estimated by: E = H (req, rer)- C V; (4.45) request request For the priority update operation, supposing that the priority bit vector before arbitration is priO, after the arbitration, priorities are updated to become pri. So the number of switching priority vector is the Hamming distance between priO and pri. Each switching priority node has a single switch, so the energy dissipation is: E prion-.., = H (pri, pri 0)- C mom,- 'Vdi (4.46) For the grant generation, each arbitration operation grants one requester. If this arbitration grants the same requester as the previous arbitration does, there is no activity on grant capacitance. If this arbitration grants a different requester, then one grant signal rises from ‘0’ to ‘1’, while another grant signal falls from ‘1’ to ‘0’. Use r0 and r be the id of the previously granted requester and the currently granted requester, the energy consumption is: 2-C ~Vd§,(r¢r0) gram 0’ (r = ’0) (4.47) grant Take request evaluation, priority update, and grant generation into consideration, the 133 power consumption of one arbitration Operation on the matrix round-robin arbiter is: E 2E +E +E arbiter request priority grunt (4-48) And in the current router model, the transistor sizing parameters are chosen as shown in Figure 4.9(b). 4.3.4. Power Model of Interconnecting Wires In this section, the analytical power model of interconnecting wires is presented. We assume that all the interconnecting links in NoC are treated as global wires laid out on the top metal layers (M5 and M6 in 180nm 6-metal process), and these global wires are pipelined and repeated uniformly. The dynamic power results from the signal switching of wires and repeaters, while the leakage power mainly originates from repeaters’ subthreshold current. Differentiated from the delay model, inter-wire coupling capacitance is taken into consideration in wires’ power modeling and a simple data- dependent interconnecting wire power model is derived. For a single interconnecting wire, the switching capacitance is: C 2 CW -L - XOR (bit,bit’) (4,49) wire where bit and bit’ are current and previous l-bit data latched on the wire, and Cw and L represent the unit length wire capacitance and wire length respectively. The other part contributing to wires’ dynamic power is the inserted repeaters, whose switching capacitance is: C = (6+1)w- Cg ZI; XOR(bit,bit') repeater seg = r . Cw .L. XOR(bit, bit') (4.50) For a Sflit'bit width interconnecting link inserted by (ULseg—I) uniformly distributed repeaters, the dynamic power consumption depends on the traversing data’s switching pattern represented by two consecutive transmitted flits’ Hamming distance. From (4.49) and (4.50), we can derive that: 134 1 SfliI—l Binkdyn = E ' ded ' f ' AF 0 Z (Cili'ire + Criepemer ) i=0 1 =3-Vdi -f-AF ~(l+r)C..L- DAM) (4.51) where Vdd is link’s supply voltage, f is the operating frequency, which is assumed to be the same as router’s. AF is activity factor with the assumption value of 50% in the system model. DH(x,y) is the Hamming distance between two successive flits x and y flowing through this link during simulation. The leakage power model accounts for the repeaters’ subthreshold leakage. In the CMOS inverter, either NMOS or PMOS has 50% probability in the off state. 1 L RinkJeakage = E (WPIPO + WN [NO )L— Sflil 'Vdd seg wL . S . flit = T (fl! P0 + 1N0 ) Vdd (4.52) reg where [pa and IN“ are unit-width NMOS and PMOS’s leakage current, respectively. Ci.” Ci.i+1 l 2 1 I g 3:— . // Figure 4.10 Interconnecting wires considering (i.i+l), (Li-1) coupling effects In the wire model, the unit length wire capacitance Cw plays an important role that determines the switching capacitance during power modeling. The wire model employed in the current NoC framework takes the inter-wire coupling capacitance into 135 consideration for the DSM processes. When the VLSI design enters DSM technologies, the interconnecting wires have: smaller width, larger aspect ratio (height/width); and in most cases, they are placed closer to each other as compared to older, larger-scale technologies. Moreover, the chip area has increased and so has the ratio of the expected global link length over the cross sectional area of the lines. Based on this, we derive a more complicated interconnecting wire model when estimating the unit length wire capacitance for more accurate power estimation. Assume N wires form an N—bit link laid upon the ground plane with the width of w, height of h, thickness of t and space of 5 shown in Figure 4.10, the total unit length capacitance erm of each wire i(i=1,2,..N-1), except for the bordering ones, is composed by three components: the self-capacitance C,- that is the capacitance between this wire and the ground plane, the coupling capacitance Cu”, Cu-) which are the capacitance formed by this wire and its directly neighboring ones. Here we only consider the coupling effect between one wire and its direct neighboring ones. For the bordering wires, the coupling capacitance only has one component contributing to the total wire capacitance. Overall, we have CW” = C, + 2 ' Curt] when l< i < N (453) where C, is the wire’s self-capacitance, which can be approximately modeled by: 0.222 w t C.=£ 1.15 — +2.80 — , (,3 1,.) an, is the dielectric constant of the insulator such as $02. er is the coupling capacitance between the adjacent wires, and here we only consider the case that j = i+1 or H . C — e 0 03(3) + 0 83(1) — 0 07“)”222 (3)—”4 i,i:l at - h r h - h h (4.56) Combine (4.56) and (4.53), it should be noted that 136 1im(C,. + CW ) = C, (4.57) S-)°° showing that the coupling effect can be neglected when the line separation s approaches infinity. Applying (4.53) and (4.54) to interconnecting power model (4.51), the coupling effect can be simulated. 4.4. Integrate Power Models into SoC Simulation Framework By far the power/energy modeling methodologies for SoC functional components have been addressed. Due to different power model schemes for PBS and NoC, different methods are applied when integrating their power models with the 80C simulation framework. taSk1 Task1 (annotated) c = a * b; energy + = enerng UT[r1L U_OP_2].‘ d = a - b: energv + = energfllx’T/ALU_()P_2]; PSI(PE) e r c * d: energy += errerthUT[.~UUL_OP_2]: <\ ...... \ PE energy library \ (PPC405, \\ \7 \/\ Add Task/ I. &taskl).‘ ‘. ' Power_SRAM . calculate_SR4 Arl_dynamic_eng( ) ' ! calculate SRAM leakage _pwrfl r out 3'11] Power_cbar [ OCA_Router calculate_cbar_dvnamic_eng/ ) ' calculate cbar Ieakage_pwr(/ N 00 < [ router[N] 1 link[0] P e r k — ow r_|n E [ OCA—link » calculate_link_dynamic_eng( ) k ‘ — calculate link leakage _pwfl ) lmk[M] Figure 4.11 Integrating Power Models with SoC simulation framework Power_arbiter calculate_arb_dynamic_eng( ) 5:, I calculate arb leakage _pwl'( ) J As shown in Figure 4.11, the energy implications of the application-driven PE models are implemented by annotating the experienced macro-operations’ energy model along with their timing model. To enable the top-level SoC simulation framework to evaluate 137 each PE’s power/energy in-flight, each PE object employs a power-sampling interface (PSI), which includes a power sampling sc_thread process and an energy data member Eng. During simulation, Eng will be updated by PS] through adding the energy values of executed macro-operations from the table-lookup under the programmed interval Teng. Assume ME is a task’s macro-executor containing n macro-operations M01, M02, MO", with each one consuming E(M0,~) (i=1,2,...,n) amount of energy. The energy consumption by executing ME is: n Aenergy = ZE(M0,) (4.58) i=1 For a task containing m macro-executors, the toral energy consumption is evaluated by: mid energy = Z 2 E (M 0,- ) (4.59) j=i i=1 And the instantaneous power consumption can be estimated by the average power consumed between two consecutive energy-sampling events when the sampling period is small enough: Pm = LEI-13 energy_sampled lTeng (4.60) Plotting a series of P)”, values versus the sampling time generates the temporal power profile for PE showing the core’s power feature when executing the mapped tasks dynamically. Although current system-level model doesn’t consider PE’s leakage power, it is easy to refine this methodology by adding PE’s empirical leakage power models proposed in [69] to support system-level leakage power estimation. Moreover, considering the PSI interval ng, smaller T mg will result in higher power profiling resolution and vice versa. Figure 4.12(a) and (b) present the power profiling results for ADPCM application executed on PowerPC405 with T mg = 5us and Teng=20us respectively. It can be clearly seen that the different power resolutions between these different sampling rates although they got the similar average power. Although smaller 138 Teng results in 4X better power resolution, it asks for ~2.18X more simulation time due to the increased number of sampling time events evaluated by the SystemC kernel. :06 «111'. , =2 is l g ‘i 'r i" 11' g l" H111 0.2 '41:”... 2.1111111111 l o e ' ‘ . . - . o ‘ ' - ' . . . . . 0 500 1000150020002500300035004000 0 800 1600 2400 3200 4000 time (us) time (us) (a) @ Teng = 5143 (b) @ Teng = 20143 Figure 4.12. Power profile of ADPCM application on PowerPC For the NoC, the power consumption of each of its components is modeled by a dedicated power class (Power_SRAM, Power_cbar, and Power_arbiter for routers, and Power_link for interconnecting links), which implements the functions to calculate the dynamic/leakage power when a NoC operation happens. These functions are plugged into the corresponding NoC functional models in the 30C framework. When an operation, e.g. a read from the input buffer to retrieve a flit, takes place, the buffer’s output flit data along with the previous latched flit (assume each input buffer latches its output until new data appears on the output data port) will be passed to the power model function calculate_SRAM_dynamic_eng( ) of the Power_SRAM class to evaluate the energy consumption during this buffer’s read operation. The evaluated energy, which is the return value of the power model function, will update a global data member energy defined in the router class. Similarly, the leakage power is evaluated by calling the Power_SRAM class’ calculate_SRAM_leakage_pwr( ) with the arguments of operation type (READ, WRITE, or IDLE) and temperature. During simulation, the updating of energy data member of each router/link reflects its energy/power information. The same as the sampling mechanism used in the PE model, a PSI is also employed to control the energy-sampling interval for temporal power profiling. 139 To evaluate the spatial distribution of each router/link’s power at a specific time point, the power snapshot capability is developed along with the temporal power profiling by dumping each router/link’s instantaneous power into a matrix with each element representing the relative position of each router/link at the pre-configured time points. It is implemented by adding another sc_thread process sensitive to the current time stamp. This process is triggered when the current stimulated time reaches the configured time points, and it collects the instantaneous power information from each router and link, and dumps them into a file, which will be plotted by the third-party tools such as Matlab to show the NoC’s spatial power properties. 4. 5. NoC Power Simulation Experimental Results Using the same shared Sun Solaris 5.9 server as the experiments taken in the previous chapters for the simulation platform, we simulate some synthetic traffic patterns and a real application: M-JPEG, to evaluate the NoC and 80C power implications with different process technologies using the 80C simulator equipped with PE and NoC power models. Six predictive CMOS process models from Berkeley Predictive Technology Models (BPTM) from 180nm to 45nm are chosen as the target processes for the simulated SoCs. For the instantaneous power consumption, the operation frequency is set to ZGHz for all these six processes. The NoC is mesh architecture with the assumed chip size of lOmmxlOmm. 4. 5.1. NoC Router’s Dynamic and Leakage Power Simulation Figure 4.13(a) shows the leakage power of NoC router under different temperatures. The target NoC is configured as 4x4 mesh architecture with S-port routers. Each port has 8 input buffers, and each buffer is 64-bit in width. The workload is set to be uniform traffic, where each packet contains 10 flits. To show the trend of dynamic power among different technologies, the average value is plotted with the accompanying leakage power under the ambient temperature of 300K as Figure 4.13(b) depicts. From these data, we could envision that the leakage power increases dramatically with the scaled technology 140 nodes. Each newer technology node has 2.8x more leakage power in average compared with the old one, and it goes worse by the rate of >8X when the feature size scales below 70nm, where the estimated leakage power is nearly the same as the dynamic power. Router's leakage power at different processes Dynamic v.s. leakage power of NoC router . . . 1500 ' E - T=300K(27C)7 + router‘s dynamic power g T=323K(SOC) g @ fer”: k 5 - T=353K 80C 5, E rou e 5 ea age power 3 1000 ( ) . 4 g 1000 u @ 300K in. ._ a router's leakage power E 1 :E, e @ 323K ‘3 2 routers leakage power 0 _a— 3 500 — 3 500 a @ 353K 3 a) e E 3 n. I . . __ ...—- .1“. I:- l" 180nm130nm 90nm 70nm 65nm 45nm 1 nm 130nm 90nm 70nm 65nm 45nm BPTM processes BPTM processes Figure 4.13(a) Router’s leakage power Figure 4.13(b) Router’s dynamic power and leakage 4.5.2. NoC Router’s Dynamic Power v.s. Data Payload Switching Properties According to the NoC power model including router’s and link’s, the dynamic power has the direct relationship with the data bus’ switching activity, which is the number of switched bits between two consecutive u'ansactions on this bus. The dynamic power is also proportional to the operating frequency, which decides how often the Circuit’s intemal/load capacitance is charged and discharged. In our model, we estimate the average dynamic power consumed in a given sampling window and use it to approximate the instantaneous dynamic power. Assume that an on-chip router (or link) consumes the energy of E during the sampling window T», the average dynamic power is calculated by: Laugh.) T nT n W (4.61) Where it is the number of cycles contained in this sampling window. And (4.61) can be: lim E P =1im,.->r f Zn— = E - f (4.62) n->1 , which shows that the average dynamic power is the instantaneous dynamic power when the sampling window is approaching one cycle. So reducing the length of the sampling 141 window can result in better power estimation resolution, while it brings the overhead of more simulation time because of the increased number of events that trigger energy calculation and output file dumping. Figure 4.14 presents the average router dynamic power consumption under three different input data payload switching schemes controlled by the TO: (1) two consecutive data are fully random; (2) two consecutive data have 50% bits constant, (3) two consecutive data are constant, (4) two consecutive data have 100% bit switching . The simulation is run under lGHz NoC clock rate and the NoC is configured with IBM 180nm process. When there is no bit switching between two injected data payloads, the router’s average dynamic power is estimated to be 25.695mW, which mainly results from the bit switching on the packet head including the source and destination addresses. When the data payload has 50% of random bits, the router dynamic power increases greatly by 3.187X, and this power overhead comes from the switching activities on the data payload bits. When the whole data payload is set to 100% random, each router consumes 139.22mW of dynamic power in average due to the further increased bit switching activities. Although 50% and 100% of payload bits are chosen randomly in schemes (1,) and (2), the switching ratio of (1) is no larger than 2X of (2) because of the random feature. This results in the power consumption of scheme (1) is 1.7X of scheme (2). In (4) the router has the maximum data payload switching activity, and this corresponds to the power consumption of 173.55mW, which is 2.12X larger than the scheme where 50% of payload bits switch randomly. The dynamic power on the link has a similar trend on this four different data payload switching schemes. The data payload switching activity has distinctive influence on NoC dynamic power consumption, and the higher rate the data toggles, the higher dynamic power NoC takes to transfer the data. From the dynamic power saving perspective, the data encoding scheme similar to what’s applied to the share-bus architecture to reduce the Hamming distance between two successive data can also be applied to NoC architecture. 142 If there are only 25% bits toggling between each two continuous data payload after encoding, the router’s dynamic power is estimated to be 52.658mW, which is 3.28X less than the scheme of full data toggling. In this work we don’t target exploring the power— aware NoC data encoding techniques, but the proposed NoC power simulation model provides a complete platform to evaluate different encoding schemes. Data switching patter v.s. NoC router power 20% , Router's average dynamic power(mW) (1) (2) (3) (4) TG payload bit switching schemes Figure 4.14 Router’s dynamic power 4.5.3. Router Architecture and its Dynamic Power The router architecture influences its power consumption as well as its performance. In general, the higher throughput a router has, the higher the power consumption it takes. All three router’s basic operations, input buffering (optional output buffering), route computing and packet/flit forwarding, and output arbitration, involved in a packet’s delivery will toggle the router hardware, which in turn consumes the dynamic power resulting from the circuit charging and discharging. Currently twelve different router architectures have been modeled by adjusting three architectural parameters, each of which corresponds to specific hardware architecture in the on-chip router and will have different power properties compared with others. We simulate six different router architectures the same as what’s presented in 3.6 with the input buffer size ranging from 2 to 8 flits, each containing 64 bits. And all these routers don’t have the output buffer. Both random and transpose traffics are used in the 143 simulation, and the PE and NoC are clocked at the same frequency of ZGHz. Figure 4.15(a), (b) presents the average router dynamic power consumption at random and transpose traffics. Under the same buffering scheme (input and output buffer size), the NoC employing XY—deterministic routing algorithm has higher dynamic power consumption compared with the NoC using the other 4 adaptive ones, although the adaptive routers need more power overhead to perform the adaptive route computing operation. For the input buffer size from 2 to 8, the NoC with Ibuf_F P_X Y router architecture has the average of 2.03X higher power than the lbuf_FP_WF’s, and this ratio reduces to 1.73X when comparing with lbuf_FP_NL’s under random traffic. And when the arbitration policy is fixed- priority, North-Last routing algorithm is about ~22.4% more power-hungry than West- First routing; while this difference becomes marginal (~0.85%) when employing fair- sharing arbitration policy. It is also observed that employing the fair-sharing policy has 54.4% of more power consumption compared with fixed-priority. All the above information comes from the result from random traffic. Comparing the power data with the APL simulation results in chapter 3, it shows strong correlation between NoC average router power and the maximum achieved injection rate. This correlation explains the power simulation results in Figure 4.15(a): Higher MAIR indicates that there are more packets successfully injected into NoC at the unit amount of time, and these packets will trigger the NoC routers to access the input buffers, start route calculation and arbitration. 80 higher MAIR results in more router operations in NoC during the unit amount of time, and this directly brings out more dynamic power consumption. From 3.6.3, the NoC with lbuf_FP_X Y router (buffer size is 2) has 2.27X and 1.9X of MAIR compared to lbuf_FP_WF and lbuf_FP_NL, respectively, and this correlates to the 2.12X and 1.73X average router power consumption. The lower increase ratio of dynamic power compared with MAIR results from the fact that the adaptive routing algorithm consumes additional power to perform the minimum throughput calculation, which takes small amount of 144 power compared with other router blocks. Random traffic 500 Transpose traffic #7 A fir y—‘ i '—but.FP.xvl E400 -'b”"FP’XY E 1 —bur.r=s.xvl a -IburFs,xv g 400 —bur.n=.wrl i w Ibuf, FP, WF 8 g bu1.FS,WF O 300 Ibuf, FS. WF 2 - g“? :2. ''3': E E 300 (,_ U- , g _Ibuf, FP, NL g 15‘ 200 !m I? ._ 5 200 .9 a a g, 100 g, 100 e e g . 2 < O 4' < 0 2 a 4 5 6 7 8 Input buffer size(nits) Input butter size(tlits) Figure 4.15(a) Router dynamic power (random) (b) Router dynamic power (transpose) Under the transpose traffic where the data payload is still random, but the destination address of each node is fixed as: dest _ x = (source_x + 3)%4 ,dest_ y 2 (source_ x + 3)%4 , and the final destination address is calculated as (dest_x + dest_y x 4). The power simulation shows some different properties compared with the random traffic, although the power consumption increases almost linearly with the increased buffer size. The NoC lbuf_FP_X Y router has almost the same dynamic power as Ibuf_FS_X Y, and performance simulation shows that the NOC APL with these two routers are both 5, which indicates that the NoC is not congested. The output arbiter in these two cases is not triggered due to the well-balance traffic pattern. So the contributors of router dynamic power consumption is only input buffer and crossbar, this explains the identical power consumptions between the two XY— deterministic routers. For the other four routers architecture with adaptive routing algorithm, the power consumption is lower due to the reduced actual packet injection rate. When comparing the power consumption under the same router architecture between Figure 4.15(a) and 4.15(b), it is observed that the deterministic routers consumes 85% more power in the transpose traffic than the random traffic. This is still the result of the higher achieved injection rate in the transpose traffic resulting from the more regular 145 traffic pattern. Power simulation results indicate that the NoC router power is strongly related to all the three architectural parameters: input buffer size, arbitration policy and routing algorithm, as well as the actual traffic pattern. From the system architect‘s point of view, it is a good practice to extract NoC traffic pattern based on tracing the application model to explore the candidate NOC architectures. And it is also beneficial to keep as small input buffer as possible for on-chip routers if performance is met to lower the on—chip communication power as well as the silicon area. The previous simulation and analysis was based on the router without output buffering, which is employed in some throughput—critical applications. The output buffer provides additional temporary information storage for the routed packets. To evaluate the output buffer power perspective, we adopt the 8-entry fixed buffer scheme for the input and output ports of the same direction (NORTH, SOUTH, WEST, and EAST). The input and output buffer size can have 6 different combinations of (8,0), (2,6), (3,5), (4,4), (5,3) and (6,2), where (ab) represents the input buffer size of a and output buffer size of b. Figure 4.16(a) and (b) presents the power simulation results with the router of XY-deterministic routing algorithm on the random and transpose traffics, respectively. namic power consumption with different input/output buffering Dynamic power consumption with different input/output buffering 500; ,JJ 1 _Ibut/Obui/FP/XY 1 - lbut/Obui/FS/XY ‘ . W ‘9’ - lbut/Obul/FP/XY 1 - but/Obui/FS/XY M 0" O ,2 A O O M 8,. --.--E-.. Average router dynamic power(mW) ,8. case (2. e) (3, 5) (4. 4) (5. a) (s, 2) (8. 0) (2. 6) (3. 5) (4r 4) (5. 3) (6, 2) (8, 0) Average router dynamic power(mW) Butter configuration (Input buffer size. output buffer size) Figure 4.l6(a) Random traffic Buffer configuration (input buffer size. output butter size) Figure 4. l6(b) Transpose traffic It shows that the router dynamic power consumption is more sensitive to the size of input buffer than output buffer. Figure 4. l6(a) indicates that the dynamic power increases by 17.2% when increasing and decreasing the input and output buffer sizes by 1 while keeping the same total buffer size of 8 under random traffic. For the power simulation under transpose traffic shown in Figure 4.l6(b), the power consumption has the similar trend, but the non-congested traffic here results in the same power consumption between the two NoC architectures with fixed-priority and fare-sharing arbitration policies. 4.5.4. NoC Dynamic Power’s Temporal Profiling .fl. ,/"\ /' 1 12 H 13 H 14 }4—vf 15 l \. ,/' \ J/ \‘. // \ I/ Figure 4.17 4x4 mesh NoC under burst traffic from node #0, #2 to node #5 A burst-mode traffic pattern, which mimics some real applications, is applied on the same mesh-NOC to evaluate the temporary and spatial energy properties. The target technology is selected as BPTM 180nm process. Each PE injects packets with the rate of 10M flit/s randomly. During the burst period from lOus to 15us, PEs at nodes 0 and 2 increase their injection rate to 100M flit/s and their packets’ destination is fixed to node 5 as described in Figure 4.17. Figure 4.18 (a)~(d) show the dynamic power temporal profiles of routers #0, #1, #2 and #15 from 0 to 100us. Before the burst comes, all these 4 routers have power consumption of 60~100mW depending on the locations since the border routers consumes less power than others due to decreased traffics. After 10us, the routers involving in the burst have increased power up to ~150mW. Notice that router #l is the shared switching node for packets from #0 and #2 to reach destination #5, and its power increases along with the burst even its host PE still keeps the low injection rate. After the burst, power consumption of these nodes will gradually return back to previous states due to the decreased traffic, which is reflected clearly from Figure 4.18(a). For 147 router #15, which is far away from the burst area, its power is mainly influenced by the uniform traffic and the activity change is far less than that of routers #0, #1 and #2. Power (mW) 160 «- ~ 140 1 120 100‘ 80 ~ 60 . 40 . . .. . .. . II ,0 l , I l' 1 r o l . l l‘ l l 9 8 8 8 E 53 8 8 8 E o o o o o o o o o o (D N 0') L!) N (D V v— |\ v- N ('7 V Ll") (O l\ on 0') O O '— v— (\1 0') ('3 V l!) L!) (D N w V 0 L0 N co LO v- N C') V 1.0 (D l\ CD 0') O r- v- N (V) ('3 V V LO . '- N (V) V LO (D f\ (D (D time (ns) time (ns) (:1) Temporal power profile on node #0 (b) Temporal power profile on node #1 Power (mW) Power (mW) 180 . . . 14o _ . .............. :22 100 80 . 80 60 3 ‘ :2 4o 20 g , l 20 0 5 l o 9 8 E 8 8 E 8 E 8 8 .9 m c r a a s 3 <0 3 9 N 8 v m w n 3 m time (ns) time (ns) (c) Temporal power profile on node #2 (d) Temporal power profile on node #15 Figure 4.18 NoC temporal power profile under the burst traffic shown in Figure 4.17 4. 5.5. NoC Link Power Evaluation In this set of experiments, the NoC is the same 4x4 mesh with the physical size of lOmmxlOmm. The packet size is still fixed to lO—tlit, with each flit 64—bit in width. Based on the assumption of fixed chip size and router number, we use the minimum global wire width of 0.8um in BPTM 180nm process as the baseline, and keep this wire width same for all the other 5 processes. As a result, the unit—length wire resistance and capacitance scale accordingly, and are shown in Table 4.6. Figure 4.19(a)—(d) describe NoC’s power breakdown in four processes under the uniform traffic load, where both frequency and voltage are scaled according to the target technology nodes. We can see that the relative weight between router and link. In NoC 148 fabricated in I80nm process, the power ratio between link and router is 0.81. This ratio increases to 0.91, 1.12 and 3.13 in 130nm, 100nm and 45nm processes, showing the trend that links are becoming more power—hungry than routers. Table 4.6 Wire parameters for different technologies using the same global wire width (0.8um) Technology BPTM 180nm BPTM 45nm 0.3 Power breakdown on BPTM 180nm 1000 . f , 1 . W,,.i..(um) 900 - Router’s power § 800 — Link's ower 0 5 10 20 50 100 250 50010002000 Injection rate (MfIit/s) (a) BPTM I 80am Power breakdown on BPTM 100nm § . - Router's power - Link's ower 8 o ABOO' Dynamci power (mW w A 01 (D \I O O O O 8 0 O O O 8 o .4 O OO fir o 5 Injection rate (Mflit/s) (c) BPTM 100nm 10 20 So 100 250 50010002000 20 114.75 Power breakdown on BPTM 130nm 1000 'fie— e v 900 .. - Router's power - Link's power g 600 .. § 700 .. g 600 ~- 0 .8 500 ~ E a 400 .. g 300- 200 '~ 100 ' a 0 5 10 20 50 100 250 5001000 2000 Injection rate (Mflit/s) (b) BPTM I30nm Power breakdown on BPTM 45nm 1000 j — R t ' . ou er 5 power} A 900 - Link's war A g 800 I § 700 . g 600- O 3 500 . , 0 g 400 ' C 0 0 5 10 20 50 100 250 50010002000 Injection rate (MIIit/s) (d) BPTM 45nm I49 Figure 4.19 NOC Power breakdowns at different technology nodes To illustrate the temporal and spatial interconnecting power profiles of the proposed simulation framework, a burst-mode traffic the same as what is used in 4.5.2 is applied on the NoC. Figure 4.20(a)—(b) describe link #l6(from router #1 to router #5) and #17(from router #5 to router #l)’s dynamic power temporal profile from O to 2ms. When the burst comes at 10us, the involving Iink’s (#16) power increases to ~5.2X in maximum, and gradually returns back after the burst traffic. The link (#17) that doesn’t involve in burst, the power keeps almost constant as Figure 4.20(b) shows. Link 16's power profile Link 1115's power profile . fl Y . . . goo . 7 x w r r f A A700 E E eoo g g 500‘ § .9 400 o g i l g m 300 i . 1 i g g I. ' illiIIll‘lli‘ii, 0 020° Illiii'rir' sing, III ii» i | 100 iii ;“l'.s“"r ‘ l 0 0 200 400 600 80010001200140016001800 2000 simulation times (us) simulation times (us) Figure 4.20(a) Link #l6(router #l->#5)’s profile (b) Link #l7(router #5->#1)’s profile Combine the routers and links’ power consumption together, Figure 4.21(a)-(d) describe the NoC spatial power distribution in the 4x4 mesh architecture under burst traffic at different simulation time points of 5us, 8us, llus and 8014s. The X and Y coordinates represent the physical locations of routers and links. The power of two links connecting the same router pair on the opposite directions is added together since they are assumed to reside in the same physical location. After the NOC reaches burst mode, power consumption of routers and links involving in the burst is increased by ~3X compared with the non-burst nodes as Figure 4.21(c) shows due to the suddenly increased injection rate. When the burst ends and the traffic returns back to uniform, the power distribution profile will become flat as Figure 4.21(d) describes. The routers on four edges consume less power than others in that the reduced number of input ports results in 150 less traffic load. Here we only take dynamic power into consideration because leakage power stays almost constant during the run-time compared with the dynamic one, which is relevant to the switching capacitance. Figure 4.22(a) shows the leakage power’s temporal profile for router #0 under the burst mode with the maximum power swing of only 2.4%, while the dynamic power has at most 362.5% in power swing as shown in Figure 4.22(b). NoC Spatial Power Profile at 5us NoC Spatial Power Profile at Bus 1 31000 — ‘ g E T , T " " \ i E : aoo . L E 0 L 4 v s 600 . : a o. 1' ' 1 3 ‘6 400 . I 8. E .1 '6 g 200 ; E > W (U D 0 . 7 . g 10 D 10 —. LL ‘ . \ / 6 Y-position (mm) 0 0 X-position (mm) Y-position (mm) 0 0 X-position (mm) (a) (13) N00 Spatial Power Profile at 11us NoC Spatial Power Profile at 80115 1000 1* Li ‘1‘ x A A H“; ;-. ~. “ 3 E000 ’4'} i xi 3% E i A ‘ — f' ’ ‘ “ 1 s g ‘ I - . l 3 a. . .E‘ '5 J E E j cu N ‘ ‘1 C i . 5‘ o \x/’ 4 ' . . Y‘POSifion (mm) 0 0 2 X-position (mm) Y-posmon (mm) 0 0 X-position (mm) (C) ((1) Figure 4.21 4x4 NoC power spatial profile under burst at different time stamps. (a) Sus, (b) 8us, (C) llus, (d) 80us 151 2.18— ___“. ’ W“ m" ”‘ --.,, ..L..,L ,0 ...J. 2‘13£111_L:Laa;£1@s: /// Mme Leakage(mW)////,/\ 838£88§§§§E§§l 2.4 ’/ .— 22 ’ 2 .— 1.8 1.6 1.4 -~ ~~~~ ~— —————— * 1'? l1me(ns) m 8 ‘3 E § 3 E 83“: v w . t a S 8 3 S §g timelns) Figure 4.22(a) Leakage power profile of router #0 (b) Dynamic power profile of router #0 4.6. F all SoC System-level Power Simulation on Real Applications Applying the proposed power model on the real applications, we explore the power implications on the M—JPEG application described in the previous chapter. For convenience, the SoC platform for simulating the M-JPEG application is re-drawn in Figure 4.23(a), and another NoC configuration featuring different communication patterns is shown in Figure 4.23(b). To evaluate the computational power breakdown of different tasks partitioned from M-JPEG, all the four PEs are configured as PowerPC microprocessors. In Figure 4.23(a), the communication pattern to complete the encoding of one video frame follows the path of router #(0,0) -> #( 1,0) -> #( 1,] ) -> #(0, l) -> #(0,0) counterclockwise without overlapping between the packet flows in any two routers. While Figure 4.23(b) features some overlapping in between two routers’ communication flows shown as the dotted arrows. The original video frame data follows path p] to traverse from core (1,1) to (0,0) to perform DCT, followed by p2 to send the transformed data to core (1.0) for the quantization and zigzag scan. The re-ordered data in the quantized frequency domain are sent to core (0,1) for encoding through p3, finally the encoded data stream will be transferred to (1,1) again following p4 for output bitstream dumping. 152 ‘. .................................... Ir --------- (1)VLC encoding: @~ ® : (1)Quantization ' 1 ® ® I (2) Zigzag-scan . .0 0 ' ' @1- (1)File read (1) D-CT (2) RGB-YUV ------------------------------ (3)Downsampling (4)bit-stream form ...-...... ‘C....-... . (3) p4 :’ p1 ......................................................... Ir- --------- .- ' @ ©l I (1)File read (1) VLC eneodi g . 1 (2) RGB-YUV I |(3)Downsampling : (4)bit-stream form I PowerPC It ‘C...-O......C.............. ...-......CCCOO......O..........» (1) DCT (1) Quantization p2 (2) Zigzag-scan p3 "" (b) Figure 4.23 M-JPEG application mapped on 2x2 NoC with (a) non-overlapped traffic flow, and (b) overlapped traffic flow Based on the two configurations, 6 video segments with different frame sizes and contents are employed as the input to the SoC simulation platform. The properties of these inputs are listed as follows: (video_l) the size 0f each frame is 16x16, and the pixel values in each frame are constant (video_2) the size of each frame is 32x32, and the pixel values in each frame are constant 153 (video_3) the size of each frame is 128x128, and the pixel values in each frame are constant (video_4) the size of each frame is 16x16, and the pixel values in each frame change randomly (video_5) the size of each frame is 32x32, and the pixel values in each frame change randomly (video_6) the size of each frame 128x128, and the pixel values in each frame changed randomly Tables 4.7~4.10 describe the average power consumption of each component in this homogeneous SOC except for the interconnecting links under different inputs and 80C architectures. In each table the (ci, vj) tuple represents the 80C configuration (CI for Figure 4.23(a), while c2 for Figure 4.23(b)) and the inputs from video_l to video_6. The average power consumption of the three architectural components inside each NoC router is presented as well as the dynamic power of the PE connected with this router. For the input buffer and crossbar whose leakage power is modeled, their leakage value is also evaluated given the ambient temperature of 300K. Table 4.7. M-JPEG power simulation results on architecture 4.23(a) with input video_l (c1, v1) Input buffer power(mW) Crossbar power(mW) Arbiter power(mW) PE power(mW) Position dynamic leakage dynamic leakage dynamic dynamic tile (0,0) 47.3 6.67 1.02 0.0953 0.835 59.32 tile (1,0) 50.9 6.69 1.09 0.0861 0.901 155.87 tile (1,1) 41.2 6.68 0.946 0.0668 0.72 62.31 tile (0,1 ) 45.7 6.68 1.01 0.0764 0.72 89.54 Table 4.8. M-JPEG power simulation results on architecture 4.23(a) with input video_2 (cl, v2) Input buffer power(mW) Crossbar power(mW) Arbiter power(mW) PE power(mW) Position dynamic leakage dynamic leakage dynamic dynamic tile (0,0) 47.7 6.67 1.03 0.099 0.842 58.94 tile (1.0) 51.1 6.69 1.09 0.0856 0.905 152.38 tile (1,1) 41 6.68 0.943 0.0797 0.717 60.79 tile (0,1 ) 45.9 6.67 1.01 0.0847 0.808 88.12 154 Table 4.9. M-J PEG power simulation results on architecture 4.23(a) with input video_3 (cl, v3) Input buffer power(mW) Crossbar power(mW) Arbiter power(mW) PE power(mW) Position dynamic leakage dynamic leakage dynamic dynamic tile (0,0) 45.9 6.66 0.992 0.106 0.81 57.62 tile (1,0) 50.5 6.69 1.08 0.0799 0.894 151.27 tile (1,1) 41 6.67 0.943 0.0846 0.717 60.18 tile (0.1) 45.9 6.67 1.01 0.0879 0.809 86.53 From table 4749, it is observed that the SoC’s average power consumption keeps almost constant for the input videos (video_l , video_2, and video_3) with the same data switching properties and different frame sizes. This is expectable since the average power of both computation and communication is related to the energy consumption, which is proportional to the experienced macro-operations (in PBS) or NoC Operations dependent on the input data size, and the simulated time that is also dependent on the input data size. We also observe that tile (1,0) performing DCT consumes the maximum dynamic power, and tile (1,1) handling quantization/scan has the minimum power. It results from the minimum information entropy that the packets traverse router (1,1). The tile (1,1)‘s input flits come from tile (1,0) where DCT is performed, and tile (1,1) will send the flits containing quantized/scanned data. Both of DCT and quantization operations reduce the data entrOpy, which is related with the data switching property, from which the power model is derived. Table 4.10. M-J PEG power simulation results on architecture 4.23(a) with input video_4 (c1, v4) Input buffer power(mW) Crossbar power(mW) Arbiter power(mW) PE power(mW) Position dynamic leakage dynamic leakage dynamic dynamic tile (0,0) 64.3 6.67 3.33 0.106 0.878 59.31 tile (1,0) 67.5 6.69 3.68 0.1 0.899 155.87 tile (1,1) 42.7 6.68 1.07 0.0648 0.735 62.3 tile (0,1) 55.1 6.67 2.25 0.091 0.833 89.52 Table 4.11. M-JPEG power simulation results on architecture 4.23(a) with input video_S I (c1, v5) I Input buffer power(mW) I Crossbar power(mW) I Arbiter power(mW) I PE power(mW) I leakage I dynamic leakage I I Position I dynamic dynamic I dynamic I 155 tile (0.0) 65.7 6.66 3.38 0.11 0.901 58.94 tile (1,0) 67.6 6.69 3.64 0.102 0.904 152.38 tile (1,1) 42.6 6.68 1.07 0.077 0.844 60.78 tile (0,1) 55.6 6.67 2.25 0.0998 0.733 88.12 Table 4.12. M-JPEG power simulation results on architecture 4.23(a) with input video_6 (c1, v6) Input buffer power(mW) Crossbar power(mW) Arbiter power(mW) PE power(mW) Position dynamic leakage dynamic leakage dynamic dynamic tile (0,0) 63.7 6.66 3.29 0.1 15 0.872 57.62 tile (1,0) 67.1 6.69 3.63 0.0985 0.898 151.27 tile (1,1) 42.6 6.68 1.07 0.0818 0.733 60.16 tile (0,1) 55.9 6.67 2.26 0.103 0.847 86.53 Tables 4.10~4.12 depict the same property in terms of average power in each SoC component. But compared with tables 4.7~4.9, the dynamic power of NoC components has distinct increase with 32.3% in input buffer, 361.7% in crossbar and 0.5% in arbiter in average. The ~3.6X increase of crossbar’s dynamic power results from the fact that the random pixel data in each frame in video_4, video_5, and video_6 dramatically boosts the data switching, which is proportional to crossbar’s dynamic power. For the input buffer, only the buffer write operation’s dynamic power is related to the data switching pattern, and it results in a much smaller increase in dynamic power compared with the crossbar. The arbiter’s dynamic power is independent with the data switching on data links but on the crossbar ports’ requesting status, and the simulation results clearly reflect this property. The power consumption in PBS still keeps almost constant compared with tables 4.7~4.9. This can be explained by that PE’s power is estimated using data- independent macro-models. Consider the 80C configuration shown in Figure 4.23(b), where overlapped traffic flows are present to model the increased communication load on NoC. Since the simulation of the input videos with the same data switching properties results in the similar power distributions although frame sizes are different, only the simulation results for video_2 and video_5 are presented in table 4.13~4.14 respectively. Table 4.13. M-J PEG power simulation results on architecture 4.23(b) with input video_2 156 (02, v2) Input buffer power(mW) Crossbar power(m W) Arbiter power(mW) PE power(mW) Position dynamic Leakage dynamic leakage dynamic dynamic tile (0.0) 46.4 6.66 1.01 0.0894 0.82 47.42 tile (1,0) 44.7 6.69 0.964 0.057 0.79 137.94 tile (1,1) 43.2 6.66 0.939 0.101 0.763 54.41 tile (0.1) 88.7 6.69 2.05 0.0978 1.55 84.05 Table 4.14. M-JPEG power simulation results on architecture 4.23(b) with input video_5 (c2, v5) Input buffer power(mW) Crossbar power(mW) Arbiter power(mW) PE power(mW) Position dynamic Leakage dynamic leakage dynamic dynamic tile (0.0) 54.9 6.66 2.32 0.105 0.82 46.78 tile (1,0) 52.3 6.69 2.15 0.0804 0.79 136.45 tile (1,1) 46.6 6.66 1.38 0.099 0.775 53.89 tile (0,1) 97.2 6.69 3.36 0.0963 1.14 83.72 In this SoC configuration where traffics may share the same routers and links, the most power-hungry tile changes to the location (0,1), where three traffics, one of which is the original image flits, traverse the router. The increased traffic results in 93.25% and 74.82% more dynamic power consumption in the router (0,1). The minimum power consumption still takes place on the tile where the quantization/scan operation is performed due to the reduced data switching after transformation. The increased traffic results in ~11.5% of execution time to encode one frame, and in turn reduces the PE’s average power consumption for ~10.46% in average compared with the SoC configuration with non-overlap traffics. For the leakage power, a small variation of ~1.5% on router’s power is identified due to the small dependency between leakage and data switching properties, as well as the traffic status in the NoC. Finally consider the dynamic power breakdown among the partitioned tasks for completing the M-JPEG application. The power breakdown also reflects different power requirement for the computational tasks since all of them are mapped on the same type of PBS with the identical power macro-model. Among the partitioned four tasks, the DCT consumes 42% of power and is the most power-hungry task due to its high computation complexity. The VLC encoding takes 24% power consumption, followed by the 157 Quantization/scan taking 18% of the total computation power. The least power- demanding task is the one that handles RGB2YUV conversion, downsampling and file handling, where the power consumption of file U0 is not explicitly modeled in this framework. From the power breakdown, we envision that it will be beneficial to employ hardware accelerator for DCT to reduce the PE’s power consumption and flatten the spatial power distribution profile. On the other hand, using a dedicated ASIC core instead of a microprocessor will exploit the inherent parallelism inside the DCT algorithm, and hence increase the data throughput as well. 158 5. POWER-AWARE SOC DESIGN SPACE EXPLORATION FRAMEWORK The platform-based SoC design methodology based on reconfigurable IP cores enables designers to “assemble” an application specific SoC given a set of design constraints. This process requires the following Operations including (a) resource allocation, (b) task mapping and (c) computation and communication task scheduling, each of which may have multiple choices to fulfill the functionality requirement. How to prune efficiently and quickly among all the possible configuration choices to get one optimal (or pareto- optimal) SoC configuration to meet the ever-stringent performance and power budgets is the main task of Design Space Exploration (DSE). The ever-increasing design size and IP library result in the exploding design space due to the dramatically increased nrunber of potential configuration combinations. In general, DSE considers two orthogonal issues: (1) how can a single design point be evaluated efficiently, (2) how can the design space be covered during the exploration process? [78] Chapters 2, 3 and 4 have answered the first question by proposing the system—level simulation framework to estimate application’s execution time and power consumption for regular tile-based SoCs. However, the latter question needs to be tackled carefully in all levels of abstraction since an exhaustive search of the design space is usually prohibitive due to the sheer size of the design space. Similar to the power modeling, it is also believed that the higher level is the DSE performed, the less search time it will take, if the design is simulated during the exploration process. On the other hand, the market-demanding handheld wireless personal communication devices ask for more integrated one-chip solutions for better performance, less power, and lower cost. For embedded systems running on batteries, maximizing the battery-life is becoming one of the chief design drivers. As such, targeting low-energy consumption is then extremely important. For the high-performance system such as the cellular base station, the battery life is not the problem, but lowering power consumption would reduce the heat dissipation cost such as the cooling system, which is becoming extremely 159 complex in the PC system as what we saw before. Based on the proposed system-level SoC simulation framework, especially the TSE-TEE-based retargetable PE models enabling fast task remapping and simulation and the fully parameterized on-chip communication infrastructure (NoC) model, we move forward and propose a power- aware DSE framework ground on fast task-level application simulation. Given an application represented by its task graph with minimum throughput requirements, the exploration process consists of two related parts: global communication minimization (GCM) and local energy-aware scheduling (LES). Before elaborating the proposed DSE framework, we start with the mathematical formulation of the 80C energy-aware mapping and scheduling algorithm. 5.]. Mathematical Formulation of 50C Minimum Power Problem The application executed on the 80C is model by the communication-annotated task graph (CATG) compared with the original task graph (TG) that was presented in the processor model in chapter 2. Definition 1: A Communication-Annotated Task Graph (CATG) G = C(T, C) is a directed acyclic graph, where each vertex represents a computational module of the application referred to as a task I, e T . Each 1,- has the following properties: ®A normal execution time array NR', where the j-th element nr; 6 NR‘stores the execution time of task I,- if it is executed on j-th PE architecture under its normal supplying voltage. If task t,- cannot be mapped onto j-th PE (e.g. the sorting algorithm can’t be executed on a hardwired DCT core) due to the architectural constraints, nrj = oo. When DVFS is applied with the supplying voltage of V = men / Rut/F5, the execution time can be derived by: i (Rt -1)2 i "r1 (RDVFS) _ 2 ~nrj , where R: 2% (5-1) I R _r__ _ / R V RDVFS DVFS I60 @A normal power array NE, where the j-th element ”836 NE’ stores the energy consumption of task ti if it is executed on j-th PE in the architecture under normal supplying voltage. If task I,- cannot be executed on j—th PE, nej. = 0. Each directed edge C1416 C characterizes the communication or control dependency between I,- and t}. A communication matrix (CM) denoting the communication volume among these tasks is created by: cmoo cmm 6mm; cm”, cmLl cm”, CM 2 ,where em”. = v(c,_j) standing for the _me cm",l emu." JIM communicating bit number from task I,- to I}. In this matrix cm,-,,- = 0, and cm”- : 0 when task I; and tj don’t communicate with each other. The NoC connectivity and link bandwidth is represented by the NoC architecture graph (NAG). Definition 2: An NoC Architecture Graph (NAG) G = C(P, R) is a directed graph, where each vertex p, e Prepresents one router in the NoC, and each directed edge rm. 6 R represents the unique route from p,- to p,- under the deterministic XY-routing algorithm or other possible non-minimum routes under the adaptive WF or NL-routing algorithms. Each r,-.,- associates with two metrics e(r,-J-) and l(r,-J). e( r,-, 1) stands for the minimum energy consumption of sending one bit of data from p,- to pj. [(riJ) gives the minimum latency (in cycles) for each piece of data transferred on that route. Note that although we use the same definition as [105], the energy and latency for each possible route is not available until a NoC characterization simulation is performed with the help of TG/TR models. Given a tile-based NoC with the size of M xN, two matrices can be used to model the NAG: 161 Energy matrix: EM 2 , and _e(rM-l.0) 6(rM—IJ) e(rM—l.N—l)_MxN P [(57.0) l(’b.r) [(rOJV—l) - [(5.0) [(ru) [(rlJV—I) Latency matrix: LM 2 _l("M-r.o) [(rM—IJ) [(rM—l.:‘V-I)_MxN Each element in EM and LM is pre-calculated or characterized before the exploration starts. They provide the information of how information can be routed from each node to another on the NoC. If the NoC is constrained that no information can be routed from p,- to pj, the elements e(rl.‘j )and [(ri'j)in EM and LM are both 0. Currently LM is defined by the condition that there is no traffic congestions on NoC (the minimum latency between two nodes). In the actual application, it is very possible that information (packet) is block at the router due to the congested traffic, which results in much larger latency. Compared with the deterministic minimum latency, the actual latency after taking the congestion into consideration is decided mainly by the overall NoC traffic pattern, whose information cannot be extracted until the NoC communication characterization simulation is performed. We define the NoC congestion factor matrix CFM “06.0) “06.1) “(QM—r)— “(4.0) dbl!) “(rm—r) CFM = frost—1,0) a(rM-l,l) a(rM—l,N—1)Mx~ , where each element a(r..‘ )2 1 describes how congested the path between two NoC nodes p.- to pj. In each round of NoC simulation, each CFM element may be changed resulting from the new traffic pattern. And a traffic congestion threshold K, is defined to characterize how two different traffic patterns influence the CFM. Assume that there are two traffic patterns tr, and trz, under which their CFMs have the following relationship: I62 M—IN-l 22(al(’i_j)—a2 (rt; )F =r:0j=() 3 NoC library 1 Thmifib / no \ yes 1 xSoCSim w. ' 1 no Lag] /‘ no ‘\ < >= Pmin‘» / yes \\ , ’ - \‘7/ yes Adjust ‘ E\W._CMF9':9/ [Admtll V \V'/ FIE nd' Figure 5.2 SoC DSE framework based on GCM It is well believed that global on-chip communication is asking more and more timing l slack and energy budget due to the worse scalability of the global wires laid out on the top metal layers in the advanced sub-100nm CMOS technologies. But on the other hand, 168 the increasingly stringent performance requirement makes processing the whole application, which is becoming more and more complicated and contains tens of hundreds of computational intensive task, on one PE unfeasible. The trade-off here is that the application tasks are mapped onto as small numbers of PBS as possible so that the global communication overhead is minimized, and at the same time the performance requirement is still met. This is the key point of the global communication minimization methodology that we applied in our DSE framework. Figure 5.2 presents the main architecture of the proposed SoC design space exploration framework based on the global communication minimization. There are three main blocks in the SOC exploration framework, Initial Task Characterization (IT C), Initial Communication Characterization (ICC), and the Global Communication Minimization (GCM). Given an application model represented by the CATG, and a PE library containing Np PBS with its macro timing and energy models, ITC performs the task-level simulation for each single task mapped on each PE to acquire the execution time matrix NR and power consumption matrix NE at normal operating voltage and frequency. For a CATG with NI tasks, the resulting NR and NE will be represented by the following two arrays: F _ ton to] t0,N,-1 NR _ tro In t1,N,—-1 NPxN, '— _tN,,—I.o INP—IJ [Np—1,N,—1‘ _ _ 100.0 p0,] pom—r NE pro [71,] P1,N,—1 prN, _ _pr—ro pr—u PN,,-1,N,—1_ 169 Since the performance and power simulation can be performed by the same PE model, the total number of simulations in ITC is N,,xN,. Although the ITC simulations pose some overhead to get the two task/PE perfonnance/power matrices, it is still acceptable in the whole DSE framework because (I) ITC is only performed one time before the exploration core algorithm, which is the main loop starts, so its run-time, which may not be neglectable, will not take a big portion of the overall run-time; (2) With the fast macro—model simulation framework equipped with macro-models, each simulation in ITC would be very fast on the host machine. Similar to ITC, ICC (Initial Communication Characterization) is performed on the NoC architecture with the synthetic traffic generators without any traffic congestions. The ICC will generate EM and LM, while the congestion factor matrix (CFM) is an all-l element matrix because of the non-congested background traffic. ICC provides the basic information of the NoC properties. Different from IT C, NoC is characterized by is router architecture, which has 12 different configurations in our work. NoC with each router architecture will be characterized with a different set of EM and LM, which will be used to explore the NoC architecture during the exploration. For the 4x4 router, 16x12 = 192 simulation runs will be performed to acquire all this information. Both ITC and ICC are automatically performed by the Perl scripting language. The initial mapping is performed outside of the DSE main loop shown in Figure 5.2. We start with the scenario where there is no global communication, which means that all tasks are mapped onto only one PE, which is chosen randomly from PE library. And this PE will be mapped to an initial NoC architecture, which is set to be constructed by the router with architecture of Ibuf_2_FP_XY. Since there is zero global communication, there is no communication power on the NoC in this case. This PE and NoC configuration will be used to construct a SoC simulation model presented in chapter 3 under the xSoCSim framework with the annotated timing and energy models. This executable model timing/energy annotation will run to generate the performance and 1 7O energy numbers. Firstly the performance metric is evaluated to see whether the anticipated throughput meets the specification, which is provided as a performance constraint before the exploration process starts. It is most likely that the performance doesn’t meet requirement in the initial mapping where all tasks are mapped onto one PE resulting the zero communication overhead. And additional PE will be employed to take some of the first one’s computational load. This scheme is similar to the processor pipeline, where the throughput is mainly decided by one stage’s latency if there is no pipeline stalls and other hazards. But the difference is adding another PE requires the inter-communication between the PEs through the NoC infrastructure. Another key point in the GCM-based exploration is to minimize the inferred NoC communication when another PE is added. We employed the algorithm based on the minimum weighted cut set on CATG to decide which task, or task set, will be partitioned to the second PE. Definition 3: A Cut Set is a partition of CATG, and it includes the partition of both the vertices and edges in the graph. After partition, a CATG becomes two sub-graphs G) = GI(T1, C1), G2 = G2(T2, C2) shown in Figure 5.3(a). G1 and G2 are two sub-graphs of original G = G( T. C). All edges c“ e C,.(k = 1,2;i = 1.2....) that intersect with the contour of the cut for each sub—graph is referred to the main edges in each sub-graph. If a CTAG is partition to two sub-graphs but one cut, they will share the same set of main edges. In the weighted CATG, the sum of each main edge after cut is defined as the Inferred Communication Load (ICL) in the units of number of bits communicated between to sub- graphs. Figure 5.3(b) presents the algorithm of finding a minimum weight cut set from the given CATG so that the partition of CATG has the minimum ICL. It starts will an empty cut set, and iteratively checks the inferred communication load for the possible vertex permutations, and keeps the current permutation if its ICL is less the previous iteration. For a CATG containing N tasks, the possible search space is: 171 s: Xe; =C;,+c; +....+c;“ i=l.2...N - N(1\l/!—l)+N(N—;)!(N—2)+m+1 (5.2) , which is pretty large when the task graph size N is big. For example, for a CATG with 5 tasks, S is 56, and if the tasks number increases to 20, the search space S would explode to > 2000. So we define a time-out threshold or an ICL threshold to reduce the search time. Once the evaluated ICL is smaller than the ICL threshold, or the run-time exceeds the time-out threshold, the algorithm will end and the current permutation with minimum ICL will be returned. F ind partition with minimum lCLfrom G() t1 .13) P = {}; G={T,C}; min_ICL = 0; for each node t, in T calculates the accumulated weight of its associated edges Whilefl Time-out) {P = permutation(G); G = G — P; if(lCL(P,G) < min_ICL) {min_lCL = ICL(P,G); Pmin = P:} i return Pmin, min_ICL: } CATG with out set Figure 5.3(a) Algorithm to find the cut with minimum ICL (b)CATG with cut set After partitioning tasks to other PE and updating the task mapping function M,a,k( ), a new simulation model will be created to run the partitioned application. This is also iterative until a partitioning that meets the performance requirement is found. If it still doesn’t meet the performance requirement after partitioning CATG into 2 PBS, the third PE will be added to take some the tasks further partitioned from one of the two sub- graphs using the same GCM method. This process will be performed iteratively until a partition satisfying the performance requirement is found. Based on the refined application partitioning and mapping, the power minimization will be performed as 172 what’s shown in Figure 5.2. The target function during the application power minimization contains two elements: power consumption from PE, and power consumption from NoC. Actually from GCM, the NoC communication has already been minimized and it indicates that the communication power is the lowest to achieve the required performance. There are two possible exploration processes to further reduce the power consumption while keeping the performance metric. The congestion factor matrix CFM will take an important role during the power optimization. The main idea of employing CFM is to evaluate the communication between two mapping schemes. The exploration framework will perform the task or communication re-mapping based on the current and previous CFMs. During the exploration, the CFM is calculated by: CFM = ALM / LM , where ALM can be achieved from the system-level SoC fast simulation and LM is the ICC characterization result. For two different mapping schemes, if they have compatible CFMs, the next exploration procedure will change the task mapping function Mmk( ) to reduce the operating voltage and frequency to reduce the PE power consumption. To avoid over-reduce the voltage/frequency, a new xSoCSim model will be created after the updated Mmtl ) to evaluate the performance, which should meet the throughput requirement. Once the performance is met, the algorithm will return with the minimum power consumption when executing this application. If two different mappings don’t have compatible CMFs, the next step will be changing the communication mappings to achieve the compatible CFMs before adjusting PE’s voltage and frequency. It is possible that the search will go to some space that never converges, and the power consumption estimation is vibrating among several values. To avoid this situation, a time- out parameter is defined in the algorithm, and it will break the search and move it back to the starting of the loop to get a different initial mapping once the run-time exceeds the time-out parameter. It is also possible that the search result is the local minimum value, 173 which is still larger than the potential global minimum one. So far this algorithm doesn’t equip with the abilities to automatically to avoid finding the local minimum. The way that we propose to avoid the local minimum is to perform multiple searches with different initial mapping, so that the algorithm has more probability to cover those local optimal points to get the global power-optimized mapping. The proposed exploration framework along with its search algorithm is called simulation-in-loop DSE since the main search loop uses the fast system-level SoC simulation to direct the search procedure. Both latency and power metrics are evaluated to find the optimal power consumption task and communication mappings to meet the performance constraint. The whole simulation framework is written in Perl scripting language and it invokes the xSoCSim model to run the simulation and do the data collection. 5.4. Local PE/NoC Refinement After we find a power-optimal mapping for an application, local PE/NoC refinement can be used to further reduce the global power consumption without degrading the performance. The rational of local refinement is the pipeline-style on-chip communication in the 80C architecture. The heterogeneity of PBS along with the mapped tasks result in different processing latency on each PE, and the overall application throughput is determined by the PE/router—tile with the maximum processing latency. The basic idea of local refinement is to modify those tiles with smaller processing latency, which doesn’t not contribute to better throughput performance in the system while does contribute to more power consumption. For example, a NoC node connecting a 200MHz processor has the average sampling latency of 20ns, and the router’s communication latency is 60ns. If the overall throughput is lOMsps, which indicates that the slowest node needs lOOns to finish the processing communication task. There is 20ns’ timing slack for the fast node, and it will not hurt the performance if we slower down the processor to 100MHz to get 40ns’ processing latency. A direct effect of slowing down processor from 174 200MHz to 100MHz is that the PE dynamic power consumption reduces 50%. Or similarly, we can reduce the router’s input buffer size to increase its communication latency from 60ns to 80ns, and this method reduces the router’s power consumption. The difference of local refinement compared with the GCM is that it is performed locally so that the global performance/power metrics such as LM and CFM are not evaluated. Before the local refinement, the pre—Optimized mapping will be simulated to generate each node’s processing and communication latency, which are added together to get the total latency. The minimum total latency is identified as the local refinement baseline. And each other node’s timing slack will be computed by subtracting with the minimum total latency. This slack can be applied to PE or NoC router locally to further reduce the power consumption. If this slack is applied to PE, dynamic voltage and frequency scaling will be invoked to select a reduced supplying voltage as well as the corresponding frequency according to (5.1). No further simulation is necessary to verify the PE’s local refinement. And if this slack is applied to NoC router, its architecture can be changed to reduce the power consumption. The most frequently used method is to reduce the buffer depth to degrade the latency performance with the payback of better power properties. Since NoC router communicates with its neighboring ones, it is necessary to perform a verification simulation to ensure that the change of this tile’s router architecture would not influence the overall communication performance. CFM defined before is used to perform this verification. The router architecture refinement is acceptable if the CFM of the NoC with modified router architecture is compatible with the original one running the same application. After the local PE/NoC refinement, the final optimal application task and communication mapping is achieved. In this framework, the local refinement is an option so that user can choose whether to activate after GCM or not to trade off the exploration time and the final optimal power consumption. 175 6. CONCLUSIONS As the system-on-chip integration is becoming a trend of today’s semiconductor system-level solution by combing all the system components onto one die with the evolvement of the advanced semiconductor fabrication technologies. Not only the hardware integration, software/firmware is much more tightly coupled with the 80C development cycle since the programmable processor is frequently engaged into the 80C, and it becomes very beneficial that the firmware be developed as early as possible during the system design cycle. The increasing complexity of the modern SoC containing tens or hundreds of different processing cores (PEs) asks more efficient development tool flow and design methodology to increase the design productivity. In additional to the computation, the future SoC with multiple processing cores heavily relies on the on-chip communication to synchronize their operations. And the on-chip communication is becoming a hot spot of the current research since the future SoC is becoming more communication-centric compared with today’s computation-centric bus-based SoCs. The conventional simulation - synthesis — simulation design flow is far too slow in the complex SoC era; especially there are multiple PEs, each of which may be a general purpose or application specific processor, or a dedicated hardware accelerator in the system. This forces the 80C designers to raise the level of abstraction to a higher one from RTL-level to system-level, which more focuses on the interactions between the integrated IP cores in the system since the target application needs much inter-core communication and synchronization since each core only handles one or multiple tasks partitioned from the application. At the system-level, the modeling of the 80C is pretty different from what we have done in the RTL level based on HDL-synthesis, since the system-level model needs to appropriately abstract some architectural details to extract the performance/power properties from the whole system’s point of view, not a single IP block and its sub-block. System-level SoC modeling is still a very active research field in both industry and academia, and different modeling tools and methodologies have been 176 proposed. This work proposes and implements a system-level multi-core SoC modeling and simulation framework compatible to the new IEEE-1666 SystemC standard, which was approved by IEEE in 2005. Actually the main phase of this work was performed when the SystemC was evolving before becoming an IEEE standard. The modeling methodology in this framework covers both processing element and on-chip interconnection, and it also integrates them seamlessly to an executable SoC simulation model running the partitioned application directly on the host machine without the target-dependent software and hardware synthesis (we refer to the compilation of software to the processor binary as “software synthesis”), which is time-consuming when the system scale is large. To fulfill this target, the multi-granular time model, which is a critical factor in the electronic system performance model, is applied to the modeling of both processing elements and on-chip communication. For those features such as the intrinsic wire delay, the physical time model is applied to acquire the best accuracy; the cycle-accurate model is applied for those blocks requiring cycle-true simulation performance; and for the feature that cycle-accurate level is too fine but the physical level is too coarse, cycle- approximate accuracy is applied and it focuses on the timing on the boundaries of the IP blocks of interest, which is cycle-accurate, while the internal operation timing is not evaluated since the system-level simulation doesn’t need this detailed information. The combination of the three different time models enables us to explore the system with fast simulation speed, sufficient accuracy, and moreover, the homogeneous SystemC environment. For the two major SoC functional blocks, different modeling methodologies are applied to achieve the best trade-off between the simulation speed and modeling accuracy. We employ the macro-modeling methodology to model the Processing Element (PE), and the key point of this is that the common operations of the processing element, which may be a programmable microprocessor or a parallel hardware accelerator, are 177 extracted and characterized for both performance (timing) and energy. And the characterized data will be annotated to the task code that is mapped onto the target PE, so that the timing and energy features can be acquired during the simulation on the host machine without invoking the target architecture dependent compilers or HDL synthesizers. This system-level annotation also enables to create a retargetable PE simulation model, which supports heterogeneous PE architectures under the same framework when estimating timing and energy. Essentially speaking, the macro-model itself is not a completely new idea, but our contribution is that we utilize the macro- modeling method to build the target independent PE performance/energy modeling framework supporting task-level simulation, which has not been seen in the publications before. For the 50C on-chip communication infrastructure, we focus on the new area of network-on-chip (NoC) instead of the maturing and conventional shared-bus architecture. The concept of NoC is based on the original network where information is transferred in the format of packet under the control of routers spreading through the network. The key benefit of NoC is that it breaks the long global communication links into multiple short ones connected via on-chip routers. This is very important since it is well believed that the performance and power of the global wires in the evolving CMOS technologies doesn’t scale down as transistors and local wires, so that using short global wires can reduce the wire load to get the better communication performance and power. Currently NoC is still a developing idea and there are no mature products in the market, and the NoC modeling is still in its infant stage. Due to the unavailability of reference NoC models, we developed our in-house NoC parameterized hierarchical simulation model from ground-up using SystemC. In this NoC model, the main functional modules are routers and links. The router is fully parameterized by 4 architectural parameters: input buffering, output buffering, routing algorithm and arbitration policy. With the change of each parameter, new router architecture will be created. The on-chip router model 178 implements 3 routing algorithms (1 deterministic, and 2 adaptive), 2 arbitration policies, and unlimited buffer depths (if the area budge is large enough). With different connections among these on-chip routers, 4 different NoC topologies including mesh, torus and half-torus are modeled. To fully utilize these configurations, a NoC generator is developed to generate different NoC architectures by parsing the input NoC configuration file. And we also develop the traffic generator to emulate the actual PEs when characterizing N 0C for the performance and power. NoC performance and power are modeled architecturally compared with the empirical PE macro model. We develop each performance and power models by the NoC components architectural information. The annotation of the developed performance/power model enables the run-time acquirement of the performance and power information for different traffic patterns. Moreover, the power model is based on the bit switching factor by tracing two consecutive data payloads on each router and link, so it is highly data-dependent. This feature provides the application-dependent power/performance estimation compared with those proposed statistical models. Based on the PE and NoC models, a complete SoC simulation model running the partitioned application code directly with the appropriate performance/power annotation is developed and it is named xSoCSim. The most promising feature of this model is that it runs the simulation without the requirement of any target—compiler/synthesizer with the help of PE macro-models. And the communication pattern is formed by the network interface (NI) bridging between PE and NoC, and it also decouples the computation with the communication. The mapping of different tasks onto PEs is easily performed with the help of object-orientated application and task model coupled with the PE models. We have successfully ported two applications: M-J PEG and WCDMA filtering array into this simulation model. And the packaging and publication of this model is under progress. Based upon the proposed model supporting fast task-level SoC application simulation, we also develop a preliminary SoC design space exploration framework to prune 179 different PE/NoC architectures to achieve the minimum total power consumption while still meeting the throughput constraint. The distinct feature of this exploration framework is that it exploits the fast in-loop application simulation facilitated by the xSoCSim model to collect the run-time performance and power data if necessary compared with other works where the all the performance and power metrics are pre-characterized. Although this exploration framework still needs the characterization, it needs only to be executed once at the beginning outside the main searching loop. The main searching loop features minimizing the global communication overhead based on the assumption that the communication latency and power is taking a big portion of the system performance. And an optional local PE/NoC refinement based on the slack is used to further lower the processing/communication power while keeping the overall system performance. Looking ahead of this work, we envision some further research that can be done in the future. On the PE side, it would be beneficial to add the cache model to the processor core since cache contributes a big portion of area well as power. On the on-chip communication architecture side, the NoC would be expanded to support the virtual channel, which brings the benefit of reduced traffic congestions in the communication- centric applications. Another work that can be done is to add more automation to this framework, so that it supports automatic application partitioning and computation and communication annotation. 180 BIBLIOGRAPHY 1. L. Benini, GD. Micheli, “Network-on-Chips: A New SoC Paradigm”, In IEEE Computer; 35(1), Jan 2002, pp. 70—78. 2. X. Zhu, and S. Malik, “Modeling Operation and Microarchitecture Concurrency for Communication Architectures with Application to Retargetable Simulation”, In Proceedings of CODES +1555 ’04, Sept. 2004, Sweden. 3. K. Roy et al., “Leakage current mechanisms and leakage reduction techniques in deep submicrometer CMOS circuits,” In Proceeding of the IEEE, vol. 91, no. 2, pp. 305 ~ 327, Feb 2003. 4. M. Becer, R. Vaidyanathan, C. Oh, and R. Panda, “Crosstalk Noise Control in an SoC Physical Design Flow”, In IEEE Transactions on Computer-Aided Design of Integrated Circtuis and Systems, Vol. 23, No. 4, Apr. 2004. 5. N. Cohen, T. Sriram, N. Leland, D. Moyer, S. Butler and R. Flatley, “Soft Error Considerations for Deep-Submicron CMOS Circuit Applications”, In IEDM Proceedings of IEEE International Electron Device Meeting, pp. 315-318, 1999. 6. http://public.itrs.net/ 7. A. Sangiovanni-Vincentelli, “Defining Platform-based Design”, In EEDesign, Feb. 2002. 8. K. Keutzer, S. Malik, R. Newton, J . Rabaey and A. Sangiovanni-Vincentelli, “System Level Design: Orthogonalization of Concerns and Platform-Based Design”, In IEEE Transactions of Computer-Aided Design of Integrated Circuits & Systems, Vol. 19, No. 12, Dec. 2000, pp.1523-43. 9. http://bwrc.eecs.berkeley.edu/Classes/IcBook/SPICE/ 10. T. Austin, E. Larson, and D. Ernst, “SimpleScalar: An Infrastructure for Computer System Modeling,” IEEE Computer, February 2002. 11. www-306.ibm.corn/chips/techlib/techlib.nsf/ techdocs/ 12. A. Agrawal, A. Bakshi, J. Davis, K., Ledeczi, S. Mohanty, G Nordstrom, V. Prasanna, C. Raghavendra, and M. Singh, “MILAN: A Model Based Integrated Simulation Framework for Design of Embedded Systems”, Workshop on Languages, Compilers, 181 and Tools for Embedded Systems (LCTES 2001). 13. Y. Zhang, D. Parish, K. Sankaranarayanan, K. Skadron, and M. Stan, “HotLeakage: A l4. 16. 17. 18. 19. 20. 21. 22. Temperature-Aware Model of Subthreshold and Gate Leakage for Architects”, Technical report of ECE Department, University of Virginia, Charlottesville, VA, 2003. L. He, W. Liao, and MR. Stan, “System Level leakage Reduction Considering the Interdependence of Temperature and Leakage”, In Proceedings of Design Automation Conference, Jun. 2004, San Diego. . C. Isci and M. Martonosi, “Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data”, In Proceedings of the 36th International Symposium on Microarchitecture, pp.93-105.San Diego, 2003. R. Joseph and M. Martonosi, “Run-time Power Estimation in High Performance Microprocessors”, In Proceedings of ISLPED ’2001 , pp.135-140, California, 2001. S. Gupta, EN. Najm, “Analytical Model for High Level Power Modeling of Combinational and Sequential Circuits”, In Proceedings of IEEE Alessandro Volta Workshop on Low Power Design, Mar. 1999. EN. Najm, “Transition Density: A Stochastic Measure of Activity in Digital Circuits” In Proceedings of 20’” Design Automation Conference, Jun. 1991. M. Lajolo, A. Raghunathan, S. Dey, “Efficient Power Co-estimation Techniques for System-on-Chip Design, In Proceedings of DAT E”, pp. 27-34, Paris, 2000. M. Loghi, M. Poncino, and L. Benini, “Cycle—Accurate Power Analysis for Multiprocessor Systems-on-a-Chip”, In Proceedings of the 14th ACM Great Lakes symposium on VLSI, Boston, MA, USA, pp. 406-410. P. Liverse, V. Wolf, K. Vissers, and E. Deprettere, “A Methodology for Architecture Exploration of Heterogeneous Signal Processing Systems”, In Journal of VLSI Signal Processing, Vol. 29, 2001, pp. 197-207. W. Qin, and S. Malik, “Flexible and Formal Modeling of Microprocessors with Application to Retargetable Simulation”, In Proceedings of Conference on Design Automation and Test in Europe, pages 556—561 , 2003. 23. D.S.Tortosa et al., “VHDL—based Simulation Environment for PROTEO NoC”, In 182 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. Proc. of HLDVT Workshop, 2002. J. Chen et al., “NoCGen: A Template-Based Reuse Methodology for NOC Architecture”, In Proc. 1C VLSI, 2004. HS. Wang, X. Zhu, L.S. Peh and S. Malik, “ORION: A Power-Performance Simulator for Interconnection Networks”, In Proceedings of MI CRO, 2002. D. Wiklund, S. Sathe, et al., “NoC Simulations for Benchmarking”, In Proc. IWSoC for Real Time Apps, 2004. K. Goossens, J. Dielissen, O.P. Gangwal, SG Pestana, A. Radulescu, and E. Rijpkema, “A Design Flow for Application-Specific Networks on Chip with Guaranteed Performance to Accelerate SOC Design and Verification”, In Proceedings of Conference on Design Automation and Test in Europe, pages 1182- 1187, 2005. J. Hu, and R. Marculescu, “Exploiting Routing Flexibility for Energy/Performance Aware Mapping of Regular NoC Architectures”, In Proceedings of Design Automation & Test in Europe (DATE), March, 2003. K. Lahiri, and A. Raghunathan, “Power Analysis of System Level On-Chip Communication Architectures”, In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 20, No. 6, Jun. 2001. B. Cmelik et al., “Shade: A Fast Instruction-Set Simulator for Execution Profiling”, In ACM SIGMETRICS Performance Evaluation Review, Volume 22(1), Pages 128- 137, May 1994. R. Leupers et al., “Generation of Interpretive and Compiled Instruction Set Simulators” In Proceedings of Design Automation Conference, 1999. L. Séméria and A. Ghosh, “Methodology for Hardware/Software Co-verification in C/C++,” In Proceedings of Asia and South Pacific Design Automation Conf. (ASPDAC’OO), ACM Press, 2000. PP. 405-408. J. Liu, M. Lajolo, and A. Sangiovanni-Vincentelli, “Software Timing Analysis Using HW/SW Co-simulation and Instruction Set Simulator,” In Proceedings of 6th International Workshop Hardware—Software Codesign, IEEE CS Press, 1998, pp. 65- 69. 183 34. 36. 37. 38. 39. 40. 41. 42. 43. 45. P. Gerin et al., “Scalable and Flexible Co—Simulation of 80C Designs with Heterogeneous Multi-Processor Target Architectures”, In Proceedings of Conf Asia and South Pacific Design Automation (ASP-DAC 01) ACM Press, 2001, pp. 63-68. . L. Benini, D. Bertozzi, D. Bruni, N. Drago, F. Fummi, and M. Poncino, “SystemC Cosimulation and Emulation of Multiprocessor SoC Designs”, In Proceedings of IEEE, Apr. 2003. GNU Project, “Debugging with GDB,” www.gnu.orglmanuallgdb~4.l7. T. Grotker, S. Liao, G. Martin, and S. Swan, System Design with SystemC. Kluwer Academic Publishers, 2002. Open SystemC Initiative website. http://systemc.org. December 2004. Edward A. Lee. “Overview of the Ptolemy Project, technical memorandum no. ucb/erlm03/25”, Technical report, Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, 94720, USA, July 2004. HI. Volkerink, G.H. Hilderink, J.F. Broenink, W.A. Vervoort, and A.W.P Bakkers, “CSP Design Model and Tool Support”, Communicating Process Architectures 2000 33 PH. Welch and A. W.P. Bakkers (Eds. ), IOS Press, 2000. SA. Edwards, “The Specification and Execution of Synchronous Reactive Systems”. PhD thesis, University of California, Berkeley, 1997. M. Vachharajani, N. Vachharajani, D.A. Penry, J.A. Blome, and DA. August, “Microarchitectural Exploration with Liberty”, In Proceedings of the 35th International Symposium on Microarchitecture (MICRO), 2002. J. Nurrni, H. Tenhunen, J. Isoaho, and A. Jantsch, Interconnect-Centric Design for Advanced SoC and NoC. Kluwer Academic Publishers, April 2004. . http://www.mentor.com/products/fv/digital_verification/modelsim_se/index.cfm C. Lee, M. Potkonjak and W. Mangione, “MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems”, In Proceedings of 30th Annual International Symposium on Microarchitecture (Micro '97), pp. 330-336, Research Triangle Park, NC, December, 1997. 184 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. S. Erik and J. Steffen and BE. Sumner, Jr. “Seesoft - A Tool for Visualizing Line Oriented Software Statistics“, IEEE Transactions on Software Engineering, pp.957- 968. C. Kozyrakis, and D. Patterson, “Vector Vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks”, In the Proceedings of the 35th International Symposium on Microarchitecture, Instabul, Turkey, November 2002. K.F. Milfeld, C.S. Guiang, AP, and J .R. Boisseau, “Exploring the Effects of Hyper- Threading on Scientific Applications”, In Proceedings of C UG 2003. N.K. Jha, “Low Power System Scheduling and Synthesis”, In Proceedings of the 2001 IEEE/ACM International Conference on Computer-Aided Design, pp. 259 - 263, San Jose, California. K. Choi, R. Soma, and M. Pedram. Fine-grained dynamic voltage and frequency scaling for precise energy and performance tradeoff based on the ratio of off—chip access to on-chip computation times. In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pages 18-28, Jan 2005. ARM Holdings PLC. Advanced Microcontroller Bus Architecture (AMBA) Specification rev 2.0. http://www.arm.com/products/solutions/AMBA_Spec.htrnl. IBM., The 32-bit Processor Local Bus Architecture Specification, ver. 2.9. Technical Write Paper, http://www-306.ibm.com/chips/techlib/techlib.nsf/products. M. Sgroi, M. Sheets, A. Mihal, K. Keutzer, J. Rabaey, and A. Sangiovanni- Vincentelli, “Addressing the System-on-Chip Interconnect Woes through Communication-based Design”, In the Proceedings of Design Automation Conference, 2001. MB. Tailor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, W. Lee, M. Seneski, N. Shnidman, M. Frank, S. Amarasinghe, and A. Agarwal, “The RAW Processor: A Computation Fabric for Software Circuits and General-purpose Programs”, In IEEE MA CRO, 22(2), 2002. http://www.mosis.org/Technical/Designsupport/artisan-university.html L.S. Peh and W. J. Dally. A Delay Model and Speculative Architecture for Pipelined Routers. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture, 2001. 185 57. 58. 59. 61. 62. 63. 65. 67. 68. Heo and K. Asanovic, “Replacing Global Wires with an On-Chip Network: A Power Analysis”, In Proc. of ISPLED ’05 V. Trwari, S. Malik and A. Wolfe, “Power Analysis of Embedded Software: A first Step Towards Software Power Estimation”. In Proceedings of IEEE/ACM International Conference on Computer-Aided, pp.384-390, August 1994. T. Sirnunic, L. Benini, G Micheli and M. Hans, “Source Code Optimization and Profiling of Energy Consumption in Embedded Systems”, In Proceedings of 15832000. pp.193-l98, Madrid, Spain, 2000. . D. Brooks, V. Trwari, M. Martonosi, “Wattch: A Framework for Architectural-Level Power Analysis and Optimizations”, In Proceedings of International Symposium on Computer Architecture, pp.83-94, 2000. W. Ye, N. Vijaykrishna, M. Kandemir, and M. J. Irwin, “The Design and Use of SimplePower: A Cycle-Accurate Energy Estimation Tool”, In the Proceedings of the Design Automation Conference, June 2000. http://www.cse.psu.edu/~mdl/software.htm R. Marculescu, D. Marculescu, and M. Pedram, “Efficient power estimation for highly correlated input streams,” in Proc. DAC-32: ACM/IEEE Design Automation Conf., San Francisco, CA, J une 1995, pp. 628—634. . M. Barocci, L. Benini, A. Bogliolo, B. Ricco, and GD. Micheli, “Lookup Table Power Macro-Models for Behavioral Library Components”, In Proceedings of IEEE Alessandro Volta Workshop on Low Power Design, Mar. 1999. S. Gupta, and EN. Najm, “Powe Macromodeling for High Level Power Estimation”, In Proceedings of 34th Design Automation Conference, Jun. 1997. . G Bemacchia, and MC. Papaefthymiou, “Analytical Macromodeling for High-Level Power Estimation”, In Proceedings of IEEE International Conference on Computer Aided Design, Nov. 1999. W. Dally, “Route Packets, Not Wires: On-Chip Interconnection Networks”. In Proceedings of the 2001 Design Automation Conference (DAC-OI), pp.684-689, New York, June 2001. ACM Press. Terry T. Ye, L. Benini, and GD. Micheli, “Analysis of Power Consumption on Switch 186 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. Fabrics in Network Routers”, In Proceedings of DAC, 2002. X. Chen, and LS. Peh, “Leakage Power Modeling and Optimization in Interconnection Networks”, In Proceedings of ISPLED’2003 , Seoul, Korea. P. Gupta et al., “A High-level Interconnect Power Mode] for Design Space Exploration”, In Proceedings of IEEE International Conference on Computer Aided Design, 2003. H.B. Bakoglu, Circuits, Interconnections and Packaging for VLSI. Addision-Wesley, 1990. R. Ho et al., “The Future of Wires”, Proceeding of the IEEE, 89(4): 490-504, Apr., 2001. W. Liu, X. Jin, J. Chen, M. Jeng et. al, “BSIM3v3.2.2 MOSFET Model User’s Manual”, Dept of Electrical Engineering and Computer Science, UC-Berkeley, 1999. M. Mamidipaka, N. Dutt and M. Abadir, “Leakage Power Estimation in SRAMs”, CECS Technical Report, 2003. Berkeley Predictive Technology Model and BSIM4. http://www-device.eecs.berkeley.edu/research M. Powell, S.H. Yang, K Roy, and TN. Vijaykumar, “Gated-Vdd: A Circuit Technique to Reduce Leakage in Deep Submicron Cache Memories”, In Proc. ISPLED, 2000. K. Flautner, N.S. Kim, S. Martin, D. Blaauw, and T. Mudge, “Drowsy Caches: Simple Techniques for Reducing Leakage Power”, In Proceedings of ISCA, Alaska, May 2002. M. Gries, “Methods for Evaluating and Covering the Design Space during Early Design Development”, In VLSI Journal, 26 May 2004. M. Gries, Algorithm-architecture Trade-offs in Network Processor Design, Ph.D. thesis, Diss. ETH No. 14191, Swiss Federal Institute of Technology (ETH) Zurich, Switzerland (Jul. 2001). A. Baghdadi, N. Zergainoh, W. Cesario, T. Roudier, and A. Jerraya, “Design Space Exploration for Hardware/Software Codesign of Multiprocessor Systems”, In 11th 187 81. 83. 84. 85. 86. 87. 88. 89. 90. International Workshop on Rapid System Prototyping (RSP), 2000, pp. 8-13. C. Ykman-Couvreur, J. Larnbrecht, D. Verkest, F. Catthoor, A. Nikologiannis, G. Konstantoulakis, “System-level Performance Optimization of the Data Queueing Memory Management in High-Speed Network Processors, in 39th Design Automation Conference (DAC), 2002, pp. 518-523. . S. Blythe, and R. Walker, “Toward a Practical Methodology for Completely Characterizing the Optimal Design Space”, in 9th International Symposium on System Synthesis, 1996, pp. 8-13. S. Blythe, and R. Walker, “Efficient Optimal Design Space Characterization Methodologies”, In ACM Transactions on Design Automation of Electronic Systems 5 (3) (2000) 323-336. S. Chaudhuri, S. Blythe, and R. Walker, “A Solution Methodology for Exact Design Space Exploration in a Three-dimensional Design Space”, In IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 5 (l) (1997) 69-81. M. Auguin, L. Capella, F. Cuesta, and E. Gresset, “CODEF: a System Level Design Space Exploration Tool”, In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, 2001, pp. 1145-1148. P. Mishra, N. Dutt, and A. Nicolau, “Functional Abstraction Driven Design Space Exploration of Heterogeneous Programmable Architectures”, In International Symposium on System Synthesis, 2001, pp. 256-261. S. Pees, A. Hoffmann, and H. Meyr, “Retargeting of Compiled Simulators for Digital Signal Processors Using a Machine Description Language”, In Proceedings of Design, Automation and Test in Europe Conference (DA TE), 2000, pp. 669-673. J. Kin, C. Lee, W. Mangione-Smith, and M. Potkonjak, “Power Efficient Mediaprocessors: Design Space Exploration”, In Proceedings of 36th Design Automation Conference (DAC), 1999, pp. 321-326. G. Hekstra, G. L. Hei, P. Bingley, and F. Sijstermans, “TriMedia CPU64 Design Space Exploration”, In Proceedings of 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors, 1999, pp. 599-606. T. Givargis, J. Henkel, and F. Vahid, “Interface and Cache Power Exploration for Core-based Embedded System Design”, In Proceedings of International Conference 188 91. 92. 93. 94. 95. 96. 97. 98. 99. on Computer-Aided Design (ICCAD), 1999, pp. 270-273. I. Karkowski, and H. Corporaal, “Design Space Exploration Algorithm for Heterogeneous Multi-processor Embedded System Design”, In Proceedings of 35th Design and Automation Conference (DAC), 1998, pp. 82-87. K. Lahiri, A. Raghunathan, and S. Dey, “System-level Performance Analysis for Designing On-chip Communication Architectures”, In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 20(6) (200]) 768-783. M. Gries, and J. Greutert, “Modeling a Shared Medium Access Node with QoS Distinction”, Tech. Rep. 86, Computer Engineering and Networks Laboratory (TIK), ETH Zurich, Switzerland (Apr. 2000). D. Bruni, and A.B.L. Benini, “Statistical Design Space Exploration for Application Specific Unit Synthesis”, In Proceedings of 38th Design Automation Conference (DAC), 2001, PP. 641-646. V. Srinivasan, S. Radhakrishnan, and R. Vemuri, “Hardware Software Partitioning with Integrated Hardware Design Space Exploration, In Proceedings of Design, Automation and Test in Europe (DATE), 1998, pp. 28-35. D. Gajski, F. Vahid, S. Narayan, and J. Gong, “System-level Exploration with SpecSyn”, In Proceedings of 35th Design and Automation Conference (DAC), 1998, pp. 812-817. K. Lahiri, A. Raghunathan, and S. Dey, “Efficient Exploration of the SoC Communication Architecture Design Space”, In Proceedings of IEEE/ACM International Conference on Computer Aided Design (1C CAD), 2000, pp. 424-430. V. Zivkovic, E. Deprettere, P. van der Wolf, and E. de Kock, “Design Space Exploration of Streaming Multiprocessor Architectures”, In IEEE Worksh0p on Signal Processing Systems (SIPS), 2002. pp. 228-234. P. Eles, et al., “Scheduling with Bus Access Optimization for Distributed Embedded Systems”, In IEEE Transaction on VLSI, Vol. 8, No. 5, pp.472-491, 2000. 100. GC. Sih, E.A. Lee, “A Compile-time Scheduling Heuristic for Interconnection- constrained Heterogeneous Processor Architectures”, In IEEE Transaction on Parallel and Distributed Systems, Vol. 4, No. 2, 1993. 189 101. l. Luo, and N.K. Jha, “Battery-aware Static Scheduling for Distributed Real-time Embedded Systems”, In Proceedings of DAC, Las Vegas, NV, Jun. 2001. 102. J.M. Chang, and M. Pedram, “Energy Minimization using Multiple Supply Voltages”, In IEEE Transaction on VLSI Systems, Dec. 1997. 103. J.M. Chang, and M. Pedram, “Codex-dp: Codesign of Communicating Systems Using Dynamic Programming”, In IEEE Transaction on CAD, Jul. 2000. 104. Y. Zhang, X. Hu, and D. Chen, “Task Scheduling and Voltage Selection for Energy Minimization”, In Proceedings of DAC, New Orleans, LA, Jun. 2002. 105. J. Hu, and R. Marculescu, “Energy-aware Communication and Task Scheduling for Network-on-Chip Architectures under Real-Time Constraints”, In Proceedings of Design, Automation and Test in Europe Conference and Exhibition, 2004. 106. G Varatkar, and R. Marculescu, “Communication-aware Task Scheduling and Voltage Selection for Total Systems Energy Minimization”, In Proceedings of Intl. Conf. on Computer-Aided Design (IC CAD ), November, 2003. 107. J. Hu, and R. Marculescu, “Energy- and Performance-Aware Mapping for Regular NoC Architectures”, In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 24, No. 4, Apr. 2005. 108. L.M. Ni, and PK. McKinley, “A Survey of Wormhole Routing Techniques in Direct Networks”, In IEEE Transactions on Computers, Vol. 26, No. 2, Feb. 1993. 109. Berkeley Wireless Research Center, website: http://bwrc.eecs.berkeley.edu/ 110. D. Krishnaswamy, R. Stevens, R. Hasbun, J. Revilla, and C. Hagan, “Intel PXA8OOF Wireless Intemet-On-a—Chip Architecture and Design”, In Proceedings of IEEE Custom Integrated Circuits Conference, 2003. 11 1. www.mentor.com 112. C. Demartini, R. Losif, and R. Sisto, “dSPIN:A Dynamic Extension of SPIN”, In Proceedings of the 5th and 6th International SPIN Workshops on Theoretical and Practical Aspects of SPIN Model Checking, 1999, pp.261-276. 113. J. Xi, Z. Huang, and P. Zhong, “SystemC-based Energy Macro-Modeling of Embedded Microprocessor”, In Proceedings of 2005 IEEE International Conference on 190 Electro Information Tech., Mar. 2005. 114. J. Xi, and P. Zhong, “Fast Energy Estimation of Multi-processor System-on-Chip with Energy Macro-Models for Embedded Microprocessors”, In Proceedings of 2005 International Conference on Modeling, Simulation and Visualization Methods, Jul. 2005. 115. J. Xi, and P. Zhong, “A Transaction-Level Network-on-Chip Simulation Platform with Architecture-Level Dynamic and Leakage Energy Models”, In Proceedings of GLSVLSI 2006, Apr. 2006. 116. Texas Instruments Locosto TCS-2300 GSM System-on-Chip http://focus.ti.com/general/docs/wtbu/wtbuproductcontent.tsp?template1d=6l 23&navigati onId=1 265 6&contentId= 15408&DCMP=WTBU&HQS=Other+OT+gsm_locosto 117. Texas Instruments Ecosto OMAPV-1035 GSM/GPRS/EDGE System-on-Chip http://focus.ti.com/general/docs/wtbu/wtbuproductcontent.tsp?templateld=61 23 &navi gati onld=1273 l &contentId=27883 1 l 8. http://www.cppreference.com/cppdeque/index.html 119. http://msdn2.microsoft.com/en-us/library/wa 1s7hx(VS.80).aspx 191 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 11[11111111111111511111]]111