. . ..: :
153:3.

 

3.

f: ..

1 “'r
v

:1.
.335.
.3 i.

. t.
7“”.

V-

, ;.
. T!.,....s?

. .. .. U Eu...

.1. .

but

‘1'

1

.31...‘
. u u‘. v .
,. 3..“ $11.33...

 

, 3».
. .13.!»17.
3.5.15; .

 

THESIS

962.7

 

LIBRARY
ichigan State
University

 

 

 

TM

 

This is to certify that the
dissertation entitled

A SYSTEM-LEVEL PLATFORM-BASED MULTI-CORE
SYSTEM-ON-CHIP SIMULATION FRAMEWORK WITH
RETARGETABLE PROCESSING ELEMENT MODELS

presented by

JINWEN XI

has been accepted towards fulfillment
of the requirements for the

Ph.D. degree in Electrical and Computer
Engineering

 

 

Major Professor’s Signature
97 7 /u 7

Date

 

MSU is an affirmative-action, equal-opportunity employer

 

 

—.-.--.-.-n--c-u--.—-n-.--.-.-o-.—.-o--.--o--..

.------—o-------u—o-n-t-u-u-u-n-o-----I-n-c--u-n-v-a---.. -

PLACE IN RETURN Box to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

DATE DUE DATE DUE DATE DUE

 

 

A SYSTEM-LEVEL PLATFORM-BASED MULTI-CORE SYSTEM-ON-CHIP
SIMULATION FRAMEWORK WITH RETARGETABLE PROCESSING
ELEMENT MODELS
By

Jinwen Xi

A DISSERTATION
Submitted to
Michigan State University
In partial fulﬁllment of the requirements
For the degree of
DOCTOR OF PHILOSOPHY
Department of Electrical and Computer Engineering

2007

ABSTRACT
A SYSTEM-LEVEL PLATFORM-BASED MULTI-CORE SYSTEM-ON—CHIP

SIMULATION FRAMEWORK WITH RETARGETABLE PROCESSING
ELEMENT MODELS

By

Jinwen Xi

Today’s semiconductor technology makes it reality to integrate up to billions of
transistor to create the complete system on the single silicon die. The escalating
design complexity makes the traditional register transfer level (RTL) design and
modeling methodology inadequate facing the pressure of time to market and limited
resource. Electronic system level (ESL) design methodology is needed to handle such
a complex system-on-chip (SOC) design. System level modeling is core to this new
methodology. This work explores system level modeling of function, performance and
power and proposes a complete hierarchical SOC simulation framework supporting
heterogeneous processing elements (PEs) integrated with a network-on-chip (NOC)

communication infrastructure.

To efﬁciently map an application onto multiple heterogeneous PE architectures, an
object-oriented application/task mode] is proposed. PEs are characterized by their
macro-models for both execution time and energy consumption. Performing macro-
model back-annotation to application tasks enables fast and cycle-approximate timing
and energy simulation. Moreover, the resulting retargetable PE model methodology
mitigates the effort of task remapping among different core architectures by using the
uniform macro-operation set. Task scheduling engine (TSE) and Task execution
engine (TEE) supports dynamic task scheduling. By this method, the task-level

simulation can be taken without architecture-dependent cross—compilers/synthesizers.

The parameterizable NOC model features conﬁgurable routers and links. All the
internal operations in each component are modeled as transaction-level function calls.
Analytical transaction-level dynamic and leakage energy models are developed for
each NOC component, and energy estimation is triggered whenever there is a
transaction during the simulation. The NOC model also supports temporal and spatial
power profiling that will help designers identify the potential hot-spots during
application simulations. By conﬁguring the four architectural parameters, NOC with
different on-chip routers is evaluated. A complete SOC simulation framework is
developed with the case study of M-JPEG application. The power/performance
evaluations are performed to prove the effectiveness of the proposed modeling

methodology.

Based on the proposed simulation model, a system-level SOC design space
exploration framework is developed with the novel in-loop simulation feature. The
principle it employs is to minimize the on—chip communication size while meeting the
throughput constraints. And local architectural ﬁne-tuning of PE/NoC is also applied
to achieve the further power savings.

Overall, this methodology will bring signiﬁcant changes in the way how future
system-on-chips are designed. The inherent scalability enables the tradeoff of
modeling time and accuracy. It has broad applications, from early design space
exploration, through design reﬁnements and iterations. to design characterization and

reuse.

ACKNOWLEDGEMENTS

First of all, I'd like to thank my advisor, Professor Peixin Zhong, for his invaluable
guidance throughout the road of my four years’ research and study in East Lansing,
MI. His extraordinary vision, generosity, and patience have made this thesis possible.
He provided a lot of signiﬁcant suggestions and comments during my research, to
help me reﬁne the idea and modify the model. And the completion this thesis is also a
mission impossible without his directions and suggestions. I am also very impressed
by his keen insight into the technology trend because there was almost nobody using
SystemC when we started developing the SystemC model in 2004, and it has become
IEEE standard now. I look forward to applying in my future career the knowledge that
I have gained from working with him during the past four years.

I am also grateful to Professor Fathi Salem, Professor Gregory Wierzba, and
Professor Anthony Wojcik for taking the time to read my thesis and providing me with
helpful feedbacks on how to improve its quality. Serving as my Ph.D. committee
members, they provided signiﬁcant suggestions during my research work. I really
appreciate their time and effort to review my work, discuss my research and provide
suggestions. I also thank their excellent teaching efforts in the circuit and computer
architecture courses that I took during my Ph.D. career, and I gained lot of theoretical
and practical skills that will deﬁnitely be applied to my future career work. I also
thank Professor Donnie Reinhard for his generous help in my Ph.D. plan
modiﬁcation. Additionally, I thank the faculty and staff of the Electrical and
Computer Engineering Department for the help and advice 1 have received during my
stay at Michigan State.

I also want to thank my colleagues and friends to whom I am indebted, and their

support has enabled .me to enjoy my research over the years: members from AMSAIC

lab at Michigan State (Chao Yang, Jichun Zhang, Junwei Zhou, Yue Huang and
Professor Andrew Mason).

Finally, and most important, I’d like to thank my loved wife, Xiaohan Chen. Her
encouragement makes me more conﬁdent in my research, especially during the most
difﬁcult idea-forming phase. And her love makes my life in East Lansing more
beautiful. And I also want to thank my parents, Baojing Xi and Fengqi Xu, who
always loved and supported me when I pursued my Ph.D. Thanks to my brother
Bowen Xi for his love and friendship. I dedicate this thesis to my wife, parents and

brother.

‘1

TABLE OF CONTENTS

LIST OF FIGURES ........................................................................... viii
LIST OF TABLES ................................................................................. xii
1. Introduction ........................................................................................ 1
1.1. Platform-based System-on-Chip Design Methodology ................................ 2
1.2. SOC Architecture Overview ................................................................ 3
1.2.1. Shared Bus OCA .......................................................................... 4
1.2.2. Switched Bus OCA ..................................................................... 4
1.2.3. Network-on-Chip OCA ................................................................. 5
1.3. Requirements of SOC Modeling for Efficient Design Space Exploration............6
1.4. Related Works on System-on-Chip Modeling ............................................ 9
1.4.1. System-on-Chip Simulation Frameworks .......................................... 11
1.4.2. SOC Modeling Languages and Environments ...................................... 12
1.4.2.1. SystemC: System-level Modeling Language .................................. 12
1.4.2.2. Ptolemy 11 Modeling Environment ............................................. 13
1.4.2.3. Metropolis System-level Design Framework... ............ ................... 13
1.5. Key Objectives and Conuibutions ....................................................... 15
1.6. Thesis Organization ........................................................................ l6
2. Cycle-approximate Processing Element Model ............................................. 18
2.1. Multi-granular Time Model ............................................................... 19
2.1.1. Cycle-approximate Time Annotation ................................................ 20
2.1.2. Implementation of Multi-granular Time Model in SystemC .................... 24
2.2. Object-oriented Uniform PE Simulation Framework ................................. 26
2.2.1. Macro-operations and Their Characterization ...................................... 27
2.2.2. Simulate Tasks by Running Macro-operations .................................... 30
2.2.3. Object-oriented Application and Task Graph Mode132
2.2.4. Object-oriented PE Simulation Model .............................................. 35
2.2.4.1. Sequential Microprocessor Model .............................................. 39
2.2.4.2. Parallel Hardware Accelerator (ASIC) Model ................................. 41
2.2.4.3. Case Study of an Application Mapped onto Microprocessor Model ....... 43
2.2.5. Retargetable Simulation Framework for Heterogeneous PE Cores ............. 44
2.2.6. Selective Code Simulation ............................................................ 47
2.2.7. System-level Modeling Of Dynamic Voltage and Frequency Scaling ........... 48
2.3. Experimental Results on Media Applications .......................................... 51
3. Architecturally Parameterized NOC Simulation Model .................................... 53
3.1. On-chip Interconnection: Packet-switching Network—on-Chip(NoC) ............... 53
3.1.1. Network-on-Chip Fundamentals ..................................................... 54
3.1.2. Object-oriented NOC Simulation Framework ...................................... 55
3.2. System—level Model of On-chip Router ................................................. 58
3.2.1. Input/Output Buffer Model and its Architectural Parameter ..................... 62
3.2.2. Output Arbiter Model and its Architectural Parameter ........................... 65
3.2.3. Router Crossbar Model and its Architectural Parameter ......................... 68
3.3. System-level Model of Interconnecting Links .......................................... 75
3.4. Latency Model of NOC Routers and Links ............................................. 77
3.4.1. On-chip Router’s Latency Model .................................................... 78

vi

3.4.2. Link Latency Model ................................................................... 81

3.5. Simulate NOC Model with Trafﬁc Generator/Trafﬁc Receiver ...................... 83
3.6. NOC Performance Model Experiment Results .......................................... 85
3.6.1. APL Performance under Different NOC Topologies ............................ 86
3.6.2. APL Performance with Different Packet Sizes on Mesh NOC .................. 88
3.6.3. APL Performance with Different Router Architectures on Mesh NOC ......... 89

3.7. Integrate PE and N 0C into the SOC Simulation Framework .......................... 94
3.7.1. Network Interface for Bridging NOC and PEs ..................................... 94
3.7.2. Buffered Network Interface Model .................................................. 95
3.7.3. Put Everything Together: SOC Simulation Framework xSoCSim ............... 98

3.8. M-JPEG NOC-based SOC Simulation Model Case Study ........................... 100

4. System-on-Chip Power Modeling Framework ............................................. 106
4.1. Related Work on Power Modeling ...................................................... 107
4.2. PE Power Macro-modeling Methodology ............................................. 109
4.2.1. Programmable Processor Power Macro-model ................................... 110
4.2.2. Experimental Results of Processor Macro Power Mode] ....................... 114
4.2.3. Application Speciﬁc Hardware Accelerator Power Macro Model ............. 116
4.2.4. Experimental Results of DFHA Macro Power Model ........................ 120
4.3. NOC Architectural Power Model ........................................................ 123
4.3.1. Power Model of Router’s Input/Output Buffer ................................... 124
4.3.2. Power Model of Router’s Crossbar ................................................ 128
4.3.3. Power Model Of Output Arbiter .................................................... 131
4.3.4. Power Model of Interconnecting Wires ........................................... 134
4.4. Integrate Power Models into SOC Simulation Framework .......................... 137
4.5. NOC Power Simulation Experimental Results ........................................ 140
4.5.1. NOC Router’s Dynamic and Leakage Power Simulation ........................ 140
4.5.2. NOC Router‘s Dynamic Power vs. Data Payload Switching Properties......14l
4.5.3. Router Architecture and its Dynamic Power ...................................... 143
4.5.4. NOC Dynamic Power’s Temporal Proﬁling ....................................... 147
4.5.5. NOC Link Power Evaluation ........................................................ 148
4.6. Full SOC System-level Power Simulation on Real Applications ................... 152
5. Power-aware SOC Design Space Exploration Framework ................................ 159
5.1. Mathematical Formulation of SOC Minimum Power Problem ..................... 160
5.2. Related Work on DSE Framework for Multi-core SOCs ............................. 163
5.3. Global Communication Minimization Energy—aware Exploration Framework...166
5.4. Local PE/NoC Reﬁnement ............................................................... 174
6. Conclusions ..................................................................................... 176
Bibliography ....................................................................................... 181

vii

LIST OF FIGURES

Figure 1.1. Bus on—chip interconnections in SOC. (a)Shared—bus SOC architecture. (b)
Switched-bus architecture ....................................................................... 5

Figure 1.2. SOC - - ‘ ' ' 16 PBS with 4x4 mesh NOC interconnection ................ 7

Figure 2.1. Cycle approximate time annotation (a)DCT functional speciﬁcation.
(b)Executable model with time annotation .................................................. 23

Figure 2.2. Time event distributions during simulation with different time models. . .24

Figure 2.3. The same macro-operation on different architectures (a)OP_MUL_2 with

MUL instruction. (b)0P_MUL_2 without MUL instruction .............................. 28
Figure 2.4. Trme annotation styles to model PE timing. (a)Discrete time annotation. (b)
Grouped time annotation ....................................................................... 31
Figure 2.5. Task and application model represented by C++ class ....................... 33

Figure 2.6. Application abstracu'on model. (a)Task graph with 6 tasks. (b) Application
task’s dependency model ....................................................................... 35

Figure 2.7. Class diagram of PE model framework ........................................ 36

Figure 2.8. Architecture of TSP/TEE PE mode1(with application task models). . . .. ...38

Figure 2.9. Task Microprocessor’s TSE—TEE ............................................... 40
Figure 2.10. Concurrent ASIC model. (a)Task graph with concurrency. (b)ASIC’s
TSE-TEE model ................................................................................. 42
Figure 2.11. (a)Ad-hoc task remapping. (b)Retargetable task remapping ............... 45
Figure 3.1. Topological illustration of 4-by-4 NOC-based SOC ........................... 54

Figure 3.2. Modeling of on-chip communication architecture. (a)Class diagram of
OCA model. (b)0CA instantiation ............................................................ 57

Figure 3.3. The data transferred on NOC. (a)NoC packet structure. (b) Class diagram
of packet .......................................................................................... 59

Figure 3.4. Generic NOC router architecture ................................................ 60
Figure 3.5. NOC on-chip router FIFO architecture .......................................... 62

Figure 3.6. Different ways of modeling FIFO. (a)SystemC sc_ﬁfo-based FIFO model.
(b) C++ deque-based FIFO model ............................................................ 64

Figure 3.7. Arbitration algorithms. (a)FPUS algorithm in NoC router. (b) FSUS
algorithm in NOC router ........................................................................ 67

viii

Figure 3.8. Generic router crossbar architecture ............................................ 69

Figure 3.9. XY-routing on 4x4 mesh ......................................................... 70
Figure 3.10. Crossbar control of XY—deterministic router. (a)RCU for XY—routing
algorithm. (b)Crossbar control logic with 5 RCUs ......................................... 71
Figure 3.11. WF-routing on 4x4 mesh ....................................................... 72
Figure 3.12. Crossbar control of WF-adaptive router. (a)RCU for WF-routing
algorithm. (b)Crossbar control logic with 5 RCUs ......................................... 74
Figure 3.13. NL-routing on 4x4 mesh ........................................................ 75

Figure 3.14. Reuse of sc_ﬁfo-based channel for both input buffer and link models. . .76

Figure 3.15. Input buffer’s latency model ................................................... 79
Figure 3.16. NOC interconnecting wire model. (a)Interconnecting Wire’s delay model.
(b)1St order wire RC delay model ............................................................. 82
Figure 3.17. TG/TR connected with NOC ................................................... 84
Figure 3.18. NOC average package latency under different topologies. (a)Injection rate
= 0.025 packet/us. (b)Injection rate = 0.25 packet/n3 ...................................... 87
Figure 3.19. (a)SRAM area v.s. size. (b) SRAM power v.s. size ......................... 88
Figure 3.20. APL performance with different packet sizes on mesh NOC. (a)Uniform
trafﬁc. (b)Transpose trafﬁc ..................................................................... 89
Figure 3.21. NOC APL with different router architectures ................................. 91
Figure 3.22. (a)SoC with 2x2 mesh NOC and Nis. (b) NI architecture .................. 96
Figure 3.23. State machine in NI. (a)State machine for data packetization. (b) State
machine for data de-packetization ............................................................ 97
Figure 3.24. SOC simulation framework by integrating PEs, N15 and NOC ............ 100
Figure 3.25. Task graph of M-JPEG application .......................................... 101
Figure 3.26. SOC simulation model mapped with M-JPEG application ............... 104
Figure 3.27. (a)PE cycle count on M-JPEG application. (b)Communication cycle
count on M-JPEG application ............................................................... 105
Figure 4.1. Abstraction levels of microprocessor ‘ " a 110

 

Figure 4.2. (a)Characterized 2-operand ALU operation. (b)Characterized 2-Operand

Multiplier operation ........................................................................... 113
Figure 4.3. Parallel FIR architecture with 3 pipeline stages providing 1 sample/cycle
throughput ....................................................................................... 117
Figure 4.4. Characterized ADD, REG and MPY macro-Operations’ power versus data
widths ........................................................................................... 119
Figure 4.5. Power macro-model for hardware. (a)Dynamic power estimation results.
(b)Simu1ation speed comparison ............................................................ 122
Figure 4.6. NOC hierarchical power model ................................................ 124
Figure 4.7. Architecture of dual-port SRAM .............................................. 125

Figure 4.8. Power model of matrix crossbar. (a)Mat.rix crossbar capacitance model.
(b)Active state of matrix crossbar ........................................................... 130

Figure 4.9. Power model of output arbiter. (a)Structure of the matrix arbiter. (b)
Default transistor sizes ........................................................................ 132

Figure 4.10. Interconnecting wires considering (i, i+1), (i, H ) coupling effects.....135
Figure 4.11. Integrating power models with SOC simulation framework .............. 137

Figure 4.12. Power proﬁle of ADPCM application on PowerPC. (a)@ T."g = Sus (b)
@ Tens = 20us .................................................................................. 139

Figure 4.13. (a)Router’s leakage power. (b)Router’s dynamic power and leakage...14l
Figure 4.14. Router’s dynamic power on different input patterns ...................... 143

Figure 4.15. Power Of different router architectures on uniform and transpose trafﬁcs.
(a)random traffic. (b)transpose traffic ...................................................... 145

Figure 4.16. Dynamic router power with different input/output buffer conﬁgurations.
(a)random trafﬁc. (b)transpose trafﬁc ...................................................... 146

Figure 4.17. 4x4 mesh NOC under burst trafﬁc from node #0, #2 and node #5 ....... 147

Figure 4.18. NOC temporal power proﬁle under the burst trafﬁc ....................... 148
Figure 4.19. NOC power breakdown at different technology nodes ..................... 149

Figure 4.20. Temporal power proﬁle of different links. (a)Link #16(router #1 -> #5)’s
proﬁle. (b) Link #l7(router #5 -> #l)’s proﬁle ............................................ 150

Figure 4.21. 4x4 NOC power spatial proﬁle under burst at different time stamps.
(a)Sus. (b)8us. (c)11us. (d)80us .............................................................. 151

Figure 4.22. (a)Leakage power proﬁle Of router #0. (b)Dynamic power profile of

router #0 ........................................................................................ 152
Figure 4.23. M-JPEG application mapped on 2x2 NOC with (a)non-overlapped trafﬁc
ﬂow. (b)0ver1apped trafﬁc flow ............................................................. 153
Figure 5.1. Tile-based SOC architecture and task mapping .............................. 167
Figure 5.2. SOC DSE framework based on GCM ......................................... 168

Figure 5.3. (a)Algorithm to ﬁnd the cut with minimum ICL from CATG (b)CATG
with out set ...................................................................................... 172

xi

LIST OF TABLES

Table 2.1. Task mapping scheme .............................................................. 44
Table 2.2. Simulated task ﬁnish time ......................................................... 44
Table 2.3. Simulation time for Mediabench applications .................................. 52
Table 3.1. Characterized read latency(ns) for FIFOs in IBM 0.18um process .......... 79
Table 3.2. Evaluated router architectures .................................................... 90
Table 4.1. Energy simulation results for J PEG on PISA architecture .................. 115
Table 4.2. Simulation times ...... - ............................................................ 115
Table 4.3. Regression coefﬁcients of ADD operations ................................... 121
Table 4.4. Regression coefﬁcients of REG operations .................................... 121
Table 4.5. Regression coefficients of MPY operations ................................... 121
Table 4.6. Wire parameters for different technologies .................................... 149

Table 4.7. M-JPEG power simulation results on architecture 4.23(a) with input
video_l .......................................................................................... 154

Table 4.8. M-JPEG power simulation results on architecture 4.23(a) with input
video_2 .......................................................................................... 154

Table 4.9. M-JPEG power simulation results on architecture 4.23(a) with input
video_3 .......................................................................................... 155

Table 4.10. M-JPEG power simulation results on architecture 4.23(a) with input
video_4 .......................................................................................... 155

Table 4.11. M-JPEG power simulation results on architecture 4.23(a) with input
video_5 .......................................................................................... 155

Table 4.12. M-JPEG power simulation results on architecture 4.23(a) with input
video_6 .......................................................................................... 156

Table 4.13. M-JPEG power simulation results on architecture 4.23(b) with input
video_2 .......................................................................................... 157

Table 4.14. M-JPEG power simulation results on architecture 4. 23(b) with input
video_ 5 .......................................................................................... 157

xii

1. INTRODUCTION

Today’s semiconductor technology makes creating the whole system on a silicon die
feasible by integrating millions or even billions of transistors when the transistor feature
size continuously scales down. The system—on-chip (SOC) refers to integration of system
level functions on a single silicon chip. SOC can be the integration of a complete
electronic system including all its periphery and its interfaces to the outside world on a
single die[1]. It can also be high-level integration of digital cores while leaving the
analog and mixed signal functions to other devices. Nowadays people also started
integrating the RF functionality into SoCs to make the single—chip cell phone platform
solution, such as Texas Instruments’ LOCostO[116] and eCosto[117] platforms, which
reduces the BOM (Bill Of Materials) dramatically. SOC is a best candidate to meet the
increasing system complexity and the ever tightening time-to-market. SOC belongs to the
broader class of embedded systems where hardware and software are closely coupled and
the system targets one application or a class of applications, in contrast to general purpose
computing systems. This thesis concentrates mainly on the digital SOC or the digital
portion of a mixed-signal SOC, but it is also extendable to support the analog/mixed-
signal functionalities once the appropriate analog models are plugged—in. Generally
speaking, modern high performance SOC designs consist of multiple integrated parts
including the Processing Elements (PBS), the On-chip Communication Architecture
(OCA), i.e., shared-bus or packet-switching Network-on-Chip (NOC), and embedded
memory blocks providing on-chip data/program storage. The PEs process the
computation tasks while the OCA provides the necessary communication services
between these PEs[2]. It is critical to optimize both the computation architecture and the
OCA so that it can meet the ever-increasing demands of the computation and
communication intensive applications. These SoCs often require comprehensive
functionality, high performance, low power consumption, low production cost and short

time-to-market.

The continuously increasing VLSI design complexity has stimulated the advancement
of Electronic Design Automation (EDA) tools and methodologies, such as the well-
established logic synthesis-based ASIC design ﬂow. However, the continuously shrinking
feature size in the deep sub-micron (DSM) era aggravates the negative effects of gate and
sub-threshold 1eakage[3], crosstalk[4], soft error[5] etc., which have resided in the
custom ASIC design area for decades. Moreover, as ITRS[6] predicted, the productivity
of current EDA tools is becoming increasingly inadequate for tackling the complexity of
today’s ICs with one of the reasons that the productivity of EDA tools is not growing as
fast as the exponential growth of semiconductor gate count thanks to the Moore’s Law.
Alamiingly, the design productivity gap is widening in recent years as the design goes to
the multi-core SOC epoch. Copying the conventional ASIC design methodology to SOC is
not adequate to solve the problem. To help bridge this design productivity gap, the
semiconductor industry needs a new generation of EDA methodologies and tools that can
signiﬁcantly increase design productivity and reduce the Non-Recurring Engineering
(NRE) cost.

I .1. Platform-based System-on-Chip Design Methodology

In response to these challenges, a new design methodology named “platform—based
design”[7] was proposed, promoting reuse at higher levels of granularity and relying
heavily on ﬂexibility and programmability to amortize the NRE costs of complex multi-
core SOC design. A platform-based design methodology is essentially a meet-iu-the
middle approach, which leverages the power of top-down methods and the efﬁciency of
bottom-up styles. A design process is viewed as a stepwise reﬁnement of a speciﬁcation
into a lower level of abstraction chosen from a restricted library of available components.
Components are computational blocks or interconnecting blocks and are often referred to
as Intellectual Property (IP) cores in many literatures. This library is a platform, which is
a family of designs and not a single design in essence. A platform deﬁnes the design

space that can be explored, and design space is the set of combinations of the components

with different conﬁgurations from the platform library.

Once a particular collection of components of the platform is selected, we obtain a
platform instance or a point of design space. The choice of the platform instance and the
mapping of the components of the speciﬁcation into the components of the platform
instance represent the top—down process where the constraints accompanying the
speciﬁcation are mapped into constraints on the components of the platform instance.
When selecting a platform instance and mapping constraints, it is important to guide the
selection with constraint parameters, e.g. delay, power, size and cost, that summarize the
characteristics of the components of the platform. Evaluating these parameters quickly
with the appropriate accuracy is one of the critical parts of platform-based design [8].

1.2. SOC Architecture Overview

From the architectural point of view, an SOC consists of ﬁve main functional blocks:
processor core, hardware accelerator IPs, memory blocks, peripherals and on-chip
interconnection architecture (OCA). The processor core functions as the central controller
for other functional blocks in the whole system to orchestrate their operations. Modern
SoCs often employ RISC microprocessor as the processor core due to its simple
architecture and high clock frequency. Sometimes there is more than one processor,
which may be homogeneous or heterogeneous, in the SOC to handle different
control/computation tasks. Another important feature in SOC is that dedicated hardware
blocks are often included to accelerate the computation-intensive parts of the application
to mitigate the load on the main processor core. It may be the co-processor executing
speciﬁcally designed instructions, or custom logic and reconﬁgurable FPGAs. In the SOC
context, both processors and hardware accelerators are referred as Processing Elements
(PE) since their tasks are performing data processing. PEs access memory blocks
providing code and data storage, and peripherals offer the system interface with the
external world. The non-volatile ﬂash memory is often employed to store the program

code or persistent data, while SRAM is used as data buffer to store intermediate

computation results due to its fast accessing speed. Dynamic RAM (DRAM) has not been
extensively integrated in current SOC designs due to its speciﬁc fabrication process. All
these functional blocks are interconnected together via OCA to compose the complete
system, and there are three main interconnecting schemes differentiating SOC OCA
architectures: shared bus, switched bus and network-on-chip (NOC).
1.2.1. Shared Bus OCA

The shared bus is the most common way to move on-chip data in current SOC designs.
In this scheme, many components can drive a single interconnect net that sends the
signals to all devices on the net. The target component replies to the master's request
depending on the machine address. Figure 1.1 (a) shows a classical shared-bus SOC
architecture connecting 8 PEs. A bus arbiter is present to decide which PE is granted for
accessing the bus according to the arbitration policy once multiple masters raise their
requests simultaneously. During execution only one master PE can initialize a bus
transaction orchestrated by the bus arbiter. Shared bus architecture is very popular in the
previous and current SOC designs and several semiconductor vendors provide industrial
standardized buses such as ARM’s AMBA[51] and IBM’s CoreConnect[52]. But shared
bus is not very scalable electrically because high fan-out load limits the speed. The
physical wires also tend to be long, adding more delay and signal integrity issues.
1.2.2. Switched Bus OCA

The switched bus architecture differentiates from the shared bus by allowing multiple
bus transactions simultaneously at the same clock cycle. There are multiple point-to—point
interconnections controlled by the crossbar switch, which selects interconnects between
any combination of the CPU, the memory and other PS. The switched bus architecture is
essentially a multi-bus whose connectivity is controlled by the crossbar switch. This
interconnection architecture relieves some burden resulting from the shared bus in large
SOC consisting of large number of PBS. Currently MIPS Technology has developed its
switched bus product SOC-it[97] to leverage the multi-layer capabilities of ARM's

AMBA to allow two simultaneous transfers on the same clock cycle.

 

ASIC-1 ASIC-2 CPU-1 ASIC-3

 

 

 

 

 

 

 

 

 

 

 

SRAM CPU-2 FPGA ASIC—4

 

 

 

 

 

 

 

 

 

(a) Shared—bus architecture

ASIC-2
FPGA
ASIC-3
ASIC-4

r
7

51‘2“ x“_

sees:

V

CPU-1 Z

 

 

ASIC-1 4

 

CPU-2 2

 

SRAM 1

USED
§%%%_
gage-

V

 

 

 

 

I I

(b) Switched—bus architecture
Figure 1.1 Bus on-chip interconnections in SOC.

1.2.3. Nerwork-on-Chip OCA

Network-on-Chip (NOC) has evolved as a promising alternative for large SOC on-chip
interconnections during the recent years. The idea is that scalable-switched networks are
used for on-chip communication between PBS in order to cope with the design of
continuously growing systems. Different from bus—based SOC, cores in NOC
communicate by sending information packets across the on-chip network instead of

driving signals across dedicated global circuit wires [67]. The on-chip interconnection is

implemented by routers that connect with each other via interconnecting links. These
routers are placed in a regular ﬂoorplan and multi-bit width links connect these routers to
form different network topologies, such as mesh and torus. Each PE is connected with
one router via network interface and they form a “tile” in the SOC. Compared with the
shared bus architecture, the shorter links in NOC are more controllable than the long
wires when the system grows large. The main force behind NOC is the progress in
semiconductor manufacturing technology and the fact that it is not followed by increased
effectiveness in design automation tools. Design complexity promotes reuse of
investment in earlier designs as well as purchase Of outside intellectual property (IP), but
due to larger designs, communication among components will become a bottleneck using
traditional techniques like common shared- or switched-buses. The most adverse pitfall
for the shared-bus interconnection is that the global wires are non-scalable when the
number of connected cores increases since adding one core to the bus will increase its
capacitive load, which tightly contributes to the delay and dynamic power. Moreover, the
inverse scaling properties residing in global wires ask for a more scalable and ﬂexible on—
chip interconnection architecture for the future SoCs fabricated in sub-90nm processes.
NOC is one solution to this where regular packet switched communication architecture
can provide better scalability, ﬂexibility, throughput and reusability. Currently NOC-based
SOC is still in the infant stage and many researchers and companies put much effort in it.
Figure 1.2 shows an SOC with 4x4 mesh NOC interconnection architecture.
1.3. Requirements of 50C Modeling for Eﬁicient Design Space Exploration

With the advent of multi-core architecture, each application can be partitioned into
different tasks, each of which runs on different PE to achieve the pre-deﬁned
functionality. Given an application and a set of available PE cores, SOC designer has to
decide the number and configuration of hardware resources to be used for the application
to fulﬁll the functionality and meet the non-functional constraints such as power and area.

This has increased the design space manifolds. It is not possible to explore all possible

candidate architectures with detailed simulation and synthesis. Fast simulation
frameworks and efﬁcient modeling methodologies are indispensable in tackling the
Design Space Exploration (DSE) problem in multi-core SOC era. Efﬁcient
implementation of heterogeneous system requires performance and energy estimation at
the system level in early design stages so that design space can be reduced in later
reﬁnement stages. In general, the system-level SOC DSE requires a modeling framework
to performing the following tasks:

System speciﬁcation

@Architecture selection and resource allocation

@Task partitioning and resource mapping

@System-level integration

©Fast hardware/software co-simulation

©Efﬁcient simulation data post-processing

 

 

 

 

 

 

 

 

 

Figure 1.2 SOC containing 16 PEs with 4x4 mesh NOC interconnection
The system speciﬁcation should be manifested as functional model and nonfunctional
requirements. The functional model should be abstract and does not constraint
implementation details. Analysis will provide insight on architectural decisions. The
speciﬁcation is then transformed into implementation in multiple steps. The target

application will be partitioned into small tasks, mapped onto different PBS, and the inter-

PE communication model will be inserted for data sy-u L ' " ' ' The

 

choice of target PE and task mapping increases exponentially along with the complex
applications and large PE libraries. The high complexities of embedded systems require a
system level design automation framework to perform these transformations fast and
efﬁciently, and it should allow exploration of possible architectures at a higher level
without going for synthesis and detailed timing simulation.

In addition to implementing the functionality deﬁned by speciﬁcation, another
important role an system—level SOC model should possess is the ability to estimate the
non-functional constraints, from which energy/power and area are the most frequently
used ones. Energy consumption directly relates to the battery life in the battery-powered
handheld devices. In high performance systems, the power consumption determines the
need for heat dissipation, which can be costly in places such as central offices for
telecommunication systems. The area has major impact on the chip cost. Energy
consumption is related to many factors in the SOC design from task mapping of the
application to the conﬁguration of each PE, and estimating the energy precisely and
quickly is not a trivial task. It is well believed that the higher level the energy estimation
and optimization is performed, the better energy saving will be achieved [49]. Fast
energy/power simulator is becoming an important component in modern SOC DSE
frameworks targeting low power applications. SOC energy consumption is composed by
the computation energy from PBS and communication energy from OCA, and each of
them should be estimated when simulating an application. Proﬁling the energy
consumption of SOC components temporarily or spatially also helps designers estimate
the potential energy-hungry parts during run-time.

In general, A system-level model should provide the system properties without detailed
low-level information. Depending on the properties of interest, different types of system
models can be built. In the research, the system-level model is functionality—centric in

some sense since it cares more about the system functionality than the detailed circuit

implementation. For example, a PowerPC[ll] microprocessor executing an application
code could be modeled simply by the same application code executed on the host
machine with the back-annotated timing information characterized from the low-level
PowerPC models or real chips. Given a functional speciﬁcation, other system blocks
could be modeled in the same way and the similar method can be applied on
communication fabrics. In this way, the system-level model accelerates the process of
functionality veriﬁcation, design space exploration and nonfunctional metrics estimation
in early design stages before the design goes down to the lower levels for logic- and
circuit-level implementation.

1.4. Related Works on System-on-Chip Modeling

In this section, we discuss the existing approaches for modeling multi-core SoCs or its
IP cores on three main metrics: performance(throughput, clock rate), power/energy, and
thermal properties (chip temperature) as well as functionality. These metrics are critical to
SOC and much effort[l2][l3][14] has been taken to evaluate them quickly and accurately.
First Of all, several prior work addressing system-level modeling methodologies[l3][14]
is essentially a modification of the available micro-architectural model such as
SimpleScalar[ 10] by adding the customized power, energy or thermal models, integrating
the modified model with other blocks and performing simulation. While these models can
generate an accurate estimation, which is a natural consequence of employing the
architectural information such as L1 and L2 cache, BTB, instruction fetch in
microprocessor core, their simulation speed is often slow. They are suitable for detailed
study of micro-architecture choices.

Secondly, system-level model is functionality-centric without too much low-level
information as was mentioned earlier. Utilizing this property, macro-models have been
proposed and researched in many literatures for programmable microprocessors [15][16],
hardwired ASICs [17][l8], and SoCs containing both of these two types of

components[ l9][20] with memory blocks. Macro-models rely on the characterization of a

comprehensive set of atomic operations by which all other functionalities could be
implemented. The characterization process involves simulating a set of test benches
composed by these atomic operations on the available low-level models to acquire the
metrics (power, performance) data and performing the mathematical linear/non-linear
regression analysis and curve ﬁtting, from which a closed-form expression for each
atomic operation in terms of one of these metrics could be derived. Applying these
derived models, system-level simulation could be performed by only taking the block’s
input/output properties into consideration and these metrics could be estimated by
evaluating the incoming atomic operations without the verbose architectural information.
The most frequently used atomic operations are instructions (Instruction Set Architecture,
or ISA) and input/output switching statistics for microprocessors and ASIC cores,
respectively. However, the conventional macro-model methodology still has some
challenges when applied to multi-core SoCs. Using a virtual SOC containing only two
microprocessor cores (for example, PowerPC and ARM) with their corresponding ISA
macro-models as an example, the partitioned tasks (a segment of application code, which
is often written in C/C++) mapped on each core should be compiled to instructions before
the macro-model could be evaluated and simulated. And this compilation process should
be performed whenever there is a change on the mapped tasks for each core and the
compilation time will take a non-negligible overhead when the system scale becomes
large. We address this as retargetability issue in our research and retargetability reﬂects
the ﬂexibility of performing task mapping among different core architectures including
both programmable micro-processors and hardwired ASICs. This issue resides in many
proposed system-level SOC modeling frameworks[21][12][22] by the fact that changing
task mapping always results in the code recompilation for microprocessors and re-
synthesis for hardware cores, and solving this issue is still an open question.

Since SOC integrates multiple IP cores, each Of which can be a complete functionality

block to perform computational tasks or provide communication services, simulating the

whole SOC on the circuit-level becomes intolerably time-consuming and error-prone. In
consequence, available SOC simulation models are built upon RTL or higher levels of
abstraction. Many research groups have proposed different methodologies and simulation
frameworks to model and simulate multi-core SOCs at different levels of abstraction to
facilitate the functional veriﬁcation and design space exploration. We brieﬂy review their
work and enumerate three major modeling languages and environments residing in
industry and academia: SystemC, Ptolemy H and Metropolis. And we also review some
recent research work on the architecture and modeling Of network-on-chip, which is
chosen as the communication architecture in this research.
1.4.1. System-on-Chip Simulation Frameworks

[23] proposes a simulation framework utilizing VHDL to evaluate several features of
virtual channels in mesh-based and hierarchical packet-switching NOC topologies in
multi-core SoCs. The accuracy of VHDL model is high due to its cycle-accurate
simulation engine, but it suffers from low simulation speed if the number of cores
increases. To enhance the simulation speed, several approaches have been proposed. [24]
presents a hybrid VHDL/SystemC model using a template router to support multiple
interconnection networks. [25][26] describes a modeling framework for custom NOC
topologies based on SystemC. In [27], XML is employed to specify the NOC components
and evaluate different mesh-based NOC architectures. These models all concentrate on
the architectural detail of each SOC component and are cycle-accurate. As a result, the
simulation speed is still limited especially in the case that one or multiple microprocessor
cores are present in the target SOC. And moreover, [28][29] use data trafﬁc from tracing
the application on the off-the-shelf simulators to simplify the PE models since the main
goal of [28][29] is to model and evaluate different interconnection schemes (OCA).

C/C++ is adopted broadly as a high-level system modeling language as well as a
programming language. For most microprocessor-based SOC architectures, Instruction

Set Simulator (188) has been extensively employed in modeling and simulating the target

applications in that 188 is essentially a set of C/C++ codes running on the host machine
to mimic the code execution on target architectures. Most ISSes[10][30][31] are
constructed in the micro-architecture level to provide cycle-accurate simulation engines,
and they are much faster than HDL simulators when evaluating software execution on
microprocessors. [32][33][34] propose the Multi-Processor SOC (MPSOC) simulation
framework using Inter-Process Communication (IPC) and a bus wrapper. The 188 and the
C/C++ hardware model run as distinct processes on the host system and communicate
between the system simulation and 158 and translate the information from the 188 into
cycle-accurate bus transactions. To resolve the issue that different ISSes need different
wrappers tO hook up the MPSOC simulation model, [35] develops an implementation Of
the IPC interface between wrappers and the ISSes based on the Open-source debugger-
GDB[36] so that any 188 that can communicate with GDB can also become part of the
co-simulation framework seamlessly. These simulation frameworks are homogeneous
from the language point of view since both hardware (HW) and software (SW) are
described using C/C++, even they show heterogeneity from the simulator point of view
because HW and SW can be executed using different simulators where SystemC
simulation kernel is used for hardware and 188 for software program, respectively.
1.4.2. SoC Modeling Languages and Environments

This section introduces three representative modeling environments in the
chronological order of development.
1.4.2.1. SystemC: System-level Modeling Language

As described before, many SOC simulation frameworks are based on the SystemC.
SystemC is an Open source architecture- and system-level design and modeling language
based on C++[37],[38] and it allows designers to apply well-understood software
techniques such as Object-oriented programming (GOP), to the problems of system
modeling and veriﬁcation. Although originally designed for system and hardware design,

SystemC is also applied in veriﬁcation, testbench design, software modeling, etc.

Currently SystemC 2.0.1 and later versions have been extensively used by many
semiconductor vendors as well as EDA companies.

SystemC language itself provides three basic computation units and one
communication unit: (1) Function that is similar to the regular C++ member functions,
(2) Process that is the basic functional entity Of SytemC, and there are three different
process abstractions. (3) Module that is the basic structural entity of SystemC and
modules can communicate with each other through port, which can be either uni-
directional or bi-directional. (4) Channel that transports data between ports connecting
different modules. SystemC also provides a rich set of data types targeting hardware
modeling and simulation. Finally, SystemC library includes a lightweight Discrete Event
(DE) simulation kernel.
1.4.2.2. Ptolemy 11 Modeling Environment

Ptolemy II is an object-oriented, heterogeneous design and modeling framework based
on Java. It targets the modeling, simulation and design of concurrent heterogeneous
systems, which are usually hard to model in One uniﬁed design framework. Its
description language can easily describe complex hierarchical interconnections Of
pararneterized executable components. Ptolemy II supports a number of Models of
Computation (MOC)[39], including Communication Sequential Process (CSP)[40], DE
and Synchronous/Reactive (SR)[41]. Each MOC is implemented in Ptolemy II as a
domain that is a software package that performs the required scheduling algorithms for
the MOC. The software architecture of Ptolemy H is layered. Besides the domain
packages, the core package contains the abstract syntax of semantics for the data model.
The library packages provide actors that can execute in multiple domains. Also, a
graphical user interface (GUI) package is provided to support its internal XML format,
MOML.
1.4.2.3. Metropolis System-level Design Framework

, The Metropolis [13] develops a formal methodology for SOC design and the

corresponding system—level framework facilitating the exploration of multi-core SoCs. In
the Metropolis methodology, the 30C designer ﬁrst describes (or selects from IP
libraries) the blocks that perform computations and, then, designs the communication
among them using a rigorous successive reﬁnement process. To maximize reusability, the
layers of the protocols should only encapsulate the original computation cores without
any change in their internal structure. This framework is based on a formal representation
of the system throughout all levels of abstraction, but is not biased towards any speciﬁc
Model of Computation (MOC) or communication semantics. At the highest level of
abstraction, the system specification is purely denotational, i.e. the system is described as
a set of concurrent components, call processes, and each deﬁned as an input/output
relation. In contrast to the traditional design practice, where the speciﬁcation of the
processes also includes details of the communication interface and hence is not easily
reusable, the communication among processes in Metropolis is separated and dealt with
explicitly.

In Metropolis communication is modeled by channels with a set of properties, such as
FIFO ordering, error rate and bandwidth, and a set of interfaces (e.g., read and write) that
the processes connected to the channel can access. Once a physical channel is chosen, the
designer must select a MOC that deﬁnes the ﬁring rules that the processes must follow
when they access the channel. The ﬁring rules are part of a MoC wrapper that
encapsulates each process. If the selected physical channel does not immediately meet the
communication constraints on parameters like delay, throughput, reliability, it is
necessary to introduce adapters (called channel adapters) between sender and receiver,
and the channel. Consider the problem of implementing a reliable error-free connection.
If an unreliable channel with a non-negligible error rate is selected, it is necessary to
introduce adapters, e.g. encoding or retransmission functions, between the unreliable

channel and the sender and receiver.

1.5. Key Objectives and Contributions

We propose a hierarchical SoC modeling methodology and the associated framework
that tackle the issues related to the ever-increasing system complexity and retargetibility.
The goal is to satisfy the need for efﬁcient design space exploration. The key objectives
of this work are:

@A uniﬁed modeling environment that can be used at different stages of design and can

support different levels of abstraction.

@A uniﬁed modeling framework that can model functional specification, system

architecture and functions mapped to the architecture.

@A framework that models the concurrency and communication.

@A modeling framework that provides fast simulation for realistic application loads.

©A performance modeling framework that supports untimed, approximate timed and

cycle—based timing models.

©A modeling framework that supports nonfunctional properties such as power

consumption and dynamic power management.

@Integration of processing element models and on-chip communication architecture
models.

.A methodology for rapid simulation model generation according to specs.

@Build the design space exploration tools with simulation in the loop.

In order to achieve these objectives, a set of models and associated tool ﬂows are
developed during this research. Firstly, the multi-granularity time model is proposed and
will be utilized throughout the whole framework, and it also supports performance and
power models with variable granularity to trade off the simulation speed and estimation
accuracy. An approximate-timed cycle-free transaction-level simulation model for
processing elements (PE) including programmable microprocessor and hardware
accelerator (ASIC) based on the uniform task scheduling engine - task execution engine

(TSE-TEE) is developed. This methodology provides an Open framework for modeling

different architectures at system-level with the aid of proposed macro-modeling
methodology covering both performance and power estimation. To reduce the simulation
time based on the macro-model in some computation-intensive applications as well as
back-annotated simulation code size, an innovative macro-model grouping technique will
be presented. On the other side, to address the on-chip communication modeling, we
propose a parameterized simulation model for Network-on-Chip (NOC) supporting
multiple conﬁgurations. Combining the computation (PE) and communication (NOC)
models through the interfaces (wrappers), a fully integrated model of PEs and the NOC
OCA that enables efﬁcient modeling and evaluation of the multi-core SOC is proposed.
Due to the framework’s open architecture feature, our model supports plugging-in power
and performance models at different levels of abstraction, by which the system-level
performance (latency and throughput), power (including both dynamic and leakage
power), and potentially thermal properties can be evaluated, and this information in turn
helps the fast simulation-based design space exploration and application temporal and
spatial proﬁling to ﬁnd the optimal system conﬁguration given a set of design constraints
and possible hot-spots during run-time. Finally a power-aware design space exploration
framework employing global communication minimization and local PE/NOC reﬁnement
is developed. This DSE framework heavily beneﬁts from the fast task-level SOC
simulation model, which provides capabilities of in-loop simulation to estimate the
performance and power under the real application load.
1.6. Thesis Organization

The remainder of the thesis is organized as follows. Chapter 2 presents the multi-
granularity time model and the application task model. It also covers the proposed
empirical macro-modeling methodology for PE’s high-level performance estimation.
Then it starts presenting the approximate-timed transaction-level PE simulation model
based on the task schedule engine (TSE) and task execution engine (TEE). Based on the
proposed TSE/TEE model, a retargetable PE simulation framework is developed and it

eliminates architecture—dependent cross-compilation and synthesis when performing the
task remapping. Chapter 3 details the parameterized packet-switching NOC on-chip
communication infrastructure, and it focuses on the modeling of two main NOC
components: on-chip router and link. Based on the functionality model supporting
exploring different NOC architectures, the NOC performance model is presented based on
analyzing the router and link’s architectural features. This chapter also presents the
integration of PE model presented in chapter 2 and the NOC model to form a complete
task-level multi—core SOC simulation model named xSoCSim, which has been
successfully ported to the M-JPEG media application. Chapter 4 presents the energy
model for both PE and NOC supporting the run-time dynamic and leakage power
estimation. PE energy model is based on the statistical macro—model similar to what’s
presented in chapter 3, while the NOC energy model is based on the architectural
information of the target NOC router and link, as well as the run—time data payload
pattern. Based on the functionality, performance and power models presented in chapter
2, 3 and 4, chapter 5 presents the idea of system-level simulation—in—loop design space
exploration methodology and the associated global communication minimization
searching algorithm built upon the xSoCSim model. Chapter 6 concludes this thesis and

presents the future work that can be done based on this research work.

2. CYCLE-APPROXIMATE PROCESSING ELEMENT MODEL

As what’s discussed in chapter 1, simulation is one of the most frequently used
methods to explore the SOC design space due to its precision with high ﬁdelity. By
running applications on the simulator, the system functionality can be veriﬁed and some
non—functional design metrics such as throughput and energy will be estimated to help
designers reﬁne the design to meet the speciﬁcation. Different types of
simulators[25][30] have been developed to simulate the hardware and software at
different levels of abstraction. For hardware driven by the synchronous clocks, cycle—
accurate simulators have been the well-established environment In the cycle-accurate (or
cycle-true in some literatures) hardware simulator, signal transitions on memory
components are evaluated at the rising or falling edge of synchronizing clocks for
deterministic timing. For the software running on microprocessors, cycle-accurate
instruction-set simulator (188) is adopted extensively in both industry and academia to
evaluate the processors. In 188, the operations of each instruction on the processor’s
functional units are traced and recorded to mimic its execution, and a virtual global clock
is employed to synchronize the processor’s functional units.

Heterogeneity in modern SoCs due to different types of IP cores ranging from
hardwired ASIC to fully programmable processor makes developing SOC simulator a
non-trivial work. Ad-hoc integration of each component’s cycle-accurate simulation
model is not an efﬁcient way since each simulator has different interfaces and semantics.
Also the simulation time will be another major concern when system scale is large. More
importantly, simply integrating different simulators may not tackle the simulation of on-
chip communications, which need to be dealt more efﬁciently since communication will
become a major overhead in the future SoCs containing hundreds of cores. Many
companies and universities are currently working on developing efﬁcient simulators for
multi-core SoCs, and there have been some products available in the market [111], and

published research tools [34][35].

Our work in this research targets developing an efﬁcient and fast simulation platform
for multi-core SoCs with NoC on—chip communication architecture. We propose an
innovative cycle-approximate retargetable SOC simulation framework in this and the next
chapters. The novelty lies in the following aspects: First of all, the multi-granular time
model leverages the different timing properties residing in the heterogeneous multi-core
SoCs at system-level; second, an uniform PE simulation framework supporting both
concurrent ASIC model and sequential microprocessor model is presented and it executes
at the task—level with back-annotated timing and energy macro-models. At last, the
parameterized packet—switching NOC model is developed to be integrated with PE models
via gluing interfaces to construct a complete SOC simulation framework. The NOC model
is developed by integrating multiple functionally orthogonal modules, each of which can
be reconﬁgured to support different communication features such as routing algorithm,
arbitration and ﬂow control etc. to facilitate system-level NOC reconﬁguration for the
design space exploration. And the open architecture interface provides the space for the
future extension to support other NoC architectures such as asymmetrical network
topology and virtual channel routing algorithm. Adhering to the broadly adopted IEEE
SystemC standard, we choose SystemC 2.1 as the main modeling language during this
work.

2.1. Mum-granular Time Model

Time is a crucial design parameter among almost all electronic systems and it is of the
utmost concern for many system models to evaluate their sequential or concurrent
executions. Depending on different levels of abstraction, there are three basic time
models applied in the digital VLSI system as what’s described in [43]:

(l) The physical time model uses physical time units and is based on physical principles.
It is used extensively in the circuit-level timing analysis and RTL-level delay back—
annotation.

.(2) The clocked time model relates all circuit activities to a clocking scheme where one or

multiple periodic clocks are applied to synchronize the operations of each functional
component inside the system. It is essentially the concepts of digital time.
(3) The causality time is an abstract model and employs a partial ordering deﬁned by
generation and consumption of data and by explicit control dependencies to emulate the
total ordering of system events. At this level, time is abstracted to a higher-level without
speciﬁed values and units.
2.1.1 . Cycle-Approximate Time Annotation

Trme granularity is the minimum timing unit to model the system’s behavior. At the
circuit-level, the time granularity is the physical time resolution related to solve the
Circuit’s differential equation. At RTL, the time granularity is the synchronizing clock
with speciﬁc parameters such as period, and the circuit will be triggered and evaluated on
each transition event (falling- or rising— edge) of the clock. Similarly, cycle-accurate
micro-architectural model triggers its evaluation on each clock cycle with the difference
that it only evaluates the operations on the micro-architecture level through abstracting
out some unnecessary detailed transitions inside each micro-architectural block. Although
cycle-accurate models generate accurate timing information, it limits the temporal
resolution at each cycle period, which will be in the ns range, resulting in slow simulation
for large systems. It is also difficult to use if the systems contain multiple cores running at
different clocks, where the meta-stability needs to be handled carefully in the blocks
crossing two clock-domains. A well-used design method is to insert the re-synchronizer
between the two clock domains, but modeling it brings additional overheads since there
may be tens or hundreds of re-synchronizers needed to be evaluated at each delta cycle
on the clock events resulting from different clock domains. This will pose signiﬁcant
simulation speed-down, which can be resolved by moving to higher levels of abstraction.

The functional centric system-level model mainly considers the relationship between
the system’s input and output by eliminating its implementation details to raise the level

of abstraction to tackle the future large platform-based SOC design. The time model

20

utilized in our system-level framework is essentially a meet-in-the-middle solution
combining both physical and causality time models introduced beforehand. Figure 2. 1(a)
shows part of a functionality description of DCT algorithm that can be implemented in
ASIC, FPGA or microprocessor. This functional model contains 9 macro-executors, each
of which performs some intermediate computations. At the system-level, the
functionality-centric model concerns more about “what’s the output when applying given
inputs” and “when does each task ﬁnish” than “how does each task is implemented” since
designers don’t care about the internal signal transitions at early design stages. Given the
pre-characterized timing models of the macro-executors, an efﬁcient way is to annotate
the timing model, which is executable, to each macro-executor to emulate its timing
property on the target architecture. Assume a macro-executor ME,- has the execution time
Of ETMSIC on ASIC core, the functional spec can be converted into the system-level model
shown in Figure 2.1(b). Each macro-executor is annotated with a delay function
DelayFunc(ET.-, ASIC), which issues a time event after its execution. This time event will
be sent to the simulation kernel, which synchronizes the whole system if multiple
processes execute simultaneously. In this scheme, the causality is maintained by the
control and data dependencies and the timing information is also provided by invoking
delay functions. Moreover, time events are aligned by the completion of macro-executors
for system synchronization and they are not distributed evenly on the time axis
considering the fact that different macro-executors have different execution time. And
even the same macro—executor may still spend disparate amount of time to execute on
different target architectures. Compared with cycle-accurate or clocked time models with
ﬁxed granularity, the proposed time model is multiple-granular in the sense that it doesn’t
have a constant minimum unit for each two consecutive synchronization time events
along the time axis during simulation. The time granularity is associated with macro-
executors by the annotated timing information evaluated by delay functions. Figure 2.2

describes three sequential macro-executors ME], ME; and ME3 with lOOns, 80ns and

21

200ns of execution time respectively. It also shows the time events’ distribution on the
time axis for a cycle-accurate and multi-granular time model. The execution of functional
code with multi-granular time model only issues 3 time events represented by the vertical
lines on the time axis, while the cycle-accurate counterpart generates 38 time events
given the clock period Of lOns. Since most currently used Discrete Event (DE) simulators
are triggered by events to perform evaluation, the number of time events generated during
simulation has a direct inﬂuence on the simulation speed. Shown in Figure 2.2,
simulating using C++/SystemC-source-level simulator should be much faster than using
RTL-based cycle-accurate simulator resulting from the reduced number of time events.
Applying this time model makes system-level evaluation and design space exploration
more efﬁcient when performing macro-architectural simulation thanks to the reduced

number of time events.

 

 

 

Vvoidfdct_1d(int r0w[8], int r0w_res[8])
{
"""" . //’variable declaration
t0 = (float) (r0w[0] + r0w[7] — bias); //ME,
t7 = (float) (r0w[0] - r0w[7]); //ME,3
t1 = (float) (r0w[l] + r0w[6] - bias); //ME3
t6 = (float) (r0w[1] - r0w[6]); //ME4
t2 = (float) (r0w[2] + r0w[5] - bias); //ME5
t5 = (ﬂoat) (r0w[2] - r0w[5]); //ME6
t3 = 070611) (r0w[3] + r0w[4] - bias); //ME7
t4 = (float) (r0w[3] - row[ 4 j); //ME3
x0 = [0 + [3; //ME9
""""""" . //other macro-executors
1' m

 

Figure 2.1(a) DCT functional speciﬁcation

V

void fdct_1 d( int r0w[8], int row_res[8])

I
r

t0 = (float) (r0w[0] + r0w[7] - bias); //ME1
elayF unc(E T ram);

(7 = (float) (r0w[0] - r0w[7]); //ME2
/ , DelayFunc(ET2.A.s1c);
, / t1 = (ﬂoat) (r0w[l] + r0w[6] - bias); ﬁll/IE3
Delay fUHCIIOD ‘——-—DelayFunc(ET3,4s,();

ammatio” x :6 = (float) (r0w[1] - r0w[6]).' //ME4

\Dela yF unc(E T 4,45%);

 

 

 

 

 

F t2 = (ﬂoat) (r0w[2] + r0w[5] - bias); //ME5
\DelayFundETMm);
t5 = (float) (r0w[2] - r0w[5]); //’ME6
\ DelayF unc(E T 6 ASIC);
//\ """" . //other macro-executors
,/'//// f
,«"‘l\/‘/ ',/// I
T \1"
Time event sequence
I J l l l I : time

 

ETIASIC ETIASK' ET3.A.\'[(' ETJ..4.\‘I(' ETJ'ASK‘ ETISASK'
Figure 2.1(b) Executable model with time annotation

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

t0 = (float) (r0w[0] + r0w[7] - bias); rill-IE,

I7 = (float) (r0w[0] * r0w[7]); AIME;

t5 : (float) (r0w[2] - r0w[5]); Adi/IE3

spec in C? N
C-to-RTL conversion Time annotation
Ven'lOQ or VHDL :0 = (ﬂoat) (row/01+ row/71 - bias): mg,

DelayFuncﬂ 00113);
t7 = (float) (row/0] * r0w[7]); .I'I/MEg
DelayFunc(80ns);
t5 = (float) (r0w[2] - r0w[5]) ,xf/ME3
DelayFunc(200ns);

 

 

 

[RTL simulator) (C-level simulator)

I time

ME1 ME: ME; ME1 MEZ ME3

 

  

Figure 2.2 Time event distributions during simulation with different time models

Cycle-approximate simulation can be performed with the multi-granular time model by

23

considering the timing only on the border of each macro—executor without delving its
architectural detail. Although the timing information inside each macro-executor, which
is decided by its implementation, is not explicit by annotation, the external timing
information such as execution time can be applied directly for the overall system-level
timing analysis. In this sense, the cycle information for each macro-executor is still
effective given a pseudo-clock, which is not used for system synchronization, but just for
cycle number calculation. And this scheme is called cycle-approximation. On the other
hand, the multi-granular time model doesn’t preclude the cycle-accurate property. In the
case where cycle-accuracy is required to model the internal micro-architecture of a core,
invoking a delay function with the argument of constant time value equal to the clock
period repetitively can provide a pseudo-clocked time model. In this way, each pair of
neighboring time events has the same interval, which is the same as the cycle-accurate
simulation scheme introduced before.

Moreover, the three aforementioned time models (physical, clocked and causal) can
coexist in the same simulation framework using the multi-granular time model to
leverage different simulation timing requirements in the heterogeneous SOC architecture.
This scheme is also applied in the NOC OCA model, where each packet’s receiving are
handled using the clocked time model for the system-level synchronization, and route
computation of each packet is processed with causal time model, and the transferring
packet from one to another router is evaluated with the physical time model since the
wire has the intrinsic RC delay that cannot be modeled with the synchronizing clock.
2.1.2. Implementation of Mum-granular Time Model in SystemC

The proposed multi-granular time model is implemented by SystemC 2.1 in our
framework. SystemC 2.1 library provides the pre-deﬁned sc_time data type to represent
physical time in different resolutions (s, ms, us, ps and fs). A sc_time data has two fields:
value and unit, to describe the attribute of time. The value ﬁeld contains the absolute time

value, while the unit ﬁeld speciﬁes the time scale that is one of the pre-deﬁned macros:

SC_S, SC_MS, SC_US, SC_NS, SC_PS and SC_FS. Operations such as addition,
subtraction and division are overloaded for sc_time data type to deal with the automatic
time scale conversion. SystemC library provides wait( ) method as a prototype of the
delay function. When wait( ) is invoked with an argument Of sc_time(t,u), a time—out
event will be generated after the simulated time reaches t in the unit of u. Since multiple
processes are executed concurrently during simulation, SystemC DE simulation kernel
will evaluate these executing processes at each time event to perform updating before
evaluating the next incoming time event. This type of evaluation is global, saying that if
there are multiple concurrent processes executing in one simulation model, any time
event will trigger updating of all processes no matter which one generates it.

To illustrate the time model for different PE types, we use “[ ]” to represent the core
that may be sequential microprocessors or concurrent ASICs, and “{ }” for the process
with a sequence of macro-executors. Assume that ME], ME;, ME; and ME4 are four
macro-executors running on the same or different PBS, and each one has sc(t,-, u;),(i =
l,2,3,4) of execution time. Assume these four macro-executors are independent and can
only be implemented on one core. Two different timing properties are modeled according
to different core architectures.

(1) These four macro-operations execute sequentially on one microprocessor with the
order of M0,—>M02->M03.>M04, modeling their execution is expressed by:
[{M01;Wait(5t(tls 141)); M02; wait(st(t2, u2)); M03; wait(st(t3, 143)); M04; wait(St(t4, u4))}l
where st(t,-,u.-) = sc_time(t,-, u.) and it describes the multi-granular time model

(2) M0,, M02 and M03, M04 form two independent processes with each consisting of
two sequential macro-executors running on the ASIC core supporting concurrency. The
description is:

”MOI; wait(st(tz. no); M02; wait(st(t2. WW II {M03; wait(st(t3, us»; M04; wait(st(t4.
M)”

where II represents the concurrency relationship between two processes.

In these two schemes, the total execution time is evaluated as:

4
TI =2sc_time(t,.,u,-) and

i=1
2 4

T2 = max[z sc _ time(t,. , u, ), 2 SC __ time(t,. , u, )] (2.1)

i=1 i=3

When performing simulation by invoking SystemC kernel, the global time stamp will
be assigned to T1 and T2 for (1) and (2) after these four macro-executors ﬁnish executing.
From this way, both sequential and concurrent system’s timing properties can be
emulated by invoking the wait( ) delay functions.

The sc_time argument for wait( ) delay function is a crucial parameter when
implementing the multi—granular time model, and it can be statically predetermined
beforehand or dynamically computed at run-time. The ﬁrst scheme is often applied when
generating a periodical time event sequence to mimic a cycle—accurate clock, while the
latter one is used in the simulation of PE or OCA where the tinting granularity varies
along with current system status such as Dynamic Voltage and Frequency Scaling
(DVFS), which will be covered in future sections.

2.2. Object-Oriented Uniform PE Simulation Framework

The increasing diversity and complexity of PBS ranging from hardwired ASIC cores to
fully programmable microprocessors call for a more efﬁcient and modular way to
simulate them in a multi-core SOC context during early exploration stages. Although
cycle-accurate models offer good simulation accuracy, their ad-hoc way that a distinct
model corresponds to each architecture platform makes the application simulation that
involves both computations Of multiple models and their communications very slow and
prone to errors. For example, the well-known micro-architecture ISS SimpleScalar[lO]
takes ~5 minutes to run a JPEG encoder algorithm on a 64x64 input image. Even worse,
an VHDL simulator[44] spends >15 minutes to simulate the JPEG encoder core for the

same input. In the system-level SOC exploration, it is no longer efficient to use cycle-

26

accurate models for simulating real applications running on multi-core architectures. This
factor motives us to raise the level of abstraction from the architecture level to system-
level to model PEs more efﬁciently in our framework.

2.2. l. Macro-operations and Their Characterization

In early design stages, the functionality of SOC is documented as system speciﬁcation
and modeled in high-level languages such as SystemC/C++ regardless of its target
architecture. This type of executable speciﬁcation can be employed to verify the system
functionality coarsely since it only contains architecture-independent functional
information. Annotating the multi-granular time model to the executable spec makes it
possible to account for timing information when executing the functional code. Due to
the ﬂexibility of high-level programming language, each macro-executor may consist of
multiple basic operations. To modularize the timing annotation, it is beneﬁcial to ﬁnd a
minimum set of operations by which any macro-executor could be constructed.

In our work a comprehensive set of C operations is identiﬁed and characterized for
timing information. Since PEs target at signal and media processing applications, source-
level analysis on benchmarks[45] is taken by hand or proﬁling too]s[46] to track the
occurrence of each basic Operation in the code. Then the most frequent operations are
identiﬁed and categorized into ﬁve groups containing 2] atomic operations:

(1) Arithmetic operations such as: a = b + c, a = c >> 2;

(2) Loop operations, such as for and while loops

(3) Branch operations, such as if...then...else structures

(4) Function calls with different number of parameters and type of return values

(5) Memory access Operations such as a = *p, where p is a data pointer to memory
address

Any macro-executor can be described by the combination of these atomic operations,
and each operation is called a macro-operation. These 21 macro-operations form the

macro-operation set in the system-level mode]. For example, consider macro-executor

27

ME: c = a * b + (*d), there are three macro-operations 0P_ALU_2, 0P_MUL-2 and
0P_MEM_I. In this proposal, it is described as ME = (0P_ALU_2, 0P_MUL_2,
0P_MEM_I). Notice that the execution order of these macro-operations is not deﬁned

yet in this description.

 

 

LOAD @0x200, Rl CLR RASH
LOAD @0x202, R2 CLR RASL
DMUL R1, R2, R3 MOV &0x200, RM
MOV R3, @0x204 MOV &0x202. RM]
NOP
MOV RASL, @Ox204

 

 

 

 

 

 

Figure 2.3(a) OP_MUL_2 with MUL instruction (b) 0P_MUL_2 without MUL instruction

Assume that each macro-operation can be implemented on both microprocessor and
ALU-type ASIC core. In the macro-executor shown before, each macro-operation may
have different implementations resulting in different timing properties after architecture
mapping. Figure 2.3 shows the implementation Of macro-Operation 0P_MUL_2(two
operand multiplication) on two microprocessor architectures. (a) shows that it needs only
four instructions to ﬁnish 0P_MUL_2 including memory access in a microprocessor
with multiplication instructions, while (b) shows that it needs six instructions to execute
the same macro-operation on the processor without hardware MAC data path (such as
MSP430 that treats multiplier as a peripheral).

Different PE architecture has different timing properties in its macro-operation set, and
timing characterization is performed to extract timing information for them. Benchmark
programs with small segments Of C code are created by repeating the same type of
macro-operation for a pre-deﬁned number Of iterations. Using RTL- or microarchitecture-
level simulator such as SimpleScalar to run the compiled benchmark and generate two
parameters: sim_num_inst and sim_CPle, which represent the number of simulated
instructions and average CPI (cycles per instruction). The total execution cycle number of

the benchmark can be computed from:

28

Cycle benchmark 2 sim _ num _ inst X sim _ CPI (22)
Each benchmark is encapsulated in a main( ) function, which initializes the run-time
environment and allocates stacks and consumes clock cycles as well. To calculate the net
cycle count for each iteration Of macro-operation, the main( ) initialization overhead,
which is computed by simulating another code containing only a main( ) body, is
removed. Then the macro—Operation cycle number is calculated as (2.3) shows:

_ CyClebenchmark — CyCIemain
operation _ ' N (2.3)

 

Cycle

where N is the number of iterations a macro-operation in the benchmark takes in

simulation and Cyclemm-n is main( )’s initialization clock cycles expressed by:

Cycl

emain

= sim_num_inst

mm X sim_CPImam (2.4)

Ideally, each macro-operation’s cycle number is linear and passes through the origin
point in terms of number of iterations in the benchmark. In this step, each macro-
operation is simulated by different number of iterations from 10000 to 20000 with the
step of 100 iterations. Cycle data from (3.2) is approximately linear to the iteration
number with ~0.0l% average non-linearity and a non-zero offset, which originates from
the inherent Operation mechanism in embedded microprocessor with cache. At the start of
program execution, the cache is empty and the miss rate is nearly 100% and brings
additional clock penalties to ﬁll the cache from external memory.

Linear regression is employed to minimize the mean square error on the pre-computed
data to calculate each macro—operation’s approximate cycle number statistically. Each
macro-operation corresponds to a linear function, whose parameters are used to create the
model library. For example, cycle function for macro-operation 0P_MUL_2 is: 1.8n +
13.9, where n is the number of iterations macro-operation 0P_MUL_2 takes on a MIPS
embedded microprocessor. The ﬁrst constant is average cycle number per macro-
Operation and the latter one is offset value.

‘ The characterized cycle functions for all operations from the macro-operation set form

29

the macro-timing library. Combine the cycle information of each macro-operation and the
operating frequency, any macro-executor’s execution time can be estimated by
multiplying the cycle number with the clock period. For each target PE architecture, a
dedicated characterization process should be performed to get its macro-timing library.
2.2.2. Simulate Tasks by Running Macro-operations

Executing a macro—executor involves executing the functional code and its macro-
operations’ time model. For macro-executor consisting of multiple macro-operations,
timing annotation is implemented by providing the sum of these macro-operations’ time
model to the wait( ) method for timing emulation. If a macro-executor ME contains N
different macro-operations with parameters of st(t,-, u,-)(1<=i<=N), its execution could be

expressed as:

N
[ ME; wait Zst(t,,u,) ]
i=1
By executing wait( ) function with the argument Of summed time model, the timing effect
of these macro-operations will be reflected by the global simulated time on the time axis.
A task contains a set of macro-executors and each task can only be executed on the
same core at one time. We also assume that each task is non-preemptive in the
framework. Intuitively, a task executes directly on the host machine given the input and
generates the output target-independently. Annotating time model to each macro-executor
in a task provides the timing information related to the target core architecture whose
details are hidden at early design stages. Depending on different annotation methods, the
same task may have different simulation models. Figure 2.4(a) shows a computational
task containing M macro-executors, each of which consists of N,- macro-Operations. The
discrete annotation method annotates the code with the time model for each macro-

executor separately as follows:

30

Nl N2 NM
[ SI;waiI[Zst(t,,u,)];Sz;wait 2st(t,,u,)];...;SM;wait[Zst(t,,u,)] ]
i=1

i=| [=1

(2.5)
while the grouped annotation method annotates all the macro-executors when the task

ﬁnishes:

M N1
[ 51:52:...;SM;wait ZZst(t,,u,) ] (2.6)

j=i i=1

N, N: NM
wait[ st(t,. , u, )] wait(}: st(t, , u, )] Wdit[z “(1,- , u, )]
Task

...... ..z

 

 

 

 

l I ...... I ‘ .
E T 51 ETsz' ETSM ' time

(3) Discrete time annotation for 3 macro-operations

wait[§: i st(t,. , u .- )]

j=i i=1

------

 

 

 

l u .
ETS'w' time
(b) Grouped time annotation
Figure 2.4 Different time annotations on 3 macro-operations

The difference between the discrete and grouped time annotation methods resides in

the number of time events generated during the simulation. Although both models ﬁnish

31

M N j
execution at the same simulated time of 22st(t,,u,) and have the same functionality,

j=i i=1

the model from discrete time annotation in (2.5) generates a time event after each macro—
executor commits execution to notify other concurrent processes shown in Figure 2.4(a),
while the model from grouped time annotation in (2.6) issues the time event only when
the whole task ﬁnishes as described in Figure 2.4(b). Applying these properties, discrete
timing annotation is used in the context where synchronization needs to be performed
with other tasks inside the task while the grouped one views the task as a black box
assuming that no synchronization with other tasks is required during its execution. And
the simulation time for the grouped annotation is faster than the discrete one because of
less synchronization time events.

2.2.3. Object—Oriented Application and Task Graph Model

Before detailing the retargetable PE simulation model, we present the C++ Object-
oriented application model on which the PE model is built. The motivation Of creating the
application model is to provide an encapsulation of the application, which essentially is
just a set of C++ code and contains only functionality information, to create a
programming interface by which the PE model is easy to handle regardless of the
application code.

The application model is graphically shown in Figure 2.5, which consists of two
abstract C++ classes: TaskType and AppModel. TaskType abstracts all the required task
information including the following ﬁelds:
®mSk_name, which speciﬁes the task name, and any two tasks in an application should
have different task__name ﬁelds.
®task_platfonn, which is modeled by the pointer to the object of PE class, which will be
described later. This ﬁeld speciﬁes the target PE on which this task executes, and it also
implicitly links to the target PE’s macro—library for timing/energy simulation.

®task_start_time, which models the start time of this task. This is modeled by SystemC

32

sc_time data type, and can be derived to cycle count given the operating frequency.
@task_end_time, which models the end time Of this task. This is also modeled by
SystemC sc_time data type. The task execution time can be calculated by task_end_time
— task_start_time.

@task_status, which indicates this task’s status, which may be IDLE, RUN, or DONE.
This ﬁeld provides PE model this task’s run-time information.

@task_handler, which is a pointer pointing to the starting address of a function, which
models this task’s functionality that can be modiﬁed at run-time by just changing its
value. The C++ callback function takes task_handler as its entry to start execution on this
task.

@pred_task_num, which speciﬁes the number of tasks that are required to be completed
before this task. This member works with the next one together to handle the task
dependency in a multi-task application.

.pred_tasks, which is an array of pointers pointing recursively to the TaskType object. It
speciﬁes the tasks that should be executed before this task, and each element in the

pred_task array has the same structure of Q) to @.

 

TaskType

string task_name;

ProcEIem *task_platform;

sc_time task_start_time;

sc_time task_end_time;
STATUS task__status;

void (* task_handler) (ProcEIem *);
int pred_task_num;
TaskType *pred_tasks[MAX_TASK_NUM]; AppModeI

TaskType *m_task[MAX_ TA SK_NUM];
ApplicationMode/(T A SK_ TYPE
*m_taskﬂ, unsigned int);
~ApplicationMode/0;

 

 

 

 

 

 

 

 

 

 

 

Figure 2.5 Task and application model represented by C++ class
AppModel class abstracts the application to the task-level by associating multiple
TaskType objects in its instantiation. As what’s shown in Figure 2.5, m_task array in

AppModel contains all the task instances for this application, and the task dependency is

33

implicitly presented by each task’s prcd__tasks data member recursively. A task T.- can be
scheduled for execution if and only if the following condition meets: Each element in Ti’s
pred_tasks array has the status of DONE. This condition indicates that T,-’s all precedent
tasks in the application task graph have ﬁnished so that T ,~ is ready to start.

Given an application S partitioned into N tasks {T1, T2,..., TN}, a Task Graph (TG) Gs is
constructed to represent the task dependency as what Figure 2.6(a) shows. In our research
each task object is created by initializing its task_handler ﬁeld, and the pred_tasks array
will be initialized for each task object according to the task dependency in TO The
AppModel Object will be instantiated to include all these sub-tasks at the top-level to
create the encapsulated abstract application model. Figure 2.6(b) presents the task
dependency model for all the 6 tasks in application represented by the task graph shown
in Figure 2.6(a). Utilizing this proposed object-oriented application model, any
application containing multiple tasks with arbitrary task dependencies could be easily
modeled by a formal method without any ambiguity since each task’s information is
explicitly encapsulated into the class object. This method also provides an application’s
executable formal representation for the PE model. In the case that two tasks mapped
onto different PEs have data communication, a buffer may be allocated for each task

Object to temporarily store the communicating data.

Application

/

/ ' l \\
100us K task_a ., task_c ,: 220us

\

 
  
   
    
   
   
 

/ \ /.
\3/ \‘\~_..p’
///w \\ ”T‘“\ \\

200us task_b ) task_e /=19us

\

\/'

R, if-

,.—-

\ / ‘\

150 ’/taskmd\ / t k f \
usr _ i’ as « 40ns
k j \\ — ,.///

..— ’a/

Figure 2.6(a) Task graph with 6 tasks

34

Task object instantiation

m _task[a]= new TaskTypeC‘Task a" NULL T _ZERO T _,ZERO function _:a)
m _-task[b] — new TaskTypeC‘Taskb”, NULL, T __,ZERO T _ZERO, function _;b)
m _task[c]= new TaskTypeC’Taskc", NULL T _ZERO, T Z_ERO, function __;c)
m_task[d]= newTaskTypeC‘Task d". NULL, T_ZERO, T__ZERO, function_d);
m_task[e] = new TaskTypeC'T ask e". NULL, T_ZERO, T__ZERO, function_e);
\m_task[f] = new TaskTypeC'Task 1", NULL, T_ZERO, T_ZERO, function_f);/

    
 

 

Task dependency initialization

 

l/No task before task a \
m_task[a]->pred_task_num = 0:

HT ask a and c before task b
m_task[b]->pred_task_num = 2;
m_task[b]->pred_tasks[0] = m_task[a]; //task a
m_task[b]->pred_tasks[1] = m_task[c]; l/task c
//No task before task c
m_task[c]->pred_task_num = 0;

[fr ask e before task d
m_task[d]~>pred_task_num = 1;
m_task[d]->pred_tasks[0] = m_task[e]: ”task e
In ask a before task e
m_task[e]->pred_task_num = 1;
m_task[e]—>pred_tasks[0] = m_task[a]; ”task a
[IT ask b and e before taskf

m _-task[t] >pred_ task_ num= 2;

m _task[f]- >pred _=tasks[0] m _task[b]; l/task b
@_wsk[ﬂ- >pred_tasks[1]- — m_task[e]; I/task e j

 

 

 

Figure 2.6(b) Application’s task dependency model
2.2.4. Object-Oriented PE Simulation Model

The PE simulation framework is developed to create a uniform environment for
modeling both microprocessors and hardwired ASIC cores efﬁciently and flexibly. It also
provides a simple scheduling engine as a timed OS model for processors running multiple
tasks. Object-oriented modeling methodology is applied by abstracting the main
attributes of the target PBS to create a homogeneous simulation model for different
architectures.

The class diagram Of the proposed PE modeling framework in shown in Figure 2.7
where all PEs are abstracted into the base class PEL containing the architecture
independent parameters such as clock frequency (in IP based SOC, one operating clock
frequency may correspond to several core choices), memory size (assume each PE has its

own local memory), and timing/energy library. Two classes CPUL and ASICL inherit

35

from PEL and each of them contains some high-level information regarding
microprocessors and ASICs respectively. The main parameters for CPUL are architecture,
the application model object and its scheduling policy, and the ASICL is conﬁgured by its
task list and parallel degree, which deﬁnes the maximum number of parallel processes.
Each PE is instantiated as an Object Of CPUL or ASICL class by their constructors to
specify the architecture. For example, an IBM PowerPC and a JPEG encoder core could
be instantiated by the following C++ code in the SystemC environment:

m _pe[0] = new CPUL( "PowerPC405" PPC405, 50, TRUE);

m_pe[1] = new ASICL( "JPEGENC", JPEGASIC, 200, 4, TRUE);
where the ﬁrst one initiates a 50MHz IBM PowerPC405 rrricroprocessor core, and the

second one creates a 200MHz JPEG encoder core with 4 parallel executing paths.

 

 

 

 

 

 

CPUL ASICL
type, type.
(void *) schedu/e( ) Int parallel_ degree
File 'tasklist; ------ ﬁle “task/Ist;
CPU”): . ,ASICL();
~CPULU’ ~ASICL()‘
void Addtask(unsigned char, ...... ’
void (*func)(CPUL 7); ------

 

 

 

 

 

 

PEL

Frequency, latency_LUT
Mem_size, energy_LUT

PEN );

~PEL( );

SetFrequencyﬁnt freq);

Int getFrequency( );

void init_latency_LUT(oonst char ');
void init enemy LUT(const char *);

 

 

 

 

 

Figure 2.7 Class diagram of PE model framework
After task mapping, different cores may have different task dependency features, which
need to be resolved for execution. We take different PEs into two categories:
microprocessor and ASIC abstracted by CPUL and ASICL classes. We also assume that
microprocessor executes only one task at one time, while ASIC targets computation—

intensive applications and performs concurrent task executions through multiple

36

executing paths. The uniform cycle-approximate PE simulation framework based on Task
Execution Engine (TEE) applies to both microprocessor and ASIC cores. First, this
simulation framework is a high-level behavioral model and it performs multi-granular
timed simulation by executing system-level speciﬁcations based on SystemC with
annotated time model without compiling them to architecture-dependent codes. Second,
the task executions are synchronized by their annotated timing models of macro-
operations and feature cycle-approximate without being driven by ﬁxed-cycle clocks. As
what’s discussed before, the execution time of macro-Operation M0,- is modeled by

SystemC wait( ) function call with its argument calculated by:
t(M0, ) = sc _ time( f ‘_' - cycle _ numlMO, D (2.7)

where f and cycle_num[M0,-] represent the operating frequency and macro-operation
M058 cycle number respectively. The ﬁrst term f in (2.7) is initialized when instantiating
the PE object or conﬁgured by the top-level model during run-time when dynamic
voltage and frequency scaling (DVFS) is applied, and the second one is acquired from
table lookup given the macro-timing library, which is also setup when instantiating a PE
module.

PE model consists of two main modules: Task Scheduling Engine (TSE) and Task
Execution Engine (TEE). Different PE categories correspond to different TSE/TEE
model architectures by their executing concurrency properties. Figure 2.8 shows the
generic architecture of the TSE/TEE-based PE model represented by the ProcElem
SystemC class. This PE abstraction model contains one TSE and one TEE, which are
modeled by TaskScheduler and process__l process shown in Figure 2.8, respectively.
Each PE can only have one TSE and one or multiple TEES depending on whether this PE
is single-thread microprocessor or multi-threaded hardware ASIC. TaskScheduler is
registered to SystemC kernel as a SC_METHOD process and it performs the task status
checking and scheduling, which doesn’t need to be suspended during simulation. On the

other hand, process__1 is registered to SystemC kernel as SC_THREAD process because

37

it needs to invoke the C++ callback function pointing to the application task code that
contains explicit wait( ) statement to model task timing directly. When this thread process
is suspended during evaluating timing information, TaskScheduler can still be triggered

to monitor the current task execution status.

ProcElem

 

 

 

\

TEE _task __assigned_ event

cW—gTaskSchedulerO L ———————— prooess_1

current_ fask_ done_ event
<— ————————— ‘

TEE _busy

‘\ I

 

 

 

 

 

 

 

 

 

 

ApplicationModel

6 ’ Tabsk Task
d/
E\i Task lllll Task

TASK TYPE

 

q TSE accesses task
q TEE updates task
% PE Is mapped with task

    

   

Figure 2.8 Architecture of TSE/TEE PE model (with application/task models)

TSE(TaskScheduler) and TEE(pr0cess__1) communicate via dedicated events and
signals. TEE_task_assign_event is issued by TSE when it decides that a task’s
dependence is resolved and ready to execute according to the application task graph and
TEE status. TEE will not start executing the task until it receives TEE_task_assign_event
event and it executes with the causal time model. TEE provides another event
current_task_done_event to TSE along with a signal TEE_busy showing whether TEE is
currently executing a task. The ﬁrst event is used to trigger TSE to start scheduling the

next task asynchronously regardless of the clock signal, which can be used for time-out

controlling purposes, while TEE_busy signal is to control the synchronous TSE
scheduling by running scheduler if and only if TEE_busy is ‘0’ representing no executing
task in TEE on the TSE clock rising edge. TSE will pick up and pass the scheduled task
for execution to TEE from the task _pool array, which stores all tasks that are mapped
onto this PE. Each task element in the task array is a copy of task Object in the application
task graph model shown in Figure 2.8. When an application is partitioned, each PE
object’s task _pool will be initialized by those mapped tasks by a separate system
controller.
2.2.4.1. Sequential M icr0pr0cessor Model

Most embedded microprocessors feature sequential program execution without
considering the advanced factors such as VLIW [47], hyper-threading[48], etc. Figure 2.9
depicts the TSE/TEE model for a microprocessor with one execution unit and this ﬁgure
is derived from Figure 2.7 by adding some details. Assume a sub-graph Gs, from TG in
Figure 2.9 containing 3 tasks {T1, T2, T3) is mapped onto a microprocessor, TSE ﬁrst
assigns these tasks by invoking AddTask( ) method, which registers each task entry with
the microprocessor. Then TSE calculates a schedule from which the processor executes
only one task at a time and meets task’s deadline requirement by the Scheduler Function
(SF). Each task is characterized by its timing attributes such as starting time, deadline and
period, and SF evaluates them dynamically during task execution and calculates the next
task that will be scheduled according to predefined scheduling policies. Based on the
scheduling result, TSE allocates the processor resource to the scheduled task once the
current executing task ﬁnishes, and starts scheduling the unscheduled tasks in the sub-
graph 05,. TSE loads the scheduled task to TEE for execution by L0adTask( ) function
call and gets the current executing task from TEE by GetCurrentExeTask( ). Depending
on different scheduling policies such as maximizing throughput or minimizing energy, a
TSE may have multiple pre-deﬁned SF prototypes to be loaded to its scheduler. More
signiﬁcantly TSE can change its scheduler’s SF during run-time by invoking

39

SetTSEScheduler( ) method and this feature provides the potential to model dynamic run-

time scheduling policies.

Scheduled task in 631
T6 for application 8 Partitioned tasks in 631 mappgd_o_p_ppo_cessor
, - ., \

,2aai
l

I"

(3' T1 \

 

H.

~.-I..-">

Y

./ ﬂ ‘ l"\‘
I T3 ,1
_-/’

\._,

l

T2)

//
:1
\ /
\
'-l
N!
\/
,—————-—————-.
/\
11:
\f/
I
El
\

:/\

f \

l—li

10"]
(IV?

8: .

I .

I ' .
I f
I \’
I -
\
\

\

 

 

 

  
  

 

 

 

TSE TEE
r ‘ r 1
[AddTask(TASKL 'taskﬂ
M t. . Task_staIt—U
acro- ImIng /___~__ﬁ1 LoadTask(TASKL
library \ “W“ [ 'task) ]

 

 

/__la!5":£‘l°_,\l>
{GetCurrentExeTasld )]\m i — m SC_THREAD( )

o—Taskjlnish

 

 

 

 

_._ _._.-__.___,., SetTSEScheduler((voi
d ')sch )

 

 

 

 

 

[Schedule_oompute( )]
L J L J

Figure 2.9 Microprocessor’s TSE-TEE

 

 

 

 

 

SystemC sc_thread process provides a run-time environment for its encapsulated tasks
and a sc_thread body triggered by the effective scheduled task implements the main part
of TEE for microprocessor. Since the macro-executors in one task execute sequentially
and so does each task, the sequential execution is guaranteed for the microprocessor
model. Each task’s execution is triggered by the TEE_task_assign_event event issued
from the TSE after the task is loaded into TEE by TSE. During task execution, the
annotated time model will be invoked by wait( ) function calls, thus generate time events
for the whole system’s synchronization if there exist multiple cores. Once a tasks ﬁnishes
execution, a current_task_done_event event will be generated and notify the TSE to
perform scheduling for its unhandled tasks. When all the mapped tasks are completed,
another all_task_done event will be issued by TEE to trigger TSE to rerun the scheduler

after it resets these tasks’ status information recorded during scheduling.

40

2. 2.4.2. Parallel Hardware Accelerator (ASIC) Model

We model ASIC cores that are employed to exploit the parallelism among tasks for
computation-intensive applications. Described as Figure 2.10(a), one of the most
signiﬁcant features of this type of applications is that part of its task graph can be
partitioned into multiple concurrent executing threads. For example, JPEG encoder
application can be partitioned into 4 homogeneous computation ﬂows with each one
performing the same tasks {DCT, Quant, Zigzag, Huffman} sequentially on each color
component with the downsampling ratio of 21121. This motivates us to use multiple
processing resources to handle these tasks concurrently to accelerate the computation.

Reflected on the TEE model, these executing paths are implemented by multiple

sc_thread processes, each of which executes a sequence of tasks as shown in Figure

start

2. lO(b).

 

  

/

func1 i func1” --- func1ig
./ x - i : / T1" \)
T1 \Tl \‘ _,-/’
funczr_ func2 E11: funcZ i‘\
‘\ . ‘\; T2" )
(1.2/j \ T2 ’ <\._ _,/"

func3 i ““103 func§,,[\
T3 \._ <T3’ T3" 3
K _//j \‘\x _.//

    

end

 

(a) Task graph with concurrency

41

Partitioned tasks in

 

/" _- T“ '” ..\
I3...) (I39 03,.)
fund; fungi; funds-lg
(T4) <1 T42)- Q41)
funds)“ func§’_l_g_\ func§l\
(,1 rs) U51) (35",)

 
 

~_————.——————_J

Macro-timing

library \1 """"""""

f——_—‘_—\
- roc__1_start—# SC_THREAD( )

. roc_2_sta It—o

_Taskﬁfo.:.11\ SC_THREAD()

\l ,,,,,

          
    
   

AddTask(TASKL ‘taskl
GetCurrentExeTask( )
LoadTaskUASKL
‘task )

  
 

SC_THREAD( )

h————J

 

 

(b) ASIC’s TSE/TEE model
Figure 2.10 Application with concurrent tasks and the corresponding TSE/TEE model

Since each ASIC is dedicated to one speciﬁc application as its name shows, its
TSE/TEE model does not have the dynamic scheduling ﬂexibility as microprocessor
model has, and this feature also conforms to the (Synchronous Data Flow) SDF MOC
(Model of Computation) that these applications pertain to. Described in Figure 2.10(b),
each ASIC model is conﬁgured by its parallelism degree that is the number of concurrent
executing ﬂows, and each ﬂow is mapped onto one sc_thread process, which contains the
sequence of tasks. Although using the same name, TSE of ASIC only performs task
assignment by LoadTask( ) method that takes the arguments of task entry for each
executing ﬂow during core initialization after application partitioning. It also sends the
start signal to TEE to trigger task execution. Each executing ﬂow feedbacks a ﬂow _ﬁnish

event to TSE once its tasks complete. If these computing ﬂows have different execution

time, TSE will handle their synchronization by these events before the next iteration of
execution.

TEE consists of multiple sc_thread processes with each one providing run-time
environment for the execution of mapped tasks. Similar to the TEE in micrOprocessor
model, each sc_thread executes its tasks and evaluates timing sequentially. On the
contrary, multiple sc_rhread processes handle the time evaluation and advancement from
the task’s annotated time model independently, thus these sc_thread processes mimic the
concurrent properties residing in these execution ﬂows. Using the task graph in Figure
2.10(a) as an example and assume each task has the execution time of t,-, the simulated

time after running this task graph for one iteration will be evaluated by:

Texe Zmax Zti’zti’ - ti. (2.8)

Compared with the simulated time of executing them sequentially on microprocessor

3 3 3
'_ . ..
Texe - Zti + Zti + 2 ti (2.9)
i=1 i=1 i=1
The speedup using hardware ASIC accelerator can be estimated by:

Speedup = (it, + it; +itg'J/max[izi,izg,iz;] = 3
i=1 i=1 i=1 i=| i=l '=l
3 3 . 3 n
if Zti =Z’i :24“ (2.10)

2.2.4.3. Case Study of an Application Mapped onto Microprocessor Model

The TSE/TEE microprocessor and ASIC models have been fully veriﬁed with
applications containing arbitrary number of tasks and their dependencies. This section
presents one case study with a synthetic application containing 6 tasks having the
dependency graph shown in Figure 2.6(a). The execution time of each task is pre-

calculated as

43

Ta = 100115, Th 2 200115, 7}- : 220us, Td = [50:13, T, = l 9115, T]: 40ns

Table 2.1. Task mapping scheme

 

 

 

 

tasks mapped on ARM tasks mapped on PowerPC
Mapping l a, b, d c, e, f
Mapping 2 a, b. d, e, f c
Mapping 3 a, b, c, d, e, f

 

 

 

 

 

Each task has the delayed function call with the argument of its pre-calculated
execution time for veriﬁcation purpose. Actually this time is calculated by evaluating
each macro-operation in the task given the target-dependent timing library. The target
SoC architecture contains two PEs, ARM and PowerPC (PPC). And the inter-PE
communication is not modeled yet here. Three different task mapping schemes are
considered as table 2.] shows. Table 2.2 presents the simulated task ﬁnish time

Table 2.2. Simulated task ﬁnish time
a b c d e f
Mapping l 100.002us 420.002us 220us 570.002 239us 420.042us
Mapping 2 100.002us 469.002us 220us 269.002us 119.002us 469.042us
Mapping 3 100.002us 520.002us 300.002us 689.002us 539.002us 689.042us

 

 

 

 

 

 

 

 

 

 

 

 

 

And it matches the manual calculation results. Also table 2.2 clearly shows that
different task mapping on the same application to the same set of PEs will result in
different performance characterized by the application’s ﬁnish time. The proposed
TSE/TEE model also provides a platform to evaluate different scheduling policies for
applications on multi-core platforms by fast system-level simulation.

2.2.5. Retargetable Simulation Framework for Heterogeneous PE Cores

Another issue resolved by the proposed PE model is the retargetability in
heterogeneous cores. Retargetability reﬂects the ﬂexibility of performing task
mapping/remapping and simulation among different PE architectures. First we present the
retargetability issue by a case study of MPSOC architecture containing two

microprocessor cores shown in Figure 2.] 1(3). Two RISC processors, (A)PowerPC405

44

and (B)MicroBlaze, are allocated to execute different tasks partitioned from application.
Two tasks fdct( ) and idct( ) are initially mapped as A:[[fdct( )}] and B:[{idct( )}]
respectively where X:[{f,( )},{fz( )}] represents to map tasks f1( ) and f2( ) on core X. To
change the mapping to: A:[{idct( )}], B:[{fdct( )}] and perform simulation, the
conventional method recompiles fdct( ) and idct( ) using MicroBlaze and PowerPC
compilers and run their ISSes again as shown in Figure 2.ll(a). Obviously it is ad-hoc

and will be time-consuming when task and PE number increase.

llll’ PPC405 |II¢ PPC405
Compiler ISS

Task1: fdct
fdct(arg list)

 

 

 

 

 

 

 

 

 

 

 

// Function code

 

 

 

 

 

Recompile
Task2: idct Recompile
idct(arg list) _ _
MicroBlaze MicroBlaze
# Compiler # ISS

 

 

 

 

 

 

 

 

// Function code

 

§m__PPC->AddTask(1, &ldct); Task remapping

 

 

F

l lnit PPC » m_FPC—>AddTask(1,&fdct)

Task1: fdct ‘

Wcuarg list)
' (SystemC TEE> « Pigggc

 

 

 

 

 

 

 

{

 

 

 

 

ll Function code J

 

Task2: idct
Fidct(arg list)

{ <SystemC TEE> « Mlﬁgﬁgze

// Function code

 

 

 

 

 

 

 

 

 

 

'"it » §m_MB—>Addrask(1, aid'éiigw'i
MicroBlaze ‘ ..........

 

 

 

 

 

im_MB->AddTask(1[Sf-dot); Task remapping
(b)

Figure 2.11 Task re-mapping between two processor cores. (a)Ad-hoc task re-mapping with
dedicated ISSes (b) Retargetable task re-mapping based on macro-model and TSEJTEE

 

 

45

To mitigate the constraint of relying on different cross-compiling environments for
different PE architectures, our proposed PE model employs a uniform architecture-
independent set of macro-operations as discussed in 2.2.1 to characterize the tinting
information. Each PE architecture has a lookup table (LUT) with the same entries of
these macro-operations’ names, and the same macro-operation on different cores features
different execution time characterized from low-level simulations. These global LUTs
constitute the system-level timing library, which is invoked by the task code. Each PE
object has its local LUT, which is initialized by copying the corresponding global LUT
from the library according to its architecture properly when executing its class
constructor. For example, when instantiating a PowerPC core using core_1 = new PEL
("PowerPC_1 PPC405, 100, .....), it creates a PE object c0re_1 named “PowerPC_1”
with the architecture of PPC405 (IBM PowerPC405) and the base operating frequency of
100MHz, then its constructor invokes initiare_LatencyLUT( ) method to copy the
PowerPC405’s LUT information in the library to core_1’s local LUT data structure as
shown in Figure 2.11(b). This step essentially links a PE architecture with its timing
information.

Given a task T = task junc(argu 1, argun) where argu; is the task’s ith parameter, its
mapping onto PE is simply modeled by concatenating the PE object core to its argument
list as well as annotating the macro-executors inside the task with the macro-operation
time model as Figure 2.11(a) shows. After instantiating a PE core_1 as PowerPC405, T
will be associated with core_1 ’3 local LUT after coreJ calls AddTask( ) member function
to allocate resource for this task, and each annotated macro-operation M0 in T will
access core_1’s LatencyLUT to evaluate its execution time by table lookup such as him =
LatencyLUTIMO]. After passing the appropriate execution time data to the wait( )
method after each macro-executor ﬁnishes, the tinting property of executing this macro-
executor on PowerPC 405 core will be emulated. In this way, either architecture-

dependent cross-compiler or ISS is not required to perform simulation. And if there is

46

another core core_2 with different architecture, remapping task _func onto it is
straightforwardly implemented by invoking core_2’s AddTask( ), by which all macro-
operations in T are associated with core_2’s local timing LUT. Figure 2.11(b) describes
the retargetability of TSE-TEE model using the same example as Figure 2.11(a). The only
operation needed for switching the target processors for these two tasks is to exchange
the function pointers of these two cores’ task allocation methods.

Through this task function callback-based method the system-level task remapping can
be simpliﬁed dramatically by just invoking each core’s task allocation method AddTask(
). During simulation, the annotated macro-operations access the mapped LUT and in turn
evaluate the execution time on the target PE without relying on compilers/ISSes.
Provided a task graph with large amount of tasks and several different core architectures,
applying this methodology signiﬁcantly reduces the overhead of building a simulation
model targeting heterogeneous PE cores. In consequence, this uniform simulation
framework enables the fast simulation-based SoC design space exploration, which will be
presented in chapter 5.

2. 2.6. Selective Code Simulation

The overhead brought from the retargetable PE model is the additional access to the
target core object during task execution. If a task contains multiple child functions, each
child function and its possible children need to be modified by adding a virtual PE object
to its argument list to access the internal macro-operations. In addition to fast
architecture-independent task remapping, another promising feature that this simulation
framework possesses is the Selective Code Simulation (SCS) ability, which has not been
proposed in previous work compared with the cross-compilation-based simulation model.

Any real applications contain both functional code implementing the core functionality,
and some nonfunctional code for debugging and ﬁle I/Os. When compiled with cross-
compiler, the whole application including both functional and nonfunctional parts will be

translated into architecture-dependent machine code. As a result, not only will the

47

nonfunctional code induce errors to the timing analysis where most concerns are
concentrated on the functional code, but also reduce the simulation speed because of the
increased machine code. As we know, simulating ﬁle I/O functions on cycle-accurate ISS
is very time-consuming and inefﬁcient. To tackle this issue, designers often tailor the
application using compiler directives to only compile the core functions manually to feed
ISS for a more accurate and faster simulation, and we believe it will become inefﬁcient
with large applications. In the TSE/TEE-based simulation framework, any function
without macro-model annotation linked with PE object will not trigger any time event,
which means that this function ﬁnishes execution without executing the time evaluating
functions. Applying this methodology, designers only need to selectively annotate the
function code that is of interest without changing any part of the nonfunctional code to
get accurate timing information from system-level simulation. SCS signiﬁcantly
increases the ﬂexibility of application simulation and also generates output dumping ﬁle
with reasonable simulation time and accuracy, which is hard to achieve by the cross-
compilation-based ISSes.

2. 2. 7. System-level Modeling of Dynamic Voltage and Frequency Scaling

Currently many modern SoCs with embedded microprocessor cores are becoming
more energy-centric than speed-centric given that the performance constraints are met
since most of them are powered by battery. Because battery technology has not kept pace
with the increasing power demands of such systems, it is important to target power
during the design process. It is also known that the higher level of design hierarchy where
the power is tackled, the higher is the power reduction possible [49].

Dynamic Voltage and Frequency Scaling (DVFS, or DVS in short) is one of the
extensively applied power-reduction technologies in embedded SoCs. In DVS, different
computation tasks are run at different voltages and clock frequencies in order to ﬁll up
the idle periods in the schedule. In this chapter only the system-level DVS model of

microprocessor cores is presented and chapter 5 mainly addresses the task mapping and

48

scheduling for multi-core SoCs when taking DVS into consideration.

Given a circuit fabricated through a certain technology, the delay depends on supply
voltage V44 as follows: L = kx Vdd/(Vdd — V,)2, where k is a constant and V, is the threshold
voltage. Thus, delay increases as Vdd decreases. This formula also applies to
microprocessor when taking L as the minimum operational clock period corresponding to
the maximum working frequency. In our model, each microprocessor is assigned a
normal supply voltage Vddmm, and we deﬁne the DVFS factor Rows as the ratio between
the normal supply voltage and actual supply voltage Vddm. Also based on the fact that the
supply voltage must exceed the threshold voltage to make the digital circuit work
normally, we have:

//
Vdd ,norm //

Vdd .acr / R V

_ k DVFS : k detorm

Vdd / 2 V 2
.norm/ -— V dd.n0rm ’
[ ,/ Rows I) [F_ RDVFS 'V:
DVFS

 

 

(2.1l)

 

Vdd norm
R DVFS E [1’
‘ , and T represents the microprocessor’s clock period.

where

From (2.11), it is observed that the increase of latency is not linear given a linear
reduction of the supply voltage. Use the microprocessor’s clock period to represent the
latency of instructions and macro-operations. Assume the baseline latency T0 with normal
supply voltage Vddmm, the latency increase ratio could be derived from the DVFS factor

by:

 

 

k . Vddmorm //’ 7
Vderorm - V 2
/ [ _ RDVFS v: ] [ dd .norm _ I]
R _ 1 _ V Rovrs _ V,
L To k ' Vddmorm / Vdd norm 2
,// (Vddmorm — Vt )2 ' ///‘/' ‘/_R——
” _ DVFS

VRDVFS

49

(R. —1)'2
= (2.12)

R 2
[ I " _ V Rovrs ]

RDVFS

 

 

_ dd,n0rm

,where R: - ‘V— is constant given a technology node, Rovrs E [1, R, )

t

(2.12) forms the base on which our system-level DVFS model is built. From (2.1),
each macro-operation evaluates the annotated latency by the multiplication of clock
period and the pre-characterized cycle number. This scheme is based on the assumption
of the constant normal supply voltage without DVFS. Applying (2.12) to (2.1), we model
the DVFS effect by considering RL when evaluating each macro-operation’s execution
time by:

t(M0,. )DVFS 2 SC _time(RL - f ‘1 -cycle_num[M0,. ])

- cycle _ num[M0,.]

 

= sc_time (2.13)

 

 

 

for macro-operation M0,.

The DVFS model is applied at the task-level, and every macro-operation inside each
task is assigned with the same supply voltage by calling TSE’s LoadTask( ) method after
it completes scheduling. If two tasks T, and T 2 are scheduled to execute sequentially with
the supply voltages of Vdd; and Vddz on core core_l, its TSE will perform the following
procedure

c0re_I ~> LoadTask(&T1, Vddﬂm/Vd.” );
core_l -> LoadTask(&T2, Vdde/Vddg);
to apply the appropriate voltages for task execution. In this model, when applying a

smaller supply voltage of Vd.i,,mm/RDVF5, the execution time of each macro-operation is

50

evaluated based on (2.13). Only Row-‘5 is updated during this procedure without the
modiﬁcation of any other parts. Application execution on the new voltage and frequency
will be reﬂected by the change of the simulated time evaluated from (2.13). The policy
that decides how to change each task’s supply voltage for each core in a multi-core SOC
is a hot research topic and will not be addressed in this thesis. Compared with the models
presented in [50], our system-level DVFS model eliminates the architecture-level
evaluation, and thus accelerates the real application’s simulation.

2. 3. Experimental Results on Media Applications

To evaluate the proposed PE simulation framework, we run the Mediabench[45]
application benchmark on the standalone microprocessor models by annotating the
macro-Operations. Mediabench targets the computation-intensive signal processing
applications covering audio, image and video processing algorithms. We choose
GSM(audio), (1721(audio), ADPCM(audio), EPIC(image), JPEG(image), and MPEG-
2(video) applications and map them onto the PISA microprocessor core based on MIPS
architecture. Each application has both encoder and decoder sub-applications. After
characterizing PISA’s macro-timing library, we annotate the macro-operations to the
applications’ source code and map them on TSE-TEE PE model. For comparison,
SimpleScalar/Wattch version 1.02 is employed as a cycle—accurate ISS to run these
applications using gcc PISA cross-compiler. These applications are also executed on the
host machine directly without including any architecture information. The simulation
results are shown in table 2.3.

Table 2.3 lists the original code length (number of effective lines) and the code length
after timing macro-model annotation, as well as each application’s run-time on these
three platforms. T ho”, T153 and T73; represent the run-time for each application on the host
machine, the cycle—accurate SimpleScalar/Wattch and the TSE/TEE PE model
respectively. For all these ﬁve applications, executing their functional spec without any

architectural information on the host machine is fastest, which is pretty natural and

51

straightforward. SimpleScalar performs cycle-accurate simulation on the cross-compiled
machine code for each application and its simulation speed is 6659X to 10508X slower
than executing the functional spec resulting from the fact that each micro-architectural
transaction on the processor model is evaluated explicitly for high accuracy. We don’t
perform the RTL-level simulation for these applications due to the unavailability of the
target HDL models and we believe it will be much slower than SimpleScalar. For the
TEE/TSE model, although the simulation time features ~102.11X slower compared with
the functional spec due to the overhead of additional access to PE object’s timing LUT
during execution, it demonstrates a minimum speedup of 68.39X compared with
SimpleScalar on these applications since the proposed model raises the level of
abstraction by evaluating the external features of each macro-operation, which may
consist of multiple instructions. Also for the applications with heavy ﬁle I/Os MPEG-2,
the ISS simulation speed is pretty slow compared with the macro-model due to the
additional machine code to implement the file I/O library on the target microprocessor.

Table 2.3. Simulation time for Mediabench applications

 

 

 

 

 

 

 

 

 

 

 

 

Application mode Original code Annotated code Tho,,(s) T,5,(s) TTSE(s)
length length
ADPCM encoding 168 264 0.08 645 7.391
decoding ] 19 183 0.02 217 2.14
0.721 encoding 522 796 0.32 2298 33.6
decoding 507 765 0.22 1465 20.9
JPEG encoding 4017 5315 0.12 1261 13.8
decoding 3622 4608 0.02 269 2.43
EPIC encoding 2621 4067 0.07 7 17 8.96
decoding 2085 3241 0.02 2 l6 2. 12
MPEG-2 encoding 5469 8523 1.94 >lhr 190.14
decoding 5 159 7977 0.27 2597 32.43

 

 

 

 

 

 

 

 

3. ARCHITECTURALLY PARAMETERIZED NOC SIMULATION MODEL
3.1. Orr-Chip Interconnection: Packet-Switching Network-on-Chip (NOC)

By far, the most common way tO integrate a number of IP cores on a SOC is tO use a
bus, which provides a shared backplane tO connect with different cores. A bus is usually
composed of master interface, slave interface, arbiters and shared backplane. The cores
are connected to the shared backplane through master and slave interfaces. The access to
the backplane is arbitrated through the arbiter. Cores connected to master interfaces
initiate the bus transfer, and the device addressed by the transfer is considered as a slave.
Thanks to its ﬂexibility and easy programming model, the bus is well understood and
applied effectively in numerous SOC designs [51][52].

However, one Of the major drawbacks for the bus-based SOC integration approach is
the scalability issue. The next generation of SOC will be able to integrate more than one
billion transistors on a single chip by the end Of this decade[6] thanks tO the ever-
increasing silicon technology. Thus, futures SoCs need to interconnect tens or hundreds
of cores on the chip. But a bus itself is not scalable since it is implemented by global
metal wires. In addition, as process geometries continue scaling down, the increased wire
capacitance and low signal swing (<1.0V) make signal propagation slower on global
wires in comparison with the gates, which brings a challenge further complicated by a
much larger chip area.

Given the above scenario, it is widely believed that the bus-based approach alone
cannot address the challenges posed by the next generation of on-chip communication
trafﬁc. TO create a more scalable and high-performance interconnection alternative, using
packet-switching on-chip network as the backbone for on-chip inter-module
communication is proposed in [53][54]. Similar to the Off-chip general-purpose network,
a packet-switching interconnection network consists Of routers and links that are
connected by a speciﬁc network topology such as ring, tree, mesh, torus, etc. Cores

communicate with each Other through routers by sending and receiving data in the form

53

Of packet via interconnecting links. The most promising feature the packet-switching
NOC has is that it provides multiple communication channels for different trafﬁcs as
Figure 3.1 shows. Moreover, NOCs with regular topologies are more scalable than bus by
the fact that adding other cores will also add more communication paths, which in turn
increase the system’s total bandwidth. In this work, the OCA model targets packet-
switching NOCs, which is abbreviated to “NOC” in this thesis.

3.1.1 . Network-on-Chip Fundamentals

Before elaborating the NOC model, the basics of NOC are uncovered by presenting a
component-based view, introducing the basic building blocks of a typical NOC. Then we
will extract some NOC architectural features that will be modeled in the simulation
framework.

Figure 3.1 shows a sample SOC with NOC interconnecting architecture structured as a
4-by-4 grid which provides global chip-level communication. Instead of busses and
dedicated point-tO-point links, a more general scheme is adapted, employing a grid Of
routing nodes spread out across the chip, connected by communication links. Currently
we adopt a simpliﬁed perspective in which the NOC contains the following fundamental

components.

_ﬁ

«__— PE
Network Interface
«~— Router

! <—— Link

    
 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.1 Topological illustration Of 4-by-4 NOC-based SOC, indicating fundamental
components

54

--Network Interfaces (NI) implement the interface by which IP cores (also called PE,
shown as the large squares in Figure 3.1) connect to the NOC. They function decoupling
the computation that IP cores perform from communication that NOC handles. With NI,
PEs don’t need to have any information of the communication protocol because N I will
add the NOC-related information to the data from PE according to the network protocol
before injecting them into NOC.
--Routing nodes (Routers) route the data according to the chosen protocols from the
source to the destination and they implement the routing strategy. A router is
characterized by the following 4 parameters: router type, routing algorithm, arbitration
policy, buffering scheme, each of which results in different NOC performance and power
features. One of the distinct features of our work is that the router is modeled in the
modular method by which different routers are constructed by simply conﬁguring each
parameter separately.
--Links connect the routers, providing the raw bandwidth. They may consist Of one or
more logical or physical channels. Tire data transferring from one node to another
physically happens on links by toggling the voltage. In NOC links can be implemented as
single-cycle non-pipelined or multi-cycle pipelined ones. In our work only the non-
pipelined link is modeled and evaluated.
3.1.2. Object-Oriented NoC Simulation Framework

The NOC simulation model targets tile—based regular topologies presented in Figure 3.1
and it is easy tO extend to other irregular interconnecting architectures. The most
frequently addressed topology is a mesh, where M xN routers (switches) are organized as
a 2-D MxN array. Excluding the border ones, each router has four ports, NORTH,
SOUTH, EAST and WEST, to connect with its neighbors, and another LOCAL port will
be connected to the PE, which may be microprocessor, DSP or ASIC. Each router and its
interconnected PE is called a tile and each tile is connected with is neighbors through two

unidirectional Sﬂu-blt width links. By changing the border routers’ connecting properties,

55

other three derived topologies, toms, X-torus and Y—torus are modeled as well in this
simulation framework.

Abstracted from the low-level circuit implementation, NOC is composed by two
functional components: routers and interconnecting links. NOC essentially doesn’t
contain any information of PE, which only performs computation tasks and injects their
data into the network for communication services. To make the framework conform to
the scheme of separating computation and communication, NOC model is PE-
independent and it only handles communication requests. TO verify the NOC
functionality, traffic generator/receiver models are used to emulate PEs by generating
speciﬁc data trafﬁcs. A trafﬁc generator creates packets compatible to the communication
protocol and injects them into NOC to emulate the communication trafﬁc from PE, and
trafﬁc receiver model does the reversed work for evaluating the NoC’s performance and
power consumption. Employing this method PE and NOC models can be developed and
veriﬁed independently and it also easies the integration of them.

Similar to PE, NOC is modeled through the object-oriented methodology with
reconfigurable parameters by SystemC TL-l(transaction—level l). The motivation Of TL-l
modeling is that the pin-accurate information, which needs delta-cycle evaluate/update,
doesn’t need tO be considered in the system-level model. The class diagram of NOC OCA
is shown in Figure 3.2(a) where each class is described by the solid rectangle with hollow
arrow and black diamond lines representing the inheritance and association relations
between classes, respectively. The dotted rectangles are the SystemC primitive classes
used to construct the OCA model. The NOC Object is instantiated from the top
N0C_TL1_0CA class associating two children classes: NoC_TLI_Router_(param) and
NoC_TL1_Fifo_Channel. param is a set Of router architectural parameters characterizing
the router, and different NOC OCA can be constructed by instantiating NOC router with
different architectural parameters. Each router parameter corresponds to a speciﬁc routing

functionality, which will be detailed in section 3.2.

56

 

 

NoC__ TL 1_Router_ (param)

 

 

 

NoC_TL1_OCA ‘ — .335“:
int mDegree x output '"p‘"
int mDegree:y input __sc__ inte—rfa_ce_‘ output

 

 

 

FNOC_TYPE mArch

mAddr
BufferSize
Routing
Tonalogy

 

create_noc_oca()
oonﬁgure_wire_model( )

 

 

1

-———$ sc_module |<l-—
I_ _ _ _. .4

route_compute()
Arbitrate( )
router_xfer( )
route_East( )
route_North( )
route_ West( )

 

 

m_chan

m_ﬁfo<T>

 

 

 

_ .1 _._“
NoC__ TL 1_Fifo_ChanneI

 

Index
transient_ throughput
average_throughput

 

 

 

 

 

SetConnectSts(unsigned int)
m_ﬁfo ' GetChannel()

connected A num_ﬂits
. delaymodel_en
GetConnectedSts( ) “ha""e' read( T8. )
write(wnst T8. )

Cal_link_delay( )

 

 

m_frfo(chanlnfo *)

 

_ $11104:

Figure 3.2(a) SystemC Class diagram of NOC OCA model

-\\
14 191,1 5
// \

l,/"”“\ /‘ \._ ”\
1 12 )4—H‘ 13 u—s

\EJ/ ‘
\ / \\

(874—59
\ \

\r
l
l
/1

/' “\j

/

\

a—ﬁ 10 \ﬁ—h’ 11 )

m_router[i] = new 0CA_Rou!er(i, XY, 64);

new m_ﬁfo<FLIT> (64,1);

m_ router[0[->input/EAST](*m_chan[0]);
m router] I ]->output[ WES T]( *m_chan[ 0]);

f l T \l/
, " ' ‘\ /’ \. /‘ ‘\
1 ’1 )‘ ’1’ 1 l
“1,4 /‘ 5 ,/ “\6 ,/" "\7 forti=0.- i<4*4.-i++)
I f 7 7i ‘1 / ‘\ /'§\ f0r(i=0: K48: i++i
I 1’ 0 [H—H 1 [‘1 _ 2 \K—ﬁ 3 j m_chan[r'].channel =
\\ \\ _// \\\~._/’ I ‘\\_/ \\\«'/ n.
\ \ ‘ 1’ //perform port connection
. \E ) 7
1nput[EAST] output[WEST] .\ / /\,
,/_\\“—o ------------- ﬁ/-~\L \ﬁ
( ,1 m __chan[0] ; 1 l
\\_ —>\ V //
m_router[0] m _router[1]

Figure 3.2(b) NOC OCA instantiation by channel binding

57

 

NoC_TLI_Router__(param} models router by implementing the packet forwarding
functionality through its routing functions implemented as member functions, while
NoC_TLI_Fifo__Channel models the interconnecting links by associating m _ﬁfo class.
which inherits from scjifo SystemC primitive channel. In addition to m _ﬁfo class,
NoC_TLI_Fifo_Channel also contains connectivity information to interact with the upper
NoC_TLI_OCA class for connecting with routers’ ports. NoC_TLI_Router_(param)
inherits from sc_module SystemC class tO provide the functionality encapsulation and
associates with sc_port class to exchange with its external modules. NOC model is
parameterized ”by its topology, routing algorithm, arbitration policy, and buffering
scheme, in which the last three rs are router’s architectural parameters.

A NOC is created by instantiating the NoC_TLI_OCA class Object as shown Figure
3.2(b) where a 4x4 mesh is initialized by the wormhole router with only input buffer,
ﬁxed-priority arbitration policy and XY—deterministic routing algorithm. The OCA
initialization process calls constructors of its associated NoC_TLI_Router and
NOC_TL1_Fifo_Channel classes to create the router and link Objects, and then performs
interconnecting between them according to the topological information. Each port P on
the router R is connected with link L is implemented by binding P with L shown in Figure
3.2(b). If another router R, has the port P, bound with L, these two routers are connected
logically and data can be transmitted between these two entities via L. For a regular M xN
mesh NOC, there are MN router Objects and 2(MN-M-N) link objects instantiated before
simulation.

3.2. System-level Model of Orr-chip Router

Similar tO the general-purpose network, NOC communicates information in the form of
a packet, which integrates a chunk Of data with some additional information (source,
destination, and CRC etc.) and performs routing on this data chunk. In NOC each packet
is transferred and received in the unit of ﬂit, which is the minimum data block transmitted

on wires in the network. A packet is composed of two parts: packet header and packet

58

payload. As shown in Figure 3.3(a), packet header is also the header ﬂit and it contains
the information Of source address, destination address, and effective ﬂit number. The
payload is the actual data from PE that requests communication service. Our system—level
NOC model also conforms to this packetization scheme, and deﬁnes a dedicated PACKET
class associating with FLIT class. Both PACKET and FLIT classes have a sc_time—type
ﬁeld latency to record the packet or ﬂit’s latency, which is deﬁned by the time between
the injection of each packet/ﬂit from the source PE and its arrival at the destination PE. A
PACKET object can consists of multiple FLIT Objects by instantiating them according tO

the packet length.

 

packet head r OxFF j src_addr [dest_addrT eff_ﬂit_numl

 

 

data_0 (32-bit) j
data__1 (132-bit) T

 

 

 

data_2 (132-bit)

 

 

l
i—T'_T_1F:T

data_3 (32-bit)

 

 

 

 

 

 

 

 

 

payload J
L data_4 (32-1111) 1

[ data_5 (32-1011) 1

[ data_6 (324311) j

_.l data_7 (32-bit) j

 

Figure 3.3(a) NOC packet structure

 

 

 

 

 

 

 

PACKET l—ﬁ F LIT
data
Length
Latency hopcount
inline bool operator == (const.
inline bool operator == (const, FLIT8. rhs) const
PACKET& rhs) 00081 inline ostream& operator <<(

inline ostream& operator << (
ostream& 05, const PACKET& a )

ostream& 05. const FLIT& a )

 

 

 

 

 

 

sc_time

 

Figure 3.3(b) Class diagram Of NOC packet/ﬂit model
Router is a NOC component that performs routing algorithm for packet delivery. As

described in Figure 3.4, a router has one local port connected with PE and four routing

59

ports connected with its neighboring routers to form the interconnection network. Inside a
router three functional modules cooperate with each other to route a packet from one
input port to the appropriate output port: input buffers, crossbar, and output arbiters. And
each output port has an optional output buffer for better ﬂow control. Among these the
input buffer is implemented in conjunction with the link model, which will be described
later. Each router is conﬁgured with a speciﬁc routing algorithm deﬁning the path taken
by a packet between the source and the destination. They must prevent deadlock,
livelock, and starvation situations. In this work, we only consider the wormhole routers
where multiple ﬂits in a packet spread through a set of routers from the source to the

destination during operation.

NoC_TL1__Router_Wh_<buffer>_<arbitration>_<routing>

 

OBuffer
IbufferlLOCAL] RS[LOCAL] , _________ ,

--------- Arb'ter »-
lbuﬁer[NORTH] RSINORTHI r --------- , ' i

 

 

a
1

Ibuffer[SOUTH] RSlSOUTH] ::_-_-_-_-_-___-:.I Arbiter

l
L _________ 1
lbuffe EAS RS[EAST] .' _________ 1' Arbiter
{rial {inn—w .
_’ L _________ l '
lbuffer[WEST] RSIWEST] , _________ ,
M' —’© Full crossbar

Routing algortihm

Figure 3.4 Generic NOC router architecture

a
l
_ 1--
ill ilJl

 

uuuuu
65
l
i

 

 

 

 

 

 

_-_L

 

 

 

 

A NoC router is modeled by sc_module class associated with input and output ports
instantiated from sc_port class. Router processes the incoming packets from its ﬁve
buffered input ports concurrently since these ports may receive data simultaneously, and
each packet will be routed to the appropriate output port through a 5x5 full crossbar,

which is controlled by the speciﬁed routing algorithm. Five Optional output buffers

60

bridge the output Of the crossbar and the output ports to provide better throughput with
the price of additional area. Each output port is arbitrated by an output arbiter to handle
the situation that multiple packets from different input ports are routed to the same output
port.

To model the router’s inherent concurrency, we employ SystemC concurrent process
sc_thread to model the packet routing. In the 5-port router in this work, ﬁve parallel
sc_thread processes with each one invoking the corresponding routing handlers are
hooked up for each router during instantiation. The routing function route( ) emulates the
crossbar operations on one input port. It performs non-blocking read for packet/ﬂit from
its input port, computes the route according to the assigned routing algorithm, and returns
the calculated output port for this packet.

As what’s mentioned before. a NOC router is characterized by four architectural
parameters: router type, buffering scheme, arbitration policy and routing algorithm.
Actually only the last three parameters are taken into consideration since our work
mainly models the wormhole router, which is used extensively in NOC research. The
(param) ﬁeld in each router class is the permutation of these ﬁelds:
<buﬂer>_<arbitration>_<routing>, where <buﬁfer>, <arbitration> and <routing> specify
the input/output buffering scheme, arbitration policy and routing algorithm, respectively.
Each parameter is modeled by a dedicated set Of classes/function calls in the router class,
and these three parameters are completely orthogonal. Our methodology creates a router
template model that only implements those architecture independent features such as
input port scanning and routed packet/ﬂit forwarding. Different router architectures are
modeled explicitly by plugging in the architecture-dependent class members/methods
supporting those features designated by the three aforementioned architectural
parameters. Another feature of these three architectural parameters is that they are
architecturally orthogonal in that each one can be changed independently without the

requirement of changing another one accordingly. Categorizing the orthogonal

61

architecture parameters enables system designers to explore the design space more
efﬁciently.
3.2.1. Input/Output Buffer Model and itsArchiteCtural Parameter Buﬂer

Similar to the general routers used in the conventional network, buffering provides the
ability of temporary data storage under the situation that the requesting hardware resource
is busy on the previous or higher priority requests to avoid the data loss or network
congestion. The on-chip router in NOC modeled in this work cannot afford large amount
Of SRAM/register—based buffers because Of area and power consideration. In this work,

the router’s maximum buffer size is 32x32 bit.
l—l Din/Sf", :0]

.. .9 ..
[Unsung
3:3 [1:] iii“ [5]

 

 

 

 

 

 

 

 

3x
s :3)
i iiiiii Dout[S -01
‘2 ﬂit'
g l .1 .......... i ...... I .......................... l"~-|
WR \-
- . 1:1 1:1 ------ 1:1
Address /
“we“ Address[Nm,-l:0] _ M

 

 

 

 

 

 

V - 6
RD
Read 1 Flag logic
control em ty
- J

 

 

 

 

 

Figure 3.5 NOC on-chip router FIFO architecture
The buffer in the on-chip router is architected as First-In First-Out (FIFO) containing
one read port and one write port having two sets of read/write control logics. Figure 3.5
shows the block diagram of a typical dual—port SRAM-based N x Sﬂi, FIFO that can store

at most N ﬂits (words), each containing Sﬂi, bits. On a write( read), the ﬂit is written into

62

(read from) the appropriate row Of the SRAM, with the row indexing done based on the
addresses generated from a read (write) counter.

In the System-level model, the architectural and circuit-level details are not modeled
elaborately since the FIFO is treated as a black box with the speciﬁed input/output
properties in the system designers’ point of view. For the functionality, the on-chip FIFO
buffer is modeled via two methods in our work: event-driven based on hierarchical
channel built upon sc_ﬂfo primitive channel, and non-event-driven based on standard
C++ template class deque. Here only the functional modeling methodology is presented,
and the performance and power models will be detailed in chapter 3.4 and chapter 4,
respectively.

The key of modeling FIFO behaviors is the support Of reading/writing data with the
ﬁrst-in ﬁrst -out fashion as what’s indicated from the name. Figure 3.7(a) graphically
presents the FIFO modeled with event-driven feature via a hierarchical channel Chﬁfo,
which is essentially a SystemC module. A primitive sc_ﬁfo channel Chpﬁor its derived
class) is instantiated inside Chﬁfo along with two ports P1,, and Po“, , which have opposite
access polarities. P,-,, is templated with sc_ﬁfo_in_if interface virtually declaring non-
blocking/blocking write( ) function, and similarly PM is templated with scjifo_out_rf
interface virtually declaring non-blocking/blocking read( ) function. Both read( ) and
write( ) functions’ implementations are done by the encapsulated sc_ﬂfo channel Chm.
During simulation, the data is written and read to/from the FIFO model by the method Of
remote procedure call(RPC), where the actual read/write operations don’t belong to the
module that calls them. Moreover, Chpf’s length and data width are both reconﬁgurable
during the module elaboration time, and data__written_event( ) and data_read_event( )
member functions return the write and read event to the input and output port connecting
the channel, respectively. SO the read/write operations are always triggered by these two
events implicitly (non-blocking read/write) or explicitly (blocking read/write). The

storage element is provided by the sc_ﬁfo module Chpf, which encapsulates a FIFO data

63

structure only allowing the data written ﬁrst to be read out ﬁrst.

F [F 0 templated storage
Chﬁfo elements

P P

Module A Module 3

 

Figure 3.6(a) SystemC sc_ﬁfo-based FIFO model

FIFO templated storage
Qeque ,70 . elements

 
   

 

push_back( / pop frond )

 

 

 

 

 

 

Figure 3.6(b) C++ deque—based FIFO model

Another way Of modeling the on-chip router buffer is the C++ template class deque,
which supports different queue data structures with built-in encapsulation and is not
event-driven. The details Of deque class template can be found in several references
[118][ll9], but there was no literature that employed it to model the on-chip router in the
previous work. Figure 3.6(b) shows the on-chip output buffer OBuﬁfer modeled by the
deque template class. The key member functions used in modeling the FIFO behavior are
push_back( ) and pop _front( ), in which the previous one writes data to the tail of the
queue structure and the last one reads the data from the head. Similar to the ﬁrst method,
the FIFO size and data width are both templated so that it can support a broad ranges of
FIFOs with different sizes and data widths. The user explicitly controls the write/read
operations on the FIFO without the indication of events.

The difference between the FIFO model architectures in 3.6(a) and (b) is that the

sc_ﬁfo-based FIFO model is implemented as a SystemC module having input and output

portsand the deque-based FIFO model doesn’t have any hierarchy. The scjifo-based
FIFO model is applied in the situation that interconnections need to be modeled explicitly
since its hierarchy with input/output ports provides the connecting interface and the
deque-based one models the FIFO without explicit port-level interconnection. In the on-
chip router model shown in Figure 3.4, the input buffer is modeled via the event-driven
method using the sc_ﬂfo-based hierarchical channel since the data reception is triggered
by monitoring the data writing event on another router’s output port connecting with this
router’s input port via on—chip link. The optional output buffer in the on—chip router is
modeled by the non-event-driven method since its write/read access is controlled by the
crossbar/output arbiter without monitoring any events.

Another distinct difference between these two FIFO modeling schemes is their
simulation speed. Because Of the complexity of internal hierarchy of the scjfo—based
buffer as well as its event-driven evaluate/update simulation scheme, its simulation speed
is much slower than the non-event driven one. The experiment shows that the event-
driven FIFO is ~27.5X slower than the non-event driven one.

In summary, the NOC router’s on-chip buffer has two architectural conﬁgurations: Ibuf,
IObuf.
@Ibuf indicates that the router is only buffered on input port. The input buffer is
modeled by via the event-driven method via primitive channel. The buffer size is
conﬁgured when the channel class constructor during Object instantiation.
@IObuf indicates that the router has both input and output buffers. The input buffer is
modeled by the event-driven method, and the output buffer is modeled by the non-event-
driven scheme via deque<T> template class, where the template T speciﬁes the FIFO
width.
3.2.2. Output Arbiter Model and its Architectural Parameter Arbitration

The output arbiter performs arbitration on multiple active requests routed to the same

output port for the ﬂow-control purpose. In our work, we propose a uniﬁed arbitration

65

model based on the concept Of priority list (PL) and priority updating scheme (PUS). And
this model is based on the mathematical representation of the arbitration process based on

PL and PUS.

[Priority List (PL) Model] Given an N-tuple input port set [P = {IPOJPl ..... lPN—I }and an M-
tuple output port set 0P={0P0.0P1 ,...,0PM__,}, for each 0P,-, (1' = 0, 1, M-I), there
exists one IP permutation set A=P(IP)|03 ={IRJPVHJB‘}, i,j,ke [0,N—l] and
1' ¢ j 1: k representing output port 0P.- priority list. And 0P,- has the priority list of:

p(Arr(A)[0])> p(Arr(A)[l])>... > p(Arr(A)[N —l]), where Arr(A) is the transformation of

permutation set to an array indexed by the position from 0 to NJ, and p(d) represents the

priority Of the dth element in the array Arr(A). If an output port has the IP permutation set

of A = {IR,I[§,IR,...,IP, }, the requests coming from input port d cannot be serviced until

all the requests from the input port I with Arr(A)[pos(t)]<Arr(A)[pos(d)]are processed,

where pos(d) is the position function calculating the absolute position of the port d in the
IP permutation set A.

[Priority Update Scheme (PUS) Model] The PUS stands for the iterative permutation on
the IP permutation setA = P(Ip)l08 ={IP,-,IPJ-,...,1Pk }. There exists a rule that updates
set A via the iterative method Of:

A“ = P(A),0,, = [113. JP}. ....,1P,. }, wherei',j',k' .- [o,~_1] and 1' ¢ j' e r

If the new permutation A* = A, which indicates that Arr(A) = Arn(A *), it is called Fixed
Priority Update Scheme (FPUS); otherwise, the PUS is variable. There is speciﬁc

situation that if 1P1, k e [0,N —l]is serviced in the current cycle, in the next cycle the new

permutation becomes:

At = {I}: 91p]- 9IPA"(A)[1(_1]aIPA,-r(A)[k+1]-~91Pk }, where i',j..k il,k E [0,N—l]

66

In this scheme A’ :t A, and [)(Arr(A‘IN-1D=k, indicating that the previous serviced

input port has the lowest priority. This priority updating is referred as Fair Sharing
Update Scheme (FSUS). There exist different iterative permutations resulting distinctive
priority update schemes, but only the FPUS and FSUS are modeled in our current work.

In the NOC router with 5 input and 5 output ports of LOCAL, NORTH, SOUTH.
EAST and WEST. Each output port’s priority list is initially set to P(OP,- = (LOCAL,
NORTH, SOUTH, EAST, WEST} i = LOCAL, NORTH, SOUTH, EAST or WEST, giving
the highest priority to LOCAL port and the lowest priority to WEST port. And each
output port has the same priority list with the same highest priority input and lowest
priority input. Under FPUS, each output’s priority list keeps constant and a ﬂit from low-
priority input port cannot be serviced until there is no incoming ﬂit from all high-priority
ports accessing the same output port. The fixed priority arbitration works well under
some deterministic traffic such as a fixed input port having more frequent access to a
ﬁxed output compared with other inputs. But it may result in the starvation of low-
priority ports if there are continues data from both high-priority and low-priority input
ports asking the same output port. To overcome this issue, the FSUS scheme that rotates
the priority list by setting the most recently serviced input port to the lowest priority is
applied to allow all the input ports to share the output port’s bandwidth.

The implementations of FPUS and FSUS are based on their mathematical model
proposed before, and the priority list of the 5 output ports is simply modeled as a 2-D
array with each element being constructed as a S-entry l-D array specifying the input port
with the priorities from high to low. The algorithm FPUS and FSUS are shown in Figure
3.7(a) and Figure 3.7(b), respectively. Inside these two algorithms, the real arbitration is
not triggered until there are multiple ﬂits being routed to the same output port. And the
router module has ﬁve sc_thread processes, each one of which calls the arbitration
function in Figure 3.7 as the output arbiter. These ﬁve processes are triggered

independently with their own simulated times so that the parallelism is modeled explicitly

67

without any other coding efforts.
F PUS abitration

FSUS abitralion

 

bool Arbitrate_FPUS(int prioriryﬂ)

if(there is only one active request)
{grant_this_requesl;
break; }

else

index _priority++)
index_queue++)

{if(find that the index_queue has the highest priority)
{granted = rs[index_queue];

{for(index_priority = 0; index _prion'ty < lNPORT_NUM ;

{for(index_queue = 0; index_queue < INPORT_NUM;

bool Arbitrate__FSUS(int priorityﬂ)

if(there is only one active request)
{grant_this_request;
break; }
else
{for(index_priority = 0; index _priority < INPORLNUM ;
index__priority++)
{tor(index_queue = 0; index_queue < INPORT_NUM;
index_queue++)
{if(ﬁnd that the index_queue has the highest priority)
{granted = rs[index_queue];
ﬁne = true;

312a; if”. for(k:index_priority; k<lNPORT_NUM—1: k++)
- prioritylkl = priority{k+1l:
”(ﬁnd) priority[k] = grantedjngress; ll set lowest priority level
‘ break;}}
} break.) if(ﬁnd)
_ break: } }
}return granted. return granted;

 

 

 

}
Figure 3.7(b) FSUS algorithm in NOC router

In summary, NOC router’s output arbiter has two architectural conﬁgurations: F P, FS.

 

 

Figure 3.7(a) FPUS algorithm in NOC router

@FP indicates that the router’s arbiter uses the ﬁxed-priority arbitration policy
employing FPUS scheme, where each output port’s priority list doesn’t change during the
operation.
@FS indicates that the router’s arbiter employs the fair-sharing arbitration policy
employing FSUS scheme, where each output port’s priority list is permuted in the way
that the most recently serviced one has the lowest priority.
3.2.3. Router Crossbar Model and its Architectural Parameter Routing

The crossbar functions steering the ﬂit from router’s input port to the routed output
port. In the NOC framework, the NixNo matrix architecture crossbar is employed and each
crossbar has N,- input ports and N0 output ports shown in Figure 3.8. Each port has the
same data width of Spy-bit. The crossbar’s input ports are connected with the router’s
input buffer, and its output ports drive the router’s output directly or the router’s output
buffer, which in turn connects to the output port The passing transistors located on the
intersections between the input and output ports control the data ﬂow with the aid of the
routing logic unit and the arbiter. So a matrix crossbar has Sﬂi;XNiXN0 passing transistors,

each of which controls the connectivity status of an output with a specific input. At any

68

time, only one output port can be granted for the request from one input.

The output port that a packet/ﬂit will be routed to is calculated by the router’s routing
logic by evaluating the packet’s destination address, the router’s address and the NOC
topology. In some NOC architectures with adaptive routing capability, the dynamic
network trafﬁc status on this router is also evaluated to calculate the route for this packet.
For the functionality model’s perspective, circuit-level details such as input/output drivers
and passing transistors are abstracted out. A routing function modeling the routing logic
calculates the route based on the packet and router information, and the incoming packet

will be written to the output port/buffer designated by the calculated route.

slimli—
Fs—s s
s

’N

Switchbox

 

 

 

1}
a
it

‘\

 

}l
}l

\\

 

Routing Logic

 

 

 

 

Ni

ii

is

 

f;

2;,

‘k

1}“
J}?
}j
}l

 

 

 

 

I

v v
0000

2 ......

 

 

 

 

 

 

{:14 .\

—I

Figure 3.8 Generic router crossbar architecture

The most signiﬁcant factor differentiating one router from another is the routing
algorithm that the routing logic part shown in Figure 3.8 performs to calculate the route
for the incoming packet. In the wormhole router, only the head ﬂit containing the
source/destination information is evaluated for the route computation, and all the

following ﬂits will follow the same route until the tail ﬂit is successfully delivered from

69

this router. Depending on the route calculation algorithm, the router crossbar model has
the following 3 architectural parameters:

<routing> = {XY, WF, NL}

 

.0001
_L
00

0,3

 

    
    

cool-

 

I. neocoe’e-c ooeeetuooeteee-oq

 

......5.....l.. .....

 

 

 

 

 

Figure 3.9 XY—routing on 4x4 mesh
Q) XY indicates that the route is calculated by the XY—deterministic routing algorithm.

Assume that each router in the NOC has a two-dimension address (x,y) representing its
location, the header ﬂit of each packet contains its destination addresses (xd, yd). XY-
routing algorithm ﬁrst routes the packet in the X direction until reaching the yd address,
and afterwards in the Y direction to reach the destination as shown in Figure 3.9. The
router only computes the new route when receiving a header ﬂit or just routes the body
ﬂit with the pro-computed route if the output port is granted. The XY—routing algorithm is
very area efﬁcient for hardware implementation with the micro-architecture called Route
Calculation Unit (RCU) containing only two 4-bit comparators and l 5-entry ROM table
with additional latches for data storage shown in Figure 3.10(a). And Figure 3.10(b)
presents the routing control logic architecture containing ﬁve XY—routing RCUs
associated with router’s ﬁve input ports. Each one drives a direction signal selecting the
appropriate passing gates to let the routed ﬂits traversing through the router shown in
Figure 3.4. The crossbar control logic also orchestrates the ﬁve RCUs so that each two
RCUs will not select the same output port. The FSM inside the crossbar control logic also

monitors the arbitration status to decide whether to start the new packet’s route

70

computation and data forwarding. In additional to its low gate count architecture, the XY-

routing algorithm gives the micro—network the “in-order delivery” property since each ﬂit

in one packet is routed on the same path.

4-bit comparator

 

 

 

 

 

 

 

  
 

 

> “00,, 5—entry ROM table
Addr x 4
- —,4> 0101; LOCAL
___ s01. ; 00xx: WEST
Dest_x 4; I 10xx: EAST
u " xxOO: SOUTH
< 10 xx10: North
>“00" VitiAddrs Oé's'i) " ”
Addr 4 :
‘y +> ' output = 00
De t 4 = '01- ._ If(Addr == Dest) -,
5 _y A . j output = '01" i
< .10. 3, If(Addr == Dest) 1
output = '10" :

 

 

 

4-bit comparator

(a)

crossbar control unit

   

 

 

   
 

To gate of passing
transistors

 

To crossbar’s
data port

 

 

 

 

 

 

 

 

 

/‘\
LOCA b r
L ”ﬂer— RCU(LOCAL)
[ 1
NORTH bufferv RCU(NORTH) _T»
SOUTH bufferl >i RCU(SOUTH) i >
1 1 ‘
b ff . '
EAST ” er - > RCU(EAST) —1l—>!-'
t 1 1’
l 1 /
WEST buffer R CU (WEST)

 

To crossbar’s control port(gate of
passing transistor)

LCM—F

Control FSM
\ ,

\\‘_ >m" _/

/
From output arbiter

 

(b)
Figure 3.10 Crossbar control of XY—deterministic router (a) RCU for XY—routing algorithm
(b) Crossbar control logic with 5 RCUs

(2) WF indicates that the route is calculated by the West First adaptive routing algorithm,
which could selectively choose the route depending on both the input packet’s
information and current trafﬁc status. As what’s shown by the dotted line in Figure
3.ll(a), the turns where the packet comes from the Y direction are forbidden to prevent
the deadlock situation. In the West-ﬁrst algorithm, if xd <= 11', packets are routed
deterministically as what the XY algorithm does, and packets can be routed adaptively in

East, North or South directions if xd > x depending on traﬁic conditions.

 

  

1&0) ' (3.11 l (3.21
I

 

 

 

 

 

 

Figure 3.11 WF-routing paths on 4x4 ntesh

In our work, we use the output port’s average throughput ThA as the trafﬁc metrics
when determining which output port grants this packet/ﬂit in WF-adaptive routing
algorithm. The average throughput is deﬁned as the amount of data (number of bytes)
traversing through the output port. Given the NOC starting time t and current time stamp
tc, assume that there are Np traversing packets recorded on output port 0P,~, with the
packet size of 5,,(ﬂits), and the ﬂit size szi,(bits), the average throughput of this port is
calculate as:
Thorax = —Np XSP x Sﬂl

(tc —t)x8

In the ﬁve output ports LOCAL, NORTH, SOUTH, WEST, and EAST, Throcxtx. and

Byte/ s

Thwgsm are not evaluated since the LOCAL and WEST ports cannot be routed adaptively
in the WP routing algorithm. The non-WEST output of the adaptive routing algorithm is
the port with the minimum average throughput.
WEST, if (xd <= x)
OPWF = , where [Thml is the 3-
k,T/z,_, = 1111110712,, [)1 = {NORTH,SOUTH, EAST}
element array of the remaining routable output ports NORTH, SOUTH and EAST. If
there are two or more Th,“ have the equal values, the output port with highest priority
will be selected and it will be rotated to the lowest priority as what FS arbitration
algorithm performs in chapter 3.2.2.

Adaptive routing algorithm has the beneﬁt of automatically monitoring the current
trafﬁc and adjusts the packets’ route dynamically to avoid the intensive access to the
ﬁxed output port. But the adaptability is achieved by paying the price of additional
hardware logics. Figure 3.12(a) shows the micro-architecture of the WF routing
calculation logic. Three 32-bit counters C1, C2 and C 3 are used to record the number of
passing packets on the output ports of NORTH, SOUTH and EAST. And a min-max logic
L evaluates the minimum value among the counters’ outputs to calculate the port with
minimum average throughput. A 4-bit comparator CP compares the input packet header’s
x-direction address and this router’s. If xd <= x, CP will outputs ‘0’ that selects the
WEST route; Otherwise, CP outputs ‘1’ selecting the 2-bit output of L that will be used to
choose one out of another three possible routes {NORTH, SOUTH, EAST}. Compared
with the deterministic routing logic, the WP adaptive one adds three 32-bit counters,

which contributes the biggest portion of the routing logic.

73

Ct L MUX

 

 

 

 

 

 

 

 

 

 

 

 

 

32-bit 32/ _
counter ’ '
c2 1: “00" => NORTH
lg:- 1t ‘32, Min-max 2, _ 1: "01" => SOUTH
counter / ' logic / ' 1: “10" => EAST
To gate of passing
32-bit 32 A transistors
counter fa '-——>

 

 

 

 

 

 

 

 

 

 

4-bit 0: WEST
comparator
Addr_x 4 5 0 “
<=| D
Dest__x 4; >‘1'
(SP
(21)
To crossbar's

crossbar control unit

 
 
 
 
 

 

p011 DLOCAL output

NORTH output

LOCAL buffer

   

[without
counter

       
 
 
  

NORTH buffer

  

[without
counter

 
 

SOUTH output

 
  
  

SOUTH buffer

 
 

[without
counter

 
 

output

 
 
  

EAST buffer

 
 

[without
counter

   
 
 
  

  

WEST output

 

D

To crossbar's control port(gate of
passing transistor)

WEST butler

     

[without
counter

 
 
      
 

Control FSM
From output arbiter

(b)
Figure 3.12 Crossbar control of WF—adaptive router (a) RCU for XY—routing algorithm
(b) Crossbar control logic with 5 RCUs

To save the on-chip router’s area, the three 32-bit packet number counters are shared
among the five RCUs in the router. The overall routing control logic is shown in Figure
3.12(b). Similar to Figure 3.10(b), the output arbiter’s information is provided to the FSM

to schedule the route computing. During the routing calculation process, all the three

74

counters are reset when anyone of them overﬂows to prevent the wrong comparison
results. Also if two or more counters have the same counting value, the output with

highest priority will be selected.

 

 

   
    

 

 

 

 

 

 

 

 

 

(3.01 0 (3.11 i (3.2) l (3.31
I
rm I rx . /\ I. .
k] I \J I \_/____L_
l l l
(2.01 I (2.1 l (2.2) 12. 1
r‘x ' I r\ i .
L 2 r r \f f
l | 1
70,1) — — +07) _ _ 17271; — :J—(aTI' _ _
\V~ J r — +- 7? | A
J l .\ l K/ 1 l \r/
I l l l
(0.0) F - T173) '7 _ Ting _ (33 _
r\ | KL‘ ‘1' r ’ LA.
k/ I U I V’ I k/
I l l

 

 

Figure 3.13 NL-routing paths on 4x4 mesh

® NL (North—Last) is another adaptive routing algorithm supported by this model. Figure
3.13 shows its possible routing paths. In the NL algorithm if yd<=y, packets are routed
deterministically. And if yd >y, packets can be routed adaptively in WEST, EAST, or
SOUTH directions. It has the similar architecture compared with the WF-routing
computing unit shown in Figure 3.12. And the difference lies in that the average
throughputs of WEST, EAST, and SOUTH are calculated instead of NORTH, SOUTH
and EAST.
3.3. System-level Model of Interconnecting Links

Link model abstracts the physical signal transntission on the global metal
interconnecting wires in NOC shown in Figure 3.1. From the system-level point of view, a
link can be considered as a process that only receives/sends data front/to its connected
routers. Currently link is modeled by inheritng the SystemC scjifo primitive channel
with customized read( ) and write( ) methods. These two methods override the virtual

read/write functions declared in sc_ﬁfo_if class from which sc_ﬁfo inherits. Since

sc_ﬂfo_if is associated with the router’s input/output port, the link and the router’s port
are logically connected with the channel that provides a FIFO data structure manipulated
by the channel’s read( ) and write( ) methods. Consider the case in 3.15(a) where a ﬂit F
is sent from router A’s output port 1 to router B’s input port 1, the process in A will call its
port 1’s virtual function write(F), which in turn invokes the implementation function
write(F) in the m jifo channel inheriting from sc_ﬁfo. The FIFO will be updated by
putting F to the end of its queue. Consider the FIFO as the input buffer of port 1 on router
B, writing data into FIFO from A models the operation of data arrival from router A to
router B. At this time, router B may not process this data directly since it still exists in the
input buffer. B needs to invoke read( ) associated with its input port 1 to receive F, and
the FIFO will pop the ﬁrst element in the FIFO to B’s local data member ﬂit. Similarly
applying this method to write a sequence of data from A to B by invoking port1-> write(
) multiple times, and B will receive them in the same sequence by continuously calling
port] ->read( ). Through this Remote Procedure Call (RPC) method, inter-module

conununication is modeled by the explicit read/write function calls.

_ Eel 9:19.614 .. .... ineuLBzrfierMade.’ ..
1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

P, | P2 I
v 0d” e , " I l v odu e
A \tlmk '1‘er / B
:— 7" (\I: Read()I
" 7' " “s"? —————————— J
<<\\\/

Pin-> Write( ﬂit)
Figure 3.14 Reuse of scjifo—based channel for botlt input buffer and link models
Since the data transfer behavior of links is modeled via the channels’ RPC read/write,

which is also used for router’s input buffer modeling, we share the stifo-based event-
driven channel for modeling both router’s input buffer and interconnecting link. The
rationality of reusing the same model for the two different NOC functional blocks resides
in the fact that the scﬁfo-based channel can be viewed as the link between port P, and

P2, and the storage elements encapsulated in scjifo is treated as the input buffer of

76

router’s P2 port shown in Figure 3.14. The RPC function call from P, write( ) models the
behavior of port P, injecting data onto the link, and this write’s associated
data_written_event( ) indicates that data is received by P2’s input buffer. SO when
annotating the latencies of link and the router’s input buffer, the overloaded sc_ﬁfo’s
write( ) method would have the delay function with the argument of (t,,., + t,,-,,,,), where t,,.,
and t,,-,,,, are input buffer’s write latency and link’s latency, respectively. On the other
hand, the input buffer read latency is modeled by overloading sc_ﬁfo’s read( ) method.

In the current NOC model, we assume that each packet can’t be delivered until its
previous one is sent successfully. Applying blocking read/write methods provided in
sc_ﬁfo, a simple lossless ﬂow control is modeled. In short, an input port’s blocking-read
doesn’t return until the data is successfully read from the connected FIFO once invoked,
and similar for its blocking-write. When a port’s blocking read/write is waiting for the
data, it allows other concurrent processes continue executing by using wait( ) method for
synchronization. Instead of an explicit timed sc_time data, wait( ) is parameterized by an
sc_event variable data_ready showing that the FIFO has effective data for read or enough
space for write in blocking read/write methods. The lossless delivery is signiﬁcant in
NOC because the on-chip micro-network doesn’t have much area for implementing
complex error-correcting algorithms compared with general-purpose network
infrastructures.

3. 4. Latency Model of NoC Routers and Links

In NOC, router and link’s delay or latency is another important design metric that needs
to be taken into account because they contribute to the whole system performance
signiﬁcantly. Router latency is deﬁned by the time interval between a ﬂit’s injection from
its input port and its departure from its output port, while link delay is the time that a ﬂit
spends to traverse through the link. With the decreasing transistor feature size in
advancing technologies, routers are becoming smaller with faster Operating frequency,

and in turn bring less latency. On the contrary, global interconnecting wires scale at a

77

much slower rate compared with transistors’, and in consequence link latency will
increase on the global wires resulting from the increased unit length resistance and
capacitance. Also, the increasing aspect ratio (AR) on deep-submicron wires makes the
inter-wire coupling capacitance a signiﬁcant contributor to wire delay. This work
proposes a system-level modeling methodology for router and link’s delay and its
integration with the NOC simulation framework by high-level function calls.
3. 4. l . Orr-chip Router ’3 Latency Model

First consider router’s latency model for the generic router architecture shown in
Figure 3.4. For each ﬂit passing tltrough this router, its latency depends on both the
router’s architecture and the trafﬁc status. Here only the router’s intrinsic latency
resulting from its architectural components is explicitly modeled, and the latency
resulting from the network congestion is implicitly modeled by scjifo’s blocking
read/write function calls. For a router consisting of input buffer, crossbar and arbiter, the

latency of a ﬂit traversing through it is estimated by:

L = Lr’buff" + L + L

flit .router r,crossbar r ,arbiter (3- 1 )
where Lnbuﬂgr, L,,,,,,,,,,a, and Lmrbm, are the latencies of router’s input buffer, crossbar and
arbiter respectively. For the input buffer, most available on-chip routers use dual-port
SRAM array for fast access. In this case, the latency contributed from input buffer could

be modeled by:

Lr,buffer = t5RAM,read (3.2)
where tSRAMﬂm, is the read latency of the dual-port SRAM array. In the system-level, we
don’t go into the detailed SRAM’s internal architecture to evaluate its latency, but
characterize a set of memory instances with different sizes and bit widths to generate a
lookup table. Once there is a memory access happening in the input buffer, the simulator
will perform a table lookup according to each router’s input buffer size and data width
and invoke the wait( ) method with the argument of extracted latency data to emulate the

SRAM read delay. Since the input buffer read is modeled by calling the corresponding

78

input port’s read( ) virtual function, which is implemented by read( ) in the link object
that this port is connected with, inserting wait( ) function parameterized by the extracted
SRAM latency l_sram will postpone the data receiving operation by “l_sram” amount of

time described in Figure 3.15.

 

Read command issued @ t ns

F is ready for routing @ t+l_sram ns

 

Router
T o crossbar

 

(read(<7> val _j
1’
while/num_available() == 0)
wait(m_data_written_event),'
m_num_read + +,'

”Delay
wait(sc_time(l_sram,NS)).‘4/ function
buf_read(val_);
request_update().'

Q J

 

 

Figure 3.15 lnput buffer’s latency model
We use the Artisan memory generator[55] for IBM 7RF 0.18am CMOS process to
create and characterize memory instances. The input buffer size ranges from 16-word to
2K-word, and the data width spans from 32 to 256 bits. The characterized FIFO read

latency is shown in table 3.1:

Table 3.1. Characterized read latency(ns) for FIFOs in IBM 0.18um process

 

 

 

 

 

 

 

 

 

 

 

 

 

RAM size 32 64 128 256 512 1024 2048

Data wi
32-bit 1.118 1.123 1.134 1.155 1.198 1.283 1.453
64-bit 1.311 1.316 1.327 1.348 1.390 1.475 1.645
128—bit 1.696 1.702 1.712 1.734 1.776 1.861 2.031
256-bit 1.914 2.008 2.207 2.325 2.342 2.428 2.576

 

79

 

All these characterization data are stored as tables for the router model to access during
simulation. Each RAM size-Data width pair corresponds to characterized memory access
latency.

In NOC router, the latency resulting from crossbar is deﬁned as time between the ﬂit’s
arrival in crossbar’s input and its departure on its output. For the 5x5 crossbar as Figure
3.8 shows, each cross-point between input and output lines are connected by a NMOS
passing transistor (or a transmission gate consisting of an NMOS and a PMOS). Each
transistor’s gate is controlled by the routing logic unit implementing the routing function.
The crossbar latency is composed by two components: the route computation
time(t,o,.,2_c,,,,,,,) and the crossbar traverse time (t,,,,,,). In wormhole routing where each
packet is decomposed into ﬁxed-size ﬂits for transmission, the router takes the header ﬂit
containing address information to compute the route, which will be applied to all the
successive ﬂits in this packet. Once the route for a packet is evaluated, this route will be
reserved for all the ﬂits belonging to this packet. The header ﬂit can not traverse through
the crossbar until the route computation ﬁnishes, while other ﬂits just follow the same

computed route. Based on this, the latency on the crossbar can be estimated as:

Lr,crossbar :. troute_comp + tcbar 9 for the header ﬂit (3'3)

or L,,crossb,,r = tcba, , for the body ﬂits

Instead of developing a micro-architecture crossbar latency model, we employ the
published parameterizable crossbar model proposed in [56] to estimate crossbar’s
inherent delay. Assume p is the number of ports in the crossbar, and w represents the
channel width of each port, the crossbar’s traversal time could be evaluated

approximately by:

1.....(p.W)= 910g8 wi-‘EJ +610g2lpl+6 '47 (3.4)

where 1.’ is the delay of an inverter with the fanout of 4.

The route computation time tm,.,e_,.0,,,,, is characterized by running the timing simulation

80

of the synthesized routing algorithm written in Verilog with IBM 0.18am standard cell
library. And we perform scaling to estimate tmu,e_,,,m,, on other technology nodes with
different feature sizes.

Arbiter performs arbitration when there are multiple routing requests to the same
output port. The parameterized arbiter model in [56] is applied to estimate the arbitration
latency when there is an arbitration request. Compared to the crossbar latency model in
(3.4), the arbiter latency is independent with the channel width, and only associated with

the crossbar’s port number p:

l l
[arbiter (p) : [ZIEIOgZ p +145) . 42- (3.5)

3. 4.2. Link Latency Model

In the system-level, link’s delay model is similar to the router’s by calling the wait( )
method to emulate the delay when a ﬂit passing through an interconnecting link. Since
the physical signal transmission on links is abstracted to the channel’s write( ) function,
modeling delay on links needs to insert the link’s delay function to the write( ) method to
emulate the real delay. Shown in Figure 3.16(a), if the output driver on the source router
A writes a ﬂit F to router B tlnough link L, the transmission of F on the link is modeled
by moving F from A to L’s FIFO data structure. Each ﬂit has a data ﬁeld f_latency to
store the latency it experiences when traversing the network, updating f_latency during
invoking write( ) intuitively models the link’s delay. As the pseudo-SystemC code in
Figure 3.16(a) describes, the write( ) execution is “halted” for link_delay by inserting the
wait( ) function before writing the ﬂit F into the FIFO, which is modeled as router B’s
input buffer.

The link_delay parameter used for wait( ) function call is calculated by the simpliﬁed
lst-Order wire RC delay model shown in Figure 3.16(b). In the current wire model, we
don’t take the input and output drivers into consideration for simplicity. The whole global

wire with length L is divided into small segments, each with length of Lug. by repeaters

81

for less latency and better driving capability. Different repeater sizing and location will
result in different wire latency, and we assume repeaters with the same size are
distributed uniformly along each interconnecting wire. According to Heo[57]’s wire
model, wire delay is represented as a function Of the repeater sizing r, which is defined as

the ratio of the repeater gate capacitance and wire capacitance within a wire segment.

 

F is sent @ t ns F is received @ (t+link_delay) ns
Router
input
‘3‘ Link FIFO
‘

 

 

...... Crier) W
A ->write(F) {
...... while(num_ﬁ‘ee() == 0)

wait(m_data_read_event),'

m_num_written + +; Delay
wait(sc_tim€(li"Ldelay’NS»v' 4" V’ﬁmction
buf_write(val_);

request_update( ),'
L} J
Figure 3.16(a) Interconnecting wire’s delay model
Vdd Vdd

 

 

 

 

 

 

 

 

'TT'TTT

/
!Cd(a+1)w $Cwl/2 !Cg(a+1)W

Figure 3.16(b) 1” order wire RC delay model

_ w(,8 +1)Cg

r_ C L (3'6)
11‘ seg

 

The wire delay is modeled by two components: RC—delay from metal wires and repeaters:

 

 

L
DR =0-7' L [Rd '(ﬂHi'iCd +Cg i+ Lsch..,W(ﬂ+1)Cgl (3.7)
seg
LR C R C.
1),, =0.7——d——l+0.7L-Lm ’2 " (3.3)
w .

And the total link delay is estimated by adding up these two components:
link _ delay = f(L, L3,, )= DR + DM (3.9)

where R... and C... are the unit-length wire resistance and capacitance; C, and C, are the
unit-length capacitance of drain and gate of minimum size NMOS transistor; and R4 is the
resistance of minimum size repeater. ,6 is the size ratio between repeater’s PMOS and
NMOS transistors.

In this model, we treat the link delay data-independent as well, which means that the
delay is only decided by the global wire length and repeater inserting scheme. But it is
easy to extend this model to account for the inter-wire crosstalk resulting from different
switching characteristics in neighboring wires by inserting the corresponding data-
dependent model, which is left as one of the future work. A switch, which is conﬁgured
in the link class constructor, is used to decide whether the link delay model is applied or
not during simulation.

3. 5. Simulate NoC Model with Traﬂic Generator/Traﬁ‘ic Receiver

NOC model is developed independently with PE model, and its functionality is veriﬁed
before interconnecting with real PE models to reduce the tum-around time of system
integration. Figure 3.17 depicts the verification environment using trafﬁc generator/trafﬁc
receiver models. Trafﬁc generator (TG) and trafﬁc receiver(TR) essentially mimic the

packet injection and consumption mechanism in PBS and their bridging network

83

interfaces (NI). The TG and TR coexist in one SystemC module where sc_thread process
tgen models the packet generation and injection and another sc_thread process trev
emulates the packet reception. The TG/T R has input and output ports connected with
router’s LOCAL ports, and each router in the NOC will be connected with a TG/TR.
TG/T R calls the write( ) virtual function of its output port’s interface to inject its packets
into the NOC, while read( ) is used for retrieving the incoming packet. Actually each
write( ) function call transmits a ﬂit of the packet, and so does the read( ). Hence a packet
segmentation and reassembly (SAR) mechanism is required for the TG/TR model. We
employ the packet format shown in Figure 3.3(a) where each packet contains a header ﬂit
and the following body ﬂits. The header ﬂit contains the information of source and
destination addresses, number of body ﬂits in this packet. And the body ﬂits compose

packet’s data payload. Each ﬂit in the packet has the same bit-width, and each packet in

the network may have different number of ﬂits.

 

 

$99

 

 

 

Figure 3.17 TG/T R connected with NOC
When TG injects a packet into the NOC, it ﬁrst writes the header ﬂit, then it

continuously writes the body ﬂits until all ﬂits in this packet are injected. On the other
hand, TR ﬁrst identiﬁes the received packet header ﬂit before receiving its body ﬂits. By

retrieving the ﬂit number ﬁeld from the header ﬂit, TR will identify the last ﬂit of the

84

received packet by a simple counting mechanism. TG controls the interval of two
successive injected ﬂits or packets by calling wait( ) method that simulates delaying a
given amount of time td. By adjusting the value and distribution of t2, different trafﬁc
patterns can be emulated. Also changing each packet’s destination address creates
different network trafﬁc. Each TG/TR can inject/receive packets simultaneously during
simulation since its sc_thread processes execute independently, and this mechanism
intuitively models the concurrent trafﬁc ﬂows in the NOC. We apply the following two
synthetic trafﬁcs on different NOC conﬁgurations in terms of topology, input buffer size,
routing algorithm and packet/ﬂit size.

(1) Transpose trafﬁc: Any TG/TR located in (x, y) sends data to a ﬁxed destination (x’,y’)
throughout simulation. In our experiments, the relationship between the source and

destination is set to be:
.r'= (x + 2)%X _ SIZE and y'= (y + 2)%Y _ SIZE

where X_SIZE and Y_SIZE are the numbers of routers in x and y directions respectively.
(2) Uniform trafﬁc: All TG/TRs in NOC inject packets to random destinations. The
possibility of destination TG/TR features Gaussian distribution.
3. 6. NoC Performance Model Experiment Results

Four 2-D regular architectures: mesh, toms, X-torus and Y-torus are used in verifying
the NoC’s functionality and latency models. The maximum number of routers in both
directions is set to be 50 with the ﬁxed neighboring router distance of 1mm. The target
technology is 180nm IBM CMOS process. The simulation platform is the same as what is
used for simulating the PE models described in 2.3. The simulated time is ﬁxed to 20ms
(2M cycles with the clock frequency of 100MHz). In the ﬁrst set of experiment, different
NOC topologies are evaluated; while other experiments all focus on mesh architecture
since it is the most frequently used NOC topology in many literatures[105][107]. Both
physical and cycle-approximate time models are applied during the NOC performance

simulations.

Average packet latency (APL) is one of the most important performance metrics for
evaluating NOC, APL is defined as the average time between a packet’s generation at the
source node (TG) and its reception at the destination node (TR) in the NOC performance
model. By this deﬁnition, the packet queuing time, which is spent when a packet is
generated at TG, but waits for the available input buffer during the trafﬁc, is also
accounted. APL measures how fast the on-chip communication infrastructure delivers
information to the destination. Another NOC performance metric, average router
throughput (ART), is deﬁned as the average number of bytes that traverse each router at
unit amount of time. Assume a tile-based NOC has the router number of MxN, and the
measurement shows that each router has KI,- injected bytes and KB, ejected bytes (i = 0, I,

M xN-I ) at the given interval of T, the ART will be calculated by:

MxN-l

)3 (K1,. + KEI)

ART = "=0
(M xN)-T (3'10)

3. 6. 1. APL Performance under Different NOC Topologies

 

We simulate the four regular tile—based NOC topologies (mesh, torus, X-torus and Y-
torus) with different input buffer sizes of 4, 8, 16, 32 and 64 under uniform traffic. The
packet injection rate (number of packet injected into the NOC in unit amount of time) is
adjusted to 0.025 packet/us and 0.25 packet/us respectively. The ﬂit width is ﬁxed to 64-
bit and each packet contains 10 ﬂits. The average packet latencies are plotted as Figure
3.18(a) and (b).

From the simulation results we can see that the NOC model clearly reﬂects the input
buffer’s inﬂuence on the average packet latency in different NOC topologies. Since torus
NOC has more communicating channels than the other three topologies, its average
packet latency is the minimum one among these four NOCs. Mesh doesn’t have any
“loop-back” channels, and in turn it results in 23.78% and 34.61% more packet average
latency at the packet injection rate of 0.025packet/ns and 0.25packet/ns compared with

toms, with the compensation of 50% wiring resource saving. As trade-offs, X-torus and

86

Y—torus both consume 25% more wiring areas with the latency reductions of 15.81% and
8.75% compared with mesh respectively. Here we assume that the interconnecting wires
are only laid out on the top metal layer. When the injection rate is low at 0.025 packet/us,
doubling the input buffer size from 4 to 32 results in 7.27%, 5.7% and 4.96%
improvement in packet latency in mesh. While the latency improvement is getting better
in high injection rate at 0.25 packet/ns, and the latency reductions increase to 18.55%,
11.95% and 9.8% respectively. And the similar trend also applies to the other three NOC
topologies. The high injection rate makes the network easily congested, and big input

buffer can relieve the latency penalty by buffering the unrouted packets.

 

    
  
   

   
   
    
 

   

 

   

Average packet latency v.8. input buffer size Average packet latency v.5. input butler size
0339‘“?1 i"1.99%" rate = 0-025 packet/n3) (packet injection rate = 0.25 packet/n3)
’5? 70 120 T— -———r——'_—_,
E 5x5 mesh II A 1 - 5x5 mesh
0 60 5x5 torus 1 8 100i
5 V . - 5x5 toms
33 5x5 X-torus > 1
_19 so 8 5x5 X-torus
._ - 5x5 Y—torus <1) 80
3 40 °_.a . - 5x5 Y—torus
o .— ‘
a 5; col
g) 30 I § I
g 20 § 40‘
< 10 E 20I
<
0 . . o :31 3: a:
4 a 16 32 4 8 16 32
Input buffer size (ﬂit) Input butler size(ﬂit)

Figure 3.18(a). Injection rate = 0.025 packet/ms Figure 3.18(b) Injection rate = 0.25 packet/ns

We also observe that increasing buffer size to 64 only gives marginal network latency
improvement in the applied uniform traffic. And it suggests that keeping a reasonable
buffer size not only maintains the NOC performance, but also reduces its area and power
consumption for a given traffic pattern. Note that SRAM implementing router’s buffer is
pretty power—hungry and area—consuming due to its architecture. Figure 3.19(a) and (b)
show the area and power consumption of different dual-port SRAM sizes under [BM
0.18um CMOS process. Due to the architectural constraint, the MUX number can only be
4 when memory size is less than 32. Doubling SRAM size will ask for 42.71% additional
area and 29.92% more power in average when the ﬂit size is 64-bit. Combining Figures

3.18 and 3.19, we observed that l6-entry is the optimal buffer size for the 5x5 mesh with

the uniform trafﬁc when taking buffer area and power into consideration.

      

 

 

   

  

 

 

 

Dual-port SRAM area v.s. size Dual-port SRAM dynamic power v.s. size
0.35 r . r
, - W t . _ _ . j
0 3 + Data width: 64b? ' A 1‘“ + Data Wt” 64 bf‘
' . . . 3 + Data width: 132-bit

A , + Data Width. 32—bit E 10
E 0.25 2___ ‘5
E. 3
g 0.2 3
(U 0
9 E
m
2 o ‘5 MUX srze g
E 0.1 2
a) <

o. $

01 . . . . . . .
4 8 16 32 64 128 4 8 16 32 64 128
Dual-port SRAM size(ﬂits) Dual-port SRAM size(ﬂits)
Figure 3.19(a) SRAM area v.s. size Figure 3.19(b) SRAM power v.s. size

3.6.2. APL Performance with Diﬁ’erent Packet Sizes on Mesh NoC

Under 5x5 mesh NOC, we adjust different packet sizes ranging from 8 ﬂits to 128 flits.
Each ﬂit has the ﬁxed bit-width of 64 and the input buffer size is set to 16. Two injection
rates are applied to the NOC model: 0.025 packetJns and 0.25 packet/ns. The simulation
results on the uniform and transpose trafﬁcs are plotted in Figure 3.20(a) and (b).

Packet size also affects the network performance due to the fact that a reserved route in
the NOC wormhole router will not be released for other requesting packets unless all ﬂits
in the currently transmitting packet reach their destination. It takes longer time for large
packets to reserve the router than short packets given that each flit takes equal time to
traverse the router. Once the router’s output port is reserved, the arbitration will take
place for the incoming packets requesting the same output port. And according to (3.4)
and (3.5), arbitration takes ~50% more time than crossbar traverse. Large route
reservation time and arbitration overhead make the packets with smaller injection rate
have bigger average packet latency. In the uniform trafﬁc, the randomness of packet’s
destination makes it possible for a packet to reserve a long route compared with the
transpose trafﬁc (x’= (x+2) mod 5, y’=(y+2) mod 5). where the average hop count is 4.
In this case, the uniform traffic has 56.96% more average latency with slow packet

injection rate of 0.025 packet/ms compared with transpose trafﬁc. When the injection rate

88

is increased to 0.25 packet/ns, the reduced ﬂit injection interval reduces the port
reservation time for each packet by ~10X, thus decreases the average packet latency in

both uniform and transpose trafﬁcs.

Packet size v.s. awrage latency in 5x5 mesh Packet size v.s. awrage latency in 5x5 mesh
(uniform trafﬁc) (transpose traffic)

800 800

 

7 __A _-_‘1——v_17 if

,l_L _
f- injection rate: 0.025 packet/as
j - injection rate: 0.25 packet/n5

 

 

1

[E‘— injection ratezfo.025 packet/n3
600 M - injection rate: 0.25 packet/n5

 

 

600

 

 

 

400i

.,

Packet size(ﬂits) Packet size(llits)

Average packet latency(ns)
Average packet latency(ns)

ti
a

Figure 3.20(a) Uniform trafﬁc Figure 3.20(b) Transpose trafﬁc

In the uniform trafﬁc, the average packet latency doesn’t increase monotonously with
the increased packet size, which seems to deviate what we explained in the previous
paragraph. This is due to the applied random trafﬁc pattern compared with the
deterministic transpose trafﬁc, where no randomness exists and monotonous increase of
average packet latency is observed.

3. 6.3. APL Performance with Drﬁerent Router Architectures on Mesh NoC

This experiment evaluates the performance resulting from different NoC router
architectures. As what’s presented in 3.2, the on-chip router model in NOC can be
reconﬁgured to 12 different architectures characterized by the combination of three
functionally orthogonal architectural parameters. 6 different router architectures with
only input buffering shown in Table 3.2 are conﬁgured during the simulation. The
targeting mesh NoC size is adjusted to 4x4 and each simulation starts with 2000 cycles’
warm up period before calculating each packet’s average latency to ensure that NOC is in
the sustained trafﬁc. The injected packet is single—ﬂit format, where each packet contains
only one 64-bits ﬂit in this experiment. Figure 3.21 (a) ~ (f) presents the APL (average

packet latency) under different on-chip router architectures with the input buffer size

89

ranging from 2 to 8.
Table 3.2. Evaluated Router Architectures

 

 

 

 

 

 

 

NOC router architecture Buffering Arbitration Policy Routing algorithm
Ibuf_FP_XY Input buffer only Fixed—priority XY-deterministic
Ibuf_FS_XY Input buffer only Fair-sharing XY—deterministic
lbuf_FP_WF Input buffer only Fixed-priority WF—adaptive
Ibuf_FS_WF Input buffer only Fair-sharing WF—adaptive
Ibuf_FP_NL Input buffer only Fixed-priority N L-adaptive
Ibuf_FS_NL Input buffer only Fair-sharing N L-adaptive

 

 

 

 

 

 

With the increased trafﬁc injection rate, the average packet latency associated with all
6 different router architecture shows the similar trend under different input buffer depths.
At the very low packet injection rate of 1.5~2.5 packets/cycle, the APL is almost constant
with only ~2.68% difference under each architecture and buffer size. This small variance
results from the randomness of each simulated trafﬁc. In the low packet rate scheme, the
NOC handling capability is higher than PEs’ (modeled by TG model) injection rate, e.g.
the interval between two consecutive packets from each PE is larger than the time that
each packet spends to traverse from the source to the destination including the operations
of buffer read/write access, route calculation and crossbar forwarding, and arbitration.
Moreover, the output arbiter is only triggered when more than two packets from different
input ports are routed to the same output port. So the arbiter’s latency can be omitted here
since the triggering rate is very low of < 2% under the low packet injection rate. Overall,
the APL is mainly determined by the NoC intrinsic latency, which is contributed mainly
by input/output buffer, crossbar and interconnecting link, under the low packet injection
rate of 1.5~2.5 packets/cycle.

When TG gradually increases its packet injection rate, the APL starts increasing at a
given injection rate, and after that APL increases dramatically regarding to the injection
rate. This is also expected since the NOC packet delivery ability is overwhelmed by the
fast packet injection, and it results in the network congestion, where there are packets

staying in the queuing buffer waiting for the available routing resources. This congestion—

90

related latency takes a big portion of the APL at high packet injection rates. From3 2.1(a)

to (f) also presents this trend explicitly.

Average Packet Latency(cycles)

Average Packet Latency(cycles)

Average Packet Latency(cycles)

Input Butler Only, FP arbitration, XY routing

 

Input Buffer Only, FS arbitration, XY routing

 

   
  

I i-&- Buffer = 2 .

'--l
u
_U
.-

 

 

 

 

4000 —-+---Bufler=3 Ii
""5"" Buffer = 4 :5
30m -"'-' Buffer = 5 II: :-'
--Bufier = w
0" Bu=lier 7 ': "'
I -->--- Buller— =
1000 -
0
1 5

g

..a

E S

§

Average Packet Injection Rate(Packet/cycle)

(a) Ibuf_FP_XY router
Input Butler Only, FP arbitration, WF routing

Average Packet Latency(cycles)?

§

3000 ,

_33

0

WI --9-- Buﬂer7 =
, - +-- Butter- ~
MB -Buffer =
--A--- Buffer = 5
Butler = 6
.......0 Butter = 7
->-- Butler = 8

  
    
   

 

I
I
l

.u a. ..
I ... V .

  

 

 

 

 

    

4 5 79

Average Packet Injection Rate(Packet/cyc|e)
(b) Ibuf_ FS_ XY router

Input Buffer Only, F8 arbitration, WF routing

3

 

 

    

1
~
‘~

 
   

Average Packet Injection Rate (Packet/cycle)

(c) Ibuf_ FP_ WF router
Input Buffer Only, FP arbitration, NL routing

:99?

--9-- Buﬂer- =

-- +-- -Buffer =
WEI-- Buffer = 4
-qA--- Buffer = 5
Butter = 6
"“9" Buffer = 7

--D--‘ Buffer = 8 i’
O

 

-:::::::::-_g

33%;;

--

«(a-(«rumman-
-- - - - - Inc.

3

‘ Ira-—

S
r
.
5

 
 
 

 

3.5

1.5
Average Packet Injection Rate(Packet/cycle)

(e) Ibuf_FP_NL router
Figure 3.21 NoC Average Packet Latency with different router architectures

2.5

Average Packet Latency(cycles)

91

S

   

_' l ‘ 'l
--O--- Buffer = 2 , i
4000I I --+-- Buffer = 3 j
”a Butler = 4 ,'
3000 --A--~ Buller = 5 '9
.-.,-,3-.- Butler = 6 '1 j
"”0- Buﬂer = 7 .' - :
2000 ‘ -->--‘ Butler: 8 ’5 ' 5

S

 

1.5

 

 

 

 

  
  

Chm—2...- , _ , ‘ 1 500° __ ‘ 'L: .
II --e-- Buffer=2 : I I g; I --e--- Butter: 7 jjli 7:7
4000II _.+._ Buffer=3 ? +P+ ~29 ? :4); 4000: —+-- Buller=3 3I II}? III
., "um... Buffer = 4 : .5 g . :z : I ; "Ha-u. Buffer: 4 W ’ :
II I .5: ,5 ,- § 300011--A--Buuer=s‘ III-..- .'
I .. I: ' I: = 'l 36 - A;;:'---~ Butler: 6 1.3:: ,‘l
.., l 5 g f ' .— ....°.... Butler ,__ 7 :Ii: p
[I 5 I .3 1':- '
2000IL__’__ Buffer=8 13+ :9 f 8 2000 __,.-.Buﬂehar 5% }
I ‘ : , »' c; ’.’I:' I
3
m
>
<

2 2.5 3 7

Average Packet Injection Rate(Packet/cycle)
(d) Ibuf_FS_WF router

Input Buffer Only, FS arbitration, NL routing

 

I

   
 

 
 
 

 

 

  
  

 

 

1.5.5 4. 5
Average Packet Injection Rate(Packet/cycle)

(i) Ibuf_FS_NL router

The APL curve in 3.21(a) to (f) for each input buffer size has a maximum achieved

packet injection rate (MAIR), on which a small increase of injection rate results in a very
large increase of APL. For router architecture of lbuf_FP_X Y, the MAIR is 4.2
packets/cycle when the input buffer depth is 2, while the MAIR promotes to 4.9
packet/cycle when the buffer depth is 8. This indicates that increasing buffer size to 4X
results in 14.28% improvement of MAIR. Another phenomenon is that the injection rate
resulting in high APL under small buffer size corresponds to a much lower APL under
higher buffer size. In NOC with Ibuf_FP_X Y router architecture under 4.4 packets/cycle’s
injection rate, the APL is ~5000 cycles with the minimum buffer size of 2, and it reduces
by 36% to ~3200 cycles when the buffer size is increased by 50%. When the input buffer
size reaches 5, the APL reduces further to 1400 cycles, which is 3.57X smaller than the 2-
entry input buffer’s case. This simulation result shows that it is potential to get 3.57X
performance speedup by the price of 1.5X more memory area. In some applications
where the APL is critical, increasing the input buffer size selectively is a good design
practice.

The APL is not only inﬂuenced by the input buffer size, but also the other two
architectural parameters including the arbitration policy and routing algorithm.
Comparing Figure 3.21(b) with Figure 3.21(a), we notice that changing the arbitration
policy from ﬁxed-priority to fair-sharing increases the APL performance greatly and
MAIR under 2—entry and 8-entry buffer increases by 1.76X and 1.79X, respectively. This
is because the fair-sharing policy arbitrates the output buffer in a round-robin way where
each input port has the same accessing possibility to each output port. In consequence the
starvation of some low-priority input ports in the ﬁxed-priority arbitration is reduced or
eliminated. The starvation directly results to the very high latency for the packets injected
from these low priority input ports, and it also increases the APL shown in Figure 3.21(a).
The overhead with fair-sharing is the additional hardware resource such as FSM, and
priority rotating logics. The synthesis result shows that the fair-sharing arbiter has 2.16X

silicon area compared with ﬁxed-priority arbiter, which uses only one comparator and a

5-entry ROM. So in the area-critical situations, the FP arbiter is still practical if the
performance requirement is met.

Compared with XY—deterministic routing, WP and NL adaptive routing algorithms
have the potential of choosing the appropriate route adaptively according to the current
NOC trafﬁc status. Applying this routing algorithm the congested packet doesn’t need to
wait for the availability of the designated output port calculated by deterministic XY-
routing algorithm. The price associated with the adaptive routing is the potential of
increased hop count, which is the number of routers a packet traverses through before
reaching the destination. In some cases, the same packet will pass through the same
router multiple times, and this scheme introduces heavy latency overhead. Unfortunately,
the experimented random trafﬁc falls into this case. As shown from Figure 3.21(c), (d),
(e) and (f), the MAIR of NOC with Ibuf_FP_WF router is 2.27X and 2.24X less than
Ibuf_FP_XY’s at the input buffer depth of 2 and 8, respectively, and NoC with
Ibuf_FP_NL router architecture has reduced MAIR of 1.9x and 1.53X at buffer depth of
2 and 8 compared with their counterparts of XY—deterministic routing. Another effect of
the increased hop count in the adaptive routing algorithm is the steeper curve when the
injection rate reaches the MAIR shown from 3.21(c)~(f). This trend does only show that
the adaptive routing has worse APL performance at this dedicated random trafﬁc
resulting from the increased average packet hop counts. In some applications where the
network congestion can be estimated, the adaptive routing algorithm has the potential to
increase the N 0C throughput.

In the adaptive routers, the architecture selection of output arbiter still results in the
same effect as shown before due to the reduced possibility of starving the low priority
input ports. When the input buffer size is 2, the MAIR of Ibuf_FS_WF architecture is 3.3
packets/cycle, which is 1.78X larger than is FP arbitration counterpart.

From these simulations, the NoC model is veriﬁed to deliver packets successfully

inside the network with multiple concurrent traffic ﬂows. With router and link’s latency

93

models, the average packet latency on different NOC conﬁgurations and trafﬁc patterns
can be evaluated and direct designers tO choose the appropriate parameters during SOC
interconnection design.
3. 7. Integrate PE and NOC into the 50C Simulation Framework

After PE and interconnecting NOC models are developed independently, they are
integrated together to form the complete SOC simulation framework not only for
applications’ functionality veriﬁcation and timing estimation, but also for the system-
level power estimation, which is becoming signiﬁcantly important for ever-demanding
low-power system design, and system-level simulation-based design space exploration.
This section will propose the methodology Of integrating models Of computation cores
and communication fabrics into an application-driven SOC simulation framework by the
case study Of M-JPEG encoding system. And the same methodology also applies tO other
complex applications.
3. 7.1. Network Interface for Bridging NoC and P155

Each PE performs computation tasks and sends data via NOC for inter-PE
communication. When the system scale becomes large, it is important tO separate
computation and communication models for better modularity and portability.
Orthogonizing computation and communication indicates that PBS in a SOC are
transparent tO each Other, and communication fabrics are invisible from each PE’s point
Of view. When a PE initiates a communication Operation tO send data tO another PE via
NOC, it only injects the destination PE’s address and the data tO NOC without handling
any network protocol information. Also each PE has different data processing granularity
represented by the minimum amount of data processed together for its tasks as well as
different data formats. The heterogeneity in PE’s data granularity asks for different
packetization methods before delivering them into NOC. On the contrary, NOC routers
don’t perform the SAR functionality in that they have nO knowledge Of the data format in

packets traversing through them but delivery them to the destinations. TO bridge PEs that

94

are data-sensitive with the NOC featuring data-insensitive, network interfaces (NI) are
required when connecting PBS with NOC tO maintain the separation between computation
and communication.

3. 7.2. Buﬂered Network Interface Model

NI performs data packetization to encapsulate data blocks from PBS into packets
according tO the communication protocol in NOC, and it also de-packetizes the incoming
packet back into the PIE-recognizable data once the packet is accepted by the NI’s
associated router. Another functionality NI has is tO perform data buffering before PE
injects them intO NOC. Similar tO the NIC (Network Interface Card) used in personal
computers connecting tO the Internet, buffering between PE and network improves the
system throughput and reduces the penalty of network congestion. This work implements
a PE-speciﬁc NI model with bi-directional data buffering. Figure 3.22(a) describes the
connection of PBS with 2x2 mesh NOC via four N18 for M—JPEG application. And Figure
3.22(b) presents the internal architecture Of the buffered NI model.

N I has two input ports and two output ports, where pe_out_data and ni_out_packet
form the egress channel from PE to NOC, and ni_in_packet and pe_in_data compose the
ingress channel from NOC to PE. A data buffer and a packetization logic are located on
the egress channel between its two data ports, while a packet buffer and a de-
packetization logic are set on the ingress channel. The packetization and de-packetization
logics are two independent ﬁnite state machines shown in Figure 3.22(a) and (b). During
data packetization, PE ﬁrst sends the address indicator, followed by the current packet’s
destination address, the packet size (number Of ﬂits), ﬂit size (number Of data a ﬂit
contains) and the real data payload tO the data buffer. And the packetization logic
accesses data from the buffer. From 3.23(a), the packetization logic stays in the NI_IDLE
state until it receives the effect destination address information, which moves it to
NI_ADDR state. Receiving the next data (packet size) puts it into NI_PLENGTH state
followed by NI_FLENGTH upon receiving the ﬂit size. It stays in NI_FDATA state until

95

a ﬂit is constructed, and moves to NI_SENDF state to inject this ﬂit into NOC. The state
machine will return to NI_FDATA state tO receive the next ﬂit’s data. Upon receiving all
ﬂits a packet contains, the state machine will go back tO NI_IDLE state waiting for the
next packet. The similar procedure applies to the de-packetization state machine shown in

Figure 3.23(b).

(1) VLC encodinm NOC @ (1)FiIe read
r — _ _ _ ‘ (2) RGB-YUV
I l (3)Downsampling
I (4)bit-stream form

L.____)
I® @-

(1) DCT (1) Quantization
(2) Zigzag-scan

Figure 3.22(a) SOC with 2x2 mesh NOC and N13

Network Interface (N1)

 

 
    
       
  
      

(1 data buffer ni ouLpacke!
e out at —
P _ _ a Packetization

pe_in_ data

b

ni__in_packet

 

Figure 3.22(b) NI architecture

N1 is PE-speciﬁc according tO its PE’s supported data format. This speciﬁcity is
reﬂected by the buffer’s data width in accordance with the PE’s output port. For an NI
bridging a PE featuring 8-bit output port, its data buffer width is set to one byte, while NI
will have 32-bit width data buffer when connecting a processor core with 32-bit GPIO.

When a PE is initialized, its associated NI will be instantiated by specifying the input data

96

buffer width as well as pe_in_data port width. The Nl’s packetization and de-
packetization logics are also conﬁgured according tO the data buffer width and the NOC’s
ﬂit width since a ﬂit may contain multiple data elements for efﬁciently utilizing the

network bandwidth.

No address
No address

indicator

 
   

 

 
 

address
received

  
  

No packet
length received

    

Received_data_nu N o ﬂit size received

m < ﬂit size

  

'Received_data_num: Number Of data received in a ﬂit |
lFIit_num: Number of ﬂits constructed in a packet I

FIit_num < packet size

Figure 3.23(a) FSM for data packetization in N I

Effective packet not

received F

lit_num = packet size

packet size not
extracted

    
 

Effective packet
received

    
  
   
   

  

packet size extracted

eceived_data_num = ﬂit size &&
Flit num < packet size ®

 

ﬂit size not
extracted

 
  

IReoeived_data_num: Number of data received in a ﬂit I
FIit_num: Number Of ﬂits constructed in a packet
h --------------- A

Received_data_num < ﬂit size

Figure 3.23(b) FSM for data de-packetization in N1

97

In the system-level, N1 is also modeled as SystemC sc_module object for modularity
and functional encapsulation. SystemC interface sc_ﬁfo_in_if and scjifo_out_zf are
applied to set its input and output ports’ attribute for gluelessly bridging PE and NoC.
Although NI’s input/output ports to PE side may have non-uniform data width for
different cores, its 108 to NoC side have exactly the same data width equal to the NoC’s
ﬂit size. The C++ standard template class queue is employed to model the NI’s
data/packet buffer and it encapsulates all the required Operations that are provided as
member functions for users to invoke directly. The packetization and de-packetization
logics are modeled explicitly by two concurrent sc_thread processes, which are
scheduled for execution triggered by the incoming data/packet.

3. 7.3. Put Everything Together: SoC Simulation Framework xSoCSim

PE, N1 and NoC (interconnection of routers via links) form the 80C component library
and the top-level SoC simulation framework is constructed by integrating PEs, N15 and
NoC together. PE handles the computation tasks, NoC provides the inter-PE
communication services required by data transmission between tasks mapped on different
PBS, and NI provides seamless bridging between PE and NoC. As what’s discussed
above, an abstract class encapsulating its functionality models each SoC component.
Integrating these components into a multi-core SoC simulation model involves the
following steps described in Figure 3.24.

(1) Application analysis and task partition extract tasks from the application
speciﬁcation represented by C++/SystemC. A task is a function call and these tasks may
be dependent or independent. A task graph integrates the partitioned tasks is derived to
represent the application at the task-level. Cmrently the application task partitioning is
still manual without any automation tools.

(2) PE instantiation creates the required PE objects with architectural information
represented by the macro-models. The initial operating frequency is also assigned. For

the ASIC core, the parallelism is explicitly represented by the number of concurrent

98

processes initialized by its class constructor.

(3) Task macro-operation annotation inserts the wait( ) function call parameterized by
macro—operations for each task. Each macro-executor in the task is annotated with a delay
function representing its execution time or multiple macro-executors are annotated with
one wait( ) emulating their cumulative execution time.

(4) Task mapping associates each PE with tasks. For microprocessor mapped with
multiple tasks, the sub-graph representing these tasks’ dependency needs to be provided
to the PBS TSE for task scheduling. Each task is hooked up with the PE architecture by
calling the PE’s AddTask( ) method, which associates the PE’s macro timing-model with
the macro-operations annotated to this task. For ASIC core, only tasks with explicit
concurrency are mapped onto it, and the parallelism is resolved explicitly by employing
multiple sc_thread processes in the current work. If different tasks mapped onto different
PEs have data communication, the communicating pattern needs to be speciﬁed explicitly
for each PE by inserting code accessing the NI to write these data to NoC.

(5) NoC instantiation creates the on-chip communication architecture that interconnects
these PBS. The topological, architectural and technological information is provided for
initializing NoC components (routers and links). The NoC class constructor further calls
its associative router and link’s constructors to create their objects and performs port
binding according to the NoC topology to form the NoC. The NoC’s buffer size, ﬂit size
and routing algorithm are also setup during its initialization.

(6) N1 initialization creates the bridging NI objects for different PBS to connect with
NoC. A PE can not send data into NoC until its NI is initialized and connected with this
PE. For each PE, its N1 is mapped to its memory space so that N1 is treated as PE’s
peripheral during the simulation.

(7) Connecting PEs with NoC performs the ﬁnal connection between PBS and NoC
through NI’s bridging. Two sets of channels, one of which connects PE with N1 and the

other links NI to NoC, are instantiated.

99

Until now the complete SoC simulation model and its associated methodology named
xSoCSim supporting multiple heterogeneous cores connected with on-chip network is
created and ready for simulation. The top—level framework instantiates the 80C model
and performs cycle-approximate simulation, where each PE executes its mapped tasks
and sends data via NI to NoC for communication. After the execution, the total execution

time as well as each PE’s timing overhead will be provided for analysis and optimization.

 

    

. /

1 Application 5 ~\ 5’3” W
speciﬁcation ;

(C/C++) i % funci- func2 /_ﬂ __\ --, funcZ
3 / i I” .
1 (a)Macro-operation ‘~.\T\‘(a"”‘f’/k_..\* T\2(anno’)/,
’ i annotation

Application i _f- .{ncii

3 Task Moel : /

i : \Tj’jaT‘ﬂ/

"‘ ‘ ‘ " “""‘(1)Appiieation analysis fun“

 

8. task partitioning

task graph

5»—
T4(anno) /

\
\

a nnotated

  
 

(2)PE instantiation

' nd ‘ task ra h

(4)Task mapping Q L) g p

:Dg . (7)Connect PEs with Not:
' | -------------------- I /~—4‘\ , ~\

IE] [El

Figure 3.24 xSoCSim simulation framework by integrating PEs, N13 and NoC

  
  
 

  

PE library
(ARM. PowerPC.
FFT...)

 

 

 

i
l
i
i
i

z...“ _ .. ___, . , A
I (6)Nl instantiation
Ni library :>

 

 

xSoCSim model

aI-yotao‘

 

3.8. M—JPEG NOC-based SoC Simulation Model Case Study

We use the M-JPEG application as a case study to demonstrate the proposed SoC
simulation framework. M-JPEG is a motion picture encoding algorithm that encodes each
frame using JPEG engine without considering the inter-frame temporal and spatial
redundancies. Its functionality speciﬁcation is written in C++ containing eight tasks: (1)
Input frame read, (2) RGB to YUV color space conversion. (3) 2: 1:1 down-sampling, (4)
2-D DCT transformation, (5) Quantization, (6) Zigzag scan and data reordering, (7) VLC

and Huffman encoding, and (8) bit-stream dumping. Tasks (1)-(8) apply to each input
frame to convert the input motion video into a serial of JPEG images as the task graph in
Figure 3.25 shows. The computation is performed on 8x8 data block in each task except
for (8). Task (3)’s output is 8-bit for each YUV component while tasks (4) ~ (7) operate

on 32-bit ﬁxed-point data for better precision.

1 2 3
read frame
block

   

 

 

6
8 7 .
Figure 3.25 Task graph of M-JPEG application

Although there exists intra-task-Ievel parallelism in tasks (4), (5) and (6) since each
down-sampled YUV component can be processed independently, we don’t exploit it in
this case study and all YUV components share the same processing element.

We employ the MPSOC (Multi-Processor SoC) architecture, where each PE is
implemented by a microprocessor and they are interconnected through the NOC, to
exemplify the construction of M-JPEG SoC simulation framework. Assume 3
microprocessors, IBM PowerPC405, Xilinx MicroBlaze and MIPS PISA architecture, are
supported in the PE library. The macro-operations of PowerPC405 and MicroBlaze are
characterized by Xilinx RTL-level simulators, and PISA is characterized with micro-
architectural simulator, SimpleScalar. We will not go into the detailed characterization
process here. A 2x2 mesh NoC architecture provides the interconnection. To construct the
80C simulation model under the xSoCSim framework, the following steps are performed
sequentially:

@PE instantiation: Four PEs, including two MicroBlaze, one PPC405 and one PISA, are
instantiated with their operating frequency of SOMHz, 100MHz and 3OMHz. And each

101

processor is assigned an address from 0x00 to 0x03 in correspondence with NoC router’s

location. The SystemC description to initialize these PEs is shown as follows

 

m _pe[ 0] = new CPUL("PowerPC", PPC405, 100, 0x00. T RUE);
m_pe[1] = new CPUL("DCT", MB. 50, 0x01, TRUE):

m _pe[2] = new CPUL("Quant", MB. 50, 0x02, T RUE);
m_pe[3] = new CPUL("VLC", PISA, 30, 0x03, TRUE);

 

 

 

@I'ask-level macro-operation annotation: Each task is annotated with the macro-
operations, and each function and its child functions inside the task are annotated in an
iterative way. The following code describes the annotated code for the l-d DCT that will

be called by 2-d DCT task.

 

void fdct_ld(int row[ 8], int row_res[8], int bias, PEL *core)
{
ﬂoat x0, xl, x2, x3, 10, tl. t2, t3, t4, t5, t6, t7;
(0 = (float) (r0w[0] + r0w[7] - bias);
latency = sc_time(2* core->D VS_ratio * core->period * core->latency_ L UT]ALU_ 0P_2], SC_NS);
wait(latency);

t7 = (ﬂoat) (r0w[0] - r0w[7]);

latency = sc_time(core->D VS_ratio * core->period * care->latency_L U TIAL U_ 0P_2], SC_ NS);
wait(latenqy);

.................. / /Olher operations

 

 

 

(@I‘ask mapping: tasks (1), (2) (3) and (8) are mapped to the PowerPC 405, task (4) is
mapped on to MicroBlaze located at address 0x01, tasks (5) and (6) are mapped onto
another MicroBlaze core addressed by 0x02, and task (7) is mapped on PISA. Each core

object invokes its AddTask( ) to hook up its assigned tasks described as follows.

 

m _pe[0] -> AddTask(1, &read_ppm_pic);
m_pe[0] -> AddTas/t(2, &rgb2yuv);

m _pe[0] -> AddTask(3, &d0wnsample):
m_pe[0] -> AddTask/4, &dump_jpeg_data);
m_pe[1] -> AddTask(1, &fdct);

m_pe[2] -> AddTask(I, &quant);

m _pe[2] -> AddTask(2, &zigzag);

m_‘pe[3] -> AddTask(1, &vlc_huﬁinan);

 

 

 

@NOC instantiation: a 2x2 mesh is created by reading the NoC conﬁguration ﬁle, which

102

specifies the topology, input buffer size and other interconnection parameters. The NOC
ﬂit size is set to be 32-bit and packet size is 10 (including header ﬂit, 8 data ﬂits, and one
CRC tail ﬂit).

@Nl conﬁguration: NIs connecting to different PEs have different packet formats. For
the packet that is sent from PowerPC containing the down-sampled YiYgUV data, each
32-bit ﬂit is divided into 4 parts, each of which stores one component. For the two
MicroBlaze cores running tasks (4), (5) and (6), each ﬂit only contains one 32-bit integer.
For the PISA running the VLC encoding algorithm, each ﬂit consists of two encoded 16-
bit word for the output JPEG bit-stream dumping. In this sense, three different NIs are
initialized to handle these different packetization/de-packetization requirements.
@PE-NOC integration: the ﬁnal step is to connect PBS to the NoC through N13 to
construct the top-level SoC module for simulation. The test bench instantiates the top-
level simulation model to perform the application simulation in the source code level
without compiling it to the target dependent binary executables. Each processor core
executes their mapped tasks along with the annotated timing macro models scheduled by
the PE model’s TSE. Once a processor needs to send its result data to another one, it
accesses the NI, which is mapped to processor core’s memory space, to inform it the data
payload and destination information. NI keeps receiving the data to assemble the packet
before injecting it into the NoC, which routes the information to the router connecting to
the processor core mapped with the next dependent task. PE and NoC operation
collaboratively until all the data are processed and the result is dumped into an output file
by one of these 4 processor cores.

Figure 3.26 presents the ﬁnal SoC architecture mapped with M-JPEG application. The
PowerPC at tile #(0,0) operates on the frequency of 100MHz, the two MicroBlaze cores
at tiles #(0,1) and #(l,0) works on the sample frequency of 50MHz, while the PISA core
located at tile #(l,1) operates at 30MHz. The different PE frequencies setup is not

optimal but just shows the capabilities of the simulation of heterogeneous SoC under this

103

framework. The NoC is clocked by the same frequency of 100MHz. The NoC on-chip
router architecture set to lbuf_FP_XY with the input buffer depth of 4. Due to the
deterministic XY—routing algorithm, 4 possible routes between NoC nodes can be
identiﬁed as comm ,(#(0,0) —> #(l,0)), comm2(#(l,0) —> #(0,1)), comm3(#(0,1) —> #(1,l)).
and comm4(#( 1,1) -> #(0,0)). The input video stream is a set of 64x64 frames in the ppm

format, and the output is dumped into the corresponding set of ﬁles in the jpeg format.

???AHZ- cor_nm4 30MHZ
l m MicroBlaze ....-....§9'.“.VD§.-...> PISA

 

Q

 

 

 

(1) Q-uantization . (1) VLC encoding
(2)2igzag-scan' .

: l:

3 E; i 100MHz

: z = JUL

: 5 . :+—

I I i

i E

r t v - 3
100MHz ® ‘@ 50MHz
.ouerPC corny-n2 \

(.1)File read comm1 > 1 FDCT
(2)RGB—YUV ( )
(3)Downsampling
(4)b‘rt—stream form

Figure 3.26 SoC simulation model mapped with the partitioned M-JPEG application
Figure 3.27(a) shows the cycle count of each mapped PE to process one 64x64 video

frame, and ﬁgure 3.27(b) depicts the corresponding cycle count of each communication
route. For each 64x64 block, the computational latency is 7.84ms, 43.1]ms, 33.52ms, and
35.96ms on each PE located at tile #(0,0). #(l,0), #(0,l) and #(l ,1), respectively. And the
communication latencies on the four inferred routers are evaluated as 0.09ms, 0.26ms,
0.09ms, and 0.22ms under the 100MHz NoC frequency. Since the data is processed in the
pipelined fashion in each PE, so the average throughput is estimated by considering the
worst case PE with maximum processing latency, and the communication latency is not
considered since it is >100X smaller than the computation latency in this simple case
without heavy communication trafﬁc. Among the 4 PEs, MicroBlaze at #(l,0) running at

SOMHz has the maximum latency of 43.11ms, the average throughput of M-JPEG

application is 23.19fps (frames per second).

 

PE cycle overhead

frequency(M Hz)

 

 

PowerPC MBlaze #0 MBlaze #1 PISA

 

 

Figure 3.27(a) PE cycle count on M-J PEG

 

Comrrunication cycle overhead

    

30000
25000
20000
1 5000
1 0000

5000

commt ' comrnZ comma comm4

 

 

Figure 3.27(b) Communication cycle count on M-JPEG

105

 

4. SYSTEM-ON-Cl-IIP POWER MODELHVG FRAMEWORK
Nowadays VLSI designers are facing ever-stringent challenges dealing with power on
two folds. First of all, the continuously shrinking transistor size squeezes millions and

billions of transistors clocked at GHz-level frequency. This trend aggrandizes the

dynamic power consumption based on Pd," = 0.50CVj, f , where or, C, Vdd and f represent

the switching factor, circuit capacitance, supplying voltage and operating frequency
respectively, along with the ever increasing chip size. Secondly, the battery technology
advances at a much slower rate compared with transistors and it needs VLSI designers to
adopt low-power techniques to extend the battery life for the SoCs mostly used in
handheld devices. Power/energy modeling serves as the core part in low-power
techniques to estimate the system’s power/energy consumption for ﬁne-tuning the design
to meet its power budget. It is also believed that the higher level the power optimization
is performed, the more power savings that can be achieved although power estimation at
high-level may not achieve the same accuracy as what low-level models do. Take these
two factors into consideration, a system-level power modeling framework based on the
80C simulation model is developed to provide fast power/energy estimation for multi-
core SoCs targeting multimedia applications.

Based on the levels of abstraction where the power model is built, there are different
methodologies to estimate system’s power consumption from circuit-level considering
each transistor’s switching property to system-level evaluating the input/output’s
transition statistics. In our work different power modeling methodologies targeting
different levels of abstraction are employed to estimate the SoC’s power consumption in
the same framework. For PEs handling computational tasks, power macro-modeling is
applied to abstract off their low-level implementation details using statistical regression
linear ﬁtting to derive a closed form formula for each macro-operation’s power/energy
consumption. For routers and links in NoC, an analytical power model is developed by

evaluating their micro-architectures since they are relatively stable compared with

106

various PE architectures. These models provide power functions that can be called by the
simulation framework to evaluate the power during run-time without delving into the
circuit- or RT-level transition details. Using a function call to model power also offers the
ﬂexibility that different power models can be applied to the same functionality block by
just changing the corresponding power function, and it makes the 80C simulation
framework more modular and maintainable.

This chapter is organized as follows. Section 4.1 reviews the related research work in
power modeling and proposes the motivation of raising the abstraction level in multi-core
SoC power estimation. Section 4.2 will propose the microprocessor’s power macro
modeling methodology where the level of abstraction is raised from instruction-level to
macro-operation—level to escalate the power estimation speed while keeping reasonable
accuracy. The analytical power estimation of NoC components will be proposed in
section 4.3, where both dynamic and leakage power are addressed. Using the analytical
power model doesn’t preclude the capability of employing statistical high-level NoC
models, once available, in the framework. It also shows the openness of the proposed
framework that can adopt heterogeneous power models on different functional blocks,
which has not been addressed before. Section 4.4 presents the integration of these power
models with SoC simulation framework to form a run-time application speciﬁc SoC
power simulator.

4.]. Related Work on Power Modeling

SoC power/energy estimation essentially consists of modeling the power consumed on
its cores including both PBS and NoC. On modeling PEs especially embedded
microprocessors, various techniques based on instruction-level characterization and
simulation of the underlying hardware have been proposed[l6] [58]. In [15][16]
microprocessors are characterized by gathering its ISAs’ energy data through
experimental methods by running applications on real prototype boards. Isci et al. [15]

propose the coordinated measurement approach that combines real power measurement

l07

with performance-counter—based, per-unit power estimation, and applies it to the Pentium
4 processors. In [16][59], ARM and SH3 microprocessor models are created by recording
the power consumption of each instruction to a spreadsheet-like library, which is invoked
by program models during simulation to get total power consumption. Characterizing
hardware achieves good precision, but it needs prototypes of target microprocessors,
which are not available in early design stages of an application-speciﬁc SoC. As an
alternative, architectural modeling mitigates requirements of real hardware for instruction
set characterization. It models micro-operations of the functional blocks such as
execution unit, cache, and branch predictor inside the microprocessor by creating these
blocks’ capacitance models as what Wattch does. Wattch[60] is built upon
SimpleScalar[IO] framework and achieves a speedup of 1000X or more compared with
circuit-level power tools, and yet maintains accuracy within 10% of their estimates as
veriﬁed using industry tools. Another cycle-accurate power estimator, SimplePower[6l]
is proposed to simulate 5-stage pipeline microprocessors and it employs a transition
sensitive energy model [62].

Compared with microprocessor whose internal signal transition can be abstracted to its
ISA, the hardwired ASIC does not have such a minimum operation set to characterize.
Low-level SPICE-based power estimation is performed on circuits containing small
number of transistors for a high accuracy[63]. And RT-level power model estimates the
circuit’s transition activity using some input vectors and derives the power
consumption[63]. While as circuit size and complexity continue increasing with
exponential pace, macromodeling[64][65][66] for high-level power estimation has
attracted considerable attention over the past few years. In general, the macro-model is
based on a look-up table (LUT) search and the LUT is created based on characterizing the
circuits using input vectors with specific statistical distribution. The most frequently used
parameters are the average input signal probability P;,,, the average input transition

density Di", and the average output transition density Dam. The metric Dam is evaluated

l08

using zero-delay simulation. The LUT reports estimates for equally spaced discrete
values of these parameters. For input characteristics that do not correspond to any LUT
entries, estimates are obtained using an interpolation scheme[65]. Several modiﬁcations
to the macro LUT method are introduced in [66] to improve its accuracy.

NoC power/energy modeling has attracted extensive research efforts since the NoC
concept was proposed in [67]. Ye et al. [68] analyzes the power consumption in the NoC
routers and proposes the “bit energy” models to estimate the routes’ power consumption.
Wang et al. [25] presents ORION, a power-performance simulation model with analytical
dynamic power models of NoC routers based on LSE framework. [69] extends the
ORION by adding the leakage power model, which is based on empirically
characterizing some frequently used circuit components. Interconnecting links, especially
in deep submicron processes under 180nm, are also extensively researched from the
energy point of view. Gupta et al. [70] presents a high-level interconnect power model for
wires of a single core chip. Bokoglu et al. [71] proposes the wire delay model
considering the optimal repeater sizing and spacing. Ho et al. [72] points out the power
consumption of the delay optimal repeated wire is prohibitively large, and introduces a
methodology to save wire power by increasing repeater spacing and decreasing repeater
size. Heo et al. analyzes the power of the global wires in the on-chip network [57] using a
ﬁrst-order RC model.

4. 2. PE Power Macro-modeling Methodology

In the communication centric NOC-based SoCs, the power consumption from both PE
and NoC needs to be modeled to evaluate the energy perspective of the system. For the
PE, macro-model method is employed to estimate processing element’s power based on
characterizing a set of common operations the target PE supports; and NoC is modeled in
an analytical methodology where the circuit-level properties are analyzed and elaborated
to calculate the communication power. These two distinctive power-modeling threads

(statistical-model and analytical-model) show the ﬂexibility and openness of the

l09

proposed framework, and users can easily switch from one power model to another one
for SoC blocks to trade off the simulation speed versus modeling accuracy.

4.2.1. Programmable Processor Power Macro-Model

 

+ Simulation speed _
function

 

 

 

void fft(int ‘3, int *b)

 

   

Instruction

 

 

 

DMUL R0, R0, R3
LD R4, 0(R2)
DADD R0, R0, R4

Loop: LD R0, 0(R1)
RTL J

 

 

 

Layout+
ckcuﬂ

 

 

 

estimation error

I
I
I
I
I
I
I
I
l
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
V

Figure 4.1 Abstraction levels of microprocessor modeling

Figure 4.1 shows the abstraction levels of microprocessor power models starting from
the lowest circuit-level. In the lowest hierarchy, the microprocessor is modeled by the
actual silicon layout or transistor netlist. The power estimates are highly dependent on
transistors’ sizes and switching features, which are very slow to simulate. At the middle
hierarchy, models are created at the register-transfer level (RTL). It takes the signal
transferring among different logic blocks into consideration. The RTL model is mainly
used by ASIC designers. The Instruction-level model lies on the top of the RTL model
and it takes each instruction as an atomic operation from energy perspective, although an
instruction is composed by several micro-operations. As stated before, instruction-level
power modeling cannot meet the simulation speed requirement in high-complexity SoCs
with embedded microprocessor for efﬁcient design space exploration on real applications.
At a higher level of hierarchy, function-level energy model characterizes the energy
consumption with different parameter sets for the frequently invoked functions in the
software code, and builds a linear regression model for each function’s energy

consumption. It can achieve a very high-speed in microprocessor’s energy estimation but

“0

suffers less efﬁciency and ﬂexibility since a real application may contain hundreds or
even thousands of frequently invoked functions and characterizing them needs huge
amount of effort that may pay off in the gained speed-up.

Recall that chapter 3 deﬁnes the macro-operation set that provides a minimum number
of basic operations from which all the applications can be constructed. As a compromise,
modeling macro-operations’ energy property raises the level of abstraction compared
with instruction—level, and reduces the characterization effort compared with function-
level model. More importantly, different microprocessor architectures have the same
macro—operation set and it also conforms to the retargetable energy-modeling scheme,
where architecture-dependent cross-compilation is eliminated similar to the execution
time model described in chapter 3.

As stated before, a C source code line is regarded as a macro-executor, such as
YIi]=a*X[i]+Y[i]. Each macro-executor may contain one or more macro—operations,
each of which is compiled into multiple instructions. Since energy is additive, the energy
consumption of a macro-operation is the sum of its compiled instructions’ energy and the
macro-executor’s energy can be estimated by adding its macro-operations’ energies.
Consider a simple loop code that is compiled to the assembly code shown in Figure 4.1:
for(i=0; i<1000: i++)

Y[i]=a*X[i]+Y[i];
Assuming the energy values of these instructions are: Eu), EDSUB, EDADD, E50, EDMUL, EBR,

respectively, then the energy consumption of the C code segment is computed by:

E = n X [(ELD + EDSUB + EBR)+ (ELD + EDSUB + E51) + EDMUL )] (4-1)

for_loop
where n is the number of iterations this for loop takes.

From Equation (4.1) two macro—executors can be extracted: for loop body and
Y[i]=X[i]*a+Y[i], with the energy values of Ew+EDSUB+EBR and
Ew+EDSU3+ESD+EDMUL, respectively. However, such simplistic approach has limitations.

The ﬁrst is the accuracy of energy model for a single instruction. Real measurement is

Ill

difficult and pipelining of execution makes the estimation even harder. The error will
accumulate as the overall energy is the summation of all instructions. The second issue is
the energy consumption for inter-instruction, which results from switching between two
consecutive different instructions. In order to generate a relatively accurate energy model,
the empirical method is employed to extract energy data for the macro-operations. The
same macro-operations as what are used for execution time modeling are employed
directly in the energy estimation of embedded microprocessors.

The first step is characterizing the energy consumption of each macro-operation. In this
work, it is extracted from the energy data generated from execution of benchmarks on
Wattch. The methodology can be applied to extract from physical measurement of a
processor or other energy aware simulation methods. Statistical methods are employed to
design the experiment and to process the raw simulation data. Benchmark programs with
small segments of C code are created by repeating the same type of macro-operation for a
pre-deﬁned number of iterations. Wattch runs the benchmark to generate three parameters
after the C code is compiled: sim_num_inst, sim_CPI, and energy_cycle, which represent
the number of simulated instructions, average CPI (cycles per instruction) and energy
consumption per cycle. The total energy consumption in the execution of benchmark can

be computed from:
Energybm,,mm.k 2 Sim _ num _ inst >< Sim _ CPI x energy _ cycle (4.2)

Similar to the method to eliminate the initialization main( ) function’s execution time in
chapter 3, a code segment containing only main( ) body is simulated to get the
corresponding sim_num_inst,,,a,-,,, sim_CPImam, and energy_cycle,,m,-,,. To calculate the
energy consumption of each macro-operation, the main( ) initialization energy is removed
by:

N XEnergyuperution : Energybenchmark — Energy main (43)

where N is the number of iterations a macro-operation in the benchmark takes in

simulation and Energymm is main( )’s initialization energy expressed by:

112

Energymm = sim _ num - instmum x sim_ CPIM." X energy_ cyclemain (4.4)

Ideally, each macro-operation’s energy consumption is linear and passes through the
origin in terms of iteration number in the benchmark. In this step, each macro—operation
is simulated by different number of iterations from 100 to 1000 with the step of 10
iterations. Figure 4.2 shows the energy simulation results for two macro—operations
OP_ALU_2(2-operand ALU operations) and OP_MUL_2(2-operand multiplier
operations) for PISA processor with hardware multiplier. They are both approximately
linear to the iteration number with ~0.l% average non-linearity and a non—zero offset,
which originates from the inherent operation mechanism in embedded microprocessors
with cache. At the start of program execution, the cache is empty and the miss rate is
nearly 100% and it brings additional energy consumption to ﬁll the cache from the
external memory. After that, the looped benchmark program accesses the cache
frequently and hence results in a linear increase on the energy consumption. We employ

the regressive linear ﬁtting method to minimize the mean square error on these pre-

computed energy data to estimate each macro-operation’s energy consumption.

 

    
  

 

 

 

 

 

3.5 . . . . 1% a a . s . r

{109” t xtoe+8
3.0 3 5.

K i

9’ a
2.5 o

“2

O 0
2.0» t; g

i 3“ “E
1.5- O 8

2- 0 i
1.0
0.5 1' l
iterations iterations

" 15 25 35 4b 55 Sb 70 86 95 “0 o 10 20 30 40 50 60 76 80 90 100

Figure 4.2(a) Power of 2-operand ALU operation (b) Power of 2-operand Multiplier operation

With these characterized macro-operation energy data, an energy lookup table (LUT)
is composed with the similar way that constructs the execution time model. Each PE
architecture corresponds to a dedicated LUT, where the same macro-operations

represented by a set of integers designate the LUT indexes. The characterized energy

113

model for each macro-Operation corresponds to each LUT index and can be accessed by
providing the index as the macro-operation type during simulation. To estimate the
energy consumption for an application executed on a specific microprocessor, each
involved macro—operation is annotated with its energy model. Each macro—operation’s
energy model will be accessed by table lookup after running this macro—operation, and
the corresponding energy number will be retrieved. A global energy variable accumulates
all the experienced energy data and evaluates the total energy consumption during
application execution.

4. 2.2. Experimental Results of Processor Macro Power Model

The performance and accuracy of the energy model is validated with a media
processing application, JPEG encoder. JPEG[21] is a lossy compression algorithm, which
is used extensively in the static picture compression. The four computation cores in JPEG
encoder algorithm, DCT, Quantization, Zigzag and VLC, execute sequentially and DCT
is more computation-intensive than the other three. We use JPEG source code[22] from
UCLA and map it to PISA combining the energy model under SystemC. This JPEG
encoder is simulated to encode pictures with different sizes from 64x64 to 512x512. For
comparison, the JPEG source code is compiled to PISA binary and simulated with
Wattch. Meanwhile, the code is also compiled with gcc and runs on the same Solaris
platform to verify the functionality.

Table 4.1 presents the experimental results of energy estimation from the macro-model
methodology and Wattch. The ﬁrst column lists the picture names and sizes, followed by
energy estimation results from Wattch (eng_W) and macro-model energy estimator
(eng_M). The last column shows the relative errors compared to the Wattch simulation
results. Table 4.2 describes the execution time for each simulation model. In each column
t_C, t_SC and t_W represent the run-time for pure functionality run without energy
model in C, functional and energy simulation with PISA’s macro-energy model in

SystemC and functional/energy simulation with PISA’s architecture-level energy model

114

in Wattch. The last column shows the speedup of energy macro-model compared with

Wattch.

Table 4.1. Energy Simulation Results for JPEG on PISA architecture

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

picture eng_W(pJ) eng_M(p.I) error(%)
bunny (64x64) 5.243x 16‘ 4.784x108 ~8.75
name(128x128) 1.972x109 1.854x109 -5.96
sun (256x256) 7.374x109 6.977x10" -5.38
face (512x512) 2.916x10m 2.787x10m -442

Table 4.2. Simulation times

picture t_C(s) t_SC(s) t_W(s) Speedup
bunny (64x64) 0.0015 0.149 31 208.05X
name(128x128) 0.0191 0.462 115 248.92x
sun (256x256) 0.089 1.631 438 268.55X
face (512x512) 0.372 6.288 1730 275.2X

 

 

 

 

 

 

 

We observe that microprocessor energy increases approximately linearly to run time
from the simulatiOn results in both Wattch and the macro-model energy estimator when
picture size increases as table 4.1 shows. This result is expectable since the amount of
instructions and macro-operations executed to encode a picture with increasing size
increases proportionally. The macro—model estimator achieves pretty good precision by
the relative error ranging from —4.42% to -8.75%, with the average of —6.70% among 4
test pictures with different sizes. The under-estimation of the energy results from the
absence of the modeling of inter-macro-operations and I/O functions such as fprintﬂ ) and
printﬂ ), which consume energy as well. In many embedded applications, such I/O
functions may not be used except for debugging. As compensation to the errors, which
are acceptable in high-level design stages without much architectural information, this
framework’s simulation time reduces dramatically with speedup from 208.05X to 275.2X
when comparing with Wattch. The speedup is due to eliminating the evaluation of each
executed instruction during simulation by combining multiple instructions into one

macro-operation and invoking the model library directly.

115

4. 2.3. Application Speciﬁc Hardware Accelerator Power Macro Model

Differentiating from the programmable microprocessor, the ASIC core, which is
frequently used as the hardware accelerator for some computational intensive tasks on
80C to increase the performance and save power. By dedicatedly designed data path ﬁne-
tuned for the requirement of the specific application, hardware accelerator reduces the
processing load of the main processor, in turn lowering its MIPS requirement to meet the
target throughput. The reduced processor MIPS requirement brings the beneﬁt of further
power saving from the lower operating frequency, and possible voltage if DVFS is
enabled.

The customized data path of 80C hardware accelerator brings out the overhead of
efﬁcient power macro-modeling methodology because the “instruction” concept of
microprocessor doesn’t apply to this type of PBS and their operation cannot be easily
categorized by its “instruction” set as what we did in 4.2.1. What’s more, the broad range
of 80C hardware accelerator makes it more difficult to use the common set of basic
macro-operations to characterize the power properties of them. As an alternative solution,
we narrow down the range of the hardware accelerators of interest with some common
features such as data path type, architecture, or memory allocation, it is still feasible to
extract a set of atomic operations by which all the hardware’s functionalities are
composed.

In this work, we target the digital ﬁlter hardware accelerators (DFHA) presented in the
2.2.4 when building the hardware accelerator’s power model. DFHA performs FIR/IIR

algorithms in many media processing applications, where multiplication/MAC is the
most intensively used operation expressed by the format of Zaxb. In the multi-tap
FIR/11R hardware accelerators, multiple MACs are working in parallel shown in Figure
4.3(a) to reduce the cycle count of computing the input sample, and this scheme poses the

requirement of another two basic Operations in the hardware accelerator:

® 2N-bit adders for intermediate result accumulation among multiple NxN multipliers

116

C2) 2N-bit width registers to enable the pipeline processing to maintain the data

throughput

N N N N, N N N N N N N N N N/ N N
2N 2N 2N 2N 2N 2N 2N 2N
LI _ I REG 3

REG

 

 

 

 

 

 

 

 

2N /

1 /

A4 ADD
2N

 

 

 

 

Figure 4.3 Parallel FIR architecture with 3 pipeline stages providing 1 sample/cycle throughput

Pipelining in the hardware accelerator is very useful in the real design since it reduces
the length of the critical path resulting from the insertion of multiple stages of partial
result accumulators. Although pipelined data path still has multiple cycles’ processing
latency, it achieves one sample/cycle’s throughput, which is essential for most ﬁlter
applications with stringent sampling rate requirements.

The three basic DFHA operations, ADD, REG and MPY, are characterized by their
power consumption using the available IBM 180nm CMOS process by Synopsys RTL
power tools. During characterization, the operating frequency is set to 200MHz with 50%
activity factor, and the supplying voltage is 1.8V. The power consumption of each macro-
operation is estimated on 7 different data widths from 8 to 32. Based on the characterized
power consumption on different data path widths, linear ﬁtting is applied on the ADD

and REG operations and a 3rd order polynomial ﬁtting is performed on the MPY

117

operation to derive a regression power model. Figure 4.4 (a), (b) and (c) shows the
characterized power versus input data width of operation MPY (multiplication),
ADD(adder) and REG(register) as well as their regression curve ﬁtting results,
respectively. It clearly shows that the power consumption of each macro-operation is
highly correlated with its data width. The RMS errors for ADD, REG and MPY
operations after curve ﬁtting are 2.04%, 2.46% and 1.98%, respectively.

There are other three factors taken into consideration when building the power macro-
model: clock frequency 0‘), operating voltage (Vdd) and activity factor (a). Given a
hardware architecture, dynamic power is proportional to f, a, and square of Vdd, so the
generic dynamic power model for ADD, REG, and MPY operations with IBM 180nm
CMOS process can be derived as:

2
f (Vdd) a

P ,V , , =-—- —— -———-P 200,1.8,0.5,

ADD(f dd 0 W) 200 1.8 05 ADD( W)

2
=_[_.(Vi] ._a_.(23.519w—8.190l)
200 1.8 0.5

= (0.072w—0.025)- f v; 0a (4.5)

2
f [Vdd a
P ,V , ,w =—- — -—-P 200,1.8,0.5,w
REG(f dda ) 200 1.8 0.5 REG( )

2
=_f_.(Y_a ._a_.(8.794w—0.0068)
200 1.8 0.5
= (0.027w— 00002)- f -v,,’:’, -a (4.6)

f Vdd 2 a
P ,V, , =—‘ — -—°P 200,1.8,0.5,
MPY (f dd 0 W) 200 [1.8 0.5 MPY( W)

2
=—f—-[Xéi -i.(0.0037w3+6.208w2+6.539w+4.175)
200 1.8 0.5

= (0.0001w3 + 0.019w2 +0.017w+ 0013)- f -v,,‘;’-, .a (4.7)

118

Characterized power consumption 01 ADD operatior

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.., A Characterized power consumption of REG operatic
5. 1000 f . a i 400 r . .
g y = 23.519,x _ 8.1901 1]]. g 300 . y = 8.7938 x - 0.006893 A
,4" © 200 - ,A
E 400 1", E ,’A
Z 1 2’ ,A
.g r .g 100 - AK «
g 200 ~ .1 g ,A
g o I DC estimate E 0 1 A DC estimate
8 T6
3 -200 1 . . g -100 . , 1
o. 0 10 20 30 40 r; 10 20 30 4C
Adder data width(bits) Data width(bits)
(a) ADD operation (b) REG operation
A Characterized power consumption of MPY operati
a 12000 . . -
1%- * DC estimate
5 10000 " --——— cubic ,.
O /
8 8000 -"/
© y = 0.0037206'x3 + _/
E 6000_ 6.207sz + 6.539'x + 4.175 f"
c ,/
.9 ’
E 4000 . ',,1"/
a”
g 2000»
o 0,}
g ,,_....r**
8 0 10 20 30 4c
Data width(bits)
(c) MPY operation

Figure 4.4 Characterized ADD, REG, and MPY macro—operations’ power versus data widths

(4.5), (4.6) and (4.7) form the basic power model of DHFA used in signal processing
data paths. They are 4-variable functions considering the operating frequency, supplying
voltage, activity factor and data width. These models are derived based on the available
IBM 180nm process, and they are scaled to other predictive advanced CMOS
technologies by the following capacitance scaling methodology.

From (4.5), (4.6) and (4.7), each operation’s power macro model has the regression

formula of:

P(f,vd,,,a, w) = F(w)- f vi, -a (4.8)

According to the CMOS Circuit’s dynamic power consumption:

119

l .
P(faviaiia):§'c'f'vdfi’a (4-9)

The switching capacitance under 100% activity factor is estimated as:
C(w)=2F(w)=2(a0w3 +alw2 +a2w+a3) (4.10)
where w is the data path width, and [(10, a], a2, a3] are regression coefﬁcients from the
curve ﬁtting.
(4.10) presents the statistical switching capacitance model for the data path element
used in DHFA, and it is scaled to the other predictive technology node according to the

lTRS roadmap[6]. For each technology node with m nm feature size, its statistical power

model can be built as: P(f,V ,a, w,m) = ,B"P(f,V ,a,w,180), where B is the scaling
dd dd

factor derived from the ITRS report [6], and P( f ,Vdd , a, w,l 80) is the regression power

model on the characterized IBM 180nm CMOS process. Table 4.3~4.5 presents the
coefficients (an, a], a2, a 3} for different technology nodes after the capacitance scaling.

With the feature size decreasing on advancing sub-100nm technologies, the
capacitance of transistor and local wire scale with the feature size, while the global wires
don’t. Since the wires in those hardware accelerator are frequently implemented as local
ones for better timing closure and power consumption, the proposed DFHA statistical
power model is still effective when being applied to those technologies.
4. 2.4. Experimental Results of DFHA Macro Power Model

Due to the lack of technology libraries on those advanced technologies, the proposed
statistical power model is only veriﬁed with the DC estimated power on the available
180nm CMOS technology provided by IBM. In this experiment, a 3-stage pipelined 8-tap
FIR shown in Figure 4.3 is used for evaluation.

This ﬁlter architecture contains eight NxN multipliers in the ﬁrst stage, six 2N-bit
cascaded adders in the second stage and one 2N-bit adder in the third stage. two pipeline
registers with the data width of MN and 4N separate these three pipe stages. The ﬂip—

ﬂops for the delayed element storage at the input are not explicitly shown here. This

120

architecture could ﬁnish an N—bit sample’s 8-tap FIR calculation in three cycles, and it

has one sample/cycle throughput after the pipeline is 100% utilized.

Table 4.3. Regression coefﬁcients of DFHA macro power model for ADD operation

 

 

 

 

 

 

 

 

 

 

 

coefﬁcient IBM 180nm BPT M 130nm BPT M 90nm BPT M 65nm BP’I‘M 45nm
an 0 0 0 0 0
a, 0 0 0 0 0
a; 0.072 0.0648 0.0547 0.0504 0.0432
03 -0.025 -0.0225 -0.019 -0.0175 -0.015

 

Table 4.4. Regression coefﬁcients of DFHA macro power model for REG operation

 

 

 

 

 

 

 

 

 

 

 

coefﬁcient IBM 180nm BPTM 130nm BPTM 90nm BPTM 65nm BPTM 45nm
an 0 0 0 0 0
a; 0 0 0 . 0 0
02 0.027 0.0243 0.0205 0.189 0.0162
03 -0.002 -0.0018 -0.0015 -0.0014 -0.0012

 

Table 4.5. Regression coefﬁcients of DFHA macro power model for MPY operation

 

 

 

 

 

 

 

 

 

 

 

coefﬁcient IBM 180nm BPT M 130nm BPT M 90nm BPT M 65nm BP’I’M 45nm
an 0.0001 0.00009 0.000076 0 (approx) 0 (approx)
0, 0.019 0.0171. 0.0144 0.0133 0.0114
(1; 0.017 0.0153 0.0129 0.0119 0.0102
a 3 0.013 0.01 17 0.0099 0.0091 0.0078

 

 

 

 

The power consumption of this FIR DFHA is estimated with the proposed regression
power model, and its result is compared with the Synopsys tool estimated power
consumption. The DFHA with four different data widths: 8, 14, 18, and 24, are evaluated,
and the power results estimated by the regression model and the Synopsys tool are
presented by Figure 4.5(a), while Figure 4.5(b) presents the run-time in the unit of
seconds for these two methods.

The proposed statistical hardware accelerator power modeling methodology achieves
the relative error of 4.08%, 4.22%, 4.27% and 4.14% on the DFHA data path with 8, 14,

18 and 24 compared with the Synopsys simulation results, showing that the curve fitting

121

method can estimate SoC hardware accelerator power consumption with > 95% accuracy
with arbitrary data width of those FIR—type architectures. We argue that the ~4 percent of
overestimation results from two factors: (1) The intrinsic error related to the polynomial
curve ﬁtting, whose optimization method is to minimize the RMS error. The global RMS
error minimization results in some additional error on the discrete sampling points, which
are the data width in this case: (2) Some logic optimizations are performed by the
Synopsys tool when it compiles the design to the gate netlist before running power
estimation, and this optimization will reduce the gate count, in turn lowering the power
consumption. If more architecture-level blocks, such as muxes, memories, etc, are
characterized in the proposed way, the power consumption of hardware blocks with more
complex architecture could be estimated without the timing-consuming RTL— or gate
level synthesis. Another signiﬁcant merit this modeling methodology brings is the
estimation speed, which increases along with the increasing data width in Synopsys tool
because of the computational overhead of elaboration all the inferred cells during power
estimation. Since the proposed statistical macro-model is based on the regression curve
ﬁtting model without the elaboration of the design in full hierarchy, its simulation speed
is constant for DFHA with different sizes. The simulation time based on Sun Solaris
server unveils that the macro—model achieves the speed-up of ~205.1X in average
compared with the synthesis—based power estimation.

Power macro—model simulation results
4

 

 

  
   

 

 

 

5 x 10 on FIR data path 300 Macromodel simulation speedup
E; 4i - Macro-model A 1 f
A x
g5 i - Synopsys iooi ‘ 7%. 250
e I; 3 j 8 e
8 I i g
a 5 i 200
(518 2 a .
'9 g E 150
E 3, 1 ‘17:
E ‘ .
0 100'L -v.—_z*__l
8 14 18 24 8 14 18 24
FIR data width(bits) FiR data width(bits)

Figure 4.5(a) Dynamic power estimation results Figure 4.5(b) Simulation speed comparison

122

Combining the processor and DFHA macro power models presented in this section, we
can explore the energy perspective of PBS in the 80C with very high simulation speed
and good precision. Next step would be to create the power model for the 80C
interconnection infrastructure, NoC.

4. 3. NoC Architectural Power Model

For modeling ﬂexibility and scalability, each router or link class has a dedicated
associative power model class gluelessly connected with the NoC performance model as
the dotted rectangle in Figure 4.6 shows. Each power model class has two main member
functions, one of which estimates the leakage power, and the other one calculates the
dynamic power. The router’s power model is instantiated and initialized in each router’s
constructor and so does the link’s. The architectural and technological parameters of the
NoC will in turn be passed to these models for power estimation. Take the fact that the
dynamic power depends on data switching characteristics into consideration, the ﬂit
payload that involves in each transaction is passed to the power model as well. During
simulation, the communication fabric (router or link) involved in a transaction will
invoke the power estimation functions in its power model to calculate the power. The
utilization of power model class makes the model modular and easy for reﬁnement, and it
also conforms to the object-oriented modeling methodology, which is applied throughout
the whole SoC simulation framework.

Another feature of this power modeling framework is that it supports the temporal and
spatial power proﬁling during run-time. Power proﬁling describes the power implications
of an application dynamically in terms of time and location. It also helps designers locate
the potential hot spots when simulating the applications. In this work a power monitor is
implemented by sampling the power data for each router and link at the speciﬁed time
interval Tsampie shown in Figure 4.6. The sampling interval can be conﬁgured by the
designers according to different applications. From the sampled data, the average power

can be calculated by:

S—l
:0: Bran . Tbample
PM = ——S-—T—_ (4.11)

sample
where Bjﬂis the transient power(dynamic or leakage) on the ith sample, and S is the
sample number.

6E9???¢9I!l§l931§--.-

 

 

 

 

 
   
 

 

 

 

 

Topology = mesh
X_size = 4 .
Y_size = 4 1 Power
Channel 1
“CA [ Channe. M OCA Link -- ~
m mm” W

Svdd = 1.3 ' ‘ Power decoder
iLamda = 90e-9 l:> OCA Router
EVt_n = 03423
New: 4,1196 > MEL]
i_::.'.:.'..' .............................. ~ >
Technology data

 

 

 

Figure 4.6 NoC hierarchical power model
4. 3.1. Power Model of Router ’3 Input/Output Buﬁ‘er

Router has the following main components: input/output buffer, crossbar and arbiter,
each of which has a dynamic and leakage power model. For the input/output buffer, we
model its power properties by analyzing the SRAM array that implements the main
storage functionality.

Studies have shown that the input buffers, typically structured as FIFOs, are potentially
the single largest energy consumer in a typical router [5]. Figure 3.5 shows the block
diagram of a typical dual-port SRAM-based N x 5,71, FIFO that can store at most N ﬂits
(words), each containing Sﬂi’ bits. On a write( read), the ﬂit is written into (read from) the
appropriate row of the SRAM, with the row indexing done based on the addresses

generated from a read (write) counter.

124

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4.7 Architecture of dual-port SRAM

We refer to [25] to create the router’s dynamic energy model with some modiﬁcations
to interface our NOC transaction-level model. Figure 4.7 describes a dual—port SRAM cell
inside the input buffer. N 1, N2, P1 and P2 compose the storage element, and N3, N4
form the write control, while read control is implemented by N5 and N6. P3 and P4 are
PMOS transistors for precharging the bitlines before read operation. There are two types
of SRAM operations that consume dynamic power. In read, the dynamic power is
comprised by the switchings of precharge, wordline, and bitline as (4.12), (4.13) and
(4.14) describe:

Cwordline(read) = Sﬂit I [C2 (N5)+ Cg (N6)J+ Cw (Lword ’ M 1) (4-12)
Cprecharge = Cg (P3) (413)
Chit/inee(read) = 2 ' [Nrow ' Cd (N5)+ Cd (P3)+ Cw (Lbit ’ M2)] (4-14)

And wordline, bitline, and memory cell capacitances contribute to write access shown in

125

(4.15), (4.16) and (4.17):

wordlinet write) : 2 . Sﬂit . Cg (N3)+ Cw (L1H)!!! ’ M 1) (4-15)
Chitlinelivrite) : 2 . [wa . Cd (N3)+ Cw (Lbit ’ M 2)] (4-16)
C... = 2 - 1C. (N5)+ C. (N3)+ C. (P1)+ C. (N01 (4171

,where C 80:) and Cd(x) are the gate and drain capacitances of transistor x respectively, and
Ca(x) = Cg(x) + Cd(x) as [25] deﬁnes. NM. is the number of rows in the FIFO, and Cull,
Mn) is the capacitance of wire laid out on Metal 11 with length L. Conventionally, M1 is
used for word lines and M2 is applied to bit lines. The router’s dynamic power of read

and write is computed as:

P rend : Rvordlindrearl) + 2 . S ﬂit . lzpprecharge + Bﬁt/ineUead) + IDsenseJ (418)
Pwri'te : Pwordline(write) + m . [Pbitline(ivrite) + [)cell J (4-19)

whereP, = a,C,Vdf, f is component i’s dynamic power, and m is the number of switching

bits on the word line during a write access, which is available during simulation.

As [69] states, the main part of CMOS Circuit’s leakage power results from the sub-
threshold current, which is the static current ﬂowing through the transistor’s drain and
source when it’s in the “off” state. Sub-threshold current could be calculated by the

following equations [73]:
W q 8 51 N ch

 

 

1.0 =u0 .7. T ,sz (4.20)
V 5‘ -V —V()
[sub = [50 .[l-exp(— qu /Vt )j'exp s- th ﬁr
- nV,
Vg, =(),V(l.S'=Vd .1
V2 8.. ‘ _Vt _Vo .
:ﬂo.._‘_. q 51 ch 'CX h ﬂ W
L 2¢s nV,
: Ix” . W (4.21)

The methodology to model leakage differentiates from what is proposed in [69] in that
our analytical model is derived from transistor-level SRAM circuits. We take the
assumption that a MOSFET’s subthreshold leakage is zero when it’s not in the off-state,
and identify the off-state transistors of each SRAM operation to calculate the total
leakage current. [74] proposes similar work that analyzed the one-port SRAM, while we
extend it to dual-port FIFO cell in this work.

Compared with 2-state assumption in FIFO’s dynamic power model, 3 states, including
an additional “idle” state where no operation is taken, are taken into account since
leakage in “idle” state exceeds the other two states due to maximum number of off-state
transistors. In the leakage model, W,- is the width of transistor i, 1M) and [p0 are the unit
width leakages of NMOS and PMOS as shown in (4.21).

@When SRAM cell is in “idle” state, bitlines are precharged to ‘1’, write = read = ‘0’,
din(din_b) has 50% probability to be either ‘0’(‘l’) or ‘1’(‘0’). If din = ‘1’, N1, N3, N5

and P2 are in off-state; and only N1, N5 and P2 are in off-state when din = ‘ l ’.

Ileak,din=0 2 (Wm + WN3 + WNS) ' 1N0 + WP2 ' [PO (422)

Ileak,din=l : (Wm +W~5 ) ' 1N0 +WP2 'IPO (4.23)
So the SRAM array has the leakage of:

Ilekuge,idle Z N row 'Ncni ’[(W~i +0-5WN3 +WN5 )1 N0 +WP21P0] (424)

@When SRAM cell is in “read” state, All write signals are kept at ‘0’, while the read
signal of the selected row is activated to ‘l’ and others remains at ‘0’. For the cell that is

deselected in read operation:
Ileuk,read:0 : (Wm +0‘5WN3 + WNS )1 N0 +WP21P0 (425)
And (4.20) estimates the leakage for the cell that is selected in read operation:

Ileak,read=l = (Wm + O‘SWN3 )INO +WP2 1P0 (426)

The SRAM array has the leakage estimation of

127

I -N

leukagt'Jeud = rmv col . [(W’Vl + O’SWN3 )INO + WP'ZIPU]+ (wa _ l) ‘ Neal . WNS . 1N0 (4'27)

©In “write” state, one of write signals is asserted to ‘1’ while others keep at ‘0’, all read
signals are staying at logic ‘0’. Bitlines are precharged to logic ‘ l’. The transistors in off-
state depend on status of current din(a’in_b) and previous data stored in the SRAM cell.

When write = ‘0’, the leakage current is:

Ileak.write=0 = (Wm + O'SWN3 + WNS )1 N0 +“11921190 (4-28)
Otherwise,
Ileak.write=l : (Wm + WNS )INO +WP21PO (429)

Overall, the SRAM cell’s leakage during write access:

I = wa ' Ncol ' [(WNI +WNS)]N0 +WP2’P0]+ (N

— 1). N 05W,3 - 1,, (4.30)

leakage. write col

From (4.24), (4.27) and (4.30), the leakage power of input buffer is derived by
P

leakagej = Vdd ' Ileakagej (431)
where Iraqi-age..- is the leakage current at operation state i (i belongs to the set (idle, write,
read}).
4. 3.2. Power Model of Router ’s Crossbar

Crossbar functions steering the ﬂit from router’s input port to the routed output port. In
the NoC framework, the NixNo matrix architecture crossbar is employed and each
crossbar has N,- input ports and N0 output ports shown in Figure 4.8(a). Each port has the
same data width of SIM-bit. The crossbar’s input ports are connected with the router’s
input buffer, and its output ports are connected with router’s output. The passing
transistors located on the intersections between the input and output ports control the data
ﬂow with the aid of the routing logic unit and the arbiter. So a matrix crossbar has
SymeixNo passing transistors, each of which controls the connectivity status of an output
with a speciﬁc input. At any time, only one output port can be granted for the requests

from inputs.

Similar to the input buffer, the power model of the crossbar is constructed by deriving

its basic operation’s switching capacitance model. Consider one slice of the SﬂerIdth
crossbar with the input and output munbers of N.- and No respectively. Assume C.._.,.,,
C.,.,_..,., and Cam" are the capacitance in the input, output and control ports. And N
represents for the NMOS used as the passing transistor.

C. =Cd (N), C =Cd(N)+Cg(N) (4.32)

in _ cnt

out _tnt _ _Cd (N)9 Cctr_(‘nl
Assume in the layout of matrix crossbar, the horizontal and vertical spaces between each
two adjacent intersection nodes are w, and h,. The corresponding wire length for the input

and output ports can be estimated by:

L," = N0 -Sﬂ,., ~w, (4.33)
L... = N . '5... 'h. (4.34)

So the input and output capacitances of one bit slice on the matrix crossbar are:
C.._.-.- - N.- C.._ +IC.(T T... )+ C.(T)1+T.-. C...(L.-.) (4.35)
C...._.. =N. C..- +.(ICT ..).+C (..T )J+C.(L (L...) (4.36)

If there is a data transition on the input or output port, the corresponding input or output
effective capacitor C.._.-,, or Cw... will be charged or discharged, resulting in dynamic
power consumption. Assume the current data is latched on each port until the next data is
available. During a switching operation, suppose there are (5,..- bits and 5..., bits on input

and output ports change their values. The effective switching capacitance is:
Ceff = 61d . be _in + 6x0 be_aut (4-37)

And the dynamic power consumption:

1
Pza-WZIAFV ddxi(§. .be _in +6xo.be_out).f (4.38)

where AF is the crossbar’s activity factor, which can be characterized from low-level

simulations. In this model, we assume that AF = 0.5.

 

 

 

 

 

 

......

 

 

.
O. .-

 

 

 

 

 

 

 

 

 

 

 

 

............

Figure 4.8(a) Matrix crossbar capacitance model Figure 4.8(b) Active state of matrix crossbar
Similar to SRAM’s leakage power model, different operating states are taken into

consideration when dealing with matrix crossbar’s leakage power. The main contributors

to the leakage in the crossbar are the passing transistors. Two states, idle and open, are

analyzed to estimate the crossbar’s leakage power.

@When the crossbar is in the “idle” state, all the control signals applied on the gate of

the passing transistors are ‘0’. All these passing transistors are in off—state and contribute

to the total leakage.
[lea/(idle : Ni . No . WN . 1N0 . Sﬂit (439)

where WN and [No are the passing transistor’s width and the unit width leakage current for
NMOS transistor respectively.

@When the crossbar is working, there is only one passing transistor open in each input
row and each output column as shown in Figure 4.8(b). So each row has (No - I) passing
transistors in off-state, and each column has (N,- — 1) off-state passing transistors. Assume
that OutDatj is the ﬂit latched on the output port j, and InDat) is the ﬂit from the input
port i. And the Hamming distance between OutDatj and InDat; H(0utDatj, InDat.) shows

the number of switched bits between these two data. For each input port i, there exists:

lea/t ,l

1,0 -H(1nDai,,ouiDat 1) (4.40)

Ms

\.
II

1 30

Take all input rows into consideration and add leakage from the input/output drivers, the
leakage of crossbar can be estimated by:

N,- N,
I,,,,W = WN 22 1,,0 -H(1nt)at,.,oi.ubaij )

i=1 j=l

+%’(dev,tr “1100 +dev,N ' INC). San '(N1 + No) (4.41)
where dep, and deN are the PMOS and NMOS’ width in the input/output drivers.
4. 3.3. Power Model of Output Arbiter

Figure 4.9(a) shows a canonical micro-architecture of fair-sharing output arbiter, which
is characterized by the number of requests per arbiter R. The arbiter’s request number
equals to the number of crossbar’s input ports. The arbitration operation of the matrix
arbiter can be divided into three atomic operations: request evaluation, priority update,
and grant generation, which are employed to model its dynamic power consumption. First
the incoming header ﬂits on each input port generates a request, which will be evaluated
along with the request signals from other inputs to generate a grant signal based on the
current priority list. After that the priority list will be updated cyclically to achieve the

arbitration fairness.

(R(R-1)/2 ﬂit-ﬂops)
priorities(mij)l-~ 01:14}
grantt req_n-1
FL.“ ELM , , : ﬁn' '1)" T02

5 E ’ rantn
reqn ;-. 3.11,,
_.' ~ten-nu. -_ -' I P
. ' req_n+1
mhln+1)

—R
grant generation logic fell D—
mnR
Tn1

Figure 4.9(a) Structure of the matrix arbiter

 

 

 

 

 

 

 

 

 

 

 

 

 

 

reqR E _

 

 

 

 

 

 

 

 

 

 

 

Gate 19(TN) M’(Tp)

 

 

 

Tnl 13.51 76)»
Tn2 13.5)» 761
Ti 12.5}. 251

 

 

 

 

 

Figure 4.9(b) Default transistor sizes

For the request evaluation component, the switching capacitance contributing to the
dynamic power can be further divided into three sub-components:
@The gate end capacitance of the ﬁrst level NOR gates Tn]. A request signal is involved
in computing all the other grants. Besides the inverter connected with this request signal,
there are (R-I) NOR gate ends. The total capacitance is (R-1)Cg(T,,1).
@Gate end capacitance of the second-level multi-input NOR gate Tnz, which is not
driven by the request signal directly, but by its compliment from T,-. Its capacitance is
Cg(T,,2).
@Inverter T;, which generates the complimented request signal for grant computation. Its
capacitance is Ca(T,-) = Cg(T,-) + C401).

80 the capacitance during the request evaluation can be estimated by:

C = (R-1)-C,(T,.r)+C,(T,.2)+C.(T.-) (4.42)

request
For the priority update, two capacitance components are taken into consideration

during power modeling:

@The gate end capacitance of the ﬁrst level NOR gates Tn ,. Each priority signal

represents the priority relationship between two requesters. So it is used to compute two

grants, and the gate end capacitance is 2Cg(T,,,).

@Flip-ﬂop switching capacitance Cpp since the priorities are stored by Flip-flops. The

switching capacitance during priority update is:
CPriority = 2 Cg (an )+ CFF (4.43)

For the grant generation, two capacitance components contribute to its dynamic power:

I32

@Drain end capacitance of the second level NOR gate Tnz, the capacitance is Cd(T,,2).
@Crossbar control line’s capacitance Cum-m Since the grant signals are connected directly

with them.
Overall, Cgrtmt = Cd (T112 )+ Cctr_cm (4.44)

Arbitration has one argument request, the bit vector of all requesting signals, with each
bit representing whether a request signal is on or off. The request vector can be acquired
by evaluating each column of the trafﬁc matrix representing the current trafﬁc
information in this router. For the request evaluation, the number of switching requesters
is simply the Hamming distance between rer, the previous request bit vector, and req,
the current request bit vector. Hence the energy dissipation during request evaluation can

be estimated by:

E = H (req, rer)- C V; (4.45)

request request
For the priority update operation, supposing that the priority bit vector before arbitration
is priO, after the arbitration, priorities are updated to become pri. So the number of
switching priority vector is the Hamming distance between priO and pri. Each switching

priority node has a single switch, so the energy dissipation is:

E prion-.., = H (pri, pri 0)- C mom,- 'Vdi (4.46)
For the grant generation, each arbitration operation grants one requester. If this arbitration
grants the same requester as the previous arbitration does, there is no activity on grant
capacitance. If this arbitration grants a different requester, then one grant signal rises
from ‘0’ to ‘1’, while another grant signal falls from ‘1’ to ‘0’. Use r0 and r be the id of
the previously granted requester and the currently granted requester, the energy

consumption is:

2-C ~Vd§,(r¢r0)

gram

0’ (r = ’0) (4.47)

grant

Take request evaluation, priority update, and grant generation into consideration, the

133

power consumption of one arbitration Operation on the matrix round-robin arbiter is:

E 2E +E +E

arbiter request priority grunt (4-48)
And in the current router model, the transistor sizing parameters are chosen as shown in
Figure 4.9(b).
4.3.4. Power Model of Interconnecting Wires

In this section, the analytical power model of interconnecting wires is presented. We
assume that all the interconnecting links in NoC are treated as global wires laid out on the
top metal layers (M5 and M6 in 180nm 6-metal process), and these global wires are
pipelined and repeated uniformly. The dynamic power results from the signal switching
of wires and repeaters, while the leakage power mainly originates from repeaters’
subthreshold current. Differentiated from the delay model, inter-wire coupling
capacitance is taken into consideration in wires’ power modeling and a simple data-
dependent interconnecting wire power model is derived. For a single interconnecting

wire, the switching capacitance is:

C 2 CW -L - XOR (bit,bit’) (4,49)

wire
where bit and bit’ are current and previous l-bit data latched on the wire, and Cw and L
represent the unit length wire capacitance and wire length respectively. The other part
contributing to wires’ dynamic power is the inserted repeaters, whose switching

capacitance is:

C = (6+1)w- Cg ZI; XOR(bit,bit')

repeater
seg

= r . Cw .L. XOR(bit, bit') (4.50)
For a Sﬂit'bit width interconnecting link inserted by (ULseg—I) uniformly distributed
repeaters, the dynamic power consumption depends on the traversing data’s switching

pattern represented by two consecutive transmitted ﬂits’ Hamming distance. From (4.49)

and (4.50), we can derive that:

134

1 SﬂiI—l

Binkdyn = E ' ded ' f ' AF 0 Z (Cili'ire + Criepemer )
i=0

1
=3-Vdi -f-AF ~(l+r)C..L- DAM) (4.51)

where Vdd is link’s supply voltage, f is the operating frequency, which is assumed to be
the same as router’s. AF is activity factor with the assumption value of 50% in the system
model. DH(x,y) is the Hamming distance between two successive flits x and y ﬂowing
through this link during simulation.

The leakage power model accounts for the repeaters’ subthreshold leakage. In the
CMOS inverter, either NMOS or PMOS has 50% probability in the off state.

1 L
RinkJeakage = E (WPIPO + WN [NO )L— Sﬂil 'Vdd
seg
wL . S .
ﬂit
= T (ﬂ! P0 + 1N0 ) Vdd (4.52)

reg
where [pa and IN“ are unit-width NMOS and PMOS’s leakage current, respectively.

Ci.” Ci.i+1

      

l 2 1
I g 3:— .

 

//

 

Figure 4.10 Interconnecting wires considering (i.i+l), (Li-1) coupling effects
In the wire model, the unit length wire capacitance Cw plays an important role that
determines the switching capacitance during power modeling. The wire model employed

in the current NoC framework takes the inter-wire coupling capacitance into

135

consideration for the DSM processes. When the VLSI design enters DSM technologies,
the interconnecting wires have: smaller width, larger aspect ratio (height/width); and in
most cases, they are placed closer to each other as compared to older, larger-scale
technologies. Moreover, the chip area has increased and so has the ratio of the expected
global link length over the cross sectional area of the lines. Based on this, we derive a
more complicated interconnecting wire model when estimating the unit length wire
capacitance for more accurate power estimation. Assume N wires form an N—bit link laid
upon the ground plane with the width of w, height of h, thickness of t and space of 5
shown in Figure 4.10, the total unit length capacitance erm of each wire i(i=1,2,..N-1),
except for the bordering ones, is composed by three components: the self-capacitance C,-
that is the capacitance between this wire and the ground plane, the coupling capacitance
Cu”, Cu-) which are the capacitance formed by this wire and its directly neighboring
ones. Here we only consider the coupling effect between one wire and its direct
neighboring ones. For the bordering wires, the coupling capacitance only has one

component contributing to the total wire capacitance. Overall, we have
CW” = C, + 2 ' Curt] when l< i < N (453)

where C, is the wire’s self-capacitance, which can be approximately modeled by:

0.222
w t
C.=£ 1.15 — +2.80 —
, (,3 1,.)

an, is the dielectric constant of the insulator such as $02. er is the coupling capacitance

between the adjacent wires, and here we only consider the case that j = i+1 or H .

C — e 0 03(3) + 0 83(1) — 0 07“)”222 (3)—”4
i,i:l at - h r h - h h (4.56)

Combine (4.56) and (4.53), it should be noted that

136

1im(C,. + CW ) = C, (4.57)

S-)°°

showing that the coupling effect can be neglected when the line separation s approaches
inﬁnity. Applying (4.53) and (4.54) to interconnecting power model (4.51), the coupling
effect can be simulated.
4.4. Integrate Power Models into SoC Simulation Framework

By far the power/energy modeling methodologies for SoC functional components have
been addressed. Due to different power model schemes for PBS and NoC, different

methods are applied when integrating their power models with the 80C simulation

 

     
   

   

framework.
taSk1 Task1 (annotated)
c = a * b;
energy + = enerng UT[r1L U_OP_2].‘
d = a - b:
energv + = energﬂlx’T/ALU_()P_2]; PSI(PE)
e r c * d:
energy += errerthUT[.~UUL_OP_2]: <\
...... \
PE energy library \
(PPC405,

 

 

\\ \7

\/\ <J , .............. ,
MSP430...) '
core = new CPUL( -'-,PPC~/05. .'..,) " energy ,

core -> Add Task/ I. &taskl).‘ ‘. '
Power_SRAM .

calculate_SR4 Arl_dynamic_eng( ) '

! calculate SRAM leakage _pwrﬂ

r out 3'11] Power_cbar
[ OCA_Router

calculate_cbar_dvnamic_eng/ )
' calculate cbar Ieakage_pwr(/

 

 

 

 

 

 

 

 

 

 

N 00 < [ router[N] 1

link[0] P e r k
— ow r_|n
E [ OCA—link » calculate_link_dynamic_eng( )
k ‘ — calculate link leakage _pwﬂ )
lmk[M]

Figure 4.11 Integrating Power Models with SoC simulation framework

 

 

 

Power_arbiter
calculate_arb_dynamic_eng( ) 5:, I
calculate arb leakage _pwl'( ) J

 

 

 

 

 

 

 

 

 

 

 

 

 

As shown in Figure 4.11, the energy implications of the application-driven PE models
are implemented by annotating the experienced macro-operations’ energy model along

with their timing model. To enable the top-level SoC simulation framework to evaluate

137

each PE’s power/energy in-flight, each PE object employs a power-sampling interface
(PSI), which includes a power sampling sc_thread process and an energy data member
Eng. During simulation, Eng will be updated by PS] through adding the energy values of
executed macro-operations from the table-lookup under the programmed interval Teng.
Assume ME is a task’s macro-executor containing n macro-operations M01, M02,
MO", with each one consuming E(M0,~) (i=1,2,...,n) amount of energy. The energy
consumption by executing ME is:
n
Aenergy = ZE(M0,) (4.58)
i=1
For a task containing m macro-executors, the toral energy consumption is evaluated by:

mid

energy = Z 2 E (M 0,- ) (4.59)

j=i i=1
And the instantaneous power consumption can be estimated by the average power

consumed between two consecutive energy-sampling events when the sampling period is

small enough:

Pm = LEI-13 energy_sampled lTeng (4.60)

Plotting a series of P)”, values versus the sampling time generates the temporal power
proﬁle for PE showing the core’s power feature when executing the mapped tasks
dynamically. Although current system-level model doesn’t consider PE’s leakage power,
it is easy to reﬁne this methodology by adding PE’s empirical leakage power models
proposed in [69] to support system-level leakage power estimation. Moreover,
considering the PSI interval ng, smaller T mg will result in higher power proﬁling
resolution and vice versa. Figure 4.12(a) and (b) present the power profiling results for
ADPCM application executed on PowerPC405 with T mg = 5us and Teng=20us
respectively. It can be clearly seen that the different power resolutions between these

different sampling rates although they got the similar average power. Although smaller

138

Teng results in 4X better power resolution, it asks for ~2.18X more simulation time due to

the increased number of sampling time events evaluated by the SystemC kernel.

 

 

        

:06 «111'. , =2 is l
g ‘i 'r i" 11' g l" H111
0.2 '41:”... 2.1111111111 l

 

 

o e ' ‘ . . - . o ‘ ' - ' . . . . .
0 500 1000150020002500300035004000 0 800 1600 2400 3200 4000
time (us) time (us)
(a) @ Teng = 5143 (b) @ Teng = 20143

Figure 4.12. Power proﬁle of ADPCM application on PowerPC

For the NoC, the power consumption of each of its components is modeled by a
dedicated power class (Power_SRAM, Power_cbar, and Power_arbiter for routers, and
Power_link for interconnecting links), which implements the functions to calculate the
dynamic/leakage power when a NoC operation happens. These functions are plugged into
the corresponding NoC functional models in the 30C framework. When an operation, e.g.
a read from the input buffer to retrieve a ﬂit, takes place, the buffer’s output flit data
along with the previous latched flit (assume each input buffer latches its output until new
data appears on the output data port) will be passed to the power model function
calculate_SRAM_dynamic_eng( ) of the Power_SRAM class to evaluate the energy
consumption during this buffer’s read operation. The evaluated energy, which is the
return value of the power model function, will update a global data member energy
deﬁned in the router class. Similarly, the leakage power is evaluated by calling the
Power_SRAM class’ calculate_SRAM_leakage_pwr( ) with the arguments of operation
type (READ, WRITE, or IDLE) and temperature. During simulation, the updating of
energy data member of each router/link reﬂects its energy/power information. The same
as the sampling mechanism used in the PE model, a PSI is also employed to control the

energy-sampling interval for temporal power proﬁling.

139

To evaluate the spatial distribution of each router/link’s power at a speciﬁc time point,
the power snapshot capability is developed along with the temporal power proﬁling by
dumping each router/link’s instantaneous power into a matrix with each element
representing the relative position of each router/link at the pre-conﬁgured time points. It
is implemented by adding another sc_thread process sensitive to the current time stamp.
This process is triggered when the current stimulated time reaches the conﬁgured time
points, and it collects the instantaneous power information from each router and link, and
dumps them into a ﬁle, which will be plotted by the third-party tools such as Matlab to
show the NoC’s spatial power properties.

4. 5. NoC Power Simulation Experimental Results

Using the same shared Sun Solaris 5.9 server as the experiments taken in the previous
chapters for the simulation platform, we simulate some synthetic traffic patterns and a
real application: M-JPEG, to evaluate the NoC and 80C power implications with different
process technologies using the 80C simulator equipped with PE and NoC power models.
Six predictive CMOS process models from Berkeley Predictive Technology Models
(BPTM) from 180nm to 45nm are chosen as the target processes for the simulated SoCs.
For the instantaneous power consumption, the operation frequency is set to ZGHz for all
these six processes. The NoC is mesh architecture with the assumed chip size of
lOmmxlOmm.

4. 5.1. NoC Router’s Dynamic and Leakage Power Simulation

Figure 4.13(a) shows the leakage power of NoC router under different temperatures.
The target NoC is configured as 4x4 mesh architecture with S-port routers. Each port has
8 input buffers, and each buffer is 64-bit in width. The workload is set to be uniform
trafﬁc, where each packet contains 10 ﬂits. To show the trend of dynamic power among
different technologies, the average value is plotted with the accompanying leakage power
under the ambient temperature of 300K as Figure 4.13(b) depicts. From these data, we

could envision that the leakage power increases dramatically with the scaled technology

140

nodes. Each newer technology node has 2.8x more leakage power in average compared
with the old one, and it goes worse by the rate of >8X when the feature size scales below

70nm, where the estimated leakage power is nearly the same as the dynamic power.

 

 

 

 

 

 

 

 

 

 

 

 

 

Router's leakage power at different processes Dynamic v.s. leakage power of NoC router
. . . 1500 '
E - T=300K(27C)7 + router‘s dynamic power
g T=323K(SOC) g @ fer”: k
5 - T=353K 80C 5, E rou e 5 ea age power
3 1000 ( ) . 4 g 1000 u @ 300K
in. ._
a router's leakage power
E 1 :E, e @ 323K
‘3 2 routers leakage power
0 _a—
3 500 — 3 500 a @ 353K
3 a)
e E
3 n.
I . .
__ ...—- .1“. I:- l"
180nm130nm 90nm 70nm 65nm 45nm 1 nm 130nm 90nm 70nm 65nm 45nm
BPTM processes BPTM processes

Figure 4.13(a) Router’s leakage power Figure 4.13(b) Router’s dynamic power and leakage
4.5.2. NoC Router’s Dynamic Power v.s. Data Payload Switching Properties

According to the NoC power model including router’s and link’s, the dynamic power
has the direct relationship with the data bus’ switching activity, which is the number of
switched bits between two consecutive u'ansactions on this bus. The dynamic power is
also proportional to the operating frequency, which decides how often the Circuit’s
intemal/load capacitance is charged and discharged. In our model, we estimate the
average dynamic power consumed in a given sampling window and use it to approximate
the instantaneous dynamic power. Assume that an on-chip router (or link) consumes the
energy of E during the sampling window T», the average dynamic power is calculated by:

Laugh.)
T nT n

W

(4.61)
Where it is the number of cycles contained in this sampling window. And (4.61) can be:
lim

E
P =1im,.->r f Zn— = E - f (4.62)

n->1

, which shows that the average dynamic power is the instantaneous dynamic power when

the sampling window is approaching one cycle. So reducing the length of the sampling

141

window can result in better power estimation resolution, while it brings the overhead of
more simulation time because of the increased number of events that trigger energy
calculation and output ﬁle dumping.

Figure 4.14 presents the average router dynamic power consumption under three
different input data payload switching schemes controlled by the TO: (1) two consecutive
data are fully random; (2) two consecutive data have 50% bits constant, (3) two
consecutive data are constant, (4) two consecutive data have 100% bit switching . The
simulation is run under lGHz NoC clock rate and the NoC is conﬁgured with IBM
180nm process. When there is no bit switching between two injected data payloads, the
router’s average dynamic power is estimated to be 25.695mW, which mainly results from
the bit switching on the packet head including the source and destination addresses.
When the data payload has 50% of random bits, the router dynamic power increases
greatly by 3.187X, and this power overhead comes from the switching activities on the
data payload bits. When the whole data payload is set to 100% random, each router
consumes 139.22mW of dynamic power in average due to the further increased bit
switching activities. Although 50% and 100% of payload bits are chosen randomly in
schemes (1,) and (2), the switching ratio of (1) is no larger than 2X of (2) because of the
random feature. This results in the power consumption of scheme (1) is 1.7X of scheme
(2). In (4) the router has the maximum data payload switching activity, and this
corresponds to the power consumption of 173.55mW, which is 2.12X larger than the
scheme where 50% of payload bits switch randomly.

The dynamic power on the link has a similar trend on this four different data payload
switching schemes. The data payload switching activity has distinctive inﬂuence on NoC
dynamic power consumption, and the higher rate the data toggles, the higher dynamic
power NoC takes to transfer the data. From the dynamic power saving perspective, the
data encoding scheme similar to what’s applied to the share-bus architecture to reduce the

Hamming distance between two successive data can also be applied to NoC architecture.

142

If there are only 25% bits toggling between each two continuous data payload after
encoding, the router’s dynamic power is estimated to be 52.658mW, which is 3.28X less
than the scheme of full data toggling. In this work we don’t target exploring the power—
aware NoC data encoding techniques, but the proposed NoC power simulation model
provides a complete platform to evaluate different encoding schemes.

Data switching patter v.s. NoC router power

20% ,

 

Router's average dynamic power(mW)

 

 

(1) (2) (3) (4)
TG payload bit switching schemes

Figure 4.14 Router’s dynamic power

4.5.3. Router Architecture and its Dynamic Power

The router architecture inﬂuences its power consumption as well as its performance. In
general, the higher throughput a router has, the higher the power consumption it takes.
All three router’s basic operations, input buffering (optional output buffering), route
computing and packet/ﬂit forwarding, and output arbitration, involved in a packet’s
delivery will toggle the router hardware, which in turn consumes the dynamic power
resulting from the circuit charging and discharging. Currently twelve different router
architectures have been modeled by adjusting three architectural parameters, each of
which corresponds to speciﬁc hardware architecture in the on-chip router and will have
different power properties compared with others.

We simulate six different router architectures the same as what’s presented in 3.6 with
the input buffer size ranging from 2 to 8 ﬂits, each containing 64 bits. And all these

routers don’t have the output buffer. Both random and transpose trafﬁcs are used in the

143

simulation, and the PE and NoC are clocked at the same frequency of ZGHz. Figure
4.15(a), (b) presents the average router dynamic power consumption at random and
transpose trafﬁcs.

Under the same buffering scheme (input and output buffer size), the NoC employing
XY—deterministic routing algorithm has higher dynamic power consumption compared
with the NoC using the other 4 adaptive ones, although the adaptive routers need more
power overhead to perform the adaptive route computing operation. For the input buffer
size from 2 to 8, the NoC with Ibuf_F P_X Y router architecture has the average of 2.03X
higher power than the lbuf_FP_WF’s, and this ratio reduces to 1.73X when comparing
with lbuf_FP_NL’s under random trafﬁc. And when the arbitration policy is ﬁxed-
priority, North-Last routing algorithm is about ~22.4% more power-hungry than West-
First routing; while this difference becomes marginal (~0.85%) when employing fair-
sharing arbitration policy. It is also observed that employing the fair-sharing policy has
54.4% of more power consumption compared with ﬁxed-priority. All the above
information comes from the result from random trafﬁc. Comparing the power data with
the APL simulation results in chapter 3, it shows strong correlation between NoC average
router power and the maximum achieved injection rate. This correlation explains the
power simulation results in Figure 4.15(a): Higher MAIR indicates that there are more
packets successfully injected into NoC at the unit amount of time, and these packets will
trigger the NoC routers to access the input buffers, start route calculation and arbitration.
80 higher MAIR results in more router operations in NoC during the unit amount of time,
and this directly brings out more dynamic power consumption. From 3.6.3, the NoC with
lbuf_FP_X Y router (buffer size is 2) has 2.27X and 1.9X of MAIR compared to
lbuf_FP_WF and lbuf_FP_NL, respectively, and this correlates to the 2.12X and 1.73X
average router power consumption. The lower increase ratio of dynamic power compared
with MAIR results from the fact that the adaptive routing algorithm consumes additional

power to perform the minimum throughput calculation, which takes small amount of

144

power compared with other router blocks.

 

 

 
 
  
  

 

 

 
   

 

 

Random trafﬁc 500 Transpose trafﬁc

#7 A ﬁr

y—‘ i '—but.FP.xvl
E400 -'b”"FP’XY E 1 —bur.r=s.xvl
a -IburFs,xv g 400 —bur.n=.wrl
i w Ibuf, FP, WF 8 g bu1.FS,WF
O 300 Ibuf, FS. WF 2 - g“? :2. ''3':
E E 300 (,_ U- ,
g _Ibuf, FP, NL g
15‘ 200 !m I?
._ 5 200
.9
a a
g, 100 g, 100
e e
g . 2
< O 4' < 0

2 a 4 5 6 7 8
Input buffer size(nits) Input butter size(tlits)

Figure 4.15(a) Router dynamic power (random) (b) Router dynamic power (transpose)

Under the transpose traffic where the data payload is still random, but the destination

address of each node is ﬁxed as:
dest _ x = (source_x + 3)%4 ,dest_ y 2 (source_ x + 3)%4 , and the ﬁnal destination

address is calculated as (dest_x + dest_y x 4). The power simulation shows some
different properties compared with the random trafﬁc, although the power consumption
increases almost linearly with the increased buffer size. The NoC lbuf_FP_X Y router has
almost the same dynamic power as Ibuf_FS_X Y, and performance simulation shows that
the NOC APL with these two routers are both 5, which indicates that the NoC is not
congested. The output arbiter in these two cases is not triggered due to the well-balance
trafﬁc pattern. So the contributors of router dynamic power consumption is only input
buffer and crossbar, this explains the identical power consumptions between the two XY—
deterministic routers. For the other four routers architecture with adaptive routing
algorithm, the power consumption is lower due to the reduced actual packet injection
rate. When comparing the power consumption under the same router architecture
between Figure 4.15(a) and 4.15(b), it is observed that the deterministic routers consumes
85% more power in the transpose trafﬁc than the random trafﬁc. This is still the result of

the higher achieved injection rate in the transpose trafﬁc resulting from the more regular

145

trafﬁc pattern.

Power simulation results indicate that the NoC router power is strongly related to all
the three architectural parameters: input buffer size, arbitration policy and routing
algorithm, as well as the actual trafﬁc pattern. From the system architect‘s point of view,
it is a good practice to extract NoC trafﬁc pattern based on tracing the application model
to explore the candidate NOC architectures. And it is also beneﬁcial to keep as small input
buffer as possible for on-chip routers if performance is met to lower the on—chip
communication power as well as the silicon area.

The previous simulation and analysis was based on the router without output buffering,
which is employed in some throughput—critical applications. The output buffer provides
additional temporary information storage for the routed packets. To evaluate the output
buffer power perspective, we adopt the 8-entry ﬁxed buffer scheme for the input and
output ports of the same direction (NORTH, SOUTH, WEST, and EAST). The input and
output buffer size can have 6 different combinations of (8,0), (2,6), (3,5), (4,4), (5,3) and
(6,2), where (ab) represents the input buffer size of a and output buffer size of b. Figure
4.16(a) and (b) presents the power simulation results with the router of XY-deterministic

routing algorithm on the random and transpose trafﬁcs, respectively.

namic power consumption with
different input/output buffering

Dynamic power consumption with
different input/output buffering

 

500; ,JJ
1 _Ibut/Obui/FP/XY 1
- lbut/Obui/FS/XY ‘

. W

‘9’

- lbut/Obul/FP/XY 1
- but/Obui/FS/XY

M
0"
O
,2
A
O
O

  
  
 
 
  

     

M
8,.

--.--E-..

Average router dynamic power(mW)
,8.

case

(2. e) (3, 5) (4. 4) (5. a) (s, 2) (8. 0) (2. 6) (3. 5) (4r 4) (5. 3) (6, 2) (8, 0)

Average router dynamic power(mW)

Butter conﬁguration
(Input buffer size. output buffer size)

Figure 4.l6(a) Random trafﬁc

Buffer conﬁguration
(input buffer size. output butter size)

Figure 4. l6(b) Transpose trafﬁc

It shows that the router dynamic power consumption is more sensitive to the size of

input buffer than output buffer. Figure 4. l6(a) indicates that the dynamic power increases

by 17.2% when increasing and decreasing the input and output buffer sizes by 1 while
keeping the same total buffer size of 8 under random trafﬁc. For the power simulation
under transpose trafﬁc shown in Figure 4.l6(b), the power consumption has the similar
trend, but the non-congested trafﬁc here results in the same power consumption between
the two NoC architectures with ﬁxed-priority and fare-sharing arbitration policies.

4.5.4. NoC Dynamic Power’s Temporal Proﬁling

.ﬂ. ,/"\ /'
1 12 H 13 H 14 }4—vf 15 l

\. ,/' \ J/ \‘. // \ I/

Figure 4.17 4x4 mesh NoC under burst trafﬁc from node #0, #2 to node #5

A burst-mode trafﬁc pattern, which mimics some real applications, is applied on the
same mesh-NOC to evaluate the temporary and spatial energy properties. The target
technology is selected as BPTM 180nm process. Each PE injects packets with the rate of
10M ﬂit/s randomly. During the burst period from lOus to 15us, PEs at nodes 0 and 2
increase their injection rate to 100M ﬂit/s and their packets’ destination is ﬁxed to node 5
as described in Figure 4.17. Figure 4.18 (a)~(d) show the dynamic power temporal
proﬁles of routers #0, #1, #2 and #15 from 0 to 100us. Before the burst comes, all these 4
routers have power consumption of 60~100mW depending on the locations since the
border routers consumes less power than others due to decreased traffics. After 10us, the
routers involving in the burst have increased power up to ~150mW. Notice that router #l
is the shared switching node for packets from #0 and #2 to reach destination #5, and its
power increases along with the burst even its host PE still keeps the low injection rate.
After the burst, power consumption of these nodes will gradually return back to previous

states due to the decreased trafﬁc, which is reﬂected clearly from Figure 4.18(a). For

147

router #15, which is far away from the burst area, its power is mainly influenced by the

uniform trafﬁc and the activity change is far less than that of routers #0, #1 and #2.

Power (mW)
160 «- ~
140 1
120
100‘
80 ~

 

 

 

 

    
 

 

     

       

60 .
40 . . .. . .. . II
,0 l , I l' 1 r
o l . l l‘ l l
9 8 8 8 E 53 8 8 8 E o o o o o o o o o o
(D N 0') L!) N (D V v— |\ v- N ('7 V Ll") (O l\ on 0') O
O '— v— (\1 0') ('3 V l!) L!) (D N w V 0 L0 N co LO
v- N C') V 1.0 (D l\ CD 0') O r- v- N (V) ('3 V V LO
. '- N (V) V LO (D f\ (D (D
time (ns) time (ns)
(:1) Temporal power profile on node #0 (b) Temporal power proﬁle on node #1
Power (mW) Power (mW)
180 . . . 14o _ . ..............
:22
100 80 .
80 60 3 ‘
:2 4o
20 g , l 20
0 5 l o
9 8 E 8 8 E 8 E 8 8 .9
m c r a a s 3 <0 3
9 N 8 v m w n 3 m
time (ns) time (ns)
(c) Temporal power proﬁle on node #2 (d) Temporal power proﬁle on node #15

Figure 4.18 NoC temporal power proﬁle under the burst trafﬁc shown in Figure 4.17

4. 5.5. NoC Link Power Evaluation

In this set of experiments, the NoC is the same 4x4 mesh with the physical size of
lOmmxlOmm. The packet size is still ﬁxed to lO—tlit, with each ﬂit 64—bit in width. Based
on the assumption of ﬁxed chip size and router number, we use the minimum global wire
width of 0.8um in BPTM 180nm process as the baseline, and keep this wire width same
for all the other 5 processes. As a result, the unit—length wire resistance and capacitance
scale accordingly, and are shown in Table 4.6.

Figure 4.19(a)—(d) describe NoC’s power breakdown in four processes under the
uniform trafﬁc load, where both frequency and voltage are scaled according to the target

technology nodes. We can see that the relative weight between router and link. In NoC

148

fabricated in I80nm process, the power ratio between link and router is 0.81. This ratio
increases to 0.91, 1.12 and 3.13 in 130nm, 100nm and 45nm processes, showing the trend

that links are becoming more power—hungry than routers.

Table 4.6 Wire parameters for different technologies using the same global wire width (0.8um)

Technology

BPTM 180nm

BPTM 45nm 0.3

Power breakdown on BPTM 180nm
1000 . f , 1 .

W,,.i..(um)

 

900 - Router’s power
§ 800 — Link's ower

0 5 10 20 50 100 250 50010002000
Injection rate (MfIit/s)

(a) BPTM I 80am

Power breakdown on BPTM 100nm

§

. - Router's power
- Link's ower

8
o

 

ABOO'

Dynamci power (mW
w A 01 (D \I
O O O O 8
0 O O O

8
o

 

.4

O

OO
ﬁr

o 5
Injection rate (Mﬂit/s)

(c) BPTM 100nm

 

 

10 20 So 100 250 50010002000

20

114.75

 

Power breakdown on BPTM 130nm
1000 'ﬁe— e v

 

 

 

900 .. - Router's power
- Link's power
g 600 ..
§ 700 ..
g 600 ~-
0
.8 500 ~
E
a 400 ..
g 300-
200 '~
100 ' a
0 5 10 20 50 100 250 5001000 2000
Injection rate (Mﬂit/s)
(b) BPTM I30nm
Power breakdown on BPTM 45nm
1000 j — R t '
. ou er 5 power}
A 900 - Link's war A
g 800 I
§ 700 .
g 600-
O
3 500 . ,
0
g 400 '
C

0
0 5 10 20 50 100 250 50010002000

Injection rate (MIIit/s)

(d) BPTM 45nm

I49

Figure 4.19 NOC Power breakdowns at different technology nodes
To illustrate the temporal and spatial interconnecting power proﬁles of the proposed
simulation framework, a burst-mode traffic the same as what is used in 4.5.2 is applied on
the NoC. Figure 4.20(a)—(b) describe link #l6(from router #1 to router #5) and #17(from
router #5 to router #l)’s dynamic power temporal proﬁle from O to 2ms. When the burst
comes at 10us, the involving Iink’s (#16) power increases to ~5.2X in maximum, and
gradually returns back after the burst traffic. The link (#17) that doesn’t involve in burst,

the power keeps almost constant as Figure 4.20(b) shows.

 

 

    

Link 16's power profile Link 1115's power proﬁle
. ﬂ Y . . . goo . 7 x w r r f
A A700
E E eoo
g g 500‘
§ .9 400
o g i l
g m 300 i . 1 i
g g I. ' illiIIll‘lli‘ii,
0 020° Illiii'rir' sing, III
ii» i |
100 iii ;“l'.s“"r ‘ l
0
0 200 400 600 80010001200140016001800 2000
simulation times (us) simulation times (us)
Figure 4.20(a) Link #l6(router #l->#5)’s proﬁle (b) Link #l7(router #5->#1)’s proﬁle

Combine the routers and links’ power consumption together, Figure 4.21(a)-(d)
describe the NoC spatial power distribution in the 4x4 mesh architecture under burst
trafﬁc at different simulation time points of 5us, 8us, llus and 8014s. The X and Y
coordinates represent the physical locations of routers and links. The power of two links
connecting the same router pair on the opposite directions is added together since they are
assumed to reside in the same physical location. After the NOC reaches burst mode,
power consumption of routers and links involving in the burst is increased by ~3X
compared with the non-burst nodes as Figure 4.21(c) shows due to the suddenly increased
injection rate. When the burst ends and the trafﬁc returns back to uniform, the power
distribution proﬁle will become ﬂat as Figure 4.21(d) describes. The routers on four

edges consume less power than others in that the reduced number of input ports results in

150

less trafﬁc load. Here we only take dynamic power into consideration because leakage
power stays almost constant during the run-time compared with the dynamic one, which
is relevant to the switching capacitance. Figure 4.22(a) shows the leakage power’s
temporal profile for router #0 under the burst mode with the maximum power swing of
only 2.4%, while the dynamic power has at most 362.5% in power swing as shown in

Figure 4.22(b).

      

    
    

NoC Spatial Power Proﬁle at 5us NoC Spatial Power Profile at Bus
1
31000 — ‘ g
E T , T " " \ i E
: aoo . L E
0 L 4 v
s 600 . : a
o. 1' ' 1 3
‘6 400 . I 8.
E .1 '6
g 200 ; E
> W (U
D 0 . 7 . g
10 D
10 —. LL ‘ .
\ / 6
Y-position (mm) 0 0 X-position (mm) Y-position (mm) 0 0 X-position (mm)
(a) (13)
N00 Spatial Power Profile at 11us NoC Spatial Power Profile at 80115
1000 1* Li ‘1‘ x A
A H“; ;-. ~. “ 3
E000 ’4'} i xi 3%
E i A ‘ — f' ’ ‘ “ 1 s
g ‘ I - . l 3
a. . .E‘
'5 J E
E j cu
N ‘ ‘1 C
i . 5‘
o

 

\x/’ 4 ' . .
Y‘POSiﬁon (mm) 0 0 2 X-position (mm) Y-posmon (mm) 0 0 X-position (mm)
(C) ((1)
Figure 4.21 4x4 NoC power spatial proﬁle under burst at different time stamps. (a) Sus, (b) 8us,
(C) llus, (d) 80us

151

 

2.18— ___“. ’ W“ m" ”‘

 

 

 

 

--.,, ..L..,L ,0 ...J.
2‘13£111_L:Laa;£1@s:
/// Mme
Leakage(mW)////,/\ 838£88§§§§E§§l
2.4 ’/ .—
22 ’
2 .—
1.8
1.6
1.4 -~ ~~~~ ~— —————— *
1'? l1me(ns)
m 8 ‘3 E § 3 E 83“: v
w .
t a S 8 3 S §g timelns)

Figure 4.22(a) Leakage power proﬁle of router #0 (b) Dynamic power proﬁle of router #0
4.6. F all SoC System-level Power Simulation on Real Applications

Applying the proposed power model on the real applications, we explore the power
implications on the M—JPEG application described in the previous chapter. For
convenience, the SoC platform for simulating the M-JPEG application is re-drawn in
Figure 4.23(a), and another NoC conﬁguration featuring different communication
patterns is shown in Figure 4.23(b). To evaluate the computational power breakdown of
different tasks partitioned from M-JPEG, all the four PEs are conﬁgured as PowerPC
microprocessors. In Figure 4.23(a), the communication pattern to complete the encoding
of one video frame follows the path of router #(0,0) -> #( 1,0) -> #( 1,] ) -> #(0, l) -> #(0,0)
counterclockwise without overlapping between the packet flows in any two routers.
While Figure 4.23(b) features some overlapping in between two routers’ communication
flows shown as the dotted arrows. The original video frame data follows path p] to
traverse from core (1,1) to (0,0) to perform DCT, followed by p2 to send the transformed
data to core (1.0) for the quantization and zigzag scan. The re-ordered data in the
quantized frequency domain are sent to core (0,1) for encoding through p3, ﬁnally the
encoded data stream will be transferred to (1,1) again following p4 for output bitstream

dumping.

152

‘. ....................................
Ir ---------
(1)VLC encoding: @~ ® : (1)Quantization
' 1 ® ®
I

(2) Zigzag-scan
. .0 0 '
' @1-

(1)File read (1) D-CT
(2) RGB-YUV ------------------------------
(3)Downsampling

(4)bit-stream form

...-......

‘C....-... .

 

(3)
p4
:’
p1 .........................................................
Ir- --------- .-
' @ ©l I (1)File read

(1) VLC eneodi g .
1 (2) RGB-YUV
I |(3)Downsampling

: (4)bit-stream form
I

PowerPC It

‘C...-O......C..............

 

...-......CCCOO......O..........»

 

(1) DCT (1) Quantization
p2 (2) Zigzag-scan
p3 ""
(b)

Figure 4.23 M-JPEG application mapped on 2x2 NoC with (a) non-overlapped trafﬁc ﬂow, and
(b) overlapped trafﬁc ﬂow

Based on the two conﬁgurations, 6 video segments with different frame sizes and
contents are employed as the input to the SoC simulation platform. The properties of
these inputs are listed as follows:

(video_l) the size 0f each frame is 16x16, and the pixel values in each frame are constant

(video_2) the size of each frame is 32x32, and the pixel values in each frame are constant

153

(video_3) the size of each frame is 128x128, and the pixel values in each frame are
constant
(video_4) the size of each frame is 16x16, and the pixel values in each frame change
randomly
(video_5) the size of each frame is 32x32, and the pixel values in each frame change
randomly
(video_6) the size of each frame 128x128, and the pixel values in each frame changed
randomly

Tables 4.7~4.10 describe the average power consumption of each component in this
homogeneous SOC except for the interconnecting links under different inputs and 80C
architectures. In each table the (ci, vj) tuple represents the 80C conﬁguration (CI for
Figure 4.23(a), while c2 for Figure 4.23(b)) and the inputs from video_l to video_6. The
average power consumption of the three architectural components inside each NoC router
is presented as well as the dynamic power of the PE connected with this router. For the
input buffer and crossbar whose leakage power is modeled, their leakage value is also
evaluated given the ambient temperature of 300K.

Table 4.7. M-JPEG power simulation results on architecture 4.23(a) with input video_l

 

 

 

 

 

 

 

 

 

 

 

 

 

(c1, v1) Input buffer power(mW) Crossbar power(mW) Arbiter power(mW) PE power(mW)
Position dynamic leakage dynamic leakage dynamic dynamic
tile (0,0) 47.3 6.67 1.02 0.0953 0.835 59.32
tile (1,0) 50.9 6.69 1.09 0.0861 0.901 155.87
tile (1,1) 41.2 6.68 0.946 0.0668 0.72 62.31
tile (0,1 ) 45.7 6.68 1.01 0.0764 0.72 89.54

 

Table 4.8. M-JPEG power simulation results on architecture 4.23(a) with input video_2

 

 

 

 

 

 

 

 

 

 

 

 

 

(cl, v2) Input buffer power(mW) Crossbar power(mW) Arbiter power(mW) PE power(mW)
Position dynamic leakage dynamic leakage dynamic dynamic
tile (0,0) 47.7 6.67 1.03 0.099 0.842 58.94

tile (1.0) 51.1 6.69 1.09 0.0856 0.905 152.38

tile (1,1) 41 6.68 0.943 0.0797 0.717 60.79
tile (0,1 ) 45.9 6.67 1.01 0.0847 0.808 88.12

 

154

 

 

Table 4.9. M-J PEG power simulation results on architecture 4.23(a) with input video_3

 

 

 

 

 

 

(cl, v3) Input buffer power(mW) Crossbar power(mW) Arbiter power(mW) PE power(mW)
Position dynamic leakage dynamic leakage dynamic dynamic
tile (0,0) 45.9 6.66 0.992 0.106 0.81 57.62

tile (1,0) 50.5 6.69 1.08 0.0799 0.894 151.27

tile (1,1) 41 6.67 0.943 0.0846 0.717 60.18

tile (0.1) 45.9 6.67 1.01 0.0879 0.809 86.53

 

 

 

 

 

 

 

 

 

From table 4749, it is observed that the SoC’s average power consumption keeps
almost constant for the input videos (video_l , video_2, and video_3) with the same data
switching properties and different frame sizes. This is expectable since the average power
of both computation and communication is related to the energy consumption, which is
proportional to the experienced macro-operations (in PBS) or NoC Operations dependent
on the input data size, and the simulated time that is also dependent on the input data size.
We also observe that tile (1,0) performing DCT consumes the maximum dynamic power,
and tile (1,1) handling quantization/scan has the minimum power. It results from the
minimum information entropy that the packets traverse router (1,1). The tile (1,1)‘s input
ﬂits come from tile (1,0) where DCT is performed, and tile (1,1) will send the ﬂits
containing quantized/scanned data. Both of DCT and quantization operations reduce the
data entrOpy, which is related with the data switching property, from which the power
model is derived.

Table 4.10. M-J PEG power simulation results on architecture 4.23(a) with input video_4

 

 

 

 

 

 

(c1, v4) Input buffer power(mW) Crossbar power(mW) Arbiter power(mW) PE power(mW)
Position dynamic leakage dynamic leakage dynamic dynamic
tile (0,0) 64.3 6.67 3.33 0.106 0.878 59.31

tile (1,0) 67.5 6.69 3.68 0.1 0.899 155.87

tile (1,1) 42.7 6.68 1.07 0.0648 0.735 62.3

tile (0,1) 55.1 6.67 2.25 0.091 0.833 89.52

 

 

 

 

 

 

 

 

 

Table 4.11. M-JPEG power simulation results on architecture 4.23(a) with input video_S
I (c1, v5) I Input buffer power(mW) I Crossbar power(mW) I Arbiter power(mW) I PE power(mW) I
leakage I dynamic leakage I

 

 

I Position I dynamic

 

 

dynamic I dynamic I

 

155

 

 

 

 

tile (0.0) 65.7 6.66 3.38 0.11 0.901 58.94
tile (1,0) 67.6 6.69 3.64 0.102 0.904 152.38
tile (1,1) 42.6 6.68 1.07 0.077 0.844 60.78
tile (0,1) 55.6 6.67 2.25 0.0998 0.733 88.12

 

 

 

 

 

 

 

 

 

Table 4.12. M-JPEG power simulation results on architecture 4.23(a) with input video_6

 

 

 

 

 

 

(c1, v6) Input buffer power(mW) Crossbar power(mW) Arbiter power(mW) PE power(mW)
Position dynamic leakage dynamic leakage dynamic dynamic
tile (0,0) 63.7 6.66 3.29 0.1 15 0.872 57.62
tile (1,0) 67.1 6.69 3.63 0.0985 0.898 151.27
tile (1,1) 42.6 6.68 1.07 0.0818 0.733 60.16
tile (0,1) 55.9 6.67 2.26 0.103 0.847 86.53

 

 

 

 

 

 

 

 

 

Tables 4.10~4.12 depict the same property in terms of average power in each SoC
component. But compared with tables 4.7~4.9, the dynamic power of NoC components
has distinct increase with 32.3% in input buffer, 361.7% in crossbar and 0.5% in arbiter
in average. The ~3.6X increase of crossbar’s dynamic power results from the fact that the
random pixel data in each frame in video_4, video_5, and video_6 dramatically boosts
the data switching, which is proportional to crossbar’s dynamic power. For the input
buffer, only the buffer write operation’s dynamic power is related to the data switching
pattern, and it results in a much smaller increase in dynamic power compared with the
crossbar. The arbiter’s dynamic power is independent with the data switching on data
links but on the crossbar ports’ requesting status, and the simulation results clearly reﬂect
this property. The power consumption in PBS still keeps almost constant compared with
tables 4.7~4.9. This can be explained by that PE’s power is estimated using data-
independent macro-models.

Consider the 80C conﬁguration shown in Figure 4.23(b), where overlapped trafﬁc
ﬂows are present to model the increased communication load on NoC. Since the
simulation of the input videos with the same data switching properties results in the
similar power distributions although frame sizes are different, only the simulation results
for video_2 and video_5 are presented in table 4.13~4.14 respectively.

Table 4.13. M-J PEG power simulation results on architecture 4.23(b) with input video_2

156

 

 

 

 

 

 

(02, v2) Input buffer power(mW) Crossbar power(m W) Arbiter power(mW) PE power(mW)
Position dynamic Leakage dynamic leakage dynamic dynamic
tile (0.0) 46.4 6.66 1.01 0.0894 0.82 47.42
tile (1,0) 44.7 6.69 0.964 0.057 0.79 137.94
tile (1,1) 43.2 6.66 0.939 0.101 0.763 54.41
tile (0.1) 88.7 6.69 2.05 0.0978 1.55 84.05

 

 

 

 

 

 

 

 

 

Table 4.14. M-JPEG power simulation results on architecture 4.23(b) with input video_5

 

 

 

 

 

 

(c2, v5) Input buffer power(mW) Crossbar power(mW) Arbiter power(mW) PE power(mW)
Position dynamic Leakage dynamic leakage dynamic dynamic
tile (0.0) 54.9 6.66 2.32 0.105 0.82 46.78
tile (1,0) 52.3 6.69 2.15 0.0804 0.79 136.45
tile (1,1) 46.6 6.66 1.38 0.099 0.775 53.89
tile (0,1) 97.2 6.69 3.36 0.0963 1.14 83.72

 

 

 

 

 

 

 

 

 

In this SoC conﬁguration where trafﬁcs may share the same routers and links, the most
power-hungry tile changes to the location (0,1), where three trafﬁcs, one of which is the
original image ﬂits, traverse the router. The increased trafﬁc results in 93.25% and
74.82% more dynamic power consumption in the router (0,1). The minimum power
consumption still takes place on the tile where the quantization/scan operation is
performed due to the reduced data switching after transformation. The increased trafﬁc
results in ~11.5% of execution time to encode one frame, and in turn reduces the PE’s
average power consumption for ~10.46% in average compared with the SoC
conﬁguration with non-overlap trafﬁcs. For the leakage power, a small variation of
~1.5% on router’s power is identiﬁed due to the small dependency between leakage and
data switching properties, as well as the trafﬁc status in the NoC.

Finally consider the dynamic power breakdown among the partitioned tasks for
completing the M-JPEG application. The power breakdown also reﬂects different power
requirement for the computational tasks since all of them are mapped on the same type of
PBS with the identical power macro-model. Among the partitioned four tasks, the DCT
consumes 42% of power and is the most power-hungry task due to its high computation

complexity. The VLC encoding takes 24% power consumption, followed by the

157

Quantization/scan taking 18% of the total computation power. The least power-
demanding task is the one that handles RGB2YUV conversion, downsampling and ﬁle
handling, where the power consumption of ﬁle U0 is not explicitly modeled in this
framework. From the power breakdown, we envision that it will be beneficial to employ
hardware accelerator for DCT to reduce the PE’s power consumption and ﬂatten the
spatial power distribution proﬁle. On the other hand, using a dedicated ASIC core instead
of a microprocessor will exploit the inherent parallelism inside the DCT algorithm, and

hence increase the data throughput as well.

158

5. POWER-AWARE SOC DESIGN SPACE EXPLORATION FRAMEWORK

The platform-based SoC design methodology based on reconﬁgurable IP cores enables
designers to “assemble” an application speciﬁc SoC given a set of design constraints.
This process requires the following Operations including (a) resource allocation, (b) task
mapping and (c) computation and communication task scheduling, each of which may
have multiple choices to fulﬁll the functionality requirement. How to prune efﬁciently
and quickly among all the possible conﬁguration choices to get one optimal (or pareto-
optimal) SoC conﬁguration to meet the ever-stringent performance and power budgets is
the main task of Design Space Exploration (DSE). The ever-increasing design size and IP
library result in the exploding design space due to the dramatically increased nrunber of
potential conﬁguration combinations. In general, DSE considers two orthogonal issues:
(1) how can a single design point be evaluated efﬁciently, (2) how can the design space
be covered during the exploration process? [78] Chapters 2, 3 and 4 have answered the
ﬁrst question by proposing the system—level simulation framework to estimate
application’s execution time and power consumption for regular tile-based SoCs.
However, the latter question needs to be tackled carefully in all levels of abstraction since
an exhaustive search of the design space is usually prohibitive due to the sheer size of the
design space. Similar to the power modeling, it is also believed that the higher level is the
DSE performed, the less search time it will take, if the design is simulated during the
exploration process.

On the other hand, the market-demanding handheld wireless personal communication
devices ask for more integrated one-chip solutions for better performance, less power,
and lower cost. For embedded systems running on batteries, maximizing the battery-life
is becoming one of the chief design drivers. As such, targeting low-energy consumption
is then extremely important. For the high-performance system such as the cellular base
station, the battery life is not the problem, but lowering power consumption would reduce

the heat dissipation cost such as the cooling system, which is becoming extremely

159

complex in the PC system as what we saw before. Based on the proposed system-level
SoC simulation framework, especially the TSE-TEE-based retargetable PE models
enabling fast task remapping and simulation and the fully parameterized on-chip
communication infrastructure (NoC) model, we move forward and propose a power-
aware DSE framework ground on fast task-level application simulation. Given an
application represented by its task graph with minimum throughput requirements, the
exploration process consists of two related parts: global communication minimization
(GCM) and local energy-aware scheduling (LES). Before elaborating the proposed DSE
framework, we start with the mathematical formulation of the 80C energy-aware
mapping and scheduling algorithm.
5.]. Mathematical Formulation of 50C Minimum Power Problem

The application executed on the 80C is model by the communication-annotated task
graph (CATG) compared with the original task graph (TG) that was presented in the
processor model in chapter 2.
Deﬁnition 1: A Communication-Annotated Task Graph (CATG) G = C(T, C) is a

directed acyclic graph, where each vertex represents a computational module of the

application referred to as a task I, e T . Each 1,- has the following properties:

®A normal execution time array NR', where the j-th element nr; 6 NR‘stores the

execution time of task I,- if it is executed on j-th PE architecture under its normal

supplying voltage. If task t,- cannot be mapped onto j-th PE (e.g. the sorting algorithm

can’t be executed on a hardwired DCT core) due to the architectural constraints, nrj = oo.

When DVFS is applied with the supplying voltage of V = men / Rut/F5, the execution
time can be derived by:

i (Rt -1)2 i
"r1 (RDVFS) _ 2 ~nrj , where R: 2% (5-1)

I

 

R
_r__ _ / R
V RDVFS DVFS

I60

@A normal power array NE, where the j-th element ”836 NE’ stores the energy
consumption of task ti if it is executed on j-th PE in the architecture under normal
supplying voltage. If task I,- cannot be executed on j—th PE, nej. = 0.

Each directed edge C1416 C characterizes the communication or control dependency

between I,- and t}. A communication matrix (CM) denoting the communication volume

among these tasks is created by:

 

 

cmoo cmm 6mm;
cm”, cmLl cm”,
CM 2 ,where em”. = v(c,_j) standing for the
_me cm",l emu." JIM

communicating bit number from task I,- to I}. In this matrix cm,-,,- = 0, and cm”- : 0 when
task I; and tj don’t communicate with each other.

The NoC connectivity and link bandwidth is represented by the NoC architecture graph
(NAG).
Deﬁnition 2: An NoC Architecture Graph (NAG) G = C(P, R) is a directed graph, where

each vertex p, e Prepresents one router in the NoC, and each directed edge rm. 6 R

represents the unique route from p,- to p,- under the deterministic XY-routing algorithm or
other possible non-minimum routes under the adaptive WF or NL-routing algorithms.
Each r,-.,- associates with two metrics e(r,-J-) and l(r,-J). e( r,-, 1) stands for the minimum energy
consumption of sending one bit of data from p,- to pj. [(riJ) gives the minimum latency (in
cycles) for each piece of data transferred on that route. Note that although we use the
same definition as [105], the energy and latency for each possible route is not available
until a NoC characterization simulation is performed with the help of TG/TR models.
Given a tile-based NoC with the size of M xN, two matrices can be used to model the

NAG:

161

Energy matrix: EM 2 , and

 

_e(rM-l.0) 6(rM—IJ) e(rM—l.N—l)_MxN

P [(57.0) l(’b.r) [(rOJV—l) -
[(5.0) [(ru) [(rlJV—I)

Latency matrix: LM 2

 

 

_l("M-r.o) [(rM—IJ) [(rM—l.:‘V-I)_MxN
Each element in EM and LM is pre-calculated or characterized before the exploration
starts. They provide the information of how information can be routed from each node to

another on the NoC. If the NoC is constrained that no information can be routed from p,-

to pj, the elements e(rl.‘j )and [(ri'j)in EM and LM are both 0.

Currently LM is deﬁned by the condition that there is no trafﬁc congestions on NoC
(the minimum latency between two nodes). In the actual application, it is very possible
that information (packet) is block at the router due to the congested trafﬁc, which results
in much larger latency. Compared with the deterministic minimum latency, the actual
latency after taking the congestion into consideration is decided mainly by the overall
NoC trafﬁc pattern, whose information cannot be extracted until the NoC communication
characterization simulation is performed. We deﬁne the NoC congestion factor matrix

CFM

“06.0) “06.1) “(QM—r)—
“(4.0) dbl!) “(rm—r)

CFM =

 

 

frost—1,0) a(rM-l,l) a(rM—l,N—1)Mx~

, where each element a(r..‘ )2 1 describes how congested the path between two NoC nodes p.-

to pj. In each round of NoC simulation, each CFM element may be changed resulting
from the new trafﬁc pattern. And a trafﬁc congestion threshold K, is deﬁned to
characterize how two different trafﬁc patterns inﬂuence the CFM. Assume that there are

two trafﬁc patterns tr, and trz, under which their CFMs have the following relationship:

I62

M—IN-l

22(al(’i_j)—a2 (rt; )F

=r:0j=() <K
M-N ‘ '

 

ct: = |(CFM, — CFM2 )2

 

, the two trafﬁc patterns tr; and tr; are compatible in congestion. When tr; and tr; are
compatible with each other, we can use the same CFM to represent these two trafﬁcs. The

actual latency matrix for NoC is the dot multiplication between its CM and CFM,
F [(5,0) “(fool [(rm ) “(furl l(’b.N—r )' a(’b,N-r)

1(r..o)-a(r.,0) 1(r...)-a(n..) 1(r..~-l)-a(r..~-i)

—

ALM =LM0CFM =

 

 

Ll(r,,,_,.0)-at(r,,,_,‘u) l(rM_L,)-a(rM_L,) [(rM-l.N-l).a(rM-l.N-l)JMxN
And this is the metric used for evaluating the on-chip communication overhead during
the exploration process.

Using these deﬁnitions, the energy-aware scheduling problem for heterogeneous SoC
with NoC on-chip interconnecting infrastructure under real-time constraints can be

formulated as:

 

 

Given a CATG and an NAG ﬁnd a task mapping function Mm“ ) from tasks( T) to
PEs( P), and a communication mapping ﬁtnction Mmmm( )fmm PEs( P ) to NoC( N), which :

min E = zeMM,,)(V;d)+ 212(6); j)x elm... (Mm (r. )).M...,,.. (M... om]

Vin,- vcw.

and

@All the control/data dependencies are satisﬁed.
@The throughput requirement of the CATG is met.

@Mmk( ) can be multiple-to-one/one-to-0ne mapping (one or multiple tasks is mapped

 

onto one PE), while Mmmm( ) must be one-to-one mapping.

 

5.2. Related Work on DSE Framework for Multi-core SoCs
Before presenting our SoC design space exploration methodology, we review some of

related research work published recently.

l63

Exhaustively evaluating every possible design point is the most straightforward way to
prune the 50C design space. However, evaluating every possible combination of design
parameters is prohibitive for large design spaces. The design space can be reduced by
limiting the range of parameters and/or by parameter quantization. The search process is
completely unguided and unbiased towards preferences of the designer. [79][80][81]
propose the DSE methodologies based on exhaustive search at system—level simulation.
Other methodologies, including high-level synthesis [82][83][84][8S], ADL-driven
approaches [86][87], cycle-accurate simulations [88][89], instruction set simulators [90],
code parallelization and partitioning onto multi-processors [91], trace-based analysis
[92], and last but not least static analysis [93] based on full search are also proposed.

Evaluating only random samples is the obvious choice for coping with large design
spaces. It also has the advantage of revealing an unbiased view of the characteristics of
the design space. In [94], a Monte Carlo-based system is described where random
samples of the solution space are generated by randomly creating constraints for system
synthesis. Another approach is to apply simulated annealing techniques. Starting from an
initial design, changes to the design become more and more unlikely with advancing
search time. In [95], Srinivasan et al. compare explorations driven by simulated
annealing with results using an evolutionary approach. Gajski et al. [96] combine an
exhaustive search over all possible architecture allocations with a simulated annealing-
based exploration of mappings for each allocation.

Trace-based performance/energy design space exploration presented by Lahiri et al.
[92][97] is applied to the design of on-chip communication structures. An initial system-
level simulation of communicating components representing the workload environment,
i.e. HW and SW components surrounding the communication structure under
investigation is performed to collect traces of communications going on between
components. Traces are compressed by generating a communication analysis graph

(CAG) that accumulates burst transfers and computations into single events. Given a

164

communication substructure template consisting of point-to—point and shared connections
as well as bridges, the CAG is modiﬁed to incorporate effects of synchronization,
arbitration, and block transfers. Modiﬁcations include adding nodes and adjusting time
stamps of nodes, which enable the estimation of performance and resource utilization.
Zivkovic et al. [98] augment traditional traces that usually contain information on data
transfers and task executions only, with control information in order to evaluate the cost
of control as well.

Analytical methods come into play if deterministic or worst-case behavior is a
reasonable assumption for the system under evaluation. In addition, building an
executable model of the system as well as simulations might be too costly or even
impossible at the time of the evaluation. Analytical models in particular ease early design
decisions by identifying comer cases of potential designs. Eles et al. in [99] present a
scheduling algorithm for distributed embedded systems that takes into consideration bus
access optimization. Sih et al. in [100] propose a compile-time scheduling heuristic called
dynamic-level scheduling which accounts for inter-processor communication overhead.
Both papers try to maximize the performance of the system without considering the
system’s energy consumption. In [101], the authors perform. battery-aware task
scheduling for multiprocessor systems by moving tasks to smoothen the power proﬁle. In
[102], a dynamic programming technique for solving the multiple supply voltage-
scheduling problems is proposed with the assumption that tasks are already mapped to
processors. Similarly, [103] considers a generalized task graph with inter-task
communication and suggests an algorithm for partitioning and mapping. That algorithm,
however, does not try to minimize the total interprocessor communication volume. More
recently, [104] describes the need for a scheduling algorithm to take into account the
voltage selection awareness. All of the aforementioned algorithms ignore the impact of
scheduling on the interprocessor communication volume.

Hu et. al. [105] propose an energy-aware communication and task-scheduling

165

algorithm for the regular tile—based NoC architectures under real-time constraints. In this
framework, the application and the NoC architecture are abstracted to directed graphs,
and the mapping and scheduling are performed by the task-level slack allocation
budgeting, level-based communication scheduling, and ﬁnal global search and repair.
This work is based on the assumption of the pre-modeled PE/NoC and is essentially a
static multi-processor scheduling algorithm that doesn’t take voltage and frequency
scaling into consideration. [106] proposes the modiﬁed communication-aware task
scheduling for SoC considering variable voltage selection for minimizing the total energy
while meeting all timing constraints. A priority-based task ordering and best ﬁt processor
selection are developed to map and schedule the tasks and adjust each task’s supplying
voltage for energy saving. A communication ignorance parameter is employed to tune the
communication awareness in the algorithm. This is still a static scheduling algorithm
given each task’s WCET, which may under-utilize the available processing resources. [ ]
et al. present an energy- and performance—aware mapping for regular NoC architecture
using branch—and-bound algorithm as a complementary work of [105].

Compared with these work, our exploration framework differentiates them greatly by
employing the in-loop real application simulation to characterize the application and
communication performance and energy metrics and this type of run-time fast
simulation—based characterization is performed dynamically according to the exploration
process. And we explore both the mapping from task to PE, as well as the mapping from
PE to NoC, in which the ﬁrst mapping determines the latency and energy consumption
locally in each PE-level, while the second one would inﬂuence the global communication
latency and energy. Moreover, the in-loop fast task-level simulation, which is the core of
this design space exploration framework, beneﬁting from the retargetable TEE-TSE PE
models presented in chapter 2.

5. 3. Global Communication Minimization Energy-Aware Exploration Framework

Since finding an optimal schedule for a multi-core system that consumes the minimum

166

energy is known to be NP—complete or NP-hard[105], which means that no polynomial
time algorithm that can guarantee the optimal solution. We propose a heuristic and
iterative approach combining global communication optimization and local reﬁnement to

obtain satisfactory solutions with reasonable computation time.

 

 

 

 

 

Communication Annotated Task Graph
(CAT G)

Tile-based SoC architecture
Figure 5.1 Tile—based SoC architecture and task mapping
Although this algorithm can be easily applied to architectures with different NoC

topologies, in this work, we focus on the tile-based mesh NoC on-chip interconnection
infrastructure shown in Figure 5.1 describing an abstract view of the NoC architecture
where each rectangle represents a tile containing a PE and a router connected by NI as
described in chapter 3. A PE can be a high-performance, energy-hungry CPU, or a low-
power ARM processor, or dedicated hardware accelerator. The timing and energy macro-
model is characterized for each PE architecture. Each PE can be conﬁgured by its base
operating frequency (fbase) and normal supplying voltage (men). In the future, more
conﬁguration parameters such as cache will be taken into consideration. The NoC has the
ﬁxed size of 4x4, and each router is conﬁgured by its router’s three architectural
parameters: buffering, output arbitration policy, and routing algorithm. Here the

wormhole routing for packets delivery which seems to be a reasonable solution for NoCs

167

is applied. A 5x5 full crossbar switch is used as the switching fabric in the router.

The CATG in the right part of Figure 5.1 will be ﬁrst partitioned and mapped to
different PEs, which will be further mapped onto the NoC shown in the left part of Figure
5.1. The application performance constraint is represented by its minimum throughput
Thmm. We use throughput as the performance metric instead of cycle number in this work
because the heterogeneous SoC has multiple PEs operating at different frequencies,

which cannot be easily translated into cycle—based metrics under one common frequency.

 

 

[[3 192
"ism/>3
NoC library 1
<TGf1’R lbuf_FP_XY,
Ibuf_FS_XY
L4 0 o o . '

 

 

 

 

EM(ri,j), LMIriJ)

"
O

 

 

 

 

NRj, NE}

 

.......... d Initial mapping
Mtask( ), Mcomm( )

 

 

 

5 xSoCSim L

,-’—_:I::.\ modei_/"

AdJUSt J I task-level performance
M188” _/ , simulation

<Th > Thmiﬁb
/

no \
yes 1
xSoCSim w.

' 1
no Lag]

/‘

no ‘\
< >= Pmin‘»

/

 

 

 

yes

\\
, ’ - \‘7/ yes Adjust ‘
E\W._CMF9':9/ [Admtll
V \V'/

FIE nd'

Figure 5.2 SoC DSE framework based on GCM

It is well believed that global on-chip communication is asking more and more timing

 

 

l

 

slack and energy budget due to the worse scalability of the global wires laid out on the

top metal layers in the advanced sub-100nm CMOS technologies. But on the other hand,

168

the increasingly stringent performance requirement makes processing the whole
application, which is becoming more and more complicated and contains tens of
hundreds of computational intensive task, on one PE unfeasible. The trade-off here is that
the application tasks are mapped onto as small numbers of PBS as possible so that the
global communication overhead is minimized, and at the same time the performance
requirement is still met. This is the key point of the global communication minimization
methodology that we applied in our DSE framework. Figure 5.2 presents the main
architecture of the proposed SoC design space exploration framework based on the global
communication minimization.

There are three main blocks in the SOC exploration framework, Initial Task
Characterization (IT C), Initial Communication Characterization (ICC), and the Global
Communication Minimization (GCM). Given an application model represented by the
CATG, and a PE library containing Np PBS with its macro timing and energy models, ITC
performs the task-level simulation for each single task mapped on each PE to acquire the
execution time matrix NR and power consumption matrix NE at normal operating voltage
and frequency. For a CATG with NI tasks, the resulting NR and NE will be represented by

the following two arrays:

 

 

F _
ton to] t0,N,-1
NR _ tro In t1,N,—-1
NPxN, '—
_tN,,—I.o INP—IJ [Np—1,N,—1‘
_ _
100.0 p0,] pom—r
NE pro [71,] P1,N,—1
prN, _

 

 

_pr—ro pr—u PN,,-1,N,—1_

169

Since the performance and power simulation can be performed by the same PE model,
the total number of simulations in ITC is N,,xN,. Although the ITC simulations pose some
overhead to get the two task/PE perfonnance/power matrices, it is still acceptable in the
whole DSE framework because (I) ITC is only performed one time before the
exploration core algorithm, which is the main loop starts, so its run-time, which may not
be neglectable, will not take a big portion of the overall run-time; (2) With the fast
macro—model simulation framework equipped with macro-models, each simulation in
ITC would be very fast on the host machine.

Similar to ITC, ICC (Initial Communication Characterization) is performed on the
NoC architecture with the synthetic trafﬁc generators without any trafﬁc congestions. The
ICC will generate EM and LM, while the congestion factor matrix (CFM) is an all-l
element matrix because of the non-congested background trafﬁc. ICC provides the basic
information of the NoC properties. Different from IT C, NoC is characterized by is router
architecture, which has 12 different conﬁgurations in our work. NoC with each router
architecture will be characterized with a different set of EM and LM, which will be used
to explore the NoC architecture during the exploration. For the 4x4 router, 16x12 = 192
simulation runs will be performed to acquire all this information. Both ITC and ICC are
automatically performed by the Perl scripting language.

The initial mapping is performed outside of the DSE main loop shown in Figure 5.2.
We start with the scenario where there is no global communication, which means that all
tasks are mapped onto only one PE, which is chosen randomly from PE library. And this
PE will be mapped to an initial NoC architecture, which is set to be constructed by the
router with architecture of Ibuf_2_FP_XY. Since there is zero global communication,
there is no communication power on the NoC in this case. This PE and NoC
conﬁguration will be used to construct a SoC simulation model presented in chapter 3
under the xSoCSim framework with the annotated timing and energy models. This

executable model timing/energy annotation will run to generate the performance and

1 7O

energy numbers. Firstly the performance metric is evaluated to see whether the
anticipated throughput meets the speciﬁcation, which is provided as a performance
constraint before the exploration process starts. It is most likely that the performance
doesn’t meet requirement in the initial mapping where all tasks are mapped onto one PE
resulting the zero communication overhead. And additional PE will be employed to take
some of the ﬁrst one’s computational load. This scheme is similar to the processor
pipeline, where the throughput is mainly decided by one stage’s latency if there is no
pipeline stalls and other hazards. But the difference is adding another PE requires the
inter-communication between the PEs through the NoC infrastructure. Another key point
in the GCM-based exploration is to minimize the inferred NoC communication when
another PE is added. We employed the algorithm based on the minimum weighted cut set
on CATG to decide which task, or task set, will be partitioned to the second PE.

Deﬁnition 3: A Cut Set is a partition of CATG, and it includes the partition of both the
vertices and edges in the graph. After partition, a CATG becomes two sub-graphs G) =
GI(T1, C1), G2 = G2(T2, C2) shown in Figure 5.3(a). G1 and G2 are two sub-graphs of

original G = G( T. C). All edges c“ e C,.(k = 1,2;i = 1.2....) that intersect with the contour of

the cut for each sub—graph is referred to the main edges in each sub-graph. If a CTAG is
partition to two sub-graphs but one cut, they will share the same set of main edges. In the
weighted CATG, the sum of each main edge after cut is deﬁned as the Inferred
Communication Load (ICL) in the units of number of bits communicated between to sub-
graphs.

Figure 5.3(b) presents the algorithm of ﬁnding a minimum weight cut set from the
given CATG so that the partition of CATG has the minimum ICL. It starts will an empty
cut set, and iteratively checks the inferred communication load for the possible vertex
permutations, and keeps the current permutation if its ICL is less the previous iteration.

For a CATG containing N tasks, the possible search space is:

171

s: Xe; =C;,+c; +....+c;“

i=l.2...N

 

- N(1\l/!—l)+N(N—;)!(N—2)+m+1 (5.2)
, which is pretty large when the task graph size N is big. For example, for a CATG with 5
tasks, S is 56, and if the tasks number increases to 20, the search space S would explode
to > 2000. So we deﬁne a time-out threshold or an ICL threshold to reduce the search
time. Once the evaluated ICL is smaller than the ICL threshold, or the run-time exceeds
the time-out threshold, the algorithm will end and the current permutation with minimum

ICL will be returned.

 

F ind partition with minimum lCLfrom G()

  
  
 

t1 .13)
P = {}; G={T,C}; min_ICL = 0;
for each node t, in T
calculates the accumulated weight of its
associated edges
Whileﬂ Time-out)
{P = permutation(G);
G = G — P;
if(lCL(P,G) < min_ICL)
{min_lCL = ICL(P,G);
Pmin = P:}
i
return Pmin, min_ICL:

}

CATG with out set

 

 

 

Figure 5.3(a) Algorithm to ﬁnd the cut with minimum ICL (b)CATG with cut set

After partitioning tasks to other PE and updating the task mapping function M,a,k( ), a
new simulation model will be created to run the partitioned application. This is also
iterative until a partitioning that meets the performance requirement is found. If it still
doesn’t meet the performance requirement after partitioning CATG into 2 PBS, the third
PE will be added to take some the tasks further partitioned from one of the two sub-
graphs using the same GCM method. This process will be performed iteratively until a
partition satisfying the performance requirement is found. Based on the reﬁned

application partitioning and mapping, the power minimization will be performed as

172

what’s shown in Figure 5.2.

The target function during the application power minimization contains two elements:
power consumption from PE, and power consumption from NoC. Actually from GCM,
the NoC communication has already been minimized and it indicates that the
communication power is the lowest to achieve the required performance. There are two
possible exploration processes to further reduce the power consumption while keeping
the performance metric.

The congestion factor matrix CFM will take an important role during the power
optimization. The main idea of employing CFM is to evaluate the communication
between two mapping schemes. The exploration framework will perform the task or
communication re-mapping based on the current and previous CFMs. During the
exploration, the CFM is calculated by: CFM = ALM / LM , where ALM can be achieved
from the system-level SoC fast simulation and LM is the ICC characterization result. For
two different mapping schemes, if they have compatible CFMs, the next exploration
procedure will change the task mapping function Mmk( ) to reduce the operating voltage
and frequency to reduce the PE power consumption. To avoid over-reduce the
voltage/frequency, a new xSoCSim model will be created after the updated Mmtl ) to
evaluate the performance, which should meet the throughput requirement. Once the
performance is met, the algorithm will return with the minimum power consumption
when executing this application. If two different mappings don’t have compatible CMFs,
the next step will be changing the communication mappings to achieve the compatible
CFMs before adjusting PE’s voltage and frequency.

It is possible that the search will go to some space that never converges, and the power
consumption estimation is vibrating among several values. To avoid this situation, a time-
out parameter is deﬁned in the algorithm, and it will break the search and move it back to
the starting of the loop to get a different initial mapping once the run-time exceeds the

time-out parameter. It is also possible that the search result is the local minimum value,

173

which is still larger than the potential global minimum one. So far this algorithm doesn’t
equip with the abilities to automatically to avoid ﬁnding the local minimum. The way
that we propose to avoid the local minimum is to perform multiple searches with different
initial mapping, so that the algorithm has more probability to cover those local optimal
points to get the global power-optimized mapping.

The proposed exploration framework along with its search algorithm is called
simulation-in-loop DSE since the main search loop uses the fast system-level SoC
simulation to direct the search procedure. Both latency and power metrics are evaluated
to ﬁnd the optimal power consumption task and communication mappings to meet the
performance constraint. The whole simulation framework is written in Perl scripting
language and it invokes the xSoCSim model to run the simulation and do the data
collection.

5.4. Local PE/NoC Reﬁnement

After we ﬁnd a power-optimal mapping for an application, local PE/NoC reﬁnement
can be used to further reduce the global power consumption without degrading the
performance. The rational of local refinement is the pipeline-style on-chip
communication in the 80C architecture. The heterogeneity of PBS along with the mapped
tasks result in different processing latency on each PE, and the overall application
throughput is determined by the PE/router—tile with the maximum processing latency. The
basic idea of local reﬁnement is to modify those tiles with smaller processing latency,
which doesn’t not contribute to better throughput performance in the system while does
contribute to more power consumption. For example, a NoC node connecting a 200MHz
processor has the average sampling latency of 20ns, and the router’s communication
latency is 60ns. If the overall throughput is lOMsps, which indicates that the slowest node
needs lOOns to ﬁnish the processing communication task. There is 20ns’ timing slack for
the fast node, and it will not hurt the performance if we slower down the processor to

100MHz to get 40ns’ processing latency. A direct effect of slowing down processor from

174

200MHz to 100MHz is that the PE dynamic power consumption reduces 50%. Or
similarly, we can reduce the router’s input buffer size to increase its communication
latency from 60ns to 80ns, and this method reduces the router’s power consumption.

The difference of local reﬁnement compared with the GCM is that it is performed
locally so that the global performance/power metrics such as LM and CFM are not
evaluated. Before the local reﬁnement, the pre—Optimized mapping will be simulated to
generate each node’s processing and communication latency, which are added together to
get the total latency. The minimum total latency is identiﬁed as the local reﬁnement
baseline. And each other node’s timing slack will be computed by subtracting with the
minimum total latency. This slack can be applied to PE or NoC router locally to further
reduce the power consumption.

If this slack is applied to PE, dynamic voltage and frequency scaling will be invoked to
select a reduced supplying voltage as well as the corresponding frequency according to
(5.1). No further simulation is necessary to verify the PE’s local reﬁnement. And if this
slack is applied to NoC router, its architecture can be changed to reduce the power
consumption. The most frequently used method is to reduce the buffer depth to degrade
the latency performance with the payback of better power properties. Since NoC router
communicates with its neighboring ones, it is necessary to perform a veriﬁcation
simulation to ensure that the change of this tile’s router architecture would not influence
the overall communication performance. CFM deﬁned before is used to perform this
veriﬁcation. The router architecture reﬁnement is acceptable if the CFM of the NoC with
modiﬁed router architecture is compatible with the original one running the same
application.

After the local PE/NoC reﬁnement, the ﬁnal optimal application task and
communication mapping is achieved. In this framework, the local reﬁnement is an option
so that user can choose whether to activate after GCM or not to trade off the exploration

time and the ﬁnal optimal power consumption.

175

6. CONCLUSIONS

As the system-on-chip integration is becoming a trend of today’s semiconductor
system-level solution by combing all the system components onto one die with the
evolvement of the advanced semiconductor fabrication technologies. Not only the
hardware integration, software/ﬁrmware is much more tightly coupled with the 80C
development cycle since the programmable processor is frequently engaged into the 80C,
and it becomes very beneﬁcial that the firmware be developed as early as possible during
the system design cycle. The increasing complexity of the modern SoC containing tens or
hundreds of different processing cores (PEs) asks more efficient development tool ﬂow
and design methodology to increase the design productivity. In additional to the
computation, the future SoC with multiple processing cores heavily relies on the on-chip
communication to synchronize their operations. And the on-chip communication is
becoming a hot spot of the current research since the future SoC is becoming more
communication-centric compared with today’s computation-centric bus-based SoCs. The
conventional simulation - synthesis — simulation design ﬂow is far too slow in the
complex SoC era; especially there are multiple PEs, each of which may be a general
purpose or application speciﬁc processor, or a dedicated hardware accelerator in the
system. This forces the 80C designers to raise the level of abstraction to a higher one
from RTL-level to system-level, which more focuses on the interactions between the
integrated IP cores in the system since the target application needs much inter-core
communication and synchronization since each core only handles one or multiple tasks
partitioned from the application. At the system-level, the modeling of the 80C is pretty
different from what we have done in the RTL level based on HDL-synthesis, since the
system-level model needs to appropriately abstract some architectural details to extract
the performance/power properties from the whole system’s point of view, not a single IP
block and its sub-block. System-level SoC modeling is still a very active research ﬁeld in

both industry and academia, and different modeling tools and methodologies have been

176

proposed.

This work proposes and implements a system-level multi-core SoC modeling and
simulation framework compatible to the new IEEE-1666 SystemC standard, which was
approved by IEEE in 2005. Actually the main phase of this work was performed when the
SystemC was evolving before becoming an IEEE standard. The modeling methodology
in this framework covers both processing element and on-chip interconnection, and it
also integrates them seamlessly to an executable SoC simulation model running the
partitioned application directly on the host machine without the target-dependent
software and hardware synthesis (we refer to the compilation of software to the processor
binary as “software synthesis”), which is time-consuming when the system scale is large.
To fulﬁll this target, the multi-granular time model, which is a critical factor in the
electronic system performance model, is applied to the modeling of both processing
elements and on-chip communication. For those features such as the intrinsic wire delay,
the physical time model is applied to acquire the best accuracy; the cycle-accurate model
is applied for those blocks requiring cycle-true simulation performance; and for the
feature that cycle-accurate level is too ﬁne but the physical level is too coarse, cycle-
approximate accuracy is applied and it focuses on the timing on the boundaries of the IP
blocks of interest, which is cycle-accurate, while the internal operation timing is not
evaluated since the system-level simulation doesn’t need this detailed information. The
combination of the three different time models enables us to explore the system with fast
simulation speed, sufﬁcient accuracy, and moreover, the homogeneous SystemC
environment.

For the two major SoC functional blocks, different modeling methodologies are
applied to achieve the best trade-off between the simulation speed and modeling
accuracy. We employ the macro-modeling methodology to model the Processing Element
(PE), and the key point of this is that the common operations of the processing element,

which may be a programmable microprocessor or a parallel hardware accelerator, are

177

extracted and characterized for both performance (timing) and energy. And the
characterized data will be annotated to the task code that is mapped onto the target PE, so
that the timing and energy features can be acquired during the simulation on the host
machine without invoking the target architecture dependent compilers or HDL
synthesizers. This system-level annotation also enables to create a retargetable PE
simulation model, which supports heterogeneous PE architectures under the same
framework when estimating timing and energy. Essentially speaking, the macro-model
itself is not a completely new idea, but our contribution is that we utilize the macro-
modeling method to build the target independent PE performance/energy modeling
framework supporting task-level simulation, which has not been seen in the publications
before.

For the 50C on-chip communication infrastructure, we focus on the new area of
network-on-chip (NoC) instead of the maturing and conventional shared-bus architecture.
The concept of NoC is based on the original network where information is transferred in
the format of packet under the control of routers spreading through the network. The key
beneﬁt of NoC is that it breaks the long global communication links into multiple short
ones connected via on-chip routers. This is very important since it is well believed that
the performance and power of the global wires in the evolving CMOS technologies
doesn’t scale down as transistors and local wires, so that using short global wires can
reduce the wire load to get the better communication performance and power. Currently
NoC is still a developing idea and there are no mature products in the market, and the
NoC modeling is still in its infant stage. Due to the unavailability of reference NoC
models, we developed our in-house NoC parameterized hierarchical simulation model
from ground-up using SystemC. In this NoC model, the main functional modules are
routers and links. The router is fully parameterized by 4 architectural parameters: input
buffering, output buffering, routing algorithm and arbitration policy. With the change of

each parameter, new router architecture will be created. The on-chip router model

178

implements 3 routing algorithms (1 deterministic, and 2 adaptive), 2 arbitration policies,
and unlimited buffer depths (if the area budge is large enough). With different
connections among these on-chip routers, 4 different NoC topologies including mesh,
torus and half-torus are modeled. To fully utilize these conﬁgurations, a NoC generator is
developed to generate different NoC architectures by parsing the input NoC conﬁguration
ﬁle. And we also develop the trafﬁc generator to emulate the actual PEs when
characterizing N 0C for the performance and power.

NoC performance and power are modeled architecturally compared with the empirical
PE macro model. We develop each performance and power models by the NoC
components architectural information. The annotation of the developed
performance/power model enables the run-time acquirement of the performance and
power information for different trafﬁc patterns. Moreover, the power model is based on
the bit switching factor by tracing two consecutive data payloads on each router and link,
so it is highly data-dependent. This feature provides the application-dependent
power/performance estimation compared with those proposed statistical models.

Based on the PE and NoC models, a complete SoC simulation model running the
partitioned application code directly with the appropriate performance/power annotation
is developed and it is named xSoCSim. The most promising feature of this model is that it
runs the simulation without the requirement of any target—compiler/synthesizer with the
help of PE macro-models. And the communication pattern is formed by the network
interface (NI) bridging between PE and NoC, and it also decouples the computation with
the communication. The mapping of different tasks onto PEs is easily performed with the
help of object-orientated application and task model coupled with the PE models. We
have successfully ported two applications: M-J PEG and WCDMA ﬁltering array into this
simulation model. And the packaging and publication of this model is under progress.

Based upon the proposed model supporting fast task-level SoC application simulation,

we also develop a preliminary SoC design space exploration framework to prune

179

 

different PE/NoC architectures to achieve the minimum total power consumption while
still meeting the throughput constraint. The distinct feature of this exploration framework
is that it exploits the fast in-loop application simulation facilitated by the xSoCSim model
to collect the run-time performance and power data if necessary compared with other
works where the all the performance and power metrics are pre-characterized. Although
this exploration framework still needs the characterization, it needs only to be executed
once at the beginning outside the main searching loop. The main searching loop features
minimizing the global communication overhead based on the assumption that the
communication latency and power is taking a big portion of the system performance. And
an optional local PE/NoC reﬁnement based on the slack is used to further lower the
processing/communication power while keeping the overall system performance.
Looking ahead of this work, we envision some further research that can be done in the
future. On the PE side, it would be beneﬁcial to add the cache model to the processor
core since cache contributes a big portion of area well as power. On the on-chip
communication architecture side, the NoC would be expanded to support the virtual
channel, which brings the beneﬁt of reduced trafﬁc congestions in the communication-
centric applications. Another work that can be done is to add more automation to this
framework, so that it supports automatic application partitioning and computation and

communication annotation.

180

BIBLIOGRAPHY

1. L. Benini, GD. Micheli, “Network-on-Chips: A New SoC Paradigm”, In IEEE
Computer; 35(1), Jan 2002, pp. 70—78.

2. X. Zhu, and S. Malik, “Modeling Operation and Microarchitecture Concurrency for
Communication Architectures with Application to Retargetable Simulation”, In
Proceedings of CODES +1555 ’04, Sept. 2004, Sweden.

3. K. Roy et al., “Leakage current mechanisms and leakage reduction techniques in deep

submicrometer CMOS circuits,” In Proceeding of the IEEE, vol. 91, no. 2, pp. 305 ~
327, Feb 2003.

4. M. Becer, R. Vaidyanathan, C. Oh, and R. Panda, “Crosstalk Noise Control in an SoC
Physical Design Flow”, In IEEE Transactions on Computer-Aided Design of
Integrated Circtuis and Systems, Vol. 23, No. 4, Apr. 2004.

5. N. Cohen, T. Sriram, N. Leland, D. Moyer, S. Butler and R. Flatley, “Soft Error
Considerations for Deep-Submicron CMOS Circuit Applications”, In IEDM
Proceedings of IEEE International Electron Device Meeting, pp. 315-318, 1999.

6. http://public.itrs.net/

7. A. Sangiovanni-Vincentelli, “Deﬁning Platform-based Design”, In EEDesign, Feb.
2002.

8. K. Keutzer, S. Malik, R. Newton, J . Rabaey and A. Sangiovanni-Vincentelli, “System
Level Design: Orthogonalization of Concerns and Platform-Based Design”, In IEEE

Transactions of Computer-Aided Design of Integrated Circuits & Systems, Vol. 19,
No. 12, Dec. 2000, pp.1523-43.

9. http://bwrc.eecs.berkeley.edu/Classes/IcBook/SPICE/

10. T. Austin, E. Larson, and D. Ernst, “SimpleScalar: An Infrastructure for Computer
System Modeling,” IEEE Computer, February 2002.

11. www-306.ibm.corn/chips/techlib/techlib.nsf/ techdocs/
12. A. Agrawal, A. Bakshi, J. Davis, K., Ledeczi, S. Mohanty, G Nordstrom, V. Prasanna,

C. Raghavendra, and M. Singh, “MILAN: A Model Based Integrated Simulation
Framework for Design of Embedded Systems”, Workshop on Languages, Compilers,

181

and Tools for Embedded Systems (LCTES 2001).

13. Y. Zhang, D. Parish, K. Sankaranarayanan, K. Skadron, and M. Stan, “HotLeakage: A

l4.

16.

17.

18.

19.

20.

21.

22.

Temperature-Aware Model of Subthreshold and Gate Leakage for Architects”,

Technical report of ECE Department, University of Virginia, Charlottesville, VA,
2003.

L. He, W. Liao, and MR. Stan, “System Level leakage Reduction Considering the
Interdependence of Temperature and Leakage”, In Proceedings of Design Automation
Conference, Jun. 2004, San Diego.

. C. Isci and M. Martonosi, “Runtime Power Monitoring in High-End Processors:

Methodology and Empirical Data”, In Proceedings of the 36th International
Symposium on Microarchitecture, pp.93-105.San Diego, 2003.

R. Joseph and M. Martonosi, “Run-time Power Estimation in High Performance
Microprocessors”, In Proceedings of ISLPED ’2001 , pp.135-140, California, 2001.

S. Gupta, EN. Najm, “Analytical Model for High Level Power Modeling of
Combinational and Sequential Circuits”, In Proceedings of IEEE Alessandro Volta
Workshop on Low Power Design, Mar. 1999.

EN. Najm, “Transition Density: A Stochastic Measure of Activity in Digital Circuits”
In Proceedings of 20’” Design Automation Conference, Jun. 1991.

M. Lajolo, A. Raghunathan, S. Dey, “Efficient Power Co-estimation Techniques for
System-on-Chip Design, In Proceedings of DAT E”, pp. 27-34, Paris, 2000.

M. Loghi, M. Poncino, and L. Benini, “Cycle—Accurate Power Analysis for
Multiprocessor Systems-on-a-Chip”, In Proceedings of the 14th ACM Great Lakes
symposium on VLSI, Boston, MA, USA, pp. 406-410.

P. Liverse, V. Wolf, K. Vissers, and E. Deprettere, “A Methodology for Architecture
Exploration of Heterogeneous Signal Processing Systems”, In Journal of VLSI Signal
Processing, Vol. 29, 2001, pp. 197-207.

W. Qin, and S. Malik, “Flexible and Formal Modeling of Microprocessors with

Application to Retargetable Simulation”, In Proceedings of Conference on Design
Automation and Test in Europe, pages 556—561 , 2003.

23.

D.S.Tortosa et al., “VHDL—based Simulation Environment for PROTEO NoC”, In

182

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

Proc. of HLDVT Workshop, 2002.

J. Chen et al., “NoCGen: A Template-Based Reuse Methodology for NOC
Architecture”, In Proc. 1C VLSI, 2004.

HS. Wang, X. Zhu, L.S. Peh and S. Malik, “ORION: A Power-Performance
Simulator for Interconnection Networks”, In Proceedings of MI CRO, 2002.

D. Wiklund, S. Sathe, et al., “NoC Simulations for Benchmarking”, In Proc. IWSoC
for Real Time Apps, 2004.

K. Goossens, J. Dielissen, O.P. Gangwal, SG Pestana, A. Radulescu, and E.
Rijpkema, “A Design Flow for Application-Speciﬁc Networks on Chip with
Guaranteed Performance to Accelerate SOC Design and Veriﬁcation”, In
Proceedings of Conference on Design Automation and Test in Europe, pages 1182-

1187, 2005.

J. Hu, and R. Marculescu, “Exploiting Routing Flexibility for Energy/Performance
Aware Mapping of Regular NoC Architectures”, In Proceedings of Design
Automation & Test in Europe (DATE), March, 2003.

K. Lahiri, and A. Raghunathan, “Power Analysis of System Level On-Chip
Communication Architectures”, In IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, Vol. 20, No. 6, Jun. 2001.

B. Cmelik et al., “Shade: A Fast Instruction-Set Simulator for Execution Proﬁling”,
In ACM SIGMETRICS Performance Evaluation Review, Volume 22(1), Pages 128-
137, May 1994.

R. Leupers et al., “Generation of Interpretive and Compiled Instruction Set
Simulators” In Proceedings of Design Automation Conference, 1999.

L. Séméria and A. Ghosh, “Methodology for Hardware/Software Co-veriﬁcation in
C/C++,” In Proceedings of Asia and South Pacific Design Automation Conf.
(ASPDAC’OO), ACM Press, 2000. PP. 405-408.

J. Liu, M. Lajolo, and A. Sangiovanni-Vincentelli, “Software Timing Analysis Using
HW/SW Co-simulation and Instruction Set Simulator,” In Proceedings of 6th
International Workshop Hardware—Software Codesign, IEEE CS Press, 1998, pp. 65-
69.

183

 

34.

36.

37.

38.

39.

40.

41.

42.

43.

45.

P. Gerin et al., “Scalable and Flexible Co—Simulation of 80C Designs with
Heterogeneous Multi-Processor Target Architectures”, In Proceedings of Conf Asia
and South Pacific Design Automation (ASP-DAC 01) ACM Press, 2001, pp. 63-68.

. L. Benini, D. Bertozzi, D. Bruni, N. Drago, F. Fummi, and M. Poncino, “SystemC

Cosimulation and Emulation of Multiprocessor SoC Designs”, In Proceedings of
IEEE, Apr. 2003.

GNU Project, “Debugging with GDB,” www.gnu.orglmanuallgdb~4.l7.

T. Grotker, S. Liao, G. Martin, and S. Swan, System Design with SystemC. Kluwer
Academic Publishers, 2002.

Open SystemC Initiative website. http://systemc.org. December 2004.

Edward A. Lee. “Overview of the Ptolemy Project, technical memorandum no.
ucb/erlm03/25”, Technical report, Department of Electrical Engineering and
Computer Sciences, University of California, Berkeley, CA, 94720, USA, July 2004.

HI. Volkerink, G.H. Hilderink, J.F. Broenink, W.A. Vervoort, and A.W.P Bakkers,
“CSP Design Model and Tool Support”, Communicating Process Architectures 2000
33 PH. Welch and A. W.P. Bakkers (Eds. ), IOS Press, 2000.

SA. Edwards, “The Speciﬁcation and Execution of Synchronous Reactive Systems”.
PhD thesis, University of California, Berkeley, 1997.

M. Vachharajani, N. Vachharajani, D.A. Penry, J.A. Blome, and DA. August,
“Microarchitectural Exploration with Liberty”, In Proceedings of the 35th
International Symposium on Microarchitecture (MICRO), 2002.

J. Nurrni, H. Tenhunen, J. Isoaho, and A. Jantsch, Interconnect-Centric Design for
Advanced SoC and NoC. Kluwer Academic Publishers, April 2004.

. http://www.mentor.com/products/fv/digital_veriﬁcation/modelsim_se/index.cfm

C. Lee, M. Potkonjak and W. Mangione, “MediaBench: A Tool for Evaluating and
Synthesizing Multimedia and Communications Systems”, In Proceedings of 30th

Annual International Symposium on Microarchitecture (Micro '97), pp. 330-336,
Research Triangle Park, NC, December, 1997.

184

46.

47.

48.

49.

50.

51.

52.

53.

54.

55.

56.

S. Erik and J. Steffen and BE. Sumner, Jr. “Seesoft - A Tool for Visualizing Line
Oriented Software Statistics“, IEEE Transactions on Software Engineering, pp.957-

968.

C. Kozyrakis, and D. Patterson, “Vector Vs. Superscalar and VLIW Architectures for
Embedded Multimedia Benchmarks”, In the Proceedings of the 35th International
Symposium on Microarchitecture, Instabul, Turkey, November 2002.

K.F. Milfeld, C.S. Guiang, AP, and J .R. Boisseau, “Exploring the Effects of Hyper-
Threading on Scientiﬁc Applications”, In Proceedings of C UG 2003.

N.K. Jha, “Low Power System Scheduling and Synthesis”, In Proceedings of the
2001 IEEE/ACM International Conference on Computer-Aided Design, pp. 259 -
263, San Jose, California.

K. Choi, R. Soma, and M. Pedram. Fine-grained dynamic voltage and frequency
scaling for precise energy and performance tradeoff based on the ratio of off—chip
access to on-chip computation times. In IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, pages 18-28, Jan 2005.

ARM Holdings PLC. Advanced Microcontroller Bus Architecture (AMBA)
Speciﬁcation rev 2.0. http://www.arm.com/products/solutions/AMBA_Spec.htrnl.

IBM., The 32-bit Processor Local Bus Architecture Speciﬁcation, ver. 2.9. Technical
Write Paper, http://www-306.ibm.com/chips/techlib/techlib.nsf/products.

M. Sgroi, M. Sheets, A. Mihal, K. Keutzer, J. Rabaey, and A. Sangiovanni-
Vincentelli, “Addressing the System-on-Chip Interconnect Woes through
Communication-based Design”, In the Proceedings of Design Automation
Conference, 2001.

MB. Tailor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman,
W. Lee, M. Seneski, N. Shnidman, M. Frank, S. Amarasinghe, and A. Agarwal, “The
RAW Processor: A Computation Fabric for Software Circuits and General-purpose
Programs”, In IEEE MA CRO, 22(2), 2002.

http://www.mosis.org/Technical/Designsupport/artisan-university.html
L.S. Peh and W. J. Dally. A Delay Model and Speculative Architecture for Pipelined

Routers. In Proceedings of the 7th International Symposium on High-Performance
Computer Architecture, 2001.

185

57.

58.

59.

61.

62.

63.

65.

67.

68.

Heo and K. Asanovic, “Replacing Global Wires with an On-Chip Network: A Power
Analysis”, In Proc. of ISPLED ’05

V. Trwari, S. Malik and A. Wolfe, “Power Analysis of Embedded Software: A ﬁrst
Step Towards Software Power Estimation”. In Proceedings of IEEE/ACM
International Conference on Computer-Aided, pp.384-390, August 1994.

T. Sirnunic, L. Benini, G Micheli and M. Hans, “Source Code Optimization and
Proﬁling of Energy Consumption in Embedded Systems”, In Proceedings of
15832000. pp.193-l98, Madrid, Spain, 2000.

. D. Brooks, V. Trwari, M. Martonosi, “Wattch: A Framework for Architectural-Level

Power Analysis and Optimizations”, In Proceedings of International Symposium on
Computer Architecture, pp.83-94, 2000.

W. Ye, N. Vijaykrishna, M. Kandemir, and M. J. Irwin, “The Design and Use of
SimplePower: A Cycle-Accurate Energy Estimation Tool”, In the Proceedings of the
Design Automation Conference, June 2000.

http://www.cse.psu.edu/~mdl/software.htm
R. Marculescu, D. Marculescu, and M. Pedram, “Efﬁcient power estimation for

highly correlated input streams,” in Proc. DAC-32: ACM/IEEE Design Automation
Conf., San Francisco, CA, J une 1995, pp. 628—634.

. M. Barocci, L. Benini, A. Bogliolo, B. Ricco, and GD. Micheli, “Lookup Table

Power Macro-Models for Behavioral Library Components”, In Proceedings of IEEE
Alessandro Volta Workshop on Low Power Design, Mar. 1999.

S. Gupta, and EN. Najm, “Powe Macromodeling for High Level Power Estimation”,
In Proceedings of 34th Design Automation Conference, Jun. 1997.

. G Bemacchia, and MC. Papaefthymiou, “Analytical Macromodeling for High-Level

Power Estimation”, In Proceedings of IEEE International Conference on Computer
Aided Design, Nov. 1999.

W. Dally, “Route Packets, Not Wires: On-Chip Interconnection Networks”. In
Proceedings of the 2001 Design Automation Conference (DAC-OI), pp.684-689, New
York, June 2001. ACM Press.

Terry T. Ye, L. Benini, and GD. Micheli, “Analysis of Power Consumption on Switch

186

 

69.

70.

71.

72.

73.

74.

75.

76.

77.

78.

79.

80.

Fabrics in Network Routers”, In Proceedings of DAC, 2002.

X. Chen, and LS. Peh, “Leakage Power Modeling and Optimization in
Interconnection Networks”, In Proceedings of ISPLED’2003 , Seoul, Korea.

P. Gupta et al., “A High-level Interconnect Power Mode] for Design Space
Exploration”, In Proceedings of IEEE International Conference on Computer Aided
Design, 2003.

H.B. Bakoglu, Circuits, Interconnections and Packaging for VLSI. Addision-Wesley,
1990.

R. Ho et al., “The Future of Wires”, Proceeding of the IEEE, 89(4): 490-504, Apr.,
2001.

W. Liu, X. Jin, J. Chen, M. Jeng et. al, “BSIM3v3.2.2 MOSFET Model User’s
Manual”, Dept of Electrical Engineering and Computer Science, UC-Berkeley, 1999.

M. Mamidipaka, N. Dutt and M. Abadir, “Leakage Power Estimation in SRAMs”,
CECS Technical Report, 2003.

Berkeley Predictive Technology Model and BSIM4.
http://www-device.eecs.berkeley.edu/research

M. Powell, S.H. Yang, K Roy, and TN. Vijaykumar, “Gated-Vdd: A Circuit
Technique to Reduce Leakage in Deep Submicron Cache Memories”, In Proc.
ISPLED, 2000.

K. Flautner, N.S. Kim, S. Martin, D. Blaauw, and T. Mudge, “Drowsy Caches: Simple
Techniques for Reducing Leakage Power”, In Proceedings of ISCA, Alaska, May
2002.

M. Gries, “Methods for Evaluating and Covering the Design Space during Early
Design Development”, In VLSI Journal, 26 May 2004.

M. Gries, Algorithm-architecture Trade-offs in Network Processor Design, Ph.D.
thesis, Diss. ETH No. 14191, Swiss Federal Institute of Technology (ETH) Zurich,
Switzerland (Jul. 2001).

A. Baghdadi, N. Zergainoh, W. Cesario, T. Roudier, and A. Jerraya, “Design Space

Exploration for Hardware/Software Codesign of Multiprocessor Systems”, In 11th

187

81.

83.

84.

85.

86.

87.

88.

89.

90.

International Workshop on Rapid System Prototyping (RSP), 2000, pp. 8-13.

C. Ykman-Couvreur, J. Larnbrecht, D. Verkest, F. Catthoor, A. Nikologiannis, G.
Konstantoulakis, “System-level Performance Optimization of the Data Queueing
Memory Management in High-Speed Network Processors, in 39th Design
Automation Conference (DAC), 2002, pp. 518-523.

. S. Blythe, and R. Walker, “Toward a Practical Methodology for Completely

Characterizing the Optimal Design Space”, in 9th International Symposium on
System Synthesis, 1996, pp. 8-13.

S. Blythe, and R. Walker, “Efﬁcient Optimal Design Space Characterization

Methodologies”, In ACM Transactions on Design Automation of Electronic Systems
5 (3) (2000) 323-336.

S. Chaudhuri, S. Blythe, and R. Walker, “A Solution Methodology for Exact Design
Space Exploration in a Three-dimensional Design Space”, In IEEE Transactions on
Very Large Scale Integration (VLSI) Systems, 5 (l) (1997) 69-81.

M. Auguin, L. Capella, F. Cuesta, and E. Gresset, “CODEF: a System Level Design
Space Exploration Tool”, In 2001 IEEE International Conference on Acoustics,
Speech, and Signal Processing, Vol. 2, 2001, pp. 1145-1148.

P. Mishra, N. Dutt, and A. Nicolau, “Functional Abstraction Driven Design Space
Exploration of Heterogeneous Programmable Architectures”, In International
Symposium on System Synthesis, 2001, pp. 256-261.

S. Pees, A. Hoffmann, and H. Meyr, “Retargeting of Compiled Simulators for Digital
Signal Processors Using a Machine Description Language”, In Proceedings of
Design, Automation and Test in Europe Conference (DA TE), 2000, pp. 669-673.

J. Kin, C. Lee, W. Mangione-Smith, and M. Potkonjak, “Power Efﬁcient
Mediaprocessors: Design Space Exploration”, In Proceedings of 36th Design
Automation Conference (DAC), 1999, pp. 321-326.

G. Hekstra, G. L. Hei, P. Bingley, and F. Sijstermans, “TriMedia CPU64 Design
Space Exploration”, In Proceedings of 1999 IEEE International Conference on
Computer Design: VLSI in Computers and Processors, 1999, pp. 599-606.

T. Givargis, J. Henkel, and F. Vahid, “Interface and Cache Power Exploration for
Core-based Embedded System Design”, In Proceedings of International Conference

188

91.

92.

93.

94.

95.

96.

97.

98.

99.

on Computer-Aided Design (ICCAD), 1999, pp. 270-273.

I. Karkowski, and H. Corporaal, “Design Space Exploration Algorithm for
Heterogeneous Multi-processor Embedded System Design”, In Proceedings of 35th
Design and Automation Conference (DAC), 1998, pp. 82-87.

K. Lahiri, A. Raghunathan, and S. Dey, “System-level Performance Analysis for
Designing On-chip Communication Architectures”, In IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 20(6) (200]) 768-783.

M. Gries, and J. Greutert, “Modeling a Shared Medium Access Node with QoS
Distinction”, Tech. Rep. 86, Computer Engineering and Networks Laboratory (TIK),
ETH Zurich, Switzerland (Apr. 2000).

D. Bruni, and A.B.L. Benini, “Statistical Design Space Exploration for Application
Speciﬁc Unit Synthesis”, In Proceedings of 38th Design Automation Conference
(DAC), 2001, PP. 641-646.

V. Srinivasan, S. Radhakrishnan, and R. Vemuri, “Hardware Software Partitioning
with Integrated Hardware Design Space Exploration, In Proceedings of Design,
Automation and Test in Europe (DATE), 1998, pp. 28-35.

D. Gajski, F. Vahid, S. Narayan, and J. Gong, “System-level Exploration with
SpecSyn”, In Proceedings of 35th Design and Automation Conference (DAC), 1998,
pp. 812-817.

K. Lahiri, A. Raghunathan, and S. Dey, “Efﬁcient Exploration of the SoC
Communication Architecture Design Space”, In Proceedings of IEEE/ACM
International Conference on Computer Aided Design (1C CAD), 2000, pp. 424-430.

V. Zivkovic, E. Deprettere, P. van der Wolf, and E. de Kock, “Design Space
Exploration of Streaming Multiprocessor Architectures”, In IEEE Worksh0p on
Signal Processing Systems (SIPS), 2002. pp. 228-234.

P. Eles, et al., “Scheduling with Bus Access Optimization for Distributed Embedded
Systems”, In IEEE Transaction on VLSI, Vol. 8, No. 5, pp.472-491, 2000.

100. GC. Sih, E.A. Lee, “A Compile-time Scheduling Heuristic for Interconnection-

constrained Heterogeneous Processor Architectures”, In IEEE Transaction on
Parallel and Distributed Systems, Vol. 4, No. 2, 1993.

189

101. l. Luo, and N.K. Jha, “Battery-aware Static Scheduling for Distributed Real-time
Embedded Systems”, In Proceedings of DAC, Las Vegas, NV, Jun. 2001.

102. J.M. Chang, and M. Pedram, “Energy Minimization using Multiple Supply
Voltages”, In IEEE Transaction on VLSI Systems, Dec. 1997.

103. J.M. Chang, and M. Pedram, “Codex-dp: Codesign of Communicating Systems
Using Dynamic Programming”, In IEEE Transaction on CAD, Jul. 2000.

104. Y. Zhang, X. Hu, and D. Chen, “Task Scheduling and Voltage Selection for Energy
Minimization”, In Proceedings of DAC, New Orleans, LA, Jun. 2002.

105. J. Hu, and R. Marculescu, “Energy-aware Communication and Task Scheduling for
Network-on-Chip Architectures under Real-Time Constraints”, In Proceedings of
Design, Automation and Test in Europe Conference and Exhibition, 2004.

106. G Varatkar, and R. Marculescu, “Communication-aware Task Scheduling and
Voltage Selection for Total Systems Energy Minimization”, In Proceedings of Intl.
Conf. on Computer-Aided Design (IC CAD ), November, 2003.

107. J. Hu, and R. Marculescu, “Energy- and Performance-Aware Mapping for Regular
NoC Architectures”, In IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, Vol. 24, No. 4, Apr. 2005.

108. L.M. Ni, and PK. McKinley, “A Survey of Wormhole Routing Techniques in Direct
Networks”, In IEEE Transactions on Computers, Vol. 26, No. 2, Feb. 1993.

109. Berkeley Wireless Research Center, website: http://bwrc.eecs.berkeley.edu/

110. D. Krishnaswamy, R. Stevens, R. Hasbun, J. Revilla, and C. Hagan, “Intel PXA8OOF
Wireless Intemet-On-a—Chip Architecture and Design”, In Proceedings of IEEE Custom
Integrated Circuits Conference, 2003.

11 1. www.mentor.com

112. C. Demartini, R. Losif, and R. Sisto, “dSPIN:A Dynamic Extension of SPIN”, In
Proceedings of the 5th and 6th International SPIN Workshops on Theoretical and
Practical Aspects of SPIN Model Checking, 1999, pp.261-276.

113. J. Xi, Z. Huang, and P. Zhong, “SystemC-based Energy Macro-Modeling of
Embedded Microprocessor”, In Proceedings of 2005 IEEE International Conference on

190

Electro Information Tech., Mar. 2005.

114. J. Xi, and P. Zhong, “Fast Energy Estimation of Multi-processor System-on-Chip
with Energy Macro-Models for Embedded Microprocessors”, In Proceedings of 2005
International Conference on Modeling, Simulation and Visualization Methods, Jul. 2005.

115. J. Xi, and P. Zhong, “A Transaction-Level Network-on-Chip Simulation Platform
with Architecture-Level Dynamic and Leakage Energy Models”, In Proceedings of
GLSVLSI 2006, Apr. 2006.

116. Texas Instruments Locosto TCS-2300 GSM System-on-Chip
http://focus.ti.com/general/docs/wtbu/wtbuproductcontent.tsp?template1d=6l 23&navigati
onId=1 265 6&contentId= 15408&DCMP=WTBU&HQS=Other+OT+gsm_locosto

117. Texas Instruments Ecosto OMAPV-1035 GSM/GPRS/EDGE System-on-Chip
http://focus.ti.com/general/docs/wtbu/wtbuproductcontent.tsp?templateld=61 23 &navi gati
onld=1273 l &contentId=27883

1 l 8. http://www.cppreference.com/cppdeque/index.html

119. http://msdn2.microsoft.com/en-us/library/wa 1s7hx(VS.80).aspx

191

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

11[11111111111111511111]]111