THEORY, SYNTHESIS AND IMPLEMENTATION OF CURRENT-MODE CMOS
PIECEWISE-LINEAR CIRCUITS USING MARGIN PROPAGATION
By
Ming Gu

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Electrical Engineering
2012

ABSTRACT
THEORY, SYNTHESIS AND IMPLEMENTATION OF CURRENT-MODE CMOS
PIECEWISE-LINEAR CIRCUITS USING MARGIN PROPAGATION
By
Ming Gu
Achieving high energy-efﬁciency is a key requirement for many emerging smart sensors and
portable computing systems. While digital signal processing (DSP) has been the de-facto technique
for implementing ultra-low power systems, analog signal processing (ASP) provides an attractive
and alternate approach that can not only achieve high energy efﬁciency but also high computational density. Conventional ASP techniques are based on a top-down design approach, where
proven mathematical principles and related algorithms are mapped and emulated using computational primitives inherent in the device physics. An example being the translinear principle, which
is the state-of-the-art ASP technique, that uses the exponential current-to-voltage characteristics
for designing ultra-low-power analog processors. However, elegant formulations could result from
a bottom-up approach where device and bias independent computational primitives (e.g. current
and charge conservation principles) are used for designing ”approximate” analog signal processors. The hypothesis of this proposal is that many signal processing algorithms exhibit an inherent
calibration ability due to which their performance remains unaffected by the use of ”approximate”
analog computing techniques.
In this research, we investigate the theory, synthesis and implementation of high performance
analog processors using a novel piecewise-linear (PWL) approximation algorithm called margin
propagation (MP). MP principle utilizes only basic conservation laws of physical quantities (current, charge, mass, energy) for computing and therefore is scalable across devices (silicon, MEMS,
microﬂuidics). However, there are additional advantages of MP-based processors when imple-

mented using CMOS current-mode circuits, which includes: 1) bias-scalability and robust to variations in environmental conditions (e.g. temperature); and 2) improved dynamic range and faster
convergence as compared to the translinear implementations. We verify our hypothesis using two
ASP applications: (a) design of high-performance analog low-density parity check (LDPC) decoders for applications in sensor networks; and (b) design of ultra-low-power analog support vector machines (SVM) for smart sensors. Our results demonstrate that an algorithmic framework
for designing margin propagation (MP) based LDPC decoders can be used to trade-off its BER
performance with its energy efﬁciency, making the design attractive for applications with adaptive
energy-BER constraints. We have veriﬁed this trade-off using an analog current-mode implementation of an MP-based (32,8) LDPC decoder. Measured results from prototypes fabricated in a
0.5 µm CMOS process show that the BER performance of the MP-based decoder outperforms a
benchmark state-of-the-art min-sum decoder at SNR levels greater than 3.5 dB and can achieve
energy efﬁciencies greater than 100pJ/bit at a throughput of 12.8 Mbps.
In the second part of this study, MP principle is used for designing an energy-scalable SVM
whose power and speed requirements can be conﬁgured dynamically without any degradation in
performance. We have veriﬁed the energy-scaling property using a current-mode implementation
of an SVM operating with 8 dimensional feature vectors and 18 support vectors. The prototype
fabricated in a 0.5 µm CMOS process has integrated an array of ﬂoating gate transistors that serve
as storage for up to 740 SVM parameters as well as novel circuits that have been designed for interfacing with an external digital processor. These include a novel current-input current-output logarithmic ampliﬁer circuit that can achieve a dynamic range of 120dB while consuming nanowatts
of power and a novel varactor based temperature compensated ﬂoating-gate memory that demonstrates a superior programming range than other competitors.

I dedicate this dissertation to my husband, Wei Wang, for his love and support.

iv

ACKNOWLEDGMENT

I am deeply grateful to Dr. Shantanu Chakrabartty for his advice during my Ph.D studies. I owe my
successful transition from control theory to analog integrated circuit design to his guidance, encouragement, support, and constant assurance. As an academic advisor, he was always approachable
and ready to help in every possible way to make sure I was always at ease while working on my
research. I wish him well for all his future endeavors and would be happy to cooperate with him
in the future.
Many thanks to Dr. Hayder Radha, Dr. Tongtong Li and Dr. Jonathan I. Hall for being on my
Ph.D committee, for their feedbacks, suggestions, and continual interest in my progress. Thanks to
Dr. Hayder Radha for helping me in our collaborated project and the instructions on information
theory. Thanks to Dr. Tongtong Li and Dr. Jonathan I. Hall in helping me in communication theory
and error-correcting coding.
I would like to thank my lab mates Ravi Krishna Shaga, Tao Feng, Pikul Sarkar and Thamira
Hindo. I really enjoy the opportunities to work with them and have a good time with them during
my graduate studies.
I am grateful to my friends Rui Lin, Xiaoqin Tang, Lei Zhang, Kiran Misra, Hao Wen and Yun
Li for their friendship, support, and motivation.
My special thanks go to my husband, Dr. Wei Wang, for his support during my graduate
studies. His love, support, and encouragement have carried me to where I am today. We met,
got to know each other, fell in love and got married at MSU. The time we spent together at MSU
pursuing our Ph.Ds is the best part in our lives. I would dedicate this dissertation to my parents,
and grandparents for their support all the time.

v

LIST OF TABLES

Table 1.1

Power dissipation of major components for a 0.6 µm mixed-signal CMOS
image sensor [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

Table 1.2

Milestones in TASP . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

Table 3.1

Operands of MP units in Fig 3.4 . . . . . . . . . . . . . . . . . . . . .

44

Table 4.1

Operations in ordinary and logarithmic arithmetic in digital implemented
LNS processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 5.1

47

Threshold comparison for AWGN channel under margin propagation and
sum-product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

Table 5.2

Speciﬁcations of the MP-based analog LDPC decoder . . . . . . . . . .

112

Table 5.3

CMOS Digital (top 4) and Analog (bottom 6) Decoders . . . . . . . . .

121

vi

Table 6.1

Speciﬁcations of the current-input current-output CMOS logarithmic ampliﬁer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

150

Table 6.2

Comparison of performance . . . . . . . . . . . . . . . . . . . . . . . .

156

Table 6.3

Speciﬁcation of the fabricated prototype . . . . . . . . . . . . . . . . .

156

vii

LIST OF FIGURES

Figure 1.1

Illustration of the tradeoffs in (a)DSP and (b)CADSP systems [2, 3]. For
interpretation of the references to color in this and all other ﬁgures, the
reader is referred to the electronic version of this dissertation. . . . . . .

3

Figure 1.2

Power consumption comparison for DSP [4, 5] and ASP [6, 7]. . . . . .

4

Figure 1.3

Basic CMOS circuits for (a) digital half adder based on NAND and NOT
gates; (b) digital multiplier based on (a); (c) analog adder; (d) analog
multiplier (Gilbert cell). . . . . . . . . . . . . . . . . . . . . . . . . . .

5

Figure 1.4

Resolution versus “cost” on using ASP or DSP [2]. . . . . . . . . . . .

7

Figure 1.5

log(ID ) versus VGS , gm versus VGS , and gm /ID versus ID /(W/L)
curves for various operation regions [8]. . . . . . . . . . . . . . . . . .
viii

16

Figure 1.6

Illustration of the proposed concept for designing scalable analog computational systems. The depth of potential wells can be controlled by an
external voltage source. At equilibrium the charge re-distributes amongst
potential wells to satisfy charge conservation laws. . . . . . . . . . . .

Figure 1.7

17

Top-down design procedure for conventional TASP and the bottom-up
procedure we propose. . . . . . . . . . . . . . . . . . . . . . . . . . .

18

Figure 2.1

MOS symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

Figure 2.2

An alternating translinear loop of NMOS transistors. Here the number
of clockwise facing transistors in junction is the same as the number of
counterclockwise facing transistors. . . . . . . . . . . . . . . . . . . .

22

Figure 2.3

A voltage-in-current-out winner-take-all circuit. . . . . . . . . . . . . .

27

Figure 2.4

A multiple-input translinear element (MITE). . . . . . . . . . . . . . .

27

Figure 2.5

Principle of dynamic translinear circuits. . . . . . . . . . . . . . . . . .

28

Figure 3.1

Illustration of reverse water-ﬁlling procedure . . . . . . . . . . . . . . .

32

ix

Figure 3.2

Scaling property of margin propagation . . . . . . . . . . . . . . . . .

33

Figure 3.3

(L −L )
The curve of e i n and its PWL approximation curve of u(Li , Ln )

42

Figure 3.4

PWL approximation effect of MP algorithm . . . . . . . . . . . . . . .

45

Figure 4.1

Top architecture of the computation unit . . . . . . . . . . . . . . . . .

50

Figure 4.2

CMOS implementation of the core MP circuit. . . . . . . . . . . . . . .

52

Figure 4.3

Die micrograph of the chip. . . . . . . . . . . . . . . . . . . . . . . . .

52

Figure 4.4

System architecture of the chip. . . . . . . . . . . . . . . . . . . . . . .

53

Figure 4.5

MP circuit for implementing two operand addition. . . . . . . . . . . .

56

Figure 4.6

Addition: z + − z − computed according to (a) the original log-sum-exp
function; (b) MP approximations (simulation); and (c) MP circuits (measurement). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.7

56

Addition: measurement results for z + − z − with transistors biased in
(a)strong inversion; (b)moderate inversion; (c)weak inversion; (d)-(f) computation errors for (a)-(c). . . . . . . . . . . . . . . . . . . . . . . . . .
x

57

Figure 4.8

Subtraction: z + − z − computed by (a) original functions; (b) MP approximations (simulation); and (c) MP circuits (measurement). . . . . .

Figure 4.9

59

Subtraction: measurement results for z + − z − with transistors biased in
(a)strong inversion; (b)moderate inversion; (c)weak inversion; and (d)-(f)
show respective approximation errors. . . . . . . . . . . . . . . . . . .

59

Figure 4.10

MP circuit for implementing four quadrant operand multiplication . . .

61

Figure 4.11

Multiplication: z + −z − computed according to the (a) original function;
(b) MP approximations (simulation); and (c) MP circuits (measurement).

Figure 4.12

61

Multiplication: measurement results for z + − z − with transistors biased
in (a)strong inversion; (b)moderate inversion; (c)weak inversion; and (d)(f) the respective approximation errors . . . . . . . . . . . . . . . . . .

62

Figure 4.13

A current mirror implementing the power function in logarithmic domain.

65

Figure 4.14

Power:z + − z − computed according to the log-sum-exp function, the
MP approximation simulation and the MP circuit measurement. . . . . .

66

Figure 4.15

Noise model for MP circuit in Fig. 4.2. . . . . . . . . . . . . . . . . . .

66

Figure 4.16

System architecture and circuit diagram of the MP-based pattern classiﬁer. 73
xi

Figure 4.17

Classiﬁer: z + − z − simulated by (a) log-sum-exp function (simulation);
(b) MP approximation (simulation); and measured z + − z − when the
transistors are biased in (c)strong inversion (measurement); (d)moderate
inversion (measurement); (e)weak inversion (measurement). . . . . . .

75

Figure 5.1

Factor graph corresponding to a 32-bit (3, 4) LDPC code . . . . . . . .

78

Figure 5.2

A (3,6) LDPC message PDF evolution for: (a) sum-product variable to
check messages; (b) sum-product check to variable messages; (c) minsum variable to check messages; (d) min-sum check to variable messages;

Figure 5.3

Comparison of pair-wise decoding functions for sum-product and minsum LDPC decoding algorithms . . . . . . . . . . . . . . . . . . . . .

Figure 5.4

85

Comparison of pair-wise decoding function for sum-product, min-sum
and margin propagation decoding algorithms . . . . . . . . . . . . . . .

Figure 5.5

80

87

Comparison of 3D LDPC check function for (a) sum-product (b) minsum and (c) margin propagation with γ=1.75 . . . . . . . . . . . . . . .

88

Figure 5.6

Possible solutions z for M( 1 , 2 ) . . . . . . . . . . . . . . . . . . . . .

89

Figure 5.7

A (3,6) LDPC message PDF evolution for: (a) margin propagation variable to check messages; (b) margin propagation check to variable messages; 93
xii

Figure 5.8

: (a)Effect of hyper-parameter γ on the L0 norm of the check to variable
messages(for different AWGN channel conditions) (b)Effect of hypertuning parameter γ on BER performance (for different AWGN channel
conditions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 5.9

97

The comparison of BER for LDPC decoders based on different algorithms achieved by (3,6)-regular LDPC code of length 256, 104 , and 106 ;
note the maximum iteration number is set to 20 . . . . . . . . . . . . .

Figure 5.10

100

The comparison of BER for LDPC decoders based on different algorithms achieved by length 106 (3,6)-regular LDPC code and (2.5,5) irregular LDPC code; note the maximum iteration number is set to 20. . .

Figure 5.11

101

: L0 norm comparison of sum-product, min-sum and margin propagation
for iterations 1,2,3 for check to variable messages: (a),(b),(c) and variable
to check messages: (d),(e),(f) . . . . . . . . . . . . . . . . . . . . . . .

103

Figure 5.12

Architecture of the check node c1 in Fig. 5.1 . . . . . . . . . . . . . . .

106

Figure 5.13

(a) Circuits for implementing the check node module using the MP algorithm:(a) design used in this paper; and (b) an alternate low-voltage
circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

107

Figure 5.14

Architecture of variable node v1 in Fig. 5.1 . . . . . . . . . . . . . . .

110

Figure 5.15

Circuit to implement variable node . . . . . . . . . . . . . . . . . . . .

111

xiii

Figure 5.16

System level architecture of the (32,8) LDPC decoder . . . . . . . . . .

113

Figure 5.17

Die microphotograph of the chip . . . . . . . . . . . . . . . . . . . . .

114

Figure 5.18

Experimental setup of the chip . . . . . . . . . . . . . . . . . . . . . .

115

Figure 5.19

Timing diagram of digital signals used to control the operation of the
LDPC decoder chip . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 5.20

Comparison of BER of the MP-based LDPC decoder using software simulation and measurements from the fabricated prototypes. . . . . . . . .

Figure 5.21

116

Comparison of measured BER performances for different values of the
hyper-parameter γ. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 5.23

116

Comparison of BERs of the measured MP-based LDPC decoder and a
benchmark min-sum LDPC decoder. . . . . . . . . . . . . . . . . . . .

Figure 5.22

115

117

Measured BER performances of the MP-based LDPC decoder (Vγ =
0.74V ) for different decoding times (throughput) . . . . . . . . . . . .

Figure 6.1

117

The circuit diagram of the MP-based SVM . . . . . . . . . . . . . . . .

125

xiv

Figure 6.2

Multiplication computed by (a) “log-sum-exp” function (b) MP-based
function with γ = 1.2 (in Matlab) (c) MP-based function with γ = 1.2
(in Cadence) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

127

Figure 6.3

The system architecture and basic circuit of ﬂoating-gate memory cells.

130

Figure 6.4

The basic concept of translinear Ohm’s law. . . . . . . . . . . . . . . .

133

Figure 6.5

Schematic of the translinear logarithmic current-to-current converter. . .

134

Figure 6.6

Proposed logarithmic ampliﬁer: (a) basic circuit of translinear stage; (b)
basic circuit of input stage; (c) basic circuit of reference stage. . . . . .

136

Figure 6.7

Basic circuit of (a) current reference; (b) current DAC. . . . . . . . . .

138

Figure 6.8

Proposed ﬂoating-gate based current reference and current memory cells

141

Figure 6.9

(a) Cross-sectional architecture of MOS varactor (b) Layout for memory
cell in Fig. 6.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

144

Figure 6.10

Floating-gate transistor current with respect to current starving voltage .

145

Figure 6.11

Floating-gate transistor current with respect to Vdd fg . . . . . . . . . .

146

Figure 6.12

Floating-gate transistor current with respect to Vx . . . . . . . . . . . .

147

xv

Figure 6.13

Precise programming procedure for ﬂoating-gate transistors . . . . . . .

148

Figure 6.14

Die microphotograph of the chip. . . . . . . . . . . . . . . . . . . . . .

149

Figure 6.15

(a) Measurement current response and (b) relative error for the current
reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

150

Figure 6.16

Measurement results for the logarithmic current conversion. . . . . . . .

151

Figure 6.17

(a) Noise of Iout versus Iin ; (b) SNR of Iout . . . . . . . . . . . . . . .

152

Iout (T ) − Iout (T )
1
2
under different temIout (T ) − Iout (T )
1
3
peratures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

154

Figure 6.19

Measured response plotting bandwidth with respect to current. . . . . .

155

Figure 6.20

Die microphotograph of the chip . . . . . . . . . . . . . . . . . . . . .

157

Figure 6.21

The C-V curve of MOS capacitor . . . . . . . . . . . . . . . . . . . . .

158

I
I
I
I
I
Measured response plotting (a) 1 (b) 1 (c) 1 (d) 1 (e) 1 with reI2
I3
I4
I5
I6
spect to Vx under 28◦ C, 33◦ C, 38◦ C, and 43◦ C. . . . . . . . . . . . .

160

Figure 6.18

Figure 6.22

Figure 6.23

Measured response plotting

Measured response plotting temperature compensated current ratios under Vx of (a)1.2V (b)1.3V. . . . . . . . . . . . . . . . . . . . . . . . .

Figure 7.1

161

Improved MP circuit with cascode current mirrors . . . . . . . . . . . .

165

xvi

Figure 7.2

Decoupled MP circuit to prevent gate leakage . . . . . . . . . . . . . .

xvii

166

Chapter 1
Introduction
Digital signal processing (DSP) has been prevalent in all computing systems for the past ﬁve
decades. Since its beginnings in early 1950’s, the progress in the ﬁeld of DSP has paralleled
the growth of digital computers, as signiﬁed by the celebrated Moore’s law [9]. The key milestones in the area of DSP include the development of the Fast Fourier Transform (FFT) in 1965
[10] followed by the implementation of DSP algorithms on integrated circuit (IC) technology in
1980’s. The use of analog signal processing (ASP) can be traced back to an earlier time than its
digital counterpart, primarily driven by the need to process naturally occurring signals (image,
speech) which are inherently “analog”. However, due to rapid progress in digital IC technology
and due to the limitations imposed by analog artifacts, in 1970’s analog computers were gradually
superseded by digital computers. Many experts even predicted the demise of ASP in early 1980s.
However, the interest in ASP was renewed a decade ago, primarily due to the need for more energy
efﬁcient and high-density signal processing systems. The inspiration for achieving this using ASP
techniques came from biological computing systems which are inherently “analog”. Biological
neural processors excel at solving hard problems in sensory signal processing (video and speech)
1

by sustaining high computational throughput while keeping energy dissipation at a minimal level.
At the core of these neural processors and similarly analog signal processors are massively parallel
circuits which exploit computational primitives inherent in device physics to achieve high energyefﬁciency and high computational density. In this dissertation, we will follow the ASP principle as
a guideline to investigate energy efﬁcient computing systems.
From a practical point-of-view there are additional advantages of ASP over DSP techniques
which are listed below:
(a) Practical limitations in the analog-to-digital converters (ADC) design.
Since all the naturally occurring signals are analog, ADC is the essential element attached to
DSP systems. Many DSP applications are required to recover signals over a wide dynamic
range or from a noisy background, which necessitate high speed and high resolution ADC.
However, the design of ADC for high resolution/high performance and low power dissipation
is one of the challenges nowadays. It seems that the scaling factor is not affecting ADC as
much as DSP efﬁciency. Resolution of ADCs has been increased at 1.5 bits/5 years [11]. As
a consequence, the long design time of ADC has become an increasingly severe constraint
for the system design as well.
One efﬁcient solution is the cooperative analog/digital signal processing (CADSP), proposed
by Hasler in [2, 3]. By utilizing ASP at the front-end, the ADC complexity is reduced
signiﬁcantly, and hence the overall system complexity [2, 3]. Fig. 1.1 shows the tradeoff
between CADSP and DSP.
(b) The limited power consumption.
Nowadays, an increasing desire for portability has been posed on electronic devices. A
low power dissipation is attractive and crucial in these applications to obtain a long battery
2

A/D
Converter

DSP
Processor
(a)

Analog

Digital
Specialized A/D

ASP
IC

DSP
Processor

A/D

(b)

Figure 1.1: Illustration of the tradeoffs in (a)DSP and (b)CADSP systems [2, 3]. For interpretation
of the references to color in this and all other ﬁgures, the reader is referred to the electronic version
of this dissertation.
life. Power consumption of DSP, measured in mW/M IP S, reduces by half around every
18 months [4]. This has been keeping pace with Moore’s law [9]. Even so, a ﬁxed power
budget still constrains the increasing proliferation of portable electronics.
In contrast, ASP tends to be more power efﬁcient. Since ASP block lacks ADC as shown in
Fig. 1.1, which has been proved to be a major energy consumer, especially for the scaleddown system. Table 1.1 lists all the key elements power consumption of a state-of-the-art
0.6-µm CMOS image sensor with mixed-signal processor array [1]. It can be seen that the
power consumption of ADC is a major factor amongst all and occupies more than one-third
of the total amount.
Furthermore, the arithmetic unit of DSP, which consists of digital multipliers and adders,
also consumes more energy than ASP. The main cause is less number of transistors used in
the case of ASP than DSP. As a consequence, even custom digital solutions are inadequate
for ultralow-power applications [12].
In Fig. 1.2 the power consumption of DSP and ASP are compared. The black square rep3

Table 1.1: Power dissipation of major components for a 0.6 µm mixed-signal CMOS image sensor [1]

Supply voltage

5V

Processor unit
Sensor, readout and biasing
ADC
FIFO memory

0.25 mW
2.25 mW
21 mW
83 mW

resents the power consumption of a state-of-the-art DSP core [5]. It seems the power consumption of the DSP core follows Gene’s law. The star represents the power consumption of
an analog, ﬂoating-gate integrated chip [6]. And the circuit represents the power consumption of a CMOS analog processor that can be used for long-term, self-powered mechanical
usage monitoring [7].

1W

Gene’s law

POWER

10mW
DSP

0.1mW

[5]

1uW
[6]

10nW

[7]
ASP

100pW
1980

1990

2000
2010
YEAR

2020

2030

Figure 1.2: Power consumption comparison for DSP [4, 5] and ASP [6, 7].

(c) Size constraints.
Circuit size constraints also favor analog VLSI circuits. It is often possible to perform complex operations on device physics with only a few transistors. Fig. 1.3 shows the comparison
4

of basic circuit schematics for digital/analog adders, and digital/analog multipliers implemented in metal-oxide-semiconductor ﬁeld-effect transistors (MOSFET’s). It is obvious that
the number of transistors in digital implementation surpasses that in analog implementation.

B

P2

A

P1
P1

A

N2

B

N1

N1

B0

C0

B1
Half Adder
IN1 SUM
IN2
C

C1

Carry

Sum

A
B

Half Adder
IN1 SUM
IN2
C

C2
C3

A0 A1

(a)

(b)
+
I out

I
V2+

IA

IB

V1+

+
2

I

−
2

+
− 3
V

I

2

N4 N5

N2

I1+

(c)

N6 N7

I1−

N3

−
I out

V2+

V1−

VBias

Ib

I SUM

I

−
3

N1

(d)

Figure 1.3: Basic CMOS circuits for (a) digital half adder based on NAND and NOT gates; (b)
digital multiplier based on (a); (c) analog adder; (d) analog multiplier (Gilbert cell).

5

(d) Speed requirements.
In recent years great effort has been put in real-time computation, which is applied throughout three-dimensional (3D) graphic systems, high-speed communication system, image sensors, etc.. For these applications, the defect of DSP arise largely from the fact that all the
operations in a single arithmetic unit have to be carried out in sequence. In contrast, ASP has
a simpler implementation and an inherent parallel architecture for computation. Thus when
the signal processing has to be executed under the condition of high speed or high frequency,
ASP outperforms DSP on almost all the metrics.

Although there are some crucial reasons for that ASP remains indispensable for scientiﬁc computations, there are still some problems existing for ASP. One primary disadvantage is the lack
of ﬂexibility. By directly exploiting the computational primitives inherent in the device physics,
complicated non-linear functions can be implemented in analog using signiﬁcantly lower number
of transistors compared to its digital counterpart. However, this methodology only works for speciﬁc computations. For instance, the subthreshold translinear characteristic of MOS transistor can
only be used to solve log-domain computations. Whereas DSP can be used to solve almost any
computation by programming.
Another disadvantage lies in the high noise sensitivity for analog implementation. As is known,
analog circuits is more vulnerable to noise than digital circuits. It can be explained as that the
number in analog computation is represented by continuous voltages or currents, which will be
deteriorated by noise easily. Digital system, however, due to the quantization and bi-value logic
representation (either one or zero) of discrete signals, is more robust to noise. Only when the noise
levels are high enough to ﬂip logical bit, otherwise it works without any effect on precision of the
result.
6

Also the effective resolution of the computing system is a major concern for signal processors.
To evaluate the input effort to obtain a designated resolution, in [2, 13], a comprehensive metric
“cost” is proposed. It is calculated based on a variety of metrics ranging from the circuit performance metrics like size, power dissipation, etc. to the human effort and other resources involved in
the design or fabrication. The comparison of cost versus resolution is described as Fig. 1.4 shows.
It can be seen that a “threshold” exists. When the required computation resolution is beyond the
threshold, it is less “expensive” to use digital implementation, otherwise, it is more “worthy” to
use analog implementation instead of digital.
16
14

Log(“cost”)

12
Digital

10

Lower
digital
cost

8
Analog

6

Lower
analog
cost

4
2
0
0

2

4

6

8

10

12

14

16

Bits of Resolution

Figure 1.4: Resolution versus “cost” on using ASP or DSP [2].

In summary, ASP is suitable for applications where
• precision/resolution of computation is not required strictly;
• a stringent budget for power, area, or number of transistors is imposed on the system design
(e.g. image sensors);
7

• high-speed or fully-parallel computation is required (e.g. analog decoders);

1.1

Motivation for the proposed research

The underlying principle of analog computation is to represent numbers as physical quantities
such as current, voltage, charge, time, etc. and then use the physics of the device for computing
(e.g. addition, subtraction, multiplication, etc.). Amongst all the analog computation techniques,
translinear analog signal processing (TASP) is the most widely used and the most intensively researched method. Since TASP technique was introduced by Gilbert in 1975 [14], the method has
been applied for synthesis of neural, ﬁlter and radio-frequency circuits. TASP utilizes the linear relation between the transconductance and the current of the device, and can be readily implemented
on MOSFETs and bipolar transistors. In Table 1.2 we summarize different variants and application
of the TASP principle that has been reported in the literature:
Table 1.2: Milestones in TASP

Year

Device Characteristic
Utilized

Operation Region

Analog Computation
Circuits Synthesized

Static TASP
1975 [14]

I ∝ eV

weak inversion

1991 [17]

I ∝V2

strong inversion

2001 [19]

I ∝ e(w1 V1 +···+wN VN )

weak inversion

1979 [20]
1997 [21]
1998 [22]

Dynamic TASP
˙
CUT IDS = Icap IDS
weak inversion
˙ = Icap I
CUT IDS
weak inversion
DS
˙
CUT IDS = Icap IDS
weak inversion
8

multiplier, divider [15],
maximum [16]
square, square-root, polynomial,
multiplier-divider [18]
square, square-root,
divider, magnitude of vector
ﬁrst-order ﬁlter
RMS-DC converting
second-order ﬁlter

The advantages of TASP circuits are as follows:
• A high functional density, which explains the extensive application of translinear circuits in
neural networks.
• Insensitivity to temperature due to UT cancelation inside the translinear loop.
• Low-power consumption due to the absence of passive components. In translinear ﬁlters
implementation, only transistors and capacitors are required.
• A large dynamic range. The trend toward low supply voltages causes a decrease of the
dynamic range for ﬁlters implemented with conventional circuit techniques, for which companding may be a possible solution. Whereas companding is implicitly performed in translinear ﬁlters as a consequence of the exponential behavior of the translinear devices.
• High controllability. Most parameters can be controlled by tuning the current, which is
doomed by the translinear principles.
However, there are some disadvantages of TASP circuits when implemented in CMOS. To understand these disadvantages, we ﬁrst brieﬂy describe the operating regions of a MOS transistor.
Fig. 1.5 shows the current-vs-gate voltage response of a typical n-type MOSFET. The characteristic
consists of three different operating regions namely: (a) weak inversion: where the current is exponentially dependent on the gate voltage; (b) strong inversion: where the current is quadratically
dependent on the gate voltage; and (c) moderate inversion: is the intermediate region between the
strong and the weak inversion. Operating in each of these regions have speciﬁc advantages which
are summarized below:
• In weak inversion region, the minimum power consumption is achieved (low ID ).
9

• In strong inversion region, the higher operational speed can be obtained (high ID , high gm ).
• In moderate inversion region, an optimal compromise between power consumption and
speed can be achieved.
The power consumption (energy-efﬁciency), speed trade-off can also be expressed using the
“gm/ID” metric [8], where gm is the input transconductance of the MOS transistor and ID is the
drain current ﬂowing through the transistor. The parameter gm effectively determines the speed
of the transistor and the parameter ID determines its power consumption. gm /ID is chosen as a
metric since it reﬂects the efﬁciency to translate current (i.e. power) into transconductance. The
greater the gm /ID value is, the greater the transconductance we obtain at a constant current value
[8]. Fig. 1.5 displays the gm /ID versus the drain current ID for MOS transistors under the three
operating regions. Fig. 1.5 shows that in weak inversion region, we obtain the highest gm /ID ,
thus the best power efﬁciency.
As can be observed from table 1.2, TASP circuits are designed to operate in only one speciﬁc
operating region. Thus it lacks ﬂexibility, or the trade-off capability across the operation regions,
which is a severe problem when a high dynamic range is required. This limits the application of
TASP circuits in designing more complex and ﬂexible analog signal processors.
To solve this problem, we propose ASP circuits that only depend on the universal conservation
principles like the Kirchhoff’s current law, as illustrated in Fig. 1.3 (c). By construction, ASP
circuits based on universal conservation principles will have several advantages over their TASP
counterparts which are:
• Independence on transistor operation regions since its underlying principle follows physical
laws (e.g. universal conservation law).
• Wide dynamic range due to the unlimited operation region.
10

• Diverse ways of implementation (e.g. charge-mode, voltage-mode, time-mode, etc.). For
instance, Fig. 1.6 illustrates how reverse water-ﬁlling can naturally be implemented using a charge-coupled device. Initially, a ﬁxed amount of charge (shown by shaded area in
Fig. 1.6a) is introduced into the charge-coupled device (CCD). As potential barriers (wells)
are increased adiabatically charge re-distributes itself (Fig. 1.6b) amongst the wells. At
equilibrium (Fig. 1.6c) total charge accumulated in the potential wells (residue) and the
normalization level is determined by charge conservation principles. Even though potential
wells have been used in the above example the procedure is applicable to other physical
quantities (current, mass, energy) implemented by different analog structures (MEMS, microﬂuidic devices). However, since current-mode circuits is the most easy and mature way
of implementation, in this dissertation, we adhere to current-mode implementation.
• Insensitivity to the outside variations (e.g. temperature).
However, the using universal conservation laws restricts the choice of computations that can
be precisely implemented. That leads to the main hypothesis of this dissertation: many signal
processing algorithms exhibit an inherent calibration ability, and hence their performance remains
unaffected by the use of “approximate” analog computing techniques which is proposed in this
study. Also, approximations at a circuit level would require a bottom-up and top-down synthesis
of signal processing algorithms as is illustrated in Fig. 1.7 and is compared with the conventional
top-down design ﬂow for a TASP based synthesis. The chart in Fig. 1.7 shows that many of
the approximation techniques have be simulated, veriﬁed and calibrated at the algorithmic and
system level before prototyping on hardware. However, algorithmic and system levels of the design
ﬂow offers more ﬂexibility in implementing a given system and hence can overcome performance
limitations due to low-level approximations. In this dissertation, we plan to illustrate this using two
11

speciﬁc applications: (a) designing high-performance analog LDPC decoders; and (b) designing
ultra energy-efﬁcient analog SVM.

1.2

Scientiﬁc contributions

This dissertation is motivated by improving the energy efﬁciency of computation implementation
with ASP while resorting to the physical law to ensure bias-scalability. The major scientiﬁc contributions are listed below:
(a) An analog signal processing (ASP) algorithm named margin propagation (MP) as an efﬁcient
piece-wise linear (PWL) approximation technique to a “log-sum-exp” function along with its
current-mode analog circuit implementation is proposed. Due to its underlying principle of
conservative law, MP algorithm can be mapped to various hardware easily. The current-mode
MP circuit implementation is designed to be bias-scalable. This characteristic is beneﬁcial
for computational circuit since it enables the design adaptive to various applications with
different speciﬁcation requirements.
(b) A wide-range of mathematical functions are synthesized with MP-based bias-scalable analog computational circuits. The synthesized functions include addition, subtraction, multiplication, division, power and polynomial computation, etc.. MP circuits operate in the
log-likelihood domain where complex functions such as multiplication, division, square-root
operations are mapped into simpler operations like addition, subtraction or scaling, making
circuit design less complex. The circuits can operate over a wider dynamic range as the
transistors can span different biasing regions.
(c) A 100pJ/bit, (32,8) CMOS analog low-density parity-check (LDPC) decoder based on MP
12

is designed and implemented. One fascinating property is its capability of trading off BER
performance with energy efﬁciency due to the tunable hyper parameter γ in MP algorithm.
The prototyped MP-based LDPC decoder can achieve an energy efﬁciency of 100nJ/bit while
an optimal conﬁguration can also deliver up to 3 dB improvement in BER compared to the
benchmark min-sum LDPC decoder. This design is helpful for high energy-efﬁciency high
error-correcting performance analog LDPC decoder design.

(d) An analog energy-scalable MP-based support vector machine (SVM) is designed and implemented. The prototype stores 2052 SVM parameters into ﬂoating-gate memory array and is
fully programmable. Due to the using of MP computation circuit, the prototyped SVM is
energy-scalable.

(e) A novel current-input current-output logarithmic ampliﬁer circuit is designed and implemented. This design based on translinear Ohm’s law directly generates currents as a logarithmic function of the input current, which exhibits a dynamic range of 120dB and a temperature sensitivity of 230 ppm/K, while consuming less than 100nW of power. This design
brings three major beneﬁts for implementing logarithmic ampliﬁer circuit: high dynamic
range, low power consumption, and temperature compensation.

(f) A novel varactor driven temperature compensation circuitry of CMOS ﬂoating-gate current
memory is designed and implemented. By adapting the ﬂoating-gate capacitance with voltage controlled capacitor (varactor), this design enables temperature dependent factors be
effectively canceled. With this design, a temperature sensitivity of 150 ppm/K is achieved.
This design is instructive for implementing high-density, temperature compensated ﬂoatinggate current memories.
13

1.3

Dissertation Organization

The dissertation is organized as follows: Other ASP research work is brieﬂy introduced in chapter
2. The MP algorithm related theories and properties are introduced in detail in chapter 3. The
circuit implementation and synthesized analog computation circuits are introduced in chapter 4.
The applications of MP-based PWL circuits are described in chapter 5 and 6.
In chapter 2, the device models of MOS transistor are introduced ﬁrst as background material.
Then a thorough survey of state-of-the-art research work in ASP is illustrated, i.e. a survey on
translinear analog signal processing circuits.
Chapter 3 presents MP algorithm related theories. Fundamental deﬁnitions and mathematical properties of MP algorithm are illustrated ﬁrst, followed by the proof that MP algorithm is a
PWL approximation for “log-sum-exp” function. These deﬁnitions and properties serves as the
theoretical fundamental for later chapters.
Since MP is a PWL approximation algorithm to “log-sum-exp” function, a wide variety of scientiﬁc computation can be synthesized with MP algorithm. In chapter 4, the basic current-mode
circuit implementation for MP algorithm is presented. Based on the basic MP circuit block, a
variety of analog computation circuits are synthesized, which include addition, subtraction, multiplication, division, power, inner-product, polynomial, and tanh. The PWL approximation effect of
MP-based circuits are demonstrated by the simulation results.
Chapter 5 presents an LDPC decoder which is implemented based on the MP circuits. This
chapter can be divided into theory sections and implementation sections. The theory sections
starts with the metrics to evaluated the performance of LDPC decoders and a signiﬁcant property
of LDPC decoder: performance trade-off. Density evolution is introduced as an analytical tool
and utilized to evaluate the message sparsity performance for the MP-based LDPC decoding al14

gorithm and the other two conventional algorithms. The subsequent simulation results verify the
density evolution analysis results as well as the BER performance. It is also demonstrated that the
trade-off capability of MP-based LDPC decoder can be realized elegantly by just tuning the hyper
parameter γ. The implementation sections start with circuit for basic module. Then the systematic
architecture is introduced. The test setup and the measurement results are presented.
Chapter 6 presents another application of the MP-based PWL circuits: an analog SVM. In this
application, the inner product computation of vector, the addition, multiplication, square of scalar
are all synthesized with MP-based PWL circuit. In this chapter, the core computation, the system
architecture and primary circuits are introduced in sequence. Finally, the simulation results and the
measurement results are demonstrated.
Chapter 7 offers conclusions and outlook.

15

Log(IDS)
Strong
Inversion
Moderate
Inversion
Weak
Inversion
gm
Moderate
Inversion

VGS

Strong
Inversion

Weak
Inversion

VGS

Weak
Inversion
Strong
Inversion

10 −5

10 2

Figure 1.5: log(ID ) versus VGS , gm versus VGS , and gm /ID versus ID /(W/L) curves for
various operation regions [8].

16

Charge

(a)
Potential Well

(b)
Li

z

Level
Residue

(c)
Figure 1.6: Illustration of the proposed concept for designing scalable analog computational systems. The depth of potential wells can be controlled by an external voltage source. At equilibrium
the charge re-distributes amongst potential wells to satisfy charge conservation laws.

17

Image
Processor

Image
Processor

Algorithm
Level

Matrix Vector
Multiplication

Matrix Vector
Multiplication

Architecture
Level

Inner
Product

Inner
Product

Circuit
Level

Translinear

Margin
Propagation

Device
Level

MOS,
Bipolar

MOS

More Design Flexibility

Application
Level

Figure 1.7: Top-down design procedure for conventional TASP and the bottom-up procedure we
propose.

18

Chapter 2
Device Models and Translinear Analog
Signal Processing

2.1

Device Models of MOS Transistor

Fig. 2.1 illustrates the terminals of MOS transistors. Since the model of n-type MOS and that of
p-type MOS only differ in polarity deﬁnition of current ﬂow through the devices, their mathematic
models only differs in signs. Thus only NMOS transistors are discussed thereinafter.

NMOS

Gate
(G)

PMOS

Drain
(D)
Body
(B)

Gate
(G)

Source
(S)
Figure 2.1: MOS symbols
19

Source
(S)
Body
(B)
Drain
(D)

The operation of a MOSFET can be separated into three different modes, depending on the
voltages at the terminals. The most frequently used models of MOSFET are for the operation in
strong inversion (or called “above threshold operation”) and weak inversion (or called “subthreshold conduction”).

The strong inversion model is described as follows:

IDS =

µn Cox W
[κ(VG − VT ) − VS ]2 .
2κ L

(2.1)

In equation (2.1), W and L denotes the width and length of transistor respectively. W is called the
L
aspect ratio. µn represents the mobility of electrons. And Cox denotes the gate oxide capacitance
per unit area. VT represents the threshold voltage. κ is a parameter reﬂecting the effectiveness of
the gate potential in controlling the channel current, which is always approximated as κ

1 for

strong inversion. According to (2.1), the relationship between the current ﬂow and the voltages
can be described as “square”.

The weak inversion model is described as:

IDS = I0 e(1−κ)VBS /UT eκVGS /UT (1 − e−VDS /UT + VDS /V0 ).

(2.2)

In this equation, VGS is the gate-to-source potential, VDS is the drain-to-source potential, VBS
is the bulk-to-source potential (body effect), I0 is the zero-bias current for the given device, V0 is
the early voltage, and UT is the thermodynamic voltage. For devices in subthreshold saturation,
which is determined by VDS ≥ 4UT , neglecting the early effect and the body effect, equation
20

(2.2) can be simpliﬁed as

IDS = I0 eκVGS /UT .

(2.3)

According to (2.3), the current is the exponential of the potential. Also we can deduce the input
transconductance in subthreshold saturation region as:

gm =

∂IDS
κI
= DS .
∂VGS
UT

(2.4)

And the output transconductance in subthreshold saturation region as:

I
∂I
gd = DS = DS .
∂VDS
V0

(2.5)

It can be seen that gm and gd are linear to the current IDS in this operation region.
There is no uniform model for MOSFET throughout all the operation regions until it was developed by C. C. Enz, F. Krummenacher and E. A. Vittoz. Enz-Krummenacher-Vittoz in 1995,
named as EKV model. With EKV model, all the currents, transconductance, the intrinsic capacitance, and thermal noise can be expressed in a continuous way in all operation regions, including
weak inversion, moderate inversion, strong inversion, conduction, and saturation. The equation of
EKV model is described as

IDS = IS log 2 1 + e(κ(VG −VT )−VS )/2UT − IS log 2 1 + e(κ(VG −VT )−VD )/2UT .
(2.6)
In this equation, IS is a speciﬁc current, which depends essentially on the aspect ratio W and the
L
21

mobility µn .

2.2

Translinear Analog Signal Processing

Figure 2.2: An alternating translinear loop of NMOS transistors. Here the number of clockwise
facing transistors in junction is the same as the number of counterclockwise facing transistors.

The translinear principle was deﬁned by Gilbert in 1975 [14], originally with bipolar transistors. In translinear circuits, the exponential current-voltage characteristic of bipolar transistor is
exploited. With the emergence of MOSFET circuit, translinear principle is extended into MOS
circuits, since MOS transistors in subthreshold region also have the exponential current-voltage
non-linearity characteristic, as shown in (2.3). The translinear principle states that, in a closed loop
of tranlinear devices comprising an equal number of of clockwise facing and counterclockwise
facing junction devices, their currents follows:

j∈CW

Ij
=
Sj

j∈CCW

22

Ij
,
Sj

(2.7)

where CW and CCW denote the set of junction transistors clockwise facing and counterclockwise
facing respectively. Ij denotes the IDS of the transistor. S denotes the aspect ratio, i.e. W/L
ratio of the transistor. Fig. 2.2 shows a conceptual alternating translinear loop comprising NMOS
transistors.
Note that the translinear principle is derived by Kirchhoff’s voltage law for the loop, which is:

VGS(j) =
j∈CW

VGS(j) ,

(2.8)

j∈CCW

In 1991, the translinear principle was extended by Seevinck and Wiegerink [17]. They generalized the translinear principle with the statement that the transconductance of a device is linear in
the controlling voltage rather than it being linear in the current, as Gilbert intended originally. In
this generalized translinear principle, the square-law MOS characteristic, as we show in equation
(2.1), is exploited. Correspondingly, the MOS transistors in a translinear loop they presented were
biased in strong inversion. And their translinear principle takes the form as

j∈CW

Ij
=
Sj

j∈CCW

Ij
.
Sj

(2.9)

This generalized translinear principle is suggested to be renamed as voltage-translinear circuits to
prevent possible confusion [23].
However, there is no translinear principle describing the behavior of translinear loops when any
of the transistors enters the moderate inversion region, where the device is in neither weak inversion
nor strong inversion, but in between. In this region, the current-voltage characteristic follows
neither the exponential form as shown in equation (2.3) nor the square-law form as described in
equation (2.1).
23

The EKV model, which is described in (2.6), provides a model for the current-voltage characteristics of the MOS transistor at all levels of inversion including moderate inversion. In 2008,
Minch brought out a generalized translinear principle for alternating loops of saturated MOS transistors that is valid at all levels of inversion based on the EKV model [24]. This generalized
translinear principle reduces to the conventional one when all transistors in a translinear loop are
biased in weak inversion and reduces to the voltage-translinear principle when all transistors in the
loop are biased in strong inversion. They model the current-voltage characteristic of an NMOS
transistor with its bulk tied to ground as:

Isat = SIS log 2 (1 + e(κ(VG −VT 0 )−VS )/2UT ),

(2.10)

where S = W/L denotes the aspect ratio, κ = Cox /(Cox + Cdep ) denotes the subthreshold slope
factor, VT 0 denotes the zero-bias threshold voltage, UT = kT /q denotes the thermal voltage, and
2
IS = 2µn Cox UT /κ denotes the transistor’s speciﬁc current. Based on this equation it is explicitly
to deduce

κ(VG − VT 0 ) − VS = 2UT log(e Isat /SIS − 1).

When the transistor is in weak inversion, by expanding e

(2.11)

Isat /SIS in a Maclaurin series and

retaining the constant and linear terms, we obtain the approximation as

I
κ(VG − VT 0 ) − VS ≈ UT log sat .
SIS
When the transistor is in strong inversion, −1 is negligible compared to e

24

(2.12)

Isat /SIS , therefore

we can obtain

κ(VG − VT 0 ) − VS ≈ 2UT

Isat
.
SIS

(2.13)

Based on equation (2.12) and (2.13), we can derive translinear principle on weak inversion, shown
in (2.9), and voltage-translinear principle on strong inversion, shown in (2.10). And the continuous
moderate inversion model is obtained on equation (2.11) as well.

Based on these aforementioned translinear principles, continuous-time translinear analog signal processing (TASP) systems are constructed. TASP circuits can be divided into two major
classes: static translinear circuits and dynamic translinear circuits.

Static translinear circuits exploit the exponential relation between voltage and current to realize
static linear and nonlinear transfer functions, such as products, quotients, arbitrary ﬁxed power
laws, etc. [15, 25, 18, 26]. With such operations, we can implement signal-processing algorithms
involving auto-correlation, cross-correlation, signal energy computation, least-mean-square (LMS)
error metric computation, and the like [23, 25]. For instance, the Gilbert multiplier [15] shown in
Fig. 1.3 (d) is a static translinear circuit. For the circuit of Fig. 1.3 (d):

Ib
Ib
−
, I1 =
−
−
+
−(V1 −V1 )/UT
(V + −V1 )/UT
1+e
1+e 1
+
+
I1
I1
−=
+=
I2
, I
−
−
+
−(V2 −V2 )/UT 2
(V + −V2 )/UT
1+e
1+e 2
−
−
I1
I1
−=
+=
I3
, I
−
−
+
+
(V2 −V2 )/UT 3
−(V2 −V2 )/UT
1+e
1+e

+
I1 =

25

(2.14)

(2.15)

(2.16)

−
−
+
+
If we denote (V1 − V1 )/UT = x, (V2 − V2 )/UT = y, we can obtain
ex ey + 1
+
+
+
Iout = I2 + I3 = Ib x
(e + 1)(ey + 1)
ex + ey
− = I− + I− = I
Iout
b (ex + 1)(ey + 1)
2
3

(2.17)
(2.18)

Thus,

x
y
−
+
Iout − Iout = Ib tanh( )tanh( )
2
2

(2.19)

By tailor series expression, here we take tanh(z) ≈ z. Then

x y
Iout ≈ Ib · ·
2 2

(2.20)

∝ x · y.

(2.21)

Another example of static translinear circuit is “winner-take-all” circuit, which can be used to
ﬁnd out the maximum amongst a couple of inputs [16]. A voltage-in-current-out winner-take-all
circuit is shown in Fig. 2.3. In this circuit, current ﬂow through each branch follows:

Ii =

eκVi /UT
· N · Ib
N eκVj /UT
j=1

(2.22)

Due to the fast decay of exponential, at equilibrium, the branch with the maximum voltage takes all
the currents and the other branch obtains currents which decays gradually to zero. Consequently,
the winner (maximum) is easy to pick out.
There is also an increasing effort to develop new circuit design techniques in order to build
26

I1

I2
V1

M1

IN
V2

M2

MN
…...

Ib

Ib

VN

…...

Ib

Vb

Figure 2.3: A voltage-in-current-out winner-take-all circuit.
analog information-processing systems. In 2001, Minch proposed multiple-input translinear element (MITE) networks [19], which produces an output current that is exponential in a weighted
sum of its input voltages. MITEs are simply implemented using multiple-input ﬂoating gate MOS
(FGMOS) transistors. An MITE is shown in Fig. 2.4. The output current Iout follows:

I out

...

w1
V
w2 1
V2
wN
VN

Figure 2.4: A multiple-input translinear element (MITE).

Iout ∝ e(w1 V1 +w2 V2 +···+wN VN ) .
27

(2.23)

From equation (2.23) it can be seen that if these weights are adjusted according to some learning
algorithm, such as the least-mean-square (LMS) algorithm or the recursive least-squares (RLS)
algorithm, these built-in weights can be used to implement a variety of linear adaptive signalprocessing architectures. In this respect, TASP circuits are also applied into neural systems [27,
28, 29, 30]. Furthermore, these weights can be programmed in a digital fashion to realize reconﬁgurability in TASP systems. Such reconﬁgurable analog signal processing systems are called
ﬁeld-programmable analog arrays (FPAAs) [6]. This kind of TASP FPAA is analogous to an
E 2 P ROM -based ﬁeld programmable gate arrays (FPGA) in digital domain.
Based on MITEs, a variety of static and dynamic translinear circuits can be synthesized (e.g.
the vector-magnitude circuit, the vector-normalization circuit, the second-order low-pass ﬁlter, and
the rms-to-dc converter) [19, 31, 32, 26].
By admitting capacitors in the translinear loops, the dynamic translinear circuits are constructed
to deal with frequency dependent transfer functions. The dynamic translinear principle can be
explained with reference to the circuit shown in Fig.2.5. The expression for Icap is easily found

Vconst
+
-

Icap
Vcap

IDS

+
-

Figure 2.5: Principle of dynamic translinear circuits.

28

to be
I ˙
Icap = CUT DS ,
IDS

(2.24)

˙
which shows that Icap is a nonlinear function of IDS and its time derivative IDS . Equation (2.24)
can be rewritten as

˙
CUT IDS = Icap IDS ,

(2.25)

which states the dynamic translinear principle: A time derivative of a current can be mapped onto
a product of currents [33].
The ﬁrst dynamic translinear circuit, or log-domain ﬁlter, was originally introduced by Adams
in 1979 [20]. The ﬁrst-order ﬁlter described in [20] is in fact a translinear circuit. Another milestone is a general class of ﬁlters, called exponential state-space ﬁlters, which comprise a number
of translinear loops, ﬁrst published by Frey [34]. The exponential state-space ﬁlters enables the
design of higher order log-domain ﬁlters[35, 36]. Since then the interest of dynamic translinear
ﬁlter increases dramatically. More studies were reported in this area [37, 38, 39, 40, 41, 42].
However, the dynamic translinear principle is not limited to ﬁlters, or linear differential equations. It also can be utilized to realize nonlinear differential equations using transistors and capacitors only. In [21], the nonlinear differential equations describing the rms-dc conversion function are
implemented on the dynamic translinear principle. Another example is [22], in which a translinear
second-order oscillator was implemented. Some other example functions described by nonlinear
differential equations and implemented by translinear circuits are phase-locked loops (PLL’s) [43],
translinear sinusoidal frequency tripler [44], etc..

29

Chapter 3
Theory of PWL technique based on margin
propagation

3.1

Deﬁnition of MP algorithm

At the core of MP algorithm is the reverse water-ﬁlling procedure, which has been described in
detail in [45, 46]. Given a set of scores Li ∈ R, i = 1, ..., N, reverse water-ﬁlling computes a
normalization factor z according to the constraint:
N
[Li − z]+ = γ
i=1

(3.1)

where[·]+ = max(·, 0) denotes a rectiﬁcation operation and γ ≥ 0 represents a parameter of the
algorithm. Computation of z according to constraint (3.1) is summarized in the Algorithm 1. The
algorithm recursively computes the factor z such that the net balance of log-likelihood scores Li
in excess of z equals γ.
The solution to equation (3.1) can also be visualized using Figure 3.1, where the cumulative
30

Algorithm 1 Reverse water-ﬁlling procedure to compute the parameter z
Require: Set of log-likelihood scores {Li }, i = 1, .., N .
Ensure: z = 0, N = 1, T = 0
a = max{Li }
{s} ← {Li } − {a}
while T < γ & m < N do
b = max{s}
T ← T + M (a − b)
a←b
{s} ← {s} − {b}
m←m+1
end while
z ← b + M (γ − T )

score beyond the normalization factor z (shown by the shaded area) equals to γ . To be practical,
we always limit z in the range of z ≤ max Li . The computation of z can be expressed as an
i
equivalent function M (L, γ) whose inputs are a set of log-likelihood scores L = {Li }, i = 1, .., N
of size N and a hyper-parameter γ:

z mp = M (L, γ).

3.2

(3.2)

Properties of MP algorithm

In this section we present some of the key properties of MP function M(L, γ). Since margin
approximation uses thresholds [x]+ = max(x, 0), we ﬁrst state two lemmas which will be useful
for proving some of the other properties.
Lemma 1 : ∀a, b ∈ R,
[a]+ − [b]+ ≤ [a − b]+ .
31

Li - z

+

z
L1

Li
Figure 3.1: Illustration of reverse water-ﬁlling procedure

Lemma 2 : ∀a, b ∈ R,
[a]+ + [b]+ ≤ [a + b]+ .
The proof of Lemma 1 and Lemma 2 is straightforward and can be found in any standard analysis
textbook. For the sake of brevity, only the statement and the illustration of some of the properties
are presented in this section and all the proofs have been presented in detail in the appendix.
Property 1 (Scaling Property): For any α ∈ R, α > 0 and a set of scores L = {Li }, i =
1, .., N
M(αL, αγ) = αM(L, γ).
Here αL = {αLi }, i = 1, .., N . The proof of this property is simple since if the condition
N
[Li − z]+ = γ
i=1
32

(3.3)

is satisﬁed, then the following condition
N
[αLi − αz]+ = αγ
i=1
is also satisﬁed. Figure3.2 illustrates the scaling property of margin approximation function, where
the threshold z scales with the scaling of the log-likelihood scores and the hyper-parameter γ.

i

i
1

1

i

i

Figure 3.2: Scaling property of margin propagation

Property 2 (Monotonicity): Given a set of scores L = {Li }, i = 1, .., N and if γ1 ≥ γ2 ≥ 0
then
M(L, γ1 ) ≤ M(L, γ2 )
One of the important implications of the monotonicity is the asymptotic property when γ → 0 and
is given by
lim M (L, γ) = max(L).
γ→0

(3.4)

The importance of this asymptotic property is that the min-sum algorithm for LDPC decoding is a
special case of the proposed margin propagation algorithm.
33

Proof. Given






















N
[Li − z1 ]+ = γ1 ,
i=1
N
[Li − z2 ]+ = γ2 ,
i=1

(3.5)

γ1 ≥ γ2 ≥ 0,

what we need to prove is
z1 ≤ z2 .
If we subtract the ﬁrst equation by the second equation in (3.5), we can get
N
([Li − z1 ]+ − [Li − z2 ]+ ) = γ1 − γ2 ≥ 0.
i=1
Based on Lemma 1,
N

N

[z2 − z1 ]+ ≥
([Li − z1 ]+ − [Li − z2 ]+ ).
i=1
i=1
Combined with (3.6), it can be deduced that
N
[z2 − z1 ]+ ≥ 0.
i=1
Consequently,
z2 − z1 ≥ 0.

34

(3.6)

In conclusion,
z1 ≤ z2 .

Property 3 (Convexity Property): Given a set of coefﬁcients {αk } satisfying 0 ≤ αk ≤ 1
αk = 1 and given a set of hyper-parameters {γk }

and
k

αk γk ) ≥

M(L,
k

αk M(L, γk ).
k

Proof. Given the same group of Li s, and a set of γk , k = 1, .., n


































N
[Li − z1 ]+ = γ1 ,
i=1
N
[Li − z2 ]+ = γ2 ,
i=1
.
.
.
N
[Li − zn ]+ = γn ,
i=1

Based on Property 1, we can transform the equations in (3.7) as


































N
[α1 Li − α1 z1 ]+ = α1 γ1 ,
i=1
N
[α2 Li − α2 z2 ]+ = α2 γ2 ,
i=1
.
.
.
N
[αn Li − αn zn ]+ = αn γn ,
i=1
35

(3.7)

Add the above equations up, based on Lemma 2, we will get
N





n
 ≥
αk γk .
αk zk
k=1
+ k=1
n

L −
i
i=1

Denote z satisﬁes
N

n
Li − z

i=1
Based on Property 1, then z ≥

=
+

αk γk .
k=1

n α z .
k=1 k k

That is,
αk γk ) ≥

M(L,
k

αk M(L, γk ).
k

Property 4 (Superposition Property): Given two sets of scores L and G of size N with a well
deﬁned ordering and if L + G represent an element by element scalar addition then

M(L + G, γ) ≤ M(L, γ) + M(G, γ).

Proof.

 N



[Li − z1 ]+ = γ,


i=1
 N



[gi − z2 ]+ = γ,


i=1

(3.8)

what we need to prove is if
N
[Li + gi − z3 ]+ = γ,
i=1
36

(3.9)

then
z3 ≤ z1 + z2 .

All the Li s are sorted and rearranged in a decreasing order so that L1 ≥ L2 ≥ L3 ... ≥ LN .
We use N1 to represent the number of Li s above the threshold z1 . The same operations are done
to gi s. All the gi s are sorted and rearranged decreasingly as g1 ≥ g2 ≥ g3 ... ≥ gN . And the
number of gi s above the threshold z2 is N2 .
Rewrite (3.8) as

 N1




Li − N1 z1 = γ,


i=1
 N2




gi − N2 z2 = γ,


i=1

(3.10)

Assume N2 ≥N1 , then from the bottom equation in (3.10), we get
N1
0≤

gi − N1 z2 ≤ γ.
i=1

(3.11)

Add both sides of (3.11) to those of the upper equation in (3.10), we can get
N1
γ≤

N1

Li +
gi − N1 z1 − N1 z2 ≤ 2γ.
i=1
i=1

(3.12) can be transformed as
N1
[Li + gi − (z1 + z2 )] ≥ γ,
i=1
37

(3.12)

from which we can deduce that
N
[Li + gi − (z1 + z2 )]+ ≥ γ,
i=1

(3.13)

Given (3.9) and (3.13), based on Property 2, it can be deduced that

z3 ≤ z1 + z2 .

And vice versa if we assume N1 ≥N2 . If written in MP function,

M(L + G, γ) ≤ M(L, γ) + M(G, γ).

Under special condition the above property reduces to an equality and is given by Property 5
(Offset Invariance): Given a set of scores L of size N and a scalar g ∈ R then

M(L + g, γ) = M(L, γ) + g.

Property 5 implies that if a constant offset to all the elements of input set leads to an equivalent
offset in the output of the margin approximation function.

Proof. Based on the proof of Property 4, given
N
[Li − z]+ = γ,
i=1
38

if G is a scalar set with each gi = g, g ∈ R, 3.9 can be rewritten as
N
[Li + g − (z + g)]+ = γ,
i=1
Written in MP function format,

M(L + g, γ) = M(L, γ) + g.

Property 6 (Invariance Property): Given a set of scores L of size N , and given a set of scores
G whose elements satisfy gi ≤ M (L, γ), i = 1, ..., N , then

G, γ) = M(L, γ).

M(L

Proof. Given
N
[Li − z]+ = γ,
i=1
∀gj , j ∈ [1, n], since gj

z,
n
[gj − z]+ = 0
i=1

which means
N

n

[Li − z]+ +
[gj − z]+ = γ.
i=1
i=1
Therefore, we can get
M(L

G, γ) = M(L, γ).
39

3.3

PWL approximation based on MP algorithm

The generalized form of a “log-sum-exp” function is described as

zlog = log 

N



eLi  .
i=1

(3.14)

The Piecewise-linear (PWL) approximation of “log-sum-exp” function can be achieved with MP
algorithm.

Proof. Assume that (3.14) and (3.2) share the same group of operands L1 , . . . , LN , Li ∈ R, i =
1, ..., N . If the operands are sorted and rearranged in a decreasing order as L1 ≥ L2 ... ≥ LN ,
(3.14) can be rewritten as


N



L
e i ,
i=1

(3.15)

z = M (L1 , . . . , LN , γ).

(3.16)

zlog = log 

For MP algorithm, (3.2) is rewritten as

If the number of Li s above threshold z is n, 1 ≤ n ≤ N , according to the deﬁnition of MP
40

algorithm, (3.1) is rewritten as
n
Li − z

= γ,

i=1
n
Li − nz = γ.

(3.17)

i=1
And z can be deduced as

n
Li
z = i=1
n

−

γ
.
n

(3.18)

For (3.18), under the conditions (a) γ and all the other Li s are ﬁxed except Ln (b) Ln is varying
within the range not below the threshold z, we have

∂z

=

∂Ln

1
.
n

(3.19)

For zlog in (3.15), under the same conditions (a) and (b),

eLn
1
=
=
.
N
N
∂Ln
L
(L −L )
e i
e i n
i=1
i=1

∂zlog

(L −L )
Deﬁne a function u(Li , Ln ) to approximate e i n as


 1, L ≥ Ln ,
i
u(Li , Ln ) =
1 ≤ n ≤ N.

 0, L < L ,
n
i
41

(3.20)

(L −L )
Fig. 3.3 shows us the curve of u(Li , Ln ) and its PWL approximating effect to that of e i n .

2.5
2

u(L‘i,L‘n)
‘

‘

e(Li−Ln)

1.5
 
¡

1
0.5
0
0.5

1

1.5

2

L‘i
(L −L )
Figure 3.3: The curve of e i n and its PWL approximation curve of u(Li , Ln )
Since the number of Li s not less than Ln is n, then
n

N

u(Li , Ln ) = n.

u(Li , Ln ) =

(3.21)

i=1

i=1
Therefore,
∂zlog
∂Ln

1
≈

N

=

1
.
n

(3.22)

u(Li , Ln )
i=1

Comparing (3.19) and (3.22), we can ﬁnd out that
∂zlog
∂Ln

≈

1
∂z
=
.
n
∂Ln

(3.23)

Based on the illustrations above, we can draw the conclusion that z in (3.2) approximates zlog
42

in (3.14) under the condition that γ and all the other Li s are ﬁxed except Ln varying within a
certain range. Since 1 ≤ n ≤ N , by adjusting the value of γ to change the number of operands
above threshold (reﬂected by n), this conclusion can be extended to the whole group of operands.

Based on the proof, we can write


N



eLi  .
z mp = M (L, γ) ≈ log 
i=1

(3.24)

To demonstrate the PWL approximation effect of MP algorithm, a “log-sum-exp” based function is utilized for reference. It is a transform function φ with two log-likelihood operands L1 and
L2 [47], which is named “sum-product” in the ﬁeld of factor graph decoding [48]:

φL (L1 ) = log
2

1 + eL1 +L2
eL1 + eL2

.

(3.25)

The two operands are expressed in differential form, i.e., L1 = L+ − L− , L2 = L+ − L− .
2
2
1
1
Now (5.11) is rewritten and decomposed as

−+L− L++L+
L
 e 1 2 +e 1 2 
φL (L1 ) = log − +
,
2
L1 +L2
L++L−
e
+e 1 2
L−+L− L++L+
L−+L+ L++L−
= log e 1 2 +e 1 2 −log e 1 2 +e 1 2 ,


(3.26)

where φL (L1 ) can be represented as a difference of two “log-sum-exp” function. Based on
2
43

equation (3.24), we can approximate (3.26) using two MP units with two operands for each:

φL (L1 ) ≈ M(L− +L− , L+ +L+ ,γ)−M(L+ +L− , L− +L+ ,γ).
1
2 1
2
1
2 1
2
2

(3.27)

Fig. 3.4 shows us the comparison of PWL approximation effect achieved with a ﬁxed γ and
different schemes of operands in either MP unit: (a) two operands for either MP unit as described
in equation (3.27); (b) four operands for either MP unit and (c) six operands for either MP unit.
The operands for MP units are listed in Table 3.1.
Table 3.1: Operands of MP units in Fig 3.4

Num
2

4

6

Operands for MP unit z +
L− + L− ,
1
2
+ + L+
L1
2
− + L− ,
L1
2
1 −
(L1 + L− + L+ + L+ ),
2
1
2
2
3 −
(L1 + L− + L+ + L+ ),
2
1
2
5
+ + L+
L1
2
− + L− ,
L1
2

Operands for MP unit z −
L− + L+ ,
2
1
+ + L−
L1
2
+ + L− ,
L1
2
1 −
(L1 + L− + L+ + L+ ),
2
1
2
2
3 −
(L1 + L− + L+ + L+ ),
2
1
2
5
− + L+
L1
2
+ + L− ,
L1
2

2 −
(L1 + L− + L+ + L+ ),
2
2
1
5
1 −
(L1 + L− + L+ + L+ ),
2
1
2
2
3 −
(L1 + L− + L+ + L+ ),
2
1
2
5
2 −
(L1 + L− + L+ + L+ ),
2
1
2
3
+ + L+ ,
L1
2

2 −
(L1 + L− + L+ + L+ ),
2
2
1
5
1 −
(L1 + L− + L+ + L+ ),
2
1
2
2
3 −
(L1 + L− + L+ + L+ ),
2
1
2
5
2 −
(L1 + L− + L+ + L+ ),
2
1
2
3
− + L+
L1
2

As can be seen from Fig. 3.4, when more intercepted operands are introduced into MP computation, the approximating curves are becoming more ﬁne and similar to the original “sum-product”
44

algorithm. That means, the PWL approximation effect of (3.27) can be improved when more
operands are utilized in the approximating algorithm.
1.5
1

φ1.0(L1)

0.5
0
−0.5

MP with 2 operands
MP with 4 operands
MP with 6 operands
Sum−product

−1
−1.5
−5

0
L1

5

Figure 3.4: PWL approximation effect of MP algorithm

45

Chapter 4
MP-based current-mode PWL circuit and
synthesis

4.1

Log-domain computation

As high-speed computation in signal processing system, graphic systems, communication system,
are receiving increasingly concerns, logarithmic computation, as a speciﬁc and effective option in
signal processing area, becomes the central of research.
Logarithmic number system (LNS) has been adopted in many trials of digital signal processor
(DSP) [49, 50, 51, 52, 53]. A list of operations in base-2 LNS in comparison to normal arithmetic
system is shown in table 4.1. As can be seen, log-domain computation can dramatically simplify
various complex ordinary arithmetic computations, such as multiplication, division, reciprocal,
square-root, and so on. Consequently, with LNS, the number of bits needed in the computation
decreases and therefore the power consumption drops [52, 53]. At the same time, lower complexity results in high-speed computation as well [49, 52, 53]. Generally speaking, for the scenarios
46

in demand of low dissipated power, high throughput and small gate counts, LNS is more appropriate than conventional arithmetic systems. However, a major obstacle of LNS implemented in
digital circuits is the complicated addition and subtraction. Moreover, to simplify the hardware
implementation, PWL schemes are frequently adopted by the digital logarithmic converters in real
number to logarithmic domain data conversion. As a result of PWL, small coefﬁcient lookup tables are resorted to as the power- and area-efﬁcient implementation [52, 53, 54]. However, the
use of read only memory (ROM) is costly and seriously restricts the word length of the LNS data
system [49, 55].

Table 4.1: Operations in ordinary and logarithmic arithmetic in digital implemented LNS processor

Operation

Ordinary Arithmetic

Logarithmic Arithmetic

Multiplication

x × y,

X +Y

Division

X −Y

Square

x ÷ y,
√
x,
x2 ,

Addition

x + y,

Subtraction

x − y,

Square Root

X >> 1
X << 1
X + log 2 (1 + 2Y −X )
X + log 2 (1 − 2Y −X )

The proposed MP-based algorithm is capable of approximate a speciﬁc log-domain computation (“log-sum-exp” function) and can be implemented in an analog manner, which circumvents
the problems aforementioned for digital implementation. In later section of this chapter, a diversity
of log-domain computation circuits will be shown synthesized with the a basic MP circuit.
47

4.2

Architecture of synthesized system

Let us take an arbitrary function as example:

Y = a1 ∗ b 1 + a2 ∗ b 2 .

(4.1)

The operands can be converted into differential forms as
a2 = a+ − a− ,
2
2
b 2 = b+ − b− ,
2
2

a1 = a+ − a− ,
1
1
b1 = b+ − b − ,
1
1
and the corresponding operands in LNS are

L− = log a− ,
a1
1
L− = log a− ,
a2
2
L− = log b− ,
1
b1
L− = log b− .
2
b2

L+ = log a+ ,
a1
1
L+ = log a+ ,
a2
2
L+ = log b+ ,
1
b1
L+ = log b+ ,
2
b2
Accordingly, (4.1) can be converted into

+
−
Y = ez − ez ,
48

(4.2)

where z + and z − are expressed as

L+ +L+
L− +L−
L+ +L+
L− +L−
a
b 1 + e a1
b 1 + e a2
b 2 + e a2
b2 
z + = log e 1



L+ +L−
L− +L+
L+ +L−
L+ +L−
a
b 1 + e a1
b 1 + e a2
b 2 + e a2
b2 
z − = log e 1


(4.3)

From (4.3), it can be seen that the original computation in (4.1) can be transformed into a “logsum-exp” function shown in (3.14) when the computation is converted into LNS. Then it can be
approximated with MP algorithm, as equation (3.24) shows.
Based on the illustration above, for any arbitrary function that can be be converted into the form
of (4.2) and (4.3), i.e. expressed in terms of “log-sum-exp” function, its PWL approximation can
be obtained with MP algorithm. Let us generalize such an arbitrary function with all the operands
in real domain as Y = f (a, ..., b). It can be converted into differential form and LNS as

Y = ez

+

−
− ez .

z + = F + (L+ , L− , ..., L+ , L− ).
a a
b b
z − = F − (L+ , L− , ..., L+ , L− ).
a a
b b

(4.4)

In (4.4), F + and F − denote the “log-sum-exp” functions of L+ , L− , L+ , L− , etc.
a a b
b
Fig. 4.1 shows the system level architecture of arbitrary computations. The input signals are
represented by their differential forms a = a+ − a− , b = b+ − b− and are converted into logarithms at input stage. Similarly, the logarithms of the output signals are represented by z + and
z − . With all the logarithm/antilogarithm conversion realized by outer units at input/output stage,
the core computations (within the dashed block) are carried out in log-likelihood domain and im49

+
b
−
b

log b = L
log b − = L

MP based
circuit

z

−

e
e

z+
z−

Output Signals

+

z

+
Output Stage

b−

log a + = L+
a
−
−
log a = La
…...

b

+

Input Stage

a−
…...

Input Signals

a+

log-likelihood domain
computation unit
Figure 4.1: Top architecture of the computation unit
plemented by MP based PWL approximation circuit. Here we only focus on the core computation
unit.

4.3

Current-mode circuit implementation for MP algorithm

A current-mode circuit implementing the basic reverse water-ﬁlling equation (3.1) is shown in Fig.
4.2. The four operands L1 ∼ L4 in (3.2) are represented by currents IL ∼ IL . The hyper1
4
parameter γ in (3.2) is implemented by the current Iγ . The circuit in Fig. 4.2 bears similarity
to other parallel current-mode circuits like the winner-take-all circuits. However, as mentioned
in the previous section, the MP implementation is more general and for Iγ → 0 the MP circuit
implements a winner-take-all function. The Kirchoff’s current law (KCL) when applied to node
A, is equivalent to the reverse water-ﬁlling condition

4 [I − I ] = I . I represents the
z+
γ z
i=1 Li

MP approximation M (L1 , .., L4 , γ), and [.]+ denotes a rectify operation which is implemented
by the PMOS diodes P1 ∼ P4 . It can be readily veriﬁed that the PMOS diodes ensure that
VDS,N = VSG,P + VDS,sat , where VDS,N is the drain-to-source voltage drop across trani
i
i
50

sistor Ni , i = 1, .., 4, VSG,P is the voltage drop across the PMOS diodes and VDS,sat is the
i
saturation voltage across the current sink Iγ . Thus, transistors N1 ∼ N4 are always maintained
in saturation irrespective of the operation region (weak, moderate and strong inversion). This is
important because the operation of the MP circuit relies on the accuracy of the current mirrors
formed by N0 ∼ N4 . Also, to ensure proper operation of the core MP circuit the input currents
4
have to satisfy the constraint
IL ≥ Iγ and the input current range can be expressed as
i
i=1
W 1
1
µn Cox ( )( Vdd )2 ≥ IL ≥ 0.
i
2
L 2

(4.5)

In the derivation of the upper-bound of the input current range we have assumed that the IL >>
i
Iγ and the PMOS and the NMOS transistors have similar drain-to-source voltage drops (implying
VB ≈ Vdd /2). It should be noted that the PMOS diodes in circuit shown in Fig. 4.2 introduces a
threshold voltage drop which implies that the supply voltage has to be greater than 2VGS + VDS .
However, a low-voltage implementation of the circuit in Fig. 4.2 is possible by replacing the PMOS
diodes by low-threshold active diodes [56], but at the expense of larger silicon area, higher power
dissipation and degraded speed and stability.

We have prototyped a programmable version of the MP circuit shown in Fig. 4.2 in a 0.5µm
standard CMOS process. The architecture of the prototype is shown in Fig. 4.4 and its micrograph is shown in Fig. 4.3. A serial shift-register is used for selectively turning on and off each
of different branches of currents. In this manner, the same MP circuit in Fig. 4.2 can be reconﬁgured to implement different mathematical functions. The architecture in Fig.4.4 uses an on-chip
ﬁrst-order Σ∆ modulator as an analog-to-digital converter for measurement and calibration of the
output currents. Because the circuits used for implementing the Σ∆ modulator have been reported
51

Input Stage
Vdd

Vdd

I L1
IZ
N0

P2

N2

Vdd

I L4

I L3

I L2
P1

N1

Vdd

A

P3

N3

Vds,sat
B
P4

N4

Iγ

Vsg ,P4
Vds,sat

Figure 4.2: CMOS implementation of the core MP circuit.

μ

Figure 4.3: Die micrograph of the chip.

elsewhere [57] we have omitted its implementation details in this paper. The input currents and
hence the operating region (weak, moderate and strong inversion) of all the transistors are adjusted by controlling the gate voltages of the PMOS transistors. For each of the operating regions,
the ADC is re-calibrated to the maximum current and the measured results (presented in the next
sections) are normalized with respect to the maximum current.
52

Vdd

CLK

CLK

+
IZ

From
SRs

P4

P3

P2

I

P1

To ADC

Iγ

N4

SR

SR

…...

Din

N3

N2

SR

N1

N0

Measurement Circuit

MP circuits

ADC

I

Normalized
real number

current to be measured

I ref
Latch

I

Real
number

Out

+

+

I ref

t
Normalization
over time

Figure 4.4: System architecture of the chip.

53

4.4

MP based computational circuits

In previous section, we already illustrate MP is an effective PWL approximation algorithm for a
speciﬁc log-domain function. The circuit schematic of a four-operand MP based PWL approximation circuit is shown in Fig. 4.2. In this section, we illustrate some log-domain fundamental
computations that can be synthesized based on the “log-sum-exp” function and their MP based
PWL approximation circuits. Based on these fundamental computation circuits, arbitrary logdomain arithmetic operations and their combinations can be implemented, which can be validated
with some classical mathematic functions presented in this section as well. Current-mode signal
representation is exploited so that power dissipation can be controlled adaptively under different
power consumption requirements.

4.4.1

Addition

To compute a+b with respect to the two input variables a and b, a ∈ R, b ∈ R the operands are ﬁrst
represented in their differential forms a = a+ −a− and b = b+ −b− with a+ , a− , b+ , b− ∈ R+ .
Then, the operation is rewritten as in Fig. 4.1

a + b = a+ − a− + b + − b − = a+ + b + − a− + b −
=e

log a+ +b+

−e

log a− +b−

.

All the operands are then mapped into a log-likelihood domain according to
L+ = log a+ ,
a

L− = log a− ,
a

L+ = log b+ ,
b

L− = log b− ,
b
54

(4.6)

where L+ ∈ R, L− ∈ R, L+ ∈ R, L− ∈ R. Since, our circuit implementation uses only
a
a
b
b
unipolar current sources, we impose additional constraints a+ , a− , b+ , b− > 1. Due to the differential representation, this additional constraint will not affect the ﬁnal output. Let z + denote
log a+ + b+ and z − denote log a− + b− . Then, (4.6) can be written as

a + b = ez

+

−
− ez .

(4.7)

and z + can be expanded as

z + = log a+ + b+ = log elog a
+
L+
= log eLa + e b

+

+ elog b

+

.

(4.8)

We now apply the MP approximation to the log-sum-exp functions to obtain

z + ≈ M(L+ , L+ , γ).
a b

(4.9)

Similarly,

z − = log a− + b− = log elog a

−

≈ M(L− , L− , γ)
a b

+ elog b

−

−
L−
= log eLa + e b
(4.10)

Since the equation (4.9) and the equation (4.10) uses only two operands, its circuit level implementation is identical to Fig. 4.2 except that it has only two input branches instead of four. The
single-ended circuit for addition is shown in Fig. 4.5. I + , the current through N0 , represents the
z
55

value of z + . I + , I + and Iγ represents L+ , L+ and γ respectively.
a b
La L
b
+
I La

I z+

+
I Lb

P1

N0

N1

P2

N2

Iγ

Figure 4.5: MP circuit for implementing two operand addition.
Fig. 4.6a (a)-(c) compares the output z + − z − obtained using ﬂoating-point implementation of
the “log-sum-exp” functions, the MP approximation and from measurements (normalized currents)
using the fabricated prototype.

0
−1

1
z+−z−

1
z+−z−

z+−z−

1

0
−1

2

0
a −2

(a)

0
−2 b

2

0
−1

2

0
a −2

(b)

0
−2 b

2

2

0
a −2

0
−2 b

2

(c)

Figure 4.6: Addition: z + − z − computed according to (a) the original log-sum-exp function; (b)
MP approximations (simulation); and (c) MP circuits (measurement).

Fig. 4.7 (a)-(c) show the measurement results when the transistors are biased in (a) strong
inversion; (b) moderate inversion; and (c) weak inversion. Fig. 4.7 (d)-(f) show the approximation
error between the measured response and the software model. The results validate our claim for
the addition operation that MP approximation is scalable across different biasing regions and the
56

circuit in Fig. 4.5 approximates the ”log-sum-exp” function with a reasonable error. The error
increases when the circuit is biased in the weak inversion and as with any sub-threshold analog
VLSI circuit, which we attribute to transistor mismatch, ﬂicker-noise, and errors introduced in
mapping ﬂoating-point operands into analog operands (currents).

0
−1

1
z+−z−

1
z+−z−

z+−z−

1

0
−1

2

0
a −2

0
−2 b

2

−1
2

0
−2 b

2

0
−2 b

(d)

2

0.1
0
−0.1
−0.2
2

0
a −2

(e)

0
a −2

0
−2 b

2

(c)

Error

0.1
0
−0.1
−0.2
0
a −2

0
a −2

2

(b)

Error

Error

(a)

2

0

0
−2 b

2

0.1
0
−0.1
−0.2
2

0
a −2

0
−2 b

2

(f)

Figure 4.7: Addition: measurement results for z + − z − with transistors biased in (a)strong inversion; (b)moderate inversion; (c)weak inversion; (d)-(f) computation errors for (a)-(c).

4.4.2

Subtraction

Synthesizing an MP circuit for subtraction a − b, with a and b being the two operands is similar to
the addition operation. The operation can be written in a differential form as a = a+ − a− and
b = b+ − b− , with a+ , a− , b+ , b− ∈ R+ and satisfying a+ , a− , b+ , b− > 1. Thus, a − b is
57

written as

a − b = a+ − a− − b + − b − = a+ + b − − a− + b +
log a+ +b−
log a− +b+
=e
−e

(4.11)

which after conversion into a log-likelihood domain
L+ = log a+ ,
a

L− = log a− ,
a

L+ = log b+ ,
b

L− = log b− ,
b

leads to
+
−
z + = log a+ + b− = log elog a + elog b

+
L−
= log eLa + e b

≈ M(L+ , L− , γ)
a b

(4.12)

and
+
−
z − = log a− + b+ = log elog a + elog b
≈ M(L− , L+ , γ)
a b

−
L+
= log eLa + e b
(4.13)

where z + and z − are differential forms of the output log-likelihoods as described in (4.2). The
circuit implementation of subtraction is identical to that of addition, except that the input differential currents are now permuted. Fig. 4.8 compares the measured response with software based
“log-sum-exp” and MP implementation. Also, Fig. 4.9 (a)-(c) shows the measured results when

58

the transistors are biased in the three different operating regimes. Similar to the results obtained
for the addition operation, the measured result show the bias-scalable operation of the MP circuits.

0
−1
−5

0

a 5 5

0
b

0
−1
−5

−5

1
z+−z−

1
z+−z−

z+−z−

1

0

(a)

a 5 5

0
b

0
−1
−5

−5

0

(b)

a 5 5

0
b

−5

(c)

Figure 4.8: Subtraction: z + − z − computed by (a) original functions; (b) MP approximations
(simulation); and (c) MP circuits (measurement).

0
−1
−5

0

a 5 5

0
b

0
−1
−5

−5

1
z+−z−

1
z+−z−

z+−z−

1

0

0.1
0

a 55

(d)

0

0
b

−5

0.2
0.1
0
−5

0

a 55

(e)

a 5 5

0
b

−5

(c)

Error

0.2
0
−5

−1
−5

−5

(b)

Error

Error

(a)

a 5 5

0
b

0

0
b

−5

0.2
0.1
0
−5

0

a 55

0
b

−5

(f)

Figure 4.9: Subtraction: measurement results for z + − z − with transistors biased in (a)strong
inversion; (b)moderate inversion; (c)weak inversion; and (d)-(f) show respective approximation
errors.

59

4.4.3

Four quadrant multiplication

Implementing a single quadrant multiplier using MP circuit is straight-forward because z = a · b in
its logarithmic form can be expressed as a sum of likelihoods Lz = La +Lb , where Lx denotes the
logarithm of the operand x. Thus, a single quadrant multiplier requires only a current summation
and can be implemented using KCL. For an MP-based four-quadrant multiplier the operation a · b
is again expressed in terms of the differential operands as:

a · b = a+ − a− · b + − b − = a+ · b + + a− · b − − a− · b + + a+ · b −
log a+ ·b+ +a− ·b−
log a− ·b+ +a+ ·b−
.
=e
−e

(4.14)

After log-likelihood mapping, the following differential forms are obtained

z + = log a+ · b+ + a− · b− = log e

log a+ ·b+

+e

log a− ·b−

≈ M L+ + L+ , L− + L− , γ
a
b a
b

(4.15)

and
− +
log a+ ·b−
− = log a− · b+ + a+ · b− = log elog a ·b
z
+e
≈ M L− + L+ , L+ + L− , γ
a
b a
b

(4.16)

The circuit implementing an MP based four-quadrant multiplier is shown in Fig. 4.10. I + dez
notes the value of z + . Iγ represents γ. I + (I − ), I + (I + ) represent L+ (L− ), L+ (L− )
a a
b b
La La
L
L
b
b
respectively.
60

+
I La

+
I Lb

−
I La

−
I Lb

I z+

Iγ

Figure 4.10: MP circuit for implementing four quadrant operand multiplication

Fig. 4.11 compares the measured output z + − z − with the ideal implementation of the
model (4.15) and (4.16). Fig. 4.12 (a)-(c) shows the measurement results when the transistors
are biased in: (a) strong; (b) moderate and (c) weak inversion along with the respective approxi-

−5
0
a 55

(a)

0
b

0.5
0
−0.5
−5

z+−z−

0.5
0
−0.5
−5

z+−z−

z+−z−

mation error. Again, the measured results show the bias-scaling property of MP circuits.

−5
0
a 55

(b)

0
b

0.5
0
−0.5
−5

−5
0
a 55

0
b

(c)

Figure 4.11: Multiplication: z + − z − computed according to the (a) original function; (b) MP
approximations (simulation); and (c) MP circuits (measurement).

61

−5
0
a 55

0.5
0
−0.5
−5

0
b

z+−z−

z+−z−

z+−z−

0.5
0
−0.5
−5

−5
0
a 55

−5
a 55

(d)

0
a 55

0
b

0.4
0.2
0
−0.2
−5

−5
0
a 55

(e)

0
b

(c)

Error

0.4
0.2
0
−0.2
−5
0

0
b

−5

(b)

Error

Error

(a)

0.5
0
−0.5
−5

0
b

0.4
0.2
0
−0.2
−5

−5
0
a 55

0
b

(f)

Figure 4.12: Multiplication: measurement results for z + − z − with transistors biased in (a)strong
inversion; (b)moderate inversion; (c)weak inversion; and (d)-(f) the respective approximation errors .

62

4.4.4

Single Quadrant Division

Like the implementation of a single quadrant multiplication, implementing a single quadrant division in logarithmic domain is straight-forward. For operands a, b > 0, division in the logarithmic
domain is equivalent to a subtraction operation La − Lb which can be implemented using KCL. To
implement four quadrant division, the operation ﬁrst requires extracting the sign of the operand as
x = sgn(x) · |x|, where |x| is the absolute value of x. A single quadrant division can now be used
for the absolute values and the sign of the ﬁnal output can be computed using only comparators.

4.4.5

Power and Polynomial Computation

For positive operands satisfying a ∈ R+ , a > 1, computing an , n ∈ Z+ in the logarithmic
domain is equivalent to

log an = n · log (a) = n · La .

(4.17)

which can be implemented using a simple current mirror as shown in Fig. 4.13. Thus, in MP-based
synthesis, power functions like square-root or cube-root can be implemented using current mirrors
which signiﬁcantly simpliﬁes the implementation compared to translinear based synthesis [58].
However, for a four-quadrant implementation of power and polynomial function would require
evaluating a Taylor expansion using multiplication and addition operations.
For a non-positive operand a, a ∈ R, its power function is expressed as an , n ∈ Z+ . a is
expressed in its differential form a = a+ − a− , a+ (a− )∈ R+ and a+ (a− )> 1. For arbitrary n,
63

an can be written as

n
a+ − a−
=

n−i − i
i
(−1)i Cn a+
a

i
n−i − i
n−i − i
i
i
=
C n a+
a
−
C n a+
a
i is even
i is odd
i + n−i a− i
i + n−i a− i
log
log
i: even Cn a
i: odd Cn a
.
−e
=e

i
C n a+

Designate z + as log 

n−i

a−

i

(4.18)


 and z − to represent

i: even




n−i − i
i
 in (4.18). Then
log 
C n a+
a
i: odd
z + ≈ M(L+ , γ).
z − ≈ M(L− , γ).

(4.19)

i
Li = log Cn + (n − i) log a+ + i log a− ,

(4.20)

L+ (L− ) is a group of Li s satisfying

i ∈ [0, n] and i is even(odd).
As an example, a four quadrant square function a2 can be implemented by expressing it as
a product of two differential operands as (a+ − a− ) · (a+ − a− ) which has an architecture
similar to the four-quadrant multiplier described in above section. The differential output z + and
z − corresponding to the square operation can be expressed in terms of differential log-likelihood
64

I La

IZ
⎛W ⎞
⎟
L ⎠2
⎝

⎛ W ⎞ M1
⎜ ⎟
⎝ L ⎠1

M2 ⎜

Figure 4.13: A current mirror implementing the power function in logarithmic domain.
inputs L+ , L− as
a a

z + ≈ M 2L+ , 2L− , γ
a
a

(4.21)

z − = log 2 + L+ + L−
a
a

(4.22)

Fig.4.14 shows the measured results when the circuit in Fig. 4.10 has been conﬁgured to compute
a four-quadrant square function. Again, the circuit is shown to approximate the “log-sum-exp”
function for strong-inversion, moderate-inversion and weak-inversion biasing conditions.

4.5

Scalability analysis of MP circuit

In Fig. 3.4, we showed that the accuracy of the PWL approximation improves when the number
of operands in the margin function increases. In this section, we analyze the effect of hardware
artifacts (noise, mismatch, speed and input/output impedance) on the performance and scalability
of MP circuits.
65

Original function
MP simulation
Strong inversion
Moderate inversion
Weak inversion

z+−z−

1.5

1

0.5

0
−6

−4

−2

0
a

2

4

6

Figure 4.14: Power:z + − z − computed according to the log-sum-exp function, the MP approximation simulation and the MP circuit measurement.

Vdd

Vdd

Vdd

2
I n ,source1

B

2
I n ,z

2
I n ,N 1

Rsource1

2
I n ,P1

P1

PK

P2

…...
N0

N1

N2

A

NK

Rsin k
2
I n ,sin k

Figure 4.15: Noise model for MP circuit in Fig. 4.2.

66

4.5.1

Noise and Mismatch Analysis of MP circuit

The noise model for the basic MP circuit in Fig. 4.2 is shown in Fig. 4.15. For the following
analysis we will assume that the number of input current branches in the MP circuit equals K. Also,
assuming that the noise in K − 1 (due to KCL) of the parallel branches are mutually independent,
2
the total output noise power In,z can be expressed as:


2
2
2
In,N
In,P
I2
In,source
2 =
i +
i +
i

 + n,sink + I 2
In,z
n,N0 (4.23)
2 2
(K − 2)2
(K − 2)2 gm Rsource (K − 2)2
K2
i=1
K



where In,sourcei , In,N , In,P , and In,sink are noise currents as shown in Fig. 4.15 and
i
i
Rsource is the output impedance of the current sources. We have simpliﬁed the expression
in (4.23) by assuming K to be large and by assuming that the small-signal parameters gm , Rsource
for all the transistors are approximately equal. Equation (4.23) can be written as

1
2
2
+
In,z,thermal = 1 +
2 R2
K gm source K

8
kT gm ,
3

(4.24)

2
where we have considered only the effect of thermal noise as given by In = 8/3kT gm [59].
The effect of the ﬂicker-noise has been ignored in equation (4.24), because it exhibits a similar
dependency on K as the thermal-noise.
Equation (4.24) shows that as the number of branches (or operands) K increases, the output
noise reduces and in the limit K → ∞, the output noise is just equal to the noise due to transistor
N0 . The reduction in the output noise with the increase in K can be attributed to the attenuation
and averaging effect of the K stage current mirrors formed by N1 ∼ NK . Also, it is important
that gm Rsource >> 1, not only for reducing the effect of noise but also to ensure that the input
67

impedance of the MP circuit is much lower than the output impedance of the current sources (see
Fig. 4.15).
The mismatch analysis for the MP circuit follows a similar procedure as the noise analysis. The
MOS transistor mismatch models used in the analysis are based on the work reported in [60] which
uses two technology dependent constants AV and Aβ that determine the mismatch variance as:
T
A2
VT
σ 2 (∆VT ) =
WL
A2
σ(∆β) 2
β
=
β
WL

(4.25)
(4.26)

VT denotes the threshold voltage, β = µCox W/L, W and L represent the width and length of the
MOS transistor. It is also claimed in [60] that for MOS transistors the mismatch in VT dominates
the mismatch in β. Therefore, in this paper, our analysis is only be based on the VT mismatch.
The error variance σ 2 (∆VT ) can be converted into an equivalent noise current source in Fig. 4.15
according to

σ 2 (I)

A2
2 σ 2 (∆V ) = g 2 VT
gm
m WL
T

(4.27)

Hence, similar to the noise analysis, the error variance in the output current σ 2 (Iz ) is given by:

σ 2 (Iz )

1
1
1+
+
2 2
K gm Rsource K

A2
2 VT .
gm
WL

(4.28)

Equation (4.28) shows that for a large K the accuracy of the MP circuit is limited by the mismatch
due to the output current mirror stage N0 . Thus, all layout and impedance transformation methods
which are used for improving the accuracy of current mirrors, like centroid layout and cascoding
68

techniques, can be used to improve the accuracy of MP circuits. The use of cascoding at the output
stage transistor N0 increases the output impedance and makes the core MP module scalable to
implement large networks.

4.5.2

Speed and Bandwidth Analysis

As with other current-mode circuits, the transient response of the MP circuits is determined by its:
(a) slew-rate which determines the large-signal response; and (b) bandwidth which determines the
small-signal response. The node A in Fig. 4.2 has the largest capacitance and is discharged by the
current Iγ . Therefore, the slew-rate of the MP circuit can be expressed as:

SRA =

Iγ
K(CGS,N + CGS,P + CDB,P )

(4.29)

where CGS,N represents the gate-to-source capacitance of NMOS transistor N1 through NK , and
CGS,P /CDB,P denotes the gate-to-source/drain-to-body capacitance of the PMOS transistor P1
through PK . Equation (4.29) shows that the slew-rate will reduce when the number of input
branches K increase. However, the small-signal bandwidth of the circuit remains invariant with
respect K and is determined by the frequency of the pole located at node A and is given by

f−3dB,A =

gm,P
.
2π(CGS,N + CGS,P + CDB,P )

(4.30)

gm,P represents the small-signal gate referred transconductance for the PMOS transistors P1 ∼
PK . The slew-rate and bandwidth analysis show that scaling of MP circuit is limited by the
application speciﬁc speed requirements which is controlled by the current Iγ . However, unlike
translinear synthesis this limitation can be overcome in MP circuit by re-biasing the circuit in
69

strong-inversion.

4.5.3

Performance comparison with Translinear synthesis

In this section, we summarize and compare the performance of CMOS circuits designed using MP
based synthesis with circuits designed using translinear synthesis. The comparisons made here are
qualitative in nature because many of the performance metrics (silicon area and power dissipation)
are dependent on the topology and complexity of the translinear circuit.
• Accuracy: MP based synthesis relies on PWL approximations at the algorithmic level.
However, the accuracy of its circuit level implementation using CMOS current-mode circuit
is only limited by mismatch. Our analysis in the previous section showed that the techniques
used for designing precision current-mirrors can also be used to improve the accuracy of
the MP circuits. TL based synthesis, on the other hand, are precise at the algorithmic level.
However, their mapping to CMOS current-mode circuits is approximate due to the effect of
sub-threshold slope (unlike bipolar transistors) and ﬁnite drain impedance. These artifacts
also introduce temperature dependency in TL circuits, whereas MP circuits are theoretically
temperature invariant. Also, due to their operation in weak-inversion, CMOS TL circuits are
more prone to errors due to mismatch, whereas the MP circuits can be re-biased in stronginversion if higher precision is desired, but at the expense of higher power dissipation.
• Dynamic range: The bias-scalable property of the MP circuit enables it to achieve a larger
dynamic range compared to a TL circuit. Also, MP synthesis uses currents to represent
log-likelihoods which can also achieve a larger dynamic range.
• Speed: For an MP circuit higher speed can be achieved by re-biasing the circuit in stronginversion, without changing the circuit topology. For CMOS translinear circuits higher speed
70

can be achieved by the use of large currents which implies using large size transistors to
maintain their operation in the weak inversion. This will increase the silicon area when
compared to an equivalent MP circuit.

4.6

Application Example: Pattern Classiﬁer

One of the applications where ultra-low power analog computing is attractive is pattern classiﬁcation [61, 62]. Most pattern classiﬁers do not require high precision and any analog artifacts (like
mismatch and non-linearity) can be calibrated using an ofﬂine training procedure. In this example,
we use the MP circuits to design a bias-scalable linear classiﬁer. A linear classiﬁer implements an
inner-product computation between a weight vector w ∈ Rn and the classiﬁer input x ∈ Rn and
is given by
f = wT x

(4.31)

The weight vector w is determined by an ofﬂine training procedure [61] using a labeled training
−
−
dataset. The two vectors → and → are expressed in differential form as
w
x

→ = →+ − →− = [w+ , w+ , ..., w+ ]T − [w− , w− , ..., w− ]T
−
−
w −
w
w
1 2
1 2
N
N
→ = →+ − →− = [x+ , x+ , ..., x+ ]T − [x− , x− , ..., x− ]T
−
−
−
x
x
x
1 2
1 2
N
N
71

(4.32)

+ −
with all the wi (wi ) and x+ (x− ) ∈ R+ . Then equation (4.31) can be rewritten as
i i

→T · → =
−
−
w
x

N
i=1

=e

log

= ez

+

−
+
w i x+ + w i x− −
i
i

N

−
+
w i x− + w i x+
i
i

i=1
+ x+ +w− x−
N w
i=1 i i
i i − elog

N w+ x− +w− x+
i=1 i i
i i

−
− ez .

(4.33)

when z + and z − are expressed as

z + = log

N

−
+
w i x+ + w i x−
i
i

i=1
≈ M(Lw1 + + Lx1 + , Lw1 − + Lx1 − , ...,
LwN + + LxN + , LwN − + LxN − , γ)
z − = log

N

−
+
w i x− + w i x+
i
i

i=1
≈ M(Lw1 + + Lx1 − , Lw1 − + Lx1 + , ...,
LwN + + LxN − , LwN − + LxN + , γ).

(4.34)

+ −
Lwi + (Lwi − ) and Lxi + (Lxi − ) denote the logarithms of wi (wi ) and x+ (x− ) in equation
i i
(4.33). For a binary classiﬁer, only the sign of z + −z − is important, implying that the output stage
+
−
in the architecture (Fig. 4.1) could be implemented using a comparator (instead of ez − ez ).
The system architecture and circuit diagram for the pattern classiﬁer is shown in Fig. 4.16.
In this example, we generated a synthetic linearly separable dataset corresponding to classes:
class I and class II. We used a linear support vector machine training algorithm [63] to determine
weight w. The parameter vector w is then programmed as currents on the prototype chip. Test
72

I Lw +
1

I Lw−

I Lx +
1

1

ILw+

I Lx −
1

N

I Lx +

N

ILw−

N

I Lx −

N

I z+
…...

+

wN
−
xN

Input Stage

+

+

logw1
+
log x1

+

log wN
−
log x N

z+
Inner Product
with
MP Circuit

…...

w1
+
x1

…...

Input Signals

Iγ

+

z−
Comparator

Figure 4.16: System architecture and circuit diagram of the MP-based pattern classiﬁer.

73

vectors x were selected randomly and were then applied as input to the chip. For the sake of
visualization, we show only results from a two-dimensional dataset. In this special case, the classiﬁer equation (4.31) reduces to a two-dimensional inner product formulation as described in the
example in section I. The circuit implementation is therefore identical to that of a full quadrant
multiplier in Fig. 4.10. Fig. 4.17 shows the classiﬁcation boundary and the classiﬁer scores (negative values are represented by darker shades, whereas positive values are represented by lighter
shades) obtained from software simulations and from the fabricated prototype under different biasing conditions. Again, the results show that the operation of the prototype is bias-scalable and
the classiﬁcation performance has been veriﬁed to be similar to that of the software model.

74

−1.2
−0.6

−1.2
0.5

0.2

Class I

−0.6

0

0

Classification
boundary

−0.1

0.6

0.6

0
x1

0.6

0

0

0.6

−0.2

Class II

1.2
−1.2

x2

x2

0.1

−0.5

1.2
−1,2 −0.6

1.2

0
x1

(a)

0.6

1.2

(b)

−1.2

−1.2
0.5

0

0

0.6
1.2
−1.2 −0.6

0.5

−0.6
x2

x2

−0.6

0

0

0.6
−0.5
0
x1

0.6

1.2

−0.5
1.2
−1.2 −0.6

0
x1

(c)

0.6

1.2

(d)
−1.2
0.5

x2

−0.6
0

0

0.6
−0.5
1.2
−1.2

0.6

0
x1

0.6

1.2

(e)
Figure 4.17: Classiﬁer: z + − z − simulated by (a) log-sum-exp function (simulation); (b) MP approximation (simulation); and measured z + − z − when the transistors are biased in (c)strong inversion (measurement); (d)moderate inversion (measurement); (e)weak inversion (measurement).
75

Chapter 5
An analog LDPC decoder implemented on
MP-based PWL circuits

5.1

Performance metrics and trade-off of LDPC decoder

Low-density parity-check codes [64] constitute an important class of capacity approaching errorcorrecting codes which has seen widespread acceptance in emerging communication standards
[65, 66, 67]. One of the key factors behind the success of LDPC codes is its iterative decoding
algorithms [68, 69, 70, 71] which are scalable and hence can be easily mapped onto digital [72,
73, 74, 75] and analog [76, 77, 78, 79] hardware.
As applications of LDPC codes expand into new areas like sensor networks [80, 81, 82] it is
becoming more important to understand theoretical foundations that govern and determine the energy efﬁciency of LDPC decoders. Even though numerous studies have been reported in literature
that evaluate the performance of LDPC decoding in regards to its bit-error-rate (BER) performance
for different signal-to-noise ratios (SNR) [83, 84], there exists no theoretical basis for understand76

ing the BER/SNR trade-off with respect to the decoder’s energy efﬁciency. Understanding this
trade-off is important because:
• Different applications impose different constraints on the energy efﬁciency of the decoding
algorithm. For instance a decoder used in sensor networks will be more severely constrained
for energy [81, 82] than a decoder which is used in a base-station of a wireless network.
Therefore, decoding algorithms that provide an elegant trade-off between BER, SNR and
power-dissipation is important as superior energy efﬁciency can be achieved by sacriﬁcing
BER performance.
• Understanding the trade-off could lead to novel class of decoding algorithms that deliver
the similar BER performance as existing state-of-the-art algorithms while demonstrating
superior energy efﬁciency.
One of the ways to quantify the energy efﬁciency of a decoding algorithm is to evaluate the
sparsity of the messages being passed between the nodes of the bipartite graph during LDPC decoding. Here we have considered one notion of sparsity: L0 norm of the messages, which is deﬁned
as the percentage of the non-zero messages out of a ﬁxed total. Since zero-message consumes no
energy, it is obvious to see that with the same amount of messages, the higher percentage of the
zero-messages there is, the lower the L0 norm is, and also the lower energy consumption. In this
regard, the overall energy consumption can be characterized by the L0 norm of the messages. This
method of energy analysis is relevant to analog implementations because decoding of large LDPC
codes is communication intensive rather than being computation intensive [77, 78, 79]. Also, it has
been shown that in emerging deep sub-micron and nanoscale integrated circuits, energy dissipated
during communication (through interconnects) will be signiﬁcantly larger than the energy cost of
local computations [85]. We use density evolution (DE) [86, 87, 84] as a tool to observe and
77

analyze the L0 norm of the messages generated during iterative LDPC decoding.

5.2

Density evolution and message sparsity

5.2.1

Density evolution analysis for conventional LDPC decoding algorithms

c1

c24

v1

v32
Figure 5.1: Factor graph corresponding to a 32-bit (3, 4) LDPC code

Density evolution is a well-established technique to determine the asymptotic performance of
any LDPC code for a given channel condition and a code-rate [87]. It provides a way to study
the decoding algorithm by analyzing the evolution of message densities on a hypothetical inﬁnite
tree-like graph with similar local topology as the LDPC code of interest [88]. In this section, we
ﬁrst provide a brief introduction of density evolution analysis whose details can be found in [87].
Consider an example of a factor graph shown in Fig. 5.1. It consists of variable nodes vk , k =
1, .., N which are connected to check nodes ci , i = 1, .., M using edges. For the description that
follows, the number of edges associated with each check node and variable node (also known as
the degree of the node) will be denoted by dc and dv .
Let Vi denote the set of check nodes connected to the variable node vi and Vi∼j represent the
set of check nodes other than cj that are connected to variable node vi . Similarly, let Cj denote
78

the set of variable nodes connected to the check node cj and Cj∼i represent the set of variable
nodes other than the node vi connected to cj .
In a sum-product based LDPC decoding algorithm [69], each check node cj receives messages
from its set of neighbors Cj (denoted by L(vi → cj )) and computes messages to be sent to the
variable nodes vi ∈ Cj (denoted by L(cj → vi )) according to


L(cj → vi ) = 2 tanh−1 


tanh
vk ∈Cj∼i

L(vk → cj ) 

2

(5.1)

In subsequent iterations, each variable node vi receives messages from its neighboring check nodes
cj ∈ Vi and re-computes messages that will be sent to the check node cj (denoted by L(vi → cj ))
according to
L(vi → cj ) = L(vi ) +

L(ck → vi )

(5.2)

ck ∈Vi∼j
where L(vi ) denotes the initial messages obtained from the communication channel. Messages are
then propagated back and forth between the variable and check nodes for a pre-determined number
of iterations before a decision on the received bits is made [69]. An alternative implementation of
the LDPC decoder is based on the min-sum approximation of the L(cj → vi ) in equation (5.1)
according to
L(cj → vi ) = L(vk → cj ),

(5.3)

where k = arg minv ∈C
|L(vˆ → cj )|.
k
j∼i
ˆ
k
Even though min-sum decoders are relatively easy to implement on hardware compared to sumproduct decoders, we will show later in this section and in section 5.3 that message distributions
that the algorithm produces are less sparse and hence its energy efﬁciency is inferior compared to
79

that of sum-product decoders. In this regard, density evolution will serve as an important compu-

0.15

0.8
Iteration=1
Iteration=3
Iteration=5

Probability Density

Probability Density

0.2

0.1
0.05

0
−20

−10

0

10

0.6
0.4
0.2
0
−20

20

Iteration=1
Iteration=3
Iteration=5

−10

0

(a)

Probability Density

Probability Density

10

20

0.8
Iteration=1
Iteration=3
Iteration=5

0.1
0.05

0
−20

20

(b)

0.2
0.15

10

x

x

−10

0

10

0.6
0.4
0.2
0
−20

20

Iteration=1
Iteration=3
Iteration=5

−10

0

x

x

(c)

(d)

Figure 5.2: A (3,6) LDPC message PDF evolution for: (a) sum-product variable to check messages;
(b) sum-product check to variable messages; (c) min-sum variable to check messages; (d) min-sum
check to variable messages;

tational tool to track distributions of messages that are propagated between the variable and check
nodes and hence visualize its sparsity. The DE procedure computes the probability density of messages during an iterative decoding algorithm based on a recursive estimation procedure that can
be evaluated analytically or numerically using Monte-Carlo simulations [87, 86, 84]. By observing the asymptotic message distributions, traditional DE analysis estimates a minimum threshold
80

parameter (e.g. noise-variance of a Gaussian channel) that ensures BER performance less than
some speciﬁc tolerance. Even though the asymptotic behavior is of importance in DE, we will use
the formulation for estimating sparsity of messages which will serve as a theoretical estimator for
energy efﬁciency for a given decoder. The underlying assumption in DE analysis is that the graph
rooted at a variable node, and incident by messages originating at the variable node, is cycle-free
(for a ﬁnite number of iterations) which allows presumption of independence between incoming
messages [88]. The independence assumption implies that the summation of message at variable
and check nodes translates to convolution of their probability distributions, making it easier to track
the evolution of the message densities [86]. We now outline equations which concisely capture the
analytical framework for DE analysis. Let the degree distribution function λ(x)(ρ(x)) for variable
(check) nodes be represented as

λ(x) = λ2 x + λ3 x2 + · · · + λn x(n−1) =

n

λi x(i−1)

(5.4)

ρi x(i−1)

(5.5)

i=2
q+1
ρ(x) = ρ2 x + ρ3 x2 + · · · + ρq+1 xq =
i=2
where,
ρi is the fraction of edges connected to check nodes with degree i.
λi is the fraction of edges connected to variable nodes with degree i.
xi represents convolution of the message density x with itself (i − 1) times.
Note: The two tuples (λ, ρ) are often referred to as the “Code Degree Distribution”

Also let the message distributions corresponding to check and variable nodes after message
passing iterations be denoted by R and P . Then, DE analysis computes R and P recursively
81

according to:

R

= Γ−1 ρ Γ P0 ⊗ λ(R −1 )

(5.6)

P

= P0 ⊗ λ(R −1 )

(5.7)

where P0 represents the probability distribution of messages received from the communication
channel, ⊗ represents convolution operation and Γ is the Probability Density Function (PDF)
transformation due to the non-linear operation at each of the check node. For instance, in sumproduct decoding Γ is associated with the variable change operation related to the tanh(.) function
in equation (5.1). Note that for sum-product decoding Γ = Γ−1 .
At ﬁnal step of the DE recursions given by equation (5.7), the fraction of incorrect messages
propagated from the variable nodes to the check nodes, can be determined using:

Pe (λ, ρ) =

P (x)

(5.8)

x
where x is a quantized random variable within a bounded range correspondent to the codeword
transmitted.
The probability distribution functions computed by DE at each iteration step

can also be

used to compute metrics that quantify sparsity of messages. One metric of sparsity which is
commonly used in signal processing literature [89] is the L0 norm. For a vector θ, θ 0 is denoted as the number of non-zeros of θ. However, most frequently the size of the vector is taken
into account. Thus θ 0 is normalized as the percentage of the non-zero coefﬁcients [90]. Let
L (c → v)(L (v → c)) denote the set of messages propagated from check(variable) nodes to
variable(check) nodes during the -th iteration. The L0 norm for L (c → v)(L (v → c)) can be
82

represented by L (c → v) 0 ( L (v → c) 0 ). The PDFs computed by DE give out the discrete
message density. Thus it is convenient to calculate the normalized L0 norm with

L (c → v) 0 =

R (x)

(5.9)

P (x)

(5.10)

x=0
L (v → c) 0 =
x=0
The random variable x in (5.9) and (5.10) is quantized.

In Figure 5.2 we show the mean message PDFs obtained during iterative decoding of a length
2000, (3,6) LDPC code. The decoding algorithms employed are sum-product and min-sum. To obtain the PDFs a zero codeword was transmitted through an additive white gaussian noise (AWGN)
channel with a SNR of 2.5dB, before the decoding was carried out. The ﬁgure shows trends similar
to those previously reported for DE analysis in [87] [86]. The distribution P (Fig. 5.2 (a) and (c))
shifts to the right reducing the area under the curve to the left of the origin. As a result, after several
iterations the probability of error calculated according to equation (5.8) is small. Additionally, we
can infer (by comparing Fig. 5.2 (b) and (d)) that the PDF of the messages propagated from check
nodes to variable nodes in -th iteration, denoted by L (c → v), for the sum-product algorithm
is more concentrated around zero than that of the min-sum algorithm. Consequently, the sumproduct algorithm has lower L0 norm for L (c → v) as against the min-sum algorithm. Overall,
the L0 norm of L (c → v) increases with the number of iterations . And the message distribution
approaches a gaussian distribution with increasing number of iterations. This observation has been
veriﬁed by several previous studies [91, 92] and is also the basis of gaussian approximation based
analysis of the sum-product algorithm [91].
83

5.2.2

Sparse LDPC decoding with MP

To understand how one can achieve sparsity in LDPC decoding, we analyze the core transforms
that are related with the operator Γ for DE recursions given in equation (5.6). Since the check
node computations in LDPC decoding can be performed pair-wise, we will analyze the properties
of two dimensional transforms related to Γ. For the sum-product decoding the check function is
given by [47]
sp
φy (x) = log

1 + ex+y
ex + ey

(5.11)

where x and y are the two variables of function φ representing the messages in (5.1) and (5.2). (5.11)
represents the result of combining two messages x and y. This can be recursively applied to the
sum-product checking function computation.
For min-sum decoding the equivalent operation φms (x) is given by [47]
y

φms (x) = max(0, x + y) − max(x, y)
y

(5.12)

where we have similar x and y as in (5.11). (5.12) is recursively applicable in checking function
computation as well.
sp
Figure 5.3 compares the functions φy (x) and φms (x) for y = 1.0. The plot shows that around
y
the origin, the behavior of both functions are linear. As a result, during the ﬁrst few iterations
of LDPC decoding, when the value of messages propagated from the check to variable nodes are
concentrated around the origin (see Fig. 5.2), the resulting distribution obtained using the transform
Γ is non-sparse. This indicates that if we choose a transform function containing a “dead-zone”
around the origin, and at the same time approximate the sum-product transformation, we could
obtain a sparser LDPC decoding procedure.
84

1.5
Min−sum
Sum−product

1

φ1.0(x)

0.5
0
−0.5
−1
−1.5
−5

0
x

5

Figure 5.3: Comparison of pair-wise decoding functions for sum-product and min-sum LDPC
decoding algorithms

In the ﬁrst step of our approximation procedure, the sum-product transform is decomposed as

sp
φy (x) = log 1 + ex+y − log ex + ey

(5.13)

which is a difference of two “log-sum-exp” function whose general form can be expressed as


N



zlog = log 
eLi 
i=1

(5.14)

where {Li }, i = 1, .., N denote a set of log-likelihood ratios which are used in DE analysis and
LDPC decoding.
85

Equation (5.14) can be re-written equivalently as:
N
e

Li −zlog

=1

(5.15)

i=1

Next, we lower-bound the equality (5.15) using the inequality [x]+ ≤ e(x) where [x]+ =
max(x, 0) is a threshold function. Equation (5.15) can then be expressed in its piecewise-linear
form as
N
i=1

Li − zlog
≤ 1.
+

(5.16)

As is seen from equation (5.16), if zlog in is replaced by z in equation (3.1), and ≤ 1 is changed
into γ in equation (3.1), equation (5.16) is turned out to be MP algorithm.
When the MP-based PWL approximation is applied to the “log-sum-exp” functions in equation
(5.11) the MP-based check function is obtained as:

mp
φy (x) = M ({0, x + y}, γ) − M ({x, y}, γ)

(5.17)

Fig. 5.4 compares the margin propagation check functions to its sum-product and min-sum
equivalent for different values of the hyper-parameter γ. It can be seen that the margin check
function is a PWL approximation to the sum-product check function and the width of the “deadzone” around the origin is controlled by the hyper-parameter γ. When γ diminishes to zero, the
“dead-zone” shrinks to zero as well and the margin propagation algorithm converts to min-sum, as
can be seen through ﬁgure 5.4. The “dead-zone” can also be clearly seen in the three-dimensional
plot of the check functions illustrated in Figure 5.5. Note that for the sum-product check function
86

1.5
1

φ1.0(x)

0.5

Margin with γ=1.25
Margin with γ=1.75
Margin with γ=2.25
Min−sum
Sum−product

0
−0.5
−1
−1.5
−5

0
x

5

Figure 5.4: Comparison of pair-wise decoding function for sum-product, min-sum and margin
propagation decoding algorithms

(Figure 5.5(a)) and min-sum check function (Figure 5.5(b)) the origin is the saddle point and the
gradient at the origin is non-zero.

5.2.3

Density evolution analysis of MP

As shown above, at the core of the margin propagation based decoding is the margin approximation
algorithm. Consequently we ﬁrst simplify the margin approximation algorithm. If we consider
two likelihood scores with levels 1 and 2 , the solution to the algorithm is the level z. Figure 5.6
displays the possible solutions for different values of 1 and 2 . Mathematically the solution for
87

(a)

(b)

(c)
Figure 5.5: Comparison of 3D LDPC check function for (a) sum-product (b) min-sum and (c)
margin propagation with γ=1.75

88

M (l1 , l2 ) can be stated as:




max( 1 , 2 ) − γ if | 1 − 2 | > γ
 1+ 2−γ

otherwise
2
+ 2−γ
= max max( 1 , 2 ) − γ, 1
2

z=

(5.18)

(5.19)

If the (recursive) operation at the check side of an LDPC graph is represented as L1

1

1

L2 , then:

2

2

z

1

2

z

2

z
1

1

2

(a)

|

1

2

|

(b)

|

2

1

2

|

Figure 5.6: Possible solutions z for M( 1 , 2 )

L1

L2 = M(L+ + L+ , L− + L− , γ)
1
2 1
2
− M(L+ + L− , L− + L+ , γ)
1
2 1
2
= z1 − z2

(5.20)
89

where, L+ , L− , L+ , L− are the differential forms of L1 , L2 given by:
1 1 2 2
L+ = T + L1 /2
1
L− = T − L1 /2
1

(5.21)

L+ = T + L2 /2
2
L− = T − L2 /2
2
where T denotes the common mode signal.

Using equation (5.18) the terms z1 and z2 in equation (5.20) can be rewritten as:

z1

z2



 max(L+ + L+ , L− + L− ) − γ


2 1
2
1



=
if |L+ + L+ − L− − L− | ≥ γ
2
1
2
1

 +

 L + L+ + L− + L− − γ
 1

2
1
2

otherwise
2


 max(L+ + L− , L− + L+ ) − γ


2
1
2 1



=
if |L− + L+ − L+ − L− | ≥ γ
1
2
2
1

 +

 L + L− + L− + L+ − γ
 1

2
2
1

otherwise
2

(5.22)

(5.23)

The computations outlined in equation (5.20), (5.22) and (5.23) transform the PDFs of the
incoming messages L1 , L2 into a new PDF. This PDF transformation can be captured “exactly”
by using the surjective function corresponding to the operation L1

L2 . Let us represent this PDF

transfer function as follows:

MP : PL ( 1 ) × PL ( 2 ) → PZ (z), where 1 , 2 , z ∈ R
1
2
90

(5.24)

More speciﬁcally the resulting PDF is calculated as follows:

PZ (z) =
1 ∈R { 2 ∈R| 1

2 =z}

=
1 ∈R { 2 ∈R| 1

2 =z}

PL ,L ( 1 , 2 )
1 2
PL ( 1 )PL ( 2 )
1
2

2

1

2

1

(5.25)

In the above equation, the cycle-free assumption [86] for the message incident sub-graph (rooted
at a variable node) allows us to express the joint distribution PL ,L ( 1 , 2 ) as the product of
1 2
marginals PL ( 1 )PL ( 2 ). Equivalent to the parity degree distribution function (5.5) used in
1
2
density evolution analysis of the sum-product algorithm, we can now deﬁne the following parity
degree distribution function for margin propagation:

ρM P (x) = ρ2 x + ρ3 x 2 + · · · + ρq+1 x q
q+1
ρi x (i−1)

=

(5.26)

i=2
where,
x i represents MP · · · MP(MP(x, x), x) · · · x)
i times
Consequently, the density evolution analysis for margin propagation at the check side becomes:

R = ρM P P0 ⊗ λ(R −1 )

(5.27)

Using equations (5.7), (5.25), (5.26) and (5.27) we determine the AWGN noise threshold for a (3,6)
and an irregular LDPC code which we list in table 5.1 (under exact analysis column). Note, the
outer integral over l1 in the PDF equation (5.25) (corresponding to L1

91

L2 ) involves calculation

over the entire real line R, which we can restrict to large enough values for practical considerations.
However, determining the domain for the second integral in equation (5.25) involves a search
2 = z is satisﬁed. This search operation

operation identifying the region where equation 1

signiﬁcantly increases the complexity of density evolution analysis for margin propagation. We
therefore look for simpliﬁcations which reduce the complexity involved in computing thresholds
for the margin propagation algorithm without sacriﬁcing too much accuracy. A similar approach
aiming to reduce complexity has been proposed for sum-product threshold calculation and is often
referred to as the gaussian approximation analysis for sum-product [91].
To obtain approximations which simplify the analysis of margin propagation decoding let us
consider the expressions involved in the ﬁrst part of equation (5.22) in conjunction with equation
(5.21). We get:

L
L
L+ + L+ = T + 1 + T + 2
2
1
2
2
L2
L1
+T −
L− + L− = T −
1
2
2
2

1
[L + L2 ]
2 1
1
= 2T − [L1 + L2 ]
2
= 2T +

To simplify our representation we substitute U =

1
[L + L2 ], then:
2 1

L+ + L+ = 2T + U
1
2
L− + L− = 2T − U
1
2

(5.28)

Substituting the representation (5.28) in ﬁrst part of equation (5.22) gives:

max(L+ + L+ , L− + L− ) − γ = 2T + max(U, −U ) − γ
1
2 1
2
= 2T + |U | − γ
92

(5.29)

0.8

0.15

Iteration=1
Iteration=3
Iteration=5

Probability Density

Probability Density

0.2

0.1
0.05

0
−20

−10

0

10

0.6
0.4
0.2
0
−20

20

Iteration=1
Iteration=3
Iteration=5

−10

0

10

20

x

x

(a)

(b)

Figure 5.7: A (3,6) LDPC message PDF evolution for: (a) margin propagation variable to check
messages; (b) margin propagation check to variable messages;
Table 5.1: Threshold comparison for AWGN channel under margin propagation and sum-product
λ(x)

ρ(x)/ρM P (x)

x2
0.4x + 0.6x2

x5 /x 5
x4 /x 4

Margin Propagation threshold (γ = 1.75)
Exact density
Approximate density
evolution analysis
evolution analysis
1.14 dB
1.29 dB 1
1.09 dB

0.95 dB

Sum-Product
threshold
1.10 dB [87]
1.07 dB [93]

Substituting the representation (5.28) in second part of equation (5.22) gives:
L+ + L+ + L− + L− − γ
1
2
1
2
2

1
γ
[2T + U + 2T − U ] −
2
2
γ
= 2T −
2

=

(5.30)

Using equations (5.19), (5.29) and (5.30) we can rewrite z1 as:
γ
z1 = max 2T + |U | − γ, 2T −
2
γ
= 2T + max |U | − γ, −
2
93

(5.31)

If common mode signal T is set to zero, the term 2T disappears from the above equations.
Using arguments similar to those used for z1 we can derive the following expression for z2 :
γ
z2 = max 2T + |V | − γ, 2T −
2
γ
= 2T + max |V | − γ, −
2

(5.32)

Again, if we choose T = 0, the term 2T disappears from the above equations.
Substituting (5.31) and (5.32) in (5.20) gives:

L1

γ
γ
L2 = max |U | − γ, − − max |V | − γ, −
2
2

(5.33)

Let us now focus on calculating the evolution of message probability densities. The absolute
value operation just leads to the folding of the probability distribution. Moreover, since L1 and L2
are drawn from independent random variables (due to the cycle-free assumption) the summation
or difference of the two used to obtain U and V just leads to the convolution of the probability
density functions. The calculations associated with U and V are listed below:

1
[L + L2 ]
2 1
1
V = [L1 − L2 ]
2
U=

(5.34)
(5.35)

The c = max(a, b) operation where a and b are drawn from two independent random variables A

1 Check side density evolution is calculated by partitioning x 5 into x (4)+1 . Density evolution corresponding to the part within the bracket is calculated using approximate analysis followed
by exact analysis for the remaining part.
94

and B, gives c with probability distribution function C determined as follows [94]:

P (C = c) = P (A = c)P (B ≤ c) + P (B = c)P (A ≤ c)

(5.36)

Using the above arguments we can determine the PDFs of z1 and z2 from the PDFs of L1 and
L2 . However, determining the PDF of z1 − z2 is not trivial since in general z1 and z2 are drawn
from dependent random variables. If however the random variables corresponding to L1 and L2
are independent identically distributed (i.i.d.) and gaussian in nature, the PDF of z1 − z2 is just the
convolution of PDFs corresponding to z1 and −z2 . Due to the i.i.d. and gaussian restriction, the
density evolution computation for the parity side needs to be carried out in a particular order. For
example, if we are interested in determining the PDF of L1
PDFs corresponding to (L1

L2 ) and (L3

L2

L3

L4 , we ﬁrst determine the

L4 ). This gives two different PDFs which are i.i.d.

however they are not gaussian, so we approximate these PDFs with gaussian distributions before
computing the PDF of (L1

L2 ) (L3

L4 ). Using the approximation approach described above

in conjunction with equation (5.33) allows us to track the evolution of the message densities at the
parity side of an LDPC graph when margin propagation based decoding is employed. However
the i.i.d and gaussian requirements for the input, restricts our ability to track message densities, to
parity degree distribution functions which have the following representation:

ρi x (i−1)

ρM P (x) =

(5.37)

i∈{20 +1,21 +1,22 +1··· }

AWGN noise thresholds are obtained for a (3,6) and an irregular rate one-half LDPC code using
the above approximation and listed in table 5.1. Noise thresholds obtained using exact and approximate analysis are found to be comparable. The approximate analysis noise threshold for margin
95

propagation is within 0.16 dB of the actual threshold. The noise threshold for sum-product decoding is included to highlight the minimal performance loss for margin propagation. Thresholds
obtained using exact analysis shows that the loss in performance for margin-propagation decoding
is less than 0.04 dB.
Using the setup described in section 5.2, we can track the evolution of message densities for
margin propagation decoding using simulations for a (3,6) LDPC code of length 1000. Figure 5.7
displays the message density evolution for margin propagation in 5 iterations. The left column
corresponds to the messages from variable-side to check-side and right column corresponds to the
message from check-side to variable-side. The message PDF behaves similar to that outlined in
ﬁgure 5.2 (sum-product algorithm). As the iterations progress, the message PDF shifts to the right,
reducing the probability of making errors.

5.3

Performance simulation results of MP-based LDPC decoder

This section presents Monte-Carlo simulations for evaluating the performance of LDPC decoding
based on margin propagation. The results are presented in the following order: First, we analyze
the effect of the hyper-parameter γ on the L0 norm of the messages. Next we compare the BER
performance of an optimized margin propagation with an equivalent sum-product and min-sum
decoder. Next (keeping the BER performance the same) we compare the sparsity of LDPC decoding for MP, sum-product and min-sum algorithm by observing the L0 norm of their messages. For
all the simulations described in this section, binary phase shift keying (BPSK) has been assumed
for modulation and the channel is assumed to be an AWGN channel. The experimental parameter
96

being varied is the signal-to-noise ratio (SNR) of the AWGN channel which can be expressed as

SN R(dB) = −20 log10 σ

(5.38)

where σ denotes the standard deviation of noise.

5.3.1

Effect of γ on L0 norm

1
0.8

0

0.6
0.4
0.2
0
0

SNR = 1.00 dB
SNR = 2.00 dB
SNR = 3.00 dB
SNR = 4.00 dB
SNR = 5.00 dB
SNR = 6.00 dB
1

BER

L0−norm

10

0.5
−5

10

2

3

4

5

0

1
1
γ

γ

(a)

2

3 2

1.5
SNR(dB)

(b)

Figure 5.8: : (a)Effect of hyper-parameter γ on the L0 norm of the check to variable messages(for
different AWGN channel conditions) (b)Effect of hyper-tuning parameter γ on BER performance
(for different AWGN channel conditions)

Figure 5.8a shows the L0 norm of messages propagated from check side to variable side for
different values of γ. For this experiment, a (2000,1000) regular LDPC code with a (3 × 6) parity
check matrix H was chosen. The maximum number of decoder iterations was set to 20. The
average L0 norm was then computed over all the decoding iterations. The simulation results
presented here consider only the average L0 norm for messages propagating from the check to
variable side. As we will show in the later experiment, the L0 norm for messages propagated from
97

variable to check side does not exhibit sparse behavior and hence does not provide any efﬁciency
improvement over sum-product or min-sum algorithm.

When the parameter γ increases, ﬁgure 5.8a shows that the L0 norm of the messages decreases.
This can be attributed to the size of the “dead zone” in the margin transfer function (see Fig. 5.4)
which increases with an increase in γ. Thus γ is a consistent metric to modulate the sparsity (L0
norm) of the LDPC messages. Also for a ﬁxed value of γ the L0 norm increases with an increase
in channel SNR. This is because lower SNR would concentrate the likelihood scores around the
“dead-zone” (due to noise) thus leading to a sparser message distribution (lower L0 norm). An
important consequence of this experiment is that the sparsity of the MP decoding algorithm can be
controlled using γ and can also be adaptively varied depending on the channel conditions.

To understand the trade-off between γ, SNR and the BER performance of the MP algorithm,
we conducted a three parameter experiment for a length 1000 (3,6) LDPC code, whose results are
summarized in Fig. 5.8b. It can be clearly seen that for a ﬁxed channel SNR, there exists an optimal
value of γ (hence sparsity) that yields the best BER performance. This indicates that a correct
choice of γ could provide an optimal performance in terms of sparsity and BER performance.
Also, Figure 5.8b could be used as a calibration curve which can be used for designing different
LDPC decoders with varying speciﬁcations of sparsity (energy efﬁciency).

Figure 5.8b shows that with SNR ranging from 0.5dB to 2dB, when the value of γ increases
from as low as 0 to around 1.75, the BER surface lowers monotonically, however when γ increases
from 1.75 to 3, the trend is reversed. And it can be seen that at the SNR of 2dB, the “waterfall”
phenomenon happens to the BER curve with γ equal to 1.75 ahead of others. Hence when SNR is
ranging from 0.5dB to 2dB, 1.75 is deemed to be the optimal value of γ to achieve the least BER.
98

5.3.2

Comparison of BER performance

In the next set of experiments we compare the BER performance of the optimal MP decoding
algorithm to the equivalent sum-product and min-sum algorithms. Three different lengths of the
(3,6) LDPC codes were chosen: 256, 104 and 106 . For each of the runs, the optimal parameter
γ was found to be 1.75. Figure 5.9 summarizes the BER performance for different algorithms. It
can be seen from the results that for a 106 length code, min-sum decoding incurs a 0.3dB penalty
in performance compared to its sum-product equivalent, whereas MP decoding incurs less than
0.03dB penalty. The result shows that the performance of MP and sum-product decoding is nearly
identical and is consistent with observations made in table 5.1.
This has been veriﬁed to be true also for a rate 1/2 irregular code. Figure 5.10 compares
the BER performance for a (λ(x), ρ(x)) ≡ (0.4x + 0.6x2 , x4 ) irregular LDPC code. We refer
to this LDPC code as a (2.5,5) code. For this experiment, the optimal value of γ was found to
be 1.75. It can also be seen from the ﬁgure that with a higher degree at both sides, the regular
code performs better than the irregular code with penalty not less than 0.7dB for all the three
algorithms. This is embodied by the BER “water-fall” which happens at a lower SNR and is more
pronounced. However, the results presented here differ slightly from the BER metrics reported in
the literature [87]. We attribute this difference to the number of decoding iterations used for the
reported experiments.

5.3.3

Comparison of message sparsity

The previous results conﬁrmed that the BER performance of an optimized MP decoding algorithm
is near identical to the BER performance of the sum-product decoding algorithm. However, the MP
decoding algorithm produces more sparse distribution of messages compared to the sum-product
99

0

10

−1

10

Bit Error Rate

−2

10

−3

10

−4

10

Sum−Product n=256
Margin n=256
Min−Sum n=256
Sum−Product n=104
Margin n=104
Min−Sum n=104

−5

10

Sum−Product n=106
Margin n=106

−6

10

0.5

Min−Sum n=106
1

1.5

2

2.5

Eb/No (dB)
Figure 5.9: The comparison of BER for LDPC decoders based on different algorithms achieved by
(3,6)-regular LDPC code of length 256, 104 , and 106 ; note the maximum iteration number is set
to 20

100

0

10

−1

10
Bit Error Rate

−2

10

−3

10

−4

10

−5

10

−6

10

0.5

Sum−Product (2.5,5)
Margin (2.5,5)
Min−Sum (2.5,5)
Sum−Product (3,6)
Margin (3,6)
Min−Sum (3,6)
1

1.5

2

Eb/No (dB)

Figure 5.10: The comparison of BER for LDPC decoders based on different algorithms achieved
by length 106 (3,6)-regular LDPC code and (2.5,5) irregular LDPC code; note the maximum iteration number is set to 20.

101

algorithm as conﬁrmed by the following experiment. A (2000, 1000) regular LDPC code with
a (3 × 6) parity check matrix is used. Figure 5.11 shows the comparison of average L0 norm
for messages propagated between check (variable) and the variable (check) nodes. It can be seen
that the L0 norm (sparsity) of the messages propagated between the variable to check are similar.
For a high SNR, the message convergence is considerably faster and requires only two iterations.
However, the L0 norm of messages propagated between the check-to-variable nodes exhibit a
signiﬁcant difference between the different algorithms. For the MP algorithm, the achieved sparsity
is the best whereas for the min-sum algorithm the achieved sparsity is the worst. This result shows
that, MP algorithm provides an elegant approach for achieving sparse decoding (hence superior
energy efﬁciency) without sacriﬁcing the BER performance.

5.4

Basic circuits of MP-based LDPC decoder

All the basic operations of check and variable nodes for this LDPC decoder have been illustrated
in section 5.2. In this section, we present the basic modules of the MP based LDPC decoder. For
check node, we take c1 in Fig.5.1 as an example. Note differential circuits are used for both check
node and variable node implementation.

5.4.1

Implementation of check node

As shown in equation (3.1), MP approximation requires addition, subtraction and rectiﬁcation.
Each of these operations can be implemented in a CMOS process using basic conservation laws
(e.g. Kirchhoff’s current summation) which scale across transistor operating regimes (weak, moderate and strong inversion). Therefore, the same decoder architecture can be operated at different
decoding speeds by scaling the bias current and hence by scaling its power dissipation. In this sec102

0.9

L0 norm

1

0.9

L0 norm

1

0.8
0.7

0.8
0.7

Sum−product
Margin with γ=1.75
Min−sum

0.6
0.5
0

1

2

3

4

5

Sum−product
Margin with γ=1.75
Min−sum

0.6
0.5
0

6

1

2

3

SNR(dB)

4

5

6

SNR(dB)

(a) Iteration 1: check to variable

(b) Iteration 2: check to variable
1.0005

1
1
0.9995

L0 norm

L0 norm

0.9
0.8

0.999
0.9985

0.7
0.998

Sum−product
Margin with γ=1.75
Min−sum

0.6
0.5
0

1

2

3

4

5

Sum−product
Margin with γ=1.75
Min−sum

0.9975
0.997
0

6

1

2

SNR(dB)

(c) Iteration 3: check to variable

4

5

6

(d) Iteration 1: variable to check

1.0005

1.0005

1

1

0.9995

0.9995

0.999

L0 norm

L0 norm

3

SNR(dB)

0.9985
0.998

0.9985
0.998

0.9975

Sum−product
Margin with γ=1.75
Min−sum

0.997
0.9965
0

0.999

1

2

3

4

5

Sum−product
Margin with γ=1.75
Min−sum

0.9975
6

SNR(dB)

0.997
0

1

2

3

4

5

6

SNR(dB)

(e) Iteration 2: variable to check

(f) Iteration 3: variable to check

Figure 5.11: : L0 norm comparison of sum-product, min-sum and margin propagation for iterations 1,2,3 for check to variable messages: (a),(b),(c) and variable to check messages: (d),(e),(f)
103

tion, we describe the current-mode circuits which has been used for implementing the check node
module of the proposed MP-based LDPC decoder. A differential topology has been adopted to
represent the positive and negative log-likelihood ratios and for canceling out any common-mode
interference.

We used the check node c1 in Fig.5.1 as an example to explain the implementation details. c1
is connected to the variable nodes v1 , v9 , v17 , and v24 . To avoid long mathematical notations we
will denote the message from c1 to v24 (L(c1 → v24 )) as z sp and the messages L(v1 → c1 ) as
L1 , L(v9 → c1 ) as L2 and L(v17 → c1 ) as L3 . The check function, according to equation (5.1)
can then be written as:

z sp = log

eL1 + eL2 + eL3 + eL1 +L2 +L3
1 + eL1 +L2 + eL1 +L3 + eL2 +L3

.

(5.39)

If the log-likelihood ratios Li s are represented by their differential forms as Li = L+ − L− , with
i
i
L+ , L− > 0, and i = 1, . . . , 3, then equation (5.39) can be rewritten as:
i
i
sp
z
=
dif f


++L−+L− L−+L++L− L−+L−+L+ L++L++L+
L
 e 1 2 3 +e 1 2 3 +e 1 2 3 +e 1 2 3 
log  + + −
.
L1 +L2 +L3
L++L−+L+ L−+L++L+ L−+L−+L−
e
+e 1 2 3 +e 1 2 3 +e 1 2 3

(5.40)

Once the check function has been expressed in terms of log-sum-exp functions, we can apply the
MP-based approximation according to the procedure described in section 5.2. The MP approximate check function (3.2) is written as

mp
z
= z+ − z−.
dif f
104

(5.41)

where z + and z − are given by

z + = M (L+ + L− + L− , L− + L+ + L− , L− + L− + L+ , L+ + L+ + L+ , γ).
2
3 1
2
3 1
2
3 1
2
3
1
z − = M (L+ + L+ + L− , L+ + L− + L+ , L− + L+ + L+ , L− + L− + L− , γ).
3 1
2
3
2
3 1
2
2
3 1
1

(5.42)

These equations are also applicable for computing messages L(c1 → v1 ), L(c1 → v9 ), and
L(c1 → v17 ).
Fig. 5.12 shows the architecture of the check node c1 implementing equations (5.42). The
differential messages generated by the variable nodes are selected and the permutations and summations are carried out according to the equation (5.42). The summation results are then processed
by individual MP units as described in (5.42). The output produced by the respective MP units (z +
and z − in (5.42)) are then propagated to the neighboring variable nodes.
Fig. 5.13a shows the current-mode implementation of the equations (5.42). The currents
+(−)
in equation (5.42), i = 1, 2, 3.
I +(−) in Fig. 5.13a represent the differential variables Li
Li
The hyper-parameter γ in equation (5.42) is represented by the current Iγ . The circuit has two
operational modes: (a) when the reset signal RST is set to logic low, transistor N1 pulls the gate
voltage of transistors N2 -N6 low and the output current I + is set to zero (implying no message
z
is conveyed); (b) when the reset signal RST is set to logic high, the gate voltage of N1 set to Vγ
which determines the current Iγ . At equilibrium, the output current I + is determined by the MP
z
condition (3.1).
The PMOS diodes P1 through P4 ensures a unidirectional ﬂow of current thus implementing
the rectiﬁcation operation in the equation (3.1). The subtraction operations in equation (3.1) are
implemented by transistors N2 -N5 and the summation constraint is implemented using the Kirchhoff’s current summation at the node K. However, the diodes introduce a threshold voltage drop
105

L v1 o c1 

IL 

¦

IL 

¦

1

1

Connection Matrix

L v1 o c1 


MP

L c1 o v 24 

MP

L c1 o v 24 

MP

L c1 o v17 

¦

L v9 o c1 

IL 

L v9 o c1 

IL 

L v17 o c1 

IL 

L v17 o c1 

IL 

¦

IL 

¦

IL 

¦









2

2

3

3

¦
¦
¦
¦

L v 24 o c1 
L v 24 o c1 

1

IL 
2

IL 
2

IL 
3

Connection Matrix

1

¦
¦
¦
¦

MP

L c1 o v17 

¦

IL 

¦

IL 

¦

IL 

¦

3

1

IL 
2

IL 
2

IL 
3

Connection Matrix

1

IL 

MP

L c1 o v9 

¦
¦
¦
¦

MP

L c1 o v9 

¦
¦

3

IL 

¦

IL 

¦

1

IL 
2

IL 
2

IL 
3

Connection Matrix

1

IL 

MP

L c1 o v1 

MP

L c1 o v1 

¦
¦
¦
¦

¦
¦

3

Figure 5.12: Architecture of the check node c1 in Fig. 5.1

106

I L+ I L− I L I L I L I L I L I L I L I L I L I L
1
2
−
3

I z+

−
1

P4

N6

−
1

−
3

+
2

−
2

P3

N5

+
3

+
2

+
1

+
3

P2

N4

Vdd

P1

N3

N2

RST

P0

K

Iγ

N1

N0

Vγ

(a)

I L+ I L− I L− I L− I L+ I L− I L− I L− I L+ I L+ I L+ I L+
2

1

I z+
N6'

3

P4'
N5'

2

1

VB

1

3

P3'

VB

N4'

2

3

P2'
N3'

1

2

3

VB

P1'
N2'

VB
P0'

Vdd

RST

K’

Iγ

N1'

N0'

Vγ

(b)
Figure 5.13: (a) Circuits for implementing the check node module using the MP algorithm:(a)
design used in this paper; and (b) an alternate low-voltage circuit.

107

requiring a larger supply voltage (≥ 2Vth + Vdsat ). An alternative low-voltage implementation
is shown in Fig. 5.13b which consists of a cascode PMOS P1 through P4 . The implementation
eliminates the threshold voltage drop but potentially suffers from current leakage between neighboring stages as shown in Fig. 5.13b. This situation could occur when one of the branch currents
is much larger than the other. In the proposed implementation, we have chosen the circuit shown
in Fig. 5.13a. Since the operation of the circuit is based on the Kirchhoff’s current law, it is
functional irrespective of the biasing condition of the transistors. Consequently, the magnitude of
current can vary in a wide range from pA to µA, which only effects the operational speed of the
circuit. However, the matching between the transistors N2 -N5 is important because it ensures that
the constraint in equation (3.1) is accurately satisﬁed.

5.4.2

Implementation of the variable node

We will use the variable node v1 to illustrate the basic operation as described by equation (5.2).
The architecture of the variable node is shown in Fig. 5.14 and the circuit implementation for
computing the message propagated from variable node v1 to check node c20 is shown in Fig. 5.15.
+
The currents Iv represent the positive part of differential message L(v1 ) (L(v1 ) = L+ (v1 ) −
1
+
+
L− (v1 )). Similarly, Ic →v and Ic →v represent the L+ (c1 → v1 ) and L+ (c8 → v1 ). These
1
1
8
1
currents are summed at node B and mirrored through transistors P0 and P1 . Similarly, the current
through P2 equals the sum of all the negative portion of the differential message. For brevity, we
108

denote the two summed currents as I + and I − as shown in Fig. 5.15, where

I + ⇒ L+ (vi ) +

L+ (ck → vi ),
ck ∈Vi∼j

I − ⇒ L− (vi ) +

L− (ck → vi ).

(5.43)

ck ∈Vi∼j
The two currents I + and I − are compared at node A and when I + is greater than I − , the current
through P5 equals to the positive part of the differential message propagated from the variable
node to the check node. A similar circuit computes the negative part of the differential message.
Thus, we realize the differential form of (5.2) as

 +
 I − I − , if I + − I − > 0.

L+ (vi → cj ) ⇐

 0,
otherwise.


 0,
if I + − I − > 0.
− (v → c ) ⇐
L
i
j
 +
 I − I − , otherwise.

(5.44)

Note, the transistors P3 and P4 in Fig. 5.15 act as current limiters where the common-mode current
and hence the total power dissipation of the decoder can be limited using the voltage VB .

5.5

Measurement Results and Discussions

Fig. 5.16 shows a system level schematic of the (32,8) LDPC decoder corresponding to the Tannergraph shown in Fig. 5.1. Besides the variable node and the check node modules, the decoder
also integrates the following modules to facilitate output buffering and testing through an external
digital interface (FPGA in this paper):
109

L v1 

L v1 o c20 

¦

Comparator

L c1 o v1 


L c8 o v1 

L v1 

L c1 o v1 

¦

L v1 o c20 

L c8 o v1 
L v1 
¦

Comparator

L c1 o v1 


L c20 o v1 
L v1 
L c1 o v1 


¦

L v1 o c8 

L v1 o c8 

L c20 o v1 
L v1 
¦

Comparator

L c8 o v1 


L c20 o v1 
L v1 
L c8 o v1 

¦

L v1 o c1 

L v1 o c1 

L c20 o v1 
Figure 5.14: Architecture of variable node v1 in Fig. 5.1

110

Vdd

Vdd

P0

P1

+
IV 1
N0

+
I C1→V 1
N1

P2

I

VB

P3

+
I C 8→V 1

Vdd

Vdd
+

P5

I

−

P4

+
IV 1→C 20

A

N2
N3

N4

Figure 5.15: Circuit to implement variable node
(a) A 6-bit current-mode digital-to-analog converter (DAC), used to digitally programming
initial messages (currents) in equation (5.43). Its MSB of the DAC is denoted by d1 whereas
the LSB is denoted by d6 . The architecture of the current DAC (shown in Fig. 5.16 is based
on the popular resistive divider [95] where the current through each branch is twice as large
as its neighbor. The output as determined by the bits d1 –d6 is a binary weighted sum of
currents which are mirrored into each of the variable node modules.

(b) Output comparators, which are used to compute the decoded bits by comparing the differ+
ential variable node currents when the decoder has reached equilibrium. For example, if IV
−
is greater than IV , the output of the comparator Q is logic “high” and indicative of the bit
“0” being decoded. Note that the comparator needs to be reset before each comparison is
made.

(c) Digital sample-and-hold input buffer is a shift register chain which converts the serial
input Din to the decoder IC into parallel bit slices that are processed by the DAC module.
As shown in Fig. 5.16, there are two sets of non-overlapping control signals for the shift
111

register chain: Φ1 /Φ1 and Φ2 /Φ2 . Φ1 and Φ2 . In one period, when (Φ1 = 1 and Φ2 = 0),
the previous stage data di−1 is sampled; and when (Φ1 = 1 and Φ2 = 0), the sampled data
is held during while the DAC module processes the latched bits.
(d) Digital sample-and-hold output buffer is also a shift register chain which sample the parallel decoded bits (from the comparator modules) and convert them into a bit-serial format
Qout .
The microphotograph of a prototype MP-based LDPC decoder fabricated in a 0.5µm standard
CMOS process is shown in Fig. 6.20. Table 6.3 summarizes the main speciﬁcations of the chip.
Table 5.2: Speciﬁcations of the MP-based analog LDPC decoder

Fabrication Process

Standard CMOS 0.5µm

Die Size

3000µm × 3000µm

Number of Transistors

11424 (6336 PMOS, 5088 NMOS)

Number of Variable Nodes

32

Number of Check Nodes

24

The measurement setup used to evaluate the fabricated chip is shown in Fig. 5.18. The decoder
chip is hosted on a daughter board which is then mounted on a test station mother board. A second
mountable daughter board hosts a ﬁeld-programmable gate array (FPGA) which is responsible for
clock generation and data synchronization. A sample timing diagram for all the digital signals
used by the LDPC chip (shown in Fig. 5.16) and generated by the FPGA are shown in Fig. 5.19.
The FPGA selectively programs the on-chip DACs to emulate data received over an additive white
Gaussian noise (AWGN) channel after which the FPGA enables the analog decoding core. The
output of the decoder is latched on the comparator after a pre-determined decoding time and the
latched bit-stream is serially retrieved by the FPGA. As shown in Fig. 5.18, the experimental setup
is also controlled through a Visual Studio interface on a PC. The C script emulates the AWGN
112

Φ1

d i −1

Φ2

Φ1

di

Φ2

Din
SR
VN1

CN1

VN2

CN2

SR
…...

DAC
Modules

SR

…...

…...

VN31

CN23

VN32

CN24

Variable
Nodes

Check
Nodes

+

SR

+

SR

-

…...

…...
+

SR

-

Qout

Vdd

CLK

Vdd
+
IV

−
IV

CLK

Vdd

IN

Q

OUT

Q

Figure 5.16: System level architecture of the (32,8) LDPC decoder

113

Figure 5.17: Die microphotograph of the chip

channel and generates 6-bit signal values at different SNR levels. These bit patterns are then
stored on the SDRAM of the FPGA which uses the values to perform real-time decoding. After
each noisy transmission, decoded bits are retrieved from the LDPC chip and stored on the FPGA’s
SDRAM. At the end of each Monte-carlo run, the Visual Studio interface transfers the data logged
on the SDRAM and computes the overall BER. The MATLAB interface is also used to adapt the
bias voltages of the decoder through a National Instruments data acquisition card.
Fig. 5.20 compares the BER performance of a software implementation of the MP-based decoder and the measured results obtained from the fabricated prototype. For this experiment the
decoding throughput was set to 320 Kbps and the hyper-parameter γ was optimally chosen based
on iterative experiments as described later. At low SNR, the hardware implementation outperforms
its software counterpart which could be attributed to the continuous-time dynamics of the analog
114

Test Support Board
SDRAM
(MT48LC16M16)

Xilinx Spartan-3
FPGA

PC Program
(a) Sends noisy bits as DAC
values for the chip (stored in
SDRAM on FPGA board).

FPGA Board
USB Micro
(CY68013A)

(b) Receives decoded bits to
calculate bit error rates.

Voltage sources

c) Programs analog signals
by MATLAB test station.

Decoder Chip
Other analog signals

Figure 5.18: Experimental setup of the chip
1 period

Din

DATA0

DATA1

DATA2

DATA3

DATA4

IN _ CLK

RST

Φ1

Φ2

CLK

OUT _ SR _ CLK

Time (s)

Figure 5.19: Timing diagram of digital signals used to control the operation of the LDPC decoder
chip
115

0

10

Software
Measured MP(V =0.71V)
γ

−2

BER

10

−4

10

−6

10

−8

10

0

2

4

6

8

SNR(dB)
Figure 5.20: Comparison of BER of the MP-based LDPC decoder using software simulation and
measurements from the fabricated prototypes.

−1

10

−2

10

BER

−3

10

−4

10

−5

10

Measured MP(Vγ=0.71V)

−6

10

Measured min−sum (Vγ=0.15V)

1

Benchmark min−sum[14]
2
3

4

5

SNR(dB)
Figure 5.21: Comparison of BERs of the measured MP-based LDPC decoder and a benchmark
min-sum LDPC decoder.

116

0

10

−2

BER

10

−4

10

V =0.15 V
γ

−6

10

Vγ=0.71 V
V =0.74 V
γ

−8

10

Vγ=0.85 V

0

2

4

6

8

SNR(dB)
Figure 5.22: Comparison of measured BER performances for different values of the hyperparameter γ.

−1

10

SNR=1dB
SNR=3dB
SNR=5dB
SNR=7dB

−2

BER

10

−3

10

−4

10

−5

10

0

5

10

15

20

25

Decoding Time (µ s)
Figure 5.23: Measured BER performances of the MP-based LDPC decoder (Vγ = 0.74V ) for
different decoding times (throughput)

117

decoder. However, at high SNR the software implementation outperforms its hardware counterpart
which could be attributed to the limited dynamic range of the programming DACs and due to the
offset and gain errors introduced by the current mirrors.
Fig. 5.21 shows the measured BER curves which are obtained under the conditions: (a) when
the system is conﬁgured to operate as a min-sum decoder (γ ≈ 0) [45] and the decoding throughput
is set to 320 Kbps; and (b) when the MP-based decoder is conﬁgured with an optimal setting of the
hyper-parameter γ > 0. For comparison we have included BER results reported for a state-of-theart analog min-sum decoder [77]. The results in Fig. 5.21 show that for γ = 0, the performance
of MP-based min-sum LDPC decoder is inferior to the benchmark min-sum decoder [77] at low
SNR, however, at high SNR (> 3.5 dB) the performance of MP min-sum decoder outperforms
the benchmark. When γ > 0, the results show that the MP-based LDPC decoder outperforms the
benchmark min-sum decoder by more than 3dB.
Fig. 5.22 shows the measured BER curves for different values of the hyper-parameter Vγ .
Again, for this experiment the decoding throughput is set to 320 Kbps. As Vγ increases (γ increases), it can be seen that the BER of the MP-based LDPC decoder ﬁrst improves and then
degrades, which is consistent with the BER-SNR-γ trade-off previously demonstrated only using
simulation results (shown in Fig. 5.8b).
Fig. 5.23 shows the measured BER performance (under different SNRs) when the decoding
time (inverse of decoding speed) is varied. For this experiment, Vγ is held constant at 0.74V
(correspondent to one of the four curves shown in Fig. 5.22). It is seen that as the decoding
time is increased, the BER performance improves, which is consistent with the response of previously reported analog decoders [77]. When the decoding time is reduced to less than 2.5 µs (data
throughput 12.8 M b/s), the BER performance starts to decreases rapidly. When decoding time is
118

as low as 833 ns (data throughput 38.4 M b/s), the BER (measured at SNR 7dB) could be 13 times
higher than the BER measured at data throughput of 12.8 M b/s.
Table 5.3 compares the measured speciﬁcation of the fabricated MP-based LDPC decoder to
the speciﬁcations of different digital and analog decoders reported in literature.
The comparison shows that the MP-based LDPC decoder has the second highest throughput
amongst the reported analog decoders. However, the turbo decoder with the highest throughput
[96] has an energy efﬁciency much lower than the MP-based design. Moreover, the turbo decoder
has a longer codeword length than ours. The benchmark min-sum LDPC decoder [77] has the
same codelength as the proposed MP-based LDPC decoder. However, the former measured the
energy efﬁciency based on the power consumption for the whole chip, while the latter based on
the core, which makes the comparison difﬁcult. However, it can be seen that the throughput of the
MP-based LDPC decoder is more than twice that of the benchmark min-sum LDPC decoder. In
terms of energy efﬁciency, the MP-based LDPC decoder has the second highest energy efﬁciency.
Whereas the (8,4) Hamming Trellis graph decoder, which has the highest energy efﬁciency, has
a much lower throughput than this implementation. In terms of the silicon area, the proposed
implementation achieves integration density comparable to that of [77], considering the differences
in the respective technology feature size.
The table of comparison also shows that state-of-the-art digital decoders can also enjoy high
energy-efﬁciencies as the analog decoders. This is because these implementations exploit highly
parallel architecture [72, 73], early termination techniques [97], post-processing methods [98] and
aggressive feature and voltage scaling. However, it should be noted that unlike analog decoders,
digital decoders require an analog-to-digital converter to digitize the analog channel information.
For instance, the digital decoder reported in [98] requires 4 to 6-bit digital inputs and the energy
119

efﬁciency metric (pJ/bit) reported in table 5.3 does not incorporate the power dissipation of the
ADC.

120

Table 5.3: CMOS Digital (top 4) and Analog (bottom 6) Decoders

Author

Code

Blanksby

1
R= 2 ,
N=1024,
LDPC
1 : 14 : 1
R= 2 16 16
N=2048,LDPC
R= 11
15
N=660
LDPC
R=0.84
N=2048
RS-LDPC

CMOS
Technology

Core Area
(mm2 )

Power
(mW )

Throughput
(M b/s)

Energy
Efﬁciency
(nJ/b)

52.5

690

500

1.4

14.3

787

320

2.46

7.3

518@4dB
398@5.5dB

2440
480

0.156@4dB
0.12@5.5dB

6.67

144
2800

6670
47700

0.0215
0.0587

185

13.3

13.9

1

1

2

3.4

3.7
3.7

0.04
0.22

6

0.83

Digital

et al. [73]
Bickerstaff
et al. [72]
Darabiha
et al. [97]
Zhang
et al. [98]

0.16 µm,
1.5 V
0.18 µm
1.8 V
0.13 µm,
1.2 V
0.6 V
65 nm
0.7 V
1.2 V

Analog
Gaudet
et al. [96]
Winstead
et al. [99]
Amat
et al.[100]
Winstead
et al. [78]
Hemati
et al. [77]

1
R= 3 ,N=48,
Turbo
(8,4) Hamming
Tail-biting
R= 1 ,N=132,
3
Turbo
(8,4) Hamming
Trellis Graph
Factor Graph
(32,8)
LDPC

0.35 µm,
3.3 V
0.5 µm,
3.3 V
0.35 µm,
3.3 V
0.18 µm,
1.8 V
0.18 µm,
1.8 V

Gu et al.

(32,8)
LDPC

0.083
4.1
0.002
0.02
0.57

0.5 µm,

this work

1.32

3.3 V

5.4

121

1
(core)
6.8
(core)
0.15
0.807
5
(chip)
1.254
(Vγ =0.85V)
1.683
(Vγ =0.74V)
1.914
(Vγ =0.71V)
7.755
(Vγ =0.15V)
(core)

0.098
0.1315
12.8

0.1495
0.6059

Chapter 6

An analog SVM implemented on MP-based
PWL circuits

6.1

History of SVM

SVM theory provides a principled approach for designing classiﬁers (binary and multi-class) by
having a strong foundation in statistical learning theory. Due to their excellent generalization ability and modest requirements on the size of training data, SVMs have been successfully applied to
numerous real-time recognition tasks [101]. Even though several digital implementations of SVMs
have been reported in literature [102], they have been mapped onto analog VLSI using translinear CMOS circuits [103, 104] only recently. Here, we will re-formulate SVMs to operate in a
log-likelihood domain, and apply MP-based transforms to map the architecture onto CMOS circuits. The result is a general-purpose micro-power SVM hardware that can be used in applications
ranging from communications to biometrics.
122

6.2

An MP-based SVM

In its general setting, a binary SVM computes a decision score F ∈ R for every feature vector
−
presented → ∈ RN and is given by:
z
N
F =

− −
αi K(→, →) + b
xi z

(6.1)

i=1
−
K : RN × RN → R denotes a kernel function computed between an input vector → and an array
z
−
of templates (support vectors) → ∈ RN . The parameters αi ∈ R and b ∈ R are obtained through
xi
a supervised training procedure [105]. Even though several choices of kernel functions exist, a
− −
− −
second-order polynomial kernel given by K(→, →) = [(→ · →) + β]2 has been chosen. All the
xi z
xi z
−
+
parameters of the SVM are represented in a differential form as αi = αi − αi and b = b+ − b− ,
which leads to a differential representation of equation (6.1) as:

F = F + − F −,

where F + =

− −
+ → →
+
−
i αi K( xi , z ) + b and F =

(6.2)

− −
− → →
−
i αi K( xi , z ) + b .

−
−
The inner product computation of → and → can also be achieved by turning them into differxi
z
ential forms:

→T · → =
−
−
xi
z

N

−
+
(x+ − x− )(zj − zj )
ij
ij
j=1
N
N
−
−
+
+
(x+ zj + x− zj )
=
(x+ zj + x− zj ) −
ij
ij
ij
ij
j=1
j=1

(6.3)

−
−
Here xij and zj are elements of vector → and →. (6.3) can be approximated with MP units
xi
z
123

expressed by:

+
+
z + ≈ M(log x+ + log z1 , . . . , log x+ + log zN ,
i1
iN
−
−
log x− + log z1 , . . . , log x− + log zN , γ)
i1
iN
−
−
z − ≈ M(log x+ + log z1 , . . . , log x+ + log zN ,
i1
iN
+
+
log x− + log z1 , . . . , log x− + log zN , γ)
i1
iN

(6.4)

For a classiﬁcation task, only the sign of F is important, and using the monotonic property of
the log(.) implies sign(F ) = sign(log(F + ) − log(F − )). Therefore, the decision generated by
the SVM can be expressed as

sign(F ) = sign(z + − z − ).

(6.5)

The system architecture and circuit diagram of MP-based SVM are shown in Fig. 6.1.
An MP-based SVM architecture can be partitioned into three major sub-systems as shown in
ﬁgure 6.1. It consists of a log-MAP matrix-vector multiplier that computes compressive innerproduct kernels based on margin propagation. The outputs of the matrix-vector multiplier are
scaled in log domain, which is equivalent to a cubic transformation of an inner-product kernel.
The scaled factors are then presented to an output normalization stage that computes the decision
function according to equation (6.5).
A current-mode circuit for implementing a log inner product computation log(xT y) based on
MP circuits is also shown in ﬁgure 6.1. At the core of the network is the basic MP cell shown in Fig.
4.2. Currents representing the logarithm of vector elements are summed according to Kirchhoff’s
current law. A constant current sink with magnitude γ constraints the total current at the gate of
124

Ia+

I b+
1

1

Ia−

Ia+

I b−
1

1

N

I b+

N

I a−

N

I b−

N

I z+
…...
Iγ

Scale Module

r
xi

Input Vector

Ix

αi

r
z

Output (F)

Figure 6.1: The circuit diagram of the MP-based SVM

125

IZ

output transistor. Thus the reverse water-ﬁlling criterion is implemented. For the SVM sub-system
in Fig. 6.1, the circuit is replicated for each template (support vector). The output current from
the ﬁrst stage is mirrored and scaled to an outer normalization stage that implements another MP
normalization procedure. Consequently, the outer normalization stage is also implemented using
the reverse water-ﬁlling circuit in Fig. 4.2.
The SVM parameters αi and xi are stored as currents on ﬂoating gate transistors, illustrated
in the highlighted box of ﬁgure 4.2. Each of the storage cells are individually addressable via a
serial shift register chain. Programming is achieved by using hot-electron injection in conjunction will global tunneling [106]. There are 14 reverse water-ﬁlling cells per support vector in the
log-MAP matrix-vector multiplier. The number of cells in the output normalization sub-system
is proportional to the total number of support vectors. Since the positive and negative class labels must be split to maintain positive current ﬂow during analog computation, a fully-differential
implementation of the output stage is required to compute the decision score.

6.3

Simulation result

Fig. 6.2 shows the PWL approximation effect of MP-based algorithm on inner product, expressed
by equation (4.34), with N = 2. Fig. 6.2a displays the result computed by “log-sum-exp” function.
Fig. 6.2b displays the result computed by MP-based function simulated in Matlab. Fig. 6.2c
displays the result computed by MP-based function simulated in Cadence.
126

SVM with Differential MP, γ=1.2

Inner product

−6

x 10
1

−5

0
−6

−1
−5

0
−6

x 10

x 10

0
5
X1

W1

10 5

(a)

Inner product with MP

SVM

−6

x 10
2

−5

0

−6

−2
−5

0
−6

x 10

0
5
X1

x 10

10 5

W1

(b)

SVM

Inner product

−6

x 10
2

−5

0
−6

−2
−5

0
−6

x 10

0

x 10

5
X1

10 5

W1

(c)
Figure 6.2: Multiplication computed by (a) “log-sum-exp” function (b) MP-based function with
γ = 1.2 (in Matlab) (c) MP-based function with γ = 1.2 (in Cadence)

6.4

Circuit Implementation

The prototype fabricated in a 0.5 µm CMOS process has integrated an array of ﬂoating gate transistors that serve as storage for up to 740 SVM parameters as well as novel circuits that have
been designed for interfacing with an external digital processor. These include a novel currentinput current-output logarithmic ampliﬁer circuit that can achieve a dynamic range of 120dB while
consuming nanowatts of power and a novel varactor based temperature compensated ﬂoating-gate
127

memory that demonstrates a superior programming range than other competitors.

6.4.1

ﬂoating-gate memory cells

The prototype uses ﬂoating-gate memory array to store SVM parameters. The system architecture
and the basic circuit of the ﬂoating-gate memory cells is shown in Fig. 6.3. The prototype has
embedded ﬂoating-gate memory array, temperature compensation circuitry, logarithmic ampliﬁer
circuitry (translinear circuitry). It also sets aside connection pin to Keithley for current calibration
and measurement as well as the pin for ﬂoating-gate programming. The memory array is composed
of 740 ﬂoating-gate memory cells, which are distributed into 18 rows and 114 columns. Each cell
is addressable by their address switches, Row select and Column select1-3. These switches are
set by programming shift registers. As can be seen from Fig. 6.3, the system is also transferable
between two states by setting the signal RU N . Once RU N is set to logic high, the parameters
stored in memory cells in terms of current are ﬂowing into the MP-based computational circuit
and the computation is carried out. However, once RU N is set to low, the stored current is led into
Keithley so that calibration and measurement are executed.
The current through P2 represents the support vectors. As is seen from Fig. 6.3, P2 and P1
are a pair of ﬂoating gate transistors. Hence the current through P2 is tunable through charge injection and tunneling. During current injection, the memory cell to be programmed is selected by
the address switches. A negative voltage is applied to Program pin so that charge can be injected
into the gate of P1 and P2 . During charge tunneling, a high voltage is applied to Vtunnel . Vc
is a voltage applied to the ﬂoating-gate control capacitor Cc , which can be used to tune the gate
voltage of P1 and P2 and hence control the current ﬂow through them. Vx is the voltage applied to
the varactor Cv , which is used to tune the value of Cv . This property is used in temperature com128

pensation to protect the memory array from the attack of temperature variation. We will talk about
this feature into detail in later sections. The current through P3 represents the input vectors. It is
controlled by Vdd translinear and Vg translinear, which are transmitted from reference circuitry,
i.e. a current-input current-output logarithmic ampliﬁer circuitry.

6.4.2

Current-input current-output logarithmic ampliﬁer circuit

Logarithmic ampliﬁers are used for generating output signals that are proportional to the logarithm
of the input [107]. Due to the compressive nature of the logarithm function, such an ampliﬁer
is useful for processing wide dynamic range signals like sensory (auditory or visual) signals or
multi-path signals in radio-frequency and radar applications.
The most popular technique for implementing logarithmic ampliﬁers is a transimpedance based
approach that exploits the exponential dependence between the current and the voltage across a p-n
junction diode, a bipolar transistor [14] or a MOSFET in sub-threshold region [108, 109]. Based on
this approach, ampliﬁers with input dynamic range greater than 100dB have been reported. However, there are three disadvantages of using a transimpedance based approach: (a) the output of a
transimpedance logarithmic ampliﬁer is a voltage signal, which implies that the output dynamic
range is limited by the supply voltage; (b) due to the dependence on the current-to-voltage relationship, transimpedance logarithmic ampliﬁers are sensitive to temperature variations and therefore
requires additional compensation circuitry; and (c) for current-mode analog VLSI computation,
the use of transimpedance logarithmic ampliﬁers requires a linear transconductor stage to convert
the output voltage into currents.
In our study, we propose the design of a current-input, current-output logarithmic ampliﬁer,
which by construction in insensitive to variations in temperature. At the core of the proposed design
129

VCurrent_Starving

Row_select

Vtunnel
Ctun
Vc

A

CC
Vx

Col_select1

Vdd_fg

Vdd_translinear

Q1
P1

P2

P3

Vg_translinear

Row_select

CV

Col_select2

Col_select3

Program
Pin

RUN

RUN

To Keithley

To MP

RUN

To Keithley

Shift Register Array

Temperature
Compensation

Program Pin

Memory Cell
Array

Shift Register Array

Keithley

Interfaces to Calibration &Reference
Units

Translinear Circuits

Interfaces to Functional Circuits

MP Computation Circuits

Figure 6.3: The system architecture and basic circuit of ﬂoating-gate memory cells.

130

is a Translinear Ohm’s law principle which exploits ﬂoating-voltage sources and linear/non-linear
resistive elements embedded within a translinear loop. By exploing multiple translinear networks,
temperature compensation is achieved through a resistive cancellation technique. We verify the
functionality of the circuit using measured results from fabricated prototypes and justify the results
using mathematical models and analysis. We also extend the design of the logarithmic ampliﬁer
to be fully programmable and digitally addressable, where all the current references have been
made programmable using analog ﬂoating-gates and the input to the ampliﬁer is made digitally
programmable using a standard ladder-based digital-to-analog converter DAC.

6.4.2.1

Translinear Ohm’s Law based Logarithm Computation

The concept of translinear Ohm’s law is an extension of the celebrated translinear principle [14]
which exploits the exponential relationship between voltages and currents in certain devices (diodes,
BJTs and sub-threshold MOSFETs). Translinear principle has been successfully used for implementing linear analog computation like matrix-vector multiplication [17] and for implementing
non-linear analog computation like quadratic kernels [61]). Translinear Ohm’s law is ﬁrst explained in this section using an example circuit shown in Fig. 6.4 which consists of several diodes
D1–4 acting as translinear elements connected to a ﬂoating-voltage source ∆V and a memoryless,
non-linear circuit element N . The non-linear circuit element N is characterized by a function f (.)
which models relationship between the current ﬂowing through N (IN ) and the voltage across N
(VN ) according to
IN = f (VN ).
131

(6.6)

If the voltage drop across each diode in Fig. 6.4 is denoted by V1–4 then,

V1 + V2 + ∆V = V3 + V4 + VN ,

(6.7)

which after using the translinear diode equation of I1-4 = Is exp V1-4 /UT leads to,

I1 · I2 · exp

∆V
UT

= I3 · I4 · exp

VN
UT

,

(6.8)

and hence,

I1 · I2
I3 · I4
I ·I
∆V + UT ln 1 2
I3 · I4

VN = ∆V + UT ln
IN = f

(6.9)
(6.10)

UT in equations 6.8 and 6.10 refers to the thermal voltage which is a linear function of absolute
temperature and is approximately equal to 26mV at room temperature (25o C).

If currents I1−4 satisfy the relation I1 · I2 = I3 · I4 then,

IN = f (∆V ) .

(6.11)

If N is a resistor with resistance R, then f (∆V ) = ∆V /R which when inserted in equation (6.11)
leads to an equivalent Ohm’s law (Note that the current IN is not directly drawn from VN ). If N
is a general type of resistor (linear and non-linear) with f (0) = 0, then f can be approximated
using ﬁrst-order and second-order Taylor series terms as f (∆V ) ≈ f (0)∆V + f (0)/2(∆V )2 .
132

Inserting the approximation in equation (6.11) leads to

IN = f (0)∆V + f (0)/2(∆V )2 .

(6.12)

Note that f (0) has units of a transconductance and equation (6.12) is only satisﬁed when the
translinear condition I1 · I2 = I3 · I4 is satisﬁed, hence the name “translinear Ohm’s Law”. For
the sake of simplicity and to minimize mathematical clutter we will assume f (0) ≈ 0 for the rest
of the analysis and we will reintroduce the higher-order terms to understand the limitations of the
proposed method.

V

 

I2

I1
V1

V2

V3
I3

V4
IR

I4

R
Figure 6.4: The basic concept of translinear Ohm’s law.

The translinear Ohm’s Law is now applied towards approximating the logarithm of a current
using the circuit in Fig. 6.5. The circuit is is derived from a translinear circuit which was reported
in [61] where the transistors M1 to M4 are biased in weak-inversion and form the translinear-loop.
133

Ib1

R
V1
M6

M1

V2

M2
M5

I1

I2

M7

Iout
Iin

I3

M3

M4

I4

Ib2

Figure 6.5: Schematic of the translinear logarithmic current-to-current converter.

The drain-to-source voltages for all transistors are larger than 100mV, in which case the transistors
satisfy the following translinear relation [110]:

N M OS : In = Sn ID0n e VGn −VT n /nn UT e−VSn /UT ,

(6.13)

−VGp +VT p /np UT VSp /UT
e
,

(6.14)

P M OS : Ip = Sp ID0p e

where Sn,p , VT n,p , nn,p , UT , VGn,p , and VSn,p are the aspect ratio, the threshold voltage, the
sub-threshold slope, the thermal voltage, gate and source voltage referred to bulk potential (Vdd
or gnd) for nMOS and pMOS transistor respectively. The transistor M5 serves as a feedback
element which reduces the output impedance at the drain of M2 . If the sizes of the transistors are
considered to be equal, the current mirror formed by M3 and M4 ensures I1 · I2 = I3 · I4 . Then,
134

using the translinear Ohm’s law the output current Iout can be expressed as

Iout = f (0)V Iin ,

(6.15)

where the ﬂoating-voltage source V Iin equals the difference in gate voltages of transistors M1
and M2 , which are mirrored from M6 and M7 and can be expressed as

V Iin = n · UT · log

Iin
Ib2

.

(6.16)

I3 = n · f (0) · UT · ln

I1
I2

.

(6.17)

which leads to

Thus, the output current is proportional to the logarithm of the input current under the condition
Iin > Ib2 which is required for the circuit to be operational.
Equation (6.16) consists of several temperature dependent and non-linear parameters which
includes: (a) the thermal voltage UT ; (b) the transconductance parameter f (0); and (c) the subthreshold slope n. However, all the parameters affect only the gain of the ampliﬁer and hence can
be potentially cancelled.

6.4.2.2

Complete Logarithmic Ampliﬁer Circuit

Fig. 6.6(a) shows a complete implementation of a temperature insensitive logarithmic ampliﬁer
based on the proposed translinear Ohm’s law principle. It consists of an input stage, a reference
stage and a translinear stage. The input stage and the reference stage is formed using the basic
circuit shown in Fig. 6.6. Based on equation (6.16), the output current generated by the input stage
135

Ib3
M10 M11
M15

M9
Is1

From Input
Stage

M12
Is2

Iout

M8

M13

Ib4

From Reference
Stage

Current
Reference
(a)

Ib1

R
M6
Current
DAC

V1

M1

V2

M2

M7

M5

Iin

Ib2
M3

M4

Current
Reference
To Translinear
Stage

(b)

R
M16

To
Translinear
Stage

M17
M22
2·Ib2
M20

Ib1
M18

M19

Ib2
M21

(c)
Figure 6.6: Proposed logarithmic ampliﬁer: (a) basic circuit of translinear stage; (b) basic circuit
of input stage; (c) basic circuit of reference stage.

136

Iin is given by

Ix = n · f (0) · UT · log

Iin
Ib2

.

(6.18)

For the reference stage the ratio of the currents I1 and I2 are set to be 2 using the current mirrors
M1 ,M2 and M3 . Thus the reference current Iref is given by

Iref = n · f (0) · UT · log (2)

(6.19)

Iref = 1.44 · n · f (0) · UT .

(6.20)

The translinear stage forms a translinear loop using the gate-to-source terminals of the pMOS
transistors M1 -M4 which leads to Iout · Iref = Igain · Ix . Thus, using equation (6.16) Note that
Ib3 in Fig. 6.6 is an external biasing current to establish the translinear loop which is usually 10
times larger than Ib4 . M15 is added at the input of Ib4 to ensure M11 to be in saturation region.
Thus, using equation (6.16) the output current Iout can be expressed as
Ix
Iout = Igain ·
Iref
Iout = 1.44 · Igain · log

(6.21)
Iin
Ib2

.

(6.22)

Thus, if the currents Igain and Ib2 are independent of temperature, Iout is theoretically invariant
with respect to temperature. Also, if the resistances R1 and R2 are matched all the terms dependent
on the resistance gets potentially cancelled out. The current Igain determines the gain of the
ampliﬁer and the current Ib2 determines the minimum input current of the ampliﬁer.
Circuit in Fig. 6.6 requires current sources Iref and Ib4 that can be programmed to different
137

Vb

Vb

M1

V−
V+

M2

VG1

C fg1
Vref1
FPGA

Sp

Vs1

Ctun
Vtun

Ib1

R

M2

A1

M3

Vout
M5

Translinear Circuit

CL

M1

Vb2

M2
M4

M5

Vb1
F1

V+

M3

M4

M1

M fg1
buffer
Iref1

M3

M1
M5

I1

M2

VG2
I2
Iout

M4

I4
M3 M4

C fg2

M fg2
Iref2

F2

Vb2

Vs2
buffer

I3

Vb1

Vref2

Sp

Ctun

A2

FPGA

Vtun

(a)

Vdd
I in +

I in −

IN _ CLK

IN _ CLK
IN _ CLK

IN _ CLK

d10
6μ

d8

d9
6μ

6μ

12 μ

6μ

6μ

6μ
12 μ

6μ

d1

d7
6μ

6μ

12 μ

6μ

6μ

6μ
12 μ

6μ

12 μ

6μ
12 μ

(b)
Figure 6.7: Basic circuit of (a) current reference; (b) current DAC.

138

6μ

6μ

ampliﬁer settings. It turns out that the core circuit shown in Fig 6.5 could also be used for designing
programmable current references as shown in Fig. 6.7a. If the gate voltages of M1 and M2 are
given by VG1 and VG2 , then the reference current is given by Iout = (VG1 − VG2 )/R. If the
resistance R is temperature compensated and the voltage difference VG1 − VG2 are independent
of temperature then Iout will also be temperature compensated. Note that once Iout is generated
on-chip, different values of currents can be generated using a current mirrors. The voltages VG1
and VG2 are generated using programmable ﬂoating-gate voltage references which was reported
in [111]. In this section, we brieﬂy describe the operating principle of the voltage reference and
the readers are referred to [111] for additional details. If the ﬂoating-gate transistor M4 is matched
to the current source transistor M1, the output voltage VG1 is given by
Q
VG1 = Vdd − nVb1 + n 1 ,
CT

(6.23)

where n is the sub-threshold slope factor, Vdd and Vb1 is the supply and biasing voltages, Q1 the
charge on the ﬂoating-gate F and CT is the total capacitance at node F . Thus,
Q − Q2
.
VG1 − VG2 = n 1
CT

(6.24)

The charge Q1 and Q2 are programmed using a linear injection technique [111], where the opamps
A1 and A2 implement an active feedback that maintains the source current, the source and the gate
voltage of M1 and M2 constant. This elements any non-linear factors that affect hot-electron
injection onto the gate, thus achieving a stable and controllable injection rate. Thus the rate and
resolution at which Q1 and Q2 remains constant throughout the programming range which is railto-rail. Programming is achieved by periodically enabling the feedback loop by opening the switch

139

S1 (controlled by an FPGA). Again, for the sake of brevity, we have omitted details of hot-electron
injection programming which can be found in [111].
Programming of the input current Iin is achieved using a 10-bit current-mode DAC as shown
in Fig. 6.6. The MSB of the DAC is denoted by d1 0 whereas the LSB is denoted by d1 . The
architecture of the current DAC (shown in Fig. 6.7 (b)) is based on the popular resistive divider [95]
where the current through each branch is twice as large as its neighbor. The output as determined
by the bits d1 0–d1 is a binary weighted sum of currents which are mirrored into the input stage.
The digital inputs of d1 0 through d1 are transmitted through a shift register chain, which converts
the serial input Din into parallel bit slices that are processed by the DAC module.

6.4.3

Adaptive, varactor-driven temperature compensation circuitry

Floating-gate transistors serve as an attractive medium for non-volatile storage of analog parameters. However, conventional current memories based on ﬂoating-gate transistors are sensitive to
variations in temperature, therefore limiting their applications to only controlled environments. In
our research, we propose a temperature compensated ﬂoating-gate array that can be programmed
to store currents down to picoampere level. At the core of the proposed architecture is a control algorithm that uses a varactor to adapt the ﬂoating-gate capacitance such that the temperature
dependent factors can be effectively canceled. As a result, the stored current is theoretically a
function of a reference current and the differential charge stored on the ﬂoating-gates. We validate
the proof-of-concept using measurement results obtained from prototype current memory cells
fabricated in a 0.5µm CMOS process.
140

6.4.3.1

Varactor based Floating-gate Temperature Compensation

The architecture of the proposed varactor-based ﬂoating-gate current memory cell is shown in
Fig. 6.8. It consists of a reference cell formed by two ﬂoating-gate transistors P1 and P2 . Note that
in addition to the usual control-gate capacitor Cc and tunneling capacitor Ctun , all the ﬂoatinggate transistors have a varactor Cv connected to their respective gates. The currents through P1
and P2 (measured using an on-chip analog-to-digital converter) are controlled by a control module
which sets the control-gate voltage Vc and tuning voltage Vx using an off-chip digital-to-analog
converter. If the charge on the ﬂoating-gate nodes A and B are denoted by Q1 and Q2 , then the
gate voltages for the pMOS transistors P1 and P2 are given by
Q + Cc · Vc + Cv · Vx + Ctun · Vtunnel + Cb · Vdd
VA ≈ 1
,
CT
Q + Cc · Vc + Cv · Vx + Ctun · Vtunnel + Cb · Vdd
VB ≈ 2
.
CT

Vtunnel

Ctun

Ctun

Q1

Q2

Q3

P1
Vc

(6.26)

Vtunnel

Vtunnel

Ctun

CC

(6.25)

A

P2
B

CC

P3
CC
...

Vx

CV

CV

CV
...

I1

I2

Control Algorithm

I out
Memory Cell

Figure 6.8: Proposed ﬂoating-gate based current reference and current memory cells
When P1 and P2 are biased in weak-inversion, the respective ratio of their drain-currents I1
141

and I2 is given by
I1
Q − Q1
= exp κ 2
I2
CT U T

.

(6.27)

The control module in Fig. 6.8 monitors I1 and I2 and ensures that the their ratio remains constant
across temperature variations by tuning the voltage Vx which changes the capacitance CT . Letting
K = I1 /I2 , then the parameter CT UT is also held constant according to
Q − Q1
CT UT = κ 2
log(K)

(6.28)

which leads to the expression for the output current Iout generated through the memory cell as
I1
Q − Q1
= exp log(K) 3
.
Iout
Q2 − Q1

(6.29)

Thus, if the current I1 is compensated for temperature, then according to equation (6.29), the
current through the memory cells are also temperature compensated. In principle, I1 could be
generated by a bandgap reference circuit [59]. For this implementation, we tried to maintain I1 to
be constant by co-adapting the control-gate voltage Vc along with the tuning voltage Vx .

6.4.3.2

Varactor Implementation and Second-order Effects

The varactor has been realized using a MOS capacitor which operates in an accumulation mode [112].
The cross-sectional architecture and the layout of the MOS-cap is shown in Fig. 6.9 along with the
ﬂoating-gate transistor, the control-gate and the tunneling-node. The use of MOS-cap as a varactor
introduces second-order effects into the equation (6.29):
142

• Derivation of equation (6.29) assumed that the ﬂoating-gate transistors including the variable capacitors Cv are matched with respect to each other. Unfortunately, each ﬂoatinggate node is programmed to a different potential, which will lead to mismatch in Cv . In
weak-inversion, however, the difference between respective gate-voltages is small (less than
100mV), which reduces the effect of the mismatch.
• The variation of Cv with respect to the tuning voltage Vx is non-linear as shown in the
measured response in Fig. 6.21. Therefore, Vx and Vc has to be biased properly to ensure:
(a) sub-threshold biasing of the transistors; and (b) high-sensitivity of Cv with respect to Vx .
• The MOS-cap Cv also varies with temperature. However, the temperature model of a MOS
varactor is too complex for any closed-form calibration, which motivates adapting Vx in an
online manner.

6.5

Measurement Results

The prototype is still being measured for the SVM computation result.

6.5.1

Measurement results for ﬂoating-gate memory

As is seen through Fig. 6.3, the programming of ﬂoating-gate cells are controlled by Vcurrent starving, Vdd fg, Vc , Vx and the charge injection voltage applied at program pin. In the
ﬁrst set of experiment, we vary one of the above parameters and keep the others ﬁxed to verify the
functionality of these control parameters.
Fig. 6.10 demonstrates the current through three different ﬂoating-gate transistors when different gate voltages are applied to the current starving transistor P0 during charge injection. These
143

VFG
VX

n

n

N  Well
(a)

VX

P-Well

VFG

N-Well
Metal1
Poly1
N active
P active

Floating-gate
Transistor

Varactor

Poly2

Vtunnel

VC

Cc

Ctun

(b)
Figure 6.9: (a) Cross-sectional architecture of MOS varactor (b) Layout for memory cell in Fig.
6.8

144

three cells are tunneled until their current is lower than picoampere. As Vcurrent starving is decreasing, the charging velocity is increasing. This is because the injection current through the
transistor is increased by lowering the gate voltage of P0 .

−7

6

x 10

5
Iout (A)

4
3
2
Vg=2.6 V
Vg=2.65 V
Vg=2.7 V

1
0
0

2000

4000
Time (s)

6000

8000

Figure 6.10: Floating-gate transistor current with respect to current starving voltage
Fig. 6.11 shows the current through three different ﬂoating-gate transistors when different
Vdd fg are applied. It is seen that as Vdd fg is increasing, the charging velocity is increasing as
well. The phenomenon is consistent with the principle that the injection current is increased with
a higher VSG across the ﬂoating-gate transistor.
Fig. 6.12 shows the current through three different ﬂoating-gate transistors when different Vx s
are applied. It is seen that as Vx is increasing, the charging velocity is decreased. Since increasing
Vx will increase the gate voltage of the ﬂoating-gate transistor, the phenomenon is consistent with
145

−7

6

x 10

Vddfg=2.32 V
5

Iout (A)

4

Vddfg=2.35 V
Vddfg=2.4 V

3
2
1
0
0

20

40
60
Time (s)

80

100

Figure 6.11: Floating-gate transistor current with respect to Vdd fg

146

the principle that the injection current is proportional to VSG across the ﬂoating-gate transistor.

−7

6

x 10

5
Iout (A)

4

Vx=3.1 V
Vx=3.0 V
Vx=2.9 V

3
2
1
0
0

200

400
Time (s)

600

800

Figure 6.12: Floating-gate transistor current with respect to Vx

Fig. 6.13 shows the precise programming procedure by tuning the voltage applied at program
pin, i.e., by tuning the injection current. The methodology is: ﬁrst charge the cell to some threshold
value that is approaching the desired value, then reduce VP ROD to allow a lower injection current,
which ensures a ﬁne step size. For this experiment, the desired value is 6.3 µA, and the threshold
is set at 6.1 µA. It is seen that the current is increasing rapidly as VP ROD is as low as as -3.3V;
however, as the current is approaching 6.1 µA, VP ROD is switched to -2.8V. As a result, the
current through the ﬂoating-gate transistors are programmed to 6.3 µA precisely, as is seen from
Fig. 6.13.
147

Figure 6.13: Precise programming procedure for ﬂoating-gate transistors

148

6.5.2

Measurement results for current-input current-output CMOS logarithmic ampliﬁer

The microphotograph of a prototype current-input current-output CMOS logarithmic ampliﬁer
fabricated in a 0.5µm standard CMOS process is shown in Fig. 6.14. Table 6.3 summarizes the
main speciﬁcations of the chip.

Translinear circuit

100µm

Input DAC

Figure 6.14: Die microphotograph of the chip.

6.5.2.1

Measurement results for the current reference

Fig. 6.15 shows the measured output current Iout (Refer to Fig. 6.7 (a)) versus the differential
gate voltage V1 -V2 (Refer to Fig. 6.7 (a)). During this experiment, the switches Sp s are set to be
off so that both V1 and V2 are determined by the Vref of the opamps. V1 is varied while V2 is
149

Table 6.1: Speciﬁcations of the current-input current-output CMOS logarithmic ampliﬁer

Fabrication Process

Standard CMOS 0.5µm

Die Size

3000µm × 3000µm

M1-4, 6-7 (refer to Fig. 6.5)

10µm/5µm

M5 (refer to Fig. 6.5)

30µm/1.5µm

R (refer to Fig. 6.5)

10MΩ

ﬁxed. It is seen from Fig. 6.15 that Iout is proportional to the differential voltage, which proves
that equation (6.11) holds for this circuit.
120

0.02

Relative Error

Iout (nA)

100
80
60
40

0.01

0

−0.01
20
0
1

1.5
2
Voltage of FG transistor (V)

−0.02
1

2.5

1.5
2
Voltage of FG transistor (V)

(a)

2.5

(b)

Figure 6.15: (a) Measurement current response and (b) relative error for the current reference.

6.5.2.2

Measurement results for the logarithmic ampliﬁer

The measured result of the circuit is shown in Fig. 6.16. The input current was varied from
100pA to 40nA and for each measurement with different value of Ib4 (Refer to Fig. 6.6). The
results show a close agreement between the measured result and the mathematical model given
by equation (6.22). The results also demonstrate that a larger dynamic range can be achieved by
increasing Ib4 .
150

−9

4

x 10

−9

4

Ib4=40pA

Ib4=40pA

Ib4=100pA

3

Ib4=250pA
2

Iout(A)

Iout(A)

3

x 10

1
0

Ib4=100pA
Ib4=250pA

2

1

1

2

3
Iin(A)

4

0 −10
10

5
−8

−9

−8

10

10

−7

10

Iin(A)

x 10

(a)

(b)

Figure 6.16: Measurement results for the logarithmic current conversion.

6.5.2.3

Noise Analysis

When the current memory cell is operating at sub-nA current levels, the thermal noise and ﬂicker
noise (due to sub-threshold MOSFETs and the resistor) effects the accuracy of the current read-out
from the cell. Under DC condition, ﬂicker noise is dominant. The power spectral density (PSD) of
ﬂicker noise for MOS transistors biased in subthreshold region can be expressed by
2
ID
KF
·
SI =
.
D
W LCox f

(6.30)

Note KF is a process-dependent constant; Cox is the gate oxide capacitance per unit area; W
and L represent the width and length of the transistor respectively; ID is the drain current ﬂowing
through the transistor; and f represents the frequency. Equation (6.30) indicates that the noise
spectral density will increase along with the biasing current.
To verify equation (6.30), a set of experiment is performed in which Iin and Iout are varied
whereas Ib4 and Ib2 are ﬁxed. The noise is obtained based on the standard deviation of the 100
151

current samples for each pair of Iin and Iout . Fig. 6.17 (a) shows the noise of Iout is proportional
to the magnitude of Iout , and Fig. 6.17 (b) shows the corresponding SNRs.

56

0.035
0.03

SNR (dB)

Noise (nA)

54
0.025
0.02
0.015

52
50

0.01
0.005
0

50

100

48
0

150

50

100

150

Iin (nA)

Iin (nA)

(a)

(b)

Figure 6.17: (a) Noise of Iout versus Iin ; (b) SNR of Iout .

6.5.2.4

Temperature characteristics

The underlying principle of the translinear Ohm’s law intrinsically leads to a temperature compensated operation. According to equation (6.22), the temperature coefﬁcient of Iout is determined by
the temperature characteristics of Ib4 , Ib2 , and Iin . Assume that Ib4 and Ib2 are not temperature
compensated whereas Iin is temperature compensated. Hence Ib4 and Ib2 are written as functions
of temperature Ib4 (T ) and Ib2 (T ). Then equation (6.22) is rewritten as:

Iout (T ) = Ib4 (T ) ·

ln(Iin ) − ln(Ib2 (T ))
.
ln2

152

(6.31)

For a speciﬁc temperature T0 , three Iout (T0 ) are sampled as
ln(Iin ) − ln(Ib2 (T0 ))
1
,
ln2
ln(Iin ) − ln(Ib2 (T0 ))
2
,
Iout (T0 ) = Ib4 (T0 ) ·
2
ln2
ln(Iin ) − ln(Ib2 (T0 ))
3
Iout (T0 ) = Ib4 (T0 ) ·
.
3
ln2
Iout (T0 ) = Ib4 (T0 ) ·
1

(6.32)

Then it can be deduced that
Ib4 (T0 )
· ln
ln2

Iout (T0 ) − Iout (T0 )
1
2
=
Iout (T0 ) − Iout (T0 )
Ib4 (T0 )
1
3
· ln
ln2

Iin
1
Iin
2
Iin
1
Iin
3

ln
=
ln

Iin
1
Iin
2
Iin
1
Iin
3

(6.33)

As is seen from equation (6.33), the temperature related items are canceled. Hence
Iout (T )−Iout (T )
1
2
is temperature compensated only if Iin is temperature compensated.
Iout (T )−Iout (T )
1
3
A corresponding experiment with temperature varying from 27o C to 57o C is conducted. The
experimental result is shown in Fig. 6.18. It is seen that in the weak-inversion regime, the current
ratios are approximately invariant with respect to temperature. That validates the relationship
given by equation (6.33). Based on the ﬁrst order polynomial for the data shown in Fig. 6.18, a
230 ppm/K is achieved over the temperature range of 27o C to 57o C.

6.5.2.5

Speed analysis

The speed of the circuit is limited by current. According to equation (6.22), the speed limit factor
could be Iin , Ib4 , and Ib2 . The following experiment is conducted to ﬁgure out the bandwidth
variation with respect to these factors. In the experiment, Iin and Ib4 are varied whereas Ib2 is
ﬁxed to be 100pA. The result is demonstrated in Fig. 6.19. Against Fig. 6.19 (a), it is seen that
153

Figure 6.18: Measured response plotting

Iout (T ) − Iout (T )
1
2
under different temperatures.
Iout (T ) − Iout (T )
1
3

154

when Ib4 is set to 0.38nA, when Iin varies from 50nA to 110nA, the bandwidth is increased from
0.6kHz to 0.9kHz. However, when Ib4 is set to 0.75nA, when Iin varies from 50nA to 110nA,
the bandwidth is increased from 0.8kHz to 1.2kHz. From Fig. 6.19 (b), it is seen that when
Iin is set to 85nA, bandwidth is monotonically increased from 0.45kHz to 1kHz as Ib4 varies
from 0.25nA to 2.75nA; however, when Iin is set to 110nA, bandwidth increases from 0.73kHz
to 1.1kHz as Ib4 varies in the same range. These phenomena demonstrate that the bandwidth is
increased when bias current are increased.
1.4

1.4

Iin=85 nA

1.2

Ib4=0.75 nA

Bandwidth (kHz)

Bandwidth (kHz)

Ib4=0.38 nA

1
0.8

Iin=110 nA

1
0.8
0.6

0.6
0.4
40

1.2

60

80
Iin (nA)

100

0.4
0

120

1

2

3

Ib4 (nA)

(a)

(b)

Figure 6.19: Measured response plotting bandwidth with respect to current.

Table 6.2 compares this work with some of the references.

6.5.3

Measurement results for varactor-driven temperature compensation
circuitry

An array of varactor-based FG current memory cells has been prototyped in a 0.5µm standard
CMOS process. Table 6.3 summarizes the speciﬁcation of the fabricated prototype and Fig. 6.20
shows its micrograph. Programming of the FG memory cells is based on the hot-electron injection
155

Table 6.2: Comparison of performance

Reference

[113]

[114]

[115]

[108]

This work

Conversion

current-in
voltage-out

current-in
voltage-out

current-in
voltage-out

current-in
voltage-out

current-in
current-out

Process

1.6µm

1.5µm

Discrete

0.5µm

0.5µm

2.9 ×

3.7mm2

91 ×

75µm2

290 × 140µm2

Area

-

Supply

-

±6V

-

3.3 V

3.3 V

DR

100dB

114dB

80dB

140dB

120dB

Bandwidth

>10Hz

1kHz

25kHz

>3.5kHz

1.2kHz

Power

10pA-1µA

30mW

-

0.1µW-33µW

1nW-10µW

ppm/K

uncompensated

uncompensated

uncompensated

uncompensated

230

-

and Fowler-Nordheim (FN) tunneling, whose details have been presented elsewhere [61, 116] and
has been omitted here for the sake of brevity.
Table 6.3: Speciﬁcation of the fabricated prototype

Fabrication Process

Standard CMOS 0.5µm

Die Size

3000µm × 3000µm

Size of P1 -P3

6µm/3µm

Cc

100fF

In the ﬁrst set of experiments, the ﬂoating-gate voltage VA was measured (using a voltage
Cv
buffer shown in Fig 6.8) when Vx was tuned and Vc = 1.2V . The ratio
was determined based
Cc
on the sensitivity analysis of the current I1 with respect to Vx and Vc . The measurement result in
Fig. 6.21 shows that the capacitance is inversely proportional to Vx and shows the largest change
happens at Vx = 1.25V . The experiment also demonstrates that Cv can be varied over a large
dynamic range with respect to Vx , which should be sufﬁcient to compensate for large temperature
variations.
For the next set of experiment, we measured the current ratios under four different tempera156

Figure 6.20: Die microphotograph of the chip

157

0.5

Cv/Ctotal

0.4
0.3
0.2
0.1
0

0.5

1

1.5
Vx (V)

2

Figure 6.21: The C-V curve of MOS capacitor

158

2.5

tures, namely, 28◦ C, 33◦ C, 38◦ C, and 43◦ C, when the voltages Vc is ﬁxed to be 2V and Vx is
varied. The results shown in Fig. 6.22(a) - (e) demonstrate that with the increase in temperature,
the respective current ratios decrease as indicated by the relationship in equation (6.28). The conI
trol algorithm then varies Vx to ensure that the ratio 1 is maintained to be constant. The trajectory
I2
of the Vx traversed by the control algorithm is shown in Fig. 6.22(a) - (e) shows the corresponding
I1
(Here Iout is represented by I3 -I6 from different memory cells). The comIout
pensated results with Vx biased at 1.2 V and 1.3V are demonstrated in Fig. 6.23. A 150 ppm/K

value of the ratio

is achieved over the temperature range of 28o C to 43o C. It is seen that the current through other
I
memory cells (I3 -I6 ) can be effectively compensated with respect to temperature once 1 is conI2
stant, no matter how different the current is from I1 . This result validates the proof-of-concept
temperature compensation for the proposed ﬂoating-gate current memory.

159

0.6

0.4

T=38oC
T=43oC

I1/I3

0.5

T=28oC
T=33oC

0.3
0.2
0.1
1.1

1.2

1.3

1.4

1.5

1.6

Vx

(a)

(b)
0.95

0.85

T=38oC
T=43oC

T=28 C
T=33oC

0.7

o

0.6

T=38 C
T=43oC

I1/I5

I1/I4

0.9

T=28oC
T=33oC

o

0.5

0.8
0.4

0.75
0.3
1.1

1.2

1.3

1.4

1.5

0.7
1.1

1.6

1.2

1.3

1.4

1.5

1.6

Vx

Vx

(c)

(d)
0.9

0.7

T=38oC
T=43oC

I1/I6

0.8

T=28oC
T=33oC

0.6
0.5
0.4
1.1

1.2

1.3

1.4

1.5

1.6

Vx

(e)
I
Figure 6.22: Measured response plotting (a) 1 (b)
I2
◦ C, 33◦ C, 38◦ C, and 43◦ C.
under 28
160

I1
I
I
I
(c) 1 (d) 1 (e) 1 with respect to Vx
I3
I4
I5
I6

1

1
I1/I2

I1/I3

0.8

Current Ratios

Current Ratios

I1/I2
I1/I4
I1/I5

0.6

I1/I6

0.4
0.2
25

30

35
40
Temperature (oC)

45

I1/I4

(a)

I1/I5

0.6

I1/I6

0.4
0.2
25

50

I1/I3

0.8

30

35
40
Temperature (oC)

45

50

(b)

Figure 6.23: Measured response plotting temperature compensated current ratios under Vx of
(a)1.2V (b)1.3V.

161

Chapter 7
Conclusions and future work
The major conclusion of this thesis is that it is feasible to use MP algorithm as an effective alternative for analog computation. The MP algorithm can be implemented with current-mode analog
VLSI circuit. The basic MP circuit can be utilized to synthesize a variety of static analog computation circuits. Also MP circuit serves as a fundamental circuit unit in the MP-involved applications.
The proposed MP algorithm is a novel analog computation algorithm with the underlying principle of universal conservation law. In contrast with the state-of-the-art transliner circuits dependent on physics of electronic devices, MP algorithm is dependent on the physical law, which
facilitates it independence on transistor operation regions. As a consequence, MP circuit has a
wide dynamic range, and is capable of trade off one performance with another when required due
to its wide operation region. Also the underlying principle of universal conservation law enables
MP algorithm to be implemented in a diversity of devices, not limited to electronic device (e.g.
mass, time, resistance, etc.). Since current-mode circuits is the most easy and mature way of
implementation, in this dissertation, we adhere to current-mode implementation.
One characteristic of MP algorithm is that by using MP algorithm, the PWL approximation of
162

“log-sum-exp” function can be achieved. MP algorithm also has a couple of mathematical properties, which facilitate the MP-based computations. Another characteristic of MP algorithm lies in
the implementation based on reverse water-ﬁlling criterion, which enables MP to be implemented
in analog domain easily.
The two characteristics mentioned above entail diverse analog arithmetic computations to be
synthesized on basic MP circuit. In this study, the investigated computations include addition,
subtraction, multiplication, division, power, polynomial, tanh, inner product. These computations
are ubiquitous in applications such as neural network, communications, machine learning, and the
like.
The two MP-based application examples brought out in this research study are an analog LDPC
decoder and an analog SVM. For each application, the mathematic model built upon MP algorithm
is illustrated, the simulation results are presented, the system architecture and circuit implementation based on MP circuit are demonstrated, and the measurement results are presented as well.

7.1

Summary of contributions

The contributions of this dissertation are summarized here:
• Propose an analog signal processing (ASP) algorithm named margin propagation (MP) as an
efﬁcient piece-wise linear (PWL) approximation technique to a “log-sum-exp” function and
bring out its current-mode analog circuit implementation. Since MP algorithm involves only
addition, subtraction and threshold operations and hence can be implemented using universal
conservation principles like the Kirchoff’s current law. Its analog circuit implementation is
bias-scalable in different operation regions. As a result, the MP circuit can be adaptive to
various application requirements since operating in different biasing condition is beneﬁcial
163

for different speciﬁcations.
• Synthesize bias-scalable analog computational circuits for a wide-range of linear and nonlinear mathematical functions based on MP principle. We have implemented addition, subtraction, multiplication, division, power and polynomial computation, etc. with MP-based
circuits. An important attribute of MP based synthesis is that all computations are performed
in the logarithmic domain, as a result of which multiplications map into additions, divisions
map into subtractions and power computations map into multiplications. This simpliﬁes the
implementation of many complicated non-linear functions. All the computational circuits
can be operated in weak inversion, moderate inversion, and strong inversion regions.
• Design and implement an analog (32,8) MP-based LDPC decoder. One fascinating property
is its capability of trading off BER performance with energy efﬁciency due to the tunable
hyper parameter γ in MP algorithm, which has been veriﬁed by the (32,8) MP-based LDPC
decoder prototype. The prototyped MP-based LDPC decoder can achieve an energy efﬁciency of 100nJ/bit while an optimal conﬁguration can also deliver up to 3 dB improvement
in BER compared to the benchmark min-sum LDPC decoder.
• Design and implement an analog energy-scalable MP-based support vector machine (SVM).
The prototype stores 2052 SVM parameters into ﬂoating-gate memory array and is fully
programmable. Due to the using of MP computation circuit, SVM is bias-scalable.
• Design and implement a novel current-input current-output logarithmic ampliﬁer circuit Unlike the traditional transimpedance-based logarithmic ampliﬁer circuit, our design based on
translinear Ohm’s law directly generates currents as a logarithmic function of the input current, which exhibits a dynamic range of 120dB and a temperature sensitivity of 230 ppm/K,
164

while consuming less than 100nW of power.

• Design and implement a varactor based temperature compensation circuitry for ﬂoating-gate
memory array By adapting the ﬂoating-gate capacitance, this design enables temperature
dependent factors be effectively canceled, which solve the variation of current stored in
ﬂoating-gate memory due to the temperature effect. A temperature sensitivity of 150 ppm/K
is achieved.

Input Stage
Vdd

Vdd

I L3

I L4

Vdd

Vdd

I L2

Vds,sat
Vds,sat

I L1

IZ

P1

P2

P3

B
P4

Vsg ,P4

VBias
N0

N1

N2

A

N3

N4

Iγ

Vgs ,N 3

Figure 7.1: Improved MP circuit with cascode current mirrors

165

Input Stage
Vdd

Vdd

I L3

I L4

Vdd

Vdd

I L2

Vds,sat
Vds,sat

I L1

IZ

P1

P2

P3

B
P4

Vsg ,P4

VBias
Iγ

A

N0

N1

N2

C

N3

N4

Vgs ,N 3
Self Balanced

Figure 7.2: Decoupled MP circuit to prevent gate leakage

166

7.2

Future directions

Some aspects can be further improved or remain undiscovered for MP algorithm and MP circuits:

• Novel methodologies to implement MP algorithm. MP algorithm is scalable to different
analog structures ranging from silicon transistors to microﬂuidic devices. This dissertation is
adhered to current-mode since it is the easiest way of implementation. In future research, the
physical parameters to represent value can be time (to improve energy efﬁciency), charge (to
improve the precision), or any other characteristic parameters dependent on the application
requirements.

• Ultra-low power implementation for MP algorithm. In this thesis, all the prototypes are
capable of operating with current magnitude as low as nano-ampere. However, since MP
algorithm is based on conservation law, it is possible to function even under the current
magnitude of femto-ampere. For this improvement, speciﬁc circuit design techniques are
supposed to be developed.

• Precision improvement for MP algorithm. In this thesis, one intuitive solution to improve
the PWL approximation effect of MP to “log-sum-exp” function is proposed based on an
experiment presented in chapter 3: to increase the number of operands involved in the MP
computation. The accurate mathematical design for operands including the optimal number
of operands and the value of the operands, is a promising topic for research.
167

7.2.1

Potential solutions for ultra-deep-submicron computational circuits
using margin propagations

7.2.1.1

Cascode current mirrors

As has been discussed in [117], the performance of MP circuit is limited by the accuracy of “current
mirrors” composed of N0 through N4 in Fig. 4.2. However, when MP circuit is scaled down to
nano-scale, the inaccuracy in current replicating, which is caused by systematic mismatch (due to
different VDS s), is getting more pronounced.
Several literature have reported methodologies to improve the precision of ultra-deep-submicron
current mirrors under low-voltage circumstances [118, 119, 120, 121, 122, 123]. Most of them
are focused on cascoding scheme. Based on the cases of low-voltage high-precision current mirrors, the simplest method of improving the accuracy of nano-scale MP circuit is to utilize cascode
technique. A basic nano-MP circuit with cascode current mirrors is displayed in Fig. 7.1. To
achieve low-voltage implementation, it is also feasible to introduce other techniques reported in
[118, 119, 120, 121, 122, 123].

7.2.1.2

Gate decoupling

The gate leakage is another concern for nano-scale MP circuits since it may cause the malfunction
for the MP circuit. As we can see through Fig. 7.1, gate leakage is liable to occur at node A, which
would affect the output current mirrored from N 1 through N4 . To avoid the liability, a possible
solution is proposed, which “decouples” the gates of N 1 through N4 from node A. The improved
topology is demonstrated in Fig. 7.2. KCL still holds at node A; the difference lies in that the
gates of transistors N0 through N4 are decoupled from node A and connected to node C. In this
way, the gate voltages of N1 through N4 , which used to be driven by currents through P1 -P4 , is
168

self-balanced by the current through themselves. This is made possible due to the hot carrier effect
in nano-scale circuit.

169

BIBLIOGRAPHY

170

BIBLIOGRAPHY

[1] A. Graupner, J. Schreiter, S. Getzlaff, and R. Sch¨ffny, “CMOS image sensor with mixedu
signal processor array,” IEEE J. Solid-State Circuits, vol. 38, no. 6, pp. 948 – 957, Jun.
2003.
[2] P. Hasler, P. D. Smith, D. Graham, R. Ellis, and D. V. Anderson, “Analog ﬂoating-gate,
on-chip auditory sensing system interfaces,” IEEE sensors journal, vol. 5, no. 5, pp. 1027–
1034, Oct. 2005.
[3] P. Hasler and D. V. Anderson, “Cooperative analog-digital signal processing,” in Proc. IEEE
International Conference on Acoustics, Speech, and Signal Processing, vol. 4, 2002, pp. IV–
3972–IV–3975.
[4] G. Frantz, “Digital signal processor trends,” IEEE Micro, vol. 20, no. 6, pp. 52 – 59, Dec.
2000.
[5] H. Suzuki, H. Takata, H. Shinohara, E. Teraoka, M. Matsuo, T. Yoshida, H. Sato, N. Honda,
N. Masui, and T. Shimizu, “1.047Ghz, 1.2V, 90nm CMOS, 2-way VLIW DSP core using
saturation anticipator circuit,” in 2006 Symposium on VLSI Circuits Digest of Technical
Papers, 2006, pp. 152 – 153.
[6] T. S. Hall, C. M. Twigg, J. D. Gray, P. Hasler, and D. V. Anderson, “Large-scale ﬁeldprogrammable analog arrays for analog signal processing,” IEEE Trans. Circuits Syst. I,
vol. 52, no. 11, pp. 2298–2307, Nov. 2005.
[7] C. Huang, N. Lajnef, and S. Chakrabartty, “Calibration and characterization of self-powered
ﬂoating-gate usage monitor with single electron per second operational limit,” IEEE Trans.
Circuits Syst. I, vol. 57, no. 3, pp. 556–567, Mar. 2010.
171

[8] F. Silveira, D. Flandre, and P. G. A. Jespers, “A ﬂoating-gate MOS learning array with
locally computed weight updates,” IEEE J. Solid-State Circuits, vol. 31, no. 9, pp. 1314–
1319, Sep. 1996.
[9] G. E. Moore, “Cramming more components onto integrated circuits,” Electronics, vol. 38,
no. 8, pp. 114–117, apr. 1965.
[10] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of complex fourier
series,” Math. Computation, vol. 19, pp. 297–301, Apr. 1965.
[11] A. Bandyopadhyay, J. Lee, R. W. Robucci, and P. Hasler, “MATIA: A programmable 80
µW/frame cmos block matrix transform imager architecture,” IEEE J. Solid-State Circuits,
vol. 41, no. 3, pp. 663–672, Mar. 2006.
[12] N. Zhang, C. Teuscher, H. Lee, and B. Brodersen, “Architectural implementation issues in
a wideband receiver using multiuser detection,” in Proc. 36th Annu. Allerton Conf. Communication, Control, and Computing, Sep. 1998, pp. 765 – 771.
[13] R. Sarpeshkar, “Efﬁcient precise computation with noisy components: Extrapolating from
an electronic cochlea to the brain,” Ph.D. dissertation, Dept. Comput. Neur. Syst., California
Institute of Technology, Pasadena, CA, 1997.
[14] B. Gilbert, “Tranlinear circuits: a proposed classiﬁcation,” Electronics Letters, vol. 11, no. 1,
pp. 14–16, 1975.
[15] ——, “A high-accuracy analog multiplier,” in Proc. IEEE ISSCC, 1974, pp. 128 – 129.
[16] S. Hemati and A. Banihashemi, “Full
pp. 347 – 347.

CMOS

min-sum analog iterative decoder,” jun. 2003,

[17] E. Seevinck and R. J. Wiegerink, “Generalized tranlinear circuit principle,” IEEE J. SolidState Circuits, vol. 26, no. 8, pp. 1198–1202, Aug. 1991.
[18] W. Gai, H. Chen, and E. Seevinck, “Quadratic-translinear CMOS multiplier-divider circuit,”
Electron. Lett., vol. 33, no. 10, pp. 860–861, May. 1997.
[19] B. A. Minch, P. Hasler, and C. Diorio, “Multiple-input translinear element networks,” IEEE
Trans. Circuits Syst. II, vol. 48, no. 1, pp. 20–28, Jan. 2001.
[20] R.W.Adams, “Filtering in the log domain,” in 63rd AES conf., New York, May 1979.
172

[21] J. Mulder, A. C. van der Woerd, W. A. Serdijn, and A. H. M. van Roermund, “An RMS-DC
converter based on the dynamic translinear principle,” IEEE J. Solid-State Circuits, vol. 32,
no. 7, pp. 1146–1150, Jul. 1997.
[22] Y. P. Tsividis, “Externally linear integrators,” IEEE Trans. Circuits Syst. II, vol. 45, no. 9,
pp. 1181–1187, Sep. 1998.
[23] B. Gilbert, “Translinear circuits: A historical overview,” Analog Integrated Circuits and
Signal Processing, vol. 9, no. 2, pp. 95–118, 1996.
[24] B. A. Minch, “MOS translinear principle for all inversion levels,” IEEE Trans. Circuits Syst.
II, vol. 55, no. 2, pp. 121–125, Feb. 2008.
[25] E. A. Vittoz, “Analog VLSI signal processing: Why, where, and how?” Analog Integr.
Circuits Signal Process, vol. 6, no. 1, pp. 27–44, 1994.
[26] B. A. Minch, “Synthesis of static and dynamic multiple-input translinear element networks,”
IEEE Trans. Circuits Syst. I, vol. 51, no. 2, pp. 409–421, Feb. 2004.
[27] A. G. Andreou, K. A. Boahen, P. O. Pouliquen, A. Pavasovic, R. E. Jenkins, and K. Strohbehn, “Current-mode subthreshold MOS circuits for analog VLSI neural systems,” IEEE
Trans. Neural Netw., vol. 2, no. 2, pp. 205–213, Mar. 1991.
[28] A. G. Andreou and K. A. Boahen, “Translinear circuits in subthreshold MOS,” Analog
Integrated Circuits and Signal Processing, vol. 9, no. 2, pp. 141–166, 1996.
[29] P. Hasler, C. Diorio, and B. A. Minch, “A four-quadrant ﬂoating-gate synapse,” in Proc.
IEEE International Symposium on Circuits and Systems, vol. 3, Jun. 1998, pp. 29–31.
[30] C. Diorio, “Neurally inspired silicon learning: From synapse transistors to learning arrays,”
Ph.D. dissertation, Department of Electrical Engineering,California Institute of Technology,
Pasadena, CA, 1997.
[31] B. A. Minch, “Multiple-input translinear element log-domain ﬁlters,” IEEE Trans. Circuits
Syst. II, vol. 48, pp. 29–36, Jan. 2001.
[32] ——, “Construction and transformation of multiple-input translinear element networks,”
IEEE Trans. Circuits Syst. I, vol. 50, pp. 1530–1538, Dec. 2003.
173

[33] J. Mulder, W. A. Serdijn, A. C. van der Woerd, and A. H. M. van Roermund, “A generalized
class of dynamic translinear circuits,” IEEE Trans. Circuits Syst. II, vol. 48, no. 5, pp. 501–
504, May. 2001.
[34] D. Frey, “Log-domain ﬁltering: An approach to current-mode ﬁltering,” Proc. Inst. Electron.
Eng., vol. 140, no. 6, pp. 406–416, Dec. 1993.
[35] ——, “A general class of current mode ﬁlters,” in Proc. IEEE ISCAS, vol. 2, 1993, pp.
1435–1438.
[36] ——, “Exponential state space ﬁlters: A generic current mode design strategy,” IEEE Trans.
Circuits Syst. I, vol. 43, pp. 34–42, Jan. 1996.
[37] C. Toumazou and T. S. Lande, “Micropower log-domain ﬁlter for electronic cochlea,” Electron. Lett., vol. 30, no. 22, pp. 1839–1841, Oct. 1994.
[38] M. Punzenberger and C. C. Enz, “Low-voltage companding currentmode integrators,” in
Proc. IEEE ISCAS, 1995, pp. 2112–2115.
[39] D. Perry and G. W. Roberts, “Log-domain ﬁlters based on LC ladder synthesis,” in Proc.
IEEE ISCAS, 1995, pp. 311–314.
[40] J. Mulder, A. C. van der Woerd, W. A. Serdijn, and A. H. M. van Roermund, “A current√
mode companding x-domain integrator,” Electron. Lett., vol. 32, no. 3, pp. 198–199, Feb.
1996.
[41] W. A. Serdijn, M. Broest, J. Mulder, A. C. van der Woerd, and A. H. M. van Roermund,
“A low-voltage ultra-low-power translinear integrator for audio ﬁlter applications,” IEEE J.
Solid-State Circuits, vol. 32, pp. 577–581, Apr. 1997.
[42] W. A. Serdijn, J. Mulder, A. C. van der Woerd, and A. H. M. van Roermund, “A widetunable translinear second-order oscillator,” IEEE J. Solid-State Circuits, vol. 33, no. 2, pp.
195–201, Feb. 1998.
[43] A. Payne, A. Thanachayanont, and C. Papavassilliou, “A 150-MHz translinear phase-locked
loop,” IEEE Trans. Circuits Syst. II, vol. 45, no. 9, pp. 1220–1231, Sep. 1998.
[44] T. Lin and A. Payne, “Translinear sinusoidal frequency tripler,” Electron. Lett., vol. 38,
no. 10, pp. 441–442, May. 2002.
174

[45] M. Gu, K. Misra, H. Radha, and S. Chakrabartty, “Sparse decoding of low density parity
check codes using margin propagation,” in Proc. of IEEE Globecom, Nov. 2009.
[46] ——, “Sparse decoding of low-density parity-check codes based on margin propagation,”
IEEE Trans. Commun., 2010, submitted.
[47] W. E. Ryan, “An introduction to LDPC codes,” in CRC Handbook for Coding and Signal
Processing for Recoding Systems. Taylor, CRC Press, 2004, ch. 36.
[48] F. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and the sum-product algorithm,” IEEE Trans. Inf. Theory, vol. 47, no. 2, pp. 498–519, Feb. 2001.
[49] F. shi Lai and C.-F. E. Wu, “A hybrid number system processor with geometric and complex
arithmetic capabilities,” IEEE Trans. Comput., vol. 40, no. 8, pp. 952–962, Aug. 1991.
[50] E. I. Chester and J. N. Coleman, “Matrix engine for signal processing applications using the
logarithmic number system,” in Proc. IEEE Application-Speciﬁc Syst., Architectures and
Processors, Jul. 2002, pp. 315–324.
[51] J. N. Coleman, E. I. Chester, C. I. Softley, and J. Kadlec, “Arithmetic on the European
logarithmic microprocessor,” IEEE Trans. Comput., vol. 49, no. 7, pp. 702–715, Jul. 2000.
[52] H. Kim, B.-G. Nam, J.-H. Sohn, J.-H. Woo, and Hoi-Jun, “A 231-MHz, 2.18-mW 32bit logarithmic arithmetic unit for ﬁxed-point 3-D graphics system,” IEEE J. Solid-State
Circuits, vol. 41, no. 11, pp. 2373–2381, Nov. 2006.
[53] B.-G. Nam, H. Kim, and H.-J. Yoo, “A low-power uniﬁed arithmetic unit for programmable
handheld 3-D graphics systems,” IEEE J. Solid-State Circuits, vol. 42, no. 8, pp. 1767–1778,
Aug. 2007.
[54] K. H. Abed and R. E. Siferd, “CMOS VLSI implementation of a low-power logarithmic
converter,” IEEE Trans. Comput., vol. 52, no. 11, pp. 1421–1433, Nov. 2003.
[55] S. L. SanGregory, C. Brothers, and D. Gallagher, “A fast,low-power logarithm approximation with CMOS VLSI implementation,” in Proc. IEEE Midwest Symp. Circuits and
Systems, Aug. 1999, pp. 388–391.
[56] Y.-H. Lam, W.-H. Ki, and C.-Y. Tsui, “Integrated low-loss CMOS active rectiﬁer for wirelessly powered devices,” IEEE Trans. Circuits Syst. II, vol. 53, no. 12, pp. 1378–1382, Dec.
2006.
175

[57] A. Gore, S. Chakrabartty, S. Pal, and E. C. Alocilja, “A multichannel femtoamperesensitivity potentiostat array for biosensing applications,” IEEE Trans. Circuits Syst. I,
vol. 53, no. 11, pp. 2357–2363, Nov. 2006.
[58] T. Serrano-Gotarredona, B. Linares-Barranco, and A. G. Andreou, “A general translinear
principle for subthreshold MOS transistors,” IEEE Trans. Circuits Syst. I, vol. 46, no. 5, pp.
607–616, May. 1999.
[59] B. Razavi, Design of analog CMOS integrated circuits.
ence/Engineering/Math, Aug. 2000.

McGraw-Hill Sci-

[60] P. R. Kinget, “Device mismatch and tradeoffs in the design of analog circuits,” IEEE J.
Solid-State Circuits, vol. 40, no. 6, pp. 1212–1224, Jun. 2005.
[61] S. Chakrabartty and G. Cauwenberghs, “Sub-microwatt analog VLSI trainable pattern classiﬁer,” IEEE J. Solid-State Circuits, vol. 42, no. 5, pp. 1169–1179, May. 2007.
[62] K. Kang and T. Shibata, “An on-chip-trainable gaussian-kernel analog support vector machine,” IEEE Trans. Circuits Syst. I, vol. 57, no. 7, pp. 1513–1524, Jul. 2010.
[63] S. Chakrabartty and G. Cauwenberghs, “Gini-support vector machine: Quadratic entropy
based multi-class probability regression,” Journal of Machine Learning Research, vol. 8,
pp. 813–839, 2007.
[64] R. G. Gallager, Low Density Parity-Check Codes, R. G. Gallager, Ed.
bridge, MA, 1963.

MIT Press, Cam-

[65] E. T. S. I. (ETSI), “Digital Video Broadcasting (DVB) second generation framing structure, channel coding and modulation systems for broadcasting, interactive services, news
gathering and other broadband satellite applications; tm 2860rl dvbs2-74r8,” www.dvb.org.
[66] B. Bangerter and et al., “Wireless technologies: High-throughput wireless LAN air interface,” http://www.intel.com/technology/itj.
[67] Y. Rashi and et al., “LDPC: Efﬁcient alternative FEC for the TFI-OFDM PHY proposal,”
http://www.grouper.ieee.org/groups/802/15/pub.
[68] R. M. Tanner, “A recursive approach to low complexity codes,” IEEE Trans. Inf. Theory,
vol. 27, no. 5, pp. 533–547, Sep 1981.
176

[69] J. Hagenauer, E. Offer, and L. Papke, “Iterative decoding of block and convolutional codes,”
IEEE Trans. Inf. Theory, vol. IT-42, no. 3, pp. 429–445, Mar. 1996.
[70] D. J. C. MacKay, “Good error-correcting codes based on very sparse matrices,” IEEE Trans.
Inf. Theory, vol. 45, no. 2, pp. 399–431, Mar. 1999.
[71] M. G. Luby, M. Mitzenmacher, M. A. Shokrollahi, and D. A. Spielman, “Improved lowdensity parity-check codes using irregular graphs,” IEEE Trans. Inf. Theory, vol. 47, no. 2,
pp. 585–598, Feb. 2001.
[72] M. M. Mansour and N. R. Shanbhag, “A 640 Mbps 2048-bit programmable LDPC decoder
chip,” IEEE J. Solid-State Circuits, vol. 41, pp. 684–698, Mar. 2006.
[73] A. J. Blanksby and C. J. Howland, “A 690-mW 1-Gb/s 1024-b, rate-1/2 low-density paritycheck code decoder,” IEEE J. Solid-State Circuits, vol. 37, no. 3, Mar. 2002.
[74] A. Darabiha, A. C. Carusone, and F. R. Kschischang, “A 3.3-Gbps bit-serial block-interlaced
min-sum LDPC decoder in 0.13-µm CMOS.” IEEE Custom Integrated Circuits Conference, Sep. 2007, pp. 459–462.
[75] T. Brandon, R. Hang, G. Block, V. C. Gaudet, B. Cockburn, S. Howard, C. Giasson,
K. Boyle, P. Goud, S. S. Zeinoddin, A. Rapley, S. Bates, D. Elliott, and C. Schlegel, “A
scalable LDPC decoder ASIC architecture with bit-serial message exchange,” Elsevier
Integration-the VLSI Journal, vol. 41, no. 3, pp. 385–398, May. 2008.
[76] J. Hagenauer and M. Winklhofer, “The analog decoder,” in Proceedings of IEEE International Symposium on Information Theory (ISIT ’98), Cambridge, Mass, USA, Aug. 1998, p.
145.
[77] S. Hemati, A. Banihashemi, and C. Plett, “An 80-Mb/s 0.18-µm CMOS analog min-sum
iterative decoder for a (32,8,10) LDPC code,” IEEE J. Solid-State Circuits, vol. 41, pp.
2531–2540, Nov. 2006.
[78] C. Winstead, N. Nguyen, V. Gaudet, and C. Schlegel, “Low-voltage CMOS circuits for
analog iterative decoders,” IEEE Trans. Circuits Syst. I, vol. 53, pp. 829–841, Apr. 2006.
[79] V. Gaudet, “Energy efﬁcient circuits for LDPC decoding.” CMOS Emerging Technologies,
2007.
177

[80] A. Liveris, Z. Xiong, and C. Georghiades, “Compression of binary sources with side information at the decoder using LDPC codes,” Communications Letters, IEEE, vol. 6, no. 10,
pp. 440–442, Oct 2002.
[81] M. Sartipi and F. Fekri, “Source and channel coding in wireless sensor networks using ldpc
codes,” Sensor and Ad Hoc Communications and Networks, 2004. IEEE SECON 2004. 2004
First Annual IEEE Communications Society Conference on, pp. 309–316, Oct. 2004.
[82] M. Potluri, S. Chilumuru, S. Kambhampati, and K. Namuduri, “Distributed source coding
using non-binary ldpc codes for sensor network applications,” Information Theory, 2007.
CWIT ’07. 10th Canadian Workshop on, pp. 105–109, June 2007.
[83] D. MacKay and R. Neal, “Near shannon limit performance of low density parity check
codes,” Electronics Letters, vol. 33, no. 6, pp. 457–458, Mar 1997.
[84] S.-Y. Chung, J. Forney, G.D., T. Richardson, and R. Urbanke, “On the design of low-density
parity-check codes within 0.0045 db of the shannon limit,” Communications Letters, IEEE,
vol. 5, no. 2, pp. 58–60, Feb 2001.
[85] ITRS, “International technology roadmap for semiconductors,” http://www.itrs.net, 2010.
[86] T. Richardson and R. Urbanke, “The capacity of low-density parity check codes under
message-passing decoding,” IEEE Trans. Inf. Theory, vol. 47, pp. 599–618, 2001.
[87] T. Richardson, M. Shokrollahi, and R. Urbanke, “Design of capacity approaching irregular
low-density parity-check codes,” IEEE Trans. Inf. Theory, vol. 47, no. 2, pp. 619–637, Feb.
2001.
[88] D. J. C. MacKay, Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003, ch. 6, pp. 557–572, http://www.inference.phy.cam.ac.uk/mackay/itila/.
[89] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52, no. 4, Apr. 2006.
[90] L. Mancera and J. Portilla, “L0-norm-based sparse representation through alternate projections,” in 13th International Conference on Image Processing (ICIP), 2006. Atlanta, GE
(USA), October 2006.
[91] S.-Y. Chung, T. Richardson, and R. Urbanke, “Analysis of sum-product decoding of lowdensity parity-check codes using a gaussian approximation,” IEEE Trans. Inf. Theory,
vol. 47, no. 2, pp. 657–670, Feb. 2001.
178

[92] P. Rusmevichientong and B. Van Roy, “An analysis of belief propagation on the turbo decoding graph with gaussian densities,” IEEE Trans. Inf. Theory, vol. 47, no. 2, pp. 745–765,
Feb 2001.
[93] A. Amraoui, “Ldpcopt,” http://ipgdemos.epﬂ.ch/ldpcopt/.
[94] O. C. Ibe, Fundamentals of applied probability and random processes, 1st ed.
Press, 2005.

Academic

[95] C.-H. Lin and K. Bult, “A 10-b, 500-MSample/s CMOS DAC in 0.6 mm2 ,” IEEE J. SolidState Circuits, vol. 33, no. 12, pp. 1948 – 1958, Dec. 1998.
[96] V. Gaudet and G. Gulak, “A 13.3-Mb/s 0.35µm CMOS analog turbo decoder IC with a
conﬁgurable interleaver,” IEEE J. Solid-State Circuits, vol. 38, no. 11, pp. 2010–2015, Nov.
2003.
[97] A. Darabiha, A. C. Carusone, and F. R. Kschischang, “Power reduction techniques for
LDPC codes,” IEEE J. Solid-State Circuits, vol. 43, no. 8, pp. 1835 – 1845, Aug. 2008.
[98] Z. Zhang, V. Anantharam, M. J. Wainwright, and B. Nikoli´, “An efﬁcient 10GBASE-T
c
ethernet LDPC design with low error ﬂoors,” IEEE J. Solid-State Circuits, vol. 45, no. 4,
pp. 843 – 855, Apr. 2010.
[99] C. Winstead, J. Dai, S. Yu, C. Myers, R. Harrison, and C. Schlegel, “CMOS analog MAP
decoder for (8,4) Hamming code,” IEEE J. Solid-State Circuits, vol. 39, no. 1, pp. 122–131,
Jan. 2004.
[100] D. Vogrig, A. Gerosa, A. Neviani, A. Amat, G. Montorsi, and S. Benedetto, “A 0.35-µm
CMOS analog turbo decoder for the 40-bit rate 1/3 UMTS channel code,” IEEE J. SolidState Circuits, vol. 40, pp. 753–762, Mar. 2005.
[101] E. Osuna, R. Freund, and F. Girosi, “Training support vector machines: An application to
face detection,” Comput. Vis. Pattern Recogn., pp. 130–136, 1999.
[102] R. Genov and G. Cauwenberghs, “Charge-mode parallel architecture for matrix-vector multiplication,” IEEE Trans. Circuits Syst. II, vol. 48, no. 10, pp. 930–936, 2001.
[103] S. Chakrabartty and G. Cauwenberghs, Sub-Microwatt Analog VLSI Support Vector Machine for Pattern Classiﬁcation and Sequence Estimation. Cambridge: MIT Press, 2005,
ch. Adv. Neural Information Processing Systems (NIPS2004), p. 17.
179

[104] ——, “Sub-microwatt analog VLSI trainable pattern classiﬁer,” IEEE J. Solid-State Circuits, vol. 42, no. 5, pp. 1169–1179, May. 2007.
[105] B. Boser, I. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classiﬁer,”
in Proc. 5th Annu. ACM Workshop Computational Learning Theory, 1992, p. 14452.
[106] A. Bandyopadhyay, G. J. Serrano, and P. Hasler, “Adaptive algorithm using hot-electron
injecton for programming analog computational memory elements within 0.2% of accuracy
over 3.5 decades,” IEEE J. Solid-State Circuits, vol. 41, no. 9, pp. 2107–2114, Sep. 2006.
[107] W. L. Barber and E. R. Brown, “A true logarithmic ampliﬁer for radar IF applications,”
vol. 15, no. 3, pp. 291–295, Jun. 1980.
[108] A. Basu, R. W. Robucci, and P. E. Hasler, “A low-power, compact, adaptive logarithmic
transimpedance ampliﬁer operating over seven decades of current,” IEEE Trans. Circuits
Syst. I, vol. 54, no. 10, pp. 2167–2177, Oct. 2007.
[109] C. D. Holdenried, J. W. Haslett, J. G. McRory, R. D. Beards, and A. J. Bergsma, “A DC-4GHz true logarithmic ampliﬁer: Theory and implementation,” IEEE J. Solid-State Circuits,
vol. 37, no. 10, pp. 1290–1299, Oct. 2002.
[110] E. Vittoz and J. Fellrath, “CMOS analog integrated circuits based on weak inversion operations,” IEEE J. Solid-State Circuits, vol. 12, no. 3, pp. 224–231, Jun. 1977.
[111] C. Huang, P. Sarkar, and S. Chakrabartty, “Rail-to-rail, thermal-noise limited, linear hotelectron injection programming of ﬂoating-gate voltage bias generators at 13-bit resolution,”
IEEE J. Solid-State Circuits, vol. 46, no. 11, pp. 2685–2692, Nov. 2011.
[112] R. L. Bunch and S. Raman, “Large-signal analysis of MOS varactors in CMOS-Gm LC
VCOs,” IEEE J. Solid-State Circuits, vol. 38, no. 8, pp. 1325–1332, Aug. 2003.
[113] T. Delbruck and D. Oberhoff, “Self-biasing low power adaptive photoreceptor,” in Proc.
IEEE Int. Symp. Circuits and Systems, vol. 4, May. 2004, pp. 844–847.
[114] D. H. K. Hoe and D. B. Ribner, “An auto-ranging photodiode preampliﬁer with 114 db
dynamic range,” IEEE J. Solid-State Circuits, vol. 31, no. 2, pp. 187–194, Feb. 1996.
[115] M. J. Hayes, “A nonlinear optical preampliﬁer for sensing applications,” IEEE Trans. Circuits Syst. I, vol. 49, no. 1, pp. 1–9, Jan. 2002.
180

[116] A. Basu, S. Brink, C. Schlottmann, S. Ramakrishnan, C. Petre, S. Koziol, F. Baskaya, C. M.
Twigg, and P. Hasler, “A ﬂoating-gate-based ﬁeld-programmable analog array,” IEEE J.
Solid-State Circuits, vol. 45, no. 9, pp. 1781–1794, Sep. 2010.
[117] M. Gu and S. Chakrabartty, “Synthesis of bias-scalable CMOS analog computational circuits using margin propagation,” IEEE Trans. Circuits Syst. I, vol. 59, no. 2, pp. 243–254,
Feb. 2012.
[118] E. S¨ckinger and W. Guggenb¨hl, “A high swing, high impedance MOS cascode circuit,”
a
u
IEEE J. Solid-State Circuits, vol. 25, no. 1, pp. 289–298, Feb. 1990.
[119] T. Itakura and Z. Czarnul, “High output-resistance CMOS current mirrors for low-voltage
applications,” IEICE Trans. Fundamentals, vol. E80-A, no. 11, pp. 230–232, Jan. 1997.
[120] F. You, S. H. K. Embabi, J. F. Duque-Carrillo, and E. S´nchez-Sinencio, “An improved tail
a
current source for low voltage applications,” IEEE J. Solid-State Circuits, vol. 32, no. 8, pp.
1173–1180, Aug. 1997.
[121] T. Serrano and B. Linares-Barranco, “The active-input regulated cascode current-mirror,”
IEEE Trans. Circuits Syst. I, vol. 41, no. 6, pp. 464–467, Jun. 1994.
[122] J. Ram´
irez-Angulo, R. G. Carvajal, and A. Torralba, “Low supply voltage high-performance
CMOS current mirror with low input and output voltage requirements,” IEEE Trans. Circuits Syst. II, vol. 51, no. 3, pp. 124–129, Mar. 2004.
[123] X. Zhang and E. I. El-Masry, “A regulated body-driven CMOS current mirror for lowvoltage applications,” IEEE Trans. Circuits Syst. II, vol. 51, no. 10, pp. 571–577, Oct. 2004.

181