THEORY, SYNTHESIS AND IMPLEMENTATION OF CURRENT-MODE CMOS PIECEWISE-LINEAR CIRCUITS USING MARGIN PROPAGATION By Ming Gu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Electrical Engineering 2012 ABSTRACT THEORY, SYNTHESIS AND IMPLEMENTATION OF CURRENT-MODE CMOS PIECEWISE-LINEAR CIRCUITS USING MARGIN PROPAGATION By Ming Gu Achieving high energy-efficiency is a key requirement for many emerging smart sensors and portable computing systems. While digital signal processing (DSP) has been the de-facto technique for implementing ultra-low power systems, analog signal processing (ASP) provides an attractive and alternate approach that can not only achieve high energy efficiency but also high computational density. Conventional ASP techniques are based on a top-down design approach, where proven mathematical principles and related algorithms are mapped and emulated using computational primitives inherent in the device physics. An example being the translinear principle, which is the state-of-the-art ASP technique, that uses the exponential current-to-voltage characteristics for designing ultra-low-power analog processors. However, elegant formulations could result from a bottom-up approach where device and bias independent computational primitives (e.g. current and charge conservation principles) are used for designing ”approximate” analog signal processors. The hypothesis of this proposal is that many signal processing algorithms exhibit an inherent calibration ability due to which their performance remains unaffected by the use of ”approximate” analog computing techniques. In this research, we investigate the theory, synthesis and implementation of high performance analog processors using a novel piecewise-linear (PWL) approximation algorithm called margin propagation (MP). MP principle utilizes only basic conservation laws of physical quantities (current, charge, mass, energy) for computing and therefore is scalable across devices (silicon, MEMS, microfluidics). However, there are additional advantages of MP-based processors when imple- mented using CMOS current-mode circuits, which includes: 1) bias-scalability and robust to variations in environmental conditions (e.g. temperature); and 2) improved dynamic range and faster convergence as compared to the translinear implementations. We verify our hypothesis using two ASP applications: (a) design of high-performance analog low-density parity check (LDPC) decoders for applications in sensor networks; and (b) design of ultra-low-power analog support vector machines (SVM) for smart sensors. Our results demonstrate that an algorithmic framework for designing margin propagation (MP) based LDPC decoders can be used to trade-off its BER performance with its energy efficiency, making the design attractive for applications with adaptive energy-BER constraints. We have verified this trade-off using an analog current-mode implementation of an MP-based (32,8) LDPC decoder. Measured results from prototypes fabricated in a 0.5 µm CMOS process show that the BER performance of the MP-based decoder outperforms a benchmark state-of-the-art min-sum decoder at SNR levels greater than 3.5 dB and can achieve energy efficiencies greater than 100pJ/bit at a throughput of 12.8 Mbps. In the second part of this study, MP principle is used for designing an energy-scalable SVM whose power and speed requirements can be configured dynamically without any degradation in performance. We have verified the energy-scaling property using a current-mode implementation of an SVM operating with 8 dimensional feature vectors and 18 support vectors. The prototype fabricated in a 0.5 µm CMOS process has integrated an array of floating gate transistors that serve as storage for up to 740 SVM parameters as well as novel circuits that have been designed for interfacing with an external digital processor. These include a novel current-input current-output logarithmic amplifier circuit that can achieve a dynamic range of 120dB while consuming nanowatts of power and a novel varactor based temperature compensated floating-gate memory that demonstrates a superior programming range than other competitors. I dedicate this dissertation to my husband, Wei Wang, for his love and support. iv ACKNOWLEDGMENT I am deeply grateful to Dr. Shantanu Chakrabartty for his advice during my Ph.D studies. I owe my successful transition from control theory to analog integrated circuit design to his guidance, encouragement, support, and constant assurance. As an academic advisor, he was always approachable and ready to help in every possible way to make sure I was always at ease while working on my research. I wish him well for all his future endeavors and would be happy to cooperate with him in the future. Many thanks to Dr. Hayder Radha, Dr. Tongtong Li and Dr. Jonathan I. Hall for being on my Ph.D committee, for their feedbacks, suggestions, and continual interest in my progress. Thanks to Dr. Hayder Radha for helping me in our collaborated project and the instructions on information theory. Thanks to Dr. Tongtong Li and Dr. Jonathan I. Hall in helping me in communication theory and error-correcting coding. I would like to thank my lab mates Ravi Krishna Shaga, Tao Feng, Pikul Sarkar and Thamira Hindo. I really enjoy the opportunities to work with them and have a good time with them during my graduate studies. I am grateful to my friends Rui Lin, Xiaoqin Tang, Lei Zhang, Kiran Misra, Hao Wen and Yun Li for their friendship, support, and motivation. My special thanks go to my husband, Dr. Wei Wang, for his support during my graduate studies. His love, support, and encouragement have carried me to where I am today. We met, got to know each other, fell in love and got married at MSU. The time we spent together at MSU pursuing our Ph.Ds is the best part in our lives. I would dedicate this dissertation to my parents, and grandparents for their support all the time. v LIST OF TABLES Table 1.1 Power dissipation of major components for a 0.6 µm mixed-signal CMOS image sensor [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Table 1.2 Milestones in TASP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Table 3.1 Operands of MP units in Fig 3.4 . . . . . . . . . . . . . . . . . . . . . 44 Table 4.1 Operations in ordinary and logarithmic arithmetic in digital implemented LNS processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 5.1 47 Threshold comparison for AWGN channel under margin propagation and sum-product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Table 5.2 Specifications of the MP-based analog LDPC decoder . . . . . . . . . . 112 Table 5.3 CMOS Digital (top 4) and Analog (bottom 6) Decoders . . . . . . . . . 121 vi Table 6.1 Specifications of the current-input current-output CMOS logarithmic amplifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Table 6.2 Comparison of performance . . . . . . . . . . . . . . . . . . . . . . . . 156 Table 6.3 Specification of the fabricated prototype . . . . . . . . . . . . . . . . . 156 vii LIST OF FIGURES Figure 1.1 Illustration of the tradeoffs in (a)DSP and (b)CADSP systems [2, 3]. For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation. . . . . . . 3 Figure 1.2 Power consumption comparison for DSP [4, 5] and ASP [6, 7]. . . . . . 4 Figure 1.3 Basic CMOS circuits for (a) digital half adder based on NAND and NOT gates; (b) digital multiplier based on (a); (c) analog adder; (d) analog multiplier (Gilbert cell). . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Figure 1.4 Resolution versus “cost” on using ASP or DSP [2]. . . . . . . . . . . . 7 Figure 1.5 log(ID ) versus VGS , gm versus VGS , and gm /ID versus ID /(W/L) curves for various operation regions [8]. . . . . . . . . . . . . . . . . . viii 16 Figure 1.6 Illustration of the proposed concept for designing scalable analog computational systems. The depth of potential wells can be controlled by an external voltage source. At equilibrium the charge re-distributes amongst potential wells to satisfy charge conservation laws. . . . . . . . . . . . Figure 1.7 17 Top-down design procedure for conventional TASP and the bottom-up procedure we propose. . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Figure 2.1 MOS symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Figure 2.2 An alternating translinear loop of NMOS transistors. Here the number of clockwise facing transistors in junction is the same as the number of counterclockwise facing transistors. . . . . . . . . . . . . . . . . . . . 22 Figure 2.3 A voltage-in-current-out winner-take-all circuit. . . . . . . . . . . . . . 27 Figure 2.4 A multiple-input translinear element (MITE). . . . . . . . . . . . . . . 27 Figure 2.5 Principle of dynamic translinear circuits. . . . . . . . . . . . . . . . . . 28 Figure 3.1 Illustration of reverse water-filling procedure . . . . . . . . . . . . . . . 32 ix Figure 3.2 Scaling property of margin propagation . . . . . . . . . . . . . . . . . 33 Figure 3.3 (L −L ) The curve of e i n and its PWL approximation curve of u(Li , Ln ) 42 Figure 3.4 PWL approximation effect of MP algorithm . . . . . . . . . . . . . . . 45 Figure 4.1 Top architecture of the computation unit . . . . . . . . . . . . . . . . . 50 Figure 4.2 CMOS implementation of the core MP circuit. . . . . . . . . . . . . . . 52 Figure 4.3 Die micrograph of the chip. . . . . . . . . . . . . . . . . . . . . . . . . 52 Figure 4.4 System architecture of the chip. . . . . . . . . . . . . . . . . . . . . . . 53 Figure 4.5 MP circuit for implementing two operand addition. . . . . . . . . . . . 56 Figure 4.6 Addition: z + − z − computed according to (a) the original log-sum-exp function; (b) MP approximations (simulation); and (c) MP circuits (measurement). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.7 56 Addition: measurement results for z + − z − with transistors biased in (a)strong inversion; (b)moderate inversion; (c)weak inversion; (d)-(f) computation errors for (a)-(c). . . . . . . . . . . . . . . . . . . . . . . . . . x 57 Figure 4.8 Subtraction: z + − z − computed by (a) original functions; (b) MP approximations (simulation); and (c) MP circuits (measurement). . . . . . Figure 4.9 59 Subtraction: measurement results for z + − z − with transistors biased in (a)strong inversion; (b)moderate inversion; (c)weak inversion; and (d)-(f) show respective approximation errors. . . . . . . . . . . . . . . . . . . 59 Figure 4.10 MP circuit for implementing four quadrant operand multiplication . . . 61 Figure 4.11 Multiplication: z + −z − computed according to the (a) original function; (b) MP approximations (simulation); and (c) MP circuits (measurement). Figure 4.12 61 Multiplication: measurement results for z + − z − with transistors biased in (a)strong inversion; (b)moderate inversion; (c)weak inversion; and (d)(f) the respective approximation errors . . . . . . . . . . . . . . . . . . 62 Figure 4.13 A current mirror implementing the power function in logarithmic domain. 65 Figure 4.14 Power:z + − z − computed according to the log-sum-exp function, the MP approximation simulation and the MP circuit measurement. . . . . . 66 Figure 4.15 Noise model for MP circuit in Fig. 4.2. . . . . . . . . . . . . . . . . . . 66 Figure 4.16 System architecture and circuit diagram of the MP-based pattern classifier. 73 xi Figure 4.17 Classifier: z + − z − simulated by (a) log-sum-exp function (simulation); (b) MP approximation (simulation); and measured z + − z − when the transistors are biased in (c)strong inversion (measurement); (d)moderate inversion (measurement); (e)weak inversion (measurement). . . . . . . 75 Figure 5.1 Factor graph corresponding to a 32-bit (3, 4) LDPC code . . . . . . . . 78 Figure 5.2 A (3,6) LDPC message PDF evolution for: (a) sum-product variable to check messages; (b) sum-product check to variable messages; (c) minsum variable to check messages; (d) min-sum check to variable messages; Figure 5.3 Comparison of pair-wise decoding functions for sum-product and minsum LDPC decoding algorithms . . . . . . . . . . . . . . . . . . . . . Figure 5.4 85 Comparison of pair-wise decoding function for sum-product, min-sum and margin propagation decoding algorithms . . . . . . . . . . . . . . . Figure 5.5 80 87 Comparison of 3D LDPC check function for (a) sum-product (b) minsum and (c) margin propagation with γ=1.75 . . . . . . . . . . . . . . . 88 Figure 5.6 Possible solutions z for M( 1 , 2 ) . . . . . . . . . . . . . . . . . . . . . 89 Figure 5.7 A (3,6) LDPC message PDF evolution for: (a) margin propagation variable to check messages; (b) margin propagation check to variable messages; 93 xii Figure 5.8 : (a)Effect of hyper-parameter γ on the L0 norm of the check to variable messages(for different AWGN channel conditions) (b)Effect of hypertuning parameter γ on BER performance (for different AWGN channel conditions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 5.9 97 The comparison of BER for LDPC decoders based on different algorithms achieved by (3,6)-regular LDPC code of length 256, 104 , and 106 ; note the maximum iteration number is set to 20 . . . . . . . . . . . . . Figure 5.10 100 The comparison of BER for LDPC decoders based on different algorithms achieved by length 106 (3,6)-regular LDPC code and (2.5,5) irregular LDPC code; note the maximum iteration number is set to 20. . . Figure 5.11 101 : L0 norm comparison of sum-product, min-sum and margin propagation for iterations 1,2,3 for check to variable messages: (a),(b),(c) and variable to check messages: (d),(e),(f) . . . . . . . . . . . . . . . . . . . . . . . 103 Figure 5.12 Architecture of the check node c1 in Fig. 5.1 . . . . . . . . . . . . . . . 106 Figure 5.13 (a) Circuits for implementing the check node module using the MP algorithm:(a) design used in this paper; and (b) an alternate low-voltage circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Figure 5.14 Architecture of variable node v1 in Fig. 5.1 . . . . . . . . . . . . . . . 110 Figure 5.15 Circuit to implement variable node . . . . . . . . . . . . . . . . . . . . 111 xiii Figure 5.16 System level architecture of the (32,8) LDPC decoder . . . . . . . . . . 113 Figure 5.17 Die microphotograph of the chip . . . . . . . . . . . . . . . . . . . . . 114 Figure 5.18 Experimental setup of the chip . . . . . . . . . . . . . . . . . . . . . . 115 Figure 5.19 Timing diagram of digital signals used to control the operation of the LDPC decoder chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 5.20 Comparison of BER of the MP-based LDPC decoder using software simulation and measurements from the fabricated prototypes. . . . . . . . . Figure 5.21 116 Comparison of measured BER performances for different values of the hyper-parameter γ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 5.23 116 Comparison of BERs of the measured MP-based LDPC decoder and a benchmark min-sum LDPC decoder. . . . . . . . . . . . . . . . . . . . Figure 5.22 115 117 Measured BER performances of the MP-based LDPC decoder (Vγ = 0.74V ) for different decoding times (throughput) . . . . . . . . . . . . Figure 6.1 117 The circuit diagram of the MP-based SVM . . . . . . . . . . . . . . . . 125 xiv Figure 6.2 Multiplication computed by (a) “log-sum-exp” function (b) MP-based function with γ = 1.2 (in Matlab) (c) MP-based function with γ = 1.2 (in Cadence) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Figure 6.3 The system architecture and basic circuit of floating-gate memory cells. 130 Figure 6.4 The basic concept of translinear Ohm’s law. . . . . . . . . . . . . . . . 133 Figure 6.5 Schematic of the translinear logarithmic current-to-current converter. . . 134 Figure 6.6 Proposed logarithmic amplifier: (a) basic circuit of translinear stage; (b) basic circuit of input stage; (c) basic circuit of reference stage. . . . . . 136 Figure 6.7 Basic circuit of (a) current reference; (b) current DAC. . . . . . . . . . 138 Figure 6.8 Proposed floating-gate based current reference and current memory cells 141 Figure 6.9 (a) Cross-sectional architecture of MOS varactor (b) Layout for memory cell in Fig. 6.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Figure 6.10 Floating-gate transistor current with respect to current starving voltage . 145 Figure 6.11 Floating-gate transistor current with respect to Vdd fg . . . . . . . . . . 146 Figure 6.12 Floating-gate transistor current with respect to Vx . . . . . . . . . . . . 147 xv Figure 6.13 Precise programming procedure for floating-gate transistors . . . . . . . 148 Figure 6.14 Die microphotograph of the chip. . . . . . . . . . . . . . . . . . . . . . 149 Figure 6.15 (a) Measurement current response and (b) relative error for the current reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Figure 6.16 Measurement results for the logarithmic current conversion. . . . . . . . 151 Figure 6.17 (a) Noise of Iout versus Iin ; (b) SNR of Iout . . . . . . . . . . . . . . . 152 Iout (T ) − Iout (T ) 1 2 under different temIout (T ) − Iout (T ) 1 3 peratures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Figure 6.19 Measured response plotting bandwidth with respect to current. . . . . . 155 Figure 6.20 Die microphotograph of the chip . . . . . . . . . . . . . . . . . . . . . 157 Figure 6.21 The C-V curve of MOS capacitor . . . . . . . . . . . . . . . . . . . . . 158 I I I I I Measured response plotting (a) 1 (b) 1 (c) 1 (d) 1 (e) 1 with reI2 I3 I4 I5 I6 spect to Vx under 28◦ C, 33◦ C, 38◦ C, and 43◦ C. . . . . . . . . . . . . 160 Figure 6.18 Figure 6.22 Figure 6.23 Measured response plotting Measured response plotting temperature compensated current ratios under Vx of (a)1.2V (b)1.3V. . . . . . . . . . . . . . . . . . . . . . . . . Figure 7.1 161 Improved MP circuit with cascode current mirrors . . . . . . . . . . . . 165 xvi Figure 7.2 Decoupled MP circuit to prevent gate leakage . . . . . . . . . . . . . . xvii 166 Chapter 1 Introduction Digital signal processing (DSP) has been prevalent in all computing systems for the past five decades. Since its beginnings in early 1950’s, the progress in the field of DSP has paralleled the growth of digital computers, as signified by the celebrated Moore’s law [9]. The key milestones in the area of DSP include the development of the Fast Fourier Transform (FFT) in 1965 [10] followed by the implementation of DSP algorithms on integrated circuit (IC) technology in 1980’s. The use of analog signal processing (ASP) can be traced back to an earlier time than its digital counterpart, primarily driven by the need to process naturally occurring signals (image, speech) which are inherently “analog”. However, due to rapid progress in digital IC technology and due to the limitations imposed by analog artifacts, in 1970’s analog computers were gradually superseded by digital computers. Many experts even predicted the demise of ASP in early 1980s. However, the interest in ASP was renewed a decade ago, primarily due to the need for more energy efficient and high-density signal processing systems. The inspiration for achieving this using ASP techniques came from biological computing systems which are inherently “analog”. Biological neural processors excel at solving hard problems in sensory signal processing (video and speech) 1 by sustaining high computational throughput while keeping energy dissipation at a minimal level. At the core of these neural processors and similarly analog signal processors are massively parallel circuits which exploit computational primitives inherent in device physics to achieve high energyefficiency and high computational density. In this dissertation, we will follow the ASP principle as a guideline to investigate energy efficient computing systems. From a practical point-of-view there are additional advantages of ASP over DSP techniques which are listed below: (a) Practical limitations in the analog-to-digital converters (ADC) design. Since all the naturally occurring signals are analog, ADC is the essential element attached to DSP systems. Many DSP applications are required to recover signals over a wide dynamic range or from a noisy background, which necessitate high speed and high resolution ADC. However, the design of ADC for high resolution/high performance and low power dissipation is one of the challenges nowadays. It seems that the scaling factor is not affecting ADC as much as DSP efficiency. Resolution of ADCs has been increased at 1.5 bits/5 years [11]. As a consequence, the long design time of ADC has become an increasingly severe constraint for the system design as well. One efficient solution is the cooperative analog/digital signal processing (CADSP), proposed by Hasler in [2, 3]. By utilizing ASP at the front-end, the ADC complexity is reduced significantly, and hence the overall system complexity [2, 3]. Fig. 1.1 shows the tradeoff between CADSP and DSP. (b) The limited power consumption. Nowadays, an increasing desire for portability has been posed on electronic devices. A low power dissipation is attractive and crucial in these applications to obtain a long battery 2 A/D Converter DSP Processor (a) Analog Digital Specialized A/D ASP IC DSP Processor A/D (b) Figure 1.1: Illustration of the tradeoffs in (a)DSP and (b)CADSP systems [2, 3]. For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation. life. Power consumption of DSP, measured in mW/M IP S, reduces by half around every 18 months [4]. This has been keeping pace with Moore’s law [9]. Even so, a fixed power budget still constrains the increasing proliferation of portable electronics. In contrast, ASP tends to be more power efficient. Since ASP block lacks ADC as shown in Fig. 1.1, which has been proved to be a major energy consumer, especially for the scaleddown system. Table 1.1 lists all the key elements power consumption of a state-of-the-art 0.6-µm CMOS image sensor with mixed-signal processor array [1]. It can be seen that the power consumption of ADC is a major factor amongst all and occupies more than one-third of the total amount. Furthermore, the arithmetic unit of DSP, which consists of digital multipliers and adders, also consumes more energy than ASP. The main cause is less number of transistors used in the case of ASP than DSP. As a consequence, even custom digital solutions are inadequate for ultralow-power applications [12]. In Fig. 1.2 the power consumption of DSP and ASP are compared. The black square rep3 Table 1.1: Power dissipation of major components for a 0.6 µm mixed-signal CMOS image sensor [1] Supply voltage 5V Processor unit Sensor, readout and biasing ADC FIFO memory 0.25 mW 2.25 mW 21 mW 83 mW resents the power consumption of a state-of-the-art DSP core [5]. It seems the power consumption of the DSP core follows Gene’s law. The star represents the power consumption of an analog, floating-gate integrated chip [6]. And the circuit represents the power consumption of a CMOS analog processor that can be used for long-term, self-powered mechanical usage monitoring [7]. 1W Gene’s law POWER 10mW DSP 0.1mW [5] 1uW [6] 10nW [7] ASP 100pW 1980 1990 2000 2010 YEAR 2020 2030 Figure 1.2: Power consumption comparison for DSP [4, 5] and ASP [6, 7]. (c) Size constraints. Circuit size constraints also favor analog VLSI circuits. It is often possible to perform complex operations on device physics with only a few transistors. Fig. 1.3 shows the comparison 4 of basic circuit schematics for digital/analog adders, and digital/analog multipliers implemented in metal-oxide-semiconductor field-effect transistors (MOSFET’s). It is obvious that the number of transistors in digital implementation surpasses that in analog implementation. B P2 A P1 P1 A N2 B N1 N1 B0 C0 B1 Half Adder IN1 SUM IN2 C C1 Carry Sum A B Half Adder IN1 SUM IN2 C C2 C3 A0 A1 (a) (b) + I out I V2+ IA IB V1+ + 2 I − 2 + − 3 V I 2 N4 N5 N2 I1+ (c) N6 N7 I1− N3 − I out V2+ V1− VBias Ib I SUM I − 3 N1 (d) Figure 1.3: Basic CMOS circuits for (a) digital half adder based on NAND and NOT gates; (b) digital multiplier based on (a); (c) analog adder; (d) analog multiplier (Gilbert cell). 5 (d) Speed requirements. In recent years great effort has been put in real-time computation, which is applied throughout three-dimensional (3D) graphic systems, high-speed communication system, image sensors, etc.. For these applications, the defect of DSP arise largely from the fact that all the operations in a single arithmetic unit have to be carried out in sequence. In contrast, ASP has a simpler implementation and an inherent parallel architecture for computation. Thus when the signal processing has to be executed under the condition of high speed or high frequency, ASP outperforms DSP on almost all the metrics. Although there are some crucial reasons for that ASP remains indispensable for scientific computations, there are still some problems existing for ASP. One primary disadvantage is the lack of flexibility. By directly exploiting the computational primitives inherent in the device physics, complicated non-linear functions can be implemented in analog using significantly lower number of transistors compared to its digital counterpart. However, this methodology only works for specific computations. For instance, the subthreshold translinear characteristic of MOS transistor can only be used to solve log-domain computations. Whereas DSP can be used to solve almost any computation by programming. Another disadvantage lies in the high noise sensitivity for analog implementation. As is known, analog circuits is more vulnerable to noise than digital circuits. It can be explained as that the number in analog computation is represented by continuous voltages or currents, which will be deteriorated by noise easily. Digital system, however, due to the quantization and bi-value logic representation (either one or zero) of discrete signals, is more robust to noise. Only when the noise levels are high enough to flip logical bit, otherwise it works without any effect on precision of the result. 6 Also the effective resolution of the computing system is a major concern for signal processors. To evaluate the input effort to obtain a designated resolution, in [2, 13], a comprehensive metric “cost” is proposed. It is calculated based on a variety of metrics ranging from the circuit performance metrics like size, power dissipation, etc. to the human effort and other resources involved in the design or fabrication. The comparison of cost versus resolution is described as Fig. 1.4 shows. It can be seen that a “threshold” exists. When the required computation resolution is beyond the threshold, it is less “expensive” to use digital implementation, otherwise, it is more “worthy” to use analog implementation instead of digital. 16 14 Log(“cost”) 12 Digital 10 Lower digital cost 8 Analog 6 Lower analog cost 4 2 0 0 2 4 6 8 10 12 14 16 Bits of Resolution Figure 1.4: Resolution versus “cost” on using ASP or DSP [2]. In summary, ASP is suitable for applications where • precision/resolution of computation is not required strictly; • a stringent budget for power, area, or number of transistors is imposed on the system design (e.g. image sensors); 7 • high-speed or fully-parallel computation is required (e.g. analog decoders); 1.1 Motivation for the proposed research The underlying principle of analog computation is to represent numbers as physical quantities such as current, voltage, charge, time, etc. and then use the physics of the device for computing (e.g. addition, subtraction, multiplication, etc.). Amongst all the analog computation techniques, translinear analog signal processing (TASP) is the most widely used and the most intensively researched method. Since TASP technique was introduced by Gilbert in 1975 [14], the method has been applied for synthesis of neural, filter and radio-frequency circuits. TASP utilizes the linear relation between the transconductance and the current of the device, and can be readily implemented on MOSFETs and bipolar transistors. In Table 1.2 we summarize different variants and application of the TASP principle that has been reported in the literature: Table 1.2: Milestones in TASP Year Device Characteristic Utilized Operation Region Analog Computation Circuits Synthesized Static TASP 1975 [14] I ∝ eV weak inversion 1991 [17] I ∝V2 strong inversion 2001 [19] I ∝ e(w1 V1 +···+wN VN ) weak inversion 1979 [20] 1997 [21] 1998 [22] Dynamic TASP ˙ CUT IDS = Icap IDS weak inversion ˙ = Icap I CUT IDS weak inversion DS ˙ CUT IDS = Icap IDS weak inversion 8 multiplier, divider [15], maximum [16] square, square-root, polynomial, multiplier-divider [18] square, square-root, divider, magnitude of vector first-order filter RMS-DC converting second-order filter The advantages of TASP circuits are as follows: • A high functional density, which explains the extensive application of translinear circuits in neural networks. • Insensitivity to temperature due to UT cancelation inside the translinear loop. • Low-power consumption due to the absence of passive components. In translinear filters implementation, only transistors and capacitors are required. • A large dynamic range. The trend toward low supply voltages causes a decrease of the dynamic range for filters implemented with conventional circuit techniques, for which companding may be a possible solution. Whereas companding is implicitly performed in translinear filters as a consequence of the exponential behavior of the translinear devices. • High controllability. Most parameters can be controlled by tuning the current, which is doomed by the translinear principles. However, there are some disadvantages of TASP circuits when implemented in CMOS. To understand these disadvantages, we first briefly describe the operating regions of a MOS transistor. Fig. 1.5 shows the current-vs-gate voltage response of a typical n-type MOSFET. The characteristic consists of three different operating regions namely: (a) weak inversion: where the current is exponentially dependent on the gate voltage; (b) strong inversion: where the current is quadratically dependent on the gate voltage; and (c) moderate inversion: is the intermediate region between the strong and the weak inversion. Operating in each of these regions have specific advantages which are summarized below: • In weak inversion region, the minimum power consumption is achieved (low ID ). 9 • In strong inversion region, the higher operational speed can be obtained (high ID , high gm ). • In moderate inversion region, an optimal compromise between power consumption and speed can be achieved. The power consumption (energy-efficiency), speed trade-off can also be expressed using the “gm/ID” metric [8], where gm is the input transconductance of the MOS transistor and ID is the drain current flowing through the transistor. The parameter gm effectively determines the speed of the transistor and the parameter ID determines its power consumption. gm /ID is chosen as a metric since it reflects the efficiency to translate current (i.e. power) into transconductance. The greater the gm /ID value is, the greater the transconductance we obtain at a constant current value [8]. Fig. 1.5 displays the gm /ID versus the drain current ID for MOS transistors under the three operating regions. Fig. 1.5 shows that in weak inversion region, we obtain the highest gm /ID , thus the best power efficiency. As can be observed from table 1.2, TASP circuits are designed to operate in only one specific operating region. Thus it lacks flexibility, or the trade-off capability across the operation regions, which is a severe problem when a high dynamic range is required. This limits the application of TASP circuits in designing more complex and flexible analog signal processors. To solve this problem, we propose ASP circuits that only depend on the universal conservation principles like the Kirchhoff’s current law, as illustrated in Fig. 1.3 (c). By construction, ASP circuits based on universal conservation principles will have several advantages over their TASP counterparts which are: • Independence on transistor operation regions since its underlying principle follows physical laws (e.g. universal conservation law). • Wide dynamic range due to the unlimited operation region. 10 • Diverse ways of implementation (e.g. charge-mode, voltage-mode, time-mode, etc.). For instance, Fig. 1.6 illustrates how reverse water-filling can naturally be implemented using a charge-coupled device. Initially, a fixed amount of charge (shown by shaded area in Fig. 1.6a) is introduced into the charge-coupled device (CCD). As potential barriers (wells) are increased adiabatically charge re-distributes itself (Fig. 1.6b) amongst the wells. At equilibrium (Fig. 1.6c) total charge accumulated in the potential wells (residue) and the normalization level is determined by charge conservation principles. Even though potential wells have been used in the above example the procedure is applicable to other physical quantities (current, mass, energy) implemented by different analog structures (MEMS, microfluidic devices). However, since current-mode circuits is the most easy and mature way of implementation, in this dissertation, we adhere to current-mode implementation. • Insensitivity to the outside variations (e.g. temperature). However, the using universal conservation laws restricts the choice of computations that can be precisely implemented. That leads to the main hypothesis of this dissertation: many signal processing algorithms exhibit an inherent calibration ability, and hence their performance remains unaffected by the use of “approximate” analog computing techniques which is proposed in this study. Also, approximations at a circuit level would require a bottom-up and top-down synthesis of signal processing algorithms as is illustrated in Fig. 1.7 and is compared with the conventional top-down design flow for a TASP based synthesis. The chart in Fig. 1.7 shows that many of the approximation techniques have be simulated, verified and calibrated at the algorithmic and system level before prototyping on hardware. However, algorithmic and system levels of the design flow offers more flexibility in implementing a given system and hence can overcome performance limitations due to low-level approximations. In this dissertation, we plan to illustrate this using two 11 specific applications: (a) designing high-performance analog LDPC decoders; and (b) designing ultra energy-efficient analog SVM. 1.2 Scientific contributions This dissertation is motivated by improving the energy efficiency of computation implementation with ASP while resorting to the physical law to ensure bias-scalability. The major scientific contributions are listed below: (a) An analog signal processing (ASP) algorithm named margin propagation (MP) as an efficient piece-wise linear (PWL) approximation technique to a “log-sum-exp” function along with its current-mode analog circuit implementation is proposed. Due to its underlying principle of conservative law, MP algorithm can be mapped to various hardware easily. The current-mode MP circuit implementation is designed to be bias-scalable. This characteristic is beneficial for computational circuit since it enables the design adaptive to various applications with different specification requirements. (b) A wide-range of mathematical functions are synthesized with MP-based bias-scalable analog computational circuits. The synthesized functions include addition, subtraction, multiplication, division, power and polynomial computation, etc.. MP circuits operate in the log-likelihood domain where complex functions such as multiplication, division, square-root operations are mapped into simpler operations like addition, subtraction or scaling, making circuit design less complex. The circuits can operate over a wider dynamic range as the transistors can span different biasing regions. (c) A 100pJ/bit, (32,8) CMOS analog low-density parity-check (LDPC) decoder based on MP 12 is designed and implemented. One fascinating property is its capability of trading off BER performance with energy efficiency due to the tunable hyper parameter γ in MP algorithm. The prototyped MP-based LDPC decoder can achieve an energy efficiency of 100nJ/bit while an optimal configuration can also deliver up to 3 dB improvement in BER compared to the benchmark min-sum LDPC decoder. This design is helpful for high energy-efficiency high error-correcting performance analog LDPC decoder design. (d) An analog energy-scalable MP-based support vector machine (SVM) is designed and implemented. The prototype stores 2052 SVM parameters into floating-gate memory array and is fully programmable. Due to the using of MP computation circuit, the prototyped SVM is energy-scalable. (e) A novel current-input current-output logarithmic amplifier circuit is designed and implemented. This design based on translinear Ohm’s law directly generates currents as a logarithmic function of the input current, which exhibits a dynamic range of 120dB and a temperature sensitivity of 230 ppm/K, while consuming less than 100nW of power. This design brings three major benefits for implementing logarithmic amplifier circuit: high dynamic range, low power consumption, and temperature compensation. (f) A novel varactor driven temperature compensation circuitry of CMOS floating-gate current memory is designed and implemented. By adapting the floating-gate capacitance with voltage controlled capacitor (varactor), this design enables temperature dependent factors be effectively canceled. With this design, a temperature sensitivity of 150 ppm/K is achieved. This design is instructive for implementing high-density, temperature compensated floatinggate current memories. 13 1.3 Dissertation Organization The dissertation is organized as follows: Other ASP research work is briefly introduced in chapter 2. The MP algorithm related theories and properties are introduced in detail in chapter 3. The circuit implementation and synthesized analog computation circuits are introduced in chapter 4. The applications of MP-based PWL circuits are described in chapter 5 and 6. In chapter 2, the device models of MOS transistor are introduced first as background material. Then a thorough survey of state-of-the-art research work in ASP is illustrated, i.e. a survey on translinear analog signal processing circuits. Chapter 3 presents MP algorithm related theories. Fundamental definitions and mathematical properties of MP algorithm are illustrated first, followed by the proof that MP algorithm is a PWL approximation for “log-sum-exp” function. These definitions and properties serves as the theoretical fundamental for later chapters. Since MP is a PWL approximation algorithm to “log-sum-exp” function, a wide variety of scientific computation can be synthesized with MP algorithm. In chapter 4, the basic current-mode circuit implementation for MP algorithm is presented. Based on the basic MP circuit block, a variety of analog computation circuits are synthesized, which include addition, subtraction, multiplication, division, power, inner-product, polynomial, and tanh. The PWL approximation effect of MP-based circuits are demonstrated by the simulation results. Chapter 5 presents an LDPC decoder which is implemented based on the MP circuits. This chapter can be divided into theory sections and implementation sections. The theory sections starts with the metrics to evaluated the performance of LDPC decoders and a significant property of LDPC decoder: performance trade-off. Density evolution is introduced as an analytical tool and utilized to evaluate the message sparsity performance for the MP-based LDPC decoding al14 gorithm and the other two conventional algorithms. The subsequent simulation results verify the density evolution analysis results as well as the BER performance. It is also demonstrated that the trade-off capability of MP-based LDPC decoder can be realized elegantly by just tuning the hyper parameter γ. The implementation sections start with circuit for basic module. Then the systematic architecture is introduced. The test setup and the measurement results are presented. Chapter 6 presents another application of the MP-based PWL circuits: an analog SVM. In this application, the inner product computation of vector, the addition, multiplication, square of scalar are all synthesized with MP-based PWL circuit. In this chapter, the core computation, the system architecture and primary circuits are introduced in sequence. Finally, the simulation results and the measurement results are demonstrated. Chapter 7 offers conclusions and outlook. 15 Log(IDS) Strong Inversion Moderate Inversion Weak Inversion gm Moderate Inversion VGS Strong Inversion Weak Inversion VGS Weak Inversion Strong Inversion 10 −5 10 2 Figure 1.5: log(ID ) versus VGS , gm versus VGS , and gm /ID versus ID /(W/L) curves for various operation regions [8]. 16 Charge (a) Potential Well (b) Li z Level Residue (c) Figure 1.6: Illustration of the proposed concept for designing scalable analog computational systems. The depth of potential wells can be controlled by an external voltage source. At equilibrium the charge re-distributes amongst potential wells to satisfy charge conservation laws. 17 Image Processor Image Processor Algorithm Level Matrix Vector Multiplication Matrix Vector Multiplication Architecture Level Inner Product Inner Product Circuit Level Translinear Margin Propagation Device Level MOS, Bipolar MOS More Design Flexibility Application Level Figure 1.7: Top-down design procedure for conventional TASP and the bottom-up procedure we propose. 18 Chapter 2 Device Models and Translinear Analog Signal Processing 2.1 Device Models of MOS Transistor Fig. 2.1 illustrates the terminals of MOS transistors. Since the model of n-type MOS and that of p-type MOS only differ in polarity definition of current flow through the devices, their mathematic models only differs in signs. Thus only NMOS transistors are discussed thereinafter. NMOS Gate (G) PMOS Drain (D) Body (B) Gate (G) Source (S) Figure 2.1: MOS symbols 19 Source (S) Body (B) Drain (D) The operation of a MOSFET can be separated into three different modes, depending on the voltages at the terminals. The most frequently used models of MOSFET are for the operation in strong inversion (or called “above threshold operation”) and weak inversion (or called “subthreshold conduction”). The strong inversion model is described as follows: IDS = µn Cox W [κ(VG − VT ) − VS ]2 . 2κ L (2.1) In equation (2.1), W and L denotes the width and length of transistor respectively. W is called the L aspect ratio. µn represents the mobility of electrons. And Cox denotes the gate oxide capacitance per unit area. VT represents the threshold voltage. κ is a parameter reflecting the effectiveness of the gate potential in controlling the channel current, which is always approximated as κ 1 for strong inversion. According to (2.1), the relationship between the current flow and the voltages can be described as “square”. The weak inversion model is described as: IDS = I0 e(1−κ)VBS /UT eκVGS /UT (1 − e−VDS /UT + VDS /V0 ). (2.2) In this equation, VGS is the gate-to-source potential, VDS is the drain-to-source potential, VBS is the bulk-to-source potential (body effect), I0 is the zero-bias current for the given device, V0 is the early voltage, and UT is the thermodynamic voltage. For devices in subthreshold saturation, which is determined by VDS ≥ 4UT , neglecting the early effect and the body effect, equation 20 (2.2) can be simplified as IDS = I0 eκVGS /UT . (2.3) According to (2.3), the current is the exponential of the potential. Also we can deduce the input transconductance in subthreshold saturation region as: gm = ∂IDS κI = DS . ∂VGS UT (2.4) And the output transconductance in subthreshold saturation region as: I ∂I gd = DS = DS . ∂VDS V0 (2.5) It can be seen that gm and gd are linear to the current IDS in this operation region. There is no uniform model for MOSFET throughout all the operation regions until it was developed by C. C. Enz, F. Krummenacher and E. A. Vittoz. Enz-Krummenacher-Vittoz in 1995, named as EKV model. With EKV model, all the currents, transconductance, the intrinsic capacitance, and thermal noise can be expressed in a continuous way in all operation regions, including weak inversion, moderate inversion, strong inversion, conduction, and saturation. The equation of EKV model is described as IDS = IS log 2 1 + e(κ(VG −VT )−VS )/2UT − IS log 2 1 + e(κ(VG −VT )−VD )/2UT . (2.6) In this equation, IS is a specific current, which depends essentially on the aspect ratio W and the L 21 mobility µn . 2.2 Translinear Analog Signal Processing Figure 2.2: An alternating translinear loop of NMOS transistors. Here the number of clockwise facing transistors in junction is the same as the number of counterclockwise facing transistors. The translinear principle was defined by Gilbert in 1975 [14], originally with bipolar transistors. In translinear circuits, the exponential current-voltage characteristic of bipolar transistor is exploited. With the emergence of MOSFET circuit, translinear principle is extended into MOS circuits, since MOS transistors in subthreshold region also have the exponential current-voltage non-linearity characteristic, as shown in (2.3). The translinear principle states that, in a closed loop of tranlinear devices comprising an equal number of of clockwise facing and counterclockwise facing junction devices, their currents follows: j∈CW Ij = Sj j∈CCW 22 Ij , Sj (2.7) where CW and CCW denote the set of junction transistors clockwise facing and counterclockwise facing respectively. Ij denotes the IDS of the transistor. S denotes the aspect ratio, i.e. W/L ratio of the transistor. Fig. 2.2 shows a conceptual alternating translinear loop comprising NMOS transistors. Note that the translinear principle is derived by Kirchhoff’s voltage law for the loop, which is: VGS(j) = j∈CW VGS(j) , (2.8) j∈CCW In 1991, the translinear principle was extended by Seevinck and Wiegerink [17]. They generalized the translinear principle with the statement that the transconductance of a device is linear in the controlling voltage rather than it being linear in the current, as Gilbert intended originally. In this generalized translinear principle, the square-law MOS characteristic, as we show in equation (2.1), is exploited. Correspondingly, the MOS transistors in a translinear loop they presented were biased in strong inversion. And their translinear principle takes the form as j∈CW Ij = Sj j∈CCW Ij . Sj (2.9) This generalized translinear principle is suggested to be renamed as voltage-translinear circuits to prevent possible confusion [23]. However, there is no translinear principle describing the behavior of translinear loops when any of the transistors enters the moderate inversion region, where the device is in neither weak inversion nor strong inversion, but in between. In this region, the current-voltage characteristic follows neither the exponential form as shown in equation (2.3) nor the square-law form as described in equation (2.1). 23 The EKV model, which is described in (2.6), provides a model for the current-voltage characteristics of the MOS transistor at all levels of inversion including moderate inversion. In 2008, Minch brought out a generalized translinear principle for alternating loops of saturated MOS transistors that is valid at all levels of inversion based on the EKV model [24]. This generalized translinear principle reduces to the conventional one when all transistors in a translinear loop are biased in weak inversion and reduces to the voltage-translinear principle when all transistors in the loop are biased in strong inversion. They model the current-voltage characteristic of an NMOS transistor with its bulk tied to ground as: Isat = SIS log 2 (1 + e(κ(VG −VT 0 )−VS )/2UT ), (2.10) where S = W/L denotes the aspect ratio, κ = Cox /(Cox + Cdep ) denotes the subthreshold slope factor, VT 0 denotes the zero-bias threshold voltage, UT = kT /q denotes the thermal voltage, and 2 IS = 2µn Cox UT /κ denotes the transistor’s specific current. Based on this equation it is explicitly to deduce κ(VG − VT 0 ) − VS = 2UT log(e Isat /SIS − 1). When the transistor is in weak inversion, by expanding e (2.11) Isat /SIS in a Maclaurin series and retaining the constant and linear terms, we obtain the approximation as I κ(VG − VT 0 ) − VS ≈ UT log sat . SIS When the transistor is in strong inversion, −1 is negligible compared to e 24 (2.12) Isat /SIS , therefore we can obtain κ(VG − VT 0 ) − VS ≈ 2UT Isat . SIS (2.13) Based on equation (2.12) and (2.13), we can derive translinear principle on weak inversion, shown in (2.9), and voltage-translinear principle on strong inversion, shown in (2.10). And the continuous moderate inversion model is obtained on equation (2.11) as well. Based on these aforementioned translinear principles, continuous-time translinear analog signal processing (TASP) systems are constructed. TASP circuits can be divided into two major classes: static translinear circuits and dynamic translinear circuits. Static translinear circuits exploit the exponential relation between voltage and current to realize static linear and nonlinear transfer functions, such as products, quotients, arbitrary fixed power laws, etc. [15, 25, 18, 26]. With such operations, we can implement signal-processing algorithms involving auto-correlation, cross-correlation, signal energy computation, least-mean-square (LMS) error metric computation, and the like [23, 25]. For instance, the Gilbert multiplier [15] shown in Fig. 1.3 (d) is a static translinear circuit. For the circuit of Fig. 1.3 (d): Ib Ib − , I1 = − − + −(V1 −V1 )/UT (V + −V1 )/UT 1+e 1+e 1 + + I1 I1 −= += I2 , I − − + −(V2 −V2 )/UT 2 (V + −V2 )/UT 1+e 1+e 2 − − I1 I1 −= += I3 , I − − + + (V2 −V2 )/UT 3 −(V2 −V2 )/UT 1+e 1+e + I1 = 25 (2.14) (2.15) (2.16) − − + + If we denote (V1 − V1 )/UT = x, (V2 − V2 )/UT = y, we can obtain ex ey + 1 + + + Iout = I2 + I3 = Ib x (e + 1)(ey + 1) ex + ey − = I− + I− = I Iout b (ex + 1)(ey + 1) 2 3 (2.17) (2.18) Thus, x y − + Iout − Iout = Ib tanh( )tanh( ) 2 2 (2.19) By tailor series expression, here we take tanh(z) ≈ z. Then x y Iout ≈ Ib · · 2 2 (2.20) ∝ x · y. (2.21) Another example of static translinear circuit is “winner-take-all” circuit, which can be used to find out the maximum amongst a couple of inputs [16]. A voltage-in-current-out winner-take-all circuit is shown in Fig. 2.3. In this circuit, current flow through each branch follows: Ii = eκVi /UT · N · Ib N eκVj /UT j=1 (2.22) Due to the fast decay of exponential, at equilibrium, the branch with the maximum voltage takes all the currents and the other branch obtains currents which decays gradually to zero. Consequently, the winner (maximum) is easy to pick out. There is also an increasing effort to develop new circuit design techniques in order to build 26 I1 I2 V1 M1 IN V2 M2 MN …... Ib Ib VN …... Ib Vb Figure 2.3: A voltage-in-current-out winner-take-all circuit. analog information-processing systems. In 2001, Minch proposed multiple-input translinear element (MITE) networks [19], which produces an output current that is exponential in a weighted sum of its input voltages. MITEs are simply implemented using multiple-input floating gate MOS (FGMOS) transistors. An MITE is shown in Fig. 2.4. The output current Iout follows: I out ... w1 V w2 1 V2 wN VN Figure 2.4: A multiple-input translinear element (MITE). Iout ∝ e(w1 V1 +w2 V2 +···+wN VN ) . 27 (2.23) From equation (2.23) it can be seen that if these weights are adjusted according to some learning algorithm, such as the least-mean-square (LMS) algorithm or the recursive least-squares (RLS) algorithm, these built-in weights can be used to implement a variety of linear adaptive signalprocessing architectures. In this respect, TASP circuits are also applied into neural systems [27, 28, 29, 30]. Furthermore, these weights can be programmed in a digital fashion to realize reconfigurability in TASP systems. Such reconfigurable analog signal processing systems are called field-programmable analog arrays (FPAAs) [6]. This kind of TASP FPAA is analogous to an E 2 P ROM -based field programmable gate arrays (FPGA) in digital domain. Based on MITEs, a variety of static and dynamic translinear circuits can be synthesized (e.g. the vector-magnitude circuit, the vector-normalization circuit, the second-order low-pass filter, and the rms-to-dc converter) [19, 31, 32, 26]. By admitting capacitors in the translinear loops, the dynamic translinear circuits are constructed to deal with frequency dependent transfer functions. The dynamic translinear principle can be explained with reference to the circuit shown in Fig.2.5. The expression for Icap is easily found Vconst + - Icap Vcap IDS + - Figure 2.5: Principle of dynamic translinear circuits. 28 to be I ˙ Icap = CUT DS , IDS (2.24) ˙ which shows that Icap is a nonlinear function of IDS and its time derivative IDS . Equation (2.24) can be rewritten as ˙ CUT IDS = Icap IDS , (2.25) which states the dynamic translinear principle: A time derivative of a current can be mapped onto a product of currents [33]. The first dynamic translinear circuit, or log-domain filter, was originally introduced by Adams in 1979 [20]. The first-order filter described in [20] is in fact a translinear circuit. Another milestone is a general class of filters, called exponential state-space filters, which comprise a number of translinear loops, first published by Frey [34]. The exponential state-space filters enables the design of higher order log-domain filters[35, 36]. Since then the interest of dynamic translinear filter increases dramatically. More studies were reported in this area [37, 38, 39, 40, 41, 42]. However, the dynamic translinear principle is not limited to filters, or linear differential equations. It also can be utilized to realize nonlinear differential equations using transistors and capacitors only. In [21], the nonlinear differential equations describing the rms-dc conversion function are implemented on the dynamic translinear principle. Another example is [22], in which a translinear second-order oscillator was implemented. Some other example functions described by nonlinear differential equations and implemented by translinear circuits are phase-locked loops (PLL’s) [43], translinear sinusoidal frequency tripler [44], etc.. 29 Chapter 3 Theory of PWL technique based on margin propagation 3.1 Definition of MP algorithm At the core of MP algorithm is the reverse water-filling procedure, which has been described in detail in [45, 46]. Given a set of scores Li ∈ R, i = 1, ..., N, reverse water-filling computes a normalization factor z according to the constraint: N [Li − z]+ = γ i=1 (3.1) where[·]+ = max(·, 0) denotes a rectification operation and γ ≥ 0 represents a parameter of the algorithm. Computation of z according to constraint (3.1) is summarized in the Algorithm 1. The algorithm recursively computes the factor z such that the net balance of log-likelihood scores Li in excess of z equals γ. The solution to equation (3.1) can also be visualized using Figure 3.1, where the cumulative 30 Algorithm 1 Reverse water-filling procedure to compute the parameter z Require: Set of log-likelihood scores {Li }, i = 1, .., N . Ensure: z = 0, N = 1, T = 0 a = max{Li } {s} ← {Li } − {a} while T < γ & m < N do b = max{s} T ← T + M (a − b) a←b {s} ← {s} − {b} m←m+1 end while z ← b + M (γ − T ) score beyond the normalization factor z (shown by the shaded area) equals to γ . To be practical, we always limit z in the range of z ≤ max Li . The computation of z can be expressed as an i equivalent function M (L, γ) whose inputs are a set of log-likelihood scores L = {Li }, i = 1, .., N of size N and a hyper-parameter γ: z mp = M (L, γ). 3.2 (3.2) Properties of MP algorithm In this section we present some of the key properties of MP function M(L, γ). Since margin approximation uses thresholds [x]+ = max(x, 0), we first state two lemmas which will be useful for proving some of the other properties. Lemma 1 : ∀a, b ∈ R, [a]+ − [b]+ ≤ [a − b]+ . 31 Li - z + z L1 Li Figure 3.1: Illustration of reverse water-filling procedure Lemma 2 : ∀a, b ∈ R, [a]+ + [b]+ ≤ [a + b]+ . The proof of Lemma 1 and Lemma 2 is straightforward and can be found in any standard analysis textbook. For the sake of brevity, only the statement and the illustration of some of the properties are presented in this section and all the proofs have been presented in detail in the appendix. Property 1 (Scaling Property): For any α ∈ R, α > 0 and a set of scores L = {Li }, i = 1, .., N M(αL, αγ) = αM(L, γ). Here αL = {αLi }, i = 1, .., N . The proof of this property is simple since if the condition N [Li − z]+ = γ i=1 32 (3.3) is satisfied, then the following condition N [αLi − αz]+ = αγ i=1 is also satisfied. Figure3.2 illustrates the scaling property of margin approximation function, where the threshold z scales with the scaling of the log-likelihood scores and the hyper-parameter γ. i i 1 1 i i Figure 3.2: Scaling property of margin propagation Property 2 (Monotonicity): Given a set of scores L = {Li }, i = 1, .., N and if γ1 ≥ γ2 ≥ 0 then M(L, γ1 ) ≤ M(L, γ2 ) One of the important implications of the monotonicity is the asymptotic property when γ → 0 and is given by lim M (L, γ) = max(L). γ→0 (3.4) The importance of this asymptotic property is that the min-sum algorithm for LDPC decoding is a special case of the proposed margin propagation algorithm. 33 Proof. Given                      N [Li − z1 ]+ = γ1 , i=1 N [Li − z2 ]+ = γ2 , i=1 (3.5) γ1 ≥ γ2 ≥ 0, what we need to prove is z1 ≤ z2 . If we subtract the first equation by the second equation in (3.5), we can get N ([Li − z1 ]+ − [Li − z2 ]+ ) = γ1 − γ2 ≥ 0. i=1 Based on Lemma 1, N N [z2 − z1 ]+ ≥ ([Li − z1 ]+ − [Li − z2 ]+ ). i=1 i=1 Combined with (3.6), it can be deduced that N [z2 − z1 ]+ ≥ 0. i=1 Consequently, z2 − z1 ≥ 0. 34 (3.6) In conclusion, z1 ≤ z2 . Property 3 (Convexity Property): Given a set of coefficients {αk } satisfying 0 ≤ αk ≤ 1 αk = 1 and given a set of hyper-parameters {γk } and k αk γk ) ≥ M(L, k αk M(L, γk ). k Proof. Given the same group of Li s, and a set of γk , k = 1, .., n                                  N [Li − z1 ]+ = γ1 , i=1 N [Li − z2 ]+ = γ2 , i=1 . . . N [Li − zn ]+ = γn , i=1 Based on Property 1, we can transform the equations in (3.7) as                                  N [α1 Li − α1 z1 ]+ = α1 γ1 , i=1 N [α2 Li − α2 z2 ]+ = α2 γ2 , i=1 . . . N [αn Li − αn zn ]+ = αn γn , i=1 35 (3.7) Add the above equations up, based on Lemma 2, we will get N   n  ≥ αk γk . αk zk k=1 + k=1 n L − i i=1 Denote z satisfies N n Li − z i=1 Based on Property 1, then z ≥ = + αk γk . k=1 n α z . k=1 k k That is, αk γk ) ≥ M(L, k αk M(L, γk ). k Property 4 (Superposition Property): Given two sets of scores L and G of size N with a well defined ordering and if L + G represent an element by element scalar addition then M(L + G, γ) ≤ M(L, γ) + M(G, γ). Proof.   N    [Li − z1 ]+ = γ,   i=1  N    [gi − z2 ]+ = γ,   i=1 (3.8) what we need to prove is if N [Li + gi − z3 ]+ = γ, i=1 36 (3.9) then z3 ≤ z1 + z2 . All the Li s are sorted and rearranged in a decreasing order so that L1 ≥ L2 ≥ L3 ... ≥ LN . We use N1 to represent the number of Li s above the threshold z1 . The same operations are done to gi s. All the gi s are sorted and rearranged decreasingly as g1 ≥ g2 ≥ g3 ... ≥ gN . And the number of gi s above the threshold z2 is N2 . Rewrite (3.8) as   N1     Li − N1 z1 = γ,   i=1  N2     gi − N2 z2 = γ,   i=1 (3.10) Assume N2 ≥N1 , then from the bottom equation in (3.10), we get N1 0≤ gi − N1 z2 ≤ γ. i=1 (3.11) Add both sides of (3.11) to those of the upper equation in (3.10), we can get N1 γ≤ N1 Li + gi − N1 z1 − N1 z2 ≤ 2γ. i=1 i=1 (3.12) can be transformed as N1 [Li + gi − (z1 + z2 )] ≥ γ, i=1 37 (3.12) from which we can deduce that N [Li + gi − (z1 + z2 )]+ ≥ γ, i=1 (3.13) Given (3.9) and (3.13), based on Property 2, it can be deduced that z3 ≤ z1 + z2 . And vice versa if we assume N1 ≥N2 . If written in MP function, M(L + G, γ) ≤ M(L, γ) + M(G, γ). Under special condition the above property reduces to an equality and is given by Property 5 (Offset Invariance): Given a set of scores L of size N and a scalar g ∈ R then M(L + g, γ) = M(L, γ) + g. Property 5 implies that if a constant offset to all the elements of input set leads to an equivalent offset in the output of the margin approximation function. Proof. Based on the proof of Property 4, given N [Li − z]+ = γ, i=1 38 if G is a scalar set with each gi = g, g ∈ R, 3.9 can be rewritten as N [Li + g − (z + g)]+ = γ, i=1 Written in MP function format, M(L + g, γ) = M(L, γ) + g. Property 6 (Invariance Property): Given a set of scores L of size N , and given a set of scores G whose elements satisfy gi ≤ M (L, γ), i = 1, ..., N , then G, γ) = M(L, γ). M(L Proof. Given N [Li − z]+ = γ, i=1 ∀gj , j ∈ [1, n], since gj z, n [gj − z]+ = 0 i=1 which means N n [Li − z]+ + [gj − z]+ = γ. i=1 i=1 Therefore, we can get M(L G, γ) = M(L, γ). 39 3.3 PWL approximation based on MP algorithm The generalized form of a “log-sum-exp” function is described as  zlog = log  N  eLi  . i=1 (3.14) The Piecewise-linear (PWL) approximation of “log-sum-exp” function can be achieved with MP algorithm. Proof. Assume that (3.14) and (3.2) share the same group of operands L1 , . . . , LN , Li ∈ R, i = 1, ..., N . If the operands are sorted and rearranged in a decreasing order as L1 ≥ L2 ... ≥ LN , (3.14) can be rewritten as  N  L e i , i=1 (3.15) z = M (L1 , . . . , LN , γ). (3.16) zlog = log  For MP algorithm, (3.2) is rewritten as If the number of Li s above threshold z is n, 1 ≤ n ≤ N , according to the definition of MP 40 algorithm, (3.1) is rewritten as n Li − z = γ, i=1 n Li − nz = γ. (3.17) i=1 And z can be deduced as n Li z = i=1 n − γ . n (3.18) For (3.18), under the conditions (a) γ and all the other Li s are fixed except Ln (b) Ln is varying within the range not below the threshold z, we have ∂z = ∂Ln 1 . n (3.19) For zlog in (3.15), under the same conditions (a) and (b), eLn 1 = = . N N ∂Ln L (L −L ) e i e i n i=1 i=1 ∂zlog (L −L ) Define a function u(Li , Ln ) to approximate e i n as    1, L ≥ Ln , i u(Li , Ln ) = 1 ≤ n ≤ N.   0, L < L , n i 41 (3.20) (L −L ) Fig. 3.3 shows us the curve of u(Li , Ln ) and its PWL approximating effect to that of e i n . 2.5 2 u(L‘i,L‘n) ‘ ‘ e(Li−Ln) 1.5   ¡ 1 0.5 0 0.5 1 1.5 2 L‘i (L −L ) Figure 3.3: The curve of e i n and its PWL approximation curve of u(Li , Ln ) Since the number of Li s not less than Ln is n, then n N u(Li , Ln ) = n. u(Li , Ln ) = (3.21) i=1 i=1 Therefore, ∂zlog ∂Ln 1 ≈ N = 1 . n (3.22) u(Li , Ln ) i=1 Comparing (3.19) and (3.22), we can find out that ∂zlog ∂Ln ≈ 1 ∂z = . n ∂Ln (3.23) Based on the illustrations above, we can draw the conclusion that z in (3.2) approximates zlog 42 in (3.14) under the condition that γ and all the other Li s are fixed except Ln varying within a certain range. Since 1 ≤ n ≤ N , by adjusting the value of γ to change the number of operands above threshold (reflected by n), this conclusion can be extended to the whole group of operands. Based on the proof, we can write  N  eLi  . z mp = M (L, γ) ≈ log  i=1 (3.24) To demonstrate the PWL approximation effect of MP algorithm, a “log-sum-exp” based function is utilized for reference. It is a transform function φ with two log-likelihood operands L1 and L2 [47], which is named “sum-product” in the field of factor graph decoding [48]: φL (L1 ) = log 2 1 + eL1 +L2 eL1 + eL2 . (3.25) The two operands are expressed in differential form, i.e., L1 = L+ − L− , L2 = L+ − L− . 2 2 1 1 Now (5.11) is rewritten and decomposed as  −+L− L++L+ L  e 1 2 +e 1 2  φL (L1 ) = log − + , 2 L1 +L2 L++L− e +e 1 2 L−+L− L++L+ L−+L+ L++L− = log e 1 2 +e 1 2 −log e 1 2 +e 1 2 ,  (3.26) where φL (L1 ) can be represented as a difference of two “log-sum-exp” function. Based on 2 43 equation (3.24), we can approximate (3.26) using two MP units with two operands for each: φL (L1 ) ≈ M(L− +L− , L+ +L+ ,γ)−M(L+ +L− , L− +L+ ,γ). 1 2 1 2 1 2 1 2 2 (3.27) Fig. 3.4 shows us the comparison of PWL approximation effect achieved with a fixed γ and different schemes of operands in either MP unit: (a) two operands for either MP unit as described in equation (3.27); (b) four operands for either MP unit and (c) six operands for either MP unit. The operands for MP units are listed in Table 3.1. Table 3.1: Operands of MP units in Fig 3.4 Num 2 4 6 Operands for MP unit z + L− + L− , 1 2 + + L+ L1 2 − + L− , L1 2 1 − (L1 + L− + L+ + L+ ), 2 1 2 2 3 − (L1 + L− + L+ + L+ ), 2 1 2 5 + + L+ L1 2 − + L− , L1 2 Operands for MP unit z − L− + L+ , 2 1 + + L− L1 2 + + L− , L1 2 1 − (L1 + L− + L+ + L+ ), 2 1 2 2 3 − (L1 + L− + L+ + L+ ), 2 1 2 5 − + L+ L1 2 + + L− , L1 2 2 − (L1 + L− + L+ + L+ ), 2 2 1 5 1 − (L1 + L− + L+ + L+ ), 2 1 2 2 3 − (L1 + L− + L+ + L+ ), 2 1 2 5 2 − (L1 + L− + L+ + L+ ), 2 1 2 3 + + L+ , L1 2 2 − (L1 + L− + L+ + L+ ), 2 2 1 5 1 − (L1 + L− + L+ + L+ ), 2 1 2 2 3 − (L1 + L− + L+ + L+ ), 2 1 2 5 2 − (L1 + L− + L+ + L+ ), 2 1 2 3 − + L+ L1 2 As can be seen from Fig. 3.4, when more intercepted operands are introduced into MP computation, the approximating curves are becoming more fine and similar to the original “sum-product” 44 algorithm. That means, the PWL approximation effect of (3.27) can be improved when more operands are utilized in the approximating algorithm. 1.5 1 φ1.0(L1) 0.5 0 −0.5 MP with 2 operands MP with 4 operands MP with 6 operands Sum−product −1 −1.5 −5 0 L1 5 Figure 3.4: PWL approximation effect of MP algorithm 45 Chapter 4 MP-based current-mode PWL circuit and synthesis 4.1 Log-domain computation As high-speed computation in signal processing system, graphic systems, communication system, are receiving increasingly concerns, logarithmic computation, as a specific and effective option in signal processing area, becomes the central of research. Logarithmic number system (LNS) has been adopted in many trials of digital signal processor (DSP) [49, 50, 51, 52, 53]. A list of operations in base-2 LNS in comparison to normal arithmetic system is shown in table 4.1. As can be seen, log-domain computation can dramatically simplify various complex ordinary arithmetic computations, such as multiplication, division, reciprocal, square-root, and so on. Consequently, with LNS, the number of bits needed in the computation decreases and therefore the power consumption drops [52, 53]. At the same time, lower complexity results in high-speed computation as well [49, 52, 53]. Generally speaking, for the scenarios 46 in demand of low dissipated power, high throughput and small gate counts, LNS is more appropriate than conventional arithmetic systems. However, a major obstacle of LNS implemented in digital circuits is the complicated addition and subtraction. Moreover, to simplify the hardware implementation, PWL schemes are frequently adopted by the digital logarithmic converters in real number to logarithmic domain data conversion. As a result of PWL, small coefficient lookup tables are resorted to as the power- and area-efficient implementation [52, 53, 54]. However, the use of read only memory (ROM) is costly and seriously restricts the word length of the LNS data system [49, 55]. Table 4.1: Operations in ordinary and logarithmic arithmetic in digital implemented LNS processor Operation Ordinary Arithmetic Logarithmic Arithmetic Multiplication x × y, X +Y Division X −Y Square x ÷ y, √ x, x2 , Addition x + y, Subtraction x − y, Square Root X >> 1 X << 1 X + log 2 (1 + 2Y −X ) X + log 2 (1 − 2Y −X ) The proposed MP-based algorithm is capable of approximate a specific log-domain computation (“log-sum-exp” function) and can be implemented in an analog manner, which circumvents the problems aforementioned for digital implementation. In later section of this chapter, a diversity of log-domain computation circuits will be shown synthesized with the a basic MP circuit. 47 4.2 Architecture of synthesized system Let us take an arbitrary function as example: Y = a1 ∗ b 1 + a2 ∗ b 2 . (4.1) The operands can be converted into differential forms as a2 = a+ − a− , 2 2 b 2 = b+ − b− , 2 2 a1 = a+ − a− , 1 1 b1 = b+ − b − , 1 1 and the corresponding operands in LNS are L− = log a− , a1 1 L− = log a− , a2 2 L− = log b− , 1 b1 L− = log b− . 2 b2 L+ = log a+ , a1 1 L+ = log a+ , a2 2 L+ = log b+ , 1 b1 L+ = log b+ , 2 b2 Accordingly, (4.1) can be converted into + − Y = ez − ez , 48 (4.2) where z + and z − are expressed as  L+ +L+ L− +L− L+ +L+ L− +L− a b 1 + e a1 b 1 + e a2 b 2 + e a2 b2  z + = log e 1   L+ +L− L− +L+ L+ +L− L+ +L− a b 1 + e a1 b 1 + e a2 b 2 + e a2 b2  z − = log e 1  (4.3) From (4.3), it can be seen that the original computation in (4.1) can be transformed into a “logsum-exp” function shown in (3.14) when the computation is converted into LNS. Then it can be approximated with MP algorithm, as equation (3.24) shows. Based on the illustration above, for any arbitrary function that can be be converted into the form of (4.2) and (4.3), i.e. expressed in terms of “log-sum-exp” function, its PWL approximation can be obtained with MP algorithm. Let us generalize such an arbitrary function with all the operands in real domain as Y = f (a, ..., b). It can be converted into differential form and LNS as Y = ez + − − ez . z + = F + (L+ , L− , ..., L+ , L− ). a a b b z − = F − (L+ , L− , ..., L+ , L− ). a a b b (4.4) In (4.4), F + and F − denote the “log-sum-exp” functions of L+ , L− , L+ , L− , etc. a a b b Fig. 4.1 shows the system level architecture of arbitrary computations. The input signals are represented by their differential forms a = a+ − a− , b = b+ − b− and are converted into logarithms at input stage. Similarly, the logarithms of the output signals are represented by z + and z − . With all the logarithm/antilogarithm conversion realized by outer units at input/output stage, the core computations (within the dashed block) are carried out in log-likelihood domain and im49 + b − b log b = L log b − = L MP based circuit z − e e z+ z− Output Signals + z + Output Stage b− log a + = L+ a − − log a = La …... b + Input Stage a− …... Input Signals a+ log-likelihood domain computation unit Figure 4.1: Top architecture of the computation unit plemented by MP based PWL approximation circuit. Here we only focus on the core computation unit. 4.3 Current-mode circuit implementation for MP algorithm A current-mode circuit implementing the basic reverse water-filling equation (3.1) is shown in Fig. 4.2. The four operands L1 ∼ L4 in (3.2) are represented by currents IL ∼ IL . The hyper1 4 parameter γ in (3.2) is implemented by the current Iγ . The circuit in Fig. 4.2 bears similarity to other parallel current-mode circuits like the winner-take-all circuits. However, as mentioned in the previous section, the MP implementation is more general and for Iγ → 0 the MP circuit implements a winner-take-all function. The Kirchoff’s current law (KCL) when applied to node A, is equivalent to the reverse water-filling condition 4 [I − I ] = I . I represents the z+ γ z i=1 Li MP approximation M (L1 , .., L4 , γ), and [.]+ denotes a rectify operation which is implemented by the PMOS diodes P1 ∼ P4 . It can be readily verified that the PMOS diodes ensure that VDS,N = VSG,P + VDS,sat , where VDS,N is the drain-to-source voltage drop across trani i i 50 sistor Ni , i = 1, .., 4, VSG,P is the voltage drop across the PMOS diodes and VDS,sat is the i saturation voltage across the current sink Iγ . Thus, transistors N1 ∼ N4 are always maintained in saturation irrespective of the operation region (weak, moderate and strong inversion). This is important because the operation of the MP circuit relies on the accuracy of the current mirrors formed by N0 ∼ N4 . Also, to ensure proper operation of the core MP circuit the input currents 4 have to satisfy the constraint IL ≥ Iγ and the input current range can be expressed as i i=1 W 1 1 µn Cox ( )( Vdd )2 ≥ IL ≥ 0. i 2 L 2 (4.5) In the derivation of the upper-bound of the input current range we have assumed that the IL >> i Iγ and the PMOS and the NMOS transistors have similar drain-to-source voltage drops (implying VB ≈ Vdd /2). It should be noted that the PMOS diodes in circuit shown in Fig. 4.2 introduces a threshold voltage drop which implies that the supply voltage has to be greater than 2VGS + VDS . However, a low-voltage implementation of the circuit in Fig. 4.2 is possible by replacing the PMOS diodes by low-threshold active diodes [56], but at the expense of larger silicon area, higher power dissipation and degraded speed and stability. We have prototyped a programmable version of the MP circuit shown in Fig. 4.2 in a 0.5µm standard CMOS process. The architecture of the prototype is shown in Fig. 4.4 and its micrograph is shown in Fig. 4.3. A serial shift-register is used for selectively turning on and off each of different branches of currents. In this manner, the same MP circuit in Fig. 4.2 can be reconfigured to implement different mathematical functions. The architecture in Fig.4.4 uses an on-chip first-order Σ∆ modulator as an analog-to-digital converter for measurement and calibration of the output currents. Because the circuits used for implementing the Σ∆ modulator have been reported 51 Input Stage Vdd Vdd I L1 IZ N0 P2 N2 Vdd I L4 I L3 I L2 P1 N1 Vdd A P3 N3 Vds,sat B P4 N4 Iγ Vsg ,P4 Vds,sat Figure 4.2: CMOS implementation of the core MP circuit. μ Figure 4.3: Die micrograph of the chip. elsewhere [57] we have omitted its implementation details in this paper. The input currents and hence the operating region (weak, moderate and strong inversion) of all the transistors are adjusted by controlling the gate voltages of the PMOS transistors. For each of the operating regions, the ADC is re-calibrated to the maximum current and the measured results (presented in the next sections) are normalized with respect to the maximum current. 52 Vdd CLK CLK + IZ From SRs P4 P3 P2 I P1 To ADC Iγ N4 SR SR …... Din N3 N2 SR N1 N0 Measurement Circuit MP circuits ADC I Normalized real number current to be measured I ref Latch I Real number Out + + I ref t Normalization over time Figure 4.4: System architecture of the chip. 53 4.4 MP based computational circuits In previous section, we already illustrate MP is an effective PWL approximation algorithm for a specific log-domain function. The circuit schematic of a four-operand MP based PWL approximation circuit is shown in Fig. 4.2. In this section, we illustrate some log-domain fundamental computations that can be synthesized based on the “log-sum-exp” function and their MP based PWL approximation circuits. Based on these fundamental computation circuits, arbitrary logdomain arithmetic operations and their combinations can be implemented, which can be validated with some classical mathematic functions presented in this section as well. Current-mode signal representation is exploited so that power dissipation can be controlled adaptively under different power consumption requirements. 4.4.1 Addition To compute a+b with respect to the two input variables a and b, a ∈ R, b ∈ R the operands are first represented in their differential forms a = a+ −a− and b = b+ −b− with a+ , a− , b+ , b− ∈ R+ . Then, the operation is rewritten as in Fig. 4.1 a + b = a+ − a− + b + − b − = a+ + b + − a− + b − =e log a+ +b+ −e log a− +b− . All the operands are then mapped into a log-likelihood domain according to L+ = log a+ , a L− = log a− , a L+ = log b+ , b L− = log b− , b 54 (4.6) where L+ ∈ R, L− ∈ R, L+ ∈ R, L− ∈ R. Since, our circuit implementation uses only a a b b unipolar current sources, we impose additional constraints a+ , a− , b+ , b− > 1. Due to the differential representation, this additional constraint will not affect the final output. Let z + denote log a+ + b+ and z − denote log a− + b− . Then, (4.6) can be written as a + b = ez + − − ez . (4.7) and z + can be expanded as z + = log a+ + b+ = log elog a + L+ = log eLa + e b + + elog b + . (4.8) We now apply the MP approximation to the log-sum-exp functions to obtain z + ≈ M(L+ , L+ , γ). a b (4.9) Similarly, z − = log a− + b− = log elog a − ≈ M(L− , L− , γ) a b + elog b − − L− = log eLa + e b (4.10) Since the equation (4.9) and the equation (4.10) uses only two operands, its circuit level implementation is identical to Fig. 4.2 except that it has only two input branches instead of four. The single-ended circuit for addition is shown in Fig. 4.5. I + , the current through N0 , represents the z 55 value of z + . I + , I + and Iγ represents L+ , L+ and γ respectively. a b La L b + I La I z+ + I Lb P1 N0 N1 P2 N2 Iγ Figure 4.5: MP circuit for implementing two operand addition. Fig. 4.6a (a)-(c) compares the output z + − z − obtained using floating-point implementation of the “log-sum-exp” functions, the MP approximation and from measurements (normalized currents) using the fabricated prototype. 0 −1 1 z+−z− 1 z+−z− z+−z− 1 0 −1 2 0 a −2 (a) 0 −2 b 2 0 −1 2 0 a −2 (b) 0 −2 b 2 2 0 a −2 0 −2 b 2 (c) Figure 4.6: Addition: z + − z − computed according to (a) the original log-sum-exp function; (b) MP approximations (simulation); and (c) MP circuits (measurement). Fig. 4.7 (a)-(c) show the measurement results when the transistors are biased in (a) strong inversion; (b) moderate inversion; and (c) weak inversion. Fig. 4.7 (d)-(f) show the approximation error between the measured response and the software model. The results validate our claim for the addition operation that MP approximation is scalable across different biasing regions and the 56 circuit in Fig. 4.5 approximates the ”log-sum-exp” function with a reasonable error. The error increases when the circuit is biased in the weak inversion and as with any sub-threshold analog VLSI circuit, which we attribute to transistor mismatch, flicker-noise, and errors introduced in mapping floating-point operands into analog operands (currents). 0 −1 1 z+−z− 1 z+−z− z+−z− 1 0 −1 2 0 a −2 0 −2 b 2 −1 2 0 −2 b 2 0 −2 b (d) 2 0.1 0 −0.1 −0.2 2 0 a −2 (e) 0 a −2 0 −2 b 2 (c) Error 0.1 0 −0.1 −0.2 0 a −2 0 a −2 2 (b) Error Error (a) 2 0 0 −2 b 2 0.1 0 −0.1 −0.2 2 0 a −2 0 −2 b 2 (f) Figure 4.7: Addition: measurement results for z + − z − with transistors biased in (a)strong inversion; (b)moderate inversion; (c)weak inversion; (d)-(f) computation errors for (a)-(c). 4.4.2 Subtraction Synthesizing an MP circuit for subtraction a − b, with a and b being the two operands is similar to the addition operation. The operation can be written in a differential form as a = a+ − a− and b = b+ − b− , with a+ , a− , b+ , b− ∈ R+ and satisfying a+ , a− , b+ , b− > 1. Thus, a − b is 57 written as a − b = a+ − a− − b + − b − = a+ + b − − a− + b + log a+ +b− log a− +b+ =e −e (4.11) which after conversion into a log-likelihood domain L+ = log a+ , a L− = log a− , a L+ = log b+ , b L− = log b− , b leads to + − z + = log a+ + b− = log elog a + elog b + L− = log eLa + e b ≈ M(L+ , L− , γ) a b (4.12) and + − z − = log a− + b+ = log elog a + elog b ≈ M(L− , L+ , γ) a b − L+ = log eLa + e b (4.13) where z + and z − are differential forms of the output log-likelihoods as described in (4.2). The circuit implementation of subtraction is identical to that of addition, except that the input differential currents are now permuted. Fig. 4.8 compares the measured response with software based “log-sum-exp” and MP implementation. Also, Fig. 4.9 (a)-(c) shows the measured results when 58 the transistors are biased in the three different operating regimes. Similar to the results obtained for the addition operation, the measured result show the bias-scalable operation of the MP circuits. 0 −1 −5 0 a 5 5 0 b 0 −1 −5 −5 1 z+−z− 1 z+−z− z+−z− 1 0 (a) a 5 5 0 b 0 −1 −5 −5 0 (b) a 5 5 0 b −5 (c) Figure 4.8: Subtraction: z + − z − computed by (a) original functions; (b) MP approximations (simulation); and (c) MP circuits (measurement). 0 −1 −5 0 a 5 5 0 b 0 −1 −5 −5 1 z+−z− 1 z+−z− z+−z− 1 0 0.1 0 a 55 (d) 0 0 b −5 0.2 0.1 0 −5 0 a 55 (e) a 5 5 0 b −5 (c) Error 0.2 0 −5 −1 −5 −5 (b) Error Error (a) a 5 5 0 b 0 0 b −5 0.2 0.1 0 −5 0 a 55 0 b −5 (f) Figure 4.9: Subtraction: measurement results for z + − z − with transistors biased in (a)strong inversion; (b)moderate inversion; (c)weak inversion; and (d)-(f) show respective approximation errors. 59 4.4.3 Four quadrant multiplication Implementing a single quadrant multiplier using MP circuit is straight-forward because z = a · b in its logarithmic form can be expressed as a sum of likelihoods Lz = La +Lb , where Lx denotes the logarithm of the operand x. Thus, a single quadrant multiplier requires only a current summation and can be implemented using KCL. For an MP-based four-quadrant multiplier the operation a · b is again expressed in terms of the differential operands as: a · b = a+ − a− · b + − b − = a+ · b + + a− · b − − a− · b + + a+ · b − log a+ ·b+ +a− ·b− log a− ·b+ +a+ ·b− . =e −e (4.14) After log-likelihood mapping, the following differential forms are obtained z + = log a+ · b+ + a− · b− = log e log a+ ·b+ +e log a− ·b− ≈ M L+ + L+ , L− + L− , γ a b a b (4.15) and − + log a+ ·b− − = log a− · b+ + a+ · b− = log elog a ·b z +e ≈ M L− + L+ , L+ + L− , γ a b a b (4.16) The circuit implementing an MP based four-quadrant multiplier is shown in Fig. 4.10. I + dez notes the value of z + . Iγ represents γ. I + (I − ), I + (I + ) represent L+ (L− ), L+ (L− ) a a b b La La L L b b respectively. 60 + I La + I Lb − I La − I Lb I z+ Iγ Figure 4.10: MP circuit for implementing four quadrant operand multiplication Fig. 4.11 compares the measured output z + − z − with the ideal implementation of the model (4.15) and (4.16). Fig. 4.12 (a)-(c) shows the measurement results when the transistors are biased in: (a) strong; (b) moderate and (c) weak inversion along with the respective approxi- −5 0 a 55 (a) 0 b 0.5 0 −0.5 −5 z+−z− 0.5 0 −0.5 −5 z+−z− z+−z− mation error. Again, the measured results show the bias-scaling property of MP circuits. −5 0 a 55 (b) 0 b 0.5 0 −0.5 −5 −5 0 a 55 0 b (c) Figure 4.11: Multiplication: z + − z − computed according to the (a) original function; (b) MP approximations (simulation); and (c) MP circuits (measurement). 61 −5 0 a 55 0.5 0 −0.5 −5 0 b z+−z− z+−z− z+−z− 0.5 0 −0.5 −5 −5 0 a 55 −5 a 55 (d) 0 a 55 0 b 0.4 0.2 0 −0.2 −5 −5 0 a 55 (e) 0 b (c) Error 0.4 0.2 0 −0.2 −5 0 0 b −5 (b) Error Error (a) 0.5 0 −0.5 −5 0 b 0.4 0.2 0 −0.2 −5 −5 0 a 55 0 b (f) Figure 4.12: Multiplication: measurement results for z + − z − with transistors biased in (a)strong inversion; (b)moderate inversion; (c)weak inversion; and (d)-(f) the respective approximation errors . 62 4.4.4 Single Quadrant Division Like the implementation of a single quadrant multiplication, implementing a single quadrant division in logarithmic domain is straight-forward. For operands a, b > 0, division in the logarithmic domain is equivalent to a subtraction operation La − Lb which can be implemented using KCL. To implement four quadrant division, the operation first requires extracting the sign of the operand as x = sgn(x) · |x|, where |x| is the absolute value of x. A single quadrant division can now be used for the absolute values and the sign of the final output can be computed using only comparators. 4.4.5 Power and Polynomial Computation For positive operands satisfying a ∈ R+ , a > 1, computing an , n ∈ Z+ in the logarithmic domain is equivalent to log an = n · log (a) = n · La . (4.17) which can be implemented using a simple current mirror as shown in Fig. 4.13. Thus, in MP-based synthesis, power functions like square-root or cube-root can be implemented using current mirrors which significantly simplifies the implementation compared to translinear based synthesis [58]. However, for a four-quadrant implementation of power and polynomial function would require evaluating a Taylor expansion using multiplication and addition operations. For a non-positive operand a, a ∈ R, its power function is expressed as an , n ∈ Z+ . a is expressed in its differential form a = a+ − a− , a+ (a− )∈ R+ and a+ (a− )> 1. For arbitrary n, 63 an can be written as n a+ − a− = n−i − i i (−1)i Cn a+ a i n−i − i n−i − i i i = C n a+ a − C n a+ a i is even i is odd i + n−i a− i i + n−i a− i log log i: even Cn a i: odd Cn a . −e =e  i C n a+ Designate z + as log  n−i a− i (4.18)   and z − to represent i: even   n−i − i i  in (4.18). Then log  C n a+ a i: odd z + ≈ M(L+ , γ). z − ≈ M(L− , γ). (4.19) i Li = log Cn + (n − i) log a+ + i log a− , (4.20) L+ (L− ) is a group of Li s satisfying i ∈ [0, n] and i is even(odd). As an example, a four quadrant square function a2 can be implemented by expressing it as a product of two differential operands as (a+ − a− ) · (a+ − a− ) which has an architecture similar to the four-quadrant multiplier described in above section. The differential output z + and z − corresponding to the square operation can be expressed in terms of differential log-likelihood 64 I La IZ ⎛W ⎞ ⎟ L ⎠2 ⎝ ⎛ W ⎞ M1 ⎜ ⎟ ⎝ L ⎠1 M2 ⎜ Figure 4.13: A current mirror implementing the power function in logarithmic domain. inputs L+ , L− as a a z + ≈ M 2L+ , 2L− , γ a a (4.21) z − = log 2 + L+ + L− a a (4.22) Fig.4.14 shows the measured results when the circuit in Fig. 4.10 has been configured to compute a four-quadrant square function. Again, the circuit is shown to approximate the “log-sum-exp” function for strong-inversion, moderate-inversion and weak-inversion biasing conditions. 4.5 Scalability analysis of MP circuit In Fig. 3.4, we showed that the accuracy of the PWL approximation improves when the number of operands in the margin function increases. In this section, we analyze the effect of hardware artifacts (noise, mismatch, speed and input/output impedance) on the performance and scalability of MP circuits. 65 Original function MP simulation Strong inversion Moderate inversion Weak inversion z+−z− 1.5 1 0.5 0 −6 −4 −2 0 a 2 4 6 Figure 4.14: Power:z + − z − computed according to the log-sum-exp function, the MP approximation simulation and the MP circuit measurement. Vdd Vdd Vdd 2 I n ,source1 B 2 I n ,z 2 I n ,N 1 Rsource1 2 I n ,P1 P1 PK P2 …... N0 N1 N2 A NK Rsin k 2 I n ,sin k Figure 4.15: Noise model for MP circuit in Fig. 4.2. 66 4.5.1 Noise and Mismatch Analysis of MP circuit The noise model for the basic MP circuit in Fig. 4.2 is shown in Fig. 4.15. For the following analysis we will assume that the number of input current branches in the MP circuit equals K. Also, assuming that the noise in K − 1 (due to KCL) of the parallel branches are mutually independent, 2 the total output noise power In,z can be expressed as:  2 2 2 In,N In,P I2 In,source 2 = i + i + i   + n,sink + I 2 In,z n,N0 (4.23) 2 2 (K − 2)2 (K − 2)2 gm Rsource (K − 2)2 K2 i=1 K  where In,sourcei , In,N , In,P , and In,sink are noise currents as shown in Fig. 4.15 and i i Rsource is the output impedance of the current sources. We have simplified the expression in (4.23) by assuming K to be large and by assuming that the small-signal parameters gm , Rsource for all the transistors are approximately equal. Equation (4.23) can be written as 1 2 2 + In,z,thermal = 1 + 2 R2 K gm source K 8 kT gm , 3 (4.24) 2 where we have considered only the effect of thermal noise as given by In = 8/3kT gm [59]. The effect of the flicker-noise has been ignored in equation (4.24), because it exhibits a similar dependency on K as the thermal-noise. Equation (4.24) shows that as the number of branches (or operands) K increases, the output noise reduces and in the limit K → ∞, the output noise is just equal to the noise due to transistor N0 . The reduction in the output noise with the increase in K can be attributed to the attenuation and averaging effect of the K stage current mirrors formed by N1 ∼ NK . Also, it is important that gm Rsource >> 1, not only for reducing the effect of noise but also to ensure that the input 67 impedance of the MP circuit is much lower than the output impedance of the current sources (see Fig. 4.15). The mismatch analysis for the MP circuit follows a similar procedure as the noise analysis. The MOS transistor mismatch models used in the analysis are based on the work reported in [60] which uses two technology dependent constants AV and Aβ that determine the mismatch variance as: T A2 VT σ 2 (∆VT ) = WL A2 σ(∆β) 2 β = β WL (4.25) (4.26) VT denotes the threshold voltage, β = µCox W/L, W and L represent the width and length of the MOS transistor. It is also claimed in [60] that for MOS transistors the mismatch in VT dominates the mismatch in β. Therefore, in this paper, our analysis is only be based on the VT mismatch. The error variance σ 2 (∆VT ) can be converted into an equivalent noise current source in Fig. 4.15 according to σ 2 (I) A2 2 σ 2 (∆V ) = g 2 VT gm m WL T (4.27) Hence, similar to the noise analysis, the error variance in the output current σ 2 (Iz ) is given by: σ 2 (Iz ) 1 1 1+ + 2 2 K gm Rsource K A2 2 VT . gm WL (4.28) Equation (4.28) shows that for a large K the accuracy of the MP circuit is limited by the mismatch due to the output current mirror stage N0 . Thus, all layout and impedance transformation methods which are used for improving the accuracy of current mirrors, like centroid layout and cascoding 68 techniques, can be used to improve the accuracy of MP circuits. The use of cascoding at the output stage transistor N0 increases the output impedance and makes the core MP module scalable to implement large networks. 4.5.2 Speed and Bandwidth Analysis As with other current-mode circuits, the transient response of the MP circuits is determined by its: (a) slew-rate which determines the large-signal response; and (b) bandwidth which determines the small-signal response. The node A in Fig. 4.2 has the largest capacitance and is discharged by the current Iγ . Therefore, the slew-rate of the MP circuit can be expressed as: SRA = Iγ K(CGS,N + CGS,P + CDB,P ) (4.29) where CGS,N represents the gate-to-source capacitance of NMOS transistor N1 through NK , and CGS,P /CDB,P denotes the gate-to-source/drain-to-body capacitance of the PMOS transistor P1 through PK . Equation (4.29) shows that the slew-rate will reduce when the number of input branches K increase. However, the small-signal bandwidth of the circuit remains invariant with respect K and is determined by the frequency of the pole located at node A and is given by f−3dB,A = gm,P . 2π(CGS,N + CGS,P + CDB,P ) (4.30) gm,P represents the small-signal gate referred transconductance for the PMOS transistors P1 ∼ PK . The slew-rate and bandwidth analysis show that scaling of MP circuit is limited by the application specific speed requirements which is controlled by the current Iγ . However, unlike translinear synthesis this limitation can be overcome in MP circuit by re-biasing the circuit in 69 strong-inversion. 4.5.3 Performance comparison with Translinear synthesis In this section, we summarize and compare the performance of CMOS circuits designed using MP based synthesis with circuits designed using translinear synthesis. The comparisons made here are qualitative in nature because many of the performance metrics (silicon area and power dissipation) are dependent on the topology and complexity of the translinear circuit. • Accuracy: MP based synthesis relies on PWL approximations at the algorithmic level. However, the accuracy of its circuit level implementation using CMOS current-mode circuit is only limited by mismatch. Our analysis in the previous section showed that the techniques used for designing precision current-mirrors can also be used to improve the accuracy of the MP circuits. TL based synthesis, on the other hand, are precise at the algorithmic level. However, their mapping to CMOS current-mode circuits is approximate due to the effect of sub-threshold slope (unlike bipolar transistors) and finite drain impedance. These artifacts also introduce temperature dependency in TL circuits, whereas MP circuits are theoretically temperature invariant. Also, due to their operation in weak-inversion, CMOS TL circuits are more prone to errors due to mismatch, whereas the MP circuits can be re-biased in stronginversion if higher precision is desired, but at the expense of higher power dissipation. • Dynamic range: The bias-scalable property of the MP circuit enables it to achieve a larger dynamic range compared to a TL circuit. Also, MP synthesis uses currents to represent log-likelihoods which can also achieve a larger dynamic range. • Speed: For an MP circuit higher speed can be achieved by re-biasing the circuit in stronginversion, without changing the circuit topology. For CMOS translinear circuits higher speed 70 can be achieved by the use of large currents which implies using large size transistors to maintain their operation in the weak inversion. This will increase the silicon area when compared to an equivalent MP circuit. 4.6 Application Example: Pattern Classifier One of the applications where ultra-low power analog computing is attractive is pattern classification [61, 62]. Most pattern classifiers do not require high precision and any analog artifacts (like mismatch and non-linearity) can be calibrated using an offline training procedure. In this example, we use the MP circuits to design a bias-scalable linear classifier. A linear classifier implements an inner-product computation between a weight vector w ∈ Rn and the classifier input x ∈ Rn and is given by f = wT x (4.31) The weight vector w is determined by an offline training procedure [61] using a labeled training − − dataset. The two vectors → and → are expressed in differential form as w x → = →+ − →− = [w+ , w+ , ..., w+ ]T − [w− , w− , ..., w− ]T − − w − w w 1 2 1 2 N N → = →+ − →− = [x+ , x+ , ..., x+ ]T − [x− , x− , ..., x− ]T − − − x x x 1 2 1 2 N N 71 (4.32) + − with all the wi (wi ) and x+ (x− ) ∈ R+ . Then equation (4.31) can be rewritten as i i →T · → = − − w x N i=1 =e log = ez + − + w i x+ + w i x− − i i N − + w i x− + w i x+ i i i=1 + x+ +w− x− N w i=1 i i i i − elog N w+ x− +w− x+ i=1 i i i i − − ez . (4.33) when z + and z − are expressed as z + = log N − + w i x+ + w i x− i i i=1 ≈ M(Lw1 + + Lx1 + , Lw1 − + Lx1 − , ..., LwN + + LxN + , LwN − + LxN − , γ) z − = log N − + w i x− + w i x+ i i i=1 ≈ M(Lw1 + + Lx1 − , Lw1 − + Lx1 + , ..., LwN + + LxN − , LwN − + LxN + , γ). (4.34) + − Lwi + (Lwi − ) and Lxi + (Lxi − ) denote the logarithms of wi (wi ) and x+ (x− ) in equation i i (4.33). For a binary classifier, only the sign of z + −z − is important, implying that the output stage + − in the architecture (Fig. 4.1) could be implemented using a comparator (instead of ez − ez ). The system architecture and circuit diagram for the pattern classifier is shown in Fig. 4.16. In this example, we generated a synthetic linearly separable dataset corresponding to classes: class I and class II. We used a linear support vector machine training algorithm [63] to determine weight w. The parameter vector w is then programmed as currents on the prototype chip. Test 72 I Lw + 1 I Lw− I Lx + 1 1 ILw+ I Lx − 1 N I Lx + N ILw− N I Lx − N I z+ …... + wN − xN Input Stage + + logw1 + log x1 + log wN − log x N z+ Inner Product with MP Circuit …... w1 + x1 …... Input Signals Iγ + z− Comparator Figure 4.16: System architecture and circuit diagram of the MP-based pattern classifier. 73 vectors x were selected randomly and were then applied as input to the chip. For the sake of visualization, we show only results from a two-dimensional dataset. In this special case, the classifier equation (4.31) reduces to a two-dimensional inner product formulation as described in the example in section I. The circuit implementation is therefore identical to that of a full quadrant multiplier in Fig. 4.10. Fig. 4.17 shows the classification boundary and the classifier scores (negative values are represented by darker shades, whereas positive values are represented by lighter shades) obtained from software simulations and from the fabricated prototype under different biasing conditions. Again, the results show that the operation of the prototype is bias-scalable and the classification performance has been verified to be similar to that of the software model. 74 −1.2 −0.6 −1.2 0.5 0.2 Class I −0.6 0 0 Classification boundary −0.1 0.6 0.6 0 x1 0.6 0 0 0.6 −0.2 Class II 1.2 −1.2 x2 x2 0.1 −0.5 1.2 −1,2 −0.6 1.2 0 x1 (a) 0.6 1.2 (b) −1.2 −1.2 0.5 0 0 0.6 1.2 −1.2 −0.6 0.5 −0.6 x2 x2 −0.6 0 0 0.6 −0.5 0 x1 0.6 1.2 −0.5 1.2 −1.2 −0.6 0 x1 (c) 0.6 1.2 (d) −1.2 0.5 x2 −0.6 0 0 0.6 −0.5 1.2 −1.2 0.6 0 x1 0.6 1.2 (e) Figure 4.17: Classifier: z + − z − simulated by (a) log-sum-exp function (simulation); (b) MP approximation (simulation); and measured z + − z − when the transistors are biased in (c)strong inversion (measurement); (d)moderate inversion (measurement); (e)weak inversion (measurement). 75 Chapter 5 An analog LDPC decoder implemented on MP-based PWL circuits 5.1 Performance metrics and trade-off of LDPC decoder Low-density parity-check codes [64] constitute an important class of capacity approaching errorcorrecting codes which has seen widespread acceptance in emerging communication standards [65, 66, 67]. One of the key factors behind the success of LDPC codes is its iterative decoding algorithms [68, 69, 70, 71] which are scalable and hence can be easily mapped onto digital [72, 73, 74, 75] and analog [76, 77, 78, 79] hardware. As applications of LDPC codes expand into new areas like sensor networks [80, 81, 82] it is becoming more important to understand theoretical foundations that govern and determine the energy efficiency of LDPC decoders. Even though numerous studies have been reported in literature that evaluate the performance of LDPC decoding in regards to its bit-error-rate (BER) performance for different signal-to-noise ratios (SNR) [83, 84], there exists no theoretical basis for understand76 ing the BER/SNR trade-off with respect to the decoder’s energy efficiency. Understanding this trade-off is important because: • Different applications impose different constraints on the energy efficiency of the decoding algorithm. For instance a decoder used in sensor networks will be more severely constrained for energy [81, 82] than a decoder which is used in a base-station of a wireless network. Therefore, decoding algorithms that provide an elegant trade-off between BER, SNR and power-dissipation is important as superior energy efficiency can be achieved by sacrificing BER performance. • Understanding the trade-off could lead to novel class of decoding algorithms that deliver the similar BER performance as existing state-of-the-art algorithms while demonstrating superior energy efficiency. One of the ways to quantify the energy efficiency of a decoding algorithm is to evaluate the sparsity of the messages being passed between the nodes of the bipartite graph during LDPC decoding. Here we have considered one notion of sparsity: L0 norm of the messages, which is defined as the percentage of the non-zero messages out of a fixed total. Since zero-message consumes no energy, it is obvious to see that with the same amount of messages, the higher percentage of the zero-messages there is, the lower the L0 norm is, and also the lower energy consumption. In this regard, the overall energy consumption can be characterized by the L0 norm of the messages. This method of energy analysis is relevant to analog implementations because decoding of large LDPC codes is communication intensive rather than being computation intensive [77, 78, 79]. Also, it has been shown that in emerging deep sub-micron and nanoscale integrated circuits, energy dissipated during communication (through interconnects) will be significantly larger than the energy cost of local computations [85]. We use density evolution (DE) [86, 87, 84] as a tool to observe and 77 analyze the L0 norm of the messages generated during iterative LDPC decoding. 5.2 Density evolution and message sparsity 5.2.1 Density evolution analysis for conventional LDPC decoding algorithms c1 c24 v1 v32 Figure 5.1: Factor graph corresponding to a 32-bit (3, 4) LDPC code Density evolution is a well-established technique to determine the asymptotic performance of any LDPC code for a given channel condition and a code-rate [87]. It provides a way to study the decoding algorithm by analyzing the evolution of message densities on a hypothetical infinite tree-like graph with similar local topology as the LDPC code of interest [88]. In this section, we first provide a brief introduction of density evolution analysis whose details can be found in [87]. Consider an example of a factor graph shown in Fig. 5.1. It consists of variable nodes vk , k = 1, .., N which are connected to check nodes ci , i = 1, .., M using edges. For the description that follows, the number of edges associated with each check node and variable node (also known as the degree of the node) will be denoted by dc and dv . Let Vi denote the set of check nodes connected to the variable node vi and Vi∼j represent the set of check nodes other than cj that are connected to variable node vi . Similarly, let Cj denote 78 the set of variable nodes connected to the check node cj and Cj∼i represent the set of variable nodes other than the node vi connected to cj . In a sum-product based LDPC decoding algorithm [69], each check node cj receives messages from its set of neighbors Cj (denoted by L(vi → cj )) and computes messages to be sent to the variable nodes vi ∈ Cj (denoted by L(cj → vi )) according to   L(cj → vi ) = 2 tanh−1   tanh vk ∈Cj∼i L(vk → cj )   2 (5.1) In subsequent iterations, each variable node vi receives messages from its neighboring check nodes cj ∈ Vi and re-computes messages that will be sent to the check node cj (denoted by L(vi → cj )) according to L(vi → cj ) = L(vi ) + L(ck → vi ) (5.2) ck ∈Vi∼j where L(vi ) denotes the initial messages obtained from the communication channel. Messages are then propagated back and forth between the variable and check nodes for a pre-determined number of iterations before a decision on the received bits is made [69]. An alternative implementation of the LDPC decoder is based on the min-sum approximation of the L(cj → vi ) in equation (5.1) according to L(cj → vi ) = L(vk → cj ), (5.3) where k = arg minv ∈C |L(vˆ → cj )|. k j∼i ˆ k Even though min-sum decoders are relatively easy to implement on hardware compared to sumproduct decoders, we will show later in this section and in section 5.3 that message distributions that the algorithm produces are less sparse and hence its energy efficiency is inferior compared to 79 that of sum-product decoders. In this regard, density evolution will serve as an important compu- 0.15 0.8 Iteration=1 Iteration=3 Iteration=5 Probability Density Probability Density 0.2 0.1 0.05 0 −20 −10 0 10 0.6 0.4 0.2 0 −20 20 Iteration=1 Iteration=3 Iteration=5 −10 0 (a) Probability Density Probability Density 10 20 0.8 Iteration=1 Iteration=3 Iteration=5 0.1 0.05 0 −20 20 (b) 0.2 0.15 10 x x −10 0 10 0.6 0.4 0.2 0 −20 20 Iteration=1 Iteration=3 Iteration=5 −10 0 x x (c) (d) Figure 5.2: A (3,6) LDPC message PDF evolution for: (a) sum-product variable to check messages; (b) sum-product check to variable messages; (c) min-sum variable to check messages; (d) min-sum check to variable messages; tational tool to track distributions of messages that are propagated between the variable and check nodes and hence visualize its sparsity. The DE procedure computes the probability density of messages during an iterative decoding algorithm based on a recursive estimation procedure that can be evaluated analytically or numerically using Monte-Carlo simulations [87, 86, 84]. By observing the asymptotic message distributions, traditional DE analysis estimates a minimum threshold 80 parameter (e.g. noise-variance of a Gaussian channel) that ensures BER performance less than some specific tolerance. Even though the asymptotic behavior is of importance in DE, we will use the formulation for estimating sparsity of messages which will serve as a theoretical estimator for energy efficiency for a given decoder. The underlying assumption in DE analysis is that the graph rooted at a variable node, and incident by messages originating at the variable node, is cycle-free (for a finite number of iterations) which allows presumption of independence between incoming messages [88]. The independence assumption implies that the summation of message at variable and check nodes translates to convolution of their probability distributions, making it easier to track the evolution of the message densities [86]. We now outline equations which concisely capture the analytical framework for DE analysis. Let the degree distribution function λ(x)(ρ(x)) for variable (check) nodes be represented as λ(x) = λ2 x + λ3 x2 + · · · + λn x(n−1) = n λi x(i−1) (5.4) ρi x(i−1) (5.5) i=2 q+1 ρ(x) = ρ2 x + ρ3 x2 + · · · + ρq+1 xq = i=2 where, ρi is the fraction of edges connected to check nodes with degree i. λi is the fraction of edges connected to variable nodes with degree i. xi represents convolution of the message density x with itself (i − 1) times. Note: The two tuples (λ, ρ) are often referred to as the “Code Degree Distribution” Also let the message distributions corresponding to check and variable nodes after message passing iterations be denoted by R and P . Then, DE analysis computes R and P recursively 81 according to: R = Γ−1 ρ Γ P0 ⊗ λ(R −1 ) (5.6) P = P0 ⊗ λ(R −1 ) (5.7) where P0 represents the probability distribution of messages received from the communication channel, ⊗ represents convolution operation and Γ is the Probability Density Function (PDF) transformation due to the non-linear operation at each of the check node. For instance, in sumproduct decoding Γ is associated with the variable change operation related to the tanh(.) function in equation (5.1). Note that for sum-product decoding Γ = Γ−1 . At final step of the DE recursions given by equation (5.7), the fraction of incorrect messages propagated from the variable nodes to the check nodes, can be determined using: Pe (λ, ρ) = P (x) (5.8) x where x is a quantized random variable within a bounded range correspondent to the codeword transmitted. The probability distribution functions computed by DE at each iteration step can also be used to compute metrics that quantify sparsity of messages. One metric of sparsity which is commonly used in signal processing literature [89] is the L0 norm. For a vector θ, θ 0 is denoted as the number of non-zeros of θ. However, most frequently the size of the vector is taken into account. Thus θ 0 is normalized as the percentage of the non-zero coefficients [90]. Let L (c → v)(L (v → c)) denote the set of messages propagated from check(variable) nodes to variable(check) nodes during the -th iteration. The L0 norm for L (c → v)(L (v → c)) can be 82 represented by L (c → v) 0 ( L (v → c) 0 ). The PDFs computed by DE give out the discrete message density. Thus it is convenient to calculate the normalized L0 norm with L (c → v) 0 = R (x) (5.9) P (x) (5.10) x=0 L (v → c) 0 = x=0 The random variable x in (5.9) and (5.10) is quantized. In Figure 5.2 we show the mean message PDFs obtained during iterative decoding of a length 2000, (3,6) LDPC code. The decoding algorithms employed are sum-product and min-sum. To obtain the PDFs a zero codeword was transmitted through an additive white gaussian noise (AWGN) channel with a SNR of 2.5dB, before the decoding was carried out. The figure shows trends similar to those previously reported for DE analysis in [87] [86]. The distribution P (Fig. 5.2 (a) and (c)) shifts to the right reducing the area under the curve to the left of the origin. As a result, after several iterations the probability of error calculated according to equation (5.8) is small. Additionally, we can infer (by comparing Fig. 5.2 (b) and (d)) that the PDF of the messages propagated from check nodes to variable nodes in -th iteration, denoted by L (c → v), for the sum-product algorithm is more concentrated around zero than that of the min-sum algorithm. Consequently, the sumproduct algorithm has lower L0 norm for L (c → v) as against the min-sum algorithm. Overall, the L0 norm of L (c → v) increases with the number of iterations . And the message distribution approaches a gaussian distribution with increasing number of iterations. This observation has been verified by several previous studies [91, 92] and is also the basis of gaussian approximation based analysis of the sum-product algorithm [91]. 83 5.2.2 Sparse LDPC decoding with MP To understand how one can achieve sparsity in LDPC decoding, we analyze the core transforms that are related with the operator Γ for DE recursions given in equation (5.6). Since the check node computations in LDPC decoding can be performed pair-wise, we will analyze the properties of two dimensional transforms related to Γ. For the sum-product decoding the check function is given by [47] sp φy (x) = log 1 + ex+y ex + ey (5.11) where x and y are the two variables of function φ representing the messages in (5.1) and (5.2). (5.11) represents the result of combining two messages x and y. This can be recursively applied to the sum-product checking function computation. For min-sum decoding the equivalent operation φms (x) is given by [47] y φms (x) = max(0, x + y) − max(x, y) y (5.12) where we have similar x and y as in (5.11). (5.12) is recursively applicable in checking function computation as well. sp Figure 5.3 compares the functions φy (x) and φms (x) for y = 1.0. The plot shows that around y the origin, the behavior of both functions are linear. As a result, during the first few iterations of LDPC decoding, when the value of messages propagated from the check to variable nodes are concentrated around the origin (see Fig. 5.2), the resulting distribution obtained using the transform Γ is non-sparse. This indicates that if we choose a transform function containing a “dead-zone” around the origin, and at the same time approximate the sum-product transformation, we could obtain a sparser LDPC decoding procedure. 84 1.5 Min−sum Sum−product 1 φ1.0(x) 0.5 0 −0.5 −1 −1.5 −5 0 x 5 Figure 5.3: Comparison of pair-wise decoding functions for sum-product and min-sum LDPC decoding algorithms In the first step of our approximation procedure, the sum-product transform is decomposed as sp φy (x) = log 1 + ex+y − log ex + ey (5.13) which is a difference of two “log-sum-exp” function whose general form can be expressed as  N  zlog = log  eLi  i=1 (5.14) where {Li }, i = 1, .., N denote a set of log-likelihood ratios which are used in DE analysis and LDPC decoding. 85 Equation (5.14) can be re-written equivalently as: N e Li −zlog =1 (5.15) i=1 Next, we lower-bound the equality (5.15) using the inequality [x]+ ≤ e(x) where [x]+ = max(x, 0) is a threshold function. Equation (5.15) can then be expressed in its piecewise-linear form as N i=1 Li − zlog ≤ 1. + (5.16) As is seen from equation (5.16), if zlog in is replaced by z in equation (3.1), and ≤ 1 is changed into γ in equation (3.1), equation (5.16) is turned out to be MP algorithm. When the MP-based PWL approximation is applied to the “log-sum-exp” functions in equation (5.11) the MP-based check function is obtained as: mp φy (x) = M ({0, x + y}, γ) − M ({x, y}, γ) (5.17) Fig. 5.4 compares the margin propagation check functions to its sum-product and min-sum equivalent for different values of the hyper-parameter γ. It can be seen that the margin check function is a PWL approximation to the sum-product check function and the width of the “deadzone” around the origin is controlled by the hyper-parameter γ. When γ diminishes to zero, the “dead-zone” shrinks to zero as well and the margin propagation algorithm converts to min-sum, as can be seen through figure 5.4. The “dead-zone” can also be clearly seen in the three-dimensional plot of the check functions illustrated in Figure 5.5. Note that for the sum-product check function 86 1.5 1 φ1.0(x) 0.5 Margin with γ=1.25 Margin with γ=1.75 Margin with γ=2.25 Min−sum Sum−product 0 −0.5 −1 −1.5 −5 0 x 5 Figure 5.4: Comparison of pair-wise decoding function for sum-product, min-sum and margin propagation decoding algorithms (Figure 5.5(a)) and min-sum check function (Figure 5.5(b)) the origin is the saddle point and the gradient at the origin is non-zero. 5.2.3 Density evolution analysis of MP As shown above, at the core of the margin propagation based decoding is the margin approximation algorithm. Consequently we first simplify the margin approximation algorithm. If we consider two likelihood scores with levels 1 and 2 , the solution to the algorithm is the level z. Figure 5.6 displays the possible solutions for different values of 1 and 2 . Mathematically the solution for 87 (a) (b) (c) Figure 5.5: Comparison of 3D LDPC check function for (a) sum-product (b) min-sum and (c) margin propagation with γ=1.75 88 M (l1 , l2 ) can be stated as:    max( 1 , 2 ) − γ if | 1 − 2 | > γ  1+ 2−γ  otherwise 2 + 2−γ = max max( 1 , 2 ) − γ, 1 2 z= (5.18) (5.19) If the (recursive) operation at the check side of an LDPC graph is represented as L1 1 1 L2 , then: 2 2 z 1 2 z 2 z 1 1 2 (a) | 1 2 | (b) | 2 1 2 | Figure 5.6: Possible solutions z for M( 1 , 2 ) L1 L2 = M(L+ + L+ , L− + L− , γ) 1 2 1 2 − M(L+ + L− , L− + L+ , γ) 1 2 1 2 = z1 − z2 (5.20) 89 where, L+ , L− , L+ , L− are the differential forms of L1 , L2 given by: 1 1 2 2 L+ = T + L1 /2 1 L− = T − L1 /2 1 (5.21) L+ = T + L2 /2 2 L− = T − L2 /2 2 where T denotes the common mode signal. Using equation (5.18) the terms z1 and z2 in equation (5.20) can be rewritten as: z1 z2    max(L+ + L+ , L− + L− ) − γ   2 1 2 1    = if |L+ + L+ − L− − L− | ≥ γ 2 1 2 1   +   L + L+ + L− + L− − γ  1  2 1 2  otherwise 2    max(L+ + L− , L− + L+ ) − γ   2 1 2 1    = if |L− + L+ − L+ − L− | ≥ γ 1 2 2 1   +   L + L− + L− + L+ − γ  1  2 2 1  otherwise 2 (5.22) (5.23) The computations outlined in equation (5.20), (5.22) and (5.23) transform the PDFs of the incoming messages L1 , L2 into a new PDF. This PDF transformation can be captured “exactly” by using the surjective function corresponding to the operation L1 L2 . Let us represent this PDF transfer function as follows: MP : PL ( 1 ) × PL ( 2 ) → PZ (z), where 1 , 2 , z ∈ R 1 2 90 (5.24) More specifically the resulting PDF is calculated as follows: PZ (z) = 1 ∈R { 2 ∈R| 1 2 =z} = 1 ∈R { 2 ∈R| 1 2 =z} PL ,L ( 1 , 2 ) 1 2 PL ( 1 )PL ( 2 ) 1 2 2 1 2 1 (5.25) In the above equation, the cycle-free assumption [86] for the message incident sub-graph (rooted at a variable node) allows us to express the joint distribution PL ,L ( 1 , 2 ) as the product of 1 2 marginals PL ( 1 )PL ( 2 ). Equivalent to the parity degree distribution function (5.5) used in 1 2 density evolution analysis of the sum-product algorithm, we can now define the following parity degree distribution function for margin propagation: ρM P (x) = ρ2 x + ρ3 x 2 + · · · + ρq+1 x q q+1 ρi x (i−1) = (5.26) i=2 where, x i represents MP · · · MP(MP(x, x), x) · · · x) i times Consequently, the density evolution analysis for margin propagation at the check side becomes: R = ρM P P0 ⊗ λ(R −1 ) (5.27) Using equations (5.7), (5.25), (5.26) and (5.27) we determine the AWGN noise threshold for a (3,6) and an irregular LDPC code which we list in table 5.1 (under exact analysis column). Note, the outer integral over l1 in the PDF equation (5.25) (corresponding to L1 91 L2 ) involves calculation over the entire real line R, which we can restrict to large enough values for practical considerations. However, determining the domain for the second integral in equation (5.25) involves a search 2 = z is satisfied. This search operation operation identifying the region where equation 1 significantly increases the complexity of density evolution analysis for margin propagation. We therefore look for simplifications which reduce the complexity involved in computing thresholds for the margin propagation algorithm without sacrificing too much accuracy. A similar approach aiming to reduce complexity has been proposed for sum-product threshold calculation and is often referred to as the gaussian approximation analysis for sum-product [91]. To obtain approximations which simplify the analysis of margin propagation decoding let us consider the expressions involved in the first part of equation (5.22) in conjunction with equation (5.21). We get: L L L+ + L+ = T + 1 + T + 2 2 1 2 2 L2 L1 +T − L− + L− = T − 1 2 2 2 1 [L + L2 ] 2 1 1 = 2T − [L1 + L2 ] 2 = 2T + To simplify our representation we substitute U = 1 [L + L2 ], then: 2 1 L+ + L+ = 2T + U 1 2 L− + L− = 2T − U 1 2 (5.28) Substituting the representation (5.28) in first part of equation (5.22) gives: max(L+ + L+ , L− + L− ) − γ = 2T + max(U, −U ) − γ 1 2 1 2 = 2T + |U | − γ 92 (5.29) 0.8 0.15 Iteration=1 Iteration=3 Iteration=5 Probability Density Probability Density 0.2 0.1 0.05 0 −20 −10 0 10 0.6 0.4 0.2 0 −20 20 Iteration=1 Iteration=3 Iteration=5 −10 0 10 20 x x (a) (b) Figure 5.7: A (3,6) LDPC message PDF evolution for: (a) margin propagation variable to check messages; (b) margin propagation check to variable messages; Table 5.1: Threshold comparison for AWGN channel under margin propagation and sum-product λ(x) ρ(x)/ρM P (x) x2 0.4x + 0.6x2 x5 /x 5 x4 /x 4 Margin Propagation threshold (γ = 1.75) Exact density Approximate density evolution analysis evolution analysis 1.14 dB 1.29 dB 1 1.09 dB 0.95 dB Sum-Product threshold 1.10 dB [87] 1.07 dB [93] Substituting the representation (5.28) in second part of equation (5.22) gives: L+ + L+ + L− + L− − γ 1 2 1 2 2 1 γ [2T + U + 2T − U ] − 2 2 γ = 2T − 2 = (5.30) Using equations (5.19), (5.29) and (5.30) we can rewrite z1 as: γ z1 = max 2T + |U | − γ, 2T − 2 γ = 2T + max |U | − γ, − 2 93 (5.31) If common mode signal T is set to zero, the term 2T disappears from the above equations. Using arguments similar to those used for z1 we can derive the following expression for z2 : γ z2 = max 2T + |V | − γ, 2T − 2 γ = 2T + max |V | − γ, − 2 (5.32) Again, if we choose T = 0, the term 2T disappears from the above equations. Substituting (5.31) and (5.32) in (5.20) gives: L1 γ γ L2 = max |U | − γ, − − max |V | − γ, − 2 2 (5.33) Let us now focus on calculating the evolution of message probability densities. The absolute value operation just leads to the folding of the probability distribution. Moreover, since L1 and L2 are drawn from independent random variables (due to the cycle-free assumption) the summation or difference of the two used to obtain U and V just leads to the convolution of the probability density functions. The calculations associated with U and V are listed below: 1 [L + L2 ] 2 1 1 V = [L1 − L2 ] 2 U= (5.34) (5.35) The c = max(a, b) operation where a and b are drawn from two independent random variables A 1 Check side density evolution is calculated by partitioning x 5 into x (4)+1 . Density evolution corresponding to the part within the bracket is calculated using approximate analysis followed by exact analysis for the remaining part. 94 and B, gives c with probability distribution function C determined as follows [94]: P (C = c) = P (A = c)P (B ≤ c) + P (B = c)P (A ≤ c) (5.36) Using the above arguments we can determine the PDFs of z1 and z2 from the PDFs of L1 and L2 . However, determining the PDF of z1 − z2 is not trivial since in general z1 and z2 are drawn from dependent random variables. If however the random variables corresponding to L1 and L2 are independent identically distributed (i.i.d.) and gaussian in nature, the PDF of z1 − z2 is just the convolution of PDFs corresponding to z1 and −z2 . Due to the i.i.d. and gaussian restriction, the density evolution computation for the parity side needs to be carried out in a particular order. For example, if we are interested in determining the PDF of L1 PDFs corresponding to (L1 L2 ) and (L3 L2 L3 L4 , we first determine the L4 ). This gives two different PDFs which are i.i.d. however they are not gaussian, so we approximate these PDFs with gaussian distributions before computing the PDF of (L1 L2 ) (L3 L4 ). Using the approximation approach described above in conjunction with equation (5.33) allows us to track the evolution of the message densities at the parity side of an LDPC graph when margin propagation based decoding is employed. However the i.i.d and gaussian requirements for the input, restricts our ability to track message densities, to parity degree distribution functions which have the following representation: ρi x (i−1) ρM P (x) = (5.37) i∈{20 +1,21 +1,22 +1··· } AWGN noise thresholds are obtained for a (3,6) and an irregular rate one-half LDPC code using the above approximation and listed in table 5.1. Noise thresholds obtained using exact and approximate analysis are found to be comparable. The approximate analysis noise threshold for margin 95 propagation is within 0.16 dB of the actual threshold. The noise threshold for sum-product decoding is included to highlight the minimal performance loss for margin propagation. Thresholds obtained using exact analysis shows that the loss in performance for margin-propagation decoding is less than 0.04 dB. Using the setup described in section 5.2, we can track the evolution of message densities for margin propagation decoding using simulations for a (3,6) LDPC code of length 1000. Figure 5.7 displays the message density evolution for margin propagation in 5 iterations. The left column corresponds to the messages from variable-side to check-side and right column corresponds to the message from check-side to variable-side. The message PDF behaves similar to that outlined in figure 5.2 (sum-product algorithm). As the iterations progress, the message PDF shifts to the right, reducing the probability of making errors. 5.3 Performance simulation results of MP-based LDPC decoder This section presents Monte-Carlo simulations for evaluating the performance of LDPC decoding based on margin propagation. The results are presented in the following order: First, we analyze the effect of the hyper-parameter γ on the L0 norm of the messages. Next we compare the BER performance of an optimized margin propagation with an equivalent sum-product and min-sum decoder. Next (keeping the BER performance the same) we compare the sparsity of LDPC decoding for MP, sum-product and min-sum algorithm by observing the L0 norm of their messages. For all the simulations described in this section, binary phase shift keying (BPSK) has been assumed for modulation and the channel is assumed to be an AWGN channel. The experimental parameter 96 being varied is the signal-to-noise ratio (SNR) of the AWGN channel which can be expressed as SN R(dB) = −20 log10 σ (5.38) where σ denotes the standard deviation of noise. 5.3.1 Effect of γ on L0 norm 1 0.8 0 0.6 0.4 0.2 0 0 SNR = 1.00 dB SNR = 2.00 dB SNR = 3.00 dB SNR = 4.00 dB SNR = 5.00 dB SNR = 6.00 dB 1 BER L0−norm 10 0.5 −5 10 2 3 4 5 0 1 1 γ γ (a) 2 3 2 1.5 SNR(dB) (b) Figure 5.8: : (a)Effect of hyper-parameter γ on the L0 norm of the check to variable messages(for different AWGN channel conditions) (b)Effect of hyper-tuning parameter γ on BER performance (for different AWGN channel conditions) Figure 5.8a shows the L0 norm of messages propagated from check side to variable side for different values of γ. For this experiment, a (2000,1000) regular LDPC code with a (3 × 6) parity check matrix H was chosen. The maximum number of decoder iterations was set to 20. The average L0 norm was then computed over all the decoding iterations. The simulation results presented here consider only the average L0 norm for messages propagating from the check to variable side. As we will show in the later experiment, the L0 norm for messages propagated from 97 variable to check side does not exhibit sparse behavior and hence does not provide any efficiency improvement over sum-product or min-sum algorithm. When the parameter γ increases, figure 5.8a shows that the L0 norm of the messages decreases. This can be attributed to the size of the “dead zone” in the margin transfer function (see Fig. 5.4) which increases with an increase in γ. Thus γ is a consistent metric to modulate the sparsity (L0 norm) of the LDPC messages. Also for a fixed value of γ the L0 norm increases with an increase in channel SNR. This is because lower SNR would concentrate the likelihood scores around the “dead-zone” (due to noise) thus leading to a sparser message distribution (lower L0 norm). An important consequence of this experiment is that the sparsity of the MP decoding algorithm can be controlled using γ and can also be adaptively varied depending on the channel conditions. To understand the trade-off between γ, SNR and the BER performance of the MP algorithm, we conducted a three parameter experiment for a length 1000 (3,6) LDPC code, whose results are summarized in Fig. 5.8b. It can be clearly seen that for a fixed channel SNR, there exists an optimal value of γ (hence sparsity) that yields the best BER performance. This indicates that a correct choice of γ could provide an optimal performance in terms of sparsity and BER performance. Also, Figure 5.8b could be used as a calibration curve which can be used for designing different LDPC decoders with varying specifications of sparsity (energy efficiency). Figure 5.8b shows that with SNR ranging from 0.5dB to 2dB, when the value of γ increases from as low as 0 to around 1.75, the BER surface lowers monotonically, however when γ increases from 1.75 to 3, the trend is reversed. And it can be seen that at the SNR of 2dB, the “waterfall” phenomenon happens to the BER curve with γ equal to 1.75 ahead of others. Hence when SNR is ranging from 0.5dB to 2dB, 1.75 is deemed to be the optimal value of γ to achieve the least BER. 98 5.3.2 Comparison of BER performance In the next set of experiments we compare the BER performance of the optimal MP decoding algorithm to the equivalent sum-product and min-sum algorithms. Three different lengths of the (3,6) LDPC codes were chosen: 256, 104 and 106 . For each of the runs, the optimal parameter γ was found to be 1.75. Figure 5.9 summarizes the BER performance for different algorithms. It can be seen from the results that for a 106 length code, min-sum decoding incurs a 0.3dB penalty in performance compared to its sum-product equivalent, whereas MP decoding incurs less than 0.03dB penalty. The result shows that the performance of MP and sum-product decoding is nearly identical and is consistent with observations made in table 5.1. This has been verified to be true also for a rate 1/2 irregular code. Figure 5.10 compares the BER performance for a (λ(x), ρ(x)) ≡ (0.4x + 0.6x2 , x4 ) irregular LDPC code. We refer to this LDPC code as a (2.5,5) code. For this experiment, the optimal value of γ was found to be 1.75. It can also be seen from the figure that with a higher degree at both sides, the regular code performs better than the irregular code with penalty not less than 0.7dB for all the three algorithms. This is embodied by the BER “water-fall” which happens at a lower SNR and is more pronounced. However, the results presented here differ slightly from the BER metrics reported in the literature [87]. We attribute this difference to the number of decoding iterations used for the reported experiments. 5.3.3 Comparison of message sparsity The previous results confirmed that the BER performance of an optimized MP decoding algorithm is near identical to the BER performance of the sum-product decoding algorithm. However, the MP decoding algorithm produces more sparse distribution of messages compared to the sum-product 99 0 10 −1 10 Bit Error Rate −2 10 −3 10 −4 10 Sum−Product n=256 Margin n=256 Min−Sum n=256 Sum−Product n=104 Margin n=104 Min−Sum n=104 −5 10 Sum−Product n=106 Margin n=106 −6 10 0.5 Min−Sum n=106 1 1.5 2 2.5 Eb/No (dB) Figure 5.9: The comparison of BER for LDPC decoders based on different algorithms achieved by (3,6)-regular LDPC code of length 256, 104 , and 106 ; note the maximum iteration number is set to 20 100 0 10 −1 10 Bit Error Rate −2 10 −3 10 −4 10 −5 10 −6 10 0.5 Sum−Product (2.5,5) Margin (2.5,5) Min−Sum (2.5,5) Sum−Product (3,6) Margin (3,6) Min−Sum (3,6) 1 1.5 2 Eb/No (dB) Figure 5.10: The comparison of BER for LDPC decoders based on different algorithms achieved by length 106 (3,6)-regular LDPC code and (2.5,5) irregular LDPC code; note the maximum iteration number is set to 20. 101 algorithm as confirmed by the following experiment. A (2000, 1000) regular LDPC code with a (3 × 6) parity check matrix is used. Figure 5.11 shows the comparison of average L0 norm for messages propagated between check (variable) and the variable (check) nodes. It can be seen that the L0 norm (sparsity) of the messages propagated between the variable to check are similar. For a high SNR, the message convergence is considerably faster and requires only two iterations. However, the L0 norm of messages propagated between the check-to-variable nodes exhibit a significant difference between the different algorithms. For the MP algorithm, the achieved sparsity is the best whereas for the min-sum algorithm the achieved sparsity is the worst. This result shows that, MP algorithm provides an elegant approach for achieving sparse decoding (hence superior energy efficiency) without sacrificing the BER performance. 5.4 Basic circuits of MP-based LDPC decoder All the basic operations of check and variable nodes for this LDPC decoder have been illustrated in section 5.2. In this section, we present the basic modules of the MP based LDPC decoder. For check node, we take c1 in Fig.5.1 as an example. Note differential circuits are used for both check node and variable node implementation. 5.4.1 Implementation of check node As shown in equation (3.1), MP approximation requires addition, subtraction and rectification. Each of these operations can be implemented in a CMOS process using basic conservation laws (e.g. Kirchhoff’s current summation) which scale across transistor operating regimes (weak, moderate and strong inversion). Therefore, the same decoder architecture can be operated at different decoding speeds by scaling the bias current and hence by scaling its power dissipation. In this sec102 0.9 L0 norm 1 0.9 L0 norm 1 0.8 0.7 0.8 0.7 Sum−product Margin with γ=1.75 Min−sum 0.6 0.5 0 1 2 3 4 5 Sum−product Margin with γ=1.75 Min−sum 0.6 0.5 0 6 1 2 3 SNR(dB) 4 5 6 SNR(dB) (a) Iteration 1: check to variable (b) Iteration 2: check to variable 1.0005 1 1 0.9995 L0 norm L0 norm 0.9 0.8 0.999 0.9985 0.7 0.998 Sum−product Margin with γ=1.75 Min−sum 0.6 0.5 0 1 2 3 4 5 Sum−product Margin with γ=1.75 Min−sum 0.9975 0.997 0 6 1 2 SNR(dB) (c) Iteration 3: check to variable 4 5 6 (d) Iteration 1: variable to check 1.0005 1.0005 1 1 0.9995 0.9995 0.999 L0 norm L0 norm 3 SNR(dB) 0.9985 0.998 0.9985 0.998 0.9975 Sum−product Margin with γ=1.75 Min−sum 0.997 0.9965 0 0.999 1 2 3 4 5 Sum−product Margin with γ=1.75 Min−sum 0.9975 6 SNR(dB) 0.997 0 1 2 3 4 5 6 SNR(dB) (e) Iteration 2: variable to check (f) Iteration 3: variable to check Figure 5.11: : L0 norm comparison of sum-product, min-sum and margin propagation for iterations 1,2,3 for check to variable messages: (a),(b),(c) and variable to check messages: (d),(e),(f) 103 tion, we describe the current-mode circuits which has been used for implementing the check node module of the proposed MP-based LDPC decoder. A differential topology has been adopted to represent the positive and negative log-likelihood ratios and for canceling out any common-mode interference. We used the check node c1 in Fig.5.1 as an example to explain the implementation details. c1 is connected to the variable nodes v1 , v9 , v17 , and v24 . To avoid long mathematical notations we will denote the message from c1 to v24 (L(c1 → v24 )) as z sp and the messages L(v1 → c1 ) as L1 , L(v9 → c1 ) as L2 and L(v17 → c1 ) as L3 . The check function, according to equation (5.1) can then be written as: z sp = log eL1 + eL2 + eL3 + eL1 +L2 +L3 1 + eL1 +L2 + eL1 +L3 + eL2 +L3 . (5.39) If the log-likelihood ratios Li s are represented by their differential forms as Li = L+ − L− , with i i L+ , L− > 0, and i = 1, . . . , 3, then equation (5.39) can be rewritten as: i i sp z = dif f   ++L−+L− L−+L++L− L−+L−+L+ L++L++L+ L  e 1 2 3 +e 1 2 3 +e 1 2 3 +e 1 2 3  log  + + − . L1 +L2 +L3 L++L−+L+ L−+L++L+ L−+L−+L− e +e 1 2 3 +e 1 2 3 +e 1 2 3 (5.40) Once the check function has been expressed in terms of log-sum-exp functions, we can apply the MP-based approximation according to the procedure described in section 5.2. The MP approximate check function (3.2) is written as mp z = z+ − z−. dif f 104 (5.41) where z + and z − are given by z + = M (L+ + L− + L− , L− + L+ + L− , L− + L− + L+ , L+ + L+ + L+ , γ). 2 3 1 2 3 1 2 3 1 2 3 1 z − = M (L+ + L+ + L− , L+ + L− + L+ , L− + L+ + L+ , L− + L− + L− , γ). 3 1 2 3 2 3 1 2 2 3 1 1 (5.42) These equations are also applicable for computing messages L(c1 → v1 ), L(c1 → v9 ), and L(c1 → v17 ). Fig. 5.12 shows the architecture of the check node c1 implementing equations (5.42). The differential messages generated by the variable nodes are selected and the permutations and summations are carried out according to the equation (5.42). The summation results are then processed by individual MP units as described in (5.42). The output produced by the respective MP units (z + and z − in (5.42)) are then propagated to the neighboring variable nodes. Fig. 5.13a shows the current-mode implementation of the equations (5.42). The currents +(−) in equation (5.42), i = 1, 2, 3. I +(−) in Fig. 5.13a represent the differential variables Li Li The hyper-parameter γ in equation (5.42) is represented by the current Iγ . The circuit has two operational modes: (a) when the reset signal RST is set to logic low, transistor N1 pulls the gate voltage of transistors N2 -N6 low and the output current I + is set to zero (implying no message z is conveyed); (b) when the reset signal RST is set to logic high, the gate voltage of N1 set to Vγ which determines the current Iγ . At equilibrium, the output current I + is determined by the MP z condition (3.1). The PMOS diodes P1 through P4 ensures a unidirectional flow of current thus implementing the rectification operation in the equation (3.1). The subtraction operations in equation (3.1) are implemented by transistors N2 -N5 and the summation constraint is implemented using the Kirchhoff’s current summation at the node K. However, the diodes introduce a threshold voltage drop 105 L v1 o c1 IL  ¦ IL  ¦ 1 1 Connection Matrix L v1 o c1  MP L c1 o v 24 MP L c1 o v 24 MP L c1 o v17 ¦ L v9 o c1 IL  L v9 o c1 IL  L v17 o c1 IL  L v17 o c1 IL  ¦ IL  ¦ IL  ¦     2 2 3 3 ¦ ¦ ¦ ¦ L v 24 o c1 L v 24 o c1 1 IL  2 IL  2 IL  3 Connection Matrix 1 ¦ ¦ ¦ ¦ MP L c1 o v17 ¦ IL  ¦ IL  ¦ IL  ¦ 3 1 IL  2 IL  2 IL  3 Connection Matrix 1 IL  MP L c1 o v9 ¦ ¦ ¦ ¦ MP L c1 o v9 ¦ ¦ 3 IL  ¦ IL  ¦ 1 IL  2 IL  2 IL  3 Connection Matrix 1 IL  MP L c1 o v1 MP L c1 o v1 ¦ ¦ ¦ ¦ ¦ ¦ 3 Figure 5.12: Architecture of the check node c1 in Fig. 5.1 106 I L+ I L− I L I L I L I L I L I L I L I L I L I L 1 2 − 3 I z+ − 1 P4 N6 − 1 − 3 + 2 − 2 P3 N5 + 3 + 2 + 1 + 3 P2 N4 Vdd P1 N3 N2 RST P0 K Iγ N1 N0 Vγ (a) I L+ I L− I L− I L− I L+ I L− I L− I L− I L+ I L+ I L+ I L+ 2 1 I z+ N6' 3 P4' N5' 2 1 VB 1 3 P3' VB N4' 2 3 P2' N3' 1 2 3 VB P1' N2' VB P0' Vdd RST K’ Iγ N1' N0' Vγ (b) Figure 5.13: (a) Circuits for implementing the check node module using the MP algorithm:(a) design used in this paper; and (b) an alternate low-voltage circuit. 107 requiring a larger supply voltage (≥ 2Vth + Vdsat ). An alternative low-voltage implementation is shown in Fig. 5.13b which consists of a cascode PMOS P1 through P4 . The implementation eliminates the threshold voltage drop but potentially suffers from current leakage between neighboring stages as shown in Fig. 5.13b. This situation could occur when one of the branch currents is much larger than the other. In the proposed implementation, we have chosen the circuit shown in Fig. 5.13a. Since the operation of the circuit is based on the Kirchhoff’s current law, it is functional irrespective of the biasing condition of the transistors. Consequently, the magnitude of current can vary in a wide range from pA to µA, which only effects the operational speed of the circuit. However, the matching between the transistors N2 -N5 is important because it ensures that the constraint in equation (3.1) is accurately satisfied. 5.4.2 Implementation of the variable node We will use the variable node v1 to illustrate the basic operation as described by equation (5.2). The architecture of the variable node is shown in Fig. 5.14 and the circuit implementation for computing the message propagated from variable node v1 to check node c20 is shown in Fig. 5.15. + The currents Iv represent the positive part of differential message L(v1 ) (L(v1 ) = L+ (v1 ) − 1 + + L− (v1 )). Similarly, Ic →v and Ic →v represent the L+ (c1 → v1 ) and L+ (c8 → v1 ). These 1 1 8 1 currents are summed at node B and mirrored through transistors P0 and P1 . Similarly, the current through P2 equals the sum of all the negative portion of the differential message. For brevity, we 108 denote the two summed currents as I + and I − as shown in Fig. 5.15, where I + ⇒ L+ (vi ) + L+ (ck → vi ), ck ∈Vi∼j I − ⇒ L− (vi ) + L− (ck → vi ). (5.43) ck ∈Vi∼j The two currents I + and I − are compared at node A and when I + is greater than I − , the current through P5 equals to the positive part of the differential message propagated from the variable node to the check node. A similar circuit computes the negative part of the differential message. Thus, we realize the differential form of (5.2) as   +  I − I − , if I + − I − > 0. L+ (vi → cj ) ⇐   0, otherwise.    0, if I + − I − > 0. − (v → c ) ⇐ L i j  +  I − I − , otherwise. (5.44) Note, the transistors P3 and P4 in Fig. 5.15 act as current limiters where the common-mode current and hence the total power dissipation of the decoder can be limited using the voltage VB . 5.5 Measurement Results and Discussions Fig. 5.16 shows a system level schematic of the (32,8) LDPC decoder corresponding to the Tannergraph shown in Fig. 5.1. Besides the variable node and the check node modules, the decoder also integrates the following modules to facilitate output buffering and testing through an external digital interface (FPGA in this paper): 109 L v1 L v1 o c20 ¦ Comparator L c1 o v1  L c8 o v1 L v1 L c1 o v1 ¦ L v1 o c20 L c8 o v1 L v1 ¦ Comparator L c1 o v1  L c20 o v1 L v1 L c1 o v1  ¦ L v1 o c8 L v1 o c8 L c20 o v1 L v1 ¦ Comparator L c8 o v1  L c20 o v1 L v1 L c8 o v1 ¦ L v1 o c1 L v1 o c1 L c20 o v1 Figure 5.14: Architecture of variable node v1 in Fig. 5.1 110 Vdd Vdd P0 P1 + IV 1 N0 + I C1→V 1 N1 P2 I VB P3 + I C 8→V 1 Vdd Vdd + P5 I − P4 + IV 1→C 20 A N2 N3 N4 Figure 5.15: Circuit to implement variable node (a) A 6-bit current-mode digital-to-analog converter (DAC), used to digitally programming initial messages (currents) in equation (5.43). Its MSB of the DAC is denoted by d1 whereas the LSB is denoted by d6 . The architecture of the current DAC (shown in Fig. 5.16 is based on the popular resistive divider [95] where the current through each branch is twice as large as its neighbor. The output as determined by the bits d1 –d6 is a binary weighted sum of currents which are mirrored into each of the variable node modules. (b) Output comparators, which are used to compute the decoded bits by comparing the differ+ ential variable node currents when the decoder has reached equilibrium. For example, if IV − is greater than IV , the output of the comparator Q is logic “high” and indicative of the bit “0” being decoded. Note that the comparator needs to be reset before each comparison is made. (c) Digital sample-and-hold input buffer is a shift register chain which converts the serial input Din to the decoder IC into parallel bit slices that are processed by the DAC module. As shown in Fig. 5.16, there are two sets of non-overlapping control signals for the shift 111 register chain: Φ1 /Φ1 and Φ2 /Φ2 . Φ1 and Φ2 . In one period, when (Φ1 = 1 and Φ2 = 0), the previous stage data di−1 is sampled; and when (Φ1 = 1 and Φ2 = 0), the sampled data is held during while the DAC module processes the latched bits. (d) Digital sample-and-hold output buffer is also a shift register chain which sample the parallel decoded bits (from the comparator modules) and convert them into a bit-serial format Qout . The microphotograph of a prototype MP-based LDPC decoder fabricated in a 0.5µm standard CMOS process is shown in Fig. 6.20. Table 6.3 summarizes the main specifications of the chip. Table 5.2: Specifications of the MP-based analog LDPC decoder Fabrication Process Standard CMOS 0.5µm Die Size 3000µm × 3000µm Number of Transistors 11424 (6336 PMOS, 5088 NMOS) Number of Variable Nodes 32 Number of Check Nodes 24 The measurement setup used to evaluate the fabricated chip is shown in Fig. 5.18. The decoder chip is hosted on a daughter board which is then mounted on a test station mother board. A second mountable daughter board hosts a field-programmable gate array (FPGA) which is responsible for clock generation and data synchronization. A sample timing diagram for all the digital signals used by the LDPC chip (shown in Fig. 5.16) and generated by the FPGA are shown in Fig. 5.19. The FPGA selectively programs the on-chip DACs to emulate data received over an additive white Gaussian noise (AWGN) channel after which the FPGA enables the analog decoding core. The output of the decoder is latched on the comparator after a pre-determined decoding time and the latched bit-stream is serially retrieved by the FPGA. As shown in Fig. 5.18, the experimental setup is also controlled through a Visual Studio interface on a PC. The C script emulates the AWGN 112 Φ1 d i −1 Φ2 Φ1 di Φ2 Din SR VN1 CN1 VN2 CN2 SR …... DAC Modules SR …... …... VN31 CN23 VN32 CN24 Variable Nodes Check Nodes + SR + SR - …... …... + SR - Qout Vdd CLK Vdd + IV − IV CLK Vdd IN Q OUT Q Figure 5.16: System level architecture of the (32,8) LDPC decoder 113 Figure 5.17: Die microphotograph of the chip channel and generates 6-bit signal values at different SNR levels. These bit patterns are then stored on the SDRAM of the FPGA which uses the values to perform real-time decoding. After each noisy transmission, decoded bits are retrieved from the LDPC chip and stored on the FPGA’s SDRAM. At the end of each Monte-carlo run, the Visual Studio interface transfers the data logged on the SDRAM and computes the overall BER. The MATLAB interface is also used to adapt the bias voltages of the decoder through a National Instruments data acquisition card. Fig. 5.20 compares the BER performance of a software implementation of the MP-based decoder and the measured results obtained from the fabricated prototype. For this experiment the decoding throughput was set to 320 Kbps and the hyper-parameter γ was optimally chosen based on iterative experiments as described later. At low SNR, the hardware implementation outperforms its software counterpart which could be attributed to the continuous-time dynamics of the analog 114 Test Support Board SDRAM (MT48LC16M16) Xilinx Spartan-3 FPGA PC Program (a) Sends noisy bits as DAC values for the chip (stored in SDRAM on FPGA board). FPGA Board USB Micro (CY68013A) (b) Receives decoded bits to calculate bit error rates. Voltage sources c) Programs analog signals by MATLAB test station. Decoder Chip Other analog signals Figure 5.18: Experimental setup of the chip 1 period Din DATA0 DATA1 DATA2 DATA3 DATA4 IN _ CLK RST Φ1 Φ2 CLK OUT _ SR _ CLK Time (s) Figure 5.19: Timing diagram of digital signals used to control the operation of the LDPC decoder chip 115 0 10 Software Measured MP(V =0.71V) γ −2 BER 10 −4 10 −6 10 −8 10 0 2 4 6 8 SNR(dB) Figure 5.20: Comparison of BER of the MP-based LDPC decoder using software simulation and measurements from the fabricated prototypes. −1 10 −2 10 BER −3 10 −4 10 −5 10 Measured MP(Vγ=0.71V) −6 10 Measured min−sum (Vγ=0.15V) 1 Benchmark min−sum[14] 2 3 4 5 SNR(dB) Figure 5.21: Comparison of BERs of the measured MP-based LDPC decoder and a benchmark min-sum LDPC decoder. 116 0 10 −2 BER 10 −4 10 V =0.15 V γ −6 10 Vγ=0.71 V V =0.74 V γ −8 10 Vγ=0.85 V 0 2 4 6 8 SNR(dB) Figure 5.22: Comparison of measured BER performances for different values of the hyperparameter γ. −1 10 SNR=1dB SNR=3dB SNR=5dB SNR=7dB −2 BER 10 −3 10 −4 10 −5 10 0 5 10 15 20 25 Decoding Time (µ s) Figure 5.23: Measured BER performances of the MP-based LDPC decoder (Vγ = 0.74V ) for different decoding times (throughput) 117 decoder. However, at high SNR the software implementation outperforms its hardware counterpart which could be attributed to the limited dynamic range of the programming DACs and due to the offset and gain errors introduced by the current mirrors. Fig. 5.21 shows the measured BER curves which are obtained under the conditions: (a) when the system is configured to operate as a min-sum decoder (γ ≈ 0) [45] and the decoding throughput is set to 320 Kbps; and (b) when the MP-based decoder is configured with an optimal setting of the hyper-parameter γ > 0. For comparison we have included BER results reported for a state-of-theart analog min-sum decoder [77]. The results in Fig. 5.21 show that for γ = 0, the performance of MP-based min-sum LDPC decoder is inferior to the benchmark min-sum decoder [77] at low SNR, however, at high SNR (> 3.5 dB) the performance of MP min-sum decoder outperforms the benchmark. When γ > 0, the results show that the MP-based LDPC decoder outperforms the benchmark min-sum decoder by more than 3dB. Fig. 5.22 shows the measured BER curves for different values of the hyper-parameter Vγ . Again, for this experiment the decoding throughput is set to 320 Kbps. As Vγ increases (γ increases), it can be seen that the BER of the MP-based LDPC decoder first improves and then degrades, which is consistent with the BER-SNR-γ trade-off previously demonstrated only using simulation results (shown in Fig. 5.8b). Fig. 5.23 shows the measured BER performance (under different SNRs) when the decoding time (inverse of decoding speed) is varied. For this experiment, Vγ is held constant at 0.74V (correspondent to one of the four curves shown in Fig. 5.22). It is seen that as the decoding time is increased, the BER performance improves, which is consistent with the response of previously reported analog decoders [77]. When the decoding time is reduced to less than 2.5 µs (data throughput 12.8 M b/s), the BER performance starts to decreases rapidly. When decoding time is 118 as low as 833 ns (data throughput 38.4 M b/s), the BER (measured at SNR 7dB) could be 13 times higher than the BER measured at data throughput of 12.8 M b/s. Table 5.3 compares the measured specification of the fabricated MP-based LDPC decoder to the specifications of different digital and analog decoders reported in literature. The comparison shows that the MP-based LDPC decoder has the second highest throughput amongst the reported analog decoders. However, the turbo decoder with the highest throughput [96] has an energy efficiency much lower than the MP-based design. Moreover, the turbo decoder has a longer codeword length than ours. The benchmark min-sum LDPC decoder [77] has the same codelength as the proposed MP-based LDPC decoder. However, the former measured the energy efficiency based on the power consumption for the whole chip, while the latter based on the core, which makes the comparison difficult. However, it can be seen that the throughput of the MP-based LDPC decoder is more than twice that of the benchmark min-sum LDPC decoder. In terms of energy efficiency, the MP-based LDPC decoder has the second highest energy efficiency. Whereas the (8,4) Hamming Trellis graph decoder, which has the highest energy efficiency, has a much lower throughput than this implementation. In terms of the silicon area, the proposed implementation achieves integration density comparable to that of [77], considering the differences in the respective technology feature size. The table of comparison also shows that state-of-the-art digital decoders can also enjoy high energy-efficiencies as the analog decoders. This is because these implementations exploit highly parallel architecture [72, 73], early termination techniques [97], post-processing methods [98] and aggressive feature and voltage scaling. However, it should be noted that unlike analog decoders, digital decoders require an analog-to-digital converter to digitize the analog channel information. For instance, the digital decoder reported in [98] requires 4 to 6-bit digital inputs and the energy 119 efficiency metric (pJ/bit) reported in table 5.3 does not incorporate the power dissipation of the ADC. 120 Table 5.3: CMOS Digital (top 4) and Analog (bottom 6) Decoders Author Code Blanksby 1 R= 2 , N=1024, LDPC 1 : 14 : 1 R= 2 16 16 N=2048,LDPC R= 11 15 N=660 LDPC R=0.84 N=2048 RS-LDPC CMOS Technology Core Area (mm2 ) Power (mW ) Throughput (M b/s) Energy Efficiency (nJ/b) 52.5 690 500 1.4 14.3 787 320 2.46 7.3 518@4dB 398@5.5dB 2440 480 0.156@4dB 0.12@5.5dB 6.67 144 2800 6670 47700 0.0215 0.0587 185 13.3 13.9 1 1 2 3.4 3.7 3.7 0.04 0.22 6 0.83 Digital et al. [73] Bickerstaff et al. [72] Darabiha et al. [97] Zhang et al. [98] 0.16 µm, 1.5 V 0.18 µm 1.8 V 0.13 µm, 1.2 V 0.6 V 65 nm 0.7 V 1.2 V Analog Gaudet et al. [96] Winstead et al. [99] Amat et al.[100] Winstead et al. [78] Hemati et al. [77] 1 R= 3 ,N=48, Turbo (8,4) Hamming Tail-biting R= 1 ,N=132, 3 Turbo (8,4) Hamming Trellis Graph Factor Graph (32,8) LDPC 0.35 µm, 3.3 V 0.5 µm, 3.3 V 0.35 µm, 3.3 V 0.18 µm, 1.8 V 0.18 µm, 1.8 V Gu et al. (32,8) LDPC 0.083 4.1 0.002 0.02 0.57 0.5 µm, this work 1.32 3.3 V 5.4 121 1 (core) 6.8 (core) 0.15 0.807 5 (chip) 1.254 (Vγ =0.85V) 1.683 (Vγ =0.74V) 1.914 (Vγ =0.71V) 7.755 (Vγ =0.15V) (core) 0.098 0.1315 12.8 0.1495 0.6059 Chapter 6 An analog SVM implemented on MP-based PWL circuits 6.1 History of SVM SVM theory provides a principled approach for designing classifiers (binary and multi-class) by having a strong foundation in statistical learning theory. Due to their excellent generalization ability and modest requirements on the size of training data, SVMs have been successfully applied to numerous real-time recognition tasks [101]. Even though several digital implementations of SVMs have been reported in literature [102], they have been mapped onto analog VLSI using translinear CMOS circuits [103, 104] only recently. Here, we will re-formulate SVMs to operate in a log-likelihood domain, and apply MP-based transforms to map the architecture onto CMOS circuits. The result is a general-purpose micro-power SVM hardware that can be used in applications ranging from communications to biometrics. 122 6.2 An MP-based SVM In its general setting, a binary SVM computes a decision score F ∈ R for every feature vector − presented → ∈ RN and is given by: z N F = − − αi K(→, →) + b xi z (6.1) i=1 − K : RN × RN → R denotes a kernel function computed between an input vector → and an array z − of templates (support vectors) → ∈ RN . The parameters αi ∈ R and b ∈ R are obtained through xi a supervised training procedure [105]. Even though several choices of kernel functions exist, a − − − − second-order polynomial kernel given by K(→, →) = [(→ · →) + β]2 has been chosen. All the xi z xi z − + parameters of the SVM are represented in a differential form as αi = αi − αi and b = b+ − b− , which leads to a differential representation of equation (6.1) as: F = F + − F −, where F + = − − + → → + − i αi K( xi , z ) + b and F = (6.2) − − − → → − i αi K( xi , z ) + b . − − The inner product computation of → and → can also be achieved by turning them into differxi z ential forms: →T · → = − − xi z N − + (x+ − x− )(zj − zj ) ij ij j=1 N N − − + + (x+ zj + x− zj ) = (x+ zj + x− zj ) − ij ij ij ij j=1 j=1 (6.3) − − Here xij and zj are elements of vector → and →. (6.3) can be approximated with MP units xi z 123 expressed by: + + z + ≈ M(log x+ + log z1 , . . . , log x+ + log zN , i1 iN − − log x− + log z1 , . . . , log x− + log zN , γ) i1 iN − − z − ≈ M(log x+ + log z1 , . . . , log x+ + log zN , i1 iN + + log x− + log z1 , . . . , log x− + log zN , γ) i1 iN (6.4) For a classification task, only the sign of F is important, and using the monotonic property of the log(.) implies sign(F ) = sign(log(F + ) − log(F − )). Therefore, the decision generated by the SVM can be expressed as sign(F ) = sign(z + − z − ). (6.5) The system architecture and circuit diagram of MP-based SVM are shown in Fig. 6.1. An MP-based SVM architecture can be partitioned into three major sub-systems as shown in figure 6.1. It consists of a log-MAP matrix-vector multiplier that computes compressive innerproduct kernels based on margin propagation. The outputs of the matrix-vector multiplier are scaled in log domain, which is equivalent to a cubic transformation of an inner-product kernel. The scaled factors are then presented to an output normalization stage that computes the decision function according to equation (6.5). A current-mode circuit for implementing a log inner product computation log(xT y) based on MP circuits is also shown in figure 6.1. At the core of the network is the basic MP cell shown in Fig. 4.2. Currents representing the logarithm of vector elements are summed according to Kirchhoff’s current law. A constant current sink with magnitude γ constraints the total current at the gate of 124 Ia+ I b+ 1 1 Ia− Ia+ I b− 1 1 N I b+ N I a− N I b− N I z+ …... Iγ Scale Module r xi Input Vector Ix αi r z Output (F) Figure 6.1: The circuit diagram of the MP-based SVM 125 IZ output transistor. Thus the reverse water-filling criterion is implemented. For the SVM sub-system in Fig. 6.1, the circuit is replicated for each template (support vector). The output current from the first stage is mirrored and scaled to an outer normalization stage that implements another MP normalization procedure. Consequently, the outer normalization stage is also implemented using the reverse water-filling circuit in Fig. 4.2. The SVM parameters αi and xi are stored as currents on floating gate transistors, illustrated in the highlighted box of figure 4.2. Each of the storage cells are individually addressable via a serial shift register chain. Programming is achieved by using hot-electron injection in conjunction will global tunneling [106]. There are 14 reverse water-filling cells per support vector in the log-MAP matrix-vector multiplier. The number of cells in the output normalization sub-system is proportional to the total number of support vectors. Since the positive and negative class labels must be split to maintain positive current flow during analog computation, a fully-differential implementation of the output stage is required to compute the decision score. 6.3 Simulation result Fig. 6.2 shows the PWL approximation effect of MP-based algorithm on inner product, expressed by equation (4.34), with N = 2. Fig. 6.2a displays the result computed by “log-sum-exp” function. Fig. 6.2b displays the result computed by MP-based function simulated in Matlab. Fig. 6.2c displays the result computed by MP-based function simulated in Cadence. 126 SVM with Differential MP, γ=1.2 Inner product −6 x 10 1 −5 0 −6 −1 −5 0 −6 x 10 x 10 0 5 X1 W1 10 5 (a) Inner product with MP SVM −6 x 10 2 −5 0 −6 −2 −5 0 −6 x 10 0 5 X1 x 10 10 5 W1 (b) SVM Inner product −6 x 10 2 −5 0 −6 −2 −5 0 −6 x 10 0 x 10 5 X1 10 5 W1 (c) Figure 6.2: Multiplication computed by (a) “log-sum-exp” function (b) MP-based function with γ = 1.2 (in Matlab) (c) MP-based function with γ = 1.2 (in Cadence) 6.4 Circuit Implementation The prototype fabricated in a 0.5 µm CMOS process has integrated an array of floating gate transistors that serve as storage for up to 740 SVM parameters as well as novel circuits that have been designed for interfacing with an external digital processor. These include a novel currentinput current-output logarithmic amplifier circuit that can achieve a dynamic range of 120dB while consuming nanowatts of power and a novel varactor based temperature compensated floating-gate 127 memory that demonstrates a superior programming range than other competitors. 6.4.1 floating-gate memory cells The prototype uses floating-gate memory array to store SVM parameters. The system architecture and the basic circuit of the floating-gate memory cells is shown in Fig. 6.3. The prototype has embedded floating-gate memory array, temperature compensation circuitry, logarithmic amplifier circuitry (translinear circuitry). It also sets aside connection pin to Keithley for current calibration and measurement as well as the pin for floating-gate programming. The memory array is composed of 740 floating-gate memory cells, which are distributed into 18 rows and 114 columns. Each cell is addressable by their address switches, Row select and Column select1-3. These switches are set by programming shift registers. As can be seen from Fig. 6.3, the system is also transferable between two states by setting the signal RU N . Once RU N is set to logic high, the parameters stored in memory cells in terms of current are flowing into the MP-based computational circuit and the computation is carried out. However, once RU N is set to low, the stored current is led into Keithley so that calibration and measurement are executed. The current through P2 represents the support vectors. As is seen from Fig. 6.3, P2 and P1 are a pair of floating gate transistors. Hence the current through P2 is tunable through charge injection and tunneling. During current injection, the memory cell to be programmed is selected by the address switches. A negative voltage is applied to Program pin so that charge can be injected into the gate of P1 and P2 . During charge tunneling, a high voltage is applied to Vtunnel . Vc is a voltage applied to the floating-gate control capacitor Cc , which can be used to tune the gate voltage of P1 and P2 and hence control the current flow through them. Vx is the voltage applied to the varactor Cv , which is used to tune the value of Cv . This property is used in temperature com128 pensation to protect the memory array from the attack of temperature variation. We will talk about this feature into detail in later sections. The current through P3 represents the input vectors. It is controlled by Vdd translinear and Vg translinear, which are transmitted from reference circuitry, i.e. a current-input current-output logarithmic amplifier circuitry. 6.4.2 Current-input current-output logarithmic amplifier circuit Logarithmic amplifiers are used for generating output signals that are proportional to the logarithm of the input [107]. Due to the compressive nature of the logarithm function, such an amplifier is useful for processing wide dynamic range signals like sensory (auditory or visual) signals or multi-path signals in radio-frequency and radar applications. The most popular technique for implementing logarithmic amplifiers is a transimpedance based approach that exploits the exponential dependence between the current and the voltage across a p-n junction diode, a bipolar transistor [14] or a MOSFET in sub-threshold region [108, 109]. Based on this approach, amplifiers with input dynamic range greater than 100dB have been reported. However, there are three disadvantages of using a transimpedance based approach: (a) the output of a transimpedance logarithmic amplifier is a voltage signal, which implies that the output dynamic range is limited by the supply voltage; (b) due to the dependence on the current-to-voltage relationship, transimpedance logarithmic amplifiers are sensitive to temperature variations and therefore requires additional compensation circuitry; and (c) for current-mode analog VLSI computation, the use of transimpedance logarithmic amplifiers requires a linear transconductor stage to convert the output voltage into currents. In our study, we propose the design of a current-input, current-output logarithmic amplifier, which by construction in insensitive to variations in temperature. At the core of the proposed design 129 VCurrent_Starving Row_select Vtunnel Ctun Vc A CC Vx Col_select1 Vdd_fg Vdd_translinear Q1 P1 P2 P3 Vg_translinear Row_select CV Col_select2 Col_select3 Program Pin RUN RUN To Keithley To MP RUN To Keithley Shift Register Array Temperature Compensation Program Pin Memory Cell Array Shift Register Array Keithley Interfaces to Calibration &Reference Units Translinear Circuits Interfaces to Functional Circuits MP Computation Circuits Figure 6.3: The system architecture and basic circuit of floating-gate memory cells. 130 is a Translinear Ohm’s law principle which exploits floating-voltage sources and linear/non-linear resistive elements embedded within a translinear loop. By exploing multiple translinear networks, temperature compensation is achieved through a resistive cancellation technique. We verify the functionality of the circuit using measured results from fabricated prototypes and justify the results using mathematical models and analysis. We also extend the design of the logarithmic amplifier to be fully programmable and digitally addressable, where all the current references have been made programmable using analog floating-gates and the input to the amplifier is made digitally programmable using a standard ladder-based digital-to-analog converter DAC. 6.4.2.1 Translinear Ohm’s Law based Logarithm Computation The concept of translinear Ohm’s law is an extension of the celebrated translinear principle [14] which exploits the exponential relationship between voltages and currents in certain devices (diodes, BJTs and sub-threshold MOSFETs). Translinear principle has been successfully used for implementing linear analog computation like matrix-vector multiplication [17] and for implementing non-linear analog computation like quadratic kernels [61]). Translinear Ohm’s law is first explained in this section using an example circuit shown in Fig. 6.4 which consists of several diodes D1–4 acting as translinear elements connected to a floating-voltage source ∆V and a memoryless, non-linear circuit element N . The non-linear circuit element N is characterized by a function f (.) which models relationship between the current flowing through N (IN ) and the voltage across N (VN ) according to IN = f (VN ). 131 (6.6) If the voltage drop across each diode in Fig. 6.4 is denoted by V1–4 then, V1 + V2 + ∆V = V3 + V4 + VN , (6.7) which after using the translinear diode equation of I1-4 = Is exp V1-4 /UT leads to, I1 · I2 · exp ∆V UT = I3 · I4 · exp VN UT , (6.8) and hence, I1 · I2 I3 · I4 I ·I ∆V + UT ln 1 2 I3 · I4 VN = ∆V + UT ln IN = f (6.9) (6.10) UT in equations 6.8 and 6.10 refers to the thermal voltage which is a linear function of absolute temperature and is approximately equal to 26mV at room temperature (25o C). If currents I1−4 satisfy the relation I1 · I2 = I3 · I4 then, IN = f (∆V ) . (6.11) If N is a resistor with resistance R, then f (∆V ) = ∆V /R which when inserted in equation (6.11) leads to an equivalent Ohm’s law (Note that the current IN is not directly drawn from VN ). If N is a general type of resistor (linear and non-linear) with f (0) = 0, then f can be approximated using first-order and second-order Taylor series terms as f (∆V ) ≈ f (0)∆V + f (0)/2(∆V )2 . 132 Inserting the approximation in equation (6.11) leads to IN = f (0)∆V + f (0)/2(∆V )2 . (6.12) Note that f (0) has units of a transconductance and equation (6.12) is only satisfied when the translinear condition I1 · I2 = I3 · I4 is satisfied, hence the name “translinear Ohm’s Law”. For the sake of simplicity and to minimize mathematical clutter we will assume f (0) ≈ 0 for the rest of the analysis and we will reintroduce the higher-order terms to understand the limitations of the proposed method. V   I2 I1 V1 V2 V3 I3 V4 IR I4 R Figure 6.4: The basic concept of translinear Ohm’s law. The translinear Ohm’s Law is now applied towards approximating the logarithm of a current using the circuit in Fig. 6.5. The circuit is is derived from a translinear circuit which was reported in [61] where the transistors M1 to M4 are biased in weak-inversion and form the translinear-loop. 133 Ib1 R V1 M6 M1 V2 M2 M5 I1 I2 M7 Iout Iin I3 M3 M4 I4 Ib2 Figure 6.5: Schematic of the translinear logarithmic current-to-current converter. The drain-to-source voltages for all transistors are larger than 100mV, in which case the transistors satisfy the following translinear relation [110]: N M OS : In = Sn ID0n e VGn −VT n /nn UT e−VSn /UT , (6.13) −VGp +VT p /np UT VSp /UT e , (6.14) P M OS : Ip = Sp ID0p e where Sn,p , VT n,p , nn,p , UT , VGn,p , and VSn,p are the aspect ratio, the threshold voltage, the sub-threshold slope, the thermal voltage, gate and source voltage referred to bulk potential (Vdd or gnd) for nMOS and pMOS transistor respectively. The transistor M5 serves as a feedback element which reduces the output impedance at the drain of M2 . If the sizes of the transistors are considered to be equal, the current mirror formed by M3 and M4 ensures I1 · I2 = I3 · I4 . Then, 134 using the translinear Ohm’s law the output current Iout can be expressed as Iout = f (0)V Iin , (6.15) where the floating-voltage source V Iin equals the difference in gate voltages of transistors M1 and M2 , which are mirrored from M6 and M7 and can be expressed as V Iin = n · UT · log Iin Ib2 . (6.16) I3 = n · f (0) · UT · ln I1 I2 . (6.17) which leads to Thus, the output current is proportional to the logarithm of the input current under the condition Iin > Ib2 which is required for the circuit to be operational. Equation (6.16) consists of several temperature dependent and non-linear parameters which includes: (a) the thermal voltage UT ; (b) the transconductance parameter f (0); and (c) the subthreshold slope n. However, all the parameters affect only the gain of the amplifier and hence can be potentially cancelled. 6.4.2.2 Complete Logarithmic Amplifier Circuit Fig. 6.6(a) shows a complete implementation of a temperature insensitive logarithmic amplifier based on the proposed translinear Ohm’s law principle. It consists of an input stage, a reference stage and a translinear stage. The input stage and the reference stage is formed using the basic circuit shown in Fig. 6.6. Based on equation (6.16), the output current generated by the input stage 135 Ib3 M10 M11 M15 M9 Is1 From Input Stage M12 Is2 Iout M8 M13 Ib4 From Reference Stage Current Reference (a) Ib1 R M6 Current DAC V1 M1 V2 M2 M7 M5 Iin Ib2 M3 M4 Current Reference To Translinear Stage (b) R M16 To Translinear Stage M17 M22 2·Ib2 M20 Ib1 M18 M19 Ib2 M21 (c) Figure 6.6: Proposed logarithmic amplifier: (a) basic circuit of translinear stage; (b) basic circuit of input stage; (c) basic circuit of reference stage. 136 Iin is given by Ix = n · f (0) · UT · log Iin Ib2 . (6.18) For the reference stage the ratio of the currents I1 and I2 are set to be 2 using the current mirrors M1 ,M2 and M3 . Thus the reference current Iref is given by Iref = n · f (0) · UT · log (2) (6.19) Iref = 1.44 · n · f (0) · UT . (6.20) The translinear stage forms a translinear loop using the gate-to-source terminals of the pMOS transistors M1 -M4 which leads to Iout · Iref = Igain · Ix . Thus, using equation (6.16) Note that Ib3 in Fig. 6.6 is an external biasing current to establish the translinear loop which is usually 10 times larger than Ib4 . M15 is added at the input of Ib4 to ensure M11 to be in saturation region. Thus, using equation (6.16) the output current Iout can be expressed as Ix Iout = Igain · Iref Iout = 1.44 · Igain · log (6.21) Iin Ib2 . (6.22) Thus, if the currents Igain and Ib2 are independent of temperature, Iout is theoretically invariant with respect to temperature. Also, if the resistances R1 and R2 are matched all the terms dependent on the resistance gets potentially cancelled out. The current Igain determines the gain of the amplifier and the current Ib2 determines the minimum input current of the amplifier. Circuit in Fig. 6.6 requires current sources Iref and Ib4 that can be programmed to different 137 Vb Vb M1 V− V+ M2 VG1 C fg1 Vref1 FPGA Sp Vs1 Ctun Vtun Ib1 R M2 A1 M3 Vout M5 Translinear Circuit CL M1 Vb2 M2 M4 M5 Vb1 F1 V+ M3 M4 M1 M fg1 buffer Iref1 M3 M1 M5 I1 M2 VG2 I2 Iout M4 I4 M3 M4 C fg2 M fg2 Iref2 F2 Vb2 Vs2 buffer I3 Vb1 Vref2 Sp Ctun A2 FPGA Vtun (a) Vdd I in + I in − IN _ CLK IN _ CLK IN _ CLK IN _ CLK d10 6μ d8 d9 6μ 6μ 12 μ 6μ 6μ 6μ 12 μ 6μ d1 d7 6μ 6μ 12 μ 6μ 6μ 6μ 12 μ 6μ 12 μ 6μ 12 μ (b) Figure 6.7: Basic circuit of (a) current reference; (b) current DAC. 138 6μ 6μ amplifier settings. It turns out that the core circuit shown in Fig 6.5 could also be used for designing programmable current references as shown in Fig. 6.7a. If the gate voltages of M1 and M2 are given by VG1 and VG2 , then the reference current is given by Iout = (VG1 − VG2 )/R. If the resistance R is temperature compensated and the voltage difference VG1 − VG2 are independent of temperature then Iout will also be temperature compensated. Note that once Iout is generated on-chip, different values of currents can be generated using a current mirrors. The voltages VG1 and VG2 are generated using programmable floating-gate voltage references which was reported in [111]. In this section, we briefly describe the operating principle of the voltage reference and the readers are referred to [111] for additional details. If the floating-gate transistor M4 is matched to the current source transistor M1, the output voltage VG1 is given by Q VG1 = Vdd − nVb1 + n 1 , CT (6.23) where n is the sub-threshold slope factor, Vdd and Vb1 is the supply and biasing voltages, Q1 the charge on the floating-gate F and CT is the total capacitance at node F . Thus, Q − Q2 . VG1 − VG2 = n 1 CT (6.24) The charge Q1 and Q2 are programmed using a linear injection technique [111], where the opamps A1 and A2 implement an active feedback that maintains the source current, the source and the gate voltage of M1 and M2 constant. This elements any non-linear factors that affect hot-electron injection onto the gate, thus achieving a stable and controllable injection rate. Thus the rate and resolution at which Q1 and Q2 remains constant throughout the programming range which is railto-rail. Programming is achieved by periodically enabling the feedback loop by opening the switch 139 S1 (controlled by an FPGA). Again, for the sake of brevity, we have omitted details of hot-electron injection programming which can be found in [111]. Programming of the input current Iin is achieved using a 10-bit current-mode DAC as shown in Fig. 6.6. The MSB of the DAC is denoted by d1 0 whereas the LSB is denoted by d1 . The architecture of the current DAC (shown in Fig. 6.7 (b)) is based on the popular resistive divider [95] where the current through each branch is twice as large as its neighbor. The output as determined by the bits d1 0–d1 is a binary weighted sum of currents which are mirrored into the input stage. The digital inputs of d1 0 through d1 are transmitted through a shift register chain, which converts the serial input Din into parallel bit slices that are processed by the DAC module. 6.4.3 Adaptive, varactor-driven temperature compensation circuitry Floating-gate transistors serve as an attractive medium for non-volatile storage of analog parameters. However, conventional current memories based on floating-gate transistors are sensitive to variations in temperature, therefore limiting their applications to only controlled environments. In our research, we propose a temperature compensated floating-gate array that can be programmed to store currents down to picoampere level. At the core of the proposed architecture is a control algorithm that uses a varactor to adapt the floating-gate capacitance such that the temperature dependent factors can be effectively canceled. As a result, the stored current is theoretically a function of a reference current and the differential charge stored on the floating-gates. We validate the proof-of-concept using measurement results obtained from prototype current memory cells fabricated in a 0.5µm CMOS process. 140 6.4.3.1 Varactor based Floating-gate Temperature Compensation The architecture of the proposed varactor-based floating-gate current memory cell is shown in Fig. 6.8. It consists of a reference cell formed by two floating-gate transistors P1 and P2 . Note that in addition to the usual control-gate capacitor Cc and tunneling capacitor Ctun , all the floatinggate transistors have a varactor Cv connected to their respective gates. The currents through P1 and P2 (measured using an on-chip analog-to-digital converter) are controlled by a control module which sets the control-gate voltage Vc and tuning voltage Vx using an off-chip digital-to-analog converter. If the charge on the floating-gate nodes A and B are denoted by Q1 and Q2 , then the gate voltages for the pMOS transistors P1 and P2 are given by Q + Cc · Vc + Cv · Vx + Ctun · Vtunnel + Cb · Vdd VA ≈ 1 , CT Q + Cc · Vc + Cv · Vx + Ctun · Vtunnel + Cb · Vdd VB ≈ 2 . CT Vtunnel Ctun Ctun Q1 Q2 Q3 P1 Vc (6.26) Vtunnel Vtunnel Ctun CC (6.25) A P2 B CC P3 CC ... Vx CV CV CV ... I1 I2 Control Algorithm I out Memory Cell Figure 6.8: Proposed floating-gate based current reference and current memory cells When P1 and P2 are biased in weak-inversion, the respective ratio of their drain-currents I1 141 and I2 is given by I1 Q − Q1 = exp κ 2 I2 CT U T . (6.27) The control module in Fig. 6.8 monitors I1 and I2 and ensures that the their ratio remains constant across temperature variations by tuning the voltage Vx which changes the capacitance CT . Letting K = I1 /I2 , then the parameter CT UT is also held constant according to Q − Q1 CT UT = κ 2 log(K) (6.28) which leads to the expression for the output current Iout generated through the memory cell as I1 Q − Q1 = exp log(K) 3 . Iout Q2 − Q1 (6.29) Thus, if the current I1 is compensated for temperature, then according to equation (6.29), the current through the memory cells are also temperature compensated. In principle, I1 could be generated by a bandgap reference circuit [59]. For this implementation, we tried to maintain I1 to be constant by co-adapting the control-gate voltage Vc along with the tuning voltage Vx . 6.4.3.2 Varactor Implementation and Second-order Effects The varactor has been realized using a MOS capacitor which operates in an accumulation mode [112]. The cross-sectional architecture and the layout of the MOS-cap is shown in Fig. 6.9 along with the floating-gate transistor, the control-gate and the tunneling-node. The use of MOS-cap as a varactor introduces second-order effects into the equation (6.29): 142 • Derivation of equation (6.29) assumed that the floating-gate transistors including the variable capacitors Cv are matched with respect to each other. Unfortunately, each floatinggate node is programmed to a different potential, which will lead to mismatch in Cv . In weak-inversion, however, the difference between respective gate-voltages is small (less than 100mV), which reduces the effect of the mismatch. • The variation of Cv with respect to the tuning voltage Vx is non-linear as shown in the measured response in Fig. 6.21. Therefore, Vx and Vc has to be biased properly to ensure: (a) sub-threshold biasing of the transistors; and (b) high-sensitivity of Cv with respect to Vx . • The MOS-cap Cv also varies with temperature. However, the temperature model of a MOS varactor is too complex for any closed-form calibration, which motivates adapting Vx in an online manner. 6.5 Measurement Results The prototype is still being measured for the SVM computation result. 6.5.1 Measurement results for floating-gate memory As is seen through Fig. 6.3, the programming of floating-gate cells are controlled by Vcurrent starving, Vdd fg, Vc , Vx and the charge injection voltage applied at program pin. In the first set of experiment, we vary one of the above parameters and keep the others fixed to verify the functionality of these control parameters. Fig. 6.10 demonstrates the current through three different floating-gate transistors when different gate voltages are applied to the current starving transistor P0 during charge injection. These 143 VFG VX n n N  Well (a) VX P-Well VFG N-Well Metal1 Poly1 N active P active Floating-gate Transistor Varactor Poly2 Vtunnel VC Cc Ctun (b) Figure 6.9: (a) Cross-sectional architecture of MOS varactor (b) Layout for memory cell in Fig. 6.8 144 three cells are tunneled until their current is lower than picoampere. As Vcurrent starving is decreasing, the charging velocity is increasing. This is because the injection current through the transistor is increased by lowering the gate voltage of P0 . −7 6 x 10 5 Iout (A) 4 3 2 Vg=2.6 V Vg=2.65 V Vg=2.7 V 1 0 0 2000 4000 Time (s) 6000 8000 Figure 6.10: Floating-gate transistor current with respect to current starving voltage Fig. 6.11 shows the current through three different floating-gate transistors when different Vdd fg are applied. It is seen that as Vdd fg is increasing, the charging velocity is increasing as well. The phenomenon is consistent with the principle that the injection current is increased with a higher VSG across the floating-gate transistor. Fig. 6.12 shows the current through three different floating-gate transistors when different Vx s are applied. It is seen that as Vx is increasing, the charging velocity is decreased. Since increasing Vx will increase the gate voltage of the floating-gate transistor, the phenomenon is consistent with 145 −7 6 x 10 Vddfg=2.32 V 5 Iout (A) 4 Vddfg=2.35 V Vddfg=2.4 V 3 2 1 0 0 20 40 60 Time (s) 80 100 Figure 6.11: Floating-gate transistor current with respect to Vdd fg 146 the principle that the injection current is proportional to VSG across the floating-gate transistor. −7 6 x 10 5 Iout (A) 4 Vx=3.1 V Vx=3.0 V Vx=2.9 V 3 2 1 0 0 200 400 Time (s) 600 800 Figure 6.12: Floating-gate transistor current with respect to Vx Fig. 6.13 shows the precise programming procedure by tuning the voltage applied at program pin, i.e., by tuning the injection current. The methodology is: first charge the cell to some threshold value that is approaching the desired value, then reduce VP ROD to allow a lower injection current, which ensures a fine step size. For this experiment, the desired value is 6.3 µA, and the threshold is set at 6.1 µA. It is seen that the current is increasing rapidly as VP ROD is as low as as -3.3V; however, as the current is approaching 6.1 µA, VP ROD is switched to -2.8V. As a result, the current through the floating-gate transistors are programmed to 6.3 µA precisely, as is seen from Fig. 6.13. 147 Figure 6.13: Precise programming procedure for floating-gate transistors 148 6.5.2 Measurement results for current-input current-output CMOS logarithmic amplifier The microphotograph of a prototype current-input current-output CMOS logarithmic amplifier fabricated in a 0.5µm standard CMOS process is shown in Fig. 6.14. Table 6.3 summarizes the main specifications of the chip. Translinear circuit 100µm Input DAC Figure 6.14: Die microphotograph of the chip. 6.5.2.1 Measurement results for the current reference Fig. 6.15 shows the measured output current Iout (Refer to Fig. 6.7 (a)) versus the differential gate voltage V1 -V2 (Refer to Fig. 6.7 (a)). During this experiment, the switches Sp s are set to be off so that both V1 and V2 are determined by the Vref of the opamps. V1 is varied while V2 is 149 Table 6.1: Specifications of the current-input current-output CMOS logarithmic amplifier Fabrication Process Standard CMOS 0.5µm Die Size 3000µm × 3000µm M1-4, 6-7 (refer to Fig. 6.5) 10µm/5µm M5 (refer to Fig. 6.5) 30µm/1.5µm R (refer to Fig. 6.5) 10MΩ fixed. It is seen from Fig. 6.15 that Iout is proportional to the differential voltage, which proves that equation (6.11) holds for this circuit. 120 0.02 Relative Error Iout (nA) 100 80 60 40 0.01 0 −0.01 20 0 1 1.5 2 Voltage of FG transistor (V) −0.02 1 2.5 1.5 2 Voltage of FG transistor (V) (a) 2.5 (b) Figure 6.15: (a) Measurement current response and (b) relative error for the current reference. 6.5.2.2 Measurement results for the logarithmic amplifier The measured result of the circuit is shown in Fig. 6.16. The input current was varied from 100pA to 40nA and for each measurement with different value of Ib4 (Refer to Fig. 6.6). The results show a close agreement between the measured result and the mathematical model given by equation (6.22). The results also demonstrate that a larger dynamic range can be achieved by increasing Ib4 . 150 −9 4 x 10 −9 4 Ib4=40pA Ib4=40pA Ib4=100pA 3 Ib4=250pA 2 Iout(A) Iout(A) 3 x 10 1 0 Ib4=100pA Ib4=250pA 2 1 1 2 3 Iin(A) 4 0 −10 10 5 −8 −9 −8 10 10 −7 10 Iin(A) x 10 (a) (b) Figure 6.16: Measurement results for the logarithmic current conversion. 6.5.2.3 Noise Analysis When the current memory cell is operating at sub-nA current levels, the thermal noise and flicker noise (due to sub-threshold MOSFETs and the resistor) effects the accuracy of the current read-out from the cell. Under DC condition, flicker noise is dominant. The power spectral density (PSD) of flicker noise for MOS transistors biased in subthreshold region can be expressed by 2 ID KF · SI = . D W LCox f (6.30) Note KF is a process-dependent constant; Cox is the gate oxide capacitance per unit area; W and L represent the width and length of the transistor respectively; ID is the drain current flowing through the transistor; and f represents the frequency. Equation (6.30) indicates that the noise spectral density will increase along with the biasing current. To verify equation (6.30), a set of experiment is performed in which Iin and Iout are varied whereas Ib4 and Ib2 are fixed. The noise is obtained based on the standard deviation of the 100 151 current samples for each pair of Iin and Iout . Fig. 6.17 (a) shows the noise of Iout is proportional to the magnitude of Iout , and Fig. 6.17 (b) shows the corresponding SNRs. 56 0.035 0.03 SNR (dB) Noise (nA) 54 0.025 0.02 0.015 52 50 0.01 0.005 0 50 100 48 0 150 50 100 150 Iin (nA) Iin (nA) (a) (b) Figure 6.17: (a) Noise of Iout versus Iin ; (b) SNR of Iout . 6.5.2.4 Temperature characteristics The underlying principle of the translinear Ohm’s law intrinsically leads to a temperature compensated operation. According to equation (6.22), the temperature coefficient of Iout is determined by the temperature characteristics of Ib4 , Ib2 , and Iin . Assume that Ib4 and Ib2 are not temperature compensated whereas Iin is temperature compensated. Hence Ib4 and Ib2 are written as functions of temperature Ib4 (T ) and Ib2 (T ). Then equation (6.22) is rewritten as: Iout (T ) = Ib4 (T ) · ln(Iin ) − ln(Ib2 (T )) . ln2 152 (6.31) For a specific temperature T0 , three Iout (T0 ) are sampled as ln(Iin ) − ln(Ib2 (T0 )) 1 , ln2 ln(Iin ) − ln(Ib2 (T0 )) 2 , Iout (T0 ) = Ib4 (T0 ) · 2 ln2 ln(Iin ) − ln(Ib2 (T0 )) 3 Iout (T0 ) = Ib4 (T0 ) · . 3 ln2 Iout (T0 ) = Ib4 (T0 ) · 1 (6.32) Then it can be deduced that Ib4 (T0 ) · ln ln2 Iout (T0 ) − Iout (T0 ) 1 2 = Iout (T0 ) − Iout (T0 ) Ib4 (T0 ) 1 3 · ln ln2 Iin 1 Iin 2 Iin 1 Iin 3 ln = ln Iin 1 Iin 2 Iin 1 Iin 3 (6.33) As is seen from equation (6.33), the temperature related items are canceled. Hence Iout (T )−Iout (T ) 1 2 is temperature compensated only if Iin is temperature compensated. Iout (T )−Iout (T ) 1 3 A corresponding experiment with temperature varying from 27o C to 57o C is conducted. The experimental result is shown in Fig. 6.18. It is seen that in the weak-inversion regime, the current ratios are approximately invariant with respect to temperature. That validates the relationship given by equation (6.33). Based on the first order polynomial for the data shown in Fig. 6.18, a 230 ppm/K is achieved over the temperature range of 27o C to 57o C. 6.5.2.5 Speed analysis The speed of the circuit is limited by current. According to equation (6.22), the speed limit factor could be Iin , Ib4 , and Ib2 . The following experiment is conducted to figure out the bandwidth variation with respect to these factors. In the experiment, Iin and Ib4 are varied whereas Ib2 is fixed to be 100pA. The result is demonstrated in Fig. 6.19. Against Fig. 6.19 (a), it is seen that 153 Figure 6.18: Measured response plotting Iout (T ) − Iout (T ) 1 2 under different temperatures. Iout (T ) − Iout (T ) 1 3 154 when Ib4 is set to 0.38nA, when Iin varies from 50nA to 110nA, the bandwidth is increased from 0.6kHz to 0.9kHz. However, when Ib4 is set to 0.75nA, when Iin varies from 50nA to 110nA, the bandwidth is increased from 0.8kHz to 1.2kHz. From Fig. 6.19 (b), it is seen that when Iin is set to 85nA, bandwidth is monotonically increased from 0.45kHz to 1kHz as Ib4 varies from 0.25nA to 2.75nA; however, when Iin is set to 110nA, bandwidth increases from 0.73kHz to 1.1kHz as Ib4 varies in the same range. These phenomena demonstrate that the bandwidth is increased when bias current are increased. 1.4 1.4 Iin=85 nA 1.2 Ib4=0.75 nA Bandwidth (kHz) Bandwidth (kHz) Ib4=0.38 nA 1 0.8 Iin=110 nA 1 0.8 0.6 0.6 0.4 40 1.2 60 80 Iin (nA) 100 0.4 0 120 1 2 3 Ib4 (nA) (a) (b) Figure 6.19: Measured response plotting bandwidth with respect to current. Table 6.2 compares this work with some of the references. 6.5.3 Measurement results for varactor-driven temperature compensation circuitry An array of varactor-based FG current memory cells has been prototyped in a 0.5µm standard CMOS process. Table 6.3 summarizes the specification of the fabricated prototype and Fig. 6.20 shows its micrograph. Programming of the FG memory cells is based on the hot-electron injection 155 Table 6.2: Comparison of performance Reference [113] [114] [115] [108] This work Conversion current-in voltage-out current-in voltage-out current-in voltage-out current-in voltage-out current-in current-out Process 1.6µm 1.5µm Discrete 0.5µm 0.5µm 2.9 × 3.7mm2 91 × 75µm2 290 × 140µm2 Area - Supply - ±6V - 3.3 V 3.3 V DR 100dB 114dB 80dB 140dB 120dB Bandwidth >10Hz 1kHz 25kHz >3.5kHz 1.2kHz Power 10pA-1µA 30mW - 0.1µW-33µW 1nW-10µW ppm/K uncompensated uncompensated uncompensated uncompensated 230 - and Fowler-Nordheim (FN) tunneling, whose details have been presented elsewhere [61, 116] and has been omitted here for the sake of brevity. Table 6.3: Specification of the fabricated prototype Fabrication Process Standard CMOS 0.5µm Die Size 3000µm × 3000µm Size of P1 -P3 6µm/3µm Cc 100fF In the first set of experiments, the floating-gate voltage VA was measured (using a voltage Cv buffer shown in Fig 6.8) when Vx was tuned and Vc = 1.2V . The ratio was determined based Cc on the sensitivity analysis of the current I1 with respect to Vx and Vc . The measurement result in Fig. 6.21 shows that the capacitance is inversely proportional to Vx and shows the largest change happens at Vx = 1.25V . The experiment also demonstrates that Cv can be varied over a large dynamic range with respect to Vx , which should be sufficient to compensate for large temperature variations. For the next set of experiment, we measured the current ratios under four different tempera156 Figure 6.20: Die microphotograph of the chip 157 0.5 Cv/Ctotal 0.4 0.3 0.2 0.1 0 0.5 1 1.5 Vx (V) 2 Figure 6.21: The C-V curve of MOS capacitor 158 2.5 tures, namely, 28◦ C, 33◦ C, 38◦ C, and 43◦ C, when the voltages Vc is fixed to be 2V and Vx is varied. The results shown in Fig. 6.22(a) - (e) demonstrate that with the increase in temperature, the respective current ratios decrease as indicated by the relationship in equation (6.28). The conI trol algorithm then varies Vx to ensure that the ratio 1 is maintained to be constant. The trajectory I2 of the Vx traversed by the control algorithm is shown in Fig. 6.22(a) - (e) shows the corresponding I1 (Here Iout is represented by I3 -I6 from different memory cells). The comIout pensated results with Vx biased at 1.2 V and 1.3V are demonstrated in Fig. 6.23. A 150 ppm/K value of the ratio is achieved over the temperature range of 28o C to 43o C. It is seen that the current through other I memory cells (I3 -I6 ) can be effectively compensated with respect to temperature once 1 is conI2 stant, no matter how different the current is from I1 . This result validates the proof-of-concept temperature compensation for the proposed floating-gate current memory. 159 0.6 0.4 T=38oC T=43oC I1/I3 0.5 T=28oC T=33oC 0.3 0.2 0.1 1.1 1.2 1.3 1.4 1.5 1.6 Vx (a) (b) 0.95 0.85 T=38oC T=43oC T=28 C T=33oC 0.7 o 0.6 T=38 C T=43oC I1/I5 I1/I4 0.9 T=28oC T=33oC o 0.5 0.8 0.4 0.75 0.3 1.1 1.2 1.3 1.4 1.5 0.7 1.1 1.6 1.2 1.3 1.4 1.5 1.6 Vx Vx (c) (d) 0.9 0.7 T=38oC T=43oC I1/I6 0.8 T=28oC T=33oC 0.6 0.5 0.4 1.1 1.2 1.3 1.4 1.5 1.6 Vx (e) I Figure 6.22: Measured response plotting (a) 1 (b) I2 ◦ C, 33◦ C, 38◦ C, and 43◦ C. under 28 160 I1 I I I (c) 1 (d) 1 (e) 1 with respect to Vx I3 I4 I5 I6 1 1 I1/I2 I1/I3 0.8 Current Ratios Current Ratios I1/I2 I1/I4 I1/I5 0.6 I1/I6 0.4 0.2 25 30 35 40 Temperature (oC) 45 I1/I4 (a) I1/I5 0.6 I1/I6 0.4 0.2 25 50 I1/I3 0.8 30 35 40 Temperature (oC) 45 50 (b) Figure 6.23: Measured response plotting temperature compensated current ratios under Vx of (a)1.2V (b)1.3V. 161 Chapter 7 Conclusions and future work The major conclusion of this thesis is that it is feasible to use MP algorithm as an effective alternative for analog computation. The MP algorithm can be implemented with current-mode analog VLSI circuit. The basic MP circuit can be utilized to synthesize a variety of static analog computation circuits. Also MP circuit serves as a fundamental circuit unit in the MP-involved applications. The proposed MP algorithm is a novel analog computation algorithm with the underlying principle of universal conservation law. In contrast with the state-of-the-art transliner circuits dependent on physics of electronic devices, MP algorithm is dependent on the physical law, which facilitates it independence on transistor operation regions. As a consequence, MP circuit has a wide dynamic range, and is capable of trade off one performance with another when required due to its wide operation region. Also the underlying principle of universal conservation law enables MP algorithm to be implemented in a diversity of devices, not limited to electronic device (e.g. mass, time, resistance, etc.). Since current-mode circuits is the most easy and mature way of implementation, in this dissertation, we adhere to current-mode implementation. One characteristic of MP algorithm is that by using MP algorithm, the PWL approximation of 162 “log-sum-exp” function can be achieved. MP algorithm also has a couple of mathematical properties, which facilitate the MP-based computations. Another characteristic of MP algorithm lies in the implementation based on reverse water-filling criterion, which enables MP to be implemented in analog domain easily. The two characteristics mentioned above entail diverse analog arithmetic computations to be synthesized on basic MP circuit. In this study, the investigated computations include addition, subtraction, multiplication, division, power, polynomial, tanh, inner product. These computations are ubiquitous in applications such as neural network, communications, machine learning, and the like. The two MP-based application examples brought out in this research study are an analog LDPC decoder and an analog SVM. For each application, the mathematic model built upon MP algorithm is illustrated, the simulation results are presented, the system architecture and circuit implementation based on MP circuit are demonstrated, and the measurement results are presented as well. 7.1 Summary of contributions The contributions of this dissertation are summarized here: • Propose an analog signal processing (ASP) algorithm named margin propagation (MP) as an efficient piece-wise linear (PWL) approximation technique to a “log-sum-exp” function and bring out its current-mode analog circuit implementation. Since MP algorithm involves only addition, subtraction and threshold operations and hence can be implemented using universal conservation principles like the Kirchoff’s current law. Its analog circuit implementation is bias-scalable in different operation regions. As a result, the MP circuit can be adaptive to various application requirements since operating in different biasing condition is beneficial 163 for different specifications. • Synthesize bias-scalable analog computational circuits for a wide-range of linear and nonlinear mathematical functions based on MP principle. We have implemented addition, subtraction, multiplication, division, power and polynomial computation, etc. with MP-based circuits. An important attribute of MP based synthesis is that all computations are performed in the logarithmic domain, as a result of which multiplications map into additions, divisions map into subtractions and power computations map into multiplications. This simplifies the implementation of many complicated non-linear functions. All the computational circuits can be operated in weak inversion, moderate inversion, and strong inversion regions. • Design and implement an analog (32,8) MP-based LDPC decoder. One fascinating property is its capability of trading off BER performance with energy efficiency due to the tunable hyper parameter γ in MP algorithm, which has been verified by the (32,8) MP-based LDPC decoder prototype. The prototyped MP-based LDPC decoder can achieve an energy efficiency of 100nJ/bit while an optimal configuration can also deliver up to 3 dB improvement in BER compared to the benchmark min-sum LDPC decoder. • Design and implement an analog energy-scalable MP-based support vector machine (SVM). The prototype stores 2052 SVM parameters into floating-gate memory array and is fully programmable. Due to the using of MP computation circuit, SVM is bias-scalable. • Design and implement a novel current-input current-output logarithmic amplifier circuit Unlike the traditional transimpedance-based logarithmic amplifier circuit, our design based on translinear Ohm’s law directly generates currents as a logarithmic function of the input current, which exhibits a dynamic range of 120dB and a temperature sensitivity of 230 ppm/K, 164 while consuming less than 100nW of power. • Design and implement a varactor based temperature compensation circuitry for floating-gate memory array By adapting the floating-gate capacitance, this design enables temperature dependent factors be effectively canceled, which solve the variation of current stored in floating-gate memory due to the temperature effect. A temperature sensitivity of 150 ppm/K is achieved. Input Stage Vdd Vdd I L3 I L4 Vdd Vdd I L2 Vds,sat Vds,sat I L1 IZ P1 P2 P3 B P4 Vsg ,P4 VBias N0 N1 N2 A N3 N4 Iγ Vgs ,N 3 Figure 7.1: Improved MP circuit with cascode current mirrors 165 Input Stage Vdd Vdd I L3 I L4 Vdd Vdd I L2 Vds,sat Vds,sat I L1 IZ P1 P2 P3 B P4 Vsg ,P4 VBias Iγ A N0 N1 N2 C N3 N4 Vgs ,N 3 Self Balanced Figure 7.2: Decoupled MP circuit to prevent gate leakage 166 7.2 Future directions Some aspects can be further improved or remain undiscovered for MP algorithm and MP circuits: • Novel methodologies to implement MP algorithm. MP algorithm is scalable to different analog structures ranging from silicon transistors to microfluidic devices. This dissertation is adhered to current-mode since it is the easiest way of implementation. In future research, the physical parameters to represent value can be time (to improve energy efficiency), charge (to improve the precision), or any other characteristic parameters dependent on the application requirements. • Ultra-low power implementation for MP algorithm. In this thesis, all the prototypes are capable of operating with current magnitude as low as nano-ampere. However, since MP algorithm is based on conservation law, it is possible to function even under the current magnitude of femto-ampere. For this improvement, specific circuit design techniques are supposed to be developed. • Precision improvement for MP algorithm. In this thesis, one intuitive solution to improve the PWL approximation effect of MP to “log-sum-exp” function is proposed based on an experiment presented in chapter 3: to increase the number of operands involved in the MP computation. The accurate mathematical design for operands including the optimal number of operands and the value of the operands, is a promising topic for research. 167 7.2.1 Potential solutions for ultra-deep-submicron computational circuits using margin propagations 7.2.1.1 Cascode current mirrors As has been discussed in [117], the performance of MP circuit is limited by the accuracy of “current mirrors” composed of N0 through N4 in Fig. 4.2. However, when MP circuit is scaled down to nano-scale, the inaccuracy in current replicating, which is caused by systematic mismatch (due to different VDS s), is getting more pronounced. Several literature have reported methodologies to improve the precision of ultra-deep-submicron current mirrors under low-voltage circumstances [118, 119, 120, 121, 122, 123]. Most of them are focused on cascoding scheme. Based on the cases of low-voltage high-precision current mirrors, the simplest method of improving the accuracy of nano-scale MP circuit is to utilize cascode technique. A basic nano-MP circuit with cascode current mirrors is displayed in Fig. 7.1. To achieve low-voltage implementation, it is also feasible to introduce other techniques reported in [118, 119, 120, 121, 122, 123]. 7.2.1.2 Gate decoupling The gate leakage is another concern for nano-scale MP circuits since it may cause the malfunction for the MP circuit. As we can see through Fig. 7.1, gate leakage is liable to occur at node A, which would affect the output current mirrored from N 1 through N4 . To avoid the liability, a possible solution is proposed, which “decouples” the gates of N 1 through N4 from node A. The improved topology is demonstrated in Fig. 7.2. KCL still holds at node A; the difference lies in that the gates of transistors N0 through N4 are decoupled from node A and connected to node C. In this way, the gate voltages of N1 through N4 , which used to be driven by currents through P1 -P4 , is 168 self-balanced by the current through themselves. This is made possible due to the hot carrier effect in nano-scale circuit. 169 BIBLIOGRAPHY 170 BIBLIOGRAPHY [1] A. Graupner, J. Schreiter, S. Getzlaff, and R. Sch¨ffny, “CMOS image sensor with mixedu signal processor array,” IEEE J. Solid-State Circuits, vol. 38, no. 6, pp. 948 – 957, Jun. 2003. [2] P. Hasler, P. D. Smith, D. Graham, R. Ellis, and D. V. Anderson, “Analog floating-gate, on-chip auditory sensing system interfaces,” IEEE sensors journal, vol. 5, no. 5, pp. 1027– 1034, Oct. 2005. [3] P. Hasler and D. V. Anderson, “Cooperative analog-digital signal processing,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, 2002, pp. IV– 3972–IV–3975. [4] G. Frantz, “Digital signal processor trends,” IEEE Micro, vol. 20, no. 6, pp. 52 – 59, Dec. 2000. [5] H. Suzuki, H. Takata, H. Shinohara, E. Teraoka, M. Matsuo, T. Yoshida, H. Sato, N. Honda, N. Masui, and T. Shimizu, “1.047Ghz, 1.2V, 90nm CMOS, 2-way VLIW DSP core using saturation anticipator circuit,” in 2006 Symposium on VLSI Circuits Digest of Technical Papers, 2006, pp. 152 – 153. [6] T. S. Hall, C. M. Twigg, J. D. Gray, P. Hasler, and D. V. Anderson, “Large-scale fieldprogrammable analog arrays for analog signal processing,” IEEE Trans. Circuits Syst. I, vol. 52, no. 11, pp. 2298–2307, Nov. 2005. [7] C. Huang, N. Lajnef, and S. Chakrabartty, “Calibration and characterization of self-powered floating-gate usage monitor with single electron per second operational limit,” IEEE Trans. Circuits Syst. I, vol. 57, no. 3, pp. 556–567, Mar. 2010. 171 [8] F. Silveira, D. Flandre, and P. G. A. Jespers, “A floating-gate MOS learning array with locally computed weight updates,” IEEE J. Solid-State Circuits, vol. 31, no. 9, pp. 1314– 1319, Sep. 1996. [9] G. E. Moore, “Cramming more components onto integrated circuits,” Electronics, vol. 38, no. 8, pp. 114–117, apr. 1965. [10] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of complex fourier series,” Math. Computation, vol. 19, pp. 297–301, Apr. 1965. [11] A. Bandyopadhyay, J. Lee, R. W. Robucci, and P. Hasler, “MATIA: A programmable 80 µW/frame cmos block matrix transform imager architecture,” IEEE J. Solid-State Circuits, vol. 41, no. 3, pp. 663–672, Mar. 2006. [12] N. Zhang, C. Teuscher, H. Lee, and B. Brodersen, “Architectural implementation issues in a wideband receiver using multiuser detection,” in Proc. 36th Annu. Allerton Conf. Communication, Control, and Computing, Sep. 1998, pp. 765 – 771. [13] R. Sarpeshkar, “Efficient precise computation with noisy components: Extrapolating from an electronic cochlea to the brain,” Ph.D. dissertation, Dept. Comput. Neur. Syst., California Institute of Technology, Pasadena, CA, 1997. [14] B. Gilbert, “Tranlinear circuits: a proposed classification,” Electronics Letters, vol. 11, no. 1, pp. 14–16, 1975. [15] ——, “A high-accuracy analog multiplier,” in Proc. IEEE ISSCC, 1974, pp. 128 – 129. [16] S. Hemati and A. Banihashemi, “Full pp. 347 – 347. CMOS min-sum analog iterative decoder,” jun. 2003, [17] E. Seevinck and R. J. Wiegerink, “Generalized tranlinear circuit principle,” IEEE J. SolidState Circuits, vol. 26, no. 8, pp. 1198–1202, Aug. 1991. [18] W. Gai, H. Chen, and E. Seevinck, “Quadratic-translinear CMOS multiplier-divider circuit,” Electron. Lett., vol. 33, no. 10, pp. 860–861, May. 1997. [19] B. A. Minch, P. Hasler, and C. Diorio, “Multiple-input translinear element networks,” IEEE Trans. Circuits Syst. II, vol. 48, no. 1, pp. 20–28, Jan. 2001. [20] R.W.Adams, “Filtering in the log domain,” in 63rd AES conf., New York, May 1979. 172 [21] J. Mulder, A. C. van der Woerd, W. A. Serdijn, and A. H. M. van Roermund, “An RMS-DC converter based on the dynamic translinear principle,” IEEE J. Solid-State Circuits, vol. 32, no. 7, pp. 1146–1150, Jul. 1997. [22] Y. P. Tsividis, “Externally linear integrators,” IEEE Trans. Circuits Syst. II, vol. 45, no. 9, pp. 1181–1187, Sep. 1998. [23] B. Gilbert, “Translinear circuits: A historical overview,” Analog Integrated Circuits and Signal Processing, vol. 9, no. 2, pp. 95–118, 1996. [24] B. A. Minch, “MOS translinear principle for all inversion levels,” IEEE Trans. Circuits Syst. II, vol. 55, no. 2, pp. 121–125, Feb. 2008. [25] E. A. Vittoz, “Analog VLSI signal processing: Why, where, and how?” Analog Integr. Circuits Signal Process, vol. 6, no. 1, pp. 27–44, 1994. [26] B. A. Minch, “Synthesis of static and dynamic multiple-input translinear element networks,” IEEE Trans. Circuits Syst. I, vol. 51, no. 2, pp. 409–421, Feb. 2004. [27] A. G. Andreou, K. A. Boahen, P. O. Pouliquen, A. Pavasovic, R. E. Jenkins, and K. Strohbehn, “Current-mode subthreshold MOS circuits for analog VLSI neural systems,” IEEE Trans. Neural Netw., vol. 2, no. 2, pp. 205–213, Mar. 1991. [28] A. G. Andreou and K. A. Boahen, “Translinear circuits in subthreshold MOS,” Analog Integrated Circuits and Signal Processing, vol. 9, no. 2, pp. 141–166, 1996. [29] P. Hasler, C. Diorio, and B. A. Minch, “A four-quadrant floating-gate synapse,” in Proc. IEEE International Symposium on Circuits and Systems, vol. 3, Jun. 1998, pp. 29–31. [30] C. Diorio, “Neurally inspired silicon learning: From synapse transistors to learning arrays,” Ph.D. dissertation, Department of Electrical Engineering,California Institute of Technology, Pasadena, CA, 1997. [31] B. A. Minch, “Multiple-input translinear element log-domain filters,” IEEE Trans. Circuits Syst. II, vol. 48, pp. 29–36, Jan. 2001. [32] ——, “Construction and transformation of multiple-input translinear element networks,” IEEE Trans. Circuits Syst. I, vol. 50, pp. 1530–1538, Dec. 2003. 173 [33] J. Mulder, W. A. Serdijn, A. C. van der Woerd, and A. H. M. van Roermund, “A generalized class of dynamic translinear circuits,” IEEE Trans. Circuits Syst. II, vol. 48, no. 5, pp. 501– 504, May. 2001. [34] D. Frey, “Log-domain filtering: An approach to current-mode filtering,” Proc. Inst. Electron. Eng., vol. 140, no. 6, pp. 406–416, Dec. 1993. [35] ——, “A general class of current mode filters,” in Proc. IEEE ISCAS, vol. 2, 1993, pp. 1435–1438. [36] ——, “Exponential state space filters: A generic current mode design strategy,” IEEE Trans. Circuits Syst. I, vol. 43, pp. 34–42, Jan. 1996. [37] C. Toumazou and T. S. Lande, “Micropower log-domain filter for electronic cochlea,” Electron. Lett., vol. 30, no. 22, pp. 1839–1841, Oct. 1994. [38] M. Punzenberger and C. C. Enz, “Low-voltage companding currentmode integrators,” in Proc. IEEE ISCAS, 1995, pp. 2112–2115. [39] D. Perry and G. W. Roberts, “Log-domain filters based on LC ladder synthesis,” in Proc. IEEE ISCAS, 1995, pp. 311–314. [40] J. Mulder, A. C. van der Woerd, W. A. Serdijn, and A. H. M. van Roermund, “A current√ mode companding x-domain integrator,” Electron. Lett., vol. 32, no. 3, pp. 198–199, Feb. 1996. [41] W. A. Serdijn, M. Broest, J. Mulder, A. C. van der Woerd, and A. H. M. van Roermund, “A low-voltage ultra-low-power translinear integrator for audio filter applications,” IEEE J. Solid-State Circuits, vol. 32, pp. 577–581, Apr. 1997. [42] W. A. Serdijn, J. Mulder, A. C. van der Woerd, and A. H. M. van Roermund, “A widetunable translinear second-order oscillator,” IEEE J. Solid-State Circuits, vol. 33, no. 2, pp. 195–201, Feb. 1998. [43] A. Payne, A. Thanachayanont, and C. Papavassilliou, “A 150-MHz translinear phase-locked loop,” IEEE Trans. Circuits Syst. II, vol. 45, no. 9, pp. 1220–1231, Sep. 1998. [44] T. Lin and A. Payne, “Translinear sinusoidal frequency tripler,” Electron. Lett., vol. 38, no. 10, pp. 441–442, May. 2002. 174 [45] M. Gu, K. Misra, H. Radha, and S. Chakrabartty, “Sparse decoding of low density parity check codes using margin propagation,” in Proc. of IEEE Globecom, Nov. 2009. [46] ——, “Sparse decoding of low-density parity-check codes based on margin propagation,” IEEE Trans. Commun., 2010, submitted. [47] W. E. Ryan, “An introduction to LDPC codes,” in CRC Handbook for Coding and Signal Processing for Recoding Systems. Taylor, CRC Press, 2004, ch. 36. [48] F. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and the sum-product algorithm,” IEEE Trans. Inf. Theory, vol. 47, no. 2, pp. 498–519, Feb. 2001. [49] F. shi Lai and C.-F. E. Wu, “A hybrid number system processor with geometric and complex arithmetic capabilities,” IEEE Trans. Comput., vol. 40, no. 8, pp. 952–962, Aug. 1991. [50] E. I. Chester and J. N. Coleman, “Matrix engine for signal processing applications using the logarithmic number system,” in Proc. IEEE Application-Specific Syst., Architectures and Processors, Jul. 2002, pp. 315–324. [51] J. N. Coleman, E. I. Chester, C. I. Softley, and J. Kadlec, “Arithmetic on the European logarithmic microprocessor,” IEEE Trans. Comput., vol. 49, no. 7, pp. 702–715, Jul. 2000. [52] H. Kim, B.-G. Nam, J.-H. Sohn, J.-H. Woo, and Hoi-Jun, “A 231-MHz, 2.18-mW 32bit logarithmic arithmetic unit for fixed-point 3-D graphics system,” IEEE J. Solid-State Circuits, vol. 41, no. 11, pp. 2373–2381, Nov. 2006. [53] B.-G. Nam, H. Kim, and H.-J. Yoo, “A low-power unified arithmetic unit for programmable handheld 3-D graphics systems,” IEEE J. Solid-State Circuits, vol. 42, no. 8, pp. 1767–1778, Aug. 2007. [54] K. H. Abed and R. E. Siferd, “CMOS VLSI implementation of a low-power logarithmic converter,” IEEE Trans. Comput., vol. 52, no. 11, pp. 1421–1433, Nov. 2003. [55] S. L. SanGregory, C. Brothers, and D. Gallagher, “A fast,low-power logarithm approximation with CMOS VLSI implementation,” in Proc. IEEE Midwest Symp. Circuits and Systems, Aug. 1999, pp. 388–391. [56] Y.-H. Lam, W.-H. Ki, and C.-Y. Tsui, “Integrated low-loss CMOS active rectifier for wirelessly powered devices,” IEEE Trans. Circuits Syst. II, vol. 53, no. 12, pp. 1378–1382, Dec. 2006. 175 [57] A. Gore, S. Chakrabartty, S. Pal, and E. C. Alocilja, “A multichannel femtoamperesensitivity potentiostat array for biosensing applications,” IEEE Trans. Circuits Syst. I, vol. 53, no. 11, pp. 2357–2363, Nov. 2006. [58] T. Serrano-Gotarredona, B. Linares-Barranco, and A. G. Andreou, “A general translinear principle for subthreshold MOS transistors,” IEEE Trans. Circuits Syst. I, vol. 46, no. 5, pp. 607–616, May. 1999. [59] B. Razavi, Design of analog CMOS integrated circuits. ence/Engineering/Math, Aug. 2000. McGraw-Hill Sci- [60] P. R. Kinget, “Device mismatch and tradeoffs in the design of analog circuits,” IEEE J. Solid-State Circuits, vol. 40, no. 6, pp. 1212–1224, Jun. 2005. [61] S. Chakrabartty and G. Cauwenberghs, “Sub-microwatt analog VLSI trainable pattern classifier,” IEEE J. Solid-State Circuits, vol. 42, no. 5, pp. 1169–1179, May. 2007. [62] K. Kang and T. Shibata, “An on-chip-trainable gaussian-kernel analog support vector machine,” IEEE Trans. Circuits Syst. I, vol. 57, no. 7, pp. 1513–1524, Jul. 2010. [63] S. Chakrabartty and G. Cauwenberghs, “Gini-support vector machine: Quadratic entropy based multi-class probability regression,” Journal of Machine Learning Research, vol. 8, pp. 813–839, 2007. [64] R. G. Gallager, Low Density Parity-Check Codes, R. G. Gallager, Ed. bridge, MA, 1963. MIT Press, Cam- [65] E. T. S. I. (ETSI), “Digital Video Broadcasting (DVB) second generation framing structure, channel coding and modulation systems for broadcasting, interactive services, news gathering and other broadband satellite applications; tm 2860rl dvbs2-74r8,” www.dvb.org. [66] B. Bangerter and et al., “Wireless technologies: High-throughput wireless LAN air interface,” http://www.intel.com/technology/itj. [67] Y. Rashi and et al., “LDPC: Efficient alternative FEC for the TFI-OFDM PHY proposal,” http://www.grouper.ieee.org/groups/802/15/pub. [68] R. M. Tanner, “A recursive approach to low complexity codes,” IEEE Trans. Inf. Theory, vol. 27, no. 5, pp. 533–547, Sep 1981. 176 [69] J. Hagenauer, E. Offer, and L. Papke, “Iterative decoding of block and convolutional codes,” IEEE Trans. Inf. Theory, vol. IT-42, no. 3, pp. 429–445, Mar. 1996. [70] D. J. C. MacKay, “Good error-correcting codes based on very sparse matrices,” IEEE Trans. Inf. Theory, vol. 45, no. 2, pp. 399–431, Mar. 1999. [71] M. G. Luby, M. Mitzenmacher, M. A. Shokrollahi, and D. A. Spielman, “Improved lowdensity parity-check codes using irregular graphs,” IEEE Trans. Inf. Theory, vol. 47, no. 2, pp. 585–598, Feb. 2001. [72] M. M. Mansour and N. R. Shanbhag, “A 640 Mbps 2048-bit programmable LDPC decoder chip,” IEEE J. Solid-State Circuits, vol. 41, pp. 684–698, Mar. 2006. [73] A. J. Blanksby and C. J. Howland, “A 690-mW 1-Gb/s 1024-b, rate-1/2 low-density paritycheck code decoder,” IEEE J. Solid-State Circuits, vol. 37, no. 3, Mar. 2002. [74] A. Darabiha, A. C. Carusone, and F. R. Kschischang, “A 3.3-Gbps bit-serial block-interlaced min-sum LDPC decoder in 0.13-µm CMOS.” IEEE Custom Integrated Circuits Conference, Sep. 2007, pp. 459–462. [75] T. Brandon, R. Hang, G. Block, V. C. Gaudet, B. Cockburn, S. Howard, C. Giasson, K. Boyle, P. Goud, S. S. Zeinoddin, A. Rapley, S. Bates, D. Elliott, and C. Schlegel, “A scalable LDPC decoder ASIC architecture with bit-serial message exchange,” Elsevier Integration-the VLSI Journal, vol. 41, no. 3, pp. 385–398, May. 2008. [76] J. Hagenauer and M. Winklhofer, “The analog decoder,” in Proceedings of IEEE International Symposium on Information Theory (ISIT ’98), Cambridge, Mass, USA, Aug. 1998, p. 145. [77] S. Hemati, A. Banihashemi, and C. Plett, “An 80-Mb/s 0.18-µm CMOS analog min-sum iterative decoder for a (32,8,10) LDPC code,” IEEE J. Solid-State Circuits, vol. 41, pp. 2531–2540, Nov. 2006. [78] C. Winstead, N. Nguyen, V. Gaudet, and C. Schlegel, “Low-voltage CMOS circuits for analog iterative decoders,” IEEE Trans. Circuits Syst. I, vol. 53, pp. 829–841, Apr. 2006. [79] V. Gaudet, “Energy efficient circuits for LDPC decoding.” CMOS Emerging Technologies, 2007. 177 [80] A. Liveris, Z. Xiong, and C. Georghiades, “Compression of binary sources with side information at the decoder using LDPC codes,” Communications Letters, IEEE, vol. 6, no. 10, pp. 440–442, Oct 2002. [81] M. Sartipi and F. Fekri, “Source and channel coding in wireless sensor networks using ldpc codes,” Sensor and Ad Hoc Communications and Networks, 2004. IEEE SECON 2004. 2004 First Annual IEEE Communications Society Conference on, pp. 309–316, Oct. 2004. [82] M. Potluri, S. Chilumuru, S. Kambhampati, and K. Namuduri, “Distributed source coding using non-binary ldpc codes for sensor network applications,” Information Theory, 2007. CWIT ’07. 10th Canadian Workshop on, pp. 105–109, June 2007. [83] D. MacKay and R. Neal, “Near shannon limit performance of low density parity check codes,” Electronics Letters, vol. 33, no. 6, pp. 457–458, Mar 1997. [84] S.-Y. Chung, J. Forney, G.D., T. Richardson, and R. Urbanke, “On the design of low-density parity-check codes within 0.0045 db of the shannon limit,” Communications Letters, IEEE, vol. 5, no. 2, pp. 58–60, Feb 2001. [85] ITRS, “International technology roadmap for semiconductors,” http://www.itrs.net, 2010. [86] T. Richardson and R. Urbanke, “The capacity of low-density parity check codes under message-passing decoding,” IEEE Trans. Inf. Theory, vol. 47, pp. 599–618, 2001. [87] T. Richardson, M. Shokrollahi, and R. Urbanke, “Design of capacity approaching irregular low-density parity-check codes,” IEEE Trans. Inf. Theory, vol. 47, no. 2, pp. 619–637, Feb. 2001. [88] D. J. C. MacKay, Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003, ch. 6, pp. 557–572, http://www.inference.phy.cam.ac.uk/mackay/itila/. [89] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52, no. 4, Apr. 2006. [90] L. Mancera and J. Portilla, “L0-norm-based sparse representation through alternate projections,” in 13th International Conference on Image Processing (ICIP), 2006. Atlanta, GE (USA), October 2006. [91] S.-Y. Chung, T. Richardson, and R. Urbanke, “Analysis of sum-product decoding of lowdensity parity-check codes using a gaussian approximation,” IEEE Trans. Inf. Theory, vol. 47, no. 2, pp. 657–670, Feb. 2001. 178 [92] P. Rusmevichientong and B. Van Roy, “An analysis of belief propagation on the turbo decoding graph with gaussian densities,” IEEE Trans. Inf. Theory, vol. 47, no. 2, pp. 745–765, Feb 2001. [93] A. Amraoui, “Ldpcopt,” http://ipgdemos.epfl.ch/ldpcopt/. [94] O. C. Ibe, Fundamentals of applied probability and random processes, 1st ed. Press, 2005. Academic [95] C.-H. Lin and K. Bult, “A 10-b, 500-MSample/s CMOS DAC in 0.6 mm2 ,” IEEE J. SolidState Circuits, vol. 33, no. 12, pp. 1948 – 1958, Dec. 1998. [96] V. Gaudet and G. Gulak, “A 13.3-Mb/s 0.35µm CMOS analog turbo decoder IC with a configurable interleaver,” IEEE J. Solid-State Circuits, vol. 38, no. 11, pp. 2010–2015, Nov. 2003. [97] A. Darabiha, A. C. Carusone, and F. R. Kschischang, “Power reduction techniques for LDPC codes,” IEEE J. Solid-State Circuits, vol. 43, no. 8, pp. 1835 – 1845, Aug. 2008. [98] Z. Zhang, V. Anantharam, M. J. Wainwright, and B. Nikoli´, “An efficient 10GBASE-T c ethernet LDPC design with low error floors,” IEEE J. Solid-State Circuits, vol. 45, no. 4, pp. 843 – 855, Apr. 2010. [99] C. Winstead, J. Dai, S. Yu, C. Myers, R. Harrison, and C. Schlegel, “CMOS analog MAP decoder for (8,4) Hamming code,” IEEE J. Solid-State Circuits, vol. 39, no. 1, pp. 122–131, Jan. 2004. [100] D. Vogrig, A. Gerosa, A. Neviani, A. Amat, G. Montorsi, and S. Benedetto, “A 0.35-µm CMOS analog turbo decoder for the 40-bit rate 1/3 UMTS channel code,” IEEE J. SolidState Circuits, vol. 40, pp. 753–762, Mar. 2005. [101] E. Osuna, R. Freund, and F. Girosi, “Training support vector machines: An application to face detection,” Comput. Vis. Pattern Recogn., pp. 130–136, 1999. [102] R. Genov and G. Cauwenberghs, “Charge-mode parallel architecture for matrix-vector multiplication,” IEEE Trans. Circuits Syst. II, vol. 48, no. 10, pp. 930–936, 2001. [103] S. Chakrabartty and G. Cauwenberghs, Sub-Microwatt Analog VLSI Support Vector Machine for Pattern Classification and Sequence Estimation. Cambridge: MIT Press, 2005, ch. Adv. Neural Information Processing Systems (NIPS2004), p. 17. 179 [104] ——, “Sub-microwatt analog VLSI trainable pattern classifier,” IEEE J. Solid-State Circuits, vol. 42, no. 5, pp. 1169–1179, May. 2007. [105] B. Boser, I. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifier,” in Proc. 5th Annu. ACM Workshop Computational Learning Theory, 1992, p. 14452. [106] A. Bandyopadhyay, G. J. Serrano, and P. Hasler, “Adaptive algorithm using hot-electron injecton for programming analog computational memory elements within 0.2% of accuracy over 3.5 decades,” IEEE J. Solid-State Circuits, vol. 41, no. 9, pp. 2107–2114, Sep. 2006. [107] W. L. Barber and E. R. Brown, “A true logarithmic amplifier for radar IF applications,” vol. 15, no. 3, pp. 291–295, Jun. 1980. [108] A. Basu, R. W. Robucci, and P. E. Hasler, “A low-power, compact, adaptive logarithmic transimpedance amplifier operating over seven decades of current,” IEEE Trans. Circuits Syst. I, vol. 54, no. 10, pp. 2167–2177, Oct. 2007. [109] C. D. Holdenried, J. W. Haslett, J. G. McRory, R. D. Beards, and A. J. Bergsma, “A DC-4GHz true logarithmic amplifier: Theory and implementation,” IEEE J. Solid-State Circuits, vol. 37, no. 10, pp. 1290–1299, Oct. 2002. [110] E. Vittoz and J. Fellrath, “CMOS analog integrated circuits based on weak inversion operations,” IEEE J. Solid-State Circuits, vol. 12, no. 3, pp. 224–231, Jun. 1977. [111] C. Huang, P. Sarkar, and S. Chakrabartty, “Rail-to-rail, thermal-noise limited, linear hotelectron injection programming of floating-gate voltage bias generators at 13-bit resolution,” IEEE J. Solid-State Circuits, vol. 46, no. 11, pp. 2685–2692, Nov. 2011. [112] R. L. Bunch and S. Raman, “Large-signal analysis of MOS varactors in CMOS-Gm LC VCOs,” IEEE J. Solid-State Circuits, vol. 38, no. 8, pp. 1325–1332, Aug. 2003. [113] T. Delbruck and D. Oberhoff, “Self-biasing low power adaptive photoreceptor,” in Proc. IEEE Int. Symp. Circuits and Systems, vol. 4, May. 2004, pp. 844–847. [114] D. H. K. Hoe and D. B. Ribner, “An auto-ranging photodiode preamplifier with 114 db dynamic range,” IEEE J. Solid-State Circuits, vol. 31, no. 2, pp. 187–194, Feb. 1996. [115] M. J. Hayes, “A nonlinear optical preamplifier for sensing applications,” IEEE Trans. Circuits Syst. I, vol. 49, no. 1, pp. 1–9, Jan. 2002. 180 [116] A. Basu, S. Brink, C. Schlottmann, S. Ramakrishnan, C. Petre, S. Koziol, F. Baskaya, C. M. Twigg, and P. Hasler, “A floating-gate-based field-programmable analog array,” IEEE J. Solid-State Circuits, vol. 45, no. 9, pp. 1781–1794, Sep. 2010. [117] M. Gu and S. Chakrabartty, “Synthesis of bias-scalable CMOS analog computational circuits using margin propagation,” IEEE Trans. Circuits Syst. I, vol. 59, no. 2, pp. 243–254, Feb. 2012. [118] E. S¨ckinger and W. Guggenb¨hl, “A high swing, high impedance MOS cascode circuit,” a u IEEE J. Solid-State Circuits, vol. 25, no. 1, pp. 289–298, Feb. 1990. [119] T. Itakura and Z. Czarnul, “High output-resistance CMOS current mirrors for low-voltage applications,” IEICE Trans. Fundamentals, vol. E80-A, no. 11, pp. 230–232, Jan. 1997. [120] F. You, S. H. K. Embabi, J. F. Duque-Carrillo, and E. S´nchez-Sinencio, “An improved tail a current source for low voltage applications,” IEEE J. Solid-State Circuits, vol. 32, no. 8, pp. 1173–1180, Aug. 1997. [121] T. Serrano and B. Linares-Barranco, “The active-input regulated cascode current-mirror,” IEEE Trans. Circuits Syst. I, vol. 41, no. 6, pp. 464–467, Jun. 1994. [122] J. Ram´ irez-Angulo, R. G. Carvajal, and A. Torralba, “Low supply voltage high-performance CMOS current mirror with low input and output voltage requirements,” IEEE Trans. Circuits Syst. II, vol. 51, no. 3, pp. 124–129, Mar. 2004. [123] X. Zhang and E. I. El-Masry, “A regulated body-driven CMOS current mirror for lowvoltage applications,” IEEE Trans. Circuits Syst. II, vol. 51, no. 10, pp. 571–577, Oct. 2004. 181