IMPLANTABLE VLSI SYSTEMS FOR COMPRESSION AND COMMUNICATION
IN WIRELESS BIOSENSOR RECORDING ARRAYS
By
Awais Mehmood Kamboh

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Electrical Engineering
2010

ABSTRACT
IMPLANTABLE VLSI SYSTEMS FOR COMPRESSION AND COMMUNICATION
IN WIRELESS BIOSENSOR RECORDING ARRAYS
By
Awais Mehmood Kamboh
Successful use of microelectrode arrays to record neural activity in the cortex has
opened new opportunities for scientists to decode the intricate functionality of the
human brain and the behavior of neurons that enable its complex operation. The
resulting brain-machine interface devices play a critical role in enabling patients
with neural disorders to achieve a better lifestyle. Such interfaces provide a direct
interface to the brain and show great promise in many biomedical applications.
This thesis explores some of the major obstacles impeding the advance of
wireless neural implants and addresses them through development of highly
efficient algorithms and implantable hardware. An overwhelming amount of data
is generated by the microelectrode arrays, resulting in a data bandwidth
bottleneck. To overcome this problem, an implantable system has been devised
to enable control over the amount of data that must be transmitted without
compromising the information contained in the array of neural signals.
Furthermore, the nature of the wireless communication channel across the skin
tissue is not well characterized. In this thesis, solutions have been developed to
maximize

that

data

throughput

and

enable

unfailing

yet

low-power

communication of bidirectional data between the implanted device and the
external world. Finally, a unified energy-efficient, implantable CMOS integrated

circuit was developed to address these two critical problems. The resulting
integrated

solution

ensures

seamless

multi-modal

operation,

and

thus

establishes a pathway to the design of next-generation neuroprosthetics devices.
Although the motivation for this thesis comes from the field of neuroprosthetics,
the solutions devised are pertinent to a wide range of implantable applications.

Copyright by
AWAIS MEHMOOD KAMBOH
2010

Dedicated to my parents

v

ACKNOWLEDGEMENTS

“In the name of Allah, the Beneficent, the Merciful”
The Quran (Chapter 1: Verse 1)
I would like to thank my advisor Professor Andrew Mason for his continued
guidance and support through my PhD research. He has been a true mentor to
me. I would also like to acknowledge my dissertation committee including
Professor Hayder Radha, Professor Richard Enbody and Professor Ramakrishna
Mukkamala for their valuable feedback that made this dissertation possible.
I must also thank my colleagues at AMSaC lab who helped me put my ideas to
practice. Special thanks to Chao, Yue and Daniel whose technical insights
helped improve my work. Matthew, Garrett and Leslie also helped me achieve
my research goals. Waqar, Xiaowen, Xiaoyi, Yuning, Stefan, James, Haitao, Lin
and Liya made my stay at MSU wonderful. I am also grateful to Khawar, Saad,
Jawad, Khurram, Usman, Sajjad, Attaullah, Qaiser, Abdul Rehman, Ahmed,
Aqeel and Zubair for their friendship.
My research would not have been possible without the support from National
University of Sciences and Technology, Islamabad, and the Higher Education
Commission of Pakistan. My research was also funded in part by the National
Institutes of Health, USA.

vi

Finally and most importantly, I want to thank my parents and siblings for believing
in me, and my wife and daughter for supporting me every step of the way.

Awais Mehmood Kamboh

vii

TABLE OF CONTENTS
List of Tables .................................................................................................... xii
List of Figures .................................................................................................. xiii
1

Introduction .................................................................................................. 1
1.1 Motivation ................................................................................................ 1
1.2 Neural Recording Techniques ................................................................. 2
1.2.1 Intracranial EEG................................................................................ 3
1.3 Characteristics of Neural Signals ............................................................ 3
1.4 Source Separation and Spike Sorting ..................................................... 5
1.5 Challenges Associated with Array Based Recording ............................... 6
1.6 Contemporary Research Groups ............................................................. 9
1.6.1 Review .............................................................................................. 9
1.6.2 Drawbacks ...................................................................................... 12
1.7 Research Goals and Approach ............................................................. 13
1.7.1 Goals Statement ............................................................................. 13
1.7.2 Approach ........................................................................................ 14
1.7.3 System Architecture ........................................................................ 15
1.8 Challenges ............................................................................................ 18

2

Hardware Realization of DWT for Neural Recording ............................... 20
2.1 Discrete Wavelet Transform for Neural Recording ................................ 20
2.1.1 Background ..................................................................................... 21
2.1.2 Wavelet Representation of Signals ................................................. 22
2.1.3 Basis Selection ............................................................................... 24
2.2 Algorithms for DWT ............................................................................... 25
2.2.1 Lifting Factorization ......................................................................... 26
2.2.2 B-Spline Factorization ..................................................................... 27
2.2.3 Algorithmic Comparison .................................................................. 30
2.3 Minimized Hardware Implementations................................................... 32
2.3.1 Lifting Implementation ..................................................................... 32
2.3.2 B-spline Implementation ................................................................. 35
2.4 Sequential vs. Pipelined Implementations ............................................. 38
2.4.1 Pipelined Design for Lifting DWT .................................................... 39
2.4.2 Sequential Design for Lifting DWT .................................................. 40
2.4.3 Pipeline vs. Sequential Lifting DWT ................................................ 41
2.5 Multi-Channel / Multi-Level Sequential DWT ......................................... 45
viii

2.5.1 Single Channel / Single Level Architecture ..................................... 45
2.5.2 Multi Channel / Multi Level Architecture .......................................... 46
2.6 Comparative Analysis for Multi Channel/Level DWT ............................. 49
2.6.1 Multi Channel / Multi Level Lifting vs. B-spline ................................ 49
2.6.2 Multi Channel / Multi Level Sequential vs. Pipeline ......................... 55
2.7 VLSI Architecture for Multi-Channel DWT ............................................. 60
2.7.1 Architecture Overview ..................................................................... 61
2.7.2 Computation Core ........................................................................... 63
2.7.3 Memory Modules ............................................................................ 67
2.7.4 State Machine Controller ................................................................ 72
2.8 Simulation Results and Analysis ........................................................... 78
2.8.1 Hardware Resources ...................................................................... 78
2.8.2 Data Compression vs. Reconstruction Error ................................... 80
2.9 The Design Flow ................................................................................... 85
2.10 Integrated Circuit for DWT: Design and Test ......................................... 87
2.10.1 Design ............................................................................................. 87
2.10.2 Test Setup ...................................................................................... 88
2.11 Improvements in DWT ........................................................................... 91
3

Design of Implantable Neural Interface Node .......................................... 93
3.1 Motivation and Approach ....................................................................... 93
3.2 Architecture Overview ........................................................................... 94
3.3 Data Compression Engine at the NIN.................................................... 95
3.3.1 Analog Interface Controller ............................................................. 95
3.3.2 Discrete Wavelet Transform ........................................................... 97
3.3.3 Multi-Level Threshold...................................................................... 98
3.3.4 Run Length Encoder ..................................................................... 100
3.4 Communication Controller and Interface ............................................. 102
3.4.1 Packet Structures.......................................................................... 103
3.4.2 Protocol Design............................................................................. 105
3.4.3 Efficiency ...................................................................................... 111
3.5 Global Controller ................................................................................. 113
3.5.1 Modes of Operation ...................................................................... 113
3.5.2 Power Management ...................................................................... 115
3.6 RF Data and Power Receiver .............................................................. 116
3.7 Results & Performance Evaluation ...................................................... 117
3.8 Hardware Implementation and FPGA Test .......................................... 125
3.9 VLSI Layout of Digital System on Chip in 0.13 Micron CMOS ............. 127
3.10 Analysis ............................................................................................... 129
ix

4 Channel Characterization for Implant to Body Surface
Communication .............................................................................................. 131
4.1 Introduction.......................................................................................... 131
4.2 Wireless Power / Data Transceiver ..................................................... 132
4.3 Experimental Setup ............................................................................. 134
4.3.1 Trace Collection ............................................................................ 134
4.3.2 Physical Setup .............................................................................. 135
4.4 Transceiver Performance Analysis ...................................................... 136
4.4.1 Transceiver Orientation Diversity .................................................. 136
4.4.2 Transceiver Characterization Results ........................................... 137
4.5 Channel Memory ................................................................................. 142
4.5.1 Bit Level Memory .......................................................................... 143
4.5.2 Packet Level Memory ................................................................... 143
4.6 Channel Characterization Across Animal Skin .................................... 146
4.6.1 Physical Setup .............................................................................. 146
4.6.2 Channel Characterization Results................................................. 147
4.7 Forward Error Correction ..................................................................... 151
4.8 Conclusions ......................................................................................... 154
5

High Density Feature Extraction ............................................................. 156
5.1 Introduction.......................................................................................... 156
5.2 Theory ................................................................................................. 157
5.2.1 Spike Detection and Classification ................................................ 159
5.2.2 Feature Selection and Extraction .................................................. 160
5.2.3 Complexity Calculations and Hardware Resources ...................... 163
5.3 Methods............................................................................................... 164
5.4 Results and Discussion ....................................................................... 167
5.5 ZCF vs DWT........................................................................................ 174

6

Summary and Future Work ..................................................................... 178
6.1 Summary ............................................................................................. 178
6.2 Contributions ....................................................................................... 180
6.3 Future Work......................................................................................... 182
6.4 Papers Written ..................................................................................... 187

Appendix A: Neural Recording Techniques ................................................. 191
Appendix B: Quantization .............................................................................. 195
Appendix C: Design Flow .............................................................................. 199
x

Appendix D: Design for Testability ............................................................... 212
Bibliography.................................................................................................... 213

xi

LIST OF TABLES
2.1

Coefficients of the symlet-4 lifting factorization ………………………..

27

2.2

Coefficients of the B-spline factorization ……………………………….

29

2.3

Comparison of required computation …………………………………...

32

2.4

Characteristics of single-level, single-channel Integer DWT hardware
for pipeline and sequential configurations at two technology nodes...

45

2.5

Hardware comparison of lifting and B-Spline architectures ................

50

2.6

CC transistor counts and worst case values ......................................

67

2.7

Comparison of SRAM and DRAM memory implementations .............

70

2.8

Transistor count and area for hardware modules ...............................

79

2.9

Average activity per cycle for hardware modules ...............................

80

3.1

Estimated area and power requirement for implantable hardware
modules in 0.5 micron CMOS technology .......................................... 125

4.1

Comparison of BER, PER and packet memory for different settings
before and after 3-bit FEC is employed .............................................. 152

5.1

Comparison of complexity and required memory for different
algorithms ........................................................................................... 163

xii

LIST OF FIGURES
1.1

A typical neural signal recording using a microelectrode after band
pass filtering to remove LFP and high frequency noise.......................

6

Concept diagram for the proposed wireless neural recording system
showing implantable neural interface node and the extra-cranial
manager interface module....................................................................

17

Block diagram of the implantable neural data compression engine
and its position within an implantable neural recording system...........

18

2.1

The „symlet-4‟ wavelet bases function..................................................

24

2.2

The multilevel discrete wavelet transform decomposition and
reconstruction.......................................................................................

25

2.3

Lifting scheme for DWT........................................................................

27

2.4

B-spline architecture for computing the symlet-4 based DWT.............

30

2.5

Computation core for lifting DWT [12]..................................................

34

2.6

(a) Minimum chip area computation core for B-spline, CCB1. (b)
Minimum delay computation core for B-spline, CCB2..........................

37

Pipeline structure for integer-lifting wavelet transform with data
notations to match filter equations in (2.12) at a single point in time.
Boxes marked with „z‟ represent pipeline stages and registers that
need to be stored in memory for multi-channel multi-level
processing............................................................................................

39

Sequential structure over five operation cycles. Boxes marked with
„z‟ represent registers that need to be stored in memory for multichannel multi-level processing.............................................................

41

The general computation unit architecture for sequential DWT...........

49

2.10 Comparison of relative area and power consumption vs. number of
levels and channels for B-Spline, B-Spline 2 and Lifting
architectures.........................................................................................

53

1.2

1.3

2.7

2.8

2.9

2.11 Comparison of multichannel/multilevel pipeline and sequential DWT
approaches: relative chip area and relative power consumption
xiii

versus number of levels and channels.................................................

56

2.12 Power-area product versus level and channel for pipeline and
sequential approaches.........................................................................

57

2.13 Relative area versus channels of data memory compared to all other
blocks for sequential and pipeline designs, at L = 4.............................

59

2.14 Data flow diagram for lifting DWT using the symlet-4 basis.................

62

2.15 Complete system diagram for sequential calculation of DWT..............

63

2.16 Computation Core architecture for integer lifting DWT.........................

65

2.17 Single bit full adder cell used in the CC adders based on pass
transistor logic [8].................................................................................

66

2.18 DWT circuit operation phases for a single computation cycle..............

72

2.19 Per channel activity of the CC at different levels of decomposition for
a 4-level implementation.......................................................................

75

2.20 Data movement in CC memory during the five calculation cycles
within a computation cycle (one per sample per level). Top line
shows the six register names. Subsequent lines define how data
must be moved as the steps in (2.20) are processed sequentially....

77

2.21 RMS error and Entropy in bits per symbol as a function of threshold
value for the neural data set used in our experiments........................

83

2.22 Original and reconstructed signals for Threshold = 50, RMS error =
37.95 and Entropy = 4.78 bits per symbol at 4 levels of
decomposition with 10 bits data and 5 bits coefficients......................

84

2.23 Original and reconstructed signals for Threshold = 120, RMS error =
87.69 and Entropy = 1.48 bits per symbol at 4 levels of
decomposition with 10 bits data and 5 bits coefficients......................

85

2.24 High-level ASIC design flow................................................................

86

2.25 DWT system on chip, fabricated in 0.5μm technology........................

88

2.26 Photograph of Test PCB designed to interface ASIC with FPGA and
DAQ cards...........................................................................................

90

xiv

2.27 Labview interface to control DAQ cards providing input data and
recording output data from the ASIC...................................................
3.1

91

Block diagram of the implantable neural data compression engine
and its position within an implantable neural recording system...........

95

Analog module for multichannel neural data recording with variable
gain amplifier.......................................................................................

96

3.3

System diagram for sequential calculation of DWT.............................

97

3.4

System diagram for the Threshold and the RLE blocks......................

101

3.5

Example of the system specific run length encoding...........................

102

3.6

(Top) MIM to NIN Command Packet. (Bottom) NIN to MIM Data
Packet................................................................................................... 105

3.7

Flow diagrams for the communication protocol. The diagram shows
normal flow as well as handling of error conditions.............................. 108

3.8

Functional diagram for the communication controller........................... 111

3.9

Timing diagrams (Top) Packet transmission from NIN. (Bottom)
Packet reception at the NIN................................................................. 111

3.2

3.10 Functional diagram of circuits for inductive power and clock transfer
and RF data transmission from MIM to NIN......................................... 117
3.11 System performance as a function of threshold value for the neural
data set used in our experiments......................................................... 120
3.12 3.12 (a) Original signal. (b) 2 times compression. (c) 16 times
compression. (d) 62 times compression............................................... 122
3.13 3.13. Part of the 32 neural signals received and reconstructed by the
MIM, at a compression ratio of 40, in comparison to the original
signal acquired by the NIN. Despite some apparent signal loss,
spikes are easily discriminated............................................................. 123
3.14 Histogram of time taken in milliseconds by the compressed neural
data to completely fill each Data Packet.............................................. 124
3.15 Layout of the complete NIN digital system in 0.13 micron CMOS
technology............................................................................................ 128
xv

4.1

Experimental setup used for transceiver characterization.................... 136

4.2

Scaled diagram of different settings used to measure the
performance of the transceiver, including the distance, the tilt and
the misalignment.................................................................................. 137

4.3

Bit Error Rate and Packet Error Rate versus the Distance in mm........ 138

4.4

Bit Error Rate and Packet Error Rate versus the Mis-Alignment in
mm........................................................................................................ 139

4.5

Bit Error Rate and Packet Error Rate versus the Angular Tilt in
degrees................................................................................................. 140

4.6

Packet errors versus time for disturbance effect test. The 100,000
packets are divided into bins of 100 packets and show the number of
packet errors in each bin...................................................................... 141

4.7

Bit Error Rate and Packet Error Rate versus the Distance in mm
showing the effect of physical disturbance on BER and PER.............. 142

4.8

Histogram of packet level memory for 5mm distance with
disturbance........................................................................................... 144

4.9

Histogram of bit level memory for 5mm distance with disturbance......

145

4.10 Histogram of number of bit errors per packet form 5mm distance with
disturbance........................................................................................... 146
4.11 Experimental setup used for channel characterization including
animal skin............................................................................................ 147
4.12 Packet errors versus time for the case of 10mm distance with animal
skin. The 500,000 packets are divided into bins of 100 packets and
show the number of packet errors in each bin..................................... 148
4.13 Histogram of packet level memory for 10mm distance through animal
skin....................................................................................................... 148
4.14 Histogram of bit level memory for 10mm distance through animal
skin....................................................................................................... 149
4.15 Histogram of number of bit errors per erroneous packet for 10mm
distance through animal skin................................................................ 150
xvi

4.16 Comparison of BER and PER versus distance for various
experimental settings............................................................................ 151
4.17 Comparison of PER before and after 3-bit FEC for different settings..

153

4.18 Comparison of BER before and after 3-bit FEC for different settings..

154

5.1

General data flow and typical components of a neural recording and
classification. system............................................................................ 158

5.2

Graphical representation of ZCF features for a spike detected using
dual thresholds..................................................................................... 162

5.3

Hardware architecture of proposed signal detection and feature
extraction stages units.......................................................................... 164

5.4

Channels and spikes used in simulations. Ten different channels
consisting of ten different spike shapes were used with each channel
containing three spike shapes. (Top Row) Channel 1-5 from left to
right. (Bottom Row) Channel 5-10 from left to right.............................. 165

5.5a (Top) Spikes detected on the channel. (Bottom) Color-coded spikes
for different neurons. This knowledge is available from the neural
signal simulator and is presented here only for reference................... 168
5.5b (Top) ZCF mapping of the spikes on the ZCF feature space as seen
by the spike classifier. (Bottom) ZCF mappings are color-coded to
distinguish between neurons................................................................ 169
5.5c (Top) ZCF mappings are grouped into three clusters using
Mahalanobis distance, 1.71% spikes are mis-classified. (Bottom)
ZCF mappings are grouped into clusters using K-Means clustering,
6.14% spikes are mis-classified........................................................... 170
5.6

Percentage values of (i) DEDT, (ii) CEP, (iii) CEDT and (iv) DCEDT
versus different values of SNR. Data represents performance results
for the DT spike detector, the ZCF and PCA feature extractor and
the combined spike sorting system...................................................... 174

6.1

NIN Chip fabricated in 0.13 micron CMOS........................................... 185

B.1

Effect of round off and quantization errors on the signal fidelity as a
function of quantized data word length and coefficient word length..... 197
xvii

C.1

High-level ASIC design flow.................................................................

D.1

Scan chain insertion for testability including both observability and
controllability......................................................................................... 212

xviii

199

1 Introduction
1.1 Motivation
How the brain functions is one of the questions that have puzzled scientists and
philosophers alike for centuries. Modern scientists have been making significant
advances towards solving this mystery. Apart from just satiating an intellectual
desire,

a

better

understanding

of

brain

functions

has

been

helping

neurophysiologists develop treatments for a variety of disorders including
Parkinson‟s disease or several auditory disorders. It also helps us understand
better the physiological changes during various seizures, migraines and epilepsy.
Restoration of motor and sensory capability for the disabled such as
quadriplegic, blind, deaf or persons with missing or damaged limbs is one of the
main reasons for continued interest in a better understanding of working of the
brain. Deep brain stimulation and visual implants are some treatments that
benefit from the partial understanding of the brain functionality.
Ability to monitor, record and process neural signals is thus essential to the
development of treatments for any of the numerous neural disorders.
Neuroprosthetics devices and Brain Machine Interfaces (BMIs) are increasingly
playing a vital role in helping patients with severe motor disorders achieve a
better lifestyle by enabling direct interface to the central nervous system. Thus
far, it has only been possible to monitor limited brain activity observable above
the scalp, however, scientists need to explore cortical signals observable only
1

under the skull to find solutions to above mentioned problems. Any advancement
in the ability to monitor those signals will help millions of individuals with different
neural disorders.
Recent advances in biomedical research have enabled recording of neural
signals from large populations of cortical cells under the skull. However, lack of a
wireless data acquisitioning system hinders their use in brain machine interfaces
for freely moving subjects. The research presented in this dissertation addresses
this problem by developing an implantable hardware employing on-chip signal
processing and enabling wireless data telemetry for further processing of neural
signals.

1.2 Neural Recording Techniques
There are different ways of recording neural activity currently in use,
Electroencephalography (EEG), Magnetoencephalography (MEG), functional
Magnetic Resonance Imaging (fMRI) and Positron Emission Tomography (PET),
all of which are non-invasive. In several situations it is desirable to have high
spatial resolution as well as high time resolution for better understanding of intracranial connectivity and neural activity than provided by the above mentioned
recording techniques. This can only be achieved with highly invasive methods
and

introduction

of

strip

electrodes

or

microelectrode

arrays.

Electrocorticography (ECoG) and Intracranial EEG are two such invasive
techniques. An overview listing pros and cons of these techniques can be found
in the Appendix.
2

1.2.1 Intracranial EEG
Intracranial EEG has shown the capability of recording activity of individual cells,
and has recently been made possible by the use of microelectrode arrays. The
signals are recorded by implanting electrodes into the cortex. Recording the
activity of cortical neurons with microelectrode arrays was shown to be essential
to quantify the degree of involvement of each neuron in encoding movement
parameters. Use of microelectrodes allows extremely high spatial and temporal
resolution and allows recording from single cells at arbitrary sampling rates,
thereby enabling decoding of the neural signals and eventually control of artificial
limbs [1].
There are several challenges associated with this highly invasive procedure.
Firstly, as with ECoG, a surgery is required to implant the electrode array and
supporting electronics into the cortex. Secondly, such high resolutions on the
order of microseconds and micrometers result in tremendous amount of data
being generated. Thus far, clinical trials have used wires that extend from the
skull to the external machines to record the signals. However, enabling wireless
telemetry of these signals will allow free movements to the subject in addition to
long term or permanent implants without any chance of infection in the scalp.
Such BMIs show great promise in many biomedical applications.

1.3 Characteristics of Neural Signals
Neural signals recorded using high density microelectrode arrays can be
subdivided into two kinds of potentials, the action potentials, also known as
3

neural spikes, and the Local Field Potentials (LFP). A summary of salient
features of these signals is mentioned below.
1. Action Potentials
Action potentials are the primary signal of interest for implantable neural signal
processing systems. They are generated at the cell membrane of the neuron but
can be recorded if the electrodes are a few microns away from the cell. Each
electrode can thus receive signal from multiple neurons. These neural spikes
have a wide frequency bandwidth ranging from 100 Hz to 10 KHz, while a typical
spike lasts for about 1.5 ms. Amplitude of these spikes is about 100 mV relative
to extracellular fluid at the cell membrane, however it reduces to within a range of
50-200 µV when measured using an electrode a few micron away. The recording
electrodes also see a relatively large DC offset of 50 mV at the input. A typical
neuron fires between 10 to 100 spikes per second when active.
2. Local Field Potentials
Local Field Potentials (LFP) are low frequency signals less than 100 Hz which
are generated as a result of electrical discharges from many neurons at the same
time. Thus where action potentials are the signals from individual neurons, the
local field potentials are the superimposed signals from a large collection of
neurons. The amplitude of LFP is about 5 mV. The LFPs contain useful
information about collective activity of the signal, however this information does
not change drastically from one electrode to another in a microelectrode array
based data acquisition. All the electrodes in the microelectrode array are likely to
4

record the same signal for LFP. As a result the signal acquired by the array
records the high frequency action potentials superimposed with the low
frequency LFP where the LFPs can be up to a few thousand times stronger in
amplitude than the action potentials. Since action potentials are the primary
signal of interest, the LFPs have to be filtered out to allow any further signal
processing.

1.4 Source Separation and Spike Sorting
Each electrode can acquire signals from more than one neuron based on its
location relative to the cells. The acquired waveform is thus a superposition of a
number of spike trains from as many neurons. Since complex brain processes
are a result of activity in a large number of neurons, it becomes necessary to be
able to separate the activity of each neuron from the other to enable
understanding of inter-neuron functional and physical connectivity.
The process of separating spikes from different neurons is called spike sorting.
Spikes from these neurons may or may not have different amplitudes based on
location of the electrode, however, they almost always tend to have slightly
different shapes which are unique to each neuron. This difference in shape can
be used to separate spikes from one neuron from the spikes of others. Two
parameters are deemed important by the neuroscientists to enable further
research, the shape of the spike and its time of arrival. Thus the task of any
neural data acquisition system is to maintain those features of the waveform that
enable spike sorting.
5

Fig. 1.1 shows a typical intracranial EEG recording using microelectrodes. Band
pass filtering has been applied to remove LFP and high frequency noise. The
inset shows a single action potential which is about 1.5ms long. Action potentials
from different neurons have slightly different shapes owing to their relative
position to the recording electrode. The noise on the signal is not necessarily
white Gaussian.

Fig. 1.1. A typical neural signal recording using a microelectrode after band pass
filtering to remove LFP and high frequency noise. For Interpretation of the references
to color in this and all other figures, the reader is referred to the electronic version of
this dissertation.

1.5 Challenges Associated with Array Based Recording
There are various challenges associated with wireless recording of multi-channel
neural signals using implanted microelectrode arrays. Any system designed to
enable wireless neural recording must address these issues.
6

1. Data Bandwidth
One particular challenge with BMI technology is the need to transmit the high
bandwidth neural data from the implanted device to the outside world for further
analysis. For example, a typical recording experiment with a 100-electrode array
sampled at 25 kHz per channel with 12-bit precision yields an aggregate data
rate of 30 Mbps, which is well beyond the reach of state-of-the-art wireless
telemetry links for biological applications.
2. Implant Size
Another significant challenge is the need to fit implanted circuitry within ~1 cm

2

for the entire signal processing system because of surgical reasons. Large size
of the system results in physical implantation challenges. There is a small
clearance between the cortex and the skull, thus small size is needed to
minimize tissue trauma and displacement.
3. Wireless Power
Power supply to implantable system is also one of the major challenges as use of
a battery provides a highly invasive yet short term solution as batteries need to
be replaced and recharged. Inductive power transmission can be used as an
alternative to deliver power to an implanted system over short distances from the
power source. However, inductive power transmission has its limitations in terms
of maximum deliverable power and interruptions in power supply because of coil
misalignment due to motion in freely moving subjects.
4. Implant Power Density
7

High power dissipation of the system can result in damage to the neighboring
neurons and cortical tissue, in effect killing the neurons that are being recorded
from. Studies show that an increase in temperature over one degree Celsius can
2

result in tissue damage. A power density of 0.58 mW/mm results in one degree
rise in temperature in the tissue. Thus any design of implantable system must
conform to the dissipated power density limitation.
5. Processing capability
The above mentioned challenges also directly influence the processing capability
of the implantable system as more computations translate to higher power
dissipation whereas more memory elements require larger area in silicon. These
stringent requirements force the use of application specific integrated circuits
over general purpose processors.
6. Signal Noise
The recorded signal contains various kinds of noise. Apart from the recording
and thermal noise associated with the electronics the acquired signal also
contains background noise. Background noise is in effect the attenuated signal
from neurons located far from the recording electrode. As a result the
background noise is correlated to the signal, occupies the same spectrum as the
signal, and also contains similar features as neural spikes. However, the
background noise has smaller amplitude than the signal of interest. Cross
channel interference between recordings from different electrodes is another
source of noise, as well as the interference between the analog, the digital and
8

the wireless transceiver blocks. Physiological noise as a result of LFP, muscle
EMG or cardiac activity also damages the signal quality.

1.6 Contemporary Research Groups
1.6.1 Review
The problem of acquiring neural signals from the cortex using an implantable
system and transmitting them to external processing units is being explored by a
number of research groups. These groups tackle similar problems with slightly
different goals and approaches.
1. Duke University, Durham, North Carolina
Absolute thresholding is used in time domain to detect and extract spikes [2, 3].
Spikes can be extracted and transmitted for only a subset of channels; as a
result spike sorting is possible for only a few channels in parallel. Streaming raw
data for only one channel can be transmitted in parallel to the extracted spikes.
Another mode transmits the spike count in time bins where a packet is
transmitted every 50ms. This loses information about spike shapes and their
occurrence times. A commercial 1 Mbps transceiver is used for data
transmission.
2. University of Utah, Salt Lake City, Utah
Absolute thresholding used in time domain to detect spikes in the analog front
end before digitization [4, 5]. Thresholds are programmable for individual
channels. A „one‟ is sent if a spike is detected, a zero is sent otherwise. As a
result spike shapes are not preserved. In addition, one channel of raw data
9

(sampled at 15.6KSps) can be transmitted in parallel. Spike sorting not possible
on other channels. Data, power and clock are transmitted over the inductively
coupled wireless link. A 96 channel prototype using 100 amplifiers with 10-bit
ADC is presented, consuming 8mW power.
3. Stanford University, Stanford, California
Raw data can be transmitted for a single channel sampled at 15.6KSps.
Recording system is wireless but not implantable [6]. Salient feature is the
transmission range of up to 4 meters away from the subject at a power
consumption of 63.2mW with a sampling frequency of 15.6KSps. The design has
been reported to be scalable and suitable for an increase in the number of
channels in future.
4. Ecole Polytechique de Montreal, Montreal, Canada
Spike detection is implemented by applying a threshold to a nonlinear energy
operator after digitization [7, 8]. This approach is reported to have better
detection results compared to spike detection using thresholding in time domain.
This operator uses temporal as well as instantaneous spectral information to
detect spikes. 64 8-bit data samples (2ms window @ 30KSps sampling rate) are
transmitted for each detected spike. System is reported to handle 16 channels in
parallel. The extracted spike samples retain necessary features to allow spike
sorting. Data overflow is reported when burst activity is encountered in the
neurons.
5. University of Michigan, Ann Arbor, Michigan
10

Absolute threshold used in time domain after digitization to detect spikes [9-12].
The system can transmit raw data for one channel in scan mode. In a separate
mode binary values are transmitted to indicate presence or absence of a spike.
Spike shapes are not preserved. Prototype designed to handle 64 channels,
where later versions may include neural stimulation in addition to recording.
Clock, power and data are communicated through an RF link.
6. Brown University, Providence, Rhode Island
The design uses broadband infrared transmission using vertical- cavity surfaceemitting laser (VCSEL) diode to transmit raw data from 16 channels
simultaneously [13]. No data rate is provided. Power can be delivered through an
inductive RF link or IR link. An important drawback is that the alignment of IR
transmitter and receiver is critical for successful broadband transmission.
7. University of California, Santa Cruz, California
Raw data from 128 channels sampled at 40KSps with 9 bit resolution is
transmitted using UWB 90Mbps transmission link [14, 15]. Ex vivo experiment
results have been presented. The system allows one channel to be processed
on-chip to extract spike features for classification using on-chip energy based
spike detection, noise shaping filter and min-max feature extraction. No results
are provided for the quality of spike sorting achieved based on this feature set.
Apart from off chip training required to set detection thresholds, the noise
shaping filter (32 tap) also needs offline training and on-chip memory for storing

11

coefficients. A programmable front-end can support various types of biological
signals.
Various other groups provide parts of the system e.g. amplifiers or ADC or simply
transmit single channel recordings. Some groups present multi-channel systems
with discrete components that are not implantable and thus do not have stringent
power and bandwidth requirements.

1.6.2 Drawbacks
All of the above mentioned systems try to reduce the amount of data that needs
to be transmitted out of the brain. Almost all systems use programmable
thresholds to detect spikes; which means offline training is necessary to
determine these thresholds. Some systems maintain spike shapes to allow off
chip spike sorting while others use binary values to represent the presence of
spikes.
Following are some of the drawbacks of these approaches:
1. Most other designs use spike extraction to preserve spike shapes but are
constrained to small number of channels because of limited bandwidth.
Data overflow is reported when burst activity is encountered.
2. Some systems use binary values to represent the presence or absence of
a spike but lose all information that enables spike sorting at later stages.

12

3. Most designs use thresholding in time domain to detect the spikes. This
can lead to a large number of false positives in low SNR conditions, or a
large number of spikes can be missed when liberal thresholds are used.
4. False positives result in significant bandwidth loss if complete spike
extraction is used, however, these can be detected and ignored in offline
processing. Whereas, a false positive will remain undetectable if binary
values are used to represent the presence or absence of spikes even in
later stage of processing.
5. None of the systems takes into account the unreliability of the
communication channel. Although the maximum data rates are sometimes
mentioned, the performance of the transceivers has not been discussed
thoroughly. Neither any assumptions about the channel have been stated
nor any measure of data transfer accuracy has been established.
6. Transmission of raw data has been made possible by the use of UWB
transceivers, however, their feasibility for use in freely moving subject is
still under investigation since precise alignment of infrared receiver is not
possible when the subject is moving.

1.7 Research Goals and Approach
1.7.1 Goals Statement
The goal of this thesis is to solve some of the major obstacles impeding the
advance of wireless neural implants. The problems addressed in this thesis lie in
13

the realm of low-power system design and can be solved by developing highly
efficient algorithms and hardware suitable for implantation, establishing a
pathway to the design of modern neuroprosthetics devices. Based on a review of
the state of the art neural recording systems it has been identified that this goal
can be achieved if the following objectives are met.
First of the problems is the overwhelming amount of data generated by the
microelectrode array resulting in a data bandwidth bottleneck. The objective is to
devise a system to enable some degree of control over the amount of data that
needs to be transmitted without compromising the information contained in the
array of signals. The second problem addressed in this thesis is the unreliable
nature of the wireless communication channel across the skin tissue. The
objective is to maximize the data throughput and enable unfailing yet low-power
communication of bidirectional data between the implanted device and the
external world. The final objective is the integration of these respective solutions
into a unified energy efficient implantable system for seamless operation of these
functions.
Although the motivation for this thesis comes from the field of neuroprosthetics,
the aim is to devise solutions that are applicable to a wide range of implantable
applications.

1.7.2 Approach
To achieve the above mentioned goals and in contrast to the existing approaches
listed above, this research will explore the realization of an implantable low14

power on-chip signal processor to acquire and compress the multi-channel
neural data before wireless transmission to the external units. The use of high
level processing on-chip for lossy as well as lossless compression would allow
efficient use of the bandwidth. Configurability built into the system would allow
various degrees of signal information retention. The communication channel
across the skin tissue would be studied and a communication protocol and
hardware will be devised to maximize the data throughput for streaming data
communication. Finally, a scheme will be developed for ultra low power
hardware-constrained signal compression for scalability and suitability to handle
simultaneous recording from thousands of channels.

1.7.3 System Architecture
Based on the above-mentioned system approach, Fig. 1.2 gives the conceptual
diagram of the proposed wireless neural recording system. The system consists
of two main modules, the implantable Neural Interface Node (NIN), and the extracranial Manager Interface Module (MIM).
The NIN is implanted on the cortex under the skull and is connected to one or
more microelectrode arrays. The captured signals are amplified and digitized in
the analog front end, which relays them to the neural processing unit (NPU). The
NPU is responsible for the overall operation of the NIN, including the different
modes of operation, compression of signals and error-free bi-directional
communication between the NIN and the MIM. The data is transmitted through
an RF wireless transceiver, which also receives commands from the MIM. The
15

NIN is a battery-less module, which receives power and clock through an
inductive link from the MIM.
The MIM is placed outside the scalp opposite to the NIN. This provides the best
alignment between the NIN and MIM coils for power and data transfer. The two
modules are only a few millimeters apart from each other, however, in freely
moving subjects the NIN on the cortex cannot remain fixed and thus
misalignment of coils is expected during the experiment. The MIM receives the
data from the NIN and transmits it to the external processing unit (EPU). The
EPU issues appropriate commands to the MIM, which relays them to the NIN to
change the operation of the system.
Fig. 1.3 shows the detailed block diagram of the NIN. The neural signal is
acquired through the microelectrode array and then amplified and band pass
filtered before conversion to digital words. The signals are sampled in a timemultiplexed fashion at a rate of 25KSamples per second per channel. The
function of this front-end analog block is controlled by an analog interface
controller. The data is then passed to the compression engine, which is
responsible for the lossy as well as lossless compression of the multi-channel
neural data. Finally, the data is transmitted to the MIM through a communication
controller, which handles the outgoing data and incoming commands over a halfduplex wireless link.

16

Fig. 1.2. Concept diagram for the proposed wireless neural recording system showing
implantable neural interface node and the extra-cranial manager interface module.

The wireless transceiver uses RF circuits to handle bi-directional data transfer. In
addition, it is responsible for power transfer from MIM to NIN, as well as
extraction of system clock signal from the wireless carrier frequency. A global
controller on the NIN unites all these components into a working system. Its most
important task is to interpret the commands from the MIM and change the
operation of the NIN accordingly. It is also responsible for managing the power
and clock for each of the blocks.

17

All the digital components of the NIN are shown in green in Fig. 1.3. Focus of this
thesis is the design and fabrication of these components and the study of
algorithms involved.

Fig. 1.3. Block diagram of the implantable neural data compression engine and its
position within an implantable neural recording system.

1.8 Challenges
To following challenges need to be managed to achieve the goal and its
objectives outlined above.
1. The compression engine has been identified as the most computationally
intensive of all the blocks. The main challenge is to identify and model
different algorithms and architectures and to achieve the lowest areapower consuming implementation. Intelligent hardware design is required

18

for area-power efficient circuit. This requires understanding of the
hardware requirements of various compression algorithms.
2. A communication protocol is required to be designed that enables errorfree half-duplex data communication of the streaming data while
maintaining a high data throughput. The challenge is posed by the
unavailability of any information about the transceiver performance and
channel error profile. Channel characteristics need to be measured and
transceiver performance needs to be analyzed and a protocol able to
handle this communication needs to be developed.
3. With the rapid increase in the number of channels in a microelectrode
array, the challenge is to design an ultra constrained set of features that
allows spike sorting for thousands of channels simultaneously.
4. All the components including the analog interface controller, the
compression engine and the communication controller have to come
together into a single highly configurable system and work seamlessly.
The challenge is the design of the system within the „implantability‟
limitations, i.e. the area, power and power density limitations discussed
earlier.

19

2 Hardware Realization of DWT for Neural Recording

2.1 Discrete Wavelet Transform for Neural Recording
For implanted microelectrode arrays, the power required to wirelessly transmit
raw data to extra-cranial processing units is prohibitively large. Likewise, the
hardware required to perform full neural data analysis is too complex to be
implemented within an implanted system. Data compression before transmission
is an attractive alternative, if it can be performed with minimal hardware
resources. Discrete wavelet transform (DWT) is a very effective method for
compressing neural data [16, 17].
Discrete wavelet transform enables time-frequency domain analysis of neural
signals, which manifests more information than either the time domain or the
frequency domain analysis with selectable tradeoff between time and frequency
resolutions. In addition to the co-existence of time and frequency in wavelet
domain, it also provides the flexibility of choice between large set of basis
functions, which can result in different numbers of high-energy coefficients after
the transform.
For an appropriately selected wavelet basis function, the resulting non-zero DWT
coefficients give a sparse representation of the signal that greatly reduces the
power required to upload data. The useful information is mostly contained in the
short transients, or spikes, above the noise level that result from the activity of an
unknown number of neurons. It can be observed that the scarcity introduced by
20

the DWT compaction property enables very few “large” coefficients to capture
most of the spikes‟ energy, while leaving many “small” coefficients attributed to
noise. This property permits the later ones to be thresholded [18], yielding the
denoised signal.

2.1.1 Background
VLSI implementation of the discrete wavelet transform (DWT) has been widely
explored in the literature as a result of the transform efficiency and applicability to
a wide range of signals, particularly image and video [19, 20]. These
implementations are generally driven by the need to fulﬁll certain characteristics
such as regularity, smoothness and linear phase of the scaling and wavelet
ﬁlters, as well as perfect reconstruction of the decomposed signals [21]. In some
applications, it is desirable to meet certain design criteria for VLSI
implementation to enhance the overall system performance. For example,
minimizing area and energy consumption of the DWT chip is highly desirable in
wireless sensor applications where resources are very scarce. In addition to
miniaturized size, minimizing power dissipation is strongly sought to minimize
tissue heating in some biomedical applications where the chip needs to be
implanted subcutaneously.
With few exceptions, recent efforts to optimize DWT hardware have concentrated
on increasing throughput at the expense of area and power [22, 23]. In contrast,
our work tries to identify an optimal implementation of the DWT architecture
where chip area and power have priority over speed [24]. The case of computing
21

the DWT for high throughput streaming data has not been fully explored [25]. It
has been argued that a lifting scheme [26] provides the fewest arithmetic
operations, allowing larger savings in power consumption but at the expense of
longer critical path than that of convolution-based ones [26]. Recent work by
Huang et al.[27] focused on analyzing DWT architectures with respect to
tradeoffs between critical path and internal buffer implementations. Such critical
path can be shortened using pipelining with additional registers or using a socalled ﬂipping structure with ﬁxed number of registers [28]. The B-spline
approach [29], on the other hand, requires fewer multipliers than lifting, replacing
them with adders that may permit a smaller chip area [23]. Nonetheless, most of
the reported hardware approaches focus on computational speed and do not
adequately address severe power and area constraints. By comparing with other
implementations of the DWT in this chapter, we demonstrate that the appropriate
compromise among power, size and speed of computations is achieved with a
sequential implementation of integer arithmetic lifting approach.

2.1.2 Wavelet Representation of Signals
The classical, convolution-based, dual-band DWT of a given signal involves
recursively convolving the signal through two decomposition ﬁlters L(z) and H(z)
and decimating the result to obtain the approximation and detail coefﬁcients at
every decomposition level j. These ﬁlters are derived from a scaling function and
a wavelet function that satisfy subspace decomposition completeness constraints
[16]. A typical FIR low pass and high pass 3-tap ﬁlter is expressed as
22

L( z)  l0  l1 z 1  l2 z 2

(2.1)

H ( z)  h0  h1 z 1  h2 z 2

(2.2)
j

j

So that the output approximation and detail coefficients a and d respectively at
th

the j

level can be computed as

K 1

a (i)   a j 1 (2i  k )lk
j

(2.3)

k 0

K 1

d (i )   a j 1 (2i  k )hk
j

(2.4)

k 0

j

j

where K is the number of filter taps. The obtained coefficient vectors a and d are
j

N/2 -dimensional, where N is the length of the original single-channel input
sequence. Equations (2.3) and (2.4) describe the original pyramidal algorithm
reported by Mallat [16]. Reconstruction of the original sequence from the DWT
coefficients is achieved through

a j 1 (2i  1) 

( K 2 ) 1

a

j

k 0

d j 1 (2i  1) 

( K 2 ) 1

a
k 0

(i  k )l k
j

(i  k )hk

where lk and hk are the coefficients of the synthesis filters, respectively.

23

(2.5)

(2.6)

2.1.3 Basis Selection
For near-optimal data compression, a wavelet basis needs to be selected to best
approximate the neural signal waveform with the minimal number of data
coefficients. A compromise between signal ﬁdelity and ease of hardware
implementation has to be made. A near-optimal choice was proposed in [17] from
a compression standpoint and demonstrated that the biorthogonal and the
symlet-4 wavelet functions are advantageous over other wavelet basis families
for processing neural signals. From a hardware implementation viewpoint, the
symlet-4 family has much smaller support size for similar number of vanishing
moments compared to the biorthogonal basis [30]. Fig. 2.1 displays the wavelet
basis function of symlet-4.

Fig. 2.1: The „symlet-4‟ wavelet basis function.

Therefore, we will focus on symlet basis with order 4 throughout the thesis. The
high pass and low pass equations of the discrete „symlet-4‟ basis are given by

24

L(z)  - 0.076 - 0.030z -1  0.498z -2  0.804z -3
 0.298z -4 - 0.099z -5 - 0.013z -6  0.032z -7

(2.7)

H(z)  - 0.032 - 0.013z -1  0.099z -2  0.298z -3
 0.804z - 4  0.498z -5  0.030z -6  0.076z -7
2.2 Algorithms for DWT
Mallat‟s algorithm [16] has been used traditionally for evaluating the wavelet
transform of a given signal and involves recursively convolving the signal through
two decomposition filters H(z) and L(z) and down-sampling the result to obtain
the approximation and detail coefficients at every decomposition level [25]. Fig.
2.2 shows the multi-level decomposition of input data into DWT coefficients
where the blocks represent the filtering and down-sampling operations. The
second half of the Fig. 2.2 shows reconstruction of the signal from transmitted
coefficients. This chapter concentrates on the decomposition portion that must be
implemented within the implantable device, whereas the reconstruction, if
needed [31], can be done extra-cutaneously where computational resources are
not scarce.

aj

H(z)

↓2

dj+1

dj+2

↑2

↓2

L(z)

↓2

aj+2

↑2

aj+2

H(z)
L(z)

H(z)

↓2
aj+1

↑2
↑2

H(z)
L(z)

dj+1

dj+2

L(z)

aj+1

Fig. 2.2: The multilevel discrete wavelet transform decomposition and reconstruction.
25

aj

2.2.1 Lifting Factorization
Any finite impulse response (FIR) wavelet transform can be expressed in terms
of lifting steps [25]. Optimizing the conventional DWT algorithm, the lifting
scheme analysis is described with a sequence of “predict” and “update” filters, Sn
and Tn, respectively, which form the lifting steps as evident in (2.2). Each lifting
step is generally a one or two tap filter, which is computationally more efficient
than longer, multiple tap filters.

K
P( z )   1
 0


0 
0  n  1 S i ( z )  1

 

K 2  i 1  0
1  Ti ( z ) 1 





(2.8)

Mathematically, lifting DWT is implemented by splitting the data into even and
odd samples and applying the Sn and Tn filters simultaneously as shown in Fig.
2.3. The data at each step, after applying the filters, is labeled as f,f1,…,fn, and
h,h1,…,hn for update and predict steps, respectively. The coefficient values B 0 to
B7 are listed in Table 2.1. The last step in (2.8) is a multiplication by a scaling
factors K1 and K2, where K1 = 1.571 and K2 = 0.637. Because this multiplication
can be omitted and correspondingly taken care of at the decoding side, the
scaling factors have been eliminated in the following discussion. In the case of a
multi-level decomposition, the outcome at an arbitrary decomposition level j is
obtained as the approximation a and detail d. The former is fed back for the next

26

level of DWT decomposition if required, whereas the latter is stored or
transmitted as required.

Fig. 2.3: Lifting scheme for DWT.

B0 to B7 in Fig 2.3 and Table 2.1 are the coefficients of the lifting and update filter
steps for the factorization of symlet-4 filters.
Table 2.1: Coefficients of the symlet-4 lifting factorization.

Coeff.

B0

B1

B2

B3

Value

0.3911

-0.1243

-0.3392

-1.4195

Coeff.

B4

B5

B6

B7

Value

0.162

0.4312

0.1459

-1.0492

2.2.2 B-Spline Factorization
It has been shown [23, 29] that wavelet filters can be decomposed as

27

H ( z )  (1  z ) M Q( z )  ho

(2.9)

L( z )  (1  z ) R( z )  lo
M

M

where (1±z)

is called the B-spline factor and Q(z) and R(z) are the distributed

factors. Application of this decomposition to the symlet-4 filters in (2.1), results in
H(z) and L(z) given by (2.10).
(2.10)

H ( z )  (1  z 1 ) 4 (1  A0 z 1  A1 z 2  A2 z 3 )  C1
L( z )  (1  z 1 ) 4 (1  A3 z 1  A4 z  2  A5 z  3 )  C2

where C1 = -0.076 and C2 = -0.032 are the gain constants. The coefficients A0 to
A5 resulting from B-spline factorization of symlet-4 filters are listed in Table 2.2.
Expanding the B-spline polynomials in (2.10) results in (2.11)
(2.11)

(1  z 1 ) 4  (1  6 z 2  z 4 )  (4 z 1  4 z 3 )

which allows all the multiplications required in the B-spline portion of (2.10) to be
replaced by less computationally demanding shift and add operations. The
polyphase decomposition can therefore be performed on the distributed parts
R(z) and Q(z) [29]. This is achieved by splitting the distributed parts into odd and
even components Re(z) and Ro(z), Qe(z) and Qo(z), respectively. For example,
-2

the low-pass even distributed part can be represented as Re(z) = 1 + A2z
4

-

+ 0.z

and likewise for the remaining components. Accordingly, the computation gain

28

in the B-spline method is a reduction in the number of floating point
multiplications at the expense of more additions [23].
Table 2.2: Coefficients of the B-spline factorization.

H(z)

L(z)

A0

A1

A2

A3

A4

A5

-3.607

1.867

-0.425

4.391

8.485

2.339

Fig. 2.4 is a Figurative description of DWT implemented using B-spline
optimization on the symlet-4 filters. Note that Fig. 2.4 contains two distinct
portions. The first portion (left) splits the input stream and implements B-spline
portion of (2.10) where the multiplications are implemented using shifters and
adders. The model for these shift-and-add multiplications is shown in a blowup
box in Fig. 2.4. The second portion (right) implements the distributed factors,
where multiplications require a hardware multiplier.

29

Fig. 2.4: B-spline architecture for computing the symlet-4 based DWT.

2.2.3 Algorithmic Comparison
According to Fig. 2.2, the downsampling operation follows filtering, but this
removes half of the samples just calculated and results in wasted computation
energy. Lifting and B-spline are both motivated by the idea of moving the
downsampling stage before the filtering stage, so as not to compute results that
would eventually be thrown away. The resulting filter-downsample operation
allows polyphase decomposition [25] to be used, which in this case involves
splitting of the input signal into even and odd samples and filtering them with
30

even and odd filter taps, respectively, to minimize hardware resources. This
approach results in almost half the number of calculations compared to the
standard convolution-based filtering. Both, lifting and B-spline use polyphase
decomposition.
Fig. 2.3 illustrates that lifting does not preserve causality, and thus
implementation of some of the resulting update and predict filters introduce
latency between the input and corresponding output of the DWT block. Although
this implementation does not produce results strictly in real-time, this lag is not
critical in neuroprosthetics applications where the biologically relevant sampling
rate is 25 KHz [32]. B-spline does not have any causality issues, and the only
delay between the input and its corresponding output is the computation delay.
Table 2.3 compares the computational requirements of convolution, B-spline and
lifting based DWT implementations. The standard convolution based filter
requires 16 multiplication and 14 additions. The lifting scheme reduces
computation requirements to only eight multiplications and eight additions. For Bspline factorization, the total multiplications required is reduced to 12 and 16
additions are required. Out of these 12 multiplications, six of the multiplications
can be implemented using shift-and-add operations. Fig. 2.4 and (2.11) show
that four of these multiplications require only shift operations to multiply by the
binary factor of four. The remaining two multiplications have six as the
multiplicand and require two shifts and one addition operations, as shown in the
blowup in Fig. 2.4. As a result, the minimized B-spline implementation requires
31

only six multiplications and 18 additions, as shown in parentheses in Table 2.3.
Based on Table 2.3, Relative to lifting, B-spline requires two fewer multiplications
at the expense of ten more additions for one level of decomposition.
Nevertheless, as the detailed low-power/area DWT implementation below will
show, any beneﬁt to B-spline is diminished for multilevel multichannel
decomposition. The input data was quantized to 10 bit while the filter coefficients
were quantized to 6 bits in sign magnitude form based on an analysis of signal
integrity. A detailed discussion can be found in the Appendix.
Table 2.3: Comparison of required computation.

Multiplications
Additions

Convolution
16
14

B-spline
12(6)
16(18)

Lifting
8
8

2.3 Minimized Hardware Implementations
2.3.1 Lifting Implementation
The lifting scheme is a sequence of predict Sn and update Tn filters as shown in
Fig. 2.3. For the symlet-4 basis, the set of lifting filters can be described by

P0  h0  B0 f 0
Q1  f 1  B1 P0  B2 P1
R1  P1  B3Q1  B4Q 2
a1  Q1  B5 R1  B6 R 2
d  2  R 2  B7 a1
32

(2.12)

where a and d are approximation and detail results, and P, Q, and R are
intermediate results. Bi are the eight constant filter coefficients with index i
ranging from zero to seven. Each equation represents one filter step, and values
of the filter coefficients are given in Table 2.1. The subscripts of variables in
(2.12) represent the time sample, where 0 represents the current sample, -1
represents the previous sample, and so on [34, 35].
Analysis of the lifting factorization and resulting equations in (2.12) shows that
there is a noticeable regularity in the required computations. All arithmetic
operations of the lifting implementation of symlet-4 can be expressed in the
general form of

W  X  BiY  B j Z

(2.13)

which permits all filter steps to be written as two multiplications and two
additions, with one of the coefficients set to zero for the first and last step in
(2.12). This regularity can be exploited to minimize hardware by implementing a
standard computation core that executes (2.13), as shown in Fig. 2.5. This single
hardware block can be repeatedly used to perform all the computational steps in
(2.12) sequentially [36, 37]. Sequential reuse of the same hardware reduces the
area required by the overall DWT block without impacting performance in this low
bandwidth application [36]. The hardware block in Fig. 2.5 will be called the lifting
computation core (CCL) in the following discussion.

33

X
Bi
Y

W

+

Bj
Z

+

X

X

Fig. 2.5: Computation core for lifting DWT [12].

We have shown that fixed-point integer DWT computation using the symlet-4
wavelet with filter coefficients truncated to six bits and data values truncated to
10 bits gives performance comparable to floating-point calculations while
significantly reducing computational demand [38]. A customized CCL block has
been designed to support fixed-point multiplications and additions. Using this
CCL to sequentially execute the steps in (2.12) requires five cycles to compute
results for one input sample. The critical path for this CCL is Dm+2Da, where Dm
is the delay of the multiplier and Da is the delay of an adder.
Applying lifting to a wavelet basis does not guarantee the resulting filters to be
causal. In the case of symlet-4, there are two filters that introduce non-causality
into the system. These filters can be identified as S0 and T2 in Fig. 2.3 and are
used for the calculation of Q and d in (2.12). Thus, the current calculations
depend on availability of future samples. The corresponding calculations can be
34

delayed to attain causality, resulting in a latency of three samples between input
and output.

2.3.2 B-spline Implementation
To implement DWT using B-spline factorization in (2.9), equations corresponding
to (2.12) need to be obtained. The time dependent equations governing the
calculation of B-spline approximation and detail coefficients of symlet-4 wavelet
filters can be expressed by

J 0  h0  6h1  h2
K 0  4h1  4h2
N0  J 0  K0
Q0  L0  M 0
S 0  N 0  A0 N 1
U 0  Q0  A3Q1
a0  S 0  T0

L0  f 0  6 f 1  f 2
M 0  4 f 0  4h1
P0  J 0  K 0
R0  L0  M 0
T0  A1 P0  A2 P1
V0  A4 R0  A5 R1
d 0  U 0  V0

(2.14)

where a and d are approximation and detail results and J, K, L, M, N, P, Q, R, S,
T, U, and V are intermediate results. The constant coefficients A0 to A5 resulting
from this factorization are listed in Table 2.2. The subscripts of variables in (2.14)
represent the time samples, where 0 is the current sample, -1 is the previous
sample, and so on.
The B-spline equations in (2.14) do not exhibit the regularity of lifting that allowed
the general equation of (2.13) to be utilized. However, the required B-spline
computations can be generalized into two expressions
35

W  AiY  A j Z

(2.15)

W Y Z

(2.16)

where Ai and Aj are constant coefficients. All of the equations in (2.14) can be
implemented using (2.15) or (2.16) in one or multiple cycles. Although the CCL
used for the lifting architecture cannot be applied to B-spline, a new computation
core suitable for sequential evaluation of the filter steps in (2.14) has been
defined to minimize hardware requirements. Fig. 2.6(a) shows the hardwareefficient computation core tailored to the B-spline implementation of (2.15) and
(2.16), hereafter called CCB1.

Though the B-spline algorithm requires six

multiplies and 18 additions, the task can be handled by sequential reuse of two
multipliers, an adder and two multiplexers. Since most of the equations do not
need multipliers, the inputs are fed directly to the adder stage through
multiplexers to conserve power and reduce average delay. In following
discussion, the shifters needed to implement binary integer multiplications are
ignored because they do not require significant hardware resources.
Using the CCB1 block, 18 cycles are required to execute all of the steps in (2.14)
because some of the equations cannot be implemented in a single cycle. To
reduce calculation delay, notice that the CCB1 implementation relies on the fact
that (2.16) can be generalized by (2.15) when Ai and Aj are assigned an
appropriate unity values (1). As a result, some of the steps in (2.14) require two
36

cycles to calculate and no steps can be executed simultaneously using CC B1.
The time required to calculate all of the steps in (2.14) can be reduced to 11
cycles by including another adder, as shown in Fig. 2.6(b).

Y1
W1

X

Z1
Ai
Y

2x1

Aj
Z

W

+

2x1

+

X

Aj

X

Z

W

2x1

Ai
Y

2x1

+

X

Fig. 2.6: (a) Minimum chip area computation core for B-spline, CCB1. (b) Minimum
delay computation core for B-spline, CCB2.

The new computation core, CCB2, allows several computations to be performed
in parallel. The delay reduction permitted by CCB2 comes at the cost of an
increase in chip area and a more complex controller to handle the parallelism in
this multiple-input multiple-output block. Additional performance comparisons of
CCB1 and CCB2 are discussed later in the chapter. The critical path for both the

37

B-spline CCs is Dm+Dx+Da, where Dm and Da are defined above and Dx is the
delay due to the multiplexer.

2.4 Sequential vs. Pipelined Implementations
In a ﬁrst-order analysis, the area of a CMOS integrated circuit is proportional to
the number of transistors required, and power consumption is proportional to the
product of the number of transistors and the clocking frequency. Through
transistor-level custom circuit design, circuit area and power consumption can be
further

reduced,

with

signiﬁcant

improvement

in

efficiency over

held-

programmable gate arrays (FPGA) or standard cell ASIC implementations.
Parallel execution of the DWT ﬁlter steps using a pipelined implementation is
known to provide efficient hardware utilization and fast computation. In fact, a
vast majority of the reported hardware implementations for lifting-based DWT rely
on pipeline structures [22, 30, 39]. However, these circuits target image and
video applications where speed has highest priority and the wavelet basis is
chosen to optimize signal representation. A different approach is required to
meet the power and area constraints imposed by implantability requirements, the
low bandwidth of neural signals, and the type of signals observed. Two different
integer lifting DWT implementations, a pipeline approach and a sequential
scheme, have been optimized and compared for the symlet-4 factorization.
Furthermore, the hardware requirements for lifting DWT have been compared to

38

a B-spline implementation to verify the advantage of lifting in the application at
hand.

2.4.1 Pipelined Design for Lifting DWT
The integer DWT ﬁlter equations in (2.12) can be implemented simultaneously in
a pipeline structure that permits real time, continuous signal processing to take
place. Fig. 2.7 illustrates a pipeline structure designed around the customized
three-term computation core from Fig. 2.5.

Fig. 2.7: Pipeline structure for integer-lifting wavelet transform with data notations to
match filter equations in (2.12) at a single point in time. Boxes marked with „z‟ represent
pipeline stages and registers that need to be stored in memory for multi-channel multilevel processing.

The output of each of the ﬁve ﬁlter stages is held by a pipeline register, and other
registers provide the necessary delays. By clocking all of the registers out of
phase from the CC blocks, continuous operation is provided. The computation
latency is seven cycles, due to the ﬁve pipeline stages and the two delay cycles
built into (2.12). The temporal latency for detail d and approximation a results is
14 samples because each computation cycle operates on a pair of data samples.
The overall pipelined computational node consists of five CC blocks, 15 10-bit
39

registers, and an 8x6b coefficient ROM. An additional delay phase could be
added at the ‘d’ output to synchronize the latency of the detail and approximation
outputs.

2.4.2 Sequential Design for Lifting DWT
The pipeline structure achieves fast integer DWT processing via a large
hardware overhead, and thus well suited for low-power, single channel, neural
signal processing. However, as discussed below, scaling the pipeline for multiple
data channels and/or multiple decomposition levels begins to break down the
efficiency of the pipeline structure. An alternative approach is to process each of
the ﬁlter steps (or pipeline stages) sequentially using a single CC block and fewer
memory registers. This approach takes advantage of the low bandwidth of neural
signals that permits the CC to be clocked much faster than the input data
sampling frequency (typically in the range of 25–40 kHz).
Sequential processing of the integer DWT ﬁlter steps can be achieved when
each stage depends only on data from previous cycles or from same-cycle
outputs generated in a preceding step. The simplicity of data dependencies
relative to the pipeline structure can be observed from Fig. 2.8, which illustrates
the sequential structure in a format comparable to the pipeline.

40

Fig. 2.8: Sequential structure over five operation cycles. Boxes marked with „z‟
represent registers that need to be stored in memory for multi-channel multi-level
processing.

Here, each section of the circuit represents a temporal phase rather than a
physical stage. An important observation is that signiﬁcantly fewer registers are
needed because the inputs of subsequent phases rely largely on preceding
outputs from the same computation cycle. Therefore, it can be shown that the
overall sequential DWT circuit can be efficiently implemented with six 10-bit
registers to manage data ﬂow between computation cycles, a single CC block, an
8 x 6b coefficient ROM, and a simple control block to direct data from memory to
the appropriate CC input during each phase of operation. Sequential execution
has a computation latency of two cycles, and the temporal latency for detail and
approximation results is four samples.

2.4.3 Pipeline vs. Sequential Lifting DWT
As stated above, the sequential approach requires only one CC unit and six 10bit memory registers compared to ﬁve CC units and 15 registers for the pipeline
circuit. The sequential design does, however, require additional multiplexers and
control logic to redirect data and coefficients to CC inputs, which are not
necessary in the inherently hardware-efficient pipeline design. This added
41

circuitry will make the critical path of the sequential circuit longer than that of the
pipeline structure. Furthermore, to maintain the same throughput, the sequential
design must be operated at ﬁve times the clock rate of the pipeline. Because
data is processed in a real-time streaming mode, neither approach requires a
large input data buffer.
Both architectures have been thoroughly analyzed to determine which approach
is best suited to the power and area requirements of an implantable neural signal
processor. To ﬁrst validate that both approaches can achieve the application
speed requirements, a custom computation core has been implemented in
CMOS, and analog simulations show the critical path delay is 6.5ns in 0.5μm
technology. Thus, approximately 6000 computation cycles could be preformed
within a nominal 25-kHz sampling frequency for neural signals. Even if an equally
large memory access delay is included, 3000 computations can be completed
within a single sampling period. This indicates that speed is not a critical design
constraint and that circuit optimization can focus on chip area and power
consumption.
Using custom design techniques, the chip area, A, required to implement both
approaches will be roughly proportional to the number of transistors in the circuit

A  T . N i

(2.17)

i

where T is the average area per transistor and N is the number of transistors in
th

i

circuit block. Empirical observations of several custom circuit layouts shows
42

that a single value for T reasonably approximates all of the integer DWT blocks,
especially for comparing two similar circuits.
Although absolute power consumption is inherently difficult to estimate, for the
purpose of comparing the two similar design alternatives, dynamic power can be
determined from

P  k VDD 2  f s   N i  si

(2.18)

i

where VDD is the supply voltage and fs is the data sampling frequency
(nominally 25kHz). The parameter k accounts for the average output load
capacitance, the average number of transistors operating per output, and the
average output transitions per clock cycle. This parameter is a function of both
fabrication process and circuit topology and has been derived empirically and set
to a value of 3fF and 0.75fF for 0.5m and 0.13m technology, respectively. The
variable s is the clock rate scaling factor relative to fs for each block i such that
the clocking frequency of each circuit block is (si x fs). For example, in the
pipeline conﬁguration, the computation core will be clocked only every other
cycle, i.e. 0.5fs so that the ﬁrst of the pair of samples to be processed can be
acquired in the idle cycle. Correspondingly, because the sequential conﬁguration
must be clocked at ﬁve times the rate of the pipeline, it will have an average
clocking rate of 2.5fs . In the pipeline approach, all of the blocks are clocked at
the same frequency, except the coefficient memory, which is static in both
43

designs. Since one of the multipliers is idle during two of the ﬁve stages, so we
estimate the sequential CC clock-scaling factor to be 2. Similarly, in the
sequential controller, most of the circuits are clocked at 2.5fs while others are
clocked at 0.5fs, so we estimate the clock-scaling factor to be 2 as well.
Table 2.4 lists the total number of transistors in each approach along with the
area and power estimated from (2.17) and (2.18) for both 0.5μm and 0.13 μm
technology. As expected, the pipeline computation unit requires nearly three
times the area of the sequential approach, and it would occupy about 21% of the
chip area on a 3x3mm chip in 0.5m technology or 5% of a 1.5x1.5mm chip in a
0.13m process. The power model predicts that the sequential approach will
consume only 23% more power than the pipeline. The larger power consumption
of the sequential approach can be attributed to its requirement for a more
complex controller and the need to move more data around within the single
computation core.

Overall, these results show a tradeoff between area and

power consumption between the two approaches.
2

Conservative values of 80 μm per transistor for 0.13μm technology have been
selected to estimate the required chip real estate.

44

Table 2.4: Characteristics of single-level, single-channel Integer DWT hardware
for pipeline and sequential configurations at two technology nodes.
Architecture

pipeline
sequential

# transistors

23,336
7,931

0.5m technology
area
power [W]
2
[mm ]

0.13m technology
area
power
[mm ]

[W]

1.867
0.634

0.117
0.040

0.71
0.87

15.10
18.55

2

If area-power products are taken as the figure of merit then pipelined design is
about 2.5 times more expensive then the sequential design. Moreover, the above
analysis does not include static power consumption but since it is directly
proportional to the number of transistors, it is expected to be higher for pipelined
design.

2.5 Multi-Channel / Multi-Level Sequential DWT
2.5.1 Single Channel / Single Level Architecture
The computation cores described above require supporting memory and a
controller to implement the full DWT. The collection of memory, controller and
computation core blocks into a larger module that is capable of independently
calculating DWT results will be referred to as computational node (CN). The CN
requires three different memory modules to store filter coefficients, intermediate
results for sequential CC calculations, and intermediate channel/level results as
described below. It has been shown previously that quantizing the data to 10 bits
and filter coefficients to 6 bits maintains high signal to noise ratio and does not

45

distort the signal of interest when lifting DWT is applied. For proper comparison,
the same quantization is assumed for B-spline.
1. Computation Core Memory
Due to the prescribed sequential reuse of CC hardware, temporary memory is
required to store intermediate calculation results for proper evaluation of (2.12)
and (2.14). These „computation core memory‟ registers are used over and over
again for every cycle of the CC. A careful analysis shows that, for lifting-based
DWT, a total of seven 10-bit registers are required while CC is processing one
sample. Both CC implementations for B-spline need twelve 10-bit registers to
hold temporary values during sequential execution.
2. Coefficient Memory
A separate „coefficient memory‟ block is required to store the filter coefficients
listed in Table 2.1 and Table 2.2. For lifting, the coefficient memory block size is
eight 6-bit registers, while for B-spline only six registers are required. The
coefficient memory data is constant and can be implemented as ROM.

2.5.2 Multi Channel / Multi Level Architecture
Most neuroprosthetics applications require data from more than one nerve or
from spatially scattered locations. As a result, multiple data streams need to be
compressed simultaneously in real-time using DWT and transmitted to extracutaneous processing units. Because of area constraints in surgically implanted
neuroprosthetics, it is desired to process all data channels within a single
interface chip. Furthermore, increasing the number of decomposition levels
46

inherently improves the compression ratio of DWT [16, 25], which is vital for
reducing the power required to transmit data from the implanted device.
Therefore, the design of a computational node that can compute multi-level DWT
for multiple channels pseudo-simultaneously, i.e. within a single sampling period
is necessary. Recall that the relatively long period between neurologically
relevant data samples motivated the design of a sequentially cycled single-stage
computation core. Analysis of preliminary DWT implementations [36] indicates
that there is sufficient bandwidth between samples to process many data
channels sequentially with a single computation core, permitting significant
savings in chip area.
1. Channel/Level Memory
Having adopted a pseudo-simultaneous approach for processing multiple input
channels, notice that, for each data sample, the CC finishes calculating results
for a particular channel and then proceeds to calculate results for next channel.
However, the intermediate values, or state, of the current channel need to be
stored in memory so they are available when the CC returns to this channel for
the next data sample. Similarly, when the CC processes a lower level of
decomposition, the current state of the CC needs to be stored. This
„channel/level memory‟ is critical to restoring the state of computation when the
CC switches between different channels and levels.
Equations (2.12) and (2.14) define which values need to be stored and made
available to process future samples. For each level and channel (beyond level=1
47

and channel=1), the lifting architecture requires four 10-bit values,

f0,

P0,

R0,and Q0, to be saved in CN memory. Similarly for B-spline, there are a total of
eight 10-bit values, R0, Q0, P0, N0, f-1, f0, h-1 and h0, that need to be stored in
channel/level memory. Note that these values are set by the algorithm and thus
do not change with either of the B-spline CCs presented above. Compared to
lifting, B-spline requires four more 10-bit registers for each additional level or
channel. Although the number of levels would typically remain small (<5), modern
neural recording arrays continue to push the boundary on available channels,
and this difference in memory requirements becomes a key factor for large
numbers of channels.
Other memory blocks such as pairing memory and input buffer pose equal
hardware requirements on both algorithms and are thus discussed in detail in
next chapter.
2. DWT Computational Node
Arranging all the above-mentioned blocks, the CC, the controller, and the
memories, the general computation unit architecture of Fig. 2.9 is obtained. The
size of each block varies with algorithm, lifting or B-spline, as noted in the
discussions above. Based on this design, the next section provides a thorough
comparison of the VLSI implementations of the two algorithms.

48

Fig. 2.9: The general computation unit architecture for sequential DWT.

2.6 Comparative Analysis for Multi Channel/Level DWT
2.6.1 Multi Channel / Multi Level Lifting vs. B-spline
For the sake of comparison, the coefficients and data for both the
implementations have been quantized with the same precision, i.e. 6 bits and 10
bits, respectively, with negligible loss in signal fidelity. Table 2.5 summarizes the
hardware and timing characteristics of both architectures when implemented with
the CC modules described above. Multipliers, adders, and memory blocks were
fully custom-designed in 0.18m CMOS and used in all implementations. The
multipliers were optimized for 10x6 bits to minimize transistor count and reduce
delay compared to a standard square multiplier. The B-spline shifters are listed in
Table 2.5, though they require a negligible amount of hardware and contribute
very small delay.
To compare critical path delays notice that, for different adder/multiplier
implementations, the relative delay of multipliers, Dm, and adders, Da, could vary
significantly. To evaluate the impact of this ratio on computational delays, we
49

defined Dm ≈ •Da and observed the delays in units of Da as a function of .
Defining the total critical path delay as the CC critical path delay times the
required number of CC cycles per sample, and assuming that the delay due to
multiplexers, Dx, in B-spline is negligible, it can be argued that B-spline 2 always
has less delay than both Lifting and B-spline, indicating the use of CCB2 would
be preferred where delay poses a critical constraint. With =3, as it is for our
custom designed adder and multiplier, Lifting delay is only slightly worse that Bspline 2. The computational load per sample listed in Table 2.5 is the measure of
computational resources used to calculate results for a single sample and thus
directly effects the power dissipation.
Table 2.5: Hardware comparison of lifting and B-Spline architectures.
Lifting CCl

B-spline CCB1

B-spline CCB2

Multipliers
Adders
Multiplexers
Shifters
Latency
Cycles per Sample
Coeff. Mem. (6bit)
CC Mem. (10bit)
Per Chn/Lvl Mem.

2
2
0
0
3
5
8
7
4

2
1
2
4
0
18
6
12
8

2
2
2
4
0
11
6
12
8

Critical Path Delay

Dm+2Da

Dm+Dx+Da

Dm+Dx+Da

Computational Load per
Sample
No. of Tx for CC

5Dm+10Da 4Dm+18Dx +18Da 4Dm+11Dx +18Da
1786

50

1650

1810

The lifting approach results in a highly regular set of equations so that the
corresponding CC has a high utilization factor. The equations governing B-spline
do not follow a regular pattern, which results in a minimum area CCB1 that does
not optimally use the capabilities of hardware. As a result, CCB1 requires a high
number of cycles to process a set of samples. The CCB2, on the other hand, is
designed to parallelize the computation and improve the hardware utilization and
thus relatively fewer cycles are required for calculations.
The lifting architecture the inputs h and f lead to P-1, which is required for
calculation of Q-2, which is subsequently required for a-2, which finally produces
d-3. Thus, we cannot compute d-3 until we have received h and f. Hence, as
discussed above, there is a latency of three samples between input and its
corresponding output. B-spline does not have any future sample dependence so
there is no latency between input and output samples. The cost, however, is that
B-spline requires 18 or 11 cycles, depending upon the type of CC used, to
process one sample as compared to 5 cycles taken by lifting. This translates to a
higher clocking rate for B-spline. Note also that most of the transistors in lifting
architecture are active all the time, however, since B-spline does not use the
hardware very efficiently, a large fraction of transistors in B-spline remain inactive
most of the time, dissipating undesirable static power. By counting the number of
transistors active over different calculation cycles for a single sample, power
requirements can be modeled by
51

P  K  VDD2  f s  Ni  Si , j
j

(2.19)

i

where the first summation j is over the total number of cycles (5, 18 and 11 for
lifting, B-spline and B-spline 2, respectively) and i refers to the circuit blocks, i.e.,
computation core, controller, and different memory blocks. Ni is the total number
th

of transistors in i
the j

th

block and Si,j is the fraction of transistors active in i

th

block in

cycle. K is a constant that depends on IC fabrication process parameters.

Note that the clock frequency is different for all the three implementations
because of the different number of required cycles per sample.
Based on (2.19), and assuming the relative area is directly proportional to
transistor count, Fig. 2.10 compares the relative area and power consumption of
the B-spline and lifting designs for an increasing number of channels and
decomposition levels. The plots show that lifting requires smaller area and
consumes less power than both implementations of B-spline. CCB1 for B-spline
requires one less adder than lifting and thus has fewer transistors. However, in a
multi-channel, multi-level implementation the number of transistors for the entire
computational node, including memory and controller, increases much more
rapidly with B-spline than lifting, thus, offsetting any advantage gained by a
smaller CC. The B-spline implementation with CCB2 requires less power than
CCB1 mainly because it requires lower clocking frequency. The lifting
implementation shows increasingly superior performance as the number of
52

channels and levels are scaled up, due in large part to its lower channel/memory
requirements compared to either B-spline implementation.

Fig. 2.10: Comparison of relative area and power consumption vs. number of levels and
channels for B-Spline, B-Spline 2 and Lifting architectures.

For a single-channel, single-level DWT, no channel/level memory is required. As
channels or levels increase, the difference between the two B-spline CC
implementations remains constant, however, CCB2 retains the advantages of
smaller critical path delay and lower power consumption. For every increase in
the number of channels or levels, lifting requires significantly fewer transistors
than B-spline, making it preferable for multi-channel applications with power and
area constraints.

53

Multiplication-Free Lifting: The CC unit proposed for lifting DWT uses two
multipliers so that the calculations required per sample are eight multiplications
and eight additions that can be completed in. It is noteworthy that a general
purpose lifting approach based on only shifts and additions was proposed in [21].
For the sake of completeness, we compared the demands of a CC unit with
multipliers to a CC unit without any multiplier, i.e., composed of only a shifter and
an adder. The later approach resulted in 12 shift operations and 21 add
operations, and required 21 cycles per sample. This is because the equations
required to compute multiplication-free lifting DWT did not show any regular
structure such as the ones in (2.13). Therefore, substituting another adder and
shifter in the data path did not help in reducing the number of cycles required to
complete the computation. With respect to area demands, we found that for one
sample pair, a CC unit without a multiplier requires 52% less area compared to a
CC with multiplier. This obviously translates into large savings in chip area.
However, these savings were not substantial when the system is scaled up. For
example, a 32-channel/4-level DWT system using a CC with multiplier would
occupy 6.5% of the total chip area as opposed to 3.3% using a CC without
multiplier. Therefore, the overall savings in chip area are only 3.2%. In contrast,
the CC without multiplier requires 13.3% more power than a CC with multiplier for
this speciﬁcation. We therefore concluded that the reduction in area using a shift
and add strategy in the lifting approach is overshadowed by the increase in
power dissipation when multichannel/ multilevel decomposition is sought.

54

2.6.2 Multi Channel / Multi Level Sequential vs. Pipeline
Both the pipeline and sequential architectures can be scaled to multiple channels
and/or levels by reusing the computational node hardware and increasing the
clocking frequency to complete all computations within the input sample period.
In both approaches, registers within the computational node hold data necessary
for the next cycle‟s calculation. To sequentially reuse the computational node,
some register values for a speciﬁc channel/level must be saved so they will be
available when that channel/level is next processed in a future cycle. Fig. 2.11
shows the multichannel, multilevel, implementations of the pipeline and
sequential conﬁgurations.
For multiple channels/levels, the need to copy the entire set of pipeline registers
to memory effectively negates one of the primary advantages of the pipeline over
the sequential approach. On the other hand, the sequential processing circuit is
inherently designed to swap new data in/out each clock cycle. To quantitatively
compare these two approaches, circuit models have been developed to describe
the power and area for each option as a function of the number of channels and
the number of decomposition levels. The following models assume the hardware
(including control logic) has been scaled to manage multiple channels and levels,
though they are still valid for single channel, single level implementations.

55

Fig. 2.11. Comparison of multichannel/multilevel pipeline and sequential DWT
approaches: relative chip area and relative power consumption versus number of levels
and channels.

The relative area for pipeline and sequential architectures as a function of levels
and channels is shown in Fig. 2.11. These results demonstrate that the pipeline
requires signiﬁcantly more chip area than the sequential approach and its area
needs grow faster with larger number of channels and levels. This is due
primarily to the relatively large number of registers that must be stored per
channel or level (11 for pipeline compared to 4 for sequential).
Fig. 2.11 also shows the relative power consumption for the two approaches. The
linear increase in power per channel is slightly higher with the sequential design
than the pipeline. Although there is a sharp jump in power from L = 1 to L = 2,
further increases in levels require less and less additional power as the usage
56

rate approaches one. The most important observation from Fig. 2.11 is that the
power consumption of the two implementations is almost similar but the
sequential design requires signiﬁcantly less chip area.
Due to size and power constraints in implantable systems, an important ﬁgure of
merit is the relative area-power product, which is plotted in Fig. 2.12 versus both
levels and channels. Fig. 2.12 illustrates that the sequential approach is
increasingly preferable as the number of channels or the number of decomposition levels increases. The only signiﬁcant beneﬁts of the pipeline within the
enforced design constraints are that it can be clocked at a higher rate and that it
takes fewer clock cycles to complete a computation.

Fig. 2.12. Power-area product versus level and channel for pipeline and sequential
approaches.
57

Both of these factors result in the pipeline having a higher limit on the maximum
number of channels that can be simultaneously processed. However, based on
the parameters deﬁned above, the sequential execution architecture has an estimated maximum of around 500 data channels (at L = 1). Given the chip area
limitations, the area-efficient sequential approach is best suited for this
application. In an example implementation with 32 channels and 4 levels of
decomposition, the models predict that the sequential approach will require 0.692
mm and 50.1μw in 0.13μm CMOS. indicating the feasibility of performing frontend signal processing within the constraints of an implanted device.
Another interesting result of this study is the comparison of the area required by
the computational node circuitry versus the area required by the memory that
holds register values required for multichannel/multilevel operation. Fig. 2.13
illustrates this result for both sequential and pipeline conﬁgurations as a function
of channels at L = 4. Increasing the number of levels or channels does not
increase the area taken by computation blocks, however, memory requirements
scale linearly. With 10-bit data resolution, at L = 4 and C = 32, the pipeline
requires over 14 000 bits of SRAM, while the sequential circuit requires only
about 5000 bits. Reducing memory requirements becomes increasingly important
in multichannel applications, again highlighting the advantage of the sequential
approach.
As illustrated in Fig. 2.13, the memory required to store intermediate calculation
values will dominate circuit area in multichannel implementations. Careful
58

analysis of an optimized sequential B-spline implementation [24] has shown that
eight memory registers are required per channel/level, compared to four for
sequential lifting and 11 for pipeline lifting. Based on this information and the
comparisons above, B-spline has a slight advantage over pipeline lifting but
incurs a signiﬁcant penalty relative to sequential lifting in terms of area.
Furthermore, the sequential lifting implementation requires only about 25% of the
dynamic power of sequential B-spline, primarily because B-spline takes 18 cycles
to execute sequentially compared to 5 cycles for lifting [24]. The advantage of
sequential lifting becomes even more profound when static power is considered,
especially in deep submicron technologies.

Fig. 2.13. Relative area versus channels of data memory compared to all other blocks
for sequential and pipeline designs, at L = 4.
59

2.7 VLSI Architecture for Multi-Channel DWT
To utilize DWT-based compression with modern neuroprosthetics devices, a
multi-channel, multi-level implementation is necessary because microelectrode
arrays sample multiple data channels simultaneously and multiple decomposition
levels improve signal reproduction accuracy. Thus, area and power efficient
hardware that can perform multi-channel, multi-level DWT in real time is highly
desirable. In contrast to traditional DWT applications, neuroprosthetics can afford
long computation intervals, up to 40μsec [17], permitting hardware to prioritize
power and area efficiency over speed. From the hardware point of view, integerlifting

and

B-spline

DWT

factorization

schemes

have

very

efficient

implementations. The lifting approach to the DWT reduces the required
arithmetic operations and, as an in-place implementation, requires less memory
at the expense of a longer critical path [22, 23]. The B-spline factorization
reduces the critical path delay by converting some of the required multiply
operations into less computationally intensive shift-and-add operations [23].
With few exceptions, recent efforts to optimize DWT hardware have concentrated
on increasing throughput at the expense of area and power [22, 23]. In contrast,
our prior work has identified an optimal implementation of the DWT architecture
where chip area and power have priority over speed [24]. The approach relies on
a hardware-efficient integer lifting factorization scheme and the „symlet‟ family of
wavelet basis that has been shown to provide a near optimal compression of
neural signals [17]. This chapter describes the design of a highly area and power
60

efficient VLSI circuit that implements multi-channel, multi-level lifting-based DWT
in real time.

2.7.1 Architecture Overview
Fig. 2.14 shows the data flow diagram of lifting-based DWT using the symlet-4
bases. If h and f are two sequential input samples from the same channel, a and
d are approximation and detail results, and P, Q, and R are intermediate results,
the five filter steps in this flow can be expressed by

P0  h0  B0 f 0
Q1  f 1  B1 P0  B2 P1
R1  P1  B3Q1  B4Q 2
a1  Q1  B5 R1  B6 R 2
d  2  R 2  B7 a1

(2.20)

where B0-B7 are the eight constant filter coefficients and subscripts of the other
variables represent the time sample, 0 being the current sample, -1 the previous
sample, and so on.
To ease hardware requirements, it has been shown that integer lifting DWT with
quantized data and filter coefficient values maintains a high signal to noise ratio.
Based on this analysis, we have chosen 10-bits data and 6-bit coefficients
(including sign bit) for our hardware implementation. Our previous work
evaluated two structurally different hardware approaches, namely pipeline and
sequential, for suitability in implantable devices based on their resource demands
61

[35]. Derived from this system-level analysis, we have found that for the single
channel case, the pipeline architecture consumes smaller power at the expense
of considerably larger area as compared to the sequential architecture, resulting
in a higher area-power product. The difference in area-power product increases
with the increasing number of channels and levels, making the sequential
architecture the better choice of the two especially for higher number of
channels.

Fig. 2.14: Data flow diagram for lifting DWT using the symlet-4 basis.

Fig. 2.15 describes the DWT architecture resulting from our circuit-level design
efforts. It is composed of a customized computation core (CC), a digital
controller, and five memory modules for incoming data, filter coefficients,
intermediate CC products, and intermediate values necessary to sequentially
process multiple levels and channels.

62

Fig. 2.15. Complete system diagram for sequential calculation of DWT.

Sequential data samples (inputs h and f) from a given channel are processed in
pairs to generate the approximate and detail DWT coefficients (outputs a and d).
To achieve system-level goals of power and area efficiency, each of these
architectural blocks has been customized at the circuit level to minimize
transistor count and eliminate unnecessary output transitions.

2.7.2 Computation Core
Analysis of the „symlet-4‟ lifting factorization and resulting equations in (2.20)
shows a noticeable computation regularity; all arithmetic operations can be
expressed in the general form of W=X+BiY+BjZ. This regularity can be exploited
to minimize hardware by implementing a single computation core (CC) that
sequentially executes all computational steps in (2.20). Sequential reuse of the
same hardware significantly reduces chip real estate requirements without
impacting performance in this low bandwidth application.
63

The merits of using two‟s complement arithmetic versus sign-magnitude
arithmetic were considered. The 10-bit input data is received in sign magnitude
form, as are the constant 6-bit filter coefficients. The CC multipliers are required
to perform 10x6 operations. Methods for multiplying sign-magnitude numbers do
not work with two's complement numbers without adaptation. We have
implemented several combinations of adders and multipliers using signmagnitude and two‟s complement representation and found that it is most
efficient to handle multiplication in sign-magnitude form while additions are
performed in two‟s complement. As shown in Fig. 2.16, the computation core can
be implemented using two multipliers and a three-term adder formed by
cascading two two-term adders. Using this CC to sequentially execute the steps
in (2.20) requires five cycles to compute results for one sample pair. The critical
path for this CC is Dm+2Da, where Dm and Da are the delays of the multiplier
and the adder, respectively. Overflows that can occur during these computations
are handled by replacing the results with appropriate positive or negative
saturation values.

64

Fig. 2.16. Computation Core architecture for integer lifting DWT.

1. Adders
Ripple carry, carry save and carry look-ahead adders were analyzed for this
DWT implementation. Because delay does not pose a bottleneck and area and
power are much more important, the lower transistor count required by ripple
carry adder was favored. Ripple carry adder performance depends largely upon
structure of its individual full-adder cells, whose properties can vary widely to
match different applications. Comparative analysis of several full adder structures
was performed at the circuit level to take into account transistor count, activity
factor, and dynamic power due to parasitics. Based on these results, the pass
gate logic 1-bit full adder presented in [40] and shown in Fig. 2.17 was chosen to
best suit our DWT application. Unlike some adder cells with slightly lower
transistor counts, this cell has no internal direct path between power and ground,
eliminating short circuit current during transistor switching. Furthermore, the use
65

of transmission gates limits the output voltage swing and reduces dynamic power
consumption without compromising overall speed or reliability. In 0.18μm CMOS,
this 16-transistor cell was designed to dissipate 0.4μW power and occupy
2.

41.5μm

It was utilized to construct 10-bit adders for the three-term adder block

and 13-bit adders for the 10x6 multipliers.

Fig. 2.17. Single bit full adder cell based on pass transistor logic [8].

2. Multipliers
The CC requires two multipliers to perform 10x6 operations. Both Booth and
array multiplier structures were optimized to the specific bit resolutions of our
DWT application and compared for area and power. The Booth multiplier
consists of recoders, partial product generators, 3:2 and 2:1 compressors and a
final adder. The recoders and partial product generators were optimized for low
66

power [41]. Two different recoders and partial product generators were
considered, both requiring 18 transistors each. The two partial product generator
architectures, NPR3a and NPR3b, presented in [41], were tailored to this
application and it was determined that, while power consumption is similar,
NPR3b needs slightly fewer transistors (14 vs. 16) and is preferred. The 3:2 and
2:1 compressors are full and half adders respectively. In contrast, the array
multiplier ANDs every bit of the multiplier with every bit of the multiplicand to
create partial products. A Wallace tree structure was implemented to compress
the partial products. Both Booth and array multipliers require an adder at the final
stage. In comparison, a CC design using a Booth multiplier was shown to occupy
16% more area than a CC design with an array multiplier. Table 2.6 gives area,
average power and delay measurements of the custom designed adder and
array multiplier sub-modules selected for use in our DWT CC. It also includes the
average power, area and delay measurements for the complete computational
core including two‟s complement circuits. The average power includes static and
dynamic power dissipation at a clock frequency of 6.4MHz.
Table 2.6: CC Transistor Counts and Worst Case Values
Sub-module
10-bit Adder
10x6 Multiplier
Comp. Core

2
# of TX Area (μm ) Power (μW) Delay (ns)

160
733
1854

418
2858
9828

2.7.3 Memory Modules
1. Multi-channel/level memory module
67

2.07
9.04
27.36

0.41
2.13
6.07

Neuroprosthetics applications generally depend upon multiple data streams
taken from an array of microelectrodes. Thus, our DWT system has been
designed to compress data from multiple channels simultaneously using multilevel decomposition in real time. Analysis of our DWT implementation indicates
that there is sufficient bandwidth within a single computation core to process well
over 100 data channels sequentially. Sequential processing significantly reduces
computational hardware demand; however, the intermediate values, or „state‟, of
the current channel needs to be stored in memory so they are available when the
CC later processes the next sample from this channel.
Similarly, memory is needed to store intermediate values while switching
between different levels. The memory block required to hold these intermediate
values is called the channel/level memory. The values that need to be stored and
made available to process future samples are defined by (2.20). For each level
and channel (beyond one), the lifting architecture requires five 10-bit values, a0,
f0, P0, Q0 and R0, to be saved, four of which are stored in channel/level memory
while one is stored in the pairing memory. Every level and every channel requires
a corresponding 40-bit memory register. In a 32-channel, 4-level design, this
amounts to 128 registers. The channel/level memory was implemented in both
an SRAM using standard six transistor cells and a DRAM. For a large number of
channels and levels, the channel/level memory module was found to dominate
the power and area of the overall DWT system.

68

For the prototype DWT chip, the channel/level memory was implemented in
SRAM to maximize reliability and ease system-level testing. However, a DRAM
implementation is particularly efficient in this application; due to the sequential
nature of the DWT, the entire memory block is periodically overwritten,
eliminating the need for refresh circuitry during normal operation. For example, at
a sampling rate of 25 kHz, the least frequently accessed level 4 data is
overwritten every 0.64ms, which is much faster than the hold time easily
achievable by a DRAM cell implemented in a standard CMOS process. This
feature permits significant area savings over the SRAM implementation, as
shown in Table 2.7, which compares relative performance criteria for a 32
channel, 4 level memory block.
In a standard 0.18μm CMOS process, use of a discrete DRAM storage capacitor
is prohibitively area expensive. Alternatively, the gate capacitance of a large
MOSFET has been used to store charge. Assuming that the sense amplifiers
need a stored value of at least 1V to reliably read the correct value, a MOSFET
2

with at least 0.2μm gate area is required to realize a dependable hold time.
Constructed in this fashion, comparison of two custom designed CMOS memory
blocks shows that a DRAM block would permit a 64% area savings over an
SRAM block for a 32-channel 4-level implementation.

69

Table 2.7: Comparison of SRAM and DRAM Memory Implementations
32 Chn/ 4 Level

No. of Tx

2
Total Area (μm )

2
Bit Density (μm /bit)

SRAM
DRAM

34814
13414

129830
46433

25.36
9.07

2. Input, coefficient, pairing and CC memory modules
Because DWT operates on data pairs, there is a hold cycle between each
computation of input pairs wherein the first sample of the pair is acquired. Thus,
an input data buffer of size equal to the number of channels is required to store
input samples during the hold cycles. Note that in this architecture, the hold cycle
is used to compute all results for higher (beyond one) levels of decomposition,
where the necessary data is available from previous computations.
A separate coefficient memory block is required to store the eight 6-bit filter
coefficients. Because this data is static, it is most efficiently implemented using
hardwired connections, effectively operating as a ROM structure with only the
electronics necessary to switch the proper coefficient values into the CC at the
appropriate computation phase.
To accommodate sequential reuse of CC hardware, at the beginning of each
computation cycle, four intermediate results from the previous computation cycle
for the same channel must be loaded into the CC. These values, loaded from the
channel/level memory module, are stored in a block defined as the CC memory.
This CC memory block must be capable of loading all four bytes in parallel and
moving data to the appropriate CC input as it cycles through the filter steps in
70

(2.20). As described below, this requires six 10-bit registers that were
implemented using flip-flops for parallel load/store and byte-wise shift operations.
When performing calculations for levels 2 and beyond, the CC input data (f and h
in Fig. 2.15) does not come from inputs to the DWT circuit. Rather, inputs must
be pulled from results calculated previously for a lower level. These values are
stored in a pairing memory that must contain two 10-bit values for every channel
and level except the highest level. All computation cycles except those of the
highest level will generate one 10-bit byte to be written to this pairing memory.
During all computation cycles for level two and beyond, two 10-bit values will be
read from this block. Due to this unique read/write structure, this block was
implemented independent of the channel/level memory using an SRAM with
address decoding circuitry that allows us to enable each 10-bit byte block
independently.
3. Power saving strategies for memory
To reduce power consumption in the SRAM blocks, a divided bit line structure
[42] was adopted, reducing the overall bit line capacitance and eliminating
unnecessary dynamic power consumption during read and write cycles. Each of
the sub bit lines were connected to only eight SRAM cells, and extra logic within
the decoder controls access to these sub bit lines. With reduced bit line
capacitance, the SRAM could be implemented without a sense amplifier,
eliminating the need to access both sides of the SRAM cell and reducing bit line
currents during read operations.
71

Instead of using a single memory block for all storage, the memory was
partitioned into smaller modules based on their different access patterns. As a
result, only a comparably small portion of the whole memory is active and
undergoes switching at a given time, resulting in saved power.

2.7.4 State Machine Controller
The main functions of the controller are to direct overall system operation and
route data between DWT circuit blocks at the proper time [43, 44]. The sequence
of actions needed to complete a single DWT computation cycle consists of three
different phases over a total of eight clock cycles, as shown in Fig. 2.18. As
managed by the DWT controller, one read cycle is followed by five calculation
cycles and then two write cycles. During the read cycle, stored values from prior
calculations are loaded into the CC memory. During the calculation cycles, the
five filter steps in (2.20) are executed. During the write cycles, results are stored
onto channel/level and pairing memory blocks. This sequence must be repeated
for each channel within the input sampling period. For example, with 32 data
channels and a typical neural signal sampling rate of 25 kHz, the eight operation
cycles must be clocked at 6.4MHz. For fewer channels, the clock rate can be
reduced to ensure power is only consumed when computations are required.

Fig. 2.18: DWT circuit operation phases for a single computation cycle.

72

Note that the number of levels has no effect on the clock frequency because
higher-level results will be computed during intermediate hold cycles while data
input pairs are being stored. This is illustrated in Fig. 2.19, which describes the
sequence of computation cycles managed by the controller for one channel and
four levels. Here, the top line shows the computation cycle count, and the next line
counts the cycles within each operation sequence (two shown). A level 1 result is
computed when a pair of data samples is received, a level 2 result is computed when
two level 1 results are available, and so on. The first results calculated at each level are
trash values because the memory initially contains meaningless values. After each 2L
computation cycles (where L is the number of levels), there is an idle (no calc) cycle.

Each column represents a computation cycle composed of eight clock cycles.
The number of computation cycles necessary for a complete, repeatable, DWT
operation sequence is 2L, where L is the number of levels. As shown in Fig. 2.19,
four levels requires 16 computation cycles for a complete operation sequence.
Within each operation sequence (2L computation cycles) one cycle will be an idle
cycle where no computations are necessary, as indicated by NO CALC in Fig.
2.19. Notice that, in each odd numbered computation cycle, no level one results
are being calculated. These odd cycles correspond to input hold cycles
discussed above. While input data pairs are being stored in these cycles (to be
processed in the following cycle), the CC hardware can be utilized to process
results for all levels greater than one in the sequence defined by Fig. 2.19. Thus,
an infinite number of levels could be processed without additional hardware or
increasing frequency. The number of channels has no impact on this sequence;
73

each channel is processed sequentially within a single computation cycle, which
can be achieved by increasing clocking frequency by a factor equal to the
number of channels. Thus, the only major impact of increasing channels or levels
is an increase in dynamic power consumption and the addition of channel/level
memory.
To control the sequence and timing of operations within the DWT circuit, both an
instruction based microprocessor and a state machine based controller were
analyzed. Because of the relatively straightforward and repetitive nature of
operations, the efficiency of a state machine design was found to be much more
compatible with the power and area goals of this DWT circuit. The inherent
tradeoff for this efficiency is limited flexibility, and the maximum number of levels
and channels had to be set before implementing the state machine controller.
Here we will describe a state machine for 32 channels and 4 levels. This design
assumes multi-channel input data is multiplexed into a single 10-bit input bus,
and it will output results sequentially onto a single output bus.

74

75
75

Fig. 2.19. Per channel activity of the CC at different levels of decomposition for a 4-level implementation.

For a 32 channel, 4 level DWT system, the state machine controller utilizes a 12bit counter to keep track of the current state, channel, and level. The three least
significant bits of this counter determine the eight clock cycles within a
computation cycle (Fig. 2.18) defining the operation phase for the CC and
memory blocks, regardless of the channel or level being processed. The next five
bits of the counter specify channel being processed. The remaining four bits
represent the level being processed, where each bit represents an individual
level (Fig. 2.19). The level 1 bit also determines if the input stream should go to
the input buffer (hold cycle) or the computation core (compute level 1 results). In
this fashion, a level 1 result is computed once a pair of data samples is received;
a level 2 result is computed when two level one results are available; and so on.
By increasing the width of the counter, more states could be added to support
additional channels or levels. Fig. 2.20 describes operations within the CC
memory controlled by the three least significant bits of the state machine counter.
Here, complex data movement within the CC memory and its interaction with
other memory blocks is controlled.

76

Fig. 2.20. Data movement in CC memory during the five calculation cycles within a
computation cycle (one per sample per level). Top line shows the six register names.
Subsequent lines define how data must be moved as the steps in (2.20) are processed
sequentially.

Fig. 2.20 defines CC memory register names, where X, Y and Z are registers
connected to respective inputs of the CC while M1, M2 and M3 store
intermediate computation results. Values are read into the CC memory at the
beginning of the computation cycle. This is followed by five cycles of calculations
and register data shifts. The last calculation cycle produces detail and
approximation DWT results. The detail DWT coefficient is sent to the output bus.
The approximation DWT coefficient is sent to the output bus and, depending on
the level being processed, stored in pairing memory. After the last calculation
cycle, four of the values in CC memory, f0, P0, Q0 and R0, are stored to the
77

channel/level memory for use in subsequent calculations of the current
channel/level results.

2.8 Simulation Results and Analysis
2.8.1 Hardware Resources
All of the DWT circuit blocks were custom designed and laid out in 0.18m
CMOS. Table 2.8 lists number of transistors required for each module. The
corresponding chip area required by each module, including routing, is also
shown. The dominance of channel/level memory over other modules is evident,
where increasing the number of channels by a factor of four (from 8 to 32) results
in roughly a 300% increase in both the number of transistors and the area
consumption. Our model 32-channel, 4-level DWT implementation, requires ~54k
transistors and occupies roughly 470m x 470m. These values were obtained
with the major memory blocks implemented with SRAM for reliability and testing
purposes. If the input buffer, pairing and channel/level memory were replaced by
DRAM blocks, the estimated transistor count for the 32-channel, 4-level circuit
would drop to around 25,000 and the total DWT circuit area would reduce to a
2

little more than 100,000μm , which is less than 50% of the overall circuit using
SRAM. Table 2.8 also gives the area and transistor count precise estimates for
100 channel and 250 channel designs.
A 32-channel system, with a sampling frequency of 25KHz, operating at four
levels of DWT decomposition requires a clock frequency of 6.4MHz. The power
78

is managed separately for each block, thus the hardware portions not working at
a given time do not consume any power. Operation of the DWT block can be
modeled as a state machine which requires 8 cycles to complete its operation, as
shown in Fig. 2.18.
Table 2.8: Transistor Count and Area for Hardware Modules
Module

No. of Tx

Controller
987
Computation Core
1854
Comp. Core Memory (Flip-Flops)
2803
Coefficient Memory (ROM)
122
*Pairing Mem. per Chn/Lvl (SRAM)
120*C*(L-1)
*Input Buffer per Chn (SRAM)
60*C
*Per Chn or Lvl Mem (SRAM)
240*C*L
Complete DWT system: 8 Chn, 4 Lvl
17910
Complete DWT system: 32 Chn, 4 Lvl
54339
Complete DWT system: 100 Chn, 4 Lvl
157562
Complete DWT system: 250 Chn, 4 Lvl
385256
*Counts exclude transistors for address decoding circuitry

2
Area (μm )

6319
9828
10453
714
Varies
Varies
Varies
74253
221643
641344
1583832

Table 2.9 gives the average activity per cycle of each hardware module.If the
maximum delay of 10ns is considered to allow for effects of fabrication nonidealities, it allows us time to process 500 channels, at 25KHz sampling
frequency, in a time-multiplexed fashion. However, handling of 500 channels
does not seem feasible especially at the analog-to-digital converters and the
neural signal amplifiers.

79

Table 2.9: Average Activity per Cycle for Hardware Modules
Module
Controller
Comp Core
Comp Core Memory (Flip Flops)
Coefficient Memory (ROM)
Pairing Memory per channel/level
Input Buffer per channel (SRAM)
Per Channel or Level Memory

Average Activity per cycle
8/8 = 1
5/8
8/8 = 1
5/8
2/8 = ¼
1/8
2/8 = ¼

The overall 32 channel, 4 level system at 0.18 micron process technology with
1.3V VDD and at 6.4MHz of operating frequency consumes an average power of
76μW, including static and dynamic power dissipation, with highest power being
consumed by CC memory, the computation core and the controller respectively.
Since these three modules are operational most of the time, they account for
about 80% of the total power.

2.8.2 Data Compression vs. Reconstruction Error
The DWT circuit outputs interleaved approximate and detail transform
coefficients that give a sparse representation of the original neural signal.
Coefficient values below a specific threshold can be set to zero to compress the
results into a smaller number of bytes. The non-zero coefficients can then be
encoded using a lossless encoding scheme and transmitted to the extra-cranial
or extra-cutaneous processing units [17, 45]. Choosing the value of the zeroing
threshold provides a tradeoff between signal integrity and compression ratio.
To test our DWT circuit implementation on real neural data, a linear-step analog
to digital converter was used to convert experimentally obtained neural data into
80

a stream of 10-bit digital values. The data was then processed through the DWT
system and results were stored for analysis. In one test, the stored transform
coefficients were used to reconstruct the neural signal, and this result was
compared to the original signal to measure the quality of reconstruction. This
analysis was performed for several different zeroing threshold values to evaluate
signal quality verses the amount of compression obtained. The final performance
metrics can be defined in terms of three quantities, the root mean squared error
(RMS), Shannon‟s entropy [46], and the assigned threshold.
RMS error is a measure of the average difference between the original and the
reconstructed signal. Here the difference between the original signal and the
reconstructed signal is comprised of two components: the quantization noise due
to finite-word length, and the lossy-compression noise resulting from the
thresholding operation. The mathematical representation of RMS error is given
by

RMS _ error 
where

1
N

(2.21)

N

 (x

n

ˆ
 xn ) 2

n

ˆ
xn is the original signal, x n is the reconstructed signal and N is the length

of the signal.
Shannon‟s entropy is a measure of uncertainty associated with a random signal
and can be interpreted as the average minimum message length, represented in
bits, which must be transmitted to communicate the exact value of the signal.
81

Entropy gives the theoretical limit to the best possible lossless data compression
for any communication. Entropy can be mathematically represented as
N

H ( X )   p( xi ) log 2 p( xi )

(2.22)

i 1

where N is the total number of possible values (also called symbols in
information theory literature) and p(xi) is the probability of occurrence of the ith
value.
Increasing the zeroing threshold improves the amount of compression achieved
(decreasing entropy) but degrades RMS error. For neural data processed
through our DWT circuit, Fig. 2.21 plots the RMS error and entropy as the
zeroing threshold increases and confirms the anticipated tradeoff.
Because of this direct tradeoff between RMS error and compression ratio, the
zeroing threshold must be chosen to match application requirements; if
bandwidth limitations are not severe, a low threshold could maximize signal
reconstruction, or if near perfect reconstruction is not needed, a high threshold
could reduce power for data transmission.
It is important to note that the measures of entropy and RMS error presented
here are also dependent on the signal to noise ratio (SNR) of the input signal. In
high SNR signals, most of the data values (which represent noise) have very low
amplitude, whereas in low SNR signals the data values are more uniformly
distributed. As a result, though the trend of decreasing entropy and increasing

82

RMS error with threshold holds true for all data sets, the slope of these graphs
depends on the particular data set and is dependent on the SNR of the signal.

Fig. 2.21. RMS error and Entropy in bits per symbol as a function of threshold value for
the neural data set used in our experiments.

Fig. 2.22 and Fig. 2.13 show the original and reconstructed signals for threshold
values of 50 and 120, respectively, as obtained in our experiments with real
neural data. At a threshold of 50, entropy of 4.78 bits per symbol and an RMS
error of 37.95 were obtained. For a threshold value of 120, entropy of 1.48 bits
per symbol and an RMS error of 87.69 were measured.
It can be seen that a low threshold value, as in Fig. 2.22 results in a good signal
reconstruction so it is suitable for applications that require high quality signal
83

reconstruction. A higher threshold value, as in Fig. 2.23, results in a relatively
worse reconstruction; however, it still captures the spikes in the neural signal
very effectively and preserves their shapes and sizes.
Since most of the neural signal analysis and applications are based on the
information embedded in the spikes only, even this level of reconstruction is
highly accurate and useful. Thus depending on the specific application
requirements, the threshold value can be chosen to optimally exploit available
bandwidth, available power, and desired reconstruction quality.

Fig. 2.22 Original and reconstructed signals for Threshold = 50, RMS error = 37.95 and
Entropy = 4.78 bits per symbol at 4 levels of decomposition with 10 bits data and 5 bits
coefficients.

84

Fig. 2.23. Original and reconstructed signals for Threshold = 120, RMS error = 87.69
and Entropy = 1.48 bits per symbol at 4 levels of decomposition with 10 bits data and 5
bits coefficients.

The resulting DWT circuit can pseudo-simultaneously process multiple channels
of incoming neural signals in real time, and its ability to perform multi level DWT
enables very high data compression while maintaining signal fidelity. The small
size and low power consumption of this DWT VLSI circuit makes it highly suitable
for font-end data compression in implantable applications.

2.9 The Design Flow
The design of an ASIC to implement an algorithm in hardware is completed in
two main phases, the technology independent phase and the technology
dependent phase. The technology dependent phase of the design involves fixed85

point algorithm design in Matlab followed by behavioral and structural design in a
hardware description language. The ASIC design is verified and prototyped in
hardware using an FPGA. Although it is a technology dependent technique,
verification using an FPGA has various pros as described in the following
section. The FPGA verified hardware description is finally translated to the ASIC
design for chip fabrication for a particular technology. Fig. 2.24 gives the highlevel steps for the design of proposed ASIC hardware in silicon.

Fig. 2.24: High-level ASIC design flow

The design flow is divided into three phases,
1. Technology Independent Phase
2. Technology Dependent Phase for FPGA based Prototype
3. Technology Dependent Phase for ASIC
These phases are described in detail in the Appendix.
86

2.10 Integrated Circuit for DWT: Design and Test
2.10.1 Design
A 0.5m CMOS process was used to design all the blocks of the neural
compression engine. The DWT block has been fabricated and the 3mm x 3mm
chip is shown in Fig. 2.25. The controller was synthesized using OSU‟s Standard
Cells Library [47], while all other blocks were custom designed for low power and
low area. The active components of the prototype 32-channel, 4-level DWT
implementation occupy roughly 3.84mm

2.

The layout for the threshold and RLE

2

blocks requires about 0.95mm of area. Thus the combined compression system
is expected to require approximately 5.75mm

2

including global routing for a

0.5m process. The tested DWT module consumes only 3mW of power while
processing 32 channels at 25Ksamples per second, or equivalently 95μW per
channel. The power consumption per channel is directly proportional to the
neural data sampling frequency.
Power Consumed = 95 μW per channel @ 25KSps per channel
2

Core Area = 3.84mm

87

Fig. 2.25. DWT system on chip, fabricated in 0.5μm technology.

2.10.2 Test Setup
To test our designs and algorithms, a stream of experimentally obtained 10-bit
neural data was processed through the compression system and the resulting
transform coefficients were used to reconstruct the neural signal. The results
were compared to the original signal to measure the quality of spike
reconstruction versus the compression obtained.
The chip was tested using National Instruments‟ DAQ card, Labview and an
FPGA. Initially Labview and the DAQ card were used to test the functionality of
the chip at low speeds. Once verified, the FPGA was used to test the chip at the
required frequency of 6.4MHz. The resulting coefficients were an exact match to
the results from fixed-point simulations in Matlab.
88

The PCB shown in Fig. 2.26 was designed and fabricated to allow connection
between the ASIC to be tested, DAQ cards and the FPGA. Apart from connecting
tips included to provide flexible connections, the PCB also contained several
adjustable power supplies to allow chip test at several different supply voltages.
In addition, the test board contains voltage level shifters to allow connectivity
between different voltage domains. This is necessary since DAQ card uses 5V
digital IO, the FPGA uses 3.3V digital IO while the ASIC can operate at a range
of supply voltages.
The chip was tested using National Instruments‟ DAQ card, Labview and an
FPGA. The virtual interface used to control the data acquisition is shown in Fig.
2.27. Initially Labview and the DAQ card were used to test the functionality of the
chip at low speeds. Once verified, the FPGA was used to test the chip at the
required frequency of 6.4MHz. The resulting coefficients were an exact match to
the results from fixed-point simulations in Matlab. The same setup is later used to
connect the FPGA directly to the DAQ cards during FPGA based prototyping of
other modules detailed in next chapter.

89

Fig. 2.26. Photograph of Test PCB designed to interface ASIC with FPGA and DAQ
cards.

90

Fig. 2.27. Labview interface to control DAQ cards providing input data and recording
output data from the ASIC.

2.11 Improvements in DWT
A second version of the DWT block has been designed to improve the
functionality and power consumption of the block.
1. Channel Selectivity
When microelectrode arrays are used for recording from the cortex it is not
possible to position each electrode independently at the desired location. As a
result, in general a significant number of electrodes fail to capture any signal at
all. It is thus not useful to waste precious transmission bandwidth to transmit
coefficients from those channels. The initial design captured and transformed
signals from all the 32 channels in a time-multiplexed sequence. The block has
91

been modified to be able to handle selectable number of channels. The block can
now be instructed to compress 32, 16, 8, 4, 2 or 1 channel at a given time.
Since the reduced number of channels does not require as many computations
per second as the maximum number of channels, the block can be clocked at
slower frequencies to save dynamic power. Dynamic power is linearly dependent
on the frequency of the operation.
2. Overflow
The second version of the DWT block has been designed to incorporate better
handling of data overflow at the output of the computation core. If an overflow is
detected at the output of an arithmetic computation, then the sign of the correct
result is determined and the output is truncated to the corresponding maximum
value. Thus, a negative overflow is replaced by a -511 and a positive overflow is
replaced by a +511 value.

92

3 Design of Implantable Neural Interface Node
3.1 Motivation and Approach
The work in Chapter 2 establishes the design of a highly efficient circuit for
computing the multi-level DWT of neural signals. However, for this DWT block to
be useful within a neural implant system, it must be combined with several other
functional circuit blocks. The DWT does not provide compression on its own,
however it enables compression by providing time-frequency transformation and
in doing so compresses energy into a few coefficients. Coefficients with
„significant‟ energy are separated from those with insignificant energy by the use
of multi-level threshold. The lesser energy coefficients are reduced to zero
providing lossy compression. The resulting stream then undergoes lossless
compression through a run length encoder block. The DWT, the multi-level
threshold and the run length encoder collectively perform neural data
compression. Apart from the compression engine, the implantable node needs
two more interfaces, the analog frontend interface and the RF transceiver
interface. The analog interface is controlled by a dedicated controller while the
RF transceiver is interfaced with the communication controller which is also
responsible for running the communication protocol and providing error free
communication. The aim for the communication controller is to provide maximum
data throughput . A global controller is necessary to synchronize the data flow
between blocks, to manage various configurations, and also maintains the power
management block. This chapter describes the circuits designed to meet the
93

needs of these individual blocks as well as their collective operation under the
constraints of area and low power.

3.2 Architecture Overview
Fig. 3.1 provides a system level view of an integrated neural data compression
circuit fabricated on the back of a microelectrode array of the NIN that is
implanted in the cortex and communicates wirelessly to an EPU. Neural signals
from multiple channels are amplified, digitized and fed to the DWT block, which
generates a sparse representation of these signals. The threshold block serves a
dual purpose of de-noising and spike detection. The Run Length Encoder (RLE)
removes the redundancy from the streaming data which is then formatted into
packets for wireless transmission and sent to the transceiver. The system
employs both lossy and lossless compressions to reduce the data, where the
DWT, threshold and RLE blocks form the compression engine.

94

Fig. 3.1. Block diagram of the implantable neural data compression engine and its
position within an implantable neural recording system.

3.3 Data Compression Engine at the NIN
3.3.1 Analog Interface Controller
Fig. 3.2 shows the analog front end to receive, amplify, and filter signals from the
electrodes that has been assumed in designing the digital control portion of the
NIN. The individual signal channels are AC coupled to remove any DC offset and
are then passed through the pre-amplifiers which provide much needed buffering
at this stage. The channels are then fed to a transmission gate based analog
multiplexer which selects one channel at a time. The output of the mux contains
a superposition of the LFP and the neural spikes. This signal is then passed
through a band pass filter followed by a variable gain amplifier. A number of
amplifiers have been reported in literature which satisfy the gain-bandwidth
requirements of this study [4, 48]. A modified version of [48] is being designed to
95

allow variable gain amplification. Output of the amplifier contains amplified neural
spikes. Depending upon the mode of operation, the amplified spike train or the
LFP is then relayed to the ADC which converts the analog voltage into 10-bit
2

digital data. The analog frontend is expected to cover an area of 1.77mm with a
power dissipation of 1.1mW.

Fig. 3.2. Analog module for multichannel neural data recording with variable gain
amplifier.

Operation of this frontend analog module is controlled by a state machine based
controller, which depending upon the runtime configuration settings, allows the
flexibility to choose individual channels for recording. The controller time
multiplexes the channels before the data is handed over to DWT block as well as
it selects the data path for different modes of operation selecting between spikes
and local field potentials. The controller allows setting individual channel gains,

96

out of four possible gain settings, for each of the channels. It is also responsible
of providing appropriate clock to the ADC.

3.3.2 Discrete Wavelet Transform
As detailed in Chapter 2, based on the chip real estate available to this research
project, the DWT block shown in Fig. 3.3 was designed to support up to 32
channels of data simultaneously with 4 levels of wavelet decomposition. Multiple
decomposition levels generally result in fewer significant coefficients. The
relatively long intervals between samples of neural signals allow for computation
hardware that prioritizes power and area efficiency over speed [43]. Derived from
our prior system-level analysis [35], the power-area product can be minimized by
an architecture that sequentially evaluates the DWT of multi-channel data in real
time. The lifting scheme is used to compute results; it has been shown to require
fewer computations than convolution based filtering [24].

Fig. 3.3. System diagram for sequential calculation of DWT.

97

The computation core performs sequential calculation of DWT coefficients. The
memory blocks store temporary data and intermediate results and have been
partitioned based on different access patterns. The controller manages timing
and data flow among the blocks. The controller was synthesized from a library
while the other blocks were custom designed to minimize power and chip area.

3.3.3 Multi-Level Threshold
Coefficients at the output of the DWT block can be viewed as sparse packets of
energy which do not, by themselves, result in any compression. However,
following the DWT by a thresholding stage, which reduces the in-significant lowenergy coefficients to zero and lets high-energy coefficients pass, does permit
compression. The low-energy coefficients have little or no significant information
and mainly contribute to noise. The high energy coefficients invariably
correspond to different spikes and events in the neural signal, allowing spike
sorting even without reconstruction [17]. The values to which the thresholds are
set are of critical importance since they determine both the quality of
reconstruction and the final rate of compression.
Methods to determine the optimal threshold is an ongoing investigation.
However, it has been established that the best compression is achieved by using
separate threshold values for each decomposition level of each channel [17].
Thus, DWT results from each level of each channel must be treated separately
and will form a stream of data containing information that is virtually independent
of the information from other channels and levels within the data stream. The
98

DWT and the threshold blocks are combined to implement the lossy compression
on the input signals. The multi-level thresholds thus result in compression and
denoising of the signal at the same time.
Fig. 3.4 shows the main functions of the threshold and RLE stages. The
threshold block includes a set of memory registers that contain threshold values
for each level of each channel. Since, according to figure 3.6, the last level of
DWT produces two separate coefficients streams, an N level system would have
N+1 threshold values for each channel. These values would be determined
externally and then stored into the memory sequentially before the DWT
operation begins. Any of these values can be updated during system operation.
The DWT block generates values in signed-magnitude form. The magnitude is
compared against the threshold value using a magnitude comparator; if found
smaller than the threshold, a 10 bit zero is generated at the output. If equal or
greater, the original value is regenerated at the output.
As explained later in this chapter, the overall compression system is designed to
operate in two modes: Monitor mode and Compression mode. In Monitor mode,
the system bypasses the DWT, threshold and RLE blocks and sends the
uncompressed neural signal directly to the transceiver. Monitor mode is used by
the EPU to analyze the statistical properties of individual channels and calculate
the optimal threshold values for each channel and level. These threshold values
are then transmitted back to the implanted system to set compression

99

parameters for use in Compression mode, where all system blocks are activated
to compress data.

3.3.4 Run Length Encoder
Several lossless data compressors exist in literature, with varied computational
complexity and storage requirements. A few popular techniques are Huffman
coders, Lempel Ziv coder, arithmetic encoders and their variants. Most of these
algorithms require prior statistical knowledge of the incoming data set, and thus
the rate of compression achieved is directly related to the accuracy of this prior
information. Since these algorithms are variable length encoders, under certain
conditions, they approach the theoretical limits of compression, bounded by the
entropy of the incoming signal. However, these algorithms require prohibitively
large storage to maintain the dictionary of codes. Use of Huffman coding is not
possible, since in this application, the size and statistical properties of the source
alphabet (number of possible data values) depends on the threshold values,
which are ideally controllable by the neuroscientist.
The only statistical information available for our thresholding compression
system is that much of the data stream at the output of the threshold block
consists of zeroes. Run Length Encoding is best suited for data with long
repetitive strings of values; in addition, it is very conservative in required
hardware resources. Because we expect long strings of zeros at the output of the
threshold block, RLE is a good lossless compression choice. Though RLE is not

100

an optimal encoding scheme in general, when given very long repetitive
sequences it approaches the performance of near-optimal algorithms.

Fig. 3.4 System diagram for the Threshold and the RLE blocks.

Given that, for this implementation, a byte refers to 10-bit values in sign
magnitude form ranging from -511 to +511 and that a 10-bit counter can count up
to 1023, our implementation of RLE can be summarized by the following rules.
1. Transmit all non-zero bytes as is.
2. Convert all negative zeros (represented by X) to positive zeros.
3. Replace a sequence of zeros (two or more) with an X (negative zero) byte
and a zero-count byte.
4. If the zero count reaches 1023, send 1023 and restart a new sequence of
zeros.
Following these rules, the example RLE operation of Fig. 4 shows the original 40byte sequence has been reduced into a 20-byte sequence. Sequences from real
101

neural recordings have been observed to yield much better compression ratios.
Since we do not expect long sequences of repeating non-zero values, this
implementation compresses only the sequences of zeros. This scheme results in
fixed length codes, which have a computational resource advantage.

Fig. 3.5 Example of the system specific run length encoding.

3.4 Communication Controller and Interface
A communication system is required to communicate neural spike data from the
NIN to the MIM and at the same time take configuration settings and commands
from the MIM to the NIN. This communication takes place across the skull and
the scalp, thus the NIN and MIM are effectively only a few inches apart. A
Communication Controller module is designed to control the transmission,
reception of the data and general working of the RF transceiver. The controller
must adhere to a few requirements for the system to work effectively.

102

Power dissipation is again a primary concern while designing the communication
system. Since the human tissue is known to absorb higher radio frequencies
more than lower frequencies, carrier frequencies in the range of a few megahertz
are suitable. The system has been designed to abide by the FCC Part 15
recommendation of 13.56MHz.

3.4.1 Packet Structures
The neural data is considered pristine, which means error detection is necessary
at both ends of the communication channel so that the packets can be
retransmitted until error free data has been received. Two different packet
structures are used during communication, one for uplink from NIN to MIM and
the other for the downlink from MIM to NIN. As shown in Fig. 3.6, the uplink
packet, also called the Data Packet, is large and contains eighty 10-bit values
representing the compressed neural data. The packet contains 8-bits of channel
and DWT level information as the first data value in the packet. This channel and
level information is only required for the first data value and subsequent values
conform to a predetermined sequence of channels and levels set by the DWT
block. The packet contains one bit to represent the current Packet ID to establish
proper data sequencing at the receiving side. It also contains a Command
Request ID as an acknowledgement for the last packet received. If the request ID
is the same as the ID of the last transmitted packet, the receiver is requesting
retransmission of the last packet. Finally, the packet contains 8-bits of Cyclic
Redundancy Check (CRC) for error detection. A standard 8-bit CRC (CRC-8-

103

CCITT) is used for this application to detect anticipated error patterns possible
8

7

during communication. The polynomial for the CRC-8-CCITT is given as x + x
3

2

+x +x +1
The CRC bits allow error-detection at the receiver side. The choice of CRC
depends on two factors; the length of the message, and the desired percentage
of error detection. No CRC choice can guarantee absolutely error free
communication, however, greater number of CRC bits correspond to better error
detection capabilities. These capabilities are achieved, essentially, at the cost of
higher redundancy and thus lower throughput. This code guarantees to detect all
single bit and double bit errors as well as all odd number of errors. It also
guarantees detection for burst errors of up to 8 consecutive bits. Some errors
may pass through if the error pattern results in another valid pattern. For the 8-bit
CRC there is an average of 1 error pattern which will not be detected for every
255 which would be detected, that is, we should be able to detect 255/256 or
99.6% of all possible 8 bit errors [49]. As a result, the total size of the data packet
becomes 818 bits.
The downlink packet, also called the Command Packet, is used to send
commands from NIN to MIM. This 25-bit packet contains 8 bits for the Command
Code and 10 bits for corresponding Command Data. As in the Data Packet, the
Command Packet also contains identification and request bits. A 5-bit CRC is

104

sufficiently powerful for a packet of this length. The polynomial used to generate
5

4

2

CRC-5 is x + x + x + 1.

Fig. 3.6 (Top) MIM to NIN Command Packet. (Bottom) NIN to MIM Data Packet.

3.4.2 Protocol Design
A major limitation in the protocol design is the use of a half-duplex transceiver,
chosen to conserve power and bandwidth. As a result, a single 1Mbps channel is
used for both uplink and downlink data transfer. A stop-and-wait based approach
has been taken to meet the requirements of the system, including a need for the
NIN to wait for a packet to fill before transmission of streaming data. At the same
time, the MIM has to transfer various configurations and commands to the NIN.
However, the MIM does not have to wait for streaming data to become available.
Since the system clock is generated wirelessly, another issue is the transmitterreceiver re-synchronization once a disruption takes place.
Automatic Repeat Request (ARQ) is the error control method based on
acknowledgements and timeouts to achieve reliable data transmission. The Stopand-Wait protocol has been customized and redesigned to cater for the fact that
NIN is generating streaming data and thus may have to wait for data to become
105

available for transmission. The protocol in its innate form is known to be not as
efficient as other more sophisticated window-based variations of ARQ e.g.
Selective Repeat (ARQ) or Hybrid (ARQ) but since the error characteristics of
communication across human tissue for such an application are unknown and
are highly dependent upon the ambience and transceiver coil alignment, it is not
evident if any gain in efficiency would be achieved by switching the protocol to
any of the other variations. Several sophisticated window-based protocols have
been reported in the literature. However, with streaming data, where the
transmitter sometimes has to wait for the data to become available, windowbased protocols may take very long to hand channel control back to the MIM.
This would preclude instant changes in configuration of the NIN. In addition, the
protocol presented here has the obvious hardware area and complexity
advantage of not having to maintain a number of previous packets. The other
protocols require a large buffer to maintain a „transmission window‟. Since this
buffer is based on hundreds of bits of shift registers, the hardware cost in
comparison to the other blocks in the system is tremendous. However,
improvements can be made to the current protocol or the choice of CRC etc
depending upon the profile of error statistics generated and collected from the
first generation of design.
Fig. 3.7 shows the flow diagram of the modified protocol. The diagram depicts
normal flow of data as well as error handling. Command Packets are represented
with a „C,‟ Data Packets are represented with a „D,‟ and the acknowledgement is

106

represented with an „A‟. Three important parameters in the flow diagram are the
three configurable timeout values described below.
An essential feature included in the protocol is the programmable „Data Not
Ready‟ timeout, referred to in Fig. 3.7 as Timeout N1 (TN1). This timeout occurs
when the MIM requests new data from NIN but the NIN packet is not filled
completely and thus not ready for transmission. TN1 is the maximum time that
the NIN waits before bit-stuffing and transmitting the packet if it is incompletely
filled. This timeout is only relevant when neural activity is low and the spikes are
sparse. A packet is transmitted as soon as the buffer fills; however, if a single
spike is followed by a very long string of zeros it can take up to approximately
100ms to completely fill the Data Packet. This „delay‟ in transmission may not be
acceptable for certain applications where real-time spike recording and decoding
is important. The default value of TN1 is set to 8ms, but it can be reconfigured to
any temporally relevant value of less than 32ms through a command from the
MIM.

107

Fig. 3.7. Flow diagrams for the communication protocol. The diagram shows normal
flow as well as handling of error conditions.

The MIM transmits a Command Packet as soon as it receives a Data Packet
from the NIN; it does not have to wait for a new command to become available
and always maintains a preconfigured Command Packet. If the NIN does not
receive a Command Packet within a specified time, then Timeout N2 (TN2)
occurs. The NIN assumes a ‘Data or Command Packet lost’ error and retransmits
the Data Packet. Let ‘TD’ be the time to transmit a Data Packet and ‘TC’ be the
transmit time of a Command Packet. With a transmission frequency of 1Mbps,

108

the 818-bit Data Packet takes TD=0.818ms for transmission, and the 25 bit
Command Packet takes TC = 0.025ms. TN2 can then be expressed as

TN 2  TD  TC  2TP  0.85ms

(3.1)

where TP is the packet propagation delay, which is negligible due to the close
proximity of the NIN and MIM. Similarly, the idle time between reception of a
packet and transmission of a preconfigured packet is also negligible.
The reconfigurable Timeout N3 (TN3) acts as a watchdog timer for the
communication controller. If the NIN does not receive a packet from the MIM for
three consecutive Data Packet transmissions, then the NIN communication
controller resets itself assuming a „Break‟ in communication requiring resynchronization. TN3 can be expressed as

TN 3  TN1  3TN 2

(3.2)

The default value of TN1= 8ms gives TN3≈ 10.6ms. If needed, a reset command
can be issued through the MIM that resets the NIN and initializes all settings to
their default values.
Fig. 3.8 shows the functional diagram of the communication controller. The
transmitter and the receiver use serial data in and out lines in conjunction with
corresponding clock signals to communicate with the transceiver. The controller
contains two frames. At any given time, one frame is the active frame while the
109

other acts as a reserve frame. The reserve frame is filled while data from the
active frame are being transmitted. It is desirable to be able to transmit the active
frame twice (in case of an error) before the reserve frame fills and data
overflows. As a result, from (3.1), the buffer should be large enough to store
incoming data for a time greater than 2*TN2 ≈ 1.7 ms seconds. This can be
achieved by settings the appropriate threshold and frame size.
The RF transmitter and the receiver use serial data in and out lines in conjunction
with corresponding clock signals to communicate with the communication
controller. The NIN controls the select line which enables the transceiver or the
receiver according to the protocol presented in Fig 3.7. The transmit and receive
clocks have a frequency of 1Mbps where the data-bits are transmitted or
received at the positive edge of the clock as shown in Fig 3.9. Tx/Rx Select
signal is generated by the communication controller in addition to the Tx clock
and Tx data. The Rx clock and Rx data are the signals generated by the RF
receiver.

110

Fig. 3.8. Functional diagram for the communication controller.

Fig. 3.9. Timing diagrams (Top) Packet transmission from NIN. (Bottom) Packet
reception at the NIN.

3.4.3 Efficiency
The analysis and results of the DWT, threshold and RLE blocks have been
presented in [50]. The communication system can be seen as operating in two
111

phases. In the beginning, the NIN has data to send to the MIM and the MIM has
configuration commands to send to NIN. During this phase the protocol efficiency
is maximum and is given by

T T
  D C 1
TN 2

(3.3)

Steady state is reached when the initial configuration is complete and the MIM
has very few commands for the NIN. During this phase, the efficiency of protocol
can be given by



TD
 97%
TN 2

(3.4)

The efficiency expression in (3.4) is valid when there are no errors during
transmission. The probability of error because of a CRC failure or lost packet is
denoted by Pe. If T is the amount of time taken to successfully transmit a packet
and receive its acknowledgement in the presence of errors, then the average
time it takes to successfully transmit a packet in presence of errors, E[T], is given
by

E[T ]  TN 2 (1 

Pe
)
1  Pe

(3.5)

During steady state, in the absence of any errors, the data throughput of the
compressed coefficients can be given as

112

NIN Throughput = (No. of Data Bits / Total Bits) * Tx. Rate

(3.6)

= (800/(818+25)) * 1Mbps ≈ 950 Kbps
Thus, a 1Mbps channel can transfer data generated at rates less than 950Kbps.
However, as with other protocols, if successive packets result in failed CRCs or
are lost, then multiple retransmissions may be required for each packet. If at the
same time there is a burst of activity in the neurons, then this may result in data
being generated at a higher rate than the throughput of the channel, raising the
possibility of a buffer overflow and thus data loss at the NIN.

3.5 Global Controller
In addition to the local controller for each block, a global controller binds all the
components together into a system and is responsible for the correct operation of
the NIN. It is responsible for synchronization of data flow between blocks,
generation of respective clocks and the engagement or bypass of certain blocks
in the data path. The controller also contains a power management module.
However, most importantly, the global controller interprets the different
commands from the MIM and changes the configuration and operation of the NIN
accordingly. The operation of the NIN is controlled by several configuration
registers. The values in these registers can be changed by the MIM to control the
operation of the NIN.

3.5.1 Modes of Operation
The system operates in three modes controlled by the configuration registers.
The first two are monitor modes in which no signal processing is employed and
113

the data is transmitted without compression. These modes are called monitor
spikes (MS) mode and monitor LFP (MF) mode, where LFP refers to local field
potential. Because data is uncompressed in monitor modes, the obvious
bandwidth limitation dictates that only one selectable channel can be active at a
given time. In MF mode, the system bypasses the high gain amplifiers and routes
the analog signals directly to the ADC. The signal is transmitted out without any
further processing. The transmitted signal thus contains the information
embedded in the LFP as well as the neural spikes. In MS mode, the amplifiers
and filters are activated resulting in a signal that contains only the neural spikes
and is free of LFP effects. Gain of the amplifiers is adjustable and can be
increased or decreased to four different settings.
The third mode is called the compression mode (CM). This mode engages the
digital signal compression blocks in the NIN and is capable of processing up to
32 channels simultaneously. The configuration registers dictate which channels
will be sampled and which, if any, will be ignored. The channels are sampled in a
time division multiplexed manner maintaining separate gain settings for each
channel. The ADC output is then fed to the DWT block which computes the
wavelet coefficients up to four levels of scaling. These coefficients are generated
in a predefined sequence. They are then passed through the threshold block,
which keeps the significant coefficients and discards the insignificant values. The
MIM assigns appropriate values to each threshold, which are stored on the NIN.
The resulting signals are sent to the RLE block that compresses the data and
forwards it to the communication controller for transmission to the MIM.
114

Upon reset, the system will sample channel one at 25Ksamples/s in MS mode.
The MIM receives and processes the uncompressed data to compute
appropriate thresholds and transmits the resulting values back to the NIN. Then
the NIN is configured to monitor the next channel and the procedure repeats.
Once the thresholds for all the channels have been set, the NIN can be
reconfigured for operation in CM mode and begin processing all channels
simultaneously.

3.5.2 Power Management
The global controller also manages power for the NIN. Two main techniques are
used to conserve power. First, dynamic frequency scaling (DFS) is employed
whenever the NIN system is not required to function at its maximum capability;
the clock frequency is scaled down, which linearly effects the power dissipation.
A dedicated block in the global controller is used to generate clock signals for all
the NIN blocks. The controller also disables any components not on the signal
data-path for a given mode of operation, reducing unnecessary transistor
switching. The analog, digital and RF modules are being designed to work with
separate supply voltages, each working at the minimum required voltage to
conserve power. Reducing supply voltage increases signal propagation delay.
However, as the NIN has a pipeline-based architecture, the performance of
components is not limited by the signal propagation delays, which allows for the
use of dynamic voltage scaling (DVS) in addition to DFS for future generations of
the design.
115

3.6 RF Data and Power Receiver
A transceiver is responsible for inductively transferring power from the MIM to the
NIN as well as providing the master clock to the NIN. The transceiver also allows
bidirectional data transfer between the two modules. Two pairs of perpendicularly
aligned coils are required for these transfers. Since the communication protocol
is half-duplex only one module is transmitting at a given time, this allows for the
use of one only pair of coils for both directions of communication.
There has been extensive research in designing high data rate, high efficiency
RF transceivers and antenna coils, and various designs have been published
recently [48, 51]. The design presented in [51] can be modified accordingly and
used in our application. Fig. 3.10 shows the functional diagram of the RF power,
clock and data transmission circuit from MIM to NIN. It must be noted that the
design of the transceiver is out of the scope of this thesis. The design is
described here to help understand the blocks the global controller must interface
with.

116

Fig. 3.10 Functional diagram of circuits for inductive power and clock transfer and RF
data transmission from MIM to NIN.

The NIN and the MIM are separated by the skull and the scalp and the coils
transfer the power and data inductively. An LC tank is used in conjunction with a
power amplifier to generate the power carrier which is received at the NIN coil,
rectified and regulated to generate a constant and stable power source for other
blocks in the NIN. The data is transferred using frequency shift keying (FSK)
modulation and high quality LC tank. The receiver amplifies and demodulates the
received signal to regenerate the transmitted data bits. The RF transceiver
2

covers an area of 2.2mm and consume power up to 7.3mW. The coils for the
power and data transfer, however, are expected to be 10mm x 10mm wide.

3.7 Results & Performance Evaluation
A system enabling data telemetry using very high data compression of neural
recording is presented. All the necessary components of the system have been
117

detailed including the analog frontend, the digital neural signal processor and the
RF data and power transceiver. The analog frontend interfaces with the
electrodes implanted in the cortex. The neural signal processor control the
functionality of the NIN and employs the compression engine to compress the
neural signals while maintaining high signal fidelity has been described. The
processor also manages different modes of operation and various configurations
built into the design. Finally, a communication protocol is presented which
enables error free transmission of data to the MIM and reception of commands at
the NIN.
A 0.5m CMOS process was used to design all the blocks of the neural
compression engine. The DWT block has been fabricated and the 3mm x 3mm
chip is shown in Fig. 3.12. The controller was synthesized using OSU‟s Standard
Cells Library [47], while all other blocks were custom designed for low power and
low area. The active components of the prototype 32-channel, 4-level DWT
2

implementation occupy roughly 3.84mm . The layout for the threshold and RLE
2

blocks requires about 0.95mm of area. Thus the combined compression system
is expected to require approximately 5.75mm

2

including global routing for a

0.5m process. The tested DWT module consumes only 3mW of power while
processing 32 channels at 25Ksamples per second, or equivalently 95μW per
channel. The power consumption per channel is directly proportional to the
neural data sampling frequency.
118

To test our designs and algorithms for the compression engine, a stream of
experimentally obtained 10-bit neural data was processed through the
compression system and the resulting transform coefficients were used to
reconstruct the neural signal. The results were compared to the original signal to
measure the quality of spike reconstruction versus the compression obtained.
This analysis was performed for several different zeroing threshold values. For a
fair comparison, the same threshold value was used for all channels and levels.
As discussed in previous chapter, Root Mean Squared (RMS) error and Entropy
are used as the primary measures to evaluate the performance of our system.
RMS error is a measure of the average difference between the original and the
reconstructed signal. Shannon‟s entropy gives the theoretical limit to the
achievable lossless data compression for a given data set.
For a given spike train, Fig. 3.12 plots the number of detected spikes, RMS error,
the entropy of the transmitted sequence and the RLE compression achieved with
respect to the threshold. The plots confirm the anticipated tradeoff between
compression and RMS error. As expected, when thresholding is not employed,
the RLE does not result in any compression, while entropy is at its maximum and
RMS error at its minimum. The RMS error never goes to zero because of
quantization noise. As the threshold value (and thus the number of zeros)
increases, the RLE compression approaches the theoretical limit of entropy,
which proves the effectiveness of our design. A threshold increase also results in
an increase in the RMS error. This region of operation removes noise from the
signal while preserving all the neural spikes and their shapes. At very high
119

threshold values, the system starts distorting neural spikes, which results in a
drop in the number of spikes detected at the output. The point of operation, thus,
must be selected just before this region. The optimal point of operation may vary
from one application to another depending upon the quality of reconstruction
desired. Because of this direct tradeoff between RMS error and compression
ratio, the zeroing threshold must be chosen to match application requirements;
i.e. available bandwidth, quality of signal reconstruction required, and the power
available for data transmission.

Fig. 3.11 System performance as a function of threshold value for the neural data set
used in our experiments.

Fig. 3.11 shows the results for a sample spike train which contains 27 spikes.
Almost perfect reconstruction is achieved for a zero threshold, but at the cost of
high data rate. Beyond the threshold value of 110 the system starts losing spikes
120

for this particular data set. Thus, the region of operation should be set between
the values of 50 and 100 depending on the application and the desired spike
reconstruction quality. The measure of quality of reconstruction depends highly
on application specific spike detection and classification algorithms employed by
the neuroscientist.
The DWT chip has been tested and works as designed with an excellent match
between simulated and experimental results. Because the RLE block has not yet
been fabricated, data presented in Fig. 3.11 is based on a combination of
measured and simulated results. Fig. 3.12 shows the same spike at four different
de-noising thresholds and compression ratios. For our prototype 32 channel
design, a conservative threshold value of 80 resulted in an output data rate of
less than 370Kbps, providing a compression of more than 20 times over 8Mbps
for unprocessed data. The authors are not aware of any other publications where
the spike shapes have been maintained, thus we are unable to compare our
results against other methods of neural signal compression.

121

(a)

(b)

(c)

(d)

Fig. 3.12 (a) Original signal. (b) 2 times compression. (c) 16 times compression. (d) 62
times compression.

To test the functionality and integrity of the design, 32 channels of neural spike
trains were simulated and sampled at 25Ksamples/sec per channel. Each
channel contains spikes at a high firing rate of 90 spikes per second which are
multiplexed into a single stream and provided to the compression engine. The
system was tested in both the monitor and compression modes. The monitor
mode resulted in an exact reconstruction of the original signal. For the
compression mode, Fig. 3.13 shows a comparison of the original signal with a
signal reconstructed by keeping only 2.5% of the coefficients, which means each
bit of compressed data represents 40 bits of raw signal. No data loss due to
overflows was recorded during the transmission. Same thresholds were used for
consistency across channels, resulting in similar quality of reconstruction for all
channels.

122

Fig. 3.14 shows a histogram of the time taken for incoming neural data to
completely fill each frame during the experiment. It can be seen that, on average,
each packet was filled in about 4ms. Corresponding to the highest level of activity
in the neurons, the minimum time taken to fill a packet was recorded as 2.6ms
which is within the allowed time limit of 1.7ms to avoid data overflow. However, in
experiments using very low threshold values, packets have been observed to fill
faster than they can be transmitted, which is expected and demonstrates the
need for setting thresholds that result in the greater data compression. Our
ongoing work aims at raising the thresholds to their highest possible value to
enable spike sorting to simultaneously take place [52].

Fig. 3.13. Part of the 32 neural signals received and reconstructed by the MIM, at a
compression ratio of 40, in comparison to the original signal acquired by the NIN.
Despite some apparent signal loss, spikes are easily discriminated.

123

Fig. 3.14. Histogram of time taken in milliseconds by the compressed neural data to
completely fill each Data Packet.
2

Fabricated in 0.5μm CMOS, the DWT block require only 3.84mm of area to
process 32 channels of data at 4 levels of in real time, while consuming only
3mW of power. The layout for neural signal processor covers an area of
2

13.8mm with a power dissipation of 8.8mW. The analog frontend is expected to
cover an area of 1.77mm

2

with a power dissipation of 1.1mW. The RF
2

transceiver covers an area of 2.2mm and consume power up to 7.3mW. Thus at
peak performance, the NIN module is expected to consume 11.4mW of power
2

covering an area of 17.77mm in 0.5μm technology. The coils for the power and
data transfer, however, are expected to be 10mm x 10mm wide. Table 3.1
summarizes the estimated area and power requirements for all the modules in
the NIN. Empirical measurements show that this area would be reduced by
124

roughly a factor of 10 if implemented in a 0.18m process reducing the chip area
to 1.8mm

2

with power consumption in the proximity of 5mW. However, this

system is sufficiently small for implantation even in 0.5m CMOS.
Table 3.1. Estimated Area and Power Requirement for Implantable Hardware Modules
in 0.5 micron CMOS technology
2
Area (mm ) Power(mW)

Module

Analog Frontend Module
1.77
1.1
Digital Neural Signal Processor
13.8
8.8
*RF Data and Power Transceiver
2.2
7.3
*Complete NIN block for 32 Channels
17.77
11.4
*Area numbers do not include coil area for RF transceiver

This highly configurable area-power efficient hardware results in

large

compression rates while extending flexibility to the neuroscientists to choose the
„perfectness‟ of reconstruction of neural data from a multitude of neural signals.
The small size and low power consumption of the system makes it highly suitable
for implantable high-density microelectrode array devices.

3.8 Hardware Implementation and FPGA Test
Once the algorithms were designed and tested in Matlab; the hardware for the
complete NIN digital system including the Multi-Level Threshold block, the Run
Length Encoder the Communication Controller, the Analog Frontend Controller
and the Global Controller was designed in HDL using Verilog. In addition, the
design of DWT was improved to include channel selectivity and dynamic
frequency scaling. To enable high degree of testability, the analog frontend
125

controller was modified to receive input from two different sources, one serial and
one parallel. Test points were inserted in the design to enable testing of data
outputs from each block.
To test the design of digital system in absence of the prototype analog frontend
and RF transceivers, the system was mapped on an Altera Cyclone III FPGA and
interfaced with a custom designed printed circuit board containing the analog
frontend components. The amplified signals were sent to the PCB designed with
discrete components containing an analog multiplexer and an analog to digital
converter. The control signals for the A/D were generated by the analog frontend
controller present in the FPGA. The time-multiplexed outputs from the board are
received by the FPGA which processes the signals as per the settings. The input
data was received at the analog frontend, passed through various blocks and
observed at the output of the communication controller. The RF interface was
replaced with a wired connection between the NIN-FPGA and a host computer
acting as the MIM. The control signals for the transceiver were generated by the
communication controller in the FPGA. The signals were then received by the
MIM computer through a data acquisitioning system. The MIM completes the
necessary processing and generates the command packets which are finally
received by the NIN in the FPGA. The basic functionality of the NIN was tested
and verified, however, since the MIM emulation was very rudimentary, the
elaborate testing of NIN functionality, its various configurations and modes of
operations was not possible. These tests allowed us to test the system using real

126

input signals. In the process the system allowed testing for situations that were
ignored or not represented accurately during simulations.
Although the basic functionality was tested, the power consumption and area
requirements of the system implemented on FPGA could not be used to estimate
the requirements for the system implemented in the ASIC because of the
difference in fabrication technology and various architectural overheads
associated with the FPGA. As a result, the prototype was not be able to give an
accurate estimate of the power consumption of the system.

3.9 VLSI Layout of Digital System on Chip in 0.13 Micron CMOS
Once the design was functionally verified on the FPGA, the layout was prepared
for fabrication in IBM 8RF 0.13 micron CMOS process. This involved preparation
of various components of the ASIC for automatic place and route. This includes
synthesis and routing for all the different digital blocks and generation of different
memory blocks. Third party Low-Vt Standard Cell Libraries and Low-Vt SRAM
generators provided by ARM were used for low power operation. Synopsys
Design Vision was used for logic synthesis. Once all the components are
available, floorplanning was completed, clock trees were created and the final
place and route is completed with the inclusion of pads with electrostatic
discharge protection. The layout was passed through the design rule checks
(DRC) for the rules outlined from the fabrication foundry. Fig. 3.15 shows the final
layout of the system.

127

The chip layout was generated to enable measurement of chip area and to allow
simulations for power measurements. The chip may be fabricated at a future
date. Note that for area optimization all the blocks have been merged together
into a single unit. The core of the design is 1.1mm x 1.1mm whereas including
the IO and power pads the area of the NIN chip becomes 1.9mm x 1.9mm =
2

3.69mm . Simulation results on the core show 800μW of power consumption
with a power supply of 1.2V.

Fig. 3.15: Layout of the complete NIN digital system in 0.13 micron CMOS technology.

128

3.10 Analysis
The digital neural signal processor described above has exceeded the original
design goals for area and power. However, while developing this chip, some
limitations and unexpected challenges were observed and are described below.
The research described in the next two chapters was undertaken in order to
address these challenges and demonstrate how the performance of the neural
recording systems can be improved for various applications.
Optimal design of the communication controller requires information about the
error characteristics of the channel within its intended environment, including
across skin. Although inductively coupled transmitters and receivers have been
used in earlier studies, there has been no effort on quantifying the error
characteristics of the channel. For example, design of the communication
protocol and the packet format rests on the distribution of bit errors in a packet.
This affects the decision on the use of error correction vs. error detection and
thus affects the total number of retransmissions necessary for error free
transmission. The number of parity bits needed is dependent on the maximum
possible errors in a single frame. Similarly the number of consecutive packets in
error defines the transmission buffer size at the receiver. Because this error
profile has never been reported in literature, extensive trace collection and
analysis of this error process was undertaken, as described in the next chapter.
The DWT based compression engine design has its own merits and demerits.
The DWT based neural recording system developed in this research permits a
129

large amount of processing to be completed on-chip to retain much of the
information in the spike train and needing only a fraction of coefficients to be
transmitted off-chip. The result is the ability to reconstruct the signal as faithfully
as needed for a particular application. This however is achieved at a tremendous
cost of processing every sample of the neural channel, which includes 8
additions and 8 multiplications, and then deciding if the resulting coefficients
need to be kept or discarded. This is an overkill for applications that do not
require signal reconstruction (an issue that is still highly debated in the neural
engineering community) or applications that can afford some degree of
inaccuracy in spike sorting capability (again, this issue is debatable as no clear
physiological need has been established). The research in this chapter has
shown that it is possible to process many channels within an implanted chip;
however, the presented solution does not scale well to neural systems
demanding simultaneous processing of a higher number of channels, say one
the order of a few hundred or more. With the DWT-based design, area, power
and fabrication costs make it impossible to complete such level of processing on
an implantable system. It thus becomes imperative to search for an ultra
hardware resource constrained and computationally efficient set of features that
can provide high spike sorting performance for a fraction of the area-power cost.
A set of features that meet this goal was explored as presented in Chapter 5.

130

4 Channel Characterization for Implant to Body Surface
Communication

4.1 Introduction
To achieve reliable communication between nodes on an implant-to-bodysurface biotelemetry link, knowledge of the power and data transfer
characteristics of the transceivers and the channel is required. There has been
recent interest in the field of Body Area Networks; however, it is still a new field
and no standards for the design of power and data transceivers or
communication protocols have been adopted. The search for the most suitable
transceiver is a topic of continuing research and hence there are no limitations,
requirements or capabilities that the transceiver hardware should conform to.
Recent

research

has

focused

on

employing

inductive

links

for

data

communication [5, 9, 48, 53]. The transceiver design reported in [53] is distinctive
as it is capable to transmitting power and data over the same inductive channel
using a technique called „back telemetry‟. For the rest of the chapter we call such
inductive power / data transceiver „Inductively Coupled Transceivers‟ (ICT).
Implantable ICTs are of particular interest to the biotelemetry research
community because they are based on a wireless communication techniques that
allows low power communication at reasonable data rates. It is expected that
ICTs will be used extensively in biotelemetry applications. Although various ICT
designs have been presented, a concerted effort and detailed experimental

131

analysis is necessary to quantify the performance and limitations of the
transceiver before they can be used in implantable applications.
Communication of digital data sees errors at two levels, the bit-level and the
packet level. These random errors observed over time can be modeled as a
random process. A deeper understanding of this error process is necessary for
the optimal design of various components of the communication system including
the use of error detection and correction codes, the protocol design, the packet
size and the hardware resources required to allow retransmission of erroneous
packets.
The purpose of this study is twofold. Firstly, it is to estimate the performance
limitations of the ICT presented in [53] under bandwidth and power constraints by
studying the effects of various orientations of transmitter and receiver. Secondly,
the error patterns for the channel across the skin for implant to body surface
power/data communication need to be characterized to enable design of
protocols for reliable data communication.
Trace collection has been used to model error processes of various wireless
communication standards including 802.11x WLANs [54-55], and 802.15.4 LRWPANs [56-57]. The same approach will be followed to analyze the ICT link.

4.2 Wireless Power / Data Transceiver
The neural recoding system consists of two modules, the Implanted Module (IM)
and the Surface Module (SM). The IM is implanted inside the body, preferably
132

within a couple of centimeters from the body surface. It needs to be powered
wirelessly from an external source. The SM is placed on the surface of the body
and is supplied with continuous power through a battery. The data transmitter is
part of the IM, while the data receiver is part of the SM. On the other hand, the
power transmitter is part of SM while the power receiver is part of the IM. Each
module contains a flat inductive coil etched on a printed circuit board to allow
inductive coupling. The SM generates an AC signal of 13.56MHz which is
coupled to the IM coil. This frequency belongs to the FCC mandated ISM band
which is allocated for industrial, scientific and medical research. The IM houses a
power rectifier that converts the signal received at the coil into a DC voltage. The
detailed design is presented in [53]. The rectifier has been designed to provide
20mW of power to the IM, which includes the frontend analog and digital
processing parts in addition to the transceiver. The system uses „passive back
telemetry‟ to transmit data over the same power channel. This is achieved by the
use of Load Shift Keying (LSK). Under normal operation, when there is no data to
be transmitted, the IM is constantly receiving power from the SM. However, when
the IM wants to transmit a „one‟ bit, it creates an open circuit at the power
receiver coil for 0.2μsec. As a result, the IM stops receiving power from the SM
for this duration of time. This change in „load‟ is sensed at the SM by a change in
current and a „one‟ is detected by the data receiver. Thus at the IM the power
receiver coil also works as the data transmitter coil, whereas on the SM the
power transmitter coils also works as a data receiver coil. The current design is

133

capable of data transfer at a maximum rate of 500kbps. The transmitter power is
fixed at 1mW which corresponds to 2nJ energy per bit transmission.
Both the power and data are dependent on a single coil based inductive link.
Inductive links are highly directional and the quality of induction depends heavily
on the orientation and alignment of the coil. The distance between the two coils
also affects the amount of induction achieved. Since in practical situations it is
not possible to guarantee a perfect alignment between an implanted coil and a
surface coil, it is important to study the effect of relative coil orientation on the
quality of data transfer achieved to enable design of suitable communication
protocols to maximize data throughput.

4.3 Experimental Setup
4.3.1 Trace Collection
Trace collection refers to transmission of a large number of packets from the
transmitter and reception of those packets at the receiver. Since the contents of
the transmitted and received packets are known, any errors in transmission can
be identified at the bit and packet levels. A bit-wise XOR operation on the
transmitted and received packets results in a „zero‟ if the bits match, and results
in a „one‟ if the bit has been flipped during communication. Thousands of packets
need to be transmitted and received in order to formulate an unbiased model.
The statistics that can be extracted from these traces are the Packet Error Rates
(PER) and the Bit Error Rates (BER). PER is the fraction of transmitted packets
that are received with at least one bit in error.
134

PER 

Packet _ Errors
Total _ Packets

(4.1)

BER is the fraction of transmitted bits that are received in error.

BER 

Bit _ Errors
Total _ Bits

(4.2)

Some papers report Packet Reception Rate (PRR) instead of PER, which is
related to PRR by
PRR 1  PER

(4.3)

To our knowledge, this is the first thorough trace collection effort for inductively
coupled transceivers for biotelemetry links. For this study, error traces of
approximately 3 million packets were collected, where each transmitted and
received bit was logged for detailed analysis.

4.3.2 Physical Setup
The trace-collection setup is depicted in Fig. 4.1. It consists of a surface module
(SM) and an implantable module (IM). The SM houses a power transmitter in
addition to a data receiver, whereas the receiver houses a power receiver in
addition to a data transmitter. The power source is only connected to the SM. A
data acquisition card (DAQ) card is connected to both the SM and IM to provide
the data to be transmitted to the IM and to record the data that is received at the
SM. The DAQ card is connected to a host PC running a LabView application that
continuously retrieves data from the receiver and logs it. The SM and IM are
placed in a Faraday cage to minimize the effect of external electromagnetic noise
on the data communication.
135

Fig. 4.1. Experimental setup used for transceiver characterization.

The size of the packet is 820 bits, and the packet format is detailed in previous
chapters as well as in [58]. The packet is composed of 800 bits of data, 10bits of
addressing information and 8 bits of CRC.

4.4 Transceiver Performance Analysis
The performance of the power and data transceiver was examined to determine
the operating conditions for which the transceiver provides acceptable
performance.

4.4.1 Transceiver Orientation Diversity
Packets were collected over a period of weeks, at different times of the day, in a
typical lab environment. Each trace collection was characterized by the location
of the transmitter and receiver, separation between them, the tilt of coils, and the
misalignment from the center. Fig. 4.2 shows the scaled diagram for different
settings used to measure the performance of the transceiver. The amount misalignment is represented with „z‟, the distance is represented with an „x‟ and the
136

ο

tilt is represented with a θ. For the scaled Fig. 4.2, x=10mm, z=10mm and θ=30 .
The transmitter has a fixed transmission power of 1 mW.
For each setting, a total of 100,000 packets were transmitted in segments of
5000 each with two minutes of inactivity between each segment. This
corresponds to 200 seconds of packet transmission for each segment. As a
result a total of 82 million bits were transmitted for each setting. The data was
transmitted at a rate of 500kbps, which corresponds to packets being transmitted
at a rate of 25 packets per second.

Fig. 4.2. Scaled diagram of different settings used to measure the performance of the
transceiver, including the distance, the tilt and the misalignment.

4.4.2 Transceiver Characterization Results
Figures 4.3-4.5 plot the BER and PER vs. the three parameters of distance, misalignment and angular tilt. The BER is plotted against a logarithmic scale. As
expected, both the BER and PER deteriorate by increasing the values of these
137

parameters. It must be noted that the minimum value plotted in each plot
corresponds to zero errors, i.e. no errors were detected for those settings.
However, becasue BER can only be detected up to 1E-8 (because of the number
of transmitted bits), the value of 1E-8 is plotted for the least value. Similarly, the
minimum plotted value for PER is 1E-5.
It must be noted from Fig. 4.3-4.5 that the initial settings of 5mm distance, 11
degrees tilt and 5mm mis-alignment do not result in any transmission errors.
These are the typical settings expected in practical situations when the
transmitter is implanted into the body and the receiver is placed on the body
surface. Fig. 4.3 shows that a nominal acceptable BER value of 1E-4 is achieved
at 14.5mm of distance with a PER of about 10%.

Fig. 4.3. Bit Error Rate and Packet Error Rate versus the Distance in mm.

138

Fig. 4.4 shows that 11mm of mis-alignment provides a BER of 3E-5, beyond
which the BER increases sharply. It is interesting to note that from 9mm to 11mm
the BER remains almost constant (from 2.6E-5 to 3.1E-5); however PER
increases rapidly (0.4% to 2.5%). This suggests that, although the number of bits
in error remain approximately same, the errors are spread over a larger number
of packets. Conversely, it can be seen that at 12mm and 13mm misalignment the
BER increases sharply (0.25% to 1.4%) while the PER barely increases (84% to
100%), suggesting an increase in number of bit errors per packet.
Fig. 4.5 shows that at 34 degrees of tilt a BER of 2E-4 is achieved with a
corresponding PER of 2.3%. Both the BER and the PER show similar amount
change with increasing tilt.

Fig. 4.4. Bit Error Rate and Packet Error Rate versus the Mis-Alignment in mm.

139

Fig. 4.5. Bit Error Rate and Packet Error Rate versus the Angular Tilt in degrees.

Disturbance Effect: The transmitter and receiver were physically disturbed to
simulate the effect of body movement, which is expected to affect all three
parameters of distance, tilt and mis-alignment. Disturbance was created as a
combination of all six degrees of freedom, include the three translational and
three rotational axis. Although the disturbance magnitude could not be measured
scientifically, the relative initial displacement of the transceivers was recorded to
be ±1.5mm from the center points. The disturbances were created at a frequency
of about 2Hz, but each disturbance resulted in higher frequency damped
vibrations because the Tx and Rx were suspended in air using copper wires.
Fig. 4.6 shows the number of packets in errors over time for the case when
disturbance was added to the initial setting of 5mm distance between
transceivers. To plot the results, the 100,000 packets have been divided into bins
140

of 100 packets each. As a result, the 1000 bins show the number of erroneous
packets in each bin. Note the „burst‟ nature of errors, i.e. whenever there is an
error it is very likely that there will be more packets with errors in the same
vicinity.

Fig. 4.6. Packet errors versus time for disturbance effect test. The 100,000 packets are
divided into bins of 100 packets and show the number of packet errors in each bin

Fig. 4.7 compares the BER and PER at distances of 5mm and 10mm distance
with and without disturbance. It can be seen that disturbance has a significant
effect on the quality of channel and the total number of errors.

141

Fig. 4.7. Bit Error Rate and Packet Error Rate versus the Distance in mm showing the
effect of physical disturbance on BER and PER.

4.5 Channel Memory
Most wireless channels show „burstiness‟ in the packet-level and bit-level errors.
Thus, if a packet (or bit) error has been seen in transmission, the probability that
the subsequent one or more packets (or bits) will be in error increases. This
increase in probability is a characteristic of the channel and varies from one
channel to the other. This phenomenon is called Channel Memory. The degree to
which effects of a particular bit error are carried over time can be quantified by a
measurement of the packet level channel memory, P M, and the bit-level channel
memory, BM, respectively [57]. In essence, channel memory provides an

142

estimate of the average number of consecutive errors expected whenever an
error is observed.

4.5.1 Bit Level Memory
Knowledge of the amount of bit-level memory of a channel helps in the design of
error detection and correction schemes. Correlograms are commonly used to
measure the memory of a process [54-56]. A correlogram is a plot of the autocorrelation function, RXX(t), of the process. In this case, it is a bit level trace
consisting of zeros (when the bit is correct) and ones (when the bit is inverted) as
a function of time, t.

RXX (t ) 

E{( X   )( X t   )}



2

(4.4)

The correlogram was computed for each erroneous packet over the length of the
packet. The lag „t‟ for which RXX(t) becomes insignificantly small, less than 0.15,
was taken as a measure of bit level memory [54]. A histogram of the memories
was computed and a cumulative distribution function is presented.

4.5.2 Packet Level Memory
A similar correlogram-based analysis was applied to the packet level, in which
each erroneous packet is represented with a one and each correct packet is
represented with a zero. In order to observe time-dependent effects on PM, this
measurement was performed separately for all 20 segments for each setting. A

143

histogram of memories of these individual traces was computed and a
cumulative distributed is provided to show the trend.
For the test case of 5mm distance with disturbance, the histogram and
cumulative distribution of packet level memory is shown in Fig. 4.8, where the red
(line) plot shows an 85% chance that the errors will be solitary and a 95% chance
that there will be four or fewer consecutive errors. No error correction capability is
assumed for memory calculations; however, if error correction were applied then
the packet memory is expected to reduce even further. Note that a memory of
zero means that either zero packets are in error or solitary packets are in error.
Memory of one means that two consecutive packets are in error, a memory of
two means that three consecutive packets are in error, and so no.

Fig. 4.8. Histogram of packet level memory for 5mm distance with disturbance.

144

Fig. 4.9 gives the bit-level memory of the channel, showing the consecutive
number of bits in error. It can be seen that 79% of the time the bit-errors are
solitary, and 98% of times there are four or fewer consecutive bits in error.

Fig. 4.9. Histogram of bit level memory for 5mm distance with disturbance.

For the movement artifacts with 5mm distance there were 913 erroneous packets
out of 100,000 and 1909 bits in error out of 82 million, given the PER = 9.13E-3
and BER = 2.33E-5. Fig. 4.10 shows the histogram of number of bit errors in
each erroneous packet. Only the packets containing errors are considered for
this plot. It can be seen that most packets have only one bit in error, while 89.4%
packets have only three or fewer bits in error. If error correction techniques are
applied that can correct up to 3 errors, then retransmissions could be reduced by
90%.

145

Fig. 4.10. Histogram of number of bit errors per packet form 5mm distance with
disturbance.

4.6 Channel Characterization Across Animal Skin
4.6.1 Physical Setup
Several experiments were conducted using chicken skin to model the general
effects of skin tissue on inductive power transfer and data communication.
Experiments were conducted without a Faraday cage in the open lab
environment containing many sources of electromagnetic noise. For 5mm
distance measurements, a 4mm thick layer of skin was sandwiched between the
transmitter and receiver coils. For 10mm distance measurements, layers of skin
adding up to 8mm thickness were used. However, tilt and movement effects
could not be measured with the skin between the transmitter and receiver. Fig.
4.11 shows the setup used for these measurements.

146

Fig. 4.11. Experimental setup used for channel characterization including animal skin.

4.6.2 Channel Characterization Results
The results for the case of 10mm distance with 8mm thick layer of skin between
the transmitter and receiver are discussed in this section. A total of 0.5 million
packets were transmitted and a PER = 7.43E-3 was recorded. The 5000 bins of
100 packets each are shown in Fig. 4.12.
The cumulative distribution of packet memory is shown in Fig. 4.13, where the
red plot shows 84% chance that the errors will be four or fewer errors and a 89%
chance that there will be 7 or fewer consecutive errors.

147

Fig. 4.12. Packet errors versus time for the case of 10mm distance with animal skin.
The 500,000 packets are divided into bins of 100 packets and show the number of
packet errors in each bin.

Fig. 4.13. Histogram of packet level memory for 10mm distance through animal skin.

148

Fig. 4.14 gives the bit-level memory of the channel, showing the consecutive
number of bits in error. It can be seen that 68% of the time the bit-errors are
solitary, and 97% of times there are 3 or fewer consecutive bits in error.
For the skin effects at 10mm distance, there are 3716 erroneous packets out of
500,000 and 15144 bits in error out of 410 million, giving the PER = 7.43E-3 and
BER = 3.69E-5. Fig. 4.15 shows that most packets have only one bit in error,
66% packets have only three or fewer bits in error and 89.8% have 9 or fewer
bits in error. If error correction techniques were applied that can correct up to 3
errors, then retransmissions could be reduced by 66%. It must be noted that a
packet was found with 69 bits in error, which means the error detection capability
should be able to handle such cases of large errors in a packet, even though
most of the packet have only a few bits in error.

Fig. 4.14. Histogram of bit level memory for 10mm distance through animal skin.
149

Fig. 4.15. Histogram of number of bit errors per erroneous packet for 10mm distance
through animal skin.

Fig. 4.16 shows the comparison of disturbance artifacts for the skin-based
channel and the air-based channel. It can be seen that disturbance has the most
effect on the performance of the transceiver while the introduction of skin also
degrades the error characteristics of the channel in comparison to the air based
channel.

150

Fig. 4.16. Comparison of BER and PER versus distance for various experimental
settings.

4.7 Forward Error Correction
There are several forward error correction (FEC) coding techniques and
algorithms which provide various degrees of error correction capability. Also,
different codes can handle certain types of errors better than others. Most codes
have separate error detection and correction capability. The experiments
described above have established that error detection capability is extremely
important for the system. The addition of error correction capability can improve
the throughput of the system and can also help in reducing the retransmission
buffer length at the transmitter, saving hardware resources.

151

Table 4.1 shows the comparison of BER, PER and packet memory P M for
different settings before and after a 3-bit FEC is applied. The minimum
measureable PER is 1.0E-5 and BER is 1.0E-8. Note that apart from the
reduction in BER and PER, the FEC also impacts the packet memory, which
dictates the amount of data-buffering capability required at the transmitter end.
Smaller memory means only a small buffer is required to hold the data until it is
successfully transmitted. The cumulative distribution value of 0.9 was chosen to
calculate these memories.
Table 4.1: Comparison of BER, PER and Packet Memory for different settings before
and after 3-bit FEC is employed.
Before FEC
Distance

Setting
Air

5mm

Skin

BER

After 3-bit FEC

PER

BER
PER
PM
PM
1.00E-08 1.00E-05 0 1.00E-08 1.00E-05 0
1.32E-06 5.22E-04

0

5.64E-07 8.75E-05

0

Disturbance 2.33E-05 9.13E-03

1

9.66E-06 9.70E-04

1

Air

0

1.00E-08 1.00E-05

0

Skin

3.69E-05 7.43E-03

5

2.75E-05 2.48E-03

2

Distrubance 1.28E-04 2.74E-02

10mm

6.27E-06 5.13E-03

3

9.17E-05 7.52E-03

1

Fig. 4.17 shows the comparison of PER before and after 3-bit FEC is applied. It
can be seen that using as few as 3-bits of FEC reduces the PER greatly, which in
turn reduces the number of retransmission necessary for error free data
transmission.

152

Fig. 4.17. Comparison of PER before and after 3-bit FEC for different settings.

Fig. 4.18 shows the comparison of BER before and after 3-bit FEC is applied.
The effect of FEC is not very profound on BER which means that, although most
of the packets can be corrected using FEC, the BER is contributed mostly by a
few packets with a large number of bits in errors. It can be concluded that,
although BER may appear high, the system needs only a few packets of
retransmission to maintain a high data throughput.

153

Fig. 4.18. Comparison of BER before and after 3-bit FEC for different settings.

4.8 Conclusions
From the above results it can be concluded that, although it is important to keep
the BER to a minimum, it is the PER and the packet memory that most affect the
throughput of the system (by forcing packet retransmissions) and the hardware
resources required for error-free communication. For the typical setting of 5mm
distance

between

transceiver

and

receiver

with

no

disturbance,

the

communication controller developed by this research and presented in previous
chapters is capable of handling the communication task effectively. A departure
from the typical channel parameters will still work correctly as long as the
transceiver settings result in a small packet level channel memory and the
154

compression engine produces a reasonable output data rate. However, if the
channel has a greater packet level memory, then the transmission buffer may
overflow. This, however, can be avoided by increasing the size of the
transmission buffer and making appropriate changes to the packet structure.
These changes are discussed in the section for future work.

155

5 High Density Feature Extraction
5.1 Introduction
Advances in microelectronics and nanostructures have enabled scientists to
combine thousands of electrodes into microelectrode arrays [59], permitting
capture of neural signals from hundreds of neurons simultaneously for use in
neuroprosthetics applications and neuroscience research. However, existing
implantable wireless transceivers lack the capability to transmit the large amount
of data generated by these electrodes. This mandates the use of implantable
signal processing systems that enable on-chip data reduction while maintaining
necessary information embedded in the neural spikes. On-chip feature extraction
for spike sorting has been identified as a way to reduce transmission bandwidth
[60].
Existing methods of off-chip feature extraction for spike sorting, e.g. template
matching, principal component analysis (PCA), and time-frequency transforms,
are computationally too demanding to be realized in implantable hardware, as
reported in the comprehensive survey [61]. For implantable systems, the area
required by circuitry is also a major concern in addition to the power consumption
during computations. It has been shown that memory is the largest block in
neural recording systems incorporating on-chip signal processing [43]. Thus, it is
highly desired to reduce the number of computations while also using a minimum
number of memory elements. Most high performance algorithms are not scalable
to 1024 channels, keeping within the area and power limitations, due to the
156

incurred hardware cost of complex processing [43,61]. An implanted system has
to be low-power, low-area, highly accurate, automatic, and able to operate in
real-time [61].
This section presents a new set of features for spike sorting, named ZeroCrossing Features (ZCF), and compares their sorting accuracy against PCA
features for several data sets. PCA has been known to provide good separability
between spikes for channels with reasonable signal to noise ratio. The ZCF have
been explicitly framed to be extremely computationally efficient and to not require
any offline training. Figures of merit for different stages of the system have been
established. The effect of non-ideal spike detection on the performance of ZCF
and PCA was studied. In addition, the computational complexity was analyzed
and a scalable hardware architecture suitable for processing thousands of
channels in an implantable system is presented.

5.2 Theory
Fig. 5.1. shows the general data flow of a spike sorting system. Input signal X[n]
is the recording from electrodes containing spikes from multiple neurons. S[k] is
the result of spike detection where only spikes samples are kept while noise is
discarded, as a result reducing the number of samples. The spikes are fed to a
feature extractor which generates F[p], where the number of values generated
per spike equals the number of features.

157

Fig. 5.1. General data flow and typical components of a neural recording and classification.
system.

The features are then used either on-chip or off-chip to classify the incoming
spike as belonging to one of the possible neurons on that channel. All multichannel systems keep track of the spikes by recording a „spike arrival time‟ and a
„channel identifier‟ in addition to the features. Most systems have two phases of
operation. The „training‟ phase followed by the „acquisition‟ phase [61-62]. During
the training phase, the neural data is transmitted uncompressed and is used by
external processors to calculate different parameters, e.g. detection thresholds
and feature vectors. These parameters are then relayed back to the implanted
module where they are used in the acquisition phase. These steps are repeated
for all the channels. Nevertheless, ideally, a system suitable for recording from
hundreds of channels should require minimum user intervention.
It is important to note that the ZCF feature extraction presented in this section
does not require any training. Throughout this section, the „training‟ phase is only
used to train detection thresholds and classification algorithms. Although this
section primarily discusses feature extraction, the performance of feature
extraction is also dependent on the quality of the spike detector used. In addition,
it is also dependent on the quality of spike classifier in the system.
158

5.2.1 Spike Detection and Classification
Several methods for spike detection have been reported in literature including
data transformations, derivatives and template matching [62]. All methods
employ a thresholding step for spike detection at the front-end or later in the
system. Most high performance methods are computationally very intense and
require offline training in addition to a large number of memory elements to
maintain templates or temporary data. Thus, time-domain front-end thresholding
is the most suitable option for high density multi-channel implanted systems [61],
where the thresholds are computed off chip in the training phase. Using two
independent thresholds, one positive and one negative, referred to herein as
Dual Thresholds (DT), provides significantly improved performance as compared
to the „absolute‟ threshold. „Absolute‟ threshold is a specific instance of DT where
a single magnitude is compared to the incoming sample and the polarity of the
sample is ignored. As a result the DT always performs equally good or better
than absolute threshold. DT is chosen as the spike detector for this study
because of its good detection performance.
Once the features have been extracted, several algorithms are available for offchip spike classification. K-Means and Mahalanobis distance based classification
were considered for this study. Mahalanobis classification was found to perform
better for almost all tests and was used for classification henceforth. Off-chip
Mahalanobis clustering also allows for adaptability in clustering as channel
statistics change and spike amplitudes vary over time.

159

5.2.2 Feature Selection and Extraction
PCA and Wavelet Transforms (WT) have traditionally been used for off-chip
spike sorting; however, they are deemed unsuitable for implantable on-chip multichannel feature extraction because of the high computational cost and hardware
resources required to map the neural data onto the feature space. Furthermore,
both methods require offline training, and neither of the methods provides a
generic set of features applicable across channels. This means each channel
requires separate training and results in separate sets of features. This high
computational cost mandates the design of a new set of features that is explicitly
framed to be computationally efficient and do not require any offline training while
maintaining good sorting performance.
The most prominent and visually distinctive features of a set of different spikes
are the relative amplitude (or energy) of the spikes, the relative position of
positive and negative peaks in the spikes, and the widths of the spikes. Each of
these features can be used to sort spikes of one kind from the others, however,
the ability to sort spikes based on these features falls sharply with an increase in
noise level, i.e., in signals with low signal to noise ratio (SNR). However, all these
features can be combined together to form a new set of features that are
distinctive, resilient to noise and computationally efficient. The mathematical
representation of the set of ZCF features can be expressed as
Z 1

ZC1   x(n)
n 0

K 1

ZC 2   x(n)
nZ

160

(5.1)

where ZC1 and ZC2 are the two features, K is the number of samples in a spike
and Z is the index of first zero crossing after the spike has been detected. Note
that in a DT based spike detection system, the spikes are detected when the
value of the spike is larger than either of the specified thresholds. The value of Z
in (5.1) is thus the first zero crossing after a significant amount of energy in the
spike has already been recorded. Features in (5.1) can also be seen as
recording the pre-zero-crossing and post-zero-crossing energies of the spike.
Collectively, these features are referred to as Zero-Crossing Features (ZCF) for
the rest of the section.
The underlying assumption in selection of these quantities as a feature set is that
the spikes from different neurons have different „area‟ before and after the first
zero crossing. Though, in theory, it is possible to have multiple neurons
generating spikes with same energy before and after zero crossing, in practice,
experimentally recorded spike waveforms show considerable differences in these
two values allowing for the use of these features for spike sorting. In case the
spikes have no zero crossing, ZC1 contains all the area of the spike while ZC2
remains zero. Even if multiple non-zero-crossing spikes are present in the
channel, the features still maintain a degree of separability. An average spike is
about 1.5ms long, which amounts to 30 samples at a 20Ksps sampling rate. The
factor „K‟ in (5.1) is thus set to 30 and is dependent on the sampling rate.
The ZCF features are graphically represented in the example of Fig. 5.2. When a
spike is generated, the electrode potential changes very rapidly. In this example,
161

the spike is detected when the electrode potential crosses the negative
threshold. There are still some samples before the detection that belong to the
spike and may contain useful information; thus it is desirable to be stored some
samples in a buffer. For the sampling rate of 20Ksps, empirical results show that
a buffer length of only 3 successfully captures all the useful samples and
captures the onset of the spike, as shown in Fig. 5.2.
Buffer length and spike length are the two parameter that effect the amount of
required hardware resources. The buffer length is dependent on the type of spike
detector used, and the ideal detector is the one that detects the spike as soon as
it is generated. Also note that, as in Fig. 5.2, only one spike can be detected
within any K samples.

Fig. 5.2. Graphical representation of ZCF features for a spike detected using dual
thresholds.

162

5.2.3 Complexity Calculations and Hardware Resources
The computation complexity of implementing the two features of ZCF was
compared to that of implementing the two most significant components of PCA,
where each of the PCA components has K dimensions. Computations for spike
detection algorithms occur for each input sample, while the computations for
feature extraction occur for each detected spike. To compare complexity in terms
of addition operations, each multiplication is considered to be equal to 10
additions [62] whereas each „comparison‟ operation is considered equal to one
addition. For the case of DT spike detection, use of two thresholds means that
every sample is compared against either positive or negative threshold, thus only
one computation occurs per sample.
From Table I, it can be seen that for each detected spike, the ZCF needs less
than 5% of the computations and requires less than 8% of the memory as
compared to the PCA.
Table 5.1: Comparison of Complexity and Required Memory for Different Algorithms

DT
ZCF
PCA
* Computations per spike

Add.
1
30*
60*

Mult.
0
0
60*

Complexity
1
30*
660*

Memory
2
5
65

Fig. 5.3 gives the hardware architecture of the proposed DT signal detection and
ZCF feature extraction algorithms. Light colored blocks represent memory
elements. A three word FIFO is updated every new sample to implement the
buffer that ensures spike data before detection is not lost. When a spike is
163

detected, the control logic uses the data stored in FIFO and the subsequent input
values to accumulate in ZC1. When the zero-crossing-detector (ZCD) detects a
change in sign, then ZC2 is accumulated until all the 30 samples representing a
spike are received.

Fig. 5.3. Hardware architecture of proposed signal detection and feature extraction stages
units.

It is important to note that, with a system for 1024 channels; one additional byte
of storage per channel translates to 1Kbyte of storage for the system. Similarly,
each additional computation per channel means 1024 additional system
computations.

5.3 Methods
Neural spikes recoded from live experiments were used in conjunction with a
neural signal simulator to generate signals that mimic electrode recordings. A
detailed discussion of experimental procedures to collect neural data can be
found in [52]. A total of ten different spike shapes were used to mimic ten
separate neural channels, each consisting of three spikes. Fig. 5.4 shows the 10
164

spike channels used in these simulations. For each dataset, one second of
„training‟ phase was completed to determine thresholds for DT, principal
components for PCA and initial statistics for Mahalanobis clustering. The signals
generated did not contain any overlapped spikes, and are generated for different
values of SNR, ranging from 18dB to -5dB, as computed by

 
SNR  20 log 10  x 
 n 

(5.2)

where σx and σn are the signal and noise standard deviations, respectively.

Fig. 5.4. Channels and spikes used in simulations. Ten different channels consisting of
ten different spike shapes were used with each channel containing three spike shapes.
(Top Row) Channel 1-5 from left to right. (Bottom Row) Channel 5-10 from left to right.

The performance accuracy of a spike sorting system depends on the collective
and individual performance of each of the components. Performance accuracy of
the detection algorithms is computed based on the number of spikes missed and
the false alarms in relation to the number of original spikes generated by the
simulator. Detection Error for Dual Threshold, DEDT, is represented by [63].
165

DEDT 

MS  FA
 100
OS  FA

(5.3)

where OS is the number of original spikes, FA is the number of false alarms and
MS is the number of missed spikes. Performance of ZCF is compared against
the performance of PCA; however, PCA requires a training phase to find the two
most significant principal components. When „perfect‟ detection is used then
Classification Error, CE, can be calculated as

CE P 

MCS
 100
OS

(5.4)

where MCS is the number of mis-classified spikes. Accuracy of the feature
extraction algorithm is computed through (5.4) treating misses or false alarms as
misclassified spikes. The effects of non-ideal spike detector on classification
accuracy are reflected in the classification error formula

CE DT 

MCS  FA
 100
DS  FA

(5.5)

where DS is the total number of spikes detected. The performance of feature
extraction is isolated from the performance of non-ideal spike detector by (5.5).
Since different sets of features are effected differently, (5.5) represents the
effects of mis-alignment on the classification accuracy caused due to imperfect
spike detection. This can happen in instances when the noise is such that it
suppresses the onset of the spike creating a slight „delay‟, resulting in a spike of
shorter duration; or when the noise is such that it creates a high magnitude just
before the onset of the spike giving the illusion that the spike started earlier and
166

thus lives longer than other spikes from the same neuron.. Here, the number of
misclassified spikes includes all the false alarms, because false alarms are
always wrongly classified. The performance of the complete system including the
non-ideal DT spike detector, feature extractor and classifier can be computed as

DCE DT 

MS  MCS
 100
OS  FA

(5.6)

5.4 Results and Discussion
Fig. 5 shows an example of how well the ZCF based clustering performs and
when Mahalanobis clustering outperforms K-Means clustering sanctioning the
use of Mahalanobis to create clusters .

167

aligned acquired spikes
600

400

200

0

-200

-400

-600

0

5

10

Aquired spikes
15

20

25

30

0

5

10

15

20

25

30

600

400

200

0

-200

-400

-600

Fig. 5.5a. (Top) Spikes detected on the channel. (Bottom) Color-coded spikes for different
neurons. This knowledge is available from the neural signal simulator and is presented
here only for reference.

168

Sorted Peaks (Aquisition Phase) ZCF
4000

ZC2 (Post Zero-Crossing)

3000

2000

1000

0

-1000

-2000
-3000 -2500 -2000
4000

ZC2 (Post Zero-Crossing)

3000

2000

1000

0

-1000

-1500 -1000 -500
0
500
1000
ZC1 (Pre Zero-Crossing) ZCF
Sorted Peaks (Aquisition Phase)

1500

2000

1
1
1
1
1
1 1111 11 1 11
1
1
1
1
1
1 1 1 11111 11
11
1 1 111 11
11 111111111 111 1 1
1
1 1111 11 11 1
1 11 1 1
1
1 1 11 111111111 111 1
11 11 111 1
1 11 1111111 1 1
11111111111111 11
1
11 1 1 1 11
111 1 11111111111111 1 1 11 1
11 1111 11111 1 1 1
1
1
11
1
11 1
1
1 11 1111111111 11
1 1 1111111111
1 11 11
1 1111 111111111111 1 1 1 1
1 11 11111 1 1 1
11
11 1 11 11 1
1 1111 1 11111 111 1
1 1 1 1 11 1 111 1
1 111 11
1 1
1
1 1 1 1 11
1
1 1 1 11
3
1 31
33 3
33
3 3 3333 3 3 1
3 1
3
3 33 3 3333 3
3 3 33
33 33 3
3
3 3 33 33 13333
33 3
2
2
33 3 3 3
2
3 333 333 3 33 3 33 3 3
3 33
2 2
3
3 33 3 333333 3 3
3 3 33 3
2 2 222 22 23 3 3 33333 3333 333
2 2 2 222 22 2 2 2 2 2 33 3 33333 3 33 3 3 3
333333 3
33 1
2 222 222 2 2 2
2
2 22 22 2 2 2 3 33 33 333333 3333 3
22 2
2 2
3 33 23 333 333 3 3
3 333 33 3
3 3 3 33
2
2
2 2
3
3
222 22 2 2 22 33 2 3 3 3
33
2 2 222 2 222 22222222 2 32 33 32 33333 333333 3 3 33 3 3 3 3
2 2 222 2 2 2 222 22 2
2 22 2
2
2 2 2 2222 222 2
322
2 22222222222 2222 2 3 2 3 333 33 3 333 3 33 3
3
3 33 3
2
2
2
2
22 2 2 2
2 2 3 3333 333 3333 3 3 3 33
2 2222222 22 22 2
3
33
3 3 3 3 3 3 33 3
2222222222222222222
222 2 2222 2 2
2
222 2 2 2 22 2 2 2 2 2
3 333 3
3
3
2 2
2222222 2 2
333333333 33 333
3 3
2 2 22222222 2 222
3333 333 33
2 2 22 2 2 2 2
3 3 33 3
222222 22 222222 2222 2 2
222 22 222 22 2 2
3 33 333 33 3333333 3 3
3
3 3 333 3
3
22
3333333333333 3
2 2 2222 222 22 2 2 23
2 2 222
33 333 33 3 33
2
22 222 2
2 2 22 22
3
2
2
22
2
2
3 33
3 3 33 333 33 3 3
3 3 3 3
3 33 3
2 2 2 22 2 2 2
2 2
3 3
33
22
2 2
3
2
2
3 3
3 3
3
3

-2000
-3000 -2500 -2000

-1500 -1000 -500
0
500
ZC1 (Pre Zero-Crossing)

1000

1500

2000

Fig. 5.5b. (Top) ZCF mapping of the spikes on the ZCF feature space as seen by the spike
classifier. (Bottom) ZCF mappings are color-coded to distinguish between neurons.

169

Mahalanobis Clustering, ZCF
4000

ZC2 (Post Zero-Crossing)

3000

2000

1000

0

-1000

-2000
-3000 -2500 -2000 -1500 -1000 -500
0
500
ZC1 (Pre Zero-Crossing)
K-Means Clustering, ZCF
4000

1000

1500

2000

1000

1500

2000

ZC2 (Post Zero-Crossing)

3000

2000

1000

0

-1000

-2000
-3000 -2500 -2000 -1500 -1000 -500
0
500
ZC1 (Pre Zero-Crossing)

Fig. 5.5c. (Top) ZCF mappings are grouped into three clusters using Mahalanobis
distance, 1.71% spikes are mis-classified. (Bottom) ZCF mappings are grouped into
clusters using K-Means clustering, 6.14% spikes are mis-classified.

170

Neural channel 6 from Fig. 5.4 was simulated for an SNR of 3dB. Fig. 5.5a(Top)
shows all the spikes that are present on the channel and are detected. Since
prior knowledge is available from the neural simulator regarding the true
identities of these spikes, the detected spikes are color-coded to distinguish
between different neurons in Fig. 5.5a(Bottom). Fig. 5.5b(Top) shows the
mapping of the spikes on the ZCF feature space with ZC1 (pre zero-crossing
feature) on the x-axis and ZC2 (post zero-crossing feature) on the y-axis, as
seen by the classifier. Fig. 5.5b(Bottom) shows the color-coded mappings, using
prior knowledge from the neural simulator, as a reference to distinguish between
the three neurons. Fig. 5.5c(Top) is the clusters created by the classifier using
Mahalanobis distance. Note that, although it looks very similar to the color-coded
mapping on center right, there still are some mis-classified spikes. Out of 1286
spikes shown, 22 spikes are misclassified giving an mis-classification ratio of
1.71%. Fig. 5.5c(Bottom) shows the clusters created by the classifier using KMeans clustering. The mis-classified spikes can easily be identified in the middle
of the feature space. A mis-classification ratio of 6.14% is recorded. Both
classifiers need some training. The K-Means classifier needs sample points to
calculate the centeroid of the cluster while the Mahalanobis clustering needs
sample points to compute the covariances of the clusters.
Fig. 5.6 shows the performance results of detection and classification stages
using the four figures of merit described by (5.3)-(5.6). Ten different data sets
171

were used in simulations, and the resulting averages have been plotted. Fig.
5.6(i) shows that the positive and negative DT threshold values that generate the
minimum detection error (DEDT) in the initial training phase result in comparable
error in the acquisition phase. The detection error increases sharply for an SNR
of 3dB and below where noise power becomes comparable to signal power. Fig.
5.6(ii) compares the performance of ZCF and PCA and represents the
classification error when ideal detection is assumed (CE P). It can be seen that
ZCF and PCA perform equally well even under low SNR conditions. Thus, even
though ZCF requires only 5% of the resources as PCA, it can perform as well as
PCA-based features. Fig. 5.6(iii) shows that, when non-ideal spike detection is
employed, the resulting mis-alignment of detected spikes adversely effects the
classification performance (CEDT) of both ZCF and PCA algorithms. The errors
have increased by about 50% for each SNR. However, it can be seen that there
is no performance loss at high SNRs and that both ZCF and PCA still perform
equally well for all SNRs. Fig 5.6(iv) shows the system level results including
spike detection, feature extraction and classification. The performance of ZCF
and PCA are nearly indistinguishable. It can be seen that, even though the ZCF
and PCA are capable of performing well under low SNR conditions (e.g. 5%
errors at 0dB), the system performance (55% errors at 0dB) is dominated by the
perfromance of the spike detector (45% errors at 0dB in addition to the misalignment of detected spikes). Note that this analysis does not reject the
possibility that if more than two features were used in PCA, then its classification
172

performance may improve, albeit at significant cost of increased hardware
resources.
A 1024 channel system with a sampling frequency of 25Ks/sec/channel and 10
bits ADC resolution would generate raw data at about 250Mbits per second.
Using ZCF feature extraction, the information to be transmitted could be reduced
to a time stamp, a channel ID and the two computed features (around 50 bits per
spike). Assuming the average number of neurons on each channel to be three
and the neuron firing rate to be 30 spikes per second, data would only need to be
transmitted at the rate of ~4.7Mbps using ZCF. Thus, ZCF permits transmission
at is less than 2% of the raw data rate. The resulting savings to transmission
power is purchased by the cost of ZCF computations, which average to, 2.86
million computations per second during steady state operation. In comparison,
PCA provides similar data reduction but costs 60.91 million computations per
second, many times the dynamic power of ZCF.

173

Fig. 5.6. Percentage values of (i) DEDT, (ii) CEP, (iii) CEDT and (iv) DCEDT versus
different values of SNR. Data represents performance results for the DT spike detector, the
ZCF and PCA feature extractor and the combined spike sorting system.

5.5 ZCF vs DWT
In this section we present a comparison of a DWT based spike sorting system,
henceforth called DWT, against a ZCF based spike sorting system, henceforth
called ZCF.

174

1. Sorting Performance: DWT has been shown to have slightly better spike
sorting results than PCA. Since ZCF performs as good as PCA, we can
assume that DWT would be able to perform better than ZCF.
2. Reconstruction Flexibility: DWT is capable of allowing various degrees
of reconstruction by changing the multi-level thresholds.As a result the
actual spike shapes may be reconstructed if needed. ZCF based features
do not allow such reconstruction, however a „monitor‟ mode can be used
to record the spike shapes in the time domain.
3. Computational Complexity and Power Consumption: There is no
spike detection before DWT; as a result all of the input samples go
through the time-frequency transformation, which is computationally
hungry. Each input sample requires 8 multiplications and 8 additions for a
4 level transform. ZCF processes „detected‟ spikes only, and thus
computes features only when a spike is detected, which saves a huge
amount of power and resources.
4. Detection performance: In DWT, detection is applied offline once the
data has been transmitted. This allows a higher performance algorithms to
be used for spike detection. The performance of ZCF is affected by the
performance of an on-chip spike detector which may limit the overall
system performance.

175

5. Neural Activity: Power consumed by a ZCF system is proportional to the
neural activity on the channels. DWT processes all samples of the channel
even if it has low neural activity.
6. Classifier Complexity: ZCF requires only a 2 dimensional spike classifier
while a 4 level DWT requires a 5 dimensional classifier. However, the
complexity of classification may not be directly proportional to the
dimensionality of classifier.
7. Scalability: DWT requires a lot of memory for computation and thus does
not seem scalable for systems with over a thousand channels. ZCF has
been designed to require minimum memory, only one adder and no
multipliers, and thus is scalable to a very large number of channels.
8. Training and Feedback: Although taking the time-frequency transform in
DWT does not require any supervision, selection of appropriate thresholds
to retain or discard features requires initial training and feedback from an
external processors or user. ZCF does not require any initial training,
however the preceding spike detector may require some offline training.
9. Compression: ZCF produces two features for each detected spike, while
DWT can, theoretically, be trained to produce a single feature capable of
providing good sorting results in addition to some feature identification
information. The resulting throughput of both system appears to be similar.

176

The comparison presented above highlights the strengths and weaknesses of
both the DWT-based and the ZCF-based systems. The DWT has a clear
advantage when any level of signal reconstruction is needed after the data has
been received at the output. In addition, because of its time-frequency domain
processing, DWT-based systems may provide better spike detection and sorting
performance as compared to only time-domain based ZCF; however the
difference in performance is not clear and depends heavily on the performance of
threshold determination algorithm running in the external processing units. On
the other hand, though the DWT has been shown to effectively handle multichannel processing, the ZCF is scalable to a much higher number of channels
that can be processed on-chip. This is possible because of the extremely
resource efficient implementation of ZCF resulting in smaller IC area and very
low power consumption. The power consumption is directly related to the amount
of neural activity observed, as opposed to the constant power consumed by the
DWT.
It can be concluded that both systems perform well for different kinds of
applications. The DWT is suitable when detection and sorting accuracy take
precedence over the number of simultaneously processed channels, whereas
ZCF is suitable when extremely large number of channels are processed while
slight performance loss is acceptable.

177

6 Summary and Future Work

The summary of this Ph.D. thesis research is provided below followed by a brief
description of the contributions made in this thesis and a list of tasks that could
further this research in the future.

6.1 Summary
Neuroprosthetics devices and Brain Machine Interfaces (BMIs) are increasingly
playing a vital role in helping patients with severe motor disorders achieve a
better lifestyle by enabling direct interface to the central nervous system. Such
BMIs show great promise in many biomedical applications.
This thesis addresses some of the critical challenges that impede progress in
wireless neural implants by developing highly efficient algorithms and hardware
suitable for implantation. Employing front-end on-chip signal processing, the
system is tailored to enable simultaneous multi-channel neural recording,
compression and transmission of 32 neural channels. The system uses an areapower efficient discrete wavelet transform based compression engine to
compress the multi-channel data while providing control over the quality of
reconstruction needed for the particular application. Lossy as well as lossless
compression algorithms combine to provide high compression ratios. A host of
low power design techniques have been employed in addition to a power
management module to keep the system under the low area, low power and low
bandwidth constraints. A communication protocol has been designed to enable
178

error free data communication between the implanted device and the external
module on body surface. A salient feature of the system is its ability to compress
multi-channel data while preserving the important attributes of the signals,
allowing

off-chip

spike

sorting.

Moreover,

a

scalable

design

enables

simultaneous processing of a higher number of channels. In addition to this core
functionality, the ASIC allows configurability of the entire system supporting real
time operation, different modes of operation and different levels of compression
as desired by the neuroscientist.
The design of a reliable communication and power link between an implanted
transmitter and a receiver on the body surface depends heavily on the channel
characteristics across animal skin and the characteristics of low power inductive
links. These error characteristics have never been explored in literature before.
This thesis empirically determines the channel characteristics and analyses the
performance of the inductively coupled data and power transceiver by presenting
results from extensive testing for different settings and orientations of the
directional transceivers. The channel characteristics can be used to model the
communication link for a inductive power-data transceiver implanted in any part
of the body with different thicknesses of skin.
Finally, to match the pace of rapidly increasing number of channels on a
microelectrode array, a new set of features named „Zero Crossing Features‟ has
been identified and analyzed for extremely efficient feature extraction in
hardware without sacrificing any spike sorting performance. The sorting
179

capability of the features is compared against the features from principal
component analysis and shown to perform equally well. A resource efficient
hardware architecture is also presented for complexity analysis and proves its
feasibility for implantation.

6.2 Contributions
This thesis makes the following contributions to the field of biomedical
engineering.
1.

The prohibitively large amount of data generated by the microelectrode

arrays results in a data communication bandwidth bottleneck. We have designed
and developed a new implantable system that enables control over the amount of
transmitted data without compromising the vital information contained in the array
of signals. As a result, a large number of channels can be acquired
simultaneously within the same bandwidth where the scientist controls which
channel gets more share of the bandwidth than others do. This will allow the
scientists to allow more resources to the channels of interest and is expected to
improve neural decoding accuracy in neuroprosthetics applications.
2.

We have developed a unique architecture for discrete wavelet transform

that is designed to be extremely area-power efficient in addition to being easily
adaptable to implement any other wavelet basis with any number of levels of
decomposition. Accordingly, the DWT architecture can be used as is or can be

180

adapted for use in a wide variety of sensor arrays for signal compression or
pattern recognition applications.
3.

The unreliable nature of wireless channels poses a formidable challenge

in the pattern of bit and packet errors seen at the receiver. We have created the
first ever error profile for inductively coupled transceivers transferring data and
power between the implanted device and the external world across the skin
tissue. In addition, we have also designed an efficient half-duplex protocol for
streaming data to maximize the data throughput and to enable unfailing
bidirectional data transfer. This error profile enables scientists to evaluate design
tradeoffs for error correction, data throughput, transceiver hardware complexity
and power consumption. The general nature of transceiver and skin channel
ensures that the profile and the resulting protocol are valid not only for neural but
for any implantable system.
4.

We have developed the first ever implantable digital system that employs

on-chip high level signal processing of multi-channel neural data. Hardware has
been designed for all the components including the analog interface controller,
the compression engine and the communication controller, which have been
combined together into a single system keeping within the implantability
limitations of area, power and power density. Although the motivation for this
design comes from the field of neuroprosthetics, the system developed can be
used in a wide range of implantable applications.

181

5.

Advances in MEMS have allowed thousands of channels to be fabricated

on to a microelectrode array; however, simultaneous recording from such a large
number of channels has not been possible. To enable recording from such highdensity microelectrode arrays we have designed a new ultra-constrained set of
spike

sorting

features

named

„Zero-Crossing

Features‟

which

allows

computationally and hardware efficient online spike sorting without requiring any
training or user intervention. Consequently, enabling simultaneous recording
from thousands of channels will allow decoding of physical and functional interconnections between large swathes of neurons, helping our understanding of
brain functionality.

6.3 Future Work
Following is a set of tasks that can further this research.
1. Update the Communication Protocol using Channel Characteristics
With the channel error characteristics available through this research, it is now
possible to improve the communication protocol and optimize for this particular
biotelemetry application. The goal is to maximize the throughput while keeping
the power consumption and transmission buffer length as small as possible.
Based on the results from chapter 4, the number of parity bits included for error
detection could also be re-evaluated. The cost to compute and transmit FEC bits
needs to be evaluated further for the tradeoff between power consumed in
transmission relative to the benefit of increase in throughput. It is evident from
Fig 4.10 and Fig 4.15 that the system should have a very strong error detection
182

capability, although only a small error correction capability is sufficient. FEC
algorithms need to be evaluated for the amount of error correction and error
detection capabilities that they provide relative to what is needed by this system
for error free low power data communication.
2. Optimization for Low Power Operation
Once the functional verification of the system is complete, the design can be
analyzed to identify areas where power saving techniques can be employed to
reduce the power consumption of the system as a whole, including both the
digital blocks as well as the analog front end. For the target technology of 0.13
micron CMOS, the focus of this effort is to reduce the combined leakage and
dynamic power of the system while keeping a check on the area overhead.
These techniques may include Multiple Supply Voltages, Dynamic/Adaptive
Voltage and Frequency Scaling, Clock Gating, Transition rate buffering, operand
isolation, logic restructuring, memory splitting, substrate biasing, multi threshold
voltage and power shutoff.
3. System Design for the Ultra Constrained Spike Sorting
The hardware design of ultra constrained feature extraction can be extended to
include spike detection and spike classification. The designs need to be low
complexity, low power and resource efficient enough such that their resource
consumption does not overshadow the complexity gains achieved by the feature
extraction algorithm. Current designs for spike detection are either too resource
hungry for an implantable system or do not have good detection performance.
183

Spike classification can also be performed through several different algorithms;
however most high performance classifiers are computationally expensive. It is
thus required to evaluate the tradeoff between power consumption for on-chip
spike classification or transmission power required to transmit unsorted spike
features to be used off chip.
4. Evaluate the Number of Input Output Pins
It has been learnt from our experience that for advanced technologies like 0.13
micron CMOS, the IO and power supply pads for various inputs and outputs take
up a sizeable amount of overall area, as shown in Fig. 6.1, mainly to enable
packaging and for protection against electrostatic discharge. The pads are also
the biggest power consuming blocks in the system since they add large
capacitances at the interface. Methods to reduce this non-essential power
consumption could be explored. One way to reduce the number of IO pads is the
use of serial communication between various components of the design. The
cost may involve increasing the clock frequency or use of frequency multipliers.

184

Fig. 6.1. NIN Chip fabricated in 0.13 micron CMOS

5. Test of Fabricated ASIC in Live Animals
Once the system chip presented in Chapter 3 is fabricated it needs to be tested
in live animals for two things mainly, functionality and power consumption. The
tests will verify the initial choices of appropriate sampling rate and suitable
quantization levels, and will verify the assumptions of amount of data generated
and the amount of data necessary for off chip spike sorting. Long-term usability
of the system could also be evaluated for any unforeseen glitches in hardware
implementation. Including the wireless data and power transfer will enable testing
of effects of short-time power outages and the functionality of communication

185

controller. Once the testing is complete, a data sheet and user manual could be
created to enable transfer of the chip to users unfamiliar with the design.

186

6.4 Papers Written
Journals
1.

A. M. Kamboh, A. Mason, and K. G. Oweiss, “Analysis of Lifting and BSpline DWT Implementations for Implantable Neuroprosthetics,” Journal of
Signal Processing Systems, vol. 52, no. 3, pp. 249–261, September 2008.

2.

K. G. Oweiss, A. Mason, Y. Suhail, A. M. Kamboh, and K. E. Thomson, “A
Scalable Wavelet Transform VLSI Architecture for Real-Time Signal
Processing in High-Density Intra-Cortical Implants,” IEEE Transactions on
Circuits and Systems I: Fundamental Theory and Applications, vol. 54, no.
6, pp. 1266–1278, June 2007.

3.

A. M. Kamboh, M. Raetz, K. G. Oweiss, and A. Mason, “Area-Power
Efficient VLSI Implementation of Multichannel DWT for Data Compression
in Implantable Neuroprosthetics,” IEEE Transactions on Biomedical Circuits
and Systems, vol. 1, no. 2, pp. 128–135, June 2007.

4.

Awais M. Kamboh, and A. J. Mason, “High Fidelity Neural Signal
Compression and Wireless Communication,” IEEE Transactions on Circuits
and Systems I: Fundamental Theory and Applications, 2010 (In Prep).

5.

Awais M. Kamboh, and A. J. Mason, “Ultra Constrained On Chip Feature
Extraction for Neural Spike Sorting,” IEEE Transactions on Neural Systems
and Rehabilitation Engineering, 2010 (In Prep).

6.

Awais M. Kamboh, Mehdi Kiani, Maysam Ghovanloo, and Andrew J.
Mason, “Performance Analysis of Inductively Coupled Transceivers for
Power and Data Transfer in Wireless Implants,” IEEE Transactions on
Biomedical Circuits and Systems, 2010 (In Prep).

Conferences
7.

A. M. Kamboh, and A. J. Mason, “On Chip Feature Extraction for Spike
Sorting in High Density Implantable Neural Recording Systems,” IEEE
Conference on Biomedical Circuits and Systems, Paphos, Cyprus, pp. 1316, Nov 2010.
187

8.

Y. Yang, A. M. Kamboh, and A. J. Mason, “Adaptive Threshold Spike
Detection Using Stationary Wavelet Transform for Neural Recording
Implants,” IEEE Conference on Biomedical Circuits and Systems, Paphos,
Cyprus, pp. 9-12, Nov 2010.

9.

A. M. Kamboh, Y. Yang, K. G. Oweiss and A. J. Mason, “Design of a
Configurable Neural Data Compression System for Intra-Cortical Implants,”
IEEE International Symposium on Circuits and Systems, (ISCAS 2010), pp.
3473-3476, Paris, France, June 2010.

10. A. M. Kamboh, K. G. Oweiss and A. J. Mason, “Resource Constrained VLSI
Architecture for Implantable Neural Data Compression Systems,” IEEE
International Symposium on Circuits and Systems, (ISCAS 2009), pp. 14811484, Taipei, Taiwan, May 2009.
11. F. T. Abu-Nimeh, A. Kamboh, M. Aghagolzadeh, U.-M. Jow, A. Mason, M.
Ghovanloo, K. Oweiss, "A Highly Modular, Wireless, Implantable Interface
to the Cortex," IEEE/EMBS Conference on Neural Engineering, Antalya,
Turkey, pp. 375-378, April 2009.
12. A. M. Kamboh, M. Raetz, A. Mason, and K. Oweiss, “Area Power Efficient
Lifting based DWT Hardware for High-Yield Area-Power Efficient DWT
Hardware for Implantable Neuroprosthetics,” IEEE International Symposium
on Circuits and Systems, (ISCAS), New Orleans, Louisiana, USA, pp. 23712374, May 2007.
13. M. A. Shetliffe, A. M. Kamboh, A. Mason, and K. G. Oweiss, “Impact of
Lossy Compression on Neural Response Characteristics Extracted From
High-Density Intra-Cortical Implant Data,” IEEE/ EMBS International
Conference of the IEEE Engineering in Medicine and Biology Society,
(EMBS 2007), Lyon, France, pp. 5358–5361, August 2007.
14. A. M. Kamboh, A. Mason, and K. Oweiss, “A High-Yield Area-Power
Efficient DWT Hardware for Implantable Neural Interface Applications,” 3rd
International IEEE/EMBS Conference on Neural Engineering, (CNE 2007),
Kohala Coast, Hawaii, USA, pp. 212–216, May 2007.

188

15. A. M. Kamboh, A. Mason, “Comparison of Lifting and B-spline DWT
Implementations for Implantable Neuroprosthetics,” 5th IEEE Conference
on Sensors 2006, Daejeon, South Korea, Pages 811-814, Oct 2006.

189

APPENDICES

190

Appendix A: Neural Recording Techniques
Overview
There are different ways of recording neural activity currently in use,
Electroencephalography (EEG), Magnetoencephalography (MEG), functional
Magnetic Resonance Imaging (fMRI) and Positron Emission Tomography (PET),
all of which are non-invasive.
1. EEG:
EEG is the recording of electrical activity along the scalp produced by the firing of
neurons in the cortex (brain). These signals are recorded by placing electrodes
on the scalp. EEG has been used extensively to test neural activity in different
regions of the brain and has been helpful in categorizing seizures, brain death,
coma and epilepsy etc. The electrical activity of the brain can be described in
spatial scales from the currents within a single neuron to the relatively
„accumulated‟ potentials that the EEG records from the scalp. Although EEG is
the most helpful technique for some of these situations, it has several limitations
the most important of them is poor spatial resolution. EEG has very high time
resolution, on the order of milliseconds, enabling detection of very small events
of interest.
2. MEG:
MEG and EEG are generated by the same neurophysiologic process, where
EEG results from accumulated electric fields and MEG results from accumulated
magnetic fields. The skull and the scalp do not distort magnetic fields as much as
electric fields; thus MEG has better spatial resolution than EEG. MEG has found
191

its uses in the study of epilepsy, however, EEG signal is much richer in
information content and is the primary monitoring technique for most
neurophysiological processes.
3. fMRI & PET:
fMRI and PET both record neural activity directly by monitoring changes in blood
flow or metabolic activity in the cortical tissues respectively. Both these
techniques have better spatial resolution than EEG in three dimensions, however
there are limitations associated with both. Poor time resolution is one of the
limitations where successive recordings can contain several minutes long of
superimposed information. Another limitation is that the subject cannot move
freely (or cannot move at all) for the duration of the recording.
In several situations it is desirable to have high spatial resolution as well as high
time resolution for better understanding of intra-cranial connectivity and neural
activity than provided by the above mentioned recording techniques. This can
only be achieved with highly invasive methods and introduction of strip
electrodes or microelectrode arrays.
4. ECoG:
Electrocorticography (ECoG) provides better resolution by placing the electrodes
on the cortex under the skull. A fine strip with numerous electrodes is placed over
the cortex. It records signals which are more pristine and are much more defined
than the EEG. These signals are recorded without having to pass through the
skull or the scalp, and thus have not been smeared. Though ECoG provides
192

better resolution, it is still not possible to monitor activity of individual cells. A
disadvantage of this technique is that a surgery is needed to place the strip of
electrodes on the cortex and thus far no system has been developed to
wirelessly transmit the recorded data to the external machines. As a result a set
of wires extends out of the scalp and feeds signals to the processing machines.
5. Intracranial EEG:
Intracranial EEG has shown the capability of recording activity of individual cells,
and has recently been made possible by the use of microelectrode arrays. The
signals are recorded by implanting electrodes into the cortex. Recording the
activity of cortical neurons with microelectrode arrays was shown to be essential
to quantify the degree of involvement of each neuron in encoding movement
parameters. Use of microelectrodes allows extremely high spatial and temporal
resolution and allows recording from single cells at arbitrary sampling rates,
thereby enabling decoding of the neural signals and eventually control of artificial
limbs [1].
There are several challenges associated with this highly invasive procedure.
Firstly, as with ECoG, a surgery is required to implant the electrode array and
supporting electronics into the cortex. Secondly, such high resolutions on the
order of microseconds and micrometers result in tremendous amount of data
being generated. Thus far, clinical trials have used wires that extend from the
skull to the external machines to record the signals. However, enabling wireless
telemetry of these signals will allow free movements to the subject in addition to
193

long term or permanent implants without any chance of infection in the scalp.
Such BMIs show great promise in many biomedical applications.

194

Appendix B: Quantization
Power and area requirements of the DWT hardware are determined largely by
the complexity of the computational circuitry and the required memory. To
systematically reduce hardware requirements, we have explored different options
to reduce computation and memory requirements at the algorithm level and
analyzed their impact on signal integrity to determine an optimal solution.
Quantization of input signal and wavelet coefficients into finite width words
effects the signal integrity negatively, at the same time, reducing the word
lengths has a large effect on the area and power consumed by the processing
hardware. This tradeoff is analyzed below.
Signal Quantization
Fixed-point integer approximation limits the range and precision of data values
but also greatly reduces the computational demand and memory requirements
for processing and storage. The choice of signal width also affects the design of
the analog to digital converter, as width of the data path equals the number of
bits of the ADC. The input analog signal is ﬁrst scaled and then digitized to obtain
data samples within acceptable signal-to-quantization-noise ratio. Traditionally
neural signals are digitized with a word width between 8 to 12 bits. Our analysis
shows that a 10-bit quantization provides sufficient signal to noise ratio for most
spike sorting algorithms, and increasing the width does not improve sorting
performance. As our results in later chapters demonstrate, the data precision is
sufficient to maintain signal integrity.
Quantization of the Filter Coefficients
195

Since the algorithms involve a lot of multiplications and additions, it is important
to design the multipliers and adders to be requiring low power and smaller area.
Multipliers required to multiply two 10-bit inputs are considerably larger in area
than those required to multiply a 10-bit word with a 6-bit word. As a result,
wherever possible, using unequal input word lengths has power advantage over
using equal word lengths. Another factor that encourages use of unequal word
lengths is that the result of a multiply operation consists of a word length equal to
the sum of input word lengths. Which means that the dynamic range of the output
is different from the dynamic range of the input. In hardware systems where the
input and the output of the multiplier are required to have same word length, it is
essential to further quantize the output by means of discarding a number of least
significant bits, essentially resulting in quantization error. Use of unequal wordlength inputs requires less number of bits to be discarded, resulting in smaller
quantization error.
We quantiﬁed the effect of the round off and quantization errors on the signal
ﬁdelity as a function of word-lengths for the wavelet filter coefficients. Our results
demonstrate that 6 bits (5 magnitude + 1 sign) for coefficients in conjunction with
10 bits (9 magnitude + 1 sign) for data can adequately preserve signal integrity.
Signal Integrity
We assess the effects of data and ﬁlter coefficient approximations on the quality
of the signals obtained after reconstruction. We quantiﬁed the performance in
terms of the complexity of hardware required to implement the Lifting based DWT
196

and illustrate the results in Fig. B.1. The wavelet ﬁlter coefficients were quantized
to different resolutions ranging from 4 to 12 bits. The data was also quantized in
the same range. The root mean squared error (RMS) is illustrated in Fig. B.1
versus data and coefficient word lengths. These results demonstrate that, with
sufficient precision, the use of integer computations does not result in signiﬁcant
signal degradation as quantiﬁed by the observed output RMS error. It can be
noted that there is no significant improvement in SNR when the signal is
quantized with greater that 10 bits or when the coefficients are quantized with
greater than 6 bits of resolution.

Fig. B.1. Effect of round off and quantization errors on the signal fidelity as a function of
quantized data word length and coefficient word length.

197

Taking these results all together, it is clear that the choice of 6/10-bit
coefficient/data quantization offers the best compromise between multiplier
complexity and signal fidelity as concluded earlier. We should emphasize that
perfect reconstruction of signals off chip may not be always needed. Typically,
neural signals contain the activity of multiple neurons that need to be sorted out,
and this information remains in the compressed data at the output of the DWT
block. It has been shown elsewhere that sorting the multi source neuronal signals
can be performed directly on the wavelet transformed data [32, 33], and this topic
is outside the scope of this research.

198

Appendix C: Design Flow

The design of an ASIC to implement an algorithm in hardware is completed in
two main phases, the technology independent phase and the technology
dependent phase. The technology dependent phase of the design involves fixedpoint algorithm design in Matlab followed by behavioral and structural design in a
hardware description language. The ASIC design is verified and prototyped in
hardware using an FPGA. Although it is a technology dependent technique,
verification using an FPGA has various pros as described in the following
section. The FPGA verified hardware description is finally translated to the ASIC
design for chip fabrication for a particular technology. Fig. C.1 gives the highlevel steps for the design of proposed ASIC hardware in silicon.

Fig. C.1: High-level ASIC design flow

Technology Independent Phase
199

The technology independent phase of the design involves the algorithm design
and the register transfer level (RTL) description of the hardware in a hardware
description language (HDL). This phase has following main components.
1. Algorithm design
First step in fabricating an ASIC that implements a signal-processing algorithm is
the design of the algorithm in a high-level language. Apart from the desired
functionality of the algorithm, design of the algorithm involves calculation of the
resources required for processing the data. There are two main resources that
need consideration, the number of calculations and the number of memory
elements required. As a general rule, higher number of calculations mean higher
power dissipation whereas higher number of memory elements mean more
silicon area in the final ASIC. However, area and power are not always mutually
exclusive as power loss due to leakage has been the dominating reason for
power loss lately. Frequency of operation also depends on the degree of
parallelism inherent in the algorithm.
Algorithms are generally designed in high-level languages such as C, Python or
Matlab. We use Matlab for our designs.
2. Fixed Point Algorithm
Hardware designs require a finite word length, which is generally much smaller
than the word lengths available in the high-level languages. This introduces two
issues, quantization errors and overflows. Both these issues contribute towards
degrading the signal to noise ratio at the output of the system. It is generally tried
200

to keep the word length minimum while maintain an acceptable level of signal to
noise ratio. This is critical as increasing or decreasing the word length of an 8-bit
system by only one bit may result in increase or decrease of the final silicon area
by about 10%. Correspondingly, the power consumption of the system also
changes. Similarly, it is essential to handle overflows correctly whenever the
result of an arithmetic operation exceeds the available word length. These issues
make it mandatory to adapt the algorithm for finite precision, fixed-point
arithmetic and ascertain the integrity of the algorithm.
Our system is designed with 10 bits of word length. This word length was chosen
to keep the quantization error small enough to preserve the integrity of the
incoming data and to enable spike sorting in the later stages. Matlab was used to
perform the analysis for required bit resolution.
Once the fixed-point simulations have verified the integrity of the algorithm, a “list
of specifications” can be created, outlining key requirements for the hardware to
be designed. Among other specifications, this list includes a list of inputs and
outputs for the system, number of calculations per second of individual
components in the system to maintain a desired throughput etc.
3. Register Transfer Level Hardware Description
The list of specifications is then used to design the actual hardware using a
hardware description language (HDL). The benefit of using an HDL is that it is
easier and faster to create, manage, test and modify larger designs as compared
to creating a schematic based design. The register transfer level (RTL) of coding
201

is the level next above gates. Registers are the memory elements in the design.
Meaningful designs contain some combinational logic between different registers.
RTL thus refers to transfer of signals from one set of registers to the next through
the combinational logic. The RTL code contains several high-level language
constructs like if-then-else or switch-case, which are later mapped to complex
structures like edge-triggered finite state machines or simpler structures like
multiplexers and decoders during synthesis.
There are several hardware description languages prevalent in the industry
including Verilog, VHDL, Verilog-AMS and System Verilog. We use Verilog to
design the hardware as it is one of the most popular and highly developed
languages and is supported by majority of vendors to model their technologies
and systems. Similarly, there are several software tools available that enable
HDL coding e.g. Modelsim, Nc-Verilog and Xilinx ISE.
4. Functional verification & Simulation
The functionality of the designed hardware is verified through simulations. Real
data is provided in parallel to the Matlab based algorithm in fixed point and the
Verilog based HDL design. The outputs of both the systems are compared for
differences. For any given hardware configuration, the outputs of the Verilog
design should be an exact match to the outputs of the Matlab code. Simulations
at this stage are independent of the fabrication technology and do not cater for
the gate delays that are present in the system. As a result, the simulations
cannot provide insight into the race conditions or spurious glitches that may arise
202

because of different delays and path lengths. However, functional simulations are
still important at this stage. If some problems are diagnosed at later stages, the
designed is rolled back to this stage and appropriate changes are made.
5. Design Constraints File
The design constraints file is required by the synthesizers to optimize the design
and conform to the design requirements such as operating frequencies and
various IO delays. The file specifies the names of the clock pins, internally
generated clocks, their frequencies and the rise and fall times. The file also
specifies other parameters e.g. the amount of uncertainty expected in the input
clock or the maximum allowed delay between two registers in the design.
Multicycle paths can also be specified here to improve the synthesis results.
Technology Dependent Phase for FPGA Based Prototype
As ASIC designs become more complex, prototyping based on FPGAs has
become one of the most reliable techniques for verification. There are several
reasons for using FPGAs for prototyping. Bigger designs mean more bugs,
similarly higher complexity means the bugs are harder to find. Finding and fixing
these bugs in software requires very long verification and simulation times. It is
very difficult to develop comprehensive testbenches that can test all possible
situations and conditions.
There are several benefits for FPGA-based prototyping including comprehensive
verification, software development in parallel and field-testing. First and foremost,
it is a way to prove the functionality of the design, by rigorously verifying it
203

physically using real data, before an expensive ASIC fabrication is undertaken.
About 70% of ASIC re-spins are a result of functional errors resulting from bugs,
which could not be detected during simulations and verification. Since ASIC
fabrication is a very expensive and time-consuming processes spanning over
several weeks, it is very important to get the design right in the first attempt.
Secondly, A hardware prototype permits design and early integration of the
complete system, allowing software and board development to progress in
parallel to the ASIC manufacturing.
Finally, a functionally verified hardware prototype spawns more confidence in the
design, which enables the development of a “right-first-time” ASIC. About 40% of
all digital ASIC designs are prototyped using ASICs.
There are various FPGA platforms available for prototyping. We use Altera‟s
Cyclone III FPGA with sufficient number of logic elements to map the hardware
design. Following steps are necessary to ensure proper verification.
1. Synthesis, Placement and Routing for FPGA
Verilog files are synthesized to the FPGA specific libraries provided by the FPGA
vendor. The synthesis software converts the RTL level Verilog description into
gate level circuit. The vendor provided libraries contain information about the
delays caused by each cell/gate. Since there are multiple ways of implementing
any given function, the synthesizer uses the constraints listed earlier in the
Design Constraints File to choose the most suitable topology. This choice results
in a tradeoff between area, power and performance (speed). These parameters
204

are compared against the requirements set in the Design Constraints File; in
case of any violations, the synthesis is iterated using any of the other synthesis
topologies. In case the synthesizer is unsuccessful in creating a gate level circuit,
it becomes necessary to modify the high level RTL code at the top of the design
flow.
Output of the synthesis step is another Verilog file that contains gate-level
description of the circuit. The synthesized output is then provided to the „place n
route‟ software which maps these gates to the logic elements present in the
FPGA. This mapping also takes into account the delays incurred by the long
wires between different cells. Apart from mapping the gates to the logic
elements, this step generates a file in the Standard Delay Format (SDF) that
contains information about the delays resulting from individual wires.
FPGA vendors provide the required software for synthesis and place n route,
however,

general-purpose

synthesizers

like

Mentor

Graphics‟

Leonardo

Spectrum or Synopsys‟ Design Vision can also be used to synthesize the code
for the particular FPGA. We use Altera‟s Quartus II to perform these steps.
2. Post-Synthesis Timing Simulation
Once the synthesis is complete, the design is simulated and verified again in
Modelsim. This simulation is based on gate-level synthesized code and also
includes the information present in the SDF file. Consequently, the simulation
includes delays experienced by the signals through all the gates and the wires
connecting them. This gate and wire delay based simulation is necessary to
205

establish the integrity of the design and gives insight into any race conditions
present in the system as well as the effects of any undesirable change in signal
value at the output of a gate. Again, the output of the gate-level simulation is
compared to the output from fixed-point Matlab based simulation for a perfect
match.
Although synthesis results for the FPGA are expected to be different from the
synthesis results for the ASIC due to difference in technology and the FPGA
architecture; a success at this stage verifies that the hardware has been
designed suitably enough such that it can be converted to physical hardware
while keeping within the design constraints.
3. Hardware Verification
The simulated and verified code is then „burned‟ on to the actual FPGA for
hardware verification. Hardware verification of the design requires additional
hardware and software resources to provide the real data at the inputs of the
FPGA and, at the same time, record the outputs from the FPGA. Output of the
FPGA needs to be the same as the simulation results, which should be the same
as the Matlab based algorithmic simulations.
We use National Instruments‟ data acquisition cards in conjunction with LabView
and Python programming language to generate the signals required at the input
of the FPGA. This setup provides high degree of flexibility in the test design,
however, these software-based interfaces are not fast enough to test the
hardware at high frequencies. To test the design at appropriate frequencies,
206

additional FPGAs need to be used to provide and record the signal from the
FPGA. Though use of FPGA to monitor the signals lacks the flexibility of software
based monitoring, nevertheless, it verifies the functionality of the design at high
operating frequencies.
As mentioned earlier, due to the possible technology difference from the ASIC,
FPGA based prototyping does not provide an accurate estimate of the power
consumption or the IO delays.
Technology Dependent Phase for ASIC
Once the design has been verified on an FPGA based prototype, it is ready to be
ported to an ASIC. Choice of the fabrication technology for an ASIC depends on
many factors. Newer technologies mean smaller feature size, which generally
translates to less area consumption, smaller power dissipation and faster
designs. However, fabrication costs as well as design complexity increase
tremendously with reduced feature size.
We used AMI 0.5 micron CMOS fabrication process to verify the functionality of
DWT chip and develop the process design flow. The complete system is to be
fabricated on an IBM 0.13 micron CMOS process. The use of this process is
mandated by the low area and low power requirements of this research project.
Design of digital ASICs requires various libraries and models provided by the
vendor for proper verification and reliable fabrication. These files include various
HDL models for simulations, standard cells library files for the synthesizer, and
the cell layout files for final layouts. Technology specific memory generators are
207

also required to compile large and reliable memory blocks. The following
technology specific steps are being followed to ensure „right-first-time‟ ASIC
design for the algorithm.
1. Synthesis for ASIC
The Verilog files containing RTL level design description is imported into the
synthesizer along with the design constraints file. It is ensured that the design
constraints file does not contain any FPGA specific constraints. Synopsys Design
Vision is used to synthesize the design and map it to the IBM 0.13 micron CMOS
specific libraries. The synthesizer outputs a Verilog file containing gate-level
description of the synthesized block.
The synthesizers are not capable of generating memory blocks as six-transistor
SRAM. The synthesizers can only generate flip flop based registers that are
about three times larger than the SRAM. As a result, special memory generators
are needed to compile the memory blocks while synthesizing other portions of
the design using the synthesizer.
Each block can be synthesized independently or the complete system can be
synthesized as a large block. We chose to synthesize the blocks independently
since it allows the system to be divided into various modules. As different blocks
operate at different clock frequencies, a modular approach helps to reduce power
consumption at the expense of additional silicon area.
2. Memory Generation and Custom Blocks

208

Required memory blocks are generated using technology specific memory
generators provided by the vendor. Memory generators are necessary since
change in the depth of memory requires Different memory blocks are generated
with different word size and memory depths depending upon their access
patterns to reduce the power consumption of memory read or write operations.
Once generated the memory is simulated for functional and timing verification.
The memory generators also generate the Verilog model files for the memory.
We use Cadence systems‟ Virtuoso to compile the memories.
3. Timing verification for ASIC
The synthesized blocks and the memory models are combined into a system and
verified through simulations in Modelsim. These simulations incorporate the IO
delays and the delays associated with individual gates but do not incorporate the
wire delays introduced due to parasitic capacitances.
4. Abstract Generation
Abstract generation is necessary if the design contains any custom blocks that
are not included in the standard cells library. Abstract generation is a process
that extracts necessary information from a custom layout block to enable the
„automatic place n route‟ software to include the custom blocks in the design.
Apart from other useful items, this information includes the precise location of
input and output pins, as well as the areas occupied by different metals. We use
Virtuoso for abstract generation.
5. Automatic Placement and Routing
209

Automatic Place n Route is a layout generation process that includes automatic
placement of different cells representing logic gates on a layout and then routes
the inputs and outputs of those gates according to the synthesized netlist
generated by the synthesizer. The resulting layout contains all the gates and
connections specified in the synthesized Verilog gate level netlist.
There are various steps involved in the generation of layout from automatic place
n route. These include floor-planning and generation of clock trees and power
rings. Each module is placed n routed individually. Once all the digital blocks
have been routed then global routing is performed. All the digital blocks, memory
blocks and custom blocks are connected to each other in the global routing
stage. The output of this stage is a set of SDF files that contain information about
delays introduced by various connecting wires at the block level and at the global
level.
6. Layout Integrity Check
Automatic place n route follows several technology specific design rules while
routing signals through various cells. These rules are defined in the technology
files. However, since various components of the design flow come from different
vendors it is still possible that there may be some rules which are overlooked by
the router. As a result a Design Rules Check is performed on the layout files to
verify the routing. There are some rules that cannot be checked by the router
automatically and thus need to be checked manually once the process is
complete.
210

7. Timing verification for ASIC
The gate level netlists and the SDF files generated by the place n route stage are
then simulated in modelsim. These simulations include all the system
components connected to each other (except the pad frame for input / output
pins) and also uses gate delay and wire delay information to generate accurate
results. it is highly desirable to simulate at this stage as simulation of moderately
sized layouts may take weeks to generate output vectors of interest.
8. Post-Layout Simulations for Timing Verification
In the final step, the layout is completed by including the pad frame for the input,
output pins and the power, ground pins. Simulation of the complete layout is time
consuming but may be necessary to verify some element of design. Since it is
difficult to verify all the test vectors, a selection of test scenarios can be simulated
to verify the functionality of the chip. Ocean is a scripting language that can be
used in Cadence Systems‟ Virtuoso to analyze and extract useful information
from the multi-gigabytes of data generated during the simulations.
This step concludes the ASIC design process after which the final layout is taped
out for fabrication.

211

Appendix D: Design for Testability

Key signals in the design are included into a scan chain designed to increase the
testability of the system in future generations. Fig. D.1 shows a design to include
observability and controllability for a particular signal. A standard 4-pin interface
is used to operate the scan chain. One pin T/N selects Test or Normal mode of
operation. Serial scan in and serial scan out are used to serially input data to the
scan chain as well as read register values at the output. T-Clk and U-Clk are the
Test and Update clocks. They can be controlled through designated pins or can
be controlled through one T-clk and accompanying control logic.

Fig. D.1: Scan chain insertion for testability including both observability and
controllability

212

BIBLIOGRAPHY

213

[1]

M. Nicolelis, "Actions from thoughts," Nature, vol. 409, no. 6818, pp. 403408, 2001.

[2]

M. Rizk, C. A. Bossetti, T. A. Jochum, S. H. Callender, M. A. L. Nicolelis, D.
A. Turner, P. D. Wolf, "A fully implantable 96-channel neural data
acquisition system," Journal of Neural Engineering, vol. 6, no. 2, pp. 1-14,
2009.

[3]

M. Rizk, I. Obeid, S. Callender, P. Wolf, "A single-chip signal processing
and telemetry engine for an implantable 96-channel neural data acquisition
system," Journal of Neural Engineering, vol. 4, no. 3, pp. 309-321, 2007.

[4]

R. Harrison, P. Watkins, R. Kier, R. Lovejoy, D. Black, B. Greger, F.
Solzbacher, "A low-power integrated circuit for a wireless 100-electrode
neural recording system," IEEE Journal of Solid State Circuits, vol. 42, no.
1, p. 123-133, 2007.

[5]

R. Harrison, R. Kier, C. Chestek, V. Gilja, P. Nuyujukian, S. Ryu, B.
Greger, F. Solzbacher, K. Shenoy, "Wireless Neural Recording With Single
Low-Power Integrated Circuit," Neural Systems and Rehabilitation
Engineering, IEEE Transactions on, vol. 17, no. 4, pp. 322-329, 2009.

[6]

C. Chestek, V. Gilja, P. Nuyujukian, R. Kier, F. Solzbacher, S. Ryu, R.
Harrison, K. Shenoy, "HermesC: Low-Power Wireless Neural Recording
System for Freely Moving Primates," Neural Systems and Rehabilitation
Engineering, IEEE Transactions on, vol. 17, no. 4, pp. 330-338, 2009.

[7]

B. Gosselin, M. Sawan, "An Ultra Low-Power CMOS Automatic Action
Potential Detector," Neural Systems and Rehabilitation Engineering, IEEE
Transactions on, vol. 17, no. 4, pp. 346-353, 2009.

[8]

B. Gosselin, A. Ayoub, J. Roy, M. Sawan, F. Lepore, A. Chaudhuri, D.
Guitton, "A Mixed-Signal Multichip Neural Recording Interface With
Bandwidth Reduction," Biomedical Circuits and Systems, IEEE
Transactions on, vol. 3, no. 3, pp. 129-141, 2009.

214

[9]

A. Sodagar, K. Wise, K. Najafi, "A Wireless Implantable Microsystem for
Multichannel Neural Recording," Microwave Theory and Techniques, IEEE
Transactions on, vol. 57, no. 10, pp. 2565-2573, 2009.

[10] K. Wise, G. Perlin, Y. Yao, K. Najafi, A. Sodagar, "An Implantable 64Channel Wireless Microsystem for Single-Unit Neural Recording," SolidState Circuits, IEEE Journal of, vol. 44, no. 9, pp. 2591-2604, 2009.
[11] K. Wise, A. Sodagar, Y. Yao, M. Gulari, G. Perlin, K. Najafi,
"Microelectrodes, Microelectronics, and Implantable Neural Microsystems,"
Proceedings of the IEEE, vol. 96, no. 7, pp. 1184-1202, 2008.
[12] A. Sodagar, K. Wise, K. Najafi, "A fully integrated mixed-signal neural
processor for implantable multichannel cortical recording," Biomedical
Engineering, IEEE Transactions on, vol. 54, no. 6, p. 1075-1088, 2007.
[13] Y. Song, D. Borton, S. Park, W. Patterson, C. Bull, F. Laiwalla, J. Mislow, J.
Simeral, J. Donoghue, A. Nurmikko, "Active Microelectronic Neurosensor
Arrays for Implantable Brain Communication Interfaces," Neural Systems
and Rehabilitation Engineering, IEEE Transactions on, vol. 17, no. 4, pp.
339-345, 2009.
[14] M. Yuce, W. Liu, M. S. Chae, J. S. Kim, " A Wideband Telemetry Unit for
Multi-Channel Neural Recording Systems," Ultra-wideband, IEEE
international conference on, pp. 612-617, 2007.
[15] M. S. Chae, Z. Yang, M. Yuce, L. Hoang, W. Liu, "A 128-Channel 6 mW
Wireless Neural Recording IC With Spike Feature Extraction and UWB
Transmitter," Neural Systems and Rehabilitation Engineering, IEEE
Transactions on, vol. 17, no. 4, pp. 312-321, 2009.
[16] S. Mallat, A wavelet tour of signal processing, Academic Press, 1999.
[17] K. Oweiss, "A systems approach for data compression and latency
reduction in cortically controlled brain machine interfaces," Biomedical
Engineering, IEEE Transactions on, vol. 53, no. 7, pp. 1364-1377, 2006.
[18] D. Donoho, "Denoising by soft-thresholding," Information Theory, IEEE
Transactions on, vol. 41, no. 3, pp. 613-627, 1995.

215

[19] K. Parhi, T. Nishitani, "Folded VLSI architectures for discrete wavelet
transforms," 1993 IEEE International Symposium on Circuits and Systems,
pp. 1734-1737, 1993.
[20] C. Chakrabarti, M. Vishwanath, R. Owens, "Architectures for wavelet
transforms: A survey," The Journal of VLSI Signal Processing, vol. 14, no.
2, pp. 171-192, 1996.
[21] H. Olkkonen, J. Olkkonen, P. Pesola, "Efficient lifting wavelet transform for
microprocessor and VLSI applications," IEEE Signal Processing Letters,
vol. 12, no. 2, pp. 120-122, 2005.
[22] H. Liao, M. Mandal, B. Cockburn, "Efficient architectures for 1-D and 2-D
lifting-based wavelet transforms," Signal Processing, IEEE Transactions
on, vol. 52, no. 5, pp. 1315-1326, 2004.
[23] C. Huang, P. Tseng, L. Chen, "VLSI architecture for forward discrete
wavelet transform based on B-spline factorization," The Journal of VLSI
Signal Processing, vol. 40, no. 3, pp. 343-353, 2005.
[24] A. Kamboh, A. Mason, K. Oweiss, "Analysis of Lifting and B-Spline DWT
Implementations for Implantable Neuroprosthetics," Journal of Signal
Processing Systems, vol. 52, no. 3, pp. 249-261, 2008.
[25] I. Daubechies, W. Sweldens, "Factoring wavelet transforms into lifting
steps," Journal of Fourier Analysis and Applications, vol. 4, no. 3, pp. 247269, 1998.
[26] K. Kotteri, S. Barua, A. Bell, J. Carletta, "A comparison of hardware
implementations of the biorthogonal 9/7 DWT: convolution versus lifting,"
IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 52, no.
5, pp. 256-260, 2005.
[27] C. Huang, P. Tseng, L. Chen, "Analysis and VLSI architecture for 1-D and
2-D discrete wavelet transform," IEEE Transactions on Signal Processing,
vol. 53, no. 4, pp. 1575-1586, 2005.
[28] C. Huang, P. Tseng, L. Chen, "Flipping structure: An efficient VLSI
architecture for lifting-based discrete wavelet transform," IEEE
Transactions on Signal Processing, vol. 52, no. 4, pp. 1080-1089, 2004.

216

[29] M. Unser, T. Blu, "Wavelet theory demystified," IEEE Transactions on
Signal Processing, vol. 51, no. 2, pp. 470-483, 2003.
[30] K. Andra, C. Chakrabarti, T. Acharya, others, "A VLSI architecture for
lifting-based forward and inverse wavelet transform," IEEE Transactions on
Signal Processing, vol. 50, no. 4, pp. 966-977, 2002.
[31] K. Oweiss, "Compressed and Distributed Sensing of Multivariate Neural
Point Processes," IEEE International Conference on Acoustics, Speech
and Signal Processing, pp. 577-580, 2007.
[32] K. Oweiss. Multiresolution analysis of multichannel neural recordings in the
context of signal detection, estimation, classification and noise
suppression. University of Michigan, Ann Arbor. 2002.
[33] K. Oweiss, "Compressed sensing of large-scale ensemble neural activity
with resource-constrained cortical implants," Society of Neuroscience
Abstracts, 2006.
[34] K. Oweiss, D. Anderson, M. Papaefthymiou, "Optimizing signal coding in
neural interface system-on-a-chip modules," IEEE conference on EMBS,
pp. 2016-2019, 2003.
[35] K. Oweiss, A. Mason, Y. Suhail, A. Kamboh, K. Thomson, "A scalable
wavelet transform VLSI architecture for real-time signal processing in highdensity intra-cortical implants," IEEE Transactions on Circuits and Systems
Part I, vol. 54, no. 6, pp. 1266-1278, 2007.
[36] A. Mason, J. Li, K. Thomson, Y. Suhail, K. Oweiss, "Design optimization of
integer lifting DWT circuitry for implantable neuroprosthetics," IEEE/EMBS
Special Topics Conference on Microtechnologies in Medicine and Biology,
pp. 136-139, 2005.
[37] K. Thomson, Y. Suhail, K. Oweiss, "A Scalable Architecture for Streaming
Neural Information from Implantable Multichannel Neuroprosthetic
Devices," IEEE International Symposium on Circuits and Systems, pp.
1342-1345, 2005.
[38] Y. Suhail, K. Oweiss, "A reduced complexity integer lifting wavelet based
module for real-time processing in implantable neural interface devices,"
IEEE International Conference on Engineering in Medicine and Biology, pp.
4552-4555, 2004.
217

[39] B. Wu, C. Lin, "A rescheduling and fast pipeline VLSI architecture for liftingbased discrete wavelet transform," IEEE International Symposium on
Circuits and Systems, pp.732-735, 2003.
[40] A. Shams, T. Darwish, M. Bayoumi, "Performance analysis of low-power 1bit CMOS full adder cells," IEEE Transactions on Very Large Scale
Integration Systems, vol. 10, no. 1, pp. 20-29, 2002.
[41] Z. Huang. High-level optimization techniques for low-power multiplier
design. University of California, Los Angeles. 2003.
[42] B. Yang, L. Kim, "A low-power sram using hierarchical bit line and local
sense amplifiers," IEEE Journal of Solid-State Circuits, vol. 40, no. 6, pp.
1366-1376, 2005.
[43] A. Kamboh, M. Raetz, K. Oweiss, A. Mason, "Area-power efficient VLSI
implementation of multichannel DWT for data compression in implantable
neuroprosthetics," Biomedical Circuits and Systems, IEEE Transactions on,
vol. 1, no. 2, pp. 128-135, 2007.
[44] A. Kamboh, M. Raetz, A. Mason, K. Oweiss, " Area-Power Efficient LiftingBased DWT Hardware for Implantable Neuroprosthetics," Circuits and
systems. IEEE International Symposium on, pp. 2371-2374, 2007.
[45] K. Oweiss, Y. Suhail, K. Thomson, J. Li, A. Mason, " Augmenting real-time
DSP in implantable high-density neuroprosthetic devices," Microtechnology
in Medicine and Biology, IEEE/EMBS Special Topic Conference on, pp.
108-111, 2005.
[46] C. Shannon, "A Mathematical Theory of Communication," Bell System
Technical Journal, vol. 27, pp. 379-423, 1948.
[47] J. Grad, J. Stine, "A standard cell library for student projects," International
conference on microelectronic systems education, pp. 98-99, 2003.
[48] R. Sarpeshkar, W. Wattanapanitch, B. Rapoport, S. Arfin, M. Baker, S.
Mandal, M. Fee, S. Musallam, R. Andersen, "Low-power circuits for brainmachine interfaces," IEEE Transactions on Biomedical Circuits and
Systems, vol. 2, no. 3, pp. 173-183, 2008.
[49] A. Tanenbaum, Computer Networks, Prentice Hall, 1981.
218

[50] A. Kamboh, K. Oweiss, A. Mason, " Resource constrained VLSI
architecture for implantable neural data compression systems," Circuits
and systems, IEEE International Symposium on, pp. 1481-1484, 2009.
[51] M. Ghovanloo, S. Atluri, "A wide-band power-efficient inductive wireless
link for implantable microelectronic devices using multiple carriers," IEEE
Transactions on Circuits and Systems I: Regular Papers, vol. 54, no. 10,
pp. 2211-2221, 2007.
[52] M. Aghagolzadeh, K. Oweiss, "Compressed and Distributed Sensing of
Neuronal Activity for Real Time Spike Train Decoding," Neural Systems
and Rehabilitation Engineering, IEEE Transactions on, vol. 17, no. 2, pp.
116-127, 2009.
[53] M. Ghovanloo, and Suresh Atluri, “An integrated full-wave CMOS rectifier
with built-in telemetry for RFID and implantable biomedical applications,”
IEEE Transactions on Circuits and Systems I, vol. 55, no. 10, pp 33283334, Nov 2008.
[54] S. A. Khayam, “Analysis and Modeling of Errors and Losses Over 802.11b
Networks,” M.S. Thesis, Dept. Elect. Eng., Michigan State Univ., East
Lansing, MI, 2003.
[55] S. Karande, U. Parrikar, K. Misra, and H. Radha, “On Modeling of 802.11b
Residual Errors,” Conference on Information Science and Systems, pp.
283-288, March 2006.
[56] M. U. Ilyas, and H. Radha, “Measurement Based Analysis and Modeling of
the Error Process in IEEE 802.15.4 LR-WPANs,” IEEE INFOCOM Conf on
Computer Communication, pp. 1274-1282, Apr 2008.
[57] M. U. Ilyas, “Analytical and Quantitative Characterization of Wireless
Sensor Networks,” Ph.D. dissertation, Dept. Elect. Eng., Michigan State
Univ., East Lansing, MI, 2009.
[58] A. M. Kamboh, Y. Yang, K. G. Oweiss and A. J. Mason, “Design of a
Configurable Neural Data Compression System for Intra-Cortical
Implants,” IEEE International Symposium on Circuits and Systems, pp.
3473-3476, Paris, France, June 2010.
[59] M. Ghovanloo, K. D. Wise, and K. Najafi, “Towards a Button-Sized 1024Site Wireless Cortical Microstimulating Array,” IEEE/EMBS Conference on
Neural Engineering, pp. 138-141, March 2003.
219

[60] Z. Zumsteg, R. Ahmed, G. Santhanam, K. Shenoy, and T. Meng, “Power
Feasibility of Implantable Digital Spike-Sorting Circuits for Neural Prosthetic
Systems,” IEEE Proceedings on Engineering in Medicine and Biology
Society, pp. 4237-4240, Sep. 2004.
[61] S. Gibson, J. Judy, and D. Markovic, “Comparison of Spike Sorting
Algorithms for Future Hardware Implementation,” IEEE Proceedings on
Engineering in Medicine and Biology Society, pp. 5015-5020, Aug. 2008.
[62] A. Zviagintsev, Y. Perelman, and R. Ginosar, “Low Power Architectures for
Spike Sorting,” IEEE Conference on Neural Engineering, pp. 16-19, Mar.
2005.
[63] C. Rogers, J. Harris, J. Principe, and J. Sanchez, “A Pulse Based Feature
Extractor for Spike Sorting Neural Signals,” IEEE Conference on Neural
Engineering, pp. 490-493, May. 2007.

220