£007

‘ LIBRARY '
Michigan State
University

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DAIEDUE

DAIEDUE

DAIEDUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6/07 p:/CIRC/Date0ue.indd-p_1

 

 

7HF f“. .

Zdj I

This is to certify that the
thesis entitled

HARDWARE CONSIDERATIONS FOR SPACE-TIME
PROCESSING IN IMPLANTABLE NEUROPROSTHETIC
DEVICES

presented by

KYLE E. THOMSON

has been accepted towards fulﬁllment
of the requirements for the

MS. degree in Electrical Engineering

 

 

 

\ o
Mmfessoi‘s Signature
12*! ll; / O 6
I I I

Date

MSU is an Afﬁrmative Action/Equal Opportunity Institution

- --.. -«c--.-v-v-u-u-n-u-c----»--n-I-n-p-.-

HARDWARE CONSIDERATIONS OF SPACE-TIME PROCESSING
IN IMPLANTABLE NEUROPROSTHETIC DEVICES

By

Kyle E. Thomson

A THESIS

Submitted to
Michigan State University
In partial fulﬁllment of requirements
For the degree of

MASTERS OF SCIENCE
Department of Electrical and Computer Engineering

2006

ABSTRACT

HARDWARE CONSIDERATIONS OF SPACE-TIME PROCESSING
IN IMPLANTABLE NEUROPROSTHETIC DEVICES

By

Kyle E. Thomson

Advances in microfabrication technology have allowed hundreds of microelectrodes to
be implanted in the vicinity of small populations of neurons in the cortex, opening new
avenues for the neural interface technology to unveil many mysteries about the
connectivity and functionality of the nervous system at the single cell and population
levels. However, many challenges have arisen. One particular challenge is to transmit the
recorded neural data from the implanted device to the outside world for further analysis.
Previous work has shown that the incorporation of advanced signal processing algorithms
on chip enables efﬁcient compression of the data while maintaining high signal ﬁdelity.
However, adapting these signal processing algorithms to be hardware friendly requires a
nontrivial solution due to the severe constraints imposed on an implantable system. Chip
size, signal ﬁdelity, and computational complexity must be balanced to create an ASIC
device tailored to this application. This thesis focuses on critical hardware considerations
for implementing these algorithms for an implantable neuroprosthetic signal processor
operating in vivo in real time. Detailed performance analysis and suggestions for future

work are presented.

. l‘llllilrl \rlll I i I .l.| ‘Itli'lleIlIIPiI-Vli .llvllllrll'al.

 

 

 

ill

lil‘ri Illrr'.li

III lls'?!‘

.t'l'i. 'illllulil. trill. I. I

,. i Lilli

Ill It‘ll!

C0pyright © by
Kyle Thomson

2006

To my family and friends

iv

ACKNOWLDGEMENTS

I would like to ﬁrst and foremost like to acknowledge my advisor, Dr. Karim G.
Oweiss. Through his guidance and support, I was able to complete my Master’s degree.
Additionally, he has given me many opportunities to interact with members of the
community of neural engineering. For that, I am extremely grateful.

I would also like to acknowledge Dr. Erik Goodman, for his support, both ﬁnancial
and emotional, through my graduate career. His efforts to ensure sure I could continue
my graduate career were greatly appreciated.

I would also like to thank my collaborators, Dr. Andrew Mason and Dr. Shantanu
Chakrabarrty. The many collaborative efforts between our labs gave new insights and
ideas to approaching our research.

I want to thank my fellow lab mates, who made the experience of being in the lab an
enjoyable one, in addition to aiding in my research; including Yasir Suhail, April
Thompson, Scott Chrispell, Martin Gerster, Michael Shetliffe, and Seif EI-Dawlatly.

I would not have been able to make it through my graduate career without the love
and support of my parents, my sister, and my new brother-in-law. Their support and

caring words through the years made my degree possible.

Table of Contents

List of Figures viii
List of Tables ix
CHAPTER 1 ....................................................................................................................... l
1. 1 Introduction ......................................................................................................... 1
1.2 Signal Processing Algorithms ............................................................................. 4
1.3 Hardware Considerations .................................................................................... 6
1.4 Multichannel Processing ..................................................................................... 8
1.5 Implementation of System Components ............................................................. 8
CHAPTER 2 ..................................................................................................................... 10
2.1 Theory ............................................................................................................... 10
2.1.1 Lifting Factorization ................................................................................. 12
2.1.2 B-spline Factorization ............................................................................... 14
2.1.3 Integer Approximation .............................................................................. 15
2.1.4 Multi-level Processing .............................................................................. 16

2.2 Hardware Implementations ............................................................................... 17
2.2.1 Lifting Implementation ............................................................................. 18
2.2.2 B-spline Implementation ........................................................................... 21

2.3 Results ............................................................................................................... 22
2.3.1 Lifting and B-spline Comparison .............................................................. 23
2.3.2 Integer Approximation .............................................................................. 25

2.4 Conclusion ........................................................................................................ 26
CHAPTER 3 ..................................................................................................................... 28
3.1 Theory ............................................................................................................... 28
3.1.1 Low Rank Approximation ........................................................................ 30

3.2 Hardware Considerations .................................................................................. 31
3.2.1 Integer Approximation .............................................................................. 31
3.2.2 Truncation ................................................................................................. 32
3.2.3 Fixed Low-Rank Approximation .............................................................. 33

3.3 Results ............................................................................................................... 34
3.4 Conclusion ........................................................................................................ 37
CHAPTER 4 ..................................................................................................................... 39
4.1 Theory ............................................................................................................... 39
4.1.1 Modiﬁed RLE ........................................................................................... 40
4.1.2 Multichannel and Multilevel Processing .................................................. 41

4.2 Hardware Considerations .................................................................................. 42
4.2.1 Multichannel Processing ........................................................................... 43
4.2.2 Multi-Level Processing ............................................................................. 43

4.3 Results ............................................................................................................... 44
4.3.1 Modiﬁed RLE ........................................................................................... 44

vi

4.3.2 Vertical RLE ............................................................................................. 45
4.4 Conclusion ........................................................................................................ 48

CHAPTER 5 ..................................................................................................................... 49

vii

List of Figures

Figure 1.1 (a) close up of NeuroNexus probes (b) NeuroNexus implantable acute probes

[23] .............................................................................................................................. 3
Figure 1.2: Proposed neuroprosthetic system design .......................................................... 3
Figure 1.3: Four Spike Templates ....................................................................................... 4
Figure 2.1: Flow of the Discrete Wavelet Transform ....................................................... l 1
Figure 2.2: Filter steps of the lifting factorization ............................................................ 13
Figure 2.3: Execution pattern of a 13 sample interval ...................................................... 17
Figure 2.4: Hardware Block for performing a single Lifting Step ................................... 18
Figure 2.5 Hardware Intensive implementation of the lifting approach [13] ................... 20
Figure 2.6: Size Eﬂicient implementation of the lifting approach [13] ............................. 21
Figure 2.7: B-spline hardware, adapted from [9] .............................................................. 22
Figure 2.8: B-spline Hardware notation ............................................................................ 22
Figure 2.9: Distortion caused by integer approximation .................................................. 25
Figure 2.10: Distortion based on Wavelet Levels ............................................................. 26
Figure 3.1: Effects of Integer Approximation on Spatial Filtering ................................... 35

Figure 3.2 A comparison between eLRA , elA+LRA , em and 91,14, LRA+TR as a function
ofB. M=8, P=8, Q= 3. ............................................................................................ 36

Figure 3.3: A comparison of eLRA and 31/1,. LRA+TR as a function of Q/M for B=3,4 and

5. M=32, P=32. ........................................................................................................ 37
Figure 4.1 Block diagram of a typical RLE for single channel compression ................... 42
Figure 4.2 Modiﬁed RLE for Block diagram ................................................................... 43
Figure 4.3: Typical vs. Modiﬁed RLE .............................................................................. 44
Figure 4.4: Vertical vs. Horizontal RLE for 4 channel clusters ........................................ 46
Figure 4.5: Vertical vs. Horizontal RLE for 16 channels. ................................................ 47

viii

List of Tables

Table 2.1: Coefﬁcients of the symmlet4 family of wavelet basis [8] ................................ 11
Table 2.2: Coefﬁcients of the Lifting Wavelet Transform [12] ........................................ 14
Table 2.3: Coefﬁcients of the B-spline Factorization ....................................................... 15
Table 2.4: Results from Lifting and B-spline implementation ......................................... 23
Table 4.1: Examples of RLE Encoding ............................................................................ 40
Table 4.2: Typical vs. Modiﬁed RLE ............................................................................... 41

Images in this thesis are presented in color.

ix

CHAPTER 1

Hardware Considerations of Space-
Time Processing in Implantable
Neuroprosthetic Devices

1.1 Introduction

Handicapped patients, such as quadriplegics and paraplegics, have greatly beneﬁted
from advances in the ﬁeld of neuroprosthetic devices. A neuroprosthetic device takes
input by reading brain activity, and uses that information to control an external device;

ranging from mechanical hands to computer cursors [1][3]. This gives patients, who

previously had little to no control over their environment a new sense of freedom and
independence.

Research in the ﬁeld of neuroprosthetics has been signiﬁcantly assisted by the
introduction of high-density Microelectrode Recording Arrays (MEAs). An MBA is an
array of sensors that can be implanted into the cortex, recording the neural activity from
the surrounding cortical tissue. The number of sensors per devices has grown
exponentially, and now there are devices that have as many as 1000 closely spaced
sensors on a single array [1]. The increasing density allows for neuroscientists to better
analyze entire populations of neurons and understand the inner working of these
populations. However, the increase in data being recorded results in an increase in the
bandwidth required to transmit the data.

Currently, MEAs [23], shown in ﬁgure 1.1, are connected from beneath the skull to
computers outside the cortex, using large wires. Wired systems are a source of infection,
which can easily become fatal due to the location of the implant. Additionally, wired
systems cause patient discomfort due to the movements of the wires against the
surrounding skin tissue. Thus, an externally powered wireless system is the most
advantageous route for long term MEA implantation. However, as the density of sensors
on an MBA increases, the raw data can not possibly be transmitted with available

implantable wireless technology. Thus, the data must be pre-processed [1].

    

(d) (b)
Figure 1.1 (a) close up of NeuroNexus probes (b) NeuroNexus
mplantable acute probes [23]

Advanced signal processing algorithms have shown that neural data can be
compressed by as much as 90%, while still retaining the information required to
determine the workings of a neural population [4]. The design, shown by ﬁgure 1.1, has
been proposed by adapting these signal processing algorithms to process the data in vivo
[4]. The area external to the green box is existing hardware, currently used with MEAs.

The focus of this work is to present the considerations required when adapting the

advanced signal-processing algorithms to implantable neuroprosthetic devices.

 

 

 

M

HEP
i
E
m

P

Front-
end

Analog

 

 

Encoder

 

 

 

 

 

 

| ‘ Spatial Filter

 

Figure 1.2: Proposed neuroprosthetic system design

1.2 Signal Processing Algorithms

To understand the signal processing algorithms used to compress neural signals, the
mechanics of neural recordings must be understood. MEAs record the electrical activity
of neurons, in the form of action potentials. Action potentials, or neural spikes, are pulses
of voltage that travel from the cell body of a neuron down the neuron’s axon, where it is
picked up by the dendrites of a different neuron. Spikes are the means by which the brain
receives and processes data, and creates commands for the body [2]. Figure 1.2 shows an

example of four spikes.

 

uVolts

 

 

 

 

 

 

Time

Figure 1.3: Four Spike Templates

Controlling an implantable neuroprosthetic device by the information recorded from an
MBA requires the spikes being sent to and from the neurons surrounding the implant site.
Neurons which produce discemable spikes, i.e. spikes that can be reliably removed from
the recorded data, from an MBA recording are called units.

The observation model for the MEA data assumes that P units impinge on an array of

M recording sites within the discrete time recording interval T = [r] ..... 1N] . The recording

matrix Y is expressed as
Y = A S + Z (1_1)

mMXP

where A 6 denotes the mixing matrix that expresses the array response to P

recorded units, S E SRPXN denotes the recorded signal from each unit, and

2 6 92M”, denotes the noise from neurons in the surrounding tissue. The noise is a zero-
mean additive component with an arbitrary spatial and temporal correlation.

Isolating the spike activity in Y is required for neuroscientist to understand and use a
population of neurons. First, the spikes must be detected, which is called spike detection.
Then, the spike activity is associated with the unit that produced it, which is deﬁned as
spike sorting. The process of spike sorting is a computationally complex algorithm, and is
not well suited for implementation in an in vivo signal processor [4]. Additionally,
traditional spike detection based on simple thresholding is often inaccurate [22]. A
proposed solution to these problems is the use of the Discrete Wavelet Transform.

The Discrete Wavelet Transform (DWT) is a cascade of high-pass and low-pass ﬁlters
which localize signals in both time and frequency [8]. This removes temporal redundancy
by localizing the spikes into few large coefﬁcients, while presumably leaving the noise as
many small coefﬁcients. Previous work has shown that the DWT is well suited for
compression of neural signals [4]. Additionally, the DWT has shown many other useful
properties that aid in processing neural signals. Algorithms for spike detection and spike
sorting can be improved through use of DWT coefﬁcients.

While the DWT removes the temporal redundancy ﬁ'om the neural recorded signals,
there is still redundancy in the spatial mixing. Spatial Filtering can reduce the spatial
redundancy, thus reducing the number of channels of data being sent [4]. The physical
channels, or electrode sites, can be reduced to a few principal channels. Principal
channels contain a compacted version of all of the raw data recorded by the MEA.

Additionally, the recorded data can be completely reconstructed by sending the

parameters used to perform the spatial ﬁltering. By reducing the number of transmitted
channels, the effective bandwidth is reduced, while maintaining full signal ﬁdelity.

The previous signal processing algorithms reduce spatial and temporal redundancy in
the recorded neural signals; however they do not result in actual data compression [22].
Instead, the majority of the information about the neural signal is compacted into a few
large coefﬁcients, clustered among many small coefﬁcients containing very little
information or noise. To have actual data compression, the small coefﬁcients must be
discarded, and the large coefﬁcients as well as their relationship to the discarded small
coefﬁcients must be maintained.

A method known as Run-length Encoding (RLE) [21] is used to compress the
transformed signal. While RLE is not an optimal compression algorithm, it has the
beneﬁts of low computational complexity, and can be performed in real-time on
streaming data. The goal of RLE is to remove consecutive repetitive coefﬁcients. RLE is
well suited to be employed in the implantable environment due to its computational
simplicity and streamline processing.

To use these algorithms effectively, signal processing hardware must be designed with

the goal of maximum efﬁciency, while maintaining the constraints of an implantable

system.

1.3 Hardware Considerations

Most modern applications utilizing signal processing algorithms can rely on
advanced, multipurpose DSP chips to perform the required operations. This approach is

inexpensive and time effective to implement. Neuroprosthetic devices, however, do not

allow for a simple solution. The implantable environment constraint requires the design
of an Application Speciﬁc Integrated Circuit (ASIC).

To design an implantable signal processor, the algorithms must be adapted to be
implemented in a hardware friendly manner. Hardware considerations in the context of
implantable neuroprosthetic device must be made when to simplify the process.

Design of an ASIC for neuroprosthetic device requires special considerations. First, the
device must be low powered for two reasons; the power is transmitted externally through
use of RF, which can only transmit minimal amounts of energy, and the cortical tissue
surrounding the implant can not be subjected to temperatures that would raise the tissue
by more than l~2°C, which prevents damage to the surrounding brain tissue. This
severely limits the power dissipation and clock speed of the neuroprosthetic device.
Additionally, the implant size is constrained because the implant cannot rise more than
1mm above the cortical surface. A larger implant would compromise the stability.

However, certain factors alleviate the constraints in designing the neural signal
processor. First, all channels are typically sampled at the biologically relevant sampling
rate of 25KI-Iz [3]. This provides a 40 psec sampling interval to perform any calculations
on the incoming sample set. Ideally, all signal processing required on a given sample
would be preformed in that interval, therefore precluding the need for buffering, reducing
the memory requirements for the design. Additionally, a 14mm burr hole used by
neurosurgeons is sufﬁcient for a package of several hundred thousand transistors [1].

This provides enough transistors to perform low complexity algorithms, and to maintain

some internal memory.

1.4 Multichannel Processing

MEA data is sampled across all neural recording channels simultaneously. While
ideally, each channel would have dedicated hardware processing of its own, the hardware
cost for this implementation is prohibitive. Thus, when considering system designs for
implementing the aforementioned signal processing algorithms, the idea of using shared
hardware must be considered. Thus, at each sampling interval, the hardware for executing
each process must be used for each of the channels. This minimizes the computation
hardware, maximizing the available room for memory, thus allowing for more channels
to be processed.

Additionally, all data should be handled in a streaming fashion. Buffering of data
causes delays in the system, and also requires additional memory that could be used for
increasing the number of channels that the processor can handle.

These factors must be considered when designing the system for a multichannel
neuroprosthetic device. Thus, when discussing the system, it will be assumed that data is
available across all channels, and that all data being recorded must leave the system at the

same rate at which it is recorded, with only minimal delay.

1.5 Implementation of System Components

ASIC design consists of taking the signal processing algorithms, and ﬁnding the most
computationally simple method to execute them. Additionally, trade-offs between
precision and memory must be accounted for. The goal of this work is to present these
trade-offs in a quantitative matter, so that the decisions can be made before the time-
intensive process implementing the algorithms in VLSI. The focus of this work is to

make considerations for implementing the aforementioned signal processing algorithms

for a minimized implantable application, suited for hardware design. Chapter 2 deals with
the design of the Discrete Wavelet Transform (DWT) block. Chapter 3 deals with the
design of the Spatial Filter block. Chapter 4 discusses the hardware implementation of the

compression algorithm, Run-length Encoding.

CHAPTER 2

Discrete Wavelet Transform
System Design

2.1 Theory

The Discrete Wavelet Transform (DWT) is performed on a data set by convolving a
high and low pass ﬁlter, H (z) and L(z) respectively, with the data set being compressed.
This results in data that is localized in both time and frequency [8]. The wavelet
transform is a recursive ﬁlter. The result of the high pass ﬁlter is stored, and deﬁned as

the detail coefﬁcients. The results of the low pass ﬁlter, deﬁned as the approximation

10

coefﬁcients, are fed back into the Wavelet Transform [8]. This is called higher levels of

decomposition, illustrated by ﬁgure 2.1.

 

 

LEQEED

0 Raw Data

. Approximati on:

0 Detail:

 

 

 

 

 

 

 

Figure 2.1: Flow of the Discrete Wavelet Transform
It has been shown that the symmlet4 family of wavelet basis results in the best
compression of neural signals [4]. Therefore, we will focus on this basis. The ﬁlter

coefﬁcients of the discrete symmlet4 basis, composed of the high pass ﬁlter [1(2) and a

low pass ﬁlter L(z), are listed in Table 2.1.

 

 

 

 

 

 

 

 

 

 

 

 

Lag z0 Z'1 Z‘2 2'3 2“ Z'5 2" 2'7
11(2) -.O76 -032 .049 .080 .029 -099 -.013 .033
L(z) -032 -.013 .099 .029 -.080 .049 .029 -.O76

 

 

Table 2.1: Coefficients of the symmlet4 family of wavelet basis
[8]

Certain factorizations provide a more computationally simple approach then performing
two discrete convolutions required by the DWT. Two such approaches have previously
shown to reduce the computational complexity and memory requirements; B-spline and
Lifting factorization [l6][7]. These factorizations must be analyzed for their cost in
hardware, their computational complexity, and their ability to be implemented in an

implantable neuroprosthetic device.

11

Additionally, it has been shown that the Lifting symmlet4 wavelet decomposition
can be performed with using coefﬁcients approximated to integers [12]. Integer-based

math requires less complex hardware than ﬁxed-point or ﬂoating—point math.

2.1.1 Lifting Factorization

The lifting approach [7] takes two wavelet ﬁlters, H(z) and L(z), and factorizes them

into n lifting steps to obtain Sn (2) and T,,(z) as

 

 

 

 

ENE] +2Ll— 5] HI‘EI +2HI‘ ‘Ef
Pk): Lila-2114?] Hlﬁl-ZHI'JEI <2“

 

=I’3 Kill?) Silas ilmlé S'Izllrizi ll

Each of the ﬁlter steps takes the form of

S .. = Cihn + thnz'l

”’1’!
— +1 2.2
Tn,i,j—Cifn+ijnZ ()

where n is the number of the ﬁlter step, and C, , C j are the resulting constants from

factorizing the DWT. This creates a ﬁlter bank, shown in Figure 2.2.

12

 

 

Figure 2.2: Filter steps of the lifting factorization

It should be noted that the lifting approach splits each sampling window into odd and
even sample pairs. For a factorization of the symmlet4 wavelet, the ﬁltering steps

corresponding to Figure 2.2 can be written as

stepl = M!) = new. xfoa)
step2:f,(t)=ﬁ,(t)+C2xhl(t)+C3xh,(t+l)
step3:h2(t‘)=hl(t)+C4 ><fl(t)+C5 xfl(t—l)
step4:a,(t)=fl(t)+C6 xh,(t)+C7 xh,(t—l)
stepS : d, (t) = 112(t)+C8 xal (t +1)

(2.3)

where fo and ho are the even and odd data samples; f , hi , and hz are intermediate

values which are discarded after being used, due to the in-place nature of the lifting

algorithm; a] is the resulting approximation coefﬁcient; d] is the resulting detail

coefﬁcients; Ci through C3 are the coefﬁcients of the ﬁlter steps given in Table 2.2

along with their scaled to 5 bit integer values. The scaling is discussed in section 2.1.3.

13

 

 

 

 

 

Coefﬁcient Value Scaled Coefﬁcient Value Scaled
C. 0.39114 6 C5 0.16203 3
C2 -0.12439 -2 C6 0.43128 5
C3 -0.33924 -5 C7 0.14598 2
C. -1.4l951 -23 Ca -1.0492 -17

 

 

 

 

 

 

 

 

Table 2.2: Coefficients of the Lifting Wavelet Transform [12]

2.1.2 B-spline Factorization

It has been shown [1 1] that the DWT can be decomposed as

H(z) = (1+ 2)” Q(z) - h.
L(z) = (1 — z)7hR(z) '10 (2'4)
where Q(z) and R(z) are the resulting ﬂoating point operations, yh is the number of

times that H(z) and L(z)can be factorized by (liz), and ho, 10 are the resulting

constants of factorization. Applying this decomposition to the symmlet4 ﬁlters, noted in

Table 2.1, H(z) and L(z), results in
H(z) = (1+z")4(1+CB1 ~z'l +C,,2 ~z'2 +CB3 -z'3) (2.5)
and
L(z) = (1—2-‘)4(1+CB, -z-' +C,,, .z'2 +C,6 -z‘3) (2.6)

The coefﬁcients resulting from the B-spline factorization are listed in Table 2. Expanding

the polynomials results in

(1+ 2" )4 = 1+ 4Z_l + 62‘2 + 42-3 +2"4 (2.7)

and

(1 —z’1 4 =1—4z’l + 6Z_2 —4z“3 +2"4 (2.8)

14

The calculations for performing equation 2.7 and 2.8 can be combined to create a simpler

equation, which results in
(liz"‘)4 =(1+6z-2 +z'4)i(4z'l +424) (29)

further reducing the number of calculations. The result is a reduction in the number of
ﬂoating point multiplications replaced with integer multiplications. The advantage is that
a ﬁnite even integer multiplication can be performed by shifting and adding. Further,

polyphase decomposition [15] is performed on the distributed part, Q(z) and R(z) to

increase the hardware utilization. Unlike the lifting factorization, the B-spline
factorization does not split the samples into pairs of odd and even coefﬁcients. Table 2.3
gives the coefﬁcients of the B-spline transform, as well as their integer approximated

values with 5 bit resolution.

 

 

 

Coefﬁcient Q(z) Scaled Coefﬁcient R(z) Scaled
CB, -3.6069 -13 C34 4.3913 16
C32 1.8668 7 C35 8.4845 31
C33 -0.4248 -2 CB6 2.3385 9

 

 

 

 

 

 

 

Table 2.3: Coefficients of the B-spline Factorization

2.1.3 Integer Approximation

Floating point operations are too demanding given the constraint of implantability (See
table 2.4 for a comparison of gate counts for ﬂoating point operations). Approximating
the coefﬁcients of the wavelet transform to integers reduces computational complexity
and the size of the memory [19]. Additionally, a smaller number of bits reduces the
complexity and power consumption of the computations, as well as the storage cost for

the results. It should be noted that the tables give the scaled coefﬁcients for both Lifting

15

and B-spline. The affect of using integer approximation is quantiﬁed in the results

section.

2.1.4 Multi-level Processing

As mentioned before, the DWT is a recursive algorithm. The resulting approximation
coefﬁcients are the input to higher levels of decomposition. Therefore, the DWT is
executed multiple iterations for a given sampling window.

As noted before, the lifting approach splits the sampling window into odd and even
pairs. This means the lifting algorithm requires that two samples be input simultaneously,
to execute a single iteration [13]. Thus, this effectively increase the amount of time
available to execute multiple levels of decomposition by twofold. Additionally, since
each execution requires two samples and only produces a single approximation
coefﬁcient, each higher level of decomposition is executed only half the iterations of the
previous level. Thus, since the lowest level of decomposition is only executed half of the
time of the sampling window, the number of level executions will never exceed the total
number of samples, illustrated by Eq. 2.10.

L
N
Z— < NVL (2.10)

K =1 2K
If the lifting hardware is able to be executed for every channel once per sampling
interval, the inequality in Eq. 2.10 shows that the processor will never need to buffer
data. Figure 2.3 shows the steps of execution for a short window of samples. E denotes an

execution of the lifting hardware, and I denotes a sampling interval in which the

processor will be idle.

l6

D3 17/03)
Dz ./

 

01 co 0"" o
Idle 6) CD CD

1 234567 891011 12 13
Sampling Interval

Figure 2.3: Execution pattern of a 13 sample interval

 

 

Unlike the Lifting approach, the B-spline approach does not require that the input be in
sample pairs, and thus the algorithm is readily available to be executed every sampling
interval. Thus, the processor must be able to execute twice in the same amount of time as
the lifting hardware executes once. This means B-spline requires a faster clock speed or

smaller critical path than that of the lifting.

2.2 Hardware Implementations

To select the optimal factorization, the system level architecture for performing the
algorithms must be analyzed. This analysis must consider the balance of processor
efﬁciency, memory per channel, and size of the processor block. However, there are
multiple possibilities to implement each algorithm. Therefore, we will consider the two
extremes, maximizing the hardware to minimize the critical path, and using the minimal
hardware, to minimize the space consumed. These will be referred to as the Hardware
Intensive and Size Efﬁcient implementations, respectively.

Designing wavelet decomposition hardware for multichannel neural signal processing
must take advantage of the slow sampling rate to compute multiple channels in a given

sampling window. A biologically relevant sampling rate yields a sampling interval of 40

,usec. Thus, all processing steps for a given wavelet decomposition level for all channels
must occur during this interval. Therefore, when considering hardware design, a balance
must be struck between sacriﬁcing a larger critical path and the memory required per

channel

2.2.1 Lifting Implementation

The arithmetic operations in Equation 2.3 have a noticeable regularity that permits any

arbitrary step to be deﬁned as
Ft]. =A+CUB+CUD (2,11)

where the variables F ,A,Band D take the values of f” and h". C L .- and C L j are the
constant coefﬁcients of the lifting decomposition. We will refer to the hardware block that
computes Equation 2.10 as a computational node (CN). Cu through C L8 are the

coefﬁcients needed to compute the ﬁve steps of symmlet4 decomposition, listen in Table

2.2. Fig 2.4 will be used to denote the hardware block performing Equation 2.11.

A

B =A+QB+ch

[7
Figure 2.4: Hardware Block for performing a single Lifting Step
In creating a hardware intensive pipeline to minimize the critical path, each CN is
implemented with a separate hardware block. Additionally, each CN must be computing
a different sample pair to deal with data interdependencies between lifting steps.

Additionally, since certain stages require data from future calculations, additional delays

l8

must be added. For the Symmlet4, these delays must occur between steps 1 and 2, and
also steps 4 and 5. We’ll denote by Bo and 00 the values held by the input registers for the
even and odd incoming data sample pair, respectively. Also, we’ll denote by {P0, P1, P2},
{Q0, Q1, Q2}, {R0, R1, R2}, {A0, A1}, and {01, 02} the values at registers for internal use
by the DWT computation block. Thus, A1 and D0 are output registers holding the
approximation and detail values at any given level. The equations for implementing the

hardware

ClePo =E0+C,x00

Delay

CN2on=02+C2><P2+C3xPI
CN3:R0=P2+C4><QI+C5><Q2 (2.12)
CN4:AO=Q2+C6><RI+C7XR2

Delay

CN5:D0=R2+C8><Al

To enable simultaneous, or parallel, execution of all steps, the zero indexed registers
where the current data is being written are not read during the execution. In order to ﬁll
the higher indexed registers, a two-phase update step is necessary. The update step
operates after a single execution of all steps as illustrated in ﬁg. 2. From Eq. 2.12, the
pipeline length is seven, and it operates on a pair of data samples, so the resulting detail
and approximation temporal latency is 14 samples. This increases the memory of a given
state in the pipeline. A “state” is deﬁned as the required memory blocks which would
restore computation to its current processing thread. Thus, in multichannel processing,
each channel needs to have enough memory to save its current state. The hardware
intensive version of the lifting architecture is shown in ﬁgure 2.5. Note that the

computation nodes are denoted by ﬁgure 2.4.

19

 

 

 

 

 

 

 

 

 

 

 

 

 

 

   

.‘L

 

 

 

 

Figure 2.5 Hardware Intensive implementation of the lifting
approach [13]

Since the CN denoted by ﬁgure 2.4 repeatedly in ﬁgure 2.5 is redundant hardware, a
Size Eﬂicient implementation of the wavelet hardware can be designed by reusing the
same hardware. This turns each calculation into a computation step (CS), which replaces
the need for multiple CNs. It should be noted that the new ﬂow has a much smaller
memory requirement, since the data dependencies caused by parallel execution is
removed. Additionally; less memory is required, since the steps are executed in a
sequential fashion. The new ﬂow is illustrated by Equation 2.12 and ﬁgure 2.6. Note that
each CN shown in ﬁgure 2.6 is a use of the computation node, as oppose to a physical

computation node in hardware.

CSlle0 =E0+Clx00

Delay

CS2:Q0 =01+C2xPl+C3xPO
CS3:R0=l"1+C4xQ0+C5xQl

CS4on =Q0+C6xRO+C7><Rl (2.12)
Delay

CSSID0=R1+C8XAO

20

 

 

 

 

 

~12 Z"
._ 1
tr cs 2 -
L, .12 CS cs CS 2" H
24

 

 

 

 

 

r L

24
Figure 2.6: Size Efficient implementation of the lifting approach
I l 3 1
Due to the presence of two delays, the difference between the sample pair entering and
the approximation and detail coefﬁcient leaving results in a temporal latency of six data
samples, as opposed to fourteen in the Hardware Intensive implementation. We note that
the indexing on each of the intermediate registers has decreased, which implies that fewer
values need to be stored external to the DWT module, as well as decreasing the number
of registers internal to the DWT module. However, the use of only a single CN increases

the critical path required during a single sampling interval.

2.2.2 B-spline Implementation

The B-spline architecture proposed in [9] splits equations (2.5) and (2.6) into the B-

spline part (1 1r z'l )‘ , and the distributed part, Q(z) and R(z). It should be noted that the

multiplications in the B-spline part can be performed by simple hardwired shifting
operations. Additionally, polyphase decomposition is used to split the distributed part
into odd and even parts, further reducing the computational complexity. The resulting
hardware is shown in ﬁgure 2.7. This will be further referred to as the hardware intensive

architecture.

21

00(2)

+0: 16|1 :: 1c,
~ +

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

7-" + Q(z) '-
12 ~ 4 4 | 0 _ :: c2 0,
R,(z)

 

 

1|6]1+ “C4
R,(z) H

—>[0 4|4 _.+ 0,0,

Figure 2.7: B-spline hardware, adapted from [9]

 

 

 

 

 

 

 

l 0,] Czl 03' g ICI+CZZ-'+C3Z—2|

 

 

Figure 2.8: B-spline Hardware notation

It can be noted that the amount of hardware required for performing the B-spline is
repetitive for the both the B-spline, Q(z) and R(z) parts. Thus, to make a size efﬁcient
version, the top half could be reused, simply by alternating between the Q(z) and R(z)

coefﬁcients on each sampling interval.

2.3 Results

Analysis of the proposed implementations relies on determining the best balance
between the approximated size and computation speed. Size is approximated by transistor

counts. Computation speed is approximated by critical path. Speciﬁc details on power

22

consumption, actual size, and processing latency requires the hardware to be fully created

as a VLSI design.

2.3.1 Lifting and B-spline Comparison

Table 2.4 illustrates the comparison between the two factorizations implemented in
hardware. Floating point multipliers, adders, and memory blocks were 24 bit ﬂoating
point numbers designed in Verilog and synthesized to a custom library Error! Reference
source not found.. Integer multipliers, adders, and memory blocks are fully customized

designs tailored to 10 bit data with 5 bit coefﬁcients [l4].

 

 

 

 

 

 

 

Implementation
B-spline Lifting
Hardware Intensive
Multipliers 6 8
Adders 18 8
Memory 8 1 1
Critical Path Tm+5Ta Tm+2Ta
Floating Point Gates 45504 54944
Integer Gates 9426 10008
Size Efﬁcient
Multipliers 3 2
Adders 9 2
Memory 8 4
Critical Path 2Tm+ 1 OTa 4Tm+8Ta
Floating Point Gates 22752 13736
Integer Gates 4713 2502

 

 

Table 2.4: Results from Lifting and B-spline implementation

23

Previous work has shown [14] that as many as 512 channels of data can be processed
by the size efﬁcient lifting hardware in the available sampling interval. This implies that
the critical path does not constrain the design.

Figure 2.8 shows the approximated total number of gates for a given amount of
channels. There is a large divergence between B-spline and Lifting as the number of
channels being processed increases. This is due to the difference in memory requirement
for each architecture. This also shows that the dominating factor in assessing size of a
given implementation is dependent on the memory required for each channel and the

number of levels.

 

 

 

 

 

 

x104 Total Number of Gates for Channels
1R
IV I l l I —
Lrltlng 2Le~els
-~"-_-meg3Lwds
14 -
----- Lifting: 4 Levels
12 --0-- Lifting: SLevels -. i I.)
--+-— B-spine‘ 2Levels ' ‘z'
3 10 - -°--- B-spline3Levels ' I 1’"
8 --0-- B-spline’liLevels ' .I f,"
r .2
'8 —a— B-splineSLevels ’1’ {a ’
8 1" 1”,- ‘
0 .J'
3 .X ”If ’ a
z 5' (of ,I . ‘
x ,r" » ‘
I I) a’ ’ .a ' v '
4 _ . . a .4 r J. a ’ .— "ﬂl’d’ 4
. bI'IIII ’ ’ ' .g c.--""'".
' I ',J' T I’.-.‘ ‘
2- .' ‘Cz’l’ I’II:’_.._.—'" q
up", a» "’
’1 é‘"""'
o ' l I l I l I
t 5 10 15 20 25 30 35
Channels

Figure 2.8: Size comparison between size efficient approaches

24

2.3.2 Integer Approximation

Previous work [12] has shown that for single level decomposition, as few as 4 bits
can be used to approximate the wavelet coefﬁcients for the lifting approach. Further
analyze must consider the use of 4 and 5 bit integer approximation for multi-level
wavelet decomposition. Figure 2.9 shows the distortion caused by the approximated
coefﬁcients for 4 and 5 bits. It can be seen that the losses incurred by approximating the

wavelet coefﬁcients to integers level are larger for 4 bits.

7 Level Distortion with 4 Bit Approximation

— Original
Post~Wavelet

 

    

 

 

 

 

_200 1 1 I l 1

 

 

 

— Original
Post-Wavelet

 

 

 

 

 

 

 

 

Figure 2.9: Distortion caused by integer approximation
Figure 2.10 shows the MSE distortion of spike waveforms. One hundred random

noiseless spike trains of 256 samples were generated, and then decomposed and

25

reconstructed with l to 7 levels of wavelet transform. The quality measure shown is the

ratio of the energy of the error versus the total signal energy.

Distortion based on Wavelet Levels
0.1 . .

 

I I I

 

 

——4 Bits Quantization

0.08 - " ' - 5 Bit Quantization

 

0.07 -

0.05 -

0.05 -

0.03 -

 

0'02 i“.al..ﬂI-u_--II—-—l.II—n—I-

.
0"-

Error Energy vs. Total Energy Ratio

 

 

I l l l l

1 2 3 4 5 5 7
Level

 

Figure 2.10: Distortion based on Wavelet Levels

2.4 Conclusion

Two architectural implementations of the Discrete Wavelet Transform have been
considered. Additionally, the implementations have been considered in two forms,
minimizing hardware, and minimizing delay. The minimized hardware approach to
implementing the lifting DWT has shown many beneﬁts for an implantable
neuroprosthetic device. The minimal memory requirement for the size-efﬁcient lifting

architecture maximizes the ability to process the largest number of channels.

26

Additionally, the increased critical path does not constrain the number of channels that
can be processed on the device.

Integer approximation of wavelet coefﬁcients was also considered for multilevel
decomposition. It has been shown that multilevel integer approximated DWT can be

performed with minimal losses to signal energy.

27

CHAPTER 3

Spatial Filter System Design

3.1 Theory

The redundancy caused due to observing the signals with multiple closely spaced
electrodes can be reduced using advanced signal processing techniques. A spatial ﬁlter
decorrelates the data by aggregating the signal energy spread across many physical
channels to a few principal channels that can be further ﬁltered and compressed [4]. This
process is known as whitening the data, while the inverse process will be referred to as

coloring the data.

28

The actual process involved in performing Spatial Filtering is computationally
complex. However, like the DWT, certain concessions can be made to make the Spatial
Filter feasible to implement in hardware; causing minimal and acceptable losses to signal
ﬁdelity. These concessions must be analyzed for their validity.

Decorrelation in space can be optimally achieved by ﬁrst estimating the sample spatial

covariance matrix [17] of the data as

N

.. l

RY = __1 Zy[n]yT[n] (3.2)
n=l

where y[n] 6 92““ is the array snapshot at time sample n, and N denotes the length of time

interval from which the ﬁlter is estimated. This covariance can be further decomposed

using singular value decomposition (SVD) [17] to yield

M
RY = UODOUC = Zlmumu;
m=1

(3.3)

where 4", denotes the mth eigenvalue corresponding to the mth diagonal entry in and
M XM . . .

U0 : [U1 9 112 a - ~ - 9 u M] E m comprises the eigenvectors spannmg the

column space of RY . The whitening process is performed by the following matrix

multiplication
~ — T
Y _ UOY (3,4)

The matrix U 0 is the optimal whitening matrix and its entries are generally represented

by ﬂoating point numbers.

29

3.1.1 Low Rank Approximation

One way to reduce the number of transmitted channels in T is to use low rank

approximation (LRA) [17] to shrink the sum in (3.3) to only the principal channels
comprising most of the signal energy. This is guaranteed if PsM and the P signal
sources have orthogonal waveforms. However, in a typical neural recording experiment,
both conditions are often violated. Spike waveforms from distinct neurons are generally
not orthogonal amongst each other, and typically the number of single units recorded
exceeds the number of electrode channels. Nevertheless, the principal role of the spatial
processing stage is not to sort out the distinct signal sources, but rather reduce the overall

signal energy impinging on the M channel array to a much smaller number of principal
channels, say Q << M , that can be encoded and transmitted, thus reducing the required

bandwidth.

Performing a LRA [17] on (3.4) is equivalent to zeroing out the M-Q rightmost

columns of U0. This is denoted as U0 =[u,,u2,...,uQ,0,...,OM]eﬂiM"M. If the

data is whitened using U0 , the mean square error between the original data and the

reconstructed data is deﬁned as

em (Q)=(Y—U§YU0)2 (3.5)

This error decays to zero when Q 9 M.

30

3.2 Hardware Considerations

The process described by equation (3.4) requires MxM ﬂoating-point multiplications.
This is not feasible to perform on an in-vivo signal processor. To deal with this, the
coefﬁcients of the spatial ﬁlter must be approximated to integers. The number of bits
used when approximating the coefﬁcients of the spatial ﬁlter varies the computational
complexity, the memory requirement, the achievable compression, and the losses to

signal ﬁdelity.

3.2.1 Integer Approximation

The hardware constraints imposed by implantability requirements relies on the
minimization of chip size and power consumption. Typically, the computations being
undertaken on chip require the elements be processed and stored in quantized integer
format. Since the whitening process deﬁned in Eq. 4 is a ﬂoating point process, the

coefﬁcients of U 0 must be approximated to integers to reduce the hardware complexity

of the whitening process. This creates a hardware-ﬁiendly whitening ﬁlter, which will be

referred to as UHF .

Quantizing the ﬁlter amounts to expressing the real numbers in U 0 by their quantized

values as

. . ZBU .’ .
UHFG» )=ﬂ00r( 23 C(11)) (3.6)

where B is the number of bits for quantization. It should be noted that the smaller B is, the

more size and power efﬁcient the hardware implementation is. Thus, a trade off arises

between the resulting error from selecting a B which creates a suboptimal U HF , and the

31

system performance. To simplify the hardware to perform the spatial ﬁltering, the
denominator of U HF is moved outside of the matrix, leaving only integer

multiplications. The division will be discussed in section 3.3.2. The following example

illustrates the process of quantizing the Spatial Filter. A sample U 0 is approximated to

U HF with B=3 and B=5 bits.

B=3 B=5
.9501 .4460 N1 8 4 ~ 1 30 14
.2311 .6068 ‘8 2 5 ‘32 7 19

It is obvious from the ﬁrst row of the example that the error between quantizing to 3
bits versus 5 bits can be substantial. The error from approximating the coefﬁcients of

U 0 to the hardware-friendly U HF will be deﬁned as

661(3): (U67 —UI{IFY)Z (3.7)

Note that if the matrix U ,CFY is transformed back into Y using the same U HF , no error is

introduced into Y. The approximation to U HF only creates a suboptimal spatial ﬁlter,

which affects the performance of LRA.

3.2.2 Truncation

After the multiplication U ,TWY , the resulting matrix still requires that the division by the
integer scaling factor be preformed. To perform real arithmetic division would require
large amounts of hardware. Instead, the binary nature of the integer scaling can be

exploited. To perform the division, the B least signiﬁcant bits can be dropped, resulting in

equation (3.9).

32

U1”. Y

Y ' = ﬂoor —23 (3.9)

However, performing the ﬂoor operation means that precision is lost. The error due to

the loss of the decimal places will be deﬁned as

~' ~
eIA.rR(B)= (Y -Y)2 (3.10)
Note that more precision is lost as B shrinks. This error occurs due to the loss of

orthonormality of the columns of U 0 when approximating U 0 to integers.

3.2.3 Fixed Low-Rank Approximation

The ideal signal processing situation when performing LRA is to perform the entire

whitening process, and then determine the principal channels Q via thresholding. This

approach is hardware intensive, as each sampling interval requires M 2multiplications.

Since it is expected that Q<<M , if Q is determined in advance, only M xQ

multiplications need to be performed. This process will be referred to as Fixed LRA.

Reducing Q also reduces the amount of bandwidth required to transmit the neural data.
Additionally, only M x Q coefﬁcients of the spatial ﬁlter need to be stored in memory,

further reducing the amount of hardware required for Spatial Filtering. However, as Q is
lowered, eW increases. Additionally, if U 0 is replaced with U HF, a low rank

approximation of U HF, then em increases as a function of B. This is due to the

suboptimal ﬁlter that is approximated. The error due to IA and LRA is deﬁned as
6 (Q B) — (UTY-UT Y)2
IA + LRA 9 — 0 HF (3. l l)

33

The total error is equal to the superposition of all of the individual errors incurred in the

implementation of the spatial ﬁlter, i.e.

('1’ Y .
e11.......(Q,B)= Y—ﬂoor ﬂoor 5:; U... 2’ (3.12)

 

This error is indicative of the hardware ﬁiendly spatial ﬁlter performance. In the actual

hardware implementation, the three types of errors are interdependent. This error decays

to 81,4 when Q -) M.

3.3 Results

Figure 3.1 visually illustrates the effect of integer approximation on Spatial Filtering.
The top left indicates 100 ms of experimental data recorded from the Dorsal Cochlear
Nucleus of an adult guinea pig across an 8-channel electrode array. The top right image
shows the results of ideal spatial ﬁltering, with no integer approximation [4]. Note that
the ﬁlter was approximated based on event A, and used for both events, A & B. This
results in noticeable artifacts for event B on 3 principle channels. The bottom left image
shows the results for integer approximated spatial ﬁltering with B=3. Note that artifacts
for both events A & B appear on almost all channels. Whether or not these artifacts are
necessary for signal reconstruction will be discussed later. The bottom right shows the
result with B=5. It is difﬁcult to tell by visual inspection the difference between ideal and

approximated.

34

 

Figure 3.1: Effects of Integer Approximation on Spatial Filtering

Figure 3.2 compares four of the errors listed in this work as a function of the number of
bits B used for quantizing UHF . Signals were generated using a matrix of 8 units
spread across 8 recording channels with a uniformly distributed mixing matrix. Note that

9L,“ is not a function of B, and is included to show that the hardware performance

approaches that of the ﬂoating point optimal whitening ﬁlter. Additionally, for lower bit

values, the effect of truncation, illustrated by em , is larger then when it is combined as

with low rank approximation, as e,A+LRA+TR. This interesting fact arises because the

35

truncation of the whitening ﬁlter has a larger effect on smaller coefﬁcients, and thus LRA

minimizes the effect of eTR for smaller levels of quantization.

:L5x153 MSE Approximation Due to Quantization

 

 

 

 

 

 

 

 

 

-‘F'eLRA

3 eIA+LRA ]
-u- eTR

O eIA+LRA+TR
z
1» 4
1 ~r-~- .
”3 4 5 5 7 a 9 10

Number of Bits

Figure 3.2 A comparison between eLRA , 31,444,)“ , eTR and eIA+LRA+TR
as a function of B. M=8, P=8, Q: 3.

Fig. 3.3 compares the use of LRA with U 0 to U HF quantized to 3, 4 and 5 bits. Signals

were generated using a matrix of 32 units spread across 32 recording channels with a
uniformly distributed mixing matrix. We notice that for the case B=3, the error decreases
for Q/M < 0.3 for which the LRA error component dominates. As more channels are
added, artifacts created by approximating to 3 bits, illustrated by Figure 3.1, contribute
more noise, and become the dominating factor in noise increase. Increasing the

quantization resolution to 4 bits helps reduce the introduction of artifact errors

36

quantization resolution to 4 bits helps reduce the introduction of artifact errors
signiﬁcantly up to Q/M < 0.7. Increasing the quantization up to 5 bits removes the effect

of noise inducing artifacts, and results in only integer approximation errors.

 

 

 

 

 

 

      

 

 

   

 

X10“ MSE as a Function of GM
8 . , . .
I
7 _ eLRA ,1
--- elA+LRA+TR B=3 ”J l
6 _ —*— eIA+LRA+TR B=4 -
“‘ eIA+LRA+TR B=5
‘
5 - Ix _
\
Lu xx ”(”41”
g 4 - ‘***'*~.rf ‘
3 _
2 -
1 L
0 l 1 1 .
0 0.2 0.4 0.6 0.8 1

Principal Channels QIM

Figure 3.3: A comparison of eLRA and elA+LRA+TR as a function of
Q/M for B=3,4 and 5. M=32, P=32.

3.4 Conclusion

Spatial Filtering reduces the spatial redundancy of neural recorded data. The spatial
ﬁlter has been analyzed for implementation in an implantable neuroprosthetic signal

processor. Considerations for implementing the spatial ﬁlter for in hardware have been

37

discussed. It has been shown that the computational complexity of the spatial ﬁlter can be
greatly reduced by integer approximation. Additionally, the number of computations that
must be performed in a single sampling interval can be reduced by using ﬁxed low rank

approximation.

38

CHAPTER 4

Run-Length Encoding

4.1 Theory

Typical RLE compresses a stream of inputs by removing consecutive redundant
information from a streaming signal [20]. Coefﬁcients of the signal are compared;
consecutive identical coefﬁcients are removed from the signal, and replaced with a count
of the number of consecutive values in the stream. To signify this, the ﬁrst two identical
coefﬁcients are left in the signal. The compression ratios, deﬁned as

_Or

Cr————
Nw

(4.1)

where Cr is the Compression Ratio, Or is the original signal size, and Nw is the new

39

signal size, determine the quality of the compression. A larger Cr indicates a better
compression scheme. The following examples show typical RLE being used on two

signals. The RLE coefﬁcients are noted by bold in the compressed signal.

 

 

 

 

 

 

 

Originalﬁgnal RLE Cr
555555555566666666666 55106610 ﬂHG
543333332111222555555 543352113223556 ZOWB

1122334455 112222332442552 10rw

 

 

Table 4.1: Examples of RLE Encoding
This example shows that RLE can produce substantial compression ratios. However, the
ﬁnal example shows that RLE can also result in a larger signal, depending on the signal
characteristics. Thus, the characteristics of the signal must be taken into account. The
RLE scheme can be modiﬁed to best match the characteristics of the signal to maximize

the efﬁciency of the compression.

4.1.1 Modiﬁed RLE

After being processed by the DWT and the spatial ﬁlter, the neural signals will consist
of very few large coefﬁcients, and many small coefﬁcients. The large coefﬁcients
presumably represent the spikes being recorded, and the small coefﬁcients presumably
represent the noise [4]. By thresholding the signal, the noise coefﬁcients are replaced by
zeros. This means that there will be many zeros in the signal, and very few large
coefﬁcients. Thus, the chance for more then one consecutive zero coefﬁcient is high, and
the chance for more then one consecutive non-zero coefﬁcient is extremely low.

Using these signal characteristics, the RLE scheme can be modiﬁed to better compress
the signal. Knowing that a single zero coefﬁcient is less likely, we can remove the need

to send a second zero by simply encoding the number of zeros directly after any zero

40

occurs in the signal. Additionally, repeat large coefﬁcients are ignored. The following

three examples show how this modiﬁed RLE compression scheme works.

 

 

 

 

 

Typical Modiﬁed
Original Signal Typical RLE Modiﬁed RLE Cr Cr
000000898000000 00689006 068906 ‘M/8 M46
00009999990000 004996004 0499999904 14”) 14PM
0608090 0608090 01601801901 7/7 7/H

 

 

 

 

 

 

Table 4.2: Typical vs. Modified RLE

The ﬁrst example shows the gain by switching to a modiﬁed RLE scheme. As the
second and third examples show, modiﬁed RLE can result in worse compression, if
isolated zeros exist in a signal, or if multiple large values exist. However, if our signal
meets the characteristics displayed by the ﬁrst example, which is the case, the modiﬁed

compression scheme is a beneﬁt. This will be further reviewed in the results section.

4.1.2 Multichannel and Multilevel Processing

Streaming multichannel neural signal data post DWT ﬁltering is a combination of
multiple levels and multiple channels. Typically, RLE is performed in the time domain,
further referenced as Horizontal RLE. However, since the data is available across
multiple channels simultaneously instead of in the time domain, it beneﬁts the
compression algorithm to run spatially, or Vertical RLE [21]. Additionally, since the
levels of the DWT are coming out interleaved, it beneﬁts the vertical RLE to also

compress the data in an interleaved fashion as well.

41

4.2 Hardware Considerations

Implementing typical RLE in hardware requires very few simple circuit components.
Fig 4.1 shows the block diagram of a typical RLE. Data is ﬁrst sent to a comparator, and
is compared to the current value being held in memory. If the value matches, the
accumulator is incremented, and the value is held if the value of the accumulator is
greater than 1. Otherwise, if the data does not match, the accumulator data is sent if it is
greater than 1, and then the data is transmitted, and the memory is loaded with the most

recent data, and the accumulator is reset.

 

 

Acaunuhux

+

DaGnM-«D l JV

Conmmnnm'(.-—i>

 

 

 

 

Mux

 

 

 

 

V
Imanow'

 

 

 

 

 

 

Figure 4.1 Block diagram of a typical RLE for single channel
compression

 

Modifying the block diagram as discussed in 4.2.2 means that threshold detection must
be added. Additionally, the post-thresholded data is compared only to zero, since only
zeros are accumulated. This further reduces the size of the RLE block. The comparator
circuit also needs to only compare to a zero coefﬁcient. Additionally, as soon as a non-
zero coefﬁcient is detected, the accumulator value is sent, and then normal data ﬂow

continues. Fig. 4.2 illustrates these changes.

42

 

 

Accumulator

4t

.. .....| i

. 0"] M... I
I

Transmitter

Figure 4.2 Modified RLE for Block diagram

 

 

 

 

     

Threshold
Detector

 

 

 

 

4.2.1 Multichannel Processing

Implementing horizontal RLE for multiple channels would require multiple
accumulators and multiplexers to determine the current state of each individual channel
as data arrives. Vertical RLE removes the need for additional hardware to separately
process each individual channels data. This will change the compression ratios that can

be achieved. The effect of this change is outlined in the results section.

4.2.2 Multi-Level Processing

The data being streamed from the DWT block is interleaved by channels, due to the
processing pattern outlined in chapter 2. I addition to multichannel processing, the data is
also coming out of the processor in multiple levels. The pattern noted by ﬁgure 2.3
displays how the data is seen by the processor. The data must also be compressed in an
interleaved fashion, to minimize the hardware required to perform Vertical RLE. It
should be assumed further that Vertical RLE incorporates both interleaved channels and

levels.

43

 

4.3 Results
Modiﬁed RLE was compared with Typical RLE performed on neural signals. Modiﬁed

RLE was tested on multi-channel multi-level data using both Vertical and Horizontal

RLE.

4.3.1 Modiﬁed RLE

Fig 4.4 shows the compression ratio achievement by modifying RLE for the
compression of neural signals. Single channels from recorded neural data were passed
through the wavelet transform outlined in chapter 2. The resulting data was compressed
using typical RLE and the modiﬁed RLE described earlier. The saving results in

modifying RLE can be easily seen.

Modiﬁed vs. Typical RLE

350 r

 

 

 

 

 

17 I I
-Typical
3m -Modiﬁcd ‘
250 ..
H
C
8 20° ‘
U

150

100

 

 

 

L I

5

 

6 7 a 9 1o
Compression Ratio
Figure 4.3: Typical vs. Modified RLE

 

4.3.2 Vertical RLE

Data from a 16 channel array was clustered into 4 sets of 4 channels and one set of 16
channels. The correlation between each channel is noted in addition to its compression
ratio achievements.

Figure 4.5 shows 4 separate clusters of 4 channels. The correlation between the
channels is noted in the top right corner. The cluster with the highest correlation,
channels 1-4, shows that Vertical RLE outperforms Horizontal RLE. Channels 13-16
shows that as correlation decreases, so does the gains given by Vertical RLE.

Figure 4.6 compares Horizontal vs. Vertical RLE over an entire array of 16 channels.
While not a substantial gain, Vertical RLE still outperforms Horizontal RLE for the entire

array of channels.

45

.muoumdao chcmno v new mam HeucoNﬂuom .m> Hmoﬁunw> ”¢.w mhdmﬂm

caves nouwmehnlou
OH O m n o n v m N

OOH
OmH
OON
OnN

 

 

 

OHIMH Waoccucu

canon uoammmhnlou
o h v n v

OOH
onH
.OON
OnN

  

oom

 

 

 

 

. . - omm
aim masseuse

:Inoo

 

 

 

Dauom noumnuknlao
o p o n v

OOH
OmH
OON
OnN
OOn

 

«Him mmoccuzo.

 

 

ouuox noumwehnlou
OH O m n O m t m N

Om

OOH
OmH
,OON
0mm

 

OOm

 

 

Own

41H masseuse

30.03

aunoa

46

.mﬁmmccso ma How mam HmucoNﬁHom .m> Hmoﬂuhw> "m.v mudoﬁm

naolnonu
OH in NH OH O

 

elouuaqa

 

 

 

cauum :OMmmouQEOU
m a . s u n v m N

3.3:... I

 

.85.; I

 

(F I D P b

C:
It,
("I

 

maid naoececu

zunog

47

4.4 Conclusion

Run-Length Encoding has been evaluated for implementation in an implantable
neuroprosthetic device. The scheme has been modiﬁed to adapt the processing to neural
signals. Additionally, a new concept of Vertical RLE across multiple channels and levels
to simplify operations is explored. RLE has been demonstrated to adequately compress

the neural data in a computationally simple and effective manner.

48

CHAPTER 5

Conclusion

The feasibility of ASIC implementation of advanced signal processing algorithms for
implantable neuroprosthetic devices has been discussed. The focus of this work has been
to provide design parameters that describe the trade-offs between computational
complexity, hardware size, and signal ﬁdelity. These parameters can be used in the
context of decision making during the process of laying out the signal processor in an
ASIC.

It has been shown that the size-efﬁcient implementation of the lifting approach to the

DWT is superior over the hardware-intensive implementation under the constraint of

49

limited memory, whereas its increased critical path does not limit the number of channels
that may be processed in a given sampling interval. It has also been shown that 5-bit
approximation of lifting DWT ﬁlter coefﬁcients is sufﬁcient for maintaining the signal
ﬁdelity using integer-approximated multi-level wavelet decomposition.

Considerations for implementing the Spatial Filter in hardware have also been
discussed. It has been shown that 5-bit approximation of estimated ﬁlter coefﬁcients does
not affect the quality of spatial ﬁltering for up to 32 channels. Additionally, ﬁxed low-
rank approximation can greatly reduce the number of computations required to perform
the ﬁltering operation.

Run-length encoding has been adapted to best ﬁt the statistics of the neural signals
compressed by the DWT. Additionally, it has been shown that performing vertical RLE is
feasible, and provides equal or better compression than horizontal RLE.

Additional work needs to be performed to integrate the individual components and study
the overall effect of these hardware considerations. The chip must further be laid out for
ASIC chip design, which gives the ﬁnal estimated power consumption, the actual latency,

and the chip size consumed per channel.

50

 

Bibliography

[1] K.D. Wise. D]. Anderson, J.F. Hetke, D.R. Kipke and K. Najaﬁ, “Wireless
Implantable Microsystems: High-Density Electronic Interfaces to the Nervous
System,” Proc. of the IEEE, (92): 1, pp: 76-97, 2004.

[2] E. Kandel, J. Schawartz, T. Jessell, Principles of Neural Science: 4th Edition,
McGraw-Hill, New York. ISBN 0-8385-7701-6

[3] M. Nicolelis, “Actions from Thoughts,” Nature, vol. 409, pp.403-407, Jan. 2001

[4] Karim G. Oweiss, “A Systems Approach for Data Compression and Latency
Reduction in Cortically Controlled Brain Machine Interfaces.” IEEE Transactions on
Biomedical Engineering, Vol. 53 No. 7, July 2006

[5] Sees T. M., Harasake H. Saidel G. M., and Davies C. R., “Characterization of
tissue morphology, angiogenesis, and temperature in adaptive response of muscle
tissue to chronic heating.” Lab Investigation, vol 78(12), pp. 1553-1562, 1998

[6] K.G. Oweiss, DJ. Anderson, M.M. Papaefthymious, “Optimizing Signal Coding
in Neural Interface System-On-A-Chip Modules,” Proc. of 25th IEEE Int. Conf. on
Eng. in Medicine and Biology, pp. 2016-2019, Cancun, Mexico, September 2003.

[7] I. Daubechies and W. Sweldens, “Factoring wavelet transforms into lifting steps,”
J. Fourier Anal. Appl.4(3), pp. 245-267, 1998.

nd
[8] S. Mallat, A Wavelet Tour of Signal Processing, Academic Press, 2 . Edition, pp.
413: 1999.

[9] Chao-Tsung Huang, Po-Chih Tseng, Liang-Gee Chen,“VLSI Architecture for
Forward Discrete Wavelet Transform Based on B-spline Factorization” Journal of
VLSI Signal Processing 40, 343-353, 2005

[10] H. Liao, M.Kr. Mandal, B.F. Cockbum, “Efﬁcient architectures for l-D and 2-D
lifting-based wavelet transforms,” IEEE Transactions on Signal Processing, volume:
52 , Issue: 5 , May 2004 ppzl315 — 1326

[11] M. Unser and T. Blu, “Wavelet Theory Demystiﬁed,” IEEE Transactions on
Signal Processing, vol. 51, no. 2, 2003, pp. 470—483.

[12] Y. Suhail, K.G. Oweiss, “A Reduced Complexity Integer Lifting Wavelet Based

Module for Real-Time Processing in Implantable Neural Interface Devices,” 26th
IEEE Int. Conf. on Eng. in Medicine and Biology, pp. 4552—4555, September 2004.

51

[13] K. Thomson, Y. Suhail, and K. Oweiss “A Scalable Architecture for Streaming
Neural Information from Implantable Multichannel Neuroprosthetic Devices,” Proc.
IEEE Int. Conf. On Circuits & Systems, May 2005

[14] A. Mason, J. Li, K. Thomson, Y. Suhail, K. Oweiss, “Design Optimization of
Integer Lifting DWT Circuitry for Implantable Neuroprosthetics” Proceedings of the
3rd Annual International IEEE EMBS Special Topic Conference on
Microtechnologies in Medicine and Biology Kahuku, Oahu, Hawaiii, May 2005

[15] PP. Vaidyanathan, Multirate Systems and Filter Banks, Prentice Hall, 1993.

[16] Prasanna Balasundaram, “Low Energy Hardware for Sensor Signal Calibration
and Compensation” Master’s Thesis, Michigan State University, May 2003

[17] H. Van Trees, Optimum Array Processing, New York: John Wiley & Sons, 1st
ed., 2002

[18] K. G. Oweiss, D. Anderson, “Spike Sorting: A novel shift and amplitude
invariant technique,” J. Neurocomputing, vol. 44-46, pp. 1133-1139, July 2002

[19] W. Sweldens, A. R. Calderbank, I. Daubechies and B. L. Yeo, “Wavelet
transforms that map integers to integers,” Appl. Comput. Harmon. Anal., vol. 5, no.
3, pp. 332-369, 1998.

[20] H. Tanaka, A. Leon-Garcia , “Efﬁcient run-length encodings,” IEEE Trans
.Information Theory, vol. IT-28, pp. 880—890, Nov. 1982

[21] K. Oweiss, K. Thomson, D. Anderson, “A Systems Approach for Real-Time
Data Compression in Advanced Brain-Machine Interfaces,” Proc of 2nd IEEE-EMBS
Conf. on Neural Engineering, pp. 62-65, Arlington, VA, March 2005

[22] K. G. Oweiss, D.J. Anderson, Neural Channel Identiﬁcation of Multichannel
Neural Recordings Using Multiresolution Analysis, Annals of the 22nd Biomedical
Engineering Society annual meeting, 28 (Suppl. 1): p. S-1 16, October 2000

[23] NeuroNexus Technologies, http://www.neuronexustech.com

52

 

   

lliliilllllllllll(ill)l