£007 ‘ LIBRARY ' Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DAIEDUE DAIEDUE DAIEDUE 6/07 p:/CIRC/Date0ue.indd-p_1 7HF f“. . Zdj I This is to certify that the thesis entitled HARDWARE CONSIDERATIONS FOR SPACE-TIME PROCESSING IN IMPLANTABLE NEUROPROSTHETIC DEVICES presented by KYLE E. THOMSON has been accepted towards fulfillment of the requirements for the MS. degree in Electrical Engineering \ o Mmfessoi‘s Signature 12*! ll; / O 6 I I I Date MSU is an Affirmative Action/Equal Opportunity Institution - --.. -«c--.-v-v-u-u-n-u-c----»--n-I-n-p-.- HARDWARE CONSIDERATIONS OF SPACE-TIME PROCESSING IN IMPLANTABLE NEUROPROSTHETIC DEVICES By Kyle E. Thomson A THESIS Submitted to Michigan State University In partial fulfillment of requirements For the degree of MASTERS OF SCIENCE Department of Electrical and Computer Engineering 2006 ABSTRACT HARDWARE CONSIDERATIONS OF SPACE-TIME PROCESSING IN IMPLANTABLE NEUROPROSTHETIC DEVICES By Kyle E. Thomson Advances in microfabrication technology have allowed hundreds of microelectrodes to be implanted in the vicinity of small populations of neurons in the cortex, opening new avenues for the neural interface technology to unveil many mysteries about the connectivity and functionality of the nervous system at the single cell and population levels. However, many challenges have arisen. One particular challenge is to transmit the recorded neural data from the implanted device to the outside world for further analysis. Previous work has shown that the incorporation of advanced signal processing algorithms on chip enables efficient compression of the data while maintaining high signal fidelity. However, adapting these signal processing algorithms to be hardware friendly requires a nontrivial solution due to the severe constraints imposed on an implantable system. Chip size, signal fidelity, and computational complexity must be balanced to create an ASIC device tailored to this application. This thesis focuses on critical hardware considerations for implementing these algorithms for an implantable neuroprosthetic signal processor operating in vivo in real time. Detailed performance analysis and suggestions for future work are presented. . l‘llllilrl \rlll I i I .l.| ‘Itli'lleIlIIPiI-Vli .llvllllrll'al. ill lil‘ri Illrr'.li III lls'?!‘ .t'l'i. 'illllulil. trill. I. I ,. i Lilli Ill It‘ll! C0pyright © by Kyle Thomson 2006 To my family and friends iv ACKNOWLDGEMENTS I would like to first and foremost like to acknowledge my advisor, Dr. Karim G. Oweiss. Through his guidance and support, I was able to complete my Master’s degree. Additionally, he has given me many opportunities to interact with members of the community of neural engineering. For that, I am extremely grateful. I would also like to acknowledge Dr. Erik Goodman, for his support, both financial and emotional, through my graduate career. His efforts to ensure sure I could continue my graduate career were greatly appreciated. I would also like to thank my collaborators, Dr. Andrew Mason and Dr. Shantanu Chakrabarrty. The many collaborative efforts between our labs gave new insights and ideas to approaching our research. I want to thank my fellow lab mates, who made the experience of being in the lab an enjoyable one, in addition to aiding in my research; including Yasir Suhail, April Thompson, Scott Chrispell, Martin Gerster, Michael Shetliffe, and Seif EI-Dawlatly. I would not have been able to make it through my graduate career without the love and support of my parents, my sister, and my new brother-in-law. Their support and caring words through the years made my degree possible. Table of Contents List of Figures viii List of Tables ix CHAPTER 1 ....................................................................................................................... l 1. 1 Introduction ......................................................................................................... 1 1.2 Signal Processing Algorithms ............................................................................. 4 1.3 Hardware Considerations .................................................................................... 6 1.4 Multichannel Processing ..................................................................................... 8 1.5 Implementation of System Components ............................................................. 8 CHAPTER 2 ..................................................................................................................... 10 2.1 Theory ............................................................................................................... 10 2.1.1 Lifting Factorization ................................................................................. 12 2.1.2 B-spline Factorization ............................................................................... 14 2.1.3 Integer Approximation .............................................................................. 15 2.1.4 Multi-level Processing .............................................................................. 16 2.2 Hardware Implementations ............................................................................... 17 2.2.1 Lifting Implementation ............................................................................. 18 2.2.2 B-spline Implementation ........................................................................... 21 2.3 Results ............................................................................................................... 22 2.3.1 Lifting and B-spline Comparison .............................................................. 23 2.3.2 Integer Approximation .............................................................................. 25 2.4 Conclusion ........................................................................................................ 26 CHAPTER 3 ..................................................................................................................... 28 3.1 Theory ............................................................................................................... 28 3.1.1 Low Rank Approximation ........................................................................ 30 3.2 Hardware Considerations .................................................................................. 31 3.2.1 Integer Approximation .............................................................................. 31 3.2.2 Truncation ................................................................................................. 32 3.2.3 Fixed Low-Rank Approximation .............................................................. 33 3.3 Results ............................................................................................................... 34 3.4 Conclusion ........................................................................................................ 37 CHAPTER 4 ..................................................................................................................... 39 4.1 Theory ............................................................................................................... 39 4.1.1 Modified RLE ........................................................................................... 40 4.1.2 Multichannel and Multilevel Processing .................................................. 41 4.2 Hardware Considerations .................................................................................. 42 4.2.1 Multichannel Processing ........................................................................... 43 4.2.2 Multi-Level Processing ............................................................................. 43 4.3 Results ............................................................................................................... 44 4.3.1 Modified RLE ........................................................................................... 44 vi 4.3.2 Vertical RLE ............................................................................................. 45 4.4 Conclusion ........................................................................................................ 48 CHAPTER 5 ..................................................................................................................... 49 vii List of Figures Figure 1.1 (a) close up of NeuroNexus probes (b) NeuroNexus implantable acute probes [23] .............................................................................................................................. 3 Figure 1.2: Proposed neuroprosthetic system design .......................................................... 3 Figure 1.3: Four Spike Templates ....................................................................................... 4 Figure 2.1: Flow of the Discrete Wavelet Transform ....................................................... l 1 Figure 2.2: Filter steps of the lifting factorization ............................................................ 13 Figure 2.3: Execution pattern of a 13 sample interval ...................................................... 17 Figure 2.4: Hardware Block for performing a single Lifting Step ................................... 18 Figure 2.5 Hardware Intensive implementation of the lifting approach [13] ................... 20 Figure 2.6: Size Eflicient implementation of the lifting approach [13] ............................. 21 Figure 2.7: B-spline hardware, adapted from [9] .............................................................. 22 Figure 2.8: B-spline Hardware notation ............................................................................ 22 Figure 2.9: Distortion caused by integer approximation .................................................. 25 Figure 2.10: Distortion based on Wavelet Levels ............................................................. 26 Figure 3.1: Effects of Integer Approximation on Spatial Filtering ................................... 35 Figure 3.2 A comparison between eLRA , elA+LRA , em and 91,14, LRA+TR as a function ofB. M=8, P=8, Q= 3. ............................................................................................ 36 Figure 3.3: A comparison of eLRA and 31/1,. LRA+TR as a function of Q/M for B=3,4 and 5. M=32, P=32. ........................................................................................................ 37 Figure 4.1 Block diagram of a typical RLE for single channel compression ................... 42 Figure 4.2 Modified RLE for Block diagram ................................................................... 43 Figure 4.3: Typical vs. Modified RLE .............................................................................. 44 Figure 4.4: Vertical vs. Horizontal RLE for 4 channel clusters ........................................ 46 Figure 4.5: Vertical vs. Horizontal RLE for 16 channels. ................................................ 47 viii List of Tables Table 2.1: Coefficients of the symmlet4 family of wavelet basis [8] ................................ 11 Table 2.2: Coefficients of the Lifting Wavelet Transform [12] ........................................ 14 Table 2.3: Coefficients of the B-spline Factorization ....................................................... 15 Table 2.4: Results from Lifting and B-spline implementation ......................................... 23 Table 4.1: Examples of RLE Encoding ............................................................................ 40 Table 4.2: Typical vs. Modified RLE ............................................................................... 41 Images in this thesis are presented in color. ix CHAPTER 1 Hardware Considerations of Space- Time Processing in Implantable Neuroprosthetic Devices 1.1 Introduction Handicapped patients, such as quadriplegics and paraplegics, have greatly benefited from advances in the field of neuroprosthetic devices. A neuroprosthetic device takes input by reading brain activity, and uses that information to control an external device; ranging from mechanical hands to computer cursors [1][3]. This gives patients, who previously had little to no control over their environment a new sense of freedom and independence. Research in the field of neuroprosthetics has been significantly assisted by the introduction of high-density Microelectrode Recording Arrays (MEAs). An MBA is an array of sensors that can be implanted into the cortex, recording the neural activity from the surrounding cortical tissue. The number of sensors per devices has grown exponentially, and now there are devices that have as many as 1000 closely spaced sensors on a single array [1]. The increasing density allows for neuroscientists to better analyze entire populations of neurons and understand the inner working of these populations. However, the increase in data being recorded results in an increase in the bandwidth required to transmit the data. Currently, MEAs [23], shown in figure 1.1, are connected from beneath the skull to computers outside the cortex, using large wires. Wired systems are a source of infection, which can easily become fatal due to the location of the implant. Additionally, wired systems cause patient discomfort due to the movements of the wires against the surrounding skin tissue. Thus, an externally powered wireless system is the most advantageous route for long term MEA implantation. However, as the density of sensors on an MBA increases, the raw data can not possibly be transmitted with available implantable wireless technology. Thus, the data must be pre-processed [1]. (d) (b) Figure 1.1 (a) close up of NeuroNexus probes (b) NeuroNexus mplantable acute probes [23] Advanced signal processing algorithms have shown that neural data can be compressed by as much as 90%, while still retaining the information required to determine the workings of a neural population [4]. The design, shown by figure 1.1, has been proposed by adapting these signal processing algorithms to process the data in vivo [4]. The area external to the green box is existing hardware, currently used with MEAs. The focus of this work is to present the considerations required when adapting the advanced signal-processing algorithms to implantable neuroprosthetic devices. M HEP i E m P Front- end Analog Encoder | ‘ Spatial Filter Figure 1.2: Proposed neuroprosthetic system design 1.2 Signal Processing Algorithms To understand the signal processing algorithms used to compress neural signals, the mechanics of neural recordings must be understood. MEAs record the electrical activity of neurons, in the form of action potentials. Action potentials, or neural spikes, are pulses of voltage that travel from the cell body of a neuron down the neuron’s axon, where it is picked up by the dendrites of a different neuron. Spikes are the means by which the brain receives and processes data, and creates commands for the body [2]. Figure 1.2 shows an example of four spikes. uVolts Time Figure 1.3: Four Spike Templates Controlling an implantable neuroprosthetic device by the information recorded from an MBA requires the spikes being sent to and from the neurons surrounding the implant site. Neurons which produce discemable spikes, i.e. spikes that can be reliably removed from the recorded data, from an MBA recording are called units. The observation model for the MEA data assumes that P units impinge on an array of M recording sites within the discrete time recording interval T = [r] ..... 1N] . The recording matrix Y is expressed as Y = A S + Z (1_1) mMXP where A 6 denotes the mixing matrix that expresses the array response to P recorded units, S E SRPXN denotes the recorded signal from each unit, and 2 6 92M”, denotes the noise from neurons in the surrounding tissue. The noise is a zero- mean additive component with an arbitrary spatial and temporal correlation. Isolating the spike activity in Y is required for neuroscientist to understand and use a population of neurons. First, the spikes must be detected, which is called spike detection. Then, the spike activity is associated with the unit that produced it, which is defined as spike sorting. The process of spike sorting is a computationally complex algorithm, and is not well suited for implementation in an in vivo signal processor [4]. Additionally, traditional spike detection based on simple thresholding is often inaccurate [22]. A proposed solution to these problems is the use of the Discrete Wavelet Transform. The Discrete Wavelet Transform (DWT) is a cascade of high-pass and low-pass filters which localize signals in both time and frequency [8]. This removes temporal redundancy by localizing the spikes into few large coefficients, while presumably leaving the noise as many small coefficients. Previous work has shown that the DWT is well suited for compression of neural signals [4]. Additionally, the DWT has shown many other useful properties that aid in processing neural signals. Algorithms for spike detection and spike sorting can be improved through use of DWT coefficients. While the DWT removes the temporal redundancy fi'om the neural recorded signals, there is still redundancy in the spatial mixing. Spatial Filtering can reduce the spatial redundancy, thus reducing the number of channels of data being sent [4]. The physical channels, or electrode sites, can be reduced to a few principal channels. Principal channels contain a compacted version of all of the raw data recorded by the MEA. Additionally, the recorded data can be completely reconstructed by sending the parameters used to perform the spatial filtering. By reducing the number of transmitted channels, the effective bandwidth is reduced, while maintaining full signal fidelity. The previous signal processing algorithms reduce spatial and temporal redundancy in the recorded neural signals; however they do not result in actual data compression [22]. Instead, the majority of the information about the neural signal is compacted into a few large coefficients, clustered among many small coefficients containing very little information or noise. To have actual data compression, the small coefficients must be discarded, and the large coefficients as well as their relationship to the discarded small coefficients must be maintained. A method known as Run-length Encoding (RLE) [21] is used to compress the transformed signal. While RLE is not an optimal compression algorithm, it has the benefits of low computational complexity, and can be performed in real-time on streaming data. The goal of RLE is to remove consecutive repetitive coefficients. RLE is well suited to be employed in the implantable environment due to its computational simplicity and streamline processing. To use these algorithms effectively, signal processing hardware must be designed with the goal of maximum efficiency, while maintaining the constraints of an implantable system. 1.3 Hardware Considerations Most modern applications utilizing signal processing algorithms can rely on advanced, multipurpose DSP chips to perform the required operations. This approach is inexpensive and time effective to implement. Neuroprosthetic devices, however, do not allow for a simple solution. The implantable environment constraint requires the design of an Application Specific Integrated Circuit (ASIC). To design an implantable signal processor, the algorithms must be adapted to be implemented in a hardware friendly manner. Hardware considerations in the context of implantable neuroprosthetic device must be made when to simplify the process. Design of an ASIC for neuroprosthetic device requires special considerations. First, the device must be low powered for two reasons; the power is transmitted externally through use of RF, which can only transmit minimal amounts of energy, and the cortical tissue surrounding the implant can not be subjected to temperatures that would raise the tissue by more than l~2°C, which prevents damage to the surrounding brain tissue. This severely limits the power dissipation and clock speed of the neuroprosthetic device. Additionally, the implant size is constrained because the implant cannot rise more than 1mm above the cortical surface. A larger implant would compromise the stability. However, certain factors alleviate the constraints in designing the neural signal processor. First, all channels are typically sampled at the biologically relevant sampling rate of 25KI-Iz [3]. This provides a 40 psec sampling interval to perform any calculations on the incoming sample set. Ideally, all signal processing required on a given sample would be preformed in that interval, therefore precluding the need for buffering, reducing the memory requirements for the design. Additionally, a 14mm burr hole used by neurosurgeons is sufficient for a package of several hundred thousand transistors [1]. This provides enough transistors to perform low complexity algorithms, and to maintain some internal memory. 1.4 Multichannel Processing MEA data is sampled across all neural recording channels simultaneously. While ideally, each channel would have dedicated hardware processing of its own, the hardware cost for this implementation is prohibitive. Thus, when considering system designs for implementing the aforementioned signal processing algorithms, the idea of using shared hardware must be considered. Thus, at each sampling interval, the hardware for executing each process must be used for each of the channels. This minimizes the computation hardware, maximizing the available room for memory, thus allowing for more channels to be processed. Additionally, all data should be handled in a streaming fashion. Buffering of data causes delays in the system, and also requires additional memory that could be used for increasing the number of channels that the processor can handle. These factors must be considered when designing the system for a multichannel neuroprosthetic device. Thus, when discussing the system, it will be assumed that data is available across all channels, and that all data being recorded must leave the system at the same rate at which it is recorded, with only minimal delay. 1.5 Implementation of System Components ASIC design consists of taking the signal processing algorithms, and finding the most computationally simple method to execute them. Additionally, trade-offs between precision and memory must be accounted for. The goal of this work is to present these trade-offs in a quantitative matter, so that the decisions can be made before the time- intensive process implementing the algorithms in VLSI. The focus of this work is to make considerations for implementing the aforementioned signal processing algorithms for a minimized implantable application, suited for hardware design. Chapter 2 deals with the design of the Discrete Wavelet Transform (DWT) block. Chapter 3 deals with the design of the Spatial Filter block. Chapter 4 discusses the hardware implementation of the compression algorithm, Run-length Encoding. CHAPTER 2 Discrete Wavelet Transform System Design 2.1 Theory The Discrete Wavelet Transform (DWT) is performed on a data set by convolving a high and low pass filter, H (z) and L(z) respectively, with the data set being compressed. This results in data that is localized in both time and frequency [8]. The wavelet transform is a recursive filter. The result of the high pass filter is stored, and defined as the detail coefficients. The results of the low pass filter, defined as the approximation 10 coefficients, are fed back into the Wavelet Transform [8]. This is called higher levels of decomposition, illustrated by figure 2.1. LEQEED 0 Raw Data . Approximati on: 0 Detail: Figure 2.1: Flow of the Discrete Wavelet Transform It has been shown that the symmlet4 family of wavelet basis results in the best compression of neural signals [4]. Therefore, we will focus on this basis. The filter coefficients of the discrete symmlet4 basis, composed of the high pass filter [1(2) and a low pass filter L(z), are listed in Table 2.1. Lag z0 Z'1 Z‘2 2'3 2“ Z'5 2" 2'7 11(2) -.O76 -032 .049 .080 .029 -099 -.013 .033 L(z) -032 -.013 .099 .029 -.080 .049 .029 -.O76 Table 2.1: Coefficients of the symmlet4 family of wavelet basis [8] Certain factorizations provide a more computationally simple approach then performing two discrete convolutions required by the DWT. Two such approaches have previously shown to reduce the computational complexity and memory requirements; B-spline and Lifting factorization [l6][7]. These factorizations must be analyzed for their cost in hardware, their computational complexity, and their ability to be implemented in an implantable neuroprosthetic device. 11 Additionally, it has been shown that the Lifting symmlet4 wavelet decomposition can be performed with using coefficients approximated to integers [12]. Integer-based math requires less complex hardware than fixed-point or floating—point math. 2.1.1 Lifting Factorization The lifting approach [7] takes two wavelet filters, H(z) and L(z), and factorizes them into n lifting steps to obtain Sn (2) and T,,(z) as ENE] +2Ll— 5] HI‘EI +2HI‘ ‘Ef Pk): Lila-2114?] Hlfil-ZHI'JEI <2“ =I’3 Kill?) Silas ilmlé S'Izllrizi ll Each of the filter steps takes the form of S .. = Cihn + thnz'l ”’1’! — +1 2.2 Tn,i,j—Cifn+ijnZ () where n is the number of the filter step, and C, , C j are the resulting constants from factorizing the DWT. This creates a filter bank, shown in Figure 2.2. 12 Figure 2.2: Filter steps of the lifting factorization It should be noted that the lifting approach splits each sampling window into odd and even sample pairs. For a factorization of the symmlet4 wavelet, the filtering steps corresponding to Figure 2.2 can be written as stepl = M!) = new. xfoa) step2:f,(t)=fi,(t)+C2xhl(t)+C3xh,(t+l) step3:h2(t‘)=hl(t)+C4 >[0 4|4 _.+ 0,0, Figure 2.7: B-spline hardware, adapted from [9] l 0,] Czl 03' g ICI+CZZ-'+C3Z—2| Figure 2.8: B-spline Hardware notation It can be noted that the amount of hardware required for performing the B-spline is repetitive for the both the B-spline, Q(z) and R(z) parts. Thus, to make a size efficient version, the top half could be reused, simply by alternating between the Q(z) and R(z) coefficients on each sampling interval. 2.3 Results Analysis of the proposed implementations relies on determining the best balance between the approximated size and computation speed. Size is approximated by transistor counts. Computation speed is approximated by critical path. Specific details on power 22 consumption, actual size, and processing latency requires the hardware to be fully created as a VLSI design. 2.3.1 Lifting and B-spline Comparison Table 2.4 illustrates the comparison between the two factorizations implemented in hardware. Floating point multipliers, adders, and memory blocks were 24 bit floating point numbers designed in Verilog and synthesized to a custom library Error! Reference source not found.. Integer multipliers, adders, and memory blocks are fully customized designs tailored to 10 bit data with 5 bit coefficients [l4]. Implementation B-spline Lifting Hardware Intensive Multipliers 6 8 Adders 18 8 Memory 8 1 1 Critical Path Tm+5Ta Tm+2Ta Floating Point Gates 45504 54944 Integer Gates 9426 10008 Size Efficient Multipliers 3 2 Adders 9 2 Memory 8 4 Critical Path 2Tm+ 1 OTa 4Tm+8Ta Floating Point Gates 22752 13736 Integer Gates 4713 2502 Table 2.4: Results from Lifting and B-spline implementation 23 Previous work has shown [14] that as many as 512 channels of data can be processed by the size efficient lifting hardware in the available sampling interval. This implies that the critical path does not constrain the design. Figure 2.8 shows the approximated total number of gates for a given amount of channels. There is a large divergence between B-spline and Lifting as the number of channels being processed increases. This is due to the difference in memory requirement for each architecture. This also shows that the dominating factor in assessing size of a given implementation is dependent on the memory required for each channel and the number of levels. x104 Total Number of Gates for Channels 1R IV I l l I — Lrltlng 2Le~els -~"-_-meg3Lwds 14 - ----- Lifting: 4 Levels 12 --0-- Lifting: SLevels -. i I.) --+-— B-spine‘ 2Levels ' ‘z' 3 10 - -°--- B-spline3Levels ' I 1’" 8 --0-- B-spline’liLevels ' .I f," r .2 '8 —a— B-splineSLevels ’1’ {a ’ 8 1" 1”,- ‘ 0 .J' 3 .X ”If ’ a z 5' (of ,I . ‘ x ,r" » ‘ I I) a’ ’ .a ' v ' 4 _ . . a .4 r J. a ’ .— "fll’d’ 4 . bI'IIII ’ ’ ' .g c.--""'". ' I ',J' T I’.-.‘ ‘ 2- .' ‘Cz’l’ I’II:’_.._.—'" q up", a» "’ ’1 é‘"""' o ' l I l I l I t 5 10 15 20 25 30 35 Channels Figure 2.8: Size comparison between size efficient approaches 24 2.3.2 Integer Approximation Previous work [12] has shown that for single level decomposition, as few as 4 bits can be used to approximate the wavelet coefficients for the lifting approach. Further analyze must consider the use of 4 and 5 bit integer approximation for multi-level wavelet decomposition. Figure 2.9 shows the distortion caused by the approximated coefficients for 4 and 5 bits. It can be seen that the losses incurred by approximating the wavelet coefficients to integers level are larger for 4 bits. 7 Level Distortion with 4 Bit Approximation — Original Post~Wavelet _200 1 1 I l 1 — Original Post-Wavelet Figure 2.9: Distortion caused by integer approximation Figure 2.10 shows the MSE distortion of spike waveforms. One hundred random noiseless spike trains of 256 samples were generated, and then decomposed and 25 reconstructed with l to 7 levels of wavelet transform. The quality measure shown is the ratio of the energy of the error versus the total signal energy. Distortion based on Wavelet Levels 0.1 . . I I I ——4 Bits Quantization 0.08 - " ' - 5 Bit Quantization 0.07 - 0.05 - 0.05 - 0.03 - 0'02 i“.al..flI-u_--II—-—l.II—n—I- . 0"- Error Energy vs. Total Energy Ratio I l l l l 1 2 3 4 5 5 7 Level Figure 2.10: Distortion based on Wavelet Levels 2.4 Conclusion Two architectural implementations of the Discrete Wavelet Transform have been considered. Additionally, the implementations have been considered in two forms, minimizing hardware, and minimizing delay. The minimized hardware approach to implementing the lifting DWT has shown many benefits for an implantable neuroprosthetic device. The minimal memory requirement for the size-efficient lifting architecture maximizes the ability to process the largest number of channels. 26 Additionally, the increased critical path does not constrain the number of channels that can be processed on the device. Integer approximation of wavelet coefficients was also considered for multilevel decomposition. It has been shown that multilevel integer approximated DWT can be performed with minimal losses to signal energy. 27 CHAPTER 3 Spatial Filter System Design 3.1 Theory The redundancy caused due to observing the signals with multiple closely spaced electrodes can be reduced using advanced signal processing techniques. A spatial filter decorrelates the data by aggregating the signal energy spread across many physical channels to a few principal channels that can be further filtered and compressed [4]. This process is known as whitening the data, while the inverse process will be referred to as coloring the data. 28 The actual process involved in performing Spatial Filtering is computationally complex. However, like the DWT, certain concessions can be made to make the Spatial Filter feasible to implement in hardware; causing minimal and acceptable losses to signal fidelity. These concessions must be analyzed for their validity. Decorrelation in space can be optimally achieved by first estimating the sample spatial covariance matrix [17] of the data as N .. l RY = __1 Zy[n]yT[n] (3.2) n=l where y[n] 6 92““ is the array snapshot at time sample n, and N denotes the length of time interval from which the filter is estimated. This covariance can be further decomposed using singular value decomposition (SVD) [17] to yield M RY = UODOUC = Zlmumu; m=1 (3.3) where 4", denotes the mth eigenvalue corresponding to the mth diagonal entry in and M XM . . . U0 : [U1 9 112 a - ~ - 9 u M] E m comprises the eigenvectors spannmg the column space of RY . The whitening process is performed by the following matrix multiplication ~ — T Y _ UOY (3,4) The matrix U 0 is the optimal whitening matrix and its entries are generally represented by floating point numbers. 29 3.1.1 Low Rank Approximation One way to reduce the number of transmitted channels in T is to use low rank approximation (LRA) [17] to shrink the sum in (3.3) to only the principal channels comprising most of the signal energy. This is guaranteed if PsM and the P signal sources have orthogonal waveforms. However, in a typical neural recording experiment, both conditions are often violated. Spike waveforms from distinct neurons are generally not orthogonal amongst each other, and typically the number of single units recorded exceeds the number of electrode channels. Nevertheless, the principal role of the spatial processing stage is not to sort out the distinct signal sources, but rather reduce the overall signal energy impinging on the M channel array to a much smaller number of principal channels, say Q << M , that can be encoded and transmitted, thus reducing the required bandwidth. Performing a LRA [17] on (3.4) is equivalent to zeroing out the M-Q rightmost columns of U0. This is denoted as U0 =[u,,u2,...,uQ,0,...,OM]efliM"M. If the data is whitened using U0 , the mean square error between the original data and the reconstructed data is defined as em (Q)=(Y—U§YU0)2 (3.5) This error decays to zero when Q 9 M. 30 3.2 Hardware Considerations The process described by equation (3.4) requires MxM floating-point multiplications. This is not feasible to perform on an in-vivo signal processor. To deal with this, the coefficients of the spatial filter must be approximated to integers. The number of bits used when approximating the coefficients of the spatial filter varies the computational complexity, the memory requirement, the achievable compression, and the losses to signal fidelity. 3.2.1 Integer Approximation The hardware constraints imposed by implantability requirements relies on the minimization of chip size and power consumption. Typically, the computations being undertaken on chip require the elements be processed and stored in quantized integer format. Since the whitening process defined in Eq. 4 is a floating point process, the coefficients of U 0 must be approximated to integers to reduce the hardware complexity of the whitening process. This creates a hardware-fiiendly whitening filter, which will be referred to as UHF . Quantizing the filter amounts to expressing the real numbers in U 0 by their quantized values as . . ZBU .’ . UHFG» )=fl00r( 23 C(11)) (3.6) where B is the number of bits for quantization. It should be noted that the smaller B is, the more size and power efficient the hardware implementation is. Thus, a trade off arises between the resulting error from selecting a B which creates a suboptimal U HF , and the 31 system performance. To simplify the hardware to perform the spatial filtering, the denominator of U HF is moved outside of the matrix, leaving only integer multiplications. The division will be discussed in section 3.3.2. The following example illustrates the process of quantizing the Spatial Filter. A sample U 0 is approximated to U HF with B=3 and B=5 bits. B=3 B=5 .9501 .4460 N1 8 4 ~ 1 30 14 .2311 .6068 ‘8 2 5 ‘32 7 19 It is obvious from the first row of the example that the error between quantizing to 3 bits versus 5 bits can be substantial. The error from approximating the coefficients of U 0 to the hardware-friendly U HF will be defined as 661(3): (U67 —UI{IFY)Z (3.7) Note that if the matrix U ,CFY is transformed back into Y using the same U HF , no error is introduced into Y. The approximation to U HF only creates a suboptimal spatial filter, which affects the performance of LRA. 3.2.2 Truncation After the multiplication U ,TWY , the resulting matrix still requires that the division by the integer scaling factor be preformed. To perform real arithmetic division would require large amounts of hardware. Instead, the binary nature of the integer scaling can be exploited. To perform the division, the B least significant bits can be dropped, resulting in equation (3.9). 32 U1”. Y Y ' = floor —23 (3.9) However, performing the floor operation means that precision is lost. The error due to the loss of the decimal places will be defined as ~' ~ eIA.rR(B)= (Y -Y)2 (3.10) Note that more precision is lost as B shrinks. This error occurs due to the loss of orthonormality of the columns of U 0 when approximating U 0 to integers. 3.2.3 Fixed Low-Rank Approximation The ideal signal processing situation when performing LRA is to perform the entire whitening process, and then determine the principal channels Q via thresholding. This approach is hardware intensive, as each sampling interval requires M 2multiplications. Since it is expected that Q< Mux V Imanow' Figure 4.1 Block diagram of a typical RLE for single channel compression Modifying the block diagram as discussed in 4.2.2 means that threshold detection must be added. Additionally, the post-thresholded data is compared only to zero, since only zeros are accumulated. This further reduces the size of the RLE block. The comparator circuit also needs to only compare to a zero coefficient. Additionally, as soon as a non- zero coefficient is detected, the accumulator value is sent, and then normal data flow continues. Fig. 4.2 illustrates these changes. 42 Accumulator 4t .. .....| i . 0"] M... I I Transmitter Figure 4.2 Modified RLE for Block diagram Threshold Detector 4.2.1 Multichannel Processing Implementing horizontal RLE for multiple channels would require multiple accumulators and multiplexers to determine the current state of each individual channel as data arrives. Vertical RLE removes the need for additional hardware to separately process each individual channels data. This will change the compression ratios that can be achieved. The effect of this change is outlined in the results section. 4.2.2 Multi-Level Processing The data being streamed from the DWT block is interleaved by channels, due to the processing pattern outlined in chapter 2. I addition to multichannel processing, the data is also coming out of the processor in multiple levels. The pattern noted by figure 2.3 displays how the data is seen by the processor. The data must also be compressed in an interleaved fashion, to minimize the hardware required to perform Vertical RLE. It should be assumed further that Vertical RLE incorporates both interleaved channels and levels. 43 4.3 Results Modified RLE was compared with Typical RLE performed on neural signals. Modified RLE was tested on multi-channel multi-level data using both Vertical and Horizontal RLE. 4.3.1 Modified RLE Fig 4.4 shows the compression ratio achievement by modifying RLE for the compression of neural signals. Single channels from recorded neural data were passed through the wavelet transform outlined in chapter 2. The resulting data was compressed using typical RLE and the modified RLE described earlier. The saving results in modifying RLE can be easily seen. Modified vs. Typical RLE 350 r 17 I I -Typical 3m -Modificd ‘ 250 .. H C 8 20° ‘ U 150 100 L I 5 6 7 a 9 1o Compression Ratio Figure 4.3: Typical vs. Modified RLE 4.3.2 Vertical RLE Data from a 16 channel array was clustered into 4 sets of 4 channels and one set of 16 channels. The correlation between each channel is noted in addition to its compression ratio achievements. Figure 4.5 shows 4 separate clusters of 4 channels. The correlation between the channels is noted in the top right corner. The cluster with the highest correlation, channels 1-4, shows that Vertical RLE outperforms Horizontal RLE. Channels 13-16 shows that as correlation decreases, so does the gains given by Vertical RLE. Figure 4.6 compares Horizontal vs. Vertical RLE over an entire array of 16 channels. While not a substantial gain, Vertical RLE still outperforms Horizontal RLE for the entire array of channels. 45 .muoumdao chcmno v new mam HeucoNfluom .m> Hmofiunw> ”¢.w mhdmflm caves nouwmehnlou OH O m n o n v m N OOH OmH OON OnN OHIMH Waoccucu canon uoammmhnlou o h v n v OOH onH .OON OnN oom . . - omm aim masseuse :Inoo Dauom noumnuknlao o p o n v OOH OmH OON OnN OOn «Him mmoccuzo. ouuox noumwehnlou OH O m n O m t m N Om OOH OmH ,OON 0mm OOm Own 41H masseuse 30.03 aunoa 46 .mfimmccso ma How mam HmucoNfiHom .m> Hmofluhw> "m.v mudofim naolnonu OH in NH OH O elouuaqa cauum :OMmmouQEOU m a . s u n v m N 3.3:... I .85.; I (F I D P b C: It, ("I maid naoececu zunog 47 4.4 Conclusion Run-Length Encoding has been evaluated for implementation in an implantable neuroprosthetic device. The scheme has been modified to adapt the processing to neural signals. Additionally, a new concept of Vertical RLE across multiple channels and levels to simplify operations is explored. RLE has been demonstrated to adequately compress the neural data in a computationally simple and effective manner. 48 CHAPTER 5 Conclusion The feasibility of ASIC implementation of advanced signal processing algorithms for implantable neuroprosthetic devices has been discussed. The focus of this work has been to provide design parameters that describe the trade-offs between computational complexity, hardware size, and signal fidelity. These parameters can be used in the context of decision making during the process of laying out the signal processor in an ASIC. It has been shown that the size-efficient implementation of the lifting approach to the DWT is superior over the hardware-intensive implementation under the constraint of 49 limited memory, whereas its increased critical path does not limit the number of channels that may be processed in a given sampling interval. It has also been shown that 5-bit approximation of lifting DWT filter coefficients is sufficient for maintaining the signal fidelity using integer-approximated multi-level wavelet decomposition. Considerations for implementing the Spatial Filter in hardware have also been discussed. It has been shown that 5-bit approximation of estimated filter coefficients does not affect the quality of spatial filtering for up to 32 channels. Additionally, fixed low- rank approximation can greatly reduce the number of computations required to perform the filtering operation. Run-length encoding has been adapted to best fit the statistics of the neural signals compressed by the DWT. Additionally, it has been shown that performing vertical RLE is feasible, and provides equal or better compression than horizontal RLE. Additional work needs to be performed to integrate the individual components and study the overall effect of these hardware considerations. The chip must further be laid out for ASIC chip design, which gives the final estimated power consumption, the actual latency, and the chip size consumed per channel. 50 Bibliography [1] K.D. Wise. D]. Anderson, J.F. Hetke, D.R. Kipke and K. Najafi, “Wireless Implantable Microsystems: High-Density Electronic Interfaces to the Nervous System,” Proc. of the IEEE, (92): 1, pp: 76-97, 2004. [2] E. Kandel, J. Schawartz, T. Jessell, Principles of Neural Science: 4th Edition, McGraw-Hill, New York. ISBN 0-8385-7701-6 [3] M. Nicolelis, “Actions from Thoughts,” Nature, vol. 409, pp.403-407, Jan. 2001 [4] Karim G. Oweiss, “A Systems Approach for Data Compression and Latency Reduction in Cortically Controlled Brain Machine Interfaces.” IEEE Transactions on Biomedical Engineering, Vol. 53 No. 7, July 2006 [5] Sees T. M., Harasake H. Saidel G. M., and Davies C. R., “Characterization of tissue morphology, angiogenesis, and temperature in adaptive response of muscle tissue to chronic heating.” Lab Investigation, vol 78(12), pp. 1553-1562, 1998 [6] K.G. Oweiss, DJ. Anderson, M.M. Papaefthymious, “Optimizing Signal Coding in Neural Interface System-On-A-Chip Modules,” Proc. of 25th IEEE Int. Conf. on Eng. in Medicine and Biology, pp. 2016-2019, Cancun, Mexico, September 2003. [7] I. Daubechies and W. Sweldens, “Factoring wavelet transforms into lifting steps,” J. Fourier Anal. Appl.4(3), pp. 245-267, 1998. nd [8] S. Mallat, A Wavelet Tour of Signal Processing, Academic Press, 2 . Edition, pp. 413: 1999. [9] Chao-Tsung Huang, Po-Chih Tseng, Liang-Gee Chen,“VLSI Architecture for Forward Discrete Wavelet Transform Based on B-spline Factorization” Journal of VLSI Signal Processing 40, 343-353, 2005 [10] H. Liao, M.Kr. Mandal, B.F. Cockbum, “Efficient architectures for l-D and 2-D lifting-based wavelet transforms,” IEEE Transactions on Signal Processing, volume: 52 , Issue: 5 , May 2004 ppzl315 — 1326 [11] M. Unser and T. Blu, “Wavelet Theory Demystified,” IEEE Transactions on Signal Processing, vol. 51, no. 2, 2003, pp. 470—483. [12] Y. Suhail, K.G. Oweiss, “A Reduced Complexity Integer Lifting Wavelet Based Module for Real-Time Processing in Implantable Neural Interface Devices,” 26th IEEE Int. Conf. on Eng. in Medicine and Biology, pp. 4552—4555, September 2004. 51 [13] K. Thomson, Y. Suhail, and K. Oweiss “A Scalable Architecture for Streaming Neural Information from Implantable Multichannel Neuroprosthetic Devices,” Proc. IEEE Int. Conf. On Circuits & Systems, May 2005 [14] A. Mason, J. Li, K. Thomson, Y. Suhail, K. Oweiss, “Design Optimization of Integer Lifting DWT Circuitry for Implantable Neuroprosthetics” Proceedings of the 3rd Annual International IEEE EMBS Special Topic Conference on Microtechnologies in Medicine and Biology Kahuku, Oahu, Hawaiii, May 2005 [15] PP. Vaidyanathan, Multirate Systems and Filter Banks, Prentice Hall, 1993. [16] Prasanna Balasundaram, “Low Energy Hardware for Sensor Signal Calibration and Compensation” Master’s Thesis, Michigan State University, May 2003 [17] H. Van Trees, Optimum Array Processing, New York: John Wiley & Sons, 1st ed., 2002 [18] K. G. Oweiss, D. Anderson, “Spike Sorting: A novel shift and amplitude invariant technique,” J. Neurocomputing, vol. 44-46, pp. 1133-1139, July 2002 [19] W. Sweldens, A. R. Calderbank, I. Daubechies and B. L. Yeo, “Wavelet transforms that map integers to integers,” Appl. Comput. Harmon. Anal., vol. 5, no. 3, pp. 332-369, 1998. [20] H. Tanaka, A. Leon-Garcia , “Efficient run-length encodings,” IEEE Trans .Information Theory, vol. IT-28, pp. 880—890, Nov. 1982 [21] K. Oweiss, K. Thomson, D. Anderson, “A Systems Approach for Real-Time Data Compression in Advanced Brain-Machine Interfaces,” Proc of 2nd IEEE-EMBS Conf. on Neural Engineering, pp. 62-65, Arlington, VA, March 2005 [22] K. G. Oweiss, D.J. Anderson, Neural Channel Identification of Multichannel Neural Recordings Using Multiresolution Analysis, Annals of the 22nd Biomedical Engineering Society annual meeting, 28 (Suppl. 1): p. S-1 16, October 2000 [23] NeuroNexus Technologies, http://www.neuronexustech.com 52 lliliilllllllllll(ill)l