STATISTICAL SIGNAL PROCESSING TOOLS FOR
ANALYZING LARGE-SCALE NEURAL ENSEMBLES

By
Mehdi Aghagolzadeh

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Electrical Engineering
2012

ABSTRACT
STATISTICAL SIGNAL PROCESSING TOOLS FOR ANALYZING LARGE-SCALE
NEURAL ENSEMBLES

By Mehdi Aghagolzadeh
Understanding how a neuron processes information and communicates with other neurons is
key to unravel the brain’s mechanisms underlying perception, learning and motor processing. Key to
characterize neurons acting in concert is the ability to simultaneously measure their firing patterns in
awake behaving subjects. Microelectrode arrays (MEAs) implanted in the brain allow simultaneous
monitoring of the activity of large ensembles of cells while subjects carry out specific tasks. Isolating
the spike pattern characterizing each cell's signature firing requires sophisticated signal processing
algorithms, particularly to track the nonstationary behavior of these waveforms over extended
periods of time. In this thesis, we introduce a compressive spike sorting technique that discriminates
spike patterns from individual neurons using a sparse representation of the ensemble raw data. An
iterative learning algorithm is introduced to estimate and adapt a set of optimal thresholds that
maximize the separability between spike classes while minimizing the average waveform
reconstruction error. Once derived, spike trains are then used to infer functional connectivity
patterns among the recorded neural ensemble constituents. We introduce two information-theoretic
approaches within the realm of graphical models that capture the spatiotemporal dependency
between neurons’ spiking patterns. Specifically, the class of spatiotemporal maximum entropy
models is shown to incorporate higher-order interactions. We demonstrate the richness of these
techniques in improving our understanding of neural function and dysfunction at the millisecond
and micron resolutions, and their potential to be applied in emerging applications of brain machine
interface technology to help improve the lifestyle of people with severe disabilities.

To my wife, Hamideh Rezaee, for her unconditional support

iii

ACKNOWLEDGEMENTS

I would like to thank my advisor Dr. Karim Oweiss for his continuous guidance and support
since I joined Michigan State University at August 2007. His patience, dedication and
encouragement in addition to meeting me at least once every week and sending me to prestigious
conferences helped me in advancing my research and writing skills. I would also like to express
appreciation to my dissertation committee members, Dr. John Deller, Dr. Haydar Radha, and Dr.
Rong Jin. I also like to thank all my previous and current lab mates that I have coauthored papers
with: Seif Eldawltly, Fei Zhang, Jianbo Liu, Ki Yong Kwon, Faisal Abunimeh, Ali Mohebi, and John
Daly. Finally I would like to thank the National Institute of Health (NIH) for the generous funding
and the department of electrical and computer engineering and the graduate school for summer
scholarships.

iv

TABLE OF CONTENTS

List of Tables..................................................................................................................................................viii
List of Figures..................................................................................................................................................ix
Table of Abbreviations.................................................................................................................................xiii
Introduction ........................................................................................................................................................ 1
Overview of Neural Analysis Tools ................................................................................................................ 6
2.1. Spike Detection and Sorting.............................................................................................................. 8
2.2. Generalized Linear Model ...............................................................................................................12
Neural Activity in the Sparse Representation Domain ..............................................................................15
3.1. Spike Trains as Point Processes ......................................................................................................17
3.2. Sparse Representation ......................................................................................................................20
3.3. Computational Complexity of Spike Sorting in the Sparse Domain .........................................24
3.4. Spike Class Generation and Separability .......................................................................................26
3.5. Decoding Performance ....................................................................................................................32
3.6. Discussion ..........................................................................................................................................43
Sorting and Tracking Neural Activity via Simple Thresholding ...............................................................46
4.1. Spike Sorting as a Sparse Optimization Problem .........................................................................47
4.1.1. Optimal Thresholding ...................................................................................................................50
4.1.2. Threshold selection to maximize sorting performance ............................................................52
v

4.1.3. Multilevel Wavelet Decomposition .............................................................................................57
4.2. Collecting Neural Data.....................................................................................................................59
4.2.1. Generating Surrogate Data with Known Waveform Characteristics .....................................59
4.2.2. Tracking Neurons over Long-Term Recordings.......................................................................62
4.2.3. Considerations for Practical Implementation ............................................................................67
4.3. Compressive Spike Sorting Performance ......................................................................................69
4.3.1. Spike Sorting Performance vs. SNR and Compression Rate ..................................................69
4.3.2. Wavelet Basis Function vs. Sampling Rate ................................................................................70
4.3.3. Analysis of Real Data ....................................................................................................................72
4.4. Discussion ..........................................................................................................................................82
Spatiotemporal Graphical Models for Ensemble Spike Train Analysis...................................................86
5.1.

Pairwise Maximum Entropy Models ..................................................................................................87

5.1.1. Principle of Maximum Entropy ...................................................................................................89
5.1.2. Instantaneous Maximum Entropy Model ..................................................................................90
5.1.3. Spatiotemporal Maximum Entropy Model ................................................................................91
5.1.4. Training Maximum Entropy Models ..........................................................................................96
5.2.

V1 Population Encoding in Response to Stimulus Grating ...........................................................98

5.2.1. Single Unit Analysis .................................................................................................................... 101
5.2.2. Multiunit Analysis ....................................................................................................................... 102
5.3.

Discussion ........................................................................................................................................... 116
vi

Hypergraph Models with Higher Order Interactions Better Explain the Network Dynamics......... 117
6.1.

Hypergraph Maximum Entropy Model .......................................................................................... 119

6.2.1. Factoring Multi-information ..................................................................................................... 121
6.2.2. Graphical representation............................................................................................................ 123
6.2.3. Cooperative Model ..................................................................................................................... 125
6.2.4. Minimum Entropy Distance ..................................................................................................... 127
6.2.

Probabilistic Spiking Model with Synaptic Plasticity .................................................................... 129

6.1. Probabilistic Spiking Neurons...................................................................................................... 129
6.2. Task Learning ................................................................................................................................. 132
6.3.

Experimental Results ......................................................................................................................... 142

6.3.1. Order of the Encoding Model .................................................................................................. 142
6.3.2. Model Consistency ..................................................................................................................... 144
6.4.

Discussion ........................................................................................................................................... 150

Concluding Remarks .................................................................................................................................... 155
REFERENCES............................................................................................................................................. 159

vii

List of Tables

Table 4.1 Pseudo code for determining the optimal thresholds by (a) minimizing the MSE, (b)
maximizing the sorting performance.............................................................................................................55
Table 6.1 Values of the parameters used for the model of equations (6-16), (6-17), (6-18) and (619)......................................................................................................................................................................133

viii

List of Figures

Figure 1.1 Illustration of the two-category data analysis applied to microelectrode array recordings...2
Figure 2.1 Pattern recognition approach to spike sorting..........................................................................10
Figure 2.2 PCA for feature extraction and EM for unsupervised cluster cutting of 6469 spike event
..............................................................................................................................................................................11
Figure 2.3 Schematic of the network structure used to simulate the spike trains...................................14
Figure 3.3 Schematic diagram of a typical data flow in a neuro-motor prosthetic application............16
Figure 3.2 Sparse representation of sample events......................................................................................21
Figure 3.3 Five units obtained from recordings in an anesthetized preparation.....................................30
Figure 3.4 Unit isolation quality of the data..................................................................................................31
Figure 3.5 Schematic of encoding 2D, non goal-directed arm movement: the sample network of
neurons is randomly connected with positive and negative connections................................................36
Figure 3.6 Top-left: 400 msec segment of angular direction from a movement trajectory
superimposed on tuning “bands” of five representative units..................................................................38
Figure 3.7 Average mutual information (in bits) between movement direction, θ , and rate estimators
averaged across the two subgroups of neurons in the entire population as a function of
decomposition level (i.e. kernel size).............................................................................................................40
Figure 3.8 Decoding performance of a sample 2D movement trajectory...............................................41
ix

Figure 3.9 Computational complexity of PCA/EM/Gaussian kernel and the compressed sensing
method................................................................................................................................................................42
Figure 4.1 The distribution of spike events in a 2D space obtained using PCA....................................56
Figure 4.2 Schematic diagram for obtaining the sparse representation from the raw recordings using
a 4-level wavelet decomposition in a tree structure using the Symlets 2, 4, and 6, at 25 kHz..............60
Figure 4.3 Distribution of spike PCA features in a 3D space....................................................................62
Figure 4.4 Tracking neurons for multiday recordings on a single electrode............................................64
Figure 4.5 Tracking neurons for single day recordings...............................................................................66
Figure 4.6 Schematic diagram of the data flow in the implantable wireless system...............................69
Figure 4.7 Average TPR of an FDA classifier trained for different SNR and compression rates.......71
Figure 4.8 Effect of wavelet basis choice on the average TPR for variable sampling rates between 12
and 28 kHz, and SNRs of 0, 0.5 and 1.5 dB.................................................................................................73
Figure 4.9 A 10 second demonstration of in vivo recording from the rat’s somatosensory cortex....74
Figure 4.10 The compression rate and MSE for the iterative simulated annealing learning.................76
Figure 4.11 Rate-distortion scatter plot. Each dot represents the rate and the distortion for a random
threshold drawn from a uniform distribution..............................................................................................79
Figure 4.12 Spike sorting performance of the compressive spike sorting (CSS) vs. the conventional
PCA sorting for neurons tracked over 1, 2 and 3 days span......................................................................80
Figure 5.1 Representation of hypothetical instantaneous patterns, ܺ ሺ௜ሻ (for ݅ = 1,2,3).......................93
x

Figure 5.2 First-order Markov representation of pairwise maximum entropy model............................95
Figure 5.3 Experimental setup: recording from the V1 area of an anesthetized cat in response to
drifting grating stimulus with tetrode microwires.......................................................................................99
Figure 5.4 The circular representation demonstrates the 16 different stimuli in the center with the
arrows indicating the movement direction of the drifting bars..............................................................103
Figure 5.5 Predicting spatiotemporal patterns of activity for 1st order maximum entropy
model................................................................................................................................................................105
Figure 5.6 Predicting spatiotemporal patterns of activity 2nd order maximum entropy model.........107
Figure 5.7 Independent encoding model....................................................................................................110
Figure 5.8 Spatiotemporal maximum entropy encoding model............................................................112
Figure 5.9 Actual counted number of patterns in the collective activity of the neural population
(solid blue line) and estimated counts by the independent, instantaneous maximum entropy, first and
second-order spatiotemporal maximum entropy models.........................................................................115
Figure 6.1 A stimulus entropy representation by a population of 3 neurons........................................122
Figure 6.2 A hypergraph including four factors, representing a Markov network model of a
population of six neurons..............................................................................................................................127
Figure 6.3 A simplified version of the network structure.........................................................................134
Figure 6.4 Schematic of the statistical learning rule. Neurons A and B are non-cooperative layer
neurons that excite neurons cooperative layer neurons ‫ ܦ ,ܥ‬and ‫631......................................................ܧ‬
xi

Figure 6.5 Mean connectivity strength versus trial number for functionally correlated (blue) and
uncorrelated (red) cooperative layer neurons.............................................................................................137
Figure 6.6 An example of the connectivity between one cooperative layer neuron (neuron 103) and
four non-cooperative layer neurons.............................................................................................................139
Figure 6.7 Probability of number of neurons firing within a 10 ms window for actual data generated
from the model and a jittered version of the data for cooperative layer neurons 99 to 108..............143
Figure 6.8 Sample data from 60 cooperative layer neurons generated by the model during one trial
after task learning was completed................................................................................................................145
Figure 6.9 Connected information for different orders of the cooperative model for a fixed
subpopulation size (n = 10) ..........................................................................................................................146
Figure 6.10. Inferred factor-graphs for sample targets ‘1’, and ‘8’..........................................................147
Figure 6.11 Model fit for a sample subpopulation of 10 neurons for: 1- Independent model (blue), 2Maximum entropy model (red), 3- MinED cooperative model (green).................................................148
Figure 6.12 The Kullback-Leibler distance between the true and the single trial distribution of states
for each model.................................................................................................................................................148
Figure 6.13 The KL-divergence for independent, maximum entropy and cooperative models as a
function of the size of the observed population........................................................................................149
Figure 6.14 Two sample inputs of drifting grating stimulus....................................................................151

xii

TABLE OF ABBREVIATIONS
MEA

Multielectrode Array

BMI

Brain Machine Interface

GLM

Generalized Linear Model

EEG

Electroencephalogram

FMRI

Functional Magnetic Resonance Imaging

LFP

Local Field Potential

PCA

Principal Component Analysis

ISI

Inter Spike Interval

EM

Expectation Maximization

DWT

Discrete Wavelet Transform

SR

Separability Ratio

GMM

Gaussian Mixture Model

MAP

Maximum a Posteriori

FDA

Fisher Discriminant Analysis

LFSR

Linear Feedback Shift Register

CSS

Compressive Spike Sorting

RLE

Run Length Encoding

KL

Kullback-Leibler

MRF

Markov Random Fields

RGC

Retinal Ganglion Cell

MinED

Minimum Entropy Distance

STDP

Spike Timing Dependent Plasticity

PSTH

Post Stimulus Time Histograms

xiii

1

Introduction

Understanding the role of neurons in brain functionality is a cornerstone in neuroscience,
psychology and cognitive sciences [1]. Neurons participate in encoding sensory information, motor
intensions, and decision-making via excitatory or inhibitory connectivity to thousands of neurons. It
is the collective behavior of these connected neuronal elements that form the important aspects of
complex behavior [2]. The ability to simultaneously record the activity of an ensemble of neurons in
vivo using multielectrode arrays (MEAs) implanted in selected brain areas has made it feasible to
study functional connectivity between neurons while awake behaving subjects carry out specific
functions [3-6]. The advantage of these arrays is their ability to sample from the neuronal activity at a
high temporal and spatial resolution compared to fMRI that lacks temporal resolution, and EEG or
ECOG recordings that lack spatial resolution [7].
The objective of this dissertation is to design statistical signal processing tools to extract
the critical information from large-scale recordings of neuronal populations and model the
way neurons communicate with each other. These tools can in general be categorized into two
categories, first detecting and sorting action potentials from the background neural activity, and
second constructing models among neuronal constituents. Figure 1.1 illustrates the two-category
1

analysis applied to extracellular recordings of MEAs.

r11

r12
r22

r21
r41
r51

r23
r33

r32

r31
Stage 2

r13

r42
r52

r43
r53

Stage 1

Figure 1.1 Illustration of the two-category data analysis applied to microelectrode array recordings.
The bottom plate demonstrates the recorded neural data from three channels and the noisy
background activity. Each channel can record from one or more neurons simultaneously, e.g. the
first and third rows record two neurons, while the second row records only from one neuron. The
middle plate demonstrates the spike trains for the 5 segregated neurons, represented by the exact
time stamps of individual events. The top plate demonstrates a spatiotemporal model fit to the
collective neural activity of the observed population. The hypothetical interactions among neuronal
constituents can be either at instantaneous time instants to represent bidirectional connections or at
2

different time instants to represent unidirectional connections. For interpretation of the references
to color in this and all other figures, the reader is referred to the electronic version of this
dissertation.
Chapter 2 provides an overview on some basic neural analysis tools that have been repeatedly
used in this dissertation for analyzing extracellular recordings of MEAs. The chapter begins with a
general introduction to neurons and the different methods developed for recording its activity.
Then, spike detection and sorting algorithms, which are the first step in analyzing any extracellular
recordings of a mixture of neural activities, are described. We then explain generalized linear models
(GLMs) as the generative model for simulating neural encoding data that has been the centerpiece in
evaluating the decoding performance and connectivity inference in this dissertation.
Chapter 3 introduces a novel spike sorting algorithm that operates directly on the sparse
representation of the neural recordings, referred to as the compressive spike sorting. In this chapter,
we first introduce the state-of-the-art algorithms for detecting, sorting and transmitting neural data
from the site of an implantable MEA recording to an external observer. The objective is to
automatically detect and sort action potentials in a compressed domain based on exploiting a sparse
representation of the extracellular spike waveforms using multilevel wavelet decomposition
transform. We will show that by restricting this sparse representation to a subset of projections that
simultaneously preserve features of the spike waveforms, we can reasonably approximate the spike
trains and the instantaneous firing rates of the recorded neurons. This approach eliminates multiple
steps from the typical processing path of neural signals that are customarily performed for
instantaneous neural decoding, thus reduces the computational complexity that is essential for an
actual implementation of miniaturized implantable brain machine interface (BMI) systems. We
compared the computational complexity of the compressive spike sorting with the state-of-the-art
3

spike sorting and demonstrated the substantial savings in power and hardware resources that is the
cornerstone of the design of implantable BMI systems.
Chapter 4 introduces an iterative learning approach for estimating the thresholds for the sparse
representation. It was shown in the previous chapter that these thresholds are neuron-specific and
permit direct sorting in the compressed domain. The simple thresholding mechanism was
demonstrated to discriminate between signals of different neurons, given an efficient
multidimensional feature extraction technique. Since thresholding is a binary classifier, we show
here that complex nonlinear decision boundaries required for efficient class discrimination can be
achieved by fusing a set of weak binary classifiers to optimally represent individual units. The
thresholds for these binary classifiers were adaptively estimated through a learning algorithm that
maximizes the separability between the sparsely represented signal classes. This approach
substantially reduces the computational complexity of extracting, aligning and sorting multiple single
unit activity early in the data stream, making it highly suitable for implementation of large scale BMI
applications.
Chapter 5 introduces the use of spatiotemporal maximum entropy models for understanding the
role of statistical dependence between the output spike trains of cortical neurons in encoding an
input stimulus. This dependence may be reminiscent of network state transitions that characterize
the dynamics of the stimulus. Pairwise maximum entropy models were suggested to fit instantaneous
patterns of activity in small populations that exhibit spatial correlations, but did not account for the
temporal correlations that likely represent the effect of spiking history on network state transitions.
In this chapter, we introduce an extension to current instantaneous maximum entropy models that
accounts for correlations over longer temporal intervals among neurons. We evaluated the
performance of our spatiotemporal model in predicting the network states of V1 cortical neurons in
4

response to drifting grating stimuli, and compared it to that of the instantaneous maximum entropy
model, as well as an independent model. We also looked at how stimulus specific are the network
representations for either of these models.
Chapter 6 introduces a novel method for inferring connectivity by including higher-order
interactions among neuronal constituents and demonstrating this connectivity with hypergraphs. In
this chapter, we formulate an information theoretic approach to determine whether cooperation
among neurons may constitute a governing mechanism of information processing. Specifically, we
show that cooperation among neurons can provide a “message-passing” mechanism that preserves
most of the information in the covariates under specific constraints governing their connectivity
structure. Using a biologically plausible statistical learning model, we demonstrate the performance
of this information theoretic approach in synergistically encoding a motor task using a subset of
neurons drawn randomly from a larger population. We demonstrate its superiority in approximating
the joint density of the population from limited data compared to the independent and
instantaneous maximum entropy models.
Finally, chapter 7 provides concluding remarks and future directions.

5

2

Overview of Neural Analysis Tools

A typical neuron possesses three distance parts, the cell body, also referred to as the soma, the
axon, and the dendrites [1]. The axon and dendrites are the means by which neurons communicate
with each other, and play a part as the neuron’s transmitter and receivers, respectively [8]. Neurons
are also referred to as excitable cells since they tend to generate an action potential when dendrites
receive enough inputs from neighboring neurons and axons, such that the collective activity passes a
certain threshold. As a result of neural firing, the generated action potential propagates through the
axon to influence the activity of neighboring or distal neurons.
Different techniques developed for recording neural activity can be categorized into noninvasive
or invasive methods. While techniques such as electroencephalogram (EEG) [9] and functional
magnetic resonance imaging (FMRI) [10] are considered as noninvasive recording methods, patchclamp recordings of single cells [11] and extracellular recording using multi-electrode arrays (MEAs)
[12] are invasive. Although noninvasive techniques due to their harmless nature have been
commonly practiced for medical purposes, they basically lack the necessary spatial and temporal
resolution required for studying the behavior of neurons in brain functionality [13].

6

This dissertation mainly focuses on signals collected by invasive recordings techniques using
implantable high-density MEAs, where the objective is to study the role of activity of single neurons
in a population. These invasive devices have demonstrated to probe high spatial (< 50	ߤ݉) and

temporal (< 1 millisecond) resolutions in the implanted site [14]. The two major kinds of MEAs are
wired-based arrays, also referred to as micro-wires [15], and silicon-based arrays [16]. Both micro-

wires and silicon-based MEAs have been designed in different shapes and lengths to enable
recording from various brain locations of different subjects.
Although MEAs have been designed to record the activity of a subpopulation of neurons, the
recorded signal is contaminated by the accumulative activity of millions of surrounding neurons,
referred to as the background activity, and noise [17]. We focus explicitly on spikes for a number of
reasons: First, their origin – the neurons – is known or can be estimated. Second, spikes fired by a
given neuron tend to be highly stereotypical, permitting to model them with probability distributions
with relatively few parameters. Third, spike trains –albeit largely variable across multiple repeated
trials- carry a lot of information about neuronal tuning characteristics to sensory and motor
covariates, thereby permitting adequate modeling of the corresponding neural systems and their
dynamics [18]. In contrast, other neural signals, such as Local Field Potentials (LFP) and
Electroencephalographic (EEG) signals, are more obscure in terms of the signal sources they
originate from, are far more limited in terms of their spatial and temporal resolutions, and their
correlation to sensory and/or motor covariates are ill defined. This makes them hard to model and
consequently hard to detect or discriminate from pure noise. Therefore, the first step in most neural
analysis algorithms is to extract the signal of interest from raw MEA recordings by performing spike
detection and sorting. Following, we will explain spike detection and sorting from a general pattern
recognition standpoint.

7

2.1.

Spike Detection and Sorting

Spike detection and sorting play a crucial role in processing neural signals, largely due to the
highly stochastic nature of these signals [17]. The statistical hypothesis testing decides which
generative model or hypothesis, among many possible ones, may have generated the observed
signals [19]. The degree of complexity of the detection/sorting task can be viewed as directly
proportional to the degree of “closeness” of the candidate generative models, i.e., the
detection/sorting task gets more complex as the models get closer together in a geometrical sense.
In most spike detection/sorting applications, it is often assumed that the signal and noise are
independent and therefore uncorrelated.
Spikes are transient stochastic signals that are non-zero only for a subset of samples, ‫ ,ܮ‬within an

observation interval. Detecting the transient signal involves knowing the arrival time, the duration
and the waveform shape. The most important parameter, however, is the spike arrival time, as it
carries all the information about how neurons encode and communicate amongst themselves. Given
observations ‫ ,ݔ‬the detection can be casted as a binary hypothesis testing problem in which the goal
is to determine which of the null hypothesis, ‫ܪ‬଴ : ‫ ݊ = ݔ‬or the alternative hypothesis ‫ܪ‬ଵ : ‫݊ + ݏ = ݔ‬

is in place, where ‫ ݏ‬is the transient signal and ݊ is noise [20]. Most spike detection algorithms

assume the noise and signal are uncorrelated and the binary test can be performed using simple

thresholding. The threshold is set as three times the standard deviation of noise [21], and spikes that
pass this threshold are cropped and aligned based on a common extremum point, i.e., all spikes are
cropped ‫ܮ‬ଵ samples before the local minimum point of the spike and ‫ܮ‬ଶ samples after, where
‫ܮ = ܮ‬ଵ + ‫ܮ‬ଶ + 1.

Spike sorting refers to the process of classifying the individual spikes extracted after the spike
8

detection step. The objective is to segregate the activity of each individual neuron to characterize its
encoding characteristics. Spike sorting is one of the most extensively studied problems since
techniques for extracellular recordings have started to emerge [17]. Since then, the need for more
sophisticated spike sorting techniques continues to emerge with the ability to simultaneously sample
large populations of neurons. Spike sorting is simply a pattern recognition problem [22]. The
approach relies on the sequence of steps illustrated in Figure 2.1. After spikes are detected and
extracted, they are aligned relative to a common reference point. This step is important because
spikes may be translated and potentially scaled from previous occurrences relative to the detection
point especially when the extraction window length ‫ ܮ‬is empirically chosen by the user.

The feature extraction step finds the most “discriminative” features of the spike waveforms.

These features can be viewed as a vector in a ݀-dimensional space. The criterion for choosing the

features should allow the corresponding pattern vectors from different spike classes to occupy
compact and disjoint regions in this ݀-dimensional space. The clustering step attempts to separate

the extracted features by partitioning the feature space into distinct clusters. This is done by
identifying statistical decision boundaries based on perceived similarity among the patterns. This
step is difficult because the clusters can have very different shapes and their number is unknown
beforehand. The last step is to label the waveforms in each cluster as belonging to a single neuron.
These steps constitute the training mode of the pattern classifier. Supervision by the user through
the feedback path allows optimizing the preprocessing and feature extraction/selection strategies
(e.g. determining the number of clusters). In the test mode, similar preprocessing and feature
extraction takes place for the incoming unknown spike. A decision-making step compares the
extracted features to the set of stored features and assigns the incoming spike to the closest pattern
class using some distance metric.

9

Unlabeled Feature Set

Labeled Feature Set

Detected Events

Spike

Spike Extraction

Feature

Cluster Cutting

Detection

and Alignment

Extraction

and Labeling

Train Path
Classifier

Test Path

Labeled
Training
Events
Classified Test Events
Figure 2.1 Pattern recognition approach to spike sorting. Spike events are extracted after detection
and discriminative features are computed and clustered in a 2D or 3D feature space.
The effectiveness of the representation space (feature set) is determined by how well patterns
from different classes can be separated, while original details about spike waveform shapes become
less important once these patterns are reasonably separable. Therefore, most of the work in spike
sorting has focused on the feature extraction and clustering steps. Initial work relied on extracting
simple features of the waveform shape such as amplitude and duration [23]. Because these are highly
variable features, cluster separability using these methods is often poor unless the SNR is
considerably high and the waveforms are stationary. Template matching, on the other hand, relies on
10

determining the degree of similarity between stored template spike waveforms and incoming spikes
[17]. The templates are obtained by averaging many occurrences of labeled spikes in the training
mode. The method is computationally demanding because the comparison needs to take place for all
possible translations and rotations of the waveforms and is also vulnerable to nonstationarity unless
the stored templates are regularly updated.
Principal component analysis (PCA) is a popular method that attempts to alleviate these
limitations by representing the spike waveform as a weighted sum of orthonormal eigenvectors.
Typically, the PCA feature space uses d=2 or 3 features (depending on how separable the clusters
are) to represent each spike. These PCA features are subsequently clustered using algorithms such as
Expectation-Maximization clustering [24] or fuzzy c-means clustering [25]. Figure 2.2 illustrates an
example of a PCA feature space with two scores representing each spike. In this case, we have 5

PC2

clearly isolated clusters that amount to 5 single units in the observed ensemble activity.

PC1
Figure 2.2 PCA for feature extraction and EM for unsupervised cluster cutting of 6469 spike events.
Five distinct units can be separated.
11

2.2.

Generalized Linear Model

Generalized linear models or GLMs are a family of generative models that have been extensively
used to simulate neural encoding models. A GLM models the spike trains of neurons as a binary

point process that is governed by the conditional intensity function, ߣ൫‫ܨ|ݐ‬ሺ‫ݐ‬ሻ൯, where ‫ܨ‬ሺ‫ݐ‬ሻ includes

all factors that influence the output of a neuron at time ‫ .ݐ‬In this dissertation, we have used GLMs

to simulate a motor encoding model of classical 2D goal-directed arm reach movement in a centerout-reach [18] or a pinball task [26]. The trajectory information was collected from actual repeated
arm movements using a computer interface pointing device. The direction of movement was
extracted from these trajectories and used as an input to the GLM to modulate the firing pattern of
a population of neurons.
We designed the population network in two separate layers, the input layer (also referred to as

the non-cooperative layer), and the integration layer (also referred to as the cooperative layer). In the
input layer, neurons were only tuned to movement direction, such that the conditional intensity
function, ߣூ௡௣௨௧ ൫‫ݏ|ݐ‬ሺ‫ݐ‬ሻ൯, for neuron ݅ was given by
௜
ூ௡௣௨௧
ߣ௜
൫‫ݏ|ݐ‬ሺ‫ݐ‬ሻ൯

ߢሺ‫ݐ‬ሻ − ߢ௜
= exp ൭ߚ௜ + ߜ௜ cos ቆ
ቇ൱
߱௜

(2-1)

where ߜ௜ denotes the modulation depth, ߢሺ‫ݐ‬ሻ denotes movement direction, ߢ௜ denotes the neuron’s

preferred direction (or receptive field in the case of sensory encoding neurons), and ߱௜ denotes the

tuning width. This layer was designed such that there were no connections between neurons, and the
observed correlations were due to the overlap between their tuning functions. The cosine function
was due to the cosine tuning seen in the activity of M1 motor encoding neurons [18].
12

In the integration layer, neurons were not tuned to any movement parameters. Their encoding
resulted from forward connectivity from neurons in the input layer by either excitatory or inhibitory
connections. The conditional intensity function of these neurons, ߣூ௡௧ ൫‫ܪ|ݐ‬௜ ሺ‫ݐ‬ሻ൯, was expressed as
௜
ெ

ெ

௠
௠ ௧ି௠
ߣூ௡௧ ൫‫ܪ|ݐ‬௜ ሺ‫ݐ‬ሻ൯ = exp	ቌߚ௜ + ෍ ෍ ߙ௜௝ ‫ݎ‬௝௧ି௠ + ෍ ෍ ߛ௜௝ ‫ݎ‬௞ ቍ
௜

(2-2)

௞∈గሶ ሺ௜ሻ ௠ୀ଴

௝∈గሺ௜ሻ ௠ୀ଴

where ‫ܪ‬௜ ሺ‫ݐ‬ሻ is the history term for neuron ݅, ‫ ܯ‬is the length of effective history, ߨሺ݅ሻ and ߨሶ ሺ݅ሻ are

the sets of indices of the pre-synaptic neurons to neuron ݅ in the input layer and the integration

layer, respectively, while ߙ௜௝ and ߛ௜௝ represent the forward connections and within integration layer
connections, respectively.

Figure 2.3a demonstrates the two-layer GLM used for modeling motor encoding. The blue
colored coupling filters are the forward connections between the input and integration layers,
expressed as α୫ , and the purple colored coupling filters are the connections within the integration
୧୨

layer, expressed as γ୫ in (2-2). The green filters are known as the intrinsic filters, which are self୧୨

inhibitory connections to prevent neurons from bursting activity. They set a refractory period after a

spike is fired by a neuron for at least 2 msec in which no spikes are fired. The output of the input
and integration layer are in the form of binary spike trains, demonstrated in Figure 2.3b for 15 input
layer neurons and 15 integration layer neurons.

13

+

ex

+

ex

Neuron 1

Neuron 1
+

ex

Neuron 2
+

Stimulus

+

ex

Neuron 2

ex

Neuron 3

+

ex

+

ex

Neuron 15

Neuron 15

A

B
(a)

(b)
Figure 2.3 Schematic of the network structure used to simulate the spike trains. (a) The input layer
consists of 15 direction-tuned neurons. Integration layer neuron receives forward connections from
the input layer. (b) Sample raster plots for one trial during a reach for neurons of the input and
integration layers. Every row represents the binary spike train for a neuron, where each dot is a spike
event.
14

3

Neural Activity in the Sparse Representation Domain

Spike trains are the fundamental neural communication mechanism used by cortical neurons to
relay, process, and store information in the brain. Decoding the information in spike trains is thus
important in neuroscience in order to better understand the complex mechanisms underlying brain
functionality. In motor systems, spike trains were demonstrated to carry important information
about movement intention and execution [18], and were shown to be useful in the development of
Brain Machine Interface (BMI) systems and neuroprosthetics to improve the lifestyle of severely
disabled people [27-29]. Cortically-controlled BMI systems depend fundamentally on instantaneous
decoding of spike trains from motor cortical neurons recorded during short intervals of around 100200 milliseconds, often referred to as the movement planning period [29]. The decoding process is
typically a cascade of preprocessing steps illustrated in Figure 3.. It features amplification and
filtering, followed by spike detection and sorting to segregate single unit activity in the form of
binary spike trains. The spike trains are then filtered using a Gaussian kernel to yield a smoothed
estimate of the firing rate [30, 31]. These steps have to be performed within the movement
preparation period to enable the subject to experience a natural motor behavior.

15

Figure 3.1 Schematic diagram of a typical data flow in a neuro-motor prosthetic application.
Ensemble neural recordings are first amplified and filtered prior to telemetry transmission to the
outside world. Three data processing paths are considered: (a) Wired systems (top): information is
extracted through the cascade of spike detection and sorting followed by rate estimation with a
massive computational power, (b) Wireless systems (middle): Telemetry bandwidth is reduced by
moving the spike detection block inside the implantable device, (c) Proposed system (bottom): the
spike detection, sorting and rate estimation blocks are replaced with one “compressive classifier”
block that permits adaptive firing rate estimation in real time to permit instantaneous decoding to
take place.
The spike sorting step is arguably the most computationally prohibitive task in Figure 3.,
requiring two modes of analysis, a training mode and a runtime mode. During training mode, spikes
are detected, aligned, and sorted based on certain discriminating features [17], while during runtime,
16

the features of an observed event are classified as one of the predefined neuronal classes. As a result
of the significant amount of computations required to enable this identification/classification
process, most of existing systems feature an external computing engine connected through a wire to
permit streaming of the high-bandwidth neural data. To eliminate the necessity of the wired
connection, it has been suggested to obtain a sparse representation of the recorded data that could
be potentially used for denoising and compression [21] as well as spike detection and sorting [32,
33]. In this chapter, we will show that the same sparse representation not only overcomes the severe
bandwidth limitations of a wireless implantable system, but also enables adequate estimation of
neuronal firing rates without the need to decompress, reconstruct, and sort the spikes on an external
device, as illustrated in the bottom of Figure 3.. Following, we will first introduce the model of
neuronal firing as a point process and outline the mapping from neural recordings to spike train
realizations through sparse representation. Then, the details of the model used to simulate neural
activity encoding 2D arm movement is described and preceded by its experimental results.

3.1.

Spike Trains as Point Processes

In a typical recording experiment, the observations of interest are the times in which events
occur from a population of neurons. A spike train is the set of event times that expresses the
discharge pattern of an arbitrary neuron ݊. This pattern can be modeled as a realization of an
underlying point process with a conditional intensity function, ߣ௡ ሺ‫ܨ|ݐ‬ሻ also referred to as the firing

rate [34]. This intensity function is conditioned on some set, ‫ ,ܨ‬of intrinsic properties of the neuron

itself and the neurons connected to it, and the neuron’s tuning characteristics to external stimuli [35].

Because many of these properties are hard to measure, the number of events in a given interval, ܰ௣ ,
is typically random by nature. The sum of ߣ௡ over a finite time interval between times ܶ௔ and ܶ௕
17

estimates the expected value ܰ௣ within a single trial as [36]
்್

‫ܧ‬ൣܰ௣ ൧ = න ߣ௡ ሺ‫ܨ|ݐ‬ሻ݀‫ݐ‬
்ೌ

(3-1)

Historically, modulation of ߣ௡ from a baseline level ߚ௡ has been the fundamental indicator of

neuronal involvement in cortical information processing. Estimating ߣ௡ from the set of event times
ሼ‫ݐ‬௡ ሽ is typically achieved by binning the data into time bins of equal width, and counting the number

of events occurring within each bin. The resulting spike counts, often referred to as a rate histogram,

መ
constitute an instantaneous firing rate estimate ߣ௡ . In traditional signal processing, this is equivalent

to convolving the spike train with a fixed-width rectangular window. This approach assumes that
variations in the rate pattern over the bin width do not carry information that is destroyed if aliasing

occurs, for example, when the bin width is not optimally selected to satisfy the Nyquist sampling
rate of ߣ௡ .

The binning approach, typically used to obtain the rate histogram, can detect the presence of the
type of spike bursts that may exist within the fixed-length bins. However, bursts come in a variety of
lengths within a given trial, and can range from very short bursts (3-4 spikes within 2-3 ms [37]) to
much longer bursts that can last for more than 2 seconds [38]. This implies that the firing rate of

individual neurons is highly non-stationary and that temporal and spectral variations in ߣ௡ are

believed to occur over a multitude of time scales that reflect the complex temporal structure of
neuronal encoding while subjects carry out similar behavioral tasks [39, 40], or depending on the
demands of distinct behavioral tasks [41, 42]. This nonstationarity arises in part because of the
dependence of the firing rate on multiple factors such as the degree of tuning (sharp or broad) to
behavioral parameters, the behavioral state, the subject’s level of attention to the task, level of
18

fatigue, prior experience with the task, etc. While across-trial averaging of rate histograms helps to
reduce this variability, it destroys any information about the dynamics of interaction between
neurons that are widely believed to affect the receptive fields of cortical neurons, particularly when
plastic changes occur across multiple repeated trials. Typically, a nonparametric kernel smoothing
step (e.g. a Parzen window [43]) is needed. The temporal support of the kernel function is known to
strongly impact the rate estimator [44]. Moreover, the selection of temporal support is arguably
important to determine the type of neural response property sought. For small temporal supports (<
2-3ms), precise event times can be obtained. As the temporal support approaches the trial length, we
obtain the overall average firing rate over that trial. In between these two limits, the temporal
support needs to be adaptively selected to capture any nonstationarities in ߣ௡ that may reflect
continuously varying degrees of neuronal inhibition and excitation related to the degree of tuning to
behavioral parameters.
We ultimately seek to estimate ߣ௡ directly from the recorded raw data. However, two

complications arise: First, the events detected are not directly manifested as binary sequence of zeros
and ones to permit direct convolution with a kernel to take place, but rather by full Action Potential

(AP) waveforms. Second, these events are typically a combination of multiple single unit activity in
the form of AP waveforms with generally distinct -but occasionally similar- shapes. This mandates
the spike sorting step before the actual firing rate can be estimated.
Let’s assume that the actual spike waveforms are uniformly sampled over a period ܶ௦ . Each spike

from neuron ݊ is a vector of length ܰ௦ samples that we will denote by ݃௡ . If the event time is taken
as the first sample of the spike waveform, but can be generalized to any time index (e.g. that of a

detection threshold crossing), then the discrete time series corresponding to the entire activity of
neuron ݊ over a single trial of length ܶ can be expressed as
19

ேೞ ିଵ

‫ݏ‬௡ = ෍ ෍ ݃௡ ሾ݇ሿߜሾ݅ − ݇ሿ

(3-2)

௜∈ሼ௧೙ ሽ ௞ୀ଴

where the time index ݅ includes all the refractory and rebound effects of the neuron and takes values

from the set ሼ‫ݐ‬௡ ሽ, while ߜሾ	. ሿ is the Dirac delta function.

3.2.

Sparse Representation

For compression purposes, it was shown in [21, 32] that a carefully-chosen sparse
transformation operator -such as a wavelet transform- can significantly reduce the number of
coefficients representing the spike waveform to some ܰ௖ ≪ ܰ௦ . This number is determined based

on the degree of sparseness ‫ ݍ‬as ܰ௖ ≈ ߝ ሺ‫2 − ݍ‬ሻ⁄2‫ ݍ‬where 0 < ‫ 0 = ݍ( 2 < ݍ‬implies no

sparseness, while ‫ 2 = ݍ‬implies fully sparse) and ߝ denotes some arbitrarily chosen signal
reconstruction error [45]. Mathematically, an observed spike, ݃, is represented by the transform

coefficients obtained from the inner product ݃ ௝ = 〈݃, ߱ ௝ 〉, where ߱ ௝ is an arbitrary wavelet basis at

time scale ݆. The wavelet basis should be selected to keep the smallest number of coefficients
without significant loss of spike features. In the traditional spike sorting approach, this means

performing inverse wavelet transformation followed by spike sorting in the time domain. When
multiple units are simultaneously recorded, the spike recordings from the entire population can be
expressed as
ೕ

ே೎

‫ ݏ‬௝ = ෍ ෍ ݃ ௝ ሾ݇ሿߜሾ݅ − ݇ሿ
௜∈ሼ௧ೞ ሽ ௞ୀ଴

(3-3)

where ܰ௖ is the number of nonzero transform coefficients at time scale ݆, and ݅ takes values from
௝

20

௝
the set of spike times in the whole trial, ሼ‫ݐ‬௦ ሽ. Note that ܰ௖ ≪ ܰ௦ and the total number of

coefficients obtained is ܰ௖ = ∑௝ ܰ௖ .
௝

Node 8

Node 7

(a)
Figure 3.2 (a) Sparse representation of sample events from three neurons, red, green and blue in the
noisy neural trace for four wavelet decomposition levels indicated by the binary tree. Sensing
thresholds are set to allow only one coefficient/event to survive in a given node. For example, node
6, 7 and 8 can be used to mark the blue, green and red events, respectively. (b) 1D and 2D joint
distributions of wavelet features for nodes 7 and 8 for the three neurons over many spike
occurrences showing 3 distinct clusters. While the 2D projection represents a clear separation
between the three neurons, a similar performance can be achieved from the 1D projections by using
the two-class discrimination task.
21

Figure 3.2 (continued)

(b)

To minimize the number of the most important coefficients per event, ideally to a single feature,

we note that the magnitude of the coefficients ݃ ௝ carry information about the degree of correlation

of the spike waveforms with the basis ߱ ௝ . Therefore, this information can be used to single out “the

most significant” coefficient via a thresholding process. This process requires defining a sensing

threshold for each neuron ݊ at time scale ݆, denoted ߛ௡ . This threshold should be selected to
௝

preserve the ability to discriminate neuron ݊’s events from those belonging to other neurons.

Specifically, in every time scale ݆, we cast the problem as a binary hypothesis test in which
‫ܪ‬ଵ
௝
௝
ห݃ ௝ ሾ݇ሿห ≷ ߛ௡ 										݂‫ܰ , ⋯ ,1,0 = ݇	ݎ݋‬௖ 	and	݆ = 0,1, ⋯ , ‫ܬ‬
‫ܪ‬଴

(3-4)

The set of thresholds ߛ௡ are selected based on iterative learning approach described in chapter 4.
௝

22

The outcome of this statistical binary test is a set of time indices per event, ݇ ∗ , for which the

alternative hypothesis ‫ܪ‬ଵ is in effect. In other words, the sensing threshold in a given time scale

should allow only one coefficient to be kept per event, i.e., ݇ ∗ takes only one value. Once this is

achieved, ݃ ௝ ሾ݇ሿ at indices ݇ where ‫ܪ‬଴ is in effect are automatically set to zero. Note that this step
allows suppressing both noise coefficients as well as those belonging to neurons' other than neuron
݊'s. In such case, the thresholded signal can be expressed as
‫ ̃ݏ‬௝ = ෍ ݃௡ ሾ݇ ∗ ሿߜሾ݅ − ݇ ∗ ሿ
௞ ∗ ∈ሼ௧೙ ሽ

௝

(3-5)

෠
The outcome of (3-5), after proper normalization, is an estimate, ܾ௡ , of the true binary spike

෠
train vector ܾ௡ . It can be readily seen that the temporal characteristics of ܾ௡ will exactly match that
of the binary spike train of neuron ݊ and consequently preserves all the critical information such as

spike counts and inter-spike interval (ISI) statistics allowing rate estimation to be readily
implemented [46]. The scaling factor ݃ ௝ ሾ݇ ∗ ሿ can simply be normalized to obtain a binary sequence

exactly matching the spike train of neuron ݊. The simple example in Figure 3.2 illustrates this idea.

In each wavelet decomposition node, the binary hypothesis test (i.e., the thresholding) is equivalent
to a 2-class discrimination task whereby one unit at a time is identified at each decomposition level.
The spike class separability (defined below) is compared to that in the time domain and a unit is
extracted (i.e. its coefficients removed) from the data set if the unit separability is higher than that of
the time domain. This process is repeated until the separability no longer exceeds that of the time
domain, or the size of the remaining events is smaller than a minimum cluster size (typically five
events), or the maximum number of levels has been reached (typically 4-5 levels).
A fundamental property of the DWT sparse representation suggests that as ݆ increases, ‫ ̃ݏ‬௝
23

becomes more representative of the intensity function rather than the temporal details of the spikes
themselves, which were eventually captured in finer time scales. This is because the coefficients that
survive the sensing threshold will spread their energy across multiple adjacent time indices, thereby
performing the same role as the kernel smoothing approach, but at a much less computational

෠
overhead as will be shown later. Mathematically, extending the DWT of the vector ܾ௡ to higher level
requires convolving it with a wavelet basis kernel with increasing support. This support, denoted ‫ݐ‬௅

at level ‫ ,ܮ‬is related to the sampling period ܶ௦ as ‫ݐ‬௅ = ܶ௦ ݊௪ 2௅ିଶ , where ݊௪ is the wavelet filter

support. For the symmlet4 basis (݊௪ = 8), this temporal support is equivalent to ~1.2 ms at level 4

at a 25 kHz sampling rate, which roughly corresponds to one full event duration. Extending the

decomposition to level 5 will include refractory and rebound effects of neurons typically observed in
the cerebral cortex [40]. Therefore, temporal characteristics of the firing rate will be best
characterized starting at level 6 and beyond where the basis support becomes long enough to include
two or more consecutive spike events.

3.3.

Computational Complexity of Spike Sorting in the Sparse Domain

Herein, we compare the cost of estimating the firing rate through the standard time domain
spike sorting/kernel smoothing approach and the compressed sensing approach. Both involve
calculating the computational cost in two different modes of operation, the “training” mode and the
“runtime” mode. In the training mode, features are extracted and the population size is estimated
using cluster cutting in the feature space. This should ideally correspond to the number of distinct
spike templates in the data. In the runtime mode, the observed waveforms are assigned to any of the
existing classes, typically using a Bayesian classifier with equal priors

24

݊ = max ܲሺ݃|‫ܥ‬௡ ሻ
௡

(3-6)

ܲሺ݃|‫ܥ‬௡ ሻ = ࣨሺ݃ − ߤ௡ , Σ௡ ሻ

where ߤ௡ and Σ௡ are the ܰ௦ × 1 mean vector and ܰ௦ × ܰ௦ temporal covariance matrix for each

neuron class ‫ܥ‬௡ , for ݊ = 1, ⋯ , ܰ. The overall computations for the Bayesian classifier are in the

order of ܱ൫ܰ × ܰ௦ ଶ ൯.

A standard PCA-based spike sorting followed by a Gaussian Kernel rate estimator was used as
the benchmark for evaluating the computational cost of the traditional path that appears in the top
of Figure 3.1. First, spikes are aligned by searching for a local extreme followed by cropping the
waveform symmetrically around that location, which requires computations in the order of

ܱ൫2ܰ௦ × ܰ௣ ൯. Finding the eigenvalues and eigenvectors, for example, using a cyclic Jacobi method
[47], requires ܱ൫ܰ௦ ଷ + ܰ௦ ଶ × ܰ௣ ൯ computations. For projection, ܱ൫2ܰ௦ × ܰ௣ ൯ operations are
performed to reduce the dimensionality of spike waveforms to a 2-dimensional feature space.

A cluster-cutting algorithm, such as expectation-maximization (EM), is performed on the

obtained 2-D feature space. Optimizing EM clustering requires ܱ൫ܰ݀ ଶ × ܰ௣ ଶ ൯ computations,
where ܰ here indicates the number of Gaussian models and ݀ is the dimension of data (here ݀ =

2). To detect various spike prototypes, the EM clustering is implemented for different ܰ’s, and the

best fit is selected. The overall computations required for EM clustering for a maximum number of

ଶ
ܰ units is in the order of ∑ே ܱ൫4݇ܰ௣ ଶ ൯ = ܱ൫2ܰሺܰ + 1ሻܰ௣ ൯. Consequently, the overall
௞ୀଵ

computations required for training the PCA-based spike sorting is ܱ൫4ܰ௦ × ܰ௣ + ܰ௦ ଷ + ܰ௦ ଶ ×

ܰ௣ + 2ܰሺܰ + 1ሻܰ௣ ൯. In the runtime mode, detected spikes are aligned and projected, and then
ଶ

25

classified to one of the predefined units using the Bayesian classifier, requiring computations in the
order of ܱሺ4ܰ௦ + 4ܰሻ.

In contrast, a five-level wavelet decomposition requires operations in the order of ܱሺ23ܰ௦ ሻ if

classical convolution is used. However, this number can be significantly reduced by using the
approach we reported in [48]. Local averaging, typically used to remedy the shift variance property
of the DWT, with a node-dependent filter requires computations in the order of ܱሺ8ܰ௦ ሻ, since this

filter is only applied to nodes 4, 6, 8, 9, and 10 in which spike features are mostly captured. At each
node, one unit is discriminated at a time using a 2-class cluster cutting (binary classification). The
required computations for this are in the order of ܱ൫10ܰ௣ ଶ ൯. Consequently, the overall

computations required for the training mode is in the order of ܱ൫31ܰ௦ ܰ௣ + 10ܰ௣ ଶ ൯. In the

runtime mode, every detected event is decomposed, filtered, and classified using a 1-D Bayesian
classifier with computations in the order of ܱሺ31ܰ௦ + ܰሻ. For rate estimation, three methods were

considered: the rectangular kernel (rate histogram), the Gaussian kernel and the extended DWT
(EDWT) we propose. In EDWT, the firing rate is directly obtained by normalizing the thresholded
vectors and extending the decomposition to lower levels (higher frequency resolution). This requires

ܱሺ45ܰ௦ ݊ఠ ∑ஶ 2ି௟ ሻ = ܱሺ22.5ܰ௦ ሻ. In the kernel based methods, a kernel function is convolved
௟ୀହ
with the spike train and the rate is estimated by sampling the result. Assuming 45 msec bin width,

and 2 msec refractory period, the number of computation required is in the order of ܱሺ22.5݊ௐ ሻ. A
Gaussian kernel width of ݊ௐ = 100 is typically used to limit the amount of computations.

3.4.

Spike Class Generation and Separability

Spike waveforms were detected and extracted from spontaneous activity recorded in the primary
motor cortex of an anesthetized rat using a 16-channel microelectrode array. All procedures were
26

approved by the Institutional Animal Care and Use Committee at Michigan State University
following NIH guidelines. Details of the experimental procedures to obtain these recordings are
described elsewhere [21]. These spikes were manually aligned and sorted using a custom spike
sorting algorithm [32]. Out of 24 units recorded, the actual action potential waveforms are shown in
Figure 3.3a and 3.3b for five representative units recorded on one electrode. Figure 3.3c shows a
scatter plot of the first two principal components of the five representative spike classes in Figure
3.3a. Consider for example unit 4 that appears quite well isolated in the time domain feature space, It
is clear that the other classes are poorly isolated. Results of manual, extensive, offline sorting using
hierarchical clustering of all the features in the data are displayed in Figure 3.3d. In Figure 3.3e, the
clustering result using automated, online PCA/EM cluster-cutting with two principal features is
illustrated. Examination of these figures reveals that the lack of separability in the feature space,
particularly for units 1, 2, 3 and 5, results in significant differences between the manual, extensive,
offline sorting result and the automated online PCA/EM result.
The separability of spike classes in feature space was calculated to determine the sensing

thresholds for each neuron at any given time scale ݆. Specifically, we used the measure, Γሺ࡯ሻ =

ܵ஻ ⁄ܵௐ , where ܵ஻ is the between cluster separability and ܵௐ is the within cluster separability and

࡯ = ሼ‫ܥ‬௡ |݊ = 1, ⋯ , ܰሽ is the set of clusters. The between-cluster separability is defined as [49]
ே

1
ܵ஻ = ෍
෍ ෍ ‖‫‖ݕ − ݔ‬
|‫ܥ‬௡ | ∑௠ஷ௡ ‫ܥ‬௠
௡ୀଵ

(3-7)

௫∈஼೙ ௬∈஼೘
௠ஷ௡

where |‫ܥ‬௡ | equals the number of spikes belonging to cluster ‫ܥ‬௡ , ‫ ݔ‬and ‫ ݕ‬are elements from the set

of all spike waveforms and ‖	. ‖ represents the Euclidean distance or L2-norm between two

27

elements. ܵ஻ provides a factor proportional to the overall separation between clusters. For improved

separability, a large ܵ஻ is desired. On the other hand, the within-cluster separability is defined as
ே

1
ܵௐ = ෍
෍ ෍ ‖‫‖ݕ − ݔ‬
|‫ܥ‬௡ |ሺ|‫ܥ‬௡ | − 1ሻ
௡ୀଵ

(3-8)

௫∈஼೙ ௬∈஼೙

and is proportional to the overall spread within each of the individual clusters. For improved
separability, a small ܵௐ is desired. Therefore, a large Γሺ࡯ሻ indicates a greater overall separability.

We computed a separability ratio (SR) as the ratio between Γሺ࡯ሻ in every node to that in the

time domain. Therefore, an SR ratio of 1 indicates equal degree of separability in both domains,
while ratios larger than 1 indicate superior separability in the sparse representation domain. This
later case implies that at least one unit can be separated in that node’s feature space better than the
time domain’s feature space. This detected unit is subsequently removed from the data and the
decomposition process continues until all possible units are detected, or all nodes have been
examined on any given electrode. On the other hand, if the same unit can be discriminated in more
than one node, the “best node” for discrimination of this unit is the node that provides the largest
SR.
Figure 3.4 illustrates that each spike class of Figure 3.3 is separable in at least one node of the
sparse representation, in which a two-class situation is considered and one single cluster is isolated in
a given node where all other spike classes are lumped together. The different degrees of separability
across nodes permit isolating one class at a time, owing to the compactness property of the
transform in nodes that are best representative of each class. For example, class 1 appears poorly
isolated from class 5 in the time-domain feature space, yet it is well separated from all the other
classes in node 6. Since the sensing threshold is chosen to discriminate between spike events and not
to minimize the MSE of the reconstructed spike, this selection rule results in thresholds that are
28

typically higher than those obtained from the universal thresholding rule for denoising and nearoptimal signal reconstruction [50].
It can be seen from Figure 3.4a that in most nodes, the SR ratio is larger than 1 (except for
nodes 2 and 10). For the 24 units recorded in this data set, the performance of the compressed
sensing strategy was 92.88+6.33% compared to 93.49+6.36 % for the PCA-EM. Performance of the
sensing threshold selection process was quantified as a function of the number of coefficients
retained in Figure 3.4b. As we increase the sensing threshold, the number of retained coefficients
logically decreases thereby improving compression. However, the most interesting result is the
improved separability by more than 70% compared to time domain separability at roughly 97%
compression. This implies that discarding some of the coefficients that may be needed for optimal
spike reconstruction and sorting in the time domain in a classical sense does improve the ability to
discriminate between spike classes based on their magnitude only. Maximum separability is reached
when we retain a few coefficients per event, after which some classes are entirely lost and the
performance deteriorates.

29

(a)

(b)

(c)

(d)

(e)

Figure 3.3 Five units obtained from recordings in an anesthetized preparation. (a) Events from each
recorded unit, aligned and superimposed on top of each other for comparison. (b) Corresponding
spike templates obtained by averaging all events from each unit on the left panel. (c) PCA 2D
feature space. Dimensions represent the projection of spike events onto the two largest principal
components. (d) Clustering result of manual, extensive, offline sorting using hierarchical clustering
30

using all features in the data. (e) Clustering result using the two largest principal components and
EM cluster-cutting based on Gaussian Mixture Models (GMM). This is an example of a suboptimal
online sorting method with relatively unlimited computational power.

(a)

(b)
Figure 3.4 (a) Unit isolation quality of the data in Fig.3. Each cell in the left side shows the
separation (displayed as a 2-D feature space for illustration only) obtained using the compressed

31

sensing method with the binary hypothesis test in each node. The highest magnitude coefficients
that survive the sensing threshold in a given node are considered irregular samples of the underlying
unit’s firing rate and are marked with the “Gold” symbols in the left panel. The feature space of the
sorted spikes using the manual, extensive, offline spike sorting is re-displayed in the right side
(illustrated with the same color code as Figure 3) for comparison. If a gold cluster from the left
panel matches a single colored cluster from the right panel in any given row, this implies that the
corresponding unit is well isolated in this node using the single coefficient/event magnitude alone.
The unit is then removed from the coefficient data before subsequent DWT calculation is
performed in the next time scale. Using this approach, three out of five units (pink, red, and green)
in the original data were isolated during the first iteration in nodes 4, 6, and 9, respectively, leaving
out two units to be isolated with one additional iteration on node 9’s remaining coefficients. In the
first iteration, node 2 shows weak separation (SR = 0.45) between units. Unit class 4 has larger
separability in node 4 (SR=1.07). Units 1 and 2 are separated in nodes 6 and 9 (SR=1.15 and 1.51,
respectively). Units 3 and 5 are separated in node 9 afterwards (SR = 1.14). (b) Quantitative analysis
of spike class separability vs. compression rate (i.e. thresholding) for 24 units recorded in the
primary motor cortex of anesthetized rat. A ~70% improvement can be observed at around 97%
compression ratio compared to time domain separability (Raw data).

3.5.

Decoding Performance

Because our purpose was to demonstrate the ability to decode movement trajectory directly
from neural data using the compressed signal representation, and given that the nature of cortical
encoding of movement remains a subject of current debate in the neuroscience community [18, 28,
29, 51], investigation of the methods developed in this paper required generation of neural data with
known spike train encoding properties. This section describes in details the methods we used to

32

model and analyze the data to demonstrate the validity of the approach. Since instantaneous
decoding of spike trains is the ultimate goal in this application, we used the decoding performance as
a measure of success of this method. To simulate spike trains from motor cortex neurons during
movement planning and execution, we used a probabilistic population encoding model of a natural,
non-goal directed, 2D arm movement trajectory. The arm movement data were experimentally
collected to ensure realistic kinematics. The discrete time representation of the conditional intensity
governing each neuron firing rate was modeled as a variant of the cosine tuning model of the
neuron’s preferred direction ߠ௡ (ranging from 0 to 2π) [18]

ߠሺ‫ݐ‬௞ ሻ − ߠ௡
ߣ௡ ሺ‫ݐ‬௞ |‫ݔ‬௡ ሻ = ݁‫ ݌ݔ‬൭ߚ௡ + ߜ௡ ܿ‫ ݏ݋‬ቆ
ቇ൱
߱௡

(3-9)

where ߚ௡ denotes the background firing rate, ߠሺ‫ݐ‬௞ ሻ denotes the actual movement direction,

‫ݔ‬௡ = ሾߠ௡ , ߜ௡ , ߱௡ ሿ is a parameter vector governing the tuning characteristics of neuron ݊, where it
was assumed that the tuning depth ߜ௡ was constant (ߜ௡ and ߚ௡ where fixed for all neurons and

equal to 1 and log(5), respectively), the preferred direction ߠ௡ was uniformly distributed, while the

tuning width ߱௡ was varied across experiments. Using this model, event times were obtained using

an inhomogeneous Poisson process with 2 ms refractory period.

The tuning term in (3-9) incorporates a neuron-dependent tuning width ߱௡ , an important

parameter that affects the bin width choice for rate estimation prior to decoding. Variability in this
term (߱௡ ranged from 0.25 to 4 in each experiment) resulted in firing rates that are more stochastic
in nature and served to closely approximate the characteristics of cortical neurons’ firing patterns
[40]. We used the mean squared error between the rate functions obtained from the simulated
trajectory data and the estimated rates as

33

ே

1
ଶ
መ
‫ = ܧܵܯ‬෍൫ߣሾ݊ሿ − ߣ௝ ሾ݊ሿ൯
ܰ

(3-10)

௡ୀଵ

While equation (3-10) provides a simple and obvious measure of performance, it should be
noted that in practice the true rate function is unknown. Information theoretic measures are useful
in such cases since they assess higher order statistical correlation between the estimators and
measurable quantities such as the observed movement and can be useful to determine the time scale
that best characterize the information in the instantaneous firing rate. We used a node-dependent
mutual information metric between the encoded movement parameter and the rate estimator [52]
defined as
መ
‫ܫ‬௝ = ෍ ܲ൫ߠ, ߣ௝ ൯݈‫݃݋‬
෡
ఏ,ఒೕ

መ
ܲ൫ߠ, ߣ௝ ൯
መ
ܲሺߠሻܲ൫ߣ௝ ൯

(3-11)

This metric is particularly useful when the instantaneous rate function is not Gaussian
distributed.
A sample trajectory, rate functions from neurons with distinct tuning characteristics and their
spike train realizations are shown in Figure 3.5. It can be clearly seen that the tuning width has a
direct influence on the spike train statistics, particularly the ISI. A broadly tuned neuron exhibits
more regular ISI distribution, while a sharply tuned neuron exhibits a more irregular pattern of ISI.
Figure 3.5b illustrates the tuning characteristics of a subpopulation of the entire population over a
limited range (for clarity) to demonstrate the heterogeneous characteristics of the model we
employed. A 9-second raster plot in Figure 3.5c illustrates the stochastic patterns obtained for the
trajectory illustrated later.
In Figure 3.6a, a 3 second segment of the movement’s angular direction over time is illustrated
superimposed on the neuronal tuning range of three representative units with distinct tuning widths.

34

The resulting firing rates and their estimators using the rate histogram, Gaussian kernel, and
extended DWT methods are illustrated for the three units, showing various degrees of estimation
quality. As expected, the rate histogram estimate is noisy, while the Gaussian and EDWT methods
perform better. In Figure 3.6b, the relation between the wavelet kernel size and the MSE is
quantified. As expected, decomposition levels with shorter kernel width (i.e., fine time scales) tend
to provide the lowest MSE for neurons that are sharply tuned. In contrast, a global minimum in the
MSE is observed for broadly tuned neurons at coarser time scales, suggesting that these
decomposition levels are better suited for capturing the time varying-characteristics of the firing
rates. Interestingly, the MSE for the EDWT method attains a lower level than both the rectangular
and Gaussian kernel methods at the optimal time scale, clearly demonstrating the superiority of our
approach. The relation between the tuning width and the kernel size for the entire population is
illustrated in Figure 3.6c. As the tuning broadens, larger kernel sizes (i.e. deeper decomposition
levels) are required to attain a minimum MSE and thus better performance.
The mutual information between the actual movement trajectory and the rate estimators are
shown in Figure 3.7. There is a steady increase in the mutual information versus kernel support until
a maximum is reached at the optimal decomposition level that agrees with the minimum MSE
performance. This maximum coincides with a rate estimator spectral bandwidth matching that of
the underlying movement parameter. Rate estimators beyond the optimal time scale do not carry any
additional information about the movement trajectory.
A sample trajectory and the decoded trajectory are shown in Figure 3.8 for four different cases:
First, when no spike sorting is required. This is the ideal case in which every electrode records
exactly the activity of one unit, but is hard to encounter in practice. Second, when two or more units
are recorded on a single electrode but no spike sorting is performed prior to rate estimation. Third,
when spike sorting is performed for the latter case using the PCA/EM/Gaussian kernel algorithm.
35

And fourth, when combined spike sorting and rate estimation are performed using the compressed
sensing method. We used a linear filter for decoding in all cases [53]. It is clear that the our method
has a decoding error variance that is comparable to the PCA/EM/Gaussian kernel algorithm,
suggesting that the performance is as good as, if not superior, to the standard method.

Broad Tuning

Sharp Tuning

θ

(a)
Figure 3.5 (a) Schematic of encoding 2D, non goal-directed arm movement: the sample network of
neurons is randomly connected with positive (excitatory), and negative (inhibitory) connections.
Right panel demonstrates a symbolic movement trajectory to indicate the movement parameter
encoded in the neural population model. Sample firing rates and corresponding spike trains are
shown to illustrate the distinct firing patterns that would be obtained with broad and sharp tuning
characteristics. (b) Sample tuning characteristics (over a partial range) of a subset of the 50 neurons
modeled with randomly chosen directions and widths. (c) Sample 9-sec raster plot of spike trains
obtained from the population model.

36

Figure 3.5 (continued)

(b)

(c)
An important aspect to validate and confirm the superiority of our approach is to compare the
computational complexity of the standard PCA/EM/Gaussian kernel rate estimator to the
compressed sensing method for different event lengths (ܰ௦ ) and different number of events (ܰ௣ )
per neuron. The results illustrated in Figure 3.9 show that our method requires significantly less
computations for training. This is mainly attributed to the complexity in computing the eigenvectors
of the spike data every time a new unit is recorded. In contrast, wavelets are universal approximators
to a wide variety of transient signals and therefore do not need to be updated with the occurrence of
events from new units. In the runtime mode, the computational cost for our method becomes
higher when the number of samples/event exceeds 128 samples. At a nominal sampling rate of 40
kHz (lower rates are typically used), this corresponds to a 3.2 msec interval, which is much larger
than the typical action potential duration (estimated to be between 1.2-1.5 msec).

37

(a)
4

Sharp EDWT
Broad EDWT
Sharp Gaussian
Sharp Rectangular
Broad Gaussian
Broad Rectangular

3.5
3

MSE

2.5
2
1.5
1
0.5
0
2

4

6

8

10

12

14

Decomposition Level
(b)
Figure 3.6 (a) Top-left: 400 msec segment of angular direction from a movement trajectory
superimposed on tuning “bands” of five representative units. Top right, middle, and bottom panels:
Firing rates obtained from the point process model for five units and their extended DWT (EDWT),
38

Gaussian, and rectangular kernel estimators. As expected, the rectangular kernel estimator is the
noisiest, while the Gaussian and EDWT estimators are closest to the true rates. (b) Mean square
error between the actual (solid black line) and the estimated firing rate for each neuron with the
three methods. Each pair of dotted and dashed lines is the MSE for rectangular and Gaussian kernel
methods, respectively, for the five units in (a). These remain flat as they do not depend on the DWT
kernel window length. For the sharply tuned neurons, on average, ten levels of decomposition result
in a minimum MSE that is lower than the MSE for rectangular and Gaussian kernel methods. For
broadly tuned neurons, 12 levels of decomposition result in optimal performance. (c) Tuning width
versus optimal kernel size. As the tuning broadens, larger kernel windows (i.e. coarser time scales)
are needed to obtain optimal rate estimators.
Figure 3.6 (continued)
300

Kernel Support (msec)

250
200
150
100
50
0

9

20

46
74
Tuning Width [wp] (degrees)

(c)

39

115

Broad EDWT
Broad Gaussian
Broad Rectangular
Sharp EDWT
Sharp Gaussian
Sharp Rectangular

Mutual Information

0.6

0.4

0.2

0

2

4

6

8

10

12

14

Decomposition Level
Figure 3.7 Average mutual information (in bits) between movement direction, θ , and rate estimators
averaged across the two subgroups of neurons in the entire population as a function of
decomposition level (i.e. kernel size). Solid lines indicate the performance of the EDWT method
(dark for the broad tuning group and gray for the sharp tuning group). The two dashed lines
represent the Gaussian kernel method (broad tuning and sharp tuning groups), while the two dotted
lines represent the rectangular kernel method in a similar way. As expected, sharply tuned neurons
require smaller kernel size to estimate their firing rates. Overall, the EDWT method achieves higher
mutual information than either the fixed width Gaussian or rectangular kernels for broadly tuned
neurons, while slightly less for sharply tuned neurons owing to the relatively more limited response
time these neurons have.

40

Figure 3.8 Decoding performance of a sample 2D movement trajectory. The black line is the average
over 20 trials, while the gray shade around the trajectory represents the estimate variance. Top left:
one unit is observed on any given electrode (i.e., neural yield = 1) and therefore no spike sorting is
required. The variance observed is due to the network interaction. Top right: every electrode records
two units on average (neural yield = 2) and no spike sorting is performed. Bottom left:
PCA/EM/Gaussian kernel spike sorting and rate estimation is implemented. Bottom right:
Compressed sensing decoding.

41

(a)

(b)
Figure 3.9 Computational complexity of PCA/EM/Gaussian kernel and the compressed sensing
method. (a) Computations per event vs. number of events and number of samples per event in the
42

training mode. (b) Computations per event vs. number of samples per event and kernel size in the
runtime mode. At a sampling rate of 40 KHz and ~1.2-1.5 ms event duration (48-60 samples), the
compressed sensing method requires less computations than the PCA/EM/Gaussian kernel
method. The number of units is assumed fixed in the training mode for both methods (P=50).

3.6.

Discussion

In this chapter, we have proposed a new approach to directly estimate a critical neuronal
response property – the instantaneous firing rate – from a compressed representation of the
recorded neural data. The approach has three major benefits: First, the near-optimal denoising and
compression allows to efficiently transmit the activity of large populations of neurons while
simultaneously maintain features of their individual spike waveforms necessary for spike sorting, if
desired. Second, firing rates are estimated across a multitude of timescales, an essential feature to
cope with the heterogeneous tuning characteristics of motor cortex neurons. These characteristics
are important to consider in long term experiments where plasticity in the ensemble interaction is
likely to affect the optimal time scale for rate estimation. Third, as our extensive body of prior work
has demonstrated [48, 54], the algorithm can be efficiently implemented in low-power, small size
electronics to enable direct decoding of the neural signals to take place without the need for massive
computing power. Taken together, these are highly desirable features for real-time adaptive decoding
in BMI applications.
We have used a particular model for encoding the 2D hand trajectory for demonstration
purposes only. It should be noted, however, that the method is completely independent of that
model. What is important to consider is the fact that the sparse representation preserves all the
information that needs to be extracted from the recorded neural data to permit faithful decoding to

43

take place downstream. This includes the features of the spike waveforms as well as the temporal
characteristics of the underlying rate functions.
In the tests performed here we have used the same wavelet basis - the symmlet4 - for both spike
sorting and rate estimation. This basis was previously demonstrated to be near-optimal for
denoising, compression, and hardware implementation. However, the possibility exists to use this
basis in the first few levels, and then extend the decomposition from that point on using a different
basis that may better represent other features present in the rate functions that were not best
approximated by the symmlet4. For example, the “bumps” in the sparse rate estimates in Fig.6 are
not as symmetrical in shape as those in the original rate, or those in the Gaussian estimator. For this
particular example a more symmetric basis may be better suited.
Estimation of the rate using a fixed bin width may be adequate for certain applications that
utilize firing rates as the sole information for decoding cortical responses during instructed
behavioral tasks such as goal-directed arm reach tasks [27-29, 55]. These operate over a limited range
of behavioral time scales. However, natural motor behavior is characterized by more heterogeneous
temporal characteristics that reflect highly-nonstationary sensory feedback mechanisms from the
surrounding cortical areas. The firing rates of motor neurons during naturalistic movements are
highly stochastic and require a statistically-driven technique that can adapt to the expected variability
[40, 56]. This is particularly important given the significant degrees of synchrony typically observed
between cortical neurons during movement preparation [57], and also observed during expected and
unexpected transitions between behavioral goal representations [58].
While it has been argued that precise spike timing does not carry information about motor
encoding [59], one must note that most of the BMI demonstrations to date were carried out in
highly-trained subjects performing highly stereotypical, goal-directed behavioral tasks. Very few

44

studies, if any, have been carried out to characterize naturally occurring movements in naïve
subjects. Thus, the potential still exists for new studies that may demonstrate the utility of both
neuronal response properties, namely precise spike timing and firing rate, in decoding cortical
activity. For that, the sparse representation is able to simultaneously extract these two important
elements that are widely believed to be the core of the neural code [60]. Therefore, our approach is
the first to offer the solution for extracting both properties within a single computational platform
in future generations of BMI systems.
We note that for a fully implantable interface to the cortex to be clinically viable, spike detection,
sorting, and instantaneous rate estimation need to be implemented within miniaturized electronics
that dissipate very low power in the surrounding brain tissue. More recently, it has been shown that
tethering the device to the subject’s skull to maintain a wired connection to the implant significantly
increases brain tissue adverse reaction, which is believed to negatively affect implant longevity [61].
Therefore, the interface needs to feature wireless telemetry to minimize any potential risk of
infection and discomfort to the patient and to elongate the implant’s lifespan. We believe that
eliminating any of the steps from the signal processing path while preserving the critical information
in the neural data will significantly reduce the computational overhead to permit small size, low
power electronics to be deployed and accelerate the translation of this promising technology to
clinical use.

45

4

Sorting and Tracking Neural Activity via Simple
Thresholding

There has been extensive research on discriminating between spikes from multiple single units in
a recorded ensemble [2]. Even when clearly isolated units remain stable over long periods of time,
sorting their spike waveforms remains a challenge for a number of reasons. First, action potentials
from single cells – while highly stationary when recorded intracellularly – are highly nonstationary
when recorded extracellularly, requiring extensive user judgment during the discrimination process.
Second, spike waveforms could be substantially similar even though they may have been generated
by different cells. Third, the presence of contaminating noise often makes clustering spike waveform
features a daunting task; the degree of overlap between putative clusters increases as a function of
noise variance. Fourth, the role of individual neurons’ with clearly isolated spike waveforms in
encoding a behavioral correlate on one day could substantially change on the next day – a hallmark
of brain plasticity [7]. It is therefore imperative to take these factors as well as many others into
consideration when devising new methods for discriminating spikes from multiple single units in a
recorded neural ensemble.
This chapter introduces a novel approach to discriminate multiple single unit activity in a given
neural ensemble recording. The main contribution over the previous chapter [6] is that the approach
46

is rather simple, yet very efficient compared to classical and state of the art techniques. The
approach shares a similar “learning phase” as other techniques in which decision boundaries
between clusters of putative single units are first determined. One substantial difference in this phase
is that instead of identifying these boundaries using multidimensional search algorithms such as
Expectation Maximization (EM) clustering [8], the decision boundaries here are formed by
combining simple linear boundaries from a set of binary classifiers. These linear decision boundaries
are identified by a set of optimal thresholds determined using a simulated annealing search. The testing
phase, in turn, results in maintaining the most discriminative features of a given set of spikes and
therefore is robust to long-term variability in waveform shapes. In addition, its low computational
cost makes it suitable for implementation on ultra-low power miniaturized hardware modules that
can be fully implanted in the brain of awake behaving subjects [9].
This chapter is organized as follows. First, we will formulate spike sorting algorithms as a sparse
optimization problem and describe the theory behind the simple thresholding. Then, the details of
the methods used to collect the neural data and generate surrogate data with known spike waveform
characteristics are explained. Finally, we demonstrated the performance of the method for variable
signal-to-noise ratios and sampling rates.

4.1.

Spike Sorting as a Sparse Optimization Problem

Let’s assume that the extracellular neural voltage trace, ‫ݒ‬ሺ‫ݐ‬ሻ, recorded at the tip of an electrode

can be expressed as a linear superposition of spike waveforms generated by ܰ spiking neurons as
ே

௄೙

‫ݒ‬ሺ‫ݐ‬ሻ = ෍ ෍ ܹ ቀ‫− ݐ‬
௡
௡ୀଵ ௞ୀଵ

ሺ௞ሻ
߬௡ ቁ

47

+ ߟሺ‫ݐ‬ሻ

(4-1)

where ܹ ሺ‫ݐ‬ሻ is a deterministic signal indicating the spike waveform of neuron ݊, ‫ܭ‬௡ is the number
௡

of spikes fired by this neuron within an observation interval of length ܶ, ߬௡

ሺ௞ሻ

are the exact spike

times, and ߟሺ‫ݐ‬ሻ represents all the noise elements, including the background neural noise and the

instrumentation noise caused by thermal and electrical effects of the measurement system. Given the

prior distribution ܲఆ ሺܹ ሻ and the conditional distribution ܲሺ‫ݒ‬ሺ‫ݐ‬ሻ|ܹ ሻ, Bayes’ theory can be used
௡
௡
to compute the maximum a posteriori (MAP) estimate ܲሺ‫ݒ‬ሺ‫ݐ‬ሻ|ܹ ሻܲఆ ሺܹ ሻ	for which the optimal
௡
௡

class is the one that maximizes the posterior density [10]

݊∗ = argmax ܲሺ‫ݒ‬ሺ‫ݐ‬ሻ|ܹ ሻܲఆ ሺܹ ሻ
௡
௡
௡

(4-2)

where ߗ is the set of all spike waveforms, ሼܹ ሺ‫ݐ‬ሻሽே ∈ 	ߗ.
௡
௡ୀଵ

Since the spike waveforms are typically not known, spike sorting involves segmenting the
interval around the events surpassing a detection threshold (usually set as three times the standard
deviation of the noise), and projecting these segments into a lower-dimensional space using principal

components analysis to obtain ߗ [8]. This set is subsequently clustered into ܰ independent clusters

using algorithms such as ݇-means or EM clustering, and the conditional and prior distributions are

calculated. The MAP classifier in (4-2) assigns an event to class ݊ if it maximizes the MAP estimate

for that class. In conventional approaches, it is therefore imperative to build the priors and

likelihood, ܲఆ ሺܹ ሻ and ܲሺ‫ݒ‬ሺ‫ݐ‬ሻ|ܹ ሻ,	during the training phase. In spike sorting, however, the
௡
௡
average spike waveforms, ܹ ሺ‫ݐ‬ሻ, and the number of neurons, ܰ, are unknown.
௡

For sparse signals, the problem of identifying the signal sources in ‫ݒ‬ሺ‫ݐ‬ሻ can be formulated as an

optimization problem where the Euclidean norm between ‫ݒ‬ሺ‫ݐ‬ሻ and the observation model in (4-1)
is minimized below the noise variance, ߝ = ‖ߟሺ‫ݐ‬ሻ‖ଶ, where ‖	. ‖ଶ is the Euclidean norm [11],
48

min

ሬԦ
ே,ௐ೙ ሺ௧ሻ,ఛ೙

ே

෍ ‫ܭ‬௡

௡ୀଵ

ே

௄೙

‫ .ݐ .ݏ‬ቯ‫ݒ‬ሺ‫ݐ‬ሻ − ෍ ෍ ܹ ቀ‫߬ − ݐ‬௡ ቁቯ ≤ ߝ
௡
௡ୀଵ ௞ୀଵ

ሺ௞ሻ

(4-3)

ଶ

Because there are a number of unknowns in (4-3), solving this optimization problem requires

dissociating the spike times, ߬௡ , from the spike waveforms, ܹ ሺ‫ݐ‬ሻ. The first step to achieve this is to
Ԧ
௡
sample ‫ݒ‬ሺ‫ݐ‬ሻ within ܶ with ‫ ۀ∆⁄ܶڿ = ܯ‬bins, where ∆ is the bin width (in seconds), to obtain the

discrete observations, ܸሺ‫ݐ‬ሻ, as

ே

ெ

௠
ܸሺ‫ݐ‬ሻ = ෍ ෍ ߙ௡ ܹ ሺ‫∆݉ − ݐ‬ሻ + ߟሺ‫ݐ‬ሻ
௡

(4-4)

௡ୀଵ ௠ୀଵ

௠
where ߙ௡ = 1 indicates that the presence of a spike by neuron ݊ within the range ሺ݉ − 1ሻ∆< ‫≤ ݐ‬

௠
݉∆, and ߙ௡ = 0 indicates otherwise. By replacing (4-4) into (4-3), the optimization problem can be

expressed as

min

ሬሬԦ
ே,ௐ೙ ሺ௧ሻ,ఈ೙

Ԧᇱ Ԧ
෍ ߙ௡ ߙ௡
௡

‫.ݐ .ݏ‬

ே

ெ

௠
ะܸሺ‫ݐ‬ሻ − ෍ ෍ ߙ௡ ܹ ሺ‫∆݉ − ݐ‬ሻะ ≤ ߝ
௡
௡ୀଵ ௠ୀଵ

(4-5)

ଶ

௠
where the vector ߙ௡ = ሾߙ௡ ሿெ represents the binary spike train of neuron ݊, and ‫ܣ‬ᇱ is transpose
Ԧ
௠ୀଵ

of ‫ .ܣ‬The observation model in (4-4) can be expressed as a linear combination of shifted spike

waveforms, ܸሺ‫ݐ‬ሻ = ∑ே ߙ௡ ॾ௡ ሺ‫ݐ‬ሻ, where the matrix ॾ௡ ሺ‫ݐ‬ሻ ∈ ܴ ௠×ெ represents ݉ shifted copies
௡ୀଵ Ԧ

of ܹ ሺ‫ݐ‬ሻ, zero-padded up to a length of ‫ ,ܯ‬as
௡

ܹ ሺ‫ݐ‬ሻ	0	0	0	 ⋯ 0
௡
0	0	ܹ ሺ‫ݐ‬ሻ	0		 ⋯ 0൪
௡
ॾ௡ ሺ‫ݐ‬ሻ = ൦
⋱
0	0	0	0	 ⋯	ܹ ሺ‫ݐ‬ሻ ௠×ெ
௡
49

(4-6)

Concatenating

all

the

ॾ௡ ሺ‫ݐ‬ሻ

matrices

for

ܰ

different

neurons

gives

us

ॾሺ‫ݐ‬ሻ = ሾॾଵ ሺ‫ݐ‬ሻ| ⋯ |ॾே ሺ‫ݐ‬ሻሿ. The observation model can then be represented as ܸሺ‫ݐ‬ሻ = ߙ
Ԧॾሺ‫ݐ‬ሻ,

where ߙ = ሾߙଵ | ⋯ |ߙே ሿ. Considering the fact that spiking activities have a sparse nature, we can
Ԧ
Ԧ
Ԧ

expect ߙ to also demonstrate sparsity through a small number of large coefficients distributed
Ԧ

among a large number of small coefficients. Thus, the minimization of ߙ௡ ߙ௡ in (4-5) can be
Ԧᇱ Ԧ
replaced with the following sparse optimization problem as

min‖ߙ ௅బ
Ԧ‖
ሬሬԦ
ఈ

‫ܸ‖ .ݐ .ݏ‬ሺ‫ݐ‬ሻ − ߙ
Ԧॾሺ‫ݐ‬ሻ‖ଶ ≤ ߝ

(4-7)

where the operator ‖	. ‖௅బ counts the number of nonzero elements in the vector. The problem in (4-

7) can be viewed as a classical rate-distortion problem, in which the objective is to reduce the rate expressed here in terms of the number of non-zero elements in ߙ - while maintaining the distortion
Ԧ

- expressed in terms of the Euclidean norm - below a predetermined level [12]. To practically solve

(4-7), we replace the ‫ܮ‬଴ -operator with a thresholding operator. Determining the optimal thresholds
is achieved next.

4.1.1.

Optimal Thresholding

A threshold, ߛ௡ , is applied to the coefficient vector ߙ௡ to provide a sparse coefficient vector, ‫ݔ‬௡ ,
Ԧ
Ԧ

defined as

௠
‫ݔ‬௡ :		‫ݔ‬௡ = ቄ
Ԧ

௠
ߙ௡
0

௠
	|ߙ௡ | ≥ ߛ௡
											݂‫ܰ , ⋯ ,1 = ݊		݀݊ܽ	ܯ , ⋯ ,1 = ݉	ݎ݋‬
‫ݐ݋‬ℎ݁‫݁ݏ݅ݓݎ‬

(4-8)

The degree of sparsity in ‫ݔ‬௡ is determined by ߛ௡ such that the sparsity increases with increasing ߛ௡ .
Ԧ

This means that the minimization of ‖ߙ ௅బ can be approximately replaced by the maximization of
Ԧ‖

the product of ߛ௡ ’s, for ݊ = 1, ⋯ , ܰ. Thus, the objective function in (4-7) can be replaced by a
function that maximizes the product of neuron-specific thresholds for the ܰ neurons, as
50

max ෑ
ಿ

ሬԦ∈ℝ
ఊ

ே

௡ୀଵ

ߛ௡

‫ܸ‖ .ݐ .ݏ‬ሺ‫ݐ‬ሻ − ‫ݔ‬
Ԧॾሺ‫ݐ‬ሻ‖ଶ ≤ ߝ

(4-9)

where ߛ = ሾߛ௡ ሿே is the vector comprising these optimal thresholds. This optimization function is
Ԧ
௡ୀଵ

non-convex due to the nonlinear thresholding operator in (4-8). Thus, it cannot be solved using

regular convex optimization techniques. Instead, iterative methods that perform a generic search
such as simulated annealing can be used for finding the optimal thresholds in (4-9). Simulated
annealing provides a near-optimal solution without the computational complexity of exhaustive
search methods [13]. The unknown parameters in (4-9) are the thresholds, ߛ and the dictionary
Ԧ,

ॾሺ‫ݐ‬ሻ. We will reduce the number of these unknown parameters to facilitate the iterative learning by
approximating ॾሺ‫ݐ‬ሻ with a predetermined reference dictionary, ࣱሺ‫ݐ‬ሻ. This is feasible by

considering the fact that spike waveforms generated by different neurons can be faithfully estimated
from a linear combination of carefully chosen orthogonal kernels functions [6]. We obtain these
kernel functions and subsequently the reference dictionary from a multilevel wavelet transform with
‫ ܮ‬decomposition levels.

The simulated annealing approach can be summarized as follows: starting from an initial set of

arbitrary thresholds, ߛ ሺ଴ሻ , we compute the sparse coefficients, ‫ ݔ‬ሺ଴ሻ , given ࣱሺ‫ݐ‬ሻ and estimate the
Ԧ
Ԧ

mean square error (MSE) of the reconstruction, ‫ ܧܵܯ‬ሺ଴ሻ , by setting the regularization term ߩሺ଴ሻ = 1.

For a maximum of ‫ ܭ‬iterations, we perturb the thresholds of the previous iteration, ߛ ሺ௞ିଵሻ , to
Ԧ

തതതതത
obtain a new set of thresholds, ߛ ሺ௞ሻ = ݂௣ ൫ߛ ሺ௞ିଵሻ ൯. The perturbation function, ݂௣ ሺ	. ሻ, adds a random
Ԧ
Ԧ

vector drawn from a zero-mean multivariate Gaussian distribution with a fixed variance ߜ ଶ ‫ ܫ‬to

തതതതത
ߛ ሺ௞ିଵሻ to obtain തതതതത. Given ߛ ሺ௞ሻ we compute the sparse coefficients, ‫ ݔ‬ሺ௞ሻ , and the MSE of the
Ԧ
ߛ ሺ௞ሻ
Ԧ
Ԧ
Ԧ

reconstruction, ‫ ܧܵܯ‬ሺ௞ሻ and ߩሺ௞ሻ for iteration ݇. The perturbed set of thresholds and the updated
51

regularization term will be accepted if ‫ ܧܵܯ‬ሺ௞ሻ is less than ‫ ܧܵܯ‬ሺ௞ିଵሻ , otherwise it will be accepted if
ܲሺ௞ሻ is larger than a random probability, ߫, drawn from a uniform distribution between ሾ0		1ሿ. This

probability is estimated as ܲሺ௞ሻ = ߢଵ − ߢଶ ݇⁄‫ , ܭ‬where constants ߢଵ and ߢଶ are set to 0.5 and 0.25,
respectively. Table 4.1.a demonstrates the pseudo code for learning the optimal thresholds for
discriminating the spike trains while minimizing the reconstruction MSE.

In any given iteration, once we start increasing the thresholds to find the optimal set of
thresholds, more coefficients are set to zero. Although this process increases the sparsity of the
representation, it also decreases the energy of the reconstructed spike waveforms because a
significant number of coefficients is lost. The regularization term ߩ will monotonically increase by

increasing the thresholds and thus compensating for the energy loss. Specifically, we adjust ߩ in the

݇ ௧௛ iteration to give the lowest mean square error between the observation, ܸሺ‫ݐ‬ሻ and ߩ ×

‫ ݔ‬ሺ௞ሻ ࣱሺ‫ݐ‬ሻ, in which ‫ ݔ‬ሺ௞ሻ are the sparse coefficients obtained by applying the thresholds ߛ ሺ௞ሻ .
Ԧ
Ԧ
Ԧ

4.1.2.

Threshold selection to maximize sorting performance

So far, the optimization in (4-9) determines the optimal thresholds that minimize the mean
square error to provide faithful representation upon reconstruction. These thresholds, however,
need not only to provide a faithful representation, but also to maintain a good separability between
putative spike waveforms clusters in the sparse domain to achieve good sorting performance. It is
noteworthy that using this criterion instead of minimizing MSE may not necessarily result in the
same thresholds. This is because a separability criterion accounts for maximal preservation of
distance between individual class probabilities in the observed joint density, while a mean square
error criterion accounts for maximal separation between signal and noise probabilities and only
accounts for bias and variance. Here, the Euclidean norm is replaced with the Fisher information - a

52

measure of cluster separability - in the sparse domain. For cluster ݊, Fisher information is calculated

using the within-class separability, ܵௐ , and the between-class separability, ܵ஻ [10] as

ܵௐ ௡ = ෍ ሺ‫ߤ − ݔ‬௡ ሻሺ‫ߤ − ݔ‬௡ ሻ୘
Ԧሺ‫ݐ‬ሻ Ԧ
Ԧሺ‫ݐ‬ሻ Ԧ
௧∈ሼૌ࢔ ሽ

(4-10)

Ԧሻሺߤ
Ԧሻ
ܵ஻ ௡ = ‫ܭ‬௡ ሺߤ௡ − ߤ Ԧ௡ − ߤ ୘
Ԧ

Ԧᇱ Ԧ Ԧᇱ
Ԧ
‫ܨ‬௡ ሺܵ஻ , ܵௐ ሻ = ߚ௡ ܵ஻ ߚ௡ ൗߚ௡ ܵௐ ߚ௡ ,

where ߤ௡ = 1⁄‫ܭ‬௡ ∑௧∈ሼૌ࢔ ሽ ‫ݔ‬
Ԧ
Ԧ
Ԧሺ‫ݐ‬ሻ is the average spike representation of neuron ݊, and ߤ =

1⁄‫ݔ ∑ ܭ‬
Ԧሺ‫ݐ‬ሻ is the total average across all realizations. For a two-class scenario (neuron ݊ versus all

other neurons) the within-class separability is ܵௐ = ܵௐ ௡ + ܵௐ ௡ሶ , and the between-class separability
Ԧ
is ܵ஻ = ܵ஻ ௡ + ܵ஻ ௡ሶ , where ݊ሶ is the set of cluster indices ݆ ∈ ሼ1, ⋯ , ܰ	&	݆ ≠ ݊ሽ . The vector ߚ௡ is
the optimal linear classifier that separates cluster ݊ from ݊ሶ . This classifier fuses the outcome of the
thresholding blocks, ‫ݔ‬
Ԧሺ‫ݐ‬ሻ, to make a decision on whether it is a spike waveform of neuron ݊ or not.

ିଵ
Ԧ
The classifier ߚ௡ is the eigenvector of the matrix ܵௐ ܵ஻ corresponding to the largest eigenvalue.

This classifier segregates spike waveforms of neuron ݊, ܹ ሺ‫ݐ‬ሻ from ߗ, for which the projected
௡

Ԧ ′ Ԧሺ‫ݐ‬ሻ Ԧሺ௧ሻ∈௡ሽ > ߚ௡ ‫ ݔ‬ሼ௫
Ԧ ′ Ԧሺ‫ݐ‬ሻ Ԧሺ௧ሻ∈௡ሶ ሽ . Note that cluster
sparse coefficients of that neuron satisfies ߚ௡ ‫ ݔ‬ሼ௫
boundaries for each neuron are first calculated using offline analysis from which the thresholds are
subsequently calculated to achieve a similar degree of separability.
The optimization can therefore be stated as simultaneously maximizing the Fisher information

and the product of ߛ௡ ’s, for ݊ = 1, ⋯ , ܰ, as

53

ே
Ԧᇱ Ԧ
ߚ௡ ܵ஻ ߚ௡
max
× ෑ ߛ௡
ሬሬԦ
Ԧᇱ
Ԧ
ሬԦ∈ℝಿ ,ఉ೙ ∈ℝಿ ߚ௡ ܵௐ ߚ௡
ఊ
௡ୀଵ

(4-11)

where the objective is to find the optimal set of thresholds, ߛ along with ܰ optimal binary
Ԧ,

Ԧ
classifiers, ߚ௡ , to simultaneously achieve the highest separability and sparsity. Combining the

information from ܰ binary classifiers can also overcome possible labeling errors by individual

classifiers, and can detect nonlinear boundaries among classes. It also reduces the complexity of the
training by exploiting computationally less expensive classifiers, and increases the reliability of
classification decisions [14], [15].
Figure 4.1 provides an example of the performance of both classifiers in sorting two neurons
that were initially sorted by a human expert. Each spike waveform is then projected using the two
most significant principal components, PC1 and PC2, as the x-axis and y-axis, respectively. The
distribution of projected spike waveforms for each neuron is illustrated by ellipsoids to demonstrate
the nonlinear decision boundary between the two clusters. These ellipsoids were obtained by fitting
the distribution with a Gaussian mixture model (GMM) using at most 10 Gaussian kernels [10].
Given the actual labels, the optimal linear classifier (dashed line), determined by Fisher discriminant
analysis (FDA), produces a 91.5% accurate classification [10]. This is while the ensemble classifier
that fuses the outcome of two binary classifiers, determined by the optimal thresholds Thr1 and
Thr2 as the vertical and horizontal lines, produces a comparable 90.5% accurate classification. Note
that the two optimal thresholds are not necessarily the optimal Bayesian thresholds estimated
independently from the horizontal or vertical histograms, as demonstrated in Figure 4.1.

54

Table 4.1 Pseudo code for determining the optimal thresholds by (a) minimizing the MSE, (b)
maximizing the sorting performance

(a) Inputs:
•
•
•

Arbitrary thresholds, ߛ ሺ଴ሻ
Ԧ

Regularization term, ߩሺ଴ሻ = 1

Compute sparse coefficients, ‫ ݔ‬ሺ଴ሻ , then ‫ ܧܵܯ‬ሺ଴ሻ = ฮܸሺ‫ݐ‬ሻ − ߩሺ଴ሻ ×
Ԧ

‫ ݔ‬ሺ଴ሻ ࣱሺ‫ݐ‬ሻฮଶ
Ԧ

1: for iterations ݇ = 1, … , ‫ܭ‬

തതതതത = ݂ ൫ߛ ሺ௞ିଵሻ ൯
ߛ ሺ௞ሻ
Ԧ
௣ Ԧ

2:

Compute ‫ ݔ‬ሺ௞ሻ , ߩሺ௞ሻ and ‫ ܧܵܯ‬ሺ௞ሻ
Ԧ

3:

5:

if 		‫ ܧܵܯ‬ሺ௞ሻ 	 ≤ ‫ ܧܵܯ‬ሺ௞ିଵሻ or ܲ ሺ௞ሻ > ߫

6:

end

4:

തതതതത
update 			ߛ ሺ௞ሻ = ߛ ሺ௞ሻ
Ԧ
Ԧ

7: end

(b) Inputs:
•
•
•

൛ܹ ሺ‫ݐ‬ሻ, ሼૌ࢔ ሽൟ௡ୀଵ
௡
ே

Arbitrary thresholds, ߛ ሺ଴ሻ
Ԧ

Ԧ ሺ଴ሻ
Estimate optimal ensemble classifiers,	ߚ௡ for ݊ = 1, … , ܰ, and

ܵ ሺ଴ሻ = ∑௡ ‫ܨ‬௡ ሺܵ஻ , ܵௐ ሻ

1: for iterations ݇ = 1, … , ‫ܭ‬

2:

3:
4:
5:
6:

തതതതത
ߛ ሺ௞ሻ = ݂௣ ൫ߛ ሺ௞ିଵሻ ൯
Ԧ
Ԧ

Ԧ ሺ௞ሻ
estimate optimal ensemble classifiers ߚ௡ , for ݊ = 1, … , ܰ, and ܵ ሺ௞ሻ

if 		ܵ ሺ௞ሻ ≥ ܵ ሺ௞ିଵሻ 	or ܲሺ௞ሻ > ߫
end

തതതതത
update 			ߛ ሺ௞ሻ = ߛ ሺ௞ሻ
Ԧ
Ԧ

7: end

55

Thr1

Spike

Amplitude

PC2

Time

Thr2

PC1
Spike

Optimal

Figure 4.1 The distribution of spike events in a 2D space obtained using PCA. Spike events were
initially clustered into two group of sample, and colored to represent two separate neurons. The
average spike waveform of each neuron is demonstrated, where the shaded area represents the
standard deviation. The nonlinear boundary between the two clusters is illustrated by the ellipsoids
that represent the distribution of each cluster. The distribution of each cluster was estimated by
fitting at most 10 Gaussian kernels using Gaussian mixture model. Directly applying a linear

56

classifier (dashed line) introduces a 91.5% accurate classification rate, while optimal thresholding
(horizontal and vertical lines) introduces a comparable 90.5% accurate classification rate. To
demonstrate that these optimal thresholds are not necessarily the optimal thresholds to
independently segregate the two clusters in either the 1D horizontal or vertical representation, we
have shown the 1D horizontal and vertical marginal distributions using 1D-histogram representation
for the features corresponding to the first and second most significant principal components, PC1
and PC2, respectively.
Maximizing the Fisher’s information in (4-11) using simulated annealing can be summarized as

follows: the spike waveforms for N neurons, W୬ ሺtሻ, and their spike times, ሼτ୬ ሽ, are obtained using

an offline spike sorting algorithm. Starting from an arbitrary set of thresholds in iteration (0), γሺ଴ሻ ,
ሬԦ

ሺ଴ሻ
we find the optimal classifier for neuron n, ሬԦ୬ , for n = 1, ⋯ , N. Let us assume the sum of Fisher’s
β

information as S ሺ଴ሻ = ∑୬ F୬ ሺS୆ , S୛ ሻ. For a maximum of K iterations, we perturb the thresholds

തതതതത
ሬԦ
found in the previous iteration, γሺ୩ିଵሻ , to obtain γሺ୩ሻ . The optimal classifiers, βሺ୩ሻ , are determined
ሬԦ
ሬԦ
for the perturbed set of thresholds. If the sum of Fisher’s information for the perturbed thresholds,

S ሺ୩ሻ , is larger than S ሺ୩ିଵሻ , the perturbation will be accepted, otherwise it will be rejected with

increasing probability. Table 4.1b demonstrates the pseudo code for this approach.

4.1.3.

Multilevel Wavelet Decomposition

An important element in the process of finding the optimal thresholds is the approximation of

the spike waveforms, ॾሺtሻ through the predetermined reference dictionary, ࣱሺtሻ. This dictionary

should represent a wide range of spike waveforms, such that most observed spike waveforms can be
built using the kernels (rows) of ࣱሺtሻ,. These kernels can be obtained through orthogonal

multilevel wavelet decomposition [17]. Symlets, for example, have shown to be near-optimal for

57

preserving the spike features of a large number of typical spike waveform classes. The left column of
Figure 4.2a illustrates four Symlet kernels of the orders 2, 4, and 6 at a sampling rate of 25 kHz that
were used to estimate the detail coefficients at decomposition levels 2, 3 and 4 and the
approximation coefficients of decomposition level 4. Note that the sampling rate only affects the
width of these kernels, but not their form or shape. This figure demonstrates a sample recording,

vሺtሻ, with four spike waveforms, color coded to belong to either the blue or green neuron. The right
column demonstrates the coefficients of the detail level 2, 3, 4 and approximation level 4 using

Symlet 4 before and after thresholding. The green neuron can be segregated with a neuron-specific
threshold at detail level 2, and the blue neuron at approximation level 4. The remaining coefficients
are set to zero to increase the overall compression rate.
Since the typical duration of action potentials is around one millisecond [18], wavelet kernels are
expected to have a similar temporal support – defined as the width that accounts for more than 99%
of the total energy. Figure 4.2b demonstrates the temporal support of Symlet kernels for different
sampling rates and decomposition levels. The temporal support increases as the sampling rate
decreases or as the decomposition levels increase [19]. A particular Symlet can be selected to provide
the best temporal support for a given spike waveform. As an example, Figure 4.2c demonstrates the
temporal support for Symlet kernels with four decomposition levels for different sampling rates. For
a sampling rate of 25 to 30 kHz, only Symlets 4 and 6 provide the necessary temporal support of 1.2
msec, respectively. Alternatively, Symlet 2 provides this support at a sampling frequency of around
16 KHz. Figure 4.3d demonstrates the temporal support for Symlet kernels at a sampling rate of 25
kHz, but for different decomposition levels. Depending on the sampling rate by which the neural
data is recorded and also the number of intended decomposition levels, we can always find the
optimal order for the Symlet kernels to obtain the best sparse representation.

58

4.2.

Collecting Neural Data

4.2.1.

Generating Surrogate Data with Known Waveform Characteristics

To evaluate the performance of the optimization methods, synthetic neural recordings for which
the ground truth is available is needed. To generate such data, action potentials were detected and
extracted from neural recording in the somatosensory cortex of an anesthetized rat using a 32channel microelectrode array in response to unilateral whisker stimulation. Details of the
experimental procedures to obtain these recordings are described elsewhere [16]. All procedures
were approved by the Institutional Animal Care and Use Committee at Michigan State University
following NIH guidelines. Spike waveforms that passed a threshold set as 3 times the standard
deviation of noise signal were detected, aligned and extracted relative to their minimum point and
sorted using the classical spike sorting algorithm. We detected 11 clearly distinguishable units on 5
channels of one recording session. To generate synthetic data, multiple spike realizations were drawn

from a Gaussian distribution with the mean set as the spike waveform of neuron ݊, ܹ ሺ‫ݐ‬ሻ, and the
௡

covariance set as ܹ ′ ሺ‫ݐ‬ሻܹ ሺ‫ݐ‬ሻ. Figure 4.3 demonstrates 50 spike realizations for 11 neuron
௡
௡

population, in addition to 50 noise samples that were extracted, cropped and aligned from an actual
experimental recording with a similar length to the spike realizations. Figure 4.3 demonstrates the
distribution of these samples in a 3-dimensional PCA feature space. The signal-to-noise-ratio (SNR)

of this data was calculated as 10 logଵ଴ ߪ⁄ߝ, where ߪ is the average root mean square of the spike

waveforms. We produced synthetic data with variable SNR by varying the scaling of the noise signal,

ߟሺ‫ݐ‬ሻ.

59

(a)
Figure 4.2 (a) Schematic diagram for obtaining the sparse representation from the raw recordings
using a 4-level wavelet decomposition in a tree structure using the Symlets 2, 4, and 6, sampled at 25
kHz. The left column in an order demonstrates the orthonormal basis functions for detail levels 2, 3
and 4 and approximation level 4. The right column represents the wavelet coefficients obtained
from this decomposition before and after thresholding. The 8msec simulated neural recoding
demonstrates four spike waveforms generated by the activity of two neurons. We can show for this
toy example that a neuron-specific threshold can be estimated to segregate the activity of one
60

neuron from the rest. The green neuron is segregated at detail level 2, while the blue neuron is
segregated at approximation level 4. (b) Temporal support of wavelet basis functions extracted from
Symlets 2, 4 and, 6 for different decomposition levels and different sampling rates. (c) Temporal
support for different sampling rates at decomposition level 4. (d) Temporal support for different
decomposition levels at a 25 kHz sampling rate.
Figure 4.2 (continued)

(b)

(c)

(d)
61

3
2

PC3

1
0
-1
-2

Neuron 1
Neuron 2
Neuron 3
Neuron 4
Neuron 5
Neuron 6
Neuron 7
Neuron 8
Neuron 9
Neuron 10
Neuron 11
Noise

-3
-5

2
1

-4

0
-3

-2

-1

-1

0

PC1

1

PC2

-2

Figure 4.3 Distribution of spike PCA features in a 3D space. The simulated spikes were generated
from 11 average spike waveforms of neurons recorded in vivo in the rat’s somatosensory cortex, in
addition to noise events (action potentials are average spike waveform for the 11 neurons and noise
events). Each sample is color coded to demonstrate the 11 neurons and noise.

4.2.2.

Tracking Neurons over Long-Term Recordings

To analyze the stability of spike sorting algorithms in detecting isolated units across many
sessions of recorded data, we used a dataset containing multiple units collected across multiple
consecutive days. Details about the data collection are described elsewhere [20]. In brief, the subject
was trained to perform a delayed sensory-guided task and was implanted with a 32 microwire array
in mPFC, where extracellular activity was recorded for hour long sessions. Out of multiple sessions
acquired, 11 consecutive sessions with appropriate behavioral performance over a span of three

62

weeks were selected and used in this study to assess the robustness of sorting with respect to
changes in the statistical characteristics of the collected signal. The number of neurons in each
recording session was determined by a fully automatic spike sorting algorithm that takes into
account features extracted from the spike waveforms, in addition to the inter spike interval (ISI) and
average firing rate of each neuron [21]. The features include the peak-to-peak amplitude of spike
waveforms, the time between the two peaks, and spectral ratio (SR), estimated as the amount of
energy between the 2 to 5 kHz sampling range over the total energy. The number of neurons in each
session is set as the least amount of neurons that could faithfully explain the variance observed
among all spike waveforms. Using a similar combination of spike waveform features and encoding
characteristics, neurons from consecutive days are matched to be the same (Figure 4.4a). These
neurons show high similarity in both their spike waveforms (Figure 4.4b) and encoding
characteristics and can confidently be assumed to be the same neuron, as least from a neural
decoding perspective [21]. A total number of 212 neurons were detected from 16 functional
electrodes over 11 days, thus providing an average neural yield of 1.2 ± 1.3 units per channel per day
with an average SNR of 5.75 ± 1.67 dB.

63

Day-11
Day-10

Week

Day-9

#3

Day-8
Day-7
Day-6
Week
Day-5
Day-4
Day-3
Day-2

Week

Day-1

(a)
Figure 4.4 Tracking neurons for multiday recordings on a single electrode: (a) Each plate represents
one day of recording, for a total of 11 days over a period of three weeks. Detected neurons in each
day are illustrated by projecting their average spike waveform using PCA, demonstrated as a dot, in
which the size of this dot is proportional to the neuron’s average firing rate. Neurons that have been
tracked over multiple days are demonstrated by similar colors, while other ones are black. A total of
nine neurons have been detected to be stable for at least two consecutive days. (b) The average spike
waveform of the nine stable neurons over the 11 days color-coded to match (a).

64

Figure 4.4 (continued)
9
8
7

Unit

6
5
4
3
2
1
1

2

3

4

5

6

7

8

9

10

11

Day

(b)

Variations in the spike waveforms fired by a single neuron have not only been observed for a
span of daily recordings, but also for a span of minutes, thus requiring robust sorting for even short
term recordings. Figure 4.5 illustrates these variations in the spike waveforms fired by neurons over
hour long sessions. Figure 4.5a demonstrates the 2D representation of spike waveforms of 4
different neurons using PCA. Some of these neurons show a significant change in their energy as the
session progresses. Figure 4.5b demonstrates the changes in the spike waveforms more clearly.
These variations are demonstrated to be changes in the peak-to-peak amplitude (Figure 4.5c), while
the peak-to-peak time and spectral ratio seem to be more robust. Any spike sorting algorithm that
uses PCA feature will be sensitive to these variations and cannot appropriately detect and track these
neurons over recording sessions.

65

400

PC2

200

0

-200
0

200

400

600
PC1

800

1000

1200

Amplitude (µV)

(a)

200
100
0
-100
-200
-300
0

5

10

15

20

25

30 35
Time (min)

40

45

50

55

1
Times (msec)

0

(b)
1

Normalized Ratio

0.9
0.8
0.7
0.6

Spectral Ratio
Peak to Peak Amplitude
Peak to Peak Time

0.5
0.4
0

5

10

15

20

25

30
35
Time (min)

40

45

50

55

(c)
Figure 4.5 Tracking neurons for single day recordings: (a) 2D representation of spike waveforms
detected in a single day recording using PCA (4 neurons were detected). Each dot represents the
66

projection of a spike waveform, color coded to match one of the 4 neurons, where the intensity of
its color determines the time by which they were detected (higher intensity mean later times). The
average 2D representation of these dots is demonstrated by a line with a similar coloring pattern. A
line that extends the longest in the 2D representation demonstrates the largest changes in the
average energy of the spike waveforms fired by the corresponding neuron (here the purple-colored
neuron). (b) The spike waveforms of the purple-colored neuron. The x-axis demonstrates time in
minutes, by which each spike waveform was fired, and the y-axis demonstrates the time in
milliseconds by which a spike waveform is recorded. (c) Demonstrates three key features of the
spike waveforms generated by the purpled-colored neuron, the peak-to-peak, the time span between
peaks, and the spectral ratio (SR). While the peak-to-peak time and SR features seem to be constant
for this neuron, the peak-to-peak rapidly decreases in the first 30 minutes.

4.2.3.

Considerations for Practical Implementation

In our previous work [6], we have demonstrated how a thresholding strategy could be useful to
fulfill the design constraints of a fully implantable neural interface that performs instantaneous spike
detection and sorting to cope with a limited wireless telemetry bandwidth and minimize the overall
latency in BMI applications [9], [17], [22], [23]. We have proposed to equip the system with three
‘operational’ modes, namely a monitoring, a compression and a sensing mode. The optimal
thresholds for the compression and sensing modes are determined external to the implantable
system using raw data collected during the monitor mode [9]. Here, we refine this design to only
include the compression and sensing modes, thereby removing the need for the monitoring mode,
which is time consuming due to the bandwidth constraints of the wireless telemetry. Figure 4.6
demonstrates a schematic diagram of this alternative design. It features an additional block that
determines the optimal threshold in the compression mode by continuously calculating the MSE

67

through the regularization terms indicated in Table 4.1a. Because clustering is needed before a
sensing mode can be implemented, the optimal thresholds and classifiers for sorting are done
externally (as in Table 4.1b) using the data collected during the compression mode. The important
advantage of this design over the previous design is the self-evaluation capability that permits
continuous adjustments of the optimal compression thresholds to cope with possible SNR
degradation during long term recordings.
The computational complexity logically becomes an important evaluation metric. Let us assume

that a 32 channel neural recording is quantized into 8ܾ݅‫ ݏݐ‬samples - which have been shown

previously to maintain desired information [9]. Using the lifting method to implement the multilevel
wavelet decomposition[22], 8 multipliers and 8 adders are required per sample, thus for a spike
waveform of length ‫ ܮ‬samples, 224 × ‫ ܮ‬full adder operations are required for each recording

channel, considering that an 8ܾ݅‫ ݐ‬adder requires 8 full-adders, and a 8ܾ݅‫ ݐ‬multiplier requires 20 full-

adders [24]. A similar number is required for the inverse multilevel wavelet decomposition. To apply
the optimal thresholds and regularization terms, four 8ܾ݅‫ ݐ‬adders and four 8ܾ݅‫ ݐ‬multipliers are

required, therefore requiring a total of 560 × ‫ ܮ‬full adder operations. To estimate the MSE for a

length of ܶ samples, we need 1 + ‫ڿ‬log ଼ ܶ‫ ۀ‬bytes of memory, in addition to approximately 100 fixed

bytes of memory to store the residuals of the lifting method in the multilevel wavelet decomposition,
the optimal thresholds and regularization terms for each recording channel [9]. It can be seen that
the memory size scales sub-linearly - while the number of full adders scales linearly - with the
number of samples. The perturbation of thresholds and regularization terms for every ܶ sample is

performed by the central controller using a linear feedback shift register (LFSR) using XOR as the
linear function [25]. The LFSR for each threshold requires an 8ܾ݅‫ ݐ‬register and an 8ܾ݅‫ ݐ‬full adder.

68

Accepted perturbed thresholds are stored in the predetermined registers before beginning the next
round of perturbations.
Programmable Implantable Chip
Electrode
Array

Amplify/
Filter

Sampling/
Quantization

DWT

Optimal
Thresholds

-

Inverse
DWT

RF
Transmitter

Adaptive
Reg.

RF
Receiver

Figure 4.6 Schematic diagram of the data flow in the implantable wireless system. Neural recordings
are initially amplified and filtered in the analog front-end prior to sampling/quantization and
transmission over the wireless channel, where the raw data is projected into the sparse
representation with the discrete wavelet transform (DWT) block, followed by the optimal
thresholding block, to reduce the telemetry bandwidth. This diagram also features the adaptive
regularization and inverse DWT blocks to project back the sparse representation into the original
representation to continuously estimate the MSE. The green arrows demonstrate the iterative
learning that takes place by perturbing the set of thresholds and regularization terms for the
compression mode.

4.3.

Compressive Spike Sorting Performance

4.3.1.

Spike Sorting Performance vs. SNR and Compression Rate

We evaluated the performance of our approach for both compressing and spike sorting as a
function of the SNR and sampling rates and compared it to the classical spike sorting. The
compression metric is important to evaluate strategies for telemetry bandwidth reduction during
wireless neural recording using implantable systems [17]. The compression rate was estimated as the
69

number of zero coefficients over the total number of coefficients. In the classical PCA-based spike
sorting, the lower dimensional PCA representation can be viewed as one form of signal-dependent
compression of the neural data. The compression rate in such case would be the percentage of
variance accounted for using a subset of the eigenvectors over the total number of eigenvectors.
Figure 4.7 demonstrates the average performance using two neurons, collected at different SNR
between 0 to 5 dB. The sorting performance in (a) is expressed as the average true positive rate
(TPR), where TPR is estimated as the number of accurately sorted spike waveforms over the total
number of spike waveforms from each neuron. The sorting performance in (b) is expressed in terms
of the average false positive rate (FPR), where FPR is estimated as the number of noise events over
the total number of non-spike events that pass the detection threshold of each neuron. For the 2neuron mixture, we sampled all possible combinations and produced

ቀ

11
ቁ
2

= 55 synthetic data sets

using spikes from the 11-neuron population recorded in one session. An FDA classifier (Figure 4.1)

was trained given the spike trains of each neuron, and a test data set was used for calculating the
TPR and FPR. The average TPR and FPR for our spike sorting approach are plotted as a blue trace
in Figures 4.7a and 4.7b, respectively. We also plotted the projection of this trace on the TPRcompression plane, SNR-compression and TPR-SNR planes for clarity. It can be seen that better
spike sorting performance and compression rates are achieved compared to PCA-based approach.
Unlike the classical approach, our approach is highly robust against SNR variations as measured by
the high TPR. A decrease in SNR – as expected – results in an increase in FPR and consequently a
decrease in the compression rate.

4.3.2.

Wavelet Basis Function vs. Sampling Rate

The sampling rate is an important element in the selection of the wavelet basis function to
sparsify the signals. Figure 4.8 demonstrates the effect of choosing Symlet 2 or Symlet 4 on the

70

average TPR for different SNRs as a function of the sampling rate. The sorting performance for the
Symlet 4 basis function dominates the one of Symlet 2 as the sampling rate increases. For sampling
rates close to Nyquist (around 15 kHz), Symlet 2 provides a temporal support close to action
potentials, while for sampling rates around 25 kHz, Symlet 4 provides a similar support. The
difference in performance between the two basis kernels becomes negligible at high SNRs.

(a)
Figure 4.7 (a) Average TPR of an FDA classifier trained for different SNR and compression rates.
The compression is achieved by projecting the raw recordings using PCA. A combination of 55
recording segments - each includes the activity of two distinct neurons - was used. The average TPR
of our spike sorting approach is demonstrated as the nonlinear blue trace in the 3D space. The
projection of this trace is plotted in three planes for better visibility. (b) Average FPR of the FDA
classifier and our spike sorting approaches versus variable SNR and compression rates.
71

Figure 4.7 (continued)

(b)

4.3.3.

Analysis of Real Data

Figure 4.9a demonstrates 10 seconds out of a 27-minute neural recording at a sampling rate of
25 kHz obtained from the rat’s somatosensory cortex during an optogenetic stimulation experiment.
Optogenetics were used here to evoke precisely timed spikes from genetically targeted neurons,
which helped in faithfully estimating the prior density of spiking compared to spontaneous or
natural stimulus-driven activity. Figure 4.9b demonstrates 3D PCA feature space of all events that
passed the detection threshold and aligned relative to their minimum point. We clustered the dataset
into 5 classes, and illustrated the average spike waveform and its standard deviation in figure 4.8b
[21]. The cluster colored as black illustrates putative noise events that passed the detection threshold.
Using simulated annealing (Table 4.1a), we calculated the optimal thresholds for minimizing the
MSE –termed “compressed mode”. After 300 iterations, we found the optimal set of thresholds that
72

provided 90% compression rate. The reconstruction using the remaining 10% of coefficients is
demonstrated in Figure 4.9a superimposed on the original data trace for comparison. It can be seen
that the form of spike waveforms are faithfully preserved, while the noise has been significantly
suppressed. In the compressed mode, the simulated annealing (Table 4.1b) is used to find the
optimal thresholds that maximize separability – labeled “sensed mode”. After 300 iterations, these
thresholds amounted to a compression rate of more than 99%. Figure 4.10a demonstrates the
compression rate and normalized MSE over the two learning phases, and Figure 4.10b demonstrates
the progression of the thresholds’ values over the same phases.

Figure 4.8 Effect of wavelet basis choice on the average TPR for variable sampling rates between 12
and 28 kHz, and SNRs of 0, 0.5 and 1.5 dB. Every dot represents the average TPR for the data in
Figure 4.6 and errorbars are variance.

73

(a)

(b)
Figure 4.9 (a) A 10 second demonstration of in vivo recording from the rat’s somatosensory cortex.
The second row is a magnification of the red box illustrated in the first row. The black line

74

represents the raw neural recording, the green line represents the reconstruction in the compressed
mode, and the blue line represents the reconstruction in the sensed mode. (b) Demonstration of all
detected events in a 3D space obtained using PCA. Each dot represents the projection of a detected
event in the 3D space. Dots are color coded to demonstrate 5 clusters that were obtained using the
spike sorting algorithm explained in methods.

Figure 4.11a demonstrates the rate-distortion tradeoff obtained using random sets of thresholds
drawn from a uniform distribution. The learning of the thresholds for the compressed and sensed
modes is demonstrated by the trace approximating the lower bound of the rate-distortion tradeoff.
While the compression mode threshold is the closest to the origin, the sensed mode threshold
achieves the highest compression (i.e., lowest rate) at highest distortion. Although the MSE has
increased significantly as a result of the learning phase in the compressed and sensed modes, the
degree of separability in both modes remained qualitatively comparable to the raw data separability,
as demonstrated in Figure 4.11b. It also demonstrates quantitatively the changes in MSE and rate as
well as the degree of separability as the training proceeds for the compressed and sensed modes.

75

100
90

Compression Rate
Normalized MSE
Noise RMS

80

Compression Rate

70
60
50
40

Separability Optimization
Sensed Mode

30
20

MSE Optimization
Compressed Mode

10
0
0

100

200

300

400

500

600

Iteration
(a)
Figure 4.10 (a) The compression rate and MSE for the iterative simulated annealing learning. The
first 300 iterations correspond to finding the optimal thresholds for the compressed mode in which
the MSE is minimized. For the compressed mode, a compression rate of 90% was reached. In the
next 300 iterations, the optimal thresholds for the sensed mode were estimated using the iterative
learning, starting from where the compressed mode stopped. After another 300 trials, a compression
rate of 99% was achieved. (b) Threshold variations as learning progresses for the compressed and
sensed modes.

76

Figure 4.10 (continued)

Normalized Thresholds

0.5

0.4

Detail Level 2
Detail Level 3
Detail Level 4
Approximation Level 4

Sensing Thresholds

Separability Optimization

MSE Optimization

Compression Thresholds

0.3

0.2

0.1

0
0

100

200

300

400

500

600

Iteration
(b)
We have compared the sorting performance of our thresholding algorithm, which we refer to as
the compressive spike sorting (CSS), with PCA based algorithms for recording sessions of at least an
hour long. For each session, we initially obtained the number of neurons and their firing times
offline using [21]. Then we learned the optimal thresholds (Table 4.1b) and classifiers for the both
CSS and PCA spike sorting algorithms. In the runtime, we had both algorithms classify the detected
events that pass a universal threshold as one of the given neurons. Figure 4.12a demonstrates and
compares the sorting performance of the CSS versus the PCA algorithm for a ݉ day span, where

initial learning happened at a certain day and the runtime happened ݉ − 1 days afterward on the

same electrode. So for a 1 day span, the initial learning and runtime happen at the same day, while

there is a day difference for the 2 day span. Figure 4.12a demonstrates the average and standard
deviation sorting performance for the total number of 212 neurons detected over the 16 electrodes
77

and 11 days for the 1 day span. For the 2 and 3 day spans, we only report on the neurons that have
been tracked over 2 or 3 consecutive days, which are 15 and 2 cases, respectively. While the sorting
performance of the PCA and CSS algorithms was very close for the 1 day span, the sorting
performance of the CSS algorithm is statistically significant to the PCA algorithm (‫.)100.0 < ݌‬

Thus, the CSS algorithm could potentially be used for faithfully tracking neurons over consecutive

days of recording. Note that the higher sorting performance for the 3 day span compared to the 1
day span is due to typically tracking neurons that demonstrate higher SNRs in their spike
waveforms, where the average sorting performance for 1 day span is affected by low SNR neurons
that are not tracked over consecutive days.
We have demonstrated the performance of compressive sorting to be comparable and in some
cases superior to the conventional spike sorting algorithms that rely on PCA features. Although,
PCA features inherently introduce a significant compression rate on the raw neural recordings,
through thresholding, CSS provides additional compression due to the elimination of a large number
of wavelet coefficients. Figure 4.12a demonstrates a schematic diagram comparing the compression
by the two methods. Using ݊ principal components, the compression rate of PCA is ሺ‫݊ − ܮ‬ሻ⁄‫,ܮ‬

where ‫ ܮ‬is the number of samples to faithfully represent spike waveforms as the signal of interest.

Assuming a ݊-level wavelet decomposition and holding on to the largest coefficients in magnitude

for the detail levels of 2 to ݊ an approximation level ݊ (݊ coefficients), we will obtain a similar

compression rate. Thresholding, however, to obtain a sparse representation will set a large portion
of these coefficients to zero. Using run length encoding (RLE), an additional compression rate can
be obtained, that has been reported as the compression rate of the CSS algorithm.

78

1
Random Thresholds
Training Thresholds
Initial Threshold
Compression Threshold
Sensing Threshold

0.9
0.8

Normalized Rate

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Distortion = Normalized MSE

(a)
1
Normalized Distortion
Normalized Rate
Separability

0.9

Normalized Measures

0.8
0.7
0.6
0.5

Separability Optimization

0.4

Sensed Mode
0.3

MSE Optimization

Compressed Mode

0.2
0.1
0
0

100

200

300

400

500

600

Iteration

(b)
Figure 4.11 (a) Rate-distortion scatter plot. Each dot represents the rate and the distortion for a
random threshold drawn from a uniform distribution. The red trace demonstrates the same rate79

distortion for the threshold variations for the training in the compressed and sensed modes. The
compression thresholds are in close proximity to the origin, while the sensing thresholds are at
extremely low rates. The normalized MSE or distortion is estimated as the ሺ‫ܧ‬ோ௔௪ − ‫ܧ‬ோ௘௖௢௡ ሻ⁄‫ܧ‬ோ௔௪ ,
where ‫ܧ‬ோ௔௪ is the root mean square (RMS) of the raw recordings, and ‫ܧ‬ோ௘௖௢௡ is the RMS of the

reconstruction signal. The normalized rate is estimated as the number of zero coefficients over total
number of coefficients. (b) Rate, distortion and the degree of separability as learning progresses
through the 600 iterations. The separability was calculated as the normalized inverse of classification
errors. The error bars demonstrate the standard deviation.

1

PCA
CSS

0.9
0.8

Performance

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

1 Day Span

2 Day Span

3 Day Span

(a)
Figure 4.12 (a) Spike sorting performance of the compressive spike sorting (CSS) vs. the
conventional PCA sorting for neurons tracked over 1, 2 and 3 days span. The bars demonstrate the
average sorting performance and errorbars are the standard deviation. The performance of the CSS

80

is statistically significant compared to the PCA algorithm for neurons tracked over a span of 2 or 3

days (p < 0.001). (b) Schematic diagram for comparing the compression of spike waveforms using

PCA vs. sparse representation. The size of the green matrix on left that includes all detected events,
i.e. spike waveforms and noise that pass the detection threshold, is N × L, where N is the number of

detected events and L is the number of samples in each event. Using the n principal components,

events are projected into a n-dimensional space, therefore the projected matrix is N × n, and the
compression rate is ሺN − nሻ⁄N. In sparse representation using a 4-level wavelet decomposition,

each event is projected into a 5D space, given that only the maximum coefficient survives at each
decomposition level, thus introducing a ሺN − 5ሻ⁄N compression rate. Additional compression is

obtained by thresholding out a large portion of these coefficients, therefore replacing them with

zeros. Run length encoding (RLE) can provide us with an additional k⁄ሺN × nሻ compression rate,
where k is the number of zero coefficients.

Figure 4.12 (continued)
PC1 PC2 …

1
2
3
4
5

L
1
2
3
4
5

PC Feature Space, N×n
Sparse Representation

PCA
d1 d2 d3 d4 a4

d1 d2 d3 d4 a4

1
2
3
4
5

1
2
3
4
5

N

Thr

Wavelet Decomposition
N
Raw Waveforms, N×L

DWT Feature Space, N×5
N
(b)
81

N

4.4.

Discussion

The CSS algorithm directly sorts spike waveforms from the sparse representation without the
need to reconstruct the original neural recording, yet provides comparable sorting performances to
conventional and commercial spike sorting algorithms that rely on the PCA features. The advantage
of mutually compressing and sorting the neural data comes from the adaptive search for optimal set
of thresholds for the sparse representation that maintain the sorting performance. This search,
however, is performed at two separate modes, compressed and sensing modes. Initially MSE is
minimized in the compressed mode to provide an offline spike sorting algorithm with a faithful
reconstruction of the neural recording that is used to estimate the number of neurons and their
firing times. Given these information, a new set of thresholds are learned by maximizing the
separability between the neurons in the sparse representation. We have demonstrated these two
separate stages of learning on the rate-distortion plot in Figure 4.11a, where we have shown that at
the expense of higher distortion, much lower rates can be achieved without compromising the
sorting performance. This perspective to implantable data acquisition and processing systems can
have direct implication for the future generation of BMI in which the information of several
thousand neurons has to be transferred instantaneously to an external actuator.
To make sure the compromise made in CSS in not reconstructing the original neural recording
does not affect the overall sorting performance in comparison with conventional PCA spike sorting
algorithms that heavily rely on a faithful reconstruction, we provided two different set of
comparisons. First, we compared the CSS with PCA sorting for different SNRs and different
compression rates. Using synthetically generated data, in which the ground truth is known, we have
demonstrated the superior performance of the CSS especially for cases of low SNR while providing
a compression rate far beyond the compression obtained from the PCA approach. Second, we
compared the sorting performance of the CSS to the PCA algorithm in handling non-stationarities
82

observed in the neural data particularly over long-term recordings. We have shown using multielectrode multiday recordings that the CSS is more robust to the variabilities in the spike waveforms,
especially as the span of recordings increase from an hour to over multiple days. Unlike the PCA
algorithm that is sensitive to variations in the energy of the spike waveforms, the CSS takes into
account the distribution of this energy over different sub-bands. This provides a valuable advantage
for CSS over PCA sorting algorithms in tracking neurons over long-term recordings. For BMI
systems, this implies that a neural decoder initiated using stable units provided by the CSS does not
require everyday training.
Key to the CSS is the ability to sparsely represent the neural waveforms using a dictionary of
basis kernels that preserve the discriminative features of the spike waveforms. The optimal
thresholds are estimated using a learning algorithm that minimizes the mean square error while
maximizing separability between putative signal classes. We evaluated the performance for variable
SNRs, and compared it with classical spike sorting techniques. We also demonstrated the
importance of choosing the appropriate basis kernel on the sorting performance as a function of the
sampling rate.
It is important to note that the compression here by our approach is measured by the number of
zeroed coefficients over the total number of coefficients. Accordingly, more zeroed coefficients post
thresholding result in more compression. Note that this definition should not to be confused with
the traditional definition of compression from information theory in which the compression rate is
measured in bits using entropy coding techniques [62]. In our case, we use run length encoding - a
variant of entropy coding for binary sequences - to estimate the compressed output stream for the
implantable BMI configuration (Figure 4.6) [63].

83

It's noteworthy that the compression in our approach closely resembles the idea in the
compressed sensing framework [64, 65]. Since the extracellular unit activity is transient in nature,
compressed sensing can also be used to compress the signal by using a dictionary of orthogonal
random samplers [66]. It is important, however, to note that we have replaced the random sampler
matrix with a known dictionary extracted from multilevel wavelet decomposition based on the prior
information we have about the general form of spike waveforms. The wavelet dictionary has also the
advantage of requiring less memory to be implemented in an implantable BMI system compared to a
random matrix. Changing the L0 optimization in equation (4-7) of Chapter 4 to an L1 optimization
had minimal effect on the remainder of the CSS algorithm performance since the optimization is
eventually approximated by maximizing the product of thresholds in equation (4-9).
The simple, yet efficient CSS algorithm based on simple thresholding has its own tradeoffs.
First, as mentioned before, it provides a poor reconstruction of the raw neural recording obtained
from the sparse representation in the sensed mode. Second, like many other spike sorting
algorithms, the CSS also depends on an offline spike sorter to provide it with the number of
neurons and their firing times that is used to learn the optimal set of thresholds.
An ensemble classifier was used to fuse the classification results from multiple binary classifiers
to improve the overall performance and approximate the typical nonlinear boundary between classes
with multiple linear classifiers. There are some theoretical and practical reasons to select an ensemble
classifier. First, sophisticated classifiers that can detect the nonlinear boundaries between spike
clusters are required. By integrating the classification result of multiple simple thresholding blocks,
the performance becomes comparable to nonlinear classifiers. Second, spike sorting becomes
challenging at low SNRs, eventually resulting in a more complex nonlinear decision boundary
between classes. A combination of binary classifiers, however, provides a solution that can be used
to approximate the nonlinear boundary. Third, since the coefficients obtained from the sparse
84

decomposition represent different bands of the frequency spectrum, each thresholding block will
extract different information, which can be integrated in the ensemble classifier to increase the
performance. Fourth, due to hardware limitations, simple binary classifiers, such as thresholds, are
highly desirable.
Our approach identifies the independent neural channels in the raw ensemble recordings while
maintaining low computational complexity, satisfying the necessary power and telemetry bandwidth
constraints of a fully implantable hardware. Unlike state-of-the-art systems, this approach reduces
the continued reliance on an external computing site to monitor changes in signal quality over
prolonged recording periods – a problem often encountered in BMI applications that rely on long
term use of implantable recording electrodes. This translates into substantial savings in the
computational, communication and human supervision costs of adjusting parameters for the
implanted system. Consequently, we anticipate this will lead to further improvements in the
performance and potential use of these systems in clinical applications.

85

5

Spatiotemporal Graphical Models for Ensemble Spike
Train Analysis

Spike trains are known to carry important information about the response of cortical neurons to
external stimuli that mediate perception, learning and motor processing. Spike trains, simultaneously
recorded from neurons within a local population, exhibit variable degrees of statistical dependency.
A large number of studies have attempted to identify the role of this dependency and whether it
improves the encoding of a jointly represented stimulus attribute [67-71], or it simply expresses
redundancy to that stimulus [59, 72-75]. To assess the optimality of any population code, the first
step is to model this dependency in the population response.
For a better understanding of the exact role of statistical dependence between the spike trains of
cortical neurons, one has to capture this dependence by fitting the spatiotemporal patterns of neural
activity with a statistical model. Maximum entropy models have been suggested to fit the
simultaneously observed patterns of neuronal spike trains, although they predict the probability of
instantaneous patterns and thereby ignoring the temporal correlations in the response. Using higher
order Markov processes to represent the sequence of instantaneous patterns, we introduce a
spatiotemporal maximum entropy model that takes into consideration of the temporal correlations

86

as well as the spatial ones. We provide an iterative method for learning the parameters of such
model, in which the parameters are truncated to obtain sparse graphs that represent the most
important interactions in a population network.

5.1.

Pairwise Maximum Entropy Models

Let us assume a population of ܰ-simultaneously observed neuronal spike trains, in which an

population vector, ܺ ௧ , represents the number of spikes observed from these spike trains at time ‫ ݐ‬in
௧
௧
response to an external stimulus, ‫ .ݏ‬The population vector has ܰ elements as ܺ ௧ = ሾܺଵ , ⋯ , ܺே ሿ,
௧
where ܺ௡ is the number of spikes fired by neuron ݊ within time interval ‫ .ݐ‬Let us also assume binary

spike trains, where a zero represents an observation of no spikes in the observed time interval, and
one otherwise. In response to some constantly variable stimulus ‫ ݏ‬within ܶ time bins, ‫ݏ‬ଵ:் , we

observe a series of nonstationary population vectors, expressed as ܺଵ:் = ሾܺଵ , ⋯ , ܺ ் ሿ. To fully

characterize the nonstationary encoding system one has to initially describe the conditional
distribution, ܲሺܺଵ:் |‫ݏ‬ଵ:் ሻ. The main challenge in estimating this distribution is the sample size. For

a population of ܰ neurons, the maximum number of different binary population vectors is 2ே ,

though the actual number is usually smaller.

The conditional distribution, ܲሺܺଵ:் |‫ݏ‬ଵ:் ሻ, tries to explain the dynamics of population

responses given a particular stimulus, hence is used to explain the causal relationships, or effective
connectivity, between the variables that can greatly improve our understanding of the encoding
mechanism. However, other methods with simplifying assumptions have also been used to explain
some of the variability observed in population responses. One is independently estimating the
conditional distribution of neurons, also referred to as the tuning function, as a direct yet simple
87

method to understand the encoding mechanism for the entire population of neurons. The tuning

function expresses the average firing rate of neuron ݊ in response to different values of ‫ ,ݏ‬and is

expressed as ܲሺܺ௡ |‫ݏ‬ሻ. Under the independence assumption, the joint conditional distribution can be

expressed as the normalized product of individual tuning functions, ܲ௜௡ௗ ሺܺ|‫ݏ‬ሻ = ∏ே ܲሺܺ௡ |‫ݏ‬ሻ⁄ܼ,
௡ୀଵ
where ܼ is the normalization factor or partition function estimated as ܼ = ∑ࢄ ∏ே ܲሺܺ௡ |‫ݏ‬ሻ. The
௡ୀଵ

independence assumption can be also extended for different time intervals, therefore, the
conditional distribution can be expressed as ܲሺܺଵ:் |‫ݏ‬ଵ:் ሻ = ∏் ܲ௜௡ௗ ሺܺ ௧ |‫ ݏ‬௧ ሻ⁄ܼሶ, where ܼሶ is the
௧ୀଵ

product of all ܶ normalization factors. This representation assumes independence between all
neuronal constituents at different times, thus ignores both the spatial and temporal correlations

among neuronal constituents. Although the independent model is less sensitive to the sample size
problem and is less affected by the increase in population size [76], more complex models are
required to capture the dynamics of nonstationary population vectors.
By including the statistical dependency among neuronal pairs, a pairwise model can be

constructed to express the conditional distribution as, ܲ௣௔௜௥ ሺܺ|‫ݏ‬ሻ = ∏ே ‫݌‬൫ܺ௡ , ܺగሺ௡ሻ |‫ݏ‬൯⁄ܼ, where
௡ୀଵ

ߨሺ݊ሻ is the set of neurons connected to neuron ݊ and ܼ is the partition function. Note that the
connection here does not necessarily indicate an anatomical connection between neurons, but rather
a statistical dependence on each other’s neural activity. To extend the pairwise model for population

vectors of different time intervals, ‫ ,ܶ , ⋯ ,1 = ݐ‬the set ߨሺ݊ሻ can include any neuronal constituents

of time intervals up to time ‫ ݐ‬that affect the activity of neuron ݊. The conditional distribution for

the pairwise model can be expressed as ܲ௣௔௜௥ ሺܺଵ:் |‫ݏ‬ଵ:் ሻ = ∏் ܲ௣௔௜௥ ൫ܺ ௧ , ܺగሺ௡ሻ |‫ ݏ‬௧ ൯⁄ܼሶ. This
௧ୀଵ

model not only explains the instantaneous dependencies within a time interval, but also explains
dependencies between neuronal constituents at different time intervals, thus including both the
spatial and temporal correlations. In this chapter, we will address the question of finding the set of
88

parent neurons, ߨሺ݊ሻ, and building a spatiotemporal pairwise model that incorporates both spatial

and temporal correlations. Out of the many approaches that can be used for fitting a conditional
distribution, we used maximum entropy models to provide us the models with maximum
uncertainty.

5.1.1.

Principle of Maximum Entropy

Using the principle of maximum entropy, we will construct a pairwise model that matches the

expectation of certain moments of neuronal constituents, ݂ఈ ሺܺሻ: ܺ → ℝ, with analogous empirical
estimates obtained from the true conditional distribution of observed population vectors, ܲሺܺ|‫ݏ‬ሻ

[77]. The empirical expectations of those certain moments is estimated as,
்

1
‫ܧ‬௉ ሾ݂ఈ ሿ = ෍ ݂ఈ ሺܺ ௧ ሻ
ܶ

(5-1)

௧ୀଵ

where here ܶ is the total number of observed population vectors. Now let’s assume an estimation of

the conditional distribution, ܲఏ ሺܺ|‫ݏ‬ሻ, where ߠ are some unknown parameters of this distribution.
The expectation of the certain moments can be estimated given ܲఏ ሺܺ|‫ݏ‬ሻ as
߃ఏ ሾ݂ఈ ሿ = ෍ ܲሺ‫ݏ‬ሻܲఏ ሺܺ|‫ݏ‬ሻ݂ఈ ሺܺ, ‫ݏ‬ሻ
௑

(5-2)

where ܲሺ‫ݏ‬ሻ is the true distribution of the stimulus. The goal is to estimate ܲఏ ሺܺ|‫ݏ‬ሻ such that the

model and empirical expectations match, ߃ఏ ሾ݂ఈ ሿ = ߃௉ ሾ݂ఈ ሿ. This problem, however, is

underdetermined, meaning that there are infinite number of distributions that conform to this

condition. To choose among the many combinations, the principle of maximum entropy selects the

89

one that maximizes the Shannon’s measure of entropy, ‫ܪ‬ሺܺሻ = − ∑௑ ܲሺܺሻ log ܲሺܺሻ, as
ܲ∗ = max ‫ܪ‬ሺܲሻ
௉∈࣪

subject	to ߃ఏ ሾ݂ఈ ሿ = ߃௉ ሾ݂ఈ ሿ

(5-3)

where ࣪ is set of all distributions that satisfy the condition. The solution to this determined problem

represents the distribution with the maximum uncertainty. It has be shown that all maximum

entropy distributions conform to the family of exponential functions, such as Gaussian, Poisson,
and binomial distributions, and are expressed as ܲఏ ሺܺ|‫ݏ‬ሻ = expሺ∑ఈ ߠఈ ݂ఈ ሺܺሻሻ⁄ܼ , where ߠఈ is the
parameter associated with ݂ఈ ሺܺሻ.

5.1.2.

Instantaneous Maximum Entropy Model

If we define the certain moments ݂ఈ ሺܺሻ as the first order moments, ܺ௜௧ , and second order

moments, ܺ௜௧ ܺ௝௧ , within the same time interval ‫ ,ݐ‬the maximum entropy model can be expressed as
ܲఏ

ሺܺ ௧

ே

ே

ே

1
1
|‫ݏ‬ሻ = exp ቌ− ෍ ℎ௜ ܺ௜௧ − ෍ ෍ ‫ܬ‬௜௝ ܺ௜௧ ܺ௝௧ ቍ
ܼ
2
௜ୀଵ

(5-4)

௜ୀଵ ௝ୀଵ

where ℎ௜ and ‫ܬ‬௜௝ are the parameters associated with the first and second order moments,

respectively, and ܼ is the partition function to ensure ∑௑ ܲఏ ሺܺ ௧ |‫ݏ‬ሻ = 1. This model is referred to as

the instantaneous maximum entropy model, since ܺ ௧ is the pattern of activity only at time ‫ .ݐ‬The

conditional distribution of a sequence of population vectors for the instantaneous model can be
expressed as the product of individual instantaneous models at different times, ܲఏ ሺܺଵ:் ሻ =

∏் ܲఏ ሺܺ ௧ ሻ⁄ܼሶ. This model was shown to be effective in representing almost 90 percent of the
௧ୀଵ
variance seen in the response of a population less than 20 neurons [78, 79]. The main disadvantage
90

of this model, however, is that it ignores the temporal correlation that represents the effect of the
history of spiking activity on the current population state. Extending the width of neural patterns
has been proposed as a method to incorporate spiking history. It destroys, however, the information
in precise spike timings [80]. Thus, the instantaneous maximum entropy model does not consider
interactions between neurons at different times that are believed to play an important role in the
encoding mechanism.

5.1.3.

Spatiotemporal Maximum Entropy Model

Given the observed population vectors up to time ‫ܺ ,ݐ‬ଵ:௧ , one can estimate the probability of

observing the population vector of time ‫ 1 + ݐ‬as ܲሺܺ ௧ାଵ |ܺଵ:௧ ሻ. Assuming a ݇ ௧௛ -order Markov

model, the conditional probability can be reduced to ܲሺܺ ௧ାଵ |ܺ ௧ି௞:௧ ሻ. Assuming a ݇ ௧௛ -order

Markov model, the joint density of a sequence of population vectors can be expressed as
ܲሺܺ

ଵ:் ሻ

= ܲሺܺ

ଵሻ

்

ෑ ܲሺܺ |ܺ
௧ୀଶ

௧

௧ି௞:௧ିଵ ሻ

(5-5)

For a first-order Markov model, the transitional probabilities can expressed as ܲ௡௠ =

ܲ൫ܺ ௧ = ܺ ሺ௡ሻ |ܺ ௧ିଵ = ܺ ሺ௠ሻ ൯, where ܺ ሺ௡ሻ and ܺ ሺ௠ሻ are two sample population vectors. The

transitional probabilities governing the transition between population vectors in a neural recording

are nonsymmetrical, ܲ௡௠ ≠ ܲ௠௡ . To better demonstrate this, we constructed a toy example in which
a population network generates one out three possible population vectors, X ሺଵሻ , X ሺଶሻ , and X ሺଷሻ .

Using a first-order Markov model, we illustrate in Figure 5.1a the transitional probabilities that
govern the transition between the three population vectors. Let’s assume that the transitional
probabilities are

stationary.

Given

a

particular sequence
91

of population

vectors,

i.e.

൛ܺ ሺଵሻ , ܺ ሺଶሻ , ܺ ሺଶሻ , ܺ ሺଷሻ ൟ, the probability of observing this sequence can be estimated as
ܲ൫ܺ ሺଵሻ ൯ܲଶଵ ܲଶଶ ܲଷଶ , where ܲ൫ܺ ሺଵሻ ൯ is the a priori of observing ܺ ሺଵሻ .

By replacing the pairwise maximum entropy model in (5-4) in the transitional probabilities of a
first-order Markov model in (5-5), we can express these probabilities based on the neuronal
constituents as
ே

ே

ே

ே

1
1
௧ିଵ,௧
௧ିଵ,௧
ܲఏ ሺܺ ௧ |ܺ ௧ିଵ ሻ = ෑ ෑ exp൫‫ܬ‬௜௝ ܺ௝௧ିଵ ܺ௜௧ ൯ = exp ቌ෍ ෍ ‫ܬ‬௜௝ ܺ௜௧ ܺ௝௧ିଵ ቍ
ܼ
ܼ
௜ୀଵ ௝ୀଵ

(5-6)

௜ୀଵ ௝ୀଵ

௧ିଵ,௧
where ‫ܬ‬௜௝
are the transitional couplings, and ܼ is the partition function. By including all elements

of the pairwise maximum entropy model, the joint distribution for the first-order Markov model can
be expressed as
ܲఏ

ሺܺ ௧ିଵ

,ܺ

௧ሻ

ே

ே

ே

1
1
௧ିଵ,௧ିଵ ௧ିଵ ௧ିଵ
௧ିଵ
= exp ቌ− ෍ ℎ௜ ܺ௜௧ିଵ − ෍ ෍ ‫ܬ‬௜௝
ܺ௜ ܺ௝
ܼ
2
−

ே

ே

௜ୀଵ

௧ିଵ,௧
෍ ෍ ‫ܬ‬௜௝ ܺ௜௧ିଵ ܺ௝௧
௜ୀଵ ௝ୀଵ

−

ே

௜ୀଵ ௝ୀଵ

௧
෍ ℎ௜ ܺ௜௧
௜ୀଵ

ே

(5-7)

ே

1
௧,௧
− ෍ ෍ ‫ܬ‬௜௝ ܺ௜௧ ܺ௝௧ ቍ
2
௜ୀଵ ௝ୀଵ

௧,௧
௧,௧ିଵ
௧ିଵ,௧ିଵ
௧
௧ିଵ
where ℎ௜ and ℎ௜ are the individual, and ‫ܬ‬௜௝ , ‫ܬ‬௜௝
and ‫ܬ‬௜௝
are the coupling terms. This

spatiotemporal model, which we refer to as the spatiotemporal maximum entropy model, is more

general than the representation introduced in Marre et al to account for history terms [81], but at the
expense of increasing the number of parameters to be trained.

92

ܲଵଶ

ܲଵଵ

ܺ ሺଵሻ

ܲଶଵ

ܲଶଶ

ܺ ሺଶሻ

ܲଶଷ

ܲଵଷ

ܲଷଶ

ܲଷଵ

ܺ ሺଷሻ

ܲଷଷ

(a)

ܺ ሺଵሻ
ܺ ሺଶሻ
ܺ ሺଷሻ

ܻ ሺଵሻ

ܻ ሺଶሻ

ܺ ሺଵሻ

ܺ ሺଵሻ

ܺ ሺଵሻ

ܺ ሺଶሻ

ܺ ሺଵሻ

ܺ ሺଷሻ

ܻ ሺସሻ
ܻ ሺ଻ሻ

ܺ ሺଶሻ

ܺ ሺଵሻ

ܺ ሺଶሻ

ܺ ሺଶሻ

ܺ ሺଶሻ

ܺ ሺଷሻ

ܻ ሺହሻ

ܻ ሺ଼ሻ

ܻ ሺଷሻ

ܺ ሺଷሻ

ܻ ሺ଺ሻ

ܺ ሺଷሻ

ܻ ሺଽሻ

ܺ ሺଷሻ

(b)
Figure 5.1 (a) Representation of hypothetical instantaneous patterns, X ሺ୧ሻ (for i = 1,2,3), generated

by a neural. (b) The first-order Markov representation of state transfers for Fig.1. Each state can
either transfer to itself or one of the two other states, thus generating a total of 9 transient states,

Y ሺ୩ሻ (for k = 1, … 9). (c) Another representation of the transient states, Y ሺ୩ሻ , with Y ሺ଼ሻ being
unobserved.

93

Figure 5.1 (continued)
ܺ ሺଵሻ
ܺ ሺଶሻ
ܺ ሺଷሻ

ܻ ሺଵሻ

ܺ ሺଵሻ

ܺ ሺଵሻ

ܺ ሺଵሻ

ܺ ሺଶሻ

ܻ ሺସሻ
ܻ ሺ଻ሻ

ܻ ሺଶሻ

ܺ ሺଶሻ

ܺ ሺଵሻ

ܺ ሺଶሻ

ܺ ሺଶሻ

ܻ ሺହሻ

ܺ ሺଷሻ

ܺ ሺଵሻ

ܻ ሺଷሻ

ܺ ሺଷሻ

ܻ ሺ଺ሻ

ܺ ሺଷሻ

ܻ ሺଽሻ

ܺ ሺଷሻ

(c)

ሶ
We simplify (5-7) as ‫݌‬ఏ ሺܻ ௧ ሻ = ܼ ିଵ ݁‫݌ݔ‬൫− ∑௜ ℎሶ௜ ‫ݕ‬௜௧ − ∑௜,௝ ‫ܬ‬௜௝ ‫ݕ‬௜௧ ‫ݕ‬௝௧ ൯, where ܻ ௧ = ሾܺ ௧ , ܺ ௧ିଵ ሿ்

௧ ௧ିଵ
includes the ܺ ௧ and ܺ ௧ିଵ population vectors, ℎሶ௜ = ሾℎ௜ , ℎ௜ ሿ் is the parameters associated with the
௧,௧ ௧ିଵ,௧ ௧,௧ିଵ ௧ିଵ,௧ିଵ
ሶ
first order moments, ‫ܬ‬௜௝ = ൣ‫ܬ‬௜௝ , ‫ܬ‬௜௝ ; ‫ܬ‬௜௝ , ‫ܬ‬௜௝
൧ is the parameters associated with the second

order moments, and ܶ is transpose. This representation can be easily extracted for higher order
௧ିଵ,௧ିଵ ௧,௧
Markov models too. Figure 5.2a and 5.2b demonstrates the model parameters ‫ܬ‬௜௝
, ‫ܬ‬௜௝ , and

௧ିଵ,௧
as blue, red and black connecting lines, respectively.
‫ܬ‬௜௝

௧ିଵ,௧ିଵ
௧,௧
௧ିଵ
௧
To better understand the necessity of having both ℎ௜ and ℎ௜ or ‫ܬ‬௜௝
and ‫ܬ‬௜௝ , Figure 5.1b

demonstrates all possible transitions between the three hypothetical population vectors, ܺ ሺଵሻ , ܺ ሺଶሻ ,

and ܺ ሺଷሻ , as joint patterns ܻ ሺ௞ሻ for ݇ = 1, ⋯ ,9. If we assume that the possibility of observing any

௧ିଵ
௧
ܻ ሺ௞ሻ is equal, thus, ℎ௜ will be equivalent to ℎ௜ . In this configuration, we have assumed that the

possibility of transiting to either of patterns ܺ ሺଵሻ , ܺ ሺଶሻ , and ܺ ሺଷሻ from the current pattern is equal to
1/3, thus making the possibility of observing ܻ ሺ௞ሻ equal to 1/9 for all ݇. Now let’s assume a
94

different configuration in Figure 5.1c, in which ܺ ሺଷሻ can only transit to ܺ ሺଵሻ or ܺ ሺଷሻ , thus making

௧ିଵ
௧
the probability of observing ܻ ሺ଼ሻ equal to 0. In this case, ℎ௜
is different than ℎ௜ since it is
௧
estimated using three ܺ ሺଵሻ , three ܺ ሺଶሻ and two ܺ ሺଷሻ , while ℎ௜ is estimated using three ܺ ሺଵሻ , two ܺ ሺଶሻ

௧ିଵ,௧ିଵ
௧,௧
and three ܺ ሺଷሻ . A similar explanation can be given for the difference between ‫ܬ‬௜௝
and ‫ܬ‬௜௝ for

the configuration in Figure 5.1c.

௧ିଵ
ܺଵ

௧ିଵ
ܺଵ

௧ିଵ
ܺே

௧ିଵ
ܺଶ

௧
ܺଶ

௧ିଵ
ܺே

௧
ܺଵ

௧ିଵ
ܺଶ

௧
ܺଶ

௧
ܺଵ

௧
ܺே

௧
ܺே

(a)

(b)

Figure 5.2 First-order Markov representation of pairwise maximum entropy model. The model
୲ିଵ,୲ିଵ
୲,୲
୲ିଵ,୲
parameters J୧୨
, J୧୨ , and J୧୨
are demonstrated as blue, red and black connecting lines,

respectively.
The first-order Markov model can be simply extended for higher order Markov representations.
For a second-order Markov representation for the maximum entropy model, we define three
population vectors jointly as one pattern, ܼ ௧ = ሾܺ ௧ିଵ
95

ܺ௧

ܺ ௧ାଵ ሿ. The instantaneous maximum

ሷ
entropy model can therefore be expressed as ܲఏ ሺܼ ௧ ሻ = ௓ exp ቀ− ∑ ℎሷ ௜ ܼ௜௧ − ଶ ∑ ∑ ‫ܬ‬௜௝ ܼ௜௧ ܼ௝௧ ቁ in which

ሷ
ℎሷ ௜ and ‫ܬ‬௜௝ are the its parameters and are defined as
௧ିଵ
ℎ௜
௧
ℎሷ ௜ = ቎ ℎ௜ ቏ and
௧ାଵ
ℎ௜

5.1.4.

௧ିଵ,௧ିଵ
‫ܬ‬௜௝

௧ିଵ,௧
ሷ
‫ܬ‬௜௝ = ൦ ‫ܬ‬௜௝

௧ିଵ,௧ାଵ
‫ܬ‬௜௝

ଵ

௧ିଵ,௧
‫ܬ‬௜௝
௧,௧
‫ܬ‬௜௝

௧,௧ାଵ
‫ܬ‬௜௝

ଵ

௧ିଵ,௧ାଵ
‫ܬ‬௜௝

௧,௧ାଵ
‫ܬ‬௜௝
൪

(5-8)

௧ାଵ,௧ାଵ
‫ܬ‬௜௝

Training Maximum Entropy Models

Two iterative methods, the gradient descent [82] and the iterative scaling algorithm [79], are
commonly used to find the unknown parameters of the spatiotemporal maximum entropy model.
Here we will discuss how these methods were obtained in the first place, and introduce a quadratic
estimation that learns faster compared to the gradient descent and iterative scaling algorithms. The
objective is to estimate the model parameters such that model estimations of the probability of
population vectors provide the maximum entropy as
ߠ ∗ = min ‫ܮ‬ሺ‫݌‬
෤|ߙሻ = min ෍ ܲఏ ሺܺ ௧ ሻlog൫ܲሺܺ ௧ ሻ൯
ఏ

ఏ

௧

(5-9)

Where ‫ܮ‬ሺܲఏ |ߠሻ is the likelihood function. The model parameters are learned using an iterative

learning approach, in which the updates for parameter ݇, ߜ௞ , are estimated as follows. First, we

estimate variations in the likelihood, ∆‫ ,ܮ‬given the parameter updates as

ܼሶ
∆‫ܮ = ܮ‬ሺܲఏ |ߠ + ߜሻ − ‫ܮ‬ሺܲఏ |ߠሻ = ෍ ෍ ܲఏ ሺܺ ௧ ሻߜ௞ ݂௞ ሺܺ ௧ ሻ − ݈‫ ݃݋‬ቆ ቇ
ܼ
௧

௞

(5-10)

where ܼሶ = ∑௧ expሺ∑௞ሺߠ௞ + ߜ௞ ሻ݂௞ ሺܺ ௧ ሻሻ. By applying the inequality −logሺ‫ݔ‬ሻ ≥ 1 − ‫ ,0 > ݔ∀ ,ݔ‬we
96

will find a lower bound for ∆‫ ܮ‬as
∆‫ ≥ ܮ‬෍ ෍ ܲఏ ሺܺ ௧ ሻߜ௞ ݂௞ ሺܺ ௧ ሻ + 1 − ෍ ܲሺܺ ௧ ሻ݁‫ ݌ݔ‬൭෍ ߜ௞ ݂௞ ሺܺ ௧ ሻ൱
௧

௞

௧

(5-11)

௞

By applying Jensen’s inequality for a probability distribution ‫݌‬ሺxሻ, ݁ ∑೉ ௣ሺ௫ሻ௤ሺ௫ሻ ≤ ∑௑ ‫݌‬ሺ‫ݔ‬ሻ݁ ௤ሺ௫ሻ ,

and defining ‫ܨ‬ሺܺሻ = ∑௞|݂௞ ሺܺሻ|, and decomposing ݂௞ ሺܺሻ = |݂௞ ሺܺሻ|‫ݏ‬௞ ሺܺሻ, where ‫ݏ‬௞ ሺܺሻ =

sign൫݂௞ ሺܺሻ൯, the lower bound can be written as
∆‫ ≥ ܮ‬෍ ෍ ܲఏ ሺܺ ௧ ሻߜ௞ ݂௞ ሺܺ ௧ ሻ + 1
௧

௞

− ෍ ܲሺܺ ௧ ሻ ෍
௧

௞

(5-12)

|݂௞ ሺܺ ௧ ሻ|
݁‫݌ݔ‬൫ߜ௞ ‫ݏ‬௞ ሺܺ ௧ ሻ‫ܨ‬ሺܺ ௧ ሻ൯
‫ܨ‬ሺܺ௧ ሻ

We can find the optimal parameter updates by solving ߲∆‫ߠ߲⁄ܮ‬ሶ = 0 to have
෍ ܲఏ ሺܺ ௧ ሻ݂௞ ሺܺ ௧ ሻ = ෍ ܲሺܺ ௧ ሻ݂௞ ሺܺ ௧ ሻ݁‫݌ݔ‬൫ߜ௞ ‫ݏ‬௞ ሺܺ ௧ ሻ‫ܨ‬ሺܺ ௧ ሻ൯
௧

௧

(5-13)

To obtain the iterative scaling algorithm for learning the unknown parameters, we can

approximate the right hand side of (5-13) as ∑௧ ܲሺܺ ௧ ሻ݂௞ ሺܺ ௧ ሻ × expሺߜ௞ ∑௧ ‫ݏ‬௞ ሺܺ ௧ ሻ‫ܨ‬ሺܺ ௧ ሻሻ, thus, the
updates are estimated as
ߜ௞ =

∑௧ ܲఏ ሺܺ ௧ ሻ݂௞ ሺܺ ௧ ሻ
1
݈‫ ݃݋‬ቆ
ቇ
∑௧ ‫ݏ‬௞ ሺܺ௧ ሻ‫ܨ‬ሺܺ௧ ሻ
∑௧ ܲሺܺ௧ ሻ݂௞ ሺܺ௧ ሻ

(5-14)

To obtain the gradient descent algorithm for learning the unknown parameters, we can

approximate the exponential on the right hand side of (5-13) as ݁‫݌ݔ‬ሺ‫ݔ‬ሻ = 1 + ‫ ,ݔ‬thus, the updates
97

are estimated as
∑௧ ܲఏ ሺܺ ௧ ሻ݂௞ ሺܺ ௧ ሻ − ∑௧ ܲሺܺ ௧ ሻ݂௞ ሺܺ ௧ ሻ
ߜ௞ =
∑௧ ܲሺܺ௧ ሻ|݂௞ ሺܺ௧ ሻ|‫ܨ‬ሺܺ௧ ሻ

(5-15)

Here we approximated the exponential on the right hand side of (5-13) as ݁‫݌ݔ‬ሺ‫ݔ‬ሻ = 1 + ‫+ ݔ‬

0.5‫ ݔ‬ଶ , and thus update the unknown parameters as
‫2=ܣ‬

∑௧ ܲఏ ሺܺ ௧ ሻ|݂௞ ሺܺ ௧ ሻ|‫ܨ‬ሺܺ ௧ ሻ
∑௧ ܲሺܺ௧ ሻ݂௞ ሺܺ௧ ሻ‫ ܨ‬ଶ ሺܺ௧ ሻ

ߜ௞ = −0.5 ቌ‫ ± ܣ‬ඨ‫ܣ‬ଶ − 8

(5-16)

∑௧ ܲሺܺ௧ ሻ݂௞ ሺܺ௧ ሻ − ∑௧ ܲఏ ሺܺ௧ ሻ݂௞ ሺܺ௧ ሻ
ቍ
∑௧ ܲሺܺ௧ ሻ݂௞ ሺܺ௧ ሻ‫ ܨ‬ଶ ሺܺ௧ ሻ

which we refer to as the quadratic iterative learning algorithm.

5.2.

V1 Population Encoding in Response to Stimulus Grating

We evaluated the performance of the different encoding model in predicting the activity pattern
of cat V1 neurons in response to drifting grating stimuli, as illustrated in Figure 5.3a [83]. In drifting
grating stimuli, a set of dark bars oriented and drifting in one out of 16 possible degrees, from 0 to
15ߨ⁄8 degrees, with steps of ߨ⁄8, were displayed to an anesthetized cat. A stimulus is randomly

selected and presented for 2.5 seconds with less than 0.5 seconds of blank screen before presenting
the next stimulus. Figure 5.3b demonstrates 9 seconds of an hour long session, in which three
different stimuli out of 16 stimuli were randomly presented to the subject. During the blank screen,
the spiking activity significantly reduces compared to when one of the stimulus is presented.

98

Multiunit
Activity

(a)

(b)
Figure 5.3 (a) Experimental setup: recording from the V1 area of an anesthetized cat in response to
drifting grating stimulus with tetrode microwires. The bars in the image presented to the subject
moves in forward direction. (b) The sequence of stimuli for an interval of 9 seconds, in which
randomly selected stimuli are presented to the subject. Each stimulus is presented for approximately
2.5 seconds, with less than 0.5 seconds of blank screen afterwards. (c) & (d) Single unit activity: the
peristimulus time histogram (PSTH) was demonstrated for neurons 18 and 3 for two different
stimuli, respectively [84]. Every dotted row demonstrates the spiking activity of a neuron to a trial of
the same stimulus. The red plot is the average firing rate of that neuron over all trials and is
represented in average spike per second. The gray bar demonstrates when the stimulus was
presented. (e) Multiunit activity: demonstrates the spike raster plots for the neural population for 5
seconds. Every row represents the spike train of a neuron, and each bar indicates a spike event.

99

Figure 5.3 (continued)

(c)

(d)

(e)

100

5.2.1.

Single Unit Analysis

Using tetrode microwires, a population of 21 neurons was obtained using spike sorting. In a
tetrode array, 4 wires are bundled together, thus, the spatial distribution of signal propagation can be
used to improve the performance of spike sorting. The spike trains were sampled in 5 msec time
intervals, and intervals that contained one or more than one spike were set as ‘1’ and ‘0’ for no spike.
Depending on the encoding characteristics of single unit activity, these V1 neurons can be
categorized as either movement-selective or orientation-selective neuron. Movement-selective
neurons, as demonstrated in Figure 5.3c, are sensitive to the movement direction of the drifting
bars. Orientation-selective neurons, on the other hand, respond to the intensity level of the bars, as
demonstrated if Figure 5.3d. Unlike the movement-selective neurons, the activity of orientationselective neurons is periodic and time locked to the frequency of the drifting bars.
Figure 5.4 demonstrates the encoding mechanism of movement-selective and orientationselective neurons in the V1 using a circular representation of all 16 stimuli. In Figure 5.4a, the
preferred movement direction of a movement-selective neuron is ߮ = 7ߨ⁄8 degrees. As a result,
ത

this neuron fires constantly within the 2.5 seconds interval with a higher average firing rate if the

stimulus, ߮, is close to ߮ The tuning function in this case resembles the cosine tuning in motor
ത.

encoding to preferred velocity, such that the average firing rate is directly proportional to cosሺ߮ −

߮ As a result, this neuron is responsive to movement directions of 3ߨ⁄4 and ߨ degrees. The
തሻ.

further ߮ is from ߮ the less is the total average firing rate, demonstrated by size of the stimulus icon
ത,
in center.

The orientation-selective neuron demonstrated in Figure 5.4b, on the other hand, is sensitive to

vertical oriented bars, so it will be equally responsive to the bars drifting with ߮ = 3ߨ⁄2 or ߨ⁄2
ധ
101

degrees. This neuron generates similar firing rates for stimuli positioned across each other with ߨ

degrees difference, thus it will generate higher average firing rates if ߮ is close to ߮ or ߮ − ߨ. The
ധ
ധ

tuning function in this case can be expressed as cos൫2ሺ߮ − ߮ Out of the 21 neurons, 14 were
ധሻ൯.

orientation-selective neurons, and 7 were movement-selective neurons. The important question here

is whether the multiunit activity of the population of 21 neurons, as demonstrated in Figure 5.3e,
can be predicted given only the tuning functions of each neuron, or is it essential to include also the
spatial and temporal correlations among neuronal constituents to improve the predictive power.

5.2.2.

Multiunit Analysis

Although the encoding mechanism of neuron ݊ can be fully characterized by its tuning function,

ܲሺܺ௡ |‫ݏ‬ሻ, its role in the population network is not yet explained. Here, we evaluate and compare the

performance of the independent, instantaneous maximum entropy, and spatiotemporal maximum
entropy models in explaining the network encoding mechanism. We have looked at the performance
of these encoding models from two perspectives. First, we evaluated the predictive power of these
models in predicting the observation a particular sequence of population vectors for up to three
consecutive vectors. Second, we evaluated the separability of network representations for different
stimuli, and how this separability relates to single unit tuning functions.
Figure 5.5 demonstrates the predictive power of the three different encoding models in

predicting the probability of observing a sequence of population vectors, ൛ܺ ሺ௜ሻ , ܺ ሺ௝ሻ ൟ. The joint

distribution, ܲ൫ܺ ሺ௜ሻ , ܺ ሺ௝ሻ ൯, can be expressed for the independent model as ܲ௜௡ௗ ൫ܺ ሺ௜ሻ ൯ܲ௜௡ௗ ൫ܺ ሺ௝ሻ ൯,

and for the instantaneous maximum entropy model as ܲெ௔௫ா௡௧ ൫ܺ ሺ௜ሻ ൯ܲெ௔௫ா௡௧ ൫ܺ ሺ௝ሻ ൯, and for the

spatiotemporal maximum entropy model as ܲெ௔௫ா௡௧ ൫ൣܺ ሺ௜ሻ , ܺ ሺ௝ሻ ൧൯. The quantile-quantile plot in
102

Figure 5.5a demonstrates the model estimations versus true empirical probabilities for all observed
combinations of ൛ܺ ሺ௜ሻ , ܺ ሺ௝ሻ ൟ, each demonstrated as a dot. The encoding model that provides the

distribution of dots close to the equality line is better. In this case, the spatiotemporal model was
better than the instantaneous model and that was better than the independent model. Figure 5.5b
demonstrates the sum of all estimated probabilities with a similar number of ‘1’ in their population

Spikes/sec

vectors, and the closest plot to the empirical plot determines the best model fit.

3.75
2.5
1.25
0
0

1

2

3

(a)
Figure 5.4 The circular representation demonstrates the 16 different stimuli in the center with the
arrows indicating the movement direction of the drifting bars. The PSTHs of a neuron in response

103

to the 16 different stimuli are demonstrated on the outside. The size of the icon representing the
stimulus is consistent with the overall average firing rate of that neuron, in which larger sizes
demonstrate higher firing rates.

Spikes/sec

Figure 5.4 (continued)

26.25
17.5
8.75
0
0

1

2
(b)

104

3

(a)
Figure 5.5 Predicting spatiotemporal patterns of activity (a) Quantile-quantile plot: the estimated
probability of observing population vectors using different encoding models versus the true
probability from empirical distribution. Each dot represents a combination of two population vector
at times t and t + 1. The dotted line represents equality for the estimated and true probabilities. (b)

Synchronous pattern estimation: the sum of the estimated probabilities of all population vectors
with a similar firing rate is plotted versus the number of ‘1’ in the population vector. The best model
is the one closer to the solid black line, which in this case is the first-order maximum entropy model.

105

Figure 5.6 (continued)

0

Estimated Probabilities

10

-5

10

-10

10

-15

10

-20

10

0

True Distribution
Independent
Instantaneous MaxEnt
First-order MaxEnt
2
4
6
Number of Firing Neurons

8

(b)
Figure 5.6 demonstrates the predictive power of the four different encoding models in

estimating the probability of observing a sequence of population vectors, ൛ܺ ሺ௜ሻ , ܺ ሺ௝ሻ , ܺ ሺ௞ሻ ൟ. The

joint probability distribution, ܲ൫ܺ ሺ௜ሻ , ܺ ሺ௝ሻ , ܺ ሺ௞ሻ ൯, can be expressed for the independent model as
ܲ௜௡ௗ ൫ܺ ሺ௜ሻ ൯ܲ௜௡ௗ ൫ܺ ሺ௝ሻ ൯ܲ௜௡ௗ ൫ܺ ሺ௞ሻ ൯, and for the instantaneous maximum entropy model as

ܲெ௔௫ா௡௧ ൫ܺ ሺ௜ሻ ൯ܲெ௔௫ா௡௧ ൫ܺ ሺ௝ሻ ൯ܲெ௔௫ா௡௧ ൫ܺ ሺ௞ሻ ൯, and for the first-order spatiotemporal maximum
entropy model as ܲெ௔௫ா௡௧ ൫ൣܺ ሺ௜ሻ , ܺ ሺ௝ሻ ൧൯ܲெ௔௫ா௡௧ ൫ൣܺ ሺ௝ሻ , ܺ ሺ௞ሻ ൧൯ൗܲெ௔௫ா௡௧ ൫ܺ ሺ௝ሻ ൯, and for the second-

order spatiotemporal maximum entropy model as ܲெ௔௫ா௡௧ ൫ൣܺ ሺ௜ሻ , ܺ ሺ௝ሻ , ܺ ሺ௞ሻ ൧൯. Figure 5.6a

demonstrates the quantile-quantile plot, in which the second-order spatiotemporal maximum
entropy model was better than the first-order spatiotemporal, instantaneous and independent
106

models. Figure 5.6b demonstrates the sum of all estimated probabilities with a similar number of ‘1’
in their population vectors, and the closest plot to the empirical plot determines the best model fit,
which in this case is the second-order spatiotemporal model.

(a)
Figure 5.6 Predicting spatiotemporal patterns of activity (a) Quantile-quantile plot: the estimated
probability of observing population vectors using different encoding models versus the true
probability from empirical distribution. Each dot represents a combination of three population

vector at times ‫ ,1 + ݐ ,ݐ‬and ‫ .2 + ݐ‬The dotted line represents equality for the estimated and true

probabilities. (b) Synchronous pattern estimation: the sum of the estimated probabilities of all
population vectors with a similar firing rate is plotted versus the number of ‘1’ in the population
vector. The best model is the one closer to the solid black line, which in this case is the second-order
maximum entropy model.
107

Figure 5.6 (continued)

Estimated Probabilities

10

10

10

10

10

0

-5

-10

-15

-20

0

True Distribution
Independent
Instantaneous MaxEnt
First-order MaxEnt
Second-order MaxEnt
2

4
6
Number of Firing Neurons

8

10

(b)
Another evaluation method is by comparing the separability of network representations of

different stimuli. The network representation is obtained from the neural responses to stimulus ߮,

referred to as trial of stimulus ߮, and are expected to be more similar for other trials of stimulus ߮

and dissimilar to the other network representations. We also expect that the network representation
obtained from trials of stimulus ߮ is somewhat closer to network representation obtained from trials

of stimulus ߮ ± ߨ, referred to as across network, and also somewhat close to network

representation from trials of stimulus ߮ ± ߨ⁄8, referred to as adjacent networks, and far from the

rest of the network representations, referred to as the other networks. The similarity to across and
adjacent networks is due to the existence of orientation-selective and movement-selective neurons in
the V1 subpopulation, respectively.
108

The network representation of the independent encoding model is obtained directly from the
spike trains using PCA. For that, a response matrix was computed from the binary spike train by
summing the number of spike fired within 50 msec time intervals, and was arranged as a vector.
Using the three most significant principal components, the vectors for approximately 40 trials of 16
different stimuli were projected into a 3D space, as demonstrated in Figure 5.7a. The level of
separability among clusters of network representations for different stimuli was not equal
throughout all 16 stimuli.
Figure 5.7b demonstrates a circular representation of the Euclidean distance computed between
the network representations of within, across, adjacent and other stimuli using boxplots. It can be
seen that within distances are typically less than the across and adjacent distances, and they are in
general less than the other distances. In this figure, ‘**’ and ‘***’ demonstrates that a set of distances

is statistically less than another set with a confidence interval of ܲ < 0.001 and ܲ < 0.0001,

respectively. The overall dissimilarity between different network representations is demonstrated in

Figure 5.7c, in which the within and across are statistically less than the other networks, but no other
conclusive argument can be made for the adjacent vs. other, and the within vs. across or adjacent for
this independent encoding model.

109

(a)
Figure 5.7 Independent encoding model (a) 3D projection of individual network representations of
trials from the 16 stimuli using PCA. (b) Circular representation of the Euclidean distance between
the network representations of one particular stimulus with within, across, adjacent and other
network representations using boxplots. In the boxplots, the central line demonstrated as the green
line is the median or second quartile, the edges of the box are the first and third quartiles, the
whiskers extend to the most extreme samples that are not considered as outliers, and outliers are
individually demonstrated as green dots. (c) The overall distances for all network representations of
different stimuli

110

0.75
0.5
0.25
0

Within

Across

Adjacent
(b)

Others

1

0.75

Dissimilarity

Dissimilarity

Figure 5.7 (continued)

0.5

0.25

0

Within Across Adjacent Others

(c)
111

Dissimilarity

0.3
0.2
0.1
0

Within

Across

Adjacent

Others

(a)
Figure 5.8 Spatiotemporal maximum entropy encoding model (a) Circular representation of the
Euclidean distance between the network representations of one particular stimulus with within,
across, adjacent and other network representations using boxplots. (b) The overall distances or
dissimilarity measure for all network representations of different stimuli

112

Figure 5.8 (continued)

Dissimilarity

0.3

0.2

0.1

0

Within Across Adjacent Others

(b)
Figure 5.8a demonstrates a circular representation of the Euclidean distance computed between
the network representations of the spatiotemporal encoding model for the network of within,
across, adjacent and other stimuli using boxplots. It can be seen that within distances are always less
than the across and adjacent distances, and they are also less than the other distances. The overall
dissimilarity between different network representation is demonstrated in Figure 5.8b, in which the
within is statistically less than the across, adjacent and other distances, the across is statistically less
than adjacent and other distances, and finally the adjacent is statistically less than the other distances.
Compared to the independent encoding model, the spatiotemporal maximum entropy model
provides a much more conclusive encoding model with network representations that are not as
much affected by noise in the neural response.
To show the importance of accounting for temporal correlations in addition to spatial
correlations in predicting the population vectors, we counted the number of patterns that can be
observed in an arbitrary network of 21 neurons with different assumptions on their collective
activity. In the independent model with no temporal or spatial correlations among the neuronal
constituents, neurons in an infinite long session could basically produce 2ଶଵ population vectors for
113

the 21 sorted neurons, while this number grows exponentially for a sequence of two population

vectors to 2ଶ×ଶଵ, as demonstrated by the dotted red line in Figure 5.9a. In this figure, a sequence of

one, two, three and four population vectors are represented as patterns with 0, 1, 2 and 3, Markov
lags, respectively.
The counted number of different population vectors observed over an hour long session was

only 748 individual patterns, far less than 2ଶଵ patterns estimated by the independent model,

suggesting that neurons don’t fire independently. For the instantaneous maximum entropy model,

the model predicts the number of a sequence of two population vectors to be 748ଶ , demonstrated
as the dotted green line, while the actual count is less than 7000 patterns. This inherently suggests
using spatiotemporal maximum entropy model that include both temporal and spatial correlations
among neuronal constituents. An important factor of this model, however, is to define the order of
the Markov model. The solid blue line demonstrates the actual number of a sequence of three
population vectors. The number of patterns estimated by the first-order and second-order
spatiotemporal models are demonstrated by the dotted cyan and magenta lines, respectively. It can
be seen that the estimates by the first-order spatiotemporal model is somewhat off the actual count,
but the second-order spatiotemporal model seems to be much closer. This suggests that a secondorder spatiotemporal maximum entropy model is sufficient to model the patterns generated by the
V1 cortical neurons, and higher-order Markov models seems to be redundant. This suggests that the

effective temporal support for the history of population response is around 15 msec.

Figure 5.9b demonstrates the average Kullback-Leibler (KL) divergence between the actual
probability distribution of the observed population vectors in response to the 16 stimuli with the
ones estimated using different encoding models. Errorbars demonstrate standard deviation. The KL

divergence between the true distribution, ܲሺܺሻ, and its estimation, ܳሺܺሻ, is estimated as
114

∑௑ ܲሺܺሻlogሺܲሺܺሻ⁄ܳሺܺሻሻ. The KL divergence for the spatiotemporal maximum entropy model is

significantly less than the instantaneous maximum entropy model with ܲ < 0.01, and less than the

independent model with ܲ < 0.001. This is while the instantaneous maximum entropy model does

not provide a statistically significant improvement compared to the independent model.

10

Number of Patterns

10

10

10

10

10

10

9

Independent
True Count
Instantaneous
1st-order Spatiotemporal
2nd-order Spatiotemporal

8

7

6

5

4

3

0

1
2
Markov Time Lags

3

(a)
Figure 5.9 (a) the actual counted number of patterns in the collective activity of the neural
population (solid blue line), and estimated counts by the independent, instantaneous maximum
entropy, first and second-order spatiotemporal maximum entropy models, demonstrated as the
dotted red, green, cyan and magenta lines, respectively. (a) Average KL divergence between the true
and the estimated distributions over 16 stimuli (error-bars are the standard deviation).

115

Spatiotemporal MaxEnt

Instantaneous MaxEnt

Independent

Figure 5.9 (continued)

(b)

5.3.

Discussion

We evaluated the performance of the spatiotemporal maximum entropy model in predicting the
activity pattern of cat V1 neurons in response to drifting grating stimuli. A population of 21 neurons
was recorded using microwires with tetrode arrays. We compared the prediction power of this
model with the instantaneous pairwise maximum entropy model, as well as the independent model.
We demonstrated that the extended model converges faster compared to [81], and also reduces the
Kullback-Leibler divergence between the estimated and true distributions by 20 percent. Taken
together, these results suggest the importance of accounting for temporal correlations in predicting
the spatiotemporal patterns.

116

6

Hypergraph Models with Higher Order Interactions
Better Explain the Network Dynamics

Cortical neurons have long been hypothesized to encode information about external stimuli
using one of two encoding modalities: temporal coding or rate coding [40]. In temporal coding, a
neuron fires a spike that is almost always time locked to the external stimulus. Rate coding, on the
other hand, presumes that a neuron responds to a stimulus by modulating the number of spikes it
fires above or below a baseline level during a fixed time interval. In sensory systems such as visual
and auditory cortices, temporal codes are more pronounced [85-88] while in motor systems, rate
codes seem to be predominantly present [18, 89, 90].
The mapping relationship between an external stimulus ‫ ݏ‬and a neural response ‫( ݎ‬whether it

expresses a temporal code or a rate code) is typically described by the conditional density ܲሺ‫ݏ|ݎ‬ሻ.

This density completely expresses the “receptive field” of a sensory neuron, or a “tuning function”

of a motor neuron [91, 92]. It has been the standard metric of stimulus-driven neural response in
neurophysiological experiments for many decades. Recently, correlation among neurons has
emerged as another cortical response property that may be playing a role in encoding stimuli [72, 9395]. This property has naturally emerged as a result of the ability to simultaneously record the activity

117

of an ensemble of neurons in selected brain targets while subjects carry out specific functions [3-6].
Not surprisingly, many important aspects of behavior-neurophysiology relationships stem from the
collective behavior of many neuronal elements [2, 96]. The major gain over sequentially recording
single-neuron activity is the ability to access the ݊௧௛ -order conditional density of the observed

population, ܲሺ‫ݎ‬ଵ , ‫ݎ‬ଶ , ⋯ , ‫ݎ‬௡ |‫ݏ‬ሻ, thereby providing a closer examination of the response space of the

underlying biological system that these neurons represent. Clearly, calculation of the joint
distribution requires knowledge of all the possible network states, which is practically impossible.

In many cases, the variables ‫ݎ‬ଵ , ‫ݎ‬ଶ , ⋯ , ‫ݎ‬௡ exhibit variable degrees of statistical dependency [73,

97]. Generally speaking, this statistical dependency may either result from a significant overlap in the

receptive fields of these neurons, often referred to as response to a common input, or from some type of
anatomical connectivity between the neurons, or from both [98-100]. To date, it is unclear whether
this dependency is part of an efficient neural encoding mechanism that attempts to synergistically
and cooperatively represent the stimulus in a network-wide interaction between the neuronal
elements, or hinders this encoding mechanism if it were to redundantly express independent
neuronal responses to that stimulus.
Numerous information theoretic measures have been proposed to address this synergyredundancy question [94, 101-106]. In the previous chapter, a maximum entropy model relying on
second-order interaction (pairwise correlation) between neurons has been shown to result in a
strong cooperative network behavior. The principle of maximum entropy was argued to yield a data
distribution with the maximum uncertainty among all possible distributions that could explain the
observed data [107]. More recently, a minimum mutual information principle was argued to yield a
tighter lower bound on the mutual information between the stimulus and the response of any system
that explains the data [108]. However, neither approach has directly addressed the following
118

fundamental questions: how higher order interactions observed in spatiotemporal spiking patterns of
cortical neurons contribute to actual information transfer through the system and how the joint
distribution can be optimally expressed using knowledge of any intrinsic structure in their underlying
network connectivity.
In this chapter, we propose an information theoretic metric that helps assess whether a
connection-induced dependency in neural firing is part of a synergistic population code. Specifically,
we propose to optimally express the conditional neural response in terms of a consistent hypergraph.
We argue that the stimulus-specific network model associated with each hypergraph attempts to
discover how stimulus information governs the way neurons signal one another. We demonstrate
that this hypergraph representation yields a minimum entropy distance between the conditional
response entropy of a single trial and the “true” response entropy estimated from a concatenation of
many trials. We also show that this approach falls under the class of maximum entropy models that
seek to find the probability distribution that maximizes the entropy with respect to the known
constraints.
The chapter is organized as follows: first we will discuss the theory of the framework behind
hypergraph models. Then, we will describe the methods we used to generate the neural data,
followed by the obtained results. Discussion is provided about relations to previous work and
suggestions for future work.

6.1.

Hypergraph Maximum Entropy Model

Let’s assume we have a population of n neurons encoding stimulus S that can take one out of ܲ

possible values ൛‫ݏ‬ଵ , ‫ݏ‬ଶ , ⋯ , ‫ݏ‬௣ ൟ. These can be a set of discrete values of a continuous, time-varying
119

deterministic function, or discrete-valued parameters of a latent stochastic process. For notational

convenience, we will denote herein by ܲ௦ ሺ	. ሻ quantities conditioned on the presence of ܵ and make

the distinction clear wherever needed. The encoding dictionary is characterized by the conditional
probability of the neural response given s, or the likelihood ܲ௦ ሺ࢘ሻ, where ࢘ is a matrix of population
responses, ࢘ = ሾ‫ݎ‬ଵ , ‫ݎ‬ଶ , ⋯ , ‫ݎ‬௡ ሿ. Each neuron’s response is expressed in terms of the spike count in a

fixed length time bin that is small enough to contain a binary ‘1’ (spike) or a binary ‘0’ (no spike). In

the presence of ‫ݏ‬௣ , we assume that the variables ‫ݎ‬ଵ , ‫ݎ‬ଶ , ⋯ , ‫ݎ‬௡ exhibit some form of stimulus-specific

statistical dependency that can be characterized by the ݇ ௧௛ -order marginal, ܲ௦ ሺሾ‫ݎ‬ଵ , ‫ݎ‬ଶ , ⋯ , ‫ݎ‬௞ ሿሻ, where

݇ is the number of neurons involved in encoding ܵ = ‫ݏ‬௣ , and can vary with variable ‫ .݌‬We seek to
express the encoding dictionary as a product of all the possible ݇ ௧௛ -order conditional marginals,

ሺ௞ሻ
ܲ௦ ሺ࢘ሻ. Given that a population of ݊ neurons has 2௡ states, expressing the encoding dictionary by

searching for the optimum product of marginals is a computationally prohibitive task. Therefore, it
is highly desirable to reduce the search space by finding an optimum set of interactions between the
variables ‫ݎ‬ଵ , ‫ݎ‬ଶ , ⋯ , ‫ݎ‬௡ for each value of the stimulus ܵ.

Shannon’s mutual information can be used to compute how much information about ܵ is

encoded in ࢘. However, it depends on obtaining a good estimate of the noise entropy [52], denoted
‫ܪ‬௦ ሺ࢘ሻ. When correlation between the variables is present, multi-information [109] -also known as excess

entropy [110]- can be used. It is the relative entropy of the joint density of the variables with respect
to the product of their individual densities
‫ܫ‬௦ ሺ࢘ሻ = ෍ ܲ௦ ሺ࢘ሻlog
࢘

ܲ௦ ሺ࢘ሻ
= ෍ ‫ܪ‬௦ ሺ‫ݎ‬௜ ሻ − ‫ܪ‬௦ ሺ࢘ሻ
∏௥೔ ܲ௦ ሺ‫ݎ‬௜ ሻ
௥೔

120

(6-1)

where ܲ௦ ሺ‫ݎ‬௜ ሻ is the marginal distribution of neuron ݅ given ‫ ,ݏ‬and ‫ܪ‬௦ ሺ‫ݎ‬௜ ሻ represents the conditional

entropy of neuron ݅. Since the joint entropy cannot exceed the sum of marginal entropies, i.e.,
∑ ‫ܪ‬௦ ሺ‫ݎ‬௜ ሻ ≥ ‫ܪ‬௦ ሺ࢘ሻ, the multi-information is always positive, and equality holds only when the
responses are statistically independent.

6.2.1.

Factoring Multi-information

Although multi-information accounts for the presence of correlation, it does so in a global
context. It is highly desirable to quantify how much of that correlation is intrinsic to independent,
pairwise, triplets or higher order interactions among neurons given a particular stimulus. To do this,

the ݇ ௧௛ -order connected information measure, ‫ܫ‬௦ ሺ࢘ሻ [111], also known as co-occurrences [112],
ሺ௞ሻ

computes the distance between the entropy of the ݇ ௧௛ -order and the ሺ݇ − 1ሻ௧௛ -order conditional

densities

ሺ௞ሻ
ሺ௞ିଵሻ
ሺ࢘ሻቃ − ‫ܪ‬ቂܲ௦ሺ௞ሻ ሺ࢘ሻቃ
‫ܫ‬௦ ሺ࢘ሻ = ‫ܪ‬ቂܲ௦

(6-2)

ሺ௞ሻ
where ܲ௦ ሺ࢘ሻ is the ݇ ௧௛ -order conditional density and is estimated as a normalized product of

marginals up to the ݇ ௧௛ -order. It can be easily shown that multi-information is the sum of all ݇ ௧௛ -

order connected information terms

‫ܫ‬௦ ሺ࢘ሻ =

ே

ሺ௞ሻ
෍ ‫ܫ‬௦ ሺ࢘ሻ
௞ୀଶ

(6-3)

By computing the ratio of each ݇ ௧௛ -order connected information to the multi-information,

‫ܫ‬௦ ൗ‫ܫ‬௦ , the contribution of each ݇ ௧௛ -order correlation in the encoding dictionary can be obtained.
ሺ௞ሻ

121

The order of the encoding model is estimated as the maximum ݇ with significant connected

information, as will be shown in the results section.

We demonstrate the above theoretical analysis with a simple example. In Figure 6.1, we

graphically represent the encoding of ܵ using a population of three neurons ݅, ݆ and ݇. Here we have

ܲ possible stimuli that these three neurons are encoding, where it may also represent different

variations of parameters associated with a single time-varying stimulus. The stimulus is encoded

independently (e.g. neuron ݇) and jointly (neurons ݅ and ݆). Since no 3௥ௗ -order interaction is

observed between the variables, the multi-information in this case is equal to the second-order

connected information, ‫ܫ‬௦ ሺ࢘ሻ = ‫ܫ‬௦ ሺ࢘ሻ, and the encoding model order for this population is
ሺଶሻ

݇	 = 	2 as it fully characterizes the response properties using interactions limited to pairwise
correlations. The later can be also measured by the synaptic efficacy [113]. The mutual information
between the response and the stimulus is shown by the blue region in Figure 6.1, while the noise
entropy, ‫ܪ‬௦ ሺ࢘ሻ, is shown by the red region.

‫ܪ‬௦ ሺ‫ݎ‬௞ ሻ	

‫ܪ‬௦ ሺ‫ݎ‬௜ ሻ	

‫ܪ‬௦ ൫‫ݎ‬௝ ൯

‫ܪ‬ሺܵሻ	

Figure 6.1 A stimulus entropy representation by a population of 3 neurons. HሺSሻ, expressing the
stimulus dictionary. The blue region is the contribution of the three neurons in encoding the

stimulus space. Hୱ ሺrሻ represents the noise entropy and is shown in red. While neuron k

independently encodes the stimulus, neurons i and j encode it independently as well as jointly.
122

6.2.2.

Graphical representation

A stimulus-induced dependency between the neurons can be graphically represented using

Markov Random Fields (MRF) [114]. Correlation can be represented in this graph, denoted ‫ ,ܩ‬as an

undirected link between the neurons, or factors. In such case, the encoding dictionary is expressed
as
ܲ௦ ሺ࢘ሻ =

1
ෑ ܲ௦ ൫‫ݎ‬గ೔ ൯
ܼ

(6-4)

గ೔ ∈ீ

ܼ = ෍ ෑ ܲ௦ ൫‫ݎ‬గ೔ ൯
௜

గ೔ ∈ீ

where ߨ௜ denotes the set of neurons connected to neuron ݅, and ܼ is a normalization factor (also
known as the partition function).

The edges in a Markov network structure, or cliques [115], express the role of second-order

interactions between neurons in encoding ܵ. A Markov network incorporating only first-order

cliques assumes neurons to be conditionally independent. This probabilistic model is non-cooperative
because it does not consider any stimulus-induced correlation among neurons. The encoding
dictionary in such case is expressed as
௡

1
ܲ௦ ሺ࢘ሻ = ෑ ܲ௦ ሺ‫ݎ‬௜ ሻ
ܼ

(6-5)

௜ୀଵ

On the other hand, when second-order cliques are present, we get the Ising model from statistical
physics [107, 116]. This model represents a Maximum Entropy model when the density is expressed

123

as an exponential function of the weighted individual responses and pairwise responses
1
ଵ
ܲ௦ ሺ࢘ሻ = ݁‫ ݌ݔ‬ቌ෍ ߜ௜ ‫ݎ‬௜ + ଶ ෍ ߛ௜௝ ‫ݎ‬௜ ‫ݎ‬௝ ቍ
ܼ
௜

(6-6)

௜ஷ௝

The unknowns parameters ߜ௜ and ߛ௜௝ are estimated using an iterative scaling algorithm so that the

expected values of the individual responses, 〈‫ݎ‬௜ 〉, and pairwise responses, 〈‫ݎ‬௜ ‫ݎ‬௝ 〉, match the

measurable information in the experiment [79]. This information is the mean spike count within a
fixed window (i.e. firing rates) of individual neurons and observed pairwise correlations.
Maximum entropy models similar to (6-6) have been shown to faithfully explain the collective
stimulus-driven network behavior of ~50-100 retinal Ganglion cells (RGCs) in isolated slices in vitro
[79, 107, 116-118]. These neurons have well-characterized receptive fields, typically exhibit low firing
rates, and were shown to capture ~98% of the multi-information under stimulus presence [116]
while only account for ~88% of the multi-information in spontaneous network activity [79]. Cortical
neurons observed in vivo, on the other hand, are morphologically very different from RGCs, exhibit
very heterogeneous firing patterns [39], and therefore are believed to encode stimuli in a more
complex fashion than sensory neurons observed in vitro such as RGCs [119, 120]. These neurons
receive inputs from many cortical and subcortical areas and generate outputs through numerous
feedback loops with optimal error-correcting capabilities that are known to play a very important
role in optimizing the subject’s experience with the stimulus over multiple repeated trials [121, 122].
Therefore, their underlying complex structure results in interaction patterns that have much longer
spatial and temporal dependence, yielding sequence lengths that are highly nonstationary.

124

6.2.3.

Cooperative Model

Herein, we generalize the Markov structure to edges that can connect any number of vertices to
model higher order interactions at a relatively very fine time scale (3 msec). This time scale has been
shown to describe excess synchrony typically observed between cortical neurons in vivo [90, 123,

124]. This yields a hypergraph ‫ ܩ‬in which every vertex expresses a marginal distribution and
represents a cooperative model of the encoding dictionary as [125]

ܲ௦ ሺ࢘ሻ =

௤ೕ
1
ෑ ෑ ܲ௦ ቀ‫ݎ‬గೕ ቁ
ܼீ

(6-7)

గ೔ ∈ீ గೕ ⊆గ೔

where ܼீ is a normalization factor, and ܲ௦ ቀ‫ݎ‬గೕ ቁ are factors defined by the set ߨ௝ , which itself is a

subset of ߨ௜ . The subsets ߨ௜ are the edges in the hypergraph ‫ .ܩ‬The power of the marginals in

equation (6-7), ‫ݍ‬௝ = ±1, is defined based on the Möbius inversion property as [126]

‫ݍ‬௝ = −1หగೕหି|గ೔|

(6-8)

where |	. | is the cardinality of a set. The factorization in (6-7) indicates that all the marginals

associated with the subsets of ߨ௜ are represented in addition to the marginals of set ߨ௜ .

The entropy of the cooperative model can be derived from the entropy of all its factors as

‫ܪ‬௦ ሺ࢘ሻ = ‫ܪ‬௓ಸ + ෍ ෍ ‫ݍ‬௝ ‫ܪ‬௦ ቀ‫ݎ‬గೕ ቁ
గ೔ ∈ீ గೕ ⊆గ೔

125

(6-9)

where the entropy ‫ܪ‬௓ಸ represents a constant bias term resulting from the partition function. An

example is illustrated in Figure 6.2a. Here we have a hypergraph including four factors that represent
the network structure of a population of six neurons. In this hypergraph, circles represent neurons
and squares represent factors, respectively. Using (6-7), the encoding dictionary of the cooperative
model represented by this hypergraph is factorized as
ܲ௦ ሺ࢘ሻ =

1 ܲଵଶଷ ܲଵ ܲଶ ܲଷ ܲଵଶ ܲଵଷ ܲଶଷ ܲଶସ ܲଶହ ܲଷ଺
×
×
× ܲଵ ܲଶ ܲଷ ܲସ ܲହ ܲ଺
ܼீ
ܲଵଶ ܲଶଷ ܲଵଷ ܲଵ ܲଶ ܲଵ ܲଷ ܲଶ ܲଷ ܲଶ ܲସ ܲଶ ܲହ ܲଷ ܲ଺

(6-10)

where ܲ௜௝௞ is short for ܲ௦ ൫‫ݎ‬௜ , ‫ݎ‬௝ , ‫ݎ‬௞ ൯, and every bracket represents factor sets of different orders. The

cooperative model in this example can be further simplified to
ܲ௦ ሺ࢘ሻ =

1 ܲଵଶଷ ܲଶସ ܲଶହ ܲଷ଺
ܼீ
ܲଶ ଶ ܲଷ

(6-11)

The ݇ ௧௛ -order interactions among neurons illustrated by the hypergraph are shown as ovals,

representing the entropy space of each factor. Using (6-9), the entropy estimate of the cooperative
model in Figure 6.2 is expressed as
‫ܪ‬௦ ሺ࢘ሻ = ‫ܪ‬௓ಸ + ‫ܪ‬ଵଶଷ + ‫ܪ‬ଶସ + ‫ܪ‬ଶହ + ‫ܪ‬ଷ଺ − 2‫ܪ‬ଶ − ‫ܪ‬ଷ

126

(6-12)

H24

H123

2

4

2

H25
3

1

5

1

H36
4

3

6

Figure 6.2 A hypergraph including four factors, representing a Markov network model of a
population of six neurons. Neurons are illustrated as circles and factors are illustrated as squares.

The terms Hଵଶଷ , Hଶସ , Hଶହ , and Hଷ଺ are the entropies of the marginal factors defined by the

hypergraph.

6.2.4.

Minimum Entropy Distance

In a typical experiment, the stimulus is presented over multiple repeated trials to assess all the
possible system states expressed by the output spike trains. The stimulus itself can be an actual
information-bearing signal or the output of another neural population, as typical in the layered
structure of the cortex [127]. We therefore assume that the “true” entropy of the population given S
is computed by concatenating a large number of repeated trials for that stimulus. Given the
distribution of these data, we seek to find the optimal factorization that results in a model entropy
with minimum distance to the true entropy. We refer to this as the minimum entropy distance

(MinED) algorithm. This algorithm yields a hypergraph, ‫ , ∗ ܩ‬obtained by minimizing the following
objective function

127

‫ ≡ ∗ ܩ	:ܦܧ݊݅ܯ‬min|‫்ܪ‬௥௨௘ − ‫| ீܪ‬
ீ

(6-13)

where ‫்ܪ‬௥௨௘ is the true population entropy, ‫ ீܪ‬is the entropy estimate of the population from

single trial data.

To initialize the search, we assume the least informative prior expressed by the independent

model. The algorithm searches for the best ݇ ௧௛ -order marginal out of all ൫ே൯ ݇ ௧௛ -order marginals
௞

that minimizes the entropy distance. In each step, only ݇ ௧௛ -order marginal factors are accumulated

in the hypergraph. Changes to the graph structure are allowed only when the entropy distance

decreases. This operation is repeated until convergence occurs. The hypergraph is saved at the end
of each iteration, and then continues to higher orders.
The MinED algorithm can also be categorized as a maximum entropy model. One notable

difference is that the testable information is the rate of occurrence of network states, ்ܲ௥௨௘ , and not

the estimates of the individual and pairwise neurons’ firing rates, 〈‫ݎ‬௜ 〉 and 〈‫ݎ‬௜ ‫ݎ‬௝ 〉, as in [107].

Although we don’t have access to the true rate, we have a repertoire of network states collected over

multiple repeated trials. Based on this testable information, (6-13) can be rewritten as
‫்ܪ‬௥௨௘ − ‫ = ீܪ‬෍ ்ܲ௥௨௘ log்ܲ௥௨௘ − ෍ ்ܲ௥௨௘ logܲீ
࢘

(6-14)

࢘

Based on the maximum entropy principle, ‫ ீܪ‬is always less than or equal to ‫்ܪ‬௥௨௘ , and the

maximum ‫ ீܪ‬is reached when the MinED objective function is minimized.

128

6.2.

Probabilistic Spiking Model with Synaptic Plasticity

Pairwise Maximum entropy models, to our knowledge, have not been applied to stimulus-driven
responses of large sized networks of cortical neurons observed in vivo. There are multiple reports in
the literature suggesting that these models are only suitable for neural systems that are small in size
(~ 10-20 cells) and operate below a cross over point of the model’s predictive power [128]. To gain
some insight on whether this is the case, we sought here to simulate an encoding model with the
following features:
1. A relatively large size spiking network model to test the generalization capability of the
MinED algorithm to larger systems and contrast it to maximum entropy models that
perform well in small sized systems.
2. Testable information limited to a small subpopulation of the entire simulated network
model so that the true probability distribution can be faithfully approximated across many
experiments.
3. Spike trains at the input of the network (i.e. discrete binary representation) rather than
continuous inputs (e.g. current injection typically used in experiments in vitro) to mimic
cortical neurons’ input-output mapping.
4. Biologically plausible statistical learning rule that optimizes the system’s experience with the
stimulus by varying the network structure over multiple repeated trials.
5. Response statistics of the population similar to those observed experimentally.

6.1.

Probabilistic Spiking Neurons

We simulated a motor task in which a moderately large population of neurons (166 neurons)

129

encoded a 2D arm trajectory of a subject reaching from a center position to one of 8 peripheral
targets [18]. We used an inhomogeneous Poisson model to simulate the spike train data for two
interconnected layers of neurons. This model has been shown to fit the firing pattern of motor
cortical neurons under a variety of motor tasks [129, 130]. In this model, the firing probability of

each neuron at time t is expressed by the conditional mean intensity function ߣ௜ ൫‫݃|ݐ‬௜ ሺ‫ݐ‬ሻ, ℎ௜ ሺ‫ݐ‬ሻ൯,
where ݃௜ ሺ‫ݐ‬ሻ describes the tuning to the movement parameters (i.e. the stimulus), and ℎ௜ ሺ‫ݐ‬ሻ denotes

the spiking history of neuron ݅ (to account for refractory effects) and that of other neurons
connected to it up to time ‫]53 ,43[	ݐ‬

ߣ௜ ൫‫݃|ݐ‬௜ ሺ‫ݐ‬ሻ, ℎ௜ ሺ‫ݐ‬ሻ൯ = ݁‫݌ݔ‬൫ߚ௜ + ݃௜ ሺ‫ݐ‬ሻ + ℎ௜ ሺ‫ݐ‬ሻ൯

(6-15)

where ߚ௜ is the background rate of neuron ݅ that represents membrane voltage fluctuations and
ெ೔ೕ

ℎ௜ ሺ‫ݐ‬ሻ = ෍ ෍ ߙ௜௝ ሺ݉Δሻܵ௝ ሺ‫݉ − ݐ‬Δሻ

(6-16)

௝∈గ೔ ௠ୀ଴

where ߨ௜ is the set of (input) neurons affecting the firing of (output) neuron ݅, ‫ܯ‬௜௝ is the number of

history time bins that relate the firing probability of neuron i to activity from parent neuron ݆,

ܵ௝ ሺ‫݉ − ݐ‬Δሻ is the state of neuron ݆ in bin ݉ (takes the value of 0 or 1), Δ is the bin width. The

parameter ߙ௜௝ models the coupling between neuron ݅ and neuron ݆ as [131]
0,																																																						‫݈ < ݐ‬௜௝ Δ
ߙ௜௝ ሺ‫ݐ‬ሻ = ቊ
‫ܣ‬௜௝ ሺ‫ݐ‬ሻ݁‫݌ݔ‬൫−3000‫ܯ⁄ݐ‬௜௝ ൯,
‫݈ ≥ ݐ‬௜௝ Δ

(6-17)

where ‫ܣ‬௜௝ ሺ‫ݐ‬ሻ is the strength of the coupling and can be positive or negative depending on whether
130

the connection is excitatory or inhibitory, and ݈௜௝ is the synaptic latency (in bins) associated with that
connection.

Our model consisted of a 2-layer network as depicted in Figure 6.3a. The input layer (66 neurons)

was non-cooperative, meaning that the term ℎ௜ ሺ‫ݐ‬ሻ was absent from the model (6-15) for these 66

neurons. For this layer, neurons were tuned to two movement parameters: direction and end-point
through ݃௜ ሺ‫ݐ‬ሻ as

ሺ߮௫ − ‫ݔ‬௜ ሻଶ + ൫߮௬ − ‫ݕ‬௜ ൯
ߠሺ‫߫ + ݐ‬ሻ − ߠ௜
ሺ‫ݐ‬ሻ = ߟ௜ ܸሺ‫߫ + ݐ‬ሻܿ‫ ݏ݋‬ቆ
݃௜
ቇ − ߢ௜
߱௜
ߪ௜

ଶ

(6-18)

where ߟ௜ and ߢ௜ are the weights assigned to the cosine and end-point tuning terms, respectively,

ܸሺ‫߫ + ݐ‬ሻ is the movement speed (kept constant) to be reached after ߫ sec, ߠሺ‫߫ + ݐ‬ሻ is the

movement direction, ߠ௜ is the neuron’s preferred direction [18], ߮௫ and ߮௬ are the ‫ -ݔ‬and ‫ݕ‬components of the intended target end-point, respectively, while ‫ݔ‬௜ and ‫ݕ‬௜ are the ‫ -ݔ‬and ‫-ݕ‬

components of the preferred end-point of neuron ݅, respectively. The parameter ߱௜ and ߪ௜ control

the direction and end-point tuning widths, respectively. Based on these settings, neural responses
would appear to be correlated due to the overlap between the tuning characteristics (i.e. common
input), and not due to actual interaction among themselves. In our model, neurons 1 to 40 were each
cosine-tuned to a uniformly distributed preferred direction between -180° and 180° as illustrated in
Figure 6.3b. Neurons 31 to 66 were each tuned to a preferred end-point in a 2D space divided into a
6 × 6 grid. The center of each cell in the grid was considered as the end-point as shown in Figure
6.3c. Accordingly, 10 neurons (neurons indexed 31 to 40) were tuned to both direction and
endpoint. This mimicked the heterogeneous tuning characteristics frequently observed in motor
cortex neurons recording experiments [27, 132].
131

The cooperative layer (100 neurons), on the other hand, followed the model (6-15) with the term

݃௜ ሺ‫ݐ‬ሻ = 0. Initially, these neurons did not have any tuning preferences and were fully

interconnected among themselves with fixed weights, with 80 excitatory neurons and 20 inhibitory
neurons similar to the 1:4 ratio typically observed in cortex [133]. For this layer, ℎ௜ ሺ‫ݐ‬ሻ

was

expressed as

ெ೔ೕ

ெ೔ೖ

௡௖
௖
ℎ௜ ሺ‫ݐ‬ሻ = ෍ ෍ ߙ௜௝ ሺ݉Δሻܵ௝ ሺ‫݉ − ݐ‬Δሻ + ෍ ෍ ߙ௜௞ ሺ݉Δሻܵ௞ ሺ‫݉ − ݐ‬Δሻ
೙೎
௝∈గ೔ ௠ୀ଴

(6-19)

೎
௞∈గ೔ ௠ୀ଴

௡௖
௖
where ߨ௜௡௖ and ߨ௜௖ are the parent sets of neuron ݅, while ߙ௜௝ and ߙ௜௝ model the connections from

non-cooperative layer and cooperative layer neurons, respectively, as expressed in (6-19). Therefore,
the encoding of these neurons to movement parameters resulted from the random connections with
subpopulations in the non-cooperative layer. Specifically, each cooperative layer neuron received
excitatory connections from 4 randomly selected non-cooperative layer neurons. The synaptic

latency ݈௜௝ of the connections from the non-cooperative layer neurons was randomly selected from a

uniform distribution between 1-3 bins, while for the cooperative layer connections, ݈௜௞ was set to 1
bin. Table 6.1 lists the values of the parameters used in (6-16) to (6-19).

6.1.

Task Learning

We used a biologically plausible statistical learning rule to optimize the system experience with the
task. This rule was governed by a spike timing dependent plasticity (STDP) mechanism. This
mechanism has been shown to take place in vitro [134] and is believed to underlie cortical plasticity in
vivo [135-137]. Specifically, under this mechanism, the connection strength between any given pair of
interconnected neurons in the cooperative layer was allowed to naturally change if the pair was
132

“functionally correlated”. Two neurons in the cooperative layer were considered functionally
correlated if they received connections from the same non-cooperative layer neuron with different
synaptic latencies or from more than one non-cooperative layer neuron with overlapped tuning. This
is illustrated in Figure 6.4a by cooperative layer neurons ‫ ܥ‬and ‫ .ܦ‬In this example, only the

connection from neuron ‫ ܥ‬to neuron ‫ ܦ‬is expected to be strengthened as a result of learning, while

all other connections are expected to be weakened if no overlap exists between the tuning of
neurons A and B. The details of this learning rule are reported elsewhere [136, 138].

Table 6.1 Values of the parameters used for the model of equations (6-16), (6-17), (6-18) and (6-19).

These parameters were fixed for all neurons in the non-cooperative and cooperative layers. Aୡା ሺtሻ
୧୩

and Aୡି ሺtሻ model the weight of the excitatory and inhibitory connections, respectively, between
୧୩
cooperative layer neurons.

Parameter Value Parameter
βi

ߟ௜

߱௜
ߢ௜
ߪ௜

log(5)
1.5
1
5
1

‫ܯ‬௜௝

‫ܯ‬௜௞

‫ܣ‬௡௖ି ሺ‫ݐ‬ሻ
௜
‫ܣ‬௖ା ሺ‫ݐ‬ሻ
௜௞
‫ܣ‬௖ି ሺ‫ݐ‬ሻ
௜௞

133

Value
40
40
8
0.01 (Initial value)
-8

Non-cooperative Network
Neuron 1
Neuron 2
⋮
Neuron 30

Direction

Neuron 67
Neuron 68
Neuron 69

Neuron 31

Neuron 70

Neuron 32
⋮
Neuron 40

End Point

Cooperative Network

⋮

Neuron 164

Neuron 41

Neuron 165

Neuron 42
⋮
Neuron 66

Neuron 166

(a)

Direction Tuning
1

Amplitude

Neuron 1
Neuron 3
Neuron 7
Neuron 10
Neuron 18
Neuron 22
Neuron 30

0
-180

-90

0
Direction

90

180

(b)

(c)

Figure 6.3 (a) A simplified version of the network structure. The non-cooperative (input) layer
consists of 40 direction-tuned neurons and 36 end-point tuned neurons with 10 neurons tuned to
both movement parameters. Each cooperative layer neuron receives 4 excitatory connections from
134

the non-cooperative layer. Connections within the cooperative layer are 80% excitatory and 20%
inhibitory. (b) Sample cosine-tuning functions for 7 of the direction tuned neurons. Only 8 are
shown out of the 40 neurons for clarity. (c) Gaussian-tuning functions for end-point tuned neurons.

The model was trained with 1000 trials (trajectories) per target where the duration of each trial
was 2 sec: 1 sec to reach the target and 1 sec to return back to the center position. Excitatory
connections between cooperative layer neurons were the only connections allowed to change as a
result of learning. Figure 6.4b shows sample trajectories used during task learning. Every 2 sec, one
of the 1000 trajectories was randomly selected from which the movement direction, movement
speed, and end-point were extracted. These movement features were then used to modulate the
firing of non-cooperative layer neurons according to equation (6-18). Figure 6.4c shows sample
spike raster plots obtained in one trial for two adjacent targets 1 and 8 after the training phase. It can
be seen that non-cooperative layer neurons exhibit clear task-dependent response modulation
reminiscent of their tuning characteristics. In contrast, cooperative layer neurons do not exhibit
noticeable changes in their firing rates but significant synchrony seems to be present.
Figure 6.5a shows the average change in connection strength for both functionally correlated
and uncorrelated neurons during the learning phase. As can be seen, the strength of the connections
between functionally correlated neurons increases above the initial strength (0.01) while those of
uncorrelated neurons decrease substantially over multiple training sessions. Figure 6.5b shows the
connectivity matrices of cooperative layer neurons before and after learning. Sparse connectivity is
apparent indicating that the learning phase consolidated the task information in a relatively small
number of connections.

135

Input Layer

Target 1

A
lCA
C

B
lDA
D

Target 2
lEB

Target 8

Target 3

Target 7

E

Target 6

Target 4
Target 5

Integration Layer
(a)

(b)

(c)
Figure 6.4 (a) Schematic of the statistical learning rule. Neurons A and B are non-cooperative layer
neurons that excite neurons cooperative layer neurons ‫ ܦ ,ܥ‬and ‫ .ܧ‬The parameter ݈௜௞ is the synaptic
136

latency associated with each connection. Assuming ݈஼஺ greater than ݈஽஺ , neurons ‫ ܥ‬and ‫ ܦ‬were

considered functionally correlated. Therefore, the only connection that is strengthened is ‫.ܦ → ܥ‬

Dashed connections indicate connections that were expected to weaken over time. (b) Sample
trajectories used to simulate the spike trains for the 8 targets. (c) Sample raster plot of the spike
trains generated by the model corresponding to trajectories for targets 1 (top) and 8 (bottom) for
one trial.

Mean Connectivity Strength

0.02
Correlated
Uncorrelated
0.015

Initial Strength

0.01

0.005
0

200

400
Trial

600

800

(a)
Cooperartive layer connectivity before learning

Cooperative layer connectivity after learning

1.2
1

20

1.2
20

1

0.8
40
To

0.6

0.6

0.4 60
0.2

60
80
100

0.8
40

0

20

40

60
From

80

100

0.4
0.2

80

-8 100

0
20

40

60
From

80

100

-8

(b)
Figure 6.5 (a) Mean connectivity strength versus trial number for functionally correlated (blue) and
uncorrelated (red) cooperative layer neurons. (b) Connectivity matrix for the cooperative layer
137

neurons before training (left) and after training (right). The columns represent pre-synaptic
(transmitter) neurons while the rows represent post-synaptic (receiver) neurons.
An example of the statistics of input and output spike trains for sample neurons during the reach
portion of the trial are shown in Figure 6.6. Cooperative layer neuron ‘103’ receives input from
neurons ‘8’, ‘13’, ‘24’ (direction tuned) and ‘54’ (endpoint tuned). As can be seen in Figure 6.6f,
neuron ‘103’ modulated its firing significantly when the movement was to targets ‘5’ or ‘8’ as
depicted by the 10 ms bin Post-stimulus Time Histograms (PSTHs) averaged across 1000
trials/target [84]. On the other hand, for these two targets, two out of the four input neurons
modulated their firing significantly (neurons ‘8’ and ‘13’ for target ‘5’ and neurons ‘24’ and ‘54’ for
target ‘8’) as shown in Figure 6.6b-e. Because target ‘5’ is independently encoded by neurons ‘8’ and
‘13’ and both excite neuron ‘103’, the response modulation of neuron ‘103’ to target ‘5’ is much
stronger than its response modulation to targets ‘4’ and ‘6’ combined. The later targets are
independently encoded (with similar strength) by the same two neurons ‘8’ and ‘13’, respectively. A
similar observation can be made to the response modulation of neuron ‘103’ to target 8 and that of
its response to targets 1 and 7 relative to neurons ‘24’ and ‘54’, respectively.
To assess whether the correlation observed between cooperative layer neurons is an actual
indicator of stimulus-driven interaction, we created a version of the data in which each spike was
randomly jittered around its position over a uniform range of ሾ−9, 9ሿ msec. This keeps the firing

rate of each neuron the same (i.e. similar individual response to the common input) but destroys the

time-dependent correlation in the patterns. For a representative population of 10 cooperative layer
neurons (neurons ‘99’ to ‘108’), Figure 6.7a shows the probability of the number of neurons firing
within a 10ms window for both the actual data and the jittered data. As can be seen, the probability
of two or more neurons firing within a 10 msec window for the jittered data becomes smaller than
138

that for the actual data indicating that the temporal coordination among the neurons is important to
maintain the spatiotemporal pattern with higher order interactions and is destroyed by jittering.

8

13

24

54

103
(a)
Target 1

PSTHs for Neuron 8
Target 2
Target 3

Target 4

Target 5

Target 6

Target 8

Spikes/bin

2
1
0

Target 7

2
1
0
0

0.5

1

0

0.5

1 0
0.5
Time (s)
(b)

1

0

0.5

1

Figure 6.6 (a) An example of the connectivity between one cooperative layer neuron (neuron 103)
and four non-cooperative layer neurons (8, 13, 24 and 54). Neurons 8, 13 and 24 are direction tuned
while neuron 54 is end-point tuned. (b), (c), (d), (e) and (f): 10 ms bin PSTHs of neurons 8, 13, 24
and 54, and 103, respectively, for each of the 8 targets. Dashed lines represent the significance level
computed by averaging the PSTHs of each neuron across all targets and bins.

139

Figure 6.6 (continued)

Target 1

PSTHs for Neuron 13
Target 2
Target 3

Target 4

Target 5

Target 6

Target 8

2

Spikes/bin

1
0

Target 7

2
1
0
0

0.5

1

0

0.5

1 0
0.5
Time (s)

1

0

0.5

1

(c)

Target 1

PSTHs for Neuron 24
Target 2
Target 3

Target 4

Target 5

Target 6

Target 8

2

Spikes/bin

1
0

Target 7

2
1
0
0

0.5

1

0

0.5

1 0
0.5
Time (s)
(d)

140

1

0

0.5

1

Figure 6.6 (continued)

Target 1

PSTHs for Neuron 54
Target 2
Target 3

Target 4

Target 5

Target 6

Target 7

Target 8

0.5

0.5

4

Spikes/bin

2
0
4
2
0
0

0.5

1

0

1 0
Time (s)

1

0

0.5

1

(e)
Target 1

PSTHs for Neuron 103
Target 2
Target 3

Target 4

Target 5

Target 6

Target 8

0.06
0.04
Spikes/bin

0.02
0
Target 7

0.06
0.04
0.02
0
0

0.5

1

0

0.5

1
0
Time (s)

(f)

141

0.5

1

0

0.5

1

Figure 6.7b shows the number of times each specific pattern of firing occurred in the jittered data
versus the actual data. Each point in the figure represents one out of the 1024 possible patterns. The
graph indicates that jittering the spike trains decreased the rate of occurrence of most of the patterns
compared to the actual data. This confirms the observation in Figure 6.7a that higher order
correlation in the data is more than what can be obtained from neurons acting independently.

6.3.

Experimental Results

6.3.1.

Order of the Encoding Model

After the motor task was learned, no more changes in the connection strengths were permissible
and the obtained network topology was kept fixed for the rest of the analysis. 1000 trajectories per
target were subsequently presented to the network and the corresponding spike trains were
obtained. Figure 6.8a shows sample trial data from target 1, obtained from the cooperative layer
neurons. Target-specific trials were then concatenated into a single response matrix, r. To estimate
the true entropy, the joint density was calculated as the rate of occurrence of distinct patterns. Figure
6.8b shows a sample pattern from a subpopulation of 10 neurons indicated by the box in Figure
6.8a.

142

(a)

(b)
Figure 6.7 (a) Probability of number of neurons firing within a 10 ms window for actual data
generated from the model (blue) and a jittered version of the data (red) for cooperative layer
neurons 99 to 108. (b) Rate of occurrence of specific firing patterns within a 3 ms window (1 bin)
for the jittered data versus the actual data generated from the model.

143

The order of the encoding model was estimated by determining how much information is

intrinsic to each order of interactions. This was done by estimating the ݇ ௧௛ -order connected
information in (6-2) and comparing it to the multi-information in (6-1). For a subpopulation size of

10 neurons, the 2௡ௗ , 3௥ௗ , and 4௧௛ order connected information is illustrated in Figure 6.9. It can be
seen that the ݇ ௧௛ -order connected information for ݇	 > 	3 is negligible, indicating that information

in the model was intrinsic to independent, pairwise and triplet interactions among neurons, but no
significant higher-order interactions beyond ݇	 = 	3 were contributing to the encoding model.

6.3.1.

Model Consistency

The above analysis was repeated and the underlying network model was inferred for each of the
eight targets using the MinED algorithm. Figure 6.10a illustrate two hypergraphs inferred for sample
targets 1 and 8 for which the spike train data from a sample trial was shown in Figure 6.4c. It can be
seen that a considerable number of second-order interactions exists, confirming the results reported
in the literature using in vitro data [107]. Furthermore, a considerable portion of third-order
interactions is present. Both account for more than 99.9% of the multi-information in the
population activity.

144

60
50

Neuron

40
30
20
10
0

0.5

1

1.5

2

Time (s)
(a)
30

Neuron

28
26
24
22

(b)
Figure 6.8 (a) Sample data from 60 cooperative layer neurons generated by the model during one
trial after task learning was completed. (b) A binary representation of the network states shown by
the box in (a). Every state describes the activity pattern of the population in one bin. Some states are
repeated more often than others (for e.g., the silent state 000000000 is observed in 7 out of 20
times). Despite that we have 2n possible states, many states never occur.

145

Multi-Information
2nd-order
3rd-order
4th-order

Figure 6.9 Connected information for different orders of the cooperative model for a fixed
subpopulation size (n = 10). The kth-order connected information for k > 3 is negligible compared
to the 2nd and 3rd-orders. The sum of all connected information is equal to the multi-information.
To verify that the observed factors are reminiscent of stimulus-dependent higher order
interactions, we computed crosscorrelograms between sample pairwise neurons that were found to
have factors in the hypergraphs. Features in the correlograms can be used to determine if serial
excitatory or inhibitory synaptic interaction exist between biological neurons [139]. Figures 6.10b
and 6.10c show sample crosscorrelograms for neurons pair (‘9’, ‘10’) with a bin size of 9ms
computed from target ‘1’ and target ‘8’ trials, respectively. A single bin central peak at zero lag
occurs in the crosscorrelogram for target ‘1’ indicating a serial excitatory interaction [100]. This is
not observed between the two neurons for target ‘8’ data and perfectly agrees with the absence of a
factor between the two neurons in the hypergraphs in Figure 6.10a. It is also noteworthy that the
MinED algorithm extracted many additional pairwise factors that are not detected by the
crosscorrelogram method.

146

F1

F4

F9

F7

1
F3

F5

3

5

4

F2

F6

Target 1
F16 F15 F13 F1 F20 F22 F21 F2
8

10

2

7

F14 F8 F17 F12 F11 F19 F24 F23 F18 F10

F1

F2

1

6

F1

9

6

F8

F4

F13
2

3
F3

Target 8
F6 F5 F10

F9

F17

5

4

7
F7

F2

F11 F14

F12 F15 F18
8
F3

9

10

F16

F19

(a)

0.08
Normalized Count

0.1

0.04
Normalized Count

0.05

0.03
0.02

0.04
0.02

0.01
0
-500

0.06

0
Lag (ms)

500

0
-500

(b)

0
Lag (ms)

500

(c)

Figure 6.10. (a) Inferred factor-graphs for sample targets ‘1’, and ‘8’, respectively. Every ellipse
represents a neurons and the cyan and purple squares represent second and third-order factors,
respectively. It can be seen that a considerable number of 2nd interactions are present, confirming
experimental results in vitro. Although the number of 3rd-order factors is considerably lower than
the number of 2nd-order factors, it constitutes a significant portion of the multi information. (b), (c)
Crosscorrelograms for neurons ‘9’ and ‘10’ with a bin size of 9ms for target ‘1’ and target ‘8’,
respectively. A significant peak occurs in the crosscorrelogram for target ‘1’ indicating an excitatory
interaction between the pair that is not observed for target ‘8’.
147

Figure 6.11 Model fit for a sample subpopulation of 10 neurons for: 1- Independent model (blue), 2Maximum entropy model (red), 3- MinED cooperative model (green). The true density is the
measured rate of occurrence for distinct network states while the estimated density is the rate
predicted by the probabilistic graphical model.
x 10

Kullback–Leibler Distance

6
Cooperative
MaxEnt
Independent

5
4
3
2
1
0

1

2

3

4

5

6

7

8

Target Index
Figure 6.12 The KL distance between the true and the single trial distribution of states for each
model. It can be seen that on average the MinED cooperative model results in the smallest distance.
148

KL Distance/(# of Neurons)

x 10
Independent
MaxEnt
Cooperative

5
4
3
2
1
0
2

4

6

8

10

12

Number of Neurons
Figure 6.13 The KL-divergence for independent, maximum entropy and cooperative models as a
function of the size of the observed population. The distance is normalized by the number of
neurons. Error bars represent standard deviation.

Figure 6.11 illustrates the model goodness of fit for the independent, Maximum Entropy and
MinED cooperative models. Further departure from the reference line (solid) would indicate the
data from the two models were generated from increasingly distinct distributions. According to
Figure 6.11, the MinED model is the closest to the reference line, further demonstrating its
superiority in capturing the intrinsic higher order interactions between the neurons. This is further
demonstrated using the distance between the true and the estimated densities displayed in Figure
6.12 for each of the eight targets.
To demonstrate the consistency of the MinED cooperative model as a function of the
149

population size, Figure 6.13 shows the average normalized Kullback-Leibler (KL) distance for the
different models as a function of the observed population size. The error for the cooperative model
remains bounded as the number of neurons increase compared to a larger error for the independent
and maximum entropy models. One possible explanation is that the probability of observing
neurons with significant overlaps in their tuning characteristics increases as the population size
increases. This means that the output correlation becomes dominated by the input correlation rather
than local interaction between cooperative layer neurons. In such case, the maximum entropy and
independent models tend to exhibit some redundancy in their encoding characteristics.
Figure 6.14b demonstrates different hypergraph models obtained for the population responses
of V1 data in response to different inputs of drifting grating stimulus (chapter 5), demonstrated in
Figure 6.14a. Similar to the hypergraphs of the synthetic motor encoding model in Figure 6.10a,
different inputs result in different hypergraphs.

6.1.

Discussion

As tools for large-scale recording of neural activity in the brain continue to excel, some
challenging questions about the nature of the brain’s complex information processing naturally arise.
Information theory is increasingly becoming popular to address many of these questions. In this
chapter, we mainly addressed the problem of how to discover stimulus-dependent spatiotemporal
patterns of activity using an information theoretic approach. These patterns may indicate how
cooperation among neurons can provide a near-optimal mechanism for encoding external stimuli.

150

(a)

(b)
Figure 6.14 (a) Two sample inputs of drifting grating stimulus (b) Hypergraph models for the 21
neurons for the two sample inputs in (a). Red squares represent factors of second-order interactions,
and green squares represent factors of third-order interactions.
Using factor graphs, we have demonstrated that a network model can best fit single trial data
when connection constraints are guided by information available from the concatenation of multiple
trials. Generally speaking, limited data constitute a significant challenge [128], particularly for large
populations, because the dimension of the response space grows exponentially with the number of
151

neurons. In such a case, an infinite amount of data would be needed to completely characterize the
joint distribution. The density estimates can be severely biased when only limited data is available.
Our approach therefore attempts to circumvent this problem by finding the graphical model that
explains single trial data and can be simultaneously extrapolated to account for unlimited data size, if
available.
Using a model of motor encoding, we demonstrated that task information can be transformed
from a non-cooperative network output with independent firing to a cooperative network
interaction using a spike timing dependent plasticity learning rule. This biologically plausible
mechanism accounts for the ordered set of event occurrences and therefore captures temporal
correlations between neuronal states that are caused by delayed synaptic interactions between
neurons. Our analysis has revealed that a substantial amount of information in the model is
represented in 2nd-order interactions and is efficiently represented by cliques in the graph structures,
confirming the results obtained by other groups. However, our results also demonstrate that 3rdorder interactions are present and contribute significantly to the multi-information. These results
were not surprising, given the model’s dependence on the variable-length history of interaction that
contributed to the emergence of temporal patterns of network states. In addition, spatial patterns
between neurons have emerged as a result of task learning. This is a component strongly believed to
underlie cortical network organization [2].
In our analysis, we have mainly focused on the development of the algorithm and therefore
restricted our analysis to spike train data from simulated motor cortex neurons with inhomogeneous
Poisson firing characteristics. These models generate neural responses that share a number of
features with experimental data. First, it has been shown that cortical neurons can simultaneously
encode multiple parameters (e.g. direction and endpoint [27]), reflecting the complex anatomical and
152

functional organization of the cortical neural circuits. Second, synaptic efficacy has been
hypothesized to be mediated by the precise timing of the presynaptic (transmitter) neuron firing and
that of the postsynaptic (receiver) neuron [134, 138]. It is widely accepted that synaptic modification
plays an important role in learning and memory formation [91]. Third, neurons in our model found
to exhibit higher order interaction were driven by discrete neuronal inputs (spike trains). For these
reasons, we believe that our encoding model mimics to some extent a biological “information
transform” that my take place in an actual spiking neural system. Our model, however, did not
include an important element known to guide motor behavior: visual feedback. We have chosen to
simulate a task (the center-out-reach task) that minimally relies on this feedback mechanism to guide
reach movements (as opposed to, for example, encoding a task with freely moving target which
would require continuous adaptation of motor commands).
We also found a strong agreement between the statistics of our encoding model data and related
models that have been applied to RGCs in vitro. Specifically, we found that 2nd-order interaction
seem to be predominantly present, a result observed by a multitude of studies (e.g. [72, 75, 79, 94,
107]). We also found that our biological learning rule result in network connectivity that is relatively
sparse, in agreement with some recent experimental results during direct activation of biological
cortical networks [140].
As our results demonstrated, the MinED approach is superior to the maximum entropy model
in inferring the underlying model structure, particularly for relatively large size systems. The
maximum entropy model finds the network realization that provides the maximum entropy with
respect to the testable information (i.e. known constraints such as firing rate and pairwise
correlation). This limits its application to small sized networks in which the true distribution can be
measured with a high degree of accuracy. Our MinED approach, on the other hand, was developed
153

to optimally explain single trial data with knowledge of network behavior from a concatenation of a
large number of trials. This has the advantage of capturing information about transition between
specific sequences of network states, a feature not accounted for by the maximum entropy model.
We therefore expect that the MinED approach will yield superior performance in quantifying
spatiotemporal patterns of activity that are frequently observed in cortical systems in vivo. Full
analysis of real experimental neural data with our approach is currently underway.
We believe this framework can be extremely useful in improving our understanding of brain
circuits, and in developing efficient decoders for Brain Machine Interface applications [141]. Despite
our focus on spiking neural data analysis, the framework is applicable to many other fields, for
example, in gene regulatory networks where the collective role of multiple genes needs to be
assessed to characterize an observed phenotype. It can also be useful in the development of sensor
network coding strategies in which sensors optimally cooperate to encode signals from the
surrounding to optimize their limited resources. In some sense, one can argue that natural brain
encoding mechanisms can be extremely insightful in the design of efficient network codes.

154

7

Concluding Remarks

In this dissertation we have discussed two main steps in analyzing neural data, spike sorting and
spike train analysis. Although these are fundamentally separate algorithms, the latter one builds upon
the outcome of the first one, and as a result, its success in representing graphical models for
population encoding heavily depends on the sorting performance [142]. Falsely detected noise
events and missed spikes basically alter the estimations of the empirical moments and the true
entropy of the population response, and therefore can affect our estimations on the parameters of
the spatiotemporal maximum entropy and hypergraph models.
As discussed early in chapter 3, spike sorting is important but computationally prohibitive in
BMI systems. As a result, not all BMI applications feature spike sorting, or simply avoid its extensive
computations by classifying events based on their maximum amplitudes via thresholding the raw
data. For example, implantable devices designed for transmitting wirelessly the activity of a large
number of electrodes only transmit a small interval of action potentials that pass a detection
threshold, and rely on each event’s maximum amplitude for sorting [143-145]. Although we have
shown the effect of poor or lack of spike sorting on the decoding performance in Figure 3.8, some
neural decoders rely on higher-level analysis on the encoding characteristics of detected units to
compensate for the loss of performance [146]. Therefore, the need for spike sorting is application

155

dependent, and must be realized considering its computational cost versus the increased
performance and reliability gained.
Setting aside the computational cost of spike sorting algorithms, it is still a challenging job with
respect to making a decision on the number of units recorded on an electrode and determining the
decision boundaries that separate these units. Although, tetrodes that include four closely located
electrodes can assist spike sorting algorithms in making a decision on the number of units given the
spatial propagation of spike waveforms [147-149], the challenge of determining discriminative
boarders remains. Including the encoding characteristics of units, if stationary, might help. These
encoding characteristics can be the inter-spike interval (ISI) [150], the conditional firing rate [151], or
the neuronal tuning function [152]. This additional information can provide reliable performances
and can be used for tracking stable events over consecutive days, provided they don’t change as a
result of neuroplasticity [153].
In this dissertation we have introduced the compressive spike sorting (CSS) to directly sort units
in a sparse domain obtained by thresholding multilevel wavelet coefficients. We have also
introduced an iterative learning algorithm to find the optimal set of thresholds for the sparse
representation. CSS has three major benefits. First, while the discriminative features necessary for a
high sorting performance are preserved, significant data compression is obtained. Second, the
amount of computations required for sorting directly in the sparse representation is less than
conventional spike sorting algorithms, and therefore can be efficiently implemented in low-power,
small size electronics. Third, the sorting performance is less sensitive to low SNR recordings and can
be used to track units over consecutive days of recording.
For spike train analyses a high performance spike sorting was required, without its

156

computational power being an issue. Although the term ‘functional connectivity’ was used to refer
to the graphical models fit to the conditional distribution of the population response given the
stimulus, the connectivity here is different by all means with a physical synaptic connection between
neurons. The graphical models have three main applications. One is in the prediction of future
population responses given a sequence of observed neural responses. The second is in neural
decoding using a maximum a posteriori (MAP) estimator. In this application, the likelihood function
is factorized into marginal and pairwise distributions using the graphical model realization, thus
providing a more reliable estimation given the limited training data. The third is in linking specific
models to certain behavioral events. The spatiotemporal graphical models introduced in this
dissertation have proven to be more efficient in all three applications compared to typical
independent models, and also instantaneous maximum entropy models that don’t incorporate
temporal correlations among neuronal constituents.
While the probability of having a physical connection between two neurons randomly selected
from a small localized area of the brain is very small, algorithms introduced in chapters 5 and 6 tend
to find numerous statistical interactions among recorded neurons, i.e. the maximum entropy model
will find a nonzero value for all unknown parameters. Although there is a small chance that this
connectivity is related to an actual physical connection directly or indirectly through hidden neurons,
but it is more likely that it originates from neurons responding to a common input, i.e. overlapping
receptive fields for sensory neurons or overlapping tuning functions for motor neurons. Whatever
the source, we are interested in limiting the number of nonzero elements of the inferred connectivity
for the sake of model simplification and avoiding overfitting. This automatically suggests a sparse
optimization algorithm for learning the unknown parameters by obtaining graphical models with the
minimum number of nonzero elements [154, 155].

157

REFERENCES

158

REFERENCES

1. Kandel, E., J. Schwartz, and T. Jessell, Principles of neural science. 2000, 2000, McGraw-Hill, New
York.
2. Hebb, D.O., The Organization of Behavior; a Neuropsychological Theory. 1949: Wiley, New York.
3. Wise, K., et al., Wireless Implantable Microsystems: High-Density Electronic Interfaces to the Nervous System.
Proceedings of the IEEE, 2004. 92-1: p. 76–97.
4. Normann, R., et al., A neural interface for a cortical vision prosthesis. Vision research, 1999. 39 (15): p.
2577-2587.
5. Buzsaki, G., Large-scale recording of neuronal ensembles. Nat Neurosci, 2004. 7(5): p. 446-451.
6. Nicolelis, M.A.L., Methods for Neural Ensemble Recordings. 1999: CRC Press.
7. Abbott, L.F. and S.B. Nelson, Synaptic plasticity: taming the beast. Nature Neurosci. , 2000. 3
(suppl.): p. 1178-1183.
8. Barnes, A.P. and F. Polleux, Establishment of axon-dendrite polarity in developing neurons. Annual review
of neuroscience, 2009. 32: p. 347.
9. Achermann, P. and A. Borbely, Low-frequency (< 1 Hz) oscillations in the human sleep
electroencephalogram. Neuroscience, 1997. 81(1): p. 213.
10.Jurkiewicz, M.T., et al., Sensorimotor Cortical Plasticity During Recovery Following Spinal Cord Injury: A
Longitudinal fMRI Study. Neurorehabilitation and Neural Repair, 2007. 21(6): p. 527.
11.Armstrong-Gold, C.E. and F. Rieke, Bandpass Filtering at the Rod to Second-Order Cell Synapse in
Salamander (Ambystoma tigrinum) Retina. J. Neurosci., 2003. 23(9): p. 3796-3806.
12.Stark, E. and M. Abeles, Predicting Movement from Multiunit Activity. J. Neurosci., 2007. 27(31): p.
8387-8394.

159

13.Frohlich, Z., Combined spatial and temporal imaging of brain activity during visual selective attention in
humans. Nature, 1994. 372: p. 8.
14.Normann, R.A., et al., High-resolution spatio-temporal mapping of visual pathways using multi-electrode
arrays. Vision research, 2001. 41(10): p. 1261-1275.
15.Palmer, C., A microwire technique for recording single neurons in unrestrained animals. Brain research
bulletin, 1978. 3(3): p. 285-289.
16.Najafi, K. and K.D. Wise, An implantable multielectrode array with on-chip signal processing. Solid-State
Circuits, IEEE Journal of, 1986. 21(6): p. 1035-1044.
17.Lewicki, M.S., A review of methods for spike sorting: the detection and classification of neural action potentials.
Network: Computation in Neural Systems, 1998. 9(4): p. 53-78.
18.Georgopoulos, A.P., A.B. Schwartz, and R.E. Kettner, Neuronal population coding of movement
direction. Science, 1986. 233(4771): p. 1416.
19.Oweiss, K.G. and D.J. Anderson, A new technique for blind source separation using subband
subspaceanalysis in correlated multichannel signal environments. Acoustics, Speech, and Signal Processing,
2001. Proceedings.(ICASSP'01). 2001 IEEE International Conference on, 2001. 5.
20.Oweiss, K. and M. Aghagolzadeh, Detection and Classification of Extracellular Action Potential
Recordings. Statistical Signal Processing for Neuroscience and Neurotechnology, KG Oweiss, Ed.,
Academic Press, Elsevier, 2010: p. 15-74.
21.Oweiss, K.G., A systems approach for data compression and latency reduction in cortically controlled brain
machine interfaces. IEEE Transactions on Biomedical Engineering, 2006. 53(7): p. 1364–1377.
22.Bishop, C.M., Pattern recognition and machine learning. Vol. 4. 2006: springer New York.
23.Dinning, G.J. and A.C. Sanderson, Real-time classification of multiunit neural signals using reduced feature
sets. Biomedical Engineering, IEEE Transactions on, 1981(12): p. 804-812.
24.Dempster, A.P., N.M. Laird, and D.B. Rubin, Maximum likelihood from incomplete data via the EM
algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 1977: p. 1-38.

160

25.Bezdek, J.C., R. Ehrlich, and W. Full, FCM: The fuzzy< i> c</i>-means clustering algorithm.
Computers & Geosciences, 1984. 10(2): p. 191-203.
26.Suminski, A.J., et al., Incorporating feedback from multiple sensory modalities enhances brain–machine interface
control. The Journal of Neuroscience, 2010. 30(50): p. 16777-16787.
27.Carmena, J.M., et al., Learning to control a brain-machine interface for reaching and grasping by primates.
PLoS Biology, 2003. 1(2): p. e2.
28.Serruya, M.D., et al., Brain-machine interface: Instant neural control of a movement signal. Nature, 2002.
416(6877): p. 141-142.
29.Taylor, D.M., S.I.H. Tillery, and A.B. Schwartz, Direct Cortical Control of 3D Neuroprosthetic Devices.
Science, 2002. 296(5574): p. 1829.
30.Kass, R.E., V. Ventura, and C. Cai, Statistical smoothing of neuronal data. Network Computation in
Neural Systems, 2003. 14(1): p. 5-15.
31.Paulin, M.G. and L.F. Hoffman, Optimal firing rate estimation. Neural Netw, 2001. 14(6-7): p. 87781.
32.Oweiss, K.G., Multiresolution Analysis of Multichannel Neural Recordings in the Context of Signal Detection,
Estimation, Classification and Noise Suppression, 2002, University of Michigan.
33.Oweiss, K.G. and D.J. Anderson, Tracking Signal Subspace Invariance for Blind Separation and
Classification of Nonorthogonal Sources in Correlated Noise. EURASIP Journal on Advances in Signal
Processing}, 2007. 2007.
34.Brillinger, D., The Identification of Point Process Systems. Annals of Probability, 1975. 3(6): p. 909-924.
35.Brown, E.N., ed. Theory of Point Processes for Neural Systems Methods and Models in Neurophysics,
ed. C.C. Chow, et al. 2005, Elsevier: Paris. 691-726.
36.Brillinger, D., Nerve Cell Spike train Data Analysis. J. Am. Stat. Assoc., 1992. 87(418): p. 260-271.
37.Kaneoke, Y. and J.L. Vitek, Burst and oscillation as disparate neuronal properties. Journal of
Neuroscience Methods, 1996. 68(2): p. 211-223.

161

38.Goldberg, J.A., et al., Enhanced Synchrony among Primary Motor Cortex Neurons in the 1-Methyl-4-Phenyl1, 2, 3, 6-Tetrahydropyridine Primate Model of Parkinson's Disease. Journal of Neuroscience, 2002.
22(11): p. 4639.
39.Churchland, M.M. and K.V. Shenoy, Temporal Complexity and Heterogeneity of Single-Neuron Activity in
Premotor and Motor Cortex. Journal of Neurophysiology, 2007. 97(6): p. 4235.
40.Shadlen, M.N. and W.T. Newsome, The Variable Discharge of Cortical Neurons: Implications for
Connectivity, Computation, and Information Coding. Journal of Neuroscience, 1998. 18(10): p. 38703896.
41.Galan, R.F., G.B. Ermentrout, and N.N. Urban, Optimal Time Scale for Spike-Time Reliability: Theory,
Simulations, and Experiments. J Neurophysiol, 2008. 99(1): p. 277-283.
42.Kass, R.E. and V. Ventura, Spike Count Correlation Increases with Length of Time Interval in the Presence
of Trial-to-Trial Variation. Neural Computation, 2006. 18(11): p. 2583.
43.Parzen, E., On estimation of a probability density function and mode. Annals of Mathematical Statistics,
1962. 33(3): p. 1065-1076.
44.Cherif, S., K.E. Cullen, and H.L. Galiana, An improved method for the estimation of firing rate dynamics
using an optimal digital filter. J Neurosci Methods, 2008.
45.Candes, E.J., J. Romberg, and T. Tao, Robust uncertainty principles: exact signal reconstruction from highly
incomplete frequency information. Information Theory, IEEE Transactions on, 2006. 52(2): p. 489-509.
46.Oweiss, K.G., Compressed and Distributed Sensing of Multivariate Neural Point Processes. IEEE
International Conference on Acoustics, Speech and Signal Processing, 2007, 2007. 2.
47.Golub, G.H. and C.F. Van Loan, Matrix Computations. 1996: Johns Hopkins University Press.
48.Oweiss, K.G., et al., A Scalable Wavelet Transform VLSI Architecture for Real-Time Signal Processing in
High-Density Intra-Cortical Implants. IEEE Transactions on Circuits and Systems I, 2007. 54(6): p.
1266-1278.
49.Tan, P.N., M. Steinbach, and V. Kumar, Introduction to Data Mining. 2006: Addison-Wesley.

162

50.Donoho, D.L., De-noising by soft-thresholding. Information Theory, IEEE Transactions on, 1995.
41(3): p. 613-627.
51.Georgopoulos, A.P., et al., Reply to Kurtzer and Herter. Journal of Neurophysiology, 2007. 97(6): p.
4391-4392.
52.Cover, T.M. and J.A. Thomas, Elements of Information Theory. 2006: Wiley-Interscience New York.
53.Warland, D.K., P. Reinagel, and M. Meister, Decoding Visual Information From a Population of Retinal
Ganglion Cells. Journal of Neurophysiology, 1997. 78(5): p. 2336-2350.
54.Kamboh, A.M., et al., Area-Power Efficient Lifting-Based DWT Hardware for Implantable
Neuroprosthetics. Circuits and Systems, 2007. ISCAS 2007. IEEE International Symposium on,
2007: p. 2371-2374.
55.Srinivasan, L., et al., A State-Space Analysis for Reconstruction of Goal-Directed Movements Using Neural
Signals. Neural Comp., 2006. 18(10): p. 2465-2494.
56.Santhanam, G., et al., A high-performance brain-computer interface. Nature, 2006. 442(7099): p. 195–
198.
57.Hatsopoulos, N., et al., At what time scale does the nervous system operate. Neurocomputing, 2003.
52(54): p. 25-29.
58.Riehle, A., et al., Dynamical changes and temporal precision of synchronized spiking activity in monkey motor
cortex during movement preparation. Journal of Physiology (Paris), 2000. 94: p. 569-82.
59.Oram, M.W., et al., Excess Synchrony in Motor Cortical Neurons Provides Redundant Direction Information
With That From Coarse Temporal Measures. Journal of Neurophysiology, 2001. 86(4): p. 1700-1716.
60.Eldawlatly, S., R. Jin, and K. Oweiss, Identifying Functional Connectivity in Large Scale Neural Ensemble
Recordings: A Multiscale Data Mining Approach. Neural Computation, 2008. 20: p. 1-28.
61.Biran, R., D.C. Martin, and P.A. Tresco, The brain tissue response to implanted silicon microelectrode arrays
is increased when the device is tethered to the skull. J Biomed Mater Res A, 2007. 82(1): p. 169-78.

163

62.Wu, X. High-order context modeling and embedded conditional entropy coding of wavelet coefficients for image
compression. in Signals, Systems & Computers, 1997. Conference Record of the Thirty-First Asilomar
Conference on. 1997. IEEE.
63.Hauck, E.L., Data compression using run length encoding and statistical encoding, 1986, Google Patents.
64.Donoho, D.L., Compressed Sensing. Information Theory, IEEE Transactions on, 2006. 52(4): p.
1289-1306.
65.Candès, E.J., The restricted isometry property and its implications for compressed sensing. Comptes Rendus
Mathematique, 2008. 346(9): p. 589-592.
66.Rauhut, H., K. Schnass, and P. Vandergheynst, Compressed sensing and redundant dictionaries.
Information Theory, IEEE Transactions on, 2008. 54(5): p. 2210-2219.
67.Johnson, K.O., Sensory discrimination: decision process. Journal of Neurophysiology, 1980a. 43(6): p.
1771-1792.
68.Johnson, D.H. and W. Ray, Optimal Stimulus Coding by Neural Populations Using Rate Codes. Journal
of Computational Neuroscience, 2004. 16(2): p. 129-138.
69.Wiener, M.C. and B.J. Richmond, Using Response Models to Estimate Channel Capacity for Neuronal
Classification of Stationary Visual Stimuli Using Temporal Coding. J Neurophysiol, 1999. 82(6): p. 28612875.
70.Abbott, L.F. and P. Dayan, The effect of correlated variability on the accuracy of a population code. Neural
Computation, 1999. 11(1): p. 91-101.
71.Johnson, K.O., Sensory discrimination: neural processes preceding discrimination decision. Journal of
Neurophysiology, 1980b. 43(6): p. 1793-1815.
72.Averbeck, B.B., P. Latham, and A. Pouget, Neural correlations, population coding and computation.
Nature Reviews Neuroscience, 2006. 7: p. 358-366.
73.Zohary, E., M.N. Shadlen, and W.T. Newsome, Correlated neuronal discharge rate and its implications for
psychophysical performance. Nature, 1994. 370(6485): p. 140-143.
74.Puchalla, J.L., et al., Redundancy in the Population Code of the Retina. Neuron, 2005. 46(3): p. 493-504.
164

75.Latham, P.E. and S. Nirenberg, Synergy, Redundancy, and Independence in Population Codes, Revisited.
Journal of Neuroscience, 2005. 25(21): p. 5195.
76.Stevenson, I.H. and K.P. Kording, How advances in neural recording affect data analysis. Nature
Neuroscience. 14(2): p. 139-142.
77.Wainwright, M.J. and M.I. Jordan, Graphical models, exponential families, and variational inference.
Foundations and Trends® in Machine Learning, 2008. 1(1-2): p. 1-305.
78.Schneidman, E., et al., Weak pairwise correlations imply strongly correlated network states in a neural
population. Nature, 2006. 440(7087): p. 1007-1012.
79.Tang, A., et al., A maximum entropy model applied to spatial and temporal correlations from cortical networks
in vitro. Journal of Neuroscience, 2008. 28(2): p. 505.
80.Truccolo, W., L.R. Hochberg, and J.P. Donoghue, Collective dynamics in human and monkey
sensorimotor cortex: predicting single neuron spikes. Nature neuroscience, 2009.
81.Marre, O., et al., Prediction of spatiotemporal patterns of neural activity from pairwise correlations. Physical
review letters, 2009. 102(13): p. 138101.
82.Boyd, S.P. and L. Vandenberghe, Convex optimization. 2004: Cambridge Univ Pr.
83.Shelley, M. and D. McLaughlin, Coarse-grained reduction and analysis of a network model of cortical
response: I. Drifting grating stimuli. Journal of Computational Neuroscience, 2002. 12(2): p. 97-122.
84.Palm, G., A. Aertsen, and G.L. Gerstein, On the significance of correlations among neuronal spike trains.
Biological cybernetics, 1988. 59(1): p. 1-11.
85.Gilbert, C.D., Microcircuitry of the Visual Cortex. Annual Review of Neuroscience, 1983. 6(1): p.
217-247.
86.Reich, D.S., et al., Response Variability and Timing Precision of Neuronal Spike Trains In Vivo, 1997, Am
Physiological Soc. p. 2836-2841.
87.Eggermont, J.J., Function aspects of synchrony and correlation in the auditory nervous system Concepts in
Neuroscience, 1993. 4(2): p. 105-129.

165

88.Dzakpasu, R. and M. Zochowski, Discriminating different types of synchrony in neural systems. Physica D.,
2005. 208: p. 115-122.
89.Georgopoulos, A.P., et al., On the relations between the direction of two-dimensional arm movements and cell
discharge in primate motor cortex. Journal of Neuroscience, 1982. 2(11): p. 1527-1537.
90.Riehle, A., et al., Spike Synchronization and Rate Modulation Differentially Involved in Motor Cortical
Function. Science, 1997. 278(5345): p. 1950.
91.Kandel, E.R., J.H. Schwartz, and T.M. Jessell, Principles of neural science. 2000: Appleton & Lange.
92.Rieke, F., Spikes:: Exploring the Neural Code. 1997.
93.Montani, F., et al., The Role of Correlations in Direction and Contrast Coding in the Primary Visual Cortex.
Journal of Neuroscience, 2007. 27(9): p. 2338.
94.Nirenberg, S. and P.E. Latham, Decoding neuronal spike trains: How important are correlations?
Proceedings of the National Academy of Sciences, 2003. 100(12): p. 7348-7353.
95.Jermakowicz, W.J. and V.A. Casagrande, Neural networks a century after Cajal. Brain Research
Reviews, 2007. 55(2): p. 264-284.
96.Gourevitch, B. and J.J. Eggermont, A nonparametric approach for detection of bursts in spike trains.
Journal of Neuroscience Methods, 2007. 160(2): p. 349-358.
97.Salinas, E. and T.J. Sejnowski, Correlated neuronal activity and the flow of neural information. Nat Rev
Neurosci, 2001. 2(8): p. 539-550.
98.Perkel, D.H., G. Gerstein, and G.P. Moore, Neuronal spike trains and stochastic point processes. II.
Simultaneous spike trains. J. Biophys, 1967. 7(4): p. 419-40.
99.Gerstein, G.L., D.H. Perkel, and J.E. Dayhoff, Cooperative firing activity in simultaneously recorded
populations of neurons: detection and measurement. Journal of Neuroscience, 1985. 5(4): p. 881.
100. Smith, W.S. and E.E. Fetz, Synaptic Interactions Between Forelimb-Related Motor Cortex Neurons in
Behaving Primates. Journal of Neurophysiology, 2009. 102(2): p. 1026.

166

101. Schneidman, E., W. Bialek, and M.J. Berry, Synergy, redundancy, and independence in population
codes. Journal of Neuroscience, 2003. 23(37): p. 11539-11553.
102. Globerson, A., et al., The minimum information principle and its application to neural code analysis.
Proceedings of the National Academy of Sciences, 2009. 106(9): p. 3490-3495.
103.

Brenner, N., et al., Synergy in a Neural Code. Neural Comp., 2000. 12(7): p. 1531-1552.

104. Martignon, L.G., et al., Neural Coding: Higher-Order Temporal Patterns in the Neurostatistics of Cell
Assemblies. Neural Comp., 2000. 12: p. 2621-2653
105. Borst, A. and F.E. Theunissen, Information theory and neural coding. nature neuroscience, 1999.
2: p. 947-958.
106. Nelken, I. and G. Chechik, Information theory in auditory research. Hearing research, 2007. 229(12): p. 94-105.
107. Schneidman, E., et al., Weak pairwise correlations imply strongly correlated network states in a neural
population. Nature, 2006. 440(7087): p. 1007.
108. Globerson, A., et al., The minimum information principle and its application to neural code analysis.
Proceedings of the National Academy of Sciences, 2009. 106(9): p. 3490.
109. Studeny, M. and J. Vejnarova, The multiinformation function as a tool for measuring stochastic
dependence. Learning in graphical models, 1998: p. 261–297.
110. Van Emden, M.H. and A. Universiteit van, An analysis of complexity. 1971: Mathematisch
Centrum.
111. Schneidman, E., et al., Network information and connected correlations. Physical review letters,
2003. 91(23): p. 238701-238701.
112. Silva, J. and R. Willett, Hypergraph-Based Anomaly Detection of High-Dimensional Co-Occurrences.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009. 31(3): p. 563-569.
113. London, M., et al., The information efficacy of a synapse. nature neuroscience, 2002. 5(4): p. 332340.

167

114. Kindermann, R., J.L. Snell, and S. American Mathematical, Markov random fields and their
applications. 1980: American Mathematical Society Providence, RI.
115.

Golumbic, M.C., Algorithmic graph theory and perfect graphs. 1980: Academic press.

116. Shlens, J., et al., The structure of multi-neuron firing patterns in primate retina. Journal of
Neuroscience, 2006. 26(32): p. 8254.
117. Pillow, J.W., et al., Spatio-temporal correlations and visual signalling in a complete neuronal population.
Nature, 2008. 454(7207): p. 995-999.
118. Shlens, J., et al., The Structure of Large-Scale Synchronized Firing in Primate Retina. Journal of
Neuroscience, 2009. 29(15): p. 5022-5031.
119. Callaway, E.M., Local circuits in primary visual cortex of the macaque monkey. Annu Rev Neurosci,
1998. 21: p. 47-74.
120. Yoshimura, Y. and E.M. Callaway, Fine-scale specificity of cortical networks depends on inhibitory cell
type and connectivity. Nat Neurosci, 2005. 8(11): p. 1552-9.
121. Sanes, J.N. and J.P. Donoghue, Plasticity and primary motor cortex. Annual Review of
Neuroscience, 2000. 23(1): p. 393-415.
122. Cohen, M.R. and W.T. Newsome, Context-dependent changes in functional circuitry in visual area
MT. Neuron, 2008. 60(1): p. 162-173.
123. Azouz, R. and C.M. Gray, Dynamic spike threshold reveals a mechanism for synaptic coincidence
detection in cortical neurons in vivo, 2000, National Acad Sciences. p. 8110-8115.
124. Grün, S., M. Diesmann, and A. Aertsen, Unitary events in multiple single-neuron spiking activity: I.
Detection and significance. Neural Computation, 2002. 14(1): p. 43-80.
125.

Bondy, J.A. and U.S.R. Murty, Graph theory. 2007: Springer Verlag.

126.

Bell, A.J. The co-information lattice. 2003.

127.

Crochet, S. and C.C. Petersen, Cortical dynamics by layers. Neuron, 2009. 64(3): p. 298-300.
168

128. Roudi, Y., S. Nirenberg, and P.E. Latham, Pairwise maximum entropy models for studying large
biological systems: when they can work and when they can't. Plos Computational Biology, 2009. 5(5).
129. Truccolo, W., et al., A Point Process Framework for Relating Neural Spiking Activity to Spiking
History, Neural Ensemble, and Extrinsic Covariate Effects. J Neurophysiol, 2005. 93(2): p. 1074-1089.
130. Brockwell, A.E., R.E. Kass, and A.B. Schwartz, Statistical Signal Processing and the Motor Cortex.
Proceedings of the IEEE, 2007. 95(5): p. 881-898.
131. Sprekeler, H., C. Michaelis, and L. Wiskott, Slowness: An Objective for Spike-Timing-Dependent
Plasticity? PLoS Comput Biol, 2007. 3(6): p. e112.
132. Jackson, A., J. Mavoori, and E.E. Fetz, Long-term motor cortex plasticity induced by an electronic
neural implant. Nature, 2006. 444: p. 56-60.
133. Hendry, S.H., et al., Numbers and proportions of GABA-immunoreactive neurons in different areas of
monkey cerebral cortex. J Neurosci, 1987. 7: p. 1503–1519.
134. Bi, G.Q. and M.M. Poo, Synaptic modifications in cultured hippocampal neurons: dependence on spike
timing, synaptic strength, and postsynaptic cell type. Journal of Neuroscience, 1998. 18: p. 10464-72.
135. Jacob, V., et al., Spike Timing-Dependent Synaptic Depression in the In Vivo Barrel Cortex of the Rat.
J. Neurosci., 2007. 27: p. 1271-1284.
136. Froemke, R.C. and Y. Dan, Spike-timing-dependent synaptic modification induced by natural spike
trains. Nature, 2002. 416(6879): p. 433-438.
137. Dan, Y. and M.M. Poo, Spike Timing-Dependent Plasticity: From Synapse to Perception.
Physiological Reviews, 2006. 86(3): p. 1033.
138. Dan, Y. and M.M. Poo, Spike timing-dependent plasticity of neural circuits. Neuron, 2004. 44: p. 2330.
139. Perkel, D.H., G.L. Gerstein, and G.P. Moore, Neuronal spike trains and stochastic point processes
II. Simultaneous spike trains. Biophysical Journal, 1967. 7(4): p. 419-440.
140. Histed, M.H., A. Pasupathy, and E.K. Miller, Learning Substrates in the Primate Prefrontal Cortex
and Striatum: Sustained Activity Related to Successful Actions. Neuron, 2009. 63(2): p. 244-253.
169

141. Aghagolzadeh, M. and K. Oweiss, Compressed and Distributed Sensing of Neuronal Activity for Real
Time Spike Train Decoding. IEEE Transactions on Neural Systems and Rehabilitation Engineering,
April 2009. 17(2): p. 116-127.
142. Brown, E.N., R. Kass, and P. Mitra, Multiple neural spike train data analysis: State-of-the-Art and
future challenges. Nature Neurosc., 2004. 7 (5): p. 456-461.
143. Harrison, R.R., et al., A Low-Power Integrated Circuit for a Wireless 100-Electrode Neural Recording
System. IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2007. 42(1): p. 123.
144. Wise, K.D., et al., Microelectrodes, Microelectronics, and Implantable Neural Microsystems.
Proceedings of the IEEE, 2008. 96(7): p. 1184-1202.
145. Sodagar, A.M., K.D. Wise, and K. Najafi, A wireless implantable microsystem for multichannel neural
recording. Microwave Theory and Techniques, IEEE Transactions on, 2009. 57(10): p. 2565-2573.
146.

Ventura, V., Spike Train Decoding Without Spike Sorting. Neural Comp., 2007. 20(4): p. 923-963.

147. Harris, K.D., et al., Accuracy of tetrode spike separation as determined by simultaneous intracellular and
extracellular measurements. Journal of Neurophysiology, 2000. 84(1): p. 401-414.
148. Jog, M., et al., Tetrode technology: advances in implantable hardware, neuroimaging, and data analysis
techniques. Journal of neuroscience methods, 2002. 117(2): p. 141-152.
149. Mechler, F., et al., Three-dimensional localization of neurons in cortical tetrode recordings. Journal of
neurophysiology, 2011. 106(2): p. 828-848.
150. Reich, D.S., et al., Interspike intervals, receptive fields, and information encoding in primary visual cortex.
The Journal of Neuroscience, 2000. 20(5): p. 1964-1974.
151. Rad, K.R. and L. Paninski, Efficient, adaptive estimation of two-dimensional firing rate surfaces via
Gaussian process methods. Network: Computation in Neural Systems, 2010. 21(3-4): p. 142-168.
152. Zhang, K. and T.J. Sejnowski, Neuronal tuning: To sharpen or broaden? Neural Computation,
1999. 11(1): p. 75-84.
153. M. Aghagolzadeh, A.M.a.K.O., Sorting Neural Activity via Simple Thresholding. IEEE
Transactions on Neural Systems and Rehabilitation Engineering, 2012: p. to appear.
170

154.

Boyd, S. and L. Vandenberghe, Convex optimization. 2004: Cambridge university press.

155. van den Berg, E. and M.P. Friedlander, Sparse optimization with least-squares constraints. SIAM
Journal on Optimization, 2011. 21(4): p. 1201-1229.

171